
Pdf Ocr Skill
Extract Chinese and English text from scanned PDFs and image files using local or cloud OCR engines without manual retyping.
Overview
pdf-ocr-skill is an agent skill most often used in Build (also Validate and Idea) that extracts Chinese and English text from scanned PDFs and images via four configurable OCR engines.
Install
npx skills add https://github.com/yejinlei/pdf-ocr-skill --skill pdf-ocr-skillWhat is this skill?
- Four OCR engines: RapidOCR, RapidDoc, PaddleOCR (local), and SiliconFlow DeepSeek-OCR (cloud)
- Processes scanned PDFs (via page rasterization) and JPG, PNG, BMP, GIF, TIFF, WEBP images
- Chinese and English recognition with structure-aware output order
- Default local RapidOCR path needs no API key; automatic fallback to SiliconFlow when local init fails
- Configurable via OCR_ENGINE and .env for siliconflow API key and model
- 4 OCR engines: RapidOCR, RapidDoc, PaddleOCR, and SiliconFlow API
- Supports 6 image formats plus multi-page scanned PDFs
Adoption & trust: 1.4k installs on skills.sh; 7 GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have image-only or scanned PDFs and cannot search, quote, or feed them into your agent workflow without slow manual transcription.
Who is it for?
Indie builders who ingest mixed Chinese/English scans locally first and only pay for cloud OCR when accuracy or engine failures demand it.
Skip if: Teams that need guaranteed layout tables, forms, or redaction pipelines without post-processing—this skill targets text extraction, not full document intelligence platforms.
When should I use this skill?
You need to extract Chinese or English text from scanned PDFs or image files and want local-first OCR with optional cloud fallback.
What do I get? / Deliverables
You get ordered plain text from each page or image file, ready to paste into docs, tickets, or downstream automation, with engine choice controlled by OCR_ENGINE.
- Plain-text extraction per PDF page or image
- Engine-selected OCR output suitable for docs or downstream scripts
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Document ingestion and OCR most often happen while you are building or maintaining project docs, specs, and reference materials from scanned sources. The docs subphase is the canonical shelf because the skill turns scanned PDFs and images into machine-readable text for READMEs, specs, and knowledge bases.
Where it fits
OCR a competitor’s scanned whitepaper PDF into text you can summarize in your opportunity notes.
Extract clauses from a photographed scope document before locking MVP requirements.
Batch-convert legacy scan assets into copy-pasteable text for your product documentation repo.
How it compares
Use instead of one-off OCR SaaS uploads when you want a scriptable, agent-invokable skill inside your repo.
Common Questions / FAQ
Who is pdf-ocr-skill for?
Solo and indie builders who work with scanned PDFs or photos of documents and need searchable text in Chinese or English inside Claude Code, Cursor, or similar agents.
When should I use pdf-ocr-skill?
Use it during Idea research when digitizing competitor PDFs, during Validate when quoting scanned briefs, and during Build docs work when turning scans into README or spec text—especially when local RapidOCR is enough or SiliconFlow is configured for harder pages.
Is pdf-ocr-skill safe to install?
Review the Security Audits panel on this Prism page before installing; cloud mode sends images to SiliconFlow when enabled, and local engines read files from disk as configured in your environment.
SKILL.md
READMESKILL.md - Pdf Ocr Skill
# PDF OCR Skill ## 中文版本 PDF OCR技能用于从影印版PDF文件和图片文件中提取文字内容。该技能支持两种OCR引擎: - **RapidOCR**(本地引擎):无需API密钥,免费使用,识别速度快 - **硅基流动大模型**(云端引擎):使用AI大模型进行高精度OCR识别 ### 功能特性 - 支持影印版PDF文件的文字提取 - 支持多种图片格式的文字识别(JPG、PNG、BMP、GIF、TIFF、WEBP) - **四引擎支持**:RapidOCR(本地)、RapidDoc(增强)、PaddleOCR(本地)和硅基流动API(云端) - 支持中文和英文文字识别 - 保持文字的顺序和结构 - 自动将PDF页面转换为图片进行识别 - 智能引擎切换:当RapidOCR初始化失败时自动切换到硅基流动API ### 安装 #### 依赖要求 ```bash pip install pymupdf pillow requests python-dotenv ``` #### 可选依赖(推荐) 安装RapidOCR以获得本地识别能力: ```bash pip install rapidocr_onnxruntime ``` ### 环境变量配置 1. 复制 `.env.example` 文件并重命名为 `.env` 2. 根据需要配置以下选项: ```env # OCR引擎选择 # - "rapid": 使用RapidOCR本地引擎(默认,无需API密钥) # - "rapidoc": 使用RapidDoc增强引擎(无需API密钥) # - "paddle": 使用PaddleOCR本地引擎(无需API密钥) # - "siliconflow": 使用硅基流动API引擎(需要API密钥) OCR_ENGINE=rapid # 如果使用硅基流动API引擎,需要配置以下选项: SILICON_FLOW_API_KEY=your_api_key_here SILICON_FLOW_OCR_MODEL=deepseek-ai/DeepSeek-OCR ``` ### 快速开始 #### 使用默认引擎(RapidOCR本地识别) ```python # 导入OCR处理器 from scripts.pdf_ocr_processor import PDFOCRProcessor # 创建处理器实例(默认使用RapidOCR) processor = PDFOCRProcessor() # 执行PDF OCR识别 result = processor.ocr_pdf('path/to/your/scanned.pdf') # 获取识别结果 print(f"识别完成,共 {result['page_count']} 页") print(f"使用引擎: {result['engine']}") print(result['text']) ``` #### 使用硅基流动API引擎 ```python # 导入OCR处理器 from scripts.pdf_ocr_processor import PDFOCRProcessor # 创建处理器实例,指定使用硅基流动API processor = PDFOCRProcessor(engine="siliconflow") # 执行PDF OCR识别 result = processor.ocr_pdf('path/to/your/scanned.pdf') # 获取识别结果 print(f"识别完成,共 {result['page_count']} 页") print(result['text']) ``` #### 识别图片文件 ```python # 导入OCR处理器 from scripts.pdf_ocr_processor import PDFOCRProcessor # 创建处理器实例 processor = PDFOCRProcessor() # 或 PDFOCRProcessor(engine="siliconflow") # 执行图片OCR识别 result = processor.ocr_image_file('path/to/your/image.jpg') # 获取识别结果 print(f"识别结果: {result['text']}") ``` ### 命令行使用 ```bash # 使用默认RapidOCR引擎 python pdf_ocr_processor.py your_document.pdf # 使用硅基流动API引擎 python pdf_ocr_processor.py your_document.pdf siliconflow # 使用RapidDoc增强引擎 python pdf_ocr_processor.py your_document.pdf rapidoc # 使用PaddleOCR引擎 python pdf_ocr_processor.py your_document.pdf paddle ``` ### 进阶使用示例 #### 批量处理多个PDF文件 ```python import os from scripts.pdf_ocr_processor import PDFOCRProcessor # 创建处理器实例 processor = PDFOCRProcessor() # 批量处理目录中的所有PDF文件 pdf_dir = "path/to/pdf/files" output_dir = "path/to/output" os.makedirs(output_dir, exist_ok=True) for pdf_file in os.listdir(pdf_dir): if pdf_file.endswith('.pdf'): pdf_path = os.path.join(pdf_dir, pdf_file) output_path = os.path.join(output_dir, f"{os.path.splitext(pdf_file)[0]}.txt") print(f"处理文件: {pdf_file}") try: result = processor.ocr_pdf(pdf_path) # 保存识别结果到文本文件 with open(output_path, 'w', encoding='utf-8') as f: f.write(f"=== PDF OCR 识别结果 ===\n") f.write(f"文件名: {pdf_file}\n") f.write(f"页数: {result['page_count']}\n") f.write(f"使用引擎: {result['engine']}\n\n") f.write(result['text']) print(f"处理完成,结果已保存到: {output_path}") except Exception as e: print(f"处理失败: {e}") ``` #### 混合使用两种引擎 ```python from scripts.pdf_ocr_processor import PDFOCRProcessor def process_with_best_engine(pdf_path): """尝试使用RapidOCR,如果效果不佳则使用硅基流动API""" # 首先使用RapidOCR本地引擎 rapid_processor = PDFOCRProcessor(engine="rapid") rapid_result = rapid_processor.ocr_pdf(pdf_path) # 简单评估识别效果(例如:检查识别出的文本长度) text_length = len(rapid_result['text'])