Pdf Ocr Skill

Name: Pdf Ocr Skill
Author: yejinlei

yejinlei/pdf-ocr-skill·MIT

Extract Chinese and English text from scanned PDFs and image files using local or cloud OCR engines without manual retyping.

Overview

pdf-ocr-skill is an agent skill most often used in Build (also Validate and Idea) that extracts Chinese and English text from scanned PDFs and images via four configurable OCR engines.

Install

npx skills add https://github.com/yejinlei/pdf-ocr-skill --skill pdf-ocr-skill

What is this skill?

Four OCR engines: RapidOCR, RapidDoc, PaddleOCR (local), and SiliconFlow DeepSeek-OCR (cloud)
Processes scanned PDFs (via page rasterization) and JPG, PNG, BMP, GIF, TIFF, WEBP images
Chinese and English recognition with structure-aware output order
Default local RapidOCR path needs no API key; automatic fallback to SiliconFlow when local init fails
Configurable via OCR_ENGINE and .env for siliconflow API key and model
4 OCR engines: RapidOCR, RapidDoc, PaddleOCR, and SiliconFlow API
Supports 6 image formats plus multi-page scanned PDFs

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 1.4k installs on skills.sh; 7 GitHub stars; 2/3 security scanners passed (skills.sh audits).

What problem does it solve?

You have image-only or scanned PDFs and cannot search, quote, or feed them into your agent workflow without slow manual transcription.

Who is it for?

Indie builders who ingest mixed Chinese/English scans locally first and only pay for cloud OCR when accuracy or engine failures demand it.

Skip if: Teams that need guaranteed layout tables, forms, or redaction pipelines without post-processing—this skill targets text extraction, not full document intelligence platforms.

When should I use this skill?

You need to extract Chinese or English text from scanned PDFs or image files and want local-first OCR with optional cloud fallback.

What do I get? / Deliverables

You get ordered plain text from each page or image file, ready to paste into docs, tickets, or downstream automation, with engine choice controlled by OCR_ENGINE.

Plain-text extraction per PDF page or image
Engine-selected OCR output suitable for docs or downstream scripts

Recommended Skills

Lark Maillarksuite/cli

Feishu email skill covering compose, send, reply, forward, search, drafts, attachments, contacts, and mail rules via lar…209k installs·13.7k stars

Lark Slideslarksuite/cli

Template and markup for building themed Lark Office slide presentations, including title slide styling for company meeti…162k installs·13.7k stars

Pptxanthropics/skills

pptx is Anthropic’s agent skill for PowerPoint work inside Claude-powered coding and assistant flows. Solo builders reac…138k installs·148k stars

Pdfanthropics/skills

pdf is a journey-wide Anthropic agent skill for anything involving PDF files: reading and extracting text or tables, mer…130k installs·148k stars

Lark Markdownlarksuite/cli

CLI-oriented skill for Lark Drive native Markdown: create, read, overwrite, diff, and localized patch with clear boundar…125k installs·13.7k stars

Docxanthropics/skills

End-to-end Word document skill for creation, extraction, and structured editing of professional .docx files using pandoc…118k installs·148k stars

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Document ingestion and OCR most often happen while you are building or maintaining project docs, specs, and reference materials from scanned sources. The docs subphase is the canonical shelf because the skill turns scanned PDFs and images into machine-readable text for READMEs, specs, and knowledge bases.

Also useful

ValidateScope & plan

Also useful

IdeaOpportunity & market research

Where it fits

Example use

IdeaOpportunity & market research

OCR a competitor’s scanned whitepaper PDF into text you can summarize in your opportunity notes.

Example use

ValidateScope & plan

Extract clauses from a photographed scope document before locking MVP requirements.

Example use

BuildDocs & content

Batch-convert legacy scan assets into copy-pasteable text for your product documentation repo.

How it compares

Use instead of one-off OCR SaaS uploads when you want a scriptable, agent-invokable skill inside your repo.

Common Questions / FAQ

Who is pdf-ocr-skill for?

Solo and indie builders who work with scanned PDFs or photos of documents and need searchable text in Chinese or English inside Claude Code, Cursor, or similar agents.

When should I use pdf-ocr-skill?

Use it during Idea research when digitizing competitor PDFs, during Validate when quoting scanned briefs, and during Build docs work when turning scans into README or spec text—especially when local RapidOCR is enough or SiliconFlow is configured for harder pages.

Is pdf-ocr-skill safe to install?

Review the Security Audits panel on this Prism page before installing; cloud mode sends images to SiliconFlow when enabled, and local engines read files from disk as configured in your environment.

SKILL.md

READMESKILL.md - Pdf Ocr Skill

# PDF OCR Skill

## 中文版本

PDF OCR技能用于从影印版PDF文件和图片文件中提取文字内容。该技能支持两种OCR引擎：
- **RapidOCR**（本地引擎）：无需API密钥，免费使用，识别速度快
- **硅基流动大模型**（云端引擎）：使用AI大模型进行高精度OCR识别

### 功能特性

- 支持影印版PDF文件的文字提取
- 支持多种图片格式的文字识别（JPG、PNG、BMP、GIF、TIFF、WEBP）
- **四引擎支持**：RapidOCR（本地）、RapidDoc（增强）、PaddleOCR（本地）和硅基流动API（云端）
- 支持中文和英文文字识别
- 保持文字的顺序和结构
- 自动将PDF页面转换为图片进行识别
- 智能引擎切换：当RapidOCR初始化失败时自动切换到硅基流动API

### 安装

#### 依赖要求

```bash
pip install pymupdf pillow requests python-dotenv
```

#### 可选依赖（推荐）

安装RapidOCR以获得本地识别能力：

```bash
pip install rapidocr_onnxruntime
```

### 环境变量配置

1. 复制 `.env.example` 文件并重命名为 `.env`
2. 根据需要配置以下选项：

```env
# OCR引擎选择
# - "rapid": 使用RapidOCR本地引擎（默认，无需API密钥）
# - "rapidoc": 使用RapidDoc增强引擎（无需API密钥）
# - "paddle": 使用PaddleOCR本地引擎（无需API密钥）
# - "siliconflow": 使用硅基流动API引擎（需要API密钥）
OCR_ENGINE=rapid

# 如果使用硅基流动API引擎，需要配置以下选项：
SILICON_FLOW_API_KEY=your_api_key_here
SILICON_FLOW_OCR_MODEL=deepseek-ai/DeepSeek-OCR
```

### 快速开始

#### 使用默认引擎（RapidOCR本地识别）

```python
# 导入OCR处理器
from scripts.pdf_ocr_processor import PDFOCRProcessor

# 创建处理器实例（默认使用RapidOCR）
processor = PDFOCRProcessor()

# 执行PDF OCR识别
result = processor.ocr_pdf('path/to/your/scanned.pdf')

# 获取识别结果
print(f"识别完成，共 {result['page_count']} 页")
print(f"使用引擎: {result['engine']}")
print(result['text'])
```

#### 使用硅基流动API引擎

```python
# 导入OCR处理器
from scripts.pdf_ocr_processor import PDFOCRProcessor

# 创建处理器实例，指定使用硅基流动API
processor = PDFOCRProcessor(engine="siliconflow")

# 执行PDF OCR识别
result = processor.ocr_pdf('path/to/your/scanned.pdf')

# 获取识别结果
print(f"识别完成，共 {result['page_count']} 页")
print(result['text'])
```

#### 识别图片文件

```python
# 导入OCR处理器
from scripts.pdf_ocr_processor import PDFOCRProcessor

# 创建处理器实例
processor = PDFOCRProcessor()  # 或 PDFOCRProcessor(engine="siliconflow")

# 执行图片OCR识别
result = processor.ocr_image_file('path/to/your/image.jpg')

# 获取识别结果
print(f"识别结果: {result['text']}")
```

### 命令行使用

```bash
# 使用默认RapidOCR引擎
python pdf_ocr_processor.py your_document.pdf

# 使用硅基流动API引擎
python pdf_ocr_processor.py your_document.pdf siliconflow

# 使用RapidDoc增强引擎
python pdf_ocr_processor.py your_document.pdf rapidoc

# 使用PaddleOCR引擎
python pdf_ocr_processor.py your_document.pdf paddle
```

### 进阶使用示例

#### 批量处理多个PDF文件

```python
import os
from scripts.pdf_ocr_processor import PDFOCRProcessor

# 创建处理器实例
processor = PDFOCRProcessor()

# 批量处理目录中的所有PDF文件
pdf_dir = "path/to/pdf/files"
output_dir = "path/to/output"
os.makedirs(output_dir, exist_ok=True)

for pdf_file in os.listdir(pdf_dir):
    if pdf_file.endswith('.pdf'):
        pdf_path = os.path.join(pdf_dir, pdf_file)
        output_path = os.path.join(output_dir, f"{os.path.splitext(pdf_file)[0]}.txt")
        
        print(f"处理文件: {pdf_file}")
        try:
            result = processor.ocr_pdf(pdf_path)
            
            # 保存识别结果到文本文件
            with open(output_path, 'w', encoding='utf-8') as f:
                f.write(f"=== PDF OCR 识别结果 ===\n")
                f.write(f"文件名: {pdf_file}\n")
                f.write(f"页数: {result['page_count']}\n")
                f.write(f"使用引擎: {result['engine']}\n\n")
                f.write(result['text'])
            
            print(f"处理完成，结果已保存到: {output_path}")
        except Exception as e:
            print(f"处理失败: {e}")
```

#### 混合使用两种引擎

```python
from scripts.pdf_ocr_processor import PDFOCRProcessor

def process_with_best_engine(pdf_path):
    """尝试使用RapidOCR，如果效果不佳则使用硅基流动API"""
    # 首先使用RapidOCR本地引擎
    rapid_processor = PDFOCRProcessor(engine="rapid")
    rapid_result = rapid_processor.ocr_pdf(pdf_path)
    
    # 简单评估识别效果（例如：检查识别出的文本长度）
    text_length = len(rapid_result['text'])

What is this skill?

Four OCR engines: RapidOCR, RapidDoc, PaddleOCR (local), and SiliconFlow DeepSeek-OCR (cloud)

Processes scanned PDFs (via page rasterization) and JPG, PNG, BMP, GIF, TIFF, WEBP images

Chinese and English recognition with structure-aware output order

Default local RapidOCR path needs no API key; automatic fallback to SiliconFlow when local init fails

Configurable via OCR_ENGINE and .env for siliconflow API key and model

4 OCR engines: RapidOCR, RapidDoc, PaddleOCR, and SiliconFlow API

Supports 6 image formats plus multi-page scanned PDFs

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 1.4k installs on skills.sh; 7 GitHub stars; 2/3 security scanners passed (skills.sh audits).

Who is it for?

Indie builders who ingest mixed Chinese/English scans locally first and only pay for cloud OCR when accuracy or engine failures demand it.

Skip if: Teams that need guaranteed layout tables, forms, or redaction pipelines without post-processing—this skill targets text extraction, not full document intelligence platforms.

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

Also useful

IdeaOpportunity & market research

Where it fits

Example use

IdeaOpportunity & market research

OCR a competitor’s scanned whitepaper PDF into text you can summarize in your opportunity notes.

Example use

ValidateScope & plan

Extract clauses from a photographed scope document before locking MVP requirements.

Example use

BuildDocs & content

Batch-convert legacy scan assets into copy-pasteable text for your product documentation repo.

SKILL.md

READMESKILL.md - Pdf Ocr Skill

# PDF OCR Skill

## 中文版本

PDF OCR技能用于从影印版PDF文件和图片文件中提取文字内容。该技能支持两种OCR引擎：
- **RapidOCR**（本地引擎）：无需API密钥，免费使用，识别速度快
- **硅基流动大模型**（云端引擎）：使用AI大模型进行高精度OCR识别

### 功能特性

- 支持影印版PDF文件的文字提取
- 支持多种图片格式的文字识别（JPG、PNG、BMP、GIF、TIFF、WEBP）
- **四引擎支持**：RapidOCR（本地）、RapidDoc（增强）、PaddleOCR（本地）和硅基流动API（云端）
- 支持中文和英文文字识别
- 保持文字的顺序和结构
- 自动将PDF页面转换为图片进行识别
- 智能引擎切换：当RapidOCR初始化失败时自动切换到硅基流动API

### 安装

#### 依赖要求

```bash
pip install pymupdf pillow requests python-dotenv
```

#### 可选依赖（推荐）

安装RapidOCR以获得本地识别能力：

```bash
pip install rapidocr_onnxruntime
```

### 环境变量配置

1. 复制 `.env.example` 文件并重命名为 `.env`
2. 根据需要配置以下选项：

```env
# OCR引擎选择
# - "rapid": 使用RapidOCR本地引擎（默认，无需API密钥）
# - "rapidoc": 使用RapidDoc增强引擎（无需API密钥）
# - "paddle": 使用PaddleOCR本地引擎（无需API密钥）
# - "siliconflow": 使用硅基流动API引擎（需要API密钥）
OCR_ENGINE=rapid

# 如果使用硅基流动API引擎，需要配置以下选项：
SILICON_FLOW_API_KEY=your_api_key_here
SILICON_FLOW_OCR_MODEL=deepseek-ai/DeepSeek-OCR
```

### 快速开始

#### 使用默认引擎（RapidOCR本地识别）

```python
# 导入OCR处理器
from scripts.pdf_ocr_processor import PDFOCRProcessor

# 创建处理器实例（默认使用RapidOCR）
processor = PDFOCRProcessor()

# 执行PDF OCR识别
result = processor.ocr_pdf('path/to/your/scanned.pdf')

# 获取识别结果
print(f"识别完成，共 {result['page_count']} 页")
print(f"使用引擎: {result['engine']}")
print(result['text'])
```

#### 使用硅基流动API引擎

```python
# 导入OCR处理器
from scripts.pdf_ocr_processor import PDFOCRProcessor

# 创建处理器实例，指定使用硅基流动API
processor = PDFOCRProcessor(engine="siliconflow")

# 执行PDF OCR识别
result = processor.ocr_pdf('path/to/your/scanned.pdf')

# 获取识别结果
print(f"识别完成，共 {result['page_count']} 页")
print(result['text'])
```

#### 识别图片文件

```python
# 导入OCR处理器
from scripts.pdf_ocr_processor import PDFOCRProcessor

# 创建处理器实例
processor = PDFOCRProcessor()  # 或 PDFOCRProcessor(engine="siliconflow")

# 执行图片OCR识别
result = processor.ocr_image_file('path/to/your/image.jpg')

# 获取识别结果
print(f"识别结果: {result['text']}")
```

### 命令行使用

```bash
# 使用默认RapidOCR引擎
python pdf_ocr_processor.py your_document.pdf

# 使用硅基流动API引擎
python pdf_ocr_processor.py your_document.pdf siliconflow

# 使用RapidDoc增强引擎
python pdf_ocr_processor.py your_document.pdf rapidoc

# 使用PaddleOCR引擎
python pdf_ocr_processor.py your_document.pdf paddle
```

### 进阶使用示例

#### 批量处理多个PDF文件

```python
import os
from scripts.pdf_ocr_processor import PDFOCRProcessor

# 创建处理器实例
processor = PDFOCRProcessor()

# 批量处理目录中的所有PDF文件
pdf_dir = "path/to/pdf/files"
output_dir = "path/to/output"
os.makedirs(output_dir, exist_ok=True)

for pdf_file in os.listdir(pdf_dir):
    if pdf_file.endswith('.pdf'):
        pdf_path = os.path.join(pdf_dir, pdf_file)
        output_path = os.path.join(output_dir, f"{os.path.splitext(pdf_file)[0]}.txt")
        
        print(f"处理文件: {pdf_file}")
        try:
            result = processor.ocr_pdf(pdf_path)
            
            # 保存识别结果到文本文件
            with open(output_path, 'w', encoding='utf-8') as f:
                f.write(f"=== PDF OCR 识别结果 ===\n")
                f.write(f"文件名: {pdf_file}\n")
                f.write(f"页数: {result['page_count']}\n")
                f.write(f"使用引擎: {result['engine']}\n\n")
                f.write(result['text'])
            
            print(f"处理完成，结果已保存到: {output_path}")
        except Exception as e:
            print(f"处理失败: {e}")
```

#### 混合使用两种引擎

```python
from scripts.pdf_ocr_processor import PDFOCRProcessor

def process_with_best_engine(pdf_path):
    """尝试使用RapidOCR，如果效果不佳则使用硅基流动API"""
    # 首先使用RapidOCR本地引擎
    rapid_processor = PDFOCRProcessor(engine="rapid")
    rapid_result = rapid_processor.ocr_pdf(pdf_path)
    
    # 简单评估识别效果（例如：检查识别出的文本长度）
    text_length = len(rapid_result['text'])

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Where it fits

Who is pdf-ocr-skill for?

When should I use pdf-ocr-skill?

Is pdf-ocr-skill safe to install?

SKILL.md

This week for builders

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Where it fits

Who is pdf-ocr-skill for?

When should I use pdf-ocr-skill?

Is pdf-ocr-skill safe to install?

SKILL.md