Docpick

Name: Docpick
Author: QuartzUnit

QuartzUnit/docpick

1 repo stars
Updated April 15, 2026
QuartzUnit/docpick

Docpick is an MCP server that schema-driven extracts documents with local OCR and an LLM and returns structured JSON.

About

Docpick is a Model Context Protocol server that extracts structured data from documents using a schema you supply, combining local OCR with an LLM so agents can go from file to JSON in one step. developers shipping agent-backed SaaS, internal tools, or CLI workflows use it when they need repeatable field capture from PDFs and images without hand-labeling every document. Install the PyPI package, connect via stdio, and invoke extraction from your agent during validate (parsing specs), build (ingesting user uploads), or operate (reprocessing failed extractions). It fits the integrations lane of the journey: register the server, pass documents and schemas, and consume JSON in your app logic rather than maintaining a separate microservice for every doc type.

Schema-driven extraction: you define the target JSON shape and the server maps document content into it
Local OCR plus LLM on the docpick PyPI package (0.1.2) over stdio MCP transport
Document in, structured JSON out—suited for forms, receipts, IDs, and mixed layouts
stdio MCP server (identifier docpick) for Claude Code, Cursor, and other MCP hosts
No mandatory cloud OCR API in the positioning—designed for local processing workflows

Docpick by the numbers

Data as of Jul 7, 2026 (Skillselion catalog sync)

terminal

claude mcp add docpick -- uvx docpick

Add your badge

Show developers this MCP server is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/mcp/tool/io.github.ArkNill/docpick.svg)](https://skillselion.com/mcp/tool/io.github.ArkNill/docpick)

repo stars	★ 1
Package	docpick
Transport	STDIO
Auth	None
Last updated	April 15, 2026
Repository	QuartzUnit/docpick ↗

What it does

Turn scanned or digital documents into structured JSON from your coding agent using a schema you define, with local OCR plus an LLM.

Who is it for?

Best when you're adding document upload or inbox flows and want MCP-callable extraction with defined JSON shapes and local-first processing.

Skip if: Skip if you only need full-text search in repos with no document schemas, or and require a fully managed cloud document AI with SLAs and no local stack.

What you get

After you register docpick, your agent can pass a document plus schema and get consistent structured JSON you can validate, store, or ship in features.

Structured JSON matching your supplied schema per document
Agent-callable MCP extraction workflow over stdio
Repeatable document-ingest path for build and operate iterations

By the numbers

MCP server version 0.1.2
PyPI package identifier docpick with stdio transport
Source repository QuartzUnit/docpick on GitHub

README.md

Docpick

한국어 문서 · llms.txt

Document in, Structured JSON out. Locally. With your schema.

docpick is a lightweight, schema-driven document extraction pipeline that combines local OCR engines with local LLMs to extract structured JSON from any document — invoices, receipts, bills of lading, tax forms, and more.

Zero cloud dependency — runs entirely on your machine (CPU or GPU)
Custom schemas — define your own Pydantic models or use 8 built-in document schemas
Validation built-in — checkdigit verification, cross-field rules, cross-document consistency
Apache 2.0 — no GPL/AGPL dependencies

Install

pip install docpick            # core (LLM extraction only)
pip install docpick[paddle]    # + PaddleOCR (recommended)
pip install docpick[easyocr]   # + EasyOCR (Korean-optimized)
pip install docpick[got]       # + GOT-OCR2.0 (GPU, vision-language)
pip install docpick[all]       # all OCR backends

Requirements: Python 3.11+ / LLM endpoint (vLLM, Ollama, or OpenAI-compatible)

Quick Start

Python API

from docpick import DocpickPipeline
from docpick.schemas import InvoiceSchema

pipeline = DocpickPipeline()
result = pipeline.extract("invoice.pdf", schema=InvoiceSchema)

print(result.data)           # Structured dict matching schema
print(result.validation)     # Validation errors/warnings
print(result.confidence)     # Per-field confidence scores

CLI

# Extract structured data
docpick extract invoice.pdf --schema invoice --output result.json

# OCR only (no LLM)
docpick ocr document.png --lang ko,en

# Validate extracted JSON
docpick validate result.json --schema invoice

# Batch process a directory
docpick batch ./documents/ --schema invoice --output ./results/ --concurrency 4

# List available schemas
docpick schemas list

# Show schema details
docpick schemas show invoice

Built-in Schemas

Schema	Document Type	Key Validations
`invoice`	Commercial invoices	Line item sums, tax ID checkdigit, date order
`receipt`	Retail/restaurant receipts	Total = subtotal + tax + tip
`bill_of_lading`	Ocean/air B/L	Container weight sums, ISO 6346, HS code format
`purchase_order`	Purchase orders	PO total = line items, delivery date order
`kr_tax_invoice`	Korean e-tax invoice (세금계산서)	Business number checkdigit (x2), supply/tax/total sums
`bank_statement`	Bank statements	IBAN mod97, period date order
`id_document`	Passport/ID (ICAO 9303)	MRZ, ISO 3166 country codes, date ranges
`certificate_of_origin`	Certificate of Origin	ISO 3166 alpha-2 country codes

Custom Schemas

Define your own schema with Pydantic:

from pydantic import BaseModel
from docpick import DocpickPipeline
from docpick.validation.rules import SumEqualsRule, RequiredFieldRule

class MyDocument(BaseModel):
    """Custom document schema."""
    company_name: str | None = None
    total_amount: float | None = None
    tax_amount: float | None = None
    net_amount: float | None = None
    items: list[dict] | None = None

    class ValidationRules:
        rules = [
            RequiredFieldRule("company_name"),
            SumEqualsRule(["net_amount", "tax_amount"], "total_amount"),
        ]

pipeline = DocpickPipeline()
result = pipeline.extract("my_document.pdf", schema=MyDocument)

Or use a JSON Schema file:

docpick extract document.pdf --schema my_schema.json

Validation

Check Digit Algorithms

Algorithm	Use Case
`kr_business_number`	Korean business registration number (10 digits)
`luhn`	Credit card numbers
`iso_6346`	Shipping container numbers
`iban_mod97`	International bank account numbers
`awb_mod7`	Air waybill numbers
`mrz`	Machine Readable Zone (passport/ID)

Cross-Field Rules

Rule	Description
`SumEqualsRule`	Sum of fields equals target (with tolerance)
`DateBeforeRule`	Date A must precede Date B
`RequiredFieldRule`	Field must be non-null and non-empty
`FieldEqualsRule`	Two fields must be equal
`RangeRule`	Numeric field within min/max bounds
`RegexRule`	Field matches regex pattern

Cross-Document Validation

Validate consistency across related documents (e.g., Invoice + B/L + Packing List):

from docpick.validation.cross_document import create_trade_document_validator

validator = create_trade_document_validator()
result = validator.validate({
    "invoice": invoice_data,
    "bl": bl_data,
    "packing_list": packing_list_data,
    "certificate": certificate_data,
})
print(result.is_valid)

OCR Engines

Engine	Type	GPU	Languages	Best For
PaddleOCR	Traditional OCR	Optional	111	General documents (default)
EasyOCR	Traditional OCR	Optional	80+	Korean text
GOT-OCR2.0	Vision-Language	Required	Multi	Complex layouts
VLM	Vision-Language	Required	Multi	Direct image → JSON

2-Tier Auto Engine

The default auto engine uses confidence-based fallback:

Tier 1 (CPU): PaddleOCR → EasyOCR
Tier 2 (GPU): GOT-OCR2.0 → VLM

If Tier 1 average confidence falls below threshold (default 0.7), automatically escalates to Tier 2.

LLM Providers

Provider	Endpoint	Default Model
vLLM	`http://localhost:8000/v1`	Qwen/Qwen3.5-32B-AWQ
Ollama	`http://localhost:11434`	qwen3.5:7b

Configure via CLI or YAML:

docpick config set llm.provider ollama
docpick config set llm.base_url http://localhost:11434
docpick config set llm.model qwen3.5:7b

Error Handling

The pipeline is designed to be resilient:

OCR failure → automatic fallback to next available engine
LLM JSON parse failure → automatic retry with correction prompt (up to 1 retry)
Partial results → returns whatever was extracted, with errors logged in result.errors
Document load failure → returns empty result with error message

result = pipeline.extract("damaged.pdf", schema=InvoiceSchema)
if result.errors:
    print("Pipeline warnings:", result.errors)
if result.data:
    print("Partial extraction:", result.data)

Batch Processing

Process entire directories with parallel workers:

from docpick.batch import BatchProcessor
from docpick.schemas import InvoiceSchema

processor = BatchProcessor(concurrency=4)
result = processor.process_directory(
    "./invoices/",
    schema=InvoiceSchema,
    recursive=True,
)

print(f"Processed {result.succeeded}/{result.total} files")
for path, extraction in result.results.items():
    print(f"{path}: {extraction.data.get('total_amount')}")

Architecture

flowchart TD
    A["📄 Document\n(PDF / Image)"] --> B["DocumentLoader\n(pypdfium2)"]
    B --> C["Tier 1: OCR\n(PaddleOCR / EasyOCR)\nCPU"]
    C --> D{"Confidence\n≥ threshold?"}
    D -->|"yes"| F["LLM Extractor\n(vLLM / Ollama)\nSchema prompt"]
    D -->|"no"| E["Tier 2: VLM\n(GOT / VLM)\nGPU"]
    E --> F
    F --> G["Pydantic Validation"]
    G --> H["✅ ExtractionResult"]

License

Apache 2.0 — all dependencies are Apache 2.0 or MIT licensed.

_{Part of the QuartzUnit ecosystem — composable Python libraries for data collection, extraction, search, and AI agent safety.}

Recommended MCP Servers

0Latency MemoryPersistent memory layer for AI agents.

0nMCP — Universal AI API OrchestratorUniversal AI API Orchestrator — 1,554 tools, 96 services.

0xHumans Protocol MCPMCP for AI agents: financing, skills, lending on Base

1k Patient Mcp1k patient MCP server

1trippulse1trip PULSE: 21-tool AI travel planner.

3D AI Agent Avatar3D AI Agent Avatar — render any GLB, give it a Solana wallet, a voice, and pump.fun powers.89

How it compares

MCP document-extraction integration, not a general web-to-markdown scraper or semantic code search tool.

FAQ

Who is docpick for?

Developers using Claude Code, Cursor, or Codex who need agents to turn PDFs and images into JSON that matches their own schemas.

When should I use docpick?

Use it when you are scoping or building features that ingest forms, invoices, or scans and you want structured output without rewriting parsers for every layout.

How do I add docpick to my agent?

Install the docpick package from PyPI (0.1.2), add an MCP stdio server entry with identifier docpick in your host config, restart the client, and call its extraction tools with your schema.

AI & LLM Toolsautomationllm

About

Docpick by the numbers

Add your badge

What it does

Who is it for?

What you get

By the numbers

Docpick

Install

Quick Start

Python API

CLI

Built-in Schemas

Custom Schemas

Validation

Check Digit Algorithms

Cross-Field Rules

Cross-Document Validation

OCR Engines

2-Tier Auto Engine

LLM Providers

Error Handling

Batch Processing

Architecture

License

Recommended MCP Servers

How it compares

FAQ

Who is docpick for?

When should I use docpick?

How do I add docpick to my agent?

This week in AI coding