
Deepseek Ocr
Run DeepSeek-OCR locally or via vLLM to turn images and PDFs into markdown or structured text inside document pipelines and agent workflows.
Overview
DeepSeek-OCR is an agent skill most often used in Build (also Validate, Operate) that runs DeepSeek’s vision-language OCR on images and PDFs with multiple prompt modes and vLLM or Transformers backends.
Install
npx skills add https://github.com/aradotso/trending-skills --skill deepseek-ocrWhat is this skill?
- DeepSeek-OCR vision-language OCR with contexts optical compression
- Multiple prompt modes: document-to-markdown, free OCR, figure parsing, and grounding
- Native and dynamic resolution paths for images and PDFs
- vLLM high-throughput path (v0.8.5+cu118) or HuggingFace Transformers
- Prereqs called out: CUDA 11.8+, PyTorch 2.6.0, Python 3.12.9 via conda
- CUDA 11.8+ and PyTorch 2.6.0 prerequisites
- Python 3.12.9 recommended
- vLLM 0.8.5+cu118 install path documented
Adoption & trust: 1.3k installs on skills.sh; 31 GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have scanned documents or PDFs and need reliable markdown or text extraction without rebuilding GPU inference and prompt recipes from scratch.
Who is it for?
Solo builders with a CUDA machine who want document-to-markdown, figure parsing, or grounding OCR inside agent or CLI pipelines.
Skip if: CPU-only quick OCR, teams that cannot host GPU workloads, or workflows that require certified commercial OCR SLAs without self-hosting.
When should I use this skill?
Use deepseek ocr, extract text from image with deepseek, OCR PDF, convert document to markdown, or run vLLM DeepSeek OCR inference.
What do I get? / Deliverables
You get a documented install and inference workflow so your agent can extract text via DeepSeek-OCR and feed downstream indexing, editing, or validation steps.
- Structured text or markdown from images/PDFs
- Configured vLLM or Transformers inference command
- Prompt mode selection for OCR vs grounding
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
OCR is most often wired while building agent tooling and document ingestion, even though the same capability supports validate prototypes and operate document jobs. Agent-tooling is the canonical shelf because the skill centers on inference setup, prompt modes, and vLLM/HuggingFace execution—not mobile UI or generic backend CRUD alone.
Where it fits
Smoke-test whether scanned pitch decks parse cleanly before you commit to a RAG architecture.
Add a DeepSeek-OCR step so your agent converts uploaded PDFs to markdown for chunking and embedding.
Stand up a vLLM worker that serves OCR with document-to-markdown prompts for an internal API.
Re-run figure-parsing mode on new template PDFs when customers report extraction drift.
How it compares
Self-hosted VLM OCR skill—not a cloud Document AI one-liner or a browser-only screenshot copy tool.
Common Questions / FAQ
Who is deepseek-ocr for?
Developers shipping document AI, knowledge bases, or agent tools who already run Python GPU stacks and want DeepSeek-OCR procedures in the agent.
When should I use deepseek-ocr?
In Build when ingesting PDFs/images; in Validate when prototyping doc understanding; in Operate when batch-converting support attachments—always with GPU setup time budgeted.
Is deepseek-ocr safe to install?
It pulls upstream model code and heavy dependencies; review Security Audits on this page, pin versions, and avoid sending sensitive documents to shared GPUs you do not control.
SKILL.md
READMESKILL.md - Deepseek Ocr
# DeepSeek-OCR > Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection. DeepSeek-OCR is a vision-language model for Optical Character Recognition with "Contexts Optical Compression." It supports native and dynamic resolutions, multiple prompt modes (document-to-markdown, free OCR, figure parsing, grounding), and can be run via vLLM (high-throughput) or HuggingFace Transformers. It processes images and PDFs, outputting structured text or markdown. --- ## Installation ### Prerequisites - CUDA 11.8+, PyTorch 2.6.0 - Python 3.12.9 (via conda recommended) ### Setup ```bash git clone https://github.com/deepseek-ai/DeepSeek-OCR.git cd DeepSeek-OCR conda create -n deepseek-ocr python=3.12.9 -y conda activate deepseek-ocr # Install PyTorch with CUDA 11.8 pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \ --index-url https://download.pytorch.org/whl/cu118 # Download vllm-0.8.5 whl from https://github.com/vllm-project/vllm/releases/tag/v0.8.5 pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl pip install -r requirements.txt pip install flash-attn==2.7.3 --no-build-isolation ``` ### Alternative: upstream vLLM (nightly) ```bash uv venv source .venv/bin/activate uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly ``` --- ## Model Download Model is available on HuggingFace: `deepseek-ai/DeepSeek-OCR` ```python from huggingface_hub import snapshot_download snapshot_download(repo_id="deepseek-ai/DeepSeek-OCR") ``` --- ## Inference: vLLM (Recommended for Production) ### Single Image — Streaming ```python from vllm import LLM, SamplingParams from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor from PIL import Image llm = LLM( model="deepseek-ai/DeepSeek-OCR", enable_prefix_caching=False, mm_processor_cache_gb=0, logits_processors=[NGramPerReqLogitsProcessor] ) image = Image.open("document.png").convert("RGB") prompt = "<image>\nFree OCR." sampling_params = SamplingParams( temperature=0.0, max_tokens=8192, extra_args=dict( ngram_size=30, window_size=90, whitelist_token_ids={128821, 128822}, # <td>, </td> for table support ), skip_special_tokens=False, ) outputs = llm.generate( [{"prompt": prompt, "multi_modal_data": {"image": image}}], sampling_params ) print(outputs[0].outputs[0].text) ``` ### Batch Images ```python from vllm import LLM, SamplingParams from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor from PIL import Image llm = LLM( model="deepseek-ai/DeepSeek-OCR", enable_prefix_caching=False, mm_processor_cache_gb=0, logits_processors=[NGramPerReqLogitsProcessor] ) image_paths = ["page1.png", "page2.png", "page3.png"] prompt = "<image>\n<|grounding|>Convert the document to markdown. " model_input = [ { "prompt": prompt, "multi_modal_data": {"image": Image.open(p).convert("RGB")} } for p in image_paths ] sampling_params = SamplingParams( temperature=0.0, max_tokens=8192, extra_args=dict( ngram_size=30, window_size=90, whitelist_token_ids={128821, 128822}, ), skip_special_tokens=False, ) outputs = llm.generate(model_input, sampling_params) for path, output in zip(image_paths, outputs): print(f"=== {path} ===") print(output.outputs[0].text) ``` ### PDF Processing (via vLLM scripts) ```bash cd DeepSeek-OCR-master/DeepSeek-OCR-vllm # Edit config.py: set INP