
Smart Ocr
Pull readable text, bounding boxes, and confidence scores from screenshots, scans, and photos using PaddleOCR inside an agent workflow.
Overview
Smart OCR is an agent skill for the Build phase that extracts text from images and scanned documents using PaddleOCR with multilingual support and position metadata.
Install
npx skills add https://github.com/skills.volces.com --skill smart-ocrWhat is this skill?
- PaddleOCR-based extraction with angle classification for skewed scans
- 100+ language support including mixed Chinese and English prompts
- Returns per-line text, quadrilateral boxes, and confidence scores
- Works on screenshots, scanned PDFs, business cards, and handwritten images
- Example Python init and result parsing included in the skill
- 100+ languages supported
- PaddleOCR library referenced at ~69k GitHub stars in skill metadata
Adoption & trust: 1 installs on skills.sh; 1/1 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).
What problem does it solve?
You have screenshots, scans, or photos full of text but no fast way to feed that content into search, agents, or databases.
Who is it for?
Indie builders adding document ingestion, screenshot parsing, or multilingual OCR to Python-backed agent tools.
Skip if: Production OCR at scale without your own hosting, compliance review, and image pre-processing pipeline.
When should I use this skill?
User provides an image or scanned document and asks to extract, read, or OCR text with optional language hints.
What do I get? / Deliverables
You receive machine-readable text lines with bounding boxes and confidence so downstream code, RAG, or agents can cite or transform the content.
- Structured OCR lines with text, boxes, and confidence scores
Recommended Skills
Journey fit
How it compares
Skill-wrapped PaddleOCR integration, not a hosted SaaS OCR API with managed SLAs.
Common Questions / FAQ
Who is smart-ocr for?
Solo developers and agent users who need local or scriptable OCR from PaddleOCR during Build integrations.
When should I use smart-ocr?
While building features that ingest scans, screenshots, or photos—especially when you need 100+ languages or mixed-language business cards and forms.
Is smart-ocr safe to install?
OCR runs Python and reads image files; confirm dependency sources and review the Security Audits panel on this Prism page before enabling in sensitive environments.
SKILL.md
READMESKILL.md - Smart Ocr
# Smart OCR Skill ## Overview This skill enables intelligent text extraction from images and scanned documents using **PaddleOCR** - a leading OCR engine supporting 100+ languages. Extract text from photos, screenshots, scanned PDFs, and handwritten documents with high accuracy. ## How to Use 1. Provide the image or scanned document 2. Optionally specify language(s) to detect 3. I'll extract text with position and confidence data **Example prompts:** - "Extract all text from this screenshot" - "OCR this scanned PDF document" - "Read the text from this business card photo" - "Extract Chinese and English text from this image" ## Domain Knowledge ### PaddleOCR Fundamentals ```python from paddleocr import PaddleOCR # Initialize OCR engine ocr = PaddleOCR(use_angle_cls=True, lang='en') # Run OCR on image result = ocr.ocr('image.png', cls=True) # Result structure: [[box, (text, confidence)], ...] for line in result[0]: box = line[0] # [[x1,y1], [x2,y2], [x3,y3], [x4,y4]] text = line[1][0] # Extracted text conf = line[1][1] # Confidence score print(f"{text} ({conf:.2f})") ``` ### Supported Languages ```python # Common language codes languages = { 'en': 'English', 'ch': 'Chinese (Simplified)', 'cht': 'Chinese (Traditional)', 'japan': 'Japanese', 'korean': 'Korean', 'french': 'French', 'german': 'German', 'spanish': 'Spanish', 'russian': 'Russian', 'arabic': 'Arabic', 'hindi': 'Hindi', 'vi': 'Vietnamese', 'th': 'Thai', # ... 100+ languages supported } # Use specific language ocr = PaddleOCR(lang='ch') # Chinese ocr = PaddleOCR(lang='japan') # Japanese ocr = PaddleOCR(lang='multilingual') # Auto-detect ``` ### Configuration Options ```python from paddleocr import PaddleOCR ocr = PaddleOCR( # Detection settings det_model_dir=None, # Custom detection model det_limit_side_len=960, # Max side length for detection det_db_thresh=0.3, # Binarization threshold det_db_box_thresh=0.5, # Box score threshold # Recognition settings rec_model_dir=None, # Custom recognition model rec_char_dict_path=None, # Custom character dictionary # Angle classification use_angle_cls=True, # Enable angle classification cls_model_dir=None, # Custom classification model # Language lang='en', # Language code # Performance use_gpu=True, # Use GPU if available gpu_mem=500, # GPU memory limit (MB) enable_mkldnn=True, # CPU optimization # Output show_log=False, # Suppress logs ) ``` ### Processing Different Sources #### Image Files ```python # Single image result = ocr.ocr('image.png') # Multiple images images = ['img1.png', 'img2.png', 'img3.png'] for img in images: result = ocr.ocr(img) process_result(result) ``` #### PDF Files (Scanned) ```python from pdf2image import convert_from_path def ocr_pdf(pdf_path): """OCR a scanned PDF.""" # Convert PDF pages to images images = convert_from_path(pdf_path) all_text = [] for i, img in enumerate(images): # Save temp image temp_path = f'temp_page_{i}.png' img.save(temp_path) # OCR the image result = ocr.ocr(temp_path) # Extract text page_text = '\n'.join([line[1][0] for line in result[0]]) all_text.append(f"--- Page {i+1} ---\n{page_text}") os.remove(temp_path) return