
Translate Pdf
Extract every unique text span from a PDF with PyMuPDF so you can translate or localize document content without retyping pages by hand.
Overview
translate-pdf is an agent skill for the Build phase that extracts unique PDF text spans with PyMuPDF for downstream translation or localization.
Install
npx skills add https://github.com/wshuyi/translate-pdf-skill --skill translate-pdfWhat is this skill?
- PyMuPDF page walk: dict blocks → lines → spans with stripped text
- Dedupes strings via a set and returns a sorted list for stable diffing
- Optional --output JSON with ensure_ascii=False for CJK and RTL locales
- Fails fast with install hint when pymupdf is missing
- Pairs naturally with a translate-pdf workflow after extraction
Adoption & trust: 832 installs on skills.sh; 12 GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have a PDF full of prose but no clean, deduplicated string list to feed a translator or agent workflow.
Who is it for?
Indie builders localizing docs, papers, or onboarding PDFs who want a scriptable extract step inside their agent repo.
Skip if: Teams that only need OCR on scanned image-only PDFs or full layout-preserving reflow without a separate translation pipeline.
When should I use this skill?
You need to turn a PDF into a deduplicated list of text strings before translation or batch editing.
What do I get? / Deliverables
You get a sorted unique text inventory as JSON or stdout, ready for batch translation or glossary tooling on the same repo skill chain.
- Sorted unique text strings (stdout or JSON file)
- Optional JSON path with extraction count message
Recommended Skills
Journey fit
Document ingestion and prep sit in Build under docs—turning PDFs into structured text is what you do before shipping localized copies or agent-readable corpora. The skill ships a Python CLI that dumps sorted unique strings to JSON or stdout, which is the standard docs-pipeline step ahead of translation or CMS import.
How it compares
Use instead of manually copy-pasting from a PDF viewer when you need deduplicated strings for automation.
Common Questions / FAQ
Who is translate-pdf for?
Solo and indie builders who work with PDF source docs and want agent-driven extraction before translation or content migration.
When should I use translate-pdf?
During Build docs work when you receive a PDF manual or paper and need JSON or CLI output of every unique text span before MT or CMS import.
Is translate-pdf safe to install?
Review the Security Audits panel on this Prism page and inspect the skill’s Python deps and file paths before running on sensitive PDFs.
SKILL.md
READMESKILL.md - Translate Pdf
#!/usr/bin/env python3 """ Extract all unique text strings from a PDF file. Usage: python extract_texts.py <input.pdf> [--output <texts.json>] """ import json import sys import argparse try: import pymupdf except ImportError: print("Error: pymupdf not installed. Run: pip install pymupdf") sys.exit(1) def extract_texts(input_path: str) -> list: """Extract all unique text strings from PDF.""" doc = pymupdf.open(input_path) all_texts = set() for page in doc: text_dict = page.get_text("dict") for block in text_dict["blocks"]: if block.get("type") != 0: continue for line in block.get("lines", []): for span in line.get("spans", []): text = span.get("text", "").strip() if text: all_texts.add(text) doc.close() return sorted(all_texts) def main(): parser = argparse.ArgumentParser(description="Extract text from PDF") parser.add_argument("input_pdf", help="Input PDF file") parser.add_argument("--output", "-o", help="Output JSON file (optional)") args = parser.parse_args() texts = extract_texts(args.input_pdf) if args.output: with open(args.output, "w", encoding="utf-8") as f: json.dump(texts, f, ensure_ascii=False, indent=2) print(f"Extracted {len(texts)} unique texts to {args.output}") else: for t in texts: print(t) print(f"\n--- Total: {len(texts)} unique texts ---") if __name__ == "__main__": main() #!/usr/bin/env python3 """ PDF Translation Script - Replace text in PDF while preserving structure and style. Usage: python translate_pdf.py <input.pdf> <translations.json> <output.pdf> [--font <fontname>] Arguments: input.pdf Input PDF file path translations.json JSON file with translation mappings: {"original": "translated", ...} output.pdf Output PDF file path --font Font name for target language (default: helv, use china-ss for Chinese, japan for Japanese) """ import json import sys import argparse try: import pymupdf except ImportError: print("Error: pymupdf not installed. Run: pip install pymupdf") sys.exit(1) def translate_pdf(input_path: str, translations: dict, output_path: str, fontname: str = "helv"): """ Translate text in PDF using provided translation mappings. Args: input_path: Path to input PDF translations: Dict mapping original text to translated text output_path: Path for output PDF fontname: Font name for translated text (helv, china-ss, japan, korea, etc.) """ doc = pymupdf.open(input_path) translated_count = 0 total_spans = 0 for page in doc: text_dict = page.get_text("dict") replacements = [] for block in text_dict["blocks"]: if block.get("type") != 0: continue for line in block.get("lines", []): for span in line.get("spans", []): total_spans += 1 original_text = span.get("text", "") if not original_text.strip(): continue if original_text in translations: new_text = translations[original_text] if new_text != original_text: bbox = span["bbox"] font_size = span["size"] color = span.get("color", 0) if isinstance(color, int): r = (color >> 16 & 0xFF) / 255 g = (color >> 8 & 0xFF) / 255 b = (color & 0xFF) / 255 text_color = (r, g, b) else: text_color = (0, 0, 0)