Translate Pdf

Name: Translate Pdf
Author: wshuyi

wshuyi/translate-pdf-skill

Extract every unique text span from a PDF with PyMuPDF so you can translate or localize document content without retyping pages by hand.

Overview

translate-pdf is an agent skill for the Build phase that extracts unique PDF text spans with PyMuPDF for downstream translation or localization.

Install

npx skills add https://github.com/wshuyi/translate-pdf-skill --skill translate-pdf

What is this skill?

PyMuPDF page walk: dict blocks → lines → spans with stripped text
Dedupes strings via a set and returns a sorted list for stable diffing
Optional --output JSON with ensure_ascii=False for CJK and RTL locales
Fails fast with install hint when pymupdf is missing
Pairs naturally with a translate-pdf workflow after extraction

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 832 installs on skills.sh; 12 GitHub stars; 3/3 security scanners passed (skills.sh audits).

What problem does it solve?

You have a PDF full of prose but no clean, deduplicated string list to feed a translator or agent workflow.

Who is it for?

Indie builders localizing docs, papers, or onboarding PDFs who want a scriptable extract step inside their agent repo.

Skip if: Teams that only need OCR on scanned image-only PDFs or full layout-preserving reflow without a separate translation pipeline.

When should I use this skill?

You need to turn a PDF into a deduplicated list of text strings before translation or batch editing.

What do I get? / Deliverables

You get a sorted unique text inventory as JSON or stdout, ready for batch translation or glossary tooling on the same repo skill chain.

Sorted unique text strings (stdout or JSON file)
Optional JSON path with extraction count message

Recommended Skills

Lark Maillarksuite/cli

Feishu email skill covering compose, send, reply, forward, search, drafts, attachments, contacts, and mail rules via lar…209k installs·13.7k stars

Lark Slideslarksuite/cli

Template and markup for building themed Lark Office slide presentations, including title slide styling for company meeti…162k installs·13.7k stars

Pptxanthropics/skills

pptx is Anthropic’s agent skill for PowerPoint work inside Claude-powered coding and assistant flows. Solo builders reac…138k installs·148k stars

Pdfanthropics/skills

pdf is a journey-wide Anthropic agent skill for anything involving PDF files: reading and extracting text or tables, mer…130k installs·148k stars

Lark Markdownlarksuite/cli

CLI-oriented skill for Lark Drive native Markdown: create, read, overwrite, diff, and localized patch with clear boundar…125k installs·13.7k stars

Docxanthropics/skills

End-to-end Word document skill for creation, extraction, and structured editing of professional .docx files using pandoc…118k installs·148k stars

Journey fit

Primary fit

Document ingestion and prep sit in Build under docs—turning PDFs into structured text is what you do before shipping localized copies or agent-readable corpora. The skill ships a Python CLI that dumps sorted unique strings to JSON or stdout, which is the standard docs-pipeline step ahead of translation or CMS import.

Also useful

GrowContent & marketing

How it compares

Use instead of manually copy-pasting from a PDF viewer when you need deduplicated strings for automation.

Common Questions / FAQ

Who is translate-pdf for?

Solo and indie builders who work with PDF source docs and want agent-driven extraction before translation or content migration.

When should I use translate-pdf?

During Build docs work when you receive a PDF manual or paper and need JSON or CLI output of every unique text span before MT or CMS import.

Is translate-pdf safe to install?

Review the Security Audits panel on this Prism page and inspect the skill’s Python deps and file paths before running on sensitive PDFs.

SKILL.md

READMESKILL.md - Translate Pdf

#!/usr/bin/env python3
"""
Extract all unique text strings from a PDF file.

Usage:
    python extract_texts.py <input.pdf> [--output <texts.json>]
"""

import json
import sys
import argparse

try:
    import pymupdf
except ImportError:
    print("Error: pymupdf not installed. Run: pip install pymupdf")
    sys.exit(1)


def extract_texts(input_path: str) -> list:
    """Extract all unique text strings from PDF."""
    doc = pymupdf.open(input_path)
    all_texts = set()

    for page in doc:
        text_dict = page.get_text("dict")

        for block in text_dict["blocks"]:
            if block.get("type") != 0:
                continue

            for line in block.get("lines", []):
                for span in line.get("spans", []):
                    text = span.get("text", "").strip()
                    if text:
                        all_texts.add(text)

    doc.close()
    return sorted(all_texts)


def main():
    parser = argparse.ArgumentParser(description="Extract text from PDF")
    parser.add_argument("input_pdf", help="Input PDF file")
    parser.add_argument("--output", "-o", help="Output JSON file (optional)")

    args = parser.parse_args()

    texts = extract_texts(args.input_pdf)

    if args.output:
        with open(args.output, "w", encoding="utf-8") as f:
            json.dump(texts, f, ensure_ascii=False, indent=2)
        print(f"Extracted {len(texts)} unique texts to {args.output}")
    else:
        for t in texts:
            print(t)
        print(f"\n--- Total: {len(texts)} unique texts ---")


if __name__ == "__main__":
    main()


#!/usr/bin/env python3
"""
PDF Translation Script - Replace text in PDF while preserving structure and style.

Usage:
    python translate_pdf.py <input.pdf> <translations.json> <output.pdf> [--font <fontname>]

Arguments:
    input.pdf         Input PDF file path
    translations.json JSON file with translation mappings: {"original": "translated", ...}
    output.pdf        Output PDF file path
    --font            Font name for target language (default: helv, use china-ss for Chinese, japan for Japanese)
"""

import json
import sys
import argparse

try:
    import pymupdf
except ImportError:
    print("Error: pymupdf not installed. Run: pip install pymupdf")
    sys.exit(1)


def translate_pdf(input_path: str, translations: dict, output_path: str, fontname: str = "helv"):
    """
    Translate text in PDF using provided translation mappings.

    Args:
        input_path: Path to input PDF
        translations: Dict mapping original text to translated text
        output_path: Path for output PDF
        fontname: Font name for translated text (helv, china-ss, japan, korea, etc.)
    """
    doc = pymupdf.open(input_path)

    translated_count = 0
    total_spans = 0

    for page in doc:
        text_dict = page.get_text("dict")
        replacements = []

        for block in text_dict["blocks"]:
            if block.get("type") != 0:
                continue

            for line in block.get("lines", []):
                for span in line.get("spans", []):
                    total_spans += 1
                    original_text = span.get("text", "")

                    if not original_text.strip():
                        continue

                    if original_text in translations:
                        new_text = translations[original_text]

                        if new_text != original_text:
                            bbox = span["bbox"]
                            font_size = span["size"]
                            color = span.get("color", 0)

                            if isinstance(color, int):
                                r = (color >> 16 & 0xFF) / 255
                                g = (color >> 8 & 0xFF) / 255
                                b = (color & 0xFF) / 255
                                text_color = (r, g, b)
                            else:
                                text_color = (0, 0, 0)

What is this skill?

PyMuPDF page walk: dict blocks → lines → spans with stripped text

Dedupes strings via a set and returns a sorted list for stable diffing

Optional --output JSON with ensure_ascii=False for CJK and RTL locales

Fails fast with install hint when pymupdf is missing

Pairs naturally with a translate-pdf workflow after extraction

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 832 installs on skills.sh; 12 GitHub stars; 3/3 security scanners passed (skills.sh audits).

Journey fit

Primary fit

Also useful

GrowContent & marketing

SKILL.md

READMESKILL.md - Translate Pdf

#!/usr/bin/env python3
"""
Extract all unique text strings from a PDF file.

Usage:
    python extract_texts.py <input.pdf> [--output <texts.json>]
"""

import json
import sys
import argparse

try:
    import pymupdf
except ImportError:
    print("Error: pymupdf not installed. Run: pip install pymupdf")
    sys.exit(1)


def extract_texts(input_path: str) -> list:
    """Extract all unique text strings from PDF."""
    doc = pymupdf.open(input_path)
    all_texts = set()

    for page in doc:
        text_dict = page.get_text("dict")

        for block in text_dict["blocks"]:
            if block.get("type") != 0:
                continue

            for line in block.get("lines", []):
                for span in line.get("spans", []):
                    text = span.get("text", "").strip()
                    if text:
                        all_texts.add(text)

    doc.close()
    return sorted(all_texts)


def main():
    parser = argparse.ArgumentParser(description="Extract text from PDF")
    parser.add_argument("input_pdf", help="Input PDF file")
    parser.add_argument("--output", "-o", help="Output JSON file (optional)")

    args = parser.parse_args()

    texts = extract_texts(args.input_pdf)

    if args.output:
        with open(args.output, "w", encoding="utf-8") as f:
            json.dump(texts, f, ensure_ascii=False, indent=2)
        print(f"Extracted {len(texts)} unique texts to {args.output}")
    else:
        for t in texts:
            print(t)
        print(f"\n--- Total: {len(texts)} unique texts ---")


if __name__ == "__main__":
    main()


#!/usr/bin/env python3
"""
PDF Translation Script - Replace text in PDF while preserving structure and style.

Usage:
    python translate_pdf.py <input.pdf> <translations.json> <output.pdf> [--font <fontname>]

Arguments:
    input.pdf         Input PDF file path
    translations.json JSON file with translation mappings: {"original": "translated", ...}
    output.pdf        Output PDF file path
    --font            Font name for target language (default: helv, use china-ss for Chinese, japan for Japanese)
"""

import json
import sys
import argparse

try:
    import pymupdf
except ImportError:
    print("Error: pymupdf not installed. Run: pip install pymupdf")
    sys.exit(1)


def translate_pdf(input_path: str, translations: dict, output_path: str, fontname: str = "helv"):
    """
    Translate text in PDF using provided translation mappings.

    Args:
        input_path: Path to input PDF
        translations: Dict mapping original text to translated text
        output_path: Path for output PDF
        fontname: Font name for translated text (helv, china-ss, japan, korea, etc.)
    """
    doc = pymupdf.open(input_path)

    translated_count = 0
    total_spans = 0

    for page in doc:
        text_dict = page.get_text("dict")
        replacements = []

        for block in text_dict["blocks"]:
            if block.get("type") != 0:
                continue

            for line in block.get("lines", []):
                for span in line.get("spans", []):
                    total_spans += 1
                    original_text = span.get("text", "")

                    if not original_text.strip():
                        continue

                    if original_text in translations:
                        new_text = translations[original_text]

                        if new_text != original_text:
                            bbox = span["bbox"]
                            font_size = span["size"]
                            color = span.get("color", 0)

                            if isinstance(color, int):
                                r = (color >> 16 & 0xFF) / 255
                                g = (color >> 8 & 0xFF) / 255
                                b = (color & 0xFF) / 255
                                text_color = (r, g, b)
                            else:
                                text_color = (0, 0, 0)

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Who is translate-pdf for?

When should I use translate-pdf?

Is translate-pdf safe to install?

SKILL.md

This week for builders

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Who is translate-pdf for?

When should I use translate-pdf?

Is translate-pdf safe to install?

SKILL.md