
Document Conversion
Convert PDFs, HTML, and other sources to clean markdown for agents using MCP markitdown first and native Read/WebFetch fallbacks.
Overview
document-conversion is an agent skill most often used in Build (docs), also Idea (research) and Grow (content), that converts files and URLs to sanitized markdown via MCP markitdown with native fallbacks.
Install
npx skills add https://github.com/athola/claude-night-market --skill document-conversionWhat is this skill?
- Tier 1: MCP markitdown via construct URI then convert_to_markdown
- Tier 2 native fallbacks: PDF via Read with 20-page chunking; HTML via WebFetch
- Detects Tier 1 outage via tool-not-found, connection refused, or per-file conversion errors
- Applies leyline:content-sanitization to successful markdown output
- Documents limitations: tables plain text, equations lost on PDF fallback, nav noise on HTML
- 2-tier fallback pipeline (MCP markitdown then native tools)
- PDF chunking in 20-page increments
- estimated_tokens: 400 on fallback-tiers module frontmatter
Adoption & trust: 1 installs on skills.sh; 304 GitHub stars; 2/3 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).
What problem does it solve?
Your agent cannot use a PDF spec or HTML article because the content is trapped in binary or noisy page markup.
Who is it for?
Builders ingesting contracts, research PDFs, or marketing HTML into agent context when markitdown MCP may or may not be running.
Skip if: Pixel-perfect layout reproduction, reliable equation extraction from PDFs without Tier 1, or bulk OCR of scanned archives.
When should I use this skill?
You need markdown from office or web documents and want MCP markitdown first with Read/WebFetch fallbacks plus content sanitization.
What do I get? / Deliverables
You get markdown text through Tier 1 MCP or Tier 2 Read/WebFetch, passed through content sanitization, ready for summarization or implementation planning.
- Sanitized markdown representation of the source document
- Implicit tier used (MCP vs native fallback) for debugging quality issues
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Document-to-markdown pipelines are shelved under Build → docs because they produce agent-ready knowledge artifacts from raw files. Fallback-tier conversion (MCP then Read/WebFetch) is documentation ingestion work, not frontend UI or API coding.
Where it fits
Pull a competitor whitepaper PDF into markdown before you scope features.
Convert an uploaded API spec PDF into sanitized markdown for implementation tasks.
Fetch an HTML blog post and sanitize it before drafting a newsletter summary.
Ingest a pricing PDF from a prospect email to compare against your planned tiers.
How it compares
Structured fallback workflow across MCP and native tools—not a single-purpose markitdown-only snippet.
Common Questions / FAQ
Who is document-conversion for?
Solo builders and indie teams who need agents to ingest PDFs and HTML into markdown for planning, coding, or content reuse.
When should I use document-conversion?
Use it in Build (docs) for specs; Idea (research) when pulling competitor PDFs; Grow (content) when repurposing web articles—always when you need sanitized markdown, not raw binary.
Is document-conversion safe to install?
Tier 2 uses Read and WebFetch on paths and URLs you supply; review fetched domains and local files, and check the Security Audits panel on this Prism page before enabling network tools.
SKILL.md
READMESKILL.md - Document Conversion
# Fallback Tier Instructions ## Tier 1: MCP markitdown For all supported formats, the approach is the same: 1. Construct the URI (see `modules/uri-construction.md`) 2. Call `convert_to_markdown` with the URI 3. If the call succeeds, the result is markdown text 4. Apply `leyline:content-sanitization` to the output **Detecting Tier 1 availability**: If the MCP tool call returns an error like "tool not found", "server not running", or "connection refused", Tier 1 is unavailable. Proceed to Tier 2. If the tool exists but returns a conversion error for the specific file (corrupt file, unsupported variant), also proceed to Tier 2. ## Tier 2: Native Tool Fallbacks ### PDF Use the Read tool with the `pages` parameter: ``` Read(file_path="/path/to/file.pdf", pages="1-20") ``` For remote PDFs, first fetch with WebFetch to get a local path or use the URL directly with Read if supported. **Chunking strategy for large PDFs:** - Pages 1-20: first chunk - Pages 21-40: second chunk - Continue in 20-page increments - Concatenate results **Limitations**: Tables render as plain text. Equations are lost. Scanned pages produce no text. Images are not extracted. ### HTML Use WebFetch with the URL: ``` WebFetch(url="https://example.com/article.html") ``` **Limitations**: Includes navigation, headers, footers, and boilerplate. Manually identify the main content section and discard the rest. ### Images (PNG, JPG, GIF, WebP) Use the Read tool to display the image visually: ``` Read(file_path="/path/to/image.png") ``` Claude sees the image and can describe its contents. **Limitations**: No OCR text extraction. No EXIF metadata. Good for visual inspection, not for extracting text from screenshots or scanned documents. ### CSV Use the Read tool to get raw comma-separated text: ``` Read(file_path="/path/to/data.csv") ``` Then format the first N rows as a markdown table manually if needed for presentation. ### JSON and XML Use the Read tool directly. The structured format is readable as-is. Summarize or extract relevant sections rather than converting the entire file. ## Tier 3: User Notification For formats with no Tier 2 coverage, inform the user. **Formats requiring Tier 3:** DOCX, PPTX, XLSX/XLS, MSG, audio (MP3/WAV/M4A), ZIP archives, EPUB. **Notification template:** > This {format} file requires the markitdown MCP server > for conversion. Without it, I cannot extract the content. > > **Option A**: Install markitdown-mcp by adding to > `.mcp.json`: > ```json > {"mcpServers": {"markitdown": {"type": "stdio", > "command": "uvx", "args": ["markitdown-mcp"]}}} > ``` > > **Option B**: Convert the file to PDF or HTML manually, > then I can process it with built-in tools. **Do NOT guess or fabricate content** from a document you cannot read. Clearly state the limitation. --- name: format-matrix description: >- Document format support matrix showing conversion quality across the three fallback tiers. estimated_tokens: 300 --- # Format Support Matrix Quality ratings: High (preserves structure, tables, images), Medium (readable but loses some formatting), Low (raw text or visual only), None (not supported at this tier). ## Office Documents | Format | Tier 1 (markitdown) | Tier 2 (native) | Notes | |--------|---------------------|------------------|-------| | PDF | High: structure, tables, OCR | Medium: Read tool, 20pp chunks | Native loses table formatting | | DOCX | High: headings, lists, tables | None | Tier 3 only without markitdown | | PPTX | High: slide-by-slide, speaker notes | None | Tier 3 only | | XLSX/XLS | High: tables to markdown | None | Tier 3 only | | MSG | High: email headers and body | None | Outlook format, Tier 3 only | ## Web and Data Formats | Format | Tier 1 (markitdown) | Tier 2 (native) | Notes | |--------|-----------