Document Conversion

Name: Document Conversion
Author: athola

athola/claude-night-market

Convert PDFs, HTML, and other sources to clean markdown for agents using MCP markitdown first and native Read/WebFetch fallbacks.

Overview

document-conversion is an agent skill most often used in Build (docs), also Idea (research) and Grow (content), that converts files and URLs to sanitized markdown via MCP markitdown with native fallbacks.

Install

npx skills add https://github.com/athola/claude-night-market --skill document-conversion

What is this skill?

Tier 1: MCP markitdown via construct URI then convert_to_markdown
Tier 2 native fallbacks: PDF via Read with 20-page chunking; HTML via WebFetch
Detects Tier 1 outage via tool-not-found, connection refused, or per-file conversion errors
Applies leyline:content-sanitization to successful markdown output
Documents limitations: tables plain text, equations lost on PDF fallback, nav noise on HTML
2-tier fallback pipeline (MCP markitdown then native tools)
PDF chunking in 20-page increments
estimated_tokens: 400 on fallback-tiers module frontmatter

Compatible agents: Claude Code, Cursor, any compatible agent

Adoption & trust: 1 installs on skills.sh; 304 GitHub stars; 2/3 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).

What problem does it solve?

Your agent cannot use a PDF spec or HTML article because the content is trapped in binary or noisy page markup.

Who is it for?

Builders ingesting contracts, research PDFs, or marketing HTML into agent context when markitdown MCP may or may not be running.

Skip if: Pixel-perfect layout reproduction, reliable equation extraction from PDFs without Tier 1, or bulk OCR of scanned archives.

When should I use this skill?

You need markdown from office or web documents and want MCP markitdown first with Read/WebFetch fallbacks plus content sanitization.

What do I get? / Deliverables

You get markdown text through Tier 1 MCP or Tier 2 Read/WebFetch, passed through content sanitization, ready for summarization or implementation planning.

Sanitized markdown representation of the source document
Implicit tier used (MCP vs native fallback) for debugging quality issues

Recommended Skills

Lark Maillarksuite/cli

Feishu email skill covering compose, send, reply, forward, search, drafts, attachments, contacts, and mail rules via lar…209k installs·13.7k stars

Lark Slideslarksuite/cli

Template and markup for building themed Lark Office slide presentations, including title slide styling for company meeti…162k installs·13.7k stars

Pptxanthropics/skills

pptx is Anthropic’s agent skill for PowerPoint work inside Claude-powered coding and assistant flows. Solo builders reac…138k installs·148k stars

Pdfanthropics/skills

pdf is a journey-wide Anthropic agent skill for anything involving PDF files: reading and extracting text or tables, mer…130k installs·148k stars

Lark Markdownlarksuite/cli

CLI-oriented skill for Lark Drive native Markdown: create, read, overwrite, diff, and localized patch with clear boundar…125k installs·13.7k stars

Docxanthropics/skills

End-to-end Word document skill for creation, extraction, and structured editing of professional .docx files using pandoc…118k installs·148k stars

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Document-to-markdown pipelines are shelved under Build → docs because they produce agent-ready knowledge artifacts from raw files. Fallback-tier conversion (MCP then Read/WebFetch) is documentation ingestion work, not frontend UI or API coding.

Also useful

IdeaOpportunity & market research

Also useful

GrowContent & marketing

Where it fits

Example use

IdeaOpportunity & market research

Pull a competitor whitepaper PDF into markdown before you scope features.

Example use

BuildDocs & content

Convert an uploaded API spec PDF into sanitized markdown for implementation tasks.

Example use

GrowContent & marketing

Fetch an HTML blog post and sanitize it before drafting a newsletter summary.

Example use

ValidateScope & plan

Ingest a pricing PDF from a prospect email to compare against your planned tiers.

How it compares

Structured fallback workflow across MCP and native tools—not a single-purpose markitdown-only snippet.

Common Questions / FAQ

Who is document-conversion for?

Solo builders and indie teams who need agents to ingest PDFs and HTML into markdown for planning, coding, or content reuse.

When should I use document-conversion?

Use it in Build (docs) for specs; Idea (research) when pulling competitor PDFs; Grow (content) when repurposing web articles—always when you need sanitized markdown, not raw binary.

Is document-conversion safe to install?

Tier 2 uses Read and WebFetch on paths and URLs you supply; review fetched domains and local files, and check the Security Audits panel on this Prism page before enabling network tools.

SKILL.md

READMESKILL.md - Document Conversion

# Fallback Tier Instructions

## Tier 1: MCP markitdown

For all supported formats, the approach is the same:

1. Construct the URI (see `modules/uri-construction.md`)
2. Call `convert_to_markdown` with the URI
3. If the call succeeds, the result is markdown text
4. Apply `leyline:content-sanitization` to the output

**Detecting Tier 1 availability**: If the MCP tool call
returns an error like "tool not found", "server not
running", or "connection refused", Tier 1 is unavailable.
Proceed to Tier 2.

If the tool exists but returns a conversion error for
the specific file (corrupt file, unsupported variant),
also proceed to Tier 2.

## Tier 2: Native Tool Fallbacks

### PDF

Use the Read tool with the `pages` parameter:

```
Read(file_path="/path/to/file.pdf", pages="1-20")
```

For remote PDFs, first fetch with WebFetch to get a
local path or use the URL directly with Read if supported.

**Chunking strategy for large PDFs:**

- Pages 1-20: first chunk
- Pages 21-40: second chunk
- Continue in 20-page increments
- Concatenate results

**Limitations**: Tables render as plain text. Equations
are lost. Scanned pages produce no text. Images are
not extracted.

### HTML

Use WebFetch with the URL:

```
WebFetch(url="https://example.com/article.html")
```

**Limitations**: Includes navigation, headers, footers,
and boilerplate. Manually identify the main content
section and discard the rest.

### Images (PNG, JPG, GIF, WebP)

Use the Read tool to display the image visually:

```
Read(file_path="/path/to/image.png")
```

Claude sees the image and can describe its contents.

**Limitations**: No OCR text extraction. No EXIF metadata.
Good for visual inspection, not for extracting text from
screenshots or scanned documents.

### CSV

Use the Read tool to get raw comma-separated text:

```
Read(file_path="/path/to/data.csv")
```

Then format the first N rows as a markdown table manually
if needed for presentation.

### JSON and XML

Use the Read tool directly. The structured format is
readable as-is. Summarize or extract relevant sections
rather than converting the entire file.

## Tier 3: User Notification

For formats with no Tier 2 coverage, inform the user.

**Formats requiring Tier 3:**
DOCX, PPTX, XLSX/XLS, MSG, audio (MP3/WAV/M4A),
ZIP archives, EPUB.

**Notification template:**

> This {format} file requires the markitdown MCP server
> for conversion. Without it, I cannot extract the content.
>
> **Option A**: Install markitdown-mcp by adding to
> `.mcp.json`:
> ```json
> {"mcpServers": {"markitdown": {"type": "stdio",
>   "command": "uvx", "args": ["markitdown-mcp"]}}}
> ```
>
> **Option B**: Convert the file to PDF or HTML manually,
> then I can process it with built-in tools.

**Do NOT guess or fabricate content** from a document you
cannot read. Clearly state the limitation.


---
name: format-matrix
description: >-
  Document format support matrix showing conversion quality
  across the three fallback tiers.
estimated_tokens: 300
---

# Format Support Matrix

Quality ratings: High (preserves structure, tables, images),
Medium (readable but loses some formatting), Low (raw text
or visual only), None (not supported at this tier).

## Office Documents

| Format | Tier 1 (markitdown) | Tier 2 (native) | Notes |
|--------|---------------------|------------------|-------|
| PDF | High: structure, tables, OCR | Medium: Read tool, 20pp chunks | Native loses table formatting |
| DOCX | High: headings, lists, tables | None | Tier 3 only without markitdown |
| PPTX | High: slide-by-slide, speaker notes | None | Tier 3 only |
| XLSX/XLS | High: tables to markdown | None | Tier 3 only |
| MSG | High: email headers and body | None | Outlook format, Tier 3 only |

## Web and Data Formats

| Format | Tier 1 (markitdown) | Tier 2 (native) | Notes |
|--------|-----------

What is this skill?

Tier 1: MCP markitdown via construct URI then convert_to_markdown

Tier 2 native fallbacks: PDF via Read with 20-page chunking; HTML via WebFetch

Detects Tier 1 outage via tool-not-found, connection refused, or per-file conversion errors

Applies leyline:content-sanitization to successful markdown output

Documents limitations: tables plain text, equations lost on PDF fallback, nav noise on HTML

2-tier fallback pipeline (MCP markitdown then native tools)

PDF chunking in 20-page increments

estimated_tokens: 400 on fallback-tiers module frontmatter

Compatible agents: Claude Code, Cursor, any compatible agent

Adoption & trust: 1 installs on skills.sh; 304 GitHub stars; 2/3 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

IdeaOpportunity & market research

Also useful

GrowContent & marketing

Where it fits

Example use

IdeaOpportunity & market research

Pull a competitor whitepaper PDF into markdown before you scope features.

Example use

BuildDocs & content

Convert an uploaded API spec PDF into sanitized markdown for implementation tasks.

Example use

GrowContent & marketing

Fetch an HTML blog post and sanitize it before drafting a newsletter summary.

Example use

ValidateScope & plan

Ingest a pricing PDF from a prospect email to compare against your planned tiers.

SKILL.md

READMESKILL.md - Document Conversion

# Fallback Tier Instructions

## Tier 1: MCP markitdown

For all supported formats, the approach is the same:

1. Construct the URI (see `modules/uri-construction.md`)
2. Call `convert_to_markdown` with the URI
3. If the call succeeds, the result is markdown text
4. Apply `leyline:content-sanitization` to the output

**Detecting Tier 1 availability**: If the MCP tool call
returns an error like "tool not found", "server not
running", or "connection refused", Tier 1 is unavailable.
Proceed to Tier 2.

If the tool exists but returns a conversion error for
the specific file (corrupt file, unsupported variant),
also proceed to Tier 2.

## Tier 2: Native Tool Fallbacks

### PDF

Use the Read tool with the `pages` parameter:

```
Read(file_path="/path/to/file.pdf", pages="1-20")
```

For remote PDFs, first fetch with WebFetch to get a
local path or use the URL directly with Read if supported.

**Chunking strategy for large PDFs:**

- Pages 1-20: first chunk
- Pages 21-40: second chunk
- Continue in 20-page increments
- Concatenate results

**Limitations**: Tables render as plain text. Equations
are lost. Scanned pages produce no text. Images are
not extracted.

### HTML

Use WebFetch with the URL:

```
WebFetch(url="https://example.com/article.html")
```

**Limitations**: Includes navigation, headers, footers,
and boilerplate. Manually identify the main content
section and discard the rest.

### Images (PNG, JPG, GIF, WebP)

Use the Read tool to display the image visually:

```
Read(file_path="/path/to/image.png")
```

Claude sees the image and can describe its contents.

**Limitations**: No OCR text extraction. No EXIF metadata.
Good for visual inspection, not for extracting text from
screenshots or scanned documents.

### CSV

Use the Read tool to get raw comma-separated text:

```
Read(file_path="/path/to/data.csv")
```

Then format the first N rows as a markdown table manually
if needed for presentation.

### JSON and XML

Use the Read tool directly. The structured format is
readable as-is. Summarize or extract relevant sections
rather than converting the entire file.

## Tier 3: User Notification

For formats with no Tier 2 coverage, inform the user.

**Formats requiring Tier 3:**
DOCX, PPTX, XLSX/XLS, MSG, audio (MP3/WAV/M4A),
ZIP archives, EPUB.

**Notification template:**

> This {format} file requires the markitdown MCP server
> for conversion. Without it, I cannot extract the content.
>
> **Option A**: Install markitdown-mcp by adding to
> `.mcp.json`:
> ```json
> {"mcpServers": {"markitdown": {"type": "stdio",
>   "command": "uvx", "args": ["markitdown-mcp"]}}}
> ```
>
> **Option B**: Convert the file to PDF or HTML manually,
> then I can process it with built-in tools.

**Do NOT guess or fabricate content** from a document you
cannot read. Clearly state the limitation.


---
name: format-matrix
description: >-
  Document format support matrix showing conversion quality
  across the three fallback tiers.
estimated_tokens: 300
---

# Format Support Matrix

Quality ratings: High (preserves structure, tables, images),
Medium (readable but loses some formatting), Low (raw text
or visual only), None (not supported at this tier).

## Office Documents

| Format | Tier 1 (markitdown) | Tier 2 (native) | Notes |
|--------|---------------------|------------------|-------|
| PDF | High: structure, tables, OCR | Medium: Read tool, 20pp chunks | Native loses table formatting |
| DOCX | High: headings, lists, tables | None | Tier 3 only without markitdown |
| PPTX | High: slide-by-slide, speaker notes | None | Tier 3 only |
| XLSX/XLS | High: tables to markdown | None | Tier 3 only |
| MSG | High: email headers and body | None | Outlook format, Tier 3 only |

## Web and Data Formats

| Format | Tier 1 (markitdown) | Tier 2 (native) | Notes |
|--------|-----------

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Where it fits

Who is document-conversion for?

When should I use document-conversion?

Is document-conversion safe to install?

SKILL.md

This week for builders

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Where it fits

Who is document-conversion for?

When should I use document-conversion?

Is document-conversion safe to install?

SKILL.md