
Defuddle
Pull clean article markdown and metadata from noisy web pages or local HTML for research and docs.
Overview
Defuddle is an agent skill most often used in Idea (also Grow, Build) that extracts main article content from URLs or HTML into clean Markdown with JSON metadata.
Install
npx skills add https://github.com/joeseesun/defuddle-skill --skill defuddleWhat is this skill?
- Strips ads, sidebars, nav, and clutter to return main article content
- Default CLI: defuddle parse "<url>" -m -j for Markdown plus JSON metadata
- Three-step workflow: extract, present summary preview, ask save directory on first use
- Triggers: defuddle, extract article, clean this page, strip clutter, web extract
- One-time npm global install: defuddle and jsdom when binary missing
- Default workflow has 3 steps: extract, summary, save path
- parse uses -m and -j flags together for Markdown and JSON metadata
Adoption & trust: 1.1k installs on skills.sh; 103 GitHub stars; 1/3 security scanners passed (skills.sh audits).
What problem does it solve?
You need the substance of a web article but the page HTML is full of ads, navigation, and sidebars that waste tokens and ruin summaries.
Who is it for?
Indie builders researching competitors, docs, or essays who want agent-friendly Markdown without writing a custom scraper.
Skip if: Pages that need interactive login, heavy JavaScript rendering the CLI cannot handle, or workflows that require full-site crawls rather than single-article extraction.
When should I use this skill?
User wants to extract or clean web page content, strip clutter from HTML, get article text from a URL, or convert pages to clean markdown; triggers include defuddle, extract article, clean this page, get content from URL
What do I get? / Deliverables
You receive clean Markdown, metadata fields like title and wordCount, a short preview, and an optional saved file in your chosen directory.
- Clean Markdown article body
- JSON metadata including title, author, and wordCount
- Optional saved file in user-chosen directory
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Canonical shelf is Idea research because the default workflow optimizes reading and saving source material before you build or publish. Research subphase fits URL and HTML extraction, clutter removal, and summarizing title, author, and word count for later citation.
Where it fits
Extract a competitor’s manifesto post into Markdown for your positioning notes.
Pull a long-form source article cleanly before outlining your newsletter or blog rewrite.
Archive a third-party API guide as Markdown while writing integration docs.
How it compares
Opinionated single-URL article extractor with save workflow—not a headless browser automation suite or generic wget mirror.
Common Questions / FAQ
Who is defuddle for?
Solo builders and agents who read the open web or local HTML and need clutter-free Markdown plus metadata for notes, research briefs, or content drafts.
When should I use defuddle?
In Idea research when capturing articles, in Grow content when repurposing sources, or in Build docs when archiving reference pages—whenever triggers like extract article, clean this page, or get content from URL appear.
Is defuddle safe to install?
The skill runs a global npm CLI that fetches URLs you provide; review the Security Audits panel on this Prism page and pin or audit the defuddle and jsdom packages before use on sensitive networks.
SKILL.md
READMESKILL.md - Defuddle
# Defuddle - Web Content Extraction Extract main article content from web pages, removing ads, sidebars, navigation, and other clutter. Output clean Markdown with metadata. ## Prerequisites Before first use, check if `defuddle` is installed: ```bash command -v defuddle >/dev/null 2>&1 || npm install -g defuddle jsdom ``` ## Default Workflow When user provides a URL, follow this workflow: ### Step 1: Extract content as Markdown + JSON metadata Always use both `-m` and `-j` flags to get markdown content with full metadata: ```bash defuddle parse "<url>" -m -j ``` ### Step 2: Present a summary to the user Show the user: - **Title**: from JSON `title` field - **Author**: from JSON `author` field - **Source**: domain - **Word count**: from JSON `wordCount` field - A brief preview (first 2-3 sentences) ### Step 3: Ask where to save If this is the **first time** using defuddle in this conversation, ask the user: > "Save to which directory? (e.g. `~/Documents`, `~/Desktop`, or a custom path)" Remember the user's chosen directory for subsequent uses in the same conversation. ### Step 4: Save as Markdown file Write the file with frontmatter + full content: ```markdown --- title: {title} author: {author} source: {url} date: {published or "Unknown"} clipped: {today's date YYYY-MM-DD} wordCount: {wordCount} --- # {title} {markdown content} ``` **File naming**: Use the article title as filename, sanitized for filesystem: - Replace special characters with spaces - Trim whitespace - Example: `The Shape of the Essay Field.md` ### Step 5: Confirm to user Tell the user the file path where it was saved. ## CLI Reference ```bash defuddle parse <source> [options] ``` **Arguments:** - `<source>` — URL (`https://...`) or local HTML file path **Options:** | Flag | Description | |------|-------------| | `-m, --markdown` | Convert content to Markdown | | `-j, --json` | Output as JSON with full metadata | | `-o, --output <file>` | Write to file instead of stdout | | `-p, --property <name>` | Extract single property (title, description, domain, author, published, wordCount, content) | | `--debug` | Verbose logging | ## JSON Response Fields When using `-j`, the response includes: - `title` — Article title - `author` — Author name - `published` — Publication date - `description` — Meta description - `content` — Extracted Markdown (when `-m` used) - `domain` — Source domain - `favicon` — Favicon URL - `image` — Featured image URL - `site` — Site name - `wordCount` — Word count - `parseTime` — Processing time in ms ## Notes - Requires Node.js and npm - `jsdom` is required as a peer dependency - Works best with article-style pages (blogs, news, documentation) - Not designed for SPAs or JavaScript-heavy pages (e.g. WeChat articles need browser rendering)