
Web Content Fetcher
Pull clean Markdown from any article, doc, or WeChat URL so your agent can read, cite, or summarize without hand-copying HTML.
Overview
Web Content Fetcher is an agent skill most often used in Idea (also Validate, Build, Grow) that extracts main article content from a URL as clean Markdown via Scrapling or Jina Reader.
Install
npx skills add https://github.com/shirenchuang/web-content-fetcher --skill web-content-fetcherWhat is this skill?
- Primary path: Scrapling fetch.py with domain-aware fast vs --stealth routing for clean Markdown
- Fallback: Jina Reader (r.jina.ai) when Scrapling fails or deps are missing—~1–2s on simple pages
- Preserves headings, links, images, lists, and code blocks in Markdown output
- Optional max_chars cap on extracted body length
- Bilingual triggers: fetch/read/scrape URLs plus 帮我读一下这篇文章, 抓取这个网页, 提取正文
- Two-tier strategy: Scrapling script primary, Jina Reader fallback
- Jina free tier noted at 200 requests per day in skill docs
Adoption & trust: 2.7k installs on skills.sh; 579 GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You need an agent to work from a live web page but raw HTML and bot blocking make copy-paste and naive fetches unreliable.
Who is it for?
Builders who routinely feed blog posts, docs, or Chinese media URLs into agents for research and content workflows.
Skip if: Heavy bulk crawling at scale, paywalled content you are not authorized to access, or pages where only interactive login sessions work.
When should I use this skill?
User wants to fetch, read, extract, scrape, or summarize content from a URL—including blog posts, news, WeChat articles, docs—or says 帮我读一下这篇文章, 抓取这个网页, 提取正文, read this page for me.
What do I get? / Deliverables
You receive structured Markdown of the page body ready for summarization, citation, or downstream research notes.
- Clean Markdown body with preserved structure (headings, links, images, lists, code)
- Truncated extract when max_chars is set
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Idea is the primary shelf because fetching and reading external pages is the backbone of early opportunity and competitor research before you commit to build. Research is where URL ingestion lives—blog posts, news, docs, and 微信公众号 pages become inputs to your notes and specs.
Where it fits
Pull a competitor launch post into Markdown before you sketch positioning.
Read a long product teardown article to sanity-check feature scope against market write-ups.
Ingest a third-party API guide as Markdown while wiring an integration.
Fetch a reference article to outline your own SEO post without losing link structure.
How it compares
Agent-side URL-to-Markdown skill—not a hosted MCP browser; pair with summarization skills after fetch.
Common Questions / FAQ
Who is web-content-fetcher for?
Solo builders and indie hackers who want their coding agent to read external articles and documentation as Markdown without writing scrape glue each time.
When should I use web-content-fetcher?
When you need to fetch, read, extract, or summarize a URL—in Idea competitor research, Validate scope reading, Build doc ingestion, or Grow content reference pulls—including WeChat and documentation pages.
Is web-content-fetcher safe to install?
It runs network fetches and optional local Python scripts; review the Security Audits panel on this page and avoid sending authenticated or sensitive URLs through third-party readers.
SKILL.md
READMESKILL.md - Web Content Fetcher
# Web Content Fetcher Given a URL, return its main content as clean Markdown — headings, links, images, lists, code blocks all preserved. ## Extraction Strategy Always try **one method per URL** — don't cascade blindly. Pick the right one upfront. ``` URL │ ├─ 1. Scrapling script (preferred) │ Run fetch.py — check the domain routing table to decide fast vs --stealth. │ Works for most sites. Returns clean Markdown directly. │ └─ 2. Jina Reader (fallback — only if Scrapling fails or dependencies not installed) web_fetch("https://r.jina.ai/<url>") Free tier: 200 req/day. Fast (~1-2s), good Markdown output. Does NOT work for: WeChat (403), some Chinese platforms. ``` ### Scrapling script ```bash python3 <SKILL_DIR>/scripts/fetch.py "<url>" [max_chars] [--stealth] ``` `<SKILL_DIR>` is the directory where this SKILL.md lives. Resolve it before calling the script. The script has two modes built in: - **Default (fast):** HTTP fetch, ~1-3s, works for most sites - **`--stealth`:** Headless browser, ~5-15s, for JS-rendered or anti-scraping sites When run without `--stealth`, the script automatically falls back to stealth if the fast result has too little content. So you rarely need to specify `--stealth` manually — the only reason to force it is when you already know the site needs it (see routing table), which saves the initial fast attempt. ## Domain Routing Use this table to pick the right mode on the first call: | Domain | Command | Why | |--------|---------|-----| | `mp.weixin.qq.com` | `fetch.py <url> --stealth` | JS-rendered content | | `zhuanlan.zhihu.com` | `fetch.py <url> --stealth` | Anti-scraping + JS | | `juejin.cn` | `fetch.py <url> --stealth` | JS-rendered SPA | | `sspai.com` | `fetch.py <url>` | Static HTML | | `blog.csdn.net` | `fetch.py <url>` | Static HTML | | `ruanyifeng.com` | `fetch.py <url>` | Static blog | | `openai.com` | `fetch.py <url>` | Static HTML | | `blog.google` | `fetch.py <url>` | Static HTML | | Everything else | `fetch.py <url>` | Auto-fallback handles it | ## Script Options ```bash # Basic — auto-selects fast or stealth python3 <SKILL_DIR>/scripts/fetch.py "https://sspai.com/post/73145" # Force stealth for known JS-heavy sites python3 <SKILL_DIR>/scripts/fetch.py "https://mp.weixin.qq.com/s/xxx" --stealth # Limit output to 15000 characters (default: 30000) python3 <SKILL_DIR>/scripts/fetch.py "https://example.com/article" 15000 # JSON output with metadata (url, mode, selector, content_length) python3 <SKILL_DIR>/scripts/fetch.py "https://example.com" --json ``` ## Install Dependencies First use only — the script checks and tells you if anything is missing: ```bash pip install scrapling html2text ``` If on system-managed Python (macOS/Linux), add `--break-system-packages` or use a venv. ## Failure Rules - Same URL fails once → give up, tell the user "unable to extract content from this URL" - Do not retry — each failed call wastes context tokens <div align="center"> # Web Content Fetcher **网页正文提取 · 永久免费 · 支持微信公众号** [](https://www.python.org/) [](LICENSE) </div> --- ## 简介 Web Content Fetcher 是一个轻量级的网页正文提取工具,能够自动将任意网页转换为干净的 Markdown 格式,保留标题、链接、图片和列表结构。 **核心优势:** - Scrapling 优先提取,内置 fast / ste