
Crawl4ai
Crawl competitor sites, docs, and landing pages into markdown or JSON for agent context without hand-copying pages.
Overview
Crawl4ai is an agent skill most often used in Idea (also Build) that teaches the Crawl4AI crwl CLI for turning URLs into markdown or JSON for research and agent pipelines.
Install
npx skills add https://github.com/brettdavies/crawl4ai-skill --skill crawl4aiWhat is this skill?
- crwl CLI with markdown, JSON, and verbose cache-bypass modes
- Browser, crawler, and extraction configuration sections for tuned runs
- LLM Q&A, structured data extraction, and content filtering flows
- Output format switches (-o markdown/json) for agent-ready corpora
- Complete examples and best-practices section in the skill guide
Adoption & trust: 632 installs on skills.sh; 31 GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You need live web content as structured text for research or agents but copying pages or writing one-off scrapers wastes time.
Who is it for?
Solo builders scraping docs, competitor pages, or blogs into markdown/JSON during research or agent prototyping.
Skip if: Teams that only need a one-off curl fetch, already operate a managed crawl service, or cannot allow browser/network automation on the machine.
When should I use this skill?
You need the Crawl4AI crwl CLI for markdown/JSON crawls, LLM Q&A on pages, or structured extraction—not a generic HTTP one-liner.
What do I get? / Deliverables
You run configured crwl crawls with chosen output formats and extraction options, producing agent-ready markdown or JSON from target URLs.
- Markdown or JSON crawl output
- Configured browser/crawler/extraction settings for repeat runs
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Research-heavy crawling (competitors, market pages, documentation) is where solo builders first reach for a CLI crawler before build-time ingestion pipelines. The crwl CLI is optimized for discovery and extraction during opportunity and competitor research, not for production deploy config.
Where it fits
Crawl rival pricing and feature pages to markdown before you commit to a niche.
Pull official docs into JSON for a side-by-side API comparison memo.
Archive a target customer’s public site structure to size an MVP integration.
Run crwl with extraction config to feed an agent’s knowledge base on each release.
Refresh crawled competitor blogs into markdown for a weekly positioning digest.
How it compares
CLI skill package for local crwl usage, not a hosted web-scraper MCP or no-code monitor.
Common Questions / FAQ
Who is crawl4ai for?
Indie developers and agent builders who want Crawl4AI’s crwl CLI documented as procedural steps inside Claude Code, Cursor, or similar agents.
When should I use crawl4ai?
Use it in Idea/research to crawl competitors and references, in Validate when grounding a landing or scope doc in live pages, and in Build/integrations when feeding crawled markdown into RAG or automation—always via explicit crwl invocations.
Is crawl4ai safe to install?
Review the Security Audits panel on this Prism page for install risk and dependency signals before running pip install crawl4ai and crawl4ai-setup on your machine.
SKILL.md
READMESKILL.md - Crawl4ai
# Crawl4AI CLI Guide <!-- Reference: Tier 2 - Command-line interface for Crawl4AI --> ## Table of Contents <!-- Lines 1-24 --> - [Crawl4AI CLI Guide](#crawl4ai-cli-guide) - [Table of Contents](#table-of-contents) - [Installation](#installation) - [Basic Usage](#basic-usage) - [Quick Example - Advanced Usage](#quick-example---advanced-usage) - [Configuration](#configuration) - [Browser Configuration](#browser-configuration) - [Crawler Configuration](#crawler-configuration) - [Extraction Configuration](#extraction-configuration) - [Advanced Features](#advanced-features) - [LLM Q\&A](#llm-qa) - [Structured Data Extraction](#structured-data-extraction) - [Content Filtering](#content-filtering) - [Output Formats](#output-formats) - [Complete Examples](#complete-examples) - [Best Practices \& Tips](#best-practices--tips) - [Recap](#recap) - [See Also](#see-also) --- ## Installation <!-- Lines 27-38 --> The Crawl4AI CLI (`crwl`) is installed automatically with the library: ```bash pip install crawl4ai crawl4ai-setup ``` --- ## Basic Usage <!-- Lines 39-68 --> The `crwl` command provides a simple interface to the Crawl4AI library: ```bash # Basic crawling - returns markdown crwl https://example.com # Specify output format crwl https://example.com -o markdown # Verbose JSON output with cache bypass crwl https://example.com -o json -v --bypass-cache # See usage examples crwl --example ``` ### Quick Example - Advanced Usage <!-- Lines 59-70 --> ```bash # Extract structured data using CSS schema crwl "https://www.infoq.com/ai-ml-data-eng/" \ -e docs/examples/cli/extract_css.yml \ -s docs/examples/cli/css_schema.json \ -o json ``` --- ## Configuration <!-- Lines 51-160 --> ### Browser Configuration <!-- Lines 51-75 --> Browser settings via YAML file or command line: ```yaml # browser.yml headless: true viewport_width: 1280 user_agent_mode: "random" verbose: true ignore_https_errors: true ``` ```bash # Using config file crwl https://example.com -B browser.yml # Using direct parameters crwl https://example.com -b "headless=true,viewport_width=1280,user_agent_mode=random" ``` **Key Parameters:** | Parameter | Description | |-----------|-------------| | `headless` | Run without GUI (true/false) | | `viewport_width` | Browser width in pixels | | `viewport_height` | Browser height in pixels | | `user_agent_mode` | "random" or specific UA string | For all browser parameters: [BrowserConfig Reference](complete-sdk-reference.md#1-browserconfig--controlling-the-browser) (lines 1977-2020) ### Crawler Configuration <!-- Lines 76-110 --> Control crawling behavior: ```yaml # crawler.yml cache_mode: "bypass" wait_until: "networkidle" page_timeout: 30000 delay_before_return_html: 0.5 word_count_threshold: 100 scan_full_page: true scroll_delay: 0.3 process_iframes: false remove_overlay_elements: true magic: true verbose: true ``` ```bash # Using config file crwl https://example.com -C crawler.yml # Using direct parameters crwl https://example.com -c "css_selector=#main,delay_before_return_html=2,scan_full_page=true" ``` **Key Parameters:** | Parameter | Description | |-----------|-------------| | `cache_mode` | bypass, enabled, disabled | | `wait_until` | networkidle, domcontentloaded | | `page_timeout` | Max page load time (ms) | | `css_selector` | Focus on specific element | | `scan_full_page` | Enable infinite scroll handling | For all crawler parameters: [CrawlerRunConfig Reference](complete-sdk-reference.md#2-crawlerrunconfig--controlling-each-crawl) (lines 2020-2330) ### Extraction Configuration <!-- Lines 111-160 --> Two extraction types supported: **1. CSS/XPath-based extraction:** ```yaml # extract_css.yml type: "json-css" params: verbose: true ``` ```json // css_schema.json { "name": "ArticleExtractor", "baseSelector": ".article", "fields": [ { "name": "title", "selector": "h1.title", "type": "text" }, { "n