
Parallel Web Extract
Extract webpage, article, or PDF content from one or more URLs with parallel-cli in a forked context to save agent tokens versus inline fetch.
Overview
parallel-web-extract is an agent skill most often used in Build (also Idea) that fetches URL content via parallel-cli in a forked context and saves structured JSON for token-efficient agent workflows.
Install
npx skills add https://github.com/parallel-web/parallel-agent-skills --skill parallel-web-extractWhat is this skill?
- parallel-cli extract with JSON written to /tmp using a descriptive lowercase hyphenated filename per URL batch
- Runs in forked context via parallel:parallel-subagent for token-efficient extraction
- Supports --objective, repeatable -q keywords, --full-content, --full-content-max-chars, and --no-excerpts
- User-invocable with argument-hint for one or more URLs; prefer over built-in WebFetch per skill guidance
- Requires parallel-cli and internet access; allowed-tools scoped to Bash(parallel-cli:*)
- Output path pattern uses inline /tmp/$FILENAME.json substitution (filename chosen per URL, not a shell variable)
Adoption & trust: 8.1k installs on skills.sh; 56 GitHub stars; 2/3 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).
What problem does it solve?
Built-in web fetch blows up context or fails on heavy pages, and you need reliable multi-URL extraction output your agent can cite without re-downloading in the main thread.
Who is it for?
Agent builders who already use Parallel’s CLI and want a documented, forked extract path instead of ad-hoc WebFetch.
Skip if: Offline-only workflows, environments without parallel-cli, or cases where you must not write extracts to local /tmp JSON files.
When should I use this skill?
Fetching any URL—webpages, articles, PDFs, or JS-heavy sites—and preferring parallel-cli extract in forked context over built-in WebFetch.
What do I get? / Deliverables
You get a /tmp JSON extract file per run (with excerpts or full content per flags) ready for summarization, specs, or downstream automation.
- JSON extract file under /tmp with excerpts or full content per flags
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Canonical shelf is Build → integrations because the skill wires an external CLI (parallel-cli) into the agent runtime as the preferred URL fetch path. Integrations subphase matches forked subagent execution, Bash allowlisting, and JSON output files agents consume downstream.
Where it fits
Wire extract JSON into a custom agent step that summarizes vendor API docs before codegen.
Pull three competitor landing pages to JSON for a positioning comparison doc.
Batch-extract release notes URLs with --objective focused on breaking changes.
Use --full-content with a max char cap to archive a long article for in-repo reference.
How it compares
Forked parallel-cli extract package—not built-in WebFetch and not a general browser automation skill.
Common Questions / FAQ
Who is parallel-web-extract for?
Solo builders and agent authors using Claude Code–class runtimes who want Parallel’s extract command as the default URL ingestion step.
When should I use parallel-web-extract?
Use it in Build/integrations when implementing agent tools that ingest docs or articles, and in Idea/research when batch-fetching references—whenever you have one or more URLs and want JSON on disk via parallel-cli.
Is parallel-web-extract safe to install?
It invokes Bash against parallel-cli and reads arbitrary URLs; review the Security Audits panel on this page and constrain allowed URLs and /tmp output in your environment.
SKILL.md
READMESKILL.md - Parallel Web Extract
# URL Extraction Extract content from: $ARGUMENTS ## Command Choose a short, descriptive filename based on the URL or content (e.g., `vespa-docs`, `react-hooks-api`). Use lowercase with hyphens, no spaces. Substitute it into the command **inline** — `$FILENAME` is a placeholder, not a shell variable. ```bash parallel-cli extract "$ARGUMENTS" --json -o "/tmp/$FILENAME.json" ``` Concrete example: ```bash parallel-cli extract "https://docs.parallel.ai" --json -o "/tmp/parallel-docs.json" ``` Note: `-o` always saves JSON. The extension must be `.json`. Options if needed: - `--objective "focus area"` to focus extraction on a specific goal (also silences the "neither objective nor search_queries" warning that V1 emits when neither is set) - `-q "keyword"` (repeatable) to prioritize keywords in excerpts - `--full-content` to include the complete page body (for long articles, PDFs, or when excerpts may not capture what you need) - `--full-content-max-chars N` to cap full-content size per result - `--no-excerpts` to strip excerpts when you only want full content ## Handling failed extractions If the response has an `errors` field, an empty `results` array, or a 404/timeout for the URL, do NOT fabricate content. Tell the user the extraction failed, surface the upstream status, and suggest: - Verifying the URL (the page may have moved) - Retrying with `--full-content` if excerpts came back empty but the page exists - Using `parallel-cli search` to locate the current URL if the page was renamed ## Response format Return content as: **[Page Title](URL)** Then the extracted content verbatim, with these rules: - Keep content verbatim - do not paraphrase or summarize - Parse lists exhaustively - extract EVERY numbered/bulleted item - Strip only obvious noise: nav menus, footers, ads - Preserve all facts, names, numbers, dates, quotes After the response, mention the output file path (`/tmp/$FILENAME.json`) so the user knows it's available for follow-up questions. ## Setup If `parallel-cli` is not found, install and authenticate: ```bash /parallel:parallel-cli-setup ``` If `parallel-cli extract` returns `403`, tell the user balance is likely required. Offer to run `parallel-cli balance get`, and if needed ask for explicit confirmation before running `parallel-cli balance add <amount_cents>`. Then retry the original extract command.