
Web Scraping
Run a fallback scraping cascade—requests, trafilatura, Playwright stealth—for research pipelines, content ingestion, or integration prototypes with anti-bot awareness.
Overview
web-scraping is an agent skill most often used in Idea (also Build integrations, Grow content) that implements cascading extractors with trafilatura, Playwright stealth, and social-tool fallbacks for reliable page conten
Install
npx skills add https://github.com/jamditis/claude-skills-journalism --skill web-scrapingWhat is this skill?
- Scraping cascade with abstract Scraper classes and automatic method fallback
- Trafilatura for fast article extraction; BeautifulSoup and requests baseline
- Playwright sync and async paths with playwright_stealth for anti-bot bypass
- Documented use of yt-dlp and instaloader for social/video sources
- Poison pill detection and ethical scraping considerations in the methodology
Adoption & trust: 5.1k installs on skills.sh; 252 GitHub stars; 0/3 security scanners passed (skills.sh audits).
What problem does it solve?
You need text and metadata from sites that block simple requests, vary layout, or lack a documented API.
Who is it for?
Indie builders running journalism-style research, competitive monitoring, or content aggregation where respectful fallback strategies matter.
Skip if: Scraping that violates site terms, credential stuffing, or cases where a licensed API is cheaper and compliant.
When should I use this skill?
Extracting content from websites, handling paywalls, implementing scraping cascades, or processing social media with requests, trafilatura, Playwright stealth, yt-dlp, or instaloader.
What do I get? / Deliverables
You get a layered Scraper pipeline with traced extraction methods and patterns for anti-bot and poison-pill edge cases, ready to plug into research or ingestion code.
- Scraper cascade modules with ScrapingResult and method tracking
- Playwright or trafilatura-based extractors for articles and social URLs
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Canonical shelf is Idea/research because journalists and builders often install this skill first to gather sources, competitors, and public web evidence before committing to a product shape. Web extraction is a core research motion: collecting articles, social signals, and page text when no official API exists.
Where it fits
Collect competitor blog posts and press pages into a normalized corpus before scoping a newsletter product.
Snapshot pricing and feature pages when no structured competitor API is available.
Implement the Scraper ABC chain as a nightly job feeding your app's content store.
Monitor public sources for mentions to fuel lifecycle or editorial alerts.
How it compares
Procedural cascade methodology with Python tooling—not a hosted scraper SaaS or a single MCP fetch wrapper.
Common Questions / FAQ
Who is web-scraping for?
Solo developers and researchers using coding agents to collect public web content for analysis, datasets, or editorial workflows.
When should I use web-scraping?
In Idea/research to gather sources and competitor pages; in Build/integrations when wiring ingestion jobs; in Grow/content when automating monitoring—whenever SKILL.md triggers mention paywalls, cascades, or social extraction.
Is web-scraping safe to install?
Skills that drive browsers and network calls need careful review; check the Security Audits panel on this page and run scrapers only against URLs you are allowed to access.
SKILL.md
READMESKILL.md - Web Scraping
# Web scraping methodology Patterns for reliable, ethical web scraping with fallback strategies and anti-bot handling. ## Scraping cascade architecture Implement multiple extraction strategies with automatic fallback: ```python from abc import ABC, abstractmethod from typing import Optional import requests from bs4 import BeautifulSoup import trafilatura #for .py files from playwright.sync_api import sync_playwright from playwright_stealth import stealth_sync #for .ipynb files import asyncio from playwright.async_api import async_playwright class ScrapingResult: def __init__(self, content: str, title: str, method: str): self.content = content self.title = title self.method = method # Track which method succeeded class Scraper(ABC): @abstractmethod def fetch(self, url: str) -> Optional[ScrapingResult]: ... class TrafilaturaCscraper(Scraper): """Fast, lightweight extraction for standard articles.""" def fetch(self, url: str) -> Optional[ScrapingResult]: try: downloaded = trafilatura.fetch_url(url) if not downloaded: return None content = trafilatura.extract( downloaded, include_comments=False, include_tables=True, favor_recall=True ) if not content or len(content) < 100: return None # Extract title separately soup = BeautifulSoup(downloaded, 'html.parser') title = soup.find('title') title_text = title.get_text() if title else '' return ScrapingResult(content, title_text, 'trafilatura') except Exception: return None class RequestsScraper(Scraper): """HTTP requests with rotating user agents.""" USER_AGENTS = [ 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36', ] def fetch(self, url: str) -> Optional[ScrapingResult]: import random headers = { 'User-Agent': random.choice(self.USER_AGENTS), 'Accept': 'text/html,application/xhtml+xml', 'Accept-Language': 'en-US,en;q=0.9', } try: response = requests.get(url, headers=headers, timeout=30) response.raise_for_status() soup = BeautifulSoup(response.text, 'html.parser') # Remove script/style elements for element in soup(['script', 'style', 'nav', 'footer', 'aside']): element.decompose() # Find main content main = soup.find('main') or soup.find('article') or soup.find('body') content = main.get_text(separator='\n', strip=True) if main else '' title = soup.find('title') title_text = title.get_text() if title else '' if len(content) < 100: return None return ScrapingResult(content, title_text, 'requests') except Exception: return None class PlaywrightScraper(Scraper): """Heavy JavaScript rendering with stealth mode for anti-bot bypass.""" def fetch(self, url: str) -> Optional[ScrapingResult]: try: with sync_playwright() as p: browser = p.chromium.launch(headless=True) context = browser.new_context( viewport={'width': 1920, 'height': 1080}, user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/5