Web Scraping

Canonical shelf is Idea/research because journalists and builders often install this skill first to gather sources, competitors, and public web evidence before committing to a product shape. Web extraction is a core research motion: collecting articles, social signals, and page text when no official API exists.

Also useful

Also useful

Where it fits

Example use

IdeaCompetitor & landscape research

Collect competitor blog posts and press pages into a normalized corpus before scoping a newsletter product.

Example use

Snapshot pricing and feature pages when no structured competitor API is available.

Example use

Implement the Scraper ABC chain as a nightly job feeding your app's content store.

Example use

Monitor public sources for mentions to fuel lifecycle or editorial alerts.

How it compares

Procedural cascade methodology with Python tooling—not a hosted scraper SaaS or a single MCP fetch wrapper.

Common Questions / FAQ

Who is web-scraping for?

Solo developers and researchers using coding agents to collect public web content for analysis, datasets, or editorial workflows.

When should I use web-scraping?

In Idea/research to gather sources and competitor pages; in Build/integrations when wiring ingestion jobs; in Grow/content when automating monitoring—whenever SKILL.md triggers mention paywalls, cascades, or social extraction.

Is web-scraping safe to install?

Skills that drive browsers and network calls need careful review; check the Security Audits panel on this page and run scrapers only against URLs you are allowed to access.

SKILL.md

READMESKILL.md - Web Scraping

# Web scraping methodology

Patterns for reliable, ethical web scraping with fallback strategies and anti-bot handling.

## Scraping cascade architecture

Implement multiple extraction strategies with automatic fallback:

```python
from abc import ABC, abstractmethod
from typing import Optional
import requests
from bs4 import BeautifulSoup
import trafilatura

#for .py files
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

#for .ipynb files
import asyncio
from playwright.async_api import async_playwright

class ScrapingResult:
    def __init__(self, content: str, title: str, method: str):
        self.content = content
        self.title = title
        self.method = method  # Track which method succeeded

class Scraper(ABC):
    @abstractmethod
    def fetch(self, url: str) -> Optional[ScrapingResult]: ...

class TrafilaturaCscraper(Scraper):
    """Fast, lightweight extraction for standard articles."""

    def fetch(self, url: str) -> Optional[ScrapingResult]:
        try:
            downloaded = trafilatura.fetch_url(url)
            if not downloaded:
                return None

            content = trafilatura.extract(
                downloaded,
                include_comments=False,
                include_tables=True,
                favor_recall=True
            )

            if not content or len(content) < 100:
                return None

            # Extract title separately
            soup = BeautifulSoup(downloaded, 'html.parser')
            title = soup.find('title')
            title_text = title.get_text() if title else ''

            return ScrapingResult(content, title_text, 'trafilatura')
        except Exception:
            return None

class RequestsScraper(Scraper):
    """HTTP requests with rotating user agents."""

    USER_AGENTS = [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36',
    ]

    def fetch(self, url: str) -> Optional[ScrapingResult]:
        import random

        headers = {
            'User-Agent': random.choice(self.USER_AGENTS),
            'Accept': 'text/html,application/xhtml+xml',
            'Accept-Language': 'en-US,en;q=0.9',
        }

        try:
            response = requests.get(url, headers=headers, timeout=30)
            response.raise_for_status()

            soup = BeautifulSoup(response.text, 'html.parser')

            # Remove script/style elements
            for element in soup(['script', 'style', 'nav', 'footer', 'aside']):
                element.decompose()

            # Find main content
            main = soup.find('main') or soup.find('article') or soup.find('body')
            content = main.get_text(separator='\n', strip=True) if main else ''

            title = soup.find('title')
            title_text = title.get_text() if title else ''

            if len(content) < 100:
                return None

            return ScrapingResult(content, title_text, 'requests')
        except Exception:
            return None

class PlaywrightScraper(Scraper):
    """Heavy JavaScript rendering with stealth mode for anti-bot bypass."""

    def fetch(self, url: str) -> Optional[ScrapingResult]:
        try:
            with sync_playwright() as p:
                browser = p.chromium.launch(headless=True)
                context = browser.new_context(
                    viewport={'width': 1920, 'height': 1080},
                    user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/5

What is this skill?

Scraping cascade with abstract Scraper classes and automatic method fallback

Trafilatura for fast article extraction; BeautifulSoup and requests baseline

Playwright sync and async paths with playwright_stealth for anti-bot bypass

Documented use of yt-dlp and instaloader for social/video sources

Poison pill detection and ethical scraping considerations in the methodology

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 5.1k installs on skills.sh; 252 GitHub stars; 0/3 security scanners passed (skills.sh audits).

What do I get? / Deliverables

You get a layered Scraper pipeline with traced extraction methods and patterns for anti-bot and poison-pill edge cases, ready to plug into research or ingestion code.

Scraper cascade modules with ScrapingResult and method tracking

Playwright or trafilatura-based extractors for articles and social URLs

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

Also useful

Where it fits

Example use

IdeaCompetitor & landscape research

Collect competitor blog posts and press pages into a normalized corpus before scoping a newsletter product.

Example use

Snapshot pricing and feature pages when no structured competitor API is available.

Example use

Implement the Scraper ABC chain as a nightly job feeding your app's content store.

Example use