
Data Scraper Agent
Scaffold a scheduled Python agent that scrapes public sources, enriches rows with Gemini Flash, and syncs results to Notion, Sheets, or Supabase on free GitHub Actions.
Overview
Data Scraper Agent is an agent skill most often used in Build (also Grow analytics and Operate iterate) that helps you build a scheduled, AI-enriched public-data collection pipeline on Python, Gemini Flash, and GitHub Ac
Install
npx skills add https://github.com/affaan-m/everything-claude-code --skill data-scraper-agentWhat is this skill?
- Three-layer architecture: scrape on schedule, LLM enrich, persist to Notion, Sheets, or Supabase
- Trigger phrases include monitor X, build a bot that checks, and collect data from public APIs or sites
- Designed for 100% free hosting cadence via GitHub Actions plus Gemini Flash for enrichment
- Feedback loop so classification and scoring can improve from user decisions over time
- Python-first recipes for job boards, prices, news, GitHub, sports, listings, and generic public sources
- Three-layer pipeline: COLLECT → ENRICH → STORE
- Runs on GitHub Actions with Gemini Flash called out as the free LLM enrichment layer
Adoption & trust: 4.2k installs on skills.sh; 210k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You need to monitor public listings, prices, or feeds automatically but only have ad-hoc scripts and no free scheduled enrich-and-store workflow.
Who is it for?
Solo builders who want hands-off monitoring of public web or API data with free GitHub Actions hosting and a Notion, Sheets, or Supabase sink.
Skip if: Projects that must scrape authenticated private dashboards, violate site terms, or need enterprise-scale crawl infrastructure beyond Actions limits.
When should I use this skill?
User wants to scrape or monitor public websites or APIs, says build a bot that checks or monitor X for me, track jobs, prices, news, repos, or asks how to automate data collection without paid hosting.
What do I get? / Deliverables
You deploy a repeatable COLLECT → ENRICH → STORE agent with cron-style runs, structured storage, and a path to improve scoring from feedback.
- Scheduled scraper jobs with enrichment step and database sync
- Configurable storage mapping for Notion, Google Sheets, or Supabase
- Feedback-informed classification or scoring hooks for later iterations
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Canonical shelf is Build because the skill produces the collector, enricher, and storage integration code and workflow wiring. Integrations matches the COLLECT → ENRICH → STORE pipeline and third-party sinks (Notion, Sheets, Supabase, GitHub Actions).
Where it fits
Prove a niche job-board signal by running a thin scraper plus Sheets sink before committing to a full product UI.
Implement the COLLECT → ENRICH → STORE layers with GitHub Actions cron and Supabase tables.
Feed enriched rows into dashboards or lifecycle emails when competitor pricing moves.
Tune Gemini prompts and scraper selectors after user feedback labels false positives.
How it compares
Use for an opinionated free-stack monitoring agent, not for a single interactive scrape snippet or a hosted no-code scraper SaaS.
Common Questions / FAQ
Who is data-scraper-agent for?
It is for indie developers and solo founders who want a scheduled, LLM-enriched data bot stored in Notion, Sheets, or Supabase without paying for always-on servers.
When should I use data-scraper-agent?
Use it in Build when wiring integrations; in Grow when you need ongoing market or listing analytics; and in Operate when you iterate schedules and enrichment rules after watching production runs.
Is data-scraper-agent safe to install?
It implies network access, secrets for APIs and storage, and scheduled jobs—review the Security Audits panel on this page and scope repository secrets before enabling Actions.
SKILL.md
READMESKILL.md - Data Scraper Agent
# Data Scraper Agent Build a production-ready, AI-powered data collection agent for any public data source. Runs on a schedule, enriches results with a free LLM, stores to a database, and improves over time. **Stack: Python · Gemini Flash (free) · GitHub Actions (free) · Notion / Sheets / Supabase** ## When to Activate - User wants to scrape or monitor any public website or API - User says "build a bot that checks...", "monitor X for me", "collect data from..." - User wants to track jobs, prices, news, repos, sports scores, events, listings - User asks how to automate data collection without paying for hosting - User wants an agent that gets smarter over time based on their decisions ## Core Concepts ### The Three Layers Every data scraper agent has three layers: ``` COLLECT → ENRICH → STORE │ │ │ Scraper AI (LLM) Database runs on scores/ Notion / schedule summarises Sheets / & classifies Supabase ``` ### Free Stack | Layer | Tool | Why | |---|---|---| | **Scraping** | `requests` + `BeautifulSoup` | No cost, covers 80% of public sites | | **JS-rendered sites** | `playwright` (free) | When HTML scraping fails | | **AI enrichment** | Gemini Flash via REST API | 500 req/day, 1M tokens/day — free | | **Storage** | Notion API | Free tier, great UI for review | | **Schedule** | GitHub Actions cron | Free for public repos | | **Learning** | JSON feedback file in repo | Zero infra, persists in git | ### AI Model Fallback Chain Build agents to auto-fallback across Gemini models on quota exhaustion: ``` gemini-2.0-flash-lite (30 RPM) → gemini-2.0-flash (15 RPM) → gemini-2.5-flash (10 RPM) → gemini-flash-lite-latest (fallback) ``` ### Batch API Calls for Efficiency Never call the LLM once per item. Always batch: ```python # BAD: 33 API calls for 33 items for item in items: result = call_ai(item) # 33 calls → hits rate limit # GOOD: 7 API calls for 33 items (batch size 5) for batch in chunks(items, size=5): results = call_ai(batch) # 7 calls → stays within free tier ``` --- ## Workflow ### Step 1: Understand the Goal Ask the user: 1. **What to collect:** "What data source? URL / API / RSS / public endpoint?" 2. **What to extract:** "What fields matter? Title, price, URL, date, score?" 3. **How to store:** "Where should results go? Notion, Google Sheets, Supabase, or local file?" 4. **How to enrich:** "Do you want AI to score, summarise, classify, or match each item?" 5. **Frequency:** "How often should it run? Every hour, daily, weekly?" Common examples to prompt: - Job boards → score relevance to resume - Product prices → alert on drops - GitHub repos → summarise new releases - News feeds → classify by topic + sentiment - Sports results → extract stats to tracker - Events calendar → filter by interest --- ### Step 2: Design the Agent Architecture Generate this directory structure for the user: ``` my-agent/ ├── config.yaml # User customises this (keywords, filters, preferences) ├── profile/ │ └── context.md # User context the AI uses (resume, interests, criteria) ├── scraper/ │ ├── __init__.py │ ├── main.py # Orchestrator: scrape → enrich → store │ ├── filters.py # Rule-based pre-filter (fast, before AI) │ └── sources/ │ ├── __init__.py │ └── source_name.py # One file per data source ├── ai/ │ ├── __init__.py │ ├── client.py # Gemini REST client with model fallback │ ├── pipeline.py # Batch AI analysis │ ├── jd_fet