Data Scraper Agent

Canonical shelf is Build because the skill produces the collector, enricher, and storage integration code and workflow wiring. Integrations matches the COLLECT → ENRICH → STORE pipeline and third-party sinks (Notion, Sheets, Supabase, GitHub Actions).

Also useful

Also useful

ValidatePrototype & spike

Where it fits

Example use

Prove a niche job-board signal by running a thin scraper plus Sheets sink before committing to a full product UI.

Example use

Implement the COLLECT → ENRICH → STORE layers with GitHub Actions cron and Supabase tables.

Example use

Feed enriched rows into dashboards or lifecycle emails when competitor pricing moves.

Example use

Tune Gemini prompts and scraper selectors after user feedback labels false positives.

How it compares

Use for an opinionated free-stack monitoring agent, not for a single interactive scrape snippet or a hosted no-code scraper SaaS.

Common Questions / FAQ

Who is data-scraper-agent for?

It is for indie developers and solo founders who want a scheduled, LLM-enriched data bot stored in Notion, Sheets, or Supabase without paying for always-on servers.

When should I use data-scraper-agent?

Use it in Build when wiring integrations; in Grow when you need ongoing market or listing analytics; and in Operate when you iterate schedules and enrichment rules after watching production runs.

Is data-scraper-agent safe to install?

It implies network access, secrets for APIs and storage, and scheduled jobs—review the Security Audits panel on this page and scope repository secrets before enabling Actions.

SKILL.md

READMESKILL.md - Data Scraper Agent

# Data Scraper Agent

Build a production-ready, AI-powered data collection agent for any public data source.
Runs on a schedule, enriches results with a free LLM, stores to a database, and improves over time.

**Stack: Python · Gemini Flash (free) · GitHub Actions (free) · Notion / Sheets / Supabase**

## When to Activate

- User wants to scrape or monitor any public website or API
- User says "build a bot that checks...", "monitor X for me", "collect data from..."
- User wants to track jobs, prices, news, repos, sports scores, events, listings
- User asks how to automate data collection without paying for hosting
- User wants an agent that gets smarter over time based on their decisions

## Core Concepts

### The Three Layers

Every data scraper agent has three layers:

```
COLLECT → ENRICH → STORE
  │           │        │
Scraper    AI (LLM)  Database
runs on    scores/   Notion /
schedule   summarises Sheets /
           & classifies Supabase
```

### Free Stack

| Layer | Tool | Why |
|---|---|---|
| **Scraping** | `requests` + `BeautifulSoup` | No cost, covers 80% of public sites |
| **JS-rendered sites** | `playwright` (free) | When HTML scraping fails |
| **AI enrichment** | Gemini Flash via REST API | 500 req/day, 1M tokens/day — free |
| **Storage** | Notion API | Free tier, great UI for review |
| **Schedule** | GitHub Actions cron | Free for public repos |
| **Learning** | JSON feedback file in repo | Zero infra, persists in git |

### AI Model Fallback Chain

Build agents to auto-fallback across Gemini models on quota exhaustion:

```
gemini-2.0-flash-lite (30 RPM) →
gemini-2.0-flash (15 RPM) →
gemini-2.5-flash (10 RPM) →
gemini-flash-lite-latest (fallback)
```

### Batch API Calls for Efficiency

Never call the LLM once per item. Always batch:

```python
# BAD: 33 API calls for 33 items
for item in items:
    result = call_ai(item)  # 33 calls → hits rate limit

# GOOD: 7 API calls for 33 items (batch size 5)
for batch in chunks(items, size=5):
    results = call_ai(batch)  # 7 calls → stays within free tier
```

---

## Workflow

### Step 1: Understand the Goal

Ask the user:

1. **What to collect:** "What data source? URL / API / RSS / public endpoint?"
2. **What to extract:** "What fields matter? Title, price, URL, date, score?"
3. **How to store:** "Where should results go? Notion, Google Sheets, Supabase, or local file?"
4. **How to enrich:** "Do you want AI to score, summarise, classify, or match each item?"
5. **Frequency:** "How often should it run? Every hour, daily, weekly?"

Common examples to prompt:
- Job boards → score relevance to resume
- Product prices → alert on drops
- GitHub repos → summarise new releases
- News feeds → classify by topic + sentiment
- Sports results → extract stats to tracker
- Events calendar → filter by interest

---

### Step 2: Design the Agent Architecture

Generate this directory structure for the user:

```
my-agent/
├── config.yaml              # User customises this (keywords, filters, preferences)
├── profile/
│   └── context.md           # User context the AI uses (resume, interests, criteria)
├── scraper/
│   ├── __init__.py
│   ├── main.py              # Orchestrator: scrape → enrich → store
│   ├── filters.py           # Rule-based pre-filter (fast, before AI)
│   └── sources/
│       ├── __init__.py
│       └── source_name.py   # One file per data source
├── ai/
│   ├── __init__.py
│   ├── client.py            # Gemini REST client with model fallback
│   ├── pipeline.py          # Batch AI analysis
│   ├── jd_fet

What is this skill?

Three-layer architecture: scrape on schedule, LLM enrich, persist to Notion, Sheets, or Supabase

Trigger phrases include monitor X, build a bot that checks, and collect data from public APIs or sites

Designed for 100% free hosting cadence via GitHub Actions plus Gemini Flash for enrichment

Feedback loop so classification and scoring can improve from user decisions over time

Python-first recipes for job boards, prices, news, GitHub, sports, listings, and generic public sources

Three-layer pipeline: COLLECT → ENRICH → STORE

Runs on GitHub Actions with Gemini Flash called out as the free LLM enrichment layer

Compatible agents: Claude Code, Cursor, Codex, Windsurf

Adoption & trust: 4.2k installs on skills.sh; 210k GitHub stars; 2/3 security scanners passed (skills.sh audits).

What do I get? / Deliverables

You deploy a repeatable COLLECT → ENRICH → STORE agent with cron-style runs, structured storage, and a path to improve scoring from feedback.

Scheduled scraper jobs with enrichment step and database sync

Configurable storage mapping for Notion, Google Sheets, or Supabase

Feedback-informed classification or scoring hooks for later iterations

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

Also useful

ValidatePrototype & spike

Where it fits

Example use

Prove a niche job-board signal by running a thin scraper plus Sheets sink before committing to a full product UI.

Example use

Implement the COLLECT → ENRICH → STORE layers with GitHub Actions cron and Supabase tables.

Example use

Feed enriched rows into dashboards or lifecycle emails when competitor pricing moves.

Example use