Web Scraping Automation

Name: Web Scraping Automation
Author: aaaaqwq

aaaaqwq/claude-code-skills

Automate fetching website or API data with Python or Node scripts, including parsing, scheduling, and mandatory browser process cleanup after scrapes.

Overview

web-scraping-automation is an agent skill most often used in Build (also Idea, Grow) that automates website and API data collection with Python or JavaScript scrapers and mandatory browser cleanup.

Install

npx skills add https://github.com/aaaaqwq/claude-code-skills --skill web-scraping-automation

What is this skill?

End-to-end flow: target analysis, stack choice, implementation, and post-run Chrome/Selenium cleanup via pkill
Python stack: requests, BeautifulSoup4, Scrapy, Selenium, Playwright
JavaScript stack: axios, cheerio, puppeteer, node-fetch
REST and GraphQL API call, test, and response parsing workflows
Explicit anti–resource-leak rule to avoid Gateway CPU overload from lingering browser processes
Mandatory Chrome/Selenium cleanup via browser.close/quit plus pkill -f chrome

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 560 installs on skills.sh; 69 GitHub stars; 2/3 security scanners passed (skills.sh audits).

What problem does it solve?

You need repeatable data from the web or an API but lack a safe script pattern, stack choice, and cleanup discipline for headless browsers.

Who is it for?

Solo builders prototyping data pipelines, price monitors, or research harvesters where they control the runtime and legal scope.

Skip if: Production compliance-heavy crawls without your own legal review, or teams that forbid shell and browser automation in the agent environment.

When should I use this skill?

When the user needs to scrape web content, call and parse APIs, create crawler scripts, handle anti-scraping, or schedule data collection (including Chinese prompts about 爬取/API).

What do I get? / Deliverables

You get runnable crawler or API client scripts with parsing, optional scheduling hooks, and closed browser processes after each job.

Scraper or API automation script with parsing and storage hooks
Documented cleanup pattern for browser automation runs

Recommended Skills

Agent Browservercel-labs/agent-browser

agent-browser is a Node-installed browser automation CLI built for AI agents that need dependable programmatic web inter…428k installs·35.5k stars

Lark Imlarksuite/cli

Lark IM is a Larksuite agent skill that exposes Feishu/Lark instant messaging to Claude Code, Cursor, and similar agents…210k installs·13.7k stars

Lark Calendarlarksuite/cli

lark-calendar is an agent skill for Feishu/Lark Calendar v4 exposed via lark-cli. Solo builders and small teams who alre…209k installs·13.7k stars

Lark Sheetslarksuite/cli

Skill for programmatic Feishu spreadsheet and worksheet management—create tables, bulk data IO, lookup, and export—using…209k installs·13.7k stars

Lark Vclarksuite/cli

lark-vc is an agent skill for Feishu/Lark video conferencing history and artifacts through lark-cli. After calls end, so…208k installs·13.7k stars

Lark Contactlarksuite/cli

CLI skill for Lark directory lookup: search employees and fetch metadata by open_id, with clear boundaries vs IM, calend…208k installs·13.7k stars

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

BuildIntegrations & version control

Most owners install this when they need working fetch-and-parse automation wired into their product or ops scripts—the Build integrations shelf. Scrapers and API clients are integration work: HTTP, browsers, storage, and cron-style jobs connecting external data to your codebase.

Also useful

IdeaOpportunity & market research

Also useful

GrowAnalytics & insights

Where it fits

Example use

IdeaCompetitor & landscape research

Harvest public pricing or feature lists from competitor sites to compare positioning before you commit to a build.

Example use

BuildIntegrations & version control

Wire a Playwright script that logs in, extracts tables, and writes JSON your backend importer consumes.

Example use

GrowContent & marketing

Schedule a nightly fetch of industry news or listings to feed a content or alert pipeline.

Example use

OperateIteration & experiments

Re-run an API health poll and store snapshots when a partner endpoint changes response shape.

How it compares

Delivers owned scripts and workflows—not a managed scrape API or passive MCP read-only connector.

Common Questions / FAQ

Who is web-scraping-automation for?

Solo and indie developers who want agent-guided scrapers and API callers in Python or JavaScript with explicit process cleanup after browser automation.

When should I use web-scraping-automation?

In Idea for competitor or catalog research, in Build when integrating external data sources, and in Grow or Operate for scheduled ingestion—always with cleanup after Playwright or Selenium.

Is web-scraping-automation safe to install?

It requests Bash, network, and browser-related actions; check the Security Audits panel on this page and never aim scrapers at sites or data you are not authorized to access.

SKILL.md

READMESKILL.md - Web Scraping Automation

# 网站爬取与 API 自动化

## 功能说明
此技能专门用于自动化网站数据爬取和 API 接口调用，包括：
- 分析和爬取网站结构
- 调用和测试 REST/GraphQL API
- 创建自动化爬虫脚本
- 数据解析和清洗
- 处理反爬虫机制
- 定时任务和数据存储

## 使用场景
- "爬取这个网站的产品信息"
- "帮我调用这个 API 并解析返回数据"
- "创建一个脚本定时抓取新闻"
- "分析这个网站的 API 接口文档"
- "绕过这个网站的反爬虫限制"

## 技术栈

### ⚠️ 资源清理原则（强制）

**所有涉及浏览器的爬取任务完成后，必须自动关闭 Chrome/Selenium 进程！**

```python
# Playwright 示例
from playwright.sync_api import sync_playwright

def scrape_website():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        # ... 爬取逻辑 ...
        browser.close()

    # ⚠️ 强制清理残留进程
    import subprocess
    subprocess.run(['pkill', '-f', 'chrome'], capture_output=True)

# Selenium 示例
from selenium import webdriver

driver = webdriver.Chrome()
try:
    # ... 爬取逻辑 ...
    pass
finally:
    driver.quit()
    # ⚠️ 确保清理
    import subprocess
    subprocess.run(['pkill', '-f', 'chrome'], capture_output=True)
```

**原因**: 避免内存泄漏和资源占用，防止 Gateway CPU 100% 过载

### Python 爬虫
- **requests**：HTTP 请求库
- **BeautifulSoup4**：HTML 解析
- **Scrapy**：专业爬虫框架
- **Selenium**：浏览器自动化
- **Playwright**：现代浏览器自动化

### JavaScript 爬虫
- **axios**：HTTP 客户端
- **cheerio**：服务端 jQuery
- **puppeteer**：Chrome 自动化
- **node-fetch**：Fetch API

## 工作流程
1. **目标分析**：
   - 检查网站结构和数据位置
   - 分析 API 接口和认证方式
   - 评估反爬虫机制

2. **方案设计**：
   - 选择合适的技术栈
   - 设计数据提取策略
   - 规划错误处理和重试机制

3. **脚本开发**：
   - 编写爬虫代码
   - 实现数据解析逻辑
   - 添加日志和监控

4. **测试优化**：
   - 验证数据准确性
   - 优化性能和稳定性
   - 处理边界情况

## 最佳实践
- 遵守 robots.txt 规则
- 设置合理的请求间隔
- 使用 User-Agent 和请求头
- 实现错误重试机制
- 数据去重和验证
- 使用代理池（如需要）
- 保存原始数据和日志

## 常见场景示例

### 1. 简单网页爬取
```python
import requests
from bs4 import BeautifulSoup

def scrape_website(url):
    headers = {'User-Agent': 'Mozilla/5.0'}
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    # 提取数据
    data = []
    for item in soup.select('.product'):
        data.append({
            'title': item.select_one('.title').text,
            'price': item.select_one('.price').text
        })
    return data
```

### 2. API 调用
```python
import requests

def call_api(endpoint, params=None):
    headers = {
        'Authorization': 'Bearer YOUR_TOKEN',
        'Content-Type': 'application/json'
    }
    response = requests.get(endpoint, headers=headers, params=params)
    return response.json()
```

### 3. 动态网页爬取
```python
from selenium import webdriver
from selenium.webdriver.common.by import By

def scrape_dynamic_page(url):
    driver = webdriver.Chrome()
    driver.get(url)

    # 等待页面加载
    driver.implicitly_wait(10)

    # 提取数据
    elements = driver.find_elements(By.CLASS_NAME, 'item')
    data = [elem.text for elem in elements]

    driver.quit()
    return data
```

## 反爬虫应对策略
- **请求头伪装**：模拟真实浏览器
- **代理轮换**：使用代理池
- **验证码处理**：OCR 或第三方服务
- **Cookie 管理**：维护会话状态
- **请求频率控制**：避免触发限制
- **JavaScript 渲染**：使用 Selenium/Playwright

## 数据存储方案
- **CSV/Excel**：简单数据导出
- **JSON**：结构化数据存储
- **数据库**：MySQL、PostgreSQL、MongoDB
- **云存储**：S3、OSS
- **数据仓库**：用于大规模数据分析

What is this skill?

End-to-end flow: target analysis, stack choice, implementation, and post-run Chrome/Selenium cleanup via pkill

Python stack: requests, BeautifulSoup4, Scrapy, Selenium, Playwright

JavaScript stack: axios, cheerio, puppeteer, node-fetch

REST and GraphQL API call, test, and response parsing workflows

Explicit anti–resource-leak rule to avoid Gateway CPU overload from lingering browser processes

Mandatory Chrome/Selenium cleanup via browser.close/quit plus pkill -f chrome

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 560 installs on skills.sh; 69 GitHub stars; 2/3 security scanners passed (skills.sh audits).

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

BuildIntegrations & version control

Also useful

IdeaOpportunity & market research

Also useful

GrowAnalytics & insights

Where it fits

Example use

IdeaCompetitor & landscape research

Harvest public pricing or feature lists from competitor sites to compare positioning before you commit to a build.

Example use

BuildIntegrations & version control

Wire a Playwright script that logs in, extracts tables, and writes JSON your backend importer consumes.

Example use

GrowContent & marketing

Schedule a nightly fetch of industry news or listings to feed a content or alert pipeline.

Example use

OperateIteration & experiments

Re-run an API health poll and store snapshots when a partner endpoint changes response shape.

SKILL.md

READMESKILL.md - Web Scraping Automation

# 网站爬取与 API 自动化

## 功能说明
此技能专门用于自动化网站数据爬取和 API 接口调用，包括：
- 分析和爬取网站结构
- 调用和测试 REST/GraphQL API
- 创建自动化爬虫脚本
- 数据解析和清洗
- 处理反爬虫机制
- 定时任务和数据存储

## 使用场景
- "爬取这个网站的产品信息"
- "帮我调用这个 API 并解析返回数据"
- "创建一个脚本定时抓取新闻"
- "分析这个网站的 API 接口文档"
- "绕过这个网站的反爬虫限制"

## 技术栈

### ⚠️ 资源清理原则（强制）

**所有涉及浏览器的爬取任务完成后，必须自动关闭 Chrome/Selenium 进程！**

```python
# Playwright 示例
from playwright.sync_api import sync_playwright

def scrape_website():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        # ... 爬取逻辑 ...
        browser.close()

    # ⚠️ 强制清理残留进程
    import subprocess
    subprocess.run(['pkill', '-f', 'chrome'], capture_output=True)

# Selenium 示例
from selenium import webdriver

driver = webdriver.Chrome()
try:
    # ... 爬取逻辑 ...
    pass
finally:
    driver.quit()
    # ⚠️ 确保清理
    import subprocess
    subprocess.run(['pkill', '-f', 'chrome'], capture_output=True)
```

**原因**: 避免内存泄漏和资源占用，防止 Gateway CPU 100% 过载

### Python 爬虫
- **requests**：HTTP 请求库
- **BeautifulSoup4**：HTML 解析
- **Scrapy**：专业爬虫框架
- **Selenium**：浏览器自动化
- **Playwright**：现代浏览器自动化

### JavaScript 爬虫
- **axios**：HTTP 客户端
- **cheerio**：服务端 jQuery
- **puppeteer**：Chrome 自动化
- **node-fetch**：Fetch API

## 工作流程
1. **目标分析**：
   - 检查网站结构和数据位置
   - 分析 API 接口和认证方式
   - 评估反爬虫机制

2. **方案设计**：
   - 选择合适的技术栈
   - 设计数据提取策略
   - 规划错误处理和重试机制

3. **脚本开发**：
   - 编写爬虫代码
   - 实现数据解析逻辑
   - 添加日志和监控

4. **测试优化**：
   - 验证数据准确性
   - 优化性能和稳定性
   - 处理边界情况

## 最佳实践
- 遵守 robots.txt 规则
- 设置合理的请求间隔
- 使用 User-Agent 和请求头
- 实现错误重试机制
- 数据去重和验证
- 使用代理池（如需要）
- 保存原始数据和日志

## 常见场景示例

### 1. 简单网页爬取
```python
import requests
from bs4 import BeautifulSoup

def scrape_website(url):
    headers = {'User-Agent': 'Mozilla/5.0'}
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    # 提取数据
    data = []
    for item in soup.select('.product'):
        data.append({
            'title': item.select_one('.title').text,
            'price': item.select_one('.price').text
        })
    return data
```

### 2. API 调用
```python
import requests

def call_api(endpoint, params=None):
    headers = {
        'Authorization': 'Bearer YOUR_TOKEN',
        'Content-Type': 'application/json'
    }
    response = requests.get(endpoint, headers=headers, params=params)
    return response.json()
```

### 3. 动态网页爬取
```python
from selenium import webdriver
from selenium.webdriver.common.by import By

def scrape_dynamic_page(url):
    driver = webdriver.Chrome()
    driver.get(url)

    # 等待页面加载
    driver.implicitly_wait(10)

    # 提取数据
    elements = driver.find_elements(By.CLASS_NAME, 'item')
    data = [elem.text for elem in elements]

    driver.quit()
    return data
```

## 反爬虫应对策略
- **请求头伪装**：模拟真实浏览器
- **代理轮换**：使用代理池
- **验证码处理**：OCR 或第三方服务
- **Cookie 管理**：维护会话状态
- **请求频率控制**：避免触发限制
- **JavaScript 渲染**：使用 Selenium/Playwright

## 数据存储方案
- **CSV/Excel**：简单数据导出
- **JSON**：结构化数据存储
- **数据库**：MySQL、PostgreSQL、MongoDB
- **云存储**：S3、OSS
- **数据仓库**：用于大规模数据分析

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Where it fits

Who is web-scraping-automation for?

When should I use web-scraping-automation?

Is web-scraping-automation safe to install?

SKILL.md

This week for builders

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Where it fits

Who is web-scraping-automation for?

When should I use web-scraping-automation?

Is web-scraping-automation safe to install?

SKILL.md