
Web Scraping Automation
Automate fetching website or API data with Python or Node scripts, including parsing, scheduling, and mandatory browser process cleanup after scrapes.
Overview
web-scraping-automation is an agent skill most often used in Build (also Idea, Grow) that automates website and API data collection with Python or JavaScript scrapers and mandatory browser cleanup.
Install
npx skills add https://github.com/aaaaqwq/claude-code-skills --skill web-scraping-automationWhat is this skill?
- End-to-end flow: target analysis, stack choice, implementation, and post-run Chrome/Selenium cleanup via pkill
- Python stack: requests, BeautifulSoup4, Scrapy, Selenium, Playwright
- JavaScript stack: axios, cheerio, puppeteer, node-fetch
- REST and GraphQL API call, test, and response parsing workflows
- Explicit anti–resource-leak rule to avoid Gateway CPU overload from lingering browser processes
- Mandatory Chrome/Selenium cleanup via browser.close/quit plus pkill -f chrome
Adoption & trust: 560 installs on skills.sh; 69 GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You need repeatable data from the web or an API but lack a safe script pattern, stack choice, and cleanup discipline for headless browsers.
Who is it for?
Solo builders prototyping data pipelines, price monitors, or research harvesters where they control the runtime and legal scope.
Skip if: Production compliance-heavy crawls without your own legal review, or teams that forbid shell and browser automation in the agent environment.
When should I use this skill?
When the user needs to scrape web content, call and parse APIs, create crawler scripts, handle anti-scraping, or schedule data collection (including Chinese prompts about 爬取/API).
What do I get? / Deliverables
You get runnable crawler or API client scripts with parsing, optional scheduling hooks, and closed browser processes after each job.
- Scraper or API automation script with parsing and storage hooks
- Documented cleanup pattern for browser automation runs
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Most owners install this when they need working fetch-and-parse automation wired into their product or ops scripts—the Build integrations shelf. Scrapers and API clients are integration work: HTTP, browsers, storage, and cron-style jobs connecting external data to your codebase.
Where it fits
Harvest public pricing or feature lists from competitor sites to compare positioning before you commit to a build.
Wire a Playwright script that logs in, extracts tables, and writes JSON your backend importer consumes.
Schedule a nightly fetch of industry news or listings to feed a content or alert pipeline.
Re-run an API health poll and store snapshots when a partner endpoint changes response shape.
How it compares
Delivers owned scripts and workflows—not a managed scrape API or passive MCP read-only connector.
Common Questions / FAQ
Who is web-scraping-automation for?
Solo and indie developers who want agent-guided scrapers and API callers in Python or JavaScript with explicit process cleanup after browser automation.
When should I use web-scraping-automation?
In Idea for competitor or catalog research, in Build when integrating external data sources, and in Grow or Operate for scheduled ingestion—always with cleanup after Playwright or Selenium.
Is web-scraping-automation safe to install?
It requests Bash, network, and browser-related actions; check the Security Audits panel on this page and never aim scrapers at sites or data you are not authorized to access.
SKILL.md
READMESKILL.md - Web Scraping Automation
# 网站爬取与 API 自动化 ## 功能说明 此技能专门用于自动化网站数据爬取和 API 接口调用,包括: - 分析和爬取网站结构 - 调用和测试 REST/GraphQL API - 创建自动化爬虫脚本 - 数据解析和清洗 - 处理反爬虫机制 - 定时任务和数据存储 ## 使用场景 - "爬取这个网站的产品信息" - "帮我调用这个 API 并解析返回数据" - "创建一个脚本定时抓取新闻" - "分析这个网站的 API 接口文档" - "绕过这个网站的反爬虫限制" ## 技术栈 ### ⚠️ 资源清理原则(强制) **所有涉及浏览器的爬取任务完成后,必须自动关闭 Chrome/Selenium 进程!** ```python # Playwright 示例 from playwright.sync_api import sync_playwright def scrape_website(): with sync_playwright() as p: browser = p.chromium.launch(headless=True) page = browser.new_page() # ... 爬取逻辑 ... browser.close() # ⚠️ 强制清理残留进程 import subprocess subprocess.run(['pkill', '-f', 'chrome'], capture_output=True) # Selenium 示例 from selenium import webdriver driver = webdriver.Chrome() try: # ... 爬取逻辑 ... pass finally: driver.quit() # ⚠️ 确保清理 import subprocess subprocess.run(['pkill', '-f', 'chrome'], capture_output=True) ``` **原因**: 避免内存泄漏和资源占用,防止 Gateway CPU 100% 过载 ### Python 爬虫 - **requests**:HTTP 请求库 - **BeautifulSoup4**:HTML 解析 - **Scrapy**:专业爬虫框架 - **Selenium**:浏览器自动化 - **Playwright**:现代浏览器自动化 ### JavaScript 爬虫 - **axios**:HTTP 客户端 - **cheerio**:服务端 jQuery - **puppeteer**:Chrome 自动化 - **node-fetch**:Fetch API ## 工作流程 1. **目标分析**: - 检查网站结构和数据位置 - 分析 API 接口和认证方式 - 评估反爬虫机制 2. **方案设计**: - 选择合适的技术栈 - 设计数据提取策略 - 规划错误处理和重试机制 3. **脚本开发**: - 编写爬虫代码 - 实现数据解析逻辑 - 添加日志和监控 4. **测试优化**: - 验证数据准确性 - 优化性能和稳定性 - 处理边界情况 ## 最佳实践 - 遵守 robots.txt 规则 - 设置合理的请求间隔 - 使用 User-Agent 和请求头 - 实现错误重试机制 - 数据去重和验证 - 使用代理池(如需要) - 保存原始数据和日志 ## 常见场景示例 ### 1. 简单网页爬取 ```python import requests from bs4 import BeautifulSoup def scrape_website(url): headers = {'User-Agent': 'Mozilla/5.0'} response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') # 提取数据 data = [] for item in soup.select('.product'): data.append({ 'title': item.select_one('.title').text, 'price': item.select_one('.price').text }) return data ``` ### 2. API 调用 ```python import requests def call_api(endpoint, params=None): headers = { 'Authorization': 'Bearer YOUR_TOKEN', 'Content-Type': 'application/json' } response = requests.get(endpoint, headers=headers, params=params) return response.json() ``` ### 3. 动态网页爬取 ```python from selenium import webdriver from selenium.webdriver.common.by import By def scrape_dynamic_page(url): driver = webdriver.Chrome() driver.get(url) # 等待页面加载 driver.implicitly_wait(10) # 提取数据 elements = driver.find_elements(By.CLASS_NAME, 'item') data = [elem.text for elem in elements] driver.quit() return data ``` ## 反爬虫应对策略 - **请求头伪装**:模拟真实浏览器 - **代理轮换**:使用代理池 - **验证码处理**:OCR 或第三方服务 - **Cookie 管理**:维护会话状态 - **请求频率控制**:避免触发限制 - **JavaScript 渲染**:使用 Selenium/Playwright ## 数据存储方案 - **CSV/Excel**:简单数据导出 - **JSON**:结构化数据存储 - **数据库**:MySQL、PostgreSQL、MongoDB - **云存储**:S3、OSS - **数据仓库**:用于大规模数据分析