Web Crawler

Name: Web Crawler
Author: starchild-ai-agent

starchild-ai-agent/official-skills

3.7k installs
18 repo stars
Updated July 27, 2026
starchild-ai-agent/official-skills

web-crawler is an agent skill for scrape web and social surfaces including youtube, tiktok, instagram, and linkedin with structured extraction.

About

The web-crawler skill Web scraping plus social data: YouTube, TikTok, Instagram, LinkedIn,. Preferred entry: call exports.py (don't hand-roll requests) Ready-made helpers live in skills/web-crawler/exports.py. Prefer them over writing your own proxied_get/proxied_post calls they already inject the proxy credentials, so there is no API key to find (don't read $SCRAPECREATORS_API_KEY / $FIRECRAWL_API_KEY, don't check .env, don't ask the user). ```python import sys; sys.path.insert(0, "/data/workspace/skills/web-crawler") from exports import scrape_markdown, youtube_transcript, sc_get scrape_markdown("https://example.com/article") Firecrawl fallback youtube_transcript("https://youtube.com/watch?v=ID") ScrapeCreators sc_get("/v1/tiktok/profile", handle="charlidamelio") any SC endpoint from exports import archive_fallback paywall / Firecrawl-403 archive_fallback("https://www.nytimes.com/.../article.html") archive snapshot ``` Named Named wrappers exist for the high-frequency actions (YouTube/TikTok transcript & video, IG/Twitter/Reddit posts, profiles, Google/Reddit search). For any other ScrapeCreators endpoint use sc_get(path, params) — it auto-strips leading @/ from handles/hashtags.

web_fetch returns HTTP 401/403/429/5xx
Response is an anti-bot challenge page (for example Cloudflare "Attention Required", "Just a moment", or challenge/captc
The page is JS-heavy and the first fetch misses required detail fields (for example publish time, author, listing code,
Search results do not contain the requested field and the value must be extracted from the target page itself
If ordinary fetch is blocked or incomplete, switch to Firecrawl fallback in this skill before asking the user for screen

Web Crawler by the numbers

3,711 all-time installs (skills.sh)
+174 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #114 of 2,742 Automation & Workflows skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

At a glance

web-crawler capabilities & compatibility

Capabilities: web_fetch returns http 401/403/429/5xx · response is an anti bot challenge page (for exam · the page is js heavy and the first fetch misses · search results do not contain the requested fiel · if ordinary fetch is blocked or incomplete, swit
Use cases: web scraping · research

From the docs

What web-crawler says it does

Ready-made helpers live in `skills/web-crawler/exports.py`. Prefer them over

SKILL.md

writing your own `proxied_get`/`proxied_post` calls — they already inject the

SKILL.md

proxy credentials, so **there is no API key to find** (don't read

SKILL.md

npx skills add https://github.com/starchild-ai-agent/official-skills --skill web-crawler

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/starchild-ai-agent/official-skills/web-crawler.svg)](https://skillselion.com/skills/starchild-ai-agent/official-skills/web-crawler)

Installs	3.7k
repo stars	★ 18
Security audit	1 / 3 scanners passed
Last updated	July 27, 2026
Repository	starchild-ai-agent/official-skills ↗

How do I scrape web and social surfaces including youtube, tiktok, instagram, and linkedin with structured extraction with documented agent guidance?

Scrape web and social surfaces including YouTube, TikTok, Instagram, and LinkedIn with structured extraction.

Who is it for?

Developers who need automation & workflows help during build work.

Skip if: Skip when the task falls outside Automation & Workflows scope described in SKILL.md.

When should I use this skill?

Scrape web and social surfaces including YouTube, TikTok, Instagram, and LinkedIn with structured extraction.

What you get

Completed automation & workflows workflow aligned with SKILL.md steps and validation.

Structured post extracts
Transcript text
Comment and trending data

By the numbers

web_fetch returns HTTP 401/403/429/5xx
Response is an anti-bot challenge page (for example Cloudflare "Attention Required", "Just a moment", or challenge/captc
The page is JS-heavy and the first fetch misses required detail fields (for example publish time, author, listing code,

Files

SKILL.mdMarkdownGitHub ↗

Preferred entry: call exports.py (don't hand-roll requests)

Ready-made helpers live in skills/web-crawler/exports.py. Prefer them over writing your own proxied_get/proxied_post calls — they already inject the proxy credentials, so there is no API key to find (don't read $SCRAPECREATORS_API_KEY / $FIRECRAWL_API_KEY, don't check .env, don't ask the user).

import sys; sys.path.insert(0, "/data/workspace/skills/web-crawler")
from exports import scrape_markdown, youtube_transcript, sc_get
scrape_markdown("https://example.com/article")          # Firecrawl fallback
youtube_transcript("https://youtube.com/watch?v=ID")     # ScrapeCreators
sc_get("/v1/tiktok/profile", handle="charlidamelio")     # any SC endpoint

from exports import archive_fallback                      # paywall / Firecrawl-403
archive_fallback("https://www.nytimes.com/.../article.html")  # archive snapshot

Named wrappers exist for the high-frequency actions (YouTube/TikTok transcript & video, IG/Twitter/Reddit posts, profiles, Google/Reddit search). For any other ScrapeCreators endpoint use sc_get(path, **params) — it auto-strips leading @/# from handles/hashtags. The intent-routing tables below still tell you which endpoint to pass. Pass caller_id="chat:<thread>" (or job:/preview:) for cost tracking.

Quick trigger rules (read this first)

Use this skill immediately when any of these conditions is true:

web_fetch returns HTTP 401/403/429/5xx
Response is an anti-bot challenge page (for example Cloudflare "Attention Required", "Just a moment", or challenge/captcha pages)
The page is JS-heavy and the first fetch misses required detail fields (for example publish time, author, listing code, updated time, price breakdown)
Search results do not contain the requested field and the value must be extracted from the target page itself

Fallback rule:

If ordinary fetch is blocked or incomplete, switch to Firecrawl fallback in this skill before asking the user for screenshots/manual text.

Error signatures -> action

Signature	Action
`web_fetch` HTTP 401/403/429/5xx	Call Firecrawl `POST /v2/scrape` once with `formats:["markdown","links"]` + `onlyMainContent:true`
Cloudflare/challenge page text in body	Same Firecrawl call as above
Markdown still misses key fields	Retry once with `formats:["rawHtml"]`
Firecrawl itself returns 403 / empty (hard paywall: NYT, WSJ, Economist, FT, Bloomberg)	Call `archive_fallback(url)` — recovers full text from a web archive snapshot

Paywall / Firecrawl-blocked fallback chain (use `archive_fallback`)

When Firecrawl can't get the page either (it returns 403, or markdown comes back empty) the site is behind a hard paywall or aggressive WAF. Do NOT keep retrying Firecrawl. Recover the article from a web archive snapshot instead:

from exports import archive_fallback
res = archive_fallback("https://www.nytimes.com/.../article.html")
res["markdown"]      # full text, or "" if no snapshot exists anywhere
res["source"]        # "archive.today" | "wayback" | None
res["snapshot_url"]  # the snapshot that was scraped

How it works (and why this order):

archive.today first (archive.ph / archive.is mirrors). User-triggered,

real-browser captures; historically preserves full text behind paywalls. Best bet for NYT/WSJ/Economist. We scrape its /newest/ snapshot via Firecrawl (archive.today has its own Cloudflare, so scrape it through Firecrawl, never web_fetch it directly).

Wayback Machine second (archive.org). Automated crawler that honors

robots.txt and paywalls, so it often has NO full text for hard paywalls — but it's a good fallback for ordinary 403/Cloudflare pages that aren't paywalled.

Limitation: archives only return text someone already saved. If res["markdown"] == "", no snapshot exists — stop, tell the user, and try the outlet's official API/RSS or a different source. Do not fabricate the article.

What each service is for

ScrapeCreators — Social media data extraction (27+ platforms)

Use for any request involving social media profiles, posts, videos, comments, transcripts, search, ads, trending content, or engagement metrics. Covers TikTok, Instagram, YouTube, LinkedIn, Facebook, Twitter/X, Reddit, Threads, Bluesky, Pinterest, Snapchat, Twitch, Kick, Truth Social, TikTok Shop, Google search, and link-in-bio services (Linktree, Komi, Pillar, Linkbio, Linkme, Amazon Shop).

Base URL: https://api.scrapecreators.com Auth: No user-supplied key needed. sc-proxy injects platform credentials automatically — just send the request. The x-api-key header can be any value or omitted entirely. Do NOT bail out or ask the user for a key if $SCRAPECREATORS_API_KEY looks unset; that env var is intentionally not required. Method: All endpoints use GET requests with query params. Responses are JSON.

Firecrawl — Fallback web page scraper

Only a fallback crawler for one web page when ordinary fetching fails. Use POST /v2/scrape with a single url and focused formats like markdown, html, rawHtml, links, summary, or constrained json/question/highlights extraction.

Auth: No user-supplied key needed. sc-proxy injects the Firecrawl credential automatically when you call through core.http_client.proxied_post — just send the request. Do NOT read $FIRECRAWL_API_KEY from env, do NOT check .env for it, and do NOT ask the user for a Firecrawl key if it looks unset; that env var is intentionally not required. The same proxy-injection model as ScrapeCreators applies here.

Do not use Firecrawl crawl/map/search/agent/browser endpoints. Do not request screenshots, audio, branding, images, or browser actions unless the proxy policy is expanded later.

---

ScrapeCreators — Intent routing

Map user intent to the right endpoint. Endpoint paths use the pattern /v1/platform/action.

Important: After selecting an endpoint from the tables below, fetch its OpenAPI spec at https://docs.scrapecreators.com/{path}/openapi.json for full parameter details, types, and example response before making the actual API call. For example: https://docs.scrapecreators.com/v1/tiktok/profile/openapi.json

Profiles / User Info

Platform	Endpoint	Primary Param	Example
TikTok	`/v1/tiktok/profile`	handle	`stoolpresidente`
Instagram	`/v1/instagram/profile`	handle	`jane`
YouTube	`/v1/youtube/channel`	handle, channelId, or url	`ThePatMcAfeeShow`
LinkedIn (person)	`/v1/linkedin/profile`	url	`https://www.linkedin.com/in/parrsam/`
LinkedIn (company)	`/v1/linkedin/company`	url	`https://linkedin.com/company/shopify`
Facebook	`/v1/facebook/profile`	url	`https://www.facebook.com/mantraindianfolsom`
Twitter/X	`/v1/twitter/profile`	handle	`elonmusk`
Reddit	`/v1/reddit/subreddit/details`	subreddit or url	`AskReddit`
Threads	`/v1/threads/profile`	handle	`zuck`
Bluesky	`/v1/bluesky/profile`	handle	`jay.bsky.team`
Pinterest	`/v1/pinterest/user/boards`	handle	`pinterest`
Truth Social	`/v1/truthsocial/profile`	handle	`realDonaldTrump`
Twitch	`/v1/twitch/profile`	handle	`ninja`
Snapchat	`/v1/snapchat/profile`	handle	`djkhaled`

Posts / Content Feeds

Platform	Endpoint	Primary Param	Example
TikTok videos	`/v3/tiktok/profile/videos`	handle	`stoolpresidente`
Instagram posts	`/v2/instagram/user/posts`	handle	`jane`
Instagram reels	`/v1/instagram/user/reels`	handle or user_id	`jane` or `2700692569`
Instagram highlights	`/v1/instagram/user/highlights`	handle or user_id	`jane` or `2700692569`
YouTube videos	`/v1/youtube/channel/videos`	handle or channelId	`ThePatMcAfeeShow`
YouTube shorts	`/v1/youtube/channel/shorts`	handle or channelId	`starterstory`
YouTube playlist	`/v1/youtube/playlist`	playlist_id	`PLP32wGpgzmIlInfgKVFfCwVsxgGqZNIiS`
LinkedIn posts	`/v1/linkedin/company/posts`	url	`https://linkedin.com/company/shopify`
Facebook posts	`/v1/facebook/profile/posts`	url or pageId	`https://www.facebook.com/pacemorby`
Facebook reels	`/v1/facebook/profile/reels`	url	`https://www.facebook.com/Spurs`
Facebook photos	`/v1/facebook/profile/photos`	url	`https://www.facebook.com/Spurs`
Facebook group posts	`/v1/facebook/group/posts`	url or group_id	`742354120555345`
Twitter tweets	`/v1/twitter/user/tweets`	handle	`elonmusk`
Reddit posts	`/v1/reddit/subreddit`	subreddit	`AskReddit`
Threads posts	`/v1/threads/user/posts`	handle	`zuck`
Bluesky posts	`/v1/bluesky/user/posts`	handle or user_id	`jay.bsky.team`
Truth Social posts	`/v1/truthsocial/user/posts`	handle or user_id	`realDonaldTrump`
Pinterest board	`/v1/pinterest/board`	url	`https://www.pinterest.com/...`

Single Post / Video Details

Platform	Endpoint	Primary Param	Example
TikTok	`/v2/tiktok/video`	url	`https://www.tiktok.com/@randomspamvideos25/video/7251387037834595630`
Instagram	`/v1/instagram/post`	url	`https://www.instagram.com/reel/DOq6eV6iIgD`
Instagram highlight	`/v1/instagram/user/highlight/detail`	id	`18067016518767507`
YouTube	`/v1/youtube/video`	url	`https://www.youtube.com/watch?v=Y2Ah_DFr8cw`
YouTube community post	`/v1/youtube/community-post`	url	`https://www.youtube.com/post/Ugkxvj2KoApYAXoqLWnKVr6zZe5JjeHrQeP8`
LinkedIn	`/v1/linkedin/post`	url	`https://www.linkedin.com/pulse/being-father-has-made-me-better-leader...`
Facebook	`/v1/facebook/post`	url	`https://www.facebook.com/reel/1535656380759655`
Twitter/X	`/v1/twitter/tweet`	url	`https://twitter.com/elonmusk/status/...`
Twitter/X community	`/v1/twitter/community`	url	`https://twitter.com/i/communities/...`
Twitter/X community tweets	`/v1/twitter/community/tweets`	url	`https://twitter.com/i/communities/...`
Reddit	`/v1/reddit/post/comments`	url	`https://www.reddit.com/r/AskReddit/comments/...`
Threads	`/v1/threads/post`	url	`https://www.threads.net/@zuck/post/...`
Bluesky	`/v1/bluesky/post`	url	`https://bsky.app/profile/.../post/...`
Truth Social	`/v1/truthsocial/post`	url	`https://truthsocial.com/@realDonaldTrump/posts/...`
Pinterest	`/v1/pinterest/pin`	url	`https://www.pinterest.com/pin/...`
Twitch clip	`/v1/twitch/clip`	url	`https://clips.twitch.tv/...`
Kick clip	`/v1/kick/clip`	url	`https://kick.com/...`

Comments

Platform	Endpoint	Primary Param	Example
TikTok	`/v1/tiktok/video/comments`	url	`https://www.tiktok.com/@stoolpresidente/video/7499229683859426602`
Instagram	`/v2/instagram/post/comments`	url	`https://www.instagram.com/reel/DOq6eV6iIgD`
YouTube	`/v1/youtube/video/comments`	url	`https://www.youtube.com/watch?v=dQw4w9WgXcQ`
Facebook	`/v1/facebook/post/comments`	url or feedback_id	`https://www.facebook.com/reel/753347914167361`
Reddit	`/v1/reddit/post/comments`	url	`https://www.reddit.com/r/AskReddit/comments/...`

Transcripts

Platform	Endpoint	Example	Note
TikTok	`/v1/tiktok/video/transcript`	`url=https://www.tiktok.com/...&lang=en`	also via `/v2/tiktok/video` with get_transcript=true
Instagram	`/v2/instagram/media/transcript`	`url=https://www.instagram.com/reel/...`	AI-powered, 10-30s, under 2min
YouTube	`/v1/youtube/video/transcript`	`url=https://www.youtube.com/watch?v=bjVIDXPP7Uk`	also included in `/v1/youtube/video` response
Facebook	`/v1/facebook/post/transcript`	`url=https://www.facebook.com/reel/...`	under 2min only
Twitter/X	`/v1/twitter/tweet/transcript`	`url=https://twitter.com/...`	AI-powered, slow

Search

Platform	Endpoint	Primary Param	Example
TikTok users	`/v1/tiktok/search/users`	query	`funny`
TikTok videos (keyword)	`/v1/tiktok/search/keyword`	query	`funny`
TikTok videos (hashtag)	`/v1/tiktok/search/hashtag`	hashtag	`fyp`
TikTok top (photos+videos)	`/v1/tiktok/search/top`	query	`funny`
Instagram reels	`/v2/instagram/reels/search`	query	`dogs`
YouTube	`/v1/youtube/search`	query	`funny`
YouTube hashtag	`/v1/youtube/search/hashtag`	hashtag	`funny`
Reddit (all)	`/v1/reddit/search`	query	`best programming languages`
Reddit (in subreddit)	`/v1/reddit/subreddit/search`	subreddit + query	`AskReddit` + `funny`
Threads posts	`/v1/threads/search`	query	`AI`
Threads users	`/v1/threads/search/users`	query	`zuck`
Pinterest	`/v1/pinterest/search`	query	`home decor`
Google	`/v1/google/search`	query	`best restaurants in NYC`

Ad Libraries

Platform	Endpoint	Primary Param	Example
Facebook ads search	`/v1/facebook/adLibrary/search/ads`	query	`running`
Facebook company ads	`/v1/facebook/adLibrary/company/ads`	pageId or companyName	`Lululemon`
Facebook ad detail	`/v1/facebook/adLibrary/ad`	id or url	`702369045530963`
Facebook find companies	`/v1/facebook/adLibrary/search/companies`	query	`Nike`
Google company ads	`/v1/google/company/ads`	domain or advertiser_id	`nike.com`
Google ad detail	`/v1/google/ad`	url	`https://adstransparency.google.com/...`
Google find advertisers	`/v1/google/adLibrary/advertisers/search`	query	`Nike`
LinkedIn ads search	`/v1/linkedin/ads/search`	company or keyword	`Shopify`
LinkedIn ad detail	`/v1/linkedin/ad`	url	`https://www.linkedin.com/ad/...`
Reddit ads search	`/v1/reddit/ads/search`	query	`gaming`
Reddit ad detail	`/v1/reddit/ad`	id	`t3_abc123`

Trending / Popular

Content	Endpoint	Param	Example
Trending feed	`/v1/tiktok/get-trending-feed`	region (required)	`US`
Popular videos	`/v1/tiktok/videos/popular`
Popular creators	`/v1/tiktok/creators/popular`
Popular hashtags	`/v1/tiktok/hashtags/popular`
Popular songs	`/v1/tiktok/songs/popular`
Song details	`/v1/tiktok/song`	clipId	`7439295283975702544`
Videos using song	`/v1/tiktok/song/videos`	clipId	`7439295283975702544`
Trending shorts (YT)	`/v1/youtube/shorts/trending`

Followers / Following / Live (TikTok only)

Type	Endpoint	Example
Following	`/v1/tiktok/user/following`	`handle=stoolpresidente`
Followers	`/v1/tiktok/user/followers`	`handle=stoolpresidente`
Audience demographics	`/v1/tiktok/user/audience` (26 credits!)	`handle=shakira`
Live stream	`/v1/tiktok/user/live`	`handle=thejustalex`

TikTok Shop

Type	Endpoint	Primary Param	Example
Search products	`/v1/tiktok/shop/search`	query	`shoes`
Store products	`/v1/tiktok/shop/products`	url	`https://www.tiktok.com/shop/store/goli-nutrition/7495794203056835079`
Product detail	`/v1/tiktok/product`	url	`https://www.tiktok.com/shop/pdp/goli-ashwagandha-gummies.../1729587769570529799`
Product reviews	`/v1/tiktok/shop/product/reviews`	url or product_id	`1731578642912612516`
User showcase	`/v1/tiktok/user/showcase`	handle	`mrtiktokreviews`

Link-in-Bio / Other

Service	Endpoint	Param	Example
Linktree	`/v1/linktree`	url	`https://linktr.ee/...`
Komi	`/v1/komi`	url	`https://komi.io/...`
Pillar	`/v1/pillar`	url	`https://pillar.io/...`
Linkbio	`/v1/linkbio`	url	`https://linkbio.co/...`
Linkme	`/v1/linkme`	url	`https://linkme.bio/...`
Amazon Shop	`/v1/amazon/shop`	url	`https://www.amazon.com/shop/...`
Instagram basic profile	`/v1/instagram/basic/profile`	userId	`314216`
Instagram embed HTML	`/v1/instagram/user/embed`	handle	`jane`
Age/Gender detect	`/v1/detect/age-gender`	url (social profile)	`https://www.tiktok.com/@charlidamelio`
Credit balance	`/v1/credit/balance`	(none)

ScrapeCreators pagination

Paginated endpoints return a cursor/token in the response. Pass it back as a query param to get the next page.

Cursor Field	Used By
`cursor`	TikTok comments/search/song videos, Instagram comments, Reddit subreddit search, Pinterest, Bluesky, Facebook reels/photos/posts/comments, TikTok Shop products/user showcase
`max_cursor`	TikTok profile videos
`min_time`	TikTok following/followers
`continuationToken`	YouTube (all paginated endpoints)
`after`	Reddit posts, Reddit search
`next_max_id`	Instagram posts, Truth Social posts
`max_id`	Instagram reels
`page`	TikTok popular/shop, Instagram reels search, LinkedIn company posts, TikTok Shop reviews
`paginationToken`	LinkedIn ads

ScrapeCreators known limitations

Handles: pass without the @ symbol. Use charlidamelio not @charlidamelio. Applies to TikTok, Instagram, Twitter, Threads, Bluesky, Snapchat, Twitch, Pinterest, Truth Social
YouTube handles: pass without the @ symbol. Use ThePatMcAfeeShow not @ThePatMcAfeeShow. You can also pass a channelId or full URL instead
Hashtags: pass without the # symbol. Use fyp not #fyp. Applies to TikTok and YouTube hashtag search endpoints
Twitter: returns ~100 most popular tweets, not chronological/latest
Threads: only last 20-30 posts visible publicly
Facebook posts: only 3 posts per page (API limitation)
Facebook group posts: only 3 posts per page (same limitation)
LinkedIn company posts: max 7 pages
Instagram play counts: IG-only views (excludes cross-posted FB views)
Truth Social: only prominent users (Trump, Vance, etc.) work publicly
Transcripts: all transcript endpoints require video under 2 minutes
Reddit subreddit names: case-sensitive! Use "AskReddit" not "askreddit"

---

Access patterns

ScrapeCreators (social media)

Use Python with core.http_client.proxied_get so sc-proxy injects credentials and bills correctly. Include a typed SC-CALLER-ID header (chat:, job:, preview:, etc.) for cost tracking. Do not read `$SCRAPECREATORS_API_KEY` from env and do not ask the user for a key — the proxy handles it.

from core.http_client import proxied_get

headers = {"SC-CALLER-ID": "chat:youtube-transcript"}

transcript = proxied_get(
    "https://api.scrapecreators.com/v1/youtube/video/transcript",
    params={"url": "https://www.youtube.com/watch?v=VIDEO_ID", "language": "en"},
    headers=headers,
    timeout=20,
).json()

Bash/curl works too (proxy is transparent), but Python is preferred for cost tracking:

curl -s "https://api.scrapecreators.com/v1/tiktok/profile?handle=charlidamelio" \
  -H "x-api-key: any"

Each endpoint has its own OpenAPI spec at https://docs.scrapecreators.com/{path}/openapi.json. Always fetch the per-endpoint spec first to get full parameter details before making the actual API call. The full spec is at https://docs.scrapecreators.com/openapi.json (large file — prefer per-endpoint specs).

Common optional params:

`trim` (boolean): reduces response payload size. Use when you only need key metrics.
`region` (string): 2-letter country code for proxy location. Does NOT filter by region — just routes through that country's proxy.

Firecrawl (web page fallback via transparent proxy)

No Firecrawl API key required — sc-proxy injects it. Don't look it up in .env or ask the user. Just call:

from core.http_client import proxied_post

headers = {"SC-CALLER-ID": "chat:web-crawl-fallback"}

page = proxied_post(
    "https://api.firecrawl.dev/v2/scrape",
    json={
        "url": "https://example.com/article",
        "formats": ["markdown", "links"],
        "onlyMainContent": True,
        "timeout": 60000,
    },
    headers=headers,
).json()

Decision rules

Route every request to the right backend. The user should never need to specify which API to use.

Social media request (profile, posts, comments, search, ads, trending, transcripts)

Use ScrapeCreators. Match the user's intent to an endpoint from the routing tables above. Strip @ from handles and # from hashtags before calling. Fetch the per-endpoint OpenAPI spec first for full param details.

YouTube URL or YouTube content request

Use ScrapeCreators for YouTube (channel info, videos, shorts, playlists, comments, transcripts, search, trending shorts). Match the user's intent to the appropriate YouTube endpoint from the routing tables above.

For a YouTube URL, use /v1/youtube/video to get video details and transcript. If the user's goal is content analysis, summarization, quote extraction, or topic mining, the transcript is included in the video response.

For a YouTube topic query, use /v1/youtube/search to find relevant videos, then call /v1/youtube/video only for the videos needed. Avoid fetching many videos by default.

Blocked or JS-heavy web page

Use Firecrawl once with formats:["markdown","links"] and onlyMainContent:true. Treat the returned Markdown as the extraction substrate, not as final truth: parse the title, price/value fields, specs, body description, image URLs, outbound links, and obvious contact/location hints from the page structure.

General web-page extraction lessons:

Many listing/detail pages render important content with JavaScript, image galleries, hidden sections, or repeated UI labels. web_fetch may return boilerplate while Firecrawl can still recover the real main content.
Do not hard-code site-specific labels. Convert page text into a generic structured summary: what it is, where it is, key numbers, evidence snippets, media/links, and caveats.
Preserve source URLs for images and links when they help verify the page, but do not download or batch-process every media asset unless the user asks.
If Markdown misses important layout or structured fields, retry once with rawHtml; use json, question, or highlights only when the user asked for narrow extraction and the schema/prompt is specific.

Cost discipline

ScrapeCreators — most endpoints cost 1 credit per request. Exceptions: /v1/tiktok/user/audience costs 26 credits; /v1/tiktok/video/transcript with use_ai_as_fallback=true costs +10 credits; /v1/google/company/ads with get_ad_details=true costs 25 credits. Warn users before calling expensive endpoints.

Firecrawl — billed per page plus expensive modifiers.

Keep calls tight: one page, one video, or a small shortlist. Never batch-crawl whole websites or bulk-scrape entire feeds with this skill.

If the proxy returns 403, the request is outside the allowed use case. Change the approach instead of retrying.

If the proxy returns 429, back off; do not parallelize around the limit.

If the upstream returns a failure, report the exact failure and avoid repeated paid retries unless one parameter change is clearly justified.

"""
web-crawler skill exports — script-mode helpers.

Why this file exists: agents kept hand-rolling proxied_get/proxied_post calls,
which made them wonder "where's the API key?" and waste turns. There is NO key
to find — sc-proxy injects ScrapeCreators + Firecrawl credentials automatically.
Call these functions and ignore auth entirely.

Usage from a bash block:
    python3 - <<'EOF'
    import sys
    sys.path.insert(0, "/data/workspace/skills/web-crawler")
    from exports import youtube_transcript, scrape_page, sc_get
    print(youtube_transcript("https://www.youtube.com/watch?v=VIDEO_ID"))
    EOF

Or via the platform loader:
    from core.skill_tools import web_crawler
    web_crawler.scrape_page("https://example.com/article")

NO API KEY NEEDED for any function here. Do not read $SCRAPECREATORS_API_KEY or
$FIRECRAWL_API_KEY, do not check .env, do not ask the user. Proxy handles it.
"""
from core.http_client import proxied_get, proxied_post

SC_BASE = "https://api.scrapecreators.com"
FC_BASE = "https://api.firecrawl.dev"
_DEFAULT_CALLER = "chat:web-crawler"


def _headers(caller_id=None):
    return {"SC-CALLER-ID": caller_id or _DEFAULT_CALLER}


def _strip(value):
    """Normalize a handle/hashtag: drop leading @ or # that the API rejects."""
    if isinstance(value, str):
        return value.lstrip("@#").strip()
    return value


# ---------------------------------------------------------------------------
# Generic backends — use these for any endpoint not wrapped below.
# ---------------------------------------------------------------------------
def sc_get(path, caller_id=None, timeout=30, **params):
    """Generic ScrapeCreators GET. `path` is the endpoint, e.g.
    '/v1/tiktok/profile'. Pass query params as kwargs:
        sc_get('/v1/tiktok/profile', handle='charlidamelio')
    Handle/hashtag values are auto-stripped of leading @/#.
    Returns parsed JSON. No api key needed — proxy injects it.
    """
    if not path.startswith("/"):
        path = "/" + path
    for k in ("handle", "hashtag"):
        if k in params:
            params[k] = _strip(params[k])
    resp = proxied_get(SC_BASE + path, params=params,
                       headers=_headers(caller_id), timeout=timeout)
    resp.raise_for_status()
    return resp.json()


def scrape_page(url, formats=None, only_main_content=True, caller_id=None,
                timeout=90, **extra):
    """Firecrawl fallback for ONE web page when ordinary fetch is blocked
    (403/429/anti-bot/JS-heavy). Returns parsed JSON; the markdown lives at
    result['data']['markdown']. No api key needed — proxy injects it.
        scrape_page('https://example.com/article')
        scrape_page(url, formats=['rawHtml'])   # retry when markdown misses fields
    """
    payload = {
        "url": url,
        "formats": formats or ["markdown", "links"],
        "onlyMainContent": only_main_content,
        "timeout": 60000,
    }
    payload.update(extra)
    resp = proxied_post(FC_BASE + "/v2/scrape", json=payload,
                        headers=_headers(caller_id), timeout=timeout)
    resp.raise_for_status()
    return resp.json()


def scrape_markdown(url, caller_id=None, **kw):
    """Convenience: scrape_page and return just the markdown string (or '')."""
    data = scrape_page(url, caller_id=caller_id, **kw)
    return (data.get("data") or {}).get("markdown", "")


def wayback_snapshot_url(url, caller_id=None, timeout=30):
    """Ask the Internet Archive for the newest available Wayback snapshot of
    `url`. Returns the snapshot URL string, or None if none archived.
    Note: Wayback honors robots.txt / paywalls, so hard paywalls (NYT/WSJ)
    are often NOT captured here — try archive.today first for those.
    """
    resp = proxied_get("https://archive.org/wayback/available",
                       params={"url": url}, headers=_headers(caller_id),
                       timeout=timeout)
    resp.raise_for_status()
    snaps = (resp.json().get("archived_snapshots") or {}).get("closest") or {}
    return snaps.get("url") if snaps.get("available") else None


def archive_fallback(url, caller_id=None, timeout=120):
    """Last-resort full-text recovery for a page Firecrawl couldn't get
    (hard paywall / Cloudflare returning 403 even through Firecrawl).

    Strategy, in order:
      1. archive.today — scrape the `/newest/` snapshot via Firecrawl. This
         site captures with a real browser and historically preserves full
         text behind paywalls (NYT, WSJ, Economist). Best for paywalls.
      2. Wayback Machine — if archive.today has nothing, fall back to the
         Internet Archive's newest snapshot and scrape that.

    Returns a dict: {markdown, source, snapshot_url}. markdown == "" means
    no archived copy exists anywhere — then stop and tell the user / try a
    different source. archive_fallback CANNOT create a snapshot that nobody
    ever saved; it only retrieves existing ones.
    """
    # 1. archive.today — try its mirror domains; /newest/ resolves latest capture
    for host in ("https://archive.ph/newest/", "https://archive.today/newest/",
                 "https://archive.is/newest/"):
        try:
            md = scrape_markdown(host + url, caller_id=caller_id, timeout=timeout)
            if md and len(md) > 800:  # filter out "no snapshot" / chrome-only pages
                return {"markdown": md, "source": "archive.today",
                        "snapshot_url": host + url}
        except Exception:
            continue
    # 2. Wayback Machine fallback
    try:
        snap = wayback_snapshot_url(url, caller_id=caller_id)
        if snap:
            md = scrape_markdown(snap, caller_id=caller_id, timeout=timeout)
            if md:
                return {"markdown": md, "source": "wayback",
                        "snapshot_url": snap}
    except Exception:
        pass
    return {"markdown": "", "source": None, "snapshot_url": None}


# ---------------------------------------------------------------------------
# High-frequency named wrappers (thin sugar over sc_get).
# ---------------------------------------------------------------------------
def youtube_transcript(url, language="en", caller_id=None):
    return sc_get("/v1/youtube/video/transcript", url=url, language=language,
                  caller_id=caller_id)


def youtube_video(url, caller_id=None):
    return sc_get("/v1/youtube/video", url=url, caller_id=caller_id)


def tiktok_video(url, caller_id=None):
    return sc_get("/v2/tiktok/video", url=url, caller_id=caller_id)


def tiktok_transcript(url, lang="en", caller_id=None):
    return sc_get("/v1/tiktok/video/transcript", url=url, lang=lang,
                  caller_id=caller_id)


def tiktok_profile(handle, caller_id=None):
    return sc_get("/v1/tiktok/profile", handle=handle, caller_id=caller_id)


def instagram_post(url, caller_id=None):
    return sc_get("/v1/instagram/post", url=url, caller_id=caller_id)


def instagram_profile(handle, caller_id=None):
    return sc_get("/v1/instagram/profile", handle=handle, caller_id=caller_id)


def twitter_tweet(url, caller_id=None):
    return sc_get("/v1/twitter/tweet", url=url, caller_id=caller_id)


def reddit_post(url, caller_id=None):
    return sc_get("/v1/reddit/post/comments", url=url, caller_id=caller_id)


def reddit_search(query, caller_id=None, **params):
    return sc_get("/v1/reddit/search", query=query, caller_id=caller_id, **params)


def google_search(query, caller_id=None, **params):
    return sc_get("/v1/google/search", query=query, caller_id=caller_id, **params)


def linkedin_profile(url, caller_id=None):
    return sc_get("/v1/linkedin/profile", url=url, caller_id=caller_id)


__all__ = [
    "sc_get", "scrape_page", "scrape_markdown",
    "archive_fallback", "wayback_snapshot_url",
    "youtube_transcript", "youtube_video",
    "tiktok_video", "tiktok_transcript", "tiktok_profile",
    "instagram_post", "instagram_profile",
    "twitter_tweet", "reddit_post", "reddit_search",
    "google_search", "linkedin_profile",
]

Related skills

Agent BrowserGive their coding agent reliable, high-fidelity control over any website or Electron desktop app.577k39.1k

Lark ApprovalLet their AI coding agent create, read, update, and approve items in Lark (Feishu) approval workflows without leaving the coding environment.471k

Lark EventHandle Feishu/Lark bot events, webhooks, and subscription callbacks in agent-driven backend code.471k

Lark Workflow Meeting SummaryAutomatically generate structured meeting summaries and action items from Lark/Feishu calls without manual note-taking.470k

Lark Workflow Standup ReportAutomatically generate and post daily standup reports in Lark/Feishu from their workflow and activity data.470k

Lark Vc AgentGive their coding agent the ability to read, create, and update documents, tasks, and wiki pages inside Feishu (Lark).415k

How it compares

web-crawler is an agent skill for scrape web and social surfaces including youtube, tiktok, instagram, and linkedin with structured extraction, not a generic alternative.

FAQ

Who is web-crawler for?

Developers using Automation & Workflows workflows with agent-guided SKILL.md steps.

When should I use web-crawler?

Scrape web and social surfaces including YouTube, TikTok, Instagram, and LinkedIn with structured extraction.

Is web-crawler safe to install?

Review the Security Audits panel on this page before installing in production.

Automation & Workflowsintegrations

About

Web Crawler by the numbers

web-crawler capabilities & compatibility

What web-crawler says it does

Add your badge

How do I scrape web and social surfaces including youtube, tiktok, instagram, and linkedin with structured extraction with documented agent guidance?

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

Preferred entry: call exports.py (don't hand-roll requests)

Quick trigger rules (read this first)

Error signatures -> action

Paywall / Firecrawl-blocked fallback chain (use archive_fallback)

What each service is for

ScrapeCreators — Social media data extraction (27+ platforms)

Firecrawl — Fallback web page scraper

ScrapeCreators — Intent routing

Profiles / User Info

Posts / Content Feeds

Single Post / Video Details

Comments

Transcripts

Search

Ad Libraries

Trending / Popular

Followers / Following / Live (TikTok only)

TikTok Shop

Link-in-Bio / Other

ScrapeCreators pagination

ScrapeCreators known limitations

Access patterns

ScrapeCreators (social media)

Firecrawl (web page fallback via transparent proxy)

Decision rules

Social media request (profile, posts, comments, search, ads, trending, transcripts)

YouTube URL or YouTube content request

Blocked or JS-heavy web page

Cost discipline

Related skills

How it compares

FAQ

Who is web-crawler for?

When should I use web-crawler?

Is web-crawler safe to install?

This week in AI coding

Paywall / Firecrawl-blocked fallback chain (use `archive_fallback`)