
Robots Txt
Configure, audit, or fix robots.txt for search and AI crawlers without accidentally blocking important URLs.
Overview
Robots-txt is an agent skill for the Launch phase that configures and audits robots.txt crawl rules for search engines and AI bots.
Install
npx skills add https://github.com/kostja94/marketing-skills --skill robots-txtWhat is this skill?
- Covers Disallow/Allow rules, Sitemap references, and Clean-param style directives
- Audits for accidental blocks to important paths and misconfigured wildcards
- Explains crawl control vs index control and points to indexing skill for noindex workflows
- Includes AI crawler strategy (e.g., GPTBot) alongside classic search bots
- Uses project context files when present for site URL and indexing scope
- Skill metadata version: 1.2.0
Adoption & trust: 866 installs on skills.sh; 586 GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You are unsure whether robots.txt is blocking revenue pages, allowing unwanted AI scrapers, or missing a Sitemap reference.
Who is it for?
Solo founders shipping a public site who need trustworthy crawler rules and an audit pass before scaling SEO or GEO efforts.
Skip if: Purely private apps with no public HTTP surface, or cases where the real issue is noindex/canonical tags rather than crawl access.
When should I use this skill?
User mentions robots.txt, crawler rules, block crawlers, AI crawlers, GPTBot, allow/disallow, disallow path, crawl directives, user-agent, block Googlebot, fix robots.txt, robots.txt blocking, or search engine crawling.
What do I get? / Deliverables
You publish or fix a robots.txt aligned with your indexing scope and crawler strategy, with clear next steps for page-level indexing when needed.
- Audited or rewritten robots.txt with User-agent, Allow/Disallow, and Sitemap lines
- Documented AI crawler allow/block strategy aligned to product goals
Recommended Skills
Journey fit
Crawl directives belong in Launch when the site is going public and crawler behavior must match indexing goals. robots.txt is technical SEO—path-level crawl control sits squarely under the seo subphase, separate from page-level indexing rules.
How it compares
Handles path-level crawl policy in robots.txt—not a substitute for the indexing skill’s page-level index exclusions.
Common Questions / FAQ
Who is robots-txt for?
Solo builders and small teams responsible for technical SEO, launch readiness, and controlling how search and AI crawlers access public paths.
When should I use robots-txt?
At Launch when adding or reviewing robots.txt, after site migrations, when tuning GPTBot or other AI crawlers, or when fixing ‘blocked by robots.txt’ coverage issues.
Is robots-txt safe to install?
It is editorial guidance only for your repo and hosting; review the Security Audits panel on this page and validate changes in Search Console before relying on production crawl behavior.
SKILL.md
READMESKILL.md - Robots Txt
# SEO Technical: robots.txt Guides configuration and auditing of robots.txt for search engine and AI crawler control. **When invoking**: On **first use**, if helpful, open with 1–2 sentences on what this skill covers and why it matters, then provide the main output. On **subsequent use** or when the user asks to skip, go directly to the main output. ## Scope (Technical SEO) - **Robots.txt**: Configure Disallow/Allow, Sitemap, Clean-param; audit for accidental blocks - **Crawler access**: Path-level crawl control; AI crawler allow/block strategy - **Differentiation**: robots.txt = crawl control (who accesses what paths); noindex = index control (what gets indexed). See **indexing** for page-level exclusions. ## Initial Assessment **Check for project context first:** If `.claude/project-context.md` or `.cursor/project-context.md` exists, read it for site URL and indexing goals. Identify: 1. **Site URL**: Base domain (e.g., `https://example.com`) 2. **Indexing scope**: Full site, partial, or specific paths to exclude 3. **AI crawler strategy**: Allow search/indexing vs. block training data crawlers ## Best Practices ### Purpose and Limitations | Point | Note | |-------|------| | **Purpose** | Controls crawler access; does NOT prevent indexing (disallowed URLs may still appear in search without snippet) | | **Advisory** | Rules are advisory; malicious crawlers may ignore | | **Public** | robots.txt is publicly readable; use noindex or auth for sensitive content. See **indexing** | ### Crawl vs Index vs Link Equity (Quick Reference) | Tool | Controls | Prevents indexing? | |------|----------|-------------------| | **robots.txt** | Crawl (path-level) | No—blocked URLs may still appear in SERP | | **noindex** (meta / X-Robots-Tag) | Index (page-level) | Yes. See **indexing** | | **nofollow** | Link equity only | No—does not control indexing | ### When to Use robots.txt vs noindex | Use | Tool | Example | |-----|------|---------| | **Path-level** (whole directory) | robots.txt | `Disallow: /admin/`, `Disallow: /api/`, `Disallow: /staging/` | | **Page-level** (specific pages) | noindex meta / X-Robots-Tag | Login, signup, thank-you, 404, legal. See **indexing** for full list | | **Critical** | Do NOT block in robots.txt | Pages that use noindex—crawlers must access the page to read the directive | **Paths to block in robots.txt**: /admin/, /api/, /staging/, temp files. **Paths to use noindex** (allow crawl): /login/, /signup/, /thank-you/, etc.—see **indexing**. ### Location and Format | Item | Requirement | |------|-------------| | **Path** | Site root: `https://example.com/robots.txt` | | **Encoding** | UTF-8 plain text | | **Standard** | RFC 9309 (Robots Exclusion Protocol) | ### Core Directives | Directive | Purpose | Example | |-----------|---------|---------| | `User-agent:` | Target crawler | `User-agent: Googlebot`, `User-agent: *` | | `Disallow:` | Block path prefix | `Disallow: /admin/` | | `Allow:` | Allow path (can override Disallow) | `Allow: /public/` | | `Sitemap:` | Declare sitemap absolute URL | `Sitemap: https://example.com/sitemap.xml` | | `Clean-param:` | Strip query params (Yandex) | See below | ### Critical: Do Not Block | Do not block | Reason | |--------------|--------| | CSS, JS, images | Google needs them to render pages; blocking breaks indexing | | `/_next/` (Next.js) | Breaks CSS/JS loading; static assets in GSC "Crawled - not indexed" is expected. See **indexing** | | Pages that use noindex | Crawlers must access the page to read the noindex directive; blocking in robots