Fetching Salesforce Docs

Name: Fetching Salesforce Docs
Author: forcedotcom

forcedotcom/sf-skills

2.5k installs
763 repo stars
Updated July 24, 2026
forcedotcom/sf-skills

fetching-salesforce-docs is a Salesforce skill that retrieves and grounds answers in official Salesforce web documentation using a targeted extraction playbook.

About

The fetching-salesforce-docs skill retrieves official Salesforce documentation from developer.salesforce.com, help.salesforce.com, architect.salesforce.com, admin.salesforce.com, and lightningdesignsystem.com using a targeted retrieval playbook. It classifies requests into developer, help, architect or admin, design system, or legacy atlas families before fetching, and rejects broad landing pages, shell-rendered help shells, third-party blogs, and PDF fallbacks. For JS-heavy pages it prefers browser-rendered extraction, verifies the exact concept or identifier appears on the page, and follows only the best one to three official child links when needed. Help articles require real article bodies, not Loading or CSS Error chrome, and soft 404 shells must be rejected. Grounded answers must cite title, exact URL, source type, and extraction caveats. Optional Playwright scripts in scripts/ support extraction but the playbook works without them. Out of scope items include code changes, deployments, and generating metadata.

Official-domain-only retrieval across developer, help, architect, admin, and SLDS.
Classify-then-fetch workflow with concept-level acceptance rules.
Browser-rendered extraction guidance for JS-heavy Salesforce pages.
Rejection rules for landing pages, shell content, and third-party summaries.
Optional Playwright extractors with requirements.txt dependencies.

Fetching Salesforce Docs by the numbers

2,500 all-time installs (skills.sh)
+7 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #154 of 1,901 Documentation skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

At a glance

fetching-salesforce-docs capabilities & compatibility

Capabilities: doc family classification before fetch · targeted official url retrieval without broad cr · browser rendered extraction for js heavy pages · child link follow up limited to one to three off · acceptance and rejection rules for evidence qual · optional playwright extraction scripts for help
Works with: salesforce
Use cases: documentation · research · api development
Pricing: Free

From the docs

What fetching-salesforce-docs says it does

Avoid third-party blogs, videos, or summary articles unless the user explicitly asks for them.

SKILL.md

npx skills add https://github.com/forcedotcom/sf-skills --skill fetching-salesforce-docs

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/forcedotcom/sf-skills/fetching-salesforce-docs.svg)](https://skillselion.com/skills/forcedotcom/sf-skills/fetching-salesforce-docs)

Installs	2.5k
repo stars	★ 763
Security audit	2 / 3 scanners passed
Last updated	July 24, 2026
Repository	forcedotcom/sf-skills ↗

How do I fetch reliable official Salesforce documentation when pages are JS-heavy or help articles fail naive scraping?

Retrieve authoritative Salesforce docs from developer, help, architect, admin, and SLDS domains with reliable extraction.

Who is it for?

Salesforce developers needing official Apex, LWC, setup, Agentforce, or SLDS references instead of blog summaries.

Skip if: Skip for code changes, deployments, metadata generation, or tasks that do not require documentation retrieval.

When should I use this skill?

User asks for official Salesforce docs, API references, help articles, SLDS guidance, or Agentforce documentation.

What you get

Grounded citation with article title, exact official URL, source type, and caveat when extraction was partial.

accurate Salesforce doc excerpts
child-page deep links

Files

SKILL.mdMarkdownGitHub ↗

fetching-salesforce-docs

Use this skill to retrieve and ground answers in official Salesforce documentation on the public web.

This skill provides a reliable online retrieval playbook for Salesforce docs that are hard to fetch, especially help.salesforce.com, JS-heavy developer.salesforce.com, Lightning Design System docs on lightningdesignsystem.com, and other official Salesforce-owned doc pages such as architect.salesforce.com and admin.salesforce.com.

Optional extraction scripts are available in scripts/ — see the Reference File Index below.

Scope


In scope	Official Salesforce doc retrieval: Apex, API, LWC, metadata, Agentforce, setup articles, SLDS, architect/admin guidance
Out of scope	Third-party blogs, PDF fallback, local corpus indexing, benchmark workflows, generating code or metadata

Required Inputs

Before fetching, identify:

The exact concept, identifier, class, method, or feature name being requested
The likely doc family (developer docs, help articles, design system, architect/admin)

No additional setup is required to use the retrieval playbook in this skill. The optional extraction scripts require playwright — see requirements.txt.

Official Sources Only

Prefer Salesforce-owned documentation sources:

developer.salesforce.com
help.salesforce.com
architect.salesforce.com
admin.salesforce.com
lightningdesignsystem.com
other official Salesforce documentation pages when Salesforce uses them as the source of truth

Avoid third-party blogs, videos, or summary articles unless the user explicitly asks for them.

Do not fall back to PDFs.

Retrieval Workflow

1. Classify the request first

Before fetching anything, identify the likely doc family.

Family	Typical Source	Use For
Developer docs	`developer.salesforce.com/docs/...`	Apex, APIs, LWC, metadata, Agentforce developer docs
Help docs	`help.salesforce.com/...`	setup, admin, product configuration
Architect/Admin docs	`architect.salesforce.com/...`, `admin.salesforce.com/...`	best practices, patterns, well-architected guidance, admin enablement
Design system docs	`lightningdesignsystem.com/...`	SLDS, Cosmos, design tokens, component and styling guidance
Legacy atlas docs	`developer.salesforce.com/docs/atlas.en-us.*`	older official guide and reference docs

2. Identify the exact concept

Extract the real target before you search:

exact API/class/method name
exact feature name
exact product phrase
exact setup concept

Examples:

Lightning Message Service
Wire Service
System.StubProvider
Agentforce Actions
Messaging for In-App and Web allowed domains

3. Prefer targeted official retrieval

Do not broad-crawl Salesforce docs.

Instead: 1. identify the most likely official guide root or article 2. if search is needed, restrict it to official Salesforce domains only 3. fetch that official page 4. check whether the exact concept actually appears on the page 5. if not, inspect and follow the most relevant 1–3 official child links 6. stop once you have grounded evidence

4. Do not stop at broad landing pages

A guide landing page is not enough unless it clearly contains the exact requested concept.

This is especially important for:

LWC docs
Agentforce docs
broad platform guide homepages
help landing pages that link to the real article

5. For `developer.salesforce.com`

Use this playbook:

start with the most likely official guide root
if the page is JS-heavy, prefer browser-rendered extraction
check whether the exact concept appears on the page
if the concept is missing, inspect official child links and follow the best matching 1–3 links
prefer exact concept pages over broad guide roots
legacy atlas pages are valid if they are the real official reference for the concept

6. For `help.salesforce.com`

Help pages often fail with naive fetching.

Use this playbook:

prefer exact articleView?id=... URLs when available
use browser-rendered extraction when plain fetch returns shell content
treat outputs like Loading, Sorry to interrupt, CSS Error, or mostly chrome/navigation text as failed extraction, not evidence
look for the real article body, not just header, nav, or footer text
reject shell pages and soft-404 pages such as:
"We looked high and low but couldn't find that page"
generic empty help shells
if starting from a nearby guide or hub page, follow linked Help articles until you reach the real article body
if extraction still fails after targeted retries, return the best official Help URLs you found and explicitly say that article-body extraction was unsuccessful

Acceptance Rules

A page is good enough to answer from only when at least one of these is true:

the exact identifier appears on the page
the exact concept phrase appears on the page
multiple query-specific phrases appear in the correct official context

A page is not good enough when:

it is only a broad landing page
it is a shell page with little real article text
it is from the wrong product area
it does not contain the requested identifier or concept
it is a third-party explanation when an official page should exist

Rejection Rules

Reject these as final evidence:

broad guide homepages without the exact concept
release notes when a concept/reference page is expected
admin blog posts when developer docs are requested
third-party blogs when official docs are available
shell-rendered pages with no real article body
pages whose titles sound right but whose body does not contain the requested concept

Grounding Requirements

When answering, include: 1. guide/article title 2. exact official URL 3. source type:

developer doc page
atlas reference page
help article page

4. any caveat if extraction was partial or browser-rendered

If evidence is weak, say so plainly.

Examples

Example: Lightning Message Service

Do not stop at the general LWC guide root. Find the exact LWC page for Lightning Message Service or follow the most relevant child links from the LWC docs until the exact concept appears.

Example: Wire Service

Do not answer from the LWC homepage unless Wire Service is actually present there. Follow the relevant child doc page for wire service or wire adapters.

Example: Agentforce Actions

Do not answer from a broad Agentforce landing page or a blog post. Find the official Agentforce developer page for actions, or follow the best matching child pages from the official Agentforce docs.

Example: Messaging for In-App and Web allowed domains

Prefer official Help articles and browser-rendered extraction. Reject generic help shells. Follow linked Help articles from nearby official messaging docs if needed.

Example: System.StubProvider

Prefer the official Salesforce reference/developer page where the exact identifier appears. Do not substitute a broader Apex landing page if the identifier is absent.

Non-Goals

This skill should not:

maintain a local documentation corpus
rely on a local index
use PDF fallback
run benchmark workflows
depend on repo-specific scripts to be useful

Output Expectations

For each retrieval, include: 1. Guide or article title 2. Exact official URL 3. Source type (developer doc page / atlas reference page / help article page) 4. Any caveat if extraction was partial or browser-rendered

If evidence is weak, say so plainly rather than forcing an answer.

---

Reference File Index

File	When to read
`scripts/extract_salesforce_doc.py`	Use to fetch any official Salesforce doc URL; automatically routes `help.salesforce.com` into the dedicated Help extractor and supports browser-rendered extraction for all Salesforce-owned doc hosts
`scripts/extract_help_salesforce.py`	Use directly when targeting `help.salesforce.com` `articleView` URLs; use when the wrapper is not appropriate
`scripts/runtime_bootstrap.py`	Imported by the extraction scripts to resolve the isolated fetching-salesforce-docs Python runtime and Playwright browser path; not called directly
`requirements.txt`	Lists Python dependencies (`playwright`, `playwright-stealth`) needed to run the extraction scripts

fetching-salesforce-docs

What it is

fetching-salesforce-docs is a prompt-only skill.

It gives a practical retrieval playbook for official Salesforce docs on the public web, especially when:

developer.salesforce.com pages are JS-heavy
help.salesforce.com pages return shell content
architect.salesforce.com / admin.salesforce.com pages need browser-rendered extraction
lightningdesignsystem.com pages contain official SLDS guidance
the real answer is on a child page, not the guide homepage

What it is not

This skill does not include:

local corpus workflows
indexing
benchmark workflows
any required helper CLI dependency
PDF fallback guidance

Use it for

official Salesforce docs lookup
hard-to-fetch Help articles
Apex / API / LWC / Agentforce documentation grounding
deciding when to follow child links from broad official guide pages
rejecting weak results such as shells, landing pages, and third-party summaries

Optional utility

A tiny wrapper is available for official Salesforce doc URLs:

python3 skills/fetching-salesforce-docs/scripts/extract_salesforce_doc.py \
  --url "https://help.salesforce.com/s/articleView?id=service.miaw_security.htm&type=5" \
  --pretty

Behavior:

automatically routes help.salesforce.com URLs into the dedicated Help extractor
supports official Salesforce-owned doc hosts such as developer.salesforce.com, architect.salesforce.com, admin.salesforce.com, lightningdesignsystem.com, and other official Salesforce documentation pages
supports optional best-effort stealth mode via --stealth

Dependencies for the helper scripts live in:

skills/fetching-salesforce-docs/requirements.txt

The installer sets up an isolated runtime under ~/.claude/.fetching-salesforce-docs-runtime, installs those Python packages there, and installs the Playwright Chromium browser automatically during install/update.

The underlying Help extractor is also available directly at:

python3 skills/fetching-salesforce-docs/scripts/extract_help_salesforce.py \
  --url "https://help.salesforce.com/s/articleView?id=service.miaw_security.htm&type=5" \
  --pretty

Key idea

Keep retrieval:

official-source-first
HTML-only
targeted
child-link aware
strict about exact concept matching

#!/usr/bin/env python3
"""
Extract article content from help.salesforce.com using a real browser, deep shadow DOM
traversal, and Salesforce Help-specific heuristics.

Why this exists:
- help.salesforce.com is heavily client-rendered
- the real article body often lives inside custom elements and shadow roots
- naive HTML fetching often returns shell text like "Loading", "Sorry to interrupt",
  or CSS/runtime error wrappers instead of the actual documentation

This script:
- renders the page with Playwright
- waits for the Help article app to hydrate
- traverses nested shadow roots
- prioritizes Salesforce Help article-body containers such as `.slds-text-longform`
- returns structured JSON with the extracted article text and official child links

Example:
  python3 skills/fetching-salesforce-docs/scripts/extract_help_salesforce.py \
    --url "https://help.salesforce.com/s/articleView?id=service.miaw_security.htm&type=5"
"""

from __future__ import annotations

import argparse
import json
import re
import sys
from typing import Any, Dict, List, Tuple
from urllib.parse import urlparse

from runtime_bootstrap import maybe_reexec_in_sf_docs_runtime

maybe_reexec_in_sf_docs_runtime(__file__)

from playwright.sync_api import TimeoutError as PlaywrightTimeoutError
from playwright.sync_api import sync_playwright

try:
    from playwright_stealth import Stealth
except ImportError:
    Stealth = None

try:
    from playwright_stealth import stealth_sync
except ImportError:
    stealth_sync = None


USER_AGENT = (
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
    "AppleWebKit/537.36 (KHTML, like Gecko) "
    "Chrome/122.0.0.0 Safari/537.36"
)

STRONG_SHELL_TOKENS = [
    "loading",
    "sorry to interrupt",
    "css error",
    "enable javascript",
    "we looked high and low",
    "couldn't find that page",
    "404 error",
]

WEAK_SHELL_TOKENS = [
    "sign in",
    "cookie preferences",
]

NOISE_LINES = {
    "table of contents",
    "close",
    "search",
}


def apply_stealth(page) -> bool:
    if stealth_sync is not None:
        try:
            stealth_sync(page)
            return True
        except Exception:
            pass
    if Stealth is not None:
        try:
            Stealth().apply_stealth_sync(page)
            return True
        except Exception:
            return False
    return False


def _looks_like_section_banner(line: str) -> bool:
    stripped = line.strip()
    if not stripped or len(stripped) > 120:
        return False
    if not any(ch.isalpha() for ch in stripped):
        return False
    return stripped.upper() == stripped


def normalize_text(text: str) -> str:
    text = text.replace("\u00a0", " ").replace("\r", "")
    text = re.sub(r"\n{3,}", "\n\n", text)
    text = re.sub(r"[ \t]+", " ", text)
    return text.strip()


def cleanup_help_text(text: str, title: str = "") -> str:
    text = normalize_text(text)
    if not text:
        return text

    lines = [line.strip() for line in text.splitlines()]
    cleaned: List[str] = []
    normalized_title = title.strip().lower()

    for line in lines:
        if not line:
            if cleaned and cleaned[-1] != "":
                cleaned.append("")
            continue

        lowered = line.lower().strip()
        if lowered in NOISE_LINES:
            continue
        if lowered.startswith("you are here:"):
            continue
        if lowered in {"salesforce help", "docs"}:
            continue

        if normalized_title and "|" in line:
            title_pos = lowered.find(normalized_title)
            if title_pos >= 0:
                line = line[title_pos:].strip()
                lowered = line.lower()

        cleaned.append(line)

    while cleaned and cleaned[0] == "":
        cleaned.pop(0)

    if normalized_title and len(cleaned) >= 2:
        first = cleaned[0].strip()
        second = cleaned[1].strip()
        if _looks_like_section_banner(first) and second.lower() == normalized_title:
            cleaned.pop(0)

    if normalized_title and cleaned:
        first = cleaned[0].strip()
        if "|" in first:
            title_pos = first.lower().find(normalized_title)
            if title_pos >= 0:
                cleaned[0] = first[title_pos:].strip()

    text = "\n".join(cleaned)
    text = re.sub(r"\n{3,}", "\n\n", text).strip()
    return text


def looks_like_shell(title: str, text: str) -> bool:
    haystack = f"{title}\n{text}".lower()
    if any(token in haystack for token in STRONG_SHELL_TOKENS):
        return True
    if any(token in haystack for token in WEAK_SHELL_TOKENS):
        return len(text.strip()) < 600
    return False


def _split_blocks(text: str) -> List[str]:
    blocks = [block.strip() for block in re.split(r"\n\s*\n", text) if block.strip()]
    return blocks


def _is_heading_line(line: str) -> bool:
    stripped = line.strip()
    if not stripped or len(stripped) > 100:
        return False
    if stripped.endswith(":"):
        return False
    if stripped.lower().startswith(("available in:", "this article applies to:", "this article doesn", "view supported editions")):
        return False
    if _looks_like_section_banner(stripped):
        return True
    if stripped == stripped.title() and any(ch.isalpha() for ch in stripped):
        return True
    return False


def _classify_metadata_block(block: str) -> Tuple[str, str] | None:
    stripped = block.strip()
    lowered = stripped.lower()
    if lowered.startswith("required editions"):
        return "required_editions", stripped
    if lowered.startswith("user permissions"):
        return "user_permissions", stripped
    if lowered.startswith("important"):
        return "important", stripped
    if lowered.startswith("this article applies to:"):
        return "applies_to", stripped
    if lowered.startswith("this article doesn"):
        return "does_not_apply_to", stripped
    if lowered.startswith("available in:"):
        return "availability", stripped
    if lowered.startswith("needed"):
        return "needed", stripped
    return None


def structure_help_text(text: str, title: str = "") -> Dict[str, Any]:
    blocks = _split_blocks(text)
    normalized_title = title.strip().lower()
    if blocks and normalized_title and blocks[0].strip().lower() == normalized_title:
        blocks = blocks[1:]

    metadata: Dict[str, List[str]] = {}
    content_blocks: List[str] = []
    sections: List[Dict[str, str]] = []

    i = 0
    while i < len(blocks):
        block = blocks[i]
        meta = _classify_metadata_block(block)
        if meta:
            key, value = meta
            metadata.setdefault(key, []).append(value)
            i += 1
            continue

        heading_candidate = block.strip()
        if _is_heading_line(heading_candidate) and i + 1 < len(blocks):
            next_block = blocks[i + 1]
            next_meta = _classify_metadata_block(next_block)
            if not next_meta and next_block.strip().lower() != normalized_title:
                section = {
                    "heading": heading_candidate,
                    "text": next_block.strip(),
                }
                sections.append(section)
                content_blocks.append(f"{heading_candidate}\n{next_block.strip()}".strip())
                i += 2
                continue

        lines = [line.strip() for line in block.splitlines() if line.strip()]
        if len(lines) >= 2 and _is_heading_line(lines[0]):
            sections.append({
                "heading": lines[0],
                "text": "\n".join(lines[1:]).strip(),
            })
        content_blocks.append(block)
        i += 1

    intro = content_blocks[0] if content_blocks else ""
    body = "\n\n".join(content_blocks[1:]) if len(content_blocks) > 1 else ""

    compact_metadata = {key: "\n\n".join(values) for key, values in metadata.items()}
    return {
        "intro": intro,
        "body": body,
        "metadata": compact_metadata,
        "sections": sections,
        "contentBlocks": content_blocks,
    }


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(description="Extract article text from help.salesforce.com")
    parser.add_argument("--url", required=True, help="help.salesforce.com article URL")
    parser.add_argument("--timeout", type=int, default=60, help="Timeout in seconds (default: 60)")
    parser.add_argument("--stealth", action="store_true", help="Best-effort stealth mode for bot-sensitive pages")
    parser.add_argument("--pretty", action="store_true", help="Pretty-print JSON")
    return parser.parse_args()


def validate_url(url: str) -> None:
    host = (urlparse(url).hostname or "").lower()
    if not host.endswith("help.salesforce.com"):
        raise SystemExit(f"URL must be on help.salesforce.com: {url}")


def extract(url: str, timeout_seconds: int, use_stealth: bool = False) -> Dict[str, Any]:
    timeout_ms = timeout_seconds * 1000

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page(user_agent=USER_AGENT, viewport={"width": 1440, "height": 1400})
        stealth_used = apply_stealth(page) if use_stealth else False

        try:
            response = page.goto(url, wait_until="domcontentloaded", timeout=timeout_ms)
            http_status = response.status if response else None

            # Let the client app boot, then wait for the article shell to hydrate.
            page.wait_for_timeout(1500)
            try:
                page.wait_for_function(
                    r"""
                    () => {
                      const hosts = Array.from(document.querySelectorAll('c-hc-article-viewer, c-hc-documentation-article'));
                      return hosts.some(el => ((el.innerText || '').trim().length > 400));
                    }
                    """,
                    timeout=min(timeout_ms, 30000),
                )
            except PlaywrightTimeoutError:
                # Continue anyway — some pages still expose enough content after network idle.
                pass

            page.wait_for_load_state("networkidle", timeout=timeout_ms)
            page.wait_for_timeout(1000)

            payload = page.evaluate(
                r"""
                () => {
                  function normalize(text) {
                    return String(text || '')
                      .replace(/\u00a0/g, ' ')
                      .replace(/\r/g, '')
                      .replace(/\n{3,}/g, '\n\n')
                      .trim();
                  }

                  function isVisible(el) {
                    if (!el || !el.getBoundingClientRect) return false;
                    const rect = el.getBoundingClientRect();
                    const style = window.getComputedStyle(el);
                    return rect.width > 0 && rect.height > 0 && style.visibility !== 'hidden' && style.display !== 'none';
                  }

                  function allRoots() {
                    const roots = [document];
                    const queue = [document];
                    while (queue.length) {
                      const current = queue.shift();
                      if (!current || !current.querySelectorAll) continue;
                      const elements = current.querySelectorAll('*');
                      for (const el of elements) {
                        if (el.shadowRoot) {
                          roots.push(el.shadowRoot);
                          queue.push(el.shadowRoot);
                        }
                      }
                    }
                    return roots;
                  }

                  function deepQueryAll(selector) {
                    const results = [];
                    const seen = new Set();
                    for (const root of allRoots()) {
                      if (!root.querySelectorAll) continue;
                      for (const el of root.querySelectorAll(selector)) {
                        if (!seen.has(el)) {
                          seen.add(el);
                          results.push(el);
                        }
                      }
                    }
                    return results;
                  }

                  function collectLinks(scope) {
                    const urls = new Set();
                    const nodes = scope && scope.querySelectorAll ? scope.querySelectorAll('a[href]') : [];
                    for (const a of nodes) {
                      const href = a.href || a.getAttribute('href') || '';
                      if (!href) continue;
                      if (href.startsWith('javascript:') || href.startsWith('mailto:')) continue;
                      urls.add(href);
                    }
                    return Array.from(urls);
                  }

                  const title = document.title || normalize(document.querySelector('title')?.innerText || 'Untitled');
                  const helpArticleId = new URL(window.location.href).searchParams.get('id');
                  const childLinks = new Set();
                  for (const root of allRoots()) {
                    for (const link of collectLinks(root)) childLinks.add(link);
                  }

                  const selectorConfigs = [
                    { selector: '#content.slds-text-longform', strategy: 'help-longform-id', base: 300 },
                    { selector: '.slds-text-longform#content', strategy: 'help-longform-id', base: 300 },
                    { selector: '.slds-text-longform', strategy: 'help-longform', base: 260 },
                    { selector: 'c-hc-documentation-article', strategy: 'help-article-host', base: 160 },
                    { selector: 'article', strategy: 'article', base: 120 },
                    { selector: 'main', strategy: 'main', base: 100 },
                    { selector: 'doc-content-layout', strategy: 'legacy-doc-layout', base: 90 },
                    { selector: 'doc-xml-content', strategy: 'legacy-doc-xml', base: 90 },
                    { selector: 'doc-amf-reference .markdown-content', strategy: 'legacy-amf-markdown', base: 90 },
                  ];

                  const candidates = [];
                  for (const cfg of selectorConfigs) {
                    const nodes = deepQueryAll(cfg.selector);
                    for (const node of nodes) {
                      if (!isVisible(node)) continue;
                      const text = normalize(node.innerText || node.textContent || '');
                      if (text.length < 200) continue;
                      let score = cfg.base + Math.min(text.length, 5000) / 25;
                      const lowered = text.toLowerCase();
                      if (lowered.includes('table of contents')) score -= 80;
                      if (lowered.includes('sorry to interrupt')) score -= 500;
                      if (lowered.includes('css error')) score -= 500;
                      if (lowered.includes(title.toLowerCase())) score += 40;
                      candidates.push({
                        strategy: cfg.strategy,
                        selector: cfg.selector,
                        score,
                        text,
                        html: (node.innerHTML || '').slice(0, 4000),
                        links: collectLinks(node).slice(0, 200),
                      });
                    }
                  }

                  // Last-resort body fallback.
                  const bodyText = normalize(document.body?.innerText || '');
                  if (bodyText.length >= 200) {
                    candidates.push({
                      strategy: 'body',
                      selector: 'body',
                      score: Math.min(bodyText.length, 5000) / 50,
                      text: bodyText,
                      html: (document.body?.innerHTML || '').slice(0, 4000),
                      links: Array.from(childLinks).slice(0, 200),
                    });
                  }

                  candidates.sort((a, b) => b.score - a.score);
                  const best = candidates[0] || null;

                  return {
                    url: window.location.href,
                    title,
                    helpArticleId,
                    httpStatus: null,
                    strategy: best ? best.strategy : 'none',
                    selector: best ? best.selector : null,
                    text: best ? best.text : '',
                    htmlExcerpt: best ? best.html : '',
                    contentLinks: best ? best.links : [],
                    childLinks: Array.from(childLinks).slice(0, 200),
                    candidateCount: candidates.length,
                  };
                }
                """
            )
            payload["httpStatus"] = http_status

            raw_text = normalize_text(payload.get("text", ""))
            cleaned_text = cleanup_help_text(raw_text, payload.get("title", ""))
            structured = structure_help_text(cleaned_text, payload.get("title", ""))
            likely_shell = looks_like_shell(payload.get("title", ""), cleaned_text)
            ok = bool(cleaned_text) and len(cleaned_text) >= 400 and not likely_shell

            return {
                "ok": ok,
                "url": payload.get("url", url),
                "httpStatus": payload.get("httpStatus"),
                "title": payload.get("title") or "Untitled",
                "helpArticleId": payload.get("helpArticleId"),
                "strategy": payload.get("strategy"),
                "selector": payload.get("selector"),
                "likelyShell": likely_shell,
                "stealthRequested": use_stealth,
                "stealthAvailable": stealth_sync is not None or Stealth is not None,
                "stealthUsed": stealth_used,
                "rawText": raw_text,
                "text": cleaned_text,
                "intro": structured.get("intro", ""),
                "body": structured.get("body", ""),
                "metadata": structured.get("metadata", {}),
                "sections": structured.get("sections", []),
                "contentBlocks": structured.get("contentBlocks", []),
                "contentLinks": payload.get("contentLinks", []),
                "childLinks": payload.get("childLinks", []),
                "candidateCount": payload.get("candidateCount", 0),
            }
        finally:
            page.close()
            browser.close()


def main() -> int:
    args = parse_args()
    validate_url(args.url)
    result = extract(args.url, args.timeout, use_stealth=args.stealth)
    dump = json.dumps(result, indent=2 if args.pretty else None)
    print(dump)
    return 0 if result.get("ok") else 1


if __name__ == "__main__":
    raise SystemExit(main())

#!/usr/bin/env python3
"""
Tiny wrapper for Salesforce documentation extraction.

Behavior:
- If the URL is on help.salesforce.com, automatically route to the dedicated
  Help extractor with shadow DOM heuristics.
- Otherwise, use a lightweight browser-rendered extractor for official
  Salesforce-owned documentation sites such as developer.salesforce.com,
  architect.salesforce.com, admin.salesforce.com, lightningdesignsystem.com,
  and other supported official documentation hosts.

Examples:
  python3 skills/fetching-salesforce-docs/scripts/extract_salesforce_doc.py \
    --url "https://help.salesforce.com/s/articleView?id=service.miaw_security.htm&type=5" \
    --pretty

  python3 skills/fetching-salesforce-docs/scripts/extract_salesforce_doc.py \
    --url "https://developer.salesforce.com/docs/platform/lwc/guide/use-message-channel-intro.html" \
    --pretty

  python3 skills/fetching-salesforce-docs/scripts/extract_salesforce_doc.py \
    --url "https://architect.salesforce.com/well-architected/overview" \
    --stealth --pretty
"""

from __future__ import annotations

import argparse
import json
import re
from typing import Any, Dict
from urllib.parse import urlparse

from runtime_bootstrap import maybe_reexec_in_sf_docs_runtime

maybe_reexec_in_sf_docs_runtime(__file__)

from playwright.sync_api import TimeoutError as PlaywrightTimeoutError
from playwright.sync_api import sync_playwright

try:
    from playwright_stealth import Stealth
except ImportError:
    Stealth = None

try:
    from playwright_stealth import stealth_sync
except ImportError:
    stealth_sync = None

from extract_help_salesforce import extract as extract_help_salesforce


USER_AGENT = (
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
    "AppleWebKit/537.36 (KHTML, like Gecko) "
    "Chrome/122.0.0.0 Safari/537.36"
)

STRONG_SHELL_TOKENS = [
    "loading",
    "sorry to interrupt",
    "css error",
    "enable javascript",
    "we looked high and low",
    "couldn't find that page",
    "404 error",
]

WEAK_SHELL_TOKENS = [
    "sign in",
    "cookie preferences",
]

OFFICIAL_DOC_EXACT_HOSTS = {
    "salesforce.com",
    "lightningdesignsystem.com",
}

OFFICIAL_DOC_SUFFIXES = (
    ".salesforce.com",
    ".lightningdesignsystem.com",
)


def normalize_text(text: str) -> str:
    text = text.replace("\u00a0", " ").replace("\r", "")
    text = re.sub(r"\n{3,}", "\n\n", text)
    text = re.sub(r"[ \t]+", " ", text)
    return text.strip()


def looks_like_shell(title: str, text: str) -> bool:
    haystack = f"{title}\n{text}".lower()
    if any(token in haystack for token in STRONG_SHELL_TOKENS):
        return True
    if any(token in haystack for token in WEAK_SHELL_TOKENS):
        return len(text.strip()) < 600
    return False


def apply_stealth(page) -> bool:
    if stealth_sync is not None:
        try:
            stealth_sync(page)
            return True
        except Exception:
            pass
    if Stealth is not None:
        try:
            Stealth().apply_stealth_sync(page)
            return True
        except Exception:
            return False
    return False


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(description="Extract official Salesforce documentation from a URL")
    parser.add_argument("--url", required=True, help="Official Salesforce doc URL")
    parser.add_argument("--timeout", type=int, default=60, help="Timeout in seconds (default: 60)")
    parser.add_argument("--stealth", action="store_true", help="Best-effort stealth mode for bot-sensitive pages")
    parser.add_argument("--pretty", action="store_true", help="Pretty-print JSON")
    return parser.parse_args()


def is_official_salesforce_host(host: str) -> bool:
    host = (host or "").lower()
    return host in OFFICIAL_DOC_EXACT_HOSTS or any(host.endswith(suffix) for suffix in OFFICIAL_DOC_SUFFIXES)


def route_kind(url: str) -> str:
    host = (urlparse(url).hostname or "").lower()
    if host.endswith("help.salesforce.com"):
        return "help"
    if is_official_salesforce_host(host):
        return "official"
    raise SystemExit(f"Unsupported host for fetching-salesforce-docs extractor: {host or url}")


def extract_official_salesforce_doc(url: str, timeout_seconds: int, use_stealth: bool = False) -> Dict[str, Any]:
    timeout_ms = timeout_seconds * 1000
    host = (urlparse(url).hostname or "").lower()

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page(user_agent=USER_AGENT, viewport={"width": 1440, "height": 1400})
        stealth_used = apply_stealth(page) if use_stealth else False

        try:
            response = page.goto(url, wait_until="domcontentloaded", timeout=timeout_ms)
            http_status = response.status if response else None
            page.wait_for_timeout(1500)
            try:
                page.wait_for_function(
                    r"""
                    () => {
                      const el = document.querySelector('main, article, [role="main"]');
                      const text = (el?.innerText || el?.textContent || '').trim();
                      return text.length > 200;
                    }
                    """,
                    timeout=min(timeout_ms, 15000),
                )
            except PlaywrightTimeoutError:
                pass
            try:
                page.wait_for_load_state("networkidle", timeout=min(timeout_ms, 15000))
            except PlaywrightTimeoutError:
                pass
            page.wait_for_timeout(500)

            payload = page.evaluate(
                r"""
                () => {
                  function normalize(text) {
                    return String(text || '')
                      .replace(/\u00a0/g, ' ')
                      .replace(/\r/g, '')
                      .replace(/\n{3,}/g, '\n\n')
                      .trim();
                  }

                  function isVisible(el) {
                    if (!el || !el.getBoundingClientRect) return false;
                    const rect = el.getBoundingClientRect();
                    const style = window.getComputedStyle(el);
                    return rect.width > 0 && rect.height > 0 && style.visibility !== 'hidden' && style.display !== 'none';
                  }

                  function allRoots() {
                    const roots = [document];
                    const queue = [document];
                    while (queue.length) {
                      const current = queue.shift();
                      if (!current || !current.querySelectorAll) continue;
                      const elements = current.querySelectorAll('*');
                      for (const el of elements) {
                        if (el.shadowRoot) {
                          roots.push(el.shadowRoot);
                          queue.push(el.shadowRoot);
                        }
                      }
                    }
                    return roots;
                  }

                  function deepQueryAll(selector) {
                    const results = [];
                    const seen = new Set();
                    for (const root of allRoots()) {
                      if (!root.querySelectorAll) continue;
                      for (const el of root.querySelectorAll(selector)) {
                        if (!seen.has(el)) {
                          seen.add(el);
                          results.push(el);
                        }
                      }
                    }
                    return results;
                  }

                  function collectLinks(scope) {
                    const urls = new Set();
                    const nodes = scope && scope.querySelectorAll ? scope.querySelectorAll('a[href]') : [];
                    for (const a of nodes) {
                      const href = a.href || a.getAttribute('href') || '';
                      if (!href) continue;
                      if (href.startsWith('javascript:') || href.startsWith('mailto:')) continue;
                      urls.add(href);
                    }
                    return Array.from(urls);
                  }

                  const title = document.title || normalize(document.querySelector('title')?.innerText || 'Untitled');
                  const childLinks = new Set();
                  for (const root of allRoots()) {
                    for (const link of collectLinks(root)) childLinks.add(link);
                  }

                  const selectorConfigs = [
                    { selector: 'article', strategy: 'article', base: 260 },
                    { selector: 'main', strategy: 'main', base: 220 },
                    { selector: '[role="main"]', strategy: 'role-main', base: 220 },
                    { selector: '.slds-text-longform', strategy: 'longform', base: 200 },
                    { selector: '.markdown-content', strategy: 'markdown-content', base: 190 },
                    { selector: '.content-body', strategy: 'content-body', base: 180 },
                    { selector: '.article-body', strategy: 'article-body', base: 180 },
                    { selector: '.article-content', strategy: 'article-content', base: 180 },
                    { selector: '.post-content', strategy: 'post-content', base: 170 },
                    { selector: '.main-content', strategy: 'main-content', base: 170 },
                    { selector: '.tds-content', strategy: 'tds-content', base: 165 },
                    { selector: '.siteforceContentArea .content', strategy: 'siteforce-content', base: 160 },
                    { selector: 'doc-content-layout', strategy: 'legacy-doc-layout', base: 150 },
                    { selector: 'doc-xml-content', strategy: 'legacy-doc-xml', base: 145 },
                    { selector: 'doc-amf-reference .markdown-content', strategy: 'legacy-amf-markdown', base: 150 },
                    { selector: 'main .content, article .content', strategy: 'nested-content', base: 140 },
                  ];

                  const candidates = [];
                  for (const cfg of selectorConfigs) {
                    const nodes = deepQueryAll(cfg.selector);
                    for (const node of nodes) {
                      if (!isVisible(node)) continue;
                      const text = normalize(node.innerText || node.textContent || '');
                      if (text.length < 200) continue;
                      let score = cfg.base + Math.min(text.length, 5000) / 30;
                      const lowered = text.toLowerCase();
                      if (lowered.includes(title.toLowerCase())) score += 50;
                      if (lowered.includes('table of contents')) score -= 80;
                      if (lowered.includes('cookie preferences')) score -= 120;
                      if (lowered.includes('sign in')) score -= 120;
                      candidates.push({
                        strategy: cfg.strategy,
                        selector: cfg.selector,
                        score,
                        text,
                        links: collectLinks(node).slice(0, 200),
                      });
                    }
                  }

                  const bodyText = normalize(document.body?.innerText || '');
                  if (bodyText.length >= 200) {
                    candidates.push({
                      strategy: 'body',
                      selector: 'body',
                      score: Math.min(bodyText.length, 5000) / 50,
                      text: bodyText,
                      links: Array.from(childLinks).slice(0, 200),
                    });
                  }

                  candidates.sort((a, b) => b.score - a.score);
                  const best = candidates[0] || null;

                  return {
                    url: window.location.href,
                    title,
                    strategy: best ? best.strategy : 'none',
                    selector: best ? best.selector : null,
                    text: best ? best.text : '',
                    contentLinks: best ? best.links : [],
                    childLinks: Array.from(childLinks).slice(0, 200),
                    candidateCount: candidates.length,
                  };
                }
                """
            )

            text = normalize_text(payload.get("text", ""))
            likely_shell = looks_like_shell(payload.get("title", ""), text)
            ok = bool(text) and len(text) >= 300 and not likely_shell

            return {
                "ok": ok,
                "url": payload.get("url", url),
                "httpStatus": http_status,
                "title": payload.get("title") or "Untitled",
                "host": host,
                "hostKind": "official-salesforce",
                "strategy": payload.get("strategy"),
                "selector": payload.get("selector"),
                "likelyShell": likely_shell,
                "stealthRequested": use_stealth,
                "stealthAvailable": stealth_sync is not None or Stealth is not None,
                "stealthUsed": stealth_used,
                "text": text,
                "contentLinks": payload.get("contentLinks", []),
                "childLinks": payload.get("childLinks", []),
                "candidateCount": payload.get("candidateCount", 0),
            }
        finally:
            page.close()
            browser.close()


def main() -> int:
    args = parse_args()
    kind = route_kind(args.url)

    if kind == "help":
        result = extract_help_salesforce(args.url, args.timeout, use_stealth=args.stealth)
        result["routedVia"] = "extract_help_salesforce"
        result.setdefault("hostKind", "help")
    else:
        result = extract_official_salesforce_doc(args.url, args.timeout, use_stealth=args.stealth)
        result["routedVia"] = "generic_official_salesforce_extractor"

    dump = json.dumps(result, indent=2 if args.pretty else None)
    print(dump)
    return 0 if result.get("ok") else 1


if __name__ == "__main__":
    raise SystemExit(main())

from __future__ import annotations

import os
import sys
from pathlib import Path


def sf_docs_runtime_root() -> Path:
    return Path.home() / ".claude" / ".fetching-salesforce-docs-runtime"


def sf_docs_runtime_python() -> Path:
    root = sf_docs_runtime_root() / "venv"
    candidates = [
        root / "bin" / "python",
        root / "bin" / "python3",
        root / "Scripts" / "python.exe",
    ]
    for candidate in candidates:
        if candidate.exists():
            return candidate
    return candidates[0] if os.name != "nt" else candidates[-1]


def prepare_sf_docs_runtime_env(env: dict[str, str] | None = None) -> dict[str, str]:
    runtime_root = sf_docs_runtime_root()
    target = dict(env or os.environ)
    target.setdefault("PLAYWRIGHT_BROWSERS_PATH", str(runtime_root / "ms-playwright"))
    target.setdefault("SF_DOCS_RUNTIME_ROOT", str(runtime_root))
    return target


def maybe_reexec_in_sf_docs_runtime(script_path: str) -> bool:
    runtime_python = sf_docs_runtime_python()
    os.environ.update(prepare_sf_docs_runtime_env())

    if os.environ.get("SF_DOCS_RUNTIME_ACTIVE") == "1":
        return False
    if not runtime_python.exists():
        return False

    try:
        current_python = Path(sys.executable).resolve()
        target_python = runtime_python.resolve()
        if current_python == target_python:
            os.environ["SF_DOCS_RUNTIME_ACTIVE"] = "1"
            return False
    except OSError:
        pass

    env = prepare_sf_docs_runtime_env()
    env["SF_DOCS_RUNTIME_ACTIVE"] = "1"
    os.execve(
        str(runtime_python),
        [str(runtime_python), str(Path(script_path).resolve()), *sys.argv[1:]],
        env,
    )
    return True

Related skills

Lark MarkdownInstantly turn any markdown file into clean, formatted Lark/Figma-compatible documents without manual reformatting.402k

Lark DocCreate, read, update, summarize, rewrite and manage Feishu/Lark cloud documents directly from agent workflows.377k15.7k

Lark WikiCreate, organize, query, and manage documents, spaces, and members inside Lark (Feishu) knowledge bases directly from AI agent workflows.374k15.7k

Opensource Guide CoachGet expert, attribution-safe guidance on launching and sustaining an open-source project.270k72

Readme I18nMaintain consistent, updatable language selector blocks across every README variant in a multilingual repository.270k72

CavemanCompress SPEC.md files and any spec-adjacent prose into a token-efficient format that preserves precision while slashing context usage.259k1.1k

Forks & variants (1)

Fetching Salesforce Docs has 1 known copy in the catalog totaling 491 installs. They canonicalize to this original listing.

forcedotcom - 491 installs

How it compares

Pick fetching-salesforce-docs over generic web-search skills when Salesforce official pages are JavaScript-rendered and scrapers return empty shells.

FAQ

Which domains are in scope?

developer.salesforce.com, help.salesforce.com, architect.salesforce.com, admin.salesforce.com, and lightningdesignsystem.com.

When is a page not good enough evidence?

When it is only a broad landing page, shell-rendered chrome, wrong product area, or missing the requested identifier or concept.

Are PDF fallbacks allowed?

No. The skill explicitly avoids PDF fallback and third-party blogs unless the user explicitly requests them.

Is Fetching Salesforce Docs safe to install?

skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Documentationintegrationsdocs

About

Fetching Salesforce Docs by the numbers

fetching-salesforce-docs capabilities & compatibility

What fetching-salesforce-docs says it does

Add your badge

How do I fetch reliable official Salesforce documentation when pages are JS-heavy or help articles fail naive scraping?

Who is it for?

When should I use this skill?

What you get

Files

fetching-salesforce-docs

Scope

Required Inputs

Official Sources Only

Retrieval Workflow

1. Classify the request first

2. Identify the exact concept

3. Prefer targeted official retrieval

4. Do not stop at broad landing pages

5. For developer.salesforce.com

6. For help.salesforce.com

Acceptance Rules

Rejection Rules

Grounding Requirements

Examples

Example: Lightning Message Service

Example: Wire Service

Example: Agentforce Actions

Example: Messaging for In-App and Web allowed domains

Example: System.StubProvider

Non-Goals

Output Expectations

Reference File Index

fetching-salesforce-docs

What it is

What it is not

Use it for

Optional utility

Key idea

Related skills

Forks & variants (1)

How it compares

FAQ

Which domains are in scope?

When is a page not good enough evidence?

Are PDF fallbacks allowed?

Is Fetching Salesforce Docs safe to install?

This week in AI coding

5. For `developer.salesforce.com`

6. For `help.salesforce.com`