
Benchmark
Measure Core Web Vitals, API latency percentiles, and build/HMR times to catch regressions before merge or launch.
Overview
Benchmark is an agent skill most often used in Ship (also Launch and Build) that measures web, API, and build performance baselines and detects regressions against stated targets.
Install
npx skills add https://github.com/affaan-m/everything-claude-code --skill benchmarkWhat is this skill?
- Mode 1 page performance via browser MCP: LCP, CLS, INP, FCP, TTFB with documented targets
- Mode 2 API benchmarks: 100 iterations with p50/p95/p99 and 10 concurrent load
- Mode 3 build performance: cold build, hot reload/HMR, and dev feedback loop timing
- Resource budgets: page weight, JS bundle, CSS, images, and third-party script sizing
- Designed for before/after PR comparison and pre-launch performance gate checks
- Page targets include LCP < 2.5s, CLS < 0.1, INP < 200ms, FCP < 1.8s, TTFB < 800ms
- API mode hits each endpoint 100 times and tests 10 concurrent requests
- Resource targets include total page weight < 1MB and JS bundle < 200KB gzipped
Adoption & trust: 3.5k installs on skills.sh; 210k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You cannot tell whether a change or launch candidate actually improved or harmed speed across the page, API, and build pipeline.
Who is it for?
Solo builders who need numeric baselines before launch or who want objective before/after PR perf diffs.
Skip if: Pure feature work with no performance concern, or teams that already have mature continuous perf CI and only need orchestration docs.
When should I use this skill?
Before/after a PR, setting baselines, user reports slowness, pre-launch perf checks, or comparing stack alternatives.
What do I get? / Deliverables
You capture repeatable metrics (Core Web Vitals, latency percentiles, build times) you can compare across PRs and gate release decisions with evidence.
- Documented performance baseline metrics per mode
- Before/after comparison suitable for PR or launch review
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Ship perf is the canonical shelf because the skill’s primary job is regression detection and SLA-style targets tied to release readiness, not ideation or growth analytics. Perf subphase matches page, API, and build benchmarking modes with explicit targets (LCP, p95 latency, cold build time).
Where it fits
Run page Mode 1 on staging after a bundle change to see if LCP crossed 2.5s.
Gate go-live by confirming FCP and total page weight targets before sharing the public URL.
Hit each API route 100 times and log p95 against your SLA before picking a framework variant.
Re-baseline endpoints when users report slowdowns and attach numbers to an incident fix PR.
How it compares
Structured measurement ritual with targets—not a substitute for load-testing infrastructure or production APM dashboards.
Common Questions / FAQ
Who is benchmark for?
Solo and indie developers using Claude Code-style agents who ship web and API products and need regression-aware performance measurement.
When should I use benchmark?
Use it in Ship perf work before/after PRs, in Launch prep to verify Core Web Vitals targets, in Build when comparing stack alternatives or slow HMR, and whenever stakeholders report subjective slowness.
Is benchmark safe to install?
It may navigate URLs and hit APIs you configure; review the Security Audits panel on this page and only benchmark endpoints and environments you own or are allowed to load test.
SKILL.md
READMESKILL.md - Benchmark
# Benchmark — Performance Baseline & Regression Detection ## When to Use - Before and after a PR to measure performance impact - Setting up performance baselines for a project - When users report "it feels slow" - Before a launch — ensure you meet performance targets - Comparing your stack against alternatives ## How It Works ### Mode 1: Page Performance Measures real browser metrics via browser MCP: ``` 1. Navigate to each target URL 2. Measure Core Web Vitals: - LCP (Largest Contentful Paint) — target < 2.5s - CLS (Cumulative Layout Shift) — target < 0.1 - INP (Interaction to Next Paint) — target < 200ms - FCP (First Contentful Paint) — target < 1.8s - TTFB (Time to First Byte) — target < 800ms 3. Measure resource sizes: - Total page weight (target < 1MB) - JS bundle size (target < 200KB gzipped) - CSS size - Image weight - Third-party script weight 4. Count network requests 5. Check for render-blocking resources ``` ### Mode 2: API Performance Benchmarks API endpoints: ``` 1. Hit each endpoint 100 times 2. Measure: p50, p95, p99 latency 3. Track: response size, status codes 4. Test under load: 10 concurrent requests 5. Compare against SLA targets ``` ### Mode 3: Build Performance Measures development feedback loop: ``` 1. Cold build time 2. Hot reload time (HMR) 3. Test suite duration 4. TypeScript check time 5. Lint time 6. Docker build time ``` ### Mode 4: Before/After Comparison Run before and after a change to measure impact: ``` /benchmark baseline # saves current metrics # ... make changes ... /benchmark compare # compares against baseline ``` Output: ``` | Metric | Before | After | Delta | Verdict | |--------|--------|-------|-------|---------| | LCP | 1.2s | 1.4s | +200ms | WARNING: WARN | | Bundle | 180KB | 175KB | -5KB | ✓ BETTER | | Build | 12s | 14s | +2s | WARNING: WARN | ``` ## Output Stores baselines in `.ecc/benchmarks/` as JSON. Git-tracked so the team shares baselines. ## Integration - CI: run `/benchmark compare` on every PR - Pair with `/canary-watch` for post-deploy monitoring - Pair with `/browser-qa` for full pre-ship checklist