Benchmark

Name: Benchmark
Author: affaan-m

affaan-m/everything-claude-code

Measure Core Web Vitals, API latency percentiles, and build/HMR times to catch regressions before merge or launch.

Overview

Benchmark is an agent skill most often used in Ship (also Launch and Build) that measures web, API, and build performance baselines and detects regressions against stated targets.

Install

npx skills add https://github.com/affaan-m/everything-claude-code --skill benchmark

What is this skill?

Mode 1 page performance via browser MCP: LCP, CLS, INP, FCP, TTFB with documented targets
Mode 2 API benchmarks: 100 iterations with p50/p95/p99 and 10 concurrent load
Mode 3 build performance: cold build, hot reload/HMR, and dev feedback loop timing
Resource budgets: page weight, JS bundle, CSS, images, and third-party script sizing
Designed for before/after PR comparison and pre-launch performance gate checks
Page targets include LCP < 2.5s, CLS < 0.1, INP < 200ms, FCP < 1.8s, TTFB < 800ms
API mode hits each endpoint 100 times and tests 10 concurrent requests
Resource targets include total page weight < 1MB and JS bundle < 200KB gzipped

Compatible agents: Claude Code, Cursor, Codex

Adoption & trust: 3.5k installs on skills.sh; 210k GitHub stars; 2/3 security scanners passed (skills.sh audits).

What problem does it solve?

You cannot tell whether a change or launch candidate actually improved or harmed speed across the page, API, and build pipeline.

Who is it for?

Solo builders who need numeric baselines before launch or who want objective before/after PR perf diffs.

Skip if: Pure feature work with no performance concern, or teams that already have mature continuous perf CI and only need orchestration docs.

When should I use this skill?

Before/after a PR, setting baselines, user reports slowness, pre-launch perf checks, or comparing stack alternatives.

What do I get? / Deliverables

You capture repeatable metrics (Core Web Vitals, latency percentiles, build times) you can compare across PRs and gate release decisions with evidence.

Documented performance baseline metrics per mode
Before/after comparison suitable for PR or launch review

Recommended Skills

Agent Browservercel-labs/open-agents

agent-browser is a Vercel Open Agents skill that wraps a CLI for programmatic browser control—ideal when solo builders n…404k installs·5.6k stars

Tddmattpocock/skills

TDD is an agent skill that coaches test-driven development using the red-green-refactor loop for solo and indie builders…214k installs·121k stars

Use My Browserxixu-me/skills

Use My Browser skill forces agents to classify tasks as static-capable or browser-required before choosing tools—staying…198k installs·61 stars

Test Driven Developmentobra/superpowers

Test-Driven Development is an agent skill from obra/superpowers that forces a test-first implementation ritual: write a …118k installs·221k stars

Verification Before Completionobra/superpowers

Verification Before Completion is an agent skill from the Superpowers lineage that blocks premature success claims durin…100k installs·221k stars

Webapp Testinganthropics/skills

webapp-testing is an agent skill for solo builders who need to prove that a local web application actually works—not jus…90.9k installs·148k stars

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Ship perf is the canonical shelf because the skill’s primary job is regression detection and SLA-style targets tied to release readiness, not ideation or growth analytics. Perf subphase matches page, API, and build benchmarking modes with explicit targets (LCP, p95 latency, cold build time).

Also useful

LaunchDistribution & launch channels

Also useful

BuildBackend, data & payments

Where it fits

Example use

ShipPerformance

Run page Mode 1 on staging after a bundle change to see if LCP crossed 2.5s.

Example use

LaunchDistribution & launch channels

Gate go-live by confirming FCP and total page weight targets before sharing the public URL.

Example use

BuildBackend, data & payments

Hit each API route 100 times and log p95 against your SLA before picking a framework variant.

Example use

OperateMonitoring & observability

Re-baseline endpoints when users report slowdowns and attach numbers to an incident fix PR.

How it compares

Structured measurement ritual with targets—not a substitute for load-testing infrastructure or production APM dashboards.

Common Questions / FAQ

Who is benchmark for?

Solo and indie developers using Claude Code-style agents who ship web and API products and need regression-aware performance measurement.

When should I use benchmark?

Use it in Ship perf work before/after PRs, in Launch prep to verify Core Web Vitals targets, in Build when comparing stack alternatives or slow HMR, and whenever stakeholders report subjective slowness.

Is benchmark safe to install?

It may navigate URLs and hit APIs you configure; review the Security Audits panel on this page and only benchmark endpoints and environments you own or are allowed to load test.

SKILL.md

READMESKILL.md - Benchmark

# Benchmark — Performance Baseline & Regression Detection

## When to Use

- Before and after a PR to measure performance impact
- Setting up performance baselines for a project
- When users report "it feels slow"
- Before a launch — ensure you meet performance targets
- Comparing your stack against alternatives

## How It Works

### Mode 1: Page Performance

Measures real browser metrics via browser MCP:

```
1. Navigate to each target URL
2. Measure Core Web Vitals:
   - LCP (Largest Contentful Paint) — target < 2.5s
   - CLS (Cumulative Layout Shift) — target < 0.1
   - INP (Interaction to Next Paint) — target < 200ms
   - FCP (First Contentful Paint) — target < 1.8s
   - TTFB (Time to First Byte) — target < 800ms
3. Measure resource sizes:
   - Total page weight (target < 1MB)
   - JS bundle size (target < 200KB gzipped)
   - CSS size
   - Image weight
   - Third-party script weight
4. Count network requests
5. Check for render-blocking resources
```

### Mode 2: API Performance

Benchmarks API endpoints:

```
1. Hit each endpoint 100 times
2. Measure: p50, p95, p99 latency
3. Track: response size, status codes
4. Test under load: 10 concurrent requests
5. Compare against SLA targets
```

### Mode 3: Build Performance

Measures development feedback loop:

```
1. Cold build time
2. Hot reload time (HMR)
3. Test suite duration
4. TypeScript check time
5. Lint time
6. Docker build time
```

### Mode 4: Before/After Comparison

Run before and after a change to measure impact:

```
/benchmark baseline    # saves current metrics
# ... make changes ...
/benchmark compare     # compares against baseline
```

Output:
```
| Metric | Before | After | Delta | Verdict |
|--------|--------|-------|-------|---------|
| LCP | 1.2s | 1.4s | +200ms | WARNING: WARN |
| Bundle | 180KB | 175KB | -5KB | ✓ BETTER |
| Build | 12s | 14s | +2s | WARNING: WARN |
```

## Output

Stores baselines in `.ecc/benchmarks/` as JSON. Git-tracked so the team shares baselines.

## Integration

- CI: run `/benchmark compare` on every PR
- Pair with `/canary-watch` for post-deploy monitoring
- Pair with `/browser-qa` for full pre-ship checklist

What is this skill?

Mode 1 page performance via browser MCP: LCP, CLS, INP, FCP, TTFB with documented targets

Mode 2 API benchmarks: 100 iterations with p50/p95/p99 and 10 concurrent load

Mode 3 build performance: cold build, hot reload/HMR, and dev feedback loop timing

Resource budgets: page weight, JS bundle, CSS, images, and third-party script sizing

Designed for before/after PR comparison and pre-launch performance gate checks

Page targets include LCP < 2.5s, CLS < 0.1, INP < 200ms, FCP < 1.8s, TTFB < 800ms

API mode hits each endpoint 100 times and tests 10 concurrent requests

Resource targets include total page weight < 1MB and JS bundle < 200KB gzipped

Compatible agents: Claude Code, Cursor, Codex

Adoption & trust: 3.5k installs on skills.sh; 210k GitHub stars; 2/3 security scanners passed (skills.sh audits).

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

LaunchDistribution & launch channels

Also useful

BuildBackend, data & payments

Where it fits

Example use

ShipPerformance

Run page Mode 1 on staging after a bundle change to see if LCP crossed 2.5s.

Example use

LaunchDistribution & launch channels

Gate go-live by confirming FCP and total page weight targets before sharing the public URL.

Example use

BuildBackend, data & payments

Hit each API route 100 times and log p95 against your SLA before picking a framework variant.

Example use

OperateMonitoring & observability

Re-baseline endpoints when users report slowdowns and attach numbers to an incident fix PR.

SKILL.md

READMESKILL.md - Benchmark

# Benchmark — Performance Baseline & Regression Detection

## When to Use

- Before and after a PR to measure performance impact
- Setting up performance baselines for a project
- When users report "it feels slow"
- Before a launch — ensure you meet performance targets
- Comparing your stack against alternatives

## How It Works

### Mode 1: Page Performance

Measures real browser metrics via browser MCP:

```
1. Navigate to each target URL
2. Measure Core Web Vitals:
   - LCP (Largest Contentful Paint) — target < 2.5s
   - CLS (Cumulative Layout Shift) — target < 0.1
   - INP (Interaction to Next Paint) — target < 200ms
   - FCP (First Contentful Paint) — target < 1.8s
   - TTFB (Time to First Byte) — target < 800ms
3. Measure resource sizes:
   - Total page weight (target < 1MB)
   - JS bundle size (target < 200KB gzipped)
   - CSS size
   - Image weight
   - Third-party script weight
4. Count network requests
5. Check for render-blocking resources
```

### Mode 2: API Performance

Benchmarks API endpoints:

```
1. Hit each endpoint 100 times
2. Measure: p50, p95, p99 latency
3. Track: response size, status codes
4. Test under load: 10 concurrent requests
5. Compare against SLA targets
```

### Mode 3: Build Performance

Measures development feedback loop:

```
1. Cold build time
2. Hot reload time (HMR)
3. Test suite duration
4. TypeScript check time
5. Lint time
6. Docker build time
```

### Mode 4: Before/After Comparison

Run before and after a change to measure impact:

```
/benchmark baseline    # saves current metrics
# ... make changes ...
/benchmark compare     # compares against baseline
```

Output:
```
| Metric | Before | After | Delta | Verdict |
|--------|--------|-------|-------|---------|
| LCP | 1.2s | 1.4s | +200ms | WARNING: WARN |
| Bundle | 180KB | 175KB | -5KB | ✓ BETTER |
| Build | 12s | 14s | +2s | WARNING: WARN |
```

## Output

Stores baselines in `.ecc/benchmarks/` as JSON. Git-tracked so the team shares baselines.

## Integration

- CI: run `/benchmark compare` on every PR
- Pair with `/canary-watch` for post-deploy monitoring
- Pair with `/browser-qa` for full pre-ship checklist

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Where it fits

Who is benchmark for?

When should I use benchmark?

Is benchmark safe to install?

SKILL.md

This week for builders

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Where it fits

Who is benchmark for?

When should I use benchmark?

Is benchmark safe to install?

SKILL.md