
Benchmark E2e
Run the vercel-plugin end-to-end benchmark pipeline to prove skill injection, dev servers, and overnight improvement loops work on realistic projects.
Overview
Benchmark E2E is an agent skill for the Ship phase that runs the vercel-plugin multi-stage benchmark suite and outputs scorecards plus improvement reports.
Install
npx skills add https://github.com/vercel-labs/vercel-plugin --skill benchmark-e2eWhat is this skill?
- Single-command pipeline: runner → verify → analyze → report
- Full suite spans 9 realistic projects; --quick runs the first 3
- Exercises claude --print with VERCEL_PLUGIN_LOG_LEVEL=trace for skill injection
- Verifies dev servers return 200 with non-empty HTML
- Produces report.md and report.json scorecards for overnight self-improvement loops
- Full benchmark suite covers 9 projects (~2–3 hours)
- --quick mode runs the first 3 projects (~30–45 minutes)
- Pipeline has 4 sequential stages: runner, verify, analyze, report
Adoption & trust: 203 installs on skills.sh; 187 GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You changed skill injection or plugin behavior and have no realistic multi-project proof that dev servers and agent sessions still succeed.
Who is it for?
Plugin and agent-skill maintainers working inside vercel-plugin who want a repeatable E2E gate before merging risky injection changes.
Skip if: Casual app deploys, production monitoring, or teams without the vercel-plugin benchmark scripts and Bun toolchain.
When should I use this skill?
Maintaining vercel-plugin and needing realistic E2E validation of skill injection, dev servers, and improvement reporting.
What do I get? / Deliverables
You get verified project runs, analyzed conversation metrics, and report.md/report.json recommendations to feed overnight self-improvement loops.
- report.md and report.json with scorecards and recommendations
- Per-project verification status and analyzed session metrics
Recommended Skills
Journey fit
Shipping confidence for plugin authors depends on automated E2E verification, which maps to the testing slice of Ship. Testing subphase covers multi-project suites, dev-server health checks, and regression-style benchmark orchestration.
How it compares
Skill-packaged eval orchestrator for plugin realism—not a generic Playwright tutorial or a single-app smoke test.
Common Questions / FAQ
Who is benchmark-e2e for?
Developers maintaining vercel-plugin or similar agent-injection stacks who need automated, multi-project end-to-end validation.
When should I use benchmark-e2e?
In Ship testing before releasing plugin changes, after altering skill injection, or when setting up overnight benchmark-and-report improvement cycles.
Is benchmark-e2e safe to install?
It runs local shells, dev servers, and Claude print sessions against test directories; check the Security Audits panel on this page and isolate --base paths you control.
SKILL.md
READMESKILL.md - Benchmark E2e
# Benchmark E2E Single-command pipeline that creates projects, exercises skill injection via `claude --print`, launches dev servers, verifies they work, analyzes conversation logs, and generates actionable improvement reports. ## Quick Start ```bash # Full suite (9 projects, ~2-3 hours) bun run scripts/benchmark-e2e.ts # Quick mode (first 3 projects, ~30-45 min) bun run scripts/benchmark-e2e.ts --quick ``` Options: | Flag | Description | Default | |------|-------------|---------| | `--quick` | Run only first 3 projects | `false` | | `--base <path>` | Override base directory | `~/dev/vercel-plugin-testing` | | `--timeout <ms>` | Per-project timeout (forwarded to runner) | `900000` (15 min) | ## Pipeline Stages The orchestrator chains four stages sequentially, aborting on failure: 1. **runner** — Creates test dirs, installs plugin, runs `claude --print` with `VERCEL_PLUGIN_LOG_LEVEL=trace` 2. **verify** — Detects package manager, launches dev server, polls for 200 with non-empty HTML 3. **analyze** — Matches JSONL sessions to projects via `run-manifest.json`, extracts metrics 4. **report** — Generates `report.md` and `report.json` with scorecards and recommendations ## Contracts ### `run-manifest.json` Written by the runner at `<base>/results/run-manifest.json`. Links all downstream stages to the same run. ```typescript interface BenchmarkRunManifest { runId: string; // UUID for this pipeline run timestamp: string; // ISO 8601 baseDir: string; // Absolute path to base directory projects: Array<{ slug: string; // e.g. "01-recipe-platform" cwd: string; // Absolute path to project dir promptHash: string; // SHA hash of the prompt text expectedSkills: string[]; }>; } ``` The analyzer and verifier read this manifest to correlate sessions precisely instead of guessing from directory listings. ### `events.jsonl` The orchestrator writes NDJSON events to `<base>/results/events.jsonl` tracking pipeline lifecycle: ```jsonc // Each line is one JSON object: { "stage": "pipeline", "event": "start", "timestamp": "...", "data": { "baseDir": "...", "quick": false } } { "stage": "runner", "event": "start", "timestamp": "...", "data": { "script": "...", "args": [...] } } { "stage": "runner", "event": "complete", "timestamp": "...", "data": { "exitCode": 0, "durationMs": 120000 } } // On failure: { "stage": "verify", "event": "error", "timestamp": "...", "data": { "exitCode": 1, "durationMs": 5000, "slug": "04-conference-tickets" } } { "stage": "pipeline", "event": "abort", "timestamp": "...", "data": { "failedStage": "verify", "exitCode": 1, "slug": "04-conference-tickets" } } ``` ### `report.json` Machine-readable report at `<base>/results/report.json` for programmatic consumption: ```typescript interface ReportJson { runId: string | null; timestamp: string; verdict: "pass" | "partial" | "fail"; gaps: Array<{ slug: string; expected: string[]; actual: string[]; missing: string[]; }>; recommendations: string[]; suggestedPatterns: Array<{ skill: string; // Skill that was expected but not injected glob: string; // Suggested pathPattern glob tool: string; // Tool name that should trigger injection }>; } ``` ## Overnight Automation Loop Run the pipeline repeatedly with a cooldown between iterations: ```bash while true; do bun run scripts/benchmark-e2e.ts sleep 3600 done ``` Each run produces timestamped `report.json` and `report.md` files. Compare across runs to track improvement. ## Self-Improvement Cycle The pipeline enables a closed feedback loop: 1. **Run** — `bun run scripts/benchmark-e2e.ts` exercises the p