Benchmark E2e

Name: Benchmark E2e
Author: vercel-labs

vercel-labs/vercel-plugin

1.1k installs
229 repo stars
Updated July 27, 2026
vercel-labs/vercel-plugin

benchmark-e2e provides documented workflows for End-to-end benchmark suite for vercel-plugin. Runs realistic projects through skill injection, launches dev servers, verifies everything works, analyzes convers

About

The benchmark-e2e skill end-to-end benchmark suite for vercel-plugin. Runs realistic projects through skill injection, launches dev servers, verifies everything works, analyzes conversation logs, and produces an improvement report for overnight self-improvement loops. # Benchmark E2E Single-command pipeline that creates projects, exercises skill injection via `claude --print`, launches dev servers, verifies they work, analyzes conversation logs, and generates actionable improvement reports. ## Quick Start ```bash # Full suite (9 projects, ~2-3 hours) bun run scripts/benchmark-e2e.ts # Quick mode (first 3 projects, ~30-45 min) bun run scripts/benchmark-e2e.ts --quick ``` Options: | Flag | Description | Default | |------|-------------|---------| | `--quick` | Run only first 3 projects | `false` | | `--base <path>` | Override base directory | `~/dev/vercel-plugin-testing` | | `--timeout <ms>` | Per-project timeout (forwarded to runner) | `900000` (15 min) | ## Pipeline Stages The orchestrator chains four stages sequentially, aborting on failure: 1. **runner** - Creates test dirs, installs plugin, runs `claude --print` with `VERCEL_PLUGIN_LOG_LEVEL=trace` 2. **verify** - Detects p.

**runner** - Creates test dirs, installs plugin, runs `claude --print` with `VERCEL_PLUGIN_LOG_LEVEL=trace`
**verify** - Detects package manager, launches dev server, polls for 200 with non-empty HTML
**analyze** - Matches JSONL sessions to projects via `run-manifest.json`, extracts metrics
**report** - Generates `report.md` and `report.json` with scorecards and recommendations
**Run** - `bun run scripts/benchmark-e2e.ts` exercises the plugin against realistic projects

Benchmark E2e by the numbers

1,112 all-time installs (skills.sh)
Ranked #515 of 2,184 Testing & QA skills by installs in the Skillselion catalog
Security screen: LOW risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

At a glance

benchmark-e2e capabilities & compatibility

Capabilities: **runner** creates test dirs, installs plugin, · **verify** detects package manager, launches d · **analyze** matches jsonl sessions to projects · **report** generates `report.md` and `report.j · **run** `bun run scripts/benchmark e2e.ts` exe
Use cases: documentation

npx skills add https://github.com/vercel-labs/vercel-plugin --skill benchmark-e2e

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/vercel-labs/vercel-plugin/benchmark-e2e.svg)](https://skillselion.com/skills/vercel-labs/vercel-plugin/benchmark-e2e)

Installs	1.1k
repo stars	★ 229
Security audit	3 / 3 scanners passed
Last updated	July 27, 2026
Repository	vercel-labs/vercel-plugin ↗

How do I use benchmark-e2e for the task described in its SKILL.md triggers?

End-to-end benchmark suite for vercel-plugin. Runs realistic projects through skill injection, launches dev servers, verifies everything works, analyzes conversation logs, and produces an improvement.

Who is it for?

Teams invoking benchmark-e2e when the user request matches documented triggers and prerequisites.

Skip if: Skip when cached docs are missing, the request is a negative trigger, or another sibling skill owns the workflow.

When should I use this skill?

What you get

Step-by-step guidance grounded in benchmark-e2e documentation and reference files.

benchmark run results
dev-server verification logs
improvement reports

By the numbers

Full benchmark suite runs 9 projects in roughly 2-3 hours
Quick mode benchmarks first 3 projects in about 30-45 minutes

Files

SKILL.mdMarkdownGitHub ↗

Benchmark E2E

Single-command pipeline that creates projects, exercises skill injection via claude --print, launches dev servers, verifies they work, analyzes conversation logs, and generates actionable improvement reports.

Quick Start

# Full suite (9 projects, ~2-3 hours)
bun run scripts/benchmark-e2e.ts

# Quick mode (first 3 projects, ~30-45 min)
bun run scripts/benchmark-e2e.ts --quick

Options:

Flag	Description	Default
`--quick`	Run only first 3 projects	`false`
`--base <path>`	Override base directory	`~/dev/vercel-plugin-testing`
`--timeout <ms>`	Per-project timeout (forwarded to runner)	`900000` (15 min)

Pipeline Stages

The orchestrator chains four stages sequentially, aborting on failure:

1. runner — Creates test dirs, installs plugin, runs claude --print with VERCEL_PLUGIN_LOG_LEVEL=trace 2. verify — Detects package manager, launches dev server, polls for 200 with non-empty HTML 3. analyze — Matches JSONL sessions to projects via run-manifest.json, extracts metrics 4. report — Generates report.md and report.json with scorecards and recommendations

Contracts

`run-manifest.json`

Written by the runner at <base>/results/run-manifest.json. Links all downstream stages to the same run.

interface BenchmarkRunManifest {
  runId: string;           // UUID for this pipeline run
  timestamp: string;       // ISO 8601
  baseDir: string;         // Absolute path to base directory
  projects: Array<{
    slug: string;          // e.g. "01-recipe-platform"
    cwd: string;           // Absolute path to project dir
    promptHash: string;    // SHA hash of the prompt text
    expectedSkills: string[];
  }>;
}

The analyzer and verifier read this manifest to correlate sessions precisely instead of guessing from directory listings.

`events.jsonl`

The orchestrator writes NDJSON events to <base>/results/events.jsonl tracking pipeline lifecycle:

// Each line is one JSON object:
{ "stage": "pipeline", "event": "start", "timestamp": "...", "data": { "baseDir": "...", "quick": false } }
{ "stage": "runner",   "event": "start", "timestamp": "...", "data": { "script": "...", "args": [...] } }
{ "stage": "runner",   "event": "complete", "timestamp": "...", "data": { "exitCode": 0, "durationMs": 120000 } }
// On failure:
{ "stage": "verify",   "event": "error", "timestamp": "...", "data": { "exitCode": 1, "durationMs": 5000, "slug": "04-conference-tickets" } }
{ "stage": "pipeline", "event": "abort", "timestamp": "...", "data": { "failedStage": "verify", "exitCode": 1, "slug": "04-conference-tickets" } }

`report.json`

Machine-readable report at <base>/results/report.json for programmatic consumption:

interface ReportJson {
  runId: string | null;
  timestamp: string;
  verdict: "pass" | "partial" | "fail";
  gaps: Array<{
    slug: string;
    expected: string[];
    actual: string[];
    missing: string[];
  }>;
  recommendations: string[];
  suggestedPatterns: Array<{
    skill: string;   // Skill that was expected but not injected
    glob: string;    // Suggested pathPattern glob
    tool: string;    // Tool name that should trigger injection
  }>;
}

Overnight Automation Loop

Run the pipeline repeatedly with a cooldown between iterations:

while true; do
  bun run scripts/benchmark-e2e.ts
  sleep 3600
done

Each run produces timestamped report.json and report.md files. Compare across runs to track improvement.

Self-Improvement Cycle

The pipeline enables a closed feedback loop:

1. Run — bun run scripts/benchmark-e2e.ts exercises the plugin against realistic projects 2. Read gaps — report.json lists which skills were expected but never injected, with exact slugs 3. Apply fixes — Use suggestedPatterns entries (copy-pasteable YAML) to add missing frontmatter patterns; use recommendations to fix hook logic 4. Re-run — Execute the pipeline again to verify the gaps are closed 5. Compare — Diff report.json across runs: verdict should trend from "fail" → "partial" → "pass"

For overnight automation, combine with the loop above. Wake up to reports showing exactly what improved and what still needs work.

Prompt Table

Prompts never name specific technologies — they describe the product and features, letting the plugin infer which skills to inject.

#	Slug	Expected Skills
01	recipe-platform	auth, vercel-storage, nextjs
02	trivia-game	vercel-storage, nextjs
03	code-review-bot	ai-sdk, nextjs
04	conference-tickets	payments, email, auth
05	content-aggregator	cron-jobs, ai-sdk
06	finance-tracker	cron-jobs, email
07	multi-tenant-blog	routing-middleware, cms, auth
08	status-page	cron-jobs, vercel-storage, observability
09	dog-walking-saas	payments, auth, vercel-storage, env-vars

Cleanup

rm -rf ~/dev/vercel-plugin-testing

Related skills

TddFollow test-driven development with a strict red-green-refactor loop when creating reliable features or fixing bugs.510k185k

Test Driven DevelopmentEnforce writing failing tests before any production implementation code.176k260k

QaRun conversational QA sessions that turn user-reported bugs into well-written, domain-aware GitHub issues without manual ticket writing.164k185k

Migrate To ShoehornAutomatically update TypeScript test files that rely on unsafe `as` type assertions by replacing them with type-safe partial objects from @total-typescript/shoehorn.151k185k

Webapp TestingVerify frontend behavior, debug UI issues, capture screenshots, and inspect logs of a running local web application using Playwright.121k164k

Playwright CliRun browser automation, generate element snapshots, inspect DOM attributes, and execute Playwright tests from the terminal.96.3k12.2k

How it compares

Use benchmark-e2e for vercel-plugin regression across realistic agent projects; use app-level Playwright suites when testing your own product UI instead of plugin skill injection.

FAQ

What does benchmark-e2e do?

When should I use benchmark-e2e?

What are common prerequisites?

--- name: benchmark-e2e description: End-to-end benchmark suite for vercel-plugin.

Is Benchmark E2e safe to install?

skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Testing & QAtesting