
Benchmark Agents
Stress-test Vercel plugin skill injection and agent builds by running real interactive Claude Code eval sessions with hook monitoring and coverage reports.
Install
npx skills add https://github.com/vercel-labs/vercel-plugin --skill benchmark-agentsWhat is this skill?
- Full eval loop: setup → launch → monitor → verify → fix → release → repeat
- Interactive sessions only—WezTerm panes and Bash; explicitly rejects claude --print and headless spawn evals
- Stress-tests Workflow DevKit, AI Gateway, MCP, Chat SDK, Queues, Flags, Sandbox, and multi-agent orchestration
- Monitors plugin PreToolUse, PostToolUse, and UserPromptSubmit hooks and produces skill-injection coverage reports
- Designed for advanced benchmark scenarios that push cutting-edge Vercel platform features
Adoption & trust: 212 installs on skills.sh; 187 GitHub stars; 1/3 security scanners passed (skills.sh audits).
Recommended Skills
Journey fit
Canonical shelf is Ship because the skill’s core loop is verification, PostToolUse validation, and correctness inspection—same mindset as QA before release. Testing subphase fits Playwright-style rigor applied to multi-agent orchestration, Workflow DevKit, MCP, and gateway scenarios rather than one-off coding.
Common Questions / FAQ
Is Benchmark Agents safe to install?
skills.sh reports 1 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Benchmark Agents
# Benchmark Agents — Advanced AI Systems Launch real Claude Code sessions with the plugin installed, verify skill injection, monitor PostToolUse validation catches, and produce a coverage report. This skill covers the full eval loop: setup → launch → monitor → verify → fix → release → repeat. ## How Evals Work (The Only Correct Method) Evals are run by **you, in this conversation**, not by scripts. The process is: 1. You create directories and install the plugin via Bash tool calls 2. You spawn WezTerm panes with `wezterm cli spawn` — each pane runs an independent Claude Code interactive session 3. You wait, then check debug logs and claim dirs to see what the plugin injected 4. You inspect the generated source code for correctness 5. You read conversation logs to find what the user had to correct 6. You update skills/hooks, run `/release`, and spawn more evals **Never use `claude --print`, eval scripts, or `Bun.spawn(["claude", ...])`**. These do not work because: - Plugin hooks (PreToolUse, PostToolUse, UserPromptSubmit) only fire during interactive tool-calling sessions - `--print` mode generates text without executing tools — no files are created, no deps installed, no dev servers started - No `session_id` means dedup, profiler, and claim files don't work **The WezTerm interactive approach is the only method that exercises the plugin correctly.** Every eval in our history (60+ sessions) used this approach. ## DO NOT (Hard Rules) These are **absolute prohibitions**. Violating any of them wastes the entire eval run: - **DO NOT** use `claude --print` or `-p` flag — hooks don't fire, no files created - **DO NOT** use `--dangerously-skip-permissions` — changes agent behavior - **DO NOT** create projects in `/tmp/` — always use `~/dev/vercel-plugin-testing/` - **DO NOT** manually create `settings.local.json` or wire hooks by hand — use `npx add-plugin` - **DO NOT** set `CLAUDE_PLUGIN_ROOT` manually — the plugin manages this - **DO NOT** use `bash -c` or `bash -lc` in WezTerm — always use `/bin/zsh -ic` - **DO NOT** use the full path to claude — use the `x` alias (it's configured in zsh) - **DO NOT** create custom `debug.log` files with stderr redirects — debug logs go to `~/.claude/debug/` - **DO NOT** write eval runner scripts in TypeScript/JavaScript — do everything as Bash tool calls in the conversation - **DO NOT** try to `git init` or create `package.json` manually — `npx add-plugin` + the WezTerm session handle all scaffolding - **DO NOT** use uppercase letters in directory names — npm rejects them (e.g. `T` in timestamps breaks `create-next-app`) **Copy the exact commands below. Do not improvise.** ## Setup & Launch (Exact Commands) ### Naming convention **Always append a timestamp** to directory names so reruns don't overwrite old projects: ``` <slug>-<yyyymmdd>-<hhmm> ``` Example: `tarot-card-deck-20260309-1227`, `interior-designer-20260309-1227` Generate the timestamp with: `date +%Y%m%d-%H%M` ### 1. Create test directory and install plugin ```bash TS=$(date +%Y%m%d-%H%M) SLUG="my-app-$TS" mkdir -p ~/dev/vercel-plugin-testing/$SLUG cd ~/dev/vercel-plugin-testing/$SLUG npx add-plugin https://github.com/vercel/vercel-plugin -s project -y ``` ### 2. Launch session via WezTerm ```bash wezterm cli spawn --cwd /Users/johnlindquist/dev/vercel-plugin-testing/$SLUG -- /bin/zsh -ic \ "unset CLAUDECODE; VERCEL_PLUGIN_LOG_LEVEL=debug x '<PROMPT>' --settings .claude/settings.json; exec zsh" ``` Key flags: - `unset CLAUDECODE` — prevents nested session detection error - `VERCEL_PLUGIN_LOG_LEVEL=debug` — enables hook debug output in `~/.claude/debug/` - `x` — alias for `claude` CLI - `--settings .claude/setti