Benchmark Agents

Name: Benchmark Agents
Author: vercel-labs

vercel-labs/vercel-plugin

1.1k installs
229 repo stars
Updated July 27, 2026
vercel-labs/vercel-plugin

benchmark-agents provides documented workflows for Advanced AI agent benchmark scenarios that push Vercel's cutting-edge platform features - Workflow DevKit, AI Gateway, MCP, Chat SDK, Queues, Flags, Sandbox, an

About

The benchmark-agents skill advanced AI agent benchmark scenarios that push Vercel's cutting-edge platform features - Workflow DevKit, AI Gateway, MCP, Chat SDK, Queues, Flags, Sandbox, and multi-agent orchestration. Designed to stress-test skill injection for complex, multi-system builds. # Benchmark Agents - Advanced AI Systems Launch real Claude Code sessions with the plugin installed, verify skill injection, monitor PostToolUse validation catches, and produce a coverage report. This skill covers the full eval loop: setup → launch → monitor → verify → fix → release → repeat. ## How Evals Work (The Only Correct Method) Evals are run by **you, in this conversation**, not by scripts. You create directories and install the plugin via Bash tool calls 2. You spawn WezTerm panes with `wezterm cli spawn` - each pane runs an independent Claude Code interactive session 3. You wait, then check debug logs and claim dirs to see what the plugin injected 4. You inspect the generated source code for correctness 5.

You create directories and install the plugin via Bash tool calls
You spawn WezTerm panes with `wezterm cli spawn` - each pane runs an independent Claude Code interactive session
You wait, then check debug logs and claim dirs to see what the plugin injected
You inspect the generated source code for correctness
You read conversation logs to find what the user had to correct

Benchmark Agents by the numbers

1,123 all-time installs (skills.sh)
Ranked #935 of 16,659 AI & Agent Building skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

At a glance

benchmark-agents capabilities & compatibility

Capabilities: you create directories and install the plugin vi · you spawn wezterm panes with `wezterm cli spawn` · you wait, then check debug logs and claim dirs t · you inspect the generated source code for correc · you read conversation logs to find what the user
Use cases: documentation

From the docs

What benchmark-agents says it does

This skill covers the full eval loop: setup → launch → monitor → verify → fix → release → repeat.

SKILL.md

npx skills add https://github.com/vercel-labs/vercel-plugin --skill benchmark-agents

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/vercel-labs/vercel-plugin/benchmark-agents.svg)](https://skillselion.com/skills/vercel-labs/vercel-plugin/benchmark-agents)

Installs	1.1k
repo stars	★ 229
Security audit	1 / 3 scanners passed
Last updated	July 27, 2026
Repository	vercel-labs/vercel-plugin ↗

How do I use benchmark-agents for the task described in its SKILL.md triggers?

Advanced AI agent benchmark scenarios that push Vercel's cutting-edge platform features - Workflow DevKit, AI Gateway, MCP, Chat SDK, Queues, Flags, Sandbox, and multi-agent orchestration. Designed.

Who is it for?

Teams invoking benchmark-agents when the user request matches documented triggers and prerequisites.

Skip if: Skip when cached docs are missing, the request is a negative trigger, or another sibling skill owns the workflow.

When should I use this skill?

Advanced AI agent benchmark scenarios that push Vercel's cutting-edge platform features - Workflow DevKit, AI Gateway, MCP, Chat SDK, Queues, Flags, Sandbox, and multi-agent orchestration. Designed to stress-test skill

What you get

Step-by-step guidance grounded in benchmark-agents documentation and reference files.

Skill-injection coverage report
PostToolUse hook validation logs
Interactive eval session transcript

By the numbers

Exercises 8 Vercel platform features: Workflow DevKit, AI Gateway, MCP, Chat SDK, Queues, Flags, Sandbox, and multi-agen
Follows a 7-step eval loop: setup, launch, monitor, verify, fix, release, repeat

Files

SKILL.mdMarkdownGitHub ↗

Benchmark Agents — Advanced AI Systems

Launch real Claude Code sessions with the plugin installed, verify skill injection, monitor PostToolUse validation catches, and produce a coverage report. This skill covers the full eval loop: setup → launch → monitor → verify → fix → release → repeat.

How Evals Work (The Only Correct Method)

Evals are run by you, in this conversation, not by scripts. The process is:

1. You create directories and install the plugin via Bash tool calls 2. You spawn WezTerm panes with wezterm cli spawn — each pane runs an independent Claude Code interactive session 3. You wait, then check debug logs and claim dirs to see what the plugin injected 4. You inspect the generated source code for correctness 5. You read conversation logs to find what the user had to correct 6. You update skills/hooks, run /release, and spawn more evals

Never use `claude --print`, eval scripts, or `Bun.spawn(["claude", ...])`. These do not work because:

Plugin hooks (PreToolUse, PostToolUse, UserPromptSubmit) only fire during interactive tool-calling sessions
--print mode generates text without executing tools — no files are created, no deps installed, no dev servers started
No session_id means dedup, profiler, and claim files don't work

The WezTerm interactive approach is the only method that exercises the plugin correctly. Every eval in our history (60+ sessions) used this approach.

DO NOT (Hard Rules)

These are absolute prohibitions. Violating any of them wastes the entire eval run:

DO NOT use claude --print or -p flag — hooks don't fire, no files created
DO NOT use --dangerously-skip-permissions — changes agent behavior
DO NOT create projects in /tmp/ — always use ~/dev/vercel-plugin-testing/
DO NOT manually create settings.local.json or wire hooks by hand — use npx add-plugin
DO NOT set CLAUDE_PLUGIN_ROOT manually — the plugin manages this
DO NOT use bash -c or bash -lc in WezTerm — always use /bin/zsh -ic
DO NOT use the full path to claude — use the x alias (it's configured in zsh)
DO NOT create custom debug.log files with stderr redirects — debug logs go to ~/.claude/debug/
DO NOT write eval runner scripts in TypeScript/JavaScript — do everything as Bash tool calls in the conversation
DO NOT try to git init or create package.json manually — npx add-plugin + the WezTerm session handle all scaffolding
DO NOT use uppercase letters in directory names — npm rejects them (e.g. T in timestamps breaks create-next-app)

Copy the exact commands below. Do not improvise.

Setup & Launch (Exact Commands)

Naming convention

Always append a timestamp to directory names so reruns don't overwrite old projects:

<slug>-<yyyymmdd>-<hhmm>

Example: tarot-card-deck-20260309-1227, interior-designer-20260309-1227

Generate the timestamp with: date +%Y%m%d-%H%M

1. Create test directory and install plugin

TS=$(date +%Y%m%d-%H%M)
SLUG="my-app-$TS"
mkdir -p ~/dev/vercel-plugin-testing/$SLUG
cd ~/dev/vercel-plugin-testing/$SLUG
npx add-plugin https://github.com/vercel/vercel-plugin -s project -y

2. Launch session via WezTerm

wezterm cli spawn --cwd /Users/johnlindquist/dev/vercel-plugin-testing/$SLUG -- /bin/zsh -ic \
  "unset CLAUDECODE; VERCEL_PLUGIN_LOG_LEVEL=debug x '<PROMPT>' --settings .claude/settings.json; exec zsh"

Key flags:

unset CLAUDECODE — prevents nested session detection error
VERCEL_PLUGIN_LOG_LEVEL=debug — enables hook debug output in ~/.claude/debug/
x — alias for claude CLI
--settings .claude/settings.json — loads project-level plugin settings

3. Find the debug log (wait ~25s for SessionStart hooks)

find ~/.claude/debug -name "*.txt" -mmin -2 -exec grep -l "$SLUG" {} +

4. Launch multiple sessions in parallel

Create dirs and install plugin in a loop, then spawn each WezTerm pane:

TS=$(date +%Y%m%d-%H%M)
cd ~/dev/vercel-plugin-testing
for name in tarot-deck interior-designer superhero-origin; do
  d="${name}-${TS}"
  mkdir -p "$d" && (cd "$d" && npx add-plugin https://github.com/vercel/vercel-plugin -s project -y)
done

# Then spawn each (these run in separate terminal panes)
wezterm cli spawn --cwd .../tarot-deck-$TS -- /bin/zsh -ic "unset CLAUDECODE; VERCEL_PLUGIN_LOG_LEVEL=debug x '...' --settings .claude/settings.json; exec zsh"
wezterm cli spawn --cwd .../interior-designer-$TS -- /bin/zsh -ic "unset CLAUDECODE; VERCEL_PLUGIN_LOG_LEVEL=debug x '...' --settings .claude/settings.json; exec zsh"
wezterm cli spawn --cwd .../superhero-origin-$TS -- /bin/zsh -ic "unset CLAUDECODE; VERCEL_PLUGIN_LOG_LEVEL=debug x '...' --settings .claude/settings.json; exec zsh"

Monitoring

Skill injection claims (the key metric)

TMPDIR=$(node -e "import {tmpdir} from 'os'; console.log(tmpdir())" --input-type=module)
CLAIMDIR="$TMPDIR/vercel-plugin-<session-id>-seen-skills.d"

# List all injected skills
ls "$CLAIMDIR"

# Count
ls "$CLAIMDIR" | wc -l

# Check specific skill
ls "$CLAIMDIR/workflow" && echo "YES" || echo "NO"

Hook firing

LOG=~/.claude/debug/<session-id>.txt

# SessionStart hooks
grep -c 'SessionStart.*success' "$LOG"

# PreToolUse calls and injections
grep -c 'executePreToolHooks' "$LOG"        # total calls
grep -c 'provided additionalContext' "$LOG"  # actual injections

# PostToolUse validation catches
grep 'VALIDATION' "$LOG" | head -10

# UserPromptSubmit
grep -c 'UserPromptSubmit.*success' "$LOG"

Quick status check for multiple sessions

TMPDIR=$(node -e "import {tmpdir} from 'os'; console.log(tmpdir())" --input-type=module 2>/dev/null)

for label_id in "slug1:SESSION_ID_1" "slug2:SESSION_ID_2" "slug3:SESSION_ID_3"; do
  label="${label_id%%:*}"
  id="${label_id##*:}"
  claimdir="$TMPDIR/vercel-plugin-$id-seen-skills.d"
  echo "=== $label ==="
  count=$(ls "$claimdir" 2>/dev/null | wc -l | tr -d ' ')
  claims=$(ls "$claimdir" 2>/dev/null | sort | tr '\n' ', ')
  echo "Skills ($count): $claims"
done

Verification — What to Check in Generated Code

After sessions build, verify these patterns in the generated projects:

Project structure

echo -n "src/: "; test -d "$base/src" && echo YES || echo NO          # Should be NO for WDK projects
echo -n "workflows/: "; test -d "$base/workflows" && echo YES || echo NO
echo -n "withWorkflow: "; grep -q "withWorkflow" "$base"/next.config.* && echo YES || echo NO
echo -n "components.json: "; test -f "$base/components.json" && echo YES || echo NO

Image generation model

# Should use gemini-3.1-flash-image-preview, NOT dall-e-3 or older gemini models
grep -rn "gemini.*image\|dall-e\|experimental_generateImage\|result\.files" "$base/workflows/" "$base/app/" 2>/dev/null | grep "\.ts"

Gateway vs direct provider

# Should use gateway() or plain "provider/model" strings, NOT openai("gpt-4o") directly
grep -rn "from.*@ai-sdk/openai\|openai(" "$base" 2>/dev/null | grep "\.ts" | grep -v node_modules
grep -rn "gateway(\|model:.*\"openai/" "$base" 2>/dev/null | grep "\.ts" | grep -v node_modules

AI Elements installed

find "$base" -path "*/ai-elements/*.tsx" 2>/dev/null | grep -v node_modules | wc -l

Workflow API usage

wf=$(find "$base" -name "*.ts" -path "*/workflow*" 2>/dev/null | grep -v node_modules | head -1)
head -5 "$wf"   # Should show: import { getWritable } from "workflow"

Prompt Design Rules

Describe products, not technologies. Let the plugin infer which skills to inject. This tests whether the plugin's pattern matching and prompt signals work from natural language.

DO:

"runs a multi-step creation pipeline that streams each phase"
"generates a portrait image"
"users can chat with an AI advisor"
"store all designs in a gallery"

DON'T:

"use Vercel Workflow DevKit with getWritable"
"use gateway('google/gemini-3.1-flash-image-preview')"
"install npx ai-elements"
"add withWorkflow to next.config.ts"

Always end prompts with:

"Link the project to my vercel-labs team so we can deploy it later. Skip any planning and just build it. Get the dev server running."

Phrases that trigger key skills (via promptSignals):

workflow: "multi-step pipeline", "streams progress", "streams each phase", "durable pipeline", "creation pipeline"
ai-sdk: Triggered by imports/install patterns (very broad)
shadcn: Triggered by create-next-app bash pattern
ai-elements: Triggered when ai-sdk is active + chat UI patterns

Common Issues Found in Evals (and Fixes Applied)

Issue	Cause	Plugin Fix (version)
Workflow not triggered from natural language	promptSignals too narrow	Broadened phrases, lowered minScore 6→4 (v0.9.5)
Agent uses `openai("gpt-4o")` instead of gateway	Agent's training data defaults to openai	PostToolUse validate warns "your knowledge is outdated" (v0.9.9)
Agent uses `dall-e-3` for images	Agent doesn't know about gemini image gen	PostToolUse validate warns, capabilities table in ai-sdk (v0.9.7)
Agent uses `experimental_generateImage`	Old API	PostToolUse validate warns, recommend `generateText` + `result.files` (v0.9.9)
Raw markdown rendering (`bold` visible)	Agent skips AI Elements	`MessageResponse` documented as universal renderer (v0.9.2)
`@/../../workflows/` broken import	Workflows outside `@` alias root	Canonical structure docs: no `src/` for WDK (v0.8.3)
`withWorkflow` missing from next.config	Agent skipped setup step	Marked as "Required" in workflow skill (v0.8.1)
`defineHook` but no resume route	Agent didn't wire the 3-piece pattern	Documented as 3 required pieces (v0.9.3)
`generateObject()` used (removed in v6)	Agent's training data	PostToolUse validate catches as error (v0.9.3)
`getWritable()` in workflow scope	Sandbox violation	Strengthened warning in skill (v0.8.1)
Missing `vercel link` + `vercel env pull`	No OIDC credentials	Added as "Required" setup step (v0.9.1)
`getStepMetadata().retryCount` undefined on first attempt	WDK quirk	Documented: guard with `?? 0` (v0.9.1)
shadcn not installed	No trigger for scaffolding	Added `create-next-app` bashPattern to shadcn (v0.8.0)
Skill cap too low (3)	Only 3 skills injected per tool call	Raised to 5 with 18KB budget (v0.8.0)

Agent-Browser Verification

After dev server starts, verify with agent-browser. Note: agents currently DO NOT self-verify despite the skill being injected. You must launch verification manually:

agent-browser open http://localhost:<port>
agent-browser wait --load networkidle
agent-browser screenshot
agent-browser snapshot -i

Coverage Report

Write results to .notes/COVERAGE.md with:

1. Session index — slug, session ID, unique skills, dedup status 2. Hook coverage matrix — which hooks fired in which sessions 3. Skill injection table — which of the 43 skills triggered 4. Code quality checks — gateway vs direct, image model, withWorkflow, AI Elements 5. PostToolUse validation catches — outdated models, deprecated APIs 6. Issues found — bugs, pattern gaps, new findings to feed back into skills

Release → Eval Loop

The standard improvement cycle:

1. Run evals — launch 3 sessions with natural language prompts 2. Check results — skill claims, project structure, code quality 3. Identify gaps — what skills didn't trigger, what patterns are wrong 4. Read conversation logs — find user follow-up corrections 5. Fix skills — update SKILL.md content, patterns, validate rules 6. Run gates — bun run typecheck && bun test && bun run validate 7. Release — bump version, bun run build, commit, push 8. Repeat — launch 3 more evals to verify fixes

Scenario Table

#	Slug	Prompt Summary	Expected Skills
01	doc-qa-agent	PDF Q&A with embeddings, citations, multi-step reasoning	ai-sdk, nextjs, vercel-storage, ai-elements
02	customer-support-agent	Durable support agent, escalation, confidence tracking	ai-sdk, workflow, nextjs, ai-elements
03	deploy-monitor	Uptime monitoring, AI incident responder, durable investigation	workflow, cron-jobs, observability, ai-sdk
04	multi-model-router	Side-by-side model comparison, parallel streaming, cost tracking	ai-gateway, ai-sdk, nextjs, ai-elements
05	slack-pr-reviewer	Multi-platform chat bot, PR review, threaded conversations	chat-sdk, ai-sdk, nextjs
06	content-pipeline	Durable multi-step content production with image generation	workflow, ai-sdk, satori, nextjs
07	feature-rollout	Feature flags, A/B testing, AI experiment analysis	vercel-flags, ai-sdk, nextjs
08	event-driven-crm	Event-driven CRM, churn prediction, re-engagement emails	vercel-queues, workflow, ai-sdk, email
09	code-sandbox-tutor	AI coding tutor with sandbox execution, auto-fix	vercel-sandbox, ai-sdk, nextjs, ai-elements
10	multi-agent-research	Parallel sub-agents, durable orchestration, streaming synthesis	workflow, ai-sdk, ai-elements, nextjs
11	discord-game-master	RPG bot, persistent game state, scene illustration generation	chat-sdk, ai-sdk, vercel-storage, nextjs
12	compliance-auditor	Scheduled AI audits, durable approval workflow, deploy blocking	workflow, cron-jobs, ai-sdk, vercel-firewall

Complexity Tiers

Tier 1 — Core AI (30-45 min, `--quick`)

Scenarios 01, 04, 09 — AI SDK, Gateway, Sandbox, AI Elements without durable workflows.

Tier 2 — Durable Agents (45-60 min)

Scenarios 02, 03, 06, 10 — Workflow DevKit, multi-step durability, agent orchestration.

Tier 3 — Platform Integration (45-60 min)

Scenarios 05, 07, 08, 11, 12 — Chat SDK, Queues, Flags, Firewall, cross-platform messaging.

Full Suite

All 12 scenarios, ~3-4 hours.

Cleanup

rm -rf ~/dev/vercel-plugin-testing

Benchmark Agent UI Design Prompts

doc-qa-agent

Web app UI mockup for "doc-qa-agent" — document Q&A system. Dark theme. Three-column layout: left sidebar lists uploaded PDFs as file cards with page count badges and hover states. Center column is a PDF viewer with yellow-highlighted text chunks showing which passages were retrieved. Right column is a chat panel — user messages right-aligned in zinc-700 bubbles, AI answers left-aligned in zinc-800 cards. Tool calls shown as collapsible Accordion rows labeled "🔍 Searched 847 chunks · 12ms". Source citations rendered as small yellow Badge chips [p.42] [p.17] inline in the answer text. Top navbar: document title breadcrumb, Upload PDF Button, embedding progress bar. Streaming answer has an animated blinking cursor. Color palette: zinc-950 background, zinc-800 cards, yellow-400 highlights, white text. Shadcn-style components throughout. High fidelity web UI screenshot.

customer-support-agent

Web app UI mockup for "customer-support-agent" — AI customer support dashboard. Warm light theme. Three-panel layout: left sidebar lists active conversations with user avatar, name, green/yellow/red status dot, and last message preview truncated. Center panel is a chat window — rounded message bubbles, agent messages in indigo-50 with left avatar, user messages right-aligned in white. Above chat: a segmented confidence bar component (High / Medium / Low) with color fill and label. Orange outlined "Escalate to Human 🙋" Button appears when confidence drops to Low. Right panel: three large stat cards — Resolution Rate 87% with upward arrow, Avg Response 1.2 min with sparkline, Open Tickets 14 with downward trend. Below stats: Tabs component — Active | Resolved | Escalated. Color palette: white background, indigo-600 accent, warm gray-100 sidebar. Polished shadcn SaaS product UI screenshot.

deploy-monitor

Web app UI mockup for "deploy-monitor" — AI deployment monitoring and incident responder. Dark high-density ops dashboard. Top row: four stat cards — Uptime 99.97% in green with pulse dot, P95 Latency 142ms in yellow, Active Incidents 1 in red with animated ring, Deployments Today 7 in white. Below: full-width recharts-style line chart showing response time with a red shaded anomaly region and vertical "INCIDENT" marker. Left panel: scrollable endpoints Table — URL, status Badge (Healthy/Degraded/Down), last check timestamp, inline sparkline. Right panel: dark Card labeled "AI Incident Responder" with monospaced font streaming log lines like "Analyzing build #4821..." and a vertical Stepper: Logs ✓ → Builds ✓ → Deploys → Root Cause (spinning). Color palette: slate-950 background, red-500 alerts, green-400 healthy, white text. Dense ops dashboard screenshot.

multi-model-router

Web app UI mockup for "multi-model-router" — AI model comparison playground. Dark theme with vibrant columns. Top bar: wide Textarea input "Enter your prompt..." with a bold "Race Models ▶" Button and model selector checkboxes. Main area: four equal-width columns — GPT-4o (blue top border), Claude (orange), Gemini (green), Llama (purple). Each column: colored Badge header with model name and version, a live streaming text content area with blinking cursor, footer row with tokens/sec displayed as a mini bar chart and cost "$0.0042" in dim text. Thin horizontal progress bar at column top races to 100% as response completes. Left sidebar: history list showing past races with winner highlighted in gold. Vote thumbs-up/down buttons at bottom of each column. Top right navbar: "Session cost: $0.18". Neutral-900 background, model accent colors. Screenshot.

slack-pr-reviewer

Web app UI mockup for "slack-pr-reviewer" — AI pull request review bot dashboard. Clean light developer theme. Left sidebar: PR list — author avatar, PR title truncated to one line, repo name in dim text, colored status Badge: Reviewed (green) / Pending (yellow) / Flagged (red). Main area: PR detail view. Two Tabs — "AI Review" and "Thread". AI Review tab: diff-style code block with red/green line backgrounds, then below it an AI analysis section with three collapsible Alert components — 🔴 Bug: "Null pointer on line 47" / 🟡 Security: "Exposed API key" / 🟢 Style: "Rename variable". Thread tab: Slack-style chat messages with avatar, username, timestamp, threaded indent replies. Right sidebar: mini bar chart PRs by repo, stat "Avg review time: 4 min". Top navbar: platform toggle buttons Slack | Discord | GitHub. Screenshot.

content-pipeline

Web app UI mockup for "content-pipeline" — AI content production pipeline dashboard. Light warm stone theme. Top: horizontal 5-step Stepper — Brief Submitted (green check) → Research (green check) → Draft (blue spinner, active) → Editorial Review (gray) → Published (gray). Active step "Draft" expands below the stepper into a Card showing a rich text article preview with a streaming cursor, word count badge, and Retry/Pause Buttons. Left sidebar: article brief list — title, status chip, deadline badge, author avatar. Right panel: stacked cards — OG Image Preview showing a social card mockup with title and image, then Social Posts with Twitter and LinkedIn card previews. Bottom-right: Toast notification "✅ Research complete — 8 sources found". Warm stone-100 background, amber-500 active step, sage-green completed steps. Screenshot.

feature-rollout

Web app UI mockup for "feature-rollout" — feature flag and A/B testing management tool. Dark theme with confident typography. Top Tabs: Flags | Experiments | Analytics. Stats row: 7 active experiments, 2 ready to ship, 1 paused. Main area: dense data Table — Flag Name, Status (shadcn Toggle switch on/off), Rollout % (thin inline Progress bar with percentage label), Variants (A / B chips), Conversion % with delta. A row is selected and a Sheet slides in from the right: large Donut chart showing variant A 68% vs B 32%, Rollout Slider 0-100% set to 45%, Segment checkboxes (Power Users ✓, Free Tier ✓, Enterprise unchecked), then an "AI Recommendation" Card with badge "🚀 Ship Variant A — 94% confidence" and two-line reasoning text. Gray-900 background, violet-500 accents, green winning, red losing. Screenshot.

event-driven-crm

Web app UI mockup for "event-driven-crm" — event-driven CRM with AI churn prediction. Light clean enterprise SaaS. Left sidebar: customer list — avatar, full name, company name, and a small colored health dot (green/yellow/red). Main area: customer profile page. Top row: avatar, name "Sarah Chen", company "Acme Corp", plan Badge "Pro", edit Button. Stats row: MRR $2,400 | Open Tickets 3 | Last Login 12 days ago. Below: horizontal Timeline component — nodes on a line with icons and dates: 🟢 Signup, 🟢 Purchase $240, 🟠 Support Ticket #882, 🔴 Churn Signal "Cancelled onboarding". Each node hoverable with a tooltip Card. Right panel: circular gauge chart "Health Score 34/100" in red, below it a red Alert Banner "Churn Risk: HIGH — 73% probability". Then: recommended actions list with Buttons: Send Re-engagement Email, Schedule Call, Apply Discount. White background, sky-blue accents. Screenshot.

code-sandbox-tutor

Web app UI mockup for "code-sandbox-tutor" — interactive AI coding tutor for students. Light, playful but professional web app. Left sidebar: lesson curriculum tree — expandable sections with icons 🔵 Loops 🟢 Functions 🟡 Arrays, lesson items with completion checkmarks and lock icons for locked lessons. Currently active lesson highlighted. Center top: Monaco-style code editor with syntax-highlighted JavaScript, line numbers, and the student's code visible. Center bottom: split output panel — left half "Console Output" with printed output lines, right half "Live Preview" iframe showing rendered HTML result. Right sidebar: AI tutor chat — bot avatar labeled "Codey 🤖", chat bubbles with hints, a yellow Alert card "💡 Hint: check your loop bounds". Prominent green "▶ Run Code" Button above editor. Orange "🔧 Auto-fix" Button below error. Top bar: "Lesson 4: Arrays" title, XP bar 340/500, streak "🔥 7 days". Screenshot.

multi-agent-research

Web app UI mockup for "multi-agent-research" — multi-agent parallel research orchestrator. Dark visualization-heavy theme. Center: a node-graph canvas (react-flow style) — central oval node "What causes inflation?" connected by animated dashed lines to four agent nodes arranged around it: blue rounded Card "🌐 Web Search — RUNNING", amber Card "📄 Document Analysis — DONE", purple Card "🗄️ Knowledge Base — WAITING", white Card "🧠 Orchestrator — SYNTHESIZING". Each Card shows agent name, colored status chip, and a one-line preview of findings. Below the graph: a "Synthesis Report" streaming output area with markdown formatting — headers, bullet points, and inline citations [1][2][3] rendered as superscript blue badges. Left sidebar: research history with timestamps. Top: question input with "Start Research ▶" Button. Status bar bottom: "4 agents · 23 sources · 2 min 14 sec". Slate-900 background. Screenshot.

discord-game-master

Web app UI mockup for "discord-game-master" — AI tabletop RPG game master web companion app. Dark fantasy theme but as a real web app. Left sidebar: "The Party" section — four character cards each with fantasy portrait avatar, character name "Theron the Bold", class icon ⚔️🧙🏹🛡️, HP bar in red (65/100), XP bar in gold (340/500). Center: game feed as a chat-style scrollable log. GM narration in italic amber serif text. Player actions in white monospace. A dice roll result Card: large d20 icon, roll "18" in huge text, "+3 STR modifier", total "21" in bold, green "SUCCESS" Badge. Player input row at bottom with quick-action Buttons: ⚔️ Attack | 🔍 Investigate | 💬 Persuade | ✨ Custom. Right panel: "Current Scene" with a dungeon map image in a Card. Below: Initiative Order as a ranked list — colored dots, name, HP badge. Top navbar: "Campaign: Curse of Strahd", session timer "02:14:33", "🎨 Generate Scene Art" Button. Gray-950 background, amber-400 accents. Screenshot.

compliance-auditor

Web app UI mockup for "compliance-auditor" — AI compliance auditing system for SaaS platforms. Light professional institutional theme. Sticky top Alert banner: "🚫 2 Critical findings blocking deploy — Review required" in red. Stats row: large "Compliance Score 94/100" in a green Badge, "Last Audit: 2 hours ago", "Open: 3 Critical · 7 Warnings · 42 Passed". Main area: full-width shadcn Table — columns: ID (#C-047), Category Badge (Infrastructure/Code/Secrets), Severity Badge (Critical red / Warning amber / Pass green), Description text, Status chip (Open/Resolved/Blocked). A row is expanded inline showing AI explanation paragraph and "Remediation Steps" as a numbered list with a "Mark Resolved" Button. Left sidebar: audit schedule — cron job list with next-run timestamps and play Button. Right panel: "Approval Workflow" vertical Stepper — Audit ✓ → AI Review ✓ → Human Approval (active spinner) → Deploy Unblocked. "Export PDF Report" Button at bottom. White background, red-600 critical, amber warning, emerald pass. Screenshot.

Related skills

Setup Matt Pocock SkillsScaffold the per-repo configuration that Matt Pocock’s engineering agent skills rely on so they understand the issue tracker, triage labels, and domain documentation la462k185k

Lark Skill MakerQuickly turn any Lark/Feishu OpenAPI call or multi-step workflow into a reusable agent skill with its own SKILL.md.379k15.8k

CavemanSlash token usage by roughly 75% while keeping every technical detail intact when working with Claude Code, Cursor or similar agents.378k92.5k

Lark AppsConnect Claude, Cursor or custom agents directly to Lark (Feishu) for messaging, document automation, approval workflows and enterprise data access.375k

Running Claude Code Via Litellm CopilotRun Claude Code at a fraction of the cost by routing requests through LiteLLM to the GitHub Copilot Chat API.270k72

Codex PetGenerate a complete Codex Pet spritesheet and metadata from one reference image without needing an OpenAI key or Codex Pro.246k8

How it compares

Pick benchmark-agents over generic test runners when the goal is validating Claude Code skill injection and Vercel plugin hook behavior during live agent builds, not isolated function or API unit coverage.

FAQ

What does benchmark-agents do?

When should I use benchmark-agents?

What are common prerequisites?

--- name: benchmark-agents description: Advanced AI agent benchmark scenarios that push Vercel's cutting-edge platform features - Workflow DevKit, AI Gateway, MCP, Chat SDK, Queues, Flags, Sandbox, and multi-agent orches

Is Benchmark Agents safe to install?

skills.sh reports 1 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

AI & Agent Buildingagentsautomation