
Agent Eval
Run reproducible head-to-head benchmarks of Claude Code, Codex, Aider, and peers on your repo tasks before you standardize on one coding agent.
Overview
agent-eval is an agent skill most often used in Validate (also Ship, Build) that compares coding agents on YAML-defined repo tasks with pass rate, cost, time, and consistency metrics.
Install
npx skills add https://github.com/affaan-m/everything-claude-code --skill agent-evalWhat is this skill?
- YAML task definitions with prompts, touched files, and multi-type judges (pytest, grep)
- Head-to-head metrics: pass rate, cost, time, and consistency across coding agents
- Git worktree isolation for reproducible runs on pinned commits
- CLI-oriented eval workflow to replace vibe-based which agent is best debates
- Regression checks when models or agent tooling updates ship
- YAML tasks support multiple judge types including pytest commands and grep pattern checks
- Compares agents on pass rate, cost, time, and consistency dimensions
Adoption & trust: 4k installs on skills.sh; 210k GitHub stars; 1/3 security scanners passed (skills.sh audits).
What problem does it solve?
You are choosing or renewing a coding agent based on opinions, with no reproducible pass-rate or cost data on your real tasks.
Who is it for?
Indie leads evaluating Claude Code vs Codex vs Aider on representative bugs and features with pytest-backed success criteria.
Skip if: Casual chat-only coding without a git repo, fixed test harness, or appetite to maintain YAML task definitions.
When should I use this skill?
Comparing coding agents on your own codebase, measuring performance before adopting a new tool or model, running regression checks after agent updates, or producing data-backed agent selection for a team.
What do I get? / Deliverables
You get comparable eval runs on pinned tasks and judges so you can document which agent meets your bar before adoption or after an upgrade regression.
- YAML task definitions with judges and optional commit pins
- Comparative pass-rate, cost, time, and consistency results across agents
- Reproducible worktree-isolated eval runs for regression tracking
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Canonical shelf is Validate because the skill exists to produce data-backed agent selection and scope decisions before you commit team workflow to a single tool. Scope fits YAML-defined tasks, pinned commits, and pass-rate metrics that bound what you will ask agents to do reliably on your codebase.
Where it fits
Pin three representative tasks in YAML and pick the agent with the best pass rate and cost before buying seats.
Re-run the same task suite after a model bump to catch regressions on HTTP retry or test refactors.
Document which agent handles your monorepo layout reliably for future skill and MCP investments.
How it compares
Use instead of ad-hoc prompt shootouts—structured tasks and judges, not a single subjective coding session.
Common Questions / FAQ
Who is agent-eval for?
Solo builders and tiny teams who need quantified agent comparisons on their own repositories before standardizing tooling.
When should I use agent-eval?
In Validate when scoping which agent to adopt; in Build when tuning agent-tooling investments; in Ship when regression-testing agent or model updates on fixed YAML tasks.
Is agent-eval safe to install?
Check the Security Audits panel on this page; the workflow runs shell, git, and tests against your repo—review the CLI source and isolate runs via worktrees.
SKILL.md
READMESKILL.md - Agent Eval
# Agent Eval Skill A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it. ## When to Activate - Comparing coding agents (Claude Code, Aider, Codex, etc.) on your own codebase - Measuring agent performance before adopting a new tool or model - Running regression checks when an agent updates its model or tooling - Producing data-backed agent selection decisions for a team ## Installation > **Note:** Install agent-eval from its repository after reviewing the source. ## Core Concepts ### YAML Task Definitions Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success: ```yaml name: add-retry-logic description: Add exponential backoff retry to the HTTP client repo: ./my-project files: - src/http_client.py prompt: | Add retry logic with exponential backoff to all HTTP requests. Max 3 retries. Initial delay 1s, max delay 30s. judge: - type: pytest command: pytest tests/test_http_client.py -v - type: grep pattern: "exponential_backoff|retry" files: src/http_client.py commit: "abc1234" # pin to specific commit for reproducibility ``` ### Git Worktree Isolation Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo. ### Metrics Collected | Metric | What It Measures | |--------|-----------------| | Pass rate | Did the agent produce code that passes the judge? | | Cost | API spend per task (when available) | | Time | Wall-clock seconds to completion | | Consistency | Pass rate across repeated runs (e.g., 3/3 = 100%) | ## Workflow ### 1. Define Tasks Create a `tasks/` directory with YAML files, one per task: ```bash mkdir tasks # Write task definitions (see template above) ``` ### 2. Run Agents Execute agents against your tasks: ```bash agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3 ``` Each run: 1. Creates a fresh git worktree from the specified commit 2. Hands the prompt to the agent 3. Runs the judge criteria 4. Records pass/fail, cost, and time ### 3. Compare Results Generate a comparison report: ```bash agent-eval report --format table ``` ``` Task: add-retry-logic (3 runs each) ┌──────────────┬───────────┬────────┬────────┬─────────────┐ │ Agent │ Pass Rate │ Cost │ Time │ Consistency │ ├──────────────┼───────────┼────────┼────────┼─────────────┤ │ claude-code │ 3/3 │ $0.12 │ 45s │ 100% │ │ aider │ 2/3 │ $0.08 │ 38s │ 67% │ └──────────────┴───────────┴────────┴────────┴─────────────┘ ``` ## Judge Types ### Code-Based (deterministic) ```yaml judge: - type: pytest command: pytest tests/ -v - type: command command: npm run build ``` ### Pattern-Based ```yaml judge: - type: grep pattern: "class.*Retry" files: src/**/*.py ``` ### Model-Based (LLM-as-judge) ```yaml judge: - type: llm prompt: | Does this implementation correctly handle exponential backoff? Check for: max retries, increasing delays, jitter. ``` ## Best Practices - **Start with 3-5 tasks** that represent your real workload, not toy examples - **Run at least 3 trials** per agent to capture variance — agents are non-deterministic - **Pin the commit** in your task YAML so results are reproducible across days/weeks - **Include at least one deterministic judge** (tests, build) per task — LLM judges add noise - **Track cost alongside pass rate** — a 95% agent at 10x the cost may not be the right choice - **Version your task definitions** — they are test fixtures, treat them as