Agent Eval

Name: Agent Eval
Author: affaan-m

affaan-m/everything-claude-code

Run reproducible head-to-head benchmarks of Claude Code, Codex, Aider, and peers on your repo tasks before you standardize on one coding agent.

Overview

agent-eval is an agent skill most often used in Validate (also Ship, Build) that compares coding agents on YAML-defined repo tasks with pass rate, cost, time, and consistency metrics.

Install

npx skills add https://github.com/affaan-m/everything-claude-code --skill agent-eval

What is this skill?

YAML task definitions with prompts, touched files, and multi-type judges (pytest, grep)
Head-to-head metrics: pass rate, cost, time, and consistency across coding agents
Git worktree isolation for reproducible runs on pinned commits
CLI-oriented eval workflow to replace vibe-based which agent is best debates
Regression checks when models or agent tooling updates ship
YAML tasks support multiple judge types including pytest commands and grep pattern checks
Compares agents on pass rate, cost, time, and consistency dimensions

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 4k installs on skills.sh; 210k GitHub stars; 1/3 security scanners passed (skills.sh audits).

What problem does it solve?

You are choosing or renewing a coding agent based on opinions, with no reproducible pass-rate or cost data on your real tasks.

Who is it for?

Indie leads evaluating Claude Code vs Codex vs Aider on representative bugs and features with pytest-backed success criteria.

Skip if: Casual chat-only coding without a git repo, fixed test harness, or appetite to maintain YAML task definitions.

When should I use this skill?

Comparing coding agents on your own codebase, measuring performance before adopting a new tool or model, running regression checks after agent updates, or producing data-backed agent selection for a team.

What do I get? / Deliverables

You get comparable eval runs on pinned tasks and judges so you can document which agent meets your bar before adoption or after an upgrade regression.

YAML task definitions with judges and optional commit pins
Comparative pass-rate, cost, time, and consistency results across agents
Reproducible worktree-isolated eval runs for regression tracking

Recommended Skills

Microsoft Foundrymicrosoft/azure-skills

Microsoft Foundry skill guides agents through the full Azure AI Foundry lifecycle—containerizing agents, pushing to ACR,…377k installs·1.2k stars

Azure Aimicrosoft/azure-skills

azure-ai is a Prism-oriented quick reference for Microsoft Azure AI work, with the published body centered on the Azure …375k installs·1.2k stars

Azure Hosted Copilot Sdkmicrosoft/azure-skills

Azure Hosted Copilot SDK is Microsoft's entry skill for repos using @github/copilot-sdk—it detects CopilotClient usage, …346k installs·1.2k stars

Lark Eventlarksuite/cli

Lark real-time subscription skill via lark-cli event consume for building bots and streaming webhook-style agent workers…208k installs·13.7k stars

Running Claude Code Via Litellm Copilotxixu-me/skills

Running Claude Code via LiteLLM Copilot walks through pointing Claude Code at a local LiteLLM proxy that forwards Anthro…200k installs·61 stars

Setup Matt Pocock Skillsmattpocock/skills

One-time per-repo setup so Matt Pocock engineering skills share correct issue tracker, triage strings, and domain docume…180k installs·121k stars

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Canonical shelf is Validate because the skill exists to produce data-backed agent selection and scope decisions before you commit team workflow to a single tool. Scope fits YAML-defined tasks, pinned commits, and pass-rate metrics that bound what you will ask agents to do reliably on your codebase.

Also useful

ShipTesting & QA

Also useful

BuildAgent skills & templates

Where it fits

Example use

ValidateScope & plan

Pin three representative tasks in YAML and pick the agent with the best pass rate and cost before buying seats.

Example use

ShipTesting & QA

Re-run the same task suite after a model bump to catch regressions on HTTP retry or test refactors.

Example use

BuildAgent skills & templates

Document which agent handles your monorepo layout reliably for future skill and MCP investments.

How it compares

Use instead of ad-hoc prompt shootouts—structured tasks and judges, not a single subjective coding session.

Common Questions / FAQ

Who is agent-eval for?

Solo builders and tiny teams who need quantified agent comparisons on their own repositories before standardizing tooling.

When should I use agent-eval?

In Validate when scoping which agent to adopt; in Build when tuning agent-tooling investments; in Ship when regression-testing agent or model updates on fixed YAML tasks.

Is agent-eval safe to install?

Check the Security Audits panel on this page; the workflow runs shell, git, and tests against your repo—review the CLI source and isolate runs via worktrees.

SKILL.md

READMESKILL.md - Agent Eval

# Agent Eval Skill

A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it.

## When to Activate

- Comparing coding agents (Claude Code, Aider, Codex, etc.) on your own codebase
- Measuring agent performance before adopting a new tool or model
- Running regression checks when an agent updates its model or tooling
- Producing data-backed agent selection decisions for a team

## Installation

> **Note:** Install agent-eval from its repository after reviewing the source.

## Core Concepts

### YAML Task Definitions

Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success:

```yaml
name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
  - src/http_client.py
prompt: |
  Add retry logic with exponential backoff to all HTTP requests.
  Max 3 retries. Initial delay 1s, max delay 30s.
judge:
  - type: pytest
    command: pytest tests/test_http_client.py -v
  - type: grep
    pattern: "exponential_backoff|retry"
    files: src/http_client.py
commit: "abc1234"  # pin to specific commit for reproducibility
```

### Git Worktree Isolation

Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo.

### Metrics Collected

| Metric | What It Measures |
|--------|-----------------|
| Pass rate | Did the agent produce code that passes the judge? |
| Cost | API spend per task (when available) |
| Time | Wall-clock seconds to completion |
| Consistency | Pass rate across repeated runs (e.g., 3/3 = 100%) |

## Workflow

### 1. Define Tasks

Create a `tasks/` directory with YAML files, one per task:

```bash
mkdir tasks
# Write task definitions (see template above)
```

### 2. Run Agents

Execute agents against your tasks:

```bash
agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3
```

Each run:
1. Creates a fresh git worktree from the specified commit
2. Hands the prompt to the agent
3. Runs the judge criteria
4. Records pass/fail, cost, and time

### 3. Compare Results

Generate a comparison report:

```bash
agent-eval report --format table
```

```
Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent        │ Pass Rate │ Cost   │ Time   │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code  │ 3/3       │ $0.12  │ 45s    │ 100%        │
│ aider        │ 2/3       │ $0.08  │ 38s    │  67%        │
└──────────────┴───────────┴────────┴────────┴─────────────┘
```

## Judge Types

### Code-Based (deterministic)

```yaml
judge:
  - type: pytest
    command: pytest tests/ -v
  - type: command
    command: npm run build
```

### Pattern-Based

```yaml
judge:
  - type: grep
    pattern: "class.*Retry"
    files: src/**/*.py
```

### Model-Based (LLM-as-judge)

```yaml
judge:
  - type: llm
    prompt: |
      Does this implementation correctly handle exponential backoff?
      Check for: max retries, increasing delays, jitter.
```

## Best Practices

- **Start with 3-5 tasks** that represent your real workload, not toy examples
- **Run at least 3 trials** per agent to capture variance — agents are non-deterministic
- **Pin the commit** in your task YAML so results are reproducible across days/weeks
- **Include at least one deterministic judge** (tests, build) per task — LLM judges add noise
- **Track cost alongside pass rate** — a 95% agent at 10x the cost may not be the right choice
- **Version your task definitions** — they are test fixtures, treat them as

What is this skill?

YAML task definitions with prompts, touched files, and multi-type judges (pytest, grep)

Head-to-head metrics: pass rate, cost, time, and consistency across coding agents

Git worktree isolation for reproducible runs on pinned commits

CLI-oriented eval workflow to replace vibe-based which agent is best debates

Regression checks when models or agent tooling updates ship

YAML tasks support multiple judge types including pytest commands and grep pattern checks

Compares agents on pass rate, cost, time, and consistency dimensions

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 4k installs on skills.sh; 210k GitHub stars; 1/3 security scanners passed (skills.sh audits).

What do I get? / Deliverables

You get comparable eval runs on pinned tasks and judges so you can document which agent meets your bar before adoption or after an upgrade regression.

YAML task definitions with judges and optional commit pins

Comparative pass-rate, cost, time, and consistency results across agents

Reproducible worktree-isolated eval runs for regression tracking

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

Also useful

BuildAgent skills & templates

Where it fits

Example use

ValidateScope & plan

Pin three representative tasks in YAML and pick the agent with the best pass rate and cost before buying seats.

Example use

ShipTesting & QA

Re-run the same task suite after a model bump to catch regressions on HTTP retry or test refactors.

Example use

BuildAgent skills & templates

Document which agent handles your monorepo layout reliably for future skill and MCP investments.

SKILL.md

READMESKILL.md - Agent Eval

# Agent Eval Skill

A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it.

## When to Activate

- Comparing coding agents (Claude Code, Aider, Codex, etc.) on your own codebase
- Measuring agent performance before adopting a new tool or model
- Running regression checks when an agent updates its model or tooling
- Producing data-backed agent selection decisions for a team

## Installation

> **Note:** Install agent-eval from its repository after reviewing the source.

## Core Concepts

### YAML Task Definitions

Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success:

```yaml
name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
  - src/http_client.py
prompt: |
  Add retry logic with exponential backoff to all HTTP requests.
  Max 3 retries. Initial delay 1s, max delay 30s.
judge:
  - type: pytest
    command: pytest tests/test_http_client.py -v
  - type: grep
    pattern: "exponential_backoff|retry"
    files: src/http_client.py
commit: "abc1234"  # pin to specific commit for reproducibility
```

### Git Worktree Isolation

Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo.

### Metrics Collected

| Metric | What It Measures |
|--------|-----------------|
| Pass rate | Did the agent produce code that passes the judge? |
| Cost | API spend per task (when available) |
| Time | Wall-clock seconds to completion |
| Consistency | Pass rate across repeated runs (e.g., 3/3 = 100%) |

## Workflow

### 1. Define Tasks

Create a `tasks/` directory with YAML files, one per task:

```bash
mkdir tasks
# Write task definitions (see template above)
```

### 2. Run Agents

Execute agents against your tasks:

```bash
agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3
```

Each run:
1. Creates a fresh git worktree from the specified commit
2. Hands the prompt to the agent
3. Runs the judge criteria
4. Records pass/fail, cost, and time

### 3. Compare Results

Generate a comparison report:

```bash
agent-eval report --format table
```

```
Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent        │ Pass Rate │ Cost   │ Time   │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code  │ 3/3       │ $0.12  │ 45s    │ 100%        │
│ aider        │ 2/3       │ $0.08  │ 38s    │  67%        │
└──────────────┴───────────┴────────┴────────┴─────────────┘
```

## Judge Types

### Code-Based (deterministic)

```yaml
judge:
  - type: pytest
    command: pytest tests/ -v
  - type: command
    command: npm run build
```

### Pattern-Based

```yaml
judge:
  - type: grep
    pattern: "class.*Retry"
    files: src/**/*.py
```

### Model-Based (LLM-as-judge)

```yaml
judge:
  - type: llm
    prompt: |
      Does this implementation correctly handle exponential backoff?
      Check for: max retries, increasing delays, jitter.
```

## Best Practices

- **Start with 3-5 tasks** that represent your real workload, not toy examples
- **Run at least 3 trials** per agent to capture variance — agents are non-deterministic
- **Pin the commit** in your task YAML so results are reproducible across days/weeks
- **Include at least one deterministic judge** (tests, build) per task — LLM judges add noise
- **Track cost alongside pass rate** — a 95% agent at 10x the cost may not be the right choice
- **Version your task definitions** — they are test fixtures, treat them as

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Where it fits

Who is agent-eval for?

When should I use agent-eval?

Is agent-eval safe to install?

SKILL.md

This week for builders

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Where it fits

Who is agent-eval for?

When should I use agent-eval?

Is agent-eval safe to install?

SKILL.md