
Eval Harness
Define pass/fail evals and pass@k checks so Claude Code tasks stay reliable before you ship agent workflow changes.
Overview
Eval Harness is an agent skill most often used in Build (also Ship, Operate) that implements eval-driven development with capability and regression evals and pass@k-style reliability checks for Claude Code.
Install
npx skills add https://github.com/affaan-m/everything-claude-code --skill eval-harnessWhat is this skill?
- Eval-driven development philosophy with pre-implementation success criteria
- Capability eval and regression eval markdown templates with checklists
- pass@k reliability framing for agent task completion
- Continuous eval runs during prompt and workflow changes
- Tools: Read, Write, Edit, Bash, Grep, Glob for harness automation
- pass@k metrics for reliability measurement
- Capability eval and regression eval template types
Adoption & trust: 5.2k installs on skills.sh; 210k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You change prompts and agent workflows without defined pass/fail tests, so regressions slip through until users complain.
Who is it for?
Solo builders maintaining Claude Code skills, hooks, or multi-step agent workflows who want EDD-style gates.
Skip if: One-off chat tasks with no reusable workflow, or teams that only run manual QA with no agent eval artifacts.
When should I use this skill?
Setting up eval-driven development, defining pass/fail criteria for Claude Code tasks, measuring reliability with pass@k, creating regression suites, or benchmarking agent performance.
What do I get? / Deliverables
You get documented capability and regression evals with explicit success criteria and a repeatable pass@k mindset before merging agent changes.
- Capability and regression eval markdown specs
- Documented success criteria and pass@k tracking approach
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Eval harnesses are authored while you build agent tooling, even though you rerun them in Ship and Operate. EDD templates and capability/regression evals are procedural agent infrastructure, not app frontend or SEO work.
Where it fits
Author capability eval checklists before shipping a new custom skill.
Run regression evals against a baseline SHA before tagging an agent workflow release.
Compare pass@k across model upgrades to decide whether to roll back prompt changes.
How it compares
Structured eval markdown and pass@k framing—not a hosted benchmark SaaS or a single end-to-end Playwright suite.
Common Questions / FAQ
Who is eval-harness for?
Indie developers and small teams running Claude Code who want formal eval templates and regression tracking for agent tasks.
When should I use eval-harness?
In Build when defining agent evals; in Ship when gating releases with regression suites; in Operate when benchmarking model or prompt changes across versions.
Is eval-harness safe to install?
It may use shell and file tools to run harnesses—review the Security Audits panel on this page and scope Bash access to your repo.
SKILL.md
READMESKILL.md - Eval Harness
# Eval Harness Skill A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles. ## When to Activate - Setting up eval-driven development (EDD) for AI-assisted workflows - Defining pass/fail criteria for Claude Code task completion - Measuring agent reliability with pass@k metrics - Creating regression test suites for prompt or agent changes - Benchmarking agent performance across model versions ## Philosophy Eval-Driven Development treats evals as the "unit tests of AI development": - Define expected behavior BEFORE implementation - Run evals continuously during development - Track regressions with each change - Use pass@k metrics for reliability measurement ## Eval Types ### Capability Evals Test if Claude can do something it couldn't before: ```markdown [CAPABILITY EVAL: feature-name] Task: Description of what Claude should accomplish Success Criteria: - [ ] Criterion 1 - [ ] Criterion 2 - [ ] Criterion 3 Expected Output: Description of expected result ``` ### Regression Evals Ensure changes don't break existing functionality: ```markdown [REGRESSION EVAL: feature-name] Baseline: SHA or checkpoint name Tests: - existing-test-1: PASS/FAIL - existing-test-2: PASS/FAIL - existing-test-3: PASS/FAIL Result: X/Y passed (previously Y/Y) ``` ## Grader Types ### 1. Code-Based Grader Deterministic checks using code: ```bash # Check if file contains expected pattern grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL" # Check if tests pass npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL" # Check if build succeeds npm run build && echo "PASS" || echo "FAIL" ``` ### 2. Model-Based Grader Use Claude to evaluate open-ended outputs: ```markdown [MODEL GRADER PROMPT] Evaluate the following code change: 1. Does it solve the stated problem? 2. Is it well-structured? 3. Are edge cases handled? 4. Is error handling appropriate? Score: 1-5 (1=poor, 5=excellent) Reasoning: [explanation] ``` ### 3. Human Grader Flag for manual review: ```markdown [HUMAN REVIEW REQUIRED] Change: Description of what changed Reason: Why human review is needed Risk Level: LOW/MEDIUM/HIGH ``` ## Metrics ### pass@k "At least one success in k attempts" - pass@1: First attempt success rate - pass@3: Success within 3 attempts - Typical target: pass@3 > 90% ### pass^k "All k trials succeed" - Higher bar for reliability - pass^3: 3 consecutive successes - Use for critical paths ## Eval Workflow ### 1. Define (Before Coding) ```markdown ## EVAL DEFINITION: feature-xyz ### Capability Evals 1. Can create new user account 2. Can validate email format 3. Can hash password securely ### Regression Evals 1. Existing login still works 2. Session management unchanged 3. Logout flow intact ### Success Metrics - pass@3 > 90% for capability evals - pass^3 = 100% for regression evals ``` ### 2. Implement Write code to pass the defined evals. ### 3. Evaluate ```bash # Run capability evals [Run each capability eval, record PASS/FAIL] # Run regression evals npm test -- --testPathPattern="existing" # Generate report ``` ### 4. Report ```markdown EVAL REPORT: feature-xyz ======================== Capability Evals: create-user: PASS (pass@1) validate-email: PASS (pass@2) hash-password: PASS (pass@1) Overall: 3/3 passed Regression Evals: login-flow: PASS session-mgmt: PASS logout-flow: PASS Overall: 3/3 passed Metrics: pass@1: 67% (2/3) pass@3: 100% (3/3) Status: READY FOR REVIEW ``` ## Integration Patterns ### Pre-Implementation ``` /eval define feature-name ``` Creates eval definition file at `.claude/evals/feature-name.md` ### During Implementation ``` /eval check feature-name ``` Runs current evals a