
Agent Evaluation
Install this when you need repeatable rubrics to judge whether Claude Code skills, commands, and agents actually improve after you change prompts or context.
Overview
Agent-evaluation is an agent skill most often used in Build (also Ship, Operate) that measures and improves Claude Code commands, skills, and agents with outcome-focused, multi-dimensional rubrics.
Install
npx skills add https://github.com/neolabhq/context-engineering-kit --skill agent-evaluationWhat is this skill?
- Multi-dimensional rubrics: factual accuracy, completeness, citations, source quality, and tool efficiency
- Outcome-focused evaluation that accepts multiple valid agent paths if end results are sound
- LLM-as-judge scaling paired with human review for edge cases and regressions
- BrowseComp research anchor: three drivers explain ~95% of browsing-agent performance variance
- Designed for validating context-engineering choices and catching quality regressions between runs
- BrowseComp research cites three factors explaining 95% of performance variance
Adoption & trust: 546 installs on skills.sh; 1.1k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You changed prompts or context for an agent but have no disciplined way to know if quality improved or regressed across non-deterministic runs.
Who is it for?
Solo builders maintaining a growing library of Claude Code skills who want LLM-judge plus human spot checks before merging prompt changes.
Skip if: One-off scripts with a single obvious pass/fail, or teams that refuse any LLM-as-judge workflow and only want traditional automated tests.
When should I use this skill?
Testing prompt effectiveness, validating context engineering choices, or measuring improvement quality for Claude Code commands, skills, and agents.
What do I get? / Deliverables
You get repeatable evaluation dimensions, actionable scores, and regression signals so you can ship context-engineering changes with justified confidence.
- Multi-dimensional evaluation scores
- Regression comparison notes
- Actionable rubric feedback for context changes
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Canonical shelf is Build → agent-tooling because the skill targets measuring and tuning agent artifacts (commands, skills, agents) while you engineer them. Agent-evaluation sits beside skill authoring and context design—you run it whenever you iterate on procedural knowledge, not as a one-off app feature ship check.
Where it fits
Score a rewritten skill.md against your rubric after tightening invokeWhen triggers and tool permissions.
Run paired before/after agent sessions to catch regressions in citation accuracy before you tag a skill release.
Re-benchmark a browsing agent monthly using outcome-focused metrics when users report missed sources.
How it compares
Use instead of ad-hoc “run it twice and eyeball” checks when you need structured agent QA across runs.
Common Questions / FAQ
Who is agent-evaluation for?
Indie builders and small teams who author Claude Code skills, slash commands, or sub-agents and need systematic quality measurement—not just manual chat retries.
When should I use agent-evaluation?
During Build when tuning agent-tooling; in Ship when validating prompt changes before release; and in Operate when iterating on production agent behavior after user-reported failures.
Is agent-evaluation safe to install?
It is documentation-style evaluation guidance; review the Security Audits panel on this Prism page before enabling any companion automation that executes shell or network calls.
SKILL.md
READMESKILL.md - Agent Evaluation
# Evaluation Methods for Claude Code Agents Evaluation of agent systems requires different approaches than traditional software or even standard language model applications. Agents make dynamic decisions, are non-deterministic between runs, and often lack single correct answers. Effective evaluation must account for these characteristics while providing actionable feedback. A robust evaluation framework enables continuous improvement, catches regressions, and validates that context engineering choices achieve intended effects. ## Core Concepts Agent evaluation requires outcome-focused approaches that account for non-determinism and multiple valid paths. Multi-dimensional rubrics capture various quality aspects: factual accuracy, completeness, citation accuracy, source quality, and tool efficiency. LLM-as-judge provides scalable evaluation while human evaluation catches edge cases. The key insight is that agents may find alternative paths to goals—the evaluation should judge whether they achieve right outcomes while following reasonable processes. **Performance Drivers: The 95% Finding** Research on the BrowseComp evaluation (which tests browsing agents' ability to locate hard-to-find information) found that three factors explain 95% of performance variance: | Factor | Variance Explained | Implication | |--------|-------------------|-------------| | Token usage | 80% | More tokens = better performance | | Number of tool calls | ~10% | More exploration helps | | Model choice | ~5% | Better models multiply efficiency | Implications for Claude Code development: - **Token budgets matter**: Evaluate with realistic token constraints - **Model upgrades beat token increases**: Upgrading models provides larger gains than increasing token budgets - **Multi-agent validation**: Validates architectures that distribute work across subagents with separate context windows ## Evaluation Challenges ### Non-Determinism and Multiple Valid Paths Agents may take completely different valid paths to reach goals. One agent might search three sources while another searches ten. They might use different tools to find the same answer. Traditional evaluations that check for specific steps fail in this context. **Solution**: The solution is outcomes, not exact execution paths. Judge whether the agent achieves the right result through a reasonable process. ### Context-Dependent Failures Agent failures often depend on context in subtle ways. An agent might succeed on complex queries but fail on simple ones. It might work well with one tool set but fail with another. Failures may emerge only after extended interaction when context accumulates. **Solution**: Evaluation must cover a range of complexity levels and test extended interactions, not just isolated queries. ### Composite Quality Dimensions Agent quality is not a single dimension. It includes factual accuracy, completeness, coherence, tool efficiency, and process quality. An agent might score high on accuracy but low in efficiency, or vice versa. An agent might score high on accuracy but low in efficiency. **Solution**: Evaluation rubrics must capture multiple dimensions with appropriate weighting for the use case. ## Evaluation Rubric Design ### Multi-Dimensional Rubric Effective rubrics cover key dimensions with descriptive levels: **Instruction Following** (weight: 0.30) - Excellent (1.0): All instructions followed precisely - Good (0.8): Minor deviations that don't affect outcome - Acceptable (0.6): Major instructions followed, minor ones missed - Poor (0.3): Significant instructions ignored - Failed (0.0): Fundamentally misunderstood the task **Output Completeness** (weight: 0.25) - Excellent: All requested aspects thoroughly covered - Good: Most aspects c