
Eval
Rank and pick a winner across parallel AgentHub agent runs using a metric command or an LLM judge on diffs.
Overview
eval is an agent skill most often used in Ship (also Build) that ranks parallel AgentHub session outputs using metric commands or an LLM judge comparing git diffs.
Install
npx skills add https://github.com/alirezarezvani/claude-skills --skill evalWhat is this skill?
- Supports /hub:eval on latest or a specific session id with optional --judge to force LLM ranking
- Metric mode runs a configured eval command in each agent worktree and ranks by metric and direction
- LLM judge mode diffs base vs agent branches and scores correctness, simplicity, and quality
- Reads per-agent result posts from .agenthub/board/results for judge context
- Outputs ordered rankings with winner summary suitable for picking which agent branch to keep
- LLM judge ranks on three criteria: correctness, simplicity, and quality
- Metric output table includes rank, agent id, metric value, delta, and file count
Adoption & trust: 1.4k installs on skills.sh; 17.5k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You ran several agents on the same task and need a fair, repeatable way to pick which result is best without manually diffing every branch.
Who is it for?
Solo builders orchestrating AgentHub sessions who already configured eval metrics or want LLM-based diff ranking after parallel runs complete.
Skip if: Single-agent workflows with no session id, worktrees, or result posts—or when you have not run agents yet and only need implementation help.
When should I use this skill?
Evaluate and rank agent results by metric or LLM judge for an AgentHub session (/hub:eval).
What do I get? / Deliverables
You get a ranked table with a declared winner and enough deltas and file context to merge or continue from the top agent branch.
- Ordered ranking table with winner declaration
- Per-agent metric or judge rationale tied to diffs and result posts
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Canonical shelf is Ship because the skill’s job is to judge finished agent outputs against criteria—same mental model as test gates before you merge a chosen solution. Testing fits metric-based eval commands and pass/fail style rankings; LLM-judge mode still acts as a structured quality gate on session results.
Where it fits
After spawning three agents with different prompts, run /hub:eval to see which branch passes your latency metric.
Use metric mode with your test command in each worktree before merging the winning agent’s changes.
Force --judge to compare diffs for correctness and simplicity when no numeric metric is configured.
How it compares
Use for session-level winner selection after parallel agents finish, not as a substitute for writing tests or implementing features in one checkout.
Common Questions / FAQ
Who is eval for?
Indie builders and small teams using AgentHub-style parallel agents who need to compare multiple completed runs on one task.
When should I use eval?
After a Ship testing pass when sessions finish—in Build agent-tooling when tuning skills—or before merge when you must choose among agent-1, agent-2, and agent-3 branches.
Is eval safe to install?
It runs shell eval commands and reads git diffs across worktrees; review the Security Audits panel on this page and only point --eval-cmd at scripts you trust.
SKILL.md
READMESKILL.md - Eval
# /hub:eval — Evaluate Agent Results Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid. ## Usage ``` /hub:eval # Eval latest session using configured criteria /hub:eval 20260317-143022 # Eval specific session /hub:eval --judge # Force LLM judge mode (ignore metric config) ``` ## What It Does ### Metric Mode (eval command configured) Run the evaluation command in each agent's worktree: ```bash python {skill_path}/scripts/result_ranker.py \ --session {session-id} \ --eval-cmd "{eval_cmd}" \ --metric {metric} --direction {direction} ``` Output: ``` RANK AGENT METRIC DELTA FILES 1 agent-2 142ms -38ms 2 2 agent-1 165ms -15ms 3 3 agent-3 190ms +10ms 1 Winner: agent-2 (142ms) ``` ### LLM Judge Mode (no eval command, or --judge flag) For each agent: 1. Get the diff: `git diff {base_branch}...{agent_branch}` 2. Read the agent's result post from `.agenthub/board/results/agent-{i}-result.md` 3. Compare all diffs and rank by: - **Correctness** — Does it solve the task? - **Simplicity** — Fewer lines changed is better (when equal correctness) - **Quality** — Clean execution, good structure, no regressions Present rankings with justification. Example LLM judge output for a content task: ``` RANK AGENT VERDICT WORD COUNT 1 agent-1 Strong narrative, clear CTA 1480 2 agent-3 Good data points, weak intro 1520 3 agent-2 Generic tone, no differentiation 1350 Winner: agent-1 (strongest narrative arc and call-to-action) ``` ### Hybrid Mode 1. Run metric evaluation first 2. If top agents are within 10% of each other, use LLM judge to break ties 3. Present both metric and qualitative rankings ## After Eval 1. Update session state: ```bash python {skill_path}/scripts/session_manager.py --update {session-id} --state evaluating ``` 2. Tell the user: - Ranked results with winner highlighted - Next step: `/hub:merge` to merge the winner - Or `/hub:merge {session-id} --agent {winner}` to be explicit