Eval

Name: Eval
Author: alirezarezvani

alirezarezvani/claude-skills

1.4k installs
23.5k repo stars
Updated July 17, 2026
alirezarezvani/claude-skills

eval is an agent skill that Evaluate and rank agent results by metric or LLM judge for an AgentHub session.

About

Evaluate and rank agent results by metric or LLM judge for an AgentHub session. --- name: "eval" description: "Evaluate and rank agent results by metric or LLM judge for an AgentHub session." command: /hub:eval --- # /hub:eval - Evaluate Agent Results Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid. ## Usage ``` /hub:eval # Eval latest session using configured criteria /hub:eval 20260317-143022 # Eval specific session /hub:eval --judge # Force LLM judge mode (ignore metric config) ``` ## What It Does ### Metric Mode (eval command configured) Run the evaluation command in each agent's worktree: ```bash python {skill_path}/scripts/result_ranker.py \ --session {session-id} \ --eval-cmd "{eval_cmd}" \ --metric {metric} --direction {direction} ``` Output: ``` RANK AGENT METRIC DELTA FILES 1 agent-2 142ms -38ms 2 2 agent-1 165ms -15ms 3 3 agent-3 190ms +10ms 1 Winner: agent-2 (142ms) ``` ### LLM Judge Mode (no eval command, or --judge flag) For each agent: 1.

/hub:eval - Evaluate Agent Results
Get the diff: `git diff {base_branch}...{agent_branch}`
Read the agent's result post from `.agenthub/board/results/agent-{i}-result.md`
Compare all diffs and rank by:
**Correctness** - Does it solve the task?

Eval by the numbers

1,401 all-time installs (skills.sh)
+2 installs in the week ending Jul 29, 2026 (Skillselion tracking)
Ranked #492 of 2,159 Testing & QA skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 31, 2026 (Skillselion catalog sync)

At a glance

eval capabilities & compatibility

Capabilities: /hub:eval — evaluate agent results · get the diff: `git diff {base_branch}...{agent_b · read the agent's result post from `.agenthub/boa · compare all diffs and rank by: · **correctness** — does it solve the task?
Use cases: documentation

From the docs

What eval says it does

Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.

SKILL.md

Get the diff: `git diff {base_branch}...{agent_branch}` 2.

SKILL.md

Read the agent's result post from `.agenthub/board/results/agent-{i}-result.md` 3.

SKILL.md

Compare all diffs and rank by: - **Correctness** — Does it solve the task?

SKILL.md

npx skills add https://github.com/alirezarezvani/claude-skills --skill eval

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/alirezarezvani/claude-skills/eval.svg)](https://skillselion.com/skills/alirezarezvani/claude-skills/eval)