
Evaluation Methodology
Calibrate how agent skills are scored on triggering, depth, and quality when you build or audit skills for Claude Code and similar agents.
Overview
Evaluation Methodology is an agent skill most often used in Build (also Ship review, Operate iterate) that documents anchored 0.0–1.0 rubrics and judge rules for scoring agent skills on dimensions including Triggering Ac
Install
npx skills add https://github.com/wshobson/agents --skill evaluation-methodologyWhat is this skill?
- Authoritative anchored rubrics on a 0.0–1.0 scale with five anchor points per dimension
- Triggering Accuracy dimension weighted 0.25 with F1-style recall/precision reasoning
- Judge protocol: 10 mental test prompts (5 should trigger, 5 should not)
- Deep-depth layer blend: static 15%, judge 25%, Monte Carlo 60%
- Ground truth for calibrating eval-judge, score disputes, and new judge models
- Four scoring dimensions assessed by eval-judge
- Five anchor points per dimension on a 0.0–1.0 scale
- Triggering Accuracy composite weight 0.25
Adoption & trust: 3.4k installs on skills.sh; 36.5k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You are shipping or debating skill quality without a shared, reproducible scoring standard for descriptions, triggers, and depth.
Who is it for?
Skill maintainers, eval-judge operators, and indie builders calibrating SKILL.md descriptions and reference depth before publishing to a skills directory.
Skip if: Builders who only need a one-off feature prompt with no skill package, routing, or formal evaluation cycle.
When should I use this skill?
Calibrating skill scores, training judges, or editing SKILL.md descriptions for routing accuracy.
What do I get? / Deliverables
You align expectations with eval-judge using fixed anchors, prompt-based triggering tests, and documented composite weights before you change SKILL.md or challenge a score.
- Shared mental model of dimension anchors and weights
- Triggering test prompt design aligned with F1-style judging
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Skill authors and evaluators reach for this reference while designing SKILL.md packages and agent-tooling in the Build phase. Anchored rubrics and judge methodology belong on the agent-tooling shelf because they define how procedural skills are measured before ship.
Where it fits
Decide whether a new skill idea warrants a full SKILL.md package by checking if its description would pass triggering prompts.
Rewrite frontmatter description after comparing against five-anchor Triggering Accuracy expectations.
Align a code review of skill changes with judge dimensions before merging to main.
File a score dispute or retrain judge expectations using the authoritative rubric text.
How it compares
Reference rubric for judges—not a workflow skill that generates plans or runs Monte Carlo jobs by itself.
Common Questions / FAQ
Who is evaluation-methodology for?
It is for people who author, review, or dispute scores on agent skills—especially when tuning frontmatter descriptions so Claude Code invokes skills at the right time.
When should I use evaluation-methodology?
Use it while building agent-tooling packages, before ship-time quality review, and when iterating published skills—any time you need Triggering Accuracy anchors or judge calibration across validate, build, and operate decisions.
Is evaluation-methodology safe to install?
It is documentation-only reference text with no shell or network behavior described here; review the Security Audits panel on this Prism page before installing any package from the repo.
SKILL.md
READMESKILL.md - Evaluation Methodology
# Judge Rubrics — Anchored Scoring Reference This document contains the full anchored rubrics used by the `eval-judge` agent (Layer 2) to score skills on each of the four dimensions it assesses. Each dimension uses a 0.0–1.0 scale with five anchor points. The judge interpolates between anchors based on the evidence gathered from reading SKILL.md and any `references/` files. These rubrics are the authoritative scoring standard. When calibrating expectations, filing score disputes, or training new judge models, use these anchors as ground truth. --- ## Dimension 1 — Triggering Accuracy **Weight in composite:** 0.25 (highest) **Layer blend (deep depth):** static 15%, judge 25%, Monte Carlo 60% ### What is being measured Triggering accuracy measures whether the skill's `description` field in the frontmatter causes Claude Code to invoke the skill at the right times. A skill with perfect triggering accuracy fires on every prompt that genuinely needs it (high recall) and never fires on prompts where it is irrelevant (high precision). The score is conceptually the F1 of precision and recall across a representative prompt distribution. ### How the judge scores it The judge generates 10 mental test prompts: 5 that should trigger the skill and 5 that should not. It assesses whether the description would lead Claude Code's routing model to activate (or not activate) for each prompt. The F1 score of this 10-prompt evaluation becomes the dimension score. The judge also considers whether the description provides actionable trigger signals rather than just naming or describing the skill in passive terms. ### Anchored Rubric **0.0 – 0.19 (Grade F) — Unusable trigger** The description is absent, empty, or so vague that it provides no routing signal. Examples: - Description is under 10 characters - Description is just the skill name: "evaluation-methodology" - Description describes what the skill is, not when to use it: "A skill about evaluation" - Description uses entirely passive language with no conditional framing A skill at this level will almost never be autonomously invoked. It may be invoked if the user explicitly names it, but that defeats the purpose of a plugin ecosystem. **0.20 – 0.39 (Grade F/D) — Weak trigger** The description exists and is somewhat meaningful but has major gaps: - Mentions the domain but lacks trigger phrases ("Use when..." or similar) - Trigger language is present but maps to only one narrow use case - Description would trigger the skill on clearly wrong prompts (precision failure) - Description would miss 3+ of the 5 should-trigger test prompts (recall failure) Example of a 0.30-scoring description: > "PluginEval quality methodology — dimensions, rubrics, statistical methods." This names the topic but provides no trigger signal. The routing model cannot infer when to use it. **0.40 – 0.59 (Grade D/C) — Partial trigger** The description has some trigger signal but is imprecise: - Contains "Use when" but only one specific context - Would correctly handle 3 of 5 should-trigger prompts - Some false positives — would fire for adjacent but wrong use cases - Trigger phrase is generic ("Use when working with evaluations") rather than specific Example of a 0.50-scoring description: > "PluginEval quality methodology — dimensions, rubrics. Use when understanding evaluation." Better — has a trigger phrase — but "understanding evaluation" is too generic. It would catch some legitimate uses but also fire for unrelated evaluation tasks. **0.60 – 0.79 (Grade C/B) — Good trigger** Description clearly identifies when to invoke the skill with only minor gaps: - Contains "Use when..." or "Use this skill when..." with at least two specific contexts - Would correctly handle 4 of 5 should-trigger prompts - Precision is good (few false positives) - May miss edge-case trigger scenarios not explicitly listed Example of a 0.70-scoring description: > "PluginEval quality methodology. Use this skill when understandi