Evaluation Methodology

Name: Evaluation Methodology
Author: wshobson

wshobson/agents

4.8k installs
38.3k repo stars
Updated July 22, 2026
wshobson/agents

evaluation-methodology is an agent skill for PluginEval quality methodology - dimensions, rubrics, statistical methods, and scoring formulas. Use this skill when u

About

The evaluation-methodology skill pluginEval quality methodology - dimensions, rubrics, statistical methods, and scoring formulas. Use this skill when understanding how plugin quality is measured, when interpreting a low score on a specific dimension, when deciding how to improve a skill's triggering accuracy or orchestration fitness, when calibrating scoring thresholds for your marketplace, or when explaining quality badges to external partners like Neon.. Evaluation Methodology This document is the authoritative reference for how PluginEval measures plugin and skill quality. It covers the three evaluation layers, all ten scoring dimensions, the composite formula, badge thresholds, anti-pattern flags, Elo ranking, and actionable improvement tips. Related: Full rubric anchors --- The Three Evaluation Layers PluginEval stacks three complementary layers. Each layer produces a score between 0.0 and 1.0 for each applicable dimension, and later layers override or blend with earlier ones according to per-dimension blend weights. Layer 1 - Static Analysis Speed: < 2 seconds. The static analyzer ( layers/static.py ) runs six sub-checks directly against the parsed SKILL.md: Sub-check What.

PluginEval quality methodology - dimensions, rubrics, statistical methods, and scoring formulas. Use this skill when u
**Activation rate** - Fraction of prompts that triggered the skill
**Output consistency** - Coefficient of variation (CV) across quality scores
**Failure rate** - Error/crash fraction with Clopper-Pearson exact CIs
**Token efficiency** - Median token count, IQR, outlier count

Evaluation Methodology by the numbers

4,791 all-time installs (skills.sh)
+156 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #278 of 2,184 Testing & QA skills by installs in the Skillselion catalog
Security screen: LOW risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

At a glance

evaluation-methodology capabilities & compatibility

Capabilities: plugineval quality methodology dimensions, rub · **activation rate** fraction of prompts that t · **output consistency** coefficient of variatio

From the docs

What evaluation-methodology says it does

ALWAYS, or NEVER in the SKILL.

SKILL.md

ALWAYS/NEVER count.

SKILL.md

npx skills add https://github.com/wshobson/agents --skill evaluation-methodology

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/wshobson/agents/evaluation-methodology.svg)](https://skillselion.com/skills/wshobson/agents/evaluation-methodology)

Installs	4.8k
repo stars	★ 38.3k
Security audit	3 / 3 scanners passed
Last updated	July 22, 2026
Repository	wshobson/agents ↗

How do I run evaluation-methodology tasks with correct setup and documented commands?

PluginEval quality methodology - dimensions, rubrics, statistical methods, and scoring formulas. Use this skill when understanding how plugin quality is measured, when interpreting a low score on a

Who is it for?

Developers automating evaluation methodology via agent-guided SKILL.md workflows.

Skip if: Skip when unrelated tooling already covers the task without this skill's documented flow.

When should I use this skill?

PluginEval quality methodology - dimensions, rubrics, statistical methods, and scoring formulas. Use this skill when understanding how plugin quality is measured, when interpreting a low score on a

What you get

Repeatable evaluation-methodology workflows with grounded commands and expected outputs.

Dimension score interpretation
Rubric calibration notes
Improvement recommendations

By the numbers

3 evaluation layers: static, LLM judge, and Monte Carlo
10 composite scoring dimensions in PluginEval
6 static sub-checks plus 4 judge rubric dimensions with 5 anchors each

Files

SKILL.mdMarkdownGitHub ↗

Evaluation Methodology

This document is the authoritative reference for how PluginEval measures plugin and skill quality. It covers the three evaluation layers, all ten scoring dimensions, the composite formula, badge thresholds, anti-pattern flags, Elo ranking, and actionable improvement tips.

Related: Full rubric anchors

---

The Three Evaluation Layers

PluginEval stacks three complementary layers. Each layer produces a score between 0.0 and 1.0 for each applicable dimension, and later layers override or blend with earlier ones according to per-dimension blend weights.

Layer 1 — Static Analysis

Speed: < 2 seconds. No LLM calls. Deterministic.

The static analyzer (layers/static.py) runs six sub-checks directly against the parsed SKILL.md:

Sub-check	What it measures
`frontmatter_quality`	Name presence, description length, trigger-phrase quality
`orchestration_wiring`	Output/input documentation, code block count, orchestrator anti-pattern
`progressive_disclosure`	Line count vs. sweet-spot (200–600 lines), references/ and assets/ bonuses
`structural_completeness`	Heading density, code blocks, examples section, troubleshooting section
`token_efficiency`	MUST/NEVER/ALWAYS density, duplicate-line repetition ratio
`ecosystem_coherence`	Cross-references to other skills/agents, "related"/"see also" mentions

These six sub-checks feed directly into six of the ten final dimensions (via STATIC_TO_DIMENSION mapping). The remaining four dimensions — output_quality, scope_calibration, robustness, and part of triggering_accuracy — receive no static contribution and rely entirely on Layer 2 and/or Layer 3.

Anti-pattern penalty is applied multiplicatively to the Layer 1 score:

penalty = max(0.5, 1.0 − 0.05 × anti_pattern_count)

Each additional detected anti-pattern reduces the score by 5%, flooring at 50%.

Layer 2 — LLM Judge

Speed: 30–90 seconds. One or more LLM calls (Sonnet by default). Non-deterministic.

The eval-judge agent reads the SKILL.md and any references/ files, then scores four dimensions using anchored rubrics (see references/rubrics.md):

1. Triggering accuracy — F1 score derived from 10 mental test prompts 2. Orchestration fitness — Worker purity assessment (0–1 rubric) 3. Output quality — Simulates 3 realistic tasks; assesses instruction quality 4. Scope calibration — Judges depth and breadth relative to the skill's category

The judge returns a structured JSON object (no markdown fences) that the eval engine merges into the composite. When judges > 1, scores are averaged and Cohen's kappa is reported as an inter-judge agreement metric.

Layer 3 — Monte Carlo Simulation

Speed: 5–20 minutes. N=50 simulated Agent SDK invocations (default). Statistical.

Monte Carlo runs N real prompts through the skill and records:

Activation rate — Fraction of prompts that triggered the skill
Output consistency — Coefficient of variation (CV) across quality scores
Failure rate — Error/crash fraction with Clopper-Pearson exact CIs
Token efficiency — Median token count, IQR, outlier count

The Layer 3 composite formula:

mc_score = 0.40 × activation_rate
         + 0.30 × (1 − min(1.0, CV))
         + 0.20 × (1 − failure_rate)
         + 0.10 × efficiency_norm

where efficiency_norm = max(0, 1 − median_tokens / 8000).

---

Composite Scoring Formula

The final score is a weighted blend across all three layers for each dimension, then summed:

composite = Σ(dimension_weight × blended_dimension_score) × 100 × anti_pattern_penalty

Dimension Weights

Dimension	Weight	Why it matters
`triggering_accuracy`	0.25	A skill that never fires — or fires incorrectly — has no value
`orchestration_fitness`	0.20	Skills must be pure workers; supervisor logic belongs in agents
`output_quality`	0.15	Correct, complete output is the primary deliverable
`scope_calibration`	0.12	Neither a stub nor a bloated monster
`progressive_disclosure`	0.10	SKILL.md is lean; detail lives in references/
`token_efficiency`	0.06	Minimal context waste per invocation
`robustness`	0.05	Handles edge cases without crashing
`structural_completeness`	0.03	Correct sections in the right order
`code_template_quality`	0.02	Working, copy-paste-ready examples
`ecosystem_coherence`	0.02	Cross-references; no duplication with siblings

Layer Blend Weights

Each dimension draws from different layers at different ratios. With all three layers active (--depth deep or certify):

Dimension	Static	Judge	Monte Carlo
`triggering_accuracy`	0.15	0.25	0.60
`orchestration_fitness`	0.10	0.70	0.20
`output_quality`	0.00	0.40	0.60
`scope_calibration`	0.30	0.55	0.15
`progressive_disclosure`	0.80	0.20	0.00
`token_efficiency`	0.40	0.10	0.50
`robustness`	0.00	0.20	0.80
`structural_completeness`	0.90	0.10	0.00
`code_template_quality`	0.30	0.70	0.00
`ecosystem_coherence`	0.85	0.15	0.00

At --depth standard (static + judge only), blends are renormalized to drop the Monte Carlo column. At --depth quick (static only), all weight falls on Layer 1.

Blended Score Calculation

For a given depth, the blended score for dimension d is:

blended[d] = Σ( layer_weight[d][layer] × layer_score[d][layer] )
             ─────────────────────────────────────────────────────
             Σ( layer_weight[d][layer] for available layers )

This normalization ensures that skipping Monte Carlo at standard depth doesn't artificially deflate scores.

---

Interpreting Dimension Scores

Each dimension score is a float in [0.0, 1.0]. The CLI converts it to a letter grade:

Grade	Score range	Meaning
A	0.90 – 1.00	Excellent — no meaningful improvement needed
B	0.80 – 0.89	Good — minor gaps only
C	0.70 – 0.79	Adequate — one or two clear improvement areas
D	0.60 – 0.69	Marginal — needs targeted work
F	< 0.60	Failing — significant remediation required

When reading a report, focus first on the lowest-graded dimension that has the highest weight. A D in triggering_accuracy (weight 0.25) costs far more than a D in ecosystem_coherence (weight 0.02).

Confidence intervals appear in the report when Layer 2 or Layer 3 ran. Narrow CIs (± < 5 points) indicate stable scores. Wide CIs suggest inconsistency — often caused by an ambiguous description or instructions that work for some prompt styles but not others.

---

Quality Badges

Badges require both a composite score threshold AND an Elo threshold (when Elo is available). The Badge.from_scores() logic checks composite first, then Elo if provided:

Badge	Composite	Elo	Meaning
Platinum ★★★★★	≥ 90	≥ 1600	Reference quality — suitable for gold corpus
Gold ★★★★	≥ 80	≥ 1500	Production ready
Silver ★★★	≥ 70	≥ 1400	Functional, has improvement opportunities
Bronze ★★	≥ 60	≥ 1300	Minimum viable — not yet recommended for users
—	< 60	any	Does not meet minimum bar

The Elo threshold is skipped when Elo has not been computed (i.e., at quick or standard depth without certify). A skill can earn a badge on composite score alone in those cases.

---

Anti-Pattern Flags

The static analyzer detects five anti-patterns. Each carries a severity multiplier that feeds into the penalty formula.

OVER_CONSTRAINED

Trigger: More than 15 occurrences of MUST, ALWAYS, or NEVER in the SKILL.md.

Problem: Overly prescriptive instructions reduce model flexibility, increase token overhead, and signal that the author is trying to micromanage every output rather than providing principled guidance.

Fix: Audit every MUST/ALWAYS/NEVER. Replace directive language with explanatory framing where possible. Reserve hard constraints for genuine safety or correctness requirements. Target fewer than 10 such directives per 100 lines.

EMPTY_DESCRIPTION

Trigger: The frontmatter description field is fewer than 20 characters after stripping.

Problem: Without a meaningful description, the Claude Code plugin system cannot determine when to invoke the skill. The skill becomes invisible to autonomous invocation.

Fix: Write a description of at least 60–120 characters that includes:

A "Use this skill when..." or "Use when..." trigger clause
Two or more concrete contexts separated by commas or "or"

MISSING_TRIGGER

Trigger: The description does not contain "use when", "use this skill when", "use proactively", or "trigger when" (case-insensitive).

Problem: Even a long description is useless for autonomous invocation if it doesn't include a clear trigger signal. The system's routing model needs an explicit cue.

Fix: Prepend "Use this skill when..." to the description, followed by specific scenarios. Example: "Use this skill when measuring plugin quality, interpreting score reports, or explaining badge thresholds to a team."

BLOATED_SKILL

Trigger: SKILL.md exceeds 800 lines AND the skill has no references/ directory.

Problem: A monolithic SKILL.md forces the entire document into context on every invocation, wasting tokens on content only needed in edge cases.

Fix: Create a references/ directory and move supporting material there:

Detailed rubrics → references/rubrics.md
Extended examples → references/examples.md
Configuration reference → references/config.md

The SKILL.md should link to these files with [text](references/filename.md) so the model can fetch them on demand.

ORPHAN_REFERENCE

Trigger: SKILL.md contains a markdown link [text](references/filename) where filename does not exist in the references/ directory.

Problem: Dead links waste tokens on context that will never resolve and confuse the model.

Fix: Either create the missing reference file or remove the dead link.

DEAD_CROSS_REF

Trigger: SKILL.md references another skill or agent by relative path and that path cannot be resolved from the skills/ directory.

Problem: Broken ecosystem links undermine the plugin's coherence score and may cause the model to attempt navigation to non-existent files.

Fix: Verify the referenced skill exists. Update the path or remove the reference.

---

Elo Ranking

PluginEval uses an Elo/Bradley-Terry rating system to rank a skill against the gold corpus.

Starting rating: 1500 (the corpus median by convention).

K-factor: 32 (standard for moderate-stakes ratings).

Expected score formula (standard Elo):

E(A vs B) = 1 / (1 + 10^((B_rating − A_rating) / 400))

Rating update after each matchup:

new_rating = old_rating + 32 × (actual_score − expected_score)

where actual_score is 1.0 for a win, 0.5 for a draw, 0.0 for a loss.

Confidence intervals are computed via 500-sample bootstrap, reported as 95% CI. Corpus percentile reflects pairwise win rate against the gold corpus. Position bias check: Pairs are evaluated in both orders; disagreements are flagged.

The plugin-eval init command builds the corpus index from a plugins directory:

plugin-eval init ./plugins --corpus-dir ~/.plugineval/corpus

---

CLI Reference

Score a skill (quick static analysis only)

plugin-eval score ./path/to/skill --depth quick

Returns Layer 1 results in < 2 seconds. Useful for fast feedback during authoring.

Score with LLM judge (default)

plugin-eval score ./path/to/skill

Runs static + LLM judge (standard depth). Takes 30–90 seconds.

Score with full output as JSON

plugin-eval score ./path/to/skill --output json

Emits structured JSON including composite.score, composite.dimensions, and layers[0].anti_patterns. Suitable for CI integration:

plugin-eval score ./path/to/skill --depth quick --output json --threshold 70
# exits with code 1 if score < 70

Full certification (all three layers + Elo)

plugin-eval certify ./path/to/skill

Runs static + LLM judge + Monte Carlo (50 simulations) + Elo ranking. Takes 15–20 minutes. Assigns a quality badge. Use before publishing a skill to the marketplace.

Head-to-head comparison

plugin-eval compare ./skill-a ./skill-b

Evaluates both skills at quick depth and prints a dimension-by-dimension comparison table. Useful for deciding between two implementations or measuring improvement before/after a rewrite.

Initialize corpus for Elo

plugin-eval init ./plugins

Builds the local corpus index at ~/.plugineval/corpus. Required before Elo ranking works.

Scripting the Composite Formula

Reproduce the composite score offline (pre-commit hook, CI gate):

def composite_score(dimension_scores: dict, anti_pattern_count: int = 0) -> float:
    """Replicate the PluginEval composite formula."""
    WEIGHTS = {
        "triggering_accuracy":    0.25,
        "orchestration_fitness":  0.20,
        "output_quality":         0.15,
        "scope_calibration":      0.12,
        "progressive_disclosure": 0.10,
        "token_efficiency":       0.06,
        "robustness":             0.05,
        "structural_completeness":0.03,
        "code_template_quality":  0.02,
        "ecosystem_coherence":    0.02,
    }
    raw = sum(WEIGHTS[d] * s for d, s in dimension_scores.items())
    penalty = max(0.5, 1.0 - 0.05 * anti_pattern_count)
    return round(raw * 100 * penalty, 2)

# Example: a skill with a weak triggering score
scores = {
    "triggering_accuracy":    0.65,  # D — needs description work
    "orchestration_fitness":  0.85,
    "output_quality":         0.80,
    # … fill in remaining 7 dimensions …
}
# composite_score(scores, anti_pattern_count=1) → ~76.5

JSON Output Format

Top-level shape of --output json:

{
  "composite": { "score": 76.5, "badge": "Silver", "elo": null },
  "dimensions": {
    "triggering_accuracy": { "score": 0.65, "grade": "D", "ci_low": 0.60, "ci_high": 0.70 },
    "orchestration_fitness": { "score": 0.85, "grade": "B", "ci_low": 0.80, "ci_high": 0.90 }
  },
  "layers": [
    { "name": "static", "duration_ms": 1243, "anti_patterns": ["OVER_CONSTRAINED"] },
    { "name": "judge", "duration_ms": 48200, "judges": 1, "kappa": null }
  ]
}

Parse composite.score in CI to gate deployments:

score=$(plugin-eval score ./my-skill --output json | python3 -c "import sys,json; print(json.load(sys.stdin)['composite']['score'])")
if (( $(echo "$score < 70" | bc -l) )); then
  echo "Quality gate failed: score $score < 70"
  exit 1
fi

---

Tips for Improving a Skill's Score

Work through dimensions in weight order. The largest gains come from fixing the top-weighted dimensions first.

Which Dimension to Improve First

Use this table when a score report shows multiple D/F grades and you need to prioritize effort.

Dimension	Weight	Typical fix effort	Score impact / hour	Fix first if…
`triggering_accuracy`	0.25	Low — description rewrite	High	Score < 70 overall
`orchestration_fitness`	0.20	Medium — restructure sections	High	Skill mixes worker + supervisor logic
`output_quality`	0.15	Medium — add examples	Medium	Judge score < 0.70
`scope_calibration`	0.12	Low — move content to references/	Medium	File is < 100 or > 800 lines
`progressive_disclosure`	0.10	Low — create references/ dir	Medium	No references/ directory exists
`token_efficiency`	0.06	Low — reduce MUST/ALWAYS/NEVER	Low	Anti-pattern count ≥ 3
`robustness`	0.05	Low — add Troubleshooting section	Low	No edge-case handling documented
`structural_completeness`	0.03	Very low — add headings/code blocks	Low	Fewer than 4 H2 headings
`code_template_quality`	0.02	Very low — add language tags	Very low	Code blocks missing language tags
`ecosystem_coherence`	0.02	Very low — add Related section	Very low	No cross-references at all

Rule of thumb: Fix triggering_accuracy before anything else — at weight 0.25 it delivers more composite-score gain per hour than all low-weight dimensions combined.

Triggering Accuracy (weight 0.25)

Include "Use this skill when..." followed by 3–4 comma-separated specific contexts.
Add "proactively" if the skill should auto-activate without an explicit user request.
Mental test: write 5 prompts that should trigger it and 5 that should not — does

your description discriminate? If not, add or tighten the context phrases.

Orchestration Fitness (weight 0.20)

Document what the skill receives and what it returns — not what it orchestrates.
Avoid "orchestrate", "coordinate", "dispatch", "manage workflow" in SKILL.md.
Include an "Output format" section and 2+ code blocks showing concrete worker behavior.

Output Quality (weight 0.15)

Give specific, actionable instructions — not just goals.
Cover at least one edge case explicitly (empty input, malformed data, etc.).
Include an examples section showing representative inputs and expected outputs.
The more concrete the instructions, the higher the judge will score this dimension.

Scope Calibration (weight 0.12)

Target 200–600 lines. Below 100 is a stub; above 800 without references/ is bloat.
Move background reading, extended examples, and reference tables to references/.
Very narrow skills should be merged with a sibling; very broad ones should be split.

Progressive Disclosure (weight 0.10)

Add a references/ directory (earns 0.15–0.25 bonus) and keep SKILL.md focused on

the execution path. An assets/ directory adds a further bonus.

Token Efficiency (weight 0.06)

Audit MUST/ALWAYS/NEVER count. Target < 1 per 10 lines.
Consolidate near-duplicate bullet points and repeated-structure tables.

Robustness (weight 0.05)

Add a "Troubleshooting" or "Edge Cases" section covering at least 3 failure modes.
State what the skill returns when it cannot complete its task.

Structural Completeness (weight 0.03)

Ensure at least 4 H2/H3 headings, 3 code blocks, an Examples section, and a Troubleshooting section.

Code Template Quality (weight 0.02)

All code blocks must be syntactically valid and copy-paste ready with language tags.

Ecosystem Coherence (weight 0.02)

Add a "## Related" section listing sibling skills or agents with relative paths.
Avoid duplicating content that already exists in another skill — link to it instead.

---

Troubleshooting

"Score is much lower than expected after adding content"

The anti-pattern penalty compounds. Run with --output json and inspect layers[0].anti_patterns. If you have 5+ anti-patterns, the multiplier can reduce your score to 75% of its raw value regardless of how good the content is. Fix the flags first.

"triggering_accuracy is low despite a detailed description"

The _description_pushiness scorer looks for specific syntactic patterns, not just length. Verify your description contains the phrase "Use this skill when" or "Use when" (exact phrasing matters — it's a regex match). Also check that you have multiple use cases separated by commas or "or" to earn the specificity bonus.

"LLM judge scores vary significantly between runs"

This is expected for ambiguous skills. The judge generates 10 mental test prompts non-deterministically. Improve score stability by tightening the description and adding concrete examples. When judges > 1, averaged scores will be more stable. Use --depth deep with certify which runs Monte Carlo to get statistically-bounded scores.

"progressive_disclosure score is low even though the file is the right length"

Check whether the file is in the 200–600 line sweet spot. Files shorter than 100 lines score only 0.20 on this sub-check. Also confirm that references/ files are not empty — the scorer checks for non-empty reference files, not just the directory.

"compare shows my rewrite scores lower than the original"

Quick depth (--depth quick) only runs static analysis. If the rewrite moved content to references/ and shortened SKILL.md significantly, static scores for structural completeness may drop even though overall quality improved. Run --depth standard for a fairer comparison that includes the LLM judge's assessment of content quality.

---

References

Full Rubric Anchors — all 4 judge dimensions

Related Agents

eval-judge (../../agents/eval-judge.md) — the LLM judge that scores Layer 2 dimensions

(triggering_accuracy, orchestration_fitness, output_quality, scope_calibration). Invoke directly when you need to re-run only the judge layer or inspect its reasoning.

eval-orchestrator (../../agents/eval-orchestrator.md) — the top-level orchestrator that

sequences all three layers, merges results, assigns badges, and writes the final report. Invoke when running a full certification pass or comparing two skills head-to-head.

Judge Rubrics — Anchored Scoring Reference

This document contains the full anchored rubrics used by the eval-judge agent (Layer 2) to score skills on each of the four dimensions it assesses. Each dimension uses a 0.0–1.0 scale with five anchor points. The judge interpolates between anchors based on the evidence gathered from reading SKILL.md and any references/ files.

These rubrics are the authoritative scoring standard. When calibrating expectations, filing score disputes, or training new judge models, use these anchors as ground truth.

---

Dimension 1 — Triggering Accuracy

Weight in composite: 0.25 (highest)

Layer blend (deep depth): static 15%, judge 25%, Monte Carlo 60%

What is being measured

Triggering accuracy measures whether the skill's description field in the frontmatter causes Claude Code to invoke the skill at the right times. A skill with perfect triggering accuracy fires on every prompt that genuinely needs it (high recall) and never fires on prompts where it is irrelevant (high precision). The score is conceptually the F1 of precision and recall across a representative prompt distribution.

How the judge scores it

The judge generates 10 mental test prompts: 5 that should trigger the skill and 5 that should not. It assesses whether the description would lead Claude Code's routing model to activate (or not activate) for each prompt. The F1 score of this 10-prompt evaluation becomes the dimension score.

The judge also considers whether the description provides actionable trigger signals rather than just naming or describing the skill in passive terms.

Anchored Rubric

0.0 – 0.19 (Grade F) — Unusable trigger

The description is absent, empty, or so vague that it provides no routing signal. Examples:

Description is under 10 characters
Description is just the skill name: "evaluation-methodology"
Description describes what the skill is, not when to use it: "A skill about evaluation"
Description uses entirely passive language with no conditional framing

A skill at this level will almost never be autonomously invoked. It may be invoked if the user explicitly names it, but that defeats the purpose of a plugin ecosystem.

0.20 – 0.39 (Grade F/D) — Weak trigger

The description exists and is somewhat meaningful but has major gaps:

Mentions the domain but lacks trigger phrases ("Use when..." or similar)
Trigger language is present but maps to only one narrow use case
Description would trigger the skill on clearly wrong prompts (precision failure)
Description would miss 3+ of the 5 should-trigger test prompts (recall failure)

Example of a 0.30-scoring description:

"PluginEval quality methodology — dimensions, rubrics, statistical methods."

This names the topic but provides no trigger signal. The routing model cannot infer when to use it.

0.40 – 0.59 (Grade D/C) — Partial trigger

The description has some trigger signal but is imprecise:

Contains "Use when" but only one specific context
Would correctly handle 3 of 5 should-trigger prompts
Some false positives — would fire for adjacent but wrong use cases
Trigger phrase is generic ("Use when working with evaluations") rather than specific

Example of a 0.50-scoring description:

"PluginEval quality methodology — dimensions, rubrics. Use when understanding evaluation."

Better — has a trigger phrase — but "understanding evaluation" is too generic. It would catch some legitimate uses but also fire for unrelated evaluation tasks.

0.60 – 0.79 (Grade C/B) — Good trigger

Description clearly identifies when to invoke the skill with only minor gaps:

Contains "Use when..." or "Use this skill when..." with at least two specific contexts
Would correctly handle 4 of 5 should-trigger prompts
Precision is good (few false positives)
May miss edge-case trigger scenarios not explicitly listed

Example of a 0.70-scoring description:

"PluginEval quality methodology. Use this skill when understanding how plugin quality is

measured or when interpreting evaluation results."

Good — two explicit trigger contexts — but misses calibration and stakeholder scenarios.

0.80 – 1.00 (Grade A/B) — Excellent trigger

Description is precise and comprehensive:

Contains "Use when..." or "Use this skill when..." with 3+ specific, distinct contexts
Would correctly handle all 5 should-trigger prompts
Would correctly NOT trigger on all 5 should-not prompts
Contexts are concrete and discriminative (not "when evaluating" but "when interpreting

dimension scores and letter grades" or "when calibrating scoring thresholds")

Optionally includes "proactively" for skills that should auto-activate

Example of a 0.90-scoring description:

"PluginEval quality methodology — dimensions, rubrics, statistical methods. Use this skill

when understanding how plugin quality is measured, interpreting evaluation results,

calibrating scoring thresholds, or explaining quality badges to stakeholders."

Four specific, distinct contexts. Fires on exactly the right prompts.

Good Trigger Description Patterns

Start with a one-sentence summary of what the skill covers
Follow immediately with "Use this skill when..." and list 3+ concrete scenarios
Name specific technologies, output types, or file formats when relevant
Disambiguate from adjacent skills (e.g., "when interpreting results, not when running

evaluations — use the eval command for that")

Keep the total description under 200 characters for clean display in the CLI

Common Mistakes

Using "Use when interpreting results" — too generic; results of what?
Listing only one trigger context — needs 3+ to score above 0.70
Passive descriptions ("This skill covers...") that never state when to use the skill
Combining trigger and description without separating them clearly

---

Dimension 2 — Orchestration Fitness

Weight in composite: 0.20 (second highest)

Layer blend (deep depth): static 10%, judge 70%, Monte Carlo 20%

What is being measured

Orchestration fitness measures whether a skill behaves as a pure worker in the agent → skill hierarchy. A skill should receive a delegated task, execute it using its own instructions, and return structured output. It should NOT:

Make decisions about which other tools or skills to call
Manage multi-step workflows across multiple agents
Act as a supervisor that delegates to sub-workers
Contain conditional orchestration logic

This dimension is almost entirely judge-assessed (70% judge weight) because static analysis cannot reliably detect orchestration intent from surface patterns alone.

How the judge scores it

The judge reads the SKILL.md in full and asks: does this skill's instruction set define a worker (receives task → executes → returns output) or an orchestrator (plans → delegates → aggregates)? It looks for specific signals in both directions.

Worker signals (positive):

Documents what it receives (inputs/parameters)
Documents what it returns (output format, structure)
Instructions are self-contained execution steps
Code blocks show the skill doing work, not calling other skills
Scoped, focused responsibilities

Orchestrator signals (negative):

Uses words like "orchestrate", "coordinate", "dispatch", "delegate", "manage workflow"
Contains logic like "if X, call skill Y; if Z, call agent W"
Describes itself as a "supervisor" or "orchestrator"
Output is routing decisions rather than execution results
References multiple external agents by name in a decision tree

Anchored Rubric

0.0 – 0.19 (Grade F) — Standalone agent

The skill is written as a fully autonomous agent that manages its own tool calls, sub-task delegation, and workflow coordination. It has no defined input/output contract. It reads like an agent system prompt, not a worker instruction set.

Example characteristics:

"You will first assess the situation, then call the appropriate specialist..."
Dispatches to other skills based on internal logic
Has no "Input:" or "Output:" sections
Describes a complete agentic loop

0.20 – 0.39 (Grade F/D) — Mixed roles

The skill mixes worker and orchestrator responsibilities. It does some work itself but also contains orchestration logic. The boundaries are unclear.

Example characteristics:

Has an output format but also contains "if the user asks for X, also invoke Y"
Worker sections mixed with supervisor-style conditional routing
Returns both results and routing recommendations
Ambiguous whether it executes or coordinates

0.40 – 0.59 (Grade D/C) — Functional worker with structural issues

The skill is mostly a worker but the output format is not structured for supervisor consumption. The calling agent cannot easily parse or route on the output.

Example characteristics:

Produces narrative/prose output rather than structured data
No explicit output format documentation
Assumes the calling agent "just knows" what to do with the result
Instructions are adequate for execution but not for composability

0.60 – 0.79 (Grade C/B) — Clean worker, minor gaps

The skill functions as a clean worker. Inputs and outputs are documented. The instructions produce output that a supervisor agent can consume. Minor issues remain.

Example characteristics:

Has input and output documentation, but output schema could be more explicit
Instructions are worker-style throughout with only one or two ambiguous lines
Code blocks show worker behavior but coverage is incomplete
No orchestration language but also no explicit composability design

0.80 – 1.00 (Grade A/B) — Pure worker

The skill is a composable, contract-defined worker. It is clear what it takes in and what it produces. The output format is specified in a way that a calling agent can rely on.

Example characteristics:

Explicit "## Input" and "## Output" or "## Returns" sections
Output format is structured (JSON schema, typed fields, or clearly specified markdown)
Instructions are execution steps with no decision-tree routing to external services
Code blocks demonstrate realistic worker behavior
Skill is designed to be called repeatedly with different inputs

Good Signals vs. Bad Signals

Good signals (push score up):

Documents expected inputs and output format explicitly
Produces artifacts a supervisor agent can consume without parsing prose
Uses imperative instructions ("Analyze X and return Y"), not conditional delegation
Has 2+ code blocks showing concrete worker behavior
Output format section uses a schema, template, or typed field list

Bad signals (push score down):

Contains "orchestrate", "coordinate", "dispatch" in instruction text
References other skills as execution dependencies (not just "see also")
Manages multi-step workflows that span multiple tool boundaries internally
Output is described as "a comprehensive report" with no structure specification
Skill tells the model to "decide" what to do next rather than do the work

Common Mistakes

Documenting what the skill "does" without specifying what it "returns"
Including "Related skills" sections that imply the skill will call them
Writing instructions as if the skill controls the entire conversation
Mixing the worker's execution logic with stakeholder communication steps

---

Dimension 3 — Output Quality

Weight in composite: 0.15 (third highest)

Layer blend (deep depth): static 0%, judge 40%, Monte Carlo 60%

What is being measured

Output quality measures whether the skill's instructions would guide Claude to produce correct, complete, and useful output across a representative range of real-world tasks. This dimension is entirely empirical — static analysis cannot assess whether instructions will produce quality outputs, so the layer blend is 0% static.

At deep depth, Monte Carlo simulation (60% blend) produces actual outputs from real prompts and scores them. At standard depth (judge only), the judge simulates three tasks mentally.

How the judge scores it

The judge selects three realistic tasks that the skill is designed to handle — varying from simple to complex. For each task, it mentally executes the skill's instructions and assesses whether the resulting output would be:

Correct — factually accurate, technically valid
Complete — covers all aspects the task requires
Useful — actionable, well-formatted, appropriate length

The average across three tasks becomes the dimension score.

Anchored Rubric

0.0 – 0.19 (Grade F) — Instructions produce incorrect output

Following the skill's instructions would lead Claude to produce wrong answers or actively harmful output. The instructions contain factual errors, logical contradictions, or directives that produce the opposite of the intended result.

Example characteristics:

Incorrect formulas or algorithms presented as correct
Contradictory instructions that cannot both be followed
Instructions that assume wrong tool behaviors
Missing critical information that would cause systematic failure

0.20 – 0.39 (Grade F/D) — Incomplete, major gaps

Instructions produce output for simple cases but fail on anything non-trivial. Major aspects of the skill's domain are unaddressed. A user following this skill would get partial help for basic requests and no help for moderate complexity.

Example characteristics:

Handles the "hello world" case but not any realistic variant
Critical decision points have no guidance (the model must guess)
Output format is undefined — model produces inconsistent structure
No examples to calibrate expected quality

0.40 – 0.59 (Grade D/C) — Adequate for basic cases

Instructions produce reasonable output for straightforward tasks but struggle with any complexity. The skill is usable but requires the user to fill in significant gaps.

Example characteristics:

Basic case is well-handled; complex case guidance is thin or absent
Output format is suggested but not enforced
Edge cases are not addressed — model must improvise
Examples are present but only cover the simplest scenario

0.60 – 0.79 (Grade C/B) — Good for most cases

Instructions produce quality output for the majority of realistic tasks. A few edge cases or complex scenarios may be handled suboptimally but the core use cases work well.

Example characteristics:

Three or more concrete examples covering varied complexity
Output format is clearly specified
At least one edge case addressed explicitly
Instructions are actionable and specific, not just descriptive
Output would be correct and useful for 80%+ of real invocations

0.80 – 1.00 (Grade A/B) — Excellent across the board

Instructions are comprehensive, specific, and produce high-quality output for even complex or edge-case tasks. The skill represents a genuine expertise distillation.

Example characteristics:

Examples cover simple, moderate, and complex cases
Output format is precisely specified with schema or template
Multiple edge cases addressed with specific handling guidance
Instructions are expert-level — they encode domain knowledge, not just procedure
A user following the instructions would produce output comparable to an expert
Troubleshooting guidance is provided for failure modes

Judge Checks for Output Quality

When assessing code examples and technical instructions, the judge verifies:

All code blocks are syntactically correct and would run without modification
Workflows are shown end-to-end, not as fragments requiring integration
Error handling is included for the most common failure modes
APIs referenced are current (not deprecated in the skill's target environment)
Version constraints are stated when the skill targets a specific library version

Common Mistakes

Describing what good output looks like without explaining how to produce it
Providing examples of output without explaining the reasoning behind them
Instructions that are too vague to follow ("produce a comprehensive analysis")
Missing error handling — what should the skill do when the input is malformed?
Using placeholder pseudocode instead of real, runnable examples

---

Dimension 4 — Scope Calibration

Weight in composite: 0.12 (fourth highest)

Layer blend (deep depth): static 30%, judge 55%, Monte Carlo 15%

What is being measured

Scope calibration measures whether the skill is the right size for its purpose. Too thin (stub) and it provides no value. Too broad (bloated) and it wastes tokens, confuses the model, and overlaps with sibling skills. The ideal skill is exactly as large as it needs to be — comprehensive for its defined domain, not a line longer.

This dimension requires human judgment (55% judge blend) because "right size" is context-dependent. A skill covering a complex framework legitimately needs more content than a skill covering a simple utility function.

How the judge scores it

The judge assesses scope by asking: 1. Does the skill cover all the important aspects of its stated domain? 2. Does it cover anything outside its stated domain? 3. Is the depth appropriate — neither superficial nor excessively detailed? 4. Is the content density high (every line earns its place) or padded?

The judge also considers the skill's category (reference documentation, workflow assistant, code generator, etc.) when calibrating expectations.

Anchored Rubric

0.0 – 0.19 (Grade F) — Stub

The skill is a placeholder. It has a name and description but the body contains less than 50 lines or covers fewer than half of its stated domain. Someone invoking this skill would receive fragmentary guidance insufficient to complete any real task.

Example characteristics:

Fewer than 50 lines total
Body is a bulleted list of topics without elaboration
The description promises more than the content delivers
A competent practitioner would need to fill in all the gaps themselves

0.20 – 0.39 (Grade F/D) — Too narrow

The skill covers its domain but only the surface layer. Important aspects exist but are mentioned without sufficient depth to be actionable. The skill is not a stub but it is thin enough that users will frequently run into unaddressed scenarios.

Example characteristics:

50–100 lines covering 2–3 of the skill's 6+ important aspects
Core happy path is documented; anything unusual is missing
No examples or only one trivial example
Useful as a starting point but not as a self-sufficient reference

0.40 – 0.59 (Grade D/C) — Slightly off-scope

The skill is either moderately under-scoped (missing a few important aspects) or slightly over-scoped (includes content that belongs in a different skill). The content that exists is reasonable in quality but the overall package is not well-calibrated.

Example characteristics:

Under-scoped: Covers most aspects but one or two important ones are absent or cursory
Over-scoped: Includes content that duplicates a sibling skill or is only tangentially

related to the skill's stated domain

May be the right total size but wrong distribution of content across topics

0.60 – 0.79 (Grade C/B) — Well-scoped with minor issues

The skill covers its domain well. Important aspects are addressed at appropriate depth. One or two gaps remain, or there is a small amount of tangential content, but these are minor issues.

Example characteristics:

80–90% of the important aspects covered at useful depth
A practitioner could complete most tasks using only this skill
Any content outside the core domain is clearly supporting material, not distraction
Minor gaps would affect fewer than 20% of invocations

0.80 – 1.00 (Grade A/B) — Perfectly calibrated

The skill is exactly what it needs to be. It covers all important aspects of its domain at the right depth, with no padding and no gaps. Every section earns its place. The skill could be used as a reference implementation for its category.

Example characteristics:

Comprehensive coverage of all important aspects without redundancy
Each section directly supports completing the skill's stated purpose
Appropriate use of references/ for supporting material that doesn't belong in the

main execution path

Content density is high — no filler, no repetition
Would satisfy a senior practitioner working on a complex variant of the skill's task
Serves as a model for what this category of skill should look like

Skill Category Calibration Norms

Scope expectations vary by skill category. Use these as baseline calibration guides:

Category	Target lines (SKILL.md)	Pattern
Reference / Documentation	200–500	Deep coverage + references/ for extended material
Workflow / Process	150–300	Step-by-step + decision points + worked example
Code generator	100–200	Instructions + references/ for templates
Diagnostic / Debugging	200–400	Decision trees + failure modes + procedures
Integration / Configuration	150–350	Setup + options + copy-paste examples
Coordination / Planning	100–200	Decisions + checklists + handoff protocol

Common Mistakes

Writing a stub and planning to "expand later" — submit when the content is ready
Including content that belongs in a sibling skill to inflate scope
Treating a narrowly-scoped skill as too thin — a single-purpose utility skill can

be 100 lines and perfectly calibrated

Over-explaining background theory that the model already knows — focus on the

domain-specific guidance the model cannot infer from training data alone

Adding filler headings ("Overview", "Introduction") that restate the description

without adding actionable content

---

Rubric Calibration and Consistency

Inter-Judge Agreement

When running with judges > 1, PluginEval reports Cohen's kappa to measure agreement between judge instances. Target kappa ≥ 0.70 for a stable, well-defined skill.

Kappa range	Interpretation
≥ 0.80	Strong agreement — skill is clearly written
0.60 – 0.79	Moderate agreement — skill has some ambiguous sections
0.40 – 0.59	Fair agreement — skill needs clarity improvements
< 0.40	Poor agreement — skill is ambiguous or judges are not calibrated

Low kappa on a specific dimension points to the area needing clarification. Low triggering_accuracy kappa usually means the description maps to multiple different interpretations of when to use the skill.

Calibration Corpus

The gold corpus (initialized via plugin-eval init) provides Platinum and Gold-badged skills as calibration anchors. Before running a batch evaluation, compare your expected scores against one or two corpus entries to verify your judge is calibrated correctly.

If your judge consistently scores a known Platinum skill below 85 on any dimension, check for model version drift or prompt injection in the skill content that may be confusing the judge.

Score Drift Across Model Versions

Judge model upgrades can shift scores by ± 5–10 points on subjective dimensions (output_quality, scope_calibration). After any model upgrade, re-certify the top 10 corpus entries to establish new baseline calibration. If drift exceeds 5 points on any dimension, update the anchored examples in this rubric document to reflect the new model's scoring behavior.

Related skills

TddFollow test-driven development with a strict red-green-refactor loop when creating reliable features or fixing bugs.510k185k

Test Driven DevelopmentEnforce writing failing tests before any production implementation code.176k260k

QaRun conversational QA sessions that turn user-reported bugs into well-written, domain-aware GitHub issues without manual ticket writing.164k185k

Migrate To ShoehornAutomatically update TypeScript test files that rely on unsafe `as` type assertions by replacing them with type-safe partial objects from @total-typescript/shoehorn.151k185k

Webapp TestingVerify frontend behavior, debug UI issues, capture screenshots, and inspect logs of a running local web application using Playwright.121k164k

Playwright CliRun browser automation, generate element snapshots, inspect DOM attributes, and execute Playwright tests from the terminal.96.3k12.2k

How it compares

Use evaluation-methodology to understand PluginEval scores; use llm-evaluation skills for application-level LLM output benchmarking.

FAQ

Who is evaluation-methodology for?

Developers using agents to execute evaluation methodology workflows from SKILL.md.

When should I use evaluation-methodology?

PluginEval quality methodology - dimensions, rubrics, statistical methods, and scoring formulas. Use this skill when understanding how plugin quality is measured, when interpreti

Is evaluation-methodology safe to install?

Review the Security Audits panel on this page before installing in production.

Testing & QAtesting