
Evaluation Framework
Reuse one weighted scoring and threshold rubric when your agent evaluates knowledge intake, skill quality, or other artifacts instead of duplicating criteria in every plugin.
Overview
evaluation-framework is an agent skill most often used in Build (also Validate, Ship) that supplies shared weighted scoring and threshold patterns for plugin evaluation rubrics.
Install
npx skills add https://github.com/athola/claude-night-market --skill evaluation-frameworkWhat is this skill?
- Shared scoring methodology and threshold patterns for dependent rubrics
- Weighted criteria pattern (e.g. novelty at 25%) documented for downstream modules
- Integration via leyline:evaluation-framework dependencies in YAML frontmatter
- References scoring-patterns submodule for consistent numeric grading
- Single source of truth for evaluation terminology across plugins
- Novelty criterion example weighted at 25%
- Structure compliance scored 0–100 with weighted framework
Adoption & trust: 1 installs on skills.sh; 304 GitHub stars; 3/3 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).
What problem does it solve?
You duplicated scoring rules and inconsistent pass thresholds across every skill that judges knowledge or agent output.
Who is it for?
Maintainers of multiple agent plugins who want one evaluation vocabulary and scoring pattern.
Skip if: Builders who only need a single ad-hoc checklist with no shared rubric across skills.
When should I use this skill?
Integrating or authoring evaluation rubrics that should depend on a shared leyline:evaluation-framework module.
What do I get? / Deliverables
Dependent modules link one framework so rubrics share methodology while keeping domain criteria local.
- Dependency-linked evaluation module docs
- Consistent scoring-pattern references across plugins
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Canonical shelf is Build → agent-tooling because the skill is packaged for Claude Night Market plugins that assess agent skills and knowledge modules. It standardizes how other skills (memory-palace, abstract) wire evaluation modules—not a one-off test runner.
Where it fits
Define novelty and relevance weights before accepting research into a memory palace.
Refactor abstract quality-metrics to depend on the shared framework instead of inline scoring.
Align pre-merge skill eval gates with the same threshold language your intake rubric uses.
How it compares
A reusable rubric package for other skills, not a standalone test runner or MCP server.
Common Questions / FAQ
Who is evaluation-framework for?
Solo builders and plugin authors who evaluate knowledge intake, skill quality, or structured artifacts and want one shared scoring model.
When should I use evaluation-framework?
Use it while authoring rubrics in Build/agent-tooling, when scoping intake quality in Validate, and when aligning review gates in Ship before you merge eval logic into memory-palace or similar skills.
Is evaluation-framework safe to install?
Review the Security Audits panel on this Prism page and inspect the skill repo before wiring it into production evaluation paths.
SKILL.md
READMESKILL.md - Evaluation Framework
# Integration Guide How to integrate the evaluation-framework skill into your plugin. ## For memory-palace (Knowledge Intake) The memory-palace evaluation rubric can now depend on this shared framework: ```yaml # In knowledge-intake/modules/evaluation-rubric.md --- name: evaluation-rubric dependencies: [leyline:evaluation-framework] --- # Knowledge Evaluation Rubric Based on the [evaluation-framework](leyline:evaluation-framework) with domain-specific criteria for knowledge intake. ## Criteria (following evaluation-framework pattern) ### 1. Novelty (25%) See [scoring-patterns](leyline:evaluation-framework/modules/scoring-patterns.md) for methodology. [Rest of domain-specific details...] ``` ## For abstract (Quality Metrics) The abstract quality-metrics module can reference this framework: ```yaml # In skills-eval/modules/quality-metrics.md --- name: quality-metrics dependencies: [leyline:evaluation-framework] --- # Quality Metrics Framework Based on [evaluation-framework](leyline:evaluation-framework) for skill quality assessment. ## Scoring Categories (following evaluation-framework pattern) ### Structure Compliance (0-100) Uses weighted scoring from [evaluation-framework](leyline:evaluation-framework). [Rest of domain-specific details...] ``` ## Benefits of Integration ### Reduced Duplication - Common scoring methodology in one place - Shared threshold patterns - Single source of truth for evaluation concepts ### Consistency - Same terminology across plugins - Consistent scoring scales - Unified decision-making patterns ### Maintainability - Update evaluation patterns once - All consumers benefit from improvements - Clear dependency chain ## Migration Path 1. **Add Dependency**: Update frontmatter to include `leyline:evaluation-framework` 2. **Reference Core Patterns**: Link to framework for common concepts 3. **Focus on Domain**: Keep only domain-specific details in your skill 4. **Remove Duplication**: Delete explanations now in framework ## Example: Before and After ### Before (Duplicated) ```markdown # My Evaluation ## Weighted Scoring We use a weighted scoring system where each criterion has a weight... [300 lines of generic explanation] ## Domain-Specific Criteria [50 lines of actual domain logic] ``` ### After (DRY with Framework) ```markdown # My Evaluation Uses [evaluation-framework](leyline:evaluation-framework) for weighted scoring. ## Domain-Specific Criteria [50 lines of actual domain logic with references to framework patterns] ``` **Result**: 350 lines → 60 lines, clearer focus on domain logic. --- name: decision-thresholds description: Patterns for designing threshold-based decision frameworks with clear actions category: evaluation tags: [thresholds, decisions, automation, gates] estimated_tokens: 700 --- # Decision Thresholds Patterns and best practices for designing effective threshold-based decision frameworks. ## Core Concepts ### What Are Thresholds? Thresholds are score ranges that map to specific decisions or actions. They transform continuous scores into discrete decision points. ``` Score Range → Decision → Action 80-100 → Accept → Deploy immediately 60-79 → Review → Manual approval needed 0-59 → Reject → Send back for revision ``` ### Why Use Thresholds? - **Consistency**: Same score always gets same decision - **Automation**: Enable automated decision-making - **Clarity**: Clear criteria for each outcome - **Accountability**: Documented decision logic ## Threshold Design Patterns ### Binary Thresholds Simplest pattern - pass or fail: ```yaml thresholds: 70-100: Pass 0-69: Fail ``` Use when: - Decision is truly binary (deploy/don't deploy) - No middle ground exists - Automation is critical ### Multi-Tier Thresholds Multiple decision levels with different actions: ```yaml thresholds: 90-100: Excellent - Fast track 75-89: Good - Standard process 60-74: Fair - Additional review 40-59: Poor - Major revisions ne