Evaluation Framework

Canonical shelf is Build → agent-tooling because the skill is packaged for Claude Night Market plugins that assess agent skills and knowledge modules. It standardizes how other skills (memory-palace, abstract) wire evaluation modules—not a one-off test runner.

Also useful

Also useful

Where it fits

Example use

Define novelty and relevance weights before accepting research into a memory palace.

Example use

Refactor abstract quality-metrics to depend on the shared framework instead of inline scoring.

Example use

Align pre-merge skill eval gates with the same threshold language your intake rubric uses.

How it compares

A reusable rubric package for other skills, not a standalone test runner or MCP server.

Common Questions / FAQ

Who is evaluation-framework for?

Solo builders and plugin authors who evaluate knowledge intake, skill quality, or structured artifacts and want one shared scoring model.

When should I use evaluation-framework?

Use it while authoring rubrics in Build/agent-tooling, when scoping intake quality in Validate, and when aligning review gates in Ship before you merge eval logic into memory-palace or similar skills.

Is evaluation-framework safe to install?

Review the Security Audits panel on this Prism page and inspect the skill repo before wiring it into production evaluation paths.

SKILL.md

READMESKILL.md - Evaluation Framework

# Integration Guide

How to integrate the evaluation-framework skill into your plugin.

## For memory-palace (Knowledge Intake)

The memory-palace evaluation rubric can now depend on this shared framework:

```yaml
# In knowledge-intake/modules/evaluation-rubric.md
---
name: evaluation-rubric
dependencies: [leyline:evaluation-framework]
---

# Knowledge Evaluation Rubric

Based on the [evaluation-framework](leyline:evaluation-framework) with domain-specific criteria for knowledge intake.

## Criteria (following evaluation-framework pattern)

### 1. Novelty (25%)
See [scoring-patterns](leyline:evaluation-framework/modules/scoring-patterns.md) for methodology.

[Rest of domain-specific details...]
```

## For abstract (Quality Metrics)

The abstract quality-metrics module can reference this framework:

```yaml
# In skills-eval/modules/quality-metrics.md
---
name: quality-metrics
dependencies: [leyline:evaluation-framework]
---

# Quality Metrics Framework

Based on [evaluation-framework](leyline:evaluation-framework) for skill quality assessment.

## Scoring Categories (following evaluation-framework pattern)

### Structure Compliance (0-100)
Uses weighted scoring from [evaluation-framework](leyline:evaluation-framework).

[Rest of domain-specific details...]
```

## Benefits of Integration

### Reduced Duplication
- Common scoring methodology in one place
- Shared threshold patterns
- Single source of truth for evaluation concepts

### Consistency
- Same terminology across plugins
- Consistent scoring scales
- Unified decision-making patterns

### Maintainability
- Update evaluation patterns once
- All consumers benefit from improvements
- Clear dependency chain

## Migration Path

1. **Add Dependency**: Update frontmatter to include `leyline:evaluation-framework`
2. **Reference Core Patterns**: Link to framework for common concepts
3. **Focus on Domain**: Keep only domain-specific details in your skill
4. **Remove Duplication**: Delete explanations now in framework

## Example: Before and After

### Before (Duplicated)

```markdown
# My Evaluation

## Weighted Scoring

We use a weighted scoring system where each criterion has a weight...
[300 lines of generic explanation]

## Domain-Specific Criteria
[50 lines of actual domain logic]
```

### After (DRY with Framework)

```markdown
# My Evaluation

Uses [evaluation-framework](leyline:evaluation-framework) for weighted scoring.

## Domain-Specific Criteria
[50 lines of actual domain logic with references to framework patterns]
```

**Result**: 350 lines → 60 lines, clearer focus on domain logic.


---
name: decision-thresholds
description: Patterns for designing threshold-based decision frameworks with clear actions
category: evaluation
tags: [thresholds, decisions, automation, gates]
estimated_tokens: 700
---

# Decision Thresholds

Patterns and best practices for designing effective threshold-based decision frameworks.

## Core Concepts

### What Are Thresholds?

Thresholds are score ranges that map to specific decisions or actions. They transform continuous scores into discrete decision points.

```
Score Range → Decision → Action
80-100      → Accept   → Deploy immediately
60-79       → Review   → Manual approval needed
0-59        → Reject   → Send back for revision
```

### Why Use Thresholds?

- **Consistency**: Same score always gets same decision
- **Automation**: Enable automated decision-making
- **Clarity**: Clear criteria for each outcome
- **Accountability**: Documented decision logic

## Threshold Design Patterns

### Binary Thresholds

Simplest pattern - pass or fail:

```yaml
thresholds:
  70-100: Pass
  0-69:   Fail
```

Use when:
- Decision is truly binary (deploy/don't deploy)
- No middle ground exists
- Automation is critical

### Multi-Tier Thresholds

Multiple decision levels with different actions:

```yaml
thresholds:
  90-100: Excellent - Fast track
  75-89:  Good - Standard process
  60-74:  Fair - Additional review
  40-59:  Poor - Major revisions ne

What is this skill?

Shared scoring methodology and threshold patterns for dependent rubrics

Weighted criteria pattern (e.g. novelty at 25%) documented for downstream modules

Integration via leyline:evaluation-framework dependencies in YAML frontmatter

References scoring-patterns submodule for consistent numeric grading

Single source of truth for evaluation terminology across plugins

Novelty criterion example weighted at 25%

Structure compliance scored 0–100 with weighted framework

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 1 installs on skills.sh; 304 GitHub stars; 3/3 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

Also useful

Where it fits

Example use

Define novelty and relevance weights before accepting research into a memory palace.

Example use

Refactor abstract quality-metrics to depend on the shared framework instead of inline scoring.

Example use