Judge

Name: Judge
Author: neolabhq

neolabhq/context-engineering-kit

885 installs
1.3k repo stars
Updated July 26, 2026
neolabhq/context-engineering-kit

judge is a report-only evaluation skill that runs a meta-judge and isolated judge sub-agent pipeline to score completed agent work with rubrics and citations for developers who need evidence-based quality review.

About

judge implements a two-phase meta-judge then LLM-as-judge pipeline from the context-engineering-kit to assess work produced earlier in a conversation. A meta-judge first generates tailored evaluation criteria; a judge sub-agent then applies those criteria with isolated context, structured scoring, and evidence-based feedback. Findings are report-only—no automatic repo edits. Developers invoke judge when they want an objective post-task audit of agent output, plans, or implementations with optional evaluation-focus arguments.

Two-phase meta-judge then LLM-as-judge pipeline with optional evaluation-focus argument
Context-isolated judge sub-agent to limit confirmation bias from session history
Meta-judge generates tailored rubrics, checklists, and multi-dimensional scoring criteria
Evidence-required scores with file locations and line-number citations
Report-only output with self-verification questions—no automatic code changes

Judge by the numbers

885 all-time installs (skills.sh)
+48 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #147 of 1,382 Code Review & Quality skills by installs in the Skillselion catalog
Security screen: HIGH risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/neolabhq/context-engineering-kit --skill judge

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/neolabhq/context-engineering-kit/judge.svg)](https://skillselion.com/skills/neolabhq/context-engineering-kit/judge)

Installs	885
repo stars	★ 1.3k
Security audit	2 / 3 scanners passed
Last updated	July 26, 2026
Repository	neolabhq/context-engineering-kit ↗

How do you evaluate agent output with rubrics?

Run an isolated meta-judge plus judge sub-agent pipeline to score completed conversation work with rubrics, citations, and report-only feedback—without auto-editing your repo.

Who is it for?

Developers who finished multi-step agent work and want isolated, rubric-driven quality scoring without automatic code changes.

Skip if: Developers who need automatic fixes, linting, or security scanning should skip judge because it is report-only.

When should I use this skill?

User asks to evaluate, score, audit, or judge work completed earlier in the current conversation.

What you get

Structured evaluation report with tailored criteria, scores, citations, and evidence-based feedback.

Evaluation rubric
Scored assessment report
Evidence citations

By the numbers

Two-phase evaluation pipeline: meta-judge then judge sub-agent

Files

SKILL.mdMarkdownGitHub ↗

Judge Command

<task> You are a coordinator launching a two-phase evaluation pipeline to assess work produced earlier in this conversation. First, a meta-judge generates tailored evaluation criteria. Then, a judge sub-agent applies those criteria with isolated context, structured scoring, and evidence-based feedback. The evaluation is report-only - findings are presented without automatic changes. </task>

<context> This command implements the meta-judge -> LLM-as-Judge pattern with context isolation:

Structured Evaluation: Meta-judge produces tailored rubrics, checklists, and scoring criteria before judging
Context Isolation: Judge operates with fresh context, preventing confirmation bias from accumulated session state
Evidence-Based: Every score requires specific citations from the work (file locations, line numbers)
Multi-Dimensional Rubric: Generated by meta-judge to match the specific artifact type and evaluation focus
Self-Verification: Dynamic verification questions with documented adjustments

</context>

Your Workflow

Phase 1: Context Extraction

Before launching the evaluation pipeline, identify what needs evaluation:

1. Identify the work to evaluate:

Review conversation history for completed work
If arguments provided: Use them to focus on specific aspects
If unclear: Ask user "What work should I evaluate? (code changes, analysis, documentation, etc.)"

2. Extract evaluation context:

Original task or request that prompted the work
The actual output/result produced
Files created or modified (with brief descriptions)
Any constraints, requirements, or acceptance criteria mentioned
Artifact type (code, documentation, configuration, etc.)

3. Provide scope for user:

   Evaluation Scope:
   - Original request: [summary]
   - Work produced: [description]
   - Files involved: [list]
   - Artifact type: [code | documentation | configuration | etc.]
   - Evaluation focus: [from arguments or "general quality"]

   Launching meta-judge to generate evaluation criteria...

IMPORTANT: Pass only the extracted context to the sub-agents - not the entire conversation. This prevents context pollution and enables focused assessment.

Phase 2: Dispatch Meta-Judge

Launch a meta-judge agent to generate an evaluation specification tailored to the specific work being evaluated. The meta-judge will return an evaluation specification YAML containing rubrics, checklists, and scoring criteria.

Meta-Judge Prompt:

## Task

Generate an evaluation specification yaml for the following evaluation task. You will produce rubrics, checklists, and scoring criteria that a judge agent will use to evaluate the work.

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`

## User Prompt
{Original task or request that prompted the work}

## Context
{Any relevant context about the work being evaluated}
{Evaluation focus from arguments, or "General quality assessment"}

## Artifact Type
{code | documentation | configuration | etc.}

## Instructions
Return only the final evaluation specification YAML in your response.

Dispatch:

Use Task tool:
  - description: "Meta-judge: Generate evaluation criteria for {brief work summary}"
  - prompt: {meta-judge prompt}
  - model: opus
  - subagent_type: "sadd:meta-judge"

Wait for the meta-judge to complete before proceeding to Phase 3.

Phase 3: Dispatch Judge Agent

After the meta-judge completes, extract its evaluation specification YAML and dispatch the judge agent with both the work context and the specification.

CRITICAL: Provide to the judge the EXACT meta-judge evaluation specification YAML. Do not skip, add, modify, shorten, or summarize any text in it!

Judge Agent Prompt:

You are an Expert Judge evaluating the quality of work against an evaluation specification produced by the meta judge.

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`

## Work Under Evaluation

[ORIGINAL TASK]
{paste the original request/task}
[/ORIGINAL TASK]

[WORK OUTPUT]
{summary of what was created/modified}
[/WORK OUTPUT]

[FILES INVOLVED]
{list of files with brief descriptions}
[/FILES INVOLVED]

## Evaluation Specification

{meta-judge's evaluation specification YAML}


## Instructions

Follow your full judge process as defined in your agent instructions!

CRITICAL: You must reply with this exact structured evaluation report format in YAML at the START of your response!

CRITICAL: NEVER provide score threshold to judges in any format. Judge MUST not know what threshold for score is, in order to not be biased!!!

Dispatch:

Use Task tool:
  - description: "Judge: Evaluate {brief work summary}"
  - prompt: {judge prompt with exact meta-judge specification YAML}
  - model: opus
  - subagent_type: "sadd:judge"

Phase 4: Process and Present Results

After receiving the judge's evaluation:

1. Validate the evaluation:

Check that all criteria have scores in valid range (1-5)
Verify each score has supporting justification with evidence
Confirm weighted total calculation is correct
Check for contradictions between justification and score
Verify self-verification was completed with documented adjustments

2. If validation fails:

Note the specific issue
Request clarification or re-evaluation if needed

3. Present results to user:

Display the full evaluation report
Highlight the verdict and key findings
Offer follow-up options:
Address specific improvements
Request clarification on any judgment
Proceed with the work as-is

Scoring Interpretation

Score Range	Verdict	Interpretation	Recommendation
4.50 - 5.00	EXCELLENT	Exceptional quality, exceeds expectations	Ready as-is
4.00 - 4.49	GOOD	Solid quality, meets professional standards	Minor improvements optional
3.50 - 3.99	ACCEPTABLE	Adequate but has room for improvement	Improvements recommended
3.00 - 3.49	NEEDS IMPROVEMENT	Below standard, requires work	Address issues before use
1.00 - 2.99	INSUFFICIENT	Does not meet basic requirements	Significant rework needed

Important Guidelines

1. Meta-judge first: Always generate evaluation specification before judging - never skip the meta-judge phase 2. Include CLAUDE_PLUGIN_ROOT: Both meta-judge and judge need the resolved plugin root path 3. Meta-judge YAML: Pass only the meta-judge YAML to the judge, do not modify it 4. Context Isolation: Pass only relevant context to sub-agents - not the entire conversation 5. Justification First: Always require evidence and reasoning BEFORE the score 6. Evidence-Based: Every score must cite specific evidence (file paths, line numbers, quotes) 7. Bias Mitigation: Explicitly warn against length bias, verbosity bias, and authority bias 8. Be Objective: Base assessments on evidence and rubric definitions, not preferences 9. Be Specific: Cite exact locations, not vague observations 10. Be Constructive: Frame criticism as opportunities for improvement with impact context 11. Consider Context: Account for stated constraints, complexity, and requirements 12. Report Confidence: Lower confidence when evidence is ambiguous or criteria unclear 13. Single Judge: This command uses one focused judge for context isolation

Notes

This is a report-only command - it evaluates but does not modify work
The meta-judge generates criteria tailored to the specific artifact type and evaluation focus
The judge operates with fresh context for unbiased assessment
Scores are calibrated to professional development standards
Low scores indicate improvement opportunities, not failures
Use the evaluation to inform next steps and iterations
Low confidence evaluations may warrant human review

Related skills

Improve Codebase ArchitectureSafely deepen clusters of shallow modules into cohesive, testable units while respecting their external dependencies.531k185k

Caveman ReviewGet ultra-compressed, one-line code review comments that cut noise while keeping every actionable fix.260k92.5k

Codebase DesignShared vocabulary for designing deep modules: improve a module's interface, find deepening opportunities, decide where a seam goes, make code more testable.233k185k

CavecrewDelegate coding tasks to specialized subagents that return compressed output, keeping the main context window usable for much longer sessions.210k92.5k

Requesting Code ReviewDispatch a consistent, high-signal code reviewer subagent that catches plan deviations and quality issues before merging or continuing development.178k260k

Code ReviewReviews a branch or PR diff on two axes at once: conformance to coding standards plus a code-smell baseline, and whether it actually implements the original spec.167k185k

How it compares

Use judge when you need rubric-driven, evidence-cited evaluation of finished work; use lint or security skills when you need automated fixes or rule enforcement.

FAQ

Does judge automatically fix code it evaluates?

judge is report-only. The meta-judge and judge sub-agent pipeline presents structured scoring, citations, and evidence-based feedback without applying automatic edits to the repository or conversation artifacts.

How does the judge skill structure evaluation?

judge runs a meta-judge first to generate tailored evaluation criteria, then launches a judge sub-agent with isolated context to apply those criteria with structured scoring and evidence-based feedback.

Is Judge safe to install?

skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Code Review & Qualitytestingintegrations