
Judge
Run an isolated meta-judge plus judge sub-agent pipeline to score completed conversation work with rubrics, citations, and report-only feedback—without auto-editing your repo.
Overview
Judge is an agent skill most often used in Ship (also Build, Ship testing) that launches a meta-judge and isolated judge sub-agent to evaluate prior conversation output with tailored rubrics and cited evidence.
Install
npx skills add https://github.com/neolabhq/context-engineering-kit --skill judgeWhat is this skill?
- Two-phase meta-judge then LLM-as-judge pipeline with optional evaluation-focus argument
- Context-isolated judge sub-agent to limit confirmation bias from session history
- Meta-judge generates tailored rubrics, checklists, and multi-dimensional scoring criteria
- Evidence-required scores with file locations and line-number citations
- Report-only output with self-verification questions—no automatic code changes
- Two-phase pipeline: meta-judge then judge sub-agent
Adoption & trust: 530 installs on skills.sh; 1.1k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You finished work in a long agent session but cannot trust your own or the model’s approval because context bias and vague praise hide real gaps.
Who is it for?
Builders who want a second-pass, rubric-driven review of agent-produced code, docs, or plans without blending it into the same biased chat thread.
Skip if: Replacing automated CI, security scanners, or human team review on regulated releases when you need enforced pass/fail gates rather than report-only feedback.
When should I use this skill?
Launch when conversation contains completed work to assess; optional argument-hint supplies evaluation-focus (e.g., security, tests, API design).
What do I get? / Deliverables
You receive a structured, citation-backed evaluation report with generated rubric scores and verification notes, ready for you to act on manually.
- Tailored evaluation rubric and checklist
- Multi-dimensional scored report with file/line citations
- Self-verification notes without automatic repo changes
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Ship/review is the canonical shelf because the command evaluates finished artifacts the way a structured review gate would, before you merge or ship. Review subphase matches evidence-based rubric scoring and citations to files and lines, aligned with pre-ship quality assessment.
Where it fits
Judge an API implementation produced earlier in the session against tailored correctness and error-handling criteria.
Evaluate README or spec drafts with a meta-judge rubric for completeness and accuracy.
Score a PR-sized diff with isolated context to reduce confirmation bias from long chats.
Assess whether generated tests adequately cover edge cases named in the evaluation focus argument.
How it compares
Use as an in-session LLM review orchestrator instead of asking the same agent thread to grade its own homework.
Common Questions / FAQ
Who is judge for?
Solo developers and small teams using agentic coding workflows who want structured, cited evaluations of work already produced in the conversation.
When should I use judge?
After completing a feature slice, doc pass, or refactor in chat—especially before ship/review or when you pass an evaluation-focus argument for security, tests, or API design.
Is judge safe to install?
Check the Security Audits panel on this Prism page; the documented behavior is report-only evaluation via sub-agents and should not auto-apply patches, but confirm your agent’s sub-agent permissions match your policy.
SKILL.md
READMESKILL.md - Judge
# Judge Command <task> You are a coordinator launching a two-phase evaluation pipeline to assess work produced earlier in this conversation. First, a meta-judge generates tailored evaluation criteria. Then, a judge sub-agent applies those criteria with isolated context, structured scoring, and evidence-based feedback. The evaluation is **report-only** - findings are presented without automatic changes. </task> <context> This command implements the **meta-judge -> LLM-as-Judge** pattern with context isolation: - **Structured Evaluation**: Meta-judge produces tailored rubrics, checklists, and scoring criteria before judging - **Context Isolation**: Judge operates with fresh context, preventing confirmation bias from accumulated session state - **Evidence-Based**: Every score requires specific citations from the work (file locations, line numbers) - **Multi-Dimensional Rubric**: Generated by meta-judge to match the specific artifact type and evaluation focus - **Self-Verification**: Dynamic verification questions with documented adjustments </context> ## Your Workflow ### Phase 1: Context Extraction Before launching the evaluation pipeline, identify what needs evaluation: 1. **Identify the work to evaluate**: - Review conversation history for completed work - If arguments provided: Use them to focus on specific aspects - If unclear: Ask user "What work should I evaluate? (code changes, analysis, documentation, etc.)" 2. **Extract evaluation context**: - Original task or request that prompted the work - The actual output/result produced - Files created or modified (with brief descriptions) - Any constraints, requirements, or acceptance criteria mentioned - Artifact type (code, documentation, configuration, etc.) 3. **Provide scope for user**: ``` Evaluation Scope: - Original request: [summary] - Work produced: [description] - Files involved: [list] - Artifact type: [code | documentation | configuration | etc.] - Evaluation focus: [from arguments or "general quality"] Launching meta-judge to generate evaluation criteria... ``` **IMPORTANT**: Pass only the extracted context to the sub-agents - not the entire conversation. This prevents context pollution and enables focused assessment. ### Phase 2: Dispatch Meta-Judge Launch a meta-judge agent to generate an evaluation specification tailored to the specific work being evaluated. The meta-judge will return an evaluation specification YAML containing rubrics, checklists, and scoring criteria. **Meta-Judge Prompt:** ```markdown ## Task Generate an evaluation specification yaml for the following evaluation task. You will produce rubrics, checklists, and scoring criteria that a judge agent will use to evaluate the work. CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}` ## User Prompt {Original task or request that prompted the work} ## Context {Any relevant context about the work being evaluated} {Evaluation focus from arguments, or "General quality assessment"} ## Artifact Type {code | documentation | configuration | etc.} ## Instructions Return only the final evaluation specification YAML in your response. ``` **Dispatch:** ``` Use Task tool: - description: "Meta-judge: Generate evaluation criteria for {brief work summary}" - prompt: {meta-judge prompt} - model: opus - subagent_type: "sadd:meta-judge" ``` Wait for the meta-judge to complete before proceeding to Phase 3. ### Phase 3: Dispatch Judge Agent After the meta-judge completes, extract its evaluation specification YAML and dispatch the judge agent with both the work context and the specification. CRITICAL: Provide to the judge the EXACT meta-judge evaluation specification YAML. Do not skip, add, modify, shorten, or summarize any text in it! **Judge Agent Prompt:** ```markdown You are