
Judge With Debate
Run structured multi-judge debate evaluations with a meta-judge rubric until consensus or max rounds on a solution you need scored rigorously.
Install
npx skills add https://github.com/neolabhq/context-engineering-kit --skill judge-with-debateWhat is this skill?
- Meta-judge (Opus) builds shared evaluation specification and rubrics before any judge scores
- Three independent judges analyze solutions and challenge each other with evidence
- Up to 3 debate rounds drive score convergence
- Phase 0 setup writes reports under `.specs/reports`
- Multi-Agent Debate pattern reduces single-evaluator bias vs one-pass judging
Adoption & trust: 555 installs on skills.sh; 1.1k GitHub stars; 2/3 security scanners passed (skills.sh audits).
Recommended Skills
Journey fit
Canonical shelf is Ship review because the skill implements evaluation and debate over finished solution artifacts before you trust scores or ship. Review subphase matches iterative judging, evidence-based argument, and meta-judge rubrics rather than building new features.
Common Questions / FAQ
Is Judge With Debate safe to install?
skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Judge With Debate
# judge-with-debate <task> Evaluate solutions through multi-agent debate where independent judges analyze, challenge each other's assessments, and iteratively refine their evaluations until reaching consensus or maximum rounds. </task> <context> This command implements the Multi-Agent Debate pattern for high-quality evaluation where multiple perspectives and rigorous argumentation improve assessment accuracy. Unlike single-pass evaluation, debate forces judges to defend their positions with evidence and consider counter-arguments. Key benefits: - **Structured evaluation** - Meta-judge produces tailored rubrics and criteria before judging begins - **Multiple perspectives** - Three independent judges reduce individual bias - **Evidence-based debate** - Judges defend positions with specific evidence from the solution and evaluation specification - **Iterative refinement** - Up to 3 debate rounds drive convergence on accurate scores - **Shared specification** - Meta-judge runs once; all judges across all rounds share the same evaluation specification </context> ## Pattern: Debate-Based Evaluation This command implements iterative multi-judge debate: ``` Phase 0: Setup mkdir -p .specs/reports | Phase 0.5: Dispatch Meta-Judge Meta-Judge (Opus) | Evaluation Specification YAML | Phase 1: Independent Analysis (3 judges in parallel) +- Judge 1 -> {name}.1.md -+ Solution +- Judge 2 -> {name}.2.md -+-+ +- Judge 3 -> {name}.3.md -+ | | Phase 2: Debate Round (iterative) | Each judge reads others' reports | | | Argue + Defend + Challenge | (grounded in eval specification) | | | Revise if convinced --------------+ | | Check consensus | +- Yes -> Final Report | +- No -> Next Round ---------+ ``` ## Process ### Setup: Create Reports Directory Before starting evaluation, ensure the reports directory exists: ```bash mkdir -p .specs/reports ``` **Report naming convention:** `.specs/reports/{solution-name}-{YYYY-MM-DD}.[1|2|3].md` Where: - `{solution-name}` - Derived from solution filename (e.g., `users-api` from `src/api/users.ts`) - `{YYYY-MM-DD}` - Current date - `[1|2|3]` - Judge number ### Phase 0.5: Dispatch Meta-Judge Before independent analysis, dispatch a meta-judge agent to generate a tailored evaluation specification. The meta-judge runs ONCE and produces rubrics, checklists, and scoring criteria that ALL judges will use across ALL rounds. **Meta-judge prompt template:** ```markdown ## Task Generate an evaluation specification yaml for the following evaluation task. You will produce rubrics, checklists, and scoring criteria that multiple judge agents will use to evaluate the solution through independent analysis and multi-round debate. CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}` ## User Prompt {task description - what the solution was supposed to accomplish} ## Context {Any relevant context about the solution being evaluated} ## Artifact Type {code | documentation | configuration | etc.} ## Evaluation Mode Multi-judge debate with consensus-seeking across rounds ## Instructions Return only the final evaluation specification YAML in your response. The specification should support both independent analysis and debate-based refinement. ``` **Dispatch:** ``` Use Task tool: - description: "Meta-judge: generate evaluation specification for {solution-name}" - prompt: {meta-judge prompt} - model: opus - subagent_type: "sadd:meta-judge" ``` Wait for the meta-judge to complete and extract the evaluation specification YAML