
Do And Judge
Run a single implementation task through a sub-agent plus meta-judge rubric and LLM-as-judge loop until quality passes or retries are exhausted.
Overview
do-and-judge is an agent skill most often used in Ship (also Build integrations, Ship testing) that executes one task via an implementation sub-agent and verifies it with meta-judge criteria plus LLM-as-judge retries unt
Install
npx skills add https://github.com/neolabhq/context-engineering-kit --skill do-and-judgeWhat is this skill?
- Parallel dispatch of meta-judge (evaluation spec) and implementation sub-agent for speed
- LLM-as-judge applies meta-judge rubric mechanically with score ≥4 pass and max 2 retries
- Orchestrator-only role: orchestrator must not implement—delegates to fresh-context sub-agents
- Feedback loop sends judge issues into the next implementation attempt
- Single-task execution pattern for refactor-or-fix style agent work
- Pass threshold: judge score ≥4
- Maximum implementation retries: 2
Adoption & trust: 531 installs on skills.sh; 1.1k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You delegate coding to an agent but self-review in the same thread misses blind spots and lets half-finished work ship.
Who is it for?
Solo builders running high-stakes refactors or agent-written changes who want a structured external verifier instead of trusting the same model that wrote the code.
Skip if: Simple one-line fixes where spinning meta-judge, implementer, and judge sub-agents adds more overhead than reading the diff yourself.
When should I use this skill?
Execute a task description (e.g., refactor a class) that needs sub-agent implementation plus independent judge verification with retry until passing or max retries.
What do I get? / Deliverables
You get a task completed under a tailored rubric with an independent pass/fail verdict and up to two retry cycles driven by judge feedback before you accept the result.
- Implemented change meeting judge rubric
- Judge verdict with scores and retry feedback trail
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Canonical shelf is ship/review because the skill is an external quality gate that blocks shipping until judge score meets threshold—same pattern as code review before merge. Review subphase fits mechanical verification with structured rubrics, retry feedback, and a hard pass threshold rather than initial authoring.
Where it fits
After an agent wires a third-party SDK, run do-and-judge so a judge checks the integration against a meta-judge checklist before you merge.
Before accepting a large refactor PR from your agent, verify score ≥4 on dependency-injection and test-coverage criteria.
When agent-written tests exist but coverage claims feel soft, use judge feedback across two retries to close specific gaps.
How it compares
Use instead of asking the same agent to “double-check its own work” without a separate judge and rubric.
Common Questions / FAQ
Who is do-and-judge for?
Indie developers and small teams using Claude Code, Cursor, or similar agents who orchestrate sub-agents and want a repeatable quality gate on individual tasks.
When should I use do-and-judge?
During Build when integrating agent-generated backend changes, during Ship review before merging risky refactors, or during Ship testing when you need scored verification with automatic retries—not for open-ended brainstorming.
Is do-and-judge safe to install?
It orchestrates sub-agents that may edit code and invoke models; review the Security Audits panel on this page and constrain filesystem and network permissions in your agent settings.
SKILL.md
READMESKILL.md - Do And Judge
# do-and-judge ## Task Execute a single task by dispatching an implementation sub-agent, verifying with an independent judge, and iterating with feedback until passing or max retries exceeded. ## Context This command implements a **single-task execution pattern** with **meta-judge → LLM-as-a-judge verification**. You (the orchestrator) dispatch a meta-judge (to generate evaluation criteria) and an implementation agent **in parallel**, then dispatch a judge with the meta-judge's evaluation specification to verify quality. If verification fails, you launch new implementation agent with judge feedback and iterate until passing (score ≥4) or max retries (2) exceeded. Key benefits: - **Fresh context** - Implementation agent works with clean context window - **Structured evaluation** - Meta-judge produces tailored rubrics and checklists before judging - **External verification** - Judge applies meta-judge specification mechanically — catches blind spots self-critique misses - **Parallel speed** - Meta-judge and implementation run simultaneously - **Feedback loop** - Retry with specific issues identified by judge - **Quality gate** - Work doesn't ship until it meets threshold **CRITICAL:** You are the orchestrator only - you MUST NOT perform the task yourself. IF you read, write or run bash tools you failed task imidiatly. It is single most critical criteria for you. If you used anyting except sub-agents you will be killed immediatly!!!! Your role is to: 1. Analyze the task and select optimal model 2. Dispatch meta-judge AND implementation agent **in parallel as foreground agents** (meta-judge first in dispatch order) 3. Dispatch judge agent with meta-judge's evaluation specification 4. Parse verdict and iterate if needed (max 2 retries) 5. Report final results or escalate ## RED FLAGS - Never Do These **NEVER:** - Read implementation files to understand code details (let sub-agents do this) - Write code or make changes to source files directly - Skip judge verification to "save time" - Read judge reports in full (only parse structured headers) - Proceed after max retries without user decision **ALWAYS:** - Use Task tool to dispatch sub-agents for ALL implementation work - Dispatch meta-judge and implementation agent in parallel (meta-judge FIRST in dispatch order) - Wait for BOTH meta-judge and implementation to complete before dispatching judge - Pass meta-judge evaluation specification to the judge agent - Include `CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`` in prompts to meta-judge and judge agents - Parse only VERDICT/SCORE/ISSUES from judge output - Iterate with feedback if verification fails ## Process ### Phase 1: Task Analysis and Model Selection Analyze the task to select the optimal model: ``` Let me analyze this task to determine the optimal configuration: 1. **Complexity Assessment** - High: Architecture decisions, novel problem-solving, critical logic - Medium: Standard patterns, moderate refactoring, API updates - Low: Simple transformations, straightforward updates 2. **Risk Assessment** - High: Breaking changes, security-sensitive, data integrity - Medium: Internal changes, reversible modifications - Low: Non-critical utilities, isolated changes 3. **Scope Assessment** - Large: Multiple files, complex interactions - Medium: Single component, focused changes - Small: Minor modifications, single file ``` **Model Selection Guide:** | Model | When to Use | Examples | |-------|-------------|----------| | `opus` | **Default/standard choice**. Safe for any task. Use when correctness matters, decisions are nuanced, or you're unsure. | Most implementation, code writing, business logic, architectural decisions | | `sonnet` | Task is **not c