
Peer Review
Run a second model as a critical reviewer so your primary coding agent does not rubber-stamp its own output.
Install
npx skills add https://github.com/juliusbrussee/cavekit --skill peer-reviewWhat is this skill?
- Six review modes: Diff Critique, Design Challenge, Threaded Debate, Delegated Scrutiny, Deciding Vote, Coverage Audit
- Explicit mandate: reviewer finds what the builder missed—not agreement or politeness
- MCP-based peer setup so any model can act as reviewer against the builder agent
- Peer review iteration loops alternating builder fixes and reviewer passes
- Codex Loop Mode combining Cavekit, Ralph Loop, and Codex as reviewer via CLI or MCP fallback
Adoption & trust: 15 installs on skills.sh; 1k GitHub stars; 2/3 security scanners passed (skills.sh audits).
Recommended Skills
Improve Codebase Architecturemattpocock/skills
Zoom Outmattpocock/skills
Caveman Reviewjuliusbrussee/caveman
Requesting Code Reviewobra/superpowers
Receiving Code Reviewobra/superpowers
Request Refactor Planmattpocock/skills
Journey fit
Primary fit
Canonical shelf is Ship review because peer review is the quality gate before merge and launch, even though the same loop applies while building features. Diff critique, design challenge, and coverage audit map directly to human code review and pre-release scrutiny subphase.
Common Questions / FAQ
Is Peer Review safe to install?
skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Peer Review
# Peer Review Use a second AI agent to review and challenge the first agent's work. The peer reviewer exists to find what the builder missed -- not to agree, not to be polite, and not to rubber-stamp. This is the single most effective quality gate you can add beyond automated tests. ## Core Principle > **The peer reviewer's job is to find what the builder missed, not to agree.** A review that says "looks good" is a wasted review. The peer review model should be given explicit instructions to be critical, to challenge assumptions, and to look for what is *not* there rather than what is. --- ## Why Peer Review Works LLMs have blind spots. Every model has patterns it over-relies on, edge cases it misses, and architectural assumptions it makes implicitly. A second model -- or the same model with a different prompt and role -- catches a different set of issues. **The analogy:** In traditional engineering, code review exists because the author has cognitive blind spots about their own work. The same principle applies to AI agents, but the blind spots are different: they are systematic patterns in training data, context window limitations, and prompt interpretation biases. **What peer review catches that automated tests miss:** - Architectural over-engineering or under-engineering - Missing error handling patterns - Security vulnerabilities the builder didn't consider - Cavekit requirements that were technically met but poorly implemented - Dead code, unused imports, and unnecessary complexity - Performance pitfalls that only manifest at scale - Missing edge cases not covered by the cavekit --- ## Review Modes | Mode | Timing | Mechanism | |------|--------|-----------| | **Diff Critique** | After implementation completes | A second model inspects the changeset with a fault-finding prompt; the builder incorporates valid fixes | | **Design Challenge** | During the planning phase | A second model proposes alternative designs; the builder evaluates both against spec requirements and selects the stronger option | | **Threaded Debate** | When exploring complex trade-offs | Multiple exchanges occur on a persistent conversation thread so context accumulates across turns | | **Delegated Scrutiny** | For substantial review tasks | A dedicated teammate agent manages the full peer review interaction and delivers a consolidated findings report to the lead | | **Deciding Vote** | When two approaches conflict | The lead presents both options to the peer review model, which analyzes trade-offs and recommends a path forward | | **Coverage Audit** | During the validation phase | Test coverage data and gap analysis are fed to the peer review model for independent assessment of testing thoroughness | ### Choosing the Right Mode ``` Need peer review ├─ Reviewing completed code? │ ├─ Small changeset (< 500 lines) → Diff Critique │ └─ Large changeset or full feature → Delegated Scrutiny ├─ Designing architecture? │ ├─ Single decision point → Deciding Vote │ └─ Full system design → Design Challenge ├─ Debating trade-offs? │ ├─ Need extended back-and-forth → Threaded Debate │ └─ Need a decisive answer → Dec