
Verification Loops
Embed checkpoint and continuous graders so your coding agent’s recommendations are validated against rules and data before you ship or act on them.
Install
npx skills add https://github.com/itallstartedwithaidea/agent-skills --skill verification-loopsWhat is this skill?
- Checkpoint verification at stage boundaries (pre-commit, post-analysis, before-deploy)
- Continuous assertions during generation to catch drift and hallucination early
- pass@k metrics: multiple candidates graded and best selected by consensus
- Typed grader design patterned on production agent evaluation methodology
- Trust model for autonomous agents via embedded evaluation loops, not post-hoc hope
Adoption & trust: 1 installs on skills.sh; 18 GitHub stars; 3/3 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).
Recommended Skills
Journey fit
Canonical shelf is Ship because the skill centers on proving agent outputs are correct before commit, deploy, or user-facing surfacing—core release-gate thinking. Testing subphase fits systematic evaluation pipelines, pass@k selection, and multi-stage review gates rather than one-off debugging.
Common Questions / FAQ
Is Verification Loops safe to install?
skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Verification Loops
# Verification Loops Part of [Agent Skills™](https://github.com/itallstartedwithaidea/agent-skills) by [googleadsagent.ai™](https://googleadsagent.ai) ## Description Verification Loops are systematic evaluation pipelines that validate agent outputs at every stage of execution. The fundamental challenge of autonomous agents is trust — how do you know the agent did the right thing? Verification Loops solve this by embedding checkpoint evaluations, continuous assertions, and multi-stage review gates throughout the agent's execution pipeline. This skill draws from the evaluation methodology used in production at [googleadsagent.ai™](https://googleadsagent.ai), where Buddy™ verifies every Google Ads recommendation against historical data, budget constraints, and domain rules before surfacing it to users. The distinction between checkpoint and continuous verification is critical. Checkpoint verification evaluates outputs at defined stage boundaries (pre-commit, post-analysis, before-deploy). Continuous verification runs assertions in real-time during generation, catching drift and hallucination before they propagate. Both approaches are complemented by pass@k metrics — generating multiple candidate outputs and selecting the best one based on grader consensus. Production verification systems employ typed graders: deterministic graders for schema and constraint validation, LLM-as-judge graders for semantic quality assessment, and human-in-the-loop graders for high-stakes decisions. The combination creates a layered verification net that catches errors at the earliest and cheapest point in the pipeline. ## Use When - Agent outputs directly influence business decisions or user-facing content - Regulatory or compliance requirements demand audit trails for AI-generated content - Multi-step workflows need quality gates between stages - You need to measure and improve agent accuracy over time (pass@k benchmarking) - Generated code must pass tests before being committed or deployed - Analysis results must be validated against ground truth or business rules ## How It Works ```mermaid graph TD A[Agent Output] --> B[Stage 1: Deterministic Grader] B -->|Pass| C[Stage 2: LLM-as-Judge] B -->|Fail| D[Reject + Feedback Loop] C -->|Pass| E[Stage 3: Confidence Scoring] C -->|Fail| D E -->|High Confidence| F[Accept Output] E -->|Low Confidence| G{pass@k Available?} G -->|Yes| H[Generate k Candidates] H --> I[Rank by Grader Consensus] I --> F G -->|No| J[Human-in-the-Loop Review] J --> F D --> K[Error Context Injection] K --> L[Re-generation with Feedback] L --> B ``` The verification pipeline processes every agent output through three stages. Stage 1 applies deterministic graders — schema validation, constraint checking, type verification — that are fast and cheap. Stage 2 invokes an LLM-as-judge that evaluates semantic correctness, completeness, and coherence. Stage 3 computes a confidence score from the combined grader signals. Low-confidence outputs trigger pass@k generation, where multiple candidates are produced and ranked by grader consensus. Rejected outputs receive specific error feedback that is injected into the re-generation prompt. ## Implementation **Multi-Stage Verification Pipeline:** ```typescript interface Grader { name: string; type: "deterministic" | "llm_judge" | "human"; evaluate(output: string, context: VerificationContext): Promise<GradeResult>; } interface GradeResult { pass: boolean; score: number; feedback: string; } class VerificationPipeline { private stages: Grader[][] = []; addStage(graders: Grader[]): void { this.stages.push(graders); } async verify(output: string, context: VerificationContext): Promise<VerificationResult> { const stageResults: StageResult[]