Verification Loops

Name: Verification Loops
Author: itallstartedwithaidea

itallstartedwithaidea/agent-skills

Embed checkpoint and continuous graders so your coding agent’s recommendations are validated against rules and data before you ship or act on them.

Install

npx skills add https://github.com/itallstartedwithaidea/agent-skills --skill verification-loops

What is this skill?

Checkpoint verification at stage boundaries (pre-commit, post-analysis, before-deploy)
Continuous assertions during generation to catch drift and hallucination early
pass@k metrics: multiple candidates graded and best selected by consensus
Typed grader design patterned on production agent evaluation methodology
Trust model for autonomous agents via embedded evaluation loops, not post-hoc hope

Adoption & trust: 1 installs on skills.sh; 18 GitHub stars; 3/3 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).

Recommended Skills

Microsoft Foundrymicrosoft/azure-skills

Microsoft Foundry skill guides agents through the full Azure AI Foundry lifecycle—containerizing agents, pushing to ACR,…377k installs·1.2k stars

Azure Aimicrosoft/azure-skills

azure-ai is a Prism-oriented quick reference for Microsoft Azure AI work, with the published body centered on the Azure …375k installs·1.2k stars

Azure Hosted Copilot Sdkmicrosoft/azure-skills

Azure Hosted Copilot SDK is Microsoft's entry skill for repos using @github/copilot-sdk—it detects CopilotClient usage, …346k installs·1.2k stars

Lark Eventlarksuite/cli

Lark real-time subscription skill via lark-cli event consume for building bots and streaming webhook-style agent workers…208k installs·13.7k stars

Running Claude Code Via Litellm Copilotxixu-me/skills

Running Claude Code via LiteLLM Copilot walks through pointing Claude Code at a local LiteLLM proxy that forwards Anthro…200k installs·61 stars

Setup Matt Pocock Skillsmattpocock/skills

One-time per-repo setup so Matt Pocock engineering skills share correct issue tracker, triage strings, and domain docume…180k installs·121k stars

Journey fit

Primary fit

Canonical shelf is Ship because the skill centers on proving agent outputs are correct before commit, deploy, or user-facing surfacing—core release-gate thinking. Testing subphase fits systematic evaluation pipelines, pass@k selection, and multi-stage review gates rather than one-off debugging.

Common Questions / FAQ

Is Verification Loops safe to install?

skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

SKILL.md

READMESKILL.md - Verification Loops

# Verification Loops

Part of [Agent Skills™](https://github.com/itallstartedwithaidea/agent-skills) by [googleadsagent.ai™](https://googleadsagent.ai)

## Description

Verification Loops are systematic evaluation pipelines that validate agent outputs at every stage of execution. The fundamental challenge of autonomous agents is trust — how do you know the agent did the right thing? Verification Loops solve this by embedding checkpoint evaluations, continuous assertions, and multi-stage review gates throughout the agent's execution pipeline. This skill draws from the evaluation methodology used in production at [googleadsagent.ai™](https://googleadsagent.ai), where Buddy™ verifies every Google Ads recommendation against historical data, budget constraints, and domain rules before surfacing it to users.

The distinction between checkpoint and continuous verification is critical. Checkpoint verification evaluates outputs at defined stage boundaries (pre-commit, post-analysis, before-deploy). Continuous verification runs assertions in real-time during generation, catching drift and hallucination before they propagate. Both approaches are complemented by pass@k metrics — generating multiple candidate outputs and selecting the best one based on grader consensus.

Production verification systems employ typed graders: deterministic graders for schema and constraint validation, LLM-as-judge graders for semantic quality assessment, and human-in-the-loop graders for high-stakes decisions. The combination creates a layered verification net that catches errors at the earliest and cheapest point in the pipeline.

## Use When

- Agent outputs directly influence business decisions or user-facing content
- Regulatory or compliance requirements demand audit trails for AI-generated content
- Multi-step workflows need quality gates between stages
- You need to measure and improve agent accuracy over time (pass@k benchmarking)
- Generated code must pass tests before being committed or deployed
- Analysis results must be validated against ground truth or business rules

## How It Works

```mermaid
graph TD
    A[Agent Output] --> B[Stage 1: Deterministic Grader]
    B -->|Pass| C[Stage 2: LLM-as-Judge]
    B -->|Fail| D[Reject + Feedback Loop]
    C -->|Pass| E[Stage 3: Confidence Scoring]
    C -->|Fail| D
    E -->|High Confidence| F[Accept Output]
    E -->|Low Confidence| G{pass@k Available?}
    G -->|Yes| H[Generate k Candidates]
    H --> I[Rank by Grader Consensus]
    I --> F
    G -->|No| J[Human-in-the-Loop Review]
    J --> F
    D --> K[Error Context Injection]
    K --> L[Re-generation with Feedback]
    L --> B
```

The verification pipeline processes every agent output through three stages. Stage 1 applies deterministic graders — schema validation, constraint checking, type verification — that are fast and cheap. Stage 2 invokes an LLM-as-judge that evaluates semantic correctness, completeness, and coherence. Stage 3 computes a confidence score from the combined grader signals. Low-confidence outputs trigger pass@k generation, where multiple candidates are produced and ranked by grader consensus. Rejected outputs receive specific error feedback that is injected into the re-generation prompt.

## Implementation

**Multi-Stage Verification Pipeline:**

```typescript
interface Grader {
  name: string;
  type: "deterministic" | "llm_judge" | "human";
  evaluate(output: string, context: VerificationContext): Promise<GradeResult>;
}

interface GradeResult {
  pass: boolean;
  score: number;
  feedback: string;
}

class VerificationPipeline {
  private stages: Grader[][] = [];

  addStage(graders: Grader[]): void {
    this.stages.push(graders);
  }

  async verify(output: string, context: VerificationContext): Promise<VerificationResult> {
    const stageResults: StageResult[]

What is this skill?

Checkpoint verification at stage boundaries (pre-commit, post-analysis, before-deploy)

Continuous assertions during generation to catch drift and hallucination early

pass@k metrics: multiple candidates graded and best selected by consensus

Typed grader design patterned on production agent evaluation methodology

Trust model for autonomous agents via embedded evaluation loops, not post-hoc hope

Adoption & trust: 1 installs on skills.sh; 18 GitHub stars; 3/3 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).

Journey fit

Primary fit

SKILL.md

READMESKILL.md - Verification Loops

# Verification Loops

Part of [Agent Skills™](https://github.com/itallstartedwithaidea/agent-skills) by [googleadsagent.ai™](https://googleadsagent.ai)

## Description

Verification Loops are systematic evaluation pipelines that validate agent outputs at every stage of execution. The fundamental challenge of autonomous agents is trust — how do you know the agent did the right thing? Verification Loops solve this by embedding checkpoint evaluations, continuous assertions, and multi-stage review gates throughout the agent's execution pipeline. This skill draws from the evaluation methodology used in production at [googleadsagent.ai™](https://googleadsagent.ai), where Buddy™ verifies every Google Ads recommendation against historical data, budget constraints, and domain rules before surfacing it to users.

The distinction between checkpoint and continuous verification is critical. Checkpoint verification evaluates outputs at defined stage boundaries (pre-commit, post-analysis, before-deploy). Continuous verification runs assertions in real-time during generation, catching drift and hallucination before they propagate. Both approaches are complemented by pass@k metrics — generating multiple candidate outputs and selecting the best one based on grader consensus.

Production verification systems employ typed graders: deterministic graders for schema and constraint validation, LLM-as-judge graders for semantic quality assessment, and human-in-the-loop graders for high-stakes decisions. The combination creates a layered verification net that catches errors at the earliest and cheapest point in the pipeline.

## Use When

- Agent outputs directly influence business decisions or user-facing content
- Regulatory or compliance requirements demand audit trails for AI-generated content
- Multi-step workflows need quality gates between stages
- You need to measure and improve agent accuracy over time (pass@k benchmarking)
- Generated code must pass tests before being committed or deployed
- Analysis results must be validated against ground truth or business rules

## How It Works

```mermaid
graph TD
    A[Agent Output] --> B[Stage 1: Deterministic Grader]
    B -->|Pass| C[Stage 2: LLM-as-Judge]
    B -->|Fail| D[Reject + Feedback Loop]
    C -->|Pass| E[Stage 3: Confidence Scoring]
    C -->|Fail| D
    E -->|High Confidence| F[Accept Output]
    E -->|Low Confidence| G{pass@k Available?}
    G -->|Yes| H[Generate k Candidates]
    H --> I[Rank by Grader Consensus]
    I --> F
    G -->|No| J[Human-in-the-Loop Review]
    J --> F
    D --> K[Error Context Injection]
    K --> L[Re-generation with Feedback]
    L --> B
```

The verification pipeline processes every agent output through three stages. Stage 1 applies deterministic graders — schema validation, constraint checking, type verification — that are fast and cheap. Stage 2 invokes an LLM-as-judge that evaluates semantic correctness, completeness, and coherence. Stage 3 computes a confidence score from the combined grader signals. Low-confidence outputs trigger pass@k generation, where multiple candidates are produced and ranked by grader consensus. Rejected outputs receive specific error feedback that is injected into the re-generation prompt.

## Implementation

**Multi-Stage Verification Pipeline:**

```typescript
interface Grader {
  name: string;
  type: "deterministic" | "llm_judge" | "human";
  evaluate(output: string, context: VerificationContext): Promise<GradeResult>;
}

interface GradeResult {
  pass: boolean;
  score: number;
  feedback: string;
}

class VerificationPipeline {
  private stages: Grader[][] = [];

  addStage(graders: Grader[]): void {
    this.stages.push(graders);
  }

  async verify(output: string, context: VerificationContext): Promise<VerificationResult> {
    const stageResults: StageResult[]

Install

What is this skill?

Recommended Skills

Journey fit

Is Verification Loops safe to install?

SKILL.md

This week for builders

Install

What is this skill?

Recommended Skills

Journey fit

Is Verification Loops safe to install?

SKILL.md