Agentic Eval

Name: Agentic Eval
Author: github

github/awesome-copilot

9.9k installs
37.1k repo stars
Updated July 28, 2026
github/awesome-copilot

Patterns and code templates for implementing iterative agent evaluation and refinement loops with self-critique, structured evaluation, and optimization cycles.

About

Agentic Evaluation Patterns provides techniques for agents to assess and refine their own outputs through structured feedback loops. Developers use this when building quality-critical generation systems - code, reports, analysis - where single-shot outputs are insufficient. Key workflows include basic self-reflection with pass/fail criteria, separable evaluator-optimizer pipelines with scored dimensions, test-driven code refinement, and LLM-as-judge comparison strategies. The skill emphasizes structured JSON output for reliable parsing, iteration limits to prevent loops, and convergence detection. Patterns scale from simple criteria checks to rubric-based scoring across weighted dimensions.

Basic reflection loop: generate, self-critique with JSON feedback, refine based on failed criteria
Evaluator-Optimizer pattern separates generation and evaluation for clearer component responsibilities
Code-specific test-driven refinement: generate code, auto-generate tests, iterate on failures
Three evaluation strategies: outcome-based comparison, LLM-as-judge pairwise scoring, rubric-based weighted dimensions
Safety practices: iteration limits (3-5), convergence detection, structured JSON parsing, full history logging

Agentic Eval by the numbers

9,891 all-time installs (skills.sh)
+61 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #79 of 16,659 AI & Agent Building skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

At a glance

agentic-eval capabilities & compatibility

Capabilities: self critique · structured evaluation · iterative optimization · code testing · rubric scoring
Use cases: code review · testing · debugging · refactoring

From the docs

What agentic-eval says it does

Evaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops.

agentic-eval.md

Use structured JSON output for reliable parsing of critique results.

agentic-eval.md

npx skills add https://github.com/github/awesome-copilot --skill agentic-eval

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/github/awesome-copilot/agentic-eval.svg)](https://skillselion.com/skills/github/awesome-copilot/agentic-eval)

Installs	9.9k
repo stars	★ 37.1k
Security audit	3 / 3 scanners passed
Last updated	July 28, 2026
Repository	github/awesome-copilot ↗

What it does

Implement iterative self-critique and refinement loops so agents can evaluate and improve their own outputs across code, reports, and analysis.

Who is it for?

Quality-critical generation tasks with defined evaluation criteria (code generation, report writing, analysis); systems where iterative refinement is feasible.

Skip if: Real-time latency-critical systems; tasks without clear success metrics; single-pass generation where iteration cost prohibits refinement.

When should I use this skill?

Building agent systems for code generation, analysis reports, or structured content; implementing quality gates; designing test-driven agent workflows.

What you get

Agents produce higher-quality outputs (code, reports, analysis) through structured evaluation cycles with measurable convergence criteria.

Reflection loop implementation with critique phase
Evaluator-Optimizer class with generate/evaluate/optimize functions
Test-driven code refinement loop

By the numbers

3-5 recommended max iterations per refinement cycle
0.8 typical score threshold for 'good enough' stopping criterion

Files

SKILL.mdMarkdownGitHub ↗

Agentic Evaluation Patterns

Patterns for self-improvement through iterative evaluation and refinement.

Overview

Evaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops.

Generate → Evaluate → Critique → Refine → Output
    ↑                              │
    └──────────────────────────────┘

When to Use

Quality-critical generation: Code, reports, analysis requiring high accuracy
Tasks with clear evaluation criteria: Defined success metrics exist
Content requiring specific standards: Style guides, compliance, formatting

---

Pattern 1: Basic Reflection

Agent evaluates and improves its own output through self-critique.

def reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str:
    """Generate with reflection loop."""
    output = llm(f"Complete this task:\n{task}")
    
    for i in range(max_iterations):
        # Self-critique
        critique = llm(f"""
        Evaluate this output against criteria: {criteria}
        Output: {output}
        Rate each: PASS/FAIL with feedback as JSON.
        """)
        
        critique_data = json.loads(critique)
        all_pass = all(c["status"] == "PASS" for c in critique_data.values())
        if all_pass:
            return output
        
        # Refine based on critique
        failed = {k: v["feedback"] for k, v in critique_data.items() if v["status"] == "FAIL"}
        output = llm(f"Improve to address: {failed}\nOriginal: {output}")
    
    return output

Key insight: Use structured JSON output for reliable parsing of critique results.

---

Pattern 2: Evaluator-Optimizer

Separate generation and evaluation into distinct components for clearer responsibilities.

class EvaluatorOptimizer:
    def __init__(self, score_threshold: float = 0.8):
        self.score_threshold = score_threshold
    
    def generate(self, task: str) -> str:
        return llm(f"Complete: {task}")
    
    def evaluate(self, output: str, task: str) -> dict:
        return json.loads(llm(f"""
        Evaluate output for task: {task}
        Output: {output}
        Return JSON: {{"overall_score": 0-1, "dimensions": {{"accuracy": ..., "clarity": ...}}}}
        """))
    
    def optimize(self, output: str, feedback: dict) -> str:
        return llm(f"Improve based on feedback: {feedback}\nOutput: {output}")
    
    def run(self, task: str, max_iterations: int = 3) -> str:
        output = self.generate(task)
        for _ in range(max_iterations):
            evaluation = self.evaluate(output, task)
            if evaluation["overall_score"] >= self.score_threshold:
                break
            output = self.optimize(output, evaluation)
        return output

---

Pattern 3: Code-Specific Reflection

Test-driven refinement loop for code generation.

class CodeReflector:
    def reflect_and_fix(self, spec: str, max_iterations: int = 3) -> str:
        code = llm(f"Write Python code for: {spec}")
        tests = llm(f"Generate pytest tests for: {spec}\nCode: {code}")
        
        for _ in range(max_iterations):
            result = run_tests(code, tests)
            if result["success"]:
                return code
            code = llm(f"Fix error: {result['error']}\nCode: {code}")
        return code

---

Evaluation Strategies

Outcome-Based

Evaluate whether output achieves the expected result.

def evaluate_outcome(task: str, output: str, expected: str) -> str:
    return llm(f"Does output achieve expected outcome? Task: {task}, Expected: {expected}, Output: {output}")

LLM-as-Judge

Use LLM to compare and rank outputs.

def llm_judge(output_a: str, output_b: str, criteria: str) -> str:
    return llm(f"Compare outputs A and B for {criteria}. Which is better and why?")

Rubric-Based

Score outputs against weighted dimensions.

RUBRIC = {
    "accuracy": {"weight": 0.4},
    "clarity": {"weight": 0.3},
    "completeness": {"weight": 0.3}
}

def evaluate_with_rubric(output: str, rubric: dict) -> float:
    scores = json.loads(llm(f"Rate 1-5 for each dimension: {list(rubric.keys())}\nOutput: {output}"))
    return sum(scores[d] * rubric[d]["weight"] for d in rubric) / 5

---

Best Practices

Practice	Rationale
Clear criteria	Define specific, measurable evaluation criteria upfront
Iteration limits	Set max iterations (3-5) to prevent infinite loops
Convergence check	Stop if output score isn't improving between iterations
Log history	Keep full trajectory for debugging and analysis
Structured output	Use JSON for reliable parsing of evaluation results

---

Quick Start Checklist

## Evaluation Implementation Checklist

### Setup
- [ ] Define evaluation criteria/rubric
- [ ] Set score threshold for "good enough"
- [ ] Configure max iterations (default: 3)

### Implementation
- [ ] Implement generate() function
- [ ] Implement evaluate() function with structured output
- [ ] Implement optimize() function
- [ ] Wire up the refinement loop

### Safety
- [ ] Add convergence detection
- [ ] Log all iterations for debugging
- [ ] Handle evaluation parse failures gracefully

Related skills

Setup Matt Pocock SkillsScaffold the per-repo configuration that Matt Pocock’s engineering agent skills rely on so they understand the issue tracker, triage labels, and domain documentation la462k185k

Lark Skill MakerQuickly turn any Lark/Feishu OpenAPI call or multi-step workflow into a reusable agent skill with its own SKILL.md.379k15.8k

CavemanSlash token usage by roughly 75% while keeping every technical detail intact when working with Claude Code, Cursor or similar agents.378k92.5k

Lark AppsConnect Claude, Cursor or custom agents directly to Lark (Feishu) for messaging, document automation, approval workflows and enterprise data access.375k

Running Claude Code Via Litellm CopilotRun Claude Code at a fraction of the cost by routing requests through LiteLLM to the GitHub Copilot Chat API.270k72

Codex PetGenerate a complete Codex Pet spritesheet and metadata from one reference image without needing an OpenAI key or Codex Pro.246k8

Forks & variants (1)

Agentic Eval has 1 known copy in the catalog totaling 95 installs. They canonicalize to this original listing.

smithery.ai - 95 installs

How it compares

Choose agentic-eval over generic AI skills when the task is designing evaluation infrastructure for agents, not writing a single prompt or API call.

FAQ

When should I stop iterating in a refinement loop?

Set a convergence threshold (e.g., 0.8+ score) and a max iteration limit (3-5 typical). Stop early if score plateaus between iterations to avoid wasted LLM calls.

Why use JSON for evaluation output?

Structured JSON enables reliable programmatic parsing of scores and feedback, reducing failures from unstructured text parsing and enabling downstream optimization logic.

How does the Evaluator-Optimizer pattern differ from basic reflection?

Evaluator-Optimizer separates concerns: dedicated generate(), evaluate(), optimize() functions with explicit scoring. Basic reflection uses a single loop. Separation improves testability and reusability.

Is Agentic Eval safe to install?

skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

AI & Agent Buildingagentsautomation