
Agentic Eval
Add reflection, rubric judges, and evaluator–optimizer loops so agent-written code and reports meet quality bars before merge or release.
Overview
Agentic Eval is an agent skill most often used in Ship (also Build agent-tooling, Operate iterate) that implements reflection and evaluator–optimizer loops to improve AI agent outputs.
Install
npx skills add https://github.com/github/awesome-copilot --skill agentic-evalWhat is this skill?
- Generate → Evaluate → Critique → Refine loop with configurable max iterations
- Pattern catalog: basic reflection, evaluator–optimizer, rubric-based and LLM-as-judge systems
- Targets quality-critical code, reports, and analysis with explicit success metrics
- Test-driven code refinement workflows for agent outputs
- When-to-use guardrails for tasks with defined evaluation criteria
- Documented Generate → Evaluate → Critique → Refine control loop
- Pattern 1 includes configurable max_iterations (example default 3)
Adoption & trust: 9.5k installs on skills.sh; 34.6k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
Single-shot agent generations miss rubrics, tests, or compliance rules, and you have no repeatable loop to measure and fix quality.
Who is it for?
Builders running quality-critical agent workflows who can define success metrics, rubrics, or automated checks.
Skip if: Simple one-off chat tasks with no evaluation criteria or teams unwilling to pay extra LLM rounds for refinement.
When should I use this skill?
Implementing self-critique, evaluator–optimizer pipelines, rubric or LLM-as-judge evaluation, or iterative improvement for quality-critical agent outputs.
What do I get? / Deliverables
You can wire iterative evaluate-and-refine flows with clear criteria so code, reports, and analysis converge toward acceptable quality before release.
- Evaluation loop design (reflection or evaluator–optimizer) with iteration caps
- Improved final outputs that passed stated criteria or judges
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Canonical shelf is ship testing because the skill’s core is measuring and improving outputs against criteria, analogous to QA for agent behavior. Evaluation pipelines, test-driven refinement, and LLM-as-judge map directly to testing and quality assurance subphase.
Where it fits
Design a reflection loop around your custom tool-calling agent before exposing it to users.
Run rubric-based LLM judges on generated reports prior to a release candidate.
Measure drift in agent answers and tighten critique prompts after support tickets spike.
How it compares
Evaluation methodology patterns for agents—not a hosted eval platform or a single lint rule.
Common Questions / FAQ
Who is agentic-eval for?
Solo and indie developers designing agent pipelines who need self-critique, judges, or test-driven refinement beyond one-shot completion.
When should I use agentic-eval?
While building agent features to design eval loops, during ship testing before merging generated code, and in operate when iterating on production agent quality with rubrics.
Is agentic-eval safe to install?
Patterns may suggest calling LLMs on your data; review the Security Audits panel on this Prism page and avoid sending secrets into judge prompts.
SKILL.md
READMESKILL.md - Agentic Eval
# Agentic Evaluation Patterns Patterns for self-improvement through iterative evaluation and refinement. ## Overview Evaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops. ``` Generate → Evaluate → Critique → Refine → Output ↑ │ └──────────────────────────────┘ ``` ## When to Use - **Quality-critical generation**: Code, reports, analysis requiring high accuracy - **Tasks with clear evaluation criteria**: Defined success metrics exist - **Content requiring specific standards**: Style guides, compliance, formatting --- ## Pattern 1: Basic Reflection Agent evaluates and improves its own output through self-critique. ```python def reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str: """Generate with reflection loop.""" output = llm(f"Complete this task:\n{task}") for i in range(max_iterations): # Self-critique critique = llm(f""" Evaluate this output against criteria: {criteria} Output: {output} Rate each: PASS/FAIL with feedback as JSON. """) critique_data = json.loads(critique) all_pass = all(c["status"] == "PASS" for c in critique_data.values()) if all_pass: return output # Refine based on critique failed = {k: v["feedback"] for k, v in critique_data.items() if v["status"] == "FAIL"} output = llm(f"Improve to address: {failed}\nOriginal: {output}") return output ``` **Key insight**: Use structured JSON output for reliable parsing of critique results. --- ## Pattern 2: Evaluator-Optimizer Separate generation and evaluation into distinct components for clearer responsibilities. ```python class EvaluatorOptimizer: def __init__(self, score_threshold: float = 0.8): self.score_threshold = score_threshold def generate(self, task: str) -> str: return llm(f"Complete: {task}") def evaluate(self, output: str, task: str) -> dict: return json.loads(llm(f""" Evaluate output for task: {task} Output: {output} Return JSON: {{"overall_score": 0-1, "dimensions": {{"accuracy": ..., "clarity": ...}}}} """)) def optimize(self, output: str, feedback: dict) -> str: return llm(f"Improve based on feedback: {feedback}\nOutput: {output}") def run(self, task: str, max_iterations: int = 3) -> str: output = self.generate(task) for _ in range(max_iterations): evaluation = self.evaluate(output, task) if evaluation["overall_score"] >= self.score_threshold: break output = self.optimize(output, evaluation) return output ``` --- ## Pattern 3: Code-Specific Reflection Test-driven refinement loop for code generation. ```python class CodeReflector: def reflect_and_fix(self, spec: str, max_iterations: int = 3) -> str: code = llm(f"Write Python code for: {spec}") tests = llm(f"Generate pytest tests for: {spec}\nCode: {code}") for _ in range(max_iterations): result = run_tests(code, tests) if result["success"]: return code code = llm(f"Fix error: {result['error']}\nCode: {code}") return code ``` --- ## Evaluation Strategies ### Outcome-Based Evaluate whether output achieves the ex