Adk Eval Guide

Name: Adk Eval Guide
Author: google

google/adk-docs

Run and interpret Google ADK agent evaluations—metrics, evalsets, LLM-as-judge, and trajectory scoring—using the documented eval-fix loop.

Overview

ADK Eval Guide is an agent skill most often used in Ship (also Build) that documents Google ADK evaluation methodology—metrics, evalsets, LLM-as-judge, and trajectory scoring—for debugging agent quality.

Install

npx skills add https://github.com/google/adk-docs --skill adk-eval-guide

What is this skill?

MUST READ before any ADK evaluation—methodology for metrics, evalset schema, and LLM-as-judge
Reference map covers criteria guide (8 criteria), user simulation, built-in tools eval, and multimodal eval patterns
Scaffolded path: make eval, tests/eval/evalsets, and eval_config.json; non-scaffold uses adk eval CLI directly
Documents eval-fix loop: diagnose sub-threshold scores, fix root cause, re-run
Explicitly not for API cheatsheet, deploy guide, or project scaffolding—those are sibling ADK skills
8 evaluation criteria documented in criteria-guide reference
Eval-fix loop: diagnose, fix, re-run when scores sit below threshold

Compatible agents: Claude Code, Codex, Cursor, any compatible agent

Adoption & trust: 2.6k installs on skills.sh; 1.4k GitHub stars; 3/3 security scanners passed (skills.sh audits).

What problem does it solve?

Your ADK agent eval scores fail or fluctuate and you lack a systematic map from metrics and evalsets to concrete fixes.

Who is it for?

Builders with an ADK agent repo who need to run adk eval or make eval and interpret criteria, trajectory, and judge results.

Skip if: Writing ADK API handlers, production deployment steps, or initial project scaffold when eval is not yet on the roadmap.

When should I use this skill?

Evaluating ADK agent quality, running adk eval or make eval, or debugging eval results; do not use for API patterns, deploy, or scaffold-only setup.

What do I get? / Deliverables

You run evaluations with the correct commands and references, diagnose failure modes in the eval-fix loop, and improve agent quality before deploy.

Executed eval run with interpreted metrics
Diagnosis notes tied to eval-fix loop
Targeted fixes to prompts, tools, or evalset cases

Recommended Skills

Microsoft Foundrymicrosoft/azure-skills

Microsoft Foundry skill guides agents through the full Azure AI Foundry lifecycle—containerizing agents, pushing to ACR,…377k installs·1.2k stars

Azure Aimicrosoft/azure-skills

azure-ai is a Prism-oriented quick reference for Microsoft Azure AI work, with the published body centered on the Azure …375k installs·1.2k stars

Azure Hosted Copilot Sdkmicrosoft/azure-skills

Azure Hosted Copilot SDK is Microsoft's entry skill for repos using @github/copilot-sdk—it detects CopilotClient usage, …346k installs·1.2k stars

Lark Eventlarksuite/cli

Lark real-time subscription skill via lark-cli event consume for building bots and streaming webhook-style agent workers…208k installs·13.7k stars

Running Claude Code Via Litellm Copilotxixu-me/skills

Running Claude Code via LiteLLM Copilot walks through pointing Claude Code at a local LiteLLM proxy that forwards Anthro…200k installs·61 stars

Setup Matt Pocock Skillsmattpocock/skills

One-time per-repo setup so Matt Pocock engineering skills share correct issue tracker, triage strings, and domain docume…180k installs·121k stars

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Canonical shelf is Ship/testing because the skill is mandatory reading before adk eval runs and centers on quality gates, thresholds, and debugging failed scores. Testing subphase matches evalsets, criteria metrics, user simulation, multimodal eval, and iterative eval-fix workflow rather than app UI work.

Also useful

BuildAgent skills & templates

Also useful

OperateIteration & experiments

Where it fits

Example use

BuildAgent skills & templates

Author evalsets and eval_config.json while shaping tool definitions and prompts.

Example use

ShipTesting & QA

Run make eval or adk eval and walk the eval-fix loop before release.

Example use

ShipCode review

Use criteria reference to explain why a trajectory mismatch should block merge.

Example use

OperateIteration & experiments

Re-baseline scores after changing judge model or adding multimodal cases.

How it compares

ADK-specific eval playbook—not generic unit-test skills or the ADK deploy/scaffold companions.

Common Questions / FAQ

Who is adk-eval-guide for?

Solo and small-team developers building Google ADK agents who must measure and improve quality with evalsets and automated judges.

When should I use adk-eval-guide?

Use in Ship/testing before merging agent changes; in Build/agent-tooling when designing evalsets; whenever running adk eval or debugging below-threshold metrics.

Is adk-eval-guide safe to install?

Evaluation may call LLM judges and tools defined in your project—review the Security Audits panel on this page and treat evalsets like test data handling secrets.

SKILL.md

READMESKILL.md - Adk Eval Guide

# ADK Evaluation Guide

> **Scaffolded project?** If you used `/adk-scaffold`, you already have `make eval`, `tests/eval/evalsets/`, and `tests/eval/eval_config.json`. Start with `make eval` and iterate from there.
>
> **Non-scaffolded?** Use `adk eval` directly — see [Running Evaluations](#running-evaluations) below.

## Reference Files

| File | Contents |
|------|----------|
| `references/criteria-guide.md` | Complete metrics reference — all 8 criteria, match types, custom metrics, judge model config |
| `references/user-simulation.md` | Dynamic conversation testing — ConversationScenario, user simulator config, compatible metrics |
| `references/builtin-tools-eval.md` | google_search and model-internal tools — trajectory behavior, metric compatibility |
| `references/multimodal-eval.md` | Multimodal inputs — evalset schema, built-in metric limitations, custom evaluator pattern |

---

## The Eval-Fix Loop

Evaluation is iterative. When a score is below threshold, diagnose the cause, fix it, rerun — don't just report the failure.

### How to iterate

1. **Start small**: Begin with 1-2 eval cases, not the full suite
2. **Run eval**: `make eval` (or `adk eval` if no Makefile)
3. **Read the scores** — identify what failed and why
4. **Fix the code** — adjust prompts, tool logic, instructions, or the evalset
5. **Rerun eval** — verify the fix worked
6. **Repeat steps 3-5** until the case passes
7. **Only then** add more eval cases and expand coverage

**Expect 5-10+ iterations.** This is normal — each iteration makes the agent better.

### What to fix when scores fail

| Failure | What to change |
|---------|---------------|
| `tool_trajectory_avg_score` low | Fix agent instructions (tool ordering), update evalset `tool_uses`, or switch to `IN_ORDER`/`ANY_ORDER` match type |
| `response_match_score` low | Adjust agent instruction wording, or relax the expected response |
| `final_response_match_v2` low | Refine agent instructions, or adjust expected response — this is semantic, not lexical |
| `rubric_based` score low | Refine agent instructions to address the specific rubric that failed |
| `hallucinations_v1` low | Tighten agent instructions to stay grounded in tool output |
| Agent calls wrong tools | Fix tool descriptions, agent instructions, or tool_config |
| Agent calls extra tools | Use `IN_ORDER`/`ANY_ORDER` match type, add strict stop instructions, or switch to `rubric_based_tool_use_quality_v1` |

---

## Choosing the Right Criteria

| Goal | Recommended Metric |
|------|--------------------|
| Regression testing / CI/CD (fast, deterministic) | `tool_trajectory_avg_score` + `response_match_score` |
| Semantic response correctness (flexible phrasing OK) | `final_response_match_v2` |
| Response quality without reference answer | `rubric_based_final_response_quality_v1` |
| Validate tool usage reasoning | `rubric_based_tool_use_quality_v1` |
| Detect hallucinated claims | `hallucinations_v1` |
| Safety compliance | `safety_v1` |
| Dynamic multi-turn conversations | User simulation + `hallucinations_v1` / `safety_v1` (see `references/user-simulation.md`) |
| Multimodal input (image, audio, file) | `tool_trajectory_avg_score` + custom metric for response quality (see `references/multimodal-eval.md`) |

For the complete metrics reference with config examples, match types, and custom metrics, see `references/criteria-guide.md`.

---

## Running Evaluations

```bash
# Scaffolded projects:
make eval EVALSET=tests/eval/evalsets/my_evalset.json

#

What is this skill?

MUST READ before any ADK evaluation—methodology for metrics, evalset schema, and LLM-as-judge

Reference map covers criteria guide (8 criteria), user simulation, built-in tools eval, and multimodal eval patterns

Scaffolded path: make eval, tests/eval/evalsets, and eval_config.json; non-scaffold uses adk eval CLI directly

Documents eval-fix loop: diagnose sub-threshold scores, fix root cause, re-run

Explicitly not for API cheatsheet, deploy guide, or project scaffolding—those are sibling ADK skills

8 evaluation criteria documented in criteria-guide reference

Eval-fix loop: diagnose, fix, re-run when scores sit below threshold

Compatible agents: Claude Code, Codex, Cursor, any compatible agent

Adoption & trust: 2.6k installs on skills.sh; 1.4k GitHub stars; 3/3 security scanners passed (skills.sh audits).

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

BuildAgent skills & templates

Also useful

OperateIteration & experiments

Where it fits

Example use

BuildAgent skills & templates

Author evalsets and eval_config.json while shaping tool definitions and prompts.

Example use

ShipTesting & QA

Run make eval or adk eval and walk the eval-fix loop before release.

Example use

ShipCode review

Use criteria reference to explain why a trajectory mismatch should block merge.

Example use

OperateIteration & experiments

Re-baseline scores after changing judge model or adding multimodal cases.

SKILL.md

READMESKILL.md - Adk Eval Guide

# ADK Evaluation Guide

> **Scaffolded project?** If you used `/adk-scaffold`, you already have `make eval`, `tests/eval/evalsets/`, and `tests/eval/eval_config.json`. Start with `make eval` and iterate from there.
>
> **Non-scaffolded?** Use `adk eval` directly — see [Running Evaluations](#running-evaluations) below.

## Reference Files

| File | Contents |
|------|----------|
| `references/criteria-guide.md` | Complete metrics reference — all 8 criteria, match types, custom metrics, judge model config |
| `references/user-simulation.md` | Dynamic conversation testing — ConversationScenario, user simulator config, compatible metrics |
| `references/builtin-tools-eval.md` | google_search and model-internal tools — trajectory behavior, metric compatibility |
| `references/multimodal-eval.md` | Multimodal inputs — evalset schema, built-in metric limitations, custom evaluator pattern |

---

## The Eval-Fix Loop

Evaluation is iterative. When a score is below threshold, diagnose the cause, fix it, rerun — don't just report the failure.

### How to iterate

1. **Start small**: Begin with 1-2 eval cases, not the full suite
2. **Run eval**: `make eval` (or `adk eval` if no Makefile)
3. **Read the scores** — identify what failed and why
4. **Fix the code** — adjust prompts, tool logic, instructions, or the evalset
5. **Rerun eval** — verify the fix worked
6. **Repeat steps 3-5** until the case passes
7. **Only then** add more eval cases and expand coverage

**Expect 5-10+ iterations.** This is normal — each iteration makes the agent better.

### What to fix when scores fail

| Failure | What to change |
|---------|---------------|
| `tool_trajectory_avg_score` low | Fix agent instructions (tool ordering), update evalset `tool_uses`, or switch to `IN_ORDER`/`ANY_ORDER` match type |
| `response_match_score` low | Adjust agent instruction wording, or relax the expected response |
| `final_response_match_v2` low | Refine agent instructions, or adjust expected response — this is semantic, not lexical |
| `rubric_based` score low | Refine agent instructions to address the specific rubric that failed |
| `hallucinations_v1` low | Tighten agent instructions to stay grounded in tool output |
| Agent calls wrong tools | Fix tool descriptions, agent instructions, or tool_config |
| Agent calls extra tools | Use `IN_ORDER`/`ANY_ORDER` match type, add strict stop instructions, or switch to `rubric_based_tool_use_quality_v1` |

---

## Choosing the Right Criteria

| Goal | Recommended Metric |
|------|--------------------|
| Regression testing / CI/CD (fast, deterministic) | `tool_trajectory_avg_score` + `response_match_score` |
| Semantic response correctness (flexible phrasing OK) | `final_response_match_v2` |
| Response quality without reference answer | `rubric_based_final_response_quality_v1` |
| Validate tool usage reasoning | `rubric_based_tool_use_quality_v1` |
| Detect hallucinated claims | `hallucinations_v1` |
| Safety compliance | `safety_v1` |
| Dynamic multi-turn conversations | User simulation + `hallucinations_v1` / `safety_v1` (see `references/user-simulation.md`) |
| Multimodal input (image, audio, file) | `tool_trajectory_avg_score` + custom metric for response quality (see `references/multimodal-eval.md`) |

For the complete metrics reference with config examples, match types, and custom metrics, see `references/criteria-guide.md`.

---

## Running Evaluations

```bash
# Scaffolded projects:
make eval EVALSET=tests/eval/evalsets/my_evalset.json

#

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Where it fits

Who is adk-eval-guide for?

When should I use adk-eval-guide?

Is adk-eval-guide safe to install?

SKILL.md

This week for builders

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Where it fits

Who is adk-eval-guide for?

When should I use adk-eval-guide?

Is adk-eval-guide safe to install?

SKILL.md