
Adk Eval Guide
Run and interpret Google ADK agent evaluations—metrics, evalsets, LLM-as-judge, and trajectory scoring—using the documented eval-fix loop.
Overview
ADK Eval Guide is an agent skill most often used in Ship (also Build) that documents Google ADK evaluation methodology—metrics, evalsets, LLM-as-judge, and trajectory scoring—for debugging agent quality.
Install
npx skills add https://github.com/google/adk-docs --skill adk-eval-guideWhat is this skill?
- MUST READ before any ADK evaluation—methodology for metrics, evalset schema, and LLM-as-judge
- Reference map covers criteria guide (8 criteria), user simulation, built-in tools eval, and multimodal eval patterns
- Scaffolded path: make eval, tests/eval/evalsets, and eval_config.json; non-scaffold uses adk eval CLI directly
- Documents eval-fix loop: diagnose sub-threshold scores, fix root cause, re-run
- Explicitly not for API cheatsheet, deploy guide, or project scaffolding—those are sibling ADK skills
- 8 evaluation criteria documented in criteria-guide reference
- Eval-fix loop: diagnose, fix, re-run when scores sit below threshold
Adoption & trust: 2.6k installs on skills.sh; 1.4k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your ADK agent eval scores fail or fluctuate and you lack a systematic map from metrics and evalsets to concrete fixes.
Who is it for?
Builders with an ADK agent repo who need to run adk eval or make eval and interpret criteria, trajectory, and judge results.
Skip if: Writing ADK API handlers, production deployment steps, or initial project scaffold when eval is not yet on the roadmap.
When should I use this skill?
Evaluating ADK agent quality, running adk eval or make eval, or debugging eval results; do not use for API patterns, deploy, or scaffold-only setup.
What do I get? / Deliverables
You run evaluations with the correct commands and references, diagnose failure modes in the eval-fix loop, and improve agent quality before deploy.
- Executed eval run with interpreted metrics
- Diagnosis notes tied to eval-fix loop
- Targeted fixes to prompts, tools, or evalset cases
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Canonical shelf is Ship/testing because the skill is mandatory reading before adk eval runs and centers on quality gates, thresholds, and debugging failed scores. Testing subphase matches evalsets, criteria metrics, user simulation, multimodal eval, and iterative eval-fix workflow rather than app UI work.
Where it fits
Author evalsets and eval_config.json while shaping tool definitions and prompts.
Run make eval or adk eval and walk the eval-fix loop before release.
Use criteria reference to explain why a trajectory mismatch should block merge.
Re-baseline scores after changing judge model or adding multimodal cases.
How it compares
ADK-specific eval playbook—not generic unit-test skills or the ADK deploy/scaffold companions.
Common Questions / FAQ
Who is adk-eval-guide for?
Solo and small-team developers building Google ADK agents who must measure and improve quality with evalsets and automated judges.
When should I use adk-eval-guide?
Use in Ship/testing before merging agent changes; in Build/agent-tooling when designing evalsets; whenever running adk eval or debugging below-threshold metrics.
Is adk-eval-guide safe to install?
Evaluation may call LLM judges and tools defined in your project—review the Security Audits panel on this page and treat evalsets like test data handling secrets.
SKILL.md
READMESKILL.md - Adk Eval Guide
# ADK Evaluation Guide > **Scaffolded project?** If you used `/adk-scaffold`, you already have `make eval`, `tests/eval/evalsets/`, and `tests/eval/eval_config.json`. Start with `make eval` and iterate from there. > > **Non-scaffolded?** Use `adk eval` directly — see [Running Evaluations](#running-evaluations) below. ## Reference Files | File | Contents | |------|----------| | `references/criteria-guide.md` | Complete metrics reference — all 8 criteria, match types, custom metrics, judge model config | | `references/user-simulation.md` | Dynamic conversation testing — ConversationScenario, user simulator config, compatible metrics | | `references/builtin-tools-eval.md` | google_search and model-internal tools — trajectory behavior, metric compatibility | | `references/multimodal-eval.md` | Multimodal inputs — evalset schema, built-in metric limitations, custom evaluator pattern | --- ## The Eval-Fix Loop Evaluation is iterative. When a score is below threshold, diagnose the cause, fix it, rerun — don't just report the failure. ### How to iterate 1. **Start small**: Begin with 1-2 eval cases, not the full suite 2. **Run eval**: `make eval` (or `adk eval` if no Makefile) 3. **Read the scores** — identify what failed and why 4. **Fix the code** — adjust prompts, tool logic, instructions, or the evalset 5. **Rerun eval** — verify the fix worked 6. **Repeat steps 3-5** until the case passes 7. **Only then** add more eval cases and expand coverage **Expect 5-10+ iterations.** This is normal — each iteration makes the agent better. ### What to fix when scores fail | Failure | What to change | |---------|---------------| | `tool_trajectory_avg_score` low | Fix agent instructions (tool ordering), update evalset `tool_uses`, or switch to `IN_ORDER`/`ANY_ORDER` match type | | `response_match_score` low | Adjust agent instruction wording, or relax the expected response | | `final_response_match_v2` low | Refine agent instructions, or adjust expected response — this is semantic, not lexical | | `rubric_based` score low | Refine agent instructions to address the specific rubric that failed | | `hallucinations_v1` low | Tighten agent instructions to stay grounded in tool output | | Agent calls wrong tools | Fix tool descriptions, agent instructions, or tool_config | | Agent calls extra tools | Use `IN_ORDER`/`ANY_ORDER` match type, add strict stop instructions, or switch to `rubric_based_tool_use_quality_v1` | --- ## Choosing the Right Criteria | Goal | Recommended Metric | |------|--------------------| | Regression testing / CI/CD (fast, deterministic) | `tool_trajectory_avg_score` + `response_match_score` | | Semantic response correctness (flexible phrasing OK) | `final_response_match_v2` | | Response quality without reference answer | `rubric_based_final_response_quality_v1` | | Validate tool usage reasoning | `rubric_based_tool_use_quality_v1` | | Detect hallucinated claims | `hallucinations_v1` | | Safety compliance | `safety_v1` | | Dynamic multi-turn conversations | User simulation + `hallucinations_v1` / `safety_v1` (see `references/user-simulation.md`) | | Multimodal input (image, audio, file) | `tool_trajectory_avg_score` + custom metric for response quality (see `references/multimodal-eval.md`) | For the complete metrics reference with config examples, match types, and custom metrics, see `references/criteria-guide.md`. --- ## Running Evaluations ```bash # Scaffolded projects: make eval EVALSET=tests/eval/evalsets/my_evalset.json #