Google Agents Cli Eval

Name: Google Agents Cli Eval
Author: google

google/agents-cli

64.2k installs
5.4k repo stars
Updated July 23, 2026
google/agents-cli

Google-agents-cli-eval is a skill for evaluating ADK agents using metrics, datasets, and the Quality Flywheel methodology.

About

Google-agents-cli-eval is a skill for evaluating agents built with ADK using the Quality Flywheel methodology. It covers eval metrics, dataset schema, LLM-as-judge scoring, common failure causes, multi-turn scenarios, user simulation, and automatic prompt optimization. The skill guides the full evaluation loop from data preparation through inference, grading, analysis, and fixes.

Quality Flywheel methodology for agent evaluation in 5 stages
Built-in metrics, eval dataset schema, and LLM-as-judge scoring
Failure analysis and automatic prompt optimization with GEPA

Google Agents Cli Eval by the numbers

64,240 all-time installs (skills.sh)
+8,447 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #8 of 2,184 Testing & QA skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

At a glance

google-agents-cli-eval capabilities & compatibility

Capabilities: evaluation · testing · metrics
Use cases: testing · debugging
Runs: Runs locally
Pricing: Free

npx skills add https://github.com/google/agents-cli --skill google-agents-cli-eval

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/google/agents-cli/google-agents-cli-eval.svg)](https://skillselion.com/skills/google/agents-cli/google-agents-cli-eval)

Installs	64.2k
repo stars	★ 5.4k
Security audit	2 / 3 scanners passed
Last updated	July 23, 2026
Repository	google/agents-cli ↗

How do you evaluate ADK agents using google_search?

Evaluate agent quality using the Quality Flywheel with built-in metrics, eval datasets, failure analysis, and prompt optimization

Who is it for?

Agent evaluation,Quality assurance,Testing

Skip if: Standard tool-calling agents where every action appears as function_call, or teams not using Google Search grounding in ADK trajectories.

When should I use this skill?

The user needs to test or evaluate an ADK agent that uses google_search, built-in tools, or grounding_metadata in trajectories.

What you get

Correct eval assertions on grounding_metadata, session-level google_search detection, and trajectory rules for built-in versus custom tools.

Grounding-aware eval assertions
Session-level google_search detection rules

Files

SKILL.mdMarkdownGitHub ↗

Agent Evaluation Guide

Requires: agents-cli (uv tool install google-agents-cli) — install uv first if needed.

Scaffolded project? If you used /google-agents-cli-scaffold, you already have agents-cli eval run (chains generate + grade), tests/eval/datasets/, and tests/eval/eval_config.yaml. Start with executing eval run and iterate from there.

Reference Files

File	Contents
`references/dataset_schema.md`	Canonical EvaluationDataset schema — all field types, JSON examples for single-turn / multi-turn / multi-agent, common mistakes
`references/metrics-guide.md`	Complete metrics reference — all built-in metrics, match types, custom metrics, judge model config
`references/user-simulation.md`	Dynamic conversation testing — `eval dataset synthesize` flags, what scenarios are, compatible metrics
`references/builtin-tools-eval.md`	google_search and model-internal tools — trajectory behavior, metric compatibility
`references/multimodal-eval.md`	Multimodal inputs — eval dataset schema, built-in metric limitations, custom evaluator pattern

---

The Quality Flywheel

Improving agent quality is iterative. The 5 stages below describe the loop. Each stage has a Default path (you, the coding agent, do the work directly) and an Opt-in CLI command that delegates to the Agent Platform Eval Service for better quality and scale.

1. Prepare Data

Default: Use or edit the scaffolded tests/eval/datasets/basic-dataset.json to define single-turn eval inputs. Start with 1–2 cases.

Opt-in: agents-cli eval dataset synthesize — runs e2e user simulation against your live agent to synthesize multi-turn eval datasets. Prefer when testing multi-turn conversations but lacking data. Output includes traces, so you can skip Stage 2 and go directly to eval grade.

2. Run Inference

agents-cli eval generate — executes the agent over the dataset and writes traces to artifacts/traces/. Run this when you wrote the dataset by hand in Stage 1 (default path). Skip this stage if you used `eval dataset synthesize` — that command already produced traces.

3. Grade Traces (always run)

agents-cli eval grade — scores the traces and writes results_<ts>.{json,html} to artifacts/grade_results/. No opt-in alternative; this is the core. Always run, regardless of how Stages 1 and 2 produced the traces.

Shortcut: agents-cli eval run chains Stages 2 + 3 in one command using the default artifacts/traces/ directory between them. Use it for the common path; drop back to the two-step form when you need a custom traces location or want to grade an existing traces file.

4. Analyze Failures

Default: Open the latest artifacts/grade_results/results_<ts>.html (or .json) and identify failed metrics — see What to fix when scores fail below for the fix table.

Opt-in: agents-cli eval analyze — runs LLM-based failure clustering and root-cause analysis over the grade results. Prefer when you have 10+ failing cases and want categorized failure modes instead of case-by-case reading.

5. Optimize & Code Fix

Default: Edit the agent — adjust prompts, tool descriptions, instructions, or eval dataset based on the failure analysis. See What to fix when scores fail below for the failure → fix mapping.

Opt-in: agents-cli eval optimize — runs ADK GEPA prompt optimization against a target metric. Suitable for prompt-only failures. The optimized prompt appears in the command output; capture it and apply it to the agent. For the full per-iteration trace, set print_detailed_results: true in your optimization config file.

Long-running and expensive. GEPA optimization makes many LLM calls and can take a long time. Do not run it unless the user explicitly asks for prompt optimization. When you do run it, iterate as far as possible with manual fixes first, then run a single final eval optimize — never loop on this command.

Running the loop

Iterate stages 2 → 3 → 4 → 5 → 2 (or 1 → 3 → 4 → 5 → 1 if using synthesize). After each fix, run agents-cli eval compare <prev_results>.json <new_results>.json to confirm the target metric improved without regressing others. Expect 5–10+ iterations per case before it passes — this is normal. Only after a case passes should you expand coverage with more eval cases.

When doing 5+ iterations, maintain a task list of which cases are fixed, which are still failing, and what fixes you've tried. Prevents re-attempting the same fix.

Shortcuts That Waste Time

Recognize these rationalizations and push back — they always cost more time than they save:

Shortcut	Why it fails
"I'll tune the eval thresholds down to make it pass"	Lowering thresholds hides real failures. If the agent can't meet the bar, fix the agent — don't move the bar.
"This eval case is flaky, I'll skip it"	Flaky evals reveal non-determinism in your agent. Fix with `temperature=0`, rubric-based metrics, or more specific instructions — don't delete the signal.
"I just need to fix the eval dataset, not the agent"	If you're always adjusting expected outputs, your agent has a behavior problem. Fix the instructions or tool logic first.

Choosing the Right Metrics

Pick built-in metrics by what you want to measure. Multi-turn metrics evaluate the full conversation; single-turn metrics evaluate one prompt-response pair (with intermediate tool calls). When no built-in fits, write a custom metric (see Evaluation Configuration Schema below).

Goal	Recommended built-in metrics
Did the agent achieve the user's goal? (catch-all for multi-turn agents)	`multi_turn_task_success`
Was the agent's reasoning path logical and efficient?	`multi_turn_trajectory_quality`
Quality of tool / function calling across turns	`multi_turn_tool_use_quality`
Final response quality (no ground-truth reference needed)	`final_response_quality`
Factual grounding (catch hallucinated claims, e.g., RAG agents)	`hallucination`
Safety policy compliance	`safety`
Domain-specific check no built-in covers	Write a custom `LLMMetric` (LLM-judge) or `CodeExecutionMetric` (deterministic Python). See Evaluation Configuration Schema below.

Run agents-cli eval metric list to see all available built-ins. For full metric definitions and rubric details, see the Agent Platform metric docs and references/metrics-guide.md.

---

What to fix when scores fail

After agents-cli eval grade completes, inspect the latest artifacts/grade_results/results_<timestamp>.json (or open the .html file) for per-case scores and judge rationales — that's the input to every fix decision below.

Failure	What to change
`multi_turn_task_success` low	The agent isn't completing the user's goal — fix orchestration, missing tool calls, premature termination, or wrong tool selection
`multi_turn_trajectory_quality` low	The agent reaches the goal inefficiently or takes wrong steps — refine planning prompts, tighten instruction order, or remove redundant tool calls
`multi_turn_tool_use_quality` low	Fix tool descriptions, parameter docstrings, or agent instructions for tool selection
`final_response_quality` low	Read the auto-generated rubric verdicts; refine agent instructions to address the worst-scoring criterion (often clarity, completeness, or instruction-following)
`hallucination` low	Tighten agent instructions to stay grounded in tool output; verify the tool actually returned the data the agent claimed
`safety` low	Add safety guardrails to instructions; review the violating content category in the rubric verdict
Agent calls wrong tools	Fix tool descriptions, agent instructions, or `tool_config`
Agent calls extra tools	Add strict stop instructions, or switch to `multi_turn_tool_use_quality`

After applying a fix, rerun agents-cli eval generate && agents-cli eval grade and use agents-cli eval compare <prev_results>.json <new_results>.json to confirm the fix improved the target metric without regressing others.

---

Eval Commands

All agents-cli eval subcommands support --help for the authoritative flag list and defaults — run agents-cli eval <subcommand> --help (or agents-cli eval dataset <subcommand> --help) when in doubt. The examples below show the most common invocations; flags can change between releases.

`eval generate`

Runs an agent over an evaluation dataset and writes traces to disk.

# Basic — uses tests/eval/datasets/, writes to artifacts/traces/
agents-cli eval generate

# Advanced — custom dataset and output dir
agents-cli eval generate --dataset tests/eval/datasets/custom.json -o ./custom_traces/

`eval grade`

Scores generated traces against built-in or custom metrics. Writes timestamped results_<YYYYMMDD_HHMMSS>.json (consumed by eval compare) and .html (open in a browser) into the output dir, and prints a summary table to the console.

# Basic — defaults: traces from artifacts/traces/, results to artifacts/grade_results/,
# metrics from tests/eval/eval_config.yaml's metrics_to_run
agents-cli eval grade

# Advanced 1 — grade traces from a non-default location (the canonical
# pairing for `eval generate --output custom_traces/`)
agents-cli eval grade --traces custom_traces/

# Advanced 2 — pick built-in metrics, custom output dir
agents-cli eval grade --metrics tool_use_quality,safety --output ./out/

# Advanced 3 — load metrics to run from a config file (YAML or JSON) on a specified trace file.
agents-cli eval grade --traces ./artifacts/traces/trace_1.json --config tests/eval/eval_config.yaml

See Evaluation Configuration Schema below for the config file format.

`eval compare`

Diffs two results_*.json files produced by eval grade. Run it after a fix to confirm the target metric improved without regressing others.

agents-cli eval compare baseline.json candidate.json

`eval metric list`

Lists the built-in metric names usable with eval grade --metrics.

agents-cli eval metric list

`eval analyze`

Runs LLM-based failure clustering and root-cause analysis over a results_*.json produced by eval grade. Use when you have 10+ failing cases and want categorized failure modes instead of reading the HTML case-by-case. Supported --metric values: multi_turn_task_success, multi_turn_tool_use_quality.

# Basic — analyze a results file with default settings
agents-cli eval analyze --eval-result artifacts/grade_results/results_<ts>.json

# Advanced — restrict to a specific metric and cap loss clusters
agents-cli eval analyze \
  --eval-result artifacts/grade_results/results_<ts>.json \
  --metric multi_turn_tool_use_quality \
  --top-k 5 \
  --output artifacts/analysis_<ts>.json

`eval dataset synthesize`

Generates user scenarios server-side from your agent's tools and instructions, then plays each scenario against an LLM-backed user simulator. The output is a graded-ready trace file with full agent_data.turns populated — feed it directly to eval grade (skip eval generate).

# Basic — generate 3 default scenarios (up to 5 turns each) into artifacts/traces/
# (where eval grade reads from by default, so synthesize → grade works without flags)
agents-cli eval dataset synthesize

# Advanced — guide scenario generation with optional instruction and environment context
agents-cli eval dataset synthesize \
  -n 5 \
  --instruction "Customer asking about refunds" \
  --environment-context "E-commerce support" \
  --max-turns 8 \
  -o tests/eval/datasets/refund_scenarios.json

For scenario semantics, the full eval dataset synthesize flag table, and which simulator internals are not user-configurable, see references/user-simulation.md.

`eval optimize`

Runs ADK GEPA prompt optimization against a target metric. Suitable after eval grade identifies prompt-only failures (wording, not tool/orchestration logic). --dataset and --target-metric override values in --config when both are passed. Long-running and expensive — see Stage 5 of the Quality Flywheel for usage guidance.

# Basic — optimize against a single metric on a dataset
agents-cli eval optimize --dataset tests/eval/datasets/basic-dataset.json --target-metric final_response_quality

# Advanced — drive multi-metric / multi-dataset optimization from a config file
agents-cli eval optimize --config tests/eval/optimization_config.json

`eval submit` / `eval results` (cloud-side)

The managed, asynchronous counterpart to the local path, for large or CI-driven runs: eval submit hands the dataset and metrics to the Agent Platform Eval Service, and eval results polls and downloads the scores. Pass --resource-name <agent> to also run inference server-side (managed generate + grade); omit it to grade an existing trace (managed grade).

# Grade an existing trace server-side; returns a run resource name to poll
agents-cli eval submit --dataset tests/eval/datasets/basic-dataset.json --dest gs://my-bucket
# Add --resource-name projects/<p>/locations/<l>/reasoningEngines/<id> to run inference too

agents-cli eval results --run-id <run-resource-name>

---

Evaluation Dataset Format

An EvaluationDataset is a JSON file with an eval_cases array. Cases come in two shapes depending on how they're used:

Inference input (what you give to eval generate) — a user prompt or a partial conversation ending in a user prompt. The agent runs and produces traces.
Grading input (what you give to eval grade) — a complete trace including the agent's responses and tool calls. Normally produced by eval generate or eval dataset synthesize; you don't write these by hand.

See references/dataset_schema.md for the full canonical schema, all field types, and common mistakes.

Inference input format

Two shapes are supported.

(a) Simple single-turn prompt — what the scaffolded tests/eval/datasets/basic-dataset.json uses. The agent runs from scratch.

{
  "eval_cases": [
    {
      "eval_case_id": "greeting",
      "prompt": {
        "role": "user",
        "parts": [{"text": "Hello, what can you help me with?"}]
      }
    },
    {
      "eval_case_id": "weather_query",
      "prompt": {
        "role": "user",
        "parts": [{"text": "What's the weather like in San Francisco?"}]
      }
    }
  ]
}

(b) Multi-turn continuation via `agent_data` — partial conversation, last turn ends with a user message. Use to continue an existing conversation; the agent's next response is what gets evaluated.

{
  "eval_cases": [
    {
      "eval_case_id": "booking_followup",
      "agent_data": {
        "agents": {
          "flight_booking_agent": {
            "agent_id": "flight_booking_agent",
            "instruction": "You are a helpful flight booking assistant."
          }
        },
        "turns": [
          {
            "turn_index": 0,
            "events": [
              {"author": "user", "content": {"parts": [{"text": "I want to book a flight to Paris."}]}},
              {"author": "flight_booking_agent", "content": {"parts": [{"text": "I found a flight for $800. Do you want to book it?"}]}}
            ]
          },
          {
            "turn_index": 1,
            "events": [
              {"author": "user", "content": {"parts": [{"text": "Yes, please book it."}]}}
            ]
          }
        ]
      }
    }
  ]
}

Grading input format (traces)

Complete trace — agent responses, tool calls, and tool responses all present. Normally produced by eval generate or eval dataset synthesize; shown here so you can recognize the shape when debugging.

{
  "eval_cases": [
    {
      "eval_case_id": "weather_query",
      "agent_data": {
        "agents": {
          "weather_agent": {
            "agent_id": "weather_agent",
            "instruction": "You are a helpful weather assistant."
          }
        },
        "turns": [
          {
            "turn_index": 0,
            "events": [
              {"author": "user", "content": {"parts": [{"text": "What's the weather in San Francisco?"}]}},
              {"author": "weather_agent", "content": {"parts": [{"function_call": {"name": "get_weather", "args": {"city": "San Francisco"}}}]}},
              {"author": "weather_agent", "content": {"parts": [{"function_response": {"name": "get_weather", "response": {"temp_f": 62, "conditions": "foggy"}}}]}},
              {"author": "weather_agent", "content": {"parts": [{"text": "It's currently 62°F and foggy in San Francisco."}]}}
            ]
          }
        ]
      }
    }
  ]
}

Key conventions: authors are "user", agent IDs from the agents map, or "tool"; tool calls use function_call parts and tool results use function_response parts. See references/dataset_schema.md for multi-agent examples and the full type reference.

---

Evaluation Configuration Schema

agents-cli eval grade --config <path> accepts a single configuration file in either YAML (.yaml / .yml) or JSON (.json). The file declares two parts:

metrics_to_run — the selection list of metric names to execute on this run. Names resolve to built-in metrics first, then to entries in custom_metrics.
custom_metrics — a definition pool of custom metrics available to this project. Defining a metric here does not run it; it must also appear in metrics_to_run (or be passed via --metrics name1,name2 on the CLI, which is equivalent to overriding metrics_to_run for that invocation).

Minimal example (YAML preferred — human-readable, no JSON escaping for prompts and Python):

metrics_to_run:
  - multi_turn_task_success     # built-in
  - example_llm_metric          # selected from custom_metrics pool below
  - agent_turn_count            # selected from custom_metrics pool below

custom_metrics:
  - name: example_llm_metric
    prompt_template: |
      Rate the agent's response 1-5 for helpfulness and accuracy.
      Prompt: {prompt}
      Final response: {response}
      Full trace (for tool-call and reasoning context): {agent_data}
      Return JSON: {"score": <1|2|3|4|5>, "explanation": "<reason>"}

  - name: agent_turn_count
    custom_function: |
      def evaluate(instance):
          turns = (instance.get("agent_data") or {}).get("turns", [])
          return {'score': len(turns)}

JSON is also accepted (same field names, with prompt_template and custom_function as escaped strings) — but always prefer YAML for human-readable configs.

Each entry in custom_metrics is dispatched by field: presence of custom_function makes it a CodeExecutionMetric (deterministic Python); otherwise it's an LLMMetric (LLM-as-judge with prompt_template). Run agents-cli eval metric list to see available built-ins. For full custom-metric field reference (judge model options, sampling counts), see references/metrics-guide.md.

Agent trace field model. For datasets produced by agents-cli eval generate (or eval dataset synthesize), each eval case exposes three standard fields to a metric:

{prompt} — the user message (or first user turn).
{response} — the agent's final text response, extracted from the last text-bearing event. In custom_function callbacks this is instance['response'] with shape {"role": "model", "parts": [{"text": "..."}]}.
{agent_data} — the full structured turns/events trace, useful when the judge needs to reason about tool calls or intermediate reasoning.

{reference} and {context} resolve only when the eval case has reference / context fields populated (e.g., golden-answer datasets); they are not populated by eval generate / eval dataset synthesize.

Code-based metrics default to local in-process execution (no GCP project or region required, but the evaluate(instance) function runs with the CLI's privileges). Set execution: "remote" on the metric to run it server-side in Vertex AI's CodeExecutionMetric sandbox instead — that path requires a configured GCP project + region.

---

Common Gotchas

Use Rubric-Based Tool Evaluation instead of Hardcoded Sequences

Evaluating agent tool usage using strict sequence matching is fragile because agents may call helper tools (like searches or geocoding) in different orders or perform extra proactive steps.

Instead, use `multi_turn_tool_use_quality` / `multi_turn_trajectory_quality`. These metrics automatically generate content-based and intent-based adaptive rubrics, assessing technical correctness and technical sequence logic semantically using an LLM judge rather than forcing a rigid match.

App name must match directory name

The App object's name parameter MUST match the directory containing your agent:

# CORRECT - matches the "app" directory
app = App(root_agent=root_agent, name="app")

# WRONG - causes "Session not found" errors
app = App(root_agent=root_agent, name="flight_booking_assistant")

Cross-session memory can't be tested in eval

Each eval case runs in its own fresh in-memory session (eval generate creates a new InMemorySessionService and session id per case). Multi-turn within a case works via agent_data.turns, but behavior that depends on a separate prior session — e.g. Memory Bank recall across sessions — can't be exercised by eval. Validate cross-session continuity with pytest integration tests instead.

Vertex eval region

eval grade, eval submit, and eval dataset synthesize default to the `global` endpoint — they don't inherit the manifest region (the eval services support only a subset of regions). eval analyze is global-only; eval generate runs locally and follows the project region. So you normally don't configure anything for eval.

Override per run with --region <REGION> (e.g. data residency); the service rejects an unsupported one:

400 FAILED_PRECONDITION: Unsupported region for Vertex Evaluation Service: <region>

No eval region fits your data-residency rules? Fall back to local custom metrics — a custom_metrics entry with a custom_function (execution: local, the default) grades in-process with no GCP region required. You lose the managed built-in metrics, but your custom_function can still call an LLM judge in a compliant region itself — so LLM-as-judge grading stays available anywhere.

The `before_agent_callback` Pattern (State Initialization)

Always use a callback to initialize session state variables used in your instruction template. This prevents KeyError crashes on the first turn:

async def initialize_state(callback_context: CallbackContext) -> None:
    state = callback_context.state
    if "user_preferences" not in state:
        state["user_preferences"] = {}

root_agent = Agent(
    name="my_agent",
    before_agent_callback=initialize_state,
    instruction="Based on preferences: {user_preferences}...",
)

Model thinking mode may bypass tools

Models with "thinking" enabled may skip tool calls. Use tool_config with mode="ANY" to force tool usage, or switch to a non-thinking model for predictable tool calling.

---

Common Eval Failure Causes

Symptom	Cause	Fix
Agent mentions data not in tool output	Hallucination	Tighten agent instructions; add `hallucination` metric
"Session not found" error	App name mismatch	Ensure App `name` matches directory name
Score fluctuates between runs	Non-deterministic model	Set `temperature=0` or use rubric-based eval with multiple samples
`tool_use_quality` score low	Wrong tool selected or invalid arguments passed	Refine tool descriptions, instructions, or parameter documentation
LLM judge ignores image/audio in eval	`get_text_from_content()` skips non-text parts	Use custom metric with vision-capable judge (see `references/multimodal-eval.md`)

---

Debugging Example

User says: "tool_use_quality is low, what's wrong?"

1. Open the latest artifacts/grade_results/results_<timestamp>.html (or read the .json) and find the rubric verdicts the adaptive metric generated for the failing case. 2. Verify whether the agent selected the wrong tool, or called it with wrong arguments — the trace lives in artifacts/traces/. 3. Refine the tool's parameters, Python docstring description, or the agent's tool selection instructions to guide the model better. 4. Rerun agents-cli eval generate && agents-cli eval grade. 5. agents-cli eval compare <prev>.json <new>.json to confirm the score improved.

---

Proving Your Work

Don't assert that eval passes — show the evidence. Concrete output prevents false confidence and catches issues early.

After running eval: Paste the scores table output so the user can see exactly what passed and failed.
After fixing a failure: Show before/after scores for the specific case you fixed, and confirm no other cases regressed.
Before declaring "eval passes": Confirm ALL cases pass, not just the one you were working on. Run agents-cli eval generate and agents-cli eval grade one final time.
Before moving to deploy: Show the final agents-cli eval grade output with all cases above threshold. This is the gate — no exceptions.

---

Related Skills

/google-agents-cli-workflow — Development workflow and the spec-driven build-evaluate-deploy lifecycle
/google-agents-cli-adk-code — ADK Python API quick reference for writing agent code
/google-agents-cli-scaffold — Project creation and enhancement with agents-cli scaffold create / scaffold enhance
/google-agents-cli-deploy — Deployment targets, CI/CD pipelines, and production workflows
/google-agents-cli-observability — Cloud Trace, logging, and monitoring for debugging agent behavior

Evaluating Agents with `google_search` and Built-in Tools

google_search Behavior (IMPORTANT)

google_search is NOT a regular tool — it's a model-internal grounding feature.

Key behavior:

Custom tools (save_preferences, save_feedback) → appear as function_call in trajectory
google_search → NEVER appears in trajectory (happens inside the model)

How google_search works internally:

llm_request.config.tools.append(
    types.Tool(google_search=types.GoogleSearch())  # Injected into model config
)

Search results come back as grounding_metadata, not function call/response events. But the evaluator STILL detects it at the session level:

{
  "error_code": "UNEXPECTED_TOOL_CALL",
  "error_message": "Unexpected tool call: google_search"
}

This causes multi_turn_tool_use_quality to ALWAYS fail for agents using google_search.

Metric compatibility for `google_search` agents:

Metric	Usable?	Why
`multi_turn_tool_use_quality`	NO	Always fails due to unexpected google_search (the `google_search` invocation is detected by the evaluator but never appears as a `function_call` / `function_response` event)
`final_response_quality`	YES	Adaptive rubric-based evaluation; works without a reference answer
`final_response_match`	NO	Search results vary across runs, so the agent's response rarely matches a fixed reference

Dataset best practices for `google_search` agents:

{
  "eval_cases": [
    {
      "eval_case_id": "news_digest_test",
      "prompt": {
        "role": "user",
        "parts": [{"text": "Give me my news digest."}]
      }
      // NO trajectory criteria for google_search - it won't appear in the trace anyway
    }
  ]
}

For agents that mix google_search with custom function tools, grade the custom tool usage with multi_turn_tool_use_quality — it judges the tool calls in the generated trace, so you don't hand-author expected calls. Optionally add a reference response for reference-based matching:

{
  "eval_case_id": "news_digest_feedback",
  "prompt": {
    "role": "user",
    "parts": [{"text": "Great, save my positive feedback."}]
  },
  "reference": {
    "response": {
      "role": "model",
      "parts": [{"text": "Feedback saved!"}]
    }
  }
}

The google_search invocation still won't appear in the trace, so multi_turn_tool_use_quality only assesses the function-tool calls (e.g., save_feedback).

Config for `google_search` agents (`eval_config.yaml`):

metrics_to_run:
  - final_response_quality

The built-in final_response_quality is sufficient for most google_search agents; it auto-generates a content-based rubric. Define a custom override in custom_metrics only if you need project-specific judge instructions — see SKILL.md's Evaluation Configuration Schema for the override pattern.

Bottom line: google_search is a model feature, not a function tool. You cannot test it with trajectory matching. Use final_response_quality to verify the agent produces grounded, cited responses.

---

ADK Built-in Tools: Trajectory Behavior Reference

Model-Internal Tools (DON'T appear in trajectory):

Tool	In Trajectory?	Eval Strategy
`google_search`	No	Rubric-based
`google_search_retrieval`	No	Rubric-based
`BuiltInCodeExecutor`	No	Check output
`VertexAiSearchTool`	No	Rubric-based
`url_context`	No	Rubric-based

These inject into llm_request.config.tools as model capabilities:

types.Tool(google_search=types.GoogleSearch())
types.Tool(code_execution=types.ToolCodeExecution())
types.Tool(retrieval=types.Retrieval(...))

Function-Based Tools (DO appear in trajectory):

Tool	In Trajectory?	Eval Strategy
`load_web_page`	Yes	`multi_turn_tool_use_quality` works
Custom tools	Yes	`multi_turn_tool_use_quality` works
AgentTool	Yes	`multi_turn_tool_use_quality` works

These generate function_call and function_response events:

types.Tool(function_declarations=[...])

Quick Reference — Can I use `multi_turn_tool_use_quality`?

google_search → NO (model-internal)
code_executor → NO (model-internal)
VertexAiSearchTool → NO (model-internal)
url_context → NO (model-internal)
load_web_page → YES (FunctionTool)
Custom functions → YES (FunctionTool)

When mixing both types (e.g., google_search + save_preferences): 1. Rely on final_response_quality for overall quality, OR 2. Keep multi_turn_tool_use_quality — it assesses the function-tool calls that do appear in the trace, accepting that the google_search step is invisible to it

Rule of Thumb:

If a tool provides grounding/retrieval/execution capabilities built into Gemini → model-internal, won't appear in trajectory
If it's a Python function you can call → appears in trajectory

Model thinking mode may bypass tools

Models with "thinking" enabled may decide they have sufficient information and skip tool calls. Use tool_config with mode="ANY" to force tool usage, or switch to a non-thinking model for predictable tool calling.

Mock mode for external APIs

When your agent calls external APIs, add mock mode so evals can run without real credentials:

def call_external_api(query: str) -> dict:
    api_key = os.environ.get("EXTERNAL_API_KEY", "")
    if not api_key or api_key == "dummy_key":
        return {"status": "success", "data": "mock_response"}
    # Real API call here

Evaluation Dataset Schema

Canonical formats for evaluation datasets in the Agent Platform Evaluation SDK. The summary below covers the type tree as of the version this skill targets — for the live, authoritative definitions see the public SDK source: `types/evals.py` and `types/common.py`.

Core Types

EvaluationDataset
└── eval_cases: list[EvalCase]       # List of evaluation cases

EvalCase
├── prompt: Content                          # Single-turn: the user query
├── responses: list[ResponseCandidate]       # Single-turn: model response(s); list to support multi-candidate eval
├── reference: ResponseCandidate             # Ground truth (for reference-based metrics)
├── agent_data: AgentData                    # Multi-turn: full conversation trajectory
├── rubric_groups: dict[str, RubricGroup]    # Per-case rubrics; key is referenced from LLMMetric.rubric_group_name
└── (extra fields allowed)                   # Custom fields for custom metrics

ResponseCandidate
└── response: Content                # The actual Content (role + parts)

AgentData
├── agents: dict[str, AgentConfig]   # Agent definitions
└── turns: list[ConversationTurn]    # Ordered conversation turns

ConversationTurn
├── turn_index: int                  # 0-based turn number
└── events: list[AgentEvent]         # Events within this turn

AgentEvent
├── author: str                      # "user", agent_id, or "tool"
└── content: Content                 # Content with role and parts

Note on `responses` and `reference`. Both wrap a Content inside a ResponseCandidate object. So a single-turn case writes "responses": [{"response": {"role": "model", "parts": [...]}}] and "reference": {"response": {"role": "model", "parts": [...]}} — NOT a bare Content. prompt and agent_data.turns[].events[].content are bare Content (not wrapped).

Single-Turn Dataset

For simple prompt-response evaluation (e.g., QA, summarization).

{
  "eval_cases": [
    {
      "eval_case_id": "capital_of_france",
      "prompt": {
        "role": "user",
        "parts": [{"text": "What is the capital of France?"}]
      },
      "responses": [
        {
          "response": {
            "role": "model",
            "parts": [{"text": "The capital of France is Paris."}]
          }
        }
      ],
      "reference": {
        "response": {
          "role": "model",
          "parts": [{"text": "Paris"}]
        }
      }
    },
    {
      "eval_case_id": "summarize_article",
      "prompt": {
        "role": "user",
        "parts": [{"text": "Summarize this article: ..."}]
      },
      "responses": [
        {
          "response": {
            "role": "model",
            "parts": [{"text": "The article discusses..."}]
          }
        }
      ]
    }
  ]
}

Required fields by metric type

Metric category	Required fields
Predefined (single-turn)	`prompt`, `responses`
Computation-based	`responses`, `reference`
Translation	`prompt` (source), `responses`, `reference`
Custom LLM/code	Fields referenced in your template/function

Multi-Turn / Multi-Agent Dataset

For evaluating multi-turn agent conversations, including systems with multiple collaborating agents and tool calls. The agents map declares all participating agents; turns is the chronological conversation, where each event author is "user", an agent ID from the agents map, or "tool".

{
  "eval_cases": [
    {
      "eval_case_id": "flight_booking_via_specialist",
      "agent_data": {
        "agents": {
          "router": {
            "agent_id": "router",
            "agent_type": "RouterAgent",
            "instruction": "Route requests to the appropriate specialist."
          },
          "flight_bot": {
            "agent_id": "flight_bot",
            "agent_type": "SpecialistAgent",
            "instruction": "Search and book flights.",
            "tools": [{
              "function_declarations": [{
                "name": "search_flights",
                "description": "Search flights by destination",
                "parameters": {
                  "type": "OBJECT",
                  "properties": {
                    "destination": {"type": "STRING"}
                  }
                }
              }]
            }]
          }
        },
        "turns": [
          {
            "turn_index": 0,
            "events": [
              {
                "author": "user",
                "content": {
                  "parts": [{"text": "Book a flight to NYC"}]
                }
              },
              {
                "author": "router",
                "content": {
                  "parts": [{"text": "Routing to flight_bot."}]
                }
              }
            ]
          },
          {
            "turn_index": 1,
            "events": [
              {
                "author": "flight_bot",
                "content": {
                  "parts": [{
                    "function_call": {
                      "name": "search_flights",
                      "args": {"destination": "NYC"}
                    }
                  }]
                }
              },
              {
                "author": "flight_bot",
                "content": {
                  "parts": [{
                    "function_response": {
                      "name": "search_flights",
                      "response": {"flights": [{"id": "AA123", "price": 320}]}
                    }
                  }]
                }
              },
              {
                "author": "flight_bot",
                "content": {
                  "parts": [{"text": "Found AA123 to NYC for $320."}]
                }
              }
            ]
          }
        ]
      }
    }
  ]
}

For a single-agent multi-turn case, omit the extra agent definitions and use one entry in agents.

Per-Case Rubrics (`rubric_groups`)

EvalCase.rubric_groups is a dict of named rubric groups attached to a specific eval case. Use this when you want a rubric-based LLMMetric to evaluate against case-specific criteria rather than a globally-defined rubric. The metric references a group by name via LLMMetric.rubric_group_name; the name on the metric must match a key under `rubric_groups` in the matching EvalCase.

{
  "eval_cases": [
    {
      "eval_case_id": "booking_confirmation",
      "prompt": {"role": "user", "parts": [{"text": "Book my flight to Paris."}]},
      "rubric_groups": {
        "booking_rubrics": {
          "rubrics": [
            {"rubric_id": "confirmation_check", "content": {"property": {"description": "The model must confirm the booking and provide a reference number."}}},
            {"rubric_id": "no_speculation",     "content": {"property": {"description": "The response must not speculate about prices or availability beyond what tools returned."}}}
          ]
        }
      }
    }
  ]
}

In eval_config.yaml, an LLMMetric then references the group by name:

custom_metrics:
  - name: booking_quality
    rubric_group_name: booking_rubrics    # must match the key above
    prompt_template: |
      ...

If rubric_group_name is omitted on the metric, the metric runs without per-case rubrics. If the name is set but doesn't match any key in the case's rubric_groups, the rubrics for that case are simply not applied (no error).

Common Mistakes

Mistake	Fix
Using `role="assistant"`	Use `role="model"` (Vertex convention)
Missing `turn_index`	Always set sequential 0-based indices
Tool response without `function_response`	Wrap in a `function_response` part
Using `prompt` field for multi-turn	Use `agent_data` with the full trajectory
Mixing `prompt` and `agent_data` in one case	Use one or the other per `EvalCase`

Evaluation Metrics Reference

File paths below reference the scaffolded layout (tests/eval/eval_config.yaml or .json). Adjust for your project structure if not using google-agents-cli-scaffold.

Managed (Built-in) Metrics Reference

Run agents-cli eval metric list for the live set. The tables below summarize the predefined managed metrics in the Agent Platform Evaluation SDK, grouped by category.

Agent metrics (multi-turn / agent-aware, adaptive rubrics)

Metric Name	Metric ID	Description
Agent Multi-turn Task Success	`multi_turn_task_success`	Validates user goal/intent fulfillment across the full multi-turn conversation.
Agent Multi-turn Tool Use	`multi_turn_tool_use_quality`	Evaluates technical and semantic correctness of tool calls across multi-turn conversation.
Agent Multi-turn Trajectory	`multi_turn_trajectory_quality`	Evaluates sequential logic, efficiency, and error-recovery robustness across turns.
Agent Final Response Quality	`final_response_quality`	Comprehensive evaluation of final response and intermediate tool usage correctness.
Agent Final Response Reference-Free	`final_response_reference_free`	Evaluates agent response quality without a reference answer (requires custom rubrics).
Agent Tool Use Quality	`tool_use_quality`	Evaluates tool selection, parameter accuracy, and step sequence correctness (single-turn).
Multi-turn General Quality	`multi_turn_general_quality`	Evaluates overall response quality within a multi-turn dialogue.
Multi-turn Text Quality	`multi_turn_text_quality`	Evaluates linguistic text quality within a multi-turn dialogue.

General quality metrics (single-turn, adaptive rubrics)

Metric Name	Metric ID	Description
General Quality	`general_quality`	Overall response quality with auto-generated content-based criteria. Recommended starting point for non-agent eval.
Text Quality	`text_quality`	Linguistic aspects: fluency, coherence, grammar.
Instruction Following	`instruction_following`	How well the response adheres to specific constraints and instructions.

Static rubric metrics (fixed criteria)

Metric Name	Metric ID	Description
Agent Hallucination	`hallucination`	Segments response into atomic claims; verifies grounding in intermediate tool outputs.
Agent Final Response Match	`final_response_match`	Compares agent response to a provided golden reference answer.
Grounding	`grounding`	Checks factuality and consistency against provided context.
Safety	`safety`	Compliance against policies (PII, hate speech, dangerous content, harassment, sexual).

---

Custom Metrics

Custom metrics are declared in eval_config.yaml (or .json) under custom_metrics. See SKILL.md's Evaluation Configuration Schema section for how metrics_to_run selects from the pool. The schema below defines the per-entry fields.

Code-based metrics default to local in-process execution (no GCP project or region required); opt into the Vertex AI sandbox with execution: "remote".

Example

metrics_to_run:
  - multi_turn_trajectory_quality
  - project_response_rubric
  - agent_turn_count

custom_metrics:
  - name: project_response_rubric
    prompt_template: |
      Rate the agent's response 1-5 for helpfulness and accuracy.
      Prompt: {prompt}
      Final response: {response}
      Full trace (for tool-call and reasoning context): {agent_data}
      Return JSON: {"score": <1|2|3|4|5>, "explanation": "<reason>"}
    judge_model_sampling_count: 3

  - name: agent_turn_count
    custom_function: |
      def evaluate(instance):
          turns = (instance.get("agent_data") or {}).get("turns", [])
          return {'score': len(turns)}

  - name: tool_call_count
    execution: remote
    custom_function: |
      def evaluate(instance):
          n = 0
          for turn in (instance.get("agent_data") or {}).get("turns", []):
              for event in turn.get("events", []):
                  for part in (event.get("content") or {}).get("parts", []):
                      if "function_call" in part:
                          n += 1
          return {'score': n}

For datasets produced by agents-cli eval generate / eval dataset synthesize, each eval case exposes three standard fields to a metric: {prompt} (user message), {response} (final agent text, populated from the last text-bearing event), and {agent_data} (the full turns/events trace, used when the judge or custom function needs to reason about tool calls or intermediate steps). {reference} and {context} resolve only when the eval case has reference / context fields populated (e.g., golden-answer datasets); they are not populated by eval generate / eval dataset synthesize.

Schema reference

Each entry in custom_metrics must conform to one of two Agent Platform evaluation metric schemas. The presence of custom_function selects CodeExecutionMetric; otherwise it's LLMMetric.

Code Execution Metric (`CodeExecutionMetric`)

Evaluates responses using custom Python code.

Field	Required	Description
`name`	yes	Unique identifier for the metric.
`custom_function`	yes	Python source containing `def evaluate(instance):`. Receives an evaluation instance, returns a numeric score or a `{'score', 'explanation'}` dict.
`execution`	no	Where the function runs. `"local"` (default) — executed in the CLI process; no GCP project or region required; runs with the CLI's privileges, so only use trusted code. `"remote"` — uploaded and executed inside Vertex AI's `CodeExecutionMetric` sandbox; requires a configured GCP project + region.

LLM-as-a-Judge Metric (`LLMMetric`)

Evaluates responses using an LLM judge driven by a prompt template.

Field	Required	Description
`name`	yes	Unique identifier for the metric.
`prompt_template`	yes	Prompt template used by the judge model. With agents-cli's file-based `EvaluationDataset` use `{prompt}`, `{response}`, and `{agent_data}` (the full trajectory). `{reference}` and `{context}` resolve only when the eval case has those fields populated.
`rubric_group_name`	no	Name of the rubric group containing rubrics this metric uses. Must match a key under `rubric_groups` in your dataset's `EvalCase` entries (see `dataset_schema.md`). When set, the judge prompt is augmented with the rubrics from the matching group; when omitted, the metric runs without per-case rubrics.
`judge_model`	no	Judge model (e.g., `gemini-flash-latest`).
`judge_model_sampling_count`	no	Number of judge samples to compute the score (1–32).
`judge_model_system_instruction`	no	System instruction for the judge model.
`judge_model_generation_config`	no	Generation config for the judge LLM (e.g., `temperature`).

Multimodal Evaluation

Two distinct cases are covered here:

1. Evaluate generated image / video quality against a text prompt. 2. Evaluate an agent that consumes multimodal input and produces text (e.g., the agent describes an image and we want to verify the description).

Both cases use a custom LLMMetric with a vision-capable judge model. The built-in adaptive metrics only inspect text parts, so they can't reason about media content directly — a custom metric is required for true multimodal grading.

Multimodal field-model note. agents-cli eval generate populates {response} by extracting the text parts of the agent's final event. If your agent returns non-text parts (e.g., inline_data images, file_data URIs), those parts are not copied into {response} automatically. To grade with the full multimodal Content, either hand-author the eval case with a responses[0].response Content containing the media parts, or post-process the generated trace file to copy the media parts into responses.

File paths below reference the scaffolded layout (tests/eval/). Adjust for your project structure if not using google-agents-cli-scaffold.

---

Dataset shape for multimodal parts

Multimodal content lives inside parts as either inline_data (base64-encoded bytes with a mime type) or file_data (GCS URI reference). Use whichever fits — file_data is preferred for anything larger than a few KB.

{ "inline_data": { "mime_type": "image/png", "data": "<base64>" } }

{ "file_data": { "mime_type": "image/jpeg", "file_uri": "gs://my-bucket/photos/test.jpg" } }

---

Case 1: Evaluate generated image / video against a text prompt

The eval case has the user prompt as text and the model response as a Content with a media file_data (or inline_data) part.

{
  "eval_cases": [
    {
      "eval_case_id": "coffee_image",
      "prompt": {
        "role": "user",
        "parts": [{"text": "steaming cup of coffee and a croissant on a table"}]
      },
      "responses": [
        {
          "response": {
            "role": "model",
            "parts": [
              {"file_data": {"mime_type": "image/png", "file_uri": "gs://cloud-samples-data/generative-ai/evaluation/images/coffee.png"}}
            ]
          }
        }
      ]
    }
  ]
}

For video, swap mime_type to video/mp4 (or appropriate) and point at a video URI.

Custom metric (`eval_config.yaml`)

custom_metrics:
  - name: image_prompt_alignment
    prompt_template: |
      You are evaluating whether the generated image (in {response}) matches
      the user's text prompt. Consider object presence, attributes, actions,
      composition, and style.

      Prompt: {prompt}
      Image: {response}

      Return JSON: {"score": <0.0-1.0>, "explanation": "<reason>"}
    judge_model: gemini-flash-latest
    judge_model_sampling_count: 3

Run with agents-cli eval grade --config tests/eval/eval_config.yaml. For video evaluation, use the same pattern with a video-capable judge model and rubric criteria (motion consistency, temporal coherence, scene transitions).

---

Case 2: Agent consumes multimodal input, produces text

The user input contains an image / audio / file; the agent produces a text response. To verify the text against the original media (e.g., "did the agent correctly describe this image?"), use a custom LLMMetric with a vision-capable judge.

Dataset shape

The multimodal input lives in the prompt field for single-turn, or inside the user-authored event in agent_data for multi-turn:

{
  "eval_cases": [
    {
      "eval_case_id": "describe_chart",
      "prompt": {
        "role": "user",
        "parts": [
          {"text": "Describe this image"},
          {"inline_data": {"mime_type": "image/png", "data": "<base64>"}}
        ]
      },
      "responses": [
        {
          "response": {
            "role": "model",
            "parts": [{"text": "The image shows a bar chart..."}]
          }
        }
      ]
    }
  ]
}

Custom metric (`eval_config.yaml`)

custom_metrics:
  - name: multimodal_response_quality
    prompt_template: |
      You are evaluating whether the agent's text response accurately reflects
      the user's multimodal input. Inspect the user input parts (which may
      include images, audio, or files) and the agent response, then return JSON:
      {"score": <0.0-1.0>, "explanation": "<reason>"}.

      User input: {prompt}
      Agent response: {response}
    judge_model: gemini-flash-latest
    judge_model_sampling_count: 3

Run with agents-cli eval grade --config tests/eval/eval_config.yaml.

---

Notes

Built-in adaptive metrics (`final_response_quality`, etc.) skip media parts. They extract only .text parts when constructing the judge prompt. Use a custom LLMMetric for true multimodal grading.
Choose a vision-capable `judge_model`. gemini-flash-latest and gemini-pro-latest both handle images and video; verify capability before relying on it.
Sampling count (judge_model_sampling_count) of 3–5 reduces variance for multimodal judges, which can be noisier than text-only.

For the full custom-metric field reference, see references/metrics-guide.md. For dataset schema and the inline_data / file_data part types, see references/dataset_schema.md.

User Simulation for Dynamic Evaluation

File paths below reference the scaffolded layout. Adjust for your project structure if not using /google-agents-cli-scaffold.

When to Use

Use user simulation when fixed prompts are impractical — the agent may ask for information in different orders or respond in unexpected ways. Instead of hand-recording every user/agent turn, let agents-cli eval dataset synthesize ask the Vertex AI evaluation service to generate user scenarios for your agent and then play each scenario against an LLM-backed user simulator. The resulting traces (with full agent_data.turns populated) drop straight into agents-cli eval grade.

A user scenario is a starting_prompt (the user's opening message) plus a free-text conversation_plan (how the simulated user should behave for the rest of the conversation). You don't author these yourself in the agents-cli flow — eval dataset synthesize generates them from your agent's tools and instructions.

For deterministic, hand-authored eval cases (e.g., regression coverage), use the recorded-turns format instead: write agent_data.turns directly in your dataset and run agents-cli eval generate to play it back. See references/dataset_schema.md. agents-cli eval generate requires either a top-level prompt or agent_data on every case; it does not play hand-authored user_scenario cases.

---

Running `eval dataset synthesize`

# Synthesize 3 scenarios (default), simulate them, write traces to artifacts/traces/traces_<ts>.json
agents-cli eval dataset synthesize

# Steer scenario generation with an instruction and environment context
agents-cli eval dataset synthesize \
  -n 5 \
  --max-turns 8 \
  --instruction "Customer asking about refunds" \
  --environment-context "E-commerce support; orders are visible by order_id"

# Use a custom model for scenario generation (default: service default)
agents-cli eval dataset synthesize --model gemini-2.5-pro

CLI flags exposed by agents-cli eval dataset synthesize:

Flag	What it controls
`-n / --count`	Number of scenarios to generate (default 3)
`--instruction`	Natural-language steering for scenario generation
`--environment-context`	World context the simulator can rely on (e.g., available data)
`--model`	Model used for scenario generation (server-side; not the simulated user model)
`--max-turns`	Cap on user↔agent turns per scenario (default 5)
`-o / --output`	Output path; defaults to `artifacts/traces/traces_<ts>.json`
`--project` / `--region`	GCP project / region overrides. `synthesize` defaults to the `global` eval endpoint (ignores the manifest `region`); pass `--region` only for data residency — the service rejects an unsupported one.

Simulator internals are NOT user-configurable from agents-cli. The LLM-backed user simulator that plays the user side runs inside _synthesize_runner.py with hardcoded ADK defaults (gemini-2.5-flash for the user voice, default thinking config, no custom_instructions). Only --max-turns reaches it (as LlmBackedUserSimulatorConfig.max_allowed_invocations). There is no eval_config.yaml key, no --simulator-model flag, and no way to override custom_instructions or model_configuration short of editing _synthesize_runner.py directly.

---

What `synthesize` writes

A single JSON EvaluationDataset file at the output path. Each case has:

eval_case_id — server-generated UUID
user_scenario — the generated {starting_prompt, conversation_plan} (preserved for traceability)
agent_data.turns — the full simulated conversation: user events, agent responses, tool calls, tool responses

Because agent_data.turns is fully populated, the file is already a graded-ready trace. Skip eval generate and go straight to eval grade:

agents-cli eval dataset synthesize
agents-cli eval grade   # reads artifacts/traces/ by default

If synthesize fails for some scenarios, the failing cases land in the output with empty agent_data.turns and a stderr warning; the rest still pass through to eval grade.

---

Compatible Metrics

Simulated conversations have no ground-truth response, so only reference-free metrics work:

Metric	Why it works
`hallucination`	Reference-free; checks claims against tool output
`safety`	Reference-free; static-rubric policy check
`final_response_reference_free`	Reference-free by design
`tool_use_quality`	Adaptive rubric — no expected trajectory needed
`multi_turn_task_success`	Adaptive rubric judges whether the simulated user's goal was met
`multi_turn_trajectory_quality`	Adaptive rubric on agent reasoning across turns
`multi_turn_tool_use_quality`	Adaptive rubric on tool calls across turns

Reference-required metrics (e.g., final_response_match) cannot be used: simulated conversations have no ground-truth response to match against.

Example tests/eval/eval_config.yaml for grading synthesized traces:

metrics_to_run:
  - hallucination
  - safety
  - multi_turn_task_success

Run with:

agents-cli eval grade --config tests/eval/eval_config.yaml

The eval_config.yaml file is read by eval grade only — eval dataset synthesize ignores it.

---

Notes

Scenario quality depends entirely on agent metadata. generate_conversation_scenarios reads your agent's instructions and tool descriptions to generate plausible user behaviors. Vague tool descriptions produce vague scenarios. Tighten tool docstrings before running synthesize on a new agent.
`--max-turns` is a hard cap. The simulated user can stop earlier (when its goal is met or it gives up); --max-turns only prevents runaway loops.
Re-running synthesize generates new scenarios. There is no seed flag — each invocation produces fresh scenarios. For repeatable regression coverage, write agent_data.turns directly (see references/dataset_schema.md) instead of relying on synthesize.

Related skills

TddFollow test-driven development with a strict red-green-refactor loop when creating reliable features or fixing bugs.510k185k

Test Driven DevelopmentEnforce writing failing tests before any production implementation code.176k260k

QaRun conversational QA sessions that turn user-reported bugs into well-written, domain-aware GitHub issues without manual ticket writing.164k185k

Migrate To ShoehornAutomatically update TypeScript test files that rely on unsafe `as` type assertions by replacing them with type-safe partial objects from @total-typescript/shoehorn.151k185k

Webapp TestingVerify frontend behavior, debug UI issues, capture screenshots, and inspect logs of a running local web application using Playwright.121k164k

Playwright CliRun browser automation, generate element snapshots, inspect DOM attributes, and execute Playwright tests from the terminal.96.3k12.2k

How it compares

Use google-agents-cli-eval when evals must account for google_search grounding; use generic testing skills for standard function_call-only agents.

FAQ

Why does google_search not appear in ADK trajectories?

google-agents-cli-eval explains google_search is a model-internal grounding feature injected into llm_request.config.tools, not a regular tool. It never appears as function_call; results come back as grounding_metadata instead.

How do evaluators detect google_search usage?

google-agents-cli-eval states evaluators detect google_search at the session level even though trajectories lack function_call events, by inspecting grounding_metadata and session records rather than tool call/response pairs.

How do custom ADK tools differ from google_search in evals?

google-agents-cli-eval notes custom tools like save_preferences or save_feedback appear as function_call in trajectories, while google_search grounding bypasses that pattern and requires grounding_metadata-aware assertions.

Is Google Agents Cli Eval safe to install?

skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Testing & QAagents

About

Google Agents Cli Eval by the numbers

google-agents-cli-eval capabilities & compatibility

Add your badge

How do you evaluate ADK agents using google_search?

Who is it for?

When should I use this skill?

What you get

Files

Agent Evaluation Guide

Reference Files

The Quality Flywheel

1. Prepare Data

2. Run Inference

3. Grade Traces (always run)

4. Analyze Failures

5. Optimize & Code Fix

Running the loop

Shortcuts That Waste Time

Choosing the Right Metrics

What to fix when scores fail

Eval Commands

eval generate

eval grade

eval compare

eval metric list

eval analyze

eval dataset synthesize

eval optimize

eval submit / eval results (cloud-side)

Evaluation Dataset Format

Inference input format

Grading input format (traces)

Evaluation Configuration Schema

Common Gotchas

Use Rubric-Based Tool Evaluation instead of Hardcoded Sequences

App name must match directory name

Cross-session memory can't be tested in eval

Vertex eval region

The before_agent_callback Pattern (State Initialization)

Model thinking mode may bypass tools

Common Eval Failure Causes

Debugging Example

Proving Your Work

Related Skills

Evaluating Agents with google_search and Built-in Tools

google_search Behavior (IMPORTANT)

ADK Built-in Tools: Trajectory Behavior Reference

Model thinking mode may bypass tools

Mock mode for external APIs

Evaluation Dataset Schema

Core Types

Single-Turn Dataset

Required fields by metric type

Multi-Turn / Multi-Agent Dataset

Per-Case Rubrics (rubric_groups)

Common Mistakes

Evaluation Metrics Reference

Managed (Built-in) Metrics Reference

Agent metrics (multi-turn / agent-aware, adaptive rubrics)

General quality metrics (single-turn, adaptive rubrics)

Static rubric metrics (fixed criteria)

Custom Metrics

Example

Schema reference

Code Execution Metric (CodeExecutionMetric)

LLM-as-a-Judge Metric (LLMMetric)

Multimodal Evaluation

Dataset shape for multimodal parts

Case 1: Evaluate generated image / video against a text prompt

Custom metric (eval_config.yaml)

Case 2: Agent consumes multimodal input, produces text

Dataset shape

Custom metric (eval_config.yaml)

Notes

User Simulation for Dynamic Evaluation

When to Use

Running eval dataset synthesize

What synthesize writes

Compatible Metrics

`eval generate`

`eval grade`

`eval compare`

`eval metric list`

`eval analyze`

`eval dataset synthesize`

`eval optimize`

`eval submit` / `eval results` (cloud-side)

The `before_agent_callback` Pattern (State Initialization)

Evaluating Agents with `google_search` and Built-in Tools

Per-Case Rubrics (`rubric_groups`)

Code Execution Metric (`CodeExecutionMetric`)

LLM-as-a-Judge Metric (`LLMMetric`)

Custom metric (`eval_config.yaml`)

Custom metric (`eval_config.yaml`)

Running `eval dataset synthesize`

What `synthesize` writes