
Google Agents Cli Eval
Run agent evals on Google ADK agents that use google_search without picking metrics that always fail.
Overview
Google Agents CLI Eval is an agent skill most often used in Ship (also Build, Operate) that explains how to evaluate Google ADK agents that use google_search without choosing metrics that always fail.
Install
npx skills add https://github.com/google/agents-cli --skill google-agents-cli-evalWhat is this skill?
- Explains google_search as model-internal grounding—not a function_call in the trajectory
- Documents which built-in eval metrics work: final_response_quality yes; multi_turn_tool_use_quality and final_response_m
- Clarifies why evaluators emit UNEXPECTED_TOOL_CALL for google_search despite invisible tool events
- Gives dataset best practices tailored to agents that rely on live Google Search grounding
- Helps you avoid false failures when judging multi-turn tool-use quality
- 3 eval metrics explicitly rated for google_search agent compatibility in the compatibility table
- google_search never appears as function_call in the trajectory but can still trigger UNEXPECTED_TOOL_CALL at session lev
Adoption & trust: 12.4k installs on skills.sh; 2.7k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your ADK agent evals fail with UNEXPECTED_TOOL_CALL for google_search even though search never shows up as a normal tool call in the trajectory.
Who is it for?
Indie builders running Vertex or ADK agent evaluations who use google_search and need rubric-based quality checks that still pass.
Skip if: Teams who only need generic unit tests with fixed string match against a single reference answer on search-backed agents.
When should I use this skill?
You are evaluating a Google ADK agent that uses google_search or built-in grounding and your tool-use or match metrics fail unexpectedly.
What do I get? / Deliverables
You pick compatible eval metrics and dataset patterns so google_search agents get meaningful quality signals instead of guaranteed multi-turn tool-use failures.
- Metric selection aligned with google_search behavior
- Dataset design that avoids unstable reference-match assumptions
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Canonical shelf is Ship because the skill is about evaluation metrics, datasets, and pass/fail behavior after you have an agent to test. Testing is where trajectory inspection, rubric scores, and metric choice belong in the solo-builder journey.
Where it fits
You add google_search to an ADK agent and want to know which eval hooks will still be honest before you wire CI.
You run multi-turn evals and need to swap off final_response_match for rubric-based final_response_quality.
Production feedback diverges from eval scores and you suspect metric mismatch on search-grounded replies.
How it compares
Use for Google ADK eval semantics and metric compatibility—not as a generic LLM benchmark runner.
Common Questions / FAQ
Who is google-agents-cli-eval for?
Solo and indie builders shipping Google ADK agents who run official or custom evaluators and hit confusing failures around google_search and tool trajectories.
When should I use google-agents-cli-eval?
Use it in Ship when designing test suites; in Build when choosing built-in tools for your agent; and in Operate when iteratively tuning prompts after production-like eval runs.
Is google-agents-cli-eval safe to install?
It is documentation-style procedural guidance with no special runtime hooks; review the Security Audits panel on this Prism page before trusting any third-party skill in your agent workflow.
Workflow Chain
Then invoke: google agents cli observability
SKILL.md
READMESKILL.md - Google Agents Cli Eval
# Evaluating Agents with `google_search` and Built-in Tools ## google_search Behavior (IMPORTANT) `google_search` is NOT a regular tool — it's a **model-internal grounding feature**. **Key behavior:** - Custom tools (`save_preferences`, `save_feedback`) → appear as `function_call` in trajectory - `google_search` → NEVER appears in trajectory (happens inside the model) **How google_search works internally:** ```python llm_request.config.tools.append( types.Tool(google_search=types.GoogleSearch()) # Injected into model config ) ``` Search results come back as `grounding_metadata`, not function call/response events. But the evaluator STILL detects it at the session level: ```json { "error_code": "UNEXPECTED_TOOL_CALL", "error_message": "Unexpected tool call: google_search" } ``` This causes `multi_turn_tool_use_quality` to ALWAYS fail for agents using `google_search`. **Metric compatibility for `google_search` agents:** | Metric | Usable? | Why | |--------|---------|-----| | `multi_turn_tool_use_quality` | NO | Always fails due to unexpected google_search (the `google_search` invocation is detected by the evaluator but never appears as a `function_call` / `function_response` event) | | `final_response_quality` | YES | Adaptive rubric-based evaluation; works without a reference answer | | `final_response_match` | NO | Search results vary across runs, so the agent's response rarely matches a fixed reference | **Dataset best practices for `google_search` agents:** ```json { "eval_cases": [ { "eval_case_id": "news_digest_test", "prompt": { "role": "user", "parts": [{"text": "Give me my news digest."}] } // NO trajectory criteria for google_search - it won't appear in the trace anyway } ] } ``` For agents that mix `google_search` with custom function tools, grade the custom tool usage with `multi_turn_tool_use_quality` — it judges the tool calls in the generated trace, so you don't hand-author expected calls. Optionally add a `reference` response for reference-based matching: ```json { "eval_case_id": "news_digest_feedback", "prompt": { "role": "user", "parts": [{"text": "Great, save my positive feedback."}] }, "reference": { "response": { "role": "model", "parts": [{"text": "Feedback saved!"}] } } } ``` The `google_search` invocation still won't appear in the trace, so `multi_turn_tool_use_quality` only assesses the function-tool calls (e.g., `save_feedback`). **Config for `google_search` agents (`eval_config.yaml`):** ```yaml metrics_to_run: - final_response_quality ``` The built-in `final_response_quality` is sufficient for most `google_search` agents; it auto-generates a content-based rubric. Define a custom override in `custom_metrics` only if you need project-specific judge instructions — see SKILL.md's *Evaluation Configuration Schema* for the override pattern. **Bottom line:** `google_search` is a model feature, not a function tool. You cannot test it with trajectory matching. Use `final_response_quality` to verify the agent produces grounded, cited responses. --- ## ADK Built-in Tools: Trajectory Behavior Reference **Model-Internal Tools (DON'T appear in trajectory):** | Tool | In Trajectory? | Eval Strategy | |------|----------------|---------------| | `google_search` | No | Rubric-based | | `google_search_retrieval` | No | Rubric-based | | `BuiltInCodeExecutor` | No | Check output | | `VertexAiSearchTool` | No | Rubric-based | | `url_context` | No | Rubric-based | These inject into `llm_request.config.tools` as model capabilities: ```python types.Tool(google_search=types.GoogleSearch()) types.Tool(code_execution=types.ToolCodeExecution()) types.Tool(retrieval=types.Retrieval(...)) ``` **Function-Based Tools (DO appear in trajectory):** | Tool | In Trajectory? | Eval Strategy | |------|----------------|---------------| | `load_web_page` | Yes | `multi_turn_tool_use_quality` works | | Custom tools | Yes |