ADK Evaluation Guide

Scaffolded project? If you used
/adk-scaffold
, you already have
make eval
,
tests/eval/evalsets/
, and
tests/eval/eval_config.json
. Start with
make eval
and iterate from there.
Non-scaffolded? Use
adk eval
directly — see Running Evaluations below.

Reference Files

File	Contents
`references/criteria-guide.md`	Complete metrics reference — all 8 criteria, match types, custom metrics, judge model config
`references/user-simulation.md`	Dynamic conversation testing — ConversationScenario, user simulator config, compatible metrics
`references/builtin-tools-eval.md`	google_search and model-internal tools — trajectory behavior, metric compatibility
`references/multimodal-eval.md`	Multimodal inputs — evalset schema, built-in metric limitations, custom evaluator pattern

The Eval-Fix Loop

Evaluation is iterative. When a score is below threshold, diagnose the cause, fix it, rerun — don't just report the failure.

How to iterate

Start small: Begin with 1-2 eval cases, not the full suite
Run eval:
```
make eval
```
(or
```
adk eval
```
if no Makefile)
Read the scores — identify what failed and why
Fix the code — adjust prompts, tool logic, instructions, or the evalset
Rerun eval — verify the fix worked
Repeat steps 3-5 until the case passes
Only then add more eval cases and expand coverage

Expect 5-10+ iterations. This is normal — each iteration makes the agent better.

What to fix when scores fail

Failure	What to change
`tool_trajectory_avg_score` low	Fix agent instructions (tool ordering), update evalset `tool_uses` , or switch to `IN_ORDER` / `ANY_ORDER` match type
`response_match_score` low	Adjust agent instruction wording, or relax the expected response
`final_response_match_v2` low	Refine agent instructions, or adjust expected response — this is semantic, not lexical
`rubric_based` score low	Refine agent instructions to address the specific rubric that failed
`hallucinations_v1` low	Tighten agent instructions to stay grounded in tool output
Agent calls wrong tools	Fix tool descriptions, agent instructions, or tool_config
Agent calls extra tools	Use `IN_ORDER` / `ANY_ORDER` match type, add strict stop instructions, or switch to `rubric_based_tool_use_quality_v1`

Choosing the Right Criteria

Goal	Recommended Metric
Regression testing / CI/CD (fast, deterministic)	`tool_trajectory_avg_score` + `response_match_score`
Semantic response correctness (flexible phrasing OK)	`final_response_match_v2`
Response quality without reference answer	`rubric_based_final_response_quality_v1`
Validate tool usage reasoning	`rubric_based_tool_use_quality_v1`
Detect hallucinated claims	`hallucinations_v1`
Safety compliance	`safety_v1`
Dynamic multi-turn conversations	User simulation + `hallucinations_v1` / `safety_v1` (see `references/user-simulation.md` )
Multimodal input (image, audio, file)	`tool_trajectory_avg_score` + custom metric for response quality (see `references/multimodal-eval.md` )

For the complete metrics reference with config examples, match types, and custom metrics, see

references/criteria-guide.md

Running Evaluations

bash

# Scaffolded projects:
make eval EVALSET=tests/eval/evalsets/my_evalset.json

# Or directly via ADK CLI:
adk eval ./app <path_to_evalset.json> --config_file_path=<path_to_config.json> --print_detailed_results

# Run specific eval cases from a set:
adk eval ./app my_evalset.json:eval_1,eval_2

# With GCS storage:
adk eval ./app my_evalset.json --eval_storage_uri gs://my-bucket/evals

CLI options:

--config_file_path

--print_detailed_results

--eval_storage_uri

--log_level

Eval set management:

bash

adk eval_set create <agent_path> <eval_set_id>
adk eval_set add_eval_case <agent_path> <eval_set_id> --scenarios_file <path> --session_input_file <path>

Configuration Schema (

test_config.json

)

Both camelCase and snake_case field names are accepted (Pydantic aliases). The examples below use snake_case, matching the official ADK docs.

Full example

json

{
  "criteria": {
    "tool_trajectory_avg_score": {
      "threshold": 1.0,
      "match_type": "IN_ORDER"
    },
    "final_response_match_v2": {
      "threshold": 0.8,
      "judge_model_options": {
        "judge_model": "gemini-2.5-flash",
        "num_samples": 5
      }
    },
    "rubric_based_final_response_quality_v1": {
      "threshold": 0.8,
      "rubrics": [
        {
          "rubric_id": "professionalism",
          "rubric_content": { "text_property": "The response must be professional and helpful." }
        },
        {
          "rubric_id": "safety",
          "rubric_content": { "text_property": "The agent must NEVER book without asking for confirmation." }
        }
      ]
    }
  }
}

Simple threshold shorthand is also valid:

"response_match_score": 0.8

For custom metrics,

judge_model_options

details, and

user_simulator_config

, see

references/criteria-guide.md

EvalSet Schema (

evalset.json

)

json

{
  "eval_set_id": "my_eval_set",
  "name": "My Eval Set",
  "description": "Tests core capabilities",
  "eval_cases": [
    {
      "eval_id": "search_test",
      "conversation": [
        {
          "invocation_id": "inv_1",
          "user_content": { "parts": [{ "text": "Find a flight to NYC" }] },
          "final_response": {
            "role": "model",
            "parts": [{ "text": "I found a flight for $500. Want to book?" }]
          },
          "intermediate_data": {
            "tool_uses": [
              { "name": "search_flights", "args": { "destination": "NYC" } }
            ],
            "intermediate_responses": [
              ["sub_agent_name", [{ "text": "Found 3 flights to NYC." }]]
            ]
          }
        }
      ],
      "session_input": { "app_name": "my_app", "user_id": "user_1", "state": {} }
    }
  ]
}

Key fields:

```
intermediate_data.tool_uses
```
— expected tool call trajectory (chronological order)
```
intermediate_data.intermediate_responses
```
— expected sub-agent responses (for multi-agent systems)
```
session_input.state
```
— initial session state (overrides Python-level initialization)

conversation_scenario

— alternative to

conversation

for user simulation (see

references/user-simulation.md

)

Common Gotchas

The Proactivity Trajectory Gap

LLMs often perform extra actions not asked for (e.g.,

google_search

after

save_preferences

). This causes

tool_trajectory_avg_score

failures with

EXACT

match. Solutions:

Use
IN_ORDER
or
ANY_ORDER
match type — tolerates extra tool calls between expected ones
Include ALL tools the agent might call in your expected trajectory
Use
```
rubric_based_tool_use_quality_v1
```
instead of trajectory matching
Add strict stop instructions: "Stop after calling save_preferences. Do NOT search."

Multi-turn conversations require tool_uses for ALL turns

The

tool_trajectory_avg_score

evaluates each invocation. If you don't specify expected tool calls for intermediate turns, the evaluation will fail even if the agent called the right tools.

json

{
  "conversation": [
    {
      "invocation_id": "inv_1",
      "user_content": { "parts": [{"text": "Find me a flight from NYC to London"}] },
      "intermediate_data": {
        "tool_uses": [
          { "name": "search_flights", "args": {"origin": "NYC", "destination": "LON"} }
        ]
      }
    },
    {
      "invocation_id": "inv_2",
      "user_content": { "parts": [{"text": "Book the first option"}] },
      "final_response": { "role": "model", "parts": [{"text": "Booking confirmed!"}] },
      "intermediate_data": {
        "tool_uses": [
          { "name": "book_flight", "args": {"flight_id": "1"} }
        ]
      }
    }
  ]
}

App name must match directory name

The

App

object's

name

parameter MUST match the directory containing your agent:

python

# CORRECT - matches the "app" directory
app = App(root_agent=root_agent, name="app")

# WRONG - causes "Session not found" errors
app = App(root_agent=root_agent, name="flight_booking_assistant")

The

before_agent_callback

Pattern (State Initialization)

Always use a callback to initialize session state variables used in your instruction template. This prevents

KeyError

crashes on the first turn:

python

async def initialize_state(callback_context: CallbackContext) -> None:
    state = callback_context.state
    if "user_preferences" not in state:
        state["user_preferences"] = {}

root_agent = Agent(
    name="my_agent",
    before_agent_callback=initialize_state,
    instruction="Based on preferences: {user_preferences}...",
)

Eval-State Overrides (Type Mismatch Danger)

Be careful with

session_input.state

in your evalset. It overrides Python-level initialization:

json

// WRONG - initializes feedback_history as a string, breaks .append()
"state": { "feedback_history": "" }

// CORRECT - matches the Python type (list)
"state": { "feedback_history": [] }

Model thinking mode may bypass tools

Models with "thinking" enabled may skip tool calls. Use

tool_config

with

mode="ANY"

to force tool usage, or switch to a non-thinking model for predictable tool calling.

Common Eval Failure Causes

Symptom	Cause	Fix
Missing `tool_uses` in intermediate turns	Trajectory expects match per invocation	Add expected tool calls to all turns
Agent mentions data not in tool output	Hallucination	Tighten agent instructions; add `hallucinations_v1` metric
"Session not found" error	App name mismatch	Ensure App `name` matches directory name
Score fluctuates between runs	Non-deterministic model	Set `temperature=0` or use rubric-based eval
`tool_trajectory_avg_score` always 0	Agent uses `google_search` (model-internal)	Remove trajectory metric; see `references/builtin-tools-eval.md`
Trajectory fails but tools are correct	Extra tools called	Switch to `IN_ORDER` / `ANY_ORDER` match type
LLM judge ignores image/audio in eval	`get_text_from_content()` skips non-text parts	Use custom metric with vision-capable judge (see `references/multimodal-eval.md` )

Deep Dive: ADK Docs

For the official evaluation documentation, fetch these pages using WebFetch:

Evaluation overview:

https://google.github.io/adk-docs/evaluate/index.md

Criteria reference:

https://google.github.io/adk-docs/evaluate/criteria/index.md

User simulation:

https://google.github.io/adk-docs/evaluate/user-sim/index.md

Debugging Example

User says: "tool_trajectory_avg_score is 0, what's wrong?"

Check if agent uses

google_search

— if so, see

references/builtin-tools-eval.md

Check if using
```
EXACT
```
match and agent calls extra tools — try
```
IN_ORDER
```
Compare expected
```
tool_uses
```
in evalset with actual agent behavior
Fix mismatch (update evalset or agent instructions)

adk-eval-guide

NPX Install

SKILL.md Content

ADK Evaluation Guide

Reference Files

The Eval-Fix Loop

How to iterate

What to fix when scores fail

Choosing the Right Criteria

Running Evaluations

Configuration Schema (
`test_config.json`
)

Full example

EvalSet Schema (
`evalset.json`
)

Common Gotchas

The Proactivity Trajectory Gap

Multi-turn conversations require tool_uses for ALL turns

App name must match directory name

The
`before_agent_callback`
Pattern (State Initialization)

Eval-State Overrides (Type Mismatch Danger)

Model thinking mode may bypass tools

Common Eval Failure Causes

Deep Dive: ADK Docs

Debugging Example

adk-eval-guide

NPX Install

SKILL.md Content

ADK Evaluation Guide

Reference Files

The Eval-Fix Loop

How to iterate

What to fix when scores fail

Choosing the Right Criteria

Running Evaluations

Configuration Schema (test_config.json)

Full example

EvalSet Schema (evalset.json)

Common Gotchas

The Proactivity Trajectory Gap

Multi-turn conversations require tool_uses for ALL turns

App name must match directory name

The before_agent_callback Pattern (State Initialization)

Eval-State Overrides (Type Mismatch Danger)

Model thinking mode may bypass tools

Common Eval Failure Causes

Deep Dive: ADK Docs

Debugging Example

Configuration Schema (
`test_config.json`
)

EvalSet Schema (
`evalset.json`
)

The
`before_agent_callback`
Pattern (State Initialization)