adk-eval-guide
Original:🇺🇸 English
Not Translated
MUST READ before running any ADK evaluation. ADK evaluation methodology — eval metrics, evalset schema, LLM-as-judge, tool trajectory scoring, and common failure causes. Use when evaluating agent quality, running adk eval, or debugging eval results. Do NOT use for API code patterns (use adk-cheatsheet), deployment (use adk-deploy-guide), or project scaffolding (use adk-scaffold).
2installs
Sourceeliasecchig/adk-docs
Added on
NPX Install
npx skill4agent add eliasecchig/adk-docs adk-eval-guideSKILL.md Content
ADK Evaluation Guide
Scaffolded project? If you used, you already have/adk-scaffold,make eval, andtests/eval/evalsets/. Start withtests/eval/eval_config.jsonand iterate from there.make evalNon-scaffolded? Usedirectly — see Running Evaluations below.adk eval
Reference Files
| File | Contents |
|---|---|
| Complete metrics reference — all 8 criteria, match types, custom metrics, judge model config |
| Dynamic conversation testing — ConversationScenario, user simulator config, compatible metrics |
| google_search and model-internal tools — trajectory behavior, metric compatibility |
| Multimodal inputs — evalset schema, built-in metric limitations, custom evaluator pattern |
The Eval-Fix Loop
Evaluation is iterative. When a score is below threshold, diagnose the cause, fix it, rerun — don't just report the failure.
How to iterate
- Start small: Begin with 1-2 eval cases, not the full suite
- Run eval: (or
make evalif no Makefile)adk eval - Read the scores — identify what failed and why
- Fix the code — adjust prompts, tool logic, instructions, or the evalset
- Rerun eval — verify the fix worked
- Repeat steps 3-5 until the case passes
- Only then add more eval cases and expand coverage
Expect 5-10+ iterations. This is normal — each iteration makes the agent better.
What to fix when scores fail
| Failure | What to change |
|---|---|
| Fix agent instructions (tool ordering), update evalset |
| Adjust agent instruction wording, or relax the expected response |
| Refine agent instructions, or adjust expected response — this is semantic, not lexical |
| Refine agent instructions to address the specific rubric that failed |
| Tighten agent instructions to stay grounded in tool output |
| Agent calls wrong tools | Fix tool descriptions, agent instructions, or tool_config |
| Agent calls extra tools | Use |
Choosing the Right Criteria
| Goal | Recommended Metric |
|---|---|
| Regression testing / CI/CD (fast, deterministic) | |
| Semantic response correctness (flexible phrasing OK) | |
| Response quality without reference answer | |
| Validate tool usage reasoning | |
| Detect hallucinated claims | |
| Safety compliance | |
| Dynamic multi-turn conversations | User simulation + |
| Multimodal input (image, audio, file) | |
For the complete metrics reference with config examples, match types, and custom metrics, see .
references/criteria-guide.mdRunning Evaluations
bash
# Scaffolded projects:
make eval EVALSET=tests/eval/evalsets/my_evalset.json
# Or directly via ADK CLI:
adk eval ./app <path_to_evalset.json> --config_file_path=<path_to_config.json> --print_detailed_results
# Run specific eval cases from a set:
adk eval ./app my_evalset.json:eval_1,eval_2
# With GCS storage:
adk eval ./app my_evalset.json --eval_storage_uri gs://my-bucket/evalsCLI options: , , ,
--config_file_path--print_detailed_results--eval_storage_uri--log_levelEval set management:
bash
adk eval_set create <agent_path> <eval_set_id>
adk eval_set add_eval_case <agent_path> <eval_set_id> --scenarios_file <path> --session_input_file <path>Configuration Schema (test_config.json
)
test_config.jsonBoth camelCase and snake_case field names are accepted (Pydantic aliases). The examples below use snake_case, matching the official ADK docs.
Full example
json
{
"criteria": {
"tool_trajectory_avg_score": {
"threshold": 1.0,
"match_type": "IN_ORDER"
},
"final_response_match_v2": {
"threshold": 0.8,
"judge_model_options": {
"judge_model": "gemini-2.5-flash",
"num_samples": 5
}
},
"rubric_based_final_response_quality_v1": {
"threshold": 0.8,
"rubrics": [
{
"rubric_id": "professionalism",
"rubric_content": { "text_property": "The response must be professional and helpful." }
},
{
"rubric_id": "safety",
"rubric_content": { "text_property": "The agent must NEVER book without asking for confirmation." }
}
]
}
}
}Simple threshold shorthand is also valid:
"response_match_score": 0.8For custom metrics, details, and , see .
judge_model_optionsuser_simulator_configreferences/criteria-guide.mdEvalSet Schema (evalset.json
)
evalset.jsonjson
{
"eval_set_id": "my_eval_set",
"name": "My Eval Set",
"description": "Tests core capabilities",
"eval_cases": [
{
"eval_id": "search_test",
"conversation": [
{
"invocation_id": "inv_1",
"user_content": { "parts": [{ "text": "Find a flight to NYC" }] },
"final_response": {
"role": "model",
"parts": [{ "text": "I found a flight for $500. Want to book?" }]
},
"intermediate_data": {
"tool_uses": [
{ "name": "search_flights", "args": { "destination": "NYC" } }
],
"intermediate_responses": [
["sub_agent_name", [{ "text": "Found 3 flights to NYC." }]]
]
}
}
],
"session_input": { "app_name": "my_app", "user_id": "user_1", "state": {} }
}
]
}Key fields:
- — expected tool call trajectory (chronological order)
intermediate_data.tool_uses - — expected sub-agent responses (for multi-agent systems)
intermediate_data.intermediate_responses - — initial session state (overrides Python-level initialization)
session_input.state - — alternative to
conversation_scenariofor user simulation (seeconversation)references/user-simulation.md
Common Gotchas
The Proactivity Trajectory Gap
LLMs often perform extra actions not asked for (e.g., after ). This causes failures with match. Solutions:
google_searchsave_preferencestool_trajectory_avg_scoreEXACT- Use or
IN_ORDERmatch type — tolerates extra tool calls between expected onesANY_ORDER - Include ALL tools the agent might call in your expected trajectory
- Use instead of trajectory matching
rubric_based_tool_use_quality_v1 - Add strict stop instructions: "Stop after calling save_preferences. Do NOT search."
Multi-turn conversations require tool_uses for ALL turns
The evaluates each invocation. If you don't specify expected tool calls for intermediate turns, the evaluation will fail even if the agent called the right tools.
tool_trajectory_avg_scorejson
{
"conversation": [
{
"invocation_id": "inv_1",
"user_content": { "parts": [{"text": "Find me a flight from NYC to London"}] },
"intermediate_data": {
"tool_uses": [
{ "name": "search_flights", "args": {"origin": "NYC", "destination": "LON"} }
]
}
},
{
"invocation_id": "inv_2",
"user_content": { "parts": [{"text": "Book the first option"}] },
"final_response": { "role": "model", "parts": [{"text": "Booking confirmed!"}] },
"intermediate_data": {
"tool_uses": [
{ "name": "book_flight", "args": {"flight_id": "1"} }
]
}
}
]
}App name must match directory name
The object's parameter MUST match the directory containing your agent:
Appnamepython
# CORRECT - matches the "app" directory
app = App(root_agent=root_agent, name="app")
# WRONG - causes "Session not found" errors
app = App(root_agent=root_agent, name="flight_booking_assistant")The before_agent_callback
Pattern (State Initialization)
before_agent_callbackAlways use a callback to initialize session state variables used in your instruction template. This prevents crashes on the first turn:
KeyErrorpython
async def initialize_state(callback_context: CallbackContext) -> None:
state = callback_context.state
if "user_preferences" not in state:
state["user_preferences"] = {}
root_agent = Agent(
name="my_agent",
before_agent_callback=initialize_state,
instruction="Based on preferences: {user_preferences}...",
)Eval-State Overrides (Type Mismatch Danger)
Be careful with in your evalset. It overrides Python-level initialization:
session_input.statejson
// WRONG - initializes feedback_history as a string, breaks .append()
"state": { "feedback_history": "" }
// CORRECT - matches the Python type (list)
"state": { "feedback_history": [] }Model thinking mode may bypass tools
Models with "thinking" enabled may skip tool calls. Use with to force tool usage, or switch to a non-thinking model for predictable tool calling.
tool_configmode="ANY"Common Eval Failure Causes
| Symptom | Cause | Fix |
|---|---|---|
Missing | Trajectory expects match per invocation | Add expected tool calls to all turns |
| Agent mentions data not in tool output | Hallucination | Tighten agent instructions; add |
| "Session not found" error | App name mismatch | Ensure App |
| Score fluctuates between runs | Non-deterministic model | Set |
| Agent uses | Remove trajectory metric; see |
| Trajectory fails but tools are correct | Extra tools called | Switch to |
| LLM judge ignores image/audio in eval | | Use custom metric with vision-capable judge (see |
Deep Dive: ADK Docs
For the official evaluation documentation, fetch these pages using WebFetch:
- Evaluation overview:
https://google.github.io/adk-docs/evaluate/index.md - Criteria reference:
https://google.github.io/adk-docs/evaluate/criteria/index.md - User simulation:
https://google.github.io/adk-docs/evaluate/user-sim/index.md
Debugging Example
User says: "tool_trajectory_avg_score is 0, what's wrong?"
- Check if agent uses — if so, see
google_searchreferences/builtin-tools-eval.md - Check if using match and agent calls extra tools — try
EXACTIN_ORDER - Compare expected in evalset with actual agent behavior
tool_uses - Fix mismatch (update evalset or agent instructions)