adk-eval-guide
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseADK Evaluation Guide
ADK 评估指南
Scaffolded project? If you used, you already have/adk-scaffold,make eval, andtests/eval/evalsets/. Start withtests/eval/eval_config.jsonand iterate from there.make evalNon-scaffolded? Usedirectly — see Running Evaluations below.adk eval
使用脚手架创建的项目? 如果你使用了,那么你已经拥有/adk-scaffold命令、make eval目录和tests/eval/evalsets/文件。从运行tests/eval/eval_config.json开始,然后逐步迭代优化。make eval非脚手架项目? 直接使用命令——请参阅下方的【运行评估】章节。adk eval
Reference Files
参考文件
| File | Contents |
|---|---|
| Complete metrics reference — all 8 criteria, match types, custom metrics, judge model config |
| Dynamic conversation testing — ConversationScenario, user simulator config, compatible metrics |
| google_search and model-internal tools — trajectory behavior, metric compatibility |
| Multimodal inputs — evalset schema, built-in metric limitations, custom evaluator pattern |
| 文件 | 内容 |
|---|---|
| 完整的指标参考文档——包含全部8项评估标准、匹配类型、自定义指标以及评判模型配置 |
| 动态对话测试相关内容——包括ConversationScenario、用户模拟器配置以及兼容的评估指标 |
| google_search及模型内置工具的评估——涵盖工具轨迹行为、指标兼容性 |
| 多模态输入评估——包括评估集schema、内置指标的局限性以及自定义评估器模式 |
The Eval-Fix Loop
评估-修复循环
Evaluation is iterative. When a score is below threshold, diagnose the cause, fix it, rerun — don't just report the failure.
评估是一个迭代的过程。当评分低于阈值时,要诊断问题原因、修复问题,然后重新运行评估——不要只报告失败结果。
How to iterate
迭代步骤
- Start small: Begin with 1-2 eval cases, not the full suite
- Run eval: (or
make evalif no Makefile)adk eval - Read the scores — identify what failed and why
- Fix the code — adjust prompts, tool logic, instructions, or the evalset
- Rerun eval — verify the fix worked
- Repeat steps 3-5 until the case passes
- Only then add more eval cases and expand coverage
Expect 5-10+ iterations. This is normal — each iteration makes the agent better.
- 从小规模开始:先从1-2个评估用例入手,不要一开始就运行完整的评估套件
- 运行评估:执行命令(如果没有Makefile,直接使用
make eval)adk eval - 查看评分——确定哪些部分失败以及失败原因
- 修复代码——调整提示词、工具逻辑、Agent指令或评估集
- 重新运行评估——验证修复是否有效
- 重复步骤3-5,直到该用例通过评估
- 之后再添加更多评估用例,扩大覆盖范围
预计需要5-10次以上的迭代。 这是正常情况——每一次迭代都会让Agent变得更完善。
What to fix when scores fail
评分失败时的修复方向
| Failure | What to change |
|---|---|
| Fix agent instructions (tool ordering), update evalset |
| Adjust agent instruction wording, or relax the expected response |
| Refine agent instructions, or adjust expected response — this is semantic, not lexical |
| Refine agent instructions to address the specific rubric that failed |
| Tighten agent instructions to stay grounded in tool output |
| Agent calls wrong tools | Fix tool descriptions, agent instructions, or tool_config |
| Agent calls extra tools | Use |
| 失败情况 | 需要调整的内容 |
|---|---|
| 修复Agent指令(工具调用顺序)、更新评估集的 |
| 调整Agent指令的措辞,或放宽对预期响应的要求 |
| 优化Agent指令,或调整预期响应——该指标是语义层面的匹配,而非字面匹配 |
| 优化Agent指令,针对未通过的具体评估准则进行调整 |
| 收紧Agent指令,确保其输出基于工具返回的内容 |
| Agent调用了错误的工具 | 修复工具描述、Agent指令或tool_config配置 |
| Agent调用了额外的工具 | 使用 |
Choosing the Right Criteria
选择合适的评估准则
| Goal | Recommended Metric |
|---|---|
| Regression testing / CI/CD (fast, deterministic) | |
| Semantic response correctness (flexible phrasing OK) | |
| Response quality without reference answer | |
| Validate tool usage reasoning | |
| Detect hallucinated claims | |
| Safety compliance | |
| Dynamic multi-turn conversations | User simulation + |
| Multimodal input (image, audio, file) | |
For the complete metrics reference with config examples, match types, and custom metrics, see .
references/criteria-guide.md| 目标 | 推荐使用的指标 |
|---|---|
| 回归测试/CI/CD(快速、确定性) | |
| 响应语义正确性(允许灵活措辞) | |
| 无参考答案时的响应质量评估 | |
| 验证工具调用的合理性 | |
| 检测幻觉输出 | |
| 安全合规性 | |
| 动态多轮对话 | 用户模拟 + |
| 多模态输入(图片、音频、文件) | |
如需查看包含配置示例、匹配类型和自定义指标的完整指标参考,请参阅。
references/criteria-guide.mdRunning Evaluations
运行评估
bash
undefinedbash
undefinedScaffolded projects:
脚手架项目:
make eval EVALSET=tests/eval/evalsets/my_evalset.json
make eval EVALSET=tests/eval/evalsets/my_evalset.json
Or directly via ADK CLI:
或直接通过ADK CLI运行:
adk eval ./app <path_to_evalset.json> --config_file_path=<path_to_config.json> --print_detailed_results
adk eval ./app <path_to_evalset.json> --config_file_path=<path_to_config.json> --print_detailed_results
Run specific eval cases from a set:
运行评估集中的特定用例:
adk eval ./app my_evalset.json:eval_1,eval_2
adk eval ./app my_evalset.json:eval_1,eval_2
With GCS storage:
结合GCS存储使用:
adk eval ./app my_evalset.json --eval_storage_uri gs://my-bucket/evals
**CLI options:** `--config_file_path`, `--print_detailed_results`, `--eval_storage_uri`, `--log_level`
**Eval set management:**
```bash
adk eval_set create <agent_path> <eval_set_id>
adk eval_set add_eval_case <agent_path> <eval_set_id> --scenarios_file <path> --session_input_file <path>adk eval ./app my_evalset.json --eval_storage_uri gs://my-bucket/evals
**CLI选项:** `--config_file_path`, `--print_detailed_results`, `--eval_storage_uri`, `--log_level`
**评估集管理命令:**
```bash
adk eval_set create <agent_path> <eval_set_id>
adk eval_set add_eval_case <agent_path> <eval_set_id> --scenarios_file <path> --session_input_file <path>Configuration Schema (test_config.json
)
test_config.json配置Schema(test_config.json
)
test_config.jsonBoth camelCase and snake_case field names are accepted (Pydantic aliases). The examples below use snake_case, matching the official ADK docs.
配置文件同时支持camelCase和snake_case字段名(通过Pydantic别名实现)。以下示例使用snake_case,与官方ADK文档保持一致。
Full example
完整示例
json
{
"criteria": {
"tool_trajectory_avg_score": {
"threshold": 1.0,
"match_type": "IN_ORDER"
},
"final_response_match_v2": {
"threshold": 0.8,
"judge_model_options": {
"judge_model": "gemini-2.5-flash",
"num_samples": 5
}
},
"rubric_based_final_response_quality_v1": {
"threshold": 0.8,
"rubrics": [
{
"rubric_id": "professionalism",
"rubric_content": { "text_property": "The response must be professional and helpful." }
},
{
"rubric_id": "safety",
"rubric_content": { "text_property": "The agent must NEVER book without asking for confirmation." }
}
]
}
}
}Simple threshold shorthand is also valid:
"response_match_score": 0.8For custom metrics, details, and , see .
judge_model_optionsuser_simulator_configreferences/criteria-guide.mdjson
{
"criteria": {
"tool_trajectory_avg_score": {
"threshold": 1.0,
"match_type": "IN_ORDER"
},
"final_response_match_v2": {
"threshold": 0.8,
"judge_model_options": {
"judge_model": "gemini-2.5-flash",
"num_samples": 5
}
},
"rubric_based_final_response_quality_v1": {
"threshold": 0.8,
"rubrics": [
{
"rubric_id": "professionalism",
"rubric_content": { "text_property": "The response must be professional and helpful." }
},
{
"rubric_id": "safety",
"rubric_content": { "text_property": "The agent must NEVER book without asking for confirmation." }
}
]
}
}
}也可以使用简洁的阈值写法:
"response_match_score": 0.8关于自定义指标、的详细说明以及,请参阅。
judge_model_optionsuser_simulator_configreferences/criteria-guide.mdEvalSet Schema (evalset.json
)
evalset.json评估集Schema(evalset.json
)
evalset.jsonjson
{
"eval_set_id": "my_eval_set",
"name": "My Eval Set",
"description": "Tests core capabilities",
"eval_cases": [
{
"eval_id": "search_test",
"conversation": [
{
"invocation_id": "inv_1",
"user_content": { "parts": [{ "text": "Find a flight to NYC" }] },
"final_response": {
"role": "model",
"parts": [{ "text": "I found a flight for $500. Want to book?" }]
},
"intermediate_data": {
"tool_uses": [
{ "name": "search_flights", "args": { "destination": "NYC" } }
],
"intermediate_responses": [
["sub_agent_name", [{ "text": "Found 3 flights to NYC." }]]
]
}
}
],
"session_input": { "app_name": "my_app", "user_id": "user_1", "state": {} }
}
]
}Key fields:
- — expected tool call trajectory (chronological order)
intermediate_data.tool_uses - — expected sub-agent responses (for multi-agent systems)
intermediate_data.intermediate_responses - — initial session state (overrides Python-level initialization)
session_input.state - — alternative to
conversation_scenariofor user simulation (seeconversation)references/user-simulation.md
json
{
"eval_set_id": "my_eval_set",
"name": "My Eval Set",
"description": "Tests core capabilities",
"eval_cases": [
{
"eval_id": "search_test",
"conversation": [
{
"invocation_id": "inv_1",
"user_content": { "parts": [{ "text": "Find a flight to NYC" }] },
"final_response": {
"role": "model",
"parts": [{ "text": "I found a flight for $500. Want to book?" }]
},
"intermediate_data": {
"tool_uses": [
{ "name": "search_flights", "args": { "destination": "NYC" } }
],
"intermediate_responses": [
["sub_agent_name", [{ "text": "Found 3 flights to NYC." }]]
]
}
}
],
"session_input": { "app_name": "my_app", "user_id": "user_1", "state": {} }
}
]
}关键字段说明:
- — 预期的工具调用轨迹(按时间顺序)
intermediate_data.tool_uses - — 预期的子Agent响应(适用于多Agent系统)
intermediate_data.intermediate_responses - — 初始会话状态(会覆盖Python层面的初始化设置)
session_input.state - — 替代
conversation_scenario字段,用于用户模拟场景(详见conversation)references/user-simulation.md
Common Gotchas
常见陷阱
The Proactivity Trajectory Gap
主动行为轨迹偏差
LLMs often perform extra actions not asked for (e.g., after ). This causes failures with match. Solutions:
google_searchsave_preferencestool_trajectory_avg_scoreEXACT- Use or
IN_ORDERmatch type — tolerates extra tool calls between expected onesANY_ORDER - Include ALL tools the agent might call in your expected trajectory
- Use instead of trajectory matching
rubric_based_tool_use_quality_v1 - Add strict stop instructions: "Stop after calling save_preferences. Do NOT search."
LLM常常会执行未被要求的额外操作(例如,在之后调用)。这会导致使用匹配类型时评估失败。解决方案:
save_preferencesgoogle_searchEXACTtool_trajectory_avg_score- 使用或
IN_ORDER匹配类型——允许在预期的工具调用之间存在额外的工具调用ANY_ORDER - 在预期轨迹中包含Agent可能调用的所有工具
- 使用替代轨迹匹配
rubric_based_tool_use_quality_v1 - 添加严格的停止指令:"调用save_preferences后停止,请勿执行搜索操作。"
Multi-turn conversations require tool_uses for ALL turns
多轮对话需要为所有轮次指定tool_uses
The evaluates each invocation. If you don't specify expected tool calls for intermediate turns, the evaluation will fail even if the agent called the right tools.
tool_trajectory_avg_scorejson
{
"conversation": [
{
"invocation_id": "inv_1",
"user_content": { "parts": [{"text": "Find me a flight from NYC to London"}] },
"intermediate_data": {
"tool_uses": [
{ "name": "search_flights", "args": {"origin": "NYC", "destination": "LON"} }
]
}
},
{
"invocation_id": "inv_2",
"user_content": { "parts": [{"text": "Book the first option"}] },
"final_response": { "role": "model", "parts": [{"text": "Booking confirmed!"}] },
"intermediate_data": {
"tool_uses": [
{ "name": "book_flight", "args": {"flight_id": "1"} }
]
}
}
]
}tool_trajectory_avg_scorejson
{
"conversation": [
{
"invocation_id": "inv_1",
"user_content": { "parts": [{"text": "Find me a flight from NYC to London"}] },
"intermediate_data": {
"tool_uses": [
{ "name": "search_flights", "args": {"origin": "NYC", "destination": "LON"} }
]
}
},
{
"invocation_id": "inv_2",
"user_content": { "parts": [{"text": "Book the first option"}] },
"final_response": { "role": "model", "parts": [{"text": "Booking confirmed!"}] },
"intermediate_data": {
"tool_uses": [
{ "name": "book_flight", "args": {"flight_id": "1"} }
]
}
}
]
}App name must match directory name
App名称必须与目录名称匹配
The object's parameter MUST match the directory containing your agent:
Appnamepython
undefinedAppnamepython
undefinedCORRECT - matches the "app" directory
正确写法 - 与"app"目录名称匹配
app = App(root_agent=root_agent, name="app")
app = App(root_agent=root_agent, name="app")
WRONG - causes "Session not found" errors
错误写法 - 会导致"Session not found"错误
app = App(root_agent=root_agent, name="flight_booking_assistant")
undefinedapp = App(root_agent=root_agent, name="flight_booking_assistant")
undefinedThe before_agent_callback
Pattern (State Initialization)
before_agent_callbackbefore_agent_callback
模式(状态初始化)
before_agent_callbackAlways use a callback to initialize session state variables used in your instruction template. This prevents crashes on the first turn:
KeyErrorpython
async def initialize_state(callback_context: CallbackContext) -> None:
state = callback_context.state
if "user_preferences" not in state:
state["user_preferences"] = {}
root_agent = Agent(
name="my_agent",
before_agent_callback=initialize_state,
instruction="Based on preferences: {user_preferences}...",
)始终使用回调函数来初始化指令模板中用到的会话状态变量。这可以防止在第一轮对话中出现崩溃:
KeyErrorpython
async def initialize_state(callback_context: CallbackContext) -> None:
state = callback_context.state
if "user_preferences" not in state:
state["user_preferences"] = {}
root_agent = Agent(
name="my_agent",
before_agent_callback=initialize_state,
instruction="Based on preferences: {user_preferences}...",
)Eval-State Overrides (Type Mismatch Danger)
评估集状态覆盖(类型不匹配风险)
Be careful with in your evalset. It overrides Python-level initialization:
session_input.statejson
// WRONG - initializes feedback_history as a string, breaks .append()
"state": { "feedback_history": "" }
// CORRECT - matches the Python type (list)
"state": { "feedback_history": [] }在评估集中使用时要格外小心,它会覆盖Python层面的初始化设置:
session_input.statejson
// 错误写法 - 将feedback_history初始化为字符串,会导致.append()方法失效
"state": { "feedback_history": "" }
// 正确写法 - 与Python层面的类型(列表)一致
"state": { "feedback_history": [] }Model thinking mode may bypass tools
模型思考模式可能会绕过工具调用
Models with "thinking" enabled may skip tool calls. Use with to force tool usage, or switch to a non-thinking model for predictable tool calling.
tool_configmode="ANY"启用“思考”模式的模型可能会跳过工具调用。可以使用的来强制模型使用工具,或者切换到非思考模式的模型以获得可预测的工具调用行为。
mode="ANY"tool_configCommon Eval Failure Causes
常见评估失败原因
| Symptom | Cause | Fix |
|---|---|---|
Missing | Trajectory expects match per invocation | Add expected tool calls to all turns |
| Agent mentions data not in tool output | Hallucination | Tighten agent instructions; add |
| "Session not found" error | App name mismatch | Ensure App |
| Score fluctuates between runs | Non-deterministic model | Set |
| Agent uses | Remove trajectory metric; see |
| Trajectory fails but tools are correct | Extra tools called | Switch to |
| LLM judge ignores image/audio in eval | | Use custom metric with vision-capable judge (see |
| 症状 | 原因 | 修复方案 |
|---|---|---|
中间轮次缺少 | 轨迹匹配要求每一次调用都符合预期 | 为所有轮次添加预期的工具调用 |
| Agent提及了工具输出中不存在的数据 | 幻觉输出 | 收紧Agent指令;添加 |
| 出现"Session not found"错误 | App名称不匹配 | 确保App的 |
| 多次运行评分波动较大 | 模型输出非确定性 | 设置 |
| Agent使用了 | 移除轨迹评估指标;详见 |
| 轨迹评估失败但工具调用正确 | Agent调用了额外的工具 | 切换为 |
| LLM评判者在评估中忽略了图片/音频 | | 使用支持视觉的评判模型配合自定义指标(详见 |
Deep Dive: ADK Docs
深入学习:ADK官方文档
For the official evaluation documentation, fetch these pages using WebFetch:
- Evaluation overview:
https://google.github.io/adk-docs/evaluate/index.md - Criteria reference:
https://google.github.io/adk-docs/evaluate/criteria/index.md - User simulation:
https://google.github.io/adk-docs/evaluate/user-sim/index.md
如需查看官方评估文档,可通过WebFetch获取以下页面:
- 评估概述:
https://google.github.io/adk-docs/evaluate/index.md - 评估准则参考:
https://google.github.io/adk-docs/evaluate/criteria/index.md - 用户模拟:
https://google.github.io/adk-docs/evaluate/user-sim/index.md
Debugging Example
调试示例
User says: "tool_trajectory_avg_score is 0, what's wrong?"
- Check if agent uses — if so, see
google_searchreferences/builtin-tools-eval.md - Check if using match and agent calls extra tools — try
EXACTIN_ORDER - Compare expected in evalset with actual agent behavior
tool_uses - Fix mismatch (update evalset or agent instructions)
用户反馈:“tool_trajectory_avg_score为0,这是怎么回事?”
- 检查Agent是否使用了——如果是,请参阅
google_searchreferences/builtin-tools-eval.md - 检查是否使用了匹配类型且Agent调用了额外工具——尝试切换为
EXACTIN_ORDER - 对比评估集中预期的与Agent的实际行为
tool_uses - 修复不匹配的问题(更新评估集或Agent指令)