google-agents-cli-eval
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseADK Evaluation Guide
ADK评估指南
Scaffolded project? If you used, you already have/google-agents-cli-scaffold,agents-cli eval run, andtests/eval/evalsets/. Start withtests/eval/eval_config.jsonand iterate from there.agents-cli eval run
是否使用脚手架项目? 如果你使用了,那么你已经拥有/google-agents-cli-scaffold、agents-cli eval run和tests/eval/evalsets/。从tests/eval/eval_config.json开始,逐步迭代即可。agents-cli eval run
Reference Files
参考文件
| File | Contents |
|---|---|
| Complete metrics reference — all 8 criteria, match types, custom metrics, judge model config |
| Dynamic conversation testing — ConversationScenario, user simulator config, compatible metrics |
| google_search and model-internal tools — trajectory behavior, metric compatibility |
| Multimodal inputs — evalset schema, built-in metric limitations, custom evaluator pattern |
| 文件 | 内容 |
|---|---|
| 完整指标参考——包含全部8项评估标准、匹配类型、自定义指标、评判模型配置 |
| 动态对话测试——ConversationScenario、用户模拟器配置、兼容指标 |
| google_search和模型内置工具——轨迹行为、指标兼容性 |
| 多模态输入——评估集schema、内置指标局限性、自定义评估器模式 |
The Eval-Fix Loop
评估-修复循环
Evaluation is iterative. When a score is below threshold, diagnose the cause, fix it, rerun — don't just report the failure.
评估是一个迭代过程。当分数低于阈值时,要诊断原因、修复问题、重新运行——不要只报告失败结果。
How to iterate
迭代步骤
- Start small: Begin with 1-2 eval cases, not the full suite
- Run eval:
agents-cli eval run - Read the scores — identify what failed and why
- Fix the code — adjust prompts, tool logic, instructions, or the evalset
- Rerun eval — verify the fix worked
- Repeat steps 3-5 until the case passes
- Only then add more eval cases and expand coverage
Expect 5-10+ iterations. This is normal — each iteration makes the agent better.
Task tracking: When doing 5+ eval-fix iterations, use a task list to track which cases you've fixed, which are still failing, and what you've tried. This prevents re-attempting the same fix or losing track of regression across iterations.
- 从小规模开始:先从1-2个评估用例入手,不要直接使用完整套件
- 运行评估:
agents-cli eval run - 查看分数——确定哪些部分失败以及失败原因
- 修复代码——调整提示词、工具逻辑、指令或评估集
- 重新运行评估——验证修复是否有效
- 重复步骤3-5,直到用例通过
- 之后再添加更多评估用例,扩大覆盖范围
预计需要5-10+次迭代。这是正常情况——每次迭代都会让Agent变得更完善。
任务跟踪:当进行5次以上评估-修复迭代时,使用任务列表跟踪已修复的用例、仍失败的用例以及已尝试的方案。这可以避免重复尝试相同的修复,或在迭代过程中遗漏回归问题。
Shortcuts That Waste Time
浪费时间的捷径
Recognize these rationalizations and push back — they always cost more time than they save:
| Shortcut | Why it fails |
|---|---|
| "I'll tune the eval thresholds down to make it pass" | Lowering thresholds hides real failures. If the agent can't meet the bar, fix the agent — don't move the bar. |
| "This eval case is flaky, I'll skip it" | Flaky evals reveal non-determinism in your agent. Fix with |
| "I just need to fix the evalset, not the agent" | If you're always adjusting expected outputs, your agent has a behavior problem. Fix the instructions or tool logic first. |
识别这些不合理的做法并避免——它们最终花费的时间比节省的更多:
| 捷径 | 失败原因 |
|---|---|
| "我会调低评估阈值让它通过" | 降低阈值会掩盖真实的失败。如果Agent达不到标准,就修复Agent——不要降低标准。 |
| "这个评估用例不稳定,我跳过它" | 不稳定的评估会暴露Agent的非确定性问题。通过设置 |
| "我只需要修复评估集,不需要修改Agent" | 如果你总是调整预期输出,说明Agent存在行为问题。先修复指令或工具逻辑。 |
What to fix when scores fail
分数失败时的修复方向
| Failure | What to change |
|---|---|
| Fix agent instructions (tool ordering), update evalset |
| Adjust agent instruction wording, or relax the expected response |
| Refine agent instructions, or adjust expected response — this is semantic, not lexical |
| Refine agent instructions to address the specific rubric that failed |
| Tighten agent instructions to stay grounded in tool output |
| Agent calls wrong tools | Fix tool descriptions, agent instructions, or tool_config |
| Agent calls extra tools | Use |
| 失败类型 | 调整内容 |
|---|---|
| 修复Agent指令(工具调用顺序)、更新评估集 |
| 调整Agent指令措辞,或放宽预期响应要求 |
| 优化Agent指令,或调整预期响应——这是语义层面的匹配,而非词汇层面 |
| 优化Agent指令,解决未通过的特定评分标准问题 |
| 收紧Agent指令,确保基于工具输出生成内容 |
| Agent调用错误工具 | 修复工具描述、Agent指令或tool_config |
| Agent调用额外工具 | 使用 |
Choosing the Right Criteria
选择合适的评估标准
| Goal | Recommended Metric |
|---|---|
| Regression testing / CI/CD (fast, deterministic) | |
| Semantic response correctness (flexible phrasing OK) | |
| Response quality without reference answer | |
| Validate tool usage reasoning | |
| Detect hallucinated claims | |
| Safety compliance | |
| Dynamic multi-turn conversations | User simulation + |
| Multimodal input (image, audio, file) | |
For the complete metrics reference with config examples, match types, and custom metrics, see .
references/criteria-guide.md| 目标 | 推荐指标 |
|---|---|
| 回归测试/CI/CD(快速、确定性) | |
| 语义响应正确性(允许灵活表述) | |
| 无参考答案时的响应质量 | |
| 验证工具调用推理逻辑 | |
| 检测幻觉内容 | |
| 安全合规性 | |
| 动态多轮对话 | 用户模拟 + |
| 多模态输入(图片、音频、文件) | |
如需包含配置示例、匹配类型和自定义指标的完整指标参考,请查看。
references/criteria-guide.mdRunning Evaluations
运行评估
bash
undefinedbash
undefinedScaffolded projects — agents-cli:
脚手架项目——使用agents-cli:
agents-cli eval run --evalset tests/eval/evalsets/my_evalset.json
agents-cli eval run --evalset tests/eval/evalsets/my_evalset.json
With explicit config file:
指定配置文件:
agents-cli eval run --evalset tests/eval/evalsets/my_evalset.json --config tests/eval/eval_config.json
agents-cli eval run --evalset tests/eval/evalsets/my_evalset.json --config tests/eval/eval_config.json
Run all evalsets in tests/eval/evalsets/:
运行tests/eval/evalsets/下的所有评估集:
agents-cli eval run --all
**`agents-cli eval run` options:** `--evalset PATH`, `--config PATH`, `--all`
**Compare two result files:**
```bash
agents-cli eval compare baseline.json candidate.jsonagents-cli eval run --all
**`agents-cli eval run` 选项:** `--evalset PATH`, `--config PATH`, `--all`
**比较两个结果文件:**
```bash
agents-cli eval compare baseline.json candidate.jsonConfiguration Schema (eval_config.json
)
eval_config.json配置Schema (eval_config.json
)
eval_config.jsonBoth camelCase and snake_case field names are accepted (Pydantic aliases). The examples below use snake_case, matching the official ADK docs.
支持camelCase和snake_case字段名(Pydantic别名)。以下示例使用snake_case,与官方ADK文档保持一致。
Full example
完整示例
json
{
"criteria": {
"tool_trajectory_avg_score": {
"threshold": 1.0,
"match_type": "IN_ORDER"
},
"final_response_match_v2": {
"threshold": 0.8,
"judge_model_options": {
"judge_model": "gemini-flash-latest",
"num_samples": 5
}
},
"rubric_based_final_response_quality_v1": {
"threshold": 0.8,
"rubrics": [
{
"rubric_id": "professionalism",
"rubric_content": { "text_property": "The response must be professional and helpful." }
},
{
"rubric_id": "safety",
"rubric_content": { "text_property": "The agent must NEVER book without asking for confirmation." }
}
]
}
}
}Simple threshold shorthand is also valid:
"response_match_score": 0.8For custom metrics, details, and , see .
judge_model_optionsuser_simulator_configreferences/criteria-guide.mdjson
{
"criteria": {
"tool_trajectory_avg_score": {
"threshold": 1.0,
"match_type": "IN_ORDER"
},
"final_response_match_v2": {
"threshold": 0.8,
"judge_model_options": {
"judge_model": "gemini-flash-latest",
"num_samples": 5
}
},
"rubric_based_final_response_quality_v1": {
"threshold": 0.8,
"rubrics": [
{
"rubric_id": "professionalism",
"rubric_content": { "text_property": "The response must be professional and helpful." }
},
{
"rubric_id": "safety",
"rubric_content": { "text_property": "The agent must NEVER book without asking for confirmation." }
}
]
}
}
}也支持简单的阈值简写:
"response_match_score": 0.8如需自定义指标、详情和,请查看。
judge_model_optionsuser_simulator_configreferences/criteria-guide.mdEvalSet Schema (evalset.json
)
evalset.json评估集Schema (evalset.json
)
evalset.jsonjson
{
"eval_set_id": "my_eval_set",
"name": "My Eval Set",
"description": "Tests core capabilities",
"eval_cases": [
{
"eval_id": "search_test",
"conversation": [
{
"invocation_id": "inv_1",
"user_content": { "parts": [{ "text": "Find a flight to NYC" }] },
"final_response": {
"role": "model",
"parts": [{ "text": "I found a flight for $500. Want to book?" }]
},
"intermediate_data": {
"tool_uses": [
{ "name": "search_flights", "args": { "destination": "NYC" } }
],
"intermediate_responses": [
["sub_agent_name", [{ "text": "Found 3 flights to NYC." }]]
]
}
}
],
"session_input": { "app_name": "my_app", "user_id": "user_1", "state": {} }
}
]
}Key fields:
- — expected tool call trajectory (chronological order)
intermediate_data.tool_uses - — expected sub-agent responses (for multi-agent systems)
intermediate_data.intermediate_responses - — initial session state (overrides Python-level initialization)
session_input.state - — alternative to
conversation_scenariofor user simulation (seeconversation)references/user-simulation.md
json
{
"eval_set_id": "my_eval_set",
"name": "My Eval Set",
"description": "Tests core capabilities",
"eval_cases": [
{
"eval_id": "search_test",
"conversation": [
{
"invocation_id": "inv_1",
"user_content": { "parts": [{ "text": "Find a flight to NYC" }] },
"final_response": {
"role": "model",
"parts": [{ "text": "I found a flight for $500. Want to book?" }]
},
"intermediate_data": {
"tool_uses": [
{ "name": "search_flights", "args": { "destination": "NYC" } }
],
"intermediate_responses": [
["sub_agent_name", [{ "text": "Found 3 flights to NYC." }]]
]
}
}
],
"session_input": { "app_name": "my_app", "user_id": "user_1", "state": {} }
}
]
}关键字段:
- — 预期的工具调用轨迹(按时间顺序)
intermediate_data.tool_uses - — 预期的子Agent响应(适用于多Agent系统)
intermediate_data.intermediate_responses - — 初始会话状态(覆盖Python层面的初始化)
session_input.state - — 替代
conversation_scenario的用户模拟模式(详见conversation)references/user-simulation.md
Common Gotchas
常见陷阱
The Proactivity Trajectory Gap
主动性轨迹偏差
LLMs often perform extra actions not asked for (e.g., after ). This causes failures with match. Solutions:
google_searchsave_preferencestool_trajectory_avg_scoreEXACT- Use or
IN_ORDERmatch type — tolerates extra tool calls between expected onesANY_ORDER - Include ALL tools the agent might call in your expected trajectory
- Use instead of trajectory matching
rubric_based_tool_use_quality_v1 - Add strict stop instructions: "Stop after calling save_preferences. Do NOT search."
LLM经常执行未被要求的额外操作(例如,在之后调用)。这会导致使用匹配时失败。解决方案:
save_preferencesgoogle_searchEXACTtool_trajectory_avg_score- 使用或
IN_ORDER匹配类型——允许在预期工具调用之间存在额外的工具调用ANY_ORDER - 在预期轨迹中包含Agent可能调用的所有工具
- 使用替代轨迹匹配
rubric_based_tool_use_quality_v1 - 添加严格的停止指令:"调用save_preferences后停止。请勿进行搜索。"
Multi-turn conversations require tool_uses for ALL turns
多轮对话需要为所有轮次指定tool_uses
The evaluates each invocation. If you don't specify expected tool calls for intermediate turns, the evaluation will fail even if the agent called the right tools.
tool_trajectory_avg_scorejson
{
"conversation": [
{
"invocation_id": "inv_1",
"user_content": { "parts": [{"text": "Find me a flight from NYC to London"}] },
"intermediate_data": {
"tool_uses": [
{ "name": "search_flights", "args": {"origin": "NYC", "destination": "LON"} }
]
}
},
{
"invocation_id": "inv_2",
"user_content": { "parts": [{"text": "Book the first option"}] },
"final_response": { "role": "model", "parts": [{"text": "Booking confirmed!"}] },
"intermediate_data": {
"tool_uses": [
{ "name": "book_flight", "args": {"flight_id": "1"} }
]
}
}
]
}tool_trajectory_avg_scorejson
{
"conversation": [
{
"invocation_id": "inv_1",
"user_content": { "parts": [{"text": "Find me a flight from NYC to London"}] },
"intermediate_data": {
"tool_uses": [
{ "name": "search_flights", "args": {"origin": "NYC", "destination": "LON"} }
]
}
},
{
"invocation_id": "inv_2",
"user_content": { "parts": [{"text": "Book the first option"}] },
"final_response": { "role": "model", "parts": [{"text": "Booking confirmed!"}] },
"intermediate_data": {
"tool_uses": [
{ "name": "book_flight", "args": {"flight_id": "1"} }
]
}
}
]
}App name must match directory name
App名称必须与目录名称匹配
The object's parameter MUST match the directory containing your agent:
Appnamepython
undefinedAppnamepython
undefinedCORRECT - matches the "app" directory
正确 - 与"app"目录匹配
app = App(root_agent=root_agent, name="app")
app = App(root_agent=root_agent, name="app")
WRONG - causes "Session not found" errors
错误 - 会导致"Session not found"错误
app = App(root_agent=root_agent, name="flight_booking_assistant")
undefinedapp = App(root_agent=root_agent, name="flight_booking_assistant")
undefinedThe before_agent_callback
Pattern (State Initialization)
before_agent_callbackbefore_agent_callback
模式(状态初始化)
before_agent_callbackAlways use a callback to initialize session state variables used in your instruction template. This prevents crashes on the first turn:
KeyErrorpython
async def initialize_state(callback_context: CallbackContext) -> None:
state = callback_context.state
if "user_preferences" not in state:
state["user_preferences"] = {}
root_agent = Agent(
name="my_agent",
before_agent_callback=initialize_state,
instruction="Based on preferences: {user_preferences}...",
)始终使用回调函数初始化指令模板中用到的会话状态变量。这可以防止第一轮出现崩溃:
KeyErrorpython
async def initialize_state(callback_context: CallbackContext) -> None:
state = callback_context.state
if "user_preferences" not in state:
state["user_preferences"] = {}
root_agent = Agent(
name="my_agent",
before_agent_callback=initialize_state,
instruction="Based on preferences: {user_preferences}...",
)Eval-State Overrides (Type Mismatch Danger)
评估状态覆盖(类型不匹配风险)
Be careful with in your evalset. It overrides Python-level initialization:
session_input.stateWRONG — initializes feedback_history as a string, breaks :
.append()json
"state": { "feedback_history": "" }CORRECT — matches the Python type (list):
json
"state": { "feedback_history": [] }在评估集中使用时要谨慎。它会覆盖Python层面的初始化:
session_input.state错误示例——将feedback_history初始化为字符串,会破坏方法:
.append()json
"state": { "feedback_history": "" }正确示例——与Python类型(列表)匹配:
json
"state": { "feedback_history": [] }Model thinking mode may bypass tools
模型思考模式可能绕过工具调用
Models with "thinking" enabled may skip tool calls. Use with to force tool usage, or switch to a non-thinking model for predictable tool calling.
tool_configmode="ANY"启用“思考”模式的模型可能会跳过工具调用。使用并设置强制工具调用,或切换为非思考模式的模型以确保可预测的工具调用行为。
tool_configmode="ANY"Common Eval Failure Causes
常见评估失败原因
| Symptom | Cause | Fix |
|---|---|---|
Missing | Trajectory expects match per invocation | Add expected tool calls to all turns |
| Agent mentions data not in tool output | Hallucination | Tighten agent instructions; add |
| "Session not found" error | App name mismatch | Ensure App |
| Score fluctuates between runs | Non-deterministic model | Set |
| Agent uses | Remove trajectory metric; see |
| Trajectory fails but tools are correct | Extra tools called | Switch to |
| LLM judge ignores image/audio in eval | | Use custom metric with vision-capable judge (see |
| 症状 | 原因 | 修复方案 |
|---|---|---|
中间轮次缺少 | 轨迹要求每次调用都匹配 | 为所有轮次添加预期工具调用 |
| Agent提及工具输出中不存在的数据 | 幻觉问题 | 收紧Agent指令;添加 |
| "Session not found"错误 | App名称不匹配 | 确保App的 |
| 分数在多次运行之间波动 | 模型非确定性 | 设置 |
| Agent使用了 | 删除轨迹指标;详见 |
| 轨迹失败但工具调用正确 | 调用了额外工具 | 切换为 |
| LLM评判忽略评估中的图片/音频 | | 使用支持视觉的评判模型自定义指标(详见 |
Deep Dive: ADK Docs
深入学习:ADK文档
For the official evaluation documentation, fetch these pages:
- Evaluation overview:
https://adk.dev/evaluate/index.md - Criteria reference:
https://adk.dev/evaluate/criteria/index.md - User simulation:
https://adk.dev/evaluate/user-sim/index.md
如需官方评估文档,请访问以下页面:
- 评估概述:
https://adk.dev/evaluate/index.md - 评估标准参考:
https://adk.dev/evaluate/criteria/index.md - 用户模拟:
https://adk.dev/evaluate/user-sim/index.md
Debugging Example
调试示例
User says: "tool_trajectory_avg_score is 0, what's wrong?"
- Check if agent uses — if so, see
google_searchreferences/builtin-tools-eval.md - Check if using match and agent calls extra tools — try
EXACTIN_ORDER - Compare expected in evalset with actual agent behavior
tool_uses - Fix mismatch (update evalset or agent instructions)
用户反馈:"tool_trajectory_avg_score为0,出了什么问题?"
- 检查Agent是否使用了——如果是,查看
google_searchreferences/builtin-tools-eval.md - 检查是否使用了匹配且Agent调用了额外工具——尝试
EXACT匹配类型IN_ORDER - 比较评估集中的预期与Agent的实际行为
tool_uses - 修复不匹配问题(更新评估集或Agent指令)
Proving Your Work
证明你的工作成果
Don't assert that eval passes — show the evidence. Concrete output prevents false confidence and catches issues early.
- After running eval: Paste the scores table output so the user can see exactly what passed and failed.
- After fixing a failure: Show before/after scores for the specific case you fixed, and confirm no other cases regressed.
- Before declaring "eval passes": Confirm ALL cases pass, not just the one you were working on. Run (or
agents-cli eval run) one final time.agents-cli eval run --all - Before moving to deploy: Show the final output with all cases above threshold. This is the gate — no exceptions.
agents-cli eval run
不要断言评估通过——要展示证据。具体的输出可以防止错误的自信,并尽早发现问题。
- 运行评估后: 粘贴分数表输出,让用户清楚看到哪些部分通过、哪些失败。
- 修复失败后: 展示你修复的特定用例的前后分数,并确认其他用例没有出现回归问题。
- 声明“评估通过”前: 确认所有用例都通过,而不仅仅是你正在处理的那个。最后再运行一次(或
agents-cli eval run)。agents-cli eval run --all - 部署前: 展示最终的输出,确认所有用例都高于阈值。这是必经的检查环节——没有例外。
agents-cli eval run
Related Skills
相关技能
- — Development workflow and the spec-driven build-evaluate-deploy lifecycle
/google-agents-cli-workflow - — ADK Python API quick reference for writing agent code
/google-agents-cli-adk-code - — Project creation and enhancement with
/google-agents-cli-scaffold/agents-cli scaffold createscaffold enhance - — Deployment targets, CI/CD pipelines, and production workflows
/google-agents-cli-deploy - — Cloud Trace, logging, and monitoring for debugging agent behavior
/google-agents-cli-observability
- — 开发工作流以及基于规范的构建-评估-部署生命周期
/google-agents-cli-workflow - — 编写Agent代码的ADK Python API快速参考
/google-agents-cli-adk-code - — 使用
/google-agents-cli-scaffold/agents-cli scaffold create创建和增强项目scaffold enhance - — 部署目标、CI/CD流水线和生产工作流
/google-agents-cli-deploy - — 用于调试Agent行为的Cloud Trace、日志和监控
/google-agents-cli-observability