google-agents-cli-eval

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

ADK Evaluation Guide

ADK评估指南

Requires:
agents-cli
(
uv tool install google-agents-cli
) — install uv first if needed.

Scaffolded project? If you used
/google-agents-cli-scaffold
, you already have
agents-cli eval run
,
tests/eval/evalsets/
, and
tests/eval/eval_config.json
. Start with
agents-cli eval run
and iterate from there.

要求：
agents-cli
（
uv tool install google-agents-cli
）——如果需要，请先安装uv。

是否使用脚手架项目？ 如果你使用了
/google-agents-cli-scaffold
，那么你已经拥有
agents-cli eval run
、
tests/eval/evalsets/
和
tests/eval/eval_config.json
。从
agents-cli eval run
开始，逐步迭代即可。

Reference Files

参考文件

File	Contents
`references/criteria-guide.md`	Complete metrics reference — all 8 criteria, match types, custom metrics, judge model config
`references/user-simulation.md`	Dynamic conversation testing — ConversationScenario, user simulator config, compatible metrics
`references/builtin-tools-eval.md`	google_search and model-internal tools — trajectory behavior, metric compatibility
`references/multimodal-eval.md`	Multimodal inputs — evalset schema, built-in metric limitations, custom evaluator pattern

文件	内容
`references/criteria-guide.md`	完整指标参考——包含全部8项评估标准、匹配类型、自定义指标、评判模型配置
`references/user-simulation.md`	动态对话测试——ConversationScenario、用户模拟器配置、兼容指标
`references/builtin-tools-eval.md`	google_search和模型内置工具——轨迹行为、指标兼容性
`references/multimodal-eval.md`	多模态输入——评估集schema、内置指标局限性、自定义评估器模式

The Eval-Fix Loop

评估-修复循环

Evaluation is iterative. When a score is below threshold, diagnose the cause, fix it, rerun — don't just report the failure.

评估是一个迭代过程。当分数低于阈值时，要诊断原因、修复问题、重新运行——不要只报告失败结果。

How to iterate

迭代步骤

Start small: Begin with 1-2 eval cases, not the full suite
Run eval:
```
agents-cli eval run
```
Read the scores — identify what failed and why
Fix the code — adjust prompts, tool logic, instructions, or the evalset
Rerun eval — verify the fix worked
Repeat steps 3-5 until the case passes
Only then add more eval cases and expand coverage

Expect 5-10+ iterations. This is normal — each iteration makes the agent better.

Task tracking: When doing 5+ eval-fix iterations, use a task list to track which cases you've fixed, which are still failing, and what you've tried. This prevents re-attempting the same fix or losing track of regression across iterations.

从小规模开始：先从1-2个评估用例入手，不要直接使用完整套件
运行评估：
```
agents-cli eval run
```
查看分数——确定哪些部分失败以及失败原因
修复代码——调整提示词、工具逻辑、指令或评估集
重新运行评估——验证修复是否有效
重复步骤3-5，直到用例通过
之后再添加更多评估用例，扩大覆盖范围

预计需要5-10+次迭代。这是正常情况——每次迭代都会让Agent变得更完善。

任务跟踪：当进行5次以上评估-修复迭代时，使用任务列表跟踪已修复的用例、仍失败的用例以及已尝试的方案。这可以避免重复尝试相同的修复，或在迭代过程中遗漏回归问题。

Shortcuts That Waste Time

浪费时间的捷径

Recognize these rationalizations and push back — they always cost more time than they save:

Shortcut	Why it fails
"I'll tune the eval thresholds down to make it pass"	Lowering thresholds hides real failures. If the agent can't meet the bar, fix the agent — don't move the bar.
"This eval case is flaky, I'll skip it"	Flaky evals reveal non-determinism in your agent. Fix with `temperature=0` , rubric-based metrics, or more specific instructions — don't delete the signal.
"I just need to fix the evalset, not the agent"	If you're always adjusting expected outputs, your agent has a behavior problem. Fix the instructions or tool logic first.

识别这些不合理的做法并避免——它们最终花费的时间比节省的更多：

捷径	失败原因
"我会调低评估阈值让它通过"	降低阈值会掩盖真实的失败。如果Agent达不到标准，就修复Agent——不要降低标准。
"这个评估用例不稳定，我跳过它"	不稳定的评估会暴露Agent的非确定性问题。通过设置 `temperature=0` 、基于评分标准的指标或更明确的指令来修复——不要删除有效信号。
"我只需要修复评估集，不需要修改Agent"	如果你总是调整预期输出，说明Agent存在行为问题。先修复指令或工具逻辑。

What to fix when scores fail

分数失败时的修复方向

Failure	What to change
`tool_trajectory_avg_score` low	Fix agent instructions (tool ordering), update evalset `tool_uses` , or switch to `IN_ORDER` / `ANY_ORDER` match type
`response_match_score` low	Adjust agent instruction wording, or relax the expected response
`final_response_match_v2` low	Refine agent instructions, or adjust expected response — this is semantic, not lexical
`rubric_based` score low	Refine agent instructions to address the specific rubric that failed
`hallucinations_v1` low	Tighten agent instructions to stay grounded in tool output
Agent calls wrong tools	Fix tool descriptions, agent instructions, or tool_config
Agent calls extra tools	Use `IN_ORDER` / `ANY_ORDER` match type, add strict stop instructions, or switch to `rubric_based_tool_use_quality_v1`

失败类型	调整内容
`tool_trajectory_avg_score` 分数低	修复Agent指令（工具调用顺序）、更新评估集 `tool_uses` ，或切换为 `IN_ORDER` / `ANY_ORDER` 匹配类型
`response_match_score` 分数低	调整Agent指令措辞，或放宽预期响应要求
`final_response_match_v2` 分数低	优化Agent指令，或调整预期响应——这是语义层面的匹配，而非词汇层面
`rubric_based` 分数低	优化Agent指令，解决未通过的特定评分标准问题
`hallucinations_v1` 分数低	收紧Agent指令，确保基于工具输出生成内容
Agent调用错误工具	修复工具描述、Agent指令或tool_config
Agent调用额外工具	使用 `IN_ORDER` / `ANY_ORDER` 匹配类型、添加严格停止指令，或切换为 `rubric_based_tool_use_quality_v1`

Choosing the Right Criteria

选择合适的评估标准

Goal	Recommended Metric
Regression testing / CI/CD (fast, deterministic)	`tool_trajectory_avg_score` + `response_match_score`
Semantic response correctness (flexible phrasing OK)	`final_response_match_v2`
Response quality without reference answer	`rubric_based_final_response_quality_v1`
Validate tool usage reasoning	`rubric_based_tool_use_quality_v1`
Detect hallucinated claims	`hallucinations_v1`
Safety compliance	`safety_v1`
Dynamic multi-turn conversations	User simulation + `hallucinations_v1` / `safety_v1` (see `references/user-simulation.md` )
Multimodal input (image, audio, file)	`tool_trajectory_avg_score` + custom metric for response quality (see `references/multimodal-eval.md` )

For the complete metrics reference with config examples, match types, and custom metrics, see

references/criteria-guide.md

目标	推荐指标
回归测试/CI/CD（快速、确定性）	`tool_trajectory_avg_score` + `response_match_score`
语义响应正确性（允许灵活表述）	`final_response_match_v2`
无参考答案时的响应质量	`rubric_based_final_response_quality_v1`
验证工具调用推理逻辑	`rubric_based_tool_use_quality_v1`
检测幻觉内容	`hallucinations_v1`
安全合规性	`safety_v1`
动态多轮对话	用户模拟 + `hallucinations_v1` / `safety_v1` （详见 `references/user-simulation.md` ）
多模态输入（图片、音频、文件）	`tool_trajectory_avg_score` + 自定义响应质量指标（详见 `references/multimodal-eval.md` ）

如需包含配置示例、匹配类型和自定义指标的完整指标参考，请查看

references/criteria-guide.md

。

Running Evaluations

运行评估

bash

undefined

bash

undefined

Scaffolded projects — agents-cli:

脚手架项目——使用agents-cli:

agents-cli eval run --evalset tests/eval/evalsets/my_evalset.json

With explicit config file:

指定配置文件:

agents-cli eval run --evalset tests/eval/evalsets/my_evalset.json --config tests/eval/eval_config.json

Run all evalsets in tests/eval/evalsets/:

运行tests/eval/evalsets/下的所有评估集:

agents-cli eval run --all


**`agents-cli eval run` options:** `--evalset PATH`, `--config PATH`, `--all`

**Compare two result files:**
```bash
agents-cli eval compare baseline.json candidate.json

agents-cli eval run --all


**`agents-cli eval run` 选项:** `--evalset PATH`, `--config PATH`, `--all`

**比较两个结果文件:**
```bash
agents-cli eval compare baseline.json candidate.json

Configuration Schema (

eval_config.json

)

配置Schema (

eval_config.json

)

Both camelCase and snake_case field names are accepted (Pydantic aliases). The examples below use snake_case, matching the official ADK docs.

支持camelCase和snake_case字段名（Pydantic别名）。以下示例使用snake_case，与官方ADK文档保持一致。

Full example

完整示例

json

{
  "criteria": {
    "tool_trajectory_avg_score": {
      "threshold": 1.0,
      "match_type": "IN_ORDER"
    },
    "final_response_match_v2": {
      "threshold": 0.8,
      "judge_model_options": {
        "judge_model": "gemini-flash-latest",
        "num_samples": 5
      }
    },
    "rubric_based_final_response_quality_v1": {
      "threshold": 0.8,
      "rubrics": [
        {
          "rubric_id": "professionalism",
          "rubric_content": { "text_property": "The response must be professional and helpful." }
        },
        {
          "rubric_id": "safety",
          "rubric_content": { "text_property": "The agent must NEVER book without asking for confirmation." }
        }
      ]
    }
  }
}

Simple threshold shorthand is also valid:

"response_match_score": 0.8

For custom metrics,

judge_model_options

details, and

user_simulator_config

, see

references/criteria-guide.md

json

{
  "criteria": {
    "tool_trajectory_avg_score": {
      "threshold": 1.0,
      "match_type": "IN_ORDER"
    },
    "final_response_match_v2": {
      "threshold": 0.8,
      "judge_model_options": {
        "judge_model": "gemini-flash-latest",
        "num_samples": 5
      }
    },
    "rubric_based_final_response_quality_v1": {
      "threshold": 0.8,
      "rubrics": [
        {
          "rubric_id": "professionalism",
          "rubric_content": { "text_property": "The response must be professional and helpful." }
        },
        {
          "rubric_id": "safety",
          "rubric_content": { "text_property": "The agent must NEVER book without asking for confirmation." }
        }
      ]
    }
  }
}

也支持简单的阈值简写：

"response_match_score": 0.8

如需自定义指标、

judge_model_options

详情和

user_simulator_config

，请查看

references/criteria-guide.md

。

EvalSet Schema (

evalset.json

)

评估集Schema (

evalset.json

)

json

{
  "eval_set_id": "my_eval_set",
  "name": "My Eval Set",
  "description": "Tests core capabilities",
  "eval_cases": [
    {
      "eval_id": "search_test",
      "conversation": [
        {
          "invocation_id": "inv_1",
          "user_content": { "parts": [{ "text": "Find a flight to NYC" }] },
          "final_response": {
            "role": "model",
            "parts": [{ "text": "I found a flight for $500. Want to book?" }]
          },
          "intermediate_data": {
            "tool_uses": [
              { "name": "search_flights", "args": { "destination": "NYC" } }
            ],
            "intermediate_responses": [
              ["sub_agent_name", [{ "text": "Found 3 flights to NYC." }]]
            ]
          }
        }
      ],
      "session_input": { "app_name": "my_app", "user_id": "user_1", "state": {} }
    }
  ]
}

Key fields:

```
intermediate_data.tool_uses
```
— expected tool call trajectory (chronological order)
```
intermediate_data.intermediate_responses
```
— expected sub-agent responses (for multi-agent systems)
```
session_input.state
```
— initial session state (overrides Python-level initialization)

conversation_scenario

— alternative to

conversation

for user simulation (see

references/user-simulation.md

)

json

{
  "eval_set_id": "my_eval_set",
  "name": "My Eval Set",
  "description": "Tests core capabilities",
  "eval_cases": [
    {
      "eval_id": "search_test",
      "conversation": [
        {
          "invocation_id": "inv_1",
          "user_content": { "parts": [{ "text": "Find a flight to NYC" }] },
          "final_response": {
            "role": "model",
            "parts": [{ "text": "I found a flight for $500. Want to book?" }]
          },
          "intermediate_data": {
            "tool_uses": [
              { "name": "search_flights", "args": { "destination": "NYC" } }
            ],
            "intermediate_responses": [
              ["sub_agent_name", [{ "text": "Found 3 flights to NYC." }]]
            ]
          }
        }
      ],
      "session_input": { "app_name": "my_app", "user_id": "user_1", "state": {} }
    }
  ]
}

关键字段:

```
intermediate_data.tool_uses
```
— 预期的工具调用轨迹（按时间顺序）
```
intermediate_data.intermediate_responses
```
— 预期的子Agent响应（适用于多Agent系统）
```
session_input.state
```
— 初始会话状态（覆盖Python层面的初始化）

conversation_scenario

— 替代

conversation

的用户模拟模式（详见

references/user-simulation.md

）

Common Gotchas

常见陷阱

The Proactivity Trajectory Gap

主动性轨迹偏差

LLMs often perform extra actions not asked for (e.g.,

google_search

after

save_preferences

). This causes

tool_trajectory_avg_score

failures with

EXACT

match. Solutions:

Use
IN_ORDER
or
ANY_ORDER
match type — tolerates extra tool calls between expected ones
Include ALL tools the agent might call in your expected trajectory
Use
```
rubric_based_tool_use_quality_v1
```
instead of trajectory matching
Add strict stop instructions: "Stop after calling save_preferences. Do NOT search."

LLM经常执行未被要求的额外操作（例如，在

save_preferences

之后调用

google_search

）。这会导致使用

EXACT

匹配时

tool_trajectory_avg_score

失败。解决方案：

使用
IN_ORDER
或
ANY_ORDER
匹配类型——允许在预期工具调用之间存在额外的工具调用
在预期轨迹中包含Agent可能调用的所有工具
使用
```
rubric_based_tool_use_quality_v1
```
替代轨迹匹配
添加严格的停止指令："调用save_preferences后停止。请勿进行搜索。"

Multi-turn conversations require tool_uses for ALL turns

多轮对话需要为所有轮次指定tool_uses

The

tool_trajectory_avg_score

evaluates each invocation. If you don't specify expected tool calls for intermediate turns, the evaluation will fail even if the agent called the right tools.

json

{
  "conversation": [
    {
      "invocation_id": "inv_1",
      "user_content": { "parts": [{"text": "Find me a flight from NYC to London"}] },
      "intermediate_data": {
        "tool_uses": [
          { "name": "search_flights", "args": {"origin": "NYC", "destination": "LON"} }
        ]
      }
    },
    {
      "invocation_id": "inv_2",
      "user_content": { "parts": [{"text": "Book the first option"}] },
      "final_response": { "role": "model", "parts": [{"text": "Booking confirmed!"}] },
      "intermediate_data": {
        "tool_uses": [
          { "name": "book_flight", "args": {"flight_id": "1"} }
        ]
      }
    }
  ]
}

tool_trajectory_avg_score

会评估每一次调用。如果不为中间轮次指定预期工具调用，即使Agent调用了正确的工具，评估也会失败。

json

{
  "conversation": [
    {
      "invocation_id": "inv_1",
      "user_content": { "parts": [{"text": "Find me a flight from NYC to London"}] },
      "intermediate_data": {
        "tool_uses": [
          { "name": "search_flights", "args": {"origin": "NYC", "destination": "LON"} }
        ]
      }
    },
    {
      "invocation_id": "inv_2",
      "user_content": { "parts": [{"text": "Book the first option"}] },
      "final_response": { "role": "model", "parts": [{"text": "Booking confirmed!"}] },
      "intermediate_data": {
        "tool_uses": [
          { "name": "book_flight", "args": {"flight_id": "1"} }
        ]
      }
    }
  ]
}

App name must match directory name

App名称必须与目录名称匹配

The

App

object's

name

parameter MUST match the directory containing your agent:

python

undefined

App

对象的

name

参数必须与包含Agent的目录名称一致：

python

undefined

CORRECT - matches the "app" directory

正确 - 与"app"目录匹配

app = App(root_agent=root_agent, name="app")

WRONG - causes "Session not found" errors

错误 - 会导致"Session not found"错误

app = App(root_agent=root_agent, name="flight_booking_assistant")

undefined

app = App(root_agent=root_agent, name="flight_booking_assistant")

undefined

The

before_agent_callback

Pattern (State Initialization)

before_agent_callback

模式（状态初始化）

Always use a callback to initialize session state variables used in your instruction template. This prevents

KeyError

crashes on the first turn:

python

async def initialize_state(callback_context: CallbackContext) -> None:
    state = callback_context.state
    if "user_preferences" not in state:
        state["user_preferences"] = {}

root_agent = Agent(
    name="my_agent",
    before_agent_callback=initialize_state,
    instruction="Based on preferences: {user_preferences}...",
)

始终使用回调函数初始化指令模板中用到的会话状态变量。这可以防止第一轮出现

KeyError

崩溃：

python

async def initialize_state(callback_context: CallbackContext) -> None:
    state = callback_context.state
    if "user_preferences" not in state:
        state["user_preferences"] = {}

root_agent = Agent(
    name="my_agent",
    before_agent_callback=initialize_state,
    instruction="Based on preferences: {user_preferences}...",
)

Eval-State Overrides (Type Mismatch Danger)

评估状态覆盖（类型不匹配风险）

Be careful with

session_input.state

in your evalset. It overrides Python-level initialization:

WRONG — initializes feedback_history as a string, breaks

.append()

json

"state": { "feedback_history": "" }

CORRECT — matches the Python type (list):

json

"state": { "feedback_history": [] }

在评估集中使用

session_input.state

时要谨慎。它会覆盖Python层面的初始化：

错误示例——将feedback_history初始化为字符串，会破坏

.append()

方法：

json

"state": { "feedback_history": "" }

正确示例——与Python类型（列表）匹配：

json

"state": { "feedback_history": [] }

Model thinking mode may bypass tools

模型思考模式可能绕过工具调用

Models with "thinking" enabled may skip tool calls. Use

tool_config

with

mode="ANY"

to force tool usage, or switch to a non-thinking model for predictable tool calling.

启用“思考”模式的模型可能会跳过工具调用。使用

tool_config

并设置

mode="ANY"

强制工具调用，或切换为非思考模式的模型以确保可预测的工具调用行为。

Common Eval Failure Causes

常见评估失败原因

Symptom	Cause	Fix
Missing `tool_uses` in intermediate turns	Trajectory expects match per invocation	Add expected tool calls to all turns
Agent mentions data not in tool output	Hallucination	Tighten agent instructions; add `hallucinations_v1` metric
"Session not found" error	App name mismatch	Ensure App `name` matches directory name
Score fluctuates between runs	Non-deterministic model	Set `temperature=0` or use rubric-based eval
`tool_trajectory_avg_score` always 0	Agent uses `google_search` (model-internal)	Remove trajectory metric; see `references/builtin-tools-eval.md`
Trajectory fails but tools are correct	Extra tools called	Switch to `IN_ORDER` / `ANY_ORDER` match type
LLM judge ignores image/audio in eval	`get_text_from_content()` skips non-text parts	Use custom metric with vision-capable judge (see `references/multimodal-eval.md` )

症状	原因	修复方案
中间轮次缺少 `tool_uses`	轨迹要求每次调用都匹配	为所有轮次添加预期工具调用
Agent提及工具输出中不存在的数据	幻觉问题	收紧Agent指令；添加 `hallucinations_v1` 指标
"Session not found"错误	App名称不匹配	确保App的 `name` 与目录名称一致
分数在多次运行之间波动	模型非确定性	设置 `temperature=0` 或使用基于评分标准的评估
`tool_trajectory_avg_score` 始终为0	Agent使用了 `google_search` （模型内置工具）	删除轨迹指标；详见 `references/builtin-tools-eval.md`
轨迹失败但工具调用正确	调用了额外工具	切换为 `IN_ORDER` / `ANY_ORDER` 匹配类型
LLM评判忽略评估中的图片/音频	`get_text_from_content()` 跳过了非文本部分	使用支持视觉的评判模型自定义指标（详见 `references/multimodal-eval.md` ）

Deep Dive: ADK Docs

深入学习：ADK文档

For the official evaluation documentation, fetch these pages:

Evaluation overview:
```
https://adk.dev/evaluate/index.md
```

Criteria reference:

https://adk.dev/evaluate/criteria/index.md

User simulation:

https://adk.dev/evaluate/user-sim/index.md

如需官方评估文档，请访问以下页面：

评估概述:
```
https://adk.dev/evaluate/index.md
```

评估标准参考:

https://adk.dev/evaluate/criteria/index.md

用户模拟:

https://adk.dev/evaluate/user-sim/index.md

Debugging Example

调试示例

User says: "tool_trajectory_avg_score is 0, what's wrong?"

Check if agent uses

google_search

— if so, see

references/builtin-tools-eval.md

Check if using
```
EXACT
```
match and agent calls extra tools — try
```
IN_ORDER
```
Compare expected
```
tool_uses
```
in evalset with actual agent behavior
Fix mismatch (update evalset or agent instructions)

用户反馈："tool_trajectory_avg_score为0，出了什么问题？"

检查Agent是否使用了
```
google_search
```
——如果是，查看
```
references/builtin-tools-eval.md
```
检查是否使用了
```
EXACT
```
匹配且Agent调用了额外工具——尝试
```
IN_ORDER
```
匹配类型
比较评估集中的预期
```
tool_uses
```
与Agent的实际行为
修复不匹配问题（更新评估集或Agent指令）

Proving Your Work

证明你的工作成果

Don't assert that eval passes — show the evidence. Concrete output prevents false confidence and catches issues early.

After running eval: Paste the scores table output so the user can see exactly what passed and failed.
After fixing a failure: Show before/after scores for the specific case you fixed, and confirm no other cases regressed.
Before declaring "eval passes": Confirm ALL cases pass, not just the one you were working on. Run
```
agents-cli eval run
```
(or
```
agents-cli eval run --all
```
) one final time.
Before moving to deploy: Show the final
```
agents-cli eval run
```
output with all cases above threshold. This is the gate — no exceptions.

不要断言评估通过——要展示证据。具体的输出可以防止错误的自信，并尽早发现问题。

运行评估后: 粘贴分数表输出，让用户清楚看到哪些部分通过、哪些失败。
修复失败后: 展示你修复的特定用例的前后分数，并确认其他用例没有出现回归问题。
声明“评估通过”前: 确认所有用例都通过，而不仅仅是你正在处理的那个。最后再运行一次
```
agents-cli eval run
```
（或
```
agents-cli eval run --all
```
）。
部署前: 展示最终的
```
agents-cli eval run
```
输出，确认所有用例都高于阈值。这是必经的检查环节——没有例外。

google-agents-cli-eval

Original

Translation

ADK Evaluation Guide

ADK评估指南

Reference Files

参考文件

The Eval-Fix Loop

评估-修复循环

How to iterate

迭代步骤

Shortcuts That Waste Time

浪费时间的捷径

What to fix when scores fail

分数失败时的修复方向

Choosing the Right Criteria

选择合适的评估标准

Running Evaluations

运行评估

Scaffolded projects — agents-cli:

脚手架项目——使用agents-cli:

With explicit config file:

指定配置文件:

Run all evalsets in tests/eval/evalsets/:

运行tests/eval/evalsets/下的所有评估集:

Configuration Schema (eval_config.json)

配置Schema (eval_config.json)

Full example

完整示例

EvalSet Schema (evalset.json)

评估集Schema (evalset.json)

Common Gotchas

常见陷阱

The Proactivity Trajectory Gap

主动性轨迹偏差

Multi-turn conversations require tool_uses for ALL turns

多轮对话需要为所有轮次指定tool_uses

App name must match directory name

App名称必须与目录名称匹配

CORRECT - matches the "app" directory

正确 - 与"app"目录匹配

WRONG - causes "Session not found" errors

错误 - 会导致"Session not found"错误

The before_agent_callback Pattern (State Initialization)

before_agent_callback模式（状态初始化）

Eval-State Overrides (Type Mismatch Danger)

评估状态覆盖（类型不匹配风险）

Model thinking mode may bypass tools

模型思考模式可能绕过工具调用

Common Eval Failure Causes

常见评估失败原因

Deep Dive: ADK Docs

深入学习：ADK文档

Debugging Example

调试示例

Proving Your Work

证明你的工作成果

Related Skills

相关技能

Configuration Schema (
`eval_config.json`
)

配置Schema (
`eval_config.json`
)

EvalSet Schema (
`evalset.json`
)

评估集Schema (
`evalset.json`
)

The
`before_agent_callback`
Pattern (State Initialization)

`before_agent_callback`
模式（状态初始化）