agent-platform-eval-flywheel

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Agent Platform Eval Flywheel Skill

Agent Platform评估飞轮技能

Help users evaluate and iteratively improve GenAI models and agents using the Agent Platform GenAI Evaluation SDK (

google.genai

agentplatform

帮助用户使用Agent Platform GenAI评估SDK（

google.genai

agentplatform

）评估并迭代优化生成式AI（GenAI）模型与Agent。

When to use this skill

何时使用此技能

Evaluating GenAI agents or models with the Agent Platform GenAI Evaluation SDK (
```
client.evals.evaluate()
```
).
Creating evaluation datasets from session traces, pandas DataFrames, or synthetic generation.
Selecting, configuring, or writing custom evaluation metrics.
Analyzing rubric verdicts, loss patterns, and clustering failures.
Suggesting concrete code/prompt improvements based on eval results.

使用Agent Platform GenAI评估SDK（
```
client.evals.evaluate()
```
）评估GenAI Agent或模型。
从会话追踪、pandas DataFrame或合成生成创建评估数据集。
选择、配置或编写自定义评估指标。
分析评分标准 verdicts、损失模式并聚类故障。
根据评估结果提出具体的代码/提示词改进建议。

Setup

环境搭建

Install the SDK:

bash

pip install google-cloud-aiplatform[evaluation]>=1.154.0 google-genai>=1.0.0

Need

GOOGLE_CLOUD_PROJECT

and

GOOGLE_CLOUD_LOCATION

. Check env vars first; if missing, ask the user. Newer Gemini models often need

location="global"

安装SDK：

bash

pip install google-cloud-aiplatform[evaluation]>=1.154.0 google-genai>=1.0.0

需要

GOOGLE_CLOUD_PROJECT

和

GOOGLE_CLOUD_LOCATION

。先检查环境变量；如果缺失，请询问用户。较新的Gemini模型通常需要设置

location="global"

。

The Quality Flywheel

质量飞轮

Five stages, run in order on the first pass, then loop 2 → 5 until quality targets are met.

包含五个阶段，首次执行按顺序运行，之后循环执行第2→5阶段，直到达到质量目标。

Shortcuts that waste time

浪费时间的捷径

Shortcut	Why it fails
"I'll tune the metric threshold down so it passes."	Hides real failures. Fix the agent, not the bar.
"This case is flaky, I'll skip it."	Flakiness reveals non-determinism in the agent. Fix with `temperature=0` or stricter instructions.
"I just need to fix the eval dataset, not the agent."	If expected outputs keep moving, the agent has a behavior problem.
"I can tell from the trace it works — skip Stage 3."	Self-grading doesn't generalize. Always run `evaluate()` and read scores.
"One iteration is enough."	Expect 5–10+ iterations. Stopping early leaves regressions on other metrics undetected.

捷径	失败原因
"我会调低指标阈值让它通过。"	掩盖了真实故障。应修复Agent，而非降低标准。
"这个案例不稳定，我跳过它。"	不稳定性暴露了Agent的非确定性问题。可通过设置 `temperature=0` 或更严格的指令来修复。
"我只需要修复评估数据集，不用管Agent。"	如果预期输出持续变化，说明Agent存在行为问题。
"从追踪记录就能看出它正常工作——跳过第3阶段。"	自我评估不具备通用性。务必运行 `evaluate()` 并查看评分。
"一次迭代就够了。"	预计需要5–10+次迭代。过早停止会导致其他指标的回归问题未被发现。

1. Prepare Data

1. 准备数据

Produce an

EvaluationDataset

. There are three input shapes, pick the one that matches the data the user already has:

EvalCase
list (single-turn or multi-turn):

python

from agentplatform import types
dataset = types.EvaluationDataset(eval_cases=[
    types.EvalCase(prompt="What is 2+2?", response="4", reference="4"),
    # For multi-turn agent traces, set agent_data instead of prompt/response.
])

Multi-turn agent traces wrap each conversation in

AgentData

→

ConversationTurn

→

AgentEvent

. See references/dataset_schema.md for the full type hierarchy.

Pandas DataFrame (tabular sources — CSV, BigQuery, Sheets):

python

import pandas as pd
from agentplatform import types

df = pd.DataFrame({
    "prompt":    ["What is 2+2?", "Capital of France?"],
    "response":  ["4",            "Paris"],
    "reference": ["4",            "Paris"],
})
dataset = types.EvaluationDataset(eval_dataset_df=df)

Column names must match the fields the chosen metrics expect (see references/dataset_schema.md for the per-metric requirements table).

Cold start (no data at all): synthesize scenarios server-side with

client.evals.generate_user_scenarios(...)

and a

UserScenarioGenerationConfig

(

user_scenario_count

simulation_instruction

environment_data

). Stage 2 plays them out.

For ADK session dumps, use

scripts/parse_adk_traces.py

instead of writing the conversion by hand.

生成

EvaluationDataset

。有三种输入形式，选择与用户现有数据匹配的形式：

EvalCase
列表（单轮或多轮对话）：

python

from agentplatform import types
dataset = types.EvaluationDataset(eval_cases=[
    types.EvalCase(prompt="What is 2+2?", response="4", reference="4"),
    # 对于多轮Agent追踪记录，请设置agent_data而非prompt/response。
])

多轮Agent追踪记录将每个对话包装在

AgentData

→

ConversationTurn

→

AgentEvent

中。完整的类型层级请参考references/dataset_schema.md。

Pandas DataFrame（表格数据源——CSV、BigQuery、Sheets）：

python

import pandas as pd
from agentplatform import types

df = pd.DataFrame({
    "prompt":    ["What is 2+2?", "Capital of France?"],
    "response":  ["4",            "Paris"],
    "reference": ["4",            "Paris"],
})
dataset = types.EvaluationDataset(eval_dataset_df=df)

列名必须与所选指标期望的字段匹配（各指标要求请参考references/dataset_schema.md中的表格）。

冷启动（无任何数据）： 使用

client.evals.generate_user_scenarios(...)

和

UserScenarioGenerationConfig

（包含

user_scenario_count

、

simulation_instruction

、

environment_data

）在服务器端生成场景。第2阶段将执行这些场景。

对于ADK会话转储，请使用

scripts/parse_adk_traces.py

进行转换，无需手动编写转换代码。

2. Run Inference

2. 运行推理

Populate responses/traces on the dataset. Skip this stage if traces are already complete (e.g., production logs or replay).

python

undefined

在数据集上填充响应/追踪记录。如果追踪记录已完整（例如生产日志或重放数据），请跳过此阶段。

python

undefined

Agent eval — pass a callable wrapping the user's ADK Agent/App.

Agent评估——传入包装用户ADK Agent/App的可调用对象。

client.evals.run_inference(model=agent_callable, src=dataset)

Model eval — pass a model ID directly.

模型评估——直接传入模型ID。

client.evals.run_inference(model="gemini-2.5-flash", src=dataset)

Synthesized scenarios — let the simulator drive.

合成场景——由模拟器驱动执行。

client.evals.run_inference( model=agent_callable, src=dataset, user_simulator_config=UserSimulatorConfig(max_turn=10), )

DataFrame also works as src= — no EvalCase wrapping needed.

DataFrame也可作为src参数传入——无需包装为EvalCase。

client.evals.run_inference(model="gemini-2.5-flash", src=df)

undefined

client.evals.run_inference(model="gemini-2.5-flash", src=df)

undefined

3. Grade (always run)

3. 评分（务必运行）

python

result = client.evals.evaluate(dataset=dataset, metrics=[...])

Pick metrics by what you want to measure. Full catalog in references/metric_registry.md.

Agent metrics (multi-turn, adaptive rubrics) — start here for agent eval.

Goal	Metric
Did the agent achieve the user's goal?	`multi_turn_task_success`
Was the reasoning path logical and efficient?	`multi_turn_trajectory_quality`
Tool/function calling quality across turns	`multi_turn_tool_use_quality`
Overall conversational quality	`multi_turn_general_quality`
Final response quality (no reference needed)	`final_response_quality`
Final response vs. a golden reference	`final_response_match`
Single-turn tool use	`tool_use_quality`

General quality metrics (single-turn, adaptive rubrics) — for model eval.

Goal	Metric
Overall response quality (recommended starting point)	`general_quality`
Linguistic quality (fluency, coherence, grammar)	`text_quality`
Adherence to specific constraints / instructions	`instruction_following`

Static rubric metrics (fixed criteria) — apply alongside the above.

Goal	Metric
Catch hallucinated claims (RAG, factual answers)	`hallucination`
Factuality / consistency against provided context	`grounding`
Safety policy compliance	`safety`

Domain-specific check no built-in covers: write a custom metric.

Predefined:
```
types.RubricMetric.<NAME>
```
— server-side AutoRater, no judge model needed.

Custom LLM-as-a-judge:

types.LLMMetric

with

prompt_template

types.MetricPromptBuilder

for structured rubrics.

Custom code:

types.CodeExecutionMetric

with a

custom_function

string containing

def evaluate(instance: dict)

for remote sandboxed execution; or

types.Metric

with

custom_function=<callable>

for local execution.

Always persist the result so Stage 4 and 5 can read it. Save both JSON (machine-readable, diffable) and HTML (human-readable, linkable):

python

import datetime
from pathlib import Path

from agentplatform._genai import _evals_visualization

out_dir = Path("artifacts/grade_results")
out_dir.mkdir(parents=True, exist_ok=True)
ts = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

result_json = result.model_dump_json()
(out_dir / f"results_{ts}.json").write_text(result_json)

html = _evals_visualization.get_evaluation_html(result_json)
(out_dir / f"results_{ts}.html").write_text(str(html))

Or after the fact:

scripts/render_html_report.py --type evaluation

scripts/inspect_results.py --save-html

python

result = client.evals.evaluate(dataset=dataset, metrics=[...])

根据要衡量的目标选择指标。完整的指标目录请参考references/metric_registry.md。

Agent指标（多轮对话、自适应评分标准）——Agent评估首选。

目标	指标
Agent是否达成用户目标？	`multi_turn_task_success`
推理路径是否逻辑清晰且高效？	`multi_turn_trajectory_quality`
多轮对话中的工具/函数调用质量	`multi_turn_tool_use_quality`
整体对话质量	`multi_turn_general_quality`
最终响应质量（无需参考标准）	`final_response_quality`
最终响应与黄金参考标准的匹配度	`final_response_match`
单轮对话工具使用	`tool_use_quality`

通用质量指标（单轮对话、自适应评分标准）——适用于模型评估。

目标	指标
整体响应质量（推荐起始指标）	`general_quality`
语言质量（流畅度、连贯性、语法）	`text_quality`
是否遵守特定约束/指令	`instruction_following`

静态评分标准指标（固定标准）——可与上述指标配合使用。

目标	指标
检测幻觉内容（RAG、事实性回答）	`hallucination`
与提供上下文的事实一致性	`grounding`
是否符合安全政策	`safety`

内置指标未覆盖的特定领域检查： 编写自定义指标。

预定义指标：
```
types.RubricMetric.<NAME>
```
——服务器端AutoRater，无需评判模型。
自定义LLM-as-judge： 使用带
```
prompt_template
```
的
```
types.LLMMetric
```
，或使用
```
types.MetricPromptBuilder
```
构建结构化评分标准。
自定义代码： 使用带
```
custom_function
```
字符串的
```
types.CodeExecutionMetric
```
（字符串包含
```
def evaluate(instance: dict)
```
用于远程沙箱执行）；或使用带
```
custom_function=<callable>
```
的
```
types.Metric
```
用于本地执行。

务必持久化保存结果，以便第4和第5阶段读取。同时保存JSON格式（机器可读、可对比）和HTML格式（人类可读、可链接）：

python

import datetime
from pathlib import Path

from agentplatform._genai import _evals_visualization

out_dir = Path("artifacts/grade_results")
out_dir.mkdir(parents=True, exist_ok=True)
ts = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

result_json = result.model_dump_json()
(out_dir / f"results_{ts}.json").write_text(result_json)

html = _evals_visualization.get_evaluation_html(result_json)
(out_dir / f"results_{ts}.html").write_text(str(html))

或事后使用：

scripts/render_html_report.py --type evaluation

或

scripts/inspect_results.py --save-html

。

4. Analyze Failures

4. 分析故障

Read

summary_metrics

and

eval_case_results

— never fabricate scores. Use

scripts/inspect_results.py --failing-only

to filter to failures.

For each failed metric, see references/failure_patterns.md for deeper diagnoses. The compact mapping:

Failing metric	What to change
`multi_turn_task_success` low	The agent isn't completing the goal — fix orchestration, missing tool calls, premature termination, wrong tool selection.
`multi_turn_trajectory_quality` low	The agent reaches the goal inefficiently — refine planning prompts, remove redundant tool calls.
`multi_turn_tool_use_quality` low	Fix tool descriptions, parameter docstrings, or agent instructions for tool selection.
`final_response_quality` low	Read auto-generated rubric verdicts; refine instructions to address the worst-scoring criterion.
`final_response_match` low	The agent's final answer doesn't match the golden reference — adjust response format or update the reference.
`hallucination` low	Tighten instructions to stay grounded in tool output; verify the tool actually returned the claimed data.
`grounding` low	The response contradicts the provided context — add explicit "cite only from context" instructions.
`safety` low	Add safety guardrails; review the violating content category in the rubric verdict.
`general_quality` / `text_quality` low	Adjust system instruction wording; the model's default phrasing is too generic for the task.
`instruction_following` low	The agent is ignoring constraints — restate them in the system instruction or use stricter wording.
Agent calls wrong tools	Fix tool descriptions, agent instructions, or `tool_config` .
Agent calls extra tools	Add explicit stop instructions, or switch to `multi_turn_tool_use_quality` to surface the extra calls in the rubric.

For 10+ failures on the same metric, use the Error Analysis service to cluster failures into themes (L1/L2 taxonomy categories) instead of reading every trace:

python

undefined

查看

summary_metrics

和

eval_case_results

——切勿编造分数。使用

scripts/inspect_results.py --failing-only

筛选故障案例。

对于每个未达标的指标，请参考references/failure_patterns.md进行深入诊断。简要对应关系：

未达标指标	需要调整的内容
`multi_turn_task_success` 得分低	Agent未完成目标——修复编排逻辑、补充缺失的工具调用、修正提前终止问题、调整工具选择逻辑。
`multi_turn_trajectory_quality` 得分低	Agent达成目标的效率低下——优化规划提示词、移除冗余工具调用。
`multi_turn_tool_use_quality` 得分低	修复工具描述、参数文档字符串或Agent的工具选择指令。
`final_response_quality` 得分低	查看自动生成的评分标准 verdicts；优化指令以解决得分最低的标准。
`final_response_match` 得分低	Agent的最终答案与黄金参考标准不匹配——调整响应格式或更新参考标准。
`hallucination` 得分低	收紧指令以确保基于工具输出生成内容；验证工具是否确实返回了声称的数据。
`grounding` 得分低	响应与提供的上下文矛盾——添加明确的“仅引用上下文内容”指令。
`safety` 得分低	添加安全防护措施；查看评分标准 verdict中违规的内容类别。
`general_quality` / `text_quality` 得分低	调整系统指令措辞；模型的默认表述对于当前任务过于通用。
`instruction_following` 得分低	Agent忽略约束条件——在系统指令中重新表述约束或使用更严格的措辞。
Agent调用错误工具	修复工具描述、Agent指令或 `tool_config` 。
Agent调用多余工具	添加明确的停止指令，或切换为 `multi_turn_tool_use_quality` 指标以在评分标准中暴露多余调用。

同一指标出现10+次故障时，使用错误分析服务将故障聚类为不同主题（L1/L2分类类别），无需逐个查看追踪记录：

python

undefined

Only supports multi_turn_task_success and multi_turn_tool_use_quality.

仅支持multi_turn_task_success和multi_turn_tool_use_quality指标。

Service runs in the global region.

服务在global区域运行。

analysis_client = agentplatform.Client(project="PROJECT_ID", location="global") response = analysis_client.evals.generate_loss_clusters( eval_result=result, metric="multi_turn_task_success", config={"max_top_cluster_count": 5}, ) for r in response.results: for cluster in r.clusters: print( f"[{cluster.taxonomy_entry.l1_category}/" f"{cluster.taxonomy_entry.l2_category}] " f"{cluster.item_count} cases — {cluster.taxonomy_entry.description}" )


Save `response.model_dump_json()` and render with
`scripts/render_html_report.py --type loss-analysis`.


保存`response.model_dump_json()`并使用`scripts/render_html_report.py --type loss-analysis`生成可视化报告。

5. Optimize & Iterate

5. 优化与迭代

Apply a fix targeting the failing metric. Re-run Stage 3. Compare with

scripts/compare_results.py --baseline <prev> --candidate <new>

to confirm the target improved AND no other metric regressed.

Track progress across iterations:

Iteration	Metric A	Metric B	Change made
Baseline	0.62	0.55	—
v2	0.78	0.68	Added grounding prompt
v3	0.81	0.72	Fixed tool selection

Expect 5–10+ iterations per failing case. Only after a case passes should you expand coverage with more eval cases.

针对未达标的指标应用修复方案。重新运行第3阶段。使用

scripts/compare_results.py --baseline <prev> --candidate <new>

对比结果，确认目标指标有所提升且其他指标未出现回归。

跨迭代跟踪进度：

迭代次数	指标A	指标B	做出的修改
基准版本	0.62	0.55	—
v2	0.78	0.68	添加 grounding提示词
v3	0.81	0.72	修复工具选择逻辑

预计每个故障案例需要5–10+次迭代。只有当案例通过评估后，才应添加更多评估案例以扩大覆盖范围。

Proving your work

验证工作成果

Never claim eval results you didn't read from an actual

result

object.

After running eval, print the

summary_metrics

table (

scripts/inspect_results.py

After a fix, show before/after via
```
scripts/compare_results.py
```
.
Before declaring success, confirm ALL cases pass — not just the one you were working on.

If you can't produce the evidence (SDK call failed, result truncated, metric unsupported), say so explicitly. Don't paper over gaps.

切勿声称未从实际

result

对象读取的评估结果。

运行评估后，打印
```
summary_metrics
```
表格（使用
```
scripts/inspect_results.py
```
）。
修复后，通过
```
scripts/compare_results.py
```
展示修复前后的对比。
宣布成功前，确认所有案例均通过评估——而非仅修复的那一个案例。

如果无法提供证据（SDK调用失败、结果截断、指标不支持），请明确说明。切勿掩盖漏洞。

Rules of Engagement

参与规则

Always Plan First: Before writing a script, output a
```
<plan>
```
block detailing the steps you are about to take.
Step-by-Step Execution: Write the script, execute it, wait for output, then analyze. Don't do everything in one response.
Standard Python: Use standard Python imports (
```
import agentplatform
```
,
```
from google.genai import types
```
). Don't use internal import paths.
Verify Before Guessing: When unsure about SDK types or metrics, check the SDK source code rather than guessing or hallucinating.

始终先规划： 编写脚本前，输出
```
<plan>
```
块详细说明即将执行的步骤。
分步执行： 编写脚本、执行、等待输出，然后分析。不要在一次响应中完成所有操作。
标准Python： 使用标准Python导入（
```
import agentplatform
```
、
```
from google.genai import types
```
）。不要使用内部导入路径。
验证而非猜测： 对SDK类型或指标不确定时，请查看SDK源代码，而非猜测或编造。

SDK Quick Reference

SDK快速参考

python

import agentplatform
from agentplatform import types
from google.genai import types as genai_types
import pandas as pd

python

import agentplatform
from agentplatform import types
from google.genai import types as genai_types
import pandas as pd

Initialize client

初始化客户端

client = agentplatform.Client(project="PROJECT_ID", location="LOCATION")

--- SINGLE-TURN EVAL (EvalCase list) ---

--- 单轮评估（EvalCase列表） ---

dataset = types.EvaluationDataset(eval_cases=[ types.EvalCase(prompt="Query here", response="Model response here"), ])

--- SINGLE-TURN EVAL (pandas DataFrame) ---

--- 单轮评估（pandas DataFrame） ---

df = pd.DataFrame({ "prompt": ["Q1", "Q2"], "response": ["A1", "A2"], }) dataset = types.EvaluationDataset(eval_dataset_df=df)

--- MULTI-TURN AGENT EVAL ---

--- 多轮Agent评估 ---

agent_data = types.evals.AgentData( agents={"my_agent": types.evals.AgentConfig( agent_id="my_agent", instruction="You are helpful.")}, turns=[types.evals.ConversationTurn(turn_index=0, events=[ types.evals.AgentEvent(author="user", content=genai_types.Content(role="user", parts=[genai_types.Part(text="Hello")])), types.evals.AgentEvent(author="my_agent", content=genai_types.Content(role="model", parts=[genai_types.Part(text="Hi! How can I help?")])), ])], ) dataset = types.EvaluationDataset( eval_cases=[types.EvalCase(agent_data=agent_data)])

--- METRICS ---

--- 指标 ---

predefined = types.RubricMetric.MULTI_TURN_TRAJECTORY_QUALITY custom_llm = types.LLMMetric(name="tone", prompt_template="Is this polite? Response: {response}") custom_code = types.CodeExecutionMetric(name="check", custom_function='def evaluate(instance): return {"score": 1.0}')

--- EVALUATE ---

--- 执行评估 ---

result = client.evals.evaluate(dataset=dataset, metrics=[predefined])

--- RESULTS ---

--- 查看结果 ---

for s in result.summary_metrics: print(f"{s.metric_name}: mean={s.mean_score}, pass_rate={s.pass_rate}") for case in result.eval_case_results: for cand in case.response_candidate_results: for name, r in cand.metric_results.items(): print(f" {name}: score={r.score}, explanation={r.explanation}")


See [references/sdk_patterns.md](references/sdk_patterns.md) for advanced
patterns: synthetic data generation, pairwise comparison,
`MetricPromptBuilder`, multi-agent evaluation.


高级模式请参考[references/sdk_patterns.md](references/sdk_patterns.md)：合成数据生成、两两对比、`MetricPromptBuilder`、多Agent评估。

Bundled scripts

内置脚本

Script	When to use
`validate_dataset.py`	Before Stage 3 — catch malformed `EvaluationDataset` JSON.
`parse_adk_traces.py`	Stage 1 — convert ADK session dumps to the canonical dataset shape.
`inspect_results.py`	Stages 3/4 — render summary + per-case scores. `--save-html` for a browsable report.
`compare_results.py`	Stage 5 — diff baseline vs. candidate, detect regressions.
`render_html_report.py`	Render HTML from a saved result JSON or loss-clusters JSON.

脚本	使用场景
`validate_dataset.py`	第3阶段前——检测格式错误的 `EvaluationDataset` JSON。
`parse_adk_traces.py`	第1阶段——将ADK会话转储转换为标准数据集格式。
`inspect_results.py`	第3/4阶段——渲染汇总信息+单案例评分。使用 `--save-html` 生成可浏览的报告。
`compare_results.py`	第5阶段——对比基准版本与候选版本，检测回归问题。
`render_html_report.py`	从保存的结果JSON或损失聚类JSON生成HTML报告。