agent-platform-eval-flywheel
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAgent Platform Eval Flywheel Skill
Agent Platform评估飞轮技能
Help users evaluate and iteratively improve GenAI models and agents using
the Agent Platform GenAI Evaluation SDK ( / ).
google.genaiagentplatform帮助用户使用Agent Platform GenAI评估SDK( / )评估并迭代优化生成式AI(GenAI)模型与Agent。
google.genaiagentplatformWhen to use this skill
何时使用此技能
- Evaluating GenAI agents or models with the Agent Platform GenAI
Evaluation SDK ().
client.evals.evaluate() - Creating evaluation datasets from session traces, pandas DataFrames, or synthetic generation.
- Selecting, configuring, or writing custom evaluation metrics.
- Analyzing rubric verdicts, loss patterns, and clustering failures.
- Suggesting concrete code/prompt improvements based on eval results.
- 使用Agent Platform GenAI评估SDK()评估GenAI Agent或模型。
client.evals.evaluate() - 从会话追踪、pandas DataFrame或合成生成创建评估数据集。
- 选择、配置或编写自定义评估指标。
- 分析评分标准 verdicts、损失模式并聚类故障。
- 根据评估结果提出具体的代码/提示词改进建议。
Setup
环境搭建
Install the SDK:
bash
pip install google-cloud-aiplatform[evaluation]>=1.154.0 google-genai>=1.0.0Need and . Check env vars
first; if missing, ask the user. Newer Gemini models often need
.
GOOGLE_CLOUD_PROJECTGOOGLE_CLOUD_LOCATIONlocation="global"安装SDK:
bash
pip install google-cloud-aiplatform[evaluation]>=1.154.0 google-genai>=1.0.0需要和。先检查环境变量;如果缺失,请询问用户。较新的Gemini模型通常需要设置。
GOOGLE_CLOUD_PROJECTGOOGLE_CLOUD_LOCATIONlocation="global"The Quality Flywheel
质量飞轮
Five stages, run in order on the first pass, then loop 2 → 5 until quality
targets are met.
包含五个阶段,首次执行按顺序运行,之后循环执行第2→5阶段,直到达到质量目标。
Shortcuts that waste time
浪费时间的捷径
| Shortcut | Why it fails |
|---|---|
| "I'll tune the metric threshold down so it passes." | Hides real failures. Fix the agent, not the bar. |
| "This case is flaky, I'll skip it." | Flakiness reveals non-determinism in the agent. Fix with |
| "I just need to fix the eval dataset, not the agent." | If expected outputs keep moving, the agent has a behavior problem. |
| "I can tell from the trace it works — skip Stage 3." | Self-grading doesn't generalize. Always run |
| "One iteration is enough." | Expect 5–10+ iterations. Stopping early leaves regressions on other metrics undetected. |
| 捷径 | 失败原因 |
|---|---|
| "我会调低指标阈值让它通过。" | 掩盖了真实故障。应修复Agent,而非降低标准。 |
| "这个案例不稳定,我跳过它。" | 不稳定性暴露了Agent的非确定性问题。可通过设置 |
| "我只需要修复评估数据集,不用管Agent。" | 如果预期输出持续变化,说明Agent存在行为问题。 |
| "从追踪记录就能看出它正常工作——跳过第3阶段。" | 自我评估不具备通用性。务必运行 |
| "一次迭代就够了。" | 预计需要5–10+次迭代。过早停止会导致其他指标的回归问题未被发现。 |
1. Prepare Data
1. 准备数据
Produce an . There are three input shapes, pick the one
that matches the data the user already has:
EvaluationDataset-
list (single-turn or multi-turn):
EvalCasepythonfrom agentplatform import types dataset = types.EvaluationDataset(eval_cases=[ types.EvalCase(prompt="What is 2+2?", response="4", reference="4"), # For multi-turn agent traces, set agent_data instead of prompt/response. ])Multi-turn agent traces wrap each conversation in→AgentData→ConversationTurn. See references/dataset_schema.md for the full type hierarchy.AgentEvent -
Pandas DataFrame (tabular sources — CSV, BigQuery, Sheets):python
import pandas as pd from agentplatform import types df = pd.DataFrame({ "prompt": ["What is 2+2?", "Capital of France?"], "response": ["4", "Paris"], "reference": ["4", "Paris"], }) dataset = types.EvaluationDataset(eval_dataset_df=df)Column names must match the fields the chosen metrics expect (see references/dataset_schema.md for the per-metric requirements table). -
Cold start (no data at all): synthesize scenarios server-side withand a
client.evals.generate_user_scenarios(...)(UserScenarioGenerationConfig,user_scenario_count,simulation_instruction). Stage 2 plays them out.environment_data
For ADK session dumps, use instead of writing
the conversion by hand.
scripts/parse_adk_traces.py生成。有三种输入形式,选择与用户现有数据匹配的形式:
EvaluationDataset-
列表(单轮或多轮对话):
EvalCasepythonfrom agentplatform import types dataset = types.EvaluationDataset(eval_cases=[ types.EvalCase(prompt="What is 2+2?", response="4", reference="4"), # 对于多轮Agent追踪记录,请设置agent_data而非prompt/response。 ])多轮Agent追踪记录将每个对话包装在→AgentData→ConversationTurn中。完整的类型层级请参考references/dataset_schema.md。AgentEvent -
Pandas DataFrame(表格数据源——CSV、BigQuery、Sheets):python
import pandas as pd from agentplatform import types df = pd.DataFrame({ "prompt": ["What is 2+2?", "Capital of France?"], "response": ["4", "Paris"], "reference": ["4", "Paris"], }) dataset = types.EvaluationDataset(eval_dataset_df=df)列名必须与所选指标期望的字段匹配(各指标要求请参考references/dataset_schema.md中的表格)。 -
冷启动(无任何数据): 使用和
client.evals.generate_user_scenarios(...)(包含UserScenarioGenerationConfig、user_scenario_count、simulation_instruction)在服务器端生成场景。第2阶段将执行这些场景。environment_data
对于ADK会话转储,请使用进行转换,无需手动编写转换代码。
scripts/parse_adk_traces.py2. Run Inference
2. 运行推理
Populate responses/traces on the dataset. Skip this stage if traces are
already complete (e.g., production logs or replay).
python
undefined在数据集上填充响应/追踪记录。如果追踪记录已完整(例如生产日志或重放数据),请跳过此阶段。
python
undefinedAgent eval — pass a callable wrapping the user's ADK Agent/App.
Agent评估——传入包装用户ADK Agent/App的可调用对象。
client.evals.run_inference(model=agent_callable, src=dataset)
client.evals.run_inference(model=agent_callable, src=dataset)
Model eval — pass a model ID directly.
模型评估——直接传入模型ID。
client.evals.run_inference(model="gemini-2.5-flash", src=dataset)
client.evals.run_inference(model="gemini-2.5-flash", src=dataset)
Synthesized scenarios — let the simulator drive.
合成场景——由模拟器驱动执行。
client.evals.run_inference(
model=agent_callable,
src=dataset,
user_simulator_config=UserSimulatorConfig(max_turn=10),
)
client.evals.run_inference(
model=agent_callable,
src=dataset,
user_simulator_config=UserSimulatorConfig(max_turn=10),
)
DataFrame also works as src= — no EvalCase wrapping needed.
DataFrame也可作为src参数传入——无需包装为EvalCase。
client.evals.run_inference(model="gemini-2.5-flash", src=df)
undefinedclient.evals.run_inference(model="gemini-2.5-flash", src=df)
undefined3. Grade (always run)
3. 评分(务必运行)
python
result = client.evals.evaluate(dataset=dataset, metrics=[...])Pick metrics by what you want to measure. Full catalog in
references/metric_registry.md.
Agent metrics (multi-turn, adaptive rubrics) — start here for agent eval.
| Goal | Metric |
|---|---|
| Did the agent achieve the user's goal? | |
| Was the reasoning path logical and efficient? | |
| Tool/function calling quality across turns | |
| Overall conversational quality | |
| Final response quality (no reference needed) | |
| Final response vs. a golden reference | |
| Single-turn tool use | |
General quality metrics (single-turn, adaptive rubrics) — for model eval.
| Goal | Metric |
|---|---|
| Overall response quality (recommended starting point) | |
| Linguistic quality (fluency, coherence, grammar) | |
| Adherence to specific constraints / instructions | |
Static rubric metrics (fixed criteria) — apply alongside the above.
| Goal | Metric |
|---|---|
| Catch hallucinated claims (RAG, factual answers) | |
| Factuality / consistency against provided context | |
| Safety policy compliance | |
Domain-specific check no built-in covers: write a custom metric.
- Predefined: — server-side AutoRater, no judge model needed.
types.RubricMetric.<NAME> - Custom LLM-as-a-judge: with
types.LLMMetricorprompt_templatefor structured rubrics.types.MetricPromptBuilder - Custom code: with a
types.CodeExecutionMetricstring containingcustom_functionfor remote sandboxed execution; ordef evaluate(instance: dict)withtypes.Metricfor local execution.custom_function=<callable>
Always persist the result so Stage 4 and 5 can read it. Save both JSON
(machine-readable, diffable) and HTML (human-readable, linkable):
python
import datetime
from pathlib import Path
from agentplatform._genai import _evals_visualization
out_dir = Path("artifacts/grade_results")
out_dir.mkdir(parents=True, exist_ok=True)
ts = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
result_json = result.model_dump_json()
(out_dir / f"results_{ts}.json").write_text(result_json)
html = _evals_visualization.get_evaluation_html(result_json)
(out_dir / f"results_{ts}.html").write_text(str(html))Or after the fact: or
.
scripts/render_html_report.py --type evaluationscripts/inspect_results.py --save-htmlpython
result = client.evals.evaluate(dataset=dataset, metrics=[...])根据要衡量的目标选择指标。完整的指标目录请参考references/metric_registry.md。
Agent指标(多轮对话、自适应评分标准)——Agent评估首选。
| 目标 | 指标 |
|---|---|
| Agent是否达成用户目标? | |
| 推理路径是否逻辑清晰且高效? | |
| 多轮对话中的工具/函数调用质量 | |
| 整体对话质量 | |
| 最终响应质量(无需参考标准) | |
| 最终响应与黄金参考标准的匹配度 | |
| 单轮对话工具使用 | |
通用质量指标(单轮对话、自适应评分标准)——适用于模型评估。
| 目标 | 指标 |
|---|---|
| 整体响应质量(推荐起始指标) | |
| 语言质量(流畅度、连贯性、语法) | |
| 是否遵守特定约束/指令 | |
静态评分标准指标(固定标准)——可与上述指标配合使用。
| 目标 | 指标 |
|---|---|
| 检测幻觉内容(RAG、事实性回答) | |
| 与提供上下文的事实一致性 | |
| 是否符合安全政策 | |
内置指标未覆盖的特定领域检查: 编写自定义指标。
- 预定义指标: ——服务器端AutoRater,无需评判模型。
types.RubricMetric.<NAME> - 自定义LLM-as-judge: 使用带的
prompt_template,或使用types.LLMMetric构建结构化评分标准。types.MetricPromptBuilder - 自定义代码: 使用带字符串的
custom_function(字符串包含types.CodeExecutionMetric用于远程沙箱执行);或使用带def evaluate(instance: dict)的custom_function=<callable>用于本地执行。types.Metric
务必持久化保存结果,以便第4和第5阶段读取。同时保存JSON格式(机器可读、可对比)和HTML格式(人类可读、可链接):
python
import datetime
from pathlib import Path
from agentplatform._genai import _evals_visualization
out_dir = Path("artifacts/grade_results")
out_dir.mkdir(parents=True, exist_ok=True)
ts = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
result_json = result.model_dump_json()
(out_dir / f"results_{ts}.json").write_text(result_json)
html = _evals_visualization.get_evaluation_html(result_json)
(out_dir / f"results_{ts}.html").write_text(str(html))或事后使用: 或 。
scripts/render_html_report.py --type evaluationscripts/inspect_results.py --save-html4. Analyze Failures
4. 分析故障
Read and — never fabricate scores.
Use to filter to failures.
summary_metricseval_case_resultsscripts/inspect_results.py --failing-onlyFor each failed metric, see
references/failure_patterns.md for deeper
diagnoses. The compact mapping:
| Failing metric | What to change |
|---|---|
| The agent isn't completing the goal — fix orchestration, missing tool calls, premature termination, wrong tool selection. |
| The agent reaches the goal inefficiently — refine planning prompts, remove redundant tool calls. |
| Fix tool descriptions, parameter docstrings, or agent instructions for tool selection. |
| Read auto-generated rubric verdicts; refine instructions to address the worst-scoring criterion. |
| The agent's final answer doesn't match the golden reference — adjust response format or update the reference. |
| Tighten instructions to stay grounded in tool output; verify the tool actually returned the claimed data. |
| The response contradicts the provided context — add explicit "cite only from context" instructions. |
| Add safety guardrails; review the violating content category in the rubric verdict. |
| Adjust system instruction wording; the model's default phrasing is too generic for the task. |
| The agent is ignoring constraints — restate them in the system instruction or use stricter wording. |
| Agent calls wrong tools | Fix tool descriptions, agent instructions, or |
| Agent calls extra tools | Add explicit stop instructions, or switch to |
For 10+ failures on the same metric, use the Error Analysis service
to cluster failures into themes (L1/L2 taxonomy categories) instead of
reading every trace:
python
undefined查看和——切勿编造分数。使用筛选故障案例。
summary_metricseval_case_resultsscripts/inspect_results.py --failing-only对于每个未达标的指标,请参考references/failure_patterns.md进行深入诊断。简要对应关系:
| 未达标指标 | 需要调整的内容 |
|---|---|
| Agent未完成目标——修复编排逻辑、补充缺失的工具调用、修正提前终止问题、调整工具选择逻辑。 |
| Agent达成目标的效率低下——优化规划提示词、移除冗余工具调用。 |
| 修复工具描述、参数文档字符串或Agent的工具选择指令。 |
| 查看自动生成的评分标准 verdicts;优化指令以解决得分最低的标准。 |
| Agent的最终答案与黄金参考标准不匹配——调整响应格式或更新参考标准。 |
| 收紧指令以确保基于工具输出生成内容;验证工具是否确实返回了声称的数据。 |
| 响应与提供的上下文矛盾——添加明确的“仅引用上下文内容”指令。 |
| 添加安全防护措施;查看评分标准 verdict中违规的内容类别。 |
| 调整系统指令措辞;模型的默认表述对于当前任务过于通用。 |
| Agent忽略约束条件——在系统指令中重新表述约束或使用更严格的措辞。 |
| Agent调用错误工具 | 修复工具描述、Agent指令或 |
| Agent调用多余工具 | 添加明确的停止指令,或切换为 |
同一指标出现10+次故障时,使用错误分析服务将故障聚类为不同主题(L1/L2分类类别),无需逐个查看追踪记录:
python
undefinedOnly supports multi_turn_task_success and multi_turn_tool_use_quality.
仅支持multi_turn_task_success和multi_turn_tool_use_quality指标。
Service runs in the global region.
服务在global区域运行。
analysis_client = agentplatform.Client(project="PROJECT_ID", location="global")
response = analysis_client.evals.generate_loss_clusters(
eval_result=result,
metric="multi_turn_task_success",
config={"max_top_cluster_count": 5},
)
for r in response.results:
for cluster in r.clusters:
print(
f"[{cluster.taxonomy_entry.l1_category}/"
f"{cluster.taxonomy_entry.l2_category}] "
f"{cluster.item_count} cases — {cluster.taxonomy_entry.description}"
)
Save `response.model_dump_json()` and render with
`scripts/render_html_report.py --type loss-analysis`.analysis_client = agentplatform.Client(project="PROJECT_ID", location="global")
response = analysis_client.evals.generate_loss_clusters(
eval_result=result,
metric="multi_turn_task_success",
config={"max_top_cluster_count": 5},
)
for r in response.results:
for cluster in r.clusters:
print(
f"[{cluster.taxonomy_entry.l1_category}/"
f"{cluster.taxonomy_entry.l2_category}] "
f"{cluster.item_count} cases — {cluster.taxonomy_entry.description}"
)
保存`response.model_dump_json()`并使用`scripts/render_html_report.py --type loss-analysis`生成可视化报告。5. Optimize & Iterate
5. 优化与迭代
Apply a fix targeting the failing metric. Re-run Stage 3. Compare with
to confirm
the target improved AND no other metric regressed.
scripts/compare_results.py --baseline <prev> --candidate <new>Track progress across iterations:
| Iteration | Metric A | Metric B | Change made |
|---|---|---|---|
| Baseline | 0.62 | 0.55 | — |
| v2 | 0.78 | 0.68 | Added grounding prompt |
| v3 | 0.81 | 0.72 | Fixed tool selection |
Expect 5–10+ iterations per failing case. Only after a case passes should
you expand coverage with more eval cases.
针对未达标的指标应用修复方案。重新运行第3阶段。使用对比结果,确认目标指标有所提升且其他指标未出现回归。
scripts/compare_results.py --baseline <prev> --candidate <new>跨迭代跟踪进度:
| 迭代次数 | 指标A | 指标B | 做出的修改 |
|---|---|---|---|
| 基准版本 | 0.62 | 0.55 | — |
| v2 | 0.78 | 0.68 | 添加 grounding提示词 |
| v3 | 0.81 | 0.72 | 修复工具选择逻辑 |
预计每个故障案例需要5–10+次迭代。只有当案例通过评估后,才应添加更多评估案例以扩大覆盖范围。
Proving your work
验证工作成果
Never claim eval results you didn't read from an actual object.
result- After running eval, print the table (
summary_metrics).scripts/inspect_results.py - After a fix, show before/after via .
scripts/compare_results.py - Before declaring success, confirm ALL cases pass — not just the one you were working on.
If you can't produce the evidence (SDK call failed, result truncated,
metric unsupported), say so explicitly. Don't paper over gaps.
切勿声称未从实际对象读取的评估结果。
result- 运行评估后,打印表格(使用
summary_metrics)。scripts/inspect_results.py - 修复后,通过展示修复前后的对比。
scripts/compare_results.py - 宣布成功前,确认所有案例均通过评估——而非仅修复的那一个案例。
如果无法提供证据(SDK调用失败、结果截断、指标不支持),请明确说明。切勿掩盖漏洞。
Rules of Engagement
参与规则
- Always Plan First: Before writing a script, output a block detailing the steps you are about to take.
<plan> - Step-by-Step Execution: Write the script, execute it, wait for output, then analyze. Don't do everything in one response.
- Standard Python: Use standard Python imports (,
import agentplatform). Don't use internal import paths.from google.genai import types - Verify Before Guessing: When unsure about SDK types or metrics, check the SDK source code rather than guessing or hallucinating.
- 始终先规划: 编写脚本前,输出块详细说明即将执行的步骤。
<plan> - 分步执行: 编写脚本、执行、等待输出,然后分析。不要在一次响应中完成所有操作。
- 标准Python: 使用标准Python导入(、
import agentplatform)。不要使用内部导入路径。from google.genai import types - 验证而非猜测: 对SDK类型或指标不确定时,请查看SDK源代码,而非猜测或编造。
SDK Quick Reference
SDK快速参考
python
import agentplatform
from agentplatform import types
from google.genai import types as genai_types
import pandas as pdpython
import agentplatform
from agentplatform import types
from google.genai import types as genai_types
import pandas as pdInitialize client
初始化客户端
client = agentplatform.Client(project="PROJECT_ID", location="LOCATION")
client = agentplatform.Client(project="PROJECT_ID", location="LOCATION")
--- SINGLE-TURN EVAL (EvalCase list) ---
--- 单轮评估(EvalCase列表) ---
dataset = types.EvaluationDataset(eval_cases=[
types.EvalCase(prompt="Query here", response="Model response here"),
])
dataset = types.EvaluationDataset(eval_cases=[
types.EvalCase(prompt="Query here", response="Model response here"),
])
--- SINGLE-TURN EVAL (pandas DataFrame) ---
--- 单轮评估(pandas DataFrame) ---
df = pd.DataFrame({
"prompt": ["Q1", "Q2"],
"response": ["A1", "A2"],
})
dataset = types.EvaluationDataset(eval_dataset_df=df)
df = pd.DataFrame({
"prompt": ["Q1", "Q2"],
"response": ["A1", "A2"],
})
dataset = types.EvaluationDataset(eval_dataset_df=df)
--- MULTI-TURN AGENT EVAL ---
--- 多轮Agent评估 ---
agent_data = types.evals.AgentData(
agents={"my_agent": types.evals.AgentConfig(
agent_id="my_agent", instruction="You are helpful.")},
turns=[types.evals.ConversationTurn(turn_index=0, events=[
types.evals.AgentEvent(author="user",
content=genai_types.Content(role="user",
parts=[genai_types.Part(text="Hello")])),
types.evals.AgentEvent(author="my_agent",
content=genai_types.Content(role="model",
parts=[genai_types.Part(text="Hi! How can I help?")])),
])],
)
dataset = types.EvaluationDataset(
eval_cases=[types.EvalCase(agent_data=agent_data)])
agent_data = types.evals.AgentData(
agents={"my_agent": types.evals.AgentConfig(
agent_id="my_agent", instruction="You are helpful.")},
turns=[types.evals.ConversationTurn(turn_index=0, events=[
types.evals.AgentEvent(author="user",
content=genai_types.Content(role="user",
parts=[genai_types.Part(text="Hello")])),
types.evals.AgentEvent(author="my_agent",
content=genai_types.Content(role="model",
parts=[genai_types.Part(text="Hi! How can I help?")])),
])],
)
dataset = types.EvaluationDataset(
eval_cases=[types.EvalCase(agent_data=agent_data)])
--- METRICS ---
--- 指标 ---
predefined = types.RubricMetric.MULTI_TURN_TRAJECTORY_QUALITY
custom_llm = types.LLMMetric(name="tone",
prompt_template="Is this polite? Response: {response}")
custom_code = types.CodeExecutionMetric(name="check",
custom_function='def evaluate(instance): return {"score": 1.0}')
predefined = types.RubricMetric.MULTI_TURN_TRAJECTORY_QUALITY
custom_llm = types.LLMMetric(name="tone",
prompt_template="Is this polite? Response: {response}")
custom_code = types.CodeExecutionMetric(name="check",
custom_function='def evaluate(instance): return {"score": 1.0}')
--- EVALUATE ---
--- 执行评估 ---
result = client.evals.evaluate(dataset=dataset, metrics=[predefined])
result = client.evals.evaluate(dataset=dataset, metrics=[predefined])
--- RESULTS ---
--- 查看结果 ---
for s in result.summary_metrics:
print(f"{s.metric_name}: mean={s.mean_score}, pass_rate={s.pass_rate}")
for case in result.eval_case_results:
for cand in case.response_candidate_results:
for name, r in cand.metric_results.items():
print(f" {name}: score={r.score}, explanation={r.explanation}")
See [references/sdk_patterns.md](references/sdk_patterns.md) for advanced
patterns: synthetic data generation, pairwise comparison,
`MetricPromptBuilder`, multi-agent evaluation.for s in result.summary_metrics:
print(f"{s.metric_name}: mean={s.mean_score}, pass_rate={s.pass_rate}")
for case in result.eval_case_results:
for cand in case.response_candidate_results:
for name, r in cand.metric_results.items():
print(f" {name}: score={r.score}, explanation={r.explanation}")
高级模式请参考[references/sdk_patterns.md](references/sdk_patterns.md):合成数据生成、两两对比、`MetricPromptBuilder`、多Agent评估。Bundled scripts
内置脚本
| Script | When to use |
|---|---|
| Before Stage 3 — catch malformed |
| Stage 1 — convert ADK session dumps to the canonical dataset shape. |
| Stages 3/4 — render summary + per-case scores. |
| Stage 5 — diff baseline vs. candidate, detect regressions. |
| Render HTML from a saved result JSON or loss-clusters JSON. |
| 脚本 | 使用场景 |
|---|---|
| 第3阶段前——检测格式错误的 |
| 第1阶段——将ADK会话转储转换为标准数据集格式。 |
| 第3/4阶段——渲染汇总信息+单案例评分。使用 |
| 第5阶段——对比基准版本与候选版本,检测回归问题。 |
| 从保存的结果JSON或损失聚类JSON生成HTML报告。 |