agent-platform-eval-flywheel

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Agent Platform Eval Flywheel Skill

Agent Platform评估飞轮技能

Help users evaluate and iteratively improve GenAI models and agents using the Agent Platform GenAI Evaluation SDK (
google.genai
/
agentplatform
).
帮助用户使用Agent Platform GenAI评估SDK(
google.genai
/
agentplatform
)评估并迭代优化生成式AI(GenAI)模型与Agent。

When to use this skill

何时使用此技能

  • Evaluating GenAI agents or models with the Agent Platform GenAI Evaluation SDK (
    client.evals.evaluate()
    ).
  • Creating evaluation datasets from session traces, pandas DataFrames, or synthetic generation.
  • Selecting, configuring, or writing custom evaluation metrics.
  • Analyzing rubric verdicts, loss patterns, and clustering failures.
  • Suggesting concrete code/prompt improvements based on eval results.
  • 使用Agent Platform GenAI评估SDK(
    client.evals.evaluate()
    )评估GenAI Agent或模型。
  • 从会话追踪、pandas DataFrame或合成生成创建评估数据集。
  • 选择、配置或编写自定义评估指标。
  • 分析评分标准 verdicts、损失模式并聚类故障。
  • 根据评估结果提出具体的代码/提示词改进建议。

Setup

环境搭建

Install the SDK:
bash
pip install google-cloud-aiplatform[evaluation]>=1.154.0 google-genai>=1.0.0
Need
GOOGLE_CLOUD_PROJECT
and
GOOGLE_CLOUD_LOCATION
. Check env vars first; if missing, ask the user. Newer Gemini models often need
location="global"
.
安装SDK:
bash
pip install google-cloud-aiplatform[evaluation]>=1.154.0 google-genai>=1.0.0
需要
GOOGLE_CLOUD_PROJECT
GOOGLE_CLOUD_LOCATION
。先检查环境变量;如果缺失,请询问用户。较新的Gemini模型通常需要设置
location="global"

The Quality Flywheel

质量飞轮

Five stages, run in order on the first pass, then loop 2 → 5 until quality targets are met.
包含五个阶段,首次执行按顺序运行,之后循环执行第2→5阶段,直到达到质量目标。

Shortcuts that waste time

浪费时间的捷径

ShortcutWhy it fails
"I'll tune the metric threshold down so it passes."Hides real failures. Fix the agent, not the bar.
"This case is flaky, I'll skip it."Flakiness reveals non-determinism in the agent. Fix with
temperature=0
or stricter instructions.
"I just need to fix the eval dataset, not the agent."If expected outputs keep moving, the agent has a behavior problem.
"I can tell from the trace it works — skip Stage 3."Self-grading doesn't generalize. Always run
evaluate()
and read scores.
"One iteration is enough."Expect 5–10+ iterations. Stopping early leaves regressions on other metrics undetected.
捷径失败原因
"我会调低指标阈值让它通过。"掩盖了真实故障。应修复Agent,而非降低标准。
"这个案例不稳定,我跳过它。"不稳定性暴露了Agent的非确定性问题。可通过设置
temperature=0
或更严格的指令来修复。
"我只需要修复评估数据集,不用管Agent。"如果预期输出持续变化,说明Agent存在行为问题。
"从追踪记录就能看出它正常工作——跳过第3阶段。"自我评估不具备通用性。务必运行
evaluate()
并查看评分。
"一次迭代就够了。"预计需要5–10+次迭代。过早停止会导致其他指标的回归问题未被发现。

1. Prepare Data

1. 准备数据

Produce an
EvaluationDataset
. There are three input shapes, pick the one that matches the data the user already has:
  • EvalCase
    list (single-turn or multi-turn):
    python
    from agentplatform import types
    dataset = types.EvaluationDataset(eval_cases=[
        types.EvalCase(prompt="What is 2+2?", response="4", reference="4"),
        # For multi-turn agent traces, set agent_data instead of prompt/response.
    ])
    Multi-turn agent traces wrap each conversation in
    AgentData
    ConversationTurn
    AgentEvent
    . See references/dataset_schema.md for the full type hierarchy.
  • Pandas DataFrame (tabular sources — CSV, BigQuery, Sheets):
    python
    import pandas as pd
    from agentplatform import types
    
    df = pd.DataFrame({
        "prompt":    ["What is 2+2?", "Capital of France?"],
        "response":  ["4",            "Paris"],
        "reference": ["4",            "Paris"],
    })
    dataset = types.EvaluationDataset(eval_dataset_df=df)
    Column names must match the fields the chosen metrics expect (see references/dataset_schema.md for the per-metric requirements table).
  • Cold start (no data at all): synthesize scenarios server-side with
    client.evals.generate_user_scenarios(...)
    and a
    UserScenarioGenerationConfig
    (
    user_scenario_count
    ,
    simulation_instruction
    ,
    environment_data
    ). Stage 2 plays them out.
For ADK session dumps, use
scripts/parse_adk_traces.py
instead of writing the conversion by hand.
生成
EvaluationDataset
。有三种输入形式,选择与用户现有数据匹配的形式:
  • EvalCase
    列表(单轮或多轮对话):
    python
    from agentplatform import types
    dataset = types.EvaluationDataset(eval_cases=[
        types.EvalCase(prompt="What is 2+2?", response="4", reference="4"),
        # 对于多轮Agent追踪记录,请设置agent_data而非prompt/response。
    ])
    多轮Agent追踪记录将每个对话包装在
    AgentData
    ConversationTurn
    AgentEvent
    中。完整的类型层级请参考references/dataset_schema.md
  • Pandas DataFrame(表格数据源——CSV、BigQuery、Sheets):
    python
    import pandas as pd
    from agentplatform import types
    
    df = pd.DataFrame({
        "prompt":    ["What is 2+2?", "Capital of France?"],
        "response":  ["4",            "Paris"],
        "reference": ["4",            "Paris"],
    })
    dataset = types.EvaluationDataset(eval_dataset_df=df)
    列名必须与所选指标期望的字段匹配(各指标要求请参考references/dataset_schema.md中的表格)。
  • 冷启动(无任何数据): 使用
    client.evals.generate_user_scenarios(...)
    UserScenarioGenerationConfig
    (包含
    user_scenario_count
    simulation_instruction
    environment_data
    )在服务器端生成场景。第2阶段将执行这些场景。
对于ADK会话转储,请使用
scripts/parse_adk_traces.py
进行转换,无需手动编写转换代码。

2. Run Inference

2. 运行推理

Populate responses/traces on the dataset. Skip this stage if traces are already complete (e.g., production logs or replay).
python
undefined
在数据集上填充响应/追踪记录。如果追踪记录已完整(例如生产日志或重放数据),请跳过此阶段
python
undefined

Agent eval — pass a callable wrapping the user's ADK Agent/App.

Agent评估——传入包装用户ADK Agent/App的可调用对象。

client.evals.run_inference(model=agent_callable, src=dataset)
client.evals.run_inference(model=agent_callable, src=dataset)

Model eval — pass a model ID directly.

模型评估——直接传入模型ID。

client.evals.run_inference(model="gemini-2.5-flash", src=dataset)
client.evals.run_inference(model="gemini-2.5-flash", src=dataset)

Synthesized scenarios — let the simulator drive.

合成场景——由模拟器驱动执行。

client.evals.run_inference( model=agent_callable, src=dataset, user_simulator_config=UserSimulatorConfig(max_turn=10), )
client.evals.run_inference( model=agent_callable, src=dataset, user_simulator_config=UserSimulatorConfig(max_turn=10), )

DataFrame also works as src= — no EvalCase wrapping needed.

DataFrame也可作为src参数传入——无需包装为EvalCase。

client.evals.run_inference(model="gemini-2.5-flash", src=df)
undefined
client.evals.run_inference(model="gemini-2.5-flash", src=df)
undefined

3. Grade (always run)

3. 评分(务必运行)

python
result = client.evals.evaluate(dataset=dataset, metrics=[...])
Pick metrics by what you want to measure. Full catalog in references/metric_registry.md.
Agent metrics (multi-turn, adaptive rubrics) — start here for agent eval.
GoalMetric
Did the agent achieve the user's goal?
multi_turn_task_success
Was the reasoning path logical and efficient?
multi_turn_trajectory_quality
Tool/function calling quality across turns
multi_turn_tool_use_quality
Overall conversational quality
multi_turn_general_quality
Final response quality (no reference needed)
final_response_quality
Final response vs. a golden reference
final_response_match
Single-turn tool use
tool_use_quality
General quality metrics (single-turn, adaptive rubrics) — for model eval.
GoalMetric
Overall response quality (recommended starting point)
general_quality
Linguistic quality (fluency, coherence, grammar)
text_quality
Adherence to specific constraints / instructions
instruction_following
Static rubric metrics (fixed criteria) — apply alongside the above.
GoalMetric
Catch hallucinated claims (RAG, factual answers)
hallucination
Factuality / consistency against provided context
grounding
Safety policy compliance
safety
Domain-specific check no built-in covers: write a custom metric.
  • Predefined:
    types.RubricMetric.<NAME>
    — server-side AutoRater, no judge model needed.
  • Custom LLM-as-a-judge:
    types.LLMMetric
    with
    prompt_template
    or
    types.MetricPromptBuilder
    for structured rubrics.
  • Custom code:
    types.CodeExecutionMetric
    with a
    custom_function
    string containing
    def evaluate(instance: dict)
    for remote sandboxed execution; or
    types.Metric
    with
    custom_function=<callable>
    for local execution.
Always persist the result so Stage 4 and 5 can read it. Save both JSON (machine-readable, diffable) and HTML (human-readable, linkable):
python
import datetime
from pathlib import Path

from agentplatform._genai import _evals_visualization

out_dir = Path("artifacts/grade_results")
out_dir.mkdir(parents=True, exist_ok=True)
ts = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

result_json = result.model_dump_json()
(out_dir / f"results_{ts}.json").write_text(result_json)

html = _evals_visualization.get_evaluation_html(result_json)
(out_dir / f"results_{ts}.html").write_text(str(html))
Or after the fact:
scripts/render_html_report.py --type evaluation
or
scripts/inspect_results.py --save-html
.
python
result = client.evals.evaluate(dataset=dataset, metrics=[...])
根据要衡量的目标选择指标。完整的指标目录请参考references/metric_registry.md
Agent指标(多轮对话、自适应评分标准)——Agent评估首选。
目标指标
Agent是否达成用户目标?
multi_turn_task_success
推理路径是否逻辑清晰且高效?
multi_turn_trajectory_quality
多轮对话中的工具/函数调用质量
multi_turn_tool_use_quality
整体对话质量
multi_turn_general_quality
最终响应质量(无需参考标准)
final_response_quality
最终响应与黄金参考标准的匹配度
final_response_match
单轮对话工具使用
tool_use_quality
通用质量指标(单轮对话、自适应评分标准)——适用于模型评估。
目标指标
整体响应质量(推荐起始指标)
general_quality
语言质量(流畅度、连贯性、语法)
text_quality
是否遵守特定约束/指令
instruction_following
静态评分标准指标(固定标准)——可与上述指标配合使用。
目标指标
检测幻觉内容(RAG、事实性回答)
hallucination
与提供上下文的事实一致性
grounding
是否符合安全政策
safety
内置指标未覆盖的特定领域检查: 编写自定义指标。
  • 预定义指标:
    types.RubricMetric.<NAME>
    ——服务器端AutoRater,无需评判模型。
  • 自定义LLM-as-judge: 使用带
    prompt_template
    types.LLMMetric
    ,或使用
    types.MetricPromptBuilder
    构建结构化评分标准。
  • 自定义代码: 使用带
    custom_function
    字符串的
    types.CodeExecutionMetric
    (字符串包含
    def evaluate(instance: dict)
    用于远程沙箱执行);或使用带
    custom_function=<callable>
    types.Metric
    用于本地执行。
务必持久化保存结果,以便第4和第5阶段读取。同时保存JSON格式(机器可读、可对比)和HTML格式(人类可读、可链接):
python
import datetime
from pathlib import Path

from agentplatform._genai import _evals_visualization

out_dir = Path("artifacts/grade_results")
out_dir.mkdir(parents=True, exist_ok=True)
ts = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

result_json = result.model_dump_json()
(out_dir / f"results_{ts}.json").write_text(result_json)

html = _evals_visualization.get_evaluation_html(result_json)
(out_dir / f"results_{ts}.html").write_text(str(html))
或事后使用:
scripts/render_html_report.py --type evaluation
scripts/inspect_results.py --save-html

4. Analyze Failures

4. 分析故障

Read
summary_metrics
and
eval_case_results
— never fabricate scores. Use
scripts/inspect_results.py --failing-only
to filter to failures.
For each failed metric, see references/failure_patterns.md for deeper diagnoses. The compact mapping:
Failing metricWhat to change
multi_turn_task_success
low
The agent isn't completing the goal — fix orchestration, missing tool calls, premature termination, wrong tool selection.
multi_turn_trajectory_quality
low
The agent reaches the goal inefficiently — refine planning prompts, remove redundant tool calls.
multi_turn_tool_use_quality
low
Fix tool descriptions, parameter docstrings, or agent instructions for tool selection.
final_response_quality
low
Read auto-generated rubric verdicts; refine instructions to address the worst-scoring criterion.
final_response_match
low
The agent's final answer doesn't match the golden reference — adjust response format or update the reference.
hallucination
low
Tighten instructions to stay grounded in tool output; verify the tool actually returned the claimed data.
grounding
low
The response contradicts the provided context — add explicit "cite only from context" instructions.
safety
low
Add safety guardrails; review the violating content category in the rubric verdict.
general_quality
/
text_quality
low
Adjust system instruction wording; the model's default phrasing is too generic for the task.
instruction_following
low
The agent is ignoring constraints — restate them in the system instruction or use stricter wording.
Agent calls wrong toolsFix tool descriptions, agent instructions, or
tool_config
.
Agent calls extra toolsAdd explicit stop instructions, or switch to
multi_turn_tool_use_quality
to surface the extra calls in the rubric.
For 10+ failures on the same metric, use the Error Analysis service to cluster failures into themes (L1/L2 taxonomy categories) instead of reading every trace:
python
undefined
查看
summary_metrics
eval_case_results
——切勿编造分数。使用
scripts/inspect_results.py --failing-only
筛选故障案例。
对于每个未达标的指标,请参考references/failure_patterns.md进行深入诊断。简要对应关系:
未达标指标需要调整的内容
multi_turn_task_success
得分低
Agent未完成目标——修复编排逻辑、补充缺失的工具调用、修正提前终止问题、调整工具选择逻辑。
multi_turn_trajectory_quality
得分低
Agent达成目标的效率低下——优化规划提示词、移除冗余工具调用。
multi_turn_tool_use_quality
得分低
修复工具描述、参数文档字符串或Agent的工具选择指令。
final_response_quality
得分低
查看自动生成的评分标准 verdicts;优化指令以解决得分最低的标准。
final_response_match
得分低
Agent的最终答案与黄金参考标准不匹配——调整响应格式或更新参考标准。
hallucination
得分低
收紧指令以确保基于工具输出生成内容;验证工具是否确实返回了声称的数据。
grounding
得分低
响应与提供的上下文矛盾——添加明确的“仅引用上下文内容”指令。
safety
得分低
添加安全防护措施;查看评分标准 verdict中违规的内容类别。
general_quality
/
text_quality
得分低
调整系统指令措辞;模型的默认表述对于当前任务过于通用。
instruction_following
得分低
Agent忽略约束条件——在系统指令中重新表述约束或使用更严格的措辞。
Agent调用错误工具修复工具描述、Agent指令或
tool_config
Agent调用多余工具添加明确的停止指令,或切换为
multi_turn_tool_use_quality
指标以在评分标准中暴露多余调用。
同一指标出现10+次故障时,使用错误分析服务将故障聚类为不同主题(L1/L2分类类别),无需逐个查看追踪记录:
python
undefined

Only supports multi_turn_task_success and multi_turn_tool_use_quality.

仅支持multi_turn_task_success和multi_turn_tool_use_quality指标。

Service runs in the global region.

服务在global区域运行。

analysis_client = agentplatform.Client(project="PROJECT_ID", location="global") response = analysis_client.evals.generate_loss_clusters( eval_result=result, metric="multi_turn_task_success", config={"max_top_cluster_count": 5}, ) for r in response.results: for cluster in r.clusters: print( f"[{cluster.taxonomy_entry.l1_category}/" f"{cluster.taxonomy_entry.l2_category}] " f"{cluster.item_count} cases — {cluster.taxonomy_entry.description}" )

Save `response.model_dump_json()` and render with
`scripts/render_html_report.py --type loss-analysis`.
analysis_client = agentplatform.Client(project="PROJECT_ID", location="global") response = analysis_client.evals.generate_loss_clusters( eval_result=result, metric="multi_turn_task_success", config={"max_top_cluster_count": 5}, ) for r in response.results: for cluster in r.clusters: print( f"[{cluster.taxonomy_entry.l1_category}/" f"{cluster.taxonomy_entry.l2_category}] " f"{cluster.item_count} cases — {cluster.taxonomy_entry.description}" )

保存`response.model_dump_json()`并使用`scripts/render_html_report.py --type loss-analysis`生成可视化报告。

5. Optimize & Iterate

5. 优化与迭代

Apply a fix targeting the failing metric. Re-run Stage 3. Compare with
scripts/compare_results.py --baseline <prev> --candidate <new>
to confirm the target improved AND no other metric regressed.
Track progress across iterations:
IterationMetric AMetric BChange made
Baseline0.620.55
v20.780.68Added grounding prompt
v30.810.72Fixed tool selection
Expect 5–10+ iterations per failing case. Only after a case passes should you expand coverage with more eval cases.
针对未达标的指标应用修复方案。重新运行第3阶段。使用
scripts/compare_results.py --baseline <prev> --candidate <new>
对比结果,确认目标指标有所提升且其他指标未出现回归。
跨迭代跟踪进度:
迭代次数指标A指标B做出的修改
基准版本0.620.55
v20.780.68添加 grounding提示词
v30.810.72修复工具选择逻辑
预计每个故障案例需要5–10+次迭代。只有当案例通过评估后,才应添加更多评估案例以扩大覆盖范围。

Proving your work

验证工作成果

Never claim eval results you didn't read from an actual
result
object.
  • After running eval, print the
    summary_metrics
    table (
    scripts/inspect_results.py
    ).
  • After a fix, show before/after via
    scripts/compare_results.py
    .
  • Before declaring success, confirm ALL cases pass — not just the one you were working on.
If you can't produce the evidence (SDK call failed, result truncated, metric unsupported), say so explicitly. Don't paper over gaps.
切勿声称未从实际
result
对象读取的评估结果。
  • 运行评估后,打印
    summary_metrics
    表格(使用
    scripts/inspect_results.py
    )。
  • 修复后,通过
    scripts/compare_results.py
    展示修复前后的对比。
  • 宣布成功前,确认所有案例均通过评估——而非仅修复的那一个案例。
如果无法提供证据(SDK调用失败、结果截断、指标不支持),请明确说明。切勿掩盖漏洞。

Rules of Engagement

参与规则

  1. Always Plan First: Before writing a script, output a
    <plan>
    block detailing the steps you are about to take.
  2. Step-by-Step Execution: Write the script, execute it, wait for output, then analyze. Don't do everything in one response.
  3. Standard Python: Use standard Python imports (
    import agentplatform
    ,
    from google.genai import types
    ). Don't use internal import paths.
  4. Verify Before Guessing: When unsure about SDK types or metrics, check the SDK source code rather than guessing or hallucinating.
  1. 始终先规划: 编写脚本前,输出
    <plan>
    块详细说明即将执行的步骤。
  2. 分步执行: 编写脚本、执行、等待输出,然后分析。不要在一次响应中完成所有操作。
  3. 标准Python: 使用标准Python导入(
    import agentplatform
    from google.genai import types
    )。不要使用内部导入路径。
  4. 验证而非猜测: 对SDK类型或指标不确定时,请查看SDK源代码,而非猜测或编造。

SDK Quick Reference

SDK快速参考

python
import agentplatform
from agentplatform import types
from google.genai import types as genai_types
import pandas as pd
python
import agentplatform
from agentplatform import types
from google.genai import types as genai_types
import pandas as pd

Initialize client

初始化客户端

client = agentplatform.Client(project="PROJECT_ID", location="LOCATION")
client = agentplatform.Client(project="PROJECT_ID", location="LOCATION")

--- SINGLE-TURN EVAL (EvalCase list) ---

--- 单轮评估(EvalCase列表) ---

dataset = types.EvaluationDataset(eval_cases=[ types.EvalCase(prompt="Query here", response="Model response here"), ])
dataset = types.EvaluationDataset(eval_cases=[ types.EvalCase(prompt="Query here", response="Model response here"), ])

--- SINGLE-TURN EVAL (pandas DataFrame) ---

--- 单轮评估(pandas DataFrame) ---

df = pd.DataFrame({ "prompt": ["Q1", "Q2"], "response": ["A1", "A2"], }) dataset = types.EvaluationDataset(eval_dataset_df=df)
df = pd.DataFrame({ "prompt": ["Q1", "Q2"], "response": ["A1", "A2"], }) dataset = types.EvaluationDataset(eval_dataset_df=df)

--- MULTI-TURN AGENT EVAL ---

--- 多轮Agent评估 ---

agent_data = types.evals.AgentData( agents={"my_agent": types.evals.AgentConfig( agent_id="my_agent", instruction="You are helpful.")}, turns=[types.evals.ConversationTurn(turn_index=0, events=[ types.evals.AgentEvent(author="user", content=genai_types.Content(role="user", parts=[genai_types.Part(text="Hello")])), types.evals.AgentEvent(author="my_agent", content=genai_types.Content(role="model", parts=[genai_types.Part(text="Hi! How can I help?")])), ])], ) dataset = types.EvaluationDataset( eval_cases=[types.EvalCase(agent_data=agent_data)])
agent_data = types.evals.AgentData( agents={"my_agent": types.evals.AgentConfig( agent_id="my_agent", instruction="You are helpful.")}, turns=[types.evals.ConversationTurn(turn_index=0, events=[ types.evals.AgentEvent(author="user", content=genai_types.Content(role="user", parts=[genai_types.Part(text="Hello")])), types.evals.AgentEvent(author="my_agent", content=genai_types.Content(role="model", parts=[genai_types.Part(text="Hi! How can I help?")])), ])], ) dataset = types.EvaluationDataset( eval_cases=[types.EvalCase(agent_data=agent_data)])

--- METRICS ---

--- 指标 ---

predefined = types.RubricMetric.MULTI_TURN_TRAJECTORY_QUALITY custom_llm = types.LLMMetric(name="tone", prompt_template="Is this polite? Response: {response}") custom_code = types.CodeExecutionMetric(name="check", custom_function='def evaluate(instance): return {"score": 1.0}')
predefined = types.RubricMetric.MULTI_TURN_TRAJECTORY_QUALITY custom_llm = types.LLMMetric(name="tone", prompt_template="Is this polite? Response: {response}") custom_code = types.CodeExecutionMetric(name="check", custom_function='def evaluate(instance): return {"score": 1.0}')

--- EVALUATE ---

--- 执行评估 ---

result = client.evals.evaluate(dataset=dataset, metrics=[predefined])
result = client.evals.evaluate(dataset=dataset, metrics=[predefined])

--- RESULTS ---

--- 查看结果 ---

for s in result.summary_metrics: print(f"{s.metric_name}: mean={s.mean_score}, pass_rate={s.pass_rate}") for case in result.eval_case_results: for cand in case.response_candidate_results: for name, r in cand.metric_results.items(): print(f" {name}: score={r.score}, explanation={r.explanation}")

See [references/sdk_patterns.md](references/sdk_patterns.md) for advanced
patterns: synthetic data generation, pairwise comparison,
`MetricPromptBuilder`, multi-agent evaluation.
for s in result.summary_metrics: print(f"{s.metric_name}: mean={s.mean_score}, pass_rate={s.pass_rate}") for case in result.eval_case_results: for cand in case.response_candidate_results: for name, r in cand.metric_results.items(): print(f" {name}: score={r.score}, explanation={r.explanation}")

高级模式请参考[references/sdk_patterns.md](references/sdk_patterns.md):合成数据生成、两两对比、`MetricPromptBuilder`、多Agent评估。

Bundled scripts

内置脚本

ScriptWhen to use
validate_dataset.py
Before Stage 3 — catch malformed
EvaluationDataset
JSON.
parse_adk_traces.py
Stage 1 — convert ADK session dumps to the canonical dataset shape.
inspect_results.py
Stages 3/4 — render summary + per-case scores.
--save-html
for a browsable report.
compare_results.py
Stage 5 — diff baseline vs. candidate, detect regressions.
render_html_report.py
Render HTML from a saved result JSON or loss-clusters JSON.
脚本使用场景
validate_dataset.py
第3阶段前——检测格式错误的
EvaluationDataset
JSON。
parse_adk_traces.py
第1阶段——将ADK会话转储转换为标准数据集格式。
inspect_results.py
第3/4阶段——渲染汇总信息+单案例评分。使用
--save-html
生成可浏览的报告。
compare_results.py
第5阶段——对比基准版本与候选版本,检测回归问题。
render_html_report.py
从保存的结果JSON或损失聚类JSON生成HTML报告。