llm-obs-experiment-analyzer

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Backend

后端处理

Detection — At the start of every invocation, before taking any action, determine which backend to use:

If the user passed
```
--backend pup
```
anywhere in their invocation → use pup mode immediately, regardless of whether MCP tools are present. Skip steps 2–4.
Check whether MCP tools are present in your active tool list. The canonical signal is whether
```
mcp__datadog-llmo-mcp__get_llmobs_experiment_summary
```
appears in your available tools.
If MCP tools are present → use MCP mode throughout. Call MCP tools exactly as named in this skill's workflow sections.
If MCP tools are absent → check whether
```
pup
```
is executable: run
```
pup --version
```
via Bash. A JSON response containing
```
"version"
```
confirms pup is available.
If pup responds → use pup mode throughout. Translate every MCP tool call to its pup equivalent using the Tool Reference appendix at the bottom of this file.
If neither is available → stop and tell the user:
"Neither the Datadog MCP server nor the pup CLI is available. Connect the MCP server (
```
claude mcp add --scope user --transport http datadog-llmo-mcp 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'
```
) or install pup."

--backend pup

is accepted anywhere in the invocation arguments and is stripped before passing remaining args to the skill logic.

pup invocation rules:

Invoke via Bash:
```
pup llm-obs <subcommand> [flags]
```
pup always outputs JSON. Parse directly — no content-block unwrapping (unlike MCP results, which may wrap JSON in
```
[{"type": "text", "text": "<json>"}]
```
).
If pup returns an auth error, tell the user to run
```
pup auth login
```
and stop.
Parallelization: issue multiple Bash tool calls in a single message (one pup command per call).

Invocation ID: At the very start of each invocation, before any MCP tool call, generate an 8-character hex invocation ID (e.g.,

3a9f1c2b

). Keep it constant for the entire invocation.

Intent tagging: On every MCP tool call, prefix

telemetry.intent

with

skill:llm-obs-experiment-analyzer[<inv_id>] —

followed by a description of why the tool is being called. On the first MCP tool call only, use

skill:llm-obs-experiment-analyzer:start[<inv_id>] —

instead (note the

:start

suffix). Example first call:

skill:llm-obs-experiment-analyzer:start[3a9f1c2b] — Phase 1: get experiment summary to orient analysis

检测机制 — 在每次调用开始、执行任何操作之前，先确定要使用的后端：

如果用户在调用参数的任何位置传入了
```
--backend pup
```
→ 立即使用pup模式，无论是否存在MCP工具。跳过步骤2-4。
检查活跃工具列表中是否存在MCP工具。标准判断依据是可用工具中是否包含
```
mcp__datadog-llmo-mcp__get_llmobs_experiment_summary
```
。
如果存在MCP工具 → 全程使用MCP模式。严格按照本技能工作流章节中指定的名称调用MCP工具。
如果不存在MCP工具 → 检查
```
pup
```
是否可执行：通过Bash运行
```
pup --version
```
。返回包含
```
"version"
```
的JSON响应即确认pup可用。
如果pup正常响应 → 全程使用pup模式。使用本文件底部的工具参考附录，将所有MCP工具调用转换为对应的pup命令。
如果两者都不可用 → 停止操作并告知用户：
"Datadog MCP服务器和pup CLI均不可用。请连接MCP服务器（
```
claude mcp add --scope user --transport http datadog-llmo-mcp 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'
```
）或安装pup。"

--backend pup

可在调用参数的任意位置传入，在将剩余参数传递给技能逻辑前会被移除。

pup调用规则：

通过Bash调用：
```
pup llm-obs <subcommand> [flags]
```
pup始终输出JSON。直接解析即可——无需解析内容块（与MCP结果不同，MCP结果可能会将JSON包裹在
```
[{"type": "text", "text": "<json>"}]
```
中）。
如果pup返回认证错误，告知用户运行
```
pup auth login
```
并停止操作。
并行处理：可在一条消息中发起多个Bash工具调用（每个调用对应一个pup命令）。

调用ID： 在每次调用的最开始、发起任何MCP工具调用之前，生成一个8字符的十六进制调用ID（例如

3a9f1c2b

）。在整个调用过程中保持该ID不变。

意图标记： 在每次MCP工具调用中，在

telemetry.intent

前添加前缀

skill:llm-obs-experiment-analyzer[<inv_id>] —

，后跟调用该工具的原因描述。仅在第一次MCP工具调用时，使用

skill:llm-obs-experiment-analyzer:start[<inv_id>] —

作为前缀（注意

:start

后缀）。示例第一次调用：

skill:llm-obs-experiment-analyzer:start[3a9f1c2b] — Phase 1: get experiment summary to orient analysis

Unified Experiment Analyzer

统一实验分析器

Analyzes one or two LLM experiments. Supports four modes based on inputs:

Inputs	Mode
2 IDs, no question	Comparative Exploratory
2 IDs + question	Comparative Q&A
1 ID, no question	Single Exploratory
1 ID + question	Single Q&A

分析1个或2个LLM实验。根据输入支持四种模式：

输入内容	模式
2个ID，无问题	对比探索性分析
2个ID + 问题	对比问答分析
1个ID，无问题	单实验探索性分析
1个ID + 问题	单实验问答分析

Usage

使用方法

/llm-obs-experiment-analyzer <experiment_id_1> [experiment_id_2] [question text] [--output agent|file|notebook]

Arguments: $ARGUMENTS

/llm-obs-experiment-analyzer <experiment_id_1> [experiment_id_2] [question text] [--output agent|file|notebook]

参数：$ARGUMENTS

Available Tools

可用工具

Tool	Purpose
`mcp__datadog-llmo-mcp__get_llmobs_experiment_summary`	Get total events, error count, metrics stats, available dimensions
`mcp__datadog-llmo-mcp__list_llmobs_experiment_events`	Query events with filters, sorting, pagination
`mcp__datadog-llmo-mcp__get_llmobs_experiment_event`	Get full event details (input, output, expected_output, metrics)
`mcp__datadog-llmo-mcp__get_llmobs_experiment_metric_values`	Get metric stats overall and segmented by dimension. Use `segment_by_dimension` (not `segment_dimension` ) to segment; optionally `segment_dimension_value` to filter to a specific value.
`mcp__datadog-llmo-mcp__get_llmobs_experiment_dimension_values`	List unique values for a dimension with counts
`mcp__datadog-mcp-core__create_datadog_notebook`	Export report as a Datadog notebook

工具	用途
`mcp__datadog-llmo-mcp__get_llmobs_experiment_summary`	获取总事件数、错误计数、指标统计信息、可用维度
`mcp__datadog-llmo-mcp__list_llmobs_experiment_events`	通过筛选、排序、分页查询事件
`mcp__datadog-llmo-mcp__get_llmobs_experiment_event`	获取完整事件详情（输入、输出、预期输出、指标）
`mcp__datadog-llmo-mcp__get_llmobs_experiment_metric_values`	获取整体指标统计信息及按维度细分的统计信息。使用 `segment_by_dimension` （而非 `segment_dimension` ）进行细分；可选择使用 `segment_dimension_value` 筛选特定值。
`mcp__datadog-llmo-mcp__get_llmobs_experiment_dimension_values`	列出维度的唯一值及对应计数
`mcp__datadog-mcp-core__create_datadog_notebook`	将报告导出为Datadog笔记本

Phase 0 — Mode & Output Resolution

阶段0 — 模式与输出方式确定

Parse $ARGUMENTS:

Extract one or two UUID-format strings as experiment IDs (first = baseline/primary, second = candidate).
Extract
```
--output agent|file|notebook
```
flag if present.
The remaining text (after IDs and flags) is the question, if any.

Mode determination:

2 IDs + question → Comparative Q&A
2 IDs, no question → Comparative Exploratory
1 ID + question → Single Q&A
1 ID, no question → Single Exploratory

Output mode determination:

--output

was provided in arguments, use that mode and skip asking.

Otherwise, ask two separate sequential

AskUserQuestion

calls before proceeding — never combined into a single call:

Analysis type: If no question text was provided in the arguments, ask whether the user wants exploratory analysis or has a specific question. Skip this call only if the user's intent is already clear from context (e.g. they typed a question alongside the IDs).
Output destination: If
```
--output
```
was not specified, ask where to deliver the report (chat, file, or Datadog notebook). Always ask this as its own standalone call.

Output modes:

Agent (default): Display the full report in the conversation.
File: Before starting, propose a path:
```
evals/reports/YYYY-MM-DD-<experiment-slug>-analysis.md
```
Present it to the user and let them confirm or adjust. Then proceed.

Notebook: Use

mcp__datadog-mcp-core__create_datadog_notebook

at the end. In pup mode, use

pup notebooks create --title "TITLE" --file /tmp/nb_cells.json

instead (see Tool Reference). If neither MCP nor pup is available, output these setup instructions instead of failing:

To enable Datadog notebook export, add the MCP server:
  claude mcp add --transport http datadog-mcp https://mcp.datadoghq.com/api/unstable/mcp-server
See: https://docs.datadoghq.com/bits_ai/mcp_server/setup/

Then ask: "Would you like to fall back to file or agent output instead?" See Phase 5 for full notebook call details.

After resolving mode and output, proceed to Phase 1. There will be one additional

AskUserQuestion

interaction at Phase 1.5 before the deep analysis begins.

解析$ARGUMENTS：

提取1个或2个UUID格式的字符串作为实验ID（第一个为基线/主实验，第二个为候选实验）。
如果存在
```
--output agent|file|notebook
```
标志则提取。
ID和标志之后的剩余文本即为问题（如果有的话）。

模式确定：

2个ID + 问题 → 对比问答分析
2个ID，无问题 → 对比探索性分析
1个ID + 问题 → 单实验问答分析
1个ID，无问题 → 单实验探索性分析

输出方式确定：

如果参数中提供了

--output

，则使用该模式，无需询问。

否则，在继续之前发起两次独立的连续

AskUserQuestion

调用——切勿合并为一次调用：

分析类型：如果参数中未提供问题文本，询问用户是需要探索性分析还是有特定问题。仅当从上下文可明确用户意图时（例如用户在ID旁输入了问题），才跳过此调用。
输出目标：如果未指定
```
--output
```
，询问用户报告的交付位置（聊天窗口、文件或Datadog笔记本）。此调用必须始终作为独立的单独调用。

输出模式：

Agent（默认）： 在对话中显示完整报告。
File： 在开始前建议路径：
```
evals/reports/YYYY-MM-DD-<experiment-slug>-analysis.md
```
将路径呈现给用户，让用户确认或调整。然后继续操作。

Notebook： 在最后调用

mcp__datadog-mcp-core__create_datadog_notebook

。在pup模式下，使用

pup notebooks create --title "TITLE" --file /tmp/nb_cells.json

替代（参见工具参考）。如果MCP和pup均不可用，输出以下设置说明而非直接失败：

要启用Datadog笔记本导出，请添加MCP服务器：
  claude mcp add --transport http datadog-mcp https://mcp.datadoghq.com/api/unstable/mcp-server
参考文档：https://docs.datadoghq.com/bits_ai/mcp_server/setup/

然后询问："是否要回退到文件或Agent输出方式？" 有关笔记本调用的详细信息，请参见阶段5。

确定模式和输出方式后，进入阶段1。在阶段1.5开始深度分析前，会有一次额外的

AskUserQuestion

交互。

Phase 1 — Orient

阶段1 — 定位分析

Comparative: Call

get_llmobs_experiment_summary

for both experiments. Produce a side-by-side comparison:

Scale: total samples and error count for each
Metrics: which metrics exist in each; which are shared
Dimensions: which dimensions exist in each; which are shared
Immediate red flags (errors present, missing metrics, sparse data)
Obvious improvements or regressions visible at the summary level

When

error_count > 0

, call

get_llmobs_experiment_dimension_values

for

error_type

and report the breakdown by exception class (e.g. "2 errors:

asyncio.exceptions.cancellederror

"). Errors mean the executor threw an unhandled exception — no eval scores were produced for those samples. Do not report a percentage; report the count and type(s).

Single: Call

get_llmobs_experiment_summary

for the experiment. Determine:

Total samples, and error count (with
```
error_type
```
breakdown if non-zero)
Available metrics grouped by
```
metric_type
```
as returned by the summary (
```
score
```
,
```
boolean
```
,
```
categorical
```
). Do not infer semantic groupings or categories from label name patterns or prefixes — the label string is not a reliable signal for what a metric measures.
Classify each metric using the statistics already returned by the summary (mean, min, max). Do not infer metric meaning from label names or prefixes. Use the classifications defined in Phase 1.5 when referencing metrics throughout the report.
Available dimensions for segmentation
Any immediate red flags

对比分析： 为两个实验调用

get_llmobs_experiment_summary

。生成并排对比结果：

规模：每个实验的总样本数和错误计数
指标：每个实验存在的指标，以及共同的指标
维度：每个实验存在的维度，以及共同的维度
即时红色预警（存在错误、缺失指标、数据稀疏）
在摘要层面可见的明显改进或退化

当

error_count > 0

时，调用

get_llmobs_experiment_dimension_values

获取

error_type

，并按异常类报告细分情况（例如"2个错误：

asyncio.exceptions.cancellederror

"）。错误表示执行器抛出了未处理的异常——这些样本未生成评估分数。不要报告百分比，只需报告计数和类型。

单实验分析： 为该实验调用

get_llmobs_experiment_summary

。确定：

总样本数和错误计数（如果非零则包含
```
error_type
```
细分）
按摘要返回的
```
metric_type
```
分组的可用指标（
```
score
```
、
```
boolean
```
、
```
categorical
```
）。不要从标签名称模式或前缀推断语义分组或类别——标签字符串并非衡量指标的可靠信号。
使用摘要返回的统计信息（平均值、最小值、最大值）对每个指标进行分类。不要从标签名称或前缀推断指标含义。在整个报告中引用指标时，使用阶段1.5中定义的分类。
可用于细分的维度
任何即时红色预警

Phase 1.5 — Metrics Selection

阶段1.5 — 指标选择

After completing Phase 1, run the following three steps before any

AskUserQuestion

Step 1 — Classify every metric using summary statistics only (no additional tool calls):

Class	Condition	Meaning
`always_zero`	`max == 0`	Feature disabled or not implemented — no signal
`perfect`	`min == 1`	Always passes — no diagnostic signal
`saturated`	`mean ≥ 0.99` and `min < 1`	Rarely fails — low diagnostic value
`struggling`	`mean < 0.70`	Meaningful failure rate — highest diagnostic value
`interesting`	`0.70 ≤ mean < 0.99` and `min < max`	Partial failures — moderate diagnostic value

Step 2 — Print the full metric table to chat before asking any question. This gives the user complete visibility — never truncated by option limits. Format:

Found N metrics. Full breakdown:

| Metric | Mean | Class |
|--------|------|-------|
| <label> | <mean> | ⚠️ Struggling |
| <label> | <mean> | Interesting |
| <label> | <mean> | Saturated |
| <label> | 1.000 | Perfect (no signal) |
| <label> | 0.000 | Always zero (disabled?) |

Flag any

always_zero

metrics with a note — e.g. "N metrics always score 0 and appear to be disabled features; they will be excluded from suggested groupings."

Step 3 — AskUserQuestion with options built entirely from the computed classes:

Generate options dynamically based on what is actually present in the data. Do not invent option names from label prefixes.

"Struggling metrics (N) — Recommended": only shown if N ≥ 1. Description explicitly lists each metric label and its mean (e.g. "
```
open_answer
```
0.33,
```
c_permanence
```
0.68"). This is the grounded suggestion — based on observed pass rates, not label names. If there are no struggling metrics, replace this option with "Lowest-performing metrics (N)" covering the bottom N by mean.
"Interesting + struggling (N)": shown only if there are interesting-class metrics in addition to struggling ones. Description lists them with means.
"All metrics (N)": always shown. Note in the description that always-zero and perfect metrics add noise but are included.
"A specific metric": always shown. Description says: "Choose one from the table printed above."

If the user selects "A specific metric", ask a second

AskUserQuestion

that shows the 4 metrics with the lowest mean as labeled options (label = metric name, description =

mean: X.XX — class

). In the question text, explicitly say: "Or type any metric name from the table above into 'Other'." The

always_zero

and

perfect

metrics must not appear in the 4 options (they have no diagnostic value); restrict the 4 to

struggling

and

interesting

classes only. After the user picks one, restrict all analysis in Phases 2–4 to that single metric only.

Scope enforcement:

If the user accepts "all", proceed with all metrics (including constant ones, but note their low signal value).
If the user selects a grouping or a specific metric, restrict all analysis in Phases 2–4 strictly to that selection. Do not call
```
get_llmobs_experiment_metric_values
```
for any metric outside the selection.

完成阶段1后，在发起任何

AskUserQuestion

之前执行以下三个步骤。

步骤1 — 仅使用摘要统计信息对所有指标进行分类（无需额外工具调用）：

类别	条件	含义
`always_zero`	`max == 0`	功能已禁用或未实现——无信号
`perfect`	`min == 1`	始终通过——无诊断信号
`saturated`	`mean ≥ 0.99` 且 `min < 1`	极少失败——诊断价值低
`struggling`	`mean < 0.70`	失败率显著——诊断价值最高
`interesting`	`0.70 ≤ mean < 0.99` 且 `min < max`	部分失败——诊断价值中等

步骤2 — 在询问问题前，将完整指标表打印到聊天窗口。让用户完全可见——切勿因选项限制而截断。格式如下：

发现N个指标。完整细分：

| 指标 | 平均值 | 类别 |
|--------|------|-------|
| <label> | <mean> | ⚠️ 表现不佳 |
| <label> | <mean> | 值得关注 |
| <label> | <mean> | 接近完美 |
| <label> | 1.000 | 完美（无信号） |
| <label> | 0.000 | 始终为零（已禁用？） |

为所有

always_zero

指标添加注释——例如"N个指标始终得分为0，看起来是已禁用的功能；它们将被排除在建议分组之外。"

步骤3 — 根据计算出的类别生成选项，发起AskUserQuestion：

根据数据中实际存在的内容动态生成选项。切勿从标签前缀编造选项名称。

"表现不佳的指标（N个）——推荐"：仅当N≥1时显示。描述中明确列出每个指标标签及其平均值（例如"
```
open_answer
```
0.33，
```
c_permanence
```
0.68"）。这是基于观察到的通过率而非标签名称的合理建议。如果没有表现不佳的指标，将此选项替换为**"表现最差的指标（N个）"**，涵盖平均值最低的N个指标。
"值得关注+表现不佳的指标（N个）"：仅当除了表现不佳的指标外还有值得关注的指标时显示。描述中列出这些指标及其平均值。
"所有指标（N个）"：始终显示。描述中注明始终为零和完美的指标会增加噪音，但仍会包含在内。
"特定指标"：始终显示。描述为："从上面打印的表格中选择一个。"

如果用户选择"特定指标"，发起第二次

AskUserQuestion

，显示平均值最低的4个指标作为带标签的选项（标签=指标名称，描述=

mean: X.XX — 类别

）。在问题文本中明确说明："或者在‘其他’中输入上面表格中的任意指标名称。"

always_zero

和

perfect

指标不得出现在这4个选项中（它们无诊断价值）；仅限制为

struggling

和

interesting

类别的指标。用户选择后，阶段2-4中的所有分析严格限制为该单个指标。

范围强制执行：

如果用户接受"所有"，则继续分析所有指标（包括常量指标，但需注明其信号值低）。
如果用户选择分组或特定指标，阶段2-4中的所有分析严格限制为该选择范围。不得为选择范围外的指标调用
```
get_llmobs_experiment_metric_values
```
。

Phase 2 — Signal Discovery + UI Links

阶段2 — 信号发现 + UI链接

Comparative: Using only the metrics selected in Phase 1.5 (intersected with shared metrics) and shared dimensions, identify:

Segments where the candidate outperforms the baseline
Segments where the candidate regresses
Error types present in one but rare in the other
Distribution shifts or coverage gaps
Tradeoffs (e.g., higher recall, lower precision)

Generate Datadog comparison UI links:

Base URL:

https://app.datadoghq.com/llm/experiment-comparison

Required params:

baselineExperimentId

experimentIds

(candidate%2Cbaseline),

tableView=all

Optional (include if discoverable):

project

compareDatasetId

selectedEvaluation

```
selectedEvaluation
```
priority: overall/overall_score/rubric metric → primary metric → first shared metric
Generate 2–4 links: primary comparison, regression view, calibration view (if applicable), worst-segment view (only if supported — never fabricate filters)

Single: Measure per-metric performance across all dimensions for only the metrics selected in Phase 1.5. Identify:

Worst-performing segments (by metric × dimension)
Any segments with surprising pass rates
Overall pass rates and variance

Generate Datadog experiment UI link:

https://app.datadoghq.com/llm/experiments/{experiment_id}

对比分析： 仅使用阶段1.5中选择的指标（与共同指标的交集）和共同维度，识别：

候选实验优于基线实验的细分场景
候选实验退化的细分场景
仅在一个实验中存在且在另一个实验中罕见的错误类型
分布变化或覆盖缺口
权衡情况（例如，召回率提高但精确率降低）

生成Datadog对比UI链接：

基础URL：

https://app.datadoghq.com/llm/experiment-comparison

必填参数：
```
baselineExperimentId
```
、
```
experimentIds
```
（candidate%2Cbaseline）、
```
tableView=all
```
可选参数（如果可发现则包含）：
```
project
```
、
```
compareDatasetId
```
、
```
selectedEvaluation
```
```
selectedEvaluation
```
优先级：overall/overall_score/rubric指标 → 主指标 → 第一个共同指标
生成2-4个链接：主对比视图、退化视图、校准视图（如适用）、最差细分场景视图（仅在支持时生成——切勿编造筛选条件）

单实验分析： 仅针对阶段1.5中选择的指标，衡量各维度的指标表现。识别：

表现最差的细分场景（按指标×维度）
通过率意外的细分场景
整体通过率和方差

生成Datadog实验UI链接：

https://app.datadoghq.com/llm/experiments/{experiment_id}

Phase 3 — Deep Dives

阶段3 — 深度分析

Run all necessary deep dives automatically. Do not ask for approval or pause. Scope all deep dives strictly to the metrics selected in Phase 1.5 — do not call

get_llmobs_experiment_metric_values

for any metric outside the selection.

Q&A modes: Focus deep dives on what is needed to answer the question directly. Pull specific samples, segment by relevant dimensions, inspect examples.

Exploratory modes: Investigate the most interesting signals broadly:

Per-segment and per-class delta analysis (comparative) or pass-rate analysis (single)
Error overlap vs. unique failure mode analysis
Sampling and qualitative inspection of representative failures (2–5 per issue)
Clustered error theme analysis

Rules:

Prefer cheap, high-signal analyses first; do not stop early.
Mask or redact PII in all outputs.
Avoid destructive actions.

For each sampled event, generate a direct span link:

https://app.datadoghq.com/llm/experiments/{experiment_id}?selectedTab=overview&sp=[{"p":{"experimentId":"{experiment_id}","spanId":"{span_id}"},"i":"experiment-details"}]&spanId={span_id}

For each Deep Dive segment, generate a direct link to view those samples in the (candidate) experiment:

https://app.datadoghq.com/llm/experiments/{experiment_id}?selectedTab=overview&filter[{dimension}]={value}

If you are not confident the filter URL format works for this dimension, omit the filter params and link to the experiment root instead. Never fabricate filter URLs.

自动运行所有必要的深度分析。无需请求批准或暂停。所有深度分析严格限制在阶段1.5中选择的指标范围内——不得为选择范围外的指标调用

get_llmobs_experiment_metric_values

。

问答模式： 深度分析聚焦于直接回答问题所需的内容。提取特定样本、按相关维度细分、检查示例。

探索性模式： 广泛调查最有趣的信号：

按细分场景和类别进行增量分析（对比分析）或通过率分析（单实验分析）
错误重叠与独特失败模式分析
代表性失败案例的抽样和定性检查（每个问题2-5个案例）
聚类错误主题分析

规则：

优先选择低成本、高信号的分析；不要提前停止。
在所有输出中屏蔽或脱敏PII（个人身份信息）。
避免破坏性操作。

对于每个抽样事件，生成直接的Span链接：

https://app.datadoghq.com/llm/experiments/{experiment_id}?selectedTab=overview&sp=[{"p":{"experimentId":"{experiment_id}","spanId":"{span_id}"},"i":"experiment-details"}]&spanId={span_id}

对于每个深度分析细分场景，生成直接链接以在（候选）实验中查看这些样本：

https://app.datadoghq.com/llm/experiments/{experiment_id}?selectedTab=overview&filter[{dimension}]={value}

如果不确定该维度的筛选URL格式是否有效，则省略筛选参数并链接到实验根目录。切勿编造筛选URL。

Phase 4 — Synthesis

阶段4 — 综合总结

Comparative Exploratory:

Clear wins where the candidate improves on the baseline
Clear regressions or risks the candidate introduces
Neutral or unchanged areas
Root-cause hypotheses (1–4), tied to evidence
Prioritized recommendations: ship as-is / block / gate by segment / combine behaviors

Comparative Q&A:

Direct answer to the question with a clear verdict
Supporting evidence (metrics, percentages, event examples)
Relevant context (e.g., caveats, data limitations)

Single Exploratory:

Overall performance assessment
Worst-performing segments and root causes
Hypotheses for why failures occur
Recommended next experiments

Single Q&A:

Direct answer to the question with a clear verdict
Supporting evidence from the experiment data

All modes: open with a one-line issue type tally — e.g. "3 agent issues, 1 evaluator/dataset issue, 1 ambiguous" — before the detailed findings. Use quantified deltas/rates wherever possible. Redact PII.

Always produce both
## Summary & Recommendations
and
## Synthesis
sections regardless of experiment complexity, how many metrics exist, or how quickly the answer is apparent. Do not skip Summary because the findings are simple or obvious. Do not skip Synthesis because you've already covered the findings in Deep Dives. These two sections are the most portable output of the analysis — they are what a reader encounters first and last.

对比探索性分析：

候选实验相对于基线实验的明显优势
候选实验引入的明显退化或风险
无变化或中性的领域
根本原因假设（1-4个），需有证据支持
优先级建议：直接发布/阻止发布/按细分场景管控/合并行为

对比问答分析：

直接回答问题并给出明确结论
支持证据（指标、百分比、事件示例）
相关上下文（例如，警告、数据限制）

单实验探索性分析：

整体性能评估
表现最差的细分场景及根本原因
失败原因假设
推荐的后续实验

单实验问答分析：

直接回答问题并给出明确结论
来自实验数据的支持证据

所有模式：在详细结果前以一行问题类型统计开头——例如"3个Agent问题，1个评估器/数据集问题，1个不明确问题"。尽可能使用量化的增量/比率。脱敏PII。

无论实验复杂度如何、指标数量多少、答案是否显而易见，始终生成
## 摘要与建议
和
## 综合总结
部分。不要因为结果简单或明显而跳过摘要部分。不要因为已在深度分析中涵盖结果而跳过综合总结部分。这两个部分是分析中最具可移植性的输出——是读者最先和最后接触到的内容。

Phase 5 — Output Delivery

阶段5 — 输出交付

Agent: Present the full report in the conversation using the report format below.

File: Write the report to the pre-confirmed path. Confirm with: "Report saved to

<path>

Notebook: Call

mcp__datadog-mcp-core__create_datadog_notebook

with the following parameters:

name
(by mode):

Mode	Name
Comparative Exploratory	`Experiment Analysis: {baseline_short} (Baseline) vs {candidate_short} (Candidate) — YYYY-MM-DD`
Comparative Q&A	`Experiment Q&A: {baseline_short} vs {candidate_short} — YYYY-MM-DD`
Single Exploratory	`Experiment Analysis: {experiment_short} — YYYY-MM-DD`
Single Q&A	`Experiment Q&A: {experiment_short} — YYYY-MM-DD`
where `short` = first 8 characters of the UUID.

cells
: one cell per report section — do NOT put the entire report in a single cell. Structure:
- Cell 1 — Summary & Recommendations containing three
```
###
```
  subheaders: Experiment (link + executive summary), Key Findings (bullets), Recommendations (numbered list) — always present, always first, never skipped regardless of experiment complexity
- Cell 2 — Orientation table
- Cell 3 — What Changed (comparative modes only; omit for single)
- Cell 4 — Signals / Answer to Question
- Cells 5…N — one cell per Deep Dive Finding
- Cell N+1 — Synthesis (issue tally, Overall Performance Assessment, Worst-Performing Segments, Root Cause Hypothesis, Recommended Next Experiments) — always present, always second-to-last
- Cell N+2 — UI Links
Omit the
```
# Experiment Analysis Report
```
top-level heading from all cells — it is already shown as the notebook title.
time
:
```
{ "live_span": "1h" }
```

After the notebook is created, output the URL in chat:

"Report exported to notebook: <url>"

If the tool is unavailable, follow the fallback instructions in Phase 0.

Agent： 使用以下报告格式在对话中呈现完整报告。

File： 将报告写入预先确认的路径。确认信息："报告已保存至

<path>

。"

Notebook： 使用以下参数调用

mcp__datadog-mcp-core__create_datadog_notebook

：

name
（按模式）：

模式	名称
对比探索性分析	`Experiment Analysis: {baseline_short} (Baseline) vs {candidate_short} (Candidate) — YYYY-MM-DD`
对比问答分析	`Experiment Q&A: {baseline_short} vs {candidate_short} — YYYY-MM-DD`
单实验探索性分析	`Experiment Analysis: {experiment_short} — YYYY-MM-DD`
单实验问答分析	`Experiment Q&A: {experiment_short} — YYYY-MM-DD`
其中 `short` = UUID的前8个字符。

cells
：每个报告部分对应一个单元格——切勿将整个报告放入单个单元格。结构：
- 单元格1 — 摘要与建议，包含三个
```
###
```
  子标题：实验（链接+执行摘要）、关键发现（项目符号）、建议（编号列表）——始终存在，始终位于首位，无论实验复杂度如何都不得跳过
- 单元格2 — 定位分析表格
- 单元格3 — 变化内容（仅对比模式；单实验模式省略）
- 单元格4 — 信号/问题答案
- 单元格5…N — 每个深度分析发现对应一个单元格
- 单元格N+1 — 综合总结（问题统计、整体性能评估、表现最差的细分场景、根本原因假设、推荐的后续实验）——始终存在，始终位于倒数第二位
- 单元格N+2 — UI链接
所有单元格中省略
```
# Experiment Analysis Report
```
顶级标题——它已作为笔记本标题显示。
time
：
```
{ "live_span": "1h" }
```

创建笔记本后，在聊天窗口中输出URL：

"报告已导出至笔记本：<url>"

如果工具不可用，请遵循阶段0中的回退说明。

Phase 6 — Conversational Follow-up

阶段6 — 对话跟进

After delivering the report, append a follow-up section:

---

交付报告后，添加一个跟进部分：

---

Want to explore further?

想要进一步探索？

Here are a few directions based on the findings:

[Specific question derived from actual findings — e.g., "Want me to dig deeper into why the SQL scenarios regressed in the candidate?"]
[Another specific follow-up — e.g., "Should I compare error patterns between the two failing clusters?"]
[A third option if relevant]

Do you have any other questions about this analysis?


Stay active after the report. Answer follow-up questions using the same MCP tools, referencing findings already gathered. Do not re-run analyses you've already performed unless new questions require it.

---

基于本次发现，以下是几个方向：

[从实际发现衍生的具体问题——例如，"需要我深入分析为什么SQL场景在候选实验中出现退化吗？"]
[另一个具体跟进问题——例如，"是否需要我对比两个失败集群的错误模式？"]
[第三个相关选项（如适用）]

您对本次分析还有其他问题吗？


报告交付后保持活跃。使用相同的MCP工具回答跟进问题，参考已收集的发现。除非新问题需要，否则不要重新运行已执行过的分析。

---

Report Format

报告格式

Link rules:

Experiment IDs: Wherever a full experiment UUID appears, render it as a Markdown link to
```
https://app.datadoghq.com/llm/experiments/{full_uuid}
```
.
Comparative table column headers: In the Orientation table and in every subsequent table that has Baseline/Candidate columns, wrap the entire column header as a link — not just the short ID. Format:
```
[Baseline \
```
{short_id}`]({baseline_url})
```
and
```
Candidate `{short_id}``. This makes the full header cell clickable, not just the ID portion.

markdown

undefined

链接规则：

实验ID：无论完整实验UUID出现在何处，都将其渲染为指向
```
https://app.datadoghq.com/llm/experiments/{full_uuid}
```
的Markdown链接。
对比表格列标题：在定位分析表格及所有后续包含基线/候选列的表格中，将整个列标题包裹为链接——而不仅仅是短ID。格式：
```
[基线 \
```
{short_id}`]({baseline_url})
```
和
```
候选 `{short_id}``。这使整个标题单元格可点击，而不仅仅是ID部分。

markdown

undefined

Experiment Analysis Report

实验分析报告

Question: {original question text} (Q&A modes only — omit for Exploratory modes)

问题： {原始问题文本} (仅问答模式——探索性模式省略)

Summary & Recommendations

摘要与建议

Experiment

实验

[Comparative:

{baseline_short}

(Baseline) vs

{candidate_short}

(Candidate) — Compare — Single:

{experiment_short}

]

[2–3 sentence executive summary. Open with "This is a {Mode} analysis..." where {Mode} is one of: Comparative Exploratory, Comparative Q&A, Single Exploratory, Single Q&A. Include experiment(s) purpose, scale, and the headline finding with specific numbers.]

[If the report uses opaque dimension values (e.g. category labels like b1/b2/b3/bx), add a

#### Dataset Categories

sub-subsection here — one bullet per value with name bolded and a brief description. Omit if all dimension values are self-explanatory.]

[对比分析：{baseline_short} (基线) vs {candidate_short} (候选) — 对比 — 单实验分析：{experiment_short}]

[2-3句执行摘要。以"这是一项**{模式}**分析..."开头，其中{模式}为以下之一：对比探索性分析、对比问答分析、单实验探索性分析、单实验问答分析。包含实验目的、规模以及带有具体数字的核心发现。]

[如果报告使用不透明的维度值（例如b1/b2/b3/bx等类别标签），在此添加

#### 数据集类别

子小节——每个值对应一个项目符号，名称加粗并附带简短描述。如果所有维度值都不言自明则省略。]

Key Findings

关键发现

{Finding 1}: one-line description with numbers (e.g. "+4.2pp on
```
tool_accuracy
```
across all segments")
{Finding 2}: one-line description
{Finding 3} (if present): one-line description [For Q&A modes: one-line verdict bullet + one-line rationale bullet]

{发现1}：带数字的一行描述（例如"所有细分场景中
```
tool_accuracy
```
提升4.2个百分点"）
{发现2}：一行描述
{发现3}（如存在）：一行描述 [问答模式：一行结论项目符号 + 一行理由项目符号]

Recommendations

建议

{Recommendation 1}: specific, actionable next step tied to a finding
{Recommendation 2}: specific, actionable next step
{Recommendation 3} (if present): specific, actionable next step [Omit this subsection for Q&A modes unless a clear action follows from the answer.]

{建议1}：与发现相关的具体、可操作的下一步
{建议2}：具体、可操作的下一步
{建议3}（如存在）：具体、可操作的下一步 [问答模式省略此小节，除非答案明确指向某个操作。]

Orientation

定位分析

[Side-by-side table for comparative; summary table for single. Include: samples, errors (count +

error_type

breakdown if non-zero, otherwise "none"), metrics, dimensions. Experiment IDs in column headers must be Markdown links.]

[对比分析使用并排表格；单实验分析使用摘要表格。包含：样本数、错误数（计数+

error_type

细分，如果非零则显示，否则显示"无"）、指标、维度。实验ID在列标题中必须为Markdown链接。]

What Changed

变化内容

[Comparative modes only. Table of differences between baseline and candidate: model, toolset/skill profile, dataset, evaluator schema, and any other metadata differences detectable from the summary data. If no differences are detectable, write: "No configuration differences detected between experiments."]

[仅对比模式。基线与候选实验之间的差异表格：模型、工具集/技能配置文件、数据集、评估器架构以及从摘要数据中可检测到的任何其他元数据差异。如果未检测到差异，写入："未检测到实验之间的配置差异。"]

[Signals | Answer to Question]

[信号 | 问题答案]

[For exploratory: ranked table of signals/segments with metric deltas and impact counts.] [For Q&A: direct answer with verdict, then supporting evidence.]

[探索性模式：按重要性排序的信号/细分场景表格，包含指标增量和影响计数。] [问答模式：直接回答并给出结论，然后提供支持证据。]

Deep Dive Findings

深度分析发现

[Issue/Finding Title]

[问题/发现标题]

Segment:

[dimension=value]

| Impact: N samples | Severity: metric pass rate = X% | View samples

Issue type:

Agent

— the evaluator is sound; the agent output is the problem. |

Evaluator/Dataset

— the agent output may be correct; the rubric, ground truth labels, or scoring logic is suspect. |

Ambiguous

— cannot determine from available evidence whether the agent or evaluator is at fault; flag for manual inspection.

What's happening: [1–2 sentences: key observation and metric impact only]

Representative examples:

[Span link]: [input → output → expected, what went wrong]

Root cause hypothesis: [Category]: [Explanation tied to evidence]

Recommendation: [Specific, actionable next step]

[Repeat for each major issue]

细分场景：

[dimension=value]

| 影响：N个样本 | 严重程度：指标通过率 = X% | 查看样本

问题类型：

Agent

— 评估器正常；Agent输出存在问题。 |

Evaluator/Dataset

— Agent输出可能正确；评分标准、真实标签或评分逻辑存在疑问。 |

Ambiguous

— 无法从现有证据判断是Agent还是评估器的问题；标记为需要人工检查。

问题详情：[1-2句话：仅关键观察和指标影响]

代表性示例：

[Span链接]：[输入 → 输出 → 预期结果，问题所在]

根本原因假设：[类别]：[有证据支持的解释]

建议：[具体、可操作的下一步]

[每个主要问题重复上述内容]

Synthesis

综合总结

[Required in all modes. Comes after all Deep Dive Findings, before UI Links.]

Issue tally: [N agent issues, N evaluator/dataset issues, N ambiguous]

[所有模式均需包含。位于所有深度分析发现之后，UI链接之前。]

问题统计：[N个Agent问题，N个评估器/数据集问题，N个不明确问题]

Overall Performance Assessment

整体性能评估

[2–4 sentences on overall quality: what the experiment shows, whether the app/model is production-ready on this task, key numbers.]

[2-4句话评估整体质量：实验表明的内容，应用/模型在该任务上是否具备生产就绪性，关键数据。]

Worst-Performing Segments

表现最差的细分场景

[Bullet list: which dimension values or conditions most reliably predict failure. Include metric values.]

[项目符号列表：哪些维度值或条件最能预测失败。包含指标值。]

Root Cause Hypothesis

根本原因假设

[The single most likely root cause across all findings. If multiple independent root causes, list them ranked by impact. Each hypothesis must be tied to specific evidence, not to label names or general reasoning.]

[所有发现中最可能的单一根本原因。如果存在多个独立的根本原因，按影响排序列出。每个假设必须与具体证据相关联，而非标签名称或一般性推理。]

Recommended Next Experiments

UI Links

UI链接

[All generated Datadog UI links with labels]

---

[所有生成的Datadog UI链接及标签]

---

Operating Rules

操作规则

Do not assume anything about the experiment (model, task, metrics, schema, dimensions). Infer everything by inspecting the data.
Ground all conclusions in specific evidence: event IDs, counts, percentages.
Show math: include counts and rates, not just qualitative claims.
Avoid speculative explanations not supported by observed evidence.
Mask or redact PII in all user-visible output.

不要对实验（模型、任务、指标、架构、维度）做任何假设。所有信息都通过检查数据推断得出。
所有结论都基于具体证据：事件ID、计数、百分比。
展示计算过程：包含计数和比率，而非仅定性描述。
避免无观察证据支持的推测性解释。
在所有用户可见的输出中屏蔽或脱敏PII。

Tool Reference

工具参考

This appendix applies only in pup mode. In MCP mode, use the tool names in the workflow sections directly.

本附录仅适用于pup模式。在MCP模式下，直接使用工作流章节中的工具名称。

Experiments

实验

MCP Tool	pup Command
`get_llmobs_experiment_summary(experiment_id)`	`pup llm-obs experiments summary EXPERIMENT_ID`
`list_llmobs_experiment_events(experiment_id, ...)`	`pup llm-obs experiments events list EXPERIMENT_ID [--filter-metric-label L] [--sort-by-metric M] [--sort-direction asc\|desc] [--limit N]` — confirm filter/sort flag names with `pup llm-obs experiments events list --help` before use
`get_llmobs_experiment_event(experiment_id, event_id)`	`pup llm-obs experiments events get EXPERIMENT_ID EVENT_ID`
`get_llmobs_experiment_metric_values(experiment_id, metric_label, ...)`	`pup llm-obs experiments metric-values EXPERIMENT_ID --metric-label L [--segment-by-dimension D] [--segment-dimension-value V]`
`get_llmobs_experiment_dimension_values(experiment_id, dimension_key)`	`pup llm-obs experiments dimension-values EXPERIMENT_ID --dimension-key K`

MCP工具	pup命令
`get_llmobs_experiment_summary(experiment_id)`	`pup llm-obs experiments summary EXPERIMENT_ID`
`list_llmobs_experiment_events(experiment_id, ...)`	`pup llm-obs experiments events list EXPERIMENT_ID [--filter-metric-label L] [--sort-by-metric M] [--sort-direction asc\|desc] [--limit N]` — 使用前通过 `pup llm-obs experiments events list --help` 确认筛选/排序标志名称
`get_llmobs_experiment_event(experiment_id, event_id)`	`pup llm-obs experiments events get EXPERIMENT_ID EVENT_ID`
`get_llmobs_experiment_metric_values(experiment_id, metric_label, ...)`	`pup llm-obs experiments metric-values EXPERIMENT_ID --metric-label L [--segment-by-dimension D] [--segment-dimension-value V]`
`get_llmobs_experiment_dimension_values(experiment_id, dimension_key)`	`pup llm-obs experiments dimension-values EXPERIMENT_ID --dimension-key K`

Notebooks

笔记本

MCP Tool	pup Command
`create_datadog_notebook(name, cells, ...)`	`pup notebooks create --title "TITLE" --file /tmp/nb_cells.json` — confirm exact flags with `pup notebooks create --help`

The cells file is a JSON array of cell objects:

json

[{"attributes": {"definition": {"type": "markdown", "text": "## Section\n\nContent."}}, "type": "notebook_cells"}]

Never show internal tool calls, schemas, or implementation details to the user.

MCP工具	pup命令
`create_datadog_notebook(name, cells, ...)`	`pup notebooks create --title "TITLE" --file /tmp/nb_cells.json` — 通过 `pup notebooks create --help` 确认确切标志

单元格文件是单元格对象的JSON数组：