llm-obs-experiment-analyzer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBackend
后端处理
Detection — At the start of every invocation, before taking any action, determine which backend to use:
- If the user passed anywhere in their invocation → use pup mode immediately, regardless of whether MCP tools are present. Skip steps 2–4.
--backend pup - Check whether MCP tools are present in your active tool list. The canonical signal is whether appears in your available tools.
mcp__datadog-llmo-mcp__get_llmobs_experiment_summary - If MCP tools are present → use MCP mode throughout. Call MCP tools exactly as named in this skill's workflow sections.
- If MCP tools are absent → check whether is executable: run
pupvia Bash. A JSON response containingpup --versionconfirms pup is available."version" - If pup responds → use pup mode throughout. Translate every MCP tool call to its pup equivalent using the Tool Reference appendix at the bottom of this file.
- If neither is available → stop and tell the user:
"Neither the Datadog MCP server nor the pup CLI is available. Connect the MCP server () or install pup."
claude mcp add --scope user --transport http datadog-llmo-mcp 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'
--backend puppup invocation rules:
- Invoke via Bash:
pup llm-obs <subcommand> [flags] - pup always outputs JSON. Parse directly — no content-block unwrapping (unlike MCP results, which may wrap JSON in ).
[{"type": "text", "text": "<json>"}] - If pup returns an auth error, tell the user to run and stop.
pup auth login - Parallelization: issue multiple Bash tool calls in a single message (one pup command per call).
Invocation ID: At the very start of each invocation, before any MCP tool call, generate an 8-character hex invocation ID (e.g., ). Keep it constant for the entire invocation.
3a9f1c2bIntent tagging: On every MCP tool call, prefix with followed by a description of why the tool is being called. On the first MCP tool call only, use instead (note the suffix). Example first call:
telemetry.intentskill:llm-obs-experiment-analyzer[<inv_id>] — skill:llm-obs-experiment-analyzer:start[<inv_id>] — :startskill:llm-obs-experiment-analyzer:start[3a9f1c2b] — Phase 1: get experiment summary to orient analysis检测机制 — 在每次调用开始、执行任何操作之前,先确定要使用的后端:
- 如果用户在调用参数的任何位置传入了→ 立即使用pup模式,无论是否存在MCP工具。跳过步骤2-4。
--backend pup - 检查活跃工具列表中是否存在MCP工具。标准判断依据是可用工具中是否包含。
mcp__datadog-llmo-mcp__get_llmobs_experiment_summary - 如果存在MCP工具 → 全程使用MCP模式。严格按照本技能工作流章节中指定的名称调用MCP工具。
- 如果不存在MCP工具 → 检查是否可执行:通过Bash运行
pup。返回包含pup --version的JSON响应即确认pup可用。"version" - 如果pup正常响应 → 全程使用pup模式。使用本文件底部的工具参考附录,将所有MCP工具调用转换为对应的pup命令。
- 如果两者都不可用 → 停止操作并告知用户:
"Datadog MCP服务器和pup CLI均不可用。请连接MCP服务器()或安装pup。"
claude mcp add --scope user --transport http datadog-llmo-mcp 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'
--backend puppup调用规则:
- 通过Bash调用:
pup llm-obs <subcommand> [flags] - pup始终输出JSON。直接解析即可——无需解析内容块(与MCP结果不同,MCP结果可能会将JSON包裹在中)。
[{"type": "text", "text": "<json>"}] - 如果pup返回认证错误,告知用户运行并停止操作。
pup auth login - 并行处理:可在一条消息中发起多个Bash工具调用(每个调用对应一个pup命令)。
调用ID: 在每次调用的最开始、发起任何MCP工具调用之前,生成一个8字符的十六进制调用ID(例如)。在整个调用过程中保持该ID不变。
3a9f1c2b意图标记: 在每次MCP工具调用中,在前添加前缀,后跟调用该工具的原因描述。仅在第一次MCP工具调用时,使用作为前缀(注意后缀)。示例第一次调用:
telemetry.intentskill:llm-obs-experiment-analyzer[<inv_id>] — skill:llm-obs-experiment-analyzer:start[<inv_id>] — :startskill:llm-obs-experiment-analyzer:start[3a9f1c2b] — Phase 1: get experiment summary to orient analysisUnified Experiment Analyzer
统一实验分析器
Analyzes one or two LLM experiments. Supports four modes based on inputs:
| Inputs | Mode |
|---|---|
| 2 IDs, no question | Comparative Exploratory |
| 2 IDs + question | Comparative Q&A |
| 1 ID, no question | Single Exploratory |
| 1 ID + question | Single Q&A |
分析1个或2个LLM实验。根据输入支持四种模式:
| 输入内容 | 模式 |
|---|---|
| 2个ID,无问题 | 对比探索性分析 |
| 2个ID + 问题 | 对比问答分析 |
| 1个ID,无问题 | 单实验探索性分析 |
| 1个ID + 问题 | 单实验问答分析 |
Usage
使用方法
/llm-obs-experiment-analyzer <experiment_id_1> [experiment_id_2] [question text] [--output agent|file|notebook]Arguments: $ARGUMENTS
/llm-obs-experiment-analyzer <experiment_id_1> [experiment_id_2] [question text] [--output agent|file|notebook]参数:$ARGUMENTS
Available Tools
可用工具
| Tool | Purpose |
|---|---|
| Get total events, error count, metrics stats, available dimensions |
| Query events with filters, sorting, pagination |
| Get full event details (input, output, expected_output, metrics) |
| Get metric stats overall and segmented by dimension. Use |
| List unique values for a dimension with counts |
| Export report as a Datadog notebook |
| 工具 | 用途 |
|---|---|
| 获取总事件数、错误计数、指标统计信息、可用维度 |
| 通过筛选、排序、分页查询事件 |
| 获取完整事件详情(输入、输出、预期输出、指标) |
| 获取整体指标统计信息及按维度细分的统计信息。使用 |
| 列出维度的唯一值及对应计数 |
| 将报告导出为Datadog笔记本 |
Phase 0 — Mode & Output Resolution
阶段0 — 模式与输出方式确定
Parse $ARGUMENTS:
- Extract one or two UUID-format strings as experiment IDs (first = baseline/primary, second = candidate).
- Extract flag if present.
--output agent|file|notebook - The remaining text (after IDs and flags) is the question, if any.
Mode determination:
- 2 IDs + question → Comparative Q&A
- 2 IDs, no question → Comparative Exploratory
- 1 ID + question → Single Q&A
- 1 ID, no question → Single Exploratory
Output mode determination:
If was provided in arguments, use that mode and skip asking.
--outputOtherwise, ask two separate sequential calls before proceeding — never combined into a single call:
AskUserQuestion- Analysis type: If no question text was provided in the arguments, ask whether the user wants exploratory analysis or has a specific question. Skip this call only if the user's intent is already clear from context (e.g. they typed a question alongside the IDs).
- Output destination: If was not specified, ask where to deliver the report (chat, file, or Datadog notebook). Always ask this as its own standalone call.
--output
Output modes:
- Agent (default): Display the full report in the conversation.
- File: Before starting, propose a path:
Present it to the user and let them confirm or adjust. Then proceed.
evals/reports/YYYY-MM-DD-<experiment-slug>-analysis.md - Notebook: Use at the end. In pup mode, use
mcp__datadog-mcp-core__create_datadog_notebookinstead (see Tool Reference). If neither MCP nor pup is available, output these setup instructions instead of failing:pup notebooks create --title "TITLE" --file /tmp/nb_cells.jsonThen ask: "Would you like to fall back to file or agent output instead?" See Phase 5 for full notebook call details.To enable Datadog notebook export, add the MCP server: claude mcp add --transport http datadog-mcp https://mcp.datadoghq.com/api/unstable/mcp-server See: https://docs.datadoghq.com/bits_ai/mcp_server/setup/
After resolving mode and output, proceed to Phase 1. There will be one additional interaction at Phase 1.5 before the deep analysis begins.
AskUserQuestion解析$ARGUMENTS:
- 提取1个或2个UUID格式的字符串作为实验ID(第一个为基线/主实验,第二个为候选实验)。
- 如果存在标志则提取。
--output agent|file|notebook - ID和标志之后的剩余文本即为问题(如果有的话)。
模式确定:
- 2个ID + 问题 → 对比问答分析
- 2个ID,无问题 → 对比探索性分析
- 1个ID + 问题 → 单实验问答分析
- 1个ID,无问题 → 单实验探索性分析
输出方式确定:
如果参数中提供了,则使用该模式,无需询问。
--output否则,在继续之前发起两次独立的连续调用——切勿合并为一次调用:
AskUserQuestion- 分析类型:如果参数中未提供问题文本,询问用户是需要探索性分析还是有特定问题。仅当从上下文可明确用户意图时(例如用户在ID旁输入了问题),才跳过此调用。
- 输出目标:如果未指定,询问用户报告的交付位置(聊天窗口、文件或Datadog笔记本)。此调用必须始终作为独立的单独调用。
--output
输出模式:
- Agent(默认): 在对话中显示完整报告。
- File: 在开始前建议路径:
将路径呈现给用户,让用户确认或调整。然后继续操作。
evals/reports/YYYY-MM-DD-<experiment-slug>-analysis.md - Notebook: 在最后调用。在pup模式下,使用
mcp__datadog-mcp-core__create_datadog_notebook替代(参见工具参考)。如果MCP和pup均不可用,输出以下设置说明而非直接失败:pup notebooks create --title "TITLE" --file /tmp/nb_cells.json然后询问:"是否要回退到文件或Agent输出方式?" 有关笔记本调用的详细信息,请参见阶段5。要启用Datadog笔记本导出,请添加MCP服务器: claude mcp add --transport http datadog-mcp https://mcp.datadoghq.com/api/unstable/mcp-server 参考文档:https://docs.datadoghq.com/bits_ai/mcp_server/setup/
确定模式和输出方式后,进入阶段1。在阶段1.5开始深度分析前,会有一次额外的交互。
AskUserQuestionPhase 1 — Orient
阶段1 — 定位分析
Comparative: Call for both experiments. Produce a side-by-side comparison:
get_llmobs_experiment_summary- Scale: total samples and error count for each
- Metrics: which metrics exist in each; which are shared
- Dimensions: which dimensions exist in each; which are shared
- Immediate red flags (errors present, missing metrics, sparse data)
- Obvious improvements or regressions visible at the summary level
When , call for and report the breakdown by exception class (e.g. "2 errors: "). Errors mean the executor threw an unhandled exception — no eval scores were produced for those samples. Do not report a percentage; report the count and type(s).
error_count > 0get_llmobs_experiment_dimension_valueserror_typeasyncio.exceptions.cancellederrorSingle: Call for the experiment. Determine:
get_llmobs_experiment_summary- Total samples, and error count (with breakdown if non-zero)
error_type - Available metrics grouped by as returned by the summary (
metric_type,score,boolean). Do not infer semantic groupings or categories from label name patterns or prefixes — the label string is not a reliable signal for what a metric measures.categorical - Classify each metric using the statistics already returned by the summary (mean, min, max). Do not infer metric meaning from label names or prefixes. Use the classifications defined in Phase 1.5 when referencing metrics throughout the report.
- Available dimensions for segmentation
- Any immediate red flags
对比分析: 为两个实验调用。生成并排对比结果:
get_llmobs_experiment_summary- 规模:每个实验的总样本数和错误计数
- 指标:每个实验存在的指标,以及共同的指标
- 维度:每个实验存在的维度,以及共同的维度
- 即时红色预警(存在错误、缺失指标、数据稀疏)
- 在摘要层面可见的明显改进或退化
当时,调用获取,并按异常类报告细分情况(例如"2个错误:")。错误表示执行器抛出了未处理的异常——这些样本未生成评估分数。不要报告百分比,只需报告计数和类型。
error_count > 0get_llmobs_experiment_dimension_valueserror_typeasyncio.exceptions.cancellederror单实验分析: 为该实验调用。确定:
get_llmobs_experiment_summary- 总样本数和错误计数(如果非零则包含细分)
error_type - 按摘要返回的分组的可用指标(
metric_type、score、boolean)。不要从标签名称模式或前缀推断语义分组或类别——标签字符串并非衡量指标的可靠信号。categorical - 使用摘要返回的统计信息(平均值、最小值、最大值)对每个指标进行分类。不要从标签名称或前缀推断指标含义。在整个报告中引用指标时,使用阶段1.5中定义的分类。
- 可用于细分的维度
- 任何即时红色预警
Phase 1.5 — Metrics Selection
阶段1.5 — 指标选择
After completing Phase 1, run the following three steps before any .
AskUserQuestionStep 1 — Classify every metric using summary statistics only (no additional tool calls):
| Class | Condition | Meaning |
|---|---|---|
| | Feature disabled or not implemented — no signal |
| | Always passes — no diagnostic signal |
| | Rarely fails — low diagnostic value |
| | Meaningful failure rate — highest diagnostic value |
| | Partial failures — moderate diagnostic value |
Step 2 — Print the full metric table to chat before asking any question. This gives the user complete visibility — never truncated by option limits. Format:
Found N metrics. Full breakdown:
| Metric | Mean | Class |
|--------|------|-------|
| <label> | <mean> | ⚠️ Struggling |
| <label> | <mean> | Interesting |
| <label> | <mean> | Saturated |
| <label> | 1.000 | Perfect (no signal) |
| <label> | 0.000 | Always zero (disabled?) |Flag any metrics with a note — e.g. "N metrics always score 0 and appear to be disabled features; they will be excluded from suggested groupings."
always_zeroStep 3 — AskUserQuestion with options built entirely from the computed classes:
Generate options dynamically based on what is actually present in the data. Do not invent option names from label prefixes.
- "Struggling metrics (N) — Recommended": only shown if N ≥ 1. Description explicitly lists each metric label and its mean (e.g. "0.33,
open_answer0.68"). This is the grounded suggestion — based on observed pass rates, not label names. If there are no struggling metrics, replace this option with "Lowest-performing metrics (N)" covering the bottom N by mean.c_permanence - "Interesting + struggling (N)": shown only if there are interesting-class metrics in addition to struggling ones. Description lists them with means.
- "All metrics (N)": always shown. Note in the description that always-zero and perfect metrics add noise but are included.
- "A specific metric": always shown. Description says: "Choose one from the table printed above."
If the user selects "A specific metric", ask a second that shows the 4 metrics with the lowest mean as labeled options (label = metric name, description = ). In the question text, explicitly say: "Or type any metric name from the table above into 'Other'." The and metrics must not appear in the 4 options (they have no diagnostic value); restrict the 4 to and classes only. After the user picks one, restrict all analysis in Phases 2–4 to that single metric only.
AskUserQuestionmean: X.XX — classalways_zeroperfectstrugglinginterestingScope enforcement:
- If the user accepts "all", proceed with all metrics (including constant ones, but note their low signal value).
- If the user selects a grouping or a specific metric, restrict all analysis in Phases 2–4 strictly to that selection. Do not call for any metric outside the selection.
get_llmobs_experiment_metric_values
完成阶段1后,在发起任何之前执行以下三个步骤。
AskUserQuestion步骤1 — 仅使用摘要统计信息对所有指标进行分类(无需额外工具调用):
| 类别 | 条件 | 含义 |
|---|---|---|
| | 功能已禁用或未实现——无信号 |
| | 始终通过——无诊断信号 |
| | 极少失败——诊断价值低 |
| | 失败率显著——诊断价值最高 |
| | 部分失败——诊断价值中等 |
步骤2 — 在询问问题前,将完整指标表打印到聊天窗口。让用户完全可见——切勿因选项限制而截断。格式如下:
发现N个指标。完整细分:
| 指标 | 平均值 | 类别 |
|--------|------|-------|
| <label> | <mean> | ⚠️ 表现不佳 |
| <label> | <mean> | 值得关注 |
| <label> | <mean> | 接近完美 |
| <label> | 1.000 | 完美(无信号) |
| <label> | 0.000 | 始终为零(已禁用?) |为所有指标添加注释——例如"N个指标始终得分为0,看起来是已禁用的功能;它们将被排除在建议分组之外。"
always_zero步骤3 — 根据计算出的类别生成选项,发起AskUserQuestion:
根据数据中实际存在的内容动态生成选项。切勿从标签前缀编造选项名称。
- "表现不佳的指标(N个)——推荐":仅当N≥1时显示。描述中明确列出每个指标标签及其平均值(例如"0.33,
open_answer0.68")。这是基于观察到的通过率而非标签名称的合理建议。如果没有表现不佳的指标,将此选项替换为**"表现最差的指标(N个)"**,涵盖平均值最低的N个指标。c_permanence - "值得关注+表现不佳的指标(N个)":仅当除了表现不佳的指标外还有值得关注的指标时显示。描述中列出这些指标及其平均值。
- "所有指标(N个)":始终显示。描述中注明始终为零和完美的指标会增加噪音,但仍会包含在内。
- "特定指标":始终显示。描述为:"从上面打印的表格中选择一个。"
如果用户选择"特定指标",发起第二次,显示平均值最低的4个指标作为带标签的选项(标签=指标名称,描述=)。在问题文本中明确说明:"或者在‘其他’中输入上面表格中的任意指标名称。" 和指标不得出现在这4个选项中(它们无诊断价值);仅限制为和类别的指标。用户选择后,阶段2-4中的所有分析严格限制为该单个指标。
AskUserQuestionmean: X.XX — 类别always_zeroperfectstrugglinginteresting范围强制执行:
- 如果用户接受"所有",则继续分析所有指标(包括常量指标,但需注明其信号值低)。
- 如果用户选择分组或特定指标,阶段2-4中的所有分析严格限制为该选择范围。不得为选择范围外的指标调用。
get_llmobs_experiment_metric_values
Phase 2 — Signal Discovery + UI Links
阶段2 — 信号发现 + UI链接
Comparative: Using only the metrics selected in Phase 1.5 (intersected with shared metrics) and shared dimensions, identify:
- Segments where the candidate outperforms the baseline
- Segments where the candidate regresses
- Error types present in one but rare in the other
- Distribution shifts or coverage gaps
- Tradeoffs (e.g., higher recall, lower precision)
Generate Datadog comparison UI links:
- Base URL:
https://app.datadoghq.com/llm/experiment-comparison - Required params: ,
baselineExperimentId(candidate%2Cbaseline),experimentIdstableView=all - Optional (include if discoverable): ,
project,compareDatasetIdselectedEvaluation - priority: overall/overall_score/rubric metric → primary metric → first shared metric
selectedEvaluation - Generate 2–4 links: primary comparison, regression view, calibration view (if applicable), worst-segment view (only if supported — never fabricate filters)
Single: Measure per-metric performance across all dimensions for only the metrics selected in Phase 1.5. Identify:
- Worst-performing segments (by metric × dimension)
- Any segments with surprising pass rates
- Overall pass rates and variance
Generate Datadog experiment UI link:
https://app.datadoghq.com/llm/experiments/{experiment_id}
对比分析: 仅使用阶段1.5中选择的指标(与共同指标的交集)和共同维度,识别:
- 候选实验优于基线实验的细分场景
- 候选实验退化的细分场景
- 仅在一个实验中存在且在另一个实验中罕见的错误类型
- 分布变化或覆盖缺口
- 权衡情况(例如,召回率提高但精确率降低)
生成Datadog对比UI链接:
- 基础URL:
https://app.datadoghq.com/llm/experiment-comparison - 必填参数:、
baselineExperimentId(candidate%2Cbaseline)、experimentIdstableView=all - 可选参数(如果可发现则包含):、
project、compareDatasetIdselectedEvaluation - 优先级:overall/overall_score/rubric指标 → 主指标 → 第一个共同指标
selectedEvaluation - 生成2-4个链接:主对比视图、退化视图、校准视图(如适用)、最差细分场景视图(仅在支持时生成——切勿编造筛选条件)
单实验分析: 仅针对阶段1.5中选择的指标,衡量各维度的指标表现。识别:
- 表现最差的细分场景(按指标×维度)
- 通过率意外的细分场景
- 整体通过率和方差
生成Datadog实验UI链接:
https://app.datadoghq.com/llm/experiments/{experiment_id}
Phase 3 — Deep Dives
阶段3 — 深度分析
Run all necessary deep dives automatically. Do not ask for approval or pause. Scope all deep dives strictly to the metrics selected in Phase 1.5 — do not call for any metric outside the selection.
get_llmobs_experiment_metric_valuesQ&A modes: Focus deep dives on what is needed to answer the question directly. Pull specific samples, segment by relevant dimensions, inspect examples.
Exploratory modes: Investigate the most interesting signals broadly:
- Per-segment and per-class delta analysis (comparative) or pass-rate analysis (single)
- Error overlap vs. unique failure mode analysis
- Sampling and qualitative inspection of representative failures (2–5 per issue)
- Clustered error theme analysis
Rules:
- Prefer cheap, high-signal analyses first; do not stop early.
- Mask or redact PII in all outputs.
- Avoid destructive actions.
For each sampled event, generate a direct span link:
https://app.datadoghq.com/llm/experiments/{experiment_id}?selectedTab=overview&sp=[{"p":{"experimentId":"{experiment_id}","spanId":"{span_id}"},"i":"experiment-details"}]&spanId={span_id}For each Deep Dive segment, generate a direct link to view those samples in the (candidate) experiment:
If you are not confident the filter URL format works for this dimension, omit the filter params and link to the experiment root instead. Never fabricate filter URLs.
https://app.datadoghq.com/llm/experiments/{experiment_id}?selectedTab=overview&filter[{dimension}]={value}自动运行所有必要的深度分析。无需请求批准或暂停。所有深度分析严格限制在阶段1.5中选择的指标范围内——不得为选择范围外的指标调用。
get_llmobs_experiment_metric_values问答模式: 深度分析聚焦于直接回答问题所需的内容。提取特定样本、按相关维度细分、检查示例。
探索性模式: 广泛调查最有趣的信号:
- 按细分场景和类别进行增量分析(对比分析)或通过率分析(单实验分析)
- 错误重叠与独特失败模式分析
- 代表性失败案例的抽样和定性检查(每个问题2-5个案例)
- 聚类错误主题分析
规则:
- 优先选择低成本、高信号的分析;不要提前停止。
- 在所有输出中屏蔽或脱敏PII(个人身份信息)。
- 避免破坏性操作。
对于每个抽样事件,生成直接的Span链接:
https://app.datadoghq.com/llm/experiments/{experiment_id}?selectedTab=overview&sp=[{"p":{"experimentId":"{experiment_id}","spanId":"{span_id}"},"i":"experiment-details"}]&spanId={span_id}对于每个深度分析细分场景,生成直接链接以在(候选)实验中查看这些样本:
如果不确定该维度的筛选URL格式是否有效,则省略筛选参数并链接到实验根目录。切勿编造筛选URL。
https://app.datadoghq.com/llm/experiments/{experiment_id}?selectedTab=overview&filter[{dimension}]={value}Phase 4 — Synthesis
阶段4 — 综合总结
Comparative Exploratory:
- Clear wins where the candidate improves on the baseline
- Clear regressions or risks the candidate introduces
- Neutral or unchanged areas
- Root-cause hypotheses (1–4), tied to evidence
- Prioritized recommendations: ship as-is / block / gate by segment / combine behaviors
Comparative Q&A:
- Direct answer to the question with a clear verdict
- Supporting evidence (metrics, percentages, event examples)
- Relevant context (e.g., caveats, data limitations)
Single Exploratory:
- Overall performance assessment
- Worst-performing segments and root causes
- Hypotheses for why failures occur
- Recommended next experiments
Single Q&A:
- Direct answer to the question with a clear verdict
- Supporting evidence from the experiment data
All modes: open with a one-line issue type tally — e.g. "3 agent issues, 1 evaluator/dataset issue, 1 ambiguous" — before the detailed findings. Use quantified deltas/rates wherever possible. Redact PII.
Always produce both and sections regardless of experiment complexity, how many metrics exist, or how quickly the answer is apparent. Do not skip Summary because the findings are simple or obvious. Do not skip Synthesis because you've already covered the findings in Deep Dives. These two sections are the most portable output of the analysis — they are what a reader encounters first and last.
## Summary & Recommendations## Synthesis对比探索性分析:
- 候选实验相对于基线实验的明显优势
- 候选实验引入的明显退化或风险
- 无变化或中性的领域
- 根本原因假设(1-4个),需有证据支持
- 优先级建议:直接发布/阻止发布/按细分场景管控/合并行为
对比问答分析:
- 直接回答问题并给出明确结论
- 支持证据(指标、百分比、事件示例)
- 相关上下文(例如,警告、数据限制)
单实验探索性分析:
- 整体性能评估
- 表现最差的细分场景及根本原因
- 失败原因假设
- 推荐的后续实验
单实验问答分析:
- 直接回答问题并给出明确结论
- 来自实验数据的支持证据
所有模式:在详细结果前以一行问题类型统计开头——例如"3个Agent问题,1个评估器/数据集问题,1个不明确问题"。尽可能使用量化的增量/比率。脱敏PII。
无论实验复杂度如何、指标数量多少、答案是否显而易见,始终生成和部分。 不要因为结果简单或明显而跳过摘要部分。不要因为已在深度分析中涵盖结果而跳过综合总结部分。这两个部分是分析中最具可移植性的输出——是读者最先和最后接触到的内容。
## 摘要与建议## 综合总结Phase 5 — Output Delivery
阶段5 — 输出交付
Agent: Present the full report in the conversation using the report format below.
File: Write the report to the pre-confirmed path. Confirm with: "Report saved to ."
<path>Notebook: Call with the following parameters:
mcp__datadog-mcp-core__create_datadog_notebook-
(by mode):
nameMode Name Comparative Exploratory Experiment Analysis: {baseline_short} (Baseline) vs {candidate_short} (Candidate) — YYYY-MM-DDComparative Q&A Experiment Q&A: {baseline_short} vs {candidate_short} — YYYY-MM-DDSingle Exploratory Experiment Analysis: {experiment_short} — YYYY-MM-DDSingle Q&A Experiment Q&A: {experiment_short} — YYYY-MM-DDwhere = first 8 characters of the UUID.short -
: one cell per report section — do NOT put the entire report in a single cell. Structure:
cells- Cell 1 — Summary & Recommendations containing three subheaders: Experiment (link + executive summary), Key Findings (bullets), Recommendations (numbered list) — always present, always first, never skipped regardless of experiment complexity
### - Cell 2 — Orientation table
- Cell 3 — What Changed (comparative modes only; omit for single)
- Cell 4 — Signals / Answer to Question
- Cells 5…N — one cell per Deep Dive Finding
- Cell N+1 — Synthesis (issue tally, Overall Performance Assessment, Worst-Performing Segments, Root Cause Hypothesis, Recommended Next Experiments) — always present, always second-to-last
- Cell N+2 — UI Links
Omit thetop-level heading from all cells — it is already shown as the notebook title.# Experiment Analysis Report - Cell 1 — Summary & Recommendations containing three
-
:
time{ "live_span": "1h" }
After the notebook is created, output the URL in chat:
"Report exported to notebook: <url>"If the tool is unavailable, follow the fallback instructions in Phase 0.
Agent: 使用以下报告格式在对话中呈现完整报告。
File: 将报告写入预先确认的路径。确认信息:"报告已保存至。"
<path>Notebook: 使用以下参数调用:
mcp__datadog-mcp-core__create_datadog_notebook-
(按模式):
name模式 名称 对比探索性分析 Experiment Analysis: {baseline_short} (Baseline) vs {candidate_short} (Candidate) — YYYY-MM-DD对比问答分析 Experiment Q&A: {baseline_short} vs {candidate_short} — YYYY-MM-DD单实验探索性分析 Experiment Analysis: {experiment_short} — YYYY-MM-DD单实验问答分析 Experiment Q&A: {experiment_short} — YYYY-MM-DD其中 = UUID的前8个字符。short -
:每个报告部分对应一个单元格——切勿将整个报告放入单个单元格。结构:
cells- 单元格1 — 摘要与建议,包含三个子标题:实验(链接+执行摘要)、关键发现(项目符号)、建议(编号列表)——始终存在,始终位于首位,无论实验复杂度如何都不得跳过
### - 单元格2 — 定位分析表格
- 单元格3 — 变化内容(仅对比模式;单实验模式省略)
- 单元格4 — 信号/问题答案
- 单元格5…N — 每个深度分析发现对应一个单元格
- 单元格N+1 — 综合总结(问题统计、整体性能评估、表现最差的细分场景、根本原因假设、推荐的后续实验)——始终存在,始终位于倒数第二位
- 单元格N+2 — UI链接
所有单元格中省略顶级标题——它已作为笔记本标题显示。# Experiment Analysis Report - 单元格1 — 摘要与建议,包含三个
-
:
time{ "live_span": "1h" }
创建笔记本后,在聊天窗口中输出URL:
"报告已导出至笔记本:<url>"如果工具不可用,请遵循阶段0中的回退说明。
Phase 6 — Conversational Follow-up
阶段6 — 对话跟进
After delivering the report, append a follow-up section:
---交付报告后,添加一个跟进部分:
---Want to explore further?
想要进一步探索?
Here are a few directions based on the findings:
- [Specific question derived from actual findings — e.g., "Want me to dig deeper into why the SQL scenarios regressed in the candidate?"]
- [Another specific follow-up — e.g., "Should I compare error patterns between the two failing clusters?"]
- [A third option if relevant]
Do you have any other questions about this analysis?
Stay active after the report. Answer follow-up questions using the same MCP tools, referencing findings already gathered. Do not re-run analyses you've already performed unless new questions require it.
---基于本次发现,以下是几个方向:
- [从实际发现衍生的具体问题——例如,"需要我深入分析为什么SQL场景在候选实验中出现退化吗?"]
- [另一个具体跟进问题——例如,"是否需要我对比两个失败集群的错误模式?"]
- [第三个相关选项(如适用)]
您对本次分析还有其他问题吗?
报告交付后保持活跃。使用相同的MCP工具回答跟进问题,参考已收集的发现。除非新问题需要,否则不要重新运行已执行过的分析。
---Report Format
报告格式
Link rules:
- Experiment IDs: Wherever a full experiment UUID appears, render it as a Markdown link to .
https://app.datadoghq.com/llm/experiments/{full_uuid} - Comparative table column headers: In the Orientation table and in every subsequent table that has Baseline/Candidate columns, wrap the entire column header as a link — not just the short ID. Format: {short_id}`]({baseline_url})
[Baseline \Candidate `{short_id}``. This makes the full header cell clickable, not just the ID portion.and
markdown
undefined链接规则:
- 实验ID:无论完整实验UUID出现在何处,都将其渲染为指向的Markdown链接。
https://app.datadoghq.com/llm/experiments/{full_uuid} - 对比表格列标题:在定位分析表格及所有后续包含基线/候选列的表格中,将整个列标题包裹为链接——而不仅仅是短ID。格式:{short_id}`]({baseline_url})
[基线 \候选 `{short_id}``。这使整个标题单元格可点击,而不仅仅是ID部分。和
markdown
undefinedExperiment Analysis Report
实验分析报告
Question: {original question text} (Q&A modes only — omit for Exploratory modes)
问题: {原始问题文本} (仅问答模式——探索性模式省略)
Summary & Recommendations
摘要与建议
Experiment
实验
[Comparative: (Baseline) vs (Candidate) — Compare — Single: ]
{baseline_short}{candidate_short}{experiment_short}[2–3 sentence executive summary. Open with "This is a {Mode} analysis..." where {Mode} is one of: Comparative Exploratory, Comparative Q&A, Single Exploratory, Single Q&A. Include experiment(s) purpose, scale, and the headline finding with specific numbers.]
[If the report uses opaque dimension values (e.g. category labels like b1/b2/b3/bx), add a sub-subsection here — one bullet per value with name bolded and a brief description. Omit if all dimension values are self-explanatory.]
#### Dataset Categories[对比分析:{baseline_short} (基线) vs {candidate_short} (候选) — 对比 — 单实验分析:{experiment_short}]
[2-3句执行摘要。以"这是一项**{模式}**分析..."开头,其中{模式}为以下之一:对比探索性分析、对比问答分析、单实验探索性分析、单实验问答分析。包含实验目的、规模以及带有具体数字的核心发现。]
[如果报告使用不透明的维度值(例如b1/b2/b3/bx等类别标签),在此添加子小节——每个值对应一个项目符号,名称加粗并附带简短描述。如果所有维度值都不言自明则省略。]
#### 数据集类别Key Findings
关键发现
- {Finding 1}: one-line description with numbers (e.g. "+4.2pp on across all segments")
tool_accuracy - {Finding 2}: one-line description
- {Finding 3} (if present): one-line description [For Q&A modes: one-line verdict bullet + one-line rationale bullet]
- {发现1}:带数字的一行描述(例如"所有细分场景中提升4.2个百分点")
tool_accuracy - {发现2}:一行描述
- {发现3}(如存在):一行描述 [问答模式:一行结论项目符号 + 一行理由项目符号]
Recommendations
建议
- {Recommendation 1}: specific, actionable next step tied to a finding
- {Recommendation 2}: specific, actionable next step
- {Recommendation 3} (if present): specific, actionable next step [Omit this subsection for Q&A modes unless a clear action follows from the answer.]
- {建议1}:与发现相关的具体、可操作的下一步
- {建议2}:具体、可操作的下一步
- {建议3}(如存在):具体、可操作的下一步 [问答模式省略此小节,除非答案明确指向某个操作。]
Orientation
定位分析
[Side-by-side table for comparative; summary table for single. Include: samples, errors (count + breakdown if non-zero, otherwise "none"), metrics, dimensions. Experiment IDs in column headers must be Markdown links.]
error_type[对比分析使用并排表格;单实验分析使用摘要表格。包含:样本数、错误数(计数+细分,如果非零则显示,否则显示"无")、指标、维度。实验ID在列标题中必须为Markdown链接。]
error_typeWhat Changed
变化内容
[Comparative modes only. Table of differences between baseline and candidate: model, toolset/skill profile,
dataset, evaluator schema, and any other metadata differences detectable from the summary data.
If no differences are detectable, write: "No configuration differences detected between experiments."]
[仅对比模式。基线与候选实验之间的差异表格:模型、工具集/技能配置文件、数据集、评估器架构以及从摘要数据中可检测到的任何其他元数据差异。如果未检测到差异,写入:"未检测到实验之间的配置差异。"]
[Signals | Answer to Question]
[信号 | 问题答案]
[For exploratory: ranked table of signals/segments with metric deltas and impact counts.]
[For Q&A: direct answer with verdict, then supporting evidence.]
[探索性模式:按重要性排序的信号/细分场景表格,包含指标增量和影响计数。]
[问答模式:直接回答并给出结论,然后提供支持证据。]
Deep Dive Findings
深度分析发现
[Issue/Finding Title]
[问题/发现标题]
Segment: | Impact: N samples | Severity: metric pass rate = X% | View samples
[dimension=value]Issue type: — the evaluator is sound; the agent output is the problem. | — the agent output may be correct; the rubric, ground truth labels, or scoring logic is suspect. | — cannot determine from available evidence whether the agent or evaluator is at fault; flag for manual inspection.
AgentEvaluator/DatasetAmbiguousWhat's happening: [1–2 sentences: key observation and metric impact only]
Representative examples:
- [Span link]: [input → output → expected, what went wrong]
Root cause hypothesis: [Category]: [Explanation tied to evidence]
Recommendation: [Specific, actionable next step]
[Repeat for each major issue]
细分场景: | 影响:N个样本 | 严重程度:指标通过率 = X% | 查看样本
[dimension=value]问题类型: — 评估器正常;Agent输出存在问题。 | — Agent输出可能正确;评分标准、真实标签或评分逻辑存在疑问。 | — 无法从现有证据判断是Agent还是评估器的问题;标记为需要人工检查。
AgentEvaluator/DatasetAmbiguous问题详情:[1-2句话:仅关键观察和指标影响]
代表性示例:
- [Span链接]:[输入 → 输出 → 预期结果,问题所在]
根本原因假设:[类别]:[有证据支持的解释]
建议:[具体、可操作的下一步]
[每个主要问题重复上述内容]
Synthesis
综合总结
[Required in all modes. Comes after all Deep Dive Findings, before UI Links.]
Issue tally: [N agent issues, N evaluator/dataset issues, N ambiguous]
[所有模式均需包含。位于所有深度分析发现之后,UI链接之前。]
问题统计:[N个Agent问题,N个评估器/数据集问题,N个不明确问题]
Overall Performance Assessment
整体性能评估
[2–4 sentences on overall quality: what the experiment shows, whether the app/model is production-ready on this task, key numbers.]
[2-4句话评估整体质量:实验表明的内容,应用/模型在该任务上是否具备生产就绪性,关键数据。]
Worst-Performing Segments
表现最差的细分场景
[Bullet list: which dimension values or conditions most reliably predict failure. Include metric values.]
[项目符号列表:哪些维度值或条件最能预测失败。包含指标值。]
Root Cause Hypothesis
根本原因假设
[The single most likely root cause across all findings. If multiple independent root causes, list them ranked by impact. Each hypothesis must be tied to specific evidence, not to label names or general reasoning.]
[所有发现中最可能的单一根本原因。如果存在多个独立的根本原因,按影响排序列出。每个假设必须与具体证据相关联,而非标签名称或一般性推理。]
Recommended Next Experiments
推荐的后续实验
[2–4 concrete, specific follow-up experiments. Each should be actionable: e.g. "Re-run with to test whether turn exhaustion is the primary driver, not model quality" not "Investigate turn limits further."]
max_turns=40[2-4个具体、明确的后续实验。每个实验都应可操作:例如"使用重新运行,测试是否是轮次耗尽而非模型质量导致的问题",而非"进一步调查轮次限制。"]
max_turns=40UI Links
UI链接
[All generated Datadog UI links with labels]
---[所有生成的Datadog UI链接及标签]
---Operating Rules
操作规则
- Do not assume anything about the experiment (model, task, metrics, schema, dimensions). Infer everything by inspecting the data.
- Ground all conclusions in specific evidence: event IDs, counts, percentages.
- Show math: include counts and rates, not just qualitative claims.
- Avoid speculative explanations not supported by observed evidence.
- Mask or redact PII in all user-visible output.
- 不要对实验(模型、任务、指标、架构、维度)做任何假设。所有信息都通过检查数据推断得出。
- 所有结论都基于具体证据:事件ID、计数、百分比。
- 展示计算过程:包含计数和比率,而非仅定性描述。
- 避免无观察证据支持的推测性解释。
- 在所有用户可见的输出中屏蔽或脱敏PII。
Tool Reference
工具参考
This appendix applies only in pup mode. In MCP mode, use the tool names in the workflow sections directly.
本附录仅适用于pup模式。在MCP模式下,直接使用工作流章节中的工具名称。
Experiments
实验
| MCP Tool | pup Command |
|---|---|
| |
| |
| |
| |
| |
| MCP工具 | pup命令 |
|---|---|
| |
| |
| |
| |
| |
Notebooks
笔记本
| MCP Tool | pup Command |
|---|---|
| |
The cells file is a JSON array of cell objects:
json
[{"attributes": {"definition": {"type": "markdown", "text": "## Section\n\nContent."}}, "type": "notebook_cells"}]- Never show internal tool calls, schemas, or implementation details to the user.
| MCP工具 | pup命令 |
|---|---|
| |
单元格文件是单元格对象的JSON数组:
json
[{"attributes": {"definition": {"type": "markdown", "text": "## Section\n\nContent."}}, "type": "notebook_cells"}]- 切勿向用户展示内部工具调用、架构或实现细节。