llm-obs-trace-rca

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Backend

后端

Detection — At the start of every invocation, before taking any action, determine which backend to use:

If the user passed
```
--backend pup
```
anywhere in their invocation → use pup mode immediately, regardless of whether MCP tools are present. Skip steps 2–4.
Check whether MCP tools are present in your active tool list. The canonical signal is whether
```
mcp__datadog-llmo-mcp__list_llmobs_evals
```
appears in your available tools.
If MCP tools are present → use MCP mode throughout. Call MCP tools exactly as named in this skill's workflow sections.
If MCP tools are absent → check whether
```
pup
```
is executable: run
```
pup --version
```
via Bash. A JSON response containing
```
"version"
```
confirms pup is available.
If pup responds → use pup mode throughout. Translate every MCP tool call to its pup equivalent using the Tool Reference appendix at the bottom of this file.
If neither is available → stop and tell the user:
"Neither the Datadog MCP server nor the pup CLI is available. Connect the MCP server (
```
claude mcp add --scope user --transport http datadog-llmo-mcp 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'
```
) or install pup."

--backend pup

is accepted anywhere in the invocation arguments and is stripped before passing remaining args to the skill logic.

pup invocation rules:

Invoke via Bash:
```
pup llm-obs <subcommand> [flags]
```
pup always outputs JSON. Parse directly — no content-block unwrapping (unlike MCP results, which may wrap JSON in
```
[{"type": "text", "text": "<json>"}]
```
).
If pup returns an auth error, tell the user to run
```
pup auth login
```
and stop.
Parallelization: issue multiple Bash tool calls in a single message (one pup command per call).
Time flags: pup accepts bare duration strings (
```
1h
```
,
```
7d
```
,
```
30m
```
) and RFC3339 timestamps. Do not use
```
now-
```
-prefixed strings — strip the prefix when converting from a skill
```
--timeframe
```
argument:
```
now-7d
```
→
```
7d
```
,
```
now-24h
```
→
```
24h
```
,
```
now-30d
```
→
```
30d
```
.
```
--summary
```
on
```
pup llm-obs spans search
```
strips payload fields to essential metadata only. Use it in bulk/search phases where content is not needed.

Invocation ID: At the very start of each invocation, before any MCP tool call, generate an 8-character hex invocation ID (e.g.,

3a9f1c2b

). Keep it constant for the entire invocation.

Intent tagging: On every MCP tool call, prefix

telemetry.intent

with

skill:llm-obs-trace-rca[<inv_id>] —

followed by a description of why the tool is being called. On the first MCP tool call only, use

skill:llm-obs-trace-rca:start[<inv_id>] —

instead (note the

:start

suffix). Example first call:

skill:llm-obs-trace-rca:start[3a9f1c2b] — Phase 0: discover configured evals for task-cruncher to infer analysis mode

检测 — 在每次调用开始、执行任何操作前，确定要使用的后端：

如果用户在调用中任何位置传入了
```
--backend pup
```
→ 立即使用 pup模式，无论MCP工具是否存在。跳过步骤2–4。
检查活跃工具列表中是否存在MCP工具。标准判断信号是可用工具中是否包含
```
mcp__datadog-llmo-mcp__list_llmobs_evals
```
。
如果存在MCP工具 → 全程使用 MCP模式。严格按照本技能工作流章节中指定的名称调用MCP工具。
如果没有MCP工具 → 检查
```
pup
```
是否可执行：通过Bash运行
```
pup --version
```
。返回包含
```
"version"
```
的JSON响应即确认pup可用。
如果pup响应正常 → 全程使用 pup模式。使用本文档底部的工具参考附录，将每个MCP工具调用转换为对应的pup命令。
如果两者都不可用 → 停止操作并告知用户：
"Datadog MCP服务器和pup CLI均不可用。请连接MCP服务器（
```
claude mcp add --scope user --transport http datadog-llmo-mcp 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'
```
）或安装pup。"

--backend pup

可在调用参数的任何位置使用，在将剩余参数传递给技能逻辑前会被移除。

pup调用规则：

通过Bash调用：
```
pup llm-obs <subcommand> [flags]
```
pup始终输出JSON。直接解析即可——无需解包内容块（与MCP结果不同，MCP结果可能将JSON包裹在
```
[{"type": "text", "text": "<json>"}]
```
中）。
如果pup返回认证错误，告知用户运行
```
pup auth login
```
并停止操作。
并行化：在单条消息中发起多个Bash工具调用（每个pup命令对应一个调用）。
时间参数：pup接受纯时长字符串（
```
1h
```
、
```
7d
```
、
```
30m
```
）和RFC3339时间戳。请勿使用
```
now-
```
前缀的字符串——将技能的
```
--timeframe
```
参数转换时需移除前缀：
```
now-7d
```
→
```
7d
```
，
```
now-24h
```
→
```
24h
```
，
```
now-30d
```
→
```
30d
```
。
在
```
pup llm-obs spans search
```
中使用
```
--summary
```
参数会将负载字段精简为必要元数据。在批量/搜索阶段无需内容时使用该参数。

调用ID： 在每次调用的最开始、发起任何MCP工具调用前，生成一个8字符的十六进制调用ID（例如：

3a9f1c2b

）。在整个调用过程中保持该ID不变。

意图标记： 在每个MCP工具调用中，将

telemetry.intent

前缀设置为

skill:llm-obs-trace-rca[<inv_id>] —

，后跟调用该工具的原因描述。仅在第一次MCP工具调用时，使用

skill:llm-obs-trace-rca:start[<inv_id>] —

（注意

:start

后缀）。示例首次调用：

skill:llm-obs-trace-rca:start[3a9f1c2b] — Phase 0: discover configured evals for task-cruncher to infer analysis mode

LLM Obs Trace RCA — Root Cause Analysis from Production LLM Traces

LLM Obs Trace RCA — 基于生产环境LLM跟踪数据的根因分析

Diagnose why an LLM application is failing by searching production traces and walking the span tree from symptom to root cause. The skill automatically selects the best analysis mode based on available signals:

Mode	Signal	When auto-selected
Eval Signal	LLM judge verdicts and reasoning (pass/fail rates, scoring)	Evaluators are configured for the app
Error Signal	Runtime errors ( `@status:error` , error types, stack traces)	No evals configured, or user explicitly asks about errors/crashes
Generic	Structural anomalies (latency, agent loops, retrieval misses)	Explicit `mode=generic` override, or no strong signal found in Phase 1

The mode is announced (never asked) in the first user-facing output with a one-line override hint.

通过搜索生产环境跟踪数据并从症状出发遍历Span树，诊断LLM应用故障的原因。该技能会根据可用信号自动选择最佳分析模式：

模式	信号	自动选择时机
评估信号模式	LLM评估判定结果与推理（通过率/失败率、评分）	应用已配置评估器时
错误信号模式	运行时错误（ `@status:error` 、错误类型、堆栈跟踪）	未配置评估器，或用户明确询问错误/崩溃问题时
通用模式	结构异常（延迟、Agent循环、检索缺失）	显式指定 `mode=generic` ，或在Phase 1中未找到强信号时

模式会在首次面向用户的输出中主动告知（无需询问），并附带一行模式覆盖提示。

Methodology

方法论

Resolve → Search → Observe → Open Coding → Axial Coding → Root Cause Navigation → Recommendations

解析 → 搜索 → 观测 → 开放式编码 → 主轴编码 → 根因定位 → 建议生成

Usage

使用方式

What's wrong with <ml_app> over the last <timeframe>
Why is <ml_app> failing today
Analyze eval failures for <eval_name> on <ml_app>
Look at the errors on <ml_app> over the last <timeframe>
Root-cause low scores on <eval_name>

What's wrong with <ml_app> over the last <timeframe>
Why is <ml_app> failing today
Analyze eval failures for <eval_name> on <ml_app>
Look at the errors on <ml_app> over the last <timeframe>
Root-cause low scores on <eval_name>

Inputs

输入参数

Input	Required	Default	Description
`ml_app`	Yes (or `eval_name` )	—	The application to analyze.
`eval_name`	No	—	One or more evaluators to focus on. Implies Eval Signal mode. Pass a list for multi-eval analysis.
`timeframe`	No	`now-24h`	How far back to look.
`mode`	No	inferred	Explicit mode override: `eval` , `errors` , `generic` . Skips inference entirely.
`failure_filter`	No	—	Narrowing scope: `"errors"` (routes to Error Signal path), `"high latency"` (post-fetch duration sort), `"low scores on <eval>"` (promotes to `eval_name` ), a tool name or span name ( `@name:<x>` query).

If neither

ml_app

nor

eval_name

is provided, ask the user.

输入	是否必填	默认值	描述
`ml_app`	是（或提供 `eval_name` ）	—	要分析的应用。
`eval_name`	否	—	要聚焦的一个或多个评估器。指定后自动进入评估信号模式。传入列表可进行多评估器分析。
`timeframe`	否	`now-24h`	回溯时间范围。
`mode`	否	自动推断	显式覆盖模式： `eval` 、 `errors` 、 `generic` 。跳过模式推断流程。
`failure_filter`	否	—	缩小分析范围： `"errors"` （进入错误信号路径）、 `"high latency"` （按获取后时长排序）、 `"low scores on <eval>"` （自动设置为 `eval_name` ）、工具名称或Span名称（ `@name:<x>` 查询）。

如果未提供

ml_app

和

eval_name

，请询问用户。

Available Tools

可用工具

Eval discovery & overview

评估器发现与概览

Tool	Purpose
`list_llmobs_evals`	Discover all configured evals for an `ml_app` . Used in Phase 0 mode inference.
`get_llmobs_eval_aggregate_stats`	Pass/fail rate or score distribution for an eval over a time window.
`get_llmobs_evaluator`	Full evaluator config: prompt template, assessment criteria, span filter, sampling, provider. Use instead of the deprecated `get_llmobs_eval_config` .

工具	用途
`list_llmobs_evals`	发现 `ml_app` 已配置的所有评估器。用于Phase 0的模式推断。
`get_llmobs_eval_aggregate_stats`	评估器在时间窗口内的通过率/失败率或分数分布。
`get_llmobs_evaluator`	完整评估器配置：提示模板、评估标准、Span过滤器、采样规则、提供商。替代已弃用的 `get_llmobs_eval_config` 。

Trace & span exploration

跟踪与Span探索

Tool	Purpose
`search_llmobs_spans`	Find spans by tags, span kind, status, query syntax. Paginate with cursor. Entry point for Phase 1.
`get_llmobs_span_details`	Metadata, evaluations (scores, labels, reasoning), `status` , error fields, duration, and `content_info` map showing available fields + sizes.
`get_llmobs_span_content`	Actual content for a span field. Supports JSONPath via `path` param for targeted extraction.
`get_llmobs_trace`	Full trace hierarchy as span tree with span counts by kind.
`find_llmobs_error_spans`	All error spans in a trace with error type, message, stack, and propagation context.
`expand_llmobs_spans`	Load children of collapsed trace nodes.
`get_llmobs_agent_loop`	Chronological agent execution timeline (LLM calls, tool invocations, decisions). May return empty — see Phase 4b fallback.

工具	用途
`search_llmobs_spans`	通过标签、Span类型、状态、查询语法查找Span。使用游标分页。Phase 1的入口工具。
`get_llmobs_span_details`	元数据、评估结果（分数、标签、推理）、 `status` 、错误字段、时长，以及显示可用字段和大小的 `content_info` 映射。
`get_llmobs_span_content`	Span字段的实际内容。支持通过 `path` 参数使用JSONPath进行定向提取。
`get_llmobs_trace`	完整的跟踪层级结构，包含按类型统计的Span数量。
`find_llmobs_error_spans`	跟踪中所有错误Span，包含错误类型、消息、堆栈和传播上下文。
`expand_llmobs_spans`	加载折叠跟踪节点的子Span。
`get_llmobs_agent_loop`	Agent执行时间线（LLM调用、工具调用、决策）的 chronological 记录。可能返回空结果——请参考Phase 4b的 fallback 方案。

Key

get_llmobs_span_content

patterns

get_llmobs_span_content

关键使用模式

Field	Path	What you get
`messages`	`$.messages[0]`	System prompt (first message, usually `system` role)
`messages`	`$.messages[-1]`	Last assistant response
`messages`	(no path)	Full conversation including tool calls
`input` / `output`	—	Span I/O
`documents`	—	Retrieved documents (RAG apps)
`metadata`	—	Custom metadata (prompt versions, feature flags, user segments)

字段	路径	返回内容
`messages`	`$.messages[0]`	系统提示（第一条消息，通常为 `system` 角色）
`messages`	`$.messages[-1]`	最后一条助手响应
`messages`	（无路径）	包含工具调用的完整对话
`input` / `output`	—	Span输入/输出
`documents`	—	检索到的文档（RAG应用）
`metadata`	—	自定义元数据（提示版本、功能开关、用户分段）

How to use

search_llmobs_spans

search_llmobs_spans

使用方法

Always include
@ml_app:"<ml_app>"
in the
query
string — the structured
ml_app
parameter is unreliable and can return spans from other apps. Do not rely on the structured parameter alone.

Useful query fragments — combine with space (AND):

Goal	Query
Errors only	`@status:error`
Eval is present on the span	`@evaluations.custom.<eval_name>:*` (presence only — pass/fail is read from `get_llmobs_span_details` , not the query)
A specific tool by name	`@name:<tool_name>`

Dedicated params (

span_kind

root_spans_only

ml_app

) work alongside

query

, but

query

takes precedence over

tags

必须在
query
字符串中包含
@ml_app:"<ml_app>"
——结构化的

ml_app

参数不可靠，可能返回其他应用的Span。不要仅依赖结构化参数。

实用查询片段——使用空格组合（AND逻辑）：

目标	查询语句
仅错误Span	`@status:error`
Span上存在评估结果	`@evaluations.custom.<eval_name>:*` （仅判断存在性——通过率/失败率需从 `get_llmobs_span_details` 获取，而非查询结果）
指定工具名称	`@name:<tool_name>`

专用参数（

span_kind

、

root_spans_only

、

ml_app

）可与

query

配合使用，但

query

优先级高于

tags

。

Parallelization rules

并行化规则

get_llmobs_span_details
: Group span_ids by trace_id, chunk each trace's span_ids into batches of at most 20. Issue ALL chunks for a page in a single message.
get_llmobs_span_content
: Each call is independent — always issue ALL in a single message.

get_llmobs_trace
/
find_llmobs_error_spans
/
get_llmobs_agent_loop
: Parallelize across different traces in a single message.

Pipeline parallelism: Start
```
get_llmobs_span_details
```
for page 1 results immediately — don't wait to collect all pages.

get_llmobs_span_details
：按trace_id对span_id进行分组，将每个trace的span_id分成最多20个一组的批次。在单条消息中发起所有批次的调用。
get_llmobs_span_content
：每个调用相互独立——始终在单条消息中发起所有调用。

get_llmobs_trace
/
find_llmobs_error_spans
/
get_llmobs_agent_loop
：在单条消息中对不同跟踪进行并行调用。

流水线并行：立即对第1页结果发起
```
get_llmobs_span_details
```
调用——无需等待收集所有页面。

Analysis Workflow

分析工作流

Output discipline: Phases 0–5 are internal analysis. The only user-facing outputs during these phases are the Phase 1 Signal Summary and the mandatory checkpoints at Phases 2 and 3. Do NOT narrate reasoning, summarize intermediate findings, or output Phase 4 deep-dive results as prose. All detailed findings go exclusively into the Phase 6 report.

输出规范：Phase 0–5为内部分析阶段。这些阶段中仅有的面向用户输出是Phase 1的信号摘要，以及Phase 2和Phase 3的强制检查点。不要叙述推理过程、总结中间发现，或输出Phase 4深度分析的文本结果。所有详细发现仅需放入Phase 6的报告中。

Phase 0: Resolve Inputs & Infer Mode

Phase 0：解析输入并推断模式

First: check for classification context. Scan the conversation for a

# Session Classification Summary

header. If found → enter Step 0S below and skip all remaining Phase 0 steps and Phase 1 entirely.

首先：检查分类上下文。扫描对话中是否存在

# Session Classification Summary

标题。如果存在 → 进入下方的 Step 0S，跳过剩余所有Phase 0步骤和Phase 1。

Step 0S — Extract Failure Bucket from Classification Output

Step 0S — 从分类输出中提取故障分组

The canonical handoff format is the Per-Unit Details table inside the

# Session Classification Summary

section. Extract one row per unit:

Field	Source
`trace_id`	Link URL in the ID column: parse the `trace_id=` or `session_id=` query parameter from the link href
`verdict`	Verdict column
`failure_mode`	Failure Mode column ( `none` for passing rows)
`detail`	Reason column — use as the Phase 2 reasoning input (same role as eval judge reasoning or error messages)
`app_type`	From the `# Session Classification Summary` header line (e.g. `Root span kind: agent` ) — default `LLM` if absent

Failure bucket = all rows where verdict is

no

partial

< 5 entries → note low confidence, proceed anyway.
Empty → report "No failures found in the classification output" and stop.

Present this overview before proceeding:

undefined

标准交接格式是

# Session Classification Summary

部分内的 Per-Unit Details 表格。提取每一行数据：

字段	来源
`trace_id`	ID列中的链接URL：从链接href中解析 `trace_id=` 或 `session_id=` 查询参数
`verdict`	Verdict列
`failure_mode`	Failure Mode列（通过的行填 `none` ）
`detail`	Reason列——用作Phase 2的推理输入（与评估器推理或错误消息作用相同）
`app_type`	来自 `# Session Classification Summary` 标题行（例如 `Root span kind: agent` ）——若缺失则默认 `LLM`

故障分组 = 所有verdict为

no

或

partial

的行。

条目数 <5 → 标注低置信度，仍继续分析。
空分组 → 报告“在分类输出中未发现故障”并停止操作。

在继续分析前展示以下概览：

undefined

Classification Overview (from llm-obs-session-classify)

分类概览（来自llm-obs-session-classify）

ml_app: <from summary header> | Classified: N | Failures (no+partial): F | Pass rate: X%

Failure Mode	Count
...

Proceeding to Phase 2 using F failure traces. Mode inference bypassed — classification verdict is the signal.


Then **skip Phase 1 and jump directly to Phase 2**. Carry forward:
- Phase 2 reasoning input: `(trace_id, span_id, detail)` tuples — same structure as eval reasoning or error messages
- Phase 4 navigation: use `app_type` from each trace block to choose the span navigation strategy
- Phases 2–7: run completely unchanged — the failure bucket structure is identical regardless of source

---

**Standard resolution (no classification context):**

1. If neither `ml_app` nor `eval_name` provided → ask the user. If `eval_name` is provided but `ml_app` is not → also ask for `ml_app` (eval names are not globally unique; without it, span searches return results from all apps sharing the eval name).
2. If `timeframe` not provided → default to `now-24h`.
3. **Resolve `failure_filter`** (before mode inference):
   - `"errors"` → force **Error Signal** mode
   - `"low scores on <eval>"` → treat as `eval_name=<eval>`, then continue inference
   - `"high latency"` → note for Phase 1 (sort by duration post-fetch); continue inference
   - Tool/span name → note as `@name:<x>` query fragment for Phase 1; continue inference
4. **Resolve mode** (skip if `mode` was explicitly provided):
   - `eval_name` given → **Eval Signal**
   - User explicitly mentioned errors/exceptions/crashes → **Error Signal**
   - Otherwise → call `list_llmobs_evals_by_ml_app(ml_app)`:
     - Evals returned → **Eval Signal**
     - No evals → **Error Signal** (announce auto-selection in Phase 1)
5. When `eval_name` is multi-valued, note for Phase 1: run parallel per-eval searches and merge+dedup by `(trace_id, span_id)`.

---

ml_app: <来自摘要标题> | 已分类: N | 故障数（no+partial）: F | 通过率: X%

故障模式	数量
...

将使用F个故障跟踪进入Phase 2。跳过模式推断——分类判定作为分析信号。


然后**跳过Phase 1，直接进入Phase 2**。传递以下信息：
- Phase 2推理输入：`(trace_id, span_id, detail)` 元组——与评估器推理或错误消息结构相同
- Phase 4定位策略：使用每个跟踪块的 `app_type` 选择Span导航策略
- Phase 2–7：完全按原流程执行——无论来源如何，故障分组结构一致

---

**标准解析流程（无分类上下文）：**

1. 如果未提供 `ml_app` 和 `eval_name` → 询问用户。如果提供了 `eval_name` 但未提供 `ml_app` → 同样询问用户获取 `ml_app`（评估器名称不具备全局唯一性；若缺失，Span搜索会返回所有共享该评估器名称的应用结果）。
2. 如果未提供 `timeframe` → 默认使用 `now-24h`。
3. **解析 `failure_filter`**（模式推断前）：
   - `"errors"` → 强制使用 **错误信号模式**
   - `"low scores on <eval>"` → 视为 `eval_name=<eval>`，然后继续推断
   - `"high latency"` → 标注用于Phase 1（获取后按时长排序）；继续推断
   - 工具/Span名称 → 标注为Phase 1的 `@name:<x>` 查询片段；继续推断
4. **解析模式**（如果已显式提供 `mode` 则跳过）：
   - 已提供 `eval_name` → **评估信号模式**
   - 用户明确提及错误/异常/崩溃 → **错误信号模式**
   - 其他情况 → 调用 `list_llmobs_evals_by_ml_app(ml_app)`：
     - 返回评估器 → **评估信号模式**
     - 无评估器 → **错误信号模式**（在Phase 1中告知自动选择结果）
5. 当 `eval_name` 为多值时，标注用于Phase 1：并行执行每个评估器的搜索，然后按 `(trace_id, span_id)` 合并并去重结果。

---

Phase 1: Find Problematic Spans

Phase 1：定位问题Span

Three mode-specific paths. All end with a Signal Summary that labels the mode and includes a one-line override hint.

Mode switch handling: At any checkpoint, if the user says "switch to [error|eval|generic] mode", re-enter Phase 1 with the new mode. Phase 0 inputs do not re-resolve.

Auto-pivot: If the selected mode finds no data (0 evals configured, 0 error spans in timeframe), announce the pivot to Generic and proceed — do not stop and ask.

三种模式专属路径。所有路径最终都会生成信号摘要，标注当前模式并包含一行模式覆盖提示。

模式切换处理：在任何检查点，如果用户说“切换到[error|eval|generic]模式”，则使用新模式重新进入Phase 1。Phase 0输入无需重新解析。

自动切换：如果所选模式未找到数据（未配置评估器、时间范围内无错误Span），告知用户将切换到通用模式并继续分析——不要停止并询问。

Eval Signal path

评估信号路径

Step 1a: Eval overview (parallel)

Step 1a：评估器概览（并行）

For each eval, call both in a single parallel batch:

get_llmobs_eval_aggregate_stats(eval_name, from, to)

```
get_llmobs_evaluator(eval_name)
```

Interpret aggregate stats:

total_count == 0
→ Note "no data." Skip this eval (or pivot to Generic if it's the only one).
Boolean
pass_rate == 1.0
→ Note "100% pass." Skip unless it's the only eval.
Boolean with failures → Note counts and pass_rate. Continue.
Score with assessment criteria → Note distribution and pass/fail counts. Continue.
Score WITHOUT assessment criteria → Infer failures: bottom quartile, or below median if bimodal. Label as "inferred failures" in report.
Categorical with assessment criteria → Note top_values and pass/fail. Continue.
Categorical WITHOUT assessment criteria → Infer from context (e.g., "error", "incomplete", "off_topic" are likely failures). Ask user if genuinely ambiguous.

Interpret eval config:

Config returned (custom) → Store

prompt_template

assessment_criteria

parsing_type

output_schema

Config nil (OOTB) → Note prompt is not inspectable.

Calibration cross-check: When two evals share a name prefix but differ in type (e.g.

foo-boolean

and

foo-score

), compare their pass rates on overlapping spans. A discrepancy >20% is an Evaluator Calibration Discrepancy — flag it in the report.

对每个评估器，在单个并行批次中调用以下两个工具：

get_llmobs_eval_aggregate_stats(eval_name, from, to)

```
get_llmobs_evaluator(eval_name)
```

聚合统计解读：

total_count == 0
→ 标注“无数据”。跳过该评估器（如果是唯一评估器则切换到通用模式）。
布尔型
pass_rate == 1.0
→ 标注“100%通过”。除非是唯一评估器，否则跳过。
布尔型且存在失败 → 标注数量和通过率。继续分析。
带评估标准的分数型 → 标注分布情况和通过/失败数量。继续分析。
无评估标准的分数型 → 推断失败情况：取最低四分位数，若为双峰分布则取中位数以下。在报告中标注为“推断失败”。
带评估标准的分类型 → 标注top_values和通过/失败情况。继续分析。
无评估标准的分类型 → 根据上下文推断（例如“error”“incomplete”“off_topic”通常视为失败）。若确实存在歧义则询问用户。

评估器配置解读：

返回配置（自定义） → 存储

prompt_template

、

assessment_criteria

、

parsing_type

、

output_schema

。

配置为空（开箱即用） → 标注提示不可查看。

校准交叉检查：当两个评估器名称前缀相同但类型不同（例如

foo-boolean

和

foo-score

），比较它们在重叠Span上的通过率。差异>20%则判定为评估器校准差异——在报告中标记。

Step 1b: Collect failure spans

Step 1b：收集故障Span

For each eval:

search_llmobs_spans(query="@evaluations.custom.<eval_name>:*", from, limit=50)

. When multi-valued, issue one search per eval in parallel — merge result sets, dedup by

(trace_id, span_id)

Paginate until ≥15–20 failures OR no more pages. Cap at 200 spans total.
```
get_llmobs_span_details
```
per trace_id batch (follow Parallelization Rules).
Extract per row: assessment, value, reasoning, span_id, trace_id, span_kind, content_info.
Separate into pass/fail buckets using thresholds from Step 1a.

JSON-type eval fallback: If

@evaluations.custom.<eval_name>:*

returns 0 spans but

get_llmobs_eval_aggregate_stats

confirmed

total_count > 0

, the eval is JSON-type and scores are not indexed on this field. Fall back to: search by the span name or span kind that the eval targets (check

get_llmobs_evaluator

for the span filter), then inspect output payloads for JSON verdict fields via

get_llmobs_span_content(field="output")

对每个评估器：

调用
```
search_llmobs_spans(query="@evaluations.custom.<eval_name>:*", from, limit=50)
```
。如果是多值评估器，并行发起每个评估器的搜索——合并结果集，按
```
(trace_id, span_id)
```
去重。
分页直到获取≥15–20个故障Span或无更多页面。最多获取200个Span。
按trace_id批次调用
```
get_llmobs_span_details
```
（遵循并行化规则）。
提取每行数据：assessment、value、reasoning、span_id、trace_id、span_kind、content_info。
使用Step 1a中的阈值将数据分为通过/故障分组。

JSON型评估器 fallback：如果

@evaluations.custom.<eval_name>:*

返回0个Span，但

get_llmobs_eval_aggregate_stats

确认

total_count > 0

，说明该评估器为JSON型，分数未在该字段索引。 fallback方案：按评估器目标的Span名称或类型搜索（检查

get_llmobs_evaluator

的Span过滤器），然后通过

get_llmobs_span_content(field="output")

检查输出负载中的JSON判定字段。

Step 1c: Signal Summary (Eval Signal)

Step 1c：信号摘要（评估信号模式）

undefined

undefined

Signal Summary:

{ml_app}

· Eval Signal

信号摘要:

{ml_app}

· 评估信号模式

(Inferred from {N} configured eval(s). Say

switch to error mode

switch to generic mode

to change.)

Timeframe: {from} → {to}

Eval	Type	Total	Pass Rate	Status
eval_1	boolean	4,891	37.3%	⚠ Investigating
eval_2	score	1,200	— (inferred threshold)	⚠ Investigating
eval_3	boolean	500	99.2%	✓ Healthy

Collected: {pass_count} passing, {fail_count} failing.


For a single eval, collapse to a single-line header instead of a table.

---

(从{N}个已配置评估器推断得出。输入

switch to error mode

或

switch to generic mode

可切换模式。)

时间范围: {from} → {to}

评估器	类型	总数	通过率	状态
eval_1	boolean	4,891	37.3%	⚠ 正在分析
eval_2	score	1,200	—（推断阈值）	⚠ 正在分析
eval_3	boolean	500	99.2%	✓ 健康

已收集: {pass_count}个通过，{fail_count}个故障。


如果只有一个评估器，将表格折叠为单行标题。

---

Error Signal path

错误信号路径

Step 1a: Sample error spans

Step 1a：采样错误Span

search_llmobs_spans(query="@ml_app:\"<ml_app>\" @status:error", from=timeframe, limit=50)

. Paginate until ≥30 error spans or no more pages.

调用

search_llmobs_spans(query="@ml_app:\"<ml_app>\" @status:error", from=timeframe, limit=50)

。分页直到获取≥30个错误Span或无更多页面。

Step 1a.5: Soft error scan

Step 1a.5：软错误扫描

MCP tool spans sometimes report

@status:ok

but carry

"isError": true

in their output payload — these are invisible to

@status:error

queries and can outnumber hard errors.

Call

search_llmobs_spans(query="@ml_app:\"<ml_app>\" @status:ok", span_kind="tool", from=timeframe, limit=20)

. For a sample of 5–10 results, call

get_llmobs_span_content(field="output")

in parallel. If any payloads contain

"isError": true

, add MCP soft errors as a separate row in the error frequency table with the note: (status:ok but isError:true in payload — not queryable via @status:error).

MCP工具Span有时会报告

@status:ok

，但在输出负载中包含

"isError": true

——这些Span无法通过

@status:error

查询到，且数量可能超过硬错误。

调用

search_llmobs_spans(query="@ml_app:\"<ml_app>\" @status:ok", span_kind="tool", from=timeframe, limit=20)

。对5–10个样本结果，并行调用

get_llmobs_span_content(field="output")

。如果任何负载包含

"isError": true

，在错误频率表中添加MCP软错误行，并标注：(status:ok但payload中isError:true — 无法通过@status:error查询)。

Step 1b: Group by error type

Step 1b：按错误类型分组

Group spans by

error_type

tag → frequency table. If

error_type

tag is absent on some spans, supplement with the

error.type

field from

get_llmobs_span_details

(fetched in Step 1d).

按

error_type

标签对Span分组 → 生成频率表。如果部分Span缺失

error_type

标签，补充使用

get_llmobs_span_details

中的

error.type

字段（在Step 1d中获取）。

Step 1c: Fetch stack traces (parallel)

Step 1c：获取堆栈跟踪（并行）

For the top 3–4 error types by count, pick 2–3 representative trace IDs each. Call

find_llmobs_error_spans(trace_id)

in parallel across all selected traces. Extract:

Error message and stack trace
Origin span kind and name
Whether errors propagate from children to parents (cascade) or are isolated

对数量最多的3–4种错误类型，每种选择2–3个代表性trace ID。对所有选中的trace并行调用

find_llmobs_error_spans(trace_id)

。提取：

错误消息和堆栈跟踪
来源Span类型和名称
错误是否从子Span传播到父Span（级联）或仅存在于单个Span

Step 1d: Fetch span details

Step 1d：获取Span详情

get_llmobs_span_details

on representative spans for each error type (follow Parallelization Rules). Extract

content_info

span_kind

, duration.

对每种错误类型的代表性Span调用

get_llmobs_span_details

（遵循并行化规则）。提取

content_info

、

span_kind

、时长。

Step 1e: Signal Summary (Error Signal)

Step 1e：信号摘要（错误信号模式）

undefined

undefined

Signal Summary:

{ml_app}

· Error Signal

信号摘要:

{ml_app}

· 错误信号模式

(No evals configured — analyzing runtime errors. Say

switch to eval mode

switch to generic mode

to change.)

Timeframe: {from} → {to} | Total error spans sampled: {N}

Error Type	Spans	Cascade?	Origin Span Kind
TimeoutError	42	Yes	tool
APIError 429	18	No	tool
ValueError	7	No	llm
MCP soft errors (isError:true)	23	No	tool

---

(未配置评估器 — 分析运行时错误。输入

switch to eval mode

或

switch to generic mode

可切换模式。)

时间范围: {from} → {to} | 已采样错误Span总数: {N}

错误类型	Span数量	是否级联	来源Span类型
TimeoutError	42	是	tool
APIError 429	18	否	tool
ValueError	7	否	llm
MCP软错误（isError:true）	23	否	tool

---

Generic path

通用路径

Step 1a: Eval health check (when evals are configured)

Step 1a：评估器健康检查（当已配置评估器时）

list_llmobs_evals

returned evals in Phase 0, call

get_llmobs_eval_aggregate_stats

for each enabled eval in parallel. Flag any enabled eval with

total_count: 0

as Broken Eval Configuration — include in the Signal Summary anomaly table as a High severity row.

如果Phase 0中

list_llmobs_evals

返回了评估器，并行调用每个启用评估器的

get_llmobs_eval_aggregate_stats

。将任何

total_count: 0

的启用评估器标记为评估器配置错误——在信号摘要的异常表中添加高严重度行。

Step 1b: Broad span search

Step 1b：广泛Span搜索

search_llmobs_spans(query="@ml_app:\"<ml_app>\"", root_spans_only=true, from=timeframe, limit=50)

. Apply

failure_filter

narrowing if present (tool/span name →

@name:<x>

query;

"high latency"

→ sort result set by

duration

after Step 1c). Paginate until ≥30 spans.

调用

search_llmobs_spans(query="@ml_app:\"<ml_app>\"", root_spans_only=true, from=timeframe, limit=50)

。如果存在

failure_filter

则应用缩小范围（工具/Span名称 →

@name:<x>

查询；

"high latency"

→ Step 1c后按

duration

排序结果集）。分页直到获取≥30个Span。

Step 1c: Fetch span details

Step 1c：获取Span详情

get_llmobs_span_details

per trace_id batch.

按trace_id批次调用

get_llmobs_span_details

。

Step 1d: Rank by structural anomalies

Step 1d：按结构异常排序

Partition spans using heuristics:

Top decile by
```
duration
```
(latency outliers)
Agent spans with >N tool/LLM iterations (long-running loops)
Retrieval spans returning 0 documents (RAG miss)
Workflow spans whose child set is missing an expected step (compare against median child layout)
Token efficiency: Check if
```
non_cached_input_tokens ≈ input_tokens
```
across LLM spans. If the app has stable system prompts (>1k tokens) and cache hit rate is 0%, flag as High severity — enabling
```
cache_control: ephemeral
```
on the system prompt would cut input token costs by 60–90%

使用启发式方法对Span进行分区：

时长处于前十分位的Span（延迟异常值）
Agent Span且工具/LLM迭代次数>N（长时间运行循环）
返回0个文档的检索Span（RAG缺失）
子Span集缺少预期步骤的工作流Span（与中位数子Span布局对比）
Token效率：检查LLM Span中
```
non_cached_input_tokens ≈ input_tokens
```
是否成立。如果应用有稳定的系统提示（>1k tokens）且缓存命中率为0%，标记为高严重度——在系统提示中启用
```
cache_control: ephemeral
```
可将输入Token成本降低60–90%

Step 1e: Signal Summary (Generic)

Step 1e：信号摘要（通用模式）

undefined

undefined

Signal Summary:

{ml_app}

· Generic

信号摘要:

{ml_app}

· 通用模式

(Analyzing structural anomalies. Say

switch to eval mode

switch to error mode

to change.)

Timeframe: {from} → {to} | Sampled: {N} root spans

Anomaly Type	Count
Latency outliers (>p90)	12
Long agent loops (>8 iterations)	5
RAG retrieval misses	3
Zero prompt cache utilization	All LLM spans
Broken eval configurations	2

---

(分析结构异常。输入

switch to eval mode

或

switch to error mode

可切换模式。)

时间范围: {from} → {to} | 已采样: {N}个根Span

异常类型	数量
延迟异常值（>p90）	12
长Agent循环（>8次迭代）	5
RAG检索缺失	3
提示缓存利用率为0	所有LLM Span
评估器配置错误	2

---

Phase 1.5: Determine App Profile & Where the Root Cause Lives

Phase 1.5：确定应用配置文件与根因位置

Inspect

content_info

and

span_kind

across collected spans. Drives Phase 4 strategy.

App profile (from content_info):

Signal	App profile	Phase 4 strategy
`content_info` has `messages`	LLM/chat app	Extract system prompt via `messages[0]` , check conversation flow
`content_info` has `documents`	RAG app	Check retrieval quality alongside LLM output
Trace contains `agent` span kind	Agent app	Try `get_llmobs_agent_loop` first; if it returns empty use child-span reconstruction (see Phase 4b)
`messages.count > 10`	Long conversation	Check for context overflow
`content_info` has `metadata`	Has custom metadata	Check for clustering by metadata values (prompt version, feature flags)

LLM Experiments traces: If root spans have
span_kind: experiment
and carry
input
,
output
, and
expected_output
structured fields, you are looking at a Datadog LLM Experiments trace. Each span represents one dataset record run. Read quality signal from the root span's
input
/
output
/
expected_output
fields via
get_llmobs_span_content
— not from LLM sub-span messages, which may contain stub or placeholder content. Evaluations attached to experiment spans are computed by the Experiments framework at run time and may not be registered as online Datadog evaluators (
get_llmobs_evaluator
will return 404 for them).

Where the root cause likely lives — by symptom span kind:

Symptom span kind	Symptom looks like	But root cause is often in...
`llm`	Bad LLM response (eval flagged, wrong output)	Parent agent (bad instructions), sibling retrieval (bad context), sibling tool (bad data)
`agent`	Bad orchestration	Child spans (wrong tool calls, bad routing), full agent loop
`tool`	Bad tool result	Parent LLM (passed wrong parameters), tool implementation
`workflow`	Bad overall output	Child sub-spans (which step first deviated?)
`retrieval`	Bad retrieval	Query construction (parent), index/embedding config (outside trace)

Key insight: The signal — eval verdict, error message, latency outlier — flags one span in isolation. It's a symptom report, not a diagnosis. The root cause often lives in a different span: a parent that gave bad instructions, a sibling that provided bad context, or a child that made a wrong decision. Phase 4 navigates the tree to find it.

检查已收集Span的

content_info

和

span_kind

。指导Phase 4的策略。

应用配置文件（来自content_info）：

信号	应用配置文件	Phase 4策略
`content_info` 包含 `messages`	LLM/聊天应用	通过 `messages[0]` 提取系统提示，检查对话流程
`content_info` 包含 `documents`	RAG应用	同时检查检索质量和LLM输出
跟踪包含 `agent` 类型Span	Agent应用	优先尝试 `get_llmobs_agent_loop` ；如果返回空则使用子Span重建（参考Phase 4b）
`messages.count > 10`	长对话应用	检查上下文溢出情况
`content_info` 包含 `metadata`	有自定义元数据	检查是否按元数据值聚类（提示版本、功能开关）

LLM Experiments跟踪：如果根Span的
span_kind: experiment
，且包含
input
、
output
和
expected_output
结构化字段，说明这是Datadog LLM Experiments跟踪。每个Span代表一个数据集记录的运行。通过
get_llmobs_span_content
从根Span的
input
/
output
/
expected_output
字段读取质量信号——不要从LLM子Span的消息中读取，这些消息可能包含存根或占位符内容。附加到实验Span的评估结果由Experiments框架在运行时计算，可能未注册为在线Datadog评估器（
get_llmobs_evaluator
会返回404）。

根因可能位置 — 按症状Span类型：

症状Span类型	症状表现	但根因通常位于...
`llm`	LLM响应异常（评估标记、输出错误）	父级Agent（指令错误）、同级检索（上下文错误）、同级工具（数据错误）
`agent`	编排错误	子级Span（工具调用错误、路由错误）、完整Agent循环
`tool`	工具结果错误	父级LLM（参数传递错误）、工具实现
`workflow`	整体输出错误	子级子Span（哪一步首先偏离预期？）
`retrieval`	检索结果错误	查询构建（父级）、索引/嵌入配置（跟踪外）

关键见解：信号——评估判定、错误消息、延迟异常值——孤立地标记了一个Span。这只是症状报告，而非诊断结果。根因通常位于另一个Span：给出错误指令的父Span、提供错误上下文的同级Span，或做出错误决策的子Span。Phase 4通过遍历Span树定位根因。

Phase 2: Open Coding — Initial Failure Categorization

Phase 2：开放式编码 — 初始故障分类

Goal: Read per-row evidence and propose initial, concrete failure categories. Pool all problematic rows together — categories should describe app behaviors, not which signal flagged them.

Per-row "reasoning input" by mode:

Eval Signal: judge assessment + reasoning from
```
get_llmobs_span_details
```
Error Signal: error message + stack trace excerpt from
```
find_llmobs_error_spans
```
Generic: one-line description of the structural anomaly that flagged the row

Shortcuts:

< 15 problematic rows: Combine Phases 2 and 3 into one pass. Still produce the checkpoint.
> 80% share the same reasoning/error/symptom: Skip to Phase 4 with the dominant pattern. Still output checkpoint.
> 50 problematic rows: Sample ~50, build taxonomy, then spot-check 10–15 more.

Use per-row signal from Phase 1 — do NOT re-fetch. Only call
```
get_llmobs_span_content(field="input"/"output")
```
for spans where the reasoning is insufficient (generic, empty, or just a stack trace with no app context).
If eval config is loaded (Eval Signal), distinguish early:
- App failures: Output genuinely violates the eval's criteria
- Eval failures: Output seems reasonable but eval criteria are too strict/ambiguous
Each pattern must be specific: "Agent called search instead of calculator for price computation" — NOT "tool issue."

目标：读取每行证据并提出初始、具体的故障类别。将所有问题行合并——类别应描述应用行为，而非标记它们的信号类型。

每行“推理输入” 按模式分类：

评估信号模式：
```
get_llmobs_span_details
```
中的评估判定+推理
错误信号模式：
```
find_llmobs_error_spans
```
中的错误消息+堆栈跟踪片段
通用模式：标记该行的结构异常的单行描述

快捷方式：

问题行 <15个：将Phase 2和Phase 3合并为一次处理。仍需生成检查点。
>80%的行具有相同推理/错误/症状：跳过直接进入Phase 4，聚焦主导模式。仍需输出检查点。
问题行 >50个：采样约50个，构建分类法，然后抽查10–15个更多样本。

使用Phase 1的每行信号 — 不要重新获取。仅当推理信息不足（通用、空值或仅包含无应用上下文的堆栈跟踪）时，调用
```
get_llmobs_span_content(field="input"/"output")
```
获取Span内容。
如果已加载评估器配置（评估信号模式），提前区分：
- 应用故障：输出确实违反了评估器的标准
- 评估器故障：输出看似合理但评估器标准过于严格/模糊
每个模式必须具体：例如“Agent在计算价格时调用了搜索工具而非计算器工具”——而非“工具问题”。

MANDATORY CHECKPOINT

强制检查点

**Open coding**: {N} problematic rows → {K} initial categories: {Category1} ({count}), {Category2} ({count}), ...

**开放式编码**: {N}个问题行 → {K}个初始类别: {Category1} ({count}), {Category2} ({count}), ...

Phase 3: Axial Coding — Refine Failure Taxonomy

Phase 3：主轴编码 — 优化故障分类法

Goal: 3–8 final categories, ranked by impact.

Merge: Categories with < 3 occurrences → parent category or drop as noise.
Split: Categories with > 30% of failures → more specific sub-categories. Pull additional span content if needed.
Validate: 2–3 representative examples per category confirm the label fits.
Rank:
```
priority = count × severity
```
(severity: high / medium / low).

目标：最终得到3–8个类别，按影响排序。

合并：出现次数<3的类别 → 合并到父类别或作为噪声丢弃。
拆分：占故障数>30%的类别 → 拆分为更具体的子类别。必要时获取更多Span内容。
验证：每个类别选取2–3个代表性示例确认标签合适。
排序：
```
优先级 = 数量 × 严重度
```
（严重度：高/中/低）。

MANDATORY CHECKPOINT

强制检查点

**Axial coding**: {merges/splits/drops}. Final categories:
1. {Category} ({count}, {pct}%) — {severity}
2. ...

**主轴编码**: {合并/拆分/丢弃操作说明}。最终类别:
1. {Category} ({count}, {pct}%) — {severity}
2. ...

Phase 4: Root Cause Analysis — Navigate from Symptom to Root Cause

Phase 4：根因分析 — 从症状到根因的定位

Goal: The signal flagged a span. That's the symptom. Navigate the trace tree to find the actual root cause — it's often in a different span.

For each of the top 3 categories, pick 2–3 representative traces:

目标：信号标记了一个Span，这是症状。遍历跟踪树找到实际根因——通常位于另一个Span。

对前3个类别，每个选取2–3个代表性跟踪：

Step 4a: Trace structure + errors (parallel)

Step 4a：跟踪结构 + 错误（并行）

For each representative trace, call in a single message:

```
get_llmobs_trace(trace_id)
```
— span hierarchy; locate the symptom span and its parent/siblings/children
```
find_llmobs_error_spans(trace_id)
```
— check for runtime errors anywhere in the trace

Runtime vs behavioral: If errors exist on or near the symptom span, the root cause may be a runtime failure rather than a behavioral one. Check this first.

Distributed trace fallback: If

get_llmobs_trace

returns "cannot find parent" or an empty span list (common in Ray-based or multi-process execution), reconstruct the trace manually using

get_llmobs_span_details

on the span_ids collected in Phase 1, sorted by

start_ms

对每个代表性跟踪，在单条消息中调用：

```
get_llmobs_trace(trace_id)
```
— Span层级结构；定位症状Span及其父/同级/子Span
```
find_llmobs_error_spans(trace_id)
```
— 检查跟踪中是否存在运行时错误

运行时vs行为：如果症状Span上或附近存在错误，根因可能是运行时故障而非行为故障。优先检查这一点。

分布式跟踪 fallback：如果

get_llmobs_trace

返回“无法找到父Span”或空Span列表（在基于Ray或多进程执行的应用中常见），使用Phase 1中收集的span_id调用

get_llmobs_span_details

，按

start_ms

排序手动重建跟踪。

Step 4b: Navigate to the root cause (parallel)

Step 4b：定位根因（并行）

Use the symptom span kind (from Phase 1.5). Issue ALL calls in a single message.

If symptom is on an
llm
span (most common):

get_llmobs_span_content(field="messages", path="$.messages[0]")

on symptom span — system prompt

get_llmobs_span_content(field="messages")

on symptom span — full context received

get_llmobs_span_content(field="documents")

on sibling retrieval spans (if any)

```
get_llmobs_span_content(field="input")
```
on sibling tool spans (if any)

get_llmobs_span_content(field="messages", path="$.messages[0]")

on parent agent/workflow span

If symptom is on an
agent
span:

```
get_llmobs_agent_loop(trace_id, span_id)
```
— full decision timeline (try first; if it returns 0 iterations, use the fallback below)
```
get_llmobs_span_details
```
on child spans — sort by
```
start_ms
```
to reconstruct the execution timeline

get_llmobs_span_content(field="input"/"output")

on child spans that look wrong

Agent loop fallback (when

get_llmobs_agent_loop

returns 0 iterations): Reconstruct the timeline from

get_llmobs_span_details

results sorted by

start_ms

. Group by

span_kind

to identify LLM → tool → LLM sequences. This fallback is frequently needed —

get_llmobs_agent_loop

returns empty for many apps.

If symptom is on a
tool
span:

```
get_llmobs_span_content(field="input")
```
on symptom span — what parameters was it called with?
```
get_llmobs_span_content(field="messages")
```
on parent LLM span — did the LLM construct the call correctly?

If symptom is on a
workflow
span:

```
get_llmobs_span_details
```
on all child spans — find which step first deviated

get_llmobs_span_content(field="input"/"output")

on the deviating child

Always also fetch:

```
get_llmobs_span_content(field="metadata")
```
on the symptom span — clustering signals (prompt version, feature flags)

根据Phase 1.5中的症状Span类型操作。将所有调用放入单条消息中。

如果症状位于
llm
Span上（最常见）：

在症状Span上调用

get_llmobs_span_content(field="messages", path="$.messages[0]")

— 获取系统提示

在症状Span上调用
```
get_llmobs_span_content(field="messages")
```
— 获取完整上下文
在同级检索Span（如果存在）上调用
```
get_llmobs_span_content(field="documents")
```
在同级工具Span（如果存在）上调用
```
get_llmobs_span_content(field="input")
```

在父级Agent/工作流Span上调用

get_llmobs_span_content(field="messages", path="$.messages[0]")

如果症状位于
agent
Span上：

调用
```
get_llmobs_agent_loop(trace_id, span_id)
```
— 获取完整决策时间线 (优先尝试；如果返回0次迭代，使用下方的fallback方案)
在子Span上调用
```
get_llmobs_span_details
```
— 按
```
start_ms
```
排序重建执行时间线

在看似异常的子Span上调用

get_llmobs_span_content(field="input"/"output")

Agent循环 fallback（当

get_llmobs_agent_loop

返回0次迭代时）：从按

start_ms

排序的

get_llmobs_span_details

结果中重建时间线。按

span_kind

分组识别LLM → 工具 → LLM序列。该fallback方案经常需要使用——

get_llmobs_agent_loop

对许多应用会返回空结果。

如果症状位于
tool
Span上：

在症状Span上调用
```
get_llmobs_span_content(field="input")
```
— 获取调用参数
在父级LLM Span上调用
```
get_llmobs_span_content(field="messages")
```
— 检查LLM是否正确构建调用

如果症状位于
workflow
Span上：

在所有子Span上调用
```
get_llmobs_span_details
```
— 找到首先偏离预期的步骤

在偏离步骤的子Span上调用

get_llmobs_span_content(field="input"/"output")

始终还要获取：

在症状Span上调用
```
get_llmobs_span_content(field="metadata")
```
— 获取聚类信号（提示版本、功能开关）

Step 4c: Diagnose — from symptom to root cause

Step 4c：诊断 — 从症状到根因

For each category, trace the causal chain:

Symptom — what the signal flagged (eval reasoning, error message, anomaly note). The signal only saw one span in isolation — its reasoning may be shallow.
Trace context — what surrounding spans reveal (parent instructions, sibling data, child decisions).
Root cause — the specific span and decision point where the failure originated. Often NOT the symptom span itself.

For suspected eval issues (Eval Signal, if config loaded): Compare eval criteria against evidence. Is the prompt ambiguous? Criteria too strict?

Root cause categories:

Category	Description
System Prompt Deficiency	Instructions unclear, missing, or contradictory — in symptom span OR its parent
Tool Gap	Needed tool doesn't exist or parameters too coarse
Tool Misuse	Wrong tool called or wrong parameters — often visible in agent loop or parent LLM
Routing/Handoff Error	Wrong sub-agent selected (multi-agent systems)
Retrieval Failure	RAG returned irrelevant or missing context — check sibling retrieval spans
Context Overflow	Critical info lost due to context length
Upstream Data Issue	A sibling or parent span provided bad data that cascaded to the symptom span
Runtime Error	Tool/API failure, timeout, exception — from `find_llmobs_error_spans`
Evaluator Miscalibration	Eval criteria produce false positives/negatives (Eval Signal mode only)

对每个类别，跟踪因果链：

症状 — 信号标记的内容（评估推理、错误消息、异常说明）。信号仅孤立地看到一个Span——其推理可能较为表面。
跟踪上下文 — 周边Span揭示的信息（父级指令、同级数据、子级决策）。
根因 — 故障起源的具体Span和决策点。通常不是症状Span本身。

对于疑似评估器问题（评估信号模式，已加载配置）：将评估器标准与证据对比。提示是否模糊？标准是否过于严格？

根因类别：

类别	描述
系统提示缺陷	指令不清晰、缺失或矛盾 — 位于症状Span或其父Span
工具缺失	需要的工具不存在或参数过于粗糙
工具误用	调用了错误的工具或传递了错误的参数 — 通常在Agent循环或父级LLM中可见
路由/交接错误	选择了错误的子Agent（多Agent系统）
检索失败	RAG返回了无关或缺失的上下文 — 检查同级检索Span
上下文溢出	因上下文长度限制导致关键信息丢失
上游数据问题	同级或父级Span提供了错误数据并传导至症状Span
运行时错误	工具/API故障、超时、异常 — 来自 `find_llmobs_error_spans`
评估器校准错误	评估器标准产生了误报/漏报（仅评估信号模式）

Phase 5: Generate Recommendations

Phase 5：生成建议

Goal: Concrete, actionable recommendations grounded in trace evidence. Actual text/code changes with before/after quotes from the trace — not generic advice.

Recommendation types: System Prompt Edit (quote actual prompt, provide before/after), Tool Gap/Misuse (reference agent loop steps), Routing/Handoff Fix, Retrieval Fix (show retrieved vs needed), Evaluator Prompt Edit (flag that eval changes need re-validation; Eval Signal only), Other.

When run in Claude Code with codebase access: Search the codebase for system prompt, tool definitions, or routing logic. Propose specific diffs. Always ask before modifying files.

目标：基于跟踪证据的具体、可执行建议。包含来自跟踪的实际文本/代码变更的前后对比——而非通用建议。

建议类型：系统提示修改（引用实际提示，提供前后对比）、工具缺失/误用（参考Agent循环步骤）、路由/交接修复、检索修复（展示检索结果与所需内容的对比）、评估器提示修改（标记评估器变更需重新验证；仅评估信号模式）、其他。

当在Claude Code中运行且可访问代码库时：搜索代码库中的系统提示、工具定义或路由逻辑。提出具体的diff。修改文件前始终询问用户。

Phase 6: Compile RCA Report

Phase 6：编译RCA报告

Write the full report following the Output Format below. This is the primary deliverable — output it directly in the chat.

按照下方的输出格式编写完整报告。这是主要交付成果——直接在聊天中输出。

Phase 7: Post-Analysis Actions

Phase 7：分析后操作

Do NOT take any action automatically. After presenting the report, ask the user what they'd like to do next:

Save the report to
```
llm-obs-rca-{ml_app}-{date}.md
```
Apply fixes (if codebase is available)
Deeper investigation of remaining categories
Export to a Datadog notebook — in pup mode, use
```
pup notebooks create
```
to create the notebook and
```
pup notebooks edit NOTEBOOK_ID --file /tmp/nb_cells.json
```
to append sections (see Tool Reference)
Re-run on an expanded time range (e.g.
```
now-7d
```
if current window was
```
now-24h
```
)

If the user chooses option 4, follow the notebook creation fallback pattern:

Call

mcp__datadog-mcp-core__create_datadog_notebook

with:

name
:

LLM Obs RCA: {ml_app} ({mode}) — YYYY-MM-DD

type
:
```
report
```
time_span
:
```
1w
```
cells
: one cell per section (see Notebook Cell Structure below)

If the MCP call fails, inspect the error before giving up:
- Auth / permission error (401, 403) → stop and tell the user.
- Field validation error (error message names a specific field) → fix that field and retry the MCP call once.
- Any other error (binding, serialization, unexpected response) → fall back to pup:
  - Write the notebook payload to
```
/tmp/nb_rca_{ml_app}.json
```
    as a full API envelope:
```
{"data": {"attributes": {"name": "...", "time": {...}, "cells": [...]}, "type": "notebooks"}}
```
  - Run
```
pup notebooks create --file /tmp/nb_rca_{ml_app}.json
```
  - If pup is not available either, render the full notebook content as markdown in chat so the user has it.
After successful creation by either method, output the URL on its own line:
```
RCA report exported to notebook: <url>
```

Print the URL prominently — if

/eval-bootstrap

runs next in the same session, it will detect this URL and offer to append the evaluator suite to the same notebook.

不要自动执行任何操作。 展示报告后，询问用户下一步想要执行的操作：

将报告保存为
```
llm-obs-rca-{ml_app}-{date}.md
```
应用修复（如果可访问代码库）
对剩余类别进行更深入的调查
导出到Datadog笔记本 — 在pup模式下，使用
```
pup notebooks create
```
创建笔记本，使用
```
pup notebooks edit NOTEBOOK_ID --file /tmp/nb_cells.json
```
添加章节（参考工具参考）
在扩大的时间范围内重新运行分析（例如当前窗口为
```
now-24h
```
，则使用
```
now-7d
```
）

如果用户选择选项4，遵循笔记本创建fallback模式：

调用

mcp__datadog-mcp-core__create_datadog_notebook

，参数如下：

name
:

LLM Obs RCA: {ml_app} ({mode}) — YYYY-MM-DD

type
:
```
report
```
time_span
:
```
1w
```
cells
: 每个章节对应一个单元格（参考下方的笔记本单元格结构）

如果MCP调用失败，检查错误后再放弃：
- 认证/权限错误（401、403） → 停止操作并告知用户。
- 字段验证错误（错误消息提及特定字段） → 修复该字段并重试MCP调用一次。
- 其他错误（绑定、序列化、意外响应） → fallback到pup：
  - 将笔记本负载写入
```
/tmp/nb_rca_{ml_app}.json
```
    ，作为完整API包：
```
{"data": {"attributes": {"name": "...", "time": {...}, "cells": [...]}, "type": "notebooks"}}
```
  - 运行
```
pup notebooks create --file /tmp/nb_rca_{ml_app}.json
```
  - 如果pup也不可用，将完整笔记本内容以markdown格式在聊天中展示，确保用户获取到内容。
无论通过哪种方式成功创建后，单独一行输出URL：
```
RCA报告已导出到笔记本: <url>
```

确保URL醒目显示——如果

/eval-bootstrap

在同一会话中运行，它会检测到该URL并提议将评估器套件附加到同一笔记本。

Notebook Cell Structure

笔记本单元格结构

Cell	Content
1 — Overview	Structured header (see Overview cell format below — follow it exactly)
2 — Signal Summary	Mode-specific health table
3 — Failure Taxonomy	Taxonomy table
4…N — Failure Modes	One cell per failure mode
N+1 — Action Plan + Limitations	Action plan table + bullet list

Notebook formatting rules (apply to every cell):

No triple-backtick code blocks — use blockquotes (
```
>
```
) for prompts/rubrics, inline code (
```
`
```
) for short values
Evidence as tables — not bullet lists
Tool inputs as tables — Argument | Wrong value passed | Correct approach
Action plan as a table — Priority | Action | Confidence | Impact

单元格	内容
1 — 概览	结构化标题（参考下方的概览单元格格式——严格遵循）
2 — 信号摘要	模式专属健康表格
3 — 故障分类法	分类法表格
4…N — 故障模式	每个故障模式对应一个单元格
N+1 — 行动计划 + 局限性	行动计划表格 + 项目符号列表

笔记本格式规则（适用于所有单元格）：

不要使用三重反引号代码块 — 使用块引用（
```
>
```
）展示提示/评估标准，使用行内代码（
```
`
```
）展示短值
证据以表格形式展示 — 不要使用项目符号列表
工具输入以表格形式展示 — 参数 | 传递的错误值 | 正确方法
行动计划以表格形式展示 — 优先级 | 操作 | 置信度 | 影响

Output Format

输出格式

Overview cell (notebook Cell 1 / report header)

概览单元格（笔记本单元格1 / 报告标题）

The Overview cell must follow this exact structure. No prose paragraphs. No inline-numbered findings. App description is one sentence maximum.

undefined

概览单元格必须严格遵循以下结构。不要使用散文段落。不要使用行内编号的发现。应用描述最多一句话。

undefined

{ml_app}

· {Eval Signal | Error Signal | Generic} · {timeframe}

{ml_app}

· {评估信号模式 | 错误信号模式 | 通用模式} · {timeframe}

Date: {YYYY-MM-DD} | Profile: {short app profile} | Model:

{model(s)}

{One sentence: what does this app do?}

Metric	Value
{mode-appropriate rows — see below}

日期: {YYYY-MM-DD} | 配置文件: {简短应用配置文件} | 模型:

{model(s)}

{一句话：该应用的功能是什么？}

指标	值
{模式专属行 — 参考下方}

Findings

发现

{Finding 1} (~{pct}%): one-line root cause description
{Finding 2} (~{pct}%): one-line root cause description
{Finding 3} (if present): one-line root cause description

{发现1} (~{pct}%): 一行根因描述
{发现2} (~{pct}%): 一行根因描述
{发现3}（如果存在）: 一行根因描述

Recommendations

建议

{Recommendation 1}: specific next step tied to Finding 1
{Recommendation 2}: specific next step tied to Finding 2

Sample: {N} spans analyzed. Confidence: High | Medium | Low — {one-line reason if Medium or Low}.


**Mode-appropriate metric rows:**

Eval Signal:

| Eval |

{eval_name}

({type}) | | Spans evaluated | {total_count} | | Pass rate | {pass_rate}% ({pass_count} pass / {fail_count} fail) | | Top failure mode | {name} (~{pct}%) | | Evals configured | {N} |


Error Signal:

| Error spans | {N} confirmed | | Top error type |

{type}

({pct}%) | | Affected operation |

{span_name}

| | Cascade pattern | Isolated / Cascading | | Evals configured | {N} (none = no quality signal) |


Generic:

| Spans sampled | {N} root spans | | Top anomaly | {type}: {count} spans | | Error spans | {N} (0 = structurally healthy) | | Evals configured | {N} (none = no quality signal) |

---

{建议1}: 与发现1关联的具体下一步操作
{建议2}: 与发现2关联的具体下一步操作

样本: {N}个Span已分析。置信度: 高 | 中 | 低 — {如果是中或低，给出一行原因}.


**模式专属指标行：**

评估信号模式:

| 评估器 |

{eval_name}

({type}) | | 已评估Span数 | {total_count} | | 通过率 | {pass_rate}% ({pass_count}个通过 / {fail_count}个故障) | | 主要故障模式 | {name} (~{pct}%) | | 已配置评估器数 | {N} |


错误信号模式:

| 错误Span数 | {N}个已确认 | | 主要错误类型 |

{type}

({pct}%) | | 受影响操作 |

{span_name}

| | 级联模式 | 孤立 / 级联 | | 已配置评估器数 | {N}（无 = 无质量信号） |


通用模式:

| 已采样Span数 | {N}个根Span | | 主要异常 | {type}: {count}个Span | | 错误Span数 | {N}（0 = 结构健康） | | 已配置评估器数 | {N}（无 = 无质量信号） |

---

Signal Summary Table

信号摘要表格

When entering from Step 0S (classification context), replace the Signal Summary table with:

undefined

当从Step 0S（分类上下文）进入时，将信号摘要表格替换为：

undefined

Classification Signal Summary

分类信号摘要

Source: llm-obs-session-classify | ml_app: {app} | Signal: content-only | content+evals

Metric	Value
Traces classified	N
Failures in corpus (no+partial)	F
Pass rate	X%
Failure modes	list

Root cause analysis is based on per-trace classification verdicts, not automated eval judge reasoning.


**Otherwise**, mode-specific — pick the appropriate variant:

**Eval Signal** — one row per eval:

| Eval | Type | Total | Pass Rate | Status |
|------|------|------:|:---------:|--------|
| eval_1 | boolean | 4,891 | 37.3% | ⚠ Investigating |

**Error Signal** — one row per error type:

| Error Type | Spans | Cascade? | Origin Span Kind |
|------------|------:|:--------:|-----------------|
| TimeoutError | 42 | Yes | tool |

**Generic** — one row per anomaly type:

| Anomaly Type | Count |
|---|:---:|
| Latency outliers (>p90) | 12 |

---

来源: llm-obs-session-classify | ml_app: {app} | 信号: content-only | content+evals

指标	值
已分类跟踪数	N
语料库中的故障数（no+partial）	F
通过率	X%
故障模式	列表

根因分析基于每个跟踪的分类判定，而非自动化评估器推理。


**其他情况**，使用模式专属表格——选择合适的变体：

**评估信号模式** — 每个评估器一行：

| 评估器 | 类型 | 总数 | 通过率 | 状态 |
|------|------|------:|:---------:|--------|
| eval_1 | boolean | 4,891 | 37.3% | ⚠ 正在分析 |

**错误信号模式** — 每个错误类型一行：

| 错误类型 | Span数量 | 是否级联 | 来源Span类型 |
|------------|------:|:--------:|-----------------|
| TimeoutError | 42 | 是 | tool |

**通用模式** — 每个异常类型一行：

| 异常类型 | 数量 |
|---|:---:|
| 延迟异常值（>p90） | 12 |

---

Failure Taxonomy

故障分类法

#	Failure Mode	Traces	%	Severity	Root Cause
1	...	...	...%	High	Tool Misuse

#	故障模式	跟踪数	%	严重度	根因
1	...	...	...%	高	工具误用

Failure Mode Sections (one per top 3–5 modes)

故障模式章节（前3–5个模式各一个）

undefined

undefined

Failure Mode N: [Name]

故障模式N: [名称]

Count: {n} spans, {t} traces | Severity: High/Medium/Low | Root Cause: [Category]

[3–5 sentences: what goes wrong, when, what triggers it, causal chain.]

Evidence

{Use the mode-appropriate column set:}

Eval Signal — Trace | Judge verdict | What the trace revealed:

Trace	Judge verdict	What the trace revealed
69de86a7...	fail	Parent agent has no date format instruction

Error Signal — Trace | Behavior | Version:

Trace	Behavior	Version
69de86a7...	7 parallel calls, all 400	v107624932

Generic — Trace | Anomaly | Signal:

Trace	Anomaly	Signal
69de86a7...	94s, 12 tool calls	Latency outlier

{For tool misuse — add a tool inputs table:} Tool inputs (100% of sampled calls)

Argument	Value passed (wrong)	Correct approach
`query`	`"monitor_id:123 group_status:alert"`	`"monitor_id:123"` (name/tag only)

{For Eval Signal — add judge reasoning as a blockquote:}

"{quoted judge reasoning}"

Root cause: [WHY this happens — specific span, parameter, or prompt.]

Fix: BEFORE: [actual text from trace] AFTER: [proposed replacement]

Impact: Eliminates ~{n} spans / {timeframe}.

---

数量: {n}个Span, {t}个跟踪 | 严重度: 高/中/低 | 根因: [类别]

[3–5句话：问题是什么，何时发生，触发条件是什么，因果链是什么。]

证据

{使用模式专属列集：}

评估信号模式 — 跟踪 | 评估判定 | 跟踪揭示的信息:

跟踪	评估判定	跟踪揭示的信息
69de86a7...	fail	父级Agent没有日期格式指令

错误信号模式 — 跟踪 | 行为 | 版本:

跟踪	行为	版本
69de86a7...	7次并行调用，全部返回400	v107624932

通用模式 — 跟踪 | 异常 | 信号:

跟踪	异常	信号
69de86a7...	94秒，12次工具调用	延迟异常值

{对于工具误用 — 添加工具输入表格：} 工具输入（100%已采样调用）

参数	传递的错误值	正确方法
`query`	`"monitor_id:123 group_status:alert"`	`"monitor_id:123"` （仅名称/标签）

{对于评估信号模式 — 添加评估器推理作为块引用：}

"{引用的评估器推理}"

根因: [问题发生的原因 — 具体Span、参数或提示。]

修复方案: 修复前: [来自跟踪的实际文本] 修复后: [建议的替换文本]

影响: 消除~{n}个Span / {timeframe}。

---

Prioritized Action Plan

优先级行动计划

Priority	Action	Confidence	Impact
1	Fix `monitor_groups_search` schema — add `group_states` param	High	Eliminates ~21 spans/7d

When mode is Generic and no evals are configured, always append as the final action plan row:

优先级	操作	置信度	影响
1	修复 `monitor_groups_search` schema — 添加 `group_states` 参数	高	消除~21个Span/7天

当模式为通用模式且未配置评估器时，始终在行动计划最后添加一行：

| N | 配置至少一个评估器 | 高 | 为未来RCA启用评估信号模式 — 当前应用无持续质量信号 |

Limitations & Follow-ups

局限性与后续工作

Bullet list of what needs more data or follow-up action.

项目符号列表，说明需要更多数据或后续操作的内容。

Operating Rules

操作规则

Ground in evidence: Every claim references span IDs with clickable trace links:

[Trace {first_8}...](https://app.datadoghq.com/llm/traces?query=trace_id:{full_32_char_id})

Root cause over symptom: "System prompt doesn't specify date format" not "model gave wrong answer."
Show your math: "47 failures (34%)" not "many failures."
Honest about uncertainty: < 5 examples = tentative. Flag it.
Anonymize PII: No emails or names. User/org IDs are fine.
MCP result parsing safety: Before writing any script that iterates over MCP tool results, inspect the raw structure first — check top-level keys and whether the payload is nested inside a content block (e.g.
```
[{'type': 'text', 'text': '<json>'}]
```
). Extract and
```
json.loads()
```
the inner payload if needed. Never assume MCP results are bare dicts or lists.

基于证据: 每个声明都引用带可点击跟踪链接的Span ID:

[Trace {first_8}...](https://app.datadoghq.com/llm/traces?query=trace_id:{full_32_char_id})

。

聚焦根因而非症状: 例如“系统提示未指定日期格式”而非“模型给出错误答案”。
展示数据: 例如“47个故障（34%）”而非“大量故障”。
坦诚不确定性: 示例数<5 = 暂定结论。标记出来。
匿名化PII: 不要包含邮箱或姓名。用户/组织ID可以保留。
MCP结果解析安全: 在编写任何遍历MCP工具结果的脚本前，先检查原始结构——检查顶级键以及负载是否嵌套在内容块中（例如
```
[{'type': 'text', 'text': '<json>'}]
```
）。如果需要，提取并
```
json.loads()
```
内部负载。永远不要假设MCP结果是裸字典或列表。

Tool Reference

工具参考

This appendix applies only in pup mode. In MCP mode, use the tool names in the workflow sections directly.

本附录仅适用于 pup模式。在MCP模式下，直接使用工作流章节中的工具名称。

Spans and traces

Spans和跟踪

MCP Tool	pup Command
`search_llmobs_spans(query, ml_app, from, to, limit, cursor, root_spans_only, span_kind, summary)`	`pup llm-obs spans search --query "@ml_app:A [other_filters]" [--from F] [--to T] [--limit N] [--cursor C] [--root-spans-only] [--span-kind K] [--summary]` — always use `--query "@ml_app:A"` to filter by ml_app; the `--ml-app A` flag is unreliable and silently returns spans from other apps.
`get_llmobs_span_details(trace_id, span_ids, from, to)`	`pup llm-obs spans get-details --trace-id T --span-ids S1,S2,...`
`get_llmobs_span_content(trace_id, span_id, field, path)`	`pup llm-obs spans get-content --trace-id T --span-id S --field F [--path P]`
`get_llmobs_trace(trace_id, include_tree)`	`pup llm-obs spans get-trace --trace-id T [--include-tree]`
`get_llmobs_agent_loop(trace_id, span_id)`	`pup llm-obs spans get-agent-loop --trace-id T [--span-id S]`
`find_llmobs_error_spans(trace_id)`	`pup llm-obs spans find-errors --trace-id T`
`expand_llmobs_spans(trace_id, span_ids, max_depth, filter_kind)`	`pup llm-obs spans expand --trace-id T --span-ids S1,S2,... [--max-depth N] [--filter-kind K]`

MCP工具	pup命令
`search_llmobs_spans(query, ml_app, from, to, limit, cursor, root_spans_only, span_kind, summary)`	`pup llm-obs spans search --query "@ml_app:A [other_filters]" [--from F] [--to T] [--limit N] [--cursor C] [--root-spans-only] [--span-kind K] [--summary]` — 必须使用 `--query "@ml_app:A"` 过滤ml_app； `--ml-app A` 标志不可靠，会静默返回其他应用的Span。
`get_llmobs_span_details(trace_id, span_ids, from, to)`	`pup llm-obs spans get-details --trace-id T --span-ids S1,S2,...`
`get_llmobs_span_content(trace_id, span_id, field, path)`	`pup llm-obs spans get-content --trace-id T --span-id S --field F [--path P]`
`get_llmobs_trace(trace_id, include_tree)`	`pup llm-obs spans get-trace --trace-id T [--include-tree]`
`get_llmobs_agent_loop(trace_id, span_id)`	`pup llm-obs spans get-agent-loop --trace-id T [--span-id S]`
`find_llmobs_error_spans(trace_id)`	`pup llm-obs spans find-errors --trace-id T`
`expand_llmobs_spans(trace_id, span_ids, max_depth, filter_kind)`	`pup llm-obs spans expand --trace-id T --span-ids S1,S2,... [--max-depth N] [--filter-kind K]`

Evaluators

评估器

MCP Tool	pup Command
`list_llmobs_evals()`	`pup llm-obs evals list` (filter by `ml_app` client-side)
`list_llmobs_evals_by_ml_app(ml_app)`	`pup llm-obs evals list-by-ml-app --ml-app A`
`get_llmobs_evaluator(eval_name)`	`pup llm-obs evals get-evaluator EVAL_NAME`
`get_llmobs_eval_aggregate_stats(eval_name, ml_app, from, to)`	`pup llm-obs evals get-aggregate-stats EVAL_NAME [--ml-app A] [--from F] [--to T]`

MCP工具	pup命令
`list_llmobs_evals()`	`pup llm-obs evals list` （客户端侧按 `ml_app` 过滤）
`list_llmobs_evals_by_ml_app(ml_app)`	`pup llm-obs evals list-by-ml-app --ml-app A`
`get_llmobs_evaluator(eval_name)`	`pup llm-obs evals get-evaluator EVAL_NAME`
`get_llmobs_eval_aggregate_stats(eval_name, ml_app, from, to)`	`pup llm-obs evals get-aggregate-stats EVAL_NAME [--ml-app A] [--from F] [--to T]`

Notebooks

笔记本

MCP Tool pup Command

create_datadog_notebook(name, cells, ...)

pup notebooks create --title "TITLE" --file /tmp/nb_cells.json

— confirm exact flags with

pup notebooks create --help

edit_datadog_notebook(id, cells, append_only=true)

pup notebooks edit NOTEBOOK_ID --file /tmp/nb_cells.json

(fetches current notebook, appends provided cells, writes back)

The cells file is a JSON array of cell objects:

json

[{"attributes": {"definition": {"type": "markdown", "text": "## Section\n\nContent."}}, "type": "notebook_cells"}]

MCP工具 pup命令

create_datadog_notebook(name, cells, ...)

pup notebooks create --title "TITLE" --file /tmp/nb_cells.json

— 使用

pup notebooks create --help

确认准确标志

edit_datadog_notebook(id, cells, append_only=true)

pup notebooks edit NOTEBOOK_ID --file /tmp/nb_cells.json

（获取当前笔记本，附加提供的单元格，写回）

单元格文件是JSON数组，包含单元格对象：

json

[{"attributes": {"definition": {"type": "markdown", "text": "## Section\n\nContent."}}, "type": "notebook_cells"}]