llm-obs-trace-rca

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Backend

后端

Detection — At the start of every invocation, before taking any action, determine which backend to use:
  1. If the user passed
    --backend pup
    anywhere in their invocation → use pup mode immediately, regardless of whether MCP tools are present. Skip steps 2–4.
  2. Check whether MCP tools are present in your active tool list. The canonical signal is whether
    mcp__datadog-llmo-mcp__list_llmobs_evals
    appears in your available tools.
  3. If MCP tools are present → use MCP mode throughout. Call MCP tools exactly as named in this skill's workflow sections.
  4. If MCP tools are absent → check whether
    pup
    is executable: run
    pup --version
    via Bash. A JSON response containing
    "version"
    confirms pup is available.
  5. If pup responds → use pup mode throughout. Translate every MCP tool call to its pup equivalent using the Tool Reference appendix at the bottom of this file.
  6. If neither is available → stop and tell the user:
    "Neither the Datadog MCP server nor the pup CLI is available. Connect the MCP server (
    claude mcp add --scope user --transport http datadog-llmo-mcp 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'
    ) or install pup."
--backend pup
is accepted anywhere in the invocation arguments and is stripped before passing remaining args to the skill logic.
pup invocation rules:
  • Invoke via Bash:
    pup llm-obs <subcommand> [flags]
  • pup always outputs JSON. Parse directly — no content-block unwrapping (unlike MCP results, which may wrap JSON in
    [{"type": "text", "text": "<json>"}]
    ).
  • If pup returns an auth error, tell the user to run
    pup auth login
    and stop.
  • Parallelization: issue multiple Bash tool calls in a single message (one pup command per call).
  • Time flags: pup accepts bare duration strings (
    1h
    ,
    7d
    ,
    30m
    ) and RFC3339 timestamps. Do not use
    now-
    -prefixed strings — strip the prefix when converting from a skill
    --timeframe
    argument:
    now-7d
    7d
    ,
    now-24h
    24h
    ,
    now-30d
    30d
    .
  • --summary
    on
    pup llm-obs spans search
    strips payload fields to essential metadata only. Use it in bulk/search phases where content is not needed.
Invocation ID: At the very start of each invocation, before any MCP tool call, generate an 8-character hex invocation ID (e.g.,
3a9f1c2b
). Keep it constant for the entire invocation.
Intent tagging: On every MCP tool call, prefix
telemetry.intent
with
skill:llm-obs-trace-rca[<inv_id>] — 
followed by a description of why the tool is being called. On the first MCP tool call only, use
skill:llm-obs-trace-rca:start[<inv_id>] — 
instead (note the
:start
suffix). Example first call:
skill:llm-obs-trace-rca:start[3a9f1c2b] — Phase 0: discover configured evals for task-cruncher to infer analysis mode
检测 — 在每次调用开始、执行任何操作前,确定要使用的后端:
  1. 如果用户在调用中任何位置传入了
    --backend pup
    → 立即使用 pup模式,无论MCP工具是否存在。跳过步骤2–4。
  2. 检查活跃工具列表中是否存在MCP工具。标准判断信号是可用工具中是否包含
    mcp__datadog-llmo-mcp__list_llmobs_evals
  3. 如果存在MCP工具 → 全程使用 MCP模式。严格按照本技能工作流章节中指定的名称调用MCP工具。
  4. 如果没有MCP工具 → 检查
    pup
    是否可执行:通过Bash运行
    pup --version
    。返回包含
    "version"
    的JSON响应即确认pup可用。
  5. 如果pup响应正常 → 全程使用 pup模式。使用本文档底部的工具参考附录,将每个MCP工具调用转换为对应的pup命令。
  6. 如果两者都不可用 → 停止操作并告知用户:
    "Datadog MCP服务器和pup CLI均不可用。请连接MCP服务器(
    claude mcp add --scope user --transport http datadog-llmo-mcp 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'
    )或安装pup。"
--backend pup
可在调用参数的任何位置使用,在将剩余参数传递给技能逻辑前会被移除。
pup调用规则:
  • 通过Bash调用:
    pup llm-obs <subcommand> [flags]
  • pup始终输出JSON。直接解析即可——无需解包内容块(与MCP结果不同,MCP结果可能将JSON包裹在
    [{"type": "text", "text": "<json>"}]
    中)。
  • 如果pup返回认证错误,告知用户运行
    pup auth login
    并停止操作。
  • 并行化:在单条消息中发起多个Bash工具调用(每个pup命令对应一个调用)。
  • 时间参数:pup接受纯时长字符串(
    1h
    7d
    30m
    )和RFC3339时间戳。请勿使用
    now-
    前缀的字符串——将技能的
    --timeframe
    参数转换时需移除前缀:
    now-7d
    7d
    now-24h
    24h
    now-30d
    30d
  • pup llm-obs spans search
    中使用
    --summary
    参数会将负载字段精简为必要元数据。在批量/搜索阶段无需内容时使用该参数。
调用ID: 在每次调用的最开始、发起任何MCP工具调用前,生成一个8字符的十六进制调用ID(例如:
3a9f1c2b
)。在整个调用过程中保持该ID不变。
意图标记: 在每个MCP工具调用中,将
telemetry.intent
前缀设置为
skill:llm-obs-trace-rca[<inv_id>] — 
,后跟调用该工具的原因描述。仅在第一次MCP工具调用时,使用
skill:llm-obs-trace-rca:start[<inv_id>] — 
(注意
:start
后缀)。示例首次调用:
skill:llm-obs-trace-rca:start[3a9f1c2b] — Phase 0: discover configured evals for task-cruncher to infer analysis mode

LLM Obs Trace RCA — Root Cause Analysis from Production LLM Traces

LLM Obs Trace RCA — 基于生产环境LLM跟踪数据的根因分析

Diagnose why an LLM application is failing by searching production traces and walking the span tree from symptom to root cause. The skill automatically selects the best analysis mode based on available signals:
ModeSignalWhen auto-selected
Eval SignalLLM judge verdicts and reasoning (pass/fail rates, scoring)Evaluators are configured for the app
Error SignalRuntime errors (
@status:error
, error types, stack traces)
No evals configured, or user explicitly asks about errors/crashes
GenericStructural anomalies (latency, agent loops, retrieval misses)Explicit
mode=generic
override, or no strong signal found in Phase 1
The mode is announced (never asked) in the first user-facing output with a one-line override hint.
通过搜索生产环境跟踪数据并从症状出发遍历Span树,诊断LLM应用故障的原因。该技能会根据可用信号自动选择最佳分析模式:
模式信号自动选择时机
评估信号模式LLM评估判定结果与推理(通过率/失败率、评分)应用已配置评估器时
错误信号模式运行时错误(
@status:error
、错误类型、堆栈跟踪)
未配置评估器,或用户明确询问错误/崩溃问题时
通用模式结构异常(延迟、Agent循环、检索缺失)显式指定
mode=generic
,或在Phase 1中未找到强信号时
模式会在首次面向用户的输出中主动告知(无需询问),并附带一行模式覆盖提示。

Methodology

方法论

Resolve → Search → Observe → Open Coding → Axial Coding → Root Cause Navigation → Recommendations
解析 → 搜索 → 观测 → 开放式编码 → 主轴编码 → 根因定位 → 建议生成

Usage

使用方式

What's wrong with <ml_app> over the last <timeframe>
Why is <ml_app> failing today
Analyze eval failures for <eval_name> on <ml_app>
Look at the errors on <ml_app> over the last <timeframe>
Root-cause low scores on <eval_name>
What's wrong with <ml_app> over the last <timeframe>
Why is <ml_app> failing today
Analyze eval failures for <eval_name> on <ml_app>
Look at the errors on <ml_app> over the last <timeframe>
Root-cause low scores on <eval_name>

Inputs

输入参数

InputRequiredDefaultDescription
ml_app
Yes (or
eval_name
)
The application to analyze.
eval_name
NoOne or more evaluators to focus on. Implies Eval Signal mode. Pass a list for multi-eval analysis.
timeframe
No
now-24h
How far back to look.
mode
NoinferredExplicit mode override:
eval
,
errors
,
generic
. Skips inference entirely.
failure_filter
NoNarrowing scope:
"errors"
(routes to Error Signal path),
"high latency"
(post-fetch duration sort),
"low scores on <eval>"
(promotes to
eval_name
), a tool name or span name (
@name:<x>
query).
If neither
ml_app
nor
eval_name
is provided, ask the user.
输入是否必填默认值描述
ml_app
是(或提供
eval_name
要分析的应用。
eval_name
要聚焦的一个或多个评估器。指定后自动进入评估信号模式。传入列表可进行多评估器分析。
timeframe
now-24h
回溯时间范围。
mode
自动推断显式覆盖模式:
eval
errors
generic
。跳过模式推断流程。
failure_filter
缩小分析范围:
"errors"
(进入错误信号路径)、
"high latency"
(按获取后时长排序)、
"low scores on <eval>"
(自动设置为
eval_name
)、工具名称或Span名称(
@name:<x>
查询)。
如果未提供
ml_app
eval_name
,请询问用户。

Available Tools

可用工具

Eval discovery & overview

评估器发现与概览

ToolPurpose
list_llmobs_evals
Discover all configured evals for an
ml_app
. Used in Phase 0 mode inference.
get_llmobs_eval_aggregate_stats
Pass/fail rate or score distribution for an eval over a time window.
get_llmobs_evaluator
Full evaluator config: prompt template, assessment criteria, span filter, sampling, provider. Use instead of the deprecated
get_llmobs_eval_config
.
工具用途
list_llmobs_evals
发现
ml_app
已配置的所有评估器。用于Phase 0的模式推断。
get_llmobs_eval_aggregate_stats
评估器在时间窗口内的通过率/失败率或分数分布。
get_llmobs_evaluator
完整评估器配置:提示模板、评估标准、Span过滤器、采样规则、提供商。替代已弃用的
get_llmobs_eval_config

Trace & span exploration

跟踪与Span探索

ToolPurpose
search_llmobs_spans
Find spans by tags, span kind, status, query syntax. Paginate with cursor. Entry point for Phase 1.
get_llmobs_span_details
Metadata, evaluations (scores, labels, reasoning),
status
, error fields, duration, and
content_info
map showing available fields + sizes.
get_llmobs_span_content
Actual content for a span field. Supports JSONPath via
path
param for targeted extraction.
get_llmobs_trace
Full trace hierarchy as span tree with span counts by kind.
find_llmobs_error_spans
All error spans in a trace with error type, message, stack, and propagation context.
expand_llmobs_spans
Load children of collapsed trace nodes.
get_llmobs_agent_loop
Chronological agent execution timeline (LLM calls, tool invocations, decisions). May return empty — see Phase 4b fallback.
工具用途
search_llmobs_spans
通过标签、Span类型、状态、查询语法查找Span。使用游标分页。Phase 1的入口工具。
get_llmobs_span_details
元数据、评估结果(分数、标签、推理)、
status
、错误字段、时长,以及显示可用字段和大小的
content_info
映射。
get_llmobs_span_content
Span字段的实际内容。支持通过
path
参数使用JSONPath进行定向提取。
get_llmobs_trace
完整的跟踪层级结构,包含按类型统计的Span数量。
find_llmobs_error_spans
跟踪中所有错误Span,包含错误类型、消息、堆栈和传播上下文。
expand_llmobs_spans
加载折叠跟踪节点的子Span。
get_llmobs_agent_loop
Agent执行时间线(LLM调用、工具调用、决策)的 chronological 记录。可能返回空结果——请参考Phase 4b的 fallback 方案。

Key
get_llmobs_span_content
patterns

get_llmobs_span_content
关键使用模式

FieldPathWhat you get
messages
$.messages[0]
System prompt (first message, usually
system
role)
messages
$.messages[-1]
Last assistant response
messages
(no path)Full conversation including tool calls
input
/
output
Span I/O
documents
Retrieved documents (RAG apps)
metadata
Custom metadata (prompt versions, feature flags, user segments)
字段路径返回内容
messages
$.messages[0]
系统提示(第一条消息,通常为
system
角色)
messages
$.messages[-1]
最后一条助手响应
messages
(无路径)包含工具调用的完整对话
input
/
output
Span输入/输出
documents
检索到的文档(RAG应用)
metadata
自定义元数据(提示版本、功能开关、用户分段)

How to use
search_llmobs_spans

search_llmobs_spans
使用方法

Always include
@ml_app:"<ml_app>"
in the
query
string — the structured
ml_app
parameter is unreliable and can return spans from other apps.
Do not rely on the structured parameter alone.
Useful query fragments — combine with space (AND):
GoalQuery
Errors only
@status:error
Eval is present on the span
@evaluations.custom.<eval_name>:*
(presence only — pass/fail is read from
get_llmobs_span_details
, not the query)
A specific tool by name
@name:<tool_name>
Dedicated params (
span_kind
,
root_spans_only
,
ml_app
) work alongside
query
, but
query
takes precedence over
tags
.
必须在
query
字符串中包含
@ml_app:"<ml_app>"
——结构化的
ml_app
参数不可靠,可能返回其他应用的Span。不要仅依赖结构化参数。
实用查询片段——使用空格组合(AND逻辑):
目标查询语句
仅错误Span
@status:error
Span上存在评估结果
@evaluations.custom.<eval_name>:*
(仅判断存在性——通过率/失败率需从
get_llmobs_span_details
获取,而非查询结果)
指定工具名称
@name:<tool_name>
专用参数(
span_kind
root_spans_only
ml_app
)可与
query
配合使用,但
query
优先级高于
tags

Parallelization rules

并行化规则

  1. get_llmobs_span_details
    : Group span_ids by trace_id, chunk each trace's span_ids into batches of at most 20. Issue ALL chunks for a page in a single message.
  2. get_llmobs_span_content
    : Each call is independent — always issue ALL in a single message.
  3. get_llmobs_trace
    /
    find_llmobs_error_spans
    /
    get_llmobs_agent_loop
    : Parallelize across different traces in a single message.
  4. Pipeline parallelism: Start
    get_llmobs_span_details
    for page 1 results immediately — don't wait to collect all pages.

  1. get_llmobs_span_details
    :按trace_id对span_id进行分组,将每个trace的span_id分成最多20个一组的批次。在单条消息中发起所有批次的调用。
  2. get_llmobs_span_content
    :每个调用相互独立——始终在单条消息中发起所有调用。
  3. get_llmobs_trace
    /
    find_llmobs_error_spans
    /
    get_llmobs_agent_loop
    :在单条消息中对不同跟踪进行并行调用。
  4. 流水线并行:立即对第1页结果发起
    get_llmobs_span_details
    调用——无需等待收集所有页面。

Analysis Workflow

分析工作流

Output discipline: Phases 0–5 are internal analysis. The only user-facing outputs during these phases are the Phase 1 Signal Summary and the mandatory checkpoints at Phases 2 and 3. Do NOT narrate reasoning, summarize intermediate findings, or output Phase 4 deep-dive results as prose. All detailed findings go exclusively into the Phase 6 report.

输出规范:Phase 0–5为内部分析阶段。这些阶段中仅有的面向用户输出是Phase 1的信号摘要,以及Phase 2和Phase 3的强制检查点。不要叙述推理过程、总结中间发现,或输出Phase 4深度分析的文本结果。所有详细发现仅需放入Phase 6的报告中。

Phase 0: Resolve Inputs & Infer Mode

Phase 0:解析输入并推断模式

First: check for classification context. Scan the conversation for a
# Session Classification Summary
header. If found → enter Step 0S below and skip all remaining Phase 0 steps and Phase 1 entirely.
首先:检查分类上下文。扫描对话中是否存在
# Session Classification Summary
标题。如果存在 → 进入下方的 Step 0S,跳过剩余所有Phase 0步骤和Phase 1。

Step 0S — Extract Failure Bucket from Classification Output

Step 0S — 从分类输出中提取故障分组

The canonical handoff format is the Per-Unit Details table inside the
# Session Classification Summary
section. Extract one row per unit:
FieldSource
trace_id
Link URL in the ID column: parse the
trace_id=
or
session_id=
query parameter from the link href
verdict
Verdict column
failure_mode
Failure Mode column (
none
for passing rows)
detail
Reason column — use as the Phase 2 reasoning input (same role as eval judge reasoning or error messages)
app_type
From the
# Session Classification Summary
header line (e.g.
Root span kind: agent
) — default
LLM
if absent
Failure bucket = all rows where verdict is
no
or
partial
.
  • < 5 entries → note low confidence, proceed anyway.
  • Empty → report "No failures found in the classification output" and stop.
Present this overview before proceeding:
undefined
标准交接格式是
# Session Classification Summary
部分内的 Per-Unit Details 表格。提取每一行数据:
字段来源
trace_id
ID列中的链接URL:从链接href中解析
trace_id=
session_id=
查询参数
verdict
Verdict列
failure_mode
Failure Mode列(通过的行填
none
detail
Reason列——用作Phase 2的推理输入(与评估器推理或错误消息作用相同)
app_type
来自
# Session Classification Summary
标题行(例如
Root span kind: agent
)——若缺失则默认
LLM
故障分组 = 所有verdict为
no
partial
的行。
  • 条目数 <5 → 标注低置信度,仍继续分析。
  • 空分组 → 报告“在分类输出中未发现故障”并停止操作。
在继续分析前展示以下概览:
undefined

Classification Overview (from llm-obs-session-classify)

分类概览(来自llm-obs-session-classify)

ml_app: <from summary header> | Classified: N | Failures (no+partial): F | Pass rate: X%
Failure ModeCount
...
Proceeding to Phase 2 using F failure traces. Mode inference bypassed — classification verdict is the signal.

Then **skip Phase 1 and jump directly to Phase 2**. Carry forward:
- Phase 2 reasoning input: `(trace_id, span_id, detail)` tuples — same structure as eval reasoning or error messages
- Phase 4 navigation: use `app_type` from each trace block to choose the span navigation strategy
- Phases 2–7: run completely unchanged — the failure bucket structure is identical regardless of source

---

**Standard resolution (no classification context):**

1. If neither `ml_app` nor `eval_name` provided → ask the user. If `eval_name` is provided but `ml_app` is not → also ask for `ml_app` (eval names are not globally unique; without it, span searches return results from all apps sharing the eval name).
2. If `timeframe` not provided → default to `now-24h`.
3. **Resolve `failure_filter`** (before mode inference):
   - `"errors"` → force **Error Signal** mode
   - `"low scores on <eval>"` → treat as `eval_name=<eval>`, then continue inference
   - `"high latency"` → note for Phase 1 (sort by duration post-fetch); continue inference
   - Tool/span name → note as `@name:<x>` query fragment for Phase 1; continue inference
4. **Resolve mode** (skip if `mode` was explicitly provided):
   - `eval_name` given → **Eval Signal**
   - User explicitly mentioned errors/exceptions/crashes → **Error Signal**
   - Otherwise → call `list_llmobs_evals_by_ml_app(ml_app)`:
     - Evals returned → **Eval Signal**
     - No evals → **Error Signal** (announce auto-selection in Phase 1)
5. When `eval_name` is multi-valued, note for Phase 1: run parallel per-eval searches and merge+dedup by `(trace_id, span_id)`.

---
ml_app: <来自摘要标题> | 已分类: N | 故障数(no+partial): F | 通过率: X%
故障模式数量
...
将使用F个故障跟踪进入Phase 2。跳过模式推断——分类判定作为分析信号。

然后**跳过Phase 1,直接进入Phase 2**。传递以下信息:
- Phase 2推理输入:`(trace_id, span_id, detail)` 元组——与评估器推理或错误消息结构相同
- Phase 4定位策略:使用每个跟踪块的 `app_type` 选择Span导航策略
- Phase 2–7:完全按原流程执行——无论来源如何,故障分组结构一致

---

**标准解析流程(无分类上下文):**

1. 如果未提供 `ml_app` 和 `eval_name` → 询问用户。如果提供了 `eval_name` 但未提供 `ml_app` → 同样询问用户获取 `ml_app`(评估器名称不具备全局唯一性;若缺失,Span搜索会返回所有共享该评估器名称的应用结果)。
2. 如果未提供 `timeframe` → 默认使用 `now-24h`。
3. **解析 `failure_filter`**(模式推断前):
   - `"errors"` → 强制使用 **错误信号模式**
   - `"low scores on <eval>"` → 视为 `eval_name=<eval>`,然后继续推断
   - `"high latency"` → 标注用于Phase 1(获取后按时长排序);继续推断
   - 工具/Span名称 → 标注为Phase 1的 `@name:<x>` 查询片段;继续推断
4. **解析模式**(如果已显式提供 `mode` 则跳过):
   - 已提供 `eval_name` → **评估信号模式**
   - 用户明确提及错误/异常/崩溃 → **错误信号模式**
   - 其他情况 → 调用 `list_llmobs_evals_by_ml_app(ml_app)`:
     - 返回评估器 → **评估信号模式**
     - 无评估器 → **错误信号模式**(在Phase 1中告知自动选择结果)
5. 当 `eval_name` 为多值时,标注用于Phase 1:并行执行每个评估器的搜索,然后按 `(trace_id, span_id)` 合并并去重结果。

---

Phase 1: Find Problematic Spans

Phase 1:定位问题Span

Three mode-specific paths. All end with a Signal Summary that labels the mode and includes a one-line override hint.
Mode switch handling: At any checkpoint, if the user says "switch to [error|eval|generic] mode", re-enter Phase 1 with the new mode. Phase 0 inputs do not re-resolve.
Auto-pivot: If the selected mode finds no data (0 evals configured, 0 error spans in timeframe), announce the pivot to Generic and proceed — do not stop and ask.

三种模式专属路径。所有路径最终都会生成信号摘要,标注当前模式并包含一行模式覆盖提示。
模式切换处理:在任何检查点,如果用户说“切换到[error|eval|generic]模式”,则使用新模式重新进入Phase 1。Phase 0输入无需重新解析。
自动切换:如果所选模式未找到数据(未配置评估器、时间范围内无错误Span),告知用户将切换到通用模式并继续分析——不要停止并询问。

Eval Signal path

评估信号路径

Step 1a: Eval overview (parallel)
Step 1a:评估器概览(并行)
For each eval, call both in a single parallel batch:
  • get_llmobs_eval_aggregate_stats(eval_name, from, to)
  • get_llmobs_evaluator(eval_name)
Interpret aggregate stats:
  • total_count == 0
    → Note "no data." Skip this eval (or pivot to Generic if it's the only one).
  • Boolean
    pass_rate == 1.0
    → Note "100% pass." Skip unless it's the only eval.
  • Boolean with failures → Note counts and pass_rate. Continue.
  • Score with assessment criteria → Note distribution and pass/fail counts. Continue.
  • Score WITHOUT assessment criteria → Infer failures: bottom quartile, or below median if bimodal. Label as "inferred failures" in report.
  • Categorical with assessment criteria → Note top_values and pass/fail. Continue.
  • Categorical WITHOUT assessment criteria → Infer from context (e.g., "error", "incomplete", "off_topic" are likely failures). Ask user if genuinely ambiguous.
Interpret eval config:
  • Config returned (custom) → Store
    prompt_template
    ,
    assessment_criteria
    ,
    parsing_type
    ,
    output_schema
    .
  • Config nil (OOTB) → Note prompt is not inspectable.
Calibration cross-check: When two evals share a name prefix but differ in type (e.g.
foo-boolean
and
foo-score
), compare their pass rates on overlapping spans. A discrepancy >20% is an Evaluator Calibration Discrepancy — flag it in the report.
对每个评估器,在单个并行批次中调用以下两个工具:
  • get_llmobs_eval_aggregate_stats(eval_name, from, to)
  • get_llmobs_evaluator(eval_name)
聚合统计解读:
  • total_count == 0
    → 标注“无数据”。跳过该评估器(如果是唯一评估器则切换到通用模式)。
  • 布尔型
    pass_rate == 1.0
    → 标注“100%通过”。除非是唯一评估器,否则跳过。
  • 布尔型且存在失败 → 标注数量和通过率。继续分析。
  • 带评估标准的分数型 → 标注分布情况和通过/失败数量。继续分析。
  • 无评估标准的分数型 → 推断失败情况:取最低四分位数,若为双峰分布则取中位数以下。在报告中标注为“推断失败”。
  • 带评估标准的分类型 → 标注top_values和通过/失败情况。继续分析。
  • 无评估标准的分类型 → 根据上下文推断(例如“error”“incomplete”“off_topic”通常视为失败)。若确实存在歧义则询问用户。
评估器配置解读:
  • 返回配置(自定义) → 存储
    prompt_template
    assessment_criteria
    parsing_type
    output_schema
  • 配置为空(开箱即用) → 标注提示不可查看。
校准交叉检查:当两个评估器名称前缀相同但类型不同(例如
foo-boolean
foo-score
),比较它们在重叠Span上的通过率。差异>20%则判定为评估器校准差异——在报告中标记。
Step 1b: Collect failure spans
Step 1b:收集故障Span
For each eval:
  1. search_llmobs_spans(query="@evaluations.custom.<eval_name>:*", from, limit=50)
    . When multi-valued, issue one search per eval in parallel — merge result sets, dedup by
    (trace_id, span_id)
    .
  2. Paginate until ≥15–20 failures OR no more pages. Cap at 200 spans total.
  3. get_llmobs_span_details
    per trace_id batch (follow Parallelization Rules).
  4. Extract per row: assessment, value, reasoning, span_id, trace_id, span_kind, content_info.
  5. Separate into pass/fail buckets using thresholds from Step 1a.
JSON-type eval fallback: If
@evaluations.custom.<eval_name>:*
returns 0 spans but
get_llmobs_eval_aggregate_stats
confirmed
total_count > 0
, the eval is JSON-type and scores are not indexed on this field. Fall back to: search by the span name or span kind that the eval targets (check
get_llmobs_evaluator
for the span filter), then inspect output payloads for JSON verdict fields via
get_llmobs_span_content(field="output")
.
对每个评估器:
  1. 调用
    search_llmobs_spans(query="@evaluations.custom.<eval_name>:*", from, limit=50)
    。如果是多值评估器,并行发起每个评估器的搜索——合并结果集,按
    (trace_id, span_id)
    去重。
  2. 分页直到获取≥15–20个故障Span或无更多页面。最多获取200个Span。
  3. 按trace_id批次调用
    get_llmobs_span_details
    (遵循并行化规则)。
  4. 提取每行数据:assessmentvaluereasoningspan_idtrace_idspan_kindcontent_info
  5. 使用Step 1a中的阈值将数据分为通过/故障分组。
JSON型评估器 fallback:如果
@evaluations.custom.<eval_name>:*
返回0个Span,但
get_llmobs_eval_aggregate_stats
确认
total_count > 0
,说明该评估器为JSON型,分数未在该字段索引。 fallback方案:按评估器目标的Span名称或类型搜索(检查
get_llmobs_evaluator
的Span过滤器),然后通过
get_llmobs_span_content(field="output")
检查输出负载中的JSON判定字段。
Step 1c: Signal Summary (Eval Signal)
Step 1c:信号摘要(评估信号模式)
undefined
undefined

Signal Summary:
{ml_app}
· Eval Signal

信号摘要:
{ml_app}
· 评估信号模式

(Inferred from {N} configured eval(s). Say
switch to error mode
or
switch to generic mode
to change.)
Timeframe: {from} → {to}
EvalTypeTotalPass RateStatus
eval_1boolean4,89137.3%⚠ Investigating
eval_2score1,200— (inferred threshold)⚠ Investigating
eval_3boolean50099.2%✓ Healthy
Collected: {pass_count} passing, {fail_count} failing.

For a single eval, collapse to a single-line header instead of a table.

---
(从{N}个已配置评估器推断得出。输入
switch to error mode
switch to generic mode
可切换模式。)
时间范围: {from} → {to}
评估器类型总数通过率状态
eval_1boolean4,89137.3%⚠ 正在分析
eval_2score1,200—(推断阈值)⚠ 正在分析
eval_3boolean50099.2%✓ 健康
已收集: {pass_count}个通过,{fail_count}个故障。

如果只有一个评估器,将表格折叠为单行标题。

---

Error Signal path

错误信号路径

Step 1a: Sample error spans
Step 1a:采样错误Span
search_llmobs_spans(query="@ml_app:\"<ml_app>\" @status:error", from=timeframe, limit=50)
. Paginate until ≥30 error spans or no more pages.
调用
search_llmobs_spans(query="@ml_app:\"<ml_app>\" @status:error", from=timeframe, limit=50)
。分页直到获取≥30个错误Span或无更多页面。
Step 1a.5: Soft error scan
Step 1a.5:软错误扫描
MCP tool spans sometimes report
@status:ok
but carry
"isError": true
in their output payload — these are invisible to
@status:error
queries and can outnumber hard errors.
Call
search_llmobs_spans(query="@ml_app:\"<ml_app>\" @status:ok", span_kind="tool", from=timeframe, limit=20)
. For a sample of 5–10 results, call
get_llmobs_span_content(field="output")
in parallel. If any payloads contain
"isError": true
, add MCP soft errors as a separate row in the error frequency table with the note: (status:ok but isError:true in payload — not queryable via @status:error).
MCP工具Span有时会报告
@status:ok
,但在输出负载中包含
"isError": true
——这些Span无法通过
@status:error
查询到,且数量可能超过硬错误。
调用
search_llmobs_spans(query="@ml_app:\"<ml_app>\" @status:ok", span_kind="tool", from=timeframe, limit=20)
。对5–10个样本结果,并行调用
get_llmobs_span_content(field="output")
。如果任何负载包含
"isError": true
,在错误频率表中添加MCP软错误行,并标注:(status:ok但payload中isError:true — 无法通过@status:error查询)
Step 1b: Group by error type
Step 1b:按错误类型分组
Group spans by
error_type
tag → frequency table. If
error_type
tag is absent on some spans, supplement with the
error.type
field from
get_llmobs_span_details
(fetched in Step 1d).
error_type
标签对Span分组 → 生成频率表。如果部分Span缺失
error_type
标签,补充使用
get_llmobs_span_details
中的
error.type
字段(在Step 1d中获取)。
Step 1c: Fetch stack traces (parallel)
Step 1c:获取堆栈跟踪(并行)
For the top 3–4 error types by count, pick 2–3 representative trace IDs each. Call
find_llmobs_error_spans(trace_id)
in parallel across all selected traces. Extract:
  • Error message and stack trace
  • Origin span kind and name
  • Whether errors propagate from children to parents (cascade) or are isolated
对数量最多的3–4种错误类型,每种选择2–3个代表性trace ID。对所有选中的trace并行调用
find_llmobs_error_spans(trace_id)
。提取:
  • 错误消息和堆栈跟踪
  • 来源Span类型和名称
  • 错误是否从子Span传播到父Span(级联)或仅存在于单个Span
Step 1d: Fetch span details
Step 1d:获取Span详情
get_llmobs_span_details
on representative spans for each error type (follow Parallelization Rules). Extract
content_info
,
span_kind
, duration.
对每种错误类型的代表性Span调用
get_llmobs_span_details
(遵循并行化规则)。提取
content_info
span_kind
、时长。
Step 1e: Signal Summary (Error Signal)
Step 1e:信号摘要(错误信号模式)
undefined
undefined

Signal Summary:
{ml_app}
· Error Signal

信号摘要:
{ml_app}
· 错误信号模式

(No evals configured — analyzing runtime errors. Say
switch to eval mode
or
switch to generic mode
to change.)
Timeframe: {from} → {to} | Total error spans sampled: {N}
Error TypeSpansCascade?Origin Span Kind
TimeoutError42Yestool
APIError 42918Notool
ValueError7Nollm
MCP soft errors (isError:true)23Notool

---
(未配置评估器 — 分析运行时错误。输入
switch to eval mode
switch to generic mode
可切换模式。)
时间范围: {from} → {to} | 已采样错误Span总数: {N}
错误类型Span数量是否级联来源Span类型
TimeoutError42tool
APIError 42918tool
ValueError7llm
MCP软错误(isError:true)23tool

---

Generic path

通用路径

Step 1a: Eval health check (when evals are configured)
Step 1a:评估器健康检查(当已配置评估器时)
If
list_llmobs_evals
returned evals in Phase 0, call
get_llmobs_eval_aggregate_stats
for each enabled eval in parallel. Flag any enabled eval with
total_count: 0
as Broken Eval Configuration — include in the Signal Summary anomaly table as a High severity row.
如果Phase 0中
list_llmobs_evals
返回了评估器,并行调用每个启用评估器的
get_llmobs_eval_aggregate_stats
。将任何
total_count: 0
的启用评估器标记为评估器配置错误——在信号摘要的异常表中添加高严重度行。
Step 1b: Broad span search
Step 1b:广泛Span搜索
search_llmobs_spans(query="@ml_app:\"<ml_app>\"", root_spans_only=true, from=timeframe, limit=50)
. Apply
failure_filter
narrowing if present (tool/span name →
@name:<x>
query;
"high latency"
→ sort result set by
duration
after Step 1c). Paginate until ≥30 spans.
调用
search_llmobs_spans(query="@ml_app:\"<ml_app>\"", root_spans_only=true, from=timeframe, limit=50)
。如果存在
failure_filter
则应用缩小范围(工具/Span名称 →
@name:<x>
查询;
"high latency"
→ Step 1c后按
duration
排序结果集)。分页直到获取≥30个Span。
Step 1c: Fetch span details
Step 1c:获取Span详情
get_llmobs_span_details
per trace_id batch.
按trace_id批次调用
get_llmobs_span_details
Step 1d: Rank by structural anomalies
Step 1d:按结构异常排序
Partition spans using heuristics:
  • Top decile by
    duration
    (latency outliers)
  • Agent spans with >N tool/LLM iterations (long-running loops)
  • Retrieval spans returning 0 documents (RAG miss)
  • Workflow spans whose child set is missing an expected step (compare against median child layout)
  • Token efficiency: Check if
    non_cached_input_tokens ≈ input_tokens
    across LLM spans. If the app has stable system prompts (>1k tokens) and cache hit rate is 0%, flag as High severity — enabling
    cache_control: ephemeral
    on the system prompt would cut input token costs by 60–90%
使用启发式方法对Span进行分区:
  • 时长处于前十分位的Span(延迟异常值)
  • Agent Span且工具/LLM迭代次数>N(长时间运行循环)
  • 返回0个文档的检索Span(RAG缺失)
  • 子Span集缺少预期步骤的工作流Span(与中位数子Span布局对比)
  • Token效率:检查LLM Span中
    non_cached_input_tokens ≈ input_tokens
    是否成立。如果应用有稳定的系统提示(>1k tokens)且缓存命中率为0%,标记为高严重度——在系统提示中启用
    cache_control: ephemeral
    可将输入Token成本降低60–90%
Step 1e: Signal Summary (Generic)
Step 1e:信号摘要(通用模式)
undefined
undefined

Signal Summary:
{ml_app}
· Generic

信号摘要:
{ml_app}
· 通用模式

(Analyzing structural anomalies. Say
switch to eval mode
or
switch to error mode
to change.)
Timeframe: {from} → {to} | Sampled: {N} root spans
Anomaly TypeCount
Latency outliers (>p90)12
Long agent loops (>8 iterations)5
RAG retrieval misses3
Zero prompt cache utilizationAll LLM spans
Broken eval configurations2

---
(分析结构异常。输入
switch to eval mode
switch to error mode
可切换模式。)
时间范围: {from} → {to} | 已采样: {N}个根Span
异常类型数量
延迟异常值(>p90)12
长Agent循环(>8次迭代)5
RAG检索缺失3
提示缓存利用率为0所有LLM Span
评估器配置错误2

---

Phase 1.5: Determine App Profile & Where the Root Cause Lives

Phase 1.5:确定应用配置文件与根因位置

Inspect
content_info
and
span_kind
across collected spans. Drives Phase 4 strategy.
App profile (from content_info):
SignalApp profilePhase 4 strategy
content_info
has
messages
LLM/chat appExtract system prompt via
messages[0]
, check conversation flow
content_info
has
documents
RAG appCheck retrieval quality alongside LLM output
Trace contains
agent
span kind
Agent appTry
get_llmobs_agent_loop
first; if it returns empty use child-span reconstruction (see Phase 4b)
messages.count > 10
Long conversationCheck for context overflow
content_info
has
metadata
Has custom metadataCheck for clustering by metadata values (prompt version, feature flags)
LLM Experiments traces: If root spans have
span_kind: experiment
and carry
input
,
output
, and
expected_output
structured fields, you are looking at a Datadog LLM Experiments trace. Each span represents one dataset record run. Read quality signal from the root span's
input
/
output
/
expected_output
fields via
get_llmobs_span_content
— not from LLM sub-span messages, which may contain stub or placeholder content. Evaluations attached to experiment spans are computed by the Experiments framework at run time and may not be registered as online Datadog evaluators (
get_llmobs_evaluator
will return 404 for them).
Where the root cause likely lives — by symptom span kind:
Symptom span kindSymptom looks likeBut root cause is often in...
llm
Bad LLM response (eval flagged, wrong output)Parent agent (bad instructions), sibling retrieval (bad context), sibling tool (bad data)
agent
Bad orchestrationChild spans (wrong tool calls, bad routing), full agent loop
tool
Bad tool resultParent LLM (passed wrong parameters), tool implementation
workflow
Bad overall outputChild sub-spans (which step first deviated?)
retrieval
Bad retrievalQuery construction (parent), index/embedding config (outside trace)
Key insight: The signal — eval verdict, error message, latency outlier — flags one span in isolation. It's a symptom report, not a diagnosis. The root cause often lives in a different span: a parent that gave bad instructions, a sibling that provided bad context, or a child that made a wrong decision. Phase 4 navigates the tree to find it.

检查已收集Span的
content_info
span_kind
。指导Phase 4的策略。
应用配置文件(来自content_info):
信号应用配置文件Phase 4策略
content_info
包含
messages
LLM/聊天应用通过
messages[0]
提取系统提示,检查对话流程
content_info
包含
documents
RAG应用同时检查检索质量和LLM输出
跟踪包含
agent
类型Span
Agent应用优先尝试
get_llmobs_agent_loop
;如果返回空则使用子Span重建(参考Phase 4b)
messages.count > 10
长对话应用检查上下文溢出情况
content_info
包含
metadata
有自定义元数据检查是否按元数据值聚类(提示版本、功能开关)
LLM Experiments跟踪:如果根Span的
span_kind: experiment
,且包含
input
output
expected_output
结构化字段,说明这是Datadog LLM Experiments跟踪。每个Span代表一个数据集记录的运行。通过
get_llmobs_span_content
从根Span的
input
/
output
/
expected_output
字段读取质量信号——不要从LLM子Span的消息中读取,这些消息可能包含存根或占位符内容。附加到实验Span的评估结果由Experiments框架在运行时计算,可能未注册为在线Datadog评估器(
get_llmobs_evaluator
会返回404)。
根因可能位置 — 按症状Span类型:
症状Span类型症状表现但根因通常位于...
llm
LLM响应异常(评估标记、输出错误)父级Agent(指令错误)、同级检索(上下文错误)、同级工具(数据错误)
agent
编排错误子级Span(工具调用错误、路由错误)、完整Agent循环
tool
工具结果错误父级LLM(参数传递错误)、工具实现
workflow
整体输出错误子级子Span(哪一步首先偏离预期?)
retrieval
检索结果错误查询构建(父级)、索引/嵌入配置(跟踪外)
关键见解:信号——评估判定、错误消息、延迟异常值——孤立地标记了一个Span。这只是症状报告,而非诊断结果。根因通常位于另一个Span:给出错误指令的父Span、提供错误上下文的同级Span,或做出错误决策的子Span。Phase 4通过遍历Span树定位根因。

Phase 2: Open Coding — Initial Failure Categorization

Phase 2:开放式编码 — 初始故障分类

Goal: Read per-row evidence and propose initial, concrete failure categories. Pool all problematic rows together — categories should describe app behaviors, not which signal flagged them.
Per-row "reasoning input" by mode:
  • Eval Signal: judge assessment + reasoning from
    get_llmobs_span_details
  • Error Signal: error message + stack trace excerpt from
    find_llmobs_error_spans
  • Generic: one-line description of the structural anomaly that flagged the row
Shortcuts:
  • < 15 problematic rows: Combine Phases 2 and 3 into one pass. Still produce the checkpoint.
  • > 80% share the same reasoning/error/symptom: Skip to Phase 4 with the dominant pattern. Still output checkpoint.
  • > 50 problematic rows: Sample ~50, build taxonomy, then spot-check 10–15 more.
  1. Use per-row signal from Phase 1 — do NOT re-fetch. Only call
    get_llmobs_span_content(field="input"/"output")
    for spans where the reasoning is insufficient (generic, empty, or just a stack trace with no app context).
  2. If eval config is loaded (Eval Signal), distinguish early:
    • App failures: Output genuinely violates the eval's criteria
    • Eval failures: Output seems reasonable but eval criteria are too strict/ambiguous
  3. Each pattern must be specific: "Agent called search instead of calculator for price computation" — NOT "tool issue."
目标:读取每行证据并提出初始、具体的故障类别。将所有问题行合并——类别应描述应用行为,而非标记它们的信号类型。
每行“推理输入” 按模式分类:
  • 评估信号模式
    get_llmobs_span_details
    中的评估判定+推理
  • 错误信号模式
    find_llmobs_error_spans
    中的错误消息+堆栈跟踪片段
  • 通用模式:标记该行的结构异常的单行描述
快捷方式
  • 问题行 <15个:将Phase 2和Phase 3合并为一次处理。仍需生成检查点。
  • >80%的行具有相同推理/错误/症状:跳过直接进入Phase 4,聚焦主导模式。仍需输出检查点。
  • 问题行 >50个:采样约50个,构建分类法,然后抽查10–15个更多样本。
  1. 使用Phase 1的每行信号 — 不要重新获取。仅当推理信息不足(通用、空值或仅包含无应用上下文的堆栈跟踪)时,调用
    get_llmobs_span_content(field="input"/"output")
    获取Span内容。
  2. 如果已加载评估器配置(评估信号模式),提前区分:
    • 应用故障:输出确实违反了评估器的标准
    • 评估器故障:输出看似合理但评估器标准过于严格/模糊
  3. 每个模式必须具体:例如“Agent在计算价格时调用了搜索工具而非计算器工具”——而非“工具问题”。

MANDATORY CHECKPOINT

强制检查点

**Open coding**: {N} problematic rows → {K} initial categories: {Category1} ({count}), {Category2} ({count}), ...

**开放式编码**: {N}个问题行 → {K}个初始类别: {Category1} ({count}), {Category2} ({count}), ...

Phase 3: Axial Coding — Refine Failure Taxonomy

Phase 3:主轴编码 — 优化故障分类法

Goal: 3–8 final categories, ranked by impact.
  1. Merge: Categories with < 3 occurrences → parent category or drop as noise.
  2. Split: Categories with > 30% of failures → more specific sub-categories. Pull additional span content if needed.
  3. Validate: 2–3 representative examples per category confirm the label fits.
  4. Rank:
    priority = count × severity
    (severity: high / medium / low).
目标:最终得到3–8个类别,按影响排序。
  1. 合并:出现次数<3的类别 → 合并到父类别或作为噪声丢弃。
  2. 拆分:占故障数>30%的类别 → 拆分为更具体的子类别。必要时获取更多Span内容。
  3. 验证:每个类别选取2–3个代表性示例确认标签合适。
  4. 排序
    优先级 = 数量 × 严重度
    (严重度:高/中/低)。

MANDATORY CHECKPOINT

强制检查点

**Axial coding**: {merges/splits/drops}. Final categories:
1. {Category} ({count}, {pct}%) — {severity}
2. ...

**主轴编码**: {合并/拆分/丢弃操作说明}。最终类别:
1. {Category} ({count}, {pct}%) — {severity}
2. ...

Phase 4: Root Cause Analysis — Navigate from Symptom to Root Cause

Phase 4:根因分析 — 从症状到根因的定位

Goal: The signal flagged a span. That's the symptom. Navigate the trace tree to find the actual root cause — it's often in a different span.
For each of the top 3 categories, pick 2–3 representative traces:
目标:信号标记了一个Span,这是症状。遍历跟踪树找到实际根因——通常位于另一个Span。
对前3个类别,每个选取2–3个代表性跟踪:

Step 4a: Trace structure + errors (parallel)

Step 4a:跟踪结构 + 错误(并行)

For each representative trace, call in a single message:
  • get_llmobs_trace(trace_id)
    — span hierarchy; locate the symptom span and its parent/siblings/children
  • find_llmobs_error_spans(trace_id)
    — check for runtime errors anywhere in the trace
Runtime vs behavioral: If errors exist on or near the symptom span, the root cause may be a runtime failure rather than a behavioral one. Check this first.
Distributed trace fallback: If
get_llmobs_trace
returns "cannot find parent" or an empty span list (common in Ray-based or multi-process execution), reconstruct the trace manually using
get_llmobs_span_details
on the span_ids collected in Phase 1, sorted by
start_ms
.
对每个代表性跟踪,在单条消息中调用:
  • get_llmobs_trace(trace_id)
    — Span层级结构;定位症状Span及其父/同级/子Span
  • find_llmobs_error_spans(trace_id)
    — 检查跟踪中是否存在运行时错误
运行时vs行为:如果症状Span上或附近存在错误,根因可能是运行时故障而非行为故障。优先检查这一点。
分布式跟踪 fallback:如果
get_llmobs_trace
返回“无法找到父Span”或空Span列表(在基于Ray或多进程执行的应用中常见),使用Phase 1中收集的span_id调用
get_llmobs_span_details
,按
start_ms
排序手动重建跟踪。

Step 4b: Navigate to the root cause (parallel)

Step 4b:定位根因(并行)

Use the symptom span kind (from Phase 1.5). Issue ALL calls in a single message.
If symptom is on an
llm
span
(most common):
  • get_llmobs_span_content(field="messages", path="$.messages[0]")
    on symptom span — system prompt
  • get_llmobs_span_content(field="messages")
    on symptom span — full context received
  • get_llmobs_span_content(field="documents")
    on sibling retrieval spans (if any)
  • get_llmobs_span_content(field="input")
    on sibling tool spans (if any)
  • get_llmobs_span_content(field="messages", path="$.messages[0]")
    on parent agent/workflow span
If symptom is on an
agent
span
:
  • get_llmobs_agent_loop(trace_id, span_id)
    — full decision timeline (try first; if it returns 0 iterations, use the fallback below)
  • get_llmobs_span_details
    on child spans — sort by
    start_ms
    to reconstruct the execution timeline
  • get_llmobs_span_content(field="input"/"output")
    on child spans that look wrong
Agent loop fallback (when
get_llmobs_agent_loop
returns 0 iterations): Reconstruct the timeline from
get_llmobs_span_details
results sorted by
start_ms
. Group by
span_kind
to identify LLM → tool → LLM sequences. This fallback is frequently needed —
get_llmobs_agent_loop
returns empty for many apps.
If symptom is on a
tool
span
:
  • get_llmobs_span_content(field="input")
    on symptom span — what parameters was it called with?
  • get_llmobs_span_content(field="messages")
    on parent LLM span — did the LLM construct the call correctly?
If symptom is on a
workflow
span
:
  • get_llmobs_span_details
    on all child spans — find which step first deviated
  • get_llmobs_span_content(field="input"/"output")
    on the deviating child
Always also fetch:
  • get_llmobs_span_content(field="metadata")
    on the symptom span — clustering signals (prompt version, feature flags)
根据Phase 1.5中的症状Span类型操作。将所有调用放入单条消息中。
如果症状位于
llm
Span上
(最常见):
  • 症状Span上调用
    get_llmobs_span_content(field="messages", path="$.messages[0]")
    — 获取系统提示
  • 症状Span上调用
    get_llmobs_span_content(field="messages")
    — 获取完整上下文
  • 同级检索Span(如果存在)上调用
    get_llmobs_span_content(field="documents")
  • 同级工具Span(如果存在)上调用
    get_llmobs_span_content(field="input")
  • 父级Agent/工作流Span上调用
    get_llmobs_span_content(field="messages", path="$.messages[0]")
如果症状位于
agent
Span上
  • 调用
    get_llmobs_agent_loop(trace_id, span_id)
    — 获取完整决策时间线 (优先尝试;如果返回0次迭代,使用下方的fallback方案)
  • 子Span上调用
    get_llmobs_span_details
    — 按
    start_ms
    排序重建执行时间线
  • 看似异常的子Span上调用
    get_llmobs_span_content(field="input"/"output")
Agent循环 fallback(当
get_llmobs_agent_loop
返回0次迭代时):从按
start_ms
排序的
get_llmobs_span_details
结果中重建时间线。按
span_kind
分组识别LLM → 工具 → LLM序列。该fallback方案经常需要使用——
get_llmobs_agent_loop
对许多应用会返回空结果。
如果症状位于
tool
Span上
  • 症状Span上调用
    get_llmobs_span_content(field="input")
    — 获取调用参数
  • 父级LLM Span上调用
    get_llmobs_span_content(field="messages")
    — 检查LLM是否正确构建调用
如果症状位于
workflow
Span上
  • 所有子Span上调用
    get_llmobs_span_details
    — 找到首先偏离预期的步骤
  • 偏离步骤的子Span上调用
    get_llmobs_span_content(field="input"/"output")
始终还要获取
  • 在症状Span上调用
    get_llmobs_span_content(field="metadata")
    — 获取聚类信号(提示版本、功能开关)

Step 4c: Diagnose — from symptom to root cause

Step 4c:诊断 — 从症状到根因

For each category, trace the causal chain:
  1. Symptom — what the signal flagged (eval reasoning, error message, anomaly note). The signal only saw one span in isolation — its reasoning may be shallow.
  2. Trace context — what surrounding spans reveal (parent instructions, sibling data, child decisions).
  3. Root cause — the specific span and decision point where the failure originated. Often NOT the symptom span itself.
For suspected eval issues (Eval Signal, if config loaded): Compare eval criteria against evidence. Is the prompt ambiguous? Criteria too strict?
Root cause categories:
CategoryDescription
System Prompt DeficiencyInstructions unclear, missing, or contradictory — in symptom span OR its parent
Tool GapNeeded tool doesn't exist or parameters too coarse
Tool MisuseWrong tool called or wrong parameters — often visible in agent loop or parent LLM
Routing/Handoff ErrorWrong sub-agent selected (multi-agent systems)
Retrieval FailureRAG returned irrelevant or missing context — check sibling retrieval spans
Context OverflowCritical info lost due to context length
Upstream Data IssueA sibling or parent span provided bad data that cascaded to the symptom span
Runtime ErrorTool/API failure, timeout, exception — from
find_llmobs_error_spans
Evaluator MiscalibrationEval criteria produce false positives/negatives (Eval Signal mode only)

对每个类别,跟踪因果链:
  1. 症状 — 信号标记的内容(评估推理、错误消息、异常说明)。信号仅孤立地看到一个Span——其推理可能较为表面。
  2. 跟踪上下文 — 周边Span揭示的信息(父级指令、同级数据、子级决策)。
  3. 根因 — 故障起源的具体Span和决策点。通常不是症状Span本身。
对于疑似评估器问题(评估信号模式,已加载配置):将评估器标准与证据对比。提示是否模糊?标准是否过于严格?
根因类别:
类别描述
系统提示缺陷指令不清晰、缺失或矛盾 — 位于症状Span或其父Span
工具缺失需要的工具不存在或参数过于粗糙
工具误用调用了错误的工具或传递了错误的参数 — 通常在Agent循环或父级LLM中可见
路由/交接错误选择了错误的子Agent(多Agent系统)
检索失败RAG返回了无关或缺失的上下文 — 检查同级检索Span
上下文溢出因上下文长度限制导致关键信息丢失
上游数据问题同级或父级Span提供了错误数据并传导至症状Span
运行时错误工具/API故障、超时、异常 — 来自
find_llmobs_error_spans
评估器校准错误评估器标准产生了误报/漏报(仅评估信号模式)

Phase 5: Generate Recommendations

Phase 5:生成建议

Goal: Concrete, actionable recommendations grounded in trace evidence. Actual text/code changes with before/after quotes from the trace — not generic advice.
Recommendation types: System Prompt Edit (quote actual prompt, provide before/after), Tool Gap/Misuse (reference agent loop steps), Routing/Handoff Fix, Retrieval Fix (show retrieved vs needed), Evaluator Prompt Edit (flag that eval changes need re-validation; Eval Signal only), Other.
When run in Claude Code with codebase access: Search the codebase for system prompt, tool definitions, or routing logic. Propose specific diffs. Always ask before modifying files.

目标:基于跟踪证据的具体、可执行建议。包含来自跟踪的实际文本/代码变更的前后对比——而非通用建议。
建议类型:系统提示修改(引用实际提示,提供前后对比)、工具缺失/误用(参考Agent循环步骤)、路由/交接修复检索修复(展示检索结果与所需内容的对比)、评估器提示修改(标记评估器变更需重新验证;仅评估信号模式)、其他
当在Claude Code中运行且可访问代码库时:搜索代码库中的系统提示、工具定义或路由逻辑。提出具体的diff。修改文件前始终询问用户。

Phase 6: Compile RCA Report

Phase 6:编译RCA报告

Write the full report following the Output Format below. This is the primary deliverable — output it directly in the chat.

按照下方的输出格式编写完整报告。这是主要交付成果——直接在聊天中输出。

Phase 7: Post-Analysis Actions

Phase 7:分析后操作

Do NOT take any action automatically. After presenting the report, ask the user what they'd like to do next:
  1. Save the report to
    llm-obs-rca-{ml_app}-{date}.md
  2. Apply fixes (if codebase is available)
  3. Deeper investigation of remaining categories
  4. Export to a Datadog notebook — in pup mode, use
    pup notebooks create
    to create the notebook and
    pup notebooks edit NOTEBOOK_ID --file /tmp/nb_cells.json
    to append sections (see Tool Reference)
  5. Re-run on an expanded time range (e.g.
    now-7d
    if current window was
    now-24h
    )
If the user chooses option 4, follow the notebook creation fallback pattern:
  1. Call
    mcp__datadog-mcp-core__create_datadog_notebook
    with:
    • name
      :
      LLM Obs RCA: {ml_app} ({mode}) — YYYY-MM-DD
    • type
      :
      report
    • time_span
      :
      1w
    • cells
      : one cell per section (see Notebook Cell Structure below)
  2. If the MCP call fails, inspect the error before giving up:
    • Auth / permission error (401, 403) → stop and tell the user.
    • Field validation error (error message names a specific field) → fix that field and retry the MCP call once.
    • Any other error (binding, serialization, unexpected response) → fall back to pup:
      • Write the notebook payload to
        /tmp/nb_rca_{ml_app}.json
        as a full API envelope:
        {"data": {"attributes": {"name": "...", "time": {...}, "cells": [...]}, "type": "notebooks"}}
      • Run
        pup notebooks create --file /tmp/nb_rca_{ml_app}.json
      • If pup is not available either, render the full notebook content as markdown in chat so the user has it.
  3. After successful creation by either method, output the URL on its own line:
    RCA report exported to notebook: <url>
Print the URL prominently — if
/eval-bootstrap
runs next in the same session, it will detect this URL and offer to append the evaluator suite to the same notebook.
不要自动执行任何操作。 展示报告后,询问用户下一步想要执行的操作:
  1. 将报告保存为
    llm-obs-rca-{ml_app}-{date}.md
  2. 应用修复(如果可访问代码库)
  3. 对剩余类别进行更深入的调查
  4. 导出到Datadog笔记本 — 在pup模式下,使用
    pup notebooks create
    创建笔记本,使用
    pup notebooks edit NOTEBOOK_ID --file /tmp/nb_cells.json
    添加章节(参考工具参考)
  5. 在扩大的时间范围内重新运行分析(例如当前窗口为
    now-24h
    ,则使用
    now-7d
如果用户选择选项4,遵循笔记本创建fallback模式:
  1. 调用
    mcp__datadog-mcp-core__create_datadog_notebook
    ,参数如下:
    • name
      :
      LLM Obs RCA: {ml_app} ({mode}) — YYYY-MM-DD
    • type
      :
      report
    • time_span
      :
      1w
    • cells
      : 每个章节对应一个单元格(参考下方的笔记本单元格结构)
  2. 如果MCP调用失败,检查错误后再放弃:
    • 认证/权限错误(401、403) → 停止操作并告知用户。
    • 字段验证错误(错误消息提及特定字段) → 修复该字段并重试MCP调用一次。
    • 其他错误(绑定、序列化、意外响应) → fallback到pup:
      • 将笔记本负载写入
        /tmp/nb_rca_{ml_app}.json
        ,作为完整API包:
        {"data": {"attributes": {"name": "...", "time": {...}, "cells": [...]}, "type": "notebooks"}}
      • 运行
        pup notebooks create --file /tmp/nb_rca_{ml_app}.json
      • 如果pup也不可用,将完整笔记本内容以markdown格式在聊天中展示,确保用户获取到内容。
  3. 无论通过哪种方式成功创建后,单独一行输出URL:
    RCA报告已导出到笔记本: <url>
确保URL醒目显示——如果
/eval-bootstrap
在同一会话中运行,它会检测到该URL并提议将评估器套件附加到同一笔记本。

Notebook Cell Structure

笔记本单元格结构

CellContent
1 — OverviewStructured header (see Overview cell format below — follow it exactly)
2 — Signal SummaryMode-specific health table
3 — Failure TaxonomyTaxonomy table
4…N — Failure ModesOne cell per failure mode
N+1 — Action Plan + LimitationsAction plan table + bullet list
Notebook formatting rules (apply to every cell):
  • No triple-backtick code blocks — use blockquotes (
    >
    ) for prompts/rubrics, inline code (
    `
    ) for short values
  • Evidence as tables — not bullet lists
  • Tool inputs as tables — Argument | Wrong value passed | Correct approach
  • Action plan as a table — Priority | Action | Confidence | Impact

单元格内容
1 — 概览结构化标题(参考下方的概览单元格格式——严格遵循)
2 — 信号摘要模式专属健康表格
3 — 故障分类法分类法表格
4…N — 故障模式每个故障模式对应一个单元格
N+1 — 行动计划 + 局限性行动计划表格 + 项目符号列表
笔记本格式规则(适用于所有单元格):
  • 不要使用三重反引号代码块 — 使用块引用(
    >
    )展示提示/评估标准,使用行内代码(
    `
    )展示短值
  • 证据以表格形式展示 — 不要使用项目符号列表
  • 工具输入以表格形式展示 — 参数 | 传递的错误值 | 正确方法
  • 行动计划以表格形式展示 — 优先级 | 操作 | 置信度 | 影响

Output Format

输出格式



Overview cell (notebook Cell 1 / report header)

概览单元格(笔记本单元格1 / 报告标题)

The Overview cell must follow this exact structure. No prose paragraphs. No inline-numbered findings. App description is one sentence maximum.
undefined
概览单元格必须严格遵循以下结构。不要使用散文段落。不要使用行内编号的发现。应用描述最多一句话。
undefined

{ml_app}
· {Eval Signal | Error Signal | Generic} · {timeframe}

{ml_app}
· {评估信号模式 | 错误信号模式 | 通用模式} · {timeframe}

Date: {YYYY-MM-DD} | Profile: {short app profile} | Model:
{model(s)}
{One sentence: what does this app do?}
MetricValue
{mode-appropriate rows — see below}
日期: {YYYY-MM-DD} | 配置文件: {简短应用配置文件} | 模型:
{model(s)}
{一句话:该应用的功能是什么?}
指标
{模式专属行 — 参考下方}

Findings

发现

  • {Finding 1} (~{pct}%): one-line root cause description
  • {Finding 2} (~{pct}%): one-line root cause description
  • {Finding 3} (if present): one-line root cause description
  • {发现1} (~{pct}%): 一行根因描述
  • {发现2} (~{pct}%): 一行根因描述
  • {发现3}(如果存在): 一行根因描述

Recommendations

建议

  • {Recommendation 1}: specific next step tied to Finding 1
  • {Recommendation 2}: specific next step tied to Finding 2
Sample: {N} spans analyzed. Confidence: High | Medium | Low — {one-line reason if Medium or Low}.

**Mode-appropriate metric rows:**

Eval Signal:
| Eval |
{eval_name}
({type}) | | Spans evaluated | {total_count} | | Pass rate | {pass_rate}% ({pass_count} pass / {fail_count} fail) | | Top failure mode | {name} (~{pct}%) | | Evals configured | {N} |

Error Signal:
| Error spans | {N} confirmed | | Top error type |
{type}
({pct}%) | | Affected operation |
{span_name}
| | Cascade pattern | Isolated / Cascading | | Evals configured | {N} (none = no quality signal) |

Generic:
| Spans sampled | {N} root spans | | Top anomaly | {type}: {count} spans | | Error spans | {N} (0 = structurally healthy) | | Evals configured | {N} (none = no quality signal) |

---
  • {建议1}: 与发现1关联的具体下一步操作
  • {建议2}: 与发现2关联的具体下一步操作
样本: {N}个Span已分析。置信度: 高 | 中 | 低 — {如果是中或低,给出一行原因}.

**模式专属指标行:**

评估信号模式:
| 评估器 |
{eval_name}
({type}) | | 已评估Span数 | {total_count} | | 通过率 | {pass_rate}% ({pass_count}个通过 / {fail_count}个故障) | | 主要故障模式 | {name} (~{pct}%) | | 已配置评估器数 | {N} |

错误信号模式:
| 错误Span数 | {N}个已确认 | | 主要错误类型 |
{type}
({pct}%) | | 受影响操作 |
{span_name}
| | 级联模式 | 孤立 / 级联 | | 已配置评估器数 | {N}(无 = 无质量信号) |

通用模式:
| 已采样Span数 | {N}个根Span | | 主要异常 | {type}: {count}个Span | | 错误Span数 | {N}(0 = 结构健康) | | 已配置评估器数 | {N}(无 = 无质量信号) |

---

Signal Summary Table

信号摘要表格

When entering from Step 0S (classification context), replace the Signal Summary table with:
undefined
当从Step 0S(分类上下文)进入时,将信号摘要表格替换为:
undefined

Classification Signal Summary

分类信号摘要

Source: llm-obs-session-classify | ml_app: {app} | Signal: content-only | content+evals
MetricValue
Traces classifiedN
Failures in corpus (no+partial)F
Pass rateX%
Failure modeslist
Root cause analysis is based on per-trace classification verdicts, not automated eval judge reasoning.

**Otherwise**, mode-specific — pick the appropriate variant:

**Eval Signal** — one row per eval:

| Eval | Type | Total | Pass Rate | Status |
|------|------|------:|:---------:|--------|
| eval_1 | boolean | 4,891 | 37.3% | ⚠ Investigating |

**Error Signal** — one row per error type:

| Error Type | Spans | Cascade? | Origin Span Kind |
|------------|------:|:--------:|-----------------|
| TimeoutError | 42 | Yes | tool |

**Generic** — one row per anomaly type:

| Anomaly Type | Count |
|---|:---:|
| Latency outliers (>p90) | 12 |

---
来源: llm-obs-session-classify | ml_app: {app} | 信号: content-only | content+evals
指标
已分类跟踪数N
语料库中的故障数(no+partial)F
通过率X%
故障模式列表
根因分析基于每个跟踪的分类判定,而非自动化评估器推理。

**其他情况**,使用模式专属表格——选择合适的变体:

**评估信号模式** — 每个评估器一行:

| 评估器 | 类型 | 总数 | 通过率 | 状态 |
|------|------|------:|:---------:|--------|
| eval_1 | boolean | 4,891 | 37.3% | ⚠ 正在分析 |

**错误信号模式** — 每个错误类型一行:

| 错误类型 | Span数量 | 是否级联 | 来源Span类型 |
|------------|------:|:--------:|-----------------|
| TimeoutError | 42 | 是 | tool |

**通用模式** — 每个异常类型一行:

| 异常类型 | 数量 |
|---|:---:|
| 延迟异常值(>p90) | 12 |

---

Failure Taxonomy

故障分类法

#Failure ModeTraces%SeverityRoot Cause
1.........%HighTool Misuse

#故障模式跟踪数%严重度根因
1.........%工具误用

Failure Mode Sections (one per top 3–5 modes)

故障模式章节(前3–5个模式各一个)

undefined
undefined

Failure Mode N: [Name]

故障模式N: [名称]

Count: {n} spans, {t} traces | Severity: High/Medium/Low | Root Cause: [Category]
[3–5 sentences: what goes wrong, when, what triggers it, causal chain.]
Evidence
{Use the mode-appropriate column set:}
Eval Signal — Trace | Judge verdict | What the trace revealed:
TraceJudge verdictWhat the trace revealed
69de86a7...failParent agent has no date format instruction
Error Signal — Trace | Behavior | Version:
TraceBehaviorVersion
69de86a7...7 parallel calls, all 400v107624932
Generic — Trace | Anomaly | Signal:
TraceAnomalySignal
69de86a7...94s, 12 tool callsLatency outlier
{For tool misuse — add a tool inputs table:} Tool inputs (100% of sampled calls)
ArgumentValue passed (wrong)Correct approach
query
"monitor_id:123 group_status:alert"
"monitor_id:123"
(name/tag only)
{For Eval Signal — add judge reasoning as a blockquote:}
"{quoted judge reasoning}"
Root cause: [WHY this happens — specific span, parameter, or prompt.]
Fix: BEFORE: [actual text from trace] AFTER: [proposed replacement]
Impact: Eliminates ~{n} spans / {timeframe}.

---
数量: {n}个Span, {t}个跟踪 | 严重度: 高/中/低 | 根因: [类别]
[3–5句话:问题是什么,何时发生,触发条件是什么,因果链是什么。]
证据
{使用模式专属列集:}
评估信号模式 — 跟踪 | 评估判定 | 跟踪揭示的信息:
跟踪评估判定跟踪揭示的信息
69de86a7...fail父级Agent没有日期格式指令
错误信号模式 — 跟踪 | 行为 | 版本:
跟踪行为版本
69de86a7...7次并行调用,全部返回400v107624932
通用模式 — 跟踪 | 异常 | 信号:
跟踪异常信号
69de86a7...94秒,12次工具调用延迟异常值
{对于工具误用 — 添加工具输入表格:} 工具输入(100%已采样调用)
参数传递的错误值正确方法
query
"monitor_id:123 group_status:alert"
"monitor_id:123"
(仅名称/标签)
{对于评估信号模式 — 添加评估器推理作为块引用:}
"{引用的评估器推理}"
根因: [问题发生的原因 — 具体Span、参数或提示。]
修复方案: 修复前: [来自跟踪的实际文本] 修复后: [建议的替换文本]
影响: 消除~{n}个Span / {timeframe}。

---

Prioritized Action Plan

优先级行动计划

PriorityActionConfidenceImpact
1Fix
monitor_groups_search
schema — add
group_states
param
HighEliminates ~21 spans/7d
When mode is Generic and no evals are configured, always append as the final action plan row:
| N | Configure at least one evaluator | High | Enables Eval Signal mode for future RCAs — app currently has no ongoing quality signal |

优先级操作置信度影响
1修复
monitor_groups_search
schema — 添加
group_states
参数
消除~21个Span/7天
当模式为通用模式且未配置评估器时,始终在行动计划最后添加一行:
| N | 配置至少一个评估器 | 高 | 为未来RCA启用评估信号模式 — 当前应用无持续质量信号 |

Limitations & Follow-ups

局限性与后续工作

Bullet list of what needs more data or follow-up action.

项目符号列表,说明需要更多数据或后续操作的内容。

Operating Rules

操作规则

  • Ground in evidence: Every claim references span IDs with clickable trace links:
    [Trace {first_8}...](https://app.datadoghq.com/llm/traces?query=trace_id:{full_32_char_id})
    .
  • Root cause over symptom: "System prompt doesn't specify date format" not "model gave wrong answer."
  • Show your math: "47 failures (34%)" not "many failures."
  • Honest about uncertainty: < 5 examples = tentative. Flag it.
  • Anonymize PII: No emails or names. User/org IDs are fine.
  • MCP result parsing safety: Before writing any script that iterates over MCP tool results, inspect the raw structure first — check top-level keys and whether the payload is nested inside a content block (e.g.
    [{'type': 'text', 'text': '<json>'}]
    ). Extract and
    json.loads()
    the inner payload if needed. Never assume MCP results are bare dicts or lists.

  • 基于证据: 每个声明都引用带可点击跟踪链接的Span ID:
    [Trace {first_8}...](https://app.datadoghq.com/llm/traces?query=trace_id:{full_32_char_id})
  • 聚焦根因而非症状: 例如“系统提示未指定日期格式”而非“模型给出错误答案”。
  • 展示数据: 例如“47个故障(34%)”而非“大量故障”。
  • 坦诚不确定性: 示例数<5 = 暂定结论。标记出来。
  • 匿名化PII: 不要包含邮箱或姓名。用户/组织ID可以保留。
  • MCP结果解析安全: 在编写任何遍历MCP工具结果的脚本前,先检查原始结构——检查顶级键以及负载是否嵌套在内容块中(例如
    [{'type': 'text', 'text': '<json>'}]
    )。如果需要,提取并
    json.loads()
    内部负载。永远不要假设MCP结果是裸字典或列表。

Tool Reference

工具参考

This appendix applies only in pup mode. In MCP mode, use the tool names in the workflow sections directly.
本附录仅适用于 pup模式。在MCP模式下,直接使用工作流章节中的工具名称。

Spans and traces

Spans和跟踪

MCP Toolpup Command
search_llmobs_spans(query, ml_app, from, to, limit, cursor, root_spans_only, span_kind, summary)
pup llm-obs spans search --query "@ml_app:A [other_filters]" [--from F] [--to T] [--limit N] [--cursor C] [--root-spans-only] [--span-kind K] [--summary]
always use
--query "@ml_app:A"
to filter by ml_app
; the
--ml-app A
flag is unreliable and silently returns spans from other apps.
get_llmobs_span_details(trace_id, span_ids, from, to)
pup llm-obs spans get-details --trace-id T --span-ids S1,S2,...
get_llmobs_span_content(trace_id, span_id, field, path)
pup llm-obs spans get-content --trace-id T --span-id S --field F [--path P]
get_llmobs_trace(trace_id, include_tree)
pup llm-obs spans get-trace --trace-id T [--include-tree]
get_llmobs_agent_loop(trace_id, span_id)
pup llm-obs spans get-agent-loop --trace-id T [--span-id S]
find_llmobs_error_spans(trace_id)
pup llm-obs spans find-errors --trace-id T
expand_llmobs_spans(trace_id, span_ids, max_depth, filter_kind)
pup llm-obs spans expand --trace-id T --span-ids S1,S2,... [--max-depth N] [--filter-kind K]
MCP工具pup命令
search_llmobs_spans(query, ml_app, from, to, limit, cursor, root_spans_only, span_kind, summary)
pup llm-obs spans search --query "@ml_app:A [other_filters]" [--from F] [--to T] [--limit N] [--cursor C] [--root-spans-only] [--span-kind K] [--summary]
必须使用
--query "@ml_app:A"
过滤ml_app
--ml-app A
标志不可靠,会静默返回其他应用的Span。
get_llmobs_span_details(trace_id, span_ids, from, to)
pup llm-obs spans get-details --trace-id T --span-ids S1,S2,...
get_llmobs_span_content(trace_id, span_id, field, path)
pup llm-obs spans get-content --trace-id T --span-id S --field F [--path P]
get_llmobs_trace(trace_id, include_tree)
pup llm-obs spans get-trace --trace-id T [--include-tree]
get_llmobs_agent_loop(trace_id, span_id)
pup llm-obs spans get-agent-loop --trace-id T [--span-id S]
find_llmobs_error_spans(trace_id)
pup llm-obs spans find-errors --trace-id T
expand_llmobs_spans(trace_id, span_ids, max_depth, filter_kind)
pup llm-obs spans expand --trace-id T --span-ids S1,S2,... [--max-depth N] [--filter-kind K]

Evaluators

评估器

MCP Toolpup Command
list_llmobs_evals()
pup llm-obs evals list
(filter by
ml_app
client-side)
list_llmobs_evals_by_ml_app(ml_app)
pup llm-obs evals list-by-ml-app --ml-app A
get_llmobs_evaluator(eval_name)
pup llm-obs evals get-evaluator EVAL_NAME
get_llmobs_eval_aggregate_stats(eval_name, ml_app, from, to)
pup llm-obs evals get-aggregate-stats EVAL_NAME [--ml-app A] [--from F] [--to T]
MCP工具pup命令
list_llmobs_evals()
pup llm-obs evals list
(客户端侧按
ml_app
过滤)
list_llmobs_evals_by_ml_app(ml_app)
pup llm-obs evals list-by-ml-app --ml-app A
get_llmobs_evaluator(eval_name)
pup llm-obs evals get-evaluator EVAL_NAME
get_llmobs_eval_aggregate_stats(eval_name, ml_app, from, to)
pup llm-obs evals get-aggregate-stats EVAL_NAME [--ml-app A] [--from F] [--to T]

Notebooks

笔记本

MCP Toolpup Command
create_datadog_notebook(name, cells, ...)
pup notebooks create --title "TITLE" --file /tmp/nb_cells.json
— confirm exact flags with
pup notebooks create --help
edit_datadog_notebook(id, cells, append_only=true)
pup notebooks edit NOTEBOOK_ID --file /tmp/nb_cells.json
(fetches current notebook, appends provided cells, writes back)
The cells file is a JSON array of cell objects:
json
[{"attributes": {"definition": {"type": "markdown", "text": "## Section\n\nContent."}}, "type": "notebook_cells"}]
MCP工具pup命令
create_datadog_notebook(name, cells, ...)
pup notebooks create --title "TITLE" --file /tmp/nb_cells.json
— 使用
pup notebooks create --help
确认准确标志
edit_datadog_notebook(id, cells, append_only=true)
pup notebooks edit NOTEBOOK_ID --file /tmp/nb_cells.json
(获取当前笔记本,附加提供的单元格,写回)
单元格文件是JSON数组,包含单元格对象:
json
[{"attributes": {"definition": {"type": "markdown", "text": "## Section\n\nContent."}}, "type": "notebook_cells"}]