llm-obs-trace-rca
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBackend
后端
Detection — At the start of every invocation, before taking any action, determine which backend to use:
- If the user passed anywhere in their invocation → use pup mode immediately, regardless of whether MCP tools are present. Skip steps 2–4.
--backend pup - Check whether MCP tools are present in your active tool list. The canonical signal is whether appears in your available tools.
mcp__datadog-llmo-mcp__list_llmobs_evals - If MCP tools are present → use MCP mode throughout. Call MCP tools exactly as named in this skill's workflow sections.
- If MCP tools are absent → check whether is executable: run
pupvia Bash. A JSON response containingpup --versionconfirms pup is available."version" - If pup responds → use pup mode throughout. Translate every MCP tool call to its pup equivalent using the Tool Reference appendix at the bottom of this file.
- If neither is available → stop and tell the user:
"Neither the Datadog MCP server nor the pup CLI is available. Connect the MCP server () or install pup."
claude mcp add --scope user --transport http datadog-llmo-mcp 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'
--backend puppup invocation rules:
- Invoke via Bash:
pup llm-obs <subcommand> [flags] - pup always outputs JSON. Parse directly — no content-block unwrapping (unlike MCP results, which may wrap JSON in ).
[{"type": "text", "text": "<json>"}] - If pup returns an auth error, tell the user to run and stop.
pup auth login - Parallelization: issue multiple Bash tool calls in a single message (one pup command per call).
- Time flags: pup accepts bare duration strings (,
1h,7d) and RFC3339 timestamps. Do not use30m-prefixed strings — strip the prefix when converting from a skillnow-argument:--timeframe→now-7d,7d→now-24h,24h→now-30d.30d - on
--summarystrips payload fields to essential metadata only. Use it in bulk/search phases where content is not needed.pup llm-obs spans search
Invocation ID: At the very start of each invocation, before any MCP tool call, generate an 8-character hex invocation ID (e.g., ). Keep it constant for the entire invocation.
3a9f1c2bIntent tagging: On every MCP tool call, prefix with followed by a description of why the tool is being called. On the first MCP tool call only, use instead (note the suffix). Example first call:
telemetry.intentskill:llm-obs-trace-rca[<inv_id>] — skill:llm-obs-trace-rca:start[<inv_id>] — :startskill:llm-obs-trace-rca:start[3a9f1c2b] — Phase 0: discover configured evals for task-cruncher to infer analysis mode检测 — 在每次调用开始、执行任何操作前,确定要使用的后端:
- 如果用户在调用中任何位置传入了 → 立即使用 pup模式,无论MCP工具是否存在。跳过步骤2–4。
--backend pup - 检查活跃工具列表中是否存在MCP工具。标准判断信号是可用工具中是否包含 。
mcp__datadog-llmo-mcp__list_llmobs_evals - 如果存在MCP工具 → 全程使用 MCP模式。严格按照本技能工作流章节中指定的名称调用MCP工具。
- 如果没有MCP工具 → 检查 是否可执行:通过Bash运行
pup。返回包含pup --version的JSON响应即确认pup可用。"version" - 如果pup响应正常 → 全程使用 pup模式。使用本文档底部的工具参考附录,将每个MCP工具调用转换为对应的pup命令。
- 如果两者都不可用 → 停止操作并告知用户:
"Datadog MCP服务器和pup CLI均不可用。请连接MCP服务器()或安装pup。"
claude mcp add --scope user --transport http datadog-llmo-mcp 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'
--backend puppup调用规则:
- 通过Bash调用:
pup llm-obs <subcommand> [flags] - pup始终输出JSON。直接解析即可——无需解包内容块(与MCP结果不同,MCP结果可能将JSON包裹在 中)。
[{"type": "text", "text": "<json>"}] - 如果pup返回认证错误,告知用户运行 并停止操作。
pup auth login - 并行化:在单条消息中发起多个Bash工具调用(每个pup命令对应一个调用)。
- 时间参数:pup接受纯时长字符串(、
1h、7d)和RFC3339时间戳。请勿使用30m前缀的字符串——将技能的now-参数转换时需移除前缀:--timeframe→now-7d,7d→now-24h,24h→now-30d。30d - 在 中使用
pup llm-obs spans search参数会将负载字段精简为必要元数据。在批量/搜索阶段无需内容时使用该参数。--summary
调用ID: 在每次调用的最开始、发起任何MCP工具调用前,生成一个8字符的十六进制调用ID(例如:)。在整个调用过程中保持该ID不变。
3a9f1c2b意图标记: 在每个MCP工具调用中,将 前缀设置为 ,后跟调用该工具的原因描述。仅在第一次MCP工具调用时,使用 (注意 后缀)。示例首次调用:
telemetry.intentskill:llm-obs-trace-rca[<inv_id>] — skill:llm-obs-trace-rca:start[<inv_id>] — :startskill:llm-obs-trace-rca:start[3a9f1c2b] — Phase 0: discover configured evals for task-cruncher to infer analysis modeLLM Obs Trace RCA — Root Cause Analysis from Production LLM Traces
LLM Obs Trace RCA — 基于生产环境LLM跟踪数据的根因分析
Diagnose why an LLM application is failing by searching production traces and walking the span tree from symptom to root cause. The skill automatically selects the best analysis mode based on available signals:
| Mode | Signal | When auto-selected |
|---|---|---|
| Eval Signal | LLM judge verdicts and reasoning (pass/fail rates, scoring) | Evaluators are configured for the app |
| Error Signal | Runtime errors ( | No evals configured, or user explicitly asks about errors/crashes |
| Generic | Structural anomalies (latency, agent loops, retrieval misses) | Explicit |
The mode is announced (never asked) in the first user-facing output with a one-line override hint.
通过搜索生产环境跟踪数据并从症状出发遍历Span树,诊断LLM应用故障的原因。该技能会根据可用信号自动选择最佳分析模式:
| 模式 | 信号 | 自动选择时机 |
|---|---|---|
| 评估信号模式 | LLM评估判定结果与推理(通过率/失败率、评分) | 应用已配置评估器时 |
| 错误信号模式 | 运行时错误( | 未配置评估器,或用户明确询问错误/崩溃问题时 |
| 通用模式 | 结构异常(延迟、Agent循环、检索缺失) | 显式指定 |
模式会在首次面向用户的输出中主动告知(无需询问),并附带一行模式覆盖提示。
Methodology
方法论
Resolve → Search → Observe → Open Coding → Axial Coding → Root Cause Navigation → Recommendations
解析 → 搜索 → 观测 → 开放式编码 → 主轴编码 → 根因定位 → 建议生成
Usage
使用方式
What's wrong with <ml_app> over the last <timeframe>
Why is <ml_app> failing today
Analyze eval failures for <eval_name> on <ml_app>
Look at the errors on <ml_app> over the last <timeframe>
Root-cause low scores on <eval_name>What's wrong with <ml_app> over the last <timeframe>
Why is <ml_app> failing today
Analyze eval failures for <eval_name> on <ml_app>
Look at the errors on <ml_app> over the last <timeframe>
Root-cause low scores on <eval_name>Inputs
输入参数
| Input | Required | Default | Description |
|---|---|---|---|
| Yes (or | — | The application to analyze. |
| No | — | One or more evaluators to focus on. Implies Eval Signal mode. Pass a list for multi-eval analysis. |
| No | | How far back to look. |
| No | inferred | Explicit mode override: |
| No | — | Narrowing scope: |
If neither nor is provided, ask the user.
ml_appeval_name| 输入 | 是否必填 | 默认值 | 描述 |
|---|---|---|---|
| 是(或提供 | — | 要分析的应用。 |
| 否 | — | 要聚焦的一个或多个评估器。指定后自动进入评估信号模式。传入列表可进行多评估器分析。 |
| 否 | | 回溯时间范围。 |
| 否 | 自动推断 | 显式覆盖模式: |
| 否 | — | 缩小分析范围: |
如果未提供 和 ,请询问用户。
ml_appeval_nameAvailable Tools
可用工具
Eval discovery & overview
评估器发现与概览
| Tool | Purpose |
|---|---|
| Discover all configured evals for an |
| Pass/fail rate or score distribution for an eval over a time window. |
| Full evaluator config: prompt template, assessment criteria, span filter, sampling, provider. Use instead of the deprecated |
| 工具 | 用途 |
|---|---|
| 发现 |
| 评估器在时间窗口内的通过率/失败率或分数分布。 |
| 完整评估器配置:提示模板、评估标准、Span过滤器、采样规则、提供商。替代已弃用的 |
Trace & span exploration
跟踪与Span探索
| Tool | Purpose |
|---|---|
| Find spans by tags, span kind, status, query syntax. Paginate with cursor. Entry point for Phase 1. |
| Metadata, evaluations (scores, labels, reasoning), |
| Actual content for a span field. Supports JSONPath via |
| Full trace hierarchy as span tree with span counts by kind. |
| All error spans in a trace with error type, message, stack, and propagation context. |
| Load children of collapsed trace nodes. |
| Chronological agent execution timeline (LLM calls, tool invocations, decisions). May return empty — see Phase 4b fallback. |
| 工具 | 用途 |
|---|---|
| 通过标签、Span类型、状态、查询语法查找Span。使用游标分页。Phase 1的入口工具。 |
| 元数据、评估结果(分数、标签、推理)、 |
| Span字段的实际内容。支持通过 |
| 完整的跟踪层级结构,包含按类型统计的Span数量。 |
| 跟踪中所有错误Span,包含错误类型、消息、堆栈和传播上下文。 |
| 加载折叠跟踪节点的子Span。 |
| Agent执行时间线(LLM调用、工具调用、决策)的 chronological 记录。可能返回空结果——请参考Phase 4b的 fallback 方案。 |
Key get_llmobs_span_content
patterns
get_llmobs_span_contentget_llmobs_span_content
关键使用模式
get_llmobs_span_content| Field | Path | What you get |
|---|---|---|
| | System prompt (first message, usually |
| | Last assistant response |
| (no path) | Full conversation including tool calls |
| — | Span I/O |
| — | Retrieved documents (RAG apps) |
| — | Custom metadata (prompt versions, feature flags, user segments) |
| 字段 | 路径 | 返回内容 |
|---|---|---|
| | 系统提示(第一条消息,通常为 |
| | 最后一条助手响应 |
| (无路径) | 包含工具调用的完整对话 |
| — | Span输入/输出 |
| — | 检索到的文档(RAG应用) |
| — | 自定义元数据(提示版本、功能开关、用户分段) |
How to use search_llmobs_spans
search_llmobs_spanssearch_llmobs_spans
使用方法
search_llmobs_spansAlways include in the string — the structured parameter is unreliable and can return spans from other apps. Do not rely on the structured parameter alone.
@ml_app:"<ml_app>"queryml_appUseful query fragments — combine with space (AND):
| Goal | Query |
|---|---|
| Errors only | |
| Eval is present on the span | |
| A specific tool by name | |
Dedicated params (, , ) work alongside , but takes precedence over .
span_kindroot_spans_onlyml_appqueryquerytags必须在 字符串中包含 ——结构化的 参数不可靠,可能返回其他应用的Span。不要仅依赖结构化参数。
query@ml_app:"<ml_app>"ml_app实用查询片段——使用空格组合(AND逻辑):
| 目标 | 查询语句 |
|---|---|
| 仅错误Span | |
| Span上存在评估结果 | |
| 指定工具名称 | |
专用参数(、、)可与 配合使用,但 优先级高于 。
span_kindroot_spans_onlyml_appqueryquerytagsParallelization rules
并行化规则
- : Group span_ids by trace_id, chunk each trace's span_ids into batches of at most 20. Issue ALL chunks for a page in a single message.
get_llmobs_span_details - : Each call is independent — always issue ALL in a single message.
get_llmobs_span_content - /
get_llmobs_trace/find_llmobs_error_spans: Parallelize across different traces in a single message.get_llmobs_agent_loop - Pipeline parallelism: Start for page 1 results immediately — don't wait to collect all pages.
get_llmobs_span_details
- :按trace_id对span_id进行分组,将每个trace的span_id分成最多20个一组的批次。在单条消息中发起所有批次的调用。
get_llmobs_span_details - :每个调用相互独立——始终在单条消息中发起所有调用。
get_llmobs_span_content - /
get_llmobs_trace/find_llmobs_error_spans:在单条消息中对不同跟踪进行并行调用。get_llmobs_agent_loop - 流水线并行:立即对第1页结果发起 调用——无需等待收集所有页面。
get_llmobs_span_details
Analysis Workflow
分析工作流
Output discipline: Phases 0–5 are internal analysis. The only user-facing outputs during these phases are the Phase 1 Signal Summary and the mandatory checkpoints at Phases 2 and 3. Do NOT narrate reasoning, summarize intermediate findings, or output Phase 4 deep-dive results as prose. All detailed findings go exclusively into the Phase 6 report.
输出规范:Phase 0–5为内部分析阶段。这些阶段中仅有的面向用户输出是Phase 1的信号摘要,以及Phase 2和Phase 3的强制检查点。不要叙述推理过程、总结中间发现,或输出Phase 4深度分析的文本结果。所有详细发现仅需放入Phase 6的报告中。
Phase 0: Resolve Inputs & Infer Mode
Phase 0:解析输入并推断模式
First: check for classification context. Scan the conversation for a header. If found → enter Step 0S below and skip all remaining Phase 0 steps and Phase 1 entirely.
# Session Classification Summary首先:检查分类上下文。扫描对话中是否存在 标题。如果存在 → 进入下方的 Step 0S,跳过剩余所有Phase 0步骤和Phase 1。
# Session Classification SummaryStep 0S — Extract Failure Bucket from Classification Output
Step 0S — 从分类输出中提取故障分组
The canonical handoff format is the Per-Unit Details table inside the section. Extract one row per unit:
# Session Classification Summary| Field | Source |
|---|---|
| Link URL in the ID column: parse the |
| Verdict column |
| Failure Mode column ( |
| Reason column — use as the Phase 2 reasoning input (same role as eval judge reasoning or error messages) |
| From the |
Failure bucket = all rows where verdict is or .
nopartial- < 5 entries → note low confidence, proceed anyway.
- Empty → report "No failures found in the classification output" and stop.
Present this overview before proceeding:
undefined标准交接格式是 部分内的 Per-Unit Details 表格。提取每一行数据:
# Session Classification Summary| 字段 | 来源 |
|---|---|
| ID列中的链接URL:从链接href中解析 |
| Verdict列 |
| Failure Mode列(通过的行填 |
| Reason列——用作Phase 2的推理输入(与评估器推理或错误消息作用相同) |
| 来自 |
故障分组 = 所有verdict为 或 的行。
nopartial- 条目数 <5 → 标注低置信度,仍继续分析。
- 空分组 → 报告“在分类输出中未发现故障”并停止操作。
在继续分析前展示以下概览:
undefinedClassification Overview (from llm-obs-session-classify)
分类概览(来自llm-obs-session-classify)
ml_app: <from summary header> | Classified: N | Failures (no+partial): F | Pass rate: X%
| Failure Mode | Count |
|---|---|
| ... |
Proceeding to Phase 2 using F failure traces. Mode inference bypassed — classification verdict is the signal.
Then **skip Phase 1 and jump directly to Phase 2**. Carry forward:
- Phase 2 reasoning input: `(trace_id, span_id, detail)` tuples — same structure as eval reasoning or error messages
- Phase 4 navigation: use `app_type` from each trace block to choose the span navigation strategy
- Phases 2–7: run completely unchanged — the failure bucket structure is identical regardless of source
---
**Standard resolution (no classification context):**
1. If neither `ml_app` nor `eval_name` provided → ask the user. If `eval_name` is provided but `ml_app` is not → also ask for `ml_app` (eval names are not globally unique; without it, span searches return results from all apps sharing the eval name).
2. If `timeframe` not provided → default to `now-24h`.
3. **Resolve `failure_filter`** (before mode inference):
- `"errors"` → force **Error Signal** mode
- `"low scores on <eval>"` → treat as `eval_name=<eval>`, then continue inference
- `"high latency"` → note for Phase 1 (sort by duration post-fetch); continue inference
- Tool/span name → note as `@name:<x>` query fragment for Phase 1; continue inference
4. **Resolve mode** (skip if `mode` was explicitly provided):
- `eval_name` given → **Eval Signal**
- User explicitly mentioned errors/exceptions/crashes → **Error Signal**
- Otherwise → call `list_llmobs_evals_by_ml_app(ml_app)`:
- Evals returned → **Eval Signal**
- No evals → **Error Signal** (announce auto-selection in Phase 1)
5. When `eval_name` is multi-valued, note for Phase 1: run parallel per-eval searches and merge+dedup by `(trace_id, span_id)`.
---ml_app: <来自摘要标题> | 已分类: N | 故障数(no+partial): F | 通过率: X%
| 故障模式 | 数量 |
|---|---|
| ... |
将使用F个故障跟踪进入Phase 2。跳过模式推断——分类判定作为分析信号。
然后**跳过Phase 1,直接进入Phase 2**。传递以下信息:
- Phase 2推理输入:`(trace_id, span_id, detail)` 元组——与评估器推理或错误消息结构相同
- Phase 4定位策略:使用每个跟踪块的 `app_type` 选择Span导航策略
- Phase 2–7:完全按原流程执行——无论来源如何,故障分组结构一致
---
**标准解析流程(无分类上下文):**
1. 如果未提供 `ml_app` 和 `eval_name` → 询问用户。如果提供了 `eval_name` 但未提供 `ml_app` → 同样询问用户获取 `ml_app`(评估器名称不具备全局唯一性;若缺失,Span搜索会返回所有共享该评估器名称的应用结果)。
2. 如果未提供 `timeframe` → 默认使用 `now-24h`。
3. **解析 `failure_filter`**(模式推断前):
- `"errors"` → 强制使用 **错误信号模式**
- `"low scores on <eval>"` → 视为 `eval_name=<eval>`,然后继续推断
- `"high latency"` → 标注用于Phase 1(获取后按时长排序);继续推断
- 工具/Span名称 → 标注为Phase 1的 `@name:<x>` 查询片段;继续推断
4. **解析模式**(如果已显式提供 `mode` 则跳过):
- 已提供 `eval_name` → **评估信号模式**
- 用户明确提及错误/异常/崩溃 → **错误信号模式**
- 其他情况 → 调用 `list_llmobs_evals_by_ml_app(ml_app)`:
- 返回评估器 → **评估信号模式**
- 无评估器 → **错误信号模式**(在Phase 1中告知自动选择结果)
5. 当 `eval_name` 为多值时,标注用于Phase 1:并行执行每个评估器的搜索,然后按 `(trace_id, span_id)` 合并并去重结果。
---Phase 1: Find Problematic Spans
Phase 1:定位问题Span
Three mode-specific paths. All end with a Signal Summary that labels the mode and includes a one-line override hint.
Mode switch handling: At any checkpoint, if the user says "switch to [error|eval|generic] mode", re-enter Phase 1 with the new mode. Phase 0 inputs do not re-resolve.
Auto-pivot: If the selected mode finds no data (0 evals configured, 0 error spans in timeframe), announce the pivot to Generic and proceed — do not stop and ask.
三种模式专属路径。所有路径最终都会生成信号摘要,标注当前模式并包含一行模式覆盖提示。
模式切换处理:在任何检查点,如果用户说“切换到[error|eval|generic]模式”,则使用新模式重新进入Phase 1。Phase 0输入无需重新解析。
自动切换:如果所选模式未找到数据(未配置评估器、时间范围内无错误Span),告知用户将切换到通用模式并继续分析——不要停止并询问。
Eval Signal path
评估信号路径
Step 1a: Eval overview (parallel)
Step 1a:评估器概览(并行)
For each eval, call both in a single parallel batch:
get_llmobs_eval_aggregate_stats(eval_name, from, to)get_llmobs_evaluator(eval_name)
Interpret aggregate stats:
- → Note "no data." Skip this eval (or pivot to Generic if it's the only one).
total_count == 0 - Boolean → Note "100% pass." Skip unless it's the only eval.
pass_rate == 1.0 - Boolean with failures → Note counts and pass_rate. Continue.
- Score with assessment criteria → Note distribution and pass/fail counts. Continue.
- Score WITHOUT assessment criteria → Infer failures: bottom quartile, or below median if bimodal. Label as "inferred failures" in report.
- Categorical with assessment criteria → Note top_values and pass/fail. Continue.
- Categorical WITHOUT assessment criteria → Infer from context (e.g., "error", "incomplete", "off_topic" are likely failures). Ask user if genuinely ambiguous.
Interpret eval config:
- Config returned (custom) → Store ,
prompt_template,assessment_criteria,parsing_type.output_schema - Config nil (OOTB) → Note prompt is not inspectable.
Calibration cross-check: When two evals share a name prefix but differ in type (e.g. and ), compare their pass rates on overlapping spans. A discrepancy >20% is an Evaluator Calibration Discrepancy — flag it in the report.
foo-booleanfoo-score对每个评估器,在单个并行批次中调用以下两个工具:
get_llmobs_eval_aggregate_stats(eval_name, from, to)get_llmobs_evaluator(eval_name)
聚合统计解读:
- → 标注“无数据”。跳过该评估器(如果是唯一评估器则切换到通用模式)。
total_count == 0 - 布尔型 → 标注“100%通过”。除非是唯一评估器,否则跳过。
pass_rate == 1.0 - 布尔型且存在失败 → 标注数量和通过率。继续分析。
- 带评估标准的分数型 → 标注分布情况和通过/失败数量。继续分析。
- 无评估标准的分数型 → 推断失败情况:取最低四分位数,若为双峰分布则取中位数以下。在报告中标注为“推断失败”。
- 带评估标准的分类型 → 标注top_values和通过/失败情况。继续分析。
- 无评估标准的分类型 → 根据上下文推断(例如“error”“incomplete”“off_topic”通常视为失败)。若确实存在歧义则询问用户。
评估器配置解读:
- 返回配置(自定义) → 存储 、
prompt_template、assessment_criteria、parsing_type。output_schema - 配置为空(开箱即用) → 标注提示不可查看。
校准交叉检查:当两个评估器名称前缀相同但类型不同(例如 和 ),比较它们在重叠Span上的通过率。差异>20%则判定为评估器校准差异——在报告中标记。
foo-booleanfoo-scoreStep 1b: Collect failure spans
Step 1b:收集故障Span
For each eval:
- . When multi-valued, issue one search per eval in parallel — merge result sets, dedup by
search_llmobs_spans(query="@evaluations.custom.<eval_name>:*", from, limit=50).(trace_id, span_id) - Paginate until ≥15–20 failures OR no more pages. Cap at 200 spans total.
- per trace_id batch (follow Parallelization Rules).
get_llmobs_span_details - Extract per row: assessment, value, reasoning, span_id, trace_id, span_kind, content_info.
- Separate into pass/fail buckets using thresholds from Step 1a.
JSON-type eval fallback: If returns 0 spans but confirmed , the eval is JSON-type and scores are not indexed on this field. Fall back to: search by the span name or span kind that the eval targets (check for the span filter), then inspect output payloads for JSON verdict fields via .
@evaluations.custom.<eval_name>:*get_llmobs_eval_aggregate_statstotal_count > 0get_llmobs_evaluatorget_llmobs_span_content(field="output")对每个评估器:
- 调用 。如果是多值评估器,并行发起每个评估器的搜索——合并结果集,按
search_llmobs_spans(query="@evaluations.custom.<eval_name>:*", from, limit=50)去重。(trace_id, span_id) - 分页直到获取≥15–20个故障Span或无更多页面。最多获取200个Span。
- 按trace_id批次调用 (遵循并行化规则)。
get_llmobs_span_details - 提取每行数据:assessment、value、reasoning、span_id、trace_id、span_kind、content_info。
- 使用Step 1a中的阈值将数据分为通过/故障分组。
JSON型评估器 fallback:如果 返回0个Span,但 确认 ,说明该评估器为JSON型,分数未在该字段索引。 fallback方案:按评估器目标的Span名称或类型搜索(检查 的Span过滤器),然后通过 检查输出负载中的JSON判定字段。
@evaluations.custom.<eval_name>:*get_llmobs_eval_aggregate_statstotal_count > 0get_llmobs_evaluatorget_llmobs_span_content(field="output")Step 1c: Signal Summary (Eval Signal)
Step 1c:信号摘要(评估信号模式)
undefinedundefinedSignal Summary: {ml_app}
· Eval Signal
{ml_app}信号摘要: {ml_app}
· 评估信号模式
{ml_app}(Inferred from {N} configured eval(s). Say or to change.)
switch to error modeswitch to generic modeTimeframe: {from} → {to}
| Eval | Type | Total | Pass Rate | Status |
|---|---|---|---|---|
| eval_1 | boolean | 4,891 | 37.3% | ⚠ Investigating |
| eval_2 | score | 1,200 | — (inferred threshold) | ⚠ Investigating |
| eval_3 | boolean | 500 | 99.2% | ✓ Healthy |
Collected: {pass_count} passing, {fail_count} failing.
For a single eval, collapse to a single-line header instead of a table.
---(从{N}个已配置评估器推断得出。输入或可切换模式。)
switch to error modeswitch to generic mode时间范围: {from} → {to}
| 评估器 | 类型 | 总数 | 通过率 | 状态 |
|---|---|---|---|---|
| eval_1 | boolean | 4,891 | 37.3% | ⚠ 正在分析 |
| eval_2 | score | 1,200 | —(推断阈值) | ⚠ 正在分析 |
| eval_3 | boolean | 500 | 99.2% | ✓ 健康 |
已收集: {pass_count}个通过,{fail_count}个故障。
如果只有一个评估器,将表格折叠为单行标题。
---Error Signal path
错误信号路径
Step 1a: Sample error spans
Step 1a:采样错误Span
search_llmobs_spans(query="@ml_app:\"<ml_app>\" @status:error", from=timeframe, limit=50)调用 。分页直到获取≥30个错误Span或无更多页面。
search_llmobs_spans(query="@ml_app:\"<ml_app>\" @status:error", from=timeframe, limit=50)Step 1a.5: Soft error scan
Step 1a.5:软错误扫描
MCP tool spans sometimes report but carry in their output payload — these are invisible to queries and can outnumber hard errors.
@status:ok"isError": true@status:errorCall . For a sample of 5–10 results, call in parallel. If any payloads contain , add MCP soft errors as a separate row in the error frequency table with the note: (status:ok but isError:true in payload — not queryable via @status:error).
search_llmobs_spans(query="@ml_app:\"<ml_app>\" @status:ok", span_kind="tool", from=timeframe, limit=20)get_llmobs_span_content(field="output")"isError": trueMCP工具Span有时会报告 ,但在输出负载中包含 ——这些Span无法通过 查询到,且数量可能超过硬错误。
@status:ok"isError": true@status:error调用 。对5–10个样本结果,并行调用 。如果任何负载包含 ,在错误频率表中添加MCP软错误行,并标注:(status:ok但payload中isError:true — 无法通过@status:error查询)。
search_llmobs_spans(query="@ml_app:\"<ml_app>\" @status:ok", span_kind="tool", from=timeframe, limit=20)get_llmobs_span_content(field="output")"isError": trueStep 1b: Group by error type
Step 1b:按错误类型分组
Group spans by tag → frequency table. If tag is absent on some spans, supplement with the field from (fetched in Step 1d).
error_typeerror_typeerror.typeget_llmobs_span_details按 标签对Span分组 → 生成频率表。如果部分Span缺失 标签,补充使用 中的 字段(在Step 1d中获取)。
error_typeerror_typeget_llmobs_span_detailserror.typeStep 1c: Fetch stack traces (parallel)
Step 1c:获取堆栈跟踪(并行)
For the top 3–4 error types by count, pick 2–3 representative trace IDs each. Call in parallel across all selected traces. Extract:
find_llmobs_error_spans(trace_id)- Error message and stack trace
- Origin span kind and name
- Whether errors propagate from children to parents (cascade) or are isolated
对数量最多的3–4种错误类型,每种选择2–3个代表性trace ID。对所有选中的trace并行调用 。提取:
find_llmobs_error_spans(trace_id)- 错误消息和堆栈跟踪
- 来源Span类型和名称
- 错误是否从子Span传播到父Span(级联)或仅存在于单个Span
Step 1d: Fetch span details
Step 1d:获取Span详情
get_llmobs_span_detailscontent_infospan_kind对每种错误类型的代表性Span调用 (遵循并行化规则)。提取 、、时长。
get_llmobs_span_detailscontent_infospan_kindStep 1e: Signal Summary (Error Signal)
Step 1e:信号摘要(错误信号模式)
undefinedundefinedSignal Summary: {ml_app}
· Error Signal
{ml_app}信号摘要: {ml_app}
· 错误信号模式
{ml_app}(No evals configured — analyzing runtime errors. Say or to change.)
switch to eval modeswitch to generic modeTimeframe: {from} → {to} | Total error spans sampled: {N}
| Error Type | Spans | Cascade? | Origin Span Kind |
|---|---|---|---|
| TimeoutError | 42 | Yes | tool |
| APIError 429 | 18 | No | tool |
| ValueError | 7 | No | llm |
| MCP soft errors (isError:true) | 23 | No | tool |
---(未配置评估器 — 分析运行时错误。输入或可切换模式。)
switch to eval modeswitch to generic mode时间范围: {from} → {to} | 已采样错误Span总数: {N}
| 错误类型 | Span数量 | 是否级联 | 来源Span类型 |
|---|---|---|---|
| TimeoutError | 42 | 是 | tool |
| APIError 429 | 18 | 否 | tool |
| ValueError | 7 | 否 | llm |
| MCP软错误(isError:true) | 23 | 否 | tool |
---Generic path
通用路径
Step 1a: Eval health check (when evals are configured)
Step 1a:评估器健康检查(当已配置评估器时)
If returned evals in Phase 0, call for each enabled eval in parallel. Flag any enabled eval with as Broken Eval Configuration — include in the Signal Summary anomaly table as a High severity row.
list_llmobs_evalsget_llmobs_eval_aggregate_statstotal_count: 0如果Phase 0中 返回了评估器,并行调用每个启用评估器的 。将任何 的启用评估器标记为评估器配置错误——在信号摘要的异常表中添加高严重度行。
list_llmobs_evalsget_llmobs_eval_aggregate_statstotal_count: 0Step 1b: Broad span search
Step 1b:广泛Span搜索
search_llmobs_spans(query="@ml_app:\"<ml_app>\"", root_spans_only=true, from=timeframe, limit=50)failure_filter@name:<x>"high latency"duration调用 。如果存在 则应用缩小范围(工具/Span名称 → 查询; → Step 1c后按 排序结果集)。分页直到获取≥30个Span。
search_llmobs_spans(query="@ml_app:\"<ml_app>\"", root_spans_only=true, from=timeframe, limit=50)failure_filter@name:<x>"high latency"durationStep 1c: Fetch span details
Step 1c:获取Span详情
get_llmobs_span_details按trace_id批次调用 。
get_llmobs_span_detailsStep 1d: Rank by structural anomalies
Step 1d:按结构异常排序
Partition spans using heuristics:
- Top decile by (latency outliers)
duration - Agent spans with >N tool/LLM iterations (long-running loops)
- Retrieval spans returning 0 documents (RAG miss)
- Workflow spans whose child set is missing an expected step (compare against median child layout)
- Token efficiency: Check if across LLM spans. If the app has stable system prompts (>1k tokens) and cache hit rate is 0%, flag as High severity — enabling
non_cached_input_tokens ≈ input_tokenson the system prompt would cut input token costs by 60–90%cache_control: ephemeral
使用启发式方法对Span进行分区:
- 时长处于前十分位的Span(延迟异常值)
- Agent Span且工具/LLM迭代次数>N(长时间运行循环)
- 返回0个文档的检索Span(RAG缺失)
- 子Span集缺少预期步骤的工作流Span(与中位数子Span布局对比)
- Token效率:检查LLM Span中 是否成立。如果应用有稳定的系统提示(>1k tokens)且缓存命中率为0%,标记为高严重度——在系统提示中启用
non_cached_input_tokens ≈ input_tokens可将输入Token成本降低60–90%cache_control: ephemeral
Step 1e: Signal Summary (Generic)
Step 1e:信号摘要(通用模式)
undefinedundefinedSignal Summary: {ml_app}
· Generic
{ml_app}信号摘要: {ml_app}
· 通用模式
{ml_app}(Analyzing structural anomalies. Say or to change.)
switch to eval modeswitch to error modeTimeframe: {from} → {to} | Sampled: {N} root spans
| Anomaly Type | Count |
|---|---|
| Latency outliers (>p90) | 12 |
| Long agent loops (>8 iterations) | 5 |
| RAG retrieval misses | 3 |
| Zero prompt cache utilization | All LLM spans |
| Broken eval configurations | 2 |
---(分析结构异常。输入或可切换模式。)
switch to eval modeswitch to error mode时间范围: {from} → {to} | 已采样: {N}个根Span
| 异常类型 | 数量 |
|---|---|
| 延迟异常值(>p90) | 12 |
| 长Agent循环(>8次迭代) | 5 |
| RAG检索缺失 | 3 |
| 提示缓存利用率为0 | 所有LLM Span |
| 评估器配置错误 | 2 |
---Phase 1.5: Determine App Profile & Where the Root Cause Lives
Phase 1.5:确定应用配置文件与根因位置
Inspect and across collected spans. Drives Phase 4 strategy.
content_infospan_kindApp profile (from content_info):
| Signal | App profile | Phase 4 strategy |
|---|---|---|
| LLM/chat app | Extract system prompt via |
| RAG app | Check retrieval quality alongside LLM output |
Trace contains | Agent app | Try |
| Long conversation | Check for context overflow |
| Has custom metadata | Check for clustering by metadata values (prompt version, feature flags) |
LLM Experiments traces: If root spans haveand carryspan_kind: experiment,input, andoutputstructured fields, you are looking at a Datadog LLM Experiments trace. Each span represents one dataset record run. Read quality signal from the root span'sexpected_output/input/outputfields viaexpected_output— not from LLM sub-span messages, which may contain stub or placeholder content. Evaluations attached to experiment spans are computed by the Experiments framework at run time and may not be registered as online Datadog evaluators (get_llmobs_span_contentwill return 404 for them).get_llmobs_evaluator
Where the root cause likely lives — by symptom span kind:
| Symptom span kind | Symptom looks like | But root cause is often in... |
|---|---|---|
| Bad LLM response (eval flagged, wrong output) | Parent agent (bad instructions), sibling retrieval (bad context), sibling tool (bad data) |
| Bad orchestration | Child spans (wrong tool calls, bad routing), full agent loop |
| Bad tool result | Parent LLM (passed wrong parameters), tool implementation |
| Bad overall output | Child sub-spans (which step first deviated?) |
| Bad retrieval | Query construction (parent), index/embedding config (outside trace) |
Key insight: The signal — eval verdict, error message, latency outlier — flags one span in isolation. It's a symptom report, not a diagnosis. The root cause often lives in a different span: a parent that gave bad instructions, a sibling that provided bad context, or a child that made a wrong decision. Phase 4 navigates the tree to find it.
检查已收集Span的 和 。指导Phase 4的策略。
content_infospan_kind应用配置文件(来自content_info):
| 信号 | 应用配置文件 | Phase 4策略 |
|---|---|---|
| LLM/聊天应用 | 通过 |
| RAG应用 | 同时检查检索质量和LLM输出 |
跟踪包含 | Agent应用 | 优先尝试 |
| 长对话应用 | 检查上下文溢出情况 |
| 有自定义元数据 | 检查是否按元数据值聚类(提示版本、功能开关) |
LLM Experiments跟踪:如果根Span的,且包含span_kind: experiment、input和output结构化字段,说明这是Datadog LLM Experiments跟踪。每个Span代表一个数据集记录的运行。通过expected_output从根Span的get_llmobs_span_content/input/output字段读取质量信号——不要从LLM子Span的消息中读取,这些消息可能包含存根或占位符内容。附加到实验Span的评估结果由Experiments框架在运行时计算,可能未注册为在线Datadog评估器(expected_output会返回404)。get_llmobs_evaluator
根因可能位置 — 按症状Span类型:
| 症状Span类型 | 症状表现 | 但根因通常位于... |
|---|---|---|
| LLM响应异常(评估标记、输出错误) | 父级Agent(指令错误)、同级检索(上下文错误)、同级工具(数据错误) |
| 编排错误 | 子级Span(工具调用错误、路由错误)、完整Agent循环 |
| 工具结果错误 | 父级LLM(参数传递错误)、工具实现 |
| 整体输出错误 | 子级子Span(哪一步首先偏离预期?) |
| 检索结果错误 | 查询构建(父级)、索引/嵌入配置(跟踪外) |
关键见解:信号——评估判定、错误消息、延迟异常值——孤立地标记了一个Span。这只是症状报告,而非诊断结果。根因通常位于另一个Span:给出错误指令的父Span、提供错误上下文的同级Span,或做出错误决策的子Span。Phase 4通过遍历Span树定位根因。
Phase 2: Open Coding — Initial Failure Categorization
Phase 2:开放式编码 — 初始故障分类
Goal: Read per-row evidence and propose initial, concrete failure categories. Pool all problematic rows together — categories should describe app behaviors, not which signal flagged them.
Per-row "reasoning input" by mode:
- Eval Signal: judge assessment + reasoning from
get_llmobs_span_details - Error Signal: error message + stack trace excerpt from
find_llmobs_error_spans - Generic: one-line description of the structural anomaly that flagged the row
Shortcuts:
- < 15 problematic rows: Combine Phases 2 and 3 into one pass. Still produce the checkpoint.
- > 80% share the same reasoning/error/symptom: Skip to Phase 4 with the dominant pattern. Still output checkpoint.
- > 50 problematic rows: Sample ~50, build taxonomy, then spot-check 10–15 more.
-
Use per-row signal from Phase 1 — do NOT re-fetch. Only callfor spans where the reasoning is insufficient (generic, empty, or just a stack trace with no app context).
get_llmobs_span_content(field="input"/"output") -
If eval config is loaded (Eval Signal), distinguish early:
- App failures: Output genuinely violates the eval's criteria
- Eval failures: Output seems reasonable but eval criteria are too strict/ambiguous
-
Each pattern must be specific: "Agent called search instead of calculator for price computation" — NOT "tool issue."
目标:读取每行证据并提出初始、具体的故障类别。将所有问题行合并——类别应描述应用行为,而非标记它们的信号类型。
每行“推理输入” 按模式分类:
- 评估信号模式:中的评估判定+推理
get_llmobs_span_details - 错误信号模式:中的错误消息+堆栈跟踪片段
find_llmobs_error_spans - 通用模式:标记该行的结构异常的单行描述
快捷方式:
- 问题行 <15个:将Phase 2和Phase 3合并为一次处理。仍需生成检查点。
- >80%的行具有相同推理/错误/症状:跳过直接进入Phase 4,聚焦主导模式。仍需输出检查点。
- 问题行 >50个:采样约50个,构建分类法,然后抽查10–15个更多样本。
-
使用Phase 1的每行信号 — 不要重新获取。仅当推理信息不足(通用、空值或仅包含无应用上下文的堆栈跟踪)时,调用获取Span内容。
get_llmobs_span_content(field="input"/"output") -
如果已加载评估器配置(评估信号模式),提前区分:
- 应用故障:输出确实违反了评估器的标准
- 评估器故障:输出看似合理但评估器标准过于严格/模糊
-
每个模式必须具体:例如“Agent在计算价格时调用了搜索工具而非计算器工具”——而非“工具问题”。
MANDATORY CHECKPOINT
强制检查点
**Open coding**: {N} problematic rows → {K} initial categories: {Category1} ({count}), {Category2} ({count}), ...**开放式编码**: {N}个问题行 → {K}个初始类别: {Category1} ({count}), {Category2} ({count}), ...Phase 3: Axial Coding — Refine Failure Taxonomy
Phase 3:主轴编码 — 优化故障分类法
Goal: 3–8 final categories, ranked by impact.
- Merge: Categories with < 3 occurrences → parent category or drop as noise.
- Split: Categories with > 30% of failures → more specific sub-categories. Pull additional span content if needed.
- Validate: 2–3 representative examples per category confirm the label fits.
- Rank: (severity: high / medium / low).
priority = count × severity
目标:最终得到3–8个类别,按影响排序。
- 合并:出现次数<3的类别 → 合并到父类别或作为噪声丢弃。
- 拆分:占故障数>30%的类别 → 拆分为更具体的子类别。必要时获取更多Span内容。
- 验证:每个类别选取2–3个代表性示例确认标签合适。
- 排序:(严重度:高/中/低)。
优先级 = 数量 × 严重度
MANDATORY CHECKPOINT
强制检查点
**Axial coding**: {merges/splits/drops}. Final categories:
1. {Category} ({count}, {pct}%) — {severity}
2. ...**主轴编码**: {合并/拆分/丢弃操作说明}。最终类别:
1. {Category} ({count}, {pct}%) — {severity}
2. ...Phase 4: Root Cause Analysis — Navigate from Symptom to Root Cause
Phase 4:根因分析 — 从症状到根因的定位
Goal: The signal flagged a span. That's the symptom. Navigate the trace tree to find the actual root cause — it's often in a different span.
For each of the top 3 categories, pick 2–3 representative traces:
目标:信号标记了一个Span,这是症状。遍历跟踪树找到实际根因——通常位于另一个Span。
对前3个类别,每个选取2–3个代表性跟踪:
Step 4a: Trace structure + errors (parallel)
Step 4a:跟踪结构 + 错误(并行)
For each representative trace, call in a single message:
- — span hierarchy; locate the symptom span and its parent/siblings/children
get_llmobs_trace(trace_id) - — check for runtime errors anywhere in the trace
find_llmobs_error_spans(trace_id)
Runtime vs behavioral: If errors exist on or near the symptom span, the root cause may be a runtime failure rather than a behavioral one. Check this first.
Distributed trace fallback: If returns "cannot find parent" or an empty span list (common in Ray-based or multi-process execution), reconstruct the trace manually using on the span_ids collected in Phase 1, sorted by .
get_llmobs_traceget_llmobs_span_detailsstart_ms对每个代表性跟踪,在单条消息中调用:
- — Span层级结构;定位症状Span及其父/同级/子Span
get_llmobs_trace(trace_id) - — 检查跟踪中是否存在运行时错误
find_llmobs_error_spans(trace_id)
运行时vs行为:如果症状Span上或附近存在错误,根因可能是运行时故障而非行为故障。优先检查这一点。
分布式跟踪 fallback:如果 返回“无法找到父Span”或空Span列表(在基于Ray或多进程执行的应用中常见),使用Phase 1中收集的span_id调用 ,按 排序手动重建跟踪。
get_llmobs_traceget_llmobs_span_detailsstart_msStep 4b: Navigate to the root cause (parallel)
Step 4b:定位根因(并行)
Use the symptom span kind (from Phase 1.5). Issue ALL calls in a single message.
If symptom is on an span (most common):
llm- on symptom span — system prompt
get_llmobs_span_content(field="messages", path="$.messages[0]") - on symptom span — full context received
get_llmobs_span_content(field="messages") - on sibling retrieval spans (if any)
get_llmobs_span_content(field="documents") - on sibling tool spans (if any)
get_llmobs_span_content(field="input") - on parent agent/workflow span
get_llmobs_span_content(field="messages", path="$.messages[0]")
If symptom is on an span:
agent- — full decision timeline (try first; if it returns 0 iterations, use the fallback below)
get_llmobs_agent_loop(trace_id, span_id) - on child spans — sort by
get_llmobs_span_detailsto reconstruct the execution timelinestart_ms - on child spans that look wrong
get_llmobs_span_content(field="input"/"output")
Agent loop fallback (when returns 0 iterations): Reconstruct the timeline from results sorted by . Group by to identify LLM → tool → LLM sequences. This fallback is frequently needed — returns empty for many apps.
get_llmobs_agent_loopget_llmobs_span_detailsstart_msspan_kindget_llmobs_agent_loopIf symptom is on a span:
tool- on symptom span — what parameters was it called with?
get_llmobs_span_content(field="input") - on parent LLM span — did the LLM construct the call correctly?
get_llmobs_span_content(field="messages")
If symptom is on a span:
workflow- on all child spans — find which step first deviated
get_llmobs_span_details - on the deviating child
get_llmobs_span_content(field="input"/"output")
Always also fetch:
- on the symptom span — clustering signals (prompt version, feature flags)
get_llmobs_span_content(field="metadata")
根据Phase 1.5中的症状Span类型操作。将所有调用放入单条消息中。
如果症状位于 Span上(最常见):
llm- 在症状Span上调用 — 获取系统提示
get_llmobs_span_content(field="messages", path="$.messages[0]") - 在症状Span上调用 — 获取完整上下文
get_llmobs_span_content(field="messages") - 在同级检索Span(如果存在)上调用
get_llmobs_span_content(field="documents") - 在同级工具Span(如果存在)上调用
get_llmobs_span_content(field="input") - 在父级Agent/工作流Span上调用
get_llmobs_span_content(field="messages", path="$.messages[0]")
如果症状位于 Span上:
agent- 调用 — 获取完整决策时间线 (优先尝试;如果返回0次迭代,使用下方的fallback方案)
get_llmobs_agent_loop(trace_id, span_id) - 在子Span上调用 — 按
get_llmobs_span_details排序重建执行时间线start_ms - 在看似异常的子Span上调用
get_llmobs_span_content(field="input"/"output")
Agent循环 fallback(当 返回0次迭代时):从按 排序的 结果中重建时间线。按 分组识别LLM → 工具 → LLM序列。该fallback方案经常需要使用—— 对许多应用会返回空结果。
get_llmobs_agent_loopstart_msget_llmobs_span_detailsspan_kindget_llmobs_agent_loop如果症状位于 Span上:
tool- 在症状Span上调用 — 获取调用参数
get_llmobs_span_content(field="input") - 在父级LLM Span上调用 — 检查LLM是否正确构建调用
get_llmobs_span_content(field="messages")
如果症状位于 Span上:
workflow- 在所有子Span上调用 — 找到首先偏离预期的步骤
get_llmobs_span_details - 在偏离步骤的子Span上调用
get_llmobs_span_content(field="input"/"output")
始终还要获取:
- 在症状Span上调用 — 获取聚类信号(提示版本、功能开关)
get_llmobs_span_content(field="metadata")
Step 4c: Diagnose — from symptom to root cause
Step 4c:诊断 — 从症状到根因
For each category, trace the causal chain:
- Symptom — what the signal flagged (eval reasoning, error message, anomaly note). The signal only saw one span in isolation — its reasoning may be shallow.
- Trace context — what surrounding spans reveal (parent instructions, sibling data, child decisions).
- Root cause — the specific span and decision point where the failure originated. Often NOT the symptom span itself.
For suspected eval issues (Eval Signal, if config loaded): Compare eval criteria against evidence. Is the prompt ambiguous? Criteria too strict?
Root cause categories:
| Category | Description |
|---|---|
| System Prompt Deficiency | Instructions unclear, missing, or contradictory — in symptom span OR its parent |
| Tool Gap | Needed tool doesn't exist or parameters too coarse |
| Tool Misuse | Wrong tool called or wrong parameters — often visible in agent loop or parent LLM |
| Routing/Handoff Error | Wrong sub-agent selected (multi-agent systems) |
| Retrieval Failure | RAG returned irrelevant or missing context — check sibling retrieval spans |
| Context Overflow | Critical info lost due to context length |
| Upstream Data Issue | A sibling or parent span provided bad data that cascaded to the symptom span |
| Runtime Error | Tool/API failure, timeout, exception — from |
| Evaluator Miscalibration | Eval criteria produce false positives/negatives (Eval Signal mode only) |
对每个类别,跟踪因果链:
- 症状 — 信号标记的内容(评估推理、错误消息、异常说明)。信号仅孤立地看到一个Span——其推理可能较为表面。
- 跟踪上下文 — 周边Span揭示的信息(父级指令、同级数据、子级决策)。
- 根因 — 故障起源的具体Span和决策点。通常不是症状Span本身。
对于疑似评估器问题(评估信号模式,已加载配置):将评估器标准与证据对比。提示是否模糊?标准是否过于严格?
根因类别:
| 类别 | 描述 |
|---|---|
| 系统提示缺陷 | 指令不清晰、缺失或矛盾 — 位于症状Span或其父Span |
| 工具缺失 | 需要的工具不存在或参数过于粗糙 |
| 工具误用 | 调用了错误的工具或传递了错误的参数 — 通常在Agent循环或父级LLM中可见 |
| 路由/交接错误 | 选择了错误的子Agent(多Agent系统) |
| 检索失败 | RAG返回了无关或缺失的上下文 — 检查同级检索Span |
| 上下文溢出 | 因上下文长度限制导致关键信息丢失 |
| 上游数据问题 | 同级或父级Span提供了错误数据并传导至症状Span |
| 运行时错误 | 工具/API故障、超时、异常 — 来自 |
| 评估器校准错误 | 评估器标准产生了误报/漏报(仅评估信号模式) |
Phase 5: Generate Recommendations
Phase 5:生成建议
Goal: Concrete, actionable recommendations grounded in trace evidence. Actual text/code changes with before/after quotes from the trace — not generic advice.
Recommendation types: System Prompt Edit (quote actual prompt, provide before/after), Tool Gap/Misuse (reference agent loop steps), Routing/Handoff Fix, Retrieval Fix (show retrieved vs needed), Evaluator Prompt Edit (flag that eval changes need re-validation; Eval Signal only), Other.
When run in Claude Code with codebase access: Search the codebase for system prompt, tool definitions, or routing logic. Propose specific diffs. Always ask before modifying files.
目标:基于跟踪证据的具体、可执行建议。包含来自跟踪的实际文本/代码变更的前后对比——而非通用建议。
建议类型:系统提示修改(引用实际提示,提供前后对比)、工具缺失/误用(参考Agent循环步骤)、路由/交接修复、检索修复(展示检索结果与所需内容的对比)、评估器提示修改(标记评估器变更需重新验证;仅评估信号模式)、其他。
当在Claude Code中运行且可访问代码库时:搜索代码库中的系统提示、工具定义或路由逻辑。提出具体的diff。修改文件前始终询问用户。
Phase 6: Compile RCA Report
Phase 6:编译RCA报告
Write the full report following the Output Format below. This is the primary deliverable — output it directly in the chat.
按照下方的输出格式编写完整报告。这是主要交付成果——直接在聊天中输出。
Phase 7: Post-Analysis Actions
Phase 7:分析后操作
Do NOT take any action automatically. After presenting the report, ask the user what they'd like to do next:
- Save the report to
llm-obs-rca-{ml_app}-{date}.md - Apply fixes (if codebase is available)
- Deeper investigation of remaining categories
- Export to a Datadog notebook — in pup mode, use to create the notebook and
pup notebooks createto append sections (see Tool Reference)pup notebooks edit NOTEBOOK_ID --file /tmp/nb_cells.json - Re-run on an expanded time range (e.g. if current window was
now-7d)now-24h
If the user chooses option 4, follow the notebook creation fallback pattern:
-
Callwith:
mcp__datadog-mcp-core__create_datadog_notebook- :
nameLLM Obs RCA: {ml_app} ({mode}) — YYYY-MM-DD - :
typereport - :
time_span1w - : one cell per section (see Notebook Cell Structure below)
cells
-
If the MCP call fails, inspect the error before giving up:
- Auth / permission error (401, 403) → stop and tell the user.
- Field validation error (error message names a specific field) → fix that field and retry the MCP call once.
- Any other error (binding, serialization, unexpected response) → fall back to pup:
- Write the notebook payload to as a full API envelope:
/tmp/nb_rca_{ml_app}.json{"data": {"attributes": {"name": "...", "time": {...}, "cells": [...]}, "type": "notebooks"}} - Run
pup notebooks create --file /tmp/nb_rca_{ml_app}.json - If pup is not available either, render the full notebook content as markdown in chat so the user has it.
- Write the notebook payload to
-
After successful creation by either method, output the URL on its own line:
RCA report exported to notebook: <url>
Print the URL prominently — if runs next in the same session, it will detect this URL and offer to append the evaluator suite to the same notebook.
/eval-bootstrap不要自动执行任何操作。 展示报告后,询问用户下一步想要执行的操作:
- 将报告保存为
llm-obs-rca-{ml_app}-{date}.md - 应用修复(如果可访问代码库)
- 对剩余类别进行更深入的调查
- 导出到Datadog笔记本 — 在pup模式下,使用 创建笔记本,使用
pup notebooks create添加章节(参考工具参考)pup notebooks edit NOTEBOOK_ID --file /tmp/nb_cells.json - 在扩大的时间范围内重新运行分析(例如当前窗口为 ,则使用
now-24h)now-7d
如果用户选择选项4,遵循笔记本创建fallback模式:
-
调用,参数如下:
mcp__datadog-mcp-core__create_datadog_notebook- :
nameLLM Obs RCA: {ml_app} ({mode}) — YYYY-MM-DD - :
typereport - :
time_span1w - : 每个章节对应一个单元格(参考下方的笔记本单元格结构)
cells
-
如果MCP调用失败,检查错误后再放弃:
- 认证/权限错误(401、403) → 停止操作并告知用户。
- 字段验证错误(错误消息提及特定字段) → 修复该字段并重试MCP调用一次。
- 其他错误(绑定、序列化、意外响应) → fallback到pup:
- 将笔记本负载写入 ,作为完整API包:
/tmp/nb_rca_{ml_app}.json{"data": {"attributes": {"name": "...", "time": {...}, "cells": [...]}, "type": "notebooks"}} - 运行
pup notebooks create --file /tmp/nb_rca_{ml_app}.json - 如果pup也不可用,将完整笔记本内容以markdown格式在聊天中展示,确保用户获取到内容。
- 将笔记本负载写入
-
无论通过哪种方式成功创建后,单独一行输出URL:
RCA报告已导出到笔记本: <url>
确保URL醒目显示——如果 在同一会话中运行,它会检测到该URL并提议将评估器套件附加到同一笔记本。
/eval-bootstrapNotebook Cell Structure
笔记本单元格结构
| Cell | Content |
|---|---|
| 1 — Overview | Structured header (see Overview cell format below — follow it exactly) |
| 2 — Signal Summary | Mode-specific health table |
| 3 — Failure Taxonomy | Taxonomy table |
| 4…N — Failure Modes | One cell per failure mode |
| N+1 — Action Plan + Limitations | Action plan table + bullet list |
Notebook formatting rules (apply to every cell):
- No triple-backtick code blocks — use blockquotes () for prompts/rubrics, inline code (
>) for short values` - Evidence as tables — not bullet lists
- Tool inputs as tables — Argument | Wrong value passed | Correct approach
- Action plan as a table — Priority | Action | Confidence | Impact
| 单元格 | 内容 |
|---|---|
| 1 — 概览 | 结构化标题(参考下方的概览单元格格式——严格遵循) |
| 2 — 信号摘要 | 模式专属健康表格 |
| 3 — 故障分类法 | 分类法表格 |
| 4…N — 故障模式 | 每个故障模式对应一个单元格 |
| N+1 — 行动计划 + 局限性 | 行动计划表格 + 项目符号列表 |
笔记本格式规则(适用于所有单元格):
- 不要使用三重反引号代码块 — 使用块引用()展示提示/评估标准,使用行内代码(
>)展示短值` - 证据以表格形式展示 — 不要使用项目符号列表
- 工具输入以表格形式展示 — 参数 | 传递的错误值 | 正确方法
- 行动计划以表格形式展示 — 优先级 | 操作 | 置信度 | 影响
Output Format
输出格式
Overview cell (notebook Cell 1 / report header)
概览单元格(笔记本单元格1 / 报告标题)
The Overview cell must follow this exact structure. No prose paragraphs. No inline-numbered findings. App description is one sentence maximum.
undefined概览单元格必须严格遵循以下结构。不要使用散文段落。不要使用行内编号的发现。应用描述最多一句话。
undefined{ml_app}
· {Eval Signal | Error Signal | Generic} · {timeframe}
{ml_app}{ml_app}
· {评估信号模式 | 错误信号模式 | 通用模式} · {timeframe}
{ml_app}Date: {YYYY-MM-DD} | Profile: {short app profile} | Model:
{model(s)}{One sentence: what does this app do?}
| Metric | Value |
|---|---|
| {mode-appropriate rows — see below} |
日期: {YYYY-MM-DD} | 配置文件: {简短应用配置文件} | 模型:
{model(s)}{一句话:该应用的功能是什么?}
| 指标 | 值 |
|---|---|
| {模式专属行 — 参考下方} |
Findings
发现
- {Finding 1} (~{pct}%): one-line root cause description
- {Finding 2} (~{pct}%): one-line root cause description
- {Finding 3} (if present): one-line root cause description
- {发现1} (~{pct}%): 一行根因描述
- {发现2} (~{pct}%): 一行根因描述
- {发现3}(如果存在): 一行根因描述
Recommendations
建议
- {Recommendation 1}: specific next step tied to Finding 1
- {Recommendation 2}: specific next step tied to Finding 2
Sample: {N} spans analyzed. Confidence: High | Medium | Low — {one-line reason if Medium or Low}.
**Mode-appropriate metric rows:**
Eval Signal:| Eval | ({type}) |
| Spans evaluated | {total_count} |
| Pass rate | {pass_rate}% ({pass_count} pass / {fail_count} fail) |
| Top failure mode | {name} (~{pct}%) |
| Evals configured | {N} |
{eval_name}
Error Signal:| Error spans | {N} confirmed |
| Top error type | ({pct}%) |
| Affected operation | |
| Cascade pattern | Isolated / Cascading |
| Evals configured | {N} (none = no quality signal) |
{type}{span_name}
Generic:| Spans sampled | {N} root spans |
| Top anomaly | {type}: {count} spans |
| Error spans | {N} (0 = structurally healthy) |
| Evals configured | {N} (none = no quality signal) |
---- {建议1}: 与发现1关联的具体下一步操作
- {建议2}: 与发现2关联的具体下一步操作
样本: {N}个Span已分析。置信度: 高 | 中 | 低 — {如果是中或低,给出一行原因}.
**模式专属指标行:**
评估信号模式:| 评估器 | ({type}) |
| 已评估Span数 | {total_count} |
| 通过率 | {pass_rate}% ({pass_count}个通过 / {fail_count}个故障) |
| 主要故障模式 | {name} (~{pct}%) |
| 已配置评估器数 | {N} |
{eval_name}
错误信号模式:| 错误Span数 | {N}个已确认 |
| 主要错误类型 | ({pct}%) |
| 受影响操作 | |
| 级联模式 | 孤立 / 级联 |
| 已配置评估器数 | {N}(无 = 无质量信号) |
{type}{span_name}
通用模式:| 已采样Span数 | {N}个根Span |
| 主要异常 | {type}: {count}个Span |
| 错误Span数 | {N}(0 = 结构健康) |
| 已配置评估器数 | {N}(无 = 无质量信号) |
---Signal Summary Table
信号摘要表格
When entering from Step 0S (classification context), replace the Signal Summary table with:
undefined当从Step 0S(分类上下文)进入时,将信号摘要表格替换为:
undefinedClassification Signal Summary
分类信号摘要
Source: llm-obs-session-classify | ml_app: {app} | Signal: content-only | content+evals
| Metric | Value |
|---|---|
| Traces classified | N |
| Failures in corpus (no+partial) | F |
| Pass rate | X% |
| Failure modes | list |
Root cause analysis is based on per-trace classification verdicts, not automated eval judge reasoning.
**Otherwise**, mode-specific — pick the appropriate variant:
**Eval Signal** — one row per eval:
| Eval | Type | Total | Pass Rate | Status |
|------|------|------:|:---------:|--------|
| eval_1 | boolean | 4,891 | 37.3% | ⚠ Investigating |
**Error Signal** — one row per error type:
| Error Type | Spans | Cascade? | Origin Span Kind |
|------------|------:|:--------:|-----------------|
| TimeoutError | 42 | Yes | tool |
**Generic** — one row per anomaly type:
| Anomaly Type | Count |
|---|:---:|
| Latency outliers (>p90) | 12 |
---来源: llm-obs-session-classify | ml_app: {app} | 信号: content-only | content+evals
| 指标 | 值 |
|---|---|
| 已分类跟踪数 | N |
| 语料库中的故障数(no+partial) | F |
| 通过率 | X% |
| 故障模式 | 列表 |
根因分析基于每个跟踪的分类判定,而非自动化评估器推理。
**其他情况**,使用模式专属表格——选择合适的变体:
**评估信号模式** — 每个评估器一行:
| 评估器 | 类型 | 总数 | 通过率 | 状态 |
|------|------|------:|:---------:|--------|
| eval_1 | boolean | 4,891 | 37.3% | ⚠ 正在分析 |
**错误信号模式** — 每个错误类型一行:
| 错误类型 | Span数量 | 是否级联 | 来源Span类型 |
|------------|------:|:--------:|-----------------|
| TimeoutError | 42 | 是 | tool |
**通用模式** — 每个异常类型一行:
| 异常类型 | 数量 |
|---|:---:|
| 延迟异常值(>p90) | 12 |
---Failure Taxonomy
故障分类法
| # | Failure Mode | Traces | % | Severity | Root Cause |
|---|---|---|---|---|---|
| 1 | ... | ... | ...% | High | Tool Misuse |
| # | 故障模式 | 跟踪数 | % | 严重度 | 根因 |
|---|---|---|---|---|---|
| 1 | ... | ... | ...% | 高 | 工具误用 |
Failure Mode Sections (one per top 3–5 modes)
故障模式章节(前3–5个模式各一个)
undefinedundefinedFailure Mode N: [Name]
故障模式N: [名称]
Count: {n} spans, {t} traces | Severity: High/Medium/Low | Root Cause: [Category]
[3–5 sentences: what goes wrong, when, what triggers it, causal chain.]
Evidence
{Use the mode-appropriate column set:}
Eval Signal — Trace | Judge verdict | What the trace revealed:
| Trace | Judge verdict | What the trace revealed |
|---|---|---|
| 69de86a7... | fail | Parent agent has no date format instruction |
Error Signal — Trace | Behavior | Version:
| Trace | Behavior | Version |
|---|---|---|
| 69de86a7... | 7 parallel calls, all 400 | v107624932 |
Generic — Trace | Anomaly | Signal:
| Trace | Anomaly | Signal |
|---|---|---|
| 69de86a7... | 94s, 12 tool calls | Latency outlier |
{For tool misuse — add a tool inputs table:}
Tool inputs (100% of sampled calls)
| Argument | Value passed (wrong) | Correct approach |
|---|---|---|
| | |
{For Eval Signal — add judge reasoning as a blockquote:}
"{quoted judge reasoning}"
Root cause: [WHY this happens — specific span, parameter, or prompt.]
Fix:
BEFORE: [actual text from trace]
AFTER: [proposed replacement]
Impact: Eliminates ~{n} spans / {timeframe}.
---数量: {n}个Span, {t}个跟踪 | 严重度: 高/中/低 | 根因: [类别]
[3–5句话:问题是什么,何时发生,触发条件是什么,因果链是什么。]
证据
{使用模式专属列集:}
评估信号模式 — 跟踪 | 评估判定 | 跟踪揭示的信息:
| 跟踪 | 评估判定 | 跟踪揭示的信息 |
|---|---|---|
| 69de86a7... | fail | 父级Agent没有日期格式指令 |
错误信号模式 — 跟踪 | 行为 | 版本:
| 跟踪 | 行为 | 版本 |
|---|---|---|
| 69de86a7... | 7次并行调用,全部返回400 | v107624932 |
通用模式 — 跟踪 | 异常 | 信号:
| 跟踪 | 异常 | 信号 |
|---|---|---|
| 69de86a7... | 94秒,12次工具调用 | 延迟异常值 |
{对于工具误用 — 添加工具输入表格:}
工具输入(100%已采样调用)
| 参数 | 传递的错误值 | 正确方法 |
|---|---|---|
| | |
{对于评估信号模式 — 添加评估器推理作为块引用:}
"{引用的评估器推理}"
根因: [问题发生的原因 — 具体Span、参数或提示。]
修复方案:
修复前: [来自跟踪的实际文本]
修复后: [建议的替换文本]
影响: 消除~{n}个Span / {timeframe}。
---Prioritized Action Plan
优先级行动计划
| Priority | Action | Confidence | Impact |
|---|---|---|---|
| 1 | Fix | High | Eliminates ~21 spans/7d |
When mode is Generic and no evals are configured, always append as the final action plan row:
| N | Configure at least one evaluator | High | Enables Eval Signal mode for future RCAs — app currently has no ongoing quality signal |
| 优先级 | 操作 | 置信度 | 影响 |
|---|---|---|---|
| 1 | 修复 | 高 | 消除~21个Span/7天 |
当模式为通用模式且未配置评估器时,始终在行动计划最后添加一行:
| N | 配置至少一个评估器 | 高 | 为未来RCA启用评估信号模式 — 当前应用无持续质量信号 |
Limitations & Follow-ups
局限性与后续工作
Bullet list of what needs more data or follow-up action.
项目符号列表,说明需要更多数据或后续操作的内容。
Operating Rules
操作规则
- Ground in evidence: Every claim references span IDs with clickable trace links: .
[Trace {first_8}...](https://app.datadoghq.com/llm/traces?query=trace_id:{full_32_char_id}) - Root cause over symptom: "System prompt doesn't specify date format" not "model gave wrong answer."
- Show your math: "47 failures (34%)" not "many failures."
- Honest about uncertainty: < 5 examples = tentative. Flag it.
- Anonymize PII: No emails or names. User/org IDs are fine.
- MCP result parsing safety: Before writing any script that iterates over MCP tool results, inspect the raw structure first — check top-level keys and whether the payload is nested inside a content block (e.g. ). Extract and
[{'type': 'text', 'text': '<json>'}]the inner payload if needed. Never assume MCP results are bare dicts or lists.json.loads()
- 基于证据: 每个声明都引用带可点击跟踪链接的Span ID: 。
[Trace {first_8}...](https://app.datadoghq.com/llm/traces?query=trace_id:{full_32_char_id}) - 聚焦根因而非症状: 例如“系统提示未指定日期格式”而非“模型给出错误答案”。
- 展示数据: 例如“47个故障(34%)”而非“大量故障”。
- 坦诚不确定性: 示例数<5 = 暂定结论。标记出来。
- 匿名化PII: 不要包含邮箱或姓名。用户/组织ID可以保留。
- MCP结果解析安全: 在编写任何遍历MCP工具结果的脚本前,先检查原始结构——检查顶级键以及负载是否嵌套在内容块中(例如 )。如果需要,提取并
[{'type': 'text', 'text': '<json>'}]内部负载。永远不要假设MCP结果是裸字典或列表。json.loads()
Tool Reference
工具参考
This appendix applies only in pup mode. In MCP mode, use the tool names in the workflow sections directly.
本附录仅适用于 pup模式。在MCP模式下,直接使用工作流章节中的工具名称。
Spans and traces
Spans和跟踪
| MCP Tool | pup Command |
|---|---|
| |
| |
| |
| |
| |
| |
| |
| MCP工具 | pup命令 |
|---|---|
| |
| |
| |
| |
| |
| |
| |
Evaluators
评估器
| MCP Tool | pup Command |
|---|---|
| |
| |
| |
| |
| MCP工具 | pup命令 |
|---|---|
| |
| |
| |
| |
Notebooks
笔记本
| MCP Tool | pup Command |
|---|---|
| |
| |
The cells file is a JSON array of cell objects:
json
[{"attributes": {"definition": {"type": "markdown", "text": "## Section\n\nContent."}}, "type": "notebook_cells"}]| MCP工具 | pup命令 |
|---|---|
| |
| |
单元格文件是JSON数组,包含单元格对象:
json
[{"attributes": {"definition": {"type": "markdown", "text": "## Section\n\nContent."}}, "type": "notebook_cells"}]