llm-obs-eval-pipeline

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Backend

后端配置

Detection — At the start of every invocation, before taking any action, determine which backend to use:

If the user passed
```
--backend pup
```
anywhere in their invocation → use pup mode immediately, regardless of whether MCP tools are present. Skip steps 2–4.
Check whether MCP tools are present in your active tool list. The canonical signal is whether
```
mcp__datadog-llmo-mcp__search_llmobs_spans
```
appears in your available tools.
If MCP tools are present → use MCP mode throughout. Call MCP tools exactly as named in the sub-skill workflow sections.
If MCP tools are absent → check whether
```
pup
```
is executable: run
```
pup --version
```
via Bash. A JSON response containing
```
"version"
```
confirms pup is available.
If pup responds → use pup mode throughout. Each sub-skill carries its own Tool Reference appendix with the full MCP→pup mapping.
If neither is available → stop and tell the user:
"Neither the Datadog MCP server nor the pup CLI is available. Connect the MCP server (
```
claude mcp add --scope user --transport http datadog-llmo-mcp 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'
```
) or install pup."

--backend pup

is accepted anywhere in the invocation arguments. Strip it from args before passing to sub-skills, but carry the pup-mode decision forward — sub-skills must also operate in pup mode for the entire pipeline run.

Sub-skill backend propagation: The backend detected at pipeline startup applies to all three sub-skills (session-classify → trace-rca → eval-bootstrap). Do not re-detect per phase. Announce once at startup:

MCP mode: "(Running in MCP mode — all features available.)"
pup mode: "(Running in pup mode — pup commands used throughout. RUM signals use
```
pup rum aggregate
```
. Notebooks use
```
pup notebooks create/edit
```
. All features available.)"

pup invocation rules:

Invoke via Bash:
```
pup llm-obs <subcommand> [flags]
```
pup always outputs JSON. Parse directly — no content-block unwrapping (unlike MCP results).
If pup returns an auth error, tell the user to run
```
pup auth login
```
and stop.
Parallelization: issue multiple Bash tool calls in a single message (one pup command per call).
Time flags: pup accepts bare duration strings (
```
1h
```
,
```
7d
```
,
```
30m
```
) and RFC3339 timestamps. Do not use
```
now-
```
-prefixed strings — strip the prefix when converting from a skill
```
--timeframe
```
argument:
```
now-7d
```
→
```
7d
```
,
```
now-24h
```
→
```
24h
```
,
```
now-30d
```
→
```
30d
```
.

Invocation ID: At the very start of each invocation, before any MCP tool call, generate an 8-character hex invocation ID (e.g.,

3a9f1c2b

). Keep it constant for the entire invocation.

Intent tagging: On every MCP tool call, prefix

telemetry.intent

with

skill:llm-obs-eval-pipeline[<inv_id>] —

followed by a description of why the tool is being called. On the first MCP tool call only, use

skill:llm-obs-eval-pipeline:start[<inv_id>] —

instead (note the

:start

suffix). Example first call:

skill:llm-obs-eval-pipeline:start[3a9f1c2b] — Phase 1: sample root spans for ml_app to begin classification

后端检测 — 在每次调用开始、执行任何操作前，确定要使用的后端：

如果用户在调用中任何位置传入
```
--backend pup
```
→ 立即使用pup模式，无论是否存在MCP工具。跳过步骤2–4。
检查活跃工具列表中是否存在MCP工具。标准判断依据是可用工具中是否包含
```
mcp__datadog-llmo-mcp__search_llmobs_spans
```
。
如果存在MCP工具 → 全程使用MCP模式。严格按照子技能流程部分中的名称调用MCP工具。
如果不存在MCP工具 → 检查
```
pup
```
是否可执行：通过Bash运行
```
pup --version
```
。包含
```
"version"
```
的JSON响应确认pup可用。
如果pup正常响应 → 全程使用pup模式。每个子技能都有自己的工具参考附录，包含完整的MCP→pup映射关系。
如果两者都不可用 → 停止操作并告知用户：
"Datadog MCP服务器和pup CLI均不可用。请连接MCP服务器（
```
claude mcp add --scope user --transport http datadog-llmo-mcp 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'
```
）或安装pup。"

--backend pup

可在调用参数的任何位置传入。在将参数传递给子技能前需移除该参数，但要保留pup模式的决策结果 — 子技能在整个流程运行期间也必须以pup模式运行。

子技能后端传播：流程启动时检测到的后端将应用于所有三个子技能（session-classify → trace-rca → eval-bootstrap）。无需在每个阶段重新检测。启动时只需宣布一次：

MCP模式："(运行在MCP模式下 — 所有功能可用。)"
pup模式："(运行在pup模式下 — 全程使用pup命令。RUM信号使用
```
pup rum aggregate
```
。笔记本使用
```
pup notebooks create/edit
```
。所有功能可用。)"

pup调用规则：

通过Bash调用：
```
pup llm-obs <subcommand> [flags]
```
pup始终输出JSON。直接解析即可 — 无需像处理MCP结果那样进行内容块解包。
如果pup返回认证错误，告知用户运行
```
pup auth login
```
并停止操作。
并行化：在一条消息中发起多个Bash工具调用（每个调用对应一个pup命令）。
时间参数：pup接受纯时长字符串（
```
1h
```
、
```
7d
```
、
```
30m
```
）和RFC3339时间戳。请勿使用
```
now-
```
前缀的字符串 — 转换技能的
```
--timeframe
```
参数时需移除前缀：
```
now-7d
```
→
```
7d
```
，
```
now-24h
```
→
```
24h
```
，
```
now-30d
```
→
```
30d
```
。

调用ID：在每次调用的最开始、任何MCP工具调用之前，生成一个8字符的十六进制调用ID（例如：

3a9f1c2b

）。在整个调用过程中保持该ID不变。

意图标记：在每次MCP工具调用时，在

telemetry.intent

前添加前缀

skill:llm-obs-eval-pipeline[<inv_id>] —

，后跟调用该工具的原因描述。仅在第一次MCP工具调用时，改用

skill:llm-obs-eval-pipeline:start[<inv_id>] —

（注意

:start

后缀）。示例首次调用：

skill:llm-obs-eval-pipeline:start[3a9f1c2b] — Phase 1: sample root spans for ml_app to begin classification

LLM Obs Eval Pipeline — Classify → RCA → Bootstrap

LLM观测评估流程 — 分类→根本原因分析→引导评估

Walks from unlabeled production LLM trace data to a ready-to-use evaluator suite in three phases, with user checkpoints between each. No pre-existing evals or labeled data required.

llm-obs-session-classify  (ml_app mode)
           ↓
  llm-obs-trace-rca  (from classifications)
           ↓
  llm-obs-eval-bootstrap  (from RCA output)

通过三个阶段，从无标签的生产环境LLM追踪数据生成可用的评估器套件，各阶段之间设置用户检查点。无需预先存在的评估器或标签数据。

llm-obs-session-classify  (ml_app模式)
           ↓
  llm-obs-trace-rca  (基于分类结果)
           ↓
  llm-obs-eval-bootstrap  (基于RCA输出)

Usage

使用方法

/llm-obs-eval-pipeline <ml_app> [--timeframe <window>] [--trace-limit <N>] [--data-only] [--publish]

Arguments: $ARGUMENTS

/llm-obs-eval-pipeline <ml_app> [--timeframe <window>] [--trace-limit <N>] [--data-only] [--publish]

参数：$ARGUMENTS

Inputs

输入参数

Input	Required	Default	Description
`ml_app`	Yes	—	LLM app to analyze end to end
`--timeframe`	No	`now-7d`	Lookback window for trace sampling and RCA
`--trace-limit`	No	`20`	Max traces to classify in Phase 1
`--data-only`	No	off	Pass through to llm-obs-eval-bootstrap: emit JSON spec instead of Python SDK code
`--publish`	No	off	Pass through to llm-obs-eval-bootstrap: publish online LLM-judge evaluators to Datadog

ml_app

is not provided, ask the user before proceeding.

输入参数	是否必填	默认值	描述
`ml_app`	是	—	要进行端到端分析的LLM应用
`--timeframe`	否	`now-7d`	追踪采样和根本原因分析的回溯窗口
`--trace-limit`	否	`20`	阶段1中要分类的最大追踪数量
`--data-only`	否	关闭	传递给llm-obs-eval-bootstrap：输出JSON规范而非Python SDK代码
`--publish`	否	关闭	传递给llm-obs-eval-bootstrap：将在线LLM-judge评估器发布到Datadog

如果未提供

ml_app

，请在继续前询问用户。

Phase 1: Trace Classification

阶段1：追踪分类

Follow the llm-obs-session-classify skill in ml_app mode, using:

```
ml_app
```
= the provided ml_app
```
timeframe
```
= the provided timeframe
```
sample_limit
```
= the provided trace-limit

Run the complete ml_app mode workflow as defined in that skill (Steps M1 through M3): sample spans → classify each → emit per-unit compact blocks → emit summary.

Output the full classification output, including all compact per-unit blocks and the final

# Session Classification Summary

section. Do not summarize or truncate this output — downstream phase detection depends on the full text being present in context.

按照llm-obs-session-classify技能的ml_app模式执行，使用：

```
ml_app
```
= 提供的ml_app
```
timeframe
```
= 提供的时间窗口
```
sample_limit
```
= 提供的trace-limit

执行该技能中定义的完整ml_app模式工作流（步骤M1至M3）：采样跨度→逐个分类→输出每个单元的紧凑块→输出摘要。

输出完整的分类结果，包括所有每个单元的紧凑块和最终的

# Session Classification Summary

部分。请勿总结或截断该输出 — 下游阶段的检测依赖于上下文存在完整文本。

CHECKPOINT 1

检查点1

After the

# Session Classification Summary

is output, pause and present:

undefined

在输出

# Session Classification Summary

后，暂停并显示：

undefined

Phase 1 Complete

阶段1完成

[verdict distribution table] [failure mode frequency table]

Before I continue to root cause analysis:

Do these failure patterns look right?
Any traces you'd like to exclude from the RCA?
Any failure modes to focus on or ignore?

Type "continue" to proceed, or give me adjustments.


**Wait for explicit user confirmation before proceeding.**

- If the user excludes specific traces: mark them as "excluded by user" and drop them from the failure bucket. Do NOT re-classify.
- If the user asks to re-run with different parameters: re-run Phase 1 with the new parameters.
- If Phase 1 yielded zero failures: surface this explicitly and offer to retry with a wider timeframe or stop.

---

[判定分布表] [故障模式频率表]

在我继续进行根本原因分析之前：

这些故障模式看起来是否正确？
是否有任何追踪数据需要从根本原因分析中排除？
是否有需要重点关注或忽略的故障模式？

输入"continue"继续，或提出调整要求。


**继续前等待用户明确确认。**

- 如果用户排除特定追踪数据：将其标记为“用户排除”并从故障桶中移除。无需重新分类。
- 如果用户要求使用不同参数重新运行：使用新参数重新运行阶段1。
- 如果阶段1未发现任何故障：明确告知用户，并提供扩大时间窗口重试或停止的选项。

---

Phase 2: Root Cause Analysis

阶段2：根本原因分析

Follow the llm-obs-trace-rca skill.

The

# Session Classification Summary

from Phase 1 is in context. The skill detects it automatically via its Phase 0 Step 0S check and enters the "from classifications" path — it extracts the failure bucket, presents the Classification Overview, and proceeds directly to Phase 2 (open coding) without running its own Phase 1 span search.

Run the full workflow through Phase 6 (the compiled RCA report). Output the full RCA report — do not summarize. The full report must be in context for Phase 3's detection to work.

按照llm-obs-trace-rca技能执行。

阶段1的

# Session Classification Summary

已在上下文中。该技能会通过其阶段0步骤0S检查自动检测到该内容，并进入“基于分类结果”路径 — 提取故障桶，显示分类概览，直接进入阶段2（开放编码），无需运行自身的阶段1跨度搜索。

执行完整工作流直至阶段6（编译后的RCA报告）。输出完整的RCA报告 — 请勿总结。完整报告必须存在于上下文中，以便阶段3的检测正常工作。

CHECKPOINT 2

检查点2

After the RCA report is output, pause and present:

undefined

在输出RCA报告后，暂停并显示：

undefined

Phase 2 Complete

阶段2完成

[the Phase 6 RCA report is above]

Before I generate evaluators:

Do these root causes look accurate?
Any failure modes to add, remove, or reframe?
Which root causes should the evaluators target?

Type "continue" to proceed, or give me adjustments.


**Wait for explicit user confirmation before proceeding.**

If the user adjusts the taxonomy: incorporate the changes before continuing to Phase 3.

---

[上方是阶段6的RCA报告]

在我生成评估器之前：

这些根本原因看起来是否准确？
是否有需要添加、移除或重新定义的故障模式？
评估器应针对哪些根本原因？

输入"continue"继续，或提出调整要求。


**继续前等待用户明确确认。**

如果用户调整分类体系：在继续到阶段3之前整合这些更改。

---

Phase 3: Eval Bootstrap

阶段3：评估引导

Follow the llm-obs-eval-bootstrap skill.

The RCA report from Phase 2 is in context. The skill detects the

## Failure Taxonomy

heading automatically and enters its "from RCA" path in Phase 0.

Pass through any flags:

```
--data-only
```
→ emit a JSON spec instead of Python SDK code
```
--publish
```
→ publish online LLM-judge evaluators directly to Datadog

The llm-obs-eval-bootstrap skill has its own mandatory proposal checkpoint (the evaluator suite proposal before code generation). Honor it — do not skip or auto-confirm it.

按照llm-obs-eval-bootstrap技能执行。

阶段2的RCA报告已在上下文中。该技能会自动检测到

## Failure Taxonomy

标题，并在阶段0进入“基于RCA”路径。

传递所有标志：

```
--data-only
```
→ 输出JSON规范而非Python SDK代码
```
--publish
```
→ 将在线LLM-judge评估器直接发布到Datadog

llm-obs-eval-bootstrap技能有自己的强制提案检查点（代码生成前的评估器套件提案）。需遵守该规则 — 请勿跳过或自动确认。

Final Summary

最终总结

After Phase 3 completes, present:

markdown

undefined

阶段3完成后，显示：

markdown

undefined

LLM Obs Eval Pipeline Complete

LLM观测评估流程完成

App:

<ml_app>

| Timeframe: <timeframe>

Phase	Output
1. Classification	<N> traces sampled, <F> failures identified
2. Root Cause Analysis	<K> failure modes, <M> root causes diagnosed
3. Eval Bootstrap	<J> evaluators → `<output_path>`

应用:

<ml_app>

| 时间窗口: <timeframe>

阶段	输出结果
1. 分类	采样<N>条追踪数据，识别出<F>个故障
2. 根本原因分析	确定<K>种故障模式，诊断出<M>个根本原因
3. 评估引导	生成<J>个评估器 → `<output_path>`

Key findings

关键发现

[3–5 bullets: most important failure patterns and what the evaluators capture]

[3–5个要点：最重要的故障模式以及评估器覆盖的内容]

Next steps

后续步骤

Review the generated evaluators at
```
<output_path>
```
Run an offline experiment to validate eval quality
Once validated, configure as production evals in Datadog

---

查看
```
<output_path>
```
下生成的评估器
运行离线实验验证评估质量
验证通过后，在Datadog中配置为生产环境评估器

---

Orchestration Rules

编排规则

Always checkpoint before advancing. Never auto-proceed between phases.
Never truncate sub-skill outputs. Downstream phase detection depends on the full text being in context.
Phase isolation: if the user wants to re-run a single phase, re-run only that phase and its downstream phases.
Carry context forward: the output of each phase is the input for the next. Present the full output of each sub-skill before showing the checkpoint prompt.

推进前必须设置检查点。各阶段之间切勿自动继续。
切勿截断子技能输出。下游阶段的检测依赖于上下文存在完整文本。
阶段隔离：如果用户希望重新运行单个阶段，仅重新运行该阶段及其下游阶段。
上下文传递：每个阶段的输出是下一个阶段的输入。在显示检查点提示前，先展示每个子技能的完整输出。

Tool Reference

工具参考

This appendix applies only in pup mode. Each sub-skill also carries its own Tool Reference with the same mappings — consult them for full parameter details. The tables below are a quick reference for pipeline-level orientation.

本附录仅适用于pup模式。每个子技能也有自己的工具参考，包含相同的映射关系 — 如需完整参数细节请参考这些文档。下表是流程层面的快速参考。

Spans and traces

跨度与追踪

MCP Tool	pup Command
`search_llmobs_spans(...)`	`pup llm-obs spans search --query "@ml_app:A [other_filters]" [--from F] [--to T] [--limit N] [--summary]` — always use `--query "@ml_app:A"` ; `--ml-app A` is unreliable.
`get_llmobs_span_details(...)`	`pup llm-obs spans get-details --trace-id T --span-ids S1,S2,...`
`get_llmobs_span_content(...)`	`pup llm-obs spans get-content --trace-id T --span-id S --field F [--path P]`
`get_llmobs_trace(...)`	`pup llm-obs spans get-trace --trace-id T [--include-tree]`
`get_llmobs_agent_loop(...)`	`pup llm-obs spans get-agent-loop --trace-id T [--span-id S]`
`find_llmobs_error_spans(...)`	`pup llm-obs spans find-errors --trace-id T`
`expand_llmobs_spans(...)`	`pup llm-obs spans expand --trace-id T --span-ids S1,S2,...`

MCP工具	pup命令
`search_llmobs_spans(...)`	`pup llm-obs spans search --query "@ml_app:A [other_filters]" [--from F] [--to T] [--limit N] [--summary]` — 必须使用 `--query "@ml_app:A"` ； `--ml-app A` 不可靠。
`get_llmobs_span_details(...)`	`pup llm-obs spans get-details --trace-id T --span-ids S1,S2,...`
`get_llmobs_span_content(...)`	`pup llm-obs spans get-content --trace-id T --span-id S --field F [--path P]`
`get_llmobs_trace(...)`	`pup llm-obs spans get-trace --trace-id T [--include-tree]`
`get_llmobs_agent_loop(...)`	`pup llm-obs spans get-agent-loop --trace-id T [--span-id S]`
`find_llmobs_error_spans(...)`	`pup llm-obs spans find-errors --trace-id T`
`expand_llmobs_spans(...)`	`pup llm-obs spans expand --trace-id T --span-ids S1,S2,...`

Evaluators

评估器

MCP Tool	pup Command
`list_llmobs_evals()`	`pup llm-obs evals list`
`get_llmobs_evaluator(eval_name)`	`pup llm-obs evals get-evaluator EVAL_NAME`
`get_llmobs_eval_aggregate_stats(...)`	`pup llm-obs evals get-aggregate-stats EVAL_NAME [--ml-app A] [--from F] [--to T]`
`create_or_update_llmobs_evaluator(...)`	`pup llm-obs evals create-or-update EVAL_NAME --file /tmp/eval_EVAL_NAME.json` (see eval-bootstrap Tool Reference for flat schema details)

MCP工具	pup命令
`list_llmobs_evals()`	`pup llm-obs evals list`
`get_llmobs_evaluator(eval_name)`	`pup llm-obs evals get-evaluator EVAL_NAME`
`get_llmobs_eval_aggregate_stats(...)`	`pup llm-obs evals get-aggregate-stats EVAL_NAME [--ml-app A] [--from F] [--to T]`
`create_or_update_llmobs_evaluator(...)`	`pup llm-obs evals create-or-update EVAL_NAME --file /tmp/eval_EVAL_NAME.json` （请查看eval-bootstrap工具参考获取扁平化 schema 细节）

RUM

MCP Tool pup Command

MCP Tool	pup Command
`analyze_rum_events(event_type="view", filter="@usr.email:EMAIL", ...)`	`pup rum aggregate --user-email EMAIL --from F --to T --compute count --group-by @session.id`
`analyze_rum_events(event_type="action", filter="@action.type:custom ...", ...)`	`pup rum aggregate --user-email EMAIL --query "@action.type:custom" --from F --to T --compute count --group-by @evt.name`

analyze_rum_events(event_type="view", filter="@usr.email:EMAIL", ...)

pup rum aggregate --user-email EMAIL --from F --to T --compute count --group-by @session.id

analyze_rum_events(event_type="action", filter="@action.type:custom ...", ...)

pup rum aggregate --user-email EMAIL --query "@action.type:custom" --from F --to T --compute count --group-by @evt.name

MCP工具 pup命令

MCP工具	pup命令
`analyze_rum_events(event_type="view", filter="@usr.email:EMAIL", ...)`	`pup rum aggregate --user-email EMAIL --from F --to T --compute count --group-by @session.id`
`analyze_rum_events(event_type="action", filter="@action.type:custom ...", ...)`	`pup rum aggregate --user-email EMAIL --query "@action.type:custom" --from F --to T --compute count --group-by @evt.name`

analyze_rum_events(event_type="view", filter="@usr.email:EMAIL", ...)

pup rum aggregate --user-email EMAIL --from F --to T --compute count --group-by @session.id

analyze_rum_events(event_type="action", filter="@action.type:custom ...", ...)

pup rum aggregate --user-email EMAIL --query "@action.type:custom" --from F --to T --compute count --group-by @evt.name

Notebooks

笔记本

MCP Tool pup Command

MCP Tool	pup Command
`create_datadog_notebook(name, cells, ...)`	`pup notebooks create --title "TITLE" --file /tmp/nb_cells.json`
`edit_datadog_notebook(id, cells, append_only=true)`	`pup notebooks edit NOTEBOOK_ID --file /tmp/nb_cells.json`

create_datadog_notebook(name, cells, ...)

pup notebooks create --title "TITLE" --file /tmp/nb_cells.json

edit_datadog_notebook(id, cells, append_only=true)

pup notebooks edit NOTEBOOK_ID --file /tmp/nb_cells.json

MCP result parsing safety: Before writing any script (Python, jq, etc.) that iterates over or accesses fields in an MCP tool result, inspect the raw structure first — check
```
type(result)
```
, top-level keys, and whether the payload is nested inside a content block (e.g.
```
[{'type': 'text', 'text': '<json>'}]
```
). Extract and
```
json.loads()
```
the inner payload if needed. Never assume MCP results are bare dicts or lists.

MCP工具 pup命令

MCP工具	pup命令
`create_datadog_notebook(name, cells, ...)`	`pup notebooks create --title "TITLE" --file /tmp/nb_cells.json`
`edit_datadog_notebook(id, cells, append_only=true)`	`pup notebooks edit NOTEBOOK_ID --file /tmp/nb_cells.json`

create_datadog_notebook(name, cells, ...)

pup notebooks create --title "TITLE" --file /tmp/nb_cells.json

edit_datadog_notebook(id, cells, append_only=true)

pup notebooks edit NOTEBOOK_ID --file /tmp/nb_cells.json

MCP结果解析注意事项：在编写任何遍历或访问MCP工具结果字段的脚本（Python、jq等）之前，先检查原始结构 — 检查
```
type(result)
```
、顶层键以及负载是否嵌套在内容块中（例如
```
[{'type': 'text', 'text': '<json>'}]
```
）。如有需要，提取内部负载并使用
```
json.loads()
```
解析。切勿假设MCP结果是裸字典或列表。