llm-obs-eval-pipeline

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Backend

后端配置

Detection — At the start of every invocation, before taking any action, determine which backend to use:
  1. If the user passed
    --backend pup
    anywhere in their invocation → use pup mode immediately, regardless of whether MCP tools are present. Skip steps 2–4.
  2. Check whether MCP tools are present in your active tool list. The canonical signal is whether
    mcp__datadog-llmo-mcp__search_llmobs_spans
    appears in your available tools.
  3. If MCP tools are present → use MCP mode throughout. Call MCP tools exactly as named in the sub-skill workflow sections.
  4. If MCP tools are absent → check whether
    pup
    is executable: run
    pup --version
    via Bash. A JSON response containing
    "version"
    confirms pup is available.
  5. If pup responds → use pup mode throughout. Each sub-skill carries its own Tool Reference appendix with the full MCP→pup mapping.
  6. If neither is available → stop and tell the user:
    "Neither the Datadog MCP server nor the pup CLI is available. Connect the MCP server (
    claude mcp add --scope user --transport http datadog-llmo-mcp 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'
    ) or install pup."
--backend pup
is accepted anywhere in the invocation arguments. Strip it from args before passing to sub-skills, but carry the pup-mode decision forward — sub-skills must also operate in pup mode for the entire pipeline run.
Sub-skill backend propagation: The backend detected at pipeline startup applies to all three sub-skills (session-classify → trace-rca → eval-bootstrap). Do not re-detect per phase. Announce once at startup:
  • MCP mode: "(Running in MCP mode — all features available.)"
  • pup mode: "(Running in pup mode — pup commands used throughout. RUM signals use
    pup rum aggregate
    . Notebooks use
    pup notebooks create/edit
    . All features available.)"
pup invocation rules:
  • Invoke via Bash:
    pup llm-obs <subcommand> [flags]
  • pup always outputs JSON. Parse directly — no content-block unwrapping (unlike MCP results).
  • If pup returns an auth error, tell the user to run
    pup auth login
    and stop.
  • Parallelization: issue multiple Bash tool calls in a single message (one pup command per call).
  • Time flags: pup accepts bare duration strings (
    1h
    ,
    7d
    ,
    30m
    ) and RFC3339 timestamps. Do not use
    now-
    -prefixed strings — strip the prefix when converting from a skill
    --timeframe
    argument:
    now-7d
    7d
    ,
    now-24h
    24h
    ,
    now-30d
    30d
    .
Invocation ID: At the very start of each invocation, before any MCP tool call, generate an 8-character hex invocation ID (e.g.,
3a9f1c2b
). Keep it constant for the entire invocation.
Intent tagging: On every MCP tool call, prefix
telemetry.intent
with
skill:llm-obs-eval-pipeline[<inv_id>] — 
followed by a description of why the tool is being called. On the first MCP tool call only, use
skill:llm-obs-eval-pipeline:start[<inv_id>] — 
instead (note the
:start
suffix). Example first call:
skill:llm-obs-eval-pipeline:start[3a9f1c2b] — Phase 1: sample root spans for ml_app to begin classification
后端检测 — 在每次调用开始、执行任何操作前,确定要使用的后端:
  1. 如果用户在调用中任何位置传入
    --backend pup
    → 立即使用pup模式,无论是否存在MCP工具。跳过步骤2–4。
  2. 检查活跃工具列表中是否存在MCP工具。标准判断依据是可用工具中是否包含
    mcp__datadog-llmo-mcp__search_llmobs_spans
  3. 如果存在MCP工具 → 全程使用MCP模式。严格按照子技能流程部分中的名称调用MCP工具。
  4. 如果不存在MCP工具 → 检查
    pup
    是否可执行:通过Bash运行
    pup --version
    。包含
    "version"
    的JSON响应确认pup可用。
  5. 如果pup正常响应 → 全程使用pup模式。每个子技能都有自己的工具参考附录,包含完整的MCP→pup映射关系。
  6. 如果两者都不可用 → 停止操作并告知用户:
    "Datadog MCP服务器和pup CLI均不可用。请连接MCP服务器(
    claude mcp add --scope user --transport http datadog-llmo-mcp 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'
    )或安装pup。"
--backend pup
可在调用参数的任何位置传入。在将参数传递给子技能前需移除该参数,但要保留pup模式的决策结果 — 子技能在整个流程运行期间也必须以pup模式运行。
子技能后端传播:流程启动时检测到的后端将应用于所有三个子技能(session-classify → trace-rca → eval-bootstrap)。无需在每个阶段重新检测。启动时只需宣布一次:
  • MCP模式:"(运行在MCP模式下 — 所有功能可用。)"
  • pup模式:"(运行在pup模式下 — 全程使用pup命令。RUM信号使用
    pup rum aggregate
    。笔记本使用
    pup notebooks create/edit
    。所有功能可用。)"
pup调用规则
  • 通过Bash调用:
    pup llm-obs <subcommand> [flags]
  • pup始终输出JSON。直接解析即可 — 无需像处理MCP结果那样进行内容块解包。
  • 如果pup返回认证错误,告知用户运行
    pup auth login
    并停止操作。
  • 并行化:在一条消息中发起多个Bash工具调用(每个调用对应一个pup命令)。
  • 时间参数:pup接受纯时长字符串(
    1h
    7d
    30m
    )和RFC3339时间戳。请勿使用
    now-
    前缀的字符串 — 转换技能的
    --timeframe
    参数时需移除前缀:
    now-7d
    7d
    now-24h
    24h
    now-30d
    30d
调用ID:在每次调用的最开始、任何MCP工具调用之前,生成一个8字符的十六进制调用ID(例如:
3a9f1c2b
)。在整个调用过程中保持该ID不变。
意图标记:在每次MCP工具调用时,在
telemetry.intent
前添加前缀
skill:llm-obs-eval-pipeline[<inv_id>] — 
,后跟调用该工具的原因描述。仅在第一次MCP工具调用时,改用
skill:llm-obs-eval-pipeline:start[<inv_id>] — 
(注意
:start
后缀)。示例首次调用:
skill:llm-obs-eval-pipeline:start[3a9f1c2b] — Phase 1: sample root spans for ml_app to begin classification

LLM Obs Eval Pipeline — Classify → RCA → Bootstrap

LLM观测评估流程 — 分类→根本原因分析→引导评估

Walks from unlabeled production LLM trace data to a ready-to-use evaluator suite in three phases, with user checkpoints between each. No pre-existing evals or labeled data required.
llm-obs-session-classify  (ml_app mode)
  llm-obs-trace-rca  (from classifications)
  llm-obs-eval-bootstrap  (from RCA output)
通过三个阶段,从无标签的生产环境LLM追踪数据生成可用的评估器套件,各阶段之间设置用户检查点。无需预先存在的评估器或标签数据。
llm-obs-session-classify  (ml_app模式)
  llm-obs-trace-rca  (基于分类结果)
  llm-obs-eval-bootstrap  (基于RCA输出)

Usage

使用方法

/llm-obs-eval-pipeline <ml_app> [--timeframe <window>] [--trace-limit <N>] [--data-only] [--publish]
Arguments: $ARGUMENTS
/llm-obs-eval-pipeline <ml_app> [--timeframe <window>] [--trace-limit <N>] [--data-only] [--publish]
参数:$ARGUMENTS

Inputs

输入参数

InputRequiredDefaultDescription
ml_app
YesLLM app to analyze end to end
--timeframe
No
now-7d
Lookback window for trace sampling and RCA
--trace-limit
No
20
Max traces to classify in Phase 1
--data-only
NooffPass through to llm-obs-eval-bootstrap: emit JSON spec instead of Python SDK code
--publish
NooffPass through to llm-obs-eval-bootstrap: publish online LLM-judge evaluators to Datadog
If
ml_app
is not provided, ask the user before proceeding.

输入参数是否必填默认值描述
ml_app
要进行端到端分析的LLM应用
--timeframe
now-7d
追踪采样和根本原因分析的回溯窗口
--trace-limit
20
阶段1中要分类的最大追踪数量
--data-only
关闭传递给llm-obs-eval-bootstrap:输出JSON规范而非Python SDK代码
--publish
关闭传递给llm-obs-eval-bootstrap:将在线LLM-judge评估器发布到Datadog
如果未提供
ml_app
,请在继续前询问用户。

Phase 1: Trace Classification

阶段1:追踪分类

Follow the llm-obs-session-classify skill in ml_app mode, using:
  • ml_app
    = the provided ml_app
  • timeframe
    = the provided timeframe
  • sample_limit
    = the provided trace-limit
Run the complete ml_app mode workflow as defined in that skill (Steps M1 through M3): sample spans → classify each → emit per-unit compact blocks → emit summary.
Output the full classification output, including all compact per-unit blocks and the final
# Session Classification Summary
section. Do not summarize or truncate this output — downstream phase detection depends on the full text being present in context.

按照llm-obs-session-classify技能的ml_app模式执行,使用:
  • ml_app
    = 提供的ml_app
  • timeframe
    = 提供的时间窗口
  • sample_limit
    = 提供的trace-limit
执行该技能中定义的完整ml_app模式工作流(步骤M1至M3): 采样跨度→逐个分类→输出每个单元的紧凑块→输出摘要。
输出完整的分类结果,包括所有每个单元的紧凑块和最终的
# Session Classification Summary
部分。请勿总结或截断该输出 — 下游阶段的检测依赖于上下文存在完整文本。

CHECKPOINT 1

检查点1

After the
# Session Classification Summary
is output, pause and present:
undefined
在输出
# Session Classification Summary
后,暂停并显示:
undefined

Phase 1 Complete

阶段1完成

[verdict distribution table] [failure mode frequency table]
Before I continue to root cause analysis:
  • Do these failure patterns look right?
  • Any traces you'd like to exclude from the RCA?
  • Any failure modes to focus on or ignore?
Type "continue" to proceed, or give me adjustments.

**Wait for explicit user confirmation before proceeding.**

- If the user excludes specific traces: mark them as "excluded by user" and drop them from the failure bucket. Do NOT re-classify.
- If the user asks to re-run with different parameters: re-run Phase 1 with the new parameters.
- If Phase 1 yielded zero failures: surface this explicitly and offer to retry with a wider timeframe or stop.

---
[判定分布表] [故障模式频率表]
在我继续进行根本原因分析之前:
  • 这些故障模式看起来是否正确?
  • 是否有任何追踪数据需要从根本原因分析中排除?
  • 是否有需要重点关注或忽略的故障模式?
输入"continue"继续,或提出调整要求。

**继续前等待用户明确确认。**

- 如果用户排除特定追踪数据:将其标记为“用户排除”并从故障桶中移除。无需重新分类。
- 如果用户要求使用不同参数重新运行:使用新参数重新运行阶段1。
- 如果阶段1未发现任何故障:明确告知用户,并提供扩大时间窗口重试或停止的选项。

---

Phase 2: Root Cause Analysis

阶段2:根本原因分析

Follow the llm-obs-trace-rca skill.
The
# Session Classification Summary
from Phase 1 is in context. The skill detects it automatically via its Phase 0 Step 0S check and enters the "from classifications" path — it extracts the failure bucket, presents the Classification Overview, and proceeds directly to Phase 2 (open coding) without running its own Phase 1 span search.
Run the full workflow through Phase 6 (the compiled RCA report). Output the full RCA report — do not summarize. The full report must be in context for Phase 3's detection to work.

按照llm-obs-trace-rca技能执行。
阶段1的
# Session Classification Summary
已在上下文中。该技能会通过其阶段0步骤0S检查自动检测到该内容,并进入“基于分类结果”路径 — 提取故障桶,显示分类概览,直接进入阶段2(开放编码),无需运行自身的阶段1跨度搜索。
执行完整工作流直至阶段6(编译后的RCA报告)。输出完整的RCA报告 — 请勿总结。完整报告必须存在于上下文中,以便阶段3的检测正常工作。

CHECKPOINT 2

检查点2

After the RCA report is output, pause and present:
undefined
在输出RCA报告后,暂停并显示:
undefined

Phase 2 Complete

阶段2完成

[the Phase 6 RCA report is above]
Before I generate evaluators:
  • Do these root causes look accurate?
  • Any failure modes to add, remove, or reframe?
  • Which root causes should the evaluators target?
Type "continue" to proceed, or give me adjustments.

**Wait for explicit user confirmation before proceeding.**

If the user adjusts the taxonomy: incorporate the changes before continuing to Phase 3.

---
[上方是阶段6的RCA报告]
在我生成评估器之前:
  • 这些根本原因看起来是否准确?
  • 是否有需要添加、移除或重新定义的故障模式?
  • 评估器应针对哪些根本原因?
输入"continue"继续,或提出调整要求。

**继续前等待用户明确确认。**

如果用户调整分类体系:在继续到阶段3之前整合这些更改。

---

Phase 3: Eval Bootstrap

阶段3:评估引导

Follow the llm-obs-eval-bootstrap skill.
The RCA report from Phase 2 is in context. The skill detects the
## Failure Taxonomy
heading automatically and enters its "from RCA" path in Phase 0.
Pass through any flags:
  • --data-only
    → emit a JSON spec instead of Python SDK code
  • --publish
    → publish online LLM-judge evaluators directly to Datadog
The llm-obs-eval-bootstrap skill has its own mandatory proposal checkpoint (the evaluator suite proposal before code generation). Honor it — do not skip or auto-confirm it.

按照llm-obs-eval-bootstrap技能执行。
阶段2的RCA报告已在上下文中。该技能会自动检测到
## Failure Taxonomy
标题,并在阶段0进入“基于RCA”路径。
传递所有标志:
  • --data-only
    → 输出JSON规范而非Python SDK代码
  • --publish
    → 将在线LLM-judge评估器直接发布到Datadog
llm-obs-eval-bootstrap技能有自己的强制提案检查点(代码生成前的评估器套件提案)。需遵守该规则 — 请勿跳过或自动确认。

Final Summary

最终总结

After Phase 3 completes, present:
markdown
undefined
阶段3完成后,显示:
markdown
undefined

LLM Obs Eval Pipeline Complete

LLM观测评估流程完成

App:
<ml_app>
| Timeframe: <timeframe>
PhaseOutput
1. Classification<N> traces sampled, <F> failures identified
2. Root Cause Analysis<K> failure modes, <M> root causes diagnosed
3. Eval Bootstrap<J> evaluators →
<output_path>
应用:
<ml_app>
| 时间窗口: <timeframe>
阶段输出结果
1. 分类采样<N>条追踪数据,识别出<F>个故障
2. 根本原因分析确定<K>种故障模式,诊断出<M>个根本原因
3. 评估引导生成<J>个评估器 →
<output_path>

Key findings

关键发现

[3–5 bullets: most important failure patterns and what the evaluators capture]
[3–5个要点:最重要的故障模式以及评估器覆盖的内容]

Next steps

后续步骤

  1. Review the generated evaluators at
    <output_path>
  2. Run an offline experiment to validate eval quality
  3. Once validated, configure as production evals in Datadog

---
  1. 查看
    <output_path>
    下生成的评估器
  2. 运行离线实验验证评估质量
  3. 验证通过后,在Datadog中配置为生产环境评估器

---

Orchestration Rules

编排规则

  • Always checkpoint before advancing. Never auto-proceed between phases.
  • Never truncate sub-skill outputs. Downstream phase detection depends on the full text being in context.
  • Phase isolation: if the user wants to re-run a single phase, re-run only that phase and its downstream phases.
  • Carry context forward: the output of each phase is the input for the next. Present the full output of each sub-skill before showing the checkpoint prompt.

  • 推进前必须设置检查点。各阶段之间切勿自动继续。
  • 切勿截断子技能输出。下游阶段的检测依赖于上下文存在完整文本。
  • 阶段隔离:如果用户希望重新运行单个阶段,仅重新运行该阶段及其下游阶段。
  • 上下文传递:每个阶段的输出是下一个阶段的输入。在显示检查点提示前,先展示每个子技能的完整输出。

Tool Reference

工具参考

This appendix applies only in pup mode. Each sub-skill also carries its own Tool Reference with the same mappings — consult them for full parameter details. The tables below are a quick reference for pipeline-level orientation.
本附录仅适用于pup模式。每个子技能也有自己的工具参考,包含相同的映射关系 — 如需完整参数细节请参考这些文档。下表是流程层面的快速参考。

Spans and traces

跨度与追踪

MCP Toolpup Command
search_llmobs_spans(...)
pup llm-obs spans search --query "@ml_app:A [other_filters]" [--from F] [--to T] [--limit N] [--summary]
always use
--query "@ml_app:A"
;
--ml-app A
is unreliable.
get_llmobs_span_details(...)
pup llm-obs spans get-details --trace-id T --span-ids S1,S2,...
get_llmobs_span_content(...)
pup llm-obs spans get-content --trace-id T --span-id S --field F [--path P]
get_llmobs_trace(...)
pup llm-obs spans get-trace --trace-id T [--include-tree]
get_llmobs_agent_loop(...)
pup llm-obs spans get-agent-loop --trace-id T [--span-id S]
find_llmobs_error_spans(...)
pup llm-obs spans find-errors --trace-id T
expand_llmobs_spans(...)
pup llm-obs spans expand --trace-id T --span-ids S1,S2,...
MCP工具pup命令
search_llmobs_spans(...)
pup llm-obs spans search --query "@ml_app:A [other_filters]" [--from F] [--to T] [--limit N] [--summary]
必须使用
--query "@ml_app:A"
--ml-app A
不可靠。
get_llmobs_span_details(...)
pup llm-obs spans get-details --trace-id T --span-ids S1,S2,...
get_llmobs_span_content(...)
pup llm-obs spans get-content --trace-id T --span-id S --field F [--path P]
get_llmobs_trace(...)
pup llm-obs spans get-trace --trace-id T [--include-tree]
get_llmobs_agent_loop(...)
pup llm-obs spans get-agent-loop --trace-id T [--span-id S]
find_llmobs_error_spans(...)
pup llm-obs spans find-errors --trace-id T
expand_llmobs_spans(...)
pup llm-obs spans expand --trace-id T --span-ids S1,S2,...

Evaluators

评估器

MCP Toolpup Command
list_llmobs_evals()
pup llm-obs evals list
get_llmobs_evaluator(eval_name)
pup llm-obs evals get-evaluator EVAL_NAME
get_llmobs_eval_aggregate_stats(...)
pup llm-obs evals get-aggregate-stats EVAL_NAME [--ml-app A] [--from F] [--to T]
create_or_update_llmobs_evaluator(...)
pup llm-obs evals create-or-update EVAL_NAME --file /tmp/eval_EVAL_NAME.json
(see eval-bootstrap Tool Reference for flat schema details)
MCP工具pup命令
list_llmobs_evals()
pup llm-obs evals list
get_llmobs_evaluator(eval_name)
pup llm-obs evals get-evaluator EVAL_NAME
get_llmobs_eval_aggregate_stats(...)
pup llm-obs evals get-aggregate-stats EVAL_NAME [--ml-app A] [--from F] [--to T]
create_or_update_llmobs_evaluator(...)
pup llm-obs evals create-or-update EVAL_NAME --file /tmp/eval_EVAL_NAME.json
(请查看eval-bootstrap工具参考获取扁平化 schema 细节)

RUM

RUM

MCP Toolpup Command
analyze_rum_events(event_type="view", filter="@usr.email:EMAIL", ...)
pup rum aggregate --user-email EMAIL --from F --to T --compute count --group-by @session.id
analyze_rum_events(event_type="action", filter="@action.type:custom ...", ...)
pup rum aggregate --user-email EMAIL --query "@action.type:custom" --from F --to T --compute count --group-by @evt.name
MCP工具pup命令
analyze_rum_events(event_type="view", filter="@usr.email:EMAIL", ...)
pup rum aggregate --user-email EMAIL --from F --to T --compute count --group-by @session.id
analyze_rum_events(event_type="action", filter="@action.type:custom ...", ...)
pup rum aggregate --user-email EMAIL --query "@action.type:custom" --from F --to T --compute count --group-by @evt.name

Notebooks

笔记本

MCP Toolpup Command
create_datadog_notebook(name, cells, ...)
pup notebooks create --title "TITLE" --file /tmp/nb_cells.json
edit_datadog_notebook(id, cells, append_only=true)
pup notebooks edit NOTEBOOK_ID --file /tmp/nb_cells.json
  • MCP result parsing safety: Before writing any script (Python, jq, etc.) that iterates over or accesses fields in an MCP tool result, inspect the raw structure first — check
    type(result)
    , top-level keys, and whether the payload is nested inside a content block (e.g.
    [{'type': 'text', 'text': '<json>'}]
    ). Extract and
    json.loads()
    the inner payload if needed. Never assume MCP results are bare dicts or lists.
MCP工具pup命令
create_datadog_notebook(name, cells, ...)
pup notebooks create --title "TITLE" --file /tmp/nb_cells.json
edit_datadog_notebook(id, cells, append_only=true)
pup notebooks edit NOTEBOOK_ID --file /tmp/nb_cells.json
  • MCP结果解析注意事项:在编写任何遍历或访问MCP工具结果字段的脚本(Python、jq等)之前,先检查原始结构 — 检查
    type(result)
    、顶层键以及负载是否嵌套在内容块中(例如
    [{'type': 'text', 'text': '<json>'}]
    )。如有需要,提取内部负载并使用
    json.loads()
    解析。切勿假设MCP结果是裸字典或列表。