llm-obs-eval-pipeline
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBackend
后端配置
Detection — At the start of every invocation, before taking any action, determine which backend to use:
- If the user passed anywhere in their invocation → use pup mode immediately, regardless of whether MCP tools are present. Skip steps 2–4.
--backend pup - Check whether MCP tools are present in your active tool list. The canonical signal is whether appears in your available tools.
mcp__datadog-llmo-mcp__search_llmobs_spans - If MCP tools are present → use MCP mode throughout. Call MCP tools exactly as named in the sub-skill workflow sections.
- If MCP tools are absent → check whether is executable: run
pupvia Bash. A JSON response containingpup --versionconfirms pup is available."version" - If pup responds → use pup mode throughout. Each sub-skill carries its own Tool Reference appendix with the full MCP→pup mapping.
- If neither is available → stop and tell the user:
"Neither the Datadog MCP server nor the pup CLI is available. Connect the MCP server () or install pup."
claude mcp add --scope user --transport http datadog-llmo-mcp 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'
--backend pupSub-skill backend propagation: The backend detected at pipeline startup applies to all three sub-skills (session-classify → trace-rca → eval-bootstrap). Do not re-detect per phase. Announce once at startup:
- MCP mode: "(Running in MCP mode — all features available.)"
- pup mode: "(Running in pup mode — pup commands used throughout. RUM signals use . Notebooks use
pup rum aggregate. All features available.)"pup notebooks create/edit
pup invocation rules:
- Invoke via Bash:
pup llm-obs <subcommand> [flags] - pup always outputs JSON. Parse directly — no content-block unwrapping (unlike MCP results).
- If pup returns an auth error, tell the user to run and stop.
pup auth login - Parallelization: issue multiple Bash tool calls in a single message (one pup command per call).
- Time flags: pup accepts bare duration strings (,
1h,7d) and RFC3339 timestamps. Do not use30m-prefixed strings — strip the prefix when converting from a skillnow-argument:--timeframe→now-7d,7d→now-24h,24h→now-30d.30d
Invocation ID: At the very start of each invocation, before any MCP tool call, generate an 8-character hex invocation ID (e.g., ). Keep it constant for the entire invocation.
3a9f1c2bIntent tagging: On every MCP tool call, prefix with followed by a description of why the tool is being called. On the first MCP tool call only, use instead (note the suffix). Example first call:
telemetry.intentskill:llm-obs-eval-pipeline[<inv_id>] — skill:llm-obs-eval-pipeline:start[<inv_id>] — :startskill:llm-obs-eval-pipeline:start[3a9f1c2b] — Phase 1: sample root spans for ml_app to begin classification后端检测 — 在每次调用开始、执行任何操作前,确定要使用的后端:
- 如果用户在调用中任何位置传入→ 立即使用pup模式,无论是否存在MCP工具。跳过步骤2–4。
--backend pup - 检查活跃工具列表中是否存在MCP工具。标准判断依据是可用工具中是否包含。
mcp__datadog-llmo-mcp__search_llmobs_spans - 如果存在MCP工具 → 全程使用MCP模式。严格按照子技能流程部分中的名称调用MCP工具。
- 如果不存在MCP工具 → 检查是否可执行:通过Bash运行
pup。包含pup --version的JSON响应确认pup可用。"version" - 如果pup正常响应 → 全程使用pup模式。每个子技能都有自己的工具参考附录,包含完整的MCP→pup映射关系。
- 如果两者都不可用 → 停止操作并告知用户:
"Datadog MCP服务器和pup CLI均不可用。请连接MCP服务器()或安装pup。"
claude mcp add --scope user --transport http datadog-llmo-mcp 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'
--backend pup子技能后端传播:流程启动时检测到的后端将应用于所有三个子技能(session-classify → trace-rca → eval-bootstrap)。无需在每个阶段重新检测。启动时只需宣布一次:
- MCP模式:"(运行在MCP模式下 — 所有功能可用。)"
- pup模式:"(运行在pup模式下 — 全程使用pup命令。RUM信号使用。笔记本使用
pup rum aggregate。所有功能可用。)"pup notebooks create/edit
pup调用规则:
- 通过Bash调用:
pup llm-obs <subcommand> [flags] - pup始终输出JSON。直接解析即可 — 无需像处理MCP结果那样进行内容块解包。
- 如果pup返回认证错误,告知用户运行并停止操作。
pup auth login - 并行化:在一条消息中发起多个Bash工具调用(每个调用对应一个pup命令)。
- 时间参数:pup接受纯时长字符串(、
1h、7d)和RFC3339时间戳。请勿使用30m前缀的字符串 — 转换技能的now-参数时需移除前缀:--timeframe→now-7d,7d→now-24h,24h→now-30d。30d
调用ID:在每次调用的最开始、任何MCP工具调用之前,生成一个8字符的十六进制调用ID(例如:)。在整个调用过程中保持该ID不变。
3a9f1c2b意图标记:在每次MCP工具调用时,在前添加前缀,后跟调用该工具的原因描述。仅在第一次MCP工具调用时,改用(注意后缀)。示例首次调用:
telemetry.intentskill:llm-obs-eval-pipeline[<inv_id>] — skill:llm-obs-eval-pipeline:start[<inv_id>] — :startskill:llm-obs-eval-pipeline:start[3a9f1c2b] — Phase 1: sample root spans for ml_app to begin classificationLLM Obs Eval Pipeline — Classify → RCA → Bootstrap
LLM观测评估流程 — 分类→根本原因分析→引导评估
Walks from unlabeled production LLM trace data to a ready-to-use evaluator suite in three phases, with user checkpoints between each. No pre-existing evals or labeled data required.
llm-obs-session-classify (ml_app mode)
↓
llm-obs-trace-rca (from classifications)
↓
llm-obs-eval-bootstrap (from RCA output)通过三个阶段,从无标签的生产环境LLM追踪数据生成可用的评估器套件,各阶段之间设置用户检查点。无需预先存在的评估器或标签数据。
llm-obs-session-classify (ml_app模式)
↓
llm-obs-trace-rca (基于分类结果)
↓
llm-obs-eval-bootstrap (基于RCA输出)Usage
使用方法
/llm-obs-eval-pipeline <ml_app> [--timeframe <window>] [--trace-limit <N>] [--data-only] [--publish]Arguments: $ARGUMENTS
/llm-obs-eval-pipeline <ml_app> [--timeframe <window>] [--trace-limit <N>] [--data-only] [--publish]参数:$ARGUMENTS
Inputs
输入参数
| Input | Required | Default | Description |
|---|---|---|---|
| Yes | — | LLM app to analyze end to end |
| No | | Lookback window for trace sampling and RCA |
| No | | Max traces to classify in Phase 1 |
| No | off | Pass through to llm-obs-eval-bootstrap: emit JSON spec instead of Python SDK code |
| No | off | Pass through to llm-obs-eval-bootstrap: publish online LLM-judge evaluators to Datadog |
If is not provided, ask the user before proceeding.
ml_app| 输入参数 | 是否必填 | 默认值 | 描述 |
|---|---|---|---|
| 是 | — | 要进行端到端分析的LLM应用 |
| 否 | | 追踪采样和根本原因分析的回溯窗口 |
| 否 | | 阶段1中要分类的最大追踪数量 |
| 否 | 关闭 | 传递给llm-obs-eval-bootstrap:输出JSON规范而非Python SDK代码 |
| 否 | 关闭 | 传递给llm-obs-eval-bootstrap:将在线LLM-judge评估器发布到Datadog |
如果未提供,请在继续前询问用户。
ml_appPhase 1: Trace Classification
阶段1:追踪分类
Follow the llm-obs-session-classify skill in ml_app mode, using:
- = the provided ml_app
ml_app - = the provided timeframe
timeframe - = the provided trace-limit
sample_limit
Run the complete ml_app mode workflow as defined in that skill (Steps M1 through M3):
sample spans → classify each → emit per-unit compact blocks → emit summary.
Output the full classification output, including all compact per-unit blocks and the final
section. Do not summarize or truncate this output —
downstream phase detection depends on the full text being present in context.
# Session Classification Summary按照llm-obs-session-classify技能的ml_app模式执行,使用:
- = 提供的ml_app
ml_app - = 提供的时间窗口
timeframe - = 提供的trace-limit
sample_limit
执行该技能中定义的完整ml_app模式工作流(步骤M1至M3):
采样跨度→逐个分类→输出每个单元的紧凑块→输出摘要。
输出完整的分类结果,包括所有每个单元的紧凑块和最终的部分。请勿总结或截断该输出 — 下游阶段的检测依赖于上下文存在完整文本。
# Session Classification SummaryCHECKPOINT 1
检查点1
After the is output, pause and present:
# Session Classification Summaryundefined在输出后,暂停并显示:
# Session Classification SummaryundefinedPhase 1 Complete
阶段1完成
[verdict distribution table]
[failure mode frequency table]
Before I continue to root cause analysis:
- Do these failure patterns look right?
- Any traces you'd like to exclude from the RCA?
- Any failure modes to focus on or ignore?
Type "continue" to proceed, or give me adjustments.
**Wait for explicit user confirmation before proceeding.**
- If the user excludes specific traces: mark them as "excluded by user" and drop them from the failure bucket. Do NOT re-classify.
- If the user asks to re-run with different parameters: re-run Phase 1 with the new parameters.
- If Phase 1 yielded zero failures: surface this explicitly and offer to retry with a wider timeframe or stop.
---[判定分布表]
[故障模式频率表]
在我继续进行根本原因分析之前:
- 这些故障模式看起来是否正确?
- 是否有任何追踪数据需要从根本原因分析中排除?
- 是否有需要重点关注或忽略的故障模式?
输入"continue"继续,或提出调整要求。
**继续前等待用户明确确认。**
- 如果用户排除特定追踪数据:将其标记为“用户排除”并从故障桶中移除。无需重新分类。
- 如果用户要求使用不同参数重新运行:使用新参数重新运行阶段1。
- 如果阶段1未发现任何故障:明确告知用户,并提供扩大时间窗口重试或停止的选项。
---Phase 2: Root Cause Analysis
阶段2:根本原因分析
Follow the llm-obs-trace-rca skill.
The from Phase 1 is in context. The skill detects it automatically via its Phase 0 Step 0S check and enters the "from classifications" path — it extracts the failure bucket, presents the Classification Overview, and proceeds directly to Phase 2 (open coding) without running its own Phase 1 span search.
# Session Classification SummaryRun the full workflow through Phase 6 (the compiled RCA report). Output the full RCA report — do not summarize. The full report must be in context for Phase 3's detection to work.
按照llm-obs-trace-rca技能执行。
阶段1的已在上下文中。该技能会通过其阶段0步骤0S检查自动检测到该内容,并进入“基于分类结果”路径 — 提取故障桶,显示分类概览,直接进入阶段2(开放编码),无需运行自身的阶段1跨度搜索。
# Session Classification Summary执行完整工作流直至阶段6(编译后的RCA报告)。输出完整的RCA报告 — 请勿总结。完整报告必须存在于上下文中,以便阶段3的检测正常工作。
CHECKPOINT 2
检查点2
After the RCA report is output, pause and present:
undefined在输出RCA报告后,暂停并显示:
undefinedPhase 2 Complete
阶段2完成
[the Phase 6 RCA report is above]
Before I generate evaluators:
- Do these root causes look accurate?
- Any failure modes to add, remove, or reframe?
- Which root causes should the evaluators target?
Type "continue" to proceed, or give me adjustments.
**Wait for explicit user confirmation before proceeding.**
If the user adjusts the taxonomy: incorporate the changes before continuing to Phase 3.
---[上方是阶段6的RCA报告]
在我生成评估器之前:
- 这些根本原因看起来是否准确?
- 是否有需要添加、移除或重新定义的故障模式?
- 评估器应针对哪些根本原因?
输入"continue"继续,或提出调整要求。
**继续前等待用户明确确认。**
如果用户调整分类体系:在继续到阶段3之前整合这些更改。
---Phase 3: Eval Bootstrap
阶段3:评估引导
Follow the llm-obs-eval-bootstrap skill.
The RCA report from Phase 2 is in context. The skill detects the heading automatically and enters its "from RCA" path in Phase 0.
## Failure TaxonomyPass through any flags:
- → emit a JSON spec instead of Python SDK code
--data-only - → publish online LLM-judge evaluators directly to Datadog
--publish
The llm-obs-eval-bootstrap skill has its own mandatory proposal checkpoint (the evaluator suite proposal before code generation). Honor it — do not skip or auto-confirm it.
按照llm-obs-eval-bootstrap技能执行。
阶段2的RCA报告已在上下文中。该技能会自动检测到标题,并在阶段0进入“基于RCA”路径。
## Failure Taxonomy传递所有标志:
- → 输出JSON规范而非Python SDK代码
--data-only - → 将在线LLM-judge评估器直接发布到Datadog
--publish
llm-obs-eval-bootstrap技能有自己的强制提案检查点(代码生成前的评估器套件提案)。需遵守该规则 — 请勿跳过或自动确认。
Final Summary
最终总结
After Phase 3 completes, present:
markdown
undefined阶段3完成后,显示:
markdown
undefinedLLM Obs Eval Pipeline Complete
LLM观测评估流程完成
App: | Timeframe: <timeframe>
<ml_app>| Phase | Output |
|---|---|
| 1. Classification | <N> traces sampled, <F> failures identified |
| 2. Root Cause Analysis | <K> failure modes, <M> root causes diagnosed |
| 3. Eval Bootstrap | <J> evaluators → |
应用: | 时间窗口: <timeframe>
<ml_app>| 阶段 | 输出结果 |
|---|---|
| 1. 分类 | 采样<N>条追踪数据,识别出<F>个故障 |
| 2. 根本原因分析 | 确定<K>种故障模式,诊断出<M>个根本原因 |
| 3. 评估引导 | 生成<J>个评估器 → |
Key findings
关键发现
[3–5 bullets: most important failure patterns and what the evaluators capture]
[3–5个要点:最重要的故障模式以及评估器覆盖的内容]
Next steps
后续步骤
- Review the generated evaluators at
<output_path> - Run an offline experiment to validate eval quality
- Once validated, configure as production evals in Datadog
---- 查看下生成的评估器
<output_path> - 运行离线实验验证评估质量
- 验证通过后,在Datadog中配置为生产环境评估器
---Orchestration Rules
编排规则
- Always checkpoint before advancing. Never auto-proceed between phases.
- Never truncate sub-skill outputs. Downstream phase detection depends on the full text being in context.
- Phase isolation: if the user wants to re-run a single phase, re-run only that phase and its downstream phases.
- Carry context forward: the output of each phase is the input for the next. Present the full output of each sub-skill before showing the checkpoint prompt.
- 推进前必须设置检查点。各阶段之间切勿自动继续。
- 切勿截断子技能输出。下游阶段的检测依赖于上下文存在完整文本。
- 阶段隔离:如果用户希望重新运行单个阶段,仅重新运行该阶段及其下游阶段。
- 上下文传递:每个阶段的输出是下一个阶段的输入。在显示检查点提示前,先展示每个子技能的完整输出。
Tool Reference
工具参考
This appendix applies only in pup mode. Each sub-skill also carries its own Tool Reference with the same mappings — consult them for full parameter details. The tables below are a quick reference for pipeline-level orientation.
本附录仅适用于pup模式。每个子技能也有自己的工具参考,包含相同的映射关系 — 如需完整参数细节请参考这些文档。下表是流程层面的快速参考。
Spans and traces
跨度与追踪
| MCP Tool | pup Command |
|---|---|
| |
| |
| |
| |
| |
| |
| |
| MCP工具 | pup命令 |
|---|---|
| |
| |
| |
| |
| |
| |
| |
Evaluators
评估器
| MCP Tool | pup Command |
|---|---|
| |
| |
| |
| |
| MCP工具 | pup命令 |
|---|---|
| |
| |
| |
| |
RUM
RUM
| MCP Tool | pup Command |
|---|---|
| |
| |
| MCP工具 | pup命令 |
|---|---|
| |
| |
Notebooks
笔记本
| MCP Tool | pup Command |
|---|---|
| |
| |
- MCP result parsing safety: Before writing any script (Python, jq, etc.) that iterates over or accesses fields in an MCP tool result, inspect the raw structure first — check , top-level keys, and whether the payload is nested inside a content block (e.g.
type(result)). Extract and[{'type': 'text', 'text': '<json>'}]the inner payload if needed. Never assume MCP results are bare dicts or lists.json.loads()
| MCP工具 | pup命令 |
|---|---|
| |
| |
- MCP结果解析注意事项:在编写任何遍历或访问MCP工具结果字段的脚本(Python、jq等)之前,先检查原始结构 — 检查、顶层键以及负载是否嵌套在内容块中(例如
type(result))。如有需要,提取内部负载并使用[{'type': 'text', 'text': '<json>'}]解析。切勿假设MCP结果是裸字典或列表。json.loads()