Backend
Detection — At the start of every invocation, before taking any action, determine which backend to use:
- If the user passed anywhere in their invocation → use pup mode immediately, regardless of whether MCP tools are present. Skip steps 2–4.
- Check whether MCP tools are present in your active tool list. The canonical signal is whether
mcp__datadog-llmo-mcp__list_llmobs_evals
appears in your available tools.
- If MCP tools are present → use MCP mode throughout. Call MCP tools exactly as named in this skill's workflow sections.
- If MCP tools are absent → check whether is executable: run via Bash. A JSON response containing confirms pup is available.
- If pup responds → use pup mode throughout. Translate every MCP tool call to its pup equivalent using the Tool Reference appendix at the bottom of this file.
- If neither is available → stop and tell the user:
"Neither the Datadog MCP server nor the pup CLI is available. Connect the MCP server (
claude mcp add --scope user --transport http datadog-llmo-mcp 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'
) or install pup."
is accepted anywhere in the invocation arguments and is stripped before passing remaining args to the skill logic.
pup invocation rules:
- Invoke via Bash:
pup llm-obs <subcommand> [flags]
- pup always outputs JSON. Parse directly — no content-block unwrapping (unlike MCP results, which may wrap JSON in
[{"type": "text", "text": "<json>"}]
).
- If pup returns an auth error, tell the user to run and stop.
- Parallelization: issue multiple Bash tool calls in a single message (one pup command per call).
- Time flags: pup accepts bare duration strings (, , ) and RFC3339 timestamps. Do not use -prefixed strings — strip the prefix when converting from a skill argument: → , → , → .
- on strips payload fields to essential metadata only. Use it in bulk/search phases where content is not needed.
Invocation ID: At the very start of each invocation, before any MCP tool call, generate an 8-character hex invocation ID (e.g.,
). Keep it constant for the entire invocation.
Intent tagging: On every MCP tool call, prefix
with
skill:llm-obs-trace-rca[<inv_id>] —
followed by a description of why the tool is being called. On the
first MCP tool call only, use
skill:llm-obs-trace-rca:start[<inv_id>] —
instead (note the
suffix). Example first call:
skill:llm-obs-trace-rca:start[3a9f1c2b] — Phase 0: discover configured evals for task-cruncher to infer analysis mode
LLM Obs Trace RCA — Root Cause Analysis from Production LLM Traces
Diagnose why an LLM application is failing by searching production traces and walking the span tree from symptom to root cause. The skill automatically selects the best analysis mode based on available signals:
| Mode | Signal | When auto-selected |
|---|
| Eval Signal | LLM judge verdicts and reasoning (pass/fail rates, scoring) | Evaluators are configured for the app |
| Error Signal | Runtime errors (, error types, stack traces) | No evals configured, or user explicitly asks about errors/crashes |
| Generic | Structural anomalies (latency, agent loops, retrieval misses) | Explicit override, or no strong signal found in Phase 1 |
The mode is announced (never asked) in the first user-facing output with a one-line override hint.
Methodology
Resolve → Search → Observe → Open Coding → Axial Coding → Root Cause Navigation → Recommendations
Usage
What's wrong with <ml_app> over the last <timeframe>
Why is <ml_app> failing today
Analyze eval failures for <eval_name> on <ml_app>
Look at the errors on <ml_app> over the last <timeframe>
Root-cause low scores on <eval_name>
Inputs
| Input | Required | Default | Description |
|---|
| Yes (or ) | — | The application to analyze. |
| No | — | One or more evaluators to focus on. Implies Eval Signal mode. Pass a list for multi-eval analysis. |
| No | | How far back to look. |
| No | inferred | Explicit mode override: , , . Skips inference entirely. |
| No | — | Narrowing scope: (routes to Error Signal path), (post-fetch duration sort), (promotes to ), a tool name or span name ( query). |
If neither
nor
is provided, ask the user.
Available Tools
Eval discovery & overview
| Tool | Purpose |
|---|
| Discover all configured evals for an . Used in Phase 0 mode inference. |
get_llmobs_eval_aggregate_stats
| Pass/fail rate or score distribution for an eval over a time window. |
| Full evaluator config: prompt template, assessment criteria, span filter, sampling, provider. Use instead of the deprecated . |
Trace & span exploration
| Tool | Purpose |
|---|
| Find spans by tags, span kind, status, query syntax. Paginate with cursor. Entry point for Phase 1. |
| Metadata, evaluations (scores, labels, reasoning), , error fields, duration, and map showing available fields + sizes. |
| Actual content for a span field. Supports JSONPath via param for targeted extraction. |
| Full trace hierarchy as span tree with span counts by kind. |
| All error spans in a trace with error type, message, stack, and propagation context. |
| Load children of collapsed trace nodes. |
| Chronological agent execution timeline (LLM calls, tool invocations, decisions). May return empty — see Phase 4b fallback. |
Key patterns
| Field | Path | What you get |
|---|
| | System prompt (first message, usually role) |
| | Last assistant response |
| (no path) | Full conversation including tool calls |
| / | — | Span I/O |
| — | Retrieved documents (RAG apps) |
| — | Custom metadata (prompt versions, feature flags, user segments) |
How to use
Always include in the string — the structured parameter is unreliable and can return spans from other apps. Do not rely on the structured parameter alone.
Useful query fragments — combine with space (AND):
| Goal | Query |
|---|
| Errors only | |
| Eval is present on the span | @evaluations.custom.<eval_name>:*
(presence only — pass/fail is read from , not the query) |
| A specific tool by name | |
Dedicated params (
,
,
) work alongside
, but
takes precedence over
.
Parallelization rules
- : Group span_ids by trace_id, chunk each trace's span_ids into batches of at most 20. Issue ALL chunks for a page in a single message.
- : Each call is independent — always issue ALL in a single message.
- / / : Parallelize across different traces in a single message.
- Pipeline parallelism: Start for page 1 results immediately — don't wait to collect all pages.
Analysis Workflow
Output discipline: Phases 0–5 are internal analysis. The only user-facing outputs during these phases are the Phase 1 Signal Summary and the mandatory checkpoints at Phases 2 and 3. Do NOT narrate reasoning, summarize intermediate findings, or output Phase 4 deep-dive results as prose. All detailed findings go exclusively into the Phase 6 report.
Phase 0: Resolve Inputs & Infer Mode
First: check for classification context. Scan the conversation for a
# Session Classification Summary
header. If found → enter
Step 0S below and skip all remaining Phase 0 steps and Phase 1 entirely.
Step 0S — Extract Failure Bucket from Classification Output
The canonical handoff format is the
Per-Unit Details table inside the
# Session Classification Summary
section. Extract one row per unit:
| Field | Source |
|---|
| Link URL in the ID column: parse the or query parameter from the link href |
| Verdict column |
| Failure Mode column ( for passing rows) |
| Reason column — use as the Phase 2 reasoning input (same role as eval judge reasoning or error messages) |
| From the # Session Classification Summary
header line (e.g. ) — default if absent |
Failure bucket = all rows where verdict is
or
.
- < 5 entries → note low confidence, proceed anyway.
- Empty → report "No failures found in the classification output" and stop.
Present this overview before proceeding:
## Classification Overview (from llm-obs-session-classify)
**ml_app**: <from summary header> | **Classified**: N | **Failures (no+partial)**: F | **Pass rate**: X%
| Failure Mode | Count |
|---|---|
...
Proceeding to Phase 2 using F failure traces. Mode inference bypassed — classification verdict is the signal.
Then skip Phase 1 and jump directly to Phase 2. Carry forward:
- Phase 2 reasoning input:
(trace_id, span_id, detail)
tuples — same structure as eval reasoning or error messages
- Phase 4 navigation: use from each trace block to choose the span navigation strategy
- Phases 2–7: run completely unchanged — the failure bucket structure is identical regardless of source
Standard resolution (no classification context):
- If neither nor provided → ask the user. If is provided but is not → also ask for (eval names are not globally unique; without it, span searches return results from all apps sharing the eval name).
- If not provided → default to .
- Resolve (before mode inference):
- → force Error Signal mode
- → treat as , then continue inference
- → note for Phase 1 (sort by duration post-fetch); continue inference
- Tool/span name → note as query fragment for Phase 1; continue inference
- Resolve mode (skip if was explicitly provided):
- given → Eval Signal
- User explicitly mentioned errors/exceptions/crashes → Error Signal
- Otherwise → call
list_llmobs_evals_by_ml_app(ml_app)
:
- Evals returned → Eval Signal
- No evals → Error Signal (announce auto-selection in Phase 1)
- When is multi-valued, note for Phase 1: run parallel per-eval searches and merge+dedup by .
Phase 1: Find Problematic Spans
Three mode-specific paths. All end with a Signal Summary that labels the mode and includes a one-line override hint.
Mode switch handling: At any checkpoint, if the user says "switch to [error|eval|generic] mode", re-enter Phase 1 with the new mode. Phase 0 inputs do not re-resolve.
Auto-pivot: If the selected mode finds no data (0 evals configured, 0 error spans in timeframe), announce the pivot to Generic and proceed — do not stop and ask.
Eval Signal path
Step 1a: Eval overview (parallel)
For each eval, call both in a single parallel batch:
get_llmobs_eval_aggregate_stats(eval_name, from, to)
get_llmobs_evaluator(eval_name)
Interpret aggregate stats:
- → Note "no data." Skip this eval (or pivot to Generic if it's the only one).
- Boolean → Note "100% pass." Skip unless it's the only eval.
- Boolean with failures → Note counts and pass_rate. Continue.
- Score with assessment criteria → Note distribution and pass/fail counts. Continue.
- Score WITHOUT assessment criteria → Infer failures: bottom quartile, or below median if bimodal. Label as "inferred failures" in report.
- Categorical with assessment criteria → Note top_values and pass/fail. Continue.
- Categorical WITHOUT assessment criteria → Infer from context (e.g., "error", "incomplete", "off_topic" are likely failures). Ask user if genuinely ambiguous.
Interpret eval config:
- Config returned (custom) → Store , , , .
- Config nil (OOTB) → Note prompt is not inspectable.
Calibration cross-check: When two evals share a name prefix but differ in type (e.g.
and
), compare their pass rates on overlapping spans. A discrepancy >20% is an
Evaluator Calibration Discrepancy — flag it in the report.
Step 1b: Collect failure spans
For each eval:
search_llmobs_spans(query="@evaluations.custom.<eval_name>:*", from, limit=50)
. When multi-valued, issue one search per eval in parallel — merge result sets, dedup by .
- Paginate until ≥15–20 failures OR no more pages. Cap at 200 spans total.
- per trace_id batch (follow Parallelization Rules).
- Extract per row: assessment, value, reasoning, span_id, trace_id, span_kind, content_info.
- Separate into pass/fail buckets using thresholds from Step 1a.
JSON-type eval fallback: If
@evaluations.custom.<eval_name>:*
returns 0 spans but
get_llmobs_eval_aggregate_stats
confirmed
, the eval is JSON-type and scores are not indexed on this field. Fall back to: search by the span name or span kind that the eval targets (check
for the span filter), then inspect output payloads for JSON verdict fields via
get_llmobs_span_content(field="output")
.
Step 1c: Signal Summary (Eval Signal)
## Signal Summary: `{ml_app}` · Eval Signal
(Inferred from {N} configured eval(s). Say `switch to error mode` or `switch to generic mode` to change.)
**Timeframe**: {from} → {to}
| Eval | Type | Total | Pass Rate | Status |
|------|------|------:|:---------:|--------|
| eval_1 | boolean | 4,891 | 37.3% | ⚠ Investigating |
| eval_2 | score | 1,200 | — (inferred threshold) | ⚠ Investigating |
| eval_3 | boolean | 500 | 99.2% | ✓ Healthy |
**Collected**: {pass_count} passing, {fail_count} failing.
For a single eval, collapse to a single-line header instead of a table.
Error Signal path
Step 1a: Sample error spans
search_llmobs_spans(query="@ml_app:\"<ml_app>\" @status:error", from=timeframe, limit=50)
. Paginate until ≥30 error spans or no more pages.
Step 1a.5: Soft error scan
MCP tool spans sometimes report
but carry
in their output payload — these are invisible to
queries and can outnumber hard errors.
Call
search_llmobs_spans(query="@ml_app:\"<ml_app>\" @status:ok", span_kind="tool", from=timeframe, limit=20)
. For a sample of 5–10 results, call
get_llmobs_span_content(field="output")
in parallel. If any payloads contain
, add
MCP soft errors as a separate row in the error frequency table with the note:
(status:ok but isError:true in payload — not queryable via @status:error).
Step 1b: Group by error type
Group spans by
tag → frequency table. If
tag is absent on some spans, supplement with the
field from
(fetched in Step 1d).
Step 1c: Fetch stack traces (parallel)
For the top 3–4 error types by count, pick 2–3 representative trace IDs each. Call
find_llmobs_error_spans(trace_id)
in parallel across all selected traces. Extract:
- Error message and stack trace
- Origin span kind and name
- Whether errors propagate from children to parents (cascade) or are isolated
Step 1d: Fetch span details
on representative spans for each error type (follow Parallelization Rules). Extract
,
, duration.
Step 1e: Signal Summary (Error Signal)
## Signal Summary: `{ml_app}` · Error Signal
(No evals configured — analyzing runtime errors. Say `switch to eval mode` or `switch to generic mode` to change.)
**Timeframe**: {from} → {to} | **Total error spans sampled**: {N}
| Error Type | Spans | Cascade? | Origin Span Kind |
|------------|------:|:--------:|-----------------|
| TimeoutError | 42 | Yes | tool |
| APIError 429 | 18 | No | tool |
| ValueError | 7 | No | llm |
| MCP soft errors (isError:true) | 23 | No | tool |
Generic path
Step 1a: Eval health check (when evals are configured)
If
returned evals in Phase 0, call
get_llmobs_eval_aggregate_stats
for each enabled eval in parallel. Flag any enabled eval with
as
Broken Eval Configuration — include in the Signal Summary anomaly table as a High severity row.
Step 1b: Broad span search
search_llmobs_spans(query="@ml_app:\"<ml_app>\"", root_spans_only=true, from=timeframe, limit=50)
. Apply
narrowing if present (tool/span name →
query;
→ sort result set by
after Step 1c). Paginate until ≥30 spans.
Step 1c: Fetch span details
Step 1d: Rank by structural anomalies
Partition spans using heuristics:
- Top decile by (latency outliers)
- Agent spans with >N tool/LLM iterations (long-running loops)
- Retrieval spans returning 0 documents (RAG miss)
- Workflow spans whose child set is missing an expected step (compare against median child layout)
- Token efficiency: Check if
non_cached_input_tokens ≈ input_tokens
across LLM spans. If the app has stable system prompts (>1k tokens) and cache hit rate is 0%, flag as High severity — enabling on the system prompt would cut input token costs by 60–90%
Step 1e: Signal Summary (Generic)
## Signal Summary: `{ml_app}` · Generic
(Analyzing structural anomalies. Say `switch to eval mode` or `switch to error mode` to change.)
**Timeframe**: {from} → {to} | **Sampled**: {N} root spans
| Anomaly Type | Count |
|---|:---:|
| Latency outliers (>p90) | 12 |
| Long agent loops (>8 iterations) | 5 |
| RAG retrieval misses | 3 |
| Zero prompt cache utilization | All LLM spans |
| Broken eval configurations | 2 |
Phase 1.5: Determine App Profile & Where the Root Cause Lives
Inspect
and
across collected spans. Drives Phase 4 strategy.
App profile (from content_info):
| Signal | App profile | Phase 4 strategy |
|---|
| has | LLM/chat app | Extract system prompt via , check conversation flow |
| has | RAG app | Check retrieval quality alongside LLM output |
| Trace contains span kind | Agent app | Try first; if it returns empty use child-span reconstruction (see Phase 4b) |
| Long conversation | Check for context overflow |
| has | Has custom metadata | Check for clustering by metadata values (prompt version, feature flags) |
LLM Experiments traces: If root spans have
and carry
,
, and
structured fields, you are looking at a
Datadog LLM Experiments trace. Each span represents one dataset record run. Read quality signal from the root span's
/
/
fields via
— not from LLM sub-span messages, which may contain stub or placeholder content. Evaluations attached to experiment spans are computed by the Experiments framework at run time and may not be registered as online Datadog evaluators (
will return 404 for them).
Where the root cause likely lives — by symptom span kind:
| Symptom span kind | Symptom looks like | But root cause is often in... |
|---|
| Bad LLM response (eval flagged, wrong output) | Parent agent (bad instructions), sibling retrieval (bad context), sibling tool (bad data) |
| Bad orchestration | Child spans (wrong tool calls, bad routing), full agent loop |
| Bad tool result | Parent LLM (passed wrong parameters), tool implementation |
| Bad overall output | Child sub-spans (which step first deviated?) |
| Bad retrieval | Query construction (parent), index/embedding config (outside trace) |
Key insight: The signal — eval verdict, error message, latency outlier — flags one span in isolation. It's a symptom report, not a diagnosis. The root cause often lives in a different span: a parent that gave bad instructions, a sibling that provided bad context, or a child that made a wrong decision. Phase 4 navigates the tree to find it.
Phase 2: Open Coding — Initial Failure Categorization
Goal: Read per-row evidence and propose initial, concrete failure categories. Pool all problematic rows together — categories should describe app behaviors, not which signal flagged them.
Per-row "reasoning input" by mode:
- Eval Signal: judge assessment + reasoning from
- Error Signal: error message + stack trace excerpt from
- Generic: one-line description of the structural anomaly that flagged the row
Shortcuts:
- < 15 problematic rows: Combine Phases 2 and 3 into one pass. Still produce the checkpoint.
- > 80% share the same reasoning/error/symptom: Skip to Phase 4 with the dominant pattern. Still output checkpoint.
- > 50 problematic rows: Sample ~50, build taxonomy, then spot-check 10–15 more.
-
Use per-row signal from Phase 1 — do NOT re-fetch. Only call
get_llmobs_span_content(field="input"/"output")
for spans where the reasoning is insufficient (generic, empty, or just a stack trace with no app context).
-
If eval config is loaded (Eval Signal), distinguish early:
- App failures: Output genuinely violates the eval's criteria
- Eval failures: Output seems reasonable but eval criteria are too strict/ambiguous
-
Each pattern must be specific: "Agent called search instead of calculator for price computation" — NOT "tool issue."
MANDATORY CHECKPOINT
**Open coding**: {N} problematic rows → {K} initial categories: {Category1} ({count}), {Category2} ({count}), ...
Phase 3: Axial Coding — Refine Failure Taxonomy
Goal: 3–8 final categories, ranked by impact.
- Merge: Categories with < 3 occurrences → parent category or drop as noise.
- Split: Categories with > 30% of failures → more specific sub-categories. Pull additional span content if needed.
- Validate: 2–3 representative examples per category confirm the label fits.
- Rank:
priority = count × severity
(severity: high / medium / low).
MANDATORY CHECKPOINT
**Axial coding**: {merges/splits/drops}. Final categories:
1. {Category} ({count}, {pct}%) — {severity}
2. ...
Phase 4: Root Cause Analysis — Navigate from Symptom to Root Cause
Goal: The signal flagged a span. That's the symptom. Navigate the trace tree to find the actual root cause — it's often in a different span.
For each of the top 3 categories, pick 2–3 representative traces:
Step 4a: Trace structure + errors (parallel)
For each representative trace, call in a single message:
get_llmobs_trace(trace_id)
— span hierarchy; locate the symptom span and its parent/siblings/children
find_llmobs_error_spans(trace_id)
— check for runtime errors anywhere in the trace
Runtime vs behavioral: If errors exist on or near the symptom span, the root cause may be a runtime failure rather than a behavioral one. Check this first.
Distributed trace fallback: If
returns "cannot find parent" or an empty span list (common in Ray-based or multi-process execution), reconstruct the trace manually using
on the span_ids collected in Phase 1, sorted by
.
Step 4b: Navigate to the root cause (parallel)
Use the symptom span kind (from Phase 1.5). Issue ALL calls in a single message.
If symptom is on an span (most common):
get_llmobs_span_content(field="messages", path="$.messages[0]")
on symptom span — system prompt
get_llmobs_span_content(field="messages")
on symptom span — full context received
get_llmobs_span_content(field="documents")
on sibling retrieval spans (if any)
get_llmobs_span_content(field="input")
on sibling tool spans (if any)
get_llmobs_span_content(field="messages", path="$.messages[0]")
on parent agent/workflow span
If symptom is on an span:
get_llmobs_agent_loop(trace_id, span_id)
— full decision timeline (try first; if it returns 0 iterations, use the fallback below)
- on child spans — sort by to reconstruct the execution timeline
get_llmobs_span_content(field="input"/"output")
on child spans that look wrong
Agent loop fallback (when
returns 0 iterations): Reconstruct the timeline from
results sorted by
. Group by
to identify LLM → tool → LLM sequences. This fallback is frequently needed —
returns empty for many apps.
get_llmobs_span_content(field="input")
on symptom span — what parameters was it called with?
get_llmobs_span_content(field="messages")
on parent LLM span — did the LLM construct the call correctly?
- on all child spans — find which step first deviated
get_llmobs_span_content(field="input"/"output")
on the deviating child
Always also fetch:
get_llmobs_span_content(field="metadata")
on the symptom span — clustering signals (prompt version, feature flags)
Step 4c: Diagnose — from symptom to root cause
For each category, trace the causal chain:
- Symptom — what the signal flagged (eval reasoning, error message, anomaly note). The signal only saw one span in isolation — its reasoning may be shallow.
- Trace context — what surrounding spans reveal (parent instructions, sibling data, child decisions).
- Root cause — the specific span and decision point where the failure originated. Often NOT the symptom span itself.
For suspected eval issues (Eval Signal, if config loaded): Compare eval criteria against evidence. Is the prompt ambiguous? Criteria too strict?
Root cause categories:
| Category | Description |
|---|
| System Prompt Deficiency | Instructions unclear, missing, or contradictory — in symptom span OR its parent |
| Tool Gap | Needed tool doesn't exist or parameters too coarse |
| Tool Misuse | Wrong tool called or wrong parameters — often visible in agent loop or parent LLM |
| Routing/Handoff Error | Wrong sub-agent selected (multi-agent systems) |
| Retrieval Failure | RAG returned irrelevant or missing context — check sibling retrieval spans |
| Context Overflow | Critical info lost due to context length |
| Upstream Data Issue | A sibling or parent span provided bad data that cascaded to the symptom span |
| Runtime Error | Tool/API failure, timeout, exception — from |
| Evaluator Miscalibration | Eval criteria produce false positives/negatives (Eval Signal mode only) |
Phase 5: Generate Recommendations
Goal: Concrete, actionable recommendations grounded in trace evidence. Actual text/code changes with before/after quotes from the trace — not generic advice.
Recommendation types: System Prompt Edit (quote actual prompt, provide before/after), Tool Gap/Misuse (reference agent loop steps), Routing/Handoff Fix, Retrieval Fix (show retrieved vs needed), Evaluator Prompt Edit (flag that eval changes need re-validation; Eval Signal only), Other.
When run in Claude Code with codebase access: Search the codebase for system prompt, tool definitions, or routing logic. Propose specific diffs. Always ask before modifying files.
Phase 6: Compile RCA Report
Write the full report following the Output Format below. This is the primary deliverable — output it directly in the chat.
Phase 7: Post-Analysis Actions
Do NOT take any action automatically. After presenting the report, ask the user what they'd like to do next:
- Save the report to
llm-obs-rca-{ml_app}-{date}.md
- Apply fixes (if codebase is available)
- Deeper investigation of remaining categories
- Export to a Datadog notebook — in pup mode, use to create the notebook and
pup notebooks edit NOTEBOOK_ID --file /tmp/nb_cells.json
to append sections (see Tool Reference)
- Re-run on an expanded time range (e.g. if current window was )
If the user chooses option 4, follow the notebook creation fallback pattern:
-
Call
mcp__datadog-mcp-core__create_datadog_notebook
with:
- :
LLM Obs RCA: {ml_app} ({mode}) — YYYY-MM-DD
- :
- :
- : one cell per section (see Notebook Cell Structure below)
-
If the MCP call fails, inspect the error before giving up:
- Auth / permission error (401, 403) → stop and tell the user.
- Field validation error (error message names a specific field) → fix that field and retry the MCP call once.
- Any other error (binding, serialization, unexpected response) → fall back to pup:
- Write the notebook payload to
/tmp/nb_rca_{ml_app}.json
as a full API envelope: {"data": {"attributes": {"name": "...", "time": {...}, "cells": [...]}, "type": "notebooks"}}
- Run
pup notebooks create --file /tmp/nb_rca_{ml_app}.json
- If pup is not available either, render the full notebook content as markdown in chat so the user has it.
-
After successful creation by either method, output the URL on its own line:
RCA report exported to notebook: <url>
Print the URL prominently — if
runs next in the same session, it will detect this URL and offer to append the evaluator suite to the same notebook.
Notebook Cell Structure
| Cell | Content |
|---|
| 1 — Overview | Structured header (see Overview cell format below — follow it exactly) |
| 2 — Signal Summary | Mode-specific health table |
| 3 — Failure Taxonomy | Taxonomy table |
| 4…N — Failure Modes | One cell per failure mode |
| N+1 — Action Plan + Limitations | Action plan table + bullet list |
Notebook formatting rules (apply to every cell):
- No triple-backtick code blocks — use blockquotes () for prompts/rubrics, inline code () for short values
- Evidence as tables — not bullet lists
- Tool inputs as tables — Argument | Wrong value passed | Correct approach
- Action plan as a table — Priority | Action | Confidence | Impact
Output Format
Overview cell (notebook Cell 1 / report header)
The Overview cell must follow this exact structure. No prose paragraphs. No inline-numbered findings. App description is one sentence maximum.
## `{ml_app}` · {Eval Signal | Error Signal | Generic} · {timeframe}
**Date**: {YYYY-MM-DD} | **Profile**: {short app profile} | **Model**: `{model(s)}`
{One sentence: what does this app do?}
| Metric | Value |
|--------|-------|
{mode-appropriate rows — see below}
### Findings
- **{Finding 1}** (~{pct}%): one-line root cause description
- **{Finding 2}** (~{pct}%): one-line root cause description
- **{Finding 3}** (if present): one-line root cause description
### Recommendations
- **{Recommendation 1}**: specific next step tied to Finding 1
- **{Recommendation 2}**: specific next step tied to Finding 2
*Sample: {N} spans analyzed. Confidence: High | Medium | Low — {one-line reason if Medium or Low}.*
Mode-appropriate metric rows:
Eval Signal:
| Eval | `{eval_name}` ({type}) |
| Spans evaluated | {total_count} |
| Pass rate | {pass_rate}% ({pass_count} pass / {fail_count} fail) |
| Top failure mode | {name} (~{pct}%) |
| Evals configured | {N} |
Error Signal:
| Error spans | {N} confirmed |
| Top error type | `{type}` ({pct}%) |
| Affected operation | `{span_name}` |
| Cascade pattern | Isolated / Cascading |
| Evals configured | {N} (none = no quality signal) |
Generic:
| Spans sampled | {N} root spans |
| Top anomaly | {type}: {count} spans |
| Error spans | {N} (0 = structurally healthy) |
| Evals configured | {N} (none = no quality signal) |
Signal Summary Table
When entering from Step 0S (classification context), replace the Signal Summary table with:
## Classification Signal Summary
**Source**: llm-obs-session-classify | **ml_app**: {app} | **Signal**: content-only | content+evals
| Metric | Value |
|--------|-------|
| Traces classified | N |
| Failures in corpus (no+partial) | F |
| Pass rate | X% |
| Failure modes | list |
*Root cause analysis is based on per-trace classification verdicts, not automated eval judge reasoning.*
Otherwise, mode-specific — pick the appropriate variant:
Eval Signal — one row per eval:
| Eval | Type | Total | Pass Rate | Status |
|---|
| eval_1 | boolean | 4,891 | 37.3% | ⚠ Investigating |
Error Signal — one row per error type:
| Error Type | Spans | Cascade? | Origin Span Kind |
|---|
| TimeoutError | 42 | Yes | tool |
Generic — one row per anomaly type:
| Anomaly Type | Count |
|---|
| Latency outliers (>p90) | 12 |
Failure Taxonomy
| # | Failure Mode | Traces | % | Severity | Root Cause |
|---|
| 1 | ... | ... | ...% | High | Tool Misuse |
Failure Mode Sections (one per top 3–5 modes)
## Failure Mode N: [Name]
**Count**: {n} spans, {t} traces | **Severity**: High/Medium/Low | **Root Cause**: [Category]
[3–5 sentences: what goes wrong, when, what triggers it, causal chain.]
**Evidence**
{Use the mode-appropriate column set:}
Eval Signal — Trace | Judge verdict | What the trace revealed:
| Trace | Judge verdict | What the trace revealed |
|---|---|---|
| [69de86a7...](https://app.datadoghq.com/llm/traces?query=trace_id:{full_id}) | fail | Parent agent has no date format instruction |
Error Signal — Trace | Behavior | Version:
| Trace | Behavior | Version |
|---|---|---|
| [69de86a7...](https://app.datadoghq.com/llm/traces?query=trace_id:{full_id}) | 7 parallel calls, all 400 | v107624932 |
Generic — Trace | Anomaly | Signal:
| Trace | Anomaly | Signal |
|---|---|---|
| [69de86a7...](https://app.datadoghq.com/llm/traces?query=trace_id:{full_id}) | 94s, 12 tool calls | Latency outlier |
{For tool misuse — add a tool inputs table:}
**Tool inputs (100% of sampled calls)**
| Argument | Value passed (wrong) | Correct approach |
|---|---|---|
| `query` | `"monitor_id:123 group_status:alert"` | `"monitor_id:123"` (name/tag only) |
{For Eval Signal — add judge reasoning as a blockquote:}
> "{quoted judge reasoning}"
**Root cause**: [WHY this happens — specific span, parameter, or prompt.]
**Fix**:
BEFORE: [actual text from trace]
AFTER: [proposed replacement]
**Impact**: Eliminates ~{n} spans / {timeframe}.
Prioritized Action Plan
| Priority | Action | Confidence | Impact |
|---|
| 1 | Fix schema — add param | High | Eliminates ~21 spans/7d |
When mode is Generic and no evals are configured, always append as the final action plan row:
| N | Configure at least one evaluator | High | Enables Eval Signal mode for future RCAs — app currently has no ongoing quality signal |
Limitations & Follow-ups
Bullet list of what needs more data or follow-up action.
Operating Rules
- Ground in evidence: Every claim references span IDs with clickable trace links:
[Trace {first_8}...](https://app.datadoghq.com/llm/traces?query=trace_id:{full_32_char_id})
.
- Root cause over symptom: "System prompt doesn't specify date format" not "model gave wrong answer."
- Show your math: "47 failures (34%)" not "many failures."
- Honest about uncertainty: < 5 examples = tentative. Flag it.
- Anonymize PII: No emails or names. User/org IDs are fine.
- MCP result parsing safety: Before writing any script that iterates over MCP tool results, inspect the raw structure first — check top-level keys and whether the payload is nested inside a content block (e.g.
[{'type': 'text', 'text': '<json>'}]
). Extract and the inner payload if needed. Never assume MCP results are bare dicts or lists.
Tool Reference
This appendix applies only in pup mode. In MCP mode, use the tool names in the workflow sections directly.
Spans and traces
| MCP Tool | pup Command |
|---|
search_llmobs_spans(query, ml_app, from, to, limit, cursor, root_spans_only, span_kind, summary)
| pup llm-obs spans search --query "@ml_app:A [other_filters]" [--from F] [--to T] [--limit N] [--cursor C] [--root-spans-only] [--span-kind K] [--summary]
— always use to filter by ml_app; the flag is unreliable and silently returns spans from other apps. |
get_llmobs_span_details(trace_id, span_ids, from, to)
| pup llm-obs spans get-details --trace-id T --span-ids S1,S2,...
|
get_llmobs_span_content(trace_id, span_id, field, path)
| pup llm-obs spans get-content --trace-id T --span-id S --field F [--path P]
|
get_llmobs_trace(trace_id, include_tree)
| pup llm-obs spans get-trace --trace-id T [--include-tree]
|
get_llmobs_agent_loop(trace_id, span_id)
| pup llm-obs spans get-agent-loop --trace-id T [--span-id S]
|
find_llmobs_error_spans(trace_id)
| pup llm-obs spans find-errors --trace-id T
|
expand_llmobs_spans(trace_id, span_ids, max_depth, filter_kind)
| pup llm-obs spans expand --trace-id T --span-ids S1,S2,... [--max-depth N] [--filter-kind K]
|
Evaluators
| MCP Tool | pup Command |
|---|
| (filter by client-side) |
list_llmobs_evals_by_ml_app(ml_app)
| pup llm-obs evals list-by-ml-app --ml-app A
|
get_llmobs_evaluator(eval_name)
| pup llm-obs evals get-evaluator EVAL_NAME
|
get_llmobs_eval_aggregate_stats(eval_name, ml_app, from, to)
| pup llm-obs evals get-aggregate-stats EVAL_NAME [--ml-app A] [--from F] [--to T]
|
Notebooks
| MCP Tool | pup Command |
|---|
create_datadog_notebook(name, cells, ...)
| pup notebooks create --title "TITLE" --file /tmp/nb_cells.json
— confirm exact flags with pup notebooks create --help
|
edit_datadog_notebook(id, cells, append_only=true)
| pup notebooks edit NOTEBOOK_ID --file /tmp/nb_cells.json
(fetches current notebook, appends provided cells, writes back) |
The cells file is a JSON array of cell objects:
json
[{"attributes": {"definition": {"type": "markdown", "text": "## Section\n\nContent."}}, "type": "notebook_cells"}]