production-investigation

Original：🇺🇸 English

Translated

Structured workflows for investigating production issues in Honeycomb — the sequence of tool calls (context priming, broad query, BubbleUp, trace analysis, verification) and how to chain results between steps to reach root causes. Trigger phrases: "investigate production issue", "debug latency spike", "find root cause", "use BubbleUp", "analyze traces", "debug an outage", "why is my API slow", "errors are increasing", "health check", "SLO burning", or any request to investigate or debug production problems.

9installs

Sourcehoneycombio/agent-skill

Added on2026-05-23

NPX Install

npx skill4agent add honeycombio/agent-skill production-investigation

SKILL.md Content

View Translation Comparison →

Honeycomb Production Investigation

Structured workflows for debugging production issues. The MCP tools document their own parameters — this skill focuses on the sequence of tool calls and how to interpret results to reach a root cause.

The Core Analysis Loop

This workflow implements the core analysis loop (Define → Visualize → Investigate → Evaluate) from the observability-fundamentals skill. If BubbleUp returns nothing useful, the issue is often an instrumentation gap — add the missing attributes (see the otel-instrumentation skill) and try again.

Investigation Workflow

Step 1: Orient

```
get_workspace_context
```
→ environments and datasets
```
get_slos
```
→ any SLOs in violation? (frames severity)
```
get_triggers
```
→ any alerts firing? (narrows scope)
```
find_queries
```
→ has anyone investigated this before?

Step 2: Characterize the Problem

Run a broad query to see the shape of the issue:

Latency spike: P99(duration_ms), HEATMAP(duration_ms) grouped by service or route
Error surge: COUNT filtered on error=true, grouped by exception.message or service
Unknown: COUNT grouped by service.name to find which service has anomalous volume

Also call

get_service_map

— it shows P95 durations between services and can immediately reveal which dependency is slow.

Step 3: BubbleUp to Find Differentiators

This is the highest-value step. Once you have a query showing the anomaly:

Run
```
run_bubbleup
```
on the query result, selecting the outlier region
BubbleUp compares outlier vs baseline distributions across all columns automatically
Look for fields where the distributions differ significantly

How to interpret BubbleUp results:

Categorical fields (dimensions): A value overrepresented in outliers points to a cause (e.g.,
```
deployment.version=v2.3.1
```
is 90% of slow requests but only 20% of baseline)
Numeric fields (measures): A shifted distribution shows correlated metrics (e.g.,
```
db.query_duration
```
is much higher in outliers)
Typical root causes surfaced: deployment version, region, user cohort, specific endpoint, feature flag

Step 4: Drill Into Traces

After BubbleUp identifies suspects:

Add BubbleUp findings as WHERE filters to narrow results
Pick a representative trace ID
Call
```
get_trace
```
to fetch the full trace

What to look for in the trace waterfall:

Spans with disproportionate duration vs parent (the bottleneck)
Sequential spans that could be parallelized (N+1 query patterns)
Error spans — check span events for stack traces
Gaps between child spans (missing instrumentation or idle wait)
Service boundaries (where the trace crosses services)

Step 5: Verify Hypothesis

Form a hypothesis from BubbleUp + trace analysis, then confirm:

Query WITH the suspected cause filtered in
Query WITHOUT it (as a control)
If the metrics diverge, you've found it

Step 6: Record Findings

Call

create_board

with:

A text panel summarizing the root cause (Markdown)
The key query run PKs that identified the problem
Related SLOs if applicable

Investigation Patterns

Latency Spike

HEATMAP first → BubbleUp the slow region → trace a slow request → verify with filtered queries

Error Surge

COUNT errors grouped by exception.message → BubbleUp the error spike → trace an errored request → verify

Deployment Regression

P99 grouped by deployment.version → BubbleUp comparing new vs old → trace from new version → verify

Dependency Failure

get_service_map

→ P99 on the slow dependency → relational query (

any.service.name

) to measure user impact → trace an affected request

Stay on the Path

If you find yourself reasoning any of these, follow the workflow anyway:

"The cause is obvious, I can skip BubbleUp" — BubbleUp routinely surfaces causes that seem obvious in hindsight but weren't the first guess. It also catches secondary causes you'd miss entirely.
"I already know it's a deployment issue" — verify with Step 5. Confirmation bias is strongest during incidents. Query with and without the suspected cause.
"Traces confirmed it, no need to verify" — a single trace is an anecdote. The verification query proves the pattern holds across all traffic, not just one request.
"This is a simple issue, the full workflow is overkill" — the workflow takes minutes; a wrong diagnosis during an incident costs hours.

When Results Are Empty or Unclear

No results: Check field names with
```
find_columns
```
, expand time range, verify environment/dataset
BubbleUp shows no signal: Try a different time selection, add filters to isolate the anomaly more clearly, or select a different calculation
Trace missing spans: Sampling, instrumentation gaps, or cross-environment trace split

Additional Resources

Reference Files

${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/investigation-playbooks.md
— Step-by-step playbooks for latency spikes, error surges, deployment regressions, dependency failures, SLO budget burn, and health checks
${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/bubbleup-guide.md
— Detailed BubbleUp usage: selection types, time specifications, pagination, result interpretation
${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/trace-exploration.md
— Trace structure, get_trace parameters and view modes, waterfall analysis, span events and links

Cross-References

For the conceptual foundations of the core analysis loop, see the observability-fundamentals skill
For query construction patterns, see the query-patterns skill
For SLO/trigger context during investigations, see the slos-and-triggers skill