Honeycomb Production Investigation
Structured workflows for debugging production issues. The MCP tools document their
own parameters — this skill focuses on the sequence of tool calls and how to
interpret results to reach a root cause.
The Core Analysis Loop
This workflow implements the core analysis loop (Define → Visualize → Investigate →
Evaluate) from the observability-fundamentals skill. If BubbleUp returns nothing
useful, the issue is often an instrumentation gap — add the missing attributes (see the
otel-instrumentation skill) and try again.
Investigation Workflow
Step 1: Orient
- → environments and datasets
- → any SLOs in violation? (frames severity)
- → any alerts firing? (narrows scope)
- → has anyone investigated this before?
Step 2: Characterize the Problem
Run a broad query to see the shape of the issue:
- Latency spike: P99(duration_ms), HEATMAP(duration_ms) grouped by service or route
- Error surge: COUNT filtered on error=true, grouped by exception.message or service
- Unknown: COUNT grouped by service.name to find which service has anomalous volume
Also call
— it shows P95 durations between services and can immediately reveal which dependency is slow.
Step 3: BubbleUp to Find Differentiators
This is the highest-value step. Once you have a query showing the anomaly:
- Run on the query result, selecting the outlier region
- BubbleUp compares outlier vs baseline distributions across all columns automatically
- Look for fields where the distributions differ significantly
How to interpret BubbleUp results:
- Categorical fields (dimensions): A value overrepresented in outliers points to a cause (e.g.,
deployment.version=v2.3.1
is 90% of slow requests but only 20% of baseline)
- Numeric fields (measures): A shifted distribution shows correlated metrics (e.g., is much higher in outliers)
- Typical root causes surfaced: deployment version, region, user cohort, specific endpoint, feature flag
Step 4: Drill Into Traces
After BubbleUp identifies suspects:
- Add BubbleUp findings as WHERE filters to narrow results
- Pick a representative trace ID
- Call to fetch the full trace
What to look for in the trace waterfall:
- Spans with disproportionate duration vs parent (the bottleneck)
- Sequential spans that could be parallelized (N+1 query patterns)
- Error spans — check span events for stack traces
- Gaps between child spans (missing instrumentation or idle wait)
- Service boundaries (where the trace crosses services)
Step 5: Verify Hypothesis
Form a hypothesis from BubbleUp + trace analysis, then confirm:
- Query WITH the suspected cause filtered in
- Query WITHOUT it (as a control)
- If the metrics diverge, you've found it
Step 6: Record Findings
- A text panel summarizing the root cause (Markdown)
- The key query run PKs that identified the problem
- Related SLOs if applicable
Investigation Patterns
Latency Spike
HEATMAP first → BubbleUp the slow region → trace a slow request → verify with filtered queries
Error Surge
COUNT errors grouped by exception.message → BubbleUp the error spike → trace an errored request → verify
Deployment Regression
P99 grouped by deployment.version → BubbleUp comparing new vs old → trace from new version → verify
Dependency Failure
→ P99 on the slow dependency → relational query (
) to measure user impact → trace an affected request
Stay on the Path
If you find yourself reasoning any of these, follow the workflow anyway:
- "The cause is obvious, I can skip BubbleUp" — BubbleUp routinely surfaces causes that seem obvious in hindsight but weren't the first guess. It also catches secondary causes you'd miss entirely.
- "I already know it's a deployment issue" — verify with Step 5. Confirmation bias is strongest during incidents. Query with and without the suspected cause.
- "Traces confirmed it, no need to verify" — a single trace is an anecdote. The verification query proves the pattern holds across all traffic, not just one request.
- "This is a simple issue, the full workflow is overkill" — the workflow takes minutes; a wrong diagnosis during an incident costs hours.
When Results Are Empty or Unclear
- No results: Check field names with , expand time range, verify environment/dataset
- BubbleUp shows no signal: Try a different time selection, add filters to isolate the anomaly more clearly, or select a different calculation
- Trace missing spans: Sampling, instrumentation gaps, or cross-environment trace split
Additional Resources
Reference Files
${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/investigation-playbooks.md
— Step-by-step playbooks for latency spikes, error surges, deployment regressions, dependency failures, SLO budget burn, and health checks
${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/bubbleup-guide.md
— Detailed BubbleUp usage: selection types, time specifications, pagination, result interpretation
${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/trace-exploration.md
— Trace structure, get_trace parameters and view modes, waterfall analysis, span events and links
Cross-References
- For the conceptual foundations of the core analysis loop, see the observability-fundamentals skill
- For query construction patterns, see the query-patterns skill
- For SLO/trigger context during investigations, see the slos-and-triggers skill