production-investigation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Honeycomb Production Investigation

Honeycomb生产问题排查

Structured workflows for debugging production issues. The MCP tools document their own parameters — this skill focuses on the sequence of tool calls and how to interpret results to reach a root cause.

用于调试生产问题的结构化工作流。MCP工具会记录自身参数——本技能聚焦于工具调用的顺序以及如何解读结果以找到根本原因。

The Core Analysis Loop

核心分析循环

This workflow implements the core analysis loop (Define → Visualize → Investigate → Evaluate) from the observability-fundamentals skill. If BubbleUp returns nothing useful, the issue is often an instrumentation gap — add the missing attributes (see the otel-instrumentation skill) and try again.

本工作流实现了observability-fundamentals技能中的核心分析循环（定义→可视化→调查→评估）。如果BubbleUp未返回有用结果，问题通常是 instrumentation 缺失——添加缺失的属性（参见otel-instrumentation技能）后重试。

Investigation Workflow

排查工作流

Step 1: Orient

步骤1：定位

```
get_workspace_context
```
→ environments and datasets
```
get_slos
```
→ any SLOs in violation? (frames severity)
```
get_triggers
```
→ any alerts firing? (narrows scope)
```
find_queries
```
→ has anyone investigated this before?

```
get_workspace_context
```
→ 环境和数据集
```
get_slos
```
→ 是否有SLO违规？（确定严重程度）
```
get_triggers
```
→ 是否有警报触发？（缩小范围）
```
find_queries
```
→ 之前是否有人调查过此问题？

Step 2: Characterize the Problem

步骤2：明确问题特征

Run a broad query to see the shape of the issue:

Latency spike: P99(duration_ms), HEATMAP(duration_ms) grouped by service or route
Error surge: COUNT filtered on error=true, grouped by exception.message or service
Unknown: COUNT grouped by service.name to find which service has anomalous volume

Also call

get_service_map

— it shows P95 durations between services and can immediately reveal which dependency is slow.

运行宽泛查询以查看问题的整体情况：

延迟突增：P99(duration_ms)，按服务或路由分组的HEATMAP(duration_ms)
错误激增：过滤error=true的COUNT，按exception.message或服务分组
未知问题：按service.name分组的COUNT，以找出存在异常流量的服务

同时调用

get_service_map

——它展示了服务间的P95延迟，可直接揭示哪个依赖项运行缓慢。

Step 3: BubbleUp to Find Differentiators

步骤3：使用BubbleUp找到差异因素

This is the highest-value step. Once you have a query showing the anomaly:

Run
```
run_bubbleup
```
on the query result, selecting the outlier region
BubbleUp compares outlier vs baseline distributions across all columns automatically
Look for fields where the distributions differ significantly

How to interpret BubbleUp results:

Categorical fields (dimensions): A value overrepresented in outliers points to a cause (e.g.,
```
deployment.version=v2.3.1
```
is 90% of slow requests but only 20% of baseline)
Numeric fields (measures): A shifted distribution shows correlated metrics (e.g.,
```
db.query_duration
```
is much higher in outliers)
Typical root causes surfaced: deployment version, region, user cohort, specific endpoint, feature flag

这是价值最高的步骤。当你得到显示异常的查询结果后：

在查询结果上运行
```
run_bubbleup
```
，选择异常区域
BubbleUp会自动比较异常值与基线在所有列上的分布情况
寻找分布差异显著的字段

如何解读BubbleUp结果：

分类字段（维度）：在异常值中占比过高的值指向原因（例如，
```
deployment.version=v2.3.1
```
占慢请求的90%，但仅占基线的20%）
数值字段（度量）：偏移的分布显示相关指标（例如，
```
db.query_duration
```
在异常值中高得多）
常见根因：部署版本、区域、用户群体、特定端点、功能标志

Step 4: Drill Into Traces

步骤4：深入分析链路追踪

After BubbleUp identifies suspects:

Add BubbleUp findings as WHERE filters to narrow results
Pick a representative trace ID
Call
```
get_trace
```
to fetch the full trace

What to look for in the trace waterfall:

Spans with disproportionate duration vs parent (the bottleneck)
Sequential spans that could be parallelized (N+1 query patterns)
Error spans — check span events for stack traces
Gaps between child spans (missing instrumentation or idle wait)
Service boundaries (where the trace crosses services)

在BubbleUp确定可疑因素后：

添加BubbleUp的发现作为WHERE过滤器以缩小结果范围
选择一个有代表性的trace ID
调用
```
get_trace
```
获取完整链路追踪

在链路追踪瀑布图中需关注的内容：

与父跨度相比耗时过长的跨度（瓶颈）
可并行化的顺序跨度（N+1查询模式）
错误跨度——检查跨度事件中的堆栈跟踪
子跨度之间的间隙（缺失instrumentation或空闲等待）
服务边界（链路追踪跨服务的位置）

Step 5: Verify Hypothesis

步骤5：验证假设

Form a hypothesis from BubbleUp + trace analysis, then confirm:

Query WITH the suspected cause filtered in
Query WITHOUT it (as a control)
If the metrics diverge, you've found it

结合BubbleUp和链路追踪分析形成假设，然后验证：

过滤疑似原因的查询
未过滤疑似原因的查询（作为对照）
如果指标出现差异，说明你找到了原因

Step 6: Record Findings

步骤6：记录发现

Call

create_board

with:

A text panel summarizing the root cause (Markdown)
The key query run PKs that identified the problem
Related SLOs if applicable

调用

create_board

，包含：

总结根本原因的文本面板（Markdown格式）
识别问题的关键查询运行PK
适用的相关SLO

Investigation Patterns

排查模式

Latency Spike

延迟突增

HEATMAP first → BubbleUp the slow region → trace a slow request → verify with filtered queries

先使用HEATMAP → 对缓慢区域运行BubbleUp → 追踪慢请求 → 用过滤查询验证

Error Surge

错误激增

COUNT errors grouped by exception.message → BubbleUp the error spike → trace an errored request → verify

按exception.message分组统计错误数 → 对错误峰值运行BubbleUp → 追踪错误请求 → 验证

Deployment Regression

部署回归

P99 grouped by deployment.version → BubbleUp comparing new vs old → trace from new version → verify

按deployment.version分组的P99 → 对比新旧版本运行BubbleUp → 追踪新版本请求 → 验证

Dependency Failure

依赖项故障

get_service_map

→ P99 on the slow dependency → relational query (

any.service.name

) to measure user impact → trace an affected request

get_service_map

→ 对缓慢依赖项的P99 → 关联查询(

any.service.name

)以衡量用户影响 → 追踪受影响的请求

Stay on the Path

遵循工作流

If you find yourself reasoning any of these, follow the workflow anyway:

"The cause is obvious, I can skip BubbleUp" — BubbleUp routinely surfaces causes that seem obvious in hindsight but weren't the first guess. It also catches secondary causes you'd miss entirely.
"I already know it's a deployment issue" — verify with Step 5. Confirmation bias is strongest during incidents. Query with and without the suspected cause.
"Traces confirmed it, no need to verify" — a single trace is an anecdote. The verification query proves the pattern holds across all traffic, not just one request.
"This is a simple issue, the full workflow is overkill" — the workflow takes minutes; a wrong diagnosis during an incident costs hours.

如果你有以下想法，仍需遵循工作流：

“原因很明显，我可以跳过BubbleUp”——BubbleUp通常会事后看来明显但最初未被猜到的原因，还能发现你完全忽略的次要原因。
“我已经知道是部署问题”——用步骤5验证。事件期间确认偏差最强。同时运行包含和不包含疑似原因的查询。
“链路追踪已经确认，无需验证”——单个链路追踪只是个例。验证查询可证明该模式适用于所有流量，而非单个请求。
“这是简单问题，完整工作流过于繁琐”——工作流只需几分钟；事件期间诊断错误则会耗费数小时。

When Results Are Empty or Unclear

结果为空或不明确时的处理

No results: Check field names with
```
find_columns
```
, expand time range, verify environment/dataset
BubbleUp shows no signal: Try a different time selection, add filters to isolate the anomaly more clearly, or select a different calculation
Trace missing spans: Sampling, instrumentation gaps, or cross-environment trace split

无结果：用
```
find_columns
```
检查字段名，扩大时间范围，验证环境/数据集
BubbleUp无信号：尝试不同的时间选择，添加过滤器更清晰地隔离异常，或选择不同的计算方式
链路追踪缺失跨度：采样问题、instrumentation缺失或跨环境链路追踪拆分

Additional Resources

额外资源

Reference Files

参考文件

${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/investigation-playbooks.md
— Step-by-step playbooks for latency spikes, error surges, deployment regressions, dependency failures, SLO budget burn, and health checks
${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/bubbleup-guide.md
— Detailed BubbleUp usage: selection types, time specifications, pagination, result interpretation
${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/trace-exploration.md
— Trace structure, get_trace parameters and view modes, waterfall analysis, span events and links

${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/investigation-playbooks.md
—— 延迟突增、错误激增、部署回归、依赖项故障、SLO预算消耗和健康检查的分步指南

${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/bubbleup-guide.md
—— BubbleUp详细用法：选择类型、时间指定、分页、结果解读

${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/trace-exploration.md
—— 链路追踪结构、get_trace参数和视图模式、瀑布图分析、跨度事件和链接

Cross-References

交叉引用

For the conceptual foundations of the core analysis loop, see the observability-fundamentals skill
For query construction patterns, see the query-patterns skill
For SLO/trigger context during investigations, see the slos-and-triggers skill

核心分析循环的概念基础，参见observability-fundamentals技能
查询构建模式，参见query-patterns技能
排查期间的SLO/触发器上下文，参见slos-and-triggers技能