production-investigation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseHoneycomb Production Investigation
Honeycomb生产问题排查
Structured workflows for debugging production issues. The MCP tools document their
own parameters — this skill focuses on the sequence of tool calls and how to
interpret results to reach a root cause.
用于调试生产问题的结构化工作流。MCP工具会记录自身参数——本技能聚焦于工具调用的顺序以及如何解读结果以找到根本原因。
The Core Analysis Loop
核心分析循环
This workflow implements the core analysis loop (Define → Visualize → Investigate →
Evaluate) from the observability-fundamentals skill. If BubbleUp returns nothing
useful, the issue is often an instrumentation gap — add the missing attributes (see the
otel-instrumentation skill) and try again.
本工作流实现了observability-fundamentals技能中的核心分析循环(定义→可视化→调查→评估)。如果BubbleUp未返回有用结果,问题通常是 instrumentation 缺失——添加缺失的属性(参见otel-instrumentation技能)后重试。
Investigation Workflow
排查工作流
Step 1: Orient
步骤1:定位
- → environments and datasets
get_workspace_context - → any SLOs in violation? (frames severity)
get_slos - → any alerts firing? (narrows scope)
get_triggers - → has anyone investigated this before?
find_queries
- → 环境和数据集
get_workspace_context - → 是否有SLO违规?(确定严重程度)
get_slos - → 是否有警报触发?(缩小范围)
get_triggers - → 之前是否有人调查过此问题?
find_queries
Step 2: Characterize the Problem
步骤2:明确问题特征
Run a broad query to see the shape of the issue:
- Latency spike: P99(duration_ms), HEATMAP(duration_ms) grouped by service or route
- Error surge: COUNT filtered on error=true, grouped by exception.message or service
- Unknown: COUNT grouped by service.name to find which service has anomalous volume
Also call — it shows P95 durations between services and can immediately reveal which dependency is slow.
get_service_map运行宽泛查询以查看问题的整体情况:
- 延迟突增:P99(duration_ms),按服务或路由分组的HEATMAP(duration_ms)
- 错误激增:过滤error=true的COUNT,按exception.message或服务分组
- 未知问题:按service.name分组的COUNT,以找出存在异常流量的服务
同时调用——它展示了服务间的P95延迟,可直接揭示哪个依赖项运行缓慢。
get_service_mapStep 3: BubbleUp to Find Differentiators
步骤3:使用BubbleUp找到差异因素
This is the highest-value step. Once you have a query showing the anomaly:
- Run on the query result, selecting the outlier region
run_bubbleup - BubbleUp compares outlier vs baseline distributions across all columns automatically
- Look for fields where the distributions differ significantly
How to interpret BubbleUp results:
- Categorical fields (dimensions): A value overrepresented in outliers points to a cause (e.g., is 90% of slow requests but only 20% of baseline)
deployment.version=v2.3.1 - Numeric fields (measures): A shifted distribution shows correlated metrics (e.g., is much higher in outliers)
db.query_duration - Typical root causes surfaced: deployment version, region, user cohort, specific endpoint, feature flag
这是价值最高的步骤。当你得到显示异常的查询结果后:
- 在查询结果上运行,选择异常区域
run_bubbleup - BubbleUp会自动比较异常值与基线在所有列上的分布情况
- 寻找分布差异显著的字段
如何解读BubbleUp结果:
- 分类字段(维度):在异常值中占比过高的值指向原因(例如,占慢请求的90%,但仅占基线的20%)
deployment.version=v2.3.1 - 数值字段(度量):偏移的分布显示相关指标(例如,在异常值中高得多)
db.query_duration - 常见根因:部署版本、区域、用户群体、特定端点、功能标志
Step 4: Drill Into Traces
步骤4:深入分析链路追踪
After BubbleUp identifies suspects:
- Add BubbleUp findings as WHERE filters to narrow results
- Pick a representative trace ID
- Call to fetch the full trace
get_trace
What to look for in the trace waterfall:
- Spans with disproportionate duration vs parent (the bottleneck)
- Sequential spans that could be parallelized (N+1 query patterns)
- Error spans — check span events for stack traces
- Gaps between child spans (missing instrumentation or idle wait)
- Service boundaries (where the trace crosses services)
在BubbleUp确定可疑因素后:
- 添加BubbleUp的发现作为WHERE过滤器以缩小结果范围
- 选择一个有代表性的trace ID
- 调用获取完整链路追踪
get_trace
在链路追踪瀑布图中需关注的内容:
- 与父跨度相比耗时过长的跨度(瓶颈)
- 可并行化的顺序跨度(N+1查询模式)
- 错误跨度——检查跨度事件中的堆栈跟踪
- 子跨度之间的间隙(缺失instrumentation或空闲等待)
- 服务边界(链路追踪跨服务的位置)
Step 5: Verify Hypothesis
步骤5:验证假设
Form a hypothesis from BubbleUp + trace analysis, then confirm:
- Query WITH the suspected cause filtered in
- Query WITHOUT it (as a control)
- If the metrics diverge, you've found it
结合BubbleUp和链路追踪分析形成假设,然后验证:
- 过滤疑似原因的查询
- 未过滤疑似原因的查询(作为对照)
- 如果指标出现差异,说明你找到了原因
Step 6: Record Findings
步骤6:记录发现
Call with:
create_board- A text panel summarizing the root cause (Markdown)
- The key query run PKs that identified the problem
- Related SLOs if applicable
调用,包含:
create_board- 总结根本原因的文本面板(Markdown格式)
- 识别问题的关键查询运行PK
- 适用的相关SLO
Investigation Patterns
排查模式
Latency Spike
延迟突增
HEATMAP first → BubbleUp the slow region → trace a slow request → verify with filtered queries
先使用HEATMAP → 对缓慢区域运行BubbleUp → 追踪慢请求 → 用过滤查询验证
Error Surge
错误激增
COUNT errors grouped by exception.message → BubbleUp the error spike → trace an errored request → verify
按exception.message分组统计错误数 → 对错误峰值运行BubbleUp → 追踪错误请求 → 验证
Deployment Regression
部署回归
P99 grouped by deployment.version → BubbleUp comparing new vs old → trace from new version → verify
按deployment.version分组的P99 → 对比新旧版本运行BubbleUp → 追踪新版本请求 → 验证
Dependency Failure
依赖项故障
get_service_mapany.service.nameget_service_mapany.service.nameStay on the Path
遵循工作流
If you find yourself reasoning any of these, follow the workflow anyway:
- "The cause is obvious, I can skip BubbleUp" — BubbleUp routinely surfaces causes that seem obvious in hindsight but weren't the first guess. It also catches secondary causes you'd miss entirely.
- "I already know it's a deployment issue" — verify with Step 5. Confirmation bias is strongest during incidents. Query with and without the suspected cause.
- "Traces confirmed it, no need to verify" — a single trace is an anecdote. The verification query proves the pattern holds across all traffic, not just one request.
- "This is a simple issue, the full workflow is overkill" — the workflow takes minutes; a wrong diagnosis during an incident costs hours.
如果你有以下想法,仍需遵循工作流:
- “原因很明显,我可以跳过BubbleUp”——BubbleUp通常会事后看来明显但最初未被猜到的原因,还能发现你完全忽略的次要原因。
- “我已经知道是部署问题”——用步骤5验证。事件期间确认偏差最强。同时运行包含和不包含疑似原因的查询。
- “链路追踪已经确认,无需验证”——单个链路追踪只是个例。验证查询可证明该模式适用于所有流量,而非单个请求。
- “这是简单问题,完整工作流过于繁琐”——工作流只需几分钟;事件期间诊断错误则会耗费数小时。
When Results Are Empty or Unclear
结果为空或不明确时的处理
- No results: Check field names with , expand time range, verify environment/dataset
find_columns - BubbleUp shows no signal: Try a different time selection, add filters to isolate the anomaly more clearly, or select a different calculation
- Trace missing spans: Sampling, instrumentation gaps, or cross-environment trace split
- 无结果:用检查字段名,扩大时间范围,验证环境/数据集
find_columns - BubbleUp无信号:尝试不同的时间选择,添加过滤器更清晰地隔离异常,或选择不同的计算方式
- 链路追踪缺失跨度:采样问题、instrumentation缺失或跨环境链路追踪拆分
Additional Resources
额外资源
Reference Files
参考文件
- — Step-by-step playbooks for latency spikes, error surges, deployment regressions, dependency failures, SLO budget burn, and health checks
${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/investigation-playbooks.md - — Detailed BubbleUp usage: selection types, time specifications, pagination, result interpretation
${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/bubbleup-guide.md - — Trace structure, get_trace parameters and view modes, waterfall analysis, span events and links
${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/trace-exploration.md
- —— 延迟突增、错误激增、部署回归、依赖项故障、SLO预算消耗和健康检查的分步指南
${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/investigation-playbooks.md - —— BubbleUp详细用法:选择类型、时间指定、分页、结果解读
${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/bubbleup-guide.md - —— 链路追踪结构、get_trace参数和视图模式、瀑布图分析、跨度事件和链接
${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/trace-exploration.md
Cross-References
交叉引用
- For the conceptual foundations of the core analysis loop, see the observability-fundamentals skill
- For query construction patterns, see the query-patterns skill
- For SLO/trigger context during investigations, see the slos-and-triggers skill
- 核心分析循环的概念基础,参见observability-fundamentals技能
- 查询构建模式,参见query-patterns技能
- 排查期间的SLO/触发器上下文,参见slos-and-triggers技能