production-investigation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Honeycomb Production Investigation

Honeycomb生产问题排查

Structured workflows for debugging production issues. The MCP tools document their own parameters — this skill focuses on the sequence of tool calls and how to interpret results to reach a root cause.
用于调试生产问题的结构化工作流。MCP工具会记录自身参数——本技能聚焦于工具调用的顺序以及如何解读结果以找到根本原因。

The Core Analysis Loop

核心分析循环

This workflow implements the core analysis loop (Define → Visualize → Investigate → Evaluate) from the observability-fundamentals skill. If BubbleUp returns nothing useful, the issue is often an instrumentation gap — add the missing attributes (see the otel-instrumentation skill) and try again.
本工作流实现了observability-fundamentals技能中的核心分析循环(定义→可视化→调查→评估)。如果BubbleUp未返回有用结果,问题通常是 instrumentation 缺失——添加缺失的属性(参见otel-instrumentation技能)后重试。

Investigation Workflow

排查工作流

Step 1: Orient

步骤1:定位

  1. get_workspace_context
    → environments and datasets
  2. get_slos
    → any SLOs in violation? (frames severity)
  3. get_triggers
    → any alerts firing? (narrows scope)
  4. find_queries
    → has anyone investigated this before?
  1. get_workspace_context
    → 环境和数据集
  2. get_slos
    → 是否有SLO违规?(确定严重程度)
  3. get_triggers
    → 是否有警报触发?(缩小范围)
  4. find_queries
    → 之前是否有人调查过此问题?

Step 2: Characterize the Problem

步骤2:明确问题特征

Run a broad query to see the shape of the issue:
  • Latency spike: P99(duration_ms), HEATMAP(duration_ms) grouped by service or route
  • Error surge: COUNT filtered on error=true, grouped by exception.message or service
  • Unknown: COUNT grouped by service.name to find which service has anomalous volume
Also call
get_service_map
— it shows P95 durations between services and can immediately reveal which dependency is slow.
运行宽泛查询以查看问题的整体情况:
  • 延迟突增:P99(duration_ms),按服务或路由分组的HEATMAP(duration_ms)
  • 错误激增:过滤error=true的COUNT,按exception.message或服务分组
  • 未知问题:按service.name分组的COUNT,以找出存在异常流量的服务
同时调用
get_service_map
——它展示了服务间的P95延迟,可直接揭示哪个依赖项运行缓慢。

Step 3: BubbleUp to Find Differentiators

步骤3:使用BubbleUp找到差异因素

This is the highest-value step. Once you have a query showing the anomaly:
  1. Run
    run_bubbleup
    on the query result, selecting the outlier region
  2. BubbleUp compares outlier vs baseline distributions across all columns automatically
  3. Look for fields where the distributions differ significantly
How to interpret BubbleUp results:
  • Categorical fields (dimensions): A value overrepresented in outliers points to a cause (e.g.,
    deployment.version=v2.3.1
    is 90% of slow requests but only 20% of baseline)
  • Numeric fields (measures): A shifted distribution shows correlated metrics (e.g.,
    db.query_duration
    is much higher in outliers)
  • Typical root causes surfaced: deployment version, region, user cohort, specific endpoint, feature flag
这是价值最高的步骤。当你得到显示异常的查询结果后:
  1. 在查询结果上运行
    run_bubbleup
    ,选择异常区域
  2. BubbleUp会自动比较异常值与基线在所有列上的分布情况
  3. 寻找分布差异显著的字段
如何解读BubbleUp结果:
  • 分类字段(维度):在异常值中占比过高的值指向原因(例如,
    deployment.version=v2.3.1
    占慢请求的90%,但仅占基线的20%)
  • 数值字段(度量):偏移的分布显示相关指标(例如,
    db.query_duration
    在异常值中高得多)
  • 常见根因:部署版本、区域、用户群体、特定端点、功能标志

Step 4: Drill Into Traces

步骤4:深入分析链路追踪

After BubbleUp identifies suspects:
  1. Add BubbleUp findings as WHERE filters to narrow results
  2. Pick a representative trace ID
  3. Call
    get_trace
    to fetch the full trace
What to look for in the trace waterfall:
  • Spans with disproportionate duration vs parent (the bottleneck)
  • Sequential spans that could be parallelized (N+1 query patterns)
  • Error spans — check span events for stack traces
  • Gaps between child spans (missing instrumentation or idle wait)
  • Service boundaries (where the trace crosses services)
在BubbleUp确定可疑因素后:
  1. 添加BubbleUp的发现作为WHERE过滤器以缩小结果范围
  2. 选择一个有代表性的trace ID
  3. 调用
    get_trace
    获取完整链路追踪
在链路追踪瀑布图中需关注的内容:
  • 与父跨度相比耗时过长的跨度(瓶颈)
  • 可并行化的顺序跨度(N+1查询模式)
  • 错误跨度——检查跨度事件中的堆栈跟踪
  • 子跨度之间的间隙(缺失instrumentation或空闲等待)
  • 服务边界(链路追踪跨服务的位置)

Step 5: Verify Hypothesis

步骤5:验证假设

Form a hypothesis from BubbleUp + trace analysis, then confirm:
  • Query WITH the suspected cause filtered in
  • Query WITHOUT it (as a control)
  • If the metrics diverge, you've found it
结合BubbleUp和链路追踪分析形成假设,然后验证:
  • 过滤疑似原因的查询
  • 未过滤疑似原因的查询(作为对照)
  • 如果指标出现差异,说明你找到了原因

Step 6: Record Findings

步骤6:记录发现

Call
create_board
with:
  • A text panel summarizing the root cause (Markdown)
  • The key query run PKs that identified the problem
  • Related SLOs if applicable
调用
create_board
,包含:
  • 总结根本原因的文本面板(Markdown格式)
  • 识别问题的关键查询运行PK
  • 适用的相关SLO

Investigation Patterns

排查模式

Latency Spike

延迟突增

HEATMAP first → BubbleUp the slow region → trace a slow request → verify with filtered queries
先使用HEATMAP → 对缓慢区域运行BubbleUp → 追踪慢请求 → 用过滤查询验证

Error Surge

错误激增

COUNT errors grouped by exception.message → BubbleUp the error spike → trace an errored request → verify
按exception.message分组统计错误数 → 对错误峰值运行BubbleUp → 追踪错误请求 → 验证

Deployment Regression

部署回归

P99 grouped by deployment.version → BubbleUp comparing new vs old → trace from new version → verify
按deployment.version分组的P99 → 对比新旧版本运行BubbleUp → 追踪新版本请求 → 验证

Dependency Failure

依赖项故障

get_service_map
→ P99 on the slow dependency → relational query (
any.service.name
) to measure user impact → trace an affected request
get_service_map
→ 对缓慢依赖项的P99 → 关联查询(
any.service.name
)以衡量用户影响 → 追踪受影响的请求

Stay on the Path

遵循工作流

If you find yourself reasoning any of these, follow the workflow anyway:
  • "The cause is obvious, I can skip BubbleUp" — BubbleUp routinely surfaces causes that seem obvious in hindsight but weren't the first guess. It also catches secondary causes you'd miss entirely.
  • "I already know it's a deployment issue" — verify with Step 5. Confirmation bias is strongest during incidents. Query with and without the suspected cause.
  • "Traces confirmed it, no need to verify" — a single trace is an anecdote. The verification query proves the pattern holds across all traffic, not just one request.
  • "This is a simple issue, the full workflow is overkill" — the workflow takes minutes; a wrong diagnosis during an incident costs hours.
如果你有以下想法,仍需遵循工作流:
  • “原因很明显,我可以跳过BubbleUp”——BubbleUp通常会事后看来明显但最初未被猜到的原因,还能发现你完全忽略的次要原因。
  • “我已经知道是部署问题”——用步骤5验证。事件期间确认偏差最强。同时运行包含和不包含疑似原因的查询。
  • “链路追踪已经确认,无需验证”——单个链路追踪只是个例。验证查询可证明该模式适用于所有流量,而非单个请求。
  • “这是简单问题,完整工作流过于繁琐”——工作流只需几分钟;事件期间诊断错误则会耗费数小时。

When Results Are Empty or Unclear

结果为空或不明确时的处理

  • No results: Check field names with
    find_columns
    , expand time range, verify environment/dataset
  • BubbleUp shows no signal: Try a different time selection, add filters to isolate the anomaly more clearly, or select a different calculation
  • Trace missing spans: Sampling, instrumentation gaps, or cross-environment trace split
  • 无结果:用
    find_columns
    检查字段名,扩大时间范围,验证环境/数据集
  • BubbleUp无信号:尝试不同的时间选择,添加过滤器更清晰地隔离异常,或选择不同的计算方式
  • 链路追踪缺失跨度:采样问题、instrumentation缺失或跨环境链路追踪拆分

Additional Resources

额外资源

Reference Files

参考文件

  • ${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/investigation-playbooks.md
    — Step-by-step playbooks for latency spikes, error surges, deployment regressions, dependency failures, SLO budget burn, and health checks
  • ${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/bubbleup-guide.md
    — Detailed BubbleUp usage: selection types, time specifications, pagination, result interpretation
  • ${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/trace-exploration.md
    — Trace structure, get_trace parameters and view modes, waterfall analysis, span events and links
  • ${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/investigation-playbooks.md
    —— 延迟突增、错误激增、部署回归、依赖项故障、SLO预算消耗和健康检查的分步指南
  • ${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/bubbleup-guide.md
    —— BubbleUp详细用法:选择类型、时间指定、分页、结果解读
  • ${CLAUDE_PLUGIN_ROOT}/skills/production-investigation/references/trace-exploration.md
    —— 链路追踪结构、get_trace参数和视图模式、瀑布图分析、跨度事件和链接

Cross-References

交叉引用

  • For the conceptual foundations of the core analysis loop, see the observability-fundamentals skill
  • For query construction patterns, see the query-patterns skill
  • For SLO/trigger context during investigations, see the slos-and-triggers skill
  • 核心分析循环的概念基础,参见observability-fundamentals技能
  • 查询构建模式,参见query-patterns技能
  • 排查期间的SLO/触发器上下文,参见slos-and-triggers技能