post-mortem

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Post-mortem

事后分析报告(Post-mortem)

The canonical engineering record of a bug fix. Written after debugging lands a real fix, for other engineers (and future-you, who will have forgotten everything in 6 months). Code identifiers are welcome here — this is the artifact that lets the next person recover the mental model fast.
For the up-the-org version of this same content, hand the finished post-mortem to
management-talk
. They compose: post-mortem owns the engineering truth, management-talk reframes it for leadership.
这是Bug修复的标准工程记录。需在调试完成并确定有效修复方案之后撰写,面向其他工程师(以及6个月后会忘记所有细节的未来的你)。此处可使用代码标识符——这份记录能让后续接手的人快速重建相关的思维模型。
若需要面向组织管理层的版本,可将完成的事后分析报告交给
management-talk
。两者分工:事后分析报告聚焦工程事实,management-talk负责将其重构为适合领导层阅读的版本。

When to invoke

触发场景

  • "/post-mortem"
  • "write the post-mortem / postmortem / RCA / root-cause analysis"
  • "document this fix" / "write up the root cause" / "close out this bug with a writeup"
  • After a debug session has clearly landed a fix, proactively offer to draft one.
  • "/post-mortem"
  • “撰写事后分析报告/根因分析报告(post-mortem/postmortem/RCA/root-cause analysis)”
  • “记录此修复方案”/“整理根本原因”/“撰写报告以关闭此Bug”
  • 调试明确找到修复方案后,主动提议撰写报告。

When NOT to use

禁用场景

  • Bug not fixed yet, or fix not validated. A post-mortem of a hypothesis is misleading. Refuse and tell the user what's missing.
  • Customer-visible outage / incident. Those need a separate incident report (timeline, blast radius, paging history, comms). This skill is bug-fix scope. Flag and confirm before producing one.
  • Trivial fix (typo, obvious one-liner). The PR description is the record. Don't manufacture ceremony.
  • Bug尚未修复,或修复方案未经验证:基于假设撰写的事后分析报告具有误导性。需拒绝并告知用户缺失的条件。
  • 影响客户的故障/事件:这类情况需要单独的事件报告(包含时间线、影响范围、告警历史、沟通记录)。本技能仅适用于Bug修复场景。在生成报告前需标记并确认。
  • 微小修复(如拼写错误、明显的单行修改):PR描述即可作为记录,无需额外形式化流程。

Required inputs — refuse to draft without these

必要输入——缺少则拒绝撰写

Before writing a single line, confirm all four. If any are missing, list what's missing and stop:
  • Reliable repro exists (not "happens sometimes" — a deterministic or high-rate-flake repro the next person can run).
  • Root cause is known (the mechanism is identified, not a hypothesis).
  • Fix is identified (PR / commit / branch pointer).
  • Fix is validated (the original repro now passes; the customer workload / failing test now succeeds).
These map directly to
debug-mantra
steps 1–4. If you came in via
debug-mantra
, the breadcrumb ledger from step 4 is your raw material — pull from it.
在撰写任何内容前,需确认以下四项全部满足。若有缺失,列出缺失项并停止操作:
  • 存在可靠的复现步骤(并非“偶尔发生”——需提供确定性或高概率复现的步骤,供后续人员执行)。
  • 明确根本原因(已确定触发机制,而非假设)。
  • 确定修复方案(提供PR/提交记录/分支指针)。
  • 修复方案已验证(原复现步骤现在可通过;客户工作负载/失败测试现在可成功运行)。
这些直接对应
debug-mantra
的第1-4步。若你是通过
debug-mantra
进入此流程,第4步的记录可作为原始素材直接使用。

Structure

结构

Use these blocks in this order. Summary, Root cause, Fix, and Validation are mandatory. The rest are conditional but usually present.
需按以下顺序使用这些模块。摘要、根本原因、修复方案、验证过程为必填项。其余模块为可选,但通常需要包含。

1. Summary (mandatory)

1. 摘要(必填)

One paragraph. What broke, in user/workload terms. What fixed it, in one sentence. JIRA key, PR number, owner. A reader who stops here should have the right answer.
一段文字。说明故障现象(从用户/工作负载角度),以及修复方案(一句话概括)。包含JIRA编号、PR编号、负责人。仅阅读此部分的读者应能了解核心信息。

2. Symptom

2. 症状

What was actually observed. Test output, error message, log line, perf number, customer report. Concrete identifiers — don't paraphrase the failure mode.
实际观察到的现象。如测试输出、错误信息、日志行、性能数据、客户反馈。需使用具体标识符——不要转述故障模式。

3. Root cause (mandatory)

3. 根本原因(必填)

The actual bug mechanism. Code identifiers welcome and expected — function names, file paths, struct fields, branch conditions, commit SHAs of the offending change. Walk the cause chain end-to-end. This is the most expensive section and the reason the post-mortem exists at all. Future-you will live or die by how clearly you write this.
Bug的实际触发机制。欢迎并建议使用代码标识符——函数名、文件路径、结构体字段、分支条件、问题代码的提交SHA值。完整梳理因果链。这是最耗时的部分,也是事后分析报告存在的核心意义。未来的你能否快速回忆起细节,取决于此部分的撰写清晰度。

4. Why it produced the symptom

4. 症状关联

Link the root cause to the symptom. Often non-obvious — the bug is in
tadaLaunchPrepare
but the visible failure is a customer training run hanging hours later. Walk the chain so a reader who only knows the symptom can connect it back to the cause without re-deriving it.
将根本原因与症状关联起来。通常并非显而易见——比如Bug出现在
tadaLaunchPrepare
中,但可见故障是客户的训练任务在数小时后挂起。需梳理整个链条,让仅了解症状的读者无需重新推导即可将其与原因关联起来。

5. Fix (mandatory)

5. 修复方案(必填)

What changed and why this change addresses the root cause rather than hiding the symptom. Link to PR / commit. If a previous fix attempt papered over the symptom, name it and explain what was wrong with it — that history is part of the cause.
说明修改内容,以及为何此修改能解决根本原因而非掩盖症状。提供PR/提交记录的链接。若之前有过掩盖症状的修复尝试,需指出并说明其问题——这段历史也是原因的一部分。

6. How it was found

6. 排查过程

Short. The debugging path:
  • What repro made it deterministic.
  • What tools cracked it (debugger, source tracing, knob enumeration, in-code instrumentation — the
    debug-mantra
    step 2 cascade).
  • Hypotheses tried and rejected, with the one-line reason each was rejected. (Pull from the breadcrumb ledger.)
  • The single experiment that confirmed the cause.
This section is for the next debugger — make it learnable.
简短描述调试路径:
  • 如何将复现步骤变为确定性的。
  • 使用了哪些工具(调试器、源码追踪、参数枚举、代码内埋点——即
    debug-mantra
    第2步的流程)。
  • 尝试并排除的假设,以及每个假设被排除的一句话原因。(可从记录中提取)
  • 确认根本原因的关键实验。
此部分是为后续调试人员准备的——需具备可学习性。

7. Why it slipped through

7. 漏检原因

What allowed this bug to reach the branch / release / customer. Pick the real reason:
  • CI gap (no test exercises this path / configuration).
  • Latent code (correct when written, broken by a later change in a different file).
  • Workload gap (no real workload reached this code path until now).
  • Incomplete prior fix (defensive check hid the symptom; root cause untouched).
  • Review miss (the change was reviewable; the implication wasn't).
If the honest answer is "no good reason — we should have caught this," say so. Blameless — describe the gap, not the person.
说明此Bug为何能进入分支/发布版本/客户环境。需给出真实原因:
  • CI漏洞(无测试覆盖此路径/配置)。
  • 潜在代码(编写时正确,后续其他文件的修改导致其失效)。
  • 工作负载漏洞(此前无实际工作负载触发此代码路径)。
  • 不完整的前期修复(防御性检查掩盖了症状,但根本原因未解决)。
  • 代码评审遗漏(修改内容可评审,但未意识到其影响)。
若真实答案是“无合理原因——我们本应发现此问题”,直接说明即可。无指责——描述漏洞,而非针对个人。

8. Validation (mandatory)

8. 验证过程(必填)

How we know the fix works. Concrete:
  • Original failing test now passes (test name, link).
  • Customer workload now completes (workload identifier, run link).
  • Perf regression resolved (number before, number after).
  • Stress / soak / fuzz run completed clean (duration, scale).
  • Other affected configurations / workloads also tested.
If you only validated one configuration, say so explicitly — "validated on Llama-2-70B / 8 GPUs / DeepSpeed; not retested on other workloads." Don't imply broader coverage than you actually have.
说明如何确认修复方案有效。需具体:
  • 原失败测试现在可通过(测试名称、链接)。
  • 客户工作负载现在可完成(工作负载标识符、运行链接)。
  • 性能回归已解决(修复前后的数值)。
  • 压力/持续/模糊测试已顺利完成(时长、规模)。
  • 其他受影响的配置/工作负载也已测试。
若仅验证了一种配置,需明确说明——例如:“已在Llama-2-70B / 8块GPU / DeepSpeed环境下验证;未在其他工作负载上重新测试。”不要暗示覆盖范围超出实际情况。

9. Action items / follow-ups

9. 行动项/后续工作

Concrete next-steps that aren't in the fix PR itself. Each item: what + owner + tracking artifact.
  • Regression test added at <seam>. (Owner, test name.)
  • Refactor to prevent class of bug. (Owner, ticket.)
  • CI gap closed: <new check>. (Owner, PR.)
  • Doc / runbook updated. (Owner, link.)
  • Related ticket filed for <adjacent issue you noticed>. (Owner, key.)
If there are no action items, write "None — the fix is sufficient and no class-of-bug follow-up is warranted." Don't manufacture action items to look thorough.
修复方案PR之外的具体后续步骤。每个行动项需包含:内容 + 负责人 + 追踪工件。
  • 在<衔接点>添加回归测试。(负责人、测试名称)
  • 重构以避免此类Bug。(负责人、工单)
  • 修复CI漏洞:<新增检查项>。(负责人、PR)
  • 更新文档/运行手册。(负责人、链接)
  • 提交相关工单以处理<发现的相邻问题>。(负责人、编号)
若没有行动项,需写:“无——修复方案已足够,无需针对此类Bug进行后续工作。”不要为了显得全面而编造行动项。

Tone

语气

This is engineer-to-engineer. Different from
management-talk
:
  • Code identifiers are first-class.
    tadaLaunchPrepare
    ,
    tada/prim.h::syncWaitPeer
    ,
    scratchBuf
    , commit SHAs, line numbers — keep them. The whole point is that future engineers can grep their way back to the change.
  • Mechanism over narrative. Walk the actual cause chain. Don't soften it into "a synchronization issue" — say which function skipped which event under which gate.
  • Active voice, concrete subjects, short paragraphs. Same rule as everywhere else.
  • No hedging. "We believe" / "appears to" / "may have" — drop. State it or don't write it.
  • Blameless. Describe the bug, the gap, and the fix. Never "X should have caught this." The CI gap is the failure mode, not the person.
  • No advocacy. A post-mortem records what happened and what's next. If you want to argue for a refactor, that's a separate proposal — link to it from the action items.
这是工程师之间的沟通。与
management-talk
不同:
  • 代码标识符是核心内容
    tadaLaunchPrepare
    tada/prim.h::syncWaitPeer
    scratchBuf
    、提交SHA值、行号——全部保留。核心目的是让未来的工程师能通过搜索找到相关修改。
  • 聚焦机制而非叙事。完整梳理实际因果链。不要简化为“同步问题”——需说明哪个函数在哪个条件下跳过了哪个事件。
  • 主动语态、具体主语、短段落。与其他场景规则一致。
  • 不要含糊其辞。删除“我们认为”/“似乎”/“可能”。要么明确说明,要么不写。
  • 无指责。描述Bug、漏洞和修复方案。不要说“X本应发现此问题”。CI漏洞是故障模式,而非个人问题。
  • 不要游说。事后分析报告记录事实和后续计划。若想提议重构,需单独撰写提案——并在行动项中链接。

Output flow

输出流程

  1. Confirm all four required inputs are satisfied. If any are missing, list them and stop. Do not draft.
  2. Confirm where it goes (default: JIRA comment on the source ticket). Other valid destinations: PR description,
    docs/postmortems/<ticket>.md
    , internal wiki page. The shape is the same — only the wrapping changes.
  3. Produce the draft as a single chat block.
  4. Sign-off before posting. If posting back to JIRA, show the exact ADF payload, wait for explicit "post it" / "go ahead" / "yes," then
    POST /rest/api/3/issue/<KEY>/comment
    . Print-only output needs no approval.
  5. Offer the management-talk handoff: "Want a leadership-flavored version? I can hand this to
    management-talk
    ."
    Don't do it automatically.
  1. 确认四项必要输入全部满足。若有缺失,列出并停止操作。不要撰写草稿。
  2. 确认输出位置(默认:源工单的JIRA评论)。其他有效位置:PR描述、
    docs/postmortems/<ticket>.md
    、内部Wiki页面。内容结构一致——仅包装形式不同。
  3. 生成草稿,作为单个聊天块。
  4. 发布前需确认。若发布到JIRA,需展示完整的ADF payload,等待明确的“发布”/“继续”/“是”指令,再执行
    POST /rest/api/3/issue/<KEY>/comment
    。仅用于打印的输出无需批准。
  5. 提供management-talk转接选项:“需要面向领导层的版本吗?我可以将此内容交给
    management-talk
    处理。”不要自动执行。

Worked example — Tada hang in dumbModel (JIRA-12345)

示例——dumbModel中的Tada挂起问题(JIRA-12345)

Summary. Tada's single-stream fast-path skipped a required cross-stream synchronization, causing kernels to launch before scratch-buffer writes were visible. Triggered reliably by dumbModel on LLM-7B fine-tuning, hanging the workload at every eval step. Fixed by removing the unsafe fast-path and tightening a device-side check. JIRA-12345, PR org/platform#5751, owner Alex (Tada team).
Symptom. 8-GPU LLM-7B fine-tuning under dumbModel hung indefinitely at the first eval step. No error, no timeout — busy-spin in
tadaKernel_AllReduce_f32_RING
. Reproduced on every run.
Root cause. The single-stream fast-path in
tadaLaunchPrepare
/
tadaLaunchKernel
/
tadaLaunchFinish
(gated on
scheduler->numStreams == 1 && !plan->persistent
) skipped the cross-stream event between
launchStream
and
handle->shared->deviceStream
. dumbModel hits this gate exactly. The kernel was launched before the IPC publish / scratch-buffer writes on
deviceStream
(which populate
scratchBuf
) were visible to
launchStream
. In the kernel:
scratchBuf == NULL
→ stray pointer dereference → ring ready-flag read from garbage memory → thread spins forever waiting for a ready signal that will never arrive.
Why it produced the symptom. The hang lives in the all-reduce ring waitloop, which is the last visible thing in the call stack — but the actual bug is at launch-prep, several frames earlier. The skipped sync is silent until a workload triggers the exact gate (single-stream, non-persistent), and dumbModel's reduce-scatter pattern hits it at every eval step.
Fix. PR #5751 removes the single-stream fast-path entirely (the saving was negligible vs. the safety it bypassed) and adds a device-side null check on
scratchBuf
before dereference, so the same class of bug fails loudly instead of silently spinning. A previous attempt (PR #5612) added a host-side defensive check after IPC publish that hid the symptom in some paths but left the underlying race in place — that change is also reverted.
How it was found. Reproducer narrowed from "8-GPU LLM-7B hangs sometimes" to a deterministic 30s repro by pinning to a single eval step on a 2-GPU subset. Initial hypothesis: kernel launch ordering on
launchStream
. Disproved by the debugger — the kernel was correctly enqueued. Second hypothesis: scratch-buffer init race. Confirmed by adding
[DBG-7af3]
instrumentation in
tadaLaunchPrepare
printing
scratchBuf
and a
deviceStream
event-record timestamp; the launch happened before the publish completed. Single experiment that nailed it: forcing
numStreams = 2
made the bug disappear, isolating the gate.
Why it slipped through. Latent code path. The single-stream fast-path was added in March under the assumption that dumbModel paths always took the multi-stream route. That assumption was true at the time. A May change to dumbModel's launcher began collapsing eval steps to a single stream — at which point the gate flipped. Tada's CI did not exercise the single-stream + IPC + scratch-buffer combination; the customer workload was the first to hit it.
Validation. Original LLM-7B / 8-GPU / dumbModel workload now completes a full eval pass cleanly (3 consecutive 2-hour runs).
tada-tests
all_reduce_perf
regression suite green. Soak run: 6 hours on 8 GPUs, no hang. Not retested on other model sizes or non-dumbModel workloads — both go through the multi-stream path and were never affected.
Action items.
  • Regression test added:
    tests/single_stream_ipc_publish_test.cpp
    exercising the previously-uncovered gate. (Alex, merged in PR #5751.)
  • CI gap: add a single-stream + IPC matrix entry to nightly. (Alex, JIRA-12346.)
  • Doc update: Tada launch-fast-path invariants documented in
    docs/launch_synchronization.md
    . (Alex, PR #5752.)
  • Related: audit other
    numStreams == 1
    fast-paths for the same class of bug. (Filed as JIRA-12347.)
What this post-mortem does that the management-talk version didn't:
  • Names every code identifier (
    tadaLaunchPrepare
    ,
    scratchBuf
    ,
    numStreams
    ,
    handle->shared->deviceStream
    ).
  • Walks the cause chain end-to-end so the reader can grep their way to the offending lines.
  • Names the prior fix attempt (PR #5612) and what was wrong with it.
  • Documents the exact experiment that nailed the cause (
    numStreams = 2
    made it disappear).
  • States validation coverage honestly — "not retested on other model sizes" is information, not a hole.
  • Action items have owners and tracking artifacts.
摘要。Tada的单流快速路径跳过了必要的跨流同步,导致内核在临时缓冲区(scratch-buffer)写入完成前启动,使得写入内容不可见。在dumbModel运行LLM-7B微调时会可靠触发,导致工作负载在每个评估步骤挂起。修复方案为移除不安全的快速路径并强化设备端检查。JIRA-12345,PR org/platform#5751,负责人Alex(Tada团队)。
症状。8块GPU上运行dumbModel进行LLM-7B微调时,在第一个评估步骤无限挂起。无错误、无超时——在
tadaKernel_AllReduce_f32_RING
中忙循环。每次运行均可复现。
根本原因
tadaLaunchPrepare
/
tadaLaunchKernel
/
tadaLaunchFinish
中的单流快速路径(受
scheduler->numStreams == 1 && !plan->persistent
控制)跳过了
launchStream
handle->shared->deviceStream
之间的跨流事件。dumbModel恰好满足此条件。内核在
deviceStream
上的IPC发布/临时缓冲区写入(用于填充
scratchBuf
)完成前启动,导致
launchStream
无法看到这些写入。在内核中:
scratchBuf == NULL
→ 野指针解引用 → 从垃圾内存读取环就绪标志 → 线程永远等待不会到来的就绪信号,进入忙循环。
症状关联。挂起发生在all-reduce环的等待循环中,这是调用栈中最后可见的部分——但实际Bug出现在更早的启动准备阶段。跳过的同步操作是静默的,直到工作负载触发特定条件(单流、非持久化),而dumbModel的reduce-scatter模式在每个评估步骤都会触发此条件。
修复方案。PR #5751彻底移除了单流快速路径(其带来的性能提升远不及它绕过的安全性),并在设备端添加了
scratchBuf
解引用前的空值检查,使得同类Bug会触发明确错误而非静默挂起。此前的一次尝试(PR #5612)在IPC发布后添加了主机端防御性检查,掩盖了部分路径的症状,但未解决底层竞争问题——该修改已被回滚。
排查过程。通过将复现场景从“8块GPU上LLM-7B偶尔挂起”缩小到在2块GPU子集上仅运行单个评估步骤,得到了确定性的30秒复现步骤。初始假设:
launchStream
上的内核启动顺序错误。通过调试器排除——内核已正确入队。第二个假设:临时缓冲区初始化竞争。通过在
tadaLaunchPrepare
中添加
[DBG-7af3]
埋点,打印
scratchBuf
deviceStream
事件记录时间戳,确认内核启动发生在发布完成前。确认原因的关键实验:强制
numStreams = 2
后Bug消失,从而定位到触发条件。
漏检原因。潜在代码路径。单流快速路径在3月添加,当时假设dumbModel路径始终使用多流。该假设在当时成立。5月对dumbModel启动器的修改开始将评估步骤合并为单流——此时触发条件被激活。Tada的CI未覆盖单流+IPC+临时缓冲区的组合场景;客户工作负载是第一个触发此路径的场景。
验证过程。原LLM-7B/8块GPU/dumbModel工作负载现在可完整完成评估过程(连续3次2小时运行均无问题)。
tada-tests
all_reduce_perf
回归测试套件全部通过。持续测试:8块GPU上运行6小时,无挂起。未在其他模型大小或非dumbModel工作负载上重新测试——这些场景均使用多流路径,从未受影响。
行动项
  • 添加回归测试:
    tests/single_stream_ipc_publish_test.cpp
    ,覆盖此前未覆盖的触发条件。(Alex,已合并到PR #5751)
  • CI漏洞:在 nightly 测试中添加单流+IPC的矩阵项。(Alex,JIRA-12346)
  • 更新文档:在
    docs/launch_synchronization.md
    中记录Tada启动快速路径的约束条件。(Alex,PR #5752)
  • 相关工作:审核其他
    numStreams == 1
    的快速路径是否存在同类Bug。(已提交为JIRA-12347)
这份事后分析报告与面向管理层的版本的区别在于:
  • 列出了所有代码标识符(
    tadaLaunchPrepare
    scratchBuf
    numStreams
    handle->shared->deviceStream
    )。
  • 完整梳理了因果链,让读者可通过搜索找到问题代码行。
  • 指出了此前的修复尝试(PR #5612)及其问题。
  • 记录了确认原因的关键实验
    numStreams = 2
    后Bug消失)。
  • 如实说明验证覆盖范围——“未在其他模型大小上重新测试”是信息,而非漏洞。
  • 行动项包含负责人和追踪工件。

Rules

规则

  • Refuse to draft without all four required inputs. A post-mortem of a hypothesis is worse than no post-mortem.
  • Never invent root cause, owner, validation runs, or action items. If a section's facts aren't there, ask. Don't fill the gap with plausible prose.
  • Never strip code identifiers in the engineering record. They are the index. The leadership reframe is
    management-talk
    's job, not yours.
  • Blameless. Describe gaps and bugs, never people.
  • State validation coverage honestly. If you only tested one config, say so. Implying broader coverage is the failure mode that breeds repeat regressions.
  • Get sign-off before posting to JIRA. Print-only output needs no approval. Never post to non-JIRA destinations from this skill.
  • One iteration is normal, three is a smell. If the user is still revising on the third pass, ask what specific section is wrong — don't keep tweaking blindly.
  • 缺少四项必要输入则拒绝撰写。基于假设的事后分析报告不如没有。
  • 切勿编造根本原因、负责人、验证过程或行动项。若某部分事实缺失,需询问。不要用看似合理的内容填补空白。
  • 切勿在工程记录中删除代码标识符。它们是索引。面向领导层的重构是
    management-talk
    的工作,而非本技能的职责。
  • 无指责。描述漏洞和Bug,而非个人。
  • 如实说明验证覆盖范围。若仅测试了一种配置,需明确说明。暗示更大的覆盖范围会导致重复回归。
  • 发布到JIRA前需确认。仅用于打印的输出无需批准。切勿通过本技能发布到非JIRA位置。
  • 一次迭代正常,三次需警惕。若用户在第三次迭代后仍在修改,需询问具体哪部分有问题——不要盲目调整。