post-mortem

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Post-mortem

事后分析报告（Post-mortem）

The canonical engineering record of a bug fix. Written after debugging lands a real fix, for other engineers (and future-you, who will have forgotten everything in 6 months). Code identifiers are welcome here — this is the artifact that lets the next person recover the mental model fast.

For the up-the-org version of this same content, hand the finished post-mortem to

management-talk

. They compose: post-mortem owns the engineering truth, management-talk reframes it for leadership.

这是Bug修复的标准工程记录。需在调试完成并确定有效修复方案之后撰写，面向其他工程师（以及6个月后会忘记所有细节的未来的你）。此处可使用代码标识符——这份记录能让后续接手的人快速重建相关的思维模型。

若需要面向组织管理层的版本，可将完成的事后分析报告交给

management-talk

。两者分工：事后分析报告聚焦工程事实，management-talk负责将其重构为适合领导层阅读的版本。

When to invoke

触发场景

"/post-mortem"
"write the post-mortem / postmortem / RCA / root-cause analysis"
"document this fix" / "write up the root cause" / "close out this bug with a writeup"
After a debug session has clearly landed a fix, proactively offer to draft one.

"/post-mortem"
“撰写事后分析报告/根因分析报告（post-mortem/postmortem/RCA/root-cause analysis）”
“记录此修复方案”/“整理根本原因”/“撰写报告以关闭此Bug”
调试明确找到修复方案后，主动提议撰写报告。

When NOT to use

禁用场景

Bug not fixed yet, or fix not validated. A post-mortem of a hypothesis is misleading. Refuse and tell the user what's missing.
Customer-visible outage / incident. Those need a separate incident report (timeline, blast radius, paging history, comms). This skill is bug-fix scope. Flag and confirm before producing one.
Trivial fix (typo, obvious one-liner). The PR description is the record. Don't manufacture ceremony.

Bug尚未修复，或修复方案未经验证：基于假设撰写的事后分析报告具有误导性。需拒绝并告知用户缺失的条件。
影响客户的故障/事件：这类情况需要单独的事件报告（包含时间线、影响范围、告警历史、沟通记录）。本技能仅适用于Bug修复场景。在生成报告前需标记并确认。
微小修复（如拼写错误、明显的单行修改）：PR描述即可作为记录，无需额外形式化流程。

Required inputs — refuse to draft without these

必要输入——缺少则拒绝撰写

Before writing a single line, confirm all four. If any are missing, list what's missing and stop:

Reliable repro exists (not "happens sometimes" — a deterministic or high-rate-flake repro the next person can run).
Root cause is known (the mechanism is identified, not a hypothesis).
Fix is identified (PR / commit / branch pointer).
Fix is validated (the original repro now passes; the customer workload / failing test now succeeds).

These map directly to

debug-mantra

steps 1–4. If you came in via

debug-mantra

, the breadcrumb ledger from step 4 is your raw material — pull from it.

在撰写任何内容前，需确认以下四项全部满足。若有缺失，列出缺失项并停止操作：

存在可靠的复现步骤（并非“偶尔发生”——需提供确定性或高概率复现的步骤，供后续人员执行）。
明确根本原因（已确定触发机制，而非假设）。
确定修复方案（提供PR/提交记录/分支指针）。
修复方案已验证（原复现步骤现在可通过；客户工作负载/失败测试现在可成功运行）。

这些直接对应

debug-mantra

的第1-4步。若你是通过

debug-mantra

进入此流程，第4步的记录可作为原始素材直接使用。

Structure

结构

Use these blocks in this order. Summary, Root cause, Fix, and Validation are mandatory. The rest are conditional but usually present.

需按以下顺序使用这些模块。摘要、根本原因、修复方案、验证过程为必填项。其余模块为可选，但通常需要包含。

1. Summary (mandatory)

1. 摘要（必填）

One paragraph. What broke, in user/workload terms. What fixed it, in one sentence. JIRA key, PR number, owner. A reader who stops here should have the right answer.

一段文字。说明故障现象（从用户/工作负载角度），以及修复方案（一句话概括）。包含JIRA编号、PR编号、负责人。仅阅读此部分的读者应能了解核心信息。

2. Symptom

2. 症状

What was actually observed. Test output, error message, log line, perf number, customer report. Concrete identifiers — don't paraphrase the failure mode.

实际观察到的现象。如测试输出、错误信息、日志行、性能数据、客户反馈。需使用具体标识符——不要转述故障模式。

3. Root cause (mandatory)

3. 根本原因（必填）

The actual bug mechanism. Code identifiers welcome and expected — function names, file paths, struct fields, branch conditions, commit SHAs of the offending change. Walk the cause chain end-to-end. This is the most expensive section and the reason the post-mortem exists at all. Future-you will live or die by how clearly you write this.

Bug的实际触发机制。欢迎并建议使用代码标识符——函数名、文件路径、结构体字段、分支条件、问题代码的提交SHA值。完整梳理因果链。这是最耗时的部分，也是事后分析报告存在的核心意义。未来的你能否快速回忆起细节，取决于此部分的撰写清晰度。

4. Why it produced the symptom

4. 症状关联

Link the root cause to the symptom. Often non-obvious — the bug is in

tadaLaunchPrepare

but the visible failure is a customer training run hanging hours later. Walk the chain so a reader who only knows the symptom can connect it back to the cause without re-deriving it.

将根本原因与症状关联起来。通常并非显而易见——比如Bug出现在

tadaLaunchPrepare

中，但可见故障是客户的训练任务在数小时后挂起。需梳理整个链条，让仅了解症状的读者无需重新推导即可将其与原因关联起来。

5. Fix (mandatory)

5. 修复方案（必填）

What changed and why this change addresses the root cause rather than hiding the symptom. Link to PR / commit. If a previous fix attempt papered over the symptom, name it and explain what was wrong with it — that history is part of the cause.

说明修改内容，以及为何此修改能解决根本原因而非掩盖症状。提供PR/提交记录的链接。若之前有过掩盖症状的修复尝试，需指出并说明其问题——这段历史也是原因的一部分。

6. How it was found

6. 排查过程

Short. The debugging path:

What repro made it deterministic.
What tools cracked it (debugger, source tracing, knob enumeration, in-code instrumentation — the
```
debug-mantra
```
step 2 cascade).
Hypotheses tried and rejected, with the one-line reason each was rejected. (Pull from the breadcrumb ledger.)
The single experiment that confirmed the cause.

This section is for the next debugger — make it learnable.

简短描述调试路径：

如何将复现步骤变为确定性的。
使用了哪些工具（调试器、源码追踪、参数枚举、代码内埋点——即
```
debug-mantra
```
第2步的流程）。
尝试并排除的假设，以及每个假设被排除的一句话原因。（可从记录中提取）
确认根本原因的关键实验。

此部分是为后续调试人员准备的——需具备可学习性。

7. Why it slipped through

7. 漏检原因

What allowed this bug to reach the branch / release / customer. Pick the real reason:

CI gap (no test exercises this path / configuration).
Latent code (correct when written, broken by a later change in a different file).
Workload gap (no real workload reached this code path until now).
Incomplete prior fix (defensive check hid the symptom; root cause untouched).
Review miss (the change was reviewable; the implication wasn't).

If the honest answer is "no good reason — we should have caught this," say so. Blameless — describe the gap, not the person.

说明此Bug为何能进入分支/发布版本/客户环境。需给出真实原因：

CI漏洞（无测试覆盖此路径/配置）。
潜在代码（编写时正确，后续其他文件的修改导致其失效）。
工作负载漏洞（此前无实际工作负载触发此代码路径）。
不完整的前期修复（防御性检查掩盖了症状，但根本原因未解决）。
代码评审遗漏（修改内容可评审，但未意识到其影响）。

若真实答案是“无合理原因——我们本应发现此问题”，直接说明即可。无指责——描述漏洞，而非针对个人。

8. Validation (mandatory)

8. 验证过程（必填）

How we know the fix works. Concrete:

Original failing test now passes (test name, link).
Customer workload now completes (workload identifier, run link).
Perf regression resolved (number before, number after).
Stress / soak / fuzz run completed clean (duration, scale).
Other affected configurations / workloads also tested.

If you only validated one configuration, say so explicitly — "validated on Llama-2-70B / 8 GPUs / DeepSpeed; not retested on other workloads." Don't imply broader coverage than you actually have.

说明如何确认修复方案有效。需具体：

原失败测试现在可通过（测试名称、链接）。
客户工作负载现在可完成（工作负载标识符、运行链接）。
性能回归已解决（修复前后的数值）。
压力/持续/模糊测试已顺利完成（时长、规模）。
其他受影响的配置/工作负载也已测试。

若仅验证了一种配置，需明确说明——例如：“已在Llama-2-70B / 8块GPU / DeepSpeed环境下验证；未在其他工作负载上重新测试。”不要暗示覆盖范围超出实际情况。

9. Action items / follow-ups

9. 行动项/后续工作

Concrete next-steps that aren't in the fix PR itself. Each item: what + owner + tracking artifact.

Regression test added at <seam>. (Owner, test name.)
Refactor to prevent class of bug. (Owner, ticket.)
CI gap closed: <new check>. (Owner, PR.)
Doc / runbook updated. (Owner, link.)
Related ticket filed for <adjacent issue you noticed>. (Owner, key.)

If there are no action items, write "None — the fix is sufficient and no class-of-bug follow-up is warranted." Don't manufacture action items to look thorough.

修复方案PR之外的具体后续步骤。每个行动项需包含：内容 + 负责人 + 追踪工件。

在<衔接点>添加回归测试。（负责人、测试名称）
重构以避免此类Bug。（负责人、工单）
修复CI漏洞：<新增检查项>。（负责人、PR）
更新文档/运行手册。（负责人、链接）
提交相关工单以处理<发现的相邻问题>。（负责人、编号）

若没有行动项，需写：“无——修复方案已足够，无需针对此类Bug进行后续工作。”不要为了显得全面而编造行动项。

Tone

语气

This is engineer-to-engineer. Different from

management-talk

Code identifiers are first-class.
```
tadaLaunchPrepare
```
,
```
tada/prim.h::syncWaitPeer
```
,
```
scratchBuf
```
, commit SHAs, line numbers — keep them. The whole point is that future engineers can grep their way back to the change.
Mechanism over narrative. Walk the actual cause chain. Don't soften it into "a synchronization issue" — say which function skipped which event under which gate.
Active voice, concrete subjects, short paragraphs. Same rule as everywhere else.
No hedging. "We believe" / "appears to" / "may have" — drop. State it or don't write it.
Blameless. Describe the bug, the gap, and the fix. Never "X should have caught this." The CI gap is the failure mode, not the person.
No advocacy. A post-mortem records what happened and what's next. If you want to argue for a refactor, that's a separate proposal — link to it from the action items.

这是工程师之间的沟通。与

management-talk

不同：

代码标识符是核心内容。
```
tadaLaunchPrepare
```
、
```
tada/prim.h::syncWaitPeer
```
、
```
scratchBuf
```
、提交SHA值、行号——全部保留。核心目的是让未来的工程师能通过搜索找到相关修改。
聚焦机制而非叙事。完整梳理实际因果链。不要简化为“同步问题”——需说明哪个函数在哪个条件下跳过了哪个事件。
主动语态、具体主语、短段落。与其他场景规则一致。
不要含糊其辞。删除“我们认为”/“似乎”/“可能”。要么明确说明，要么不写。
无指责。描述Bug、漏洞和修复方案。不要说“X本应发现此问题”。CI漏洞是故障模式，而非个人问题。
不要游说。事后分析报告记录事实和后续计划。若想提议重构，需单独撰写提案——并在行动项中链接。

Output flow

输出流程

Confirm all four required inputs are satisfied. If any are missing, list them and stop. Do not draft.
Confirm where it goes (default: JIRA comment on the source ticket). Other valid destinations: PR description,
```
docs/postmortems/<ticket>.md
```
, internal wiki page. The shape is the same — only the wrapping changes.
Produce the draft as a single chat block.
Sign-off before posting. If posting back to JIRA, show the exact ADF payload, wait for explicit "post it" / "go ahead" / "yes," then
```
POST /rest/api/3/issue/<KEY>/comment
```
. Print-only output needs no approval.
Offer the management-talk handoff: "Want a leadership-flavored version? I can hand this to
management-talk
." Don't do it automatically.

确认四项必要输入全部满足。若有缺失，列出并停止操作。不要撰写草稿。
确认输出位置（默认：源工单的JIRA评论）。其他有效位置：PR描述、
```
docs/postmortems/<ticket>.md
```
、内部Wiki页面。内容结构一致——仅包装形式不同。
生成草稿，作为单个聊天块。
发布前需确认。若发布到JIRA，需展示完整的ADF payload，等待明确的“发布”/“继续”/“是”指令，再执行
```
POST /rest/api/3/issue/<KEY>/comment
```
。仅用于打印的输出无需批准。
提供management-talk转接选项：“需要面向领导层的版本吗？我可以将此内容交给
```
management-talk
```
处理。”不要自动执行。

Worked example — Tada hang in dumbModel (JIRA-12345)

示例——dumbModel中的Tada挂起问题（JIRA-12345）

Summary. Tada's single-stream fast-path skipped a required cross-stream synchronization, causing kernels to launch before scratch-buffer writes were visible. Triggered reliably by dumbModel on LLM-7B fine-tuning, hanging the workload at every eval step. Fixed by removing the unsafe fast-path and tightening a device-side check. JIRA-12345, PR org/platform#5751, owner Alex (Tada team).
Symptom. 8-GPU LLM-7B fine-tuning under dumbModel hung indefinitely at the first eval step. No error, no timeout — busy-spin in
tadaKernel_AllReduce_f32_RING
. Reproduced on every run.
Root cause. The single-stream fast-path in
tadaLaunchPrepare
/
tadaLaunchKernel
/
tadaLaunchFinish
(gated on
scheduler->numStreams == 1 && !plan->persistent
) skipped the cross-stream event between
launchStream
and
handle->shared->deviceStream
. dumbModel hits this gate exactly. The kernel was launched before the IPC publish / scratch-buffer writes on
deviceStream
(which populate
scratchBuf
) were visible to
launchStream
. In the kernel:
scratchBuf == NULL
→ stray pointer dereference → ring ready-flag read from garbage memory → thread spins forever waiting for a ready signal that will never arrive.
Why it produced the symptom. The hang lives in the all-reduce ring waitloop, which is the last visible thing in the call stack — but the actual bug is at launch-prep, several frames earlier. The skipped sync is silent until a workload triggers the exact gate (single-stream, non-persistent), and dumbModel's reduce-scatter pattern hits it at every eval step.
Fix. PR #5751 removes the single-stream fast-path entirely (the saving was negligible vs. the safety it bypassed) and adds a device-side null check on
scratchBuf
before dereference, so the same class of bug fails loudly instead of silently spinning. A previous attempt (PR #5612) added a host-side defensive check after IPC publish that hid the symptom in some paths but left the underlying race in place — that change is also reverted.
How it was found. Reproducer narrowed from "8-GPU LLM-7B hangs sometimes" to a deterministic 30s repro by pinning to a single eval step on a 2-GPU subset. Initial hypothesis: kernel launch ordering on
launchStream
. Disproved by the debugger — the kernel was correctly enqueued. Second hypothesis: scratch-buffer init race. Confirmed by adding
[DBG-7af3]
instrumentation in
tadaLaunchPrepare
printing
scratchBuf
and a
deviceStream
event-record timestamp; the launch happened before the publish completed. Single experiment that nailed it: forcing
numStreams = 2
made the bug disappear, isolating the gate.
Why it slipped through. Latent code path. The single-stream fast-path was added in March under the assumption that dumbModel paths always took the multi-stream route. That assumption was true at the time. A May change to dumbModel's launcher began collapsing eval steps to a single stream — at which point the gate flipped. Tada's CI did not exercise the single-stream + IPC + scratch-buffer combination; the customer workload was the first to hit it.
Validation. Original LLM-7B / 8-GPU / dumbModel workload now completes a full eval pass cleanly (3 consecutive 2-hour runs).
tada-tests
all_reduce_perf
regression suite green. Soak run: 6 hours on 8 GPUs, no hang. Not retested on other model sizes or non-dumbModel workloads — both go through the multi-stream path and were never affected.
Action items.
Regression test added:
tests/single_stream_ipc_publish_test.cpp
exercising the previously-uncovered gate. (Alex, merged in PR #5751.)
CI gap: add a single-stream + IPC matrix entry to nightly. (Alex, JIRA-12346.)
Doc update: Tada launch-fast-path invariants documented in
docs/launch_synchronization.md
. (Alex, PR #5752.)
Related: audit other
numStreams == 1
fast-paths for the same class of bug. (Filed as JIRA-12347.)

What this post-mortem does that the management-talk version didn't:

Names every code identifier (

tadaLaunchPrepare

scratchBuf

numStreams

handle->shared->deviceStream

Walks the cause chain end-to-end so the reader can grep their way to the offending lines.
Names the prior fix attempt (PR #5612) and what was wrong with it.
Documents the exact experiment that nailed the cause (
```
numStreams = 2
```
made it disappear).
States validation coverage honestly — "not retested on other model sizes" is information, not a hole.
Action items have owners and tracking artifacts.

摘要。Tada的单流快速路径跳过了必要的跨流同步，导致内核在临时缓冲区（scratch-buffer）写入完成前启动，使得写入内容不可见。在dumbModel运行LLM-7B微调时会可靠触发，导致工作负载在每个评估步骤挂起。修复方案为移除不安全的快速路径并强化设备端检查。JIRA-12345，PR org/platform#5751，负责人Alex（Tada团队）。
症状。8块GPU上运行dumbModel进行LLM-7B微调时，在第一个评估步骤无限挂起。无错误、无超时——在
tadaKernel_AllReduce_f32_RING
中忙循环。每次运行均可复现。
根本原因。
tadaLaunchPrepare
/
tadaLaunchKernel
/
tadaLaunchFinish
中的单流快速路径（受
scheduler->numStreams == 1 && !plan->persistent
控制）跳过了
launchStream
与
handle->shared->deviceStream
之间的跨流事件。dumbModel恰好满足此条件。内核在
deviceStream
上的IPC发布/临时缓冲区写入（用于填充
scratchBuf
）完成前启动，导致
launchStream
无法看到这些写入。在内核中：
scratchBuf == NULL
→ 野指针解引用 → 从垃圾内存读取环就绪标志 → 线程永远等待不会到来的就绪信号，进入忙循环。
症状关联。挂起发生在all-reduce环的等待循环中，这是调用栈中最后可见的部分——但实际Bug出现在更早的启动准备阶段。跳过的同步操作是静默的，直到工作负载触发特定条件（单流、非持久化），而dumbModel的reduce-scatter模式在每个评估步骤都会触发此条件。
修复方案。PR #5751彻底移除了单流快速路径（其带来的性能提升远不及它绕过的安全性），并在设备端添加了
scratchBuf
解引用前的空值检查，使得同类Bug会触发明确错误而非静默挂起。此前的一次尝试（PR #5612）在IPC发布后添加了主机端防御性检查，掩盖了部分路径的症状，但未解决底层竞争问题——该修改已被回滚。
排查过程。通过将复现场景从“8块GPU上LLM-7B偶尔挂起”缩小到在2块GPU子集上仅运行单个评估步骤，得到了确定性的30秒复现步骤。初始假设：
launchStream
上的内核启动顺序错误。通过调试器排除——内核已正确入队。第二个假设：临时缓冲区初始化竞争。通过在
tadaLaunchPrepare
中添加
[DBG-7af3]
埋点，打印
scratchBuf
和
deviceStream
事件记录时间戳，确认内核启动发生在发布完成前。确认原因的关键实验：强制
numStreams = 2
后Bug消失，从而定位到触发条件。
漏检原因。潜在代码路径。单流快速路径在3月添加，当时假设dumbModel路径始终使用多流。该假设在当时成立。5月对dumbModel启动器的修改开始将评估步骤合并为单流——此时触发条件被激活。Tada的CI未覆盖单流+IPC+临时缓冲区的组合场景；客户工作负载是第一个触发此路径的场景。
验证过程。原LLM-7B/8块GPU/dumbModel工作负载现在可完整完成评估过程（连续3次2小时运行均无问题）。
tada-tests
的
all_reduce_perf
回归测试套件全部通过。持续测试：8块GPU上运行6小时，无挂起。未在其他模型大小或非dumbModel工作负载上重新测试——这些场景均使用多流路径，从未受影响。
行动项。
添加回归测试：
tests/single_stream_ipc_publish_test.cpp
，覆盖此前未覆盖的触发条件。（Alex，已合并到PR #5751）
CI漏洞：在 nightly 测试中添加单流+IPC的矩阵项。（Alex，JIRA-12346）
更新文档：在
docs/launch_synchronization.md
中记录Tada启动快速路径的约束条件。（Alex，PR #5752）
相关工作：审核其他
numStreams == 1
的快速路径是否存在同类Bug。（已提交为JIRA-12347）

这份事后分析报告与面向管理层的版本的区别在于：

列出了所有代码标识符（

tadaLaunchPrepare

、

scratchBuf

、

numStreams

、

handle->shared->deviceStream

）。

完整梳理了因果链，让读者可通过搜索找到问题代码行。
指出了此前的修复尝试（PR #5612）及其问题。
记录了确认原因的关键实验（
```
numStreams = 2
```
后Bug消失）。
如实说明验证覆盖范围——“未在其他模型大小上重新测试”是信息，而非漏洞。
行动项包含负责人和追踪工件。

Rules

规则

Refuse to draft without all four required inputs. A post-mortem of a hypothesis is worse than no post-mortem.
Never invent root cause, owner, validation runs, or action items. If a section's facts aren't there, ask. Don't fill the gap with plausible prose.
Never strip code identifiers in the engineering record. They are the index. The leadership reframe is
```
management-talk
```
's job, not yours.
Blameless. Describe gaps and bugs, never people.
State validation coverage honestly. If you only tested one config, say so. Implying broader coverage is the failure mode that breeds repeat regressions.
Get sign-off before posting to JIRA. Print-only output needs no approval. Never post to non-JIRA destinations from this skill.
One iteration is normal, three is a smell. If the user is still revising on the third pass, ask what specific section is wrong — don't keep tweaking blindly.

缺少四项必要输入则拒绝撰写。基于假设的事后分析报告不如没有。
切勿编造根本原因、负责人、验证过程或行动项。若某部分事实缺失，需询问。不要用看似合理的内容填补空白。
切勿在工程记录中删除代码标识符。它们是索引。面向领导层的重构是
```
management-talk
```
的工作，而非本技能的职责。
无指责。描述漏洞和Bug，而非个人。
如实说明验证覆盖范围。若仅测试了一种配置，需明确说明。暗示更大的覆盖范围会导致重复回归。
发布到JIRA前需确认。仅用于打印的输出无需批准。切勿通过本技能发布到非JIRA位置。
一次迭代正常，三次需警惕。若用户在第三次迭代后仍在修改，需询问具体哪部分有问题——不要盲目调整。