post-mortem
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePost-mortem
事后分析报告(Post-mortem)
The canonical engineering record of a bug fix. Written after debugging lands a real fix, for other engineers (and future-you, who will have forgotten everything in 6 months). Code identifiers are welcome here — this is the artifact that lets the next person recover the mental model fast.
For the up-the-org version of this same content, hand the finished post-mortem to . They compose: post-mortem owns the engineering truth, management-talk reframes it for leadership.
management-talk这是Bug修复的标准工程记录。需在调试完成并确定有效修复方案之后撰写,面向其他工程师(以及6个月后会忘记所有细节的未来的你)。此处可使用代码标识符——这份记录能让后续接手的人快速重建相关的思维模型。
若需要面向组织管理层的版本,可将完成的事后分析报告交给。两者分工:事后分析报告聚焦工程事实,management-talk负责将其重构为适合领导层阅读的版本。
management-talkWhen to invoke
触发场景
- "/post-mortem"
- "write the post-mortem / postmortem / RCA / root-cause analysis"
- "document this fix" / "write up the root cause" / "close out this bug with a writeup"
- After a debug session has clearly landed a fix, proactively offer to draft one.
- "/post-mortem"
- “撰写事后分析报告/根因分析报告(post-mortem/postmortem/RCA/root-cause analysis)”
- “记录此修复方案”/“整理根本原因”/“撰写报告以关闭此Bug”
- 调试明确找到修复方案后,主动提议撰写报告。
When NOT to use
禁用场景
- Bug not fixed yet, or fix not validated. A post-mortem of a hypothesis is misleading. Refuse and tell the user what's missing.
- Customer-visible outage / incident. Those need a separate incident report (timeline, blast radius, paging history, comms). This skill is bug-fix scope. Flag and confirm before producing one.
- Trivial fix (typo, obvious one-liner). The PR description is the record. Don't manufacture ceremony.
- Bug尚未修复,或修复方案未经验证:基于假设撰写的事后分析报告具有误导性。需拒绝并告知用户缺失的条件。
- 影响客户的故障/事件:这类情况需要单独的事件报告(包含时间线、影响范围、告警历史、沟通记录)。本技能仅适用于Bug修复场景。在生成报告前需标记并确认。
- 微小修复(如拼写错误、明显的单行修改):PR描述即可作为记录,无需额外形式化流程。
Required inputs — refuse to draft without these
必要输入——缺少则拒绝撰写
Before writing a single line, confirm all four. If any are missing, list what's missing and stop:
- Reliable repro exists (not "happens sometimes" — a deterministic or high-rate-flake repro the next person can run).
- Root cause is known (the mechanism is identified, not a hypothesis).
- Fix is identified (PR / commit / branch pointer).
- Fix is validated (the original repro now passes; the customer workload / failing test now succeeds).
These map directly to steps 1–4. If you came in via , the breadcrumb ledger from step 4 is your raw material — pull from it.
debug-mantradebug-mantra在撰写任何内容前,需确认以下四项全部满足。若有缺失,列出缺失项并停止操作:
- 存在可靠的复现步骤(并非“偶尔发生”——需提供确定性或高概率复现的步骤,供后续人员执行)。
- 明确根本原因(已确定触发机制,而非假设)。
- 确定修复方案(提供PR/提交记录/分支指针)。
- 修复方案已验证(原复现步骤现在可通过;客户工作负载/失败测试现在可成功运行)。
这些直接对应的第1-4步。若你是通过进入此流程,第4步的记录可作为原始素材直接使用。
debug-mantradebug-mantraStructure
结构
Use these blocks in this order. Summary, Root cause, Fix, and Validation are mandatory. The rest are conditional but usually present.
需按以下顺序使用这些模块。摘要、根本原因、修复方案、验证过程为必填项。其余模块为可选,但通常需要包含。
1. Summary (mandatory)
1. 摘要(必填)
One paragraph. What broke, in user/workload terms. What fixed it, in one sentence. JIRA key, PR number, owner. A reader who stops here should have the right answer.
一段文字。说明故障现象(从用户/工作负载角度),以及修复方案(一句话概括)。包含JIRA编号、PR编号、负责人。仅阅读此部分的读者应能了解核心信息。
2. Symptom
2. 症状
What was actually observed. Test output, error message, log line, perf number, customer report. Concrete identifiers — don't paraphrase the failure mode.
实际观察到的现象。如测试输出、错误信息、日志行、性能数据、客户反馈。需使用具体标识符——不要转述故障模式。
3. Root cause (mandatory)
3. 根本原因(必填)
The actual bug mechanism. Code identifiers welcome and expected — function names, file paths, struct fields, branch conditions, commit SHAs of the offending change. Walk the cause chain end-to-end. This is the most expensive section and the reason the post-mortem exists at all. Future-you will live or die by how clearly you write this.
Bug的实际触发机制。欢迎并建议使用代码标识符——函数名、文件路径、结构体字段、分支条件、问题代码的提交SHA值。完整梳理因果链。这是最耗时的部分,也是事后分析报告存在的核心意义。未来的你能否快速回忆起细节,取决于此部分的撰写清晰度。
4. Why it produced the symptom
4. 症状关联
Link the root cause to the symptom. Often non-obvious — the bug is in but the visible failure is a customer training run hanging hours later. Walk the chain so a reader who only knows the symptom can connect it back to the cause without re-deriving it.
tadaLaunchPrepare将根本原因与症状关联起来。通常并非显而易见——比如Bug出现在中,但可见故障是客户的训练任务在数小时后挂起。需梳理整个链条,让仅了解症状的读者无需重新推导即可将其与原因关联起来。
tadaLaunchPrepare5. Fix (mandatory)
5. 修复方案(必填)
What changed and why this change addresses the root cause rather than hiding the symptom. Link to PR / commit. If a previous fix attempt papered over the symptom, name it and explain what was wrong with it — that history is part of the cause.
说明修改内容,以及为何此修改能解决根本原因而非掩盖症状。提供PR/提交记录的链接。若之前有过掩盖症状的修复尝试,需指出并说明其问题——这段历史也是原因的一部分。
6. How it was found
6. 排查过程
Short. The debugging path:
- What repro made it deterministic.
- What tools cracked it (debugger, source tracing, knob enumeration, in-code instrumentation — the step 2 cascade).
debug-mantra - Hypotheses tried and rejected, with the one-line reason each was rejected. (Pull from the breadcrumb ledger.)
- The single experiment that confirmed the cause.
This section is for the next debugger — make it learnable.
简短描述调试路径:
- 如何将复现步骤变为确定性的。
- 使用了哪些工具(调试器、源码追踪、参数枚举、代码内埋点——即第2步的流程)。
debug-mantra - 尝试并排除的假设,以及每个假设被排除的一句话原因。(可从记录中提取)
- 确认根本原因的关键实验。
此部分是为后续调试人员准备的——需具备可学习性。
7. Why it slipped through
7. 漏检原因
What allowed this bug to reach the branch / release / customer. Pick the real reason:
- CI gap (no test exercises this path / configuration).
- Latent code (correct when written, broken by a later change in a different file).
- Workload gap (no real workload reached this code path until now).
- Incomplete prior fix (defensive check hid the symptom; root cause untouched).
- Review miss (the change was reviewable; the implication wasn't).
If the honest answer is "no good reason — we should have caught this," say so. Blameless — describe the gap, not the person.
说明此Bug为何能进入分支/发布版本/客户环境。需给出真实原因:
- CI漏洞(无测试覆盖此路径/配置)。
- 潜在代码(编写时正确,后续其他文件的修改导致其失效)。
- 工作负载漏洞(此前无实际工作负载触发此代码路径)。
- 不完整的前期修复(防御性检查掩盖了症状,但根本原因未解决)。
- 代码评审遗漏(修改内容可评审,但未意识到其影响)。
若真实答案是“无合理原因——我们本应发现此问题”,直接说明即可。无指责——描述漏洞,而非针对个人。
8. Validation (mandatory)
8. 验证过程(必填)
How we know the fix works. Concrete:
- Original failing test now passes (test name, link).
- Customer workload now completes (workload identifier, run link).
- Perf regression resolved (number before, number after).
- Stress / soak / fuzz run completed clean (duration, scale).
- Other affected configurations / workloads also tested.
If you only validated one configuration, say so explicitly — "validated on Llama-2-70B / 8 GPUs / DeepSpeed; not retested on other workloads." Don't imply broader coverage than you actually have.
说明如何确认修复方案有效。需具体:
- 原失败测试现在可通过(测试名称、链接)。
- 客户工作负载现在可完成(工作负载标识符、运行链接)。
- 性能回归已解决(修复前后的数值)。
- 压力/持续/模糊测试已顺利完成(时长、规模)。
- 其他受影响的配置/工作负载也已测试。
若仅验证了一种配置,需明确说明——例如:“已在Llama-2-70B / 8块GPU / DeepSpeed环境下验证;未在其他工作负载上重新测试。”不要暗示覆盖范围超出实际情况。
9. Action items / follow-ups
9. 行动项/后续工作
Concrete next-steps that aren't in the fix PR itself. Each item: what + owner + tracking artifact.
- Regression test added at <seam>. (Owner, test name.)
- Refactor to prevent class of bug. (Owner, ticket.)
- CI gap closed: <new check>. (Owner, PR.)
- Doc / runbook updated. (Owner, link.)
- Related ticket filed for <adjacent issue you noticed>. (Owner, key.)
If there are no action items, write "None — the fix is sufficient and no class-of-bug follow-up is warranted." Don't manufacture action items to look thorough.
修复方案PR之外的具体后续步骤。每个行动项需包含:内容 + 负责人 + 追踪工件。
- 在<衔接点>添加回归测试。(负责人、测试名称)
- 重构以避免此类Bug。(负责人、工单)
- 修复CI漏洞:<新增检查项>。(负责人、PR)
- 更新文档/运行手册。(负责人、链接)
- 提交相关工单以处理<发现的相邻问题>。(负责人、编号)
若没有行动项,需写:“无——修复方案已足够,无需针对此类Bug进行后续工作。”不要为了显得全面而编造行动项。
Tone
语气
This is engineer-to-engineer. Different from :
management-talk- Code identifiers are first-class. ,
tadaLaunchPrepare,tada/prim.h::syncWaitPeer, commit SHAs, line numbers — keep them. The whole point is that future engineers can grep their way back to the change.scratchBuf - Mechanism over narrative. Walk the actual cause chain. Don't soften it into "a synchronization issue" — say which function skipped which event under which gate.
- Active voice, concrete subjects, short paragraphs. Same rule as everywhere else.
- No hedging. "We believe" / "appears to" / "may have" — drop. State it or don't write it.
- Blameless. Describe the bug, the gap, and the fix. Never "X should have caught this." The CI gap is the failure mode, not the person.
- No advocacy. A post-mortem records what happened and what's next. If you want to argue for a refactor, that's a separate proposal — link to it from the action items.
这是工程师之间的沟通。与不同:
management-talk- 代码标识符是核心内容。、
tadaLaunchPrepare、tada/prim.h::syncWaitPeer、提交SHA值、行号——全部保留。核心目的是让未来的工程师能通过搜索找到相关修改。scratchBuf - 聚焦机制而非叙事。完整梳理实际因果链。不要简化为“同步问题”——需说明哪个函数在哪个条件下跳过了哪个事件。
- 主动语态、具体主语、短段落。与其他场景规则一致。
- 不要含糊其辞。删除“我们认为”/“似乎”/“可能”。要么明确说明,要么不写。
- 无指责。描述Bug、漏洞和修复方案。不要说“X本应发现此问题”。CI漏洞是故障模式,而非个人问题。
- 不要游说。事后分析报告记录事实和后续计划。若想提议重构,需单独撰写提案——并在行动项中链接。
Output flow
输出流程
- Confirm all four required inputs are satisfied. If any are missing, list them and stop. Do not draft.
- Confirm where it goes (default: JIRA comment on the source ticket). Other valid destinations: PR description, , internal wiki page. The shape is the same — only the wrapping changes.
docs/postmortems/<ticket>.md - Produce the draft as a single chat block.
- Sign-off before posting. If posting back to JIRA, show the exact ADF payload, wait for explicit "post it" / "go ahead" / "yes," then . Print-only output needs no approval.
POST /rest/api/3/issue/<KEY>/comment - Offer the management-talk handoff: "Want a leadership-flavored version? I can hand this to ." Don't do it automatically.
management-talk
- 确认四项必要输入全部满足。若有缺失,列出并停止操作。不要撰写草稿。
- 确认输出位置(默认:源工单的JIRA评论)。其他有效位置:PR描述、、内部Wiki页面。内容结构一致——仅包装形式不同。
docs/postmortems/<ticket>.md - 生成草稿,作为单个聊天块。
- 发布前需确认。若发布到JIRA,需展示完整的ADF payload,等待明确的“发布”/“继续”/“是”指令,再执行。仅用于打印的输出无需批准。
POST /rest/api/3/issue/<KEY>/comment - 提供management-talk转接选项:“需要面向领导层的版本吗?我可以将此内容交给处理。”不要自动执行。
management-talk
Worked example — Tada hang in dumbModel (JIRA-12345)
示例——dumbModel中的Tada挂起问题(JIRA-12345)
Summary. Tada's single-stream fast-path skipped a required cross-stream synchronization, causing kernels to launch before scratch-buffer writes were visible. Triggered reliably by dumbModel on LLM-7B fine-tuning, hanging the workload at every eval step. Fixed by removing the unsafe fast-path and tightening a device-side check. JIRA-12345, PR org/platform#5751, owner Alex (Tada team).Symptom. 8-GPU LLM-7B fine-tuning under dumbModel hung indefinitely at the first eval step. No error, no timeout — busy-spin in. Reproduced on every run.tadaKernel_AllReduce_f32_RINGRoot cause. The single-stream fast-path in/tadaLaunchPrepare/tadaLaunchKernel(gated ontadaLaunchFinish) skipped the cross-stream event betweenscheduler->numStreams == 1 && !plan->persistentandlaunchStream. dumbModel hits this gate exactly. The kernel was launched before the IPC publish / scratch-buffer writes onhandle->shared->deviceStream(which populatedeviceStream) were visible toscratchBuf. In the kernel:launchStream→ stray pointer dereference → ring ready-flag read from garbage memory → thread spins forever waiting for a ready signal that will never arrive.scratchBuf == NULLWhy it produced the symptom. The hang lives in the all-reduce ring waitloop, which is the last visible thing in the call stack — but the actual bug is at launch-prep, several frames earlier. The skipped sync is silent until a workload triggers the exact gate (single-stream, non-persistent), and dumbModel's reduce-scatter pattern hits it at every eval step.Fix. PR #5751 removes the single-stream fast-path entirely (the saving was negligible vs. the safety it bypassed) and adds a device-side null check onbefore dereference, so the same class of bug fails loudly instead of silently spinning. A previous attempt (PR #5612) added a host-side defensive check after IPC publish that hid the symptom in some paths but left the underlying race in place — that change is also reverted.scratchBufHow it was found. Reproducer narrowed from "8-GPU LLM-7B hangs sometimes" to a deterministic 30s repro by pinning to a single eval step on a 2-GPU subset. Initial hypothesis: kernel launch ordering on. Disproved by the debugger — the kernel was correctly enqueued. Second hypothesis: scratch-buffer init race. Confirmed by addinglaunchStreaminstrumentation in[DBG-7af3]printingtadaLaunchPrepareand ascratchBufevent-record timestamp; the launch happened before the publish completed. Single experiment that nailed it: forcingdeviceStreammade the bug disappear, isolating the gate.numStreams = 2Why it slipped through. Latent code path. The single-stream fast-path was added in March under the assumption that dumbModel paths always took the multi-stream route. That assumption was true at the time. A May change to dumbModel's launcher began collapsing eval steps to a single stream — at which point the gate flipped. Tada's CI did not exercise the single-stream + IPC + scratch-buffer combination; the customer workload was the first to hit it.Validation. Original LLM-7B / 8-GPU / dumbModel workload now completes a full eval pass cleanly (3 consecutive 2-hour runs).tada-testsregression suite green. Soak run: 6 hours on 8 GPUs, no hang. Not retested on other model sizes or non-dumbModel workloads — both go through the multi-stream path and were never affected.all_reduce_perfAction items.
- Regression test added:
exercising the previously-uncovered gate. (Alex, merged in PR #5751.)tests/single_stream_ipc_publish_test.cpp- CI gap: add a single-stream + IPC matrix entry to nightly. (Alex, JIRA-12346.)
- Doc update: Tada launch-fast-path invariants documented in
. (Alex, PR #5752.)docs/launch_synchronization.md- Related: audit other
fast-paths for the same class of bug. (Filed as JIRA-12347.)numStreams == 1
What this post-mortem does that the management-talk version didn't:
- Names every code identifier (,
tadaLaunchPrepare,scratchBuf,numStreams).handle->shared->deviceStream - Walks the cause chain end-to-end so the reader can grep their way to the offending lines.
- Names the prior fix attempt (PR #5612) and what was wrong with it.
- Documents the exact experiment that nailed the cause (made it disappear).
numStreams = 2 - States validation coverage honestly — "not retested on other model sizes" is information, not a hole.
- Action items have owners and tracking artifacts.
摘要。Tada的单流快速路径跳过了必要的跨流同步,导致内核在临时缓冲区(scratch-buffer)写入完成前启动,使得写入内容不可见。在dumbModel运行LLM-7B微调时会可靠触发,导致工作负载在每个评估步骤挂起。修复方案为移除不安全的快速路径并强化设备端检查。JIRA-12345,PR org/platform#5751,负责人Alex(Tada团队)。症状。8块GPU上运行dumbModel进行LLM-7B微调时,在第一个评估步骤无限挂起。无错误、无超时——在中忙循环。每次运行均可复现。tadaKernel_AllReduce_f32_RING根本原因。/tadaLaunchPrepare/tadaLaunchKernel中的单流快速路径(受tadaLaunchFinish控制)跳过了scheduler->numStreams == 1 && !plan->persistent与launchStream之间的跨流事件。dumbModel恰好满足此条件。内核在handle->shared->deviceStream上的IPC发布/临时缓冲区写入(用于填充deviceStream)完成前启动,导致scratchBuf无法看到这些写入。在内核中:launchStream→ 野指针解引用 → 从垃圾内存读取环就绪标志 → 线程永远等待不会到来的就绪信号,进入忙循环。scratchBuf == NULL症状关联。挂起发生在all-reduce环的等待循环中,这是调用栈中最后可见的部分——但实际Bug出现在更早的启动准备阶段。跳过的同步操作是静默的,直到工作负载触发特定条件(单流、非持久化),而dumbModel的reduce-scatter模式在每个评估步骤都会触发此条件。修复方案。PR #5751彻底移除了单流快速路径(其带来的性能提升远不及它绕过的安全性),并在设备端添加了解引用前的空值检查,使得同类Bug会触发明确错误而非静默挂起。此前的一次尝试(PR #5612)在IPC发布后添加了主机端防御性检查,掩盖了部分路径的症状,但未解决底层竞争问题——该修改已被回滚。scratchBuf排查过程。通过将复现场景从“8块GPU上LLM-7B偶尔挂起”缩小到在2块GPU子集上仅运行单个评估步骤,得到了确定性的30秒复现步骤。初始假设:上的内核启动顺序错误。通过调试器排除——内核已正确入队。第二个假设:临时缓冲区初始化竞争。通过在launchStream中添加tadaLaunchPrepare埋点,打印[DBG-7af3]和scratchBuf事件记录时间戳,确认内核启动发生在发布完成前。确认原因的关键实验:强制deviceStream后Bug消失,从而定位到触发条件。numStreams = 2漏检原因。潜在代码路径。单流快速路径在3月添加,当时假设dumbModel路径始终使用多流。该假设在当时成立。5月对dumbModel启动器的修改开始将评估步骤合并为单流——此时触发条件被激活。Tada的CI未覆盖单流+IPC+临时缓冲区的组合场景;客户工作负载是第一个触发此路径的场景。验证过程。原LLM-7B/8块GPU/dumbModel工作负载现在可完整完成评估过程(连续3次2小时运行均无问题)。的tada-tests回归测试套件全部通过。持续测试:8块GPU上运行6小时,无挂起。未在其他模型大小或非dumbModel工作负载上重新测试——这些场景均使用多流路径,从未受影响。all_reduce_perf行动项。
- 添加回归测试:
,覆盖此前未覆盖的触发条件。(Alex,已合并到PR #5751)tests/single_stream_ipc_publish_test.cpp- CI漏洞:在 nightly 测试中添加单流+IPC的矩阵项。(Alex,JIRA-12346)
- 更新文档:在
中记录Tada启动快速路径的约束条件。(Alex,PR #5752)docs/launch_synchronization.md- 相关工作:审核其他
的快速路径是否存在同类Bug。(已提交为JIRA-12347)numStreams == 1
这份事后分析报告与面向管理层的版本的区别在于:
- 列出了所有代码标识符(、
tadaLaunchPrepare、scratchBuf、numStreams)。handle->shared->deviceStream - 完整梳理了因果链,让读者可通过搜索找到问题代码行。
- 指出了此前的修复尝试(PR #5612)及其问题。
- 记录了确认原因的关键实验(后Bug消失)。
numStreams = 2 - 如实说明验证覆盖范围——“未在其他模型大小上重新测试”是信息,而非漏洞。
- 行动项包含负责人和追踪工件。
Rules
规则
- Refuse to draft without all four required inputs. A post-mortem of a hypothesis is worse than no post-mortem.
- Never invent root cause, owner, validation runs, or action items. If a section's facts aren't there, ask. Don't fill the gap with plausible prose.
- Never strip code identifiers in the engineering record. They are the index. The leadership reframe is 's job, not yours.
management-talk - Blameless. Describe gaps and bugs, never people.
- State validation coverage honestly. If you only tested one config, say so. Implying broader coverage is the failure mode that breeds repeat regressions.
- Get sign-off before posting to JIRA. Print-only output needs no approval. Never post to non-JIRA destinations from this skill.
- One iteration is normal, three is a smell. If the user is still revising on the third pass, ask what specific section is wrong — don't keep tweaking blindly.
- 缺少四项必要输入则拒绝撰写。基于假设的事后分析报告不如没有。
- 切勿编造根本原因、负责人、验证过程或行动项。若某部分事实缺失,需询问。不要用看似合理的内容填补空白。
- 切勿在工程记录中删除代码标识符。它们是索引。面向领导层的重构是的工作,而非本技能的职责。
management-talk - 无指责。描述漏洞和Bug,而非个人。
- 如实说明验证覆盖范围。若仅测试了一种配置,需明确说明。暗示更大的覆盖范围会导致重复回归。
- 发布到JIRA前需确认。仅用于打印的输出无需批准。切勿通过本技能发布到非JIRA位置。
- 一次迭代正常,三次需警惕。若用户在第三次迭代后仍在修改,需询问具体哪部分有问题——不要盲目调整。