Post-mortem

The canonical engineering record of a bug fix. Written after debugging lands a real fix, for other engineers (and future-you, who will have forgotten everything in 6 months). Code identifiers are welcome here — this is the artifact that lets the next person recover the mental model fast.

For the up-the-org version of this same content, hand the finished post-mortem to

management-talk

. They compose: post-mortem owns the engineering truth, management-talk reframes it for leadership.

When to invoke

"/post-mortem"
"write the post-mortem / postmortem / RCA / root-cause analysis"
"document this fix" / "write up the root cause" / "close out this bug with a writeup"
After a debug session has clearly landed a fix, proactively offer to draft one.

When NOT to use

Bug not fixed yet, or fix not validated. A post-mortem of a hypothesis is misleading. Refuse and tell the user what's missing.
Customer-visible outage / incident. Those need a separate incident report (timeline, blast radius, paging history, comms). This skill is bug-fix scope. Flag and confirm before producing one.
Trivial fix (typo, obvious one-liner). The PR description is the record. Don't manufacture ceremony.

Required inputs — refuse to draft without these

Before writing a single line, confirm all four. If any are missing, list what's missing and stop:

Reliable repro exists (not "happens sometimes" — a deterministic or high-rate-flake repro the next person can run).
Root cause is known (the mechanism is identified, not a hypothesis).
Fix is identified (PR / commit / branch pointer).
Fix is validated (the original repro now passes; the customer workload / failing test now succeeds).

These map directly to

debug-mantra

steps 1–4. If you came in via

debug-mantra

, the breadcrumb ledger from step 4 is your raw material — pull from it.

Structure

Use these blocks in this order. Summary, Root cause, Fix, and Validation are mandatory. The rest are conditional but usually present.

1. Summary (mandatory)

One paragraph. What broke, in user/workload terms. What fixed it, in one sentence. JIRA key, PR number, owner. A reader who stops here should have the right answer.

2. Symptom

What was actually observed. Test output, error message, log line, perf number, customer report. Concrete identifiers — don't paraphrase the failure mode.

3. Root cause (mandatory)

The actual bug mechanism. Code identifiers welcome and expected — function names, file paths, struct fields, branch conditions, commit SHAs of the offending change. Walk the cause chain end-to-end. This is the most expensive section and the reason the post-mortem exists at all. Future-you will live or die by how clearly you write this.

4. Why it produced the symptom

Link the root cause to the symptom. Often non-obvious — the bug is in

tadaLaunchPrepare

but the visible failure is a customer training run hanging hours later. Walk the chain so a reader who only knows the symptom can connect it back to the cause without re-deriving it.

5. Fix (mandatory)

What changed and why this change addresses the root cause rather than hiding the symptom. Link to PR / commit. If a previous fix attempt papered over the symptom, name it and explain what was wrong with it — that history is part of the cause.

6. How it was found

Short. The debugging path:

What repro made it deterministic.
What tools cracked it (debugger, source tracing, knob enumeration, in-code instrumentation — the
```
debug-mantra
```
step 2 cascade).
Hypotheses tried and rejected, with the one-line reason each was rejected. (Pull from the breadcrumb ledger.)
The single experiment that confirmed the cause.

This section is for the next debugger — make it learnable.

7. Why it slipped through

What allowed this bug to reach the branch / release / customer. Pick the real reason:

CI gap (no test exercises this path / configuration).
Latent code (correct when written, broken by a later change in a different file).
Workload gap (no real workload reached this code path until now).
Incomplete prior fix (defensive check hid the symptom; root cause untouched).
Review miss (the change was reviewable; the implication wasn't).

If the honest answer is "no good reason — we should have caught this," say so. Blameless — describe the gap, not the person.

8. Validation (mandatory)

How we know the fix works. Concrete:

Original failing test now passes (test name, link).
Customer workload now completes (workload identifier, run link).
Perf regression resolved (number before, number after).
Stress / soak / fuzz run completed clean (duration, scale).
Other affected configurations / workloads also tested.

If you only validated one configuration, say so explicitly — "validated on Llama-2-70B / 8 GPUs / DeepSpeed; not retested on other workloads." Don't imply broader coverage than you actually have.

9. Action items / follow-ups

Concrete next-steps that aren't in the fix PR itself. Each item: what + owner + tracking artifact.

Regression test added at <seam>. (Owner, test name.)
Refactor to prevent class of bug. (Owner, ticket.)
CI gap closed: <new check>. (Owner, PR.)
Doc / runbook updated. (Owner, link.)
Related ticket filed for <adjacent issue you noticed>. (Owner, key.)

If there are no action items, write "None — the fix is sufficient and no class-of-bug follow-up is warranted." Don't manufacture action items to look thorough.

Tone

This is engineer-to-engineer. Different from

management-talk

Code identifiers are first-class.
```
tadaLaunchPrepare
```
,
```
tada/prim.h::syncWaitPeer
```
,
```
scratchBuf
```
, commit SHAs, line numbers — keep them. The whole point is that future engineers can grep their way back to the change.
Mechanism over narrative. Walk the actual cause chain. Don't soften it into "a synchronization issue" — say which function skipped which event under which gate.
Active voice, concrete subjects, short paragraphs. Same rule as everywhere else.
No hedging. "We believe" / "appears to" / "may have" — drop. State it or don't write it.
Blameless. Describe the bug, the gap, and the fix. Never "X should have caught this." The CI gap is the failure mode, not the person.
No advocacy. A post-mortem records what happened and what's next. If you want to argue for a refactor, that's a separate proposal — link to it from the action items.

Output flow

Confirm all four required inputs are satisfied. If any are missing, list them and stop. Do not draft.
Confirm where it goes (default: JIRA comment on the source ticket). Other valid destinations: PR description,
```
docs/postmortems/<ticket>.md
```
, internal wiki page. The shape is the same — only the wrapping changes.
Produce the draft as a single chat block.
Sign-off before posting. If posting back to JIRA, show the exact ADF payload, wait for explicit "post it" / "go ahead" / "yes," then
```
POST /rest/api/3/issue/<KEY>/comment
```
. Print-only output needs no approval.
Offer the management-talk handoff: "Want a leadership-flavored version? I can hand this to
management-talk
." Don't do it automatically.

Worked example — Tada hang in dumbModel (JIRA-12345)

Summary. Tada's single-stream fast-path skipped a required cross-stream synchronization, causing kernels to launch before scratch-buffer writes were visible. Triggered reliably by dumbModel on LLM-7B fine-tuning, hanging the workload at every eval step. Fixed by removing the unsafe fast-path and tightening a device-side check. JIRA-12345, PR org/platform#5751, owner Alex (Tada team).
Symptom. 8-GPU LLM-7B fine-tuning under dumbModel hung indefinitely at the first eval step. No error, no timeout — busy-spin in
tadaKernel_AllReduce_f32_RING
. Reproduced on every run.
Root cause. The single-stream fast-path in
tadaLaunchPrepare
/
tadaLaunchKernel
/
tadaLaunchFinish
(gated on
scheduler->numStreams == 1 && !plan->persistent
) skipped the cross-stream event between
launchStream
and
handle->shared->deviceStream
. dumbModel hits this gate exactly. The kernel was launched before the IPC publish / scratch-buffer writes on
deviceStream
(which populate
scratchBuf
) were visible to
launchStream
. In the kernel:
scratchBuf == NULL
→ stray pointer dereference → ring ready-flag read from garbage memory → thread spins forever waiting for a ready signal that will never arrive.
Why it produced the symptom. The hang lives in the all-reduce ring waitloop, which is the last visible thing in the call stack — but the actual bug is at launch-prep, several frames earlier. The skipped sync is silent until a workload triggers the exact gate (single-stream, non-persistent), and dumbModel's reduce-scatter pattern hits it at every eval step.
Fix. PR #5751 removes the single-stream fast-path entirely (the saving was negligible vs. the safety it bypassed) and adds a device-side null check on
scratchBuf
before dereference, so the same class of bug fails loudly instead of silently spinning. A previous attempt (PR #5612) added a host-side defensive check after IPC publish that hid the symptom in some paths but left the underlying race in place — that change is also reverted.
How it was found. Reproducer narrowed from "8-GPU LLM-7B hangs sometimes" to a deterministic 30s repro by pinning to a single eval step on a 2-GPU subset. Initial hypothesis: kernel launch ordering on
launchStream
. Disproved by the debugger — the kernel was correctly enqueued. Second hypothesis: scratch-buffer init race. Confirmed by adding
[DBG-7af3]
instrumentation in
tadaLaunchPrepare
printing
scratchBuf
and a
deviceStream
event-record timestamp; the launch happened before the publish completed. Single experiment that nailed it: forcing
numStreams = 2
made the bug disappear, isolating the gate.
Why it slipped through. Latent code path. The single-stream fast-path was added in March under the assumption that dumbModel paths always took the multi-stream route. That assumption was true at the time. A May change to dumbModel's launcher began collapsing eval steps to a single stream — at which point the gate flipped. Tada's CI did not exercise the single-stream + IPC + scratch-buffer combination; the customer workload was the first to hit it.
Validation. Original LLM-7B / 8-GPU / dumbModel workload now completes a full eval pass cleanly (3 consecutive 2-hour runs).
tada-tests
all_reduce_perf
regression suite green. Soak run: 6 hours on 8 GPUs, no hang. Not retested on other model sizes or non-dumbModel workloads — both go through the multi-stream path and were never affected.
Action items.
Regression test added:
tests/single_stream_ipc_publish_test.cpp
exercising the previously-uncovered gate. (Alex, merged in PR #5751.)
CI gap: add a single-stream + IPC matrix entry to nightly. (Alex, JIRA-12346.)
Doc update: Tada launch-fast-path invariants documented in
docs/launch_synchronization.md
. (Alex, PR #5752.)
Related: audit other
numStreams == 1
fast-paths for the same class of bug. (Filed as JIRA-12347.)

What this post-mortem does that the management-talk version didn't:

Names every code identifier (

tadaLaunchPrepare

scratchBuf

numStreams

handle->shared->deviceStream

Walks the cause chain end-to-end so the reader can grep their way to the offending lines.
Names the prior fix attempt (PR #5612) and what was wrong with it.
Documents the exact experiment that nailed the cause (
```
numStreams = 2
```
made it disappear).
States validation coverage honestly — "not retested on other model sizes" is information, not a hole.
Action items have owners and tracking artifacts.

Rules

Refuse to draft without all four required inputs. A post-mortem of a hypothesis is worse than no post-mortem.
Never invent root cause, owner, validation runs, or action items. If a section's facts aren't there, ask. Don't fill the gap with plausible prose.
Never strip code identifiers in the engineering record. They are the index. The leadership reframe is
```
management-talk
```
's job, not yours.
Blameless. Describe gaps and bugs, never people.
State validation coverage honestly. If you only tested one config, say so. Implying broader coverage is the failure mode that breeds repeat regressions.
Get sign-off before posting to JIRA. Print-only output needs no approval. Never post to non-JIRA destinations from this skill.
One iteration is normal, three is a smell. If the user is still revising on the third pass, ask what specific section is wrong — don't keep tweaking blindly.

post-mortem

NPX Install

Tags

SKILL.md Content