post-mortems-retrospectives

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Post-mortems & Retrospectives

Post-mortems & Retrospectives

Scope

适用范围

Covers
  • Running blameless incident post-mortems and project/OKR retrospectives
  • Turning "what happened?" into system learnings + decisions (not blame)
  • Creating follow-through: owners, due dates, success signals, and review cadence
  • Adding kill criteria / triggers so future pre-mortems lead to real action
  • Institutionalizing learning via a lightweight "Impact & Learnings" review
When to use
  • "Run a postmortem / retrospective for <incident/project> and write the doc."
  • "We missed OKRs—lead a retro focused on learning and systemic blockers."
  • "Create an after-action review with action items and owners."
  • "Set up a weekly impact & learnings review so insights don't die in docs."
  • "Do a pre-mortem and define kill criteria / pivot triggers."
When NOT to use
  • The incident is still active (do incident response first; schedule the review after stabilization)
  • The goal is to assign blame or evaluate an individual's performance (use HR/management processes)
  • You need deep technical debugging without the right experts (this skill facilitates; it doesn't replace engineering investigation)
  • You need to decide what problem to solve (use a problem-definition / discovery process first)
  • You need to facilitate a meeting that is not a post-mortem or retrospective (use
    running-effective-meetings
    )
  • You need to improve the shipping process itself, not review a past launch (use
    shipping-products
    )
  • You need to change engineering culture or practices based on systemic patterns across retros (use
    engineering-culture
    )
  • You need to plan for future risks and uncertainties rather than review past events (use
    planning-under-uncertainty
    )
涵盖内容
  • 开展无责的事件Post-mortems以及项目/OKR回顾会议
  • 将“发生了什么?”转化为系统层面的经验总结与决策(而非追责)
  • 建立跟进机制:负责人、截止日期、成功信号、回顾频次
  • 新增终止标准/触发条件,让未来的事前预演能转化为实际行动
  • 通过轻量化的“影响与经验总结”审查将学习成果制度化
适用场景
  • “为<事件/项目>开展Post-mortem/回顾会议并撰写文档。”
  • “我们未达成OKR——主持一场聚焦经验总结与系统性障碍的回顾会议。”
  • “创建包含行动项与负责人的行动后审查文档。”
  • “建立每周一次的‘影响与经验总结’审查机制,避免经验仅停留在文档中。”
  • “开展事前预演并定义终止标准/转向触发条件。”
不适用场景
  • 事件仍在进行中(先进行事件响应;待稳定后再安排审查)
  • 目标是追责或评估个人绩效(请使用HR/管理流程)
  • 需要在缺乏相关专家的情况下进行深度技术调试(本技能仅提供流程支持;无法替代工程调查)
  • 需要决定要解决什么问题(请先使用问题定义/探索流程)
  • 需要主持非Post-mortem或回顾类的会议(请使用
    running-effective-meetings
  • 需要优化交付流程本身,而非复盘过往发布(请使用
    shipping-products
  • 需要基于多次回顾会议中的系统性模式来改变工程文化或实践(请使用
    engineering-culture
  • 需要规划未来风险与不确定性,而非复盘过往事件(请使用
    planning-under-uncertainty

Inputs

输入信息

Minimum required
  • What are we reviewing? (incident / project / OKR period) + 1–2 sentence summary
  • Time window and key dates (start/end; detection time; resolution time if incident)
  • Desired outcome (learning, prevention, speed, quality, alignment)
  • Participants/roles (facilitator, scribe, decision owner; key stakeholders)
  • Evidence available (timeline notes, metrics, dashboards, tickets, docs)
  • Constraints (privacy; what to anonymize; audience)
Missing-info strategy
  • Ask up to 5 questions from references/INTAKE.md (3–5 at a time).
  • If details are unavailable, proceed with explicit assumptions and label unknowns.
  • Do not request secrets or personal data; use anonymized descriptions.
最低要求
  • 审查对象:(事件/项目/OKR周期)+1-2句话的摘要
  • 时间窗口与关键日期:开始/结束时间;检测时间;若为事件则需包含解决时间
  • 预期成果:经验总结、风险预防、效率提升、质量改进、对齐目标
  • 参与者/角色:主持人、记录员、决策负责人;关键利益相关者
  • 可用证据:时间线笔记、指标、仪表盘、工单、文档
  • 约束条件:隐私要求;需匿名化的内容;受众群体
缺失信息处理策略
  • 从[references/INTAKE.md]中最多提出5个问题(每次3-5个)
  • 若细节不可用,基于明确假设推进,并标注未知信息
  • 不得索要机密或个人数据;使用匿名化描述

Outputs (deliverables)

输出成果(交付物)

Produce a Post-mortems & Retrospectives Pack in Markdown (in-chat; or as files if requested):
  1. Retro brief + agenda (purpose, attendees, roles, pre-reads, ground rules)
  2. Facts + timeline (what happened; impact; timestamps; links)
  3. Contributing factors + root cause hypotheses (systems lens; "why it made sense")
  4. Learnings + decisions (what changes; why; tradeoffs)
  5. Action tracker (owner, due date, success signal, follow-up date)
  6. Kill criteria / triggers (signals → committed action) for future work
  7. Learning dissemination plan (how to socialize + a recurring "Impact & Learnings" review)
  8. Risks / Open questions / Next steps (always)
Templates: references/TEMPLATES.md
Expanded guidance: references/WORKFLOW.md
生成Markdown格式的Post-mortems & Retrospectives工具包(可在聊天中发送;若有要求可作为文件发送):
  1. 回顾会议简要说明+议程(目的、参会者、角色、预读材料、基本规则)
  2. 事实+时间线(事件经过、影响、时间戳、链接)
  3. 影响因素+根本原因假设(系统视角;“为何会出现这种情况”)
  4. 经验总结+决策(需做出的改变、原因、权衡)
  5. 行动跟踪表(负责人、截止日期、成功信号、跟进日期)
  6. 终止标准/触发条件(信号→已承诺的行动),用于未来工作
  7. 经验传播计划(推广方式+定期的“影响与经验总结”审查)
  8. 风险/未解决问题/下一步计划(必须包含)
模板:[references/TEMPLATES.md]
扩展指南:[references/WORKFLOW.md]

Workflow (7 steps)

工作流程(7步)

1) Classify the review + set blameless ground rules

1) 分类审查类型+设定无责基本原则

  • Inputs: request context; references/INTAKE.md.
  • Actions: Identify the review type (incident / project / OKR). Set a blameless norm ("fix systems, not people") and decide whether to reframe language as "retrospective" to signal learning. Confirm facilitator, scribe, and decision owner.
  • Outputs: Retro brief (draft) + attendee list + meeting invite outline.
  • Checks: Objective is explicit (learning + improvement). Roles are assigned.
  • 输入:请求背景;[references/INTAKE.md]
  • 行动:确定审查类型(事件/项目/OKR)。设定无责准则(“修复系统,而非追责个人”),并决定是否将表述重构为“回顾会议”以突出学习导向。确认主持人、记录员与决策负责人。
  • 输出:回顾会议简要说明(草稿)+参会者名单+会议邀请大纲
  • 检查项:目标明确(学习+改进)。角色已分配。

2) Assemble facts and a shared timeline (separate facts from stories)

2) 整理事实与共享时间线(区分事实与主观描述)

  • Inputs: artifacts (tickets, dashboards, logs, notes).
  • Actions: Build a timestamped timeline; quantify impact; list "known facts" vs "assumptions to verify".
  • Outputs: Facts + timeline section using references/TEMPLATES.md.
  • Checks: Timeline has timestamps and links/evidence where possible. Assumptions are labeled.
  • 输入:工件(工单、仪表盘、日志、笔记)
  • 行动:构建带时间戳的时间线;量化影响;列出“已知事实”与“待验证假设”
  • 输出:使用[references/TEMPLATES.md]的事实+时间线部分
  • 检查项:时间线尽可能包含时间戳与链接/证据。假设已标注。

3) Diagnose contributing factors (systems lens)

3) 诊断影响因素(系统视角)

  • Inputs: timeline + impact.
  • Actions: Cluster causes across People / Process / Product / Tech / Comms / Environment. Use a "make it reasonable" lens: what conditions made the outcome likely? Optionally run 5 Whys on the top 1–2 factors.
  • Outputs: Contributing factors map + root cause hypotheses.
  • Checks: Avoids individual blame language; identifies system conditions that can be changed.
  • 输入:时间线+影响
  • 行动:从人员/流程/产品/技术/沟通/环境维度聚类原因。采用“合理化”视角:哪些条件导致了该结果的发生?可选择对前1-2个因素进行5Why分析。
  • 输出:影响因素图谱+根本原因假设
  • 检查项:避免使用追责个人的表述;识别可改变的系统条件。

4) Extract learnings and decide what to change

4) 提取经验总结并决定改进方向

  • Inputs: contributing factors.
  • Actions: Write 3–7 crisp learnings ("we learned that…"). Convert learnings into decisions (fix, guardrail, instrumentation, runbook, training, scope change). Keep OKR/grade discussion secondary to "why" and "what changes next".
  • Outputs: Learnings + decisions section.
  • Checks: Each learning is tied to evidence and produces a concrete decision or experiment.
  • 输入:影响因素
  • 行动:撰写3-7条简洁的经验总结(“我们了解到……”)。将经验总结转化为决策(修复、设置防护、监控、运行手册、培训、范围调整)。OKR/绩效评分讨论需次要于“原因”与“下一步改进”。
  • 输出:经验总结+决策部分
  • 检查项:每条经验总结均有证据支撑,并转化为具体决策或试验。

5) Build the action tracker (owners + dates + success signals)

5) 构建行动跟踪表(负责人+日期+成功信号)

  • Inputs: decisions.
  • Actions: Create action items with an owner, due date, and success signal. Add a follow-up review date (or a recurring review). Limit to what can realistically be executed; explicitly park "later ideas".
  • Outputs: Action tracker table + follow-up plan.
  • Checks: No orphan actions: every item has owner + date. Top actions address top factors.
  • 输入:决策
  • 行动:创建包含负责人、截止日期、成功信号的行动项。添加跟进审查日期(或定期审查)。仅保留可实际执行的行动项;明确标记“后续再议”的想法。
  • 输出:行动跟踪表+跟进计划
  • 检查项:无无人负责的行动项:每个条目均有负责人+日期。重点行动对应主要影响因素。

6) Add kill criteria / triggers (pre-commit to future action)

6) 添加终止标准/触发条件(提前承诺未来行动)

  • Inputs: learnings; "what would we do differently next time?"
  • Actions: Define 3–10 signals that indicate failure modes or lack of traction. For each signal, pre-commit to an action (pause, pivot, kill, escalate, add investment).
  • Outputs: Kill criteria / trigger list.
  • Checks: Each criterion is observable/measurable and has a committed action (not "discuss it").
  • 输入:经验总结;“下次我们会如何改进?”
  • 行动:定义3-10个表明失败模式或缺乏进展的信号。针对每个信号,提前承诺行动(暂停、转向、终止、升级、增加投入)。
  • 输出:终止标准/触发条件列表
  • 检查项:每个标准均可观察/可衡量,并对应明确的承诺行动(而非“讨论此事”)。

7) Disseminate learning + quality gate + finalize

7) 传播经验+质量检查+最终定稿

  • Inputs: full draft pack.
  • Actions: Create a 1-page shareout (TL;DR, top actions, decisions). Propose a lightweight weekly/biweekly "Impact & Learnings" review to socialize learnings beyond the team. Run references/CHECKLISTS.md and score with references/RUBRIC.md. Add Risks / Open questions / Next steps.
  • Outputs: Final Post-mortems & Retrospectives Pack.
  • Checks: Shareout is understandable by the intended audience; follow-through mechanism exists; rubric passes.
  • 输入:完整的工具包草稿
  • 行动:创建1页摘要(TL;DR、重点行动、决策)。提议建立轻量化的每周/双周“影响与经验总结”审查机制,将经验推广至团队外。使用[references/CHECKLISTS.md]检查,并通过[references/RUBRIC.md]评分。添加风险/未解决问题/下一步计划
  • 输出:最终的Post-mortems & Retrospectives工具包
  • 检查项:摘要需让目标受众易于理解;存在跟进机制;通过评分标准。

Quality gate (required)

质量检查(必填)

  • Use references/CHECKLISTS.md and references/RUBRIC.md.
  • Always include: Risks, Open questions, Next steps.
  • 使用[references/CHECKLISTS.md]与[references/RUBRIC.md]
  • 必须包含:风险未解决问题下一步计划

Examples

示例

Example 1 (incident postmortem): "We had a 45-minute outage in our payments API yesterday. Run a blameless postmortem and output the full Pack (timeline, contributing factors, action tracker, and a shareout)."
Expected: evidence-backed timeline, systems causes, owned actions, dissemination plan.
Example 2 (OKR retro): "We hit 0.8 on our Q4 activation OKR. Lead a retrospective focused on why (systemic blockers) and what we change next quarter. Output the full Pack and kill criteria for the next initiative."
Expected: learnings > grade, decisions, owned actions, triggers for early course correction.
Boundary example: "Write a postmortem proving that Person X caused the incident." Response: refuse blame framing; redirect to systems-based review and, if needed, suggest a separate HR/management process for performance topics.
Boundary example (neighbor redirect): "Our last three retros all surfaced the same 'deploy process is broken' theme. Fix the deploy process." Response: recurring themes across retros indicate a systemic engineering culture or process issue. Use
engineering-culture
to design the process improvement. This skill is for running the review itself, not implementing the changes it surfaces.
示例1(事件Post-mortem):“昨天我们的支付API出现了45分钟的故障。开展无责Post-mortem并输出完整工具包(时间线、影响因素、行动跟踪表、摘要)。”
预期成果:有证据支撑的时间线、系统层面原因、明确负责人的行动项、传播计划。
示例2(OKR回顾会议):“我们Q4激活OKR完成率为0.8。主持一场回顾会议,聚焦原因(系统性障碍)以及下季度的改进方向。输出完整工具包与下一项举措的终止标准。”
预期成果:经验总结优先于评分、决策、明确负责人的行动项、用于早期调整的触发条件。
边界示例:“撰写一份Post-mortem文档,证明是X个人导致了该事件。” 回应:拒绝追责框架;转向基于系统的审查,若有需要,建议使用单独的HR/管理流程处理绩效相关问题。
边界示例(转至其他技能):“我们最近三次回顾会议都提到了‘部署流程存在问题’。修复部署流程。” 回应:多次回顾会议中出现的重复主题表明存在系统性的工程文化或流程问题。请使用
engineering-culture
来设计流程改进。本技能仅用于主持审查会议,而非落实审查所发现的改进措施。

Anti-patterns

反模式

  1. Blame in disguise — Using blameless language ("the system failed") while structuring the timeline and contributing factors to point at a single person. Contributing factors must focus on system conditions, not individual actions.
  2. Action items without owners — Producing a list of "we should" recommendations with no owner, due date, or success signal. Every action must be owned, dated, and measurable.
  3. Shallow root cause — Stopping at the first "why" (e.g., "the deploy script failed") instead of investigating systemic conditions (e.g., "no integration test coverage for deploy scripts, no runbook, no alerting"). Use at least 3 levels of "why" on top contributing factors.
  4. Retro amnesia — Running retrospectives that produce insights but never feeding them into durable process changes. Every retro must include a dissemination plan and a follow-up review date.
  5. Grade over learning — Spending most of the retrospective debating the OKR score or incident severity instead of investigating systemic causes and deciding what changes. Keep grading secondary to "why" and "what changes next."
  1. 隐性追责——使用无责表述(“系统故障”),但通过时间线与影响因素的构建指向个人。影响因素必须聚焦于系统条件,而非个人行为。
  2. 无负责人的行动项——列出一系列“我们应该”的建议,但无负责人、截止日期或成功信号。每个行动项必须有负责人、日期且可衡量。
  3. 浅层根本原因——仅停留在第一层“为什么”(例如:“部署脚本故障”),而非探究系统层面的条件(例如:“部署脚本无集成测试覆盖、无运行手册、无告警机制”)。针对主要影响因素至少进行3层“为什么”分析。
  4. 回顾遗忘症——开展回顾会议并产生洞见,但从未将其转化为持久的流程改进。每场回顾会议必须包含传播计划与跟进审查日期。
  5. 重评分轻学习——将回顾会议的大部分时间用于讨论OKR评分或事件严重程度,而非探究系统层面原因与改进方向。评分讨论需次要于“原因”与“下一步改进”。