Loading...
Loading...
Compare original and translation side by side
| Blame-Focused | Blameless |
|---|---|
| "Who caused this?" | "What conditions allowed this?" |
| "Someone made a mistake" | "The system allowed this mistake" |
| Punish individuals | Improve systems |
| Hide information | Share learnings |
| Fear of speaking up | Psychological safety |
| 追责导向 | 无责导向 |
|---|---|
| "是谁导致了这个问题?" | "哪些条件导致了这个问题?" |
| "有人犯了错误" | "系统允许这个错误发生" |
| 惩罚个人 | 优化系统 |
| 隐瞒信息 | 分享经验教训 |
| 害怕发言 | 心理安全感 |
Day 0: Incident occurs
Day 1-2: Draft postmortem document
Day 3-5: Postmortem meeting
Day 5-7: Finalize document, create tickets
Week 2+: Action item completion
Quarterly: Review patterns across incidentsDay 0: 事件发生
Day 1-2: 起草事后复盘文档
Day 3-5: 召开事后复盘会议
Day 5-7: 最终确定文档,创建工单
Week 2+: 完成行动项
Quarterly: 复盘跨事件的模式undefinedundefined| Time | Event |
|---|---|
| 14:23 | Deployment v2.3.4 completed to production |
| 14:31 | First alert: |
| 14:33 | On-call engineer @alice acknowledges alert |
| 14:35 | Initial investigation begins, error rate at 23% |
| 14:41 | Incident declared SEV2, @bob joins |
| 14:45 | Database connection exhaustion identified |
| 14:52 | Decision to rollback deployment |
| 14:58 | Rollback to v2.3.3 initiated |
| 15:10 | Rollback complete, error rate dropping |
| 15:18 | Service fully recovered, incident resolved |
| 时间 | 事件 |
|---|---|
| 14:23 | v2.3.4版本部署至生产环境完成 |
| 14:31 | 首次警报: |
| 14:33 | 值班工程师@alice确认警报 |
| 14:35 | 开始初步调查,错误率达23% |
| 14:41 | 宣布事件为SEV2级别,@bob加入处理 |
| 14:45 | 确定数据库连接耗尽问题 |
| 14:52 | 做出回滚部署的决策 |
| 14:58 | 启动回滚至v2.3.3版本 |
| 15:10 | 回滚完成,错误率开始下降 |
| 15:18 | 服务完全恢复,事件解决 |
PaymentRepository.javaDataSourceDriverManager.getConnection()PaymentRepository.javaDataSourceDriverManager.getConnection()
[Client] → [Load Balancer] → [Payment Service] → [Database]
↓
Connection Pool (broken)
↓
Direct connections (cause)
[客户端] → [负载均衡器] → [支付服务] → [数据库]
↓
连接池(已损坏)
↓
直接连接(原因)
| Priority | Action | Owner | Due Date | Ticket |
|---|---|---|---|---|
| P0 | Add integration test for connection pool behavior | @alice | 2024-01-22 | ENG-1234 |
| P0 | Lower database connection alert threshold to 70% | @bob | 2024-01-17 | OPS-567 |
| P1 | Document connection management patterns | @alice | 2024-01-29 | DOC-89 |
| P1 | Implement deployment-correlated alerting | @bob | 2024-02-05 | OPS-568 |
| P2 | Evaluate canary deployment strategy | @charlie | 2024-02-15 | ENG-1235 |
| P2 | Load test staging with production-like traffic | @dave | 2024-02-28 | QA-123 |
| 优先级 | 行动内容 | 负责人 | 截止日期 | 工单编号 |
|---|---|---|---|---|
| P0 | 添加连接池行为的集成测试 | @alice | 2024-01-22 | ENG-1234 |
| P0 | 将数据库连接警报阈值降低至70% | @bob | 2024-01-17 | OPS-567 |
| P1 | 编写连接管理模式的文档 | @alice | 2024-01-29 | DOC-89 |
| P1 | 实现部署关联的警报机制 | @bob | 2024-02-05 | OPS-568 |
| P2 | 评估金丝雀部署策略 | @charlie | 2024-02-15 | ENG-1235 |
| P2 | 用生产级流量对预发布环境进行负载测试 | @dave | 2024-02-28 | QA-123 |
undefinedundefinedundefinedundefinedDriverManager.getConnection()DataSourceDriverManager.getConnection()DataSource| Root Cause | Improvement | Type |
|---|---|---|
| Missing tests | Add infrastructure behavior tests | Prevention |
| Missing docs | Document connection patterns | Prevention |
| Review gaps | Update review checklist | Detection |
| No canary | Implement canary deployments | Mitigation |
undefined| 根本原因 | 改进措施 | 类型 |
|---|---|---|
| 缺少测试 | 添加基础设施行为测试 | 预防 |
| 缺少文档 | 编写连接模式文档 | 预防 |
| 评审漏洞 | 更新评审checklist | 检测 |
| 无金丝雀部署 | 实现金丝雀部署 | 缓解 |
undefinedundefinedundefinedundefinedundefinedundefinedundefinedundefinedundefined| Anti-Pattern | Problem | Better Approach |
|---|---|---|
| Blame game | Shuts down learning | Focus on systems |
| Shallow analysis | Doesn't prevent recurrence | Ask "why" 5 times |
| No action items | Waste of time | Always have concrete next steps |
| Unrealistic actions | Never completed | Scope to achievable tasks |
| No follow-up | Actions forgotten | Track in ticketing system |
| 反模式 | 问题 | 更佳做法 |
|---|---|---|
| 追责游戏 | 阻碍学习 | 聚焦于系统优化 |
| 浅层分析 | 无法防止事件复发 | 连续问5次"为什么" |
| 无行动项 | 浪费时间 | 始终制定具体的后续步骤 |
| 不切实际的行动项 | 无法完成 | 限定在可实现的任务范围内 |
| 无跟进 | 行动项被遗忘 | 在工单系统中跟踪进度 |