rca-report

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

RCA Report

RCA报告

Produce post-mortems that are reproducible, layered, and operationally useful — not just narrative. A good RCA lets a future engineer (or future you) understand the incident, verify the fix held, and avoid repeating it. This skill covers both the investigation flow (what to gather while the incident is fresh) and the report itself.
生成可复现、分层且具备实际运维价值的事后复盘(post-mortem)报告——而非单纯的叙事性文档。一份优质的RCA报告能让未来的工程师(或未来的你)理解事件经过、验证修复措施是否有效,并避免重蹈覆辙。本技能涵盖调查流程(事件刚发生时需收集的信息)和报告撰写两部分内容。

When this skill applies

本技能的适用场景

  • A production incident, outage, or near-miss has occurred and needs documenting
  • A pipeline, service, or system silently failed and the user just resolved it
  • The user wants a post-mortem, RCA, or incident write-up
  • Mid-investigation: the user is debugging something and wants help structuring evidence as it comes in
If the incident is still actively burning and the user just wants help fixing it, skip this skill — fix first, document after.
  • 发生了生产事件、系统中断或未遂事件,需要进行记录
  • 流水线、服务或系统静默故障,且用户刚完成修复
  • 用户需要事后复盘报告、RCA报告或事件记录文档
  • 调查过程中:用户正在调试问题,需要帮助整理收集到的证据
如果事件仍在持续爆发,用户仅需协助修复问题,则跳过本技能——先解决问题,再进行记录。

Output location

输出位置

Save the report to
<topic>-rca-<YYYY-MM-DD>.md
in the current working directory, where:
  • <topic>
    is a short kebab-case identifier of the failing system (e.g.
    debezium
    ,
    auth-service
    ,
    kafka-consumer-lag
    )
  • <YYYY-MM-DD>
    is the incident date (when it occurred), not necessarily today
Example:
debezium-rca-2026-05-05.md
,
auth-500s-rca-2026-04-12.md
保存报告至当前工作目录下的
<topic>-rca-<YYYY-MM-DD>.md
文件,其中:
  • <topic>
    是故障系统的简短短横线命名(kebab-case)标识(例如:
    debezium
    auth-service
    kafka-consumer-lag
  • <YYYY-MM-DD>
    事件发生日期,不一定是当前日期
示例:
debezium-rca-2026-05-05.md
auth-500s-rca-2026-04-12.md

The two-phase workflow

两阶段工作流程

Phase 1 — Investigation (gather evidence)

阶段1:调查(收集证据)

Before writing anything, walk through
references/investigation-checklist.md
with the user. The goal is to lock in concrete, reproducible facts — timestamps, version numbers, exact LSNs/IDs/error strings, command outputs — while the system state is still observable. Memory degrades fast; logs rotate; replication slots advance. Capture now, write later.
Do not skip this phase even if the user says "I already fixed it" — fixed-state evidence (the healthy
confirmed_flush_lsn
advancing, the test row flowing through Kafka, the new container log showing "streaming from latest xlogpos") is what proves the resolution actually held. That proof is what separates a real RCA from a story.
If the user already has notes/transcripts/scrollback from the live incident, mine those first before asking questions. Don't make them re-type what's already in the conversation.
在撰写报告前,与用户一同梳理
references/investigation-checklist.md
内容。目标是锁定具体、可复现的事实——时间戳、版本号、精确的LSN/ID/错误字符串、命令输出——趁系统状态仍可观测时完成收集。记忆会快速模糊,日志会轮转,复制槽会推进。先收集,再撰写。
即使用户表示“我已经修复了问题”,也不要跳过此阶段——修复后的状态证据(正常推进的
confirmed_flush_lsn
、流经Kafka的测试行、显示“从最新xlogpos流式传输”的新容器日志)是证明修复措施确实有效的关键。正是这些证据让一份真正的RCA报告区别于普通叙事文档。
如果用户已有事件处理过程中的笔记/对话记录/滚动回溯内容,先从中提取信息,再提问。不要让用户重复输入已在对话中存在的内容。

Phase 2 — Write the report

阶段2:撰写报告

Use
templates/rca-report.md
as the structural skeleton. Fill it section by section using the evidence from Phase 1. Then validate against
references/quality-rubric.md
before declaring done.
templates/rca-report.md
为结构框架,利用阶段1收集的证据逐节填充内容。完成后对照
references/quality-rubric.md
进行验证,确认无误后再宣告完成。

What makes an RCA actually good

优质RCA报告的标准

The Debezium RCA that this skill is modeled on worked because it had:
  1. A timeline with UTC timestamps for every observable event — "the connector was wedged for ~18h" is narrative; "2026-05-04 09:54:16 Postgres terminated replication connection" is evidence. Always prefer the precise version.
  2. An infrastructure table that fully identifies the system — versions, hostnames, zones, connector names, topic names, slot names. Someone reading this six months later should be able to find the exact resources without ambiguity.
  3. Quantified impact across user, system, data, and SLA dimensions — vague impact ("some customers were affected") is worthless for severity calibration. State user-visible effect, internal system degradation, data integrity status, and SLA / financial cost as concrete numbers. If a number is unknown, say so explicitly rather than skipping the dimension.
  4. Layered root cause analysis — not just what broke, but:
    • The primary cause (the trigger event)
    • Why it appeared healthy (what masked the failure — this is usually the most valuable part)
    • Secondary mechanisms (gates, retries, internal state that contributed)
  5. State snapshots with actual values — the contrast between expected state and observed state is what makes the diagnosis click.
    confirmed_flush_lsn = 1/AD5B16C0 (pre-restore stale value)
    next to
    pg_current_wal_lsn = 1/ADC4B740 (current)
    tells the whole story in two lines. Capture similar contrasts for whatever domain you're in (queue depth, error rate, version mismatch, schema drift).
  6. Workaround / temporary mitigation captured separately from the resolution — the fast, low-risk action that stopped the bleeding before the root cause was fully understood. Workarounds and resolutions answer different questions: workaround = what does on-call do at 3am next time this fires; resolution = what permanently closes the case. Document the workaround's effect, its risks, and the trigger condition for applying it.
  7. Resolution with ordering rationale — not just "I ran these commands", but why this order. If step 4 must come after step 3 because of in-memory state, say so. The next person hitting this will try the obvious order first and fail; document why obvious-order doesn't work.
  8. A Five Whys chain that lands on a systemic gap — the chain is only useful if it stops at a missing guardrail (alert / review / test / knowledge), not at the technical trigger. Each "why" should narrow on a different mechanism — synonyms across adjacent steps mean you're padding. The final answer should map directly to a Recommendation below it.
  9. A "What did NOT work" section — capture the dead ends. Future-you will be tempted to try the same thing. The Debezium RCA's "drop slot + recreate connector without offset reset" entry is gold — it's the most intuitive fix and it silently fails.
  10. Diagnostic commands as a copy-paste block — the next incident in this domain will reuse 80% of these. Make them runnable, not pseudocode.
  11. Verification evidence — proof the fix held. Test data flowing end-to-end. Slot lag stabilizing. Error rate returning to baseline. With actual values from the post-fix state.
  12. Recommendations binned by urgency — Immediate (alerting/monitoring), Process (runbooks, comms), Configuration (settings changes). Bins force the user to think about timeline, not just "stuff to do".
本技能参考的Debezium RCA报告之所以有效,是因为它具备以下特点:
  1. 包含每个可观测事件的UTC时间戳的时间线——“连接器卡顿约18小时”是叙事性描述;“2026-05-04 09:54:16 Postgres终止复制连接”是证据。始终优先选择精确表述。
  2. 完整标识系统的基础设施表——版本、主机名、可用区、连接器名称、主题名称、槽名称。即使是六个月后阅读报告的人,也能毫无歧义地找到对应的具体资源。
  3. 量化用户、系统、数据和SLA维度的影响——模糊的影响描述(“部分客户受到影响”)对严重程度评估毫无价值。需明确说明用户可见的影响、内部系统性能下降情况、数据完整性状态,以及SLA/财务成本的具体数值。如果某一数值未知,需明确说明,而非跳过该维度。
  4. 分层根因分析——不仅要说明什么出了问题,还要涵盖:
    • 主因(触发事件)
    • 为何系统看似正常(掩盖故障的因素——这通常是最有价值的部分)
    • 次要机制(闸门、重试、内部状态等促成故障的因素)
  5. 包含实际值的状态快照——预期状态与观测状态的对比是诊断的关键。
    confirmed_flush_lsn = 1/AD5B16C0(恢复前的陈旧值)
    pg_current_wal_lsn = 1/ADC4B740(当前值)
    并列,仅用两行就完整说明了问题。针对你所在的领域(队列深度、错误率、版本不匹配、schema漂移等),收集类似的对比信息。
  6. 将临时workaround与永久修复措施分开记录——在完全理解根因前采取的快速、低风险止血措施。临时workaround与永久修复措施解决的是不同问题:workaround = 下次凌晨3点发生此类事件时,值班人员应采取的操作;修复措施 = 彻底解决问题的方案。需记录临时workaround的效果、风险及适用触发条件。
  7. 带有顺序依据的修复措施——不仅要说明“我执行了这些命令”,还要解释为何按此顺序执行。如果步骤4必须在步骤3之后执行是因为内存状态的原因,需明确说明。下一个遇到此问题的人会先尝试直观的顺序,若失败则会陷入困境;因此需记录为何直观顺序不可行。
  8. 最终指向系统性缺口的“五个为什么”分析链——只有当分析链停留在缺失的防护措施(告警/审核/测试/知识)而非技术触发事件时,才具备实际价值。每个“为什么”应聚焦不同的机制——相邻步骤使用同义词意味着你在凑内容。最终结论应直接对应下方的建议项。
  9. “无效方案”部分——记录尝试过但失败的方法。未来的你很可能会尝试同样的方法。Debezium RCA报告中“删除槽 + 不重置偏移量重新创建连接器”的记录极具价值——这是最直观的修复方案,但会静默失败。
  10. 可直接复制粘贴的诊断命令块——同一领域的下一次事件会复用其中80%的命令。确保命令可直接运行,而非伪代码。
  11. 验证证据——证明修复措施有效的证据。测试数据端到端流转正常。槽延迟稳定。错误率恢复至基线水平。需包含修复后的实际状态数值。
  12. 按紧急程度分类的建议项——即时(告警/监控)、流程(运行手册、沟通)、配置(设置变更)。分类可促使用户思考时间线,而非仅列出“待办事项”。

Anti-patterns to avoid

需避免的反模式

  • Vague timelines: "The next morning we noticed..." — when? What did "noticed" actually mean? Who saw what?
  • Single-layer root cause: stopping at the trigger event without explaining the masking mechanism. If the system appeared healthy, that masking is the root cause for the duration of the outage.
  • Resolution without rationale: a list of commands with no explanation of why this order or why this approach. That's a runbook, not an RCA.
  • Hand-wavy recommendations: "improve monitoring" is not actionable. "Alert when
    pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) > 100MB
    for 5+ minutes" is.
  • Skipping the failed approaches: every dead end you don't document is a trap for the next person.
  • No verification: closing the report without proof the fix actually worked. This is how RCAs become folklore.
  • 模糊的时间线:“第二天早上我们注意到……”——具体是什么时间?“注意到”具体指什么?谁看到了什么?
  • 单层根因分析:仅停留在触发事件层面,未解释掩盖故障的机制。如果系统看似正常,那么这种掩盖机制就是中断期间的根本原因。
  • 无依据的修复措施:仅列出命令,未解释为何按此顺序或为何采用此方法。这是运行手册,而非RCA报告。
  • 模糊的建议项:“改进监控”不具备可操作性。“当
    pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) > 100MB
    持续5分钟以上时触发告警”才是可执行的建议。
  • 遗漏失败的尝试:每一个未记录的失败方案都是留给下一个人的陷阱。
  • 无验证环节:未提供修复措施有效的证据就结束报告。这会让RCA报告沦为传说。

Workflow

工作流程

  1. Confirm the incident is resolved or contained. If still actively firing, defer this skill.
  2. Mine existing context first. Conversation transcripts, scrollback, prior notes — extract everything you can before asking the user questions.
  3. Walk the investigation checklist (
    references/investigation-checklist.md
    ). Fill gaps by asking targeted questions or running diagnostic commands. Do this even for "small" incidents — the structure forces depth.
  4. Capture state snapshots. If the system is reachable, grab the current healthy state values now (slot lag, queue depth, error rate, etc.) — these go in the verification section. Lose them and you can't prove the fix.
  5. Draft the report using
    templates/rca-report.md
    . Fill every section; if a section truly doesn't apply, write "N/A — [reason]" rather than deleting it.
  6. Validate against the quality rubric (
    references/quality-rubric.md
    ). Fix any rubric failures before presenting.
  7. Save to
    <topic>-rca-<YYYY-MM-DD>.md
    in CWD.
  8. Show the user the file path and offer to walk through any section. Don't dump the full report into chat unless asked — they'll read the file.
  1. 确认事件已解决或得到控制。如果事件仍在持续爆发,推迟使用本技能。
  2. 优先挖掘现有上下文信息。对话记录、滚动回溯内容、先前的笔记——先提取所有可用信息,再向用户提问。
  3. 梳理调查清单
    references/investigation-checklist.md
    )。通过针对性提问或运行诊断命令填补信息空白。即使是“小”事件也需执行此步骤——结构化流程可确保分析深度。
  4. 捕获状态快照。如果系统可访问,立即获取当前健康状态的数值(槽延迟、队列深度、错误率等)——这些将用于验证部分。一旦丢失,你将无法证明修复措施有效。
  5. 使用模板撰写报告
    templates/rca-report.md
    )。填充所有章节;如果某一章节确实不适用,需写明“不适用——[原因]”,而非删除该章节。
  6. 对照质量准则验证
    references/quality-rubric.md
    )。在提交前修正所有不符合准则的内容。
  7. 保存至当前工作目录下的
    <topic>-rca-<YYYY-MM-DD>.md
    文件
  8. 向用户展示文件路径,并提出可协助梳理任意章节。除非用户要求,否则不要将完整报告粘贴到聊天中——用户会自行查看文件。

Tone

语气风格

Match the operator's voice: technical, concise, evidence-led. Lead each section with the answer, then the reasoning. No corporate hedging ("there may have been some impact") — state what happened. No blame language — focus on system gaps, not individuals. The Debezium RCA is the reference; mirror its directness.
匹配运维人员的语气:技术化、简洁、以证据为导向。每节先给出结论,再说明理由。避免官样文章(“可能存在一些影响”)——直接陈述事实。避免指责性语言——聚焦系统缺口,而非个人行为。以Debezium RCA报告为参考,保持其直接明了的风格。