root-cause-analysis

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Root Cause Analysis

根本原因分析

Overview

概述

Root cause analysis (RCA) identifies underlying reasons for failures, enabling permanent solutions rather than temporary fixes.

根本原因分析（RCA）用于识别故障的潜在原因，从而找到永久性解决方案而非临时修复措施。

When to Use

适用场景

Production incidents
Customer-impacting issues
Repeated problems
Unexpected failures
Performance degradation

生产环境故障
影响客户的问题
重复出现的问题
意外故障
性能下降

Instructions

操作指南

1. The 5 Whys Technique

1. 5个为什么分析法

yaml

Example: Website Down

Symptom: Website returned 503 Service Unavailable

Why 1: Why was website down?
  Answer: Database connection pool exhausted

Why 2: Why was connection pool exhausted?
  Answer: Queries taking too long, connections not released

Why 3: Why were queries slow?
  Answer: Missing index on frequently queried column

Why 4: Why was index missing?
  Answer: Performance testing didn't use production-like data volume

Why 5: Why wasn't production-like data used?
  Answer: Load testing environment doesn't mirror production

Root Cause: Load testing environment under-provisioned

Solution: Update load testing environment with production-like data

Prevention: Establish environment parity requirements

yaml

Example: Website Down

Symptom: Website returned 503 Service Unavailable

Why 1: Why was website down?
  Answer: Database connection pool exhausted

Why 2: Why was connection pool exhausted?
  Answer: Queries taking too long, connections not released

Why 3: Why were queries slow?
  Answer: Missing index on frequently queried column

Why 4: Why was index missing?
  Answer: Performance testing didn't use production-like data volume

Why 5: Why wasn't production-like data used?
  Answer: Load testing environment doesn't mirror production

Root Cause: Load testing environment under-provisioned

Solution: Update load testing environment with production-like data

Prevention: Establish environment parity requirements

2. Systematic RCA Process

2. 系统性RCA流程

yaml

Step 1: Gather Facts
  - When did issue occur?
  - Who detected it?
  - How many users affected?
  - What error messages?
  - What system changes deployed?
  - Check logs, metrics, alerts
  - Determine impact scope

Step 2: Reproduce
  - Can we reproduce consistently?
  - What are the exact steps?
  - What environment (prod, staging)?
  - Can we isolate to component?
  - Set up test case

Step 3: Identify Contributing Factors
  - Direct cause
  - Indirect/enabling factors
  - System vulnerabilities
  - Procedural gaps
  - Knowledge gaps

Step 4: Determine Root Cause
  - Use 5 Whys technique
  - Ask "why did this control fail?"
  - Look for systemic issues
  - Separate root cause from symptoms

Step 5: Develop Solutions
  - Immediate: Fix the symptom
  - Short-term: Prevent recurrence
  - Long-term: Systemic fix
  - Prioritize by impact/effort

Step 6: Implement & Verify
  - Implement solutions
  - Test in staging
  - Deploy carefully
  - Verify improvement
  - Monitor metrics

Step 7: Document & Share
  - Write RCA report
  - Document lesson learned
  - Share with team
  - Update procedures
  - Training if needed

yaml

Step 1: Gather Facts
  - When did issue occur?
  - Who detected it?
  - How many users affected?
  - What error messages?
  - What system changes deployed?
  - Check logs, metrics, alerts
  - Determine impact scope

Step 2: Reproduce
  - Can we reproduce consistently?
  - What are the exact steps?
  - What environment (prod, staging)?
  - Can we isolate to component?
  - Set up test case

Step 3: Identify Contributing Factors
  - Direct cause
  - Indirect/enabling factors
  - System vulnerabilities
  - Procedural gaps
  - Knowledge gaps

Step 4: Determine Root Cause
  - Use 5 Whys technique
  - Ask "why did this control fail?"
  - Look for systemic issues
  - Separate root cause from symptoms

Step 5: Develop Solutions
  - Immediate: Fix the symptom
  - Short-term: Prevent recurrence
  - Long-term: Systemic fix
  - Prioritize by impact/effort

Step 6: Implement & Verify
  - Implement solutions
  - Test in staging
  - Deploy carefully
  - Verify improvement
  - Monitor metrics

Step 7: Document & Share
  - Write RCA report
  - Document lesson learned
  - Share with team
  - Update procedures
  - Training if needed

3. RCA Report Template

3. RCA报告模板

yaml

RCA Report:

Incident: Database connection failure (2024-01-15, 14:30-15:15)

Impact:
  - Duration: 45 minutes
  - Users affected: 5,000 (10% of user base)
  - Revenue lost: ~$2,000
  - Severity: P1 (Critical)

Timeline:
  14:30: Automated monitoring alert: High error rate (20%)
  14:32: On-call engineer notified
  14:35: Identified database connection error in logs
  14:40: Restarted database connection pool
  14:42: Service recovered, error rate returned to 0.1%
  14:50: Incident declared resolved
  15:15: Full recovery verified

Root Cause:
  Poorly optimized query introduced in release 2.5.0 caused
  queries to take 10x longer. Connection pool exhausted as
  connections weren't released quickly.

Contributing Factors:
  1. No query performance testing pre-deployment
  2. Load testing environment doesn't match production volume
  3. No alerting on query duration
  4. Connection pool timeout set too high

Solutions:
  Immediate (Done):
    - Rolled back problematic query optimization

  Short-term (1 week):
    - Added query performance alerts (>1s)
    - Added index for slow query
    - Set query timeout to 5 seconds

  Long-term (1 month):
    - Updated load testing with production-like data
    - Implement performance benchmarks in CI/CD
    - Improve monitoring for connection pool health
    - Training on query optimization

Prevention:
  - Query performance regression tests
  - Load testing with production data
  - Connection pool metrics monitoring
  - Code review of database changes

yaml

RCA Report:

Incident: Database connection failure (2024-01-15, 14:30-15:15)

Impact:
  - Duration: 45 minutes
  - Users affected: 5,000 (10% of user base)
  - Revenue lost: ~$2,000
  - Severity: P1 (Critical)

Timeline:
  14:30: Automated monitoring alert: High error rate (20%)
  14:32: On-call engineer notified
  14:35: Identified database connection error in logs
  14:40: Restarted database connection pool
  14:42: Service recovered, error rate returned to 0.1%
  14:50: Incident declared resolved
  15:15: Full recovery verified

Root Cause:
  Poorly optimized query introduced in release 2.5.0 caused
  queries to take 10x longer. Connection pool exhausted as
  connections weren't released quickly.

Contributing Factors:
  1. No query performance testing pre-deployment
  2. Load testing environment doesn't match production volume
  3. No alerting on query duration
  4. Connection pool timeout set too high

Solutions:
  Immediate (Done):
    - Rolled back problematic query optimization

  Short-term (1 week):
    - Added query performance alerts (>1s)
    - Added index for slow query
    - Set query timeout to 5 seconds

  Long-term (1 month):
    - Updated load testing with production-like data
    - Implement performance benchmarks in CI/CD
    - Improve monitoring for connection pool health
    - Training on query optimization

Prevention:
  - Query performance regression tests
  - Load testing with production data
  - Connection pool metrics monitoring
  - Code review of database changes

4. Root Cause Analysis Techniques

4. 根本原因分析技术

yaml

Fishbone Diagram:

Main problem: Slow API Response

Branches:

  Code:
    - Inefficient algorithm
    - Missing cache
    - Unnecessary queries

  Data:
    - Large dataset
    - Missing index
    - Slow database

  Infrastructure:
    - Low CPU capacity
    - Slow network
    - Disk I/O bottleneck

  Process:
    - No monitoring
    - No load testing
    - Manual deployments

  People:
    - Lack of knowledge
    - Lack of tools
    - No peer review

---

Systemic vs. Individual Causes:

Individual: "Developer used inefficient code"
  Fix: Training
  Risk: Happens again with different person

Systemic: "No code review process"
  Fix: Implement mandatory code review
  Risk: Prevents similar issues

Prefer systemic solutions for prevention

yaml

Fishbone Diagram:

Main problem: Slow API Response

Branches:

  Code:
    - Inefficient algorithm
    - Missing cache
    - Unnecessary queries

  Data:
    - Large dataset
    - Missing index
    - Slow database

  Infrastructure:
    - Low CPU capacity
    - Slow network
    - Disk I/O bottleneck

  Process:
    - No monitoring
    - No load testing
    - Manual deployments

  People:
    - Lack of knowledge
    - Lack of tools
    - No peer review

---

Systemic vs. Individual Causes:

Individual: "Developer used inefficient code"
  Fix: Training
  Risk: Happens again with different person

Systemic: "No code review process"
  Fix: Implement mandatory code review
  Risk: Prevents similar issues

Prefer systemic solutions for prevention

5. Follow-Up & Prevention

5. 后续跟进与预防

yaml

After RCA:

1. Track Action Items
  - Assign owner
  - Set deadline
  - Follow up in retrospective

2. Prevent Recurrence
  - Automated tests
  - Monitoring/alerts
  - Procedural changes
  - Training

3. Monitor Metrics
  - Track similar incidents
  - Verify fix effectiveness
  - Monitor preventive measures
  - Catch early warnings

4. Share Learnings
  - Document incident
  - Share with team
  - Industry sharing if relevant
  - Update procedures

---

Checklist:

[ ] Incident details documented
[ ] Timeline established
[ ] Logs reviewed
[ ] Metrics analyzed
[ ] Root cause identified (via 5 Whys)
[ ] Contributing factors listed
[ ] Immediate actions completed
[ ] Short-term solutions planned
[ ] Long-term solutions identified
[ ] Solutions prioritized
[ ] RCA report written
[ ] Team debriefing scheduled
[ ] Action items assigned
[ ] Prevention measures planned
[ ] Follow-up scheduled

yaml

After RCA:

1. Track Action Items
  - Assign owner
  - Set deadline
  - Follow up in retrospective

2. Prevent Recurrence
  - Automated tests
  - Monitoring/alerts
  - Procedural changes
  - Training

3. Monitor Metrics
  - Track similar incidents
  - Verify fix effectiveness
  - Monitor preventive measures
  - Catch early warnings

4. Share Learnings
  - Document incident
  - Share with team
  - Industry sharing if relevant
  - Update procedures

---

Checklist:

[ ] Incident details documented
[ ] Timeline established
[ ] Logs reviewed
[ ] Metrics analyzed
[ ] Root cause identified (via 5 Whys)
[ ] Contributing factors listed
[ ] Immediate actions completed
[ ] Short-term solutions planned
[ ] Long-term solutions identified
[ ] Solutions prioritized
[ ] RCA report written
[ ] Team debriefing scheduled
[ ] Action items assigned
[ ] Prevention measures planned
[ ] Follow-up scheduled

Key Points

核心要点

Distinguish symptom from root cause
Use 5 Whys technique systematically
Look for systemic issues, not individual blame
Focus on prevention, not just fixing
Document thoroughly for team learning
Assign clear ownership for solutions
Follow up to verify effectiveness
Use RCA to drive improvements

区分症状与根本原因
系统性使用5个为什么分析法
关注系统性问题，而非个人追责
聚焦预防，而非仅修复问题
详细记录以促进团队学习
明确解决方案的负责人
跟进以验证效果
利用RCA推动改进