root-cause-analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseRoot Cause Analysis
根本原因分析
Overview
概述
Root cause analysis (RCA) identifies underlying reasons for failures, enabling permanent solutions rather than temporary fixes.
根本原因分析(RCA)用于识别故障的潜在原因,从而找到永久性解决方案而非临时修复措施。
When to Use
适用场景
- Production incidents
- Customer-impacting issues
- Repeated problems
- Unexpected failures
- Performance degradation
- 生产环境故障
- 影响客户的问题
- 重复出现的问题
- 意外故障
- 性能下降
Instructions
操作指南
1. The 5 Whys Technique
1. 5个为什么分析法
yaml
Example: Website Down
Symptom: Website returned 503 Service Unavailable
Why 1: Why was website down?
Answer: Database connection pool exhausted
Why 2: Why was connection pool exhausted?
Answer: Queries taking too long, connections not released
Why 3: Why were queries slow?
Answer: Missing index on frequently queried column
Why 4: Why was index missing?
Answer: Performance testing didn't use production-like data volume
Why 5: Why wasn't production-like data used?
Answer: Load testing environment doesn't mirror production
Root Cause: Load testing environment under-provisioned
Solution: Update load testing environment with production-like data
Prevention: Establish environment parity requirementsyaml
Example: Website Down
Symptom: Website returned 503 Service Unavailable
Why 1: Why was website down?
Answer: Database connection pool exhausted
Why 2: Why was connection pool exhausted?
Answer: Queries taking too long, connections not released
Why 3: Why were queries slow?
Answer: Missing index on frequently queried column
Why 4: Why was index missing?
Answer: Performance testing didn't use production-like data volume
Why 5: Why wasn't production-like data used?
Answer: Load testing environment doesn't mirror production
Root Cause: Load testing environment under-provisioned
Solution: Update load testing environment with production-like data
Prevention: Establish environment parity requirements2. Systematic RCA Process
2. 系统性RCA流程
yaml
Step 1: Gather Facts
- When did issue occur?
- Who detected it?
- How many users affected?
- What error messages?
- What system changes deployed?
- Check logs, metrics, alerts
- Determine impact scope
Step 2: Reproduce
- Can we reproduce consistently?
- What are the exact steps?
- What environment (prod, staging)?
- Can we isolate to component?
- Set up test case
Step 3: Identify Contributing Factors
- Direct cause
- Indirect/enabling factors
- System vulnerabilities
- Procedural gaps
- Knowledge gaps
Step 4: Determine Root Cause
- Use 5 Whys technique
- Ask "why did this control fail?"
- Look for systemic issues
- Separate root cause from symptoms
Step 5: Develop Solutions
- Immediate: Fix the symptom
- Short-term: Prevent recurrence
- Long-term: Systemic fix
- Prioritize by impact/effort
Step 6: Implement & Verify
- Implement solutions
- Test in staging
- Deploy carefully
- Verify improvement
- Monitor metrics
Step 7: Document & Share
- Write RCA report
- Document lesson learned
- Share with team
- Update procedures
- Training if neededyaml
Step 1: Gather Facts
- When did issue occur?
- Who detected it?
- How many users affected?
- What error messages?
- What system changes deployed?
- Check logs, metrics, alerts
- Determine impact scope
Step 2: Reproduce
- Can we reproduce consistently?
- What are the exact steps?
- What environment (prod, staging)?
- Can we isolate to component?
- Set up test case
Step 3: Identify Contributing Factors
- Direct cause
- Indirect/enabling factors
- System vulnerabilities
- Procedural gaps
- Knowledge gaps
Step 4: Determine Root Cause
- Use 5 Whys technique
- Ask "why did this control fail?"
- Look for systemic issues
- Separate root cause from symptoms
Step 5: Develop Solutions
- Immediate: Fix the symptom
- Short-term: Prevent recurrence
- Long-term: Systemic fix
- Prioritize by impact/effort
Step 6: Implement & Verify
- Implement solutions
- Test in staging
- Deploy carefully
- Verify improvement
- Monitor metrics
Step 7: Document & Share
- Write RCA report
- Document lesson learned
- Share with team
- Update procedures
- Training if needed3. RCA Report Template
3. RCA报告模板
yaml
RCA Report:
Incident: Database connection failure (2024-01-15, 14:30-15:15)
Impact:
- Duration: 45 minutes
- Users affected: 5,000 (10% of user base)
- Revenue lost: ~$2,000
- Severity: P1 (Critical)
Timeline:
14:30: Automated monitoring alert: High error rate (20%)
14:32: On-call engineer notified
14:35: Identified database connection error in logs
14:40: Restarted database connection pool
14:42: Service recovered, error rate returned to 0.1%
14:50: Incident declared resolved
15:15: Full recovery verified
Root Cause:
Poorly optimized query introduced in release 2.5.0 caused
queries to take 10x longer. Connection pool exhausted as
connections weren't released quickly.
Contributing Factors:
1. No query performance testing pre-deployment
2. Load testing environment doesn't match production volume
3. No alerting on query duration
4. Connection pool timeout set too high
Solutions:
Immediate (Done):
- Rolled back problematic query optimization
Short-term (1 week):
- Added query performance alerts (>1s)
- Added index for slow query
- Set query timeout to 5 seconds
Long-term (1 month):
- Updated load testing with production-like data
- Implement performance benchmarks in CI/CD
- Improve monitoring for connection pool health
- Training on query optimization
Prevention:
- Query performance regression tests
- Load testing with production data
- Connection pool metrics monitoring
- Code review of database changesyaml
RCA Report:
Incident: Database connection failure (2024-01-15, 14:30-15:15)
Impact:
- Duration: 45 minutes
- Users affected: 5,000 (10% of user base)
- Revenue lost: ~$2,000
- Severity: P1 (Critical)
Timeline:
14:30: Automated monitoring alert: High error rate (20%)
14:32: On-call engineer notified
14:35: Identified database connection error in logs
14:40: Restarted database connection pool
14:42: Service recovered, error rate returned to 0.1%
14:50: Incident declared resolved
15:15: Full recovery verified
Root Cause:
Poorly optimized query introduced in release 2.5.0 caused
queries to take 10x longer. Connection pool exhausted as
connections weren't released quickly.
Contributing Factors:
1. No query performance testing pre-deployment
2. Load testing environment doesn't match production volume
3. No alerting on query duration
4. Connection pool timeout set too high
Solutions:
Immediate (Done):
- Rolled back problematic query optimization
Short-term (1 week):
- Added query performance alerts (>1s)
- Added index for slow query
- Set query timeout to 5 seconds
Long-term (1 month):
- Updated load testing with production-like data
- Implement performance benchmarks in CI/CD
- Improve monitoring for connection pool health
- Training on query optimization
Prevention:
- Query performance regression tests
- Load testing with production data
- Connection pool metrics monitoring
- Code review of database changes4. Root Cause Analysis Techniques
4. 根本原因分析技术
yaml
Fishbone Diagram:
Main problem: Slow API Response
Branches:
Code:
- Inefficient algorithm
- Missing cache
- Unnecessary queries
Data:
- Large dataset
- Missing index
- Slow database
Infrastructure:
- Low CPU capacity
- Slow network
- Disk I/O bottleneck
Process:
- No monitoring
- No load testing
- Manual deployments
People:
- Lack of knowledge
- Lack of tools
- No peer review
---
Systemic vs. Individual Causes:
Individual: "Developer used inefficient code"
Fix: Training
Risk: Happens again with different person
Systemic: "No code review process"
Fix: Implement mandatory code review
Risk: Prevents similar issues
Prefer systemic solutions for preventionyaml
Fishbone Diagram:
Main problem: Slow API Response
Branches:
Code:
- Inefficient algorithm
- Missing cache
- Unnecessary queries
Data:
- Large dataset
- Missing index
- Slow database
Infrastructure:
- Low CPU capacity
- Slow network
- Disk I/O bottleneck
Process:
- No monitoring
- No load testing
- Manual deployments
People:
- Lack of knowledge
- Lack of tools
- No peer review
---
Systemic vs. Individual Causes:
Individual: "Developer used inefficient code"
Fix: Training
Risk: Happens again with different person
Systemic: "No code review process"
Fix: Implement mandatory code review
Risk: Prevents similar issues
Prefer systemic solutions for prevention5. Follow-Up & Prevention
5. 后续跟进与预防
yaml
After RCA:
1. Track Action Items
- Assign owner
- Set deadline
- Follow up in retrospective
2. Prevent Recurrence
- Automated tests
- Monitoring/alerts
- Procedural changes
- Training
3. Monitor Metrics
- Track similar incidents
- Verify fix effectiveness
- Monitor preventive measures
- Catch early warnings
4. Share Learnings
- Document incident
- Share with team
- Industry sharing if relevant
- Update procedures
---
Checklist:
[ ] Incident details documented
[ ] Timeline established
[ ] Logs reviewed
[ ] Metrics analyzed
[ ] Root cause identified (via 5 Whys)
[ ] Contributing factors listed
[ ] Immediate actions completed
[ ] Short-term solutions planned
[ ] Long-term solutions identified
[ ] Solutions prioritized
[ ] RCA report written
[ ] Team debriefing scheduled
[ ] Action items assigned
[ ] Prevention measures planned
[ ] Follow-up scheduledyaml
After RCA:
1. Track Action Items
- Assign owner
- Set deadline
- Follow up in retrospective
2. Prevent Recurrence
- Automated tests
- Monitoring/alerts
- Procedural changes
- Training
3. Monitor Metrics
- Track similar incidents
- Verify fix effectiveness
- Monitor preventive measures
- Catch early warnings
4. Share Learnings
- Document incident
- Share with team
- Industry sharing if relevant
- Update procedures
---
Checklist:
[ ] Incident details documented
[ ] Timeline established
[ ] Logs reviewed
[ ] Metrics analyzed
[ ] Root cause identified (via 5 Whys)
[ ] Contributing factors listed
[ ] Immediate actions completed
[ ] Short-term solutions planned
[ ] Long-term solutions identified
[ ] Solutions prioritized
[ ] RCA report written
[ ] Team debriefing scheduled
[ ] Action items assigned
[ ] Prevention measures planned
[ ] Follow-up scheduledKey Points
核心要点
- Distinguish symptom from root cause
- Use 5 Whys technique systematically
- Look for systemic issues, not individual blame
- Focus on prevention, not just fixing
- Document thoroughly for team learning
- Assign clear ownership for solutions
- Follow up to verify effectiveness
- Use RCA to drive improvements
- 区分症状与根本原因
- 系统性使用5个为什么分析法
- 关注系统性问题,而非个人追责
- 聚焦预防,而非仅修复问题
- 详细记录以促进团队学习
- 明确解决方案的负责人
- 跟进以验证效果
- 利用RCA推动改进