bug-root-cause-finder

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Bug Root Cause Finder

Bug根本原因定位

Systematic methods for finding the true source of bugs, not just symptoms.
本内容介绍系统化定位Bug真实根源的方法,而非仅处理表象。

Core Principle: Symptom ≠ Cause

核心原则:表象 ≠ 原因

The location where an error manifests is rarely where the bug originates.
Error Location:     OrderController::show() - NullPointerException
Symptom Location:   OrderRepository::find() - returns null
Root Cause:         OrderCreatedHandler - failed to persist order
True Root Cause:    RabbitMQ message lost due to missing ACK
错误出现的位置几乎从来不是Bug的根源所在地。
Error Location:     OrderController::show() - NullPointerException
Symptom Location:   OrderRepository::find() - returns null
Root Cause:         OrderCreatedHandler - failed to persist order
True Root Cause:    RabbitMQ message lost due to missing ACK

Method 1: 5 Whys Technique

方法一:5 Whys分析法

Ask "why" repeatedly until you reach the root cause.
反复询问“为什么”,直到找到根本原因。

Example Analysis

分析示例

Bug: "Customer sees wrong order total"
  1. Why? → The total displayed is $0
  2. Why?
    Order::getTotal()
    returns 0
  3. Why?
    OrderItem
    collection is empty
  4. Why? → Items weren't loaded from database
  5. Why? → Lazy loading failed due to closed EntityManager
Root Cause: EntityManager closed before accessing lazy-loaded collection
Fix: Eager load items in repository query, not lazy load
Bug现象: "客户看到错误的订单总额"
  1. 为什么? → 显示的总额为$0
  2. 为什么?
    Order::getTotal()
    返回0
  3. 为什么?
    OrderItem
    集合为空
  4. 为什么? → 未从数据库加载商品项
  5. 为什么? → 由于EntityManager已关闭,延迟加载失败
根本原因: 访问延迟加载集合前EntityManager已关闭
修复方案: 在仓库查询中提前加载商品项,而非使用延迟加载

5 Whys Template

5 Whys分析模板

markdown
undefined
markdown
undefined

5 Whys Analysis

5 Whys分析

Bug Description: [What user sees]
  1. Why does [symptom] occur? → [First-level cause]
  2. Why does [first-level cause] happen? → [Second-level cause]
  3. Why does [second-level cause] happen? → [Third-level cause]
  4. Why does [third-level cause] happen? → [Fourth-level cause]
  5. Why does [fourth-level cause] happen? → [ROOT CAUSE]
Fix Location: [File:line where fix should be applied] Fix Type: [Category: logic/null/boundary/race/resource/exception/type/sql/infinite]
undefined
Bug描述: [用户看到的现象]
  1. 为什么会出现[表象]? → [一级原因]
  2. 为什么会出现[一级原因]? → [二级原因]
  3. 为什么会出现[二级原因]? → [三级原因]
  4. 为什么会出现[三级原因]? → [四级原因]
  5. 为什么会出现[四级原因]? → [根本原因]
修复位置: [需要修复的文件:行号] 修复类型: [分类: 逻辑/空值/边界/竞态/资源/异常/类型/SQL/无限循环]
undefined

Method 2: Fault Tree Analysis

方法二:故障树分析

Build a tree of all possible causes for a failure.
构建故障所有可能原因的树状结构。

Fault Tree Structure

故障树结构

[FAILURE: Order total is $0]
        ├── [OR] Order has no items
        │       ├── [AND] Items not added during creation
        │       │       ├── Cart was empty
        │       │       └── Cart-to-Order mapping failed
        │       │
        │       └── [AND] Items were deleted
        │               ├── Cascade delete triggered
        │               └── Manual deletion bug
        └── [OR] Items exist but total calculation wrong
                ├── Price is 0
                │       ├── Product price not set
                │       └── Currency conversion failed
                └── Quantity is 0
                        ├── Validation missing
                        └── Type coercion (string "0")
[故障:订单总额为$0]
        ├── [或] 订单无商品项
        │       ├── [与] 创建订单时未添加商品项
        │       │       ├── 购物车为空
        │       │       └── 购物车转订单映射失败
        │       │
        │       └── [与] 商品项被删除
        │               ├── 触发级联删除
        │               └── 手动删除Bug
        └── [或] 商品项存在但总额计算错误
                ├── 价格为0
                │       ├── 商品价格未设置
                │       └── 货币转换失败
                └── 数量为0
                        ├── 缺少验证
                        └── 类型强制转换(字符串"0")

Fault Tree Investigation Order

故障树排查顺序

  1. Start with most likely branches (based on code review)
  2. Add logging/debugging to verify each branch
  3. Eliminate branches systematically
  4. Focus on remaining possibilities
  1. 从最可能的分支开始(基于代码审查结果)
  2. 添加日志/调试来验证每个分支
  3. 系统地排除不可能的分支
  4. 聚焦剩余的可能性

Method 3: Git Bisect

方法三:Git Bisect

Find the exact commit that introduced a bug.
定位引入Bug的具体提交记录。

Git Bisect Steps

Git Bisect操作步骤

bash
undefined
bash
undefined

1. Start bisect

1. 开始二分查找

git bisect start
git bisect start

2. Mark current (broken) as bad

2. 将当前(存在Bug的)版本标记为坏版本

git bisect bad
git bisect bad

3. Mark known good commit (e.g., last release)

3. 标记已知的正常版本(例如上一个发布版本)

git bisect good v2.3.0
git bisect good v2.3.0

4. Git checks out middle commit - test it

4. Git会检出中间版本 - 进行测试

Run your reproduction test

运行复现测试

php artisan test --filter=OrderTotalTest
php artisan test --filter=OrderTotalTest

5. Mark result

5. 标记测试结果

git bisect good # if test passes git bisect bad # if test fails
git bisect good # 如果测试通过 git bisect bad # 如果测试失败

6. Repeat until Git finds the culprit commit

6. 重复操作直到Git找到问题提交

Git will output: "abc123 is the first bad commit"

Git会输出:"abc123 is the first bad commit"

7. Examine the commit

7. 查看该提交的内容

git show abc123
git show abc123

8. End bisect

8. 结束二分查找

git bisect reset
undefined
git bisect reset
undefined

Automated Git Bisect

自动化Git Bisect

bash
undefined
bash
undefined

Run automatically with test script

使用测试脚本自动运行

git bisect start HEAD v2.3.0 git bisect run php artisan test --filter=OrderTotalTest
undefined
git bisect start HEAD v2.3.0 git bisect run php artisan test --filter=OrderTotalTest
undefined

Git Bisect Tips

Git Bisect使用技巧

  • Choose good boundaries: Bad = current, Good = last known working
  • Use automated testing: Faster and more reliable
  • Check for flaky tests: Bisect fails with inconsistent tests
  • Look at the diff: Focus on changed lines in culprit commit
  • 选择合适的边界: 坏版本=当前版本,好版本=最后一个已知正常版本
  • 使用自动化测试: 更快更可靠
  • 注意不稳定的测试: 测试结果不一致会导致二分查找失败
  • 查看代码差异: 聚焦问题提交中的修改内容

Method 4: Stack Trace Parsing

方法四:堆栈跟踪解析

Extract actionable information from error traces.
从错误跟踪信息中提取可操作的内容。

PHP Stack Trace Structure

PHP堆栈跟踪结构

Fatal error: Uncaught TypeError: OrderService::calculateTotal():
Argument #1 ($items) must be of type array, null given,
called in /app/src/Application/UseCase/CreateOrderUseCase.php on line 45

Stack trace:
#0 /app/src/Application/UseCase/CreateOrderUseCase.php(45): OrderService->calculateTotal(NULL)
#1 /app/src/Presentation/Api/OrderController.php(32): CreateOrderUseCase->execute(Object(CreateOrderCommand))
#2 /app/vendor/symfony/http-kernel/HttpKernel.php(163): OrderController->create(Object(Request))
#3 /app/vendor/symfony/http-kernel/HttpKernel.php(75): HttpKernel->handleRaw(Object(Request))
#4 /app/public/index.php(25): HttpKernel->handle(Object(Request))
#5 {main}

thrown in /app/src/Domain/Service/OrderService.php on line 23
Fatal error: Uncaught TypeError: OrderService::calculateTotal():
Argument #1 ($items) must be of type array, null given,
called in /app/src/Application/UseCase/CreateOrderUseCase.php on line 45

Stack trace:
#0 /app/src/Application/UseCase/CreateOrderUseCase.php(45): OrderService->calculateTotal(NULL)
#1 /app/src/Presentation/Api/OrderController.php(32): CreateOrderUseCase->execute(Object(CreateOrderCommand))
#2 /app/vendor/symfony/http-kernel/HttpKernel.php(163): OrderController->create(Object(Request))
#3 /app/vendor/symfony/http-kernel/HttpKernel.php(75): HttpKernel->handleRaw(Object(Request))
#4 /app/public/index.php(25): HttpKernel->handle(Object(Request))
#5 {main}

thrown in /app/src/Domain/Service/OrderService.php on line 23

Key Information to Extract

需要提取的关键信息

ElementValueMeaning
Error TypeTypeErrorType mismatch bug
MessageArgument #1 must be array, null givenNull pointer issue
Thrown LocationOrderService.php:23Where error detected
Call LocationCreateOrderUseCase.php:45Where bad value passed
Root InvestigationCreateOrderUseCase.php:45Start here
元素取值含义
错误类型TypeError类型不匹配Bug
错误信息Argument #1 must be array, null given空指针问题
错误抛出位置OrderService.php:23错误被检测到的位置
调用位置CreateOrderUseCase.php:45传入错误值的位置
排查起点CreateOrderUseCase.php:45从此处开始排查

Stack Trace Analysis Steps

堆栈跟踪分析步骤

  1. Read error message - What type of error?
  2. Find thrown location - Where was error detected?
  3. Find call location - Where was bad value passed?
  4. Trace backward - How did bad value get there?
  5. Find origin - Where was value first set to bad state?
  1. 阅读错误信息 - 错误类型是什么?
  2. 找到错误抛出位置 - 错误在哪里被检测到?
  3. 找到调用位置 - 错误值是在哪里传入的?
  4. 反向追踪 - 错误值是如何到达此处的?
  5. 找到源头 - 错误值是在哪里首次变为无效状态的?

Common Stack Trace Patterns

常见堆栈跟踪模式

Pattern: Null from Repository
#0 Repository->find() returns null
#1 Service uses result without check
#2 Controller calls service
→ Fix in #1: Add null check
Pattern: Type Coercion
#0 Method expects int, gets string
#1 Request data not validated
#2 Controller passes raw input
→ Fix in #1 or #2: Add validation/casting
Pattern: Missing Dependency
#0 Service->method() called
#1 Container->get() fails
#2 Dependency not registered
→ Fix: Register dependency in container
模式:仓库返回空值
#0 Repository->find() returns null
#1 Service uses result without check
#2 Controller calls service
→ 修复位置:#1,添加空值检查
模式:类型强制转换
#0 Method expects int, gets string
#1 Request data not validated
#2 Controller passes raw input
→ 修复位置:#1或#2,添加验证/类型转换
模式:依赖缺失
#0 Service->method() called
#1 Container->get() fails
#2 Dependency not registered
→ 修复方案:在容器中注册依赖

Method 5: Dependency Graph Analysis

方法五:依赖图分析

Trace data flow through the application.
追踪数据在应用中的流转路径。

Data Flow Tracing

数据流转追踪

php
// Trace the flow of $orderId
1. Controller receives $orderId from Request
2. UseCase receives $orderId as Command property
3. Repository uses $orderId in SQL query
4. Database returns null (ID doesn't exist)
5. UseCase passes null to Service
6. Service throws NullPointerException
php
// 追踪$orderId的流转路径
1. Controller从Request中接收$orderId
2. UseCase将$orderId作为Command属性接收
3. Repository在SQL查询中使用$orderId
4. 数据库返回null(该ID不存在)
5. UseCase将null传递给Service
6. Service抛出NullPointerException

Dependency Questions

依赖相关问题

  • Where does the value originate?
  • What transformations does it undergo?
  • Where is it validated?
  • Where could it become invalid?
  • Who else uses this value?
  • 该值的源头在哪里?
  • 它经历了哪些转换?
  • 在哪里对它进行了验证?
  • 它可能在何处变为无效?
  • 还有哪些地方使用了这个值?

Call Graph Investigation

调用图排查

bash
undefined
bash
undefined

Find all callers of a method

查找某个方法的所有调用方

grep -r "->calculateTotal(" src/
grep -r "->calculateTotal(" src/

Find all places where variable is set

查找变量被赋值的所有位置

grep -r "$items\s*=" src/
grep -r "$items\s*=" src/

Find all null assignments

查找所有空值赋值的位置

grep -r "= null" src/Domain/
undefined
grep -r "= null" src/Domain/
undefined

Method 6: State Timeline Reconstruction

方法六:状态时间线重建

Rebuild the sequence of state changes.
重建状态变化的序列。

Timeline Template

时间线模板

markdown
undefined
markdown
undefined

State Timeline

状态时间线

T0: Initial state - Order::status = DRAFT - Order::items = []
T1: AddItemToOrder executed - Order::items = [Item(id=1)] - Expected: status stays DRAFT ✓
T2: SubmitOrder executed - Order::status = SUBMITTED - Expected: items preserved ✓
T3: PaymentReceived event - Order::status = PAID - BUG: items cleared unexpectedly ✗
T4: GetOrder query - Returns Order with empty items - User sees $0 total
Root Cause: PaymentReceived handler incorrectly reinitializes Order
undefined
T0: 初始状态 - Order::status = DRAFT - Order::items = []
T1: 执行AddItemToOrder操作 - Order::items = [Item(id=1)] - 预期:状态保持DRAFT ✓
T2: 执行SubmitOrder操作 - Order::status = SUBMITTED - 预期:商品项被保留 ✓
T3: 触发PaymentReceived事件 - Order::status = PAID - Bug:商品项意外被清空 ✗
T4: 执行GetOrder查询 - 返回商品项为空的Order - 用户看到总额为$0
Root Cause: PaymentReceived handler incorrectly reinitializes Order
undefined

Investigation Checklist

排查检查清单

Before Starting

开始排查前

  • Reproduce bug consistently
  • Identify exact error message
  • Note conditions when bug occurs
  • Check if bug is environment-specific
  • 稳定复现Bug
  • 记录准确的错误信息
  • 记录Bug出现的条件
  • 检查Bug是否与环境相关

During Investigation

排查过程中

  • Parse stack trace for key locations
  • Apply 5 Whys technique
  • Build fault tree if multiple possible causes
  • Use git bisect if recent regression
  • Trace data flow from origin to error
  • 解析堆栈跟踪获取关键位置
  • 应用5 Whys分析法
  • 若存在多种可能原因,构建故障树
  • 若为近期回归问题,使用Git Bisect
  • 追踪数据从源头到错误位置的流转路径

After Finding Root Cause

找到根本原因后

  • Verify fix addresses root cause, not symptom
  • Check for similar bugs in related code
  • Document root cause for team knowledge
  • Consider if design change prevents recurrence
  • 验证修复方案针对的是根本原因而非表象
  • 检查相关代码是否存在类似Bug
  • 记录根本原因供团队参考
  • 考虑是否需要调整设计以避免同类问题再次发生

Quick Reference: Where to Look First

快速参考:优先排查方向

Bug TypeFirst Investigation Point
Null pointerRepository/Factory that creates the null
Wrong calculationInput values, not calculation logic
Missing dataEvent handler or background job
Intermittent failureShared state or race condition
After deploymentGit diff between versions
Only in productionEnvironment config or data
Only for some usersUser-specific data or permissions
Bug类型优先排查点
空指针问题返回空值的仓库/工厂
计算错误输入值,而非计算逻辑
数据缺失事件处理器或后台任务
间歇性故障共享状态或竞态条件
部署后出现版本间的Git差异
仅生产环境出现环境配置或数据
仅部分用户出现用户特定数据或权限