systematic-debugging

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

系統調試

System Debugging

概述

Overview

隨機修復會浪費時間並產生新的錯誤。快速補丁掩蓋了根本問題。
核心原則: 在嘗試修復之前始終找到根本原因。症狀修復失敗。
**違反此過程的字面意思就是違反調試精神。 **
Random fixes waste time and introduce new bugs. Quick patches mask the root problem.
Core Principle: Always find the root cause before attempting a fix. Symptom-based fixes fail.
Violating this process literally violates the spirit of debugging.

鐵律

Iron Rule

NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST
如果您尚未完成第一階段,則無法提出修復建議。
NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST
You cannot propose a fix if you haven't completed the first phase.

何時使用

When to Use

用於任何技術問題:
  • 測試失敗
  • 生產中的錯誤
  • 意外行為
  • 性能問題
  • 構建失敗
  • 整合問題
特別是在以下情況下使用此功能:
  • 在時間壓力下(緊急情況讓人很容易猜測)
  • 「只需一個快速解決方案」似乎是顯而易見的
  • 您已經嘗試過多次修復
  • 之前的修復不起作用
  • 你沒有完全理解這個問題
在以下情況下不要跳過:
  • 問題看起來很簡單(簡單的錯誤也有根本原因)
  • 你很著急(著急保證返工)
  • 經理希望立即修復(系統化比混亂更快)
For any technical issue:
  • Test failures
  • Production errors
  • Unexpected behavior
  • Performance issues
  • Build failures
  • Integration problems
Use this especially when:
  • Under time pressure (urgency makes it easy to guess)
  • "Just a quick fix" seems obvious
  • You've already tried multiple fixes
  • Previous fixes didn't work
  • You don't fully understand the problem
Do NOT skip this when:
  • The problem seems simple (simple bugs also have root causes)
  • You're in a hurry (hurry guarantees rework)
  • Managers want an immediate fix (systematic approach is faster than chaos)

四個階段

Four Phases

您必須先完成每個階段,然後才能進入下一階段。
You must complete each phase before moving to the next.

第一階段:根本原因調查

Phase 1: Root Cause Investigation

嘗試任何修復之前:
  1. 仔細閱讀錯誤訊息
    • 不要跳過過去的錯誤或警告
    • 它們通常包含精確的解決方案
    • 完整讀取堆疊追蹤
    • 記下行號、檔案路徑、錯誤程式碼
  2. 一致地再現
    • 你能可靠地觸發它嗎?
    • 具體步驟是什麼?
    • 每次都會發生嗎?
    • 如果不可重現→收集更多數據,不要猜測
  3. 檢查最近的變更
    • 是什麼變化可能導致這種情況?
    • Git diff,最近的提交
    • 新的依賴項,配置更改
    • 環境差異
  4. 收集多組件系統中的證據
當系統有多個元件時(CI → 建置 → 簽章、API → 服務 → 資料庫):
在提出修復建議之前,添加診斷工具:
For EACH component boundary:
  - Log what data enters component
  - Log what data exits component
  - Verify environment/config propagation
  - Check state at each layer

Run once to gather evidence showing WHERE it breaks
THEN analyze evidence to identify failing component
THEN investigate that specific component
示例(多層系統):
bash
# Layer 1: Workflow
echo "=== Secrets available in workflow: ==="
echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}"

# Layer 2: Build script
echo "=== Env vars in build script: ==="
env | grep IDENTITY || echo "IDENTITY not in environment"

# Layer 3: Signing script
echo "=== Keychain state: ==="
security list-keychains
security find-identity -v

# Layer 4: Actual signing
codesign --sign "$IDENTITY" --verbose=4 "$APP"
**這揭示了:**哪一層失敗了(祕密→工作流程✓,工作流程→建構✗)
  1. 追蹤資料流
當錯誤深入呼叫堆疊時:
root-cause-tracing.md
在此目錄中瞭解完整的向後跟蹤技術。
快速版本:
  • 不良價值從何而來?
  • 什麼叫這個價值不高?
  • 繼續追蹤,直到找到源頭
  • 從根源解決,而不是從症狀解決
Before attempting any fix:
  1. Read error messages carefully
    • Don't skip past errors or warnings
    • They often contain precise solutions
    • Read the full stack trace
    • Note line numbers, file paths, error codes
  2. Reproduce consistently
    • Can you trigger it reliably?
    • What are the exact steps?
    • Does it happen every time?
    • If non-reproducible → collect more data, don't guess
  3. Check recent changes
    • What change might have caused this?
    • Git diff, recent commits
    • New dependencies, configuration changes
    • Environment differences
  4. Gather evidence in multi-component systems
When the system has multiple components (CI → Build → Signing, API → Service → Database):
Before proposing a fix, add diagnostic tools:
For EACH component boundary:
  - Log what data enters component
  - Log what data exits component
  - Verify environment/config propagation
  - Check state at each layer

Run once to gather evidence showing WHERE it breaks
THEN analyze evidence to identify failing component
THEN investigate that specific component
Example (multi-layer system):
bash
# Layer 1: Workflow
echo "=== Secrets available in workflow: ==="
echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}"

# Layer 2: Build script
echo "=== Env vars in build script: ==="
env | grep IDENTITY || echo "IDENTITY not in environment"

# Layer 3: Signing script
echo "=== Keychain state: ==="
security list-keychains
security find-identity -v

# Layer 4: Actual signing
codesign --sign "$IDENTITY" --verbose=4 "$APP"
This reveals: Which layer failed (Secrets → Workflow ✓, Workflow → Build ✗)
  1. Trace data flow
When errors are deep in the call stack:
See
root-cause-tracing.md
in this directory for full backward tracing techniques.
Quick version:
  • Where does the bad value come from?
  • What called this with the bad value?
  • Keep tracing until you find the source
  • Fix at the source, not the symptom

第二階段:模式分析

Phase 2: Pattern Analysis

修復前先找到圖案:
  1. 尋找工作範例
    • 在同一代碼庫中找到相似的工作代碼
    • 有什麼作品與已損壞的作品相似?
  2. 與參考文獻比較
    • 如果實現模式,請完整閱讀參考實現
    • 不要略讀 - 閱讀每一行
    • 應用前充分了解模式
  3. 找出差異
    • 工作和壞掉有什麼不同?
    • 列出每一個差異,無論多麼小
    • 不要認為“那不重要”
  4. 瞭解依賴關係
    • 這還需要什麼其他組件?
    • 什麼設置、配置、環境?
    • 它做出了什麼假設?
Find patterns before fixing:
  1. Look for working examples
    • Find similar working code in the same codebase
    • What works that's similar to what's broken?
  2. Compare to references
    • If implementing a pattern, read the reference implementation fully
    • Don't skim - read every line
    • Understand the pattern fully before applying
  3. Identify differences
    • What's different between working and broken?
    • List every difference, no matter how small
    • Don't assume "that doesn't matter"
  4. Understand dependencies
    • What other components does this require?
    • What settings, configurations, environment?
    • What assumptions does it make?

第三階段:假設和測試

Phase 3: Hypothesis and Testing

科學方法:
  1. 形成單一假設
    • 明確說明:“我認為 X 是根本原因,因為 Y”
    • 寫下來
    • 要具體,不要含糊
  2. 最少測試
    • 做出盡可能小的改變來檢驗假設
    • 一次一個變量
    • 不要一次修復多個問題
  3. 繼續之前先驗證
    • 有效嗎?是 → 第 4 階段
    • 沒起作用?形成新的假設
    • 不要在頂部添加更多修復
  4. 當你不知道時
    • 說“我不明白X”
    • 別假裝知道
    • 求人
    • 研究更多
Scientific Method:
  1. Form a single hypothesis
    • State clearly: "I think X is the root cause because Y"
    • Write it down
    • Be specific, not vague
  2. Minimal testing
    • Make the smallest possible change to test the hypothesis
    • One variable at a time
    • Don't fix multiple issues at once
  3. Verify before proceeding
    • Did it work? Yes → Phase 4
    • Didn't work? Form a new hypothesis
    • Don't add more fixes on top
  4. When you don't know
    • Say "I don't understand X"
    • Don't pretend to know
    • Ask for help
    • Research more

第四階段:實施

Phase 4: Implementation

解決根本原因,而不是症狀:
  1. 建立失敗的測試用例
    • 最簡單的再現
    • 如果可能的話進行自動化測試
    • 如果沒有框架,一次性測試腳本
    • 修復前必須有
    • 使用
      superpowers:test-driven-development
      編寫正確的失敗測試的技能
  2. 實施單一修復
    • 解決已確定的根本原因
    • 一次更改一個
    • 沒有“當我在這裡”的改進
    • 沒有捆綁重構
  3. 驗證修復
    • 現在測試通過了嗎?
    • 其他測試沒有被破壞嗎?
    • 問題真的解決了嗎?
  4. 如果修復不起作用
    • 停止
    • 數數:您嘗試過多少次修復?
    • 如果 < 3:返回階段 1,用新信息重新分析
    • 如果 ≥ 3:停止並質疑架構(下面的步驟 5)
    • 在沒有進行架構討論的情況下,不要嘗試修復 #4
  5. 如果 3 個以上修復失敗:架構問題
指示架構問題的模式:
  • 每個修復都會在不同位置揭示新的共享狀態/耦合/問題
  • 修復需要“大規模重構”才能實施
  • 每次修復都會在其他地方產生新的症狀
停下來詢問基本原理:
  • 這種模式從根本上來說合理嗎?
  • 我們是“純粹因為慣性而堅持下去”嗎?
  • 我們應該重構架構還是繼續修復症狀?
在嘗試更多修復之前與您的人類合作夥伴討論
這不是一個失敗的假設——這是一個錯誤的架構。
Fix the root cause, not the symptom:
  1. Create a failing test case
    • Simplest reproduction possible
    • Automated test if possible
    • One-off test script if no framework
    • Must have this before fixing
    • Use
      superpowers:test-driven-development
      skill to write correct failing tests
  2. Implement a single fix
    • Address the identified root cause
    • One change at a time
    • No "while I'm here" improvements
    • No bundled refactoring
  3. Verify the fix
    • Do tests pass now?
    • No other tests broken?
    • Is the problem truly solved?
  4. If the fix doesn't work
    • Stop
    • Count: How many fixes have you tried?
    • If < 3: Return to Phase 1, re-analyze with new info
    • If ≥ 3: Stop and question the architecture (Step 5 below)
    • Don't attempt fix #4 without architectural discussion
  5. If 3+ fixes fail: Architectural Problem
Patterns indicating architectural issues:
  • Each fix reveals new shared state/coupling/problems in different places
  • Fix requires "massive refactoring" to implement
  • Each fix creates new symptoms elsewhere
Stop and ask fundamentals:
  • Is this pattern fundamentally sound?
  • Are we "sticking with it purely out of inertia"?
  • Should we refactor the architecture or keep fixing symptoms?
Discuss with your human partner before attempting more fixes
This isn't a failed hypothesis - it's a flawed architecture.

危險信號 - 停止並遵循流程

Red Flags - Stop and Follow the Process

如果你發現自己在想:
  • “現在快速修復,稍後再調查”
  • “嘗試改變X看看是否有效”
  • “添加多個更改,運行測試”
  • “跳過測試,我手動驗證”
  • “可能是X,讓我解決這個問題”
  • “我不完全明白,但這可能有用”
  • “模式說X,但我會以不同的方式對其進行調整”
  • “以下是主要問題:[列出未經調查的修復]”
  • 在追蹤資料流之前提出解決方案
  • 「再嘗試一次修復」(當已嘗試 2+ 次)
  • 每次修復都會在不同的地方揭示新問題
**所有這些都意味著:停止。返回第一階段。 **
如果 3 個以上修復失敗: 質疑架構(請參閱階段4.5)
If you catch yourself thinking:
  • "Quick fix now, investigate later"
  • "Try changing X to see if it works"
  • "Add multiple changes, run tests"
  • "Skip testing, I'll verify manually"
  • "Probably X, let me fix that"
  • "I don't fully understand, but this might work"
  • "The pattern says X, but I'll adapt it differently"
  • "Here are the main issues: [list of uninvestigated fixes]"
  • Propose solutions before tracing data flow
  • "Just try one more fix" (after 2+ attempts)
  • Each fix reveals new problems in different places
All of these mean: Stop. Return to Phase 1.
If 3+ fixes fail: Question the architecture (see Phase 4.5)

你的人類伴侶發出的信號表明你做錯了

Signals from Your Human Partner That You're Doing It Wrong

注意這些重定向:
  • “那不是發生了嗎?” - 你假設沒有驗證
  • 「它會告訴我們…嗎?」 - 你應該增加證據收集
  • 「停止猜測」——你在不理解的情況下提出修復方案
  • 「Ultrathink this」-質疑基本面,而不僅僅是症狀
  • 「我們被困住了?」(沮喪)- 你的方法不起作用
當您看到這些時: 停止。返回第一階段。
Watch for these redirects:
  • "Is that what's happening?" - You assumed without verifying
  • "Would it tell us...?" - You should add evidence collection
  • "Stop guessing" - You're proposing fixes without understanding
  • "Ultrathink this" - Question fundamentals, not just symptoms
  • "Are we stuck?" (frustration) - Your approach isn't working
When you see these: Stop. Return to Phase 1.

常見的合理化理由

Common Rationalizations

對不起現實
「問題很簡單,不需要流程」簡單的問題也有根本原因。對於簡單的錯誤,處理速度很快。
「緊急情況,沒有時間處理」系統調試比猜測和檢查顛簸更快。
“先嘗試一下,然後再調查”第一個修復設置了模式。從一開始就做對。
“確認修復有效後我將編寫測試”未經測試的修復不會持續下去。首先測試證明這一點。
“一次進行多個修復可以節省時間”無法隔離有效的方法。導致新的錯誤。
“參考資料太長,我會調整模式”部分理解一定會出現錯誤。完整地閱讀它。
“我看到問題了,讓我解決它”看到症狀≠瞭解根本原因。
「再嘗試一次修復」(兩次以上失敗後)3+次失敗=架構問題。問題模式,不要再修復。
ExcuseReality
"The problem is simple, no need for process"Simple problems still have root causes. The process is fast for simple bugs.
"It's an emergency, no time for process"Systematic debugging is faster than guess-and-check whack-a-mole.
"Try it first, then investigate"The first fix sets the pattern. Do it right from the start.
"I'll write tests after confirming the fix works"Untested fixes don't last. Tests prove it first.
"Multiple fixes at once saves time"Can't isolate what works. Introduces new bugs.
"The reference is too long, I'll adapt the pattern"Partial understanding guarantees mistakes. Read it fully.
"I see the problem, let me fix it"Seeing the symptom ≠ understanding the root cause.
"Just try one more fix" (after 2+ failures)3+ failures = architectural problem. Problem with the pattern, stop fixing.

快速參考

Quick Reference

主要活動成功標準
1.根本原因讀取錯誤、重現、檢查更改、收集證據瞭解什麼和為什麼
2.圖案查找工作示例,進行比較找出差異
3.假設形成理論,最少測試證實的或新的假設
4.實施創建測試、修復、驗證錯誤已解決,測試通過
PhaseMain ActivitiesSuccess Criteria
1. Root CauseRead errors, reproduce, check changes, gather evidenceUnderstand what and why
2. PatternFind working examples, compareIdentify differences
3. HypothesisForm theory, minimal testingConfirmed or new hypothesis
4. ImplementationCreate test, fix, verifyError resolved, tests pass

當流程顯示“沒有根本原因”時

When the Process Shows "No Root Cause"

如果系統調查顯示問題確實是環境性的、時間相關的或外部的:
  1. 您已完成該過程
  2. 記錄您調查的內容
  3. 實施適當的處理(重試、超時、錯誤消息)
  4. 添加監控/日誌記錄以供將來調查
但是: 95% 的「無根本原因」案例調查不完整。
If systematic investigation shows the problem is truly environmental, time-dependent, or external:
  1. You've completed the process
  2. Document what you investigated
  3. Implement appropriate handling (retries, timeouts, error messages)
  4. Add monitoring/logging for future investigation
But: 95% of "no root cause" cases are incomplete investigations.

配套技術

Companion Techniques

這些技術是系統調試的一部分,可以在此目錄中找到:
  • root-cause-tracing.md
    - 透過呼叫堆疊向後追蹤錯誤以找到原始觸發器
  • defense-in-depth.md
    - 在找到根本原因後添加多層驗證
  • condition-based-waiting.md
    - 用條件輪詢替換任意逾時
相關技能:
  • superpowers:test-driven-development - 用於建立失敗的測試案例(第4階段,第1步)
  • 超級能力:完成前驗證 - 在聲明成功之前驗證修復是否有效
These techniques are part of systematic debugging and can be found in this directory:
  • root-cause-tracing.md
    - Trace errors backward through the call stack to find the original trigger
  • defense-in-depth.md
    - Add multiple layers of validation after finding the root cause
  • condition-based-waiting.md
    - Replace arbitrary timeouts with conditional polling
Related Skills:
  • superpowers:test-driven-development - For creating failing test cases (Phase 4, Step 1)
  • superpowers:verify-before-complete - Verify fixes work before declaring success

現實世界的影響

Real-World Impact

從調試會話:
  • 系統方法:15-30 分鐘修復
  • 隨機修復方法:2-3小時的顛簸
  • 首次修復率:95% vs 40%
  • 引入新錯誤:接近零與常見
From debugging sessions:
  • Systematic approach: 15-30 minutes to fix
  • Random fix approach: 2-3 hours of whack-a-mole
  • First-fix success rate: 95% vs 40%
  • New bugs introduced: Near zero vs common