Systematic Debugging
Overview
Random fixes are a waste of time and introduce new bugs. Hasty patches only mask underlying issues.
Core Principle: Always identify the root cause before attempting a fix. Fixing only the symptom is a failure.
Going through the motions defeats the purpose of debugging.
Non-Negotiable Rule
No fix proposals allowed without root cause investigation
You cannot propose a fix until you've completed the first phase.
When to Use
For any technical issue:
- Test failures
- Production bugs
- Abnormal behavior
- Performance issues
- Build failures
- Integration problems
Mandatory use in these situations:
- Time constraints (emergencies are where guesswork fixes happen most often)
- Thinking "a small change will fix it"
- Having tried multiple fixes already
- The last fix didn't work
- You don't fully understand the problem
Do NOT skip even if:
- The problem seems simple (simple bugs still have root causes)
- You're in a hurry (rushing leads to rework)
- A leader demands an immediate fix (systematic debugging is faster than repeated guesses)
Four Phases
You must complete each phase before moving to the next.
Phase 1: Root Cause Investigation
Before attempting any fixes:
-
Read error messages carefully
- Don't skip errors or warnings
- They often contain the solution directly
- Read the full stack trace
- Note line numbers, file paths, error codes
-
Stably reproduce the issue
- Can you reliably trigger it?
- What are the exact reproduction steps?
- Does it reproduce every time?
- If unable to reproduce → collect more data, don't guess
-
Check recent changes
- What changes could have caused this issue?
- git diff, recent commits
- New dependencies, configuration changes
- Environment differences
-
Collect evidence in multi-component systems
When the system has multiple components (CI → Build → Signing, API → Service → Database):
Before proposing a fix, add diagnostic instrumentation:
For each component boundary:
- Record data entering the component
- Record data leaving the component
- Verify environment/configuration propagation
- Check state at each layer
Execute once to collect evidence, identify where the break occurs
Then analyze evidence to locate the faulty component
Then conduct in-depth investigation on that component
Example (multi-layer system):
bash
# Layer 1: Workflow
echo "=== Secrets available in workflow: ==="
echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}"
# Layer 2: Build script
echo "=== Env vars in build script: ==="
env | grep IDENTITY || echo "IDENTITY not in environment"
# Layer 3: Signing script
echo "=== Keychain state: ==="
security list-keychains
security find-identity -v
# Layer 4: Actual signing
codesign --sign "$IDENTITY" --verbose=4 "$APP"
This reveals: Which layer has the issue (secrets → workflow ✓, workflow → build ✗)
-
Trace data flow
When errors occur deep in the call stack:
See
in this directory for complete reverse tracing techniques.
Brief version:
- Where did the incorrect value originate?
- Who called this with the incorrect value?
- Keep tracing upward until you find the source
- Fix at the source, not at the symptom
Phase 2: Pattern Analysis
Identify patterns before fixing:
-
Find working examples
- Look for similar working code in the same codebase
- What working code resembles the problematic code?
-
Compare with reference implementations
- If implementing a pattern, read the reference implementation fully
- Don't skim — read line by line
- Fully understand the pattern before applying it
-
Identify differences
- What's different between the working code and the problematic code?
- List every difference, no matter how small
- Don't assume "that can't possibly matter"
-
Understand dependencies
- What other components does this feature require?
- What settings, configurations, environments are needed?
- What implicit assumptions does it have?
Phase 3: Hypothesis & Verification
Scientific method:
-
Form a single hypothesis
- State clearly: "I believe X is the root cause because Y"
- Write it down
- Be specific, not vague
-
Minimal testing
- Make the smallest change to verify the hypothesis
- Change only one variable at a time
- Don't fix multiple issues simultaneously
-
Verify before proceeding
- Did it work? Yes → move to Phase 4
- Didn't work? Form a new hypothesis
- Don't stack more fixes on top
-
When you're unsure
- Say "I don't understand X"
- Don't pretend you know
- Ask for help
- Do more research
Phase 4: Implementation
Fix the root cause, not the symptom:
-
Create a failing test case
- Minimal reproduction
- Use automated tests whenever possible
- Write a one-time test script if no test framework exists
- Must have a test before fixing
- Use the
superpowers:test-driven-development
skill to write standard failing tests
-
Implement a single fix
- Fix the identified root cause
- Change only one thing at a time
- Don't make "while I'm at it" optimizations
- Don't bundle refactoring
-
Verify the fix
- Does the test pass now?
- Are other tests not broken?
- Is the problem truly resolved?
-
If the fix doesn't work
- Stop
- Count: How many fixes have you attempted?
- Fewer than 3: Return to Phase 1, re-analyze with new information
- 3 or more: Stop and question the architecture (see Step 5 below)
- Do not attempt a 4th fix without architectural discussion
-
If 3+ fixes fail: Question the architecture
These patterns indicate architectural issues:
- Each fix exposes new shared state/coupling/issues elsewhere
- Fix requires "massive refactoring" to implement
- Each fix creates new symptoms elsewhere
Stop and question fundamental issues:
- Is this pattern fundamentally reasonable?
- Are we sticking to a wrong solution due to "momentum"?
- Should we refactor the architecture or continue patching symptoms?
Discuss with your partner before attempting more fixes
This is not hypothesis failure — this is architectural failure.
Red Lines – Stop and Follow the Process
If you catch yourself thinking:
- "Just patch it temporarily, I'll investigate later"
- "Try changing X to see if it works"
- "Change multiple things at once and run tests"
- "Skip testing, I'll verify manually"
- "It's probably X, let me fix it"
- "I don't fully understand, but this should work"
- "The pattern says X, but I'll use it differently"
- "The main issues are: [listing fixes without investigation]"
- Proposing solutions without tracing data flow
- "Try one more fix" (after 2+ failed attempts)
- Each fix exposes new issues in different places
All of these mean: Stop. Return to Phase 1.
If 3+ fixes fail: Question the architecture (see Phase 4, Step 5)
Signals from Your Partner – Indicating Your Approach is Wrong
Watch for these reminders:
- "Isn't it this way?" – You're making assumptions without verification
- "Can it tell us...?" – You should collect evidence first
- "Stop guessing" – You're proposing fixes without understanding
- "Think deeper" – Question fundamental issues, not just symptoms
- "Are we stuck?" (frustrated tone) – Your approach isn't working
When you see these signals: Stop. Return to Phase 1.
Common Excuses
| Excuse | Reality |
|---|
| "The problem is simple, no need to follow the process" | Even simple problems have root causes. For simple bugs, the process can be completed quickly. |
| "It's an emergency, no time for the process" | Systematic debugging is faster than repeated guesswork fixes. |
| "Try it first, then investigate" | The first fix sets the tone. Do it right from the start. |
| "Write tests after confirming the fix works" | Fixes without tests don't last. Write tests first to prove the fix works. |
| "Fixing multiple issues at once saves time" | You can't isolate what worked. It also introduces new bugs. |
| "The reference implementation is too long, I'll modify it myself" | Partial understanding inevitably leads to bugs. Read it fully. |
| "I see the problem, let me fix it" | Seeing the symptom ≠ understanding the root cause. |
| "Try one more time" (after 2+ failed attempts) | 3+ failures = architectural issue. Question the pattern, don't keep fixing. |
Quick Reference Table
| Phase | Key Activities | Pass Criteria |
|---|
| 1. Root Cause | Read errors, reproduce, check changes, collect evidence | Understand what went wrong and why |
| 2. Pattern | Find working examples, compare | Identify differences |
| 3. Hypothesis | Form theories, minimal verification | Hypothesis is verified or new hypothesis is formed |
| 4. Implementation | Create tests, fix, verify | Bug is fixed, tests pass |
When the Process Says "No Root Cause Found"
If after systematic investigation the issue is truly environment-related, timing-related, or caused by external factors:
- You've completed the process
- Document what you investigated
- Implement appropriate handling (retries, timeouts, error messages)
- Add monitoring/logging for future investigation
But: 95% of "no root cause found" cases are due to insufficient investigation.
Supporting Techniques
These techniques are part of systematic debugging and can be found in this directory:
- - Trace bugs backward along the call stack to find the initial trigger point
- - After finding the root cause, add validation at multiple levels
condition-based-waiting.md
- Replace hard-coded wait times with conditional polling
Related Skills:
- superpowers:test-driven-development - For creating failing test cases (Phase 4, Step 1)
- superpowers:verification-before-completion - Verify the fix actually works before declaring success
Actual Results
Data from debugging practices:
- Systematic approach: 15-30 minutes to fix
- Random fix approach: 2-3 hours of trial and error
- First-fix success rate: 95% vs 40%
- New bugs introduced: Almost zero vs frequent occurrences