Systematic Debugging
Overview
Random fixes are both time-consuming and introduce new bugs. Hasty patches only mask underlying issues.
Core Principle: Always find the root cause before attempting a fix. Fixing only the symptom is a failure.
Going through the motions defeats the purpose of debugging.
Non-Negotiable Rule
No root cause investigation, no fix proposal
You cannot propose a fix until you have completed the first phase.
When to Use
Use for any technical issue:
- Test failures
- Production bugs
- Abnormal behavior
- Performance issues
- Build failures
- Integration issues
Mandatory use in the following situations:
- Time constraints (emergencies are when people are most likely to guess fixes)
- Thinking "a small change will fix it"
- Have tried multiple fixes already
- The last fix didn't work
- You don't fully understand the problem
Do NOT skip in these cases either:
- The problem seems simple (even simple bugs have root causes)
- You're in a hurry (rushing leads to rework)
- Leadership demands an immediate fix (systematic debugging is faster than trial and error)
Four Phases
You must complete each phase before moving to the next.
Phase 1: Root Cause Investigation
Before attempting any fix:
-
Read error messages carefully
- Don't skip errors or warnings
- They often contain the solution directly
- Read the full stack trace
- Note line numbers, file paths, error codes
-
Stable Reproduction
- Can you reliably trigger it?
- What are the exact reproduction steps?
- Does it reproduce every time?
- If you can't reproduce → collect more data, don't guess
-
Check Recent Changes
- What changes could have caused this issue?
- git diff, recent commits
- New dependencies, configuration changes
- Environment differences
-
Collect Evidence in Multi-Component Systems
When the system has multiple components (CI → Build → Signing, API → Service → Database):
Before proposing a fix, add diagnostic instrumentation:
For each component boundary:
- Record data entering the component
- Record data leaving the component
- Verify environment/configuration propagation
- Check state at each layer
Execute once to collect evidence, identify where the break occurs
Then analyze evidence to locate the faulty component
Then investigate that component in depth
Example (multi-layer system):
bash
# Layer 1: Workflow
echo "=== Secrets available in workflow: ==="
echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}"
# Layer 2: Build script
echo "=== Env vars in build script: ==="
env | grep IDENTITY || echo "IDENTITY not in environment"
# Layer 3: Signing script
echo "=== Keychain state: ==="
security list-keychains
security find-identity -v
# Layer 4: Actual signing
codesign --sign "$IDENTITY" --verbose=4 "$APP"
This reveals: Which layer has the issue (secrets → workflow ✓, workflow → build ✗)
-
Trace Data Flow
When errors occur deep in the call stack:
See
in this directory for full reverse-tracing techniques.
Short version:
- Where did the incorrect value originate?
- Who called this with the incorrect value?
- Keep tracing upward until you find the source
- Fix at the source, not at the symptom
Phase 2: Pattern Analysis
Find patterns before fixing:
-
Find working examples
- Look for similar working code in the same codebase
- What working code resembles the problematic code?
-
Compare with reference implementations
- If implementing a pattern, read the reference implementation fully
- Don't skim — read line by line
- Fully understand the pattern before applying it
-
Identify differences
- What's different between the working code and the problematic code?
- List every difference, no matter how small
- Don't assume "that can't affect anything"
-
Understand dependencies
- What other components does this feature require?
- What settings, configurations, environments are needed?
- What implicit assumptions does it have?
Phase 3: Hypothesis and Verification
Scientific method:
-
Form a single hypothesis
- State clearly: "I believe X is the root cause because Y"
- Write it down
- Be specific, not vague
-
Minimal testing
- Make the smallest change to verify the hypothesis
- Change only one variable at a time
- Don't fix multiple issues at once
-
Verify before proceeding
- Did it work? Yes → move to Phase 4
- Didn't work? Form a new hypothesis
- Don't stack more fixes on top
-
When you're unsure
- Say "I don't understand X"
- Don't pretend you know
- Ask for help
- Do more research
Phase 4: Implementation
Fix the root cause, not the symptom:
-
Create a failing test case
- Minimal reproduction
- Use automated tests whenever possible
- Write a one-time test script if no test framework is available
- Must have a test before fixing
- Use the
superpowers:test-driven-development
skill to write a proper failing test
-
Implement a single fix
- Fix the identified root cause
- Change only one thing at a time
- Don't make "while I'm at it" optimizations
- Don't bundle refactoring
-
Verify the fix
- Does the test pass now?
- Are other tests not broken?
- Is the problem truly resolved?
-
If the fix doesn't work
- Stop
- Count: How many fixes have you tried?
- Fewer than 3: Go back to Phase 1, re-analyze with new information
- 3 or more: Stop and question the architecture (see Step 5 below)
- Don't attempt a 4th fix without architectural discussion
-
If 3+ fixes fail: Question the architecture
These patterns indicate architectural issues:
- Each fix reveals new shared state/coupling/issues elsewhere
- Fix requires "massive refactoring" to implement
- Each fix creates new symptoms elsewhere
Stop and question fundamental issues:
- Is this pattern fundamentally sound?
- Are we sticking to a bad solution out of inertia?
- Should we refactor the architecture or keep patching symptoms?
Discuss with your partner before attempting more fixes
This isn't hypothesis failure — this is architectural failure.
Red Lines — Stop and Follow the Process
If you catch yourself thinking:
- "I'll fix it temporarily and investigate later"
- "Let me try changing X to see if it works"
- "Change multiple things at once and run tests"
- "Skip tests, I'll verify manually"
- "It's probably X, let me fix it"
- "I don't fully understand, but this should work"
- "The pattern says X, but I'll use it differently"
- "The main issues are: [listing fixes without investigation]"
- Proposing solutions without tracing data flow
- "Just try one more fix" (after 2+ failed attempts)
- Each fix reveals new issues in different places
All of these mean: Stop. Go back to Phase 1.
If 3+ fixes fail: Question the architecture (see Phase 4, Step 5)
Partner Signals — Your Approach Is Wrong
Watch for these reminders:
- "Isn't it...?" — You're making assumptions without verification
- "Can it tell us...?" — You should collect evidence first
- "Don't guess" — You're proposing fixes without understanding
- "Think deeper" — You need to question fundamental issues, not just symptoms
- "Are we stuck?" (frustrated tone) — Your approach isn't working
When you see these signals: Stop. Go back to Phase 1.
Common Excuses
| Excuse | Reality |
|---|
| "The problem is simple, no need to follow the process" | Even simple problems have root causes. For simple bugs, the process is quick to complete. |
| "It's an emergency, no time for the process" | Systematic debugging is faster than trial and error. |
| "Try it first, then investigate" | The first fix sets the tone. Do it right from the start. |
| "Write tests after confirming the fix works" | Fixes without tests don't last. Writing tests first proves the fix works. |
| "Fixing multiple issues at once saves time" | You can't isolate what worked. It also introduces new bugs. |
| "The reference implementation is too long, I'll modify it myself" | Partial understanding inevitably leads to bugs. Read it fully. |
| "I see the problem, let me fix it" | Seeing the symptom ≠ understanding the root cause. |
| "Just try one more time" (after 2+ failures) | 3+ failures = architectural issue. Question the pattern, don't keep fixing. |
Cheat Sheet
| Phase | Key Activities | Pass Criteria |
|---|
| 1. Root Cause | Read errors, reproduce, check changes, collect evidence | Understand what went wrong and why |
| 2. Pattern | Find working examples, compare | Identify differences |
| 3. Hypothesis | Form theory, minimal verification | Hypothesis is verified or new hypothesis formed |
| 4. Implementation | Create test, fix, verify | Bug is fixed, tests pass |
When the Process Says "No Root Cause Found"
If systematic investigation reveals the issue is indeed environment-related, timing-related, or caused by external factors:
- You've completed the process
- Document what you investigated
- Implement appropriate handling (retries, timeouts, error messages)
- Add monitoring/logging for future investigation
However: 95% of "no root cause found" cases are due to insufficient investigation.
Supporting Techniques
The following techniques are part of systematic debugging and can be found in this directory:
- - Trace bugs backward along the call stack to find the initial trigger
- - After finding the root cause, add validation at multiple levels
condition-based-waiting.md
- Replace hardcoded wait times with conditional polling
Related Skills:
- superpowers:test-driven-development - Used to create failing test cases (Phase 4, Step 1)
- superpowers:verification-before-completion - Verify the fix actually works before declaring success
Practical Results
Data from debugging practices:
- Systematic approach: 15-30 minutes to fix
- Random fix approach: 2-3 hours of trial and error
- First-fix success rate: 95% vs 40%
- New bugs introduced: Almost zero vs frequent occurrences