System Debugging
Overview
Random fixes waste time and introduce new bugs. Quick patches mask the root problem.
Core Principle: Always find the root cause before attempting a fix. Symptom-based fixes fail.
Violating this process literally violates the spirit of debugging.
Iron Rule
NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST
You cannot propose a fix if you haven't completed the first phase.
When to Use
For any technical issue:
- Test failures
- Production errors
- Unexpected behavior
- Performance issues
- Build failures
- Integration problems
Use this especially when:
- Under time pressure (urgency makes it easy to guess)
- "Just a quick fix" seems obvious
- You've already tried multiple fixes
- Previous fixes didn't work
- You don't fully understand the problem
Do NOT skip this when:
- The problem seems simple (simple bugs also have root causes)
- You're in a hurry (hurry guarantees rework)
- Managers want an immediate fix (systematic approach is faster than chaos)
Four Phases
You must complete each phase before moving to the next.
Phase 1: Root Cause Investigation
Before attempting any fix:
-
Read error messages carefully
- Don't skip past errors or warnings
- They often contain precise solutions
- Read the full stack trace
- Note line numbers, file paths, error codes
-
Reproduce consistently
- Can you trigger it reliably?
- What are the exact steps?
- Does it happen every time?
- If non-reproducible → collect more data, don't guess
-
Check recent changes
- What change might have caused this?
- Git diff, recent commits
- New dependencies, configuration changes
- Environment differences
-
Gather evidence in multi-component systems
When the system has multiple components (CI → Build → Signing, API → Service → Database):
Before proposing a fix, add diagnostic tools:
For EACH component boundary:
- Log what data enters component
- Log what data exits component
- Verify environment/config propagation
- Check state at each layer
Run once to gather evidence showing WHERE it breaks
THEN analyze evidence to identify failing component
THEN investigate that specific component
Example (multi-layer system):
bash
# Layer 1: Workflow
echo "=== Secrets available in workflow: ==="
echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}"
# Layer 2: Build script
echo "=== Env vars in build script: ==="
env | grep IDENTITY || echo "IDENTITY not in environment"
# Layer 3: Signing script
echo "=== Keychain state: ==="
security list-keychains
security find-identity -v
# Layer 4: Actual signing
codesign --sign "$IDENTITY" --verbose=4 "$APP"
This reveals: Which layer failed (Secrets → Workflow ✓, Workflow → Build ✗)
- Trace data flow
When errors are deep in the call stack:
See
in this directory for full backward tracing techniques.
Quick version:
- Where does the bad value come from?
- What called this with the bad value?
- Keep tracing until you find the source
- Fix at the source, not the symptom
Phase 2: Pattern Analysis
Find patterns before fixing:
-
Look for working examples
- Find similar working code in the same codebase
- What works that's similar to what's broken?
-
Compare to references
- If implementing a pattern, read the reference implementation fully
- Don't skim - read every line
- Understand the pattern fully before applying
-
Identify differences
- What's different between working and broken?
- List every difference, no matter how small
- Don't assume "that doesn't matter"
-
Understand dependencies
- What other components does this require?
- What settings, configurations, environment?
- What assumptions does it make?
Phase 3: Hypothesis and Testing
Scientific Method:
-
Form a single hypothesis
- State clearly: "I think X is the root cause because Y"
- Write it down
- Be specific, not vague
-
Minimal testing
- Make the smallest possible change to test the hypothesis
- One variable at a time
- Don't fix multiple issues at once
-
Verify before proceeding
- Did it work? Yes → Phase 4
- Didn't work? Form a new hypothesis
- Don't add more fixes on top
-
When you don't know
- Say "I don't understand X"
- Don't pretend to know
- Ask for help
- Research more
Phase 4: Implementation
Fix the root cause, not the symptom:
-
Create a failing test case
- Simplest reproduction possible
- Automated test if possible
- One-off test script if no framework
- Must have this before fixing
- Use
superpowers:test-driven-development
skill to write correct failing tests
-
Implement a single fix
- Address the identified root cause
- One change at a time
- No "while I'm here" improvements
- No bundled refactoring
-
Verify the fix
- Do tests pass now?
- No other tests broken?
- Is the problem truly solved?
-
If the fix doesn't work
- Stop
- Count: How many fixes have you tried?
- If < 3: Return to Phase 1, re-analyze with new info
- If ≥ 3: Stop and question the architecture (Step 5 below)
- Don't attempt fix #4 without architectural discussion
-
If 3+ fixes fail: Architectural Problem
Patterns indicating architectural issues:
- Each fix reveals new shared state/coupling/problems in different places
- Fix requires "massive refactoring" to implement
- Each fix creates new symptoms elsewhere
Stop and ask fundamentals:
- Is this pattern fundamentally sound?
- Are we "sticking with it purely out of inertia"?
- Should we refactor the architecture or keep fixing symptoms?
Discuss with your human partner before attempting more fixes
This isn't a failed hypothesis - it's a flawed architecture.
Red Flags - Stop and Follow the Process
If you catch yourself thinking:
- "Quick fix now, investigate later"
- "Try changing X to see if it works"
- "Add multiple changes, run tests"
- "Skip testing, I'll verify manually"
- "Probably X, let me fix that"
- "I don't fully understand, but this might work"
- "The pattern says X, but I'll adapt it differently"
- "Here are the main issues: [list of uninvestigated fixes]"
- Propose solutions before tracing data flow
- "Just try one more fix" (after 2+ attempts)
- Each fix reveals new problems in different places
All of these mean: Stop. Return to Phase 1.
If 3+ fixes fail: Question the architecture (see Phase 4.5)
Signals from Your Human Partner That You're Doing It Wrong
Watch for these redirects:
- "Is that what's happening?" - You assumed without verifying
- "Would it tell us...?" - You should add evidence collection
- "Stop guessing" - You're proposing fixes without understanding
- "Ultrathink this" - Question fundamentals, not just symptoms
- "Are we stuck?" (frustration) - Your approach isn't working
When you see these: Stop. Return to Phase 1.
Common Rationalizations
| Excuse | Reality |
|---|
| "The problem is simple, no need for process" | Simple problems still have root causes. The process is fast for simple bugs. |
| "It's an emergency, no time for process" | Systematic debugging is faster than guess-and-check whack-a-mole. |
| "Try it first, then investigate" | The first fix sets the pattern. Do it right from the start. |
| "I'll write tests after confirming the fix works" | Untested fixes don't last. Tests prove it first. |
| "Multiple fixes at once saves time" | Can't isolate what works. Introduces new bugs. |
| "The reference is too long, I'll adapt the pattern" | Partial understanding guarantees mistakes. Read it fully. |
| "I see the problem, let me fix it" | Seeing the symptom ≠ understanding the root cause. |
| "Just try one more fix" (after 2+ failures) | 3+ failures = architectural problem. Problem with the pattern, stop fixing. |
Quick Reference
| Phase | Main Activities | Success Criteria |
|---|
| 1. Root Cause | Read errors, reproduce, check changes, gather evidence | Understand what and why |
| 2. Pattern | Find working examples, compare | Identify differences |
| 3. Hypothesis | Form theory, minimal testing | Confirmed or new hypothesis |
| 4. Implementation | Create test, fix, verify | Error resolved, tests pass |
When the Process Shows "No Root Cause"
If systematic investigation shows the problem is truly environmental, time-dependent, or external:
- You've completed the process
- Document what you investigated
- Implement appropriate handling (retries, timeouts, error messages)
- Add monitoring/logging for future investigation
But: 95% of "no root cause" cases are incomplete investigations.
Companion Techniques
These techniques are part of systematic debugging and can be found in this directory:
- - Trace errors backward through the call stack to find the original trigger
- - Add multiple layers of validation after finding the root cause
condition-based-waiting.md
- Replace arbitrary timeouts with conditional polling
Related Skills:
- superpowers:test-driven-development - For creating failing test cases (Phase 4, Step 1)
- superpowers:verify-before-complete - Verify fixes work before declaring success
Real-World Impact
From debugging sessions:
- Systematic approach: 15-30 minutes to fix
- Random fix approach: 2-3 hours of whack-a-mole
- First-fix success rate: 95% vs 40%
- New bugs introduced: Near zero vs common