Systematic Debugging

Overview

Random fixes are a waste of time and introduce new bugs. Hasty patches only mask underlying issues.

Core Principle: Always identify the root cause before attempting a fix. Fixing only the symptom is a failure.

Going through the motions defeats the purpose of debugging.

Non-Negotiable Rule

No fix proposals allowed without root cause investigation

You cannot propose a fix until you've completed the first phase.

When to Use

For any technical issue:

Test failures
Production bugs
Abnormal behavior
Performance issues
Build failures
Integration problems

Mandatory use in these situations:

Time constraints (emergencies are where guesswork fixes happen most often)
Thinking "a small change will fix it"
Having tried multiple fixes already
The last fix didn't work
You don't fully understand the problem

Do NOT skip even if:

The problem seems simple (simple bugs still have root causes)
You're in a hurry (rushing leads to rework)
A leader demands an immediate fix (systematic debugging is faster than repeated guesses)

Four Phases

You must complete each phase before moving to the next.

Phase 1: Root Cause Investigation

Before attempting any fixes:

Read error messages carefully
- Don't skip errors or warnings
- They often contain the solution directly
- Read the full stack trace
- Note line numbers, file paths, error codes
Stably reproduce the issue
- Can you reliably trigger it?
- What are the exact reproduction steps?
- Does it reproduce every time?
- If unable to reproduce → collect more data, don't guess
Check recent changes
- What changes could have caused this issue?
- git diff, recent commits
- New dependencies, configuration changes
- Environment differences

Collect evidence in multi-component systems

When the system has multiple components (CI → Build → Signing, API → Service → Database):

Before proposing a fix, add diagnostic instrumentation:

For each component boundary:
  - Record data entering the component
  - Record data leaving the component
  - Verify environment/configuration propagation
  - Check state at each layer

Execute once to collect evidence, identify where the break occurs
Then analyze evidence to locate the faulty component
Then conduct in-depth investigation on that component

Example (multi-layer system):

bash

# Layer 1: Workflow
echo "=== Secrets available in workflow: ==="
echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}"

# Layer 2: Build script
echo "=== Env vars in build script: ==="
env | grep IDENTITY || echo "IDENTITY not in environment"

# Layer 3: Signing script
echo "=== Keychain state: ==="
security list-keychains
security find-identity -v

# Layer 4: Actual signing
codesign --sign "$IDENTITY" --verbose=4 "$APP"

This reveals: Which layer has the issue (secrets → workflow ✓, workflow → build ✗)

Trace data flow

When errors occur deep in the call stack:
See
```
root-cause-tracing.md
```
in this directory for complete reverse tracing techniques.
Brief version:
- Where did the incorrect value originate?
- Who called this with the incorrect value?
- Keep tracing upward until you find the source
- Fix at the source, not at the symptom

Phase 2: Pattern Analysis

Identify patterns before fixing:

Find working examples
- Look for similar working code in the same codebase
- What working code resembles the problematic code?
Compare with reference implementations
- If implementing a pattern, read the reference implementation fully
- Don't skim — read line by line
- Fully understand the pattern before applying it
Identify differences
- What's different between the working code and the problematic code?
- List every difference, no matter how small
- Don't assume "that can't possibly matter"
Understand dependencies
- What other components does this feature require?
- What settings, configurations, environments are needed?
- What implicit assumptions does it have?

Phase 3: Hypothesis & Verification

Scientific method:

Form a single hypothesis
- State clearly: "I believe X is the root cause because Y"
- Write it down
- Be specific, not vague
Minimal testing
- Make the smallest change to verify the hypothesis
- Change only one variable at a time
- Don't fix multiple issues simultaneously
Verify before proceeding
- Did it work? Yes → move to Phase 4
- Didn't work? Form a new hypothesis
- Don't stack more fixes on top
When you're unsure
- Say "I don't understand X"
- Don't pretend you know
- Ask for help
- Do more research

Phase 4: Implementation

Fix the root cause, not the symptom:

Create a failing test case
- Minimal reproduction
- Use automated tests whenever possible
- Write a one-time test script if no test framework exists
- Must have a test before fixing
- Use the
```
superpowers:test-driven-development
```
  skill to write standard failing tests
Implement a single fix
- Fix the identified root cause
- Change only one thing at a time
- Don't make "while I'm at it" optimizations
- Don't bundle refactoring
Verify the fix
- Does the test pass now?
- Are other tests not broken?
- Is the problem truly resolved?
If the fix doesn't work
- Stop
- Count: How many fixes have you attempted?
- Fewer than 3: Return to Phase 1, re-analyze with new information
- 3 or more: Stop and question the architecture (see Step 5 below)
- Do not attempt a 4th fix without architectural discussion
If 3+ fixes fail: Question the architecture

These patterns indicate architectural issues:
- Each fix exposes new shared state/coupling/issues elsewhere
- Fix requires "massive refactoring" to implement
- Each fix creates new symptoms elsewhere
Stop and question fundamental issues:
- Is this pattern fundamentally reasonable?
- Are we sticking to a wrong solution due to "momentum"?
- Should we refactor the architecture or continue patching symptoms?
Discuss with your partner before attempting more fixes

This is not hypothesis failure — this is architectural failure.

Red Lines – Stop and Follow the Process

If you catch yourself thinking:

"Just patch it temporarily, I'll investigate later"
"Try changing X to see if it works"
"Change multiple things at once and run tests"
"Skip testing, I'll verify manually"
"It's probably X, let me fix it"
"I don't fully understand, but this should work"
"The pattern says X, but I'll use it differently"
"The main issues are: [listing fixes without investigation]"
Proposing solutions without tracing data flow
"Try one more fix" (after 2+ failed attempts)
Each fix exposes new issues in different places

All of these mean: Stop. Return to Phase 1.

If 3+ fixes fail: Question the architecture (see Phase 4, Step 5)

Signals from Your Partner – Indicating Your Approach is Wrong

Watch for these reminders:

"Isn't it this way?" – You're making assumptions without verification
"Can it tell us...?" – You should collect evidence first
"Stop guessing" – You're proposing fixes without understanding
"Think deeper" – Question fundamental issues, not just symptoms
"Are we stuck?" (frustrated tone) – Your approach isn't working

When you see these signals: Stop. Return to Phase 1.

Common Excuses

Excuse	Reality
"The problem is simple, no need to follow the process"	Even simple problems have root causes. For simple bugs, the process can be completed quickly.
"It's an emergency, no time for the process"	Systematic debugging is faster than repeated guesswork fixes.
"Try it first, then investigate"	The first fix sets the tone. Do it right from the start.
"Write tests after confirming the fix works"	Fixes without tests don't last. Write tests first to prove the fix works.
"Fixing multiple issues at once saves time"	You can't isolate what worked. It also introduces new bugs.
"The reference implementation is too long, I'll modify it myself"	Partial understanding inevitably leads to bugs. Read it fully.
"I see the problem, let me fix it"	Seeing the symptom ≠ understanding the root cause.
"Try one more time" (after 2+ failed attempts)	3+ failures = architectural issue. Question the pattern, don't keep fixing.

Quick Reference Table

Phase	Key Activities	Pass Criteria
1. Root Cause	Read errors, reproduce, check changes, collect evidence	Understand what went wrong and why
2. Pattern	Find working examples, compare	Identify differences
3. Hypothesis	Form theories, minimal verification	Hypothesis is verified or new hypothesis is formed
4. Implementation	Create tests, fix, verify	Bug is fixed, tests pass

When the Process Says "No Root Cause Found"

If after systematic investigation the issue is truly environment-related, timing-related, or caused by external factors:

You've completed the process
Document what you investigated
Implement appropriate handling (retries, timeouts, error messages)
Add monitoring/logging for future investigation

But: 95% of "no root cause found" cases are due to insufficient investigation.

Supporting Techniques

These techniques are part of systematic debugging and can be found in this directory:

root-cause-tracing.md
- Trace bugs backward along the call stack to find the initial trigger point
defense-in-depth.md
- After finding the root cause, add validation at multiple levels
condition-based-waiting.md
- Replace hard-coded wait times with conditional polling

Related Skills:

superpowers:test-driven-development - For creating failing test cases (Phase 4, Step 1)
superpowers:verification-before-completion - Verify the fix actually works before declaring success

Actual Results

Data from debugging practices:

Systematic approach: 15-30 minutes to fix
Random fix approach: 2-3 hours of trial and error
First-fix success rate: 95% vs 40%
New bugs introduced: Almost zero vs frequent occurrences

systematic-debugging

NPX Install

Tags

SKILL.md Content (Chinese)