Systematic Debugging

Overview

Random fixes are both time-consuming and introduce new bugs. Hasty patches only mask underlying issues.

Core Principle: Always find the root cause before attempting a fix. Fixing only the symptom is a failure.

Going through the motions defeats the purpose of debugging.

Non-Negotiable Rule

No root cause investigation, no fix proposal

You cannot propose a fix until you have completed the first phase.

When to Use

Use for any technical issue:

Test failures
Production bugs
Abnormal behavior
Performance issues
Build failures
Integration issues

Mandatory use in the following situations:

Time constraints (emergencies are when people are most likely to guess fixes)
Thinking "a small change will fix it"
Have tried multiple fixes already
The last fix didn't work
You don't fully understand the problem

Do NOT skip in these cases either:

The problem seems simple (even simple bugs have root causes)
You're in a hurry (rushing leads to rework)
Leadership demands an immediate fix (systematic debugging is faster than trial and error)

Four Phases

You must complete each phase before moving to the next.

Phase 1: Root Cause Investigation

Before attempting any fix:

Read error messages carefully
- Don't skip errors or warnings
- They often contain the solution directly
- Read the full stack trace
- Note line numbers, file paths, error codes
Stable Reproduction
- Can you reliably trigger it?
- What are the exact reproduction steps?
- Does it reproduce every time?
- If you can't reproduce → collect more data, don't guess
Check Recent Changes
- What changes could have caused this issue?
- git diff, recent commits
- New dependencies, configuration changes
- Environment differences

Collect Evidence in Multi-Component Systems

When the system has multiple components (CI → Build → Signing, API → Service → Database):

Before proposing a fix, add diagnostic instrumentation:

For each component boundary:
  - Record data entering the component
  - Record data leaving the component
  - Verify environment/configuration propagation
  - Check state at each layer

Execute once to collect evidence, identify where the break occurs
Then analyze evidence to locate the faulty component
Then investigate that component in depth

Example (multi-layer system):

bash

# Layer 1: Workflow
echo "=== Secrets available in workflow: ==="
echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}"

# Layer 2: Build script
echo "=== Env vars in build script: ==="
env | grep IDENTITY || echo "IDENTITY not in environment"

# Layer 3: Signing script
echo "=== Keychain state: ==="
security list-keychains
security find-identity -v

# Layer 4: Actual signing
codesign --sign "$IDENTITY" --verbose=4 "$APP"

This reveals: Which layer has the issue (secrets → workflow ✓, workflow → build ✗)

Trace Data Flow

When errors occur deep in the call stack:
See
```
root-cause-tracing.md
```
in this directory for full reverse-tracing techniques.
Short version:
- Where did the incorrect value originate?
- Who called this with the incorrect value?
- Keep tracing upward until you find the source
- Fix at the source, not at the symptom

Phase 2: Pattern Analysis

Find patterns before fixing:

Find working examples
- Look for similar working code in the same codebase
- What working code resembles the problematic code?
Compare with reference implementations
- If implementing a pattern, read the reference implementation fully
- Don't skim — read line by line
- Fully understand the pattern before applying it
Identify differences
- What's different between the working code and the problematic code?
- List every difference, no matter how small
- Don't assume "that can't affect anything"
Understand dependencies
- What other components does this feature require?
- What settings, configurations, environments are needed?
- What implicit assumptions does it have?

Phase 3: Hypothesis and Verification

Scientific method:

Form a single hypothesis
- State clearly: "I believe X is the root cause because Y"
- Write it down
- Be specific, not vague
Minimal testing
- Make the smallest change to verify the hypothesis
- Change only one variable at a time
- Don't fix multiple issues at once
Verify before proceeding
- Did it work? Yes → move to Phase 4
- Didn't work? Form a new hypothesis
- Don't stack more fixes on top
When you're unsure
- Say "I don't understand X"
- Don't pretend you know
- Ask for help
- Do more research

Phase 4: Implementation

Fix the root cause, not the symptom:

Create a failing test case
- Minimal reproduction
- Use automated tests whenever possible
- Write a one-time test script if no test framework is available
- Must have a test before fixing
- Use the
```
superpowers:test-driven-development
```
  skill to write a proper failing test
Implement a single fix
- Fix the identified root cause
- Change only one thing at a time
- Don't make "while I'm at it" optimizations
- Don't bundle refactoring
Verify the fix
- Does the test pass now?
- Are other tests not broken?
- Is the problem truly resolved?
If the fix doesn't work
- Stop
- Count: How many fixes have you tried?
- Fewer than 3: Go back to Phase 1, re-analyze with new information
- 3 or more: Stop and question the architecture (see Step 5 below)
- Don't attempt a 4th fix without architectural discussion
If 3+ fixes fail: Question the architecture

These patterns indicate architectural issues:
- Each fix reveals new shared state/coupling/issues elsewhere
- Fix requires "massive refactoring" to implement
- Each fix creates new symptoms elsewhere
Stop and question fundamental issues:
- Is this pattern fundamentally sound?
- Are we sticking to a bad solution out of inertia?
- Should we refactor the architecture or keep patching symptoms?
Discuss with your partner before attempting more fixes

This isn't hypothesis failure — this is architectural failure.

Red Lines — Stop and Follow the Process

If you catch yourself thinking:

"I'll fix it temporarily and investigate later"
"Let me try changing X to see if it works"
"Change multiple things at once and run tests"
"Skip tests, I'll verify manually"
"It's probably X, let me fix it"
"I don't fully understand, but this should work"
"The pattern says X, but I'll use it differently"
"The main issues are: [listing fixes without investigation]"
Proposing solutions without tracing data flow
"Just try one more fix" (after 2+ failed attempts)
Each fix reveals new issues in different places

All of these mean: Stop. Go back to Phase 1.

If 3+ fixes fail: Question the architecture (see Phase 4, Step 5)

Partner Signals — Your Approach Is Wrong

Watch for these reminders:

"Isn't it...?" — You're making assumptions without verification
"Can it tell us...?" — You should collect evidence first
"Don't guess" — You're proposing fixes without understanding
"Think deeper" — You need to question fundamental issues, not just symptoms
"Are we stuck?" (frustrated tone) — Your approach isn't working

When you see these signals: Stop. Go back to Phase 1.

Common Excuses

Excuse	Reality
"The problem is simple, no need to follow the process"	Even simple problems have root causes. For simple bugs, the process is quick to complete.
"It's an emergency, no time for the process"	Systematic debugging is faster than trial and error.
"Try it first, then investigate"	The first fix sets the tone. Do it right from the start.
"Write tests after confirming the fix works"	Fixes without tests don't last. Writing tests first proves the fix works.
"Fixing multiple issues at once saves time"	You can't isolate what worked. It also introduces new bugs.
"The reference implementation is too long, I'll modify it myself"	Partial understanding inevitably leads to bugs. Read it fully.
"I see the problem, let me fix it"	Seeing the symptom ≠ understanding the root cause.
"Just try one more time" (after 2+ failures)	3+ failures = architectural issue. Question the pattern, don't keep fixing.

Cheat Sheet

Phase	Key Activities	Pass Criteria
1. Root Cause	Read errors, reproduce, check changes, collect evidence	Understand what went wrong and why
2. Pattern	Find working examples, compare	Identify differences
3. Hypothesis	Form theory, minimal verification	Hypothesis is verified or new hypothesis formed
4. Implementation	Create test, fix, verify	Bug is fixed, tests pass

When the Process Says "No Root Cause Found"

If systematic investigation reveals the issue is indeed environment-related, timing-related, or caused by external factors:

You've completed the process
Document what you investigated
Implement appropriate handling (retries, timeouts, error messages)
Add monitoring/logging for future investigation

However: 95% of "no root cause found" cases are due to insufficient investigation.

Supporting Techniques

The following techniques are part of systematic debugging and can be found in this directory:

root-cause-tracing.md
- Trace bugs backward along the call stack to find the initial trigger
defense-in-depth.md
- After finding the root cause, add validation at multiple levels
condition-based-waiting.md
- Replace hardcoded wait times with conditional polling

Related Skills:

superpowers:test-driven-development - Used to create failing test cases (Phase 4, Step 1)
superpowers:verification-before-completion - Verify the fix actually works before declaring success

Practical Results

Data from debugging practices:

Systematic approach: 15-30 minutes to fix
Random fix approach: 2-3 hours of trial and error
First-fix success rate: 95% vs 40%
New bugs introduced: Almost zero vs frequent occurrences

systematic-debugging

NPX Install

Tags

SKILL.md Content (Chinese)