System Debugging

Overview

Random fixes waste time and introduce new bugs. Quick patches mask the root problem.

Core Principle: Always find the root cause before attempting a fix. Symptom-based fixes fail.

Violating this process literally violates the spirit of debugging.

Iron Rule

NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST

You cannot propose a fix if you haven't completed the first phase.

When to Use

For any technical issue:

Test failures
Production errors
Unexpected behavior
Performance issues
Build failures
Integration problems

Use this especially when:

Under time pressure (urgency makes it easy to guess)
"Just a quick fix" seems obvious
You've already tried multiple fixes
Previous fixes didn't work
You don't fully understand the problem

Do NOT skip this when:

The problem seems simple (simple bugs also have root causes)
You're in a hurry (hurry guarantees rework)
Managers want an immediate fix (systematic approach is faster than chaos)

Four Phases

You must complete each phase before moving to the next.

Phase 1: Root Cause Investigation

Before attempting any fix:

Read error messages carefully
- Don't skip past errors or warnings
- They often contain precise solutions
- Read the full stack trace
- Note line numbers, file paths, error codes
Reproduce consistently
- Can you trigger it reliably?
- What are the exact steps?
- Does it happen every time?
- If non-reproducible → collect more data, don't guess
Check recent changes
- What change might have caused this?
- Git diff, recent commits
- New dependencies, configuration changes
- Environment differences
Gather evidence in multi-component systems

When the system has multiple components (CI → Build → Signing, API → Service → Database):

Before proposing a fix, add diagnostic tools:

For EACH component boundary:
  - Log what data enters component
  - Log what data exits component
  - Verify environment/config propagation
  - Check state at each layer

Run once to gather evidence showing WHERE it breaks
THEN analyze evidence to identify failing component
THEN investigate that specific component

Example (multi-layer system):

bash

# Layer 1: Workflow
echo "=== Secrets available in workflow: ==="
echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}"

# Layer 2: Build script
echo "=== Env vars in build script: ==="
env | grep IDENTITY || echo "IDENTITY not in environment"

# Layer 3: Signing script
echo "=== Keychain state: ==="
security list-keychains
security find-identity -v

# Layer 4: Actual signing
codesign --sign "$IDENTITY" --verbose=4 "$APP"

This reveals: Which layer failed (Secrets → Workflow ✓, Workflow → Build ✗)

Trace data flow

When errors are deep in the call stack:

See

root-cause-tracing.md

in this directory for full backward tracing techniques.

Quick version:

Where does the bad value come from?
What called this with the bad value?
Keep tracing until you find the source
Fix at the source, not the symptom

Phase 2: Pattern Analysis

Find patterns before fixing:

Look for working examples
- Find similar working code in the same codebase
- What works that's similar to what's broken?
Compare to references
- If implementing a pattern, read the reference implementation fully
- Don't skim - read every line
- Understand the pattern fully before applying
Identify differences
- What's different between working and broken?
- List every difference, no matter how small
- Don't assume "that doesn't matter"
Understand dependencies
- What other components does this require?
- What settings, configurations, environment?
- What assumptions does it make?

Phase 3: Hypothesis and Testing

Scientific Method:

Form a single hypothesis
- State clearly: "I think X is the root cause because Y"
- Write it down
- Be specific, not vague
Minimal testing
- Make the smallest possible change to test the hypothesis
- One variable at a time
- Don't fix multiple issues at once
Verify before proceeding
- Did it work? Yes → Phase 4
- Didn't work? Form a new hypothesis
- Don't add more fixes on top
When you don't know
- Say "I don't understand X"
- Don't pretend to know
- Ask for help
- Research more

Phase 4: Implementation

Fix the root cause, not the symptom:

Create a failing test case
- Simplest reproduction possible
- Automated test if possible
- One-off test script if no framework
- Must have this before fixing
- Use
```
superpowers:test-driven-development
```
  skill to write correct failing tests
Implement a single fix
- Address the identified root cause
- One change at a time
- No "while I'm here" improvements
- No bundled refactoring
Verify the fix
- Do tests pass now?
- No other tests broken?
- Is the problem truly solved?
If the fix doesn't work
- Stop
- Count: How many fixes have you tried?
- If < 3: Return to Phase 1, re-analyze with new info
- If ≥ 3: Stop and question the architecture (Step 5 below)
- Don't attempt fix #4 without architectural discussion
If 3+ fixes fail: Architectural Problem

Patterns indicating architectural issues:

Each fix reveals new shared state/coupling/problems in different places
Fix requires "massive refactoring" to implement
Each fix creates new symptoms elsewhere

Stop and ask fundamentals:

Is this pattern fundamentally sound?
Are we "sticking with it purely out of inertia"?
Should we refactor the architecture or keep fixing symptoms?

Discuss with your human partner before attempting more fixes

This isn't a failed hypothesis - it's a flawed architecture.

Red Flags - Stop and Follow the Process

If you catch yourself thinking:

"Quick fix now, investigate later"
"Try changing X to see if it works"
"Add multiple changes, run tests"
"Skip testing, I'll verify manually"
"Probably X, let me fix that"
"I don't fully understand, but this might work"
"The pattern says X, but I'll adapt it differently"
"Here are the main issues: [list of uninvestigated fixes]"
Propose solutions before tracing data flow
"Just try one more fix" (after 2+ attempts)
Each fix reveals new problems in different places

All of these mean: Stop. Return to Phase 1.

If 3+ fixes fail: Question the architecture (see Phase 4.5)

Signals from Your Human Partner That You're Doing It Wrong

Watch for these redirects:

"Is that what's happening?" - You assumed without verifying
"Would it tell us...?" - You should add evidence collection
"Stop guessing" - You're proposing fixes without understanding
"Ultrathink this" - Question fundamentals, not just symptoms
"Are we stuck?" (frustration) - Your approach isn't working

When you see these: Stop. Return to Phase 1.

Common Rationalizations

Excuse	Reality
"The problem is simple, no need for process"	Simple problems still have root causes. The process is fast for simple bugs.
"It's an emergency, no time for process"	Systematic debugging is faster than guess-and-check whack-a-mole.
"Try it first, then investigate"	The first fix sets the pattern. Do it right from the start.
"I'll write tests after confirming the fix works"	Untested fixes don't last. Tests prove it first.
"Multiple fixes at once saves time"	Can't isolate what works. Introduces new bugs.
"The reference is too long, I'll adapt the pattern"	Partial understanding guarantees mistakes. Read it fully.
"I see the problem, let me fix it"	Seeing the symptom ≠ understanding the root cause.
"Just try one more fix" (after 2+ failures)	3+ failures = architectural problem. Problem with the pattern, stop fixing.

Quick Reference

Phase	Main Activities	Success Criteria
1. Root Cause	Read errors, reproduce, check changes, gather evidence	Understand what and why
2. Pattern	Find working examples, compare	Identify differences
3. Hypothesis	Form theory, minimal testing	Confirmed or new hypothesis
4. Implementation	Create test, fix, verify	Error resolved, tests pass

When the Process Shows "No Root Cause"

If systematic investigation shows the problem is truly environmental, time-dependent, or external:

You've completed the process
Document what you investigated
Implement appropriate handling (retries, timeouts, error messages)
Add monitoring/logging for future investigation

But: 95% of "no root cause" cases are incomplete investigations.

Companion Techniques

These techniques are part of systematic debugging and can be found in this directory:

root-cause-tracing.md
- Trace errors backward through the call stack to find the original trigger
defense-in-depth.md
- Add multiple layers of validation after finding the root cause
condition-based-waiting.md
- Replace arbitrary timeouts with conditional polling

Related Skills:

superpowers:test-driven-development - For creating failing test cases (Phase 4, Step 1)
superpowers:verify-before-complete - Verify fixes work before declaring success

Real-World Impact

From debugging sessions:

Systematic approach: 15-30 minutes to fix
Random fix approach: 2-3 hours of whack-a-mole
First-fix success rate: 95% vs 40%
New bugs introduced: Near zero vs common

systematic-debugging

NPX Install

Tags

SKILL.md Content (Chinese)