Diagnose
A discipline for handling tough bugs. Skip stages only if you have a clear reason to do so.
Phase 1 — Build a feedback loop
This is the core of this skill. The rest is just mechanical process. If you have a fast, deterministic, agent-runnable pass/fail signal that covers the bug, you can find the root cause; bisection, hypothesis-testing, and instrumentation all just consume this signal. Without it, staring at code for hours won't save you.
Invest disproportionate effort here. Be proactive. Be creative. Don't give up easily.
Approaches to build a feedback loop, try in roughly this order
- Failing test, placed at a seam that can reach the bug; it can be a unit, integration, or e2e test.
- Curl / HTTP script targeting a running dev server.
- CLI invocation with fixture input, diff stdout against a known correct snapshot.
- Headless browser script (Playwright / Puppeteer) that drives the UI and asserts on DOM/console/network.
- Replay a captured trace. Save real network requests, payloads, or event logs to disk, then replay them in isolation against the relevant code path.
- Throwaway harness. Spin up the minimal subset of the system (one service, mocked dependencies) and trigger the bug's code path with a single function call.
- Property / fuzz loop. If the bug is "sometimes outputs wrong result", run 1000 random inputs to find failure patterns.
- Bisection harness. If the bug appears between two known states (commit, dataset, version), automate "start in state X, check, repeat" so you can use .
- Differential loop. Run old-version vs new-version (or two sets of configs) on the same input and diff the outputs.
- HITL bash script. Last resort. If someone has to click, use
scripts/hitl-loop.template.sh
to drive the human, keeping the loop structured. Capture the output and feed it back to you.
Build the right feedback loop, and the bug is already 90% fixed.
Iterate on the feedback loop itself
Treat the feedback loop as a product. Once you have a loop, ask yourself:
- Can I make it faster? (Cache setup, skip irrelevant initialization, narrow test scope.)
- Can I make the signal sharper? (Assert on specific symptoms instead of "didn't crash".)
- Can I make it more deterministic? (Fix timestamps, set a fixed RNG seed, isolate filesystem, freeze network.)
A 30-second flaky loop is only slightly better than no loop. A 2-second deterministic loop is a debugging superpower.
Non-deterministic bugs
The goal isn't a clean reproduction, but increasing the reproduction rate. Loop the trigger 100 times, parallelize it, apply pressure, narrow timing windows, inject sleeps. A bug with 50% flakiness can be debugged; one with 1% can't. Keep increasing the reproduction rate until it's debuggable.
When you truly can't build a loop
Stop and state this clearly. List the approaches you've tried. Ask the user for: (a) access to a reproducible environment, (b) captured artifacts (HAR file, log dump, core dump, timestamped screen recording), or (c) permission to add temporary production instrumentation. Do not proceed to hypothesize without a loop.
Do not move to Phase 2 until you have a reliable loop.
Phase 2 — Reproduce
Run the loop. Watch the bug occur.
Verify:
Do not proceed until you've reproduced the bug.
Phase 3 — Hypothesize
Before testing any hypotheses, generate 3-5 prioritized hypotheses. Generating only one hypothesis will anchor you to the first seemingly reasonable idea.
Each hypothesis must be falsifiable: state the prediction it makes.
Format: "If <X> is the cause, then <changing Y> will make the bug disappear / <changing Z> will make it worse."
If you can't state a prediction, this hypothesis is just a vibe. Discard it or refine it.
Show the prioritized list to the user before testing. They often have domain knowledge that can reorder it instantly ("We just deployed the change for #3") or know which hypotheses have already been ruled out. This checkpoint is cheap but saves time. Don't block here; if the user is AFK, proceed with your prioritization.
Phase 4 — Instrument
Each probe must map to a specific prediction from Phase 3. Change only one variable at a time.
Tool preferences:
- If the environment supports it, prioritize Debugger / REPL inspection. One breakpoint is worth ten logs.
- Add targeted logs at boundaries that distinguish between hypotheses.
- Never "log everything and grep".
Add a unique prefix to every debug log, e.g.,
. Cleaning up later becomes a single grep. Unlabeled logs get left behind; labeled logs get removed.
Perf branch. For performance regressions, logs are usually not suitable. First establish a baseline measurement (timing harness,
, profiler, query plan), then bisect. Measure first, fix later.
Phase 5 — Fix + regression test
Write a regression test before implementing the fix, but only if a proper seam exists.
A proper seam means the test can reproduce the actual bug pattern at the call site. If the only available seam is too shallow (the bug requires multiple callers but the test only has a single caller; a unit test can't replicate the chain that triggers the bug), the regression test there will give false confidence.
If no proper seam exists, this is a finding in itself. Document it. The codebase architecture is preventing this bug from being locked in. Flag it for the next phase.
If a proper seam exists:
- Convert the minimized repro into a failing test at this seam.
- Watch it fail.
- Apply the fix.
- Watch it pass.
- Re-run the Phase 1 feedback loop against the original (unminimized) scenario.
Phase 6 — Cleanup + post-mortem
Before declaring completion, you must:
Then ask: What could have prevented this bug? If the answer involves an architecture change (no good test seams, tangled callers, hidden coupling), take this specific information to the
/improve-codebase-architecture
skill. Suggestions should be made after the fix is landed, not before; you now have more information than when you started.