rlm-debugging

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

RLM Systematic Debugging (Phase 1.5)

RLM系统化调试(Phase 1.5)

Overview

概述

When a requirement involves fixing a bug or investigating unexpected behavior, ad-hoc fixes waste time and create new bugs. Systematic debugging finds the root cause before any fix is attempted.
Core Principle: ALWAYS find root cause before attempting fixes. Symptom fixes are failure.
The Iron Law for RLM Debugging:
NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST
当需求涉及修复bug或调查异常行为时,临时修复不仅浪费时间,还会引入新的bug。系统化调试要求在尝试任何修复之前先找到问题的根本原因。
核心原则: 务必在尝试修复前找到根本原因。仅修复症状等同于失败。
RLM调试铁律:
NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST

When to Use

适用场景

Mandatory Phase 1.5 when:
  • Requirement is a bug fix
  • Test failures need investigation
  • Unexpected behavior reported
  • Performance problems
  • Integration issues
Use ESPECIALLY when:
  • Under time pressure (emergencies make guessing tempting)
  • "Just one quick fix" seems obvious
  • Previous fix attempts failed
  • You don't fully understand the issue
Don't skip when:
  • Issue seems simple (simple bugs have root causes too)
  • You're in a hurry (rushing guarantees rework)
  • Manager wants it fixed NOW (systematic is faster than thrashing)
必须执行Phase 1.5的情况:
  • 需求为bug修复
  • 需要调查测试失败问题
  • 上报了异常行为
  • 性能问题
  • 集成问题
尤其需要执行的情况:
  • 处于时间压力下(紧急情况容易让人凭猜测行事)
  • “只需快速修复一下”看起来很简单
  • 之前的修复尝试失败
  • 你并未完全理解问题
不可跳过的情况:
  • 问题看似简单(简单bug也有根本原因)
  • 你很赶时间(仓促行事必然导致返工)
  • 经理要求立即修复(系统化方法比盲目尝试更快)

Phase 1.5 Insertion

Phase 1.5的插入位置

Phase 1.5 is inserted between Phase 1 (AS-IS) and Phase 2 (TO-BE Plan) when debugging is required:
Phase 0: 00-requirements.md
Phase 1: 01-as-is.md (captures current behavior)
Phase 1.5: 01.5-root-cause.md ← NEW (this skill)
Phase 2: 02-to-be-plan.md (includes fix plan based on root cause)
当需要调试时,将Phase 1.5插入到Phase 1(现状分析)和Phase 2(目标方案规划)之间:
Phase 0: 00-requirements.md
Phase 1: 01-as-is.md(记录当前行为)
Phase 1.5: 01.5-root-cause.md ← 新增(本技能对应的环节)
Phase 2: 02-to-be-plan.md(包含基于根因分析的修复方案)

The Four Phases of Systematic Debugging

系统化调试的四个阶段

dot
digraph debugging_phases {
    rankdir=TB;
    
    phase1 [label="Step 1:\nRoot Cause Investigation", shape=box, style=filled, fillcolor="#ffcccc"];
    phase2 [label="Step 2:\nPattern Analysis", shape=box, style=filled, fillcolor="#ffffcc"];
    phase3 [label="Step 3:\nHypothesis & Testing", shape=box, style=filled, fillcolor="#ccffcc"];
    phase4 [label="Step 4:\nFix Implementation", shape=box, style=filled, fillcolor="#ccccff"];
    
    phase1 -> phase2 -> phase3 -> phase4;
    
    // Feedback loops
    phase3 -> phase1 [label="hypothesis\nfailed", style=dashed];
    phase4 -> phase1 [label="fix\nfailed", style=dashed];
}
dot
digraph debugging_phases {
    rankdir=TB;
    
    phase1 [label="Step 1:\nRoot Cause Investigation", shape=box, style=filled, fillcolor="#ffcccc"];
    phase2 [label="Step 2:\nPattern Analysis", shape=box, style=filled, fillcolor="#ffffcc"];
    phase3 [label="Step 3:\nHypothesis & Testing", shape=box, style=filled, fillcolor="#ccffcc"];
    phase4 [label="Step 4:\nFix Implementation", shape=box, style=filled, fillcolor="#ccccff"];
    
    phase1 -> phase2 -> phase3 -> phase4;
    
    // Feedback loops
    phase3 -> phase1 [label="hypothesis\nfailed", style=dashed];
    phase4 -> phase1 [label="fix\nfailed", style=dashed];
}

Step 1: Root Cause Investigation

步骤1:根因调查

BEFORE attempting ANY fix:
在尝试任何修复之前:

1.1 Read Error Messages Carefully

1.1 仔细阅读错误信息

  • Don't skip past errors or warnings
  • They often contain the exact solution
  • Read stack traces completely
  • Note line numbers, file paths, error codes
Record in Phase 1.5 artifact:
markdown
undefined
  • 不要跳过错误或警告
  • 它们通常包含确切的解决方案
  • 完整阅读堆栈跟踪
  • 记录行号、文件路径、错误代码
在Phase 1.5文档中记录:
markdown
undefined

Error Analysis

错误分析

Error Message: [verbatim] Stack Trace: [key frames] File:Line: [locations] Error Code: [if applicable] Key Insight: [what the error is telling you]
undefined
错误信息: [原文] 堆栈跟踪: [关键帧] 文件:行号: [位置] 错误代码:(如适用) 关键洞察: [错误信息传达的内容]
undefined

1.2 Reproduce Consistently

1.2 稳定复现问题

  • Can you trigger it reliably?
  • What are the exact steps?
  • Does it happen every time?
  • If not reproducible → gather more data, don't guess
Record in Phase 1.5 artifact:
markdown
undefined
  • 你能否可靠地触发问题?
  • 确切步骤是什么?
  • 每次都会发生吗?
  • 如果无法复现 → 收集更多数据,不要猜测
在Phase 1.5文档中记录:
markdown
undefined

Reproduction Verification

复现验证

Steps:
  1. [exact step]
  2. [exact step]
  3. [exact step]
Reproducible: Yes / No / Intermittent Frequency: [X out of Y attempts] Deterministic: Yes / No
undefined
步骤:
  1. [确切步骤]
  2. [确切步骤]
  3. [确切步骤]
可复现性: 是/否/间歇性 频率: [Y次尝试中出现X次] 确定性: 是/否
undefined

1.3 Check Recent Changes

1.3 检查近期变更

  • What changed that could cause this?
  • Git diff, recent commits
  • New dependencies, config changes
  • Environmental differences
Record in Phase 1.5 artifact:
markdown
undefined
  • 哪些变更可能导致了这个问题?
  • Git差异、近期提交
  • 新依赖、配置变更
  • 环境差异
在Phase 1.5文档中记录:
markdown
undefined

Recent Changes Analysis

近期变更分析

Git History: [relevant commits] Dependency Changes: [package.json, requirements.txt, etc.] Config Changes: [relevant files] Environment: [OS, runtime versions] Likely Culprit: [most suspicious change]
undefined
Git历史: [相关提交] 依赖变更: [package.json、requirements.txt等] 配置变更: [相关文件] 环境: [操作系统、运行时版本] 可疑原因: [最可疑的变更]
undefined

1.4 Gather Evidence in Multi-Component Systems

1.4 在多组件系统中收集证据

WHEN system has multiple components (CI → build → signing, API → service → database):
BEFORE proposing fixes, add diagnostic instrumentation:
For EACH component boundary:
  • Log what data enters component
  • Log what data exits component
  • Verify environment/config propagation
  • Check state at each layer
Run once to gather evidence showing WHERE it breaks, THEN analyze evidence.
Record in Phase 1.5 artifact:
markdown
undefined
当系统包含多个组件时(CI → 构建 → 签名,API → 服务 → 数据库):
在提出修复方案之前,添加诊断工具:
对于每个组件边界:
  • 记录进入组件的数据
  • 记录离开组件的数据
  • 验证环境/配置的传递
  • 检查每个层级的状态
运行一次以收集证据,确定问题出在哪里,然后再分析证据。
在Phase 1.5文档中记录:
markdown
undefined

Multi-Layer Evidence

多层证据

Layer 1: [Component Name]
  • Input: [data]
  • Output: [data]
  • Status: ✅ Working / ❌ Broken
Layer 2: [Component Name]
  • Input: [data from Layer 1]
  • Output: [data]
  • Status: ✅ Working / ❌ Broken
Failure Boundary: Layer X → Layer Y Root Cause Location: [specific component]
undefined
层级1:[组件名称]
  • 输入:[数据]
  • 输出:[数据]
  • 状态:✅ 正常 / ❌ 异常
层级2:[组件名称]
  • 输入:[来自层级1的数据]
  • 输出:[数据]
  • 状态:✅ 正常 / ❌ 异常
故障边界: 层级X → 层级Y 根因位置: [具体组件]
undefined

1.5 Trace Data Flow

1.5 跟踪数据流

WHEN error is deep in call stack:
Trace backward:
  • Where does bad value originate?
  • What called this with bad value?
  • Keep tracing up until you find the source
  • Fix at source, not at symptom
Record in Phase 1.5 artifact:
markdown
undefined
当错误位于调用栈深处时:
反向跟踪:
  • 错误值起源于哪里?
  • 是谁传递了错误值?
  • 持续向上跟踪直到找到源头
  • 修复源头,而非症状
在Phase 1.5文档中记录:
markdown
undefined

Data Flow Trace

数据流跟踪

Error Location: [file:line - function] Bad Value: [what was wrong]
Call Stack Trace:
  1. [deepest]
    functionA()
    at fileA:line - received [value]
  2. functionB()
    at fileB:line - passed [value]
  3. functionC()
    at fileC:line - passed [value]
  4. [source]
    functionD()
    at fileD:line - ORIGIN of bad value
Root Cause: [source location] - [explanation]
undefined
错误位置: [文件:行号 - 函数] 错误值: [具体问题]
调用栈跟踪:
  1. [最底层]
    functionA()
    在fileA:行号 - 接收了[值]
  2. functionB()
    在fileB:行号 - 传递了[值]
  3. functionC()
    在fileC:行号 - 传递了[值]
  4. [源头]
    functionD()
    在fileD:行号 - 错误值的起源地
根因: [源头位置] - [说明]
undefined

Step 2: Pattern Analysis

步骤2:模式分析

Find the pattern before fixing:
修复前先找到模式:

2.1 Find Working Examples

2.1 找到可正常运行的示例

  • Locate similar working code in same codebase
  • What works that's similar to what's broken?
  • 在同一代码库中定位类似的可正常运行的代码
  • 哪些部分和故障代码相似但能正常工作?

2.2 Compare Against References

2.2 与参考实现对比

  • If implementing pattern, read reference implementation COMPLETELY
  • Don't skim - read every line
  • Understand the pattern fully before applying
  • 如果是实现某种模式,请完整阅读参考实现
  • 不要略读 - 逐行阅读
  • 在应用前完全理解该模式

2.3 Identify Differences

2.3 识别差异

  • What's different between working and broken?
  • List every difference, however small
  • Don't assume "that can't matter"
  • 正常代码和故障代码之间有什么不同?
  • 列出所有差异,无论多小
  • 不要假设“这无关紧要”

2.4 Understand Dependencies

2.4 理解依赖关系

  • What other components does this need?
  • What settings, config, environment?
  • What assumptions does it make?
Record in Phase 1.5 artifact:
markdown
undefined
  • 该代码还需要哪些其他组件?
  • 需要哪些设置、配置、环境?
  • 它有哪些假设条件?
在Phase 1.5文档中记录:
markdown
undefined

Pattern Analysis

模式分析

Working Example: [file:location] Broken Code: [file:location]
Key Differences:
AspectWorkingBroken
[X][value][value]
[Y][value][value]
Likely Cause: [difference that explains the bug] Dependencies: [what the code needs to work]
undefined
正常示例: [文件:位置] 故障代码: [文件:位置]
关键差异:
方面正常故障
[X][值][值]
[Y][值][值]
可能原因: [能解释bug的差异] 依赖关系: [代码正常运行所需的条件]
undefined

Step 3: Hypothesis and Testing

步骤3:假设与测试

Scientific method:
采用科学方法:

3.1 Form Single Hypothesis

3.1 形成单一假设

  • State clearly: "I think X is the root cause because Y"
  • Write it down
  • Be specific, not vague
  • 清晰陈述:“我认为X是根因,因为Y”
  • 写下来
  • 要具体,不要模糊

3.2 Test Minimally

3.2 最小化测试

  • Make the SMALLEST possible change to test hypothesis
  • One variable at a time
  • Don't fix multiple things at once
  • 做出最小的变更来验证假设
  • 一次只改变一个变量
  • 不要同时修复多个问题

3.3 Verify Before Continuing

3.3 验证后再继续

  • Did it work? Yes → Phase 4
  • Didn't work? Form NEW hypothesis
  • DON'T add more fixes on top
  • 有效?是 → 进入步骤4
  • 无效?形成新的假设
  • 不要在已有修复上叠加更多修复

3.4 When You Don't Know

3.4 当你不确定时

  • Say "I don't understand X"
  • Don't pretend to know
  • Ask for help
  • Research more
Record in Phase 1.5 artifact:
markdown
undefined
  • 说“我不理解X”
  • 不要假装知道
  • 寻求帮助
  • 深入研究
在Phase 1.5文档中记录:
markdown
undefined

Hypothesis Testing

假设测试

Hypothesis 1

假设1

Statement: [clear hypothesis] Rationale: [why you think this] Test: [minimal change to verify] Result: [confirmed/rejected] Evidence: [output/observation]
陈述: [清晰的假设] 理由: [你这么认为的原因] 测试: [用于验证的最小变更] 结果: [确认/拒绝] 证据: [输出/观察结果]

Hypothesis 2 (if needed)

假设2(如需要)

[...]
Confirmed Root Cause: [final hypothesis]
undefined
[...]
已确认的根因: [最终假设]
undefined

Step 4: Fix Summary (Handoff to Phase 2 Planning)

步骤4:修复总结(移交至Phase 2规划)

Fix the root cause, not the symptom:
修复根因,而非症状:

4.1 Create Failing Test Case

4.1 创建失败测试用例

  • Simplest possible reproduction
  • Automated test if possible
  • One-off test script if no framework
  • MUST have before fixing
  • This becomes part of Phase 3's test plan
  • 最简单的复现方式
  • 尽可能实现自动化测试
  • 如果没有框架,编写一次性测试脚本
  • 修复前必须完成
  • 这将成为Phase 3测试计划的一部分

4.2 Root Cause Summary for Phase 2

4.2 为Phase 2准备的根因总结

  • Summarize root cause found
  • Document the fix approach
  • Reference evidence from Phase 1.5
Record in Phase 1.5 artifact:
markdown
undefined
  • 总结已找到的根因
  • 记录修复方法
  • 引用Phase 1.5中的证据
在Phase 1.5文档中记录:
markdown
undefined

Root Cause Summary

根因总结

Root Cause: [one sentence] Location: [file:line] Explanation: [paragraph explaining why] Fix Approach: [high-level] Test Strategy: [how to verify fix]
根因: [一句话总结] 位置: [文件:行号] 说明: [解释原因的段落] 修复方法: [高层级方案] 测试策略: [如何验证修复]

Phase 1.5 Gate

Phase 1.5 准入门槛

Coverage: [Did we find root cause?] Approval: [Ready to proceed to Phase 3 with fix plan?]
undefined
覆盖情况:[我们是否找到根因?] 审批:[是否准备好带着修复方案进入Phase 3?]
undefined

Red Flags - STOP and Follow Process

危险信号 - 停止并遵循流程

If you catch yourself thinking:
  • "Quick fix for now, investigate later"
  • "Just try changing X and see if it works"
  • "Add multiple changes, run tests"
  • "Skip the test, I'll manually verify"
  • "It's probably X, let me fix that"
  • "I don't fully understand but this might work"
  • Proposing solutions before tracing data flow
  • "One more fix attempt" (when already tried 2+)
ALL of these mean: STOP. Return to Phase 2.
如果你发现自己有以下想法:
  • “先临时修复,之后再调查”
  • “试试改X看看行不行”
  • “做多个变更,然后运行测试”
  • “跳过测试,我手动验证”
  • “可能是X,我来修复它”
  • “我不完全理解,但这个可能有用”
  • 在跟踪数据流之前就提出解决方案
  • “再试一次修复”(已经尝试2次以上)
以上所有情况都意味着:停止。回到Phase 2。

Common Rationalizations (STOP)

常见借口(停止)

ExcuseReality
"Issue is simple, don't need process"Simple issues have root causes too. Process is fast for simple bugs.
"Emergency, no time for process"Systematic debugging is FASTER than guess-and-check thrashing.
"Just try this first, then investigate"First fix sets the pattern. Do it right from the start.
"I'll write test after confirming fix"Untested fixes don't stick. Test first proves it.
"Multiple fixes at once saves time"Can't isolate what worked. Causes new bugs.
"I see the problem, let me fix it"Seeing symptoms ≠ understanding root cause.
"One more fix attempt" (after 2+ failures)3+ failures = architectural problem. Question pattern, don't fix again.
借口现实
“问题很简单,不需要流程”简单问题也有根因。流程处理简单bug速度很快。
“紧急情况,没时间走流程”系统化调试比盲目尝试更快。
“先试试这个,之后再调查”第一次修复会定下模式。从一开始就做对。
“确认修复后再写测试”未测试的修复无法持久。先写测试能验证问题。
“同时修复多个问题节省时间”无法确定哪个变更起作用。会引入新bug。
“我看到问题了,我来修复”看到症状≠理解根因。
“再试一次修复”(失败2次以上)3次以上失败意味着架构问题。质疑模式,不要继续修复。

If 3+ Fix Attempts Failed

当3次以上修复尝试失败时

Pattern indicating architectural problem:
  • Each fix reveals new shared state/coupling/problem in different place
  • Fixes require "massive refactoring" to implement
  • Each fix creates new symptoms elsewhere
STOP and question fundamentals:
  • Is this pattern fundamentally sound?
  • Are we "sticking with it through sheer inertia"?
  • Should we refactor architecture vs. continue fixing symptoms?
Document in Phase 1.5:
markdown
undefined
表明存在架构问题的模式:
  • 每次修复都会在不同地方暴露出新的共享状态/耦合/问题
  • 修复需要“大规模重构”才能实现
  • 每次修复都会在其他地方引发新的症状
停止并质疑基础:
  • 这个模式从根本上合理吗?
  • 我们是不是“因惯性而坚持”?
  • 我们应该重构架构还是继续修复症状?
在Phase 1.5中记录:
markdown
undefined

Architectural Concern

架构问题

Fix Attempts: [number] Pattern: [what happens with each fix] Recommendation: [architectural change vs. symptom fix] Next Steps: [escalate, refactor, or accept risk]
undefined
修复尝试次数: [次数] 模式: [每次修复后的情况] 建议: [架构变更 vs 症状修复] 下一步: [上报、重构或接受风险]
undefined

Phase 1.5 Artifact Template

Phase 1.5 文档模板

File:
/.codex/rlm/<run-id>/01.5-root-cause.md
markdown
Run: `/.codex/rlm/<run-id>/`
Phase: `01.5 Root Cause Analysis`
Status: `DRAFT` | `LOCKED`
Inputs:
- `/.codex/rlm/<run-id>/01-as-is.md`
- [relevant addenda]
Outputs:
- `/.codex/rlm/<run-id>/01.5-root-cause.md`
Scope note: This document records systematic debugging process and identified root cause.
文件:
/.codex/rlm/<run-id>/01.5-root-cause.md
markdown
Run: `/.codex/rlm/<run-id>/`
Phase: `01.5 Root Cause Analysis`
Status: `DRAFT` | `LOCKED`
Inputs:
- `/.codex/rlm/<run-id>/01-as-is.md`
- [相关附录]
Outputs:
- `/.codex/rlm/<run-id>/01.5-root-cause.md`
Scope note: This document records systematic debugging process and identified root cause.

Error Analysis

Error Analysis

[Section 2.1 - verbatim errors, stack traces]
[Section 2.1 - verbatim errors, stack traces]

Reproduction Verification

Reproduction Verification

[Section 2.2 - exact steps, reproducibility]
[Section 2.2 - exact steps, reproducibility]

Recent Changes Analysis

Recent Changes Analysis

[Section 2.3 - git history, dependencies]
[Section 2.3 - git history, dependencies]

Evidence Gathering

Evidence Gathering

[Section 2.4 - multi-layer diagnostics if applicable]
[Section 2.4 - multi-layer diagnostics if applicable]

Data Flow Trace

Data Flow Trace

[Section 2.5 - backward trace to source]
[Section 2.5 - backward trace to source]

Pattern Analysis

Pattern Analysis

[Section 3 - working vs broken comparison]
[Section 3 - working vs broken comparison]

Hypothesis Testing

Hypothesis Testing

[Section 4 - scientific method log]
[Section 4 - scientific method log]

Root Cause Summary

Root Cause Summary

Root Cause: [one sentence] Location: [file:line] Detailed Explanation: [paragraph] Fix Strategy: [approach for Phase 3] Test Plan: [how to verify]
Root Cause: [one sentence] Location: [file:line] Detailed Explanation: [paragraph] Fix Strategy: [approach for Phase 3] Test Plan: [how to verify]

Traceability

Traceability

  • R1 (Bug fix requirement) -> Root cause identified at [location] | Evidence: [section]
  • R1 (Bug fix requirement) -> Root cause identified at [location] | Evidence: [section]

Coverage Gate

Coverage Gate

  • Error messages analyzed
  • Reproduction verified
  • Recent changes reviewed
  • Data flow traced to source
  • Pattern analysis completed
  • Hypothesis tested and confirmed
  • Root cause documented
  • Fix strategy defined
Coverage: PASS / FAIL
  • Error messages analyzed
  • Reproduction verified
  • Recent changes reviewed
  • Data flow traced to source
  • Pattern analysis completed
  • Hypothesis tested and confirmed
  • Root cause documented
  • Fix strategy defined
Coverage: PASS / FAIL

Approval Gate

Approval Gate

  • Root cause identified (not just symptom)
  • Fix approach clear
  • Test strategy defined
  • No "quick fixes" attempted
  • Ready to proceed to Phase 3
Approval: PASS / FAIL
LockedAt: [when locked] LockHash: [sha256]
undefined
  • Root cause identified (not just symptom)
  • Fix approach clear
  • Test strategy defined
  • No "quick fixes" attempted
  • Ready to proceed to Phase 3
Approval: PASS / FAIL
LockedAt: [when locked] LockHash: [sha256]
undefined

Integration with RLM

与RLM的集成

Phase 1 → 1.5 Transition

Phase 1 → 1.5 过渡

When Phase 1 (AS-IS) identifies a bug/issue that needs fixing:
  1. Lock Phase 1 (
    01-as-is.md
    )
  2. Create Phase 1.5 (
    01.5-root-cause.md
    ) with Status: DRAFT
  3. Execute systematic debugging
  4. Lock Phase 1.5 when root cause found
  5. Proceed to Phase 3 with root cause knowledge
当Phase 1(现状分析)识别出需要修复的bug/问题时:
  1. 锁定Phase 1(
    01-as-is.md
  2. 创建Phase 1.5(
    01.5-root-cause.md
    ),状态设为DRAFT
  3. 执行系统化调试
  4. 找到根因后锁定Phase 1.5
  5. 带着根因相关知识进入Phase 3

Phase 1.5 → 3 Transition

Phase 1.5 → 3 过渡

Phase 3 (
02-to-be-plan.md
) builds ON Phase 1.5:
markdown
undefined
Phase 3(
02-to-be-plan.md
)基于Phase 1.5构建:
markdown
undefined

Root Cause Reference

Root Cause Reference

Root cause identified in
01.5-root-cause.md
:
  • Location: [file:line]
  • Cause: [summary]
  • Full analysis: [reference]
Root cause identified in
01.5-root-cause.md
:
  • Location: [file:line]
  • Cause: [summary]
  • Full analysis: [reference]

Fix Plan

Fix Plan

Based on root cause analysis:
  1. [specific fix steps]
  2. [test strategy from Phase 1.5]
undefined
Based on root cause analysis:
  1. [specific fix steps]
  2. [test strategy from Phase 1.5]
undefined

References

参考资料

  • REQUIRED: Use this skill for all bug-fix requirements
  • TRIGGERS: When requirement involves debugging/fixing
  • OUTPUT:
    01.5-root-cause.md
    artifact
  • NEXT: Phase 3 (TO-BE Plan) incorporates findings
  • 必须: 所有bug修复需求都使用本技能
  • 触发条件: 当需求涉及调试/修复时
  • 输出:
    01.5-root-cause.md
    文档
  • 下一步: Phase 3(目标方案规划)整合分析结果