ultrathink-protocol

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Ultrathink Protocol

Ultrathink协议

What this protocol is for

本协议的适用场景

Complex technical problems fail in a specific pattern: pattern-match to a plausible fix, apply it, observe it didn't work, apply a variation, repeat. Each iteration feels productive. None produce understanding. The loop can run for hours.
The protocol breaks the loop by separating two modes that must not be mixed:
  • Diagnosis mode — building a verified model of what is actually happening
  • Implementation mode — applying a fix once the cause is known
You cannot enter implementation mode until diagnosis mode has produced a verified cause — meaning you hold direct evidence (a log line, a port number, a source function, a network trace, a config value read from the running process) that explains the symptom. "It might be X" is a hypothesis, not verified evidence. "The running process reads config from path Y, not path Z, because the startup script overwrites the env var before exec" is verified.

复杂技术问题的失败存在特定模式:匹配看似合理的修复方案,应用它,发现无效,再尝试变体,重复此过程。每一次迭代看似有效,但都无法带来对问题的理解。这个循环可能持续数小时。
本协议通过分离两种不可混淆的模式来打破循环:
  • 诊断模式 —— 构建对实际发生情况的已验证模型
  • 实施模式 —— 在明确原因后应用修复方案
只有当诊断模式得出已验证的原因时,你才能进入实施模式——即你掌握直接证据(日志行、端口号、源函数、网络跟踪、从运行进程读取的配置值)来解释症状。“可能是X”只是假设,并非已验证证据。“运行进程从路径Y而非Z读取配置,因为启动脚本在执行前覆盖了环境变量”才是已验证结论。

The execution cycle

执行循环

Every step in a complex investigation follows this three-part structure. Write it out explicitly — do not compress it into a single paragraph.
THOUGHT:      What hypothesis am I testing? What do I expect to find?
              What would this result mean for my current model of the problem?

ACTION:       The single most informative thing I can do right now —
              a command, a source read, a log grep, a process inspection.
              One action per cycle. Pick the action that would most change
              your model if the result is unexpected.

OBSERVATION:  What actually happened. Quote the relevant output directly.
              Does this confirm or refute the hypothesis?
              How does it change the model?
The OBSERVATION step is where understanding is built. An observation that says "that confirms my theory" without explaining why is a red flag — it means you may be fitting evidence to a pre-formed conclusion rather than updating your model.

复杂调查的每一步都遵循以下三部分结构。请明确写出——不要压缩成单个段落。
THOUGHT:      我正在验证什么假设?我期望发现什么?
              这个结果会如何影响我当前对问题的模型?

ACTION:       我现在能做的最具信息量的一件事——
              一条命令、一次源码读取、一次日志检索、一次进程检查。
              每个循环仅执行一个行动。选择若结果超出预期,会最大程度改变你对问题模型的行动。

OBSERVATION:  实际发生了什么。直接引用相关输出。
              这是否证实或反驳了假设?
              它如何改变问题模型?
观察环节是构建认知的关键。若观察仅表示“这证实了我的理论”却未解释原因,这是危险信号——意味着你可能在让证据迎合预先形成的结论,而非更新你的模型。

The diagnosis ladder

诊断阶梯

Work top to bottom. Stop at the level where you find verified evidence. Going further than necessary wastes time; stopping too early produces wrong diagnoses.
从上到下逐步推进。在找到已验证证据的层级停止。过度深入会浪费时间;过早停止会导致错误诊断。

Level 1 — State the symptom precisely

层级1 —— 精准陈述症状

Before any action, restate the symptom in the most concrete observable terms:
  • Weak: "it's not working"
  • Strong: "process A writes output at 1 Hz according to its own logs, but consumer B receives nothing after 30 seconds, even though both claim to be connected on the same channel"
The gap between what is observed and what is expected defines the shape of the problem. Every diagnostic action should be aimed at explaining that gap specifically — not at exploring adjacent possibilities.
在采取任何行动前,用最具体的可观察术语重述症状:
  • 模糊表述:“它无法正常工作”
  • 精准表述:“进程A根据自身日志以1Hz频率写入输出,但消费者B在30秒后未收到任何数据,尽管两者声称连接在同一通道上”
观察到的情况与预期之间的差距定义了问题的形态。每一项诊断行动都应专门针对解释该差距——而非探索相邻可能性。

Level 2 — Eliminate the obvious

层级2 —— 排除明显因素

Check what is free to check and would explain everything if wrong:
  • Is the process actually running?
  • Does the running process actually see the config/env vars you think it does? (critical: check the process's own environment, not the shell or config file — they are frequently different)
  • Is the correct version of the code/binary deployed?
  • Is the process connected to the right endpoint, interface, or address?
The process environment trap. A very common failure mode across all stacks: you configure a variable in a launcher, compose file, or wrapper script, but the process overwrites it at startup before it matters. Always verify what the running process sees, not what you told the launcher to pass. On Linux:
/proc/<pid>/environ
. On other platforms: equivalent process inspection tools.
检查那些无需额外成本即可验证、且若出错能解释所有问题的内容:
  • 进程是否真的在运行?
  • 运行中的进程是否真的能看到你认为的配置/环境变量? (关键: 检查进程自身的环境,而非shell或配置文件——它们往往不同)
  • 部署的代码/二进制文件是否为正确版本?
  • 进程是否连接到正确的端点、接口或地址?
进程环境陷阱。这是所有技术栈中非常常见的失败模式:你在启动器、compose文件或包装脚本中配置了变量,但进程在启动时覆盖了它,导致配置无效。始终验证运行中进程实际看到的内容,而非你告诉启动器传递的内容。在Linux系统上:
/proc/<pid>/environ
。在其他平台上:使用等效的进程检查工具。

Level 3 — Trace the actual data path

层级3 —— 追踪实际数据路径

Identify the intended flow and then verify each step in the actual running system:
  • At what point does the actual behaviour diverge from the intended behaviour?
  • Which component in the chain is the last one behaving correctly?
  • Which is the first one where the behaviour is wrong?
The root cause is almost always at exactly one point of divergence — a layer boundary, a config value that was silently overridden, an interface that the code bound to differently than expected. The gap between "last correct" and "first wrong" is where to look.
Concretely: rather than theorising about what could go wrong, inspect running state — open file descriptors, bound addresses, active connections, actual values being processed — and compare them to what you expect.
确定预期流程,然后验证实际运行系统中的每个步骤:
  • 实际行为在哪个环节偏离了预期行为?
  • 链路中最后一个行为正常的组件是哪个?
  • 第一个行为异常的组件是哪个?
根本原因几乎总是恰好出现在某个分歧点——层边界、被静默覆盖的配置值、代码绑定方式与预期不同的接口。“最后正常”与“第一个异常”之间的差距就是需要排查的地方。
具体做法:不要推测可能出错的地方,而是检查运行状态——打开的文件描述符、绑定的地址、活跃连接、正在处理的实际值——并与预期进行比较。

Level 4 — Read the source

层级4 —— 阅读源码

When a component does not behave as documented or configured, read the code that processes the config or handles the relevant path. This sounds slow. It is reliably fast compared to guessing at configuration variations.
Source access priority:
  1. Grep installed headers or bundled scripts — often present in installed packages and reveals function signatures, constant values, env var names
  2. Grep the binary for string literals — finds env var names, config keys, magic values that the code actually reads at runtime
  3. Read the upstream source (GitHub, package registry) — definitive; a single function's implementation resolves hours of config iteration
  4. Search documentation and issue trackers for the exact behaviour you're seeing
The pattern that makes source reading so valuable: a function that reads
CONFIG_VAR_B
instead of the
CONFIG_VAR_A
you've been setting terminates the investigation immediately. No configuration change to
CONFIG_VAR_A
would ever work, regardless of how many variations you tried.
当组件的行为与文档或配置不符时,阅读处理该配置或相关路径的代码。这听起来很慢,但与反复尝试配置变体相比,它实际上更高效。
源码访问优先级:
  1. 检索已安装的头文件或捆绑脚本——通常存在于已安装的包中,可揭示函数签名、常量值、环境变量名称
  2. 在二进制文件中检索字符串字面量——找到代码在运行时实际读取的环境变量名称、配置键、魔法值
  3. 阅读上游源码(GitHub、包注册表)——最权威;单个函数的实现可解决数小时的配置试错
  4. 在文档和问题追踪器中搜索你观察到的具体行为
阅读源码的价值模式:若某个函数读取的是
CONFIG_VAR_B
而非你一直设置的
CONFIG_VAR_A
,调查可立即终止。无论你尝试多少种
CONFIG_VAR_A
的配置变体,都不会生效。

Level 5 — Controlled experiment

层级5 —— 受控实验

When source is unavailable or the code path is too complex to trace statically, run a controlled experiment that isolates exactly one variable:
  • Reduce the system to the minimal case that still reproduces the failure
  • Change one thing and observe the effect
  • Design the experiment to falsify your current hypothesis, not to confirm it
A good experiment is one that could prove you wrong. If it can only confirm what you already believe, it is not diagnostic — it is confirmation bias with extra steps.

当无法获取源码或代码路径过于复杂无法静态追踪时,运行受控实验来隔离单个变量:
  • 将系统简化为仍能复现故障的最小场景
  • 更改一个变量并观察效果
  • 设计实验以证伪你当前的假设,而非证实它
好的实验应能证明你是错的。如果它只能证实你已有的信念,那它不是诊断——只是带有额外步骤的确认偏差。

The stuck gate

停滞门槛

You are stuck when any of these is true:
  1. You have applied the same class of fix more than twice (different config values, different versions of the same patch, different restart sequences) without new verified evidence that the root cause has changed
  2. Your last three OBSERVATION steps have not changed your model of the problem
  3. You are considering a more complex version of something that already failed
  4. The word "maybe" appears in your reasoning without a plan to test it
When stuck: stop. Do not apply another fix variant.
Instead:
  1. State the situation explicitly:
    "I am stuck. My current model of the problem is [X]. The evidence I have is [Y]. The part I cannot explain is [Z]."
  2. Descend the diagnosis ladder — you have not gone far enough. The most common reason for being stuck is that the actual behaviour at some layer has not been inspected directly; there is an assumption standing in for observation.
  3. If all available diagnostic tools have been exhausted, escalate to external research (see below).
The phrase "stop doing trial and error without knowing the root cause" is a hard stop signal from the user. It means the protocol was violated — return immediately to diagnosis mode, regardless of how close the current fix attempt feels.

当出现以下任一情况时,你陷入了停滞:
  1. 你已应用同一类别的修复超过两次(不同的配置值、同一补丁的不同版本、不同的重启序列),但没有新的已验证证据表明根本原因已改变
  2. 你最近三次观察环节都未改变对问题的模型
  3. 你正在考虑对已失败的方案进行更复杂的修改
  4. 你的推理中出现了“也许”,却没有测试计划
陷入停滞时:停止。不要再尝试其他修复变体。
相反:
  1. 明确陈述当前情况:
    “我陷入了停滞。我当前对问题的模型是[X]。我拥有的证据是[Y]。无法解释的部分是[Z]。”
  2. 回到诊断阶梯——你还没有深入足够。陷入停滞最常见的原因是某个层的实际行为未被直接检查;存在一个假设而非观察结论。
  3. 如果所有可用诊断工具都已用尽,升级到外部研究(见下文)。
用户的“停止在未知根本原因的情况下试错”是硬停止信号。这意味着协议被违反——立即回到诊断模式,无论当前的修复尝试看起来多么接近成功。

Escalation: when and how to research

升级:何时及如何进行研究

Escalate to web search, documentation, or source code research when:
  • The root cause requires understanding a system you cannot directly inspect (closed-source binary, third-party middleware, undocumented protocol behaviour)
  • You have arrived at a clear, specific, answerable question
  • Continued diagnosis without more information would be speculation
A researchable question is specific enough that a search could answer it directly:
"What environment variable does [library X]'s [function Y] actually read at runtime?"
A non-researchable question is what you write when stuck and hoping research will rescue you:
"How does [technology A] work with [technology B]?"
If you cannot write a specific question, you have not diagnosed far enough. More diagnosis, not more research, is the right move.

当出现以下情况时,升级到网络搜索、文档或源码研究:
  • 根本原因需要理解你无法直接检查的系统(闭源二进制文件、第三方中间件、未记录的协议行为)
  • 你已得出清晰、具体、可解答的问题
  • 没有更多信息的情况下继续诊断只是猜测
可研究的问题应具体到搜索能直接给出答案:
“[库X]的[函数Y]在运行时实际读取哪个环境变量?”
不可研究的问题是你陷入停滞时希望研究能拯救你的表述:
“[技术A]如何与[技术B]配合工作?”
如果你无法写出具体问题,说明你还没有诊断足够深入。此时应继续诊断,而非进行研究。

Communication during investigation

调查期间的沟通

What to say

应说的内容

  • Before each ACTION, state the hypothesis being tested in one sentence
  • After each OBSERVATION, state what changed in your model — even if the answer is "nothing changed, which itself narrows the possibilities"
  • When the root cause is confirmed, state it completely and precisely before proposing any fix:
    "Root cause confirmed: [component A] uses [value X] because [mechanism Y] overrides the configured [value Z] at startup. Evidence: [direct quote from log/source/process state]."
  • 在每次行动前,用一句话陈述正在验证的假设
  • 在每次观察后,陈述你的模型发生了什么变化——即使答案是“没有变化,这本身也缩小了可能性范围”
  • 当根本原因被确认后,在提出任何修复方案前完整且精准地陈述:
    “根本原因已确认:[组件A]使用[值X],因为[机制Y]在启动时覆盖了配置的[值Z]。证据:[来自日志/源码/进程状态的直接引用]。”

What not to say

不应说的内容

  • Do not narrate tool calls as progress ("Let me check the logs..." is not a finding — report the finding, not the intent)
  • Do not announce fixes before the cause is verified
  • Do not say "I think the issue might be X" and immediately apply a fix for X — "might be" is a hypothesis; test it first
  • Do not compress THOUGHT/ACTION/OBSERVATION into a single paragraph — the explicit structure is precisely what prevents skipping the verification step

  • 不要将工具调用叙述为进展(“让我检查日志...”不是发现——报告发现,而非意图)
  • 不要在原因被验证前宣布修复方案
  • 不要说“我认为问题可能是X”然后立即应用针对X的修复——“可能是”只是假设;先测试它
  • 不要将THOUGHT/ACTION/OBSERVATION压缩成单个段落——明确的结构正是防止跳过验证步骤的关键

Multi-layer decomposition

多层分解

When a system has multiple layers (any stack: hardware → driver → OS → runtime → middleware → application → config), failures almost always occur at exactly one layer boundary. The strategy for finding it:
  1. Find the last layer that works correctly. Where in the chain does behaviour match expectation? Start from the input end and work forward.
  2. Find the first layer that fails. Where does behaviour first diverge from expectation?
  3. The root cause is at that boundary. You now have a precise, one-layer question instead of a whole-system question. Investigate only that boundary.
This decomposition converts "nothing works end-to-end" into a single-layer question that can be answered with one or two diagnostic actions.
Avoid investigating layers you have not checked. It is tempting to hypothesise about a deep layer when the surface layers have not been fully inspected. The actual divergence point is almost always shallower than expected.

当系统包含多个层(任何技术栈:硬件→驱动→操作系统→运行时→中间件→应用→配置),故障几乎总是恰好发生在某一层边界。找到它的策略:
  1. 找到最后一个正常工作的层。链路中哪个环节的行为符合预期?从输入端开始向前排查。
  2. 找到第一个失败的层。哪个环节的行为首次偏离预期?
  3. 根本原因就在该边界。你现在有了一个精准的单层问题,而非整个系统的问题。仅调查该边界。
这种分解将“端到端完全无法工作”转化为一个可通过一两次诊断行动解决的单层问题。
避免调查未检查过的层。当表层未被完全检查时,很容易假设深层存在问题。但实际的分歧点几乎总是比预期更浅。

Anti-patterns this protocol prevents

本协议预防的反模式

Anti-patternSignalCorrect response
Config iterationTrying the third variation of the same config changeStop. Read what config the running process actually loads.
Restart loopRebuild → restart → check, without new diagnostic informationStop. The code did not change. Inspect state before restarting again.
Assumption driftFix is written for cause X before X has been verifiedTreat X as a hypothesis. Find direct evidence before writing a fix.
Complexity escalationEach failed fix attempt adds more layers or indirectionApply Occam's razor. The simpler explanation is right more often. A 3-line change to the right place beats a 50-line workaround around the wrong place.
Confirmation readingReading diagnostic output to confirm existing belief rather than test itAsk: what would I see if my hypothesis is wrong? Look for that specifically.
Shell ≠ process envAssuming the process sees what the launcher was told to passVerify the process's own environment at runtime, not the launch config.
Layer skippingTheorising about a deep layer without inspecting the surface layers firstWalk the chain from input to output. The first divergence is the root cause.

反模式信号正确应对
配置迭代尝试同一配置更改的第三种变体停止。检查运行中进程实际加载的配置。
重启循环重建→重启→检查,未获取新的诊断信息停止。代码未发生变化。再次重启前检查状态。
假设漂移在验证X之前就针对原因X编写修复方案将X视为假设。在编写修复方案前找到直接证据。
复杂度升级每次失败的修复尝试都添加更多层或间接逻辑应用奥卡姆剃刀原则。更简单的解释往往正确。针对正确位置的3行修改优于针对错误位置的50行变通方案。
确认性读取读取诊断输出以确认现有信念而非测试它问自己:如果我的假设是错误的,我会看到什么?专门寻找该迹象。
Shell≠进程环境假设进程能看到启动器被要求传递的内容验证进程在运行时的自身环境,而非启动配置。
跨层跳跃未先检查表层就推测深层存在问题从输入到输出遍历链路。第一个分歧点就是根本原因。

Diagnostic toolkit (generic)

通用诊断工具包

bash
undefined
bash
undefined

What environment does the running process actually see?

运行中的进程实际看到什么环境?

Linux:

Linux:

cat /proc/<pid>/environ | tr '\0' '\n'
cat /proc/<pid>/environ | tr '\0' '\n'

Or filter for a specific variable:

或过滤特定变量:

cat /proc/<pid>/environ | tr '\0' '\n' | grep VAR_NAME
cat /proc/<pid>/environ | tr '\0' '\n' | grep VAR_NAME

Which network sockets does a process own? (Linux)

进程拥有哪些网络套接字?(Linux)

Map process file descriptors to UDP/TCP ports:

将进程文件描述符映射到UDP/TCP端口:

python3 -c " import os, re pid = <pid> inodes = {} for fd in os.listdir(f'/proc/{pid}/fd'): try: m = re.match(r'socket:[(\d+)]', os.readlink(f'/proc/{pid}/fd/{fd}')) if m: inodes[int(m.group(1))] = fd except: pass for proto in ['udp', 'tcp']: try: with open(f'/proc/net/{proto}') as f: for line in f: p = line.split() if len(p) >= 10: try: i = int(p[9]) if i in inodes: port = int(p[1].split(':')[1], 16) print(f'{proto.upper()} fd{inodes[i]} port={port}') except: pass except: pass "
python3 -c " import os, re pid = <pid> inodes = {} for fd in os.listdir(f'/proc/{pid}/fd'): try: m = re.match(r'socket:[(\d+)]', os.readlink(f'/proc/{pid}/fd/{fd}')) if m: inodes[int(m.group(1))] = fd except: pass for proto in ['udp', 'tcp']: try: with open(f'/proc/net/{proto}') as f: for line in f: p = line.split() if len(p) >= 10: try: i = int(p[9]) if i in inodes: port = int(p[1].split(':')[1], 16) print(f'{proto.upper()} fd{inodes[i]} port={port}') except: pass except: pass "

What string literals (env var names, config keys) does a binary contain?

二进制文件包含哪些字符串字面量(环境变量名称、配置键)?

grep -oa '[A-Z_][A-Z0-9_]{3,}' /path/to/binary | sort -u | head -50
grep -oa '[A-Z_][A-Z0-9_]{3,}' /path/to/binary | sort -u | head -50

Read source from a public GitHub repo without cloning:

无需克隆即可读取公开GitHub仓库的源码:

gh api repos/<org>/<repo>/contents/<path/to/file> --jq '.content' | base64 -d
gh api repos/<org>/<repo>/contents/<path/to/file> --jq '.content' | base64 -d

Which file is a process actually reading? (Linux, requires strace)

进程实际读取哪个文件?(Linux,需要strace)

strace -p <pid> -e trace=openat 2>&1 | grep -v ENOENT
strace -p <pid> -e trace=openat 2>&1 | grep -v ENOENT

What is the process's working directory and open files?

进程的工作目录和打开的文件是什么?

ls -la /proc/<pid>/fd readlink /proc/<pid>/cwd

---
ls -la /proc/<pid>/fd readlink /proc/<pid>/cwd

---

Completion checklist

完成检查清单

Do not declare a fix done until every item is checked:
  • The root cause is stated in one sentence with a direct evidence citation
  • The fix targets the root cause, not a downstream symptom
  • No complexity was added to work around unexplained behaviour
  • The fix is the simplest change that addresses the verified cause
  • Before applying, a specific prediction was made: "after this fix, I expect to observe [X]"
  • After applying, the prediction was verified by observation
  • The system is tested at the layer where the failure occurred, not just end-to-end
在宣布修复完成前,确保所有项都已检查:
  • 根本原因以一句话陈述,并附带直接证据引用
  • 修复针对根本原因,而非下游症状
  • 未添加复杂度来解决未解释的行为
  • 修复是解决已验证原因的最简单变更
  • 应用修复前,已做出具体预测:“修复后,我期望观察到[X]”
  • 应用修复后,预测已通过观察得到验证
  • 系统在故障发生的层级进行了测试,而非仅进行端到端测试