error-design

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Error Design Review Lens

错误设计评审视角

When invoked with $ARGUMENTS, focus the analysis on the specified file or module. Read the target code first, then apply the checks below.

Each exception a module throws is an interface element. The best way to deal with exceptions is to not have them.

"Exception handling code rarely executes. Bugs can go undetected for a long time, and when the exception handling code is finally needed, there's a good chance that it won't work." — John Ousterhout, A Philosophy of Software Design

使用$ARGUMENTS调用时，将分析聚焦到指定的文件或模块。首先阅读目标代码，再应用以下检查项。

模块抛出的每一个异常都是接口的组成部分。处理异常的最佳方式是从根源上消除异常。

「异常处理代码很少执行，其中的Bug可能长期无法被发现，等到真的需要运行异常处理逻辑时，它有很大概率无法正常工作。」—— 约翰·奥斯特豪特，《软件设计哲学》

The "Too Many Exceptions" Anti-Pattern

「过多异常」反模式

Programmers are taught that "the more errors detected, the better," but this leads to an over-defensive style that throws exceptions for anything suspicious. Throwing exceptions is easy. Handling them is hard.

"Classes with lots of exceptions have complex interfaces, and they are shallower than classes with fewer exceptions." — John Ousterhout, A Philosophy of Software Design

程序员常被教导「检测到的错误越多越好」，但这会催生过度防御的编码风格，任何可疑情况都会抛出异常。抛出异常很容易，处理异常却非常困难。

「抛出大量异常的类接口更复杂，相比抛出更少异常的类也更浅显。」—— 约翰·奥斯特豪特，《软件设计哲学》

When to Apply

适用场景

Reviewing error handling code or exception hierarchies
When a function has many error cases or throws many exception types
When callers are burdened with handling errors that rarely occur
When error handling code is longer than the happy path

评审错误处理代码或异常层级结构
函数存在大量错误场景或抛出多种异常类型
调用方被迫承担处理极少发生的错误的负担
错误处理代码长度远长于正常业务逻辑路径

Core Principles

核心原则

The Decision Tree

决策树

The four techniques below have no canonical ordering. This tree sequences them by preference for practical use.

For every error condition:

以下四种技术没有固定的标准排序，本决策树按照实际使用的优先级对其进行排序。

针对每一种错误场景：

1. Can the error be defined out of existence?

1. 能否将该错误定义为不存在？

Change the interface so the condition isn't an error. If yes: do this. Always the best option.

调整接口，让该场景不再属于错误范畴。如果可以，优先选择这种方案，这永远是最优选项。

2. Can the error be masked?

2. 能否屏蔽该错误？

Handle internally without propagating. If yes: mask if handling is safe and complete.

在内部处理错误，不向外传播。如果可以，且处理逻辑安全完整，就选择屏蔽错误。

3. Can the error be aggregated?

3. 能否聚合该错误？

Replace many specific exceptions with one general mechanism. If yes: aggregate to reduce interface surface.

用一个通用机制替代大量特定异常。如果可以，通过聚合降低接口暴露面。

4. Must the caller handle it?

4. 是否必须由调用方处理？

Propagate only if the caller genuinely must decide. If the caller can't do anything meaningful: crash.

只有当调用方确实需要自行决策时才向外传播异常。如果调用方无法执行任何有意义的处理，直接崩溃即可。

Define Errors Out of Existence

将错误定义为不存在

Error conditions follow from how an operation is specified. Change the specification, and the error disappears.

The general move: instead of "do X" (fails if preconditions aren't met), write "ensure state S" (trivially satisfied if state already holds).

Unset variable? "Delete this variable" (fails if absent) → "ensure this variable no longer exists" (always succeeds)
File not found on delete? Unix
```
unlink
```
doesn't "delete a file." It removes a directory entry. Returns success even if processes have the file open.
Substring not found? Python slicing clamps out-of-range indices (no exception, no defensive code). Java's
```
substring
```
throws
```
IndexOutOfBoundsException
```
, forcing bounds-clamping around a one-line call.

Defining errors out of existence is like a spice: a small amount improves the result but too much ruins the dish. The technique only works when the exception information is genuinely not needed outside the module. A networking module that masked all network exceptions left callers with no way to detect lost messages or failed peers. Those errors needed to be exposed because callers depended on them to build reliable applications.

错误场景是由操作的定义方式决定的。调整定义，错误就会消失。

通用做法：不要定义为「执行X操作」（前置条件不满足就失败），而是定义为「确保达到S状态」（如果已经符合状态就直接成功）。

变量未设置？ 「删除该变量」（变量不存在就失败）→ 「确保该变量不存在」（永远成功）
删除时文件不存在？ Unix的
```
unlink
```
不是「删除文件」，而是移除目录条目。就算还有进程持有该文件的句柄也会返回成功。
未找到子串？ Python切片会自动对超出范围的索引做边界处理（无异常，无需写防御代码）。Java的
```
substring
```
会抛出
```
IndexOutOfBoundsException
```
，迫使开发者在一行调用外写额外的边界处理代码。

将错误定义为不存在就像香料：少量使用能提升效果，放太多就会毁掉整道菜。只有当异常信息确实不需要暴露到模块外时，该技术才适用。如果网络模块屏蔽了所有网络异常，调用方就无法检测消息丢失或节点故障，这些错误必须暴露，因为调用方需要依赖这些信息构建可靠的应用。

Exception Masking

异常屏蔽

Handle internally without exposing to callers. Valid when:

The module can recover completely
Recovery doesn't lose important information
The masking behavior is part of the module's specification

TCP masks packet loss this way. Before masking, ask whether a developer debugging the system would want to know it happened. If yes, log it. If the loss is irreversible and important, don't mask. Propagate.

在内部处理错误，不暴露给调用方。符合以下条件时可使用：

模块可以完全恢复
恢复过程不会丢失重要信息
屏蔽行为属于模块的公开规格说明的一部分

TCP就是通过这种方式屏蔽丢包问题的。屏蔽错误前，先确认调试系统的开发者是否需要知道该错误发生，如果是，就记录日志。如果错误是不可逆且重要的，不要屏蔽，向外传播即可。

Exception Aggregation

异常聚合

Replace many specific exceptions with fewer general ones handled in one place. Masking absorbs errors low and aggregation catches errors high. Together they produce an hourglass where middle layers have no exception handling at all.

用更少的通用异常替代大量特定异常，在一处统一处理。屏蔽是在底层吸收错误，聚合是在上层捕获错误。两者结合会形成沙漏结构，中间层完全不需要做异常处理。

Web Server Pattern

Web Server 模式

Let all

NoSuchParameter

exceptions propagate to the top-level dispatcher where a single handler generates the error response. New handlers automatically work with the system. The same applies to any request-processing loop: catch in one place near the top, abort the current request, clean up and continue.

让所有

NoSuchParameter

异常传播到顶层调度器，由单一处理逻辑生成错误响应。新增的处理函数会自动适配这套体系。该模式适用于所有请求处理循环：在靠近顶层的位置统一捕获异常，终止当前请求，清理资源后继续运行。

Aggregation Through Promotion

升级聚合

Rather than building separate recovery for each failure type, promote smaller failures into a single crash-recovery mechanism. Fewer code paths, more frequently exercised (which surfaces bugs in recovery sooner). Trade-off: promotion increases recovery cost per incident, so it only makes sense when the promoted errors are rare.

不需要为每种故障类型单独构建恢复逻辑，而是把小型故障升级为单一的崩溃恢复机制。这样可以减少代码路径，提高恢复逻辑的执行频率（能更快发现恢复逻辑中的Bug）。权衡点：升级会提高单次故障的恢复成本，因此只有当升级的错误发生概率极低时才有意义。

Just Crash

直接崩溃

When an error is difficult or impossible to handle and occurs infrequently, the simplest response is to print diagnostic information and abort. Out-of-memory errors fit this pattern because there's not much an application can do and the handler itself may need to allocate memory. The same principle applies anywhere: wrap the operation so it aborts on failure, eliminating exception handling at every call site.

当错误难以甚至无法处理，且发生频率极低时，最简单的处理方式是打印诊断信息然后终止程序。内存不足错误就符合这个场景，因为应用没有太多可做的，且异常处理逻辑本身可能也需要分配内存。该原则适用于所有类似场景：封装操作让它在失败时直接终止，消除每个调用点的异常处理逻辑。

Appropriate When

适用场景

The error is infrequent, recovery is impractical, and the caller can't do anything meaningful.

错误发生频率极低，没有可行的恢复方案，且调用方无法执行任何有意义的处理。

Not Appropriate When

不适用场景

The system's value depends on handling that failure (e.g., a replicated storage system must handle I/O errors, not crash on them).

系统的价值依赖于对该故障的处理（比如分布式存储系统必须处理I/O错误，不能直接崩溃）。

Review Process

评审流程

Inventory exceptions: List every error case, exception throw, and error return.
Apply the decision tree: Can each one be defined out? Masked? Aggregated?
Check depth impact: How many exception types are in the module's interface?
Audit catch blocks: Are callers doing meaningful work, or just logging and re-throwing?
Evaluate safety: For any proposed masking, verify nothing important is lost.
Recommend simplification: Propose specific reductions in error surface.

Red flag signals for error design are cataloged in red-flags (Catch-and-Ignore, Overexposure, Shallow Module).

梳理异常清单: 列出所有错误场景、异常抛出点和错误返回值。
应用决策树: 每个错误能否被定义为不存在？能否被屏蔽？能否被聚合？
检查深度影响: 模块接口中包含多少种异常类型？
审计捕获块: 调用方是在执行有意义的处理，还是只是打印日志后重新抛出？
评估安全性: 对于所有提议的屏蔽方案，确认不会丢失重要信息。
提出简化建议: 给出具体的错误暴露面缩减方案。

错误设计的危险信号已收录在 red-flags 中（捕获并忽略、过度暴露、浅显模块）。