reliability

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

/reliability — Reliable Design Enforcement

/reliability — 可靠性设计规范

Every design, plan, and implementation MUST handle failure gracefully. Things WILL go wrong — networks fail, disks fill up, dependencies go down, inputs are invalid. The question is not "will it fail?" but "what happens when it does?"
Why this matters: Unreliable systems erode user trust faster than any other quality issue. A system that crashes on bad input, hangs when a dependency is slow, or loses data on failure is not production-ready — no matter how many features it has.
When to invoke: During PLANNING (after brainstorming, before or alongside writing-plans) and during REVIEW (as part of code review criteria). This skill applies to both new code and modifications to existing code.

所有设计、规划和实现都必须优雅地处理故障。故障是必然会发生的——网络中断、磁盘占满、依赖服务宕机、输入非法。问题的核心不在于「它会不会故障?」,而在于「故障发生后会出现什么情况?」
为什么这很重要: 不可靠的系统对用户信任的侵蚀速度远超其他任何质量问题。一个遇到非法输入就崩溃、依赖服务响应慢就卡死、故障时丢失数据的系统,不管功能有多丰富,都不具备生产环境就绪的条件。
何时使用: 在规划阶段(头脑风暴后,撰写方案前或撰写方案的同时)以及评审阶段(作为代码评审标准的一部分)。本规范适用于新代码开发和现有代码修改两种场景。

The Rules

规则

Rule 1: Every External Call Can Fail

规则1:所有外部调用都可能失败

Every network call, database query, file operation, and external service invocation MUST handle failure:
Failure modeRequired handling
TimeoutExplicit timeout set. Don't wait forever.
Connection refusedRetry with backoff, then degrade gracefully.
5xx responseRetry with backoff (idempotent ops only).
4xx responseDon't retry. Log and handle based on status code.
Malformed responseValidate schema. Don't crash on unexpected shapes.
Partial failureHandle incomplete writes. Don't leave data half-updated.
No external call without a timeout. Default: 5s for API calls, 30s for long operations (with documentation for why).
所有网络请求、数据库查询、文件操作、外部服务调用都必须处理故障:
故障模式要求的处理方式
超时设置明确的超时时间,不要无限等待
连接被拒绝采用退避策略重试,之后优雅降级
5xx 响应采用退避策略重试(仅幂等操作)
4xx 响应不要重试,记录日志并根据状态码处理
响应格式错误校验Schema,不要因为意外结构导致崩溃
部分失败处理未完成的写入,不要留下半更新的数据
所有外部调用必须设置超时。 默认值:API调用设置5秒超时,长耗时操作设置30秒超时(需要额外说明设置该值的原因)。

Rule 2: Retry with Exponential Backoff

规则2:采用指数退避加抖动重试

Retries MUST use exponential backoff with jitter:
attempt 1: immediate
attempt 2: 1s + random(0-500ms)
attempt 3: 2s + random(0-500ms)
attempt 4: 4s + random(0-500ms)
(max 3-5 retries, max backoff 30s)
Only retry idempotent operations. A retry on a non-idempotent POST can create duplicates.
Never retry:
  • 400 Bad Request (fix the input)
  • 401/403 Unauthorized (fix the auth)
  • 404 Not Found (it's not there)
  • 409 Conflict (resolve the conflict)
重试必须使用带抖动的指数退避策略:
attempt 1: immediate
attempt 2: 1s + random(0-500ms)
attempt 3: 2s + random(0-500ms)
attempt 4: 4s + random(0-500ms)
(max 3-5 retries, max backoff 30s)
仅幂等操作可以重试。 对非幂等的POST请求重试可能会生成重复数据。
禁止重试的场景:
  • 400 Bad Request(修复输入参数)
  • 401/403 Unauthorized(修复鉴权逻辑)
  • 404 Not Found(资源不存在)
  • 409 Conflict(解决冲突)

Rule 3: Circuit Breaker Pattern

规则3:熔断器模式

When a dependency fails repeatedly, STOP calling it:
StateBehavior
Closed (normal)Requests pass through. Track failure rate.
Open (failing)Requests fail fast. Return cached/default/error. Don't call dependency.
Half-open (testing)Allow 1 request through. If it succeeds, close. If it fails, reopen.
Thresholds: Open after 5 consecutive failures or >50% failure rate in 60s window. Half-open after 30s.
This prevents cascading failures — one down service shouldn't take down everything.
当依赖服务持续故障时,停止调用该服务:
状态行为
Closed(正常状态)请求正常通过,统计失败率
Open(故障状态)请求快速失败,返回缓存/默认值/错误响应,不调用依赖服务
Half-open(试探恢复状态)放行1个请求,如果请求成功则关闭熔断器,失败则重新打开
阈值: 60秒窗口内出现5次连续失败或失败率超过50%时打开熔断器,30秒后进入半开状态。
该策略可以防止级联故障——一个服务宕机不应该拖垮整个系统。

Rule 4: Graceful Degradation

规则4:优雅降级

When a dependency fails, the system MUST continue operating with reduced functionality — not crash:
ScenarioDegraded behavior
Cache downServe from database (slower, but working)
Search service downShow recent/popular items instead
Email service downQueue emails for later delivery
Analytics downDrop analytics events (non-critical)
Payment provider slowExtend timeout, show "processing" state
Define degradation behavior during design, not during the outage. Every external dependency needs a "what if it's down?" answer.
当依赖服务故障时,系统必须在功能降级的情况下继续运行,而不是直接崩溃:
场景降级行为
缓存宕机从数据库读取数据(速度更慢,但功能可用)
搜索服务宕机改为展示最近/热门内容
邮件服务宕机将邮件加入队列,后续重试投递
分析服务宕机丢弃分析事件(非核心功能)
支付提供商响应慢延长超时时间,展示「处理中」状态
在设计阶段就定义降级行为,不要等到故障发生时才考虑。 每个外部依赖都要有「如果它宕机了怎么办」的应对方案。

Rule 5: Idempotent Operations

规则5:幂等操作

Every write operation MUST be safe to retry:
  • Use idempotency keys for payment and state-changing operations.
  • Use upserts instead of insert-then-update.
  • Use database transactions for multi-step mutations.
  • Use
    IF NOT EXISTS
    /
    ON CONFLICT
    for creates.
Test: Call the operation twice with the same input. Does it produce the same result? If not, it's not idempotent — fix it.
所有写操作必须可安全重试:
  • 支付和状态变更操作使用幂等键
  • 使用upsert代替先查询后更新的逻辑
  • 多步变更使用数据库事务
  • 创建操作使用
    IF NOT EXISTS
    /
    ON CONFLICT
    语法
测试方法: 用相同的输入调用操作两次,如果返回结果不一致,说明不满足幂等性,需要修复。

Rule 6: Health Checks and Observability

规则6:健康检查和可观测性

Every service MUST expose:
EndpointPurpose
/health
(liveness)
"Am I running?" Returns 200 if process is alive.
/ready
(readiness)
"Can I serve traffic?" Checks dependencies (DB, cache, etc.)
Every failure MUST be observable:
  • Structured logging for all errors (not just
    console.log("error")
    ).
  • Metrics for error rates, latency percentiles, queue depths.
  • Alerts for error rate spikes, latency degradation, queue buildup.
If it fails silently, it might as well not exist.
所有服务必须暴露以下端点:
端点用途
/health
(存活检查)
「我是否在运行?」 如果进程正常则返回200
/ready
(就绪检查)
「我是否可以处理流量?」 检查依赖服务(数据库、缓存等)状态
所有故障必须可观测:
  • 结构化日志 记录所有错误(不要只写
    console.log("error")
  • 指标 统计错误率、延迟分位值、队列深度等
  • 告警 针对错误率突增、延迟升高、队列堆积等场景配置
如果故障悄无声息地发生,那么相关的防护逻辑等于不存在。

Rule 7: Data Integrity

规则7:数据完整性

Data MUST survive failures:
PrincipleImplementation
Atomic operationsDatabase transactions for multi-step writes
Write-ahead loggingLog intent before executing (for recovery)
ChecksumsVerify data integrity after transfer/storage
Backup and recoveryAutomated backups with tested restore procedures
Eventual consistencyDocument which operations are eventually consistent and the convergence window
Never lose acknowledged data. If you told the user "saved," it must be saved — even if the server crashes 1ms later.

数据必须能在故障中幸存:
原则实现方式
原子操作多步写操作使用数据库事务
预写日志执行操作前先记录操作意图(用于故障恢复)
校验和数据传输/存储后验证数据完整性
备份与恢复自动备份,且恢复流程经过测试验证
最终一致性明确标注哪些操作是最终一致的,以及收敛时间窗口
绝对不能丢失已经确认写入的数据。 如果你已经告诉用户「保存成功」,那么数据必须真正被持久化——哪怕服务器1毫秒后就崩溃。

Applying This Skill

本规范的应用

During Planning (brainstorming / writing-plans)

规划阶段(头脑风暴/撰写方案)

Before finalizing any design or plan, run the Reliability Checklist:
  • Every external call has a timeout and failure handling strategy
  • Retries use exponential backoff with jitter (idempotent ops only)
  • Circuit breakers protect against cascading failures
  • Graceful degradation is defined for every external dependency
  • Write operations are idempotent (safe to retry)
  • Health check endpoints are defined (liveness + readiness)
  • Data integrity is maintained through failures (transactions, WAL, backups)
If any item fails: redesign before proceeding to implementation.
在最终确定任何设计或方案前,运行 可靠性检查清单
  • 所有外部调用都设置了超时和故障处理策略
  • 重试使用带抖动的指数退避策略(仅幂等操作)
  • 熔断器用于防范级联故障
  • 所有外部依赖都定义了优雅降级策略
  • 写操作是幂等的(可安全重试)
  • 定义了健康检查端点(存活检查+就绪检查)
  • 故障场景下数据完整性得到保障(事务、预写日志、备份)
如果任何一项不满足:重新设计后再进入开发阶段。

During Implementation (executing-plans)

开发阶段(执行方案)

As you write code:
  • Set explicit timeouts on every HTTP client, DB connection, and external call.
  • Wrap external calls in try/catch with specific error handling (not bare catch-all).
  • Add circuit breakers for any dependency called >10 times per minute.
  • Return meaningful error responses — status code, error code, human message.
  • Never swallow errors silently. Log, metric, or propagate.
编写代码时:
  • 为所有HTTP客户端、数据库连接、外部调用设置明确的超时时间
  • 用try/catch包裹外部调用,实现特定的错误处理(不要使用无差别的catch-all)
  • 为每分钟调用次数超过10次的依赖添加熔断器
  • 返回有意义的错误响应——状态码、错误码、人类可读的提示信息
  • 绝对不要静默吞掉错误,要记录日志、上报指标或者向上抛出错误

During Review (code-review / receiving-code-review)

评审阶段(代码评审/接收代码评审)

Verify these as part of every code review:
  • Every external call has timeout and error handling
  • No bare
    catch
    blocks that swallow errors
  • Retry logic uses backoff (not immediate retry loops)
  • Write operations are idempotent
  • Health check endpoints exist and check real dependencies
  • Error responses are structured and meaningful
每次代码评审都要验证以下内容:
  • 所有外部调用都有超时和错误处理逻辑
  • 没有静默吞掉错误的空catch块
  • 重试逻辑使用了退避策略(不是立即重试的循环)
  • 写操作是幂等的
  • 健康检查端点存在,且会真实检查依赖服务状态
  • 错误响应是结构化的、有意义的

When Modifying Existing Code

修改现有代码时

If existing code violates these rules:
  • You are NOT required to add circuit breakers to all existing external calls.
  • You ARE required to not make reliability worse.
  • If adding a new external call, it MUST have timeout, retry, and error handling.
  • If you find a silent error swallowing in code you're touching, fix it.

如果现有代码违反了这些规则:
  • 你不需要为所有已有的外部调用添加熔断器
  • 你必须不能降低现有代码的可靠性
  • 如果你添加了新的外部调用,必须设置超时、重试和错误处理逻辑
  • 如果你在修改的代码中发现了静默吞错的逻辑,必须修复它

Anti-Patterns

反模式

PatternProblemFix
Empty catch blocksErrors silently disappearLog, metric, or propagate
No timeoutsRequests hang foreverExplicit timeout on every call
Retry stormsRetries overwhelm failing serviceExponential backoff + circuit breaker
Cascading failuresOne failure takes everything downCircuit breakers + degradation
Optimistic updatesAssume success, discover failure laterVerify writes, use transactions
"It works on my machine"Local env doesn't simulate failuresChaos testing, fault injection

模式问题修复方案
空catch块错误悄无声息地消失记录日志、上报指标或者向上抛出错误
无超时请求无限挂起每个调用都设置明确的超时时间
重试风暴重试请求压垮已经故障的服务指数退避+熔断器
级联故障一个故障拖垮整个系统熔断器+降级策略
乐观更新先假设成功,之后才发现失败验证写入结果,使用事务
「在我机器上能运行」本地环境没有模拟故障场景混沌测试,故障注入

Rationalization Prevention

常见借口反驳

ExcuseReality
"That service never goes down"It will. And you won't be ready.
"We'll add error handling later"Later = after the first outage.
"It's just a timeout, it'll recover"Without backoff, your retries will make it worse.
"The database is reliable"Networks between you and the DB are not.
"We don't need health checks yet"You need them the moment you deploy.
"This is a simple operation"Simple operations fail in complex ways. Handle it.
借口现实
「那个服务从来不会宕机」它会的,等它宕机的时候你就来不及准备了
「我们之后再加错误处理」之后=第一次故障发生之后
「就是个超时而已,自己会恢复的」没有退避策略的重试只会让情况更糟
「数据库很可靠」你和数据库之间的网络不可靠
「我们现在还不需要健康检查」你部署上线的那一刻就需要了
「这只是个简单的操作」简单的操作也会以复杂的方式故障,做好处理