reliability
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese/reliability — Reliable Design Enforcement
/reliability — 可靠性设计规范
Every design, plan, and implementation MUST handle failure gracefully. Things WILL go wrong — networks fail, disks fill up, dependencies go down, inputs are invalid. The question is not "will it fail?" but "what happens when it does?"
Why this matters: Unreliable systems erode user trust faster than any other quality issue. A system that crashes on bad input, hangs when a dependency is slow, or loses data on failure is not production-ready — no matter how many features it has.
When to invoke: During PLANNING (after brainstorming, before or alongside writing-plans) and during REVIEW (as part of code review criteria). This skill applies to both new code and modifications to existing code.
所有设计、规划和实现都必须优雅地处理故障。故障是必然会发生的——网络中断、磁盘占满、依赖服务宕机、输入非法。问题的核心不在于「它会不会故障?」,而在于「故障发生后会出现什么情况?」
为什么这很重要: 不可靠的系统对用户信任的侵蚀速度远超其他任何质量问题。一个遇到非法输入就崩溃、依赖服务响应慢就卡死、故障时丢失数据的系统,不管功能有多丰富,都不具备生产环境就绪的条件。
何时使用: 在规划阶段(头脑风暴后,撰写方案前或撰写方案的同时)以及评审阶段(作为代码评审标准的一部分)。本规范适用于新代码开发和现有代码修改两种场景。
The Rules
规则
Rule 1: Every External Call Can Fail
规则1:所有外部调用都可能失败
Every network call, database query, file operation, and external service invocation MUST handle failure:
| Failure mode | Required handling |
|---|---|
| Timeout | Explicit timeout set. Don't wait forever. |
| Connection refused | Retry with backoff, then degrade gracefully. |
| 5xx response | Retry with backoff (idempotent ops only). |
| 4xx response | Don't retry. Log and handle based on status code. |
| Malformed response | Validate schema. Don't crash on unexpected shapes. |
| Partial failure | Handle incomplete writes. Don't leave data half-updated. |
No external call without a timeout. Default: 5s for API calls, 30s for long operations (with documentation for why).
所有网络请求、数据库查询、文件操作、外部服务调用都必须处理故障:
| 故障模式 | 要求的处理方式 |
|---|---|
| 超时 | 设置明确的超时时间,不要无限等待 |
| 连接被拒绝 | 采用退避策略重试,之后优雅降级 |
| 5xx 响应 | 采用退避策略重试(仅幂等操作) |
| 4xx 响应 | 不要重试,记录日志并根据状态码处理 |
| 响应格式错误 | 校验Schema,不要因为意外结构导致崩溃 |
| 部分失败 | 处理未完成的写入,不要留下半更新的数据 |
所有外部调用必须设置超时。 默认值:API调用设置5秒超时,长耗时操作设置30秒超时(需要额外说明设置该值的原因)。
Rule 2: Retry with Exponential Backoff
规则2:采用指数退避加抖动重试
Retries MUST use exponential backoff with jitter:
attempt 1: immediate
attempt 2: 1s + random(0-500ms)
attempt 3: 2s + random(0-500ms)
attempt 4: 4s + random(0-500ms)
(max 3-5 retries, max backoff 30s)Only retry idempotent operations. A retry on a non-idempotent POST can create duplicates.
Never retry:
- 400 Bad Request (fix the input)
- 401/403 Unauthorized (fix the auth)
- 404 Not Found (it's not there)
- 409 Conflict (resolve the conflict)
重试必须使用带抖动的指数退避策略:
attempt 1: immediate
attempt 2: 1s + random(0-500ms)
attempt 3: 2s + random(0-500ms)
attempt 4: 4s + random(0-500ms)
(max 3-5 retries, max backoff 30s)仅幂等操作可以重试。 对非幂等的POST请求重试可能会生成重复数据。
禁止重试的场景:
- 400 Bad Request(修复输入参数)
- 401/403 Unauthorized(修复鉴权逻辑)
- 404 Not Found(资源不存在)
- 409 Conflict(解决冲突)
Rule 3: Circuit Breaker Pattern
规则3:熔断器模式
When a dependency fails repeatedly, STOP calling it:
| State | Behavior |
|---|---|
| Closed (normal) | Requests pass through. Track failure rate. |
| Open (failing) | Requests fail fast. Return cached/default/error. Don't call dependency. |
| Half-open (testing) | Allow 1 request through. If it succeeds, close. If it fails, reopen. |
Thresholds: Open after 5 consecutive failures or >50% failure rate in 60s window. Half-open after 30s.
This prevents cascading failures — one down service shouldn't take down everything.
当依赖服务持续故障时,停止调用该服务:
| 状态 | 行为 |
|---|---|
| Closed(正常状态) | 请求正常通过,统计失败率 |
| Open(故障状态) | 请求快速失败,返回缓存/默认值/错误响应,不调用依赖服务 |
| Half-open(试探恢复状态) | 放行1个请求,如果请求成功则关闭熔断器,失败则重新打开 |
阈值: 60秒窗口内出现5次连续失败或失败率超过50%时打开熔断器,30秒后进入半开状态。
该策略可以防止级联故障——一个服务宕机不应该拖垮整个系统。
Rule 4: Graceful Degradation
规则4:优雅降级
When a dependency fails, the system MUST continue operating with reduced functionality — not crash:
| Scenario | Degraded behavior |
|---|---|
| Cache down | Serve from database (slower, but working) |
| Search service down | Show recent/popular items instead |
| Email service down | Queue emails for later delivery |
| Analytics down | Drop analytics events (non-critical) |
| Payment provider slow | Extend timeout, show "processing" state |
Define degradation behavior during design, not during the outage. Every external dependency needs a "what if it's down?" answer.
当依赖服务故障时,系统必须在功能降级的情况下继续运行,而不是直接崩溃:
| 场景 | 降级行为 |
|---|---|
| 缓存宕机 | 从数据库读取数据(速度更慢,但功能可用) |
| 搜索服务宕机 | 改为展示最近/热门内容 |
| 邮件服务宕机 | 将邮件加入队列,后续重试投递 |
| 分析服务宕机 | 丢弃分析事件(非核心功能) |
| 支付提供商响应慢 | 延长超时时间,展示「处理中」状态 |
在设计阶段就定义降级行为,不要等到故障发生时才考虑。 每个外部依赖都要有「如果它宕机了怎么办」的应对方案。
Rule 5: Idempotent Operations
规则5:幂等操作
Every write operation MUST be safe to retry:
- Use idempotency keys for payment and state-changing operations.
- Use upserts instead of insert-then-update.
- Use database transactions for multi-step mutations.
- Use /
IF NOT EXISTSfor creates.ON CONFLICT
Test: Call the operation twice with the same input. Does it produce the same result? If not, it's not idempotent — fix it.
所有写操作必须可安全重试:
- 支付和状态变更操作使用幂等键
- 使用upsert代替先查询后更新的逻辑
- 多步变更使用数据库事务
- 创建操作使用/
IF NOT EXISTS语法ON CONFLICT
测试方法: 用相同的输入调用操作两次,如果返回结果不一致,说明不满足幂等性,需要修复。
Rule 6: Health Checks and Observability
规则6:健康检查和可观测性
Every service MUST expose:
| Endpoint | Purpose |
|---|---|
| "Am I running?" Returns 200 if process is alive. |
| "Can I serve traffic?" Checks dependencies (DB, cache, etc.) |
Every failure MUST be observable:
- Structured logging for all errors (not just ).
console.log("error") - Metrics for error rates, latency percentiles, queue depths.
- Alerts for error rate spikes, latency degradation, queue buildup.
If it fails silently, it might as well not exist.
所有服务必须暴露以下端点:
| 端点 | 用途 |
|---|---|
| 「我是否在运行?」 如果进程正常则返回200 |
| 「我是否可以处理流量?」 检查依赖服务(数据库、缓存等)状态 |
所有故障必须可观测:
- 结构化日志 记录所有错误(不要只写)
console.log("error") - 指标 统计错误率、延迟分位值、队列深度等
- 告警 针对错误率突增、延迟升高、队列堆积等场景配置
如果故障悄无声息地发生,那么相关的防护逻辑等于不存在。
Rule 7: Data Integrity
规则7:数据完整性
Data MUST survive failures:
| Principle | Implementation |
|---|---|
| Atomic operations | Database transactions for multi-step writes |
| Write-ahead logging | Log intent before executing (for recovery) |
| Checksums | Verify data integrity after transfer/storage |
| Backup and recovery | Automated backups with tested restore procedures |
| Eventual consistency | Document which operations are eventually consistent and the convergence window |
Never lose acknowledged data. If you told the user "saved," it must be saved — even if the server crashes 1ms later.
数据必须能在故障中幸存:
| 原则 | 实现方式 |
|---|---|
| 原子操作 | 多步写操作使用数据库事务 |
| 预写日志 | 执行操作前先记录操作意图(用于故障恢复) |
| 校验和 | 数据传输/存储后验证数据完整性 |
| 备份与恢复 | 自动备份,且恢复流程经过测试验证 |
| 最终一致性 | 明确标注哪些操作是最终一致的,以及收敛时间窗口 |
绝对不能丢失已经确认写入的数据。 如果你已经告诉用户「保存成功」,那么数据必须真正被持久化——哪怕服务器1毫秒后就崩溃。
Applying This Skill
本规范的应用
During Planning (brainstorming / writing-plans)
规划阶段(头脑风暴/撰写方案)
Before finalizing any design or plan, run the Reliability Checklist:
- Every external call has a timeout and failure handling strategy
- Retries use exponential backoff with jitter (idempotent ops only)
- Circuit breakers protect against cascading failures
- Graceful degradation is defined for every external dependency
- Write operations are idempotent (safe to retry)
- Health check endpoints are defined (liveness + readiness)
- Data integrity is maintained through failures (transactions, WAL, backups)
If any item fails: redesign before proceeding to implementation.
在最终确定任何设计或方案前,运行 可靠性检查清单:
- 所有外部调用都设置了超时和故障处理策略
- 重试使用带抖动的指数退避策略(仅幂等操作)
- 熔断器用于防范级联故障
- 所有外部依赖都定义了优雅降级策略
- 写操作是幂等的(可安全重试)
- 定义了健康检查端点(存活检查+就绪检查)
- 故障场景下数据完整性得到保障(事务、预写日志、备份)
如果任何一项不满足:重新设计后再进入开发阶段。
During Implementation (executing-plans)
开发阶段(执行方案)
As you write code:
- Set explicit timeouts on every HTTP client, DB connection, and external call.
- Wrap external calls in try/catch with specific error handling (not bare catch-all).
- Add circuit breakers for any dependency called >10 times per minute.
- Return meaningful error responses — status code, error code, human message.
- Never swallow errors silently. Log, metric, or propagate.
编写代码时:
- 为所有HTTP客户端、数据库连接、外部调用设置明确的超时时间
- 用try/catch包裹外部调用,实现特定的错误处理(不要使用无差别的catch-all)
- 为每分钟调用次数超过10次的依赖添加熔断器
- 返回有意义的错误响应——状态码、错误码、人类可读的提示信息
- 绝对不要静默吞掉错误,要记录日志、上报指标或者向上抛出错误
During Review (code-review / receiving-code-review)
评审阶段(代码评审/接收代码评审)
Verify these as part of every code review:
- Every external call has timeout and error handling
- No bare blocks that swallow errors
catch - Retry logic uses backoff (not immediate retry loops)
- Write operations are idempotent
- Health check endpoints exist and check real dependencies
- Error responses are structured and meaningful
每次代码评审都要验证以下内容:
- 所有外部调用都有超时和错误处理逻辑
- 没有静默吞掉错误的空catch块
- 重试逻辑使用了退避策略(不是立即重试的循环)
- 写操作是幂等的
- 健康检查端点存在,且会真实检查依赖服务状态
- 错误响应是结构化的、有意义的
When Modifying Existing Code
修改现有代码时
If existing code violates these rules:
- You are NOT required to add circuit breakers to all existing external calls.
- You ARE required to not make reliability worse.
- If adding a new external call, it MUST have timeout, retry, and error handling.
- If you find a silent error swallowing in code you're touching, fix it.
如果现有代码违反了这些规则:
- 你不需要为所有已有的外部调用添加熔断器
- 你必须不能降低现有代码的可靠性
- 如果你添加了新的外部调用,必须设置超时、重试和错误处理逻辑
- 如果你在修改的代码中发现了静默吞错的逻辑,必须修复它
Anti-Patterns
反模式
| Pattern | Problem | Fix |
|---|---|---|
| Empty catch blocks | Errors silently disappear | Log, metric, or propagate |
| No timeouts | Requests hang forever | Explicit timeout on every call |
| Retry storms | Retries overwhelm failing service | Exponential backoff + circuit breaker |
| Cascading failures | One failure takes everything down | Circuit breakers + degradation |
| Optimistic updates | Assume success, discover failure later | Verify writes, use transactions |
| "It works on my machine" | Local env doesn't simulate failures | Chaos testing, fault injection |
| 模式 | 问题 | 修复方案 |
|---|---|---|
| 空catch块 | 错误悄无声息地消失 | 记录日志、上报指标或者向上抛出错误 |
| 无超时 | 请求无限挂起 | 每个调用都设置明确的超时时间 |
| 重试风暴 | 重试请求压垮已经故障的服务 | 指数退避+熔断器 |
| 级联故障 | 一个故障拖垮整个系统 | 熔断器+降级策略 |
| 乐观更新 | 先假设成功,之后才发现失败 | 验证写入结果,使用事务 |
| 「在我机器上能运行」 | 本地环境没有模拟故障场景 | 混沌测试,故障注入 |
Rationalization Prevention
常见借口反驳
| Excuse | Reality |
|---|---|
| "That service never goes down" | It will. And you won't be ready. |
| "We'll add error handling later" | Later = after the first outage. |
| "It's just a timeout, it'll recover" | Without backoff, your retries will make it worse. |
| "The database is reliable" | Networks between you and the DB are not. |
| "We don't need health checks yet" | You need them the moment you deploy. |
| "This is a simple operation" | Simple operations fail in complex ways. Handle it. |
| 借口 | 现实 |
|---|---|
| 「那个服务从来不会宕机」 | 它会的,等它宕机的时候你就来不及准备了 |
| 「我们之后再加错误处理」 | 之后=第一次故障发生之后 |
| 「就是个超时而已,自己会恢复的」 | 没有退避策略的重试只会让情况更糟 |
| 「数据库很可靠」 | 你和数据库之间的网络不可靠 |
| 「我们现在还不需要健康检查」 | 你部署上线的那一刻就需要了 |
| 「这只是个简单的操作」 | 简单的操作也会以复杂的方式故障,做好处理 |