release-it
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseRelease It! Framework
Release It! 框架
Framework for designing, deploying, and operating production-ready software systems. Based on a fundamental truth: the software that passes QA is not the software that survives production. Production is a hostile environment -- and your system must be built to expect and handle failure at every level.
用于设计、部署和运维可投入生产的软件系统的框架。基于一个基本事实:通过QA测试的软件,未必能在生产环境中存活。生产环境是充满挑战的——你的系统必须从各个层面做好应对故障的准备。
Core Principle
核心原则
Every system will eventually be pushed beyond its design limits. The question is not whether failures will happen, but whether your system degrades gracefully or collapses catastrophically. Production-ready software is not just correct -- it is resilient, observable, and designed to operate through partial failures without human intervention.
每个系统最终都会被推至其设计极限之外。 问题不在于故障是否会发生,而在于你的系统是优雅降级还是灾难性崩溃。可投入生产的软件不仅要功能正确——还需具备韧性、可观测性,并且能在无需人工干预的情况下应对部分故障。
Scoring
评分标准
Goal: 10/10. When reviewing or creating production systems, rate them 0-10 based on adherence to the principles below. A 10/10 means full alignment with all guidelines; lower scores indicate gaps to address. Always provide the current score and specific improvements needed to reach 10/10.
目标:10/10。 在评审或创建生产系统时,根据对以下原则的遵循程度,为系统打0-10分。10分代表完全符合所有准则;分数越低,说明存在需要改进的短板。需始终提供当前分数,以及达到10分所需的具体改进措施。
The Release It! Framework
Release It! 框架
Six areas that determine whether software survives contact with production:
决定软件能否在生产环境中存活的六大领域:
1. Stability Anti-Patterns
1. 稳定性反模式
Core concept: Failures propagate through integration points, cascading across system boundaries. The most dangerous patterns are not bugs in your code -- they are emergent behaviors that arise when systems interact under stress.
Why it works: Recognizing anti-patterns lets you identify and eliminate the cracks before production traffic finds them. Every production outage traces back to one or more of these patterns. They are predictable, recurring, and preventable.
Key insights:
- Integration points are the number-one killer of production systems -- every socket, HTTP call, or queue is a risk
- Cascading failures spread when one system's failure causes its callers to fail, which causes their callers to fail
- Slow responses are worse than no response -- they tie up threads, exhaust pools, and propagate delays across the entire call chain
- Unbounded result sets turn a harmless query into an out-of-memory crash when data grows beyond test assumptions
- Users generate load patterns that no test suite can predict -- bots, retry storms, and flash crowds
- Self-denial attacks occur when your own marketing, coupons, or viral features overwhelm your infrastructure
- Blocked threads are the silent killer -- deadlocks and resource contention show no errors until everything stops
Code applications:
| Context | Pattern | Example |
|---|---|---|
| HTTP calls | Assume every remote call can fail, hang, or return garbage | Wrap all external calls with timeout + circuit breaker |
| Database queries | Enforce result set limits on every query | Add |
| Thread pools | Isolate pools per dependency to prevent cross-contamination | Separate thread pool for payment gateway vs. search |
| Load testing | Simulate realistic traffic including spikes and abuse patterns | Use production traffic replays, not synthetic happy-path scripts |
| Marketing events | Coordinate launches with capacity planning | Pre-scale before Black Friday; add queue for coupon redemption |
See: references/anti-patterns.md for detailed analysis of each anti-pattern with failure scenarios and detection strategies.
核心概念: 故障会通过集成点传播,跨系统边界引发连锁反应。最危险的模式并非代码中的bug——而是系统在压力下交互时出现的突发行为。
作用原理: 识别反模式能让你在生产流量发现问题前,找到并修复潜在漏洞。每一次生产故障都可追溯至一种或多种此类模式,它们具有可预测性、反复性,且可以预防。
关键见解:
- 集成点是生产系统的头号杀手——每个套接字、HTTP调用或队列都存在风险
- 连锁故障会在一个系统的故障导致其调用方故障时扩散,进而引发更多调用方故障
- 缓慢响应比无响应更糟糕——会占用线程、耗尽资源池,并在整个调用链中传播延迟
- 无界结果集会在数据量超出测试假设时,将无害的查询变为内存溢出崩溃
- 用户产生的流量模式是任何测试套件都无法预测的——包括机器人、重试风暴和突发流量
- 自我拒绝攻击指的是自身的营销活动、优惠券或病毒式功能导致基础设施过载
- 阻塞线程是沉默的杀手——死锁和资源竞争在系统完全停摆前不会显示任何错误
代码实践:
| 场景 | 模式 | 示例 |
|---|---|---|
| HTTP调用 | 假设每个远程调用都可能失败、挂起或返回无效数据 | 为所有外部调用添加timeout + circuit breaker |
| 数据库查询 | 为每个查询强制设置结果集限制 | 添加 |
| 线程池 | 按依赖项隔离线程池,防止交叉影响 | 为支付网关和搜索服务分别设置独立线程池 |
| 负载测试 | 模拟包含流量峰值和异常模式的真实流量 | 使用生产流量重放,而非仅用模拟的正常路径脚本 |
| 营销活动 | 协调活动启动与容量规划 | 黑色星期五前提前扩容;为优惠券兑换添加队列 |
详情请查看:references/anti-patterns.md,其中包含每种反模式的详细分析、故障场景及检测策略。
2. Stability Patterns
2. 稳定性模式
Core concept: Counter each anti-pattern with a stability pattern. Circuit breakers stop cascading failures. Bulkheads isolate blast radius. Timeouts reclaim stuck resources. Together they create a system that bends under load but does not break.
Why it works: These patterns work because they accept failure as inevitable and design the system's response to failure, rather than trying to prevent all failures. A circuit breaker that trips is the system working correctly -- it is protecting itself from a downstream failure.
Key insights:
- Circuit Breaker: three states (closed, open, half-open) -- trips after threshold failures, periodically tests recovery
- Bulkheads: partition resources so one failing component cannot drain the entire system
- Timeouts: every outbound call needs both a connect timeout and a read timeout -- and timeouts must propagate up the call chain
- Retry with backoff: exponential backoff + jitter prevents thundering herd on recovery
- Fail Fast: if you know a request will fail, reject it immediately -- do not waste resources attempting it
- Steady State: systems accumulate cruft (logs, sessions, temp files) -- design for automatic cleanup
- Let It Crash: sometimes the safest recovery is to restart the process cleanly rather than limping along in an unknown state
- Handshaking: let the server tell the client whether it can accept work before the client sends it
Code applications:
| Context | Pattern | Example |
|---|---|---|
| Service calls | Circuit Breaker with threshold and recovery timeout | Open after 5 failures in 60s; half-open after 30s |
| Resource isolation | Bulkhead with dedicated pools per dependency | Separate connection pools for critical vs. non-critical services |
| Network calls | Timeout with propagation | Connect: 1s, read: 5s; propagate deadline to downstream calls |
| Retries | Exponential backoff + jitter + retry budget | Base 100ms, max 3 retries, 20% retry budget across fleet |
| Data cleanup | Steady State with automated purging | Delete sessions older than 24h; rotate logs at 500MB |
See: references/stability-patterns.md for implementation details, state machines, threshold tuning, and pattern combinations.
核心概念: 用稳定性模式应对每种反模式。Circuit Breaker可阻止连锁故障,Bulkheads可隔离故障影响范围,Timeouts可回收卡住的资源。它们共同构建出一个在负载下可弯曲但不会断裂的系统。
作用原理: 这些模式之所以有效,是因为它们接受故障不可避免的事实,设计系统对故障的响应方式,而非试图阻止所有故障。触发跳闸的Circuit Breaker是系统正常工作的表现——它正在保护自身免受下游故障的影响。
关键见解:
- Circuit Breaker:包含三种状态(闭合、断开、半断开)——当故障次数达到阈值时触发跳闸,定期测试恢复情况
- Bulkheads:对资源进行分区,确保一个组件故障不会耗尽整个系统的资源
- Timeouts:每个外部调用都需要同时设置连接超时和读取超时——且超时需在调用链中向上传播
- Retry with backoff:指数退避 + 抖动可避免恢复时出现“惊群效应”
- Fail Fast:如果确定请求会失败,立即拒绝——不要浪费资源尝试执行
- Steady State:系统会累积冗余数据(日志、会话、临时文件)——需设计自动清理机制
- Let It Crash:有时最安全的恢复方式是重启进程,而非在未知状态下勉强运行
- Handshaking:让服务器在客户端发送请求前告知其是否能处理该请求
代码实践:
| 场景 | 模式 | 示例 |
|---|---|---|
| 服务调用 | 带阈值和恢复超时的Circuit Breaker | 60秒内出现5次故障后触发断开;30秒后进入半断开状态 |
| 资源隔离 | 按依赖项设置专用资源池的Bulkheads | 为关键服务和非关键服务分别设置独立的连接池 |
| 网络调用 | 带传播机制的Timeouts | 连接超时:1秒,读取超时:5秒;将截止时间向下游调用传播 |
| 重试机制 | 指数退避 + 抖动 + 全局重试预算 | 基础间隔100ms,最多重试3次,集群重试预算为20% |
| 数据清理 | 带自动清理的Steady State | 删除24小时以上的会话;日志达到500MB时自动轮转 |
详情请查看:references/stability-patterns.md,其中包含实现细节、状态机、阈值调优及模式组合方案。
3. Capacity and Availability
3. 容量与可用性
Core concept: Capacity is not a single number -- it is a multi-dimensional function of CPU, memory, network, disk I/O, connection pools, and thread counts. Capacity planning means understanding which resource becomes the bottleneck first and at what load level.
Why it works: Systems that are not capacity-tested fail in production at the worst possible moment -- during peak load. Understanding your system's actual limits (not theoretical limits) lets you set realistic SLAs and plan scaling before users hit the wall.
Key insights:
- Performance testing taxonomy: load test (expected traffic), stress test (beyond limits), soak test (sustained load over time), spike test (sudden bursts)
- The Universal Scalability Law: throughput does not scale linearly -- contention and coherence costs cause diminishing returns
- Connection pools are finite and precious -- a pool exhaustion looks identical to a database outage from the application's perspective
- Thread pools must be sized based on measured throughput, not guesses -- too few starve the system, too many cause context-switching overhead
- Myths: "The cloud is infinitely scalable" -- auto-scaling has lag time, cold-start costs, and hard limits
- Resource pools need health checks, eviction policies, and maximum lifetime limits
Code applications:
| Context | Pattern | Example |
|---|---|---|
| Load testing | Ramp to expected peak, then 2x, observe degradation curve | Gradually increase RPS until latency exceeds SLO |
| Connection pools | Size based on measured concurrency, not defaults | Measure active connections under load; set pool to P99 + 20% headroom |
| Auto-scaling | Define scaling triggers with appropriate cooldown | Scale on CPU > 70% sustained 3 min; cooldown 5 min |
| Soak testing | Run at 80% capacity for 24-72 hours | Catch memory leaks, connection leaks, file handle exhaustion |
| Capacity model | Document resource bottleneck per service | "Service X is memory-bound at 2000 RPS; needs 4GB per instance" |
See: references/capacity-planning.md for testing methodologies, resource pool management, and scalability modeling.
核心概念: 容量并非单一数值——它是CPU、内存、网络、磁盘I/O、连接池和线程数的多维函数。容量规划意味着要了解哪种资源会先成为瓶颈,以及在何种负载水平下出现瓶颈。
作用原理: 未经过容量测试的系统会在最糟糕的时刻(峰值负载期间)在生产环境中故障。了解系统的实际极限(而非理论极限),能让你设定合理的SLA,并在用户遇到问题前完成扩容规划。
关键见解:
- 性能测试分类:负载测试(预期流量)、压力测试(超出极限)、 soak测试(持续负载)、尖峰测试(突发流量)
- 通用可扩展性定律:吞吐量并非线性增长——竞争和一致性成本会导致收益递减
- 连接池是有限且宝贵的资源——从应用程序的角度看,连接池耗尽与数据库故障的表现完全相同
- 线程池的大小必须基于实测吞吐量,而非猜测——线程过少会导致系统饥饿,线程过多会引发上下文切换开销
- 误区:“云是无限可扩展的”——自动扩容存在延迟、冷启动成本和硬限制
- 资源池需要健康检查、驱逐策略和最大生命周期限制
代码实践:
| 场景 | 模式 | 示例 |
|---|---|---|
| 负载测试 | 逐步提升至预期峰值,再翻倍,观察降级曲线 | 逐步增加RPS,直到延迟超过SLO |
| 连接池 | 基于实测并发量设置大小,而非默认值 | 测量负载下的活跃连接数;将池大小设置为P99值 + 20%余量 |
| 自动扩容 | 定义带冷却时间的扩容触发条件 | CPU持续3分钟超过70%时扩容;冷却时间5分钟 |
| Soak测试 | 在80%容量下运行24-72小时 | 捕捉内存泄漏、连接泄漏、文件句柄耗尽等问题 |
| 容量模型 | 记录每个服务的资源瓶颈 | “服务X在2000 RPS时受内存限制;每个实例需4GB内存” |
详情请查看:references/capacity-planning.md,其中包含测试方法、资源池管理和可扩展性建模。
4. Deployment and Release
4. 部署与发布
Core concept: Deployment (putting code on servers) and release (exposing code to users) are separate operations that should be decoupled. Separating them gives you the ability to deploy without risk and release with confidence.
Why it works: Most outages are caused by changes -- deployments, configuration updates, database migrations. Decoupling deployment from release means you can deploy code to production, verify it works, and only then route traffic to it. If something goes wrong, you roll back the release, not the deployment.
Key insights:
- Zero-downtime deployment is non-negotiable for any system with users -- rolling deploys, blue-green, or canary
- Feature flags decouple deployment from release -- dark-launch code and enable it independently
- Database migrations must be backward-compatible -- the old code and new code will run simultaneously during deployment
- Immutable infrastructure: never patch a running server -- build a new image, deploy it, destroy the old one
- Canary releases limit blast radius by routing a small percentage of traffic to the new version first
- Rollback must be faster than roll-forward -- if rollback takes 30 minutes, you will avoid deploying
Code applications:
| Context | Pattern | Example |
|---|---|---|
| Deploys | Blue-green with health check gate | Deploy to green; run smoke tests; swap router |
| Progressive rollout | Canary with automated rollback | Route 5% traffic to canary; auto-rollback if error rate > 1% |
| Feature launch | Feature flags with emergency off switch | Ship code behind flag; enable for 10% of users; monitor; ramp |
| Schema changes | Expand-contract migration pattern | Add new column; deploy code that writes both; backfill; drop old column |
| Rollback | Instant rollback via traffic routing | Keep previous version running; rollback = switch load balancer target |
See: references/deployment-strategies.md for deployment patterns, migration strategies, and infrastructure-as-code practices.
核心概念: 部署(将代码部署到服务器)和发布(向用户开放代码)是两个独立的操作,应解耦。解耦这两个操作能让你零风险部署,并自信地发布。
作用原理: 大多数故障由变更引发——部署、配置更新、数据库迁移。将部署与发布解耦意味着你可以将代码部署到生产环境,验证其正常运行后,再将流量路由到新代码。如果出现问题,只需回滚发布,而非回滚部署。
关键见解:
- 零停机部署对所有面向用户的系统来说是必不可少的——滚动部署、蓝绿部署或金丝雀发布
- 功能标志可解耦部署与发布——暗启动代码,再独立启用
- 数据库迁移必须向后兼容——部署期间旧代码和新代码会同时运行
- 不可变基础设施:永远不要修补运行中的服务器——构建新镜像、部署、销毁旧服务器
- 金丝雀发布通过将小比例流量路由到新版本,限制故障影响范围
- 回滚速度必须快于向前推进——如果回滚需要30分钟,你会避免部署
代码实践:
| 场景 | 模式 | 示例 |
|---|---|---|
| 部署 | 带健康检查网关的蓝绿部署 | 部署到绿色环境;运行冒烟测试;切换路由 |
| 渐进式发布 | 带自动回滚的金丝雀发布 | 将5%的流量路由到金丝雀版本;错误率超过1%时自动回滚 |
| 功能上线 | 带紧急关闭开关的功能标志 | 在标志后发布代码;先向10%的用户启用;监控后逐步扩大范围 |
| Schema变更 | 扩容-收缩迁移模式 | 添加新列;部署同时读写新旧列的代码;回填数据;删除旧列 |
| 回滚 | 通过流量路由实现即时回滚 | 保留旧版本运行;回滚 = 切换负载均衡目标 |
详情请查看:references/deployment-strategies.md,其中包含部署模式、迁移策略和基础设施即代码实践。
5. Health Checks and Observability
5. 健康检查与可观测性
Core concept: You cannot operate what you cannot observe. Observability is not an afterthought -- it is a first-class design concern. Health checks, metrics, logs, and traces are the sensory organs of your system in production.
Why it works: Production systems fail in ways that are invisible without proper instrumentation. A health check that only returns "OK" tells you nothing. Metrics without context are noise. Observability done right gives you the ability to answer questions about your system that you did not anticipate at design time.
Key insights:
- Health checks come in two flavors: shallow (process alive) and deep (dependencies reachable, resources available)
- The three pillars of observability: structured logs (what happened), metrics (how much), distributed traces (where and how long)
- RED method for services: Rate (requests/sec), Errors (error rate), Duration (latency distribution)
- USE method for resources: Utilization (%), Saturation (queue depth), Errors (error count)
- SLIs measure user experience; SLOs set targets; SLAs create contractual obligations -- define them in that order
- Alerting on symptoms (user-facing errors) beats alerting on causes (CPU usage) -- alert on what users feel
- Dashboards should answer "Is the system healthy right now?" within 5 seconds of looking
Code applications:
| Context | Pattern | Example |
|---|---|---|
| Health endpoints | Deep health check with dependency status | |
| Service metrics | RED method instrumentation | Track request rate, error rate, and p50/p95/p99 latency per endpoint |
| Resource metrics | USE method for infrastructure | Track CPU utilization, request queue depth, and error counts per host |
| Distributed tracing | Propagate trace context across service boundaries | Inject trace ID in headers; correlate logs across services |
| Alerting | Alert on SLO burn rate, not raw thresholds | "Error budget burning 10x normal rate" vs. "CPU > 80%" |
See: references/observability.md for health check design, metrics instrumentation, SLO frameworks, and alerting strategies.
核心概念: 你无法运维不可观测的系统。可观测性不是事后补充——它是一等设计需求。健康检查、指标、日志和链路追踪是系统在生产环境中的“感官器官”。
作用原理: 生产系统的故障在没有适当监控的情况下是不可见的。仅返回“OK”的健康检查毫无意义。脱离上下文的指标只是噪音。良好的可观测性让你能够解答设计时未预料到的系统问题。
关键见解:
- 健康检查分为两种:浅层(进程存活)和深层(依赖可达、资源可用)
- 可观测性三大支柱:结构化日志(发生了什么)、指标(发生的频率)、分布式链路追踪(在哪里发生、耗时多久)
- 服务的RED方法:Rate(请求/秒)、Errors(错误率)、Duration(延迟分布)
- 资源的USE方法:Utilization(利用率)、Saturation(队列深度)、Errors(错误数)
- SLI衡量用户体验;SLO设定目标;SLA建立合同义务——按此顺序定义
- 针对症状(用户可见的错误)告警优于针对原因(CPU使用率)告警——告警应聚焦用户的实际感受
- 仪表盘应能在查看后5秒内回答“系统当前是否健康?”的问题
代码实践:
| 场景 | 模式 | 示例 |
|---|---|---|
| 健康端点 | 包含依赖状态的深层健康检查 | |
| 服务指标 | RED方法埋点 | 跟踪每个接口的请求率、错误率及p50/p95/p99延迟 |
| 资源指标 | 基础设施的USE方法 | 跟踪每个主机的CPU利用率、请求队列深度和错误数 |
| 分布式链路追踪 | 跨服务边界传播追踪上下文 | 在请求头中注入trace ID;关联跨服务的日志 |
| 告警策略 | 基于SLO燃烧率告警,而非原始阈值 | “错误预算消耗速度是正常的10倍” vs “CPU > 80%” |
详情请查看:references/observability.md,其中包含健康检查设计、指标埋点、SLO框架和告警策略。
6. Adaptation and Chaos Engineering
6. 自适应与Chaos Engineering
Safety note: Chaos engineering experiments are design-time planning activities. The patterns below describe what to test and what to verify, not actions for an AI agent to execute autonomously. All failure injection must be performed by authorized engineers using dedicated tooling (e.g., Gremlin, Litmus, AWS FIS) with proper approvals, rollback plans, and blast radius controls in place.
Core concept: Confidence in your system's resilience comes from testing it under realistic failure conditions. Chaos engineering is the discipline of experimenting on a system in a controlled environment to build confidence in its ability to withstand turbulent conditions.
Why it works: You cannot know how your system handles failure until it actually fails. Waiting for production incidents to discover weaknesses is reactive and expensive. Chaos engineering proactively injects failures in a controlled way, turning unknown-unknowns into known-knowns before they cause real outages.
Key insights:
- Define steady state first -- you need a measurable baseline to detect when behavior deviates
- Start small in non-production environments: terminate a single process, add latency to one call -- then escalate gradually with approvals
- Minimize blast radius: use canary populations, feature flags, and emergency stop mechanisms for experiments
- Production experiments require explicit authorization, monitoring, and immediate rollback capability
- Automate recurring experiments so resilience is continuously verified, not a one-time event
- GameDay exercises combine chaos engineering with incident response practice -- test both the system and the team
- Every experiment should have a hypothesis: "We believe that when X fails, the system will Y"
- Build a culture where finding weaknesses is celebrated, not punished
Code applications:
| Context | Pattern | Example |
|---|---|---|
| Process failure | Controlled instance termination (via chaos tooling) | Terminate one pod using Gremlin/Litmus; verify service recovers within SLO |
| Network failure | Inject latency or partition between services (via chaos tooling) | Add 500ms latency to DB calls; verify circuit breaker trips |
| Dependency failure | Simulate downstream service outage (via chaos tooling) | Return 503 from payment API; verify graceful degradation |
| Resource exhaustion | Simulate resource pressure (via chaos tooling) | Stress-test memory limits; verify process restarts cleanly |
| GameDay | Scheduled team exercise with realistic failure scenario | "Primary database goes read-only at 2pm" -- practice response |
See: references/chaos-engineering.md for experiment design, blast radius management, and building a chaos engineering practice.
安全提示: Chaos Engineering实验是设计阶段的规划活动。以下模式描述了要测试的内容和要验证的内容,而非AI代理自主执行的操作。所有故障注入必须由授权工程师使用专用工具(如Gremlin、Litmus、AWS FIS)执行,并需获得适当批准、制定回滚计划和故障影响范围控制措施。
核心概念: 对系统韧性的信心来自于在真实故障条件下的测试。Chaos Engineering是在受控环境中对系统进行实验的学科,旨在增强对系统承受动荡环境能力的信心。
作用原理: 除非系统实际发生故障,否则你无法知道它如何应对故障。等待生产事故发现弱点是被动且昂贵的。Chaos Engineering主动在受控环境中注入故障,在其引发真实故障前,将未知未知转化为已知已知。
关键见解:
- 先定义稳定状态——你需要可测量的基线来检测行为偏差
- 先在非生产环境从小规模实验开始:终止单个进程、为单个调用添加延迟——然后在获得批准后逐步扩大规模
- 最小化故障影响范围:对实验使用金丝雀集群、功能标志和紧急停止机制
- 生产环境实验需要明确授权、监控和即时回滚能力
- 自动化重复实验,持续验证韧性,而非一次性事件
- GameDay演练将Chaos Engineering与事件响应实践相结合——同时测试系统和团队
- 每个实验都应有假设:“我们认为当X故障时,系统会出现Y表现”
- 建立一种发现弱点值得庆祝而非惩罚的文化
代码实践:
| 场景 | 模式 | 示例 |
|---|---|---|
| 进程故障 | 受控实例终止(通过混沌工具) | 使用Gremlin/Litmus终止一个Pod;验证服务在SLO范围内恢复 |
| 网络故障 | 注入延迟或隔离服务间网络(通过混沌工具) | 为数据库调用添加500ms延迟;验证Circuit Breaker触发跳闸 |
| 依赖故障 | 模拟下游服务故障(通过混沌工具) | 从支付API返回503状态码;验证系统优雅降级 |
| 资源耗尽 | 模拟资源压力(通过混沌工具) | 对内存限制进行压力测试;验证进程能干净重启 |
| GameDay演练 | 定期团队演练,使用真实故障场景 | “下午2点主数据库变为只读”——演练响应流程 |
详情请查看:references/chaos-engineering.md,其中包含实验设计、故障影响范围管理和Chaos Engineering实践建设。
Common Mistakes
常见错误
| Mistake | Why It Fails | Fix |
|---|---|---|
| No timeouts on outbound calls | One slow dependency freezes the entire system | Set connect and read timeouts on every external call |
| Unbounded retries | Retry storms amplify failures instead of recovering from them | Use exponential backoff, jitter, and fleet-wide retry budgets |
| Shared thread/connection pools | One failing dependency drains resources from all features | Bulkhead: isolate pools per dependency or feature |
| Shallow health checks only | Load balancer routes traffic to instances with broken dependencies | Implement deep health checks that verify downstream connectivity |
| Testing only the happy path | System works perfectly until the first real failure | Load test, soak test, and chaos test before every major release |
| Coupling deploy and release | Every deployment is a high-risk event with all-or-nothing rollout | Use feature flags, canary releases, and blue-green deployments |
| Alerting on causes, not symptoms | High CPU alerts fire but users are fine; errors spike but no alert fires | Alert on user-facing SLIs: error rate, latency, availability |
| No capacity model | System falls over at 2x load during an event nobody planned for | Model bottleneck resources; load test to 3x expected peak |
| 错误 | 失败原因 | 修复方案 |
|---|---|---|
| 外部调用未设置超时 | 一个缓慢的依赖会导致整个系统冻结 | 为所有外部调用设置连接和读取超时 |
| 无界重试 | 重试风暴会放大故障,而非恢复系统 | 使用指数退避、抖动和集群重试预算 |
| 共享线程/连接池 | 一个故障依赖会耗尽所有功能的资源 | 采用Bulkheads模式:按依赖项或功能隔离资源池 |
| 仅使用浅层健康检查 | 负载均衡器会将流量路由到依赖已损坏的实例 | 实现深层健康检查,验证下游连通性 |
| 仅测试正常路径 | 系统在第一次真实故障前运行完美 | 每次重大发布前进行负载测试、soak测试和混沌测试 |
| 部署与发布耦合 | 每次部署都是高风险事件,只能全量发布或回滚 | 使用功能标志、金丝雀发布和蓝绿部署 |
| 针对原因而非症状告警 | CPU使用率过高告警触发,但用户无感知;错误率飙升却无告警 | 针对用户可见的SLI告警:错误率、延迟、可用性 |
| 无容量模型 | 系统在无人预料的2倍负载事件中崩溃 | 建模瓶颈资源;进行3倍预期峰值的负载测试 |
Quick Diagnostic
快速诊断
Audit any production system:
| Question | If No | Action |
|---|---|---|
| Does every outbound call have a timeout? | Calls can hang indefinitely, blocking threads | Add connect and read timeouts to all external calls |
| Are circuit breakers in place for critical dependencies? | One dependency failure takes down the whole system | Add circuit breakers with appropriate thresholds |
| Are thread/connection pools isolated per dependency? | Shared pools allow cross-contamination of failures | Implement bulkhead pattern with dedicated pools |
| Can you deploy without downtime? | Deployments cause user-visible outages | Implement rolling, blue-green, or canary deployment |
| Do health checks verify dependency connectivity? | Dead instances receive traffic; partial failures go undetected | Add deep health checks that test DB, cache, queue |
| Are logs, metrics, and traces correlated? | Debugging requires manual log searching across services | Implement distributed tracing with correlated IDs |
| Have you load-tested beyond expected peak? | Unknown failure mode under real load | Load test to 2-3x expected peak; document breaking point |
| Do you practice failure injection? | Resilience is theoretical, not verified | Start chaos engineering with low-risk experiments |
审计任何生产系统:
| 问题 | 如果答案为否 | 行动 |
|---|---|---|
| 所有外部调用都设置了超时吗? | 调用可能无限挂起,阻塞线程 | 为所有外部调用添加连接和读取超时 |
| 关键依赖项是否已配置Circuit Breaker? | 一个依赖故障会导致整个系统瘫痪 | 为关键依赖项添加Circuit Breaker,并设置合适的阈值 |
| 线程/连接池是否按依赖项隔离? | 共享资源池会导致故障交叉影响 | 采用Bulkheads模式,为每个依赖项设置专用资源池 |
| 能否实现零停机部署? | 部署会导致用户可见的故障 | 实现滚动部署、蓝绿部署或金丝雀部署 |
| 健康检查是否验证了依赖连通性? | 已损坏的实例仍会接收流量;部分故障无法被检测到 | 实现深层健康检查,验证数据库、缓存、队列的连通性 |
| 日志、指标和链路追踪是否关联? | 调试需要手动跨服务搜索日志 | 实现分布式链路追踪,关联所有服务的日志 |
| 是否针对超出预期峰值的负载进行了测试? | 真实负载下存在未知故障模式 | 进行2-3倍预期峰值的负载测试;记录系统崩溃点 |
| 是否开展过故障注入测试? | 系统韧性仅为理论值,未经过验证 | 从低风险实验开始,开展Chaos Engineering |
Reference Files
参考文件
- anti-patterns.md: Integration point failures, cascading failures, blocked threads, unbounded result sets, self-denial attacks, slow responses
- stability-patterns.md: Circuit Breaker, Bulkhead, Timeout, Retry, Fail Fast, Steady State, Let It Crash, Handshaking
- capacity-planning.md: Load/stress/soak testing, connection pool sizing, thread pool tuning, Universal Scalability Law
- deployment-strategies.md: Blue-green, canary, rolling deploys, feature flags, database migrations, immutable infrastructure
- observability.md: Health checks, RED/USE methods, SLIs/SLOs/SLAs, distributed tracing, alerting strategy
- chaos-engineering.md: Steady state hypothesis, failure injection, GameDay exercises, blast radius management
- anti-patterns.md:集成点故障、连锁故障、阻塞线程、无界结果集、自我拒绝攻击、缓慢响应
- stability-patterns.md:Circuit Breaker、Bulkheads、Timeout、Retry、Fail Fast、Steady State、Let It Crash、Handshaking
- capacity-planning.md:负载/压力/soak测试、连接池大小配置、线程池调优、通用可扩展性定律
- deployment-strategies.md:蓝绿部署、金丝雀发布、滚动部署、功能标志、数据库迁移、不可变基础设施
- observability.md:健康检查设计、指标埋点、SLO框架、告警策略
- chaos-engineering.md:实验设计、故障影响范围管理、Chaos Engineering实践建设
Further Reading
延伸阅读
This skill is based on Michael Nygard's essential guide to building production-ready software. For the complete methodology, war stories, and implementation details:
- "Release It! Design and Deploy Production-Ready Software" (2nd Edition) by Michael T. Nygard
本技能基于Michael Nygard关于构建可投入生产的软件的权威指南。如需完整方法论、实战案例和实现细节,请参考:
- 《Release It! Design and Deploy Production-Ready Software》(第二版),作者Michael T. Nygard,链接:https://www.amazon.com/Release-Design-Deploy-Production-Ready-Software/dp/1680502395?tag=wondelai00-20
About the Author
关于作者
Michael T. Nygard is a software architect and author with over 30 years of experience building and operating large-scale production systems. He has worked across industries including finance, retail, and government, and has been responsible for systems handling millions of transactions per day. Nygard is known for bridging the gap between development and operations, advocating that architects must be responsible for the systems they design long after the code is written. The first edition of Release It! (2007) became a foundational text in the DevOps and site reliability engineering movements. The second edition (2018) expands coverage to cloud-native architectures, containerization, and modern deployment practices. Nygard is a frequent conference speaker and has contributed to the broader conversation about resilience engineering, sociotechnical systems, and the human factors that influence production stability.
Michael T. Nygard 是拥有30余年大规模生产系统构建和运维经验的软件架构师与作者。他曾涉足金融、零售和政府等多个行业,负责过每日处理数百万笔交易的系统。Nygard以弥合开发与运维之间的鸿沟而闻名,主张架构师在代码编写完成后,仍需对自己设计的系统负责。《Release It!》第一版(2007年)成为DevOps和站点可靠性工程(SRE)运动的奠基性著作。第二版(2018年)扩展了云原生架构、容器化和现代部署实践的内容。Nygard是活跃的会议演讲者,为韧性工程、社会技术系统及影响生产稳定性的人为因素等领域的讨论做出了贡献。