reliability-strategy-builder
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseReliability Strategy Builder
可靠性策略构建器
Build resilient systems with proper failure handling and SLOs.
借助合理的故障处理能力和SLO构建高弹性系统。
Reliability Patterns
可靠性模式
1. Circuit Breaker
1. 熔断(Circuit Breaker)
Prevent cascading failures by stopping requests to failing services.
typescript
class CircuitBreaker {
private state: "closed" | "open" | "half-open" = "closed";
private failureCount = 0;
private lastFailureTime?: Date;
async execute<T>(operation: () => Promise<T>): Promise<T> {
if (this.state === "open") {
if (this.shouldAttemptReset()) {
this.state = "half-open";
} else {
throw new Error("Circuit breaker is OPEN");
}
}
try {
const result = await operation();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess() {
this.failureCount = 0;
this.state = "closed";
}
private onFailure() {
this.failureCount++;
this.lastFailureTime = new Date();
if (this.failureCount >= 5) {
this.state = "open";
}
}
private shouldAttemptReset(): boolean {
if (!this.lastFailureTime) return false;
const now = Date.now();
const elapsed = now - this.lastFailureTime.getTime();
return elapsed > 60000; // 1 minute
}
}通过停止向故障服务发送请求,避免级联故障。
typescript
class CircuitBreaker {
private state: "closed" | "open" | "half-open" = "closed";
private failureCount = 0;
private lastFailureTime?: Date;
async execute<T>(operation: () => Promise<T>): Promise<T> {
if (this.state === "open") {
if (this.shouldAttemptReset()) {
this.state = "half-open";
} else {
throw new Error("Circuit breaker is OPEN");
}
}
try {
const result = await operation();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess() {
this.failureCount = 0;
this.state = "closed";
}
private onFailure() {
this.failureCount++;
this.lastFailureTime = new Date();
if (this.failureCount >= 5) {
this.state = "open";
}
}
private shouldAttemptReset(): boolean {
if (!this.lastFailureTime) return false;
const now = Date.now();
const elapsed = now - this.lastFailureTime.getTime();
return elapsed > 60000; // 1 minute
}
}2. Retry with Backoff
2. 退避重试
Handle transient failures with exponential backoff.
typescript
async function retryWithBackoff<T>(
operation: () => Promise<T>,
maxRetries = 3,
baseDelay = 1000
): Promise<T> {
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await operation();
} catch (error) {
if (attempt === maxRetries) throw error;
// Exponential backoff: 1s, 2s, 4s
const delay = baseDelay * Math.pow(2, attempt);
await sleep(delay);
}
}
throw new Error("Max retries exceeded");
}采用指数退避机制处理瞬时故障。
typescript
async function retryWithBackoff<T>(
operation: () => Promise<T>,
maxRetries = 3,
baseDelay = 1000
): Promise<T> {
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await operation();
} catch (error) {
if (attempt === maxRetries) throw error;
// Exponential backoff: 1s, 2s, 4s
const delay = baseDelay * Math.pow(2, attempt);
await sleep(delay);
}
}
throw new Error("Max retries exceeded");
}3. Fallback Pattern
3. 降级模式(Fallback Pattern)
Provide degraded functionality when primary fails.
typescript
async function getUserWithFallback(userId: string): Promise<User> {
try {
// Try primary database
return await primaryDb.users.findById(userId);
} catch (error) {
logger.warn("Primary DB failed, using cache");
// Fallback to cache
const cached = await cache.get(`user:${userId}`);
if (cached) return cached;
// Final fallback: return minimal user object
return {
id: userId,
name: "Unknown User",
email: "unavailable",
};
}
}当主服务故障时提供降级功能。
typescript
async function getUserWithFallback(userId: string): Promise<User> {
try {
// Try primary database
return await primaryDb.users.findById(userId);
} catch (error) {
logger.warn("Primary DB failed, using cache");
// Fallback to cache
const cached = await cache.get(`user:${userId}`);
if (cached) return cached;
// Final fallback: return minimal user object
return {
id: userId,
name: "Unknown User",
email: "unavailable",
};
}
}4. Bulkhead Pattern
4. 舱壁隔离模式(Bulkhead Pattern)
Isolate failures to prevent resource exhaustion.
typescript
class ThreadPool {
private pools = new Map<string, Semaphore>();
constructor() {
// Separate pools for different operations
this.pools.set("critical", new Semaphore(100));
this.pools.set("standard", new Semaphore(50));
this.pools.set("background", new Semaphore(10));
}
async execute(priority: string, operation: () => Promise<any>) {
const pool = this.pools.get(priority);
await pool.acquire();
try {
return await operation();
} finally {
pool.release();
}
}
}隔离故障以避免资源耗尽。
typescript
class ThreadPool {
private pools = new Map<string, Semaphore>();
constructor() {
// Separate pools for different operations
this.pools.set("critical", new Semaphore(100));
this.pools.set("standard", new Semaphore(50));
this.pools.set("background", new Semaphore(10));
}
async execute(priority: string, operation: () => Promise<any>) {
const pool = this.pools.get(priority);
await pool.acquire();
try {
return await operation();
} finally {
pool.release();
}
}
}SLO Definitions
SLO定义
SLO Template
SLO模板
yaml
service: user-api
slos:
- name: Availability
description: API should be available for successful requests
target: 99.9%
measurement:
type: ratio
success: status_code < 500
total: all_requests
window: 30 days
- name: Latency
description: 95% of requests complete within 500ms
target: 95%
measurement:
type: percentile
metric: request_duration_ms
threshold: 500
percentile: 95
window: 7 days
- name: Error Rate
description: Less than 1% of requests result in errors
target: 99%
measurement:
type: ratio
success: status_code < 400 OR status_code IN [401, 403, 404]
total: all_requests
window: 24 hoursyaml
service: user-api
slos:
- name: Availability
description: API should be available for successful requests
target: 99.9%
measurement:
type: ratio
success: status_code < 500
total: all_requests
window: 30 days
- name: Latency
description: 95% of requests complete within 500ms
target: 95%
measurement:
type: percentile
metric: request_duration_ms
threshold: 500
percentile: 95
window: 7 days
- name: Error Rate
description: Less than 1% of requests result in errors
target: 99%
measurement:
type: ratio
success: status_code < 400 OR status_code IN [401, 403, 404]
total: all_requests
window: 24 hoursError Budget
错误预算
Error Budget = 100% - SLO
Example:
SLO: 99.9% availability
Error Budget: 0.1% = 43.2 minutes/month downtime allowedError Budget = 100% - SLO
Example:
SLO: 99.9% availability
Error Budget: 0.1% = 43.2 minutes/month downtime allowedFailure Mode Analysis
故障模式分析
markdown
| Component | Failure Mode | Impact | Probability | Detection | Mitigation |
| ----------- | ------------ | ------ | ----------- | ----------------------- | ------------------------------ |
| Database | Unresponsive | HIGH | Medium | Health checks every 10s | Circuit breaker, read replicas |
| API Gateway | Overload | HIGH | Low | Request queue depth | Rate limiting, auto-scaling |
| Cache | Eviction | MEDIUM | High | Cache hit rate | Fallback to DB, larger cache |
| Queue | Backed up | LOW | Medium | Queue depth metric | Add workers, DLQ |markdown
| 组件 | 故障模式 | 影响 | 发生概率 | 检测方式 | 缓解方案 |
| ----------- | ------------ | ------ | ----------- | ----------------------- | ------------------------------ |
| 数据库 | 无响应 | 高 | 中 | 每10秒一次健康检查 | 熔断、只读副本 |
| API网关 | 过载 | 高 | 低 | 请求队列深度 | 限流、自动扩缩容 |
| 缓存 | 驱逐 | 中 | 高 | 缓存命中率 | 降级到数据库、扩容缓存 |
| 队列 | 堆积 | 低 | 中 | 队列深度指标 | 增加消费者、死信队列 |Reliability Checklist
可靠性检查清单
Infrastructure
基础设施
- Load balancer with health checks
- Multiple availability zones
- Auto-scaling configured
- Database replication
- Regular backups (tested!)
- 配置带健康检查的负载均衡
- 多可用区部署
- 配置自动扩缩容
- 数据库复制
- 定期备份(已验证备份可用性!)
Application
应用层
- Circuit breakers on external calls
- Retry logic with backoff
- Timeouts on all I/O
- Fallback mechanisms
- Graceful degradation
- 外部调用配置熔断
- 带退避的重试逻辑
- 所有I/O操作配置超时
- 降级机制
- 优雅降级
Monitoring
监控
- SLO dashboard
- Error budgets tracked
- Alerting on SLO violations
- Latency percentiles (p50, p95, p99)
- Dependency health checks
- SLO看板
- 错误预算追踪
- SLO违反告警
- 延迟分位统计(p50、p95、p99)
- 依赖健康检查
Operations
运维
- Incident response runbook
- On-call rotation
- Postmortem template
- Disaster recovery plan
- Chaos engineering tests
- 事件响应手册
- 值班轮值机制
- 事故复盘模板
- 灾难恢复方案
- 混沌工程测试
Incident Response Plan
事件响应方案
Severity Levels
严重级别
SEV1 (Critical): Complete service outage, data loss
- Response time: <15 minutes
- Page on-call immediately
SEV2 (High): Partial outage, degraded performance
- Response time: <1 hour
- Alert on-call
SEV3 (Medium): Minor issues, workarounds available
- Response time: <4 hours
- Create ticket
SEV4 (Low): Cosmetic issues, no user impact
- Response time: Next business day
- BacklogSEV1(严重):服务完全中断、数据丢失
- 响应时间:<15分钟
- 立即通知值班人员
SEV2(高):部分服务中断、性能下降
- 响应时间:<1小时
- 通知值班人员
SEV3(中):小问题,有可用临时解决方案
- 响应时间:<4小时
- 创建工单
SEV4(低):展示类问题,无用户影响
- 响应时间:下个工作日
- 加入待办列表Incident Response Steps
事件响应步骤
- Acknowledge: Confirm receipt within SLA
- Assess: Determine severity and impact
- Communicate: Update status page
- Mitigate: Stop the bleeding (rollback, scale, disable)
- Resolve: Fix root cause
- Document: Write postmortem
- 确认接收:在SLA规定时间内响应
- 评估:判定严重级别和影响范围
- 同步信息:更新状态页
- 缓解:止损(回滚、扩容、禁用故障功能)
- 解决:修复根因
- 归档:编写复盘报告
Best Practices
最佳实践
- Design for failure: Assume components will fail
- Fail fast: Don't let slow failures cascade
- Isolate failures: Bulkhead pattern
- Graceful degradation: Reduce functionality, don't crash
- Monitor SLOs: Track error budgets
- Test failure modes: Chaos engineering
- Document runbooks: Clear incident response
- 面向故障设计:假设组件一定会发生故障
- 快速失败:避免慢故障引发级联问题
- 故障隔离:采用舱壁隔离模式
- 优雅降级:减少功能输出,不要直接崩溃
- 监控SLO:追踪错误预算
- 测试故障模式:开展混沌工程
- 完善运维手册:明确事件响应流程
Output Checklist
交付检查清单
- Circuit breakers implemented
- Retry logic with backoff
- Fallback mechanisms
- Bulkhead isolation
- SLOs defined (availability, latency, errors)
- Error budgets calculated
- Failure mode analysis
- Monitoring dashboard
- Incident response plan
- Runbooks documented
- 已实现熔断机制
- 已实现带退避的重试逻辑
- 已实现降级机制
- 已实现舱壁隔离
- 已定义SLO(可用性、延迟、错误率)
- 已计算错误预算
- 已完成故障模式分析
- 已配置监控看板
- 已制定事件响应方案
- 已归档运维手册