reliability-strategy-builder

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Reliability Strategy Builder

可靠性策略构建器

Build resilient systems with proper failure handling and SLOs.
借助合理的故障处理能力和SLO构建高弹性系统。

Reliability Patterns

可靠性模式

1. Circuit Breaker

1. 熔断(Circuit Breaker)

Prevent cascading failures by stopping requests to failing services.
typescript
class CircuitBreaker {
  private state: "closed" | "open" | "half-open" = "closed";
  private failureCount = 0;
  private lastFailureTime?: Date;

  async execute<T>(operation: () => Promise<T>): Promise<T> {
    if (this.state === "open") {
      if (this.shouldAttemptReset()) {
        this.state = "half-open";
      } else {
        throw new Error("Circuit breaker is OPEN");
      }
    }

    try {
      const result = await operation();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess() {
    this.failureCount = 0;
    this.state = "closed";
  }

  private onFailure() {
    this.failureCount++;
    this.lastFailureTime = new Date();

    if (this.failureCount >= 5) {
      this.state = "open";
    }
  }

  private shouldAttemptReset(): boolean {
    if (!this.lastFailureTime) return false;
    const now = Date.now();
    const elapsed = now - this.lastFailureTime.getTime();
    return elapsed > 60000; // 1 minute
  }
}
通过停止向故障服务发送请求,避免级联故障。
typescript
class CircuitBreaker {
  private state: "closed" | "open" | "half-open" = "closed";
  private failureCount = 0;
  private lastFailureTime?: Date;

  async execute<T>(operation: () => Promise<T>): Promise<T> {
    if (this.state === "open") {
      if (this.shouldAttemptReset()) {
        this.state = "half-open";
      } else {
        throw new Error("Circuit breaker is OPEN");
      }
    }

    try {
      const result = await operation();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess() {
    this.failureCount = 0;
    this.state = "closed";
  }

  private onFailure() {
    this.failureCount++;
    this.lastFailureTime = new Date();

    if (this.failureCount >= 5) {
      this.state = "open";
    }
  }

  private shouldAttemptReset(): boolean {
    if (!this.lastFailureTime) return false;
    const now = Date.now();
    const elapsed = now - this.lastFailureTime.getTime();
    return elapsed > 60000; // 1 minute
  }
}

2. Retry with Backoff

2. 退避重试

Handle transient failures with exponential backoff.
typescript
async function retryWithBackoff<T>(
  operation: () => Promise<T>,
  maxRetries = 3,
  baseDelay = 1000
): Promise<T> {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await operation();
    } catch (error) {
      if (attempt === maxRetries) throw error;

      // Exponential backoff: 1s, 2s, 4s
      const delay = baseDelay * Math.pow(2, attempt);
      await sleep(delay);
    }
  }
  throw new Error("Max retries exceeded");
}
采用指数退避机制处理瞬时故障。
typescript
async function retryWithBackoff<T>(
  operation: () => Promise<T>,
  maxRetries = 3,
  baseDelay = 1000
): Promise<T> {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await operation();
    } catch (error) {
      if (attempt === maxRetries) throw error;

      // Exponential backoff: 1s, 2s, 4s
      const delay = baseDelay * Math.pow(2, attempt);
      await sleep(delay);
    }
  }
  throw new Error("Max retries exceeded");
}

3. Fallback Pattern

3. 降级模式(Fallback Pattern)

Provide degraded functionality when primary fails.
typescript
async function getUserWithFallback(userId: string): Promise<User> {
  try {
    // Try primary database
    return await primaryDb.users.findById(userId);
  } catch (error) {
    logger.warn("Primary DB failed, using cache");

    // Fallback to cache
    const cached = await cache.get(`user:${userId}`);
    if (cached) return cached;

    // Final fallback: return minimal user object
    return {
      id: userId,
      name: "Unknown User",
      email: "unavailable",
    };
  }
}
当主服务故障时提供降级功能。
typescript
async function getUserWithFallback(userId: string): Promise<User> {
  try {
    // Try primary database
    return await primaryDb.users.findById(userId);
  } catch (error) {
    logger.warn("Primary DB failed, using cache");

    // Fallback to cache
    const cached = await cache.get(`user:${userId}`);
    if (cached) return cached;

    // Final fallback: return minimal user object
    return {
      id: userId,
      name: "Unknown User",
      email: "unavailable",
    };
  }
}

4. Bulkhead Pattern

4. 舱壁隔离模式(Bulkhead Pattern)

Isolate failures to prevent resource exhaustion.
typescript
class ThreadPool {
  private pools = new Map<string, Semaphore>();

  constructor() {
    // Separate pools for different operations
    this.pools.set("critical", new Semaphore(100));
    this.pools.set("standard", new Semaphore(50));
    this.pools.set("background", new Semaphore(10));
  }

  async execute(priority: string, operation: () => Promise<any>) {
    const pool = this.pools.get(priority);
    await pool.acquire();

    try {
      return await operation();
    } finally {
      pool.release();
    }
  }
}
隔离故障以避免资源耗尽。
typescript
class ThreadPool {
  private pools = new Map<string, Semaphore>();

  constructor() {
    // Separate pools for different operations
    this.pools.set("critical", new Semaphore(100));
    this.pools.set("standard", new Semaphore(50));
    this.pools.set("background", new Semaphore(10));
  }

  async execute(priority: string, operation: () => Promise<any>) {
    const pool = this.pools.get(priority);
    await pool.acquire();

    try {
      return await operation();
    } finally {
      pool.release();
    }
  }
}

SLO Definitions

SLO定义

SLO Template

SLO模板

yaml
service: user-api
slos:
  - name: Availability
    description: API should be available for successful requests
    target: 99.9%
    measurement:
      type: ratio
      success: status_code < 500
      total: all_requests
    window: 30 days

  - name: Latency
    description: 95% of requests complete within 500ms
    target: 95%
    measurement:
      type: percentile
      metric: request_duration_ms
      threshold: 500
      percentile: 95
    window: 7 days

  - name: Error Rate
    description: Less than 1% of requests result in errors
    target: 99%
    measurement:
      type: ratio
      success: status_code < 400 OR status_code IN [401, 403, 404]
      total: all_requests
    window: 24 hours
yaml
service: user-api
slos:
  - name: Availability
    description: API should be available for successful requests
    target: 99.9%
    measurement:
      type: ratio
      success: status_code < 500
      total: all_requests
    window: 30 days

  - name: Latency
    description: 95% of requests complete within 500ms
    target: 95%
    measurement:
      type: percentile
      metric: request_duration_ms
      threshold: 500
      percentile: 95
    window: 7 days

  - name: Error Rate
    description: Less than 1% of requests result in errors
    target: 99%
    measurement:
      type: ratio
      success: status_code < 400 OR status_code IN [401, 403, 404]
      total: all_requests
    window: 24 hours

Error Budget

错误预算

Error Budget = 100% - SLO

Example:
SLO: 99.9% availability
Error Budget: 0.1% = 43.2 minutes/month downtime allowed
Error Budget = 100% - SLO

Example:
SLO: 99.9% availability
Error Budget: 0.1% = 43.2 minutes/month downtime allowed

Failure Mode Analysis

故障模式分析

markdown
| Component   | Failure Mode | Impact | Probability | Detection               | Mitigation                     |
| ----------- | ------------ | ------ | ----------- | ----------------------- | ------------------------------ |
| Database    | Unresponsive | HIGH   | Medium      | Health checks every 10s | Circuit breaker, read replicas |
| API Gateway | Overload     | HIGH   | Low         | Request queue depth     | Rate limiting, auto-scaling    |
| Cache       | Eviction     | MEDIUM | High        | Cache hit rate          | Fallback to DB, larger cache   |
| Queue       | Backed up    | LOW    | Medium      | Queue depth metric      | Add workers, DLQ               |
markdown
| 组件   | 故障模式 | 影响 | 发生概率 | 检测方式               | 缓解方案                     |
| ----------- | ------------ | ------ | ----------- | ----------------------- | ------------------------------ |
| 数据库    | 无响应 ||| 每10秒一次健康检查 | 熔断、只读副本 |
| API网关 | 过载     ||| 请求队列深度     | 限流、自动扩缩容    |
| 缓存       | 驱逐     ||| 缓存命中率          | 降级到数据库、扩容缓存   |
| 队列       | 堆积    ||| 队列深度指标      | 增加消费者、死信队列               |

Reliability Checklist

可靠性检查清单

Infrastructure

基础设施

  • Load balancer with health checks
  • Multiple availability zones
  • Auto-scaling configured
  • Database replication
  • Regular backups (tested!)
  • 配置带健康检查的负载均衡
  • 多可用区部署
  • 配置自动扩缩容
  • 数据库复制
  • 定期备份(已验证备份可用性!)

Application

应用层

  • Circuit breakers on external calls
  • Retry logic with backoff
  • Timeouts on all I/O
  • Fallback mechanisms
  • Graceful degradation
  • 外部调用配置熔断
  • 带退避的重试逻辑
  • 所有I/O操作配置超时
  • 降级机制
  • 优雅降级

Monitoring

监控

  • SLO dashboard
  • Error budgets tracked
  • Alerting on SLO violations
  • Latency percentiles (p50, p95, p99)
  • Dependency health checks
  • SLO看板
  • 错误预算追踪
  • SLO违反告警
  • 延迟分位统计(p50、p95、p99)
  • 依赖健康检查

Operations

运维

  • Incident response runbook
  • On-call rotation
  • Postmortem template
  • Disaster recovery plan
  • Chaos engineering tests
  • 事件响应手册
  • 值班轮值机制
  • 事故复盘模板
  • 灾难恢复方案
  • 混沌工程测试

Incident Response Plan

事件响应方案

Severity Levels

严重级别

SEV1 (Critical): Complete service outage, data loss
  - Response time: <15 minutes
  - Page on-call immediately

SEV2 (High): Partial outage, degraded performance
  - Response time: <1 hour
  - Alert on-call

SEV3 (Medium): Minor issues, workarounds available
  - Response time: <4 hours
  - Create ticket

SEV4 (Low): Cosmetic issues, no user impact
  - Response time: Next business day
  - Backlog
SEV1(严重):服务完全中断、数据丢失
  - 响应时间:<15分钟
  - 立即通知值班人员

SEV2(高):部分服务中断、性能下降
  - 响应时间:<1小时
  - 通知值班人员

SEV3(中):小问题,有可用临时解决方案
  - 响应时间:<4小时
  - 创建工单

SEV4(低):展示类问题,无用户影响
  - 响应时间:下个工作日
  - 加入待办列表

Incident Response Steps

事件响应步骤

  1. Acknowledge: Confirm receipt within SLA
  2. Assess: Determine severity and impact
  3. Communicate: Update status page
  4. Mitigate: Stop the bleeding (rollback, scale, disable)
  5. Resolve: Fix root cause
  6. Document: Write postmortem
  1. 确认接收:在SLA规定时间内响应
  2. 评估:判定严重级别和影响范围
  3. 同步信息:更新状态页
  4. 缓解:止损(回滚、扩容、禁用故障功能)
  5. 解决:修复根因
  6. 归档:编写复盘报告

Best Practices

最佳实践

  1. Design for failure: Assume components will fail
  2. Fail fast: Don't let slow failures cascade
  3. Isolate failures: Bulkhead pattern
  4. Graceful degradation: Reduce functionality, don't crash
  5. Monitor SLOs: Track error budgets
  6. Test failure modes: Chaos engineering
  7. Document runbooks: Clear incident response
  1. 面向故障设计:假设组件一定会发生故障
  2. 快速失败:避免慢故障引发级联问题
  3. 故障隔离:采用舱壁隔离模式
  4. 优雅降级:减少功能输出,不要直接崩溃
  5. 监控SLO:追踪错误预算
  6. 测试故障模式:开展混沌工程
  7. 完善运维手册:明确事件响应流程

Output Checklist

交付检查清单

  • Circuit breakers implemented
  • Retry logic with backoff
  • Fallback mechanisms
  • Bulkhead isolation
  • SLOs defined (availability, latency, errors)
  • Error budgets calculated
  • Failure mode analysis
  • Monitoring dashboard
  • Incident response plan
  • Runbooks documented
  • 已实现熔断机制
  • 已实现带退避的重试逻辑
  • 已实现降级机制
  • 已实现舱壁隔离
  • 已定义SLO(可用性、延迟、错误率)
  • 已计算错误预算
  • 已完成故障模式分析
  • 已配置监控看板
  • 已制定事件响应方案
  • 已归档运维手册