Reliability Strategy Builder

可靠性策略构建器

Build resilient systems with proper failure handling and SLOs.

借助合理的故障处理能力和SLO构建高弹性系统。

Reliability Patterns

可靠性模式

1. Circuit Breaker

1. 熔断（Circuit Breaker）

Prevent cascading failures by stopping requests to failing services.

typescript

class CircuitBreaker {
  private state: "closed" | "open" | "half-open" = "closed";
  private failureCount = 0;
  private lastFailureTime?: Date;

  async execute<T>(operation: () => Promise<T>): Promise<T> {
    if (this.state === "open") {
      if (this.shouldAttemptReset()) {
        this.state = "half-open";
      } else {
        throw new Error("Circuit breaker is OPEN");
      }
    }

    try {
      const result = await operation();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess() {
    this.failureCount = 0;
    this.state = "closed";
  }

  private onFailure() {
    this.failureCount++;
    this.lastFailureTime = new Date();

    if (this.failureCount >= 5) {
      this.state = "open";
    }
  }

  private shouldAttemptReset(): boolean {
    if (!this.lastFailureTime) return false;
    const now = Date.now();
    const elapsed = now - this.lastFailureTime.getTime();
    return elapsed > 60000; // 1 minute
  }
}

通过停止向故障服务发送请求，避免级联故障。

typescript

class CircuitBreaker {
  private state: "closed" | "open" | "half-open" = "closed";
  private failureCount = 0;
  private lastFailureTime?: Date;

  async execute<T>(operation: () => Promise<T>): Promise<T> {
    if (this.state === "open") {
      if (this.shouldAttemptReset()) {
        this.state = "half-open";
      } else {
        throw new Error("Circuit breaker is OPEN");
      }
    }

    try {
      const result = await operation();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess() {
    this.failureCount = 0;
    this.state = "closed";
  }

  private onFailure() {
    this.failureCount++;
    this.lastFailureTime = new Date();

    if (this.failureCount >= 5) {
      this.state = "open";
    }
  }

  private shouldAttemptReset(): boolean {
    if (!this.lastFailureTime) return false;
    const now = Date.now();
    const elapsed = now - this.lastFailureTime.getTime();
    return elapsed > 60000; // 1 minute
  }
}

2. Retry with Backoff

2. 退避重试

Handle transient failures with exponential backoff.

typescript

async function retryWithBackoff<T>(
  operation: () => Promise<T>,
  maxRetries = 3,
  baseDelay = 1000
): Promise<T> {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await operation();
    } catch (error) {
      if (attempt === maxRetries) throw error;

      // Exponential backoff: 1s, 2s, 4s
      const delay = baseDelay * Math.pow(2, attempt);
      await sleep(delay);
    }
  }
  throw new Error("Max retries exceeded");
}

采用指数退避机制处理瞬时故障。

typescript

async function retryWithBackoff<T>(
  operation: () => Promise<T>,
  maxRetries = 3,
  baseDelay = 1000
): Promise<T> {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await operation();
    } catch (error) {
      if (attempt === maxRetries) throw error;

      // Exponential backoff: 1s, 2s, 4s
      const delay = baseDelay * Math.pow(2, attempt);
      await sleep(delay);
    }
  }
  throw new Error("Max retries exceeded");
}

3. Fallback Pattern

3. 降级模式（Fallback Pattern）

Provide degraded functionality when primary fails.

typescript

async function getUserWithFallback(userId: string): Promise<User> {
  try {
    // Try primary database
    return await primaryDb.users.findById(userId);
  } catch (error) {
    logger.warn("Primary DB failed, using cache");

    // Fallback to cache
    const cached = await cache.get(`user:${userId}`);
    if (cached) return cached;

    // Final fallback: return minimal user object
    return {
      id: userId,
      name: "Unknown User",
      email: "unavailable",
    };
  }
}

当主服务故障时提供降级功能。

typescript

async function getUserWithFallback(userId: string): Promise<User> {
  try {
    // Try primary database
    return await primaryDb.users.findById(userId);
  } catch (error) {
    logger.warn("Primary DB failed, using cache");

    // Fallback to cache
    const cached = await cache.get(`user:${userId}`);
    if (cached) return cached;

    // Final fallback: return minimal user object
    return {
      id: userId,
      name: "Unknown User",
      email: "unavailable",
    };
  }
}

4. Bulkhead Pattern

4. 舱壁隔离模式（Bulkhead Pattern）

Isolate failures to prevent resource exhaustion.

typescript

class ThreadPool {
  private pools = new Map<string, Semaphore>();

  constructor() {
    // Separate pools for different operations
    this.pools.set("critical", new Semaphore(100));
    this.pools.set("standard", new Semaphore(50));
    this.pools.set("background", new Semaphore(10));
  }

  async execute(priority: string, operation: () => Promise<any>) {
    const pool = this.pools.get(priority);
    await pool.acquire();

    try {
      return await operation();
    } finally {
      pool.release();
    }
  }
}

隔离故障以避免资源耗尽。

typescript

class ThreadPool {
  private pools = new Map<string, Semaphore>();

  constructor() {
    // Separate pools for different operations
    this.pools.set("critical", new Semaphore(100));
    this.pools.set("standard", new Semaphore(50));
    this.pools.set("background", new Semaphore(10));
  }

  async execute(priority: string, operation: () => Promise<any>) {
    const pool = this.pools.get(priority);
    await pool.acquire();

    try {
      return await operation();
    } finally {
      pool.release();
    }
  }
}

SLO Definitions

SLO定义

SLO Template

SLO模板

yaml

service: user-api
slos:
  - name: Availability
    description: API should be available for successful requests
    target: 99.9%
    measurement:
      type: ratio
      success: status_code < 500
      total: all_requests
    window: 30 days

  - name: Latency
    description: 95% of requests complete within 500ms
    target: 95%
    measurement:
      type: percentile
      metric: request_duration_ms
      threshold: 500
      percentile: 95
    window: 7 days

  - name: Error Rate
    description: Less than 1% of requests result in errors
    target: 99%
    measurement:
      type: ratio
      success: status_code < 400 OR status_code IN [401, 403, 404]
      total: all_requests
    window: 24 hours

yaml

service: user-api
slos:
  - name: Availability
    description: API should be available for successful requests
    target: 99.9%
    measurement:
      type: ratio
      success: status_code < 500
      total: all_requests
    window: 30 days

  - name: Latency
    description: 95% of requests complete within 500ms
    target: 95%
    measurement:
      type: percentile
      metric: request_duration_ms
      threshold: 500
      percentile: 95
    window: 7 days

  - name: Error Rate
    description: Less than 1% of requests result in errors
    target: 99%
    measurement:
      type: ratio
      success: status_code < 400 OR status_code IN [401, 403, 404]
      total: all_requests
    window: 24 hours

Error Budget

错误预算

Error Budget = 100% - SLO

Example:
SLO: 99.9% availability
Error Budget: 0.1% = 43.2 minutes/month downtime allowed

Error Budget = 100% - SLO

Example:
SLO: 99.9% availability
Error Budget: 0.1% = 43.2 minutes/month downtime allowed

Failure Mode Analysis

故障模式分析

markdown

| Component   | Failure Mode | Impact | Probability | Detection               | Mitigation                     |
| ----------- | ------------ | ------ | ----------- | ----------------------- | ------------------------------ |
| Database    | Unresponsive | HIGH   | Medium      | Health checks every 10s | Circuit breaker, read replicas |
| API Gateway | Overload     | HIGH   | Low         | Request queue depth     | Rate limiting, auto-scaling    |
| Cache       | Eviction     | MEDIUM | High        | Cache hit rate          | Fallback to DB, larger cache   |
| Queue       | Backed up    | LOW    | Medium      | Queue depth metric      | Add workers, DLQ               |

markdown

| 组件   | 故障模式 | 影响 | 发生概率 | 检测方式               | 缓解方案                     |
| ----------- | ------------ | ------ | ----------- | ----------------------- | ------------------------------ |
| 数据库    | 无响应 | 高   | 中      | 每10秒一次健康检查 | 熔断、只读副本 |
| API网关 | 过载     | 高   | 低         | 请求队列深度     | 限流、自动扩缩容    |
| 缓存       | 驱逐     | 中 | 高        | 缓存命中率          | 降级到数据库、扩容缓存   |
| 队列       | 堆积    | 低    | 中      | 队列深度指标      | 增加消费者、死信队列               |

Reliability Checklist

可靠性检查清单

Infrastructure

基础设施

Application

应用层

Monitoring

监控

Operations

运维

Incident Response Plan

事件响应方案

Severity Levels

严重级别

SEV1 (Critical): Complete service outage, data loss
  - Response time: <15 minutes
  - Page on-call immediately

SEV2 (High): Partial outage, degraded performance
  - Response time: <1 hour
  - Alert on-call

SEV3 (Medium): Minor issues, workarounds available
  - Response time: <4 hours
  - Create ticket

SEV4 (Low): Cosmetic issues, no user impact
  - Response time: Next business day
  - Backlog

SEV1（严重）：服务完全中断、数据丢失
  - 响应时间：<15分钟
  - 立即通知值班人员

SEV2（高）：部分服务中断、性能下降
  - 响应时间：<1小时
  - 通知值班人员

SEV3（中）：小问题，有可用临时解决方案
  - 响应时间：<4小时
  - 创建工单

SEV4（低）：展示类问题，无用户影响
  - 响应时间：下个工作日
  - 加入待办列表

Incident Response Steps

事件响应步骤

Acknowledge: Confirm receipt within SLA
Assess: Determine severity and impact
Communicate: Update status page
Mitigate: Stop the bleeding (rollback, scale, disable)
Resolve: Fix root cause
Document: Write postmortem

确认接收：在SLA规定时间内响应
评估：判定严重级别和影响范围
同步信息：更新状态页
缓解：止损（回滚、扩容、禁用故障功能）
解决：修复根因
归档：编写复盘报告

Best Practices

最佳实践

Design for failure: Assume components will fail
Fail fast: Don't let slow failures cascade
Isolate failures: Bulkhead pattern
Graceful degradation: Reduce functionality, don't crash
Monitor SLOs: Track error budgets
Test failure modes: Chaos engineering
Document runbooks: Clear incident response

面向故障设计：假设组件一定会发生故障
快速失败：避免慢故障引发级联问题
故障隔离：采用舱壁隔离模式
优雅降级：减少功能输出，不要直接崩溃
监控SLO：追踪错误预算
测试故障模式：开展混沌工程
完善运维手册：明确事件响应流程

reliability-strategy-builder

Original

Translation

Reliability Strategy Builder

可靠性策略构建器

Reliability Patterns

可靠性模式

1. Circuit Breaker

1. 熔断（Circuit Breaker）

2. Retry with Backoff

2. 退避重试

3. Fallback Pattern

3. 降级模式（Fallback Pattern）

4. Bulkhead Pattern

4. 舱壁隔离模式（Bulkhead Pattern）

SLO Definitions

SLO定义

SLO Template

SLO模板

Error Budget

错误预算

Failure Mode Analysis

故障模式分析

Reliability Checklist

可靠性检查清单

Infrastructure

基础设施

Application

应用层

Monitoring

监控

Operations

运维

Incident Response Plan

事件响应方案

Severity Levels

严重级别

Incident Response Steps

事件响应步骤

Best Practices

最佳实践

Output Checklist

交付检查清单