sre-reliability-engineering

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

SRE Reliability Engineering

SRE可靠性工程

Building reliable and scalable distributed systems.

构建可靠且可扩展的分布式系统。

Service Level Objectives (SLOs)

服务水平目标（SLOs）

Defining SLOs

定义SLOs

SLI: Availability = successful requests / total requests
SLO: 99.9% availability (measured over 30 days)
Error Budget: 0.1% = 43 minutes downtime per month

SLI: Availability = successful requests / total requests
SLO: 99.9% availability (measured over 30 days)
Error Budget: 0.1% = 43 minutes downtime per month

SLO Document Template

SLO文档模板

markdown

undefined

markdown

undefined

API Service SLO

Availability SLO

Target: 99.9% of requests succeed (measured over 30 days)

SLI Definition:

Success: HTTP 200-399 responses
Failure: HTTP 500-599 responses, timeouts
Excluded: HTTP 400-499 (client errors)

Measurement:

prometheus

sum(rate(http_requests_total{status=~"[23].."}[30d]))
/
sum(rate(http_requests_total{status!~"4.."}[30d]))

Error Budget: 0.1% = ~43 minutes/month

Consequences:

Budget remaining > 0: Ship features fast
Budget exhausted: Feature freeze, focus on reliability
Budget at 50%: Increase caution

undefined

Target: 99.9% of requests succeed (measured over 30 days)

SLI Definition:

Success: HTTP 200-399 responses
Failure: HTTP 500-599 responses, timeouts
Excluded: HTTP 400-499 (client errors)

Measurement:

prometheus

sum(rate(http_requests_total{status=~"[23].."}[30d]))
/
sum(rate(http_requests_total{status!~"4.."}[30d]))

Error Budget: 0.1% = ~43 minutes/month

Consequences:

Budget remaining > 0: Ship features fast
Budget exhausted: Feature freeze, focus on reliability
Budget at 50%: Increase caution

undefined

Error Budgets

错误预算（Error Budgets）

Tracking

跟踪

prometheus

undefined

prometheus

undefined

Error budget remaining

error_budget_remaining = 1 - ( (1 - current_sli) / (1 - slo_target) )

Example: 99.9% SLO, currently at 99.95%

Error budget remaining = 1 - ((1 - 0.9995) / (1 - 0.999))

= 1 - (0.0005 / 0.001) = 0.5 (50% remaining)

undefined

undefined

Burn Rate

消耗速率（Burn Rate）

prometheus

undefined

prometheus

undefined

How fast are we consuming error budget?

error_budget_burn_rate = (1 - current_sli_1h) / (1 - slo_target)

Alert if burning budget 10x faster than sustainable

alert: FastErrorBudgetBurn expr: error_budget_burn_rate > 10 for: 1h

undefined

alert: FastErrorBudgetBurn expr: error_budget_burn_rate > 10 for: 1h

undefined

Policy

策略

Error Budget > 75%: Ship aggressively
Error Budget 25-75%: Normal velocity
Error Budget < 25%: Slow down, increase testing
Error Budget = 0%: Feature freeze, reliability only

Error Budget > 75%: Ship aggressively
Error Budget 25-75%: Normal velocity
Error Budget < 25%: Slow down, increase testing
Error Budget = 0%: Feature freeze, reliability only

Reliability Patterns

可靠性模式（Reliability Patterns）

Circuit Breaker

Circuit Breaker（断路器）

javascript

class CircuitBreaker {
  constructor({ threshold = 5, timeout = 60000 }) {
    this.state = 'CLOSED';
    this.failures = 0;
    this.threshold = threshold;
    this.timeout = timeout;
  }
  
  async call(fn) {
    if (this.state === 'OPEN') {
      if (Date.now() - this.openedAt > this.timeout) {
        this.state = 'HALF_OPEN';
      } else {
        throw new Error('Circuit breaker is OPEN');
      }
    }
    
    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }
  
  onSuccess() {
    this.failures = 0;
    this.state = 'CLOSED';
  }
  
  onFailure() {
    this.failures++;
    if (this.failures >= this.threshold) {
      this.state = 'OPEN';
      this.openedAt = Date.now();
    }
  }
}

javascript

class CircuitBreaker {
  constructor({ threshold = 5, timeout = 60000 }) {
    this.state = 'CLOSED';
    this.failures = 0;
    this.threshold = threshold;
    this.timeout = timeout;
  }
  
  async call(fn) {
    if (this.state === 'OPEN') {
      if (Date.now() - this.openedAt > this.timeout) {
        this.state = 'HALF_OPEN';
      } else {
        throw new Error('Circuit breaker is OPEN');
      }
    }
    
    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }
  
  onSuccess() {
    this.failures = 0;
    this.state = 'CLOSED';
  }
  
  onFailure() {
    this.failures++;
    if (this.failures >= this.threshold) {
      this.state = 'OPEN';
      this.openedAt = Date.now();
    }
  }
}

Retry with Exponential Backoff

Retry with Exponential Backoff（指数退避重试）

javascript

async function retryWithBackoff(fn, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fn();
    } catch (error) {
      if (i === maxRetries - 1) throw error;
      
      const delay = Math.min(1000 * Math.pow(2, i), 10000);
      const jitter = Math.random() * 1000;
      
      await sleep(delay + jitter);
    }
  }
}

javascript

async function retryWithBackoff(fn, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fn();
    } catch (error) {
      if (i === maxRetries - 1) throw error;
      
      const delay = Math.min(1000 * Math.pow(2, i), 10000);
      const jitter = Math.random() * 1000;
      
      await sleep(delay + jitter);
    }
  }
}

Rate Limiting

Rate Limiting（限流）

javascript

class TokenBucket {
  constructor({ capacity, refillRate }) {
    this.capacity = capacity;
    this.tokens = capacity;
    this.refillRate = refillRate;
    this.lastRefill = Date.now();
  }
  
  tryConsume(tokens = 1) {
    this.refill();
    
    if (this.tokens >= tokens) {
      this.tokens -= tokens;
      return true;
    }
    return false;
  }
  
  refill() {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    const tokensToAdd = elapsed * this.refillRate;
    
    this.tokens = Math.min(
      this.capacity,
      this.tokens + tokensToAdd
    );
    this.lastRefill = now;
  }
}

javascript

class TokenBucket {
  constructor({ capacity, refillRate }) {
    this.capacity = capacity;
    this.tokens = capacity;
    this.refillRate = refillRate;
    this.lastRefill = Date.now();
  }
  
  tryConsume(tokens = 1) {
    this.refill();
    
    if (this.tokens >= tokens) {
      this.tokens -= tokens;
      return true;
    }
    return false;
  }
  
  refill() {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    const tokensToAdd = elapsed * this.refillRate;
    
    this.tokens = Math.min(
      this.capacity,
      this.tokens + tokensToAdd
    );
    this.lastRefill = now;
  }
}

Bulkhead

Bulkhead（舱壁模式）

javascript

class Bulkhead {
  constructor({ maxConcurrent }) {
    this.maxConcurrent = maxConcurrent;
    this.current = 0;
    this.queue = [];
  }
  
  async execute(fn) {
    while (this.current >= this.maxConcurrent) {
      await new Promise(resolve => this.queue.push(resolve));
    }
    
    this.current++;
    try {
      return await fn();
    } finally {
      this.current--;
      if (this.queue.length > 0) {
        const resolve = this.queue.shift();
        resolve();
      }
    }
  }
}

javascript

class Bulkhead {
  constructor({ maxConcurrent }) {
    this.maxConcurrent = maxConcurrent;
    this.current = 0;
    this.queue = [];
  }
  
  async execute(fn) {
    while (this.current >= this.maxConcurrent) {
      await new Promise(resolve => this.queue.push(resolve));
    }
    
    this.current++;
    try {
      return await fn();
    } finally {
      this.current--;
      if (this.queue.length > 0) {
        const resolve = this.queue.shift();
        resolve();
      }
    }
  }
}

Graceful Degradation

Graceful Degradation（优雅降级）

javascript

async function getRecommendations(userId) {
  try {
    // Try personalized recommendations
    return await recommendationService.getPersonalized(userId, {
      timeout: 500, // Fail fast
    });
  } catch (error) {
    logger.warn('Personalized recommendations failed, falling back', {
      userId,
      error: error.message,
    });
    
    try {
      // Fall back to popular items
      return await cache.get('popular_items');
    } catch (fallbackError) {
      // Final fallback
      return DEFAULT_RECOMMENDATIONS;
    }
  }
}

javascript

async function getRecommendations(userId) {
  try {
    // Try personalized recommendations
    return await recommendationService.getPersonalized(userId, {
      timeout: 500, // Fail fast
    });
  } catch (error) {
    logger.warn('Personalized recommendations failed, falling back', {
      userId,
      error: error.message,
    });
    
    try {
      // Fall back to popular items
      return await cache.get('popular_items');
    } catch (fallbackError) {
      // Final fallback
      return DEFAULT_RECOMMENDATIONS;
    }
  }
}

Capacity Planning

容量规划（Capacity Planning）

Utilization Tracking

利用率跟踪

prometheus

undefined

prometheus

undefined

Current utilization

current_utilization = sum(rate(http_requests_total[5m])) / capacity_requests_per_second

Alert when approaching capacity

alert: HighUtilization expr: current_utilization > 0.80 for: 10m

undefined

alert: HighUtilization expr: current_utilization > 0.80 for: 10m

undefined

Growth Projection

增长预测

Current QPS: 1,000
Growth rate: 20% per month
Capacity per instance: 100 QPS
Current instances: 12

In 6 months:
Projected QPS: 1,000 * (1.20)^6 = 2,986
Instances needed: 2,986 / 100 = 30

Current QPS: 1,000
Growth rate: 20% per month
Capacity per instance: 100 QPS
Current instances: 12

In 6 months:
Projected QPS: 1,000 * (1.20)^6 = 2,986
Instances needed: 2,986 / 100 = 30

Load Testing

负载测试

javascript

// k6 load test
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 100 },   // Ramp up
    { duration: '5m', target: 100 },   // Steady state
    { duration: '2m', target: 200 },   // Spike
    { duration: '5m', target: 200 },   // Higher steady
    { duration: '2m', target: 0 },     // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'],  // 95% under 500ms
    http_req_failed: ['rate<0.01'],     // Less than 1% errors
  },
};

export default function () {
  const res = http.get('https://api.example.com/endpoint');
  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time < 500ms': (r) => r.timings.duration < 500,
  });
  sleep(1);
}

javascript

// k6 load test
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 100 },   // Ramp up
    { duration: '5m', target: 100 },   // Steady state
    { duration: '2m', target: 200 },   // Spike
    { duration: '5m', target: 200 },   // Higher steady
    { duration: '2m', target: 0 },     // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'],  // 95% under 500ms
    http_req_failed: ['rate<0.01'],     // Less than 1% errors
  },
};

export default function () {
  const res = http.get('https://api.example.com/endpoint');
  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time < 500ms': (r) => r.timings.duration < 500,
  });
  sleep(1);
}

Chaos Engineering

Chaos Engineering（混沌工程）

Fault Injection

故障注入

javascript

// Inject latency
function withLatencyInjection(fn, { probability = 0.1, delayMs = 1000 }) {
  return async (...args) => {
    if (Math.random() < probability) {
      await sleep(delayMs);
    }
    return fn(...args);
  };
}

// Inject failures
function withFailureInjection(fn, { probability = 0.05 }) {
  return async (...args) => {
    if (Math.random() < probability) {
      throw new Error('Injected failure');
    }
    return fn(...args);
  };
}

javascript

// Inject latency
function withLatencyInjection(fn, { probability = 0.1, delayMs = 1000 }) {
  return async (...args) => {
    if (Math.random() < probability) {
      await sleep(delayMs);
    }
    return fn(...args);
  };
}

// Inject failures
function withFailureInjection(fn, { probability = 0.05 }) {
  return async (...args) => {
    if (Math.random() < probability) {
      throw new Error('Injected failure');
    }
    return fn(...args);
  };
}

Best Practices

最佳实践

Design for Failure

为故障设计

Assume all dependencies can fail
Have fallback options
Fail fast and timeout quickly
Implement retries with backoff

假设所有依赖项都可能故障
准备备选方案
快速失败并设置短超时
实现带退避的重试机制

Measure User Impact

衡量用户影响

SLOs should reflect user experience
Don't alert on internal metrics alone
Track real user monitoring (RUM)

SLO应反映用户体验
不要仅针对内部指标设置告警
跟踪真实用户监控（RUM）

Balance Velocity and Reliability

平衡迭代速度与可靠性

Use error budgets to make decisions
Don't target 100% reliability
Spend error budget on innovation

利用错误预算做决策
不追求100%的可靠性
将错误预算用于创新

Automate Everything

自动化一切

Automate deployments
Automate rollbacks
Automate capacity scaling
Automate incident response

自动化部署
自动化回滚
自动化容量扩容
自动化事件响应