sre-reliability-engineering

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

SRE Reliability Engineering

SRE可靠性工程

Building reliable and scalable distributed systems.
构建可靠且可扩展的分布式系统。

Service Level Objectives (SLOs)

服务水平目标(SLOs)

Defining SLOs

定义SLOs

SLI: Availability = successful requests / total requests
SLO: 99.9% availability (measured over 30 days)
Error Budget: 0.1% = 43 minutes downtime per month
SLI: Availability = successful requests / total requests
SLO: 99.9% availability (measured over 30 days)
Error Budget: 0.1% = 43 minutes downtime per month

SLO Document Template

SLO文档模板

markdown
undefined
markdown
undefined

API Service SLO

API Service SLO

Availability SLO

Availability SLO

Target: 99.9% of requests succeed (measured over 30 days)
SLI Definition:
  • Success: HTTP 200-399 responses
  • Failure: HTTP 500-599 responses, timeouts
  • Excluded: HTTP 400-499 (client errors)
Measurement:
prometheus
sum(rate(http_requests_total{status=~"[23].."}[30d]))
/
sum(rate(http_requests_total{status!~"4.."}[30d]))
Error Budget: 0.1% = ~43 minutes/month
Consequences:
  • Budget remaining > 0: Ship features fast
  • Budget exhausted: Feature freeze, focus on reliability
  • Budget at 50%: Increase caution
undefined
Target: 99.9% of requests succeed (measured over 30 days)
SLI Definition:
  • Success: HTTP 200-399 responses
  • Failure: HTTP 500-599 responses, timeouts
  • Excluded: HTTP 400-499 (client errors)
Measurement:
prometheus
sum(rate(http_requests_total{status=~"[23].."}[30d]))
/
sum(rate(http_requests_total{status!~"4.."}[30d]))
Error Budget: 0.1% = ~43 minutes/month
Consequences:
  • Budget remaining > 0: Ship features fast
  • Budget exhausted: Feature freeze, focus on reliability
  • Budget at 50%: Increase caution
undefined

Error Budgets

错误预算(Error Budgets)

Tracking

跟踪

prometheus
undefined
prometheus
undefined

Error budget remaining

Error budget remaining

error_budget_remaining = 1 - ( (1 - current_sli) / (1 - slo_target) )
error_budget_remaining = 1 - ( (1 - current_sli) / (1 - slo_target) )

Example: 99.9% SLO, currently at 99.95%

Example: 99.9% SLO, currently at 99.95%

Error budget remaining = 1 - ((1 - 0.9995) / (1 - 0.999))

Error budget remaining = 1 - ((1 - 0.9995) / (1 - 0.999))

= 1 - (0.0005 / 0.001) = 0.5 (50% remaining)

= 1 - (0.0005 / 0.001) = 0.5 (50% remaining)

undefined
undefined

Burn Rate

消耗速率(Burn Rate)

prometheus
undefined
prometheus
undefined

How fast are we consuming error budget?

How fast are we consuming error budget?

error_budget_burn_rate = (1 - current_sli_1h) / (1 - slo_target)
error_budget_burn_rate = (1 - current_sli_1h) / (1 - slo_target)

Alert if burning budget 10x faster than sustainable

Alert if burning budget 10x faster than sustainable

  • alert: FastErrorBudgetBurn expr: error_budget_burn_rate > 10 for: 1h
undefined
  • alert: FastErrorBudgetBurn expr: error_budget_burn_rate > 10 for: 1h
undefined

Policy

策略

Error Budget > 75%: Ship aggressively
Error Budget 25-75%: Normal velocity
Error Budget < 25%: Slow down, increase testing
Error Budget = 0%: Feature freeze, reliability only
Error Budget > 75%: Ship aggressively
Error Budget 25-75%: Normal velocity
Error Budget < 25%: Slow down, increase testing
Error Budget = 0%: Feature freeze, reliability only

Reliability Patterns

可靠性模式(Reliability Patterns)

Circuit Breaker

Circuit Breaker(断路器)

javascript
class CircuitBreaker {
  constructor({ threshold = 5, timeout = 60000 }) {
    this.state = 'CLOSED';
    this.failures = 0;
    this.threshold = threshold;
    this.timeout = timeout;
  }
  
  async call(fn) {
    if (this.state === 'OPEN') {
      if (Date.now() - this.openedAt > this.timeout) {
        this.state = 'HALF_OPEN';
      } else {
        throw new Error('Circuit breaker is OPEN');
      }
    }
    
    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }
  
  onSuccess() {
    this.failures = 0;
    this.state = 'CLOSED';
  }
  
  onFailure() {
    this.failures++;
    if (this.failures >= this.threshold) {
      this.state = 'OPEN';
      this.openedAt = Date.now();
    }
  }
}
javascript
class CircuitBreaker {
  constructor({ threshold = 5, timeout = 60000 }) {
    this.state = 'CLOSED';
    this.failures = 0;
    this.threshold = threshold;
    this.timeout = timeout;
  }
  
  async call(fn) {
    if (this.state === 'OPEN') {
      if (Date.now() - this.openedAt > this.timeout) {
        this.state = 'HALF_OPEN';
      } else {
        throw new Error('Circuit breaker is OPEN');
      }
    }
    
    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }
  
  onSuccess() {
    this.failures = 0;
    this.state = 'CLOSED';
  }
  
  onFailure() {
    this.failures++;
    if (this.failures >= this.threshold) {
      this.state = 'OPEN';
      this.openedAt = Date.now();
    }
  }
}

Retry with Exponential Backoff

Retry with Exponential Backoff(指数退避重试)

javascript
async function retryWithBackoff(fn, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fn();
    } catch (error) {
      if (i === maxRetries - 1) throw error;
      
      const delay = Math.min(1000 * Math.pow(2, i), 10000);
      const jitter = Math.random() * 1000;
      
      await sleep(delay + jitter);
    }
  }
}
javascript
async function retryWithBackoff(fn, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fn();
    } catch (error) {
      if (i === maxRetries - 1) throw error;
      
      const delay = Math.min(1000 * Math.pow(2, i), 10000);
      const jitter = Math.random() * 1000;
      
      await sleep(delay + jitter);
    }
  }
}

Rate Limiting

Rate Limiting(限流)

javascript
class TokenBucket {
  constructor({ capacity, refillRate }) {
    this.capacity = capacity;
    this.tokens = capacity;
    this.refillRate = refillRate;
    this.lastRefill = Date.now();
  }
  
  tryConsume(tokens = 1) {
    this.refill();
    
    if (this.tokens >= tokens) {
      this.tokens -= tokens;
      return true;
    }
    return false;
  }
  
  refill() {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    const tokensToAdd = elapsed * this.refillRate;
    
    this.tokens = Math.min(
      this.capacity,
      this.tokens + tokensToAdd
    );
    this.lastRefill = now;
  }
}
javascript
class TokenBucket {
  constructor({ capacity, refillRate }) {
    this.capacity = capacity;
    this.tokens = capacity;
    this.refillRate = refillRate;
    this.lastRefill = Date.now();
  }
  
  tryConsume(tokens = 1) {
    this.refill();
    
    if (this.tokens >= tokens) {
      this.tokens -= tokens;
      return true;
    }
    return false;
  }
  
  refill() {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    const tokensToAdd = elapsed * this.refillRate;
    
    this.tokens = Math.min(
      this.capacity,
      this.tokens + tokensToAdd
    );
    this.lastRefill = now;
  }
}

Bulkhead

Bulkhead(舱壁模式)

javascript
class Bulkhead {
  constructor({ maxConcurrent }) {
    this.maxConcurrent = maxConcurrent;
    this.current = 0;
    this.queue = [];
  }
  
  async execute(fn) {
    while (this.current >= this.maxConcurrent) {
      await new Promise(resolve => this.queue.push(resolve));
    }
    
    this.current++;
    try {
      return await fn();
    } finally {
      this.current--;
      if (this.queue.length > 0) {
        const resolve = this.queue.shift();
        resolve();
      }
    }
  }
}
javascript
class Bulkhead {
  constructor({ maxConcurrent }) {
    this.maxConcurrent = maxConcurrent;
    this.current = 0;
    this.queue = [];
  }
  
  async execute(fn) {
    while (this.current >= this.maxConcurrent) {
      await new Promise(resolve => this.queue.push(resolve));
    }
    
    this.current++;
    try {
      return await fn();
    } finally {
      this.current--;
      if (this.queue.length > 0) {
        const resolve = this.queue.shift();
        resolve();
      }
    }
  }
}

Graceful Degradation

Graceful Degradation(优雅降级)

javascript
async function getRecommendations(userId) {
  try {
    // Try personalized recommendations
    return await recommendationService.getPersonalized(userId, {
      timeout: 500, // Fail fast
    });
  } catch (error) {
    logger.warn('Personalized recommendations failed, falling back', {
      userId,
      error: error.message,
    });
    
    try {
      // Fall back to popular items
      return await cache.get('popular_items');
    } catch (fallbackError) {
      // Final fallback
      return DEFAULT_RECOMMENDATIONS;
    }
  }
}
javascript
async function getRecommendations(userId) {
  try {
    // Try personalized recommendations
    return await recommendationService.getPersonalized(userId, {
      timeout: 500, // Fail fast
    });
  } catch (error) {
    logger.warn('Personalized recommendations failed, falling back', {
      userId,
      error: error.message,
    });
    
    try {
      // Fall back to popular items
      return await cache.get('popular_items');
    } catch (fallbackError) {
      // Final fallback
      return DEFAULT_RECOMMENDATIONS;
    }
  }
}

Capacity Planning

容量规划(Capacity Planning)

Utilization Tracking

利用率跟踪

prometheus
undefined
prometheus
undefined

Current utilization

Current utilization

current_utilization = sum(rate(http_requests_total[5m])) / capacity_requests_per_second
current_utilization = sum(rate(http_requests_total[5m])) / capacity_requests_per_second

Alert when approaching capacity

Alert when approaching capacity

  • alert: HighUtilization expr: current_utilization > 0.80 for: 10m
undefined
  • alert: HighUtilization expr: current_utilization > 0.80 for: 10m
undefined

Growth Projection

增长预测

Current QPS: 1,000
Growth rate: 20% per month
Capacity per instance: 100 QPS
Current instances: 12

In 6 months:
Projected QPS: 1,000 * (1.20)^6 = 2,986
Instances needed: 2,986 / 100 = 30
Current QPS: 1,000
Growth rate: 20% per month
Capacity per instance: 100 QPS
Current instances: 12

In 6 months:
Projected QPS: 1,000 * (1.20)^6 = 2,986
Instances needed: 2,986 / 100 = 30

Load Testing

负载测试

javascript
// k6 load test
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 100 },   // Ramp up
    { duration: '5m', target: 100 },   // Steady state
    { duration: '2m', target: 200 },   // Spike
    { duration: '5m', target: 200 },   // Higher steady
    { duration: '2m', target: 0 },     // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'],  // 95% under 500ms
    http_req_failed: ['rate<0.01'],     // Less than 1% errors
  },
};

export default function () {
  const res = http.get('https://api.example.com/endpoint');
  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time < 500ms': (r) => r.timings.duration < 500,
  });
  sleep(1);
}
javascript
// k6 load test
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 100 },   // Ramp up
    { duration: '5m', target: 100 },   // Steady state
    { duration: '2m', target: 200 },   // Spike
    { duration: '5m', target: 200 },   // Higher steady
    { duration: '2m', target: 0 },     // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'],  // 95% under 500ms
    http_req_failed: ['rate<0.01'],     // Less than 1% errors
  },
};

export default function () {
  const res = http.get('https://api.example.com/endpoint');
  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time < 500ms': (r) => r.timings.duration < 500,
  });
  sleep(1);
}

Chaos Engineering

Chaos Engineering(混沌工程)

Fault Injection

故障注入

javascript
// Inject latency
function withLatencyInjection(fn, { probability = 0.1, delayMs = 1000 }) {
  return async (...args) => {
    if (Math.random() < probability) {
      await sleep(delayMs);
    }
    return fn(...args);
  };
}

// Inject failures
function withFailureInjection(fn, { probability = 0.05 }) {
  return async (...args) => {
    if (Math.random() < probability) {
      throw new Error('Injected failure');
    }
    return fn(...args);
  };
}
javascript
// Inject latency
function withLatencyInjection(fn, { probability = 0.1, delayMs = 1000 }) {
  return async (...args) => {
    if (Math.random() < probability) {
      await sleep(delayMs);
    }
    return fn(...args);
  };
}

// Inject failures
function withFailureInjection(fn, { probability = 0.05 }) {
  return async (...args) => {
    if (Math.random() < probability) {
      throw new Error('Injected failure');
    }
    return fn(...args);
  };
}

Best Practices

最佳实践

Design for Failure

为故障设计

  • Assume all dependencies can fail
  • Have fallback options
  • Fail fast and timeout quickly
  • Implement retries with backoff
  • 假设所有依赖项都可能故障
  • 准备备选方案
  • 快速失败并设置短超时
  • 实现带退避的重试机制

Measure User Impact

衡量用户影响

  • SLOs should reflect user experience
  • Don't alert on internal metrics alone
  • Track real user monitoring (RUM)
  • SLO应反映用户体验
  • 不要仅针对内部指标设置告警
  • 跟踪真实用户监控(RUM)

Balance Velocity and Reliability

平衡迭代速度与可靠性

  • Use error budgets to make decisions
  • Don't target 100% reliability
  • Spend error budget on innovation
  • 利用错误预算做决策
  • 不追求100%的可靠性
  • 将错误预算用于创新

Automate Everything

自动化一切

  • Automate deployments
  • Automate rollbacks
  • Automate capacity scaling
  • Automate incident response
  • 自动化部署
  • 自动化回滚
  • 自动化容量扩容
  • 自动化事件响应