sre-reliability-engineering
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSRE Reliability Engineering
SRE可靠性工程
Building reliable and scalable distributed systems.
构建可靠且可扩展的分布式系统。
Service Level Objectives (SLOs)
服务水平目标(SLOs)
Defining SLOs
定义SLOs
SLI: Availability = successful requests / total requests
SLO: 99.9% availability (measured over 30 days)
Error Budget: 0.1% = 43 minutes downtime per monthSLI: Availability = successful requests / total requests
SLO: 99.9% availability (measured over 30 days)
Error Budget: 0.1% = 43 minutes downtime per monthSLO Document Template
SLO文档模板
markdown
undefinedmarkdown
undefinedAPI Service SLO
API Service SLO
Availability SLO
Availability SLO
Target: 99.9% of requests succeed (measured over 30 days)
SLI Definition:
- Success: HTTP 200-399 responses
- Failure: HTTP 500-599 responses, timeouts
- Excluded: HTTP 400-499 (client errors)
Measurement:
prometheus
sum(rate(http_requests_total{status=~"[23].."}[30d]))
/
sum(rate(http_requests_total{status!~"4.."}[30d]))Error Budget: 0.1% = ~43 minutes/month
Consequences:
- Budget remaining > 0: Ship features fast
- Budget exhausted: Feature freeze, focus on reliability
- Budget at 50%: Increase caution
undefinedTarget: 99.9% of requests succeed (measured over 30 days)
SLI Definition:
- Success: HTTP 200-399 responses
- Failure: HTTP 500-599 responses, timeouts
- Excluded: HTTP 400-499 (client errors)
Measurement:
prometheus
sum(rate(http_requests_total{status=~"[23].."}[30d]))
/
sum(rate(http_requests_total{status!~"4.."}[30d]))Error Budget: 0.1% = ~43 minutes/month
Consequences:
- Budget remaining > 0: Ship features fast
- Budget exhausted: Feature freeze, focus on reliability
- Budget at 50%: Increase caution
undefinedError Budgets
错误预算(Error Budgets)
Tracking
跟踪
prometheus
undefinedprometheus
undefinedError budget remaining
Error budget remaining
error_budget_remaining = 1 - (
(1 - current_sli) / (1 - slo_target)
)
error_budget_remaining = 1 - (
(1 - current_sli) / (1 - slo_target)
)
Example: 99.9% SLO, currently at 99.95%
Example: 99.9% SLO, currently at 99.95%
Error budget remaining = 1 - ((1 - 0.9995) / (1 - 0.999))
Error budget remaining = 1 - ((1 - 0.9995) / (1 - 0.999))
= 1 - (0.0005 / 0.001) = 0.5 (50% remaining)
= 1 - (0.0005 / 0.001) = 0.5 (50% remaining)
undefinedundefinedBurn Rate
消耗速率(Burn Rate)
prometheus
undefinedprometheus
undefinedHow fast are we consuming error budget?
How fast are we consuming error budget?
error_budget_burn_rate =
(1 - current_sli_1h) / (1 - slo_target)
error_budget_burn_rate =
(1 - current_sli_1h) / (1 - slo_target)
Alert if burning budget 10x faster than sustainable
Alert if burning budget 10x faster than sustainable
- alert: FastErrorBudgetBurn expr: error_budget_burn_rate > 10 for: 1h
undefined- alert: FastErrorBudgetBurn expr: error_budget_burn_rate > 10 for: 1h
undefinedPolicy
策略
Error Budget > 75%: Ship aggressively
Error Budget 25-75%: Normal velocity
Error Budget < 25%: Slow down, increase testing
Error Budget = 0%: Feature freeze, reliability onlyError Budget > 75%: Ship aggressively
Error Budget 25-75%: Normal velocity
Error Budget < 25%: Slow down, increase testing
Error Budget = 0%: Feature freeze, reliability onlyReliability Patterns
可靠性模式(Reliability Patterns)
Circuit Breaker
Circuit Breaker(断路器)
javascript
class CircuitBreaker {
constructor({ threshold = 5, timeout = 60000 }) {
this.state = 'CLOSED';
this.failures = 0;
this.threshold = threshold;
this.timeout = timeout;
}
async call(fn) {
if (this.state === 'OPEN') {
if (Date.now() - this.openedAt > this.timeout) {
this.state = 'HALF_OPEN';
} else {
throw new Error('Circuit breaker is OPEN');
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
onSuccess() {
this.failures = 0;
this.state = 'CLOSED';
}
onFailure() {
this.failures++;
if (this.failures >= this.threshold) {
this.state = 'OPEN';
this.openedAt = Date.now();
}
}
}javascript
class CircuitBreaker {
constructor({ threshold = 5, timeout = 60000 }) {
this.state = 'CLOSED';
this.failures = 0;
this.threshold = threshold;
this.timeout = timeout;
}
async call(fn) {
if (this.state === 'OPEN') {
if (Date.now() - this.openedAt > this.timeout) {
this.state = 'HALF_OPEN';
} else {
throw new Error('Circuit breaker is OPEN');
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
onSuccess() {
this.failures = 0;
this.state = 'CLOSED';
}
onFailure() {
this.failures++;
if (this.failures >= this.threshold) {
this.state = 'OPEN';
this.openedAt = Date.now();
}
}
}Retry with Exponential Backoff
Retry with Exponential Backoff(指数退避重试)
javascript
async function retryWithBackoff(fn, maxRetries = 3) {
for (let i = 0; i < maxRetries; i++) {
try {
return await fn();
} catch (error) {
if (i === maxRetries - 1) throw error;
const delay = Math.min(1000 * Math.pow(2, i), 10000);
const jitter = Math.random() * 1000;
await sleep(delay + jitter);
}
}
}javascript
async function retryWithBackoff(fn, maxRetries = 3) {
for (let i = 0; i < maxRetries; i++) {
try {
return await fn();
} catch (error) {
if (i === maxRetries - 1) throw error;
const delay = Math.min(1000 * Math.pow(2, i), 10000);
const jitter = Math.random() * 1000;
await sleep(delay + jitter);
}
}
}Rate Limiting
Rate Limiting(限流)
javascript
class TokenBucket {
constructor({ capacity, refillRate }) {
this.capacity = capacity;
this.tokens = capacity;
this.refillRate = refillRate;
this.lastRefill = Date.now();
}
tryConsume(tokens = 1) {
this.refill();
if (this.tokens >= tokens) {
this.tokens -= tokens;
return true;
}
return false;
}
refill() {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000;
const tokensToAdd = elapsed * this.refillRate;
this.tokens = Math.min(
this.capacity,
this.tokens + tokensToAdd
);
this.lastRefill = now;
}
}javascript
class TokenBucket {
constructor({ capacity, refillRate }) {
this.capacity = capacity;
this.tokens = capacity;
this.refillRate = refillRate;
this.lastRefill = Date.now();
}
tryConsume(tokens = 1) {
this.refill();
if (this.tokens >= tokens) {
this.tokens -= tokens;
return true;
}
return false;
}
refill() {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000;
const tokensToAdd = elapsed * this.refillRate;
this.tokens = Math.min(
this.capacity,
this.tokens + tokensToAdd
);
this.lastRefill = now;
}
}Bulkhead
Bulkhead(舱壁模式)
javascript
class Bulkhead {
constructor({ maxConcurrent }) {
this.maxConcurrent = maxConcurrent;
this.current = 0;
this.queue = [];
}
async execute(fn) {
while (this.current >= this.maxConcurrent) {
await new Promise(resolve => this.queue.push(resolve));
}
this.current++;
try {
return await fn();
} finally {
this.current--;
if (this.queue.length > 0) {
const resolve = this.queue.shift();
resolve();
}
}
}
}javascript
class Bulkhead {
constructor({ maxConcurrent }) {
this.maxConcurrent = maxConcurrent;
this.current = 0;
this.queue = [];
}
async execute(fn) {
while (this.current >= this.maxConcurrent) {
await new Promise(resolve => this.queue.push(resolve));
}
this.current++;
try {
return await fn();
} finally {
this.current--;
if (this.queue.length > 0) {
const resolve = this.queue.shift();
resolve();
}
}
}
}Graceful Degradation
Graceful Degradation(优雅降级)
javascript
async function getRecommendations(userId) {
try {
// Try personalized recommendations
return await recommendationService.getPersonalized(userId, {
timeout: 500, // Fail fast
});
} catch (error) {
logger.warn('Personalized recommendations failed, falling back', {
userId,
error: error.message,
});
try {
// Fall back to popular items
return await cache.get('popular_items');
} catch (fallbackError) {
// Final fallback
return DEFAULT_RECOMMENDATIONS;
}
}
}javascript
async function getRecommendations(userId) {
try {
// Try personalized recommendations
return await recommendationService.getPersonalized(userId, {
timeout: 500, // Fail fast
});
} catch (error) {
logger.warn('Personalized recommendations failed, falling back', {
userId,
error: error.message,
});
try {
// Fall back to popular items
return await cache.get('popular_items');
} catch (fallbackError) {
// Final fallback
return DEFAULT_RECOMMENDATIONS;
}
}
}Capacity Planning
容量规划(Capacity Planning)
Utilization Tracking
利用率跟踪
prometheus
undefinedprometheus
undefinedCurrent utilization
Current utilization
current_utilization =
sum(rate(http_requests_total[5m]))
/ capacity_requests_per_second
current_utilization =
sum(rate(http_requests_total[5m]))
/ capacity_requests_per_second
Alert when approaching capacity
Alert when approaching capacity
- alert: HighUtilization expr: current_utilization > 0.80 for: 10m
undefined- alert: HighUtilization expr: current_utilization > 0.80 for: 10m
undefinedGrowth Projection
增长预测
Current QPS: 1,000
Growth rate: 20% per month
Capacity per instance: 100 QPS
Current instances: 12
In 6 months:
Projected QPS: 1,000 * (1.20)^6 = 2,986
Instances needed: 2,986 / 100 = 30Current QPS: 1,000
Growth rate: 20% per month
Capacity per instance: 100 QPS
Current instances: 12
In 6 months:
Projected QPS: 1,000 * (1.20)^6 = 2,986
Instances needed: 2,986 / 100 = 30Load Testing
负载测试
javascript
// k6 load test
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
stages: [
{ duration: '2m', target: 100 }, // Ramp up
{ duration: '5m', target: 100 }, // Steady state
{ duration: '2m', target: 200 }, // Spike
{ duration: '5m', target: 200 }, // Higher steady
{ duration: '2m', target: 0 }, // Ramp down
],
thresholds: {
http_req_duration: ['p(95)<500'], // 95% under 500ms
http_req_failed: ['rate<0.01'], // Less than 1% errors
},
};
export default function () {
const res = http.get('https://api.example.com/endpoint');
check(res, {
'status is 200': (r) => r.status === 200,
'response time < 500ms': (r) => r.timings.duration < 500,
});
sleep(1);
}javascript
// k6 load test
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
stages: [
{ duration: '2m', target: 100 }, // Ramp up
{ duration: '5m', target: 100 }, // Steady state
{ duration: '2m', target: 200 }, // Spike
{ duration: '5m', target: 200 }, // Higher steady
{ duration: '2m', target: 0 }, // Ramp down
],
thresholds: {
http_req_duration: ['p(95)<500'], // 95% under 500ms
http_req_failed: ['rate<0.01'], // Less than 1% errors
},
};
export default function () {
const res = http.get('https://api.example.com/endpoint');
check(res, {
'status is 200': (r) => r.status === 200,
'response time < 500ms': (r) => r.timings.duration < 500,
});
sleep(1);
}Chaos Engineering
Chaos Engineering(混沌工程)
Fault Injection
故障注入
javascript
// Inject latency
function withLatencyInjection(fn, { probability = 0.1, delayMs = 1000 }) {
return async (...args) => {
if (Math.random() < probability) {
await sleep(delayMs);
}
return fn(...args);
};
}
// Inject failures
function withFailureInjection(fn, { probability = 0.05 }) {
return async (...args) => {
if (Math.random() < probability) {
throw new Error('Injected failure');
}
return fn(...args);
};
}javascript
// Inject latency
function withLatencyInjection(fn, { probability = 0.1, delayMs = 1000 }) {
return async (...args) => {
if (Math.random() < probability) {
await sleep(delayMs);
}
return fn(...args);
};
}
// Inject failures
function withFailureInjection(fn, { probability = 0.05 }) {
return async (...args) => {
if (Math.random() < probability) {
throw new Error('Injected failure');
}
return fn(...args);
};
}Best Practices
最佳实践
Design for Failure
为故障设计
- Assume all dependencies can fail
- Have fallback options
- Fail fast and timeout quickly
- Implement retries with backoff
- 假设所有依赖项都可能故障
- 准备备选方案
- 快速失败并设置短超时
- 实现带退避的重试机制
Measure User Impact
衡量用户影响
- SLOs should reflect user experience
- Don't alert on internal metrics alone
- Track real user monitoring (RUM)
- SLO应反映用户体验
- 不要仅针对内部指标设置告警
- 跟踪真实用户监控(RUM)
Balance Velocity and Reliability
平衡迭代速度与可靠性
- Use error budgets to make decisions
- Don't target 100% reliability
- Spend error budget on innovation
- 利用错误预算做决策
- 不追求100%的可靠性
- 将错误预算用于创新
Automate Everything
自动化一切
- Automate deployments
- Automate rollbacks
- Automate capacity scaling
- Automate incident response
- 自动化部署
- 自动化回滚
- 自动化容量扩容
- 自动化事件响应