sre-monitoring-and-observability

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

SRE Monitoring and Observability

SRE 监控与可观测性

Building comprehensive monitoring and observability systems.
构建全面的监控和可观测性系统。

Four Golden Signals

四大黄金信号

Latency

延迟

Time to process requests:
prometheus
undefined
处理请求的耗时:
prometheus
undefined

Request duration

Request duration

http_request_duration_seconds
http_request_duration_seconds

Query

Query

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) )
undefined
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) )
undefined

Traffic

流量

Demand on the system:
prometheus
undefined
系统承载的需求:
prometheus
undefined

Requests per second

Requests per second

rate(http_requests_total[5m])
rate(http_requests_total[5m])

By endpoint

By endpoint

sum(rate(http_requests_total[5m])) by (endpoint)
undefined
sum(rate(http_requests_total[5m])) by (endpoint)
undefined

Errors

错误

Rate of failed requests:
prometheus
undefined
失败请求的速率:
prometheus
undefined

Error rate

Error rate

rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

SLI compliance

SLI compliance

1 - (error_rate / slo_target)
undefined
1 - (error_rate / slo_target)
undefined

Saturation

饱和度

Resource utilization:
prometheus
undefined
资源利用率:
prometheus
undefined

CPU usage

CPU usage

100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory usage

Memory usage

(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
undefined
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
undefined

Service Level Indicators (SLIs)

服务水平指标(SLIs)

Availability SLI

可用性 SLI

prometheus
undefined
prometheus
undefined

Successful requests / Total requests

Successful requests / Total requests

sum(rate(http_requests_total{status=~"[23].."}[30d])) / sum(rate(http_requests_total[30d]))
undefined
sum(rate(http_requests_total{status=~"[23].."}[30d])) / sum(rate(http_requests_total[30d]))
undefined

Latency SLI

延迟 SLI

prometheus
undefined
prometheus
undefined

Requests faster than threshold / Total requests

Requests faster than threshold / Total requests

sum(rate(http_request_duration_seconds_bucket{le="0.5"}[30d])) / sum(rate(http_request_duration_seconds_count[30d]))
undefined
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[30d])) / sum(rate(http_request_duration_seconds_count[30d]))
undefined

Throughput SLI

吞吐量 SLI

prometheus
undefined
prometheus
undefined

Requests processed within capacity

Requests processed within capacity

clamp_max( rate(http_requests_total[5m]) / capacity_requests_per_second, 1.0 )
undefined
clamp_max( rate(http_requests_total[5m]) / capacity_requests_per_second, 1.0 )
undefined

Alerting

告警

Alert Severity Levels

告警严重等级

P0 - Critical: Service down or severe degradation P1 - High: Significant impact, error budget at risk
P2 - Medium: Degradation, not user-facing yet P3 - Low: Awareness, no immediate action needed
P0 - 严重:服务宕机或严重降级 P1 - 高:影响显著,错误预算面临风险
P2 - 中:服务降级,尚未对用户可见 P3 - 低:仅作感知,无需立即采取行动

Example Alerts

告警示例

yaml
undefined
yaml
undefined

High error rate

High error rate

groups:
  • name: sre rules:
    • alert: HighErrorRate expr: | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
      0.05 for: 5m labels: severity: critical annotations: summary: "High error rate on {{ $labels.service }}"
    • alert: LatencyP95High expr: | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) ) > 1.0 for: 10m labels: severity: warning
    • alert: ErrorBudgetBurn expr: | (1 - sli_availability) > (error_budget_remaining * 10) for: 1h labels: severity: high
undefined
groups:
  • name: sre rules:
    • alert: HighErrorRate expr: | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
      0.05 for: 5m labels: severity: critical annotations: summary: "High error rate on {{ $labels.service }}"
    • alert: LatencyP95High expr: | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) ) > 1.0 for: 10m labels: severity: warning
    • alert: ErrorBudgetBurn expr: | (1 - sli_availability) > (error_budget_remaining * 10) for: 1h labels: severity: high
undefined

Dashboards

仪表盘

Overview Dashboard

概览仪表盘

  • Service health (red/yellow/green)
  • Request rate
  • Error rate
  • Latency percentiles (p50, p95, p99)
  • Saturation metrics
  • 服务健康状态(红/黄/绿)
  • 请求速率
  • 错误率
  • 延迟分位数(p50、p95、p99)
  • 饱和度指标

Detailed Dashboard

详情仪表盘

  • Per-endpoint metrics
  • Dependency health
  • Database performance
  • Cache hit rates
  • Queue depths
  • 各端点指标
  • 依赖健康状态
  • 数据库性能
  • 缓存命中率
  • 队列深度

Distributed Tracing

分布式追踪

OpenTelemetry

OpenTelemetry

javascript
const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('my-service');

async function handleRequest(req) {
  const span = tracer.startSpan('handle_request');
  
  try {
    span.setAttribute('user.id', req.user.id);
    span.setAttribute('request.path', req.path);
    
    const result = await processRequest(req);
    
    span.setStatus({ code: SpanStatusCode.OK });
    return result;
  } catch (error) {
    span.setStatus({
      code: SpanStatusCode.ERROR,
      message: error.message,
    });
    throw error;
  } finally {
    span.end();
  }
}
javascript
const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('my-service');

async function handleRequest(req) {
  const span = tracer.startSpan('handle_request');
  
  try {
    span.setAttribute('user.id', req.user.id);
    span.setAttribute('request.path', req.path);
    
    const result = await processRequest(req);
    
    span.setStatus({ code: SpanStatusCode.OK });
    return result;
  } catch (error) {
    span.setStatus({
      code: SpanStatusCode.ERROR,
      message: error.message,
    });
    throw error;
  } finally {
    span.end();
  }
}

Structured Logging

结构化日志

javascript
logger.info('request_processed', {
  request_id: req.id,
  user_id: req.user.id,
  endpoint: req.path,
  method: req.method,
  status_code: res.statusCode,
  duration_ms: duration,
  error: error?.message,
});
javascript
logger.info('request_processed', {
  request_id: req.id,
  user_id: req.user.id,
  endpoint: req.path,
  method: req.method,
  status_code: res.statusCode,
  duration_ms: duration,
  error: error?.message,
});

Best Practices

最佳实践

USE Method

USE 方法

For resources:
  • Utilization: % time resource is busy
  • Saturation: Work queued but not serviced
  • Errors: Error count
针对资源:
  • 利用率(Utilization):资源处于繁忙状态的时间占比
  • 饱和度(Saturation):已排队但尚未处理的工作量
  • 错误(Errors):错误计数

RED Method

RED 方法

For requests:
  • Rate: Requests per second
  • Errors: Failed requests per second
  • Duration: Request latency distribution
针对请求:
  • 速率(Rate):每秒请求数
  • 错误(Errors):每秒失败请求数
  • 时长(Duration):请求延迟分布

Alert on Symptoms, Not Causes

针对症状告警,而非原因

yaml
undefined
yaml
undefined

Good - alert on user impact

Good - alert on user impact

  • alert: HighLatency expr: p95_latency > 1s
  • alert: HighLatency expr: p95_latency > 1s

Bad - alert on potential cause

Bad - alert on potential cause

  • alert: HighCPU expr: cpu_usage > 80%
undefined
  • alert: HighCPU expr: cpu_usage > 80%
undefined

Runbook Links

运行手册链接

yaml
annotations:
  runbook: "https://wiki.example.com/runbooks/high-error-rate"
  dashboard: "https://grafana.example.com/d/abc123"
yaml
annotations:
  runbook: "https://wiki.example.com/runbooks/high-error-rate"
  dashboard: "https://grafana.example.com/d/abc123"