sre-monitoring-and-observability

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

SRE Monitoring and Observability

SRE 监控与可观测性

Building comprehensive monitoring and observability systems.

构建全面的监控和可观测性系统。

Four Golden Signals

四大黄金信号

Latency

延迟

Time to process requests:

prometheus

undefined

处理请求的耗时：

prometheus

undefined

Request duration

http_request_duration_seconds

Query

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) )

undefined

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) )

undefined

Traffic

流量

Demand on the system:

prometheus

undefined

系统承载的需求：

prometheus

undefined

Requests per second

rate(http_requests_total[5m])

By endpoint

sum(rate(http_requests_total[5m])) by (endpoint)

undefined

sum(rate(http_requests_total[5m])) by (endpoint)

undefined

Errors

错误

Rate of failed requests:

prometheus

undefined

失败请求的速率：

prometheus

undefined

Error rate

rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

SLI compliance

1 - (error_rate / slo_target)

undefined

1 - (error_rate / slo_target)

undefined

Saturation

饱和度

Resource utilization:

prometheus

undefined

资源利用率：

prometheus

undefined

CPU usage

100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory usage

(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

undefined

(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

undefined

Service Level Indicators (SLIs)

服务水平指标（SLIs）

Availability SLI

可用性 SLI

prometheus

undefined

prometheus

undefined

Successful requests / Total requests

sum(rate(http_requests_total{status=~"[23].."}[30d])) / sum(rate(http_requests_total[30d]))

undefined

sum(rate(http_requests_total{status=~"[23].."}[30d])) / sum(rate(http_requests_total[30d]))

undefined

Latency SLI

延迟 SLI

prometheus

undefined

prometheus

undefined

Requests faster than threshold / Total requests

sum(rate(http_request_duration_seconds_bucket{le="0.5"}[30d])) / sum(rate(http_request_duration_seconds_count[30d]))

undefined

sum(rate(http_request_duration_seconds_bucket{le="0.5"}[30d])) / sum(rate(http_request_duration_seconds_count[30d]))

undefined

Throughput SLI

吞吐量 SLI

prometheus

undefined

prometheus

undefined

Requests processed within capacity

clamp_max( rate(http_requests_total[5m]) / capacity_requests_per_second, 1.0 )

undefined

clamp_max( rate(http_requests_total[5m]) / capacity_requests_per_second, 1.0 )

undefined

Alerting

告警

Alert Severity Levels

告警严重等级

P0 - Critical: Service down or severe degradation P1 - High: Significant impact, error budget at risk
P2 - Medium: Degradation, not user-facing yet P3 - Low: Awareness, no immediate action needed

P0 - 严重：服务宕机或严重降级 P1 - 高：影响显著，错误预算面临风险
P2 - 中：服务降级，尚未对用户可见 P3 - 低：仅作感知，无需立即采取行动

Example Alerts

告警示例

yaml

undefined

yaml

undefined

High error rate

groups:

name: sre rules:
- alert: HighErrorRate expr: | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
  
  0.05 for: 5m labels: severity: critical annotations: summary: "High error rate on {{ $labels.service }}"
- alert: LatencyP95High expr: | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) ) > 1.0 for: 10m labels: severity: warning
- alert: ErrorBudgetBurn expr: | (1 - sli_availability) > (error_budget_remaining * 10) for: 1h labels: severity: high

undefined

groups:

name: sre rules:
- alert: HighErrorRate expr: | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
  
  0.05 for: 5m labels: severity: critical annotations: summary: "High error rate on {{ $labels.service }}"
- alert: LatencyP95High expr: | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) ) > 1.0 for: 10m labels: severity: warning
- alert: ErrorBudgetBurn expr: | (1 - sli_availability) > (error_budget_remaining * 10) for: 1h labels: severity: high

undefined

Dashboards

仪表盘

Overview Dashboard

概览仪表盘

Service health (red/yellow/green)
Request rate
Error rate
Latency percentiles (p50, p95, p99)
Saturation metrics

服务健康状态（红/黄/绿）
请求速率
错误率
延迟分位数（p50、p95、p99）
饱和度指标

Detailed Dashboard

详情仪表盘

Per-endpoint metrics
Dependency health
Database performance
Cache hit rates
Queue depths

各端点指标
依赖健康状态
数据库性能
缓存命中率
队列深度

Distributed Tracing

分布式追踪

OpenTelemetry

javascript

const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('my-service');

async function handleRequest(req) {
  const span = tracer.startSpan('handle_request');
  
  try {
    span.setAttribute('user.id', req.user.id);
    span.setAttribute('request.path', req.path);
    
    const result = await processRequest(req);
    
    span.setStatus({ code: SpanStatusCode.OK });
    return result;
  } catch (error) {
    span.setStatus({
      code: SpanStatusCode.ERROR,
      message: error.message,
    });
    throw error;
  } finally {
    span.end();
  }
}

javascript

const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('my-service');

async function handleRequest(req) {
  const span = tracer.startSpan('handle_request');
  
  try {
    span.setAttribute('user.id', req.user.id);
    span.setAttribute('request.path', req.path);
    
    const result = await processRequest(req);
    
    span.setStatus({ code: SpanStatusCode.OK });
    return result;
  } catch (error) {
    span.setStatus({
      code: SpanStatusCode.ERROR,
      message: error.message,
    });
    throw error;
  } finally {
    span.end();
  }
}

Structured Logging

结构化日志

javascript

logger.info('request_processed', {
  request_id: req.id,
  user_id: req.user.id,
  endpoint: req.path,
  method: req.method,
  status_code: res.statusCode,
  duration_ms: duration,
  error: error?.message,
});

javascript

logger.info('request_processed', {
  request_id: req.id,
  user_id: req.user.id,
  endpoint: req.path,
  method: req.method,
  status_code: res.statusCode,
  duration_ms: duration,
  error: error?.message,
});

Best Practices

最佳实践

USE Method

USE 方法

For resources:

Utilization: % time resource is busy
Saturation: Work queued but not serviced
Errors: Error count

针对资源：

利用率（Utilization）：资源处于繁忙状态的时间占比
饱和度（Saturation）：已排队但尚未处理的工作量
错误（Errors）：错误计数

RED Method

RED 方法

For requests:

Rate: Requests per second
Errors: Failed requests per second
Duration: Request latency distribution

针对请求：

速率（Rate）：每秒请求数
错误（Errors）：每秒失败请求数
时长（Duration）：请求延迟分布

Alert on Symptoms, Not Causes

针对症状告警，而非原因

yaml

undefined

yaml

undefined

Good - alert on user impact

alert: HighLatency expr: p95_latency > 1s

alert: HighLatency expr: p95_latency > 1s

Bad - alert on potential cause

alert: HighCPU expr: cpu_usage > 80%

undefined

alert: HighCPU expr: cpu_usage > 80%

undefined

Runbook Links

运行手册链接

yaml

annotations:
  runbook: "https://wiki.example.com/runbooks/high-error-rate"
  dashboard: "https://grafana.example.com/d/abc123"

yaml

annotations:
  runbook: "https://wiki.example.com/runbooks/high-error-rate"
  dashboard: "https://grafana.example.com/d/abc123"