sre-monitoring-and-observability
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSRE Monitoring and Observability
SRE 监控与可观测性
Building comprehensive monitoring and observability systems.
构建全面的监控和可观测性系统。
Four Golden Signals
四大黄金信号
Latency
延迟
Time to process requests:
prometheus
undefined处理请求的耗时:
prometheus
undefinedRequest duration
Request duration
http_request_duration_seconds
http_request_duration_seconds
Query
Query
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)
undefinedhistogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)
undefinedTraffic
流量
Demand on the system:
prometheus
undefined系统承载的需求:
prometheus
undefinedRequests per second
Requests per second
rate(http_requests_total[5m])
rate(http_requests_total[5m])
By endpoint
By endpoint
sum(rate(http_requests_total[5m])) by (endpoint)
undefinedsum(rate(http_requests_total[5m])) by (endpoint)
undefinedErrors
错误
Rate of failed requests:
prometheus
undefined失败请求的速率:
prometheus
undefinedError rate
Error rate
rate(http_requests_total{status=~"5.."}[5m])
/
rate(http_requests_total[5m])
rate(http_requests_total{status=~"5.."}[5m])
/
rate(http_requests_total[5m])
SLI compliance
SLI compliance
1 - (error_rate / slo_target)
undefined1 - (error_rate / slo_target)
undefinedSaturation
饱和度
Resource utilization:
prometheus
undefined资源利用率:
prometheus
undefinedCPU usage
CPU usage
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Memory usage
Memory usage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100
undefined(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100
undefinedService Level Indicators (SLIs)
服务水平指标(SLIs)
Availability SLI
可用性 SLI
prometheus
undefinedprometheus
undefinedSuccessful requests / Total requests
Successful requests / Total requests
sum(rate(http_requests_total{status=~"[23].."}[30d]))
/
sum(rate(http_requests_total[30d]))
undefinedsum(rate(http_requests_total{status=~"[23].."}[30d]))
/
sum(rate(http_requests_total[30d]))
undefinedLatency SLI
延迟 SLI
prometheus
undefinedprometheus
undefinedRequests faster than threshold / Total requests
Requests faster than threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))
undefinedsum(rate(http_request_duration_seconds_bucket{le="0.5"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))
undefinedThroughput SLI
吞吐量 SLI
prometheus
undefinedprometheus
undefinedRequests processed within capacity
Requests processed within capacity
clamp_max(
rate(http_requests_total[5m]) / capacity_requests_per_second,
1.0
)
undefinedclamp_max(
rate(http_requests_total[5m]) / capacity_requests_per_second,
1.0
)
undefinedAlerting
告警
Alert Severity Levels
告警严重等级
P0 - Critical: Service down or severe degradation
P1 - High: Significant impact, error budget at risk
P2 - Medium: Degradation, not user-facing yet P3 - Low: Awareness, no immediate action needed
P2 - Medium: Degradation, not user-facing yet P3 - Low: Awareness, no immediate action needed
P0 - 严重:服务宕机或严重降级
P1 - 高:影响显著,错误预算面临风险
P2 - 中:服务降级,尚未对用户可见 P3 - 低:仅作感知,无需立即采取行动
P2 - 中:服务降级,尚未对用户可见 P3 - 低:仅作感知,无需立即采取行动
Example Alerts
告警示例
yaml
undefinedyaml
undefinedHigh error rate
High error rate
groups:
- name: sre
rules:
-
alert: HighErrorRate expr: | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])0.05 for: 5m labels: severity: critical annotations: summary: "High error rate on {{ $labels.service }}"
-
alert: LatencyP95High expr: | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) ) > 1.0 for: 10m labels: severity: warning
-
alert: ErrorBudgetBurn expr: | (1 - sli_availability) > (error_budget_remaining * 10) for: 1h labels: severity: high
-
undefinedgroups:
- name: sre
rules:
-
alert: HighErrorRate expr: | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])0.05 for: 5m labels: severity: critical annotations: summary: "High error rate on {{ $labels.service }}"
-
alert: LatencyP95High expr: | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) ) > 1.0 for: 10m labels: severity: warning
-
alert: ErrorBudgetBurn expr: | (1 - sli_availability) > (error_budget_remaining * 10) for: 1h labels: severity: high
-
undefinedDashboards
仪表盘
Overview Dashboard
概览仪表盘
- Service health (red/yellow/green)
- Request rate
- Error rate
- Latency percentiles (p50, p95, p99)
- Saturation metrics
- 服务健康状态(红/黄/绿)
- 请求速率
- 错误率
- 延迟分位数(p50、p95、p99)
- 饱和度指标
Detailed Dashboard
详情仪表盘
- Per-endpoint metrics
- Dependency health
- Database performance
- Cache hit rates
- Queue depths
- 各端点指标
- 依赖健康状态
- 数据库性能
- 缓存命中率
- 队列深度
Distributed Tracing
分布式追踪
OpenTelemetry
OpenTelemetry
javascript
const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('my-service');
async function handleRequest(req) {
const span = tracer.startSpan('handle_request');
try {
span.setAttribute('user.id', req.user.id);
span.setAttribute('request.path', req.path);
const result = await processRequest(req);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
throw error;
} finally {
span.end();
}
}javascript
const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('my-service');
async function handleRequest(req) {
const span = tracer.startSpan('handle_request');
try {
span.setAttribute('user.id', req.user.id);
span.setAttribute('request.path', req.path);
const result = await processRequest(req);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
throw error;
} finally {
span.end();
}
}Structured Logging
结构化日志
javascript
logger.info('request_processed', {
request_id: req.id,
user_id: req.user.id,
endpoint: req.path,
method: req.method,
status_code: res.statusCode,
duration_ms: duration,
error: error?.message,
});javascript
logger.info('request_processed', {
request_id: req.id,
user_id: req.user.id,
endpoint: req.path,
method: req.method,
status_code: res.statusCode,
duration_ms: duration,
error: error?.message,
});Best Practices
最佳实践
USE Method
USE 方法
For resources:
- Utilization: % time resource is busy
- Saturation: Work queued but not serviced
- Errors: Error count
针对资源:
- 利用率(Utilization):资源处于繁忙状态的时间占比
- 饱和度(Saturation):已排队但尚未处理的工作量
- 错误(Errors):错误计数
RED Method
RED 方法
For requests:
- Rate: Requests per second
- Errors: Failed requests per second
- Duration: Request latency distribution
针对请求:
- 速率(Rate):每秒请求数
- 错误(Errors):每秒失败请求数
- 时长(Duration):请求延迟分布
Alert on Symptoms, Not Causes
针对症状告警,而非原因
yaml
undefinedyaml
undefinedGood - alert on user impact
Good - alert on user impact
- alert: HighLatency expr: p95_latency > 1s
- alert: HighLatency expr: p95_latency > 1s
Bad - alert on potential cause
Bad - alert on potential cause
- alert: HighCPU expr: cpu_usage > 80%
undefined- alert: HighCPU expr: cpu_usage > 80%
undefinedRunbook Links
运行手册链接
yaml
annotations:
runbook: "https://wiki.example.com/runbooks/high-error-rate"
dashboard: "https://grafana.example.com/d/abc123"yaml
annotations:
runbook: "https://wiki.example.com/runbooks/high-error-rate"
dashboard: "https://grafana.example.com/d/abc123"