observability-sre
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseObservability & Site Reliability Engineering
可观测性与站点可靠性工程
Core Principles
核心原则
- Three Pillars — Metrics, Logs, and Traces provide holistic visibility
- Observability-First — Build systems that explain their own behavior
- SLO-Driven — Define reliability targets that matter to users
- Proactive Detection — Find issues before customers do
- Blameless Culture — Learn from failures without blame
- Automate Toil — Reduce repetitive operational work
- Continuous Improvement — Each incident makes systems more resilient
- Full-Stack Visibility — Monitor from infrastructure to business metrics
- 三大支柱 — 指标(Metrics)、日志(Logs)和链路追踪(Traces)提供全面可见性
- 可观测性优先 — 构建能够自我解释行为的系统
- SLO驱动 — 定义对用户重要的可靠性目标
- 主动检测 — 在用户发现问题前找到隐患
- 无责文化 — 从故障中学习,而非追责
- 自动化消除重复劳动 — 减少重复性运维工作
- 持续改进 — 每一次事件都让系统更具韧性
- 全栈可见性 — 从基础设施到业务指标的全链路监控
Hard Rules (Must Follow)
硬性规则(必须遵守)
These rules are mandatory. Violating them means the skill is not working correctly.
这些规则为强制性要求。违反规则意味着技能未正确发挥作用。
Symptom-Based Alerts Only
仅基于症状告警
Alert on user-facing symptoms, not internal infrastructure metrics.
yaml
undefined仅针对用户可见的症状告警,而非内部基础设施指标。
yaml
undefined❌ FORBIDDEN: Alerting on internal metrics
❌ 禁止:针对内部指标告警
-
alert: CPUHigh expr: cpu_usage > 70%
Users don't care about CPU, they care about latency
-
alert: MemoryHigh expr: memory_usage > 80%
Internal metric, may not affect users
-
alert: CPUHigh expr: cpu_usage > 70%
用户不关心CPU使用率,他们关心的是延迟
-
alert: MemoryHigh expr: memory_usage > 80%
内部指标,可能不会影响用户
✅ REQUIRED: Alert on user experience
✅ 要求:针对用户体验告警
-
alert: APILatencyHigh expr: slo:api_latency:p95 > 0.200 annotations: summary: "Users experiencing slow response times"
-
alert: ErrorRateHigh expr: slo:api_errors:rate5m > 0.001 annotations: summary: "Users encountering errors"
undefined-
alert: APILatencyHigh expr: slo:api_latency:p95 > 0.200 annotations: summary: "用户正在经历缓慢的响应速度"
-
alert: ErrorRateHigh expr: slo:api_errors:rate5m > 0.001 annotations: summary: "用户遇到错误"
undefinedLow Cardinality Labels
低基数标签
Loki/Prometheus labels must have low cardinality (<10 unique labels).
yaml
undefinedLoki/Prometheus标签必须为低基数(唯一标签数量<10)。
yaml
undefined❌ FORBIDDEN: High cardinality labels
❌ 禁止:高基数标签
labels:
user_id: "usr_123" # Millions of values!
order_id: "ord_456" # Millions of values!
request_id: "req_789" # Every request is unique!
labels:
user_id: "usr_123" # 数百万个可能值!
order_id: "ord_456" # 数百万个可能值!
request_id: "req_789" # 每个请求都是唯一的!
✅ REQUIRED: Low cardinality only
✅ 要求:仅使用低基数标签
labels:
namespace: "production" # Few values
app: "api-server" # Few values
level: "error" # 5-6 values
method: "GET" # ~10 values
labels:
namespace: "production" # 少量可能值
app: "api-server" # 少量可能值
level: "error" # 5-6个可能值
method: "GET" # 约10个可能值
High cardinality data goes in log body:
高基数数据放入日志体中:
logger.info({
user_id: "usr_123", # In JSON body, not label
order_id: "ord_456",
}, "Order processed");
undefinedlogger.info({
user_id: "usr_123", # 放在JSON体中,而非标签
order_id: "ord_456",
}, "订单已处理");
undefinedSLO-Based Error Budgets
基于SLO的错误预算
Every service must have defined SLOs with error budget tracking.
yaml
undefined每个服务必须定义SLO并跟踪错误预算。
yaml
undefined❌ FORBIDDEN: No SLO definition
❌ 禁止:无SLO定义
Just monitoring without targets
仅监控而无目标
✅ REQUIRED: Explicit SLO with budget
✅ 要求:明确的SLO与错误预算
SLO: 99.9% availability
SLO:99.9% 可用性
Error Budget: 0.1% = 43.2 minutes/month downtime
错误预算:0.1% = 每月允许43.2分钟停机时间
groups:
- name: slo_tracking
rules:
-
record: slo:api_availability:ratio expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
-
alert: ErrorBudgetBurnRate expr: slo:api_availability:ratio < 0.999 for: 5m annotations: summary: "Burning error budget too fast"
-
undefinedgroups:
- name: slo_tracking
rules:
-
record: slo:api_availability:ratio expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
-
alert: ErrorBudgetBurnRate expr: slo:api_availability:ratio < 0.999 for: 5m annotations: summary: "错误预算消耗过快"
-
undefinedTrace Context in Logs
日志中包含追踪上下文
All logs must include trace_id for correlation with distributed traces.
typescript
// ❌ FORBIDDEN: Logs without trace context
logger.info("Payment processed");
// ✅ REQUIRED: Include trace_id in every log
const span = trace.getActiveSpan();
logger.info({
trace_id: span?.spanContext().traceId,
span_id: span?.spanContext().spanId,
order_id: "ord_123",
}, "Payment processed");
// Output includes correlation:
// {"trace_id":"abc123","span_id":"def456","order_id":"ord_123","msg":"Payment processed"}所有日志必须包含trace_id,用于与分布式追踪关联。
typescript
// ❌ 禁止:日志中无追踪上下文
logger.info("支付已处理");
// ✅ 要求:每条日志都包含trace_id
const span = trace.getActiveSpan();
logger.info({
trace_id: span?.spanContext().traceId,
span_id: span?.spanContext().spanId,
order_id: "ord_123",
}, "支付已处理");
// 输出包含关联信息:
// {"trace_id":"abc123","span_id":"def456","order_id":"ord_123","msg":"支付已处理"}Quick Reference
快速参考
When to Use What
工具选型指南
| Scenario | Tool/Pattern | Reason |
|---|---|---|
| Metrics collection | Prometheus + Grafana | Industry standard, powerful query language |
| Distributed tracing | OpenTelemetry + Tempo/Jaeger | Vendor-neutral, CNCF standard |
| Log aggregation (cost-sensitive) | Grafana Loki | Indexes only labels, 10x cheaper |
| Log aggregation (search-heavy) | ELK Stack | Full-text search, advanced analytics |
| Unified observability | Elastic/Datadog/Dynatrace | Single pane of glass for all telemetry |
| Incident management | PagerDuty/Opsgenie | Alert routing, on-call scheduling |
| Chaos engineering | Gremlin/Chaos Mesh | Controlled failure injection |
| AIOps/Anomaly detection | Dynatrace/Datadog | AI-driven root cause analysis |
| 场景 | 工具/模式 | 原因 |
|---|---|---|
| 指标采集 | Prometheus + Grafana | 行业标准,强大的查询语言 |
| 分布式追踪 | OpenTelemetry + Tempo/Jaeger | 厂商中立,CNCF标准 |
| 日志聚合(成本敏感) | Grafana Loki | 仅索引标签,成本降低10倍 |
| 日志聚合(搜索密集型) | ELK Stack | 全文搜索,高级分析能力 |
| 统一可观测性 | Elastic/Datadog/Dynatrace | 所有遥测数据的统一视图 |
| 事件管理 | PagerDuty/Opsgenie | 告警路由,排班管理 |
| 混沌工程 | Gremlin/Chaos Mesh | 可控故障注入 |
| AIOps/异常检测 | Dynatrace/Datadog | AI驱动的根因分析 |
The Three Pillars
三大支柱详解
| Pillar | What | When | Tools |
|---|---|---|---|
| Metrics | Numerical time-series data | Real-time monitoring, alerting | Prometheus, StatsD, CloudWatch |
| Logs | Event records with context | Debugging, audit trails | Loki, ELK, Splunk |
| Traces | Request journey across services | Performance analysis, dependencies | OpenTelemetry, Jaeger, Zipkin |
Fourth Pillar (Emerging): Continuous Profiling — Code-level performance data (CPU, memory usage at function level)
| 支柱 | 定义 | 使用场景 | 工具 |
|---|---|---|---|
| 指标 | 数值型时间序列数据 | 实时监控、告警 | Prometheus, StatsD, CloudWatch |
| 日志 | 带上下文的事件记录 | 调试、审计追踪 | Loki, ELK, Splunk |
| 链路追踪 | 跨服务的请求链路 | 性能分析、依赖梳理 | OpenTelemetry, Jaeger, Zipkin |
第四支柱(新兴): 持续分析 — 代码级性能数据(函数级CPU、内存使用率)
Observability Architecture
可观测性架构
Layered Prometheus Setup
分层Prometheus部署
yaml
undefinedyaml
undefined2025 Best Practice: Federated architecture
2025最佳实践:联邦架构
Prevents metric chaos while enabling drill-down
避免指标混乱,同时支持下钻分析
Layer 1: Application Prometheus
第一层:应用级Prometheus
- Detailed business logic metrics
- 详细业务逻辑指标
- High cardinality acceptable
允许高基数
- Short retention (7 days)
- 短保留期(7天)
Layer 2: Cluster Prometheus
第二层:集群级Prometheus
- Per-environment/cluster metrics
- 按环境/集群划分的指标
- Medium retention (30 days)
- 中等保留期(30天)
- Aggregates from application level
- 从应用层聚合数据
Layer 3: Global Prometheus
第三层:全局级Prometheus
- Cross-cluster critical metrics
- 跨集群关键指标
- Long retention (1 year)
- 长保留期(1年)
- Federation from cluster level
- 从集群层联邦采集
Global Prometheus config
全局级Prometheus配置
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="kubernetes-nodes"}'
- '{name=~"job:.*"}' # Recording rules only
static_configs:
- targets:
- 'cluster-prom-us-east.internal:9090'
- 'cluster-prom-eu-west.internal:9090'
- targets:
undefinedscrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="kubernetes-nodes"}'
- '{name=~"job:.*"}' # 仅采集记录规则生成的指标
static_configs:
- targets:
- 'cluster-prom-us-east.internal:9090'
- 'cluster-prom-eu-west.internal:9090'
- targets:
undefinedRecording Rules for Performance
性能优化用记录规则
yaml
undefinedyaml
undefinedPrecompute expensive queries
预计算开销大的查询
groups:
-
name: api_performance interval: 30s rules:
Request rate (requests per second)
- record: job:api_requests:rate5m expr: sum(rate(http_requests_total[5m])) by (job, method, status)
Error rate
- record: job:api_errors:rate5m expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job)
P95 latency
- record: job:api_latency:p95 expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
undefinedgroups:
-
name: api_performance interval: 30s rules:
请求速率(每秒请求数)
- record: job:api_requests:rate5m expr: sum(rate(http_requests_total[5m])) by (job, method, status)
错误率
- record: job:api_errors:rate5m expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job)
P95延迟
- record: job:api_latency:p95 expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
undefinedResource Optimization
资源优化
yaml
undefinedyaml
undefinedIncrease scrape interval for high-target deployments
针对大量目标的部署,增加采集间隔
scrape_interval: 30s # Default: 15s reduces load by 50%
scrape_interval: 30s # 默认15s,改为30s可减少50%负载
Use relabeling to drop unnecessary metrics
使用重标记规则丢弃不必要的指标
metric_relabel_configs:
- source_labels: [name] regex: 'go_.|process_.' # Drop Go runtime metrics action: drop
metric_relabel_configs:
- source_labels: [name] regex: 'go_.|process_.' # 丢弃Go运行时指标 action: drop
Limit sample retention
限制样本保留期
storage:
tsdb:
retention.time: 15d # Keep only 15 days locally
retention.size: 50GB # Or max 50GB
---storage:
tsdb:
retention.time: 15d # 本地仅保留15天数据
retention.size: 50GB # 或最大50GB容量
---Distributed Tracing with OpenTelemetry
基于OpenTelemetry的分布式追踪
Auto-Instrumentation Setup
自动埋点配置
typescript
// Node.js auto-instrumentation
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4318/v1/traces',
}),
instrumentations: [
getNodeAutoInstrumentations({
// Auto-instruments HTTP, Express, PostgreSQL, Redis, etc.
'@opentelemetry/instrumentation-fs': { enabled: false }, // Too noisy
}),
],
});
sdk.start();typescript
// Node.js自动埋点
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4318/v1/traces',
}),
instrumentations: [
getNodeAutoInstrumentations({
// 自动埋点HTTP、Express、PostgreSQL、Redis等
'@opentelemetry/instrumentation-fs': { enabled: false }, // 过于嘈杂
}),
],
});
sdk.start();Manual Instrumentation for Business Logic
业务逻辑手动埋点
typescript
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('payment-service', '1.0.0');
async function processPayment(orderId: string, amount: number) {
// Create custom span for business operation
return tracer.startActiveSpan('processPayment', async (span) => {
try {
// Add business context
span.setAttributes({
'order.id': orderId,
'payment.amount': amount,
'payment.currency': 'USD',
});
// Child span for external API call
const paymentResult = await tracer.startActiveSpan('stripe.charge', async (childSpan) => {
const result = await stripe.charges.create({ amount, currency: 'usd' });
childSpan.setAttribute('stripe.charge_id', result.id);
childSpan.setStatus({ code: SpanStatusCode.OK });
childSpan.end();
return result;
});
span.setStatus({ code: SpanStatusCode.OK });
return paymentResult;
} catch (error) {
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
throw error;
} finally {
span.end();
}
});
}typescript
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('payment-service', '1.0.0');
async function processPayment(orderId: string, amount: number) {
// 为业务操作创建自定义Span
return tracer.startActiveSpan('processPayment', async (span) => {
try {
// 添加业务上下文
span.setAttributes({
'order.id': orderId,
'payment.amount': amount,
'payment.currency': 'USD',
});
// 为外部API调用创建子Span
const paymentResult = await tracer.startActiveSpan('stripe.charge', async (childSpan) => {
const result = await stripe.charges.create({ amount, currency: 'usd' });
childSpan.setAttribute('stripe.charge_id', result.id);
childSpan.setStatus({ code: SpanStatusCode.OK });
childSpan.end();
return result;
});
span.setStatus({ code: SpanStatusCode.OK });
return paymentResult;
} catch (error) {
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
throw error;
} finally {
span.end();
}
});
}Sampling Strategies
采样策略
yaml
undefinedyaml
undefinedOpenTelemetry Collector config
OpenTelemetry Collector配置
processors:
Probabilistic sampling: Keep 10% of traces
probabilistic_sampler:
sampling_percentage: 10
Tail sampling: Make decisions after seeing full trace
tail_sampling:
policies:
# Always sample errors
- name: error-traces
type: status_code
status_code: {status_codes: [ERROR]}
# Always sample slow requests
- name: slow-traces
type: latency
latency: {threshold_ms: 1000}
# Sample 5% of normal traffic
- name: normal-traces
type: probabilistic
probabilistic: {sampling_percentage: 5}undefinedprocessors:
概率采样:保留10%的追踪数据
probabilistic_sampler:
sampling_percentage: 10
尾部采样:在看到完整追踪后再做决策
tail_sampling:
policies:
# 始终采样错误追踪
- name: error-traces
type: status_code
status_code: {status_codes: [ERROR]}
# 始终采样慢请求
- name: slow-traces
type: latency
latency: {threshold_ms: 1000}
# 采样5%的正常流量
- name: normal-traces
type: probabilistic
probabilistic: {sampling_percentage: 5}undefinedContext Propagation
上下文传播
typescript
// Ensure trace context flows across services
import { propagation, context } from '@opentelemetry/api';
// Outgoing HTTP request (automatic with auto-instrumentation)
fetch('https://api.example.com/data', {
headers: {
// W3C Trace Context headers injected automatically:
// traceparent: 00-<trace-id>-<span-id>-01
// tracestate: vendor=value
},
});
// Manual propagation for non-HTTP (e.g., message queues)
const carrier = {};
propagation.inject(context.active(), carrier);
await publishMessage(queue, { data: payload, headers: carrier });typescript
// 确保追踪上下文跨服务传递
import { propagation, context } from '@opentelemetry/api';Structured Logging Best Practices
outgoing HTTP请求(自动埋点会自动处理)
JSON Logging Format
—
typescript
// Use structured logging library
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label }),
},
timestamp: pino.stdTimeFunctions.isoTime,
// Include trace context in logs
mixin() {
const span = trace.getActiveSpan();
if (!span) return {};
const { traceId, spanId } = span.spanContext();
return {
trace_id: traceId,
span_id: spanId,
};
},
});
// Structured logging with context
logger.info(
{
user_id: '123',
order_id: 'ord_456',
amount: 99.99,
payment_method: 'card',
},
'Payment processed successfully'
);
// Output:
// {"level":"info","time":"2025-01-15T10:30:00.000Z","trace_id":"abc123","span_id":"def456","user_id":"123","order_id":"ord_456","amount":99.99,"payment_method":"card","msg":"Payment processed successfully"}fetch('https://api.example.com/data', {
headers: {
// W3C追踪上下文头会自动注入:
// traceparent: 00-<trace-id>-<span-id>-01
// tracestate: vendor=value
},
});
Log Levels
非HTTP场景手动传播(如消息队列)
typescript
// Follow standard severity levels
logger.trace({ details }, 'Low-level debugging'); // Very verbose
logger.debug({ state }, 'Debug information'); // Development
logger.info({ event }, 'Normal operation'); // Production default
logger.warn({ issue }, 'Warning condition'); // Potential issues
logger.error({ error, context }, 'Error occurred'); // Errors
logger.fatal({ critical }, 'Fatal error'); // Process crashconst carrier = {};
propagation.inject(context.active(), carrier);
await publishMessage(queue, { data: payload, headers: carrier });
---Grafana Loki Configuration
结构化日志最佳实践
—
JSON日志格式
yaml
undefinedtypescript
// 使用结构化日志库
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label }),
},
timestamp: pino.stdTimeFunctions.isoTime,
// 日志中包含追踪上下文
mixin() {
const span = trace.getActiveSpan();
if (!span) return {};
const { traceId, spanId } = span.spanContext();
return {
trace_id: traceId,
span_id: spanId,
};
},
});
// 带上下文的结构化日志
logger.info(
{
user_id: '123',
order_id: 'ord_456',
amount: 99.99,
payment_method: 'card',
},
'支付处理成功'
);
// 输出:
// {"level":"info","time":"2025-01-15T10:30:00.000Z","trace_id":"abc123","span_id":"def456","user_id":"123","order_id":"ord_456","amount":99.99,"payment_method":"card","msg":"支付处理成功"}Promtail config - ships logs to Loki
日志级别
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
scrape_configs:
- job_name: kubernetes
kubernetes_sd_configs:
- role: pod relabel_configs:
Add pod labels as Loki labels (LOW cardinality only!)
- source_labels: [__meta_kubernetes_namespace] target_label: namespace
- source_labels: [__meta_kubernetes_pod_name] target_label: pod
- source_labels: [__meta_kubernetes_pod_label_app] target_label: app pipeline_stages:
Parse JSON logs
- json: expressions: level: level trace_id: trace_id
Extract fields as labels
- labels: level: trace_id:
undefinedtypescript
undefinedLoki Best Practices
遵循标准严重级别
- Low Cardinality Labels — Use only 5-10 labels (namespace, app, level)
- High Cardinality in Log Body — Put user_id, order_id in JSON, not labels
- LogQL for Filtering — Use
{app="api"} | json | user_id="123" - Retention Policy — Keep recent logs longer, compress old logs
promql
undefinedlogger.trace({ details }, '低级别调试信息'); # 非常详细
logger.debug({ state }, '调试信息'); # 开发环境使用
logger.info({ event }, '正常操作记录'); # 生产环境默认级别
logger.warn({ issue }, '警告状态'); # 潜在问题
logger.error({ error, context }, '发生错误'); # 错误事件
logger.fatal({ critical }, '致命错误'); # 进程崩溃
undefinedLogQL query examples
Grafana Loki配置
{namespace="production", app="api"} |= "error" # Text search
{app="api"} | json | level="error" | line_format "{{.msg}}" # JSON parsing
rate({app="api"}[5m]) # Log rate per second
sum by (level) (count_over_time({namespace="production"}[1h])) # Count by level
---yaml
undefinedSLO/SLI/SLA Management
Promtail配置 - 将日志发送到Loki
Definitions
—
-
SLI (Service Level Indicator) — Quantifiable measurement of service behavior
- Examples: Request latency, error rate, availability, throughput
-
SLO (Service Level Objective) — Target value/range for an SLI
- Examples: 99.9% availability, P95 latency < 200ms
-
SLA (Service Level Agreement) — Formal commitment with consequences
- Examples: "99.9% uptime or 10% credit"
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
scrape_configs:
- job_name: kubernetes
kubernetes_sd_configs:
- role: pod relabel_configs:
将Pod标签添加为Loki标签(仅低基数!)
- source_labels: [__meta_kubernetes_namespace] target_label: namespace
- source_labels: [__meta_kubernetes_pod_name] target_label: pod
- source_labels: [__meta_kubernetes_pod_label_app] target_label: app pipeline_stages:
解析JSON日志
- json: expressions: level: level trace_id: trace_id
将字段提取为标签
- labels: level: trace_id:
undefinedThe Four Golden Signals
Loki最佳实践
yaml
undefined- 低基数标签 — 仅使用5-10个标签(命名空间、应用、级别)
- 高基数数据放入日志体 — user_id、order_id放入JSON体,而非标签
- 使用LogQL过滤 — 使用
{app="api"} | json | user_id="123" - 保留策略 — 近期日志保留更久,旧日志压缩存储
promql
undefinedGoogle SRE's key metrics for any service
LogQL查询示例
-
Latency SLI: P95 request latency SLO: 95% of requests complete in < 200ms
-
Traffic SLI: Requests per second SLO: Handle 10,000 req/s peak load
-
Errors SLI: Error rate (5xx / total) SLO: < 0.1% error rate
-
Saturation SLI: Resource utilization (CPU, memory, disk) SLO: CPU < 70%, Memory < 80%
undefined{namespace="production", app="api"} |= "error" # 文本搜索
{app="api"} | json | level="error" | line_format "{{.msg}}" # JSON解析
rate({app="api"}[5m]) # 每秒日志生成速率
sum by (level) (count_over_time({namespace="production"}[1h])) # 按级别统计数量
---Error Budget
SLO/SLI/SLA管理
—
定义
python
undefined-
SLI(服务级别指标) — 服务行为的可量化衡量标准
- 示例:请求延迟、错误率、可用性、吞吐量
-
SLO(服务级别目标) — SLI的目标值或范围
- 示例:99.9%可用性、P95延迟<200ms
-
SLA(服务级别协议) — 带有后果的正式承诺
- 示例:"99.9%正常运行时间,否则赔偿10%"
Error budget = 1 - SLO
四大黄金信号
SLO = 99.9% # "three nines"
Error_Budget = 100% - 99.9% = 0.1%
yaml
undefinedMonthly calculation (30 days)
Google SRE提出的任何服务的关键指标
Total_Minutes = 30 * 24 * 60 = 43,200 minutes
Allowed_Downtime = 43,200 * 0.001 = 43.2 minutes
-
延迟 SLI: P95请求延迟 SLO: 95%的请求在<200ms内完成
-
流量 SLI: 每秒请求数 SLO: 峰值负载下处理10,000 req/s
-
错误 SLI: 错误率(5xx请求/总请求) SLO: <0.1%错误率
-
饱和度 SLI: 资源利用率(CPU、内存、磁盘) SLO: CPU<70%,内存<80%
undefinedIf you've had 20 minutes downtime this month:
错误预算
Budget_Remaining = 43.2 - 20 = 23.2 minutes
Budget_Consumed = 20 / 43.2 = 46.3%
python
undefinedPolicy: If budget > 90% consumed, freeze deployments
错误预算 = 1 - SLO
undefinedSLO = 99.9% # "三个9"
Error_Budget = 100% - 99.9% = 0.1%
SLO Implementation with Prometheus
月度计算(30天)
yaml
undefinedTotal_Minutes = 30 * 24 * 60 = 43,200分钟
Allowed_Downtime = 43,200 * 0.001 = 43.2分钟
Recording rules for SLI calculation
如果本月已停机20分钟:
groups:
-
name: slo_availability interval: 30s rules:
Total requests
- record: slo:api_requests:total expr: sum(rate(http_requests_total[5m]))
Successful requests (non-5xx)
- record: slo:api_requests:success expr: sum(rate(http_requests_total{status!~"5.."}[5m]))
Availability SLI
- record: slo:api_availability:ratio expr: slo:api_requests:success / slo:api_requests:total
30-day availability
- record: slo:api_availability:30d expr: avg_over_time(slo:api_availability:ratio[30d])
-
name: slo_latency interval: 30s rules:
P95 latency SLI
- record: slo:api_latency:p95 expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Budget_Remaining = 43.2 - 20 = 23.2分钟
Budget_Consumed = 20 / 43.2 = 46.3%
Alerting on SLO burn rate
策略:如果预算消耗>90%,冻结部署
- alert: HighErrorBudgetBurnRate expr: | ( slo:api_availability:ratio < 0.999 # Below 99.9% SLO and slo:api_availability:30d > 0.999 # But 30-day average still OK ) for: 5m annotations: summary: "Burning error budget too fast" description: "Current availability {{ $value }} is below SLO. {{ $labels.service }}"
---undefinedIncident Response
基于Prometheus的SLO实现
Incident Severity Levels
—
| Level | Impact | Response Time | Examples |
|---|---|---|---|
| SEV-1 | Service down or major degradation | < 15 min | Complete outage, data loss, security breach |
| SEV-2 | Significant impact, partial outage | < 1 hour | Feature unavailable, high error rates |
| SEV-3 | Minor impact, workaround exists | < 4 hours | Single component degraded, slow performance |
| SEV-4 | Cosmetic, no user impact | Next business day | UI glitches, logging errors |
yaml
undefinedIncident Response Roles (IMAG Framework)
用于SLI计算的记录规则
yaml
Incident Commander (IC):
- Overall coordination and decision-making
- Declares incident start/end
- Decides on escalations
- Owns communication to leadership
Operations Lead (OL):
- Technical investigation and mitigation
- Coordinates engineers
- Implements fixes
- Reports status to IC
Communications Lead (CL):
- Internal/external status updates
- Customer communication
- Stakeholder notifications
- Status page updatesgroups:
-
name: slo_availability interval: 30s rules:
总请求数
- record: slo:api_requests:total expr: sum(rate(http_requests_total[5m]))
成功请求数(非5xx)
- record: slo:api_requests:success expr: sum(rate(http_requests_total{status!~"5.."}[5m]))
可用性SLI
- record: slo:api_availability:ratio expr: slo:api_requests:success / slo:api_requests:total
30天可用性
- record: slo:api_availability:30d expr: avg_over_time(slo:api_availability:ratio[30d])
-
name: slo_latency interval: 30s rules:
P95延迟SLI
- record: slo:api_latency:p95 expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Incident Workflow
针对SLO消耗速率的告警
1. Detection (Alert fires or user reports)
↓
2. Triage (Assess severity, assign IC)
↓
3. Response (Assemble team, create war room)
↓
4. Mitigation (Stop the bleeding, restore service)
↓
5. Resolution (Fix root cause)
↓
6. Postmortem (Blameless review, action items)
↓
7. Follow-up (Implement improvements)- alert: HighErrorBudgetBurnRate expr: | ( slo:api_availability:ratio < 0.999 # 低于99.9%的SLO and slo:api_availability:30d > 0.999 # 但30天平均值仍达标 ) for: 5m annotations: summary: "错误预算消耗过快" description: "当前可用性{{ $value }}低于SLO。服务:{{ $labels.service }}"
---On-Call Best Practices
事件响应
—
事件严重级别
- Rotation — 1-week shifts, balanced across timezones
- Escalation — Primary → Secondary → Manager (15 min each)
- Playbooks — Step-by-step debugging guides for common issues
- Runbooks — Automated remediation scripts
- Handoff — 15-min sync at rotation change
- Compensation — On-call pay or comp time
- Health — No more than 2 incidents/night target
| 级别 | 影响范围 | 响应时间 | 示例 |
|---|---|---|---|
| SEV-1 | 服务宕机或严重退化 | <15分钟 | 完全中断、数据丢失、安全漏洞 |
| SEV-2 | 重大影响,部分中断 | <1小时 | 功能不可用、高错误率 |
| SEV-3 | 轻微影响,存在临时解决方案 | <4小时 | 单个组件退化、性能缓慢 |
| SEV-4 | 外观问题,无用户影响 | 下一个工作日 | UI故障、日志错误 |
Alert Fatigue Prevention
事件响应角色(IMAG框架)
yaml
undefinedyaml
事件指挥官(IC):
- 整体协调与决策
- 宣布事件开始/结束
- 决定升级路径
- 负责向管理层沟通
运维负责人(OL):
- 技术调查与缓解
- 协调工程师
- 实施修复
- 向IC汇报状态
沟通负责人(CL):
- 内外部状态更新
- 用户沟通
- 利益相关者通知
- 状态页面更新Symptoms vs Causes alerting
事件工作流
Alert on WHAT users experience, not WHY it's broken
—
GOOD: Symptom-based alert
—
- alert: APILatencyHigh expr: slo:api_latency:p95 > 0.200 # User-facing metric annotations: summary: "API is slow for users"
1. 检测(告警触发或用户反馈)
↓
2. 分类(评估严重级别,指派IC)
↓
3. 响应(组建团队,创建作战室)
↓
4. 缓解(止损,恢复服务)
↓
5. 解决(修复根因)
↓
6. 事后复盘(无责评审,行动项)
↓
7. 跟进(实施改进措施)BAD: Cause-based alert
值班最佳实践
- alert: CPUHigh
expr: cpu_usage > 70% # Internal metric, might not impact users
Don't alert unless this affects SLOs
- 轮换制 — 每周轮换,跨时区平衡
- 升级路径 — 主值班→副值班→经理(每级15分钟)
- 操作手册 — 常见问题的分步调试指南
- 运行手册 — 自动化修复脚本
- 交接 — 轮换时15分钟同步
- 补偿 — 值班补贴或调休
- 健康保障 — 目标:每晚不超过2次事件
Use SLO-based alerting
防止告警疲劳
Alert when error budget burn rate is too high
—
---yaml
undefinedBlameless Postmortems
症状 vs 原因告警
Core Principles
告警内容为用户体验到的问题,而非问题原因
—
良好:基于症状的告警
- Assume Good Intentions — Everyone did their best with available information
- Focus on Systems — Identify gaps in process/tooling, not people
- Psychological Safety — No punishment for honest mistakes
- Learning Culture — Incidents are opportunities to improve
- Separate from Performance Reviews — Postmortem participation never affects evaluations
- alert: APILatencyHigh expr: slo:api_latency:p95 > 0.200 # 用户可见指标 annotations: summary: "API对用户来说响应缓慢"
Postmortem Template
不良:基于原因的告警
markdown
undefined- alert: CPUHigh
expr: cpu_usage > 70% # 内部指标,可能不影响用户
除非该指标影响SLO,否则不要告警
Incident Postmortem: [Title]
使用基于SLO的告警
—
当错误预算消耗过快时告警
Date: 2025-01-15
Duration: 10:30 - 12:15 UTC (1h 45m)
Severity: SEV-2
Incident Commander: Jane Doe
Responders: John Smith, Alice Johnson
---Impact
无责事后复盘
—
核心原则
- 15,000 users affected
- 12% error rate on payment processing
- $5,000 estimated revenue impact
- No data loss
- 假设善意 — 每个人都在现有信息下尽力而为
- 聚焦系统 — 识别流程/工具的漏洞,而非针对个人
- 心理安全 — 诚实的错误不会受到惩罚
- 学习文化 — 事件是改进的机会
- 与绩效评审分离 — 参与复盘绝不影响绩效评估
Timeline (UTC)
事后复盘模板
- 10:30 - Alert: Payment error rate > 5%
- 10:32 - IC assigned, war room created
- 10:45 - Identified: Database connection pool exhausted
- 11:00 - Mitigation: Increased pool size from 50 → 100
- 11:15 - Error rate back to normal
- 12:15 - Incident closed after monitoring
markdown
undefinedRoot Cause
事件复盘:[标题]
Database connection pool configured for average load, not peak traffic.
Black Friday traffic spike (3x normal) exhausted connections.
日期: 2025-01-15
持续时间: 10:30 - 12:15 UTC(1小时45分钟)
严重级别: SEV-2
事件指挥官: Jane Doe
响应人员: John Smith, Alice Johnson
What Went Well
影响
- Alert fired within 2 minutes of issue
- Clear escalation path, IC available immediately
- Mitigation applied quickly (30 minutes to fix)
- No data corruption or loss
- 15,000用户受影响
- 支付处理错误率达12%
- 预估收入损失$5,000
- 无数据丢失
What Went Wrong
时间线(UTC)
- No load testing at 3x scale
- No auto-scaling for connection pool
- No alert on connection pool saturation
- Insufficient monitoring of database metrics
- 10:30 - 告警:支付错误率>5%
- 10:32 - 指派IC,创建作战室
- 10:45 - 定位问题:数据库连接池耗尽
- 11:00 - 缓解措施:连接池大小从50→100
- 11:15 - 错误率恢复正常
- 12:15 - 监控确认后关闭事件
Action Items
根因
- (@john) Add connection pool metrics to Grafana (Due: Jan 20)
- (@alice) Implement auto-scaling based on request rate (Due: Jan 25)
- (@jane) Add load testing to CI for 5x scale (Due: Feb 1)
- (@jane) Add alert: connection pool > 80% (Due: Jan 18)
- (@john) Document connection pool tuning runbook (Due: Jan 22)
数据库连接池配置为应对平均负载,而非峰值流量。黑色星期五流量激增(为平时3倍)导致连接池耗尽。
Lessons Learned
做得好的地方
- Black Friday load patterns need dedicated testing
- Database metrics were missing from standard dashboards
- Auto-scaling should cover ALL resources, not just pods
undefined- 问题发生2分钟内触发告警
- 清晰的升级路径,IC立即到位
- 缓解措施快速实施(30分钟修复)
- 无数据损坏或丢失
Follow-up
待改进的地方
- Review postmortem in team meeting within 1 week
- Track action items to completion (not optional!)
- Share learnings across teams
- Update runbooks and playbooks
- Celebrate successful incident response
- 未进行3倍负载的测试
- 连接池未配置自动扩容
- 未针对连接池饱和度设置告警
- 数据库指标监控不足
Chaos Engineering
行动项
Principles
—
- Define Steady State — Normal system behavior (e.g., 99.9% success rate)
- Hypothesize — Predict system will remain stable under failure
- Inject Failures — Simulate real-world events
- Disprove Hypothesis — Look for deviations from steady state
- Learn and Improve — Fix weaknesses, increase resilience
- (@john) 将连接池指标添加到Grafana(截止日期:1月20日)
- (@alice) 基于请求速率实现自动扩容(截止日期:1月25日)
- (@jane) 在CI中添加5倍负载测试(截止日期:2月1日)
- (@jane) 添加告警:连接池使用率>80%(截止日期:1月18日)
- (@john) 编写连接池调优运行手册(截止日期:1月22日)
Failure Types
经验教训
yaml
Infrastructure:
- Pod/node termination
- Network latency/packet loss
- DNS failures
- Cloud region outage
Resources:
- CPU stress
- Memory exhaustion
- Disk I/O saturation
- File descriptor limits
Dependencies:
- Database connection failures
- API timeout/errors
- Cache unavailability
- Message queue backlog
Security:
- DDoS simulation
- Certificate expiration
- Unauthorized access attempts- 黑色星期五的负载模式需要专门测试
- 数据库指标未纳入标准仪表盘
- 自动扩容应覆盖所有资源,而非仅Pod
undefinedChaos Mesh Example
跟进
yaml
undefined- 1周内团队会议上评审复盘内容
- 跟踪行动项至完成(非可选!)
- 跨团队分享经验
- 更新运行手册和操作手册
- 庆祝成功的事件响应
Network latency injection
混沌工程
—
原则
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay
spec:
action: delay
mode: one
selector:
namespaces:
- production
labelSelectors:
app: payment-service
delay:
latency: "100ms"
correlation: "50"
jitter: "50ms"
duration: "5m"
scheduler:
cron: "@every 2h" # Run every 2 hours
- 定义稳态 — 系统正常行为(如99.9%成功率)
- 提出假设 — 预测系统在故障下保持稳定
- 注入故障 — 模拟真实世界事件
- 推翻假设 — 寻找与稳态的偏差
- 学习与改进 — 修复弱点,提升韧性
Pod kill experiment
故障类型
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill
spec:
action: pod-kill
mode: fixed-percent
value: "10" # Kill 10% of pods
selector:
namespaces:
- production
labelSelectors:
app: api-server
duration: "30s"
undefinedyaml
基础设施:
- Pod/节点终止
- 网络延迟/丢包
- DNS故障
- 云区域 outage
资源:
- CPU压力
- 内存耗尽
- 磁盘I/O饱和
- 文件描述符限制
依赖:
- 数据库连接故障
- API超时/错误
- 缓存不可用
- 消息队列积压
安全:
- DDoS模拟
- 证书过期
- 未授权访问尝试Best Practices
Chaos Mesh示例
- Start Small — Non-production first, then canary production
- Collect Baselines — Know normal metrics before experiments
- Define Success — Clear criteria for what "stable" means
- Monitor Everything — Watch metrics, logs, traces during tests
- Automate Rollback — Stop experiment if SLOs violated
- Game Days — Scheduled chaos exercises with full team
- Blameless Reviews — Treat chaos failures like production incidents
yaml
undefinedAIOps and AI in Observability
网络延迟注入
2025 Trends
—
- Anomaly Detection — AI spots unusual patterns in metrics/logs
- Root Cause Analysis — Correlate failures across services automatically
- Predictive Alerting — Predict failures before they happen
- Auto-Remediation — AI suggests or applies fixes autonomously
- Natural Language Queries — Ask "Why is checkout slow?" instead of writing PromQL
- AI Observability — Monitor AI model drift, hallucinations, token usage
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay
spec:
action: delay
mode: one
selector:
namespaces:
- production
labelSelectors:
app: payment-service
delay:
latency: "100ms"
correlation: "50"
jitter: "50ms"
duration: "5m"
scheduler:
cron: "@every 2h" # 每2小时运行一次
AI-Driven Platforms (2025)
Pod销毁实验
yaml
Dynatrace Davis AI:
- Auto-detected 73% of incidents before customer impact
- Reduced alert noise by 90%
- Causal AI for root cause analysis
Datadog Watchdog:
- Anomaly detection across metrics, logs, traces
- Automated correlation of related issues
- LLM-powered investigation assistant
Elastic AIOps:
- Machine learning for log anomaly detection
- Automated baseline learning
- Predictive alerting
New Relic AI:
- Natural language query interface
- Automated incident summarization
- Proactive capacity recommendationsapiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill
spec:
action: pod-kill
mode: fixed-percent
value: "10" # 销毁10%的Pod
selector:
namespaces:
- production
labelSelectors:
app: api-server
duration: "30s"
undefinedImplementing AI Observability
最佳实践
python
undefined- 从小处着手 — 先在非生产环境测试,再在生产环境金丝雀测试
- 收集基线 — 实验前了解系统正常指标
- 定义成功标准 — 明确"稳定"的判定条件
- 全面监控 — 实验期间监控指标、日志、链路追踪
- 自动化回滚 — 若SLO被违反,立即停止实验
- 游戏日 — 定期安排全团队参与的混沌演练
- 无责评审 — 将混沌实验的故障视为生产事件处理
Monitor AI model performance
AIOps与可观测性中的AI
—
2025年趋势
from opentelemetry import trace, metrics
tracer = trace.get_tracer(name)
meter = metrics.get_meter(name)
- 异常检测 — AI识别指标/日志中的异常模式
- 根因分析 — 自动关联跨服务的故障
- 预测性告警 — 在故障发生前预测隐患
- 自动修复 — AI建议或自动应用修复措施
- 自然语言查询 — 提问"为什么结账缓慢?"而非编写PromQL
- AI可观测性 — 监控AI模型漂移、幻觉、Token使用量
Create metrics for AI model
AI驱动平台(2025年)
model_latency = meter.create_histogram(
"ai.model.latency",
description="AI model inference latency",
unit="ms"
)
model_tokens = meter.create_counter(
"ai.model.tokens",
description="Token usage"
)
async def run_ai_model(prompt: str):
with tracer.start_as_current_span("ai.inference") as span:
start = time.time()
span.set_attribute("ai.model", "gpt-4")
span.set_attribute("ai.prompt_length", len(prompt))
response = await openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
latency = (time.time() - start) * 1000
tokens = response.usage.total_tokens
# Record metrics
model_latency.record(latency, {"model": "gpt-4"})
model_tokens.add(tokens, {"model": "gpt-4", "type": "total"})
# Add to span
span.set_attribute("ai.response_length", len(response.choices[0].message.content))
span.set_attribute("ai.tokens_used", tokens)
return response
---yaml
Dynatrace Davis AI:
- 73%的事件在用户受影响前自动检测到
- 告警噪音减少90%
- 因果AI用于根因分析
Datadog Watchdog:
- 跨指标、日志、链路追踪的异常检测
- 自动关联相关问题
- LLM驱动的调查助手
Elastic AIOps:
- 机器学习用于日志异常检测
- 自动基线学习
- 预测性告警
New Relic AI:
- 自然语言查询界面
- 自动事件摘要
- 主动容量建议Grafana Dashboards
实现AI可观测性
3-3-3 Rule
—
- 3 rows of panels per dashboard
- 3 panels per row
- 3 key metrics per panel
Avoid "dashboard sprawl" — Each dashboard should answer ONE question.
python
undefinedDashboard Categories
监控AI模型性能
yaml
RED Dashboard (for services):
- Rate: Requests per second
- Errors: Error rate
- Duration: Latency (P50, P95, P99)
USE Dashboard (for resources):
- Utilization: % of capacity used
- Saturation: Queue depth, wait time
- Errors: Error count
Four Golden Signals Dashboard:
- Latency
- Traffic
- Errors
- Saturation
SLO Dashboard:
- Current SLI value
- Error budget remaining
- Burn rate
- Trend (30-day)from opentelemetry import trace, metrics
import time
import openai
tracer = trace.get_tracer(name)
meter = metrics.get_meter(name)
Panel Best Practices
创建AI模型指标
json
{
"title": "API Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (method)",
"legendFormat": "{{ method }}"
}
],
"options": {
"tooltip": { "mode": "multi" },
"legend": { "displayMode": "table", "calcs": ["mean", "last"] }
},
"fieldConfig": {
"defaults": {
"unit": "reqps", // Requests per second
"color": { "mode": "palette-classic" },
"custom": {
"lineWidth": 2,
"fillOpacity": 10
}
}
}
}model_latency = meter.create_histogram(
"ai.model.latency",
description="AI模型推理延迟",
unit="ms"
)
model_tokens = meter.create_counter(
"ai.model.tokens",
description="Token使用量"
)
async def run_ai_model(prompt: str):
with tracer.start_as_current_span("ai.inference") as span:
start = time.time()
span.set_attribute("ai.model", "gpt-4")
span.set_attribute("ai.prompt_length", len(prompt))
response = await openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
latency = (time.time() - start) * 1000
tokens = response.usage.total_tokens
# 记录指标
model_latency.record(latency, {"model": "gpt-4"})
model_tokens.add(tokens, {"model": "gpt-4", "type": "total"})
# 添加到Span
span.set_attribute("ai.response_length", len(response.choices[0].message.content))
span.set_attribute("ai.tokens_used", tokens)
return response
---Checklist
Grafana仪表盘
—
3-3-3规则
markdown
undefined- 每行3个面板
- 每个仪表盘3行
- 每个面板3个关键指标
避免"仪表盘泛滥" — 每个仪表盘应仅回答一个问题。
Metrics (Prometheus + Grafana)
仪表盘分类
- Layered architecture (app/cluster/global)
- Recording rules for expensive queries
- Resource limits and retention configured
- Dashboards follow 3-3-3 rule
- Alerts based on SLOs, not internal metrics
yaml
RED仪表盘(服务用):
- Rate:每秒请求数
- Errors:错误率
- Duration:延迟(P50、P95、P99)
USE仪表盘(资源用):
- Utilization:容量使用率
- Saturation:队列长度、等待时间
- Errors:错误计数
四大黄金信号仪表盘:
- 延迟
- 流量
- 错误
- 饱和度
SLO仪表盘:
- 当前SLI值
- 剩余错误预算
- 消耗速率
- 趋势(30天)Tracing (OpenTelemetry)
面板最佳实践
- Auto-instrumentation enabled
- Custom spans for business operations
- Sampling strategy configured
- Trace context in logs (correlation)
- Backend connected (Tempo/Jaeger)
json
{
"title": "API请求速率",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (method)",
"legendFormat": "{{ method }}"
}
],
"options": {
"tooltip": { "mode": "multi" },
"legend": { "displayMode": "table", "calcs": ["mean", "last"] }
},
"fieldConfig": {
"defaults": {
"unit": "reqps", # 每秒请求数
"color": { "mode": "palette-classic" },
"custom": {
"lineWidth": 2,
"fillOpacity": 10
}
}
}
}Logging (Loki/ELK)
检查清单
- Structured JSON logging
- Low cardinality labels (<10)
- Trace IDs in logs
- Appropriate log levels
- Retention policy defined
markdown
undefinedSLOs
指标(Prometheus + Grafana)
- SLIs defined for key user journeys
- SLOs documented and tracked
- Error budget calculated
- Burn rate alerting configured
- Monthly SLO review process
- 分层架构(应用/集群/全局)
- 为开销大的查询配置记录规则
- 配置资源限制与保留期
- 仪表盘遵循3-3-3规则
- 告警基于SLO,而非内部指标
Incident Response
链路追踪(OpenTelemetry)
- Severity levels defined
- On-call rotation scheduled
- Escalation policy documented
- Runbooks for common issues
- Postmortem template ready
- 启用自动埋点
- 为业务操作添加自定义Span
- 配置采样策略
- 日志中包含追踪上下文(关联)
- 连接后端(Tempo/Jaeger)
Culture
日志(Loki/ELK)
- Blameless postmortem process
- Action items tracked to completion
- Incident learnings shared
- On-call compensation policy
- Regular chaos engineering exercises
---- 结构化JSON日志
- 低基数标签(<10个)
- 日志中包含Trace ID
- 合理的日志级别
- 定义保留策略
See Also
SLO
- reference/monitoring.md — Prometheus and Grafana deep dive
- reference/logging.md — Structured logging best practices
- reference/tracing.md — OpenTelemetry and distributed tracing
- reference/incident-response.md — Incident management and postmortems
- templates/slo-template.md — SLO definition template
- 为关键用户旅程定义SLI
- SLO已文档化并跟踪
- 计算错误预算
- 配置消耗速率告警
- 月度SLO评审流程
—
事件响应
—
- 定义严重级别
- 安排值班轮换
- 文档化升级策略
- 常见问题的运行手册
- 准备好事后复盘模板
—
文化
—
- 无责事后复盘流程
- 跟踪行动项至完成
- 分享事件经验
- 值班补偿政策
- 定期混沌工程演练
---—
相关链接
—
- reference/monitoring.md — Prometheus与Grafana深度解析
- reference/logging.md — 结构化日志最佳实践
- reference/tracing.md — OpenTelemetry与分布式追踪
- reference/incident-response.md — 事件管理与事后复盘
- templates/slo-template.md — SLO定义模板