monitoring-observability

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Monitoring and Observability

监控与可观测性

Prometheus and Grafana Setup

Prometheus与Grafana部署

Prometheus

Core Concepts
- Time-series database for metrics
- Pull-based metrics collection
- PromQL query language
- Alerting rules and notifications
Best Practices
- Use appropriate metric types (Counter, Gauge, Histogram, Summary)
- Label metrics with relevant dimensions
- Use metric naming conventions
- Implement relabeling for metric filtering
- Use federation for multi-cluster setups
Configuration
- Configure scrape targets for services
- Use service discovery for dynamic targets
- Configure retention policies
- Implement remote write for long-term storage
- Use alert rules for proactive monitoring

PromQL Examples

promql

# CPU usage rate
rate(process_cpu_seconds_total[5m])

# Request error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Memory usage
process_resident_memory_bytes / node_memory_MemTotal_bytes * 100

核心概念
- 用于指标存储的时间序列数据库
- 基于拉取的指标收集机制
- PromQL查询语言
- 告警规则与通知机制
最佳实践
- 使用合适的指标类型（Counter、Gauge、Histogram、Summary）
- 为指标添加相关维度标签
- 遵循指标命名规范
- 实现重标记以过滤指标
- 采用联邦机制实现多集群部署
配置说明
- 为服务配置采集目标
- 使用服务发现实现动态目标管理
- 配置数据保留策略
- 实现远程写入以支持长期存储
- 配置告警规则实现主动监控

PromQL示例

promql

# CPU使用率
rate(process_cpu_seconds_total[5m])

# 请求错误率
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

# P95延迟
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# 内存使用率
process_resident_memory_bytes / node_memory_MemTotal_bytes * 100

Grafana

Core Concepts
- Visualization and dashboard platform
- Multiple data source support
- Alerting and notifications
- Plugin ecosystem
Best Practices
- Use folder organization for dashboards
- Use dashboard variables for interactivity
- Implement dashboard versioning
- Use annotations for event marking
- Share dashboards via JSON export
Dashboard Design
- Create role-specific dashboards (SRE, developer, business)
- Use appropriate visualization types (graph, gauge, table, stat)
- Implement drill-down capabilities
- Use consistent color schemes
- Include context and descriptions
Alerting
- Configure alert rules in Grafana
- Use notification channels (email, Slack, PagerDuty)
- Implement alert grouping and routing
- Use alert templates for clear messages
- Configure alert silencing and downtime

核心概念
- 可视化与仪表板平台
- 支持多数据源接入
- 告警与通知功能
- 插件生态系统
最佳实践
- 使用文件夹组织仪表板
- 利用仪表板变量实现交互性
- 实现仪表板版本管理
- 使用注释标记事件
- 通过JSON导出共享仪表板
仪表板设计
- 创建角色专属仪表板（SRE、开发人员、业务人员）
- 使用合适的可视化类型（图表、仪表盘、表格、统计卡片）
- 实现钻取功能
- 采用统一的配色方案
- 添加上下文说明
告警配置
- 在Grafana中配置告警规则
- 使用通知渠道（邮件、Slack、PagerDuty）
- 实现告警分组与路由
- 使用告警模板生成清晰通知
- 配置告警静默与停机时段

CloudWatch (AWS) Monitoring

AWS CloudWatch监控

Core Concepts

核心概念

Metrics: Time-series data points
Dashboards: Visualizations of metrics
Alarms: Threshold-based alerts
Logs: Log data collection and analysis
Events: Event-driven monitoring

指标：时间序列数据点
仪表板：指标可视化展示
告警：基于阈值的告警机制
日志：日志数据收集与分析
事件：事件驱动型监控

Best Practices

最佳实践

Metric Collection
- Use custom metrics for application-specific data
- Use metric filters for log-based metrics
- Use metric dimensions for filtering
- Implement metric aggregation
- Use metric streams for real-time processing
Dashboard Design
- Create service-specific dashboards
- Use widgets for different visualizations
- Implement dashboard variables
- Use cross-account dashboards
- Share dashboards with teams
Alarm Configuration
- Use appropriate alarm thresholds
- Implement alarm actions (SNS, Auto Scaling, EC2 actions)
- Use composite alarms for complex conditions
- Configure alarm states and transitions
- Use alarm tags for organization

指标收集
- 为应用特定数据使用自定义指标
- 利用指标过滤器从日志中提取指标
- 使用指标维度实现过滤
- 实现指标聚合
- 使用指标流进行实时处理
仪表板设计
- 创建服务专属仪表板
- 使用不同组件实现多样可视化
- 配置仪表板变量
- 使用跨账户仪表板
- 与团队共享仪表板
告警配置
- 设置合适的告警阈值
- 配置告警动作（SNS、Auto Scaling、EC2操作）
- 使用复合告警处理复杂条件
- 配置告警状态与转换规则
- 使用告警标签实现组织管理

CloudWatch Logs

CloudWatch日志

Log Groups: Logical containers for logs
Log Streams: Sequences of log events
Metric Filters: Extract metrics from logs
Subscription Filters: Stream logs to other services
Insights: Query and analyze logs

日志组：日志的逻辑容器
日志流：日志事件序列
指标过滤器：从日志中提取指标
订阅过滤器：将日志流转至其他服务
Insights：日志查询与分析工具

CloudWatch Examples

CloudWatch示例

json

// Alarm configuration
{
  "AlarmName": "HighCPUUsage",
  "MetricName": "CPUUtilization",
  "Namespace": "AWS/EC2",
  "Statistic": "Average",
  "Period": 300,
  "EvaluationPeriods": 2,
  "Threshold": 80,
  "ComparisonOperator": "GreaterThanThreshold"
}

// Metric filter
{
  "filterPattern": "[timestamp, request_id, status_code, latency]",
  "metricTransformations": [
    {
      "metricName": "RequestLatency",
      "metricNamespace": "Application",
      "metricValue": "$latency"
    }
  ]
}

json

// 告警配置
{
  "AlarmName": "HighCPUUsage",
  "MetricName": "CPUUtilization",
  "Namespace": "AWS/EC2",
  "Statistic": "Average",
  "Period": 300,
  "EvaluationPeriods": 2,
  "Threshold": 80,
  "ComparisonOperator": "GreaterThanThreshold"
}

// 指标过滤器
{
  "filterPattern": "[timestamp, request_id, status_code, latency]",
  "metricTransformations": [
    {
      "metricName": "RequestLatency",
      "metricNamespace": "Application",
      "metricValue": "$latency"
    }
  ]
}

Azure Monitor

Core Concepts

核心概念

Metrics: Time-series data
Logs: Log data collection and analysis
Alerts: Threshold-based alerts
Dashboards: Visualizations
Application Insights: Application monitoring

指标：时间序列数据
日志：日志数据收集与分析
告警：基于阈值的告警
仪表板：可视化展示
Application Insights：应用监控工具

Best Practices

最佳实践

Metric Collection
- Use custom metrics for application data
- Use metric dimensions for filtering
- Implement metric aggregation
- Use metric alerts for proactive monitoring
- Configure metric collection rules
Log Analytics
- Use Kusto Query Language (KQL) for log queries
- Create custom log tables
- Implement log collection rules
- Use log alerts for monitoring
- Configure log retention policies
Application Insights
- Enable distributed tracing
- Use custom telemetry
- Implement dependency tracking
- Configure performance counters
- Use smart detection for anomalies

指标收集
- 为应用数据使用自定义指标
- 使用指标维度实现过滤
- 实现指标聚合
- 使用指标告警实现主动监控
- 配置指标收集规则
日志分析
- 使用Kusto查询语言（KQL）进行日志查询
- 创建自定义日志表
- 实现日志收集规则
- 使用日志告警进行监控
- 配置日志保留策略
Application Insights
- 启用分布式追踪
- 使用自定义遥测数据
- 实现依赖追踪
- 配置性能计数器
- 使用智能检测识别异常

Azure Monitor Examples

Azure Monitor示例

kql

// KQL query for error rate
requests
| where timestamp > ago(1h)
| summarize count() by success
| project error_rate = 100.0 * (count_ - count_success) / count_

// Query for slow requests
requests
| where timestamp > ago(1h)
| where duration > 1000
| summarize count() by name
| top 10 by count_

// Query for exceptions
exceptions
| where timestamp > ago(1h)
| summarize count() by type, problemId
| top 10 by count_

kql

// 错误率KQL查询
requests
| where timestamp > ago(1h)
| summarize count() by success
| project error_rate = 100.0 * (count_ - count_success) / count_

// 慢请求查询
requests
| where timestamp > ago(1h)
| where duration > 1000
| summarize count() by name
| top 10 by count_

// 异常查询
exceptions
| where timestamp > ago(1h)
| summarize count() by type, problemId
| top 10 by count_

Stackdriver (GCP) Monitoring

GCP Stackdriver监控

Core Concepts

核心概念

Metrics: Time-series data
Dashboards: Visualizations
Alerting: Threshold-based alerts
Logging: Log data collection
Tracing: Distributed tracing

指标：时间序列数据
仪表板：可视化展示
告警：基于阈值的告警
日志：日志数据收集
追踪：分布式追踪

Best Practices

最佳实践

Metric Collection
- Use custom metrics for application data
- Use metric labels for filtering
- Implement metric aggregation
- Use metric-based alerting policies
- Configure metric descriptors
Dashboard Design
- Create service-specific dashboards
- Use dashboard variables
- Implement dashboard sharing
- Use dashboard templates
- Configure dashboard refresh intervals
Logging
- Use log sinks for log routing
- Implement log-based metrics
- Configure log exclusions
- Use log alerts for monitoring
- Configure log retention

指标收集
- 为应用数据使用自定义指标
- 使用指标标签实现过滤
- 实现指标聚合
- 使用基于指标的告警策略
- 配置指标描述符
仪表板设计
- 创建服务专属仪表板
- 使用仪表板变量
- 实现仪表板共享
- 使用仪表板模板
- 配置仪表板刷新间隔
日志管理
- 使用日志接收器进行日志路由
- 实现基于日志的指标
- 配置日志排除规则
- 使用日志告警进行监控
- 配置日志保留策略

Stackdriver Examples

Stackdriver示例

yaml

undefined

yaml

undefined

Alerting policy

告警策略

displayName: "High Error Rate" conditions:

displayName: "Error rate > 5%" conditionThreshold: filter: 'metric.type="custom.googleapis.com/error_rate"' comparison: COMPARISON_GT thresholdValue: 0.05 duration: 300s aggregations: - alignmentPeriod: 60s perSeriesAligner: ALIGN_RATE

undefined

displayName: "High Error Rate" conditions:

displayName: "Error rate > 5%" conditionThreshold: filter: 'metric.type="custom.googleapis.com/error_rate"' comparison: COMPARISON_GT thresholdValue: 0.05 duration: 300s aggregations: - alignmentPeriod: 60s perSeriesAligner: ALIGN_RATE

undefined

Logging and Log Aggregation

日志管理与日志聚合

ELK Stack (Elasticsearch, Logstash, Kibana)

ELK Stack（Elasticsearch、Logstash、Kibana）

Elasticsearch: Search and analytics engine
Logstash: Data processing pipeline
Kibana: Visualization platform
Beats: Data shippers

Elasticsearch：搜索与分析引擎
Logstash：数据处理管道
Kibana：可视化平台
Beats：数据采集器

Best Practices

最佳实践

Log Collection
- Use centralized logging
- Implement log shippers (Filebeat, Fluentd, Logstash)
- Use log parsing and normalization
- Configure log retention policies
- Implement log archiving
Log Analysis
- Use index patterns for organization
- Implement log queries and filters
- Use saved searches for common queries
- Create visualizations for log data
- Use dashboards for log monitoring
Log Security
- Implement log encryption at rest
- Use secure log transmission (TLS)
- Implement log access controls
- Configure log audit trails
- Use log redaction for sensitive data

日志收集
- 使用集中式日志管理
- 部署日志采集器（Filebeat、Fluentd、Logstash）
- 实现日志解析与标准化
- 配置日志保留策略
- 实现日志归档
日志分析
- 使用索引模式实现组织管理
- 实现日志查询与过滤
- 为常用查询创建保存搜索
- 为日志数据创建可视化
- 使用仪表板监控日志
日志安全
- 实现日志静态加密
- 使用安全日志传输（TLS）
- 配置日志访问控制
- 实现日志审计追踪
- 对敏感数据进行日志脱敏

Loki

Core Concepts
- Lightweight log aggregation system
- Label-based indexing
- Grafana integration
- PromQL-like query language (LogQL)
Best Practices
- Use appropriate log labels
- Implement log retention policies
- Use log streams for organization
- Configure log scraping
- Implement log alerting

核心概念
- 轻量级日志聚合系统
- 基于标签的索引
- 与Grafana集成
- 类PromQL查询语言（LogQL）
最佳实践
- 使用合适的日志标签
- 配置日志保留策略
- 使用日志流实现组织管理
- 配置日志采集
- 实现日志告警

Alerting Strategies and Incident Response

告警策略与事件响应

Alerting Best Practices

告警最佳实践

Alert Design
- Use meaningful alert names and descriptions
- Include relevant context in alerts
- Use appropriate severity levels
- Configure alert thresholds carefully
- Implement alert deduplication
Alert Routing
- Route alerts to appropriate teams
- Use escalation policies
- Configure on-call rotations
- Implement alert grouping
- Use notification channels (email, Slack, PagerDuty)
Alert Quality
- Reduce alert noise with proper filtering
- Implement alert suppression
- Use alert correlation
- Configure alert cooldown periods
- Implement alert auto-resolution

告警设计
- 使用有意义的告警名称与描述
- 在告警中包含相关上下文
- 设置合适的严重级别
- 谨慎配置告警阈值
- 实现告警去重
告警路由
- 将告警路由至对应团队
- 使用升级策略
- 配置轮值待命
- 实现告警分组
- 使用通知渠道（邮件、Slack、PagerDuty）
告警质量
- 通过合理过滤减少告警噪音
- 实现告警抑制
- 使用告警关联
- 配置告警冷却期
- 实现告警自动恢复

Incident Response

事件响应

Incident Lifecycle
- Detection: Identify incident
- Triage: Assess severity and impact
- Response: Mitigate incident
- Resolution: Restore service
- Post-Mortem: Learn and improve
Best Practices
- Use incident severity levels
- Implement incident communication
- Use runbooks for common incidents
- Conduct post-mortems
- Implement follow-up actions
Runbooks
- Document common incident scenarios
- Include step-by-step procedures
- Include relevant commands and tools
- Update runbooks based on incidents
- Share runbooks with teams

事件生命周期
- 检测：识别事件
- 分类：评估严重程度与影响范围
- 响应：缓解事件影响
- 恢复：恢复服务正常
- 事后复盘：总结与改进
最佳实践
- 使用事件严重级别
- 实现事件沟通机制
- 为常见事件编写运行手册
- 开展事后复盘
- 落实后续改进措施
运行手册
- 记录常见事件场景
- 包含分步操作流程
- 附上相关命令与工具
- 根据事件更新运行手册
- 与团队共享运行手册

SLO/SLI Definitions and Tracking

SLO/SLI定义与追踪

SLI (Service Level Indicator)

SLI（服务级别指标）

Definition: Quantitative measure of service performance
Common SLIs:
- Availability: Percentage of time service is operational
- Latency: Response time for requests
- Error Rate: Percentage of failed requests
- Throughput: Requests per second
- Saturation: Resource utilization

定义：服务性能的量化衡量指标
常见SLI：
- 可用性：服务正常运行时间占比
- 延迟：请求响应时间
- 错误率：失败请求占比
- 吞吐量：每秒请求数
- 饱和度：资源利用率

SLO (Service Level Objective)

SLO（服务级别目标）

Definition: Target value for SLI
Best Practices:
- Set realistic SLOs based on business requirements
- Use SLOs to drive reliability improvements
- Monitor SLOs continuously
- Alert on SLO breaches
- Use error budgets for balancing reliability and innovation

定义：SLI的目标值
最佳实践：
- 根据业务需求设置合理的SLO
- 利用SLO推动可靠性改进
- 持续监控SLO
- 针对SLO触发告警
- 使用错误平衡可靠性与创新

Error Budget

错误预算

Definition: Allowable amount of unreliability
Calculation: Error Budget = 100% - SLO
Best Practices:
- Use error budget to guide release decisions
- Freeze deployments when error budget is exhausted
- Implement error budget alerts
- Track error budget consumption
- Use error budget for reliability planning

定义：允许的不可用时长/错误量
计算方式：错误预算 = 100% - SLO
最佳实践：
- 利用错误预算指导发布决策
- 错误预算耗尽时冻结部署
- 配置错误预算告警
- 追踪错误预算消耗情况
- 使用错误预算进行可靠性规划

SLO/SLI Examples

SLO/SLI示例

yaml

undefined

yaml

undefined

SLO configuration

SLO配置

slo_name: "API Availability" sli_name: "api_availability" slo_target: 0.999 slo_window: 30d alert_threshold: 0.998

slo_name: "API可用性" sli_name: "api_availability" slo_target: 0.999 slo_window: 30d alert_threshold: 0.998

SLI calculation

SLI计算

api_availability = 1 - (error_count / total_count)

Error budget

错误预算

error_budget = 1 - slo_target error_budget_remaining = slo_target - current_availability

undefined

error_budget = 1 - slo_target error_budget_remaining = slo_target - current_availability

undefined

Monitoring SLOs

SLO监控

Tools:
- Prometheus and Grafana
- CloudWatch SLOs
- Azure Monitor SLOs
- Stackdriver SLOs
- SRE-specific tools (Sloth, sli-exporter)
Best Practices:
- Visualize SLOs in dashboards
- Alert on SLO breaches
- Track SLO trends over time
- Compare SLOs across services
- Use SLOs for capacity planning

工具：
- Prometheus与Grafana
- CloudWatch SLO
- Azure Monitor SLO
- Stackdriver SLO
- SRE专属工具（Sloth、sli-exporter）
最佳实践：
- 在仪表板中可视化SLO
- 针对SLO触发告警
- 追踪SLO长期趋势
- 跨服务对比SLO
- 利用SLO进行容量规划

Observability Best Practices

可观测性最佳实践

The Three Pillars of Observability

可观测性三大支柱

Metrics: Quantitative data points
Logs: Discrete events
Traces: Request paths through distributed systems

指标：量化数据点
日志：离散事件记录
追踪：分布式系统中的请求路径

Distributed Tracing

分布式追踪

Core Concepts
- Trace: End-to-end request path
- Span: Individual operation
- Trace ID: Unique identifier for trace
- Span ID: Unique identifier for span
Best Practices
- Use distributed tracing for microservices
- Implement trace sampling
- Use trace context propagation
- Configure trace retention
- Analyze traces for performance issues
Tools
- Jaeger
- Zipkin
- AWS X-Ray
- Azure Application Insights
- Google Cloud Trace

核心概念
- 追踪：端到端请求路径
- 跨度：单个操作
- 追踪ID：追踪的唯一标识
- 跨度ID：跨度的唯一标识
最佳实践
- 为微服务使用分布式追踪
- 实现追踪采样
- 使用追踪上下文传递
- 配置追踪保留策略
- 分析追踪以定位性能问题
工具
- Jaeger
- Zipkin
- AWS X-Ray
- Azure Application Insights
- Google Cloud Trace

Observability Patterns

可观测性模式

RED Method: Rate, Errors, Duration
USE Method: Utilization, Saturation, Errors
Golden Signals: Latency, Traffic, Errors, Saturation
Four Golden Signals: Latency, Traffic, Errors, Saturation, SLOs

RED方法：Rate（速率）、Errors（错误）、Duration（延迟）
USE方法：Utilization（利用率）、Saturation（饱和度）、Errors（错误）
黄金信号：延迟、流量、错误、饱和度
扩展黄金信号：延迟、流量、错误、饱和度、SLO

Observability Maturity Model

可观测性成熟度模型

Level 1: Basic metrics and logging
Level 2: Structured logging and metrics
Level 3: Distributed tracing
Level 4: Automated alerting and incident response
Level 5: SLO-driven development and error budgets

Level 1：基础指标与日志
Level 2：结构化日志与标准化指标
Level 3：分布式追踪
Level 4：自动化告警与事件响应
Level 5：SLO驱动的开发与错误预算管理