monitoring-observability

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Monitoring and Observability

监控与可观测性

Prometheus and Grafana Setup

Prometheus与Grafana部署

Prometheus

Prometheus

  • Core Concepts
    • Time-series database for metrics
    • Pull-based metrics collection
    • PromQL query language
    • Alerting rules and notifications
  • Best Practices
    • Use appropriate metric types (Counter, Gauge, Histogram, Summary)
    • Label metrics with relevant dimensions
    • Use metric naming conventions
    • Implement relabeling for metric filtering
    • Use federation for multi-cluster setups
  • Configuration
    • Configure scrape targets for services
    • Use service discovery for dynamic targets
    • Configure retention policies
    • Implement remote write for long-term storage
    • Use alert rules for proactive monitoring
  • PromQL Examples
    promql
    # CPU usage rate
    rate(process_cpu_seconds_total[5m])
    
    # Request error rate
    rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
    
    # P95 latency
    histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
    
    # Memory usage
    process_resident_memory_bytes / node_memory_MemTotal_bytes * 100
  • 核心概念
    • 用于指标存储的时间序列数据库
    • 基于拉取的指标收集机制
    • PromQL查询语言
    • 告警规则与通知机制
  • 最佳实践
    • 使用合适的指标类型(Counter、Gauge、Histogram、Summary)
    • 为指标添加相关维度标签
    • 遵循指标命名规范
    • 实现重标记以过滤指标
    • 采用联邦机制实现多集群部署
  • 配置说明
    • 为服务配置采集目标
    • 使用服务发现实现动态目标管理
    • 配置数据保留策略
    • 实现远程写入以支持长期存储
    • 配置告警规则实现主动监控
  • PromQL示例
    promql
    # CPU使用率
    rate(process_cpu_seconds_total[5m])
    
    # 请求错误率
    rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
    
    # P95延迟
    histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
    
    # 内存使用率
    process_resident_memory_bytes / node_memory_MemTotal_bytes * 100

Grafana

Grafana

  • Core Concepts
    • Visualization and dashboard platform
    • Multiple data source support
    • Alerting and notifications
    • Plugin ecosystem
  • Best Practices
    • Use folder organization for dashboards
    • Use dashboard variables for interactivity
    • Implement dashboard versioning
    • Use annotations for event marking
    • Share dashboards via JSON export
  • Dashboard Design
    • Create role-specific dashboards (SRE, developer, business)
    • Use appropriate visualization types (graph, gauge, table, stat)
    • Implement drill-down capabilities
    • Use consistent color schemes
    • Include context and descriptions
  • Alerting
    • Configure alert rules in Grafana
    • Use notification channels (email, Slack, PagerDuty)
    • Implement alert grouping and routing
    • Use alert templates for clear messages
    • Configure alert silencing and downtime
  • 核心概念
    • 可视化与仪表板平台
    • 支持多数据源接入
    • 告警与通知功能
    • 插件生态系统
  • 最佳实践
    • 使用文件夹组织仪表板
    • 利用仪表板变量实现交互性
    • 实现仪表板版本管理
    • 使用注释标记事件
    • 通过JSON导出共享仪表板
  • 仪表板设计
    • 创建角色专属仪表板(SRE、开发人员、业务人员)
    • 使用合适的可视化类型(图表、仪表盘、表格、统计卡片)
    • 实现钻取功能
    • 采用统一的配色方案
    • 添加上下文说明
  • 告警配置
    • 在Grafana中配置告警规则
    • 使用通知渠道(邮件、Slack、PagerDuty)
    • 实现告警分组与路由
    • 使用告警模板生成清晰通知
    • 配置告警静默与停机时段

CloudWatch (AWS) Monitoring

AWS CloudWatch监控

Core Concepts

核心概念

  • Metrics: Time-series data points
  • Dashboards: Visualizations of metrics
  • Alarms: Threshold-based alerts
  • Logs: Log data collection and analysis
  • Events: Event-driven monitoring
  • 指标:时间序列数据点
  • 仪表板:指标可视化展示
  • 告警:基于阈值的告警机制
  • 日志:日志数据收集与分析
  • 事件:事件驱动型监控

Best Practices

最佳实践

  • Metric Collection
    • Use custom metrics for application-specific data
    • Use metric filters for log-based metrics
    • Use metric dimensions for filtering
    • Implement metric aggregation
    • Use metric streams for real-time processing
  • Dashboard Design
    • Create service-specific dashboards
    • Use widgets for different visualizations
    • Implement dashboard variables
    • Use cross-account dashboards
    • Share dashboards with teams
  • Alarm Configuration
    • Use appropriate alarm thresholds
    • Implement alarm actions (SNS, Auto Scaling, EC2 actions)
    • Use composite alarms for complex conditions
    • Configure alarm states and transitions
    • Use alarm tags for organization
  • 指标收集
    • 为应用特定数据使用自定义指标
    • 利用指标过滤器从日志中提取指标
    • 使用指标维度实现过滤
    • 实现指标聚合
    • 使用指标流进行实时处理
  • 仪表板设计
    • 创建服务专属仪表板
    • 使用不同组件实现多样可视化
    • 配置仪表板变量
    • 使用跨账户仪表板
    • 与团队共享仪表板
  • 告警配置
    • 设置合适的告警阈值
    • 配置告警动作(SNS、Auto Scaling、EC2操作)
    • 使用复合告警处理复杂条件
    • 配置告警状态与转换规则
    • 使用告警标签实现组织管理

CloudWatch Logs

CloudWatch日志

  • Log Groups: Logical containers for logs
  • Log Streams: Sequences of log events
  • Metric Filters: Extract metrics from logs
  • Subscription Filters: Stream logs to other services
  • Insights: Query and analyze logs
  • 日志组:日志的逻辑容器
  • 日志流:日志事件序列
  • 指标过滤器:从日志中提取指标
  • 订阅过滤器:将日志流转至其他服务
  • Insights:日志查询与分析工具

CloudWatch Examples

CloudWatch示例

json
// Alarm configuration
{
  "AlarmName": "HighCPUUsage",
  "MetricName": "CPUUtilization",
  "Namespace": "AWS/EC2",
  "Statistic": "Average",
  "Period": 300,
  "EvaluationPeriods": 2,
  "Threshold": 80,
  "ComparisonOperator": "GreaterThanThreshold"
}

// Metric filter
{
  "filterPattern": "[timestamp, request_id, status_code, latency]",
  "metricTransformations": [
    {
      "metricName": "RequestLatency",
      "metricNamespace": "Application",
      "metricValue": "$latency"
    }
  ]
}
json
// 告警配置
{
  "AlarmName": "HighCPUUsage",
  "MetricName": "CPUUtilization",
  "Namespace": "AWS/EC2",
  "Statistic": "Average",
  "Period": 300,
  "EvaluationPeriods": 2,
  "Threshold": 80,
  "ComparisonOperator": "GreaterThanThreshold"
}

// 指标过滤器
{
  "filterPattern": "[timestamp, request_id, status_code, latency]",
  "metricTransformations": [
    {
      "metricName": "RequestLatency",
      "metricNamespace": "Application",
      "metricValue": "$latency"
    }
  ]
}

Azure Monitor

Azure Monitor

Core Concepts

核心概念

  • Metrics: Time-series data
  • Logs: Log data collection and analysis
  • Alerts: Threshold-based alerts
  • Dashboards: Visualizations
  • Application Insights: Application monitoring
  • 指标:时间序列数据
  • 日志:日志数据收集与分析
  • 告警:基于阈值的告警
  • 仪表板:可视化展示
  • Application Insights:应用监控工具

Best Practices

最佳实践

  • Metric Collection
    • Use custom metrics for application data
    • Use metric dimensions for filtering
    • Implement metric aggregation
    • Use metric alerts for proactive monitoring
    • Configure metric collection rules
  • Log Analytics
    • Use Kusto Query Language (KQL) for log queries
    • Create custom log tables
    • Implement log collection rules
    • Use log alerts for monitoring
    • Configure log retention policies
  • Application Insights
    • Enable distributed tracing
    • Use custom telemetry
    • Implement dependency tracking
    • Configure performance counters
    • Use smart detection for anomalies
  • 指标收集
    • 为应用数据使用自定义指标
    • 使用指标维度实现过滤
    • 实现指标聚合
    • 使用指标告警实现主动监控
    • 配置指标收集规则
  • 日志分析
    • 使用Kusto查询语言(KQL)进行日志查询
    • 创建自定义日志表
    • 实现日志收集规则
    • 使用日志告警进行监控
    • 配置日志保留策略
  • Application Insights
    • 启用分布式追踪
    • 使用自定义遥测数据
    • 实现依赖追踪
    • 配置性能计数器
    • 使用智能检测识别异常

Azure Monitor Examples

Azure Monitor示例

kql
// KQL query for error rate
requests
| where timestamp > ago(1h)
| summarize count() by success
| project error_rate = 100.0 * (count_ - count_success) / count_

// Query for slow requests
requests
| where timestamp > ago(1h)
| where duration > 1000
| summarize count() by name
| top 10 by count_

// Query for exceptions
exceptions
| where timestamp > ago(1h)
| summarize count() by type, problemId
| top 10 by count_
kql
// 错误率KQL查询
requests
| where timestamp > ago(1h)
| summarize count() by success
| project error_rate = 100.0 * (count_ - count_success) / count_

// 慢请求查询
requests
| where timestamp > ago(1h)
| where duration > 1000
| summarize count() by name
| top 10 by count_

// 异常查询
exceptions
| where timestamp > ago(1h)
| summarize count() by type, problemId
| top 10 by count_

Stackdriver (GCP) Monitoring

GCP Stackdriver监控

Core Concepts

核心概念

  • Metrics: Time-series data
  • Dashboards: Visualizations
  • Alerting: Threshold-based alerts
  • Logging: Log data collection
  • Tracing: Distributed tracing
  • 指标:时间序列数据
  • 仪表板:可视化展示
  • 告警:基于阈值的告警
  • 日志:日志数据收集
  • 追踪:分布式追踪

Best Practices

最佳实践

  • Metric Collection
    • Use custom metrics for application data
    • Use metric labels for filtering
    • Implement metric aggregation
    • Use metric-based alerting policies
    • Configure metric descriptors
  • Dashboard Design
    • Create service-specific dashboards
    • Use dashboard variables
    • Implement dashboard sharing
    • Use dashboard templates
    • Configure dashboard refresh intervals
  • Logging
    • Use log sinks for log routing
    • Implement log-based metrics
    • Configure log exclusions
    • Use log alerts for monitoring
    • Configure log retention
  • 指标收集
    • 为应用数据使用自定义指标
    • 使用指标标签实现过滤
    • 实现指标聚合
    • 使用基于指标的告警策略
    • 配置指标描述符
  • 仪表板设计
    • 创建服务专属仪表板
    • 使用仪表板变量
    • 实现仪表板共享
    • 使用仪表板模板
    • 配置仪表板刷新间隔
  • 日志管理
    • 使用日志接收器进行日志路由
    • 实现基于日志的指标
    • 配置日志排除规则
    • 使用日志告警进行监控
    • 配置日志保留策略

Stackdriver Examples

Stackdriver示例

yaml
undefined
yaml
undefined

Alerting policy

告警策略

displayName: "High Error Rate" conditions:
  • displayName: "Error rate > 5%" conditionThreshold: filter: 'metric.type="custom.googleapis.com/error_rate"' comparison: COMPARISON_GT thresholdValue: 0.05 duration: 300s aggregations: - alignmentPeriod: 60s perSeriesAligner: ALIGN_RATE
undefined
displayName: "High Error Rate" conditions:
  • displayName: "Error rate > 5%" conditionThreshold: filter: 'metric.type="custom.googleapis.com/error_rate"' comparison: COMPARISON_GT thresholdValue: 0.05 duration: 300s aggregations: - alignmentPeriod: 60s perSeriesAligner: ALIGN_RATE
undefined

Logging and Log Aggregation

日志管理与日志聚合

ELK Stack (Elasticsearch, Logstash, Kibana)

ELK Stack(Elasticsearch、Logstash、Kibana)

  • Elasticsearch: Search and analytics engine
  • Logstash: Data processing pipeline
  • Kibana: Visualization platform
  • Beats: Data shippers
  • Elasticsearch:搜索与分析引擎
  • Logstash:数据处理管道
  • Kibana:可视化平台
  • Beats:数据采集器

Best Practices

最佳实践

  • Log Collection
    • Use centralized logging
    • Implement log shippers (Filebeat, Fluentd, Logstash)
    • Use log parsing and normalization
    • Configure log retention policies
    • Implement log archiving
  • Log Analysis
    • Use index patterns for organization
    • Implement log queries and filters
    • Use saved searches for common queries
    • Create visualizations for log data
    • Use dashboards for log monitoring
  • Log Security
    • Implement log encryption at rest
    • Use secure log transmission (TLS)
    • Implement log access controls
    • Configure log audit trails
    • Use log redaction for sensitive data
  • 日志收集
    • 使用集中式日志管理
    • 部署日志采集器(Filebeat、Fluentd、Logstash)
    • 实现日志解析与标准化
    • 配置日志保留策略
    • 实现日志归档
  • 日志分析
    • 使用索引模式实现组织管理
    • 实现日志查询与过滤
    • 为常用查询创建保存搜索
    • 为日志数据创建可视化
    • 使用仪表板监控日志
  • 日志安全
    • 实现日志静态加密
    • 使用安全日志传输(TLS)
    • 配置日志访问控制
    • 实现日志审计追踪
    • 对敏感数据进行日志脱敏

Loki

Loki

  • Core Concepts
    • Lightweight log aggregation system
    • Label-based indexing
    • Grafana integration
    • PromQL-like query language (LogQL)
  • Best Practices
    • Use appropriate log labels
    • Implement log retention policies
    • Use log streams for organization
    • Configure log scraping
    • Implement log alerting
  • 核心概念
    • 轻量级日志聚合系统
    • 基于标签的索引
    • 与Grafana集成
    • 类PromQL查询语言(LogQL)
  • 最佳实践
    • 使用合适的日志标签
    • 配置日志保留策略
    • 使用日志流实现组织管理
    • 配置日志采集
    • 实现日志告警

Alerting Strategies and Incident Response

告警策略与事件响应

Alerting Best Practices

告警最佳实践

  • Alert Design
    • Use meaningful alert names and descriptions
    • Include relevant context in alerts
    • Use appropriate severity levels
    • Configure alert thresholds carefully
    • Implement alert deduplication
  • Alert Routing
    • Route alerts to appropriate teams
    • Use escalation policies
    • Configure on-call rotations
    • Implement alert grouping
    • Use notification channels (email, Slack, PagerDuty)
  • Alert Quality
    • Reduce alert noise with proper filtering
    • Implement alert suppression
    • Use alert correlation
    • Configure alert cooldown periods
    • Implement alert auto-resolution
  • 告警设计
    • 使用有意义的告警名称与描述
    • 在告警中包含相关上下文
    • 设置合适的严重级别
    • 谨慎配置告警阈值
    • 实现告警去重
  • 告警路由
    • 将告警路由至对应团队
    • 使用升级策略
    • 配置轮值待命
    • 实现告警分组
    • 使用通知渠道(邮件、Slack、PagerDuty)
  • 告警质量
    • 通过合理过滤减少告警噪音
    • 实现告警抑制
    • 使用告警关联
    • 配置告警冷却期
    • 实现告警自动恢复

Incident Response

事件响应

  • Incident Lifecycle
    • Detection: Identify incident
    • Triage: Assess severity and impact
    • Response: Mitigate incident
    • Resolution: Restore service
    • Post-Mortem: Learn and improve
  • Best Practices
    • Use incident severity levels
    • Implement incident communication
    • Use runbooks for common incidents
    • Conduct post-mortems
    • Implement follow-up actions
  • Runbooks
    • Document common incident scenarios
    • Include step-by-step procedures
    • Include relevant commands and tools
    • Update runbooks based on incidents
    • Share runbooks with teams
  • 事件生命周期
    • 检测:识别事件
    • 分类:评估严重程度与影响范围
    • 响应:缓解事件影响
    • 恢复:恢复服务正常
    • 事后复盘:总结与改进
  • 最佳实践
    • 使用事件严重级别
    • 实现事件沟通机制
    • 为常见事件编写运行手册
    • 开展事后复盘
    • 落实后续改进措施
  • 运行手册
    • 记录常见事件场景
    • 包含分步操作流程
    • 附上相关命令与工具
    • 根据事件更新运行手册
    • 与团队共享运行手册

SLO/SLI Definitions and Tracking

SLO/SLI定义与追踪

SLI (Service Level Indicator)

SLI(服务级别指标)

  • Definition: Quantitative measure of service performance
  • Common SLIs:
    • Availability: Percentage of time service is operational
    • Latency: Response time for requests
    • Error Rate: Percentage of failed requests
    • Throughput: Requests per second
    • Saturation: Resource utilization
  • 定义:服务性能的量化衡量指标
  • 常见SLI
    • 可用性:服务正常运行时间占比
    • 延迟:请求响应时间
    • 错误率:失败请求占比
    • 吞吐量:每秒请求数
    • 饱和度:资源利用率

SLO (Service Level Objective)

SLO(服务级别目标)

  • Definition: Target value for SLI
  • Best Practices:
    • Set realistic SLOs based on business requirements
    • Use SLOs to drive reliability improvements
    • Monitor SLOs continuously
    • Alert on SLO breaches
    • Use error budgets for balancing reliability and innovation
  • 定义:SLI的目标值
  • 最佳实践
    • 根据业务需求设置合理的SLO
    • 利用SLO推动可靠性改进
    • 持续监控SLO
    • 针对SLO触发告警
    • 使用错误平衡可靠性与创新

Error Budget

错误预算

  • Definition: Allowable amount of unreliability
  • Calculation: Error Budget = 100% - SLO
  • Best Practices:
    • Use error budget to guide release decisions
    • Freeze deployments when error budget is exhausted
    • Implement error budget alerts
    • Track error budget consumption
    • Use error budget for reliability planning
  • 定义:允许的不可用时长/错误量
  • 计算方式:错误预算 = 100% - SLO
  • 最佳实践
    • 利用错误预算指导发布决策
    • 错误预算耗尽时冻结部署
    • 配置错误预算告警
    • 追踪错误预算消耗情况
    • 使用错误预算进行可靠性规划

SLO/SLI Examples

SLO/SLI示例

yaml
undefined
yaml
undefined

SLO configuration

SLO配置

slo_name: "API Availability" sli_name: "api_availability" slo_target: 0.999 slo_window: 30d alert_threshold: 0.998
slo_name: "API可用性" sli_name: "api_availability" slo_target: 0.999 slo_window: 30d alert_threshold: 0.998

SLI calculation

SLI计算

api_availability = 1 - (error_count / total_count)
api_availability = 1 - (error_count / total_count)

Error budget

错误预算

error_budget = 1 - slo_target error_budget_remaining = slo_target - current_availability
undefined
error_budget = 1 - slo_target error_budget_remaining = slo_target - current_availability
undefined

Monitoring SLOs

SLO监控

  • Tools:
    • Prometheus and Grafana
    • CloudWatch SLOs
    • Azure Monitor SLOs
    • Stackdriver SLOs
    • SRE-specific tools (Sloth, sli-exporter)
  • Best Practices:
    • Visualize SLOs in dashboards
    • Alert on SLO breaches
    • Track SLO trends over time
    • Compare SLOs across services
    • Use SLOs for capacity planning
  • 工具
    • Prometheus与Grafana
    • CloudWatch SLO
    • Azure Monitor SLO
    • Stackdriver SLO
    • SRE专属工具(Sloth、sli-exporter)
  • 最佳实践
    • 在仪表板中可视化SLO
    • 针对SLO触发告警
    • 追踪SLO长期趋势
    • 跨服务对比SLO
    • 利用SLO进行容量规划

Observability Best Practices

可观测性最佳实践

The Three Pillars of Observability

可观测性三大支柱

  • Metrics: Quantitative data points
  • Logs: Discrete events
  • Traces: Request paths through distributed systems
  • 指标:量化数据点
  • 日志:离散事件记录
  • 追踪:分布式系统中的请求路径

Distributed Tracing

分布式追踪

  • Core Concepts
    • Trace: End-to-end request path
    • Span: Individual operation
    • Trace ID: Unique identifier for trace
    • Span ID: Unique identifier for span
  • Best Practices
    • Use distributed tracing for microservices
    • Implement trace sampling
    • Use trace context propagation
    • Configure trace retention
    • Analyze traces for performance issues
  • Tools
    • Jaeger
    • Zipkin
    • AWS X-Ray
    • Azure Application Insights
    • Google Cloud Trace
  • 核心概念
    • 追踪:端到端请求路径
    • 跨度:单个操作
    • 追踪ID:追踪的唯一标识
    • 跨度ID:跨度的唯一标识
  • 最佳实践
    • 为微服务使用分布式追踪
    • 实现追踪采样
    • 使用追踪上下文传递
    • 配置追踪保留策略
    • 分析追踪以定位性能问题
  • 工具
    • Jaeger
    • Zipkin
    • AWS X-Ray
    • Azure Application Insights
    • Google Cloud Trace

Observability Patterns

可观测性模式

  • RED Method: Rate, Errors, Duration
  • USE Method: Utilization, Saturation, Errors
  • Golden Signals: Latency, Traffic, Errors, Saturation
  • Four Golden Signals: Latency, Traffic, Errors, Saturation, SLOs
  • RED方法:Rate(速率)、Errors(错误)、Duration(延迟)
  • USE方法:Utilization(利用率)、Saturation(饱和度)、Errors(错误)
  • 黄金信号:延迟、流量、错误、饱和度
  • 扩展黄金信号:延迟、流量、错误、饱和度、SLO

Observability Maturity Model

可观测性成熟度模型

  • Level 1: Basic metrics and logging
  • Level 2: Structured logging and metrics
  • Level 3: Distributed tracing
  • Level 4: Automated alerting and incident response
  • Level 5: SLO-driven development and error budgets
  • Level 1:基础指标与日志
  • Level 2:结构化日志与标准化指标
  • Level 3:分布式追踪
  • Level 4:自动化告警与事件响应
  • Level 5:SLO驱动的开发与错误预算管理