monitoring-observability
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMonitoring and Observability
监控与可观测性
Prometheus and Grafana Setup
Prometheus与Grafana部署
Prometheus
Prometheus
-
Core Concepts
- Time-series database for metrics
- Pull-based metrics collection
- PromQL query language
- Alerting rules and notifications
-
Best Practices
- Use appropriate metric types (Counter, Gauge, Histogram, Summary)
- Label metrics with relevant dimensions
- Use metric naming conventions
- Implement relabeling for metric filtering
- Use federation for multi-cluster setups
-
Configuration
- Configure scrape targets for services
- Use service discovery for dynamic targets
- Configure retention policies
- Implement remote write for long-term storage
- Use alert rules for proactive monitoring
-
PromQL Examplespromql
# CPU usage rate rate(process_cpu_seconds_total[5m]) # Request error rate rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) # P95 latency histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # Memory usage process_resident_memory_bytes / node_memory_MemTotal_bytes * 100
-
核心概念
- 用于指标存储的时间序列数据库
- 基于拉取的指标收集机制
- PromQL查询语言
- 告警规则与通知机制
-
最佳实践
- 使用合适的指标类型(Counter、Gauge、Histogram、Summary)
- 为指标添加相关维度标签
- 遵循指标命名规范
- 实现重标记以过滤指标
- 采用联邦机制实现多集群部署
-
配置说明
- 为服务配置采集目标
- 使用服务发现实现动态目标管理
- 配置数据保留策略
- 实现远程写入以支持长期存储
- 配置告警规则实现主动监控
-
PromQL示例promql
# CPU使用率 rate(process_cpu_seconds_total[5m]) # 请求错误率 rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) # P95延迟 histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # 内存使用率 process_resident_memory_bytes / node_memory_MemTotal_bytes * 100
Grafana
Grafana
-
Core Concepts
- Visualization and dashboard platform
- Multiple data source support
- Alerting and notifications
- Plugin ecosystem
-
Best Practices
- Use folder organization for dashboards
- Use dashboard variables for interactivity
- Implement dashboard versioning
- Use annotations for event marking
- Share dashboards via JSON export
-
Dashboard Design
- Create role-specific dashboards (SRE, developer, business)
- Use appropriate visualization types (graph, gauge, table, stat)
- Implement drill-down capabilities
- Use consistent color schemes
- Include context and descriptions
-
Alerting
- Configure alert rules in Grafana
- Use notification channels (email, Slack, PagerDuty)
- Implement alert grouping and routing
- Use alert templates for clear messages
- Configure alert silencing and downtime
-
核心概念
- 可视化与仪表板平台
- 支持多数据源接入
- 告警与通知功能
- 插件生态系统
-
最佳实践
- 使用文件夹组织仪表板
- 利用仪表板变量实现交互性
- 实现仪表板版本管理
- 使用注释标记事件
- 通过JSON导出共享仪表板
-
仪表板设计
- 创建角色专属仪表板(SRE、开发人员、业务人员)
- 使用合适的可视化类型(图表、仪表盘、表格、统计卡片)
- 实现钻取功能
- 采用统一的配色方案
- 添加上下文说明
-
告警配置
- 在Grafana中配置告警规则
- 使用通知渠道(邮件、Slack、PagerDuty)
- 实现告警分组与路由
- 使用告警模板生成清晰通知
- 配置告警静默与停机时段
CloudWatch (AWS) Monitoring
AWS CloudWatch监控
Core Concepts
核心概念
- Metrics: Time-series data points
- Dashboards: Visualizations of metrics
- Alarms: Threshold-based alerts
- Logs: Log data collection and analysis
- Events: Event-driven monitoring
- 指标:时间序列数据点
- 仪表板:指标可视化展示
- 告警:基于阈值的告警机制
- 日志:日志数据收集与分析
- 事件:事件驱动型监控
Best Practices
最佳实践
-
Metric Collection
- Use custom metrics for application-specific data
- Use metric filters for log-based metrics
- Use metric dimensions for filtering
- Implement metric aggregation
- Use metric streams for real-time processing
-
Dashboard Design
- Create service-specific dashboards
- Use widgets for different visualizations
- Implement dashboard variables
- Use cross-account dashboards
- Share dashboards with teams
-
Alarm Configuration
- Use appropriate alarm thresholds
- Implement alarm actions (SNS, Auto Scaling, EC2 actions)
- Use composite alarms for complex conditions
- Configure alarm states and transitions
- Use alarm tags for organization
-
指标收集
- 为应用特定数据使用自定义指标
- 利用指标过滤器从日志中提取指标
- 使用指标维度实现过滤
- 实现指标聚合
- 使用指标流进行实时处理
-
仪表板设计
- 创建服务专属仪表板
- 使用不同组件实现多样可视化
- 配置仪表板变量
- 使用跨账户仪表板
- 与团队共享仪表板
-
告警配置
- 设置合适的告警阈值
- 配置告警动作(SNS、Auto Scaling、EC2操作)
- 使用复合告警处理复杂条件
- 配置告警状态与转换规则
- 使用告警标签实现组织管理
CloudWatch Logs
CloudWatch日志
- Log Groups: Logical containers for logs
- Log Streams: Sequences of log events
- Metric Filters: Extract metrics from logs
- Subscription Filters: Stream logs to other services
- Insights: Query and analyze logs
- 日志组:日志的逻辑容器
- 日志流:日志事件序列
- 指标过滤器:从日志中提取指标
- 订阅过滤器:将日志流转至其他服务
- Insights:日志查询与分析工具
CloudWatch Examples
CloudWatch示例
json
// Alarm configuration
{
"AlarmName": "HighCPUUsage",
"MetricName": "CPUUtilization",
"Namespace": "AWS/EC2",
"Statistic": "Average",
"Period": 300,
"EvaluationPeriods": 2,
"Threshold": 80,
"ComparisonOperator": "GreaterThanThreshold"
}
// Metric filter
{
"filterPattern": "[timestamp, request_id, status_code, latency]",
"metricTransformations": [
{
"metricName": "RequestLatency",
"metricNamespace": "Application",
"metricValue": "$latency"
}
]
}json
// 告警配置
{
"AlarmName": "HighCPUUsage",
"MetricName": "CPUUtilization",
"Namespace": "AWS/EC2",
"Statistic": "Average",
"Period": 300,
"EvaluationPeriods": 2,
"Threshold": 80,
"ComparisonOperator": "GreaterThanThreshold"
}
// 指标过滤器
{
"filterPattern": "[timestamp, request_id, status_code, latency]",
"metricTransformations": [
{
"metricName": "RequestLatency",
"metricNamespace": "Application",
"metricValue": "$latency"
}
]
}Azure Monitor
Azure Monitor
Core Concepts
核心概念
- Metrics: Time-series data
- Logs: Log data collection and analysis
- Alerts: Threshold-based alerts
- Dashboards: Visualizations
- Application Insights: Application monitoring
- 指标:时间序列数据
- 日志:日志数据收集与分析
- 告警:基于阈值的告警
- 仪表板:可视化展示
- Application Insights:应用监控工具
Best Practices
最佳实践
-
Metric Collection
- Use custom metrics for application data
- Use metric dimensions for filtering
- Implement metric aggregation
- Use metric alerts for proactive monitoring
- Configure metric collection rules
-
Log Analytics
- Use Kusto Query Language (KQL) for log queries
- Create custom log tables
- Implement log collection rules
- Use log alerts for monitoring
- Configure log retention policies
-
Application Insights
- Enable distributed tracing
- Use custom telemetry
- Implement dependency tracking
- Configure performance counters
- Use smart detection for anomalies
-
指标收集
- 为应用数据使用自定义指标
- 使用指标维度实现过滤
- 实现指标聚合
- 使用指标告警实现主动监控
- 配置指标收集规则
-
日志分析
- 使用Kusto查询语言(KQL)进行日志查询
- 创建自定义日志表
- 实现日志收集规则
- 使用日志告警进行监控
- 配置日志保留策略
-
Application Insights
- 启用分布式追踪
- 使用自定义遥测数据
- 实现依赖追踪
- 配置性能计数器
- 使用智能检测识别异常
Azure Monitor Examples
Azure Monitor示例
kql
// KQL query for error rate
requests
| where timestamp > ago(1h)
| summarize count() by success
| project error_rate = 100.0 * (count_ - count_success) / count_
// Query for slow requests
requests
| where timestamp > ago(1h)
| where duration > 1000
| summarize count() by name
| top 10 by count_
// Query for exceptions
exceptions
| where timestamp > ago(1h)
| summarize count() by type, problemId
| top 10 by count_kql
// 错误率KQL查询
requests
| where timestamp > ago(1h)
| summarize count() by success
| project error_rate = 100.0 * (count_ - count_success) / count_
// 慢请求查询
requests
| where timestamp > ago(1h)
| where duration > 1000
| summarize count() by name
| top 10 by count_
// 异常查询
exceptions
| where timestamp > ago(1h)
| summarize count() by type, problemId
| top 10 by count_Stackdriver (GCP) Monitoring
GCP Stackdriver监控
Core Concepts
核心概念
- Metrics: Time-series data
- Dashboards: Visualizations
- Alerting: Threshold-based alerts
- Logging: Log data collection
- Tracing: Distributed tracing
- 指标:时间序列数据
- 仪表板:可视化展示
- 告警:基于阈值的告警
- 日志:日志数据收集
- 追踪:分布式追踪
Best Practices
最佳实践
-
Metric Collection
- Use custom metrics for application data
- Use metric labels for filtering
- Implement metric aggregation
- Use metric-based alerting policies
- Configure metric descriptors
-
Dashboard Design
- Create service-specific dashboards
- Use dashboard variables
- Implement dashboard sharing
- Use dashboard templates
- Configure dashboard refresh intervals
-
Logging
- Use log sinks for log routing
- Implement log-based metrics
- Configure log exclusions
- Use log alerts for monitoring
- Configure log retention
-
指标收集
- 为应用数据使用自定义指标
- 使用指标标签实现过滤
- 实现指标聚合
- 使用基于指标的告警策略
- 配置指标描述符
-
仪表板设计
- 创建服务专属仪表板
- 使用仪表板变量
- 实现仪表板共享
- 使用仪表板模板
- 配置仪表板刷新间隔
-
日志管理
- 使用日志接收器进行日志路由
- 实现基于日志的指标
- 配置日志排除规则
- 使用日志告警进行监控
- 配置日志保留策略
Stackdriver Examples
Stackdriver示例
yaml
undefinedyaml
undefinedAlerting policy
告警策略
displayName: "High Error Rate"
conditions:
- displayName: "Error rate > 5%" conditionThreshold: filter: 'metric.type="custom.googleapis.com/error_rate"' comparison: COMPARISON_GT thresholdValue: 0.05 duration: 300s aggregations: - alignmentPeriod: 60s perSeriesAligner: ALIGN_RATE
undefineddisplayName: "High Error Rate"
conditions:
- displayName: "Error rate > 5%" conditionThreshold: filter: 'metric.type="custom.googleapis.com/error_rate"' comparison: COMPARISON_GT thresholdValue: 0.05 duration: 300s aggregations: - alignmentPeriod: 60s perSeriesAligner: ALIGN_RATE
undefinedLogging and Log Aggregation
日志管理与日志聚合
ELK Stack (Elasticsearch, Logstash, Kibana)
ELK Stack(Elasticsearch、Logstash、Kibana)
- Elasticsearch: Search and analytics engine
- Logstash: Data processing pipeline
- Kibana: Visualization platform
- Beats: Data shippers
- Elasticsearch:搜索与分析引擎
- Logstash:数据处理管道
- Kibana:可视化平台
- Beats:数据采集器
Best Practices
最佳实践
-
Log Collection
- Use centralized logging
- Implement log shippers (Filebeat, Fluentd, Logstash)
- Use log parsing and normalization
- Configure log retention policies
- Implement log archiving
-
Log Analysis
- Use index patterns for organization
- Implement log queries and filters
- Use saved searches for common queries
- Create visualizations for log data
- Use dashboards for log monitoring
-
Log Security
- Implement log encryption at rest
- Use secure log transmission (TLS)
- Implement log access controls
- Configure log audit trails
- Use log redaction for sensitive data
-
日志收集
- 使用集中式日志管理
- 部署日志采集器(Filebeat、Fluentd、Logstash)
- 实现日志解析与标准化
- 配置日志保留策略
- 实现日志归档
-
日志分析
- 使用索引模式实现组织管理
- 实现日志查询与过滤
- 为常用查询创建保存搜索
- 为日志数据创建可视化
- 使用仪表板监控日志
-
日志安全
- 实现日志静态加密
- 使用安全日志传输(TLS)
- 配置日志访问控制
- 实现日志审计追踪
- 对敏感数据进行日志脱敏
Loki
Loki
-
Core Concepts
- Lightweight log aggregation system
- Label-based indexing
- Grafana integration
- PromQL-like query language (LogQL)
-
Best Practices
- Use appropriate log labels
- Implement log retention policies
- Use log streams for organization
- Configure log scraping
- Implement log alerting
-
核心概念
- 轻量级日志聚合系统
- 基于标签的索引
- 与Grafana集成
- 类PromQL查询语言(LogQL)
-
最佳实践
- 使用合适的日志标签
- 配置日志保留策略
- 使用日志流实现组织管理
- 配置日志采集
- 实现日志告警
Alerting Strategies and Incident Response
告警策略与事件响应
Alerting Best Practices
告警最佳实践
-
Alert Design
- Use meaningful alert names and descriptions
- Include relevant context in alerts
- Use appropriate severity levels
- Configure alert thresholds carefully
- Implement alert deduplication
-
Alert Routing
- Route alerts to appropriate teams
- Use escalation policies
- Configure on-call rotations
- Implement alert grouping
- Use notification channels (email, Slack, PagerDuty)
-
Alert Quality
- Reduce alert noise with proper filtering
- Implement alert suppression
- Use alert correlation
- Configure alert cooldown periods
- Implement alert auto-resolution
-
告警设计
- 使用有意义的告警名称与描述
- 在告警中包含相关上下文
- 设置合适的严重级别
- 谨慎配置告警阈值
- 实现告警去重
-
告警路由
- 将告警路由至对应团队
- 使用升级策略
- 配置轮值待命
- 实现告警分组
- 使用通知渠道(邮件、Slack、PagerDuty)
-
告警质量
- 通过合理过滤减少告警噪音
- 实现告警抑制
- 使用告警关联
- 配置告警冷却期
- 实现告警自动恢复
Incident Response
事件响应
-
Incident Lifecycle
- Detection: Identify incident
- Triage: Assess severity and impact
- Response: Mitigate incident
- Resolution: Restore service
- Post-Mortem: Learn and improve
-
Best Practices
- Use incident severity levels
- Implement incident communication
- Use runbooks for common incidents
- Conduct post-mortems
- Implement follow-up actions
-
Runbooks
- Document common incident scenarios
- Include step-by-step procedures
- Include relevant commands and tools
- Update runbooks based on incidents
- Share runbooks with teams
-
事件生命周期
- 检测:识别事件
- 分类:评估严重程度与影响范围
- 响应:缓解事件影响
- 恢复:恢复服务正常
- 事后复盘:总结与改进
-
最佳实践
- 使用事件严重级别
- 实现事件沟通机制
- 为常见事件编写运行手册
- 开展事后复盘
- 落实后续改进措施
-
运行手册
- 记录常见事件场景
- 包含分步操作流程
- 附上相关命令与工具
- 根据事件更新运行手册
- 与团队共享运行手册
SLO/SLI Definitions and Tracking
SLO/SLI定义与追踪
SLI (Service Level Indicator)
SLI(服务级别指标)
- Definition: Quantitative measure of service performance
- Common SLIs:
- Availability: Percentage of time service is operational
- Latency: Response time for requests
- Error Rate: Percentage of failed requests
- Throughput: Requests per second
- Saturation: Resource utilization
- 定义:服务性能的量化衡量指标
- 常见SLI:
- 可用性:服务正常运行时间占比
- 延迟:请求响应时间
- 错误率:失败请求占比
- 吞吐量:每秒请求数
- 饱和度:资源利用率
SLO (Service Level Objective)
SLO(服务级别目标)
- Definition: Target value for SLI
- Best Practices:
- Set realistic SLOs based on business requirements
- Use SLOs to drive reliability improvements
- Monitor SLOs continuously
- Alert on SLO breaches
- Use error budgets for balancing reliability and innovation
- 定义:SLI的目标值
- 最佳实践:
- 根据业务需求设置合理的SLO
- 利用SLO推动可靠性改进
- 持续监控SLO
- 针对SLO触发告警
- 使用错误平衡可靠性与创新
Error Budget
错误预算
- Definition: Allowable amount of unreliability
- Calculation: Error Budget = 100% - SLO
- Best Practices:
- Use error budget to guide release decisions
- Freeze deployments when error budget is exhausted
- Implement error budget alerts
- Track error budget consumption
- Use error budget for reliability planning
- 定义:允许的不可用时长/错误量
- 计算方式:错误预算 = 100% - SLO
- 最佳实践:
- 利用错误预算指导发布决策
- 错误预算耗尽时冻结部署
- 配置错误预算告警
- 追踪错误预算消耗情况
- 使用错误预算进行可靠性规划
SLO/SLI Examples
SLO/SLI示例
yaml
undefinedyaml
undefinedSLO configuration
SLO配置
slo_name: "API Availability"
sli_name: "api_availability"
slo_target: 0.999
slo_window: 30d
alert_threshold: 0.998
slo_name: "API可用性"
sli_name: "api_availability"
slo_target: 0.999
slo_window: 30d
alert_threshold: 0.998
SLI calculation
SLI计算
api_availability = 1 - (error_count / total_count)
api_availability = 1 - (error_count / total_count)
Error budget
错误预算
error_budget = 1 - slo_target
error_budget_remaining = slo_target - current_availability
undefinederror_budget = 1 - slo_target
error_budget_remaining = slo_target - current_availability
undefinedMonitoring SLOs
SLO监控
-
Tools:
- Prometheus and Grafana
- CloudWatch SLOs
- Azure Monitor SLOs
- Stackdriver SLOs
- SRE-specific tools (Sloth, sli-exporter)
-
Best Practices:
- Visualize SLOs in dashboards
- Alert on SLO breaches
- Track SLO trends over time
- Compare SLOs across services
- Use SLOs for capacity planning
-
工具:
- Prometheus与Grafana
- CloudWatch SLO
- Azure Monitor SLO
- Stackdriver SLO
- SRE专属工具(Sloth、sli-exporter)
-
最佳实践:
- 在仪表板中可视化SLO
- 针对SLO触发告警
- 追踪SLO长期趋势
- 跨服务对比SLO
- 利用SLO进行容量规划
Observability Best Practices
可观测性最佳实践
The Three Pillars of Observability
可观测性三大支柱
- Metrics: Quantitative data points
- Logs: Discrete events
- Traces: Request paths through distributed systems
- 指标:量化数据点
- 日志:离散事件记录
- 追踪:分布式系统中的请求路径
Distributed Tracing
分布式追踪
-
Core Concepts
- Trace: End-to-end request path
- Span: Individual operation
- Trace ID: Unique identifier for trace
- Span ID: Unique identifier for span
-
Best Practices
- Use distributed tracing for microservices
- Implement trace sampling
- Use trace context propagation
- Configure trace retention
- Analyze traces for performance issues
-
Tools
- Jaeger
- Zipkin
- AWS X-Ray
- Azure Application Insights
- Google Cloud Trace
-
核心概念
- 追踪:端到端请求路径
- 跨度:单个操作
- 追踪ID:追踪的唯一标识
- 跨度ID:跨度的唯一标识
-
最佳实践
- 为微服务使用分布式追踪
- 实现追踪采样
- 使用追踪上下文传递
- 配置追踪保留策略
- 分析追踪以定位性能问题
-
工具
- Jaeger
- Zipkin
- AWS X-Ray
- Azure Application Insights
- Google Cloud Trace
Observability Patterns
可观测性模式
- RED Method: Rate, Errors, Duration
- USE Method: Utilization, Saturation, Errors
- Golden Signals: Latency, Traffic, Errors, Saturation
- Four Golden Signals: Latency, Traffic, Errors, Saturation, SLOs
- RED方法:Rate(速率)、Errors(错误)、Duration(延迟)
- USE方法:Utilization(利用率)、Saturation(饱和度)、Errors(错误)
- 黄金信号:延迟、流量、错误、饱和度
- 扩展黄金信号:延迟、流量、错误、饱和度、SLO
Observability Maturity Model
可观测性成熟度模型
- Level 1: Basic metrics and logging
- Level 2: Structured logging and metrics
- Level 3: Distributed tracing
- Level 4: Automated alerting and incident response
- Level 5: SLO-driven development and error budgets
- Level 1:基础指标与日志
- Level 2:结构化日志与标准化指标
- Level 3:分布式追踪
- Level 4:自动化告警与事件响应
- Level 5:SLO驱动的开发与错误预算管理