observability-monitoring

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Observability & Monitoring

可观测性与监控

A comprehensive skill for implementing production-grade observability and monitoring using Prometheus, Grafana, and the wider cloud-native monitoring ecosystem. This skill covers metrics collection, time-series analysis, alerting, visualization, and operational excellence patterns.

本技能全面介绍如何使用Prometheus、Grafana及更广泛的云原生监控生态系统，实现生产级别的可观测性与监控方案，内容涵盖指标采集、时间序列分析、告警、可视化及运维最佳实践。

When to Use This Skill

适用场景

Use this skill when:

Setting up monitoring for production systems and applications
Implementing metrics collection and observability for microservices
Creating dashboards and visualizations for system health monitoring
Defining alerting rules and incident response automation
Analyzing system performance and capacity using time-series data
Implementing SLIs, SLOs, and SLAs for service reliability
Debugging production issues using metrics and traces
Building custom exporters for application-specific metrics
Setting up federation for multi-cluster monitoring
Migrating from legacy monitoring to cloud-native solutions
Implementing cost monitoring and optimization tracking
Creating real-time operational dashboards for DevOps teams

在以下场景中使用本技能：

为生产系统和应用搭建监控体系
为微服务实现指标采集与可观测性
创建系统健康监控的仪表盘与可视化界面
定义告警规则与事件响应自动化流程
利用时间序列数据分析系统性能与容量
为服务可靠性实现SLI、SLO和SLA
借助指标与追踪信息排查生产环境问题
为特定应用构建自定义指标导出器
搭建多集群监控的联邦机制
从传统监控方案迁移至云原生监控方案
实现成本监控与优化追踪
为DevOps团队创建实时运维仪表盘

Core Concepts

核心概念

The Four Pillars of Observability

可观测性四大支柱

Modern observability is built on four fundamental pillars:

Metrics: Numerical measurements of system behavior over time
- Counter: Monotonically increasing values (requests served, errors)
- Gauge: Point-in-time values that go up and down (memory usage, temperature)
- Histogram: Distribution of values (request duration buckets)
- Summary: Similar to histogram but calculates quantiles on client-side
Logs: Discrete events with contextual information
- Structured logging (JSON, key-value pairs)
- Centralized log aggregation (ELK, Loki)
- Correlation with metrics and traces
Traces: Request flow through distributed systems
- Span: Single unit of work with start/end time
- Trace: Collection of spans representing end-to-end request
- OpenTelemetry for distributed tracing
Events: Significant occurrences in system lifecycle
- Deployments, configuration changes
- Scaling events, incidents
- Business events and user actions

现代可观测性基于四大核心支柱：

指标（Metrics）：系统行为的数值化时间序列测量
- 计数器（Counter）：单调递增的值（如已处理请求数、错误数）
- 仪表盘（Gauge）：可上下波动的瞬时值（如内存使用率、温度）
- 直方图（Histogram）：数值分布统计（如请求时长分桶）
- 摘要（Summary）：与直方图类似，但在客户端计算分位数
日志（Logs）：带有上下文信息的离散事件
- 结构化日志（JSON、键值对格式）
- 集中式日志聚合（ELK、Loki）
- 与指标和追踪信息关联
追踪（Traces）：分布式系统中的请求流转路径
- 跨度（Span）：单个工作单元，包含开始/结束时间
- 追踪（Trace）：代表端到端请求的一组跨度集合
- 使用OpenTelemetry实现分布式追踪
事件（Events）：系统生命周期中的重要事件
- 部署、配置变更
- 扩容事件、故障事件
- 业务事件与用户操作

Prometheus Architecture

Prometheus架构

Prometheus is a pull-based monitoring system with key components:

Time-Series Database (TSDB)

Stores metrics as time-series data
Efficient compression and retention policies
Local storage with optional remote storage

Scrape Targets

Service discovery (Kubernetes, Consul, EC2, etc.)
Static configuration
Relabeling for flexible target selection

PromQL Query Engine

Powerful query language for metrics analysis
Aggregation, filtering, and mathematical operations
Range vectors and instant vectors

Alertmanager

Alert rule evaluation
Grouping, silencing, and routing
Integration with PagerDuty, Slack, email, webhooks

Exporters

Bridge between Prometheus and systems
Node exporter, cAdvisor, custom exporters
Third-party exporters for databases, services

Prometheus是基于拉取模式的监控系统，核心组件包括：

时间序列数据库（TSDB）

以时间序列格式存储指标
高效的压缩与保留策略
本地存储支持可选的远程存储扩展

采集目标（Scrape Targets）

服务发现（Kubernetes、Consul、EC2等）
静态配置
重标记（Relabeling）实现灵活的目标选择

PromQL查询引擎

强大的指标分析查询语言
支持聚合、过滤与数学运算
范围向量与瞬时向量

Alertmanager

告警规则评估
分组、静默与路由
与PagerDuty、Slack、邮件、Webhook集成

导出器（Exporters）

Prometheus与其他系统的桥接组件
Node exporter、cAdvisor、自定义导出器
针对数据库、服务的第三方导出器

Metric Labels and Cardinality

指标标签与基数

Labels are key-value pairs attached to metrics:

prometheus

http_requests_total{method="GET", endpoint="/api/users", status="200"}

Label Best Practices:

Use labels for dimensions you query/aggregate by
Avoid high-cardinality labels (user IDs, timestamps)
Keep label names consistent across metrics
Use relabeling to normalize external labels

Cardinality Considerations:

Each unique label combination = new time-series
High cardinality = increased memory and storage
Monitor cardinality with
```
prometheus_tsdb_symbol_table_size_bytes
```
Use recording rules to pre-aggregate high-cardinality metrics

标签是附加在指标上的键值对：

prometheus

http_requests_total{method="GET", endpoint="/api/users", status="200"}

标签最佳实践：

为需要查询/聚合的维度使用标签
避免高基数标签（如用户ID、时间戳）
保持指标间的标签命名一致
使用重标记标准化外部标签

基数注意事项：

每个唯一的标签组合对应一条新的时间序列
高基数会增加内存与存储开销
使用
```
prometheus_tsdb_symbol_table_size_bytes
```
监控基数
使用记录规则预聚合高基数指标

Recording Rules

记录规则

Pre-compute frequently-used or expensive queries:

yaml

groups:
  - name: api_performance
    interval: 30s
    rules:
      - record: api:http_request_duration_seconds:p99
        expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
      - record: api:http_requests:rate5m
        expr: rate(http_requests_total[5m])

Benefits:

Faster dashboard loading
Reduced query load on Prometheus
Consistent metric naming conventions
Enable complex aggregations

预计算频繁使用或开销较大的查询：

yaml

groups:
  - name: api_performance
    interval: 30s
    rules:
      - record: api:http_request_duration_seconds:p99
        expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
      - record: api:http_requests:rate5m
        expr: rate(http_requests_total[5m])

优势：

加快仪表盘加载速度
降低Prometheus的查询负载
统一指标命名规范
支持复杂聚合运算

Service Level Objectives (SLOs)

服务级别目标（SLOs）

Define and track reliability targets:

SLI (Service Level Indicator): Metric measuring service quality

Availability: % of successful requests
Latency: % of requests under threshold
Throughput: Requests per second

SLO (Service Level Objective): Target for SLI

99.9% availability (43.8 minutes downtime/month)
95% of requests < 200ms
1000 RPS sustained

SLA (Service Level Agreement): Contract with consequences

External commitments to customers
Financial penalties for SLO violations

Error Budget: Acceptable failure rate

Error budget = 100% - SLO
99.9% SLO = 0.1% error budget
Use budget for innovation vs. reliability tradeoff

定义并追踪可靠性目标：

SLI（服务级别指标）：衡量服务质量的指标

可用性：成功请求的百分比
延迟：满足阈值的请求百分比
吞吐量：每秒请求数

SLO（服务级别目标）：SLI的目标值

99.9%可用性（每月最多43.8分钟停机时间）
95%的请求延迟低于200ms
持续支持1000 RPS

SLA（服务级别协议）：带有后果的正式合同

对客户的外部承诺
违反SLO时的财务处罚

错误预算：可接受的故障率

错误预算 = 100% - SLO
99.9% SLO对应0.1%的错误预算
在创新与可靠性之间做权衡时使用错误预算

Prometheus Setup and Configuration

Prometheus安装与配置

Basic Prometheus Configuration

基础Prometheus配置

yaml

undefined

yaml

undefined

prometheus.yml

global: scrape_interval: 15s # Default scrape interval evaluation_interval: 15s # Alert rule evaluation interval external_labels: cluster: 'production' region: 'us-west-2'

global: scrape_interval: 15s # 默认采集间隔 evaluation_interval: 15s # 告警规则评估间隔 external_labels: cluster: 'production' region: 'us-west-2'

Alertmanager configuration

Alertmanager配置

alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093

Load rules

加载规则

rule_files:

'rules/*.yml'
'alerts/*.yml'

rule_files:

'rules/*.yml'
'alerts/*.yml'

Scrape configurations

采集配置

scrape_configs:

Prometheus self-monitoring

job_name: 'prometheus' static_configs:
- targets: ['localhost:9090']

Node exporter for system metrics

job_name: 'node' static_configs:
- targets:
  - 'node1:9100'
  - 'node2:9100'
  - 'node3:9100' relabel_configs:
- source_labels: [address] target_label: instance regex: '([^:]+):.*' replacement: '${1}'

Application metrics

job_name: 'api' static_configs:
- targets: ['api-1:8080', 'api-2:8080', 'api-3:8080'] labels: env: 'production' tier: 'backend'

undefined

scrape_configs:

Prometheus自监控

job_name: 'prometheus' static_configs:
- targets: ['localhost:9090']

系统指标的Node exporter

job_name: 'node' static_configs:
- targets:
  - 'node1:9100'
  - 'node2:9100'
  - 'node3:9100' relabel_configs:
- source_labels: [address] target_label: instance regex: '([^:]+):.*' replacement: '${1}'

应用指标

job_name: 'api' static_configs:
- targets: ['api-1:8080', 'api-2:8080', 'api-3:8080'] labels: env: 'production' tier: 'backend'

undefined

Kubernetes Service Discovery

Kubernetes服务发现

yaml

scrape_configs:
  # Kubernetes API server
  - job_name: 'kubernetes-apiservers'
    kubernetes_sd_configs:
      - role: endpoints
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https

  # Kubernetes pods with prometheus.io annotations
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with prometheus.io/scrape: "true" annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      # Use the port from prometheus.io/port annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: (\d+)
        target_label: __address__
        replacement: ${1}:${2}
      # Use the path from prometheus.io/path annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
      # Add namespace label
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      # Add pod name label
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

  # Kubernetes services
  - job_name: 'kubernetes-services'
    kubernetes_sd_configs:
      - role: service
    metrics_path: /probe
    params:
      module: [http_2xx]
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
        action: keep
        regex: true
      - source_labels: [__address__]
        target_label: __param_target
      - target_label: __address__
        replacement: blackbox-exporter:9115
      - source_labels: [__param_target]
        target_label: instance

yaml

scrape_configs:
  # Kubernetes API服务器
  - job_name: 'kubernetes-apiservers'
    kubernetes_sd_configs:
      - role: endpoints
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https

  # 带有prometheus.io注解的Kubernetes Pod
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # 仅采集带有prometheus.io/scrape: "true"注解的Pod
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      # 使用prometheus.io/port注解指定的端口
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: (\d+)
        target_label: __address__
        replacement: ${1}:${2}
      # 使用prometheus.io/path注解指定的路径
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
      # 添加命名空间标签
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      # 添加Pod名称标签
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

  # Kubernetes服务
  - job_name: 'kubernetes-services'
    kubernetes_sd_configs:
      - role: service
    metrics_path: /probe
    params:
      module: [http_2xx]
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
        action: keep
        regex: true
      - source_labels: [__address__]
        target_label: __param_target
      - target_label: __address__
        replacement: blackbox-exporter:9115
      - source_labels: [__param_target]
        target_label: instance

Storage and Retention

存储与保留策略

yaml

undefined

yaml

undefined

Storage configuration

存储配置

storage: tsdb: path: /prometheus/data retention.time: 15d retention.size: 50GB

Remote write for long-term storage

远程写入用于长期存储

remote_write:

url: "https://prometheus-remote-storage.example.com/api/v1/write" basic_auth: username: prometheus password_file: /etc/prometheus/remote_storage_password queue_config: capacity: 10000 max_shards: 50 max_samples_per_send: 5000 write_relabel_configs:
Drop high-cardinality metrics
- source_labels: [name] regex: 'container_network_.*' action: drop

remote_write:

url: "https://prometheus-remote-storage.example.com/api/v1/write" basic_auth: username: prometheus password_file: /etc/prometheus/remote_storage_password queue_config: capacity: 10000 max_shards: 50 max_samples_per_send: 5000 write_relabel_configs:
丢弃高基数指标
- source_labels: [name] regex: 'container_network_.*' action: drop

Remote read for querying historical data

远程读取用于查询历史数据

remote_read:

url: "https://prometheus-remote-storage.example.com/api/v1/read" basic_auth: username: prometheus password_file: /etc/prometheus/remote_storage_password read_recent: true

undefined

remote_read:

url: "https://prometheus-remote-storage.example.com/api/v1/read" basic_auth: username: prometheus password_file: /etc/prometheus/remote_storage_password read_recent: true

undefined

PromQL: The Prometheus Query Language

PromQL：Prometheus查询语言

Instant Vectors and Selectors

瞬时向量与选择器

promql

undefined

promql

undefined

Basic metric selection

基础指标选择

http_requests_total

Filter by label

按标签过滤

http_requests_total{job="api", status="200"}

Regex matching

正则匹配

http_requests_total{status=~"2..|3.."}

Negative matching

反向匹配

http_requests_total{status!="500"}

Multiple label matchers

多标签匹配

http_requests_total{job="api", method="GET", status=~"2.."}

undefined

http_requests_total{job="api", method="GET", status=~"2.."}

undefined

Range Vectors and Aggregations

范围向量与聚合

promql

undefined

promql

undefined

5-minute range vector

5分钟范围向量

http_requests_total[5m]

Rate of increase per second

每秒增长率

rate(http_requests_total[5m])

Increase over time window

时间窗口内的增量

increase(http_requests_total[1h])

Average over time

时间窗口内的平均值

avg_over_time(cpu_usage[5m])

Max/Min over time

时间窗口内的最大值/最小值

max_over_time(response_time_seconds[10m]) min_over_time(response_time_seconds[10m])

Standard deviation

标准差

stddev_over_time(response_time_seconds[5m])

undefined

stddev_over_time(response_time_seconds[5m])

undefined

Aggregation Operators

聚合运算符

promql

undefined

promql

undefined

Sum across all instances

所有实例的总和

sum(rate(http_requests_total[5m]))

Sum grouped by job

按job分组求和

sum by (job) (rate(http_requests_total[5m]))

Average grouped by multiple labels

按多标签分组求平均值

avg by (job, instance) (cpu_usage)

Count number of series

统计正常运行的实例数

count(up == 1)

Topk and bottomk

前5和后3的指标

topk(5, rate(http_requests_total[5m])) bottomk(3, node_memory_available_bytes)

Quantile across instances

跨实例的分位数

quantile(0.95, http_request_duration_seconds)

undefined

quantile(0.95, http_request_duration_seconds)

undefined

Mathematical Operations

数学运算

promql

undefined

promql

undefined

Arithmetic operations

算术运算

(node_memory_total_bytes - node_memory_available_bytes) / node_memory_total_bytes * 100

Comparison operators

比较运算符

http_request_duration_seconds > 0.5

Logical operators

逻辑运算符

up == 1 and rate(http_requests_total[5m]) > 100

Vector matching

向量匹配

rate(http_requests_total[5m]) / on(instance) group_left rate(http_responses_total[5m])

undefined

rate(http_requests_total[5m]) / on(instance) group_left rate(http_responses_total[5m])

undefined

Advanced PromQL Patterns

高级PromQL模式

promql

undefined

promql

undefined

Request success rate

请求成功率

sum(rate(http_requests_total{status=~"2.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

Error rate percentage

错误率百分比

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

Latency percentiles (histogram)

延迟分位数（直方图）

histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])) )

Predict linear growth

线性增长预测

predict_linear(node_filesystem_free_bytes[1h], 4 * 3600)

Detect anomalies with standard deviation

基于标准差的异常检测

abs(cpu_usage - avg_over_time(cpu_usage[1h]))

3 * stddev_over_time(cpu_usage[1h])

abs(cpu_usage - avg_over_time(cpu_usage[1h]))

3 * stddev_over_time(cpu_usage[1h])

Calculate saturation (RED method)

饱和度计算（RED方法）

sum(rate(cpu_seconds_total{mode!="idle"}[5m])) by (instance) / count(cpu_seconds_total{mode="idle"}) by (instance)

Burn rate for SLO

SLO错误预算消耗速率

( sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h])) )

(14.4 * (1 - 0.999)) # For 99.9% SLO

undefined

( sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h])) )

(14.4 * (1 - 0.999)) # 对应99.9%的SLO

undefined

Alerting with Prometheus and Alertmanager

Prometheus与Alertmanager告警

Alert Rule Definitions

告警规则定义

yaml

undefined

yaml

undefined

alerts/api_alerts.yml

groups:

name: api_alerts interval: 30s rules:

High error rate alert
- alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)
  
  0.05 for: 5m labels: severity: critical team: backend annotations: summary: "High error rate on {{ $labels.service }}" description: "Error rate is {{ $value | humanizePercentage }} on {{ $labels.service }}" runbook_url: "https://runbooks.example.com/HighErrorRate"
High latency alert
- alert: HighLatency expr: | histogram_quantile(0.99, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])) ) > 1 for: 10m labels: severity: warning team: backend annotations: summary: "High latency on {{ $labels.service }}" description: "P99 latency is {{ $value }}s on {{ $labels.service }}"
Service down alert
- alert: ServiceDown expr: up{job="api"} == 0 for: 2m labels: severity: critical team: sre annotations: summary: "Service {{ $labels.instance }} is down" description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 2 minutes"
Disk space alert
- alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10 for: 5m labels: severity: warning team: sre annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Disk space is {{ $value | humanize }}% on {{ $labels.instance }}"
Memory pressure alert
- alert: HighMemoryUsage expr: | (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90 for: 10m labels: severity: warning team: sre annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value | humanize }}% on {{ $labels.instance }}"
CPU saturation alert
- alert: HighCPUUsage expr: | 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 15m labels: severity: warning team: sre annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value | humanize }}% on {{ $labels.instance }}"

undefined

groups:

name: api_alerts interval: 30s rules:

高错误率告警
- alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)
  
  0.05 for: 5m labels: severity: critical team: backend annotations: summary: "{{ $labels.service }}服务错误率过高" description: "{{ $labels.service }}服务错误率为{{ $value | humanizePercentage }}" runbook_url: "https://runbooks.example.com/HighErrorRate"
高延迟告警
- alert: HighLatency expr: | histogram_quantile(0.99, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])) ) > 1 for: 10m labels: severity: warning team: backend annotations: summary: "{{ $labels.service }}服务延迟过高" description: "{{ $labels.service }}服务P99延迟为{{ $value }}秒"
服务下线告警
- alert: ServiceDown expr: up{job="api"} == 0 for: 2m labels: severity: critical team: sre annotations: summary: "{{ $labels.instance }}服务已下线" description: "{{ $labels.job }}服务在{{ $labels.instance }}节点已下线超过2分钟"
磁盘空间不足告警
- alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10 for: 5m labels: severity: warning team: sre annotations: summary: "{{ $labels.instance }}节点磁盘空间不足" description: "{{ $labels.instance }}节点磁盘剩余空间为{{ $value | humanize }}%"
内存压力告警
- alert: HighMemoryUsage expr: | (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90 for: 10m labels: severity: warning team: sre annotations: summary: "{{ $labels.instance }}节点内存使用率过高" description: "{{ $labels.instance }}节点内存使用率为{{ $value | humanize }}%"
CPU饱和告警
- alert: HighCPUUsage expr: | 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 15m labels: severity: warning team: sre annotations: summary: "{{ $labels.instance }}节点CPU使用率过高" description: "{{ $labels.instance }}节点CPU使用率为{{ $value | humanize }}%"

undefined

Alertmanager Configuration

Alertmanager配置

yaml

undefined

yaml

undefined

alertmanager.yml

global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK' pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

Templates for notifications

通知模板

templates:

'/etc/alertmanager/templates/*.tmpl'

templates:

'/etc/alertmanager/templates/*.tmpl'

Route tree for alert distribution

告警路由树

route: group_by: ['alertname', 'cluster', 'service'] group_wait: 10s group_interval: 10s repeat_interval: 12h receiver: 'team-default'

routes: # Critical alerts go to PagerDuty - match: severity: critical receiver: 'pagerduty-critical' continue: true

# Critical alerts also go to Slack
- match:
    severity: critical
  receiver: 'slack-critical'
  group_wait: 0s

# Warning alerts to Slack only
- match:
    severity: warning
  receiver: 'slack-warnings'

# Team-specific routing
- match:
    team: backend
  receiver: 'team-backend'

- match:
    team: frontend
  receiver: 'team-frontend'

# Database alerts to DBA team
- match_re:
    service: 'postgres|mysql|mongodb'
  receiver: 'team-dba'

route: group_by: ['alertname', 'cluster', 'service'] group_wait: 10s group_interval: 10s repeat_interval: 12h receiver: 'team-default'

routes: # 严重告警发送至PagerDuty - match: severity: critical receiver: 'pagerduty-critical' continue: true

# 严重告警同时发送至Slack
- match:
    severity: critical
  receiver: 'slack-critical'
  group_wait: 0s

# 警告告警仅发送至Slack
- match:
    severity: warning
  receiver: 'slack-warnings'

# 按团队路由
- match:
    team: backend
  receiver: 'team-backend'

- match:
    team: frontend
  receiver: 'team-frontend'

# 数据库告警发送至DBA团队
- match_re:
    service: 'postgres|mysql|mongodb'
  receiver: 'team-dba'

Alert receivers/integrations

告警接收方/集成

receivers:

name: 'team-default' slack_configs:
- channel: '#alerts' title: 'Alert: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' send_resolved: true
name: 'pagerduty-critical' pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY' description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.service }}' severity: '{{ .CommonLabels.severity }}'
name: 'slack-critical' slack_configs:
- channel: '#incidents' title: 'CRITICAL: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}{{ end }}' color: 'danger' send_resolved: true
name: 'slack-warnings' slack_configs:
- channel: '#monitoring' title: 'Warning: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' color: 'warning' send_resolved: true
name: 'team-backend' slack_configs:
- channel: '#team-backend' send_resolved: true email_configs:
- to: 'backend-team@example.com' from: 'alertmanager@example.com' smarthost: 'smtp.gmail.com:587' auth_username: 'alertmanager@example.com' auth_password_file: '/etc/alertmanager/email_password'
name: 'team-frontend' slack_configs:
- channel: '#team-frontend' send_resolved: true
name: 'team-dba' slack_configs:
- channel: '#team-dba' send_resolved: true pagerduty_configs:
- service_key: 'DBA_PAGERDUTY_KEY'

receivers:

name: 'team-default' slack_configs:
- channel: '#alerts' title: '告警: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' send_resolved: true
name: 'pagerduty-critical' pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY' description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.service }}' severity: '{{ .CommonLabels.severity }}'
name: 'slack-critical' slack_configs:
- channel: '#incidents' title: '严重告警: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}{{ end }}' color: 'danger' send_resolved: true
name: 'slack-warnings' slack_configs:
- channel: '#monitoring' title: '警告: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' color: 'warning' send_resolved: true
name: 'team-backend' slack_configs:
- channel: '#team-backend' send_resolved: true email_configs:
- to: 'backend-team@example.com' from: 'alertmanager@example.com' smarthost: 'smtp.gmail.com:587' auth_username: 'alertmanager@example.com' auth_password_file: '/etc/alertmanager/email_password'
name: 'team-frontend' slack_configs:
- channel: '#team-frontend' send_resolved: true
name: 'team-dba' slack_configs:
- channel: '#team-dba' send_resolved: true pagerduty_configs:
- service_key: 'DBA_PAGERDUTY_KEY'

Inhibition rules (suppress alerts)

抑制规则（屏蔽冗余告警）

inhibit_rules:

Inhibit warnings if critical alert is firing

source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'instance']

Don't alert on instance down if cluster is down

source_match: alertname: 'ClusterDown' target_match_re: alertname: 'InstanceDown|ServiceDown' equal: ['cluster']

undefined

inhibit_rules:

若存在严重告警，屏蔽对应的警告告警

source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'instance']

若集群下线，不发送实例/服务下线告警

source_match: alertname: 'ClusterDown' target_match_re: alertname: 'InstanceDown|ServiceDown' equal: ['cluster']

undefined

Multi-Window Multi-Burn-Rate Alerts for SLOs

基于SLO的多窗口多速率告警

yaml

undefined

yaml

undefined

SLO-based alerting using burn rate

基于错误预算消耗速率的SLO告警

groups:

name: slo_alerts interval: 30s rules:

Fast burn (1h window, 5m burn)
- alert: ErrorBudgetBurnFast expr: | ( sum(rate(http_requests_total{status=~~"5.."}[1h])) / sum(rate(http_requests_total[1h])) ) > (14.4 * (1 - 0.999)) and ( sum(rate(http_requests_total{status=~~"5.."}[5m])) / sum(rate(http_requests_total[5m])) ) > (14.4 * (1 - 0.999)) for: 2m labels: severity: critical slo: "99.9%" annotations: summary: "Fast error budget burn detected" description: "Error rate is burning through 99.9% SLO budget 14.4x faster than normal"
Slow burn (6h window, 30m burn)
- alert: ErrorBudgetBurnSlow expr: | ( sum(rate(http_requests_total{status=~~"5.."}[6h])) / sum(rate(http_requests_total[6h])) ) > (6 * (1 - 0.999)) and ( sum(rate(http_requests_total{status=~~"5.."}[30m])) / sum(rate(http_requests_total[30m])) ) > (6 * (1 - 0.999)) for: 15m labels: severity: warning slo: "99.9%" annotations: summary: "Slow error budget burn detected" description: "Error rate is burning through 99.9% SLO budget 6x faster than normal"

undefined

groups:

name: slo_alerts interval: 30s rules:

快速消耗（1小时窗口，5分钟速率）
- alert: ErrorBudgetBurnFast expr: | ( sum(rate(http_requests_total{status=~~"5.."}[1h])) / sum(rate(http_requests_total[1h])) ) > (14.4 * (1 - 0.999)) and ( sum(rate(http_requests_total{status=~~"5.."}[5m])) / sum(rate(http_requests_total[5m])) ) > (14.4 * (1 - 0.999)) for: 2m labels: severity: critical slo: "99.9%" annotations: summary: "错误预算快速消耗" description: "错误率消耗99.9% SLO预算的速度是正常的14.4倍"
慢速消耗（6小时窗口，30分钟速率）
- alert: ErrorBudgetBurnSlow expr: | ( sum(rate(http_requests_total{status=~~"5.."}[6h])) / sum(rate(http_requests_total[6h])) ) > (6 * (1 - 0.999)) and ( sum(rate(http_requests_total{status=~~"5.."}[30m])) / sum(rate(http_requests_total[30m])) ) > (6 * (1 - 0.999)) for: 15m labels: severity: warning slo: "99.9%" annotations: summary: "错误预算慢速消耗" description: "错误率消耗99.9% SLO预算的速度是正常的6倍"

undefined

Grafana Dashboards and Visualization

Grafana仪表盘与可视化

Dashboard JSON Structure

仪表盘JSON结构

json

{
  "dashboard": {
    "title": "API Performance Dashboard",
    "tags": ["api", "performance", "production"],
    "timezone": "browser",
    "editable": true,
    "graphTooltip": 1,
    "time": {
      "from": "now-6h",
      "to": "now"
    },
    "timepicker": {
      "refresh_intervals": ["5s", "10s", "30s", "1m", "5m", "15m"],
      "time_options": ["5m", "15m", "1h", "6h", "12h", "24h", "7d"]
    },
    "templating": {
      "list": [
        {
          "name": "cluster",
          "type": "query",
          "datasource": "Prometheus",
          "query": "label_values(up, cluster)",
          "refresh": 1,
          "multi": false,
          "includeAll": false
        },
        {
          "name": "service",
          "type": "query",
          "datasource": "Prometheus",
          "query": "label_values(up{cluster=\"$cluster\"}, service)",
          "refresh": 1,
          "multi": true,
          "includeAll": true
        },
        {
          "name": "interval",
          "type": "interval",
          "query": "1m,5m,10m,30m,1h",
          "auto": true,
          "auto_count": 30,
          "auto_min": "10s"
        }
      ]
    },
    "panels": [
      {
        "id": 1,
        "title": "Request Rate",
        "type": "graph",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=~\"$service\"}[$interval])) by (service)",
            "legendFormat": "{{ service }}",
            "refId": "A"
          }
        ],
        "yaxes": [
          {"format": "reqps", "label": "Requests/sec"},
          {"format": "short"}
        ],
        "legend": {
          "show": true,
          "values": true,
          "current": true,
          "avg": true,
          "max": true
        }
      },
      {
        "id": 2,
        "title": "Error Rate",
        "type": "graph",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=~\"$service\",status=~\"5..\"}[$interval])) by (service) / sum(rate(http_requests_total{service=~\"$service\"}[$interval])) by (service) * 100",
            "legendFormat": "{{ service }} error %",
            "refId": "A"
          }
        ],
        "yaxes": [
          {"format": "percent", "label": "Error Rate"},
          {"format": "short"}
        ],
        "alert": {
          "conditions": [
            {
              "evaluator": {"params": [5], "type": "gt"},
              "operator": {"type": "and"},
              "query": {"params": ["A", "5m", "now"]},
              "reducer": {"params": [], "type": "avg"},
              "type": "query"
            }
          ],
          "executionErrorState": "alerting",
          "frequency": "1m",
          "handler": 1,
          "name": "High Error Rate",
          "noDataState": "no_data",
          "notifications": []
        }
      },
      {
        "id": 3,
        "title": "Latency Percentiles",
        "type": "graph",
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8},
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
            "legendFormat": "{{ service }} p99",
            "refId": "A"
          },
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
            "legendFormat": "{{ service }} p95",
            "refId": "B"
          },
          {
            "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
            "legendFormat": "{{ service }} p50",
            "refId": "C"
          }
        ],
        "yaxes": [
          {"format": "s", "label": "Duration"},
          {"format": "short"}
        ]
      }
    ]
  }
}

json

{
  "dashboard": {
    "title": "API性能仪表盘",
    "tags": ["api", "performance", "production"],
    "timezone": "browser",
    "editable": true,
    "graphTooltip": 1,
    "time": {
      "from": "now-6h",
      "to": "now"
    },
    "timepicker": {
      "refresh_intervals": ["5s", "10s", "30s", "1m", "5m", "15m"],
      "time_options": ["5m", "15m", "1h", "6h", "12h", "24h", "7d"]
    },
    "templating": {
      "list": [
        {
          "name": "cluster",
          "type": "query",
          "datasource": "Prometheus",
          "query": "label_values(up, cluster)",
          "refresh": 1,
          "multi": false,
          "includeAll": false
        },
        {
          "name": "service",
          "type": "query",
          "datasource": "Prometheus",
          "query": "label_values(up{cluster=\"$cluster\"}, service)",
          "refresh": 1,
          "multi": true,
          "includeAll": true
        },
        {
          "name": "interval",
          "type": "interval",
          "query": "1m,5m,10m,30m,1h",
          "auto": true,
          "auto_count": 30,
          "auto_min": "10s"
        }
      ]
    },
    "panels": [
      {
        "id": 1,
        "title": "请求速率",
        "type": "graph",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=~\"$service\"}[$interval])) by (service)",
            "legendFormat": "{{ service }}",
            "refId": "A"
          }
        ],
        "yaxes": [
          {"format": "reqps", "label": "请求数/秒"},
          {"format": "short"}
        ],
        "legend": {
          "show": true,
          "values": true,
          "current": true,
          "avg": true,
          "max": true
        }
      },
      {
        "id": 2,
        "title": "错误率",
        "type": "graph",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=~\"$service\",status=~\"5..\"}[$interval])) by (service) / sum(rate(http_requests_total{service=~\"$service\"}[$interval])) by (service) * 100",
            "legendFormat": "{{ service }} 错误率%",
            "refId": "A"
          }
        ],
        "yaxes": [
          {"format": "percent", "label": "错误率"},
          {"format": "short"}
        ],
        "alert": {
          "conditions": [
            {
              "evaluator": {"params": [5], "type": "gt"},
              "operator": {"type": "and"},
              "query": {"params": ["A", "5m", "now"]},
              "reducer": {"params": [], "type": "avg"},
              "type": "query"
            }
          ],
          "executionErrorState": "alerting",
          "frequency": "1m",
          "handler": 1,
          "name": "高错误率",
          "noDataState": "no_data",
          "notifications": []
        }
      },
      {
        "id": 3,
        "title": "延迟分位数",
        "type": "graph",
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8},
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
            "legendFormat": "{{ service }} p99",
            "refId": "A"
          },
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
            "legendFormat": "{{ service }} p95",
            "refId": "B"
          },
          {
            "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
            "legendFormat": "{{ service }} p50",
            "refId": "C"
          }
        ],
        "yaxes": [
          {"format": "s", "label": "延迟时间"},
          {"format": "short"}
        ]
      }
    ]
  }
}

RED Method Dashboard

RED方法仪表盘

The RED method focuses on Request rate, Error rate, and Duration:

json

{
  "panels": [
    {
      "title": "Request Rate (per service)",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total[$__rate_interval])) by (service)"
        }
      ]
    },
    {
      "title": "Error Rate % (per service)",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{status=~\"5..\"}[$__rate_interval])) by (service) / sum(rate(http_requests_total[$__rate_interval])) by (service) * 100"
        }
      ]
    },
    {
      "title": "Duration p99 (per service)",
      "targets": [
        {
          "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (service, le))"
        }
      ]
    }
  ]
}

RED方法聚焦于请求速率（Request rate）、错误率（Error rate）和延迟（Duration）：

json

{
  "panels": [
    {
      "title": "请求速率（按服务）",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total[$__rate_interval])) by (service)"
        }
      ]
    },
    {
      "title": "错误率百分比（按服务）",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{status=~\"5..\"}[$__rate_interval])) by (service) / sum(rate(http_requests_total[$__rate_interval])) by (service) * 100"
        }
      ]
    },
    {
      "title": "P99延迟（按服务）",
      "targets": [
        {
          "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (service, le))"
        }
      ]
    }
  ]
}

USE Method Dashboard

USE方法仪表盘

The USE method monitors Utilization, Saturation, and Errors:

json

{
  "panels": [
    {
      "title": "CPU Utilization %",
      "targets": [
        {
          "expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[$__rate_interval])) * 100)"
        }
      ]
    },
    {
      "title": "CPU Saturation (Load Average)",
      "targets": [
        {
          "expr": "node_load1"
        }
      ]
    },
    {
      "title": "Memory Utilization %",
      "targets": [
        {
          "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100"
        }
      ]
    },
    {
      "title": "Disk I/O Utilization %",
      "targets": [
        {
          "expr": "rate(node_disk_io_time_seconds_total[$__rate_interval]) * 100"
        }
      ]
    },
    {
      "title": "Network Errors",
      "targets": [
        {
          "expr": "rate(node_network_receive_errs_total[$__rate_interval]) + rate(node_network_transmit_errs_total[$__rate_interval])"
        }
      ]
    }
  ]
}

USE方法监控利用率（Utilization）、饱和度（Saturation）和错误（Errors）：

json

{
  "panels": [
    {
      "title": "CPU利用率%",
      "targets": [
        {
          "expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[$__rate_interval])) * 100)"
        }
      ]
    },
    {
      "title": "CPU饱和度（负载均值）",
      "targets": [
        {
          "expr": "node_load1"
        }
      ]
    },
    {
      "title": "内存利用率%",
      "targets": [
        {
          "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100"
        }
      ]
    },
    {
      "title": "磁盘I/O利用率%",
      "targets": [
        {
          "expr": "rate(node_disk_io_time_seconds_total[$__rate_interval]) * 100"
        }
      ]
    },
    {
      "title": "网络错误",
      "targets": [
        {
          "expr": "rate(node_network_receive_errs_total[$__rate_interval]) + rate(node_network_transmit_errs_total[$__rate_interval])"
        }
      ]
    }
  ]
}

Exporters and Metric Collection

导出器与指标采集

Node Exporter for System Metrics

系统指标的Node Exporter

bash

undefined

bash

undefined

Install node_exporter

安装node_exporter

wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz cd node_exporter-1.6.1.linux-amd64 ./node_exporter --web.listen-address=":9100"
--collector.filesystem.mount-points-exclude="^/(dev|proc|sys|var/lib/docker/.+)($|/)"
--collector.netclass.ignored-devices="^(veth.|br.|docker.*|lo)$"


**Key Metrics from Node Exporter:**
- `node_cpu_seconds_total`: CPU usage by mode
- `node_memory_MemTotal_bytes`, `node_memory_MemAvailable_bytes`: Memory
- `node_disk_io_time_seconds_total`: Disk I/O
- `node_network_receive_bytes_total`, `node_network_transmit_bytes_total`: Network
- `node_filesystem_size_bytes`, `node_filesystem_avail_bytes`: Disk space


**Node Exporter核心指标：**
- `node_cpu_seconds_total`: 按模式统计的CPU使用率
- `node_memory_MemTotal_bytes`, `node_memory_MemAvailable_bytes`: 内存指标
- `node_disk_io_time_seconds_total`: 磁盘I/O指标
- `node_network_receive_bytes_total`, `node_network_transmit_bytes_total`: 网络指标
- `node_filesystem_size_bytes`, `node_filesystem_avail_bytes`: 磁盘空间指标

Custom Application Exporter (Python)

自定义应用导出器（Python）

python

undefined

python

undefined

app_exporter.py

from prometheus_client import start_http_server, Counter, Gauge, Histogram, Summary import time import random

Define metrics

定义指标

REQUEST_COUNT = Counter( 'app_requests_total', 'Total app requests', ['method', 'endpoint', 'status'] )

REQUEST_DURATION = Histogram( 'app_request_duration_seconds', 'Request duration in seconds', ['method', 'endpoint'], buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 2.5, 5.0, 10.0] )

ACTIVE_USERS = Gauge( 'app_active_users', 'Number of active users' )

QUEUE_SIZE = Gauge( 'app_queue_size', 'Current queue size', ['queue_name'] )

DATABASE_CONNECTIONS = Gauge( 'app_database_connections', 'Number of database connections', ['pool', 'state'] )

CACHE_HITS = Counter( 'app_cache_hits_total', 'Total cache hits', ['cache_name'] )

CACHE_MISSES = Counter( 'app_cache_misses_total', 'Total cache misses', ['cache_name'] )

def simulate_metrics(): """Simulate application metrics""" while True: # Simulate requests method = random.choice(['GET', 'POST', 'PUT', 'DELETE']) endpoint = random.choice(['/api/users', '/api/products', '/api/orders']) status = random.choice(['200', '200', '200', '400', '500'])

    REQUEST_COUNT.labels(method=method, endpoint=endpoint, status=status).inc()

    # Simulate request duration
    duration = random.uniform(0.01, 2.0)
    REQUEST_DURATION.labels(method=method, endpoint=endpoint).observe(duration)

    # Update gauges
    ACTIVE_USERS.set(random.randint(100, 1000))
    QUEUE_SIZE.labels(queue_name='jobs').set(random.randint(0, 50))
    QUEUE_SIZE.labels(queue_name='emails').set(random.randint(0, 20))

    # Database connection pool
    DATABASE_CONNECTIONS.labels(pool='main', state='active').set(random.randint(5, 20))
    DATABASE_CONNECTIONS.labels(pool='main', state='idle').set(random.randint(10, 30))

    # Cache metrics
    if random.random() > 0.3:
        CACHE_HITS.labels(cache_name='redis').inc()
    else:
        CACHE_MISSES.labels(cache_name='redis').inc()

    time.sleep(1)

if name == 'main': # Start metrics server on port 8000 start_http_server(8000) print("Metrics server started on port 8000") simulate_metrics()

undefined

REQUEST_COUNT = Counter( 'app_requests_total', '应用总请求数', ['method', 'endpoint', 'status'] )

REQUEST_DURATION = Histogram( 'app_request_duration_seconds', '请求时长（秒）', ['method', 'endpoint'], buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 2.5, 5.0, 10.0] )

ACTIVE_USERS = Gauge( 'app_active_users', '活跃用户数' )

QUEUE_SIZE = Gauge( 'app_queue_size', '当前队列长度', ['queue_name'] )

DATABASE_CONNECTIONS = Gauge( 'app_database_connections', '数据库连接数', ['pool', 'state'] )

CACHE_HITS = Counter( 'app_cache_hits_total', '缓存命中总数', ['cache_name'] )

CACHE_MISSES = Counter( 'app_cache_misses_total', '缓存未命中总数', ['cache_name'] )

def simulate_metrics(): """模拟应用指标""" while True: # 模拟请求 method = random.choice(['GET', 'POST', 'PUT', 'DELETE']) endpoint = random.choice(['/api/users', '/api/products', '/api/orders']) status = random.choice(['200', '200', '200', '400', '500'])

    REQUEST_COUNT.labels(method=method, endpoint=endpoint, status=status).inc()

    # 模拟请求时长
    duration = random.uniform(0.01, 2.0)
    REQUEST_DURATION.labels(method=method, endpoint=endpoint).observe(duration)

    # 更新仪表盘指标
    ACTIVE_USERS.set(random.randint(100, 1000))
    QUEUE_SIZE.labels(queue_name='jobs').set(random.randint(0, 50))
    QUEUE_SIZE.labels(queue_name='emails').set(random.randint(0, 20))

    # 数据库连接池
    DATABASE_CONNECTIONS.labels(pool='main', state='active').set(random.randint(5, 20))
    DATABASE_CONNECTIONS.labels(pool='main', state='idle').set(random.randint(10, 30))

    # 缓存指标
    if random.random() > 0.3:
        CACHE_HITS.labels(cache_name='redis').inc()
    else:
        CACHE_MISSES.labels(cache_name='redis').inc()

    time.sleep(1)

if name == 'main': # 在8000端口启动指标服务 start_http_server(8000) print("Metrics server started on port 8000") simulate_metrics()

undefined

Custom Exporter (Go)

自定义导出器（Go）

package main

import (
    "log"
    "math/rand"
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    requestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "app_requests_total",
            Help: "Total number of requests",
        },
        []string{"method", "endpoint", "status"},
    )

    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "app_request_duration_seconds",
            Help:    "Request duration in seconds",
            Buckets: prometheus.ExponentialBuckets(0.01, 2, 10),
        },
        []string{"method", "endpoint"},
    )

    activeUsers = prometheus.NewGauge(
        prometheus.GaugeOpts{
            Name: "app_active_users",
            Help: "Number of active users",
        },
    )

    databaseConnections = prometheus.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "app_database_connections",
            Help: "Number of database connections",
        },
        []string{"pool", "state"},
    )
)

func init() {
    prometheus.MustRegister(requestsTotal)
    prometheus.MustRegister(requestDuration)
    prometheus.MustRegister(activeUsers)
    prometheus.MustRegister(databaseConnections)
}

func simulateMetrics() {
    ticker := time.NewTicker(1 * time.Second)
    defer ticker.Stop()

    for range ticker.C {
        // Simulate requests
        methods := []string{"GET", "POST", "PUT", "DELETE"}
        endpoints := []string{"/api/users", "/api/products", "/api/orders"}
        statuses := []string{"200", "200", "200", "400", "500"}

        method := methods[rand.Intn(len(methods))]
        endpoint := endpoints[rand.Intn(len(endpoints))]
        status := statuses[rand.Intn(len(statuses))]

        requestsTotal.WithLabelValues(method, endpoint, status).Inc()
        requestDuration.WithLabelValues(method, endpoint).Observe(rand.Float64() * 2)

        // Update gauges
        activeUsers.Set(float64(rand.Intn(900) + 100))
        databaseConnections.WithLabelValues("main", "active").Set(float64(rand.Intn(15) + 5))
        databaseConnections.WithLabelValues("main", "idle").Set(float64(rand.Intn(20) + 10))
    }
}

func main() {
    go simulateMetrics()

    http.Handle("/metrics", promhttp.Handler())
    log.Println("Starting metrics server on :8000")
    log.Fatal(http.ListenAndServe(":8000", nil))
}

package main

import (
    "log"
    "math/rand"
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    requestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "app_requests_total",
            Help: "总请求数",
        },
        []string{"method", "endpoint", "status"},
    )

    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "app_request_duration_seconds",
            Help:    "请求时长（秒）",
            Buckets: prometheus.ExponentialBuckets(0.01, 2, 10),
        },
        []string{"method", "endpoint"},
    )

    activeUsers = prometheus.NewGauge(
        prometheus.GaugeOpts{
            Name: "app_active_users",
            Help: "活跃用户数",
        },
    )

    databaseConnections = prometheus.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "app_database_connections",
            Help: "数据库连接数",
        },
        []string{"pool", "state"},
    )
)

func init() {
    prometheus.MustRegister(requestsTotal)
    prometheus.MustRegister(requestDuration)
    prometheus.MustRegister(activeUsers)
    prometheus.MustRegister(databaseConnections)
}

func simulateMetrics() {
    ticker := time.NewTicker(1 * time.Second)
    defer ticker.Stop()

    for range ticker.C {
        // 模拟请求
        methods := []string{"GET", "POST", "PUT", "DELETE"}
        endpoints := []string{"/api/users", "/api/products", "/api/orders"}
        statuses := []string{"200", "200", "200", "400", "500"}

        method := methods[rand.Intn(len(methods))]
        endpoint := endpoints[rand.Intn(len(endpoints))]
        status := statuses[rand.Intn(len(statuses))]

        requestsTotal.WithLabelValues(method, endpoint, status).Inc()
        requestDuration.WithLabelValues(method, endpoint).Observe(rand.Float64() * 2)

        // 更新仪表盘指标
        activeUsers.Set(float64(rand.Intn(900) + 100))
        databaseConnections.WithLabelValues("main", "active").Set(float64(rand.Intn(15) + 5))
        databaseConnections.WithLabelValues("main", "idle").Set(float64(rand.Intn(20) + 10))
    }
}

func main() {
    go simulateMetrics()

    http.Handle("/metrics", promhttp.Handler())
    log.Println("Starting metrics server on :8000")
    log.Fatal(http.ListenAndServe(":8000", nil))
}

PostgreSQL Exporter

PostgreSQL导出器

yaml

undefined

yaml

undefined

docker-compose.yml for postgres_exporter

postgres_exporter的docker-compose.yml

version: '3.8' services: postgres-exporter: image: prometheuscommunity/postgres-exporter environment: DATA_SOURCE_NAME: "postgresql://user:password@postgres:5432/dbname?sslmode=disable" ports: - "9187:9187" command: - '--collector.stat_statements' - '--collector.stat_database' - '--collector.replication'


**Key PostgreSQL Metrics:**
- `pg_up`: Database reachability
- `pg_stat_database_tup_returned`: Rows read
- `pg_stat_database_tup_inserted`: Rows inserted
- `pg_stat_database_deadlocks`: Deadlock count
- `pg_stat_replication_lag`: Replication lag in seconds
- `pg_locks_count`: Active locks


**PostgreSQL核心指标：**
- `pg_up`: 数据库可达性
- `pg_stat_database_tup_returned`: 读取的行数
- `pg_stat_database_tup_inserted`: 插入的行数
- `pg_stat_database_deadlocks`: 死锁数量
- `pg_stat_replication_lag`: 复制延迟（秒）
- `pg_locks_count`: 活跃锁数量

Blackbox Exporter for Probing

探测用Blackbox Exporter

yaml

undefined

yaml

undefined

blackbox.yml

modules: http_2xx: prober: http timeout: 5s http: valid_http_versions: ["HTTP/1.1", "HTTP/2.0"] valid_status_codes: [200] method: GET preferred_ip_protocol: "ip4"

http_post_json: prober: http http: method: POST headers: Content-Type: application/json body: '{"key":"value"}' valid_status_codes: [200, 201]

tcp_connect: prober: tcp timeout: 5s

icmp: prober: icmp timeout: 5s icmp: preferred_ip_protocol: "ip4"


```yaml

modules: http_2xx: prober: http timeout: 5s http: valid_http_versions: ["HTTP/1.1", "HTTP/2.0"] valid_status_codes: [200] method: GET preferred_ip_protocol: "ip4"

http_post_json: prober: http http: method: POST headers: Content-Type: application/json body: '{"key":"value"}' valid_status_codes: [200, 201]

tcp_connect: prober: tcp timeout: 5s

icmp: prober: icmp timeout: 5s icmp: preferred_ip_protocol: "ip4"


```yaml

Prometheus config for blackbox exporter

用于Blackbox Exporter的Prometheus配置

scrape_configs:

job_name: 'blackbox-http' metrics_path: /probe params: module: [http_2xx] static_configs:
- targets:
  - https://example.com
  - https://api.example.com/health relabel_configs:
- source_labels: [address] target_label: __param_target
- source_labels: [__param_target] target_label: instance
- target_label: address replacement: blackbox-exporter:9115

undefined

scrape_configs:

job_name: 'blackbox-http' metrics_path: /probe params: module: [http_2xx] static_configs:
- targets:
  - https://example.com
  - https://api.example.com/health relabel_configs:
- source_labels: [address] target_label: __param_target
- source_labels: [__param_target] target_label: instance
- target_label: address replacement: blackbox-exporter:9115

undefined

Best Practices

最佳实践

Metric Naming Conventions

指标命名规范

Follow Prometheus naming best practices:

undefined

遵循Prometheus命名最佳实践：

undefined

Format: <namespace><subsystem><metric>_<unit>

格式：<命名空间><子系统><指标>_<单位>

Good examples

优秀示例

http_requests_total # Counter http_request_duration_seconds # Histogram database_connections_active # Gauge cache_hits_total # Counter memory_usage_bytes # Gauge

http_requests_total # 计数器 http_request_duration_seconds # 直方图 database_connections_active # 仪表盘 cache_hits_total # 计数器 memory_usage_bytes # 仪表盘

Include unit suffixes

包含单位后缀

_seconds, _bytes, _total, _ratio, _percentage

Avoid

避免以下写法

RequestCount # Use snake_case http_requests # Missing _total for counter request_time # Missing unit (should be _seconds)

undefined

RequestCount # 使用蛇形命名法 http_requests # 计数器缺少_total后缀 request_time # 缺少单位（应为_seconds）

undefined

Label Guidelines

标签指南

promql

undefined

promql

undefined

Good: Low cardinality labels

推荐：低基数标签

http_requests_total{method="GET", endpoint="/api/users", status="200"}

Bad: High cardinality labels (avoid)

不推荐：高基数标签（避免使用）

http_requests_total{user_id="12345", session_id="abc-def-ghi"}

Good: Use bounded label values

推荐：使用有限的标签值

http_requests_total{status_class="2xx"}

Bad: Unbounded label values

不推荐：无限的标签值

http_requests_total{response_size="1234567"}

undefined

http_requests_total{response_size="1234567"}

undefined

Recording Rule Patterns

记录规则模式

yaml

groups:
  - name: performance_rules
    interval: 30s
    rules:
      # Pre-aggregate expensive queries
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

      # Namespace aggregations
      - record: namespace:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (namespace)

      # SLI calculations
      - record: job:http_requests_success:rate5m
        expr: sum(rate(http_requests_total{status=~"2.."}[5m])) by (job)

      - record: job:http_requests_error_rate:ratio
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)

yaml

groups:
  - name: performance_rules
    interval: 30s
    rules:
      # 预聚合开销大的查询
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

      # 命名空间聚合
      - record: namespace:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (namespace)

      # SLI计算
      - record: job:http_requests_success:rate5m
        expr: sum(rate(http_requests_total{status=~"2.."}[5m])) by (job)

      - record: job:http_requests_error_rate:ratio
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)

Alert Design Principles

告警设计原则

Alert on symptoms, not causes: Alert on user-facing issues
Make alerts actionable: Include runbook links
Use appropriate severity levels: Critical, warning, info
Set proper thresholds: Based on historical data
Include context in annotations: Help on-call engineers
Group related alerts: Reduce alert fatigue
Use inhibition rules: Suppress redundant alerts
Test alert rules: Verify they fire when expected

告警症状而非原因：针对用户可见的问题告警
告警需可执行：包含运行手册链接
使用合适的严重级别：严重、警告、信息
设置合理阈值：基于历史数据
在注解中包含上下文：帮助值班工程师快速定位
关联告警分组：减少告警疲劳
使用抑制规则：屏蔽冗余告警
测试告警规则：验证触发条件

Dashboard Best Practices

仪表盘最佳实践

One dashboard per audience: SRE, developers, business
Use consistent time ranges: Make comparisons easier
Include SLI/SLO metrics: Show business impact
Add annotations for deploys: Correlate changes with metrics
Use template variables: Make dashboards reusable
Show trends and aggregates: Not just raw metrics
Include links to runbooks: Enable quick response
Use appropriate visualizations: Graphs, gauges, tables

按受众设计仪表盘：SRE、开发、业务人员
使用一致的时间范围：便于对比
包含SLI/SLO指标：展示业务影响
添加部署注解：关联变更与指标波动
使用模板变量：提升仪表盘复用性
展示趋势与聚合：而非仅原始指标
包含运行手册链接：快速响应
使用合适的可视化类型：图表、仪表盘、表格

High Availability Setup

高可用部署

yaml

undefined

yaml

undefined

Prometheus HA with Thanos

基于Thanos的Prometheus高可用方案

Deploy multiple Prometheus instances with same config

部署多个配置相同的Prometheus实例

Use Thanos to deduplicate and provide global view

使用Thanos去重并提供全局视图

prometheus-1.yml

global: external_labels: cluster: 'prod' replica: '1'

prometheus-2.yml

global: external_labels: cluster: 'prod' replica: '2'

Thanos sidecar configuration

Thanos sidecar配置

Uploads blocks to object storage

将数据块上传至对象存储

Provides StoreAPI for querying

提供StoreAPI用于查询

undefined

undefined

Capacity Planning Queries

容量规划查询

promql

undefined

promql

undefined

Disk space exhaustion prediction

磁盘空间耗尽预测

predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0

Memory growth trend

内存增长趋势

predict_linear(node_memory_MemAvailable_bytes[1h], 24 * 3600)

Request rate growth

请求速率增长趋势

predict_linear(sum(rate(http_requests_total[1h]))[24h:1h], 7 * 24 * 3600)

Storage capacity planning

存储容量规划

prometheus_tsdb_storage_blocks_bytes / (30 * 24 * 3600)

undefined

prometheus_tsdb_storage_blocks_bytes / (30 * 24 * 3600)

undefined

Advanced Patterns

高级模式

Federation for Multi-Cluster Monitoring

多集群监控联邦

yaml

undefined

yaml

undefined

Global Prometheus federating from cluster Prometheus instances

全局Prometheus从集群Prometheus实例聚合数据

scrape_configs:

job_name: 'federate' scrape_interval: 15s honor_labels: true metrics_path: '/federate' params: 'match[]': - '{job="prometheus"}' - '{name=~"job:.*"}' # All recording rules static_configs:
- targets:
  - 'prometheus-us-west:9090'
  - 'prometheus-us-east:9090'
  - 'prometheus-eu-central:9090'

undefined

scrape_configs:

job_name: 'federate' scrape_interval: 15s honor_labels: true metrics_path: '/federate' params: 'match[]': - '{job="prometheus"}' - '{name=~"job:.*"}' # 所有记录规则 static_configs:
- targets:
  - 'prometheus-us-west:9090'
  - 'prometheus-us-east:9090'
  - 'prometheus-eu-central:9090'

undefined

Cost Monitoring Pattern

成本监控模式

yaml

undefined

yaml

undefined

Track cloud costs with custom metrics

使用自定义指标追踪云成本

groups:

name: cost_tracking rules:
- record: cloud:cost:hourly_rate expr: | ( sum(kube_pod_container_resource_requests{resource="cpu"}) * 0.03 # CPU cost/hour + sum(kube_pod_container_resource_requests{resource="memory"} / 1024 / 1024 / 1024) * 0.005 # Memory cost/hour )
- record: cloud:cost:monthly_estimate expr: cloud:cost:hourly_rate * 730 # Hours in average month

undefined

groups:

name: cost_tracking rules:
- record: cloud:cost:hourly_rate expr: | ( sum(kube_pod_container_resource_requests{resource="cpu"}) * 0.03 # 每小时CPU成本 + sum(kube_pod_container_resource_requests{resource="memory"} / 1024 / 1024 / 1024) * 0.005 # 每小时内存成本 )
- record: cloud:cost:monthly_estimate expr: cloud:cost:hourly_rate * 730 # 平均每月小时数

undefined

Custom SLO Implementation

自定义SLO实现

yaml

undefined

yaml

undefined

SLO: 99.9% availability for API

SLO：API可用性99.9%

groups:

name: api_slo interval: 30s rules:

Success rate SLI
- record: api:sli:success_rate expr: | sum(rate(http_requests_total{job="api",status=~"2.."}[5m])) / sum(rate(http_requests_total{job="api"}[5m]))
Error budget remaining (30 days)
- record: api:error_budget:remaining expr: | 1 - ( (1 - api:sli:success_rate) / (1 - 0.999) )
Latency SLI (p99 < 500ms)
- record: api:sli:latency_success_rate expr: | ( histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le) ) < 0.5 )

undefined

groups:

name: api_slo interval: 30s rules:

成功率SLI
- record: api:sli:success_rate expr: | sum(rate(http_requests_total{job="api",status=~"2.."}[5m])) / sum(rate(http_requests_total{job="api"}[5m]))
剩余错误预算（30天）
- record: api:error_budget:remaining expr: | 1 - ( (1 - api:sli:success_rate) / (1 - 0.999) )
延迟SLI（p99 < 500ms）
- record: api:sli:latency_success_rate expr: | ( histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le) ) < 0.5 )

undefined

Examples Summary

示例汇总

This skill includes 20+ comprehensive examples covering:

Prometheus configuration (basic, Kubernetes SD, storage)
PromQL queries (instant vectors, range vectors, aggregations)
Mathematical operations and advanced patterns
Alert rule definitions (error rate, latency, resource usage)
Alertmanager configuration (routing, receivers, inhibition)
Multi-window multi-burn-rate SLO alerts
Grafana dashboard JSON (full dashboard, RED method, USE method)
Custom exporters (Python, Go)
Third-party exporters (PostgreSQL, Blackbox)
Recording rules for performance
Federation for multi-cluster monitoring
Cost monitoring and SLO implementation
High availability patterns
Capacity planning queries

Skill Version: 1.0.0 Last Updated: October 2025 Skill Category: Observability, Monitoring, SRE, DevOps Compatible With: Prometheus, Grafana, Alertmanager, OpenTelemetry, Kubernetes

本技能包含20+个完整示例，覆盖：

Prometheus配置（基础、Kubernetes服务发现、存储）
PromQL查询（瞬时向量、范围向量、聚合）
数学运算与高级模式
告警规则定义（错误率、延迟、资源使用率）
Alertmanager配置（路由、接收方、抑制）
多窗口多速率SLO告警
Grafana仪表盘JSON（完整仪表盘、RED方法、USE方法）
自定义导出器（Python、Go）
第三方导出器（PostgreSQL、Blackbox）
性能相关记录规则
多集群监控联邦
成本监控与SLO实现
高可用模式
容量规划查询

技能版本: 1.0.0 最后更新: 2025年10月 技能分类: 可观测性、监控、SRE、DevOps 兼容工具: Prometheus、Grafana、Alertmanager、OpenTelemetry、Kubernetes

observability-monitoring

Original

Translation

Observability & Monitoring

可观测性与监控

When to Use This Skill

适用场景

Core Concepts

核心概念

The Four Pillars of Observability

可观测性四大支柱

Prometheus Architecture

Prometheus架构

Metric Labels and Cardinality

指标标签与基数

Recording Rules

记录规则

Service Level Objectives (SLOs)

服务级别目标（SLOs）

Prometheus Setup and Configuration

Prometheus安装与配置

Basic Prometheus Configuration

基础Prometheus配置

prometheus.yml

prometheus.yml

Alertmanager configuration

Alertmanager配置

Load rules

加载规则

Scrape configurations

采集配置

Prometheus self-monitoring

Node exporter for system metrics

Application metrics

Prometheus自监控

系统指标的Node exporter

应用指标

Kubernetes Service Discovery

Kubernetes服务发现

Storage and Retention

存储与保留策略

Storage configuration

存储配置

Remote write for long-term storage

远程写入用于长期存储

Drop high-cardinality metrics

丢弃高基数指标

Remote read for querying historical data

远程读取用于查询历史数据

PromQL: The Prometheus Query Language

PromQL：Prometheus查询语言

Instant Vectors and Selectors

瞬时向量与选择器

Basic metric selection

基础指标选择

Filter by label

按标签过滤

Regex matching

正则匹配

Negative matching

反向匹配

Multiple label matchers

多标签匹配

Range Vectors and Aggregations

范围向量与聚合

5-minute range vector

5分钟范围向量

Rate of increase per second

每秒增长率

Increase over time window

时间窗口内的增量

Average over time

时间窗口内的平均值

Max/Min over time

时间窗口内的最大值/最小值

Standard deviation

标准差

Aggregation Operators

聚合运算符

Sum across all instances