observability-monitoring
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseObservability & Monitoring
可观测性与监控
A comprehensive skill for implementing production-grade observability and monitoring using Prometheus, Grafana, and the wider cloud-native monitoring ecosystem. This skill covers metrics collection, time-series analysis, alerting, visualization, and operational excellence patterns.
本技能全面介绍如何使用Prometheus、Grafana及更广泛的云原生监控生态系统,实现生产级别的可观测性与监控方案,内容涵盖指标采集、时间序列分析、告警、可视化及运维最佳实践。
When to Use This Skill
适用场景
Use this skill when:
- Setting up monitoring for production systems and applications
- Implementing metrics collection and observability for microservices
- Creating dashboards and visualizations for system health monitoring
- Defining alerting rules and incident response automation
- Analyzing system performance and capacity using time-series data
- Implementing SLIs, SLOs, and SLAs for service reliability
- Debugging production issues using metrics and traces
- Building custom exporters for application-specific metrics
- Setting up federation for multi-cluster monitoring
- Migrating from legacy monitoring to cloud-native solutions
- Implementing cost monitoring and optimization tracking
- Creating real-time operational dashboards for DevOps teams
在以下场景中使用本技能:
- 为生产系统和应用搭建监控体系
- 为微服务实现指标采集与可观测性
- 创建系统健康监控的仪表盘与可视化界面
- 定义告警规则与事件响应自动化流程
- 利用时间序列数据分析系统性能与容量
- 为服务可靠性实现SLI、SLO和SLA
- 借助指标与追踪信息排查生产环境问题
- 为特定应用构建自定义指标导出器
- 搭建多集群监控的联邦机制
- 从传统监控方案迁移至云原生监控方案
- 实现成本监控与优化追踪
- 为DevOps团队创建实时运维仪表盘
Core Concepts
核心概念
The Four Pillars of Observability
可观测性四大支柱
Modern observability is built on four fundamental pillars:
-
Metrics: Numerical measurements of system behavior over time
- Counter: Monotonically increasing values (requests served, errors)
- Gauge: Point-in-time values that go up and down (memory usage, temperature)
- Histogram: Distribution of values (request duration buckets)
- Summary: Similar to histogram but calculates quantiles on client-side
-
Logs: Discrete events with contextual information
- Structured logging (JSON, key-value pairs)
- Centralized log aggregation (ELK, Loki)
- Correlation with metrics and traces
-
Traces: Request flow through distributed systems
- Span: Single unit of work with start/end time
- Trace: Collection of spans representing end-to-end request
- OpenTelemetry for distributed tracing
-
Events: Significant occurrences in system lifecycle
- Deployments, configuration changes
- Scaling events, incidents
- Business events and user actions
现代可观测性基于四大核心支柱:
-
指标(Metrics):系统行为的数值化时间序列测量
- 计数器(Counter):单调递增的值(如已处理请求数、错误数)
- 仪表盘(Gauge):可上下波动的瞬时值(如内存使用率、温度)
- 直方图(Histogram):数值分布统计(如请求时长分桶)
- 摘要(Summary):与直方图类似,但在客户端计算分位数
-
日志(Logs):带有上下文信息的离散事件
- 结构化日志(JSON、键值对格式)
- 集中式日志聚合(ELK、Loki)
- 与指标和追踪信息关联
-
追踪(Traces):分布式系统中的请求流转路径
- 跨度(Span):单个工作单元,包含开始/结束时间
- 追踪(Trace):代表端到端请求的一组跨度集合
- 使用OpenTelemetry实现分布式追踪
-
事件(Events):系统生命周期中的重要事件
- 部署、配置变更
- 扩容事件、故障事件
- 业务事件与用户操作
Prometheus Architecture
Prometheus架构
Prometheus is a pull-based monitoring system with key components:
Time-Series Database (TSDB)
- Stores metrics as time-series data
- Efficient compression and retention policies
- Local storage with optional remote storage
Scrape Targets
- Service discovery (Kubernetes, Consul, EC2, etc.)
- Static configuration
- Relabeling for flexible target selection
PromQL Query Engine
- Powerful query language for metrics analysis
- Aggregation, filtering, and mathematical operations
- Range vectors and instant vectors
Alertmanager
- Alert rule evaluation
- Grouping, silencing, and routing
- Integration with PagerDuty, Slack, email, webhooks
Exporters
- Bridge between Prometheus and systems
- Node exporter, cAdvisor, custom exporters
- Third-party exporters for databases, services
Prometheus是基于拉取模式的监控系统,核心组件包括:
时间序列数据库(TSDB)
- 以时间序列格式存储指标
- 高效的压缩与保留策略
- 本地存储支持可选的远程存储扩展
采集目标(Scrape Targets)
- 服务发现(Kubernetes、Consul、EC2等)
- 静态配置
- 重标记(Relabeling)实现灵活的目标选择
PromQL查询引擎
- 强大的指标分析查询语言
- 支持聚合、过滤与数学运算
- 范围向量与瞬时向量
Alertmanager
- 告警规则评估
- 分组、静默与路由
- 与PagerDuty、Slack、邮件、Webhook集成
导出器(Exporters)
- Prometheus与其他系统的桥接组件
- Node exporter、cAdvisor、自定义导出器
- 针对数据库、服务的第三方导出器
Metric Labels and Cardinality
指标标签与基数
Labels are key-value pairs attached to metrics:
prometheus
http_requests_total{method="GET", endpoint="/api/users", status="200"}Label Best Practices:
- Use labels for dimensions you query/aggregate by
- Avoid high-cardinality labels (user IDs, timestamps)
- Keep label names consistent across metrics
- Use relabeling to normalize external labels
Cardinality Considerations:
- Each unique label combination = new time-series
- High cardinality = increased memory and storage
- Monitor cardinality with
prometheus_tsdb_symbol_table_size_bytes - Use recording rules to pre-aggregate high-cardinality metrics
标签是附加在指标上的键值对:
prometheus
http_requests_total{method="GET", endpoint="/api/users", status="200"}标签最佳实践:
- 为需要查询/聚合的维度使用标签
- 避免高基数标签(如用户ID、时间戳)
- 保持指标间的标签命名一致
- 使用重标记标准化外部标签
基数注意事项:
- 每个唯一的标签组合对应一条新的时间序列
- 高基数会增加内存与存储开销
- 使用监控基数
prometheus_tsdb_symbol_table_size_bytes - 使用记录规则预聚合高基数指标
Recording Rules
记录规则
Pre-compute frequently-used or expensive queries:
yaml
groups:
- name: api_performance
interval: 30s
rules:
- record: api:http_request_duration_seconds:p99
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
- record: api:http_requests:rate5m
expr: rate(http_requests_total[5m])Benefits:
- Faster dashboard loading
- Reduced query load on Prometheus
- Consistent metric naming conventions
- Enable complex aggregations
预计算频繁使用或开销较大的查询:
yaml
groups:
- name: api_performance
interval: 30s
rules:
- record: api:http_request_duration_seconds:p99
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
- record: api:http_requests:rate5m
expr: rate(http_requests_total[5m])优势:
- 加快仪表盘加载速度
- 降低Prometheus的查询负载
- 统一指标命名规范
- 支持复杂聚合运算
Service Level Objectives (SLOs)
服务级别目标(SLOs)
Define and track reliability targets:
SLI (Service Level Indicator): Metric measuring service quality
- Availability: % of successful requests
- Latency: % of requests under threshold
- Throughput: Requests per second
SLO (Service Level Objective): Target for SLI
- 99.9% availability (43.8 minutes downtime/month)
- 95% of requests < 200ms
- 1000 RPS sustained
SLA (Service Level Agreement): Contract with consequences
- External commitments to customers
- Financial penalties for SLO violations
Error Budget: Acceptable failure rate
- Error budget = 100% - SLO
- 99.9% SLO = 0.1% error budget
- Use budget for innovation vs. reliability tradeoff
定义并追踪可靠性目标:
SLI(服务级别指标):衡量服务质量的指标
- 可用性:成功请求的百分比
- 延迟:满足阈值的请求百分比
- 吞吐量:每秒请求数
SLO(服务级别目标):SLI的目标值
- 99.9%可用性(每月最多43.8分钟停机时间)
- 95%的请求延迟低于200ms
- 持续支持1000 RPS
SLA(服务级别协议):带有后果的正式合同
- 对客户的外部承诺
- 违反SLO时的财务处罚
错误预算:可接受的故障率
- 错误预算 = 100% - SLO
- 99.9% SLO对应0.1%的错误预算
- 在创新与可靠性之间做权衡时使用错误预算
Prometheus Setup and Configuration
Prometheus安装与配置
Basic Prometheus Configuration
基础Prometheus配置
yaml
undefinedyaml
undefinedprometheus.yml
prometheus.yml
global:
scrape_interval: 15s # Default scrape interval
evaluation_interval: 15s # Alert rule evaluation interval
external_labels:
cluster: 'production'
region: 'us-west-2'
global:
scrape_interval: 15s # 默认采集间隔
evaluation_interval: 15s # 告警规则评估间隔
external_labels:
cluster: 'production'
region: 'us-west-2'
Alertmanager configuration
Alertmanager配置
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
Load rules
加载规则
rule_files:
- 'rules/*.yml'
- 'alerts/*.yml'
rule_files:
- 'rules/*.yml'
- 'alerts/*.yml'
Scrape configurations
采集配置
scrape_configs:
Prometheus self-monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
Node exporter for system metrics
- job_name: 'node'
static_configs:
- targets:
- 'node1:9100'
- 'node2:9100'
- 'node3:9100' relabel_configs:
- source_labels: [address] target_label: instance regex: '([^:]+):.*' replacement: '${1}'
- targets:
Application metrics
- job_name: 'api'
static_configs:
- targets: ['api-1:8080', 'api-2:8080', 'api-3:8080'] labels: env: 'production' tier: 'backend'
undefinedscrape_configs:
Prometheus自监控
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
系统指标的Node exporter
- job_name: 'node'
static_configs:
- targets:
- 'node1:9100'
- 'node2:9100'
- 'node3:9100' relabel_configs:
- source_labels: [address] target_label: instance regex: '([^:]+):.*' replacement: '${1}'
- targets:
应用指标
- job_name: 'api'
static_configs:
- targets: ['api-1:8080', 'api-2:8080', 'api-3:8080'] labels: env: 'production' tier: 'backend'
undefinedKubernetes Service Discovery
Kubernetes服务发现
yaml
scrape_configs:
# Kubernetes API server
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# Kubernetes pods with prometheus.io annotations
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Only scrape pods with prometheus.io/scrape: "true" annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Use the port from prometheus.io/port annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: (\d+)
target_label: __address__
replacement: ${1}:${2}
# Use the path from prometheus.io/path annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
# Add namespace label
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
# Add pod name label
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
# Kubernetes services
- job_name: 'kubernetes-services'
kubernetes_sd_configs:
- role: service
metrics_path: /probe
params:
module: [http_2xx]
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox-exporter:9115
- source_labels: [__param_target]
target_label: instanceyaml
scrape_configs:
# Kubernetes API服务器
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# 带有prometheus.io注解的Kubernetes Pod
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
# 仅采集带有prometheus.io/scrape: "true"注解的Pod
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# 使用prometheus.io/port注解指定的端口
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: (\d+)
target_label: __address__
replacement: ${1}:${2}
# 使用prometheus.io/path注解指定的路径
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
# 添加命名空间标签
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
# 添加Pod名称标签
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
# Kubernetes服务
- job_name: 'kubernetes-services'
kubernetes_sd_configs:
- role: service
metrics_path: /probe
params:
module: [http_2xx]
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox-exporter:9115
- source_labels: [__param_target]
target_label: instanceStorage and Retention
存储与保留策略
yaml
undefinedyaml
undefinedStorage configuration
存储配置
storage:
tsdb:
path: /prometheus/data
retention.time: 15d
retention.size: 50GB
storage:
tsdb:
path: /prometheus/data
retention.time: 15d
retention.size: 50GB
Remote write for long-term storage
远程写入用于长期存储
remote_write:
- url: "https://prometheus-remote-storage.example.com/api/v1/write"
basic_auth:
username: prometheus
password_file: /etc/prometheus/remote_storage_password
queue_config:
capacity: 10000
max_shards: 50
max_samples_per_send: 5000
write_relabel_configs:
Drop high-cardinality metrics
- source_labels: [name] regex: 'container_network_.*' action: drop
remote_write:
- url: "https://prometheus-remote-storage.example.com/api/v1/write"
basic_auth:
username: prometheus
password_file: /etc/prometheus/remote_storage_password
queue_config:
capacity: 10000
max_shards: 50
max_samples_per_send: 5000
write_relabel_configs:
丢弃高基数指标
- source_labels: [name] regex: 'container_network_.*' action: drop
Remote read for querying historical data
远程读取用于查询历史数据
remote_read:
- url: "https://prometheus-remote-storage.example.com/api/v1/read" basic_auth: username: prometheus password_file: /etc/prometheus/remote_storage_password read_recent: true
undefinedremote_read:
- url: "https://prometheus-remote-storage.example.com/api/v1/read" basic_auth: username: prometheus password_file: /etc/prometheus/remote_storage_password read_recent: true
undefinedPromQL: The Prometheus Query Language
PromQL:Prometheus查询语言
Instant Vectors and Selectors
瞬时向量与选择器
promql
undefinedpromql
undefinedBasic metric selection
基础指标选择
http_requests_total
http_requests_total
Filter by label
按标签过滤
http_requests_total{job="api", status="200"}
http_requests_total{job="api", status="200"}
Regex matching
正则匹配
http_requests_total{status=~"2..|3.."}
http_requests_total{status=~"2..|3.."}
Negative matching
反向匹配
http_requests_total{status!="500"}
http_requests_total{status!="500"}
Multiple label matchers
多标签匹配
http_requests_total{job="api", method="GET", status=~"2.."}
undefinedhttp_requests_total{job="api", method="GET", status=~"2.."}
undefinedRange Vectors and Aggregations
范围向量与聚合
promql
undefinedpromql
undefined5-minute range vector
5分钟范围向量
http_requests_total[5m]
http_requests_total[5m]
Rate of increase per second
每秒增长率
rate(http_requests_total[5m])
rate(http_requests_total[5m])
Increase over time window
时间窗口内的增量
increase(http_requests_total[1h])
increase(http_requests_total[1h])
Average over time
时间窗口内的平均值
avg_over_time(cpu_usage[5m])
avg_over_time(cpu_usage[5m])
Max/Min over time
时间窗口内的最大值/最小值
max_over_time(response_time_seconds[10m])
min_over_time(response_time_seconds[10m])
max_over_time(response_time_seconds[10m])
min_over_time(response_time_seconds[10m])
Standard deviation
标准差
stddev_over_time(response_time_seconds[5m])
undefinedstddev_over_time(response_time_seconds[5m])
undefinedAggregation Operators
聚合运算符
promql
undefinedpromql
undefinedSum across all instances
所有实例的总和
sum(rate(http_requests_total[5m]))
sum(rate(http_requests_total[5m]))
Sum grouped by job
按job分组求和
sum by (job) (rate(http_requests_total[5m]))
sum by (job) (rate(http_requests_total[5m]))
Average grouped by multiple labels
按多标签分组求平均值
avg by (job, instance) (cpu_usage)
avg by (job, instance) (cpu_usage)
Count number of series
统计正常运行的实例数
count(up == 1)
count(up == 1)
Topk and bottomk
前5和后3的指标
topk(5, rate(http_requests_total[5m]))
bottomk(3, node_memory_available_bytes)
topk(5, rate(http_requests_total[5m]))
bottomk(3, node_memory_available_bytes)
Quantile across instances
跨实例的分位数
quantile(0.95, http_request_duration_seconds)
undefinedquantile(0.95, http_request_duration_seconds)
undefinedMathematical Operations
数学运算
promql
undefinedpromql
undefinedArithmetic operations
算术运算
(node_memory_total_bytes - node_memory_available_bytes) / node_memory_total_bytes * 100
(node_memory_total_bytes - node_memory_available_bytes) / node_memory_total_bytes * 100
Comparison operators
比较运算符
http_request_duration_seconds > 0.5
http_request_duration_seconds > 0.5
Logical operators
逻辑运算符
up == 1 and rate(http_requests_total[5m]) > 100
up == 1 and rate(http_requests_total[5m]) > 100
Vector matching
向量匹配
rate(http_requests_total[5m]) / on(instance) group_left rate(http_responses_total[5m])
undefinedrate(http_requests_total[5m]) / on(instance) group_left rate(http_responses_total[5m])
undefinedAdvanced PromQL Patterns
高级PromQL模式
promql
undefinedpromql
undefinedRequest success rate
请求成功率
sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
Error rate percentage
错误率百分比
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
Latency percentiles (histogram)
延迟分位数(直方图)
histogram_quantile(0.99,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
histogram_quantile(0.99,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
Predict linear growth
线性增长预测
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600)
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600)
Detect anomalies with standard deviation
基于标准差的异常检测
abs(cpu_usage - avg_over_time(cpu_usage[1h]))
3 * stddev_over_time(cpu_usage[1h])
abs(cpu_usage - avg_over_time(cpu_usage[1h]))
3 * stddev_over_time(cpu_usage[1h])
Calculate saturation (RED method)
饱和度计算(RED方法)
sum(rate(cpu_seconds_total{mode!="idle"}[5m])) by (instance)
/
count(cpu_seconds_total{mode="idle"}) by (instance)
sum(rate(cpu_seconds_total{mode!="idle"}[5m])) by (instance)
/
count(cpu_seconds_total{mode="idle"}) by (instance)
Burn rate for SLO
SLO错误预算消耗速率
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
(14.4 * (1 - 0.999)) # For 99.9% SLO
undefined(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
(14.4 * (1 - 0.999)) # 对应99.9%的SLO
undefinedAlerting with Prometheus and Alertmanager
Prometheus与Alertmanager告警
Alert Rule Definitions
告警规则定义
yaml
undefinedyaml
undefinedalerts/api_alerts.yml
alerts/api_alerts.yml
groups:
-
name: api_alerts interval: 30s rules:
High error rate alert
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
0.05 for: 5m labels: severity: critical team: backend annotations: summary: "High error rate on {{ $labels.service }}" description: "Error rate is {{ $value | humanizePercentage }} on {{ $labels.service }}" runbook_url: "https://runbooks.example.com/HighErrorRate"
High latency alert
- alert: HighLatency expr: | histogram_quantile(0.99, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])) ) > 1 for: 10m labels: severity: warning team: backend annotations: summary: "High latency on {{ $labels.service }}" description: "P99 latency is {{ $value }}s on {{ $labels.service }}"
Service down alert
- alert: ServiceDown expr: up{job="api"} == 0 for: 2m labels: severity: critical team: sre annotations: summary: "Service {{ $labels.instance }} is down" description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 2 minutes"
Disk space alert
- alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10 for: 5m labels: severity: warning team: sre annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Disk space is {{ $value | humanize }}% on {{ $labels.instance }}"
Memory pressure alert
- alert: HighMemoryUsage expr: | (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90 for: 10m labels: severity: warning team: sre annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value | humanize }}% on {{ $labels.instance }}"
CPU saturation alert
- alert: HighCPUUsage expr: | 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 15m labels: severity: warning team: sre annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value | humanize }}% on {{ $labels.instance }}"
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
undefinedgroups:
-
name: api_alerts interval: 30s rules:
高错误率告警
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
0.05 for: 5m labels: severity: critical team: backend annotations: summary: "{{ $labels.service }}服务错误率过高" description: "{{ $labels.service }}服务错误率为{{ $value | humanizePercentage }}" runbook_url: "https://runbooks.example.com/HighErrorRate"
高延迟告警
- alert: HighLatency expr: | histogram_quantile(0.99, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])) ) > 1 for: 10m labels: severity: warning team: backend annotations: summary: "{{ $labels.service }}服务延迟过高" description: "{{ $labels.service }}服务P99延迟为{{ $value }}秒"
服务下线告警
- alert: ServiceDown expr: up{job="api"} == 0 for: 2m labels: severity: critical team: sre annotations: summary: "{{ $labels.instance }}服务已下线" description: "{{ $labels.job }}服务在{{ $labels.instance }}节点已下线超过2分钟"
磁盘空间不足告警
- alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10 for: 5m labels: severity: warning team: sre annotations: summary: "{{ $labels.instance }}节点磁盘空间不足" description: "{{ $labels.instance }}节点磁盘剩余空间为{{ $value | humanize }}%"
内存压力告警
- alert: HighMemoryUsage expr: | (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90 for: 10m labels: severity: warning team: sre annotations: summary: "{{ $labels.instance }}节点内存使用率过高" description: "{{ $labels.instance }}节点内存使用率为{{ $value | humanize }}%"
CPU饱和告警
- alert: HighCPUUsage expr: | 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 15m labels: severity: warning team: sre annotations: summary: "{{ $labels.instance }}节点CPU使用率过高" description: "{{ $labels.instance }}节点CPU使用率为{{ $value | humanize }}%"
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
undefinedAlertmanager Configuration
Alertmanager配置
yaml
undefinedyaml
undefinedalertmanager.yml
alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
Templates for notifications
通知模板
templates:
- '/etc/alertmanager/templates/*.tmpl'
templates:
- '/etc/alertmanager/templates/*.tmpl'
Route tree for alert distribution
告警路由树
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'team-default'
routes:
# Critical alerts go to PagerDuty
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true
# Critical alerts also go to Slack
- match:
severity: critical
receiver: 'slack-critical'
group_wait: 0s
# Warning alerts to Slack only
- match:
severity: warning
receiver: 'slack-warnings'
# Team-specific routing
- match:
team: backend
receiver: 'team-backend'
- match:
team: frontend
receiver: 'team-frontend'
# Database alerts to DBA team
- match_re:
service: 'postgres|mysql|mongodb'
receiver: 'team-dba'route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'team-default'
routes:
# 严重告警发送至PagerDuty
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true
# 严重告警同时发送至Slack
- match:
severity: critical
receiver: 'slack-critical'
group_wait: 0s
# 警告告警仅发送至Slack
- match:
severity: warning
receiver: 'slack-warnings'
# 按团队路由
- match:
team: backend
receiver: 'team-backend'
- match:
team: frontend
receiver: 'team-frontend'
# 数据库告警发送至DBA团队
- match_re:
service: 'postgres|mysql|mongodb'
receiver: 'team-dba'Alert receivers/integrations
告警接收方/集成
receivers:
-
name: 'team-default' slack_configs:
- channel: '#alerts' title: 'Alert: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' send_resolved: true
-
name: 'pagerduty-critical' pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY' description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.service }}' severity: '{{ .CommonLabels.severity }}'
-
name: 'slack-critical' slack_configs:
- channel: '#incidents' title: 'CRITICAL: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}{{ end }}' color: 'danger' send_resolved: true
-
name: 'slack-warnings' slack_configs:
- channel: '#monitoring' title: 'Warning: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' color: 'warning' send_resolved: true
-
name: 'team-backend' slack_configs:
- channel: '#team-backend' send_resolved: true email_configs:
- to: 'backend-team@example.com' from: 'alertmanager@example.com' smarthost: 'smtp.gmail.com:587' auth_username: 'alertmanager@example.com' auth_password_file: '/etc/alertmanager/email_password'
-
name: 'team-frontend' slack_configs:
- channel: '#team-frontend' send_resolved: true
-
name: 'team-dba' slack_configs:
- channel: '#team-dba' send_resolved: true pagerduty_configs:
- service_key: 'DBA_PAGERDUTY_KEY'
receivers:
-
name: 'team-default' slack_configs:
- channel: '#alerts' title: '告警: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' send_resolved: true
-
name: 'pagerduty-critical' pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY' description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.service }}' severity: '{{ .CommonLabels.severity }}'
-
name: 'slack-critical' slack_configs:
- channel: '#incidents' title: '严重告警: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}{{ end }}' color: 'danger' send_resolved: true
-
name: 'slack-warnings' slack_configs:
- channel: '#monitoring' title: '警告: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' color: 'warning' send_resolved: true
-
name: 'team-backend' slack_configs:
- channel: '#team-backend' send_resolved: true email_configs:
- to: 'backend-team@example.com' from: 'alertmanager@example.com' smarthost: 'smtp.gmail.com:587' auth_username: 'alertmanager@example.com' auth_password_file: '/etc/alertmanager/email_password'
-
name: 'team-frontend' slack_configs:
- channel: '#team-frontend' send_resolved: true
-
name: 'team-dba' slack_configs:
- channel: '#team-dba' send_resolved: true pagerduty_configs:
- service_key: 'DBA_PAGERDUTY_KEY'
Inhibition rules (suppress alerts)
抑制规则(屏蔽冗余告警)
inhibit_rules:
Inhibit warnings if critical alert is firing
- source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'instance']
Don't alert on instance down if cluster is down
- source_match: alertname: 'ClusterDown' target_match_re: alertname: 'InstanceDown|ServiceDown' equal: ['cluster']
undefinedinhibit_rules:
若存在严重告警,屏蔽对应的警告告警
- source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'instance']
若集群下线,不发送实例/服务下线告警
- source_match: alertname: 'ClusterDown' target_match_re: alertname: 'InstanceDown|ServiceDown' equal: ['cluster']
undefinedMulti-Window Multi-Burn-Rate Alerts for SLOs
基于SLO的多窗口多速率告警
yaml
undefinedyaml
undefinedSLO-based alerting using burn rate
基于错误预算消耗速率的SLO告警
groups:
-
name: slo_alerts interval: 30s rules:
Fast burn (1h window, 5m burn)
- alert: ErrorBudgetBurnFast
expr: |
(
sum(rate(http_requests_total{status=
"5.."}[1h])) / sum(rate(http_requests_total[1h])) ) > (14.4 * (1 - 0.999)) and ( sum(rate(http_requests_total{status="5.."}[5m])) / sum(rate(http_requests_total[5m])) ) > (14.4 * (1 - 0.999)) for: 2m labels: severity: critical slo: "99.9%" annotations: summary: "Fast error budget burn detected" description: "Error rate is burning through 99.9% SLO budget 14.4x faster than normal"
Slow burn (6h window, 30m burn)
- alert: ErrorBudgetBurnSlow
expr: |
(
sum(rate(http_requests_total{status=
"5.."}[6h])) / sum(rate(http_requests_total[6h])) ) > (6 * (1 - 0.999)) and ( sum(rate(http_requests_total{status="5.."}[30m])) / sum(rate(http_requests_total[30m])) ) > (6 * (1 - 0.999)) for: 15m labels: severity: warning slo: "99.9%" annotations: summary: "Slow error budget burn detected" description: "Error rate is burning through 99.9% SLO budget 6x faster than normal"
- alert: ErrorBudgetBurnFast
expr: |
(
sum(rate(http_requests_total{status=
undefinedgroups:
-
name: slo_alerts interval: 30s rules:
快速消耗(1小时窗口,5分钟速率)
- alert: ErrorBudgetBurnFast
expr: |
(
sum(rate(http_requests_total{status=
"5.."}[1h])) / sum(rate(http_requests_total[1h])) ) > (14.4 * (1 - 0.999)) and ( sum(rate(http_requests_total{status="5.."}[5m])) / sum(rate(http_requests_total[5m])) ) > (14.4 * (1 - 0.999)) for: 2m labels: severity: critical slo: "99.9%" annotations: summary: "错误预算快速消耗" description: "错误率消耗99.9% SLO预算的速度是正常的14.4倍"
慢速消耗(6小时窗口,30分钟速率)
- alert: ErrorBudgetBurnSlow
expr: |
(
sum(rate(http_requests_total{status=
"5.."}[6h])) / sum(rate(http_requests_total[6h])) ) > (6 * (1 - 0.999)) and ( sum(rate(http_requests_total{status="5.."}[30m])) / sum(rate(http_requests_total[30m])) ) > (6 * (1 - 0.999)) for: 15m labels: severity: warning slo: "99.9%" annotations: summary: "错误预算慢速消耗" description: "错误率消耗99.9% SLO预算的速度是正常的6倍"
- alert: ErrorBudgetBurnFast
expr: |
(
sum(rate(http_requests_total{status=
undefinedGrafana Dashboards and Visualization
Grafana仪表盘与可视化
Dashboard JSON Structure
仪表盘JSON结构
json
{
"dashboard": {
"title": "API Performance Dashboard",
"tags": ["api", "performance", "production"],
"timezone": "browser",
"editable": true,
"graphTooltip": 1,
"time": {
"from": "now-6h",
"to": "now"
},
"timepicker": {
"refresh_intervals": ["5s", "10s", "30s", "1m", "5m", "15m"],
"time_options": ["5m", "15m", "1h", "6h", "12h", "24h", "7d"]
},
"templating": {
"list": [
{
"name": "cluster",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(up, cluster)",
"refresh": 1,
"multi": false,
"includeAll": false
},
{
"name": "service",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(up{cluster=\"$cluster\"}, service)",
"refresh": 1,
"multi": true,
"includeAll": true
},
{
"name": "interval",
"type": "interval",
"query": "1m,5m,10m,30m,1h",
"auto": true,
"auto_count": 30,
"auto_min": "10s"
}
]
},
"panels": [
{
"id": 1,
"title": "Request Rate",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"targets": [
{
"expr": "sum(rate(http_requests_total{service=~\"$service\"}[$interval])) by (service)",
"legendFormat": "{{ service }}",
"refId": "A"
}
],
"yaxes": [
{"format": "reqps", "label": "Requests/sec"},
{"format": "short"}
],
"legend": {
"show": true,
"values": true,
"current": true,
"avg": true,
"max": true
}
},
{
"id": 2,
"title": "Error Rate",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"targets": [
{
"expr": "sum(rate(http_requests_total{service=~\"$service\",status=~\"5..\"}[$interval])) by (service) / sum(rate(http_requests_total{service=~\"$service\"}[$interval])) by (service) * 100",
"legendFormat": "{{ service }} error %",
"refId": "A"
}
],
"yaxes": [
{"format": "percent", "label": "Error Rate"},
{"format": "short"}
],
"alert": {
"conditions": [
{
"evaluator": {"params": [5], "type": "gt"},
"operator": {"type": "and"},
"query": {"params": ["A", "5m", "now"]},
"reducer": {"params": [], "type": "avg"},
"type": "query"
}
],
"executionErrorState": "alerting",
"frequency": "1m",
"handler": 1,
"name": "High Error Rate",
"noDataState": "no_data",
"notifications": []
}
},
{
"id": 3,
"title": "Latency Percentiles",
"type": "graph",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 8},
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
"legendFormat": "{{ service }} p99",
"refId": "A"
},
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
"legendFormat": "{{ service }} p95",
"refId": "B"
},
{
"expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
"legendFormat": "{{ service }} p50",
"refId": "C"
}
],
"yaxes": [
{"format": "s", "label": "Duration"},
{"format": "short"}
]
}
]
}
}json
{
"dashboard": {
"title": "API性能仪表盘",
"tags": ["api", "performance", "production"],
"timezone": "browser",
"editable": true,
"graphTooltip": 1,
"time": {
"from": "now-6h",
"to": "now"
},
"timepicker": {
"refresh_intervals": ["5s", "10s", "30s", "1m", "5m", "15m"],
"time_options": ["5m", "15m", "1h", "6h", "12h", "24h", "7d"]
},
"templating": {
"list": [
{
"name": "cluster",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(up, cluster)",
"refresh": 1,
"multi": false,
"includeAll": false
},
{
"name": "service",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(up{cluster=\"$cluster\"}, service)",
"refresh": 1,
"multi": true,
"includeAll": true
},
{
"name": "interval",
"type": "interval",
"query": "1m,5m,10m,30m,1h",
"auto": true,
"auto_count": 30,
"auto_min": "10s"
}
]
},
"panels": [
{
"id": 1,
"title": "请求速率",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"targets": [
{
"expr": "sum(rate(http_requests_total{service=~\"$service\"}[$interval])) by (service)",
"legendFormat": "{{ service }}",
"refId": "A"
}
],
"yaxes": [
{"format": "reqps", "label": "请求数/秒"},
{"format": "short"}
],
"legend": {
"show": true,
"values": true,
"current": true,
"avg": true,
"max": true
}
},
{
"id": 2,
"title": "错误率",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"targets": [
{
"expr": "sum(rate(http_requests_total{service=~\"$service\",status=~\"5..\"}[$interval])) by (service) / sum(rate(http_requests_total{service=~\"$service\"}[$interval])) by (service) * 100",
"legendFormat": "{{ service }} 错误率%",
"refId": "A"
}
],
"yaxes": [
{"format": "percent", "label": "错误率"},
{"format": "short"}
],
"alert": {
"conditions": [
{
"evaluator": {"params": [5], "type": "gt"},
"operator": {"type": "and"},
"query": {"params": ["A", "5m", "now"]},
"reducer": {"params": [], "type": "avg"},
"type": "query"
}
],
"executionErrorState": "alerting",
"frequency": "1m",
"handler": 1,
"name": "高错误率",
"noDataState": "no_data",
"notifications": []
}
},
{
"id": 3,
"title": "延迟分位数",
"type": "graph",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 8},
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
"legendFormat": "{{ service }} p99",
"refId": "A"
},
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
"legendFormat": "{{ service }} p95",
"refId": "B"
},
{
"expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
"legendFormat": "{{ service }} p50",
"refId": "C"
}
],
"yaxes": [
{"format": "s", "label": "延迟时间"},
{"format": "short"}
]
}
]
}
}RED Method Dashboard
RED方法仪表盘
The RED method focuses on Request rate, Error rate, and Duration:
json
{
"panels": [
{
"title": "Request Rate (per service)",
"targets": [
{
"expr": "sum(rate(http_requests_total[$__rate_interval])) by (service)"
}
]
},
{
"title": "Error Rate % (per service)",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[$__rate_interval])) by (service) / sum(rate(http_requests_total[$__rate_interval])) by (service) * 100"
}
]
},
{
"title": "Duration p99 (per service)",
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (service, le))"
}
]
}
]
}RED方法聚焦于请求速率(Request rate)、错误率(Error rate)和延迟(Duration):
json
{
"panels": [
{
"title": "请求速率(按服务)",
"targets": [
{
"expr": "sum(rate(http_requests_total[$__rate_interval])) by (service)"
}
]
},
{
"title": "错误率百分比(按服务)",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[$__rate_interval])) by (service) / sum(rate(http_requests_total[$__rate_interval])) by (service) * 100"
}
]
},
{
"title": "P99延迟(按服务)",
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (service, le))"
}
]
}
]
}USE Method Dashboard
USE方法仪表盘
The USE method monitors Utilization, Saturation, and Errors:
json
{
"panels": [
{
"title": "CPU Utilization %",
"targets": [
{
"expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[$__rate_interval])) * 100)"
}
]
},
{
"title": "CPU Saturation (Load Average)",
"targets": [
{
"expr": "node_load1"
}
]
},
{
"title": "Memory Utilization %",
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100"
}
]
},
{
"title": "Disk I/O Utilization %",
"targets": [
{
"expr": "rate(node_disk_io_time_seconds_total[$__rate_interval]) * 100"
}
]
},
{
"title": "Network Errors",
"targets": [
{
"expr": "rate(node_network_receive_errs_total[$__rate_interval]) + rate(node_network_transmit_errs_total[$__rate_interval])"
}
]
}
]
}USE方法监控利用率(Utilization)、饱和度(Saturation)和错误(Errors):
json
{
"panels": [
{
"title": "CPU利用率%",
"targets": [
{
"expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[$__rate_interval])) * 100)"
}
]
},
{
"title": "CPU饱和度(负载均值)",
"targets": [
{
"expr": "node_load1"
}
]
},
{
"title": "内存利用率%",
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100"
}
]
},
{
"title": "磁盘I/O利用率%",
"targets": [
{
"expr": "rate(node_disk_io_time_seconds_total[$__rate_interval]) * 100"
}
]
},
{
"title": "网络错误",
"targets": [
{
"expr": "rate(node_network_receive_errs_total[$__rate_interval]) + rate(node_network_transmit_errs_total[$__rate_interval])"
}
]
}
]
}Exporters and Metric Collection
导出器与指标采集
Node Exporter for System Metrics
系统指标的Node Exporter
bash
undefinedbash
undefinedInstall node_exporter
安装node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
cd node_exporter-1.6.1.linux-amd64
./node_exporter --web.listen-address=":9100"
--collector.filesystem.mount-points-exclude="^/(dev|proc|sys|var/lib/docker/.+)($|/)"
--collector.netclass.ignored-devices="^(veth.|br.|docker.*|lo)$"
--collector.filesystem.mount-points-exclude="^/(dev|proc|sys|var/lib/docker/.+)($|/)"
--collector.netclass.ignored-devices="^(veth.|br.|docker.*|lo)$"
**Key Metrics from Node Exporter:**
- `node_cpu_seconds_total`: CPU usage by mode
- `node_memory_MemTotal_bytes`, `node_memory_MemAvailable_bytes`: Memory
- `node_disk_io_time_seconds_total`: Disk I/O
- `node_network_receive_bytes_total`, `node_network_transmit_bytes_total`: Network
- `node_filesystem_size_bytes`, `node_filesystem_avail_bytes`: Disk spacewget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
cd node_exporter-1.6.1.linux-amd64
./node_exporter --web.listen-address=":9100"
--collector.filesystem.mount-points-exclude="^/(dev|proc|sys|var/lib/docker/.+)($|/)"
--collector.netclass.ignored-devices="^(veth.|br.|docker.*|lo)$"
--collector.filesystem.mount-points-exclude="^/(dev|proc|sys|var/lib/docker/.+)($|/)"
--collector.netclass.ignored-devices="^(veth.|br.|docker.*|lo)$"
**Node Exporter核心指标:**
- `node_cpu_seconds_total`: 按模式统计的CPU使用率
- `node_memory_MemTotal_bytes`, `node_memory_MemAvailable_bytes`: 内存指标
- `node_disk_io_time_seconds_total`: 磁盘I/O指标
- `node_network_receive_bytes_total`, `node_network_transmit_bytes_total`: 网络指标
- `node_filesystem_size_bytes`, `node_filesystem_avail_bytes`: 磁盘空间指标Custom Application Exporter (Python)
自定义应用导出器(Python)
python
undefinedpython
undefinedapp_exporter.py
app_exporter.py
from prometheus_client import start_http_server, Counter, Gauge, Histogram, Summary
import time
import random
from prometheus_client import start_http_server, Counter, Gauge, Histogram, Summary
import time
import random
Define metrics
定义指标
REQUEST_COUNT = Counter(
'app_requests_total',
'Total app requests',
['method', 'endpoint', 'status']
)
REQUEST_DURATION = Histogram(
'app_request_duration_seconds',
'Request duration in seconds',
['method', 'endpoint'],
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 2.5, 5.0, 10.0]
)
ACTIVE_USERS = Gauge(
'app_active_users',
'Number of active users'
)
QUEUE_SIZE = Gauge(
'app_queue_size',
'Current queue size',
['queue_name']
)
DATABASE_CONNECTIONS = Gauge(
'app_database_connections',
'Number of database connections',
['pool', 'state']
)
CACHE_HITS = Counter(
'app_cache_hits_total',
'Total cache hits',
['cache_name']
)
CACHE_MISSES = Counter(
'app_cache_misses_total',
'Total cache misses',
['cache_name']
)
def simulate_metrics():
"""Simulate application metrics"""
while True:
# Simulate requests
method = random.choice(['GET', 'POST', 'PUT', 'DELETE'])
endpoint = random.choice(['/api/users', '/api/products', '/api/orders'])
status = random.choice(['200', '200', '200', '400', '500'])
REQUEST_COUNT.labels(method=method, endpoint=endpoint, status=status).inc()
# Simulate request duration
duration = random.uniform(0.01, 2.0)
REQUEST_DURATION.labels(method=method, endpoint=endpoint).observe(duration)
# Update gauges
ACTIVE_USERS.set(random.randint(100, 1000))
QUEUE_SIZE.labels(queue_name='jobs').set(random.randint(0, 50))
QUEUE_SIZE.labels(queue_name='emails').set(random.randint(0, 20))
# Database connection pool
DATABASE_CONNECTIONS.labels(pool='main', state='active').set(random.randint(5, 20))
DATABASE_CONNECTIONS.labels(pool='main', state='idle').set(random.randint(10, 30))
# Cache metrics
if random.random() > 0.3:
CACHE_HITS.labels(cache_name='redis').inc()
else:
CACHE_MISSES.labels(cache_name='redis').inc()
time.sleep(1)if name == 'main':
# Start metrics server on port 8000
start_http_server(8000)
print("Metrics server started on port 8000")
simulate_metrics()
undefinedREQUEST_COUNT = Counter(
'app_requests_total',
'应用总请求数',
['method', 'endpoint', 'status']
)
REQUEST_DURATION = Histogram(
'app_request_duration_seconds',
'请求时长(秒)',
['method', 'endpoint'],
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 2.5, 5.0, 10.0]
)
ACTIVE_USERS = Gauge(
'app_active_users',
'活跃用户数'
)
QUEUE_SIZE = Gauge(
'app_queue_size',
'当前队列长度',
['queue_name']
)
DATABASE_CONNECTIONS = Gauge(
'app_database_connections',
'数据库连接数',
['pool', 'state']
)
CACHE_HITS = Counter(
'app_cache_hits_total',
'缓存命中总数',
['cache_name']
)
CACHE_MISSES = Counter(
'app_cache_misses_total',
'缓存未命中总数',
['cache_name']
)
def simulate_metrics():
"""模拟应用指标"""
while True:
# 模拟请求
method = random.choice(['GET', 'POST', 'PUT', 'DELETE'])
endpoint = random.choice(['/api/users', '/api/products', '/api/orders'])
status = random.choice(['200', '200', '200', '400', '500'])
REQUEST_COUNT.labels(method=method, endpoint=endpoint, status=status).inc()
# 模拟请求时长
duration = random.uniform(0.01, 2.0)
REQUEST_DURATION.labels(method=method, endpoint=endpoint).observe(duration)
# 更新仪表盘指标
ACTIVE_USERS.set(random.randint(100, 1000))
QUEUE_SIZE.labels(queue_name='jobs').set(random.randint(0, 50))
QUEUE_SIZE.labels(queue_name='emails').set(random.randint(0, 20))
# 数据库连接池
DATABASE_CONNECTIONS.labels(pool='main', state='active').set(random.randint(5, 20))
DATABASE_CONNECTIONS.labels(pool='main', state='idle').set(random.randint(10, 30))
# 缓存指标
if random.random() > 0.3:
CACHE_HITS.labels(cache_name='redis').inc()
else:
CACHE_MISSES.labels(cache_name='redis').inc()
time.sleep(1)if name == 'main':
# 在8000端口启动指标服务
start_http_server(8000)
print("Metrics server started on port 8000")
simulate_metrics()
undefinedCustom Exporter (Go)
自定义导出器(Go)
go
package main
import (
"log"
"math/rand"
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
requestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "app_requests_total",
Help: "Total number of requests",
},
[]string{"method", "endpoint", "status"},
)
requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "app_request_duration_seconds",
Help: "Request duration in seconds",
Buckets: prometheus.ExponentialBuckets(0.01, 2, 10),
},
[]string{"method", "endpoint"},
)
activeUsers = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "app_active_users",
Help: "Number of active users",
},
)
databaseConnections = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "app_database_connections",
Help: "Number of database connections",
},
[]string{"pool", "state"},
)
)
func init() {
prometheus.MustRegister(requestsTotal)
prometheus.MustRegister(requestDuration)
prometheus.MustRegister(activeUsers)
prometheus.MustRegister(databaseConnections)
}
func simulateMetrics() {
ticker := time.NewTicker(1 * time.Second)
defer ticker.Stop()
for range ticker.C {
// Simulate requests
methods := []string{"GET", "POST", "PUT", "DELETE"}
endpoints := []string{"/api/users", "/api/products", "/api/orders"}
statuses := []string{"200", "200", "200", "400", "500"}
method := methods[rand.Intn(len(methods))]
endpoint := endpoints[rand.Intn(len(endpoints))]
status := statuses[rand.Intn(len(statuses))]
requestsTotal.WithLabelValues(method, endpoint, status).Inc()
requestDuration.WithLabelValues(method, endpoint).Observe(rand.Float64() * 2)
// Update gauges
activeUsers.Set(float64(rand.Intn(900) + 100))
databaseConnections.WithLabelValues("main", "active").Set(float64(rand.Intn(15) + 5))
databaseConnections.WithLabelValues("main", "idle").Set(float64(rand.Intn(20) + 10))
}
}
func main() {
go simulateMetrics()
http.Handle("/metrics", promhttp.Handler())
log.Println("Starting metrics server on :8000")
log.Fatal(http.ListenAndServe(":8000", nil))
}go
package main
import (
"log"
"math/rand"
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
requestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "app_requests_total",
Help: "总请求数",
},
[]string{"method", "endpoint", "status"},
)
requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "app_request_duration_seconds",
Help: "请求时长(秒)",
Buckets: prometheus.ExponentialBuckets(0.01, 2, 10),
},
[]string{"method", "endpoint"},
)
activeUsers = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "app_active_users",
Help: "活跃用户数",
},
)
databaseConnections = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "app_database_connections",
Help: "数据库连接数",
},
[]string{"pool", "state"},
)
)
func init() {
prometheus.MustRegister(requestsTotal)
prometheus.MustRegister(requestDuration)
prometheus.MustRegister(activeUsers)
prometheus.MustRegister(databaseConnections)
}
func simulateMetrics() {
ticker := time.NewTicker(1 * time.Second)
defer ticker.Stop()
for range ticker.C {
// 模拟请求
methods := []string{"GET", "POST", "PUT", "DELETE"}
endpoints := []string{"/api/users", "/api/products", "/api/orders"}
statuses := []string{"200", "200", "200", "400", "500"}
method := methods[rand.Intn(len(methods))]
endpoint := endpoints[rand.Intn(len(endpoints))]
status := statuses[rand.Intn(len(statuses))]
requestsTotal.WithLabelValues(method, endpoint, status).Inc()
requestDuration.WithLabelValues(method, endpoint).Observe(rand.Float64() * 2)
// 更新仪表盘指标
activeUsers.Set(float64(rand.Intn(900) + 100))
databaseConnections.WithLabelValues("main", "active").Set(float64(rand.Intn(15) + 5))
databaseConnections.WithLabelValues("main", "idle").Set(float64(rand.Intn(20) + 10))
}
}
func main() {
go simulateMetrics()
http.Handle("/metrics", promhttp.Handler())
log.Println("Starting metrics server on :8000")
log.Fatal(http.ListenAndServe(":8000", nil))
}PostgreSQL Exporter
PostgreSQL导出器
yaml
undefinedyaml
undefineddocker-compose.yml for postgres_exporter
postgres_exporter的docker-compose.yml
version: '3.8'
services:
postgres-exporter:
image: prometheuscommunity/postgres-exporter
environment:
DATA_SOURCE_NAME: "postgresql://user:password@postgres:5432/dbname?sslmode=disable"
ports:
- "9187:9187"
command:
- '--collector.stat_statements'
- '--collector.stat_database'
- '--collector.replication'
**Key PostgreSQL Metrics:**
- `pg_up`: Database reachability
- `pg_stat_database_tup_returned`: Rows read
- `pg_stat_database_tup_inserted`: Rows inserted
- `pg_stat_database_deadlocks`: Deadlock count
- `pg_stat_replication_lag`: Replication lag in seconds
- `pg_locks_count`: Active locksversion: '3.8'
services:
postgres-exporter:
image: prometheuscommunity/postgres-exporter
environment:
DATA_SOURCE_NAME: "postgresql://user:password@postgres:5432/dbname?sslmode=disable"
ports:
- "9187:9187"
command:
- '--collector.stat_statements'
- '--collector.stat_database'
- '--collector.replication'
**PostgreSQL核心指标:**
- `pg_up`: 数据库可达性
- `pg_stat_database_tup_returned`: 读取的行数
- `pg_stat_database_tup_inserted`: 插入的行数
- `pg_stat_database_deadlocks`: 死锁数量
- `pg_stat_replication_lag`: 复制延迟(秒)
- `pg_locks_count`: 活跃锁数量Blackbox Exporter for Probing
探测用Blackbox Exporter
yaml
undefinedyaml
undefinedblackbox.yml
blackbox.yml
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [200]
method: GET
preferred_ip_protocol: "ip4"
http_post_json:
prober: http
http:
method: POST
headers:
Content-Type: application/json
body: '{"key":"value"}'
valid_status_codes: [200, 201]
tcp_connect:
prober: tcp
timeout: 5s
icmp:
prober: icmp
timeout: 5s
icmp:
preferred_ip_protocol: "ip4"
```yamlmodules:
http_2xx:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [200]
method: GET
preferred_ip_protocol: "ip4"
http_post_json:
prober: http
http:
method: POST
headers:
Content-Type: application/json
body: '{"key":"value"}'
valid_status_codes: [200, 201]
tcp_connect:
prober: tcp
timeout: 5s
icmp:
prober: icmp
timeout: 5s
icmp:
preferred_ip_protocol: "ip4"
```yamlPrometheus config for blackbox exporter
用于Blackbox Exporter的Prometheus配置
scrape_configs:
- job_name: 'blackbox-http'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.com
- https://api.example.com/health relabel_configs:
- source_labels: [address] target_label: __param_target
- source_labels: [__param_target] target_label: instance
- target_label: address replacement: blackbox-exporter:9115
- targets:
undefinedscrape_configs:
- job_name: 'blackbox-http'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.com
- https://api.example.com/health relabel_configs:
- source_labels: [address] target_label: __param_target
- source_labels: [__param_target] target_label: instance
- target_label: address replacement: blackbox-exporter:9115
- targets:
undefinedBest Practices
最佳实践
Metric Naming Conventions
指标命名规范
Follow Prometheus naming best practices:
undefined遵循Prometheus命名最佳实践:
undefinedFormat: <namespace><subsystem><metric>_<unit>
格式:<命名空间><子系统><指标>_<单位>
Good examples
优秀示例
http_requests_total # Counter
http_request_duration_seconds # Histogram
database_connections_active # Gauge
cache_hits_total # Counter
memory_usage_bytes # Gauge
http_requests_total # 计数器
http_request_duration_seconds # 直方图
database_connections_active # 仪表盘
cache_hits_total # 计数器
memory_usage_bytes # 仪表盘
Include unit suffixes
包含单位后缀
_seconds, _bytes, _total, _ratio, _percentage
_seconds, _bytes, _total, _ratio, _percentage
Avoid
避免以下写法
RequestCount # Use snake_case
http_requests # Missing _total for counter
request_time # Missing unit (should be _seconds)
undefinedRequestCount # 使用蛇形命名法
http_requests # 计数器缺少_total后缀
request_time # 缺少单位(应为_seconds)
undefinedLabel Guidelines
标签指南
promql
undefinedpromql
undefinedGood: Low cardinality labels
推荐:低基数标签
http_requests_total{method="GET", endpoint="/api/users", status="200"}
http_requests_total{method="GET", endpoint="/api/users", status="200"}
Bad: High cardinality labels (avoid)
不推荐:高基数标签(避免使用)
http_requests_total{user_id="12345", session_id="abc-def-ghi"}
http_requests_total{user_id="12345", session_id="abc-def-ghi"}
Good: Use bounded label values
推荐:使用有限的标签值
http_requests_total{status_class="2xx"}
http_requests_total{status_class="2xx"}
Bad: Unbounded label values
不推荐:无限的标签值
http_requests_total{response_size="1234567"}
undefinedhttp_requests_total{response_size="1234567"}
undefinedRecording Rule Patterns
记录规则模式
yaml
groups:
- name: performance_rules
interval: 30s
rules:
# Pre-aggregate expensive queries
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
# Namespace aggregations
- record: namespace:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (namespace)
# SLI calculations
- record: job:http_requests_success:rate5m
expr: sum(rate(http_requests_total{status=~"2.."}[5m])) by (job)
- record: job:http_requests_error_rate:ratio
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)yaml
groups:
- name: performance_rules
interval: 30s
rules:
# 预聚合开销大的查询
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
# 命名空间聚合
- record: namespace:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (namespace)
# SLI计算
- record: job:http_requests_success:rate5m
expr: sum(rate(http_requests_total{status=~"2.."}[5m])) by (job)
- record: job:http_requests_error_rate:ratio
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)Alert Design Principles
告警设计原则
- Alert on symptoms, not causes: Alert on user-facing issues
- Make alerts actionable: Include runbook links
- Use appropriate severity levels: Critical, warning, info
- Set proper thresholds: Based on historical data
- Include context in annotations: Help on-call engineers
- Group related alerts: Reduce alert fatigue
- Use inhibition rules: Suppress redundant alerts
- Test alert rules: Verify they fire when expected
- 告警症状而非原因:针对用户可见的问题告警
- 告警需可执行:包含运行手册链接
- 使用合适的严重级别:严重、警告、信息
- 设置合理阈值:基于历史数据
- 在注解中包含上下文:帮助值班工程师快速定位
- 关联告警分组:减少告警疲劳
- 使用抑制规则:屏蔽冗余告警
- 测试告警规则:验证触发条件
Dashboard Best Practices
仪表盘最佳实践
- One dashboard per audience: SRE, developers, business
- Use consistent time ranges: Make comparisons easier
- Include SLI/SLO metrics: Show business impact
- Add annotations for deploys: Correlate changes with metrics
- Use template variables: Make dashboards reusable
- Show trends and aggregates: Not just raw metrics
- Include links to runbooks: Enable quick response
- Use appropriate visualizations: Graphs, gauges, tables
- 按受众设计仪表盘:SRE、开发、业务人员
- 使用一致的时间范围:便于对比
- 包含SLI/SLO指标:展示业务影响
- 添加部署注解:关联变更与指标波动
- 使用模板变量:提升仪表盘复用性
- 展示趋势与聚合:而非仅原始指标
- 包含运行手册链接:快速响应
- 使用合适的可视化类型:图表、仪表盘、表格
High Availability Setup
高可用部署
yaml
undefinedyaml
undefinedPrometheus HA with Thanos
基于Thanos的Prometheus高可用方案
Deploy multiple Prometheus instances with same config
部署多个配置相同的Prometheus实例
Use Thanos to deduplicate and provide global view
使用Thanos去重并提供全局视图
prometheus-1.yml
prometheus-1.yml
global:
external_labels:
cluster: 'prod'
replica: '1'
global:
external_labels:
cluster: 'prod'
replica: '1'
prometheus-2.yml
prometheus-2.yml
global:
external_labels:
cluster: 'prod'
replica: '2'
global:
external_labels:
cluster: 'prod'
replica: '2'
Thanos sidecar configuration
Thanos sidecar配置
Uploads blocks to object storage
将数据块上传至对象存储
Provides StoreAPI for querying
提供StoreAPI用于查询
undefinedundefinedCapacity Planning Queries
容量规划查询
promql
undefinedpromql
undefinedDisk space exhaustion prediction
磁盘空间耗尽预测
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0
Memory growth trend
内存增长趋势
predict_linear(node_memory_MemAvailable_bytes[1h], 24 * 3600)
predict_linear(node_memory_MemAvailable_bytes[1h], 24 * 3600)
Request rate growth
请求速率增长趋势
predict_linear(sum(rate(http_requests_total[1h]))[24h:1h], 7 * 24 * 3600)
predict_linear(sum(rate(http_requests_total[1h]))[24h:1h], 7 * 24 * 3600)
Storage capacity planning
存储容量规划
prometheus_tsdb_storage_blocks_bytes / (30 * 24 * 3600)
undefinedprometheus_tsdb_storage_blocks_bytes / (30 * 24 * 3600)
undefinedAdvanced Patterns
高级模式
Federation for Multi-Cluster Monitoring
多集群监控联邦
yaml
undefinedyaml
undefinedGlobal Prometheus federating from cluster Prometheus instances
全局Prometheus从集群Prometheus实例聚合数据
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="prometheus"}'
- '{name=~"job:.*"}' # All recording rules
static_configs:
- targets:
- 'prometheus-us-west:9090'
- 'prometheus-us-east:9090'
- 'prometheus-eu-central:9090'
- targets:
undefinedscrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="prometheus"}'
- '{name=~"job:.*"}' # 所有记录规则
static_configs:
- targets:
- 'prometheus-us-west:9090'
- 'prometheus-us-east:9090'
- 'prometheus-eu-central:9090'
- targets:
undefinedCost Monitoring Pattern
成本监控模式
yaml
undefinedyaml
undefinedTrack cloud costs with custom metrics
使用自定义指标追踪云成本
groups:
- name: cost_tracking
rules:
-
record: cloud:cost:hourly_rate expr: | ( sum(kube_pod_container_resource_requests{resource="cpu"}) * 0.03 # CPU cost/hour + sum(kube_pod_container_resource_requests{resource="memory"} / 1024 / 1024 / 1024) * 0.005 # Memory cost/hour )
-
record: cloud:cost:monthly_estimate expr: cloud:cost:hourly_rate * 730 # Hours in average month
-
undefinedgroups:
- name: cost_tracking
rules:
-
record: cloud:cost:hourly_rate expr: | ( sum(kube_pod_container_resource_requests{resource="cpu"}) * 0.03 # 每小时CPU成本 + sum(kube_pod_container_resource_requests{resource="memory"} / 1024 / 1024 / 1024) * 0.005 # 每小时内存成本 )
-
record: cloud:cost:monthly_estimate expr: cloud:cost:hourly_rate * 730 # 平均每月小时数
-
undefinedCustom SLO Implementation
自定义SLO实现
yaml
undefinedyaml
undefinedSLO: 99.9% availability for API
SLO:API可用性99.9%
groups:
-
name: api_slo interval: 30s rules:
Success rate SLI
- record: api:sli:success_rate expr: | sum(rate(http_requests_total{job="api",status=~"2.."}[5m])) / sum(rate(http_requests_total{job="api"}[5m]))
Error budget remaining (30 days)
- record: api:error_budget:remaining expr: | 1 - ( (1 - api:sli:success_rate) / (1 - 0.999) )
Latency SLI (p99 < 500ms)
- record: api:sli:latency_success_rate expr: | ( histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le) ) < 0.5 )
undefinedgroups:
-
name: api_slo interval: 30s rules:
成功率SLI
- record: api:sli:success_rate expr: | sum(rate(http_requests_total{job="api",status=~"2.."}[5m])) / sum(rate(http_requests_total{job="api"}[5m]))
剩余错误预算(30天)
- record: api:error_budget:remaining expr: | 1 - ( (1 - api:sli:success_rate) / (1 - 0.999) )
延迟SLI(p99 < 500ms)
- record: api:sli:latency_success_rate expr: | ( histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le) ) < 0.5 )
undefinedExamples Summary
示例汇总
This skill includes 20+ comprehensive examples covering:
- Prometheus configuration (basic, Kubernetes SD, storage)
- PromQL queries (instant vectors, range vectors, aggregations)
- Mathematical operations and advanced patterns
- Alert rule definitions (error rate, latency, resource usage)
- Alertmanager configuration (routing, receivers, inhibition)
- Multi-window multi-burn-rate SLO alerts
- Grafana dashboard JSON (full dashboard, RED method, USE method)
- Custom exporters (Python, Go)
- Third-party exporters (PostgreSQL, Blackbox)
- Recording rules for performance
- Federation for multi-cluster monitoring
- Cost monitoring and SLO implementation
- High availability patterns
- Capacity planning queries
Skill Version: 1.0.0
Last Updated: October 2025
Skill Category: Observability, Monitoring, SRE, DevOps
Compatible With: Prometheus, Grafana, Alertmanager, OpenTelemetry, Kubernetes
本技能包含20+个完整示例,覆盖:
- Prometheus配置(基础、Kubernetes服务发现、存储)
- PromQL查询(瞬时向量、范围向量、聚合)
- 数学运算与高级模式
- 告警规则定义(错误率、延迟、资源使用率)
- Alertmanager配置(路由、接收方、抑制)
- 多窗口多速率SLO告警
- Grafana仪表盘JSON(完整仪表盘、RED方法、USE方法)
- 自定义导出器(Python、Go)
- 第三方导出器(PostgreSQL、Blackbox)
- 性能相关记录规则
- 多集群监控联邦
- 成本监控与SLO实现
- 高可用模式
- 容量规划查询
技能版本: 1.0.0
最后更新: 2025年10月
技能分类: 可观测性、监控、SRE、DevOps
兼容工具: Prometheus、Grafana、Alertmanager、OpenTelemetry、Kubernetes