observability-monitoring

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Observability & Monitoring

可观测性与监控

A comprehensive skill for implementing production-grade observability and monitoring using Prometheus, Grafana, and the wider cloud-native monitoring ecosystem. This skill covers metrics collection, time-series analysis, alerting, visualization, and operational excellence patterns.
本技能全面介绍如何使用Prometheus、Grafana及更广泛的云原生监控生态系统,实现生产级别的可观测性与监控方案,内容涵盖指标采集、时间序列分析、告警、可视化及运维最佳实践。

When to Use This Skill

适用场景

Use this skill when:
  • Setting up monitoring for production systems and applications
  • Implementing metrics collection and observability for microservices
  • Creating dashboards and visualizations for system health monitoring
  • Defining alerting rules and incident response automation
  • Analyzing system performance and capacity using time-series data
  • Implementing SLIs, SLOs, and SLAs for service reliability
  • Debugging production issues using metrics and traces
  • Building custom exporters for application-specific metrics
  • Setting up federation for multi-cluster monitoring
  • Migrating from legacy monitoring to cloud-native solutions
  • Implementing cost monitoring and optimization tracking
  • Creating real-time operational dashboards for DevOps teams
在以下场景中使用本技能:
  • 为生产系统和应用搭建监控体系
  • 为微服务实现指标采集与可观测性
  • 创建系统健康监控的仪表盘与可视化界面
  • 定义告警规则与事件响应自动化流程
  • 利用时间序列数据分析系统性能与容量
  • 为服务可靠性实现SLI、SLO和SLA
  • 借助指标与追踪信息排查生产环境问题
  • 为特定应用构建自定义指标导出器
  • 搭建多集群监控的联邦机制
  • 从传统监控方案迁移至云原生监控方案
  • 实现成本监控与优化追踪
  • 为DevOps团队创建实时运维仪表盘

Core Concepts

核心概念

The Four Pillars of Observability

可观测性四大支柱

Modern observability is built on four fundamental pillars:
  1. Metrics: Numerical measurements of system behavior over time
    • Counter: Monotonically increasing values (requests served, errors)
    • Gauge: Point-in-time values that go up and down (memory usage, temperature)
    • Histogram: Distribution of values (request duration buckets)
    • Summary: Similar to histogram but calculates quantiles on client-side
  2. Logs: Discrete events with contextual information
    • Structured logging (JSON, key-value pairs)
    • Centralized log aggregation (ELK, Loki)
    • Correlation with metrics and traces
  3. Traces: Request flow through distributed systems
    • Span: Single unit of work with start/end time
    • Trace: Collection of spans representing end-to-end request
    • OpenTelemetry for distributed tracing
  4. Events: Significant occurrences in system lifecycle
    • Deployments, configuration changes
    • Scaling events, incidents
    • Business events and user actions
现代可观测性基于四大核心支柱:
  1. 指标(Metrics):系统行为的数值化时间序列测量
    • 计数器(Counter):单调递增的值(如已处理请求数、错误数)
    • 仪表盘(Gauge):可上下波动的瞬时值(如内存使用率、温度)
    • 直方图(Histogram):数值分布统计(如请求时长分桶)
    • 摘要(Summary):与直方图类似,但在客户端计算分位数
  2. 日志(Logs):带有上下文信息的离散事件
    • 结构化日志(JSON、键值对格式)
    • 集中式日志聚合(ELK、Loki)
    • 与指标和追踪信息关联
  3. 追踪(Traces):分布式系统中的请求流转路径
    • 跨度(Span):单个工作单元,包含开始/结束时间
    • 追踪(Trace):代表端到端请求的一组跨度集合
    • 使用OpenTelemetry实现分布式追踪
  4. 事件(Events):系统生命周期中的重要事件
    • 部署、配置变更
    • 扩容事件、故障事件
    • 业务事件与用户操作

Prometheus Architecture

Prometheus架构

Prometheus is a pull-based monitoring system with key components:
Time-Series Database (TSDB)
  • Stores metrics as time-series data
  • Efficient compression and retention policies
  • Local storage with optional remote storage
Scrape Targets
  • Service discovery (Kubernetes, Consul, EC2, etc.)
  • Static configuration
  • Relabeling for flexible target selection
PromQL Query Engine
  • Powerful query language for metrics analysis
  • Aggregation, filtering, and mathematical operations
  • Range vectors and instant vectors
Alertmanager
  • Alert rule evaluation
  • Grouping, silencing, and routing
  • Integration with PagerDuty, Slack, email, webhooks
Exporters
  • Bridge between Prometheus and systems
  • Node exporter, cAdvisor, custom exporters
  • Third-party exporters for databases, services
Prometheus是基于拉取模式的监控系统,核心组件包括:
时间序列数据库(TSDB)
  • 以时间序列格式存储指标
  • 高效的压缩与保留策略
  • 本地存储支持可选的远程存储扩展
采集目标(Scrape Targets)
  • 服务发现(Kubernetes、Consul、EC2等)
  • 静态配置
  • 重标记(Relabeling)实现灵活的目标选择
PromQL查询引擎
  • 强大的指标分析查询语言
  • 支持聚合、过滤与数学运算
  • 范围向量与瞬时向量
Alertmanager
  • 告警规则评估
  • 分组、静默与路由
  • 与PagerDuty、Slack、邮件、Webhook集成
导出器(Exporters)
  • Prometheus与其他系统的桥接组件
  • Node exporter、cAdvisor、自定义导出器
  • 针对数据库、服务的第三方导出器

Metric Labels and Cardinality

指标标签与基数

Labels are key-value pairs attached to metrics:
prometheus
http_requests_total{method="GET", endpoint="/api/users", status="200"}
Label Best Practices:
  • Use labels for dimensions you query/aggregate by
  • Avoid high-cardinality labels (user IDs, timestamps)
  • Keep label names consistent across metrics
  • Use relabeling to normalize external labels
Cardinality Considerations:
  • Each unique label combination = new time-series
  • High cardinality = increased memory and storage
  • Monitor cardinality with
    prometheus_tsdb_symbol_table_size_bytes
  • Use recording rules to pre-aggregate high-cardinality metrics
标签是附加在指标上的键值对:
prometheus
http_requests_total{method="GET", endpoint="/api/users", status="200"}
标签最佳实践:
  • 为需要查询/聚合的维度使用标签
  • 避免高基数标签(如用户ID、时间戳)
  • 保持指标间的标签命名一致
  • 使用重标记标准化外部标签
基数注意事项:
  • 每个唯一的标签组合对应一条新的时间序列
  • 高基数会增加内存与存储开销
  • 使用
    prometheus_tsdb_symbol_table_size_bytes
    监控基数
  • 使用记录规则预聚合高基数指标

Recording Rules

记录规则

Pre-compute frequently-used or expensive queries:
yaml
groups:
  - name: api_performance
    interval: 30s
    rules:
      - record: api:http_request_duration_seconds:p99
        expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
      - record: api:http_requests:rate5m
        expr: rate(http_requests_total[5m])
Benefits:
  • Faster dashboard loading
  • Reduced query load on Prometheus
  • Consistent metric naming conventions
  • Enable complex aggregations
预计算频繁使用或开销较大的查询:
yaml
groups:
  - name: api_performance
    interval: 30s
    rules:
      - record: api:http_request_duration_seconds:p99
        expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
      - record: api:http_requests:rate5m
        expr: rate(http_requests_total[5m])
优势:
  • 加快仪表盘加载速度
  • 降低Prometheus的查询负载
  • 统一指标命名规范
  • 支持复杂聚合运算

Service Level Objectives (SLOs)

服务级别目标(SLOs)

Define and track reliability targets:
SLI (Service Level Indicator): Metric measuring service quality
  • Availability: % of successful requests
  • Latency: % of requests under threshold
  • Throughput: Requests per second
SLO (Service Level Objective): Target for SLI
  • 99.9% availability (43.8 minutes downtime/month)
  • 95% of requests < 200ms
  • 1000 RPS sustained
SLA (Service Level Agreement): Contract with consequences
  • External commitments to customers
  • Financial penalties for SLO violations
Error Budget: Acceptable failure rate
  • Error budget = 100% - SLO
  • 99.9% SLO = 0.1% error budget
  • Use budget for innovation vs. reliability tradeoff
定义并追踪可靠性目标:
SLI(服务级别指标):衡量服务质量的指标
  • 可用性:成功请求的百分比
  • 延迟:满足阈值的请求百分比
  • 吞吐量:每秒请求数
SLO(服务级别目标):SLI的目标值
  • 99.9%可用性(每月最多43.8分钟停机时间)
  • 95%的请求延迟低于200ms
  • 持续支持1000 RPS
SLA(服务级别协议):带有后果的正式合同
  • 对客户的外部承诺
  • 违反SLO时的财务处罚
错误预算:可接受的故障率
  • 错误预算 = 100% - SLO
  • 99.9% SLO对应0.1%的错误预算
  • 在创新与可靠性之间做权衡时使用错误预算

Prometheus Setup and Configuration

Prometheus安装与配置

Basic Prometheus Configuration

基础Prometheus配置

yaml
undefined
yaml
undefined

prometheus.yml

prometheus.yml

global: scrape_interval: 15s # Default scrape interval evaluation_interval: 15s # Alert rule evaluation interval external_labels: cluster: 'production' region: 'us-west-2'
global: scrape_interval: 15s # 默认采集间隔 evaluation_interval: 15s # 告警规则评估间隔 external_labels: cluster: 'production' region: 'us-west-2'

Alertmanager configuration

Alertmanager配置

alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093
alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093

Load rules

加载规则

rule_files:
  • 'rules/*.yml'
  • 'alerts/*.yml'
rule_files:
  • 'rules/*.yml'
  • 'alerts/*.yml'

Scrape configurations

采集配置

scrape_configs:

Prometheus self-monitoring

  • job_name: 'prometheus' static_configs:
    • targets: ['localhost:9090']

Node exporter for system metrics

  • job_name: 'node' static_configs:
    • targets:
      • 'node1:9100'
      • 'node2:9100'
      • 'node3:9100' relabel_configs:
    • source_labels: [address] target_label: instance regex: '([^:]+):.*' replacement: '${1}'

Application metrics

  • job_name: 'api' static_configs:
    • targets: ['api-1:8080', 'api-2:8080', 'api-3:8080'] labels: env: 'production' tier: 'backend'
undefined
scrape_configs:

Prometheus自监控

  • job_name: 'prometheus' static_configs:
    • targets: ['localhost:9090']

系统指标的Node exporter

  • job_name: 'node' static_configs:
    • targets:
      • 'node1:9100'
      • 'node2:9100'
      • 'node3:9100' relabel_configs:
    • source_labels: [address] target_label: instance regex: '([^:]+):.*' replacement: '${1}'

应用指标

  • job_name: 'api' static_configs:
    • targets: ['api-1:8080', 'api-2:8080', 'api-3:8080'] labels: env: 'production' tier: 'backend'
undefined

Kubernetes Service Discovery

Kubernetes服务发现

yaml
scrape_configs:
  # Kubernetes API server
  - job_name: 'kubernetes-apiservers'
    kubernetes_sd_configs:
      - role: endpoints
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https

  # Kubernetes pods with prometheus.io annotations
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with prometheus.io/scrape: "true" annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      # Use the port from prometheus.io/port annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: (\d+)
        target_label: __address__
        replacement: ${1}:${2}
      # Use the path from prometheus.io/path annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
      # Add namespace label
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      # Add pod name label
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

  # Kubernetes services
  - job_name: 'kubernetes-services'
    kubernetes_sd_configs:
      - role: service
    metrics_path: /probe
    params:
      module: [http_2xx]
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
        action: keep
        regex: true
      - source_labels: [__address__]
        target_label: __param_target
      - target_label: __address__
        replacement: blackbox-exporter:9115
      - source_labels: [__param_target]
        target_label: instance
yaml
scrape_configs:
  # Kubernetes API服务器
  - job_name: 'kubernetes-apiservers'
    kubernetes_sd_configs:
      - role: endpoints
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https

  # 带有prometheus.io注解的Kubernetes Pod
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # 仅采集带有prometheus.io/scrape: "true"注解的Pod
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      # 使用prometheus.io/port注解指定的端口
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: (\d+)
        target_label: __address__
        replacement: ${1}:${2}
      # 使用prometheus.io/path注解指定的路径
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
      # 添加命名空间标签
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      # 添加Pod名称标签
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

  # Kubernetes服务
  - job_name: 'kubernetes-services'
    kubernetes_sd_configs:
      - role: service
    metrics_path: /probe
    params:
      module: [http_2xx]
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
        action: keep
        regex: true
      - source_labels: [__address__]
        target_label: __param_target
      - target_label: __address__
        replacement: blackbox-exporter:9115
      - source_labels: [__param_target]
        target_label: instance

Storage and Retention

存储与保留策略

yaml
undefined
yaml
undefined

Storage configuration

存储配置

storage: tsdb: path: /prometheus/data retention.time: 15d retention.size: 50GB
storage: tsdb: path: /prometheus/data retention.time: 15d retention.size: 50GB

Remote write for long-term storage

远程写入用于长期存储

remote_write:
  • url: "https://prometheus-remote-storage.example.com/api/v1/write" basic_auth: username: prometheus password_file: /etc/prometheus/remote_storage_password queue_config: capacity: 10000 max_shards: 50 max_samples_per_send: 5000 write_relabel_configs:

    Drop high-cardinality metrics

    • source_labels: [name] regex: 'container_network_.*' action: drop
remote_write:
  • url: "https://prometheus-remote-storage.example.com/api/v1/write" basic_auth: username: prometheus password_file: /etc/prometheus/remote_storage_password queue_config: capacity: 10000 max_shards: 50 max_samples_per_send: 5000 write_relabel_configs:

    丢弃高基数指标

    • source_labels: [name] regex: 'container_network_.*' action: drop

Remote read for querying historical data

远程读取用于查询历史数据

remote_read:
undefined
remote_read:
undefined

PromQL: The Prometheus Query Language

PromQL:Prometheus查询语言

Instant Vectors and Selectors

瞬时向量与选择器

promql
undefined
promql
undefined

Basic metric selection

基础指标选择

http_requests_total
http_requests_total

Filter by label

按标签过滤

http_requests_total{job="api", status="200"}
http_requests_total{job="api", status="200"}

Regex matching

正则匹配

http_requests_total{status=~"2..|3.."}
http_requests_total{status=~"2..|3.."}

Negative matching

反向匹配

http_requests_total{status!="500"}
http_requests_total{status!="500"}

Multiple label matchers

多标签匹配

http_requests_total{job="api", method="GET", status=~"2.."}
undefined
http_requests_total{job="api", method="GET", status=~"2.."}
undefined

Range Vectors and Aggregations

范围向量与聚合

promql
undefined
promql
undefined

5-minute range vector

5分钟范围向量

http_requests_total[5m]
http_requests_total[5m]

Rate of increase per second

每秒增长率

rate(http_requests_total[5m])
rate(http_requests_total[5m])

Increase over time window

时间窗口内的增量

increase(http_requests_total[1h])
increase(http_requests_total[1h])

Average over time

时间窗口内的平均值

avg_over_time(cpu_usage[5m])
avg_over_time(cpu_usage[5m])

Max/Min over time

时间窗口内的最大值/最小值

max_over_time(response_time_seconds[10m]) min_over_time(response_time_seconds[10m])
max_over_time(response_time_seconds[10m]) min_over_time(response_time_seconds[10m])

Standard deviation

标准差

stddev_over_time(response_time_seconds[5m])
undefined
stddev_over_time(response_time_seconds[5m])
undefined

Aggregation Operators

聚合运算符

promql
undefined
promql
undefined

Sum across all instances

所有实例的总和

sum(rate(http_requests_total[5m]))
sum(rate(http_requests_total[5m]))

Sum grouped by job

按job分组求和

sum by (job) (rate(http_requests_total[5m]))
sum by (job) (rate(http_requests_total[5m]))

Average grouped by multiple labels

按多标签分组求平均值

avg by (job, instance) (cpu_usage)
avg by (job, instance) (cpu_usage)

Count number of series

统计正常运行的实例数

count(up == 1)
count(up == 1)

Topk and bottomk

前5和后3的指标

topk(5, rate(http_requests_total[5m])) bottomk(3, node_memory_available_bytes)
topk(5, rate(http_requests_total[5m])) bottomk(3, node_memory_available_bytes)

Quantile across instances

跨实例的分位数

quantile(0.95, http_request_duration_seconds)
undefined
quantile(0.95, http_request_duration_seconds)
undefined

Mathematical Operations

数学运算

promql
undefined
promql
undefined

Arithmetic operations

算术运算

(node_memory_total_bytes - node_memory_available_bytes) / node_memory_total_bytes * 100
(node_memory_total_bytes - node_memory_available_bytes) / node_memory_total_bytes * 100

Comparison operators

比较运算符

http_request_duration_seconds > 0.5
http_request_duration_seconds > 0.5

Logical operators

逻辑运算符

up == 1 and rate(http_requests_total[5m]) > 100
up == 1 and rate(http_requests_total[5m]) > 100

Vector matching

向量匹配

rate(http_requests_total[5m]) / on(instance) group_left rate(http_responses_total[5m])
undefined
rate(http_requests_total[5m]) / on(instance) group_left rate(http_responses_total[5m])
undefined

Advanced PromQL Patterns

高级PromQL模式

promql
undefined
promql
undefined

Request success rate

请求成功率

sum(rate(http_requests_total{status=~"2.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
sum(rate(http_requests_total{status=~"2.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

Error rate percentage

错误率百分比

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

Latency percentiles (histogram)

延迟分位数(直方图)

histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])) )
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])) )

Predict linear growth

线性增长预测

predict_linear(node_filesystem_free_bytes[1h], 4 * 3600)
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600)

Detect anomalies with standard deviation

基于标准差的异常检测

abs(cpu_usage - avg_over_time(cpu_usage[1h]))
3 * stddev_over_time(cpu_usage[1h])
abs(cpu_usage - avg_over_time(cpu_usage[1h]))
3 * stddev_over_time(cpu_usage[1h])

Calculate saturation (RED method)

饱和度计算(RED方法)

sum(rate(cpu_seconds_total{mode!="idle"}[5m])) by (instance) / count(cpu_seconds_total{mode="idle"}) by (instance)
sum(rate(cpu_seconds_total{mode!="idle"}[5m])) by (instance) / count(cpu_seconds_total{mode="idle"}) by (instance)

Burn rate for SLO

SLO错误预算消耗速率

( sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h])) )
(14.4 * (1 - 0.999)) # For 99.9% SLO
undefined
( sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h])) )
(14.4 * (1 - 0.999)) # 对应99.9%的SLO
undefined

Alerting with Prometheus and Alertmanager

Prometheus与Alertmanager告警

Alert Rule Definitions

告警规则定义

yaml
undefined
yaml
undefined

alerts/api_alerts.yml

alerts/api_alerts.yml

groups:
  • name: api_alerts interval: 30s rules:

    High error rate alert

    • alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)
      0.05 for: 5m labels: severity: critical team: backend annotations: summary: "High error rate on {{ $labels.service }}" description: "Error rate is {{ $value | humanizePercentage }} on {{ $labels.service }}" runbook_url: "https://runbooks.example.com/HighErrorRate"

    High latency alert

    • alert: HighLatency expr: | histogram_quantile(0.99, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])) ) > 1 for: 10m labels: severity: warning team: backend annotations: summary: "High latency on {{ $labels.service }}" description: "P99 latency is {{ $value }}s on {{ $labels.service }}"

    Service down alert

    • alert: ServiceDown expr: up{job="api"} == 0 for: 2m labels: severity: critical team: sre annotations: summary: "Service {{ $labels.instance }} is down" description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 2 minutes"

    Disk space alert

    • alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10 for: 5m labels: severity: warning team: sre annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Disk space is {{ $value | humanize }}% on {{ $labels.instance }}"

    Memory pressure alert

    • alert: HighMemoryUsage expr: | (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90 for: 10m labels: severity: warning team: sre annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value | humanize }}% on {{ $labels.instance }}"

    CPU saturation alert

    • alert: HighCPUUsage expr: | 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 15m labels: severity: warning team: sre annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value | humanize }}% on {{ $labels.instance }}"
undefined
groups:
  • name: api_alerts interval: 30s rules:

    高错误率告警

    • alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)
      0.05 for: 5m labels: severity: critical team: backend annotations: summary: "{{ $labels.service }}服务错误率过高" description: "{{ $labels.service }}服务错误率为{{ $value | humanizePercentage }}" runbook_url: "https://runbooks.example.com/HighErrorRate"

    高延迟告警

    • alert: HighLatency expr: | histogram_quantile(0.99, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])) ) > 1 for: 10m labels: severity: warning team: backend annotations: summary: "{{ $labels.service }}服务延迟过高" description: "{{ $labels.service }}服务P99延迟为{{ $value }}秒"

    服务下线告警

    • alert: ServiceDown expr: up{job="api"} == 0 for: 2m labels: severity: critical team: sre annotations: summary: "{{ $labels.instance }}服务已下线" description: "{{ $labels.job }}服务在{{ $labels.instance }}节点已下线超过2分钟"

    磁盘空间不足告警

    • alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10 for: 5m labels: severity: warning team: sre annotations: summary: "{{ $labels.instance }}节点磁盘空间不足" description: "{{ $labels.instance }}节点磁盘剩余空间为{{ $value | humanize }}%"

    内存压力告警

    • alert: HighMemoryUsage expr: | (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90 for: 10m labels: severity: warning team: sre annotations: summary: "{{ $labels.instance }}节点内存使用率过高" description: "{{ $labels.instance }}节点内存使用率为{{ $value | humanize }}%"

    CPU饱和告警

    • alert: HighCPUUsage expr: | 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 15m labels: severity: warning team: sre annotations: summary: "{{ $labels.instance }}节点CPU使用率过高" description: "{{ $labels.instance }}节点CPU使用率为{{ $value | humanize }}%"
undefined

Alertmanager Configuration

Alertmanager配置

yaml
undefined
yaml
undefined

alertmanager.yml

alertmanager.yml

global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK' pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK' pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

Templates for notifications

通知模板

templates:
  • '/etc/alertmanager/templates/*.tmpl'
templates:
  • '/etc/alertmanager/templates/*.tmpl'

Route tree for alert distribution

告警路由树

route: group_by: ['alertname', 'cluster', 'service'] group_wait: 10s group_interval: 10s repeat_interval: 12h receiver: 'team-default'
routes: # Critical alerts go to PagerDuty - match: severity: critical receiver: 'pagerduty-critical' continue: true
# Critical alerts also go to Slack
- match:
    severity: critical
  receiver: 'slack-critical'
  group_wait: 0s

# Warning alerts to Slack only
- match:
    severity: warning
  receiver: 'slack-warnings'

# Team-specific routing
- match:
    team: backend
  receiver: 'team-backend'

- match:
    team: frontend
  receiver: 'team-frontend'

# Database alerts to DBA team
- match_re:
    service: 'postgres|mysql|mongodb'
  receiver: 'team-dba'
route: group_by: ['alertname', 'cluster', 'service'] group_wait: 10s group_interval: 10s repeat_interval: 12h receiver: 'team-default'
routes: # 严重告警发送至PagerDuty - match: severity: critical receiver: 'pagerduty-critical' continue: true
# 严重告警同时发送至Slack
- match:
    severity: critical
  receiver: 'slack-critical'
  group_wait: 0s

# 警告告警仅发送至Slack
- match:
    severity: warning
  receiver: 'slack-warnings'

# 按团队路由
- match:
    team: backend
  receiver: 'team-backend'

- match:
    team: frontend
  receiver: 'team-frontend'

# 数据库告警发送至DBA团队
- match_re:
    service: 'postgres|mysql|mongodb'
  receiver: 'team-dba'

Alert receivers/integrations

告警接收方/集成

receivers:
  • name: 'team-default' slack_configs:
    • channel: '#alerts' title: 'Alert: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' send_resolved: true
  • name: 'pagerduty-critical' pagerduty_configs:
    • service_key: 'YOUR_PAGERDUTY_SERVICE_KEY' description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.service }}' severity: '{{ .CommonLabels.severity }}'
  • name: 'slack-critical' slack_configs:
    • channel: '#incidents' title: 'CRITICAL: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}{{ end }}' color: 'danger' send_resolved: true
  • name: 'slack-warnings' slack_configs:
    • channel: '#monitoring' title: 'Warning: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' color: 'warning' send_resolved: true
  • name: 'team-backend' slack_configs:
    • channel: '#team-backend' send_resolved: true email_configs:
    • to: 'backend-team@example.com' from: 'alertmanager@example.com' smarthost: 'smtp.gmail.com:587' auth_username: 'alertmanager@example.com' auth_password_file: '/etc/alertmanager/email_password'
  • name: 'team-frontend' slack_configs:
    • channel: '#team-frontend' send_resolved: true
  • name: 'team-dba' slack_configs:
    • channel: '#team-dba' send_resolved: true pagerduty_configs:
    • service_key: 'DBA_PAGERDUTY_KEY'
receivers:
  • name: 'team-default' slack_configs:
    • channel: '#alerts' title: '告警: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' send_resolved: true
  • name: 'pagerduty-critical' pagerduty_configs:
    • service_key: 'YOUR_PAGERDUTY_SERVICE_KEY' description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.service }}' severity: '{{ .CommonLabels.severity }}'
  • name: 'slack-critical' slack_configs:
    • channel: '#incidents' title: '严重告警: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}{{ end }}' color: 'danger' send_resolved: true
  • name: 'slack-warnings' slack_configs:
    • channel: '#monitoring' title: '警告: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' color: 'warning' send_resolved: true
  • name: 'team-backend' slack_configs:
    • channel: '#team-backend' send_resolved: true email_configs:
    • to: 'backend-team@example.com' from: 'alertmanager@example.com' smarthost: 'smtp.gmail.com:587' auth_username: 'alertmanager@example.com' auth_password_file: '/etc/alertmanager/email_password'
  • name: 'team-frontend' slack_configs:
    • channel: '#team-frontend' send_resolved: true
  • name: 'team-dba' slack_configs:
    • channel: '#team-dba' send_resolved: true pagerduty_configs:
    • service_key: 'DBA_PAGERDUTY_KEY'

Inhibition rules (suppress alerts)

抑制规则(屏蔽冗余告警)

inhibit_rules:

Inhibit warnings if critical alert is firing

  • source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'instance']

Don't alert on instance down if cluster is down

  • source_match: alertname: 'ClusterDown' target_match_re: alertname: 'InstanceDown|ServiceDown' equal: ['cluster']
undefined
inhibit_rules:

若存在严重告警,屏蔽对应的警告告警

  • source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'instance']

若集群下线,不发送实例/服务下线告警

  • source_match: alertname: 'ClusterDown' target_match_re: alertname: 'InstanceDown|ServiceDown' equal: ['cluster']
undefined

Multi-Window Multi-Burn-Rate Alerts for SLOs

基于SLO的多窗口多速率告警

yaml
undefined
yaml
undefined

SLO-based alerting using burn rate

基于错误预算消耗速率的SLO告警

groups:
  • name: slo_alerts interval: 30s rules:

    Fast burn (1h window, 5m burn)

    • alert: ErrorBudgetBurnFast expr: | ( sum(rate(http_requests_total{status="5.."}[1h])) / sum(rate(http_requests_total[1h])) ) > (14.4 * (1 - 0.999)) and ( sum(rate(http_requests_total{status="5.."}[5m])) / sum(rate(http_requests_total[5m])) ) > (14.4 * (1 - 0.999)) for: 2m labels: severity: critical slo: "99.9%" annotations: summary: "Fast error budget burn detected" description: "Error rate is burning through 99.9% SLO budget 14.4x faster than normal"

    Slow burn (6h window, 30m burn)

    • alert: ErrorBudgetBurnSlow expr: | ( sum(rate(http_requests_total{status="5.."}[6h])) / sum(rate(http_requests_total[6h])) ) > (6 * (1 - 0.999)) and ( sum(rate(http_requests_total{status="5.."}[30m])) / sum(rate(http_requests_total[30m])) ) > (6 * (1 - 0.999)) for: 15m labels: severity: warning slo: "99.9%" annotations: summary: "Slow error budget burn detected" description: "Error rate is burning through 99.9% SLO budget 6x faster than normal"
undefined
groups:
  • name: slo_alerts interval: 30s rules:

    快速消耗(1小时窗口,5分钟速率)

    • alert: ErrorBudgetBurnFast expr: | ( sum(rate(http_requests_total{status="5.."}[1h])) / sum(rate(http_requests_total[1h])) ) > (14.4 * (1 - 0.999)) and ( sum(rate(http_requests_total{status="5.."}[5m])) / sum(rate(http_requests_total[5m])) ) > (14.4 * (1 - 0.999)) for: 2m labels: severity: critical slo: "99.9%" annotations: summary: "错误预算快速消耗" description: "错误率消耗99.9% SLO预算的速度是正常的14.4倍"

    慢速消耗(6小时窗口,30分钟速率)

    • alert: ErrorBudgetBurnSlow expr: | ( sum(rate(http_requests_total{status="5.."}[6h])) / sum(rate(http_requests_total[6h])) ) > (6 * (1 - 0.999)) and ( sum(rate(http_requests_total{status="5.."}[30m])) / sum(rate(http_requests_total[30m])) ) > (6 * (1 - 0.999)) for: 15m labels: severity: warning slo: "99.9%" annotations: summary: "错误预算慢速消耗" description: "错误率消耗99.9% SLO预算的速度是正常的6倍"
undefined

Grafana Dashboards and Visualization

Grafana仪表盘与可视化

Dashboard JSON Structure

仪表盘JSON结构

json
{
  "dashboard": {
    "title": "API Performance Dashboard",
    "tags": ["api", "performance", "production"],
    "timezone": "browser",
    "editable": true,
    "graphTooltip": 1,
    "time": {
      "from": "now-6h",
      "to": "now"
    },
    "timepicker": {
      "refresh_intervals": ["5s", "10s", "30s", "1m", "5m", "15m"],
      "time_options": ["5m", "15m", "1h", "6h", "12h", "24h", "7d"]
    },
    "templating": {
      "list": [
        {
          "name": "cluster",
          "type": "query",
          "datasource": "Prometheus",
          "query": "label_values(up, cluster)",
          "refresh": 1,
          "multi": false,
          "includeAll": false
        },
        {
          "name": "service",
          "type": "query",
          "datasource": "Prometheus",
          "query": "label_values(up{cluster=\"$cluster\"}, service)",
          "refresh": 1,
          "multi": true,
          "includeAll": true
        },
        {
          "name": "interval",
          "type": "interval",
          "query": "1m,5m,10m,30m,1h",
          "auto": true,
          "auto_count": 30,
          "auto_min": "10s"
        }
      ]
    },
    "panels": [
      {
        "id": 1,
        "title": "Request Rate",
        "type": "graph",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=~\"$service\"}[$interval])) by (service)",
            "legendFormat": "{{ service }}",
            "refId": "A"
          }
        ],
        "yaxes": [
          {"format": "reqps", "label": "Requests/sec"},
          {"format": "short"}
        ],
        "legend": {
          "show": true,
          "values": true,
          "current": true,
          "avg": true,
          "max": true
        }
      },
      {
        "id": 2,
        "title": "Error Rate",
        "type": "graph",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=~\"$service\",status=~\"5..\"}[$interval])) by (service) / sum(rate(http_requests_total{service=~\"$service\"}[$interval])) by (service) * 100",
            "legendFormat": "{{ service }} error %",
            "refId": "A"
          }
        ],
        "yaxes": [
          {"format": "percent", "label": "Error Rate"},
          {"format": "short"}
        ],
        "alert": {
          "conditions": [
            {
              "evaluator": {"params": [5], "type": "gt"},
              "operator": {"type": "and"},
              "query": {"params": ["A", "5m", "now"]},
              "reducer": {"params": [], "type": "avg"},
              "type": "query"
            }
          ],
          "executionErrorState": "alerting",
          "frequency": "1m",
          "handler": 1,
          "name": "High Error Rate",
          "noDataState": "no_data",
          "notifications": []
        }
      },
      {
        "id": 3,
        "title": "Latency Percentiles",
        "type": "graph",
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8},
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
            "legendFormat": "{{ service }} p99",
            "refId": "A"
          },
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
            "legendFormat": "{{ service }} p95",
            "refId": "B"
          },
          {
            "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
            "legendFormat": "{{ service }} p50",
            "refId": "C"
          }
        ],
        "yaxes": [
          {"format": "s", "label": "Duration"},
          {"format": "short"}
        ]
      }
    ]
  }
}
json
{
  "dashboard": {
    "title": "API性能仪表盘",
    "tags": ["api", "performance", "production"],
    "timezone": "browser",
    "editable": true,
    "graphTooltip": 1,
    "time": {
      "from": "now-6h",
      "to": "now"
    },
    "timepicker": {
      "refresh_intervals": ["5s", "10s", "30s", "1m", "5m", "15m"],
      "time_options": ["5m", "15m", "1h", "6h", "12h", "24h", "7d"]
    },
    "templating": {
      "list": [
        {
          "name": "cluster",
          "type": "query",
          "datasource": "Prometheus",
          "query": "label_values(up, cluster)",
          "refresh": 1,
          "multi": false,
          "includeAll": false
        },
        {
          "name": "service",
          "type": "query",
          "datasource": "Prometheus",
          "query": "label_values(up{cluster=\"$cluster\"}, service)",
          "refresh": 1,
          "multi": true,
          "includeAll": true
        },
        {
          "name": "interval",
          "type": "interval",
          "query": "1m,5m,10m,30m,1h",
          "auto": true,
          "auto_count": 30,
          "auto_min": "10s"
        }
      ]
    },
    "panels": [
      {
        "id": 1,
        "title": "请求速率",
        "type": "graph",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=~\"$service\"}[$interval])) by (service)",
            "legendFormat": "{{ service }}",
            "refId": "A"
          }
        ],
        "yaxes": [
          {"format": "reqps", "label": "请求数/秒"},
          {"format": "short"}
        ],
        "legend": {
          "show": true,
          "values": true,
          "current": true,
          "avg": true,
          "max": true
        }
      },
      {
        "id": 2,
        "title": "错误率",
        "type": "graph",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=~\"$service\",status=~\"5..\"}[$interval])) by (service) / sum(rate(http_requests_total{service=~\"$service\"}[$interval])) by (service) * 100",
            "legendFormat": "{{ service }} 错误率%",
            "refId": "A"
          }
        ],
        "yaxes": [
          {"format": "percent", "label": "错误率"},
          {"format": "short"}
        ],
        "alert": {
          "conditions": [
            {
              "evaluator": {"params": [5], "type": "gt"},
              "operator": {"type": "and"},
              "query": {"params": ["A", "5m", "now"]},
              "reducer": {"params": [], "type": "avg"},
              "type": "query"
            }
          ],
          "executionErrorState": "alerting",
          "frequency": "1m",
          "handler": 1,
          "name": "高错误率",
          "noDataState": "no_data",
          "notifications": []
        }
      },
      {
        "id": 3,
        "title": "延迟分位数",
        "type": "graph",
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8},
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
            "legendFormat": "{{ service }} p99",
            "refId": "A"
          },
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
            "legendFormat": "{{ service }} p95",
            "refId": "B"
          },
          {
            "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
            "legendFormat": "{{ service }} p50",
            "refId": "C"
          }
        ],
        "yaxes": [
          {"format": "s", "label": "延迟时间"},
          {"format": "short"}
        ]
      }
    ]
  }
}

RED Method Dashboard

RED方法仪表盘

The RED method focuses on Request rate, Error rate, and Duration:
json
{
  "panels": [
    {
      "title": "Request Rate (per service)",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total[$__rate_interval])) by (service)"
        }
      ]
    },
    {
      "title": "Error Rate % (per service)",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{status=~\"5..\"}[$__rate_interval])) by (service) / sum(rate(http_requests_total[$__rate_interval])) by (service) * 100"
        }
      ]
    },
    {
      "title": "Duration p99 (per service)",
      "targets": [
        {
          "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (service, le))"
        }
      ]
    }
  ]
}
RED方法聚焦于请求速率(Request rate)、错误率(Error rate)和延迟(Duration):
json
{
  "panels": [
    {
      "title": "请求速率(按服务)",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total[$__rate_interval])) by (service)"
        }
      ]
    },
    {
      "title": "错误率百分比(按服务)",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{status=~\"5..\"}[$__rate_interval])) by (service) / sum(rate(http_requests_total[$__rate_interval])) by (service) * 100"
        }
      ]
    },
    {
      "title": "P99延迟(按服务)",
      "targets": [
        {
          "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (service, le))"
        }
      ]
    }
  ]
}

USE Method Dashboard

USE方法仪表盘

The USE method monitors Utilization, Saturation, and Errors:
json
{
  "panels": [
    {
      "title": "CPU Utilization %",
      "targets": [
        {
          "expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[$__rate_interval])) * 100)"
        }
      ]
    },
    {
      "title": "CPU Saturation (Load Average)",
      "targets": [
        {
          "expr": "node_load1"
        }
      ]
    },
    {
      "title": "Memory Utilization %",
      "targets": [
        {
          "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100"
        }
      ]
    },
    {
      "title": "Disk I/O Utilization %",
      "targets": [
        {
          "expr": "rate(node_disk_io_time_seconds_total[$__rate_interval]) * 100"
        }
      ]
    },
    {
      "title": "Network Errors",
      "targets": [
        {
          "expr": "rate(node_network_receive_errs_total[$__rate_interval]) + rate(node_network_transmit_errs_total[$__rate_interval])"
        }
      ]
    }
  ]
}
USE方法监控利用率(Utilization)、饱和度(Saturation)和错误(Errors):
json
{
  "panels": [
    {
      "title": "CPU利用率%",
      "targets": [
        {
          "expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[$__rate_interval])) * 100)"
        }
      ]
    },
    {
      "title": "CPU饱和度(负载均值)",
      "targets": [
        {
          "expr": "node_load1"
        }
      ]
    },
    {
      "title": "内存利用率%",
      "targets": [
        {
          "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100"
        }
      ]
    },
    {
      "title": "磁盘I/O利用率%",
      "targets": [
        {
          "expr": "rate(node_disk_io_time_seconds_total[$__rate_interval]) * 100"
        }
      ]
    },
    {
      "title": "网络错误",
      "targets": [
        {
          "expr": "rate(node_network_receive_errs_total[$__rate_interval]) + rate(node_network_transmit_errs_total[$__rate_interval])"
        }
      ]
    }
  ]
}

Exporters and Metric Collection

导出器与指标采集

Node Exporter for System Metrics

系统指标的Node Exporter

bash
undefined
bash
undefined

Install node_exporter

安装node_exporter

wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz cd node_exporter-1.6.1.linux-amd64 ./node_exporter --web.listen-address=":9100"
--collector.filesystem.mount-points-exclude="^/(dev|proc|sys|var/lib/docker/.+)($|/)"
--collector.netclass.ignored-devices="^(veth.|br.|docker.*|lo)$"

**Key Metrics from Node Exporter:**
- `node_cpu_seconds_total`: CPU usage by mode
- `node_memory_MemTotal_bytes`, `node_memory_MemAvailable_bytes`: Memory
- `node_disk_io_time_seconds_total`: Disk I/O
- `node_network_receive_bytes_total`, `node_network_transmit_bytes_total`: Network
- `node_filesystem_size_bytes`, `node_filesystem_avail_bytes`: Disk space
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz cd node_exporter-1.6.1.linux-amd64 ./node_exporter --web.listen-address=":9100"
--collector.filesystem.mount-points-exclude="^/(dev|proc|sys|var/lib/docker/.+)($|/)"
--collector.netclass.ignored-devices="^(veth.|br.|docker.*|lo)$"

**Node Exporter核心指标:**
- `node_cpu_seconds_total`: 按模式统计的CPU使用率
- `node_memory_MemTotal_bytes`, `node_memory_MemAvailable_bytes`: 内存指标
- `node_disk_io_time_seconds_total`: 磁盘I/O指标
- `node_network_receive_bytes_total`, `node_network_transmit_bytes_total`: 网络指标
- `node_filesystem_size_bytes`, `node_filesystem_avail_bytes`: 磁盘空间指标

Custom Application Exporter (Python)

自定义应用导出器(Python)

python
undefined
python
undefined

app_exporter.py

app_exporter.py

from prometheus_client import start_http_server, Counter, Gauge, Histogram, Summary import time import random
from prometheus_client import start_http_server, Counter, Gauge, Histogram, Summary import time import random

Define metrics

定义指标

REQUEST_COUNT = Counter( 'app_requests_total', 'Total app requests', ['method', 'endpoint', 'status'] )
REQUEST_DURATION = Histogram( 'app_request_duration_seconds', 'Request duration in seconds', ['method', 'endpoint'], buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 2.5, 5.0, 10.0] )
ACTIVE_USERS = Gauge( 'app_active_users', 'Number of active users' )
QUEUE_SIZE = Gauge( 'app_queue_size', 'Current queue size', ['queue_name'] )
DATABASE_CONNECTIONS = Gauge( 'app_database_connections', 'Number of database connections', ['pool', 'state'] )
CACHE_HITS = Counter( 'app_cache_hits_total', 'Total cache hits', ['cache_name'] )
CACHE_MISSES = Counter( 'app_cache_misses_total', 'Total cache misses', ['cache_name'] )
def simulate_metrics(): """Simulate application metrics""" while True: # Simulate requests method = random.choice(['GET', 'POST', 'PUT', 'DELETE']) endpoint = random.choice(['/api/users', '/api/products', '/api/orders']) status = random.choice(['200', '200', '200', '400', '500'])
    REQUEST_COUNT.labels(method=method, endpoint=endpoint, status=status).inc()

    # Simulate request duration
    duration = random.uniform(0.01, 2.0)
    REQUEST_DURATION.labels(method=method, endpoint=endpoint).observe(duration)

    # Update gauges
    ACTIVE_USERS.set(random.randint(100, 1000))
    QUEUE_SIZE.labels(queue_name='jobs').set(random.randint(0, 50))
    QUEUE_SIZE.labels(queue_name='emails').set(random.randint(0, 20))

    # Database connection pool
    DATABASE_CONNECTIONS.labels(pool='main', state='active').set(random.randint(5, 20))
    DATABASE_CONNECTIONS.labels(pool='main', state='idle').set(random.randint(10, 30))

    # Cache metrics
    if random.random() > 0.3:
        CACHE_HITS.labels(cache_name='redis').inc()
    else:
        CACHE_MISSES.labels(cache_name='redis').inc()

    time.sleep(1)
if name == 'main': # Start metrics server on port 8000 start_http_server(8000) print("Metrics server started on port 8000") simulate_metrics()
undefined
REQUEST_COUNT = Counter( 'app_requests_total', '应用总请求数', ['method', 'endpoint', 'status'] )
REQUEST_DURATION = Histogram( 'app_request_duration_seconds', '请求时长(秒)', ['method', 'endpoint'], buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 2.5, 5.0, 10.0] )
ACTIVE_USERS = Gauge( 'app_active_users', '活跃用户数' )
QUEUE_SIZE = Gauge( 'app_queue_size', '当前队列长度', ['queue_name'] )
DATABASE_CONNECTIONS = Gauge( 'app_database_connections', '数据库连接数', ['pool', 'state'] )
CACHE_HITS = Counter( 'app_cache_hits_total', '缓存命中总数', ['cache_name'] )
CACHE_MISSES = Counter( 'app_cache_misses_total', '缓存未命中总数', ['cache_name'] )
def simulate_metrics(): """模拟应用指标""" while True: # 模拟请求 method = random.choice(['GET', 'POST', 'PUT', 'DELETE']) endpoint = random.choice(['/api/users', '/api/products', '/api/orders']) status = random.choice(['200', '200', '200', '400', '500'])
    REQUEST_COUNT.labels(method=method, endpoint=endpoint, status=status).inc()

    # 模拟请求时长
    duration = random.uniform(0.01, 2.0)
    REQUEST_DURATION.labels(method=method, endpoint=endpoint).observe(duration)

    # 更新仪表盘指标
    ACTIVE_USERS.set(random.randint(100, 1000))
    QUEUE_SIZE.labels(queue_name='jobs').set(random.randint(0, 50))
    QUEUE_SIZE.labels(queue_name='emails').set(random.randint(0, 20))

    # 数据库连接池
    DATABASE_CONNECTIONS.labels(pool='main', state='active').set(random.randint(5, 20))
    DATABASE_CONNECTIONS.labels(pool='main', state='idle').set(random.randint(10, 30))

    # 缓存指标
    if random.random() > 0.3:
        CACHE_HITS.labels(cache_name='redis').inc()
    else:
        CACHE_MISSES.labels(cache_name='redis').inc()

    time.sleep(1)
if name == 'main': # 在8000端口启动指标服务 start_http_server(8000) print("Metrics server started on port 8000") simulate_metrics()
undefined

Custom Exporter (Go)

自定义导出器(Go)

go
package main

import (
    "log"
    "math/rand"
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    requestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "app_requests_total",
            Help: "Total number of requests",
        },
        []string{"method", "endpoint", "status"},
    )

    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "app_request_duration_seconds",
            Help:    "Request duration in seconds",
            Buckets: prometheus.ExponentialBuckets(0.01, 2, 10),
        },
        []string{"method", "endpoint"},
    )

    activeUsers = prometheus.NewGauge(
        prometheus.GaugeOpts{
            Name: "app_active_users",
            Help: "Number of active users",
        },
    )

    databaseConnections = prometheus.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "app_database_connections",
            Help: "Number of database connections",
        },
        []string{"pool", "state"},
    )
)

func init() {
    prometheus.MustRegister(requestsTotal)
    prometheus.MustRegister(requestDuration)
    prometheus.MustRegister(activeUsers)
    prometheus.MustRegister(databaseConnections)
}

func simulateMetrics() {
    ticker := time.NewTicker(1 * time.Second)
    defer ticker.Stop()

    for range ticker.C {
        // Simulate requests
        methods := []string{"GET", "POST", "PUT", "DELETE"}
        endpoints := []string{"/api/users", "/api/products", "/api/orders"}
        statuses := []string{"200", "200", "200", "400", "500"}

        method := methods[rand.Intn(len(methods))]
        endpoint := endpoints[rand.Intn(len(endpoints))]
        status := statuses[rand.Intn(len(statuses))]

        requestsTotal.WithLabelValues(method, endpoint, status).Inc()
        requestDuration.WithLabelValues(method, endpoint).Observe(rand.Float64() * 2)

        // Update gauges
        activeUsers.Set(float64(rand.Intn(900) + 100))
        databaseConnections.WithLabelValues("main", "active").Set(float64(rand.Intn(15) + 5))
        databaseConnections.WithLabelValues("main", "idle").Set(float64(rand.Intn(20) + 10))
    }
}

func main() {
    go simulateMetrics()

    http.Handle("/metrics", promhttp.Handler())
    log.Println("Starting metrics server on :8000")
    log.Fatal(http.ListenAndServe(":8000", nil))
}
go
package main

import (
    "log"
    "math/rand"
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    requestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "app_requests_total",
            Help: "总请求数",
        },
        []string{"method", "endpoint", "status"},
    )

    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "app_request_duration_seconds",
            Help:    "请求时长(秒)",
            Buckets: prometheus.ExponentialBuckets(0.01, 2, 10),
        },
        []string{"method", "endpoint"},
    )

    activeUsers = prometheus.NewGauge(
        prometheus.GaugeOpts{
            Name: "app_active_users",
            Help: "活跃用户数",
        },
    )

    databaseConnections = prometheus.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "app_database_connections",
            Help: "数据库连接数",
        },
        []string{"pool", "state"},
    )
)

func init() {
    prometheus.MustRegister(requestsTotal)
    prometheus.MustRegister(requestDuration)
    prometheus.MustRegister(activeUsers)
    prometheus.MustRegister(databaseConnections)
}

func simulateMetrics() {
    ticker := time.NewTicker(1 * time.Second)
    defer ticker.Stop()

    for range ticker.C {
        // 模拟请求
        methods := []string{"GET", "POST", "PUT", "DELETE"}
        endpoints := []string{"/api/users", "/api/products", "/api/orders"}
        statuses := []string{"200", "200", "200", "400", "500"}

        method := methods[rand.Intn(len(methods))]
        endpoint := endpoints[rand.Intn(len(endpoints))]
        status := statuses[rand.Intn(len(statuses))]

        requestsTotal.WithLabelValues(method, endpoint, status).Inc()
        requestDuration.WithLabelValues(method, endpoint).Observe(rand.Float64() * 2)

        // 更新仪表盘指标
        activeUsers.Set(float64(rand.Intn(900) + 100))
        databaseConnections.WithLabelValues("main", "active").Set(float64(rand.Intn(15) + 5))
        databaseConnections.WithLabelValues("main", "idle").Set(float64(rand.Intn(20) + 10))
    }
}

func main() {
    go simulateMetrics()

    http.Handle("/metrics", promhttp.Handler())
    log.Println("Starting metrics server on :8000")
    log.Fatal(http.ListenAndServe(":8000", nil))
}

PostgreSQL Exporter

PostgreSQL导出器

yaml
undefined
yaml
undefined

docker-compose.yml for postgres_exporter

postgres_exporter的docker-compose.yml

version: '3.8' services: postgres-exporter: image: prometheuscommunity/postgres-exporter environment: DATA_SOURCE_NAME: "postgresql://user:password@postgres:5432/dbname?sslmode=disable" ports: - "9187:9187" command: - '--collector.stat_statements' - '--collector.stat_database' - '--collector.replication'

**Key PostgreSQL Metrics:**
- `pg_up`: Database reachability
- `pg_stat_database_tup_returned`: Rows read
- `pg_stat_database_tup_inserted`: Rows inserted
- `pg_stat_database_deadlocks`: Deadlock count
- `pg_stat_replication_lag`: Replication lag in seconds
- `pg_locks_count`: Active locks
version: '3.8' services: postgres-exporter: image: prometheuscommunity/postgres-exporter environment: DATA_SOURCE_NAME: "postgresql://user:password@postgres:5432/dbname?sslmode=disable" ports: - "9187:9187" command: - '--collector.stat_statements' - '--collector.stat_database' - '--collector.replication'

**PostgreSQL核心指标:**
- `pg_up`: 数据库可达性
- `pg_stat_database_tup_returned`: 读取的行数
- `pg_stat_database_tup_inserted`: 插入的行数
- `pg_stat_database_deadlocks`: 死锁数量
- `pg_stat_replication_lag`: 复制延迟(秒)
- `pg_locks_count`: 活跃锁数量

Blackbox Exporter for Probing

探测用Blackbox Exporter

yaml
undefined
yaml
undefined

blackbox.yml

blackbox.yml

modules: http_2xx: prober: http timeout: 5s http: valid_http_versions: ["HTTP/1.1", "HTTP/2.0"] valid_status_codes: [200] method: GET preferred_ip_protocol: "ip4"
http_post_json: prober: http http: method: POST headers: Content-Type: application/json body: '{"key":"value"}' valid_status_codes: [200, 201]
tcp_connect: prober: tcp timeout: 5s
icmp: prober: icmp timeout: 5s icmp: preferred_ip_protocol: "ip4"

```yaml
modules: http_2xx: prober: http timeout: 5s http: valid_http_versions: ["HTTP/1.1", "HTTP/2.0"] valid_status_codes: [200] method: GET preferred_ip_protocol: "ip4"
http_post_json: prober: http http: method: POST headers: Content-Type: application/json body: '{"key":"value"}' valid_status_codes: [200, 201]
tcp_connect: prober: tcp timeout: 5s
icmp: prober: icmp timeout: 5s icmp: preferred_ip_protocol: "ip4"

```yaml

Prometheus config for blackbox exporter

用于Blackbox Exporter的Prometheus配置

scrape_configs:
  • job_name: 'blackbox-http' metrics_path: /probe params: module: [http_2xx] static_configs:
    • targets:
    • source_labels: [address] target_label: __param_target
    • source_labels: [__param_target] target_label: instance
    • target_label: address replacement: blackbox-exporter:9115
undefined
scrape_configs:
  • job_name: 'blackbox-http' metrics_path: /probe params: module: [http_2xx] static_configs:
    • targets:
    • source_labels: [address] target_label: __param_target
    • source_labels: [__param_target] target_label: instance
    • target_label: address replacement: blackbox-exporter:9115
undefined

Best Practices

最佳实践

Metric Naming Conventions

指标命名规范

Follow Prometheus naming best practices:
undefined
遵循Prometheus命名最佳实践:
undefined

Format: <namespace><subsystem><metric>_<unit>

格式:<命名空间><子系统><指标>_<单位>

Good examples

优秀示例

http_requests_total # Counter http_request_duration_seconds # Histogram database_connections_active # Gauge cache_hits_total # Counter memory_usage_bytes # Gauge
http_requests_total # 计数器 http_request_duration_seconds # 直方图 database_connections_active # 仪表盘 cache_hits_total # 计数器 memory_usage_bytes # 仪表盘

Include unit suffixes

包含单位后缀

_seconds, _bytes, _total, _ratio, _percentage
_seconds, _bytes, _total, _ratio, _percentage

Avoid

避免以下写法

RequestCount # Use snake_case http_requests # Missing _total for counter request_time # Missing unit (should be _seconds)
undefined
RequestCount # 使用蛇形命名法 http_requests # 计数器缺少_total后缀 request_time # 缺少单位(应为_seconds)
undefined

Label Guidelines

标签指南

promql
undefined
promql
undefined

Good: Low cardinality labels

推荐:低基数标签

http_requests_total{method="GET", endpoint="/api/users", status="200"}
http_requests_total{method="GET", endpoint="/api/users", status="200"}

Bad: High cardinality labels (avoid)

不推荐:高基数标签(避免使用)

http_requests_total{user_id="12345", session_id="abc-def-ghi"}
http_requests_total{user_id="12345", session_id="abc-def-ghi"}

Good: Use bounded label values

推荐:使用有限的标签值

http_requests_total{status_class="2xx"}
http_requests_total{status_class="2xx"}

Bad: Unbounded label values

不推荐:无限的标签值

http_requests_total{response_size="1234567"}
undefined
http_requests_total{response_size="1234567"}
undefined

Recording Rule Patterns

记录规则模式

yaml
groups:
  - name: performance_rules
    interval: 30s
    rules:
      # Pre-aggregate expensive queries
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

      # Namespace aggregations
      - record: namespace:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (namespace)

      # SLI calculations
      - record: job:http_requests_success:rate5m
        expr: sum(rate(http_requests_total{status=~"2.."}[5m])) by (job)

      - record: job:http_requests_error_rate:ratio
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)
yaml
groups:
  - name: performance_rules
    interval: 30s
    rules:
      # 预聚合开销大的查询
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

      # 命名空间聚合
      - record: namespace:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (namespace)

      # SLI计算
      - record: job:http_requests_success:rate5m
        expr: sum(rate(http_requests_total{status=~"2.."}[5m])) by (job)

      - record: job:http_requests_error_rate:ratio
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)

Alert Design Principles

告警设计原则

  1. Alert on symptoms, not causes: Alert on user-facing issues
  2. Make alerts actionable: Include runbook links
  3. Use appropriate severity levels: Critical, warning, info
  4. Set proper thresholds: Based on historical data
  5. Include context in annotations: Help on-call engineers
  6. Group related alerts: Reduce alert fatigue
  7. Use inhibition rules: Suppress redundant alerts
  8. Test alert rules: Verify they fire when expected
  1. 告警症状而非原因:针对用户可见的问题告警
  2. 告警需可执行:包含运行手册链接
  3. 使用合适的严重级别:严重、警告、信息
  4. 设置合理阈值:基于历史数据
  5. 在注解中包含上下文:帮助值班工程师快速定位
  6. 关联告警分组:减少告警疲劳
  7. 使用抑制规则:屏蔽冗余告警
  8. 测试告警规则:验证触发条件

Dashboard Best Practices

仪表盘最佳实践

  1. One dashboard per audience: SRE, developers, business
  2. Use consistent time ranges: Make comparisons easier
  3. Include SLI/SLO metrics: Show business impact
  4. Add annotations for deploys: Correlate changes with metrics
  5. Use template variables: Make dashboards reusable
  6. Show trends and aggregates: Not just raw metrics
  7. Include links to runbooks: Enable quick response
  8. Use appropriate visualizations: Graphs, gauges, tables
  1. 按受众设计仪表盘:SRE、开发、业务人员
  2. 使用一致的时间范围:便于对比
  3. 包含SLI/SLO指标:展示业务影响
  4. 添加部署注解:关联变更与指标波动
  5. 使用模板变量:提升仪表盘复用性
  6. 展示趋势与聚合:而非仅原始指标
  7. 包含运行手册链接:快速响应
  8. 使用合适的可视化类型:图表、仪表盘、表格

High Availability Setup

高可用部署

yaml
undefined
yaml
undefined

Prometheus HA with Thanos

基于Thanos的Prometheus高可用方案

Deploy multiple Prometheus instances with same config

部署多个配置相同的Prometheus实例

Use Thanos to deduplicate and provide global view

使用Thanos去重并提供全局视图

prometheus-1.yml

prometheus-1.yml

global: external_labels: cluster: 'prod' replica: '1'
global: external_labels: cluster: 'prod' replica: '1'

prometheus-2.yml

prometheus-2.yml

global: external_labels: cluster: 'prod' replica: '2'
global: external_labels: cluster: 'prod' replica: '2'

Thanos sidecar configuration

Thanos sidecar配置

Uploads blocks to object storage

将数据块上传至对象存储

Provides StoreAPI for querying

提供StoreAPI用于查询

undefined
undefined

Capacity Planning Queries

容量规划查询

promql
undefined
promql
undefined

Disk space exhaustion prediction

磁盘空间耗尽预测

predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0

Memory growth trend

内存增长趋势

predict_linear(node_memory_MemAvailable_bytes[1h], 24 * 3600)
predict_linear(node_memory_MemAvailable_bytes[1h], 24 * 3600)

Request rate growth

请求速率增长趋势

predict_linear(sum(rate(http_requests_total[1h]))[24h:1h], 7 * 24 * 3600)
predict_linear(sum(rate(http_requests_total[1h]))[24h:1h], 7 * 24 * 3600)

Storage capacity planning

存储容量规划

prometheus_tsdb_storage_blocks_bytes / (30 * 24 * 3600)
undefined
prometheus_tsdb_storage_blocks_bytes / (30 * 24 * 3600)
undefined

Advanced Patterns

高级模式

Federation for Multi-Cluster Monitoring

多集群监控联邦

yaml
undefined
yaml
undefined

Global Prometheus federating from cluster Prometheus instances

全局Prometheus从集群Prometheus实例聚合数据

scrape_configs:
  • job_name: 'federate' scrape_interval: 15s honor_labels: true metrics_path: '/federate' params: 'match[]': - '{job="prometheus"}' - '{name=~"job:.*"}' # All recording rules static_configs:
    • targets:
      • 'prometheus-us-west:9090'
      • 'prometheus-us-east:9090'
      • 'prometheus-eu-central:9090'
undefined
scrape_configs:
  • job_name: 'federate' scrape_interval: 15s honor_labels: true metrics_path: '/federate' params: 'match[]': - '{job="prometheus"}' - '{name=~"job:.*"}' # 所有记录规则 static_configs:
    • targets:
      • 'prometheus-us-west:9090'
      • 'prometheus-us-east:9090'
      • 'prometheus-eu-central:9090'
undefined

Cost Monitoring Pattern

成本监控模式

yaml
undefined
yaml
undefined

Track cloud costs with custom metrics

使用自定义指标追踪云成本

groups:
  • name: cost_tracking rules:
    • record: cloud:cost:hourly_rate expr: | ( sum(kube_pod_container_resource_requests{resource="cpu"}) * 0.03 # CPU cost/hour + sum(kube_pod_container_resource_requests{resource="memory"} / 1024 / 1024 / 1024) * 0.005 # Memory cost/hour )
    • record: cloud:cost:monthly_estimate expr: cloud:cost:hourly_rate * 730 # Hours in average month
undefined
groups:
  • name: cost_tracking rules:
    • record: cloud:cost:hourly_rate expr: | ( sum(kube_pod_container_resource_requests{resource="cpu"}) * 0.03 # 每小时CPU成本 + sum(kube_pod_container_resource_requests{resource="memory"} / 1024 / 1024 / 1024) * 0.005 # 每小时内存成本 )
    • record: cloud:cost:monthly_estimate expr: cloud:cost:hourly_rate * 730 # 平均每月小时数
undefined

Custom SLO Implementation

自定义SLO实现

yaml
undefined
yaml
undefined

SLO: 99.9% availability for API

SLO:API可用性99.9%

groups:
  • name: api_slo interval: 30s rules:

    Success rate SLI

    • record: api:sli:success_rate expr: | sum(rate(http_requests_total{job="api",status=~"2.."}[5m])) / sum(rate(http_requests_total{job="api"}[5m]))

    Error budget remaining (30 days)

    • record: api:error_budget:remaining expr: | 1 - ( (1 - api:sli:success_rate) / (1 - 0.999) )

    Latency SLI (p99 < 500ms)

    • record: api:sli:latency_success_rate expr: | ( histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le) ) < 0.5 )
undefined
groups:
  • name: api_slo interval: 30s rules:

    成功率SLI

    • record: api:sli:success_rate expr: | sum(rate(http_requests_total{job="api",status=~"2.."}[5m])) / sum(rate(http_requests_total{job="api"}[5m]))

    剩余错误预算(30天)

    • record: api:error_budget:remaining expr: | 1 - ( (1 - api:sli:success_rate) / (1 - 0.999) )

    延迟SLI(p99 < 500ms)

    • record: api:sli:latency_success_rate expr: | ( histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le) ) < 0.5 )
undefined

Examples Summary

示例汇总

This skill includes 20+ comprehensive examples covering:
  1. Prometheus configuration (basic, Kubernetes SD, storage)
  2. PromQL queries (instant vectors, range vectors, aggregations)
  3. Mathematical operations and advanced patterns
  4. Alert rule definitions (error rate, latency, resource usage)
  5. Alertmanager configuration (routing, receivers, inhibition)
  6. Multi-window multi-burn-rate SLO alerts
  7. Grafana dashboard JSON (full dashboard, RED method, USE method)
  8. Custom exporters (Python, Go)
  9. Third-party exporters (PostgreSQL, Blackbox)
  10. Recording rules for performance
  11. Federation for multi-cluster monitoring
  12. Cost monitoring and SLO implementation
  13. High availability patterns
  14. Capacity planning queries

Skill Version: 1.0.0 Last Updated: October 2025 Skill Category: Observability, Monitoring, SRE, DevOps Compatible With: Prometheus, Grafana, Alertmanager, OpenTelemetry, Kubernetes
本技能包含20+个完整示例,覆盖:
  1. Prometheus配置(基础、Kubernetes服务发现、存储)
  2. PromQL查询(瞬时向量、范围向量、聚合)
  3. 数学运算与高级模式
  4. 告警规则定义(错误率、延迟、资源使用率)
  5. Alertmanager配置(路由、接收方、抑制)
  6. 多窗口多速率SLO告警
  7. Grafana仪表盘JSON(完整仪表盘、RED方法、USE方法)
  8. 自定义导出器(Python、Go)
  9. 第三方导出器(PostgreSQL、Blackbox)
  10. 性能相关记录规则
  11. 多集群监控联邦
  12. 成本监控与SLO实现
  13. 高可用模式
  14. 容量规划查询

技能版本: 1.0.0 最后更新: 2025年10月 技能分类: 可观测性、监控、SRE、DevOps 兼容工具: Prometheus、Grafana、Alertmanager、OpenTelemetry、Kubernetes