prometheus

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Prometheus Monitoring and Alerting

Prometheus监控与告警

Overview

概述

Prometheus is a powerful open-source monitoring and alerting system designed for reliability and scalability in cloud-native environments. Built for multi-dimensional time-series data with flexible querying via PromQL.
Prometheus是一款功能强大的开源监控与告警系统,专为云原生环境的可靠性和可扩展性设计。它针对多维时间序列数据构建,支持通过PromQL进行灵活查询。

Architecture Components

架构组件

  • Prometheus Server: Core component that scrapes and stores time-series data with local TSDB
  • Alertmanager: Handles alerts, deduplication, grouping, routing, and notifications to receivers
  • Pushgateway: Allows ephemeral jobs to push metrics (use sparingly - prefer pull model)
  • Exporters: Convert metrics from third-party systems to Prometheus format (node, blackbox, etc.)
  • Client Libraries: Instrument application code (Go, Java, Python, Rust, etc.)
  • Prometheus Operator: Kubernetes-native deployment and management via CRDs
  • Remote Storage: Long-term storage via Thanos, Cortex, Mimir for multi-cluster federation
  • Prometheus Server:核心组件,负责采集并存储时间序列数据,搭载本地TSDB(时间序列数据库)
  • Alertmanager:处理告警,包括去重、分组、路由以及向接收方发送通知
  • Pushgateway:允许临时任务推送指标(谨慎使用 - 优先采用拉取模型)
  • Exporters:将第三方系统的指标转换为Prometheus格式(如node exporter、blackbox exporter等)
  • Client Libraries:用于为应用代码添加指标埋点(支持Go、Java、Python、Rust等语言)
  • Prometheus Operator:基于Kubernetes CRD(自定义资源定义)的原生部署与管理工具
  • Remote Storage:通过Thanos、Cortex、Mimir实现长期存储,支持多集群联邦

Data Model

数据模型

  • Metrics: Time-series data identified by metric name and key-value labels
  • Format:
    metric_name{label1="value1", label2="value2"} sample_value timestamp
  • Metric Types:
    • Counter: Monotonically increasing value (requests, errors) - use
      rate()
      or
      increase()
      for querying
    • Gauge: Value that can go up/down (temperature, memory usage, queue length)
    • Histogram: Observations in configurable buckets (latency, request size) - exposes
      _bucket
      ,
      _sum
      ,
      _count
    • Summary: Similar to histogram but calculates quantiles client-side - use histograms for aggregation
  • Metrics(指标):由指标名称和键值对标签标识的时间序列数据
  • 格式
    metric_name{label1="value1", label2="value2"} sample_value timestamp
  • 指标类型:
    • Counter(计数器):单调递增的值(如请求数、错误数)- 查询时使用
      rate()
      increase()
    • Gauge(仪表盘):可升可降的值(如温度、内存使用率、队列长度)
    • Histogram(直方图):将观测值划分到可配置的桶中(如延迟、请求大小)- 会暴露
      _bucket
      _sum
      _count
      指标
    • Summary(摘要):与直方图类似,但在客户端计算分位数 - 聚合场景优先使用直方图

Setup and Configuration

部署与配置

Basic Prometheus Server Configuration

基础Prometheus Server配置

yaml
undefined
yaml
undefined

prometheus.yml

prometheus.yml

global: scrape_interval: 15s scrape_timeout: 10s evaluation_interval: 15s external_labels: cluster: "production" region: "us-east-1"
global: scrape_interval: 15s scrape_timeout: 10s evaluation_interval: 15s external_labels: cluster: "production" region: "us-east-1"

Alertmanager configuration

Alertmanager配置

alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093
alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093

Load rules files

加载规则文件

rule_files:
  • "alerts/*.yml"
  • "rules/*.yml"
rule_files:
  • "alerts/*.yml"
  • "rules/*.yml"

Scrape configurations

采集配置

scrape_configs:

Prometheus itself

  • job_name: "prometheus" static_configs:
    • targets: ["localhost:9090"]

Application services

  • job_name: "application" metrics_path: "/metrics" static_configs:
    • targets:
      • "app-1:8080"
      • "app-2:8080" labels: env: "production" team: "backend"

Kubernetes service discovery

  • job_name: "kubernetes-pods" kubernetes_sd_configs:
    • role: pod relabel_configs:

    Only scrape pods with prometheus.io/scrape annotation

    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true

    Use custom metrics path if specified

    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path regex: (.+)

    Use custom port if specified

    • source_labels: [address, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: address

    Add namespace label

    • source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace

    Add pod name label

    • source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name

    Add service name label

    • source_labels: [__meta_kubernetes_pod_label_app] action: replace target_label: app

Node Exporter for host metrics

  • job_name: "node-exporter" static_configs:
    • targets:
      • "node-exporter:9100"
undefined
scrape_configs:

Prometheus自身

  • job_name: "prometheus" static_configs:
    • targets: ["localhost:9090"]

应用服务

  • job_name: "application" metrics_path: "/metrics" static_configs:
    • targets:
      • "app-1:8080"
      • "app-2:8080" labels: env: "production" team: "backend"

Kubernetes服务发现

  • job_name: "kubernetes-pods" kubernetes_sd_configs:
    • role: pod relabel_configs:

    仅采集带有prometheus.io/scrape注解的Pod

    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true

    如果指定了自定义指标路径则使用

    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path regex: (.+)

    如果指定了自定义端口则使用

    • source_labels: [address, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: address

    添加命名空间标签

    • source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace

    添加Pod名称标签

    • source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name

    添加服务名称标签

    • source_labels: [__meta_kubernetes_pod_label_app] action: replace target_label: app

用于主机指标的Node Exporter

  • job_name: "node-exporter" static_configs:
    • targets:
      • "node-exporter:9100"
undefined

Alertmanager Configuration

Alertmanager配置

yaml
undefined
yaml
undefined

alertmanager.yml

alertmanager.yml

global: resolve_timeout: 5m slack_api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL" pagerduty_url: "https://events.pagerduty.com/v2/enqueue"
global: resolve_timeout: 5m slack_api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL" pagerduty_url: "https://events.pagerduty.com/v2/enqueue"

Template files for custom notifications

自定义通知的模板文件

templates:
  • "/etc/alertmanager/templates/*.tmpl"
templates:
  • "/etc/alertmanager/templates/*.tmpl"

Route alerts to appropriate receivers

将告警路由到对应的接收方

route: group_by: ["alertname", "cluster", "service"] group_wait: 10s group_interval: 10s repeat_interval: 12h receiver: "default"
routes: # Critical alerts go to PagerDuty - match: severity: critical receiver: "pagerduty" continue: true
# Database alerts to DBA team
- match:
    team: database
  receiver: "dba-team"
  group_by: ["alertname", "instance"]

# Development environment alerts
- match:
    env: development
  receiver: "slack-dev"
  group_wait: 5m
  repeat_interval: 4h
route: group_by: ["alertname", "cluster", "service"] group_wait: 10s group_interval: 10s repeat_interval: 12h receiver: "default"
routes: # 严重告警发送到PagerDuty - match: severity: critical receiver: "pagerduty" continue: true
# 数据库告警发送给DBA团队
- match:
    team: database
  receiver: "dba-team"
  group_by: ["alertname", "instance"]

# 开发环境告警
- match:
    env: development
  receiver: "slack-dev"
  group_wait: 5m
  repeat_interval: 4h

Inhibition rules (suppress alerts)

抑制规则(抑制关联告警)

inhibit_rules:

Suppress warning alerts if critical alert is firing

  • source_match: severity: "critical" target_match: severity: "warning" equal: ["alertname", "instance"]

Suppress instance alerts if entire service is down

  • source_match: alertname: "ServiceDown" target_match_re: alertname: ".*" equal: ["service"]
receivers:
  • name: "default" slack_configs:
    • channel: "#alerts" title: "Alert: {{ .GroupLabels.alertname }}" text: "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"
  • name: "pagerduty" pagerduty_configs:
    • service_key: "YOUR_PAGERDUTY_SERVICE_KEY" description: "{{ .GroupLabels.alertname }}"
  • name: "dba-team" slack_configs:
    • channel: "#database-alerts" email_configs:
    • to: "dba-team@example.com" headers: Subject: "Database Alert: {{ .GroupLabels.alertname }}"
  • name: "slack-dev" slack_configs:
    • channel: "#dev-alerts" send_resolved: true
undefined
inhibit_rules:

如果有严重告警触发,抑制警告级别的同类型告警

  • source_match: severity: "critical" target_match: severity: "warning" equal: ["alertname", "instance"]

如果整个服务不可用,抑制单个实例的告警

  • source_match: alertname: "ServiceDown" target_match_re: alertname: ".*" equal: ["service"]
receivers:
  • name: "default" slack_configs:
    • channel: "#alerts" title: "告警: {{ .GroupLabels.alertname }}" text: "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"
  • name: "pagerduty" pagerduty_configs:
    • service_key: "YOUR_PAGERDUTY_SERVICE_KEY" description: "{{ .GroupLabels.alertname }}"
  • name: "dba-team" slack_configs:
    • channel: "#database-alerts" email_configs:
    • to: "dba-team@example.com" headers: Subject: "数据库告警: {{ .GroupLabels.alertname }}"
  • name: "slack-dev" slack_configs:
    • channel: "#dev-alerts" send_resolved: true
undefined

Best Practices

最佳实践

Metric Naming Conventions

指标命名规范

Follow these naming patterns for consistency:
text
undefined
遵循以下命名模式以保证一致性:
text
undefined

Format: <namespace><subsystem><metric>_<unit>

格式: <命名空间><子系统><指标>_<单位>

Counters (always use _total suffix)

计数器(必须以_total为后缀)

http_requests_total http_request_errors_total cache_hits_total
http_requests_total http_request_errors_total cache_hits_total

Gauges

仪表盘

memory_usage_bytes active_connections queue_size
memory_usage_bytes active_connections queue_size

Histograms (use _bucket, _sum, _count suffixes automatically)

直方图(自动添加_bucket、_sum、_count后缀)

http_request_duration_seconds response_size_bytes db_query_duration_seconds
http_request_duration_seconds response_size_bytes db_query_duration_seconds

Use consistent base units

使用统一的基础单位

  • seconds for duration (not milliseconds)
  • bytes for size (not kilobytes)
  • ratio for percentages (0.0-1.0, not 0-100)
undefined
  • 时长使用秒(而非毫秒)
  • 大小使用字节(而非千字节)
  • 百分比使用比率(0.0-1.0,而非0-100)
undefined

Label Cardinality Management

标签基数管理

DO

推荐做法

yaml
undefined
yaml
undefined

Good: Bounded cardinality

良好:基数有限

http_requests_total{method="GET", status="200", endpoint="/api/users"}
http_requests_total{method="GET", status="200", endpoint="/api/users"}

Good: Reasonable number of label values

良好:标签值数量合理

db_queries_total{table="users", operation="select"}
undefined
db_queries_total{table="users", operation="select"}
undefined

DON'T

不推荐做法

yaml
undefined
yaml
undefined

Bad: Unbounded cardinality (user IDs, email addresses, timestamps)

糟糕:基数无界(用户ID、邮箱地址、时间戳)

http_requests_total{user_id="12345"} http_requests_total{email="user@example.com"} http_requests_total{timestamp="1234567890"}
http_requests_total{user_id="12345"} http_requests_total{email="user@example.com"} http_requests_total{timestamp="1234567890"}

Bad: High cardinality (full URLs, IP addresses)

糟糕:基数过高(完整URL、IP地址)

http_requests_total{url="/api/users/12345/profile"} http_requests_total{client_ip="192.168.1.100"}
undefined
http_requests_total{url="/api/users/12345/profile"} http_requests_total{client_ip="192.168.1.100"}
undefined

Guidelines

指导原则

  • Keep label values to < 10 per label (ideally)
  • Total unique time-series per metric should be < 10,000
  • Use recording rules to pre-aggregate high-cardinality metrics
  • Avoid labels with unbounded values (IDs, timestamps, user input)
  • 每个标签的标签值数量应控制在10个以内(理想情况)
  • 每个指标的唯一时间序列总数应小于10,000
  • 使用记录规则预聚合高基数指标
  • 避免使用无界值作为标签(如ID、时间戳、用户输入内容)

Recording Rules for Performance

用于性能优化的记录规则

Use recording rules to pre-compute expensive queries:
yaml
undefined
使用记录规则预计算开销较大的查询:
yaml
undefined

rules/recording_rules.yml

rules/recording_rules.yml

groups:
  • name: performance_rules interval: 30s rules:

    Pre-calculate request rates

    • record: job:http_requests:rate5m expr: sum(rate(http_requests_total[5m])) by (job)

    Pre-calculate error rates

    • record: job:http_request_errors:rate5m expr: sum(rate(http_request_errors_total[5m])) by (job)

    Pre-calculate error ratio

    • record: job:http_request_error_ratio:rate5m expr: | job:http_request_errors:rate5m / job:http_requests:rate5m

    Pre-aggregate latency percentiles

    • record: job:http_request_duration_seconds:p95 expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
    • record: job:http_request_duration_seconds:p99 expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
  • name: aggregation_rules interval: 1m rules:

    Multi-level aggregation for dashboards

    • record: instance:node_cpu_utilization:ratio expr: | 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
    • record: cluster:node_cpu_utilization:ratio expr: avg(instance:node_cpu_utilization:ratio)

    Memory aggregation

    • record: instance:node_memory_utilization:ratio expr: | 1 - ( node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes )
undefined
groups:
  • name: performance_rules interval: 30s rules:

    预计算请求速率

    • record: job:http_requests:rate5m expr: sum(rate(http_requests_total[5m])) by (job)

    预计算错误速率

    • record: job:http_request_errors:rate5m expr: sum(rate(http_request_errors_total[5m])) by (job)

    预计算错误率

    • record: job:http_request_error_ratio:rate5m expr: | job:http_request_errors:rate5m / job:http_requests:rate5m

    预聚合延迟分位数

    • record: job:http_request_duration_seconds:p95 expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
    • record: job:http_request_duration_seconds:p99 expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
  • name: aggregation_rules interval: 1m rules:

    用于仪表盘的多层聚合

    • record: instance:node_cpu_utilization:ratio expr: | 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
    • record: cluster:node_cpu_utilization:ratio expr: avg(instance:node_cpu_utilization:ratio)

    内存聚合

    • record: instance:node_memory_utilization:ratio expr: | 1 - ( node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes )
undefined

Alert Design (Symptoms vs Causes)

告警设计(症状 vs 原因)

Alert on symptoms (user-facing impact), not causes

基于症状告警(面向用户影响),而非基于原因

yaml
undefined
yaml
undefined

alerts/symptom_based.yml

alerts/symptom_based.yml

groups:
  • name: symptom_alerts rules:

    GOOD: Alert on user-facing symptoms

    • alert: HighErrorRate expr: | ( sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) ) > 0.05 for: 5m labels: severity: critical team: backend annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)" runbook: "https://wiki.example.com/runbooks/high-error-rate"
    • alert: HighLatency expr: | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service) ) > 1 for: 5m labels: severity: warning team: backend annotations: summary: "High latency on {{ $labels.service }}" description: "P95 latency is {{ $value }}s (threshold: 1s)" impact: "Users experiencing slow page loads"

    GOOD: SLO-based alerting

    • alert: SLOBudgetBurnRate expr: | ( 1 - ( sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total[1h])) ) ) > (14.4 * (1 - 0.999)) # 14.4x burn rate for 99.9% SLO for: 5m labels: severity: critical team: sre annotations: summary: "SLO budget burning too fast" description: "At current rate, monthly error budget will be exhausted in {{ $value | humanizeDuration }}"
undefined
groups:
  • name: symptom_alerts rules:

    推荐:基于用户可见的症状告警

    • alert: HighErrorRate expr: | ( sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) ) > 0.05 for: 5m labels: severity: critical team: backend annotations: summary: "检测到高错误率" description: "错误率为{{ $value | humanizePercentage }}(阈值:5%)" runbook: "https://wiki.example.com/runbooks/high-error-rate"
    • alert: HighLatency expr: | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service) ) > 1 for: 5m labels: severity: warning team: backend annotations: summary: "{{ $labels.service }}服务延迟过高" description: "P95延迟为{{ $value }}s(阈值:1s)" impact: "用户正在体验页面加载缓慢"

    推荐:基于SLO的告警

    • alert: SLOBudgetBurnRate expr: | ( 1 - ( sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total[1h])) ) ) > (14.4 * (1 - 0.999)) # 99.9% SLO对应的14.4倍消耗速率 for: 5m labels: severity: critical team: sre annotations: summary: "SLO预算消耗过快" description: "按照当前速率,月度错误预算将在{{ $value | humanizeDuration }}内耗尽"
undefined

Cause-based alerts (use for debugging, not paging)

基于原因的告警(用于调试,而非页面告警)

yaml
undefined
yaml
undefined

alerts/cause_based.yml

alerts/cause_based.yml

groups:
  • name: infrastructure_alerts rules:

    Lower severity for infrastructure issues

    • alert: HighMemoryUsage expr: | ( node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes ) / node_memory_MemTotal_bytes > 0.9 for: 10m labels: severity: warning # Not critical unless symptoms appear team: infrastructure annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value | humanizePercentage }}"
    • alert: DiskSpaceLow expr: | ( node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) < 0.1 for: 5m labels: severity: warning team: infrastructure annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Only {{ $value | humanizePercentage }} disk space remaining" action: "Clean up logs or expand disk"
undefined
groups:
  • name: infrastructure_alerts rules:

    基础设施问题使用较低的告警级别

    • alert: HighMemoryUsage expr: | ( node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes ) / node_memory_MemTotal_bytes > 0.9 for: 10m labels: severity: warning # 除非出现用户可见症状,否则不设为严重级别 team: infrastructure annotations: summary: "{{ $labels.instance }}内存使用率过高" description: "内存使用率为{{ $value | humanizePercentage }}"
    • alert: DiskSpaceLow expr: | ( node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) < 0.1 for: 5m labels: severity: warning team: infrastructure annotations: summary: "{{ $labels.instance }}磁盘空间不足" description: "剩余磁盘空间仅为{{ $value | humanizePercentage }}" action: "清理日志或扩容磁盘"
undefined

Alert Best Practices

告警最佳实践

  1. For duration: Use
    for
    clause to avoid flapping
  2. Meaningful annotations: Include summary, description, runbook URL, impact
  3. Proper severity levels: critical (page immediately), warning (ticket), info (log)
  4. Actionable alerts: Every alert should require human action
  5. Include context: Add labels for team ownership, service, environment
  1. 持续时长:使用
    for
    子句避免告警抖动
  2. 有意义的注解:包含摘要、描述、运行手册URL、影响范围
  3. 合理的严重级别:critical(立即告警)、warning(创建工单)、info(仅日志)
  4. 可执行的告警:每个告警都应需要人工干预
  5. 包含上下文信息:添加团队归属、服务、环境等标签

PromQL Query Patterns

PromQL查询模式

PromQL is the query language for Prometheus. Key concepts: instant vectors, range vectors, scalar, string literals, selectors, operators, functions, and aggregation.
PromQL是Prometheus的查询语言,核心概念包括:即时向量、范围向量、标量、字符串字面量、选择器、运算符、函数和聚合操作。

Selectors and Matchers

选择器与匹配器

promql
undefined
promql
undefined

Instant vector selector (latest sample for each time-series)

即时向量选择器(每个时间序列的最新样本)

http_requests_total
http_requests_total

Filter by label values

通过标签值过滤

http_requests_total{method="GET", status="200"}
http_requests_total{method="GET", status="200"}

Regex matching (=) and negative regex (!)

正则匹配(=)和负正则匹配(!

http_requests_total{status="5.."} # 5xx errors http_requests_total{endpoint!"/admin.*"} # exclude admin endpoints
http_requests_total{status="5.."} # 5xx错误 http_requests_total{endpoint!"/admin.*"} # 排除admin端点

Label absence/presence

标签存在/不存在

http_requests_total{job="api", status=""} # empty label http_requests_total{job="api", status!=""} # non-empty label
http_requests_total{job="api", status=""} # 空标签 http_requests_total{job="api", status!=""} # 非空标签

Range vector selector (samples over time)

范围向量选择器(一段时间内的样本)

http_requests_total[5m] # last 5 minutes of samples
undefined
http_requests_total[5m] # 最近5分钟的样本
undefined

Rate Calculations

速率计算

promql
undefined
promql
undefined

Request rate (requests per second) - ALWAYS use rate() for counters

请求速率(每秒请求数)- 计数器必须使用rate()

rate(http_requests_total[5m])
rate(http_requests_total[5m])

Sum by service

按服务求和

sum(rate(http_requests_total[5m])) by (service)
sum(rate(http_requests_total[5m])) by (service)

Increase over time window (total count) - for alerts/dashboards showing total

时间窗口内的增量(总计数)- 用于告警/仪表盘展示总量

increase(http_requests_total[1h])
increase(http_requests_total[1h])

irate() for volatile, fast-moving counters (more sensitive to spikes)

irate()用于波动较大的计数器(对峰值更敏感)

irate(http_requests_total[5m])
undefined
irate(http_requests_total[5m])
undefined

Error Ratios

错误率

promql
undefined
promql
undefined

Error rate ratio

错误率

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

Success rate

成功率

sum(rate(http_requests_total{status=~"2.."}[5m])) / sum(rate(http_requests_total[5m]))
undefined
sum(rate(http_requests_total{status=~"2.."}[5m])) / sum(rate(http_requests_total[5m]))
undefined

Histogram Queries

直方图查询

promql
undefined
promql
undefined

P95 latency

P95延迟

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) )
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) )

P50, P95, P99 latency by service

按服务统计P50、P95、P99延迟

histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

Average request duration

平均请求延迟

sum(rate(http_request_duration_seconds_sum[5m])) by (service) / sum(rate(http_request_duration_seconds_count[5m])) by (service)
undefined
sum(rate(http_request_duration_seconds_sum[5m])) by (service) / sum(rate(http_request_duration_seconds_count[5m])) by (service)
undefined

Aggregation Operations

聚合操作

promql
undefined
promql
undefined

Sum across all instances

所有实例求和

sum(node_memory_MemTotal_bytes) by (cluster)
sum(node_memory_MemTotal_bytes) by (cluster)

Average CPU usage

平均CPU使用率

avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

Maximum value

最大值

max(http_request_duration_seconds) by (service)
max(http_request_duration_seconds) by (service)

Minimum value

最小值

min(node_filesystem_avail_bytes) by (instance)
min(node_filesystem_avail_bytes) by (instance)

Count number of instances

实例数量统计

count(up == 1) by (job)
count(up == 1) by (job)

Standard deviation

标准差

stddev(http_request_duration_seconds) by (service)
undefined
stddev(http_request_duration_seconds) by (service)
undefined

Advanced Queries

高级查询

promql
undefined
promql
undefined

Top 5 services by request rate

请求速率Top 5服务

topk(5, sum(rate(http_requests_total[5m])) by (service))
topk(5, sum(rate(http_requests_total[5m])) by (service))

Bottom 3 instances by available memory

可用内存Bottom 3实例

bottomk(3, node_memory_MemAvailable_bytes)
bottomk(3, node_memory_MemAvailable_bytes)

Predict disk full time (linear regression)

预测磁盘耗尽时间(线性回归)

predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4 * 3600) < 0
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4 * 3600) < 0

Compare with 1 day ago

与1天前对比

http_requests_total - http_requests_total offset 1d
http_requests_total - http_requests_total offset 1d

Rate of change (derivative)

变化率(导数)

deriv(node_memory_MemAvailable_bytes[5m])
deriv(node_memory_MemAvailable_bytes[5m])

Absent metric detection

检测指标缺失

absent(up{job="critical-service"})
undefined
absent(up{job="critical-service"})
undefined

Complex Aggregations

复杂聚合

promql
undefined
promql
undefined

Calculate Apdex score (Application Performance Index)

计算Apdex分数(应用性能指数)

( sum(rate(http_request_duration_seconds_bucket{le="0.1"}[5m])) + sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) * 0.5 ) / sum(rate(http_request_duration_seconds_count[5m]))
( sum(rate(http_request_duration_seconds_bucket{le="0.1"}[5m])) + sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) * 0.5 ) / sum(rate(http_request_duration_seconds_count[5m]))

Multi-window multi-burn-rate SLO

多窗口多消耗速率SLO

( sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h]))
0.001 * 14.4 ) and ( sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) 0.001 * 14.4 )
undefined
( sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h]))
0.001 * 14.4 ) and ( sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) 0.001 * 14.4 )
undefined

Binary Operators and Vector Matching

二元运算符与向量匹配

promql
undefined
promql
undefined

Arithmetic operators (+, -, *, /, %, ^)

算术运算符(+、-、*、/、%、^)

node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes

Comparison operators (==, !=, >, <, >=, <=) - filter to matching values

比较运算符(==、!=、>、<、>=、<=)- 过滤匹配的值

http_request_duration_seconds > 1
http_request_duration_seconds > 1

Logical operators (and, or, unless)

逻辑运算符(and、or、unless)

up{job="api"} and rate(http_requests_total[5m]) > 100
up{job="api"} and rate(http_requests_total[5m]) > 100

One-to-one matching (default)

一对一匹配(默认)

method:http_requests:rate5m / method:http_requests:total
method:http_requests:rate5m / method:http_requests:total

Many-to-one matching with group_left

多对一匹配(使用group_left)

sum(rate(http_requests_total[5m])) by (instance, method) / on(instance) group_left sum(rate(http_requests_total[5m])) by (instance)
sum(rate(http_requests_total[5m])) by (instance, method) / on(instance) group_left sum(rate(http_requests_total[5m])) by (instance)

One-to-many matching with group_right

一对多匹配(使用group_right)

sum(rate(http_requests_total[5m])) by (instance) / on(instance) group_right sum(rate(http_requests_total[5m])) by (instance, method)
undefined
sum(rate(http_requests_total[5m])) by (instance) / on(instance) group_right sum(rate(http_requests_total[5m])) by (instance, method)
undefined

Time Functions and Offsets

时间函数与偏移量

promql
undefined
promql
undefined

Compare with previous time period

与之前时间段对比

rate(http_requests_total[5m]) / rate(http_requests_total[5m] offset 1h)
rate(http_requests_total[5m]) / rate(http_requests_total[5m] offset 1h)

Day-over-day comparison

日环比对比

http_requests_total - http_requests_total offset 1d
http_requests_total - http_requests_total offset 1d

Time-based filtering

基于时间过滤

http_requests_total and hour() >= 9 and hour() < 17 # business hours day_of_week() == 0 or day_of_week() == 6 # weekends
http_requests_total and hour() >= 9 and hour() < 17 # 工作时间 day_of_week() == 0 or day_of_week() == 6 # 周末

Timestamp functions

时间戳函数

time() - process_start_time_seconds # uptime in seconds
undefined
time() - process_start_time_seconds # 运行时长(秒)
undefined

Service Discovery

服务发现

Prometheus supports multiple service discovery mechanisms for dynamic environments where targets appear and disappear.
Prometheus支持多种服务发现机制,适用于目标动态增减的环境。

Static Configuration

静态配置

yaml
scrape_configs:
  - job_name: "static-targets"
    static_configs:
      - targets:
          - "host1:9100"
          - "host2:9100"
        labels:
          env: production
          region: us-east-1
yaml
scrape_configs:
  - job_name: "static-targets"
    static_configs:
      - targets:
          - "host1:9100"
          - "host2:9100"
        labels:
          env: production
          region: us-east-1

File-based Service Discovery

基于文件的服务发现

yaml
scrape_configs:
  - job_name: 'file-sd'
    file_sd_configs:
      - files:
          - '/etc/prometheus/targets/*.json'
          - '/etc/prometheus/targets/*.yml'
        refresh_interval: 30s
yaml
scrape_configs:
  - job_name: 'file-sd'
    file_sd_configs:
      - files:
          - '/etc/prometheus/targets/*.json'
          - '/etc/prometheus/targets/*.yml'
        refresh_interval: 30s

targets/webservers.json

targets/webservers.json

[ { "targets": ["web1:8080", "web2:8080"], "labels": { "job": "web", "env": "prod" } } ]
undefined
[ { "targets": ["web1:8080", "web2:8080"], "labels": { "job": "web", "env": "prod" } } ]
undefined

Kubernetes Service Discovery

Kubernetes服务发现

yaml
scrape_configs:
  # Pod-based discovery
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - production
            - staging
    relabel_configs:
      # Keep only pods with prometheus.io/scrape=true annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

      # Extract custom scrape path from annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

      # Extract custom port from annotation
      - source_labels:
          [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

      # Add standard Kubernetes labels
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: kubernetes_pod_name

  # Service-based discovery
  - job_name: "kubernetes-services"
    kubernetes_sd_configs:
      - role: service
    relabel_configs:
      - source_labels:
          [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels:
          [__meta_kubernetes_service_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

  # Node-based discovery (for node exporters)
  - job_name: "kubernetes-nodes"
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics

  # Endpoints discovery (for service endpoints)
  - job_name: "kubernetes-endpoints"
    kubernetes_sd_configs:
      - role: endpoints
    relabel_configs:
      - source_labels:
          [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        action: keep
        regex: metrics
yaml
scrape_configs:
  # 基于Pod的发现
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - production
            - staging
    relabel_configs:
      # 仅保留带有prometheus.io/scrape=true注解的Pod
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

      # 从注解中提取自定义采集路径
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

      # 从注解中提取自定义端口
      - source_labels:
          [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

      # 添加标准Kubernetes标签
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: kubernetes_pod_name

  # 基于Service的发现
  - job_name: "kubernetes-services"
    kubernetes_sd_configs:
      - role: service
    relabel_configs:
      - source_labels:
          [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels:
          [__meta_kubernetes_service_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

  # 基于Node的发现(用于node exporter)
  - job_name: "kubernetes-nodes"
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics

  # 基于Endpoints的发现(用于服务端点)
  - job_name: "kubernetes-endpoints"
    kubernetes_sd_configs:
      - role: endpoints
    relabel_configs:
      - source_labels:
          [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        action: keep
        regex: metrics

Consul Service Discovery

Consul服务发现

yaml
scrape_configs:
  - job_name: "consul-services"
    consul_sd_configs:
      - server: "consul.example.com:8500"
        datacenter: "dc1"
        services: ["web", "api", "cache"]
        tags: ["production"]
    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: service
      - source_labels: [__meta_consul_tags]
        target_label: tags
yaml
scrape_configs:
  - job_name: "consul-services"
    consul_sd_configs:
      - server: "consul.example.com:8500"
        datacenter: "dc1"
        services: ["web", "api", "cache"]
        tags: ["production"]
    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: service
      - source_labels: [__meta_consul_tags]
        target_label: tags

EC2 Service Discovery

EC2服务发现

yaml
scrape_configs:
  - job_name: "ec2-instances"
    ec2_sd_configs:
      - region: us-east-1
        access_key: YOUR_ACCESS_KEY
        secret_key: YOUR_SECRET_KEY
        port: 9100
        filters:
          - name: tag:Environment
            values: [production]
          - name: instance-state-name
            values: [running]
    relabel_configs:
      - source_labels: [__meta_ec2_tag_Name]
        target_label: instance_name
      - source_labels: [__meta_ec2_availability_zone]
        target_label: availability_zone
      - source_labels: [__meta_ec2_instance_type]
        target_label: instance_type
yaml
scrape_configs:
  - job_name: "ec2-instances"
    ec2_sd_configs:
      - region: us-east-1
        access_key: YOUR_ACCESS_KEY
        secret_key: YOUR_SECRET_KEY
        port: 9100
        filters:
          - name: tag:Environment
            values: [production]
          - name: instance-state-name
            values: [running]
    relabel_configs:
      - source_labels: [__meta_ec2_tag_Name]
        target_label: instance_name
      - source_labels: [__meta_ec2_availability_zone]
        target_label: availability_zone
      - source_labels: [__meta_ec2_instance_type]
        target_label: instance_type

DNS Service Discovery

DNS服务发现

yaml
scrape_configs:
  - job_name: "dns-srv-records"
    dns_sd_configs:
      - names:
          - "_prometheus._tcp.example.com"
        type: "SRV"
        refresh_interval: 30s
    relabel_configs:
      - source_labels: [__meta_dns_name]
        target_label: instance
yaml
scrape_configs:
  - job_name: "dns-srv-records"
    dns_sd_configs:
      - names:
          - "_prometheus._tcp.example.com"
        type: "SRV"
        refresh_interval: 30s
    relabel_configs:
      - source_labels: [__meta_dns_name]
        target_label: instance

Relabeling Actions Reference

Relabeling操作参考

ActionDescriptionUse Case
keep
Keep targets where regex matches source labelsFilter targets by annotation/label
drop
Drop targets where regex matches source labelsExclude specific targets
replace
Replace target label with value from source labelsExtract custom labels/paths/ports
labelmap
Map source label names to target labels via regexCopy all Kubernetes labels
labeldrop
Drop labels matching regexRemove internal metadata labels
labelkeep
Keep only labels matching regexReduce cardinality
hashmod
Set target label to hash of source labels modulo NSharding/routing
操作描述使用场景
keep
保留正则匹配源标签的目标通过注解/标签过滤目标
drop
丢弃正则匹配源标签的目标排除特定目标
replace
用源标签的值替换目标标签提取自定义标签/路径/端口
labelmap
通过正则将源标签名映射为目标标签名复制所有Kubernetes标签
labeldrop
丢弃匹配正则的标签移除内部元数据标签
labelkeep
仅保留匹配正则的标签降低基数
hashmod
将目标标签设置为源标签哈希值取模N的结果分片/路由

High Availability and Scalability

高可用性与可扩展性

Prometheus High Availability Setup

Prometheus高可用性部署

yaml
undefined
yaml
undefined

Deploy multiple identical Prometheus instances scraping same targets

部署多个相同的Prometheus实例,采集相同的目标

Use external labels to distinguish instances

使用外部标签区分不同实例

global: external_labels: replica: prometheus-1 # Change to prometheus-2, etc. cluster: production
global: external_labels: replica: prometheus-1 # 改为prometheus-2等 cluster: production

Alertmanager will deduplicate alerts from multiple Prometheus instances

Alertmanager会对多个Prometheus实例的告警进行去重

alerting: alertmanagers: - static_configs: - targets: - alertmanager-1:9093 - alertmanager-2:9093 - alertmanager-3:9093
undefined
alerting: alertmanagers: - static_configs: - targets: - alertmanager-1:9093 - alertmanager-2:9093 - alertmanager-3:9093
undefined

Alertmanager Clustering

Alertmanager集群

yaml
undefined
yaml
undefined

alertmanager.yml - HA cluster configuration

alertmanager.yml - HA集群配置

global: resolve_timeout: 5m
route: receiver: "default" group_by: ["alertname", "cluster"] group_wait: 10s group_interval: 10s repeat_interval: 12h
receivers:
global: resolve_timeout: 5m
route: receiver: "default" group_by: ["alertname", "cluster"] group_wait: 10s group_interval: 10s repeat_interval: 12h
receivers:

Start Alertmanager cluster members

启动Alertmanager集群节点

alertmanager-1: --cluster.peer=alertmanager-2:9094 --cluster.peer=alertmanager-3:9094

alertmanager-1: --cluster.peer=alertmanager-2:9094 --cluster.peer=alertmanager-3:9094

alertmanager-2: --cluster.peer=alertmanager-1:9094 --cluster.peer=alertmanager-3:9094

alertmanager-2: --cluster.peer=alertmanager-1:9094 --cluster.peer=alertmanager-3:9094

alertmanager-3: --cluster.peer=alertmanager-1:9094 --cluster.peer=alertmanager-2:9094

alertmanager-3: --cluster.peer=alertmanager-1:9094 --cluster.peer=alertmanager-2:9094

undefined
undefined

Federation for Hierarchical Monitoring

用于分层监控的联邦

yaml
undefined
yaml
undefined

Global Prometheus federating from regional instances

全局Prometheus从区域实例联邦采集数据

scrape_configs:
  • job_name: "federate" scrape_interval: 15s honor_labels: true metrics_path: "/federate" params: "match[]": # Pull aggregated metrics only - '{job="prometheus"}' - '{name=~"job:.*"}' # Recording rules - "up" static_configs:
    • targets:
      • "prometheus-us-east-1:9090"
      • "prometheus-us-west-2:9090"
      • "prometheus-eu-west-1:9090" labels: region: "us-east-1"
undefined
scrape_configs:
  • job_name: "federate" scrape_interval: 15s honor_labels: true metrics_path: "/federate" params: "match[]": # 仅拉取聚合后的指标 - '{job="prometheus"}' - '{name=~"job:.*"}' # 记录规则生成的指标 - "up" static_configs:
    • targets:
      • "prometheus-us-east-1:9090"
      • "prometheus-us-west-2:9090"
      • "prometheus-eu-west-1:9090" labels: region: "us-east-1"
undefined

Remote Storage for Long-term Retention

用于长期存储的远程存储

yaml
undefined
yaml
undefined

Prometheus remote write to Thanos/Cortex/Mimir

Prometheus远程写入到Thanos/Cortex/Mimir

remote_write:
  • url: "http://thanos-receive:19291/api/v1/receive" queue_config: capacity: 10000 max_shards: 50 min_shards: 1 max_samples_per_send: 5000 batch_send_deadline: 5s min_backoff: 30ms max_backoff: 100ms write_relabel_configs:

    Drop high-cardinality metrics before remote write

    • source_labels: [name] regex: "go_.*" action: drop
remote_write:
  • url: "http://thanos-receive:19291/api/v1/receive" queue_config: capacity: 10000 max_shards: 50 min_shards: 1 max_samples_per_send: 5000 batch_send_deadline: 5s min_backoff: 30ms max_backoff: 100ms write_relabel_configs:

    远程写入前丢弃高基数指标

    • source_labels: [name] regex: "go_.*" action: drop

Prometheus remote read from long-term storage

从长期存储远程读取

remote_read:
undefined
remote_read:
undefined

Thanos Architecture for Global View

用于全局视图的Thanos架构

yaml
undefined
yaml
undefined

Thanos Sidecar - runs alongside Prometheus

Thanos Sidecar - 与Prometheus一起运行

thanos sidecar
--prometheus.url=http://localhost:9090
--tsdb.path=/prometheus
--objstore.config-file=/etc/thanos/bucket.yml
--grpc-address=0.0.0.0:10901
--http-address=0.0.0.0:10902
thanos sidecar
--prometheus.url=http://localhost:9090
--tsdb.path=/prometheus
--objstore.config-file=/etc/thanos/bucket.yml
--grpc-address=0.0.0.0:10901
--http-address=0.0.0.0:10902

Thanos Store - queries object storage

Thanos Store - 查询对象存储

thanos store
--data-dir=/var/thanos/store
--objstore.config-file=/etc/thanos/bucket.yml
--grpc-address=0.0.0.0:10901
--http-address=0.0.0.0:10902
thanos store
--data-dir=/var/thanos/store
--objstore.config-file=/etc/thanos/bucket.yml
--grpc-address=0.0.0.0:10901
--http-address=0.0.0.0:10902

Thanos Query - global query interface

Thanos Query - 全局查询接口

thanos query
--http-address=0.0.0.0:9090
--grpc-address=0.0.0.0:10901
--store=prometheus-1-sidecar:10901
--store=prometheus-2-sidecar:10901
--store=thanos-store:10901
thanos query
--http-address=0.0.0.0:9090
--grpc-address=0.0.0.0:10901
--store=prometheus-1-sidecar:10901
--store=prometheus-2-sidecar:10901
--store=thanos-store:10901

Thanos Compactor - downsample and compact blocks

Thanos Compactor - 下采样和压缩块

thanos compact
--data-dir=/var/thanos/compact
--objstore.config-file=/etc/thanos/bucket.yml
--retention.resolution-raw=30d
--retention.resolution-5m=90d
--retention.resolution-1h=365d
undefined
thanos compact
--data-dir=/var/thanos/compact
--objstore.config-file=/etc/thanos/bucket.yml
--retention.resolution-raw=30d
--retention.resolution-5m=90d
--retention.resolution-1h=365d
undefined

Horizontal Sharding with Hashmod

基于Hashmod的水平分片

yaml
undefined
yaml
undefined

Split scrape targets across multiple Prometheus instances using hashmod

使用hashmod将采集目标拆分到多个Prometheus实例

scrape_configs:
  • job_name: "kubernetes-pods-shard-0" kubernetes_sd_configs:
    • role: pod relabel_configs:

    Hash pod name and keep only shard 0 (mod 3)

    • source_labels: [__meta_kubernetes_pod_name] modulus: 3 target_label: __tmp_hash action: hashmod
    • source_labels: [__tmp_hash] regex: "0" action: keep
  • job_name: "kubernetes-pods-shard-1" kubernetes_sd_configs:
    • role: pod relabel_configs:
    • source_labels: [__meta_kubernetes_pod_name] modulus: 3 target_label: __tmp_hash action: hashmod
    • source_labels: [__tmp_hash] regex: "1" action: keep

shard-2 similar pattern...

undefined
scrape_configs:
  • job_name: "kubernetes-pods-shard-0" kubernetes_sd_configs:
    • role: pod relabel_configs:

    对Pod名称哈希并仅保留分片0(模3)

    • source_labels: [__meta_kubernetes_pod_name] modulus: 3 target_label: __tmp_hash action: hashmod
    • source_labels: [__tmp_hash] regex: "0" action: keep
  • job_name: "kubernetes-pods-shard-1" kubernetes_sd_configs:
    • role: pod relabel_configs:
    • source_labels: [__meta_kubernetes_pod_name] modulus: 3 target_label: __tmp_hash action: hashmod
    • source_labels: [__tmp_hash] regex: "1" action: keep

shard-2采用类似模式...

undefined

Kubernetes Integration

Kubernetes集成

ServiceMonitor for Prometheus Operator

用于Prometheus Operator的ServiceMonitor

yaml
undefined
yaml
undefined

servicemonitor.yaml

servicemonitor.yaml

apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: app-metrics namespace: monitoring labels: app: myapp release: prometheus spec:

Select services to monitor

selector: matchLabels: app: myapp

Define namespaces to search

namespaceSelector: matchNames: - production - staging

Endpoint configuration

endpoints: - port: metrics # Service port name path: /metrics interval: 30s scrapeTimeout: 10s
  # Relabeling
  relabelings:
    - sourceLabels: [__meta_kubernetes_pod_name]
      targetLabel: pod
    - sourceLabels: [__meta_kubernetes_namespace]
      targetLabel: namespace

  # Metric relabeling (filter/modify metrics)
  metricRelabelings:
    - sourceLabels: [__name__]
      regex: "go_.*"
      action: drop # Drop Go runtime metrics
    - sourceLabels: [status]
      regex: "[45].."
      targetLabel: error
      replacement: "true"

Optional: TLS configuration

tlsConfig:

insecureSkipVerify: true

ca:

secret:

name: prometheus-tls

key: ca.crt

undefined
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: app-metrics namespace: monitoring labels: app: myapp release: prometheus spec:

选择要监控的Service

selector: matchLabels: app: myapp

定义要搜索的命名空间

namespaceSelector: matchNames: - production - staging

端点配置

endpoints: - port: metrics # Service端口名称 path: /metrics interval: 30s scrapeTimeout: 10s
  # Relabeling
  relabelings:
    - sourceLabels: [__meta_kubernetes_pod_name]
      targetLabel: pod
    - sourceLabels: [__meta_kubernetes_namespace]
      targetLabel: namespace

  # Metric relabeling(过滤/修改指标)
  metricRelabelings:
    - sourceLabels: [__name__]
      regex: "go_.*"
      action: drop # 丢弃Go运行时指标
    - sourceLabels: [status]
      regex: "[45].."
      targetLabel: error
      replacement: "true"

可选:TLS配置

tlsConfig:

insecureSkipVerify: true

ca:

secret:

name: prometheus-tls

key: ca.crt

undefined

PodMonitor for Direct Pod Scraping

用于直接采集Pod的PodMonitor

yaml
undefined
yaml
undefined

podmonitor.yaml

podmonitor.yaml

apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: app-pods namespace: monitoring labels: release: prometheus spec:

Select pods to monitor

selector: matchLabels: app: myapp

Namespace selection

namespaceSelector: matchNames: - production

Pod metrics endpoints

podMetricsEndpoints: - port: metrics path: /metrics interval: 15s
  # Relabeling
  relabelings:
    - sourceLabels: [__meta_kubernetes_pod_label_version]
      targetLabel: version
    - sourceLabels: [__meta_kubernetes_pod_node_name]
      targetLabel: node
undefined
apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: app-pods namespace: monitoring labels: release: prometheus spec:

选择要监控的Pod

selector: matchLabels: app: myapp

命名空间选择

namespaceSelector: matchNames: - production

Pod指标端点

podMetricsEndpoints: - port: metrics path: /metrics interval: 15s
  # Relabeling
  relabelings:
    - sourceLabels: [__meta_kubernetes_pod_label_version]
      targetLabel: version
    - sourceLabels: [__meta_kubernetes_pod_node_name]
      targetLabel: node
undefined

PrometheusRule for Alerts and Recording Rules

用于告警和记录规则的PrometheusRule

yaml
undefined
yaml
undefined

prometheusrule.yaml

prometheusrule.yaml

apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: app-rules namespace: monitoring labels: release: prometheus role: alert-rules spec: groups: - name: app_alerts interval: 30s rules: - alert: HighErrorRate expr: | ( sum(rate(http_requests_total{status=~"5..", app="myapp"}[5m])) / sum(rate(http_requests_total{app="myapp"}[5m])) ) > 0.05 for: 5m labels: severity: critical team: backend annotations: summary: "High error rate on {{ $labels.namespace }}/{{ $labels.pod }}" description: "Error rate is {{ $value | humanizePercentage }}" dashboard: "https://grafana.example.com/d/app-overview" runbook: "https://wiki.example.com/runbooks/high-error-rate"
    - alert: PodCrashLooping
      expr: |
        rate(kube_pod_container_status_restarts_total[15m]) > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
        description: "Container {{ $labels.container }} has restarted {{ $value }} times in 15m"

- name: app_recording_rules
  interval: 30s
  rules:
    - record: app:http_requests:rate5m
      expr: sum(rate(http_requests_total{app="myapp"}[5m])) by (namespace, pod, method, status)

    - record: app:http_request_duration_seconds:p95
      expr: |
        histogram_quantile(0.95,
          sum(rate(http_request_duration_seconds_bucket{app="myapp"}[5m])) by (le, namespace, pod)
        )
undefined
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: app-rules namespace: monitoring labels: release: prometheus role: alert-rules spec: groups: - name: app_alerts interval: 30s rules: - alert: HighErrorRate expr: | ( sum(rate(http_requests_total{status=~"5..", app="myapp"}[5m])) / sum(rate(http_requests_total{app="myapp"}[5m])) ) > 0.05 for: 5m labels: severity: critical team: backend annotations: summary: "{{ $labels.namespace }}/{{ $labels.pod }}高错误率" description: "错误率为{{ $value | humanizePercentage }}" dashboard: "https://grafana.example.com/d/app-overview" runbook: "https://wiki.example.com/runbooks/high-error-rate"
    - alert: PodCrashLooping
      expr: |
        rate(kube_pod_container_status_restarts_total[15m]) > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }}崩溃循环"
        description: "容器{{ $labels.container }}在15分钟内已重启{{ $value }}次"

- name: app_recording_rules
  interval: 30s
  rules:
    - record: app:http_requests:rate5m
      expr: sum(rate(http_requests_total{app="myapp"}[5m])) by (namespace, pod, method, status)

    - record: app:http_request_duration_seconds:p95
      expr: |
        histogram_quantile(0.95,
          sum(rate(http_request_duration_seconds_bucket{app="myapp"}[5m])) by (le, namespace, pod)
        )
undefined

Prometheus Custom Resource

Prometheus自定义资源

yaml
undefined
yaml
undefined

prometheus.yaml

prometheus.yaml

apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: prometheus namespace: monitoring spec: replicas: 2 version: v2.45.0

Service account for Kubernetes API access

serviceAccountName: prometheus

Select ServiceMonitors

serviceMonitorSelector: matchLabels: release: prometheus

Select PodMonitors

podMonitorSelector: matchLabels: release: prometheus

Select PrometheusRules

ruleSelector: matchLabels: release: prometheus role: alert-rules

Resource limits

resources: requests: memory: 2Gi cpu: 1000m limits: memory: 4Gi cpu: 2000m

Storage

storage: volumeClaimTemplate: spec: accessModes: - ReadWriteOnce resources: requests: storage: 50Gi storageClassName: fast-ssd

Retention

retention: 30d retentionSize: 45GB

Alertmanager configuration

alerting: alertmanagers: - namespace: monitoring name: alertmanager port: web

External labels

externalLabels: cluster: production region: us-east-1

Security context

securityContext: fsGroup: 2000 runAsNonRoot: true runAsUser: 1000

Enable admin API for management operations

enableAdminAPI: false

Additional scrape configs (from Secret)

additionalScrapeConfigs: name: additional-scrape-configs key: prometheus-additional.yaml
undefined
apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: prometheus namespace: monitoring spec: replicas: 2 version: v2.45.0

用于Kubernetes API访问的服务账号

serviceAccountName: prometheus

选择ServiceMonitor

serviceMonitorSelector: matchLabels: release: prometheus

选择PodMonitor

podMonitorSelector: matchLabels: release: prometheus

选择PrometheusRules

ruleSelector: matchLabels: release: prometheus role: alert-rules

资源限制

resources: requests: memory: 2Gi cpu: 1000m limits: memory: 4Gi cpu: 2000m

存储

storage: volumeClaimTemplate: spec: accessModes: - ReadWriteOnce resources: requests: storage: 50Gi storageClassName: fast-ssd

数据保留

retention: 30d retentionSize: 45GB

Alertmanager配置

alerting: alertmanagers: - namespace: monitoring name: alertmanager port: web

外部标签

externalLabels: cluster: production region: us-east-1

安全上下文

securityContext: fsGroup: 2000 runAsNonRoot: true runAsUser: 1000

启用管理API用于运维操作

enableAdminAPI: false

额外的采集配置(来自Secret)

additionalScrapeConfigs: name: additional-scrape-configs key: prometheus-additional.yaml
undefined

Application Instrumentation Examples

应用埋点示例

Go Application

Go应用

go
// main.go
package main

import (
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    // Counter for total requests
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )

    // Histogram for request duration
    httpRequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration in seconds",
            Buckets: []float64{.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
        },
        []string{"method", "endpoint"},
    )

    // Gauge for active connections
    activeConnections = promauto.NewGauge(
        prometheus.GaugeOpts{
            Name: "active_connections",
            Help: "Number of active connections",
        },
    )

    // Summary for response sizes
    responseSizeBytes = promauto.NewSummaryVec(
        prometheus.SummaryOpts{
            Name:       "http_response_size_bytes",
            Help:       "HTTP response size in bytes",
            Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
        },
        []string{"endpoint"},
    )
)

// Middleware to instrument HTTP handlers
func instrumentHandler(endpoint string, handler http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        activeConnections.Inc()
        defer activeConnections.Dec()

        // Wrap response writer to capture status code
        wrapped := &responseWriter{ResponseWriter: w, statusCode: 200}

        handler(wrapped, r)

        duration := time.Since(start).Seconds()
        httpRequestDuration.WithLabelValues(r.Method, endpoint).Observe(duration)
        httpRequestsTotal.WithLabelValues(r.Method, endpoint,
            http.StatusText(wrapped.statusCode)).Inc()
    }
}

type responseWriter struct {
    http.ResponseWriter
    statusCode int
}

func (rw *responseWriter) WriteHeader(code int) {
    rw.statusCode = code
    rw.ResponseWriter.WriteHeader(code)
}

func handleUsers(w http.ResponseWriter, r *http.Request) {
    w.Header().Set("Content-Type", "application/json")
    w.Write([]byte(`{"users": []}`))
}

func main() {
    // Register handlers
    http.HandleFunc("/api/users", instrumentHandler("/api/users", handleUsers))
    http.Handle("/metrics", promhttp.Handler())

    // Start server
    http.ListenAndServe(":8080", nil)
}
go
// main.go
package main

import (
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    // 总请求数计数器
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )

    // 请求延迟直方图
    httpRequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration in seconds",
            Buckets: []float64{.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
        },
        []string{"method", "endpoint"},
    )

    // 活跃连接数仪表盘
    activeConnections = promauto.NewGauge(
        prometheus.GaugeOpts{
            Name: "active_connections",
            Help: "Number of active connections",
        },
    )

    // 响应大小摘要
    responseSizeBytes = promauto.NewSummaryVec(
        prometheus.SummaryOpts{
            Name:       "http_response_size_bytes",
            Help:       "HTTP response size in bytes",
            Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
        },
        []string{"endpoint"},
    )
)

// 用于埋点HTTP处理器的中间件
func instrumentHandler(endpoint string, handler http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        activeConnections.Inc()
        defer activeConnections.Dec()

        // 包装响应Writer以捕获状态码
        wrapped := &responseWriter{ResponseWriter: w, statusCode: 200}

        handler(wrapped, r)

        duration := time.Since(start).Seconds()
        httpRequestDuration.WithLabelValues(r.Method, endpoint).Observe(duration)
        httpRequestsTotal.WithLabelValues(r.Method, endpoint,
            http.StatusText(wrapped.statusCode)).Inc()
    }
}

type responseWriter struct {
    http.ResponseWriter
    statusCode int
}

func (rw *responseWriter) WriteHeader(code int) {
    rw.statusCode = code
    rw.ResponseWriter.WriteHeader(code)
}

func handleUsers(w http.ResponseWriter, r *http.Request) {
    w.Header().Set("Content-Type", "application/json")
    w.Write([]byte(`{"users": []}`))
}

func main() {
    // 注册处理器
    http.HandleFunc("/api/users", instrumentHandler("/api/users", handleUsers))
    http.Handle("/metrics", promhttp.Handler())

    // 启动服务器
    http.ListenAndServe(":8080", nil)
}

Python Application (Flask)

Python应用(Flask)

python
undefined
python
undefined

app.py

app.py

from flask import Flask, request from prometheus_client import Counter, Histogram, Gauge, generate_latest import time
app = Flask(name)
from flask import Flask, request from prometheus_client import Counter, Histogram, Gauge, generate_latest import time
app = Flask(name)

Define metrics

定义指标

request_count = Counter( 'http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'] )
request_duration = Histogram( 'http_request_duration_seconds', 'HTTP request duration in seconds', ['method', 'endpoint'], buckets=[.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10] )
active_requests = Gauge( 'active_requests', 'Number of active requests' )
request_count = Counter( 'http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'] )
request_duration = Histogram( 'http_request_duration_seconds', 'HTTP request duration in seconds', ['method', 'endpoint'], buckets=[.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10] )
active_requests = Gauge( 'active_requests', 'Number of active requests' )

Middleware for instrumentation

埋点中间件

@app.before_request def before_request(): active_requests.inc() request.start_time = time.time()
@app.after_request def after_request(response): active_requests.dec()
duration = time.time() - request.start_time
request_duration.labels(
    method=request.method,
    endpoint=request.endpoint or 'unknown'
).observe(duration)

request_count.labels(
    method=request.method,
    endpoint=request.endpoint or 'unknown',
    status=response.status_code
).inc()

return response
@app.route('/metrics') def metrics(): return generate_latest()
@app.route('/api/users') def users(): return {'users': []}
if name == 'main': app.run(host='0.0.0.0', port=8080)
undefined
@app.before_request def before_request(): active_requests.inc() request.start_time = time.time()
@app.after_request def after_request(response): active_requests.dec()
duration = time.time() - request.start_time
request_duration.labels(
    method=request.method,
    endpoint=request.endpoint or 'unknown'
).observe(duration)

request_count.labels(
    method=request.method,
    endpoint=request.endpoint or 'unknown',
    status=response.status_code
).inc()

return response
@app.route('/metrics') def metrics(): return generate_latest()
@app.route('/api/users') def users(): return {'users': []}
if name == 'main': app.run(host='0.0.0.0', port=8080)
undefined

Production Deployment Checklist

生产部署检查清单

  • Set appropriate retention period (balance storage vs history needs)
  • Configure persistent storage with adequate size
  • Enable high availability (multiple Prometheus replicas or federation)
  • Set up remote storage for long-term retention (Thanos, Cortex, Mimir)
  • Configure service discovery for dynamic environments
  • Implement recording rules for frequently-used queries
  • Create symptom-based alerts with proper annotations
  • Set up Alertmanager with appropriate routing and receivers
  • Configure inhibition rules to reduce alert noise
  • Add runbook URLs to all critical alerts
  • Implement proper label hygiene (avoid high cardinality)
  • Monitor Prometheus itself (meta-monitoring)
  • Set up authentication and authorization
  • Enable TLS for scrape targets and remote storage
  • Configure rate limiting for queries
  • Test alert and recording rule validity (
    promtool check rules
    )
  • Implement backup and disaster recovery procedures
  • Document metric naming conventions for the team
  • Create dashboards in Grafana for common queries
  • Set up log aggregation alongside metrics (Loki)
  • 设置合适的数据保留周期(平衡存储与历史数据需求)
  • 配置足够容量的持久化存储
  • 启用高可用性(多Prometheus副本或联邦)
  • 配置远程存储实现长期数据保留(Thanos、Cortex、Mimir)
  • 为动态环境配置服务发现
  • 为常用查询实现记录规则
  • 创建基于症状的告警并添加合适的注解
  • 配置Alertmanager的路由和接收方
  • 配置抑制规则减少告警噪音
  • 为所有严重告警添加运行手册URL
  • 实现规范的标签管理(避免高基数)
  • 监控Prometheus自身(元监控)
  • 配置认证与授权
  • 为采集目标和远程存储启用TLS
  • 配置查询速率限制
  • 测试告警和记录规则的有效性(
    promtool check rules
  • 实现备份与灾难恢复流程
  • 为团队制定指标命名规范
  • 在Grafana中创建常用查询的仪表盘
  • 搭配日志聚合系统(如Loki)使用

Troubleshooting Commands

故障排查命令

bash
undefined
bash
undefined

Check Prometheus configuration syntax

检查Prometheus配置语法

promtool check config prometheus.yml
promtool check config prometheus.yml

Check rules file syntax

检查规则文件语法

promtool check rules alerts/*.yml
promtool check rules alerts/*.yml

Test PromQL queries

测试PromQL查询

promtool query instant http://localhost:9090 'up'
promtool query instant http://localhost:9090 'up'

Check which targets are up

检查哪些目标处于运行状态

Query current metric values

查询当前指标值

Check service discovery

检查服务发现状态

View TSDB stats

查看TSDB统计信息

Check runtime information

检查运行时信息

Quick Reference

快速参考

Common PromQL Patterns

常用PromQL模式

promql
undefined
promql
undefined

Request rate per second

每秒请求速率

rate(http_requests_total[5m])
rate(http_requests_total[5m])

Error ratio percentage

错误率百分比

100 * sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
100 * sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

P95 latency from histogram

从直方图获取P95延迟

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Average latency from histogram

从直方图获取平均延迟

sum(rate(http_request_duration_seconds_sum[5m])) / sum(rate(http_request_duration_seconds_count[5m]))
sum(rate(http_request_duration_seconds_sum[5m])) / sum(rate(http_request_duration_seconds_count[5m]))

Memory utilization percentage

内存使用率百分比

100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

CPU utilization (non-idle)

CPU使用率(非空闲)

100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])))
100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])))

Disk space remaining percentage

剩余磁盘空间百分比

100 * node_filesystem_avail_bytes / node_filesystem_size_bytes
100 * node_filesystem_avail_bytes / node_filesystem_size_bytes

Top 5 endpoints by request rate

请求速率Top 5端点

topk(5, sum(rate(http_requests_total[5m])) by (endpoint))
topk(5, sum(rate(http_requests_total[5m])) by (endpoint))

Service uptime in days

服务运行时长(天)

(time() - process_start_time_seconds) / 86400
(time() - process_start_time_seconds) / 86400

Request rate growth compared to 1 hour ago

与1小时前对比的请求速率变化

rate(http_requests_total[5m]) / rate(http_requests_total[5m] offset 1h)
undefined
rate(http_requests_total[5m]) / rate(http_requests_total[5m] offset 1h)
undefined

Alert Rule Patterns

告警规则模式

yaml
undefined
yaml
undefined

High error rate (symptom-based)

高错误率(基于症状)

alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 for: 5m labels: severity: critical annotations: summary: "Error rate is {{ $value | humanizePercentage }}" runbook: "https://runbooks.example.com/high-error-rate"
alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 for: 5m labels: severity: critical annotations: summary: "错误率为{{ $value | humanizePercentage }}" runbook: "https://runbooks.example.com/high-error-rate"

High latency P95

P95延迟过高

alert: HighLatency expr: | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service) ) > 1 for: 5m labels: severity: warning
alert: HighLatency expr: | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service) ) > 1 for: 5m labels: severity: warning

Service down

服务不可用

alert: ServiceDown expr: up{job="critical-service"} == 0 for: 2m labels: severity: critical
alert: ServiceDown expr: up{job="critical-service"} == 0 for: 2m labels: severity: critical

Disk space low (cause-based, warning only)

磁盘空间不足(基于原因,仅警告)

alert: DiskSpaceLow expr: | node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1 for: 10m labels: severity: warning
alert: DiskSpaceLow expr: | node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1 for: 10m labels: severity: warning

Pod crash looping

Pod崩溃循环

alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) > 0 for: 5m labels: severity: warning
undefined
alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) > 0 for: 5m labels: severity: warning
undefined

Recording Rule Naming Convention

记录规则命名规范

yaml
undefined
yaml
undefined

Format: level:metric:operations

格式: level:metric:operations

level = aggregation level (job, instance, cluster)

level = 聚合级别(job、instance、cluster)

metric = base metric name

metric = 基础指标名称

operations = transformations applied (rate5m, sum, ratio)

operations = 应用的转换操作(rate5m、sum、ratio)

groups:
  • name: aggregation_rules rules:

    Instance-level aggregation

    • record: instance:node_cpu_utilization:ratio expr: 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

    Job-level aggregation

    • record: job:http_requests:rate5m expr: sum(rate(http_requests_total[5m])) by (job)

    Job-level error ratio

    • record: job:http_request_errors:ratio expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job)

    Cluster-level aggregation

    • record: cluster:cpu_utilization:ratio expr: avg(instance:node_cpu_utilization:ratio)
undefined
groups:
  • name: aggregation_rules rules:

    实例级聚合

    • record: instance:node_cpu_utilization:ratio expr: 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

    Job级聚合

    • record: job:http_requests:rate5m expr: sum(rate(http_requests_total[5m])) by (job)

    Job级错误率

    • record: job:http_request_errors:ratio expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job)

    集群级聚合

    • record: cluster:cpu_utilization:ratio expr: avg(instance:node_cpu_utilization:ratio)
undefined

Metric Naming Best Practices

指标命名最佳实践

PatternGood ExampleBad Example
Counter suffix
http_requests_total
http_requests
Base units
http_request_duration_seconds
http_request_duration_ms
Ratio range
cache_hit_ratio
(0.0-1.0)
cache_hit_percentage
(0-100)
Byte units
response_size_bytes
response_size_kb
Namespace prefix
myapp_http_requests_total
http_requests_total
Label naming
{method="GET", status="200"}
{httpMethod="GET", statusCode="200"}
模式示例反例
计数器后缀
http_requests_total
http_requests
基础单位
http_request_duration_seconds
http_request_duration_ms
比率范围
cache_hit_ratio
(0.0-1.0)
cache_hit_percentage
(0-100)
字节单位
response_size_bytes
response_size_kb
命名空间前缀
myapp_http_requests_total
http_requests_total
标签命名
{method="GET", status="200"}
{httpMethod="GET", statusCode="200"}

Label Cardinality Guidelines

标签基数指导原则

CardinalityExamplesRecommendation
Low (<10)HTTP method, status code, environmentSafe for all labels
Medium (10-100)API endpoint, service name, pod nameSafe with aggregation
High (100-1000)Container ID, hostnameUse only when necessary
UnboundedUser ID, IP address, timestamp, URL pathNever use as label
基数范围示例建议做法
低基数(<10)HTTP方法、状态码、环境所有标签都可以安全使用
中基数(10-100)API端点、服务名称、Pod名称结合聚合操作使用
高基数(100-1000)容器ID、主机名仅在必要时使用
无界基数用户ID、IP地址、时间戳、URL路径绝对不能作为标签使用

Kubernetes Annotation-based Scraping

基于Kubernetes注解的自动采集

yaml
undefined
yaml
undefined

Pod annotations for automatic Prometheus scraping

用于Prometheus自动采集的Pod注解

apiVersion: v1 kind: Pod metadata: annotations: prometheus.io/scrape: "true" prometheus.io/port: "8080" prometheus.io/path: "/metrics" prometheus.io/scheme: "http" spec: containers: - name: app image: myapp:latest ports: - containerPort: 8080 name: metrics
undefined
apiVersion: v1 kind: Pod metadata: annotations: prometheus.io/scrape: "true" prometheus.io/port: "8080" prometheus.io/path: "/metrics" prometheus.io/scheme: "http" spec: containers: - name: app image: myapp:latest ports: - containerPort: 8080 name: metrics
undefined

Alertmanager Routing Patterns

Alertmanager路由模式

yaml
route:
  receiver: default
  group_by: ["alertname", "cluster"]
  routes:
    # Critical alerts to PagerDuty
    - match:
        severity: critical
      receiver: pagerduty
      continue: true # Also send to default

    # Team-based routing
    - match:
        team: database
      receiver: dba-team
      group_by: ["alertname", "instance"]

    # Environment-based routing
    - match:
        env: development
      receiver: slack-dev
      repeat_interval: 4h

    # Time-based routing (office hours only)
    - match:
        severity: warning
      receiver: email
      active_time_intervals:
        - business-hours

time_intervals:
  - name: business-hours
    time_intervals:
      - times:
          - start_time: "09:00"
            end_time: "17:00"
        weekdays: ["monday:friday"]
yaml
route:
  receiver: default
  group_by: ["alertname", "cluster"]
  routes:
    # 严重告警发送到PagerDuty
    - match:
        severity: critical
      receiver: pagerduty
      continue: true # 同时发送到默认接收方

    # 基于团队路由
    - match:
        team: database
      receiver: dba-team
      group_by: ["alertname", "instance"]

    # 基于环境路由
    - match:
        env: development
      receiver: slack-dev
      repeat_interval: 4h

    # 基于时间路由(仅工作时间)
    - match:
        severity: warning
      receiver: email
      active_time_intervals:
        - business-hours

time_intervals:
  - name: business-hours
    time_intervals:
      - times:
          - start_time: "09:00"
            end_time: "17:00"
        weekdays: ["monday:friday"]

Additional Resources

额外资源