Metrics with Prometheus and Grafana

Prometheus与Grafana指标管理

Value Proposition

价值主张

Prometheus is an open-source monitoring and alerting toolkit for cloud-native environments. Combined with Grafana Cloud Metrics (powered by Grafana Mimir), it provides a fully managed Prometheus-compatible service with long-term storage, global query performance, and enterprise scalability.

Key Differentiators: Pull-based model, dimensional data model with labels, PromQL, automatic service discovery, scales to billions of active series.

Prometheus是面向云原生环境的开源监控与告警工具包。结合Grafana Cloud Metrics（由Grafana Mimir提供支持），它可提供完全托管的Prometheus兼容服务，具备长期存储、全局查询性能及企业级可扩展性。

核心优势：基于拉取的模型、带标签的维度数据模型、PromQL、自动服务发现、可扩展至数十亿活跃时间序列。

PromQL Quick Reference

PromQL速查指南

Instant Vector Selectors

瞬时向量选择器

promql

undefined

promql

undefined

By metric name

http_requests_total

Label filter

http_requests_total{job="api-server"}

Multiple labels (AND)

http_requests_total{job="api-server", method="GET"}

Regex

http_requests_total{job=~~"api.*", status=~~"5.."}

Negative

http_requests_total{status!="200"}

undefined

http_requests_total{status!="200"}

undefined

Range Vectors & Rates

范围向量与速率

promql

undefined

promql

undefined

Per-second rate over 5 minutes

rate(http_requests_total[5m])

Increase over interval

increase(http_requests_total[1h])

Instant rate (last two samples)

irate(http_requests_total[5m])

Offset (5 minutes ago)

rate(http_requests_total[5m] offset 5m)

undefined

rate(http_requests_total[5m] offset 5m)

undefined

Aggregations

聚合操作

promql

undefined

promql

undefined

Sum by label

sum by (job) (rate(http_requests_total[5m]))

Average

avg by (instance) (node_cpu_seconds_total)

Top-K

topk(5, rate(http_requests_total[5m]))

Histogram quantiles

histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

Count distinct

count(up{job="api"})

undefined

count(up{job="api"})

undefined

Common Patterns

常用模式

promql

undefined

promql

undefined

Error rate percentage

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

Saturation (CPU usage %)

100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory usage

node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes

Predict disk full (linear extrapolation)

predict_linear(node_filesystem_free_bytes[6h], 24*3600) < 0

undefined

predict_linear(node_filesystem_free_bytes[6h], 24*3600) < 0

undefined

Metrics Drilldown

指标钻取

Queryless Prometheus metrics exploration (preinstalled in Grafana 12+):

Browse metrics without writing PromQL
Smart segmentation and anomaly detection
Auto-visualization with optimal chart types
Metric relationship discovery
Telemetry pivoting (metrics to logs)

无需编写查询的Prometheus指标探索功能（Grafana 12+预装）：

无需编写PromQL即可浏览指标
智能分段与异常检测
自动选择最优图表类型进行可视化
指标关系发现
遥测联动（从指标跳转至日志）

Alerting

告警

Prometheus Alertmanager

Route, group, silence, and deduplicate alerts. Multi-destination routing (PagerDuty, Slack, Email, webhooks).

对告警进行路由、分组、静默与去重。支持多目标路由（PagerDuty、Slack、邮件、Webhook）。

Grafana Alerting

Grafana告警

Unified alerting across all data sources. Supports multi-dimensional alerts, notification policies, and contact points.

跨所有数据源的统一告警功能。支持多维告警、通知策略及联系点配置。

Recording Rules

记录规则

Pre-compute expensive PromQL queries for dashboard performance:

yaml

groups:
  - name: api_rules
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

预计算开销较大的PromQL查询，提升仪表盘性能：

yaml

groups:
  - name: api_rules
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

Architecture

架构

Pull-based scraping: Prometheus scrapes HTTP endpoints at configured intervals
Service discovery: Automatic target discovery for K8s, EC2, Consul
Push gateway: For short-lived jobs that can't be scraped
Remote write/read: Send metrics to Grafana Cloud, Thanos, Mimir
Local storage: Efficient on-disk time-series database

基于拉取的采集：Prometheus按配置间隔从HTTP端点采集指标
服务发现：自动发现K8s、EC2、Consul等目标
推送网关：用于无法被采集的短生命周期任务
远程读写：将指标发送至Grafana Cloud、Thanos、Mimir
本地存储：高效的磁盘时序数据库

prometheus

Original

Translation

Metrics with Prometheus and Grafana

Prometheus与Grafana指标管理

Value Proposition

价值主张

PromQL Quick Reference

PromQL速查指南

Instant Vector Selectors

瞬时向量选择器

By metric name

By metric name

Label filter

Label filter

Multiple labels (AND)

Multiple labels (AND)

Regex

Regex

Negative

Negative

Range Vectors & Rates

范围向量与速率

Per-second rate over 5 minutes

Per-second rate over 5 minutes

Increase over interval

Increase over interval

Instant rate (last two samples)

Instant rate (last two samples)

Offset (5 minutes ago)

Offset (5 minutes ago)

Aggregations

聚合操作

Sum by label

Sum by label

Average

Average

Top-K

Top-K

Histogram quantiles

Histogram quantiles

Count distinct

Count distinct

Common Patterns

常用模式

Error rate percentage

Error rate percentage

Saturation (CPU usage %)

Saturation (CPU usage %)

Memory usage

Memory usage

Predict disk full (linear extrapolation)

Predict disk full (linear extrapolation)

Metrics Drilldown

指标钻取

Alerting

告警

Prometheus Alertmanager

Prometheus Alertmanager

Grafana Alerting

Grafana告警

Recording Rules

记录规则

Architecture

架构

Resources

资源