promql

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

PromQL Query Patterns

PromQL 查询模式

PromQL is a functional query language for time series data. Every query returns either an instant vector (one value per label set at a point in time), a range vector (a sliding window of samples), or a scalar.
Golden rule:
rate()
and
increase()
always require a range vector. The range must be at least 4x the scrape interval to avoid gaps. For a 60s scrape interval, use
[5m]
minimum.

PromQL是一种用于时间序列数据的函数式查询语言。每个查询返回instant vector(单个时间点上每个标签集对应一个值)、range vector(样本的滑动窗口)或scalar(标量值)。
黄金法则
rate()
increase()
始终需要范围向量。范围必须至少为抓取间隔的4倍,以避免数据间隙。对于60秒的抓取间隔,最小使用
[5m]

Rate and counter queries

速率与计数器查询

Rate (per-second average over a window):
promql
rate(http_requests_total[5m])
Rate with label aggregation — "sum then rate" is wrong, always rate then sum:
promql
undefined
速率(窗口内的每秒平均值)
promql
rate(http_requests_total[5m])
带标签聚合的速率 —— “先求和再取速率”是错误的,必须先取速率再求和
promql
undefined

CORRECT: rate first, then aggregate

CORRECT: 先取速率,再聚合

sum(rate(http_requests_total{job="api"}[5m])) by (status_code)
sum(rate(http_requests_total{job="api"}[5m])) by (status_code)

WRONG: sum first destroys the counter monotonicity

WRONG: 先求和会破坏计数器的单调性

sum(http_requests_total) by (status_code) -- do NOT then rate() this

**Increase (total count over a window, not per-second):**
```promql
increase(http_requests_total[1h])
irate vs rate:
  • rate()
    - smooth average over the full window. Use for dashboards and alerts.
  • irate()
    - instantaneous rate from the last two samples. Use only when you need to capture spikes that
    rate()
    would average away. Never use for alerting.

sum(http_requests_total) by (status_code) -- 请勿对其再使用rate()

**增量(窗口内的总计数,非每秒)**:
```promql
increase(http_requests_total[1h])
irate vs rate
  • rate()
    - 整个窗口内的平滑平均值。用于仪表盘和告警。
  • irate()
    - 基于最后两个样本的瞬时速率。仅在需要捕获
    rate()
    会平均掉的峰值时使用。切勿用于告警。

Filtering with label matchers

使用标签匹配器过滤

promql
undefined
promql
undefined

Exact match

精确匹配

http_requests_total{job="api", status_code="200"}
http_requests_total{job="api", status_code="200"}

Regex match (anchored automatically)

正则匹配(自动锚定)

http_requests_total{status_code=~"5.."}
http_requests_total{status_code=~"5.."}

Negative regex

负向正则

http_requests_total{status_code!~"2.."}
http_requests_total{status_code!~"2.."}

Multiple values with regex OR

多值正则或匹配

http_requests_total{env=~"staging|production"}

---
http_requests_total{env=~"staging|production"}

---

Aggregation operators

聚合运算符

Always aggregate after
rate()
:
promql
undefined
务必在
rate()
之后进行聚合:
promql
undefined

Sum across all instances, keep service label

跨所有实例求和,保留service标签

sum(rate(http_requests_total[5m])) by (service)
sum(rate(http_requests_total[5m])) by (service)

Average CPU per node, drop all other labels

每个节点的平均CPU使用率,丢弃所有其他标签

avg(node_cpu_seconds_total{mode="idle"}) by (instance)
avg(node_cpu_seconds_total{mode="idle"}) by (instance)

95th percentile request duration

95分位请求耗时

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service) )
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service) )

Top 5 services by request rate

请求速率Top 5服务

topk(5, sum(rate(http_requests_total[5m])) by (service))
topk(5, sum(rate(http_requests_total[5m])) by (service))

Count of distinct label values

不同标签值的计数

count(count(up) by (job)) by ()

**`without` vs `by`:**
```promql
count(count(up) by (job)) by ()

**`without` vs `by`**:
```promql

Keep only the labels listed

仅保留列出的标签

sum(rate(http_requests_total[5m])) by (service, status_code)
sum(rate(http_requests_total[5m])) by (service, status_code)

Drop only the labels listed, keep everything else

仅丢弃列出的标签,保留其他所有标签

sum(rate(http_requests_total[5m])) without (instance, pod)

---
sum(rate(http_requests_total[5m])) without (instance, pod)

---

Histogram quantiles

直方图分位数

Native histograms (Prometheus 2.40+) and classic histograms use different syntax.
Classic histogram (bucket metrics with
_bucket
suffix):
promql
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le)
)
Multi-service comparison:
promql
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
Common mistake: forgetting
by (le)
in the inner aggregation drops the bucket boundaries, making
histogram_quantile
produce wrong results or NaN.
Native histograms (simpler syntax):
promql
histogram_quantile(0.95, sum(rate(http_request_duration_seconds[5m])))

原生直方图(Prometheus 2.40+)和经典直方图使用不同的语法。
经典直方图(带有
_bucket
后缀的桶指标)
promql
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le)
)
多服务对比
promql
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
常见错误:内部聚合时忘记
by (le)
会丢失桶边界,导致
histogram_quantile
产生错误结果或NaN。
原生直方图(更简洁的语法)
promql
histogram_quantile(0.95, sum(rate(http_request_duration_seconds[5m])))

Ratio and error rate

比率与错误率

promql
undefined
promql
undefined

Error ratio (errors / total)

错误率(错误数/总数)

sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

Success rate as percentage

成功率百分比

(1 - sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) ) * 100
(1 - sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) ) * 100

Avoid division by zero with or vector(0)

使用or vector(0)避免除零错误

sum(rate(errors_total[5m])) / (sum(rate(requests_total[5m])) > 0)

---
sum(rate(errors_total[5m])) / (sum(rate(requests_total[5m])) > 0)

---

Absence and staleness

缺失值与陈旧性

promql
undefined
promql
undefined

Alert when a metric disappears (e.g. a job stops reporting)

当指标消失时告警(例如某个任务停止上报)

absent(up{job="api"})
absent(up{job="api"})

Alert when a metric value hasn't changed (potential stale exporter)

当指标值未变化时告警(可能是陈旧的导出器)

changes(up{job="api"}[5m]) == 0
changes(up{job="api"}[5m]) == 0

Check if a metric has been present in the last window

检查指标在最近窗口内是否存在

count_over_time(up{job="api"}[5m]) > 0

---
count_over_time(up{job="api"}[5m]) > 0

---

Time functions and offsets

时间函数与偏移量

promql
undefined
promql
undefined

Compare current value to 1 hour ago

比较当前值与1小时前的值

rate(http_requests_total[5m])

rate(http_requests_total[5m] offset 1h)

rate(http_requests_total[5m])

rate(http_requests_total[5m] offset 1h)

Day-over-day comparison

日环比对比

rate(http_requests_total[5m]) / rate(http_requests_total[5m] offset 1d)
rate(http_requests_total[5m]) / rate(http_requests_total[5m] offset 1d)

Predict value in 2 hours based on current trend (linear regression)

根据当前趋势预测2小时后的值(线性回归)

predict_linear(node_filesystem_avail_bytes[1h], 2 * 3600)

---
predict_linear(node_filesystem_avail_bytes[1h], 2 * 3600)

---

Recording rules

记录规则

Recording rules pre-compute expensive queries, improving dashboard load time and reducing Prometheus query load. Store them in a rules file loaded by Prometheus or Grafana Mimir.
yaml
groups:
  - name: http_request_rates
    interval: 1m
    rules:
      # Pre-compute per-service request rate
      - record: job:http_requests_total:rate5m
        expr: |
          sum(rate(http_requests_total[5m])) by (job)

      # Pre-compute error ratio per service
      - record: job:http_errors:ratio5m
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)

      # Pre-compute p95 latency per service (avoids expensive histogram_quantile on dashboards)
      - record: job:http_request_duration_p95:rate5m
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
          )
Naming convention:
<aggregation_level>:<metric_name>:<operation_and_window>

记录规则预计算开销较大的查询,提升仪表盘加载速度并减少Prometheus查询负载。将其存储在Prometheus或Grafana Mimir加载的规则文件中。
yaml
groups:
  - name: http_request_rates
    interval: 1m
    rules:
      # 预计算每个服务的请求速率
      - record: job:http_requests_total:rate5m
        expr: |
          sum(rate(http_requests_total[5m])) by (job)

      # 预计算每个服务的错误率
      - record: job:http_errors:ratio5m
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)

      # 预计算每个服务的p95延迟(避免在仪表盘上执行昂贵的histogram_quantile)
      - record: job:http_request_duration_p95:rate5m
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
          )
命名规范
<aggregation_level>:<metric_name>:<operation_and_window>

SLO queries

SLO查询

promql
undefined
promql
undefined

Availability SLO: fraction of successful requests over 30 days

可用性SLO:30天内成功请求的比例

1 - ( sum(increase(http_requests_total{status_code=~"5.."}[30d])) / sum(increase(http_requests_total[30d])) )
1 - ( sum(increase(http_requests_total{status_code=~"5.."}[30d])) / sum(increase(http_requests_total[30d])) )

Error budget burn rate (1h window, alerting when burning > 14.4x the allowed rate)

错误预算消耗速率(1小时窗口,当消耗速率超过允许速率的14.4倍时告警)

( sum(rate(http_requests_total{status_code=~"5.."}[1h])) / sum(rate(http_requests_total[1h])) ) / (1 - 0.999) -- replace 0.999 with your SLO target

---
( sum(rate(http_requests_total{status_code=~"5.."}[1h])) / sum(rate(http_requests_total[1h])) ) / (1 - 0.999) -- 替换0.999为你的SLO目标值

---

Cardinality and performance

基数与性能

High cardinality label values (UUIDs, user IDs, URLs) make queries slow and storage expensive.
promql
undefined
高基数标签值(UUID、用户ID、URL)会导致查询缓慢且存储成本高昂。
promql
undefined

Find metrics with the most label combinations (run in Grafana Explore)

找出标签组合最多的指标(在Grafana Explore中运行)

topk(10, count by (name)({name=~".+"}))
topk(10, count by (name)({name=~".+"}))

Find series count for a specific metric

找出特定指标的序列数量

count(http_requests_total)
count(http_requests_total)

Check label value cardinality

检查标签值的基数

count(count by (user_id)(http_requests_total))

**Rules for controllable cardinality:**
- Never put high-cardinality values (request IDs, user IDs, email addresses) in label values
- Group URLs into route patterns: `/api/users/123` → `/api/users/{id}`
- Use `relabel_configs` to drop labels before ingestion

```yaml
count(count by (user_id)(http_requests_total))

**可控基数规则**:
- 切勿将高基数值(请求ID、用户ID、邮箱地址)放入标签值中
- 将URL分组为路由模式:`/api/users/123` → `/api/users/{id}`
- 使用`relabel_configs`在数据采集前丢弃标签

```yaml

Drop a high-cardinality label during scrape (in Alloy or Prometheus scrape config)

在采集期间丢弃高基数标签(在Alloy或Prometheus采集配置中)

prometheus.scrape "api" { targets = [...] rule { source_labels = ["user_id"] action = "labeldrop" } }

---
prometheus.scrape "api" { targets = [...] rule { source_labels = ["user_id"] action = "labeldrop" } }

---

Common patterns

常见模式

Service availability (for use in alert rules):
promql
avg_over_time(up{job="api"}[5m]) < 0.9
Saturation (resource near-full):
promql
undefined
服务可用性(用于告警规则)
promql
avg_over_time(up{job="api"}[5m]) < 0.9
饱和度(资源接近耗尽)
promql
undefined

Disk filling up (predict full in < 4h based on 1h trend)

磁盘即将填满(基于1小时趋势预测4小时内耗尽)

predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4 * 3600) < 0

**Throughput spike:**
```promql
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4 * 3600) < 0

**吞吐量峰值**:
```promql

Current rate > 3x the 1-hour average

当前速率 > 1小时平均值的3倍

rate(http_requests_total[5m])
3 * avg_over_time(rate(http_requests_total[5m])[1h:5m])

---
rate(http_requests_total[5m])
3 * avg_over_time(rate(http_requests_total[5m])[1h:5m])

---

References

参考资料