promql
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePromQL Query Patterns
PromQL 查询模式
PromQL is a functional query language for time series data. Every query returns either an
instant vector (one value per label set at a point in time), a range vector (a sliding
window of samples), or a scalar.
Golden rule: and always require a range vector. The range must be at
least 4x the scrape interval to avoid gaps. For a 60s scrape interval, use minimum.
rate()increase()[5m]PromQL是一种用于时间序列数据的函数式查询语言。每个查询返回instant vector(单个时间点上每个标签集对应一个值)、range vector(样本的滑动窗口)或scalar(标量值)。
黄金法则:和始终需要范围向量。范围必须至少为抓取间隔的4倍,以避免数据间隙。对于60秒的抓取间隔,最小使用。
rate()increase()[5m]Rate and counter queries
速率与计数器查询
Rate (per-second average over a window):
promql
rate(http_requests_total[5m])Rate with label aggregation — "sum then rate" is wrong, always rate then sum:
promql
undefined速率(窗口内的每秒平均值):
promql
rate(http_requests_total[5m])带标签聚合的速率 —— “先求和再取速率”是错误的,必须先取速率再求和:
promql
undefinedCORRECT: rate first, then aggregate
CORRECT: 先取速率,再聚合
sum(rate(http_requests_total{job="api"}[5m])) by (status_code)
sum(rate(http_requests_total{job="api"}[5m])) by (status_code)
WRONG: sum first destroys the counter monotonicity
WRONG: 先求和会破坏计数器的单调性
sum(http_requests_total) by (status_code) -- do NOT then rate() this
**Increase (total count over a window, not per-second):**
```promql
increase(http_requests_total[1h])irate vs rate:
- - smooth average over the full window. Use for dashboards and alerts.
rate() - - instantaneous rate from the last two samples. Use only when you need to capture spikes that
irate()would average away. Never use for alerting.rate()
sum(http_requests_total) by (status_code) -- 请勿对其再使用rate()
**增量(窗口内的总计数,非每秒)**:
```promql
increase(http_requests_total[1h])irate vs rate:
- - 整个窗口内的平滑平均值。用于仪表盘和告警。
rate() - - 基于最后两个样本的瞬时速率。仅在需要捕获
irate()会平均掉的峰值时使用。切勿用于告警。rate()
Filtering with label matchers
使用标签匹配器过滤
promql
undefinedpromql
undefinedExact match
精确匹配
http_requests_total{job="api", status_code="200"}
http_requests_total{job="api", status_code="200"}
Regex match (anchored automatically)
正则匹配(自动锚定)
http_requests_total{status_code=~"5.."}
http_requests_total{status_code=~"5.."}
Negative regex
负向正则
http_requests_total{status_code!~"2.."}
http_requests_total{status_code!~"2.."}
Multiple values with regex OR
多值正则或匹配
http_requests_total{env=~"staging|production"}
---http_requests_total{env=~"staging|production"}
---Aggregation operators
聚合运算符
Always aggregate after :
rate()promql
undefined务必在之后进行聚合:
rate()promql
undefinedSum across all instances, keep service label
跨所有实例求和,保留service标签
sum(rate(http_requests_total[5m])) by (service)
sum(rate(http_requests_total[5m])) by (service)
Average CPU per node, drop all other labels
每个节点的平均CPU使用率,丢弃所有其他标签
avg(node_cpu_seconds_total{mode="idle"}) by (instance)
avg(node_cpu_seconds_total{mode="idle"}) by (instance)
95th percentile request duration
95分位请求耗时
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
Top 5 services by request rate
请求速率Top 5服务
topk(5, sum(rate(http_requests_total[5m])) by (service))
topk(5, sum(rate(http_requests_total[5m])) by (service))
Count of distinct label values
不同标签值的计数
count(count(up) by (job)) by ()
**`without` vs `by`:**
```promqlcount(count(up) by (job)) by ()
**`without` vs `by`**:
```promqlKeep only the labels listed
仅保留列出的标签
sum(rate(http_requests_total[5m])) by (service, status_code)
sum(rate(http_requests_total[5m])) by (service, status_code)
Drop only the labels listed, keep everything else
仅丢弃列出的标签,保留其他所有标签
sum(rate(http_requests_total[5m])) without (instance, pod)
---sum(rate(http_requests_total[5m])) without (instance, pod)
---Histogram quantiles
直方图分位数
Native histograms (Prometheus 2.40+) and classic histograms use different syntax.
Classic histogram (bucket metrics with suffix):
_bucketpromql
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le)
)Multi-service comparison:
promql
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)Common mistake: forgetting in the inner aggregation drops the bucket boundaries,
making produce wrong results or NaN.
by (le)histogram_quantileNative histograms (simpler syntax):
promql
histogram_quantile(0.95, sum(rate(http_request_duration_seconds[5m])))原生直方图(Prometheus 2.40+)和经典直方图使用不同的语法。
经典直方图(带有后缀的桶指标):
_bucketpromql
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le)
)多服务对比:
promql
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)常见错误:内部聚合时忘记会丢失桶边界,导致产生错误结果或NaN。
by (le)histogram_quantile原生直方图(更简洁的语法):
promql
histogram_quantile(0.95, sum(rate(http_request_duration_seconds[5m])))Ratio and error rate
比率与错误率
promql
undefinedpromql
undefinedError ratio (errors / total)
错误率(错误数/总数)
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
Success rate as percentage
成功率百分比
(1 -
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) * 100
(1 -
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) * 100
Avoid division by zero with or vector(0)
使用or vector(0)避免除零错误
sum(rate(errors_total[5m]))
/
(sum(rate(requests_total[5m])) > 0)
---sum(rate(errors_total[5m]))
/
(sum(rate(requests_total[5m])) > 0)
---Absence and staleness
缺失值与陈旧性
promql
undefinedpromql
undefinedAlert when a metric disappears (e.g. a job stops reporting)
当指标消失时告警(例如某个任务停止上报)
absent(up{job="api"})
absent(up{job="api"})
Alert when a metric value hasn't changed (potential stale exporter)
当指标值未变化时告警(可能是陈旧的导出器)
changes(up{job="api"}[5m]) == 0
changes(up{job="api"}[5m]) == 0
Check if a metric has been present in the last window
检查指标在最近窗口内是否存在
count_over_time(up{job="api"}[5m]) > 0
---count_over_time(up{job="api"}[5m]) > 0
---Time functions and offsets
时间函数与偏移量
promql
undefinedpromql
undefinedCompare current value to 1 hour ago
比较当前值与1小时前的值
rate(http_requests_total[5m])
rate(http_requests_total[5m] offset 1h)
rate(http_requests_total[5m])
rate(http_requests_total[5m] offset 1h)
Day-over-day comparison
日环比对比
rate(http_requests_total[5m])
/
rate(http_requests_total[5m] offset 1d)
rate(http_requests_total[5m])
/
rate(http_requests_total[5m] offset 1d)
Predict value in 2 hours based on current trend (linear regression)
根据当前趋势预测2小时后的值(线性回归)
predict_linear(node_filesystem_avail_bytes[1h], 2 * 3600)
---predict_linear(node_filesystem_avail_bytes[1h], 2 * 3600)
---Recording rules
记录规则
Recording rules pre-compute expensive queries, improving dashboard load time and reducing
Prometheus query load. Store them in a rules file loaded by Prometheus or Grafana Mimir.
yaml
groups:
- name: http_request_rates
interval: 1m
rules:
# Pre-compute per-service request rate
- record: job:http_requests_total:rate5m
expr: |
sum(rate(http_requests_total[5m])) by (job)
# Pre-compute error ratio per service
- record: job:http_errors:ratio5m
expr: |
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
# Pre-compute p95 latency per service (avoids expensive histogram_quantile on dashboards)
- record: job:http_request_duration_p95:rate5m
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
)Naming convention:
<aggregation_level>:<metric_name>:<operation_and_window>记录规则预计算开销较大的查询,提升仪表盘加载速度并减少Prometheus查询负载。将其存储在Prometheus或Grafana Mimir加载的规则文件中。
yaml
groups:
- name: http_request_rates
interval: 1m
rules:
# 预计算每个服务的请求速率
- record: job:http_requests_total:rate5m
expr: |
sum(rate(http_requests_total[5m])) by (job)
# 预计算每个服务的错误率
- record: job:http_errors:ratio5m
expr: |
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
# 预计算每个服务的p95延迟(避免在仪表盘上执行昂贵的histogram_quantile)
- record: job:http_request_duration_p95:rate5m
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
)命名规范:
<aggregation_level>:<metric_name>:<operation_and_window>SLO queries
SLO查询
promql
undefinedpromql
undefinedAvailability SLO: fraction of successful requests over 30 days
可用性SLO:30天内成功请求的比例
1 - (
sum(increase(http_requests_total{status_code=~"5.."}[30d]))
/
sum(increase(http_requests_total[30d]))
)
1 - (
sum(increase(http_requests_total{status_code=~"5.."}[30d]))
/
sum(increase(http_requests_total[30d]))
)
Error budget burn rate (1h window, alerting when burning > 14.4x the allowed rate)
错误预算消耗速率(1小时窗口,当消耗速率超过允许速率的14.4倍时告警)
(
sum(rate(http_requests_total{status_code=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
/
(1 - 0.999) -- replace 0.999 with your SLO target
---(
sum(rate(http_requests_total{status_code=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
/
(1 - 0.999) -- 替换0.999为你的SLO目标值
---Cardinality and performance
基数与性能
High cardinality label values (UUIDs, user IDs, URLs) make queries slow and storage expensive.
promql
undefined高基数标签值(UUID、用户ID、URL)会导致查询缓慢且存储成本高昂。
promql
undefinedFind metrics with the most label combinations (run in Grafana Explore)
找出标签组合最多的指标(在Grafana Explore中运行)
topk(10, count by (name)({name=~".+"}))
topk(10, count by (name)({name=~".+"}))
Find series count for a specific metric
找出特定指标的序列数量
count(http_requests_total)
count(http_requests_total)
Check label value cardinality
检查标签值的基数
count(count by (user_id)(http_requests_total))
**Rules for controllable cardinality:**
- Never put high-cardinality values (request IDs, user IDs, email addresses) in label values
- Group URLs into route patterns: `/api/users/123` → `/api/users/{id}`
- Use `relabel_configs` to drop labels before ingestion
```yamlcount(count by (user_id)(http_requests_total))
**可控基数规则**:
- 切勿将高基数值(请求ID、用户ID、邮箱地址)放入标签值中
- 将URL分组为路由模式:`/api/users/123` → `/api/users/{id}`
- 使用`relabel_configs`在数据采集前丢弃标签
```yamlDrop a high-cardinality label during scrape (in Alloy or Prometheus scrape config)
在采集期间丢弃高基数标签(在Alloy或Prometheus采集配置中)
prometheus.scrape "api" {
targets = [...]
rule {
source_labels = ["user_id"]
action = "labeldrop"
}
}
---prometheus.scrape "api" {
targets = [...]
rule {
source_labels = ["user_id"]
action = "labeldrop"
}
}
---Common patterns
常见模式
Service availability (for use in alert rules):
promql
avg_over_time(up{job="api"}[5m]) < 0.9Saturation (resource near-full):
promql
undefined服务可用性(用于告警规则):
promql
avg_over_time(up{job="api"}[5m]) < 0.9饱和度(资源接近耗尽):
promql
undefinedDisk filling up (predict full in < 4h based on 1h trend)
磁盘即将填满(基于1小时趋势预测4小时内耗尽)
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4 * 3600) < 0
**Throughput spike:**
```promqlpredict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4 * 3600) < 0
**吞吐量峰值**:
```promqlCurrent rate > 3x the 1-hour average
当前速率 > 1小时平均值的3倍
rate(http_requests_total[5m])
3 * avg_over_time(rate(http_requests_total[5m])[1h:5m])
---rate(http_requests_total[5m])
3 * avg_over_time(rate(http_requests_total[5m])[1h:5m])
---