prometheus
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePrometheus
Prometheus
PromQL Gotchas
PromQL常见陷阱
Counter Functions (Critical)
计数器函数(重点)
Counters only increase. Never use raw counter values—always use rate functions:
promql
rate(http_requests_total[5m]) # Per-second average rate
irate(http_requests_total[5m]) # Instant rate (last 2 points, spiky)
increase(http_requests_total[1h]) # Total increase over range- handles counter resets automatically
rate() - Use for dashboards,
rate()only for high-resolution spikesirate()
计数器只会递增。永远不要使用原始计数器值——务必使用速率函数:
promql
rate(http_requests_total[5m]) # 每秒平均速率
irate(http_requests_total[5m]) # 瞬时速率(基于最后2个数据点,波动较大)
increase(http_requests_total[1h]) # 时间范围内的总增量- 会自动处理计数器重置
rate() - 仪表盘使用,仅在需要高分辨率峰值监控时使用
rate()irate()
Range Vector Required
必须使用范围向量
Rate functions need :
[duration]promql
rate(metric[5m]) # Correct
rate(metric) # Error: expected range vector速率函数需要指定:
[时长]promql
rate(metric[5m]) # 正确写法
rate(metric) # 错误:需要范围向量Vector Matching
向量匹配
Binary operations require matching labels:
promql
undefined二元运算要求标签匹配:
promql
undefinedThis fails if label sets differ:
如果标签集不同,此查询会失败:
metric_a / metric_b
metric_a / metric_b
Ignore extra labels:
忽略额外标签:
metric_a / ignoring(extra_label) metric_b
metric_a / ignoring(extra_label) metric_b
Match on specific labels only:
仅基于特定标签匹配:
metric_a / on(common_label) metric_b
undefinedmetric_a / on(common_label) metric_b
undefinedHistogram Quantiles
直方图分位数
promql
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)- Must use metric with
_bucketlabelle - Always wrap in for counters
rate() - is required; add other labels as needed:
by (le)by (le, endpoint)
promql
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)- 必须使用带有标签的
le指标_bucket - 务必用包裹计数器类型的指标
rate() - 是必填项;可按需添加其他标签:
by (le)by (le, endpoint)
Common Query Patterns
常见查询模式
promql
undefinedpromql
undefinedError rate percentage
错误率百分比
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
CPU usage (node_exporter)
CPU使用率(node_exporter)
100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)
100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)
Memory usage
内存使用率
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
Container memory (Kubernetes)
容器内存(Kubernetes)
sum by (pod) (container_memory_working_set_bytes{container!=""})
undefinedsum by (pod) (container_memory_working_set_bytes{container!=""})
undefinedAlerting Rules
告警规则
yaml
groups:
- name: example
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/ sum(rate(http_requests_total[5m])) by (job)
> 0.05
for: 5m # Must be firing for this duration
labels:
severity: warning
annotations:
summary: "Error rate {{ $value | humanizePercentage }} on {{ $labels.job }}"yaml
groups:
- name: example
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/ sum(rate(http_requests_total[5m])) by (job)
> 0.05
for: 5m # 必须持续触发该时长才会告警
labels:
severity: warning
annotations:
summary: "{{ $labels.job }}的错误率为{{ $value | humanizePercentage }}"for Clause
for子句
- Prevents flapping on brief spikes
- Alert stays "pending" until duration met
- Missing = immediate alerting
for
- 避免短暂峰值导致的告警抖动
- 告警会先处于“pending”状态,直到满足持续时长
- 省略则会立即触发告警
for
Recording Rules
记录规则
Pre-compute expensive queries:
yaml
rules:
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))Naming convention:
level:metric:operations预计算开销较大的查询:
yaml
rules:
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))命名规范:
level:metric:operationsStaleness
数据过期
- Samples older than 5 minutes are "stale"
- only fires if target was recently scraped
up == 0 - Use to detect missing metrics entirely
absent(metric)
- 超过5分钟的样本会被标记为“过期”
- 仅当目标最近被采集过时,才会触发
up == 0 - 使用来检测指标是否完全缺失
absent(metric)