Loading...
Loading...
Compare original and translation side by side
prometheus-label-strategyprometheus-label-strategy| Symptom | Likely Cause | First Action |
|---|---|---|
| Prometheus OOMKilled or memory growing linearly | Active series growth (often from a new bad metric or label) | Active Series triage |
| Single PromQL query slow or OOMs the querier | One or more metrics in the query have high cardinality | Per-query drill-down |
| Remote write lagging, WAL growing | Sample throughput spike — series count OR scrape interval changed | Active Series triage + check scrape intervals |
| Hitting Mimir/Cortex ingester per-tenant series limit | Per-metric drill-down, find the new offender |
| Grafana Cloud Active Series bill spiked | New metric, new label, or rollout creating churn | Per-metric drill-down + churn check |
| Grafana Cloud DPM bill spiked but Active Series flat | Scrape interval shortened, OR remote_write sending duplicates | DPM-side issue — route to |
| Application change introduced a new bad label | Recent change diff |
| Series count grows then resets every restart | Series churn from ephemeral label values | Churn diagnosis |
| 症状 | 可能原因 | 首要操作 |
|---|---|---|
| Prometheus被OOM终止或内存线性增长 | 活跃序列增长(通常来自新的不良指标或标签) | 活跃序列排查 |
| 单个PromQL查询缓慢或导致查询器OOM | 查询中的一个或多个指标具有高基数 | 按查询深度排查 |
| 远程写入延迟、WAL文件增长 | 样本吞吐量突增——序列数量或抓取间隔变更 | 活跃序列排查 + 检查抓取间隔 |
| 达到Mimir/Cortex摄入器的租户序列限制 | 按指标深度排查,找出新增的问题指标 |
| Grafana Cloud活跃序列账单突增 | 新指标、新标签或版本发布导致序列更替 | 按指标深度排查 + 更替检查 |
| Grafana Cloud DPM账单突增但活跃序列数量平稳 | 抓取间隔缩短,或remote_write发送重复数据 | DPM侧问题——转至 |
部署后出现 | 应用变更引入了新的不良标签 | 近期变更对比 |
| 序列数量增长后每次重启都会重置 | 临时标签值导致的序列更替 | 更替诊断 |
undefinedundefined
Compare to recent history:
```promql
与近期历史数据对比:
```promql
A growth rate > a few % per day on a stable application set is a red flag.
对于稳定的应用集群,日增长率超过几个百分点就是危险信号。curl -s http://prometheus:9090/api/v1/status/tsdb | jqseriesCountByMetricNamelabelValueCountByLabelNamememoryInBytesByLabelNameseriesCountByLabelValuePairundefinedcurl -s http://prometheus:9090/api/v1/status/tsdb | jqseriesCountByMetricNamelabelValueCountByLabelNamememoryInBytesByLabelNameseriesCountByLabelValuePairundefined
---
---"seriesCountByMetricName": [
{ "name": "http_request_duration_seconds_bucket", "value": 184320 },
{ "name": "go_gc_duration_seconds", "value": 80 },
...
]_bucket_total_count_sum"seriesCountByMetricName": [
{ "name": "http_request_duration_seconds_bucket", "value": 184320 },
{ "name": "go_gc_duration_seconds", "value": 80 },
...
]_bucket_total_count_sum"labelValueCountByLabelName": [
{ "name": "url", "value": 84210 },
{ "name": "trace_id", "value": 41000 },
{ "name": "pod", "value": 1820 }
]trace_idrequest_idsession_idqueryemailpathurlpod"labelValueCountByLabelName": [
{ "name": "url", "value": 84210 },
{ "name": "trace_id", "value": 41000 },
{ "name": "pod", "value": 1820 }
]trace_idrequest_idsession_idqueryemailpathurlpodundefinedundefined
Repeat per label, or use the helper:
```bash
针对每个标签重复执行,或使用以下脚本:
```bashundefinedundefinedundefinedundefined
If you see UUIDs, hashes, timestamps, or numeric IDs in the top values → that label has unbounded values from the source.
如果顶级值中出现UUID、哈希值、时间戳或数字ID→该标签的取值是无界的,源于数据源问题。undefinedundefined
---
---undefinedundefined
Diff externally. A new metric near the top of `seriesCountByMetricName` that wasn't there a week ago → that's your offender.
在外部工具中对比差异。`seriesCountByMetricName`中排名靠前的新增指标(上周不存在)→就是问题根源。undefinedundefined
A vertical step in series count aligned with a deploy is conclusive.
---
序列数的垂直阶跃与部署时间对齐即可确认原因。
---undefinedundefined
A creation rate that materially exceeds the removal rate, sustained, means cardinality is on a one-way trip up. Common causes:
| Cause | Tell |
|---|---|
| Pod rollouts emitting `pod` label | Churn spike aligns with deploy timing; affects pod-discovered scrapes |
| `version` / `git_sha` / `image_tag` label on every metric | Churn spike on every deploy across many metrics |
| Ephemeral hostnames in `instance` | Cloud autoscaling event timing |
| Bug: dynamic label names | Churn climbs forever, never plateaus |
| Application bug emitting fresh UUIDs as labels | Linear unbounded growth, no deploy correlation |
如果创建率持续显著高于移除率,说明基数正在单向增长。常见原因:
| 原因 | 特征 |
|---|---|
| Pod滚动发布时携带`pod`标签 | 更替峰值与部署时间对齐;影响通过Pod发现的抓取任务 |
| 每个指标都带有`version` / `git_sha` / `image_tag`标签 | 每次部署都会导致多个指标出现更替峰值 |
| `instance`标签中包含临时主机名 | 与云自动扩缩容事件时间对齐 |
| 错误:动态标签名称 | 更替率持续攀升,从未稳定 |
| 应用错误:将新UUID作为标签输出 | 线性无界增长,与部署无关 |undefinedundefined
Restarting Prometheus drops churned series but is not a fix. The fix is at the source.
---
重启Prometheus会清除已更替的序列,但这并非根本解决方案。修复需从源头入手。
---*_bucketseriesCountByMetricNamepathmethodstatus_code*_bucketseriesCountByMetricNamepathmethodstatus_codekube_pod_labelskube_pod_annotationslabel_*annotation_*--metric-labels-allowlist--metric-annotations-allowlistundefinedkube_pod_labelskube_pod_annotationslabel_*annotation_*--metric-labels-allowlist--metric-annotations-allowlistundefinedundefinedundefinedhttp_requests_totaltopk(20, count by (path) (http_requests_total))/users/123456prometheus-label-strategyundefinedhttp_requests_totaltopk(20, count by (path) (http_requests_total))/users/123456prometheus-label-strategyundefined
Or normalize via relabel:
```yaml
metric_relabel_configs:
- source_labels: [path]
regex: /users/\d+
target_label: path
replacement: /users/:id
或通过重标签规则归一化:
```yaml
metric_relabel_configs:
- source_labels: [path]
regex: /users/\d+
target_label: path
replacement: /users/:id_details_per_requestmetric_relabel_configs:
- source_labels: [__name__]
regex: my_app_request_details
action: drop_details_per_requestmetric_relabel_configs:
- source_labels: [__name__]
regex: my_app_request_details
action: dropinstance=...instance=...metric_relabel_configsmetric_relabel_configs:
- regex: (instance|pod|node|host|job)
action: labeldroprelabel_configsinstance=...instance=...metric_relabel_configsmetric_relabel_configs:
- regex: (instance|pod|node|host|job)
action: labeldroprelabel_configsclusterregion- job_name: federate
honor_labels: true
metrics_path: /federate
params:
'match[]':
- '{__name__=~".*:.*"}' # Recording-rule naming convention onlyclusterregion- job_name: federate
honor_labels: true
metrics_path: /federate
params:
'match[]':
- '{__name__=~".*:.*"}' # 仅匹配记录规则的命名约定Cardinality fire confirmed
│
├── Need to stop the bleeding NOW (production OOM, ingest 429s)
│ └── Apply emergency drop via metric_relabel_configs at scrape config
│ (also applies to Alloy/Agent — same syntax)
│ Then schedule the proper fix.
│
├── It's a Grafana Cloud Active Series bill issue, not a perf issue
│ ├── Cardinality is structural and you can't fix the app
│ │ └── Route to `adaptive-metrics` skill (post-ingest aggregation rules)
│ └── You want metric-by-metric DPM breakdown
│ └── Route to `dpm-finder` skill
│
├── It's a fixable application bug (unbounded label, debug metric in prod)
│ ├── Short-term: metric_relabel_configs drop at scrape
│ └── Long-term: fix in code; route to `prometheus-label-strategy` for design guidance
│
├── It's histogram cardinality
│ ├── Trim labels on the underlying histogram (14× win per label)
│ ├── Reduce bucket count if appropriate
│ └── Consider native histograms for high-resolution latency
│
└── It's churn (deploy-driven)
├── Remove `pod`, `version`, `git_sha` from application metrics
├── Use info-metric pattern for `version` (route to `prometheus-label-strategy`)
└── Verify K8s SD relabel rules aren't carrying `uid` or other ephemeral fields确认基数问题紧急情况
│
├── 需要立即止损(生产环境OOM、摄入429错误)
│ └── 在抓取配置中通过metric_relabel_configs应用紧急丢弃规则
│ (同样适用于Alloy/Agent——语法相同)
│ 然后安排彻底修复。
│
├── 是Grafana Cloud活跃序列账单问题,而非性能问题
│ ├── 基数问题是结构性的,无法修复应用
│ │ └── 转至`adaptive-metrics`技能(摄入后聚合规则)
│ └── 需要按指标拆分DPM明细
│ └── 转至`dpm-finder`技能
│
├── 是可修复的应用bug(无界标签、生产环境调试指标)
│ ├── 短期方案:在抓取阶段通过metric_relabel_configs丢弃
│ └── 长期方案:在代码中修复;转至`prometheus-label-strategy`获取设计指导
│
├── 是直方图基数问题
│ ├── 修剪底层直方图的标签(每个标签可减少14倍序列数)
│ ├── 如有必要,减少桶数量
│ └── 考虑使用原生直方图进行高分辨率延迟追踪
│
└── 是序列更替问题(由部署驱动)
├── 从应用指标中移除`pod`、`version`、`git_sha`标签
├── 对`version`使用信息指标模式(转至`prometheus-label-strategy`)
└── 验证K8s服务发现重标签规则未携带`uid`或其他临时字段scrape_configsmetric_relabel_configs:
# Drop a specific bad metric
- source_labels: [__name__]
regex: bad_metric_name
action: drop
# Drop a high-cardinality label from all metrics
- regex: bad_label_name
action: labeldrop
# Drop all labels matching a prefix (e.g., debug_*)
- regex: debug_.*
action: labeldrop
# Drop a metric only when a specific label has unbounded values
- source_labels: [__name__, path]
regex: http_requests_total;/users/\d+
action: drop
# Aggregate path patterns to a template (replace the value)
- source_labels: [path]
regex: /users/\d+
target_label: path
replacement: /users/:id
# Aggregate status codes to classes
- source_labels: [status_code]
regex: (\d)\d\d
target_label: status_code
replacement: ${1}xxprometheus.relabelprometheus.relabel "drop_bad_metric" {
forward_to = [prometheus.remote_write.default.receiver]
rule {
source_labels = ["__name__"]
regex = "bad_metric_name"
action = "drop"
}
rule {
regex = "bad_label_name"
action = "labeldrop"
}
}labeldropscrape_configsmetric_relabel_configs:
# 丢弃特定的不良指标
- source_labels: [__name__]
regex: bad_metric_name
action: drop
# 从所有指标中丢弃高基数标签
- regex: bad_label_name
action: labeldrop
# 丢弃所有匹配前缀的标签(例如debug_*)
- regex: debug_.*
action: labeldrop
# 仅当特定标签具有无界值时丢弃指标
- source_labels: [__name__, path]
regex: http_requests_total;/users/\d+
action: drop
# 将路径模式聚合为模板(替换值)
- source_labels: [path]
regex: /users/\d+
target_label: path
replacement: /users/:id
# 将状态码聚合为类别
- source_labels: [status_code]
regex: (\d)\d\d
target_label: status_code
replacement: ${1}xxprometheus.relabelprometheus.relabel "drop_bad_metric" {
forward_to = [prometheus.remote_write.default.receiver]
rule {
source_labels = ["__name__"]
regex = "bad_metric_name"
action = "drop"
}
rule {
regex = "bad_label_name"
action = "labeldrop"
}
}labeldropprometheus-label-strategyadaptive-metricsdpm-finderpromqlalloyloki-label-analyzerprometheus-label-strategyadaptive-metricsdpm-finderpromqlalloyloki-label-analyzer