Loading...
Loading...
Diagnostic guide for active Prometheus cardinality problems — slow queries, OOMing Prometheus, high Grafana Cloud Active Series or DPM bills, "too many samples" ingest errors, series churn, or rapid memory growth. Walks through tsdb status endpoints, per-metric and per-label drill-downs, common-culprit galleries, and remediation paths. Use when the user is *currently experiencing* a cardinality fire. For preventing cardinality issues at the source, route to prometheus-label-strategy. For post-ingest aggregation, route to adaptive-metrics. For DPM-specific analysis, route to dpm-finder.
npx skill4agent add grafana/skills prometheus-cardinality-troubleshooterprometheus-label-strategy| Symptom | Likely Cause | First Action |
|---|---|---|
| Prometheus OOMKilled or memory growing linearly | Active series growth (often from a new bad metric or label) | Active Series triage |
| Single PromQL query slow or OOMs the querier | One or more metrics in the query have high cardinality | Per-query drill-down |
| Remote write lagging, WAL growing | Sample throughput spike — series count OR scrape interval changed | Active Series triage + check scrape intervals |
| Hitting Mimir/Cortex ingester per-tenant series limit | Per-metric drill-down, find the new offender |
| Grafana Cloud Active Series bill spiked | New metric, new label, or rollout creating churn | Per-metric drill-down + churn check |
| Grafana Cloud DPM bill spiked but Active Series flat | Scrape interval shortened, OR remote_write sending duplicates | DPM-side issue — route to |
| Application change introduced a new bad label | Recent change diff |
| Series count grows then resets every restart | Series churn from ephemeral label values | Churn diagnosis |
# Total active series in the local Prometheus
prometheus_tsdb_head_series
# Or for Mimir / Grafana Cloud Metrics (per tenant)
cortex_ingester_memory_series{user="<tenant>"}# Growth over the last 7 days
deriv(prometheus_tsdb_head_series[7d]) * 86400curl -s http://prometheus:9090/api/v1/status/tsdb | jqseriesCountByMetricNamelabelValueCountByLabelNamememoryInBytesByLabelNameseriesCountByLabelValuePair# Same endpoint, authenticated against the per-tenant Mimir
curl -s -u "<user>:<token>" \
"https://prometheus-prod-XX.grafana.net/api/prom/api/v1/status/tsdb" | jq"seriesCountByMetricName": [
{ "name": "http_request_duration_seconds_bucket", "value": 184320 },
{ "name": "go_gc_duration_seconds", "value": 80 },
...
]_bucket_total_count_sum"labelValueCountByLabelName": [
{ "name": "url", "value": 84210 },
{ "name": "trace_id", "value": 41000 },
{ "name": "pod", "value": 1820 }
]trace_idrequest_idsession_idqueryemailpathurlpod# How many unique values does each label have on this metric?
count by (__name__) (
count by (__name__, label_name_here) (
http_request_duration_seconds_bucket
)
)# Via the Prometheus HTTP API
curl -s "http://prometheus:9090/api/v1/labels?match[]=http_request_duration_seconds_bucket" | jq -r '.data[]' | \
while read label; do
count=$(curl -s "http://prometheus:9090/api/v1/label/${label}/values?match[]=http_request_duration_seconds_bucket" | jq '.data | length')
echo "${count} ${label}"
done | sort -rn | head -20# Top 20 path values for http_requests_total
topk(20,
count by (path) (http_requests_total)
)# Series-per-instance breakdown — if uneven, one instance is misbehaving
sum by (job, instance) ({__name__=~"my_metric.*"})# Current metrics
group by (__name__) ({__name__!=""})
# Compare to last week (offset)
group by (__name__) ({__name__!=""} offset 7d)seriesCountByMetricName# Active series correlated with build_info
prometheus_tsdb_head_series
# Overlay with:
changes(app_build_info[1d])# Series created vs. removed per second
rate(prometheus_tsdb_head_series_created_total[5m])
rate(prometheus_tsdb_head_series_removed_total[5m])
# Ratio of churned to live
prometheus_tsdb_head_series_created_total / prometheus_tsdb_head_series| Cause | Tell |
|---|---|
Pod rollouts emitting | Churn spike aligns with deploy timing; affects pod-discovered scrapes |
| Churn spike on every deploy across many metrics |
Ephemeral hostnames in | Cloud autoscaling event timing |
| Bug: dynamic label names | Churn climbs forever, never plateaus |
| Application bug emitting fresh UUIDs as labels | Linear unbounded growth, no deploy correlation |
# A churn-driven head block carries old series until tsdb compaction
prometheus_tsdb_head_chunks
go_memstats_heap_inuse_bytes{job="prometheus"}*_bucketseriesCountByMetricNamepathmethodstatus_codekube_pod_labelskube_pod_annotationslabel_*annotation_*--metric-labels-allowlist--metric-annotations-allowlist# kube-state-metrics flags
--metric-labels-allowlist=pods=[app,team,version]
--metric-annotations-allowlist=pods=[checksum/config]http_requests_totaltopk(20, count by (path) (http_requests_total))/users/123456prometheus-label-strategy# In Prometheus scrape config — emergency drop
metric_relabel_configs:
- source_labels: [__name__, path]
regex: http_requests_total;/users/\d+
action: dropmetric_relabel_configs:
- source_labels: [path]
regex: /users/\d+
target_label: path
replacement: /users/:id_details_per_requestmetric_relabel_configs:
- source_labels: [__name__]
regex: my_app_request_details
action: dropinstance=...instance=...metric_relabel_configsmetric_relabel_configs:
- regex: (instance|pod|node|host|job)
action: labeldroprelabel_configsclusterregion- job_name: federate
honor_labels: true
metrics_path: /federate
params:
'match[]':
- '{__name__=~".*:.*"}' # Recording-rule naming convention onlyCardinality fire confirmed
│
├── Need to stop the bleeding NOW (production OOM, ingest 429s)
│ └── Apply emergency drop via metric_relabel_configs at scrape config
│ (also applies to Alloy/Agent — same syntax)
│ Then schedule the proper fix.
│
├── It's a Grafana Cloud Active Series bill issue, not a perf issue
│ ├── Cardinality is structural and you can't fix the app
│ │ └── Route to `adaptive-metrics` skill (post-ingest aggregation rules)
│ └── You want metric-by-metric DPM breakdown
│ └── Route to `dpm-finder` skill
│
├── It's a fixable application bug (unbounded label, debug metric in prod)
│ ├── Short-term: metric_relabel_configs drop at scrape
│ └── Long-term: fix in code; route to `prometheus-label-strategy` for design guidance
│
├── It's histogram cardinality
│ ├── Trim labels on the underlying histogram (14× win per label)
│ ├── Reduce bucket count if appropriate
│ └── Consider native histograms for high-resolution latency
│
└── It's churn (deploy-driven)
├── Remove `pod`, `version`, `git_sha` from application metrics
├── Use info-metric pattern for `version` (route to `prometheus-label-strategy`)
└── Verify K8s SD relabel rules aren't carrying `uid` or other ephemeral fieldsscrape_configsmetric_relabel_configs:
# Drop a specific bad metric
- source_labels: [__name__]
regex: bad_metric_name
action: drop
# Drop a high-cardinality label from all metrics
- regex: bad_label_name
action: labeldrop
# Drop all labels matching a prefix (e.g., debug_*)
- regex: debug_.*
action: labeldrop
# Drop a metric only when a specific label has unbounded values
- source_labels: [__name__, path]
regex: http_requests_total;/users/\d+
action: drop
# Aggregate path patterns to a template (replace the value)
- source_labels: [path]
regex: /users/\d+
target_label: path
replacement: /users/:id
# Aggregate status codes to classes
- source_labels: [status_code]
regex: (\d)\d\d
target_label: status_code
replacement: ${1}xxprometheus.relabelprometheus.relabel "drop_bad_metric" {
forward_to = [prometheus.remote_write.default.receiver]
rule {
source_labels = ["__name__"]
regex = "bad_metric_name"
action = "drop"
}
rule {
regex = "bad_label_name"
action = "labeldrop"
}
}labeldropprometheus-label-strategyadaptive-metricsdpm-finderpromqlalloyloki-label-analyzer