prometheus-label-strategy
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePrometheus Label Strategy Evaluator
Prometheus标签策略评估工具
You are an expert in Prometheus label strategy. When asked to evaluate, audit, design, or improve a Prometheus label schema — or when a user asks how to prevent high cardinality at the source — use this guide to provide structured, actionable advice.
This skill is about preventing bad labels at ingest (instrumentation, scrape configuration, relabeling). For post-ingest cost reduction via aggregation rules, route the user to the skill. For diagnosing an active cardinality fire, route to .
adaptive-metricsprometheus-cardinality-troubleshooter你是Prometheus标签策略领域的专家。当用户要求评估、审核、设计或改进Prometheus标签架构,或者询问如何从源头防止高基数问题时,请使用本指南提供结构化、可落地的建议。
本技能聚焦于在摄入阶段预防不良标签(仪表配置、抓取配置、重标记)。如需了解通过聚合规则实现摄入后成本降低的内容,请引导用户使用技能。如需诊断当前的基数危机,请引导至技能。
adaptive-metricsprometheus-cardinality-troubleshooterCore Concepts
核心概念
Series are the fundamental unit in Prometheus. Each unique combination of metric name plus label key-value pairs creates a new active series. Too many series = memory pressure, slow queries, ingest pressure, high bill.
Cardinality = the number of unique values a label can have. Total series for a metric ≈ the product of cardinalities across its labels. A metric with (100 values), (10 values), (5 values), and (50 values) = 250,000 series per metric. Adding one more high-cardinality label often 10–100×s the count.
pathstatus_codemethodinstanceThe dual impact rule: High-cardinality labels hurt on both paths:
- Ingestion path: More active series → larger head block, larger WAL, more memory, larger remote_write payloads, higher Grafana Cloud bill (Active Series + DPM)
- Query path: PromQL operators (,
sum by, joins) must materialize matching series in memory. High cardinality balloons query memory and latencyrate
Series churn is the silent killer. If a label value changes frequently (deploy version, pod name, ephemeral IDs), every change creates a new series while the old one continues to age out. Daily churn of 100% means you carry roughly 2× the steady-state series count for retention purposes.
The key question for any proposed label: "Will queries that use this metric reliably specify or aggregate on this label?" If no → it should NOT be a label.
序列是Prometheus中的基本单元。每个唯一的指标名称加标签键值对组合都会创建一个新的活跃序列。序列过多会导致内存压力、查询缓慢、摄入压力和高额账单。
基数 = 一个标签可拥有的唯一值数量。一个指标的总序列数 ≈ 其所有标签基数的乘积。一个带有(100个值)、(10个值)、(5个值)和(50个值)的指标 = 每个指标250,000个序列。添加一个高基数标签通常会使计数增加10-100倍。
pathstatus_codemethodinstance双重影响规则:高基数标签会在两个环节造成损害:
- 摄入环节:活跃序列越多 → 头部区块越大、WAL越大、内存占用越高、remote_write负载越大、Grafana Cloud账单越高(活跃序列数+DPM)
- 查询环节:PromQL运算符(、
sum by、关联查询)必须在内存中实例化匹配的序列。高基数会大幅增加查询内存占用和延迟rate
序列波动是隐形杀手。如果标签值频繁变化(部署版本、Pod名称、临时ID),每次变化都会创建一个新序列,而旧序列会继续留存直至过期。每日100%的波动意味着为了满足留存要求,你需要承载大约2倍于稳态的序列数。
任何拟添加标签的核心问题:"使用该指标的查询是否会可靠地指定或聚合这个标签?"如果答案是否定的 → 它不应该成为标签。
Label Evaluation Framework
标签评估框架
When auditing a label set, assess each label against these criteria.
审核标签集时,请根据以下标准评估每个标签。
Cardinality Scoring
基数评分
| Label Example | Cardinality | Verdict |
|---|---|---|
| 2–5 values | ✅ Good |
| 5–50 values | ✅ Good |
| Tens | ✅ Good |
| Tens–low hundreds | ✅ Acceptable |
| Tens–hundreds | ✅ Acceptable |
| Hundreds–low thousands | ⚠️ Evaluate — fine on per-instance metrics, risky on aggregated ones |
| Thousands + transient = high churn | ❌ Drop at scrape unless required |
| Bounded if templated; unbounded if raw URLs | ⚠️ Only with templated values ( |
| Grows on every deploy → churn | ⚠️ Use sparingly; consider info-metric pattern |
| Unbounded | ❌ Never as label — use exemplars |
| Often unbounded | ❌ Only acceptable for small fixed tenant counts |
| Unbounded text | ❌ Never |
| 标签示例 | 基数 | 结论 |
|---|---|---|
| 2-5个值 | ✅ 良好 |
| 5-50个值 | ✅ 良好 |
| 数十个 | ✅ 良好 |
| 数十到数百个 | ✅ 可接受 |
| 数十到数百个 | ✅ 可接受 |
| 数百到数千个 | ⚠️ 评估——在单实例指标上可行,在聚合指标上有风险 |
| 数千个+临时值 = 高波动 | ❌ 除非必要,否则在抓取时丢弃 |
| 若为模板化则有限;若为原始URL则无限 | ⚠️ 仅使用模板化值( |
| 每次部署都会增加 → 波动 | ⚠️ 谨慎使用;考虑信息指标模式 |
| 无限 | ❌ 绝不能作为标签——使用示例(exemplars) |
| 通常无限 | ❌ 仅在租户数量固定且较少时可接受 |
| 无限文本 | ❌ 绝不能使用 |
Access Pattern Alignment
访问模式对齐
For each label, ask:
- Do queries on this metric reliably aggregate by or filter on this label?
- Does this label logically segment the metric the way users think about it?
- Would removing this label force users to use exemplars, logs, or traces instead — and would that be acceptable for the rare lookup case?
对于每个标签,请询问:
- 使用该指标的查询是否会可靠地按此标签聚合或过滤?
- 该标签是否能按用户的思维逻辑对指标进行合理划分?
- 删除该标签是否会迫使用户改用示例、日志或追踪——这种方式对于罕见的查询场景是否可接受?
Static vs. Dynamic Label Values
静态与动态标签值
- Static / target labels (set once per scrape target via , e.g.,
relabel_configs,env=prod,cluster=us-east) add cardinality proportional to targets, not requests. Cheap and high-value. Use freely.team=payments - Dynamic / sample labels (emitted by the application per measurement, e.g., ,
status_code,method) multiply cardinality by value count. Keep possible values in the single digits or low tens. The application code is the source of truth — fix it there, not in Prometheus.cache_hit
- 静态/目标标签(通过为每个抓取目标设置一次,例如
relabel_configs、env=prod、cluster=us-east)的基数与目标数量成正比,而非请求数量。成本低且价值高。可自由使用。team=payments - 动态/样本标签(由应用程序每次测量时生成,例如、
status_code、method)的基数会乘以值的数量。请将可能的值控制在个位数或低十位数。应用程序代码是真相来源——应在代码中修复,而非在Prometheus中。cache_hit
Consistency Check
一致性检查
- Label names consistent across services? (vs
statusvsstatus_codeproduces three separate label families — joins break)http_status - Label values normalized? (vs
200,"200"vsGET,getvsError)error - Naming convention consistent? Prometheus convention is for both metric and label names
snake_case - Same concept, same name across services? (vs
servicevssvc)app_name
- 标签名称在各服务间是否一致?(vs
statusvsstatus_code会产生三个独立的标签组——关联查询会失效)http_status - 标签值是否标准化?(vs
200、"200"vsGET、getvsError)error - 命名规范是否一致?Prometheus的规范是指标和标签名称均使用
snake_case - 同一概念在各服务间的名称是否相同?(vs
servicevssvc)app_name
Histogram Bucket Discipline (critical, often missed)
直方图桶规范(关键,常被忽略)
Every histogram metric multiplies its base cardinality by (bucket count + 3) — buckets via plus , , and (Prometheus 2.39+).
_bucket{le="..."}_sum_count_created- Default has 11 buckets → 14× multiplier
prometheus.DefBuckets - A histogram with ,
method,pathalready at 1,000 series becomes 14,000 series after adding histogram cardinalitystatus - Always trim histogram label cardinality first — labels matter 14× more on histograms than on counters/gauges
- Consider native histograms (Prometheus 2.40+) which use a single sparse series instead of one-per-bucket — major cardinality reduction for high-resolution latency tracking
每个直方图指标会将其基础基数乘以**(桶数量+3)**——通过实现的桶,加上、和(Prometheus 2.39+)。
_bucket{le="..."}_sum_count_created- 默认的包含11个桶 → 14倍乘数
prometheus.DefBuckets - 一个带有、
method、path且已达1000个序列的直方图,添加直方图基数后会变为14,000个序列status - 务必优先精简直方图的标签基数——直方图上的标签重要性是计数器/仪表盘的14倍
- 考虑使用原生直方图(Prometheus 2.40+),它使用单个稀疏序列而非每个桶一个序列——可大幅降低高分辨率延迟追踪的基数
Info-Metric Pattern (for high-churn metadata)
信息指标模式(针对高波动元数据)
When you want to know about a label (e.g., , , ) without paying for it on every metric, use an info metric:
versiongit_shaimage_tagundefined当你想了解某个标签(例如、、)但不想为每个指标付出成本时,请使用信息指标:
versiongit_shaimage_tagundefinedA single low-cardinality counter/gauge of value 1, with the metadata attached
一个值为1的低基数计数器/仪表盘,附加元数据
app_build_info{app="payment-api", version="2.4.1", git_sha="a1b2c3"} 1
Then join at query time:
```promql
sum by (version) (
rate(http_requests_total{app="payment-api"}[5m])
* on (app) group_left (version) app_build_info
)The label lives on exactly one series per build, not on every metric.
versionapp_build_info{app="payment-api", version="2.4.1", git_sha="a1b2c3"} 1
然后在查询时进行关联:
```promql
sum by (version) (
rate(http_requests_total{app="payment-api"}[5m])
* on (app) group_left (version) app_build_info
)versionEvaluation Output Format
评估输出格式
When auditing a label set, produce a report in this structure:
undefined审核标签集时,请按照以下结构生成报告:
undefinedPrometheus Label Strategy Audit
Prometheus标签策略审核
Summary
摘要
[1-2 sentence overall assessment — total estimated active series, biggest risks]
[1-2句话的整体评估——预估总活跃序列数、最大风险]
Per-Label Analysis
逐标签分析
| Metric Family | Label | Cardinality | Used in Queries? | Verdict | Action |
|---|---|---|---|---|---|
| http_requests_total | path | Unbounded (raw URLs) | Sometimes | ❌ Remove | Template in code: |
| http_requests_total | pod | High + churn | Rarely | ❌ Drop via metric_relabel_configs | Already in target metadata |
| 指标族 | 标签 | 基数 | 是否用于查询? | 结论 | 行动 |
|---|---|---|---|---|---|
| http_requests_total | path | 无限(原始URL) | 有时 | ❌ 删除 | 在代码中使用模板: |
| http_requests_total | pod | 高+波动 | 很少 | ❌ 通过metric_relabel_configs丢弃 | 已存在于目标元数据中 |
Histogram-Specific Findings
直方图专项发现
[Highlight any histograms with high label cardinality — these are 14×+ amplified]
[突出显示任何标签基数高的直方图——这些会被放大14倍以上]
Estimated Impact
预估影响
- Active series reduction: [X series → Y series]
- DPM reduction: [X DPM → Y DPM] (samples-per-minute = series × ~6 at 10s scrape)
- Memory impact: [if measurable]
- 活跃序列减少:[X个序列 → Y个序列]
- DPM减少:[X DPM → Y DPM] (每分钟样本数 = 序列数 × 约6,抓取间隔10秒)
- 内存影响:[若可测量]
Recommended Label Set
推荐标签集
[Final recommended labels per metric family]
[每个指标族的最终推荐标签]
Implementation Plan
实施计划
- [Code changes — instrumentation hygiene]
- [Scrape config changes — relabel_configs]
- [Drop-at-scrape changes — metric_relabel_configs]
- [Recording rules to materialize useful aggregates]
---- [代码变更——仪表规范]
- [抓取配置变更——relabel_configs]
- [抓取时丢弃变更——metric_relabel_configs]
- [录制规则以生成有用的聚合结果]
---Recommended Common Target Labels
推荐通用目标标签
These should be set as target labels (via on the scrape job, NOT emitted by the app) — they're per-target, low cardinality, high query value:
relabel_configs| Label | Purpose | Notes |
|---|---|---|
| Prometheus scrape job name | Set automatically by Prometheus |
| Target endpoint ( | Set automatically; rename via |
| Environment ( | Set via static_configs labels or service discovery |
| Multi-cluster differentiation | Critical for federation/Mimir multi-tenant |
| Geographic region | |
| Ownership — also useful for access control | |
| Logical service identity | One service may span multiple jobs |
These should NOT be re-emitted by the application. If the app emits a label, it duplicates the target label and creates collisions / decisions you don't want to make.
clusterhonor_labels这些标签应设置为目标标签(通过抓取任务的设置,而非由应用程序生成)——它们是每个目标的专属标签,基数低且查询价值高:
relabel_configs| 标签 | 用途 | 说明 |
|---|---|---|
| Prometheus抓取任务名称 | 由Prometheus自动设置 |
| 目标端点( | 自动设置;如有需要可通过 |
| 环境( | 通过static_configs标签或服务发现设置 |
| 多集群区分 | 对于联邦/Mimir多租户至关重要 |
| 地理区域 | |
| 归属——也可用于访问控制 | |
| 逻辑服务标识 | 一个服务可能涵盖多个任务 |
这些标签不应由应用程序重新生成。如果应用程序生成标签,会与目标标签重复,导致冲突或需要做出决策,这是你不希望看到的。
clusterhonor_labelsKubernetes Patterns
Kubernetes模式
Recommended Labels (from kubernetes_sd_configs)
推荐标签(来自kubernetes_sd_configs)
| Label | Source | Notes |
|---|---|---|
| Pod metadata | Always keep |
| Pod spec | Low cardinality, useful for multi-container pods |
| Derived: | Strongly preferred over |
| K8s Service | If scraping via Service |
| 标签 | 来源 | 说明 |
|---|---|---|
| Pod元数据 | 务必保留 |
| Pod规格 | 基数低,适用于多容器Pod |
| 派生: | 强烈推荐替代 |
| K8s Service | 如果通过Service抓取 |
Labels to AVOID by Default in Kubernetes
Kubernetes中默认应避免的标签
pod- Highly transient: rolls every deploy and on every restart
- High cardinality: 100 pods × N metrics = N × 100 series, but on rollouts you carry both old and new pods until they age out
- Almost never the right query dimension — users want workload, not pod instance
- Solution: Keep as a label; drop
workloadviapod; use exemplars or kube-state-metrics for pod-specific lookupsmetric_relabel_configs
yaml
undefinedpod- 高度临时:每次部署和重启都会变化
- 高基数:100个Pod × N个指标 = N×100个序列,但在滚动发布期间,旧Pod和新Pod会同时存在直至过期
- 几乎不是正确的查询维度——用户需要的是工作负载,而非Pod实例
- 解决方案:保留作为标签;通过
workload丢弃metric_relabel_configs;使用示例或kube-state-metrics进行Pod特定查询pod
yaml
undefinedDrop the pod label from application metrics at scrape time
在抓取时从应用指标中丢弃pod标签
metric_relabel_configs:
- regex: pod action: labeldrop
**`uid` label** ❌
- Completely unbounded (regenerates on every pod recreation)
- No legitimate query use — kept only by accident in default kubernetes_sd configs
**Application-emitted `instance` / `pod` / `node`** ❌
- These should come from target labels, not from the app code
- Drop them at scrape with `metric_relabel_configs` or fix in code
**kube-state-metrics annotation / label propagation** ⚠️
- `kube_pod_labels{label_app_kubernetes_io_*=...}` can carry dozens of metadata labels
- Each unique pod label combination is a new series
- Use kube-state-metrics' `--metric-labels-allowlist` to restrict to the labels you actually query on
---metric_relabel_configs:
- regex: pod action: labeldrop
**`uid`标签** ❌
- 完全无限(每次Pod重建都会重新生成)
- 没有合法的查询用途——仅因默认kubernetes_sd配置而被保留
**应用程序生成的`instance` / `pod` / `node`** ❌
- 这些标签应来自目标标签,而非应用程序代码
- 通过`metric_relabel_configs`丢弃或在代码中修复
**kube-state-metrics注解/标签传播** ⚠️
- `kube_pod_labels{label_app_kubernetes_io_*=...}`可能携带数十个元数据标签
- 每个唯一的Pod标签组合都会创建一个新序列
- 使用kube-state-metrics的`--metric-labels-allowlist`限制为实际会查询的标签
---Source-Side Prevention: Where to Fix What
源头预防:问题修复优先级
There are four levers, in order of preference:
有四个修复手段,优先级从高到低:
1. Fix in the Application (best)
1. 在应用程序中修复(最佳)
Bad labels emitted by the app are the root cause. Examples:
- HTTP paths: use templated routes () not raw paths
/users/:id - Error metrics: use a small enum () not the error message string
error_type="timeout" - User-scoped metrics: don't include — use exemplars to point to logs/traces
user_id - Free-form input: never emit user-supplied strings as label values
If you control the code, this is always the right fix. It saves cost on every downstream system (Prometheus, remote_write, Mimir, Grafana Cloud).
应用程序生成的不良标签是根本原因。示例:
- HTTP路径:使用模板化路由()而非原始路径
/users/:id - 错误指标:使用小型枚举()而非错误消息字符串
error_type="timeout" - 用户范围指标:不要包含——使用示例指向日志/追踪
user_id - 自由格式输入:绝不要将用户提供的字符串作为标签值
如果你控制代码,这始终是正确的修复方式。它能降低所有下游系统(Prometheus、remote_write、Mimir、Grafana Cloud)的成本。
2. relabel_configs
(target-time relabeling)
relabel_configs2. relabel_configs
(目标阶段重标记)
relabel_configsRuns before the scrape. Used to:
- Set target labels (,
env,cluster) on discovered targetsteam - Drop entire targets you don't want to scrape
- Rewrite to a friendly value
instance - Add identity from service discovery metadata
yaml
scrape_configs:
- job_name: my-app
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Set workload from controller metadata
- source_labels: [__meta_kubernetes_pod_controller_kind, __meta_kubernetes_pod_controller_name]
target_label: workload
separator: /
# Set env from a pod label
- source_labels: [__meta_kubernetes_pod_label_env]
target_label: env
# Only scrape pods explicitly opted in
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
regex: "true"
action: keep在抓取前运行。用于:
- 为发现的目标设置目标标签(、
env、cluster)team - 丢弃不需要抓取的整个目标
- 将重写为友好值
instance - 从服务发现元数据中添加标识信息
yaml
scrape_configs:
- job_name: my-app
kubernetes_sd_configs:
- role: pod
relabel_configs:
# 从控制器元数据设置workload
- source_labels: [__meta_kubernetes_pod_controller_kind, __meta_kubernetes_pod_controller_name]
target_label: workload
separator: /
# 从Pod标签设置env
- source_labels: [__meta_kubernetes_pod_label_env]
target_label: env
# 仅抓取明确选择加入的Pod
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
regex: "true"
action: keep3. metric_relabel_configs
(scrape-time relabeling)
metric_relabel_configs3. metric_relabel_configs
(抓取阶段重标记)
metric_relabel_configsRuns after the scrape, before storage. Used to:
- Drop high-cardinality labels the app shouldn't have emitted
- Drop entire metrics you don't want
- Rewrite label values for normalization
yaml
scrape_configs:
- job_name: my-app
metric_relabel_configs:
# Drop the pod label from every metric
- regex: pod
action: labeldrop
# Drop a specific high-cardinality metric entirely
- source_labels: [__name__]
regex: my_app_request_details
action: drop
# Normalize status_code to a class (200, 300, 400, 500)
- source_labels: [status_code]
regex: (\d)\d\d
target_label: status_code
replacement: ${1}xxThis is the emergency stop for bad labels. Use when you can't fix the app immediately.
在抓取后、存储前运行。用于:
- 丢弃应用程序不应生成的高基数标签
- 丢弃不需要的整个指标
- 重写标签值以实现标准化
yaml
scrape_configs:
- job_name: my-app
metric_relabel_configs:
# 从所有指标中丢弃pod标签
- regex: pod
action: labeldrop
# 完全丢弃特定的高基数指标
- source_labels: [__name__]
regex: my_app_request_details
action: drop
# 将status_code标准化为类别(200、300、400、500)
- source_labels: [status_code]
regex: (\\d)\\d\\d
target_label: status_code
replacement: ${1}xx这是不良标签的紧急停止按钮。当你无法立即修复应用程序时使用。
4. Recording Rules (query-time cardinality reduction)
4. 录制规则(查询阶段基数减少)
Pre-aggregate expensive series into a lower-cardinality recorded series. Stored at the same data point density but with far fewer series.
yaml
groups:
- name: http-requests-aggregates
interval: 30s
rules:
# Drop pod/instance dimension; keep only service-level rollup
- record: service:http_requests:rate5m
expr: sum by (service, env, cluster, status_code) (rate(http_requests_total[5m]))Queries that target the rollup are dramatically cheaper. The raw series still exist — recording rules don't reduce ingest cost (use Adaptive Metrics or for that). They reduce query cost.
metric_relabel_configs将昂贵的序列预聚合为低基数的录制序列。以相同的数据点密度存储,但序列数量大幅减少。
yaml
groups:
- name: http-requests-aggregates
interval: 30s
rules:
# 丢弃pod/instance维度;仅保留服务级汇总
- record: service:http_requests:rate5m
expr: sum by (service, env, cluster, status_code) (rate(http_requests_total[5m]))针对汇总结果的查询成本会大幅降低。原始序列仍然存在——录制规则不会降低摄入成本(使用Adaptive Metrics或实现该目标)。它们仅降低查询成本。
metric_relabel_configsInstrumentation Hygiene (for app developers)
仪表规范(面向应用开发者)
If the user is writing instrumentation code, these are the rules:
| Rule | Why |
|---|---|
| Never use unbounded user input as a label value | |
| Template HTTP paths before recording | |
| Bound error labels via small enums | |
Don't put | Use an info metric and join at query time |
Don't emit | Comes from scrape targets — duplicating creates collisions |
| Avoid dynamically constructed label names (keys) | |
| Use histograms sparingly and trim labels first | 14× cardinality amplification |
| Prefer exemplars over labels for trace correlation | Exemplars carry |
如果用户正在编写仪表代码,请遵循以下规则:
| 规则 | 原因 |
|---|---|
| 绝不要将无限的用户输入作为标签值 | |
| 在记录前对HTTP路径进行模板化 | |
| 通过小型枚举限制错误标签 | |
不要在每个指标上添加 | 使用信息指标并在查询时关联 |
不要从代码中生成 | 来自抓取目标——重复生成会导致冲突 |
| 避免动态构造标签名称(键) | |
| 谨慎使用直方图并优先精简标签 | 14倍基数放大 |
| 优先使用示例而非标签进行追踪关联 | 示例携带 |
Exemplars (the escape hatch)
示例(逃生舱)
Exemplars attach a (or any key-value pair) to specific samples without making it a label dimension. The ideal home for high-cardinality correlation data.
trace_idRequires OpenMetrics format, Prometheus 2.26+, scrape config:
yaml
scrape_configs:
- job_name: my-app
enable_protobuf_negotiation: true
# Or for text-format:
follow_redirects: trueAnd on the Prometheus server:
yaml
storage:
exemplars:
max_exemplars: 100000Use exemplars for:
- correlation (Tempo, Jaeger)
trace_id - for specific debug lookups
request_id - Any sparse "useful when you need it" key
Query exemplars via Grafana's exemplars-on-graph feature, not via PromQL aggregation.
示例会将(或任何键值对)附加到特定样本上,不会使其成为标签维度。是高基数关联数据的理想载体。
trace_id需要OpenMetrics格式、Prometheus 2.26+,抓取配置:
yaml
scrape_configs:
- job_name: my-app
enable_protobuf_negotiation: true
# 或针对文本格式:
follow_redirects: truePrometheus服务器配置:
yaml
storage:
exemplars:
max_exemplars: 100000示例适用于:
- 关联(Tempo、Jaeger)
trace_id - 用于特定调试查询
request_id - 任何稀疏的「需要时才有用」的键
通过Grafana的图表示例功能查询示例,而非通过PromQL聚合。
The 80/20 Rule
二八法则
The most impactful improvements almost always come from these five changes:
- Drop unbounded labels at the app layer — (untemplated),
path,user_id. Single biggest win.error_message - Trim histogram label cardinality before anything else — 14× amplification on every histogram.
- Drop from application metrics — keep
podinstead. Eliminates churn, big stream-count reduction.workload - Use info metrics for /
version/git_sha— eliminates deploy-driven churn.image_tag - Set target labels via , not app code —
relabel_configs,env,cluster,teamshould never be emitted by the application.service
Focus on these before anything else.
最具影响力的改进几乎总是来自以下五项变更:
- 在应用层丢弃无限标签——(非模板化)、
path、user_id。最大的单一收益。error_message - 优先精简直方图的标签基数——每个直方图都会被放大14倍。
- 从应用指标中丢弃——改用
pod。消除波动,大幅减少序列数。workload - 使用信息指标存储/
version/git_sha——消除部署驱动的波动。image_tag - 通过设置目标标签,而非应用代码——
relabel_configs、env、cluster、team绝不应由应用程序生成。service
在处理其他内容之前,请先聚焦于这些变更。
Labels to Avoid — Quick Reference
应避免的标签——快速参考
| Label | Why | Alternative |
|---|---|---|
| Unbounded | Exemplars; aggregate by |
| Unbounded | Exemplars |
| Unbounded | Template in code: |
| Unbounded text | Bounded |
| Churn on every deploy | Info metric pattern |
| Transient + high cardinality | |
| Unbounded; regenerates on restart | Drop entirely |
Application-emitted | Should come from scrape target | Drop via |
| Dynamically-named label keys | Cannot be bounded | Use fixed keys with bounded values |
Raw | 14× amplification | Bucket to |
| 标签 | 原因 | 替代方案 |
|---|---|---|
| 无限 | 示例;按 |
| 无限 | 示例 |
| 无限 | 在代码中使用模板: |
| 无限文本 | 有限的 |
| 每次部署都会波动 | 信息指标模式 |
| 临时+高基数 | |
| 无限;重启时重新生成 | 完全丢弃 |
应用程序生成的 | 应来自抓取目标 | 通过 |
| 动态命名的标签键 | 无法被限制 | 使用带有有限值的固定键 |
直方图上的原始 | 14倍放大 | 分组为 |
When to Route Elsewhere
何时引导至其他技能
- "Reduce my Grafana Cloud bill" → also engage skill (post-ingest aggregation rules)
adaptive-metrics - "Which metrics are driving my DPM?" → engage skill
dpm-finder - "My Prometheus is OOMing / scraping is failing right now" → engage skill
prometheus-cardinality-troubleshooter - "How do I write the query to find the bad metric?" → engage skill
promql - "How do I configure relabel rules in Alloy?" → engage skill
alloy
This skill's lane is strategy and design. Other skills own diagnosis and operational remediation.
- 「降低我的Grafana Cloud账单」 → 同时使用技能(摄入后聚合规则)
adaptive-metrics - 「哪些指标导致我的DPM过高?」 → 使用技能
dpm-finder - 「我的Prometheus内存溢出/抓取失败了」 → 使用技能
prometheus-cardinality-troubleshooter - 「如何编写查询以找到不良指标?」 → 使用技能
promql - 「如何在Alloy中配置重标记规则?」 → 使用技能
alloy
本技能的领域是策略与设计。其他技能负责诊断和运维修复。",