prometheus-label-strategy

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Prometheus Label Strategy Evaluator

Prometheus标签策略评估工具

You are an expert in Prometheus label strategy. When asked to evaluate, audit, design, or improve a Prometheus label schema — or when a user asks how to prevent high cardinality at the source — use this guide to provide structured, actionable advice.

This skill is about preventing bad labels at ingest (instrumentation, scrape configuration, relabeling). For post-ingest cost reduction via aggregation rules, route the user to the

adaptive-metrics

skill. For diagnosing an active cardinality fire, route to

prometheus-cardinality-troubleshooter

你是Prometheus标签策略领域的专家。当用户要求评估、审核、设计或改进Prometheus标签架构，或者询问如何从源头防止高基数问题时，请使用本指南提供结构化、可落地的建议。

本技能聚焦于在摄入阶段预防不良标签（仪表配置、抓取配置、重标记）。如需了解通过聚合规则实现摄入后成本降低的内容，请引导用户使用

adaptive-metrics

技能。如需诊断当前的基数危机，请引导至

prometheus-cardinality-troubleshooter

技能。

Core Concepts

核心概念

Series are the fundamental unit in Prometheus. Each unique combination of metric name plus label key-value pairs creates a new active series. Too many series = memory pressure, slow queries, ingest pressure, high bill.

Cardinality = the number of unique values a label can have. Total series for a metric ≈ the product of cardinalities across its labels. A metric with

path

(100 values),

status_code

(10 values),

method

(5 values), and

instance

(50 values) = 250,000 series per metric. Adding one more high-cardinality label often 10–100×s the count.

The dual impact rule: High-cardinality labels hurt on both paths:

Ingestion path: More active series → larger head block, larger WAL, more memory, larger remote_write payloads, higher Grafana Cloud bill (Active Series + DPM)
Query path: PromQL operators (
```
sum by
```
,
```
rate
```
, joins) must materialize matching series in memory. High cardinality balloons query memory and latency

Series churn is the silent killer. If a label value changes frequently (deploy version, pod name, ephemeral IDs), every change creates a new series while the old one continues to age out. Daily churn of 100% means you carry roughly 2× the steady-state series count for retention purposes.

The key question for any proposed label: "Will queries that use this metric reliably specify or aggregate on this label?" If no → it should NOT be a label.

序列是Prometheus中的基本单元。每个唯一的指标名称加标签键值对组合都会创建一个新的活跃序列。序列过多会导致内存压力、查询缓慢、摄入压力和高额账单。

基数 = 一个标签可拥有的唯一值数量。一个指标的总序列数 ≈ 其所有标签基数的乘积。一个带有

path

（100个值）、

status_code

（10个值）、

method

（5个值）和

instance

（50个值）的指标 = 每个指标250,000个序列。添加一个高基数标签通常会使计数增加10-100倍。

双重影响规则：高基数标签会在两个环节造成损害：

摄入环节：活跃序列越多 → 头部区块越大、WAL越大、内存占用越高、remote_write负载越大、Grafana Cloud账单越高（活跃序列数+DPM）
查询环节：PromQL运算符（
```
sum by
```
、
```
rate
```
、关联查询）必须在内存中实例化匹配的序列。高基数会大幅增加查询内存占用和延迟

序列波动是隐形杀手。如果标签值频繁变化（部署版本、Pod名称、临时ID），每次变化都会创建一个新序列，而旧序列会继续留存直至过期。每日100%的波动意味着为了满足留存要求，你需要承载大约2倍于稳态的序列数。

任何拟添加标签的核心问题："使用该指标的查询是否会可靠地指定或聚合这个标签？"如果答案是否定的 → 它不应该成为标签。

Label Evaluation Framework

标签评估框架

When auditing a label set, assess each label against these criteria.

审核标签集时，请根据以下标准评估每个标签。

Cardinality Scoring

基数评分

Label Example	Cardinality	Verdict
`env` (prod/staging/dev)	2–5 values	✅ Good
`job` (Prometheus scrape job)	5–50 values	✅ Good
`cluster` , `region`	Tens	✅ Good
`namespace` (K8s)	Tens–low hundreds	✅ Acceptable
`service` , `workload` , `container`	Tens–hundreds	✅ Acceptable
`instance` (host:port)	Hundreds–low thousands	⚠️ Evaluate — fine on per-instance metrics, risky on aggregated ones
`pod` (K8s)	Thousands + transient = high churn	❌ Drop at scrape unless required
`path` / `route` (HTTP)	Bounded if templated; unbounded if raw URLs	⚠️ Only with templated values ( `/users/:id` )
`version` , `image_tag` , `git_sha`	Grows on every deploy → churn	⚠️ Use sparingly; consider info-metric pattern
`user_id` , `request_id` , `trace_id`	Unbounded	❌ Never as label — use exemplars
`customer_id` , `tenant_id`	Often unbounded	❌ Only acceptable for small fixed tenant counts
`error_message` , `query` , `sql`	Unbounded text	❌ Never

标签示例	基数	结论
`env` （prod/staging/dev）	2-5个值	✅ 良好
`job` （Prometheus抓取任务）	5-50个值	✅ 良好
`cluster` 、 `region`	数十个	✅ 良好
`namespace` （K8s）	数十到数百个	✅ 可接受
`service` 、 `workload` 、 `container`	数十到数百个	✅ 可接受
`instance` （host:port）	数百到数千个	⚠️ 评估——在单实例指标上可行，在聚合指标上有风险
`pod` （K8s）	数千个+临时值 = 高波动	❌ 除非必要，否则在抓取时丢弃
`path` / `route` （HTTP）	若为模板化则有限；若为原始URL则无限	⚠️ 仅使用模板化值（ `/users/:id` ）
`version` 、 `image_tag` 、 `git_sha`	每次部署都会增加 → 波动	⚠️ 谨慎使用；考虑信息指标模式
`user_id` 、 `request_id` 、 `trace_id`	无限	❌ 绝不能作为标签——使用示例（exemplars）
`customer_id` 、 `tenant_id`	通常无限	❌ 仅在租户数量固定且较少时可接受
`error_message` 、 `query` 、 `sql`	无限文本	❌ 绝不能使用

Access Pattern Alignment

访问模式对齐

For each label, ask:

Do queries on this metric reliably aggregate by or filter on this label?
Does this label logically segment the metric the way users think about it?
Would removing this label force users to use exemplars, logs, or traces instead — and would that be acceptable for the rare lookup case?

对于每个标签，请询问：

使用该指标的查询是否会可靠地按此标签聚合或过滤？
该标签是否能按用户的思维逻辑对指标进行合理划分？
删除该标签是否会迫使用户改用示例、日志或追踪——这种方式对于罕见的查询场景是否可接受？

Static vs. Dynamic Label Values

静态与动态标签值

Static / target labels (set once per scrape target via
```
relabel_configs
```
, e.g.,
```
env=prod
```
,
```
cluster=us-east
```
,
```
team=payments
```
) add cardinality proportional to targets, not requests. Cheap and high-value. Use freely.
Dynamic / sample labels (emitted by the application per measurement, e.g.,
```
status_code
```
,
```
method
```
,
```
cache_hit
```
) multiply cardinality by value count. Keep possible values in the single digits or low tens. The application code is the source of truth — fix it there, not in Prometheus.

静态/目标标签（通过
```
relabel_configs
```
为每个抓取目标设置一次，例如
```
env=prod
```
、
```
cluster=us-east
```
、
```
team=payments
```
）的基数与目标数量成正比，而非请求数量。成本低且价值高。可自由使用。
动态/样本标签（由应用程序每次测量时生成，例如
```
status_code
```
、
```
method
```
、
```
cache_hit
```
）的基数会乘以值的数量。请将可能的值控制在个位数或低十位数。应用程序代码是真相来源——应在代码中修复，而非在Prometheus中。

Consistency Check

一致性检查

Label names consistent across services? (
```
status
```
vs
```
status_code
```
vs
```
http_status
```
produces three separate label families — joins break)
Label values normalized? (
```
200
```
vs
```
"200"
```
,
```
GET
```
vs
```
get
```
,
```
Error
```
vs
```
error
```
)
Naming convention consistent? Prometheus convention is
```
snake_case
```
for both metric and label names
Same concept, same name across services? (
```
service
```
vs
```
svc
```
vs
```
app_name
```
)

标签名称在各服务间是否一致？（
```
status
```
vs
```
status_code
```
vs
```
http_status
```
会产生三个独立的标签组——关联查询会失效）
标签值是否标准化？（
```
200
```
vs
```
"200"
```
、
```
GET
```
vs
```
get
```
、
```
Error
```
vs
```
error
```
）
命名规范是否一致？Prometheus的规范是指标和标签名称均使用
```
snake_case
```
同一概念在各服务间的名称是否相同？（
```
service
```
vs
```
svc
```
vs
```
app_name
```
）

Histogram Bucket Discipline (critical, often missed)

直方图桶规范（关键，常被忽略）

Every histogram metric multiplies its base cardinality by (bucket count + 3) — buckets via

_bucket{le="..."}

plus

_sum

_count

, and

_created

(Prometheus 2.39+).

Default
```
prometheus.DefBuckets
```
has 11 buckets → 14× multiplier
A histogram with
```
method
```
,
```
path
```
,
```
status
```
already at 1,000 series becomes 14,000 series after adding histogram cardinality
Always trim histogram label cardinality first — labels matter 14× more on histograms than on counters/gauges
Consider native histograms (Prometheus 2.40+) which use a single sparse series instead of one-per-bucket — major cardinality reduction for high-resolution latency tracking

每个直方图指标会将其基础基数乘以**（桶数量+3）**——通过

_bucket{le="..."}

实现的桶，加上

_sum

、

_count

和

_created

（Prometheus 2.39+）。

默认的
```
prometheus.DefBuckets
```
包含11个桶 → 14倍乘数
一个带有
```
method
```
、
```
path
```
、
```
status
```
且已达1000个序列的直方图，添加直方图基数后会变为14,000个序列
务必优先精简直方图的标签基数——直方图上的标签重要性是计数器/仪表盘的14倍
考虑使用原生直方图（Prometheus 2.40+），它使用单个稀疏序列而非每个桶一个序列——可大幅降低高分辨率延迟追踪的基数

Info-Metric Pattern (for high-churn metadata)

信息指标模式（针对高波动元数据）

When you want to know about a label (e.g.,

version

git_sha

image_tag

) without paying for it on every metric, use an info metric:

undefined

当你想了解某个标签（例如

version

、

git_sha

、

image_tag

）但不想为每个指标付出成本时，请使用信息指标：

undefined

A single low-cardinality counter/gauge of value 1, with the metadata attached

一个值为1的低基数计数器/仪表盘，附加元数据

app_build_info{app="payment-api", version="2.4.1", git_sha="a1b2c3"} 1


Then join at query time:
```promql
sum by (version) (
  rate(http_requests_total{app="payment-api"}[5m])
  * on (app) group_left (version) app_build_info
)

The

version

label lives on exactly one series per build, not on every metric.

app_build_info{app="payment-api", version="2.4.1", git_sha="a1b2c3"} 1


然后在查询时进行关联：
```promql
sum by (version) (
  rate(http_requests_total{app="payment-api"}[5m])
  * on (app) group_left (version) app_build_info
)

version

标签仅存在于每个构建对应的一个序列中，而非每个指标上。

Evaluation Output Format

评估输出格式

When auditing a label set, produce a report in this structure:

undefined

审核标签集时，请按照以下结构生成报告：

undefined

Prometheus Label Strategy Audit

Prometheus标签策略审核

Summary

摘要

[1-2 sentence overall assessment — total estimated active series, biggest risks]

[1-2句话的整体评估——预估总活跃序列数、最大风险]

Per-Label Analysis

逐标签分析

Metric Family	Label	Cardinality	Used in Queries?	Verdict	Action
http_requests_total	path	Unbounded (raw URLs)	Sometimes	❌ Remove	Template in code: `/users/:id` not `/users/12345`
http_requests_total	pod	High + churn	Rarely	❌ Drop via metric_relabel_configs	Already in target metadata

指标族	标签	基数	是否用于查询？	结论	行动
http_requests_total	path	无限（原始URL）	有时	❌ 删除	在代码中使用模板： `/users/:id` 而非 `/users/12345`
http_requests_total	pod	高+波动	很少	❌ 通过metric_relabel_configs丢弃	已存在于目标元数据中

Histogram-Specific Findings

直方图专项发现

[Highlight any histograms with high label cardinality — these are 14×+ amplified]

[突出显示任何标签基数高的直方图——这些会被放大14倍以上]

Estimated Impact

预估影响

Active series reduction: [X series → Y series]
DPM reduction: [X DPM → Y DPM] (samples-per-minute = series × ~6 at 10s scrape)
Memory impact: [if measurable]

活跃序列减少：[X个序列 → Y个序列]
DPM减少：[X DPM → Y DPM] （每分钟样本数 = 序列数 × 约6，抓取间隔10秒）
内存影响：[若可测量]

Recommended Label Set

Implementation Plan

实施计划

[Code changes — instrumentation hygiene]
[Scrape config changes — relabel_configs]
[Drop-at-scrape changes — metric_relabel_configs]
[Recording rules to materialize useful aggregates]

---

[代码变更——仪表规范]
[抓取配置变更——relabel_configs]
[抓取时丢弃变更——metric_relabel_configs]
[录制规则以生成有用的聚合结果]

---

Recommended Common Target Labels

Label	Purpose	Notes
`job`	Prometheus scrape job name	Set automatically by Prometheus
`instance`	Target endpoint ( `host:port` )	Set automatically; rename via `relabel_configs` to a friendlier value if needed
`env`	Environment ( `prod` , `staging` , `dev` )	Set via static_configs labels or service discovery
`cluster`	Multi-cluster differentiation	Critical for federation/Mimir multi-tenant
`region`	Geographic region
`team` / `squad`	Ownership — also useful for access control
`service`	Logical service identity	One service may span multiple jobs

标签	用途	说明
`job`	Prometheus抓取任务名称	由Prometheus自动设置
`instance`	目标端点（ `host:port` ）	自动设置；如有需要可通过 `relabel_configs` 重命名为更友好的值
`env`	环境（ `prod` 、 `staging` 、 `dev` ）	通过static_configs标签或服务发现设置
`cluster`	多集群区分	对于联邦/Mimir多租户至关重要
`region`	地理区域
`team` / `squad`	归属——也可用于访问控制
`service`	逻辑服务标识	一个服务可能涵盖多个任务

Kubernetes Patterns

Kubernetes模式

Recommended Labels (from kubernetes_sd_configs)

Label	Source	Notes
`namespace`	Pod metadata	Always keep
`container`	Pod spec	Low cardinality, useful for multi-container pods
`workload`	Derived: `{controller_kind}/{controller_name}`	Strongly preferred over `pod` — static, predictable
`service`	K8s Service	If scraping via Service

标签	来源	说明
`namespace`	Pod元数据	务必保留
`container`	Pod规格	基数低，适用于多容器Pod
`workload`	派生： `{controller_kind}/{controller_name}`	强烈推荐替代 `pod` ——静态、可预测
`service`	K8s Service	如果通过Service抓取

Labels to AVOID by Default in Kubernetes

Kubernetes中默认应避免的标签

pod
label ⚠️

Highly transient: rolls every deploy and on every restart
High cardinality: 100 pods × N metrics = N × 100 series, but on rollouts you carry both old and new pods until they age out
Almost never the right query dimension — users want workload, not pod instance
Solution: Keep
```
workload
```
as a label; drop
```
pod
```
via
```
metric_relabel_configs
```
; use exemplars or kube-state-metrics for pod-specific lookups

yaml

undefined

pod
标签 ⚠️

高度临时：每次部署和重启都会变化
高基数：100个Pod × N个指标 = N×100个序列，但在滚动发布期间，旧Pod和新Pod会同时存在直至过期
几乎不是正确的查询维度——用户需要的是工作负载，而非Pod实例
解决方案：保留
```
workload
```
作为标签；通过
```
metric_relabel_configs
```
丢弃
```
pod
```
；使用示例或kube-state-metrics进行Pod特定查询

yaml

undefined

Drop the pod label from application metrics at scrape time

在抓取时从应用指标中丢弃pod标签

metric_relabel_configs:

regex: pod action: labeldrop


**`uid` label** ❌
- Completely unbounded (regenerates on every pod recreation)
- No legitimate query use — kept only by accident in default kubernetes_sd configs

**Application-emitted `instance` / `pod` / `node`** ❌
- These should come from target labels, not from the app code
- Drop them at scrape with `metric_relabel_configs` or fix in code

**kube-state-metrics annotation / label propagation** ⚠️
- `kube_pod_labels{label_app_kubernetes_io_*=...}` can carry dozens of metadata labels
- Each unique pod label combination is a new series
- Use kube-state-metrics' `--metric-labels-allowlist` to restrict to the labels you actually query on

---

metric_relabel_configs:

regex: pod action: labeldrop


**`uid`标签** ❌
- 完全无限（每次Pod重建都会重新生成）
- 没有合法的查询用途——仅因默认kubernetes_sd配置而被保留

**应用程序生成的`instance` / `pod` / `node`** ❌
- 这些标签应来自目标标签，而非应用程序代码
- 通过`metric_relabel_configs`丢弃或在代码中修复

**kube-state-metrics注解/标签传播** ⚠️
- `kube_pod_labels{label_app_kubernetes_io_*=...}`可能携带数十个元数据标签
- 每个唯一的Pod标签组合都会创建一个新序列
- 使用kube-state-metrics的`--metric-labels-allowlist`限制为实际会查询的标签

---

Source-Side Prevention: Where to Fix What

源头预防：问题修复优先级

There are four levers, in order of preference:

有四个修复手段，优先级从高到低：

1. Fix in the Application (best)

1. 在应用程序中修复（最佳）

Bad labels emitted by the app are the root cause. Examples:

HTTP paths: use templated routes (
```
/users/:id
```
) not raw paths
Error metrics: use a small enum (
```
error_type="timeout"
```
) not the error message string
User-scoped metrics: don't include
```
user_id
```
— use exemplars to point to logs/traces
Free-form input: never emit user-supplied strings as label values

If you control the code, this is always the right fix. It saves cost on every downstream system (Prometheus, remote_write, Mimir, Grafana Cloud).

应用程序生成的不良标签是根本原因。示例：

HTTP路径：使用模板化路由（
```
/users/:id
```
）而非原始路径
错误指标：使用小型枚举（
```
error_type="timeout"
```
）而非错误消息字符串
用户范围指标：不要包含
```
user_id
```
——使用示例指向日志/追踪
自由格式输入：绝不要将用户提供的字符串作为标签值

如果你控制代码，这始终是正确的修复方式。它能降低所有下游系统（Prometheus、remote_write、Mimir、Grafana Cloud）的成本。

relabel_configs

(target-time relabeling)

relabel_configs

（目标阶段重标记）

Runs before the scrape. Used to:

Set target labels (
```
env
```
,
```
cluster
```
,
```
team
```
) on discovered targets
Drop entire targets you don't want to scrape
Rewrite
```
instance
```
to a friendly value
Add identity from service discovery metadata

yaml

scrape_configs:
  - job_name: my-app
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Set workload from controller metadata
      - source_labels: [__meta_kubernetes_pod_controller_kind, __meta_kubernetes_pod_controller_name]
        target_label: workload
        separator: /
      # Set env from a pod label
      - source_labels: [__meta_kubernetes_pod_label_env]
        target_label: env
      # Only scrape pods explicitly opted in
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        regex: "true"
        action: keep

在抓取前运行。用于：

为发现的目标设置目标标签（
```
env
```
、
```
cluster
```
、
```
team
```
）
丢弃不需要抓取的整个目标
将
```
instance
```
重写为友好值
从服务发现元数据中添加标识信息

yaml

scrape_configs:
  - job_name: my-app
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # 从控制器元数据设置workload
      - source_labels: [__meta_kubernetes_pod_controller_kind, __meta_kubernetes_pod_controller_name]
        target_label: workload
        separator: /
      # 从Pod标签设置env
      - source_labels: [__meta_kubernetes_pod_label_env]
        target_label: env
      # 仅抓取明确选择加入的Pod
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        regex: "true"
        action: keep

metric_relabel_configs

(scrape-time relabeling)

metric_relabel_configs

（抓取阶段重标记）

Runs after the scrape, before storage. Used to:

Drop high-cardinality labels the app shouldn't have emitted
Drop entire metrics you don't want
Rewrite label values for normalization

yaml

scrape_configs:
  - job_name: my-app
    metric_relabel_configs:
      # Drop the pod label from every metric
      - regex: pod
        action: labeldrop

      # Drop a specific high-cardinality metric entirely
      - source_labels: [__name__]
        regex: my_app_request_details
        action: drop

      # Normalize status_code to a class (200, 300, 400, 500)
      - source_labels: [status_code]
        regex: (\d)\d\d
        target_label: status_code
        replacement: ${1}xx

This is the emergency stop for bad labels. Use when you can't fix the app immediately.

在抓取后、存储前运行。用于：

丢弃应用程序不应生成的高基数标签
丢弃不需要的整个指标
重写标签值以实现标准化

yaml

scrape_configs:
  - job_name: my-app
    metric_relabel_configs:
      # 从所有指标中丢弃pod标签
      - regex: pod
        action: labeldrop

      # 完全丢弃特定的高基数指标
      - source_labels: [__name__]
        regex: my_app_request_details
        action: drop

      # 将status_code标准化为类别（200、300、400、500）
      - source_labels: [status_code]
        regex: (\\d)\\d\\d
        target_label: status_code
        replacement: ${1}xx

这是不良标签的紧急停止按钮。当你无法立即修复应用程序时使用。

4. Recording Rules (query-time cardinality reduction)

4. 录制规则（查询阶段基数减少）

Pre-aggregate expensive series into a lower-cardinality recorded series. Stored at the same data point density but with far fewer series.

yaml

groups:
  - name: http-requests-aggregates
    interval: 30s
    rules:
      # Drop pod/instance dimension; keep only service-level rollup
      - record: service:http_requests:rate5m
        expr: sum by (service, env, cluster, status_code) (rate(http_requests_total[5m]))

Queries that target the rollup are dramatically cheaper. The raw series still exist — recording rules don't reduce ingest cost (use Adaptive Metrics or

metric_relabel_configs

for that). They reduce query cost.

将昂贵的序列预聚合为低基数的录制序列。以相同的数据点密度存储，但序列数量大幅减少。

yaml

groups:
  - name: http-requests-aggregates
    interval: 30s
    rules:
      # 丢弃pod/instance维度；仅保留服务级汇总
      - record: service:http_requests:rate5m
        expr: sum by (service, env, cluster, status_code) (rate(http_requests_total[5m]))

针对汇总结果的查询成本会大幅降低。原始序列仍然存在——录制规则不会降低摄入成本（使用Adaptive Metrics或

metric_relabel_configs

实现该目标）。它们仅降低查询成本。

Instrumentation Hygiene (for app developers)

仪表规范（面向应用开发者）

If the user is writing instrumentation code, these are the rules:

Rule	Why
Never use unbounded user input as a label value	`email` , `user_id` , `query string` , `error message` — they're the #1 cardinality bug
Template HTTP paths before recording	`/users/{id}` not `/users/12345` . Most frameworks do this via routing metadata
Bound error labels via small enums	`error_type="timeout"` not `error="connection to db-shard-7 timed out at 14:32:09"`
Don't put `version` / `git_sha` / `build_id` on every metric	Use an info metric and join at query time
Don't emit `pod` / `node` / `host` from code	Comes from scrape targets — duplicating creates collisions
Avoid dynamically constructed label names (keys)	`metric{[user]=1}` cannot be bounded — use a fixed key
Use histograms sparingly and trim labels first	14× cardinality amplification
Prefer exemplars over labels for trace correlation	Exemplars carry `trace_id` without inflating cardinality

如果用户正在编写仪表代码，请遵循以下规则：

规则	原因
绝不要将无限的用户输入作为标签值	`email` 、 `user_id` 、 `查询字符串` 、 `错误消息` ——它们是排名第一的基数问题来源
在记录前对HTTP路径进行模板化	`/users/{id}` 而非 `/users/12345` 。大多数框架通过路由元数据实现此功能
通过小型枚举限制错误标签	`error_type="timeout"` 而非 `error="connection to db-shard-7 timed out at 14:32:09"`
不要在每个指标上添加 `version` / `git_sha` / `build_id`	使用信息指标并在查询时关联
不要从代码中生成 `pod` / `node` / `host`	来自抓取目标——重复生成会导致冲突
避免动态构造标签名称（键）	`metric{[user]=1}` 无法被限制——使用固定键
谨慎使用直方图并优先精简标签	14倍基数放大
优先使用示例而非标签进行追踪关联	示例携带 `trace_id` 而不会增加基数

Exemplars (the escape hatch)

示例（逃生舱）

Exemplars attach a

trace_id

(or any key-value pair) to specific samples without making it a label dimension. The ideal home for high-cardinality correlation data.

Requires OpenMetrics format, Prometheus 2.26+, scrape config:

yaml

scrape_configs:
  - job_name: my-app
    enable_protobuf_negotiation: true
    # Or for text-format:
    follow_redirects: true

And on the Prometheus server:

yaml

storage:
  exemplars:
    max_exemplars: 100000

Use exemplars for:

```
trace_id
```
correlation (Tempo, Jaeger)
```
request_id
```
for specific debug lookups
Any sparse "useful when you need it" key

Query exemplars via Grafana's exemplars-on-graph feature, not via PromQL aggregation.

示例会将

trace_id

（或任何键值对）附加到特定样本上，不会使其成为标签维度。是高基数关联数据的理想载体。

需要OpenMetrics格式、Prometheus 2.26+，抓取配置：

yaml

scrape_configs:
  - job_name: my-app
    enable_protobuf_negotiation: true
    # 或针对文本格式：
    follow_redirects: true

Prometheus服务器配置：

yaml

storage:
  exemplars:
    max_exemplars: 100000

示例适用于：

```
trace_id
```
关联（Tempo、Jaeger）
```
request_id
```
用于特定调试查询
任何稀疏的「需要时才有用」的键

通过Grafana的图表示例功能查询示例，而非通过PromQL聚合。

The 80/20 Rule

二八法则

The most impactful improvements almost always come from these five changes:

Drop unbounded labels at the app layer —
```
path
```
(untemplated),
```
user_id
```
,
```
error_message
```
. Single biggest win.
Trim histogram label cardinality before anything else — 14× amplification on every histogram.
Drop
pod
from application metrics — keep
```
workload
```
instead. Eliminates churn, big stream-count reduction.
Use info metrics for
version
/
git_sha
/
image_tag
— eliminates deploy-driven churn.
Set target labels via
relabel_configs
, not app code —
```
env
```
,
```
cluster
```
,
```
team
```
,
```
service
```
should never be emitted by the application.

Focus on these before anything else.

最具影响力的改进几乎总是来自以下五项变更：

在应用层丢弃无限标签——
```
path
```
（非模板化）、
```
user_id
```
、
```
error_message
```
。最大的单一收益。
优先精简直方图的标签基数——每个直方图都会被放大14倍。
从应用指标中丢弃
pod
——改用
```
workload
```
。消除波动，大幅减少序列数。
使用信息指标存储
version
/
git_sha
/
image_tag
——消除部署驱动的波动。
通过
relabel_configs
设置目标标签，而非应用代码——
```
env
```
、
```
cluster
```
、
```
team
```
、
```
service
```
绝不应由应用程序生成。

在处理其他内容之前，请先聚焦于这些变更。

Labels to Avoid — Quick Reference

应避免的标签——快速参考

Label	Why	Alternative
`user_id` , `customer_id` (large tenant base)	Unbounded	Exemplars; aggregate by `tenant_tier`
`request_id` , `trace_id`	Unbounded	Exemplars
`path` / `route` (raw URLs)	Unbounded	Template in code: `/users/:id`
`error_message` , `query` , `sql`	Unbounded text	Bounded `error_type` enum
`version` , `git_sha` , `image_tag` (on every metric)	Churn on every deploy	Info metric pattern
`pod` (on app metrics)	Transient + high cardinality	`workload` ; exemplars for pod-specific debug
`uid` (K8s)	Unbounded; regenerates on restart	Drop entirely
Application-emitted `instance` , `node` , `host`	Should come from scrape target	Drop via `metric_relabel_configs`
Dynamically-named label keys	Cannot be bounded	Use fixed keys with bounded values
Raw `status_code` on histograms	14× amplification	Bucket to `status_class` ( `2xx` , `4xx` , `5xx` )

标签	原因	替代方案
`user_id` 、 `customer_id` （租户基数大）	无限	示例；按 `tenant_tier` 聚合
`request_id` 、 `trace_id`	无限	示例
`path` / `route` （原始URL）	无限	在代码中使用模板： `/users/:id`
`error_message` 、 `query` 、 `sql`	无限文本	有限的 `error_type` 枚举
`version` 、 `git_sha` 、 `image_tag` （在每个指标上）	每次部署都会波动	信息指标模式
`pod` （在应用指标上）	临时+高基数	`workload` ；使用示例进行Pod特定调试
`uid` （K8s）	无限；重启时重新生成	完全丢弃
应用程序生成的 `instance` 、 `node` 、 `host`	应来自抓取目标	通过 `metric_relabel_configs` 丢弃
动态命名的标签键	无法被限制	使用带有有限值的固定键
直方图上的原始 `status_code`	14倍放大	分组为 `status_class` （ `2xx` 、 `4xx` 、 `5xx` ）

When to Route Elsewhere

何时引导至其他技能

"Reduce my Grafana Cloud bill" → also engage
```
adaptive-metrics
```
skill (post-ingest aggregation rules)
"Which metrics are driving my DPM?" → engage
```
dpm-finder
```
skill
"My Prometheus is OOMing / scraping is failing right now" → engage
```
prometheus-cardinality-troubleshooter
```
skill
"How do I write the query to find the bad metric?" → engage
```
promql
```
skill
"How do I configure relabel rules in Alloy?" → engage
```
alloy
```
skill

This skill's lane is strategy and design. Other skills own diagnosis and operational remediation.

「降低我的Grafana Cloud账单」 → 同时使用
```
adaptive-metrics
```
技能（摄入后聚合规则）
「哪些指标导致我的DPM过高？」 → 使用
```
dpm-finder
```
技能
「我的Prometheus内存溢出/抓取失败了」 → 使用
```
prometheus-cardinality-troubleshooter
```
技能
「如何编写查询以找到不良指标？」 → 使用
```
promql
```
技能
「如何在Alloy中配置重标记规则？」 → 使用
```
alloy
```
技能

本技能的领域是策略与设计。其他技能负责诊断和运维修复。",

prometheus-label-strategy

Original

Translation

Prometheus Label Strategy Evaluator

Prometheus标签策略评估工具

Core Concepts

核心概念

Label Evaluation Framework

标签评估框架

Cardinality Scoring

基数评分

Access Pattern Alignment

访问模式对齐

Static vs. Dynamic Label Values

静态与动态标签值

Consistency Check

一致性检查

Histogram Bucket Discipline (critical, often missed)

直方图桶规范（关键，常被忽略）

Info-Metric Pattern (for high-churn metadata)

信息指标模式（针对高波动元数据）

A single low-cardinality counter/gauge of value 1, with the metadata attached

一个值为1的低基数计数器/仪表盘，附加元数据

Evaluation Output Format

评估输出格式

Prometheus Label Strategy Audit

Prometheus标签策略审核

Summary

摘要

Per-Label Analysis

逐标签分析

Histogram-Specific Findings

直方图专项发现

Estimated Impact

预估影响

Recommended Label Set

推荐标签集

Implementation Plan

实施计划

Recommended Common Target Labels

推荐通用目标标签

Kubernetes Patterns

Kubernetes模式

Recommended Labels (from kubernetes_sd_configs)

推荐标签（来自kubernetes_sd_configs）

Labels to AVOID by Default in Kubernetes

Kubernetes中默认应避免的标签

Drop the pod label from application metrics at scrape time

在抓取时从应用指标中丢弃pod标签

Source-Side Prevention: Where to Fix What

源头预防：问题修复优先级

1. Fix in the Application (best)

1. 在应用程序中修复（最佳）

2. relabel_configs (target-time relabeling)

2. relabel_configs（目标阶段重标记）

3. metric_relabel_configs (scrape-time relabeling)

3. metric_relabel_configs（抓取阶段重标记）

4. Recording Rules (query-time cardinality reduction)

4. 录制规则（查询阶段基数减少）

Instrumentation Hygiene (for app developers)

仪表规范（面向应用开发者）

Exemplars (the escape hatch)

示例（逃生舱）

The 80/20 Rule

二八法则

Labels to Avoid — Quick Reference

应避免的标签——快速参考

When to Route Elsewhere

何时引导至其他技能

2.
`relabel_configs`
(target-time relabeling)

2.
`relabel_configs`
（目标阶段重标记）

3.
`metric_relabel_configs`
(scrape-time relabeling)

3.
`metric_relabel_configs`
（抓取阶段重标记）