prometheus-label-strategy

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Prometheus Label Strategy Evaluator

Prometheus标签策略评估工具

You are an expert in Prometheus label strategy. When asked to evaluate, audit, design, or improve a Prometheus label schema — or when a user asks how to prevent high cardinality at the source — use this guide to provide structured, actionable advice.
This skill is about preventing bad labels at ingest (instrumentation, scrape configuration, relabeling). For post-ingest cost reduction via aggregation rules, route the user to the
adaptive-metrics
skill. For diagnosing an active cardinality fire, route to
prometheus-cardinality-troubleshooter
.

你是Prometheus标签策略领域的专家。当用户要求评估、审核、设计或改进Prometheus标签架构,或者询问如何从源头防止高基数问题时,请使用本指南提供结构化、可落地的建议。
本技能聚焦于在摄入阶段预防不良标签(仪表配置、抓取配置、重标记)。如需了解通过聚合规则实现摄入后成本降低的内容,请引导用户使用
adaptive-metrics
技能。如需诊断当前的基数危机,请引导至
prometheus-cardinality-troubleshooter
技能。

Core Concepts

核心概念

Series are the fundamental unit in Prometheus. Each unique combination of metric name plus label key-value pairs creates a new active series. Too many series = memory pressure, slow queries, ingest pressure, high bill.
Cardinality = the number of unique values a label can have. Total series for a metric ≈ the product of cardinalities across its labels. A metric with
path
(100 values),
status_code
(10 values),
method
(5 values), and
instance
(50 values) = 250,000 series per metric. Adding one more high-cardinality label often 10–100×s the count.
The dual impact rule: High-cardinality labels hurt on both paths:
  • Ingestion path: More active series → larger head block, larger WAL, more memory, larger remote_write payloads, higher Grafana Cloud bill (Active Series + DPM)
  • Query path: PromQL operators (
    sum by
    ,
    rate
    , joins) must materialize matching series in memory. High cardinality balloons query memory and latency
Series churn is the silent killer. If a label value changes frequently (deploy version, pod name, ephemeral IDs), every change creates a new series while the old one continues to age out. Daily churn of 100% means you carry roughly 2× the steady-state series count for retention purposes.
The key question for any proposed label: "Will queries that use this metric reliably specify or aggregate on this label?" If no → it should NOT be a label.

序列是Prometheus中的基本单元。每个唯一的指标名称加标签键值对组合都会创建一个新的活跃序列。序列过多会导致内存压力、查询缓慢、摄入压力和高额账单。
基数 = 一个标签可拥有的唯一值数量。一个指标的总序列数 ≈ 其所有标签基数的乘积。一个带有
path
(100个值)、
status_code
(10个值)、
method
(5个值)和
instance
(50个值)的指标 = 每个指标250,000个序列。添加一个高基数标签通常会使计数增加10-100倍。
双重影响规则:高基数标签会在两个环节造成损害:
  • 摄入环节:活跃序列越多 → 头部区块越大、WAL越大、内存占用越高、remote_write负载越大、Grafana Cloud账单越高(活跃序列数+DPM)
  • 查询环节:PromQL运算符(
    sum by
    rate
    、关联查询)必须在内存中实例化匹配的序列。高基数会大幅增加查询内存占用和延迟
序列波动是隐形杀手。如果标签值频繁变化(部署版本、Pod名称、临时ID),每次变化都会创建一个序列,而旧序列会继续留存直至过期。每日100%的波动意味着为了满足留存要求,你需要承载大约2倍于稳态的序列数。
任何拟添加标签的核心问题:"使用该指标的查询是否会可靠地指定或聚合这个标签?"如果答案是否定的 → 它不应该成为标签。

Label Evaluation Framework

标签评估框架

When auditing a label set, assess each label against these criteria.
审核标签集时,请根据以下标准评估每个标签。

Cardinality Scoring

基数评分

Label ExampleCardinalityVerdict
env
(prod/staging/dev)
2–5 values✅ Good
job
(Prometheus scrape job)
5–50 values✅ Good
cluster
,
region
Tens✅ Good
namespace
(K8s)
Tens–low hundreds✅ Acceptable
service
,
workload
,
container
Tens–hundreds✅ Acceptable
instance
(host:port)
Hundreds–low thousands⚠️ Evaluate — fine on per-instance metrics, risky on aggregated ones
pod
(K8s)
Thousands + transient = high churn❌ Drop at scrape unless required
path
/
route
(HTTP)
Bounded if templated; unbounded if raw URLs⚠️ Only with templated values (
/users/:id
)
version
,
image_tag
,
git_sha
Grows on every deploy → churn⚠️ Use sparingly; consider info-metric pattern
user_id
,
request_id
,
trace_id
Unbounded❌ Never as label — use exemplars
customer_id
,
tenant_id
Often unbounded❌ Only acceptable for small fixed tenant counts
error_message
,
query
,
sql
Unbounded text❌ Never
标签示例基数结论
env
(prod/staging/dev)
2-5个值✅ 良好
job
(Prometheus抓取任务)
5-50个值✅ 良好
cluster
region
数十个✅ 良好
namespace
(K8s)
数十到数百个✅ 可接受
service
workload
container
数十到数百个✅ 可接受
instance
(host:port)
数百到数千个⚠️ 评估——在单实例指标上可行,在聚合指标上有风险
pod
(K8s)
数千个+临时值 = 高波动❌ 除非必要,否则在抓取时丢弃
path
/
route
(HTTP)
若为模板化则有限;若为原始URL则无限⚠️ 仅使用模板化值(
/users/:id
version
image_tag
git_sha
每次部署都会增加 → 波动⚠️ 谨慎使用;考虑信息指标模式
user_id
request_id
trace_id
无限❌ 绝不能作为标签——使用示例(exemplars)
customer_id
tenant_id
通常无限❌ 仅在租户数量固定且较少时可接受
error_message
query
sql
无限文本❌ 绝不能使用

Access Pattern Alignment

访问模式对齐

For each label, ask:
  • Do queries on this metric reliably aggregate by or filter on this label?
  • Does this label logically segment the metric the way users think about it?
  • Would removing this label force users to use exemplars, logs, or traces instead — and would that be acceptable for the rare lookup case?
对于每个标签,请询问:
  • 使用该指标的查询是否会可靠地按此标签聚合或过滤?
  • 该标签是否能按用户的思维逻辑对指标进行合理划分?
  • 删除该标签是否会迫使用户改用示例、日志或追踪——这种方式对于罕见的查询场景是否可接受?

Static vs. Dynamic Label Values

静态与动态标签值

  • Static / target labels (set once per scrape target via
    relabel_configs
    , e.g.,
    env=prod
    ,
    cluster=us-east
    ,
    team=payments
    ) add cardinality proportional to targets, not requests. Cheap and high-value. Use freely.
  • Dynamic / sample labels (emitted by the application per measurement, e.g.,
    status_code
    ,
    method
    ,
    cache_hit
    ) multiply cardinality by value count. Keep possible values in the single digits or low tens. The application code is the source of truth — fix it there, not in Prometheus.
  • 静态/目标标签(通过
    relabel_configs
    为每个抓取目标设置一次,例如
    env=prod
    cluster=us-east
    team=payments
    )的基数与目标数量成正比,而非请求数量。成本低且价值高。可自由使用。
  • 动态/样本标签(由应用程序每次测量时生成,例如
    status_code
    method
    cache_hit
    )的基数会乘以值的数量。请将可能的值控制在个位数或低十位数。应用程序代码是真相来源——应在代码中修复,而非在Prometheus中。

Consistency Check

一致性检查

  • Label names consistent across services? (
    status
    vs
    status_code
    vs
    http_status
    produces three separate label families — joins break)
  • Label values normalized? (
    200
    vs
    "200"
    ,
    GET
    vs
    get
    ,
    Error
    vs
    error
    )
  • Naming convention consistent? Prometheus convention is
    snake_case
    for both metric and label names
  • Same concept, same name across services? (
    service
    vs
    svc
    vs
    app_name
    )
  • 标签名称在各服务间是否一致?(
    status
    vs
    status_code
    vs
    http_status
    会产生三个独立的标签组——关联查询会失效)
  • 标签是否标准化?(
    200
    vs
    "200"
    GET
    vs
    get
    Error
    vs
    error
  • 命名规范是否一致?Prometheus的规范是指标和标签名称均使用
    snake_case
  • 同一概念在各服务间的名称是否相同?(
    service
    vs
    svc
    vs
    app_name

Histogram Bucket Discipline (critical, often missed)

直方图桶规范(关键,常被忽略)

Every histogram metric multiplies its base cardinality by (bucket count + 3) — buckets via
_bucket{le="..."}
plus
_sum
,
_count
, and
_created
(Prometheus 2.39+).
  • Default
    prometheus.DefBuckets
    has 11 buckets → 14× multiplier
  • A histogram with
    method
    ,
    path
    ,
    status
    already at 1,000 series becomes 14,000 series after adding histogram cardinality
  • Always trim histogram label cardinality first — labels matter 14× more on histograms than on counters/gauges
  • Consider native histograms (Prometheus 2.40+) which use a single sparse series instead of one-per-bucket — major cardinality reduction for high-resolution latency tracking
每个直方图指标会将其基础基数乘以**(桶数量+3)**——通过
_bucket{le="..."}
实现的桶,加上
_sum
_count
_created
(Prometheus 2.39+)。
  • 默认的
    prometheus.DefBuckets
    包含11个桶 → 14倍乘数
  • 一个带有
    method
    path
    status
    且已达1000个序列的直方图,添加直方图基数后会变为14,000个序列
  • 务必优先精简直方图的标签基数——直方图上的标签重要性是计数器/仪表盘的14倍
  • 考虑使用原生直方图(Prometheus 2.40+),它使用单个稀疏序列而非每个桶一个序列——可大幅降低高分辨率延迟追踪的基数

Info-Metric Pattern (for high-churn metadata)

信息指标模式(针对高波动元数据)

When you want to know about a label (e.g.,
version
,
git_sha
,
image_tag
) without paying for it on every metric, use an info metric:
undefined
当你想了解某个标签(例如
version
git_sha
image_tag
)但不想为每个指标付出成本时,请使用信息指标:
undefined

A single low-cardinality counter/gauge of value 1, with the metadata attached

一个值为1的低基数计数器/仪表盘,附加元数据

app_build_info{app="payment-api", version="2.4.1", git_sha="a1b2c3"} 1

Then join at query time:
```promql
sum by (version) (
  rate(http_requests_total{app="payment-api"}[5m])
  * on (app) group_left (version) app_build_info
)
The
version
label lives on exactly one series per build, not on every metric.

app_build_info{app="payment-api", version="2.4.1", git_sha="a1b2c3"} 1

然后在查询时进行关联:
```promql
sum by (version) (
  rate(http_requests_total{app="payment-api"}[5m])
  * on (app) group_left (version) app_build_info
)
version
标签仅存在于每个构建对应的一个序列中,而非每个指标上。

Evaluation Output Format

评估输出格式

When auditing a label set, produce a report in this structure:
undefined
审核标签集时,请按照以下结构生成报告:
undefined

Prometheus Label Strategy Audit

Prometheus标签策略审核

Summary

摘要

[1-2 sentence overall assessment — total estimated active series, biggest risks]
[1-2句话的整体评估——预估总活跃序列数、最大风险]

Per-Label Analysis

逐标签分析

Metric FamilyLabelCardinalityUsed in Queries?VerdictAction
http_requests_totalpathUnbounded (raw URLs)Sometimes❌ RemoveTemplate in code:
/users/:id
not
/users/12345
http_requests_totalpodHigh + churnRarely❌ Drop via metric_relabel_configsAlready in target metadata
指标族标签基数是否用于查询?结论行动
http_requests_totalpath无限(原始URL)有时❌ 删除在代码中使用模板:
/users/:id
而非
/users/12345
http_requests_totalpod高+波动很少❌ 通过metric_relabel_configs丢弃已存在于目标元数据中

Histogram-Specific Findings

直方图专项发现

[Highlight any histograms with high label cardinality — these are 14×+ amplified]
[突出显示任何标签基数高的直方图——这些会被放大14倍以上]

Estimated Impact

预估影响

  • Active series reduction: [X series → Y series]
  • DPM reduction: [X DPM → Y DPM] (samples-per-minute = series × ~6 at 10s scrape)
  • Memory impact: [if measurable]
  • 活跃序列减少:[X个序列 → Y个序列]
  • DPM减少:[X DPM → Y DPM] (每分钟样本数 = 序列数 × 约6,抓取间隔10秒)
  • 内存影响:[若可测量]

Recommended Label Set

推荐标签集

[Final recommended labels per metric family]
[每个指标族的最终推荐标签]

Implementation Plan

实施计划

  1. [Code changes — instrumentation hygiene]
  2. [Scrape config changes — relabel_configs]
  3. [Drop-at-scrape changes — metric_relabel_configs]
  4. [Recording rules to materialize useful aggregates]

---
  1. [代码变更——仪表规范]
  2. [抓取配置变更——relabel_configs]
  3. [抓取时丢弃变更——metric_relabel_configs]
  4. [录制规则以生成有用的聚合结果]

---

Recommended Common Target Labels

推荐通用目标标签

These should be set as target labels (via
relabel_configs
on the scrape job, NOT emitted by the app) — they're per-target, low cardinality, high query value:
LabelPurposeNotes
job
Prometheus scrape job nameSet automatically by Prometheus
instance
Target endpoint (
host:port
)
Set automatically; rename via
relabel_configs
to a friendlier value if needed
env
Environment (
prod
,
staging
,
dev
)
Set via static_configs labels or service discovery
cluster
Multi-cluster differentiationCritical for federation/Mimir multi-tenant
region
Geographic region
team
/
squad
Ownership — also useful for access control
service
Logical service identityOne service may span multiple jobs
These should NOT be re-emitted by the application. If the app emits a
cluster
label, it duplicates the target label and creates collisions /
honor_labels
decisions you don't want to make.

这些标签应设置为目标标签(通过抓取任务的
relabel_configs
设置,而非由应用程序生成)——它们是每个目标的专属标签,基数低且查询价值高:
标签用途说明
job
Prometheus抓取任务名称由Prometheus自动设置
instance
目标端点(
host:port
自动设置;如有需要可通过
relabel_configs
重命名为更友好的值
env
环境(
prod
staging
dev
通过static_configs标签或服务发现设置
cluster
多集群区分对于联邦/Mimir多租户至关重要
region
地理区域
team
/
squad
归属——也可用于访问控制
service
逻辑服务标识一个服务可能涵盖多个任务
这些标签不应由应用程序重新生成。如果应用程序生成
cluster
标签,会与目标标签重复,导致冲突或需要做出
honor_labels
决策,这是你不希望看到的。

Kubernetes Patterns

Kubernetes模式

Recommended Labels (from kubernetes_sd_configs)

推荐标签(来自kubernetes_sd_configs)

LabelSourceNotes
namespace
Pod metadataAlways keep
container
Pod specLow cardinality, useful for multi-container pods
workload
Derived:
{controller_kind}/{controller_name}
Strongly preferred over
pod
— static, predictable
service
K8s ServiceIf scraping via Service
标签来源说明
namespace
Pod元数据务必保留
container
Pod规格基数低,适用于多容器Pod
workload
派生:
{controller_kind}/{controller_name}
强烈推荐替代
pod
——静态、可预测
service
K8s Service如果通过Service抓取

Labels to AVOID by Default in Kubernetes

Kubernetes中默认应避免的标签

pod
label
⚠️
  • Highly transient: rolls every deploy and on every restart
  • High cardinality: 100 pods × N metrics = N × 100 series, but on rollouts you carry both old and new pods until they age out
  • Almost never the right query dimension — users want workload, not pod instance
  • Solution: Keep
    workload
    as a label; drop
    pod
    via
    metric_relabel_configs
    ; use exemplars or kube-state-metrics for pod-specific lookups
yaml
undefined
pod
标签
⚠️
  • 高度临时:每次部署和重启都会变化
  • 高基数:100个Pod × N个指标 = N×100个序列,但在滚动发布期间,旧Pod和新Pod会同时存在直至过期
  • 几乎不是正确的查询维度——用户需要的是工作负载,而非Pod实例
  • 解决方案:保留
    workload
    作为标签;通过
    metric_relabel_configs
    丢弃
    pod
    ;使用示例或kube-state-metrics进行Pod特定查询
yaml
undefined

Drop the pod label from application metrics at scrape time

在抓取时从应用指标中丢弃pod标签

metric_relabel_configs:
  • regex: pod action: labeldrop

**`uid` label** ❌
- Completely unbounded (regenerates on every pod recreation)
- No legitimate query use — kept only by accident in default kubernetes_sd configs

**Application-emitted `instance` / `pod` / `node`** ❌
- These should come from target labels, not from the app code
- Drop them at scrape with `metric_relabel_configs` or fix in code

**kube-state-metrics annotation / label propagation** ⚠️
- `kube_pod_labels{label_app_kubernetes_io_*=...}` can carry dozens of metadata labels
- Each unique pod label combination is a new series
- Use kube-state-metrics' `--metric-labels-allowlist` to restrict to the labels you actually query on

---
metric_relabel_configs:
  • regex: pod action: labeldrop

**`uid`标签** ❌
- 完全无限(每次Pod重建都会重新生成)
- 没有合法的查询用途——仅因默认kubernetes_sd配置而被保留

**应用程序生成的`instance` / `pod` / `node`** ❌
- 这些标签应来自目标标签,而非应用程序代码
- 通过`metric_relabel_configs`丢弃或在代码中修复

**kube-state-metrics注解/标签传播** ⚠️
- `kube_pod_labels{label_app_kubernetes_io_*=...}`可能携带数十个元数据标签
- 每个唯一的Pod标签组合都会创建一个新序列
- 使用kube-state-metrics的`--metric-labels-allowlist`限制为实际会查询的标签

---

Source-Side Prevention: Where to Fix What

源头预防:问题修复优先级

There are four levers, in order of preference:
有四个修复手段,优先级从高到低

1. Fix in the Application (best)

1. 在应用程序中修复(最佳)

Bad labels emitted by the app are the root cause. Examples:
  • HTTP paths: use templated routes (
    /users/:id
    ) not raw paths
  • Error metrics: use a small enum (
    error_type="timeout"
    ) not the error message string
  • User-scoped metrics: don't include
    user_id
    — use exemplars to point to logs/traces
  • Free-form input: never emit user-supplied strings as label values
If you control the code, this is always the right fix. It saves cost on every downstream system (Prometheus, remote_write, Mimir, Grafana Cloud).
应用程序生成的不良标签是根本原因。示例:
  • HTTP路径:使用模板化路由(
    /users/:id
    )而非原始路径
  • 错误指标:使用小型枚举(
    error_type="timeout"
    )而非错误消息字符串
  • 用户范围指标:不要包含
    user_id
    ——使用示例指向日志/追踪
  • 自由格式输入:绝不要将用户提供的字符串作为标签值
如果你控制代码,这始终是正确的修复方式。它能降低所有下游系统(Prometheus、remote_write、Mimir、Grafana Cloud)的成本。

2.
relabel_configs
(target-time relabeling)

2.
relabel_configs
(目标阶段重标记)

Runs before the scrape. Used to:
  • Set target labels (
    env
    ,
    cluster
    ,
    team
    ) on discovered targets
  • Drop entire targets you don't want to scrape
  • Rewrite
    instance
    to a friendly value
  • Add identity from service discovery metadata
yaml
scrape_configs:
  - job_name: my-app
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Set workload from controller metadata
      - source_labels: [__meta_kubernetes_pod_controller_kind, __meta_kubernetes_pod_controller_name]
        target_label: workload
        separator: /
      # Set env from a pod label
      - source_labels: [__meta_kubernetes_pod_label_env]
        target_label: env
      # Only scrape pods explicitly opted in
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        regex: "true"
        action: keep
抓取前运行。用于:
  • 为发现的目标设置目标标签(
    env
    cluster
    team
  • 丢弃不需要抓取的整个目标
  • instance
    重写为友好值
  • 从服务发现元数据中添加标识信息
yaml
scrape_configs:
  - job_name: my-app
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # 从控制器元数据设置workload
      - source_labels: [__meta_kubernetes_pod_controller_kind, __meta_kubernetes_pod_controller_name]
        target_label: workload
        separator: /
      # 从Pod标签设置env
      - source_labels: [__meta_kubernetes_pod_label_env]
        target_label: env
      # 仅抓取明确选择加入的Pod
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        regex: "true"
        action: keep

3.
metric_relabel_configs
(scrape-time relabeling)

3.
metric_relabel_configs
(抓取阶段重标记)

Runs after the scrape, before storage. Used to:
  • Drop high-cardinality labels the app shouldn't have emitted
  • Drop entire metrics you don't want
  • Rewrite label values for normalization
yaml
scrape_configs:
  - job_name: my-app
    metric_relabel_configs:
      # Drop the pod label from every metric
      - regex: pod
        action: labeldrop

      # Drop a specific high-cardinality metric entirely
      - source_labels: [__name__]
        regex: my_app_request_details
        action: drop

      # Normalize status_code to a class (200, 300, 400, 500)
      - source_labels: [status_code]
        regex: (\d)\d\d
        target_label: status_code
        replacement: ${1}xx
This is the emergency stop for bad labels. Use when you can't fix the app immediately.
抓取后、存储前运行。用于:
  • 丢弃应用程序不应生成的高基数标签
  • 丢弃不需要的整个指标
  • 重写标签值以实现标准化
yaml
scrape_configs:
  - job_name: my-app
    metric_relabel_configs:
      # 从所有指标中丢弃pod标签
      - regex: pod
        action: labeldrop

      # 完全丢弃特定的高基数指标
      - source_labels: [__name__]
        regex: my_app_request_details
        action: drop

      # 将status_code标准化为类别(200、300、400、500)
      - source_labels: [status_code]
        regex: (\\d)\\d\\d
        target_label: status_code
        replacement: ${1}xx
这是不良标签的紧急停止按钮。当你无法立即修复应用程序时使用。

4. Recording Rules (query-time cardinality reduction)

4. 录制规则(查询阶段基数减少)

Pre-aggregate expensive series into a lower-cardinality recorded series. Stored at the same data point density but with far fewer series.
yaml
groups:
  - name: http-requests-aggregates
    interval: 30s
    rules:
      # Drop pod/instance dimension; keep only service-level rollup
      - record: service:http_requests:rate5m
        expr: sum by (service, env, cluster, status_code) (rate(http_requests_total[5m]))
Queries that target the rollup are dramatically cheaper. The raw series still exist — recording rules don't reduce ingest cost (use Adaptive Metrics or
metric_relabel_configs
for that). They reduce query cost.

将昂贵的序列预聚合为低基数的录制序列。以相同的数据点密度存储,但序列数量大幅减少。
yaml
groups:
  - name: http-requests-aggregates
    interval: 30s
    rules:
      # 丢弃pod/instance维度;仅保留服务级汇总
      - record: service:http_requests:rate5m
        expr: sum by (service, env, cluster, status_code) (rate(http_requests_total[5m]))
针对汇总结果的查询成本会大幅降低。原始序列仍然存在——录制规则不会降低摄入成本(使用Adaptive Metrics或
metric_relabel_configs
实现该目标)。它们仅降低查询成本。

Instrumentation Hygiene (for app developers)

仪表规范(面向应用开发者)

If the user is writing instrumentation code, these are the rules:
RuleWhy
Never use unbounded user input as a label value
email
,
user_id
,
query string
,
error message
— they're the #1 cardinality bug
Template HTTP paths before recording
/users/{id}
not
/users/12345
. Most frameworks do this via routing metadata
Bound error labels via small enums
error_type="timeout"
not
error="connection to db-shard-7 timed out at 14:32:09"
Don't put
version
/
git_sha
/
build_id
on every metric
Use an info metric and join at query time
Don't emit
pod
/
node
/
host
from code
Comes from scrape targets — duplicating creates collisions
Avoid dynamically constructed label names (keys)
metric{[user]=1}
cannot be bounded — use a fixed key
Use histograms sparingly and trim labels first14× cardinality amplification
Prefer exemplars over labels for trace correlationExemplars carry
trace_id
without inflating cardinality
如果用户正在编写仪表代码,请遵循以下规则:
规则原因
绝不要将无限的用户输入作为标签值
email
user_id
查询字符串
错误消息
——它们是排名第一的基数问题来源
在记录前对HTTP路径进行模板化
/users/{id}
而非
/users/12345
。大多数框架通过路由元数据实现此功能
通过小型枚举限制错误标签
error_type="timeout"
而非
error="connection to db-shard-7 timed out at 14:32:09"
不要在每个指标上添加
version
/
git_sha
/
build_id
使用信息指标并在查询时关联
不要从代码中生成
pod
/
node
/
host
来自抓取目标——重复生成会导致冲突
避免动态构造标签名称(键)
metric{[user]=1}
无法被限制——使用固定键
谨慎使用直方图并优先精简标签14倍基数放大
优先使用示例而非标签进行追踪关联示例携带
trace_id
而不会增加基数

Exemplars (the escape hatch)

示例(逃生舱)

Exemplars attach a
trace_id
(or any key-value pair) to specific samples without making it a label dimension. The ideal home for high-cardinality correlation data.
Requires OpenMetrics format, Prometheus 2.26+, scrape config:
yaml
scrape_configs:
  - job_name: my-app
    enable_protobuf_negotiation: true
    # Or for text-format:
    follow_redirects: true
And on the Prometheus server:
yaml
storage:
  exemplars:
    max_exemplars: 100000
Use exemplars for:
  • trace_id
    correlation (Tempo, Jaeger)
  • request_id
    for specific debug lookups
  • Any sparse "useful when you need it" key
Query exemplars via Grafana's exemplars-on-graph feature, not via PromQL aggregation.

示例会将
trace_id
(或任何键值对)附加到特定样本上,不会使其成为标签维度。是高基数关联数据的理想载体。
需要OpenMetrics格式、Prometheus 2.26+,抓取配置:
yaml
scrape_configs:
  - job_name: my-app
    enable_protobuf_negotiation: true
    # 或针对文本格式:
    follow_redirects: true
Prometheus服务器配置:
yaml
storage:
  exemplars:
    max_exemplars: 100000
示例适用于:
  • trace_id
    关联(Tempo、Jaeger)
  • request_id
    用于特定调试查询
  • 任何稀疏的「需要时才有用」的键
通过Grafana的图表示例功能查询示例,而非通过PromQL聚合。

The 80/20 Rule

二八法则

The most impactful improvements almost always come from these five changes:
  1. Drop unbounded labels at the app layer
    path
    (untemplated),
    user_id
    ,
    error_message
    . Single biggest win.
  2. Trim histogram label cardinality before anything else — 14× amplification on every histogram.
  3. Drop
    pod
    from application metrics
    — keep
    workload
    instead. Eliminates churn, big stream-count reduction.
  4. Use info metrics for
    version
    /
    git_sha
    /
    image_tag
    — eliminates deploy-driven churn.
  5. Set target labels via
    relabel_configs
    , not app code
    env
    ,
    cluster
    ,
    team
    ,
    service
    should never be emitted by the application.
Focus on these before anything else.

最具影响力的改进几乎总是来自以下五项变更:
  1. 在应用层丢弃无限标签——
    path
    (非模板化)、
    user_id
    error_message
    。最大的单一收益。
  2. 优先精简直方图的标签基数——每个直方图都会被放大14倍。
  3. 从应用指标中丢弃
    pod
    ——改用
    workload
    。消除波动,大幅减少序列数。
  4. 使用信息指标存储
    version
    /
    git_sha
    /
    image_tag
    ——消除部署驱动的波动。
  5. 通过
    relabel_configs
    设置目标标签,而非应用代码
    ——
    env
    cluster
    team
    service
    绝不应由应用程序生成。
在处理其他内容之前,请先聚焦于这些变更。

Labels to Avoid — Quick Reference

应避免的标签——快速参考

LabelWhyAlternative
user_id
,
customer_id
(large tenant base)
UnboundedExemplars; aggregate by
tenant_tier
request_id
,
trace_id
UnboundedExemplars
path
/
route
(raw URLs)
UnboundedTemplate in code:
/users/:id
error_message
,
query
,
sql
Unbounded textBounded
error_type
enum
version
,
git_sha
,
image_tag
(on every metric)
Churn on every deployInfo metric pattern
pod
(on app metrics)
Transient + high cardinality
workload
; exemplars for pod-specific debug
uid
(K8s)
Unbounded; regenerates on restartDrop entirely
Application-emitted
instance
,
node
,
host
Should come from scrape targetDrop via
metric_relabel_configs
Dynamically-named label keysCannot be boundedUse fixed keys with bounded values
Raw
status_code
on histograms
14× amplificationBucket to
status_class
(
2xx
,
4xx
,
5xx
)

标签原因替代方案
user_id
customer_id
(租户基数大)
无限示例;按
tenant_tier
聚合
request_id
trace_id
无限示例
path
/
route
(原始URL)
无限在代码中使用模板:
/users/:id
error_message
query
sql
无限文本有限的
error_type
枚举
version
git_sha
image_tag
(在每个指标上)
每次部署都会波动信息指标模式
pod
(在应用指标上)
临时+高基数
workload
;使用示例进行Pod特定调试
uid
(K8s)
无限;重启时重新生成完全丢弃
应用程序生成的
instance
node
host
应来自抓取目标通过
metric_relabel_configs
丢弃
动态命名的标签键无法被限制使用带有有限值的固定键
直方图上的原始
status_code
14倍放大分组为
status_class
2xx
4xx
5xx

When to Route Elsewhere

何时引导至其他技能

  • "Reduce my Grafana Cloud bill" → also engage
    adaptive-metrics
    skill (post-ingest aggregation rules)
  • "Which metrics are driving my DPM?" → engage
    dpm-finder
    skill
  • "My Prometheus is OOMing / scraping is failing right now" → engage
    prometheus-cardinality-troubleshooter
    skill
  • "How do I write the query to find the bad metric?" → engage
    promql
    skill
  • "How do I configure relabel rules in Alloy?" → engage
    alloy
    skill
This skill's lane is strategy and design. Other skills own diagnosis and operational remediation.
  • 「降低我的Grafana Cloud账单」 → 同时使用
    adaptive-metrics
    技能(摄入后聚合规则)
  • 「哪些指标导致我的DPM过高?」 → 使用
    dpm-finder
    技能
  • 「我的Prometheus内存溢出/抓取失败了」 → 使用
    prometheus-cardinality-troubleshooter
    技能
  • 「如何编写查询以找到不良指标?」 → 使用
    promql
    技能
  • 「如何在Alloy中配置重标记规则?」 → 使用
    alloy
    技能
本技能的领域是策略与设计。其他技能负责诊断运维修复。",