victoriametrics-cardinality-analysis

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

VictoriaMetrics Cardinality Analysis

VictoriaMetrics 基数分析

Systematic cardinality analysis for VictoriaMetrics. Collects TSDB status, metric usage stats, and label value patterns, then produces a structured report with specific relabeling and stream aggregation configs the user can apply directly.

The goal is to find the highest-impact optimization opportunities — metrics nobody queries, labels that explode cardinality for no monitoring value, and patterns that indicate data hygiene problems (error messages as labels, SQL text as labels, UUIDs as labels).

针对VictoriaMetrics的系统性基数分析工具。收集TSDB状态、指标使用统计数据和标签值模式，然后生成包含可直接应用的重标记和流聚合配置的结构化报告。

目标是找到影响最大的优化机会——无人查询的指标、对监控毫无价值却导致基数爆炸的标签，以及表明数据整洁性问题的模式（如将错误消息、SQL文本、UUID作为标签）。

Environment

环境

Uses the same env vars as the

victoriametrics-query

skill:

bash

undefined

使用与

victoriametrics-query

技能相同的环境变量：

bash

undefined

$VM_METRICS_URL - base URL

$VM_METRICS_URL - 基础URL

cluster: export VM_METRICS_URL="https://vmselect.example.com/select/0/prometheus"

集群版: export VM_METRICS_URL="https://vmselect.example.com/select/0/prometheus"

single: export VM_METRICS_URL="http://localhost:8428"

单实例版: export VM_METRICS_URL="http://localhost:8428"

$VM_AUTH_HEADER - auth header (empty if no auth is required)

$VM_AUTH_HEADER - 认证头（无需认证时留空）

undefined

undefined

Workflow

工作流程

Phase 1: Data Collection

阶段1：数据收集

Spawn 3 subagents in a single response to collect data in parallel. Each subagent prompt must include the curl auth pattern and environment variable references above.

If the user specified a scope (job, namespace, metric prefix), pass it as

match[]

parameter to TSDB status queries and as series selectors to label queries.

在单次响应中启动3个子Agent以并行收集数据。每个子Agent的提示必须包含上述curl认证模式和环境变量引用。

如果用户指定了范围（job、namespace、指标前缀），将其作为

match[]

参数传递给TSDB状态查询，并作为序列选择器传递给标签查询。

Subagent 1: TSDB Overview

子Agent 1：TSDB概览

Agent name:

cardinality-tsdb

| Description: "Collect TSDB cardinality stats"

Query 1 — Yesterday's series (captures recently churned series):

bash

curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/tsdb?topN=50&date=$(date -d 'yesterday' +%Y-%m-%d)" | jq '.data'

Queries yesterday's stats — broader than today (includes series that may have already churned) without scanning the entire TSDB.

Query 2 — Today's active series:

bash

curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/tsdb?topN=50" | jq '.data'

Query 3 — Focus on known high-cardinality labels:

bash

for label in pod instance container path url user_id request_id session_id trace_id le name; do
  echo "=== focusLabel=$label ===" && \
  curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
    "$VM_METRICS_URL/api/v1/status/tsdb?topN=20&focusLabel=$label" | \
    jq --arg l "$label" '{label: $l, focus: .data.seriesCountByFocusLabelValue}'
done

Return: All raw JSON preserving structure. Include

totalSeries

totalLabelValuePairs

seriesCountByMetricName

seriesCountByLabelName

seriesCountByLabelValuePair

from each query.

Agent名称：

cardinality-tsdb

| 描述："收集TSDB基数统计数据"

查询1 — 昨日的序列（捕获最近变动的序列）：

bash

curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/tsdb?topN=50&date=$(date -d 'yesterday' +%Y-%m-%d)" | jq '.data'

查询昨日的统计数据——比今日的范围更广（包含可能已变动的序列），且无需扫描整个TSDB。

查询2 — 今日的活跃序列：

bash

curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/tsdb?topN=50" | jq '.data'

查询3 — 重点关注已知的高基数标签：

bash

for label in pod instance container path url user_id request_id session_id trace_id le name; do
  echo "=== focusLabel=$label ===" && \
  curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
    "$VM_METRICS_URL/api/v1/status/tsdb?topN=20&focusLabel=$label" | \
    jq --arg l "$label" '{label: $l, focus: .data.seriesCountByFocusLabelValue}'
done

返回：保留原始结构的所有JSON数据。包含每个查询中的

totalSeries

、

totalLabelValuePairs

、

seriesCountByMetricName

、

seriesCountByLabelName

、

seriesCountByLabelValuePair

。

Subagent 2: Metric Usage Stats

子Agent 2：指标使用统计

Agent name:

cardinality-usage

| Description: "Find unused and rarely-queried metrics"

Query 1 — Never-queried metrics:

bash

curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/metric_names_stats?le=0&limit=500" | jq '.'

Query 2 — Rarely-queried metrics (≤5 total queries):

bash

curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/metric_names_stats?le=5&limit=500" | jq '.'

Query 3 — Stats overview (tracking period):

bash

curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/metric_names_stats?limit=1" | \
  jq '{statsCollectedSince: .statsCollectedSince, statsCollectedRecordsTotal: .statsCollectedRecordsTotal}'

If the endpoint returns an error,

storage.trackMetricNamesStats

may not be enabled on vmstorage. Note this in the return and proceed — the analysis can still work with TSDB status data alone.

Query 4 — Cross-check: are "unused" metrics referenced in alerting rules?

bash

curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/rules" | jq '[.data.groups[].rules[].query]'

Extract metric names from rule queries. Any "unused" metric that appears in an alert/recording rule is NOT safe to drop — it's queried indirectly.

Return: Unused metrics with cross-reference against alert rules. Flag each as:

safe to drop: never queried AND not in any rule
used by rules only: never queried by dashboards but referenced in rules — verify intent
rarely used: low query count, may be accessed infrequently (e.g., monthly reports)

Agent名称：

cardinality-usage

| 描述："查找未使用和极少被查询的指标"

查询1 — 从未被查询的指标：

bash

curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/metric_names_stats?le=0&limit=500" | jq '.'

查询2 — 极少被查询的指标（总查询次数≤5次）：

bash

curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/metric_names_stats?le=5&limit=500" | jq '.'

查询3 — 统计概览（追踪周期）：

bash

curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/metric_names_stats?limit=1" | \
  jq '{statsCollectedSince: .statsCollectedSince, statsCollectedRecordsTotal: .statsCollectedRecordsTotal}'

如果端点返回错误，可能是vmstorage未启用

storage.trackMetricNamesStats

。在返回结果中注明这一点并继续——仅使用TSDB状态数据仍可完成分析。

查询4 — 交叉验证：“未使用”指标是否在告警规则中被引用？

bash

curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/rules" | jq '[.data.groups[].rules[].query]'

从规则查询中提取指标名称。任何出现在告警/记录规则中的“未使用”指标都不能安全删除——它是被间接查询的。

返回：未使用指标及其与告警规则的交叉验证结果。为每个指标标记：

可安全删除：从未被查询且未出现在任何规则中
仅被规则使用：从未被仪表板查询但被规则引用——需验证意图
极少使用：查询次数少，可能仅被偶尔访问（如月度报告）

Subagent 3: Label Pattern Inspection

子Agent 3：标签模式检查

Agent name:

cardinality-labels

| Description: "Inspect label values for problematic patterns"

All data comes from the TSDB status endpoint — do NOT use

/api/v1/labels

/api/v1/label/.../values

Query 1 — Label cardinality overview (unique value counts + series counts):

bash

curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/tsdb?topN=50" | \
  jq '{labelValueCountByLabelName: .data.labelValueCountByLabelName, seriesCountByLabelName: .data.seriesCountByLabelName}'

labelValueCountByLabelName

returns labels sorted by unique value count (replaces per-label

/values

counting).

seriesCountByLabelName

shows how many series each label appears in.

Query 2 — Sample values for high-cardinality labels via focusLabel: For each label with >100 unique values from Query 1, fetch sample values:

bash

for label in <top labels from Query 1>; do
  echo "=== focusLabel=$label ===" && \
  curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
    "$VM_METRICS_URL/api/v1/status/tsdb?topN=20&focusLabel=$label" | \
    jq --arg l "$label" '{label: $l, topValues: .data.seriesCountByFocusLabelValue}'
done

seriesCountByFocusLabelValue

returns label values sorted by series count — use the value names to detect problematic patterns.

Query 3 — High-cardinality label-value pairs:

bash

curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/tsdb?topN=50" | \
  jq '.data.seriesCountByLabelValuePair'

Shows which specific

label=value

pairs contribute the most series.

Pattern detection — classify label values from focusLabel samples:

Pattern	Regex hint	Indicates
UUIDs	`[0-9a-f]{8}-[0-9a-f]{4}-`	Request/session/trace IDs as labels
IP addresses	`\d+\.\d+\.\d+\.\d+`	Per-client or per-pod IP tracking
Long strings (>50 chars)	length check	Error messages, SQL, stack traces
SQL keywords	`SELECT\|INSERT\|UPDATE\|DELETE\|FROM\|WHERE`	Query text stored as label
URL paths with IDs	`/api/.*/[0-9a-f]+`	Unsanitized HTTP paths
Timestamps	epoch or ISO8601	Time values as labels (unbounded)
Stack traces	`at .*\.(java\|go\|py):`	Error details as labels

Return: Table of labels sorted by unique value count, with detected pattern, sample values from focusLabel, and series impact.

Agent名称：

cardinality-labels

| 描述："检查标签值的问题模式"

所有数据均来自TSDB状态端点——不要使用

/api/v1/labels

或

/api/v1/label/.../values

。

查询1 — 标签基数概览（唯一值数量+序列数量）：

bash

curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/tsdb?topN=50" | \
  jq '{labelValueCountByLabelName: .data.labelValueCountByLabelName, seriesCountByLabelName: .data.seriesCountByLabelName}'

labelValueCountByLabelName

返回按唯一值数量排序的标签（替代按标签的

/values

计数方式）。

seriesCountByLabelName

显示每个标签出现在多少个序列中。

查询2 — 通过focusLabel获取高基数标签的样本值： 对于查询1中唯一值数量>100的每个标签，获取样本值：

bash

for label in <top labels from Query 1>; do
  echo "=== focusLabel=$label ===" && \
  curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
    "$VM_METRICS_URL/api/v1/status/tsdb?topN=20&focusLabel=$label" | \
    jq --arg l "$label" '{label: $l, topValues: .data.seriesCountByFocusLabelValue}'
done

seriesCountByFocusLabelValue

返回按序列数量排序的标签值——使用这些值名称检测问题模式。

查询3 — 高基数标签-值对：

bash

curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/tsdb?topN=50" | \
  jq '.data.seriesCountByLabelValuePair'

显示哪些特定的

label=value

对贡献了最多的序列。

模式检测——对focusLabel样本中的标签值进行分类：

模式	正则提示	说明
UUIDs	`[0-9a-f]{8}-[0-9a-f]{4}-`	将请求/会话/追踪ID作为标签
IP地址	`\d+\.\d+\.\d+\.\d+`	按客户端或Pod追踪IP
长字符串（>50字符）	长度检查	错误消息、SQL、堆栈跟踪
SQL关键字	`SELECT\|INSERT\|UPDATE\|DELETE\|FROM\|WHERE`	查询文本被存储为标签
带ID的URL路径	`/api/.*/[0-9a-f]+`	未清理的HTTP路径
时间戳	时间戳或ISO8601格式	将时间值作为标签（无边界）
堆栈跟踪	`at .*\.(java\|go\|py):`	将错误详情作为标签

返回：按唯一值数量排序的标签表格，包含检测到的模式、来自focusLabel的样本值以及序列影响。

Phase 2: Analysis

阶段2：分析

After all subagents return, compile and classify findings. This is the analytical core — apply judgment, not mechanical filtering.

在所有子Agent返回结果后，编译并分类发现的问题。这是分析的核心——需运用判断，而非机械过滤。

Category 1: Unused Metrics (Quick Wins)

类别1：未使用指标（快速见效）

Cross-reference metric usage stats with TSDB series counts:

Drop candidates:
```
queryRequestsCount=0
```
, not in any alert/recording rule, >100 series
Verify candidates:
```
queryRequestsCount=0
```
but referenced in rules — check if rule is still needed
Low-priority:
```
queryRequestsCount≤5
```
with few series — not worth the config churn

Sort by series count descending — the biggest unused metrics are the biggest wins.

交叉验证指标使用统计数据与TSDB序列数量：

删除候选：
```
queryRequestsCount=0
```
，未出现在任何告警/记录规则中，且序列数>100
验证候选：
```
queryRequestsCount=0
```
但被规则引用——检查规则是否仍需保留
低优先级：
```
queryRequestsCount≤5
```
且序列数少——不值得修改配置

按序列数降序排序——最大的未使用指标能带来最显著的收益。

Category 2: High-Cardinality Labels

类别2：高基数标签

Labels with excessive unique values that drive series explosion:

Label pattern	Assessment	Typical remedy
`user_id` , `customer_id` , `account_id`	Should NEVER be metric labels — belongs in logs/traces	Drop label
`request_id` , `session_id` , `trace_id` , `span_id`	Correlation IDs — never metric labels	Drop label
`error` , `error_message` , `reason` , `status_message`	Unbounded strings	Drop label or replace with error code
`sql` , `query` , `command` , `statement`	Query text in labels — unbounded	Drop label
`path` , `url` , `uri` , `endpoint`	Unbounded if not sanitized	Relabel to normalize, or stream aggregate without
`pod` , `container`	Normal for k8s but high churn	Stream aggregate without, if per-pod not needed
`instance`	Normal for node metrics, wasteful for app metrics	Stream aggregate without for app-level metrics
`le` (histogram buckets)	Fine-grained buckets multiply every label combination	Reduce bucket count

For each finding, estimate impact:

(series with this label) - (series without) ≈ series saved

具有过多唯一值导致序列爆炸的标签：

标签模式	评估	典型解决方案
`user_id` , `customer_id` , `account_id`	绝对不能作为指标标签——应放在日志/追踪中	删除标签
`request_id` , `session_id` , `trace_id` , `span_id`	关联ID——绝不能作为指标标签	删除标签
`error` , `error_message` , `reason` , `status_message`	无边界字符串	删除标签或替换为错误码
`sql` , `query` , `command` , `statement`	标签中存储查询文本——无边界	删除标签
`path` , `url` , `uri` , `endpoint`	未清理时无边界	重标记以规范化，或不包含该标签进行流聚合
`pod` , `container`	在K8s中正常但变动频繁	若不需要按Pod统计，不包含该标签进行流聚合
`instance`	节点指标中正常，但应用指标中多余	应用级指标不包含该标签进行流聚合
`le` （直方图桶）	细粒度桶会使每个标签组合的数量倍增	减少桶的数量

对于每个发现的问题，估算影响：

(包含该标签的序列数) - (不包含该标签的序列数) ≈ 可节省的序列数

。

Category 3: Histogram Bloat

类别3：直方图膨胀

Check metrics ending in

_bucket

How many unique
```
le
```
values?
Each additional bucket multiplies series by (number of label combinations)
Look for histograms where most buckets are empty or redundant

检查以

_bucket

结尾的指标：

有多少个唯一的
```
le
```
值？
每个额外的桶都会使序列数乘以（标签组合的数量）
查找大多数桶为空或冗余的直方图

Category 4: Series Churn

类别4：序列变动

Compare yesterday's stats vs today:

Ratio >3:1 suggests significant churn from pod restarts, deployments, short-lived jobs
Not directly fixable via relabeling, but indicates opportunity for
```
dedup_interval
```
or
```
-search.maxStalenessInterval
```
tuning

对比昨日与今日的统计数据：

比值>3:1表明Pod重启、部署、短期任务导致了显著变动
无法通过重标记直接修复，但表明有机会调整
```
dedup_interval
```
或
```
-search.maxStalenessInterval
```

Phase 3: Report

阶段3：报告

Compile into a structured report. Every finding must include impact estimate and specific remedy config.

Use this template:

markdown

undefined

编译为结构化报告。每个发现的问题必须包含影响估算和具体的修复配置。

使用以下模板：

markdown

undefined

VictoriaMetrics Cardinality Report — <date>

VictoriaMetrics 基数报告 — <日期>

Overview

概览

Metric	Value
Total active series (today)	X
Total series (yesterday)	X
Churn ratio (yesterday / today)	X:1
Unique metric names	X
Stats tracking since	<date>

指标	数值
今日活跃序列总数	X
昨日序列总数	X
变动比率（昨日/今日）	X:1
唯一指标名称数量	X
统计数据追踪起始时间	<日期>

1. Unused Metrics

1. 未使用指标

Potential savings: ~X series (Y% of total)

Metric	Series	Last Queried	In Alert Rules	Action
...	...	never	no	Drop
...	...	never	yes — verify	Check rule

<details> <summary>Relabeling config to drop unused metrics</summary>

```yaml

潜在节省：约X个序列（占总数的Y%）

指标	序列数	最后查询时间	是否在告警规则中	操作
...	...	从未	否	删除
...	...	从未	是——需验证	检查规则

<details> <summary>删除未使用指标的重标记配置</summary>

```yaml

Add to vmagent metric_relabel_configs (VMServiceScrape or global)

添加到vmagent的metric_relabel_configs（VMServiceScrape或全局配置）

metric_relabel_configs:

source_labels: [name] regex: "metric1|metric2|metric3" action: drop ```

</details>

metric_relabel_configs:

source_labels: [name] regex: "metric1|metric2|metric3" action: drop ```

</details>

2. High-Cardinality Labels

2. 高基数标签

Potential savings: ~X series (Y%)

Label	Unique Values	Top Affected Metrics	Pattern	Action
user_id	50,000	http_requests_total	UUID	Drop
path	10,000	http_request_duration	URL paths	Aggregate
error_message	5,000	app_errors_total	Long strings	Drop

<details> <summary>Drop labels that should never be in metrics</summary>

```yaml metric_relabel_configs:

regex: "user_id|request_id|session_id|trace_id|error_message|sql_query" action: labeldrop ```

</details> <details> <summary>Stream aggregation for high-cardinality HTTP labels</summary>

```yaml

潜在节省：约X个序列（占总数的Y%）

标签	唯一值数量	受影响最大的指标	模式	操作
user_id	50,000	http_requests_total	UUID	删除
path	10,000	http_request_duration	URL路径	聚合
error_message	5,000	app_errors_total	长字符串	删除

<details> <summary>删除绝不能出现在指标中的标签</summary>

```yaml metric_relabel_configs:

regex: "user_id|request_id|session_id|trace_id|error_message|sql_query" action: labeldrop ```

</details> <details> <summary>高基数HTTP标签的流聚合配置</summary>

```yaml

vmagent stream aggregation config

vmagent流聚合配置

match: '{name=~"http_request.*"}' interval: 1m without: [path, instance, pod] outputs: [total]

drop_input: true # enable after verifying aggregated output
match: '{name=~"http_request_duration.*_bucket"}' interval: 1m without: [pod, instance] outputs: [total] keep_metric_names: true ```

</details> <details> <summary>Normalize URL paths via relabeling</summary>

```yaml metric_relabel_configs:

source_labels: [path] regex: "/api/v1/users/[^/]+" target_label: path replacement: "/api/v1/users/:id"
source_labels: [path] regex: "/api/v1/orders/[^/]+" target_label: path replacement: "/api/v1/orders/:id" ```

</details>

match: '{name=~"http_request.*"}' interval: 1m without: [path, instance, pod] outputs: [total]

drop_input: true # 验证聚合输出符合预期后启用
match: '{name=~"http_request_duration.*_bucket"}' interval: 1m without: [pod, instance] outputs: [total] keep_metric_names: true ```

</details> <details> <summary>通过重标记规范化URL路径</summary>

```yaml metric_relabel_configs:

source_labels: [path] regex: "/api/v1/users/[^/]+" target_label: path replacement: "/api/v1/users/:id"
source_labels: [path] regex: "/api/v1/orders/[^/]+" target_label: path replacement: "/api/v1/orders/:id" ```

</details>

3. Histogram Optimization

3. 直方图优化

Potential savings: ~X series (Y%)

Metric	Bucket Count	Recommendation
...	30	Reduce to standard 11 buckets

潜在节省：约X个序列（占总数的Y%）

指标	桶数量	建议
...	30	减少为标准的11个桶

4. Series Churn

4. 序列变动

Observation	Value
Yesterday / today ratio	X:1
Primary driver	Pod restarts / short-lived jobs

观察结果	数值
昨日/今日比值	X:1
主要原因	Pod重启 / 短期任务

Summary

总结

Category	Est. Series Saved	% of Total	Effort
Drop unused metrics	X	Y%	Low — relabeling only
Drop bad labels	X	Y%	Low — labeldrop
Stream aggregation	X	Y%	Medium — new config
Histogram reduction	X	Y%	Low — bucket filtering
Total	X	Y%

类别	预估可节省序列数	占总数比例	实施难度
删除未使用指标	X	Y%	低——仅需重标记
删除不良标签	X	Y%	低——仅需labeldrop
流聚合	X	Y%	中——需新增配置
直方图桶缩减	X	Y%	低——仅需过滤桶
总计	X	Y%

Implementation Priority

实施优先级

[Low effort] Drop unused metrics — pure relabeling, no data loss risk
[Low effort] Drop labels that should never be in metrics (IDs, messages, SQL)
[Medium effort] Stream aggregation for high-cardinality HTTP/app metrics
[Medium effort] Histogram bucket reduction


Adapt the template to actual findings — omit sections with no findings, expand sections
with significant findings.

---

[低难度] 删除未使用指标——仅需重标记，无数据丢失风险
[低难度] 删除绝不能出现在指标中的标签（ID、消息、SQL）
[中难度] 对高基数HTTP/应用指标进行流聚合
[中难度] 缩减直方图桶数量


根据实际发现的问题调整模板——省略无发现的部分，扩展有重要发现的部分。

---

Remediation Reference

修复参考

Relabeling (metric_relabel_configs)

重标记（metric_relabel_configs）

Applied at scrape time or remote write. Changes affect new data immediately.

Drop entire metrics:

yaml

metric_relabel_configs:
  - source_labels: [__name__]
    regex: "metric_to_drop|another_metric"
    action: drop

Drop labels:

yaml

metric_relabel_configs:
  - regex: "label_to_drop|another_label"
    action: labeldrop

Normalize label values (reduce unique values):

yaml

metric_relabel_configs:
  - source_labels: [path]
    regex: "/api/v1/users/[^/]+"
    target_label: path
    replacement: "/api/v1/users/:id"

在采集时或远程写入时应用。更改会立即影响新数据。

删除整个指标：

yaml

metric_relabel_configs:
  - source_labels: [__name__]
    regex: "metric_to_drop|another_metric"
    action: drop

删除标签：

yaml

metric_relabel_configs:
  - regex: "label_to_drop|another_label"
    action: labeldrop

规范化标签值（减少唯一值数量）：

yaml

metric_relabel_configs:
  - source_labels: [path]
    regex: "/api/v1/users/[^/]+"
    target_label: path
    replacement: "/api/v1/users/:id"

Stream Aggregation

流聚合

Applied at vmagent level. Aggregates in-flight before writing to storage. Docs: https://docs.victoriametrics.com/victoriametrics/stream-aggregation/

Remove labels while preserving metric semantics:

yaml

- match: '{__name__=~"http_.*"}'
  interval: 1m
  without: [instance, pod]
  outputs: [total]

Aggregate counters (drop high-cardinality dimension):

yaml

- match: 'http_requests_total'
  interval: 30s
  without: [path, user_id]
  outputs: [total]

Aggregate histograms:

yaml

- match: '{__name__=~".*_bucket"}'
  interval: 1m
  without: [pod, instance]
  outputs: [quantiles(0.5, 0.9, 0.99)]
  keep_metric_names: true

Common output functions:

Function	Use for	Example
`total`	Counters (running sum)	request counts
`sum_samples`	Gauge sums	memory usage across pods
`count_samples`	Sample counts	number of reporting instances
`last`	Latest gauge value	current temperature
`min` , `max`	Extremes	peak latency
`avg`	Averages	mean CPU usage
`quantiles(0.5, 0.9, 0.99)`	Distribution estimation	latency percentiles
`histogram_bucket`	Re-bucket histograms	reduce bucket granularity

Important: use

total

for counters,

last

avg

sum_samples

for gauges. Using

total

on gauges produces nonsensical running sums.

在vmagent层面应用。在写入存储前对数据进行实时聚合。文档：https://docs.victoriametrics.com/victoriametrics/stream-aggregation/

移除标签同时保留指标语义：

yaml

- match: '{__name__=~"http_.*"}'
  interval: 1m
  without: [instance, pod]
  outputs: [total]

聚合计数器（删除高基数维度）：

yaml

- match: 'http_requests_total'
  interval: 30s
  without: [path, user_id]
  outputs: [total]

聚合直方图：

yaml

- match: '{__name__=~".*_bucket"}'
  interval: 1m
  without: [pod, instance]
  outputs: [quantiles(0.5, 0.9, 0.99)]
  keep_metric_names: true

常用输出函数：

函数	用途	示例
`total`	计数器（运行总和）	请求计数
`sum_samples`	Gauge值求和	跨Pod内存使用量
`count_samples`	样本计数	上报实例数量
`last`	最新Gauge值	当前温度
`min` , `max`	极值	峰值延迟
`avg`	平均值	平均CPU使用率
`quantiles(0.5, 0.9, 0.99)`	分布估算	延迟百分位数
`histogram_bucket`	重新分桶直方图	降低桶的粒度

重要提示：对计数器使用

total

，对Gauge值使用

last

avg

sum_samples

。对Gauge值使用

total

会产生无意义的运行总和。

Where to Apply in Kubernetes

在Kubernetes中的应用位置

Method	CRD / Config	Scope
`metric_relabel_configs`	`VMServiceScrape` / `VMPodScrape` `.spec.metricRelabelConfigs`	Per scrape target
Global relabeling	VMAgent `-remoteWrite.relabelConfig`	All metrics
Stream aggregation	VMAgent `-remoteWrite.streamAggr.config`	All remote-written metrics
Per-remote-write SA	VMAgent `.spec.remoteWrite[].streamAggrConfig`	Per destination

方法	CRD / 配置	范围
`metric_relabel_configs`	`VMServiceScrape` / `VMPodScrape` `.spec.metricRelabelConfigs`	每个采集目标
全局重标记	VMAgent `-remoteWrite.relabelConfig`	所有指标
流聚合	VMAgent `-remoteWrite.streamAggr.config`	所有远程写入的指标
按远程写入目标配置SA	VMAgent `.spec.remoteWrite[].streamAggrConfig`	每个目标

Common Mistakes

常见错误

Mistake	Fix
Dropping a metric used by alerts	Always cross-check `/api/v1/rules` before dropping
`drop_input: true` without testing	Verify aggregation output matches expectations first
Stream aggregating gauges with `total`	Use `last` , `avg` , or `sum_samples` for gauges
Forgetting `keep_metric_names: true`	Without it, output gets long auto-generated suffix
Dropping `le` label entirely from histograms	Only drop specific `le` values, never the label itself
Not considering recording rule dependencies	Check both alerting AND recording rules
Applying relabeling without testing	Use `-dryRun` flag or test on a single scrape target first

错误	修复方案
删除了被告警规则使用的指标	删除前务必交叉验证 `/api/v1/rules`
未测试就启用 `drop_input: true`	先验证聚合输出符合预期
对Gauge值使用 `total` 进行流聚合	对Gauge值使用 `last` 、 `avg` 或 `sum_samples`
忘记设置 `keep_metric_names: true`	不设置的话，输出会带有冗长的自动生成后缀
完全删除直方图的 `le` 标签	仅删除特定的 `le` 值，绝不要删除整个标签
未考虑记录规则的依赖	同时检查告警规则和记录规则
未测试就应用重标记	使用 `-dryRun` 标志或先在单个采集目标上测试

victoriametrics-cardinality-analysis

Original

Translation

VictoriaMetrics Cardinality Analysis

VictoriaMetrics 基数分析

Environment

环境

$VM_METRICS_URL - base URL

$VM_METRICS_URL - 基础URL

cluster: export VM_METRICS_URL="https://vmselect.example.com/select/0/prometheus"

集群版: export VM_METRICS_URL="https://vmselect.example.com/select/0/prometheus"

single: export VM_METRICS_URL="http://localhost:8428"

单实例版: export VM_METRICS_URL="http://localhost:8428"

$VM_AUTH_HEADER - auth header (empty if no auth is required)

$VM_AUTH_HEADER - 认证头（无需认证时留空）

Workflow

工作流程

Phase 1: Data Collection

阶段1：数据收集

Subagent 1: TSDB Overview

子Agent 1：TSDB概览

Subagent 2: Metric Usage Stats

子Agent 2：指标使用统计

Subagent 3: Label Pattern Inspection

子Agent 3：标签模式检查

Phase 2: Analysis

阶段2：分析

Category 1: Unused Metrics (Quick Wins)

类别1：未使用指标（快速见效）

Category 2: High-Cardinality Labels

类别2：高基数标签

Category 3: Histogram Bloat

类别3：直方图膨胀

Category 4: Series Churn

类别4：序列变动

Phase 3: Report

阶段3：报告

VictoriaMetrics Cardinality Report — <date>

VictoriaMetrics 基数报告 — <日期>

Overview

概览

1. Unused Metrics

1. 未使用指标

Add to vmagent metric_relabel_configs (VMServiceScrape or global)

添加到vmagent的metric_relabel_configs（VMServiceScrape或全局配置）

2. High-Cardinality Labels

2. 高基数标签

vmagent stream aggregation config

vmagent流聚合配置

drop_input: true # enable after verifying aggregated output

drop_input: true # 验证聚合输出符合预期后启用

3. Histogram Optimization

3. 直方图优化

4. Series Churn

4. 序列变动

Summary

总结

Implementation Priority

实施优先级

Remediation Reference

修复参考

Relabeling (metric_relabel_configs)

重标记（metric_relabel_configs）

Stream Aggregation

流聚合

Where to Apply in Kubernetes

在Kubernetes中的应用位置

Common Mistakes

常见错误