victoriametrics-cardinality-analysis

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

VictoriaMetrics Cardinality Analysis

VictoriaMetrics 基数分析

Systematic cardinality analysis for VictoriaMetrics. Collects TSDB status, metric usage stats, and label value patterns, then produces a structured report with specific relabeling and stream aggregation configs the user can apply directly.
The goal is to find the highest-impact optimization opportunities — metrics nobody queries, labels that explode cardinality for no monitoring value, and patterns that indicate data hygiene problems (error messages as labels, SQL text as labels, UUIDs as labels).
针对VictoriaMetrics的系统性基数分析工具。收集TSDB状态、指标使用统计数据和标签值模式,然后生成包含可直接应用的重标记和流聚合配置的结构化报告。
目标是找到影响最大的优化机会——无人查询的指标、对监控毫无价值却导致基数爆炸的标签,以及表明数据整洁性问题的模式(如将错误消息、SQL文本、UUID作为标签)。

Environment

环境

Uses the same env vars as the
victoriametrics-query
skill:
bash
undefined
使用与
victoriametrics-query
技能相同的环境变量:
bash
undefined

$VM_METRICS_URL - base URL

$VM_METRICS_URL - 基础URL

cluster: export VM_METRICS_URL="https://vmselect.example.com/select/0/prometheus"

集群版: export VM_METRICS_URL="https://vmselect.example.com/select/0/prometheus"

single: export VM_METRICS_URL="http://localhost:8428"

单实例版: export VM_METRICS_URL="http://localhost:8428"

$VM_AUTH_HEADER - auth header (empty if no auth is required)

$VM_AUTH_HEADER - 认证头(无需认证时留空)

undefined
undefined

Workflow

工作流程

Phase 1: Data Collection

阶段1:数据收集

Spawn 3 subagents in a single response to collect data in parallel. Each subagent prompt must include the curl auth pattern and environment variable references above.
If the user specified a scope (job, namespace, metric prefix), pass it as
match[]
parameter to TSDB status queries and as series selectors to label queries.

单次响应中启动3个子Agent以并行收集数据。每个子Agent的提示必须包含上述curl认证模式和环境变量引用。
如果用户指定了范围(job、namespace、指标前缀),将其作为
match[]
参数传递给TSDB状态查询,并作为序列选择器传递给标签查询。

Subagent 1: TSDB Overview

子Agent 1:TSDB概览

Agent name:
cardinality-tsdb
| Description: "Collect TSDB cardinality stats"
Query 1 — Yesterday's series (captures recently churned series):
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/tsdb?topN=50&date=$(date -d 'yesterday' +%Y-%m-%d)" | jq '.data'
Queries yesterday's stats — broader than today (includes series that may have already churned) without scanning the entire TSDB.
Query 2 — Today's active series:
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/tsdb?topN=50" | jq '.data'
Query 3 — Focus on known high-cardinality labels:
bash
for label in pod instance container path url user_id request_id session_id trace_id le name; do
  echo "=== focusLabel=$label ===" && \
  curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
    "$VM_METRICS_URL/api/v1/status/tsdb?topN=20&focusLabel=$label" | \
    jq --arg l "$label" '{label: $l, focus: .data.seriesCountByFocusLabelValue}'
done
Return: All raw JSON preserving structure. Include
totalSeries
,
totalLabelValuePairs
,
seriesCountByMetricName
,
seriesCountByLabelName
,
seriesCountByLabelValuePair
from each query.

Agent名称
cardinality-tsdb
| 描述:"收集TSDB基数统计数据"
查询1 — 昨日的序列(捕获最近变动的序列):
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/tsdb?topN=50&date=$(date -d 'yesterday' +%Y-%m-%d)" | jq '.data'
查询昨日的统计数据——比今日的范围更广(包含可能已变动的序列),且无需扫描整个TSDB。
查询2 — 今日的活跃序列:
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/tsdb?topN=50" | jq '.data'
查询3 — 重点关注已知的高基数标签:
bash
for label in pod instance container path url user_id request_id session_id trace_id le name; do
  echo "=== focusLabel=$label ===" && \
  curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
    "$VM_METRICS_URL/api/v1/status/tsdb?topN=20&focusLabel=$label" | \
    jq --arg l "$label" '{label: $l, focus: .data.seriesCountByFocusLabelValue}'
done
返回:保留原始结构的所有JSON数据。包含每个查询中的
totalSeries
totalLabelValuePairs
seriesCountByMetricName
seriesCountByLabelName
seriesCountByLabelValuePair

Subagent 2: Metric Usage Stats

子Agent 2:指标使用统计

Agent name:
cardinality-usage
| Description: "Find unused and rarely-queried metrics"
Query 1 — Never-queried metrics:
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/metric_names_stats?le=0&limit=500" | jq '.'
Query 2 — Rarely-queried metrics (≤5 total queries):
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/metric_names_stats?le=5&limit=500" | jq '.'
Query 3 — Stats overview (tracking period):
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/metric_names_stats?limit=1" | \
  jq '{statsCollectedSince: .statsCollectedSince, statsCollectedRecordsTotal: .statsCollectedRecordsTotal}'
If the endpoint returns an error,
storage.trackMetricNamesStats
may not be enabled on vmstorage. Note this in the return and proceed — the analysis can still work with TSDB status data alone.
Query 4 — Cross-check: are "unused" metrics referenced in alerting rules?
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/rules" | jq '[.data.groups[].rules[].query]'
Extract metric names from rule queries. Any "unused" metric that appears in an alert/recording rule is NOT safe to drop — it's queried indirectly.
Return: Unused metrics with cross-reference against alert rules. Flag each as:
  • safe to drop: never queried AND not in any rule
  • used by rules only: never queried by dashboards but referenced in rules — verify intent
  • rarely used: low query count, may be accessed infrequently (e.g., monthly reports)

Agent名称
cardinality-usage
| 描述:"查找未使用和极少被查询的指标"
查询1 — 从未被查询的指标:
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/metric_names_stats?le=0&limit=500" | jq '.'
查询2 — 极少被查询的指标(总查询次数≤5次):
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/metric_names_stats?le=5&limit=500" | jq '.'
查询3 — 统计概览(追踪周期):
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/metric_names_stats?limit=1" | \
  jq '{statsCollectedSince: .statsCollectedSince, statsCollectedRecordsTotal: .statsCollectedRecordsTotal}'
如果端点返回错误,可能是vmstorage未启用
storage.trackMetricNamesStats
。在返回结果中注明这一点并继续——仅使用TSDB状态数据仍可完成分析。
查询4 — 交叉验证:“未使用”指标是否在告警规则中被引用?
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/rules" | jq '[.data.groups[].rules[].query]'
从规则查询中提取指标名称。任何出现在告警/记录规则中的“未使用”指标都不能安全删除——它是被间接查询的。
返回:未使用指标及其与告警规则的交叉验证结果。为每个指标标记:
  • 可安全删除:从未被查询且未出现在任何规则中
  • 仅被规则使用:从未被仪表板查询但被规则引用——需验证意图
  • 极少使用:查询次数少,可能仅被偶尔访问(如月度报告)

Subagent 3: Label Pattern Inspection

子Agent 3:标签模式检查

Agent name:
cardinality-labels
| Description: "Inspect label values for problematic patterns"
All data comes from the TSDB status endpoint — do NOT use
/api/v1/labels
or
/api/v1/label/.../values
.
Query 1 — Label cardinality overview (unique value counts + series counts):
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/tsdb?topN=50" | \
  jq '{labelValueCountByLabelName: .data.labelValueCountByLabelName, seriesCountByLabelName: .data.seriesCountByLabelName}'
labelValueCountByLabelName
returns labels sorted by unique value count (replaces per-label
/values
counting).
seriesCountByLabelName
shows how many series each label appears in.
Query 2 — Sample values for high-cardinality labels via focusLabel: For each label with >100 unique values from Query 1, fetch sample values:
bash
for label in <top labels from Query 1>; do
  echo "=== focusLabel=$label ===" && \
  curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
    "$VM_METRICS_URL/api/v1/status/tsdb?topN=20&focusLabel=$label" | \
    jq --arg l "$label" '{label: $l, topValues: .data.seriesCountByFocusLabelValue}'
done
seriesCountByFocusLabelValue
returns label values sorted by series count — use the value names to detect problematic patterns.
Query 3 — High-cardinality label-value pairs:
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/tsdb?topN=50" | \
  jq '.data.seriesCountByLabelValuePair'
Shows which specific
label=value
pairs contribute the most series.
Pattern detection — classify label values from focusLabel samples:
PatternRegex hintIndicates
UUIDs
[0-9a-f]{8}-[0-9a-f]{4}-
Request/session/trace IDs as labels
IP addresses
\d+\.\d+\.\d+\.\d+
Per-client or per-pod IP tracking
Long strings (>50 chars)length checkError messages, SQL, stack traces
SQL keywords
SELECT|INSERT|UPDATE|DELETE|FROM|WHERE
Query text stored as label
URL paths with IDs
/api/.*/[0-9a-f]+
Unsanitized HTTP paths
Timestampsepoch or ISO8601Time values as labels (unbounded)
Stack traces
at .*\.(java|go|py):
Error details as labels
Return: Table of labels sorted by unique value count, with detected pattern, sample values from focusLabel, and series impact.

Agent名称
cardinality-labels
| 描述:"检查标签值的问题模式"
所有数据均来自TSDB状态端点——不要使用
/api/v1/labels
/api/v1/label/.../values
查询1 — 标签基数概览(唯一值数量+序列数量):
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/tsdb?topN=50" | \
  jq '{labelValueCountByLabelName: .data.labelValueCountByLabelName, seriesCountByLabelName: .data.seriesCountByLabelName}'
labelValueCountByLabelName
返回按唯一值数量排序的标签(替代按标签的
/values
计数方式)。
seriesCountByLabelName
显示每个标签出现在多少个序列中。
查询2 — 通过focusLabel获取高基数标签的样本值: 对于查询1中唯一值数量>100的每个标签,获取样本值:
bash
for label in <top labels from Query 1>; do
  echo "=== focusLabel=$label ===" && \
  curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
    "$VM_METRICS_URL/api/v1/status/tsdb?topN=20&focusLabel=$label" | \
    jq --arg l "$label" '{label: $l, topValues: .data.seriesCountByFocusLabelValue}'
done
seriesCountByFocusLabelValue
返回按序列数量排序的标签值——使用这些值名称检测问题模式。
查询3 — 高基数标签-值对:
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
  "$VM_METRICS_URL/api/v1/status/tsdb?topN=50" | \
  jq '.data.seriesCountByLabelValuePair'
显示哪些特定的
label=value
对贡献了最多的序列。
模式检测——对focusLabel样本中的标签值进行分类:
模式正则提示说明
UUIDs
[0-9a-f]{8}-[0-9a-f]{4}-
将请求/会话/追踪ID作为标签
IP地址
\d+\.\d+\.\d+\.\d+
按客户端或Pod追踪IP
长字符串(>50字符)长度检查错误消息、SQL、堆栈跟踪
SQL关键字
SELECT|INSERT|UPDATE|DELETE|FROM|WHERE
查询文本被存储为标签
带ID的URL路径
/api/.*/[0-9a-f]+
未清理的HTTP路径
时间戳时间戳或ISO8601格式将时间值作为标签(无边界)
堆栈跟踪
at .*\.(java|go|py):
将错误详情作为标签
返回:按唯一值数量排序的标签表格,包含检测到的模式、来自focusLabel的样本值以及序列影响。

Phase 2: Analysis

阶段2:分析

After all subagents return, compile and classify findings. This is the analytical core — apply judgment, not mechanical filtering.
在所有子Agent返回结果后,编译并分类发现的问题。这是分析的核心——需运用判断,而非机械过滤。

Category 1: Unused Metrics (Quick Wins)

类别1:未使用指标(快速见效)

Cross-reference metric usage stats with TSDB series counts:
  • Drop candidates:
    queryRequestsCount=0
    , not in any alert/recording rule, >100 series
  • Verify candidates:
    queryRequestsCount=0
    but referenced in rules — check if rule is still needed
  • Low-priority:
    queryRequestsCount≤5
    with few series — not worth the config churn
Sort by series count descending — the biggest unused metrics are the biggest wins.
交叉验证指标使用统计数据与TSDB序列数量:
  • 删除候选
    queryRequestsCount=0
    ,未出现在任何告警/记录规则中,且序列数>100
  • 验证候选
    queryRequestsCount=0
    但被规则引用——检查规则是否仍需保留
  • 低优先级
    queryRequestsCount≤5
    且序列数少——不值得修改配置
按序列数降序排序——最大的未使用指标能带来最显著的收益。

Category 2: High-Cardinality Labels

类别2:高基数标签

Labels with excessive unique values that drive series explosion:
Label patternAssessmentTypical remedy
user_id
,
customer_id
,
account_id
Should NEVER be metric labels — belongs in logs/tracesDrop label
request_id
,
session_id
,
trace_id
,
span_id
Correlation IDs — never metric labelsDrop label
error
,
error_message
,
reason
,
status_message
Unbounded stringsDrop label or replace with error code
sql
,
query
,
command
,
statement
Query text in labels — unboundedDrop label
path
,
url
,
uri
,
endpoint
Unbounded if not sanitizedRelabel to normalize, or stream aggregate without
pod
,
container
Normal for k8s but high churnStream aggregate without, if per-pod not needed
instance
Normal for node metrics, wasteful for app metricsStream aggregate without for app-level metrics
le
(histogram buckets)
Fine-grained buckets multiply every label combinationReduce bucket count
For each finding, estimate impact:
(series with this label) - (series without) ≈ series saved
.
具有过多唯一值导致序列爆炸的标签:
标签模式评估典型解决方案
user_id
,
customer_id
,
account_id
绝对不能作为指标标签——应放在日志/追踪中删除标签
request_id
,
session_id
,
trace_id
,
span_id
关联ID——绝不能作为指标标签删除标签
error
,
error_message
,
reason
,
status_message
无边界字符串删除标签或替换为错误码
sql
,
query
,
command
,
statement
标签中存储查询文本——无边界删除标签
path
,
url
,
uri
,
endpoint
未清理时无边界重标记以规范化,或不包含该标签进行流聚合
pod
,
container
在K8s中正常但变动频繁若不需要按Pod统计,不包含该标签进行流聚合
instance
节点指标中正常,但应用指标中多余应用级指标不包含该标签进行流聚合
le
(直方图桶)
细粒度桶会使每个标签组合的数量倍增减少桶的数量
对于每个发现的问题,估算影响:
(包含该标签的序列数) - (不包含该标签的序列数) ≈ 可节省的序列数

Category 3: Histogram Bloat

类别3:直方图膨胀

Check metrics ending in
_bucket
:
  • How many unique
    le
    values?
  • Each additional bucket multiplies series by (number of label combinations)
  • Look for histograms where most buckets are empty or redundant
检查以
_bucket
结尾的指标:
  • 有多少个唯一的
    le
    值?
  • 每个额外的桶都会使序列数乘以(标签组合的数量)
  • 查找大多数桶为空或冗余的直方图

Category 4: Series Churn

类别4:序列变动

Compare yesterday's stats vs today:
  • Ratio >3:1 suggests significant churn from pod restarts, deployments, short-lived jobs
  • Not directly fixable via relabeling, but indicates opportunity for
    dedup_interval
    or
    -search.maxStalenessInterval
    tuning

对比昨日与今日的统计数据:
  • 比值>3:1表明Pod重启、部署、短期任务导致了显著变动
  • 无法通过重标记直接修复,但表明有机会调整
    dedup_interval
    -search.maxStalenessInterval

Phase 3: Report

阶段3:报告

Compile into a structured report. Every finding must include impact estimate and specific remedy config.
Use this template:
markdown
undefined
编译为结构化报告。每个发现的问题必须包含影响估算和具体的修复配置
使用以下模板:
markdown
undefined

VictoriaMetrics Cardinality Report — <date>

VictoriaMetrics 基数报告 — <日期>

Overview

概览

MetricValue
Total active series (today)X
Total series (yesterday)X
Churn ratio (yesterday / today)X:1
Unique metric namesX
Stats tracking since<date>
指标数值
今日活跃序列总数X
昨日序列总数X
变动比率(昨日/今日)X:1
唯一指标名称数量X
统计数据追踪起始时间<日期>

1. Unused Metrics

1. 未使用指标

Potential savings: ~X series (Y% of total)
MetricSeriesLast QueriedIn Alert RulesAction
......nevernoDrop
......neveryes — verifyCheck rule
<details> <summary>Relabeling config to drop unused metrics</summary>
​```yaml
潜在节省:约X个序列(占总数的Y%)
指标序列数最后查询时间是否在告警规则中操作
......从未删除
......从未是——需验证检查规则
<details> <summary>删除未使用指标的重标记配置</summary>
​```yaml

Add to vmagent metric_relabel_configs (VMServiceScrape or global)

添加到vmagent的metric_relabel_configs(VMServiceScrape或全局配置)

metric_relabel_configs:
  • source_labels: [name] regex: "metric1|metric2|metric3" action: drop ​```
</details>
metric_relabel_configs:
  • source_labels: [name] regex: "metric1|metric2|metric3" action: drop ​```
</details>

2. High-Cardinality Labels

2. 高基数标签

Potential savings: ~X series (Y%)
LabelUnique ValuesTop Affected MetricsPatternAction
user_id50,000http_requests_totalUUIDDrop
path10,000http_request_durationURL pathsAggregate
error_message5,000app_errors_totalLong stringsDrop
<details> <summary>Drop labels that should never be in metrics</summary>
​```yaml metric_relabel_configs:
  • regex: "user_id|request_id|session_id|trace_id|error_message|sql_query" action: labeldrop ​```
</details> <details> <summary>Stream aggregation for high-cardinality HTTP labels</summary>
​```yaml
潜在节省:约X个序列(占总数的Y%)
标签唯一值数量受影响最大的指标模式操作
user_id50,000http_requests_totalUUID删除
path10,000http_request_durationURL路径聚合
error_message5,000app_errors_total长字符串删除
<details> <summary>删除绝不能出现在指标中的标签</summary>
​```yaml metric_relabel_configs:
  • regex: "user_id|request_id|session_id|trace_id|error_message|sql_query" action: labeldrop ​```
</details> <details> <summary>高基数HTTP标签的流聚合配置</summary>
​```yaml

vmagent stream aggregation config

vmagent流聚合配置

  • match: '{name=~"http_request.*"}' interval: 1m without: [path, instance, pod] outputs: [total]

    drop_input: true # enable after verifying aggregated output

  • match: '{name=~"http_request_duration.*_bucket"}' interval: 1m without: [pod, instance] outputs: [total] keep_metric_names: true ​```
</details> <details> <summary>Normalize URL paths via relabeling</summary>
​```yaml metric_relabel_configs:
  • source_labels: [path] regex: "/api/v1/users/[^/]+" target_label: path replacement: "/api/v1/users/:id"
  • source_labels: [path] regex: "/api/v1/orders/[^/]+" target_label: path replacement: "/api/v1/orders/:id" ​```
</details>
  • match: '{name=~"http_request.*"}' interval: 1m without: [path, instance, pod] outputs: [total]

    drop_input: true # 验证聚合输出符合预期后启用

  • match: '{name=~"http_request_duration.*_bucket"}' interval: 1m without: [pod, instance] outputs: [total] keep_metric_names: true ​```
</details> <details> <summary>通过重标记规范化URL路径</summary>
​```yaml metric_relabel_configs:
  • source_labels: [path] regex: "/api/v1/users/[^/]+" target_label: path replacement: "/api/v1/users/:id"
  • source_labels: [path] regex: "/api/v1/orders/[^/]+" target_label: path replacement: "/api/v1/orders/:id" ​```
</details>

3. Histogram Optimization

3. 直方图优化

Potential savings: ~X series (Y%)
MetricBucket CountRecommendation
...30Reduce to standard 11 buckets
潜在节省:约X个序列(占总数的Y%)
指标桶数量建议
...30减少为标准的11个桶

4. Series Churn

4. 序列变动

ObservationValue
Yesterday / today ratioX:1
Primary driverPod restarts / short-lived jobs
观察结果数值
昨日/今日比值X:1
主要原因Pod重启 / 短期任务

Summary

总结

CategoryEst. Series Saved% of TotalEffort
Drop unused metricsXY%Low — relabeling only
Drop bad labelsXY%Low — labeldrop
Stream aggregationXY%Medium — new config
Histogram reductionXY%Low — bucket filtering
TotalXY%
类别预估可节省序列数占总数比例实施难度
删除未使用指标XY%低——仅需重标记
删除不良标签XY%低——仅需labeldrop
流聚合XY%中——需新增配置
直方图桶缩减XY%低——仅需过滤桶
总计XY%

Implementation Priority

实施优先级

  1. [Low effort] Drop unused metrics — pure relabeling, no data loss risk
  2. [Low effort] Drop labels that should never be in metrics (IDs, messages, SQL)
  3. [Medium effort] Stream aggregation for high-cardinality HTTP/app metrics
  4. [Medium effort] Histogram bucket reduction

Adapt the template to actual findings — omit sections with no findings, expand sections
with significant findings.

---
  1. [低难度] 删除未使用指标——仅需重标记,无数据丢失风险
  2. [低难度] 删除绝不能出现在指标中的标签(ID、消息、SQL)
  3. [中难度] 对高基数HTTP/应用指标进行流聚合
  4. [中难度] 缩减直方图桶数量

根据实际发现的问题调整模板——省略无发现的部分,扩展有重要发现的部分。

---

Remediation Reference

修复参考

Relabeling (metric_relabel_configs)

重标记(metric_relabel_configs)

Applied at scrape time or remote write. Changes affect new data immediately.
Drop entire metrics:
yaml
metric_relabel_configs:
  - source_labels: [__name__]
    regex: "metric_to_drop|another_metric"
    action: drop
Drop labels:
yaml
metric_relabel_configs:
  - regex: "label_to_drop|another_label"
    action: labeldrop
Normalize label values (reduce unique values):
yaml
metric_relabel_configs:
  - source_labels: [path]
    regex: "/api/v1/users/[^/]+"
    target_label: path
    replacement: "/api/v1/users/:id"
在采集时或远程写入时应用。更改会立即影响新数据。
删除整个指标:
yaml
metric_relabel_configs:
  - source_labels: [__name__]
    regex: "metric_to_drop|another_metric"
    action: drop
删除标签:
yaml
metric_relabel_configs:
  - regex: "label_to_drop|another_label"
    action: labeldrop
规范化标签值(减少唯一值数量):
yaml
metric_relabel_configs:
  - source_labels: [path]
    regex: "/api/v1/users/[^/]+"
    target_label: path
    replacement: "/api/v1/users/:id"

Stream Aggregation

流聚合

Applied at vmagent level. Aggregates in-flight before writing to storage. Docs: https://docs.victoriametrics.com/victoriametrics/stream-aggregation/
Remove labels while preserving metric semantics:
yaml
- match: '{__name__=~"http_.*"}'
  interval: 1m
  without: [instance, pod]
  outputs: [total]
Aggregate counters (drop high-cardinality dimension):
yaml
- match: 'http_requests_total'
  interval: 30s
  without: [path, user_id]
  outputs: [total]
Aggregate histograms:
yaml
- match: '{__name__=~".*_bucket"}'
  interval: 1m
  without: [pod, instance]
  outputs: [quantiles(0.5, 0.9, 0.99)]
  keep_metric_names: true
Common output functions:
FunctionUse forExample
total
Counters (running sum)request counts
sum_samples
Gauge sumsmemory usage across pods
count_samples
Sample countsnumber of reporting instances
last
Latest gauge valuecurrent temperature
min
,
max
Extremespeak latency
avg
Averagesmean CPU usage
quantiles(0.5, 0.9, 0.99)
Distribution estimationlatency percentiles
histogram_bucket
Re-bucket histogramsreduce bucket granularity
Important: use
total
for counters,
last
/
avg
/
sum_samples
for gauges. Using
total
on gauges produces nonsensical running sums.
在vmagent层面应用。在写入存储前对数据进行实时聚合。 文档:https://docs.victoriametrics.com/victoriametrics/stream-aggregation/
移除标签同时保留指标语义:
yaml
- match: '{__name__=~"http_.*"}'
  interval: 1m
  without: [instance, pod]
  outputs: [total]
聚合计数器(删除高基数维度):
yaml
- match: 'http_requests_total'
  interval: 30s
  without: [path, user_id]
  outputs: [total]
聚合直方图:
yaml
- match: '{__name__=~".*_bucket"}'
  interval: 1m
  without: [pod, instance]
  outputs: [quantiles(0.5, 0.9, 0.99)]
  keep_metric_names: true
常用输出函数:
函数用途示例
total
计数器(运行总和)请求计数
sum_samples
Gauge值求和跨Pod内存使用量
count_samples
样本计数上报实例数量
last
最新Gauge值当前温度
min
,
max
极值峰值延迟
avg
平均值平均CPU使用率
quantiles(0.5, 0.9, 0.99)
分布估算延迟百分位数
histogram_bucket
重新分桶直方图降低桶的粒度
重要提示:对计数器使用
total
,对Gauge值使用
last
/
avg
/
sum_samples
。对Gauge值使用
total
会产生无意义的运行总和。

Where to Apply in Kubernetes

在Kubernetes中的应用位置

MethodCRD / ConfigScope
metric_relabel_configs
VMServiceScrape
/
VMPodScrape
.spec.metricRelabelConfigs
Per scrape target
Global relabelingVMAgent
-remoteWrite.relabelConfig
All metrics
Stream aggregationVMAgent
-remoteWrite.streamAggr.config
All remote-written metrics
Per-remote-write SAVMAgent
.spec.remoteWrite[].streamAggrConfig
Per destination
方法CRD / 配置范围
metric_relabel_configs
VMServiceScrape
/
VMPodScrape
.spec.metricRelabelConfigs
每个采集目标
全局重标记VMAgent
-remoteWrite.relabelConfig
所有指标
流聚合VMAgent
-remoteWrite.streamAggr.config
所有远程写入的指标
按远程写入目标配置SAVMAgent
.spec.remoteWrite[].streamAggrConfig
每个目标

Common Mistakes

常见错误

MistakeFix
Dropping a metric used by alertsAlways cross-check
/api/v1/rules
before dropping
drop_input: true
without testing
Verify aggregation output matches expectations first
Stream aggregating gauges with
total
Use
last
,
avg
, or
sum_samples
for gauges
Forgetting
keep_metric_names: true
Without it, output gets long auto-generated suffix
Dropping
le
label entirely from histograms
Only drop specific
le
values, never the label itself
Not considering recording rule dependenciesCheck both alerting AND recording rules
Applying relabeling without testingUse
-dryRun
flag or test on a single scrape target first
错误修复方案
删除了被告警规则使用的指标删除前务必交叉验证
/api/v1/rules
未测试就启用
drop_input: true
先验证聚合输出符合预期
对Gauge值使用
total
进行流聚合
对Gauge值使用
last
avg
sum_samples
忘记设置
keep_metric_names: true
不设置的话,输出会带有冗长的自动生成后缀
完全删除直方图的
le
标签
仅删除特定的
le
值,绝不要删除整个标签
未考虑记录规则的依赖同时检查告警规则和记录规则
未测试就应用重标记使用
-dryRun
标志或先在单个采集目标上测试