victoriametrics-cardinality-analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseVictoriaMetrics Cardinality Analysis
VictoriaMetrics 基数分析
Systematic cardinality analysis for VictoriaMetrics. Collects TSDB status, metric usage stats, and label
value patterns, then produces a structured report with specific relabeling and stream aggregation configs
the user can apply directly.
The goal is to find the highest-impact optimization opportunities — metrics nobody queries, labels that
explode cardinality for no monitoring value, and patterns that indicate data hygiene problems (error
messages as labels, SQL text as labels, UUIDs as labels).
针对VictoriaMetrics的系统性基数分析工具。收集TSDB状态、指标使用统计数据和标签值模式,然后生成包含可直接应用的重标记和流聚合配置的结构化报告。
目标是找到影响最大的优化机会——无人查询的指标、对监控毫无价值却导致基数爆炸的标签,以及表明数据整洁性问题的模式(如将错误消息、SQL文本、UUID作为标签)。
Environment
环境
Uses the same env vars as the skill:
victoriametrics-querybash
undefined使用与技能相同的环境变量:
victoriametrics-querybash
undefined$VM_METRICS_URL - base URL
$VM_METRICS_URL - 基础URL
cluster: export VM_METRICS_URL="https://vmselect.example.com/select/0/prometheus"
集群版: export VM_METRICS_URL="https://vmselect.example.com/select/0/prometheus"
single: export VM_METRICS_URL="http://localhost:8428"
单实例版: export VM_METRICS_URL="http://localhost:8428"
$VM_AUTH_HEADER - auth header (empty if no auth is required)
$VM_AUTH_HEADER - 认证头(无需认证时留空)
undefinedundefinedWorkflow
工作流程
Phase 1: Data Collection
阶段1:数据收集
Spawn 3 subagents in a single response to collect data in parallel. Each subagent prompt must
include the curl auth pattern and environment variable references above.
If the user specified a scope (job, namespace, metric prefix), pass it as parameter to
TSDB status queries and as series selectors to label queries.
match[]在单次响应中启动3个子Agent以并行收集数据。每个子Agent的提示必须包含上述curl认证模式和环境变量引用。
如果用户指定了范围(job、namespace、指标前缀),将其作为参数传递给TSDB状态查询,并作为序列选择器传递给标签查询。
match[]Subagent 1: TSDB Overview
子Agent 1:TSDB概览
Agent name: | Description: "Collect TSDB cardinality stats"
cardinality-tsdbQuery 1 — Yesterday's series (captures recently churned series):
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/tsdb?topN=50&date=$(date -d 'yesterday' +%Y-%m-%d)" | jq '.data'Queries yesterday's stats — broader than today (includes series that may have already churned) without scanning the entire TSDB.
Query 2 — Today's active series:
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/tsdb?topN=50" | jq '.data'Query 3 — Focus on known high-cardinality labels:
bash
for label in pod instance container path url user_id request_id session_id trace_id le name; do
echo "=== focusLabel=$label ===" && \
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/tsdb?topN=20&focusLabel=$label" | \
jq --arg l "$label" '{label: $l, focus: .data.seriesCountByFocusLabelValue}'
doneReturn: All raw JSON preserving structure. Include , ,
, , from each query.
totalSeriestotalLabelValuePairsseriesCountByMetricNameseriesCountByLabelNameseriesCountByLabelValuePairAgent名称: | 描述:"收集TSDB基数统计数据"
cardinality-tsdb查询1 — 昨日的序列(捕获最近变动的序列):
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/tsdb?topN=50&date=$(date -d 'yesterday' +%Y-%m-%d)" | jq '.data'查询昨日的统计数据——比今日的范围更广(包含可能已变动的序列),且无需扫描整个TSDB。
查询2 — 今日的活跃序列:
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/tsdb?topN=50" | jq '.data'查询3 — 重点关注已知的高基数标签:
bash
for label in pod instance container path url user_id request_id session_id trace_id le name; do
echo "=== focusLabel=$label ===" && \
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/tsdb?topN=20&focusLabel=$label" | \
jq --arg l "$label" '{label: $l, focus: .data.seriesCountByFocusLabelValue}'
done返回:保留原始结构的所有JSON数据。包含每个查询中的、、、、。
totalSeriestotalLabelValuePairsseriesCountByMetricNameseriesCountByLabelNameseriesCountByLabelValuePairSubagent 2: Metric Usage Stats
子Agent 2:指标使用统计
Agent name: | Description: "Find unused and rarely-queried metrics"
cardinality-usageQuery 1 — Never-queried metrics:
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/metric_names_stats?le=0&limit=500" | jq '.'Query 2 — Rarely-queried metrics (≤5 total queries):
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/metric_names_stats?le=5&limit=500" | jq '.'Query 3 — Stats overview (tracking period):
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/metric_names_stats?limit=1" | \
jq '{statsCollectedSince: .statsCollectedSince, statsCollectedRecordsTotal: .statsCollectedRecordsTotal}'If the endpoint returns an error, may not be enabled on vmstorage.
Note this in the return and proceed — the analysis can still work with TSDB status data alone.
storage.trackMetricNamesStatsQuery 4 — Cross-check: are "unused" metrics referenced in alerting rules?
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/rules" | jq '[.data.groups[].rules[].query]'Extract metric names from rule queries. Any "unused" metric that appears in an alert/recording rule
is NOT safe to drop — it's queried indirectly.
Return: Unused metrics with cross-reference against alert rules. Flag each as:
- safe to drop: never queried AND not in any rule
- used by rules only: never queried by dashboards but referenced in rules — verify intent
- rarely used: low query count, may be accessed infrequently (e.g., monthly reports)
Agent名称: | 描述:"查找未使用和极少被查询的指标"
cardinality-usage查询1 — 从未被查询的指标:
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/metric_names_stats?le=0&limit=500" | jq '.'查询2 — 极少被查询的指标(总查询次数≤5次):
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/metric_names_stats?le=5&limit=500" | jq '.'查询3 — 统计概览(追踪周期):
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/metric_names_stats?limit=1" | \
jq '{statsCollectedSince: .statsCollectedSince, statsCollectedRecordsTotal: .statsCollectedRecordsTotal}'如果端点返回错误,可能是vmstorage未启用。在返回结果中注明这一点并继续——仅使用TSDB状态数据仍可完成分析。
storage.trackMetricNamesStats查询4 — 交叉验证:“未使用”指标是否在告警规则中被引用?
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/rules" | jq '[.data.groups[].rules[].query]'从规则查询中提取指标名称。任何出现在告警/记录规则中的“未使用”指标都不能安全删除——它是被间接查询的。
返回:未使用指标及其与告警规则的交叉验证结果。为每个指标标记:
- 可安全删除:从未被查询且未出现在任何规则中
- 仅被规则使用:从未被仪表板查询但被规则引用——需验证意图
- 极少使用:查询次数少,可能仅被偶尔访问(如月度报告)
Subagent 3: Label Pattern Inspection
子Agent 3:标签模式检查
Agent name: | Description: "Inspect label values for problematic patterns"
cardinality-labelsAll data comes from the TSDB status endpoint — do NOT use or .
/api/v1/labels/api/v1/label/.../valuesQuery 1 — Label cardinality overview (unique value counts + series counts):
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/tsdb?topN=50" | \
jq '{labelValueCountByLabelName: .data.labelValueCountByLabelName, seriesCountByLabelName: .data.seriesCountByLabelName}'labelValueCountByLabelName/valuesseriesCountByLabelNameQuery 2 — Sample values for high-cardinality labels via focusLabel:
For each label with >100 unique values from Query 1, fetch sample values:
bash
for label in <top labels from Query 1>; do
echo "=== focusLabel=$label ===" && \
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/tsdb?topN=20&focusLabel=$label" | \
jq --arg l "$label" '{label: $l, topValues: .data.seriesCountByFocusLabelValue}'
doneseriesCountByFocusLabelValueQuery 3 — High-cardinality label-value pairs:
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/tsdb?topN=50" | \
jq '.data.seriesCountByLabelValuePair'Shows which specific pairs contribute the most series.
label=valuePattern detection — classify label values from focusLabel samples:
| Pattern | Regex hint | Indicates |
|---|---|---|
| UUIDs | | Request/session/trace IDs as labels |
| IP addresses | | Per-client or per-pod IP tracking |
| Long strings (>50 chars) | length check | Error messages, SQL, stack traces |
| SQL keywords | | Query text stored as label |
| URL paths with IDs | | Unsanitized HTTP paths |
| Timestamps | epoch or ISO8601 | Time values as labels (unbounded) |
| Stack traces | | Error details as labels |
Return: Table of labels sorted by unique value count, with detected pattern, sample values from focusLabel, and series impact.
Agent名称: | 描述:"检查标签值的问题模式"
cardinality-labels所有数据均来自TSDB状态端点——不要使用或。
/api/v1/labels/api/v1/label/.../values查询1 — 标签基数概览(唯一值数量+序列数量):
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/tsdb?topN=50" | \
jq '{labelValueCountByLabelName: .data.labelValueCountByLabelName, seriesCountByLabelName: .data.seriesCountByLabelName}'labelValueCountByLabelName/valuesseriesCountByLabelName查询2 — 通过focusLabel获取高基数标签的样本值:
对于查询1中唯一值数量>100的每个标签,获取样本值:
bash
for label in <top labels from Query 1>; do
echo "=== focusLabel=$label ===" && \
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/tsdb?topN=20&focusLabel=$label" | \
jq --arg l "$label" '{label: $l, topValues: .data.seriesCountByFocusLabelValue}'
doneseriesCountByFocusLabelValue查询3 — 高基数标签-值对:
bash
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/tsdb?topN=50" | \
jq '.data.seriesCountByLabelValuePair'显示哪些特定的对贡献了最多的序列。
label=value模式检测——对focusLabel样本中的标签值进行分类:
| 模式 | 正则提示 | 说明 |
|---|---|---|
| UUIDs | | 将请求/会话/追踪ID作为标签 |
| IP地址 | | 按客户端或Pod追踪IP |
| 长字符串(>50字符) | 长度检查 | 错误消息、SQL、堆栈跟踪 |
| SQL关键字 | | 查询文本被存储为标签 |
| 带ID的URL路径 | | 未清理的HTTP路径 |
| 时间戳 | 时间戳或ISO8601格式 | 将时间值作为标签(无边界) |
| 堆栈跟踪 | | 将错误详情作为标签 |
返回:按唯一值数量排序的标签表格,包含检测到的模式、来自focusLabel的样本值以及序列影响。
Phase 2: Analysis
阶段2:分析
After all subagents return, compile and classify findings. This is the analytical core — apply
judgment, not mechanical filtering.
在所有子Agent返回结果后,编译并分类发现的问题。这是分析的核心——需运用判断,而非机械过滤。
Category 1: Unused Metrics (Quick Wins)
类别1:未使用指标(快速见效)
Cross-reference metric usage stats with TSDB series counts:
- Drop candidates: , not in any alert/recording rule, >100 series
queryRequestsCount=0 - Verify candidates: but referenced in rules — check if rule is still needed
queryRequestsCount=0 - Low-priority: with few series — not worth the config churn
queryRequestsCount≤5
Sort by series count descending — the biggest unused metrics are the biggest wins.
交叉验证指标使用统计数据与TSDB序列数量:
- 删除候选:,未出现在任何告警/记录规则中,且序列数>100
queryRequestsCount=0 - 验证候选:但被规则引用——检查规则是否仍需保留
queryRequestsCount=0 - 低优先级:且序列数少——不值得修改配置
queryRequestsCount≤5
按序列数降序排序——最大的未使用指标能带来最显著的收益。
Category 2: High-Cardinality Labels
类别2:高基数标签
Labels with excessive unique values that drive series explosion:
| Label pattern | Assessment | Typical remedy |
|---|---|---|
| Should NEVER be metric labels — belongs in logs/traces | Drop label |
| Correlation IDs — never metric labels | Drop label |
| Unbounded strings | Drop label or replace with error code |
| Query text in labels — unbounded | Drop label |
| Unbounded if not sanitized | Relabel to normalize, or stream aggregate without |
| Normal for k8s but high churn | Stream aggregate without, if per-pod not needed |
| Normal for node metrics, wasteful for app metrics | Stream aggregate without for app-level metrics |
| Fine-grained buckets multiply every label combination | Reduce bucket count |
For each finding, estimate impact: .
(series with this label) - (series without) ≈ series saved具有过多唯一值导致序列爆炸的标签:
| 标签模式 | 评估 | 典型解决方案 |
|---|---|---|
| 绝对不能作为指标标签——应放在日志/追踪中 | 删除标签 |
| 关联ID——绝不能作为指标标签 | 删除标签 |
| 无边界字符串 | 删除标签或替换为错误码 |
| 标签中存储查询文本——无边界 | 删除标签 |
| 未清理时无边界 | 重标记以规范化,或不包含该标签进行流聚合 |
| 在K8s中正常但变动频繁 | 若不需要按Pod统计,不包含该标签进行流聚合 |
| 节点指标中正常,但应用指标中多余 | 应用级指标不包含该标签进行流聚合 |
| 细粒度桶会使每个标签组合的数量倍增 | 减少桶的数量 |
对于每个发现的问题,估算影响:。
(包含该标签的序列数) - (不包含该标签的序列数) ≈ 可节省的序列数Category 3: Histogram Bloat
类别3:直方图膨胀
Check metrics ending in :
_bucket- How many unique values?
le - Each additional bucket multiplies series by (number of label combinations)
- Look for histograms where most buckets are empty or redundant
检查以结尾的指标:
_bucket- 有多少个唯一的值?
le - 每个额外的桶都会使序列数乘以(标签组合的数量)
- 查找大多数桶为空或冗余的直方图
Category 4: Series Churn
类别4:序列变动
Compare yesterday's stats vs today:
- Ratio >3:1 suggests significant churn from pod restarts, deployments, short-lived jobs
- Not directly fixable via relabeling, but indicates opportunity for or
dedup_intervaltuning-search.maxStalenessInterval
对比昨日与今日的统计数据:
- 比值>3:1表明Pod重启、部署、短期任务导致了显著变动
- 无法通过重标记直接修复,但表明有机会调整或
dedup_interval-search.maxStalenessInterval
Phase 3: Report
阶段3:报告
Compile into a structured report. Every finding must include impact estimate and specific remedy config.
Use this template:
markdown
undefined编译为结构化报告。每个发现的问题必须包含影响估算和具体的修复配置。
使用以下模板:
markdown
undefinedVictoriaMetrics Cardinality Report — <date>
VictoriaMetrics 基数报告 — <日期>
Overview
概览
| Metric | Value |
|---|---|
| Total active series (today) | X |
| Total series (yesterday) | X |
| Churn ratio (yesterday / today) | X:1 |
| Unique metric names | X |
| Stats tracking since | <date> |
| 指标 | 数值 |
|---|---|
| 今日活跃序列总数 | X |
| 昨日序列总数 | X |
| 变动比率(昨日/今日) | X:1 |
| 唯一指标名称数量 | X |
| 统计数据追踪起始时间 | <日期> |
1. Unused Metrics
1. 未使用指标
Potential savings: ~X series (Y% of total)
| Metric | Series | Last Queried | In Alert Rules | Action |
|---|---|---|---|---|
| ... | ... | never | no | Drop |
| ... | ... | never | yes — verify | Check rule |
```yaml
潜在节省:约X个序列(占总数的Y%)
| 指标 | 序列数 | 最后查询时间 | 是否在告警规则中 | 操作 |
|---|---|---|---|---|
| ... | ... | 从未 | 否 | 删除 |
| ... | ... | 从未 | 是——需验证 | 检查规则 |
```yaml
Add to vmagent metric_relabel_configs (VMServiceScrape or global)
添加到vmagent的metric_relabel_configs(VMServiceScrape或全局配置)
metric_relabel_configs:
- source_labels: [name] regex: "metric1|metric2|metric3" action: drop ```
metric_relabel_configs:
- source_labels: [name] regex: "metric1|metric2|metric3" action: drop ```
2. High-Cardinality Labels
2. 高基数标签
Potential savings: ~X series (Y%)
| Label | Unique Values | Top Affected Metrics | Pattern | Action |
|---|---|---|---|---|
| user_id | 50,000 | http_requests_total | UUID | Drop |
| path | 10,000 | http_request_duration | URL paths | Aggregate |
| error_message | 5,000 | app_errors_total | Long strings | Drop |
```yaml
metric_relabel_configs:
- regex: "user_id|request_id|session_id|trace_id|error_message|sql_query" action: labeldrop ```
```yaml
潜在节省:约X个序列(占总数的Y%)
| 标签 | 唯一值数量 | 受影响最大的指标 | 模式 | 操作 |
|---|---|---|---|---|
| user_id | 50,000 | http_requests_total | UUID | 删除 |
| path | 10,000 | http_request_duration | URL路径 | 聚合 |
| error_message | 5,000 | app_errors_total | 长字符串 | 删除 |
```yaml
metric_relabel_configs:
- regex: "user_id|request_id|session_id|trace_id|error_message|sql_query" action: labeldrop ```
```yaml
vmagent stream aggregation config
vmagent流聚合配置
-
match: '{name=~"http_request.*"}' interval: 1m without: [path, instance, pod] outputs: [total]
drop_input: true # enable after verifying aggregated output
-
match: '{name=~"http_request_duration.*_bucket"}' interval: 1m without: [pod, instance] outputs: [total] keep_metric_names: true ```
```yaml
metric_relabel_configs:
- source_labels: [path] regex: "/api/v1/users/[^/]+" target_label: path replacement: "/api/v1/users/:id"
- source_labels: [path] regex: "/api/v1/orders/[^/]+" target_label: path replacement: "/api/v1/orders/:id" ```
-
match: '{name=~"http_request.*"}' interval: 1m without: [path, instance, pod] outputs: [total]
drop_input: true # 验证聚合输出符合预期后启用
-
match: '{name=~"http_request_duration.*_bucket"}' interval: 1m without: [pod, instance] outputs: [total] keep_metric_names: true ```
```yaml
metric_relabel_configs:
- source_labels: [path] regex: "/api/v1/users/[^/]+" target_label: path replacement: "/api/v1/users/:id"
- source_labels: [path] regex: "/api/v1/orders/[^/]+" target_label: path replacement: "/api/v1/orders/:id" ```
3. Histogram Optimization
3. 直方图优化
Potential savings: ~X series (Y%)
| Metric | Bucket Count | Recommendation |
|---|---|---|
| ... | 30 | Reduce to standard 11 buckets |
潜在节省:约X个序列(占总数的Y%)
| 指标 | 桶数量 | 建议 |
|---|---|---|
| ... | 30 | 减少为标准的11个桶 |
4. Series Churn
4. 序列变动
| Observation | Value |
|---|---|
| Yesterday / today ratio | X:1 |
| Primary driver | Pod restarts / short-lived jobs |
| 观察结果 | 数值 |
|---|---|
| 昨日/今日比值 | X:1 |
| 主要原因 | Pod重启 / 短期任务 |
Summary
总结
| Category | Est. Series Saved | % of Total | Effort |
|---|---|---|---|
| Drop unused metrics | X | Y% | Low — relabeling only |
| Drop bad labels | X | Y% | Low — labeldrop |
| Stream aggregation | X | Y% | Medium — new config |
| Histogram reduction | X | Y% | Low — bucket filtering |
| Total | X | Y% |
| 类别 | 预估可节省序列数 | 占总数比例 | 实施难度 |
|---|---|---|---|
| 删除未使用指标 | X | Y% | 低——仅需重标记 |
| 删除不良标签 | X | Y% | 低——仅需labeldrop |
| 流聚合 | X | Y% | 中——需新增配置 |
| 直方图桶缩减 | X | Y% | 低——仅需过滤桶 |
| 总计 | X | Y% |
Implementation Priority
实施优先级
- [Low effort] Drop unused metrics — pure relabeling, no data loss risk
- [Low effort] Drop labels that should never be in metrics (IDs, messages, SQL)
- [Medium effort] Stream aggregation for high-cardinality HTTP/app metrics
- [Medium effort] Histogram bucket reduction
Adapt the template to actual findings — omit sections with no findings, expand sections
with significant findings.
---- [低难度] 删除未使用指标——仅需重标记,无数据丢失风险
- [低难度] 删除绝不能出现在指标中的标签(ID、消息、SQL)
- [中难度] 对高基数HTTP/应用指标进行流聚合
- [中难度] 缩减直方图桶数量
根据实际发现的问题调整模板——省略无发现的部分,扩展有重要发现的部分。
---Remediation Reference
修复参考
Relabeling (metric_relabel_configs)
重标记(metric_relabel_configs)
Applied at scrape time or remote write. Changes affect new data immediately.
Drop entire metrics:
yaml
metric_relabel_configs:
- source_labels: [__name__]
regex: "metric_to_drop|another_metric"
action: dropDrop labels:
yaml
metric_relabel_configs:
- regex: "label_to_drop|another_label"
action: labeldropNormalize label values (reduce unique values):
yaml
metric_relabel_configs:
- source_labels: [path]
regex: "/api/v1/users/[^/]+"
target_label: path
replacement: "/api/v1/users/:id"在采集时或远程写入时应用。更改会立即影响新数据。
删除整个指标:
yaml
metric_relabel_configs:
- source_labels: [__name__]
regex: "metric_to_drop|another_metric"
action: drop删除标签:
yaml
metric_relabel_configs:
- regex: "label_to_drop|another_label"
action: labeldrop规范化标签值(减少唯一值数量):
yaml
metric_relabel_configs:
- source_labels: [path]
regex: "/api/v1/users/[^/]+"
target_label: path
replacement: "/api/v1/users/:id"Stream Aggregation
流聚合
Applied at vmagent level. Aggregates in-flight before writing to storage.
Docs: https://docs.victoriametrics.com/victoriametrics/stream-aggregation/
Remove labels while preserving metric semantics:
yaml
- match: '{__name__=~"http_.*"}'
interval: 1m
without: [instance, pod]
outputs: [total]Aggregate counters (drop high-cardinality dimension):
yaml
- match: 'http_requests_total'
interval: 30s
without: [path, user_id]
outputs: [total]Aggregate histograms:
yaml
- match: '{__name__=~".*_bucket"}'
interval: 1m
without: [pod, instance]
outputs: [quantiles(0.5, 0.9, 0.99)]
keep_metric_names: trueCommon output functions:
| Function | Use for | Example |
|---|---|---|
| Counters (running sum) | request counts |
| Gauge sums | memory usage across pods |
| Sample counts | number of reporting instances |
| Latest gauge value | current temperature |
| Extremes | peak latency |
| Averages | mean CPU usage |
| Distribution estimation | latency percentiles |
| Re-bucket histograms | reduce bucket granularity |
Important: use for counters, // for gauges. Using on
gauges produces nonsensical running sums.
totallastavgsum_samplestotal在vmagent层面应用。在写入存储前对数据进行实时聚合。
文档:https://docs.victoriametrics.com/victoriametrics/stream-aggregation/
移除标签同时保留指标语义:
yaml
- match: '{__name__=~"http_.*"}'
interval: 1m
without: [instance, pod]
outputs: [total]聚合计数器(删除高基数维度):
yaml
- match: 'http_requests_total'
interval: 30s
without: [path, user_id]
outputs: [total]聚合直方图:
yaml
- match: '{__name__=~".*_bucket"}'
interval: 1m
without: [pod, instance]
outputs: [quantiles(0.5, 0.9, 0.99)]
keep_metric_names: true常用输出函数:
| 函数 | 用途 | 示例 |
|---|---|---|
| 计数器(运行总和) | 请求计数 |
| Gauge值求和 | 跨Pod内存使用量 |
| 样本计数 | 上报实例数量 |
| 最新Gauge值 | 当前温度 |
| 极值 | 峰值延迟 |
| 平均值 | 平均CPU使用率 |
| 分布估算 | 延迟百分位数 |
| 重新分桶直方图 | 降低桶的粒度 |
重要提示:对计数器使用,对Gauge值使用//。对Gauge值使用会产生无意义的运行总和。
totallastavgsum_samplestotalWhere to Apply in Kubernetes
在Kubernetes中的应用位置
| Method | CRD / Config | Scope |
|---|---|---|
| | Per scrape target |
| Global relabeling | VMAgent | All metrics |
| Stream aggregation | VMAgent | All remote-written metrics |
| Per-remote-write SA | VMAgent | Per destination |
| 方法 | CRD / 配置 | 范围 |
|---|---|---|
| | 每个采集目标 |
| 全局重标记 | VMAgent | 所有指标 |
| 流聚合 | VMAgent | 所有远程写入的指标 |
| 按远程写入目标配置SA | VMAgent | 每个目标 |
Common Mistakes
常见错误
| Mistake | Fix |
|---|---|
| Dropping a metric used by alerts | Always cross-check |
| Verify aggregation output matches expectations first |
Stream aggregating gauges with | Use |
Forgetting | Without it, output gets long auto-generated suffix |
Dropping | Only drop specific |
| Not considering recording rule dependencies | Check both alerting AND recording rules |
| Applying relabeling without testing | Use |
| 错误 | 修复方案 |
|---|---|
| 删除了被告警规则使用的指标 | 删除前务必交叉验证 |
未测试就启用 | 先验证聚合输出符合预期 |
对Gauge值使用 | 对Gauge值使用 |
忘记设置 | 不设置的话,输出会带有冗长的自动生成后缀 |
完全删除直方图的 | 仅删除特定的 |
| 未考虑记录规则的依赖 | 同时检查告警规则和记录规则 |
| 未测试就应用重标记 | 使用 |