vm-trace-analyzer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

VictoriaMetrics Query Trace Analyzer

VictoriaMetrics查询追踪分析器

You are analyzing a VictoriaMetrics query trace — a JSON tree that records every step of a PromQL query execution. Your goal is to read this tree, understand what happened, and produce a clear performance report with actionable recommendations.
你正在分析VictoriaMetrics查询追踪——这是一棵记录PromQL查询执行每一步的JSON树。你的目标是解读这棵树,了解执行过程,并生成包含可操作建议的清晰性能报告。

Background

背景

In Cluster mode two components are involved in query processing:
  • vmselect — query frontend that accepts PromQL or MetricsQL queries, fetches data from vmstorage nodes, and applies calculations
  • vmstorage — stores time series data and serves it to vmselect over RPC
Single-node mode runs everything in one process. The trace structure is similar but without RPC wrappers.
You can tell which mode you're looking at from the root message in trace:
  • Cluster traces contains
    vmselect-<version>: /select/...
    ,
  • Single-node traces contains
    /victoria-metrics-<version>: /api/v1/...
    .
在**集群(Cluster)**模式下,查询处理涉及两个组件:
  • vmselect — 查询前端,接收PromQL或MetricsQL查询,从vmstorage节点获取数据并执行计算
  • vmstorage — 存储时序数据,并通过RPC为vmselect提供数据服务
**单节点(Single-node)**模式下所有组件运行在同一个进程中,追踪结构类似,但没有RPC包装层。
你可以通过追踪的根节点信息判断当前模式:
  • 集群模式的追踪包含
    vmselect-<version>: /select/...
  • 单节点模式的追踪包含
    /victoria-metrics-<version>: /api/v1/...

What is a query trace?

什么是查询追踪?

When you add
trace=1
to a VictoriaMetrics HTTP query, it returns a JSON tree describing every internal operation. Each node looks like this:
json
{
  "duration_msec": 123.456,
  "message": "description of what happened",
  "children": [ ... ]
}
The tree is rooted at vmselect. It captures the full query execution pipeline: parsing, series search, data fetch from storage, rollup computation, aggregation, and response generation.
当你在VictoriaMetrics的HTTP查询中添加
trace=1
参数时,它会返回一棵描述所有内部操作的JSON树。每个节点的结构如下:
json
{
  "duration_msec": 123.456,
  "message": "description of what happened",
  "children": [ ... ]
}
这棵树的根节点是vmselect,它捕获了完整的查询执行流水线:解析、序列搜索、从存储获取数据、rollup计算、聚合以及响应生成。

How to analyze the trace

如何分析追踪

Step 0: Run the parse script

步骤0:运行解析脚本

Before manually reading the trace file, run the analysis script to extract structured data:
bash
python3 <skill_base_dir>/scripts/parse_trace.py <trace_file>
This outputs: root info, trace tree (depth 3), key nodes with durations, per-vmstorage RPC breakdown, and computed totals (bytes, samples, series). Use this output as your primary data source for the report.
Additional subcommands for deeper investigation:
  • python3 <script> <trace> tree --depth N
    — print the trace tree to depth N
  • python3 <script> <trace> nodes --pattern "fetch unique"
    — find all nodes matching a substring
Only drill deeper if the summary output reveals ambiguities or missing information.
After running the summary, also check for relevant performance improvements in newer VictoriaMetrics versions:
bash
python3 <skill_base_dir>/scripts/check_changelog.py <version> <mode>
Where
<version>
is the semver from the parse script output (e.g.,
v1.130.0
) and
<mode>
is
cluster
or
single-node
. This fetches changelogs from GitHub and shows performance-relevant fixes/features in versions newer than what the trace was captured on. If the fetch fails, skip this section gracefully.
在手动解读追踪文件前,先运行分析脚本提取结构化数据:
bash
python3 <skill_base_dir>/scripts/parse_trace.py <trace_file>
该脚本会输出:根节点信息、深度为3的追踪树、带时长的关键节点、每个vmstorage节点的RPC拆解,以及计算后的总计值(字节数、样本数、序列数)。请将该输出作为报告的主要数据源。
以下是用于深度调查的附加子命令:
  • python3 <script> <trace> tree --depth N
    — 打印深度为N的追踪树
  • python3 <script> <trace> nodes --pattern "fetch unique"
    — 查找所有匹配指定子串的节点
仅当摘要输出存在歧义或信息缺失时,才需要深入子树分析。
运行摘要分析后,还可以检查VictoriaMetrics新版本中的相关性能改进:
bash
python3 <skill_base_dir>/scripts/check_changelog.py <version> <mode>
其中
<version>
是解析脚本输出的语义化版本(例如
v1.130.0
),
<mode>
cluster
single-node
。该脚本会从GitHub获取更新日志,并显示比当前追踪版本更新的性能相关修复/功能。如果获取失败,请跳过此步骤。

Step 1: Start at the root

步骤1:从根节点开始

Read the trace JSON file the user provides (or use the script output from Step 0). The root node tells you the big picture. Extract:
  • Endpoint:
    /api/v1/query
    (instant) or
    /api/v1/query_range
    (range)
  • Query: the PromQL expression after
    query=
  • Time parameters:
    start=
    ,
    end=
    ,
    step=
    (for range queries)
  • Result count:
    series=
    at the end
  • Total duration: the root
    duration_msec
  • Version: printed in the very start of the trace.
读取用户提供的追踪JSON文件(或使用步骤0的脚本输出)。根节点会提供全局信息,你需要提取:
  • 端点
    /api/v1/query
    (即时查询)或
    /api/v1/query_range
    (范围查询)
  • 查询语句
    query=
    参数后的PromQL表达式
  • 时间参数
    start=
    end=
    step=
    (针对范围查询)
  • 结果数量:末尾的
    series=
  • 总时长:根节点的
    duration_msec
  • 版本:追踪开头显示的版本号

Step 2: Identify the phases

步骤2:识别执行阶段

Walk the top-level children and classify each into one of these phases. Not every trace has all of them — just report what's there.
For large traces, focus on the top-level children first. Drill into subtrees only when they are relevant to the bottleneck or when durations are surprising.
A query trace typically has these phases, roughly in this order. Not all phases appear in every trace. Identify them by matching the message patterns described here.
Expression evaluation — nodes matching:
eval: query=..., timeRange=..., step=..., mayCache=...: series=N, points=N, pointsPerSeries=N
These trace the recursive PromQL/MetricsQL expression tree. These trace the recursive evaluation of the PromQL/MetricsQL expression tree. Each eval node may have children for sub-expressions. Key numbers:
  • series — number of time series produced by this sub-expression
  • points — total data points across all series
  • pointsPerSeries — data points per series
Functions and aggregations — nodes matching:
  • transform <func>(): series=N
    — PromQL functions (histogram_quantile, clamp, etc.)
  • aggregate <func>(): series=N
    — aggregation operators (sum, avg, max, etc.)
  • binary op "<op>": series=N
    — binary operations
Series search (index lookup) — where label matchers get resolved to internal series IDs:
  • In Cluster traces, wrapped in
    rpc at vmstorage <addr>
    rpc call search_v7()
    , in Single-node - appears directly without RPC wrappers
  • Key messages:
    • init series search
      ,
    • search TSIDs
      ,
    • search N indexDBs in parallel
      — parallel index database search,
    • search indexDB
      — individual index partition,
    • found N metric ids for filter=...
      — metric ID, unique time series identifier within vmstorage,
    • found N TSIDs for N metricIDs
      — same as metric ID,
    • sort N TSIDs
  • Cache-related messages in this phase:
    • search for metricIDs in tag filters cache
      followed by
      cache miss
      or a cache hit (no
      cache miss
      child)
    • put N metricIDs in cache
      /
      stored N metricIDs into cache
Data fetch — getting raw data:
  • Cluster:
    fetch matching series: ...
    wraps RPC calls to each vmstorage node:
    • rpc at vmstorage <addr>
      — per-node RPC,
    • sent N blocks to vmselect
      — amount of raw data transmitted back,
    • fetch unique series=N, blocks=N, samples=N, bytes=N
      — aggregate summary across all vmstorage nodes,
  • Single-node:
    search for parts with data for N series
    followed by data scan messages. The bytes value in
    fetch unique series
    tells you total data transferred and is a good indicator of I/O load.
Rollup computation — computing rate(), increase(), avg_over_time(), etc.:
  • rollup <func>(): timeRange=..., step=N, window=N
  • rollup <func>() with incremental aggregation <agg>() over N series
    — this is an optimization
  • the rollup evaluation needs an estimated N bytes of RAM for N series and N points per series
    — memory estimate
  • parallel process of fetched data: series=N, samples=N
    — the actual computation over raw samples
  • series after aggregation with <func>(): N; samplesScanned=N
    — post-aggregation result This phase often dominates execution time for queries that scan large amounts of raw data.
Response generation — usually trivial:
  • sort series by metric name and labels
  • generate /api/v1/query_range response for series=N, points=N
    Usually trivially fast. Could be a bottleneck if response is huge (hundreds of series and thousands of datapoints per-series) and client's speed on reading the response is slow.
遍历顶层子节点,并将每个节点归类到以下阶段之一。并非所有追踪都会包含所有阶段——仅报告实际存在的阶段。
对于大型追踪,先聚焦顶层子节点。仅当子树与瓶颈相关或时长异常时,再深入分析。
查询追踪通常包含以下阶段,大致按顺序排列。并非所有阶段都会出现在每个追踪中,请根据以下消息模式识别它们。
表达式求值 — 匹配以下模式的节点:
eval: query=..., timeRange=..., step=..., mayCache=...: series=N, points=N, pointsPerSeries=N
这些节点追踪PromQL/MetricsQL表达式树的递归求值过程。每个eval节点可能包含子表达式的子节点。关键指标:
  • series — 该子表达式生成的时序序列数量
  • points — 所有序列的总数据点数
  • pointsPerSeries — 每个序列的数据点数
函数与聚合 — 匹配以下模式的节点:
  • transform <func>(): series=N
    — PromQL函数(如histogram_quantile、clamp等)
  • aggregate <func>(): series=N
    — 聚合操作符(如sum、avg、max等)
  • binary op "<op>": series=N
    — 二元运算
序列搜索(索引查找) — 将标签匹配器解析为内部序列ID的过程:
  • 集群模式追踪中,该阶段被包装在
    rpc at vmstorage <addr>
    rpc call search_v7()
    中;在单节点模式中,该阶段直接出现,没有RPC包装层
  • 关键消息:
    • init series search
      (初始化序列搜索)
    • search TSIDs
      (搜索TSID)
    • search N indexDBs in parallel
      (并行搜索N个索引数据库)
    • search indexDB
      (搜索单个索引分区)
    • found N metric ids for filter=...
      (为指定过滤器找到N个metric ID,metric ID是vmstorage内唯一的时序序列标识符)
    • found N TSIDs for N metricIDs
      (为N个metric ID找到N个TSID,与metric ID含义相同)
    • sort N TSIDs
      (排序N个TSID)
  • 该阶段的缓存相关消息:
    • search for metricIDs in tag filters cache
      (在标签过滤器缓存中搜索metric ID),后续会跟随
      cache miss
      (缓存未命中)或缓存命中(无
      cache miss
      子节点)
    • put N metricIDs in cache
      /
      stored N metricIDs into cache
      (将N个metric ID存入缓存)
数据获取 — 获取原始数据的过程:
  • 集群模式
    fetch matching series: ...
    包装了对每个vmstorage节点的RPC调用:
    • rpc at vmstorage <addr>
      — 每个节点的RPC调用
    • sent N blocks to vmselect
      — 传输回vmselect的原始数据块数量
    • fetch unique series=N, blocks=N, samples=N, bytes=N
      — 所有vmstorage节点的聚合摘要
  • 单节点模式
    search for parts with data for N series
    (搜索包含N个序列数据的分片),后续是数据扫描消息。
    fetch unique series
    中的bytes值代表传输的总数据量,是I/O负载的良好指标。
Rollup计算 — 计算rate()、increase()、avg_over_time()等函数的过程:
  • rollup <func>(): timeRange=..., step=N, window=N
  • rollup <func>() with incremental aggregation <agg>() over N series
    — 这是一种优化方式
  • the rollup evaluation needs an estimated N bytes of RAM for N series and N points per series
    — 内存估算值
  • parallel process of fetched data: series=N, samples=N
    — 对原始样本的实际计算过程
  • series after aggregation with <func>(): N; samplesScanned=N
    — 聚合后的结果 对于扫描大量原始数据的查询,该阶段通常会占据大部分执行时间。
响应生成 — 通常耗时极短:
  • sort series by metric name and labels
    (按指标名称和标签排序序列)
  • generate /api/v1/query_range response for series=N, points=N
    (为N个序列、N个数据点生成/api/v1/query_range响应) 通常速度极快。仅当响应量极大(数百个序列,每个序列数千个数据点)且客户端读取响应速度较慢时,才可能成为瓶颈。

Step 3: Build the time breakdown

步骤3:生成时间拆解

For each phase, note the
duration_msec
. In Cluster traces, the same phases repeat for each vmstorage node — aggregate for the summary but also track per-node numbers to spot imbalances.
记录每个阶段的
duration_msec
值。在集群模式追踪中,每个阶段会在每个vmstorage节点重复出现——请在摘要中汇总这些值,同时记录每个节点的数值以发现不平衡情况。

Step 4: Find the bottleneck

步骤4:定位瓶颈

Identify which phase consumed the most time and explain why in concrete terms. For instance, "The rollup scanned 212M raw samples" is useful; "the query was slow" is not.
识别占据最长时间的阶段,并结合追踪中的具体数值解释原因。例如,“Rollup阶段扫描了2.12亿个原始样本”是有效的描述;而“查询很慢”则没有意义。

Step 5: Write recommendations

步骤5:撰写建议

Base recommendations only on what the trace actually shows. If the query is fast and healthy, say so — don't invent problems.
Follow this algorithm to select recommendations:
  • Step 5a: From the time breakdown, identify which single phase dominates (>60% of total latency). Map it to the matching pattern in the "Recommendation patterns" section below.
  • Step 5b: Use ONLY that pattern's recommendations, in the listed priority order. Do not pull recommendations from other patterns.
  • Step 5c: If no single phase exceeds 60%, pick the pattern with the highest contribution and note secondary factors, but still do not mix recommendations across patterns.
仅基于追踪中实际显示的内容提出建议。如果查询速度快且追踪显示健康,请如实说明。
请按照以下算法选择建议:
  • 步骤5a:从时间拆解中识别占比最高的阶段(超过总延迟的60%),匹配下方“建议模式”中的对应模式。
  • 步骤5b:仅使用该模式下的建议,按列出的优先级排序。请勿使用其他模式的建议。
  • 步骤5c:如果没有阶段占比超过60%,选择贡献最高的模式,并注明次要因素,但仍请勿混合不同模式的建议。

Report format

报告格式

markdown
undefined
markdown
undefined

Query Overview

查询概述

  • Query:
    <the PromQL/MetricsQL expression>
  • Type: instant / range query
  • Time range: <start> to <end> (<duration>)
  • Step: <step>
  • Result: <N> series, <N> points
  • Version: vmselect or VM single-node version
  • 查询语句
    <PromQL/MetricsQL表达式>
  • 类型:即时查询 / 范围查询
  • 时间范围:<开始时间> 至 <结束时间>(<时长>)
  • 步长:<step值>
  • 结果<N>个序列,<N>个数据点
  • 版本:vmselect或VM单节点版本

Performance Summary

性能摘要

  • Total duration: <N>ms
  • Duration score: <Fast / Acceptable / Slow / Very Slow>
  • Matched series: <N> (across all storage nodes)
  • Raw samples scanned: <N>
  • Bytes transferred: <N>
"Duration score" thresholds:
  • Fast: < 500ms
  • Acceptable: 500ms–5s
  • Slow: 5s–10s
  • Very Slow: > 10s
  • 总时长<N>ms
  • 时长评分:<快 / 可接受 / 慢 / 极慢>
  • 匹配序列数<N>(所有存储节点总计)
  • 扫描原始样本数<N>
  • 传输字节数<N>
"时长评分"阈值:
  • 快:< 500ms
  • 可接受:500ms–5s
  • 慢:5s–10s
  • 极慢:> 10s

Execution Time Breakdown

执行时间拆解

PhaseDuration% of TotalNotes
Series search (index)XmsX%
Data fetchXmsX%
Rollup computationXmsX%
Aggregation / functionsXmsX%
Response generationXmsX%
(Adapt the phases to what actually appears in the trace. For cluster traces, break down data fetch per storage node.)
阶段时长占比备注
序列搜索(索引)XmsX%
数据获取XmsX%
Rollup计算XmsX%
聚合 / 函数XmsX%
响应生成XmsX%
(根据追踪中实际存在的阶段调整表格。对于集群模式追踪,请按存储节点拆解数据获取阶段。)

Storage Node Breakdown (cluster only)

存储节点拆解(仅集群模式)

NodeSeriesBytes sentDuration
vmstorage-1NNXms
vmstorage-2NNXms
节点序列数传输字节数时长
vmstorage-1NNXms
vmstorage-2NNXms

Bottleneck Analysis

瓶颈分析

Name the single biggest contributor to total query time. Explain why it's slow with specific numbers from the trace.
指出对总查询时间贡献最大的因素,并结合追踪中的具体数值解释其变慢的原因。

Recommendations

优化建议

Provide actionable suggestions to reduce query latency (see guidance below).
提供可操作的建议以降低查询延迟(请参考下方指南)。

Upgrade Recommendations (if applicable)

版本升级建议(如适用)

If the changelog check found performance-relevant improvements in newer versions, list them here with version, release date, and description. Only include this section if there are concrete relevant entries. Omit entirely otherwise.
undefined
如果更新日志检查发现新版本中有相关的性能改进,请在此列出版本号、发布日期和描述。仅当存在具体相关条目时才包含此部分,否则请省略。
undefined

Report generation rules

报告生成规则

  • Do not speculate about issues that are not evidenced in the trace. If the trace looks healthy and the query is fast, say so.
  • Don't show information about blocks in report - it useless for users and can be confusing. Focus on bytes instead.
  • Don't inform about imbalance in Duration between vmstorage nodes unless it exceeds 1-2s.
  • For durations use format "Xm" / "Xs" / "Xms" (e.g., "123ms"), use minutes only for durations above 60s, seconds for durations above 1000ms, and ms for shorter durations.
  • For data volumes, use human-readable formats (e.g., "268 MB" instead of "268000000 bytes"). Use appropriate units (KB, MB, GB) based on the size.
  • Use the
    duration_msec
    values directly from the trace — don't estimate durations by subtraction
  • In cluster traces, the same phases (index search, data scan) appear for each vmstorage node. Aggregate these for the summary but also show per-node breakdown so the user can spot imbalances.
  • If the trace is too large to read in a single pass, read the top-level structure first, identify the slowest branches by
    duration_msec
    , and drill into those.
  • Some traces may include
    subquery
    or
    @ modifier
    evaluation — treat these like nested eval phases.
  • mayCache=false
    in eval messages is informational, not a problem
  • 请勿推测追踪中未体现的问题。如果追踪显示健康且查询速度快,请如实说明。
  • 请勿在报告中显示块(blocks)相关信息——这对用户无用且易造成混淆,请聚焦于字节数。
  • 仅当vmstorage节点之间的时长差异超过1-2秒时,才告知用户存在不平衡情况。
  • 时长请使用“Xm” / “Xs” / “Xms”格式(例如“123ms”),仅当时长超过60s时使用分钟,超过1000ms时使用秒,更短的时长使用毫秒。
  • 数据量请使用易读格式(例如“268 MB”而非“268000000 bytes”),根据大小选择合适的单位(KB、MB、GB)。
  • 直接使用追踪中的
    duration_msec
    值——请勿通过减法估算时长。
  • 在集群模式追踪中,相同的阶段(索引搜索、数据扫描)会在每个vmstorage节点重复出现。请在摘要中汇总这些值,同时显示每个节点的拆解数据以帮助用户发现不平衡情况。
  • 如果追踪过大无法一次性解读,请先读取顶层结构,通过
    duration_msec
    识别最慢的分支,再深入分析。
  • 部分追踪可能包含
    subquery
    @ modifier
    求值——将其视为嵌套的eval阶段。
  • eval消息中的
    mayCache=false
    仅为信息性内容,并非问题。

Recommendation patterns

建议模式

CRITICAL: Pattern selection rules
  1. First, identify which ONE pattern below matches your bottleneck from the time breakdown.
  2. Use ONLY the recommendations from that single pattern, in the listed priority order.
  3. Do NOT mix recommendations from different patterns. If the dominant phase exceeds 60% of total latency, all recommendations MUST come from that pattern only.
  4. If a recommendation appears in multiple patterns, that does not make it pattern-independent — only use it if it's listed in YOUR selected pattern.
Base recommendations on what the trace actually shows sorted by priority. Here are common patterns and the corresponding advice:
High series cardinality (many matched series)
  1. Suggest adding more specific label matchers to reduce series count
  2. Suggest using recording rules to pre-aggregate if this is a dashboard query
  3. Suggest using stream aggregation to pre-aggregate series before storing if possible
  4. Suggest narrowing the time range if the matched metric has a high churn rate
Large raw sample scan (high samplesScanned)
  1. If the amount of samplesScanned significantly exceeds samples fetched, then query is using too short
    [window]
    or too agressive subqueries.
  2. For
    rate()/irate()/increase()
    suggest shorter
    [window]
    if semantically acceptable
  3. Suggest increasing the query step to reduce points per series
  4. Suggest narrowing the time range
Slow index lookups (series search dominates)
  1. Tag filter cache misses are normal on first query; note that repeated queries should be faster
  2. Large number of metric IDs per day partition suggests high churn or high cardinality issue
  3. Suggest more selective filters if possible
  4. Suggest increasing the amount of memory on vmstorage nodes if possible
Slow data fetch / high bytes transferred (cluster)
  1. Large sent bytes suggests vmstorage is doing heavy I/O
  2. For resource saturation recommend checking resource usage on the official Grafana dashboard for VictoriaMetrics.
  3. Suggest checking vmstorage disk I/O and network bandwidth
  4. Multiple vmstorage nodes with very uneven durations (more than 1-2 seconds) may indicate resource saturation or hardware issues.
  5. Suggest adding more vmstorage nodes if possible to horizontally scale I/O capacity
Slow data fetch / high bytes transferred (Single-node)
  1. Large sent bytes suggests VM single-node is doing heavy I/O
  2. For resource saturation recommend checking resource usage on the official Grafana dashboard for VictoriaMetrics.
  3. Suggest checking VM single-node disk I/O and CPU
  4. Suggest increasing disk I/O limits
  5. If optimizing the query or data it queries isn't possible, suggest switching to cluster topology for horizontal scaling of I/O capacity
Rollup computation dominates (often caused by scanning millions of raw samples)
  1. Suggest increasing vmselect CPU limits to improve computation speed.
  2. Suggest adding more specific label matchers to reduce the number of series being processed.
  3. Suggest narrowing the time range to reduce the volume of raw samples.
  4. Suggest increasing the query
    step
    to reduce points per series.
  5. For recurring dashboard queries, suggest recording rules to pre-compute the result.
Version upgrade opportunities If the
check_changelog.py
script found relevant performance improvements in newer versions, mention the upgrade as an additional recommendation in the "Upgrade Recommendations" report section. This is the ONE exception to the "single pattern only" rule — upgrade recommendations are supplementary and can be appended regardless of which bottleneck pattern was selected. Only include entries that are directly relevant to the observed bottleneck or the components involved in the trace.
重要:模式选择规则
  1. 首先,从时间拆解中识别瓶颈对应的唯一模式。
  2. 仅使用该模式下的建议,按列出的优先级排序。
  3. 请勿混合不同模式的建议。如果主导阶段占总延迟的60%以上,所有建议必须仅来自该模式。
  4. 如果某条建议出现在多个模式中,并不代表它可以独立于模式使用——仅当它属于你选择的模式时才使用。
基于追踪中实际显示的内容,按优先级排序给出建议。以下是常见模式及对应建议:
高序列基数(匹配大量序列)
  1. 建议添加更具体的标签匹配器以减少序列数量
  2. 如果是仪表盘查询,建议使用记录规则(recording rules)预聚合数据
  3. 如果可能,建议使用流聚合在存储前预聚合序列
  4. 如果匹配的指标具有高变更率,建议缩小时间范围
扫描大量原始样本(samplesScanned值高)
  1. 如果samplesScanned显著超过获取的样本数,说明查询使用的
    [window]
    过短或子查询过于激进
  2. 对于
    rate()/irate()/increase()
    函数,如果语义允许,建议使用更短的
    [window]
  3. 建议增大查询步长以减少每个序列的数据点数
  4. 建议缩小时间范围
索引查找缓慢(序列搜索占主导)
  1. 首次查询时标签过滤器缓存未命中是正常现象;请注明重复查询速度会更快
  2. 每天分区的metric ID数量过多,表明存在高变更率或高基数问题
  3. 如果可能,建议使用更具选择性的过滤器
  4. 如果可能,建议增加vmstorage节点的内存
数据获取缓慢 / 传输字节数高(集群模式)
  1. 传输字节数大表明vmstorage正在执行大量I/O操作
  2. 对于资源饱和问题,建议查看VictoriaMetrics官方Grafana仪表盘的资源使用情况
  3. 建议检查vmstorage的磁盘I/O和网络带宽
  4. 多个vmstorage节点的时长差异过大(超过1-2秒)可能表明资源饱和或硬件问题
  5. 如果可能,建议添加更多vmstorage节点以横向扩展I/O能力
数据获取缓慢 / 传输字节数高(单节点模式)
  1. 传输字节数大表明VM单节点正在执行大量I/O操作
  2. 对于资源饱和问题,建议查看VictoriaMetrics官方Grafana仪表盘的资源使用情况
  3. 建议检查VM单节点的磁盘I/O和CPU使用情况
  4. 建议增加磁盘I/O限制
  5. 如果无法优化查询或查询的数据,建议切换到集群拓扑以横向扩展I/O能力
Rollup计算占主导(通常由扫描数百万原始样本导致)
  1. 建议增大vmselect的CPU限制以提升计算速度
  2. 建议添加更具体的标签匹配器以减少处理的序列数量
  3. 建议缩小时间范围以减少原始样本量
  4. 建议增大查询
    step
    以减少每个序列的数据点数
  5. 对于重复执行的仪表盘查询,建议使用记录规则预计算结果
版本升级机会 如果
check_changelog.py
脚本发现新版本中有相关的性能改进,请在报告的“版本升级建议”部分提及升级。这是“仅使用单一模式建议”规则的唯一例外——升级建议是补充内容,无论选择哪种瓶颈模式,都可以附加。仅包含与观察到的瓶颈或追踪中涉及的组件直接相关的条目。