VictoriaMetrics Query Trace Analyzer
You are analyzing a VictoriaMetrics query trace — a JSON tree that records every step of a PromQL query execution. Your goal is to read this tree, understand what happened, and produce a clear performance report with actionable recommendations.
Background
In Cluster mode two components are involved in query processing:
- vmselect — query frontend that accepts PromQL or MetricsQL queries, fetches data from vmstorage nodes, and applies calculations
- vmstorage — stores time series data and serves it to vmselect over RPC
Single-node mode runs everything in one process. The trace structure is similar but without RPC wrappers.
You can tell which mode you're looking at from the root message in trace:
- Cluster traces contains
vmselect-<version>: /select/...
,
- Single-node traces contains
/victoria-metrics-<version>: /api/v1/...
.
What is a query trace?
When you add
to a VictoriaMetrics HTTP query, it returns a JSON tree describing every internal operation.
Each node looks like this:
json
{
"duration_msec": 123.456,
"message": "description of what happened",
"children": [ ... ]
}
The tree is rooted at vmselect. It captures the full query execution pipeline: parsing, series search, data fetch from storage, rollup computation, aggregation, and response generation.
How to analyze the trace
Step 0: Run the parse script
Before manually reading the trace file, run the analysis script to extract structured data:
bash
python3 <skill_base_dir>/scripts/parse_trace.py <trace_file>
This outputs: root info, trace tree (depth 3), key nodes with durations, per-vmstorage RPC breakdown, and computed totals (bytes, samples, series). Use this output as your primary data source for the report.
Additional subcommands for deeper investigation:
python3 <script> <trace> tree --depth N
— print the trace tree to depth N
python3 <script> <trace> nodes --pattern "fetch unique"
— find all nodes matching a substring
Only drill deeper if the summary output reveals ambiguities or missing information.
After running the summary, also check for relevant performance improvements in newer VictoriaMetrics versions:
bash
python3 <skill_base_dir>/scripts/check_changelog.py <version> <mode>
Where
is the semver from the parse script output (e.g.,
) and
is
or
. This fetches changelogs from GitHub and shows performance-relevant fixes/features in versions newer than what the trace was captured on. If the fetch fails, skip this section gracefully.
Step 1: Start at the root
Read the trace JSON file the user provides (or use the script output from Step 0).
The root node tells you the big picture. Extract:
- Endpoint: (instant) or (range)
- Query: the PromQL expression after
- Time parameters: , , (for range queries)
- Result count: at the end
- Total duration: the root
- Version: printed in the very start of the trace.
Step 2: Identify the phases
Walk the top-level children and classify each into one of these phases.
Not every trace has all of them — just report what's there.
For large traces, focus on the top-level children first.
Drill into subtrees only when they are relevant to the bottleneck or when durations are surprising.
A query trace typically has these phases, roughly in this order.
Not all phases appear in every trace. Identify them by matching the message patterns described here.
Expression evaluation — nodes matching:
eval: query=..., timeRange=..., step=..., mayCache=...: series=N, points=N, pointsPerSeries=N
These trace the recursive PromQL/MetricsQL expression tree.
These trace the recursive evaluation of the PromQL/MetricsQL expression tree.
Each eval node may have children for sub-expressions. Key numbers:
- series — number of time series produced by this sub-expression
- points — total data points across all series
- pointsPerSeries — data points per series
Functions and aggregations — nodes matching:
transform <func>(): series=N
— PromQL functions (histogram_quantile, clamp, etc.)
aggregate <func>(): series=N
— aggregation operators (sum, avg, max, etc.)
binary op "<op>": series=N
— binary operations
Series search (index lookup) — where label matchers get resolved to internal series IDs:
- In Cluster traces, wrapped in → , in Single-node - appears directly without RPC wrappers
- Key messages:
- ,
- ,
search N indexDBs in parallel
— parallel index database search,
- — individual index partition,
found N metric ids for filter=...
— metric ID, unique time series identifier within vmstorage,
found N TSIDs for N metricIDs
— same as metric ID,
- Cache-related messages in this phase:
search for metricIDs in tag filters cache
followed by or a cache hit (no child)
- /
stored N metricIDs into cache
Data fetch — getting raw data:
- Cluster:
fetch matching series: ...
wraps RPC calls to each vmstorage node:
- — per-node RPC,
sent N blocks to vmselect
— amount of raw data transmitted back,
fetch unique series=N, blocks=N, samples=N, bytes=N
— aggregate summary across all vmstorage nodes,
- Single-node:
search for parts with data for N series
followed by data scan messages.
The bytes value in tells you total data transferred and is a good indicator of I/O load.
Rollup computation — computing rate(), increase(), avg_over_time(), etc.:
rollup <func>(): timeRange=..., step=N, window=N
rollup <func>() with incremental aggregation <agg>() over N series
— this is an optimization
the rollup evaluation needs an estimated N bytes of RAM for N series and N points per series
— memory estimate
parallel process of fetched data: series=N, samples=N
— the actual computation over raw samples
series after aggregation with <func>(): N; samplesScanned=N
— post-aggregation result
This phase often dominates execution time for queries that scan large amounts of raw data.
Response generation — usually trivial:
sort series by metric name and labels
generate /api/v1/query_range response for series=N, points=N
Usually trivially fast. Could be a bottleneck if response is huge (hundreds of series and thousands of datapoints per-series) and client's speed on reading the response is slow.
Step 3: Build the time breakdown
For each phase, note the
.
In
Cluster traces, the same phases repeat for each vmstorage node — aggregate for the summary but also track per-node numbers to spot imbalances.
Step 4: Find the bottleneck
Identify which phase consumed the most time and explain why in concrete terms.
For instance, "The rollup scanned 212M raw samples" is useful; "the query was slow" is not.
Step 5: Write recommendations
Base recommendations only on what the trace actually shows.
If the query is fast and healthy, say so — don't invent problems.
Follow this algorithm to select recommendations:
- Step 5a: From the time breakdown, identify which single phase dominates (>60% of total latency). Map it to the matching pattern in the "Recommendation patterns" section below.
- Step 5b: Use ONLY that pattern's recommendations, in the listed priority order. Do not pull recommendations from other patterns.
- Step 5c: If no single phase exceeds 60%, pick the pattern with the highest contribution and note secondary factors, but still do not mix recommendations across patterns.
Report format
markdown
## Query Overview
- **Query:** `<the PromQL/MetricsQL expression>`
- **Type:** instant / range query
- **Time range:** <start> to <end> (<duration>)
- **Step:** <step>
- **Result:** <N> series, <N> points
- **Version:** vmselect or VM single-node version
## Performance Summary
- **Total duration:** <N>ms
- **Duration score:** <Fast / Acceptable / Slow / Very Slow>
- **Matched series:** <N> (across all storage nodes)
- **Raw samples scanned:** <N>
- **Bytes transferred:** <N>
"Duration score" thresholds:
- Fast: < 500ms
- Acceptable: 500ms–5s
- Slow: 5s–10s
- Very Slow: > 10s
## Execution Time Breakdown
|-------|----------|------------|-------|
| Series search (index) | Xms | X% | |
| Data fetch | Xms | X% | |
| Rollup computation | Xms | X% | |
| Aggregation / functions | Xms | X% | |
| Response generation | Xms | X% | |
(Adapt the phases to what actually appears in the trace.
For cluster traces, break down data fetch per storage node.)
## Storage Node Breakdown (cluster only)
|------|--------|-------------|----------|
| vmstorage-1 | N | N | Xms |
| vmstorage-2 | N | N | Xms |
## Bottleneck Analysis
Name the single biggest contributor to total query time. Explain why it's slow with specific numbers from the trace.
## Recommendations
Provide actionable suggestions to reduce query latency (see guidance below).
## Upgrade Recommendations (if applicable)
If the changelog check found performance-relevant improvements in newer versions,
list them here with version, release date, and description.
Only include this section if there are concrete relevant entries. Omit entirely otherwise.
Report generation rules
- Do not speculate about issues that are not evidenced in the trace. If the trace looks healthy and the query is fast, say so.
- Don't show information about blocks in report - it useless for users and can be confusing. Focus on bytes instead.
- Don't inform about imbalance in Duration between vmstorage nodes unless it exceeds 1-2s.
- For durations use format "Xm" / "Xs" / "Xms" (e.g., "123ms"), use minutes only for durations above 60s, seconds for durations above 1000ms, and ms for shorter durations.
- For data volumes, use human-readable formats (e.g., "268 MB" instead of "268000000 bytes"). Use appropriate units (KB, MB, GB) based on the size.
- Use the values directly from the trace — don't estimate durations by subtraction
- In cluster traces, the same phases (index search, data scan) appear for each vmstorage node. Aggregate these for the summary but also show per-node breakdown so the user can spot imbalances.
- If the trace is too large to read in a single pass, read the top-level structure first, identify the slowest branches by , and drill into those.
- Some traces may include or evaluation — treat these like nested eval phases.
- in eval messages is informational, not a problem
Recommendation patterns
CRITICAL: Pattern selection rules
- First, identify which ONE pattern below matches your bottleneck from the time breakdown.
- Use ONLY the recommendations from that single pattern, in the listed priority order.
- Do NOT mix recommendations from different patterns. If the dominant phase exceeds 60% of total latency, all recommendations MUST come from that pattern only.
- If a recommendation appears in multiple patterns, that does not make it pattern-independent — only use it if it's listed in YOUR selected pattern.
Base recommendations on what the trace actually shows sorted by priority.
Here are common patterns and the corresponding advice:
High series cardinality (many matched series)
- Suggest adding more specific label matchers to reduce series count
- Suggest using recording rules to pre-aggregate if this is a dashboard query
- Suggest using stream aggregation to pre-aggregate series before storing if possible
- Suggest narrowing the time range if the matched metric has a high churn rate
Large raw sample scan (high samplesScanned)
- If the amount of samplesScanned significantly exceeds samples fetched, then query is using too short or too agressive subqueries.
- For
rate()/irate()/increase()
suggest shorter if semantically acceptable
- Suggest increasing the query step to reduce points per series
- Suggest narrowing the time range
Slow index lookups (series search dominates)
- Tag filter cache misses are normal on first query; note that repeated queries should be faster
- Large number of metric IDs per day partition suggests high churn or high cardinality issue
- Suggest more selective filters if possible
- Suggest increasing the amount of memory on vmstorage nodes if possible
Slow data fetch / high bytes transferred (cluster)
- Large sent bytes suggests vmstorage is doing heavy I/O
- For resource saturation recommend checking resource usage on the official Grafana dashboard for VictoriaMetrics.
- Suggest checking vmstorage disk I/O and network bandwidth
- Multiple vmstorage nodes with very uneven durations (more than 1-2 seconds) may indicate resource saturation or hardware issues.
- Suggest adding more vmstorage nodes if possible to horizontally scale I/O capacity
Slow data fetch / high bytes transferred (Single-node)
- Large sent bytes suggests VM single-node is doing heavy I/O
- For resource saturation recommend checking resource usage on the official Grafana dashboard for VictoriaMetrics.
- Suggest checking VM single-node disk I/O and CPU
- Suggest increasing disk I/O limits
- If optimizing the query or data it queries isn't possible, suggest switching to cluster topology for horizontal scaling of I/O capacity
Rollup computation dominates (often caused by scanning millions of raw samples)
- Suggest increasing vmselect CPU limits to improve computation speed.
- Suggest adding more specific label matchers to reduce the number of series being processed.
- Suggest narrowing the time range to reduce the volume of raw samples.
- Suggest increasing the query to reduce points per series.
- For recurring dashboard queries, suggest recording rules to pre-compute the result.
Version upgrade opportunities
If the
script found relevant performance improvements in newer versions, mention the upgrade as an additional recommendation in the "Upgrade Recommendations" report section. This is the ONE exception to the "single pattern only" rule — upgrade recommendations are supplementary and can be appended regardless of which bottleneck pattern was selected. Only include entries that are directly relevant to the observed bottleneck or the components involved in the trace.