promql-validator

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

How This Skill Works

本技能的工作原理

This skill performs multi-level validation and provides interactive query planning:
  1. Syntax Validation: Checks for syntactically correct PromQL expressions
  2. Semantic Validation: Ensures queries make logical sense (e.g., rate() on counters, not gauges)
  3. Anti-Pattern Detection: Identifies common mistakes and inefficient patterns
  4. Optimization Suggestions: Recommends performance improvements
  5. Query Explanation: Translates PromQL to plain English
  6. Interactive Planning: Helps users clarify intent and refine queries
本技能可执行多层级校验,并提供交互式查询规划能力:
  1. 语法校验:检查PromQL表达式的语法正确性
  2. 语义校验:确保查询逻辑合理(例如仅对counter类型指标使用rate(),而非gauge类型)
  3. 反模式检测:识别常见错误和低效写法
  4. 优化建议:推荐性能提升方案
  5. 查询解释:将PromQL转换为通俗自然语言
  6. 交互式规划:帮助用户明确需求,优化查询语句

Workflow

工作流程

When a user provides a PromQL query, follow this workflow:
当用户提供PromQL查询时,请遵循以下工作流程:

Working Directory Requirement

工作目录要求

Run validation commands from the repository root so relative paths resolve correctly:
bash
cd "$(git rev-parse --show-toplevel)"
If running from another location, use absolute paths to
scripts/
files.
请在仓库根目录下执行校验命令,以保证相对路径可以正确解析:
bash
cd "$(git rev-parse --show-toplevel)"
如果在其他位置运行,请使用
scripts/
文件的绝对路径。

Step 1: Validate Syntax

步骤1:语法校验

Run the syntax validation script to check for basic correctness:
bash
python3 devops-skills-plugin/skills/promql-validator/scripts/validate_syntax.py "<query>"
Output parsing notes:
  • Exit
    0
    : syntax valid
  • Exit non-zero: syntax failure; include stderr and pinpoint token/position
  • Prefer quoting the smallest failing fragment, then provide corrected query
The script will check for:
  • Valid metric names and label matchers
  • Correct operator usage
  • Proper function syntax
  • Valid time durations and ranges
  • Balanced brackets and quotes
  • Correct use of modifiers (offset, @)
运行语法校验脚本检查基础正确性:
bash
python3 devops-skills-plugin/skills/promql-validator/scripts/validate_syntax.py "<query>"
输出解析说明:
  • 退出码
    0
    :语法有效
  • 非零退出码:语法错误,需包含stderr输出并精确定位出错的token/位置
  • 优先引用最小的出错代码片段,再提供修正后的查询
脚本将检查以下内容:
  • 合法的指标名称和标签匹配器
  • 正确的运算符用法
  • 规范的函数语法
  • 有效的时间时长和范围
  • 括号和引号配对正确
  • 修饰符(offset、@)使用正确

Step 2: Check Best Practices

步骤2:最佳实践检查

Run the best practices checker to detect anti-patterns and optimization opportunities:
bash
python3 devops-skills-plugin/skills/promql-validator/scripts/check_best_practices.py "<query>"
Output parsing notes:
  • Treat script sections as independent findings (cardinality, metric-type misuse, regex misuse, etc.)
  • If script output is empty but query is complex, add a manual sanity pass and mark it as
    manual-review
  • Preserve script wording for finding labels, then add remediation in plain English
The script will identify:
  • High cardinality queries without label filters
  • Inefficient regex matchers that could be exact matches
  • Missing rate()/increase() on counter metrics
  • rate() used on gauge metrics
  • Averaging pre-calculated quantiles
  • Subqueries with excessive time ranges
  • irate() over long time ranges
  • Opportunities to add more specific label filters
  • Complex queries that should use recording rules
运行最佳实践检查脚本,检测反模式和优化空间:
bash
python3 devops-skills-plugin/skills/promql-validator/scripts/check_best_practices.py "<query>"
输出解析说明:
  • 将脚本输出的不同部分作为独立发现项处理(基数问题、指标类型误用、正则误用等)
  • 如果脚本输出为空但查询较为复杂,添加人工检查步骤并标记为
    manual-review
  • 保留脚本输出的发现项标签,再用通俗的自然语言补充修复方案
脚本将识别以下问题:
  • 未添加标签过滤的高基数查询
  • 可替换为精确匹配的低效正则匹配器
  • counter类型指标未使用rate()/increase()
  • 对gauge类型指标使用rate()
  • 对预计算的分位数求平均值
  • 时间范围过大的子查询
  • 对过长时间范围使用irate()
  • 可添加更精准标签过滤的场景
  • 应当使用录制规则的复杂查询

Step 3: Explain the Query

步骤3:查询解释

Parse and explain what the query does in plain English:
  • What metrics are being queried
  • What type of metrics they are (counter, gauge, histogram, summary)
  • What functions are applied and why
  • What the query calculates
  • What labels will be in the output
  • What the expected result structure looks like
Required Output Details (always include these explicitly):
**Output Labels**: [list labels that will be in the result, or "None (fully aggregated to scalar)"]
**Expected Result Structure**: [instant vector / range vector / scalar] with [N series / single value]
Example:
**Output Labels**: job, instance
**Expected Result Structure**: Instant vector with one series per job/instance combination
解析查询并用通俗自然语言说明其功能:
  • 查询了哪些指标
  • 这些指标的类型(counter、gauge、histogram、summary)
  • 应用了哪些函数以及使用原因
  • 查询的计算逻辑
  • 输出包含哪些标签
  • 预期的结果结构是什么样的
必填输出详情(请始终明确包含这些内容):
**输出标签**:[列出结果中包含的标签,或"无(完全聚合为标量)"]
**预期结果结构**:[瞬时向量 / 范围向量 / 标量] 包含 [N个序列 / 单个值]
示例:
**输出标签**:job, instance
**预期结果结构**:瞬时向量,每个job/instance组合对应一个序列

Line-Number Citation Method (Required)

行号引用方法(必填)

When citing examples/docs in recommendations, include file path + 1-based line numbers:
text
examples/good_queries.promql:42
docs/best_practices.md:88
Rules:
  • Cite the most relevant single line (or start line if multi-line snippet)
  • Keep citations tight; do not cite full files
  • If line numbers are unavailable, state
    line number unavailable
    and provide file path
在推荐内容中引用示例/文档时,请包含文件路径 + 从1开始计数的行号:
text
examples/good_queries.promql:42
docs/best_practices.md:88
规则:
  • 引用最相关的单行(如果是多行片段则引用起始行)
  • 保持引用简洁,不要引用整个文件
  • 如果无法获取行号,请注明
    行号不可用
    并提供文件路径

Step 4: Interactive Query Planning (Phase 1 - STOP AND WAIT)

步骤4:交互式查询规划(第一阶段 - 暂停等待)

Ask the user clarifying questions to verify the query matches their intent:
  1. Understand the Goal: "What are you trying to monitor or measure?"
    • Request rate, error rate, latency, resource usage, etc.
  2. Verify Metric Type: "Is this a counter (always increasing), gauge (can go up/down), histogram, or summary?"
    • This affects which functions to use
  3. Clarify Time Range: "What time window do you need?"
    • Instant value, rate over time, historical analysis
  4. Confirm Aggregation: "Do you need to aggregate data across labels? If so, which labels?"
    • by (job), by (instance), without (pod), etc.
  5. Check Output Intent: "Are you using this for alerting, dashboarding, or ad-hoc analysis?"
    • Affects optimization priorities
IMPORTANT: Two-Phase Dialogue
After presenting Steps 1-4 results (Syntax, Best Practices, Query Explanation, and Intent Questions):
⏸️ STOP HERE AND WAIT FOR USER RESPONSE
Do NOT proceed to Steps 5-7 until the user answers the clarifying questions. This ensures the subsequent recommendations are tailored to the user's actual intent.
向用户询问澄清问题,验证查询是否符合他们的预期:
  1. 明确目标:"你想要监控或测量的内容是什么?"
    • 比如请求速率、错误率、延迟、资源使用率等
  2. 验证指标类型:"这是counter(持续递增)、gauge(可升可降)、histogram还是summary类型的指标?"
    • 指标类型会影响适用的函数
  3. 明确时间范围:"你需要的时间窗口是多长?"
    • 比如瞬时值、时间范围内的速率、历史分析等
  4. 确认聚合规则:"你需要跨标签聚合数据吗?如果需要,要按哪些标签聚合?"
    • 比如by (job)、by (instance)、without (pod)等
  5. 确认输出用途:"你编写这个查询是用于告警、看板展示还是临时分析?"
    • 用途会影响优化优先级
重要提示:两阶段对话规则
在展示步骤1-4的结果(语法校验、最佳实践检查、查询解释和需求确认问题)之后:
⏸️ 在此暂停,等待用户回复
在用户回复澄清问题之前,请勿继续执行步骤5-7,这样可以保证后续的推荐是贴合用户实际需求的。

Step 5: Compare Intent vs Implementation (Phase 2 - After User Response)

步骤5:对比需求与实现(第二阶段 - 用户回复后执行)

Only proceed to this step after the user has answered the clarifying questions from Step 4.
After understanding the user's intent:
  • Explain what the current query actually does
  • Highlight any mismatches between intent and implementation
  • Suggest corrections if the query doesn't match the goal
  • Offer alternative approaches if applicable
When relevant, mention known limitations:
  • Note when metric type detection is heuristic-based (e.g., "The script inferred this is a gauge based on the
    _bytes
    suffix. Please confirm if this is correct.")
  • Acknowledge when high-cardinality warnings might be false positives (e.g., "This warning may not apply if you're using a recording rule or know your cardinality is low.")
仅在用户回复了步骤4的澄清问题之后再执行此步骤
在明确用户需求后:
  • 说明当前查询实际的功能
  • 点明需求和实现之间的所有不匹配点
  • 如果查询不符合目标,建议修正方案
  • 适用时提供替代实现思路
相关场景下请说明已知限制:
  • 当指标类型检测是基于启发式规则时需要说明(例如:"脚本根据
    _bytes
    后缀推断这是一个gauge类型指标,请确认是否正确。")
  • 说明高基数警告可能是误报的场景(例如:"如果你使用了录制规则或者确认基数很低,这个警告可以忽略。")

Step 6: Offer Optimizations

步骤6:提供优化方案

Based on validation results:
  • Suggest more efficient query patterns
  • Recommend recording rules for complex/repeated queries
  • Propose better label matchers to reduce cardinality
  • Advise on appropriate time ranges
Reference Examples: When suggesting corrections, cite relevant examples using this format:
As shown in `examples/bad_queries.promql` (lines 91-97):
❌ BAD: `avg(http_request_duration_seconds{quantile="0.95"})`
✅ GOOD: Use histogram_quantile() with histogram buckets
Citation sources:
  • examples/good_queries.promql
    - for well-formed patterns
  • examples/optimization_examples.promql
    - for before/after comparisons
  • examples/bad_queries.promql
    - for showing what to avoid
  • docs/best_practices.md
    - for detailed explanations
  • docs/anti_patterns.md
    - for anti-pattern deep dives
Citation Format:
file_path (lines X-Y)
with the relevant code snippet quoted
基于校验结果:
  • 推荐更高效的查询模式
  • 为复杂/高频使用的查询推荐录制规则
  • 建议使用更精准的标签匹配器降低基数
  • 给出合理的时间范围建议
引用示例:建议修正方案时,请使用以下格式引用相关示例:
如`examples/bad_queries.promql`(第91-97行)所示:
❌ 错误写法:`avg(http_request_duration_seconds{quantile="0.95"})`
✅ 正确写法:搭配histogram桶使用histogram_quantile()
引用来源:
  • examples/good_queries.promql
    - 规范的写法示例
  • examples/optimization_examples.promql
    - 优化前后的对比示例
  • examples/bad_queries.promql
    - 应当避免的错误写法示例
  • docs/best_practices.md
    - 详细的最佳实践说明
  • docs/anti_patterns.md
    - 反模式的深度解析
引用格式
文件路径 (第X-Y行)
并附上相关的代码片段

Step 7: Let User Plan/Refine

步骤7:让用户自主规划/优化

Give the user control:
  • Ask if they want to modify the query
  • Offer to help rewrite it for better performance
  • Provide multiple alternatives if applicable
  • Explain trade-offs between different approaches
将控制权交还给用户:
  • 询问用户是否需要修改查询
  • 主动提供重写优化版本的帮助
  • 适用时提供多个可选方案
  • 说明不同实现方案之间的权衡

Key Validation Rules

核心校验规则

Syntax Rules

语法规则

  1. Metric Names: Must match
    [a-zA-Z_:][a-zA-Z0-9_:]*
    or use UTF-8 quoting syntax (Prometheus 3.0+):
    • Quoted form:
      {"my.metric.with.dots"}
    • Using name label:
      {__name__="my.metric.with.dots"}
  2. Label Matchers:
    =
    (equal),
    !=
    (not equal),
    =~
    (regex match),
    !~
    (regex not match)
  3. Time Durations:
    [0-9]+(ms|s|m|h|d|w|y)
    - e.g.,
    5m
    ,
    1h
    ,
    7d
  4. Range Vectors:
    metric_name[duration]
    - e.g.,
    http_requests_total[5m]
  5. Offset Modifier:
    offset <duration>
    - e.g.,
    metric_name offset 5m
  6. @ Modifier:
    @ <timestamp>
    or
    @ start()
    /
    @ end()
  1. 指标名称:必须匹配
    [a-zA-Z_:][a-zA-Z0-9_:]*
    规则,或使用UTF-8引号语法(Prometheus 3.0+支持):
    • 引号形式:
      {"my.metric.with.dots"}
    • 使用__name__标签:
      {__name__="my.metric.with.dots"}
  2. 标签匹配器
    =
    (等于)、
    !=
    (不等于)、
    =~
    (正则匹配)、
    !~
    (正则不匹配)
  3. 时间时长
    [0-9]+(ms|s|m|h|d|w|y)
    - 例如
    5m
    1h
    7d
  4. 范围向量
    指标名称[时长]
    - 例如
    http_requests_total[5m]
  5. Offset修饰符
    offset <时长>
    - 例如
    metric_name offset 5m
  6. @修饰符
    @ <时间戳>
    @ start()
    /
    @ end()

Semantic Rules

语义规则

  1. rate() and irate(): Should only be used with counter metrics (metrics ending in
    _total
    ,
    _count
    ,
    _sum
    , or
    _bucket
    )
  2. Counters: Should typically use
    rate()
    or
    increase()
    , not raw values
  3. Gauges: Should not use
    rate()
    or
    increase()
  4. Histograms: Use
    histogram_quantile()
    with
    le
    label and
    rate()
    on
    _bucket
    metrics
  5. Summaries: Don't average quantiles; calculate from
    _sum
    and
    _count
  6. Aggregations: Use
    by()
    or
    without()
    to control output labels
  1. rate()和irate():仅应当用于counter类型指标(以
    _total
    _count
    _sum
    _bucket
    结尾的指标)
  2. Counter类型指标:通常应当使用
    rate()
    increase()
    ,不要直接使用原始值
  3. Gauge类型指标:不应当使用
    rate()
    increase()
  4. Histogram类型指标:使用
    histogram_quantile()
    搭配
    le
    标签,对
    _bucket
    指标使用
    rate()
  5. Summary类型指标:不要对分位数求平均值,应当通过
    _sum
    _count
    计算
  6. 聚合操作:使用
    by()
    without()
    控制输出标签

Performance Rules

性能规则

  1. Cardinality: Always use specific label matchers to reduce series count
  2. Regex: Use
    =
    instead of
    =~
    when possible for exact matches
  3. Rate Range: Should be at least 4x the scrape interval (typically
    [2m]
    minimum)
  4. irate(): Best for short ranges (<5m); use
    rate()
    for longer periods
  5. Subqueries: Avoid excessive time ranges that process millions of samples
  6. Recording Rules: Use for complex queries accessed frequently
  1. 基数控制:始终使用精准的标签匹配器减少序列数量
  2. 正则使用:可以用精确匹配时优先使用
    =
    而非
    =~
  3. Rate时间范围:应当至少是抓取间隔的4倍(通常最小为
    [2m]
  4. irate()适用场景:最适合短时间范围(<5m),长时间范围请使用
    rate()
  5. 子查询限制:避免使用需要处理数百万样本的过大时间范围
  6. 录制规则:高频访问的复杂查询请使用录制规则

Anti-Patterns to Detect

需要检测的反模式

High Cardinality Issues

高基数问题

Bad:
http_requests_total{}
  • Matches all time series without filtering
Good:
http_requests_total{job="api", instance="prod-1"}
  • Specific label filters reduce cardinality
错误
http_requests_total{}
  • 未加过滤匹配所有时间序列
正确
http_requests_total{job="api", instance="prod-1"}
  • 精准的标签过滤降低基数

Regex Overuse

正则滥用

Bad:
http_requests_total{status=~"2.."}
  • Regex is slower and less precise
Good:
http_requests_total{status="200"}
  • Exact match is faster
错误
http_requests_total{status=~"2.."}
  • 正则匹配更慢且精度更低
正确
http_requests_total{status="200"}
  • 精确匹配速度更快

Missing rate() on Counters

Counter缺失rate()

Bad:
http_requests_total
  • Counter raw values are not useful (always increasing)
Good:
rate(http_requests_total[5m])
  • Rate shows requests per second
错误
http_requests_total
  • Counter的原始值没有实际意义(持续递增)
正确
rate(http_requests_total[5m])
  • Rate计算每秒请求数

rate() on Gauges

对Gauge使用rate()

Bad:
rate(memory_usage_bytes[5m])
  • Gauges measure current state, not cumulative values
Good:
memory_usage_bytes
  • Use gauge value directly or with
    avg_over_time()
错误
rate(memory_usage_bytes[5m])
  • Gauge测量当前状态,而非累计值
正确
memory_usage_bytes
  • 直接使用gauge值,或搭配
    avg_over_time()
    使用

Averaging Quantiles

对分位数求平均值

Bad:
avg(http_request_duration_seconds{quantile="0.95"})
  • Mathematically invalid to average pre-calculated quantiles
Good:
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
  • Calculate quantile from histogram buckets
错误
avg(http_request_duration_seconds{quantile="0.95"})
  • 对预计算的分位数求平均值在数学上不成立
正确
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
  • 从histogram桶中计算分位数

Excessive Subquery Ranges

子查询范围过大

Bad:
rate(metric[5m])[90d:1m]
  • Processes millions of samples, very slow
Good: Use recording rules or limit range to necessary duration
错误
rate(metric[5m])[90d:1m]
  • 需要处理数百万样本,速度极慢
正确:使用录制规则,或将范围限制为必要的时长

irate() Over Long Ranges

对长范围使用irate()

Bad:
irate(metric[1h])
  • irate() only looks at last two samples, range is wasted
Good:
rate(metric[1h])
or
irate(metric[5m])
  • Use rate() for longer ranges or reduce irate() range
错误
irate(metric[1h])
  • irate()仅取最后两个样本计算,长范围完全浪费
正确
rate(metric[1h])
irate(metric[5m])
  • 长范围用rate(),或缩短irate()的时间范围

Mixed Metric Types

混合指标类型

Bad:
avg(http_request_duration_seconds{quantile="0.95"}) / rate(node_memory_usage_bytes[1h]) + sum(http_requests_total)
  • Combines summary quantiles, gauge metrics, and counters in arithmetic
  • Produces meaningless results
Good: Keep each metric type in separate, purpose-specific queries:
  • Latency:
    histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
  • Memory:
    node_memory_usage_bytes{instance="prod-1"}
  • Request rate:
    rate(http_requests_total{job="api"}[5m])
错误
avg(http_request_duration_seconds{quantile="0.95"}) / rate(node_memory_usage_bytes[1h]) + sum(http_requests_total)
  • 在算术运算中混合了summary分位数、gauge指标和counter指标
  • 输出结果毫无意义
正确:将不同类型的指标拆分到独立的、用途明确的查询中:
  • 延迟:
    histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
  • 内存:
    node_memory_usage_bytes{instance="prod-1"}
  • 请求速率:
    rate(http_requests_total{job="api"}[5m])

Output Format

输出格式

Provide validation results in this structure:
undefined
请使用以下结构返回校验结果:
undefined

PromQL Validation Results

PromQL校验结果

Syntax Check

语法检查

  • Status: ✅ VALID / ⚠️ WARNING / ❌ ERROR
  • Issues: [list any syntax errors with line/position]
  • 状态:✅ 正常 / ⚠️ 警告 / ❌ 错误
  • 问题:[列出所有语法错误,附带行号/位置]

Semantic Check

语义检查

  • Status: ✅ VALID / ⚠️ WARNING / ❌ ERROR
  • Issues: [list any logical problems]
  • 状态:✅ 正常 / ⚠️ 警告 / ❌ 错误
  • 问题:[列出所有逻辑问题]

Performance Analysis

性能分析

  • Status: ✅ OPTIMIZED / ⚠️ CAN BE IMPROVED / ❌ INEFFICIENT
  • Issues: [list optimization opportunities]
  • Suggestions: [specific improvements]
  • 状态:✅ 已优化 / ⚠️ 可优化 / ❌ 低效
  • 问题:[列出所有可优化的点]
  • 建议:[具体的改进方案]

Query Explanation

查询解释

Your query:
<query>
This query does:
  • [Plain English explanation]
  • Metrics: [list metrics and their types]
  • Functions: [explain each function]
  • Output: [describe result structure]
你的查询:
<query>
该查询的功能:
  • [通俗自然语言的解释]
  • 指标:[列出所有指标及其类型]
  • 函数:[解释每个函数的作用]
  • 输出:[描述结果结构]

Intent Verification

需求确认

Let me verify this matches your needs:
  1. What are you trying to measure? [your goal here]
  2. Is this a counter/gauge/histogram/summary? [metric type]
  3. What time range interests you? [time window]
  4. Do you need aggregation? If so, by which labels? [aggregation needs]
  5. Is this for alerting, dashboarding, or analysis? [use case]
请确认以下信息是否符合你的需求:
  1. 你想要测量的内容是什么?[你的目标]
  2. 这是counter/gauge/histogram/summary类型的指标吗?[指标类型]
  3. 你关注的时间范围是多久?[时间窗口]
  4. 你需要聚合数据吗?如果需要,按哪些标签聚合?[聚合需求]
  5. 这个查询是用于告警、看板展示还是分析?[使用场景]

Recommendations

推荐方案

[Based on the analysis, suggest improvements or alternatives]
undefined
[基于分析结果,给出优化建议或替代方案]
undefined

Interactive Dialogue

交互式对话

After validation, engage in dialogue:
Claude: "I've validated your query. It's syntactically correct, but I notice it queries
http_requests_total
without any label filters. This could match thousands of time series. What specific service or endpoint are you trying to monitor?"
User: [provides intent]
Claude: "Great! Based on that, here's an optimized version:
rate(http_requests_total{job="api-service", path="/users"}[5m])
. This calculates the per-second rate of requests to the /users endpoint over the last 5 minutes. Does this match what you need?"
User: [confirms or asks for changes]
Claude: [provides refined query or alternatives]
校验完成后,按照以下方式和用户对话:
Claude:"我已经完成了你的查询校验,语法是正确的,但我注意到你查询
http_requests_total
时没有添加任何标签过滤,这可能会匹配数千条时间序列。你想要监控的具体服务或接口是什么?"
用户:[提供需求说明]
Claude:"好的!基于你的需求,这是优化后的版本:
rate(http_requests_total{job="api-service", path="/users"}[5m])
。它会计算过去5分钟内发往/users接口的每秒请求数,是否符合你的需求?"
用户:[确认或提出修改需求]
Claude:[提供优化后的查询或替代方案]

Examples

示例

See the
examples/
directory for:
  • good_queries.promql
    : Well-written queries following best practices
  • bad_queries.promql
    : Common mistakes and anti-patterns (with corrections)
  • optimization_examples.promql
    : Before/after optimization examples
查看
examples/
目录获取以下内容:
  • good_queries.promql
    :符合最佳实践的规范查询示例
  • bad_queries.promql
    :常见错误和反模式(附带修正方案)
  • optimization_examples.promql
    :优化前后的对比示例

Documentation

文档

See the
docs/
directory for:
  • best_practices.md
    : Comprehensive PromQL best practices guide
  • anti_patterns.md
    : Detailed anti-pattern reference with explanations
查看
docs/
目录获取以下内容:
  • best_practices.md
    :全面的PromQL最佳实践指南
  • anti_patterns.md
    :详细的反模式参考及说明

Important Notes

重要注意事项

  1. Be Interactive: Always ask clarifying questions to understand user intent
  2. Be Educational: Explain WHY something is wrong, not just THAT it's wrong
  3. Be Helpful: Offer to rewrite queries, don't just criticize
  4. Be Context-Aware: Consider the user's use case (alerting vs dashboarding)
  5. Be Thorough: Check all four levels (syntax, semantics, performance, intent)
  6. Be Practical: Suggest realistic optimizations, not theoretical perfection
  1. 保持交互:始终询问澄清问题,明确用户需求
  2. 具备教育性:解释为什么某个写法是错误的,而不仅仅是指出错误
  3. 提供帮助:主动提供重写查询的帮助,不要只提出批评
  4. 感知上下文:考虑用户的使用场景(告警 vs 看板展示)
  5. 检查全面:覆盖四个校验层级(语法、语义、性能、需求匹配)
  6. 务实可行:建议切实可行的优化方案,而非理论上的完美方案

Integration

集成场景

This skill can be used:
  • Standalone for query review
  • During monitoring setup to validate alert rules
  • When troubleshooting slow Prometheus queries
  • As part of code review for recording rules
  • For teaching PromQL to team members
本技能可用于以下场景:
  • 独立用于查询评审
  • 监控配置过程中校验告警规则
  • 排查慢Prometheus查询问题
  • 作为录制规则代码评审的一部分
  • 用于向团队成员教学PromQL

Validation Tools

校验工具

The skill uses two main Python scripts:
  1. validate_syntax.py: Pure syntax checking using regex patterns
  2. check_best_practices.py: Semantic and performance analysis
Both scripts output JSON for programmatic parsing and human-readable messages for display.
本技能使用两个核心Python脚本:
  1. validate_syntax.py:基于正则规则的纯语法检查
  2. check_best_practices.py:语义和性能分析
两个脚本都输出JSON格式便于程序解析,同时也输出人类可读的提示信息用于展示。

Success Criteria

成功标准

A successful validation session should:
  1. Identify all syntax errors
  2. Detect semantic problems
  3. Suggest at least one optimization (if applicable)
  4. Clearly explain what the query does
  5. Verify the query matches user intent
  6. Provide actionable next steps
一次成功的校验会话应当满足:
  1. 识别所有语法错误
  2. 检测到语义问题
  3. 至少给出一条优化建议(如果有可优化点)
  4. 清晰解释查询的功能
  5. 验证查询符合用户需求
  6. 提供可落地的下一步方案

Known Limitations

已知限制

The validation scripts have some limitations to be aware of:
校验脚本存在以下需要注意的限制:

Metric Type Detection

指标类型检测

  • Heuristic-based: Metric types (counter, gauge, histogram, summary) are inferred from naming conventions (e.g.,
    _total
    ,
    _bytes
    )
  • Custom metrics: Metrics with non-standard names may not be correctly classified
  • Recommendation: When the script can't determine metric type, ask the user to clarify
  • 基于启发式规则:指标类型(counter、gauge、histogram、summary)是基于命名规范推断的(例如
    _total
    _bytes
  • 自定义指标:不符合标准命名规范的指标可能无法正确分类
  • 建议:当脚本无法确定指标类型时,请询问用户确认

High Cardinality Detection

高基数检测

  • Conservative approach: The script flags metrics without label selectors, but some use cases legitimately query all series
  • Recording rules: Queries using recording rule metrics (e.g.,
    job:http_requests:rate5m
    ) are valid without label filters
  • Recommendation: Use judgment - if the user knows their cardinality is manageable, the warning can be safely ignored
  • 保守策略:脚本会标记没有标签选择器的指标,但部分场景下确实需要查询所有序列
  • 录制规则:使用录制规则生成的指标(例如
    job:http_requests:rate5m
    )不需要标签过滤也是合法的
  • 建议:根据场景判断,如果用户确认基数可控,可以安全忽略该警告

Semantic Validation

语义校验

  • No runtime context: The scripts cannot verify if metrics actually exist or if label values are valid
  • Schema-agnostic: No knowledge of specific Prometheus deployments or metric schemas
  • Recommendation: For production validation, test queries against actual Prometheus instances
  • 无运行时上下文:脚本无法验证指标是否真实存在,或标签值是否合法
  • 无Schema感知:不了解具体Prometheus部署或指标Schema的信息
  • 建议:生产环境校验请在实际的Prometheus实例上测试查询

Script Detection Coverage

脚本检测覆盖范围

The scripts detect common anti-patterns but cannot catch:
  • Business logic errors (e.g., calculating the wrong KPI)
  • Context-specific optimizations (depends on scrape interval, retention, etc.)
  • Custom function behavior from extensions
脚本可以检测常见的反模式,但无法覆盖:
  • 业务逻辑错误(例如计算错误的KPI)
  • 上下文相关的优化(取决于抓取间隔、保留策略等)
  • 扩展提供的自定义函数行为

Remember

最后提醒

The goal is not just to validate queries, but to help users write better PromQL and understand their monitoring data. Always be educational, interactive, and helpful!
我们的目标不仅仅是校验查询,更是帮助用户写出更优质的PromQL,理解他们的监控数据。请始终保持教育性、交互性和友好的态度!