promql-validator
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseHow This Skill Works
本技能的工作原理
This skill performs multi-level validation and provides interactive query planning:
- Syntax Validation: Checks for syntactically correct PromQL expressions
- Semantic Validation: Ensures queries make logical sense (e.g., rate() on counters, not gauges)
- Anti-Pattern Detection: Identifies common mistakes and inefficient patterns
- Optimization Suggestions: Recommends performance improvements
- Query Explanation: Translates PromQL to plain English
- Interactive Planning: Helps users clarify intent and refine queries
本技能可执行多层级校验,并提供交互式查询规划能力:
- 语法校验:检查PromQL表达式的语法正确性
- 语义校验:确保查询逻辑合理(例如仅对counter类型指标使用rate(),而非gauge类型)
- 反模式检测:识别常见错误和低效写法
- 优化建议:推荐性能提升方案
- 查询解释:将PromQL转换为通俗自然语言
- 交互式规划:帮助用户明确需求,优化查询语句
Workflow
工作流程
When a user provides a PromQL query, follow this workflow:
当用户提供PromQL查询时,请遵循以下工作流程:
Working Directory Requirement
工作目录要求
Run validation commands from the repository root so relative paths resolve correctly:
bash
cd "$(git rev-parse --show-toplevel)"If running from another location, use absolute paths to files.
scripts/请在仓库根目录下执行校验命令,以保证相对路径可以正确解析:
bash
cd "$(git rev-parse --show-toplevel)"如果在其他位置运行,请使用文件的绝对路径。
scripts/Step 1: Validate Syntax
步骤1:语法校验
Run the syntax validation script to check for basic correctness:
bash
python3 devops-skills-plugin/skills/promql-validator/scripts/validate_syntax.py "<query>"Output parsing notes:
- Exit : syntax valid
0 - Exit non-zero: syntax failure; include stderr and pinpoint token/position
- Prefer quoting the smallest failing fragment, then provide corrected query
The script will check for:
- Valid metric names and label matchers
- Correct operator usage
- Proper function syntax
- Valid time durations and ranges
- Balanced brackets and quotes
- Correct use of modifiers (offset, @)
运行语法校验脚本检查基础正确性:
bash
python3 devops-skills-plugin/skills/promql-validator/scripts/validate_syntax.py "<query>"输出解析说明:
- 退出码:语法有效
0 - 非零退出码:语法错误,需包含stderr输出并精确定位出错的token/位置
- 优先引用最小的出错代码片段,再提供修正后的查询
脚本将检查以下内容:
- 合法的指标名称和标签匹配器
- 正确的运算符用法
- 规范的函数语法
- 有效的时间时长和范围
- 括号和引号配对正确
- 修饰符(offset、@)使用正确
Step 2: Check Best Practices
步骤2:最佳实践检查
Run the best practices checker to detect anti-patterns and optimization opportunities:
bash
python3 devops-skills-plugin/skills/promql-validator/scripts/check_best_practices.py "<query>"Output parsing notes:
- Treat script sections as independent findings (cardinality, metric-type misuse, regex misuse, etc.)
- If script output is empty but query is complex, add a manual sanity pass and mark it as
manual-review - Preserve script wording for finding labels, then add remediation in plain English
The script will identify:
- High cardinality queries without label filters
- Inefficient regex matchers that could be exact matches
- Missing rate()/increase() on counter metrics
- rate() used on gauge metrics
- Averaging pre-calculated quantiles
- Subqueries with excessive time ranges
- irate() over long time ranges
- Opportunities to add more specific label filters
- Complex queries that should use recording rules
运行最佳实践检查脚本,检测反模式和优化空间:
bash
python3 devops-skills-plugin/skills/promql-validator/scripts/check_best_practices.py "<query>"输出解析说明:
- 将脚本输出的不同部分作为独立发现项处理(基数问题、指标类型误用、正则误用等)
- 如果脚本输出为空但查询较为复杂,添加人工检查步骤并标记为
manual-review - 保留脚本输出的发现项标签,再用通俗的自然语言补充修复方案
脚本将识别以下问题:
- 未添加标签过滤的高基数查询
- 可替换为精确匹配的低效正则匹配器
- counter类型指标未使用rate()/increase()
- 对gauge类型指标使用rate()
- 对预计算的分位数求平均值
- 时间范围过大的子查询
- 对过长时间范围使用irate()
- 可添加更精准标签过滤的场景
- 应当使用录制规则的复杂查询
Step 3: Explain the Query
步骤3:查询解释
Parse and explain what the query does in plain English:
- What metrics are being queried
- What type of metrics they are (counter, gauge, histogram, summary)
- What functions are applied and why
- What the query calculates
- What labels will be in the output
- What the expected result structure looks like
Required Output Details (always include these explicitly):
**Output Labels**: [list labels that will be in the result, or "None (fully aggregated to scalar)"]
**Expected Result Structure**: [instant vector / range vector / scalar] with [N series / single value]Example:
**Output Labels**: job, instance
**Expected Result Structure**: Instant vector with one series per job/instance combination解析查询并用通俗自然语言说明其功能:
- 查询了哪些指标
- 这些指标的类型(counter、gauge、histogram、summary)
- 应用了哪些函数以及使用原因
- 查询的计算逻辑
- 输出包含哪些标签
- 预期的结果结构是什么样的
必填输出详情(请始终明确包含这些内容):
**输出标签**:[列出结果中包含的标签,或"无(完全聚合为标量)"]
**预期结果结构**:[瞬时向量 / 范围向量 / 标量] 包含 [N个序列 / 单个值]示例:
**输出标签**:job, instance
**预期结果结构**:瞬时向量,每个job/instance组合对应一个序列Line-Number Citation Method (Required)
行号引用方法(必填)
When citing examples/docs in recommendations, include file path + 1-based line numbers:
text
examples/good_queries.promql:42
docs/best_practices.md:88Rules:
- Cite the most relevant single line (or start line if multi-line snippet)
- Keep citations tight; do not cite full files
- If line numbers are unavailable, state and provide file path
line number unavailable
在推荐内容中引用示例/文档时,请包含文件路径 + 从1开始计数的行号:
text
examples/good_queries.promql:42
docs/best_practices.md:88规则:
- 引用最相关的单行(如果是多行片段则引用起始行)
- 保持引用简洁,不要引用整个文件
- 如果无法获取行号,请注明并提供文件路径
行号不可用
Step 4: Interactive Query Planning (Phase 1 - STOP AND WAIT)
步骤4:交互式查询规划(第一阶段 - 暂停等待)
Ask the user clarifying questions to verify the query matches their intent:
-
Understand the Goal: "What are you trying to monitor or measure?"
- Request rate, error rate, latency, resource usage, etc.
-
Verify Metric Type: "Is this a counter (always increasing), gauge (can go up/down), histogram, or summary?"
- This affects which functions to use
-
Clarify Time Range: "What time window do you need?"
- Instant value, rate over time, historical analysis
-
Confirm Aggregation: "Do you need to aggregate data across labels? If so, which labels?"
- by (job), by (instance), without (pod), etc.
-
Check Output Intent: "Are you using this for alerting, dashboarding, or ad-hoc analysis?"
- Affects optimization priorities
IMPORTANT: Two-Phase DialogueAfter presenting Steps 1-4 results (Syntax, Best Practices, Query Explanation, and Intent Questions):⏸️ STOP HERE AND WAIT FOR USER RESPONSEDo NOT proceed to Steps 5-7 until the user answers the clarifying questions. This ensures the subsequent recommendations are tailored to the user's actual intent.
向用户询问澄清问题,验证查询是否符合他们的预期:
-
明确目标:"你想要监控或测量的内容是什么?"
- 比如请求速率、错误率、延迟、资源使用率等
-
验证指标类型:"这是counter(持续递增)、gauge(可升可降)、histogram还是summary类型的指标?"
- 指标类型会影响适用的函数
-
明确时间范围:"你需要的时间窗口是多长?"
- 比如瞬时值、时间范围内的速率、历史分析等
-
确认聚合规则:"你需要跨标签聚合数据吗?如果需要,要按哪些标签聚合?"
- 比如by (job)、by (instance)、without (pod)等
-
确认输出用途:"你编写这个查询是用于告警、看板展示还是临时分析?"
- 用途会影响优化优先级
重要提示:两阶段对话规则在展示步骤1-4的结果(语法校验、最佳实践检查、查询解释和需求确认问题)之后:⏸️ 在此暂停,等待用户回复在用户回复澄清问题之前,请勿继续执行步骤5-7,这样可以保证后续的推荐是贴合用户实际需求的。
Step 5: Compare Intent vs Implementation (Phase 2 - After User Response)
步骤5:对比需求与实现(第二阶段 - 用户回复后执行)
Only proceed to this step after the user has answered the clarifying questions from Step 4.
After understanding the user's intent:
- Explain what the current query actually does
- Highlight any mismatches between intent and implementation
- Suggest corrections if the query doesn't match the goal
- Offer alternative approaches if applicable
When relevant, mention known limitations:
- Note when metric type detection is heuristic-based (e.g., "The script inferred this is a gauge based on the suffix. Please confirm if this is correct.")
_bytes - Acknowledge when high-cardinality warnings might be false positives (e.g., "This warning may not apply if you're using a recording rule or know your cardinality is low.")
仅在用户回复了步骤4的澄清问题之后再执行此步骤
在明确用户需求后:
- 说明当前查询实际的功能
- 点明需求和实现之间的所有不匹配点
- 如果查询不符合目标,建议修正方案
- 适用时提供替代实现思路
相关场景下请说明已知限制:
- 当指标类型检测是基于启发式规则时需要说明(例如:"脚本根据后缀推断这是一个gauge类型指标,请确认是否正确。")
_bytes - 说明高基数警告可能是误报的场景(例如:"如果你使用了录制规则或者确认基数很低,这个警告可以忽略。")
Step 6: Offer Optimizations
步骤6:提供优化方案
Based on validation results:
- Suggest more efficient query patterns
- Recommend recording rules for complex/repeated queries
- Propose better label matchers to reduce cardinality
- Advise on appropriate time ranges
Reference Examples: When suggesting corrections, cite relevant examples using this format:
As shown in `examples/bad_queries.promql` (lines 91-97):
❌ BAD: `avg(http_request_duration_seconds{quantile="0.95"})`
✅ GOOD: Use histogram_quantile() with histogram bucketsCitation sources:
- - for well-formed patterns
examples/good_queries.promql - - for before/after comparisons
examples/optimization_examples.promql - - for showing what to avoid
examples/bad_queries.promql - - for detailed explanations
docs/best_practices.md - - for anti-pattern deep dives
docs/anti_patterns.md
Citation Format: with the relevant code snippet quoted
file_path (lines X-Y)基于校验结果:
- 推荐更高效的查询模式
- 为复杂/高频使用的查询推荐录制规则
- 建议使用更精准的标签匹配器降低基数
- 给出合理的时间范围建议
引用示例:建议修正方案时,请使用以下格式引用相关示例:
如`examples/bad_queries.promql`(第91-97行)所示:
❌ 错误写法:`avg(http_request_duration_seconds{quantile="0.95"})`
✅ 正确写法:搭配histogram桶使用histogram_quantile()引用来源:
- - 规范的写法示例
examples/good_queries.promql - - 优化前后的对比示例
examples/optimization_examples.promql - - 应当避免的错误写法示例
examples/bad_queries.promql - - 详细的最佳实践说明
docs/best_practices.md - - 反模式的深度解析
docs/anti_patterns.md
引用格式: 并附上相关的代码片段
文件路径 (第X-Y行)Step 7: Let User Plan/Refine
步骤7:让用户自主规划/优化
Give the user control:
- Ask if they want to modify the query
- Offer to help rewrite it for better performance
- Provide multiple alternatives if applicable
- Explain trade-offs between different approaches
将控制权交还给用户:
- 询问用户是否需要修改查询
- 主动提供重写优化版本的帮助
- 适用时提供多个可选方案
- 说明不同实现方案之间的权衡
Key Validation Rules
核心校验规则
Syntax Rules
语法规则
- Metric Names: Must match or use UTF-8 quoting syntax (Prometheus 3.0+):
[a-zA-Z_:][a-zA-Z0-9_:]*- Quoted form:
{"my.metric.with.dots"} - Using name label:
{__name__="my.metric.with.dots"}
- Quoted form:
- Label Matchers: (equal),
=(not equal),!=(regex match),=~(regex not match)!~ - Time Durations: - e.g.,
[0-9]+(ms|s|m|h|d|w|y),5m,1h7d - Range Vectors: - e.g.,
metric_name[duration]http_requests_total[5m] - Offset Modifier: - e.g.,
offset <duration>metric_name offset 5m - @ Modifier: or
@ <timestamp>/@ start()@ end()
- 指标名称:必须匹配规则,或使用UTF-8引号语法(Prometheus 3.0+支持):
[a-zA-Z_:][a-zA-Z0-9_:]*- 引号形式:
{"my.metric.with.dots"} - 使用__name__标签:
{__name__="my.metric.with.dots"}
- 引号形式:
- 标签匹配器:(等于)、
=(不等于)、!=(正则匹配)、=~(正则不匹配)!~ - 时间时长:- 例如
[0-9]+(ms|s|m|h|d|w|y)、5m、1h7d - 范围向量:- 例如
指标名称[时长]http_requests_total[5m] - Offset修饰符:- 例如
offset <时长>metric_name offset 5m - @修饰符:或
@ <时间戳>/@ start()@ end()
Semantic Rules
语义规则
- rate() and irate(): Should only be used with counter metrics (metrics ending in ,
_total,_count, or_sum)_bucket - Counters: Should typically use or
rate(), not raw valuesincrease() - Gauges: Should not use or
rate()increase() - Histograms: Use with
histogram_quantile()label andleonrate()metrics_bucket - Summaries: Don't average quantiles; calculate from and
_sum_count - Aggregations: Use or
by()to control output labelswithout()
- rate()和irate():仅应当用于counter类型指标(以、
_total、_count或_sum结尾的指标)_bucket - Counter类型指标:通常应当使用或
rate(),不要直接使用原始值increase() - Gauge类型指标:不应当使用或
rate()increase() - Histogram类型指标:使用搭配
histogram_quantile()标签,对le指标使用_bucketrate() - Summary类型指标:不要对分位数求平均值,应当通过和
_sum计算_count - 聚合操作:使用或
by()控制输出标签without()
Performance Rules
性能规则
- Cardinality: Always use specific label matchers to reduce series count
- Regex: Use instead of
=when possible for exact matches=~ - Rate Range: Should be at least 4x the scrape interval (typically minimum)
[2m] - irate(): Best for short ranges (<5m); use for longer periods
rate() - Subqueries: Avoid excessive time ranges that process millions of samples
- Recording Rules: Use for complex queries accessed frequently
- 基数控制:始终使用精准的标签匹配器减少序列数量
- 正则使用:可以用精确匹配时优先使用而非
==~ - Rate时间范围:应当至少是抓取间隔的4倍(通常最小为)
[2m] - irate()适用场景:最适合短时间范围(<5m),长时间范围请使用
rate() - 子查询限制:避免使用需要处理数百万样本的过大时间范围
- 录制规则:高频访问的复杂查询请使用录制规则
Anti-Patterns to Detect
需要检测的反模式
High Cardinality Issues
高基数问题
❌ Bad:
http_requests_total{}- Matches all time series without filtering
✅ Good:
http_requests_total{job="api", instance="prod-1"}- Specific label filters reduce cardinality
❌ 错误:
http_requests_total{}- 未加过滤匹配所有时间序列
✅ 正确:
http_requests_total{job="api", instance="prod-1"}- 精准的标签过滤降低基数
Regex Overuse
正则滥用
❌ Bad:
http_requests_total{status=~"2.."}- Regex is slower and less precise
✅ Good:
http_requests_total{status="200"}- Exact match is faster
❌ 错误:
http_requests_total{status=~"2.."}- 正则匹配更慢且精度更低
✅ 正确:
http_requests_total{status="200"}- 精确匹配速度更快
Missing rate() on Counters
Counter缺失rate()
❌ Bad:
http_requests_total- Counter raw values are not useful (always increasing)
✅ Good:
rate(http_requests_total[5m])- Rate shows requests per second
❌ 错误:
http_requests_total- Counter的原始值没有实际意义(持续递增)
✅ 正确:
rate(http_requests_total[5m])- Rate计算每秒请求数
rate() on Gauges
对Gauge使用rate()
❌ Bad:
rate(memory_usage_bytes[5m])- Gauges measure current state, not cumulative values
✅ Good:
memory_usage_bytes- Use gauge value directly or with
avg_over_time()
❌ 错误:
rate(memory_usage_bytes[5m])- Gauge测量当前状态,而非累计值
✅ 正确:
memory_usage_bytes- 直接使用gauge值,或搭配使用
avg_over_time()
Averaging Quantiles
对分位数求平均值
❌ Bad:
avg(http_request_duration_seconds{quantile="0.95"})- Mathematically invalid to average pre-calculated quantiles
✅ Good:
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))- Calculate quantile from histogram buckets
❌ 错误:
avg(http_request_duration_seconds{quantile="0.95"})- 对预计算的分位数求平均值在数学上不成立
✅ 正确:
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))- 从histogram桶中计算分位数
Excessive Subquery Ranges
子查询范围过大
❌ Bad:
rate(metric[5m])[90d:1m]- Processes millions of samples, very slow
✅ Good: Use recording rules or limit range to necessary duration
❌ 错误:
rate(metric[5m])[90d:1m]- 需要处理数百万样本,速度极慢
✅ 正确:使用录制规则,或将范围限制为必要的时长
irate() Over Long Ranges
对长范围使用irate()
❌ Bad:
irate(metric[1h])- irate() only looks at last two samples, range is wasted
✅ Good: or
rate(metric[1h])irate(metric[5m])- Use rate() for longer ranges or reduce irate() range
❌ 错误:
irate(metric[1h])- irate()仅取最后两个样本计算,长范围完全浪费
✅ 正确: 或
rate(metric[1h])irate(metric[5m])- 长范围用rate(),或缩短irate()的时间范围
Mixed Metric Types
混合指标类型
❌ Bad:
avg(http_request_duration_seconds{quantile="0.95"}) / rate(node_memory_usage_bytes[1h]) + sum(http_requests_total)- Combines summary quantiles, gauge metrics, and counters in arithmetic
- Produces meaningless results
✅ Good: Keep each metric type in separate, purpose-specific queries:
- Latency:
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))) - Memory:
node_memory_usage_bytes{instance="prod-1"} - Request rate:
rate(http_requests_total{job="api"}[5m])
❌ 错误:
avg(http_request_duration_seconds{quantile="0.95"}) / rate(node_memory_usage_bytes[1h]) + sum(http_requests_total)- 在算术运算中混合了summary分位数、gauge指标和counter指标
- 输出结果毫无意义
✅ 正确:将不同类型的指标拆分到独立的、用途明确的查询中:
- 延迟:
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))) - 内存:
node_memory_usage_bytes{instance="prod-1"} - 请求速率:
rate(http_requests_total{job="api"}[5m])
Output Format
输出格式
Provide validation results in this structure:
undefined请使用以下结构返回校验结果:
undefinedPromQL Validation Results
PromQL校验结果
Syntax Check
语法检查
- Status: ✅ VALID / ⚠️ WARNING / ❌ ERROR
- Issues: [list any syntax errors with line/position]
- 状态:✅ 正常 / ⚠️ 警告 / ❌ 错误
- 问题:[列出所有语法错误,附带行号/位置]
Semantic Check
语义检查
- Status: ✅ VALID / ⚠️ WARNING / ❌ ERROR
- Issues: [list any logical problems]
- 状态:✅ 正常 / ⚠️ 警告 / ❌ 错误
- 问题:[列出所有逻辑问题]
Performance Analysis
性能分析
- Status: ✅ OPTIMIZED / ⚠️ CAN BE IMPROVED / ❌ INEFFICIENT
- Issues: [list optimization opportunities]
- Suggestions: [specific improvements]
- 状态:✅ 已优化 / ⚠️ 可优化 / ❌ 低效
- 问题:[列出所有可优化的点]
- 建议:[具体的改进方案]
Query Explanation
查询解释
Your query:
<query>This query does:
- [Plain English explanation]
- Metrics: [list metrics and their types]
- Functions: [explain each function]
- Output: [describe result structure]
你的查询:
<query>该查询的功能:
- [通俗自然语言的解释]
- 指标:[列出所有指标及其类型]
- 函数:[解释每个函数的作用]
- 输出:[描述结果结构]
Intent Verification
需求确认
Let me verify this matches your needs:
- What are you trying to measure? [your goal here]
- Is this a counter/gauge/histogram/summary? [metric type]
- What time range interests you? [time window]
- Do you need aggregation? If so, by which labels? [aggregation needs]
- Is this for alerting, dashboarding, or analysis? [use case]
请确认以下信息是否符合你的需求:
- 你想要测量的内容是什么?[你的目标]
- 这是counter/gauge/histogram/summary类型的指标吗?[指标类型]
- 你关注的时间范围是多久?[时间窗口]
- 你需要聚合数据吗?如果需要,按哪些标签聚合?[聚合需求]
- 这个查询是用于告警、看板展示还是分析?[使用场景]
Recommendations
推荐方案
[Based on the analysis, suggest improvements or alternatives]
undefined[基于分析结果,给出优化建议或替代方案]
undefinedInteractive Dialogue
交互式对话
After validation, engage in dialogue:
Claude: "I've validated your query. It's syntactically correct, but I notice it queries without any label filters. This could match thousands of time series. What specific service or endpoint are you trying to monitor?"
http_requests_totalUser: [provides intent]
Claude: "Great! Based on that, here's an optimized version: . This calculates the per-second rate of requests to the /users endpoint over the last 5 minutes. Does this match what you need?"
rate(http_requests_total{job="api-service", path="/users"}[5m])User: [confirms or asks for changes]
Claude: [provides refined query or alternatives]
校验完成后,按照以下方式和用户对话:
Claude:"我已经完成了你的查询校验,语法是正确的,但我注意到你查询时没有添加任何标签过滤,这可能会匹配数千条时间序列。你想要监控的具体服务或接口是什么?"
http_requests_total用户:[提供需求说明]
Claude:"好的!基于你的需求,这是优化后的版本:。它会计算过去5分钟内发往/users接口的每秒请求数,是否符合你的需求?"
rate(http_requests_total{job="api-service", path="/users"}[5m])用户:[确认或提出修改需求]
Claude:[提供优化后的查询或替代方案]
Examples
示例
See the directory for:
examples/- : Well-written queries following best practices
good_queries.promql - : Common mistakes and anti-patterns (with corrections)
bad_queries.promql - : Before/after optimization examples
optimization_examples.promql
查看目录获取以下内容:
examples/- :符合最佳实践的规范查询示例
good_queries.promql - :常见错误和反模式(附带修正方案)
bad_queries.promql - :优化前后的对比示例
optimization_examples.promql
Documentation
文档
See the directory for:
docs/- : Comprehensive PromQL best practices guide
best_practices.md - : Detailed anti-pattern reference with explanations
anti_patterns.md
查看目录获取以下内容:
docs/- :全面的PromQL最佳实践指南
best_practices.md - :详细的反模式参考及说明
anti_patterns.md
Important Notes
重要注意事项
- Be Interactive: Always ask clarifying questions to understand user intent
- Be Educational: Explain WHY something is wrong, not just THAT it's wrong
- Be Helpful: Offer to rewrite queries, don't just criticize
- Be Context-Aware: Consider the user's use case (alerting vs dashboarding)
- Be Thorough: Check all four levels (syntax, semantics, performance, intent)
- Be Practical: Suggest realistic optimizations, not theoretical perfection
- 保持交互:始终询问澄清问题,明确用户需求
- 具备教育性:解释为什么某个写法是错误的,而不仅仅是指出错误
- 提供帮助:主动提供重写查询的帮助,不要只提出批评
- 感知上下文:考虑用户的使用场景(告警 vs 看板展示)
- 检查全面:覆盖四个校验层级(语法、语义、性能、需求匹配)
- 务实可行:建议切实可行的优化方案,而非理论上的完美方案
Integration
集成场景
This skill can be used:
- Standalone for query review
- During monitoring setup to validate alert rules
- When troubleshooting slow Prometheus queries
- As part of code review for recording rules
- For teaching PromQL to team members
本技能可用于以下场景:
- 独立用于查询评审
- 监控配置过程中校验告警规则
- 排查慢Prometheus查询问题
- 作为录制规则代码评审的一部分
- 用于向团队成员教学PromQL
Validation Tools
校验工具
The skill uses two main Python scripts:
- validate_syntax.py: Pure syntax checking using regex patterns
- check_best_practices.py: Semantic and performance analysis
Both scripts output JSON for programmatic parsing and human-readable messages for display.
本技能使用两个核心Python脚本:
- validate_syntax.py:基于正则规则的纯语法检查
- check_best_practices.py:语义和性能分析
两个脚本都输出JSON格式便于程序解析,同时也输出人类可读的提示信息用于展示。
Success Criteria
成功标准
A successful validation session should:
- Identify all syntax errors
- Detect semantic problems
- Suggest at least one optimization (if applicable)
- Clearly explain what the query does
- Verify the query matches user intent
- Provide actionable next steps
一次成功的校验会话应当满足:
- 识别所有语法错误
- 检测到语义问题
- 至少给出一条优化建议(如果有可优化点)
- 清晰解释查询的功能
- 验证查询符合用户需求
- 提供可落地的下一步方案
Known Limitations
已知限制
The validation scripts have some limitations to be aware of:
校验脚本存在以下需要注意的限制:
Metric Type Detection
指标类型检测
- Heuristic-based: Metric types (counter, gauge, histogram, summary) are inferred from naming conventions (e.g., ,
_total)_bytes - Custom metrics: Metrics with non-standard names may not be correctly classified
- Recommendation: When the script can't determine metric type, ask the user to clarify
- 基于启发式规则:指标类型(counter、gauge、histogram、summary)是基于命名规范推断的(例如、
_total)_bytes - 自定义指标:不符合标准命名规范的指标可能无法正确分类
- 建议:当脚本无法确定指标类型时,请询问用户确认
High Cardinality Detection
高基数检测
- Conservative approach: The script flags metrics without label selectors, but some use cases legitimately query all series
- Recording rules: Queries using recording rule metrics (e.g., ) are valid without label filters
job:http_requests:rate5m - Recommendation: Use judgment - if the user knows their cardinality is manageable, the warning can be safely ignored
- 保守策略:脚本会标记没有标签选择器的指标,但部分场景下确实需要查询所有序列
- 录制规则:使用录制规则生成的指标(例如)不需要标签过滤也是合法的
job:http_requests:rate5m - 建议:根据场景判断,如果用户确认基数可控,可以安全忽略该警告
Semantic Validation
语义校验
- No runtime context: The scripts cannot verify if metrics actually exist or if label values are valid
- Schema-agnostic: No knowledge of specific Prometheus deployments or metric schemas
- Recommendation: For production validation, test queries against actual Prometheus instances
- 无运行时上下文:脚本无法验证指标是否真实存在,或标签值是否合法
- 无Schema感知:不了解具体Prometheus部署或指标Schema的信息
- 建议:生产环境校验请在实际的Prometheus实例上测试查询
Script Detection Coverage
脚本检测覆盖范围
The scripts detect common anti-patterns but cannot catch:
- Business logic errors (e.g., calculating the wrong KPI)
- Context-specific optimizations (depends on scrape interval, retention, etc.)
- Custom function behavior from extensions
脚本可以检测常见的反模式,但无法覆盖:
- 业务逻辑错误(例如计算错误的KPI)
- 上下文相关的优化(取决于抓取间隔、保留策略等)
- 扩展提供的自定义函数行为
Remember
最后提醒
The goal is not just to validate queries, but to help users write better PromQL and understand their monitoring data. Always be educational, interactive, and helpful!
我们的目标不仅仅是校验查询,更是帮助用户写出更优质的PromQL,理解他们的监控数据。请始终保持教育性、交互性和友好的态度!