promql-generator
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePromQL Query Generator
PromQL查询生成工具
Overview
概述
This skill provides a comprehensive, interactive workflow for generating production-ready PromQL queries with best practices built-in. Generate queries for monitoring dashboards, alerting rules, and ad-hoc analysis with an emphasis on user collaboration and planning before code generation.
本技能提供一套全面的交互式工作流,用于生成符合最佳实践的生产级PromQL查询。可针对监控仪表板、告警规则和临时分析生成查询,强调在生成代码前与用户协作规划。
When to Use This Skill
适用场景
Invoke this skill when:
- Creating new PromQL queries from scratch
- Building monitoring dashboards (Grafana, Prometheus UI, etc.)
- Implementing alerting rules for Prometheus Alertmanager
- Analyzing metrics for troubleshooting or capacity planning
- Converting monitoring requirements into PromQL expressions
- Learning PromQL or teaching others
- The user asks to "create", "generate", "build", or "write" PromQL queries
- Working with Prometheus metrics (counters, gauges, histograms, summaries)
- Implementing RED (Rate, Errors, Duration) or USE (Utilization, Saturation, Errors) metrics
在以下场景中调用此技能:
- 从零开始创建新的PromQL查询
- 构建监控仪表板(Grafana、Prometheus UI等)
- 为Prometheus Alertmanager实现告警规则
- 分析指标以进行故障排查或容量规划
- 将监控需求转换为PromQL表达式
- 学习或教授PromQL
- 用户要求「创建」「生成」「构建」或「编写」PromQL查询时
- 处理Prometheus指标(Counter、Gauge、Histogram、Summary)
- 实现RED(Rate、Errors、Duration)或USE(Utilization、Saturation、Errors)指标体系
Interactive Query Planning Workflow
交互式查询规划流程
CRITICAL: This skill emphasizes interactive planning before query generation. Always engage the user in a collaborative planning process to ensure the generated query matches their exact intentions.
Follow this workflow when generating PromQL queries:
关键提示:本技能强调在生成查询前进行交互式规划。始终与用户开展协作规划,确保生成的查询完全符合用户意图。
生成PromQL查询时遵循以下工作流:
Stage 1: Understand the Monitoring Goal
阶段1:明确监控目标
Start by understanding what the user wants to monitor or measure. Ask clarifying questions to gather requirements:
-
Primary Goal: What are you trying to monitor or measure?
- Request rate (requests per second)
- Error rate (percentage of failed requests)
- Latency/duration (response times, percentiles)
- Resource usage (CPU, memory, disk, network)
- Availability/uptime
- Queue depth, saturation, throughput
- Custom business metrics
-
Use Case: What will this query be used for?
- Dashboard visualization (Grafana, Prometheus UI)
- Alerting rule (firing when threshold exceeded)
- Ad-hoc troubleshooting/analysis
- Recording rule (pre-computed aggregation)
- Capacity planning or SLO tracking
-
Context: Any additional context?
- Service/application name
- Team or project
- Priority level
- Existing metrics or naming conventions
Use the AskUserQuestion tool to gather this information if not provided.
When to Ask vs. Infer: If the user's initial request already clearly specifies the goal, use case, and context (e.g., "Create an alert for P95 latency > 500ms for payment-service"), you may acknowledge these details in your response instead of re-asking. Only ask clarifying questions for information that is missing or ambiguous.
首先了解用户想要监控或测量的内容。通过澄清问题收集需求:
-
核心目标:您想要监控或测量什么?
- 请求速率(每秒请求数)
- 错误率(失败请求占比)
- 延迟/时长(响应时间、百分位数)
- 资源使用率(CPU、内存、磁盘、网络)
- 可用性/在线时长
- 队列深度、饱和度、吞吐量
- 自定义业务指标
-
使用场景:该查询将用于什么场景?
- 仪表板可视化(Grafana、Prometheus UI)
- 告警规则(阈值触发时告警)
- 临时故障排查/分析
- 记录规则(预计算聚合指标)
- 容量规划或SLO追踪
-
上下文信息:是否有其他上下文信息?
- 服务/应用名称
- 团队或项目
- 优先级
- 现有指标或命名规范
若用户未提供上述信息,使用AskUserQuestion工具收集。
提问与推断的边界:如果用户初始请求已明确指定目标、场景和上下文(例如:「为payment-service创建P95延迟超过500ms的告警」),您可直接在回复中确认这些细节,无需重复提问。仅对缺失或模糊的信息进行澄清。
Stage 2: Identify Available Metrics
阶段2:识别可用指标
Determine which metrics are available and relevant:
-
Metric Discovery: What metrics are available?
- Ask the user for metric names
- If uncertain, suggest common naming patterns
- Check for metric type indicators in the name:
- suffix → Counter
_total - ,
_bucket,_sumsuffix → Histogram_count - No suffix → Likely Gauge
- suffix → Counter creation timestamp
_created
-
Metric Type Identification: Confirm the metric type(s)
- Counter: Cumulative metric that only increases (or resets to zero)
- Examples: ,
http_requests_total,errors_totalbytes_sent_total - Use with: ,
rate(),irate()increase()
- Examples:
- Gauge: Point-in-time value that can go up or down
- Examples: ,
memory_usage_bytes,cpu_temperature_celsiusqueue_length - Use with: ,
avg_over_time(),min_over_time(), or directlymax_over_time()
- Examples:
- Histogram: Buckets of observations with cumulative counts
- Examples: ,
http_request_duration_seconds_bucketresponse_size_bytes_bucket - Use with: ,
histogram_quantile()rate()
- Examples:
- Summary: Pre-calculated quantiles with count and sum
- Examples:
rpc_duration_seconds{quantile="0.95"} - Use and
_sumfor averages; don't average quantiles_count
- Examples:
- Counter: Cumulative metric that only increases (or resets to zero)
-
Label Discovery: What labels are available on these metrics?
- Common labels: ,
job,instance,environment,service,endpoint,status_codemethod - Ask which labels are important for filtering or grouping
- Common labels:
Use the AskUserQuestion tool to confirm metric names, types, and available labels.
确定哪些指标可用且相关:
-
指标发现:有哪些可用指标?
- 询问用户指标名称
- 若不确定,建议常见命名模式
- 通过名称识别指标类型:
- 后缀→ Counter
_total - 后缀、
_bucket、_sum→ Histogram_count - 无后缀 → 通常为Gauge
- 后缀→ Counter创建时间戳
_created
- 后缀
-
指标类型确认:确认指标类型
- Counter:累计型指标,仅会增加(或重置为零)
- 示例:、
http_requests_total、errors_totalbytes_sent_total - 搭配使用:、
rate()、irate()increase()
- 示例:
- Gauge:瞬时值,可升可降
- 示例:、
memory_usage_bytes、cpu_temperature_celsiusqueue_length - 搭配使用:、
avg_over_time()、min_over_time(),或直接使用max_over_time()
- 示例:
- Histogram:带累计计数的观测值分桶
- 示例:、
http_request_duration_seconds_bucketresponse_size_bytes_bucket - 搭配使用:、
histogram_quantile()rate()
- 示例:
- Summary:预计算分位数,包含计数和总和
- 示例:
rpc_duration_seconds{quantile="0.95"} - 使用和
_sum计算平均值;不要对分位数取平均_count
- 示例:
- Counter:累计型指标,仅会增加(或重置为零)
-
标签发现:这些指标有哪些可用标签?
- 常见标签:、
job、instance、environment、service、endpoint、status_codemethod - 询问哪些标签对过滤或分组重要
- 常见标签:
使用AskUserQuestion工具确认指标名称、类型和可用标签。
Stage 3: Determine Query Parameters
阶段3:确定查询参数
Gather specific requirements for the query.
收集查询的具体要求。
Pre-confirmation for User-Provided Parameters
用户提供参数的预确认
IMPORTANT: When the user has already specified parameters in their initial request (e.g., "5-minute window", "500ms threshold", "> 5% error rate"), you MUST:
- Acknowledge the provided values explicitly in your response
- Present them as pre-filled defaults in AskUserQuestion with the first option being "Use specified values"
- Allow quick confirmation rather than re-asking for information already given
Example: If user says "alert when P95 latency exceeds 500ms", use:
AskUserQuestion:
- Question: "Confirm the alert threshold?"
- Options:
1. "500ms (as specified)" - Use the threshold from your request
2. "Different threshold" - Let me specify a different valueThis respects the user's input and speeds up the workflow while still allowing modifications.
-
Time Range: What time window should the query cover?
- Instant value (current)
- Rate over time (,
[5m],[1h])[1d] - For rate calculations: typically to
[1m]for real-time,[5m]to[1h]for trends[1d] - Rule of thumb: Rate range should be at least 4x the scrape interval
-
Label Filtering: Which labels should filter the data?
- Exact matches: ,
job="api-server"status_code="200" - Negative matches:
status_code!="200" - Regex matches:
instance=~"prod-.*" - Multiple conditions:
{job="api", environment="production"}
- Exact matches:
-
Aggregation: Should the data be aggregated?
- No aggregation: Return all time series as-is
- Aggregate by labels: ,
sum by (job, endpoint)avg by (instance) - Aggregate without labels: ,
sum without (instance, pod)avg without (job) - Common aggregations: ,
sum,avg,max,min,count,topkbottomk
-
Thresholds or Conditions: Are there specific conditions?
- For alerting: threshold values (e.g., error rate > 5%)
- For filtering: only show series above/below a value
- For comparison: compare against historical data (offset)
Use the AskUserQuestion tool to gather or confirm these parameters. When the user has already provided values (e.g., "5-minute window", "> 5%"), present them as the default option for confirmation.
重要提示:当用户在初始请求中已指定参数(例如:「5分钟窗口」「500ms阈值」「>5%错误率」),您必须:
- 在回复中明确确认提供的值
- 在AskUserQuestion中将其设为预填充默认值,第一个选项为「使用指定值」
- 允许快速确认,而非重复询问已提供的信息
示例:若用户说「当P95延迟超过500ms时告警」,使用:
AskUserQuestion:
- Question: "确认告警阈值?"
- Options:
1. "500ms(按您指定的)" - 使用您请求中的阈值
2. "修改阈值" - 让我指定其他值这既尊重用户输入,又能加快工作流,同时仍允许修改。
-
时间范围:查询应覆盖哪个时间窗口?
- 瞬时值(当前)
- 速率计算窗口(、
[5m]、[1h])[1d] - 速率计算规则:实时监控通常用至
[1m],趋势分析用[5m]至[1h][1d] - 经验法则:速率范围至少为抓取间隔的4倍
-
标签过滤:应使用哪些标签过滤数据?
- 精确匹配:、
job="api-server"status_code="200" - 排除匹配:
status_code!="200" - 正则匹配:
instance=~"prod-.*" - 多条件:
{job="api", environment="production"}
- 精确匹配:
-
聚合方式:是否需要对数据进行聚合?
- 不聚合:返回所有时间序列原样
- 按标签聚合:、
sum by (job, endpoint)avg by (instance) - 排除标签聚合:、
sum without (instance, pod)avg without (job) - 常见聚合函数:、
sum、avg、max、min、count、topkbottomk
-
阈值或条件:是否有特定条件?
- 告警场景:阈值(例如:错误率>5%)
- 过滤场景:仅显示高于/低于某值的序列
- 对比场景:与历史数据对比(offset)
使用AskUserQuestion工具收集或确认这些参数。若用户已提供值(例如:「5分钟窗口」「>5%」),将其设为默认选项供确认。
Stage 4: Present the Query Plan
阶段4:呈现查询规划
BEFORE GENERATING ANY CODE, present a plain-English query plan and ask for user confirmation:
undefined在生成任何代码之前,用通俗易懂的语言呈现查询规划,并请求用户确认:
undefinedPromQL Query Plan
PromQL查询规划
Based on your requirements, here's what the query will do:
Goal: [Describe the monitoring goal in plain English]
Query Structure:
- Start with metric:
[metric_name] - Filter by labels:
{label1="value1", label2="value2"} - Apply function:
[function_name]([metric][time_range]) - Aggregate:
[aggregation] by ([label_list]) - Additional operations: [any calculations, ratios, or transformations]
Expected Output:
- Data type: [instant vector/scalar]
- Labels in result: [list of labels]
- Value represents: [what the number means]
- Typical range: [expected value range]
Example Interpretation:
If the query returns , it means: [plain English explanation]
0.05Does this match your intentions?
- If yes, I'll generate the query and validate it
- If no, let me know what needs to change
Use the **AskUserQuestion** tool to confirm the plan with options:
- "Yes, generate this query"
- "Modify [specific aspect]"
- "Show me alternative approaches"根据您的需求,该查询将实现以下功能:
目标:[用通俗易懂的语言描述监控目标]
查询结构:
- 基础指标:
[metric_name] - 标签过滤:
{label1="value1", label2="value2"} - 应用函数:
[function_name]([metric][time_range]) - 聚合操作:
[aggregation] by ([label_list]) - 额外操作:[任何计算、比率或转换]
预期输出:
- 数据类型:[瞬时向量/标量]
- 结果标签:[标签列表]
- 值代表含义:[数值的实际意义]
- 典型范围:[预期数值范围]
示例解读:
若查询返回,表示:[通俗易懂的解释]
0.05此规划是否符合您的意图?
- 是,生成此查询
- 修改[特定部分]
- 展示其他方案
使用**AskUserQuestion**工具提供以下选项确认规划:
- "是,生成此查询"
- "修改[特定部分]"
- "展示其他方案"Stage 5: Generate the PromQL Query
阶段5:生成PromQL查询
Once the user confirms the plan, generate the actual PromQL query following best practices.
用户确认规划后,按照最佳实践生成实际的PromQL查询。
IMPORTANT: Consult Reference Files Before Generating
重要提示:生成前参考文档
Before writing any query code, you MUST:
-
Read the relevant reference file(s) using the Read tool:
- For histogram queries → Read (Histogram section)
references/metric_types.md - For error/latency patterns → Read (RED method section)
references/promql_patterns.md - For resource monitoring → Read (USE method section)
references/promql_patterns.md - For optimization questions → Read
references/best_practices.md - For specific functions → Read
references/promql_functions.md
- For histogram queries → Read
-
Cite the applicable pattern or best practice in your response:
As documented in references/promql_patterns.md (Pattern 3: Latency Percentile): # 95th percentile latency histogram_quantile(0.95, sum by (le) (rate(...))) -
Reference example files when generating similar queries:
Based on examples/red_method.promql (lines 64-82): # P95 latency with proper histogram_quantile usage
This ensures generated queries follow documented patterns and helps users understand why certain approaches are recommended.
在编写任何查询代码之前,您必须:
-
使用Read工具阅读相关参考文档:
- 直方图查询 → 阅读(Histogram章节)
references/metric_types.md - 错误/延迟模式 → 阅读(RED方法章节)
references/promql_patterns.md - 资源监控 → 阅读(USE方法章节)
references/promql_patterns.md - 优化问题 → 阅读
references/best_practices.md - 特定函数 → 阅读
references/promql_functions.md
- 直方图查询 → 阅读
-
在回复中引用适用的模式或最佳实践:
正如`references/promql_patterns.md`(模式3:延迟分位数)中所述: # 95分位延迟 histogram_quantile(0.95, sum by (le) (rate(...))) -
生成相似查询时参考示例文件:
基于`examples/red_method.promql`(第64-82行): # 正确使用histogram_quantile的P95延迟查询
这确保生成的查询遵循文档化模式,并帮助用户理解为何推荐特定方法。
Best Practices for Query Generation
查询生成最佳实践
-
Always Use Label Filterspromql
# Good: Specific filtering reduces cardinality rate(http_requests_total{job="api-server", environment="prod"}[5m]) # Bad: Matches all time series, high cardinality rate(http_requests_total[5m]) -
Use Appropriate Functions for Metric Typespromql
# Counter: Use rate() or increase() rate(http_requests_total[5m]) # Gauge: Use directly or with *_over_time() memory_usage_bytes avg_over_time(memory_usage_bytes[5m]) # Histogram: Use histogram_quantile() histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])) ) -
Apply Aggregations with by() or without()promql
# Aggregate by specific labels (keeps only these labels) sum by (job, endpoint) (rate(http_requests_total[5m])) # Aggregate without specific labels (removes these labels) sum without (instance, pod) (rate(http_requests_total[5m])) -
Use Exact Matches Over Regex When Possiblepromql
# Good: Faster exact match http_requests_total{status_code="200"} # Bad: Slower regex match when not needed http_requests_total{status_code=~"200"} -
Calculate Ratios Properlypromql
# Error rate: errors / total requests sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) -
Use Recording Rules for Complex Queries
- If a query is used frequently or is computationally expensive
- Pre-aggregate data to reduce query load
- Follow naming convention:
level:metric:operations
-
Format for Readabilitypromql
# Good: Multi-line for complex queries histogram_quantile(0.95, sum by (le, job) ( rate(http_request_duration_seconds_bucket{job="api-server"}[5m]) ) )
-
始终使用标签过滤promql
# 推荐:精确过滤降低基数 rate(http_requests_total{job="api-server", environment="prod"}[5m]) # 不推荐:匹配所有时间序列,基数过高 rate(http_requests_total[5m]) -
为不同指标类型使用合适的函数promql
# Counter:使用rate()或increase() rate(http_requests_total[5m]) # Gauge:直接使用或搭配*_over_time() memory_usage_bytes avg_over_time(memory_usage_bytes[5m]) # Histogram:使用histogram_quantile() histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])) ) -
使用by()或without()进行聚合promql
# 按指定标签聚合(仅保留这些标签) sum by (job, endpoint) (rate(http_requests_total[5m])) # 排除指定标签聚合(移除这些标签) sum without (instance, pod) (rate(http_requests_total[5m])) -
优先使用精确匹配而非正则匹配promql
# 推荐:精确匹配速度更快 http_requests_total{status_code="200"} # 不推荐:无需正则时使用正则会变慢 http_requests_total{status_code=~"200"} -
正确计算比率promql
# 错误率:错误数/总请求数 sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) -
为复杂查询使用记录规则
- 若查询频繁使用或计算成本高
- 预聚合数据以降低查询负载
- 遵循命名规范:
level:metric:operations
-
格式化以提升可读性promql
# 推荐:复杂查询使用多行格式 histogram_quantile(0.95, sum by (le, job) ( rate(http_request_duration_seconds_bucket{job="api-server"}[5m]) ) )
Common Query Patterns
常见查询模式
Pattern 1: Request Rate
promql
undefined模式1:请求速率
promql
undefinedRequests per second
每秒请求数
rate(http_requests_total{job="api-server"}[5m])
rate(http_requests_total{job="api-server"}[5m])
Total requests per second across all instances
所有实例的总每秒请求数
sum(rate(http_requests_total{job="api-server"}[5m]))
**Pattern 2: Error Rate**
```promqlsum(rate(http_requests_total{job="api-server"}[5m]))
**模式2:错误率**
```promqlError ratio (0 to 1)
错误比率(0到1)
sum(rate(http_requests_total{job="api-server", status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="api-server"}[5m]))
sum(rate(http_requests_total{job="api-server", status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="api-server"}[5m]))
Error percentage (0 to 100)
错误百分比(0到100)
(
sum(rate(http_requests_total{job="api-server", status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="api-server"}[5m]))
) * 100
**Pattern 3: Latency Percentile (Histogram)**
```promql(
sum(rate(http_requests_total{job="api-server", status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="api-server"}[5m]))
) * 100
**模式3:延迟分位数(直方图)**
```promql95th percentile latency
95分位延迟
histogram_quantile(0.95,
sum by (le) (
rate(http_request_duration_seconds_bucket{job="api-server"}[5m])
)
)
**Pattern 4: Resource Usage**
```promqlhistogram_quantile(0.95,
sum by (le) (
rate(http_request_duration_seconds_bucket{job="api-server"}[5m])
)
)
**模式4:资源使用率**
```promqlCurrent memory usage
当前内存使用率
process_resident_memory_bytes{job="api-server"}
process_resident_memory_bytes{job="api-server"}
Average CPU usage over 5 minutes
5分钟平均CPU使用率
avg_over_time(process_cpu_seconds_total{job="api-server"}[5m])
**Pattern 5: Availability**
```promqlavg_over_time(process_cpu_seconds_total{job="api-server"}[5m])
**模式5:可用性**
```promqlPercentage of up instances
在线实例百分比
(
count(up{job="api-server"} == 1)
/
count(up{job="api-server"})
) * 100
**Pattern 6: Saturation/Queue Depth**
```promql(
count(up{job="api-server"} == 1)
/
count(up{job="api-server"})
) * 100
**模式6:饱和度/队列深度**
```promqlAverage queue length
平均队列长度
avg_over_time(queue_depth{job="worker"}[5m])
avg_over_time(queue_depth{job="worker"}[5m])
Maximum queue depth in the last hour
过去1小时的最大队列深度
max_over_time(queue_depth{job="worker"}[1h])
undefinedmax_over_time(queue_depth{job="worker"}[1h])
undefinedStage 6: Validate the Generated Query
阶段6:验证生成的查询
ALWAYS validate the generated query using the devops-skills:promql-validator skill:
After generating the query, automatically invoke:
Skill(devops-skills:promql-validator)
The devops-skills:promql-validator skill will:
1. Check syntax correctness
2. Validate semantic logic (correct functions for metric types)
3. Identify anti-patterns and inefficiencies
4. Suggest optimizations
5. Explain what the query does
6. Verify it matches user intentValidation checklist:
- Syntax is correct (balanced brackets, valid operators)
- Metric type matches function usage
- Label filters are specific enough
- Aggregation is appropriate
- Time ranges are reasonable
- No known anti-patterns
- Query is optimized for performance
If validation fails, fix issues and re-validate until all checks pass.
IMPORTANT: Display Validation Results to User
After running validation, you MUST display the structured results to the user in this format:
undefined始终验证生成的查询,使用devops-skills:promql-validator技能:
生成查询后,自动调用:
Skill(devops-skills:promql-validator)
devops-skills:promql-validator技能将:
1. 检查语法正确性
2. 验证语义逻辑(指标类型与函数匹配)
3. 识别反模式和低效问题
4. 建议优化方案
5. 解释查询功能
6. 验证是否符合用户意图验证清单:
- 语法正确(括号匹配、运算符有效)
- 指标类型与函数使用匹配
- 标签过滤足够精确
- 聚合方式合适
- 时间范围合理
- 无已知反模式
- 查询已针对性能优化
若验证失败,修复问题并重新验证,直至所有检查通过。
重要提示:向用户展示验证结果
运行验证后,必须以下列格式向用户展示结构化结果:
undefinedPromQL Validation Results
PromQL验证结果
Syntax Check
语法检查
- Status: ✅ VALID / ⚠️ WARNING / ❌ ERROR
- Issues: [list any syntax errors]
- 状态:✅ 有效 / ⚠️ 警告 / ❌ 错误
- 问题:[列出任何语法错误]
Best Practices Check
最佳实践检查
- Status: ✅ OPTIMIZED / ⚠️ CAN BE IMPROVED / ❌ HAS ISSUES
- Issues: [list any problems found]
- Suggestions: [list optimization opportunities]
- 状态:✅ 已优化 / ⚠️ 可改进 / ❌ 存在问题
- 问题:[列出发现的问题]
- 建议:[列出优化机会]
Query Explanation
查询解释
- What it measures: [plain English description]
- Output labels: [list labels in result, or "None (scalar)"]
- Expected result structure: [instant vector / scalar / etc.]
This transparency helps users understand the validation process and any recommendations.- 测量内容:[通俗易懂的描述]
- 输出标签:[结果中的标签列表,或“无(标量)”]
- 预期结果结构:[瞬时向量 / 标量 / 等]
这种透明性有助于用户理解验证过程和任何建议。Stage 7: Provide Usage Instructions
阶段7:提供使用说明
After successful generation and validation, provide the user with:
-
The Final Query:promql
[Generated and validated PromQL query] -
Query Explanation:
- What the query measures
- How to interpret the results
- Expected value range
- Labels in the output
-
How to Use It:
- For Dashboards: Copy into Grafana/Prometheus UI panel query
- For Alerts: Integrate into Alertmanager rule with threshold
- For Recording Rules: Add to Prometheus recording rule config
- For Ad-hoc: Run directly in Prometheus expression browser
-
Customization Notes:
- Time ranges that might need adjustment
- Labels to modify for different environments
- Threshold values to tune
- Alternative functions if requirements change
-
Related Queries:
- Suggest complementary queries
- Mention recording rule opportunities
- Recommend dashboard panels
成功生成并验证后,向用户提供:
-
最终查询:promql
[生成并验证后的PromQL查询] -
查询解释:
- 查询测量的内容
- 如何解读结果
- 预期数值范围
- 输出中的标签
-
使用方法:
- 仪表板:复制到Grafana/Prometheus UI面板查询中
- 告警:集成到Alertmanager规则中并设置阈值
- 记录规则:添加到Prometheus记录规则配置
- 临时分析:直接在Prometheus表达式浏览器中运行
-
自定义说明:
- 可能需要调整的时间范围
- 针对不同环境需修改的标签
- 需调整的阈值
- 需求变化时的替代函数
-
相关查询:
- 推荐互补查询
- 提及记录规则的使用场景
- 推荐仪表板面板
Native Histograms (Prometheus 3.x+)
原生直方图(Prometheus 3.x+)
Native histograms are now stable in Prometheus 3.0+ (released November 2024). They offer significant advantages over classic histograms:
- Sparse bucket representation with near-zero cost for empty buckets
- No configuration of bucket boundaries during instrumentation
- Coverage of the full float64 range
- Efficient mergeability across histograms
- Simpler query syntax
Important: Starting with Prometheus v3.8.0, native histograms are fully stable. However, scraping native histograms still requires explicit activation via theconfiguration setting. Starting with v3.9, no feature flag is needed butscrape_native_histogramsmust be set explicitly.scrape_native_histograms
原生直方图在Prometheus 3.0+(2024年11月发布)中已稳定。相比经典直方图,它具有显著优势:
- 稀疏分桶表示,空分桶成本几乎为零
- instrumentation时无需配置分桶边界
- 覆盖完整float64范围
- 跨直方图高效合并
- 查询语法更简单
重要提示:从Prometheus v3.8.0开始,原生直方图完全稳定。但抓取原生直方图仍需通过配置显式启用。从v3.9开始,无需功能标志,但必须显式设置scrape_native_histograms。scrape_native_histograms
Native vs Classic Histogram Syntax
原生与经典直方图语法对比
promql
undefinedpromql
undefinedClassic histogram (requires _bucket suffix and le label)
经典直方图(需要_bucket后缀和le标签)
histogram_quantile(0.95,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)
histogram_quantile(0.95,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)
Native histogram (simpler - no _bucket suffix, no le label needed)
原生直方图(更简单 - 无_bucket后缀,无需le标签)
histogram_quantile(0.95,
sum by (job) (rate(http_request_duration_seconds[5m]))
)
undefinedhistogram_quantile(0.95,
sum by (job) (rate(http_request_duration_seconds[5m]))
)
undefinedNative Histogram Functions
原生直方图函数
promql
undefinedpromql
undefinedGet observation count rate from native histogram
从原生直方图获取观测计数速率
histogram_count(rate(http_request_duration_seconds[5m]))
histogram_count(rate(http_request_duration_seconds[5m]))
Get sum of observations from native histogram
从原生直方图获取观测值总和
histogram_sum(rate(http_request_duration_seconds[5m]))
histogram_sum(rate(http_request_duration_seconds[5m]))
Calculate fraction of observations between two values
计算两个值之间的观测值占比
histogram_fraction(0, 0.1, rate(http_request_duration_seconds[5m]))
histogram_fraction(0, 0.1, rate(http_request_duration_seconds[5m]))
Average request duration from native histogram
从原生直方图计算平均请求时长
histogram_sum(rate(http_request_duration_seconds[5m]))
/
histogram_count(rate(http_request_duration_seconds[5m]))
undefinedhistogram_sum(rate(http_request_duration_seconds[5m]))
/
histogram_count(rate(http_request_duration_seconds[5m]))
undefinedDetecting Native vs Classic Histograms
区分原生与经典直方图
Native histograms are identified by:
- No suffix on the metric name
_bucket - No label in the time series
le - The metric stores histogram data directly (not separate bucket counters)
When querying, check if your Prometheus instance has native histograms enabled:
yaml
undefined原生直方图可通过以下特征识别:
- 指标名称无后缀
_bucket - 时间序列无标签
le - 指标直接存储直方图数据(而非单独的分桶计数器)
查询时,检查Prometheus实例是否启用了原生直方图:
yaml
undefinedprometheus.yml - Enable native histogram scraping
prometheus.yml - 启用原生直方图抓取
scrape_configs:
- job_name: 'my-app' scrape_native_histogram: true # Prometheus 3.x+
undefinedscrape_configs:
- job_name: 'my-app' scrape_native_histogram: true # Prometheus 3.x+
undefinedCustom Bucket Native Histograms (NHCB)
自定义分桶原生直方图(NHCB)
Prometheus 3.4+ supports custom bucket native histograms (schema -53), allowing classic histogram to native histogram conversion. This is a key migration path for users with existing classic histograms.
Benefits of NHCB:
- Keep existing instrumentation (no code changes needed)
- Store classic histograms as native histograms for lower costs
- Query with native histogram syntax
- Improved reliability and compression
Configuration (Prometheus 3.4+):
yaml
undefinedPrometheus 3.4+支持自定义分桶原生直方图(schema -53),允许将经典直方图转换为原生直方图。这是现有经典直方图用户的关键迁移路径。
NHCB优势:
- 保留现有instrumentation(无需代码变更)
- 将经典直方图存储为原生直方图,降低成本
- 使用原生直方图语法查询
- 提升可靠性和压缩率
配置(Prometheus 3.4+):
yaml
undefinedprometheus.yml - Convert classic histograms to NHCB on scrape
prometheus.yml - 抓取时将经典直方图转换为NHCB
global:
scrape_configs:
- job_name: 'my-app'
convert_classic_histograms_to_nhcb: true # Prometheus 3.4+
**Querying NHCB**:
```promqlglobal:
scrape_configs:
- job_name: 'my-app'
convert_classic_histograms_to_nhcb: true # Prometheus 3.4+
**查询NHCB**:
```promqlQuery NHCB metrics the same way as native histograms
与原生直方图查询方式相同
histogram_quantile(0.95, sum by (job) (rate(http_request_duration_seconds[5m])))
histogram_quantile(0.95, sum by (job) (rate(http_request_duration_seconds[5m])))
histogram_fraction also works with NHCB (Prometheus 3.4+)
histogram_fraction也适用于NHCB(Prometheus 3.4+)
histogram_fraction(0, 0.2, rate(http_request_duration_seconds[5m]))
**Note**: Schema -53 indicates custom bucket boundaries. These histograms with different custom bucket boundaries are generally not mergeable with each other.
---histogram_fraction(0, 0.2, rate(http_request_duration_seconds[5m]))
**注意**:Schema -53表示自定义分桶边界。具有不同自定义分桶边界的直方图通常无法相互合并。
---SLO, Error Budget, and Burn Rate Patterns
SLO、错误预算和燃烧速率模式
Service Level Objectives (SLOs) are critical for modern SRE practices. These patterns help implement SLO-based monitoring and alerting.
服务级别目标(SLO)是现代SRE实践的关键。这些模式有助于实现基于SLO的监控和告警。
Error Budget Calculation
错误预算计算
promql
undefinedpromql
undefinedError budget remaining (for 99.9% SLO over 30 days)
剩余错误预算(针对30天99.9%的SLO)
Returns value between 0 and 1 (1 = full budget, 0 = exhausted)
返回值0到1(1=预算充足,0=预算耗尽)
1 - (
sum(rate(http_requests_total{job="api", status_code=~"5.."}[30d]))
/
sum(rate(http_requests_total{job="api"}[30d]))
) / 0.001 # 0.001 = 1 - 0.999 (allowed error rate)
1 - (
sum(rate(http_requests_total{job="api", status_code=~"5.."}[30d]))
/
sum(rate(http_requests_total{job="api"}[30d]))
) / 0.001 # 0.001 = 1 - 0.999(允许的错误率)
Simplified: Availability over 30 days
简化版:30天可用性
sum(rate(http_requests_total{job="api", status_code!~"5.."}[30d]))
/
sum(rate(http_requests_total{job="api"}[30d]))
undefinedsum(rate(http_requests_total{job="api", status_code!~"5.."}[30d]))
/
sum(rate(http_requests_total{job="api"}[30d]))
undefinedBurn Rate Calculation
燃烧速率计算
Burn rate measures how fast you're consuming error budget. A burn rate of 1 means you'll exhaust the budget exactly at the end of the SLO window.
promql
undefined燃烧速率衡量错误预算的消耗速度。燃烧速率为1表示将在SLO窗口结束时恰好耗尽预算。
promql
undefinedCurrent burn rate (1 hour window, 99.9% SLO)
当前燃烧速率(1小时窗口,99.9% SLO)
Burn rate = (current error rate) / (allowed error rate)
燃烧速率 = (当前错误率) / (允许的错误率)
(
sum(rate(http_requests_total{job="api", status_code=~"5.."}[1h]))
/
sum(rate(http_requests_total{job="api"}[1h]))
) / 0.001 # 0.001 = allowed error rate for 99.9% SLO
(
sum(rate(http_requests_total{job="api", status_code=~"5.."}[1h]))
/
sum(rate(http_requests_total{job="api"}[1h]))
) / 0.001 # 0.001 = 99.9% SLO允许的错误率
Burn rate > 1 means consuming budget faster than allowed
燃烧速率>1表示消耗预算速度快于预期
Burn rate of 14.4 consumes 2% of monthly budget in 1 hour
燃烧速率14.4表示1小时消耗月度预算的2%
undefinedundefinedMulti-Window, Multi-Burn-Rate Alerts (Google SRE Standard)
多窗口、多燃烧速率告警(Google SRE标准)
The recommended approach for SLO alerting uses multiple windows to balance detection speed and precision:
promql
undefinedSLO告警的推荐方法使用多个窗口,平衡检测速度和精度:
promql
undefinedPage-level alert: 2% budget in 1 hour (burn rate 14.4)
页面级告警:1小时内消耗2%预算(燃烧速率14.4)
Long window (1h) AND short window (5m) must both exceed threshold
长窗口(1h)和短窗口(5m)必须同时超过阈值
(
(
sum(rate(http_requests_total{job="api", status_code="5.."}[1h]))
/
sum(rate(http_requests_total{job="api"}[1h]))
) > 14.4 * 0.001
)
and
(
(
sum(rate(http_requests_total{job="api", status_code="5.."}[5m]))
/
sum(rate(http_requests_total{job="api"}[5m]))
) > 14.4 * 0.001
)
(
(
sum(rate(http_requests_total{job="api", status_code="5.."}[1h]))
/
sum(rate(http_requests_total{job="api"}[1h]))
) > 14.4 * 0.001
)
and
(
(
sum(rate(http_requests_total{job="api", status_code="5.."}[5m]))
/
sum(rate(http_requests_total{job="api"}[5m]))
) > 14.4 * 0.001
)
Ticket-level alert: 5% budget in 6 hours (burn rate 6)
工单级告警:6小时内消耗5%预算(燃烧速率6)
(
(
sum(rate(http_requests_total{job="api", status_code="5.."}[6h]))
/
sum(rate(http_requests_total{job="api"}[6h]))
) > 6 * 0.001
)
and
(
(
sum(rate(http_requests_total{job="api", status_code="5.."}[30m]))
/
sum(rate(http_requests_total{job="api"}[30m]))
) > 6 * 0.001
)
undefined(
(
sum(rate(http_requests_total{job="api", status_code="5.."}[6h]))
/
sum(rate(http_requests_total{job="api"}[6h]))
) > 6 * 0.001
)
and
(
(
sum(rate(http_requests_total{job="api", status_code="5.."}[30m]))
/
sum(rate(http_requests_total{job="api"}[30m]))
) > 6 * 0.001
)
undefinedSLO Recording Rules
SLO记录规则
Pre-compute SLO metrics for efficient alerting:
yaml
undefined预计算SLO指标以实现高效告警:
yaml
undefinedRecording rules for SLO calculations
SLO计算记录规则
groups:
-
name: slo_recording_rules interval: 30s rules:
Error ratio over different windows
-
record: job:slo_errors_per_request:ratio_rate1h expr: | sum by (job) (rate(http_requests_total{status_code=~"5.."}[1h])) / sum by (job) (rate(http_requests_total[1h]))
-
record: job:slo_errors_per_request:ratio_rate5m expr: | sum by (job) (rate(http_requests_total{status_code=~"5.."}[5m])) / sum by (job) (rate(http_requests_total[5m]))
Availability (success ratio)
- record: job:slo_availability:ratio_rate1h expr: | 1 - job:slo_errors_per_request:ratio_rate1h
-
undefinedgroups:
-
name: slo_recording_rules interval: 30s rules:
不同窗口的错误比率
-
record: job:slo_errors_per_request:ratio_rate1h expr: | sum by (job) (rate(http_requests_total{status_code=~"5.."}[1h])) / sum by (job) (rate(http_requests_total[1h]))
-
record: job:slo_errors_per_request:ratio_rate5m expr: | sum by (job) (rate(http_requests_total{status_code=~"5.."}[5m])) / sum by (job) (rate(http_requests_total[5m]))
可用性(成功比率)
- record: job:slo_availability:ratio_rate1h expr: | 1 - job:slo_errors_per_request:ratio_rate1h
-
undefinedLatency SLO Queries
延迟SLO查询
promql
undefinedpromql
undefinedPercentage of requests faster than SLO target (200ms)
快于SLO目标(200ms)的请求百分比
(
sum(rate(http_request_duration_seconds_bucket{le="0.2", job="api"}[5m]))
/
sum(rate(http_request_duration_seconds_count{job="api"}[5m]))
) * 100
(
sum(rate(http_request_duration_seconds_bucket{le="0.2", job="api"}[5m]))
/
sum(rate(http_request_duration_seconds_count{job="api"}[5m]))
) * 100
Requests violating latency SLO (slower than 500ms)
违反延迟SLO的请求(慢于500ms)
( sum(rate(http_request_duration_seconds_count{job="api"}[5m]))
sum(rate(http_request_duration_seconds_bucket{le="0.5", job="api"}[5m]))
)
/
sum(rate(http_request_duration_seconds_count{job="api"}[5m]))
undefined( sum(rate(http_request_duration_seconds_count{job="api"}[5m]))
sum(rate(http_request_duration_seconds_bucket{le="0.5", job="api"}[5m]))
)
/
sum(rate(http_request_duration_seconds_count{job="api"}[5m]))
undefinedBurn Rate Reference Table
燃烧速率参考表
| Burn Rate | Budget Consumed | Time to Exhaust 30-day Budget | Alert Severity |
|---|---|---|---|
| 1 | 100% over 30d | 30 days | None |
| 2 | 100% over 15d | 15 days | Low |
| 6 | 5% in 6h | 5 days | Ticket |
| 14.4 | 2% in 1h | ~2 days | Page |
| 36 | 5% in 1h | ~20 hours | Page (urgent) |
| 燃烧速率 | 消耗预算占比 | 耗尽30天预算所需时间 | 告警级别 |
|---|---|---|---|
| 1 | 30天内100% | 30天 | 无 |
| 2 | 15天内100% | 15天 | 低 |
| 6 | 6小时内5% | 5天 | 工单 |
| 14.4 | 1小时内2% | ~2天 | 页面 |
| 36 | 1小时内5% | ~20小时 | 紧急页面 |
Advanced Query Techniques
高级查询技巧
Using Subqueries
使用子查询
Subqueries enable complex time-based calculations:
promql
undefined子查询支持复杂的时间计算:
promql
undefinedMaximum 5-minute rate over the past 30 minutes
过去30分钟内的最大5分钟速率
max_over_time(
rate(http_requests_total[5m])[30m:1m]
)
**Syntax**: `<query>[<range>:<resolution>]`
- `<range>`: Time window to evaluate over
- `<resolution>`: Step size between evaluationsmax_over_time(
rate(http_requests_total[5m])[30m:1m]
)
**语法**:`<query>[<range>:<resolution>]`
- `<range>`:评估的时间窗口
- `<resolution>`:评估的步长Using Offset Modifier
使用Offset修饰符
Compare current data with historical data:
promql
undefined将当前数据与历史数据对比:
promql
undefinedCompare current rate with rate from 1 week ago
将当前速率与1周前的速率对比
rate(http_requests_total[5m])
rate(http_requests_total[5m] offset 1w)
undefinedrate(http_requests_total[5m])
rate(http_requests_total[5m] offset 1w)
undefinedUsing @ Modifier
使用@修饰符
Query metrics at specific timestamps:
promql
undefined查询特定时间戳的指标:
promql
undefinedRate at the end of the range query
范围查询结束时的速率
rate(http_requests_total[5m] @ end())
rate(http_requests_total[5m] @ end())
Rate at specific Unix timestamp
特定Unix时间戳的速率
rate(http_requests_total[5m] @ 1609459200)
undefinedrate(http_requests_total[5m] @ 1609459200)
undefinedBinary Operators and Vector Matching
二元运算符与向量匹配
Combine metrics with operators and control label matching:
promql
undefined使用运算符组合指标并控制标签匹配:
promql
undefinedOne-to-one matching (default)
一对一匹配(默认)
metric_a + metric_b
metric_a + metric_b
Many-to-one with group_left
多对一匹配,使用group_left
rate(http_requests_total[5m])
- on (job, instance) group_left (version) app_version_info
rate(http_requests_total[5m])
- on (job, instance) group_left (version) app_version_info
Ignoring specific labels
忽略特定标签
metric_a + ignoring(instance) metric_b
undefinedmetric_a + ignoring(instance) metric_b
undefinedLogical Operators
逻辑运算符
Filter time series based on conditions:
promql
undefined根据条件过滤时间序列:
promql
undefinedReturn series only where value > 100
仅返回值>100的序列
http_requests_total > 100
http_requests_total > 100
Return series present in both
返回同时存在于两个指标中的序列
metric_a and metric_b
metric_a and metric_b
Return series in A but not in B
返回存在于A但不存在于B的序列
metric_a unless metric_b
undefinedmetric_a unless metric_b
undefinedDocumentation Lookup
文档查询
If the user asks about specific Prometheus features, operators, or custom metrics:
-
Try context7 MCP first (preferred):
Use mcp__context7__resolve-library-id with "prometheus" Then use mcp__context7__get-library-docs with: - context7CompatibleLibraryID: /prometheus/docs - topic: [specific feature, function, or operator] - page: 1 (fetch additional pages if needed) -
Fallback to WebSearch:
Search query pattern: "Prometheus PromQL [function/operator/feature] documentation [version] examples" Examples: "Prometheus PromQL rate function documentation examples" "Prometheus PromQL histogram_quantile documentation best practices" "Prometheus PromQL aggregation operators documentation"
若用户询问特定Prometheus功能、运算符或自定义指标:
-
优先使用context7 MCP:
使用mcp__context7__resolve-library-id,参数为"prometheus" 然后使用mcp__context7__get-library-docs,参数: - context7CompatibleLibraryID: /prometheus/docs - topic: [特定功能、函数或运算符] - page: 1(若需要,获取更多页面) -
备用方案:Web搜索:
搜索查询模式: "Prometheus PromQL [函数/运算符/功能] 文档 [版本] 示例" 示例: "Prometheus PromQL rate函数 文档 示例" "Prometheus PromQL histogram_quantile 文档 最佳实践" "Prometheus PromQL 聚合运算符 文档"
Common Monitoring Scenarios
常见监控场景
RED Method (for Request-Driven Services)
RED方法(适用于请求驱动型服务)
-
Rate: Request throughputpromql
sum(rate(http_requests_total{job="api"}[5m])) by (endpoint) -
Errors: Error ratepromql
sum(rate(http_requests_total{job="api", status_code=~"5.."}[5m])) / sum(rate(http_requests_total{job="api"}[5m])) -
Duration: Latency percentilespromql
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{job="api"}[5m])) )
-
Rate:请求吞吐量promql
sum(rate(http_requests_total{job="api"}[5m])) by (endpoint) -
Errors:错误率promql
sum(rate(http_requests_total{job="api", status_code=~"5.."}[5m])) / sum(rate(http_requests_total{job="api"}[5m])) -
Duration:延迟分位数promql
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{job="api"}[5m])) )
USE Method (for Resources)
USE方法(适用于资源)
-
Utilization: Resource usage percentagepromql
( avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) / count(node_cpu_seconds_total{mode="idle"}) ) * 100 -
Saturation: Queue depth or resource contentionpromql
avg_over_time(node_load1[5m]) -
Errors: Error counterspromql
rate(node_network_receive_errs_total[5m])
-
Utilization:资源使用率百分比promql
( avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) / count(node_cpu_seconds_total{mode="idle"}) ) * 100 -
Saturation:队列深度或资源竞争promql
avg_over_time(node_load1[5m]) -
Errors:错误计数器promql
rate(node_network_receive_errs_total[5m])
Alerting Rules
告警规则
When generating queries for alerting:
-
Include the Threshold: Make the condition explicitpromql
# Alert when error rate exceeds 5% ( sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) ) > 0.05 -
Use Boolean Operators: Return 1 (fire) or 0 (no alert)promql
# Returns 1 when memory usage > 90% (process_resident_memory_bytes / node_memory_MemTotal_bytes) > 0.9 -
Consider for Duration: Alerts typically useclause
foryamlalert: HighErrorRate expr: | ( sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) ) > 0.05 for: 10m # Only fire after 10 minutes of continuous violation
生成告警查询时:
-
包含阈值:明确条件promql
# 错误率超过5%时告警 ( sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) ) > 0.05 -
使用布尔运算符:返回1(触发)或0(不告警)promql
# 内存使用率>90%时返回1 (process_resident_memory_bytes / node_memory_MemTotal_bytes) > 0.9 -
考虑持续时间:告警通常使用子句
foryamlalert: HighErrorRate expr: | ( sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) ) > 0.05 for: 10m # 持续违反阈值10分钟后才触发
Recording Rules
记录规则
When generating queries for recording rules:
-
Follow Naming Convention:
level:metric:operationsyaml# level: aggregation level (job, instance, etc.) # metric: base metric name # operations: functions applied - record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m])) -
Pre-aggregate Expensive Queries:yaml
# Recording rule for frequently-used latency query - record: job_endpoint:http_request_duration_seconds:p95 expr: | histogram_quantile(0.95, sum by (job, endpoint, le) ( rate(http_request_duration_seconds_bucket[5m]) ) ) -
Use Recorded Metrics in Dashboards:promql
# Instead of expensive query, use pre-recorded metric job_endpoint:http_request_duration_seconds:p95{job="api-server"}
生成记录规则查询时:
-
遵循命名规范:
level:metric:operationsyaml# level: 聚合级别(job、instance等) # metric: 基础指标名称 # operations: 应用的函数 - record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m])) -
预聚合复杂查询:yaml
# 频繁使用的延迟查询的记录规则 - record: job_endpoint:http_request_duration_seconds:p95 expr: | histogram_quantile(0.95, sum by (job, endpoint, le) ( rate(http_request_duration_seconds_bucket[5m]) ) ) -
在仪表板中使用预记录指标:promql
# 不使用复杂查询,而是使用预记录指标 job_endpoint:http_request_duration_seconds:p95{job="api-server"}
Error Handling
错误处理
Common Issues and Solutions
常见问题与解决方案
-
Empty Results:
- Check if metrics exist:
up{job="your-job"} - Verify label filters are correct
- Check time range is appropriate
- Confirm metric is being scraped
- Check if metrics exist:
-
Too Many Series (High Cardinality):
- Add more specific label filters
- Use aggregation to reduce series count
- Consider using recording rules
- Check for label explosion (dynamic labels)
-
Incorrect Values:
- Verify metric type (counter vs gauge)
- Check function usage (rate on counters, not gauges)
- Verify time range is appropriate
- Check for counter resets
-
Performance Issues:
- Reduce time range for range vectors
- Add label filters to reduce cardinality
- Use recording rules for complex queries
- Avoid expensive regex patterns
- Consider query timeout settings
-
结果为空:
- 检查指标是否存在:
up{job="your-job"} - 验证标签过滤是否正确
- 检查时间范围是否合适
- 确认指标正在被抓取
- 检查指标是否存在:
-
序列过多(高基数):
- 添加更精确的标签过滤
- 使用聚合减少序列数量
- 考虑使用记录规则
- 检查是否存在标签爆炸(动态标签)
-
数值不正确:
- 验证指标类型(counter vs gauge)
- 检查函数使用(rate用于counter,而非gauge)
- 验证时间范围是否合适
- 检查是否存在counter重置
-
性能问题:
- 缩小范围向量的时间范围
- 添加标签过滤以降低基数
- 为复杂查询使用记录规则
- 避免使用昂贵的正则模式
- 考虑查询超时设置
Communication Guidelines
沟通指南
When generating queries:
- Explain the Plan: Always present a plain-English plan before generating
- Ask Questions: Use AskUserQuestion tool to gather requirements
- Confirm Intent: Verify the query matches user goals before finalizing
- Educate: Explain why certain functions or patterns are used
- Provide Context: Show how to interpret results
- Suggest Improvements: Offer optimizations or alternative approaches
- Validate Proactively: Always validate and fix issues
- Follow Up: Ask if adjustments are needed
生成查询时:
- 解释规划:生成前始终呈现通俗易懂的规划
- 提问:使用AskUserQuestion工具收集需求
- 确认意图:最终确定前验证查询是否符合用户目标
- 教育用户:解释为何使用特定函数或模式
- 提供上下文:展示如何解读结果
- 建议改进:提供优化或替代方案
- 主动验证:始终验证并修复问题
- 跟进:询问是否需要调整
Integration with devops-skills:promql-validator
与devops-skills:promql-validator集成
After generating any PromQL query, automatically invoke the devops-skills:promql-validator skill to ensure quality:
Steps:
1. Generate the PromQL query based on user requirements
2. Invoke devops-skills:promql-validator skill with the generated query
3. Review validation results (syntax, semantics, performance)
4. Fix any issues identified by the validator
5. Re-validate until all checks pass
6. Provide the final validated query with usage instructions
7. Ask user if further refinements are neededThis ensures all generated queries follow best practices and are production-ready.
生成任何PromQL查询后,自动调用devops-skills:promql-validator技能以确保质量:
步骤:
1. 根据用户需求生成PromQL查询
2. 使用生成的查询调用devops-skills:promql-validator技能
3. 查看验证结果(语法、语义、性能)
4. 修复验证器发现的问题
5. 重新验证直至所有检查通过
6. 提供最终验证后的查询及使用说明
7. 询问用户是否需要进一步调整这确保所有生成的查询遵循最佳实践并可用于生产环境。
Resources
资源
IMPORTANT: Explicit Reference ConsultationWhen generating queries, you SHOULD explicitly read the relevant reference files using the Read tool and cite applicable best practices. This ensures generated queries follow documented patterns and helps users understand why certain approaches are recommended.
重要提示:明确参考文档生成查询时,应使用Read工具明确阅读相关参考文档,并引用适用的最佳实践。这确保生成的查询遵循文档化模式,并帮助用户理解为何推荐特定方法。
references/
references/
promql_functions.md
- Comprehensive reference of all PromQL functions
- Grouped by category (aggregation, math, time, histogram, etc.)
- Usage examples for each function
- Read this file when: implementing specific function requirements or when user asks about function behavior
promql_patterns.md
- Common query patterns for typical monitoring scenarios
- RED method patterns (Rate, Errors, Duration)
- USE method patterns (Utilization, Saturation, Errors)
- Alerting and recording rule patterns
- Read this file when: implementing standard monitoring patterns like error rates, latency, or resource usage
best_practices.md
- PromQL best practices and anti-patterns
- Performance optimization guidelines
- Cardinality management
- Query structure recommendations
- Read this file when: optimizing queries, reviewing for anti-patterns, or when cardinality concerns arise
metric_types.md
- Detailed guide to Prometheus metric types
- Counter, Gauge, Histogram, Summary
- When to use each type
- Appropriate functions for each type
- Read this file when: clarifying metric type questions or determining appropriate functions for a metric
promql_functions.md
- 所有PromQL函数的综合参考
- 按类别分组(聚合、数学、时间、直方图等)
- 每个函数的使用示例
- 阅读场景:实现特定函数需求或用户询问函数行为时
promql_patterns.md
- 常见监控场景的查询模式
- RED方法模式(Rate、Errors、Duration)
- USE方法模式(Utilization、Saturation、Errors)
- 告警和记录规则模式
- 阅读场景:实现标准监控模式如错误率、延迟或资源使用率时
best_practices.md
- PromQL最佳实践与反模式
- 性能优化指南
- 基数管理
- 查询结构建议
- 阅读场景:优化查询、检查反模式或存在基数问题时
metric_types.md
- Prometheus指标类型的详细指南
- Counter、Gauge、Histogram、Summary
- 各类型的适用场景
- 各类型的合适函数
- 阅读场景:澄清指标类型问题或确定指标的合适函数时
examples/
examples/
common_queries.promql
- Collection of commonly-used PromQL queries
- Request rate, error rate, latency queries
- Resource usage queries
- Availability and uptime queries
- Can be copied and customized
red_method.promql
- Complete RED method implementation
- Request rate queries
- Error rate queries
- Duration/latency queries
use_method.promql
- Complete USE method implementation
- Utilization queries
- Saturation queries
- Error queries
alerting_rules.yaml
- Example Prometheus alerting rules
- Various threshold-based alerts
- Best practices for alert expressions
recording_rules.yaml
- Example Prometheus recording rules
- Pre-aggregated metrics
- Naming conventions
slo_patterns.promql
- SLO, error budget, and burn rate queries
- Multi-window, multi-burn-rate alerting patterns
- Latency SLO compliance queries
kubernetes_patterns.promql
- Kubernetes monitoring patterns
- kube-state-metrics queries (pods, deployments, nodes)
- cAdvisor container metrics (CPU, memory)
- Vector matching and joins for Kubernetes
common_queries.promql
- 常用PromQL查询集合
- 请求速率、错误率、延迟查询
- 资源使用率查询
- 可用性和在线时长查询
- 可复制并自定义
red_method.promql
- 完整的RED方法实现
- 请求速率查询
- 错误率查询
- 时长/延迟查询
use_method.promql
- 完整的USE方法实现
- 使用率查询
- 饱和度查询
- 错误查询
alerting_rules.yaml
- Prometheus告警规则示例
- 各种基于阈值的告警
- 告警表达式最佳实践
recording_rules.yaml
- Prometheus记录规则示例
- 预聚合指标
- 命名规范
slo_patterns.promql
- SLO、错误预算和燃烧速率查询
- 多窗口、多燃烧速率告警模式
- 延迟SLO合规查询
kubernetes_patterns.promql
- Kubernetes监控模式
- kube-state-metrics查询(pod、部署、节点)
- cAdvisor容器指标(CPU、内存)
- Kubernetes的向量匹配与连接
Important Notes
重要注意事项
- Always Plan Interactively: Never generate a query without confirming the plan with the user
- Use AskUserQuestion: Leverage the tool to gather requirements and confirm plans
- Validate Everything: Always invoke devops-skills:promql-validator after generation
- Educate Users: Explain what the query does and why it's structured that way
- Consider Use Case: Tailor the query based on whether it's for dashboards, alerts, or analysis
- Think About Performance: Always include label filters and consider cardinality
- Follow Metric Types: Use appropriate functions for counters, gauges, and histograms
- Format for Readability: Use multi-line formatting for complex queries
- 始终交互式规划:未与用户确认规划前,切勿生成查询
- 使用AskUserQuestion:利用工具收集需求并确认规划
- 验证所有内容:生成后始终调用devops-skills:promql-validator
- 教育用户:解释查询功能和结构原因
- 考虑使用场景:根据仪表板、告警或分析场景调整查询
- 考虑性能:始终包含标签过滤并考虑基数
- 遵循指标类型:为counter、gauge和直方图使用合适的函数
- 格式化提升可读性:复杂查询使用多行格式
Success Criteria
成功标准
A successful query generation session should:
- Fully understand the user's monitoring goal
- Identify correct metrics and their types
- Present a clear plan in plain English
- Get user confirmation before generating code
- Generate a syntactically correct query
- Use appropriate functions for metric types
- Include specific label filters
- Pass devops-skills:promql-validator validation
- Provide clear usage instructions
- Offer customization guidance
成功的查询生成会话应:
- 完全理解用户的监控目标
- 识别正确的指标及其类型
- 呈现清晰的通俗易懂的规划
- 生成代码前获得用户确认
- 生成语法正确的查询
- 为指标类型使用合适的函数
- 包含精确的标签过滤
- 通过devops-skills:promql-validator验证
- 提供清晰的使用说明
- 提供自定义指导
Remember
请记住
The goal is to collaboratively plan and generate PromQL queries that exactly match user intentions. Always prioritize clarity, correctness, and performance. The interactive planning phase is the most important part of this skill—never skip it!
目标是协作规划并生成完全符合用户意图的PromQL查询。始终优先考虑清晰性、正确性和性能。交互式规划阶段是本技能最重要的部分——切勿跳过!