promql-generator

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

PromQL Query Generator

PromQL查询生成工具

Overview

概述

This skill provides a comprehensive, interactive workflow for generating production-ready PromQL queries with best practices built-in. Generate queries for monitoring dashboards, alerting rules, and ad-hoc analysis with an emphasis on user collaboration and planning before code generation.
本技能提供一套全面的交互式工作流,用于生成符合最佳实践的生产级PromQL查询。可针对监控仪表板、告警规则和临时分析生成查询,强调在生成代码前与用户协作规划。

When to Use This Skill

适用场景

Invoke this skill when:
  • Creating new PromQL queries from scratch
  • Building monitoring dashboards (Grafana, Prometheus UI, etc.)
  • Implementing alerting rules for Prometheus Alertmanager
  • Analyzing metrics for troubleshooting or capacity planning
  • Converting monitoring requirements into PromQL expressions
  • Learning PromQL or teaching others
  • The user asks to "create", "generate", "build", or "write" PromQL queries
  • Working with Prometheus metrics (counters, gauges, histograms, summaries)
  • Implementing RED (Rate, Errors, Duration) or USE (Utilization, Saturation, Errors) metrics
在以下场景中调用此技能:
  • 从零开始创建新的PromQL查询
  • 构建监控仪表板(Grafana、Prometheus UI等)
  • 为Prometheus Alertmanager实现告警规则
  • 分析指标以进行故障排查或容量规划
  • 将监控需求转换为PromQL表达式
  • 学习或教授PromQL
  • 用户要求「创建」「生成」「构建」或「编写」PromQL查询时
  • 处理Prometheus指标(Counter、Gauge、Histogram、Summary)
  • 实现RED(Rate、Errors、Duration)或USE(Utilization、Saturation、Errors)指标体系

Interactive Query Planning Workflow

交互式查询规划流程

CRITICAL: This skill emphasizes interactive planning before query generation. Always engage the user in a collaborative planning process to ensure the generated query matches their exact intentions.
Follow this workflow when generating PromQL queries:
关键提示:本技能强调在生成查询前进行交互式规划。始终与用户开展协作规划,确保生成的查询完全符合用户意图。
生成PromQL查询时遵循以下工作流:

Stage 1: Understand the Monitoring Goal

阶段1:明确监控目标

Start by understanding what the user wants to monitor or measure. Ask clarifying questions to gather requirements:
  1. Primary Goal: What are you trying to monitor or measure?
    • Request rate (requests per second)
    • Error rate (percentage of failed requests)
    • Latency/duration (response times, percentiles)
    • Resource usage (CPU, memory, disk, network)
    • Availability/uptime
    • Queue depth, saturation, throughput
    • Custom business metrics
  2. Use Case: What will this query be used for?
    • Dashboard visualization (Grafana, Prometheus UI)
    • Alerting rule (firing when threshold exceeded)
    • Ad-hoc troubleshooting/analysis
    • Recording rule (pre-computed aggregation)
    • Capacity planning or SLO tracking
  3. Context: Any additional context?
    • Service/application name
    • Team or project
    • Priority level
    • Existing metrics or naming conventions
Use the AskUserQuestion tool to gather this information if not provided.
When to Ask vs. Infer: If the user's initial request already clearly specifies the goal, use case, and context (e.g., "Create an alert for P95 latency > 500ms for payment-service"), you may acknowledge these details in your response instead of re-asking. Only ask clarifying questions for information that is missing or ambiguous.
首先了解用户想要监控或测量的内容。通过澄清问题收集需求:
  1. 核心目标:您想要监控或测量什么?
    • 请求速率(每秒请求数)
    • 错误率(失败请求占比)
    • 延迟/时长(响应时间、百分位数)
    • 资源使用率(CPU、内存、磁盘、网络)
    • 可用性/在线时长
    • 队列深度、饱和度、吞吐量
    • 自定义业务指标
  2. 使用场景:该查询将用于什么场景?
    • 仪表板可视化(Grafana、Prometheus UI)
    • 告警规则(阈值触发时告警)
    • 临时故障排查/分析
    • 记录规则(预计算聚合指标)
    • 容量规划或SLO追踪
  3. 上下文信息:是否有其他上下文信息?
    • 服务/应用名称
    • 团队或项目
    • 优先级
    • 现有指标或命名规范
若用户未提供上述信息,使用AskUserQuestion工具收集。
提问与推断的边界:如果用户初始请求已明确指定目标、场景和上下文(例如:「为payment-service创建P95延迟超过500ms的告警」),您可直接在回复中确认这些细节,无需重复提问。仅对缺失或模糊的信息进行澄清。

Stage 2: Identify Available Metrics

阶段2:识别可用指标

Determine which metrics are available and relevant:
  1. Metric Discovery: What metrics are available?
    • Ask the user for metric names
    • If uncertain, suggest common naming patterns
    • Check for metric type indicators in the name:
      • _total
        suffix → Counter
      • _bucket
        ,
        _sum
        ,
        _count
        suffix → Histogram
      • No suffix → Likely Gauge
      • _created
        suffix → Counter creation timestamp
  2. Metric Type Identification: Confirm the metric type(s)
    • Counter: Cumulative metric that only increases (or resets to zero)
      • Examples:
        http_requests_total
        ,
        errors_total
        ,
        bytes_sent_total
      • Use with:
        rate()
        ,
        irate()
        ,
        increase()
    • Gauge: Point-in-time value that can go up or down
      • Examples:
        memory_usage_bytes
        ,
        cpu_temperature_celsius
        ,
        queue_length
      • Use with:
        avg_over_time()
        ,
        min_over_time()
        ,
        max_over_time()
        , or directly
    • Histogram: Buckets of observations with cumulative counts
      • Examples:
        http_request_duration_seconds_bucket
        ,
        response_size_bytes_bucket
      • Use with:
        histogram_quantile()
        ,
        rate()
    • Summary: Pre-calculated quantiles with count and sum
      • Examples:
        rpc_duration_seconds{quantile="0.95"}
      • Use
        _sum
        and
        _count
        for averages; don't average quantiles
  3. Label Discovery: What labels are available on these metrics?
    • Common labels:
      job
      ,
      instance
      ,
      environment
      ,
      service
      ,
      endpoint
      ,
      status_code
      ,
      method
    • Ask which labels are important for filtering or grouping
Use the AskUserQuestion tool to confirm metric names, types, and available labels.
确定哪些指标可用且相关:
  1. 指标发现:有哪些可用指标?
    • 询问用户指标名称
    • 若不确定,建议常见命名模式
    • 通过名称识别指标类型:
      • 后缀
        _total
        → Counter
      • 后缀
        _bucket
        _sum
        _count
        → Histogram
      • 无后缀 → 通常为Gauge
      • 后缀
        _created
        → Counter创建时间戳
  2. 指标类型确认:确认指标类型
    • Counter:累计型指标,仅会增加(或重置为零)
      • 示例:
        http_requests_total
        errors_total
        bytes_sent_total
      • 搭配使用:
        rate()
        irate()
        increase()
    • Gauge:瞬时值,可升可降
      • 示例:
        memory_usage_bytes
        cpu_temperature_celsius
        queue_length
      • 搭配使用:
        avg_over_time()
        min_over_time()
        max_over_time()
        ,或直接使用
    • Histogram:带累计计数的观测值分桶
      • 示例:
        http_request_duration_seconds_bucket
        response_size_bytes_bucket
      • 搭配使用:
        histogram_quantile()
        rate()
    • Summary:预计算分位数,包含计数和总和
      • 示例:
        rpc_duration_seconds{quantile="0.95"}
      • 使用
        _sum
        _count
        计算平均值;不要对分位数取平均
  3. 标签发现:这些指标有哪些可用标签?
    • 常见标签:
      job
      instance
      environment
      service
      endpoint
      status_code
      method
    • 询问哪些标签对过滤或分组重要
使用AskUserQuestion工具确认指标名称、类型和可用标签。

Stage 3: Determine Query Parameters

阶段3:确定查询参数

Gather specific requirements for the query.
收集查询的具体要求。

Pre-confirmation for User-Provided Parameters

用户提供参数的预确认

IMPORTANT: When the user has already specified parameters in their initial request (e.g., "5-minute window", "500ms threshold", "> 5% error rate"), you MUST:
  1. Acknowledge the provided values explicitly in your response
  2. Present them as pre-filled defaults in AskUserQuestion with the first option being "Use specified values"
  3. Allow quick confirmation rather than re-asking for information already given
Example: If user says "alert when P95 latency exceeds 500ms", use:
AskUserQuestion:
- Question: "Confirm the alert threshold?"
- Options:
  1. "500ms (as specified)" - Use the threshold from your request
  2. "Different threshold" - Let me specify a different value
This respects the user's input and speeds up the workflow while still allowing modifications.
  1. Time Range: What time window should the query cover?
    • Instant value (current)
    • Rate over time (
      [5m]
      ,
      [1h]
      ,
      [1d]
      )
    • For rate calculations: typically
      [1m]
      to
      [5m]
      for real-time,
      [1h]
      to
      [1d]
      for trends
    • Rule of thumb: Rate range should be at least 4x the scrape interval
  2. Label Filtering: Which labels should filter the data?
    • Exact matches:
      job="api-server"
      ,
      status_code="200"
    • Negative matches:
      status_code!="200"
    • Regex matches:
      instance=~"prod-.*"
    • Multiple conditions:
      {job="api", environment="production"}
  3. Aggregation: Should the data be aggregated?
    • No aggregation: Return all time series as-is
    • Aggregate by labels:
      sum by (job, endpoint)
      ,
      avg by (instance)
    • Aggregate without labels:
      sum without (instance, pod)
      ,
      avg without (job)
    • Common aggregations:
      sum
      ,
      avg
      ,
      max
      ,
      min
      ,
      count
      ,
      topk
      ,
      bottomk
  4. Thresholds or Conditions: Are there specific conditions?
    • For alerting: threshold values (e.g., error rate > 5%)
    • For filtering: only show series above/below a value
    • For comparison: compare against historical data (offset)
Use the AskUserQuestion tool to gather or confirm these parameters. When the user has already provided values (e.g., "5-minute window", "> 5%"), present them as the default option for confirmation.
重要提示:当用户在初始请求中已指定参数(例如:「5分钟窗口」「500ms阈值」「>5%错误率」),您必须:
  1. 在回复中明确确认提供的值
  2. 在AskUserQuestion中将其设为预填充默认值,第一个选项为「使用指定值」
  3. 允许快速确认,而非重复询问已提供的信息
示例:若用户说「当P95延迟超过500ms时告警」,使用:
AskUserQuestion:
- Question: "确认告警阈值?"
- Options:
  1. "500ms(按您指定的)" - 使用您请求中的阈值
  2. "修改阈值" - 让我指定其他值
这既尊重用户输入,又能加快工作流,同时仍允许修改。
  1. 时间范围:查询应覆盖哪个时间窗口?
    • 瞬时值(当前)
    • 速率计算窗口(
      [5m]
      [1h]
      [1d]
    • 速率计算规则:实时监控通常用
      [1m]
      [5m]
      ,趋势分析用
      [1h]
      [1d]
    • 经验法则:速率范围至少为抓取间隔的4倍
  2. 标签过滤:应使用哪些标签过滤数据?
    • 精确匹配:
      job="api-server"
      status_code="200"
    • 排除匹配:
      status_code!="200"
    • 正则匹配:
      instance=~"prod-.*"
    • 多条件:
      {job="api", environment="production"}
  3. 聚合方式:是否需要对数据进行聚合?
    • 不聚合:返回所有时间序列原样
    • 按标签聚合
      sum by (job, endpoint)
      avg by (instance)
    • 排除标签聚合
      sum without (instance, pod)
      avg without (job)
    • 常见聚合函数:
      sum
      avg
      max
      min
      count
      topk
      bottomk
  4. 阈值或条件:是否有特定条件?
    • 告警场景:阈值(例如:错误率>5%)
    • 过滤场景:仅显示高于/低于某值的序列
    • 对比场景:与历史数据对比(offset)
使用AskUserQuestion工具收集或确认这些参数。若用户已提供值(例如:「5分钟窗口」「>5%」),将其设为默认选项供确认。

Stage 4: Present the Query Plan

阶段4:呈现查询规划

BEFORE GENERATING ANY CODE, present a plain-English query plan and ask for user confirmation:
undefined
在生成任何代码之前,用通俗易懂的语言呈现查询规划,并请求用户确认:
undefined

PromQL Query Plan

PromQL查询规划

Based on your requirements, here's what the query will do:
Goal: [Describe the monitoring goal in plain English]
Query Structure:
  1. Start with metric:
    [metric_name]
  2. Filter by labels:
    {label1="value1", label2="value2"}
  3. Apply function:
    [function_name]([metric][time_range])
  4. Aggregate:
    [aggregation] by ([label_list])
  5. Additional operations: [any calculations, ratios, or transformations]
Expected Output:
  • Data type: [instant vector/scalar]
  • Labels in result: [list of labels]
  • Value represents: [what the number means]
  • Typical range: [expected value range]
Example Interpretation: If the query returns
0.05
, it means: [plain English explanation]
Does this match your intentions?
  • If yes, I'll generate the query and validate it
  • If no, let me know what needs to change

Use the **AskUserQuestion** tool to confirm the plan with options:
- "Yes, generate this query"
- "Modify [specific aspect]"
- "Show me alternative approaches"
根据您的需求,该查询将实现以下功能:
目标:[用通俗易懂的语言描述监控目标]
查询结构:
  1. 基础指标:
    [metric_name]
  2. 标签过滤:
    {label1="value1", label2="value2"}
  3. 应用函数:
    [function_name]([metric][time_range])
  4. 聚合操作:
    [aggregation] by ([label_list])
  5. 额外操作:[任何计算、比率或转换]
预期输出:
  • 数据类型:[瞬时向量/标量]
  • 结果标签:[标签列表]
  • 值代表含义:[数值的实际意义]
  • 典型范围:[预期数值范围]
示例解读: 若查询返回
0.05
,表示:[通俗易懂的解释]
此规划是否符合您的意图?
  • 是,生成此查询
  • 修改[特定部分]
  • 展示其他方案

使用**AskUserQuestion**工具提供以下选项确认规划:
- "是,生成此查询"
- "修改[特定部分]"
- "展示其他方案"

Stage 5: Generate the PromQL Query

阶段5:生成PromQL查询

Once the user confirms the plan, generate the actual PromQL query following best practices.
用户确认规划后,按照最佳实践生成实际的PromQL查询。

IMPORTANT: Consult Reference Files Before Generating

重要提示:生成前参考文档

Before writing any query code, you MUST:
  1. Read the relevant reference file(s) using the Read tool:
    • For histogram queries → Read
      references/metric_types.md
      (Histogram section)
    • For error/latency patterns → Read
      references/promql_patterns.md
      (RED method section)
    • For resource monitoring → Read
      references/promql_patterns.md
      (USE method section)
    • For optimization questions → Read
      references/best_practices.md
    • For specific functions → Read
      references/promql_functions.md
  2. Cite the applicable pattern or best practice in your response:
    As documented in references/promql_patterns.md (Pattern 3: Latency Percentile):
    # 95th percentile latency
    histogram_quantile(0.95, sum by (le) (rate(...)))
  3. Reference example files when generating similar queries:
    Based on examples/red_method.promql (lines 64-82):
    # P95 latency with proper histogram_quantile usage
This ensures generated queries follow documented patterns and helps users understand why certain approaches are recommended.
在编写任何查询代码之前,您必须:
  1. 使用Read工具阅读相关参考文档:
    • 直方图查询 → 阅读
      references/metric_types.md
      (Histogram章节)
    • 错误/延迟模式 → 阅读
      references/promql_patterns.md
      (RED方法章节)
    • 资源监控 → 阅读
      references/promql_patterns.md
      (USE方法章节)
    • 优化问题 → 阅读
      references/best_practices.md
    • 特定函数 → 阅读
      references/promql_functions.md
  2. 在回复中引用适用的模式或最佳实践:
    正如`references/promql_patterns.md`(模式3:延迟分位数)中所述:
    # 95分位延迟
    histogram_quantile(0.95, sum by (le) (rate(...)))
  3. 生成相似查询时参考示例文件:
    基于`examples/red_method.promql`(第64-82行):
    # 正确使用histogram_quantile的P95延迟查询
这确保生成的查询遵循文档化模式,并帮助用户理解为何推荐特定方法。

Best Practices for Query Generation

查询生成最佳实践

  1. Always Use Label Filters
    promql
    # Good: Specific filtering reduces cardinality
    rate(http_requests_total{job="api-server", environment="prod"}[5m])
    
    # Bad: Matches all time series, high cardinality
    rate(http_requests_total[5m])
  2. Use Appropriate Functions for Metric Types
    promql
    # Counter: Use rate() or increase()
    rate(http_requests_total[5m])
    
    # Gauge: Use directly or with *_over_time()
    memory_usage_bytes
    avg_over_time(memory_usage_bytes[5m])
    
    # Histogram: Use histogram_quantile()
    histogram_quantile(0.95,
      sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
    )
  3. Apply Aggregations with by() or without()
    promql
    # Aggregate by specific labels (keeps only these labels)
    sum by (job, endpoint) (rate(http_requests_total[5m]))
    
    # Aggregate without specific labels (removes these labels)
    sum without (instance, pod) (rate(http_requests_total[5m]))
  4. Use Exact Matches Over Regex When Possible
    promql
    # Good: Faster exact match
    http_requests_total{status_code="200"}
    
    # Bad: Slower regex match when not needed
    http_requests_total{status_code=~"200"}
  5. Calculate Ratios Properly
    promql
    # Error rate: errors / total requests
    sum(rate(http_requests_total{status_code=~"5.."}[5m]))
    /
    sum(rate(http_requests_total[5m]))
  6. Use Recording Rules for Complex Queries
    • If a query is used frequently or is computationally expensive
    • Pre-aggregate data to reduce query load
    • Follow naming convention:
      level:metric:operations
  7. Format for Readability
    promql
    # Good: Multi-line for complex queries
    histogram_quantile(0.95,
      sum by (le, job) (
        rate(http_request_duration_seconds_bucket{job="api-server"}[5m])
      )
    )
  1. 始终使用标签过滤
    promql
    # 推荐:精确过滤降低基数
    rate(http_requests_total{job="api-server", environment="prod"}[5m])
    
    # 不推荐:匹配所有时间序列,基数过高
    rate(http_requests_total[5m])
  2. 为不同指标类型使用合适的函数
    promql
    # Counter:使用rate()或increase()
    rate(http_requests_total[5m])
    
    # Gauge:直接使用或搭配*_over_time()
    memory_usage_bytes
    avg_over_time(memory_usage_bytes[5m])
    
    # Histogram:使用histogram_quantile()
    histogram_quantile(0.95,
      sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
    )
  3. 使用by()或without()进行聚合
    promql
    # 按指定标签聚合(仅保留这些标签)
    sum by (job, endpoint) (rate(http_requests_total[5m]))
    
    # 排除指定标签聚合(移除这些标签)
    sum without (instance, pod) (rate(http_requests_total[5m]))
  4. 优先使用精确匹配而非正则匹配
    promql
    # 推荐:精确匹配速度更快
    http_requests_total{status_code="200"}
    
    # 不推荐:无需正则时使用正则会变慢
    http_requests_total{status_code=~"200"}
  5. 正确计算比率
    promql
    # 错误率:错误数/总请求数
    sum(rate(http_requests_total{status_code=~"5.."}[5m]))
    /
    sum(rate(http_requests_total[5m]))
  6. 为复杂查询使用记录规则
    • 若查询频繁使用或计算成本高
    • 预聚合数据以降低查询负载
    • 遵循命名规范:
      level:metric:operations
  7. 格式化以提升可读性
    promql
    # 推荐:复杂查询使用多行格式
    histogram_quantile(0.95,
      sum by (le, job) (
        rate(http_request_duration_seconds_bucket{job="api-server"}[5m])
      )
    )

Common Query Patterns

常见查询模式

Pattern 1: Request Rate
promql
undefined
模式1:请求速率
promql
undefined

Requests per second

每秒请求数

rate(http_requests_total{job="api-server"}[5m])
rate(http_requests_total{job="api-server"}[5m])

Total requests per second across all instances

所有实例的总每秒请求数

sum(rate(http_requests_total{job="api-server"}[5m]))

**Pattern 2: Error Rate**
```promql
sum(rate(http_requests_total{job="api-server"}[5m]))

**模式2:错误率**
```promql

Error ratio (0 to 1)

错误比率(0到1)

sum(rate(http_requests_total{job="api-server", status_code=~"5.."}[5m])) / sum(rate(http_requests_total{job="api-server"}[5m]))
sum(rate(http_requests_total{job="api-server", status_code=~"5.."}[5m])) / sum(rate(http_requests_total{job="api-server"}[5m]))

Error percentage (0 to 100)

错误百分比(0到100)

( sum(rate(http_requests_total{job="api-server", status_code=~"5.."}[5m])) / sum(rate(http_requests_total{job="api-server"}[5m])) ) * 100

**Pattern 3: Latency Percentile (Histogram)**
```promql
( sum(rate(http_requests_total{job="api-server", status_code=~"5.."}[5m])) / sum(rate(http_requests_total{job="api-server"}[5m])) ) * 100

**模式3:延迟分位数(直方图)**
```promql

95th percentile latency

95分位延迟

histogram_quantile(0.95, sum by (le) ( rate(http_request_duration_seconds_bucket{job="api-server"}[5m]) ) )

**Pattern 4: Resource Usage**
```promql
histogram_quantile(0.95, sum by (le) ( rate(http_request_duration_seconds_bucket{job="api-server"}[5m]) ) )

**模式4:资源使用率**
```promql

Current memory usage

当前内存使用率

process_resident_memory_bytes{job="api-server"}
process_resident_memory_bytes{job="api-server"}

Average CPU usage over 5 minutes

5分钟平均CPU使用率

avg_over_time(process_cpu_seconds_total{job="api-server"}[5m])

**Pattern 5: Availability**
```promql
avg_over_time(process_cpu_seconds_total{job="api-server"}[5m])

**模式5:可用性**
```promql

Percentage of up instances

在线实例百分比

( count(up{job="api-server"} == 1) / count(up{job="api-server"}) ) * 100

**Pattern 6: Saturation/Queue Depth**
```promql
( count(up{job="api-server"} == 1) / count(up{job="api-server"}) ) * 100

**模式6:饱和度/队列深度**
```promql

Average queue length

平均队列长度

avg_over_time(queue_depth{job="worker"}[5m])
avg_over_time(queue_depth{job="worker"}[5m])

Maximum queue depth in the last hour

过去1小时的最大队列深度

max_over_time(queue_depth{job="worker"}[1h])
undefined
max_over_time(queue_depth{job="worker"}[1h])
undefined

Stage 6: Validate the Generated Query

阶段6:验证生成的查询

ALWAYS validate the generated query using the devops-skills:promql-validator skill:
After generating the query, automatically invoke:
Skill(devops-skills:promql-validator)

The devops-skills:promql-validator skill will:
1. Check syntax correctness
2. Validate semantic logic (correct functions for metric types)
3. Identify anti-patterns and inefficiencies
4. Suggest optimizations
5. Explain what the query does
6. Verify it matches user intent
Validation checklist:
  • Syntax is correct (balanced brackets, valid operators)
  • Metric type matches function usage
  • Label filters are specific enough
  • Aggregation is appropriate
  • Time ranges are reasonable
  • No known anti-patterns
  • Query is optimized for performance
If validation fails, fix issues and re-validate until all checks pass.
IMPORTANT: Display Validation Results to User
After running validation, you MUST display the structured results to the user in this format:
undefined
始终验证生成的查询,使用devops-skills:promql-validator技能:
生成查询后,自动调用:
Skill(devops-skills:promql-validator)

devops-skills:promql-validator技能将:
1. 检查语法正确性
2. 验证语义逻辑(指标类型与函数匹配)
3. 识别反模式和低效问题
4. 建议优化方案
5. 解释查询功能
6. 验证是否符合用户意图
验证清单:
  • 语法正确(括号匹配、运算符有效)
  • 指标类型与函数使用匹配
  • 标签过滤足够精确
  • 聚合方式合适
  • 时间范围合理
  • 无已知反模式
  • 查询已针对性能优化
若验证失败,修复问题并重新验证,直至所有检查通过。
重要提示:向用户展示验证结果
运行验证后,必须以下列格式向用户展示结构化结果:
undefined

PromQL Validation Results

PromQL验证结果

Syntax Check

语法检查

  • Status: ✅ VALID / ⚠️ WARNING / ❌ ERROR
  • Issues: [list any syntax errors]
  • 状态:✅ 有效 / ⚠️ 警告 / ❌ 错误
  • 问题:[列出任何语法错误]

Best Practices Check

最佳实践检查

  • Status: ✅ OPTIMIZED / ⚠️ CAN BE IMPROVED / ❌ HAS ISSUES
  • Issues: [list any problems found]
  • Suggestions: [list optimization opportunities]
  • 状态:✅ 已优化 / ⚠️ 可改进 / ❌ 存在问题
  • 问题:[列出发现的问题]
  • 建议:[列出优化机会]

Query Explanation

查询解释

  • What it measures: [plain English description]
  • Output labels: [list labels in result, or "None (scalar)"]
  • Expected result structure: [instant vector / scalar / etc.]

This transparency helps users understand the validation process and any recommendations.
  • 测量内容:[通俗易懂的描述]
  • 输出标签:[结果中的标签列表,或“无(标量)”]
  • 预期结果结构:[瞬时向量 / 标量 / 等]

这种透明性有助于用户理解验证过程和任何建议。

Stage 7: Provide Usage Instructions

阶段7:提供使用说明

After successful generation and validation, provide the user with:
  1. The Final Query:
    promql
    [Generated and validated PromQL query]
  2. Query Explanation:
    • What the query measures
    • How to interpret the results
    • Expected value range
    • Labels in the output
  3. How to Use It:
    • For Dashboards: Copy into Grafana/Prometheus UI panel query
    • For Alerts: Integrate into Alertmanager rule with threshold
    • For Recording Rules: Add to Prometheus recording rule config
    • For Ad-hoc: Run directly in Prometheus expression browser
  4. Customization Notes:
    • Time ranges that might need adjustment
    • Labels to modify for different environments
    • Threshold values to tune
    • Alternative functions if requirements change
  5. Related Queries:
    • Suggest complementary queries
    • Mention recording rule opportunities
    • Recommend dashboard panels
成功生成并验证后,向用户提供:
  1. 最终查询:
    promql
    [生成并验证后的PromQL查询]
  2. 查询解释:
    • 查询测量的内容
    • 如何解读结果
    • 预期数值范围
    • 输出中的标签
  3. 使用方法:
    • 仪表板:复制到Grafana/Prometheus UI面板查询中
    • 告警:集成到Alertmanager规则中并设置阈值
    • 记录规则:添加到Prometheus记录规则配置
    • 临时分析:直接在Prometheus表达式浏览器中运行
  4. 自定义说明:
    • 可能需要调整的时间范围
    • 针对不同环境需修改的标签
    • 需调整的阈值
    • 需求变化时的替代函数
  5. 相关查询:
    • 推荐互补查询
    • 提及记录规则的使用场景
    • 推荐仪表板面板

Native Histograms (Prometheus 3.x+)

原生直方图(Prometheus 3.x+)

Native histograms are now stable in Prometheus 3.0+ (released November 2024). They offer significant advantages over classic histograms:
  • Sparse bucket representation with near-zero cost for empty buckets
  • No configuration of bucket boundaries during instrumentation
  • Coverage of the full float64 range
  • Efficient mergeability across histograms
  • Simpler query syntax
Important: Starting with Prometheus v3.8.0, native histograms are fully stable. However, scraping native histograms still requires explicit activation via the
scrape_native_histograms
configuration setting. Starting with v3.9, no feature flag is needed but
scrape_native_histograms
must be set explicitly.
原生直方图在Prometheus 3.0+(2024年11月发布)中已稳定。相比经典直方图,它具有显著优势:
  • 稀疏分桶表示,空分桶成本几乎为零
  • instrumentation时无需配置分桶边界
  • 覆盖完整float64范围
  • 跨直方图高效合并
  • 查询语法更简单
重要提示:从Prometheus v3.8.0开始,原生直方图完全稳定。但抓取原生直方图仍需通过
scrape_native_histograms
配置显式启用。从v3.9开始,无需功能标志,但必须显式设置
scrape_native_histograms

Native vs Classic Histogram Syntax

原生与经典直方图语法对比

promql
undefined
promql
undefined

Classic histogram (requires _bucket suffix and le label)

经典直方图(需要_bucket后缀和le标签)

histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])) )
histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])) )

Native histogram (simpler - no _bucket suffix, no le label needed)

原生直方图(更简单 - 无_bucket后缀,无需le标签)

histogram_quantile(0.95, sum by (job) (rate(http_request_duration_seconds[5m])) )
undefined
histogram_quantile(0.95, sum by (job) (rate(http_request_duration_seconds[5m])) )
undefined

Native Histogram Functions

原生直方图函数

promql
undefined
promql
undefined

Get observation count rate from native histogram

从原生直方图获取观测计数速率

histogram_count(rate(http_request_duration_seconds[5m]))
histogram_count(rate(http_request_duration_seconds[5m]))

Get sum of observations from native histogram

从原生直方图获取观测值总和

histogram_sum(rate(http_request_duration_seconds[5m]))
histogram_sum(rate(http_request_duration_seconds[5m]))

Calculate fraction of observations between two values

计算两个值之间的观测值占比

histogram_fraction(0, 0.1, rate(http_request_duration_seconds[5m]))
histogram_fraction(0, 0.1, rate(http_request_duration_seconds[5m]))

Average request duration from native histogram

从原生直方图计算平均请求时长

histogram_sum(rate(http_request_duration_seconds[5m])) / histogram_count(rate(http_request_duration_seconds[5m]))
undefined
histogram_sum(rate(http_request_duration_seconds[5m])) / histogram_count(rate(http_request_duration_seconds[5m]))
undefined

Detecting Native vs Classic Histograms

区分原生与经典直方图

Native histograms are identified by:
  • No
    _bucket
    suffix
    on the metric name
  • No
    le
    label
    in the time series
  • The metric stores histogram data directly (not separate bucket counters)
When querying, check if your Prometheus instance has native histograms enabled:
yaml
undefined
原生直方图可通过以下特征识别:
  • 指标名称
    _bucket
    后缀
  • 时间序列
    le
    标签
  • 指标直接存储直方图数据(而非单独的分桶计数器)
查询时,检查Prometheus实例是否启用了原生直方图:
yaml
undefined

prometheus.yml - Enable native histogram scraping

prometheus.yml - 启用原生直方图抓取

scrape_configs:
  • job_name: 'my-app' scrape_native_histogram: true # Prometheus 3.x+
undefined
scrape_configs:
  • job_name: 'my-app' scrape_native_histogram: true # Prometheus 3.x+
undefined

Custom Bucket Native Histograms (NHCB)

自定义分桶原生直方图(NHCB)

Prometheus 3.4+ supports custom bucket native histograms (schema -53), allowing classic histogram to native histogram conversion. This is a key migration path for users with existing classic histograms.
Benefits of NHCB:
  • Keep existing instrumentation (no code changes needed)
  • Store classic histograms as native histograms for lower costs
  • Query with native histogram syntax
  • Improved reliability and compression
Configuration (Prometheus 3.4+):
yaml
undefined
Prometheus 3.4+支持自定义分桶原生直方图(schema -53),允许将经典直方图转换为原生直方图。这是现有经典直方图用户的关键迁移路径。
NHCB优势:
  • 保留现有instrumentation(无需代码变更)
  • 将经典直方图存储为原生直方图,降低成本
  • 使用原生直方图语法查询
  • 提升可靠性和压缩率
配置(Prometheus 3.4+):
yaml
undefined

prometheus.yml - Convert classic histograms to NHCB on scrape

prometheus.yml - 抓取时将经典直方图转换为NHCB

global: scrape_configs: - job_name: 'my-app' convert_classic_histograms_to_nhcb: true # Prometheus 3.4+

**Querying NHCB**:
```promql
global: scrape_configs: - job_name: 'my-app' convert_classic_histograms_to_nhcb: true # Prometheus 3.4+

**查询NHCB**:
```promql

Query NHCB metrics the same way as native histograms

与原生直方图查询方式相同

histogram_quantile(0.95, sum by (job) (rate(http_request_duration_seconds[5m])))
histogram_quantile(0.95, sum by (job) (rate(http_request_duration_seconds[5m])))

histogram_fraction also works with NHCB (Prometheus 3.4+)

histogram_fraction也适用于NHCB(Prometheus 3.4+)

histogram_fraction(0, 0.2, rate(http_request_duration_seconds[5m]))

**Note**: Schema -53 indicates custom bucket boundaries. These histograms with different custom bucket boundaries are generally not mergeable with each other.

---
histogram_fraction(0, 0.2, rate(http_request_duration_seconds[5m]))

**注意**:Schema -53表示自定义分桶边界。具有不同自定义分桶边界的直方图通常无法相互合并。

---

SLO, Error Budget, and Burn Rate Patterns

SLO、错误预算和燃烧速率模式

Service Level Objectives (SLOs) are critical for modern SRE practices. These patterns help implement SLO-based monitoring and alerting.
服务级别目标(SLO)是现代SRE实践的关键。这些模式有助于实现基于SLO的监控和告警。

Error Budget Calculation

错误预算计算

promql
undefined
promql
undefined

Error budget remaining (for 99.9% SLO over 30 days)

剩余错误预算(针对30天99.9%的SLO)

Returns value between 0 and 1 (1 = full budget, 0 = exhausted)

返回值0到1(1=预算充足,0=预算耗尽)

1 - ( sum(rate(http_requests_total{job="api", status_code=~"5.."}[30d])) / sum(rate(http_requests_total{job="api"}[30d])) ) / 0.001 # 0.001 = 1 - 0.999 (allowed error rate)
1 - ( sum(rate(http_requests_total{job="api", status_code=~"5.."}[30d])) / sum(rate(http_requests_total{job="api"}[30d])) ) / 0.001 # 0.001 = 1 - 0.999(允许的错误率)

Simplified: Availability over 30 days

简化版:30天可用性

sum(rate(http_requests_total{job="api", status_code!~"5.."}[30d])) / sum(rate(http_requests_total{job="api"}[30d]))
undefined
sum(rate(http_requests_total{job="api", status_code!~"5.."}[30d])) / sum(rate(http_requests_total{job="api"}[30d]))
undefined

Burn Rate Calculation

燃烧速率计算

Burn rate measures how fast you're consuming error budget. A burn rate of 1 means you'll exhaust the budget exactly at the end of the SLO window.
promql
undefined
燃烧速率衡量错误预算的消耗速度。燃烧速率为1表示将在SLO窗口结束时恰好耗尽预算。
promql
undefined

Current burn rate (1 hour window, 99.9% SLO)

当前燃烧速率(1小时窗口,99.9% SLO)

Burn rate = (current error rate) / (allowed error rate)

燃烧速率 = (当前错误率) / (允许的错误率)

( sum(rate(http_requests_total{job="api", status_code=~"5.."}[1h])) / sum(rate(http_requests_total{job="api"}[1h])) ) / 0.001 # 0.001 = allowed error rate for 99.9% SLO
( sum(rate(http_requests_total{job="api", status_code=~"5.."}[1h])) / sum(rate(http_requests_total{job="api"}[1h])) ) / 0.001 # 0.001 = 99.9% SLO允许的错误率

Burn rate > 1 means consuming budget faster than allowed

燃烧速率>1表示消耗预算速度快于预期

Burn rate of 14.4 consumes 2% of monthly budget in 1 hour

燃烧速率14.4表示1小时消耗月度预算的2%

undefined
undefined

Multi-Window, Multi-Burn-Rate Alerts (Google SRE Standard)

多窗口、多燃烧速率告警(Google SRE标准)

The recommended approach for SLO alerting uses multiple windows to balance detection speed and precision:
promql
undefined
SLO告警的推荐方法使用多个窗口,平衡检测速度和精度:
promql
undefined

Page-level alert: 2% budget in 1 hour (burn rate 14.4)

页面级告警:1小时内消耗2%预算(燃烧速率14.4)

Long window (1h) AND short window (5m) must both exceed threshold

长窗口(1h)和短窗口(5m)必须同时超过阈值

( ( sum(rate(http_requests_total{job="api", status_code="5.."}[1h])) / sum(rate(http_requests_total{job="api"}[1h])) ) > 14.4 * 0.001 ) and ( ( sum(rate(http_requests_total{job="api", status_code="5.."}[5m])) / sum(rate(http_requests_total{job="api"}[5m])) ) > 14.4 * 0.001 )
( ( sum(rate(http_requests_total{job="api", status_code="5.."}[1h])) / sum(rate(http_requests_total{job="api"}[1h])) ) > 14.4 * 0.001 ) and ( ( sum(rate(http_requests_total{job="api", status_code="5.."}[5m])) / sum(rate(http_requests_total{job="api"}[5m])) ) > 14.4 * 0.001 )

Ticket-level alert: 5% budget in 6 hours (burn rate 6)

工单级告警:6小时内消耗5%预算(燃烧速率6)

( ( sum(rate(http_requests_total{job="api", status_code="5.."}[6h])) / sum(rate(http_requests_total{job="api"}[6h])) ) > 6 * 0.001 ) and ( ( sum(rate(http_requests_total{job="api", status_code="5.."}[30m])) / sum(rate(http_requests_total{job="api"}[30m])) ) > 6 * 0.001 )
undefined
( ( sum(rate(http_requests_total{job="api", status_code="5.."}[6h])) / sum(rate(http_requests_total{job="api"}[6h])) ) > 6 * 0.001 ) and ( ( sum(rate(http_requests_total{job="api", status_code="5.."}[30m])) / sum(rate(http_requests_total{job="api"}[30m])) ) > 6 * 0.001 )
undefined

SLO Recording Rules

SLO记录规则

Pre-compute SLO metrics for efficient alerting:
yaml
undefined
预计算SLO指标以实现高效告警:
yaml
undefined

Recording rules for SLO calculations

SLO计算记录规则

groups:
  • name: slo_recording_rules interval: 30s rules:

    Error ratio over different windows

    • record: job:slo_errors_per_request:ratio_rate1h expr: | sum by (job) (rate(http_requests_total{status_code=~"5.."}[1h])) / sum by (job) (rate(http_requests_total[1h]))
    • record: job:slo_errors_per_request:ratio_rate5m expr: | sum by (job) (rate(http_requests_total{status_code=~"5.."}[5m])) / sum by (job) (rate(http_requests_total[5m]))

    Availability (success ratio)

    • record: job:slo_availability:ratio_rate1h expr: | 1 - job:slo_errors_per_request:ratio_rate1h
undefined
groups:
  • name: slo_recording_rules interval: 30s rules:

    不同窗口的错误比率

    • record: job:slo_errors_per_request:ratio_rate1h expr: | sum by (job) (rate(http_requests_total{status_code=~"5.."}[1h])) / sum by (job) (rate(http_requests_total[1h]))
    • record: job:slo_errors_per_request:ratio_rate5m expr: | sum by (job) (rate(http_requests_total{status_code=~"5.."}[5m])) / sum by (job) (rate(http_requests_total[5m]))

    可用性(成功比率)

    • record: job:slo_availability:ratio_rate1h expr: | 1 - job:slo_errors_per_request:ratio_rate1h
undefined

Latency SLO Queries

延迟SLO查询

promql
undefined
promql
undefined

Percentage of requests faster than SLO target (200ms)

快于SLO目标(200ms)的请求百分比

( sum(rate(http_request_duration_seconds_bucket{le="0.2", job="api"}[5m])) / sum(rate(http_request_duration_seconds_count{job="api"}[5m])) ) * 100
( sum(rate(http_request_duration_seconds_bucket{le="0.2", job="api"}[5m])) / sum(rate(http_request_duration_seconds_count{job="api"}[5m])) ) * 100

Requests violating latency SLO (slower than 500ms)

违反延迟SLO的请求(慢于500ms)

( sum(rate(http_request_duration_seconds_count{job="api"}[5m]))

sum(rate(http_request_duration_seconds_bucket{le="0.5", job="api"}[5m])) ) / sum(rate(http_request_duration_seconds_count{job="api"}[5m]))
undefined

( sum(rate(http_request_duration_seconds_count{job="api"}[5m]))

sum(rate(http_request_duration_seconds_bucket{le="0.5", job="api"}[5m])) ) / sum(rate(http_request_duration_seconds_count{job="api"}[5m]))
undefined

Burn Rate Reference Table

燃烧速率参考表

Burn RateBudget ConsumedTime to Exhaust 30-day BudgetAlert Severity
1100% over 30d30 daysNone
2100% over 15d15 daysLow
65% in 6h5 daysTicket
14.42% in 1h~2 daysPage
365% in 1h~20 hoursPage (urgent)

燃烧速率消耗预算占比耗尽30天预算所需时间告警级别
130天内100%30天
215天内100%15天
66小时内5%5天工单
14.41小时内2%~2天页面
361小时内5%~20小时紧急页面

Advanced Query Techniques

高级查询技巧

Using Subqueries

使用子查询

Subqueries enable complex time-based calculations:
promql
undefined
子查询支持复杂的时间计算:
promql
undefined

Maximum 5-minute rate over the past 30 minutes

过去30分钟内的最大5分钟速率

max_over_time( rate(http_requests_total[5m])[30m:1m] )

**Syntax**: `<query>[<range>:<resolution>]`
- `<range>`: Time window to evaluate over
- `<resolution>`: Step size between evaluations
max_over_time( rate(http_requests_total[5m])[30m:1m] )

**语法**:`<query>[<range>:<resolution>]`
- `<range>`:评估的时间窗口
- `<resolution>`:评估的步长

Using Offset Modifier

使用Offset修饰符

Compare current data with historical data:
promql
undefined
将当前数据与历史数据对比:
promql
undefined

Compare current rate with rate from 1 week ago

将当前速率与1周前的速率对比

rate(http_requests_total[5m])

rate(http_requests_total[5m] offset 1w)
undefined

rate(http_requests_total[5m])

rate(http_requests_total[5m] offset 1w)
undefined

Using @ Modifier

使用@修饰符

Query metrics at specific timestamps:
promql
undefined
查询特定时间戳的指标:
promql
undefined

Rate at the end of the range query

范围查询结束时的速率

rate(http_requests_total[5m] @ end())
rate(http_requests_total[5m] @ end())

Rate at specific Unix timestamp

特定Unix时间戳的速率

rate(http_requests_total[5m] @ 1609459200)
undefined
rate(http_requests_total[5m] @ 1609459200)
undefined

Binary Operators and Vector Matching

二元运算符与向量匹配

Combine metrics with operators and control label matching:
promql
undefined
使用运算符组合指标并控制标签匹配:
promql
undefined

One-to-one matching (default)

一对一匹配(默认)

metric_a + metric_b
metric_a + metric_b

Many-to-one with group_left

多对一匹配,使用group_left

rate(http_requests_total[5m])
  • on (job, instance) group_left (version) app_version_info
rate(http_requests_total[5m])
  • on (job, instance) group_left (version) app_version_info

Ignoring specific labels

忽略特定标签

metric_a + ignoring(instance) metric_b
undefined
metric_a + ignoring(instance) metric_b
undefined

Logical Operators

逻辑运算符

Filter time series based on conditions:
promql
undefined
根据条件过滤时间序列:
promql
undefined

Return series only where value > 100

仅返回值>100的序列

http_requests_total > 100
http_requests_total > 100

Return series present in both

返回同时存在于两个指标中的序列

metric_a and metric_b
metric_a and metric_b

Return series in A but not in B

返回存在于A但不存在于B的序列

metric_a unless metric_b
undefined
metric_a unless metric_b
undefined

Documentation Lookup

文档查询

If the user asks about specific Prometheus features, operators, or custom metrics:
  1. Try context7 MCP first (preferred):
    Use mcp__context7__resolve-library-id with "prometheus"
    Then use mcp__context7__get-library-docs with:
    - context7CompatibleLibraryID: /prometheus/docs
    - topic: [specific feature, function, or operator]
    - page: 1 (fetch additional pages if needed)
  2. Fallback to WebSearch:
    Search query pattern:
    "Prometheus PromQL [function/operator/feature] documentation [version] examples"
    
    Examples:
    "Prometheus PromQL rate function documentation examples"
    "Prometheus PromQL histogram_quantile documentation best practices"
    "Prometheus PromQL aggregation operators documentation"
若用户询问特定Prometheus功能、运算符或自定义指标:
  1. 优先使用context7 MCP:
    使用mcp__context7__resolve-library-id,参数为"prometheus"
    然后使用mcp__context7__get-library-docs,参数:
    - context7CompatibleLibraryID: /prometheus/docs
    - topic: [特定功能、函数或运算符]
    - page: 1(若需要,获取更多页面)
  2. 备用方案:Web搜索:
    搜索查询模式:
    "Prometheus PromQL [函数/运算符/功能] 文档 [版本] 示例"
    
    示例:
    "Prometheus PromQL rate函数 文档 示例"
    "Prometheus PromQL histogram_quantile 文档 最佳实践"
    "Prometheus PromQL 聚合运算符 文档"

Common Monitoring Scenarios

常见监控场景

RED Method (for Request-Driven Services)

RED方法(适用于请求驱动型服务)

  1. Rate: Request throughput
    promql
    sum(rate(http_requests_total{job="api"}[5m])) by (endpoint)
  2. Errors: Error rate
    promql
    sum(rate(http_requests_total{job="api", status_code=~"5.."}[5m]))
    /
    sum(rate(http_requests_total{job="api"}[5m]))
  3. Duration: Latency percentiles
    promql
    histogram_quantile(0.95,
      sum by (le) (rate(http_request_duration_seconds_bucket{job="api"}[5m]))
    )
  1. Rate:请求吞吐量
    promql
    sum(rate(http_requests_total{job="api"}[5m])) by (endpoint)
  2. Errors:错误率
    promql
    sum(rate(http_requests_total{job="api", status_code=~"5.."}[5m]))
    /
    sum(rate(http_requests_total{job="api"}[5m]))
  3. Duration:延迟分位数
    promql
    histogram_quantile(0.95,
      sum by (le) (rate(http_request_duration_seconds_bucket{job="api"}[5m]))
    )

USE Method (for Resources)

USE方法(适用于资源)

  1. Utilization: Resource usage percentage
    promql
    (
      avg(rate(node_cpu_seconds_total{mode!="idle"}[5m]))
      /
      count(node_cpu_seconds_total{mode="idle"})
    ) * 100
  2. Saturation: Queue depth or resource contention
    promql
    avg_over_time(node_load1[5m])
  3. Errors: Error counters
    promql
    rate(node_network_receive_errs_total[5m])
  1. Utilization:资源使用率百分比
    promql
    (
      avg(rate(node_cpu_seconds_total{mode!="idle"}[5m]))
      /
      count(node_cpu_seconds_total{mode="idle"})
    ) * 100
  2. Saturation:队列深度或资源竞争
    promql
    avg_over_time(node_load1[5m])
  3. Errors:错误计数器
    promql
    rate(node_network_receive_errs_total[5m])

Alerting Rules

告警规则

When generating queries for alerting:
  1. Include the Threshold: Make the condition explicit
    promql
    # Alert when error rate exceeds 5%
    (
      sum(rate(http_requests_total{status_code=~"5.."}[5m]))
      /
      sum(rate(http_requests_total[5m]))
    ) > 0.05
  2. Use Boolean Operators: Return 1 (fire) or 0 (no alert)
    promql
    # Returns 1 when memory usage > 90%
    (process_resident_memory_bytes / node_memory_MemTotal_bytes) > 0.9
  3. Consider for Duration: Alerts typically use
    for
    clause
    yaml
    alert: HighErrorRate
    expr: |
      (
        sum(rate(http_requests_total{status_code=~"5.."}[5m]))
        /
        sum(rate(http_requests_total[5m]))
      ) > 0.05
    for: 10m  # Only fire after 10 minutes of continuous violation
生成告警查询时:
  1. 包含阈值:明确条件
    promql
    # 错误率超过5%时告警
    (
      sum(rate(http_requests_total{status_code=~"5.."}[5m]))
      /
      sum(rate(http_requests_total[5m]))
    ) > 0.05
  2. 使用布尔运算符:返回1(触发)或0(不告警)
    promql
    # 内存使用率>90%时返回1
    (process_resident_memory_bytes / node_memory_MemTotal_bytes) > 0.9
  3. 考虑持续时间:告警通常使用
    for
    子句
    yaml
    alert: HighErrorRate
    expr: |
      (
        sum(rate(http_requests_total{status_code=~"5.."}[5m]))
        /
        sum(rate(http_requests_total[5m]))
      ) > 0.05
    for: 10m  # 持续违反阈值10分钟后才触发

Recording Rules

记录规则

When generating queries for recording rules:
  1. Follow Naming Convention:
    level:metric:operations
    yaml
    # level: aggregation level (job, instance, etc.)
    # metric: base metric name
    # operations: functions applied
    
    - record: job:http_requests:rate5m
      expr: sum by (job) (rate(http_requests_total[5m]))
  2. Pre-aggregate Expensive Queries:
    yaml
    # Recording rule for frequently-used latency query
    - record: job_endpoint:http_request_duration_seconds:p95
      expr: |
        histogram_quantile(0.95,
          sum by (job, endpoint, le) (
            rate(http_request_duration_seconds_bucket[5m])
          )
        )
  3. Use Recorded Metrics in Dashboards:
    promql
    # Instead of expensive query, use pre-recorded metric
    job_endpoint:http_request_duration_seconds:p95{job="api-server"}
生成记录规则查询时:
  1. 遵循命名规范
    level:metric:operations
    yaml
    # level: 聚合级别(job、instance等)
    # metric: 基础指标名称
    # operations: 应用的函数
    
    - record: job:http_requests:rate5m
      expr: sum by (job) (rate(http_requests_total[5m]))
  2. 预聚合复杂查询:
    yaml
    # 频繁使用的延迟查询的记录规则
    - record: job_endpoint:http_request_duration_seconds:p95
      expr: |
        histogram_quantile(0.95,
          sum by (job, endpoint, le) (
            rate(http_request_duration_seconds_bucket[5m])
          )
        )
  3. 在仪表板中使用预记录指标:
    promql
    # 不使用复杂查询,而是使用预记录指标
    job_endpoint:http_request_duration_seconds:p95{job="api-server"}

Error Handling

错误处理

Common Issues and Solutions

常见问题与解决方案

  1. Empty Results:
    • Check if metrics exist:
      up{job="your-job"}
    • Verify label filters are correct
    • Check time range is appropriate
    • Confirm metric is being scraped
  2. Too Many Series (High Cardinality):
    • Add more specific label filters
    • Use aggregation to reduce series count
    • Consider using recording rules
    • Check for label explosion (dynamic labels)
  3. Incorrect Values:
    • Verify metric type (counter vs gauge)
    • Check function usage (rate on counters, not gauges)
    • Verify time range is appropriate
    • Check for counter resets
  4. Performance Issues:
    • Reduce time range for range vectors
    • Add label filters to reduce cardinality
    • Use recording rules for complex queries
    • Avoid expensive regex patterns
    • Consider query timeout settings
  1. 结果为空:
    • 检查指标是否存在:
      up{job="your-job"}
    • 验证标签过滤是否正确
    • 检查时间范围是否合适
    • 确认指标正在被抓取
  2. 序列过多(高基数):
    • 添加更精确的标签过滤
    • 使用聚合减少序列数量
    • 考虑使用记录规则
    • 检查是否存在标签爆炸(动态标签)
  3. 数值不正确:
    • 验证指标类型(counter vs gauge)
    • 检查函数使用(rate用于counter,而非gauge)
    • 验证时间范围是否合适
    • 检查是否存在counter重置
  4. 性能问题:
    • 缩小范围向量的时间范围
    • 添加标签过滤以降低基数
    • 为复杂查询使用记录规则
    • 避免使用昂贵的正则模式
    • 考虑查询超时设置

Communication Guidelines

沟通指南

When generating queries:
  1. Explain the Plan: Always present a plain-English plan before generating
  2. Ask Questions: Use AskUserQuestion tool to gather requirements
  3. Confirm Intent: Verify the query matches user goals before finalizing
  4. Educate: Explain why certain functions or patterns are used
  5. Provide Context: Show how to interpret results
  6. Suggest Improvements: Offer optimizations or alternative approaches
  7. Validate Proactively: Always validate and fix issues
  8. Follow Up: Ask if adjustments are needed
生成查询时:
  1. 解释规划:生成前始终呈现通俗易懂的规划
  2. 提问:使用AskUserQuestion工具收集需求
  3. 确认意图:最终确定前验证查询是否符合用户目标
  4. 教育用户:解释为何使用特定函数或模式
  5. 提供上下文:展示如何解读结果
  6. 建议改进:提供优化或替代方案
  7. 主动验证:始终验证并修复问题
  8. 跟进:询问是否需要调整

Integration with devops-skills:promql-validator

与devops-skills:promql-validator集成

After generating any PromQL query, automatically invoke the devops-skills:promql-validator skill to ensure quality:
Steps:
1. Generate the PromQL query based on user requirements
2. Invoke devops-skills:promql-validator skill with the generated query
3. Review validation results (syntax, semantics, performance)
4. Fix any issues identified by the validator
5. Re-validate until all checks pass
6. Provide the final validated query with usage instructions
7. Ask user if further refinements are needed
This ensures all generated queries follow best practices and are production-ready.
生成任何PromQL查询后,自动调用devops-skills:promql-validator技能以确保质量:
步骤:
1. 根据用户需求生成PromQL查询
2. 使用生成的查询调用devops-skills:promql-validator技能
3. 查看验证结果(语法、语义、性能)
4. 修复验证器发现的问题
5. 重新验证直至所有检查通过
6. 提供最终验证后的查询及使用说明
7. 询问用户是否需要进一步调整
这确保所有生成的查询遵循最佳实践并可用于生产环境。

Resources

资源

IMPORTANT: Explicit Reference Consultation
When generating queries, you SHOULD explicitly read the relevant reference files using the Read tool and cite applicable best practices. This ensures generated queries follow documented patterns and helps users understand why certain approaches are recommended.
重要提示:明确参考文档
生成查询时,应使用Read工具明确阅读相关参考文档,并引用适用的最佳实践。这确保生成的查询遵循文档化模式,并帮助用户理解为何推荐特定方法。

references/

references/

promql_functions.md
  • Comprehensive reference of all PromQL functions
  • Grouped by category (aggregation, math, time, histogram, etc.)
  • Usage examples for each function
  • Read this file when: implementing specific function requirements or when user asks about function behavior
promql_patterns.md
  • Common query patterns for typical monitoring scenarios
  • RED method patterns (Rate, Errors, Duration)
  • USE method patterns (Utilization, Saturation, Errors)
  • Alerting and recording rule patterns
  • Read this file when: implementing standard monitoring patterns like error rates, latency, or resource usage
best_practices.md
  • PromQL best practices and anti-patterns
  • Performance optimization guidelines
  • Cardinality management
  • Query structure recommendations
  • Read this file when: optimizing queries, reviewing for anti-patterns, or when cardinality concerns arise
metric_types.md
  • Detailed guide to Prometheus metric types
  • Counter, Gauge, Histogram, Summary
  • When to use each type
  • Appropriate functions for each type
  • Read this file when: clarifying metric type questions or determining appropriate functions for a metric
promql_functions.md
  • 所有PromQL函数的综合参考
  • 按类别分组(聚合、数学、时间、直方图等)
  • 每个函数的使用示例
  • 阅读场景:实现特定函数需求或用户询问函数行为时
promql_patterns.md
  • 常见监控场景的查询模式
  • RED方法模式(Rate、Errors、Duration)
  • USE方法模式(Utilization、Saturation、Errors)
  • 告警和记录规则模式
  • 阅读场景:实现标准监控模式如错误率、延迟或资源使用率时
best_practices.md
  • PromQL最佳实践与反模式
  • 性能优化指南
  • 基数管理
  • 查询结构建议
  • 阅读场景:优化查询、检查反模式或存在基数问题时
metric_types.md
  • Prometheus指标类型的详细指南
  • Counter、Gauge、Histogram、Summary
  • 各类型的适用场景
  • 各类型的合适函数
  • 阅读场景:澄清指标类型问题或确定指标的合适函数时

examples/

examples/

common_queries.promql
  • Collection of commonly-used PromQL queries
  • Request rate, error rate, latency queries
  • Resource usage queries
  • Availability and uptime queries
  • Can be copied and customized
red_method.promql
  • Complete RED method implementation
  • Request rate queries
  • Error rate queries
  • Duration/latency queries
use_method.promql
  • Complete USE method implementation
  • Utilization queries
  • Saturation queries
  • Error queries
alerting_rules.yaml
  • Example Prometheus alerting rules
  • Various threshold-based alerts
  • Best practices for alert expressions
recording_rules.yaml
  • Example Prometheus recording rules
  • Pre-aggregated metrics
  • Naming conventions
slo_patterns.promql
  • SLO, error budget, and burn rate queries
  • Multi-window, multi-burn-rate alerting patterns
  • Latency SLO compliance queries
kubernetes_patterns.promql
  • Kubernetes monitoring patterns
  • kube-state-metrics queries (pods, deployments, nodes)
  • cAdvisor container metrics (CPU, memory)
  • Vector matching and joins for Kubernetes
common_queries.promql
  • 常用PromQL查询集合
  • 请求速率、错误率、延迟查询
  • 资源使用率查询
  • 可用性和在线时长查询
  • 可复制并自定义
red_method.promql
  • 完整的RED方法实现
  • 请求速率查询
  • 错误率查询
  • 时长/延迟查询
use_method.promql
  • 完整的USE方法实现
  • 使用率查询
  • 饱和度查询
  • 错误查询
alerting_rules.yaml
  • Prometheus告警规则示例
  • 各种基于阈值的告警
  • 告警表达式最佳实践
recording_rules.yaml
  • Prometheus记录规则示例
  • 预聚合指标
  • 命名规范
slo_patterns.promql
  • SLO、错误预算和燃烧速率查询
  • 多窗口、多燃烧速率告警模式
  • 延迟SLO合规查询
kubernetes_patterns.promql
  • Kubernetes监控模式
  • kube-state-metrics查询(pod、部署、节点)
  • cAdvisor容器指标(CPU、内存)
  • Kubernetes的向量匹配与连接

Important Notes

重要注意事项

  1. Always Plan Interactively: Never generate a query without confirming the plan with the user
  2. Use AskUserQuestion: Leverage the tool to gather requirements and confirm plans
  3. Validate Everything: Always invoke devops-skills:promql-validator after generation
  4. Educate Users: Explain what the query does and why it's structured that way
  5. Consider Use Case: Tailor the query based on whether it's for dashboards, alerts, or analysis
  6. Think About Performance: Always include label filters and consider cardinality
  7. Follow Metric Types: Use appropriate functions for counters, gauges, and histograms
  8. Format for Readability: Use multi-line formatting for complex queries
  1. 始终交互式规划:未与用户确认规划前,切勿生成查询
  2. 使用AskUserQuestion:利用工具收集需求并确认规划
  3. 验证所有内容:生成后始终调用devops-skills:promql-validator
  4. 教育用户:解释查询功能和结构原因
  5. 考虑使用场景:根据仪表板、告警或分析场景调整查询
  6. 考虑性能:始终包含标签过滤并考虑基数
  7. 遵循指标类型:为counter、gauge和直方图使用合适的函数
  8. 格式化提升可读性:复杂查询使用多行格式

Success Criteria

成功标准

A successful query generation session should:
  1. Fully understand the user's monitoring goal
  2. Identify correct metrics and their types
  3. Present a clear plan in plain English
  4. Get user confirmation before generating code
  5. Generate a syntactically correct query
  6. Use appropriate functions for metric types
  7. Include specific label filters
  8. Pass devops-skills:promql-validator validation
  9. Provide clear usage instructions
  10. Offer customization guidance
成功的查询生成会话应:
  1. 完全理解用户的监控目标
  2. 识别正确的指标及其类型
  3. 呈现清晰的通俗易懂的规划
  4. 生成代码前获得用户确认
  5. 生成语法正确的查询
  6. 为指标类型使用合适的函数
  7. 包含精确的标签过滤
  8. 通过devops-skills:promql-validator验证
  9. 提供清晰的使用说明
  10. 提供自定义指导

Remember

请记住

The goal is to collaboratively plan and generate PromQL queries that exactly match user intentions. Always prioritize clarity, correctness, and performance. The interactive planning phase is the most important part of this skill—never skip it!
目标是协作规划并生成完全符合用户意图的PromQL查询。始终优先考虑清晰性、正确性和性能。交互式规划阶段是本技能最重要的部分——切勿跳过!