promql-generator

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

PromQL Query Generator

PromQL查询生成工具

Overview

概述

This skill provides a comprehensive, interactive workflow for generating production-ready PromQL queries with best practices built-in. Generate queries for monitoring dashboards, alerting rules, and ad-hoc analysis with an emphasis on user collaboration and planning before code generation.

本技能提供一套全面的交互式工作流，用于生成符合最佳实践的生产级PromQL查询。可针对监控仪表板、告警规则和临时分析生成查询，强调在生成代码前与用户协作规划。

When to Use This Skill

适用场景

Invoke this skill when:

Creating new PromQL queries from scratch
Building monitoring dashboards (Grafana, Prometheus UI, etc.)
Implementing alerting rules for Prometheus Alertmanager
Analyzing metrics for troubleshooting or capacity planning
Converting monitoring requirements into PromQL expressions
Learning PromQL or teaching others
The user asks to "create", "generate", "build", or "write" PromQL queries
Working with Prometheus metrics (counters, gauges, histograms, summaries)
Implementing RED (Rate, Errors, Duration) or USE (Utilization, Saturation, Errors) metrics

在以下场景中调用此技能：

从零开始创建新的PromQL查询
构建监控仪表板（Grafana、Prometheus UI等）
为Prometheus Alertmanager实现告警规则
分析指标以进行故障排查或容量规划
将监控需求转换为PromQL表达式
学习或教授PromQL
用户要求「创建」「生成」「构建」或「编写」PromQL查询时
处理Prometheus指标（Counter、Gauge、Histogram、Summary）
实现RED（Rate、Errors、Duration）或USE（Utilization、Saturation、Errors）指标体系

Interactive Query Planning Workflow

交互式查询规划流程

CRITICAL: This skill emphasizes interactive planning before query generation. Always engage the user in a collaborative planning process to ensure the generated query matches their exact intentions.

Follow this workflow when generating PromQL queries:

关键提示：本技能强调在生成查询前进行交互式规划。始终与用户开展协作规划，确保生成的查询完全符合用户意图。

生成PromQL查询时遵循以下工作流：

Stage 1: Understand the Monitoring Goal

阶段1：明确监控目标

Start by understanding what the user wants to monitor or measure. Ask clarifying questions to gather requirements:

Primary Goal: What are you trying to monitor or measure?
- Request rate (requests per second)
- Error rate (percentage of failed requests)
- Latency/duration (response times, percentiles)
- Resource usage (CPU, memory, disk, network)
- Availability/uptime
- Queue depth, saturation, throughput
- Custom business metrics
Use Case: What will this query be used for?
- Dashboard visualization (Grafana, Prometheus UI)
- Alerting rule (firing when threshold exceeded)
- Ad-hoc troubleshooting/analysis
- Recording rule (pre-computed aggregation)
- Capacity planning or SLO tracking
Context: Any additional context?
- Service/application name
- Team or project
- Priority level
- Existing metrics or naming conventions

Use the AskUserQuestion tool to gather this information if not provided.

When to Ask vs. Infer: If the user's initial request already clearly specifies the goal, use case, and context (e.g., "Create an alert for P95 latency > 500ms for payment-service"), you may acknowledge these details in your response instead of re-asking. Only ask clarifying questions for information that is missing or ambiguous.

首先了解用户想要监控或测量的内容。通过澄清问题收集需求：

核心目标：您想要监控或测量什么？
- 请求速率（每秒请求数）
- 错误率（失败请求占比）
- 延迟/时长（响应时间、百分位数）
- 资源使用率（CPU、内存、磁盘、网络）
- 可用性/在线时长
- 队列深度、饱和度、吞吐量
- 自定义业务指标
使用场景：该查询将用于什么场景？
- 仪表板可视化（Grafana、Prometheus UI）
- 告警规则（阈值触发时告警）
- 临时故障排查/分析
- 记录规则（预计算聚合指标）
- 容量规划或SLO追踪
上下文信息：是否有其他上下文信息？
- 服务/应用名称
- 团队或项目
- 优先级
- 现有指标或命名规范

若用户未提供上述信息，使用AskUserQuestion工具收集。

提问与推断的边界：如果用户初始请求已明确指定目标、场景和上下文（例如：「为payment-service创建P95延迟超过500ms的告警」），您可直接在回复中确认这些细节，无需重复提问。仅对缺失或模糊的信息进行澄清。

Stage 2: Identify Available Metrics

阶段2：识别可用指标

Determine which metrics are available and relevant:

Metric Discovery: What metrics are available?
- Ask the user for metric names
- If uncertain, suggest common naming patterns
- Check for metric type indicators in the name:
  - ```
  _total
```
  suffix → Counter
- ```
_bucket
```
    ,
```
_sum
```
    ,
```
_count
```
    suffix → Histogram
  - No suffix → Likely Gauge
  - ```
  _created
```
  suffix → Counter creation timestamp
Metric Type Identification: Confirm the metric type(s)
- Counter: Cumulative metric that only increases (or resets to zero)
  - Examples:
```
http_requests_total
```
    ,
```
errors_total
```
    ,
```
bytes_sent_total
```
  - Use with:
```
rate()
```
    ,
```
irate()
```
    ,
```
increase()
```
- Gauge: Point-in-time value that can go up or down
  - Examples:
```
memory_usage_bytes
```
    ,
```
cpu_temperature_celsius
```
    ,
```
queue_length
```
  - Use with:
```
avg_over_time()
```
    ,
```
min_over_time()
```
    ,
```
max_over_time()
```
    , or directly
- Histogram: Buckets of observations with cumulative counts
  - Examples:
```
http_request_duration_seconds_bucket
```
    ,
```
response_size_bytes_bucket
```
  - Use with:
```
histogram_quantile()
```
    ,
```
rate()
```
- Summary: Pre-calculated quantiles with count and sum
  - Examples:
```
rpc_duration_seconds{quantile="0.95"}
```
  - Use
```
_sum
```
    and
```
_count
```
    for averages; don't average quantiles
Label Discovery: What labels are available on these metrics?
- Common labels:
```
job
```
  ,
```
instance
```
  ,
```
environment
```
  ,
```
service
```
  ,
```
endpoint
```
  ,
```
status_code
```
  ,
```
method
```
- Ask which labels are important for filtering or grouping

Use the AskUserQuestion tool to confirm metric names, types, and available labels.

确定哪些指标可用且相关：

指标发现：有哪些可用指标？
- 询问用户指标名称
- 若不确定，建议常见命名模式
- 通过名称识别指标类型：
  - 后缀
```
_total
```
    → Counter
  - 后缀
```
_bucket
```
    、
```
_sum
```
    、
```
_count
```
    → Histogram
  - 无后缀 → 通常为Gauge
  - 后缀
```
_created
```
    → Counter创建时间戳
指标类型确认：确认指标类型
- Counter：累计型指标，仅会增加（或重置为零）
  - 示例：
```
http_requests_total
```
    、
```
errors_total
```
    、
```
bytes_sent_total
```
  - 搭配使用：
```
rate()
```
    、
```
irate()
```
    、
```
increase()
```
- Gauge：瞬时值，可升可降
  - 示例：
```
memory_usage_bytes
```
    、
```
cpu_temperature_celsius
```
    、
```
queue_length
```
  - 搭配使用：
```
avg_over_time()
```
    、
```
min_over_time()
```
    、
```
max_over_time()
```
    ，或直接使用
- Histogram：带累计计数的观测值分桶
  - 示例：
```
http_request_duration_seconds_bucket
```
    、
```
response_size_bytes_bucket
```
  - 搭配使用：
```
histogram_quantile()
```
    、
```
rate()
```
- Summary：预计算分位数，包含计数和总和
  - 示例：
```
rpc_duration_seconds{quantile="0.95"}
```
  - 使用
```
_sum
```
    和
```
_count
```
    计算平均值；不要对分位数取平均
标签发现：这些指标有哪些可用标签？
- 常见标签：
```
job
```
  、
```
instance
```
  、
```
environment
```
  、
```
service
```
  、
```
endpoint
```
  、
```
status_code
```
  、
```
method
```
- 询问哪些标签对过滤或分组重要

使用AskUserQuestion工具确认指标名称、类型和可用标签。

Stage 3: Determine Query Parameters

阶段3：确定查询参数

Gather specific requirements for the query.

收集查询的具体要求。

Pre-confirmation for User-Provided Parameters

用户提供参数的预确认

IMPORTANT: When the user has already specified parameters in their initial request (e.g., "5-minute window", "500ms threshold", "> 5% error rate"), you MUST:

Acknowledge the provided values explicitly in your response
Present them as pre-filled defaults in AskUserQuestion with the first option being "Use specified values"
Allow quick confirmation rather than re-asking for information already given

Example: If user says "alert when P95 latency exceeds 500ms", use:

AskUserQuestion:
- Question: "Confirm the alert threshold?"
- Options:
  1. "500ms (as specified)" - Use the threshold from your request
  2. "Different threshold" - Let me specify a different value

This respects the user's input and speeds up the workflow while still allowing modifications.

Time Range: What time window should the query cover?
- Instant value (current)
- Rate over time (
```
[5m]
```
  ,
```
[1h]
```
  ,
```
[1d]
```
  )
- For rate calculations: typically
```
[1m]
```
  to
```
[5m]
```
  for real-time,
```
[1h]
```
  to
```
[1d]
```
  for trends
- Rule of thumb: Rate range should be at least 4x the scrape interval
Label Filtering: Which labels should filter the data?
- Exact matches:
```
job="api-server"
```
  ,
```
status_code="200"
```
- Negative matches:
```
status_code!="200"
```
- Regex matches:
```
instance=~"prod-.*"
```
- Multiple conditions:
```
{job="api", environment="production"}
```
Aggregation: Should the data be aggregated?
- No aggregation: Return all time series as-is
- Aggregate by labels:
```
sum by (job, endpoint)
```
  ,
```
avg by (instance)
```
- Aggregate without labels:
```
sum without (instance, pod)
```
  ,
```
avg without (job)
```
- Common aggregations:
```
sum
```
  ,
```
avg
```
  ,
```
max
```
  ,
```
min
```
  ,
```
count
```
  ,
```
topk
```
  ,
```
bottomk
```
Thresholds or Conditions: Are there specific conditions?
- For alerting: threshold values (e.g., error rate > 5%)
- For filtering: only show series above/below a value
- For comparison: compare against historical data (offset)

Use the AskUserQuestion tool to gather or confirm these parameters. When the user has already provided values (e.g., "5-minute window", "> 5%"), present them as the default option for confirmation.

重要提示：当用户在初始请求中已指定参数（例如：「5分钟窗口」「500ms阈值」「>5%错误率」），您必须：

在回复中明确确认提供的值
在AskUserQuestion中将其设为预填充默认值，第一个选项为「使用指定值」
允许快速确认，而非重复询问已提供的信息

示例：若用户说「当P95延迟超过500ms时告警」，使用：

AskUserQuestion:
- Question: "确认告警阈值？"
- Options:
  1. "500ms（按您指定的）" - 使用您请求中的阈值
  2. "修改阈值" - 让我指定其他值

这既尊重用户输入，又能加快工作流，同时仍允许修改。

时间范围：查询应覆盖哪个时间窗口？
- 瞬时值（当前）
- 速率计算窗口（
```
[5m]
```
  、
```
[1h]
```
  、
```
[1d]
```
  ）
- 速率计算规则：实时监控通常用
```
[1m]
```
  至
```
[5m]
```
  ，趋势分析用
```
[1h]
```
  至
```
[1d]
```
- 经验法则：速率范围至少为抓取间隔的4倍
标签过滤：应使用哪些标签过滤数据？
- 精确匹配：
```
job="api-server"
```
  、
```
status_code="200"
```
- 排除匹配：
```
status_code!="200"
```
- 正则匹配：
```
instance=~"prod-.*"
```
- 多条件：
```
{job="api", environment="production"}
```
聚合方式：是否需要对数据进行聚合？
- 不聚合：返回所有时间序列原样
- 按标签聚合：
```
sum by (job, endpoint)
```
  、
```
avg by (instance)
```
- 排除标签聚合：
```
sum without (instance, pod)
```
  、
```
avg without (job)
```
- 常见聚合函数：
```
sum
```
  、
```
avg
```
  、
```
max
```
  、
```
min
```
  、
```
count
```
  、
```
topk
```
  、
```
bottomk
```
阈值或条件：是否有特定条件？
- 告警场景：阈值（例如：错误率>5%）
- 过滤场景：仅显示高于/低于某值的序列
- 对比场景：与历史数据对比（offset）

使用AskUserQuestion工具收集或确认这些参数。若用户已提供值（例如：「5分钟窗口」「>5%」），将其设为默认选项供确认。

Stage 4: Present the Query Plan

阶段4：呈现查询规划

BEFORE GENERATING ANY CODE, present a plain-English query plan and ask for user confirmation:

undefined

在生成任何代码之前，用通俗易懂的语言呈现查询规划，并请求用户确认：

undefined

PromQL Query Plan

PromQL查询规划

Based on your requirements, here's what the query will do:

Goal: [Describe the monitoring goal in plain English]

Query Structure:

Start with metric:
```
[metric_name]
```
Filter by labels:
```
{label1="value1", label2="value2"}
```
Apply function:
```
[function_name]([metric][time_range])
```
Aggregate:
```
[aggregation] by ([label_list])
```
Additional operations: [any calculations, ratios, or transformations]

Expected Output:

Data type: [instant vector/scalar]
Labels in result: [list of labels]
Value represents: [what the number means]
Typical range: [expected value range]

Example Interpretation: If the query returns

0.05

, it means: [plain English explanation]

Does this match your intentions?

If yes, I'll generate the query and validate it
If no, let me know what needs to change


Use the **AskUserQuestion** tool to confirm the plan with options:
- "Yes, generate this query"
- "Modify [specific aspect]"
- "Show me alternative approaches"

根据您的需求，该查询将实现以下功能：

目标：[用通俗易懂的语言描述监控目标]

查询结构:

基础指标：
```
[metric_name]
```
标签过滤：
```
{label1="value1", label2="value2"}
```
应用函数：
```
[function_name]([metric][time_range])
```
聚合操作：
```
[aggregation] by ([label_list])
```
额外操作：[任何计算、比率或转换]

预期输出:

数据类型：[瞬时向量/标量]
结果标签：[标签列表]
值代表含义：[数值的实际意义]
典型范围：[预期数值范围]

示例解读: 若查询返回

0.05

，表示：[通俗易懂的解释]

此规划是否符合您的意图？

是，生成此查询
修改[特定部分]
展示其他方案


使用**AskUserQuestion**工具提供以下选项确认规划：
- "是，生成此查询"
- "修改[特定部分]"
- "展示其他方案"

Stage 5: Generate the PromQL Query

阶段5：生成PromQL查询

Once the user confirms the plan, generate the actual PromQL query following best practices.

用户确认规划后，按照最佳实践生成实际的PromQL查询。

IMPORTANT: Consult Reference Files Before Generating

重要提示：生成前参考文档

Before writing any query code, you MUST:

Read the relevant reference file(s) using the Read tool:
- For histogram queries → Read
```
references/metric_types.md
```
  (Histogram section)
- For error/latency patterns → Read
```
references/promql_patterns.md
```
  (RED method section)
- For resource monitoring → Read
```
references/promql_patterns.md
```
  (USE method section)
- For optimization questions → Read
```
references/best_practices.md
```
- For specific functions → Read
```
references/promql_functions.md
```

Cite the applicable pattern or best practice in your response:

As documented in references/promql_patterns.md (Pattern 3: Latency Percentile):
# 95th percentile latency
histogram_quantile(0.95, sum by (le) (rate(...)))

Reference example files when generating similar queries:

Based on examples/red_method.promql (lines 64-82):
# P95 latency with proper histogram_quantile usage

This ensures generated queries follow documented patterns and helps users understand why certain approaches are recommended.

在编写任何查询代码之前，您必须：

使用Read工具阅读相关参考文档:
- 直方图查询 → 阅读
```
references/metric_types.md
```
  （Histogram章节）
- 错误/延迟模式 → 阅读
```
references/promql_patterns.md
```
  （RED方法章节）
- 资源监控 → 阅读
```
references/promql_patterns.md
```
  （USE方法章节）
- 优化问题 → 阅读
```
references/best_practices.md
```
- 特定函数 → 阅读
```
references/promql_functions.md
```

在回复中引用适用的模式或最佳实践:

正如`references/promql_patterns.md`（模式3：延迟分位数）中所述：
# 95分位延迟
histogram_quantile(0.95, sum by (le) (rate(...)))

生成相似查询时参考示例文件:

基于`examples/red_method.promql`（第64-82行）：
# 正确使用histogram_quantile的P95延迟查询

这确保生成的查询遵循文档化模式，并帮助用户理解为何推荐特定方法。

Best Practices for Query Generation

查询生成最佳实践

Always Use Label Filters

promql

# Good: Specific filtering reduces cardinality
rate(http_requests_total{job="api-server", environment="prod"}[5m])

# Bad: Matches all time series, high cardinality
rate(http_requests_total[5m])

Use Appropriate Functions for Metric Types

promql

# Counter: Use rate() or increase()
rate(http_requests_total[5m])

# Gauge: Use directly or with *_over_time()
memory_usage_bytes
avg_over_time(memory_usage_bytes[5m])

# Histogram: Use histogram_quantile()
histogram_quantile(0.95,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

Apply Aggregations with by() or without()

promql

# Aggregate by specific labels (keeps only these labels)
sum by (job, endpoint) (rate(http_requests_total[5m]))

# Aggregate without specific labels (removes these labels)
sum without (instance, pod) (rate(http_requests_total[5m]))

Use Exact Matches Over Regex When Possible

promql

# Good: Faster exact match
http_requests_total{status_code="200"}

# Bad: Slower regex match when not needed
http_requests_total{status_code=~"200"}

Calculate Ratios Properly

promql

# Error rate: errors / total requests
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

Use Recording Rules for Complex Queries
- If a query is used frequently or is computationally expensive
- Pre-aggregate data to reduce query load
- Follow naming convention:
```
level:metric:operations
```

Format for Readability

promql

# Good: Multi-line for complex queries
histogram_quantile(0.95,
  sum by (le, job) (
    rate(http_request_duration_seconds_bucket{job="api-server"}[5m])
  )
)

始终使用标签过滤

promql

# 推荐：精确过滤降低基数
rate(http_requests_total{job="api-server", environment="prod"}[5m])

# 不推荐：匹配所有时间序列，基数过高
rate(http_requests_total[5m])

为不同指标类型使用合适的函数

promql

# Counter：使用rate()或increase()
rate(http_requests_total[5m])

# Gauge：直接使用或搭配*_over_time()
memory_usage_bytes
avg_over_time(memory_usage_bytes[5m])

# Histogram：使用histogram_quantile()
histogram_quantile(0.95,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

使用by()或without()进行聚合

promql

# 按指定标签聚合（仅保留这些标签）
sum by (job, endpoint) (rate(http_requests_total[5m]))

# 排除指定标签聚合（移除这些标签）
sum without (instance, pod) (rate(http_requests_total[5m]))

优先使用精确匹配而非正则匹配

promql

# 推荐：精确匹配速度更快
http_requests_total{status_code="200"}

# 不推荐：无需正则时使用正则会变慢
http_requests_total{status_code=~"200"}

正确计算比率

promql

# 错误率：错误数/总请求数
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

为复杂查询使用记录规则
- 若查询频繁使用或计算成本高
- 预聚合数据以降低查询负载
- 遵循命名规范：
```
level:metric:operations
```

格式化以提升可读性

promql

# 推荐：复杂查询使用多行格式
histogram_quantile(0.95,
  sum by (le, job) (
    rate(http_request_duration_seconds_bucket{job="api-server"}[5m])
  )
)

Common Query Patterns

常见查询模式

Pattern 1: Request Rate

promql

undefined

模式1：请求速率

promql

undefined

Requests per second

每秒请求数

rate(http_requests_total{job="api-server"}[5m])

Total requests per second across all instances

所有实例的总每秒请求数

sum(rate(http_requests_total{job="api-server"}[5m]))


**Pattern 2: Error Rate**
```promql

sum(rate(http_requests_total{job="api-server"}[5m]))


**模式2：错误率**
```promql

Error ratio (0 to 1)

错误比率（0到1）

sum(rate(http_requests_total{job="api-server", status_code=~"5.."}[5m])) / sum(rate(http_requests_total{job="api-server"}[5m]))

Error percentage (0 to 100)

错误百分比（0到100）

( sum(rate(http_requests_total{job="api-server", status_code=~"5.."}[5m])) / sum(rate(http_requests_total{job="api-server"}[5m])) ) * 100


**Pattern 3: Latency Percentile (Histogram)**
```promql

( sum(rate(http_requests_total{job="api-server", status_code=~"5.."}[5m])) / sum(rate(http_requests_total{job="api-server"}[5m])) ) * 100


**模式3：延迟分位数（直方图）**
```promql

95th percentile latency

95分位延迟

histogram_quantile(0.95, sum by (le) ( rate(http_request_duration_seconds_bucket{job="api-server"}[5m]) ) )


**Pattern 4: Resource Usage**
```promql

histogram_quantile(0.95, sum by (le) ( rate(http_request_duration_seconds_bucket{job="api-server"}[5m]) ) )


**模式4：资源使用率**
```promql

Current memory usage

当前内存使用率

process_resident_memory_bytes{job="api-server"}

Average CPU usage over 5 minutes

5分钟平均CPU使用率

avg_over_time(process_cpu_seconds_total{job="api-server"}[5m])


**Pattern 5: Availability**
```promql

avg_over_time(process_cpu_seconds_total{job="api-server"}[5m])


**模式5：可用性**
```promql

Percentage of up instances

在线实例百分比

( count(up{job="api-server"} == 1) / count(up{job="api-server"}) ) * 100


**Pattern 6: Saturation/Queue Depth**
```promql

( count(up{job="api-server"} == 1) / count(up{job="api-server"}) ) * 100


**模式6：饱和度/队列深度**
```promql

Average queue length

平均队列长度

avg_over_time(queue_depth{job="worker"}[5m])

Maximum queue depth in the last hour

过去1小时的最大队列深度

max_over_time(queue_depth{job="worker"}[1h])

undefined

max_over_time(queue_depth{job="worker"}[1h])

undefined

Stage 6: Validate the Generated Query

阶段6：验证生成的查询

ALWAYS validate the generated query using the devops-skills:promql-validator skill:

After generating the query, automatically invoke:
Skill(devops-skills:promql-validator)

The devops-skills:promql-validator skill will:
1. Check syntax correctness
2. Validate semantic logic (correct functions for metric types)
3. Identify anti-patterns and inefficiencies
4. Suggest optimizations
5. Explain what the query does
6. Verify it matches user intent

Validation checklist:

Syntax is correct (balanced brackets, valid operators)
Metric type matches function usage
Label filters are specific enough
Aggregation is appropriate
Time ranges are reasonable
No known anti-patterns
Query is optimized for performance

If validation fails, fix issues and re-validate until all checks pass.

IMPORTANT: Display Validation Results to User

After running validation, you MUST display the structured results to the user in this format:

undefined

始终验证生成的查询，使用devops-skills:promql-validator技能：

生成查询后，自动调用：
Skill(devops-skills:promql-validator)

devops-skills:promql-validator技能将：
1. 检查语法正确性
2. 验证语义逻辑（指标类型与函数匹配）
3. 识别反模式和低效问题
4. 建议优化方案
5. 解释查询功能
6. 验证是否符合用户意图

验证清单:

语法正确（括号匹配、运算符有效）
指标类型与函数使用匹配
标签过滤足够精确
聚合方式合适
时间范围合理
无已知反模式
查询已针对性能优化

若验证失败，修复问题并重新验证，直至所有检查通过。

重要提示：向用户展示验证结果

运行验证后，必须以下列格式向用户展示结构化结果：

undefined

PromQL Validation Results

PromQL验证结果

Syntax Check

语法检查

Status: ✅ VALID / ⚠️ WARNING / ❌ ERROR
Issues: [list any syntax errors]

状态：✅ 有效 / ⚠️ 警告 / ❌ 错误
问题：[列出任何语法错误]

Best Practices Check

最佳实践检查

Status: ✅ OPTIMIZED / ⚠️ CAN BE IMPROVED / ❌ HAS ISSUES
Issues: [list any problems found]
Suggestions: [list optimization opportunities]

状态：✅ 已优化 / ⚠️ 可改进 / ❌ 存在问题
问题：[列出发现的问题]
建议：[列出优化机会]

Query Explanation

查询解释

What it measures: [plain English description]
Output labels: [list labels in result, or "None (scalar)"]
Expected result structure: [instant vector / scalar / etc.]


This transparency helps users understand the validation process and any recommendations.

测量内容：[通俗易懂的描述]
输出标签：[结果中的标签列表，或“无（标量）”]
预期结果结构：[瞬时向量 / 标量 / 等]


这种透明性有助于用户理解验证过程和任何建议。

Stage 7: Provide Usage Instructions

阶段7：提供使用说明

After successful generation and validation, provide the user with:

The Final Query:
promql
```
[Generated and validated PromQL query]
```
Query Explanation:
- What the query measures
- How to interpret the results
- Expected value range
- Labels in the output
How to Use It:
- For Dashboards: Copy into Grafana/Prometheus UI panel query
- For Alerts: Integrate into Alertmanager rule with threshold
- For Recording Rules: Add to Prometheus recording rule config
- For Ad-hoc: Run directly in Prometheus expression browser
Customization Notes:
- Time ranges that might need adjustment
- Labels to modify for different environments
- Threshold values to tune
- Alternative functions if requirements change
Related Queries:
- Suggest complementary queries
- Mention recording rule opportunities
- Recommend dashboard panels

成功生成并验证后，向用户提供：

最终查询:
promql
```
[生成并验证后的PromQL查询]
```
查询解释:
- 查询测量的内容
- 如何解读结果
- 预期数值范围
- 输出中的标签
使用方法:
- 仪表板：复制到Grafana/Prometheus UI面板查询中
- 告警：集成到Alertmanager规则中并设置阈值
- 记录规则：添加到Prometheus记录规则配置
- 临时分析：直接在Prometheus表达式浏览器中运行
自定义说明:
- 可能需要调整的时间范围
- 针对不同环境需修改的标签
- 需调整的阈值
- 需求变化时的替代函数
相关查询:
- 推荐互补查询
- 提及记录规则的使用场景
- 推荐仪表板面板

Native Histograms (Prometheus 3.x+)

原生直方图（Prometheus 3.x+）

Native histograms are now stable in Prometheus 3.0+ (released November 2024). They offer significant advantages over classic histograms:

Sparse bucket representation with near-zero cost for empty buckets
No configuration of bucket boundaries during instrumentation
Coverage of the full float64 range
Efficient mergeability across histograms
Simpler query syntax

Important: Starting with Prometheus v3.8.0, native histograms are fully stable. However, scraping native histograms still requires explicit activation via the
scrape_native_histograms
configuration setting. Starting with v3.9, no feature flag is needed but
scrape_native_histograms
must be set explicitly.

原生直方图在Prometheus 3.0+（2024年11月发布）中已稳定。相比经典直方图，它具有显著优势：

稀疏分桶表示，空分桶成本几乎为零
instrumentation时无需配置分桶边界
覆盖完整float64范围
跨直方图高效合并
查询语法更简单

重要提示：从Prometheus v3.8.0开始，原生直方图完全稳定。但抓取原生直方图仍需通过
scrape_native_histograms
配置显式启用。从v3.9开始，无需功能标志，但必须显式设置
scrape_native_histograms
。

Native vs Classic Histogram Syntax

原生与经典直方图语法对比

promql

undefined

promql

undefined

Classic histogram (requires _bucket suffix and le label)

经典直方图（需要_bucket后缀和le标签）

histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])) )

Native histogram (simpler - no _bucket suffix, no le label needed)

原生直方图（更简单 - 无_bucket后缀，无需le标签）

histogram_quantile(0.95, sum by (job) (rate(http_request_duration_seconds[5m])) )

undefined

histogram_quantile(0.95, sum by (job) (rate(http_request_duration_seconds[5m])) )

undefined

Native Histogram Functions

原生直方图函数

promql

undefined

promql

undefined

Get observation count rate from native histogram

从原生直方图获取观测计数速率

histogram_count(rate(http_request_duration_seconds[5m]))

Get sum of observations from native histogram

从原生直方图获取观测值总和

histogram_sum(rate(http_request_duration_seconds[5m]))

Calculate fraction of observations between two values

计算两个值之间的观测值占比

histogram_fraction(0, 0.1, rate(http_request_duration_seconds[5m]))

Average request duration from native histogram

从原生直方图计算平均请求时长

histogram_sum(rate(http_request_duration_seconds[5m])) / histogram_count(rate(http_request_duration_seconds[5m]))

undefined

histogram_sum(rate(http_request_duration_seconds[5m])) / histogram_count(rate(http_request_duration_seconds[5m]))

undefined

Detecting Native vs Classic Histograms

区分原生与经典直方图

Native histograms are identified by:

No
_bucket
suffix on the metric name
No
le
label in the time series
The metric stores histogram data directly (not separate bucket counters)

When querying, check if your Prometheus instance has native histograms enabled:

yaml

undefined

原生直方图可通过以下特征识别：

指标名称无
_bucket
后缀
时间序列无
le
标签
指标直接存储直方图数据（而非单独的分桶计数器）

查询时，检查Prometheus实例是否启用了原生直方图：

yaml

undefined

prometheus.yml - Enable native histogram scraping

prometheus.yml - 启用原生直方图抓取

scrape_configs:

job_name: 'my-app' scrape_native_histogram: true # Prometheus 3.x+

undefined

scrape_configs:

job_name: 'my-app' scrape_native_histogram: true # Prometheus 3.x+

undefined

Custom Bucket Native Histograms (NHCB)

自定义分桶原生直方图（NHCB）

Prometheus 3.4+ supports custom bucket native histograms (schema -53), allowing classic histogram to native histogram conversion. This is a key migration path for users with existing classic histograms.

Benefits of NHCB:

Keep existing instrumentation (no code changes needed)
Store classic histograms as native histograms for lower costs
Query with native histogram syntax
Improved reliability and compression

Configuration (Prometheus 3.4+):

yaml

undefined

Prometheus 3.4+支持自定义分桶原生直方图（schema -53），允许将经典直方图转换为原生直方图。这是现有经典直方图用户的关键迁移路径。

NHCB优势:

保留现有instrumentation（无需代码变更）
将经典直方图存储为原生直方图，降低成本
使用原生直方图语法查询
提升可靠性和压缩率

配置（Prometheus 3.4+）:

yaml

undefined

prometheus.yml - Convert classic histograms to NHCB on scrape

prometheus.yml - 抓取时将经典直方图转换为NHCB

global: scrape_configs: - job_name: 'my-app' convert_classic_histograms_to_nhcb: true # Prometheus 3.4+


**Querying NHCB**:
```promql

global: scrape_configs: - job_name: 'my-app' convert_classic_histograms_to_nhcb: true # Prometheus 3.4+


**查询NHCB**:
```promql

Query NHCB metrics the same way as native histograms

与原生直方图查询方式相同

histogram_quantile(0.95, sum by (job) (rate(http_request_duration_seconds[5m])))

histogram_fraction also works with NHCB (Prometheus 3.4+)

histogram_fraction也适用于NHCB（Prometheus 3.4+）

histogram_fraction(0, 0.2, rate(http_request_duration_seconds[5m]))


**Note**: Schema -53 indicates custom bucket boundaries. These histograms with different custom bucket boundaries are generally not mergeable with each other.

---

histogram_fraction(0, 0.2, rate(http_request_duration_seconds[5m]))


**注意**：Schema -53表示自定义分桶边界。具有不同自定义分桶边界的直方图通常无法相互合并。

---

SLO, Error Budget, and Burn Rate Patterns

SLO、错误预算和燃烧速率模式

Service Level Objectives (SLOs) are critical for modern SRE practices. These patterns help implement SLO-based monitoring and alerting.

服务级别目标（SLO）是现代SRE实践的关键。这些模式有助于实现基于SLO的监控和告警。

Error Budget Calculation

错误预算计算

promql

undefined

promql

undefined

Error budget remaining (for 99.9% SLO over 30 days)

剩余错误预算（针对30天99.9%的SLO）

Returns value between 0 and 1 (1 = full budget, 0 = exhausted)

返回值0到1（1=预算充足，0=预算耗尽）

1 - ( sum(rate(http_requests_total{job="api", status_code=~"5.."}[30d])) / sum(rate(http_requests_total{job="api"}[30d])) ) / 0.001 # 0.001 = 1 - 0.999 (allowed error rate)

1 - ( sum(rate(http_requests_total{job="api", status_code=~"5.."}[30d])) / sum(rate(http_requests_total{job="api"}[30d])) ) / 0.001 # 0.001 = 1 - 0.999（允许的错误率）

Simplified: Availability over 30 days

简化版：30天可用性

sum(rate(http_requests_total{job="api", status_code!~"5.."}[30d])) / sum(rate(http_requests_total{job="api"}[30d]))

undefined

sum(rate(http_requests_total{job="api", status_code!~"5.."}[30d])) / sum(rate(http_requests_total{job="api"}[30d]))

undefined

Burn Rate Calculation

燃烧速率计算

Burn rate measures how fast you're consuming error budget. A burn rate of 1 means you'll exhaust the budget exactly at the end of the SLO window.

promql

undefined

燃烧速率衡量错误预算的消耗速度。燃烧速率为1表示将在SLO窗口结束时恰好耗尽预算。

promql

undefined

Current burn rate (1 hour window, 99.9% SLO)

当前燃烧速率（1小时窗口，99.9% SLO）

Burn rate = (current error rate) / (allowed error rate)

燃烧速率 = (当前错误率) / (允许的错误率)

( sum(rate(http_requests_total{job="api", status_code=~"5.."}[1h])) / sum(rate(http_requests_total{job="api"}[1h])) ) / 0.001 # 0.001 = allowed error rate for 99.9% SLO

( sum(rate(http_requests_total{job="api", status_code=~"5.."}[1h])) / sum(rate(http_requests_total{job="api"}[1h])) ) / 0.001 # 0.001 = 99.9% SLO允许的错误率

Burn rate > 1 means consuming budget faster than allowed

燃烧速率>1表示消耗预算速度快于预期

Burn rate of 14.4 consumes 2% of monthly budget in 1 hour

燃烧速率14.4表示1小时消耗月度预算的2%

undefined

undefined

Multi-Window, Multi-Burn-Rate Alerts (Google SRE Standard)

多窗口、多燃烧速率告警（Google SRE标准）

The recommended approach for SLO alerting uses multiple windows to balance detection speed and precision:

promql

undefined

SLO告警的推荐方法使用多个窗口，平衡检测速度和精度：

promql

undefined

Page-level alert: 2% budget in 1 hour (burn rate 14.4)

页面级告警：1小时内消耗2%预算（燃烧速率14.4）

Long window (1h) AND short window (5m) must both exceed threshold

长窗口（1h）和短窗口（5m）必须同时超过阈值

( ( sum(rate(http_requests_total{job="api", status_code=~~"5.."}[1h])) / sum(rate(http_requests_total{job="api"}[1h])) ) > 14.4 * 0.001 ) and ( ( sum(rate(http_requests_total{job="api", status_code=~~"5.."}[5m])) / sum(rate(http_requests_total{job="api"}[5m])) ) > 14.4 * 0.001 )

Ticket-level alert: 5% budget in 6 hours (burn rate 6)

工单级告警：6小时内消耗5%预算（燃烧速率6）

( ( sum(rate(http_requests_total{job="api", status_code=~~"5.."}[6h])) / sum(rate(http_requests_total{job="api"}[6h])) ) > 6 * 0.001 ) and ( ( sum(rate(http_requests_total{job="api", status_code=~~"5.."}[30m])) / sum(rate(http_requests_total{job="api"}[30m])) ) > 6 * 0.001 )

undefined

undefined

SLO Recording Rules

SLO记录规则

Pre-compute SLO metrics for efficient alerting:

yaml

undefined

预计算SLO指标以实现高效告警：

yaml

undefined

Recording rules for SLO calculations

SLO计算记录规则

groups:

name: slo_recording_rules interval: 30s rules:

Error ratio over different windows
- record: job:slo_errors_per_request:ratio_rate1h expr: | sum by (job) (rate(http_requests_total{status_code=~"5.."}[1h])) / sum by (job) (rate(http_requests_total[1h]))
- record: job:slo_errors_per_request:ratio_rate5m expr: | sum by (job) (rate(http_requests_total{status_code=~"5.."}[5m])) / sum by (job) (rate(http_requests_total[5m]))
Availability (success ratio)
- record: job:slo_availability:ratio_rate1h expr: | 1 - job:slo_errors_per_request:ratio_rate1h

undefined

groups:

name: slo_recording_rules interval: 30s rules:

不同窗口的错误比率
- record: job:slo_errors_per_request:ratio_rate1h expr: | sum by (job) (rate(http_requests_total{status_code=~"5.."}[1h])) / sum by (job) (rate(http_requests_total[1h]))
- record: job:slo_errors_per_request:ratio_rate5m expr: | sum by (job) (rate(http_requests_total{status_code=~"5.."}[5m])) / sum by (job) (rate(http_requests_total[5m]))
可用性（成功比率）
- record: job:slo_availability:ratio_rate1h expr: | 1 - job:slo_errors_per_request:ratio_rate1h

undefined

Latency SLO Queries

延迟SLO查询

promql

undefined

promql

undefined

Percentage of requests faster than SLO target (200ms)

快于SLO目标（200ms）的请求百分比

( sum(rate(http_request_duration_seconds_bucket{le="0.2", job="api"}[5m])) / sum(rate(http_request_duration_seconds_count{job="api"}[5m])) ) * 100

Requests violating latency SLO (slower than 500ms)

违反延迟SLO的请求（慢于500ms）

( sum(rate(http_request_duration_seconds_count{job="api"}[5m]))

sum(rate(http_request_duration_seconds_bucket{le="0.5", job="api"}[5m])) ) / sum(rate(http_request_duration_seconds_count{job="api"}[5m]))

undefined

( sum(rate(http_request_duration_seconds_count{job="api"}[5m]))

sum(rate(http_request_duration_seconds_bucket{le="0.5", job="api"}[5m])) ) / sum(rate(http_request_duration_seconds_count{job="api"}[5m]))

undefined

Burn Rate Reference Table

燃烧速率参考表

Burn Rate	Budget Consumed	Time to Exhaust 30-day Budget	Alert Severity
1	100% over 30d	30 days	None
2	100% over 15d	15 days	Low
6	5% in 6h	5 days	Ticket
14.4	2% in 1h	~2 days	Page
36	5% in 1h	~20 hours	Page (urgent)

燃烧速率	消耗预算占比	耗尽30天预算所需时间	告警级别
1	30天内100%	30天	无
2	15天内100%	15天	低
6	6小时内5%	5天	工单
14.4	1小时内2%	~2天	页面
36	1小时内5%	~20小时	紧急页面

Advanced Query Techniques

高级查询技巧

Using Subqueries

使用子查询

Subqueries enable complex time-based calculations:

promql

undefined

子查询支持复杂的时间计算：

promql

undefined

Maximum 5-minute rate over the past 30 minutes

过去30分钟内的最大5分钟速率

max_over_time( rate(http_requests_total[5m])[30m:1m] )


**Syntax**: `<query>[<range>:<resolution>]`
- `<range>`: Time window to evaluate over
- `<resolution>`: Step size between evaluations

max_over_time( rate(http_requests_total[5m])[30m:1m] )


**语法**：`<query>[<range>:<resolution>]`
- `<range>`：评估的时间窗口
- `<resolution>`：评估的步长

Using Offset Modifier

使用Offset修饰符

Compare current data with historical data:

promql

undefined

将当前数据与历史数据对比：

promql

undefined

Compare current rate with rate from 1 week ago

将当前速率与1周前的速率对比

rate(http_requests_total[5m])

rate(http_requests_total[5m] offset 1w)

undefined

rate(http_requests_total[5m])

rate(http_requests_total[5m] offset 1w)

undefined

Using @ Modifier

使用@修饰符

Query metrics at specific timestamps:

promql

undefined

查询特定时间戳的指标：

promql

undefined

Rate at the end of the range query

范围查询结束时的速率

rate(http_requests_total[5m] @ end())

Rate at specific Unix timestamp

特定Unix时间戳的速率

rate(http_requests_total[5m] @ 1609459200)

undefined

rate(http_requests_total[5m] @ 1609459200)

undefined

Binary Operators and Vector Matching

二元运算符与向量匹配

Combine metrics with operators and control label matching:

promql

undefined

使用运算符组合指标并控制标签匹配：

promql

undefined

One-to-one matching (default)

一对一匹配（默认）

metric_a + metric_b

Many-to-one with group_left

多对一匹配，使用group_left

rate(http_requests_total[5m])

on (job, instance) group_left (version) app_version_info

rate(http_requests_total[5m])

on (job, instance) group_left (version) app_version_info

Ignoring specific labels

忽略特定标签

metric_a + ignoring(instance) metric_b

undefined

metric_a + ignoring(instance) metric_b

undefined

Logical Operators

逻辑运算符

Filter time series based on conditions:

promql

undefined

根据条件过滤时间序列：

promql

undefined

Return series only where value > 100

仅返回值>100的序列

http_requests_total > 100

Return series present in both

返回同时存在于两个指标中的序列

metric_a and metric_b

Return series in A but not in B

返回存在于A但不存在于B的序列

metric_a unless metric_b

undefined

metric_a unless metric_b

undefined

Documentation Lookup

文档查询

If the user asks about specific Prometheus features, operators, or custom metrics:

Try context7 MCP first (preferred):

Use mcp__context7__resolve-library-id with "prometheus"
Then use mcp__context7__get-library-docs with:
- context7CompatibleLibraryID: /prometheus/docs
- topic: [specific feature, function, or operator]
- page: 1 (fetch additional pages if needed)

Fallback to WebSearch:

Search query pattern:
"Prometheus PromQL [function/operator/feature] documentation [version] examples"

Examples:
"Prometheus PromQL rate function documentation examples"
"Prometheus PromQL histogram_quantile documentation best practices"
"Prometheus PromQL aggregation operators documentation"

若用户询问特定Prometheus功能、运算符或自定义指标：

优先使用context7 MCP:

使用mcp__context7__resolve-library-id，参数为"prometheus"
然后使用mcp__context7__get-library-docs，参数：
- context7CompatibleLibraryID: /prometheus/docs
- topic: [特定功能、函数或运算符]
- page: 1（若需要，获取更多页面）

备用方案：Web搜索:

搜索查询模式：
"Prometheus PromQL [函数/运算符/功能] 文档 [版本] 示例"

示例：
"Prometheus PromQL rate函数 文档 示例"
"Prometheus PromQL histogram_quantile 文档 最佳实践"
"Prometheus PromQL 聚合运算符 文档"

Common Monitoring Scenarios

常见监控场景

RED Method (for Request-Driven Services)

RED方法（适用于请求驱动型服务）

Rate: Request throughput

promql

sum(rate(http_requests_total{job="api"}[5m])) by (endpoint)

Errors: Error rate

promql

sum(rate(http_requests_total{job="api", status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="api"}[5m]))

Duration: Latency percentiles

promql

histogram_quantile(0.95,
  sum by (le) (rate(http_request_duration_seconds_bucket{job="api"}[5m]))
)

Rate：请求吞吐量

promql

sum(rate(http_requests_total{job="api"}[5m])) by (endpoint)

Errors：错误率

promql

sum(rate(http_requests_total{job="api", status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="api"}[5m]))

Duration：延迟分位数

promql

histogram_quantile(0.95,
  sum by (le) (rate(http_request_duration_seconds_bucket{job="api"}[5m]))
)

USE Method (for Resources)

USE方法（适用于资源）

Utilization: Resource usage percentage

promql

(
  avg(rate(node_cpu_seconds_total{mode!="idle"}[5m]))
  /
  count(node_cpu_seconds_total{mode="idle"})
) * 100

Saturation: Queue depth or resource contention
promql
```
avg_over_time(node_load1[5m])
```

Errors: Error counters

promql

rate(node_network_receive_errs_total[5m])

Utilization：资源使用率百分比

promql

(
  avg(rate(node_cpu_seconds_total{mode!="idle"}[5m]))
  /
  count(node_cpu_seconds_total{mode="idle"})
) * 100

Saturation：队列深度或资源竞争
promql
```
avg_over_time(node_load1[5m])
```

Errors：错误计数器

promql

rate(node_network_receive_errs_total[5m])

Alerting Rules

告警规则

When generating queries for alerting:

Include the Threshold: Make the condition explicit

promql

# Alert when error rate exceeds 5%
(
  sum(rate(http_requests_total{status_code=~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) > 0.05

Use Boolean Operators: Return 1 (fire) or 0 (no alert)

promql

# Returns 1 when memory usage > 90%
(process_resident_memory_bytes / node_memory_MemTotal_bytes) > 0.9

Consider for Duration: Alerts typically use

for

clause

yaml

alert: HighErrorRate
expr: |
  (
    sum(rate(http_requests_total{status_code=~"5.."}[5m]))
    /
    sum(rate(http_requests_total[5m]))
  ) > 0.05
for: 10m  # Only fire after 10 minutes of continuous violation

生成告警查询时：

包含阈值：明确条件

promql

# 错误率超过5%时告警
(
  sum(rate(http_requests_total{status_code=~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) > 0.05

使用布尔运算符：返回1（触发）或0（不告警）

promql

# 内存使用率>90%时返回1
(process_resident_memory_bytes / node_memory_MemTotal_bytes) > 0.9

考虑持续时间：告警通常使用

for

子句

yaml

alert: HighErrorRate
expr: |
  (
    sum(rate(http_requests_total{status_code=~"5.."}[5m]))
    /
    sum(rate(http_requests_total[5m]))
  ) > 0.05
for: 10m  # 持续违反阈值10分钟后才触发

Recording Rules

记录规则

When generating queries for recording rules:

Follow Naming Convention:

level:metric:operations

yaml

# level: aggregation level (job, instance, etc.)
# metric: base metric name
# operations: functions applied

- record: job:http_requests:rate5m
  expr: sum by (job) (rate(http_requests_total[5m]))

Pre-aggregate Expensive Queries:

yaml

# Recording rule for frequently-used latency query
- record: job_endpoint:http_request_duration_seconds:p95
  expr: |
    histogram_quantile(0.95,
      sum by (job, endpoint, le) (
        rate(http_request_duration_seconds_bucket[5m])
      )
    )

Use Recorded Metrics in Dashboards:

promql

# Instead of expensive query, use pre-recorded metric
job_endpoint:http_request_duration_seconds:p95{job="api-server"}

生成记录规则查询时：

遵循命名规范：

level:metric:operations

yaml

# level: 聚合级别（job、instance等）
# metric: 基础指标名称
# operations: 应用的函数

- record: job:http_requests:rate5m
  expr: sum by (job) (rate(http_requests_total[5m]))

预聚合复杂查询:

yaml

# 频繁使用的延迟查询的记录规则
- record: job_endpoint:http_request_duration_seconds:p95
  expr: |
    histogram_quantile(0.95,
      sum by (job, endpoint, le) (
        rate(http_request_duration_seconds_bucket[5m])
      )
    )

在仪表板中使用预记录指标:

promql

# 不使用复杂查询，而是使用预记录指标
job_endpoint:http_request_duration_seconds:p95{job="api-server"}

Error Handling

错误处理

Common Issues and Solutions

常见问题与解决方案

Empty Results:
- Check if metrics exist:
```
up{job="your-job"}
```
- Verify label filters are correct
- Check time range is appropriate
- Confirm metric is being scraped
Too Many Series (High Cardinality):
- Add more specific label filters
- Use aggregation to reduce series count
- Consider using recording rules
- Check for label explosion (dynamic labels)
Incorrect Values:
- Verify metric type (counter vs gauge)
- Check function usage (rate on counters, not gauges)
- Verify time range is appropriate
- Check for counter resets
Performance Issues:
- Reduce time range for range vectors
- Add label filters to reduce cardinality
- Use recording rules for complex queries
- Avoid expensive regex patterns
- Consider query timeout settings

结果为空:
- 检查指标是否存在：
```
up{job="your-job"}
```
- 验证标签过滤是否正确
- 检查时间范围是否合适
- 确认指标正在被抓取
序列过多（高基数）:
- 添加更精确的标签过滤
- 使用聚合减少序列数量
- 考虑使用记录规则
- 检查是否存在标签爆炸（动态标签）
数值不正确:
- 验证指标类型（counter vs gauge）
- 检查函数使用（rate用于counter，而非gauge）
- 验证时间范围是否合适
- 检查是否存在counter重置
性能问题:
- 缩小范围向量的时间范围
- 添加标签过滤以降低基数
- 为复杂查询使用记录规则
- 避免使用昂贵的正则模式
- 考虑查询超时设置

Communication Guidelines

沟通指南

When generating queries:

Explain the Plan: Always present a plain-English plan before generating
Ask Questions: Use AskUserQuestion tool to gather requirements
Confirm Intent: Verify the query matches user goals before finalizing
Educate: Explain why certain functions or patterns are used
Provide Context: Show how to interpret results
Suggest Improvements: Offer optimizations or alternative approaches
Validate Proactively: Always validate and fix issues
Follow Up: Ask if adjustments are needed

生成查询时：

解释规划：生成前始终呈现通俗易懂的规划
提问：使用AskUserQuestion工具收集需求
确认意图：最终确定前验证查询是否符合用户目标
教育用户：解释为何使用特定函数或模式
提供上下文：展示如何解读结果
建议改进：提供优化或替代方案
主动验证：始终验证并修复问题
跟进：询问是否需要调整

Integration with devops-skills:promql-validator

与devops-skills:promql-validator集成

After generating any PromQL query, automatically invoke the devops-skills:promql-validator skill to ensure quality:

Steps:
1. Generate the PromQL query based on user requirements
2. Invoke devops-skills:promql-validator skill with the generated query
3. Review validation results (syntax, semantics, performance)
4. Fix any issues identified by the validator
5. Re-validate until all checks pass
6. Provide the final validated query with usage instructions
7. Ask user if further refinements are needed

This ensures all generated queries follow best practices and are production-ready.

生成任何PromQL查询后，自动调用devops-skills:promql-validator技能以确保质量：

步骤:
1. 根据用户需求生成PromQL查询
2. 使用生成的查询调用devops-skills:promql-validator技能
3. 查看验证结果（语法、语义、性能）
4. 修复验证器发现的问题
5. 重新验证直至所有检查通过
6. 提供最终验证后的查询及使用说明
7. 询问用户是否需要进一步调整

这确保所有生成的查询遵循最佳实践并可用于生产环境。

Resources

资源

IMPORTANT: Explicit Reference Consultation

When generating queries, you SHOULD explicitly read the relevant reference files using the Read tool and cite applicable best practices. This ensures generated queries follow documented patterns and helps users understand why certain approaches are recommended.

重要提示：明确参考文档

生成查询时，应使用Read工具明确阅读相关参考文档，并引用适用的最佳实践。这确保生成的查询遵循文档化模式，并帮助用户理解为何推荐特定方法。

references/

promql_functions.md

Comprehensive reference of all PromQL functions
Grouped by category (aggregation, math, time, histogram, etc.)
Usage examples for each function
Read this file when: implementing specific function requirements or when user asks about function behavior

promql_patterns.md

Common query patterns for typical monitoring scenarios
RED method patterns (Rate, Errors, Duration)
USE method patterns (Utilization, Saturation, Errors)
Alerting and recording rule patterns
Read this file when: implementing standard monitoring patterns like error rates, latency, or resource usage

best_practices.md

PromQL best practices and anti-patterns
Performance optimization guidelines
Cardinality management
Query structure recommendations
Read this file when: optimizing queries, reviewing for anti-patterns, or when cardinality concerns arise

metric_types.md

Detailed guide to Prometheus metric types
Counter, Gauge, Histogram, Summary
When to use each type
Appropriate functions for each type
Read this file when: clarifying metric type questions or determining appropriate functions for a metric

promql_functions.md

所有PromQL函数的综合参考
按类别分组（聚合、数学、时间、直方图等）
每个函数的使用示例
阅读场景：实现特定函数需求或用户询问函数行为时

promql_patterns.md

常见监控场景的查询模式
RED方法模式（Rate、Errors、Duration）
USE方法模式（Utilization、Saturation、Errors）
告警和记录规则模式
阅读场景：实现标准监控模式如错误率、延迟或资源使用率时

best_practices.md

PromQL最佳实践与反模式
性能优化指南
基数管理
查询结构建议
阅读场景：优化查询、检查反模式或存在基数问题时

metric_types.md

Prometheus指标类型的详细指南
Counter、Gauge、Histogram、Summary
各类型的适用场景
各类型的合适函数
阅读场景：澄清指标类型问题或确定指标的合适函数时

examples/

common_queries.promql

Collection of commonly-used PromQL queries
Request rate, error rate, latency queries
Resource usage queries
Availability and uptime queries
Can be copied and customized

red_method.promql

Complete RED method implementation
Request rate queries
Error rate queries
Duration/latency queries

use_method.promql

Complete USE method implementation
Utilization queries
Saturation queries
Error queries

alerting_rules.yaml

Example Prometheus alerting rules
Various threshold-based alerts
Best practices for alert expressions

recording_rules.yaml

Example Prometheus recording rules
Pre-aggregated metrics
Naming conventions

slo_patterns.promql

SLO, error budget, and burn rate queries
Multi-window, multi-burn-rate alerting patterns
Latency SLO compliance queries

kubernetes_patterns.promql

Kubernetes monitoring patterns
kube-state-metrics queries (pods, deployments, nodes)
cAdvisor container metrics (CPU, memory)
Vector matching and joins for Kubernetes

common_queries.promql

常用PromQL查询集合
请求速率、错误率、延迟查询
资源使用率查询
可用性和在线时长查询
可复制并自定义

red_method.promql

完整的RED方法实现
请求速率查询
错误率查询
时长/延迟查询

use_method.promql

完整的USE方法实现
使用率查询
饱和度查询
错误查询

alerting_rules.yaml

Prometheus告警规则示例
各种基于阈值的告警
告警表达式最佳实践

recording_rules.yaml

Prometheus记录规则示例
预聚合指标
命名规范

slo_patterns.promql

SLO、错误预算和燃烧速率查询
多窗口、多燃烧速率告警模式
延迟SLO合规查询

kubernetes_patterns.promql

Kubernetes监控模式
kube-state-metrics查询（pod、部署、节点）
cAdvisor容器指标（CPU、内存）
Kubernetes的向量匹配与连接

Important Notes

重要注意事项

Always Plan Interactively: Never generate a query without confirming the plan with the user
Use AskUserQuestion: Leverage the tool to gather requirements and confirm plans
Validate Everything: Always invoke devops-skills:promql-validator after generation
Educate Users: Explain what the query does and why it's structured that way
Consider Use Case: Tailor the query based on whether it's for dashboards, alerts, or analysis
Think About Performance: Always include label filters and consider cardinality
Follow Metric Types: Use appropriate functions for counters, gauges, and histograms
Format for Readability: Use multi-line formatting for complex queries

始终交互式规划：未与用户确认规划前，切勿生成查询
使用AskUserQuestion：利用工具收集需求并确认规划
验证所有内容：生成后始终调用devops-skills:promql-validator
教育用户：解释查询功能和结构原因
考虑使用场景：根据仪表板、告警或分析场景调整查询
考虑性能：始终包含标签过滤并考虑基数
遵循指标类型：为counter、gauge和直方图使用合适的函数
格式化提升可读性：复杂查询使用多行格式

Success Criteria

成功标准

A successful query generation session should:

Fully understand the user's monitoring goal
Identify correct metrics and their types
Present a clear plan in plain English
Get user confirmation before generating code
Generate a syntactically correct query
Use appropriate functions for metric types
Include specific label filters
Pass devops-skills:promql-validator validation
Provide clear usage instructions
Offer customization guidance

成功的查询生成会话应：

完全理解用户的监控目标
识别正确的指标及其类型
呈现清晰的通俗易懂的规划
生成代码前获得用户确认
生成语法正确的查询
为指标类型使用合适的函数
包含精确的标签过滤
通过devops-skills:promql-validator验证
提供清晰的使用说明
提供自定义指导

Remember

请记住

The goal is to collaboratively plan and generate PromQL queries that exactly match user intentions. Always prioritize clarity, correctness, and performance. The interactive planning phase is the most important part of this skill—never skip it!

目标是协作规划并生成完全符合用户意图的PromQL查询。始终优先考虑清晰性、正确性和性能。交互式规划阶段是本技能最重要的部分——切勿跳过！