cx-metrics-query

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Metrics Query Skill

指标查询技能

Use this skill to investigate production issues, answer performance questions, and explore metrics data using the
cx metrics
CLI commands. The workflow guides metric discovery, label exploration, and PromQL query construction.
使用此技能可通过
cx metrics
CLI命令排查生产问题、解答性能相关疑问并探索指标数据。该工作流可指导指标发现、标签探索和PromQL查询构建。

CLI Commands

CLI命令

All metrics operations use
cx metrics
with four subcommands:
CommandPurposeKey flags
cx metrics search --name <pattern>
Find metrics by name (wildcard or substring)
--name
cx metrics get-labels <metric>
List available label names for a metric-
cx metrics query '<expr>'
Instant PromQL query (single point in time)
--time <timestamp>
cx metrics query-range '<expr>'
Range PromQL query (time series)
--start
,
--end
,
--step
Output format: append
-o json
or
-o agents
to any command for machine-readable output.
所有指标操作均使用
cx metrics
及四个子命令:
命令用途关键参数
cx metrics search --name <pattern>
按名称查找指标(支持通配符或子字符串)
--name
cx metrics get-labels <metric>
列出指标可用的标签名称-
cx metrics query '<expr>'
即时PromQL查询(单个时间点)
--time <timestamp>
cx metrics query-range '<expr>'
范围PromQL查询(时间序列)
--start
,
--end
,
--step
输出格式: 在任意命令后追加
-o json
-o agents
可获取机器可读格式的输出。

Search examples

搜索示例

bash
undefined
bash
undefined

Exact substring match

精确子字符串匹配

cx metrics search --name http_requests
cx metrics search --name http_requests

Wildcard: find all CPU metrics

通配符:查找所有CPU指标

cx metrics search --name 'cpu'
cx metrics search --name 'cpu'

List all metrics

列出所有指标

cx metrics search --name '*'
undefined
cx metrics search --name '*'
undefined

Instant query examples

即时查询示例

bash
undefined
bash
undefined

Current state

当前状态

cx metrics query 'up'
cx metrics query 'up'

At a specific time

指定时间点查询

cx metrics query 'rate(http_requests_total[5m])' --time 2024-01-01T12:00:00Z
cx metrics query 'rate(http_requests_total[5m])' --time 2024-01-01T12:00:00Z

With output for further processing

输出结果用于后续处理

cx metrics query 'sum by (service) (rate(http_errors_total[5m]))' -o agents
undefined
cx metrics query 'sum by (service) (rate(http_errors_total[5m]))' -o agents
undefined

Range query examples

范围查询示例

bash
undefined
bash
undefined

Last hour, default step (1m)

过去1小时,默认步长(1分钟)

cx metrics query-range 'rate(http_requests_total[5m])'
cx metrics query-range 'rate(http_requests_total[5m])'

Custom window and step

自定义时间窗口和步长

cx metrics query-range 'sum by (service) (rate(http_requests_total[5m]))'
--start now-6h --end now --step 5m
cx metrics query-range 'sum by (service) (rate(http_requests_total[5m]))'
--start now-6h --end now --step 5m

Daily aggregation over the last week

过去一周的每日聚合数据

cx metrics query-range 'max by () (max_over_time(cpu_usage[1d]))'
--start now-7d --end now --step 1d
undefined
cx metrics query-range 'max by () (max_over_time(cpu_usage[1d]))'
--start now-7d --end now --step 1d
undefined

Label discovery example

标签发现示例

bash
cx metrics get-labels http_requests_total
bash
cx metrics get-labels http_requests_total

Returns: job, instance, method, route, status_code, ...

返回:job, instance, method, route, status_code, ...

undefined
undefined

Time Syntax

时间语法

All time arguments accept:
  • Relative:
    now
    ,
    now-1h
    ,
    now-30m
    ,
    now-2d
    ,
    now-1w
  • Absolute: RFC3339/ISO 8601 -
    2024-01-01T00:00:00Z

所有时间参数支持:
  • 相对时间:
    now
    ,
    now-1h
    ,
    now-30m
    ,
    now-2d
    ,
    now-1w
  • 绝对时间:RFC3339/ISO 8601格式 -
    2024-01-01T00:00:00Z

Investigation Workflow

排查工作流

1. Initial Assessment

1. 初始评估

When given a vague problem, ask 1–2 focused clarifying questions before proceeding:
  • What exactly is failing or behaving unexpectedly?
  • When did it start? What is the affected time window?
Prefer to start investigating immediately if the question is specific enough.
当遇到模糊问题时,先提出1-2个针对性的澄清问题再继续:
  • 具体是什么出现故障或行为异常?
  • 问题何时开始?受影响的时间窗口是什么?
如果问题足够明确,可直接开始排查。

2. Metric Discovery

2. 指标发现

Always start by searching for relevant metrics before querying:
bash
undefined
查询前务必先搜索相关指标:
bash
undefined

Try domain-specific patterns first

优先尝试领域特定的匹配模式

cx metrics search --name 'http' cx metrics search --name 'error' cx metrics search --name 'latency' cx metrics search --name 'cpu' cx metrics search --name 'memory'
cx metrics search --name 'http' cx metrics search --name 'error' cx metrics search --name 'latency' cx metrics search --name 'cpu' cx metrics search --name 'memory'

If nothing found, broaden the search

如果未找到结果,扩大搜索范围

cx metrics search --name 'request' cx metrics search --name '*' # full list as last resort

When two similar metrics are found and one is suffixed with `_count`, prefer the one without the suffix - `_count` typically tracks the number of observations, not the measured value itself.
cx metrics search --name 'request' cx metrics search --name '*' # 最后再尝试列出全部指标

当发现两个相似指标且其中一个后缀为`_count`时,优先选择不带后缀的指标——`_count`通常用于跟踪观测次数,而非实际测量值。

3. Label Discovery

3. 标签发现

Once a relevant metric is identified, discover its labels before filtering:
bash
cx metrics get-labels <metric_name>
Use the returned label names to build precise PromQL filters. Note: label values are not directly queryable via the CLI - infer them from query results or domain knowledge.
确定相关指标后,在过滤前先发现其标签:
bash
cx metrics get-labels <metric_name>
使用返回的标签名称构建精准的PromQL过滤器。注意:无法通过CLI直接查询标签——需从查询结果或领域知识中推断。

4. Query Construction & Execution

4. 查询构建与执行

Choose the right query type:
  • Instant query (
    cx metrics query
    ) - use for current state, single values, or absolute aggregations over a window. Use
    --time
    to query historical data at a specific moment.
  • Range query (
    cx metrics query-range
    ) - use when comparing across time periods (e.g., per-day DAU, hourly error rate trend). Set
    --step
    to match any
    [window]
    used in temporal functions.
Start simple, add complexity as needed:
bash
undefined
选择合适的查询类型:
  • 即时查询
    cx metrics query
    )- 用于查询当前状态、单一值或时间窗口内的绝对聚合数据。使用
    --time
    参数查询特定时刻的历史数据。
  • 范围查询
    cx metrics query-range
    )- 用于跨时间段对比(如日活DAU、每小时错误率趋势)。设置
    --step
    参数以匹配时间函数中使用的
    [window]
从简单查询开始,逐步增加复杂度:
bash
undefined

Step 1: Check if metric exists and has data

步骤1:检查指标是否存在且有数据

cx metrics query 'http_requests_total'
cx metrics query 'http_requests_total'

Step 2: Add label filters and aggregation

步骤2:添加标签过滤和聚合

cx metrics query 'sum by (status) (rate(http_requests_total[5m]))'
cx metrics query 'sum by (status) (rate(http_requests_total[5m]))'

Step 3: Build the final diagnostic query

步骤3:构建最终诊断查询

cx metrics query 'sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))'
undefined
cx metrics query 'sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))'
undefined

5. Retry Logic

5. 重试逻辑

If a query returns no results or an error:
  1. Check metric name - run
    cx metrics search --name '*<keyword>*'
    with a broader term
  2. Check label names - run
    cx metrics get-labels <metric>
    to verify filter keys
  3. Widen the time range or shorten the rate window
  4. If filtering on a label that may be empty, exclude empty values:
    {label!=""}
  5. Try an alternative metric name or structure
Maximum 5 retry attempts per query, each with a concrete improvement.
如果查询无结果或返回错误:
  1. 检查指标名称——使用更宽泛的关键词运行
    cx metrics search --name '*<keyword>*'
  2. 检查标签名称——运行
    cx metrics get-labels <metric>
    验证过滤键
  3. 扩大时间范围或缩短速率窗口
  4. 如果过滤的标签可能为空,排除空值:
    {label!=""}
  5. 尝试其他指标名称或结构
每个查询最多重试5次,每次重试需有明确的改进点。

6. Pattern Recognition & Root Cause Analysis

6. 模式识别与根因分析

After collecting results:
  • Correlate across metrics (e.g., error spike matches CPU spike?)
  • Look for temporal patterns - recurring peaks, sudden step changes
  • Cross-layer analysis: app → services → infrastructure → dependencies
  • Provide actionable next steps, not just data
收集结果后:
  • 关联多个指标(如错误峰值是否与CPU峰值匹配?)
  • 查找时间模式——重复峰值、突然阶跃变化
  • 跨层分析:应用 → 服务 → 基础设施 → 依赖项
  • 提供可执行的下一步建议,而非仅展示数据

7. Summarize Frequently

7. 频繁总结

PromQL results can be large. After every few queries, summarize:
  • Key findings so far
  • Queries already run
  • Next planned queries
  • Ask to continue if more investigation is needed

PromQL结果可能较大。每执行几次查询后进行总结:
  • 目前的关键发现
  • 已运行的查询
  • 计划执行的下一个查询
  • 如需进一步排查,询问是否继续

Common Investigation Patterns

常见排查模式

HTTP Errors

HTTP错误

  1. Check error rate:
    sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
  2. Compare to total RPS:
    sum by (service) (rate(http_requests_total[5m]))
  3. Check pod/deployment health metrics
  4. Check dependency latency
  1. 检查错误率:
    sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
  2. 对比总请求数:
    sum by (service) (rate(http_requests_total[5m]))
  3. 检查Pod/部署健康指标
  4. 检查依赖项延迟

Performance / Latency

性能/延迟

  1. Check p95/p99 latency via histograms:
    histogram_quantile(0.95, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m])))
  2. Check resource saturation: CPU, memory, disk
  3. Check autoscaling metrics
  4. Check dependency response times
  1. 通过直方图检查p95/p99延迟:
    histogram_quantile(0.95, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m])))
  2. 检查资源饱和度:CPU、内存、磁盘
  3. 检查自动扩缩容指标
  4. 检查依赖项响应时间

Availability

可用性

  1. Check
    up
    metric across services:
    cx metrics query 'up'
  2. Check pod restart counts
  3. Check node health
  4. Check service discovery metrics

  1. 检查各服务的
    up
    指标:
    cx metrics query 'up'
  2. 检查Pod重启次数
  3. 检查节点健康状态
  4. 检查服务发现指标

Key Principles

核心原则

  • Discover before querying: always search for metric names first
  • Instant over range: prefer instant queries unless the question requires a time series
  • Align step with window: when using
    max_over_time(metric[1d])
    , set
    --step 1d
  • Filter empty labels: if results have blank label values, add
    {label!=""}
    to the filter
  • Aggregate early: use
    sum by (...)
    to reduce cardinality before further operations

  • 先发现再查询:始终先搜索指标名称
  • 优先即时查询:除非问题需要时间序列数据,否则优先使用即时查询
  • 步长与窗口对齐:使用
    max_over_time(metric[1d])
    时,设置
    --step 1d
  • 过滤空标签:如果结果包含空标签值,在过滤器中添加
    {label!=""}
  • 尽早聚合:使用
    sum by (...)
    在后续操作前降低基数

Additional Resources

额外资源

Reference Files

参考文件

  • references/promql-guidelines.md
    - Complete PromQL reference: value types, counter vs gauge, histograms, common tasks, gotchas, and a mini cheat-sheet
  • references/promql-guidelines.md
    - 完整的PromQL参考文档:值类型、计数器与仪表盘对比、直方图、常见任务、注意事项及迷你速查表