Metrics Analysis

指标分析

Authentication

身份验证

IMPORTANT: Credentials are injected automatically by a proxy layer. Do NOT check for

GRAFANA_API_KEY

or

PROMETHEUS_URL

in environment variables - they won't be visible to you. Just run the scripts directly; authentication is handled transparently.

重要提示：凭证由代理层自动注入。请勿在环境变量中查找

GRAFANA_API_KEY

或

PROMETHEUS_URL

——你无法看到它们。直接运行脚本即可；身份验证会透明处理。

Core Principle: USE & RED Methods

核心原则：USE与RED方法

USE Method (for infrastructure):

Utilization - How busy is the resource?
Saturation - How much work is queued?
Errors - Are there error events?

RED Method (for services):

Rate - Requests per second
Errors - Error rate
Duration - Latency distribution

USE方法（适用于基础设施）：

Utilization（利用率）- 资源的繁忙程度如何？
Saturation（饱和度）- 有多少工作在排队？
Errors（错误）- 是否存在错误事件？

RED方法（适用于服务）：

Rate（速率）- 每秒请求数
Errors（错误）- 错误率
Duration（持续时间）- 延迟分布

Available Scripts

可用脚本

All scripts are in

.claude/skills/metrics-analysis/scripts/

所有脚本位于

.claude/skills/metrics-analysis/scripts/

目录下

query_prometheus.py - Execute PromQL Queries

query_prometheus.py - 执行PromQL查询

bash

python .claude/skills/metrics-analysis/scripts/query_prometheus.py --query PROMQL [--time-range MINUTES] [--step STEP]

bash

python .claude/skills/metrics-analysis/scripts/query_prometheus.py --query PROMQL [--time-range MINUTES] [--step STEP]

Examples:

python .claude/skills/metrics-analysis/scripts/query_prometheus.py --query "up" python .claude/skills/metrics-analysis/scripts/query_prometheus.py --query "rate(http_requests_total[5m])" --time-range 60 python .claude/skills/metrics-analysis/scripts/query_prometheus.py --query "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"

undefined

python .claude/skills/metrics-analysis/scripts/query_prometheus.py --query "up" python .claude/skills/metrics-analysis/scripts/query_prometheus.py --query "rate(http_requests_total[5m])" --time-range 60 python .claude/skills/metrics-analysis/scripts/query_prometheus.py --query "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"

undefined

list_dashboards.py - Find Grafana Dashboards

list_dashboards.py - 查找Grafana仪表盘

bash

python .claude/skills/metrics-analysis/scripts/list_dashboards.py [--query SEARCH_TERM]

bash

python .claude/skills/metrics-analysis/scripts/list_dashboards.py [--query SEARCH_TERM]

Examples:

python .claude/skills/metrics-analysis/scripts/list_dashboards.py python .claude/skills/metrics-analysis/scripts/list_dashboards.py --query "api"

undefined

python .claude/skills/metrics-analysis/scripts/list_dashboards.py python .claude/skills/metrics-analysis/scripts/list_dashboards.py --query "api"

undefined

get_alerts.py - Check Firing Alerts

get_alerts.py - 检查触发的告警

bash

python .claude/skills/metrics-analysis/scripts/get_alerts.py [--state STATE]

bash

python .claude/skills/metrics-analysis/scripts/get_alerts.py [--state STATE]

Examples:

python .claude/skills/metrics-analysis/scripts/get_alerts.py python .claude/skills/metrics-analysis/scripts/get_alerts.py --state alerting

---

python .claude/skills/metrics-analysis/scripts/get_alerts.py python .claude/skills/metrics-analysis/scripts/get_alerts.py --state alerting

---

PromQL Quick Reference

PromQL快速参考

Basic Queries

基础查询

promql

undefined

promql

undefined

Instant vector - current value

Instant vector - 当前值

http_requests_total{service="api"}

Range vector - values over time (for rate calculations)

Range vector - 一段时间内的值（用于速率计算）

http_requests_total{service="api"}[5m]

Rate of increase per second

每秒增长率

rate(http_requests_total{service="api"}[5m])

undefined

rate(http_requests_total{service="api"}[5m])

undefined

Common Operators

常用运算符

promql

undefined

promql

undefined

Rate (counter → gauge, per second)

速率（计数器 → 仪表盘，每秒）

rate(http_requests_total[5m])

Increase (total increase over time range)

增长量（时间范围内的总增量）

increase(http_requests_total[1h])

Average over time

时间范围内的平均值

avg_over_time(cpu_usage[5m])

Histogram quantile (p95, p99)

直方图分位数（p95、p99）

histogram_quantile(0.95, rate(http_request_duration_bucket[5m]))

undefined

histogram_quantile(0.95, rate(http_request_duration_bucket[5m]))

undefined

Aggregations

聚合操作

promql

undefined

promql

undefined

Sum across all instances

所有实例的总和

sum(rate(http_requests_total[5m]))

Group by label

按标签分组

sum by (service) (rate(http_requests_total[5m]))

Average by label

按标签取平均值

avg by (instance) (cpu_usage)

Top 5 by value

按值取前5名

topk(5, sum by (service) (rate(http_requests_total[5m])))

undefined

topk(5, sum by (service) (rate(http_requests_total[5m])))

undefined

Label Matching

标签匹配

promql

undefined

promql

undefined

Exact match

精确匹配

http_requests_total{status="500"}

Regex match

正则匹配

http_requests_total{status=~"5.."}

Not equal

不等于

http_requests_total{status!="200"}

Multiple labels

多标签匹配

http_requests_total{service="api", status=~"5.."}

---

http_requests_total{service="api", status=~"5.."}

---

Investigation Workflows

调查工作流

1. Latency Investigation

1. 延迟调查

bash

undefined

bash

undefined

Step 1: Check overall latency trend

步骤1：检查整体延迟趋势

python query_prometheus.py --query 'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="api"}[5m]))' --time-range 60

Step 2: Compare p50 vs p99

步骤2：比较p50与p99

python query_prometheus.py --query 'histogram_quantile(0.50, rate(http_request_duration_seconds_bucket{service="api"}[5m]))'

Step 3: Break down by endpoint

步骤3：按端点拆分

python query_prometheus.py --query 'histogram_quantile(0.95, sum by (endpoint) (rate(http_request_duration_seconds_bucket{service="api"}[5m])))'

undefined

python query_prometheus.py --query 'histogram_quantile(0.95, sum by (endpoint) (rate(http_request_duration_seconds_bucket{service="api"}[5m])))'

undefined

2. Error Rate Investigation

2. 错误率调查

bash

undefined

bash

undefined

Step 1: Overall error rate

步骤1：整体错误率

python query_prometheus.py --query 'sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))'

Step 2: Errors by status code

步骤2：按状态码统计错误

python query_prometheus.py --query 'sum by (status) (rate(http_requests_total{status=~"[45].."}[5m]))'

Step 3: Errors by service

步骤3：按服务统计错误

python query_prometheus.py --query 'sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))'

undefined

python query_prometheus.py --query 'sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))'

undefined

3. Resource Investigation (CPU/Memory)

3. 资源调查（CPU/内存）

bash

undefined

bash

undefined

CPU usage

CPU使用率

python query_prometheus.py --query 'avg by (instance) (rate(container_cpu_usage_seconds_total{pod=~"api-.*"}[5m]))'

Memory usage percentage

内存使用百分比

python query_prometheus.py --query 'container_memory_usage_bytes{pod=~"api-."} / container_spec_memory_limit_bytes{pod=~"api-."}'

---

python query_prometheus.py --query 'container_memory_usage_bytes{pod=~"api-."} / container_spec_memory_limit_bytes{pod=~"api-."}'

---

Quick Commands Reference

快速命令参考

Goal	Command
Request rate	`query_prometheus.py --query "sum(rate(http_requests_total[5m]))"`
Error rate	`query_prometheus.py --query "sum(rate(http_requests_total{status=~'5..'}[5m]))"`
P95 latency	`query_prometheus.py --query "histogram_quantile(0.95, ...)"`
CPU usage	`query_prometheus.py --query "rate(container_cpu_usage_seconds_total[5m])"`
Find dashboards	`list_dashboards.py --query "api"`
Check alerts	`get_alerts.py --state alerting`

目标	命令
请求速率	`query_prometheus.py --query "sum(rate(http_requests_total[5m]))"`
错误率	`query_prometheus.py --query "sum(rate(http_requests_total{status=~'5..'}[5m]))"`
P95延迟	`query_prometheus.py --query "histogram_quantile(0.95, ...)"`
CPU使用率	`query_prometheus.py --query "rate(container_cpu_usage_seconds_total[5m])"`
查找仪表盘	`list_dashboards.py --query "api"`
检查告警	`get_alerts.py --state alerting`

Common Metric Patterns

常见指标模式

Request Metrics

请求指标

promql

http_requests_total                    # Counter
http_request_duration_seconds_bucket   # Histogram
http_requests_in_flight               # Gauge

promql

http_requests_total                    # 计数器
http_request_duration_seconds_bucket   # 直方图
http_requests_in_flight               # 仪表盘

Kubernetes Metrics

Kubernetes指标

promql

container_cpu_usage_seconds_total
container_memory_usage_bytes
kube_pod_container_status_restarts_total
kube_pod_status_phase

promql

container_cpu_usage_seconds_total
container_memory_usage_bytes
kube_pod_container_status_restarts_total
kube_pod_status_phase

Anti-Patterns to Avoid

需避免的反模式

❌ Using
rate()
without range vector - Always include
```
[5m]
```
or similar
❌ Comparing counters directly - Use
```
rate()
```
or
```
increase()
```
first
❌ Wrong quantile math -
```
histogram_quantile
```
requires
```
_bucket
```
metrics
❌ Missing label filters - Queries without filters return all series
❌ Too-short time ranges - Use at least 2x your scrape interval for
```
rate()
```

❌ 不使用范围向量就调用
rate()
- 务必包含
```
[5m]
```
或类似的时间范围
❌ 直接比较计数器 - 先使用
```
rate()
```
或
```
increase()
```
❌ 错误的分位数计算 -
```
histogram_quantile
```
需要
```
_bucket
```
类型的指标
❌ 缺少标签过滤 - 不带过滤条件的查询会返回所有序列
❌ 时间范围过短 - 使用
```
rate()
```
时，时间范围至少要设置为抓取间隔的2倍

metrics-analysis

Original

Translation

Metrics Analysis

指标分析

Authentication

身份验证

Core Principle: USE & RED Methods

核心原则：USE与RED方法

Available Scripts

可用脚本

query_prometheus.py - Execute PromQL Queries

query_prometheus.py - 执行PromQL查询

Examples:

Examples:

list_dashboards.py - Find Grafana Dashboards

list_dashboards.py - 查找Grafana仪表盘

Examples:

Examples:

get_alerts.py - Check Firing Alerts

get_alerts.py - 检查触发的告警

Examples:

Examples:

PromQL Quick Reference

PromQL快速参考

Basic Queries

基础查询

Instant vector - current value

Instant vector - 当前值

Range vector - values over time (for rate calculations)

Range vector - 一段时间内的值（用于速率计算）

Rate of increase per second

每秒增长率

Common Operators

常用运算符

Rate (counter → gauge, per second)

速率（计数器 → 仪表盘，每秒）

Increase (total increase over time range)

增长量（时间范围内的总增量）

Average over time

时间范围内的平均值

Histogram quantile (p95, p99)

直方图分位数（p95、p99）

Aggregations

聚合操作

Sum across all instances

所有实例的总和

Group by label

按标签分组

Average by label

按标签取平均值

Top 5 by value

按值取前5名

Label Matching

标签匹配

Exact match

精确匹配

Regex match

正则匹配

Not equal

不等于

Multiple labels

多标签匹配

Investigation Workflows

调查工作流

1. Latency Investigation

1. 延迟调查

Step 1: Check overall latency trend

步骤1：检查整体延迟趋势

Step 2: Compare p50 vs p99

步骤2：比较p50与p99

Step 3: Break down by endpoint

步骤3：按端点拆分

2. Error Rate Investigation

2. 错误率调查

Step 1: Overall error rate

步骤1：整体错误率

Step 2: Errors by status code

步骤2：按状态码统计错误

Step 3: Errors by service