metrics-analysis

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Metrics Analysis

指标分析

Authentication

身份验证

IMPORTANT: Credentials are injected automatically by a proxy layer. Do NOT check for
GRAFANA_API_KEY
or
PROMETHEUS_URL
in environment variables - they won't be visible to you. Just run the scripts directly; authentication is handled transparently.

重要提示:凭证由代理层自动注入。请勿在环境变量中查找
GRAFANA_API_KEY
PROMETHEUS_URL
——你无法看到它们。直接运行脚本即可;身份验证会透明处理。

Core Principle: USE & RED Methods

核心原则:USE与RED方法

USE Method (for infrastructure):
  • Utilization - How busy is the resource?
  • Saturation - How much work is queued?
  • Errors - Are there error events?
RED Method (for services):
  • Rate - Requests per second
  • Errors - Error rate
  • Duration - Latency distribution
USE方法(适用于基础设施):
  • Utilization(利用率)- 资源的繁忙程度如何?
  • Saturation(饱和度)- 有多少工作在排队?
  • Errors(错误)- 是否存在错误事件?
RED方法(适用于服务):
  • Rate(速率)- 每秒请求数
  • Errors(错误)- 错误率
  • Duration(持续时间)- 延迟分布

Available Scripts

可用脚本

All scripts are in
.claude/skills/metrics-analysis/scripts/
所有脚本位于
.claude/skills/metrics-analysis/scripts/
目录下

query_prometheus.py - Execute PromQL Queries

query_prometheus.py - 执行PromQL查询

bash
python .claude/skills/metrics-analysis/scripts/query_prometheus.py --query PROMQL [--time-range MINUTES] [--step STEP]
bash
python .claude/skills/metrics-analysis/scripts/query_prometheus.py --query PROMQL [--time-range MINUTES] [--step STEP]

Examples:

Examples:

python .claude/skills/metrics-analysis/scripts/query_prometheus.py --query "up" python .claude/skills/metrics-analysis/scripts/query_prometheus.py --query "rate(http_requests_total[5m])" --time-range 60 python .claude/skills/metrics-analysis/scripts/query_prometheus.py --query "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
undefined
python .claude/skills/metrics-analysis/scripts/query_prometheus.py --query "up" python .claude/skills/metrics-analysis/scripts/query_prometheus.py --query "rate(http_requests_total[5m])" --time-range 60 python .claude/skills/metrics-analysis/scripts/query_prometheus.py --query "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
undefined

list_dashboards.py - Find Grafana Dashboards

list_dashboards.py - 查找Grafana仪表盘

bash
python .claude/skills/metrics-analysis/scripts/list_dashboards.py [--query SEARCH_TERM]
bash
python .claude/skills/metrics-analysis/scripts/list_dashboards.py [--query SEARCH_TERM]

Examples:

Examples:

python .claude/skills/metrics-analysis/scripts/list_dashboards.py python .claude/skills/metrics-analysis/scripts/list_dashboards.py --query "api"
undefined
python .claude/skills/metrics-analysis/scripts/list_dashboards.py python .claude/skills/metrics-analysis/scripts/list_dashboards.py --query "api"
undefined

get_alerts.py - Check Firing Alerts

get_alerts.py - 检查触发的告警

bash
python .claude/skills/metrics-analysis/scripts/get_alerts.py [--state STATE]
bash
python .claude/skills/metrics-analysis/scripts/get_alerts.py [--state STATE]

Examples:

Examples:

python .claude/skills/metrics-analysis/scripts/get_alerts.py python .claude/skills/metrics-analysis/scripts/get_alerts.py --state alerting

---
python .claude/skills/metrics-analysis/scripts/get_alerts.py python .claude/skills/metrics-analysis/scripts/get_alerts.py --state alerting

---

PromQL Quick Reference

PromQL快速参考

Basic Queries

基础查询

promql
undefined
promql
undefined

Instant vector - current value

Instant vector - 当前值

http_requests_total{service="api"}
http_requests_total{service="api"}

Range vector - values over time (for rate calculations)

Range vector - 一段时间内的值(用于速率计算)

http_requests_total{service="api"}[5m]
http_requests_total{service="api"}[5m]

Rate of increase per second

每秒增长率

rate(http_requests_total{service="api"}[5m])
undefined
rate(http_requests_total{service="api"}[5m])
undefined

Common Operators

常用运算符

promql
undefined
promql
undefined

Rate (counter → gauge, per second)

速率(计数器 → 仪表盘,每秒)

rate(http_requests_total[5m])
rate(http_requests_total[5m])

Increase (total increase over time range)

增长量(时间范围内的总增量)

increase(http_requests_total[1h])
increase(http_requests_total[1h])

Average over time

时间范围内的平均值

avg_over_time(cpu_usage[5m])
avg_over_time(cpu_usage[5m])

Histogram quantile (p95, p99)

直方图分位数(p95、p99)

histogram_quantile(0.95, rate(http_request_duration_bucket[5m]))
undefined
histogram_quantile(0.95, rate(http_request_duration_bucket[5m]))
undefined

Aggregations

聚合操作

promql
undefined
promql
undefined

Sum across all instances

所有实例的总和

sum(rate(http_requests_total[5m]))
sum(rate(http_requests_total[5m]))

Group by label

按标签分组

sum by (service) (rate(http_requests_total[5m]))
sum by (service) (rate(http_requests_total[5m]))

Average by label

按标签取平均值

avg by (instance) (cpu_usage)
avg by (instance) (cpu_usage)

Top 5 by value

按值取前5名

topk(5, sum by (service) (rate(http_requests_total[5m])))
undefined
topk(5, sum by (service) (rate(http_requests_total[5m])))
undefined

Label Matching

标签匹配

promql
undefined
promql
undefined

Exact match

精确匹配

http_requests_total{status="500"}
http_requests_total{status="500"}

Regex match

正则匹配

http_requests_total{status=~"5.."}
http_requests_total{status=~"5.."}

Not equal

不等于

http_requests_total{status!="200"}
http_requests_total{status!="200"}

Multiple labels

多标签匹配

http_requests_total{service="api", status=~"5.."}

---
http_requests_total{service="api", status=~"5.."}

---

Investigation Workflows

调查工作流

1. Latency Investigation

1. 延迟调查

bash
undefined
bash
undefined

Step 1: Check overall latency trend

步骤1:检查整体延迟趋势

python query_prometheus.py --query 'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="api"}[5m]))' --time-range 60
python query_prometheus.py --query 'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="api"}[5m]))' --time-range 60

Step 2: Compare p50 vs p99

步骤2:比较p50与p99

python query_prometheus.py --query 'histogram_quantile(0.50, rate(http_request_duration_seconds_bucket{service="api"}[5m]))'
python query_prometheus.py --query 'histogram_quantile(0.50, rate(http_request_duration_seconds_bucket{service="api"}[5m]))'

Step 3: Break down by endpoint

步骤3:按端点拆分

python query_prometheus.py --query 'histogram_quantile(0.95, sum by (endpoint) (rate(http_request_duration_seconds_bucket{service="api"}[5m])))'
undefined
python query_prometheus.py --query 'histogram_quantile(0.95, sum by (endpoint) (rate(http_request_duration_seconds_bucket{service="api"}[5m])))'
undefined

2. Error Rate Investigation

2. 错误率调查

bash
undefined
bash
undefined

Step 1: Overall error rate

步骤1:整体错误率

python query_prometheus.py --query 'sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))'
python query_prometheus.py --query 'sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))'

Step 2: Errors by status code

步骤2:按状态码统计错误

python query_prometheus.py --query 'sum by (status) (rate(http_requests_total{status=~"[45].."}[5m]))'
python query_prometheus.py --query 'sum by (status) (rate(http_requests_total{status=~"[45].."}[5m]))'

Step 3: Errors by service

步骤3:按服务统计错误

python query_prometheus.py --query 'sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))'
undefined
python query_prometheus.py --query 'sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))'
undefined

3. Resource Investigation (CPU/Memory)

3. 资源调查(CPU/内存)

bash
undefined
bash
undefined

CPU usage

CPU使用率

python query_prometheus.py --query 'avg by (instance) (rate(container_cpu_usage_seconds_total{pod=~"api-.*"}[5m]))'
python query_prometheus.py --query 'avg by (instance) (rate(container_cpu_usage_seconds_total{pod=~"api-.*"}[5m]))'

Memory usage percentage

内存使用百分比

python query_prometheus.py --query 'container_memory_usage_bytes{pod=~"api-."} / container_spec_memory_limit_bytes{pod=~"api-."}'

---
python query_prometheus.py --query 'container_memory_usage_bytes{pod=~"api-."} / container_spec_memory_limit_bytes{pod=~"api-."}'

---

Quick Commands Reference

快速命令参考

GoalCommand
Request rate
query_prometheus.py --query "sum(rate(http_requests_total[5m]))"
Error rate
query_prometheus.py --query "sum(rate(http_requests_total{status=~'5..'}[5m]))"
P95 latency
query_prometheus.py --query "histogram_quantile(0.95, ...)"
CPU usage
query_prometheus.py --query "rate(container_cpu_usage_seconds_total[5m])"
Find dashboards
list_dashboards.py --query "api"
Check alerts
get_alerts.py --state alerting

目标命令
请求速率
query_prometheus.py --query "sum(rate(http_requests_total[5m]))"
错误率
query_prometheus.py --query "sum(rate(http_requests_total{status=~'5..'}[5m]))"
P95延迟
query_prometheus.py --query "histogram_quantile(0.95, ...)"
CPU使用率
query_prometheus.py --query "rate(container_cpu_usage_seconds_total[5m])"
查找仪表盘
list_dashboards.py --query "api"
检查告警
get_alerts.py --state alerting

Common Metric Patterns

常见指标模式

Request Metrics

请求指标

promql
http_requests_total                    # Counter
http_request_duration_seconds_bucket   # Histogram
http_requests_in_flight               # Gauge
promql
http_requests_total                    # 计数器
http_request_duration_seconds_bucket   # 直方图
http_requests_in_flight               # 仪表盘

Kubernetes Metrics

Kubernetes指标

promql
container_cpu_usage_seconds_total
container_memory_usage_bytes
kube_pod_container_status_restarts_total
kube_pod_status_phase

promql
container_cpu_usage_seconds_total
container_memory_usage_bytes
kube_pod_container_status_restarts_total
kube_pod_status_phase

Anti-Patterns to Avoid

需避免的反模式

  1. Using
    rate()
    without range vector
    - Always include
    [5m]
    or similar
  2. Comparing counters directly - Use
    rate()
    or
    increase()
    first
  3. Wrong quantile math -
    histogram_quantile
    requires
    _bucket
    metrics
  4. Missing label filters - Queries without filters return all series
  5. Too-short time ranges - Use at least 2x your scrape interval for
    rate()
  1. 不使用范围向量就调用
    rate()
    - 务必包含
    [5m]
    或类似的时间范围
  2. 直接比较计数器 - 先使用
    rate()
    increase()
  3. 错误的分位数计算 -
    histogram_quantile
    需要
    _bucket
    类型的指标
  4. 缺少标签过滤 - 不带过滤条件的查询会返回所有序列
  5. 时间范围过短 - 使用
    rate()
    时,时间范围至少要设置为抓取间隔的2倍