metrics-analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMetrics Analysis
指标分析
Authentication
身份验证
IMPORTANT: Credentials are injected automatically by a proxy layer. Do NOT check for or in environment variables - they won't be visible to you. Just run the scripts directly; authentication is handled transparently.
GRAFANA_API_KEYPROMETHEUS_URL重要提示:凭证由代理层自动注入。请勿在环境变量中查找或——你无法看到它们。直接运行脚本即可;身份验证会透明处理。
GRAFANA_API_KEYPROMETHEUS_URLCore Principle: USE & RED Methods
核心原则:USE与RED方法
USE Method (for infrastructure):
- Utilization - How busy is the resource?
- Saturation - How much work is queued?
- Errors - Are there error events?
RED Method (for services):
- Rate - Requests per second
- Errors - Error rate
- Duration - Latency distribution
USE方法(适用于基础设施):
- Utilization(利用率)- 资源的繁忙程度如何?
- Saturation(饱和度)- 有多少工作在排队?
- Errors(错误)- 是否存在错误事件?
RED方法(适用于服务):
- Rate(速率)- 每秒请求数
- Errors(错误)- 错误率
- Duration(持续时间)- 延迟分布
Available Scripts
可用脚本
All scripts are in
.claude/skills/metrics-analysis/scripts/所有脚本位于目录下
.claude/skills/metrics-analysis/scripts/query_prometheus.py - Execute PromQL Queries
query_prometheus.py - 执行PromQL查询
bash
python .claude/skills/metrics-analysis/scripts/query_prometheus.py --query PROMQL [--time-range MINUTES] [--step STEP]bash
python .claude/skills/metrics-analysis/scripts/query_prometheus.py --query PROMQL [--time-range MINUTES] [--step STEP]Examples:
Examples:
python .claude/skills/metrics-analysis/scripts/query_prometheus.py --query "up"
python .claude/skills/metrics-analysis/scripts/query_prometheus.py --query "rate(http_requests_total[5m])" --time-range 60
python .claude/skills/metrics-analysis/scripts/query_prometheus.py --query "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
undefinedpython .claude/skills/metrics-analysis/scripts/query_prometheus.py --query "up"
python .claude/skills/metrics-analysis/scripts/query_prometheus.py --query "rate(http_requests_total[5m])" --time-range 60
python .claude/skills/metrics-analysis/scripts/query_prometheus.py --query "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
undefinedlist_dashboards.py - Find Grafana Dashboards
list_dashboards.py - 查找Grafana仪表盘
bash
python .claude/skills/metrics-analysis/scripts/list_dashboards.py [--query SEARCH_TERM]bash
python .claude/skills/metrics-analysis/scripts/list_dashboards.py [--query SEARCH_TERM]Examples:
Examples:
python .claude/skills/metrics-analysis/scripts/list_dashboards.py
python .claude/skills/metrics-analysis/scripts/list_dashboards.py --query "api"
undefinedpython .claude/skills/metrics-analysis/scripts/list_dashboards.py
python .claude/skills/metrics-analysis/scripts/list_dashboards.py --query "api"
undefinedget_alerts.py - Check Firing Alerts
get_alerts.py - 检查触发的告警
bash
python .claude/skills/metrics-analysis/scripts/get_alerts.py [--state STATE]bash
python .claude/skills/metrics-analysis/scripts/get_alerts.py [--state STATE]Examples:
Examples:
python .claude/skills/metrics-analysis/scripts/get_alerts.py
python .claude/skills/metrics-analysis/scripts/get_alerts.py --state alerting
---python .claude/skills/metrics-analysis/scripts/get_alerts.py
python .claude/skills/metrics-analysis/scripts/get_alerts.py --state alerting
---PromQL Quick Reference
PromQL快速参考
Basic Queries
基础查询
promql
undefinedpromql
undefinedInstant vector - current value
Instant vector - 当前值
http_requests_total{service="api"}
http_requests_total{service="api"}
Range vector - values over time (for rate calculations)
Range vector - 一段时间内的值(用于速率计算)
http_requests_total{service="api"}[5m]
http_requests_total{service="api"}[5m]
Rate of increase per second
每秒增长率
rate(http_requests_total{service="api"}[5m])
undefinedrate(http_requests_total{service="api"}[5m])
undefinedCommon Operators
常用运算符
promql
undefinedpromql
undefinedRate (counter → gauge, per second)
速率(计数器 → 仪表盘,每秒)
rate(http_requests_total[5m])
rate(http_requests_total[5m])
Increase (total increase over time range)
增长量(时间范围内的总增量)
increase(http_requests_total[1h])
increase(http_requests_total[1h])
Average over time
时间范围内的平均值
avg_over_time(cpu_usage[5m])
avg_over_time(cpu_usage[5m])
Histogram quantile (p95, p99)
直方图分位数(p95、p99)
histogram_quantile(0.95, rate(http_request_duration_bucket[5m]))
undefinedhistogram_quantile(0.95, rate(http_request_duration_bucket[5m]))
undefinedAggregations
聚合操作
promql
undefinedpromql
undefinedSum across all instances
所有实例的总和
sum(rate(http_requests_total[5m]))
sum(rate(http_requests_total[5m]))
Group by label
按标签分组
sum by (service) (rate(http_requests_total[5m]))
sum by (service) (rate(http_requests_total[5m]))
Average by label
按标签取平均值
avg by (instance) (cpu_usage)
avg by (instance) (cpu_usage)
Top 5 by value
按值取前5名
topk(5, sum by (service) (rate(http_requests_total[5m])))
undefinedtopk(5, sum by (service) (rate(http_requests_total[5m])))
undefinedLabel Matching
标签匹配
promql
undefinedpromql
undefinedExact match
精确匹配
http_requests_total{status="500"}
http_requests_total{status="500"}
Regex match
正则匹配
http_requests_total{status=~"5.."}
http_requests_total{status=~"5.."}
Not equal
不等于
http_requests_total{status!="200"}
http_requests_total{status!="200"}
Multiple labels
多标签匹配
http_requests_total{service="api", status=~"5.."}
---http_requests_total{service="api", status=~"5.."}
---Investigation Workflows
调查工作流
1. Latency Investigation
1. 延迟调查
bash
undefinedbash
undefinedStep 1: Check overall latency trend
步骤1:检查整体延迟趋势
python query_prometheus.py --query 'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="api"}[5m]))' --time-range 60
python query_prometheus.py --query 'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="api"}[5m]))' --time-range 60
Step 2: Compare p50 vs p99
步骤2:比较p50与p99
python query_prometheus.py --query 'histogram_quantile(0.50, rate(http_request_duration_seconds_bucket{service="api"}[5m]))'
python query_prometheus.py --query 'histogram_quantile(0.50, rate(http_request_duration_seconds_bucket{service="api"}[5m]))'
Step 3: Break down by endpoint
步骤3:按端点拆分
python query_prometheus.py --query 'histogram_quantile(0.95, sum by (endpoint) (rate(http_request_duration_seconds_bucket{service="api"}[5m])))'
undefinedpython query_prometheus.py --query 'histogram_quantile(0.95, sum by (endpoint) (rate(http_request_duration_seconds_bucket{service="api"}[5m])))'
undefined2. Error Rate Investigation
2. 错误率调查
bash
undefinedbash
undefinedStep 1: Overall error rate
步骤1:整体错误率
python query_prometheus.py --query 'sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))'
python query_prometheus.py --query 'sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))'
Step 2: Errors by status code
步骤2:按状态码统计错误
python query_prometheus.py --query 'sum by (status) (rate(http_requests_total{status=~"[45].."}[5m]))'
python query_prometheus.py --query 'sum by (status) (rate(http_requests_total{status=~"[45].."}[5m]))'
Step 3: Errors by service
步骤3:按服务统计错误
python query_prometheus.py --query 'sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))'
undefinedpython query_prometheus.py --query 'sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))'
undefined3. Resource Investigation (CPU/Memory)
3. 资源调查(CPU/内存)
bash
undefinedbash
undefinedCPU usage
CPU使用率
python query_prometheus.py --query 'avg by (instance) (rate(container_cpu_usage_seconds_total{pod=~"api-.*"}[5m]))'
python query_prometheus.py --query 'avg by (instance) (rate(container_cpu_usage_seconds_total{pod=~"api-.*"}[5m]))'
Memory usage percentage
内存使用百分比
python query_prometheus.py --query 'container_memory_usage_bytes{pod=~"api-."} / container_spec_memory_limit_bytes{pod=~"api-."}'
---python query_prometheus.py --query 'container_memory_usage_bytes{pod=~"api-."} / container_spec_memory_limit_bytes{pod=~"api-."}'
---Quick Commands Reference
快速命令参考
| Goal | Command |
|---|---|
| Request rate | |
| Error rate | |
| P95 latency | |
| CPU usage | |
| Find dashboards | |
| Check alerts | |
| 目标 | 命令 |
|---|---|
| 请求速率 | |
| 错误率 | |
| P95延迟 | |
| CPU使用率 | |
| 查找仪表盘 | |
| 检查告警 | |
Common Metric Patterns
常见指标模式
Request Metrics
请求指标
promql
http_requests_total # Counter
http_request_duration_seconds_bucket # Histogram
http_requests_in_flight # Gaugepromql
http_requests_total # 计数器
http_request_duration_seconds_bucket # 直方图
http_requests_in_flight # 仪表盘Kubernetes Metrics
Kubernetes指标
promql
container_cpu_usage_seconds_total
container_memory_usage_bytes
kube_pod_container_status_restarts_total
kube_pod_status_phasepromql
container_cpu_usage_seconds_total
container_memory_usage_bytes
kube_pod_container_status_restarts_total
kube_pod_status_phaseAnti-Patterns to Avoid
需避免的反模式
- ❌ Using without range vector - Always include
rate()or similar[5m] - ❌ Comparing counters directly - Use or
rate()firstincrease() - ❌ Wrong quantile math - requires
histogram_quantilemetrics_bucket - ❌ Missing label filters - Queries without filters return all series
- ❌ Too-short time ranges - Use at least 2x your scrape interval for
rate()
- ❌ 不使用范围向量就调用- 务必包含
rate()或类似的时间范围[5m] - ❌ 直接比较计数器 - 先使用或
rate()increase() - ❌ 错误的分位数计算 - 需要
histogram_quantile类型的指标_bucket - ❌ 缺少标签过滤 - 不带过滤条件的查询会返回所有序列
- ❌ 时间范围过短 - 使用时,时间范围至少要设置为抓取间隔的2倍
rate()