monitoring-expert
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMonitoring Expert
监控专家
Expert guidance for monitoring, observability, and alerting using Prometheus, Grafana, logging systems, and distributed tracing.
提供基于Prometheus、Grafana、日志系统和分布式追踪的监控、可观测性及告警相关专业指导。
Core Concepts
核心概念
The Three Pillars of Observability
可观测性三大支柱
- Metrics - Numerical measurements over time (Prometheus)
- Logs - Discrete events (ELK, Loki)
- Traces - Request flow through distributed systems (Jaeger, Tempo)
- Metrics(指标) - 随时间变化的数值测量(Prometheus)
- Logs(日志) - 离散事件记录(ELK、Loki)
- Traces(追踪) - 请求在分布式系统中的流转路径(Jaeger、Tempo)
Monitoring Fundamentals
监控基础理论
- Golden Signals (Latency, Traffic, Errors, Saturation)
- RED Method (Rate, Errors, Duration)
- USE Method (Utilization, Saturation, Errors)
- Service Level Indicators (SLIs)
- Service Level Objectives (SLOs)
- Service Level Agreements (SLAs)
- 黄金信号(延迟、流量、错误、饱和度)
- RED方法(速率、错误、耗时)
- USE方法(利用率、饱和度、错误)
- 服务水平指标(SLIs)
- 服务水平目标(SLOs)
- 服务水平协议(SLAs)
Key Components
核心组件
- Metric collection (exporters, agents)
- Time-series database
- Visualization (dashboards)
- Alerting (rules, receivers)
- Log aggregation
- Distributed tracing
- 指标采集(exporters、agents)
- 时序数据库
- 可视化(监控大盘)
- 告警(规则、接收端)
- 日志聚合
- 分布式追踪
Prometheus
Prometheus
Installation (Docker)
安装(Docker方式)
bash
undefinedbash
undefineddocker-compose.yml
docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alerts.yml:/etc/prometheus/alerts.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.enable-lifecycle'
- '--storage.tsdb.retention.time=30d'
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
node-exporter:
image: prom/node-exporter:latest
ports:
- "9100:9100"
command:
- '--path.rootfs=/host'
volumes:
- '/:/host:ro,rslave'
alertmanager:
image: prom/alertmanager:latest
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager-data:/alertmanager
volumes:
prometheus-data:
grafana-data:
alertmanager-data:
undefinedversion: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alerts.yml:/etc/prometheus/alerts.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.enable-lifecycle'
- '--storage.tsdb.retention.time=30d'
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
node-exporter:
image: prom/node-exporter:latest
ports:
- "9100:9100"
command:
- '--path.rootfs=/host'
volumes:
- '/:/host:ro,rslave'
alertmanager:
image: prom/alertmanager:latest
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager-data:/alertmanager
volumes:
prometheus-data:
grafana-data:
alertmanager-data:
undefinedPrometheus Configuration
Prometheus配置
yaml
undefinedyaml
undefinedprometheus.yml
prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
region: 'us-east-1'
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
region: 'us-east-1'
Alertmanager configuration
Alertmanager配置
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
Load alert rules
加载告警规则
rule_files:
- 'alerts.yml'
rule_files:
- 'alerts.yml'
Scrape configurations
采集配置
scrape_configs:
Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
Node exporter (system metrics)
- job_name: 'node'
static_configs:
- targets:
- 'node-exporter:9100' labels: instance: 'server-1' env: 'production'
- targets:
Application metrics
- job_name: 'app'
static_configs:
- targets:
- 'app-1:8080'
- 'app-2:8080'
- 'app-3:8080' metrics_path: '/metrics'
- targets:
Kubernetes service discovery
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path regex: (.+)
- source_labels: [address, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: address
Blackbox exporter (endpoint monitoring)
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.com
- https://api.example.com/health relabel_configs:
- source_labels: [address] target_label: __param_target
- source_labels: [__param_target] target_label: instance
- target_label: address replacement: blackbox-exporter:9115
- targets:
undefinedscrape_configs:
Prometheus自身监控
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
Node exporter(系统指标)
- job_name: 'node'
static_configs:
- targets:
- 'node-exporter:9100' labels: instance: 'server-1' env: 'production'
- targets:
应用指标
- job_name: 'app'
static_configs:
- targets:
- 'app-1:8080'
- 'app-2:8080'
- 'app-3:8080' metrics_path: '/metrics'
- targets:
Kubernetes服务发现
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path regex: (.+)
- source_labels: [address, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: address
Blackbox exporter(端点监控)
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.com
- https://api.example.com/health relabel_configs:
- source_labels: [address] target_label: __param_target
- source_labels: [__param_target] target_label: instance
- target_label: address replacement: blackbox-exporter:9115
- targets:
undefinedAlert Rules
告警规则
yaml
undefinedyaml
undefinedalerts.yml
alerts.yml
groups:
-
name: app_alerts interval: 30s rules:
High error rate
- alert: HighErrorRate expr: | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate on {{ $labels.instance }}" description: "Error rate is {{ $value | humanizePercentage }} for 5 minutes"
API latency
- alert: HighAPILatency expr: | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) ) > 1 for: 10m labels: severity: warning annotations: summary: "High API latency on {{ $labels.instance }}" description: "95th percentile latency is {{ $value }}s"
Service down
- alert: ServiceDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.job }} down" description: "{{ $labels.instance }} has been down for 1 minute"
High memory usage
- alert: HighMemoryUsage expr: | (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.90 for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value | humanizePercentage }}"
High CPU usage
- alert: HighCPUUsage expr: | 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value }}%"
Disk space
- alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.10 for: 5m labels: severity: critical annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Only {{ $value | humanizePercentage }} disk space remaining"
Pod restarts
- alert: PodRestarting expr: rate(kube_pod_container_status_restarts_total[15m]) > 0 for: 5m labels: severity: warning annotations: summary: "Pod {{ $labels.pod }} is restarting" description: "Pod has restarted {{ $value }} times in the last 15 minutes"
undefinedgroups:
-
name: app_alerts interval: 30s rules:
高错误率
- alert: HighErrorRate expr: | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "{{ $labels.instance }}上错误率过高" description: "过去5分钟错误率为{{ $value | humanizePercentage }}"
API延迟过高
- alert: HighAPILatency expr: | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) ) > 1 for: 10m labels: severity: warning annotations: summary: "{{ $labels.instance }}上API延迟过高" description: "95分位延迟为{{ $value }}秒"
服务宕机
- alert: ServiceDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: "服务{{ $labels.job }}宕机" description: "{{ $labels.instance }}已宕机1分钟"
内存使用率过高
- alert: HighMemoryUsage expr: | (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.90 for: 5m labels: severity: warning annotations: summary: "{{ $labels.instance }}内存使用率过高" description: "内存使用率为{{ $value | humanizePercentage }}"
CPU使用率过高
- alert: HighCPUUsage expr: | 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 5m labels: severity: warning annotations: summary: "{{ $labels.instance }}CPU使用率过高" description: "CPU使用率为{{ $value }}%"
磁盘空间不足
- alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.10 for: 5m labels: severity: critical annotations: summary: "{{ $labels.instance }}磁盘空间不足" description: "剩余磁盘空间仅为{{ $value | humanizePercentage }}"
Pod重启
- alert: PodRestarting expr: rate(kube_pod_container_status_restarts_total[15m]) > 0 for: 5m labels: severity: warning annotations: summary: "Pod {{ $labels.pod }}正在反复重启" description: "过去15分钟内Pod已重启{{ $value }}次"
undefinedPromQL Queries
PromQL查询语句
promql
undefinedpromql
undefinedRequest rate
请求速率
rate(http_requests_total[5m])
rate(http_requests_total[5m])
Error rate
错误率
rate(http_requests_total{status=~"5.."}[5m])
rate(http_requests_total{status=~"5.."}[5m])
Success rate
成功率
sum(rate(http_requests_total{status!~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
sum(rate(http_requests_total{status!~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
P95 latency
P95延迟
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)
Average latency
平均延迟
rate(http_request_duration_seconds_sum[5m]) /
rate(http_request_duration_seconds_count[5m])
rate(http_request_duration_seconds_sum[5m]) /
rate(http_request_duration_seconds_count[5m])
CPU usage per pod
每个Pod的CPU使用率
rate(container_cpu_usage_seconds_total{pod!=""}[5m])
rate(container_cpu_usage_seconds_total{pod!=""}[5m])
Memory usage percentage
内存使用率百分比
(container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100
(container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100
QPS per endpoint
每个端点的QPS
sum by(endpoint) (rate(http_requests_total[5m]))
sum by(endpoint) (rate(http_requests_total[5m]))
Top 5 slowest endpoints
最慢的5个端点
topk(5, histogram_quantile(0.95,
sum by(endpoint, le) (rate(http_request_duration_seconds_bucket[5m]))
))
topk(5, histogram_quantile(0.95,
sum by(endpoint, le) (rate(http_request_duration_seconds_bucket[5m]))
))
Predict disk full in 4 hours
预测4小时内磁盘是否会满
predict_linear(node_filesystem_free_bytes[1h], 4*3600) < 0
predict_linear(node_filesystem_free_bytes[1h], 4*3600) < 0
Network I/O
网络I/O
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])
undefinedrate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])
undefinedApplication Instrumentation
应用埋点
Node.js (Express)
Node.js(Express)
typescript
// Install: npm install prom-client express-prom-bundle
import express from 'express';
import promBundle from 'express-prom-bundle';
import { register, Counter, Histogram, Gauge } from 'prom-client';
const app = express();
// Automatic metrics for all endpoints
const metricsMiddleware = promBundle({
includeMethod: true,
includePath: true,
includeStatusCode: true,
includeUp: true,
customLabels: { app: 'myapp' },
promClient: { collectDefaultMetrics: {} },
});
app.use(metricsMiddleware);
// Custom metrics
const ordersTotal = new Counter({
name: 'orders_total',
help: 'Total number of orders',
labelNames: ['status', 'payment_method'],
});
const orderValue = new Histogram({
name: 'order_value_dollars',
help: 'Order value in dollars',
buckets: [10, 50, 100, 500, 1000, 5000],
});
const activeUsers = new Gauge({
name: 'active_users',
help: 'Number of active users',
});
// Use metrics in your code
app.post('/orders', async (req, res) => {
const order = await createOrder(req.body);
ordersTotal.inc({ status: 'created', payment_method: order.paymentMethod });
orderValue.observe(order.total);
res.json(order);
});
// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
app.listen(8080, () => {
console.log('Server running on :8080');
console.log('Metrics available at http://localhost:8080/metrics');
});typescript
// 安装:npm install prom-client express-prom-bundle
import express from 'express';
import promBundle from 'express-prom-bundle';
import { register, Counter, Histogram, Gauge } from 'prom-client';
const app = express();
// 自动采集所有端点的指标
const metricsMiddleware = promBundle({
includeMethod: true,
includePath: true,
includeStatusCode: true,
includeUp: true,
customLabels: { app: 'myapp' },
promClient: { collectDefaultMetrics: {} },
});
app.use(metricsMiddleware);
// 自定义指标
const ordersTotal = new Counter({
name: 'orders_total',
help: '总订单数',
labelNames: ['status', 'payment_method'],
});
const orderValue = new Histogram({
name: 'order_value_dollars',
help: '订单金额(美元)',
buckets: [10, 50, 100, 500, 1000, 5000],
});
const activeUsers = new Gauge({
name: 'active_users',
help: '活跃用户数',
});
// 在代码中使用指标
app.post('/orders', async (req, res) => {
const order = await createOrder(req.body);
ordersTotal.inc({ status: 'created', payment_method: order.paymentMethod });
orderValue.observe(order.total);
res.json(order);
});
// 暴露指标端点
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
app.listen(8080, () => {
console.log('服务运行在端口:8080');
console.log('指标可访问:http://localhost:8080/metrics');
});Python (Flask)
Python(Flask)
python
undefinedpython
undefinedInstall: pip install prometheus-flask-exporter
安装:pip install prometheus-flask-exporter
from flask import Flask
from prometheus_flask_exporter import PrometheusMetrics
from prometheus_client import Counter, Histogram, Gauge
app = Flask(name)
metrics = PrometheusMetrics(app)
from flask import Flask
from prometheus_flask_exporter import PrometheusMetrics
from prometheus_client import Counter, Histogram, Gauge
app = Flask(name)
metrics = PrometheusMetrics(app)
Custom metrics
自定义指标
orders_total = Counter(
'orders_total',
'Total number of orders',
['status', 'payment_method']
)
order_value = Histogram(
'order_value_dollars',
'Order value in dollars',
buckets=[10, 50, 100, 500, 1000, 5000]
)
active_users = Gauge(
'active_users',
'Number of active users'
)
@app.route('/orders', methods=['POST'])
def create_order():
order = process_order(request.json)
orders_total.labels(
status='created',
payment_method=order['payment_method']
).inc()
order_value.observe(order['total'])
return jsonify(order)@app.route('/health')
def health():
return {'status': 'healthy'}
if name == 'main':
app.run(host='0.0.0.0', port=8080)
# Metrics available at /metrics
undefinedorders_total = Counter(
'orders_total',
'总订单数',
['status', 'payment_method']
)
order_value = Histogram(
'order_value_dollars',
'订单金额(美元)',
buckets=[10, 50, 100, 500, 1000, 5000]
)
active_users = Gauge(
'active_users',
'活跃用户数'
)
@app.route('/orders', methods=['POST'])
def create_order():
order = process_order(request.json)
orders_total.labels(
status='created',
payment_method=order['payment_method']
).inc()
order_value.observe(order['total'])
return jsonify(order)@app.route('/health')
def health():
return {'status': 'healthy'}
if name == 'main':
app.run(host='0.0.0.0', port=8080)
# 指标可访问:/metrics
undefinedGo
Go
go
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
"github.com/prometheus/client_golang/prometheus/promauto"
)
var (
ordersTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "orders_total",
Help: "Total number of orders",
},
[]string{"status", "payment_method"},
)
orderValue = promauto.NewHistogram(
prometheus.HistogramOpts{
Name: "order_value_dollars",
Help: "Order value in dollars",
Buckets: []float64{10, 50, 100, 500, 1000, 5000},
},
)
activeUsers = promauto.NewGauge(
prometheus.GaugeOpts{
Name: "active_users",
Help: "Number of active users",
},
)
)
func createOrderHandler(w http.ResponseWriter, r *http.Request) {
order := processOrder(r.Body)
ordersTotal.WithLabelValues(
"created",
order.PaymentMethod,
).Inc()
orderValue.Observe(order.Total)
json.NewEncoder(w).Encode(order)
}
func main() {
http.HandleFunc("/orders", createOrderHandler)
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":8080", nil)
}go
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
"github.com/prometheus/client_golang/prometheus/promauto"
)
var (
ordersTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "orders_total",
Help: "总订单数",
},
[]string{"status", "payment_method"},
)
orderValue = promauto.NewHistogram(
prometheus.HistogramOpts{
Name: "order_value_dollars",
Help: "订单金额(美元)",
Buckets: []float64{10, 50, 100, 500, 1000, 5000},
},
)
activeUsers = promauto.NewGauge(
prometheus.GaugeOpts{
Name: "active_users",
Help: "活跃用户数",
},
)
)
func createOrderHandler(w http.ResponseWriter, r *http.Request) {
order := processOrder(r.Body)
ordersTotal.WithLabelValues(
"created",
order.PaymentMethod,
).Inc()
orderValue.Observe(order.Total)
json.NewEncoder(w).Encode(order)
}
func main() {
http.HandleFunc("/orders", createOrderHandler)
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":8080", nil)
}Alertmanager
Alertmanager
Configuration
配置
yaml
undefinedyaml
undefinedalertmanager.yml
alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
route:
receiver: 'default'
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 5m
repeat_interval: 4h
routes:
# Critical alerts to PagerDuty
- match:
severity: critical
receiver: pagerduty
continue: true
# Warning alerts to Slack
- match:
severity: warning
receiver: slack
# Database alerts
- match_re:
service: database
receiver: database-teamreceivers:
-
name: 'default' email_configs:
- to: 'team@example.com' from: 'alerts@example.com' smarthost: 'smtp.gmail.com:587' auth_username: 'alerts@example.com' auth_password: 'password'
-
name: 'slack' slack_configs:
- channel: '#alerts' title: '{{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' send_resolved: true
-
name: 'pagerduty' pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY' description: '{{ .GroupLabels.alertname }}'
-
name: 'database-team' slack_configs:
- channel: '#database-alerts' email_configs:
- to: 'dba-team@example.com'
inhibit_rules:
Suppress warning if critical alert is firing
- source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'instance']
undefinedglobal:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
route:
receiver: 'default'
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 5m
repeat_interval: 4h
routes:
# 严重告警发送到PagerDuty
- match:
severity: critical
receiver: pagerduty
continue: true
# 警告告警发送到Slack
- match:
severity: warning
receiver: slack
# 数据库相关告警
- match_re:
service: database
receiver: database-teamreceivers:
-
name: 'default' email_configs:
- to: 'team@example.com' from: 'alerts@example.com' smarthost: 'smtp.gmail.com:587' auth_username: 'alerts@example.com' auth_password: 'password'
-
name: 'slack' slack_configs:
- channel: '#alerts' title: '{{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' send_resolved: true
-
name: 'pagerduty' pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY' description: '{{ .GroupLabels.alertname }}'
-
name: 'database-team' slack_configs:
- channel: '#database-alerts' email_configs:
- to: 'dba-team@example.com'
inhibit_rules:
严重告警触发时抑制同实例同类型的警告告警
- source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'instance']
undefinedGrafana
Grafana
Dashboard Configuration (JSON)
大盘配置(JSON)
json
{
"dashboard": {
"title": "Application Metrics",
"tags": ["app", "production"],
"timezone": "browser",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 },
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (status)",
"legendFormat": "{{ status }}"
}
]
},
{
"title": "P95 Latency",
"type": "graph",
"gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 },
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "p95"
}
]
},
{
"title": "Error Rate",
"type": "stat",
"gridPos": { "x": 0, "y": 8, "w": 6, "h": 4 },
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))"
}
],
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"thresholds": {
"steps": [
{ "value": 0, "color": "green" },
{ "value": 0.01, "color": "yellow" },
{ "value": 0.05, "color": "red" }
]
}
}
}
}
]
}
}json
{
"dashboard": {
"title": "应用指标",
"tags": ["app", "production"],
"timezone": "browser",
"panels": [
{
"title": "请求速率",
"type": "graph",
"gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 },
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (status)",
"legendFormat": "{{ status }}"
}
]
},
{
"title": "P95延迟",
"type": "graph",
"gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 },
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "p95"
}
]
},
{
"title": "错误率",
"type": "stat",
"gridPos": { "x": 0, "y": 8, "w": 6, "h": 4 },
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))"
}
],
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"thresholds": {
"steps": [
{ "value": 0, "color": "green" },
{ "value": 0.01, "color": "yellow" },
{ "value": 0.05, "color": "red" }
]
}
}
}
}
]
}
}Provisioning Data Sources
数据源预配置
yaml
undefinedyaml
undefinedgrafana/provisioning/datasources/prometheus.yml
grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
-
name: Prometheus type: prometheus access: proxy url: http://prometheus:9090 isDefault: true editable: false
-
name: Loki type: loki access: proxy url: http://loki:3100 editable: false
undefinedapiVersion: 1
datasources:
-
name: Prometheus type: prometheus access: proxy url: http://prometheus:9090 isDefault: true editable: false
-
name: Loki type: loki access: proxy url: http://loki:3100 editable: false
undefinedLogging with Loki
使用Loki实现日志管理
Loki Configuration
Loki配置
yaml
undefinedyaml
undefinedloki-config.yml
loki-config.yml
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
chunk_idle_period: 5m
chunk_retain_period: 30s
schema_config:
configs:
- from: 2020-05-15
store: boltdb
object_store: filesystem
schema: v11
index:
prefix: index_
period: 168h
storage_config:
boltdb:
directory: /tmp/loki/index
filesystem:
directory: /tmp/loki/chunks
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
undefinedauth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
chunk_idle_period: 5m
chunk_retain_period: 30s
schema_config:
configs:
- from: 2020-05-15
store: boltdb
object_store: filesystem
schema: v11
index:
prefix: index_
period: 168h
storage_config:
boltdb:
directory: /tmp/loki/index
filesystem:
directory: /tmp/loki/chunks
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
undefinedPromtail Configuration
Promtail配置
yaml
undefinedyaml
undefinedpromtail-config.yml
promtail-config.yml
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
scrape_configs:
Application logs
- job_name: app
static_configs:
- targets:
- localhost labels: job: app path: /var/log/app/*.log
- targets:
Docker logs
- job_name: docker
docker_sd_configs:
- host: unix:///var/run/docker.sock relabel_configs:
- source_labels: ['__meta_docker_container_name'] target_label: 'container'
Kubernetes logs
- job_name: kubernetes
kubernetes_sd_configs:
- role: pod pipeline_stages:
- docker: {} relabel_configs:
- source_labels:
- __meta_kubernetes_pod_name target_label: pod
- source_labels:
- __meta_kubernetes_namespace target_label: namespace
undefinedserver:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
scrape_configs:
应用日志
- job_name: app
static_configs:
- targets:
- localhost labels: job: app path: /var/log/app/*.log
- targets:
Docker日志
- job_name: docker
docker_sd_configs:
- host: unix:///var/run/docker.sock relabel_configs:
- source_labels: ['__meta_docker_container_name'] target_label: 'container'
Kubernetes日志
- job_name: kubernetes
kubernetes_sd_configs:
- role: pod pipeline_stages:
- docker: {} relabel_configs:
- source_labels:
- __meta_kubernetes_pod_name target_label: pod
- source_labels:
- __meta_kubernetes_namespace target_label: namespace
undefinedLogQL Queries
LogQL查询语句
logql
undefinedlogql
undefinedAll logs for a job
某个job的所有日志
{job="app"}
{job="app"}
Filter by level
按级别过滤
{job="app"} |= "error"
{job="app"} |= "error"
JSON parsing
JSON解析
{job="app"} | json | level="error"
{job="app"} | json | level="error"
Rate of errors
错误发生率
rate({job="app"} |= "error" [5m])
rate({job="app"} |= "error" [5m])
Count by pod
按Pod统计日志数量
sum by (pod) (count_over_time({namespace="production"}[5m]))
sum by (pod) (count_over_time({namespace="production"}[5m]))
Extract and filter
提取并过滤字段
{job="app"}
| json
| line_format "{{.timestamp}} {{.level}} {{.message}}"
| level="error"
{job="app"}
| json
| line_format "{{.timestamp}} {{.level}} {{.message}}"
| level="error"
Metrics from logs
从日志生成指标
sum(rate({job="app"} |= "status=500" [5m])) by (endpoint)
undefinedsum(rate({job="app"} |= "status=500" [5m])) by (endpoint)
undefinedDistributed Tracing
分布式追踪
Jaeger Setup
Jaeger部署
yaml
undefinedyaml
undefineddocker-compose.yml
docker-compose.yml
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "5775:5775/udp"
- "6831:6831/udp"
- "6832:6832/udp"
- "5778:5778"
- "16686:16686" # UI
- "14268:14268" # Collector
- "9411:9411" # Zipkin compatible
environment:
- COLLECTOR_ZIPKIN_HTTP_PORT=9411
undefinedservices:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "5775:5775/udp"
- "6831:6831/udp"
- "6832:6832/udp"
- "5778:5778"
- "16686:16686" # UI界面
- "14268:14268" # 收集器
- "9411:9411" # 兼容Zipkin
environment:
- COLLECTOR_ZIPKIN_HTTP_PORT=9411
undefinedApplication Instrumentation (Node.js)
应用埋点(Node.js)
typescript
// Install: npm install jaeger-client opentracing
import { initTracer } from 'jaeger-client';
const config = {
serviceName: 'my-app',
sampler: {
type: 'probabilistic',
param: 1.0, // Sample 100% of traces
},
reporter: {
logSpans: true,
agentHost: 'localhost',
agentPort: 6831,
},
};
const tracer = initTracer(config);
// Trace HTTP request
app.get('/api/users/:id', async (req, res) => {
const span = tracer.startSpan('get_user');
span.setTag('user_id', req.params.id);
try {
// Database query
const dbSpan = tracer.startSpan('db_query', { childOf: span });
const user = await db.user.findById(req.params.id);
dbSpan.finish();
// External API call
const apiSpan = tracer.startSpan('external_api', { childOf: span });
const profile = await fetchUserProfile(user.id);
apiSpan.finish();
span.setTag('http.status_code', 200);
res.json({ user, profile });
} catch (error) {
span.setTag('error', true);
span.setTag('http.status_code', 500);
span.log({ event: 'error', message: error.message });
res.status(500).json({ error: error.message });
} finally {
span.finish();
}
});typescript
// 安装:npm install jaeger-client opentracing
import { initTracer } from 'jaeger-client';
const config = {
serviceName: 'my-app',
sampler: {
type: 'probabilistic',
param: 1.0, // 采样100%的链路
},
reporter: {
logSpans: true,
agentHost: 'localhost',
agentPort: 6831,
},
};
const tracer = initTracer(config);
// 追踪HTTP请求
app.get('/api/users/:id', async (req, res) => {
const span = tracer.startSpan('get_user');
span.setTag('user_id', req.params.id);
try {
// 数据库查询
const dbSpan = tracer.startSpan('db_query', { childOf: span });
const user = await db.user.findById(req.params.id);
dbSpan.finish();
// 外部API调用
const apiSpan = tracer.startSpan('external_api', { childOf: span });
const profile = await fetchUserProfile(user.id);
apiSpan.finish();
span.setTag('http.status_code', 200);
res.json({ user, profile });
} catch (error) {
span.setTag('error', true);
span.setTag('http.status_code', 500);
span.log({ event: 'error', message: error.message });
res.status(500).json({ error: error.message });
} finally {
span.finish();
}
});Best Practices
最佳实践
Metric Naming
指标命名
- Use descriptive names: not
http_requests_totalrequests - Use units in name: ,
duration_secondsbytes_total - Use suffix for counters
_total - Use suffix for histograms
_bucket - Use consistent label names
- 使用描述性名称:用而不是
http_requests_totalrequests - 名称中包含单位:、
duration_secondsbytes_total - 计数器使用后缀
_total - 直方图使用后缀
_bucket - 使用统一的标签命名规范
Cardinality
基数控制
- Avoid high-cardinality labels (user IDs, emails)
- Use bounded label values
- Aggregate when possible
- Monitor metric count
- 避免高基数标签(用户ID、邮箱等)
- 标签值的数量要可控
- 尽可能提前聚合
- 监控指标总数
Alert Design
告警设计
- Alert on symptoms, not causes
- Set appropriate thresholds
- Include actionable annotations
- Group related alerts
- Use inhibition rules
- 对症状告警而非原因告警
- 设置合理的阈值
- 包含可执行的注解信息
- 对相关告警进行分组
- 使用抑制规则避免告警风暴
Dashboard Design
大盘设计
- One purpose per dashboard
- Use consistent time ranges
- Include SLOs/SLIs
- Add context with annotations
- Use appropriate visualization types
- 每个大盘仅服务单一目的
- 使用统一的时间范围
- 包含SLOs/SLIs指标
- 通过注解补充上下文
- 使用合适的可视化类型
Anti-Patterns to Avoid
需要避免的反模式
❌ No SLOs: Define service level objectives
❌ Alert fatigue: Too many non-actionable alerts
❌ High cardinality: Labels with unbounded values
❌ Missing instrumentation: Instrument all critical paths
❌ No runbooks: Alerts should have clear remediation steps
❌ Ignoring trends: Monitor trends, not just current values
❌ No log structure: Use structured logging (JSON)
❌ Missing context: Include relevant labels and tags
❌ 无SLOs:必须定义服务水平目标
❌ 告警疲劳:避免大量不可执行的无效告警
❌ 高基数:避免使用值数量不受控的标签
❌ 埋点缺失:所有关键路径都需要添加埋点
❌ 无 runbook:告警需要附带清晰的修复步骤
❌ 忽略趋势:不仅要监控当前值,也要关注趋势变化
❌ 无结构化日志:使用结构化日志(如JSON)
❌ 上下文缺失:包含相关的标签和标识信息
Resources
参考资源
- Prometheus: https://prometheus.io/docs/
- Grafana: https://grafana.com/docs/
- Loki: https://grafana.com/docs/loki/
- Jaeger: https://www.jaegertracing.io/docs/
- OpenTelemetry: https://opentelemetry.io/docs/
- SRE Book: https://sre.google/books/
- Prometheus: https://prometheus.io/docs/
- Grafana: https://grafana.com/docs/
- Loki: https://grafana.com/docs/loki/
- Jaeger: https://www.jaegertracing.io/docs/
- OpenTelemetry: https://opentelemetry.io/docs/
- SRE Book: https://sre.google/books/