monitoring-expert

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Monitoring Expert

监控专家

Expert guidance for monitoring, observability, and alerting using Prometheus, Grafana, logging systems, and distributed tracing.
提供基于Prometheus、Grafana、日志系统和分布式追踪的监控、可观测性及告警相关专业指导。

Core Concepts

核心概念

The Three Pillars of Observability

可观测性三大支柱

  1. Metrics - Numerical measurements over time (Prometheus)
  2. Logs - Discrete events (ELK, Loki)
  3. Traces - Request flow through distributed systems (Jaeger, Tempo)
  1. Metrics(指标) - 随时间变化的数值测量(Prometheus)
  2. Logs(日志) - 离散事件记录(ELK、Loki)
  3. Traces(追踪) - 请求在分布式系统中的流转路径(Jaeger、Tempo)

Monitoring Fundamentals

监控基础理论

  • Golden Signals (Latency, Traffic, Errors, Saturation)
  • RED Method (Rate, Errors, Duration)
  • USE Method (Utilization, Saturation, Errors)
  • Service Level Indicators (SLIs)
  • Service Level Objectives (SLOs)
  • Service Level Agreements (SLAs)
  • 黄金信号(延迟、流量、错误、饱和度)
  • RED方法(速率、错误、耗时)
  • USE方法(利用率、饱和度、错误)
  • 服务水平指标(SLIs)
  • 服务水平目标(SLOs)
  • 服务水平协议(SLAs)

Key Components

核心组件

  • Metric collection (exporters, agents)
  • Time-series database
  • Visualization (dashboards)
  • Alerting (rules, receivers)
  • Log aggregation
  • Distributed tracing
  • 指标采集(exporters、agents)
  • 时序数据库
  • 可视化(监控大盘)
  • 告警(规则、接收端)
  • 日志聚合
  • 分布式追踪

Prometheus

Prometheus

Installation (Docker)

安装(Docker方式)

bash
undefined
bash
undefined

docker-compose.yml

docker-compose.yml

version: '3.8'
services: prometheus: image: prom/prometheus:latest ports: - "9090:9090" volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - ./alerts.yml:/etc/prometheus/alerts.yml - prometheus-data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--web.enable-lifecycle' - '--storage.tsdb.retention.time=30d'
grafana: image: grafana/grafana:latest ports: - "3000:3000" environment: - GF_SECURITY_ADMIN_PASSWORD=admin - GF_USERS_ALLOW_SIGN_UP=false volumes: - grafana-data:/var/lib/grafana - ./grafana/provisioning:/etc/grafana/provisioning
node-exporter: image: prom/node-exporter:latest ports: - "9100:9100" command: - '--path.rootfs=/host' volumes: - '/:/host:ro,rslave'
alertmanager: image: prom/alertmanager:latest ports: - "9093:9093" volumes: - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml - alertmanager-data:/alertmanager
volumes: prometheus-data: grafana-data: alertmanager-data:
undefined
version: '3.8'
services: prometheus: image: prom/prometheus:latest ports: - "9090:9090" volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - ./alerts.yml:/etc/prometheus/alerts.yml - prometheus-data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--web.enable-lifecycle' - '--storage.tsdb.retention.time=30d'
grafana: image: grafana/grafana:latest ports: - "3000:3000" environment: - GF_SECURITY_ADMIN_PASSWORD=admin - GF_USERS_ALLOW_SIGN_UP=false volumes: - grafana-data:/var/lib/grafana - ./grafana/provisioning:/etc/grafana/provisioning
node-exporter: image: prom/node-exporter:latest ports: - "9100:9100" command: - '--path.rootfs=/host' volumes: - '/:/host:ro,rslave'
alertmanager: image: prom/alertmanager:latest ports: - "9093:9093" volumes: - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml - alertmanager-data:/alertmanager
volumes: prometheus-data: grafana-data: alertmanager-data:
undefined

Prometheus Configuration

Prometheus配置

yaml
undefined
yaml
undefined

prometheus.yml

prometheus.yml

global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: 'production' region: 'us-east-1'
global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: 'production' region: 'us-east-1'

Alertmanager configuration

Alertmanager配置

alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093
alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093

Load alert rules

加载告警规则

rule_files:
  • 'alerts.yml'
rule_files:
  • 'alerts.yml'

Scrape configurations

采集配置

scrape_configs:

Prometheus itself

  • job_name: 'prometheus' static_configs:
    • targets: ['localhost:9090']

Node exporter (system metrics)

  • job_name: 'node' static_configs:
    • targets:
      • 'node-exporter:9100' labels: instance: 'server-1' env: 'production'

Application metrics

  • job_name: 'app' static_configs:
    • targets:
      • 'app-1:8080'
      • 'app-2:8080'
      • 'app-3:8080' metrics_path: '/metrics'

Kubernetes service discovery

  • job_name: 'kubernetes-pods' kubernetes_sd_configs:
    • role: pod relabel_configs:
    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true
    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path regex: (.+)
    • source_labels: [address, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: address

Blackbox exporter (endpoint monitoring)

  • job_name: 'blackbox' metrics_path: /probe params: module: [http_2xx] static_configs:
    • targets:
    • source_labels: [address] target_label: __param_target
    • source_labels: [__param_target] target_label: instance
    • target_label: address replacement: blackbox-exporter:9115
undefined
scrape_configs:

Prometheus自身监控

  • job_name: 'prometheus' static_configs:
    • targets: ['localhost:9090']

Node exporter(系统指标)

  • job_name: 'node' static_configs:
    • targets:
      • 'node-exporter:9100' labels: instance: 'server-1' env: 'production'

应用指标

  • job_name: 'app' static_configs:
    • targets:
      • 'app-1:8080'
      • 'app-2:8080'
      • 'app-3:8080' metrics_path: '/metrics'

Kubernetes服务发现

  • job_name: 'kubernetes-pods' kubernetes_sd_configs:
    • role: pod relabel_configs:
    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true
    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path regex: (.+)
    • source_labels: [address, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: address

Blackbox exporter(端点监控)

  • job_name: 'blackbox' metrics_path: /probe params: module: [http_2xx] static_configs:
    • targets:
    • source_labels: [address] target_label: __param_target
    • source_labels: [__param_target] target_label: instance
    • target_label: address replacement: blackbox-exporter:9115
undefined

Alert Rules

告警规则

yaml
undefined
yaml
undefined

alerts.yml

alerts.yml

groups:
  • name: app_alerts interval: 30s rules:

    High error rate

    • alert: HighErrorRate expr: | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate on {{ $labels.instance }}" description: "Error rate is {{ $value | humanizePercentage }} for 5 minutes"

    API latency

    • alert: HighAPILatency expr: | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) ) > 1 for: 10m labels: severity: warning annotations: summary: "High API latency on {{ $labels.instance }}" description: "95th percentile latency is {{ $value }}s"

    Service down

    • alert: ServiceDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.job }} down" description: "{{ $labels.instance }} has been down for 1 minute"

    High memory usage

    • alert: HighMemoryUsage expr: | (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.90 for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value | humanizePercentage }}"

    High CPU usage

    • alert: HighCPUUsage expr: | 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value }}%"

    Disk space

    • alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.10 for: 5m labels: severity: critical annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Only {{ $value | humanizePercentage }} disk space remaining"

    Pod restarts

    • alert: PodRestarting expr: rate(kube_pod_container_status_restarts_total[15m]) > 0 for: 5m labels: severity: warning annotations: summary: "Pod {{ $labels.pod }} is restarting" description: "Pod has restarted {{ $value }} times in the last 15 minutes"
undefined
groups:
  • name: app_alerts interval: 30s rules:

    高错误率

    • alert: HighErrorRate expr: | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "{{ $labels.instance }}上错误率过高" description: "过去5分钟错误率为{{ $value | humanizePercentage }}"

    API延迟过高

    • alert: HighAPILatency expr: | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) ) > 1 for: 10m labels: severity: warning annotations: summary: "{{ $labels.instance }}上API延迟过高" description: "95分位延迟为{{ $value }}秒"

    服务宕机

    • alert: ServiceDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: "服务{{ $labels.job }}宕机" description: "{{ $labels.instance }}已宕机1分钟"

    内存使用率过高

    • alert: HighMemoryUsage expr: | (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.90 for: 5m labels: severity: warning annotations: summary: "{{ $labels.instance }}内存使用率过高" description: "内存使用率为{{ $value | humanizePercentage }}"

    CPU使用率过高

    • alert: HighCPUUsage expr: | 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 5m labels: severity: warning annotations: summary: "{{ $labels.instance }}CPU使用率过高" description: "CPU使用率为{{ $value }}%"

    磁盘空间不足

    • alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.10 for: 5m labels: severity: critical annotations: summary: "{{ $labels.instance }}磁盘空间不足" description: "剩余磁盘空间仅为{{ $value | humanizePercentage }}"

    Pod重启

    • alert: PodRestarting expr: rate(kube_pod_container_status_restarts_total[15m]) > 0 for: 5m labels: severity: warning annotations: summary: "Pod {{ $labels.pod }}正在反复重启" description: "过去15分钟内Pod已重启{{ $value }}次"
undefined

PromQL Queries

PromQL查询语句

promql
undefined
promql
undefined

Request rate

请求速率

rate(http_requests_total[5m])
rate(http_requests_total[5m])

Error rate

错误率

rate(http_requests_total{status=~"5.."}[5m])
rate(http_requests_total{status=~"5.."}[5m])

Success rate

成功率

sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

P95 latency

P95延迟

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) )
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) )

Average latency

平均延迟

rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

CPU usage per pod

每个Pod的CPU使用率

rate(container_cpu_usage_seconds_total{pod!=""}[5m])
rate(container_cpu_usage_seconds_total{pod!=""}[5m])

Memory usage percentage

内存使用率百分比

(container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100
(container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100

QPS per endpoint

每个端点的QPS

sum by(endpoint) (rate(http_requests_total[5m]))
sum by(endpoint) (rate(http_requests_total[5m]))

Top 5 slowest endpoints

最慢的5个端点

topk(5, histogram_quantile(0.95, sum by(endpoint, le) (rate(http_request_duration_seconds_bucket[5m])) ))
topk(5, histogram_quantile(0.95, sum by(endpoint, le) (rate(http_request_duration_seconds_bucket[5m])) ))

Predict disk full in 4 hours

预测4小时内磁盘是否会满

predict_linear(node_filesystem_free_bytes[1h], 4*3600) < 0
predict_linear(node_filesystem_free_bytes[1h], 4*3600) < 0

Network I/O

网络I/O

rate(node_network_receive_bytes_total[5m]) rate(node_network_transmit_bytes_total[5m])
undefined
rate(node_network_receive_bytes_total[5m]) rate(node_network_transmit_bytes_total[5m])
undefined

Application Instrumentation

应用埋点

Node.js (Express)

Node.js(Express)

typescript
// Install: npm install prom-client express-prom-bundle
import express from 'express';
import promBundle from 'express-prom-bundle';
import { register, Counter, Histogram, Gauge } from 'prom-client';

const app = express();

// Automatic metrics for all endpoints
const metricsMiddleware = promBundle({
  includeMethod: true,
  includePath: true,
  includeStatusCode: true,
  includeUp: true,
  customLabels: { app: 'myapp' },
  promClient: { collectDefaultMetrics: {} },
});

app.use(metricsMiddleware);

// Custom metrics
const ordersTotal = new Counter({
  name: 'orders_total',
  help: 'Total number of orders',
  labelNames: ['status', 'payment_method'],
});

const orderValue = new Histogram({
  name: 'order_value_dollars',
  help: 'Order value in dollars',
  buckets: [10, 50, 100, 500, 1000, 5000],
});

const activeUsers = new Gauge({
  name: 'active_users',
  help: 'Number of active users',
});

// Use metrics in your code
app.post('/orders', async (req, res) => {
  const order = await createOrder(req.body);

  ordersTotal.inc({ status: 'created', payment_method: order.paymentMethod });
  orderValue.observe(order.total);

  res.json(order);
});

// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

app.listen(8080, () => {
  console.log('Server running on :8080');
  console.log('Metrics available at http://localhost:8080/metrics');
});
typescript
// 安装:npm install prom-client express-prom-bundle
import express from 'express';
import promBundle from 'express-prom-bundle';
import { register, Counter, Histogram, Gauge } from 'prom-client';

const app = express();

// 自动采集所有端点的指标
const metricsMiddleware = promBundle({
  includeMethod: true,
  includePath: true,
  includeStatusCode: true,
  includeUp: true,
  customLabels: { app: 'myapp' },
  promClient: { collectDefaultMetrics: {} },
});

app.use(metricsMiddleware);

// 自定义指标
const ordersTotal = new Counter({
  name: 'orders_total',
  help: '总订单数',
  labelNames: ['status', 'payment_method'],
});

const orderValue = new Histogram({
  name: 'order_value_dollars',
  help: '订单金额(美元)',
  buckets: [10, 50, 100, 500, 1000, 5000],
});

const activeUsers = new Gauge({
  name: 'active_users',
  help: '活跃用户数',
});

// 在代码中使用指标
app.post('/orders', async (req, res) => {
  const order = await createOrder(req.body);

  ordersTotal.inc({ status: 'created', payment_method: order.paymentMethod });
  orderValue.observe(order.total);

  res.json(order);
});

// 暴露指标端点
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

app.listen(8080, () => {
  console.log('服务运行在端口:8080');
  console.log('指标可访问:http://localhost:8080/metrics');
});

Python (Flask)

Python(Flask)

python
undefined
python
undefined

Install: pip install prometheus-flask-exporter

安装:pip install prometheus-flask-exporter

from flask import Flask from prometheus_flask_exporter import PrometheusMetrics from prometheus_client import Counter, Histogram, Gauge
app = Flask(name) metrics = PrometheusMetrics(app)
from flask import Flask from prometheus_flask_exporter import PrometheusMetrics from prometheus_client import Counter, Histogram, Gauge
app = Flask(name) metrics = PrometheusMetrics(app)

Custom metrics

自定义指标

orders_total = Counter( 'orders_total', 'Total number of orders', ['status', 'payment_method'] )
order_value = Histogram( 'order_value_dollars', 'Order value in dollars', buckets=[10, 50, 100, 500, 1000, 5000] )
active_users = Gauge( 'active_users', 'Number of active users' )
@app.route('/orders', methods=['POST']) def create_order(): order = process_order(request.json)
orders_total.labels(
    status='created',
    payment_method=order['payment_method']
).inc()

order_value.observe(order['total'])

return jsonify(order)
@app.route('/health') def health(): return {'status': 'healthy'}
if name == 'main': app.run(host='0.0.0.0', port=8080) # Metrics available at /metrics
undefined
orders_total = Counter( 'orders_total', '总订单数', ['status', 'payment_method'] )
order_value = Histogram( 'order_value_dollars', '订单金额(美元)', buckets=[10, 50, 100, 500, 1000, 5000] )
active_users = Gauge( 'active_users', '活跃用户数' )
@app.route('/orders', methods=['POST']) def create_order(): order = process_order(request.json)
orders_total.labels(
    status='created',
    payment_method=order['payment_method']
).inc()

order_value.observe(order['total'])

return jsonify(order)
@app.route('/health') def health(): return {'status': 'healthy'}
if name == 'main': app.run(host='0.0.0.0', port=8080) # 指标可访问:/metrics
undefined

Go

Go

go
package main

import (
    "net/http"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    ordersTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "orders_total",
            Help: "Total number of orders",
        },
        []string{"status", "payment_method"},
    )

    orderValue = promauto.NewHistogram(
        prometheus.HistogramOpts{
            Name:    "order_value_dollars",
            Help:    "Order value in dollars",
            Buckets: []float64{10, 50, 100, 500, 1000, 5000},
        },
    )

    activeUsers = promauto.NewGauge(
        prometheus.GaugeOpts{
            Name: "active_users",
            Help: "Number of active users",
        },
    )
)

func createOrderHandler(w http.ResponseWriter, r *http.Request) {
    order := processOrder(r.Body)

    ordersTotal.WithLabelValues(
        "created",
        order.PaymentMethod,
    ).Inc()

    orderValue.Observe(order.Total)

    json.NewEncoder(w).Encode(order)
}

func main() {
    http.HandleFunc("/orders", createOrderHandler)
    http.Handle("/metrics", promhttp.Handler())

    http.ListenAndServe(":8080", nil)
}
go
package main

import (
    "net/http"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    ordersTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "orders_total",
            Help: "总订单数",
        },
        []string{"status", "payment_method"},
    )

    orderValue = promauto.NewHistogram(
        prometheus.HistogramOpts{
            Name:    "order_value_dollars",
            Help:    "订单金额(美元)",
            Buckets: []float64{10, 50, 100, 500, 1000, 5000},
        },
    )

    activeUsers = promauto.NewGauge(
        prometheus.GaugeOpts{
            Name: "active_users",
            Help: "活跃用户数",
        },
    )
)

func createOrderHandler(w http.ResponseWriter, r *http.Request) {
    order := processOrder(r.Body)

    ordersTotal.WithLabelValues(
        "created",
        order.PaymentMethod,
    ).Inc()

    orderValue.Observe(order.Total)

    json.NewEncoder(w).Encode(order)
}

func main() {
    http.HandleFunc("/orders", createOrderHandler)
    http.Handle("/metrics", promhttp.Handler())

    http.ListenAndServe(":8080", nil)
}

Alertmanager

Alertmanager

Configuration

配置

yaml
undefined
yaml
undefined

alertmanager.yml

alertmanager.yml

global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
route: receiver: 'default' group_by: ['alertname', 'cluster', 'service'] group_wait: 10s group_interval: 5m repeat_interval: 4h
routes: # Critical alerts to PagerDuty - match: severity: critical receiver: pagerduty continue: true
# Warning alerts to Slack
- match:
    severity: warning
  receiver: slack

# Database alerts
- match_re:
    service: database
  receiver: database-team
receivers:
  • name: 'default' email_configs:
    • to: 'team@example.com' from: 'alerts@example.com' smarthost: 'smtp.gmail.com:587' auth_username: 'alerts@example.com' auth_password: 'password'
  • name: 'slack' slack_configs:
    • channel: '#alerts' title: '{{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' send_resolved: true
  • name: 'pagerduty' pagerduty_configs:
    • service_key: 'YOUR_PAGERDUTY_KEY' description: '{{ .GroupLabels.alertname }}'
  • name: 'database-team' slack_configs:
    • channel: '#database-alerts' email_configs:
    • to: 'dba-team@example.com'
inhibit_rules:

Suppress warning if critical alert is firing

  • source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'instance']
undefined
global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
route: receiver: 'default' group_by: ['alertname', 'cluster', 'service'] group_wait: 10s group_interval: 5m repeat_interval: 4h
routes: # 严重告警发送到PagerDuty - match: severity: critical receiver: pagerduty continue: true
# 警告告警发送到Slack
- match:
    severity: warning
  receiver: slack

# 数据库相关告警
- match_re:
    service: database
  receiver: database-team
receivers:
  • name: 'default' email_configs:
    • to: 'team@example.com' from: 'alerts@example.com' smarthost: 'smtp.gmail.com:587' auth_username: 'alerts@example.com' auth_password: 'password'
  • name: 'slack' slack_configs:
    • channel: '#alerts' title: '{{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' send_resolved: true
  • name: 'pagerduty' pagerduty_configs:
    • service_key: 'YOUR_PAGERDUTY_KEY' description: '{{ .GroupLabels.alertname }}'
  • name: 'database-team' slack_configs:
    • channel: '#database-alerts' email_configs:
    • to: 'dba-team@example.com'
inhibit_rules:

严重告警触发时抑制同实例同类型的警告告警

  • source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'instance']
undefined

Grafana

Grafana

Dashboard Configuration (JSON)

大盘配置(JSON)

json
{
  "dashboard": {
    "title": "Application Metrics",
    "tags": ["app", "production"],
    "timezone": "browser",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 },
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (status)",
            "legendFormat": "{{ status }}"
          }
        ]
      },
      {
        "title": "P95 Latency",
        "type": "graph",
        "gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 },
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "p95"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "stat",
        "gridPos": { "x": 0, "y": 8, "w": 6, "h": 4 },
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percentunit",
            "thresholds": {
              "steps": [
                { "value": 0, "color": "green" },
                { "value": 0.01, "color": "yellow" },
                { "value": 0.05, "color": "red" }
              ]
            }
          }
        }
      }
    ]
  }
}
json
{
  "dashboard": {
    "title": "应用指标",
    "tags": ["app", "production"],
    "timezone": "browser",
    "panels": [
      {
        "title": "请求速率",
        "type": "graph",
        "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 },
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (status)",
            "legendFormat": "{{ status }}"
          }
        ]
      },
      {
        "title": "P95延迟",
        "type": "graph",
        "gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 },
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "p95"
          }
        ]
      },
      {
        "title": "错误率",
        "type": "stat",
        "gridPos": { "x": 0, "y": 8, "w": 6, "h": 4 },
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percentunit",
            "thresholds": {
              "steps": [
                { "value": 0, "color": "green" },
                { "value": 0.01, "color": "yellow" },
                { "value": 0.05, "color": "red" }
              ]
            }
          }
        }
      }
    ]
  }
}

Provisioning Data Sources

数据源预配置

yaml
undefined
yaml
undefined

grafana/provisioning/datasources/prometheus.yml

grafana/provisioning/datasources/prometheus.yml

apiVersion: 1
datasources:
undefined
apiVersion: 1
datasources:
undefined

Logging with Loki

使用Loki实现日志管理

Loki Configuration

Loki配置

yaml
undefined
yaml
undefined

loki-config.yml

loki-config.yml

auth_enabled: false
server: http_listen_port: 3100
ingester: lifecycler: ring: kvstore: store: inmemory replication_factor: 1 chunk_idle_period: 5m chunk_retain_period: 30s
schema_config: configs: - from: 2020-05-15 store: boltdb object_store: filesystem schema: v11 index: prefix: index_ period: 168h
storage_config: boltdb: directory: /tmp/loki/index filesystem: directory: /tmp/loki/chunks
limits_config: enforce_metric_name: false reject_old_samples: true reject_old_samples_max_age: 168h
undefined
auth_enabled: false
server: http_listen_port: 3100
ingester: lifecycler: ring: kvstore: store: inmemory replication_factor: 1 chunk_idle_period: 5m chunk_retain_period: 30s
schema_config: configs: - from: 2020-05-15 store: boltdb object_store: filesystem schema: v11 index: prefix: index_ period: 168h
storage_config: boltdb: directory: /tmp/loki/index filesystem: directory: /tmp/loki/chunks
limits_config: enforce_metric_name: false reject_old_samples: true reject_old_samples_max_age: 168h
undefined

Promtail Configuration

Promtail配置

yaml
undefined
yaml
undefined

promtail-config.yml

promtail-config.yml

server: http_listen_port: 9080
positions: filename: /tmp/positions.yaml
clients:
scrape_configs:

Application logs

  • job_name: app static_configs:
    • targets:
      • localhost labels: job: app path: /var/log/app/*.log

Docker logs

  • job_name: docker docker_sd_configs:
    • host: unix:///var/run/docker.sock relabel_configs:
    • source_labels: ['__meta_docker_container_name'] target_label: 'container'

Kubernetes logs

  • job_name: kubernetes kubernetes_sd_configs:
    • role: pod pipeline_stages:
    • docker: {} relabel_configs:
    • source_labels:
      • __meta_kubernetes_pod_name target_label: pod
    • source_labels:
      • __meta_kubernetes_namespace target_label: namespace
undefined
server: http_listen_port: 9080
positions: filename: /tmp/positions.yaml
clients:
scrape_configs:

应用日志

  • job_name: app static_configs:
    • targets:
      • localhost labels: job: app path: /var/log/app/*.log

Docker日志

  • job_name: docker docker_sd_configs:
    • host: unix:///var/run/docker.sock relabel_configs:
    • source_labels: ['__meta_docker_container_name'] target_label: 'container'

Kubernetes日志

  • job_name: kubernetes kubernetes_sd_configs:
    • role: pod pipeline_stages:
    • docker: {} relabel_configs:
    • source_labels:
      • __meta_kubernetes_pod_name target_label: pod
    • source_labels:
      • __meta_kubernetes_namespace target_label: namespace
undefined

LogQL Queries

LogQL查询语句

logql
undefined
logql
undefined

All logs for a job

某个job的所有日志

{job="app"}
{job="app"}

Filter by level

按级别过滤

{job="app"} |= "error"
{job="app"} |= "error"

JSON parsing

JSON解析

{job="app"} | json | level="error"
{job="app"} | json | level="error"

Rate of errors

错误发生率

rate({job="app"} |= "error" [5m])
rate({job="app"} |= "error" [5m])

Count by pod

按Pod统计日志数量

sum by (pod) (count_over_time({namespace="production"}[5m]))
sum by (pod) (count_over_time({namespace="production"}[5m]))

Extract and filter

提取并过滤字段

{job="app"} | json | line_format "{{.timestamp}} {{.level}} {{.message}}" | level="error"
{job="app"} | json | line_format "{{.timestamp}} {{.level}} {{.message}}" | level="error"

Metrics from logs

从日志生成指标

sum(rate({job="app"} |= "status=500" [5m])) by (endpoint)
undefined
sum(rate({job="app"} |= "status=500" [5m])) by (endpoint)
undefined

Distributed Tracing

分布式追踪

Jaeger Setup

Jaeger部署

yaml
undefined
yaml
undefined

docker-compose.yml

docker-compose.yml

services: jaeger: image: jaegertracing/all-in-one:latest ports: - "5775:5775/udp" - "6831:6831/udp" - "6832:6832/udp" - "5778:5778" - "16686:16686" # UI - "14268:14268" # Collector - "9411:9411" # Zipkin compatible environment: - COLLECTOR_ZIPKIN_HTTP_PORT=9411
undefined
services: jaeger: image: jaegertracing/all-in-one:latest ports: - "5775:5775/udp" - "6831:6831/udp" - "6832:6832/udp" - "5778:5778" - "16686:16686" # UI界面 - "14268:14268" # 收集器 - "9411:9411" # 兼容Zipkin environment: - COLLECTOR_ZIPKIN_HTTP_PORT=9411
undefined

Application Instrumentation (Node.js)

应用埋点(Node.js)

typescript
// Install: npm install jaeger-client opentracing
import { initTracer } from 'jaeger-client';

const config = {
  serviceName: 'my-app',
  sampler: {
    type: 'probabilistic',
    param: 1.0, // Sample 100% of traces
  },
  reporter: {
    logSpans: true,
    agentHost: 'localhost',
    agentPort: 6831,
  },
};

const tracer = initTracer(config);

// Trace HTTP request
app.get('/api/users/:id', async (req, res) => {
  const span = tracer.startSpan('get_user');
  span.setTag('user_id', req.params.id);

  try {
    // Database query
    const dbSpan = tracer.startSpan('db_query', { childOf: span });
    const user = await db.user.findById(req.params.id);
    dbSpan.finish();

    // External API call
    const apiSpan = tracer.startSpan('external_api', { childOf: span });
    const profile = await fetchUserProfile(user.id);
    apiSpan.finish();

    span.setTag('http.status_code', 200);
    res.json({ user, profile });
  } catch (error) {
    span.setTag('error', true);
    span.setTag('http.status_code', 500);
    span.log({ event: 'error', message: error.message });
    res.status(500).json({ error: error.message });
  } finally {
    span.finish();
  }
});
typescript
// 安装:npm install jaeger-client opentracing
import { initTracer } from 'jaeger-client';

const config = {
  serviceName: 'my-app',
  sampler: {
    type: 'probabilistic',
    param: 1.0, // 采样100%的链路
  },
  reporter: {
    logSpans: true,
    agentHost: 'localhost',
    agentPort: 6831,
  },
};

const tracer = initTracer(config);

// 追踪HTTP请求
app.get('/api/users/:id', async (req, res) => {
  const span = tracer.startSpan('get_user');
  span.setTag('user_id', req.params.id);

  try {
    // 数据库查询
    const dbSpan = tracer.startSpan('db_query', { childOf: span });
    const user = await db.user.findById(req.params.id);
    dbSpan.finish();

    // 外部API调用
    const apiSpan = tracer.startSpan('external_api', { childOf: span });
    const profile = await fetchUserProfile(user.id);
    apiSpan.finish();

    span.setTag('http.status_code', 200);
    res.json({ user, profile });
  } catch (error) {
    span.setTag('error', true);
    span.setTag('http.status_code', 500);
    span.log({ event: 'error', message: error.message });
    res.status(500).json({ error: error.message });
  } finally {
    span.finish();
  }
});

Best Practices

最佳实践

Metric Naming

指标命名

  • Use descriptive names:
    http_requests_total
    not
    requests
  • Use units in name:
    duration_seconds
    ,
    bytes_total
  • Use
    _total
    suffix for counters
  • Use
    _bucket
    suffix for histograms
  • Use consistent label names
  • 使用描述性名称:用
    http_requests_total
    而不是
    requests
  • 名称中包含单位:
    duration_seconds
    bytes_total
  • 计数器使用
    _total
    后缀
  • 直方图使用
    _bucket
    后缀
  • 使用统一的标签命名规范

Cardinality

基数控制

  • Avoid high-cardinality labels (user IDs, emails)
  • Use bounded label values
  • Aggregate when possible
  • Monitor metric count
  • 避免高基数标签(用户ID、邮箱等)
  • 标签值的数量要可控
  • 尽可能提前聚合
  • 监控指标总数

Alert Design

告警设计

  • Alert on symptoms, not causes
  • Set appropriate thresholds
  • Include actionable annotations
  • Group related alerts
  • Use inhibition rules
  • 对症状告警而非原因告警
  • 设置合理的阈值
  • 包含可执行的注解信息
  • 对相关告警进行分组
  • 使用抑制规则避免告警风暴

Dashboard Design

大盘设计

  • One purpose per dashboard
  • Use consistent time ranges
  • Include SLOs/SLIs
  • Add context with annotations
  • Use appropriate visualization types
  • 每个大盘仅服务单一目的
  • 使用统一的时间范围
  • 包含SLOs/SLIs指标
  • 通过注解补充上下文
  • 使用合适的可视化类型

Anti-Patterns to Avoid

需要避免的反模式

No SLOs: Define service level objectives ❌ Alert fatigue: Too many non-actionable alerts ❌ High cardinality: Labels with unbounded values ❌ Missing instrumentation: Instrument all critical paths ❌ No runbooks: Alerts should have clear remediation steps ❌ Ignoring trends: Monitor trends, not just current values ❌ No log structure: Use structured logging (JSON) ❌ Missing context: Include relevant labels and tags
无SLOs:必须定义服务水平目标 ❌ 告警疲劳:避免大量不可执行的无效告警 ❌ 高基数:避免使用值数量不受控的标签 ❌ 埋点缺失:所有关键路径都需要添加埋点 ❌ 无 runbook:告警需要附带清晰的修复步骤 ❌ 忽略趋势:不仅要监控当前值,也要关注趋势变化 ❌ 无结构化日志:使用结构化日志(如JSON) ❌ 上下文缺失:包含相关的标签和标识信息

Resources

参考资源