infrastructure-monitoring

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Infrastructure Monitoring

基础设施监控

Overview

概述

Implement comprehensive infrastructure monitoring to track system health, performance metrics, and resource utilization with alerting and visualization across your entire stack.
搭建全面的基础设施监控,通过告警和可视化功能追踪整个技术栈的系统健康状态、性能指标以及资源使用情况。

When to Use

适用场景

  • Real-time performance monitoring
  • Capacity planning and trends
  • Incident detection and alerting
  • Service health tracking
  • Resource utilization analysis
  • Performance troubleshooting
  • Compliance and audit trails
  • Historical data analysis
  • 实时性能监控
  • 容量规划与趋势分析
  • 故障检测与告警
  • 服务健康状态追踪
  • 资源使用分析
  • 性能问题排查
  • 合规性与审计追踪
  • 历史数据分析

Implementation Examples

实现示例

1. Prometheus Configuration

1. Prometheus 配置

yaml
undefined
yaml
undefined

prometheus.yml

prometheus.yml

global: scrape_interval: 15s evaluation_interval: 15s external_labels: monitor: 'infrastructure-monitor' environment: 'production'
global: scrape_interval: 15s evaluation_interval: 15s external_labels: monitor: 'infrastructure-monitor' environment: 'production'

Alertmanager configuration

Alertmanager configuration

alerting: alertmanagers: - static_configs: - targets: - localhost:9093
alerting: alertmanagers: - static_configs: - targets: - localhost:9093

Rule files

Rule files

rule_files:
  • 'alerts.yml'
  • 'rules.yml'
scrape_configs:

Prometheus itself

  • job_name: 'prometheus' static_configs:
    • targets: ['localhost:9090']

Node Exporter for system metrics

  • job_name: 'node' static_configs:
    • targets:
      • 'node1.internal:9100'
      • 'node2.internal:9100'
      • 'node3.internal:9100' relabel_configs:
    • source_labels: [address] target_label: instance

Docker container metrics

  • job_name: 'docker' static_configs:
    • targets: ['localhost:9323'] metrics_path: '/metrics'

Kubernetes metrics

  • job_name: 'kubernetes-apiservers' kubernetes_sd_configs:
    • role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs:
    • source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https

Application metrics

  • job_name: 'application' metrics_path: '/metrics' static_configs:
    • targets:
      • 'app1.internal:8080'
      • 'app2.internal:8080'
      • 'app3.internal:8080' scrape_interval: 10s scrape_timeout: 5s

PostgreSQL metrics

  • job_name: 'postgres' static_configs:
    • targets: ['postgres-exporter.internal:9187']

Redis metrics

  • job_name: 'redis' static_configs:
    • targets: ['redis-exporter.internal:9121']

RabbitMQ metrics

  • job_name: 'rabbitmq' static_configs:
    • targets: ['rabbitmq.internal:15692']
undefined
rule_files:
  • 'alerts.yml'
  • 'rules.yml'
scrape_configs:

Prometheus itself

  • job_name: 'prometheus' static_configs:
    • targets: ['localhost:9090']

Node Exporter for system metrics

  • job_name: 'node' static_configs:
    • targets:
      • 'node1.internal:9100'
      • 'node2.internal:9100'
      • 'node3.internal:9100' relabel_configs:
    • source_labels: [address] target_label: instance

Docker container metrics

  • job_name: 'docker' static_configs:
    • targets: ['localhost:9323'] metrics_path: '/metrics'

Kubernetes metrics

  • job_name: 'kubernetes-apiservers' kubernetes_sd_configs:
    • role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs:
    • source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https

Application metrics

  • job_name: 'application' metrics_path: '/metrics' static_configs:
    • targets:
      • 'app1.internal:8080'
      • 'app2.internal:8080'
      • 'app3.internal:8080' scrape_interval: 10s scrape_timeout: 5s

PostgreSQL metrics

  • job_name: 'postgres' static_configs:
    • targets: ['postgres-exporter.internal:9187']

Redis metrics

  • job_name: 'redis' static_configs:
    • targets: ['redis-exporter.internal:9121']

RabbitMQ metrics

  • job_name: 'rabbitmq' static_configs:
    • targets: ['rabbitmq.internal:15692']
undefined

2. Alert Rules

2. 告警规则

yaml
undefined
yaml
undefined

alerts.yml

alerts.yml

groups:
  • name: application_alerts interval: 30s rules:
    • alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }}"
    • alert: HighLatency expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1 for: 10m labels: severity: warning annotations: summary: "High request latency" description: "P95 latency is {{ $value }}s"
    • alert: ServiceDown expr: up{job="application"} == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.instance }} is down" description: "Service has been unreachable for 1 minute"
  • name: infrastructure_alerts interval: 30s rules:
    • alert: HighCPUUsage expr: (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value }}%"
    • alert: HighMemoryUsage expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85 for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value }}%"
    • alert: LowDiskSpace expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs|squashfs|vfat"} / node_filesystem_size_bytes) * 100 < 10 for: 5m labels: severity: critical annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Available disk space is {{ $value }}%"
    • alert: NodeNotReady expr: kube_node_status_condition{condition="Ready",status="true"} == 0 for: 5m labels: severity: critical annotations: summary: "Kubernetes node {{ $labels.node }} is not ready" description: "Node has been unready for 5 minutes"
    • alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.1 for: 5m labels: severity: warning annotations: summary: "Pod {{ $labels.pod }} is crash looping" description: "Pod has restarted {{ $value }} times in 15 minutes"
undefined
groups:
  • name: application_alerts interval: 30s rules:
    • alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }}"
    • alert: HighLatency expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1 for: 10m labels: severity: warning annotations: summary: "High request latency" description: "P95 latency is {{ $value }}s"
    • alert: ServiceDown expr: up{job="application"} == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.instance }} is down" description: "Service has been unreachable for 1 minute"
  • name: infrastructure_alerts interval: 30s rules:
    • alert: HighCPUUsage expr: (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value }}%"
    • alert: HighMemoryUsage expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85 for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value }}%"
    • alert: LowDiskSpace expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs|squashfs|vfat"} / node_filesystem_size_bytes) * 100 < 10 for: 5m labels: severity: critical annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Available disk space is {{ $value }}%"
    • alert: NodeNotReady expr: kube_node_status_condition{condition="Ready",status="true"} == 0 for: 5m labels: severity: critical annotations: summary: "Kubernetes node {{ $labels.node }} is not ready" description: "Node has been unready for 5 minutes"
    • alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.1 for: 5m labels: severity: warning annotations: summary: "Pod {{ $labels.pod }} is crash looping" description: "Pod has restarted {{ $value }} times in 15 minutes"
undefined

3. Alertmanager Configuration

3. Alertmanager 配置

yaml
undefined
yaml
undefined

alertmanager.yml

alertmanager.yml

global: resolve_timeout: 5m slack_api_url: 'YOUR_SLACK_WEBHOOK_URL'
global: resolve_timeout: 5m slack_api_url: 'YOUR_SLACK_WEBHOOK_URL'

Template files

Template files

templates:
  • '/etc/alertmanager/templates/*.tmpl'
templates:
  • '/etc/alertmanager/templates/*.tmpl'

Routing tree

Routing tree

route: receiver: 'default' group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 4h
routes: # Critical alerts - match: severity: critical receiver: 'critical-team' continue: true group_wait: 10s repeat_interval: 1h
# Warning alerts
- match:
    severity: warning
  receiver: 'warning-channel'
  group_wait: 1m
route: receiver: 'default' group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 4h
routes: # Critical alerts - match: severity: critical receiver: 'critical-team' continue: true group_wait: 10s repeat_interval: 1h
# Warning alerts
- match:
    severity: warning
  receiver: 'warning-channel'
  group_wait: 1m

Receivers

Receivers

receivers:
  • name: 'default' slack_configs:
    • channel: '#alerts' title: 'Alert: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
  • name: 'critical-team' slack_configs:
    • channel: '#critical-alerts' title: 'CRITICAL: {{ .GroupLabels.alertname }}' email_configs:
    • to: 'oncall@mycompany.com' from: 'alertmanager@mycompany.com' smarthost: 'smtp.mycompany.com:587' auth_username: 'alertmanager@mycompany.com' auth_password: 'secret'
  • name: 'warning-channel' slack_configs:
    • channel: '#warnings' title: 'Warning: {{ .GroupLabels.alertname }}'
undefined
receivers:
  • name: 'default' slack_configs:
    • channel: '#alerts' title: 'Alert: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
  • name: 'critical-team' slack_configs:
    • channel: '#critical-alerts' title: 'CRITICAL: {{ .GroupLabels.alertname }}' email_configs:
    • to: 'oncall@mycompany.com' from: 'alertmanager@mycompany.com' smarthost: 'smtp.mycompany.com:587' auth_username: 'alertmanager@mycompany.com' auth_password: 'secret'
  • name: 'warning-channel' slack_configs:
    • channel: '#warnings' title: 'Warning: {{ .GroupLabels.alertname }}'
undefined

4. Grafana Dashboard

4. Grafana 仪表盘

json
{
  "dashboard": {
    "title": "Infrastructure Overview",
    "panels": [
      {
        "title": "CPU Usage",
        "targets": [
          {
            "expr": "100 - (avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
          }
        ],
        "type": "graph",
        "alert": {
          "name": "CPU Usage Alert",
          "conditions": [
            {
              "evaluator": {
                "type": "gt",
                "params": [80]
              }
            }
          ]
        }
      },
      {
        "title": "Memory Usage",
        "targets": [
          {
            "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total{status=~\"5..\"}[5m])"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Response Time P95",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Disk Usage",
        "targets": [
          {
            "expr": "(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100"
          }
        ],
        "type": "graph"
      }
    ]
  }
}
json
{
  "dashboard": {
    "title": "Infrastructure Overview",
    "panels": [
      {
        "title": "CPU Usage",
        "targets": [
          {
            "expr": "100 - (avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
          }
        ],
        "type": "graph",
        "alert": {
          "name": "CPU Usage Alert",
          "conditions": [
            {
              "evaluator": {
                "type": "gt",
                "params": [80]
              }
            }
          ]
        }
      },
      {
        "title": "Memory Usage",
        "targets": [
          {
            "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total{status=~\"5..\"}[5m])"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Response Time P95",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Disk Usage",
        "targets": [
          {
            "expr": "(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100"
          }
        ],
        "type": "graph"
      }
    ]
  }
}

5. Monitoring Deployment

5. 监控部署脚本

bash
#!/bin/bash
bash
#!/bin/bash

deploy-monitoring.sh - Deploy Prometheus and Grafana

deploy-monitoring.sh - Deploy Prometheus and Grafana

set -euo pipefail
NAMESPACE="monitoring" PROMETHEUS_VERSION="v2.40.0" GRAFANA_VERSION="9.3.2"
echo "Creating monitoring namespace..." kubectl create namespace "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -
set -euo pipefail
NAMESPACE="monitoring" PROMETHEUS_VERSION="v2.40.0" GRAFANA_VERSION="9.3.2"
echo "Creating monitoring namespace..." kubectl create namespace "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -

Deploy Prometheus

Deploy Prometheus

echo "Deploying Prometheus..." kubectl apply -f prometheus-configmap.yaml -n "$NAMESPACE" kubectl apply -f prometheus-deployment.yaml -n "$NAMESPACE" kubectl apply -f prometheus-service.yaml -n "$NAMESPACE"
echo "Deploying Prometheus..." kubectl apply -f prometheus-configmap.yaml -n "$NAMESPACE" kubectl apply -f prometheus-deployment.yaml -n "$NAMESPACE" kubectl apply -f prometheus-service.yaml -n "$NAMESPACE"

Deploy Alertmanager

Deploy Alertmanager

echo "Deploying Alertmanager..." kubectl apply -f alertmanager-configmap.yaml -n "$NAMESPACE" kubectl apply -f alertmanager-deployment.yaml -n "$NAMESPACE" kubectl apply -f alertmanager-service.yaml -n "$NAMESPACE"
echo "Deploying Alertmanager..." kubectl apply -f alertmanager-configmap.yaml -n "$NAMESPACE" kubectl apply -f alertmanager-deployment.yaml -n "$NAMESPACE" kubectl apply -f alertmanager-service.yaml -n "$NAMESPACE"

Deploy Grafana

Deploy Grafana

echo "Deploying Grafana..." kubectl apply -f grafana-deployment.yaml -n "$NAMESPACE" kubectl apply -f grafana-service.yaml -n "$NAMESPACE"
echo "Deploying Grafana..." kubectl apply -f grafana-deployment.yaml -n "$NAMESPACE" kubectl apply -f grafana-service.yaml -n "$NAMESPACE"

Wait for deployments

Wait for deployments

echo "Waiting for deployments to be ready..." kubectl rollout status deployment/prometheus -n "$NAMESPACE" --timeout=5m kubectl rollout status deployment/alertmanager -n "$NAMESPACE" --timeout=5m kubectl rollout status deployment/grafana -n "$NAMESPACE" --timeout=5m
echo "Waiting for deployments to be ready..." kubectl rollout status deployment/prometheus -n "$NAMESPACE" --timeout=5m kubectl rollout status deployment/alertmanager -n "$NAMESPACE" --timeout=5m kubectl rollout status deployment/grafana -n "$NAMESPACE" --timeout=5m

Port forward for access

Port forward for access

echo "Port forwarding to services..." kubectl port-forward -n "$NAMESPACE" svc/prometheus 9090:9090 & kubectl port-forward -n "$NAMESPACE" svc/grafana 3000:3000 &
echo "Monitoring stack deployed successfully!" echo "Prometheus: http://localhost:9090" echo "Grafana: http://localhost:3000"
undefined
echo "Port forwarding to services..." kubectl port-forward -n "$NAMESPACE" svc/prometheus 9090:9090 & kubectl port-forward -n "$NAMESPACE" svc/grafana 3000:3000 &
echo "Monitoring stack deployed successfully!" echo "Prometheus: http://localhost:9090" echo "Grafana: http://localhost:3000"
undefined

Monitoring Best Practices

监控最佳实践

✅ DO

✅ 建议

  • Monitor key business metrics
  • Set appropriate alert thresholds
  • Use consistent naming conventions
  • Implement dashboards for visualization
  • Keep data retention reasonable
  • Use labels for better querying
  • Test alerting paths regularly
  • Document alert meanings
  • 监控核心业务指标
  • 设置合理的告警阈值
  • 使用统一的命名规范
  • 搭建可视化仪表盘
  • 合理设置数据保留周期
  • 使用标签优化查询效率
  • 定期测试告警路径
  • 文档化告警含义

❌ DON'T

❌ 避免

  • Alert on every metric change
  • Ignore alert noise
  • Store too much unnecessary data
  • Set unrealistic thresholds
  • Mix metrics from different sources
  • Forget to test alert routing
  • Alert without runbooks
  • Over-instrument without purpose
  • 对所有指标变化都触发告警
  • 忽略告警噪音
  • 存储过多无用数据
  • 设置不切实际的阈值
  • 混合不同来源的指标
  • 忽略告警路由测试
  • 无运行手册的告警
  • 无目的过度监控

Resources

参考资源