infrastructure-monitoring
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseInfrastructure Monitoring
基础设施监控
Overview
概述
Implement comprehensive infrastructure monitoring to track system health, performance metrics, and resource utilization with alerting and visualization across your entire stack.
搭建全面的基础设施监控,通过告警和可视化功能追踪整个技术栈的系统健康状态、性能指标以及资源使用情况。
When to Use
适用场景
- Real-time performance monitoring
- Capacity planning and trends
- Incident detection and alerting
- Service health tracking
- Resource utilization analysis
- Performance troubleshooting
- Compliance and audit trails
- Historical data analysis
- 实时性能监控
- 容量规划与趋势分析
- 故障检测与告警
- 服务健康状态追踪
- 资源使用分析
- 性能问题排查
- 合规性与审计追踪
- 历史数据分析
Implementation Examples
实现示例
1. Prometheus Configuration
1. Prometheus 配置
yaml
undefinedyaml
undefinedprometheus.yml
prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: 'infrastructure-monitor'
environment: 'production'
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: 'infrastructure-monitor'
environment: 'production'
Alertmanager configuration
Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
Rule files
Rule files
rule_files:
- 'alerts.yml'
- 'rules.yml'
scrape_configs:
Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
Node Exporter for system metrics
- job_name: 'node'
static_configs:
- targets:
- 'node1.internal:9100'
- 'node2.internal:9100'
- 'node3.internal:9100' relabel_configs:
- source_labels: [address] target_label: instance
- targets:
Docker container metrics
- job_name: 'docker'
static_configs:
- targets: ['localhost:9323'] metrics_path: '/metrics'
Kubernetes metrics
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https
Application metrics
- job_name: 'application'
metrics_path: '/metrics'
static_configs:
- targets:
- 'app1.internal:8080'
- 'app2.internal:8080'
- 'app3.internal:8080' scrape_interval: 10s scrape_timeout: 5s
- targets:
PostgreSQL metrics
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter.internal:9187']
Redis metrics
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter.internal:9121']
RabbitMQ metrics
- job_name: 'rabbitmq'
static_configs:
- targets: ['rabbitmq.internal:15692']
undefinedrule_files:
- 'alerts.yml'
- 'rules.yml'
scrape_configs:
Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
Node Exporter for system metrics
- job_name: 'node'
static_configs:
- targets:
- 'node1.internal:9100'
- 'node2.internal:9100'
- 'node3.internal:9100' relabel_configs:
- source_labels: [address] target_label: instance
- targets:
Docker container metrics
- job_name: 'docker'
static_configs:
- targets: ['localhost:9323'] metrics_path: '/metrics'
Kubernetes metrics
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https
Application metrics
- job_name: 'application'
metrics_path: '/metrics'
static_configs:
- targets:
- 'app1.internal:8080'
- 'app2.internal:8080'
- 'app3.internal:8080' scrape_interval: 10s scrape_timeout: 5s
- targets:
PostgreSQL metrics
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter.internal:9187']
Redis metrics
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter.internal:9121']
RabbitMQ metrics
- job_name: 'rabbitmq'
static_configs:
- targets: ['rabbitmq.internal:15692']
undefined2. Alert Rules
2. 告警规则
yaml
undefinedyaml
undefinedalerts.yml
alerts.yml
groups:
-
name: application_alerts interval: 30s rules:
-
alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }}"
-
alert: HighLatency expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1 for: 10m labels: severity: warning annotations: summary: "High request latency" description: "P95 latency is {{ $value }}s"
-
alert: ServiceDown expr: up{job="application"} == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.instance }} is down" description: "Service has been unreachable for 1 minute"
-
-
name: infrastructure_alerts interval: 30s rules:
-
alert: HighCPUUsage expr: (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value }}%"
-
alert: HighMemoryUsage expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85 for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value }}%"
-
alert: LowDiskSpace expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs|squashfs|vfat"} / node_filesystem_size_bytes) * 100 < 10 for: 5m labels: severity: critical annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Available disk space is {{ $value }}%"
-
alert: NodeNotReady expr: kube_node_status_condition{condition="Ready",status="true"} == 0 for: 5m labels: severity: critical annotations: summary: "Kubernetes node {{ $labels.node }} is not ready" description: "Node has been unready for 5 minutes"
-
alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.1 for: 5m labels: severity: warning annotations: summary: "Pod {{ $labels.pod }} is crash looping" description: "Pod has restarted {{ $value }} times in 15 minutes"
-
undefinedgroups:
-
name: application_alerts interval: 30s rules:
-
alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }}"
-
alert: HighLatency expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1 for: 10m labels: severity: warning annotations: summary: "High request latency" description: "P95 latency is {{ $value }}s"
-
alert: ServiceDown expr: up{job="application"} == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.instance }} is down" description: "Service has been unreachable for 1 minute"
-
-
name: infrastructure_alerts interval: 30s rules:
-
alert: HighCPUUsage expr: (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value }}%"
-
alert: HighMemoryUsage expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85 for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value }}%"
-
alert: LowDiskSpace expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs|squashfs|vfat"} / node_filesystem_size_bytes) * 100 < 10 for: 5m labels: severity: critical annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Available disk space is {{ $value }}%"
-
alert: NodeNotReady expr: kube_node_status_condition{condition="Ready",status="true"} == 0 for: 5m labels: severity: critical annotations: summary: "Kubernetes node {{ $labels.node }} is not ready" description: "Node has been unready for 5 minutes"
-
alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.1 for: 5m labels: severity: warning annotations: summary: "Pod {{ $labels.pod }} is crash looping" description: "Pod has restarted {{ $value }} times in 15 minutes"
-
undefined3. Alertmanager Configuration
3. Alertmanager 配置
yaml
undefinedyaml
undefinedalertmanager.yml
alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'YOUR_SLACK_WEBHOOK_URL'
global:
resolve_timeout: 5m
slack_api_url: 'YOUR_SLACK_WEBHOOK_URL'
Template files
Template files
templates:
- '/etc/alertmanager/templates/*.tmpl'
templates:
- '/etc/alertmanager/templates/*.tmpl'
Routing tree
Routing tree
route:
receiver: 'default'
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Critical alerts
- match:
severity: critical
receiver: 'critical-team'
continue: true
group_wait: 10s
repeat_interval: 1h
# Warning alerts
- match:
severity: warning
receiver: 'warning-channel'
group_wait: 1mroute:
receiver: 'default'
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Critical alerts
- match:
severity: critical
receiver: 'critical-team'
continue: true
group_wait: 10s
repeat_interval: 1h
# Warning alerts
- match:
severity: warning
receiver: 'warning-channel'
group_wait: 1mReceivers
Receivers
receivers:
-
name: 'default' slack_configs:
- channel: '#alerts' title: 'Alert: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
-
name: 'critical-team' slack_configs:
- channel: '#critical-alerts' title: 'CRITICAL: {{ .GroupLabels.alertname }}' email_configs:
- to: 'oncall@mycompany.com' from: 'alertmanager@mycompany.com' smarthost: 'smtp.mycompany.com:587' auth_username: 'alertmanager@mycompany.com' auth_password: 'secret'
-
name: 'warning-channel' slack_configs:
- channel: '#warnings' title: 'Warning: {{ .GroupLabels.alertname }}'
undefinedreceivers:
-
name: 'default' slack_configs:
- channel: '#alerts' title: 'Alert: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
-
name: 'critical-team' slack_configs:
- channel: '#critical-alerts' title: 'CRITICAL: {{ .GroupLabels.alertname }}' email_configs:
- to: 'oncall@mycompany.com' from: 'alertmanager@mycompany.com' smarthost: 'smtp.mycompany.com:587' auth_username: 'alertmanager@mycompany.com' auth_password: 'secret'
-
name: 'warning-channel' slack_configs:
- channel: '#warnings' title: 'Warning: {{ .GroupLabels.alertname }}'
undefined4. Grafana Dashboard
4. Grafana 仪表盘
json
{
"dashboard": {
"title": "Infrastructure Overview",
"panels": [
{
"title": "CPU Usage",
"targets": [
{
"expr": "100 - (avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
}
],
"type": "graph",
"alert": {
"name": "CPU Usage Alert",
"conditions": [
{
"evaluator": {
"type": "gt",
"params": [80]
}
}
]
}
},
{
"title": "Memory Usage",
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100"
}
],
"type": "graph"
},
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_requests_total[5m])"
}
],
"type": "graph"
},
{
"title": "Error Rate",
"targets": [
{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m])"
}
],
"type": "graph"
},
{
"title": "Response Time P95",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
}
],
"type": "graph"
},
{
"title": "Disk Usage",
"targets": [
{
"expr": "(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100"
}
],
"type": "graph"
}
]
}
}json
{
"dashboard": {
"title": "Infrastructure Overview",
"panels": [
{
"title": "CPU Usage",
"targets": [
{
"expr": "100 - (avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
}
],
"type": "graph",
"alert": {
"name": "CPU Usage Alert",
"conditions": [
{
"evaluator": {
"type": "gt",
"params": [80]
}
}
]
}
},
{
"title": "Memory Usage",
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100"
}
],
"type": "graph"
},
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_requests_total[5m])"
}
],
"type": "graph"
},
{
"title": "Error Rate",
"targets": [
{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m])"
}
],
"type": "graph"
},
{
"title": "Response Time P95",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
}
],
"type": "graph"
},
{
"title": "Disk Usage",
"targets": [
{
"expr": "(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100"
}
],
"type": "graph"
}
]
}
}5. Monitoring Deployment
5. 监控部署脚本
bash
#!/bin/bashbash
#!/bin/bashdeploy-monitoring.sh - Deploy Prometheus and Grafana
deploy-monitoring.sh - Deploy Prometheus and Grafana
set -euo pipefail
NAMESPACE="monitoring"
PROMETHEUS_VERSION="v2.40.0"
GRAFANA_VERSION="9.3.2"
echo "Creating monitoring namespace..."
kubectl create namespace "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -
set -euo pipefail
NAMESPACE="monitoring"
PROMETHEUS_VERSION="v2.40.0"
GRAFANA_VERSION="9.3.2"
echo "Creating monitoring namespace..."
kubectl create namespace "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -
Deploy Prometheus
Deploy Prometheus
echo "Deploying Prometheus..."
kubectl apply -f prometheus-configmap.yaml -n "$NAMESPACE"
kubectl apply -f prometheus-deployment.yaml -n "$NAMESPACE"
kubectl apply -f prometheus-service.yaml -n "$NAMESPACE"
echo "Deploying Prometheus..."
kubectl apply -f prometheus-configmap.yaml -n "$NAMESPACE"
kubectl apply -f prometheus-deployment.yaml -n "$NAMESPACE"
kubectl apply -f prometheus-service.yaml -n "$NAMESPACE"
Deploy Alertmanager
Deploy Alertmanager
echo "Deploying Alertmanager..."
kubectl apply -f alertmanager-configmap.yaml -n "$NAMESPACE"
kubectl apply -f alertmanager-deployment.yaml -n "$NAMESPACE"
kubectl apply -f alertmanager-service.yaml -n "$NAMESPACE"
echo "Deploying Alertmanager..."
kubectl apply -f alertmanager-configmap.yaml -n "$NAMESPACE"
kubectl apply -f alertmanager-deployment.yaml -n "$NAMESPACE"
kubectl apply -f alertmanager-service.yaml -n "$NAMESPACE"
Deploy Grafana
Deploy Grafana
echo "Deploying Grafana..."
kubectl apply -f grafana-deployment.yaml -n "$NAMESPACE"
kubectl apply -f grafana-service.yaml -n "$NAMESPACE"
echo "Deploying Grafana..."
kubectl apply -f grafana-deployment.yaml -n "$NAMESPACE"
kubectl apply -f grafana-service.yaml -n "$NAMESPACE"
Wait for deployments
Wait for deployments
echo "Waiting for deployments to be ready..."
kubectl rollout status deployment/prometheus -n "$NAMESPACE" --timeout=5m
kubectl rollout status deployment/alertmanager -n "$NAMESPACE" --timeout=5m
kubectl rollout status deployment/grafana -n "$NAMESPACE" --timeout=5m
echo "Waiting for deployments to be ready..."
kubectl rollout status deployment/prometheus -n "$NAMESPACE" --timeout=5m
kubectl rollout status deployment/alertmanager -n "$NAMESPACE" --timeout=5m
kubectl rollout status deployment/grafana -n "$NAMESPACE" --timeout=5m
Port forward for access
Port forward for access
echo "Port forwarding to services..."
kubectl port-forward -n "$NAMESPACE" svc/prometheus 9090:9090 &
kubectl port-forward -n "$NAMESPACE" svc/grafana 3000:3000 &
echo "Monitoring stack deployed successfully!"
echo "Prometheus: http://localhost:9090"
echo "Grafana: http://localhost:3000"
undefinedecho "Port forwarding to services..."
kubectl port-forward -n "$NAMESPACE" svc/prometheus 9090:9090 &
kubectl port-forward -n "$NAMESPACE" svc/grafana 3000:3000 &
echo "Monitoring stack deployed successfully!"
echo "Prometheus: http://localhost:9090"
echo "Grafana: http://localhost:3000"
undefinedMonitoring Best Practices
监控最佳实践
✅ DO
✅ 建议
- Monitor key business metrics
- Set appropriate alert thresholds
- Use consistent naming conventions
- Implement dashboards for visualization
- Keep data retention reasonable
- Use labels for better querying
- Test alerting paths regularly
- Document alert meanings
- 监控核心业务指标
- 设置合理的告警阈值
- 使用统一的命名规范
- 搭建可视化仪表盘
- 合理设置数据保留周期
- 使用标签优化查询效率
- 定期测试告警路径
- 文档化告警含义
❌ DON'T
❌ 避免
- Alert on every metric change
- Ignore alert noise
- Store too much unnecessary data
- Set unrealistic thresholds
- Mix metrics from different sources
- Forget to test alert routing
- Alert without runbooks
- Over-instrument without purpose
- 对所有指标变化都触发告警
- 忽略告警噪音
- 存储过多无用数据
- 设置不切实际的阈值
- 混合不同来源的指标
- 忽略告警路由测试
- 无运行手册的告警
- 无目的过度监控