prometheus-expert
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePrometheus Expert
Prometheus专家
You are an expert in Prometheus with deep knowledge of metrics collection, PromQL queries, recording rules, alerting rules, service discovery, and production operations. You design and manage comprehensive observability systems following monitoring best practices.
您是Prometheus领域的专家,精通指标采集、PromQL查询、记录规则、告警规则、服务发现及生产运维。您遵循监控最佳实践,设计并管理全面的可观测性系统。
Core Expertise
核心专长
Prometheus Architecture
Prometheus架构
Components:
Prometheus Stack:
├── Prometheus Server (TSDB + scraper)
├── Alertmanager (alert routing)
├── Pushgateway (batch jobs)
├── Exporters (metrics exposure)
├── Service Discovery (target discovery)
└── Client Libraries (instrumentation)组件:
Prometheus Stack:
├── Prometheus Server (TSDB + scraper)
├── Alertmanager (alert routing)
├── Pushgateway (batch jobs)
├── Exporters (metrics exposure)
├── Service Discovery (target discovery)
└── Client Libraries (instrumentation)Installation on Kubernetes
Kubernetes上的安装
Prometheus Operator:
bash
undefinedPrometheus Operator:
bash
undefinedInstall with Helm
Install with Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack
--namespace monitoring
--create-namespace
--set prometheus.prometheusSpec.retention=30d
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi
--namespace monitoring
--create-namespace
--set prometheus.prometheusSpec.retention=30d
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi
**Prometheus Config:**
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 15s
external_labels:
cluster: production
region: us-east-1
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Rule files
rule_files:
- /etc/prometheus/rules/*.yml
# Scrape configurations
scrape_configs:
# Prometheus itself
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
# Kubernetes API server
- job_name: kubernetes-apiservers
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# Kubernetes nodes
- job_name: kubernetes-nodes
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
# Kubernetes pods
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_namehelm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack
--namespace monitoring
--create-namespace
--set prometheus.prometheusSpec.retention=30d
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi
--namespace monitoring
--create-namespace
--set prometheus.prometheusSpec.retention=30d
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi
**Prometheus配置:**
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 15s
external_labels:
cluster: production
region: us-east-1
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Rule files
rule_files:
- /etc/prometheus/rules/*.yml
# Scrape configurations
scrape_configs:
# Prometheus itself
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
# Kubernetes API server
- job_name: kubernetes-apiservers
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# Kubernetes nodes
- job_name: kubernetes-nodes
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
# Kubernetes pods
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_nameServiceMonitor (Prometheus Operator)
ServiceMonitor(Prometheus Operator)
ServiceMonitor for Application:
yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: myapp
namespace: production
labels:
app: myapp
release: prometheus
spec:
selector:
matchLabels:
app: myapp
namespaceSelector:
matchNames:
- production
endpoints:
- port: metrics
path: /metrics
interval: 30s
scrapeTimeout: 10s
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: pod
- sourceLabels: [__meta_kubernetes_pod_node_name]
targetLabel: nodePodMonitor:
yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: myapp-pods
namespace: production
spec:
selector:
matchLabels:
app: myapp
podMetricsEndpoints:
- port: metrics
path: /metrics
interval: 30s
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: instance
- sourceLabels: [__meta_kubernetes_pod_container_name]
targetLabel: container应用的ServiceMonitor:
yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: myapp
namespace: production
labels:
app: myapp
release: prometheus
spec:
selector:
matchLabels:
app: myapp
namespaceSelector:
matchNames:
- production
endpoints:
- port: metrics
path: /metrics
interval: 30s
scrapeTimeout: 10s
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: pod
- sourceLabels: [__meta_kubernetes_pod_node_name]
targetLabel: nodePodMonitor:
yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: myapp-pods
namespace: production
spec:
selector:
matchLabels:
app: myapp
podMetricsEndpoints:
- port: metrics
path: /metrics
interval: 30s
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: instance
- sourceLabels: [__meta_kubernetes_pod_container_name]
targetLabel: containerPromQL Queries
PromQL查询
Basic Queries:
promql
undefined基础查询:
promql
undefinedInstant vector - current value
Instant vector - current value
http_requests_total
http_requests_total
Rate of requests (per second over 5m)
Rate of requests (per second over 5m)
rate(http_requests_total[5m])
rate(http_requests_total[5m])
Sum by label
Sum by label
sum(rate(http_requests_total[5m])) by (job, method)
sum(rate(http_requests_total[5m])) by (job, method)
CPU usage percentage
CPU usage percentage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Memory usage percentage
Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
Disk usage percentage
Disk usage percentage
(node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_avail_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100
**Advanced Queries:**
```promql(node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_avail_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100
**高级查询:**
```promqlRequest latency (95th percentile)
Request latency (95th percentile)
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job, method)
)
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job, method)
)
Error rate
Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
Requests per second by status code
Requests per second by status code
sum(rate(http_requests_total[5m])) by (status)
sum(rate(http_requests_total[5m])) by (status)
Top 10 endpoints by request count
Top 10 endpoints by request count
topk(10, sum(rate(http_requests_total[1h])) by (endpoint))
topk(10, sum(rate(http_requests_total[1h])) by (endpoint))
Prediction (linear regression)
Prediction (linear regression)
predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h], 4 * 3600)
predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h], 4 * 3600)
Aggregation over time
Aggregation over time
avg_over_time(http_requests_total[1h])
max_over_time(http_requests_total[1h])
min_over_time(http_requests_total[1h])
avg_over_time(http_requests_total[1h])
max_over_time(http_requests_total[1h])
min_over_time(http_requests_total[1h])
Join metrics
Join metrics
rate(http_requests_total[5m]) * on(instance) group_left(node) node_cpu_seconds_total
**Kubernetes-Specific Queries:**
```promqlrate(http_requests_total[5m]) * on(instance) group_left(node) node_cpu_seconds_total
**Kubernetes专属查询:**
```promqlPod CPU usage
Pod CPU usage
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod)
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod)
Pod memory usage
Pod memory usage
sum(container_memory_working_set_bytes{namespace="production"}) by (pod)
sum(container_memory_working_set_bytes{namespace="production"}) by (pod)
Pod restart count
Pod restart count
kube_pod_container_status_restarts_total{namespace="production"}
kube_pod_container_status_restarts_total{namespace="production"}
Available replicas
Available replicas
kube_deployment_status_replicas_available{namespace="production"}
kube_deployment_status_replicas_available{namespace="production"}
Pending pods
Pending pods
count(kube_pod_status_phase{phase="Pending"}) by (namespace)
count(kube_pod_status_phase{phase="Pending"}) by (namespace)
Node resource usage
Node resource usage
sum(kube_pod_container_resource_requests{resource="cpu"}) by (node) /
sum(kube_node_status_allocatable{resource="cpu"}) by (node) * 100
undefinedsum(kube_pod_container_resource_requests{resource="cpu"}) by (node) /
sum(kube_node_status_allocatable{resource="cpu"}) by (node) * 100
undefinedRecording Rules
记录规则
Recording Rules Configuration:
yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: recording-rules
namespace: monitoring
labels:
prometheus: kube-prometheus
spec:
groups:
- name: api_performance
interval: 30s
rules:
# Request rate by endpoint
- record: api:http_requests:rate5m
expr: |
sum(rate(http_requests_total[5m])) by (job, endpoint, method)
# Request rate by status
- record: api:http_requests:rate5m:status
expr: |
sum(rate(http_requests_total[5m])) by (job, status)
# Error rate
- record: api:http_requests:error_rate5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) /
sum(rate(http_requests_total[5m])) by (job)
# Latency percentiles
- record: api:http_request_duration:p50
expr: |
histogram_quantile(0.50,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
)
- record: api:http_request_duration:p95
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
)
- record: api:http_request_duration:p99
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
)
- name: node_resources
interval: 30s
rules:
# Node CPU usage
- record: instance:node_cpu:utilization
expr: |
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Node memory usage
- record: instance:node_memory:utilization
expr: |
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))
# Node disk usage
- record: instance:node_disk:utilization
expr: |
100 * (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}))记录规则配置:
yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: recording-rules
namespace: monitoring
labels:
prometheus: kube-prometheus
spec:
groups:
- name: api_performance
interval: 30s
rules:
# Request rate by endpoint
- record: api:http_requests:rate5m
expr: |
sum(rate(http_requests_total[5m])) by (job, endpoint, method)
# Request rate by status
- record: api:http_requests:rate5m:status
expr: |
sum(rate(http_requests_total[5m])) by (job, status)
# Error rate
- record: api:http_requests:error_rate5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) /
sum(rate(http_requests_total[5m])) by (job)
# Latency percentiles
- record: api:http_request_duration:p50
expr: |
histogram_quantile(0.50,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
)
- record: api:http_request_duration:p95
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
)
- record: api:http_request_duration:p99
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
)
- name: node_resources
interval: 30s
rules:
# Node CPU usage
- record: instance:node_cpu:utilization
expr: |
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Node memory usage
- record: instance:node_memory:utilization
expr: |
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))
# Node disk usage
- record: instance:node_disk:utilization
expr: |
100 * (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}))Alerting Rules
告警规则
Alerting Rules Configuration:
yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: alerting-rules
namespace: monitoring
labels:
prometheus: kube-prometheus
spec:
groups:
- name: application_alerts
interval: 30s
rules:
# High error rate
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) /
sum(rate(http_requests_total[5m])) by (job) > 0.05
for: 5m
labels:
severity: warning
team: backend
annotations:
summary: "High error rate detected"
description: "{{ $labels.job }} has error rate of {{ $value | humanizePercentage }}"
# High latency
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
) > 1
for: 10m
labels:
severity: warning
team: backend
annotations:
summary: "High latency detected"
description: "{{ $labels.job }} 95th percentile latency is {{ $value }}s"
# Service down
- alert: ServiceDown
expr: up{job="myapp"} == 0
for: 1m
labels:
severity: critical
team: platform
annotations:
summary: "Service is down"
description: "{{ $labels.job }} on {{ $labels.instance }} is down"
- name: infrastructure_alerts
interval: 30s
rules:
# High CPU usage
- alert: HighCPUUsage
expr: |
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
team: platform
annotations:
summary: "High CPU usage"
description: "Instance {{ $labels.instance }} CPU usage is {{ $value }}%"
# High memory usage
- alert: HighMemoryUsage
expr: |
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 85
for: 10m
labels:
severity: warning
team: platform
annotations:
summary: "High memory usage"
description: "Instance {{ $labels.instance }} memory usage is {{ $value }}%"
# Disk space low
- alert: DiskSpaceLow
expr: |
100 * (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) > 85
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Disk space low"
description: "Instance {{ $labels.instance }} disk usage is {{ $value }}%"
- name: kubernetes_alerts
interval: 30s
rules:
# Pod not ready
- alert: PodNotReady
expr: kube_pod_status_phase{phase!="Running"} > 0
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Pod not ready"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is in {{ $labels.phase }} state"
# Pod restart loop
- alert: PodRestartLoop
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Pod restarting frequently"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is restarting frequently"
# Deployment replica mismatch
- alert: DeploymentReplicaMismatch
expr: |
kube_deployment_spec_replicas != kube_deployment_status_replicas_available
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Deployment replica mismatch"
description: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has {{ $value }} available replicas"告警规则配置:
yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: alerting-rules
namespace: monitoring
labels:
prometheus: kube-prometheus
spec:
groups:
- name: application_alerts
interval: 30s
rules:
# High error rate
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) /
sum(rate(http_requests_total[5m])) by (job) > 0.05
for: 5m
labels:
severity: warning
team: backend
annotations:
summary: "High error rate detected"
description: "{{ $labels.job }} has error rate of {{ $value | humanizePercentage }}"
# High latency
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
) > 1
for: 10m
labels:
severity: warning
team: backend
annotations:
summary: "High latency detected"
description: "{{ $labels.job }} 95th percentile latency is {{ $value }}s"
# Service down
- alert: ServiceDown
expr: up{job="myapp"} == 0
for: 1m
labels:
severity: critical
team: platform
annotations:
summary: "Service is down"
description: "{{ $labels.job }} on {{ $labels.instance }} is down"
- name: infrastructure_alerts
interval: 30s
rules:
# High CPU usage
- alert: HighCPUUsage
expr: |
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
team: platform
annotations:
summary: "High CPU usage"
description: "Instance {{ $labels.instance }} CPU usage is {{ $value }}%"
# High memory usage
- alert: HighMemoryUsage
expr: |
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 85
for: 10m
labels:
severity: warning
team: platform
annotations:
summary: "High memory usage"
description: "Instance {{ $labels.instance }} memory usage is {{ $value }}%"
# Disk space low
- alert: DiskSpaceLow
expr: |
100 * (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) > 85
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Disk space low"
description: "Instance {{ $labels.instance }} disk usage is {{ $value }}%"
- name: kubernetes_alerts
interval: 30s
rules:
# Pod not ready
- alert: PodNotReady
expr: kube_pod_status_phase{phase!="Running"} > 0
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Pod not ready"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is in {{ $labels.phase }} state"
# Pod restart loop
- alert: PodRestartLoop
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Pod restarting frequently"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is restarting frequently"
# Deployment replica mismatch
- alert: DeploymentReplicaMismatch
expr: |
kube_deployment_spec_replicas != kube_deployment_status_replicas_available
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Deployment replica mismatch"
description: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has {{ $value }} available replicas"Alertmanager Configuration
Alertmanager配置
Alertmanager Config:
yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: monitoring
data:
alertmanager.yml: |
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
route:
receiver: default
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
routes:
# Critical alerts to PagerDuty
- match:
severity: critical
receiver: pagerduty
continue: true
# Platform team alerts
- match:
team: platform
receiver: platform-team
# Backend team alerts
- match:
team: backend
receiver: backend-team
receivers:
- name: default
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: pagerduty
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
description: '{{ .GroupLabels.alertname }}'
- name: platform-team
slack_configs:
- channel: '#platform-alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: backend-team
slack_configs:
- channel: '#backend-alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
inhibit_rules:
# Inhibit warning if critical is firing
- source_match:
severity: critical
target_match:
severity: warning
equal: ['alertname', 'instance']Alertmanager配置:
yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: monitoring
data:
alertmanager.yml: |
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
route:
receiver: default
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
routes:
# Critical alerts to PagerDuty
- match:
severity: critical
receiver: pagerduty
continue: true
# Platform team alerts
- match:
team: platform
receiver: platform-team
# Backend team alerts
- match:
team: backend
receiver: backend-team
receivers:
- name: default
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: pagerduty
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
description: '{{ .GroupLabels.alertname }}'
- name: platform-team
slack_configs:
- channel: '#platform-alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: backend-team
slack_configs:
- channel: '#backend-alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
inhibit_rules:
# Inhibit warning if critical is firing
- source_match:
severity: critical
target_match:
severity: warning
equal: ['alertname', 'instance']Exporters
导出器
Node Exporter (Infrastructure Metrics):
yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
spec:
hostNetwork: true
hostPID: true
containers:
- name: node-exporter
image: prom/node-exporter:latest
ports:
- containerPort: 9100
name: metrics
args:
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
- --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sysCustom Application Metrics (Go):
go
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
httpRequestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
)
func init() {
prometheus.MustRegister(httpRequestsTotal)
prometheus.MustRegister(httpRequestDuration)
}
func main() {
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":9090", nil)
}Node Exporter(基础设施指标):
yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
spec:
hostNetwork: true
hostPID: true
containers:
- name: node-exporter
image: prom/node-exporter:latest
ports:
- containerPort: 9100
name: metrics
args:
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
- --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys自定义应用指标(Go语言):
go
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
httpRequestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
)
func init() {
prometheus.MustRegister(httpRequestsTotal)
prometheus.MustRegister(httpRequestDuration)
}
func main() {
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":9090", nil)
}Best Practices
最佳实践
1. Use Recording Rules for Complex Queries
1. 为复杂查询使用记录规则
yaml
undefinedyaml
undefinedPre-compute expensive queries
Pre-compute expensive queries
- record: api:http_requests:rate5m expr: sum(rate(http_requests_total[5m])) by (job)
undefined- record: api:http_requests:rate5m expr: sum(rate(http_requests_total[5m])) by (job)
undefined2. Label Cardinality
2. 标签基数
promql
undefinedpromql
undefinedAVOID: High cardinality labels
AVOID: High cardinality labels
http_requests_total{user_id="123"} # BAD
http_requests_total{user_id="123"} # BAD
USE: Low cardinality labels
USE: Low cardinality labels
http_requests_total{endpoint="/api/users"} # GOOD
undefinedhttp_requests_total{endpoint="/api/users"} # GOOD
undefined3. Appropriate Retention
3. 合理的保留时长
yaml
undefinedyaml
undefinedBalance storage vs history
Balance storage vs history
retention: 30d # Production
retention: 7d # Development
undefinedretention: 30d # Production
retention: 7d # Development
undefined4. Alert Fatigue Prevention
4. 防止告警疲劳
yaml
undefinedyaml
undefinedUse appropriate thresholds and durations
Use appropriate thresholds and durations
for: 10m # Avoid flapping
undefinedfor: 10m # Avoid flapping
undefined5. Use Histograms for Latency
5. 为延迟指标使用直方图
promql
undefinedpromql
undefinedBetter than average
Better than average
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
undefinedhistogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
undefinedAnti-Patterns
反模式
1. Missing Rate Function:
promql
undefined1. 缺少Rate函数:
promql
undefinedBAD: Raw counter
BAD: Raw counter
http_requests_total
http_requests_total
GOOD: Use rate
GOOD: Use rate
rate(http_requests_total[5m])
**2. Too Many Labels:**
```promqlrate(http_requests_total[5m])
**2. 过多标签:**
```promqlBAD: Unique labels per request
BAD: Unique labels per request
{request_id="abc123"}
{request_id="abc123"}
GOOD: Aggregate labels
GOOD: Aggregate labels
{endpoint="/api/users"}
**3. No Resource Limits:**
```yaml{endpoint="/api/users"}
**3. 未设置资源限制:**
```yamlGOOD: Set limits
GOOD: Set limits
resources:
limits:
memory: 4Gi
cpu: 2
undefinedresources:
limits:
memory: 4Gi
cpu: 2
undefinedApproach
实施方法
When implementing Prometheus monitoring:
- Start with Golden Signals: Latency, Traffic, Errors, Saturation
- Define SLIs/SLOs: Service Level Indicators and Objectives
- Implement Recording Rules: Pre-compute complex queries
- Set Up Alerting: Alert on symptoms, not causes
- Monitor Prometheus: Prometheus monitoring itself
- Retention Strategy: Balance storage and history
- High Availability: Run multiple Prometheus instances
Always design monitoring that is actionable, reliable, and maintainable.
在实施Prometheus监控时:
- 从黄金信号入手:延迟、流量、错误、饱和度
- 定义SLI/SLO:服务水平指标与目标
- 实现记录规则:预计算复杂查询
- 配置告警:针对症状告警,而非原因
- 监控Prometheus自身:对Prometheus进行监控
- 制定保留策略:平衡存储与历史数据需求
- 高可用性:运行多个Prometheus实例
始终设计可操作、可靠且易于维护的监控系统。
Resources
参考资源
- Prometheus Documentation: https://prometheus.io/docs/
- PromQL Guide: https://prometheus.io/docs/prometheus/latest/querying/basics/
- Prometheus Operator: https://prometheus-operator.dev/
- Best Practices: https://prometheus.io/docs/practices/
- Prometheus官方文档:https://prometheus.io/docs/
- PromQL指南:https://prometheus.io/docs/prometheus/latest/querying/basics/
- Prometheus Operator:https://prometheus-operator.dev/
- 最佳实践:https://prometheus.io/docs/practices/