prometheus-configuration
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePrometheus Configuration
Prometheus 配置
Complete guide to Prometheus setup, metric collection, scrape configuration, and recording rules.
关于Prometheus搭建、指标收集、抓取配置和记录规则的完整指南。
Purpose
用途
Configure Prometheus for comprehensive metric collection, alerting, and monitoring of infrastructure and applications.
配置Prometheus以实现对基础设施和应用程序的全面指标收集、告警和监控。
When to Use
适用场景
- Set up Prometheus monitoring
- Configure metric scraping
- Create recording rules
- Design alert rules
- Implement service discovery
- 搭建Prometheus监控
- 配置指标抓取
- 创建记录规则
- 设计告警规则
- 实现服务发现
Prometheus Architecture
Prometheus 架构
┌──────────────┐
│ Applications │ ← Instrumented with client libraries
└──────┬───────┘
│ /metrics endpoint
↓
┌──────────────┐
│ Prometheus │ ← Scrapes metrics periodically
│ Server │
└──────┬───────┘
│
├─→ AlertManager (alerts)
├─→ Grafana (visualization)
└─→ Long-term storage (Thanos/Cortex)┌──────────────┐
│ Applications │ ← Instrumented with client libraries
└──────┬───────┘
│ /metrics endpoint
↓
┌──────────────┐
│ Prometheus │ ← Scrapes metrics periodically
│ Server │
└──────┬───────┘
│
├─→ AlertManager (alerts)
├─→ Grafana (visualization)
└─→ Long-term storage (Thanos/Cortex)Installation
安装
Kubernetes with Helm
使用Helm在Kubernetes中部署
bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageVolumeSize=50Gibash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageVolumeSize=50GiDocker Compose
Docker Compose部署
yaml
version: "3.8"
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=30d"
volumes:
prometheus-data:yaml
version: "3.8"
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=30d"
volumes:
prometheus-data:Configuration File
配置文件
prometheus.yml:
yaml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: "production"
region: "us-west-2"prometheus.yml:
yaml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: "production"
region: "us-west-2"Alertmanager configuration
Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
Load rules files
Load rules files
rule_files:
- /etc/prometheus/rules/*.yml
rule_files:
- /etc/prometheus/rules/*.yml
Scrape configurations
Scrape configurations
scrape_configs:
Prometheus itself
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
Node exporters
- job_name: "node-exporter"
static_configs:
- targets:
- "node1:9100"
- "node2:9100"
- "node3:9100" relabel_configs:
- source_labels: [address] target_label: instance regex: "([^:]+)(:[0-9]+)?" replacement: "${1}"
- targets:
Kubernetes pods with annotations
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path regex: (.+)
- source_labels: [address, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: address
- source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace
- source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod
Application metrics
- job_name: "my-app"
static_configs:
- targets:
- "app1.example.com:9090"
- "app2.example.com:9090" metrics_path: "/metrics" scheme: "https" tls_config: ca_file: /etc/prometheus/ca.crt cert_file: /etc/prometheus/client.crt key_file: /etc/prometheus/client.key
- targets:
**Reference:** See `assets/prometheus.yml.template`scrape_configs:
Prometheus itself
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
Node exporters
- job_name: "node-exporter"
static_configs:
- targets:
- "node1:9100"
- "node2:9100"
- "node3:9100" relabel_configs:
- source_labels: [address] target_label: instance regex: "([^:]+)(:[0-9]+)?" replacement: "${1}"
- targets:
Kubernetes pods with annotations
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path regex: (.+)
- source_labels: [address, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: address
- source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace
- source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod
Application metrics
- job_name: "my-app"
static_configs:
- targets:
- "app1.example.com:9090"
- "app2.example.com:9090" metrics_path: "/metrics" scheme: "https" tls_config: ca_file: /etc/prometheus/ca.crt cert_file: /etc/prometheus/client.crt key_file: /etc/prometheus/client.key
- targets:
**参考:** 请查看 `assets/prometheus.yml.template`Scrape Configurations
抓取配置
Static Targets
静态目标
yaml
scrape_configs:
- job_name: "static-targets"
static_configs:
- targets: ["host1:9100", "host2:9100"]
labels:
env: "production"
region: "us-west-2"yaml
scrape_configs:
- job_name: "static-targets"
static_configs:
- targets: ["host1:9100", "host2:9100"]
labels:
env: "production"
region: "us-west-2"File-based Service Discovery
基于文件的服务发现
yaml
scrape_configs:
- job_name: "file-sd"
file_sd_configs:
- files:
- /etc/prometheus/targets/*.json
- /etc/prometheus/targets/*.yml
refresh_interval: 5mtargets/production.json:
json
[
{
"targets": ["app1:9090", "app2:9090"],
"labels": {
"env": "production",
"service": "api"
}
}
]yaml
scrape_configs:
- job_name: "file-sd"
file_sd_configs:
- files:
- /etc/prometheus/targets/*.json
- /etc/prometheus/targets/*.yml
refresh_interval: 5mtargets/production.json:
json
[
{
"targets": ["app1:9090", "app2:9090"],
"labels": {
"env": "production",
"service": "api"
}
}
]Kubernetes Service Discovery
Kubernetes服务发现
yaml
scrape_configs:
- job_name: "kubernetes-services"
kubernetes_sd_configs:
- role: service
relabel_configs:
- source_labels:
[__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels:
[__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)Reference: See
references/scrape-configs.mdyaml
scrape_configs:
- job_name: "kubernetes-services"
kubernetes_sd_configs:
- role: service
relabel_configs:
- source_labels:
[__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels:
[__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)参考: 请查看
references/scrape-configs.mdRecording Rules
记录规则
Create pre-computed metrics for frequently queried expressions:
yaml
undefined为频繁查询的表达式创建预计算指标:
yaml
undefined/etc/prometheus/rules/recording_rules.yml
/etc/prometheus/rules/recording_rules.yml
groups:
-
name: api_metrics interval: 15s rules:
HTTP request rate per service
- record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m]))
Error rate percentage
-
record: job:http_requests_errors:rate5m expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
-
record: job:http_requests_error_rate:percentage expr: | (job:http_requests_errors:rate5m / job:http_requests:rate5m) * 100
P95 latency
- record: job:http_request_duration:p95 expr: | histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])) )
-
name: resource_metrics interval: 30s rules:
CPU utilization percentage
- record: instance:node_cpu:utilization expr: | 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Memory utilization percentage
- record: instance:node_memory:utilization expr: | 100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)
Disk usage percentage
- record: instance:node_disk:utilization expr: | 100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)
**Reference:** See `references/recording-rules.md`groups:
-
name: api_metrics interval: 15s rules:
HTTP request rate per service
- record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m]))
Error rate percentage
-
record: job:http_requests_errors:rate5m expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
-
record: job:http_requests_error_rate:percentage expr: | (job:http_requests_errors:rate5m / job:http_requests:rate5m) * 100
P95 latency
- record: job:http_request_duration:p95 expr: | histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])) )
-
name: resource_metrics interval: 30s rules:
CPU utilization percentage
- record: instance:node_cpu:utilization expr: | 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Memory utilization percentage
- record: instance:node_memory:utilization expr: | 100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)
Disk usage percentage
- record: instance:node_disk:utilization expr: | 100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)
**参考:** 请查看 `references/recording-rules.md`Alert Rules
告警规则
yaml
undefinedyaml
undefined/etc/prometheus/rules/alert_rules.yml
/etc/prometheus/rules/alert_rules.yml
groups:
-
name: availability interval: 30s rules:
-
alert: ServiceDown expr: up{job="my-app"} == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.instance }} is down" description: "{{ $labels.job }} has been down for more than 1 minute"
-
alert: HighErrorRate expr: job:http_requests_error_rate:percentage > 5 for: 5m labels: severity: warning annotations: summary: "High error rate for {{ $labels.job }}" description: "Error rate is {{ $value }}% (threshold: 5%)"
-
alert: HighLatency expr: job:http_request_duration:p95 > 1 for: 5m labels: severity: warning annotations: summary: "High latency for {{ $labels.job }}" description: "P95 latency is {{ $value }}s (threshold: 1s)"
-
-
name: resources interval: 1m rules:
-
alert: HighCPUUsage expr: instance:node_cpu:utilization > 80 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value }}%"
-
alert: HighMemoryUsage expr: instance:node_memory:utilization > 85 for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value }}%"
-
alert: DiskSpaceLow expr: instance:node_disk:utilization > 90 for: 5m labels: severity: critical annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Disk usage is {{ $value }}%"
-
undefinedgroups:
-
name: availability interval: 30s rules:
-
alert: ServiceDown expr: up{job="my-app"} == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.instance }} is down" description: "{{ $labels.job }} has been down for more than 1 minute"
-
alert: HighErrorRate expr: job:http_requests_error_rate:percentage > 5 for: 5m labels: severity: warning annotations: summary: "High error rate for {{ $labels.job }}" description: "Error rate is {{ $value }}% (threshold: 5%)"
-
alert: HighLatency expr: job:http_request_duration:p95 > 1 for: 5m labels: severity: warning annotations: summary: "High latency for {{ $labels.job }}" description: "P95 latency is {{ $value }}s (threshold: 1s)"
-
-
name: resources interval: 1m rules:
-
alert: HighCPUUsage expr: instance:node_cpu:utilization > 80 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value }}%"
-
alert: HighMemoryUsage expr: instance:node_memory:utilization > 85 for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value }}%"
-
alert: DiskSpaceLow expr: instance:node_disk:utilization > 90 for: 5m labels: severity: critical annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Disk usage is {{ $value }}%"
-
undefinedValidation
验证
bash
undefinedbash
undefinedValidate configuration
Validate configuration
promtool check config prometheus.yml
promtool check config prometheus.yml
Validate rules
Validate rules
promtool check rules /etc/prometheus/rules/*.yml
promtool check rules /etc/prometheus/rules/*.yml
Test query
Test query
promtool query instant http://localhost:9090 'up'
**Reference:** See `scripts/validate-prometheus.sh`promtool query instant http://localhost:9090 'up'
**参考:** 请查看 `scripts/validate-prometheus.sh`Best Practices
最佳实践
- Use consistent naming for metrics (prefix_name_unit)
- Set appropriate scrape intervals (15-60s typical)
- Use recording rules for expensive queries
- Implement high availability (multiple Prometheus instances)
- Configure retention based on storage capacity
- Use relabeling for metric cleanup
- Monitor Prometheus itself
- Implement federation for large deployments
- Use Thanos/Cortex for long-term storage
- Document custom metrics
- 使用一致的命名给指标(前缀_名称_单位)
- 设置合适的抓取间隔(通常15-60秒)
- 使用记录规则处理复杂查询
- 实现高可用性(多个Prometheus实例)
- 根据存储容量配置保留时长
- 使用重标记清理指标
- 监控Prometheus自身
- 为大规模部署实现联邦
- 使用Thanos/Cortex进行长期存储
- 文档化自定义指标
Troubleshooting
故障排查
Check scrape targets:
bash
curl http://localhost:9090/api/v1/targetsCheck configuration:
bash
curl http://localhost:9090/api/v1/status/configTest query:
bash
curl 'http://localhost:9090/api/v1/query?query=up'检查抓取目标:
bash
curl http://localhost:9090/api/v1/targets检查配置:
bash
curl http://localhost:9090/api/v1/status/config测试查询:
bash
curl 'http://localhost:9090/api/v1/query?query=up'Reference Files
参考文件
- - Complete configuration template
assets/prometheus.yml.template - - Scrape configuration patterns
references/scrape-configs.md - - Recording rule examples
references/recording-rules.md - - Validation script
scripts/validate-prometheus.sh
- - 完整配置模板
assets/prometheus.yml.template - - 抓取配置模式
references/scrape-configs.md - - 记录规则示例
references/recording-rules.md - - 验证脚本
scripts/validate-prometheus.sh
Related Skills
相关技能
- - For visualization
grafana-dashboards - - For SLO monitoring
slo-implementation - - For request tracing
distributed-tracing
- - 用于可视化
grafana-dashboards - - 用于SLO监控
slo-implementation - - 用于请求追踪
distributed-tracing