prometheus-monitoring
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePrometheus Monitoring
Prometheus 监控
Overview
概述
Implement comprehensive Prometheus monitoring infrastructure for collecting, storing, and querying time-series metrics from applications and infrastructure.
搭建全面的Prometheus监控基础设施,用于从应用和基础设施中收集、存储和查询时间序列指标。
When to Use
适用场景
- Setting up metrics collection
- Creating custom application metrics
- Configuring scraping targets
- Implementing service discovery
- Building monitoring infrastructure
- 设置指标收集
- 创建自定义应用指标
- 配置抓取目标
- 实现服务发现
- 搭建监控基础设施
Instructions
操作步骤
1. Prometheus Configuration
1. Prometheus 配置
yaml
undefinedyaml
undefined/etc/prometheus/prometheus.yml
/etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: production
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
rule_files:
- '/etc/prometheus/alert_rules.yml'
scrape_configs:
-
job_name: 'prometheus' static_configs:
- targets: ['localhost:9090']
-
job_name: 'node' static_configs:
- targets: ['localhost:9100']
-
job_name: 'api-service' static_configs:
- targets: ['localhost:8080/metrics'] scrape_interval: 10s
-
job_name: 'kubernetes-pods' kubernetes_sd_configs:
- role: pod relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: 'true'
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path
undefinedglobal:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: production
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
rule_files:
- '/etc/prometheus/alert_rules.yml'
scrape_configs:
-
job_name: 'prometheus' static_configs:
- targets: ['localhost:9090']
-
job_name: 'node' static_configs:
- targets: ['localhost:9100']
-
job_name: 'api-service' static_configs:
- targets: ['localhost:8080/metrics'] scrape_interval: 10s
-
job_name: 'kubernetes-pods' kubernetes_sd_configs:
- role: pod relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: 'true'
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path
undefined2. Node.js Metrics Implementation
2. Node.js 指标实现
javascript
// metrics.js
const promClient = require('prom-client');
const register = new promClient.Registry();
promClient.collectDefaultMetrics({ register });
const httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.1, 0.5, 1, 2, 5],
registers: [register]
});
const requestsTotal = new promClient.Counter({
name: 'requests_total',
help: 'Total requests',
labelNames: ['method', 'route', 'status_code'],
registers: [register]
});
// Express middleware
const express = require('express');
const app = express();
app.get('/metrics', (req, res) => {
res.set('Content-Type', register.contentType);
res.end(register.metrics());
});
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
httpRequestDuration
.labels(req.method, req.path, res.statusCode)
.observe(duration);
requestsTotal
.labels(req.method, req.path, res.statusCode)
.inc();
});
next();
});
module.exports = { register, httpRequestDuration, requestsTotal };javascript
// metrics.js
const promClient = require('prom-client');
const register = new promClient.Registry();
promClient.collectDefaultMetrics({ register });
const httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.1, 0.5, 1, 2, 5],
registers: [register]
});
const requestsTotal = new promClient.Counter({
name: 'requests_total',
help: 'Total requests',
labelNames: ['method', 'route', 'status_code'],
registers: [register]
});
// Express middleware
const express = require('express');
const app = express();
app.get('/metrics', (req, res) => {
res.set('Content-Type', register.contentType);
res.end(register.metrics());
});
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
httpRequestDuration
.labels(req.method, req.path, res.statusCode)
.observe(duration);
requestsTotal
.labels(req.method, req.path, res.statusCode)
.inc();
});
next();
});
module.exports = { register, httpRequestDuration, requestsTotal };3. Python Prometheus Integration
3. Python Prometheus 集成
python
from prometheus_client import Counter, Histogram, start_http_server
from flask import Flask, request
import time
app = Flask(__name__)
request_count = Counter('requests_total', 'Total requests', ['method', 'endpoint'])
request_duration = Histogram('request_duration_seconds', 'Request duration', ['method', 'endpoint'])
@app.before_request
def before():
request.start_time = time.time()
@app.after_request
def after(response):
duration = time.time() - request.start_time
request_count.labels(request.method, request.path).inc()
request_duration.labels(request.method, request.path).observe(duration)
return response
if __name__ == '__main__':
start_http_server(8000)
app.run(port=5000)python
from prometheus_client import Counter, Histogram, start_http_server
from flask import Flask, request
import time
app = Flask(__name__)
request_count = Counter('requests_total', 'Total requests', ['method', 'endpoint'])
request_duration = Histogram('request_duration_seconds', 'Request duration', ['method', 'endpoint'])
@app.before_request
def before():
request.start_time = time.time()
@app.after_request
def after(response):
duration = time.time() - request.start_time
request_count.labels(request.method, request.path).inc()
request_duration.labels(request.method, request.path).observe(duration)
return response
if __name__ == '__main__':
start_http_server(8000)
app.run(port=5000)4. Alert Rules
4. 告警规则
yaml
undefinedyaml
undefined/etc/prometheus/alert_rules.yml
/etc/prometheus/alert_rules.yml
groups:
- name: application
rules:
-
alert: HighErrorRate expr: rate(requests_total{status_code=~"5.."}[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate: {{ $value }}"
-
alert: HighLatency expr: histogram_quantile(0.95, request_duration_seconds) > 1 for: 10m labels: severity: warning annotations: summary: "p95 latency: {{ $value }}s"
-
alert: HighMemoryUsage expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1 for: 5m labels: severity: warning annotations: summary: "Low memory: {{ $value }}"
-
undefinedgroups:
- name: application
rules:
-
alert: HighErrorRate expr: rate(requests_total{status_code=~"5.."}[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate: {{ $value }}"
-
alert: HighLatency expr: histogram_quantile(0.95, request_duration_seconds) > 1 for: 10m labels: severity: warning annotations: summary: "p95 latency: {{ $value }}s"
-
alert: HighMemoryUsage expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1 for: 5m labels: severity: warning annotations: summary: "Low memory: {{ $value }}"
-
undefined5. Docker Compose Setup
5. Docker Compose 部署
yaml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alert_rules.yml:/etc/prometheus/alert_rules.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
node-exporter:
image: prom/node-exporter:latest
ports:
- "9100:9100"
volumes:
prometheus_data:yaml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alert_rules.yml:/etc/prometheus/alert_rules.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
node-exporter:
image: prom/node-exporter:latest
ports:
- "9100:9100"
volumes:
prometheus_data:Best Practices
最佳实践
✅ DO
✅ 建议
- Use consistent metric naming conventions
- Add comprehensive labels for filtering
- Set appropriate scrape intervals (10-60s)
- Implement retention policies
- Monitor Prometheus itself
- Test alert rules before deployment
- Document metric meanings
- 使用统一的指标命名规范
- 添加全面的标签用于过滤
- 设置合适的抓取间隔(10-60秒)
- 实现数据保留策略
- 监控Prometheus自身状态
- 部署前测试告警规则
- 记录指标含义
❌ DON'T
❌ 避免
- Add unbounded cardinality labels
- Scrape too frequently (< 10s)
- Ignore metric naming conventions
- Create alerts without runbooks
- Store raw event data in Prometheus
- Use counters for gauge-like values
- 添加无界基数的标签
- 过于频繁地抓取(间隔<10秒)
- 忽略指标命名规范
- 创建无运行手册的告警
- 在Prometheus中存储原始事件数据
- 用计数器类型存储仪表类指标
Key Prometheus Queries
关键Prometheus查询
promql
rate(requests_total[5m]) # Request rate
histogram_quantile(0.95, request_duration_seconds) # p95 latency
rate(requests_total{status_code=~"5.."}[5m]) # Error ratepromql
rate(requests_total[5m]) # 请求速率
histogram_quantile(0.95, request_duration_seconds) # p95延迟
rate(requests_total{status_code=~"5.."}[5m]) # 错误率