prometheus-monitoring

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Prometheus Monitoring

Prometheus 监控

Overview

概述

Implement comprehensive Prometheus monitoring infrastructure for collecting, storing, and querying time-series metrics from applications and infrastructure.
搭建全面的Prometheus监控基础设施,用于从应用和基础设施中收集、存储和查询时间序列指标。

When to Use

适用场景

  • Setting up metrics collection
  • Creating custom application metrics
  • Configuring scraping targets
  • Implementing service discovery
  • Building monitoring infrastructure
  • 设置指标收集
  • 创建自定义应用指标
  • 配置抓取目标
  • 实现服务发现
  • 搭建监控基础设施

Instructions

操作步骤

1. Prometheus Configuration

1. Prometheus 配置

yaml
undefined
yaml
undefined

/etc/prometheus/prometheus.yml

/etc/prometheus/prometheus.yml

global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: production
alerting: alertmanagers: - static_configs: - targets: ['localhost:9093']
rule_files:
  • '/etc/prometheus/alert_rules.yml'
scrape_configs:
  • job_name: 'prometheus' static_configs:
    • targets: ['localhost:9090']
  • job_name: 'node' static_configs:
    • targets: ['localhost:9100']
  • job_name: 'api-service' static_configs:
    • targets: ['localhost:8080/metrics'] scrape_interval: 10s
  • job_name: 'kubernetes-pods' kubernetes_sd_configs:
    • role: pod relabel_configs:
    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: 'true'
    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path
undefined
global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: production
alerting: alertmanagers: - static_configs: - targets: ['localhost:9093']
rule_files:
  • '/etc/prometheus/alert_rules.yml'
scrape_configs:
  • job_name: 'prometheus' static_configs:
    • targets: ['localhost:9090']
  • job_name: 'node' static_configs:
    • targets: ['localhost:9100']
  • job_name: 'api-service' static_configs:
    • targets: ['localhost:8080/metrics'] scrape_interval: 10s
  • job_name: 'kubernetes-pods' kubernetes_sd_configs:
    • role: pod relabel_configs:
    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: 'true'
    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path
undefined

2. Node.js Metrics Implementation

2. Node.js 指标实现

javascript
// metrics.js
const promClient = require('prom-client');
const register = new promClient.Registry();

promClient.collectDefaultMetrics({ register });

const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.1, 0.5, 1, 2, 5],
  registers: [register]
});

const requestsTotal = new promClient.Counter({
  name: 'requests_total',
  help: 'Total requests',
  labelNames: ['method', 'route', 'status_code'],
  registers: [register]
});

// Express middleware
const express = require('express');
const app = express();

app.get('/metrics', (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(register.metrics());
});

app.use((req, res, next) => {
  const start = Date.now();
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    httpRequestDuration
      .labels(req.method, req.path, res.statusCode)
      .observe(duration);
    requestsTotal
      .labels(req.method, req.path, res.statusCode)
      .inc();
  });
  next();
});

module.exports = { register, httpRequestDuration, requestsTotal };
javascript
// metrics.js
const promClient = require('prom-client');
const register = new promClient.Registry();

promClient.collectDefaultMetrics({ register });

const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.1, 0.5, 1, 2, 5],
  registers: [register]
});

const requestsTotal = new promClient.Counter({
  name: 'requests_total',
  help: 'Total requests',
  labelNames: ['method', 'route', 'status_code'],
  registers: [register]
});

// Express middleware
const express = require('express');
const app = express();

app.get('/metrics', (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(register.metrics());
});

app.use((req, res, next) => {
  const start = Date.now();
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    httpRequestDuration
      .labels(req.method, req.path, res.statusCode)
      .observe(duration);
    requestsTotal
      .labels(req.method, req.path, res.statusCode)
      .inc();
  });
  next();
});

module.exports = { register, httpRequestDuration, requestsTotal };

3. Python Prometheus Integration

3. Python Prometheus 集成

python
from prometheus_client import Counter, Histogram, start_http_server
from flask import Flask, request
import time

app = Flask(__name__)

request_count = Counter('requests_total', 'Total requests', ['method', 'endpoint'])
request_duration = Histogram('request_duration_seconds', 'Request duration', ['method', 'endpoint'])

@app.before_request
def before():
    request.start_time = time.time()

@app.after_request
def after(response):
    duration = time.time() - request.start_time
    request_count.labels(request.method, request.path).inc()
    request_duration.labels(request.method, request.path).observe(duration)
    return response

if __name__ == '__main__':
    start_http_server(8000)
    app.run(port=5000)
python
from prometheus_client import Counter, Histogram, start_http_server
from flask import Flask, request
import time

app = Flask(__name__)

request_count = Counter('requests_total', 'Total requests', ['method', 'endpoint'])
request_duration = Histogram('request_duration_seconds', 'Request duration', ['method', 'endpoint'])

@app.before_request
def before():
    request.start_time = time.time()

@app.after_request
def after(response):
    duration = time.time() - request.start_time
    request_count.labels(request.method, request.path).inc()
    request_duration.labels(request.method, request.path).observe(duration)
    return response

if __name__ == '__main__':
    start_http_server(8000)
    app.run(port=5000)

4. Alert Rules

4. 告警规则

yaml
undefined
yaml
undefined

/etc/prometheus/alert_rules.yml

/etc/prometheus/alert_rules.yml

groups:
  • name: application rules:
    • alert: HighErrorRate expr: rate(requests_total{status_code=~"5.."}[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate: {{ $value }}"
    • alert: HighLatency expr: histogram_quantile(0.95, request_duration_seconds) > 1 for: 10m labels: severity: warning annotations: summary: "p95 latency: {{ $value }}s"
    • alert: HighMemoryUsage expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1 for: 5m labels: severity: warning annotations: summary: "Low memory: {{ $value }}"
undefined
groups:
  • name: application rules:
    • alert: HighErrorRate expr: rate(requests_total{status_code=~"5.."}[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate: {{ $value }}"
    • alert: HighLatency expr: histogram_quantile(0.95, request_duration_seconds) > 1 for: 10m labels: severity: warning annotations: summary: "p95 latency: {{ $value }}s"
    • alert: HighMemoryUsage expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1 for: 5m labels: severity: warning annotations: summary: "Low memory: {{ $value }}"
undefined

5. Docker Compose Setup

5. Docker Compose 部署

yaml
version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alert_rules.yml:/etc/prometheus/alert_rules.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'

  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"

volumes:
  prometheus_data:
yaml
version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alert_rules.yml:/etc/prometheus/alert_rules.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'

  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"

volumes:
  prometheus_data:

Best Practices

最佳实践

✅ DO

✅ 建议

  • Use consistent metric naming conventions
  • Add comprehensive labels for filtering
  • Set appropriate scrape intervals (10-60s)
  • Implement retention policies
  • Monitor Prometheus itself
  • Test alert rules before deployment
  • Document metric meanings
  • 使用统一的指标命名规范
  • 添加全面的标签用于过滤
  • 设置合适的抓取间隔(10-60秒)
  • 实现数据保留策略
  • 监控Prometheus自身状态
  • 部署前测试告警规则
  • 记录指标含义

❌ DON'T

❌ 避免

  • Add unbounded cardinality labels
  • Scrape too frequently (< 10s)
  • Ignore metric naming conventions
  • Create alerts without runbooks
  • Store raw event data in Prometheus
  • Use counters for gauge-like values
  • 添加无界基数的标签
  • 过于频繁地抓取(间隔<10秒)
  • 忽略指标命名规范
  • 创建无运行手册的告警
  • 在Prometheus中存储原始事件数据
  • 用计数器类型存储仪表类指标

Key Prometheus Queries

关键Prometheus查询

promql
rate(requests_total[5m])  # Request rate
histogram_quantile(0.95, request_duration_seconds)  # p95 latency
rate(requests_total{status_code=~"5.."}[5m])  # Error rate
promql
rate(requests_total[5m])  # 请求速率
histogram_quantile(0.95, request_duration_seconds)  # p95延迟
rate(requests_total{status_code=~"5.."}[5m])  # 错误率