prometheus-monitoring

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Prometheus Monitoring

Prometheus 监控

Overview

概述

Implement comprehensive Prometheus monitoring infrastructure for collecting, storing, and querying time-series metrics from applications and infrastructure.

搭建全面的Prometheus监控基础设施，用于从应用和基础设施中收集、存储和查询时间序列指标。

When to Use

适用场景

Setting up metrics collection
Creating custom application metrics
Configuring scraping targets
Implementing service discovery
Building monitoring infrastructure

设置指标收集
创建自定义应用指标
配置抓取目标
实现服务发现
搭建监控基础设施

Instructions

操作步骤

1. Prometheus Configuration

1. Prometheus 配置

yaml

undefined

yaml

undefined

/etc/prometheus/prometheus.yml

global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: production

alerting: alertmanagers: - static_configs: - targets: ['localhost:9093']

rule_files:

'/etc/prometheus/alert_rules.yml'

scrape_configs:

job_name: 'prometheus' static_configs:
- targets: ['localhost:9090']
job_name: 'node' static_configs:
- targets: ['localhost:9100']
job_name: 'api-service' static_configs:
- targets: ['localhost:8080/metrics'] scrape_interval: 10s
job_name: 'kubernetes-pods' kubernetes_sd_configs:
- role: pod relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: 'true'
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path

undefined

global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: production

alerting: alertmanagers: - static_configs: - targets: ['localhost:9093']

rule_files:

'/etc/prometheus/alert_rules.yml'

scrape_configs:

job_name: 'prometheus' static_configs:
- targets: ['localhost:9090']
job_name: 'node' static_configs:
- targets: ['localhost:9100']
job_name: 'api-service' static_configs:
- targets: ['localhost:8080/metrics'] scrape_interval: 10s
job_name: 'kubernetes-pods' kubernetes_sd_configs:
- role: pod relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: 'true'
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path

undefined

2. Node.js Metrics Implementation

2. Node.js 指标实现

javascript

// metrics.js
const promClient = require('prom-client');
const register = new promClient.Registry();

promClient.collectDefaultMetrics({ register });

const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.1, 0.5, 1, 2, 5],
  registers: [register]
});

const requestsTotal = new promClient.Counter({
  name: 'requests_total',
  help: 'Total requests',
  labelNames: ['method', 'route', 'status_code'],
  registers: [register]
});

// Express middleware
const express = require('express');
const app = express();

app.get('/metrics', (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(register.metrics());
});

app.use((req, res, next) => {
  const start = Date.now();
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    httpRequestDuration
      .labels(req.method, req.path, res.statusCode)
      .observe(duration);
    requestsTotal
      .labels(req.method, req.path, res.statusCode)
      .inc();
  });
  next();
});

module.exports = { register, httpRequestDuration, requestsTotal };

javascript

// metrics.js
const promClient = require('prom-client');
const register = new promClient.Registry();

promClient.collectDefaultMetrics({ register });

const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.1, 0.5, 1, 2, 5],
  registers: [register]
});

const requestsTotal = new promClient.Counter({
  name: 'requests_total',
  help: 'Total requests',
  labelNames: ['method', 'route', 'status_code'],
  registers: [register]
});

// Express middleware
const express = require('express');
const app = express();

app.get('/metrics', (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(register.metrics());
});

app.use((req, res, next) => {
  const start = Date.now();
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    httpRequestDuration
      .labels(req.method, req.path, res.statusCode)
      .observe(duration);
    requestsTotal
      .labels(req.method, req.path, res.statusCode)
      .inc();
  });
  next();
});

module.exports = { register, httpRequestDuration, requestsTotal };

3. Python Prometheus Integration

3. Python Prometheus 集成

python

from prometheus_client import Counter, Histogram, start_http_server
from flask import Flask, request
import time

app = Flask(__name__)

request_count = Counter('requests_total', 'Total requests', ['method', 'endpoint'])
request_duration = Histogram('request_duration_seconds', 'Request duration', ['method', 'endpoint'])

@app.before_request
def before():
    request.start_time = time.time()

@app.after_request
def after(response):
    duration = time.time() - request.start_time
    request_count.labels(request.method, request.path).inc()
    request_duration.labels(request.method, request.path).observe(duration)
    return response

if __name__ == '__main__':
    start_http_server(8000)
    app.run(port=5000)

python

from prometheus_client import Counter, Histogram, start_http_server
from flask import Flask, request
import time

app = Flask(__name__)

request_count = Counter('requests_total', 'Total requests', ['method', 'endpoint'])
request_duration = Histogram('request_duration_seconds', 'Request duration', ['method', 'endpoint'])

@app.before_request
def before():
    request.start_time = time.time()

@app.after_request
def after(response):
    duration = time.time() - request.start_time
    request_count.labels(request.method, request.path).inc()
    request_duration.labels(request.method, request.path).observe(duration)
    return response

if __name__ == '__main__':
    start_http_server(8000)
    app.run(port=5000)

4. Alert Rules

4. 告警规则

yaml

undefined

yaml

undefined

/etc/prometheus/alert_rules.yml

groups:

name: application rules:
- alert: HighErrorRate expr: rate(requests_total{status_code=~"5.."}[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate: {{ $value }}"
- alert: HighLatency expr: histogram_quantile(0.95, request_duration_seconds) > 1 for: 10m labels: severity: warning annotations: summary: "p95 latency: {{ $value }}s"
- alert: HighMemoryUsage expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1 for: 5m labels: severity: warning annotations: summary: "Low memory: {{ $value }}"

undefined

groups:

name: application rules:
- alert: HighErrorRate expr: rate(requests_total{status_code=~"5.."}[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate: {{ $value }}"
- alert: HighLatency expr: histogram_quantile(0.95, request_duration_seconds) > 1 for: 10m labels: severity: warning annotations: summary: "p95 latency: {{ $value }}s"
- alert: HighMemoryUsage expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1 for: 5m labels: severity: warning annotations: summary: "Low memory: {{ $value }}"

undefined

5. Docker Compose Setup

5. Docker Compose 部署

yaml

version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alert_rules.yml:/etc/prometheus/alert_rules.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'

  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"

volumes:
  prometheus_data:

yaml

version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alert_rules.yml:/etc/prometheus/alert_rules.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'

  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"

volumes:
  prometheus_data:

Best Practices

最佳实践

✅ DO

✅ 建议

Use consistent metric naming conventions
Add comprehensive labels for filtering
Set appropriate scrape intervals (10-60s)
Implement retention policies
Monitor Prometheus itself
Test alert rules before deployment
Document metric meanings

使用统一的指标命名规范
添加全面的标签用于过滤
设置合适的抓取间隔（10-60秒）
实现数据保留策略
监控Prometheus自身状态
部署前测试告警规则
记录指标含义

❌ DON'T

❌ 避免

Add unbounded cardinality labels
Scrape too frequently (< 10s)
Ignore metric naming conventions
Create alerts without runbooks
Store raw event data in Prometheus
Use counters for gauge-like values

添加无界基数的标签
过于频繁地抓取（间隔<10秒）
忽略指标命名规范
创建无运行手册的告警
在Prometheus中存储原始事件数据
用计数器类型存储仪表类指标

Key Prometheus Queries

关键Prometheus查询

promql

rate(requests_total[5m])  # Request rate
histogram_quantile(0.95, request_duration_seconds)  # p95 latency
rate(requests_total{status_code=~"5.."}[5m])  # Error rate

promql

rate(requests_total[5m])  # 请求速率
histogram_quantile(0.95, request_duration_seconds)  # p95延迟
rate(requests_total{status_code=~"5.."}[5m])  # 错误率