monitoring-setup

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Monitoring Setup

监控设置

When to Use

适用场景

Activate this skill when:
  • Setting up structured logging for a Python/FastAPI application
  • Configuring Prometheus metrics collection and custom counters/histograms
  • Implementing health check endpoints (liveness and readiness)
  • Designing alert rules and thresholds for production services
  • Creating Grafana dashboards for service monitoring
  • Integrating Sentry for error tracking and performance monitoring
  • Implementing distributed tracing with OpenTelemetry
  • Reviewing or improving existing observability coverage
Output: Write observability configuration summary to
monitoring-config.md
documenting what was set up (metrics, alerts, dashboards, health checks).
Do NOT use this skill for:
  • Responding to active production incidents (use
    incident-response
    )
  • Deploying monitoring infrastructure (use
    deployment-pipeline
    )
  • Writing application business logic (use
    python-backend-expert
    )
  • Docker container configuration (use
    docker-best-practices
    )
在以下场景激活本技能:
  • 为Python/FastAPI应用设置结构化日志
  • 配置Prometheus指标收集及自定义计数器/直方图
  • 实现健康检查端点(存活态和就绪态)
  • 为生产服务设计告警规则与阈值
  • 创建用于服务监控的Grafana仪表盘
  • 集成Sentry进行错误追踪与性能监控
  • 使用OpenTelemetry实现分布式追踪
  • 评审或优化现有可观测性覆盖范围
输出: 将可观测性配置摘要写入
monitoring-config.md
,记录已设置的内容(指标、告警、仪表盘、健康检查)。
请勿将本技能用于:
  • 响应活跃的生产事件(请使用
    incident-response
    技能)
  • 部署监控基础设施(请使用
    deployment-pipeline
    技能)
  • 编写应用业务逻辑(请使用
    python-backend-expert
    技能)
  • Docker容器配置(请使用
    docker-best-practices
    技能)

Instructions

操作指南

Four Pillars of Observability

可观测性的四大支柱

Every production service must implement all four pillars.
┌─────────────────────────────────────────────────────────────────┐
│                    OBSERVABILITY                                │
├────────────────┬───────────────┬──────────────┬────────────────┤
│    METRICS     │   LOGGING     │   TRACING    │   ALERTING     │
│                │               │              │                │
│  Prometheus    │  structlog    │ OpenTelemetry│  Alert rules   │
│  counters,     │  structured   │ distributed  │  thresholds,   │
│  histograms,   │  JSON logs,   │ trace spans, │  notification  │
│  gauges        │  context      │ correlation  │  channels      │
├────────────────┴───────────────┴──────────────┴────────────────┤
│                    DASHBOARDS (Grafana)                         │
│        Visualize metrics, logs, and traces in one place        │
└─────────────────────────────────────────────────────────────────┘
每个生产服务都必须实现这四大支柱。
┌─────────────────────────────────────────────────────────────────┐
│                    OBSERVABILITY                                │
├────────────────┬───────────────┬──────────────┬────────────────┤
│    METRICS     │   LOGGING     │   TRACING    │   ALERTING     │
│                │               │              │                │
│  Prometheus    │  structlog    │ OpenTelemetry│  Alert rules   │
│  counters,     │  structured   │ distributed  │  thresholds,   │
│  histograms,   │  JSON logs,   │ trace spans, │  notification  │
│  gauges        │  context      │ correlation  │  channels      │
├────────────────┴───────────────┴──────────────┴────────────────┤
│                    DASHBOARDS (Grafana)                         │
│        Visualize metrics, logs, and traces in one place        │
└─────────────────────────────────────────────────────────────────┘

Pillar 1: Metrics (Prometheus)

支柱1:指标(Prometheus)

Use the RED method for request-driven services and USE method for resources.
RED Method (for every API endpoint):
  • Rate -- Requests per second
  • Errors -- Failed requests per second
  • Duration -- Request latency distribution
USE Method (for infrastructure resources):
  • Utilization -- Percentage of resource used (CPU, memory, disk)
  • Saturation -- Work queued or waiting (connection pool, queue depth)
  • Errors -- Error events (OOM kills, connection failures)
Key metrics to instrument:
python
from prometheus_client import Counter, Histogram, Gauge, Info
对于请求驱动型服务使用RED方法,对于资源使用USE方法。
RED方法(适用于所有API端点):
  • Rate -- 每秒请求数
  • Errors -- 每秒失败请求数
  • Duration -- 请求延迟分布
USE方法(适用于基础设施资源):
  • Utilization -- 资源使用率(CPU、内存、磁盘)
  • Saturation -- 排队或等待的工作负载(连接池、队列深度)
  • Errors -- 错误事件(OOM终止、连接失败)
需要埋点的关键指标:
python
from prometheus_client import Counter, Histogram, Gauge, Info

RED metrics

RED metrics

REQUEST_COUNT = Counter( "http_requests_total", "Total HTTP requests", labelnames=["method", "endpoint", "status_code"], )
REQUEST_DURATION = Histogram( "http_request_duration_seconds", "HTTP request duration in seconds", labelnames=["method", "endpoint"], buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0], )
REQUEST_COUNT = Counter( "http_requests_total", "Total HTTP requests", labelnames=["method", "endpoint", "status_code"], )
REQUEST_DURATION = Histogram( "http_request_duration_seconds", "HTTP request duration in seconds", labelnames=["method", "endpoint"], buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0], )

USE metrics

USE metrics

DB_POOL_USAGE = Gauge( "db_connection_pool_usage", "Database connection pool utilization", labelnames=["pool_name"], )
DB_POOL_SIZE = Gauge( "db_connection_pool_size", "Database connection pool max size", labelnames=["pool_name"], )
REDIS_CONNECTIONS = Gauge( "redis_active_connections", "Active Redis connections", )
DB_POOL_USAGE = Gauge( "db_connection_pool_usage", "Database connection pool utilization", labelnames=["pool_name"], )
DB_POOL_SIZE = Gauge( "db_connection_pool_size", "Database connection pool max size", labelnames=["pool_name"], )
REDIS_CONNECTIONS = Gauge( "redis_active_connections", "Active Redis connections", )

Business metrics

Business metrics

ACTIVE_USERS = Gauge( "active_users_total", "Currently active users", )
APP_INFO = Info( "app", "Application metadata", )

**FastAPI middleware for automatic metrics:**

```python
import time
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request

class PrometheusMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        method = request.method
        endpoint = request.url.path
        start_time = time.perf_counter()

        response = await call_next(request)

        duration = time.perf_counter() - start_time
        status_code = str(response.status_code)

        REQUEST_COUNT.labels(
            method=method, endpoint=endpoint, status_code=status_code
        ).inc()

        REQUEST_DURATION.labels(
            method=method, endpoint=endpoint
        ).observe(duration)

        return response
See
references/metrics-config-template.py
for the complete setup.
ACTIVE_USERS = Gauge( "active_users_total", "Currently active users", )
APP_INFO = Info( "app", "Application metadata", )

**用于自动收集指标的FastAPI中间件:**

```python
import time
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request

class PrometheusMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        method = request.method
        endpoint = request.url.path
        start_time = time.perf_counter()

        response = await call_next(request)

        duration = time.perf_counter() - start_time
        status_code = str(response.status_code)

        REQUEST_COUNT.labels(
            method=method, endpoint=endpoint, status_code=status_code
        ).inc()

        REQUEST_DURATION.labels(
            method=method, endpoint=endpoint
        ).observe(duration)

        return response
完整设置请参考
references/metrics-config-template.py

Pillar 2: Logging (structlog)

支柱2:日志(structlog)

Use structured JSON logging with contextual information. Never use
print()
or unstructured logging in production.
Logging principles:
  1. Structured -- JSON format, machine-parseable
  2. Contextual -- Include request ID, user ID, trace ID in every log
  3. Leveled -- Use appropriate log levels (DEBUG, INFO, WARNING, ERROR, CRITICAL)
  4. Actionable -- Every WARNING/ERROR log should indicate what to investigate
Log levels and when to use them:
LevelWhen to UseExample
DEBUGDetailed diagnostic info, disabled in production
Processing item 42 of 100
INFONormal operations, significant events
User created
,
Payment processed
WARNINGUnexpected but handled situation
Retry attempt 2 of 3
,
Cache miss
ERROROperation failed, needs attention
Database query failed
,
External API timeout
CRITICALSystem-level failure, immediate action
Cannot connect to database
,
Out of memory
structlog setup:
python
import structlog

structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.stdlib.PositionalArgumentsFormatter(),
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.UnicodeDecoder(),
        structlog.processors.JSONRenderer(),
    ],
    wrapper_class=structlog.stdlib.BoundLogger,
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
    cache_logger_on_first_use=True,
)
Adding request context:
python
from starlette.middleware.base import BaseHTTPMiddleware
import structlog
import uuid

class LoggingContextMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request, call_next):
        request_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
        structlog.contextvars.clear_contextvars()
        structlog.contextvars.bind_contextvars(
            request_id=request_id,
            method=request.method,
            path=request.url.path,
        )
        response = await call_next(request)
        response.headers["X-Request-ID"] = request_id
        return response
See
references/logging-config-template.py
for the complete setup.
使用包含上下文信息的结构化JSON日志。生产环境中绝对不要使用
print()
或非结构化日志。
日志原则:
  1. 结构化 -- JSON格式,可被机器解析
  2. 上下文关联 -- 每条日志都包含请求ID、用户ID、追踪ID
  3. 分级 -- 使用合适的日志级别(DEBUG、INFO、WARNING、ERROR、CRITICAL)
  4. 可执行 -- 每条WARNING/ERROR日志都应指出需要排查的方向
日志级别及适用场景:
级别使用场景示例
DEBUG详细诊断信息,生产环境禁用
正在处理第42项,共100项
INFO正常操作、重要事件
用户已创建
支付已处理
WARNING意外但已处理的情况
第2次重试,共3次
缓存未命中
ERROR操作失败,需要关注
数据库查询失败
外部API超时
CRITICAL系统级故障,需立即处理
无法连接数据库
内存不足
structlog设置:
python
import structlog

structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.stdlib.PositionalArgumentsFormatter(),
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.UnicodeDecoder(),
        structlog.processors.JSONRenderer(),
    ],
    wrapper_class=structlog.stdlib.BoundLogger,
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
    cache_logger_on_first_use=True,
)
添加请求上下文:
python
from starlette.middleware.base import BaseHTTPMiddleware
import structlog
import uuid

class LoggingContextMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request, call_next):
        request_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
        structlog.contextvars.clear_contextvars()
        structlog.contextvars.bind_contextvars(
            request_id=request_id,
            method=request.method,
            path=request.url.path,
        )
        response = await call_next(request)
        response.headers["X-Request-ID"] = request_id
        return response
完整设置请参考
references/logging-config-template.py

Pillar 3: Tracing (OpenTelemetry)

支柱3:追踪(OpenTelemetry)

Distributed tracing connects logs and metrics across service boundaries.
Trace setup for FastAPI:
python
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

def setup_tracing(app, service_name: str = "backend"):
    resource = Resource.create({"service.name": service_name})
    provider = TracerProvider(resource=resource)

    exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
    provider.add_span_processor(BatchSpanProcessor(exporter))

    trace.set_tracer_provider(provider)

    # Auto-instrument FastAPI, SQLAlchemy, Redis
    FastAPIInstrumentor.instrument_app(app)
    SQLAlchemyInstrumentor().instrument()
    RedisInstrumentor().instrument()
Custom spans for business logic:
python
tracer = trace.get_tracer(__name__)

async def process_order(order_id: str):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)

        with tracer.start_as_current_span("validate_order"):
            await validate_order(order_id)

        with tracer.start_as_current_span("charge_payment"):
            result = await charge_payment(order_id)
            span.set_attribute("payment.status", result.status)

        with tracer.start_as_current_span("send_confirmation"):
            await send_confirmation(order_id)
分布式追踪可跨服务边界关联日志与指标。
FastAPI的追踪设置:
python
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

def setup_tracing(app, service_name: str = "backend"):
    resource = Resource.create({"service.name": service_name})
    provider = TracerProvider(resource=resource)

    exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
    provider.add_span_processor(BatchSpanProcessor(exporter))

    trace.set_tracer_provider(provider)

    # Auto-instrument FastAPI, SQLAlchemy, Redis
    FastAPIInstrumentor.instrument_app(app)
    SQLAlchemyInstrumentor().instrument()
    RedisInstrumentor().instrument()
为业务逻辑添加自定义追踪 span:
python
tracer = trace.get_tracer(__name__)

async def process_order(order_id: str):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)

        with tracer.start_as_current_span("validate_order"):
            await validate_order(order_id)

        with tracer.start_as_current_span("charge_payment"):
            result = await charge_payment(order_id)
            span.set_attribute("payment.status", result.status)

        with tracer.start_as_current_span("send_confirmation"):
            await send_confirmation(order_id)

Pillar 4: Alerting

支柱4:告警

Alerts must be actionable. Every alert should indicate what is broken and what to do.
Alert design principles:
  1. Page only for user-impacting issues -- Do not page for non-urgent warnings
  2. Set thresholds based on SLOs -- Not arbitrary numbers
  3. Avoid alert fatigue -- If an alert fires often without action, fix or remove it
  4. Include runbook links -- Every alert should link to a remediation guide
  5. Use multi-window burn rates -- Detect issues faster without false positives
Alert thresholds for a typical FastAPI application:
AlertConditionSeverityAction
High error rate
http_requests_total{status=~"5.."}
> 5% of total for 5 min
SEV2Check logs, consider rollback
High latency
http_request_duration_seconds
p99 > 2s for 5 min
SEV3Check DB queries, dependencies
Service downHealth check fails for 2 minSEV1Restart, check logs, escalate
DB connections highPool usage > 80% for 5 minSEV3Check for connection leaks
DB connections criticalPool usage > 95% for 2 minSEV2Restart app, investigate
Memory highContainer memory > 85% for 10 minSEV3Check for memory leaks
Disk space lowDisk usage > 85%SEV3Clean logs, expand volume
Certificate expirySSL cert expires in < 14 daysSEV4Renew certificate
See
references/alert-rules-template.yml
for Prometheus alerting rules.
告警必须具备可执行性。每条告警都应指出故障点及处理方法。
告警设计原则:
  1. 仅对影响用户的问题发送告警 -- 非紧急警告不要触发告警通知
  2. 基于SLO设置阈值 -- 不要使用任意数值
  3. 避免告警疲劳 -- 如果某条告警频繁触发但无需处理,修复或移除它
  4. 包含运行手册链接 -- 每条告警都应链接到故障排查指南
  5. 使用多窗口 burn rate -- 更快检测问题,减少误报
典型FastAPI应用的告警阈值:
告警名称触发条件严重级别操作
高错误率
http_requests_total{status=~"5.."}
占总请求数比例 >5% 持续5分钟
SEV2检查日志,考虑回滚
高延迟
http_request_duration_seconds
p99值 >2秒 持续5分钟
SEV3检查数据库查询、依赖服务
服务宕机健康检查失败持续2分钟SEV1重启服务,检查日志,升级处理
数据库连接数过高连接池使用率 >80% 持续5分钟SEV3检查是否存在连接泄漏
数据库连接数临界连接池使用率 >95% 持续2分钟SEV2重启应用,调查原因
内存占用过高容器内存使用率 >85% 持续10分钟SEV3检查是否存在内存泄漏
磁盘空间不足磁盘使用率 >85%SEV3清理日志,扩容存储
证书即将过期SSL证书剩余有效期 <14天SEV4续签证书
Prometheus告警规则模板请参考
references/alert-rules-template.yml

Health Check Endpoints

健康检查端点

Every service must expose two health endpoints.
Liveness (
/health
):
Is the process running? Returns 200 if the application is alive.
Readiness (
/health/ready
):
Can the service handle requests? Checks all dependencies.
python
from fastapi import APIRouter, Depends
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import text
from datetime import datetime, timezone

router = APIRouter(tags=["health"])

@router.get("/health")
async def liveness():
    """Liveness probe -- is the process running?"""
    return {
        "status": "healthy",
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "version": settings.APP_VERSION,
    }

@router.get("/health/ready")
async def readiness(db: AsyncSession = Depends(get_db)):
    """Readiness probe -- can we handle traffic?"""
    checks = {}

    # Check database
    try:
        await db.execute(text("SELECT 1"))
        checks["database"] = {"status": "ok", "latency_ms": 0}
    except Exception as e:
        checks["database"] = {"status": "error", "error": str(e)}

    # Check Redis
    try:
        start = time.perf_counter()
        await redis.ping()
        latency = (time.perf_counter() - start) * 1000
        checks["redis"] = {"status": "ok", "latency_ms": round(latency, 2)}
    except Exception as e:
        checks["redis"] = {"status": "error", "error": str(e)}

    all_ok = all(c["status"] == "ok" for c in checks.values())
    return JSONResponse(
        status_code=200 if all_ok else 503,
        content={
            "status": "ready" if all_ok else "not_ready",
            "checks": checks,
            "timestamp": datetime.now(timezone.utc).isoformat(),
        },
    )
每个服务都必须暴露两个健康检查端点。
存活态检查(
/health
):
进程是否在运行?如果应用存活则返回200。
就绪态检查(
/health/ready
):
服务是否能处理请求?会检查所有依赖项。
python
from fastapi import APIRouter, Depends
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import text
from datetime import datetime, timezone

router = APIRouter(tags=["health"])

@router.get("/health")
async def liveness():
    """存活态探针 -- 进程是否在运行?"""
    return {
        "status": "healthy",
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "version": settings.APP_VERSION,
    }

@router.get("/health/ready")
async def readiness(db: AsyncSession = Depends(get_db)):
    """就绪态探针 -- 能否处理流量?"""
    checks = {}

    # 检查数据库
    try:
        await db.execute(text("SELECT 1"))
        checks["database"] = {"status": "ok", "latency_ms": 0}
    except Exception as e:
        checks["database"] = {"status": "error", "error": str(e)}

    # 检查Redis
    try:
        start = time.perf_counter()
        await redis.ping()
        latency = (time.perf_counter() - start) * 1000
        checks["redis"] = {"status": "ok", "latency_ms": round(latency, 2)}
    except Exception as e:
        checks["redis"] = {"status": "error", "error": str(e)}

    all_ok = all(c["status"] == "ok" for c in checks.values())
    return JSONResponse(
        status_code=200 if all_ok else 503,
        content={
            "status": "ready" if all_ok else "not_ready",
            "checks": checks,
            "timestamp": datetime.now(timezone.utc).isoformat(),
        },
    )

Error Tracking with Sentry

Sentry错误追踪

Sentry captures unhandled exceptions and performance data.
Setup:
python
import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration
from sentry_sdk.integrations.sqlalchemy import SqlalchemyIntegration

sentry_sdk.init(
    dsn=settings.SENTRY_DSN,
    environment=settings.APP_ENV,
    release=settings.APP_VERSION,
    traces_sample_rate=0.1,  # 10% of requests for performance monitoring
    profiles_sample_rate=0.1,
    integrations=[
        FastApiIntegration(),
        SqlalchemyIntegration(),
    ],
    # Do not send PII
    send_default_pii=False,
    # Filter out health check noise
    before_send=filter_health_checks,
)

def filter_health_checks(event, hint):
    """Do not send health check errors to Sentry."""
    if "request" in event and event["request"].get("url", "").endswith("/health"):
        return None
    return event
Sentry可捕获未处理异常及性能数据。
设置:
python
import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration
from sentry_sdk.integrations.sqlalchemy import SqlalchemyIntegration

sentry_sdk.init(
    dsn=settings.SENTRY_DSN,
    environment=settings.APP_ENV,
    release=settings.APP_VERSION,
    traces_sample_rate=0.1,  # 10%的请求用于性能监控
    profiles_sample_rate=0.1,
    integrations=[
        FastApiIntegration(),
        SqlalchemyIntegration(),
    ],
    # 不发送个人身份信息
    send_default_pii=False,
    # 过滤健康检查的噪音
    before_send=filter_health_checks,
)

def filter_health_checks(event, hint):
    """不将健康检查错误发送到Sentry。"""
    if "request" in event and event["request"].get("url", "").endswith("/health"):
        return None
    return event

Dashboard Design

仪表盘设计

Grafana dashboards should follow a consistent layout pattern.
Standard dashboard sections:
  1. Overview row -- Key SLIs at a glance (error rate, latency, throughput)
  2. RED metrics row -- Rate, Errors, Duration for each endpoint
  3. Infrastructure row -- CPU, memory, disk, network
  4. Dependencies row -- Database, Redis, external API health
  5. Business metrics row -- Application-specific KPIs
Dashboard layout:
┌─────────────────────────────────────────────────────────┐
│  Service Overview                                       │
│  [Error Rate %] [p99 Latency] [Requests/s] [Uptime]    │
├────────────────────────┬────────────────────────────────┤
│  Request Rate          │  Error Rate                    │
│  (by endpoint)         │  (by endpoint, status code)    │
├────────────────────────┼────────────────────────────────┤
│  Latency (p50/p95/p99) │  Active Connections            │
│  (by endpoint)         │  (DB pool, Redis)              │
├────────────────────────┴────────────────────────────────┤
│  Infrastructure                                         │
│  [CPU %] [Memory %] [Disk %] [Network IO]              │
├─────────────────────────────────────────────────────────┤
│  Dependencies                                           │
│  [DB Latency] [Redis Latency] [External API Status]    │
└─────────────────────────────────────────────────────────┘
See
references/dashboard-template.json
for a complete Grafana dashboard template.
Grafana仪表盘应遵循一致的布局范式。
标准仪表盘板块:
  1. 概览行 -- 关键SLI概览(错误率、延迟、吞吐量)
  2. RED指标行 -- 每个端点的Rate、Errors、Duration
  3. 基础设施行 -- CPU、内存、磁盘、网络
  4. 依赖服务行 -- 数据库、Redis、外部API健康状态
  5. 业务指标行 -- 应用特定KPI
仪表盘布局:
┌─────────────────────────────────────────────────────────┐
│  Service Overview                                       │
│  [Error Rate %] [p99 Latency] [Requests/s] [Uptime]    │
├────────────────────────┬────────────────────────────────┤
│  Request Rate          │  Error Rate                    │
│  (by endpoint)         │  (by endpoint, status code)    │
├────────────────────────┼────────────────────────────────┤
│  Latency (p50/p95/p99) │  Active Connections            │
│  (by endpoint)         │  (DB pool, Redis)              │
├────────────────────────┴────────────────────────────────┤
│  Infrastructure                                         │
│  [CPU %] [Memory %] [Disk %] [Network IO]              │
├─────────────────────────────────────────────────────────┤
│  Dependencies                                           │
│  [DB Latency] [Redis Latency] [External API Status]    │
└─────────────────────────────────────────────────────────┘
完整Grafana仪表盘模板请参考
references/dashboard-template.json

Uptime Monitoring

可用性监控

External uptime monitoring validates the service from a user's perspective.
What to monitor externally:
  • /health
    endpoint from multiple geographic regions
  • Key user-facing pages (login, dashboard, API docs)
  • SSL certificate validity and expiration
  • DNS resolution time
Recommended check intervals:
CheckIntervalTimeoutRegions
Health endpoint30 seconds10 seconds3+ regions
Key pages1 minute15 seconds2+ regions
SSL certificate6 hours30 seconds1 region
DNS resolution5 minutes5 seconds3+ regions
外部可用性监控从用户视角验证服务状态。
需要外部监控的内容:
  • 从多个地理区域监控
    /health
    端点
  • 关键用户页面(登录、仪表盘、API文档)
  • SSL证书有效性及过期时间
  • DNS解析时间
推荐检查间隔:
检查项间隔超时时间区域数量
健康端点30秒10秒3个以上
关键页面1分钟15秒2个以上
SSL证书6小时30秒1个
DNS解析5分钟5秒3个以上

Quick Reference

快速参考

See
references/
for complete templates:
logging-config-template.py
,
metrics-config-template.py
,
alert-rules-template.yml
,
dashboard-template.json
.
完整模板请查看
references/
目录:
logging-config-template.py
metrics-config-template.py
alert-rules-template.yml
dashboard-template.json

Monitoring Checklist for New Services

新服务监控检查清单

  • structlog configured with JSON output
  • Request logging middleware with request ID correlation
  • Prometheus metrics endpoint exposed at
    /metrics
  • RED metrics instrumented (request count, errors, duration)
  • Health check endpoints implemented (
    /health
    ,
    /health/ready
    )
  • Sentry SDK initialized with environment and release tags
  • Alert rules defined for error rate, latency, and availability
  • Grafana dashboard created with standard sections
  • External uptime monitoring configured
  • Log retention policy defined (default: 30 days)
  • 已配置structlog并输出JSON格式日志
  • 已添加包含请求ID关联的请求日志中间件
  • 已暴露Prometheus指标端点
    /metrics
  • 已埋点RED指标(请求数、错误数、延迟)
  • 已实现健康检查端点(
    /health
    /health/ready
  • 已初始化Sentry SDK并配置环境与版本标签
  • 已为错误率、延迟及可用性定义告警规则
  • 已创建包含标准板块的Grafana仪表盘
  • 已配置外部可用性监控
  • 已定义日志保留策略(默认:30天)

Output File

输出文件

Write monitoring configuration summary to
monitoring-config.md
:
markdown
undefined
将监控配置摘要写入
monitoring-config.md
markdown
undefined

Monitoring Configuration: [Service Name]

监控配置:[服务名称]

Metrics

指标

MetricTypeLabelsPurpose
http_requests_totalCountermethod, endpoint, statusRED: Request rate
http_request_duration_secondsHistogrammethod, endpointRED: Latency
指标类型标签用途
http_requests_totalCountermethod, endpoint, statusRED:请求速率
http_request_duration_secondsHistogrammethod, endpointRED:延迟

Alerts

告警

AlertConditionSeverityRunbook
HighErrorRateerror_rate > 5% for 5mSEV2docs/runbooks/high-error-rate.md
告警条件严重级别运行手册
HighErrorRateerror_rate > 5% 持续5分钟SEV2docs/runbooks/high-error-rate.md

Health Checks

健康检查

  • /health
    — Liveness probe
  • /health/ready
    — Readiness probe (checks DB, Redis)
  • /health
    — 存活态探针
  • /health/ready
    — 就绪态探针(检查数据库、Redis)

Dashboards

仪表盘

  • Grafana: Service Overview (imported from references/dashboard-template.json)
  • Grafana:服务概览(从references/dashboard-template.json导入)

Next Steps

下一步

  • Run
    /deployment-pipeline
    to deploy with monitoring enabled
  • Run
    /incident-response
    if alerts fire
undefined
  • 运行
    /deployment-pipeline
    以启用监控部署
  • 若告警触发,运行
    /incident-response
undefined