monitoring-setup

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Monitoring Setup

监控设置

When to Use

适用场景

Activate this skill when:

Setting up structured logging for a Python/FastAPI application
Configuring Prometheus metrics collection and custom counters/histograms
Implementing health check endpoints (liveness and readiness)
Designing alert rules and thresholds for production services
Creating Grafana dashboards for service monitoring
Integrating Sentry for error tracking and performance monitoring
Implementing distributed tracing with OpenTelemetry
Reviewing or improving existing observability coverage

Output: Write observability configuration summary to

monitoring-config.md

documenting what was set up (metrics, alerts, dashboards, health checks).

Do NOT use this skill for:

Responding to active production incidents (use
```
incident-response
```
)
Deploying monitoring infrastructure (use
```
deployment-pipeline
```
)
Writing application business logic (use
```
python-backend-expert
```
)
Docker container configuration (use
```
docker-best-practices
```
)

在以下场景激活本技能：

为Python/FastAPI应用设置结构化日志
配置Prometheus指标收集及自定义计数器/直方图
实现健康检查端点（存活态和就绪态）
为生产服务设计告警规则与阈值
创建用于服务监控的Grafana仪表盘
集成Sentry进行错误追踪与性能监控
使用OpenTelemetry实现分布式追踪
评审或优化现有可观测性覆盖范围

输出： 将可观测性配置摘要写入

monitoring-config.md

，记录已设置的内容（指标、告警、仪表盘、健康检查）。

请勿将本技能用于：

响应活跃的生产事件（请使用
```
incident-response
```
技能）
部署监控基础设施（请使用
```
deployment-pipeline
```
技能）
编写应用业务逻辑（请使用
```
python-backend-expert
```
技能）
Docker容器配置（请使用
```
docker-best-practices
```
技能）

Instructions

操作指南

Four Pillars of Observability

可观测性的四大支柱

Every production service must implement all four pillars.

┌─────────────────────────────────────────────────────────────────┐
│                    OBSERVABILITY                                │
├────────────────┬───────────────┬──────────────┬────────────────┤
│    METRICS     │   LOGGING     │   TRACING    │   ALERTING     │
│                │               │              │                │
│  Prometheus    │  structlog    │ OpenTelemetry│  Alert rules   │
│  counters,     │  structured   │ distributed  │  thresholds,   │
│  histograms,   │  JSON logs,   │ trace spans, │  notification  │
│  gauges        │  context      │ correlation  │  channels      │
├────────────────┴───────────────┴──────────────┴────────────────┤
│                    DASHBOARDS (Grafana)                         │
│        Visualize metrics, logs, and traces in one place        │
└─────────────────────────────────────────────────────────────────┘

每个生产服务都必须实现这四大支柱。

┌─────────────────────────────────────────────────────────────────┐
│                    OBSERVABILITY                                │
├────────────────┬───────────────┬──────────────┬────────────────┤
│    METRICS     │   LOGGING     │   TRACING    │   ALERTING     │
│                │               │              │                │
│  Prometheus    │  structlog    │ OpenTelemetry│  Alert rules   │
│  counters,     │  structured   │ distributed  │  thresholds,   │
│  histograms,   │  JSON logs,   │ trace spans, │  notification  │
│  gauges        │  context      │ correlation  │  channels      │
├────────────────┴───────────────┴──────────────┴────────────────┤
│                    DASHBOARDS (Grafana)                         │
│        Visualize metrics, logs, and traces in one place        │
└─────────────────────────────────────────────────────────────────┘

Pillar 1: Metrics (Prometheus)

支柱1：指标（Prometheus）

Use the RED method for request-driven services and USE method for resources.

RED Method (for every API endpoint):

Rate -- Requests per second
Errors -- Failed requests per second
Duration -- Request latency distribution

USE Method (for infrastructure resources):

Utilization -- Percentage of resource used (CPU, memory, disk)
Saturation -- Work queued or waiting (connection pool, queue depth)
Errors -- Error events (OOM kills, connection failures)

Key metrics to instrument:

python

from prometheus_client import Counter, Histogram, Gauge, Info

对于请求驱动型服务使用RED方法，对于资源使用USE方法。

RED方法（适用于所有API端点）：

Rate -- 每秒请求数
Errors -- 每秒失败请求数
Duration -- 请求延迟分布

USE方法（适用于基础设施资源）：

Utilization -- 资源使用率（CPU、内存、磁盘）
Saturation -- 排队或等待的工作负载（连接池、队列深度）
Errors -- 错误事件（OOM终止、连接失败）

需要埋点的关键指标：

python

from prometheus_client import Counter, Histogram, Gauge, Info

RED metrics

REQUEST_COUNT = Counter( "http_requests_total", "Total HTTP requests", labelnames=["method", "endpoint", "status_code"], )

REQUEST_DURATION = Histogram( "http_request_duration_seconds", "HTTP request duration in seconds", labelnames=["method", "endpoint"], buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0], )

REQUEST_COUNT = Counter( "http_requests_total", "Total HTTP requests", labelnames=["method", "endpoint", "status_code"], )

USE metrics

DB_POOL_USAGE = Gauge( "db_connection_pool_usage", "Database connection pool utilization", labelnames=["pool_name"], )

DB_POOL_SIZE = Gauge( "db_connection_pool_size", "Database connection pool max size", labelnames=["pool_name"], )

REDIS_CONNECTIONS = Gauge( "redis_active_connections", "Active Redis connections", )

DB_POOL_USAGE = Gauge( "db_connection_pool_usage", "Database connection pool utilization", labelnames=["pool_name"], )

DB_POOL_SIZE = Gauge( "db_connection_pool_size", "Database connection pool max size", labelnames=["pool_name"], )

REDIS_CONNECTIONS = Gauge( "redis_active_connections", "Active Redis connections", )

Business metrics

ACTIVE_USERS = Gauge( "active_users_total", "Currently active users", )

APP_INFO = Info( "app", "Application metadata", )


**FastAPI middleware for automatic metrics:**

```python
import time
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request

class PrometheusMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        method = request.method
        endpoint = request.url.path
        start_time = time.perf_counter()

        response = await call_next(request)

        duration = time.perf_counter() - start_time
        status_code = str(response.status_code)

        REQUEST_COUNT.labels(
            method=method, endpoint=endpoint, status_code=status_code
        ).inc()

        REQUEST_DURATION.labels(
            method=method, endpoint=endpoint
        ).observe(duration)

        return response

See

references/metrics-config-template.py

for the complete setup.

ACTIVE_USERS = Gauge( "active_users_total", "Currently active users", )

APP_INFO = Info( "app", "Application metadata", )


**用于自动收集指标的FastAPI中间件：**

```python
import time
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request

class PrometheusMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        method = request.method
        endpoint = request.url.path
        start_time = time.perf_counter()

        response = await call_next(request)

        duration = time.perf_counter() - start_time
        status_code = str(response.status_code)

        REQUEST_COUNT.labels(
            method=method, endpoint=endpoint, status_code=status_code
        ).inc()

        REQUEST_DURATION.labels(
            method=method, endpoint=endpoint
        ).observe(duration)

        return response

完整设置请参考

references/metrics-config-template.py

。

Pillar 2: Logging (structlog)

支柱2：日志（structlog）

Use structured JSON logging with contextual information. Never use

print()

or unstructured logging in production.

Logging principles:

Structured -- JSON format, machine-parseable
Contextual -- Include request ID, user ID, trace ID in every log
Leveled -- Use appropriate log levels (DEBUG, INFO, WARNING, ERROR, CRITICAL)
Actionable -- Every WARNING/ERROR log should indicate what to investigate

Log levels and when to use them:

Level	When to Use	Example
DEBUG	Detailed diagnostic info, disabled in production	`Processing item 42 of 100`
INFO	Normal operations, significant events	`User created` , `Payment processed`
WARNING	Unexpected but handled situation	`Retry attempt 2 of 3` , `Cache miss`
ERROR	Operation failed, needs attention	`Database query failed` , `External API timeout`
CRITICAL	System-level failure, immediate action	`Cannot connect to database` , `Out of memory`

structlog setup:

python

import structlog

structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.stdlib.PositionalArgumentsFormatter(),
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.UnicodeDecoder(),
        structlog.processors.JSONRenderer(),
    ],
    wrapper_class=structlog.stdlib.BoundLogger,
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
    cache_logger_on_first_use=True,
)

Adding request context:

python

from starlette.middleware.base import BaseHTTPMiddleware
import structlog
import uuid

class LoggingContextMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request, call_next):
        request_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
        structlog.contextvars.clear_contextvars()
        structlog.contextvars.bind_contextvars(
            request_id=request_id,
            method=request.method,
            path=request.url.path,
        )
        response = await call_next(request)
        response.headers["X-Request-ID"] = request_id
        return response

See

references/logging-config-template.py

for the complete setup.

使用包含上下文信息的结构化JSON日志。生产环境中绝对不要使用

print()

或非结构化日志。

日志原则：

结构化 -- JSON格式，可被机器解析
上下文关联 -- 每条日志都包含请求ID、用户ID、追踪ID
分级 -- 使用合适的日志级别（DEBUG、INFO、WARNING、ERROR、CRITICAL）
可执行 -- 每条WARNING/ERROR日志都应指出需要排查的方向

日志级别及适用场景：

级别	使用场景	示例
DEBUG	详细诊断信息，生产环境禁用	`正在处理第42项，共100项`
INFO	正常操作、重要事件	`用户已创建` 、 `支付已处理`
WARNING	意外但已处理的情况	`第2次重试，共3次` 、 `缓存未命中`
ERROR	操作失败，需要关注	`数据库查询失败` 、 `外部API超时`
CRITICAL	系统级故障，需立即处理	`无法连接数据库` 、 `内存不足`

structlog设置：

python

import structlog

structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.stdlib.PositionalArgumentsFormatter(),
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.UnicodeDecoder(),
        structlog.processors.JSONRenderer(),
    ],
    wrapper_class=structlog.stdlib.BoundLogger,
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
    cache_logger_on_first_use=True,
)

添加请求上下文：

python

from starlette.middleware.base import BaseHTTPMiddleware
import structlog
import uuid

class LoggingContextMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request, call_next):
        request_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
        structlog.contextvars.clear_contextvars()
        structlog.contextvars.bind_contextvars(
            request_id=request_id,
            method=request.method,
            path=request.url.path,
        )
        response = await call_next(request)
        response.headers["X-Request-ID"] = request_id
        return response

完整设置请参考

references/logging-config-template.py

。

Pillar 3: Tracing (OpenTelemetry)

支柱3：追踪（OpenTelemetry）

Distributed tracing connects logs and metrics across service boundaries.

Trace setup for FastAPI:

python

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

def setup_tracing(app, service_name: str = "backend"):
    resource = Resource.create({"service.name": service_name})
    provider = TracerProvider(resource=resource)

    exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
    provider.add_span_processor(BatchSpanProcessor(exporter))

    trace.set_tracer_provider(provider)

    # Auto-instrument FastAPI, SQLAlchemy, Redis
    FastAPIInstrumentor.instrument_app(app)
    SQLAlchemyInstrumentor().instrument()
    RedisInstrumentor().instrument()

Custom spans for business logic:

python

tracer = trace.get_tracer(__name__)

async def process_order(order_id: str):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)

        with tracer.start_as_current_span("validate_order"):
            await validate_order(order_id)

        with tracer.start_as_current_span("charge_payment"):
            result = await charge_payment(order_id)
            span.set_attribute("payment.status", result.status)

        with tracer.start_as_current_span("send_confirmation"):
            await send_confirmation(order_id)

分布式追踪可跨服务边界关联日志与指标。

FastAPI的追踪设置：

python

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

def setup_tracing(app, service_name: str = "backend"):
    resource = Resource.create({"service.name": service_name})
    provider = TracerProvider(resource=resource)

    exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
    provider.add_span_processor(BatchSpanProcessor(exporter))

    trace.set_tracer_provider(provider)

    # Auto-instrument FastAPI, SQLAlchemy, Redis
    FastAPIInstrumentor.instrument_app(app)
    SQLAlchemyInstrumentor().instrument()
    RedisInstrumentor().instrument()

为业务逻辑添加自定义追踪 span：

python

tracer = trace.get_tracer(__name__)

async def process_order(order_id: str):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)

        with tracer.start_as_current_span("validate_order"):
            await validate_order(order_id)

        with tracer.start_as_current_span("charge_payment"):
            result = await charge_payment(order_id)
            span.set_attribute("payment.status", result.status)

        with tracer.start_as_current_span("send_confirmation"):
            await send_confirmation(order_id)

Pillar 4: Alerting

支柱4：告警

Alerts must be actionable. Every alert should indicate what is broken and what to do.

Alert design principles:

Page only for user-impacting issues -- Do not page for non-urgent warnings
Set thresholds based on SLOs -- Not arbitrary numbers
Avoid alert fatigue -- If an alert fires often without action, fix or remove it
Include runbook links -- Every alert should link to a remediation guide
Use multi-window burn rates -- Detect issues faster without false positives

Alert thresholds for a typical FastAPI application:

Alert	Condition	Severity	Action
High error rate	`http_requests_total{status=~"5.."}` > 5% of total for 5 min	SEV2	Check logs, consider rollback
High latency	`http_request_duration_seconds` p99 > 2s for 5 min	SEV3	Check DB queries, dependencies
Service down	Health check fails for 2 min	SEV1	Restart, check logs, escalate
DB connections high	Pool usage > 80% for 5 min	SEV3	Check for connection leaks
DB connections critical	Pool usage > 95% for 2 min	SEV2	Restart app, investigate
Memory high	Container memory > 85% for 10 min	SEV3	Check for memory leaks
Disk space low	Disk usage > 85%	SEV3	Clean logs, expand volume
Certificate expiry	SSL cert expires in < 14 days	SEV4	Renew certificate

See

references/alert-rules-template.yml

for Prometheus alerting rules.

告警必须具备可执行性。每条告警都应指出故障点及处理方法。

告警设计原则：

仅对影响用户的问题发送告警 -- 非紧急警告不要触发告警通知
基于SLO设置阈值 -- 不要使用任意数值
避免告警疲劳 -- 如果某条告警频繁触发但无需处理，修复或移除它
包含运行手册链接 -- 每条告警都应链接到故障排查指南
使用多窗口 burn rate -- 更快检测问题，减少误报

典型FastAPI应用的告警阈值：

告警名称	触发条件	严重级别	操作
高错误率	`http_requests_total{status=~"5.."}` 占总请求数比例 >5% 持续5分钟	SEV2	检查日志，考虑回滚
高延迟	`http_request_duration_seconds` p99值 >2秒持续5分钟	SEV3	检查数据库查询、依赖服务
服务宕机	健康检查失败持续2分钟	SEV1	重启服务，检查日志，升级处理
数据库连接数过高	连接池使用率 >80% 持续5分钟	SEV3	检查是否存在连接泄漏
数据库连接数临界	连接池使用率 >95% 持续2分钟	SEV2	重启应用，调查原因
内存占用过高	容器内存使用率 >85% 持续10分钟	SEV3	检查是否存在内存泄漏
磁盘空间不足	磁盘使用率 >85%	SEV3	清理日志，扩容存储
证书即将过期	SSL证书剩余有效期 <14天	SEV4	续签证书

Prometheus告警规则模板请参考

references/alert-rules-template.yml

。

Health Check Endpoints

健康检查端点

Every service must expose two health endpoints.

Liveness (
/health
): Is the process running? Returns 200 if the application is alive.

Readiness (
/health/ready
): Can the service handle requests? Checks all dependencies.

python

from fastapi import APIRouter, Depends
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import text
from datetime import datetime, timezone

router = APIRouter(tags=["health"])

@router.get("/health")
async def liveness():
    """Liveness probe -- is the process running?"""
    return {
        "status": "healthy",
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "version": settings.APP_VERSION,
    }

@router.get("/health/ready")
async def readiness(db: AsyncSession = Depends(get_db)):
    """Readiness probe -- can we handle traffic?"""
    checks = {}

    # Check database
    try:
        await db.execute(text("SELECT 1"))
        checks["database"] = {"status": "ok", "latency_ms": 0}
    except Exception as e:
        checks["database"] = {"status": "error", "error": str(e)}

    # Check Redis
    try:
        start = time.perf_counter()
        await redis.ping()
        latency = (time.perf_counter() - start) * 1000
        checks["redis"] = {"status": "ok", "latency_ms": round(latency, 2)}
    except Exception as e:
        checks["redis"] = {"status": "error", "error": str(e)}

    all_ok = all(c["status"] == "ok" for c in checks.values())
    return JSONResponse(
        status_code=200 if all_ok else 503,
        content={
            "status": "ready" if all_ok else "not_ready",
            "checks": checks,
            "timestamp": datetime.now(timezone.utc).isoformat(),
        },
    )

每个服务都必须暴露两个健康检查端点。

存活态检查（
/health
）：进程是否在运行？如果应用存活则返回200。

就绪态检查（
/health/ready
）：服务是否能处理请求？会检查所有依赖项。

python

from fastapi import APIRouter, Depends
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import text
from datetime import datetime, timezone

router = APIRouter(tags=["health"])

@router.get("/health")
async def liveness():
    """存活态探针 -- 进程是否在运行？"""
    return {
        "status": "healthy",
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "version": settings.APP_VERSION,
    }

@router.get("/health/ready")
async def readiness(db: AsyncSession = Depends(get_db)):
    """就绪态探针 -- 能否处理流量？"""
    checks = {}

    # 检查数据库
    try:
        await db.execute(text("SELECT 1"))
        checks["database"] = {"status": "ok", "latency_ms": 0}
    except Exception as e:
        checks["database"] = {"status": "error", "error": str(e)}

    # 检查Redis
    try:
        start = time.perf_counter()
        await redis.ping()
        latency = (time.perf_counter() - start) * 1000
        checks["redis"] = {"status": "ok", "latency_ms": round(latency, 2)}
    except Exception as e:
        checks["redis"] = {"status": "error", "error": str(e)}

    all_ok = all(c["status"] == "ok" for c in checks.values())
    return JSONResponse(
        status_code=200 if all_ok else 503,
        content={
            "status": "ready" if all_ok else "not_ready",
            "checks": checks,
            "timestamp": datetime.now(timezone.utc).isoformat(),
        },
    )

Error Tracking with Sentry

Sentry错误追踪

Sentry captures unhandled exceptions and performance data.

Setup:

python

import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration
from sentry_sdk.integrations.sqlalchemy import SqlalchemyIntegration

sentry_sdk.init(
    dsn=settings.SENTRY_DSN,
    environment=settings.APP_ENV,
    release=settings.APP_VERSION,
    traces_sample_rate=0.1,  # 10% of requests for performance monitoring
    profiles_sample_rate=0.1,
    integrations=[
        FastApiIntegration(),
        SqlalchemyIntegration(),
    ],
    # Do not send PII
    send_default_pii=False,
    # Filter out health check noise
    before_send=filter_health_checks,
)

def filter_health_checks(event, hint):
    """Do not send health check errors to Sentry."""
    if "request" in event and event["request"].get("url", "").endswith("/health"):
        return None
    return event

Sentry可捕获未处理异常及性能数据。

设置：

python

import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration
from sentry_sdk.integrations.sqlalchemy import SqlalchemyIntegration

sentry_sdk.init(
    dsn=settings.SENTRY_DSN,
    environment=settings.APP_ENV,
    release=settings.APP_VERSION,
    traces_sample_rate=0.1,  # 10%的请求用于性能监控
    profiles_sample_rate=0.1,
    integrations=[
        FastApiIntegration(),
        SqlalchemyIntegration(),
    ],
    # 不发送个人身份信息
    send_default_pii=False,
    # 过滤健康检查的噪音
    before_send=filter_health_checks,
)

def filter_health_checks(event, hint):
    """不将健康检查错误发送到Sentry。"""
    if "request" in event and event["request"].get("url", "").endswith("/health"):
        return None
    return event

Dashboard Design

仪表盘设计

Grafana dashboards should follow a consistent layout pattern.

Standard dashboard sections:

Overview row -- Key SLIs at a glance (error rate, latency, throughput)
RED metrics row -- Rate, Errors, Duration for each endpoint
Infrastructure row -- CPU, memory, disk, network
Dependencies row -- Database, Redis, external API health
Business metrics row -- Application-specific KPIs

Dashboard layout:

┌─────────────────────────────────────────────────────────┐
│  Service Overview                                       │
│  [Error Rate %] [p99 Latency] [Requests/s] [Uptime]    │
├────────────────────────┬────────────────────────────────┤
│  Request Rate          │  Error Rate                    │
│  (by endpoint)         │  (by endpoint, status code)    │
├────────────────────────┼────────────────────────────────┤
│  Latency (p50/p95/p99) │  Active Connections            │
│  (by endpoint)         │  (DB pool, Redis)              │
├────────────────────────┴────────────────────────────────┤
│  Infrastructure                                         │
│  [CPU %] [Memory %] [Disk %] [Network IO]              │
├─────────────────────────────────────────────────────────┤
│  Dependencies                                           │
│  [DB Latency] [Redis Latency] [External API Status]    │
└─────────────────────────────────────────────────────────┘

See

references/dashboard-template.json

for a complete Grafana dashboard template.

Grafana仪表盘应遵循一致的布局范式。

标准仪表盘板块：

概览行 -- 关键SLI概览（错误率、延迟、吞吐量）
RED指标行 -- 每个端点的Rate、Errors、Duration
基础设施行 -- CPU、内存、磁盘、网络
依赖服务行 -- 数据库、Redis、外部API健康状态
业务指标行 -- 应用特定KPI

仪表盘布局：

┌─────────────────────────────────────────────────────────┐
│  Service Overview                                       │
│  [Error Rate %] [p99 Latency] [Requests/s] [Uptime]    │
├────────────────────────┬────────────────────────────────┤
│  Request Rate          │  Error Rate                    │
│  (by endpoint)         │  (by endpoint, status code)    │
├────────────────────────┼────────────────────────────────┤
│  Latency (p50/p95/p99) │  Active Connections            │
│  (by endpoint)         │  (DB pool, Redis)              │
├────────────────────────┴────────────────────────────────┤
│  Infrastructure                                         │
│  [CPU %] [Memory %] [Disk %] [Network IO]              │
├─────────────────────────────────────────────────────────┤
│  Dependencies                                           │
│  [DB Latency] [Redis Latency] [External API Status]    │
└─────────────────────────────────────────────────────────┘

完整Grafana仪表盘模板请参考

references/dashboard-template.json

。

Uptime Monitoring

可用性监控

External uptime monitoring validates the service from a user's perspective.

What to monitor externally:

```
/health
```
endpoint from multiple geographic regions
Key user-facing pages (login, dashboard, API docs)
SSL certificate validity and expiration
DNS resolution time

Recommended check intervals:

Check	Interval	Timeout	Regions
Health endpoint	30 seconds	10 seconds	3+ regions
Key pages	1 minute	15 seconds	2+ regions
SSL certificate	6 hours	30 seconds	1 region
DNS resolution	5 minutes	5 seconds	3+ regions

外部可用性监控从用户视角验证服务状态。

需要外部监控的内容：

从多个地理区域监控
```
/health
```
端点
关键用户页面（登录、仪表盘、API文档）
SSL证书有效性及过期时间
DNS解析时间

推荐检查间隔：

检查项	间隔	超时时间	区域数量
健康端点	30秒	10秒	3个以上
关键页面	1分钟	15秒	2个以上
SSL证书	6小时	30秒	1个
DNS解析	5分钟	5秒	3个以上

Quick Reference

快速参考

See

references/

for complete templates:

logging-config-template.py

metrics-config-template.py

alert-rules-template.yml

dashboard-template.json

完整模板请查看

references/

logging-config-template.py

、

metrics-config-template.py

、

alert-rules-template.yml

、

dashboard-template.json

。

Monitoring Checklist for New Services

新服务监控检查清单

structlog configured with JSON output
Request logging middleware with request ID correlation
Prometheus metrics endpoint exposed at
```
/metrics
```
RED metrics instrumented (request count, errors, duration)
Health check endpoints implemented (
```
/health
```
,
```
/health/ready
```
)
Sentry SDK initialized with environment and release tags
Alert rules defined for error rate, latency, and availability
Grafana dashboard created with standard sections
External uptime monitoring configured
Log retention policy defined (default: 30 days)

已配置structlog并输出JSON格式日志
已添加包含请求ID关联的请求日志中间件
已暴露Prometheus指标端点
```
/metrics
```
已埋点RED指标（请求数、错误数、延迟）
已实现健康检查端点（
```
/health
```
、
```
/health/ready
```
）
已初始化Sentry SDK并配置环境与版本标签
已为错误率、延迟及可用性定义告警规则
已创建包含标准板块的Grafana仪表盘
已配置外部可用性监控
已定义日志保留策略（默认：30天）

Output File

输出文件

Write monitoring configuration summary to

monitoring-config.md

markdown

undefined

将监控配置摘要写入

monitoring-config.md

：

markdown

undefined

Monitoring Configuration: [Service Name]

监控配置：[服务名称]

Metrics

指标

Metric	Type	Labels	Purpose
http_requests_total	Counter	method, endpoint, status	RED: Request rate
http_request_duration_seconds	Histogram	method, endpoint	RED: Latency

指标	类型	标签	用途
http_requests_total	Counter	method, endpoint, status	RED：请求速率
http_request_duration_seconds	Histogram	method, endpoint	RED：延迟

Alerts

告警

Alert	Condition	Severity	Runbook
HighErrorRate	error_rate > 5% for 5m	SEV2	docs/runbooks/high-error-rate.md

告警	条件	严重级别	运行手册
HighErrorRate	error_rate > 5% 持续5分钟	SEV2	docs/runbooks/high-error-rate.md

Health Checks

健康检查

```
/health
```
— Liveness probe
```
/health/ready
```
— Readiness probe (checks DB, Redis)

```
/health
```
— 存活态探针
```
/health/ready
```
— 就绪态探针（检查数据库、Redis）

Dashboards

仪表盘

Grafana: Service Overview (imported from references/dashboard-template.json)

Grafana：服务概览（从references/dashboard-template.json导入）

Next Steps

下一步

Run
```
/deployment-pipeline
```
to deploy with monitoring enabled
Run
```
/incident-response
```
if alerts fire

undefined

运行
```
/deployment-pipeline
```
以启用监控部署
若告警触发，运行
```
/incident-response
```

undefined