monitoring-setup
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMonitoring Setup
监控设置
When to Use
适用场景
Activate this skill when:
- Setting up structured logging for a Python/FastAPI application
- Configuring Prometheus metrics collection and custom counters/histograms
- Implementing health check endpoints (liveness and readiness)
- Designing alert rules and thresholds for production services
- Creating Grafana dashboards for service monitoring
- Integrating Sentry for error tracking and performance monitoring
- Implementing distributed tracing with OpenTelemetry
- Reviewing or improving existing observability coverage
Output: Write observability configuration summary to documenting what was set up (metrics, alerts, dashboards, health checks).
monitoring-config.mdDo NOT use this skill for:
- Responding to active production incidents (use )
incident-response - Deploying monitoring infrastructure (use )
deployment-pipeline - Writing application business logic (use )
python-backend-expert - Docker container configuration (use )
docker-best-practices
在以下场景激活本技能:
- 为Python/FastAPI应用设置结构化日志
- 配置Prometheus指标收集及自定义计数器/直方图
- 实现健康检查端点(存活态和就绪态)
- 为生产服务设计告警规则与阈值
- 创建用于服务监控的Grafana仪表盘
- 集成Sentry进行错误追踪与性能监控
- 使用OpenTelemetry实现分布式追踪
- 评审或优化现有可观测性覆盖范围
输出: 将可观测性配置摘要写入,记录已设置的内容(指标、告警、仪表盘、健康检查)。
monitoring-config.md请勿将本技能用于:
- 响应活跃的生产事件(请使用技能)
incident-response - 部署监控基础设施(请使用技能)
deployment-pipeline - 编写应用业务逻辑(请使用技能)
python-backend-expert - Docker容器配置(请使用技能)
docker-best-practices
Instructions
操作指南
Four Pillars of Observability
可观测性的四大支柱
Every production service must implement all four pillars.
┌─────────────────────────────────────────────────────────────────┐
│ OBSERVABILITY │
├────────────────┬───────────────┬──────────────┬────────────────┤
│ METRICS │ LOGGING │ TRACING │ ALERTING │
│ │ │ │ │
│ Prometheus │ structlog │ OpenTelemetry│ Alert rules │
│ counters, │ structured │ distributed │ thresholds, │
│ histograms, │ JSON logs, │ trace spans, │ notification │
│ gauges │ context │ correlation │ channels │
├────────────────┴───────────────┴──────────────┴────────────────┤
│ DASHBOARDS (Grafana) │
│ Visualize metrics, logs, and traces in one place │
└─────────────────────────────────────────────────────────────────┘每个生产服务都必须实现这四大支柱。
┌─────────────────────────────────────────────────────────────────┐
│ OBSERVABILITY │
├────────────────┬───────────────┬──────────────┬────────────────┤
│ METRICS │ LOGGING │ TRACING │ ALERTING │
│ │ │ │ │
│ Prometheus │ structlog │ OpenTelemetry│ Alert rules │
│ counters, │ structured │ distributed │ thresholds, │
│ histograms, │ JSON logs, │ trace spans, │ notification │
│ gauges │ context │ correlation │ channels │
├────────────────┴───────────────┴──────────────┴────────────────┤
│ DASHBOARDS (Grafana) │
│ Visualize metrics, logs, and traces in one place │
└─────────────────────────────────────────────────────────────────┘Pillar 1: Metrics (Prometheus)
支柱1:指标(Prometheus)
Use the RED method for request-driven services and USE method for resources.
RED Method (for every API endpoint):
- Rate -- Requests per second
- Errors -- Failed requests per second
- Duration -- Request latency distribution
USE Method (for infrastructure resources):
- Utilization -- Percentage of resource used (CPU, memory, disk)
- Saturation -- Work queued or waiting (connection pool, queue depth)
- Errors -- Error events (OOM kills, connection failures)
Key metrics to instrument:
python
from prometheus_client import Counter, Histogram, Gauge, Info对于请求驱动型服务使用RED方法,对于资源使用USE方法。
RED方法(适用于所有API端点):
- Rate -- 每秒请求数
- Errors -- 每秒失败请求数
- Duration -- 请求延迟分布
USE方法(适用于基础设施资源):
- Utilization -- 资源使用率(CPU、内存、磁盘)
- Saturation -- 排队或等待的工作负载(连接池、队列深度)
- Errors -- 错误事件(OOM终止、连接失败)
需要埋点的关键指标:
python
from prometheus_client import Counter, Histogram, Gauge, InfoRED metrics
RED metrics
REQUEST_COUNT = Counter(
"http_requests_total",
"Total HTTP requests",
labelnames=["method", "endpoint", "status_code"],
)
REQUEST_DURATION = Histogram(
"http_request_duration_seconds",
"HTTP request duration in seconds",
labelnames=["method", "endpoint"],
buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0],
)
REQUEST_COUNT = Counter(
"http_requests_total",
"Total HTTP requests",
labelnames=["method", "endpoint", "status_code"],
)
REQUEST_DURATION = Histogram(
"http_request_duration_seconds",
"HTTP request duration in seconds",
labelnames=["method", "endpoint"],
buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0],
)
USE metrics
USE metrics
DB_POOL_USAGE = Gauge(
"db_connection_pool_usage",
"Database connection pool utilization",
labelnames=["pool_name"],
)
DB_POOL_SIZE = Gauge(
"db_connection_pool_size",
"Database connection pool max size",
labelnames=["pool_name"],
)
REDIS_CONNECTIONS = Gauge(
"redis_active_connections",
"Active Redis connections",
)
DB_POOL_USAGE = Gauge(
"db_connection_pool_usage",
"Database connection pool utilization",
labelnames=["pool_name"],
)
DB_POOL_SIZE = Gauge(
"db_connection_pool_size",
"Database connection pool max size",
labelnames=["pool_name"],
)
REDIS_CONNECTIONS = Gauge(
"redis_active_connections",
"Active Redis connections",
)
Business metrics
Business metrics
ACTIVE_USERS = Gauge(
"active_users_total",
"Currently active users",
)
APP_INFO = Info(
"app",
"Application metadata",
)
**FastAPI middleware for automatic metrics:**
```python
import time
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request
class PrometheusMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next):
method = request.method
endpoint = request.url.path
start_time = time.perf_counter()
response = await call_next(request)
duration = time.perf_counter() - start_time
status_code = str(response.status_code)
REQUEST_COUNT.labels(
method=method, endpoint=endpoint, status_code=status_code
).inc()
REQUEST_DURATION.labels(
method=method, endpoint=endpoint
).observe(duration)
return responseSee for the complete setup.
references/metrics-config-template.pyACTIVE_USERS = Gauge(
"active_users_total",
"Currently active users",
)
APP_INFO = Info(
"app",
"Application metadata",
)
**用于自动收集指标的FastAPI中间件:**
```python
import time
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request
class PrometheusMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next):
method = request.method
endpoint = request.url.path
start_time = time.perf_counter()
response = await call_next(request)
duration = time.perf_counter() - start_time
status_code = str(response.status_code)
REQUEST_COUNT.labels(
method=method, endpoint=endpoint, status_code=status_code
).inc()
REQUEST_DURATION.labels(
method=method, endpoint=endpoint
).observe(duration)
return response完整设置请参考。
references/metrics-config-template.pyPillar 2: Logging (structlog)
支柱2:日志(structlog)
Use structured JSON logging with contextual information. Never use or unstructured logging in production.
print()Logging principles:
- Structured -- JSON format, machine-parseable
- Contextual -- Include request ID, user ID, trace ID in every log
- Leveled -- Use appropriate log levels (DEBUG, INFO, WARNING, ERROR, CRITICAL)
- Actionable -- Every WARNING/ERROR log should indicate what to investigate
Log levels and when to use them:
| Level | When to Use | Example |
|---|---|---|
| DEBUG | Detailed diagnostic info, disabled in production | |
| INFO | Normal operations, significant events | |
| WARNING | Unexpected but handled situation | |
| ERROR | Operation failed, needs attention | |
| CRITICAL | System-level failure, immediate action | |
structlog setup:
python
import structlog
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.stdlib.filter_by_level,
structlog.stdlib.add_logger_name,
structlog.stdlib.add_log_level,
structlog.stdlib.PositionalArgumentsFormatter(),
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.processors.UnicodeDecoder(),
structlog.processors.JSONRenderer(),
],
wrapper_class=structlog.stdlib.BoundLogger,
context_class=dict,
logger_factory=structlog.stdlib.LoggerFactory(),
cache_logger_on_first_use=True,
)Adding request context:
python
from starlette.middleware.base import BaseHTTPMiddleware
import structlog
import uuid
class LoggingContextMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request, call_next):
request_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
structlog.contextvars.clear_contextvars()
structlog.contextvars.bind_contextvars(
request_id=request_id,
method=request.method,
path=request.url.path,
)
response = await call_next(request)
response.headers["X-Request-ID"] = request_id
return responseSee for the complete setup.
references/logging-config-template.py使用包含上下文信息的结构化JSON日志。生产环境中绝对不要使用或非结构化日志。
print()日志原则:
- 结构化 -- JSON格式,可被机器解析
- 上下文关联 -- 每条日志都包含请求ID、用户ID、追踪ID
- 分级 -- 使用合适的日志级别(DEBUG、INFO、WARNING、ERROR、CRITICAL)
- 可执行 -- 每条WARNING/ERROR日志都应指出需要排查的方向
日志级别及适用场景:
| 级别 | 使用场景 | 示例 |
|---|---|---|
| DEBUG | 详细诊断信息,生产环境禁用 | |
| INFO | 正常操作、重要事件 | |
| WARNING | 意外但已处理的情况 | |
| ERROR | 操作失败,需要关注 | |
| CRITICAL | 系统级故障,需立即处理 | |
structlog设置:
python
import structlog
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.stdlib.filter_by_level,
structlog.stdlib.add_logger_name,
structlog.stdlib.add_log_level,
structlog.stdlib.PositionalArgumentsFormatter(),
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.processors.UnicodeDecoder(),
structlog.processors.JSONRenderer(),
],
wrapper_class=structlog.stdlib.BoundLogger,
context_class=dict,
logger_factory=structlog.stdlib.LoggerFactory(),
cache_logger_on_first_use=True,
)添加请求上下文:
python
from starlette.middleware.base import BaseHTTPMiddleware
import structlog
import uuid
class LoggingContextMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request, call_next):
request_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
structlog.contextvars.clear_contextvars()
structlog.contextvars.bind_contextvars(
request_id=request_id,
method=request.method,
path=request.url.path,
)
response = await call_next(request)
response.headers["X-Request-ID"] = request_id
return response完整设置请参考。
references/logging-config-template.pyPillar 3: Tracing (OpenTelemetry)
支柱3:追踪(OpenTelemetry)
Distributed tracing connects logs and metrics across service boundaries.
Trace setup for FastAPI:
python
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
def setup_tracing(app, service_name: str = "backend"):
resource = Resource.create({"service.name": service_name})
provider = TracerProvider(resource=resource)
exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
# Auto-instrument FastAPI, SQLAlchemy, Redis
FastAPIInstrumentor.instrument_app(app)
SQLAlchemyInstrumentor().instrument()
RedisInstrumentor().instrument()Custom spans for business logic:
python
tracer = trace.get_tracer(__name__)
async def process_order(order_id: str):
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
with tracer.start_as_current_span("validate_order"):
await validate_order(order_id)
with tracer.start_as_current_span("charge_payment"):
result = await charge_payment(order_id)
span.set_attribute("payment.status", result.status)
with tracer.start_as_current_span("send_confirmation"):
await send_confirmation(order_id)分布式追踪可跨服务边界关联日志与指标。
FastAPI的追踪设置:
python
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
def setup_tracing(app, service_name: str = "backend"):
resource = Resource.create({"service.name": service_name})
provider = TracerProvider(resource=resource)
exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
# Auto-instrument FastAPI, SQLAlchemy, Redis
FastAPIInstrumentor.instrument_app(app)
SQLAlchemyInstrumentor().instrument()
RedisInstrumentor().instrument()为业务逻辑添加自定义追踪 span:
python
tracer = trace.get_tracer(__name__)
async def process_order(order_id: str):
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
with tracer.start_as_current_span("validate_order"):
await validate_order(order_id)
with tracer.start_as_current_span("charge_payment"):
result = await charge_payment(order_id)
span.set_attribute("payment.status", result.status)
with tracer.start_as_current_span("send_confirmation"):
await send_confirmation(order_id)Pillar 4: Alerting
支柱4:告警
Alerts must be actionable. Every alert should indicate what is broken and what to do.
Alert design principles:
- Page only for user-impacting issues -- Do not page for non-urgent warnings
- Set thresholds based on SLOs -- Not arbitrary numbers
- Avoid alert fatigue -- If an alert fires often without action, fix or remove it
- Include runbook links -- Every alert should link to a remediation guide
- Use multi-window burn rates -- Detect issues faster without false positives
Alert thresholds for a typical FastAPI application:
| Alert | Condition | Severity | Action |
|---|---|---|---|
| High error rate | | SEV2 | Check logs, consider rollback |
| High latency | | SEV3 | Check DB queries, dependencies |
| Service down | Health check fails for 2 min | SEV1 | Restart, check logs, escalate |
| DB connections high | Pool usage > 80% for 5 min | SEV3 | Check for connection leaks |
| DB connections critical | Pool usage > 95% for 2 min | SEV2 | Restart app, investigate |
| Memory high | Container memory > 85% for 10 min | SEV3 | Check for memory leaks |
| Disk space low | Disk usage > 85% | SEV3 | Clean logs, expand volume |
| Certificate expiry | SSL cert expires in < 14 days | SEV4 | Renew certificate |
See for Prometheus alerting rules.
references/alert-rules-template.yml告警必须具备可执行性。每条告警都应指出故障点及处理方法。
告警设计原则:
- 仅对影响用户的问题发送告警 -- 非紧急警告不要触发告警通知
- 基于SLO设置阈值 -- 不要使用任意数值
- 避免告警疲劳 -- 如果某条告警频繁触发但无需处理,修复或移除它
- 包含运行手册链接 -- 每条告警都应链接到故障排查指南
- 使用多窗口 burn rate -- 更快检测问题,减少误报
典型FastAPI应用的告警阈值:
| 告警名称 | 触发条件 | 严重级别 | 操作 |
|---|---|---|---|
| 高错误率 | | SEV2 | 检查日志,考虑回滚 |
| 高延迟 | | SEV3 | 检查数据库查询、依赖服务 |
| 服务宕机 | 健康检查失败持续2分钟 | SEV1 | 重启服务,检查日志,升级处理 |
| 数据库连接数过高 | 连接池使用率 >80% 持续5分钟 | SEV3 | 检查是否存在连接泄漏 |
| 数据库连接数临界 | 连接池使用率 >95% 持续2分钟 | SEV2 | 重启应用,调查原因 |
| 内存占用过高 | 容器内存使用率 >85% 持续10分钟 | SEV3 | 检查是否存在内存泄漏 |
| 磁盘空间不足 | 磁盘使用率 >85% | SEV3 | 清理日志,扩容存储 |
| 证书即将过期 | SSL证书剩余有效期 <14天 | SEV4 | 续签证书 |
Prometheus告警规则模板请参考。
references/alert-rules-template.ymlHealth Check Endpoints
健康检查端点
Every service must expose two health endpoints.
Liveness (): Is the process running? Returns 200 if the application is alive.
/healthReadiness (): Can the service handle requests? Checks all dependencies.
/health/readypython
from fastapi import APIRouter, Depends
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import text
from datetime import datetime, timezone
router = APIRouter(tags=["health"])
@router.get("/health")
async def liveness():
"""Liveness probe -- is the process running?"""
return {
"status": "healthy",
"timestamp": datetime.now(timezone.utc).isoformat(),
"version": settings.APP_VERSION,
}
@router.get("/health/ready")
async def readiness(db: AsyncSession = Depends(get_db)):
"""Readiness probe -- can we handle traffic?"""
checks = {}
# Check database
try:
await db.execute(text("SELECT 1"))
checks["database"] = {"status": "ok", "latency_ms": 0}
except Exception as e:
checks["database"] = {"status": "error", "error": str(e)}
# Check Redis
try:
start = time.perf_counter()
await redis.ping()
latency = (time.perf_counter() - start) * 1000
checks["redis"] = {"status": "ok", "latency_ms": round(latency, 2)}
except Exception as e:
checks["redis"] = {"status": "error", "error": str(e)}
all_ok = all(c["status"] == "ok" for c in checks.values())
return JSONResponse(
status_code=200 if all_ok else 503,
content={
"status": "ready" if all_ok else "not_ready",
"checks": checks,
"timestamp": datetime.now(timezone.utc).isoformat(),
},
)每个服务都必须暴露两个健康检查端点。
存活态检查(): 进程是否在运行?如果应用存活则返回200。
/health就绪态检查(): 服务是否能处理请求?会检查所有依赖项。
/health/readypython
from fastapi import APIRouter, Depends
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import text
from datetime import datetime, timezone
router = APIRouter(tags=["health"])
@router.get("/health")
async def liveness():
"""存活态探针 -- 进程是否在运行?"""
return {
"status": "healthy",
"timestamp": datetime.now(timezone.utc).isoformat(),
"version": settings.APP_VERSION,
}
@router.get("/health/ready")
async def readiness(db: AsyncSession = Depends(get_db)):
"""就绪态探针 -- 能否处理流量?"""
checks = {}
# 检查数据库
try:
await db.execute(text("SELECT 1"))
checks["database"] = {"status": "ok", "latency_ms": 0}
except Exception as e:
checks["database"] = {"status": "error", "error": str(e)}
# 检查Redis
try:
start = time.perf_counter()
await redis.ping()
latency = (time.perf_counter() - start) * 1000
checks["redis"] = {"status": "ok", "latency_ms": round(latency, 2)}
except Exception as e:
checks["redis"] = {"status": "error", "error": str(e)}
all_ok = all(c["status"] == "ok" for c in checks.values())
return JSONResponse(
status_code=200 if all_ok else 503,
content={
"status": "ready" if all_ok else "not_ready",
"checks": checks,
"timestamp": datetime.now(timezone.utc).isoformat(),
},
)Error Tracking with Sentry
Sentry错误追踪
Sentry captures unhandled exceptions and performance data.
Setup:
python
import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration
from sentry_sdk.integrations.sqlalchemy import SqlalchemyIntegration
sentry_sdk.init(
dsn=settings.SENTRY_DSN,
environment=settings.APP_ENV,
release=settings.APP_VERSION,
traces_sample_rate=0.1, # 10% of requests for performance monitoring
profiles_sample_rate=0.1,
integrations=[
FastApiIntegration(),
SqlalchemyIntegration(),
],
# Do not send PII
send_default_pii=False,
# Filter out health check noise
before_send=filter_health_checks,
)
def filter_health_checks(event, hint):
"""Do not send health check errors to Sentry."""
if "request" in event and event["request"].get("url", "").endswith("/health"):
return None
return eventSentry可捕获未处理异常及性能数据。
设置:
python
import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration
from sentry_sdk.integrations.sqlalchemy import SqlalchemyIntegration
sentry_sdk.init(
dsn=settings.SENTRY_DSN,
environment=settings.APP_ENV,
release=settings.APP_VERSION,
traces_sample_rate=0.1, # 10%的请求用于性能监控
profiles_sample_rate=0.1,
integrations=[
FastApiIntegration(),
SqlalchemyIntegration(),
],
# 不发送个人身份信息
send_default_pii=False,
# 过滤健康检查的噪音
before_send=filter_health_checks,
)
def filter_health_checks(event, hint):
"""不将健康检查错误发送到Sentry。"""
if "request" in event and event["request"].get("url", "").endswith("/health"):
return None
return eventDashboard Design
仪表盘设计
Grafana dashboards should follow a consistent layout pattern.
Standard dashboard sections:
- Overview row -- Key SLIs at a glance (error rate, latency, throughput)
- RED metrics row -- Rate, Errors, Duration for each endpoint
- Infrastructure row -- CPU, memory, disk, network
- Dependencies row -- Database, Redis, external API health
- Business metrics row -- Application-specific KPIs
Dashboard layout:
┌─────────────────────────────────────────────────────────┐
│ Service Overview │
│ [Error Rate %] [p99 Latency] [Requests/s] [Uptime] │
├────────────────────────┬────────────────────────────────┤
│ Request Rate │ Error Rate │
│ (by endpoint) │ (by endpoint, status code) │
├────────────────────────┼────────────────────────────────┤
│ Latency (p50/p95/p99) │ Active Connections │
│ (by endpoint) │ (DB pool, Redis) │
├────────────────────────┴────────────────────────────────┤
│ Infrastructure │
│ [CPU %] [Memory %] [Disk %] [Network IO] │
├─────────────────────────────────────────────────────────┤
│ Dependencies │
│ [DB Latency] [Redis Latency] [External API Status] │
└─────────────────────────────────────────────────────────┘See for a complete Grafana dashboard template.
references/dashboard-template.jsonGrafana仪表盘应遵循一致的布局范式。
标准仪表盘板块:
- 概览行 -- 关键SLI概览(错误率、延迟、吞吐量)
- RED指标行 -- 每个端点的Rate、Errors、Duration
- 基础设施行 -- CPU、内存、磁盘、网络
- 依赖服务行 -- 数据库、Redis、外部API健康状态
- 业务指标行 -- 应用特定KPI
仪表盘布局:
┌─────────────────────────────────────────────────────────┐
│ Service Overview │
│ [Error Rate %] [p99 Latency] [Requests/s] [Uptime] │
├────────────────────────┬────────────────────────────────┤
│ Request Rate │ Error Rate │
│ (by endpoint) │ (by endpoint, status code) │
├────────────────────────┼────────────────────────────────┤
│ Latency (p50/p95/p99) │ Active Connections │
│ (by endpoint) │ (DB pool, Redis) │
├────────────────────────┴────────────────────────────────┤
│ Infrastructure │
│ [CPU %] [Memory %] [Disk %] [Network IO] │
├─────────────────────────────────────────────────────────┤
│ Dependencies │
│ [DB Latency] [Redis Latency] [External API Status] │
└─────────────────────────────────────────────────────────┘完整Grafana仪表盘模板请参考。
references/dashboard-template.jsonUptime Monitoring
可用性监控
External uptime monitoring validates the service from a user's perspective.
What to monitor externally:
- endpoint from multiple geographic regions
/health - Key user-facing pages (login, dashboard, API docs)
- SSL certificate validity and expiration
- DNS resolution time
Recommended check intervals:
| Check | Interval | Timeout | Regions |
|---|---|---|---|
| Health endpoint | 30 seconds | 10 seconds | 3+ regions |
| Key pages | 1 minute | 15 seconds | 2+ regions |
| SSL certificate | 6 hours | 30 seconds | 1 region |
| DNS resolution | 5 minutes | 5 seconds | 3+ regions |
外部可用性监控从用户视角验证服务状态。
需要外部监控的内容:
- 从多个地理区域监控端点
/health - 关键用户页面(登录、仪表盘、API文档)
- SSL证书有效性及过期时间
- DNS解析时间
推荐检查间隔:
| 检查项 | 间隔 | 超时时间 | 区域数量 |
|---|---|---|---|
| 健康端点 | 30秒 | 10秒 | 3个以上 |
| 关键页面 | 1分钟 | 15秒 | 2个以上 |
| SSL证书 | 6小时 | 30秒 | 1个 |
| DNS解析 | 5分钟 | 5秒 | 3个以上 |
Quick Reference
快速参考
See for complete templates: , , , .
references/logging-config-template.pymetrics-config-template.pyalert-rules-template.ymldashboard-template.json完整模板请查看目录:、、、。
references/logging-config-template.pymetrics-config-template.pyalert-rules-template.ymldashboard-template.jsonMonitoring Checklist for New Services
新服务监控检查清单
- structlog configured with JSON output
- Request logging middleware with request ID correlation
- Prometheus metrics endpoint exposed at
/metrics - RED metrics instrumented (request count, errors, duration)
- Health check endpoints implemented (,
/health)/health/ready - Sentry SDK initialized with environment and release tags
- Alert rules defined for error rate, latency, and availability
- Grafana dashboard created with standard sections
- External uptime monitoring configured
- Log retention policy defined (default: 30 days)
- 已配置structlog并输出JSON格式日志
- 已添加包含请求ID关联的请求日志中间件
- 已暴露Prometheus指标端点
/metrics - 已埋点RED指标(请求数、错误数、延迟)
- 已实现健康检查端点(、
/health)/health/ready - 已初始化Sentry SDK并配置环境与版本标签
- 已为错误率、延迟及可用性定义告警规则
- 已创建包含标准板块的Grafana仪表盘
- 已配置外部可用性监控
- 已定义日志保留策略(默认:30天)
Output File
输出文件
Write monitoring configuration summary to :
monitoring-config.mdmarkdown
undefined将监控配置摘要写入:
monitoring-config.mdmarkdown
undefinedMonitoring Configuration: [Service Name]
监控配置:[服务名称]
Metrics
指标
| Metric | Type | Labels | Purpose |
|---|---|---|---|
| http_requests_total | Counter | method, endpoint, status | RED: Request rate |
| http_request_duration_seconds | Histogram | method, endpoint | RED: Latency |
| 指标 | 类型 | 标签 | 用途 |
|---|---|---|---|
| http_requests_total | Counter | method, endpoint, status | RED:请求速率 |
| http_request_duration_seconds | Histogram | method, endpoint | RED:延迟 |
Alerts
告警
| Alert | Condition | Severity | Runbook |
|---|---|---|---|
| HighErrorRate | error_rate > 5% for 5m | SEV2 | docs/runbooks/high-error-rate.md |
| 告警 | 条件 | 严重级别 | 运行手册 |
|---|---|---|---|
| HighErrorRate | error_rate > 5% 持续5分钟 | SEV2 | docs/runbooks/high-error-rate.md |
Health Checks
健康检查
- — Liveness probe
/health - — Readiness probe (checks DB, Redis)
/health/ready
- — 存活态探针
/health - — 就绪态探针(检查数据库、Redis)
/health/ready
Dashboards
仪表盘
- Grafana: Service Overview (imported from references/dashboard-template.json)
- Grafana:服务概览(从references/dashboard-template.json导入)
Next Steps
下一步
- Run to deploy with monitoring enabled
/deployment-pipeline - Run if alerts fire
/incident-response
undefined- 运行以启用监控部署
/deployment-pipeline - 若告警触发,运行
/incident-response
undefined