monitoring-setup

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Monitoring Setup Skill

监控设置技能

Overview

概述

This skill helps you implement comprehensive monitoring for applications. Covers metrics collection, dashboard creation, alerting strategies, health checks, and observability best practices.
本技能可帮助你为应用实现全面监控,涵盖指标收集、仪表板创建、告警策略、健康检查以及可观测性最佳实践。

Monitoring Philosophy

监控理念

Four Golden Signals

四大黄金信号

  1. Latency: Time to serve a request
  2. Traffic: Request volume
  3. Errors: Failed request rate
  4. Saturation: Resource utilization
  1. Latency:请求响应时间
  2. Traffic:请求量
  3. Errors:请求失败率
  4. Saturation:资源利用率

Observability Pillars

可观测性三大支柱

  • Metrics: Numeric measurements over time
  • Logs: Discrete events with context
  • Traces: Request flow across services
  • Metrics:随时间变化的数值度量
  • Logs:带上下文的离散事件
  • Traces:跨服务的请求链路

Health Check Endpoints

健康检查端点

Comprehensive Health Check

全面健康检查

typescript
// src/app/api/health/route.ts
import { NextResponse } from 'next/server';
import { createClient } from '@supabase/supabase-js';
import Redis from 'ioredis';

interface HealthCheck {
  status: 'healthy' | 'degraded' | 'unhealthy';
  timestamp: string;
  version: string;
  uptime: number;
  checks: {
    database: CheckResult;
    redis: CheckResult;
    external: CheckResult;
  };
}

interface CheckResult {
  status: 'pass' | 'fail';
  latency?: number;
  message?: string;
}

async function checkDatabase(): Promise<CheckResult> {
  const start = Date.now();
  try {
    const supabase = createClient(
      process.env.SUPABASE_URL!,
      process.env.SUPABASE_SERVICE_ROLE_KEY!
    );
    await supabase.from('health_check').select('1').single();
    return {
      status: 'pass',
      latency: Date.now() - start,
    };
  } catch (error) {
    return {
      status: 'fail',
      message: error instanceof Error ? error.message : 'Unknown error',
    };
  }
}

async function checkRedis(): Promise<CheckResult> {
  const start = Date.now();
  try {
    const redis = new Redis(process.env.REDIS_URL!);
    await redis.ping();
    redis.disconnect();
    return {
      status: 'pass',
      latency: Date.now() - start,
    };
  } catch (error) {
    return {
      status: 'fail',
      message: error instanceof Error ? error.message : 'Unknown error',
    };
  }
}

async function checkExternal(): Promise<CheckResult> {
  const start = Date.now();
  try {
    const response = await fetch('https://api.stripe.com/v1/health', {
      method: 'HEAD',
    });
    return {
      status: response.ok ? 'pass' : 'fail',
      latency: Date.now() - start,
    };
  } catch (error) {
    return {
      status: 'fail',
      message: 'External service unavailable',
    };
  }
}

const startTime = Date.now();

export async function GET() {
  const [database, redis, external] = await Promise.all([
    checkDatabase(),
    checkRedis(),
    checkExternal(),
  ]);

  const checks = { database, redis, external };

  const allPassed = Object.values(checks).every((c) => c.status === 'pass');
  const anyFailed = Object.values(checks).some((c) => c.status === 'fail');

  const health: HealthCheck = {
    status: allPassed ? 'healthy' : anyFailed ? 'unhealthy' : 'degraded',
    timestamp: new Date().toISOString(),
    version: process.env.VERCEL_GIT_COMMIT_SHA || 'local',
    uptime: Math.floor((Date.now() - startTime) / 1000),
    checks,
  };

  return NextResponse.json(health, {
    status: health.status === 'healthy' ? 200 : 503,
    headers: {
      'Cache-Control': 'no-store',
    },
  });
}
typescript
// src/app/api/health/route.ts
import { NextResponse } from 'next/server';
import { createClient } from '@supabase/supabase-js';
import Redis from 'ioredis';

interface HealthCheck {
  status: 'healthy' | 'degraded' | 'unhealthy';
  timestamp: string;
  version: string;
  uptime: number;
  checks: {
    database: CheckResult;
    redis: CheckResult;
    external: CheckResult;
  };
}

interface CheckResult {
  status: 'pass' | 'fail';
  latency?: number;
  message?: string;
}

async function checkDatabase(): Promise<CheckResult> {
  const start = Date.now();
  try {
    const supabase = createClient(
      process.env.SUPABASE_URL!,
      process.env.SUPABASE_SERVICE_ROLE_KEY!
    );
    await supabase.from('health_check').select('1').single();
    return {
      status: 'pass',
      latency: Date.now() - start,
    };
  } catch (error) {
    return {
      status: 'fail',
      message: error instanceof Error ? error.message : 'Unknown error',
    };
  }
}

async function checkRedis(): Promise<CheckResult> {
  const start = Date.now();
  try {
    const redis = new Redis(process.env.REDIS_URL!);
    await redis.ping();
    redis.disconnect();
    return {
      status: 'pass',
      latency: Date.now() - start,
    };
  } catch (error) {
    return {
      status: 'fail',
      message: error instanceof Error ? error.message : 'Unknown error',
    };
  }
}

async function checkExternal(): Promise<CheckResult> {
  const start = Date.now();
  try {
    const response = await fetch('https://api.stripe.com/v1/health', {
      method: 'HEAD',
    });
    return {
      status: response.ok ? 'pass' : 'fail',
      latency: Date.now() - start,
    };
  } catch (error) {
    return {
      status: 'fail',
      message: 'External service unavailable',
    };
  }
}

const startTime = Date.now();

export async function GET() {
  const [database, redis, external] = await Promise.all([
    checkDatabase(),
    checkRedis(),
    checkExternal(),
  ]);

  const checks = { database, redis, external };

  const allPassed = Object.values(checks).every((c) => c.status === 'pass');
  const anyFailed = Object.values(checks).some((c) => c.status === 'fail');

  const health: HealthCheck = {
    status: allPassed ? 'healthy' : anyFailed ? 'unhealthy' : 'degraded',
    timestamp: new Date().toISOString(),
    version: process.env.VERCEL_GIT_COMMIT_SHA || 'local',
    uptime: Math.floor((Date.now() - startTime) / 1000),
    checks,
  };

  return NextResponse.json(health, {
    status: health.status === 'healthy' ? 200 : 503,
    headers: {
      'Cache-Control': 'no-store',
    },
  });
}

Kubernetes-Style Probes

Kubernetes风格探针

typescript
// src/app/api/health/live/route.ts
// Liveness probe - is the app running?
export async function GET() {
  return new Response('OK', { status: 200 });
}

// src/app/api/health/ready/route.ts
// Readiness probe - can the app handle traffic?
export async function GET() {
  try {
    // Check critical dependencies
    await checkDatabase();
    return new Response('OK', { status: 200 });
  } catch {
    return new Response('Not Ready', { status: 503 });
  }
}
typescript
// src/app/api/health/live/route.ts
// Liveness probe - 应用是否在运行?
export async function GET() {
  return new Response('OK', { status: 200 });
}

// src/app/api/health/ready/route.ts
// Readiness probe - 应用能否处理流量?
export async function GET() {
  try {
    // 检查关键依赖
    await checkDatabase();
    return new Response('OK', { status: 200 });
  } catch {
    return new Response('Not Ready', { status: 503 });
  }
}

Metrics Collection

指标收集

Custom Metrics with Prometheus Client

使用Prometheus Client自定义指标

typescript
// src/lib/metrics.ts
import { Counter, Histogram, Gauge, Registry } from 'prom-client';

export const registry = new Registry();

// HTTP request metrics
export const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status'],
  registers: [registry],
});

export const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5],
  registers: [registry],
});

// Business metrics
export const activeUsers = new Gauge({
  name: 'active_users',
  help: 'Number of currently active users',
  registers: [registry],
});

export const ordersTotal = new Counter({
  name: 'orders_total',
  help: 'Total orders processed',
  labelNames: ['status', 'payment_method'],
  registers: [registry],
});

// Database metrics
export const dbQueryDuration = new Histogram({
  name: 'db_query_duration_seconds',
  help: 'Database query duration',
  labelNames: ['operation', 'table'],
  buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1],
  registers: [registry],
});
typescript
// src/lib/metrics.ts
import { Counter, Histogram, Gauge, Registry } from 'prom-client';

export const registry = new Registry();

// HTTP请求指标
export const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status'],
  registers: [registry],
});

export const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5],
  registers: [registry],
});

// 业务指标
export const activeUsers = new Gauge({
  name: 'active_users',
  help: 'Number of currently active users',
  registers: [registry],
});

export const ordersTotal = new Counter({
  name: 'orders_total',
  help: 'Total orders processed',
  labelNames: ['status', 'payment_method'],
  registers: [registry],
});

// 数据库指标
export const dbQueryDuration = new Histogram({
  name: 'db_query_duration_seconds',
  help: 'Database query duration',
  labelNames: ['operation', 'table'],
  buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1],
  registers: [registry],
});

Metrics Endpoint

指标端点

typescript
// src/app/api/metrics/route.ts
import { NextResponse } from 'next/server';
import { registry } from '@/lib/metrics';

export async function GET(request: Request) {
  // Optional: Basic auth protection
  const authHeader = request.headers.get('authorization');
  if (authHeader !== `Bearer ${process.env.METRICS_TOKEN}`) {
    return new Response('Unauthorized', { status: 401 });
  }

  const metrics = await registry.metrics();

  return new Response(metrics, {
    headers: {
      'Content-Type': registry.contentType,
    },
  });
}
typescript
// src/app/api/metrics/route.ts
import { NextResponse } from 'next/server';
import { registry } from '@/lib/metrics';

export async function GET(request: Request) {
  // 可选:基础认证保护
  const authHeader = request.headers.get('authorization');
  if (authHeader !== `Bearer ${process.env.METRICS_TOKEN}`) {
    return new Response('Unauthorized', { status: 401 });
  }

  const metrics = await registry.metrics();

  return new Response(metrics, {
    headers: {
      'Content-Type': registry.contentType,
    },
  });
}

Middleware for Request Metrics

请求指标中间件

typescript
// src/middleware.ts
import { NextResponse } from 'next/server';
import type { NextRequest } from 'next/server';
import { httpRequestsTotal, httpRequestDuration } from '@/lib/metrics';

export async function middleware(request: NextRequest) {
  const start = Date.now();

  const response = NextResponse.next();

  // Record metrics after response
  const route = request.nextUrl.pathname;
  const method = request.method;
  const status = response.status.toString();

  httpRequestsTotal.inc({ method, route, status });
  httpRequestDuration.observe(
    { method, route },
    (Date.now() - start) / 1000
  );

  return response;
}
typescript
// src/middleware.ts
import { NextResponse } from 'next/server';
import type { NextRequest } from 'next/server';
import { httpRequestsTotal, httpRequestDuration } from '@/lib/metrics';

export async function middleware(request: NextRequest) {
  const start = Date.now();

  const response = NextResponse.next();

  // 响应后记录指标
  const route = request.nextUrl.pathname;
  const method = request.method;
  const status = response.status.toString();

  httpRequestsTotal.inc({ method, route, status });
  httpRequestDuration.observe(
    { method, route },
    (Date.now() - start) / 1000
  );

  return response;
}

Alerting Configuration

告警配置

Alert Rules (Prometheus/Grafana)

告警规则(Prometheus/Grafana)

yaml
undefined
yaml
undefined

alerts.yml

alerts.yml

groups:
  • name: application rules:

    High error rate

    • alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
      0.05 for: 5m labels: severity: critical annotations: summary: High error rate detected description: Error rate is {{ $value | humanizePercentage }} over the last 5 minutes

    High latency

    • alert: HighLatency expr: | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) ) > 2 for: 5m labels: severity: warning annotations: summary: High latency detected description: 95th percentile latency is {{ $value | humanizeDuration }}

    Service down

    • alert: ServiceDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: Service is down description: "{{ $labels.instance }} has been down for more than 1 minute"

    Database connection pool exhausted

    • alert: DatabaseConnectionsHigh expr: pg_stat_activity_count > 80 for: 5m labels: severity: warning annotations: summary: Database connection pool nearly exhausted description: "{{ $value }} connections in use"
  • name: infrastructure rules:

    High CPU

    • alert: HighCPU expr: node_cpu_seconds_total{mode="idle"} < 20 for: 10m labels: severity: warning annotations: summary: High CPU usage

    Low disk space

    • alert: LowDiskSpace expr: | (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10 for: 5m labels: severity: critical annotations: summary: Low disk space description: Only {{ $value | humanizePercentage }} disk space remaining
undefined
groups:
  • name: application rules:

    高错误率

    • alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
      0.05 for: 5m labels: severity: critical annotations: summary: High error rate detected description: Error rate is {{ $value | humanizePercentage }} over the last 5 minutes

    高延迟

    • alert: HighLatency expr: | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) ) > 2 for: 5m labels: severity: warning annotations: summary: High latency detected description: 95th percentile latency is {{ $value | humanizeDuration }}

    服务宕机

    • alert: ServiceDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: Service is down description: "{{ $labels.instance }} has been down for more than 1 minute"

    数据库连接池耗尽

    • alert: DatabaseConnectionsHigh expr: pg_stat_activity_count > 80 for: 5m labels: severity: warning annotations: summary: Database connection pool nearly exhausted description: "{{ $value }} connections in use"
  • name: infrastructure rules:

    高CPU使用率

    • alert: HighCPU expr: node_cpu_seconds_total{mode="idle"} < 20 for: 10m labels: severity: warning annotations: summary: High CPU usage

    磁盘空间不足

    • alert: LowDiskSpace expr: | (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10 for: 5m labels: severity: critical annotations: summary: Low disk space description: Only {{ $value | humanizePercentage }} disk space remaining
undefined

Vercel/Uptime Monitoring

Vercel/可用性监控

typescript
// scripts/uptime-check.ts
// Run via cron or external monitoring service

const ENDPOINTS = [
  { name: 'Health', url: 'https://myapp.com/api/health' },
  { name: 'Homepage', url: 'https://myapp.com' },
  { name: 'API', url: 'https://myapp.com/api/status' },
];

const WEBHOOK_URL = process.env.SLACK_WEBHOOK_URL;

async function checkEndpoint(endpoint: typeof ENDPOINTS[0]) {
  const start = Date.now();
  try {
    const response = await fetch(endpoint.url, {
      method: 'GET',
      signal: AbortSignal.timeout(10000),
    });

    return {
      name: endpoint.name,
      url: endpoint.url,
      status: response.status,
      latency: Date.now() - start,
      healthy: response.ok,
    };
  } catch (error) {
    return {
      name: endpoint.name,
      url: endpoint.url,
      status: 0,
      latency: Date.now() - start,
      healthy: false,
      error: error instanceof Error ? error.message : 'Unknown error',
    };
  }
}

async function notifySlack(message: string) {
  if (!WEBHOOK_URL) return;

  await fetch(WEBHOOK_URL, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ text: message }),
  });
}

async function runChecks() {
  const results = await Promise.all(ENDPOINTS.map(checkEndpoint));

  const unhealthy = results.filter((r) => !r.healthy);

  if (unhealthy.length > 0) {
    const message = `🚨 *Uptime Alert*\n${unhealthy
      .map((r) => `${r.name}: ${r.error || `Status ${r.status}`}`)
      .join('\n')}`;

    await notifySlack(message);
  }

  console.log(JSON.stringify(results, null, 2));
}

runChecks();
typescript
// scripts/uptime-check.ts
// 通过定时任务或外部监控服务运行

const ENDPOINTS = [
  { name: 'Health', url: 'https://myapp.com/api/health' },
  { name: 'Homepage', url: 'https://myapp.com' },
  { name: 'API', url: 'https://myapp.com/api/status' },
];

const WEBHOOK_URL = process.env.SLACK_WEBHOOK_URL;

async function checkEndpoint(endpoint: typeof ENDPOINTS[0]) {
  const start = Date.now();
  try {
    const response = await fetch(endpoint.url, {
      method: 'GET',
      signal: AbortSignal.timeout(10000),
    });

    return {
      name: endpoint.name,
      url: endpoint.url,
      status: response.status,
      latency: Date.now() - start,
      healthy: response.ok,
    };
  } catch (error) {
    return {
      name: endpoint.name,
      url: endpoint.url,
      status: 0,
      latency: Date.now() - start,
      healthy: false,
      error: error instanceof Error ? error.message : 'Unknown error',
    };
  }
}

async function notifySlack(message: string) {
  if (!WEBHOOK_URL) return;

  await fetch(WEBHOOK_URL, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ text: message }),
  });
}

async function runChecks() {
  const results = await Promise.all(ENDPOINTS.map(checkEndpoint));

  const unhealthy = results.filter((r) => !r.healthy);

  if (unhealthy.length > 0) {
    const message = `🚨 *Uptime Alert*\n${unhealthy
      .map((r) => `${r.name}: ${r.error || `Status ${r.status}`}`)
      .join('\n')}`;

    await notifySlack(message);
  }

  console.log(JSON.stringify(results, null, 2));
}

runChecks();

Dashboard Configuration

仪表板配置

Grafana Dashboard JSON

Grafana仪表板JSON

json
{
  "title": "Application Overview",
  "panels": [
    {
      "title": "Request Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total[5m])) by (route)",
          "legendFormat": "{{ route }}"
        }
      ]
    },
    {
      "title": "Error Rate",
      "type": "stat",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "percent",
          "thresholds": {
            "steps": [
              { "value": 0, "color": "green" },
              { "value": 1, "color": "yellow" },
              { "value": 5, "color": "red" }
            ]
          }
        }
      }
    },
    {
      "title": "Response Time (p95)",
      "type": "gauge",
      "targets": [
        {
          "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "s",
          "thresholds": {
            "steps": [
              { "value": 0, "color": "green" },
              { "value": 0.5, "color": "yellow" },
              { "value": 2, "color": "red" }
            ]
          }
        }
      }
    },
    {
      "title": "Active Users",
      "type": "stat",
      "targets": [
        {
          "expr": "active_users"
        }
      ]
    }
  ]
}
json
{
  "title": "Application Overview",
  "panels": [
    {
      "title": "Request Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total[5m])) by (route)",
          "legendFormat": "{{ route }}"
        }
      ]
    },
    {
      "title": "Error Rate",
      "type": "stat",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "percent",
          "thresholds": {
            "steps": [
              { "value": 0, "color": "green" },
              { "value": 1, "color": "yellow" },
              { "value": 5, "color": "red" }
            ]
          }
        }
      }
    },
    {
      "title": "Response Time (p95)",
      "type": "gauge",
      "targets": [
        {
          "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "s",
          "thresholds": {
            "steps": [
              { "value": 0, "color": "green" },
              { "value": 0.5, "color": "yellow" },
              { "value": 2, "color": "red" }
            ]
          }
        }
      }
    },
    {
      "title": "Active Users",
      "type": "stat",
      "targets": [
        {
          "expr": "active_users"
        }
      ]
    }
  ]
}

Vercel Analytics Integration

Vercel分析集成

typescript
// src/app/layout.tsx
import { Analytics } from '@vercel/analytics/react';
import { SpeedInsights } from '@vercel/speed-insights/next';

export default function RootLayout({
  children,
}: {
  children: React.ReactNode;
}) {
  return (
    <html lang="en">
      <body>
        {children}
        <Analytics />
        <SpeedInsights />
      </body>
    </html>
  );
}
typescript
// src/app/layout.tsx
import { Analytics } from '@vercel/analytics/react';
import { SpeedInsights } from '@vercel/speed-insights/next';

export default function RootLayout({
  children,
}: {
  children: React.ReactNode;
}) {
  return (
    <html lang="en">
      <body>
        {children}
        <Analytics />
        <SpeedInsights />
      </body>
    </html>
  );
}

Status Page

状态页面

Simple Status Page

简易状态页面

typescript
// src/app/status/page.tsx
import { Suspense } from 'react';

interface ServiceStatus {
  name: string;
  status: 'operational' | 'degraded' | 'outage';
  lastChecked: string;
}

async function getStatus(): Promise<ServiceStatus[]> {
  const response = await fetch(
    `${process.env.NEXT_PUBLIC_APP_URL}/api/health`,
    { next: { revalidate: 60 } }
  );

  if (!response.ok) {
    return [
      { name: 'API', status: 'outage', lastChecked: new Date().toISOString() },
    ];
  }

  const health = await response.json();

  return [
    {
      name: 'API',
      status: health.status === 'healthy' ? 'operational' : 'degraded',
      lastChecked: health.timestamp,
    },
    {
      name: 'Database',
      status: health.checks.database.status === 'pass' ? 'operational' : 'outage',
      lastChecked: health.timestamp,
    },
    {
      name: 'Cache',
      status: health.checks.redis.status === 'pass' ? 'operational' : 'degraded',
      lastChecked: health.timestamp,
    },
  ];
}

function StatusBadge({ status }: { status: ServiceStatus['status'] }) {
  const colors = {
    operational: 'bg-green-500',
    degraded: 'bg-yellow-500',
    outage: 'bg-red-500',
  };

  return (
    <span className={`inline-block w-3 h-3 rounded-full ${colors[status]}`} />
  );
}

export default async function StatusPage() {
  const services = await getStatus();
  const allOperational = services.every((s) => s.status === 'operational');

  return (
    <div className="max-w-2xl mx-auto p-8">
      <h1 className="text-2xl font-bold mb-8">System Status</h1>

      <div className={`p-4 rounded-lg mb-8 ${
        allOperational ? 'bg-green-100' : 'bg-yellow-100'
      }`}>
        <p className="font-medium">
          {allOperational
            ? 'All systems operational'
            : 'Some systems experiencing issues'}
        </p>
      </div>

      <div className="space-y-4">
        {services.map((service) => (
          <div
            key={service.name}
            className="flex items-center justify-between p-4 border rounded"
          >
            <div className="flex items-center gap-3">
              <StatusBadge status={service.status} />
              <span className="font-medium">{service.name}</span>
            </div>
            <span className="text-sm text-gray-500 capitalize">
              {service.status}
            </span>
          </div>
        ))}
      </div>

      <p className="mt-8 text-sm text-gray-500">
        Last updated: {new Date().toLocaleString()}
      </p>
    </div>
  );
}
typescript
// src/app/status/page.tsx
import { Suspense } from 'react';

interface ServiceStatus {
  name: string;
  status: 'operational' | 'degraded' | 'outage';
  lastChecked: string;
}

async function getStatus(): Promise<ServiceStatus[]> {
  const response = await fetch(
    `${process.env.NEXT_PUBLIC_APP_URL}/api/health`,
    { next: { revalidate: 60 } }
  );

  if (!response.ok) {
    return [
      { name: 'API', status: 'outage', lastChecked: new Date().toISOString() },
    ];
  }

  const health = await response.json();

  return [
    {
      name: 'API',
      status: health.status === 'healthy' ? 'operational' : 'degraded',
      lastChecked: health.timestamp,
    },
    {
      name: 'Database',
      status: health.checks.database.status === 'pass' ? 'operational' : 'outage',
      lastChecked: health.timestamp,
    },
    {
      name: 'Cache',
      status: health.checks.redis.status === 'pass' ? 'operational' : 'degraded',
      lastChecked: health.timestamp,
    },
  ];
}

function StatusBadge({ status }: { status: ServiceStatus['status'] }) {
  const colors = {
    operational: 'bg-green-500',
    degraded: 'bg-yellow-500',
    outage: 'bg-red-500',
  };

  return (
    <span className={`inline-block w-3 h-3 rounded-full ${colors[status]}`} />
  );
}

export default async function StatusPage() {
  const services = await getStatus();
  const allOperational = services.every((s) => s.status === 'operational');

  return (
    <div className="max-w-2xl mx-auto p-8">
      <h1 className="text-2xl font-bold mb-8">系统状态</h1>

      <div className={`p-4 rounded-lg mb-8 ${
        allOperational ? 'bg-green-100' : 'bg-yellow-100'
      }`}>
        <p className="font-medium">
          {allOperational
            ? '所有系统正常运行'
            : '部分系统出现问题'}
        </p>
      </div>

      <div className="space-y-4">
        {services.map((service) => (
          <div
            key={service.name}
            className="flex items-center justify-between p-4 border rounded"
          >
            <div className="flex items-center gap-3">
              <StatusBadge status={service.status} />
              <span className="font-medium">{service.name}</span>
            </div>
            <span className="text-sm text-gray-500 capitalize">
              {service.status}
            </span>
          </div>
        ))}
      </div>

      <p className="mt-8 text-sm text-gray-500">
        最后更新时间: {new Date().toLocaleString()}
      </p>
    </div>
  );
}

Monitoring Checklist

监控检查清单

Application Monitoring

应用监控

  • Health check endpoint
  • Request latency metrics
  • Error rate tracking
  • Active user count
  • Business metrics (orders, signups, etc.)
  • 健康检查端点
  • 请求延迟指标
  • 错误率追踪
  • 活跃用户数
  • 业务指标(订单、注册量等)

Infrastructure Monitoring

基础设施监控

  • CPU/Memory utilization
  • Disk space
  • Network I/O
  • Database connections
  • Cache hit rate
  • CPU/内存使用率
  • 磁盘空间
  • 网络I/O
  • 数据库连接数
  • 缓存命中率

Alerting

告警

  • Error rate thresholds
  • Latency thresholds
  • Uptime monitoring
  • Resource alerts
  • On-call rotation configured
  • 错误率阈值
  • 延迟阈值
  • 可用性监控
  • 资源告警
  • 值班轮换配置

Dashboards

仪表板

  • Overview dashboard
  • API performance
  • Database metrics
  • Business KPIs
  • Status page (public)
  • 概览仪表板
  • API性能
  • 数据库指标
  • 业务KPI
  • 公开状态页面

When to Use This Skill

何时使用本技能

Invoke this skill when:
  • Setting up monitoring for a new project
  • Creating health check endpoints
  • Implementing metrics collection
  • Configuring alerting rules
  • Building monitoring dashboards
  • Setting up status pages
  • Debugging performance issues
  • Planning capacity
在以下场景调用本技能:
  • 为新项目搭建监控
  • 创建健康检查端点
  • 实施指标收集
  • 配置告警规则
  • 搭建监控仪表板
  • 设置状态页面
  • 调试性能问题
  • 容量规划