observability

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Observability - Monitoring, Logging & Tracing

可观测性 - 监控、日志与追踪

Production patterns for the three pillars: metrics, logs, and traces

三大核心支柱的生产环境实践模式:指标、日志与追踪

When to Use This Skill

何时使用该技能

Use this skill when:
  • Setting up application monitoring
  • Implementing structured logging
  • Adding error tracking (Sentry, Bugsnag)
  • Configuring distributed tracing
  • Building health check endpoints
  • Creating alerting rules
Don't use this skill when:
  • Development/local debugging only
  • Using managed platforms with built-in observability

在以下场景使用该技能:
  • 搭建应用监控系统
  • 实现结构化日志功能
  • 添加错误追踪(Sentry、Bugsnag)
  • 配置分布式追踪
  • 构建健康检查端点
  • 创建告警规则
请勿在以下场景使用该技能:
  • 仅用于开发/本地调试时
  • 使用内置可观测性的托管平台时

Critical Patterns

核心实践模式

Pattern 1: Structured Logging

模式1:结构化日志

When: Creating queryable, parseable logs
typescript
// ✅ GOOD: Structured logging with pino
import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
    bindings: (bindings) => ({
      pid: bindings.pid,
      hostname: bindings.hostname,
      service: process.env.SERVICE_NAME,
    }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
});

// Create request-scoped logger
function createRequestLogger(requestId: string, userId?: string) {
  return logger.child({
    requestId,
    userId,
  });
}

// Usage in handlers
async function handleRequest(req: Request) {
  const log = createRequestLogger(req.id, req.user?.id);

  log.info({ path: req.url, method: req.method }, 'Request started');

  try {
    const result = await processRequest(req);
    log.info({ duration: Date.now() - req.startTime }, 'Request completed');
    return result;
  } catch (error) {
    log.error({
      error: error.message,
      stack: error.stack,
      duration: Date.now() - req.startTime,
    }, 'Request failed');
    throw error;
  }
}

// ❌ BAD: Unstructured console.log
console.log('User ' + userId + ' did ' + action); // Can't query!
console.log(error); // Object won't serialize properly
适用场景: 创建可查询、可解析的日志
typescript
// ✅ 推荐:使用Pino实现结构化日志
import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
    bindings: (bindings) => ({
      pid: bindings.pid,
      hostname: bindings.hostname,
      service: process.env.SERVICE_NAME,
    }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
});

// 创建请求级别的日志实例
function createRequestLogger(requestId: string, userId?: string) {
  return logger.child({
    requestId,
    userId,
  });
}

// 在处理器中使用
async function handleRequest(req: Request) {
  const log = createRequestLogger(req.id, req.user?.id);

  log.info({ path: req.url, method: req.method }, '请求已启动');

  try {
    const result = await processRequest(req);
    log.info({ duration: Date.now() - req.startTime }, '请求已完成');
    return result;
  } catch (error) {
    log.error({
      error: error.message,
      stack: error.stack,
      duration: Date.now() - req.startTime,
    }, '请求失败');
    throw error;
  }
}

// ❌ 不推荐:无结构的console.log
console.log('User ' + userId + ' did ' + action); // 无法进行查询!
console.log(error); // 对象无法正确序列化

Pattern 2: Error Tracking

模式2:错误追踪

When: Capturing and analyzing production errors
typescript
// ✅ GOOD: Sentry integration
import * as Sentry from '@sentry/node';

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  environment: process.env.NODE_ENV,
  release: process.env.GIT_SHA,
  tracesSampleRate: 0.1, // 10% of transactions
  integrations: [
    new Sentry.Integrations.Http({ tracing: true }),
    new Sentry.Integrations.Prisma({ client: prisma }),
  ],
  beforeSend(event, hint) {
    // Filter out expected errors
    const error = hint.originalException;
    if (error instanceof AppError && error.isOperational) {
      return null; // Don't send operational errors
    }
    return event;
  },
});

// Capture with context
function captureError(error: Error, context?: Record<string, any>) {
  Sentry.withScope((scope) => {
    if (context) {
      scope.setExtras(context);
    }
    if (context?.userId) {
      scope.setUser({ id: context.userId });
    }
    Sentry.captureException(error);
  });
}

// Usage in error handler
app.use((err, req, res, next) => {
  captureError(err, {
    requestId: req.id,
    userId: req.user?.id,
    path: req.path,
    method: req.method,
    body: req.body,
  });

  res.status(500).json({ error: 'Internal server error' });
});

// ❌ BAD: No error tracking
app.use((err, req, res, next) => {
  console.error(err); // Lost when container restarts!
  res.status(500).json({ error: err.message }); // Exposes internals
});
适用场景: 捕获并分析生产环境错误
typescript
// ✅ 推荐:集成Sentry
import * as Sentry from '@sentry/node';

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  environment: process.env.NODE_ENV,
  release: process.env.GIT_SHA,
  tracesSampleRate: 0.1, // 10%的事务会被采样
  integrations: [
    new Sentry.Integrations.Http({ tracing: true }),
    new Sentry.Integrations.Prisma({ client: prisma }),
  ],
  beforeSend(event, hint) {
    // 过滤预期内的错误
    const error = hint.originalException;
    if (error instanceof AppError && error.isOperational) {
      return null; // 不发送可预期的业务错误
    }
    return event;
  },
});

// 携带上下文信息捕获错误
function captureError(error: Error, context?: Record<string, any>) {
  Sentry.withScope((scope) => {
    if (context) {
      scope.setExtras(context);
    }
    if (context?.userId) {
      scope.setUser({ id: context.userId });
    }
    Sentry.captureException(error);
  });
}

// 在错误处理器中使用
app.use((err, req, res, next) => {
  captureError(err, {
    requestId: req.id,
    userId: req.user?.id,
    path: req.path,
    method: req.method,
    body: req.body,
  });

  res.status(500).json({ error: '内部服务器错误' });
});

// ❌ 不推荐:未实现错误追踪
app.use((err, req, res, next) => {
  console.error(err); // 容器重启后日志会丢失!
  res.status(500).json({ error: err.message }); // 暴露内部实现细节
});

Pattern 3: Health Checks

模式3:健康检查

When: Monitoring application and dependency health
typescript
// ✅ GOOD: Comprehensive health check
// app/api/health/route.ts
import { NextResponse } from 'next/server';

interface HealthCheck {
  status: 'healthy' | 'degraded' | 'unhealthy';
  checks: Record<string, CheckResult>;
  version: string;
  uptime: number;
}

interface CheckResult {
  status: 'pass' | 'fail';
  latency?: number;
  message?: string;
}

async function checkDatabase(): Promise<CheckResult> {
  const start = Date.now();
  try {
    await prisma.$queryRaw`SELECT 1`;
    return { status: 'pass', latency: Date.now() - start };
  } catch (error) {
    return { status: 'fail', message: error.message };
  }
}

async function checkRedis(): Promise<CheckResult> {
  const start = Date.now();
  try {
    await redis.ping();
    return { status: 'pass', latency: Date.now() - start };
  } catch (error) {
    return { status: 'fail', message: error.message };
  }
}

async function checkExternalAPI(): Promise<CheckResult> {
  const start = Date.now();
  try {
    const res = await fetch(process.env.EXTERNAL_API_URL + '/health', {
      signal: AbortSignal.timeout(5000),
    });
    return {
      status: res.ok ? 'pass' : 'fail',
      latency: Date.now() - start,
    };
  } catch (error) {
    return { status: 'fail', message: error.message };
  }
}

export async function GET() {
  const checks = {
    database: await checkDatabase(),
    redis: await checkRedis(),
    externalApi: await checkExternalAPI(),
  };

  const allPassing = Object.values(checks).every(c => c.status === 'pass');
  const anyFailing = Object.values(checks).some(c => c.status === 'fail');

  const health: HealthCheck = {
    status: allPassing ? 'healthy' : anyFailing ? 'unhealthy' : 'degraded',
    checks,
    version: process.env.GIT_SHA || 'unknown',
    uptime: process.uptime(),
  };

  return NextResponse.json(health, {
    status: health.status === 'healthy' ? 200 : 503,
  });
}

// ❌ BAD: Simple health check that always passes
export async function GET() {
  return NextResponse.json({ status: 'ok' }); // Doesn't check anything!
}
适用场景: 监控应用及依赖组件的健康状态
typescript
// ✅ 推荐:全面的健康检查
// app/api/health/route.ts
import { NextResponse } from 'next/server';

interface HealthCheck {
  status: 'healthy' | 'degraded' | 'unhealthy';
  checks: Record<string, CheckResult>;
  version: string;
  uptime: number;
}

interface CheckResult {
  status: 'pass' | 'fail';
  latency?: number;
  message?: string;
}

async function checkDatabase(): Promise<CheckResult> {
  const start = Date.now();
  try {
    await prisma.$queryRaw`SELECT 1`;
    return { status: 'pass', latency: Date.now() - start };
  } catch (error) {
    return { status: 'fail', message: error.message };
  }
}

async function checkRedis(): Promise<CheckResult> {
  const start = Date.now();
  try {
    await redis.ping();
    return { status: 'pass', latency: Date.now() - start };
  } catch (error) {
    return { status: 'fail', message: error.message };
  }
}

async function checkExternalAPI(): Promise<CheckResult> {
  const start = Date.now();
  try {
    const res = await fetch(process.env.EXTERNAL_API_URL + '/health', {
      signal: AbortSignal.timeout(5000),
    });
    return {
      status: res.ok ? 'pass' : 'fail',
      latency: Date.now() - start,
    };
  } catch (error) {
    return { status: 'fail', message: error.message };
  }
}

export async function GET() {
  const checks = {
    database: await checkDatabase(),
    redis: await checkRedis(),
    externalApi: await checkExternalAPI(),
  };

  const allPassing = Object.values(checks).every(c => c.status === 'pass');
  const anyFailing = Object.values(checks).some(c => c.status === 'fail');

  const health: HealthCheck = {
    status: allPassing ? 'healthy' : anyFailing ? 'unhealthy' : 'degraded',
    checks,
    version: process.env.GIT_SHA || 'unknown',
    uptime: process.uptime(),
  };

  return NextResponse.json(health, {
    status: health.status === 'healthy' ? 200 : 503,
  });
}

// ❌ 不推荐:总是返回成功的简单健康检查
export async function GET() {
  return NextResponse.json({ status: 'ok' }); // 未检查任何依赖!
}

Pattern 4: Distributed Tracing

模式4:分布式追踪

When: Tracking requests across services
typescript
// ✅ GOOD: OpenTelemetry tracing
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
  serviceName: process.env.SERVICE_NAME,
});

sdk.start();

// Manual span creation
import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('my-service');

async function processOrder(orderId: string) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    span.setAttribute('order.id', orderId);

    try {
      // Child spans are automatically linked
      const order = await fetchOrder(orderId);
      span.setAttribute('order.total', order.total);

      await tracer.startActiveSpan('validateInventory', async (childSpan) => {
        await validateInventory(order.items);
        childSpan.end();
      });

      await tracer.startActiveSpan('processPayment', async (childSpan) => {
        await processPayment(order);
        childSpan.end();
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return order;
    } catch (error) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error.message,
      });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

// Propagate trace context to external services
async function callExternalService(data: any) {
  const headers: Record<string, string> = {};

  // Inject trace context into headers
  propagation.inject(context.active(), headers);

  return fetch(EXTERNAL_SERVICE_URL, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      ...headers, // Includes traceparent header
    },
    body: JSON.stringify(data),
  });
}
适用场景: 跨服务追踪请求链路
typescript
// ✅ 推荐:OpenTelemetry追踪
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
  serviceName: process.env.SERVICE_NAME,
});

sdk.start();

// 手动创建Span
import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('my-service');

async function processOrder(orderId: string) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    span.setAttribute('order.id', orderId);

    try {
      // 子Span会自动关联
      const order = await fetchOrder(orderId);
      span.setAttribute('order.total', order.total);

      await tracer.startActiveSpan('validateInventory', async (childSpan) => {
        await validateInventory(order.items);
        childSpan.end();
      });

      await tracer.startActiveSpan('processPayment', async (childSpan) => {
        await processPayment(order);
        childSpan.end();
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return order;
    } catch (error) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error.message,
      });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

// 将追踪上下文传递给外部服务
async function callExternalService(data: any) {
  const headers: Record<string, string> = {};

  // 将追踪上下文注入请求头
  propagation.inject(context.active(), headers);

  return fetch(EXTERNAL_SERVICE_URL, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      ...headers, // 包含traceparent请求头
    },
    body: JSON.stringify(data),
  });
}

Pattern 5: Metrics Collection

模式5:指标收集

When: Measuring application performance and business metrics
typescript
// ✅ GOOD: Prometheus metrics
import { Registry, Counter, Histogram, Gauge } from 'prom-client';

const registry = new Registry();

// HTTP request metrics
const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'path', 'status'],
  registers: [registry],
});

const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'path'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 5],
  registers: [registry],
});

// Business metrics
const ordersTotal = new Counter({
  name: 'orders_total',
  help: 'Total number of orders',
  labelNames: ['status'],
  registers: [registry],
});

const activeUsers = new Gauge({
  name: 'active_users',
  help: 'Number of currently active users',
  registers: [registry],
});

// Middleware to track requests
app.use((req, res, next) => {
  const start = Date.now();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const path = req.route?.path || req.path;

    httpRequestsTotal.inc({
      method: req.method,
      path,
      status: res.statusCode,
    });

    httpRequestDuration.observe(
      { method: req.method, path },
      duration
    );
  });

  next();
});

// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', registry.contentType);
  res.send(await registry.metrics());
});

// Track business events
async function createOrder(data: CreateOrderInput) {
  const order = await db.order.create({ data });
  ordersTotal.inc({ status: 'created' });
  return order;
}

适用场景: 衡量应用性能与业务指标
typescript
// ✅ 推荐:Prometheus指标
import { Registry, Counter, Histogram, Gauge } from 'prom-client';

const registry = new Registry();

// HTTP请求指标
const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'path', 'status'],
  registers: [registry],
});

const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'path'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 5],
  registers: [registry],
});

// 业务指标
const ordersTotal = new Counter({
  name: 'orders_total',
  help: 'Total number of orders',
  labelNames: ['status'],
  registers: [registry],
});

const activeUsers = new Gauge({
  name: 'active_users',
  help: 'Number of currently active users',
  registers: [registry],
});

// 用于追踪请求的中间件
app.use((req, res, next) => {
  const start = Date.now();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const path = req.route?.path || req.path;

    httpRequestsTotal.inc({
      method: req.method,
      path,
      status: res.statusCode,
    });

    httpRequestDuration.observe(
      { method: req.method, path },
      duration
    );
  });

  next();
});

// 暴露指标端点
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', registry.contentType);
  res.send(await registry.metrics());
});

// 追踪业务事件
async function createOrder(data: CreateOrderInput) {
  const order = await db.order.create({ data });
  ordersTotal.inc({ status: 'created' });
  return order;
}

Code Examples

代码示例

For complete, production-ready examples, see references/examples.md:
  • Request Logging Middleware
  • Dashboard Alerts Configuration (Prometheus)
  • Error Boundary with Reporting
  • Custom Metrics (Prometheus)

完整的生产环境可用示例,请查看 references/examples.md
  • 请求日志中间件
  • 仪表板告警配置(Prometheus)
  • 带上报功能的错误边界
  • 自定义指标(Prometheus)

Anti-Patterns

反模式

Don't: Log Sensitive Data

切勿:记录敏感数据

typescript
// ❌ BAD: Logging passwords, tokens, PII
logger.info({ user: { email, password } }, 'User login');
logger.info({ authorization: req.headers.authorization }, 'Request');

// ✅ GOOD: Redact sensitive fields
logger.info({ userId: user.id, email: user.email }, 'User login');
logger.info({ hasAuth: !!req.headers.authorization }, 'Request');
typescript
// ❌ 不推荐:记录密码、令牌、个人可识别信息(PII)
logger.info({ user: { email, password } }, '用户登录');
logger.info({ authorization: req.headers.authorization }, '请求信息');

// ✅ 推荐:脱敏敏感字段
logger.info({ userId: user.id, email: user.email }, '用户登录');
logger.info({ hasAuth: !!req.headers.authorization }, '请求信息');

Don't: Sample Everything at 100%

切勿:100%采样所有数据

typescript
// ❌ BAD: Trace every request
tracesSampleRate: 1.0, // Very expensive at scale!

// ✅ GOOD: Sample appropriately
tracesSampleRate: 0.1, // 10% of transactions
// Or use dynamic sampling based on endpoint
typescript
// ❌ 不推荐:追踪每一个请求
tracesSampleRate: 1.0, // 大规模场景下成本极高!

// ✅ 推荐:合理设置采样率
tracesSampleRate: 0.1, // 10%的事务会被采样
// 或者根据端点设置动态采样规则

Don't: Ignore Alert Fatigue

切勿:忽视告警疲劳

typescript
// ❌ BAD: Alert on every error
if (error) sendAlert(error); // Gets ignored due to noise

// ✅ GOOD: Alert on actionable thresholds
// Only alert when error rate exceeds 5% for 5 minutes

typescript
// ❌ 不推荐:每出现一个错误就发送告警
if (error) sendAlert(error); // 因告警过多会被忽略

// ✅ 推荐:基于可行动的阈值发送告警
// 仅当错误率在5分钟内超过5%时才发送告警

Quick Reference

速查参考

PillarToolUse Case
LogsPino, WinstonStructured application logs
MetricsPrometheusRequest counts, latencies
TracesOpenTelemetryDistributed request flow
ErrorsSentryException tracking
APMDatadog, New RelicFull observability suite

核心支柱工具适用场景
日志Pino, Winston结构化应用日志
指标Prometheus请求计数、延迟统计
追踪OpenTelemetry分布式请求链路追踪
错误Sentry异常追踪
应用性能监控Datadog, New Relic全栈可观测性套件

Resources

相关资源

Related Skills:
  • error-handling: Exception patterns
  • performance: Optimization metrics
  • ci-cd: Deployment monitoring

关联技能:
  • error-handling: 异常处理模式
  • performance: 性能优化指标
  • ci-cd: 部署监控

Keywords

关键词

observability
,
monitoring
,
logging
,
tracing
,
metrics
,
sentry
,
prometheus
,
opentelemetry
,
health-check
,
alerting
observability
,
monitoring
,
logging
,
tracing
,
metrics
,
sentry
,
prometheus
,
opentelemetry
,
health-check
,
alerting