observability

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Observability - Monitoring, Logging & Tracing

可观测性 - 监控、日志与追踪

Production patterns for the three pillars: metrics, logs, and traces

三大核心支柱的生产环境实践模式：指标、日志与追踪

When to Use This Skill

何时使用该技能

Use this skill when:

Setting up application monitoring
Implementing structured logging
Adding error tracking (Sentry, Bugsnag)
Configuring distributed tracing
Building health check endpoints
Creating alerting rules

Don't use this skill when:

Development/local debugging only
Using managed platforms with built-in observability

在以下场景使用该技能：

搭建应用监控系统
实现结构化日志功能
添加错误追踪（Sentry、Bugsnag）
配置分布式追踪
构建健康检查端点
创建告警规则

请勿在以下场景使用该技能：

仅用于开发/本地调试时
使用内置可观测性的托管平台时

Critical Patterns

核心实践模式

Pattern 1: Structured Logging

模式1：结构化日志

When: Creating queryable, parseable logs

typescript

// ✅ GOOD: Structured logging with pino
import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
    bindings: (bindings) => ({
      pid: bindings.pid,
      hostname: bindings.hostname,
      service: process.env.SERVICE_NAME,
    }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
});

// Create request-scoped logger
function createRequestLogger(requestId: string, userId?: string) {
  return logger.child({
    requestId,
    userId,
  });
}

// Usage in handlers
async function handleRequest(req: Request) {
  const log = createRequestLogger(req.id, req.user?.id);

  log.info({ path: req.url, method: req.method }, 'Request started');

  try {
    const result = await processRequest(req);
    log.info({ duration: Date.now() - req.startTime }, 'Request completed');
    return result;
  } catch (error) {
    log.error({
      error: error.message,
      stack: error.stack,
      duration: Date.now() - req.startTime,
    }, 'Request failed');
    throw error;
  }
}

// ❌ BAD: Unstructured console.log
console.log('User ' + userId + ' did ' + action); // Can't query!
console.log(error); // Object won't serialize properly

适用场景： 创建可查询、可解析的日志

typescript

// ✅ 推荐：使用Pino实现结构化日志
import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
    bindings: (bindings) => ({
      pid: bindings.pid,
      hostname: bindings.hostname,
      service: process.env.SERVICE_NAME,
    }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
});

// 创建请求级别的日志实例
function createRequestLogger(requestId: string, userId?: string) {
  return logger.child({
    requestId,
    userId,
  });
}

// 在处理器中使用
async function handleRequest(req: Request) {
  const log = createRequestLogger(req.id, req.user?.id);

  log.info({ path: req.url, method: req.method }, '请求已启动');

  try {
    const result = await processRequest(req);
    log.info({ duration: Date.now() - req.startTime }, '请求已完成');
    return result;
  } catch (error) {
    log.error({
      error: error.message,
      stack: error.stack,
      duration: Date.now() - req.startTime,
    }, '请求失败');
    throw error;
  }
}

// ❌ 不推荐：无结构的console.log
console.log('User ' + userId + ' did ' + action); // 无法进行查询！
console.log(error); // 对象无法正确序列化

Pattern 2: Error Tracking

模式2：错误追踪

When: Capturing and analyzing production errors

typescript

// ✅ GOOD: Sentry integration
import * as Sentry from '@sentry/node';

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  environment: process.env.NODE_ENV,
  release: process.env.GIT_SHA,
  tracesSampleRate: 0.1, // 10% of transactions
  integrations: [
    new Sentry.Integrations.Http({ tracing: true }),
    new Sentry.Integrations.Prisma({ client: prisma }),
  ],
  beforeSend(event, hint) {
    // Filter out expected errors
    const error = hint.originalException;
    if (error instanceof AppError && error.isOperational) {
      return null; // Don't send operational errors
    }
    return event;
  },
});

// Capture with context
function captureError(error: Error, context?: Record<string, any>) {
  Sentry.withScope((scope) => {
    if (context) {
      scope.setExtras(context);
    }
    if (context?.userId) {
      scope.setUser({ id: context.userId });
    }
    Sentry.captureException(error);
  });
}

// Usage in error handler
app.use((err, req, res, next) => {
  captureError(err, {
    requestId: req.id,
    userId: req.user?.id,
    path: req.path,
    method: req.method,
    body: req.body,
  });

  res.status(500).json({ error: 'Internal server error' });
});

// ❌ BAD: No error tracking
app.use((err, req, res, next) => {
  console.error(err); // Lost when container restarts!
  res.status(500).json({ error: err.message }); // Exposes internals
});

适用场景： 捕获并分析生产环境错误

typescript

// ✅ 推荐：集成Sentry
import * as Sentry from '@sentry/node';

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  environment: process.env.NODE_ENV,
  release: process.env.GIT_SHA,
  tracesSampleRate: 0.1, // 10%的事务会被采样
  integrations: [
    new Sentry.Integrations.Http({ tracing: true }),
    new Sentry.Integrations.Prisma({ client: prisma }),
  ],
  beforeSend(event, hint) {
    // 过滤预期内的错误
    const error = hint.originalException;
    if (error instanceof AppError && error.isOperational) {
      return null; // 不发送可预期的业务错误
    }
    return event;
  },
});

// 携带上下文信息捕获错误
function captureError(error: Error, context?: Record<string, any>) {
  Sentry.withScope((scope) => {
    if (context) {
      scope.setExtras(context);
    }
    if (context?.userId) {
      scope.setUser({ id: context.userId });
    }
    Sentry.captureException(error);
  });
}

// 在错误处理器中使用
app.use((err, req, res, next) => {
  captureError(err, {
    requestId: req.id,
    userId: req.user?.id,
    path: req.path,
    method: req.method,
    body: req.body,
  });

  res.status(500).json({ error: '内部服务器错误' });
});

// ❌ 不推荐：未实现错误追踪
app.use((err, req, res, next) => {
  console.error(err); // 容器重启后日志会丢失！
  res.status(500).json({ error: err.message }); // 暴露内部实现细节
});

Pattern 3: Health Checks

模式3：健康检查

When: Monitoring application and dependency health

typescript

// ✅ GOOD: Comprehensive health check
// app/api/health/route.ts
import { NextResponse } from 'next/server';

interface HealthCheck {
  status: 'healthy' | 'degraded' | 'unhealthy';
  checks: Record<string, CheckResult>;
  version: string;
  uptime: number;
}

interface CheckResult {
  status: 'pass' | 'fail';
  latency?: number;
  message?: string;
}

async function checkDatabase(): Promise<CheckResult> {
  const start = Date.now();
  try {
    await prisma.$queryRaw`SELECT 1`;
    return { status: 'pass', latency: Date.now() - start };
  } catch (error) {
    return { status: 'fail', message: error.message };
  }
}

async function checkRedis(): Promise<CheckResult> {
  const start = Date.now();
  try {
    await redis.ping();
    return { status: 'pass', latency: Date.now() - start };
  } catch (error) {
    return { status: 'fail', message: error.message };
  }
}

async function checkExternalAPI(): Promise<CheckResult> {
  const start = Date.now();
  try {
    const res = await fetch(process.env.EXTERNAL_API_URL + '/health', {
      signal: AbortSignal.timeout(5000),
    });
    return {
      status: res.ok ? 'pass' : 'fail',
      latency: Date.now() - start,
    };
  } catch (error) {
    return { status: 'fail', message: error.message };
  }
}

export async function GET() {
  const checks = {
    database: await checkDatabase(),
    redis: await checkRedis(),
    externalApi: await checkExternalAPI(),
  };

  const allPassing = Object.values(checks).every(c => c.status === 'pass');
  const anyFailing = Object.values(checks).some(c => c.status === 'fail');

  const health: HealthCheck = {
    status: allPassing ? 'healthy' : anyFailing ? 'unhealthy' : 'degraded',
    checks,
    version: process.env.GIT_SHA || 'unknown',
    uptime: process.uptime(),
  };

  return NextResponse.json(health, {
    status: health.status === 'healthy' ? 200 : 503,
  });
}

// ❌ BAD: Simple health check that always passes
export async function GET() {
  return NextResponse.json({ status: 'ok' }); // Doesn't check anything!
}

适用场景： 监控应用及依赖组件的健康状态

typescript

// ✅ 推荐：全面的健康检查
// app/api/health/route.ts
import { NextResponse } from 'next/server';

interface HealthCheck {
  status: 'healthy' | 'degraded' | 'unhealthy';
  checks: Record<string, CheckResult>;
  version: string;
  uptime: number;
}

interface CheckResult {
  status: 'pass' | 'fail';
  latency?: number;
  message?: string;
}

async function checkDatabase(): Promise<CheckResult> {
  const start = Date.now();
  try {
    await prisma.$queryRaw`SELECT 1`;
    return { status: 'pass', latency: Date.now() - start };
  } catch (error) {
    return { status: 'fail', message: error.message };
  }
}

async function checkRedis(): Promise<CheckResult> {
  const start = Date.now();
  try {
    await redis.ping();
    return { status: 'pass', latency: Date.now() - start };
  } catch (error) {
    return { status: 'fail', message: error.message };
  }
}

async function checkExternalAPI(): Promise<CheckResult> {
  const start = Date.now();
  try {
    const res = await fetch(process.env.EXTERNAL_API_URL + '/health', {
      signal: AbortSignal.timeout(5000),
    });
    return {
      status: res.ok ? 'pass' : 'fail',
      latency: Date.now() - start,
    };
  } catch (error) {
    return { status: 'fail', message: error.message };
  }
}

export async function GET() {
  const checks = {
    database: await checkDatabase(),
    redis: await checkRedis(),
    externalApi: await checkExternalAPI(),
  };

  const allPassing = Object.values(checks).every(c => c.status === 'pass');
  const anyFailing = Object.values(checks).some(c => c.status === 'fail');

  const health: HealthCheck = {
    status: allPassing ? 'healthy' : anyFailing ? 'unhealthy' : 'degraded',
    checks,
    version: process.env.GIT_SHA || 'unknown',
    uptime: process.uptime(),
  };

  return NextResponse.json(health, {
    status: health.status === 'healthy' ? 200 : 503,
  });
}

// ❌ 不推荐：总是返回成功的简单健康检查
export async function GET() {
  return NextResponse.json({ status: 'ok' }); // 未检查任何依赖！
}

Pattern 4: Distributed Tracing

模式4：分布式追踪

When: Tracking requests across services

typescript

// ✅ GOOD: OpenTelemetry tracing
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
  serviceName: process.env.SERVICE_NAME,
});

sdk.start();

// Manual span creation
import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('my-service');

async function processOrder(orderId: string) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    span.setAttribute('order.id', orderId);

    try {
      // Child spans are automatically linked
      const order = await fetchOrder(orderId);
      span.setAttribute('order.total', order.total);

      await tracer.startActiveSpan('validateInventory', async (childSpan) => {
        await validateInventory(order.items);
        childSpan.end();
      });

      await tracer.startActiveSpan('processPayment', async (childSpan) => {
        await processPayment(order);
        childSpan.end();
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return order;
    } catch (error) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error.message,
      });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

// Propagate trace context to external services
async function callExternalService(data: any) {
  const headers: Record<string, string> = {};

  // Inject trace context into headers
  propagation.inject(context.active(), headers);

  return fetch(EXTERNAL_SERVICE_URL, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      ...headers, // Includes traceparent header
    },
    body: JSON.stringify(data),
  });
}

适用场景： 跨服务追踪请求链路

typescript

// ✅ 推荐：OpenTelemetry追踪
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
  serviceName: process.env.SERVICE_NAME,
});

sdk.start();

// 手动创建Span
import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('my-service');

async function processOrder(orderId: string) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    span.setAttribute('order.id', orderId);

    try {
      // 子Span会自动关联
      const order = await fetchOrder(orderId);
      span.setAttribute('order.total', order.total);

      await tracer.startActiveSpan('validateInventory', async (childSpan) => {
        await validateInventory(order.items);
        childSpan.end();
      });

      await tracer.startActiveSpan('processPayment', async (childSpan) => {
        await processPayment(order);
        childSpan.end();
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return order;
    } catch (error) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error.message,
      });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

// 将追踪上下文传递给外部服务
async function callExternalService(data: any) {
  const headers: Record<string, string> = {};

  // 将追踪上下文注入请求头
  propagation.inject(context.active(), headers);

  return fetch(EXTERNAL_SERVICE_URL, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      ...headers, // 包含traceparent请求头
    },
    body: JSON.stringify(data),
  });
}

Pattern 5: Metrics Collection

模式5：指标收集

When: Measuring application performance and business metrics

typescript

// ✅ GOOD: Prometheus metrics
import { Registry, Counter, Histogram, Gauge } from 'prom-client';

const registry = new Registry();

// HTTP request metrics
const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'path', 'status'],
  registers: [registry],
});

const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'path'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 5],
  registers: [registry],
});

// Business metrics
const ordersTotal = new Counter({
  name: 'orders_total',
  help: 'Total number of orders',
  labelNames: ['status'],
  registers: [registry],
});

const activeUsers = new Gauge({
  name: 'active_users',
  help: 'Number of currently active users',
  registers: [registry],
});

// Middleware to track requests
app.use((req, res, next) => {
  const start = Date.now();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const path = req.route?.path || req.path;

    httpRequestsTotal.inc({
      method: req.method,
      path,
      status: res.statusCode,
    });

    httpRequestDuration.observe(
      { method: req.method, path },
      duration
    );
  });

  next();
});

// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', registry.contentType);
  res.send(await registry.metrics());
});

// Track business events
async function createOrder(data: CreateOrderInput) {
  const order = await db.order.create({ data });
  ordersTotal.inc({ status: 'created' });
  return order;
}

适用场景： 衡量应用性能与业务指标

typescript

// ✅ 推荐：Prometheus指标
import { Registry, Counter, Histogram, Gauge } from 'prom-client';

const registry = new Registry();

// HTTP请求指标
const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'path', 'status'],
  registers: [registry],
});

const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'path'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 5],
  registers: [registry],
});

// 业务指标
const ordersTotal = new Counter({
  name: 'orders_total',
  help: 'Total number of orders',
  labelNames: ['status'],
  registers: [registry],
});

const activeUsers = new Gauge({
  name: 'active_users',
  help: 'Number of currently active users',
  registers: [registry],
});

// 用于追踪请求的中间件
app.use((req, res, next) => {
  const start = Date.now();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const path = req.route?.path || req.path;

    httpRequestsTotal.inc({
      method: req.method,
      path,
      status: res.statusCode,
    });

    httpRequestDuration.observe(
      { method: req.method, path },
      duration
    );
  });

  next();
});

// 暴露指标端点
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', registry.contentType);
  res.send(await registry.metrics());
});

// 追踪业务事件
async function createOrder(data: CreateOrderInput) {
  const order = await db.order.create({ data });
  ordersTotal.inc({ status: 'created' });
  return order;
}

Code Examples

代码示例

For complete, production-ready examples, see references/examples.md:

Request Logging Middleware
Dashboard Alerts Configuration (Prometheus)
Error Boundary with Reporting
Custom Metrics (Prometheus)

完整的生产环境可用示例，请查看 references/examples.md：

请求日志中间件
仪表板告警配置（Prometheus）
带上报功能的错误边界
自定义指标（Prometheus）

Anti-Patterns

反模式

Don't: Log Sensitive Data

切勿：记录敏感数据

typescript

// ❌ BAD: Logging passwords, tokens, PII
logger.info({ user: { email, password } }, 'User login');
logger.info({ authorization: req.headers.authorization }, 'Request');

// ✅ GOOD: Redact sensitive fields
logger.info({ userId: user.id, email: user.email }, 'User login');
logger.info({ hasAuth: !!req.headers.authorization }, 'Request');

typescript

// ❌ 不推荐：记录密码、令牌、个人可识别信息（PII）
logger.info({ user: { email, password } }, '用户登录');
logger.info({ authorization: req.headers.authorization }, '请求信息');

// ✅ 推荐：脱敏敏感字段
logger.info({ userId: user.id, email: user.email }, '用户登录');
logger.info({ hasAuth: !!req.headers.authorization }, '请求信息');

Don't: Sample Everything at 100%

切勿：100%采样所有数据

typescript

// ❌ BAD: Trace every request
tracesSampleRate: 1.0, // Very expensive at scale!

// ✅ GOOD: Sample appropriately
tracesSampleRate: 0.1, // 10% of transactions
// Or use dynamic sampling based on endpoint

typescript

// ❌ 不推荐：追踪每一个请求
tracesSampleRate: 1.0, // 大规模场景下成本极高！

// ✅ 推荐：合理设置采样率
tracesSampleRate: 0.1, // 10%的事务会被采样
// 或者根据端点设置动态采样规则

Don't: Ignore Alert Fatigue

切勿：忽视告警疲劳

typescript

// ❌ BAD: Alert on every error
if (error) sendAlert(error); // Gets ignored due to noise

// ✅ GOOD: Alert on actionable thresholds
// Only alert when error rate exceeds 5% for 5 minutes

typescript

// ❌ 不推荐：每出现一个错误就发送告警
if (error) sendAlert(error); // 因告警过多会被忽略

// ✅ 推荐：基于可行动的阈值发送告警
// 仅当错误率在5分钟内超过5%时才发送告警

Quick Reference

速查参考

Pillar	Tool	Use Case
Logs	Pino, Winston	Structured application logs
Metrics	Prometheus	Request counts, latencies
Traces	OpenTelemetry	Distributed request flow
Errors	Sentry	Exception tracking
APM	Datadog, New Relic	Full observability suite

核心支柱	工具	适用场景
日志	Pino, Winston	结构化应用日志
指标	Prometheus	请求计数、延迟统计
追踪	OpenTelemetry	分布式请求链路追踪
错误	Sentry	异常追踪
应用性能监控	Datadog, New Relic	全栈可观测性套件

Resources

Keywords

关键词

observability

monitoring

logging

tracing

metrics

sentry

prometheus

opentelemetry

health-check

alerting

observability

monitoring

logging

tracing

metrics

sentry

prometheus

opentelemetry

health-check

alerting

observability

Original

Translation

Observability - Monitoring, Logging & Tracing

可观测性 - 监控、日志与追踪

When to Use This Skill

何时使用该技能

Critical Patterns

核心实践模式

Pattern 1: Structured Logging

模式1：结构化日志

Pattern 2: Error Tracking

模式2：错误追踪

Pattern 3: Health Checks

模式3：健康检查

Pattern 4: Distributed Tracing

模式4：分布式追踪

Pattern 5: Metrics Collection

模式5：指标收集

Code Examples

代码示例

Anti-Patterns

反模式

Don't: Log Sensitive Data

切勿：记录敏感数据

Don't: Sample Everything at 100%

切勿：100%采样所有数据

Don't: Ignore Alert Fatigue

切勿：忽视告警疲劳

Quick Reference

速查参考

Resources

相关资源

Keywords

关键词