monitoring-observability

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Monitoring & Observability

监控与可观测性

Overview

概述

This skill covers the three pillars of observability -- traces, metrics, and logs -- along with alerting, dashboards, and health check patterns. It focuses on OpenTelemetry as the vendor-neutral standard, structured logging for queryability, distributed tracing for microservice debugging, and SLO-based alerting to reduce noise.

Use this skill when instrumenting applications for production visibility, setting up monitoring infrastructure, debugging distributed systems, configuring alerts that matter, or building dashboards for operations and product teams.

本技能涵盖可观测性的三大支柱——追踪、指标与日志，同时包含告警、仪表盘与健康检查模式。它聚焦于作为厂商中立标准的OpenTelemetry、具备可查询性的结构化日志、用于微服务调试的分布式追踪，以及基于SLO的告警以减少无效告警。

当你需要为应用添加生产环境可见性埋点、搭建监控基础设施、调试分布式系统、配置关键告警，或为运维与产品团队构建仪表盘时，可使用本技能。

Core Principles

核心原则

Instrument at boundaries - Trace every external call (HTTP, database, queue, cache). Internal function tracing adds noise; boundary tracing reveals system behavior.
Structured over unstructured - Every log entry must be JSON with correlation IDs, service name, and context. Unstructured logs are unsearchable at scale.
Alert on symptoms, not causes - Alert when users are affected (error rate, latency SLO breach), not when a specific server metric spikes. Symptom-based alerting reduces noise by 80%.
Correlate across signals - A trace ID should connect logs, traces, and metrics for a single request. Without correlation, debugging distributed issues requires guesswork.
Budget your error rate - Define SLOs (99.9% availability = 43 minutes/month downtime budget). Alert when the error budget burn rate is too fast, not on individual errors.

在边界处埋点 - 追踪所有外部调用（HTTP、数据库、队列、缓存）。内部函数追踪会产生无效信息；边界追踪可揭示系统行为。
结构化优先于非结构化 - 每条日志条目必须是包含关联ID、服务名称与上下文的JSON格式。非结构化日志在大规模场景下无法被检索。
针对症状告警，而非原因 - 当用户受到影响时告警（错误率、延迟SLO违规），而非特定服务器指标飙升时。基于症状的告警可减少80%的无效告警。
跨信号关联 - 单个请求的追踪ID应关联日志、追踪与指标。若无关联，调试分布式问题只能靠猜测。
预算错误率 - 定义SLO（99.9%可用性 = 每月43分钟停机预算）。当错误预算消耗过快时告警，而非针对单个错误。

Key Patterns

关键模式

Pattern 1: OpenTelemetry Instrumentation (Node.js)

模式1：OpenTelemetry埋点（Node.js）

When to use: Any production Node.js service that needs traces, metrics, and log correlation.

Implementation:

typescript

// tracing.ts - Must be imported BEFORE any other module
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-http";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { Resource } from "@opentelemetry/resources";
import {
  ATTR_SERVICE_NAME,
  ATTR_SERVICE_VERSION,
  ATTR_DEPLOYMENT_ENVIRONMENT,
} from "@opentelemetry/semantic-conventions";

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: process.env.SERVICE_NAME ?? "my-service",
    [ATTR_SERVICE_VERSION]: process.env.SERVICE_VERSION ?? "0.0.0",
    [ATTR_DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV ?? "development",
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? "http://localhost:4318/v1/traces",
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? "http://localhost:4318/v1/metrics",
    }),
    exportIntervalMillis: 30_000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // Disable noisy fs instrumentation
      "@opentelemetry/instrumentation-fs": { enabled: false },
      // Configure HTTP to capture request/response headers
      "@opentelemetry/instrumentation-http": {
        requestHook: (span, request) => {
          span.setAttribute("http.request.header.x-request-id",
            request.headers?.["x-request-id"] ?? "unknown"
          );
        },
      },
    }),
  ],
});

sdk.start();

process.on("SIGTERM", () => {
  sdk.shutdown().then(() => process.exit(0));
});

typescript

// Custom span creation for business logic
import { trace, SpanStatusCode, context } from "@opentelemetry/api";

const tracer = trace.getTracer("order-service");

async function processOrder(orderId: string): Promise<Order> {
  return tracer.startActiveSpan("process_order", async (span) => {
    try {
      span.setAttribute("order.id", orderId);

      const order = await tracer.startActiveSpan("fetch_order", async (fetchSpan) => {
        const result = await db.orders.findUnique({ where: { id: orderId } });
        fetchSpan.setAttribute("order.total", result?.total ?? 0);
        fetchSpan.end();
        return result;
      });

      if (!order) {
        span.setStatus({ code: SpanStatusCode.ERROR, message: "Order not found" });
        throw new OrderNotFoundError(orderId);
      }

      await tracer.startActiveSpan("charge_payment", async (paymentSpan) => {
        paymentSpan.setAttribute("payment.amount", order.total);
        await paymentService.charge(order);
        paymentSpan.end();
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return order;
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: String(error) });
      span.recordException(error as Error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Why: OpenTelemetry provides vendor-neutral instrumentation. Auto-instrumentation captures HTTP, database, and gRPC calls automatically. Custom spans add business context (order IDs, payment amounts) that make traces actionable for debugging.

适用场景： 任何需要追踪、指标与日志关联的生产环境Node.js服务。

实现：

typescript

// tracing.ts - Must be imported BEFORE any other module
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-http";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { Resource } from "@opentelemetry/resources";
import {
  ATTR_SERVICE_NAME,
  ATTR_SERVICE_VERSION,
  ATTR_DEPLOYMENT_ENVIRONMENT,
} from "@opentelemetry/semantic-conventions";

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: process.env.SERVICE_NAME ?? "my-service",
    [ATTR_SERVICE_VERSION]: process.env.SERVICE_VERSION ?? "0.0.0",
    [ATTR_DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV ?? "development",
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? "http://localhost:4318/v1/traces",
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? "http://localhost:4318/v1/metrics",
    }),
    exportIntervalMillis: 30_000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // Disable noisy fs instrumentation
      "@opentelemetry/instrumentation-fs": { enabled: false },
      // Configure HTTP to capture request/response headers
      "@opentelemetry/instrumentation-http": {
        requestHook: (span, request) => {
          span.setAttribute("http.request.header.x-request-id",
            request.headers?.["x-request-id"] ?? "unknown"
          );
        },
      },
    }),
  ],
});

sdk.start();

process.on("SIGTERM", () => {
  sdk.shutdown().then(() => process.exit(0));
});

typescript

// Custom span creation for business logic
import { trace, SpanStatusCode, context } from "@opentelemetry/api";

const tracer = trace.getTracer("order-service");

async function processOrder(orderId: string): Promise<Order> {
  return tracer.startActiveSpan("process_order", async (span) => {
    try {
      span.setAttribute("order.id", orderId);

      const order = await tracer.startActiveSpan("fetch_order", async (fetchSpan) => {
        const result = await db.orders.findUnique({ where: { id: orderId } });
        fetchSpan.setAttribute("order.total", result?.total ?? 0);
        fetchSpan.end();
        return result;
      });

      if (!order) {
        span.setStatus({ code: SpanStatusCode.ERROR, message: "Order not found" });
        throw new OrderNotFoundError(orderId);
      }

      await tracer.startActiveSpan("charge_payment", async (paymentSpan) => {
        paymentSpan.setAttribute("payment.amount", order.total);
        await paymentService.charge(order);
        paymentSpan.end();
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return order;
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: String(error) });
      span.recordException(error as Error);
      throw error;
    } finally {
      span.end();
    }
  });
}

为什么要这么做： OpenTelemetry提供厂商中立的埋点能力。自动埋点可自动捕获HTTP、数据库与gRPC调用。自定义Span可添加业务上下文（订单ID、支付金额），让追踪信息在调试时更具实用性。

Pattern 2: Structured Logging with Correlation

模式2：带关联的结构化日志

When to use: Every application that produces logs (which is every application).

Implementation:

typescript

import pino from "pino";
import { context, trace } from "@opentelemetry/api";

// Create logger with trace correlation
const logger = pino({
  level: process.env.LOG_LEVEL ?? "info",
  formatters: {
    level: (label) => ({ level: label }),
  },
  mixin() {
    // Automatically inject trace context into every log line
    const span = trace.getSpan(context.active());
    if (span) {
      const spanContext = span.spanContext();
      return {
        traceId: spanContext.traceId,
        spanId: spanContext.spanId,
        traceFlags: spanContext.traceFlags,
      };
    }
    return {};
  },
  // Redact sensitive fields
  redact: ["req.headers.authorization", "password", "token", "apiKey"],
});

// Usage - always log structured data, not string interpolation
logger.info({ orderId, userId, total: order.total }, "Order processed successfully");

// NOT this:
// logger.info(`Order ${orderId} processed for user ${userId} with total ${order.total}`);

// Error logging with context
logger.error(
  {
    err: error,
    orderId,
    operation: "payment_charge",
    paymentProvider: "stripe",
  },
  "Payment processing failed"
);

// Child loggers for request-scoped context
function createRequestLogger(req: Request) {
  return logger.child({
    requestId: req.headers.get("x-request-id") ?? crypto.randomUUID(),
    path: req.url,
    method: req.method,
    userAgent: req.headers.get("user-agent"),
  });
}

Why: Structured logs are queryable. You can filter by

orderId

, correlate with traces via

traceId

, and aggregate error rates by

operation

. String-interpolated logs require regex to extract fields, which breaks at scale.

适用场景： 所有产生日志的应用（即所有应用）。

实现：

typescript

import pino from "pino";
import { context, trace } from "@opentelemetry/api";

// Create logger with trace correlation
const logger = pino({
  level: process.env.LOG_LEVEL ?? "info",
  formatters: {
    level: (label) => ({ level: label }),
  },
  mixin() {
    // Automatically inject trace context into every log line
    const span = trace.getSpan(context.active());
    if (span) {
      const spanContext = span.spanContext();
      return {
        traceId: spanContext.traceId,
        spanId: spanContext.spanId,
        traceFlags: spanContext.traceFlags,
      };
    }
    return {};
  },
  // Redact sensitive fields
  redact: ["req.headers.authorization", "password", "token", "apiKey"],
});

// Usage - always log structured data, not string interpolation
logger.info({ orderId, userId, total: order.total }, "Order processed successfully");

// NOT this:
// logger.info(`Order ${orderId} processed for user ${userId} with total ${order.total}`);

// Error logging with context
logger.error(
  {
    err: error,
    orderId,
    operation: "payment_charge",
    paymentProvider: "stripe",
  },
  "Payment processing failed"
);

// Child loggers for request-scoped context
function createRequestLogger(req: Request) {
  return logger.child({
    requestId: req.headers.get("x-request-id") ?? crypto.randomUUID(),
    path: req.url,
    method: req.method,
    userAgent: req.headers.get("user-agent"),
  });
}

为什么要这么做： 结构化日志具备可查询性。你可以通过

orderId

过滤日志，通过

traceId

与追踪信息关联，还能按

operation

聚合错误率。字符串拼接的日志需要用正则提取字段，在大规模场景下会失效。

Pattern 3: SLO-Based Alerting

模式3：基于SLO的告警

When to use: Setting up production alerting that reduces noise and focuses on user impact.

Implementation:

yaml

undefined

适用场景： 搭建生产环境告警系统，减少无效告警并聚焦用户影响。

实现：

yaml

undefined

Prometheus alerting rules based on SLOs

groups:

name: slo-alerts rules:

Availability SLO: 99.9% success rate

Alert when burning through error budget too fast
- alert: HighErrorBudgetBurn expr: | ( sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) ) > (1 - 0.999) * 14.4 for: 5m labels: severity: critical annotations: summary: "Error budget burn rate is 14.4x (will exhaust in 1 hour)" dashboard: "https://grafana.internal/d/slo-overview"
Latency SLO: 99% of requests under 500ms
- alert: HighLatencyBudgetBurn expr: | ( sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) / sum(rate(http_request_duration_seconds_count[5m])) ) < 0.99 for: 10m labels: severity: warning annotations: summary: "Latency SLO breach: >1% of requests exceeding 500ms"
Multi-window multi-burn-rate alert (Google SRE book pattern)
- alert: SLOBreach_MultiWindow expr: | ( error_ratio:rate1h > (14.4 * 0.001) and error_ratio:rate5m > (14.4 * 0.001) ) or ( error_ratio:rate6h > (6 * 0.001) and error_ratio:rate30m > (6 * 0.001) ) labels: severity: critical


```typescript
// Application-level SLO tracking with custom metrics
import { metrics } from "@opentelemetry/api";

const meter = metrics.getMeter("slo-metrics");

const requestCounter = meter.createCounter("http.requests.total", {
  description: "Total HTTP requests",
});

const requestDuration = meter.createHistogram("http.request.duration", {
  description: "HTTP request duration in seconds",
  unit: "s",
});

// Middleware to track SLO metrics
function sloMiddleware(req: Request, res: Response, next: NextFunction) {
  const start = performance.now();

  res.on("finish", () => {
    const duration = (performance.now() - start) / 1000;
    const attributes = {
      method: req.method,
      route: req.route?.path ?? "unknown",
      status: String(res.statusCode),
      success: String(res.statusCode < 500),
    };

    requestCounter.add(1, attributes);
    requestDuration.record(duration, attributes);
  });

  next();
}

Why: Traditional threshold-based alerts (CPU > 80%, errors > 10) generate noise. SLO-based alerting asks "are users affected?" and "how fast are we burning our error budget?" This approach pages only when action is needed.

groups:

name: slo-alerts rules:

Availability SLO: 99.9% success rate

Alert when burning through error budget too fast
- alert: HighErrorBudgetBurn expr: | ( sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) ) > (1 - 0.999) * 14.4 for: 5m labels: severity: critical annotations: summary: "Error budget burn rate is 14.4x (will exhaust in 1 hour)" dashboard: "https://grafana.internal/d/slo-overview"
Latency SLO: 99% of requests under 500ms
- alert: HighLatencyBudgetBurn expr: | ( sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) / sum(rate(http_request_duration_seconds_count[5m])) ) < 0.99 for: 10m labels: severity: warning annotations: summary: "Latency SLO breach: >1% of requests exceeding 500ms"
Multi-window multi-burn-rate alert (Google SRE book pattern)
- alert: SLOBreach_MultiWindow expr: | ( error_ratio:rate1h > (14.4 * 0.001) and error_ratio:rate5m > (14.4 * 0.001) ) or ( error_ratio:rate6h > (6 * 0.001) and error_ratio:rate30m > (6 * 0.001) ) labels: severity: critical


```typescript
// Application-level SLO tracking with custom metrics
import { metrics } from "@opentelemetry/api";

const meter = metrics.getMeter("slo-metrics");

const requestCounter = meter.createCounter("http.requests.total", {
  description: "Total HTTP requests",
});

const requestDuration = meter.createHistogram("http.request.duration", {
  description: "HTTP request duration in seconds",
  unit: "s",
});

// Middleware to track SLO metrics
function sloMiddleware(req: Request, res: Response, next: NextFunction) {
  const start = performance.now();

  res.on("finish", () => {
    const duration = (performance.now() - start) / 1000;
    const attributes = {
      method: req.method,
      route: req.route?.path ?? "unknown",
      status: String(res.statusCode),
      success: String(res.statusCode < 500),
    };

    requestCounter.add(1, attributes);
    requestDuration.record(duration, attributes);
  });

  next();
}

为什么要这么做： 传统的基于阈值的告警（CPU>80%、错误数>10）会产生大量无效告警。基于SLO的告警会问“用户是否受到影响？”以及“我们的错误预算消耗速度有多快？”这种方式仅在需要采取行动时才会触发告警。

Pattern 4: Health Checks and Readiness Probes

模式4：健康检查与就绪探针

When to use: Any service deployed to Kubernetes or behind a load balancer.

Implementation:

typescript

import { Router } from "express";

interface HealthCheckResult {
  status: "healthy" | "degraded" | "unhealthy";
  checks: Record<string, {
    status: "pass" | "fail" | "warn";
    latencyMs: number;
    message?: string;
  }>;
  uptime: number;
  version: string;
}

const healthRouter = Router();

// Liveness probe - is the process alive?
// Should NEVER check dependencies. Only checks if the process can respond.
healthRouter.get("/healthz", (req, res) => {
  res.status(200).json({ status: "alive" });
});

// Readiness probe - can this instance serve traffic?
// Checks critical dependencies.
healthRouter.get("/readyz", async (req, res) => {
  const checks: HealthCheckResult["checks"] = {};

  // Check database
  const dbStart = performance.now();
  try {
    await db.$queryRaw`SELECT 1`;
    checks.database = { status: "pass", latencyMs: performance.now() - dbStart };
  } catch (err) {
    checks.database = {
      status: "fail",
      latencyMs: performance.now() - dbStart,
      message: (err as Error).message,
    };
  }

  // Check Redis
  const redisStart = performance.now();
  try {
    await redis.ping();
    checks.redis = { status: "pass", latencyMs: performance.now() - redisStart };
  } catch (err) {
    checks.redis = {
      status: "fail",
      latencyMs: performance.now() - redisStart,
      message: (err as Error).message,
    };
  }

  const allPassing = Object.values(checks).every((c) => c.status === "pass");
  const anyFailing = Object.values(checks).some((c) => c.status === "fail");

  const result: HealthCheckResult = {
    status: anyFailing ? "unhealthy" : allPassing ? "healthy" : "degraded",
    checks,
    uptime: process.uptime(),
    version: process.env.SERVICE_VERSION ?? "unknown",
  };

  res.status(anyFailing ? 503 : 200).json(result);
});

Why: Kubernetes uses liveness probes to restart stuck processes and readiness probes to stop sending traffic to unready instances. Getting these wrong causes cascading failures: a liveness probe that checks the database will restart healthy pods during a database outage, making things worse.

适用场景： 部署在Kubernetes或负载均衡器后的任何服务。

实现：

typescript

import { Router } from "express";

interface HealthCheckResult {
  status: "healthy" | "degraded" | "unhealthy";
  checks: Record<string, {
    status: "pass" | "fail" | "warn";
    latencyMs: number;
    message?: string;
  }>;
  uptime: number;
  version: string;
}

const healthRouter = Router();

// Liveness probe - is the process alive?
// Should NEVER check dependencies. Only checks if the process can respond.
healthRouter.get("/healthz", (req, res) => {
  res.status(200).json({ status: "alive" });
});

// Readiness probe - can this instance serve traffic?
// Checks critical dependencies.
healthRouter.get("/readyz", async (req, res) => {
  const checks: HealthCheckResult["checks"] = {};

  // Check database
  const dbStart = performance.now();
  try {
    await db.$queryRaw`SELECT 1`;
    checks.database = { status: "pass", latencyMs: performance.now() - dbStart };
  } catch (err) {
    checks.database = {
      status: "fail",
      latencyMs: performance.now() - dbStart,
      message: (err as Error).message,
    };
  }

  // Check Redis
  const redisStart = performance.now();
  try {
    await redis.ping();
    checks.redis = { status: "pass", latencyMs: performance.now() - redisStart };
  } catch (err) {
    checks.redis = {
      status: "fail",
      latencyMs: performance.now() - redisStart,
      message: (err as Error).message,
    };
  }

  const allPassing = Object.values(checks).every((c) => c.status === "pass");
  const anyFailing = Object.values(checks).some((c) => c.status === "fail");

  const result: HealthCheckResult = {
    status: anyFailing ? "unhealthy" : allPassing ? "healthy" : "degraded",
    checks,
    uptime: process.uptime(),
    version: process.env.SERVICE_VERSION ?? "unknown",
  };

  res.status(anyFailing ? 503 : 200).json(result);
});

为什么要这么做： Kubernetes使用存活探针重启卡住的进程，使用就绪探针停止向未就绪的实例发送流量。配置错误会导致级联故障：如果存活探针检查数据库，那么在数据库 outage 期间会重启健康的Pod，让情况变得更糟。

Grafana Dashboard Quick Reference

Grafana仪表盘快速参考

json

{
  "panels": [
    {
      "title": "Request Rate",
      "type": "timeseries",
      "targets": [{ "expr": "sum(rate(http_requests_total[5m])) by (status)" }]
    },
    {
      "title": "Error Rate (%)",
      "type": "stat",
      "targets": [{
        "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
      }],
      "thresholds": [
        { "value": 0, "color": "green" },
        { "value": 0.1, "color": "yellow" },
        { "value": 1, "color": "red" }
      ]
    },
    {
      "title": "P99 Latency",
      "type": "timeseries",
      "targets": [{
        "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))"
      }]
    }
  ]
}

json

{
  "panels": [
    {
      "title": "Request Rate",
      "type": "timeseries",
      "targets": [{ "expr": "sum(rate(http_requests_total[5m])) by (status)" }]
    },
    {
      "title": "Error Rate (%)",
      "type": "stat",
      "targets": [{
        "expr": "sum(rate(http_requests_total{status=~\\"5..\\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
      }],
      "thresholds": [
        { "value": 0, "color": "green" },
        { "value": 0.1, "color": "yellow" },
        { "value": 1, "color": "red" }
      ]
    },
    {
      "title": "P99 Latency",
      "type": "timeseries",
      "targets": [{
        "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))"
      }]
    }
  ]
}

Anti-Patterns

反模式

Anti-Pattern	Why It's Bad	Better Approach
Logging with `console.log` in production	No structure, no correlation, no levels	Use pino or winston with JSON output
Alerting on every 5xx error	Alert fatigue, team ignores alerts	Alert on SLO breach / error budget burn rate
Liveness probe checks database	Restarts pods during DB outage (cascade)	Liveness checks process only; readiness checks deps
No trace context propagation	Cannot follow requests across services	Use W3C traceparent header, inject in all clients
Sampling 100% of traces	Storage costs explode at scale	Head-based sampling (10-20%) or tail-based for errors
Logging PII (emails, IPs)	GDPR/privacy violation	Redact sensitive fields in logger config
Dashboard with 50 panels	Information overload, slow to load	Four golden signals: rate, errors, duration, saturation

反模式	危害	更佳方案
生产环境使用 `console.log` 记录日志	无结构、无关联、无级别区分	使用pino或winston输出JSON格式日志
每个5xx错误都触发告警	告警疲劳，团队会忽略告警	针对SLO违规/错误预算消耗率告警
存活探针检查数据库	数据库 outage 期间重启Pod（级联故障）	存活探针仅检查进程；就绪探针检查依赖
无追踪上下文传播	无法跨服务追踪请求	使用W3C traceparent header，在所有客户端中注入
对100%的追踪进行采样	大规模场景下存储成本爆炸式增长	基于头部的采样（10-20%）或针对错误的尾部采样
记录PII（邮箱、IP）	违反GDPR/隐私法规	在日志配置中脱敏敏感字段
包含50个面板的仪表盘	信息过载，加载缓慢	四大黄金信号：请求率、错误数、延迟、饱和度

monitoring-observability

Original

Translation

Monitoring & Observability

监控与可观测性

Overview

概述

Core Principles

核心原则

Key Patterns

关键模式

Pattern 1: OpenTelemetry Instrumentation (Node.js)

模式1：OpenTelemetry埋点（Node.js）

Pattern 2: Structured Logging with Correlation

模式2：带关联的结构化日志

Pattern 3: SLO-Based Alerting

模式3：基于SLO的告警

Prometheus alerting rules based on SLOs

Prometheus alerting rules based on SLOs

Availability SLO: 99.9% success rate

Alert when burning through error budget too fast

Latency SLO: 99% of requests under 500ms

Multi-window multi-burn-rate alert (Google SRE book pattern)

Availability SLO: 99.9% success rate

Alert when burning through error budget too fast

Latency SLO: 99% of requests under 500ms

Multi-window multi-burn-rate alert (Google SRE book pattern)

Pattern 4: Health Checks and Readiness Probes

模式4：健康检查与就绪探针

Grafana Dashboard Quick Reference

Grafana仪表盘快速参考

Anti-Patterns

反模式

Checklist

检查清单

Related Resources

相关资源