monitoring-observability
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMonitoring & Observability
监控与可观测性
Overview
概述
This skill covers the three pillars of observability -- traces, metrics, and logs -- along with alerting, dashboards, and health check patterns. It focuses on OpenTelemetry as the vendor-neutral standard, structured logging for queryability, distributed tracing for microservice debugging, and SLO-based alerting to reduce noise.
Use this skill when instrumenting applications for production visibility, setting up monitoring infrastructure, debugging distributed systems, configuring alerts that matter, or building dashboards for operations and product teams.
本技能涵盖可观测性的三大支柱——追踪、指标与日志,同时包含告警、仪表盘与健康检查模式。它聚焦于作为厂商中立标准的OpenTelemetry、具备可查询性的结构化日志、用于微服务调试的分布式追踪,以及基于SLO的告警以减少无效告警。
当你需要为应用添加生产环境可见性埋点、搭建监控基础设施、调试分布式系统、配置关键告警,或为运维与产品团队构建仪表盘时,可使用本技能。
Core Principles
核心原则
- Instrument at boundaries - Trace every external call (HTTP, database, queue, cache). Internal function tracing adds noise; boundary tracing reveals system behavior.
- Structured over unstructured - Every log entry must be JSON with correlation IDs, service name, and context. Unstructured logs are unsearchable at scale.
- Alert on symptoms, not causes - Alert when users are affected (error rate, latency SLO breach), not when a specific server metric spikes. Symptom-based alerting reduces noise by 80%.
- Correlate across signals - A trace ID should connect logs, traces, and metrics for a single request. Without correlation, debugging distributed issues requires guesswork.
- Budget your error rate - Define SLOs (99.9% availability = 43 minutes/month downtime budget). Alert when the error budget burn rate is too fast, not on individual errors.
- 在边界处埋点 - 追踪所有外部调用(HTTP、数据库、队列、缓存)。内部函数追踪会产生无效信息;边界追踪可揭示系统行为。
- 结构化优先于非结构化 - 每条日志条目必须是包含关联ID、服务名称与上下文的JSON格式。非结构化日志在大规模场景下无法被检索。
- 针对症状告警,而非原因 - 当用户受到影响时告警(错误率、延迟SLO违规),而非特定服务器指标飙升时。基于症状的告警可减少80%的无效告警。
- 跨信号关联 - 单个请求的追踪ID应关联日志、追踪与指标。若无关联,调试分布式问题只能靠猜测。
- 预算错误率 - 定义SLO(99.9%可用性 = 每月43分钟停机预算)。当错误预算消耗过快时告警,而非针对单个错误。
Key Patterns
关键模式
Pattern 1: OpenTelemetry Instrumentation (Node.js)
模式1:OpenTelemetry埋点(Node.js)
When to use: Any production Node.js service that needs traces, metrics, and log correlation.
Implementation:
typescript
// tracing.ts - Must be imported BEFORE any other module
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-http";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { Resource } from "@opentelemetry/resources";
import {
ATTR_SERVICE_NAME,
ATTR_SERVICE_VERSION,
ATTR_DEPLOYMENT_ENVIRONMENT,
} from "@opentelemetry/semantic-conventions";
const sdk = new NodeSDK({
resource: new Resource({
[ATTR_SERVICE_NAME]: process.env.SERVICE_NAME ?? "my-service",
[ATTR_SERVICE_VERSION]: process.env.SERVICE_VERSION ?? "0.0.0",
[ATTR_DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV ?? "development",
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? "http://localhost:4318/v1/traces",
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? "http://localhost:4318/v1/metrics",
}),
exportIntervalMillis: 30_000,
}),
instrumentations: [
getNodeAutoInstrumentations({
// Disable noisy fs instrumentation
"@opentelemetry/instrumentation-fs": { enabled: false },
// Configure HTTP to capture request/response headers
"@opentelemetry/instrumentation-http": {
requestHook: (span, request) => {
span.setAttribute("http.request.header.x-request-id",
request.headers?.["x-request-id"] ?? "unknown"
);
},
},
}),
],
});
sdk.start();
process.on("SIGTERM", () => {
sdk.shutdown().then(() => process.exit(0));
});typescript
// Custom span creation for business logic
import { trace, SpanStatusCode, context } from "@opentelemetry/api";
const tracer = trace.getTracer("order-service");
async function processOrder(orderId: string): Promise<Order> {
return tracer.startActiveSpan("process_order", async (span) => {
try {
span.setAttribute("order.id", orderId);
const order = await tracer.startActiveSpan("fetch_order", async (fetchSpan) => {
const result = await db.orders.findUnique({ where: { id: orderId } });
fetchSpan.setAttribute("order.total", result?.total ?? 0);
fetchSpan.end();
return result;
});
if (!order) {
span.setStatus({ code: SpanStatusCode.ERROR, message: "Order not found" });
throw new OrderNotFoundError(orderId);
}
await tracer.startActiveSpan("charge_payment", async (paymentSpan) => {
paymentSpan.setAttribute("payment.amount", order.total);
await paymentService.charge(order);
paymentSpan.end();
});
span.setStatus({ code: SpanStatusCode.OK });
return order;
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR, message: String(error) });
span.recordException(error as Error);
throw error;
} finally {
span.end();
}
});
}Why: OpenTelemetry provides vendor-neutral instrumentation. Auto-instrumentation captures HTTP, database, and gRPC calls automatically. Custom spans add business context (order IDs, payment amounts) that make traces actionable for debugging.
适用场景: 任何需要追踪、指标与日志关联的生产环境Node.js服务。
实现:
typescript
// tracing.ts - Must be imported BEFORE any other module
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-http";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { Resource } from "@opentelemetry/resources";
import {
ATTR_SERVICE_NAME,
ATTR_SERVICE_VERSION,
ATTR_DEPLOYMENT_ENVIRONMENT,
} from "@opentelemetry/semantic-conventions";
const sdk = new NodeSDK({
resource: new Resource({
[ATTR_SERVICE_NAME]: process.env.SERVICE_NAME ?? "my-service",
[ATTR_SERVICE_VERSION]: process.env.SERVICE_VERSION ?? "0.0.0",
[ATTR_DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV ?? "development",
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? "http://localhost:4318/v1/traces",
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? "http://localhost:4318/v1/metrics",
}),
exportIntervalMillis: 30_000,
}),
instrumentations: [
getNodeAutoInstrumentations({
// Disable noisy fs instrumentation
"@opentelemetry/instrumentation-fs": { enabled: false },
// Configure HTTP to capture request/response headers
"@opentelemetry/instrumentation-http": {
requestHook: (span, request) => {
span.setAttribute("http.request.header.x-request-id",
request.headers?.["x-request-id"] ?? "unknown"
);
},
},
}),
],
});
sdk.start();
process.on("SIGTERM", () => {
sdk.shutdown().then(() => process.exit(0));
});typescript
// Custom span creation for business logic
import { trace, SpanStatusCode, context } from "@opentelemetry/api";
const tracer = trace.getTracer("order-service");
async function processOrder(orderId: string): Promise<Order> {
return tracer.startActiveSpan("process_order", async (span) => {
try {
span.setAttribute("order.id", orderId);
const order = await tracer.startActiveSpan("fetch_order", async (fetchSpan) => {
const result = await db.orders.findUnique({ where: { id: orderId } });
fetchSpan.setAttribute("order.total", result?.total ?? 0);
fetchSpan.end();
return result;
});
if (!order) {
span.setStatus({ code: SpanStatusCode.ERROR, message: "Order not found" });
throw new OrderNotFoundError(orderId);
}
await tracer.startActiveSpan("charge_payment", async (paymentSpan) => {
paymentSpan.setAttribute("payment.amount", order.total);
await paymentService.charge(order);
paymentSpan.end();
});
span.setStatus({ code: SpanStatusCode.OK });
return order;
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR, message: String(error) });
span.recordException(error as Error);
throw error;
} finally {
span.end();
}
});
}为什么要这么做: OpenTelemetry提供厂商中立的埋点能力。自动埋点可自动捕获HTTP、数据库与gRPC调用。自定义Span可添加业务上下文(订单ID、支付金额),让追踪信息在调试时更具实用性。
Pattern 2: Structured Logging with Correlation
模式2:带关联的结构化日志
When to use: Every application that produces logs (which is every application).
Implementation:
typescript
import pino from "pino";
import { context, trace } from "@opentelemetry/api";
// Create logger with trace correlation
const logger = pino({
level: process.env.LOG_LEVEL ?? "info",
formatters: {
level: (label) => ({ level: label }),
},
mixin() {
// Automatically inject trace context into every log line
const span = trace.getSpan(context.active());
if (span) {
const spanContext = span.spanContext();
return {
traceId: spanContext.traceId,
spanId: spanContext.spanId,
traceFlags: spanContext.traceFlags,
};
}
return {};
},
// Redact sensitive fields
redact: ["req.headers.authorization", "password", "token", "apiKey"],
});
// Usage - always log structured data, not string interpolation
logger.info({ orderId, userId, total: order.total }, "Order processed successfully");
// NOT this:
// logger.info(`Order ${orderId} processed for user ${userId} with total ${order.total}`);
// Error logging with context
logger.error(
{
err: error,
orderId,
operation: "payment_charge",
paymentProvider: "stripe",
},
"Payment processing failed"
);
// Child loggers for request-scoped context
function createRequestLogger(req: Request) {
return logger.child({
requestId: req.headers.get("x-request-id") ?? crypto.randomUUID(),
path: req.url,
method: req.method,
userAgent: req.headers.get("user-agent"),
});
}Why: Structured logs are queryable. You can filter by , correlate with traces via , and aggregate error rates by . String-interpolated logs require regex to extract fields, which breaks at scale.
orderIdtraceIdoperation适用场景: 所有产生日志的应用(即所有应用)。
实现:
typescript
import pino from "pino";
import { context, trace } from "@opentelemetry/api";
// Create logger with trace correlation
const logger = pino({
level: process.env.LOG_LEVEL ?? "info",
formatters: {
level: (label) => ({ level: label }),
},
mixin() {
// Automatically inject trace context into every log line
const span = trace.getSpan(context.active());
if (span) {
const spanContext = span.spanContext();
return {
traceId: spanContext.traceId,
spanId: spanContext.spanId,
traceFlags: spanContext.traceFlags,
};
}
return {};
},
// Redact sensitive fields
redact: ["req.headers.authorization", "password", "token", "apiKey"],
});
// Usage - always log structured data, not string interpolation
logger.info({ orderId, userId, total: order.total }, "Order processed successfully");
// NOT this:
// logger.info(`Order ${orderId} processed for user ${userId} with total ${order.total}`);
// Error logging with context
logger.error(
{
err: error,
orderId,
operation: "payment_charge",
paymentProvider: "stripe",
},
"Payment processing failed"
);
// Child loggers for request-scoped context
function createRequestLogger(req: Request) {
return logger.child({
requestId: req.headers.get("x-request-id") ?? crypto.randomUUID(),
path: req.url,
method: req.method,
userAgent: req.headers.get("user-agent"),
});
}为什么要这么做: 结构化日志具备可查询性。你可以通过过滤日志,通过与追踪信息关联,还能按聚合错误率。字符串拼接的日志需要用正则提取字段,在大规模场景下会失效。
orderIdtraceIdoperationPattern 3: SLO-Based Alerting
模式3:基于SLO的告警
When to use: Setting up production alerting that reduces noise and focuses on user impact.
Implementation:
yaml
undefined适用场景: 搭建生产环境告警系统,减少无效告警并聚焦用户影响。
实现:
yaml
undefinedPrometheus alerting rules based on SLOs
Prometheus alerting rules based on SLOs
groups:
-
name: slo-alerts rules:
Availability SLO: 99.9% success rate
Alert when burning through error budget too fast
- alert: HighErrorBudgetBurn expr: | ( sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) ) > (1 - 0.999) * 14.4 for: 5m labels: severity: critical annotations: summary: "Error budget burn rate is 14.4x (will exhaust in 1 hour)" dashboard: "https://grafana.internal/d/slo-overview"
Latency SLO: 99% of requests under 500ms
- alert: HighLatencyBudgetBurn expr: | ( sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) / sum(rate(http_request_duration_seconds_count[5m])) ) < 0.99 for: 10m labels: severity: warning annotations: summary: "Latency SLO breach: >1% of requests exceeding 500ms"
Multi-window multi-burn-rate alert (Google SRE book pattern)
- alert: SLOBreach_MultiWindow expr: | ( error_ratio:rate1h > (14.4 * 0.001) and error_ratio:rate5m > (14.4 * 0.001) ) or ( error_ratio:rate6h > (6 * 0.001) and error_ratio:rate30m > (6 * 0.001) ) labels: severity: critical
```typescript
// Application-level SLO tracking with custom metrics
import { metrics } from "@opentelemetry/api";
const meter = metrics.getMeter("slo-metrics");
const requestCounter = meter.createCounter("http.requests.total", {
description: "Total HTTP requests",
});
const requestDuration = meter.createHistogram("http.request.duration", {
description: "HTTP request duration in seconds",
unit: "s",
});
// Middleware to track SLO metrics
function sloMiddleware(req: Request, res: Response, next: NextFunction) {
const start = performance.now();
res.on("finish", () => {
const duration = (performance.now() - start) / 1000;
const attributes = {
method: req.method,
route: req.route?.path ?? "unknown",
status: String(res.statusCode),
success: String(res.statusCode < 500),
};
requestCounter.add(1, attributes);
requestDuration.record(duration, attributes);
});
next();
}Why: Traditional threshold-based alerts (CPU > 80%, errors > 10) generate noise. SLO-based alerting asks "are users affected?" and "how fast are we burning our error budget?" This approach pages only when action is needed.
groups:
-
name: slo-alerts rules:
Availability SLO: 99.9% success rate
Alert when burning through error budget too fast
- alert: HighErrorBudgetBurn expr: | ( sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) ) > (1 - 0.999) * 14.4 for: 5m labels: severity: critical annotations: summary: "Error budget burn rate is 14.4x (will exhaust in 1 hour)" dashboard: "https://grafana.internal/d/slo-overview"
Latency SLO: 99% of requests under 500ms
- alert: HighLatencyBudgetBurn expr: | ( sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) / sum(rate(http_request_duration_seconds_count[5m])) ) < 0.99 for: 10m labels: severity: warning annotations: summary: "Latency SLO breach: >1% of requests exceeding 500ms"
Multi-window multi-burn-rate alert (Google SRE book pattern)
- alert: SLOBreach_MultiWindow expr: | ( error_ratio:rate1h > (14.4 * 0.001) and error_ratio:rate5m > (14.4 * 0.001) ) or ( error_ratio:rate6h > (6 * 0.001) and error_ratio:rate30m > (6 * 0.001) ) labels: severity: critical
```typescript
// Application-level SLO tracking with custom metrics
import { metrics } from "@opentelemetry/api";
const meter = metrics.getMeter("slo-metrics");
const requestCounter = meter.createCounter("http.requests.total", {
description: "Total HTTP requests",
});
const requestDuration = meter.createHistogram("http.request.duration", {
description: "HTTP request duration in seconds",
unit: "s",
});
// Middleware to track SLO metrics
function sloMiddleware(req: Request, res: Response, next: NextFunction) {
const start = performance.now();
res.on("finish", () => {
const duration = (performance.now() - start) / 1000;
const attributes = {
method: req.method,
route: req.route?.path ?? "unknown",
status: String(res.statusCode),
success: String(res.statusCode < 500),
};
requestCounter.add(1, attributes);
requestDuration.record(duration, attributes);
});
next();
}为什么要这么做: 传统的基于阈值的告警(CPU>80%、错误数>10)会产生大量无效告警。基于SLO的告警会问“用户是否受到影响?”以及“我们的错误预算消耗速度有多快?”这种方式仅在需要采取行动时才会触发告警。
Pattern 4: Health Checks and Readiness Probes
模式4:健康检查与就绪探针
When to use: Any service deployed to Kubernetes or behind a load balancer.
Implementation:
typescript
import { Router } from "express";
interface HealthCheckResult {
status: "healthy" | "degraded" | "unhealthy";
checks: Record<string, {
status: "pass" | "fail" | "warn";
latencyMs: number;
message?: string;
}>;
uptime: number;
version: string;
}
const healthRouter = Router();
// Liveness probe - is the process alive?
// Should NEVER check dependencies. Only checks if the process can respond.
healthRouter.get("/healthz", (req, res) => {
res.status(200).json({ status: "alive" });
});
// Readiness probe - can this instance serve traffic?
// Checks critical dependencies.
healthRouter.get("/readyz", async (req, res) => {
const checks: HealthCheckResult["checks"] = {};
// Check database
const dbStart = performance.now();
try {
await db.$queryRaw`SELECT 1`;
checks.database = { status: "pass", latencyMs: performance.now() - dbStart };
} catch (err) {
checks.database = {
status: "fail",
latencyMs: performance.now() - dbStart,
message: (err as Error).message,
};
}
// Check Redis
const redisStart = performance.now();
try {
await redis.ping();
checks.redis = { status: "pass", latencyMs: performance.now() - redisStart };
} catch (err) {
checks.redis = {
status: "fail",
latencyMs: performance.now() - redisStart,
message: (err as Error).message,
};
}
const allPassing = Object.values(checks).every((c) => c.status === "pass");
const anyFailing = Object.values(checks).some((c) => c.status === "fail");
const result: HealthCheckResult = {
status: anyFailing ? "unhealthy" : allPassing ? "healthy" : "degraded",
checks,
uptime: process.uptime(),
version: process.env.SERVICE_VERSION ?? "unknown",
};
res.status(anyFailing ? 503 : 200).json(result);
});Why: Kubernetes uses liveness probes to restart stuck processes and readiness probes to stop sending traffic to unready instances. Getting these wrong causes cascading failures: a liveness probe that checks the database will restart healthy pods during a database outage, making things worse.
适用场景: 部署在Kubernetes或负载均衡器后的任何服务。
实现:
typescript
import { Router } from "express";
interface HealthCheckResult {
status: "healthy" | "degraded" | "unhealthy";
checks: Record<string, {
status: "pass" | "fail" | "warn";
latencyMs: number;
message?: string;
}>;
uptime: number;
version: string;
}
const healthRouter = Router();
// Liveness probe - is the process alive?
// Should NEVER check dependencies. Only checks if the process can respond.
healthRouter.get("/healthz", (req, res) => {
res.status(200).json({ status: "alive" });
});
// Readiness probe - can this instance serve traffic?
// Checks critical dependencies.
healthRouter.get("/readyz", async (req, res) => {
const checks: HealthCheckResult["checks"] = {};
// Check database
const dbStart = performance.now();
try {
await db.$queryRaw`SELECT 1`;
checks.database = { status: "pass", latencyMs: performance.now() - dbStart };
} catch (err) {
checks.database = {
status: "fail",
latencyMs: performance.now() - dbStart,
message: (err as Error).message,
};
}
// Check Redis
const redisStart = performance.now();
try {
await redis.ping();
checks.redis = { status: "pass", latencyMs: performance.now() - redisStart };
} catch (err) {
checks.redis = {
status: "fail",
latencyMs: performance.now() - redisStart,
message: (err as Error).message,
};
}
const allPassing = Object.values(checks).every((c) => c.status === "pass");
const anyFailing = Object.values(checks).some((c) => c.status === "fail");
const result: HealthCheckResult = {
status: anyFailing ? "unhealthy" : allPassing ? "healthy" : "degraded",
checks,
uptime: process.uptime(),
version: process.env.SERVICE_VERSION ?? "unknown",
};
res.status(anyFailing ? 503 : 200).json(result);
});为什么要这么做: Kubernetes使用存活探针重启卡住的进程,使用就绪探针停止向未就绪的实例发送流量。配置错误会导致级联故障:如果存活探针检查数据库,那么在数据库 outage 期间会重启健康的Pod,让情况变得更糟。
Grafana Dashboard Quick Reference
Grafana仪表盘快速参考
json
{
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"targets": [{ "expr": "sum(rate(http_requests_total[5m])) by (status)" }]
},
{
"title": "Error Rate (%)",
"type": "stat",
"targets": [{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
}],
"thresholds": [
{ "value": 0, "color": "green" },
{ "value": 0.1, "color": "yellow" },
{ "value": 1, "color": "red" }
]
},
{
"title": "P99 Latency",
"type": "timeseries",
"targets": [{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))"
}]
}
]
}json
{
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"targets": [{ "expr": "sum(rate(http_requests_total[5m])) by (status)" }]
},
{
"title": "Error Rate (%)",
"type": "stat",
"targets": [{
"expr": "sum(rate(http_requests_total{status=~\\"5..\\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
}],
"thresholds": [
{ "value": 0, "color": "green" },
{ "value": 0.1, "color": "yellow" },
{ "value": 1, "color": "red" }
]
},
{
"title": "P99 Latency",
"type": "timeseries",
"targets": [{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))"
}]
}
]
}Anti-Patterns
反模式
| Anti-Pattern | Why It's Bad | Better Approach |
|---|---|---|
Logging with | No structure, no correlation, no levels | Use pino or winston with JSON output |
| Alerting on every 5xx error | Alert fatigue, team ignores alerts | Alert on SLO breach / error budget burn rate |
| Liveness probe checks database | Restarts pods during DB outage (cascade) | Liveness checks process only; readiness checks deps |
| No trace context propagation | Cannot follow requests across services | Use W3C traceparent header, inject in all clients |
| Sampling 100% of traces | Storage costs explode at scale | Head-based sampling (10-20%) or tail-based for errors |
| Logging PII (emails, IPs) | GDPR/privacy violation | Redact sensitive fields in logger config |
| Dashboard with 50 panels | Information overload, slow to load | Four golden signals: rate, errors, duration, saturation |
| 反模式 | 危害 | 更佳方案 |
|---|---|---|
生产环境使用 | 无结构、无关联、无级别区分 | 使用pino或winston输出JSON格式日志 |
| 每个5xx错误都触发告警 | 告警疲劳,团队会忽略告警 | 针对SLO违规/错误预算消耗率告警 |
| 存活探针检查数据库 | 数据库 outage 期间重启Pod(级联故障) | 存活探针仅检查进程;就绪探针检查依赖 |
| 无追踪上下文传播 | 无法跨服务追踪请求 | 使用W3C traceparent header,在所有客户端中注入 |
| 对100%的追踪进行采样 | 大规模场景下存储成本爆炸式增长 | 基于头部的采样(10-20%)或针对错误的尾部采样 |
| 记录PII(邮箱、IP) | 违反GDPR/隐私法规 | 在日志配置中脱敏敏感字段 |
| 包含50个面板的仪表盘 | 信息过载,加载缓慢 | 四大黄金信号:请求率、错误数、延迟、饱和度 |
Checklist
检查清单
- OpenTelemetry SDK initialized before all other imports
- Auto-instrumentation enabled for HTTP, database, and queue clients
- Custom spans added at business-logic boundaries with relevant attributes
- Structured JSON logging with trace ID correlation
- Sensitive fields redacted in logger configuration
- Liveness probe: checks process only (no dependency checks)
- Readiness probe: checks all critical dependencies
- SLOs defined for availability and latency
- Alerts based on error budget burn rate, not raw thresholds
- Dashboard with four golden signals per service
- Trace sampling strategy configured for production scale
- Log aggregation pipeline shipping to centralized store
- OpenTelemetry SDK在所有其他导入前初始化
- 为HTTP、数据库与队列客户端启用自动埋点
- 在业务逻辑边界添加自定义Span并附带相关属性
- 实现带Trace ID关联的结构化JSON日志
- 日志配置中脱敏敏感字段
- 存活探针:仅检查进程(不检查依赖)
- 就绪探针:检查所有关键依赖
- 为可用性与延迟定义SLO
- 基于错误预算消耗率告警,而非原始阈值
- 每个服务的仪表盘包含四大黄金信号
- �生产环境规模配置追踪采样策略
- 日志聚合管道将日志发送至集中式存储
Related Resources
相关资源
- Skills: (latency optimization),
performance-engineering(security logging)application-security - Rules: (NestJS instrumentation patterns)
docs/reference/stacks/fullstack-nextjs-nestjs.md - Rules: (debugging with logs)
docs/reference/tooling/troubleshooting.md
- 技能: (延迟优化)、
performance-engineering(安全日志)application-security - 规则: (NestJS埋点模式)
docs/reference/stacks/fullstack-nextjs-nestjs.md - 规则: (使用日志调试) ",
docs/reference/tooling/troubleshooting.md