observability
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseObservability - Monitoring, Logging & Tracing
可观测性 - 监控、日志与追踪
Production patterns for the three pillars: metrics, logs, and traces
三大核心支柱的生产环境实践模式:指标、日志与追踪
When to Use This Skill
何时使用该技能
Use this skill when:
- Setting up application monitoring
- Implementing structured logging
- Adding error tracking (Sentry, Bugsnag)
- Configuring distributed tracing
- Building health check endpoints
- Creating alerting rules
Don't use this skill when:
- Development/local debugging only
- Using managed platforms with built-in observability
在以下场景使用该技能:
- 搭建应用监控系统
- 实现结构化日志功能
- 添加错误追踪(Sentry、Bugsnag)
- 配置分布式追踪
- 构建健康检查端点
- 创建告警规则
请勿在以下场景使用该技能:
- 仅用于开发/本地调试时
- 使用内置可观测性的托管平台时
Critical Patterns
核心实践模式
Pattern 1: Structured Logging
模式1:结构化日志
When: Creating queryable, parseable logs
typescript
// ✅ GOOD: Structured logging with pino
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label }),
bindings: (bindings) => ({
pid: bindings.pid,
hostname: bindings.hostname,
service: process.env.SERVICE_NAME,
}),
},
timestamp: pino.stdTimeFunctions.isoTime,
});
// Create request-scoped logger
function createRequestLogger(requestId: string, userId?: string) {
return logger.child({
requestId,
userId,
});
}
// Usage in handlers
async function handleRequest(req: Request) {
const log = createRequestLogger(req.id, req.user?.id);
log.info({ path: req.url, method: req.method }, 'Request started');
try {
const result = await processRequest(req);
log.info({ duration: Date.now() - req.startTime }, 'Request completed');
return result;
} catch (error) {
log.error({
error: error.message,
stack: error.stack,
duration: Date.now() - req.startTime,
}, 'Request failed');
throw error;
}
}
// ❌ BAD: Unstructured console.log
console.log('User ' + userId + ' did ' + action); // Can't query!
console.log(error); // Object won't serialize properly适用场景: 创建可查询、可解析的日志
typescript
// ✅ 推荐:使用Pino实现结构化日志
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label }),
bindings: (bindings) => ({
pid: bindings.pid,
hostname: bindings.hostname,
service: process.env.SERVICE_NAME,
}),
},
timestamp: pino.stdTimeFunctions.isoTime,
});
// 创建请求级别的日志实例
function createRequestLogger(requestId: string, userId?: string) {
return logger.child({
requestId,
userId,
});
}
// 在处理器中使用
async function handleRequest(req: Request) {
const log = createRequestLogger(req.id, req.user?.id);
log.info({ path: req.url, method: req.method }, '请求已启动');
try {
const result = await processRequest(req);
log.info({ duration: Date.now() - req.startTime }, '请求已完成');
return result;
} catch (error) {
log.error({
error: error.message,
stack: error.stack,
duration: Date.now() - req.startTime,
}, '请求失败');
throw error;
}
}
// ❌ 不推荐:无结构的console.log
console.log('User ' + userId + ' did ' + action); // 无法进行查询!
console.log(error); // 对象无法正确序列化Pattern 2: Error Tracking
模式2:错误追踪
When: Capturing and analyzing production errors
typescript
// ✅ GOOD: Sentry integration
import * as Sentry from '@sentry/node';
Sentry.init({
dsn: process.env.SENTRY_DSN,
environment: process.env.NODE_ENV,
release: process.env.GIT_SHA,
tracesSampleRate: 0.1, // 10% of transactions
integrations: [
new Sentry.Integrations.Http({ tracing: true }),
new Sentry.Integrations.Prisma({ client: prisma }),
],
beforeSend(event, hint) {
// Filter out expected errors
const error = hint.originalException;
if (error instanceof AppError && error.isOperational) {
return null; // Don't send operational errors
}
return event;
},
});
// Capture with context
function captureError(error: Error, context?: Record<string, any>) {
Sentry.withScope((scope) => {
if (context) {
scope.setExtras(context);
}
if (context?.userId) {
scope.setUser({ id: context.userId });
}
Sentry.captureException(error);
});
}
// Usage in error handler
app.use((err, req, res, next) => {
captureError(err, {
requestId: req.id,
userId: req.user?.id,
path: req.path,
method: req.method,
body: req.body,
});
res.status(500).json({ error: 'Internal server error' });
});
// ❌ BAD: No error tracking
app.use((err, req, res, next) => {
console.error(err); // Lost when container restarts!
res.status(500).json({ error: err.message }); // Exposes internals
});适用场景: 捕获并分析生产环境错误
typescript
// ✅ 推荐:集成Sentry
import * as Sentry from '@sentry/node';
Sentry.init({
dsn: process.env.SENTRY_DSN,
environment: process.env.NODE_ENV,
release: process.env.GIT_SHA,
tracesSampleRate: 0.1, // 10%的事务会被采样
integrations: [
new Sentry.Integrations.Http({ tracing: true }),
new Sentry.Integrations.Prisma({ client: prisma }),
],
beforeSend(event, hint) {
// 过滤预期内的错误
const error = hint.originalException;
if (error instanceof AppError && error.isOperational) {
return null; // 不发送可预期的业务错误
}
return event;
},
});
// 携带上下文信息捕获错误
function captureError(error: Error, context?: Record<string, any>) {
Sentry.withScope((scope) => {
if (context) {
scope.setExtras(context);
}
if (context?.userId) {
scope.setUser({ id: context.userId });
}
Sentry.captureException(error);
});
}
// 在错误处理器中使用
app.use((err, req, res, next) => {
captureError(err, {
requestId: req.id,
userId: req.user?.id,
path: req.path,
method: req.method,
body: req.body,
});
res.status(500).json({ error: '内部服务器错误' });
});
// ❌ 不推荐:未实现错误追踪
app.use((err, req, res, next) => {
console.error(err); // 容器重启后日志会丢失!
res.status(500).json({ error: err.message }); // 暴露内部实现细节
});Pattern 3: Health Checks
模式3:健康检查
When: Monitoring application and dependency health
typescript
// ✅ GOOD: Comprehensive health check
// app/api/health/route.ts
import { NextResponse } from 'next/server';
interface HealthCheck {
status: 'healthy' | 'degraded' | 'unhealthy';
checks: Record<string, CheckResult>;
version: string;
uptime: number;
}
interface CheckResult {
status: 'pass' | 'fail';
latency?: number;
message?: string;
}
async function checkDatabase(): Promise<CheckResult> {
const start = Date.now();
try {
await prisma.$queryRaw`SELECT 1`;
return { status: 'pass', latency: Date.now() - start };
} catch (error) {
return { status: 'fail', message: error.message };
}
}
async function checkRedis(): Promise<CheckResult> {
const start = Date.now();
try {
await redis.ping();
return { status: 'pass', latency: Date.now() - start };
} catch (error) {
return { status: 'fail', message: error.message };
}
}
async function checkExternalAPI(): Promise<CheckResult> {
const start = Date.now();
try {
const res = await fetch(process.env.EXTERNAL_API_URL + '/health', {
signal: AbortSignal.timeout(5000),
});
return {
status: res.ok ? 'pass' : 'fail',
latency: Date.now() - start,
};
} catch (error) {
return { status: 'fail', message: error.message };
}
}
export async function GET() {
const checks = {
database: await checkDatabase(),
redis: await checkRedis(),
externalApi: await checkExternalAPI(),
};
const allPassing = Object.values(checks).every(c => c.status === 'pass');
const anyFailing = Object.values(checks).some(c => c.status === 'fail');
const health: HealthCheck = {
status: allPassing ? 'healthy' : anyFailing ? 'unhealthy' : 'degraded',
checks,
version: process.env.GIT_SHA || 'unknown',
uptime: process.uptime(),
};
return NextResponse.json(health, {
status: health.status === 'healthy' ? 200 : 503,
});
}
// ❌ BAD: Simple health check that always passes
export async function GET() {
return NextResponse.json({ status: 'ok' }); // Doesn't check anything!
}适用场景: 监控应用及依赖组件的健康状态
typescript
// ✅ 推荐:全面的健康检查
// app/api/health/route.ts
import { NextResponse } from 'next/server';
interface HealthCheck {
status: 'healthy' | 'degraded' | 'unhealthy';
checks: Record<string, CheckResult>;
version: string;
uptime: number;
}
interface CheckResult {
status: 'pass' | 'fail';
latency?: number;
message?: string;
}
async function checkDatabase(): Promise<CheckResult> {
const start = Date.now();
try {
await prisma.$queryRaw`SELECT 1`;
return { status: 'pass', latency: Date.now() - start };
} catch (error) {
return { status: 'fail', message: error.message };
}
}
async function checkRedis(): Promise<CheckResult> {
const start = Date.now();
try {
await redis.ping();
return { status: 'pass', latency: Date.now() - start };
} catch (error) {
return { status: 'fail', message: error.message };
}
}
async function checkExternalAPI(): Promise<CheckResult> {
const start = Date.now();
try {
const res = await fetch(process.env.EXTERNAL_API_URL + '/health', {
signal: AbortSignal.timeout(5000),
});
return {
status: res.ok ? 'pass' : 'fail',
latency: Date.now() - start,
};
} catch (error) {
return { status: 'fail', message: error.message };
}
}
export async function GET() {
const checks = {
database: await checkDatabase(),
redis: await checkRedis(),
externalApi: await checkExternalAPI(),
};
const allPassing = Object.values(checks).every(c => c.status === 'pass');
const anyFailing = Object.values(checks).some(c => c.status === 'fail');
const health: HealthCheck = {
status: allPassing ? 'healthy' : anyFailing ? 'unhealthy' : 'degraded',
checks,
version: process.env.GIT_SHA || 'unknown',
uptime: process.uptime(),
};
return NextResponse.json(health, {
status: health.status === 'healthy' ? 200 : 503,
});
}
// ❌ 不推荐:总是返回成功的简单健康检查
export async function GET() {
return NextResponse.json({ status: 'ok' }); // 未检查任何依赖!
}Pattern 4: Distributed Tracing
模式4:分布式追踪
When: Tracking requests across services
typescript
// ✅ GOOD: OpenTelemetry tracing
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
}),
instrumentations: [getNodeAutoInstrumentations()],
serviceName: process.env.SERVICE_NAME,
});
sdk.start();
// Manual span creation
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('my-service');
async function processOrder(orderId: string) {
return tracer.startActiveSpan('processOrder', async (span) => {
span.setAttribute('order.id', orderId);
try {
// Child spans are automatically linked
const order = await fetchOrder(orderId);
span.setAttribute('order.total', order.total);
await tracer.startActiveSpan('validateInventory', async (childSpan) => {
await validateInventory(order.items);
childSpan.end();
});
await tracer.startActiveSpan('processPayment', async (childSpan) => {
await processPayment(order);
childSpan.end();
});
span.setStatus({ code: SpanStatusCode.OK });
return order;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
span.recordException(error);
throw error;
} finally {
span.end();
}
});
}
// Propagate trace context to external services
async function callExternalService(data: any) {
const headers: Record<string, string> = {};
// Inject trace context into headers
propagation.inject(context.active(), headers);
return fetch(EXTERNAL_SERVICE_URL, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
...headers, // Includes traceparent header
},
body: JSON.stringify(data),
});
}适用场景: 跨服务追踪请求链路
typescript
// ✅ 推荐:OpenTelemetry追踪
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
}),
instrumentations: [getNodeAutoInstrumentations()],
serviceName: process.env.SERVICE_NAME,
});
sdk.start();
// 手动创建Span
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('my-service');
async function processOrder(orderId: string) {
return tracer.startActiveSpan('processOrder', async (span) => {
span.setAttribute('order.id', orderId);
try {
// 子Span会自动关联
const order = await fetchOrder(orderId);
span.setAttribute('order.total', order.total);
await tracer.startActiveSpan('validateInventory', async (childSpan) => {
await validateInventory(order.items);
childSpan.end();
});
await tracer.startActiveSpan('processPayment', async (childSpan) => {
await processPayment(order);
childSpan.end();
});
span.setStatus({ code: SpanStatusCode.OK });
return order;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
span.recordException(error);
throw error;
} finally {
span.end();
}
});
}
// 将追踪上下文传递给外部服务
async function callExternalService(data: any) {
const headers: Record<string, string> = {};
// 将追踪上下文注入请求头
propagation.inject(context.active(), headers);
return fetch(EXTERNAL_SERVICE_URL, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
...headers, // 包含traceparent请求头
},
body: JSON.stringify(data),
});
}Pattern 5: Metrics Collection
模式5:指标收集
When: Measuring application performance and business metrics
typescript
// ✅ GOOD: Prometheus metrics
import { Registry, Counter, Histogram, Gauge } from 'prom-client';
const registry = new Registry();
// HTTP request metrics
const httpRequestsTotal = new Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'path', 'status'],
registers: [registry],
});
const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'path'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 5],
registers: [registry],
});
// Business metrics
const ordersTotal = new Counter({
name: 'orders_total',
help: 'Total number of orders',
labelNames: ['status'],
registers: [registry],
});
const activeUsers = new Gauge({
name: 'active_users',
help: 'Number of currently active users',
registers: [registry],
});
// Middleware to track requests
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
const path = req.route?.path || req.path;
httpRequestsTotal.inc({
method: req.method,
path,
status: res.statusCode,
});
httpRequestDuration.observe(
{ method: req.method, path },
duration
);
});
next();
});
// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', registry.contentType);
res.send(await registry.metrics());
});
// Track business events
async function createOrder(data: CreateOrderInput) {
const order = await db.order.create({ data });
ordersTotal.inc({ status: 'created' });
return order;
}适用场景: 衡量应用性能与业务指标
typescript
// ✅ 推荐:Prometheus指标
import { Registry, Counter, Histogram, Gauge } from 'prom-client';
const registry = new Registry();
// HTTP请求指标
const httpRequestsTotal = new Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'path', 'status'],
registers: [registry],
});
const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'path'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 5],
registers: [registry],
});
// 业务指标
const ordersTotal = new Counter({
name: 'orders_total',
help: 'Total number of orders',
labelNames: ['status'],
registers: [registry],
});
const activeUsers = new Gauge({
name: 'active_users',
help: 'Number of currently active users',
registers: [registry],
});
// 用于追踪请求的中间件
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
const path = req.route?.path || req.path;
httpRequestsTotal.inc({
method: req.method,
path,
status: res.statusCode,
});
httpRequestDuration.observe(
{ method: req.method, path },
duration
);
});
next();
});
// 暴露指标端点
app.get('/metrics', async (req, res) => {
res.set('Content-Type', registry.contentType);
res.send(await registry.metrics());
});
// 追踪业务事件
async function createOrder(data: CreateOrderInput) {
const order = await db.order.create({ data });
ordersTotal.inc({ status: 'created' });
return order;
}Code Examples
代码示例
For complete, production-ready examples, see references/examples.md:
- Request Logging Middleware
- Dashboard Alerts Configuration (Prometheus)
- Error Boundary with Reporting
- Custom Metrics (Prometheus)
完整的生产环境可用示例,请查看 references/examples.md:
- 请求日志中间件
- 仪表板告警配置(Prometheus)
- 带上报功能的错误边界
- 自定义指标(Prometheus)
Anti-Patterns
反模式
Don't: Log Sensitive Data
切勿:记录敏感数据
typescript
// ❌ BAD: Logging passwords, tokens, PII
logger.info({ user: { email, password } }, 'User login');
logger.info({ authorization: req.headers.authorization }, 'Request');
// ✅ GOOD: Redact sensitive fields
logger.info({ userId: user.id, email: user.email }, 'User login');
logger.info({ hasAuth: !!req.headers.authorization }, 'Request');typescript
// ❌ 不推荐:记录密码、令牌、个人可识别信息(PII)
logger.info({ user: { email, password } }, '用户登录');
logger.info({ authorization: req.headers.authorization }, '请求信息');
// ✅ 推荐:脱敏敏感字段
logger.info({ userId: user.id, email: user.email }, '用户登录');
logger.info({ hasAuth: !!req.headers.authorization }, '请求信息');Don't: Sample Everything at 100%
切勿:100%采样所有数据
typescript
// ❌ BAD: Trace every request
tracesSampleRate: 1.0, // Very expensive at scale!
// ✅ GOOD: Sample appropriately
tracesSampleRate: 0.1, // 10% of transactions
// Or use dynamic sampling based on endpointtypescript
// ❌ 不推荐:追踪每一个请求
tracesSampleRate: 1.0, // 大规模场景下成本极高!
// ✅ 推荐:合理设置采样率
tracesSampleRate: 0.1, // 10%的事务会被采样
// 或者根据端点设置动态采样规则Don't: Ignore Alert Fatigue
切勿:忽视告警疲劳
typescript
// ❌ BAD: Alert on every error
if (error) sendAlert(error); // Gets ignored due to noise
// ✅ GOOD: Alert on actionable thresholds
// Only alert when error rate exceeds 5% for 5 minutestypescript
// ❌ 不推荐:每出现一个错误就发送告警
if (error) sendAlert(error); // 因告警过多会被忽略
// ✅ 推荐:基于可行动的阈值发送告警
// 仅当错误率在5分钟内超过5%时才发送告警Quick Reference
速查参考
| Pillar | Tool | Use Case |
|---|---|---|
| Logs | Pino, Winston | Structured application logs |
| Metrics | Prometheus | Request counts, latencies |
| Traces | OpenTelemetry | Distributed request flow |
| Errors | Sentry | Exception tracking |
| APM | Datadog, New Relic | Full observability suite |
| 核心支柱 | 工具 | 适用场景 |
|---|---|---|
| 日志 | Pino, Winston | 结构化应用日志 |
| 指标 | Prometheus | 请求计数、延迟统计 |
| 追踪 | OpenTelemetry | 分布式请求链路追踪 |
| 错误 | Sentry | 异常追踪 |
| 应用性能监控 | Datadog, New Relic | 全栈可观测性套件 |
Resources
相关资源
Related Skills:
- error-handling: Exception patterns
- performance: Optimization metrics
- ci-cd: Deployment monitoring
关联技能:
- error-handling: 异常处理模式
- performance: 性能优化指标
- ci-cd: 部署监控
Keywords
关键词
observabilitymonitoringloggingtracingmetricssentryprometheusopentelemetryhealth-checkalertingobservabilitymonitoringloggingtracingmetricssentryprometheusopentelemetryhealth-checkalerting