distributed-tracing-logs

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Distributed Tracing with Logs

基于日志的分布式追踪

Implement distributed tracing using logs by propagating trace context, creating span logs, using correlation IDs, and integrating with OpenTelemetry standards to enable end-to-end request tracing across distributed systems.
通过传播追踪上下文、创建Span日志、使用关联ID并与OpenTelemetry标准集成,实现分布式系统中的端到端请求追踪。

When to use me

使用场景

Use this skill when:
  • Building or maintaining distributed systems (microservices, serverless functions)
  • Need to trace requests across multiple service boundaries
  • Debugging issues that span multiple components or services
  • Implementing observability for complex workflows
  • Correlating logs from different services for a single user request
  • Setting up OpenTelemetry or other tracing standards
  • Analyzing latency and performance across service boundaries
  • Implementing request context propagation
  • Building audit trails for business transactions
在以下场景中使用该技能:
  • 构建或维护分布式系统(微服务、无服务器函数)
  • 需要跨多个服务边界追踪请求
  • 调试涉及多个组件或服务的问题
  • 为复杂工作流实现可观测性
  • 关联单个用户请求在不同服务中的日志
  • 搭建OpenTelemetry或其他追踪标准体系
  • 分析跨服务边界的延迟与性能
  • 实现请求上下文传播
  • 为业务交易构建审计追踪

What I do

核心能力

1. Trace Context Propagation

1. 追踪上下文传播

  • Generate trace and span IDs for request initiation
  • Propagate context through HTTP headers across services
  • Maintain context through async operations (queues, background jobs, callbacks)
  • Handle context in batch processing and streaming systems
  • Implement context extraction and injection middleware
  • Manage sampling decisions for trace collection
  • 生成追踪ID和Span ID用于请求初始化
  • 通过HTTP头跨服务传播上下文
  • 在异步操作中维护上下文(队列、后台任务、回调)
  • 在批处理和流处理系统中处理上下文
  • 实现上下文提取与注入中间件
  • 管理追踪采集的采样决策

2. Span Logging

2. Span日志记录

  • Create span start/end logs with timing information
  • Log span attributes and events during execution
  • Capture parent-child relationships between spans
  • Record span status and errors for failed operations
  • Include business context in span logs
  • Implement span baggage for custom key-value propagation
  • 创建包含计时信息的Span开始/结束日志
  • 在执行过程中记录Span属性与事件
  • 捕获Span间的父子关系
  • 记录Span状态与失败操作的错误信息
  • 在Span日志中包含业务上下文
  • 实现Span baggage用于自定义键值对传播

3. Correlation & Context Management

3. 关联与上下文管理

  • Generate correlation IDs for business transactions
  • Link logs to traces through trace_id fields
  • Maintain user/session context across service boundaries
  • Propagate business identifiers (order_id, transaction_id, etc.)
  • Handle context in distributed transactions
  • Implement context storage and retrieval for long-running operations
  • 为业务交易生成关联ID
  • 通过trace_id字段将日志与追踪关联
  • 跨服务边界维护用户/会话上下文
  • 传播业务标识符(order_id、transaction_id等)
  • 处理分布式事务中的上下文
  • 为长时操作实现上下文存储与检索

4. OpenTelemetry Integration

4. OpenTelemetry集成

  • Implement OpenTelemetry SDKs for various languages
  • Configure trace exporters (Jaeger, Zipkin, OTEL Collector, etc.)
  • Set up automatic instrumentation for common frameworks
  • Define custom spans and attributes for business logic
  • Configure sampling strategies for production environments
  • Integrate with existing logging infrastructure
  • 为多种语言实现OpenTelemetry SDK
  • 配置追踪导出器(Jaeger、Zipkin、OTEL Collector等)
  • 为常见框架设置自动埋点
  • 为业务逻辑定义自定义Span与属性
  • 为生产环境配置采样策略
  • 与现有日志基础设施集成

5. Trace Analysis & Visualization

5. 追踪分析与可视化

  • Extract trace information from logs for analysis
  • Calculate trace duration and latency across services
  • Identify critical paths and bottlenecks
  • Correlate traces with business metrics
  • Create trace visualizations and dependency graphs
  • Set up trace-based alerting for performance degradation
  • 从日志中提取追踪信息用于分析
  • 计算跨服务的追踪时长与延迟
  • 识别关键路径与性能瓶颈
  • 将追踪与业务指标关联
  • 创建追踪可视化图表与依赖关系图
  • 设置基于追踪的告警以检测性能下降

Trace Context Propagation

追踪上下文传播

W3C Trace Context Standard

W3C Trace Context标准

The W3C Trace Context specification defines standard HTTP headers for trace propagation:
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE
Header format:
  • traceparent
    :
    00-{trace-id}-{span-id}-{trace-flags}
  • tracestate
    : Vendor-specific trace state information
W3C Trace Context规范定义了用于追踪传播的标准HTTP头:
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE
头格式:
  • traceparent
    :
    00-{trace-id}-{span-id}-{trace-flags}
  • tracestate
    : 厂商特定的追踪状态信息

Propagation Methods

传播方式

HTTP Headers (Synchronous calls)

HTTP头(同步调用)

http
GET /api/users HTTP/1.1
Host: api.example.com
Traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
X-Correlation-Id: tx-123456
X-Request-Id: req-789012
http
GET /api/users HTTP/1.1
Host: api.example.com
Traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
X-Correlation-Id: tx-123456
X-Request-Id: req-789012

Message Queues (Asynchronous)

消息队列(异步)

json
{
  "headers": {
    "traceparent": "00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01",
    "correlation_id": "tx-123456"
  },
  "body": {
    "order_id": "ord-789",
    "amount": 99.99
  }
}
json
{
  "headers": {
    "traceparent": "00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01",
    "correlation_id": "tx-123456"
  },
  "body": {
    "order_id": "ord-789",
    "amount": 99.99
  }
}

Database Operations

数据库操作

sql
-- Include trace context in audit fields
INSERT INTO orders (id, amount, trace_id, span_id, created_at)
VALUES ('ord-789', 99.99, '0af7651916cd43dd8448eb211c80319c', 'b7ad6b7169203331', NOW());
sql
-- 在审计字段中包含追踪上下文
INSERT INTO orders (id, amount, trace_id, span_id, created_at)
VALUES ('ord-789', 99.99, '0af7651916cd43dd8448eb211c80319c', 'b7ad6b7169203331', NOW());

Span Logging Patterns

Span日志记录模式

Basic Span Logging

基础Span日志

json
{
  "timestamp": "2026-02-26T18:00:00Z",
  "level": "INFO",
  "trace_id": "0af7651916cd43dd8448eb211c80319c",
  "span_id": "b7ad6b7169203331",
  "span_name": "process_payment",
  "span_kind": "SERVER",
  "event": "span_start",
  "duration_ms": 0,
  "attributes": {
    "order_id": "ord-789",
    "payment_method": "credit_card",
    "amount": 99.99
  }
}
json
{
  "timestamp": "2026-02-26T18:00:00.123Z",
  "level": "INFO",
  "trace_id": "0af7651916cd43dd8448eb211c80319c",
  "span_id": "b7ad6b7169203331",
  "span_name": "process_payment",
  "span_kind": "SERVER",
  "event": "span_end",
  "duration_ms": 123,
  "status": "OK",
  "attributes": {
    "order_id": "ord-789",
    "payment_id": "pay-456",
    "gateway_response": "success"
  }
}
json
{
  "timestamp": "2026-02-26T18:00:00Z",
  "level": "INFO",
  "trace_id": "0af7651916cd43dd8448eb211c80319c",
  "span_id": "b7ad6b7169203331",
  "span_name": "process_payment",
  "span_kind": "SERVER",
  "event": "span_start",
  "duration_ms": 0,
  "attributes": {
    "order_id": "ord-789",
    "payment_method": "credit_card",
    "amount": 99.99
  }
}
json
{
  "timestamp": "2026-02-26T18:00:00.123Z",
  "level": "INFO",
  "trace_id": "0af7651916cd43dd8448eb211c80319c",
  "span_id": "b7ad6b7169203331",
  "span_name": "process_payment",
  "span_kind": "SERVER",
  "event": "span_end",
  "duration_ms": 123,
  "status": "OK",
  "attributes": {
    "order_id": "ord-789",
    "payment_id": "pay-456",
    "gateway_response": "success"
  }
}

Error Span Logging

错误Span日志

json
{
  "timestamp": "2026-02-26T18:00:00Z",
  "level": "ERROR",
  "trace_id": "0af7651916cd43dd8448eb211c80319c",
  "span_id": "b7ad6b7169203331",
  "span_name": "process_payment",
  "span_kind": "SERVER",
  "event": "span_end",
  "duration_ms": 5123,
  "status": "ERROR",
  "error_code": "PAYMENT_GATEWAY_TIMEOUT",
  "error_message": "Payment gateway timeout after 5000ms",
  "stack_trace": "...",
  "attributes": {
    "order_id": "ord-789",
    "retry_count": 3,
    "gateway": "stripe"
  }
}
json
{
  "timestamp": "2026-02-26T18:00:00Z",
  "level": "ERROR",
  "trace_id": "0af7651916cd43dd8448eb211c80319c",
  "span_id": "b7ad6b7169203331",
  "span_name": "process_payment",
  "span_kind": "SERVER",
  "event": "span_end",
  "duration_ms": 5123,
  "status": "ERROR",
  "error_code": "PAYMENT_GATEWAY_TIMEOUT",
  "error_message": "Payment gateway timeout after 5000ms",
  "stack_trace": "...",
  "attributes": {
    "order_id": "ord-789",
    "retry_count": 3,
    "gateway": "stripe"
  }
}

Nested Span Logging

嵌套Span日志

json
{
  "timestamp": "2026-02-26T18:00:00Z",
  "level": "INFO",
  "trace_id": "0af7651916cd43dd8448eb211c80319c",
  "span_id": "b7ad6b7169203331",
  "parent_span_id": "c8be7c825a934b7d",
  "span_name": "charge_card",
  "span_kind": "INTERNAL",
  "event": "span_start",
  "duration_ms": 0,
  "attributes": {
    "order_id": "ord-789",
    "card_last4": "4242"
  }
}
json
{
  "timestamp": "2026-02-26T18:00:00Z",
  "level": "INFO",
  "trace_id": "0af7651916cd43dd8448eb211c80319c",
  "span_id": "b7ad6b7169203331",
  "parent_span_id": "c8be7c825a934b7d",
  "span_name": "charge_card",
  "span_kind": "INTERNAL",
  "event": "span_start",
  "duration_ms": 0,
  "attributes": {
    "order_id": "ord-789",
    "card_last4": "4242"
  }
}

OpenTelemetry Integration

OpenTelemetry集成

Manual Instrumentation

手动埋点

python
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode

tracer = trace.get_tracer(__name__)

def process_payment(order_id, amount):
    with tracer.start_as_current_span("process_payment") as span:
        span.set_attribute("order_id", order_id)
        span.set_attribute("amount", amount)
        
        try:
            # Business logic
            result = charge_credit_card(order_id, amount)
            span.set_status(Status(StatusCode.OK))
            span.set_attribute("payment_id", result.payment_id)
            return result
        except Exception as e:
            span.record_exception(e)
            span.set_status(Status(StatusCode.ERROR, str(e)))
            raise
python
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode

tracer = trace.get_tracer(__name__)

def process_payment(order_id, amount):
    with tracer.start_as_current_span("process_payment") as span:
        span.set_attribute("order_id", order_id)
        span.set_attribute("amount", amount)
        
        try:
            # 业务逻辑
            result = charge_credit_card(order_id, amount)
            span.set_status(Status(StatusCode.OK))
            span.set_attribute("payment_id", result.payment_id)
            return result
        except Exception as e:
            span.record_exception(e)
            span.set_status(Status(StatusCode.ERROR, str(e)))
            raise

Automatic Instrumentation

自动埋点

Configuration for automatic instrumentation of common frameworks:
yaml
opentelemetry:
  instrumentations:
    - name: "opentelemetry-instrumentation-flask"
      enabled: true
    - name: "opentelemetry-instrumentation-sqlalchemy"
      enabled: true
    - name: "opentelemetry-instrumentation-requests"
      enabled: true
  
  sampling:
    type: "parentbased_traceidratio"
    ratio: 0.1  # Sample 10% of traces in production
  
  exporters:
    - type: "otlp"
      endpoint: "http://otel-collector:4317"
    - type: "logging"  # Also log spans for local debugging
  
  resource:
    attributes:
      service.name: "payment-service"
      service.version: "1.2.3"
      deployment.environment: "production"
常见框架的自动埋点配置:
yaml
opentelemetry:
  instrumentations:
    - name: "opentelemetry-instrumentation-flask"
      enabled: true
    - name: "opentelemetry-instrumentation-sqlalchemy"
      enabled: true
    - name: "opentelemetry-instrumentation-requests"
      enabled: true
  
  sampling:
    type: "parentbased_traceidratio"
    ratio: 0.1  # 生产环境中采样10%的追踪数据
  
  exporters:
    - type: "otlp"
      endpoint: "http://otel-collector:4317"
    - type: "logging"  # 同时记录Span用于本地调试
  
  resource:
    attributes:
      service.name: "payment-service"
      service.version: "1.2.3"
      deployment.environment: "production"

Examples

操作示例

bash
undefined
bash
undefined

Generate trace context for new request

为新请求生成追踪上下文

npm run tracing:generate-context -- --service payment-service --output context.json
npm run tracing:generate-context -- --service payment-service --output context.json

Propagate trace context through HTTP call

通过HTTP调用传播追踪上下文

npm run tracing:propagate -- --trace-id abc123 --span-id def456 --target http://api.example.com
npm run tracing:propagate -- --trace-id abc123 --span-id def456 --target http://api.example.com

Analyze trace from logs

从日志中分析追踪数据

npm run tracing:analyze -- --trace-id abc123 --sources "app.log,api.log,db.log" --output trace.json
npm run tracing:analyze -- --trace-id abc123 --sources "app.log,api.log,db.log" --output trace.json

Set up OpenTelemetry instrumentation

搭建OpenTelemetry埋点

npm run tracing:setup-otel -- --language nodejs --exporter jaeger --sampling-ratio 0.1
npm run tracing:setup-otel -- --language nodejs --exporter jaeger --sampling-ratio 0.1

Extract trace timeline from logs

从日志中提取追踪时间线

npm run tracing:timeline -- --trace-id abc123 --output timeline.html
undefined
npm run tracing:timeline -- --trace-id abc123 --output timeline.html
undefined

Output format

输出格式

Trace Context Configuration:

追踪上下文配置:

yaml
tracing:
  standard: "W3C TraceContext"
  headers:
    traceparent: "traceparent"
    tracestate: "tracestate"
    correlation_id: "X-Correlation-Id"
    request_id: "X-Request-Id"
  
  propagation:
    http: true
    messaging: true
    database: true
    rpc: true
    
  sampling:
    strategy: "probability"
    rate: 0.1  # 10% sampling in production
    decision_deferred: false
    
  span_logging:
    enabled: true
    format: "json"
    include_fields:
      - trace_id
      - span_id
      - parent_span_id
      - span_name
      - span_kind
      - event
      - duration_ms
      - status
    events:
      - span_start
      - span_end
      - span_event
      - span_error
      
  correlation:
    business_ids:
      - order_id
      - user_id
      - transaction_id
      - session_id
yaml
tracing:
  standard: "W3C TraceContext"
  headers:
    traceparent: "traceparent"
    tracestate: "tracestate"
    correlation_id: "X-Correlation-Id"
    request_id: "X-Request-Id"
  
  propagation:
    http: true
    messaging: true
    database: true
    rpc: true
    
  sampling:
    strategy: "probability"
    rate: 0.1  # 生产环境采样率10%
    decision_deferred: false
    
  span_logging:
    enabled: true
    format: "json"
    include_fields:
      - trace_id
      - span_id
      - parent_span_id
      - span_name
      - span_kind
      - event
      - duration_ms
      - status
    events:
      - span_start
      - span_end
      - span_event
      - span_error
      
  correlation:
    business_ids:
      - order_id
      - user_id
      - transaction_id
      - session_id

Trace Analysis Report:

追踪分析报告:

Distributed Trace Analysis
─────────────────────────
Trace ID: 0af7651916cd43dd8448eb211c80319c
Start Time: 2026-02-26T18:00:00Z
Duration: 1.234s
Status: ERROR (partial failure)

Services Involved:
1. api-gateway (entry point)
2. auth-service (authentication)
3. payment-service (payment processing)
4. notification-service (notifications)
5. database (persistence)

Span Timeline:
00.000ms - api-gateway: request_received (span_start)
00.123ms - api-gateway: auth_check (span_start)
00.234ms - auth-service: validate_token (span_start)
00.345ms - auth-service: validate_token (span_end) [OK]
00.456ms - api-gateway: auth_check (span_end) [OK]
00.567ms - payment-service: process_payment (span_start)
01.234ms - payment-service: charge_card (span_start)
05.678ms - payment-service: charge_card (span_end) [ERROR: timeout]
05.789ms - payment-service: process_payment (span_end) [ERROR]
05.890ms - api-gateway: request_completed (span_end) [ERROR]

Critical Path Analysis:
- Total duration: 1.234s
- Payment processing: 1.111s (90% of total time)
- Card charging: 4.444s (within payment processing)
- Card charging timeout at 5.000ms

Error Analysis:
- Root cause: Payment gateway timeout
- Impact: Payment failed, user notified
- Recovery: Automatic retry scheduled
- Alternative flows: None configured

Performance Insights:
- Slowest service: payment-service (1.111s)
- Fastest service: auth-service (0.111ms)
- Bottleneck: External payment gateway call
- Recommendation: Implement circuit breaker for payment gateway

Business Context:
- User ID: user-123
- Order ID: ord-789
- Amount: $99.99
- Payment method: credit_card
- Outcome: Failed (gateway timeout)
分布式追踪分析
─────────────────────────
Trace ID: 0af7651916cd43dd8448eb211c80319c
开始时间: 2026-02-26T18:00:00Z
总时长: 1.234s
状态: 错误(部分失败)

涉及服务:
1. api-gateway(入口点)
2. auth-service(认证服务)
3. payment-service(支付服务)
4. notification-service(通知服务)
5. database(持久化层)

Span时间线:
00.000ms - api-gateway: request_received(Span开始)
00.123ms - api-gateway: auth_check(Span开始)
00.234ms - auth-service: validate_token(Span开始)
00.345ms - auth-service: validate_token(Span结束)[成功]
00.456ms - api-gateway: auth_check(Span结束)[成功]
00.567ms - payment-service: process_payment(Span开始)
01.234ms - payment-service: charge_card(Span开始)
05.678ms - payment-service: charge_card(Span结束)[错误:超时]
05.789ms - payment-service: process_payment(Span结束)[错误]
05.890ms - api-gateway: request_completed(Span结束)[错误]

关键路径分析:
- 总时长: 1.234s
- 支付处理: 1.111s(占总时长90%)
- 卡片扣款: 4.444s(支付处理内耗时)
- 卡片扣款在5.000ms时超时

错误分析:
- 根本原因: 支付网关超时
- 影响: 支付失败,已通知用户
- 恢复措施: 已调度自动重试
- 备选流程: 未配置

性能洞察:
- 最慢服务: payment-service(1.111s)
- 最快服务: auth-service(0.111ms)
- 瓶颈: 外部支付网关调用
- 建议: 为支付网关实现熔断器

业务上下文:
- 用户ID: user-123
- 订单ID: ord-789
- 金额: $99.99
- 支付方式: credit_card
- 结果: 失败(网关超时)

Notes

注意事项

  • Trace context should be propagated consistently across all service boundaries
  • Sampling is essential in production to manage volume and cost
  • Span logs should include business context for meaningful analysis
  • Trace visualization requires complete context from all services
  • Consider trace storage and retention policies for compliance
  • Monitor trace collection and processing for reliability
  • Implement trace-based alerting for performance degradation detection
  • Test trace propagation in all communication patterns (sync, async, batch)
  • Document trace standards for development teams
  • Regularly review trace sampling rates based on volume and importance
  • 追踪上下文需在所有服务边界中一致传播
  • 采样在生产环境中至关重要,用于管理数据量与成本
  • Span日志应包含业务上下文以实现有意义的分析
  • 追踪可视化需要来自所有服务的完整上下文
  • 考虑追踪数据的存储与保留策略以满足合规要求
  • 监控追踪数据的采集与处理以确保可靠性
  • 实现基于追踪的告警以检测性能下降
  • 在所有通信模式中测试追踪传播(同步、异步、批处理)
  • 为开发团队文档化追踪标准
  • 定期根据数据量与重要性调整追踪采样率