distributed-tracing-logs
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDistributed Tracing with Logs
基于日志的分布式追踪
Implement distributed tracing using logs by propagating trace context, creating span logs, using correlation IDs, and integrating with OpenTelemetry standards to enable end-to-end request tracing across distributed systems.
通过传播追踪上下文、创建Span日志、使用关联ID并与OpenTelemetry标准集成,实现分布式系统中的端到端请求追踪。
When to use me
使用场景
Use this skill when:
- Building or maintaining distributed systems (microservices, serverless functions)
- Need to trace requests across multiple service boundaries
- Debugging issues that span multiple components or services
- Implementing observability for complex workflows
- Correlating logs from different services for a single user request
- Setting up OpenTelemetry or other tracing standards
- Analyzing latency and performance across service boundaries
- Implementing request context propagation
- Building audit trails for business transactions
在以下场景中使用该技能:
- 构建或维护分布式系统(微服务、无服务器函数)
- 需要跨多个服务边界追踪请求
- 调试涉及多个组件或服务的问题
- 为复杂工作流实现可观测性
- 关联单个用户请求在不同服务中的日志
- 搭建OpenTelemetry或其他追踪标准体系
- 分析跨服务边界的延迟与性能
- 实现请求上下文传播
- 为业务交易构建审计追踪
What I do
核心能力
1. Trace Context Propagation
1. 追踪上下文传播
- Generate trace and span IDs for request initiation
- Propagate context through HTTP headers across services
- Maintain context through async operations (queues, background jobs, callbacks)
- Handle context in batch processing and streaming systems
- Implement context extraction and injection middleware
- Manage sampling decisions for trace collection
- 生成追踪ID和Span ID用于请求初始化
- 通过HTTP头跨服务传播上下文
- 在异步操作中维护上下文(队列、后台任务、回调)
- 在批处理和流处理系统中处理上下文
- 实现上下文提取与注入中间件
- 管理追踪采集的采样决策
2. Span Logging
2. Span日志记录
- Create span start/end logs with timing information
- Log span attributes and events during execution
- Capture parent-child relationships between spans
- Record span status and errors for failed operations
- Include business context in span logs
- Implement span baggage for custom key-value propagation
- 创建包含计时信息的Span开始/结束日志
- 在执行过程中记录Span属性与事件
- 捕获Span间的父子关系
- 记录Span状态与失败操作的错误信息
- 在Span日志中包含业务上下文
- 实现Span baggage用于自定义键值对传播
3. Correlation & Context Management
3. 关联与上下文管理
- Generate correlation IDs for business transactions
- Link logs to traces through trace_id fields
- Maintain user/session context across service boundaries
- Propagate business identifiers (order_id, transaction_id, etc.)
- Handle context in distributed transactions
- Implement context storage and retrieval for long-running operations
- 为业务交易生成关联ID
- 通过trace_id字段将日志与追踪关联
- 跨服务边界维护用户/会话上下文
- 传播业务标识符(order_id、transaction_id等)
- 处理分布式事务中的上下文
- 为长时操作实现上下文存储与检索
4. OpenTelemetry Integration
4. OpenTelemetry集成
- Implement OpenTelemetry SDKs for various languages
- Configure trace exporters (Jaeger, Zipkin, OTEL Collector, etc.)
- Set up automatic instrumentation for common frameworks
- Define custom spans and attributes for business logic
- Configure sampling strategies for production environments
- Integrate with existing logging infrastructure
- 为多种语言实现OpenTelemetry SDK
- 配置追踪导出器(Jaeger、Zipkin、OTEL Collector等)
- 为常见框架设置自动埋点
- 为业务逻辑定义自定义Span与属性
- 为生产环境配置采样策略
- 与现有日志基础设施集成
5. Trace Analysis & Visualization
5. 追踪分析与可视化
- Extract trace information from logs for analysis
- Calculate trace duration and latency across services
- Identify critical paths and bottlenecks
- Correlate traces with business metrics
- Create trace visualizations and dependency graphs
- Set up trace-based alerting for performance degradation
- 从日志中提取追踪信息用于分析
- 计算跨服务的追踪时长与延迟
- 识别关键路径与性能瓶颈
- 将追踪与业务指标关联
- 创建追踪可视化图表与依赖关系图
- 设置基于追踪的告警以检测性能下降
Trace Context Propagation
追踪上下文传播
W3C Trace Context Standard
W3C Trace Context标准
The W3C Trace Context specification defines standard HTTP headers for trace propagation:
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzEHeader format:
- :
traceparent00-{trace-id}-{span-id}-{trace-flags} - : Vendor-specific trace state information
tracestate
W3C Trace Context规范定义了用于追踪传播的标准HTTP头:
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE头格式:
- :
traceparent00-{trace-id}-{span-id}-{trace-flags} - : 厂商特定的追踪状态信息
tracestate
Propagation Methods
传播方式
HTTP Headers (Synchronous calls)
HTTP头(同步调用)
http
GET /api/users HTTP/1.1
Host: api.example.com
Traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
X-Correlation-Id: tx-123456
X-Request-Id: req-789012http
GET /api/users HTTP/1.1
Host: api.example.com
Traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
X-Correlation-Id: tx-123456
X-Request-Id: req-789012Message Queues (Asynchronous)
消息队列(异步)
json
{
"headers": {
"traceparent": "00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01",
"correlation_id": "tx-123456"
},
"body": {
"order_id": "ord-789",
"amount": 99.99
}
}json
{
"headers": {
"traceparent": "00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01",
"correlation_id": "tx-123456"
},
"body": {
"order_id": "ord-789",
"amount": 99.99
}
}Database Operations
数据库操作
sql
-- Include trace context in audit fields
INSERT INTO orders (id, amount, trace_id, span_id, created_at)
VALUES ('ord-789', 99.99, '0af7651916cd43dd8448eb211c80319c', 'b7ad6b7169203331', NOW());sql
-- 在审计字段中包含追踪上下文
INSERT INTO orders (id, amount, trace_id, span_id, created_at)
VALUES ('ord-789', 99.99, '0af7651916cd43dd8448eb211c80319c', 'b7ad6b7169203331', NOW());Span Logging Patterns
Span日志记录模式
Basic Span Logging
基础Span日志
json
{
"timestamp": "2026-02-26T18:00:00Z",
"level": "INFO",
"trace_id": "0af7651916cd43dd8448eb211c80319c",
"span_id": "b7ad6b7169203331",
"span_name": "process_payment",
"span_kind": "SERVER",
"event": "span_start",
"duration_ms": 0,
"attributes": {
"order_id": "ord-789",
"payment_method": "credit_card",
"amount": 99.99
}
}json
{
"timestamp": "2026-02-26T18:00:00.123Z",
"level": "INFO",
"trace_id": "0af7651916cd43dd8448eb211c80319c",
"span_id": "b7ad6b7169203331",
"span_name": "process_payment",
"span_kind": "SERVER",
"event": "span_end",
"duration_ms": 123,
"status": "OK",
"attributes": {
"order_id": "ord-789",
"payment_id": "pay-456",
"gateway_response": "success"
}
}json
{
"timestamp": "2026-02-26T18:00:00Z",
"level": "INFO",
"trace_id": "0af7651916cd43dd8448eb211c80319c",
"span_id": "b7ad6b7169203331",
"span_name": "process_payment",
"span_kind": "SERVER",
"event": "span_start",
"duration_ms": 0,
"attributes": {
"order_id": "ord-789",
"payment_method": "credit_card",
"amount": 99.99
}
}json
{
"timestamp": "2026-02-26T18:00:00.123Z",
"level": "INFO",
"trace_id": "0af7651916cd43dd8448eb211c80319c",
"span_id": "b7ad6b7169203331",
"span_name": "process_payment",
"span_kind": "SERVER",
"event": "span_end",
"duration_ms": 123,
"status": "OK",
"attributes": {
"order_id": "ord-789",
"payment_id": "pay-456",
"gateway_response": "success"
}
}Error Span Logging
错误Span日志
json
{
"timestamp": "2026-02-26T18:00:00Z",
"level": "ERROR",
"trace_id": "0af7651916cd43dd8448eb211c80319c",
"span_id": "b7ad6b7169203331",
"span_name": "process_payment",
"span_kind": "SERVER",
"event": "span_end",
"duration_ms": 5123,
"status": "ERROR",
"error_code": "PAYMENT_GATEWAY_TIMEOUT",
"error_message": "Payment gateway timeout after 5000ms",
"stack_trace": "...",
"attributes": {
"order_id": "ord-789",
"retry_count": 3,
"gateway": "stripe"
}
}json
{
"timestamp": "2026-02-26T18:00:00Z",
"level": "ERROR",
"trace_id": "0af7651916cd43dd8448eb211c80319c",
"span_id": "b7ad6b7169203331",
"span_name": "process_payment",
"span_kind": "SERVER",
"event": "span_end",
"duration_ms": 5123,
"status": "ERROR",
"error_code": "PAYMENT_GATEWAY_TIMEOUT",
"error_message": "Payment gateway timeout after 5000ms",
"stack_trace": "...",
"attributes": {
"order_id": "ord-789",
"retry_count": 3,
"gateway": "stripe"
}
}Nested Span Logging
嵌套Span日志
json
{
"timestamp": "2026-02-26T18:00:00Z",
"level": "INFO",
"trace_id": "0af7651916cd43dd8448eb211c80319c",
"span_id": "b7ad6b7169203331",
"parent_span_id": "c8be7c825a934b7d",
"span_name": "charge_card",
"span_kind": "INTERNAL",
"event": "span_start",
"duration_ms": 0,
"attributes": {
"order_id": "ord-789",
"card_last4": "4242"
}
}json
{
"timestamp": "2026-02-26T18:00:00Z",
"level": "INFO",
"trace_id": "0af7651916cd43dd8448eb211c80319c",
"span_id": "b7ad6b7169203331",
"parent_span_id": "c8be7c825a934b7d",
"span_name": "charge_card",
"span_kind": "INTERNAL",
"event": "span_start",
"duration_ms": 0,
"attributes": {
"order_id": "ord-789",
"card_last4": "4242"
}
}OpenTelemetry Integration
OpenTelemetry集成
Manual Instrumentation
手动埋点
python
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
tracer = trace.get_tracer(__name__)
def process_payment(order_id, amount):
with tracer.start_as_current_span("process_payment") as span:
span.set_attribute("order_id", order_id)
span.set_attribute("amount", amount)
try:
# Business logic
result = charge_credit_card(order_id, amount)
span.set_status(Status(StatusCode.OK))
span.set_attribute("payment_id", result.payment_id)
return result
except Exception as e:
span.record_exception(e)
span.set_status(Status(StatusCode.ERROR, str(e)))
raisepython
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
tracer = trace.get_tracer(__name__)
def process_payment(order_id, amount):
with tracer.start_as_current_span("process_payment") as span:
span.set_attribute("order_id", order_id)
span.set_attribute("amount", amount)
try:
# 业务逻辑
result = charge_credit_card(order_id, amount)
span.set_status(Status(StatusCode.OK))
span.set_attribute("payment_id", result.payment_id)
return result
except Exception as e:
span.record_exception(e)
span.set_status(Status(StatusCode.ERROR, str(e)))
raiseAutomatic Instrumentation
自动埋点
Configuration for automatic instrumentation of common frameworks:
yaml
opentelemetry:
instrumentations:
- name: "opentelemetry-instrumentation-flask"
enabled: true
- name: "opentelemetry-instrumentation-sqlalchemy"
enabled: true
- name: "opentelemetry-instrumentation-requests"
enabled: true
sampling:
type: "parentbased_traceidratio"
ratio: 0.1 # Sample 10% of traces in production
exporters:
- type: "otlp"
endpoint: "http://otel-collector:4317"
- type: "logging" # Also log spans for local debugging
resource:
attributes:
service.name: "payment-service"
service.version: "1.2.3"
deployment.environment: "production"常见框架的自动埋点配置:
yaml
opentelemetry:
instrumentations:
- name: "opentelemetry-instrumentation-flask"
enabled: true
- name: "opentelemetry-instrumentation-sqlalchemy"
enabled: true
- name: "opentelemetry-instrumentation-requests"
enabled: true
sampling:
type: "parentbased_traceidratio"
ratio: 0.1 # 生产环境中采样10%的追踪数据
exporters:
- type: "otlp"
endpoint: "http://otel-collector:4317"
- type: "logging" # 同时记录Span用于本地调试
resource:
attributes:
service.name: "payment-service"
service.version: "1.2.3"
deployment.environment: "production"Examples
操作示例
bash
undefinedbash
undefinedGenerate trace context for new request
为新请求生成追踪上下文
npm run tracing:generate-context -- --service payment-service --output context.json
npm run tracing:generate-context -- --service payment-service --output context.json
Propagate trace context through HTTP call
通过HTTP调用传播追踪上下文
npm run tracing:propagate -- --trace-id abc123 --span-id def456 --target http://api.example.com
npm run tracing:propagate -- --trace-id abc123 --span-id def456 --target http://api.example.com
Analyze trace from logs
从日志中分析追踪数据
npm run tracing:analyze -- --trace-id abc123 --sources "app.log,api.log,db.log" --output trace.json
npm run tracing:analyze -- --trace-id abc123 --sources "app.log,api.log,db.log" --output trace.json
Set up OpenTelemetry instrumentation
搭建OpenTelemetry埋点
npm run tracing:setup-otel -- --language nodejs --exporter jaeger --sampling-ratio 0.1
npm run tracing:setup-otel -- --language nodejs --exporter jaeger --sampling-ratio 0.1
Extract trace timeline from logs
从日志中提取追踪时间线
npm run tracing:timeline -- --trace-id abc123 --output timeline.html
undefinednpm run tracing:timeline -- --trace-id abc123 --output timeline.html
undefinedOutput format
输出格式
Trace Context Configuration:
追踪上下文配置:
yaml
tracing:
standard: "W3C TraceContext"
headers:
traceparent: "traceparent"
tracestate: "tracestate"
correlation_id: "X-Correlation-Id"
request_id: "X-Request-Id"
propagation:
http: true
messaging: true
database: true
rpc: true
sampling:
strategy: "probability"
rate: 0.1 # 10% sampling in production
decision_deferred: false
span_logging:
enabled: true
format: "json"
include_fields:
- trace_id
- span_id
- parent_span_id
- span_name
- span_kind
- event
- duration_ms
- status
events:
- span_start
- span_end
- span_event
- span_error
correlation:
business_ids:
- order_id
- user_id
- transaction_id
- session_idyaml
tracing:
standard: "W3C TraceContext"
headers:
traceparent: "traceparent"
tracestate: "tracestate"
correlation_id: "X-Correlation-Id"
request_id: "X-Request-Id"
propagation:
http: true
messaging: true
database: true
rpc: true
sampling:
strategy: "probability"
rate: 0.1 # 生产环境采样率10%
decision_deferred: false
span_logging:
enabled: true
format: "json"
include_fields:
- trace_id
- span_id
- parent_span_id
- span_name
- span_kind
- event
- duration_ms
- status
events:
- span_start
- span_end
- span_event
- span_error
correlation:
business_ids:
- order_id
- user_id
- transaction_id
- session_idTrace Analysis Report:
追踪分析报告:
Distributed Trace Analysis
─────────────────────────
Trace ID: 0af7651916cd43dd8448eb211c80319c
Start Time: 2026-02-26T18:00:00Z
Duration: 1.234s
Status: ERROR (partial failure)
Services Involved:
1. api-gateway (entry point)
2. auth-service (authentication)
3. payment-service (payment processing)
4. notification-service (notifications)
5. database (persistence)
Span Timeline:
00.000ms - api-gateway: request_received (span_start)
00.123ms - api-gateway: auth_check (span_start)
00.234ms - auth-service: validate_token (span_start)
00.345ms - auth-service: validate_token (span_end) [OK]
00.456ms - api-gateway: auth_check (span_end) [OK]
00.567ms - payment-service: process_payment (span_start)
01.234ms - payment-service: charge_card (span_start)
05.678ms - payment-service: charge_card (span_end) [ERROR: timeout]
05.789ms - payment-service: process_payment (span_end) [ERROR]
05.890ms - api-gateway: request_completed (span_end) [ERROR]
Critical Path Analysis:
- Total duration: 1.234s
- Payment processing: 1.111s (90% of total time)
- Card charging: 4.444s (within payment processing)
- Card charging timeout at 5.000ms
Error Analysis:
- Root cause: Payment gateway timeout
- Impact: Payment failed, user notified
- Recovery: Automatic retry scheduled
- Alternative flows: None configured
Performance Insights:
- Slowest service: payment-service (1.111s)
- Fastest service: auth-service (0.111ms)
- Bottleneck: External payment gateway call
- Recommendation: Implement circuit breaker for payment gateway
Business Context:
- User ID: user-123
- Order ID: ord-789
- Amount: $99.99
- Payment method: credit_card
- Outcome: Failed (gateway timeout)分布式追踪分析
─────────────────────────
Trace ID: 0af7651916cd43dd8448eb211c80319c
开始时间: 2026-02-26T18:00:00Z
总时长: 1.234s
状态: 错误(部分失败)
涉及服务:
1. api-gateway(入口点)
2. auth-service(认证服务)
3. payment-service(支付服务)
4. notification-service(通知服务)
5. database(持久化层)
Span时间线:
00.000ms - api-gateway: request_received(Span开始)
00.123ms - api-gateway: auth_check(Span开始)
00.234ms - auth-service: validate_token(Span开始)
00.345ms - auth-service: validate_token(Span结束)[成功]
00.456ms - api-gateway: auth_check(Span结束)[成功]
00.567ms - payment-service: process_payment(Span开始)
01.234ms - payment-service: charge_card(Span开始)
05.678ms - payment-service: charge_card(Span结束)[错误:超时]
05.789ms - payment-service: process_payment(Span结束)[错误]
05.890ms - api-gateway: request_completed(Span结束)[错误]
关键路径分析:
- 总时长: 1.234s
- 支付处理: 1.111s(占总时长90%)
- 卡片扣款: 4.444s(支付处理内耗时)
- 卡片扣款在5.000ms时超时
错误分析:
- 根本原因: 支付网关超时
- 影响: 支付失败,已通知用户
- 恢复措施: 已调度自动重试
- 备选流程: 未配置
性能洞察:
- 最慢服务: payment-service(1.111s)
- 最快服务: auth-service(0.111ms)
- 瓶颈: 外部支付网关调用
- 建议: 为支付网关实现熔断器
业务上下文:
- 用户ID: user-123
- 订单ID: ord-789
- 金额: $99.99
- 支付方式: credit_card
- 结果: 失败(网关超时)Notes
注意事项
- Trace context should be propagated consistently across all service boundaries
- Sampling is essential in production to manage volume and cost
- Span logs should include business context for meaningful analysis
- Trace visualization requires complete context from all services
- Consider trace storage and retention policies for compliance
- Monitor trace collection and processing for reliability
- Implement trace-based alerting for performance degradation detection
- Test trace propagation in all communication patterns (sync, async, batch)
- Document trace standards for development teams
- Regularly review trace sampling rates based on volume and importance
- 追踪上下文需在所有服务边界中一致传播
- 采样在生产环境中至关重要,用于管理数据量与成本
- Span日志应包含业务上下文以实现有意义的分析
- 追踪可视化需要来自所有服务的完整上下文
- 考虑追踪数据的存储与保留策略以满足合规要求
- 监控追踪数据的采集与处理以确保可靠性
- 实现基于追踪的告警以检测性能下降
- 在所有通信模式中测试追踪传播(同步、异步、批处理)
- 为开发团队文档化追踪标准
- 定期根据数据量与重要性调整追踪采样率