observability-distributed-tracing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Distributed Tracing

分布式追踪

Implement distributed tracing with Jaeger and Tempo for request flow visibility across microservices.
使用Jaeger和Tempo实现分布式追踪,以获得跨微服务的请求流可见性。

Do not use this skill when

请勿在以下场景使用本技能

  • The task is unrelated to distributed tracing
  • You need a different domain or tool outside this scope
  • 任务与分布式追踪无关
  • 需要本范围外的其他领域工具

Instructions

使用说明

  • Clarify goals, constraints, and required inputs.
  • Apply relevant best practices and validate outcomes.
  • Provide actionable steps and verification.
  • If detailed examples are required, open
    resources/implementation-playbook.md
    .
  • 明确目标、约束条件和所需输入。
  • 应用相关最佳实践并验证结果。
  • 提供可执行步骤与验证方法。
  • 如需详细示例,请打开
    resources/implementation-playbook.md

Purpose

用途

Track requests across distributed systems to understand latency, dependencies, and failure points.
跨分布式系统跟踪请求,以了解延迟、依赖关系和故障点。

Use this skill when

建议在以下场景使用本技能

  • Debug latency issues
  • Understand service dependencies
  • Identify bottlenecks
  • Trace error propagation
  • Analyze request paths
  • 调试延迟问题
  • 了解服务依赖关系
  • 识别性能瓶颈
  • 跟踪错误传播路径
  • 分析请求路径

Distributed Tracing Concepts

分布式追踪概念

Trace Structure

追踪结构

Trace (Request ID: abc123)
Span (frontend) [100ms]
Span (api-gateway) [80ms]
  ├→ Span (auth-service) [10ms]
  └→ Span (user-service) [60ms]
      └→ Span (database) [40ms]
Trace (Request ID: abc123)
Span (frontend) [100ms]
Span (api-gateway) [80ms]
  ├→ Span (auth-service) [10ms]
  └→ Span (user-service) [60ms]
      └→ Span (database) [40ms]

Key Components

核心组件

  • Trace - End-to-end request journey
  • Span - Single operation within a trace
  • Context - Metadata propagated between services
  • Tags - Key-value pairs for filtering
  • Logs - Timestamped events within a span
  • Trace - 端到端请求链路
  • Span - 追踪中的单个操作
  • Context - 在服务间传递的元数据
  • Tags - 用于过滤的键值对
  • Logs - Span内带时间戳的事件

Jaeger Setup

Jaeger部署

Kubernetes Deployment

Kubernetes部署

bash
undefined
bash
undefined

Deploy Jaeger Operator

Deploy Jaeger Operator

kubectl create namespace observability kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability
kubectl create namespace observability kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability

Deploy Jaeger instance

Deploy Jaeger instance

kubectl apply -f - <<EOF apiVersion: jaegertracing.io/v1 kind: Jaeger metadata: name: jaeger namespace: observability spec: strategy: production storage: type: elasticsearch options: es: server-urls: http://elasticsearch:9200 ingress: enabled: true EOF
undefined
kubectl apply -f - <<EOF apiVersion: jaegertracing.io/v1 kind: Jaeger metadata: name: jaeger namespace: observability spec: strategy: production storage: type: elasticsearch options: es: server-urls: http://elasticsearch:9200 ingress: enabled: true EOF
undefined

Docker Compose

Docker Compose部署

yaml
version: '3.8'
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "5775:5775/udp"
      - "6831:6831/udp"
      - "6832:6832/udp"
      - "5778:5778"
      - "16686:16686"  # UI
      - "14268:14268"  # Collector
      - "14250:14250"  # gRPC
      - "9411:9411"    # Zipkin
    environment:
      - COLLECTOR_ZIPKIN_HOST_PORT=:9411
Reference: See
references/jaeger-setup.md
yaml
version: '3.8'
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "5775:5775/udp"
      - "6831:6831/udp"
      - "6832:6832/udp"
      - "5778:5778"
      - "16686:16686"  # UI
      - "14268:14268"  # Collector
      - "14250:14250"  # gRPC
      - "9411:9411"    # Zipkin
    environment:
      - COLLECTOR_ZIPKIN_HOST_PORT=:9411
参考文档: 请查看
references/jaeger-setup.md

Application Instrumentation

应用埋点

OpenTelemetry (Recommended)

OpenTelemetry(推荐)

Python (Flask)

Python (Flask)

python
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from flask import Flask
python
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from flask import Flask

Initialize tracer

Initialize tracer

resource = Resource(attributes={SERVICE_NAME: "my-service"}) provider = TracerProvider(resource=resource) processor = BatchSpanProcessor(JaegerExporter( agent_host_name="jaeger", agent_port=6831, )) provider.add_span_processor(processor) trace.set_tracer_provider(provider)
resource = Resource(attributes={SERVICE_NAME: "my-service"}) provider = TracerProvider(resource=resource) processor = BatchSpanProcessor(JaegerExporter( agent_host_name="jaeger", agent_port=6831, )) provider.add_span_processor(processor) trace.set_tracer_provider(provider)

Instrument Flask

Instrument Flask

app = Flask(name) FlaskInstrumentor().instrument_app(app)
@app.route('/api/users') def get_users(): tracer = trace.get_tracer(name)
with tracer.start_as_current_span("get_users") as span:
    span.set_attribute("user.count", 100)
    # Business logic
    users = fetch_users_from_db()
    return {"users": users}
def fetch_users_from_db(): tracer = trace.get_tracer(name)
with tracer.start_as_current_span("database_query") as span:
    span.set_attribute("db.system", "postgresql")
    span.set_attribute("db.statement", "SELECT * FROM users")
    # Database query
    return query_database()
undefined
app = Flask(name) FlaskInstrumentor().instrument_app(app)
@app.route('/api/users') def get_users(): tracer = trace.get_tracer(name)
with tracer.start_as_current_span("get_users") as span:
    span.set_attribute("user.count", 100)
    # Business logic
    users = fetch_users_from_db()
    return {"users": users}
def fetch_users_from_db(): tracer = trace.get_tracer(name)
with tracer.start_as_current_span("database_query") as span:
    span.set_attribute("db.system", "postgresql")
    span.set_attribute("db.statement", "SELECT * FROM users")
    # Database query
    return query_database()
undefined

Node.js (Express)

Node.js (Express)

javascript
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');

// Initialize tracer
const provider = new NodeTracerProvider({
  resource: { attributes: { 'service.name': 'my-service' } }
});

const exporter = new JaegerExporter({
  endpoint: 'http://jaeger:14268/api/traces'
});

provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();

// Instrument libraries
registerInstrumentations({
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
  ],
});

const express = require('express');
const app = express();

app.get('/api/users', async (req, res) => {
  const tracer = trace.getTracer('my-service');
  const span = tracer.startSpan('get_users');

  try {
    const users = await fetchUsers();
    span.setAttributes({ 'user.count': users.length });
    res.json({ users });
  } finally {
    span.end();
  }
});
javascript
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');

// Initialize tracer
const provider = new NodeTracerProvider({
  resource: { attributes: { 'service.name': 'my-service' } }
});

const exporter = new JaegerExporter({
  endpoint: 'http://jaeger:14268/api/traces'
});

provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();

// Instrument libraries
registerInstrumentations({
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
  ],
});

const express = require('express');
const app = express();

app.get('/api/users', async (req, res) => {
  const tracer = trace.getTracer('my-service');
  const span = tracer.startSpan('get_users');

  try {
    const users = await fetchUsers();
    span.setAttributes({ 'user.count': users.length });
    res.json({ users });
  } finally {
    span.end();
  }
});

Go

Go

go
package main

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/jaeger"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
)

func initTracer() (*sdktrace.TracerProvider, error) {
    exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
        jaeger.WithEndpoint("http://jaeger:14268/api/traces"),
    ))
    if err != nil {
        return nil, err
    }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String("my-service"),
        )),
    )

    otel.SetTracerProvider(tp)
    return tp, nil
}

func getUsers(ctx context.Context) ([]User, error) {
    tracer := otel.Tracer("my-service")
    ctx, span := tracer.Start(ctx, "get_users")
    defer span.End()

    span.SetAttributes(attribute.String("user.filter", "active"))

    users, err := fetchUsersFromDB(ctx)
    if err != nil {
        span.RecordError(err)
        return nil, err
    }

    span.SetAttributes(attribute.Int("user.count", len(users)))
    return users, nil
}
Reference: See
references/instrumentation.md
go
package main

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/jaeger"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
)

func initTracer() (*sdktrace.TracerProvider, error) {
    exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
        jaeger.WithEndpoint("http://jaeger:14268/api/traces"),
    ))
    if err != nil {
        return nil, err
    }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String("my-service"),
        )),
    )

    otel.SetTracerProvider(tp)
    return tp, nil
}

func getUsers(ctx context.Context) ([]User, error) {
    tracer := otel.Tracer("my-service")
    ctx, span := tracer.Start(ctx, "get_users")
    defer span.End()

    span.SetAttributes(attribute.String("user.filter", "active"))

    users, err := fetchUsersFromDB(ctx)
    if err != nil {
        span.RecordError(err)
        return nil, err
    }

    span.SetAttributes(attribute.Int("user.count", len(users)))
    return users, nil
}
参考文档: 请查看
references/instrumentation.md

Context Propagation

上下文传递

HTTP Headers

HTTP头

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE

Propagation in HTTP Requests

HTTP请求中的上下文传递

Python

Python

python
from opentelemetry.propagate import inject

headers = {}
inject(headers)  # Injects trace context

response = requests.get('http://downstream-service/api', headers=headers)
python
from opentelemetry.propagate import inject

headers = {}
inject(headers)  # Injects trace context

response = requests.get('http://downstream-service/api', headers=headers)

Node.js

Node.js

javascript
const { propagation } = require('@opentelemetry/api');

const headers = {};
propagation.inject(context.active(), headers);

axios.get('http://downstream-service/api', { headers });
javascript
const { propagation } = require('@opentelemetry/api');

const headers = {};
propagation.inject(context.active(), headers);

axios.get('http://downstream-service/api', { headers });

Tempo Setup (Grafana)

Tempo部署(Grafana)

Kubernetes Deployment

Kubernetes部署

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tempo-config
data:
  tempo.yaml: |
    server:
      http_listen_port: 3200

    distributor:
      receivers:
        jaeger:
          protocols:
            thrift_http:
            grpc:
        otlp:
          protocols:
            http:
            grpc:

    storage:
      trace:
        backend: s3
        s3:
          bucket: tempo-traces
          endpoint: s3.amazonaws.com

    querier:
      frontend_worker:
        frontend_address: tempo-query-frontend:9095
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tempo
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: tempo
        image: grafana/tempo:latest
        args:
          - -config.file=/etc/tempo/tempo.yaml
        volumeMounts:
        - name: config
          mountPath: /etc/tempo
      volumes:
      - name: config
        configMap:
          name: tempo-config
Reference: See
assets/jaeger-config.yaml.template
yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tempo-config
data:
  tempo.yaml: |
    server:
      http_listen_port: 3200

    distributor:
      receivers:
        jaeger:
          protocols:
            thrift_http:
            grpc:
        otlp:
          protocols:
            http:
            grpc:

    storage:
      trace:
        backend: s3
        s3:
          bucket: tempo-traces
          endpoint: s3.amazonaws.com

    querier:
      frontend_worker:
        frontend_address: tempo-query-frontend:9095
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tempo
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: tempo
        image: grafana/tempo:latest
        args:
          - -config.file=/etc/tempo/tempo.yaml
        volumeMounts:
        - name: config
          mountPath: /etc/tempo
      volumes:
      - name: config
        configMap:
          name: tempo-config
参考文档: 请查看
assets/jaeger-config.yaml.template

Sampling Strategies

采样策略

Probabilistic Sampling

概率采样

yaml
undefined
yaml
undefined

Sample 1% of traces

Sample 1% of traces

sampler: type: probabilistic param: 0.01
undefined
sampler: type: probabilistic param: 0.01
undefined

Rate Limiting Sampling

限流采样

yaml
undefined
yaml
undefined

Sample max 100 traces per second

Sample max 100 traces per second

sampler: type: ratelimiting param: 100
undefined
sampler: type: ratelimiting param: 100
undefined

Adaptive Sampling

自适应采样

python
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased
python
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased

Sample based on trace ID (deterministic)

Sample based on trace ID (deterministic)

sampler = ParentBased(root=TraceIdRatioBased(0.01))
undefined
sampler = ParentBased(root=TraceIdRatioBased(0.01))
undefined

Trace Analysis

追踪分析

Finding Slow Requests

查找慢请求

Jaeger Query:
service=my-service
duration > 1s
Jaeger查询语句:
service=my-service
duration > 1s

Finding Errors

查找错误

Jaeger Query:
service=my-service
error=true
tags.http.status_code >= 500
Jaeger查询语句:
service=my-service
error=true
tags.http.status_code >= 500

Service Dependency Graph

服务依赖图

Jaeger automatically generates service dependency graphs showing:
  • Service relationships
  • Request rates
  • Error rates
  • Average latencies
Jaeger会自动生成服务依赖图,展示以下内容:
  • 服务关系
  • 请求速率
  • 错误率
  • 平均延迟

Best Practices

最佳实践

  1. Sample appropriately (1-10% in production)
  2. Add meaningful tags (user_id, request_id)
  3. Propagate context across all service boundaries
  4. Log exceptions in spans
  5. Use consistent naming for operations
  6. Monitor tracing overhead (<1% CPU impact)
  7. Set up alerts for trace errors
  8. Implement distributed context (baggage)
  9. Use span events for important milestones
  10. Document instrumentation standards
  1. 合理设置采样率(生产环境建议1-10%)
  2. 添加有意义的标签(如user_id、request_id)
  3. 在所有服务边界传递上下文
  4. 在Span中记录异常
  5. 为操作使用统一命名规范
  6. 监控追踪开销(CPU影响应<1%)
  7. 为追踪错误设置告警
  8. 实现分布式上下文(Baggage)
  9. 为重要里程碑使用Span事件
  10. 记录埋点规范

Integration with Logging

与日志系统集成

Correlated Logs

关联日志

python
import logging
from opentelemetry import trace

logger = logging.getLogger(__name__)

def process_request():
    span = trace.get_current_span()
    trace_id = span.get_span_context().trace_id

    logger.info(
        "Processing request",
        extra={"trace_id": format(trace_id, '032x')}
    )
python
import logging
from opentelemetry import trace

logger = logging.getLogger(__name__)

def process_request():
    span = trace.get_current_span()
    trace_id = span.get_span_context().trace_id

    logger.info(
        "Processing request",
        extra={"trace_id": format(trace_id, '032x')}
    )

Troubleshooting

故障排查

No traces appearing:
  • Check collector endpoint
  • Verify network connectivity
  • Check sampling configuration
  • Review application logs
High latency overhead:
  • Reduce sampling rate
  • Use batch span processor
  • Check exporter configuration
无追踪数据显示:
  • 检查Collector端点配置
  • 验证网络连通性
  • 检查采样配置
  • 查看应用日志
追踪导致高延迟开销:
  • 降低采样率
  • 使用批量Span处理器
  • 检查Exporter配置

Reference Files

参考文件

  • references/jaeger-setup.md
    - Jaeger installation
  • references/instrumentation.md
    - Instrumentation patterns
  • assets/jaeger-config.yaml.template
    - Jaeger configuration
  • references/jaeger-setup.md
    - Jaeger安装文档
  • references/instrumentation.md
    - 埋点模式文档
  • assets/jaeger-config.yaml.template
    - Jaeger配置模板

Related Skills

相关技能

  • prometheus-configuration
    - For metrics
  • grafana-dashboards
    - For visualization
  • slo-implementation
    - For latency SLOs
  • prometheus-configuration
    - 指标配置技能
  • grafana-dashboards
    - 可视化面板技能
  • slo-implementation
    - 延迟SLO实现技能