distributed-tracing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Distributed Tracing

分布式追踪

Implement distributed tracing with Jaeger and Tempo for request flow visibility across microservices.
使用Jaeger和Tempo实现分布式追踪,以跨微服务实现请求流可见性。

Purpose

目的

Track requests across distributed systems to understand latency, dependencies, and failure points.
跨分布式系统跟踪请求,以了解延迟、依赖关系和故障点。

When to Use

使用场景

  • Debug latency issues
  • Understand service dependencies
  • Identify bottlenecks
  • Trace error propagation
  • Analyze request paths
  • 调试延迟问题
  • 了解服务依赖关系
  • 识别性能瓶颈
  • 跟踪错误传播
  • 分析请求路径

Distributed Tracing Concepts

分布式追踪概念

Trace Structure

追踪结构

Trace (Request ID: abc123)
Span (frontend) [100ms]
Span (api-gateway) [80ms]
  ├→ Span (auth-service) [10ms]
  └→ Span (user-service) [60ms]
      └→ Span (database) [40ms]
Trace (Request ID: abc123)
Span (frontend) [100ms]
Span (api-gateway) [80ms]
  ├→ Span (auth-service) [10ms]
  └→ Span (user-service) [60ms]
      └→ Span (database) [40ms]

Key Components

核心组件

  • Trace - End-to-end request journey
  • Span - Single operation within a trace
  • Context - Metadata propagated between services
  • Tags - Key-value pairs for filtering
  • Logs - Timestamped events within a span
  • Trace - 端到端请求链路
  • Span - 追踪中的单个操作
  • Context - 在服务间传播的元数据
  • Tags - 用于过滤的键值对
  • Logs - Span内的带时间戳事件

Jaeger Setup

Jaeger 部署

Kubernetes Deployment

Kubernetes 部署

bash
undefined
bash
undefined

Deploy Jaeger Operator

Deploy Jaeger Operator

kubectl create namespace observability kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability
kubectl create namespace observability kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability

Deploy Jaeger instance

Deploy Jaeger instance

kubectl apply -f - <<EOF apiVersion: jaegertracing.io/v1 kind: Jaeger metadata: name: jaeger namespace: observability spec: strategy: production storage: type: elasticsearch options: es: server-urls: http://elasticsearch:9200 ingress: enabled: true EOF
undefined
kubectl apply -f - <<EOF apiVersion: jaegertracing.io/v1 kind: Jaeger metadata: name: jaeger namespace: observability spec: strategy: production storage: type: elasticsearch options: es: server-urls: http://elasticsearch:9200 ingress: enabled: true EOF
undefined

Docker Compose

Docker Compose

yaml
version: "3.8"
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "5775:5775/udp"
      - "6831:6831/udp"
      - "6832:6832/udp"
      - "5778:5778"
      - "16686:16686" # UI
      - "14268:14268" # Collector
      - "14250:14250" # gRPC
      - "9411:9411" # Zipkin
    environment:
      - COLLECTOR_ZIPKIN_HOST_PORT=:9411
Reference: See
references/jaeger-setup.md
yaml
version: "3.8"
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "5775:5775/udp"
      - "6831:6831/udp"
      - "6832:6832/udp"
      - "5778:5778"
      - "16686:16686" # UI
      - "14268:14268" # Collector
      - "14250:14250" # gRPC
      - "9411:9411" # Zipkin
    environment:
      - COLLECTOR_ZIPKIN_HOST_PORT=:9411
参考: 查看
references/jaeger-setup.md

Application Instrumentation

应用程序埋点

OpenTelemetry (Recommended)

OpenTelemetry(推荐)

Python (Flask)

Python(Flask)

python
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from flask import Flask
python
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from flask import Flask

Initialize tracer

Initialize tracer

resource = Resource(attributes={SERVICE_NAME: "my-service"}) provider = TracerProvider(resource=resource) processor = BatchSpanProcessor(JaegerExporter( agent_host_name="jaeger", agent_port=6831, )) provider.add_span_processor(processor) trace.set_tracer_provider(provider)
resource = Resource(attributes={SERVICE_NAME: "my-service"}) provider = TracerProvider(resource=resource) processor = BatchSpanProcessor(JaegerExporter( agent_host_name="jaeger", agent_port=6831, )) provider.add_span_processor(processor) trace.set_tracer_provider(provider)

Instrument Flask

Instrument Flask

app = Flask(name) FlaskInstrumentor().instrument_app(app)
@app.route('/api/users') def get_users(): tracer = trace.get_tracer(name)
with tracer.start_as_current_span("get_users") as span:
    span.set_attribute("user.count", 100)
    # Business logic
    users = fetch_users_from_db()
    return {"users": users}
def fetch_users_from_db(): tracer = trace.get_tracer(name)
with tracer.start_as_current_span("database_query") as span:
    span.set_attribute("db.system", "postgresql")
    span.set_attribute("db.statement", "SELECT * FROM users")
    # Database query
    return query_database()
undefined
app = Flask(name) FlaskInstrumentor().instrument_app(app)
@app.route('/api/users') def get_users(): tracer = trace.get_tracer(name)
with tracer.start_as_current_span("get_users") as span:
    span.set_attribute("user.count", 100)
    # Business logic
    users = fetch_users_from_db()
    return {"users": users}
def fetch_users_from_db(): tracer = trace.get_tracer(name)
with tracer.start_as_current_span("database_query") as span:
    span.set_attribute("db.system", "postgresql")
    span.set_attribute("db.statement", "SELECT * FROM users")
    # Database query
    return query_database()
undefined

Node.js (Express)

Node.js(Express)

javascript
const { NodeTracerProvider } = require("@opentelemetry/sdk-trace-node");
const { JaegerExporter } = require("@opentelemetry/exporter-jaeger");
const { BatchSpanProcessor } = require("@opentelemetry/sdk-trace-base");
const { registerInstrumentations } = require("@opentelemetry/instrumentation");
const { HttpInstrumentation } = require("@opentelemetry/instrumentation-http");
const {
  ExpressInstrumentation,
} = require("@opentelemetry/instrumentation-express");

// Initialize tracer
const provider = new NodeTracerProvider({
  resource: { attributes: { "service.name": "my-service" } },
});

const exporter = new JaegerExporter({
  endpoint: "http://jaeger:14268/api/traces",
});

provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();

// Instrument libraries
registerInstrumentations({
  instrumentations: [new HttpInstrumentation(), new ExpressInstrumentation()],
});

const express = require("express");
const app = express();

app.get("/api/users", async (req, res) => {
  const tracer = trace.getTracer("my-service");
  const span = tracer.startSpan("get_users");

  try {
    const users = await fetchUsers();
    span.setAttributes({ "user.count": users.length });
    res.json({ users });
  } finally {
    span.end();
  }
});
javascript
const { NodeTracerProvider } = require("@opentelemetry/sdk-trace-node");
const { JaegerExporter } = require("@opentelemetry/exporter-jaeger");
const { BatchSpanProcessor } = require("@opentelemetry/sdk-trace-base");
const { registerInstrumentations } = require("@opentelemetry/instrumentation");
const { HttpInstrumentation } = require("@opentelemetry/instrumentation-http");
const {
  ExpressInstrumentation,
} = require("@opentelemetry/instrumentation-express");

// Initialize tracer
const provider = new NodeTracerProvider({
  resource: { attributes: { "service.name": "my-service" } },
});

const exporter = new JaegerExporter({
  endpoint: "http://jaeger:14268/api/traces",
});

provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();

// Instrument libraries
registerInstrumentations({
  instrumentations: [new HttpInstrumentation(), new ExpressInstrumentation()],
});

const express = require("express");
const app = express();

app.get("/api/users", async (req, res) => {
  const tracer = trace.getTracer("my-service");
  const span = tracer.startSpan("get_users");

  try {
    const users = await fetchUsers();
    span.setAttributes({ "user.count": users.length });
    res.json({ users });
  } finally {
    span.end();
  }
});

Go

Go

go
package main

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/jaeger"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
)

func initTracer() (*sdktrace.TracerProvider, error) {
    exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
        jaeger.WithEndpoint("http://jaeger:14268/api/traces"),
    ))
    if err != nil {
        return nil, err
    }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String("my-service"),
        )),
    )

    otel.SetTracerProvider(tp)
    return tp, nil
}

func getUsers(ctx context.Context) ([]User, error) {
    tracer := otel.Tracer("my-service")
    ctx, span := tracer.Start(ctx, "get_users")
    defer span.End()

    span.SetAttributes(attribute.String("user.filter", "active"))

    users, err := fetchUsersFromDB(ctx)
    if err != nil {
        span.RecordError(err)
        return nil, err
    }

    span.SetAttributes(attribute.Int("user.count", len(users)))
    return users, nil
}
Reference: See
references/instrumentation.md
go
package main

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/jaeger"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
)

func initTracer() (*sdktrace.TracerProvider, error) {
    exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
        jaeger.WithEndpoint("http://jaeger:14268/api/traces"),
    ))
    if err != nil {
        return nil, err
    }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String("my-service"),
        )),
    )

    otel.SetTracerProvider(tp)
    return tp, nil
}

func getUsers(ctx context.Context) ([]User, error) {
    tracer := otel.Tracer("my-service")
    ctx, span := tracer.Start(ctx, "get_users")
    defer span.End()

    span.SetAttributes(attribute.String("user.filter", "active"))

    users, err := fetchUsersFromDB(ctx)
    if err != nil {
        span.RecordError(err)
        return nil, err
    }

    span.SetAttributes(attribute.Int("user.count", len(users)))
    return users, nil
}
参考: 查看
references/instrumentation.md

Context Propagation

上下文传播

HTTP Headers

HTTP 头

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE

Propagation in HTTP Requests

HTTP请求中的上下文传播

Python

Python

python
from opentelemetry.propagate import inject

headers = {}
inject(headers)  # Injects trace context

response = requests.get('http://downstream-service/api', headers=headers)
python
from opentelemetry.propagate import inject

headers = {}
inject(headers)  # Injects trace context

response = requests.get('http://downstream-service/api', headers=headers)

Node.js

Node.js

javascript
const { propagation } = require("@opentelemetry/api");

const headers = {};
propagation.inject(context.active(), headers);

axios.get("http://downstream-service/api", { headers });
javascript
const { propagation } = require("@opentelemetry/api");

const headers = {};
propagation.inject(context.active(), headers);

axios.get("http://downstream-service/api", { headers });

Tempo Setup (Grafana)

Tempo 部署(Grafana)

Kubernetes Deployment

Kubernetes 部署

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tempo-config
data:
  tempo.yaml: |
    server:
      http_listen_port: 3200

    distributor:
      receivers:
        jaeger:
          protocols:
            thrift_http:
            grpc:
        otlp:
          protocols:
            http:
            grpc:

    storage:
      trace:
        backend: s3
        s3:
          bucket: tempo-traces
          endpoint: s3.amazonaws.com

    querier:
      frontend_worker:
        frontend_address: tempo-query-frontend:9095
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tempo
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: tempo
          image: grafana/tempo:latest
          args:
            - -config.file=/etc/tempo/tempo.yaml
          volumeMounts:
            - name: config
              mountPath: /etc/tempo
      volumes:
        - name: config
          configMap:
            name: tempo-config
Reference: See
assets/jaeger-config.yaml.template
yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tempo-config
data:
  tempo.yaml: |
    server:
      http_listen_port: 3200

    distributor:
      receivers:
        jaeger:
          protocols:
            thrift_http:
            grpc:
        otlp:
          protocols:
            http:
            grpc:

    storage:
      trace:
        backend: s3
        s3:
          bucket: tempo-traces
          endpoint: s3.amazonaws.com

    querier:
      frontend_worker:
        frontend_address: tempo-query-frontend:9095
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tempo
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: tempo
          image: grafana/tempo:latest
          args:
            - -config.file=/etc/tempo/tempo.yaml
          volumeMounts:
            - name: config
              mountPath: /etc/tempo
      volumes:
        - name: config
          configMap:
            name: tempo-config
参考: 查看
assets/jaeger-config.yaml.template

Sampling Strategies

采样策略

Probabilistic Sampling

概率采样

yaml
undefined
yaml
undefined

Sample 1% of traces

Sample 1% of traces

sampler: type: probabilistic param: 0.01
undefined
sampler: type: probabilistic param: 0.01
undefined

Rate Limiting Sampling

限流采样

yaml
undefined
yaml
undefined

Sample max 100 traces per second

Sample max 100 traces per second

sampler: type: ratelimiting param: 100
undefined
sampler: type: ratelimiting param: 100
undefined

Adaptive Sampling

自适应采样

python
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased
python
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased

Sample based on trace ID (deterministic)

Sample based on trace ID (deterministic)

sampler = ParentBased(root=TraceIdRatioBased(0.01))
undefined
sampler = ParentBased(root=TraceIdRatioBased(0.01))
undefined

Trace Analysis

追踪分析

Finding Slow Requests

查找慢请求

Jaeger Query:
service=my-service
duration > 1s
Jaeger 查询:
service=my-service
duration > 1s

Finding Errors

查找错误

Jaeger Query:
service=my-service
error=true
tags.http.status_code >= 500
Jaeger 查询:
service=my-service
error=true
tags.http.status_code >= 500

Service Dependency Graph

服务依赖图

Jaeger automatically generates service dependency graphs showing:
  • Service relationships
  • Request rates
  • Error rates
  • Average latencies
Jaeger会自动生成服务依赖图,展示:
  • 服务关系
  • 请求速率
  • 错误率
  • 平均延迟

Best Practices

最佳实践

  1. Sample appropriately (1-10% in production)
  2. Add meaningful tags (user_id, request_id)
  3. Propagate context across all service boundaries
  4. Log exceptions in spans
  5. Use consistent naming for operations
  6. Monitor tracing overhead (<1% CPU impact)
  7. Set up alerts for trace errors
  8. Implement distributed context (baggage)
  9. Use span events for important milestones
  10. Document instrumentation standards
  1. 合理设置采样率(生产环境1-10%)
  2. 添加有意义的标签(user_id、request_id)
  3. 在所有服务边界传播上下文
  4. 在Span中记录异常
  5. 使用一致的操作命名
  6. 监控追踪开销(CPU影响<1%)
  7. 为追踪错误设置告警
  8. 实现分布式上下文(baggage)
  9. 为重要里程碑使用Span事件
  10. 记录埋点标准

Integration with Logging

与日志系统集成

Correlated Logs

关联日志

python
import logging
from opentelemetry import trace

logger = logging.getLogger(__name__)

def process_request():
    span = trace.get_current_span()
    trace_id = span.get_span_context().trace_id

    logger.info(
        "Processing request",
        extra={"trace_id": format(trace_id, '032x')}
    )
python
import logging
from opentelemetry import trace

logger = logging.getLogger(__name__)

def process_request():
    span = trace.get_current_span()
    trace_id = span.get_span_context().trace_id

    logger.info(
        "Processing request",
        extra={"trace_id": format(trace_id, '032x')}
    )

Troubleshooting

故障排查

No traces appearing:
  • Check collector endpoint
  • Verify network connectivity
  • Check sampling configuration
  • Review application logs
High latency overhead:
  • Reduce sampling rate
  • Use batch span processor
  • Check exporter configuration
无追踪数据显示:
  • 检查收集器端点
  • 验证网络连通性
  • 检查采样配置
  • 查看应用程序日志
高延迟开销:
  • 降低采样率
  • 使用批量Span处理器
  • 检查导出器配置

Reference Files

参考文件

  • references/jaeger-setup.md
    - Jaeger installation
  • references/instrumentation.md
    - Instrumentation patterns
  • assets/jaeger-config.yaml.template
    - Jaeger configuration
  • references/jaeger-setup.md
    - Jaeger安装
  • references/instrumentation.md
    - 埋点模式
  • assets/jaeger-config.yaml.template
    - Jaeger配置

Related Skills

相关技能

  • prometheus-configuration
    - For metrics
  • grafana-dashboards
    - For visualization
  • slo-implementation
    - For latency SLOs
  • prometheus-configuration
    - 指标监控
  • grafana-dashboards
    - 可视化
  • slo-implementation
    - 延迟SLO实现