distributed-tracing
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDistributed Tracing
分布式追踪
Implement distributed tracing with Jaeger and Tempo for request flow visibility across microservices.
使用Jaeger和Tempo实现分布式追踪,以跨微服务实现请求流可见性。
Purpose
目的
Track requests across distributed systems to understand latency, dependencies, and failure points.
跨分布式系统跟踪请求,以了解延迟、依赖关系和故障点。
When to Use
使用场景
- Debug latency issues
- Understand service dependencies
- Identify bottlenecks
- Trace error propagation
- Analyze request paths
- 调试延迟问题
- 了解服务依赖关系
- 识别性能瓶颈
- 跟踪错误传播
- 分析请求路径
Distributed Tracing Concepts
分布式追踪概念
Trace Structure
追踪结构
Trace (Request ID: abc123)
↓
Span (frontend) [100ms]
↓
Span (api-gateway) [80ms]
├→ Span (auth-service) [10ms]
└→ Span (user-service) [60ms]
└→ Span (database) [40ms]Trace (Request ID: abc123)
↓
Span (frontend) [100ms]
↓
Span (api-gateway) [80ms]
├→ Span (auth-service) [10ms]
└→ Span (user-service) [60ms]
└→ Span (database) [40ms]Key Components
核心组件
- Trace - End-to-end request journey
- Span - Single operation within a trace
- Context - Metadata propagated between services
- Tags - Key-value pairs for filtering
- Logs - Timestamped events within a span
- Trace - 端到端请求链路
- Span - 追踪中的单个操作
- Context - 在服务间传播的元数据
- Tags - 用于过滤的键值对
- Logs - Span内的带时间戳事件
Jaeger Setup
Jaeger 部署
Kubernetes Deployment
Kubernetes 部署
bash
undefinedbash
undefinedDeploy Jaeger Operator
Deploy Jaeger Operator
kubectl create namespace observability
kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability
kubectl create namespace observability
kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability
Deploy Jaeger instance
Deploy Jaeger instance
kubectl apply -f - <<EOF
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: jaeger
namespace: observability
spec:
strategy: production
storage:
type: elasticsearch
options:
es:
server-urls: http://elasticsearch:9200
ingress:
enabled: true
EOF
undefinedkubectl apply -f - <<EOF
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: jaeger
namespace: observability
spec:
strategy: production
storage:
type: elasticsearch
options:
es:
server-urls: http://elasticsearch:9200
ingress:
enabled: true
EOF
undefinedDocker Compose
Docker Compose
yaml
version: "3.8"
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "5775:5775/udp"
- "6831:6831/udp"
- "6832:6832/udp"
- "5778:5778"
- "16686:16686" # UI
- "14268:14268" # Collector
- "14250:14250" # gRPC
- "9411:9411" # Zipkin
environment:
- COLLECTOR_ZIPKIN_HOST_PORT=:9411Reference: See
references/jaeger-setup.mdyaml
version: "3.8"
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "5775:5775/udp"
- "6831:6831/udp"
- "6832:6832/udp"
- "5778:5778"
- "16686:16686" # UI
- "14268:14268" # Collector
- "14250:14250" # gRPC
- "9411:9411" # Zipkin
environment:
- COLLECTOR_ZIPKIN_HOST_PORT=:9411参考: 查看
references/jaeger-setup.mdApplication Instrumentation
应用程序埋点
OpenTelemetry (Recommended)
OpenTelemetry(推荐)
Python (Flask)
Python(Flask)
python
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from flask import Flaskpython
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from flask import FlaskInitialize tracer
Initialize tracer
resource = Resource(attributes={SERVICE_NAME: "my-service"})
provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(JaegerExporter(
agent_host_name="jaeger",
agent_port=6831,
))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
resource = Resource(attributes={SERVICE_NAME: "my-service"})
provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(JaegerExporter(
agent_host_name="jaeger",
agent_port=6831,
))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
Instrument Flask
Instrument Flask
app = Flask(name)
FlaskInstrumentor().instrument_app(app)
@app.route('/api/users')
def get_users():
tracer = trace.get_tracer(name)
with tracer.start_as_current_span("get_users") as span:
span.set_attribute("user.count", 100)
# Business logic
users = fetch_users_from_db()
return {"users": users}def fetch_users_from_db():
tracer = trace.get_tracer(name)
with tracer.start_as_current_span("database_query") as span:
span.set_attribute("db.system", "postgresql")
span.set_attribute("db.statement", "SELECT * FROM users")
# Database query
return query_database()undefinedapp = Flask(name)
FlaskInstrumentor().instrument_app(app)
@app.route('/api/users')
def get_users():
tracer = trace.get_tracer(name)
with tracer.start_as_current_span("get_users") as span:
span.set_attribute("user.count", 100)
# Business logic
users = fetch_users_from_db()
return {"users": users}def fetch_users_from_db():
tracer = trace.get_tracer(name)
with tracer.start_as_current_span("database_query") as span:
span.set_attribute("db.system", "postgresql")
span.set_attribute("db.statement", "SELECT * FROM users")
# Database query
return query_database()undefinedNode.js (Express)
Node.js(Express)
javascript
const { NodeTracerProvider } = require("@opentelemetry/sdk-trace-node");
const { JaegerExporter } = require("@opentelemetry/exporter-jaeger");
const { BatchSpanProcessor } = require("@opentelemetry/sdk-trace-base");
const { registerInstrumentations } = require("@opentelemetry/instrumentation");
const { HttpInstrumentation } = require("@opentelemetry/instrumentation-http");
const {
ExpressInstrumentation,
} = require("@opentelemetry/instrumentation-express");
// Initialize tracer
const provider = new NodeTracerProvider({
resource: { attributes: { "service.name": "my-service" } },
});
const exporter = new JaegerExporter({
endpoint: "http://jaeger:14268/api/traces",
});
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();
// Instrument libraries
registerInstrumentations({
instrumentations: [new HttpInstrumentation(), new ExpressInstrumentation()],
});
const express = require("express");
const app = express();
app.get("/api/users", async (req, res) => {
const tracer = trace.getTracer("my-service");
const span = tracer.startSpan("get_users");
try {
const users = await fetchUsers();
span.setAttributes({ "user.count": users.length });
res.json({ users });
} finally {
span.end();
}
});javascript
const { NodeTracerProvider } = require("@opentelemetry/sdk-trace-node");
const { JaegerExporter } = require("@opentelemetry/exporter-jaeger");
const { BatchSpanProcessor } = require("@opentelemetry/sdk-trace-base");
const { registerInstrumentations } = require("@opentelemetry/instrumentation");
const { HttpInstrumentation } = require("@opentelemetry/instrumentation-http");
const {
ExpressInstrumentation,
} = require("@opentelemetry/instrumentation-express");
// Initialize tracer
const provider = new NodeTracerProvider({
resource: { attributes: { "service.name": "my-service" } },
});
const exporter = new JaegerExporter({
endpoint: "http://jaeger:14268/api/traces",
});
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();
// Instrument libraries
registerInstrumentations({
instrumentations: [new HttpInstrumentation(), new ExpressInstrumentation()],
});
const express = require("express");
const app = express();
app.get("/api/users", async (req, res) => {
const tracer = trace.getTracer("my-service");
const span = tracer.startSpan("get_users");
try {
const users = await fetchUsers();
span.setAttributes({ "user.count": users.length });
res.json({ users });
} finally {
span.end();
}
});Go
Go
go
package main
import (
"context"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/jaeger"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
)
func initTracer() (*sdktrace.TracerProvider, error) {
exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
jaeger.WithEndpoint("http://jaeger:14268/api/traces"),
))
if err != nil {
return nil, err
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("my-service"),
)),
)
otel.SetTracerProvider(tp)
return tp, nil
}
func getUsers(ctx context.Context) ([]User, error) {
tracer := otel.Tracer("my-service")
ctx, span := tracer.Start(ctx, "get_users")
defer span.End()
span.SetAttributes(attribute.String("user.filter", "active"))
users, err := fetchUsersFromDB(ctx)
if err != nil {
span.RecordError(err)
return nil, err
}
span.SetAttributes(attribute.Int("user.count", len(users)))
return users, nil
}Reference: See
references/instrumentation.mdgo
package main
import (
"context"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/jaeger"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
)
func initTracer() (*sdktrace.TracerProvider, error) {
exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
jaeger.WithEndpoint("http://jaeger:14268/api/traces"),
))
if err != nil {
return nil, err
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("my-service"),
)),
)
otel.SetTracerProvider(tp)
return tp, nil
}
func getUsers(ctx context.Context) ([]User, error) {
tracer := otel.Tracer("my-service")
ctx, span := tracer.Start(ctx, "get_users")
defer span.End()
span.SetAttributes(attribute.String("user.filter", "active"))
users, err := fetchUsersFromDB(ctx)
if err != nil {
span.RecordError(err)
return nil, err
}
span.SetAttributes(attribute.Int("user.count", len(users)))
return users, nil
}参考: 查看
references/instrumentation.mdContext Propagation
上下文传播
HTTP Headers
HTTP 头
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzEtraceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzEPropagation in HTTP Requests
HTTP请求中的上下文传播
Python
Python
python
from opentelemetry.propagate import inject
headers = {}
inject(headers) # Injects trace context
response = requests.get('http://downstream-service/api', headers=headers)python
from opentelemetry.propagate import inject
headers = {}
inject(headers) # Injects trace context
response = requests.get('http://downstream-service/api', headers=headers)Node.js
Node.js
javascript
const { propagation } = require("@opentelemetry/api");
const headers = {};
propagation.inject(context.active(), headers);
axios.get("http://downstream-service/api", { headers });javascript
const { propagation } = require("@opentelemetry/api");
const headers = {};
propagation.inject(context.active(), headers);
axios.get("http://downstream-service/api", { headers });Tempo Setup (Grafana)
Tempo 部署(Grafana)
Kubernetes Deployment
Kubernetes 部署
yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: tempo-config
data:
tempo.yaml: |
server:
http_listen_port: 3200
distributor:
receivers:
jaeger:
protocols:
thrift_http:
grpc:
otlp:
protocols:
http:
grpc:
storage:
trace:
backend: s3
s3:
bucket: tempo-traces
endpoint: s3.amazonaws.com
querier:
frontend_worker:
frontend_address: tempo-query-frontend:9095
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: tempo
spec:
replicas: 1
template:
spec:
containers:
- name: tempo
image: grafana/tempo:latest
args:
- -config.file=/etc/tempo/tempo.yaml
volumeMounts:
- name: config
mountPath: /etc/tempo
volumes:
- name: config
configMap:
name: tempo-configReference: See
assets/jaeger-config.yaml.templateyaml
apiVersion: v1
kind: ConfigMap
metadata:
name: tempo-config
data:
tempo.yaml: |
server:
http_listen_port: 3200
distributor:
receivers:
jaeger:
protocols:
thrift_http:
grpc:
otlp:
protocols:
http:
grpc:
storage:
trace:
backend: s3
s3:
bucket: tempo-traces
endpoint: s3.amazonaws.com
querier:
frontend_worker:
frontend_address: tempo-query-frontend:9095
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: tempo
spec:
replicas: 1
template:
spec:
containers:
- name: tempo
image: grafana/tempo:latest
args:
- -config.file=/etc/tempo/tempo.yaml
volumeMounts:
- name: config
mountPath: /etc/tempo
volumes:
- name: config
configMap:
name: tempo-config参考: 查看
assets/jaeger-config.yaml.templateSampling Strategies
采样策略
Probabilistic Sampling
概率采样
yaml
undefinedyaml
undefinedSample 1% of traces
Sample 1% of traces
sampler:
type: probabilistic
param: 0.01
undefinedsampler:
type: probabilistic
param: 0.01
undefinedRate Limiting Sampling
限流采样
yaml
undefinedyaml
undefinedSample max 100 traces per second
Sample max 100 traces per second
sampler:
type: ratelimiting
param: 100
undefinedsampler:
type: ratelimiting
param: 100
undefinedAdaptive Sampling
自适应采样
python
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBasedpython
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBasedSample based on trace ID (deterministic)
Sample based on trace ID (deterministic)
sampler = ParentBased(root=TraceIdRatioBased(0.01))
undefinedsampler = ParentBased(root=TraceIdRatioBased(0.01))
undefinedTrace Analysis
追踪分析
Finding Slow Requests
查找慢请求
Jaeger Query:
service=my-service
duration > 1sJaeger 查询:
service=my-service
duration > 1sFinding Errors
查找错误
Jaeger Query:
service=my-service
error=true
tags.http.status_code >= 500Jaeger 查询:
service=my-service
error=true
tags.http.status_code >= 500Service Dependency Graph
服务依赖图
Jaeger automatically generates service dependency graphs showing:
- Service relationships
- Request rates
- Error rates
- Average latencies
Jaeger会自动生成服务依赖图,展示:
- 服务关系
- 请求速率
- 错误率
- 平均延迟
Best Practices
最佳实践
- Sample appropriately (1-10% in production)
- Add meaningful tags (user_id, request_id)
- Propagate context across all service boundaries
- Log exceptions in spans
- Use consistent naming for operations
- Monitor tracing overhead (<1% CPU impact)
- Set up alerts for trace errors
- Implement distributed context (baggage)
- Use span events for important milestones
- Document instrumentation standards
- 合理设置采样率(生产环境1-10%)
- 添加有意义的标签(user_id、request_id)
- 在所有服务边界传播上下文
- 在Span中记录异常
- 使用一致的操作命名
- 监控追踪开销(CPU影响<1%)
- 为追踪错误设置告警
- 实现分布式上下文(baggage)
- 为重要里程碑使用Span事件
- 记录埋点标准
Integration with Logging
与日志系统集成
Correlated Logs
关联日志
python
import logging
from opentelemetry import trace
logger = logging.getLogger(__name__)
def process_request():
span = trace.get_current_span()
trace_id = span.get_span_context().trace_id
logger.info(
"Processing request",
extra={"trace_id": format(trace_id, '032x')}
)python
import logging
from opentelemetry import trace
logger = logging.getLogger(__name__)
def process_request():
span = trace.get_current_span()
trace_id = span.get_span_context().trace_id
logger.info(
"Processing request",
extra={"trace_id": format(trace_id, '032x')}
)Troubleshooting
故障排查
No traces appearing:
- Check collector endpoint
- Verify network connectivity
- Check sampling configuration
- Review application logs
High latency overhead:
- Reduce sampling rate
- Use batch span processor
- Check exporter configuration
无追踪数据显示:
- 检查收集器端点
- 验证网络连通性
- 检查采样配置
- 查看应用程序日志
高延迟开销:
- 降低采样率
- 使用批量Span处理器
- 检查导出器配置
Reference Files
参考文件
- - Jaeger installation
references/jaeger-setup.md - - Instrumentation patterns
references/instrumentation.md - - Jaeger configuration
assets/jaeger-config.yaml.template
- - Jaeger安装
references/jaeger-setup.md - - 埋点模式
references/instrumentation.md - - Jaeger配置
assets/jaeger-config.yaml.template
Related Skills
相关技能
- - For metrics
prometheus-configuration - - For visualization
grafana-dashboards - - For latency SLOs
slo-implementation
- - 指标监控
prometheus-configuration - - 可视化
grafana-dashboards - - 延迟SLO实现
slo-implementation