observability-distributed-tracing
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDistributed Tracing
分布式追踪
Implement distributed tracing with Jaeger and Tempo for request flow visibility across microservices.
使用Jaeger和Tempo实现分布式追踪,以获得跨微服务的请求流可见性。
Do not use this skill when
请勿在以下场景使用本技能
- The task is unrelated to distributed tracing
- You need a different domain or tool outside this scope
- 任务与分布式追踪无关
- 需要本范围外的其他领域工具
Instructions
使用说明
- Clarify goals, constraints, and required inputs.
- Apply relevant best practices and validate outcomes.
- Provide actionable steps and verification.
- If detailed examples are required, open .
resources/implementation-playbook.md
- 明确目标、约束条件和所需输入。
- 应用相关最佳实践并验证结果。
- 提供可执行步骤与验证方法。
- 如需详细示例,请打开。
resources/implementation-playbook.md
Purpose
用途
Track requests across distributed systems to understand latency, dependencies, and failure points.
跨分布式系统跟踪请求,以了解延迟、依赖关系和故障点。
Use this skill when
建议在以下场景使用本技能
- Debug latency issues
- Understand service dependencies
- Identify bottlenecks
- Trace error propagation
- Analyze request paths
- 调试延迟问题
- 了解服务依赖关系
- 识别性能瓶颈
- 跟踪错误传播路径
- 分析请求路径
Distributed Tracing Concepts
分布式追踪概念
Trace Structure
追踪结构
Trace (Request ID: abc123)
↓
Span (frontend) [100ms]
↓
Span (api-gateway) [80ms]
├→ Span (auth-service) [10ms]
└→ Span (user-service) [60ms]
└→ Span (database) [40ms]Trace (Request ID: abc123)
↓
Span (frontend) [100ms]
↓
Span (api-gateway) [80ms]
├→ Span (auth-service) [10ms]
└→ Span (user-service) [60ms]
└→ Span (database) [40ms]Key Components
核心组件
- Trace - End-to-end request journey
- Span - Single operation within a trace
- Context - Metadata propagated between services
- Tags - Key-value pairs for filtering
- Logs - Timestamped events within a span
- Trace - 端到端请求链路
- Span - 追踪中的单个操作
- Context - 在服务间传递的元数据
- Tags - 用于过滤的键值对
- Logs - Span内带时间戳的事件
Jaeger Setup
Jaeger部署
Kubernetes Deployment
Kubernetes部署
bash
undefinedbash
undefinedDeploy Jaeger Operator
Deploy Jaeger Operator
kubectl create namespace observability
kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability
kubectl create namespace observability
kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability
Deploy Jaeger instance
Deploy Jaeger instance
kubectl apply -f - <<EOF
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: jaeger
namespace: observability
spec:
strategy: production
storage:
type: elasticsearch
options:
es:
server-urls: http://elasticsearch:9200
ingress:
enabled: true
EOF
undefinedkubectl apply -f - <<EOF
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: jaeger
namespace: observability
spec:
strategy: production
storage:
type: elasticsearch
options:
es:
server-urls: http://elasticsearch:9200
ingress:
enabled: true
EOF
undefinedDocker Compose
Docker Compose部署
yaml
version: '3.8'
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "5775:5775/udp"
- "6831:6831/udp"
- "6832:6832/udp"
- "5778:5778"
- "16686:16686" # UI
- "14268:14268" # Collector
- "14250:14250" # gRPC
- "9411:9411" # Zipkin
environment:
- COLLECTOR_ZIPKIN_HOST_PORT=:9411Reference: See
references/jaeger-setup.mdyaml
version: '3.8'
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "5775:5775/udp"
- "6831:6831/udp"
- "6832:6832/udp"
- "5778:5778"
- "16686:16686" # UI
- "14268:14268" # Collector
- "14250:14250" # gRPC
- "9411:9411" # Zipkin
environment:
- COLLECTOR_ZIPKIN_HOST_PORT=:9411参考文档: 请查看
references/jaeger-setup.mdApplication Instrumentation
应用埋点
OpenTelemetry (Recommended)
OpenTelemetry(推荐)
Python (Flask)
Python (Flask)
python
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from flask import Flaskpython
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from flask import FlaskInitialize tracer
Initialize tracer
resource = Resource(attributes={SERVICE_NAME: "my-service"})
provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(JaegerExporter(
agent_host_name="jaeger",
agent_port=6831,
))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
resource = Resource(attributes={SERVICE_NAME: "my-service"})
provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(JaegerExporter(
agent_host_name="jaeger",
agent_port=6831,
))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
Instrument Flask
Instrument Flask
app = Flask(name)
FlaskInstrumentor().instrument_app(app)
@app.route('/api/users')
def get_users():
tracer = trace.get_tracer(name)
with tracer.start_as_current_span("get_users") as span:
span.set_attribute("user.count", 100)
# Business logic
users = fetch_users_from_db()
return {"users": users}def fetch_users_from_db():
tracer = trace.get_tracer(name)
with tracer.start_as_current_span("database_query") as span:
span.set_attribute("db.system", "postgresql")
span.set_attribute("db.statement", "SELECT * FROM users")
# Database query
return query_database()undefinedapp = Flask(name)
FlaskInstrumentor().instrument_app(app)
@app.route('/api/users')
def get_users():
tracer = trace.get_tracer(name)
with tracer.start_as_current_span("get_users") as span:
span.set_attribute("user.count", 100)
# Business logic
users = fetch_users_from_db()
return {"users": users}def fetch_users_from_db():
tracer = trace.get_tracer(name)
with tracer.start_as_current_span("database_query") as span:
span.set_attribute("db.system", "postgresql")
span.set_attribute("db.statement", "SELECT * FROM users")
# Database query
return query_database()undefinedNode.js (Express)
Node.js (Express)
javascript
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
// Initialize tracer
const provider = new NodeTracerProvider({
resource: { attributes: { 'service.name': 'my-service' } }
});
const exporter = new JaegerExporter({
endpoint: 'http://jaeger:14268/api/traces'
});
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();
// Instrument libraries
registerInstrumentations({
instrumentations: [
new HttpInstrumentation(),
new ExpressInstrumentation(),
],
});
const express = require('express');
const app = express();
app.get('/api/users', async (req, res) => {
const tracer = trace.getTracer('my-service');
const span = tracer.startSpan('get_users');
try {
const users = await fetchUsers();
span.setAttributes({ 'user.count': users.length });
res.json({ users });
} finally {
span.end();
}
});javascript
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
// Initialize tracer
const provider = new NodeTracerProvider({
resource: { attributes: { 'service.name': 'my-service' } }
});
const exporter = new JaegerExporter({
endpoint: 'http://jaeger:14268/api/traces'
});
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();
// Instrument libraries
registerInstrumentations({
instrumentations: [
new HttpInstrumentation(),
new ExpressInstrumentation(),
],
});
const express = require('express');
const app = express();
app.get('/api/users', async (req, res) => {
const tracer = trace.getTracer('my-service');
const span = tracer.startSpan('get_users');
try {
const users = await fetchUsers();
span.setAttributes({ 'user.count': users.length });
res.json({ users });
} finally {
span.end();
}
});Go
Go
go
package main
import (
"context"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/jaeger"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
)
func initTracer() (*sdktrace.TracerProvider, error) {
exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
jaeger.WithEndpoint("http://jaeger:14268/api/traces"),
))
if err != nil {
return nil, err
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("my-service"),
)),
)
otel.SetTracerProvider(tp)
return tp, nil
}
func getUsers(ctx context.Context) ([]User, error) {
tracer := otel.Tracer("my-service")
ctx, span := tracer.Start(ctx, "get_users")
defer span.End()
span.SetAttributes(attribute.String("user.filter", "active"))
users, err := fetchUsersFromDB(ctx)
if err != nil {
span.RecordError(err)
return nil, err
}
span.SetAttributes(attribute.Int("user.count", len(users)))
return users, nil
}Reference: See
references/instrumentation.mdgo
package main
import (
"context"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/jaeger"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
)
func initTracer() (*sdktrace.TracerProvider, error) {
exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
jaeger.WithEndpoint("http://jaeger:14268/api/traces"),
))
if err != nil {
return nil, err
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("my-service"),
)),
)
otel.SetTracerProvider(tp)
return tp, nil
}
func getUsers(ctx context.Context) ([]User, error) {
tracer := otel.Tracer("my-service")
ctx, span := tracer.Start(ctx, "get_users")
defer span.End()
span.SetAttributes(attribute.String("user.filter", "active"))
users, err := fetchUsersFromDB(ctx)
if err != nil {
span.RecordError(err)
return nil, err
}
span.SetAttributes(attribute.Int("user.count", len(users)))
return users, nil
}参考文档: 请查看
references/instrumentation.mdContext Propagation
上下文传递
HTTP Headers
HTTP头
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzEtraceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzEPropagation in HTTP Requests
HTTP请求中的上下文传递
Python
Python
python
from opentelemetry.propagate import inject
headers = {}
inject(headers) # Injects trace context
response = requests.get('http://downstream-service/api', headers=headers)python
from opentelemetry.propagate import inject
headers = {}
inject(headers) # Injects trace context
response = requests.get('http://downstream-service/api', headers=headers)Node.js
Node.js
javascript
const { propagation } = require('@opentelemetry/api');
const headers = {};
propagation.inject(context.active(), headers);
axios.get('http://downstream-service/api', { headers });javascript
const { propagation } = require('@opentelemetry/api');
const headers = {};
propagation.inject(context.active(), headers);
axios.get('http://downstream-service/api', { headers });Tempo Setup (Grafana)
Tempo部署(Grafana)
Kubernetes Deployment
Kubernetes部署
yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: tempo-config
data:
tempo.yaml: |
server:
http_listen_port: 3200
distributor:
receivers:
jaeger:
protocols:
thrift_http:
grpc:
otlp:
protocols:
http:
grpc:
storage:
trace:
backend: s3
s3:
bucket: tempo-traces
endpoint: s3.amazonaws.com
querier:
frontend_worker:
frontend_address: tempo-query-frontend:9095
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: tempo
spec:
replicas: 1
template:
spec:
containers:
- name: tempo
image: grafana/tempo:latest
args:
- -config.file=/etc/tempo/tempo.yaml
volumeMounts:
- name: config
mountPath: /etc/tempo
volumes:
- name: config
configMap:
name: tempo-configReference: See
assets/jaeger-config.yaml.templateyaml
apiVersion: v1
kind: ConfigMap
metadata:
name: tempo-config
data:
tempo.yaml: |
server:
http_listen_port: 3200
distributor:
receivers:
jaeger:
protocols:
thrift_http:
grpc:
otlp:
protocols:
http:
grpc:
storage:
trace:
backend: s3
s3:
bucket: tempo-traces
endpoint: s3.amazonaws.com
querier:
frontend_worker:
frontend_address: tempo-query-frontend:9095
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: tempo
spec:
replicas: 1
template:
spec:
containers:
- name: tempo
image: grafana/tempo:latest
args:
- -config.file=/etc/tempo/tempo.yaml
volumeMounts:
- name: config
mountPath: /etc/tempo
volumes:
- name: config
configMap:
name: tempo-config参考文档: 请查看
assets/jaeger-config.yaml.templateSampling Strategies
采样策略
Probabilistic Sampling
概率采样
yaml
undefinedyaml
undefinedSample 1% of traces
Sample 1% of traces
sampler:
type: probabilistic
param: 0.01
undefinedsampler:
type: probabilistic
param: 0.01
undefinedRate Limiting Sampling
限流采样
yaml
undefinedyaml
undefinedSample max 100 traces per second
Sample max 100 traces per second
sampler:
type: ratelimiting
param: 100
undefinedsampler:
type: ratelimiting
param: 100
undefinedAdaptive Sampling
自适应采样
python
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBasedpython
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBasedSample based on trace ID (deterministic)
Sample based on trace ID (deterministic)
sampler = ParentBased(root=TraceIdRatioBased(0.01))
undefinedsampler = ParentBased(root=TraceIdRatioBased(0.01))
undefinedTrace Analysis
追踪分析
Finding Slow Requests
查找慢请求
Jaeger Query:
service=my-service
duration > 1sJaeger查询语句:
service=my-service
duration > 1sFinding Errors
查找错误
Jaeger Query:
service=my-service
error=true
tags.http.status_code >= 500Jaeger查询语句:
service=my-service
error=true
tags.http.status_code >= 500Service Dependency Graph
服务依赖图
Jaeger automatically generates service dependency graphs showing:
- Service relationships
- Request rates
- Error rates
- Average latencies
Jaeger会自动生成服务依赖图,展示以下内容:
- 服务关系
- 请求速率
- 错误率
- 平均延迟
Best Practices
最佳实践
- Sample appropriately (1-10% in production)
- Add meaningful tags (user_id, request_id)
- Propagate context across all service boundaries
- Log exceptions in spans
- Use consistent naming for operations
- Monitor tracing overhead (<1% CPU impact)
- Set up alerts for trace errors
- Implement distributed context (baggage)
- Use span events for important milestones
- Document instrumentation standards
- 合理设置采样率(生产环境建议1-10%)
- 添加有意义的标签(如user_id、request_id)
- 在所有服务边界传递上下文
- 在Span中记录异常
- 为操作使用统一命名规范
- 监控追踪开销(CPU影响应<1%)
- 为追踪错误设置告警
- 实现分布式上下文(Baggage)
- 为重要里程碑使用Span事件
- 记录埋点规范
Integration with Logging
与日志系统集成
Correlated Logs
关联日志
python
import logging
from opentelemetry import trace
logger = logging.getLogger(__name__)
def process_request():
span = trace.get_current_span()
trace_id = span.get_span_context().trace_id
logger.info(
"Processing request",
extra={"trace_id": format(trace_id, '032x')}
)python
import logging
from opentelemetry import trace
logger = logging.getLogger(__name__)
def process_request():
span = trace.get_current_span()
trace_id = span.get_span_context().trace_id
logger.info(
"Processing request",
extra={"trace_id": format(trace_id, '032x')}
)Troubleshooting
故障排查
No traces appearing:
- Check collector endpoint
- Verify network connectivity
- Check sampling configuration
- Review application logs
High latency overhead:
- Reduce sampling rate
- Use batch span processor
- Check exporter configuration
无追踪数据显示:
- 检查Collector端点配置
- 验证网络连通性
- 检查采样配置
- 查看应用日志
追踪导致高延迟开销:
- 降低采样率
- 使用批量Span处理器
- 检查Exporter配置
Reference Files
参考文件
- - Jaeger installation
references/jaeger-setup.md - - Instrumentation patterns
references/instrumentation.md - - Jaeger configuration
assets/jaeger-config.yaml.template
- - Jaeger安装文档
references/jaeger-setup.md - - 埋点模式文档
references/instrumentation.md - - Jaeger配置模板
assets/jaeger-config.yaml.template
Related Skills
相关技能
- - For metrics
prometheus-configuration - - For visualization
grafana-dashboards - - For latency SLOs
slo-implementation
- - 指标配置技能
prometheus-configuration - - 可视化面板技能
grafana-dashboards - - 延迟SLO实现技能
slo-implementation