tempo
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGrafana Tempo - Distributed Tracing Backend
Grafana Tempo - 分布式追踪后端
Grafana Tempo is an open-source, high-scale distributed tracing backend. It is:
- Cost-efficient: only requires object storage (S3, GCS, Azure) to operate
- Deeply integrated: with Grafana, Mimir, Prometheus, Loki, and Pyroscope
- Protocol-agnostic: accepts OTLP, Jaeger, Zipkin, OpenCensus, Kafka
Grafana Tempo 是一款开源的高扩展分布式追踪后端。它具备以下特性:
- 成本低廉:仅需对象存储(S3、GCS、Azure)即可运行
- 深度集成:与 Grafana、Mimir、Prometheus、Loki 和 Pyroscope 深度整合
- 协议无关:支持 OTLP、Jaeger、Zipkin、OpenCensus、Kafka
Quick Reference Links
快速参考链接
- TraceQL Language Reference - query syntax, operators, examples, metrics functions
- Configuration Reference - all YAML config blocks with defaults
- Architecture and Operations - components, deployment, tuning
- Metrics from Traces - span metrics, service graphs, TraceQL metrics
- API Reference - HTTP endpoints, ingestion, search, metrics queries
- TraceQL 语言参考 - 查询语法、运算符、示例、指标函数
- 配置参考 - 所有带默认值的 YAML 配置块
- 架构与运维 - 组件、部署、调优
- 从追踪生成指标 - 跨度指标、服务图、TraceQL 指标
- API 参考 - HTTP 端点、数据摄入、搜索、指标查询
What is Distributed Tracing?
什么是分布式追踪?
A trace represents the lifecycle of a request as it passes through multiple services. It consists of:
- Spans: Individual units of work with start time, duration, attributes, and status
- Trace ID: Shared identifier across all spans in a request
- Parent-child relationships: Spans form a tree showing causality
Traces enable:
- Root cause analysis for service outages
- Understanding service dependencies
- Identifying latency bottlenecks
- Correlating events across microservices
trace(追踪) 代表请求在多个服务间传递时的生命周期。它由以下部分组成:
- Spans(跨度):单个工作单元,包含开始时间、持续时间、属性和状态
- Trace ID:请求中所有跨度共享的标识符
- 父子关系:跨度形成树状结构,展示因果关系
追踪可实现:
- 服务中断的根本原因分析
- 理解服务依赖关系
- 识别延迟瓶颈
- 在微服务间关联事件
Architecture Overview
架构概述
Applications
|
| (OTLP 4317/4318, Jaeger 14250/14268, Zipkin 9411)
v
[Distributor] ---- hashes traceID, routes to N ingesters
|
|---> [Ingester] (WAL + Parquet block assembly, flush to object store)
|
|---> [Metrics Generator] (optional: derives RED metrics -> Prometheus)
Query path:
Grafana --> [Query Frontend] (shards queries)
|
[Querier pool]
/ \
[Ingesters] [Object Storage]
(recent) (historical blocks)Applications
|
| (OTLP 4317/4318, Jaeger 14250/14268, Zipkin 9411)
v
[Distributor] ---- 对 traceID 哈希,路由至 N 个 ingester
|
|---> [Ingester] (WAL + Parquet 块组装,刷新至对象存储)
|
|---> [Metrics Generator] (可选:从跨度生成 RED 指标 -> Prometheus)
查询路径:
Grafana --> [Query Frontend] (分片查询)
|
[Querier pool]
/ \
[Ingesters] [Object Storage]
(近期数据) (历史块数据)Core Components
核心组件
| Component | Role | Default Ports |
|---|---|---|
| Distributor | Receives spans, routes by traceID hash | 4317 (gRPC), 4318 (HTTP) |
| Ingester | Buffers in memory, flushes to storage | - |
| Query Frontend | Query orchestrator, shards across queriers | 3200 (HTTP) |
| Querier | Executes search jobs against storage | - |
| Compactor | Merges blocks, enforces retention | - |
| Metrics Generator | Derives RED metrics from spans | - |
| 组件 | 角色 | 默认端口 |
|---|---|---|
| Distributor | 接收跨度数据,通过 traceID 哈希路由 | 4317 (gRPC)、4318 (HTTP) |
| Ingester | 内存中缓冲数据,刷新至存储 | - |
| Query Frontend | 查询编排器,在 querier 间分片查询 | 3200 (HTTP) |
| Querier | 针对存储执行搜索任务 | - |
| Compactor | 合并数据块,执行保留策略 | - |
| Metrics Generator | 从跨度生成 RED 指标 | - |
TraceQL - The Query Language
TraceQL - 查询语言
TraceQL queries filter traces by span properties. Structure:
{ filters } | pipelineTraceQL 查询通过跨度属性过滤追踪数据。结构:
{ 过滤器 } | 管道Attribute Scopes
属性作用域
traceql
span.http.status_code # span-level attribute
resource.service.name # resource attribute (from SDK)
name # intrinsic: span operation name
status # intrinsic: ok | error | unset
duration # intrinsic: span duration
kind # intrinsic: server | client | producer | consumer | internal
traceDuration # intrinsic: entire trace duration
rootServiceName # intrinsic: service of the root span
rootName # intrinsic: operation name of the root spantraceql
span.http.status_code # 跨度级属性
resource.service.name # 资源属性(来自 SDK)
name # 内置属性:跨度操作名称
status # 内置属性:ok | error | unset
duration # 内置属性:跨度持续时间
kind # 内置属性:server | client | producer | consumer | internal
traceDuration # 内置属性:整个追踪的持续时间
rootServiceName # 内置属性:根跨度所属服务
rootName # 内置属性:根跨度的操作名称Operators
运算符
= != > < >= <= # comparison
=~ !~ # regex match (Go RE2)
&& || ! # logical= != > < >= <= # 比较运算符
=~ !~ # 正则匹配(Go RE2)
&& || ! # 逻辑运算符Essential Examples
基础示例
traceql
undefinedtraceql
undefinedAll errors
所有错误
{ status = error }
{ status = error }
Slow requests from a service
来自某服务的慢请求
{ resource.service.name = "frontend" && duration > 1s }
{ resource.service.name = "frontend" && duration > 1s }
HTTP 5xx errors
HTTP 5xx 错误
{ span.http.status_code >= 500 }
{ span.http.status_code >= 500 }
Count errors per trace (more than 2)
统计每个追踪中的错误数(超过2个)
{ status = error } | count() >= 2
{ status = error } | count() >= 2
Group by service
按服务分组
{ status = error } | by(resource.service.name)
{ status = error } | by(resource.service.name)
P99 latency grouping
P99 延迟分组统计
{ kind = server } | avg(duration) by(resource.service.name)
{ kind = server } | avg(duration) by(resource.service.name)
Select specific fields
选择特定字段
{ status = error } | select(span.http.url, duration, resource.service.name)
{ status = error } | select(span.http.url, duration, resource.service.name)
Structural: server span with downstream error
结构查询:存在下游错误的服务端跨度
{ kind = server } >> { status = error }
{ kind = server } >> { status = error }
Both conditions present (any relationship)
同时满足两个条件(任意关系)
{ span.db.system = "redis" } && { span.db.system = "postgresql" }
{ span.db.system = "redis" } && { span.db.system = "postgresql" }
Find most recent (deterministic)
查询最新的追踪(确定性结果)
{ resource.service.name = "api" } with (most_recent=true)
undefined{ resource.service.name = "api" } with (most_recent=true)
undefinedTraceQL Metrics
TraceQL 指标
traceql
undefinedtraceql
undefinedError rate per service
每个服务的错误率
{ status = error } | rate() by (resource.service.name)
{ status = error } | rate() by (resource.service.name)
P99 latency
P99 延迟
{ kind = server } | quantile_over_time(duration, .99) by (resource.service.name)
{ kind = server } | quantile_over_time(duration, .99) by (resource.service.name)
With exemplars
附带示例数据
{ kind = server } | quantile_over_time(duration, .99) by (resource.service.name) with (exemplars=true)
---{ kind = server } | quantile_over_time(duration, .99) by (resource.service.name) with (exemplars=true)
---Deployment
部署
Quick Start (Docker Compose)
快速开始(Docker Compose)
bash
git clone https://github.com/grafana/tempo.git
cd tempo/example/docker-compose/local
mkdir tempo-data
docker compose up -dbash
git clone https://github.com/grafana/tempo.git
cd tempo/example/docker-compose/local
mkdir tempo-data
docker compose up -dGrafana at http://localhost:3000, Tempo API at http://localhost:3200
Grafana 地址:http://localhost:3000,Tempo API 地址:http://localhost:3200
undefinedundefinedMinimal Single-Node Config
最小单节点配置
yaml
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
ingester:
lifecycler:
ring:
replication_factor: 1
compactor:
compaction:
block_retention: 336h # 14 days
storage:
trace:
backend: local
local:
path: /var/tempo/traces
wal:
path: /var/tempo/wal
memberlist:
abort_if_cluster_join_fails: false
join_members: []yaml
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
ingester:
lifecycler:
ring:
replication_factor: 1
compactor:
compaction:
block_retention: 336h # 14天
storage:
trace:
backend: local
local:
path: /var/tempo/traces
wal:
path: /var/tempo/wal
memberlist:
abort_if_cluster_join_fails: false
join_members: []Production (S3 + Microservices)
生产环境(S3 + 微服务)
yaml
storage:
trace:
backend: s3
s3:
bucket: tempo-traces
endpoint: s3.amazonaws.com
region: us-east-1
# Use IRSA/IAM roles (preferred over access keys)
compactor:
compaction:
block_retention: 336h # Override per-tenant in overrides section
memberlist:
join_members:
- tempo-1:7946
- tempo-2:7946
- tempo-3:7946
ingester:
lifecycler:
ring:
replication_factor: 3yaml
storage:
trace:
backend: s3
s3:
bucket: tempo-traces
endpoint: s3.amazonaws.com
region: us-east-1
# 推荐使用 IRSA/IAM 角色(优于访问密钥)
compactor:
compaction:
block_retention: 336h # 在 overrides 部分按租户覆盖配置
memberlist:
join_members:
- tempo-1:7946
- tempo-2:7946
- tempo-3:7946
ingester:
lifecycler:
ring:
replication_factor: 3Kubernetes (Helm)
Kubernetes(Helm)
bash
helm repo add grafana https://grafana.github.io/helm-charts
helm install tempo grafana/tempo-distributed \
--set storage.trace.backend=s3 \
--set storage.trace.s3.bucket=my-tempo-bucket \
--set storage.trace.s3.region=us-east-1bash
helm repo add grafana https://grafana.github.io/helm-charts
helm install tempo grafana/tempo-distributed \
--set storage.trace.backend=s3 \
--set storage.trace.s3.bucket=my-tempo-bucket \
--set storage.trace.s3.region=us-east-1Sending Traces to Tempo
向 Tempo 发送追踪数据
Via Grafana Alloy (Recommended)
通过 Grafana Alloy(推荐)
alloy
// alloy.river
otelcol.receiver.otlp "default" {
grpc { endpoint = "0.0.0.0:4317" }
http { endpoint = "0.0.0.0:4318" }
output {
traces = [otelcol.exporter.otlp.tempo.input]
}
}
otelcol.exporter.otlp "tempo" {
client {
endpoint = "tempo:4317"
tls { insecure = true }
}
}alloy
// alloy.river
otelcol.receiver.otlp "default" {
grpc { endpoint = "0.0.0.0:4317" }
http { endpoint = "0.0.0.0:4318" }
output {
traces = [otelcol.exporter.otlp.tempo.input]
}
}
otelcol.exporter.otlp "tempo" {
client {
endpoint = "tempo:4317"
tls { insecure = true }
}
}Via OpenTelemetry Collector
通过 OpenTelemetry Collector
yaml
exporters:
otlp:
endpoint: tempo:4317
tls:
insecure: true
# For multi-tenancy:
headers:
x-scope-orgid: my-tenant
service:
pipelines:
traces:
receivers: [otlp]
exporters: [otlp]yaml
exporters:
otlp:
endpoint: tempo:4317
tls:
insecure: true
# 多租户场景:
headers:
x-scope-orgid: my-tenant
service:
pipelines:
traces:
receivers: [otlp]
exporters: [otlp]Direct HTTP (OTLP)
直接 HTTP 请求(OTLP)
bash
curl -X POST -H 'Content-Type: application/json' \
http://localhost:4318/v1/traces \
-d '{"resourceSpans": [{"resource": {"attributes": [{"key": "service.name", "value": {"stringValue": "my-service"}}]}, "scopeSpans": [{"spans": [{"traceId": "5B8EFFF798038103D269B633813FC700", "spanId": "EEE19B7EC3C1B100", "name": "my-op", "startTimeUnixNano": 1689969302000000000, "endTimeUnixNano": 1689969302500000000, "kind": 2}]}]}]}'bash
curl -X POST -H 'Content-Type: application/json' \
http://localhost:4318/v1/traces \
-d '{"resourceSpans": [{"resource": {"attributes": [{"key": "service.name", "value": {"stringValue": "my-service"}}]}, "scopeSpans": [{"spans": [{"traceId": "5B8EFFF798038103D269B633813FC700", "spanId": "EEE19B7EC3C1B100", "name": "my-op", "startTimeUnixNano": 1689969302000000000, "endTimeUnixNano": 1689969302500000000, "kind": 2}]}]}]}'Metrics from Traces
从追踪生成指标
Enable Metrics Generator
启用 Metrics Generator
yaml
metrics_generator:
storage:
path: /var/tempo/generator/wal
remote_write:
- url: http://prometheus:9090/api/v1/write
send_exemplars: true
overrides:
defaults:
metrics_generator:
processors: [service-graphs, span-metrics, local-blocks]yaml
metrics_generator:
storage:
path: /var/tempo/generator/wal
remote_write:
- url: http://prometheus:9090/api/v1/write
send_exemplars: true
overrides:
defaults:
metrics_generator:
processors: [service-graphs, span-metrics, local-blocks]Processor Types
处理器类型
Service Graphs: Visualizes service topology and latency
- Output: ,
traces_service_graph_request_total, duration histogramstraces_service_graph_request_failed_total
Span Metrics: RED metrics per span
- Output: ,
traces_spanmetrics_calls_totaltraces_spanmetrics_duration_seconds_* - Labels: service, span_name, span_kind, status_code + custom dimensions
Local Blocks: Enables TraceQL metrics queries on recent data
服务图(Service Graphs):可视化服务拓扑和延迟
- 输出指标:、
traces_service_graph_request_total、持续时间直方图traces_service_graph_request_failed_total
跨度指标(Span Metrics):每个跨度的 RED 指标
- 输出指标:、
traces_spanmetrics_calls_totaltraces_spanmetrics_duration_seconds_* - 标签:service、span_name、span_kind、status_code + 自定义维度
本地块(Local Blocks):支持对近期数据执行 TraceQL 指标查询
Multi-Tenancy
多租户
yaml
undefinedyaml
undefinedEnable in Tempo config
在 Tempo 配置中启用
multitenancy_enabled: true
All requests require `X-Scope-OrgID` header.
```yamlmultitenancy_enabled: true
所有请求都需要携带 `X-Scope-OrgID` 请求头。
```yamlOpenTelemetry Collector
OpenTelemetry Collector
exporters:
otlp:
headers:
x-scope-orgid: tenant-id
exporters:
otlp:
headers:
x-scope-orgid: tenant-id
Grafana datasource
Grafana 数据源
jsonData:
httpHeaderName1: "X-Scope-OrgID"
secureJsonData:
httpHeaderValue1: "tenant-id"
---jsonData:
httpHeaderName1: "X-Scope-OrgID"
secureJsonData:
httpHeaderValue1: "tenant-id"
---Grafana Integration
Grafana 集成
Data Source Configuration
数据源配置
yaml
datasources:
- name: Tempo
type: tempo
url: http://tempo:3200
jsonData:
# Link traces to logs
tracesToLogsV2:
datasourceUid: loki-uid
filterByTraceID: true
tags: [{key: "service.name", value: "app"}]
# Link traces to metrics
tracesToMetrics:
datasourceUid: prometheus-uid
tags: [{key: "service.name", value: "service"}]
queries:
- name: Error Rate
query: 'sum(rate(traces_spanmetrics_calls_total{$$__tags, status_code="STATUS_CODE_ERROR"}[5m]))'
# Link traces to profiles (Pyroscope)
tracesToProfiles:
datasourceUid: pyroscope-uid
tags: [{key: "service.name", value: "service_name"}]
# Service map from span metrics
serviceMap:
datasourceUid: prometheus-uidyaml
datasources:
- name: Tempo
type: tempo
url: http://tempo:3200
jsonData:
# 关联追踪与日志
tracesToLogsV2:
datasourceUid: loki-uid
filterByTraceID: true
tags: [{key: "service.name", value: "app"}]
# 关联追踪与指标
tracesToMetrics:
datasourceUid: prometheus-uid
tags: [{key: "service.name", value: "service"}]
queries:
- name: 错误率
query: 'sum(rate(traces_spanmetrics_calls_total{$$__tags, status_code="STATUS_CODE_ERROR"}[5m]))'
# 关联追踪与性能剖析(Pyroscope)
tracesToProfiles:
datasourceUid: pyroscope-uid
tags: [{key: "service.name", value: "service_name"}]
# 从跨度指标生成服务地图
serviceMap:
datasourceUid: prometheus-uidKey Grafana Features
Grafana 核心特性
- Explore > Tempo: Search by TraceQL, trace ID, or tag filters
- Service Graph tab: Visual service topology with RED metrics
- Traces Drilldown: - no TraceQL required
/a/grafana-exploretraces-app - Exemplars: Click metric spike -> jump directly to responsible trace
- Derived fields in Loki: Click trace ID in log -> jump to trace in Tempo
- Explore > Tempo:通过 TraceQL、trace ID 或标签过滤器搜索
- 服务图标签页:可视化服务拓扑并显示 RED 指标
- 追踪钻取:- 无需编写 TraceQL
/a/grafana-exploretraces-app - 示例数据(Exemplars):点击指标峰值 -> 直接跳转到对应的追踪
- Loki 中的派生字段:点击日志中的 trace ID -> 跳转到 Tempo 中的追踪
API Quick Reference
API 快速参考
bash
undefinedbash
undefinedSearch traces
搜索追踪
GET /api/search?q={status=error}&limit=20&start=<unix>&end=<unix>
GET /api/search?q={status=error}&limit=20&start=<unix>&end=<unix>
Get trace by ID
通过 ID 获取追踪
GET /api/traces/<traceID>
GET /api/v2/traces/<traceID>
GET /api/traces/<traceID>
GET /api/v2/traces/<traceID>
List all tag names
列出所有标签名称
GET /api/search/tags
GET /api/search/tags
Get values for a tag
获取标签值
GET /api/search/tag/service.name/values
GET /api/search/tag/service.name/values
TraceQL metrics (time series)
TraceQL 指标(时间序列)
GET /api/metrics/query_range?q={status=error}|rate()&start=...&end=...&step=60
GET /api/metrics/query_range?q={status=error}|rate()&start=...&end=...&step=60
Health check
健康检查
GET /ready
---GET /ready
---Performance Tuning Summary
性能调优总结
| Problem | Solution |
|---|---|
| Slow searches | Scale queriers horizontally; scale compactors to reduce block count |
| High memory on queriers | Reduce |
| High memory on ingesters | Reduce |
| Slow attribute queries | Add dedicated Parquet columns for frequent attributes |
| Cache miss rate high | Increase cache size; tune |
| Rate limited (429) | Raise |
| Memcached connection errors | Increase memcached connection limit ( |
| 问题 | 解决方案 |
|---|---|
| 搜索缓慢 | 水平扩展 querier;扩展 compactor 以减少数据块数量 |
| querier 内存占用高 | 降低 |
| ingester 内存占用高 | 降低 |
| 属性查询缓慢 | 为频繁查询的属性添加专用 Parquet 列 |
| 缓存命中率低 | 增大缓存大小;调整 |
| 被限流(429) | 提高 |
| Memcached 连接错误 | 提高 Memcached 连接限制( |
Best Practices
最佳实践
Instrumentation
埋点
- Follow OpenTelemetry semantic conventions for attribute names
- Use prefix for span attributes,
span.for process contextresource. - Keep attributes meaningful - avoid metrics/logs as span attributes
- Limit attributes to max ~128 per span (OTel default)
- Use span linking for batch processing (instead of huge fan-out traces)
- Create spans for: external calls, significant loops, operations with variable latency
- Avoid creating spans for every function call
- 遵循 OpenTelemetry 语义规范 命名属性
- 跨度属性使用 前缀,进程上下文属性使用
span.前缀resource. - 属性需有实际意义 - 避免将指标/日志作为跨度属性
- 每个跨度的属性数量限制在约128个以内(OTel 默认值)
- 批量处理时使用 跨度链接(span linking)(而非大规模扇出追踪)
- 为以下操作创建跨度:外部调用、重要循环、延迟可变的操作
- 避免为每个函数调用创建跨度
Deployment
部署
- Use replication factor 3 for production HA
- Object storage required for distributed deployments (not local)
- Enable dedicated attribute columns for your most-queried attributes
- Set appropriate block retention per tenant via overrides
- Monitor to detect memory pressure early
tempo_ingester_live_traces
- 生产环境高可用使用 复制因子3
- 分布式部署必须使用 对象存储(而非本地存储)
- 为最常查询的属性启用 专用属性列
- 通过 overrides 为每个租户设置合适的 数据块保留时间
- 监控 以提前发现内存压力
tempo_ingester_live_traces
Querying
查询
- Use time bounds (/
start) to limit search scopeend - Use structural operators for root cause analysis patterns
- Prefer for existence checks
attribute != nil - Use when you need deterministic recent results
with (most_recent=true) - Scope tag discovery with a TraceQL query to reduce noise
- 使用 时间范围(/
start)限制搜索范围end - 使用 结构运算符 分析根本原因模式
- 检查属性存在性时优先使用
attribute != nil - 需要确定性的最新结果时使用
with (most_recent=true) - 通过 TraceQL 查询限制标签发现范围以减少干扰
Ports Reference
端口参考
| Port | Protocol | Purpose |
|---|---|---|
| 3200 | HTTP | Tempo API (queries, search, health) |
| 9095 | gRPC | Internal component communication |
| 4317 | gRPC | OTLP trace ingestion |
| 4318 | HTTP | OTLP trace ingestion |
| 14268 | HTTP | Jaeger Thrift HTTP ingestion |
| 14250 | gRPC | Jaeger gRPC ingestion |
| 6831 | UDP | Jaeger Thrift Compact |
| 6832 | UDP | Jaeger Thrift Binary |
| 9411 | HTTP | Zipkin ingestion |
| 7946 | TCP/UDP | Memberlist gossip |
| 端口 | 协议 | 用途 |
|---|---|---|
| 3200 | HTTP | Tempo API(查询、搜索、健康检查) |
| 9095 | gRPC | 内部组件通信 |
| 4317 | gRPC | OTLP 追踪数据摄入 |
| 4318 | HTTP | OTLP 追踪数据摄入 |
| 14268 | HTTP | Jaeger Thrift HTTP 摄入 |
| 14250 | gRPC | Jaeger gRPC 摄入 |
| 6831 | UDP | Jaeger Thrift Compact 摄入 |
| 6832 | UDP | Jaeger Thrift Binary 摄入 |
| 9411 | HTTP | Zipkin 摄入 |
| 7946 | TCP/UDP | Memberlist gossip 通信 |