tempo

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Grafana Tempo - Distributed Tracing Backend

Grafana Tempo - 分布式追踪后端

Grafana Tempo is an open-source, high-scale distributed tracing backend. It is:

Cost-efficient: only requires object storage (S3, GCS, Azure) to operate
Deeply integrated: with Grafana, Mimir, Prometheus, Loki, and Pyroscope
Protocol-agnostic: accepts OTLP, Jaeger, Zipkin, OpenCensus, Kafka

Grafana Tempo 是一款开源的高扩展分布式追踪后端。它具备以下特性：

成本低廉：仅需对象存储（S3、GCS、Azure）即可运行
深度集成：与 Grafana、Mimir、Prometheus、Loki 和 Pyroscope 深度整合
协议无关：支持 OTLP、Jaeger、Zipkin、OpenCensus、Kafka

Quick Reference Links

快速参考链接

TraceQL Language Reference - query syntax, operators, examples, metrics functions
Configuration Reference - all YAML config blocks with defaults
Architecture and Operations - components, deployment, tuning
Metrics from Traces - span metrics, service graphs, TraceQL metrics
API Reference - HTTP endpoints, ingestion, search, metrics queries

TraceQL 语言参考 - 查询语法、运算符、示例、指标函数
配置参考 - 所有带默认值的 YAML 配置块
架构与运维 - 组件、部署、调优
从追踪生成指标 - 跨度指标、服务图、TraceQL 指标
API 参考 - HTTP 端点、数据摄入、搜索、指标查询

What is Distributed Tracing?

什么是分布式追踪？

A trace represents the lifecycle of a request as it passes through multiple services. It consists of:

Spans: Individual units of work with start time, duration, attributes, and status
Trace ID: Shared identifier across all spans in a request
Parent-child relationships: Spans form a tree showing causality

Traces enable:

Root cause analysis for service outages
Understanding service dependencies
Identifying latency bottlenecks
Correlating events across microservices

trace（追踪） 代表请求在多个服务间传递时的生命周期。它由以下部分组成：

Spans（跨度）：单个工作单元，包含开始时间、持续时间、属性和状态
Trace ID：请求中所有跨度共享的标识符
父子关系：跨度形成树状结构，展示因果关系

追踪可实现：

服务中断的根本原因分析
理解服务依赖关系
识别延迟瓶颈
在微服务间关联事件

Architecture Overview

架构概述

Applications
    |
    | (OTLP 4317/4318, Jaeger 14250/14268, Zipkin 9411)
    v
[Distributor]  ----  hashes traceID, routes to N ingesters
    |
    |---> [Ingester]  (WAL + Parquet block assembly, flush to object store)
    |
    |---> [Metrics Generator]  (optional: derives RED metrics -> Prometheus)
    
Query path:
Grafana  -->  [Query Frontend]  (shards queries)
                    |
              [Querier pool]
              /           \
    [Ingesters]     [Object Storage]
    (recent)        (historical blocks)

Applications
    |
    | (OTLP 4317/4318, Jaeger 14250/14268, Zipkin 9411)
    v
[Distributor]  ----  对 traceID 哈希，路由至 N 个 ingester
    |
    |---> [Ingester]  (WAL + Parquet 块组装，刷新至对象存储)
    |
    |---> [Metrics Generator]  (可选：从跨度生成 RED 指标 -> Prometheus)
    
查询路径：
Grafana  -->  [Query Frontend]  (分片查询)
                    |
              [Querier pool]
              /           \
    [Ingesters]     [Object Storage]
    (近期数据)        (历史块数据)

Core Components

核心组件

Component	Role	Default Ports
Distributor	Receives spans, routes by traceID hash	4317 (gRPC), 4318 (HTTP)
Ingester	Buffers in memory, flushes to storage	-
Query Frontend	Query orchestrator, shards across queriers	3200 (HTTP)
Querier	Executes search jobs against storage	-
Compactor	Merges blocks, enforces retention	-
Metrics Generator	Derives RED metrics from spans	-

组件	角色	默认端口
Distributor	接收跨度数据，通过 traceID 哈希路由	4317 (gRPC)、4318 (HTTP)
Ingester	内存中缓冲数据，刷新至存储	-
Query Frontend	查询编排器，在 querier 间分片查询	3200 (HTTP)
Querier	针对存储执行搜索任务	-
Compactor	合并数据块，执行保留策略	-
Metrics Generator	从跨度生成 RED 指标	-

TraceQL - The Query Language

TraceQL - 查询语言

TraceQL queries filter traces by span properties. Structure:

{ filters } | pipeline

TraceQL 查询通过跨度属性过滤追踪数据。结构：

{ 过滤器 } | 管道

Attribute Scopes

属性作用域

traceql

span.http.status_code        # span-level attribute
resource.service.name        # resource attribute (from SDK)
name                         # intrinsic: span operation name
status                       # intrinsic: ok | error | unset
duration                     # intrinsic: span duration
kind                         # intrinsic: server | client | producer | consumer | internal
traceDuration                # intrinsic: entire trace duration
rootServiceName              # intrinsic: service of the root span
rootName                     # intrinsic: operation name of the root span

traceql

span.http.status_code        # 跨度级属性
resource.service.name        # 资源属性（来自 SDK）
name                         # 内置属性：跨度操作名称
status                       # 内置属性：ok | error | unset
duration                     # 内置属性：跨度持续时间
kind                         # 内置属性：server | client | producer | consumer | internal
traceDuration                # 内置属性：整个追踪的持续时间
rootServiceName              # 内置属性：根跨度所属服务
rootName                     # 内置属性：根跨度的操作名称

Operators

运算符

=   !=   >   <   >=   <=      # comparison
=~  !~                         # regex match (Go RE2)
&&  ||  !                      # logical

=   !=   >   <   >=   <=      # 比较运算符
=~  !~                         # 正则匹配（Go RE2）
&&  ||  !                      # 逻辑运算符

Essential Examples

基础示例

traceql

undefined

traceql

undefined

All errors

所有错误

{ status = error }

Slow requests from a service

来自某服务的慢请求

{ resource.service.name = "frontend" && duration > 1s }

HTTP 5xx errors

HTTP 5xx 错误

{ span.http.status_code >= 500 }

Count errors per trace (more than 2)

统计每个追踪中的错误数（超过2个）

{ status = error } | count() >= 2

Group by service

按服务分组

{ status = error } | by(resource.service.name)

P99 latency grouping

P99 延迟分组统计

{ kind = server } | avg(duration) by(resource.service.name)

Select specific fields

选择特定字段

{ status = error } | select(span.http.url, duration, resource.service.name)

Structural: server span with downstream error

结构查询：存在下游错误的服务端跨度

{ kind = server } >> { status = error }

Both conditions present (any relationship)

同时满足两个条件（任意关系）

{ span.db.system = "redis" } && { span.db.system = "postgresql" }

Find most recent (deterministic)

查询最新的追踪（确定性结果）

{ resource.service.name = "api" } with (most_recent=true)

undefined

{ resource.service.name = "api" } with (most_recent=true)

undefined

TraceQL Metrics

TraceQL 指标

traceql

undefined

traceql

undefined

Error rate per service

每个服务的错误率

{ status = error } | rate() by (resource.service.name)

P99 latency

P99 延迟

{ kind = server } | quantile_over_time(duration, .99) by (resource.service.name)

With exemplars

附带示例数据

{ kind = server } | quantile_over_time(duration, .99) by (resource.service.name) with (exemplars=true)

---

{ kind = server } | quantile_over_time(duration, .99) by (resource.service.name) with (exemplars=true)

---

Deployment

部署

Quick Start (Docker Compose)

快速开始（Docker Compose）

bash

git clone https://github.com/grafana/tempo.git
cd tempo/example/docker-compose/local
mkdir tempo-data
docker compose up -d

bash

git clone https://github.com/grafana/tempo.git
cd tempo/example/docker-compose/local
mkdir tempo-data
docker compose up -d

Grafana at http://localhost:3000, Tempo API at http://localhost:3200

Grafana 地址：http://localhost:3000，Tempo API 地址：http://localhost:3200

undefined

undefined

Minimal Single-Node Config

最小单节点配置

yaml

server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

ingester:
  lifecycler:
    ring:
      replication_factor: 1

compactor:
  compaction:
    block_retention: 336h    # 14 days

storage:
  trace:
    backend: local
    local:
      path: /var/tempo/traces
    wal:
      path: /var/tempo/wal

memberlist:
  abort_if_cluster_join_fails: false
  join_members: []

yaml

server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

ingester:
  lifecycler:
    ring:
      replication_factor: 1

compactor:
  compaction:
    block_retention: 336h    # 14天

storage:
  trace:
    backend: local
    local:
      path: /var/tempo/traces
    wal:
      path: /var/tempo/wal

memberlist:
  abort_if_cluster_join_fails: false
  join_members: []

Production (S3 + Microservices)

生产环境（S3 + 微服务）

yaml

storage:
  trace:
    backend: s3
    s3:
      bucket: tempo-traces
      endpoint: s3.amazonaws.com
      region: us-east-1
      # Use IRSA/IAM roles (preferred over access keys)

compactor:
  compaction:
    block_retention: 336h    # Override per-tenant in overrides section

memberlist:
  join_members:
    - tempo-1:7946
    - tempo-2:7946
    - tempo-3:7946

ingester:
  lifecycler:
    ring:
      replication_factor: 3

yaml

storage:
  trace:
    backend: s3
    s3:
      bucket: tempo-traces
      endpoint: s3.amazonaws.com
      region: us-east-1
      # 推荐使用 IRSA/IAM 角色（优于访问密钥）

compactor:
  compaction:
    block_retention: 336h    # 在 overrides 部分按租户覆盖配置

memberlist:
  join_members:
    - tempo-1:7946
    - tempo-2:7946
    - tempo-3:7946

ingester:
  lifecycler:
    ring:
      replication_factor: 3

Kubernetes (Helm)

Kubernetes（Helm）

bash

helm repo add grafana https://grafana.github.io/helm-charts
helm install tempo grafana/tempo-distributed \
  --set storage.trace.backend=s3 \
  --set storage.trace.s3.bucket=my-tempo-bucket \
  --set storage.trace.s3.region=us-east-1

bash

helm repo add grafana https://grafana.github.io/helm-charts
helm install tempo grafana/tempo-distributed \
  --set storage.trace.backend=s3 \
  --set storage.trace.s3.bucket=my-tempo-bucket \
  --set storage.trace.s3.region=us-east-1

Sending Traces to Tempo

向 Tempo 发送追踪数据

Via Grafana Alloy (Recommended)

通过 Grafana Alloy（推荐）

alloy

// alloy.river
otelcol.receiver.otlp "default" {
  grpc { endpoint = "0.0.0.0:4317" }
  http { endpoint = "0.0.0.0:4318" }
  output {
    traces = [otelcol.exporter.otlp.tempo.input]
  }
}

otelcol.exporter.otlp "tempo" {
  client {
    endpoint = "tempo:4317"
    tls { insecure = true }
  }
}

alloy

// alloy.river
otelcol.receiver.otlp "default" {
  grpc { endpoint = "0.0.0.0:4317" }
  http { endpoint = "0.0.0.0:4318" }
  output {
    traces = [otelcol.exporter.otlp.tempo.input]
  }
}

otelcol.exporter.otlp "tempo" {
  client {
    endpoint = "tempo:4317"
    tls { insecure = true }
  }
}

Via OpenTelemetry Collector

通过 OpenTelemetry Collector

yaml

exporters:
  otlp:
    endpoint: tempo:4317
    tls:
      insecure: true
    # For multi-tenancy:
    headers:
      x-scope-orgid: my-tenant

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlp]

yaml

exporters:
  otlp:
    endpoint: tempo:4317
    tls:
      insecure: true
    # 多租户场景：
    headers:
      x-scope-orgid: my-tenant

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlp]

Direct HTTP (OTLP)

直接 HTTP 请求（OTLP）

bash

curl -X POST -H 'Content-Type: application/json' \
  http://localhost:4318/v1/traces \
  -d '{"resourceSpans": [{"resource": {"attributes": [{"key": "service.name", "value": {"stringValue": "my-service"}}]}, "scopeSpans": [{"spans": [{"traceId": "5B8EFFF798038103D269B633813FC700", "spanId": "EEE19B7EC3C1B100", "name": "my-op", "startTimeUnixNano": 1689969302000000000, "endTimeUnixNano": 1689969302500000000, "kind": 2}]}]}]}'

bash

curl -X POST -H 'Content-Type: application/json' \
  http://localhost:4318/v1/traces \
  -d '{"resourceSpans": [{"resource": {"attributes": [{"key": "service.name", "value": {"stringValue": "my-service"}}]}, "scopeSpans": [{"spans": [{"traceId": "5B8EFFF798038103D269B633813FC700", "spanId": "EEE19B7EC3C1B100", "name": "my-op", "startTimeUnixNano": 1689969302000000000, "endTimeUnixNano": 1689969302500000000, "kind": 2}]}]}]}'

Metrics from Traces

从追踪生成指标

Enable Metrics Generator

启用 Metrics Generator

yaml

metrics_generator:
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://prometheus:9090/api/v1/write
        send_exemplars: true

overrides:
  defaults:
    metrics_generator:
      processors: [service-graphs, span-metrics, local-blocks]

yaml

metrics_generator:
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://prometheus:9090/api/v1/write
        send_exemplars: true

overrides:
  defaults:
    metrics_generator:
      processors: [service-graphs, span-metrics, local-blocks]

Processor Types

处理器类型

Service Graphs: Visualizes service topology and latency

Output:

traces_service_graph_request_total

traces_service_graph_request_failed_total

, duration histograms

Span Metrics: RED metrics per span

Output:

traces_spanmetrics_calls_total

traces_spanmetrics_duration_seconds_*

Labels: service, span_name, span_kind, status_code + custom dimensions

Local Blocks: Enables TraceQL metrics queries on recent data

服务图（Service Graphs）：可视化服务拓扑和延迟

输出指标：

traces_service_graph_request_total

、

traces_service_graph_request_failed_total

、持续时间直方图

跨度指标（Span Metrics）：每个跨度的 RED 指标

输出指标：

traces_spanmetrics_calls_total

、

traces_spanmetrics_duration_seconds_*

标签：service、span_name、span_kind、status_code + 自定义维度

本地块（Local Blocks）：支持对近期数据执行 TraceQL 指标查询

Multi-Tenancy

多租户

yaml

undefined

yaml

undefined

Enable in Tempo config

在 Tempo 配置中启用

multitenancy_enabled: true


All requests require `X-Scope-OrgID` header.

```yaml

multitenancy_enabled: true


所有请求都需要携带 `X-Scope-OrgID` 请求头。

```yaml

OpenTelemetry Collector

exporters: otlp: headers: x-scope-orgid: tenant-id

Grafana datasource

Grafana 数据源

jsonData: httpHeaderName1: "X-Scope-OrgID" secureJsonData: httpHeaderValue1: "tenant-id"

---

jsonData: httpHeaderName1: "X-Scope-OrgID" secureJsonData: httpHeaderValue1: "tenant-id"

---

Grafana Integration

Grafana 集成

Data Source Configuration

数据源配置

yaml

datasources:
  - name: Tempo
    type: tempo
    url: http://tempo:3200
    jsonData:
      # Link traces to logs
      tracesToLogsV2:
        datasourceUid: loki-uid
        filterByTraceID: true
        tags: [{key: "service.name", value: "app"}]

      # Link traces to metrics
      tracesToMetrics:
        datasourceUid: prometheus-uid
        tags: [{key: "service.name", value: "service"}]
        queries:
          - name: Error Rate
            query: 'sum(rate(traces_spanmetrics_calls_total{$$__tags, status_code="STATUS_CODE_ERROR"}[5m]))'

      # Link traces to profiles (Pyroscope)
      tracesToProfiles:
        datasourceUid: pyroscope-uid
        tags: [{key: "service.name", value: "service_name"}]

      # Service map from span metrics
      serviceMap:
        datasourceUid: prometheus-uid

yaml

datasources:
  - name: Tempo
    type: tempo
    url: http://tempo:3200
    jsonData:
      # 关联追踪与日志
      tracesToLogsV2:
        datasourceUid: loki-uid
        filterByTraceID: true
        tags: [{key: "service.name", value: "app"}]

      # 关联追踪与指标
      tracesToMetrics:
        datasourceUid: prometheus-uid
        tags: [{key: "service.name", value: "service"}]
        queries:
          - name: 错误率
            query: 'sum(rate(traces_spanmetrics_calls_total{$$__tags, status_code="STATUS_CODE_ERROR"}[5m]))'

      # 关联追踪与性能剖析（Pyroscope）
      tracesToProfiles:
        datasourceUid: pyroscope-uid
        tags: [{key: "service.name", value: "service_name"}]

      # 从跨度指标生成服务地图
      serviceMap:
        datasourceUid: prometheus-uid

Key Grafana Features

Grafana 核心特性

Explore > Tempo: Search by TraceQL, trace ID, or tag filters
Service Graph tab: Visual service topology with RED metrics
Traces Drilldown:
```
/a/grafana-exploretraces-app
```
- no TraceQL required
Exemplars: Click metric spike -> jump directly to responsible trace
Derived fields in Loki: Click trace ID in log -> jump to trace in Tempo

Explore > Tempo：通过 TraceQL、trace ID 或标签过滤器搜索
服务图标签页：可视化服务拓扑并显示 RED 指标
追踪钻取：
```
/a/grafana-exploretraces-app
```
- 无需编写 TraceQL
示例数据（Exemplars）：点击指标峰值 -> 直接跳转到对应的追踪
Loki 中的派生字段：点击日志中的 trace ID -> 跳转到 Tempo 中的追踪

API Quick Reference

API 快速参考

bash

undefined

bash

undefined

Search traces

搜索追踪

GET /api/search?q={status=error}&limit=20&start=<unix>&end=<unix>

Get trace by ID

通过 ID 获取追踪

GET /api/traces/<traceID> GET /api/v2/traces/<traceID>

List all tag names

列出所有标签名称

GET /api/search/tags

Get values for a tag

获取标签值

GET /api/search/tag/service.name/values

TraceQL metrics (time series)

TraceQL 指标（时间序列）

GET /api/metrics/query_range?q={status=error}|rate()&start=...&end=...&step=60

Health check

健康检查

GET /ready

---

GET /ready

---

Performance Tuning Summary

性能调优总结

Problem	Solution
Slow searches	Scale queriers horizontally; scale compactors to reduce block count
High memory on queriers	Reduce `max_concurrent_queries` ; lower `target_bytes_per_job`
High memory on ingesters	Reduce `max_block_bytes` ; lower per-tenant trace limits
Slow attribute queries	Add dedicated Parquet columns for frequent attributes
Cache miss rate high	Increase cache size; tune `cache_min_compaction_level`
Rate limited (429)	Raise `max_outstanding_per_tenant` or increase per-tenant ingestion limits
Memcached connection errors	Increase memcached connection limit ( `-c 4096` )

问题	解决方案
搜索缓慢	水平扩展 querier；扩展 compactor 以减少数据块数量
querier 内存占用高	降低 `max_concurrent_queries` ；调低 `target_bytes_per_job`
ingester 内存占用高	降低 `max_block_bytes` ；调低每个租户的追踪限制
属性查询缓慢	为频繁查询的属性添加专用 Parquet 列
缓存命中率低	增大缓存大小；调整 `cache_min_compaction_level`
被限流（429）	提高 `max_outstanding_per_tenant` 或增加每个租户的摄入限制
Memcached 连接错误	提高 Memcached 连接限制（ `-c 4096` ）

Best Practices

最佳实践

Instrumentation

埋点

Follow OpenTelemetry semantic conventions for attribute names
Use
```
span.
```
prefix for span attributes,
```
resource.
```
for process context
Keep attributes meaningful - avoid metrics/logs as span attributes
Limit attributes to max ~128 per span (OTel default)
Use span linking for batch processing (instead of huge fan-out traces)
Create spans for: external calls, significant loops, operations with variable latency
Avoid creating spans for every function call

遵循 OpenTelemetry 语义规范 命名属性
跨度属性使用
```
span.
```
前缀，进程上下文属性使用
```
resource.
```
前缀
属性需有实际意义 - 避免将指标/日志作为跨度属性
每个跨度的属性数量限制在约128个以内（OTel 默认值）
批量处理时使用 跨度链接（span linking）（而非大规模扇出追踪）
为以下操作创建跨度：外部调用、重要循环、延迟可变的操作
避免为每个函数调用创建跨度

Deployment

部署

Use replication factor 3 for production HA
Object storage required for distributed deployments (not local)
Enable dedicated attribute columns for your most-queried attributes
Set appropriate block retention per tenant via overrides
Monitor
```
tempo_ingester_live_traces
```
to detect memory pressure early

生产环境高可用使用 复制因子3
分布式部署必须使用 对象存储（而非本地存储）
为最常查询的属性启用 专用属性列
通过 overrides 为每个租户设置合适的 数据块保留时间
监控
```
tempo_ingester_live_traces
```
以提前发现内存压力

Querying

查询

Use time bounds (
```
start
```
/
```
end
```
) to limit search scope
Use structural operators for root cause analysis patterns
Prefer
```
attribute != nil
```
for existence checks
Use
```
with (most_recent=true)
```
when you need deterministic recent results
Scope tag discovery with a TraceQL query to reduce noise

使用 时间范围（
```
start
```
/
```
end
```
）限制搜索范围
使用 结构运算符 分析根本原因模式
检查属性存在性时优先使用
```
attribute != nil
```
需要确定性的最新结果时使用
```
with (most_recent=true)
```
通过 TraceQL 查询限制标签发现范围以减少干扰

Ports Reference

端口参考

Port	Protocol	Purpose
3200	HTTP	Tempo API (queries, search, health)
9095	gRPC	Internal component communication
4317	gRPC	OTLP trace ingestion
4318	HTTP	OTLP trace ingestion
14268	HTTP	Jaeger Thrift HTTP ingestion
14250	gRPC	Jaeger gRPC ingestion
6831	UDP	Jaeger Thrift Compact
6832	UDP	Jaeger Thrift Binary
9411	HTTP	Zipkin ingestion
7946	TCP/UDP	Memberlist gossip

端口	协议	用途
3200	HTTP	Tempo API（查询、搜索、健康检查）
9095	gRPC	内部组件通信
4317	gRPC	OTLP 追踪数据摄入
4318	HTTP	OTLP 追踪数据摄入
14268	HTTP	Jaeger Thrift HTTP 摄入
14250	gRPC	Jaeger gRPC 摄入
6831	UDP	Jaeger Thrift Compact 摄入
6832	UDP	Jaeger Thrift Binary 摄入
9411	HTTP	Zipkin 摄入
7946	TCP/UDP	Memberlist gossip 通信