tempo

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Grafana Tempo - Distributed Tracing Backend

Grafana Tempo - 分布式追踪后端

Grafana Tempo is an open-source, high-scale distributed tracing backend. It is:
  • Cost-efficient: only requires object storage (S3, GCS, Azure) to operate
  • Deeply integrated: with Grafana, Mimir, Prometheus, Loki, and Pyroscope
  • Protocol-agnostic: accepts OTLP, Jaeger, Zipkin, OpenCensus, Kafka
Grafana Tempo 是一款开源的高扩展分布式追踪后端。它具备以下特性:
  • 成本低廉:仅需对象存储(S3、GCS、Azure)即可运行
  • 深度集成:与 Grafana、Mimir、Prometheus、Loki 和 Pyroscope 深度整合
  • 协议无关:支持 OTLP、Jaeger、Zipkin、OpenCensus、Kafka

Quick Reference Links

快速参考链接

  • TraceQL Language Reference - query syntax, operators, examples, metrics functions
  • Configuration Reference - all YAML config blocks with defaults
  • Architecture and Operations - components, deployment, tuning
  • Metrics from Traces - span metrics, service graphs, TraceQL metrics
  • API Reference - HTTP endpoints, ingestion, search, metrics queries

  • TraceQL 语言参考 - 查询语法、运算符、示例、指标函数
  • 配置参考 - 所有带默认值的 YAML 配置块
  • 架构与运维 - 组件、部署、调优
  • 从追踪生成指标 - 跨度指标、服务图、TraceQL 指标
  • API 参考 - HTTP 端点、数据摄入、搜索、指标查询

What is Distributed Tracing?

什么是分布式追踪?

A trace represents the lifecycle of a request as it passes through multiple services. It consists of:
  • Spans: Individual units of work with start time, duration, attributes, and status
  • Trace ID: Shared identifier across all spans in a request
  • Parent-child relationships: Spans form a tree showing causality
Traces enable:
  • Root cause analysis for service outages
  • Understanding service dependencies
  • Identifying latency bottlenecks
  • Correlating events across microservices

trace(追踪) 代表请求在多个服务间传递时的生命周期。它由以下部分组成:
  • Spans(跨度):单个工作单元,包含开始时间、持续时间、属性和状态
  • Trace ID:请求中所有跨度共享的标识符
  • 父子关系:跨度形成树状结构,展示因果关系
追踪可实现:
  • 服务中断的根本原因分析
  • 理解服务依赖关系
  • 识别延迟瓶颈
  • 在微服务间关联事件

Architecture Overview

架构概述

Applications
    |
    | (OTLP 4317/4318, Jaeger 14250/14268, Zipkin 9411)
    v
[Distributor]  ----  hashes traceID, routes to N ingesters
    |
    |---> [Ingester]  (WAL + Parquet block assembly, flush to object store)
    |
    |---> [Metrics Generator]  (optional: derives RED metrics -> Prometheus)
    
Query path:
Grafana  -->  [Query Frontend]  (shards queries)
                    |
              [Querier pool]
              /           \
    [Ingesters]     [Object Storage]
    (recent)        (historical blocks)
Applications
    |
    | (OTLP 4317/4318, Jaeger 14250/14268, Zipkin 9411)
    v
[Distributor]  ----  对 traceID 哈希,路由至 N 个 ingester
    |
    |---> [Ingester]  (WAL + Parquet 块组装,刷新至对象存储)
    |
    |---> [Metrics Generator]  (可选:从跨度生成 RED 指标 -> Prometheus)
    
查询路径:
Grafana  -->  [Query Frontend]  (分片查询)
                    |
              [Querier pool]
              /           \
    [Ingesters]     [Object Storage]
    (近期数据)        (历史块数据)

Core Components

核心组件

ComponentRoleDefault Ports
DistributorReceives spans, routes by traceID hash4317 (gRPC), 4318 (HTTP)
IngesterBuffers in memory, flushes to storage-
Query FrontendQuery orchestrator, shards across queriers3200 (HTTP)
QuerierExecutes search jobs against storage-
CompactorMerges blocks, enforces retention-
Metrics GeneratorDerives RED metrics from spans-

组件角色默认端口
Distributor接收跨度数据,通过 traceID 哈希路由4317 (gRPC)、4318 (HTTP)
Ingester内存中缓冲数据,刷新至存储-
Query Frontend查询编排器,在 querier 间分片查询3200 (HTTP)
Querier针对存储执行搜索任务-
Compactor合并数据块,执行保留策略-
Metrics Generator从跨度生成 RED 指标-

TraceQL - The Query Language

TraceQL - 查询语言

TraceQL queries filter traces by span properties. Structure:
{ filters } | pipeline
TraceQL 查询通过跨度属性过滤追踪数据。结构:
{ 过滤器 } | 管道

Attribute Scopes

属性作用域

traceql
span.http.status_code        # span-level attribute
resource.service.name        # resource attribute (from SDK)
name                         # intrinsic: span operation name
status                       # intrinsic: ok | error | unset
duration                     # intrinsic: span duration
kind                         # intrinsic: server | client | producer | consumer | internal
traceDuration                # intrinsic: entire trace duration
rootServiceName              # intrinsic: service of the root span
rootName                     # intrinsic: operation name of the root span
traceql
span.http.status_code        # 跨度级属性
resource.service.name        # 资源属性(来自 SDK)
name                         # 内置属性:跨度操作名称
status                       # 内置属性:ok | error | unset
duration                     # 内置属性:跨度持续时间
kind                         # 内置属性:server | client | producer | consumer | internal
traceDuration                # 内置属性:整个追踪的持续时间
rootServiceName              # 内置属性:根跨度所属服务
rootName                     # 内置属性:根跨度的操作名称

Operators

运算符

=   !=   >   <   >=   <=      # comparison
=~  !~                         # regex match (Go RE2)
&&  ||  !                      # logical
=   !=   >   <   >=   <=      # 比较运算符
=~  !~                         # 正则匹配(Go RE2)
&&  ||  !                      # 逻辑运算符

Essential Examples

基础示例

traceql
undefined
traceql
undefined

All errors

所有错误

{ status = error }
{ status = error }

Slow requests from a service

来自某服务的慢请求

{ resource.service.name = "frontend" && duration > 1s }
{ resource.service.name = "frontend" && duration > 1s }

HTTP 5xx errors

HTTP 5xx 错误

{ span.http.status_code >= 500 }
{ span.http.status_code >= 500 }

Count errors per trace (more than 2)

统计每个追踪中的错误数(超过2个)

{ status = error } | count() >= 2
{ status = error } | count() >= 2

Group by service

按服务分组

{ status = error } | by(resource.service.name)
{ status = error } | by(resource.service.name)

P99 latency grouping

P99 延迟分组统计

{ kind = server } | avg(duration) by(resource.service.name)
{ kind = server } | avg(duration) by(resource.service.name)

Select specific fields

选择特定字段

{ status = error } | select(span.http.url, duration, resource.service.name)
{ status = error } | select(span.http.url, duration, resource.service.name)

Structural: server span with downstream error

结构查询:存在下游错误的服务端跨度

{ kind = server } >> { status = error }
{ kind = server } >> { status = error }

Both conditions present (any relationship)

同时满足两个条件(任意关系)

{ span.db.system = "redis" } && { span.db.system = "postgresql" }
{ span.db.system = "redis" } && { span.db.system = "postgresql" }

Find most recent (deterministic)

查询最新的追踪(确定性结果)

{ resource.service.name = "api" } with (most_recent=true)
undefined
{ resource.service.name = "api" } with (most_recent=true)
undefined

TraceQL Metrics

TraceQL 指标

traceql
undefined
traceql
undefined

Error rate per service

每个服务的错误率

{ status = error } | rate() by (resource.service.name)
{ status = error } | rate() by (resource.service.name)

P99 latency

P99 延迟

{ kind = server } | quantile_over_time(duration, .99) by (resource.service.name)
{ kind = server } | quantile_over_time(duration, .99) by (resource.service.name)

With exemplars

附带示例数据

{ kind = server } | quantile_over_time(duration, .99) by (resource.service.name) with (exemplars=true)

---
{ kind = server } | quantile_over_time(duration, .99) by (resource.service.name) with (exemplars=true)

---

Deployment

部署

Quick Start (Docker Compose)

快速开始(Docker Compose)

bash
git clone https://github.com/grafana/tempo.git
cd tempo/example/docker-compose/local
mkdir tempo-data
docker compose up -d
bash
git clone https://github.com/grafana/tempo.git
cd tempo/example/docker-compose/local
mkdir tempo-data
docker compose up -d
undefined
undefined

Minimal Single-Node Config

最小单节点配置

yaml
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

ingester:
  lifecycler:
    ring:
      replication_factor: 1

compactor:
  compaction:
    block_retention: 336h    # 14 days

storage:
  trace:
    backend: local
    local:
      path: /var/tempo/traces
    wal:
      path: /var/tempo/wal

memberlist:
  abort_if_cluster_join_fails: false
  join_members: []
yaml
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

ingester:
  lifecycler:
    ring:
      replication_factor: 1

compactor:
  compaction:
    block_retention: 336h    # 14天

storage:
  trace:
    backend: local
    local:
      path: /var/tempo/traces
    wal:
      path: /var/tempo/wal

memberlist:
  abort_if_cluster_join_fails: false
  join_members: []

Production (S3 + Microservices)

生产环境(S3 + 微服务)

yaml
storage:
  trace:
    backend: s3
    s3:
      bucket: tempo-traces
      endpoint: s3.amazonaws.com
      region: us-east-1
      # Use IRSA/IAM roles (preferred over access keys)

compactor:
  compaction:
    block_retention: 336h    # Override per-tenant in overrides section

memberlist:
  join_members:
    - tempo-1:7946
    - tempo-2:7946
    - tempo-3:7946

ingester:
  lifecycler:
    ring:
      replication_factor: 3
yaml
storage:
  trace:
    backend: s3
    s3:
      bucket: tempo-traces
      endpoint: s3.amazonaws.com
      region: us-east-1
      # 推荐使用 IRSA/IAM 角色(优于访问密钥)

compactor:
  compaction:
    block_retention: 336h    # 在 overrides 部分按租户覆盖配置

memberlist:
  join_members:
    - tempo-1:7946
    - tempo-2:7946
    - tempo-3:7946

ingester:
  lifecycler:
    ring:
      replication_factor: 3

Kubernetes (Helm)

Kubernetes(Helm)

bash
helm repo add grafana https://grafana.github.io/helm-charts
helm install tempo grafana/tempo-distributed \
  --set storage.trace.backend=s3 \
  --set storage.trace.s3.bucket=my-tempo-bucket \
  --set storage.trace.s3.region=us-east-1

bash
helm repo add grafana https://grafana.github.io/helm-charts
helm install tempo grafana/tempo-distributed \
  --set storage.trace.backend=s3 \
  --set storage.trace.s3.bucket=my-tempo-bucket \
  --set storage.trace.s3.region=us-east-1

Sending Traces to Tempo

向 Tempo 发送追踪数据

Via Grafana Alloy (Recommended)

通过 Grafana Alloy(推荐)

alloy
// alloy.river
otelcol.receiver.otlp "default" {
  grpc { endpoint = "0.0.0.0:4317" }
  http { endpoint = "0.0.0.0:4318" }
  output {
    traces = [otelcol.exporter.otlp.tempo.input]
  }
}

otelcol.exporter.otlp "tempo" {
  client {
    endpoint = "tempo:4317"
    tls { insecure = true }
  }
}
alloy
// alloy.river
otelcol.receiver.otlp "default" {
  grpc { endpoint = "0.0.0.0:4317" }
  http { endpoint = "0.0.0.0:4318" }
  output {
    traces = [otelcol.exporter.otlp.tempo.input]
  }
}

otelcol.exporter.otlp "tempo" {
  client {
    endpoint = "tempo:4317"
    tls { insecure = true }
  }
}

Via OpenTelemetry Collector

通过 OpenTelemetry Collector

yaml
exporters:
  otlp:
    endpoint: tempo:4317
    tls:
      insecure: true
    # For multi-tenancy:
    headers:
      x-scope-orgid: my-tenant

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlp]
yaml
exporters:
  otlp:
    endpoint: tempo:4317
    tls:
      insecure: true
    # 多租户场景:
    headers:
      x-scope-orgid: my-tenant

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlp]

Direct HTTP (OTLP)

直接 HTTP 请求(OTLP)

bash
curl -X POST -H 'Content-Type: application/json' \
  http://localhost:4318/v1/traces \
  -d '{"resourceSpans": [{"resource": {"attributes": [{"key": "service.name", "value": {"stringValue": "my-service"}}]}, "scopeSpans": [{"spans": [{"traceId": "5B8EFFF798038103D269B633813FC700", "spanId": "EEE19B7EC3C1B100", "name": "my-op", "startTimeUnixNano": 1689969302000000000, "endTimeUnixNano": 1689969302500000000, "kind": 2}]}]}]}'

bash
curl -X POST -H 'Content-Type: application/json' \
  http://localhost:4318/v1/traces \
  -d '{"resourceSpans": [{"resource": {"attributes": [{"key": "service.name", "value": {"stringValue": "my-service"}}]}, "scopeSpans": [{"spans": [{"traceId": "5B8EFFF798038103D269B633813FC700", "spanId": "EEE19B7EC3C1B100", "name": "my-op", "startTimeUnixNano": 1689969302000000000, "endTimeUnixNano": 1689969302500000000, "kind": 2}]}]}]}'

Metrics from Traces

从追踪生成指标

Enable Metrics Generator

启用 Metrics Generator

yaml
metrics_generator:
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://prometheus:9090/api/v1/write
        send_exemplars: true

overrides:
  defaults:
    metrics_generator:
      processors: [service-graphs, span-metrics, local-blocks]
yaml
metrics_generator:
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://prometheus:9090/api/v1/write
        send_exemplars: true

overrides:
  defaults:
    metrics_generator:
      processors: [service-graphs, span-metrics, local-blocks]

Processor Types

处理器类型

Service Graphs: Visualizes service topology and latency
  • Output:
    traces_service_graph_request_total
    ,
    traces_service_graph_request_failed_total
    , duration histograms
Span Metrics: RED metrics per span
  • Output:
    traces_spanmetrics_calls_total
    ,
    traces_spanmetrics_duration_seconds_*
  • Labels: service, span_name, span_kind, status_code + custom dimensions
Local Blocks: Enables TraceQL metrics queries on recent data

服务图(Service Graphs):可视化服务拓扑和延迟
  • 输出指标:
    traces_service_graph_request_total
    traces_service_graph_request_failed_total
    、持续时间直方图
跨度指标(Span Metrics):每个跨度的 RED 指标
  • 输出指标:
    traces_spanmetrics_calls_total
    traces_spanmetrics_duration_seconds_*
  • 标签:service、span_name、span_kind、status_code + 自定义维度
本地块(Local Blocks):支持对近期数据执行 TraceQL 指标查询

Multi-Tenancy

多租户

yaml
undefined
yaml
undefined

Enable in Tempo config

在 Tempo 配置中启用

multitenancy_enabled: true

All requests require `X-Scope-OrgID` header.

```yaml
multitenancy_enabled: true

所有请求都需要携带 `X-Scope-OrgID` 请求头。

```yaml

OpenTelemetry Collector

OpenTelemetry Collector

exporters: otlp: headers: x-scope-orgid: tenant-id
exporters: otlp: headers: x-scope-orgid: tenant-id

Grafana datasource

Grafana 数据源

jsonData: httpHeaderName1: "X-Scope-OrgID" secureJsonData: httpHeaderValue1: "tenant-id"

---
jsonData: httpHeaderName1: "X-Scope-OrgID" secureJsonData: httpHeaderValue1: "tenant-id"

---

Grafana Integration

Grafana 集成

Data Source Configuration

数据源配置

yaml
datasources:
  - name: Tempo
    type: tempo
    url: http://tempo:3200
    jsonData:
      # Link traces to logs
      tracesToLogsV2:
        datasourceUid: loki-uid
        filterByTraceID: true
        tags: [{key: "service.name", value: "app"}]

      # Link traces to metrics
      tracesToMetrics:
        datasourceUid: prometheus-uid
        tags: [{key: "service.name", value: "service"}]
        queries:
          - name: Error Rate
            query: 'sum(rate(traces_spanmetrics_calls_total{$$__tags, status_code="STATUS_CODE_ERROR"}[5m]))'

      # Link traces to profiles (Pyroscope)
      tracesToProfiles:
        datasourceUid: pyroscope-uid
        tags: [{key: "service.name", value: "service_name"}]

      # Service map from span metrics
      serviceMap:
        datasourceUid: prometheus-uid
yaml
datasources:
  - name: Tempo
    type: tempo
    url: http://tempo:3200
    jsonData:
      # 关联追踪与日志
      tracesToLogsV2:
        datasourceUid: loki-uid
        filterByTraceID: true
        tags: [{key: "service.name", value: "app"}]

      # 关联追踪与指标
      tracesToMetrics:
        datasourceUid: prometheus-uid
        tags: [{key: "service.name", value: "service"}]
        queries:
          - name: 错误率
            query: 'sum(rate(traces_spanmetrics_calls_total{$$__tags, status_code="STATUS_CODE_ERROR"}[5m]))'

      # 关联追踪与性能剖析(Pyroscope)
      tracesToProfiles:
        datasourceUid: pyroscope-uid
        tags: [{key: "service.name", value: "service_name"}]

      # 从跨度指标生成服务地图
      serviceMap:
        datasourceUid: prometheus-uid

Key Grafana Features

Grafana 核心特性

  • Explore > Tempo: Search by TraceQL, trace ID, or tag filters
  • Service Graph tab: Visual service topology with RED metrics
  • Traces Drilldown:
    /a/grafana-exploretraces-app
    - no TraceQL required
  • Exemplars: Click metric spike -> jump directly to responsible trace
  • Derived fields in Loki: Click trace ID in log -> jump to trace in Tempo

  • Explore > Tempo:通过 TraceQL、trace ID 或标签过滤器搜索
  • 服务图标签页:可视化服务拓扑并显示 RED 指标
  • 追踪钻取
    /a/grafana-exploretraces-app
    - 无需编写 TraceQL
  • 示例数据(Exemplars):点击指标峰值 -> 直接跳转到对应的追踪
  • Loki 中的派生字段:点击日志中的 trace ID -> 跳转到 Tempo 中的追踪

API Quick Reference

API 快速参考

bash
undefined
bash
undefined

Search traces

搜索追踪

GET /api/search?q={status=error}&limit=20&start=<unix>&end=<unix>
GET /api/search?q={status=error}&limit=20&start=<unix>&end=<unix>

Get trace by ID

通过 ID 获取追踪

GET /api/traces/<traceID> GET /api/v2/traces/<traceID>
GET /api/traces/<traceID> GET /api/v2/traces/<traceID>

List all tag names

列出所有标签名称

GET /api/search/tags
GET /api/search/tags

Get values for a tag

获取标签值

GET /api/search/tag/service.name/values
GET /api/search/tag/service.name/values

TraceQL metrics (time series)

TraceQL 指标(时间序列)

GET /api/metrics/query_range?q={status=error}|rate()&start=...&end=...&step=60
GET /api/metrics/query_range?q={status=error}|rate()&start=...&end=...&step=60

Health check

健康检查

GET /ready

---
GET /ready

---

Performance Tuning Summary

性能调优总结

ProblemSolution
Slow searchesScale queriers horizontally; scale compactors to reduce block count
High memory on queriersReduce
max_concurrent_queries
; lower
target_bytes_per_job
High memory on ingestersReduce
max_block_bytes
; lower per-tenant trace limits
Slow attribute queriesAdd dedicated Parquet columns for frequent attributes
Cache miss rate highIncrease cache size; tune
cache_min_compaction_level
Rate limited (429)Raise
max_outstanding_per_tenant
or increase per-tenant ingestion limits
Memcached connection errorsIncrease memcached connection limit (
-c 4096
)

问题解决方案
搜索缓慢水平扩展 querier;扩展 compactor 以减少数据块数量
querier 内存占用高降低
max_concurrent_queries
;调低
target_bytes_per_job
ingester 内存占用高降低
max_block_bytes
;调低每个租户的追踪限制
属性查询缓慢为频繁查询的属性添加专用 Parquet 列
缓存命中率低增大缓存大小;调整
cache_min_compaction_level
被限流(429)提高
max_outstanding_per_tenant
或增加每个租户的摄入限制
Memcached 连接错误提高 Memcached 连接限制(
-c 4096

Best Practices

最佳实践

Instrumentation

埋点

  • Follow OpenTelemetry semantic conventions for attribute names
  • Use
    span.
    prefix for span attributes,
    resource.
    for process context
  • Keep attributes meaningful - avoid metrics/logs as span attributes
  • Limit attributes to max ~128 per span (OTel default)
  • Use span linking for batch processing (instead of huge fan-out traces)
  • Create spans for: external calls, significant loops, operations with variable latency
  • Avoid creating spans for every function call
  • 遵循 OpenTelemetry 语义规范 命名属性
  • 跨度属性使用
    span.
    前缀,进程上下文属性使用
    resource.
    前缀
  • 属性需有实际意义 - 避免将指标/日志作为跨度属性
  • 每个跨度的属性数量限制在约128个以内(OTel 默认值)
  • 批量处理时使用 跨度链接(span linking)(而非大规模扇出追踪)
  • 为以下操作创建跨度:外部调用、重要循环、延迟可变的操作
  • 避免为每个函数调用创建跨度

Deployment

部署

  • Use replication factor 3 for production HA
  • Object storage required for distributed deployments (not local)
  • Enable dedicated attribute columns for your most-queried attributes
  • Set appropriate block retention per tenant via overrides
  • Monitor
    tempo_ingester_live_traces
    to detect memory pressure early
  • 生产环境高可用使用 复制因子3
  • 分布式部署必须使用 对象存储(而非本地存储)
  • 为最常查询的属性启用 专用属性列
  • 通过 overrides 为每个租户设置合适的 数据块保留时间
  • 监控
    tempo_ingester_live_traces
    以提前发现内存压力

Querying

查询

  • Use time bounds (
    start
    /
    end
    ) to limit search scope
  • Use structural operators for root cause analysis patterns
  • Prefer
    attribute != nil
    for existence checks
  • Use
    with (most_recent=true)
    when you need deterministic recent results
  • Scope tag discovery with a TraceQL query to reduce noise

  • 使用 时间范围
    start
    /
    end
    )限制搜索范围
  • 使用 结构运算符 分析根本原因模式
  • 检查属性存在性时优先使用
    attribute != nil
  • 需要确定性的最新结果时使用
    with (most_recent=true)
  • 通过 TraceQL 查询限制标签发现范围以减少干扰

Ports Reference

端口参考

PortProtocolPurpose
3200HTTPTempo API (queries, search, health)
9095gRPCInternal component communication
4317gRPCOTLP trace ingestion
4318HTTPOTLP trace ingestion
14268HTTPJaeger Thrift HTTP ingestion
14250gRPCJaeger gRPC ingestion
6831UDPJaeger Thrift Compact
6832UDPJaeger Thrift Binary
9411HTTPZipkin ingestion
7946TCP/UDPMemberlist gossip
端口协议用途
3200HTTPTempo API(查询、搜索、健康检查)
9095gRPC内部组件通信
4317gRPCOTLP 追踪数据摄入
4318HTTPOTLP 追踪数据摄入
14268HTTPJaeger Thrift HTTP 摄入
14250gRPCJaeger gRPC 摄入
6831UDPJaeger Thrift Compact 摄入
6832UDPJaeger Thrift Binary 摄入
9411HTTPZipkin 摄入
7946TCP/UDPMemberlist gossip 通信