otel-instrumentation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

OpenTelemetry Instrumentation for Honeycomb

面向Honeycomb的OpenTelemetry埋点指南

SDK setup, custom spans, attributes, span events, sampling, and layered telemetry. For conceptual foundations (why wide events matter, how attributes connect to investigation), see the observability-fundamentals skill.
SDK设置、自定义Span、属性、Span事件、采样以及分层遥测。 如需了解概念基础(为何宽事件很重要、属性如何关联到排查工作),请查看observability-fundamentals技能。

OTLP Configuration and SDK Setup

OTLP配置与SDK设置

Every OTel SDK needs three environment variables to send data to Honeycomb:
OTEL_EXPORTER_OTLP_ENDPOINT
,
OTEL_EXPORTER_OTLP_HEADERS
, and
OTEL_SERVICE_NAME
.
For the env var values, language-specific dependencies, and setup code (Go, Python, Node.js, Java, Ruby, .NET, Rust), see
${CLAUDE_PLUGIN_ROOT}/skills/otel-instrumentation/references/sdk-setup-by-language.md
.
每个OTel SDK都需要三个环境变量才能向Honeycomb发送数据:
OTEL_EXPORTER_OTLP_ENDPOINT
OTEL_EXPORTER_OTLP_HEADERS
OTEL_SERVICE_NAME
关于环境变量值、特定语言依赖项和设置代码(Go、Python、Node.js、Java、Ruby、.NET、Rust),请查看
${CLAUDE_PLUGIN_ROOT}/skills/otel-instrumentation/references/sdk-setup-by-language.md

Custom Instrumentation

自定义埋点

Adding Attributes to Existing Spans (Highest Impact)

为现有Span添加属性(最高价值)

Add business context to auto-instrumented spans — no new spans needed. Get the current span from context and call
SetAttributes
(Go),
set_attribute
(Python), or
setAttribute
(Node.js) with user, tenant, business, and deployment context.
为自动埋点的Span添加业务上下文——无需创建新Span。从上下文获取当前Span,并调用
SetAttributes
(Go)、
set_attribute
(Python)或
setAttribute
(Node.js),传入用户、租户、业务和部署相关上下文。

Creating Custom Spans

创建自定义Span

Wrap important business operations for visibility in the trace waterfall. Use
tracer.Start(ctx, "operation-name")
(Go),
tracer.start_as_current_span("operation-name")
(Python), or
tracer.startActiveSpan("operation-name", callback)
(Node.js).
For full code examples in all languages, consult
${CLAUDE_PLUGIN_ROOT}/skills/otel-instrumentation/references/custom-instrumentation.md
.
将重要业务操作包裹起来,以便在链路瀑布图中可见。使用
tracer.Start(ctx, "operation-name")
(Go)、
tracer.start_as_current_span("operation-name")
(Python)或
tracer.startActiveSpan("operation-name", callback)
(Node.js)。
如需所有语言的完整代码示例,请查阅
${CLAUDE_PLUGIN_ROOT}/skills/otel-instrumentation/references/custom-instrumentation.md

When to Create a Span

何时创建Span

Not every function needs a span. Two questions determine whether a span is worth creating:
  1. Is it interesting? — Does the work meaningfully impact performance (latency or failures) for the overall request?
  2. Is it aggregable? — If you group this span by name and attributes, will it produce useful trends and comparisons?
OperationInteresting?Aggregable?Create a Span?
HTTP request handlerYes — variable latency, can failYes — group by route, method, statusYes
Database queryYes — I/O bound, failure-proneYes — group by query type, tableYes
External API callYes — network latency, dependenciesYes — group by endpoint, statusYes
Cache lookupYes — fast vs slow pathYes — group by cache name, hit/missYes
Message queue pub/consumeYes — async boundary, delaysYes — group by queue, message typeYes
Business logic transactionYes — meaningful state changeYes — group by type, outcomeYes
Private helper functionNo — trivial CPU, predictableNo — too granularNo
Loop iterationMaybe — if slowNo — unbounded cardinalityNo
Getter/setterNo — no meaningful durationNo — nothing to group byNo
Input validation (pure CPU)No — fast, predictableMaybeNo
Business logic orchestrationNo — just calls instrumented codeNo — duration is sum of childrenNo
Common mistakes:
  • Too many spans: A trace with millions of 2ms spans is far too detailed and rarely actionable. Roll them up — combine into a single span, or capture the detail as an attribute on the parent span instead.
  • Too few spans: Collapsing hours of work into a single opaque handler leaves you guessing about where time is spent.
When in doubt, prefer attributes on existing spans over creating new child spans.
并非每个函数都需要Span。两个问题决定是否值得创建Span:
  1. 是否值得关注? —— 该工作是否会对整体请求的性能(延迟或失败情况)产生显著影响?
  2. 是否可聚合? —— 如果按名称和属性对该Span进行分组,是否会产生有用的趋势和对比结果?
操作是否值得关注?是否可聚合?是否创建Span?
HTTP请求处理器是——延迟可变,可能失败是——按路由、方法、状态分组
数据库查询是——I/O密集,易失败是——按查询类型、表分组
外部API调用是——网络延迟,依赖外部服务是——按端点、状态分组
缓存查询是——快慢路径差异是——按缓存名称、命中/未命中分组
消息队列发布/消费是——异步边界,存在延迟是——按队列、消息类型分组
业务逻辑事务是——有意义的状态变更是——按类型、结果分组
私有辅助函数否——CPU消耗低,可预测否——粒度太细
循环迭代可能——如果速度慢否——基数无界
Getter/Setter方法否——无有意义的耗时否——无分组依据
输入验证(纯CPU操作)否——速度快,可预测可能
业务逻辑编排否——仅调用已埋点的代码否——耗时为子Span之和
常见错误:
  • Span过多:包含数百万个2ms Span的链路过于详细,几乎无法采取行动。应将其合并——合并为单个Span,或者将细节作为父Span的属性捕获。
  • Span过少:将数小时的工作折叠为单个不透明的处理器,会让你无法判断时间消耗在哪里。
拿不定主意时,优先选择为现有Span添加属性,而非创建新的子Span。

Timing Attributes (measure sub-operations without child spans)

计时属性(无需子Span即可测量子操作)

Record important sub-operation durations as attributes on the parent span. These are easier to query than child spans and work directly with BubbleUp.
go
// Go: time auth and record on the existing span
span := trace.SpanFromContext(r.Context())
authStart := time.Now()
user, err := authenticate(r)
span.SetAttributes(attribute.Float64("auth.duration_ms", float64(time.Since(authStart).Milliseconds())))
python
undefined
将重要子操作的时长记录为父Span的属性。这些属性比子Span更易于查询,且可直接与BubbleUp配合使用。
go
// Go: 验证授权耗时并记录到现有Span
span := trace.SpanFromContext(r.Context())
authStart := time.Now()
user, err := authenticate(r)
span.SetAttributes(attribute.Float64("auth.duration_ms", float64(time.Since(authStart).Milliseconds())))
python
undefined

Python: time auth and record on the existing span

Python: 验证授权耗时并记录到现有Span

span = trace.get_current_span() auth_start = time.monotonic() user = authenticate(request) span.set_attribute("auth.duration_ms", (time.monotonic() - auth_start) * 1000)
undefined
span = trace.get_current_span() auth_start = time.monotonic() user = authenticate(request) span.set_attribute("auth.duration_ms", (time.monotonic() - auth_start) * 1000)
undefined

Exception Slugs (tag each error site with a static identifier)

异常标识(为每个错误位置添加静态标识符)

Tag each error throw site with a unique static string (
exception.slug
). This creates a low-cardinality, greppable identifier that connects dashboards directly to code.
go
// Go: static slug — greppable, safe for GROUP BY
span.SetAttributes(
    attribute.String("exception.slug", "err-stripe-charge-failed"),
    attribute.Bool("error", true),
)
span.RecordError(err)
python
undefined
为每个错误抛出位置添加唯一的静态字符串(
exception.slug
)。这会创建一个低基数、可搜索的标识符,将仪表盘直接关联到代码。
go
// Go: 静态标识——可搜索,适合GROUP BY操作
span.SetAttributes(
    attribute.String("exception.slug", "err-stripe-charge-failed"),
    attribute.Bool("error", true),
)
span.RecordError(err)
python
undefined

Python: static slug — greppable, safe for GROUP BY

Python: 静态标识——可搜索,适合GROUP BY操作

span.set_attribute("exception.slug", "err-stripe-card-error") span.set_attribute("error", True) span.record_exception(e)

Find unhandled errors (missing slugs): `WHERE error = true AND exception.slug does-not-exist`.

For extended examples in all languages, see
`${CLAUDE_PLUGIN_ROOT}/skills/otel-instrumentation/references/custom-instrumentation.md`.
span.set_attribute("exception.slug", "err-stripe-card-error") span.set_attribute("error", True) span.record_exception(e)

查找未处理的错误(缺少标识):`WHERE error = true AND exception.slug does-not-exist`。

如需所有语言的扩展示例,请查看`${CLAUDE_PLUGIN_ROOT}/skills/otel-instrumentation/references/custom-instrumentation.md`。

What to Instrument

埋点对象选择

High Value (Instrument First)

高价值(优先埋点)

  • API entry points (HTTP handlers, gRPC methods)
  • Database queries (auto-instrumented by most SDKs)
  • External HTTP calls (auto-instrumented by most SDKs)
  • Message queue producers/consumers
These are typically auto-instrumented by OTel SDKs and form the skeleton of your traces.
  • API入口点(HTTP处理器、gRPC方法)
  • 数据库查询(大多数SDK支持自动埋点)
  • 外部HTTP调用(大多数SDK支持自动埋点)
  • 消息队列生产者/消费者
这些通常由OTel SDK自动埋点,构成链路的骨架。

Medium Value (Add Next)

中等价值(接下来添加)

  • Business logic operations (checkout, payment, fulfillment)
  • Cache operations (hits, misses, evictions)
  • Authentication and authorization checks
  • Background job execution
These are your business logic. Without custom spans here, you can see that a request was slow but not why — the trace waterfall has gaps where the important work happens invisibly.
  • 业务逻辑操作(结账、支付、履约)
  • 缓存操作(命中、未命中、驱逐)
  • 身份验证与授权检查
  • 后台任务执行
这些是你的业务逻辑核心。如果此处没有自定义Span,你只能看到请求变慢,但不知道原因——链路瀑布图中会存在重要工作不可见的空白。

Attributes to Add

需添加的属性

Attributes are the dimensions BubbleUp uses during investigations. Every attribute you add is a new axis BubbleUp can diff on to find what's different about outlier requests. For the complete catalog organized by category with rationale and example queries, see
${CLAUDE_PLUGIN_ROOT}/skills/otel-instrumentation/references/wide-event-attributes.md
.
For why attributes matter conceptually, see the observability-fundamentals skill.
属性是BubbleUp在排查过程中使用的维度。你添加的每个属性都是BubbleUp可用于对比的新轴,以找出异常请求的差异。如需按类别组织的完整属性目录、原理和示例查询,请查看
${CLAUDE_PLUGIN_ROOT}/skills/otel-instrumentation/references/wide-event-attributes.md
如需了解属性重要性的概念基础,请查看observability-fundamentals技能。

Span Events and Span Links

Span事件与Span链接

  • Span events: Record point-in-time occurrences within a span (errors, retries, state changes). Use
    span.add_event("event_name", {attributes})
    .
  • Span links: Connect spans across different trace hierarchies (async processing, fan-out/fan-in, cross-system correlation). Create a
    Link
    to the related span context.
See
${CLAUDE_PLUGIN_ROOT}/skills/otel-instrumentation/references/custom-instrumentation.md
for full examples of both patterns.
  • Span事件:记录Span内的瞬时事件(错误、重试、状态变更)。使用
    span.add_event("event_name", {attributes})
  • Span链接:连接不同链路层级的Span(异步处理、扇出/扇入、跨系统关联)。创建指向相关Span上下文的
    Link
如需这两种模式的完整示例,请查看
${CLAUDE_PLUGIN_ROOT}/skills/otel-instrumentation/references/custom-instrumentation.md

Sampling

采样

Sampling Strategy

采样策略

Sampling is about tradeoffs — there is no free lunch:
  • Head sampling favors cost over debuggability. You save resources, but a 0.1% error at 1% sampling becomes effectively invisible. Head sampling is oblivious to what happens downstream.
  • Tail sampling favors fidelity over simplicity. You keep interesting traces but need infrastructure (Refinery or Collector) to buffer and evaluate complete traces.
The math matters: if an error occurs 0.1% of the time and you head-sample at 1%, you'll capture roughly 1 in 100,000 of those errors. At moderate traffic, that error may never appear in your data.
采样是权衡取舍——没有完美方案:
  • 头部采样优先考虑成本而非可调试性。你节省了资源,但1%采样率下0.1%的错误几乎会完全不可见。头部采样无法感知下游发生的情况。
  • 尾部采样优先考虑保真度而非简单性。你保留了有价值的链路,但需要基础设施(Refinery或Collector)来缓冲和评估完整链路。
数学计算很重要:如果错误发生率为0.1%,且头部采样率为1%,你大约只会捕获十万分之一的错误。在中等流量下,该错误可能永远不会出现在你的数据中。

Head Sampling (SDK-level)

头部采样(SDK层面)

Decides whether to sample a trace at creation time. Simple but can miss interesting traces.
  • Configure via
    OTEL_TRACES_SAMPLER
    env var
  • always_on
    (default),
    always_off
    ,
    traceidratio
    (e.g., sample 10%)
  • parentbased_traceidratio
    respects parent sampling decisions
  • Best for: Very high-throughput services where you can tolerate missing rare events
在链路创建时决定是否采样。简单但可能错过有价值的链路。
  • 通过
    OTEL_TRACES_SAMPLER
    环境变量配置
  • always_on
    (默认)、
    always_off
    traceidratio
    (例如,采样10%)
  • parentbased_traceidratio
    尊重父Span的采样决策
  • 最适合: 超高吞吐量服务,可容忍丢失罕见事件

Tail Sampling (Collector/Refinery)

尾部采样(Collector/Refinery)

Decides after the trace is complete. Keeps interesting traces (errors, slow requests).
  • Use Honeycomb's Refinery for production tail sampling
  • Or configure the OTel Collector's
    tail_sampling
    processor
  • Can sample based on: latency, error status, specific attributes, trace duration
  • Best for: Services where debuggability matters — keeps errors and outliers while sampling routine traffic
在链路完成后决定是否采样。保留有价值的链路(错误、慢请求)。
  • 使用Honeycomb的Refinery进行生产环境尾部采样
  • 或配置OTel Collector的
    tail_sampling
    处理器
  • 可基于以下条件采样:延迟、错误状态、特定属性、链路时长
  • 最适合: 重视可调试性的服务——保留错误和异常链路,同时对常规流量进行采样

Sampling Impact on Honeycomb

采样对Honeycomb的影响

  • Sampling reduces data volume and cost
  • SLOs, BubbleUp, and query results adjust for sampling rate automatically
  • Trace completeness may be affected — missing spans if not all services sample consistently
  • Start with no sampling, then add as needed for cost management
  • 采样减少数据量和成本
  • SLO、BubbleUp和查询结果会自动根据采样率调整
  • 链路完整性可能受影响——如果并非所有服务采样一致,可能会丢失Span
  • 从无采样开始,然后根据成本管理需求添加采样

Layered Telemetry

分层遥测

OpenTelemetry is "trace-first" — context propagation is the glue that correlates all signals. But effective observability layers multiple signal types for different purposes.
A three-question test for choosing the right signal:
  1. What needs causality and full-request context? → Traces (spans)
  2. What needs inexpensive long-term storage and fast alerting? → Metrics
  3. What is rare vs. common, and what are the audit requirements? → Logs / events
The histogram-alongside-spans pattern: For high-throughput HTTP services, emit both a span and a histogram metric for each handled request. This lets you head-sample traces for cost while histograms provide last-ditch alerting — and exemplars link outlier metric points back to specific traces for deeper investigation.
The technique is layering (not duplication) because each signal provides a different view at a different level of detail.
For architectural patterns where layering is essential (streaming, async jobs, ETL), see
${CLAUDE_PLUGIN_ROOT}/skills/otel-instrumentation/references/architectural-patterns.md
.
OpenTelemetry是“链路优先”——上下文传播是关联所有信号的核心。但有效的可观测性需要为不同目的整合多种信号类型。
选择正确信号的三问测试:
  1. 什么需要因果关系和完整请求上下文? → 链路(Span)
  2. 什么需要低成本的长期存储和快速告警? → 指标
  3. 什么是罕见/常见事件,有哪些审计要求? → 日志/事件
Span旁置直方图模式: 对于高吞吐量HTTP服务,为每个处理的请求同时发送Span和直方图指标。这让你可以通过头部采样控制链路成本,同时直方图提供最后的告警手段——示例将异常指标点链接回特定链路以进行深入排查。
这种技术是分层(而非重复),因为每种信号提供不同维度、不同细节级别的视图。
如需了解分层至关重要的架构模式(流处理、异步任务、ETL),请查看
${CLAUDE_PLUGIN_ROOT}/skills/otel-instrumentation/references/architectural-patterns.md

Logs in Honeycomb

Honeycomb中的日志

OTel can send logs too. If you have existing log infrastructure, the OTel Collector can ingest logs and forward them to Honeycomb as structured events:
  • OTel SDK log bridge: Captures logs from your existing logging library (
    slog
    in Go,
    logging
    in Python,
    winston
    /
    pino
    in Node.js) and exports them as OTel log records.
  • OTel Collector
    filelog
    receiver
    : Reads log files, parses them, exports as OTLP.
Logs sent through OTel arrive in Honeycomb as structured events with the same query capabilities as spans.
OTel也可以发送日志。如果你已有日志基础设施,OTel Collector可以摄取日志并将其作为结构化事件转发到Honeycomb:
  • OTel SDK日志桥接器:从现有日志库(Go中的
    slog
    、Python中的
    logging
    、Node.js中的
    winston
    /
    pino
    )捕获日志,并将其导出为OTel日志记录。
  • OTel Collector
    filelog
    接收器
    :读取日志文件,解析后导出为OTLP格式。
通过OTel发送的日志会作为结构化事件到达Honeycomb,具备与Span相同的查询能力。

Naming Conventions

命名规范

  • Span names: Describe the operation (
    HTTP GET /api/users
    ,
    db.query SELECT
    ,
    process-payment
    )
  • Attribute names: Use dot-separated namespaces (
    user.id
    ,
    order.total
    ,
    cache.hit
    )
  • Follow OTel semantic conventions where applicable (
    http.method
    ,
    db.system
    ,
    rpc.service
    )
  • Custom attributes: Use your own namespace (
    app.
    ,
    checkout.
    ,
    mycompany.
    )
  • Span名称:描述操作内容(
    HTTP GET /api/users
    db.query SELECT
    process-payment
  • 属性名称:使用点分隔的命名空间(
    user.id
    order.total
    cache.hit
  • 适用时遵循OTel语义规范
    http.method
    db.system
    rpc.service
  • 自定义属性:使用自己的命名空间(
    app.
    checkout.
    mycompany.

Additional Resources

额外资源

Reference Files

参考文件

  • ${CLAUDE_PLUGIN_ROOT}/skills/otel-instrumentation/references/sdk-setup-by-language.md
    — OTLP configuration and SDK setup for Go, Python, Node.js, Java, Ruby, .NET, Rust
  • ${CLAUDE_PLUGIN_ROOT}/skills/otel-instrumentation/references/custom-instrumentation.md
    — Custom instrumentation patterns with full code examples (timing attributes, exception slugs, async request summaries)
  • ${CLAUDE_PLUGIN_ROOT}/skills/otel-instrumentation/references/collector-config.md
    — OTel Collector configuration for format conversion, processing, and sampling
  • ${CLAUDE_PLUGIN_ROOT}/skills/otel-instrumentation/references/wide-event-attributes.md
    — Canonical attribute catalog organized by category with example queries
  • ${CLAUDE_PLUGIN_ROOT}/skills/otel-instrumentation/references/architectural-patterns.md
    — Trace design patterns for streaming, async, ETL, and serverless architectures
  • ${CLAUDE_PLUGIN_ROOT}/skills/otel-instrumentation/references/sdk-setup-by-language.md
    —— 面向Go、Python、Node.js、Java、Ruby、.NET、Rust的OTLP配置与SDK设置
  • ${CLAUDE_PLUGIN_ROOT}/skills/otel-instrumentation/references/custom-instrumentation.md
    —— 自定义埋点模式及完整代码示例(计时属性、异常标识、异步请求摘要)
  • ${CLAUDE_PLUGIN_ROOT}/skills/otel-instrumentation/references/collector-config.md
    —— OTel Collector配置,用于格式转换、处理和采样
  • ${CLAUDE_PLUGIN_ROOT}/skills/otel-instrumentation/references/wide-event-attributes.md
    —— 按类别组织的标准属性目录及示例查询
  • ${CLAUDE_PLUGIN_ROOT}/skills/otel-instrumentation/references/architectural-patterns.md
    —— 面向流处理、异步、ETL和无服务器架构的链路设计模式

Cross-References

交叉引用

  • For conceptual foundations of why wide events and attributes matter: observability-fundamentals skill
  • After instrumenting, use the query-patterns skill to verify data is arriving
  • 如需了解宽事件和属性重要性的概念基础:observability-fundamentals技能
  • 完成埋点后,使用query-patterns技能验证数据是否已到达