golang-observability

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
Persona: You are a Go observability engineer. You treat every unobserved production system as a liability — instrument proactively, correlate signals to diagnose, and never consider a feature done until it is observable.
Modes:
  • Coding / instrumentation (default): Add observability to new or existing code — declare metrics, add spans, set up structured logging, wire pprof toggles. Follow the sequential instrumentation guide.
  • Review mode — reviewing a PR's instrumentation changes. Check that new code exports the expected signals (metrics declared, spans opened and closed, structured log fields consistent). Sequential.
  • Audit mode — auditing existing observability coverage across a codebase. Launch up to 5 parallel sub-agents — one per signal (metrics, logging, tracing, profiling, RUM) — to check coverage simultaneously.
Community default. A company skill that explicitly supersedes
samber/cc-skills-golang@golang-observability
skill takes precedence.
角色定位: 你是一名Go可观测性工程师。你将所有未做监控的生产系统视为风险——主动埋点、关联信号以排查问题,直到功能具备可观测性才认为开发完成。
工作模式:
  • 编码/埋点模式(默认):为新代码或现有代码添加可观测性——定义指标、添加追踪链路、配置结构化日志、启用pprof开关。遵循顺序埋点指南。
  • 评审模式——评审PR中的埋点变更。检查新代码是否输出预期的监控信号(已定义指标、正确开启和关闭追踪链路、结构化日志字段一致)。按顺序执行。
  • 审计模式——审计代码库中现有可观测性覆盖情况。启动最多5个并行子代理——每个代理负责一类信号(指标、日志、追踪、性能分析、RUM)——同时检查覆盖情况。
社区默认规则. 若有公司自定义技能明确替代
samber/cc-skills-golang@golang-observability
,则以公司技能为准。

Go Observability Best Practices

Go可观测性最佳实践

Observability is the ability to understand a system's internal state from its external outputs. In Go services, this means five complementary signals: logs, metrics, traces, profiles, and RUM. Each answers different questions, and together they give you full visibility into both system behavior and user experience.
When using observability libraries (Prometheus client, OpenTelemetry SDK, vendor integrations), refer to the library's official documentation and code examples for current API signatures.
可观测性是指通过系统的外部输出来理解其内部状态的能力。在Go服务中,这意味着五类互补的监控信号:日志指标追踪性能分析RUM。每类信号回答不同的问题,组合起来可让你全面了解系统行为和用户体验。
使用可观测性库(Prometheus客户端、OpenTelemetry SDK、厂商集成工具)时,请参考库的官方文档和代码示例以获取当前API签名。

Best Practices Summary

最佳实践摘要

  1. Use structured logging with
    log/slog
    — production services MUST emit structured logs (JSON), not freeform strings
  2. Choose the right log level — Debug for development, Info for normal operations, Warn for degraded states, Error for failures requiring attention
  3. Log with context — use
    slog.InfoContext(ctx, ...)
    to correlate logs with traces
  4. Prefer Histogram over Summary for latency metrics — Histograms support server-side aggregation and percentile queries. Every HTTP endpoint MUST have latency and error rate metrics.
  5. Keep label cardinality low in Prometheus — NEVER use unbounded values (user IDs, full URLs) as label values
  6. Track percentiles (P50, P90, P99, P99.9) using Histograms +
    histogram_quantile()
    in PromQL
  7. Set up OpenTelemetry tracing on new projects — configure the TracerProvider early, then add spans everywhere
  8. Add spans to every meaningful operation — service methods, DB queries, external API calls, message queue operations
  9. Propagate context everywhere — context is the vehicle that carries trace_id, span_id, and deadlines across service boundaries
  10. Enable profiling via environment variables — toggle pprof and continuous profiling on/off without redeploying
  11. Correlate signals — inject trace_id into logs, use exemplars to link metrics to traces
  12. A feature is not done until it is observable — declare metrics, add proper logging, create spans
  13. Use awesome-prometheus-alerts as a starting point for infrastructure and dependency alerting — browse by technology, copy rules, customize thresholds
  1. 使用
    log/slog
    实现结构化日志
    ——生产服务必须输出结构化日志(JSON格式),而非自由文本字符串
  2. 选择合适的日志级别——Debug级别用于开发环境,Info级别用于正常操作,Warn级别用于降级状态,Error级别用于需要人工介入的故障场景
  3. 带上下文日志——使用
    slog.InfoContext(ctx, ...)
    关联日志与追踪链路
  4. 延迟指标优先使用Histogram而非Summary——Histogram支持服务端聚合和百分位查询。每个HTTP端点必须配置延迟和错误率指标
  5. 控制Prometheus标签基数——切勿使用无界值(如用户ID、完整URL)作为标签值
  6. 使用Histogram + PromQL的
    histogram_quantile()
    追踪百分位
    (P50、P90、P99、P99.9)
  7. 新项目默认配置OpenTelemetry追踪——尽早配置TracerProvider,然后为所有操作添加链路
  8. 为所有关键操作添加追踪链路——服务方法、数据库查询、外部API调用、消息队列操作
  9. 全链路传递上下文——上下文是跨服务边界传递trace_id、span_id和超时时间的载体
  10. 通过环境变量启用性能分析——无需重新部署即可开启/关闭pprof和持续性能分析
  11. 关联各类监控信号——将trace_id注入日志,使用示例关联指标与追踪链路
  12. 功能完成的标准是具备可观测性——定义指标、添加合规日志、创建追踪链路
  13. awesome-prometheus-alerts为起点配置告警——按技术分类浏览、复制规则、自定义阈值

Cross-References

交叉参考

See
samber/cc-skills-golang@golang-error-handling
skill for the single handling rule. See
samber/cc-skills-golang@golang-troubleshooting
skill for using observability signals to diagnose production issues. See
samber/cc-skills-golang@golang-security
skill for protecting pprof endpoints and avoiding PII in logs. See
samber/cc-skills-golang@golang-context
skill for propagating trace context across service boundaries. See
samber/cc-skills@promql-cli
skill for querying and exploring PromQL expressions against Prometheus from the CLI.
错误处理规则请查看
samber/cc-skills-golang@golang-error-handling
技能。使用可观测性信号排查生产问题请查看
samber/cc-skills-golang@golang-troubleshooting
技能。保护pprof端点安全及避免日志中出现PII请查看
samber/cc-skills-golang@golang-security
技能。跨服务边界传递追踪上下文请查看
samber/cc-skills-golang@golang-context
技能。从CLI查询PromQL表达式请查看
samber/cc-skills@promql-cli
技能。

The Five Signals

五类监控信号

SignalQuestion it answersToolWhen to use
LogsWhat happened?
log/slog
Discrete events, errors, audit trails
MetricsHow much / how fast?Prometheus clientAggregated measurements, alerting, SLOs
TracesWhere did time go?OpenTelemetryRequest flow across services, latency breakdown
ProfilesWhy is it slow / using memory?pprof, PyroscopeCPU hotspots, memory leaks, lock contention
RUMHow do users experience it?PostHog, SegmentProduct analytics, funnels, session replay
信号类型解决的问题工具使用场景
日志发生了什么?
log/slog
离散事件、错误、审计追踪
指标数量多少/速度快慢?Prometheus客户端聚合度量、告警、SLO监控
追踪时间消耗在哪里?OpenTelemetry跨服务请求流、延迟拆解
性能分析为什么慢/内存占用高?pprof, PyroscopeCPU热点、内存泄漏、锁竞争
RUM用户体验如何?PostHog, Segment产品分析、转化漏斗、会话重放

Detailed Guides

详细指南

Each signal has a dedicated guide with full code examples, configuration patterns, and cost analysis:
  • Structured Logging — Why structured logging matters for log aggregation at scale. Covers
    log/slog
    setup, log levels (Debug/Info/Warn/Error) and when to use each, request correlation with trace IDs, context propagation with
    slog.InfoContext
    , request-scoped attributes, the slog ecosystem (handlers, formatters, middleware), and migration strategies from zap/logrus/zerolog.
  • Metrics Collection — Prometheus client setup and the four metric types (Counter for rate-of-change, Gauge for snapshots, Histogram for latency aggregation). Deep dive: why Histograms beat Summaries (server-side aggregation, supports
    histogram_quantile
    PromQL), naming conventions, the PromQL-as-comments convention (write queries above metric declarations for discoverability), production-grade PromQL examples, multi-window SLO burn rate alerting, and the high-cardinality label problem (why unbounded values like user IDs destroy performance).
  • Distributed Tracing — When and how to use OpenTelemetry SDK to trace request flows across services. Covers spans (creating, attributes, status recording),
    otelhttp
    middleware for HTTP instrumentation, error recording with
    span.RecordError()
    , trace sampling (why you can't collect everything at scale), propagating trace context across service boundaries, and cost optimization.
  • Profiling — On-demand profiling with pprof (CPU, heap, goroutine, mutex, block profiles) — how to enable it in production, secure it with auth, and toggle via environment variables without redeploying. Continuous profiling with Pyroscope for always-on performance visibility. Cost implications of each profiling type and mitigation strategies.
  • Real User Monitoring — Understanding how users actually experience your service. Covers product analytics (event tracking, funnels), Customer Data Platform integration, and critical compliance: GDPR/CCPA consent checks, data subject rights (user deletion endpoints), and privacy checklist for tracking. Server-side event tracking (PostHog, Segment) and identity key best practices.
  • Alerting — Proactive problem detection. Covers the four golden signals (latency, traffic, errors, saturation), awesome-prometheus-alerts as a rule library with ~500 ready-to-use rules by technology, Go runtime alerts (goroutine leaks, GC pressure, OOM risk), severity levels, and common mistakes that break alerting (using
    irate
    instead of
    rate
    , missing
    for:
    duration to avoid flapping).
  • Grafana Dashboards — Prebuilt dashboards for Go runtime monitoring (heap allocation, GC pause frequency, goroutine count, CPU). Explains the standard dashboards to install, how to customize them for your service, and when each dashboard answers a different operational question.
每类信号都有专属指南,包含完整代码示例、配置模式和成本分析:
  • 结构化日志——规模化日志聚合场景下结构化日志的重要性。涵盖
    log/slog
    配置、日志级别(Debug/Info/Warn/Error)及适用场景、请求与追踪ID关联、
    slog.InfoContext
    上下文传递、请求级属性、slog生态(处理器、格式化器、中间件)、以及从zap/logrus/zerolog的迁移策略。
  • 指标采集——Prometheus客户端配置及四类指标类型(Counter用于变化率、Gauge用于快照、Histogram用于延迟聚合)。深度解析:为什么Histogram优于Summary(服务端聚合、支持
    histogram_quantile
    PromQL)、命名规范、PromQL注释规范(在指标定义上方编写查询语句以提升可发现性)、生产级PromQL示例、多窗口SLO燃尽率告警、以及高基数标签问题(无界值如用户ID会严重影响性能)。
  • 分布式追踪——何时及如何使用OpenTelemetry SDK实现跨服务请求追踪。涵盖链路(创建、属性、状态记录)、用于HTTP埋点的
    otelhttp
    中间件、使用
    span.RecordError()
    记录错误、追踪采样(规模化场景下无法全量采集的原因)、跨服务边界传递追踪上下文、以及成本优化。
  • 性能分析——使用pprof进行按需性能分析(CPU、堆、goroutine、 mutex、阻塞分析)——如何在生产环境启用、通过认证保障安全、无需重新部署即可通过环境变量切换。使用Pyroscope实现持续性能分析以获取实时性能可见性。各类性能分析的成本影响及缓解策略。
  • 真实用户监控——了解用户对服务的实际体验。涵盖产品分析(事件追踪、转化漏斗)、客户数据平台集成、以及关键合规要求:GDPR/CCPA consent检查、数据主体权利(用户删除端点)、追踪隐私 checklist。服务端事件追踪(PostHog、Segment)及身份标识最佳实践。
  • 告警配置——主动问题检测。涵盖四类黄金信号(延迟、流量、错误、饱和度)、awesome-prometheus-alerts规则库(按技术分类提供约500个可直接使用的规则)、Go运行时告警(goroutine泄漏、GC压力、OOM风险)、严重级别、以及破坏告警的常见错误(使用
    irate
    而非
    rate
    、缺少
    for:
    时长配置以避免抖动)。
  • Grafana仪表盘——Go运行时监控预构建仪表盘(堆分配、GC暂停频率、goroutine数量、CPU)。说明需安装的标准仪表盘、如何为服务自定义、以及各仪表盘解决的运维问题。

Correlating Signals

信号关联

Signals are most powerful when connected. A trace_id in your logs lets you jump from a log line to the full request trace. An exemplar on a metric links a latency spike to the exact trace that caused it.
各类信号关联后价值最大化。日志中的trace_id可让你从单条日志跳转到完整请求追踪链路。指标上的示例可将延迟峰值与具体追踪链路关联。

Logs + Traces:
otelslog
bridge

日志 + 追踪:
otelslog
桥接

go
import "go.opentelemetry.io/contrib/bridges/otelslog"

// Create a logger that automatically injects trace_id and span_id
logger := otelslog.NewHandler("my-service")
slog.SetDefault(slog.New(logger))

// Now every slog call with context includes trace correlation
slog.InfoContext(ctx, "order created", "order_id", orderID)
// Output includes: {"trace_id":"abc123", "span_id":"def456", "msg":"order created", ...}
go
import "go.opentelemetry.io/contrib/bridges/otelslog"

// 创建自动注入trace_id和span_id的日志实例
logger := otelslog.NewHandler("my-service")
slog.SetDefault(slog.New(logger))

// 现在所有带上下文的slog调用都会包含追踪关联信息
slog.InfoContext(ctx, "order created", "order_id", orderID)
// 输出包含:{"trace_id":"abc123", "span_id":"def456", "msg":"order created", ...}

Metrics + Traces: Exemplars

指标 + 追踪:示例关联

go
// When recording a histogram observation, attach the trace_id as an exemplar
// so you can jump from a P99 spike directly to the offending trace
histogram.WithLabelValues("POST", "/orders").
    (Exemplar(prometheus.Labels{"trace_id": traceID}, duration))
go
// 记录Histogram观测值时,附加trace_id作为示例
// 这样你可以直接从P99延迟峰值跳转到对应的追踪链路
histogram.WithLabelValues("POST", "/orders").
    (Exemplar(prometheus.Labels{"trace_id": traceID}, duration))

Migrating Legacy Loggers

传统日志库迁移

If the project currently uses
zap
,
logrus
, or
zerolog
, migrate to
log/slog
. It is the standard library logger since Go 1.21, has a stable API, and the ecosystem has consolidated around it. Continuing with third-party loggers means maintaining an extra dependency for no benefit.
Migration strategy:
  1. Add
    slog
    as the new logger with
    slog.SetDefault()
  2. Use bridge handlers during migration to route slog output through the existing logger: samber/slog-zap, samber/slog-logrus, samber/slog-zerolog
  3. Gradually replace all
    zap.L().Info(...)
    /
    logrus.Info(...)
    /
    log.Info().Msg(...)
    calls with
    slog.Info(...)
  4. Once fully migrated, remove the bridge handler and the old logger dependency
若项目当前使用
zap
logrus
zerolog
,建议迁移至
log/slog
。它是Go 1.21以来的标准库日志组件,API稳定,生态已逐步统一。继续使用第三方日志库意味着无意义的额外依赖维护。
迁移策略:
  1. 添加
    slog
    作为新日志库并执行
    slog.SetDefault()
  2. 迁移期间使用桥接处理器将slog输出路由至现有日志库:samber/slog-zapsamber/slog-logrussamber/slog-zerolog
  3. 逐步将所有
    zap.L().Info(...)
    /
    logrus.Info(...)
    /
    log.Info().Msg(...)
    调用替换为
    slog.Info(...)
  4. 完全迁移后,移除桥接处理器和旧日志库依赖

Definition of Done for Observability

可观测性验收标准

A feature is not production-ready until it is observable. Before marking a feature as done, verify:
  • Metrics declared — counters for operations/errors, histograms for latencies, gauges for saturation. Each metric var has PromQL queries and alert rules as comments above its declaration.
  • Logging is proper — structured key-value pairs with
    slog
    , context variants used (
    slog.InfoContext
    ), no PII in logs, errors MUST be either logged OR returned (NEVER both).
  • Spans created — every service method, DB query, and external API call has a span with relevant attributes, errors recorded with
    span.RecordError()
    .
  • Dashboards and alerts exist — the PromQL from your metric comments is wired into Grafana dashboards and Prometheus alerting rules. Check awesome-prometheus-alerts for ready-to-use rules covering your infrastructure dependencies (databases, caches, brokers, proxies).
  • RUM events tracked — key business events tracked server-side (PostHog/Segment), identity key is
    user_id
    (not email), consent checked before tracking.
功能具备可观测性才视为生产就绪。标记功能完成前,请验证:
  • 已定义指标——操作/错误计数器、延迟直方图、饱和度仪表盘。每个指标变量上方需添加PromQL查询和告警规则作为注释。
  • 日志合规——使用
    slog
    输出结构化键值对、使用上下文变体(
    slog.InfoContext
    )、日志中无PII、错误必须只记录或只返回(禁止同时操作)。
  • 已创建追踪链路——所有服务方法、数据库查询、外部API调用都有带相关属性的追踪链路,错误通过
    span.RecordError()
    记录。
  • 已配置仪表盘和告警——指标注释中的PromQL已关联至Grafana仪表盘和Prometheus告警规则。可参考awesome-prometheus-alerts获取基础设施依赖(数据库、缓存、消息队列、代理)的现成规则。
  • 已追踪RUM事件——关键业务事件已通过服务端追踪(PostHog/Segment),身份标识使用
    user_id
    (而非邮箱),追踪前已检查用户授权。

Common Mistakes

常见错误

go
// ✗ Bad — log AND return (error gets logged multiple times up the chain)
if err != nil {
    slog.Error("query failed", "error", err)
    return fmt.Errorf("query: %w", err)
}

// ✓ Good — return with context, log once at the top level
if err != nil {
    return fmt.Errorf("querying users: %w", err)
}
go
// ✗ Bad — high-cardinality label (unbounded user IDs)
httpRequests.WithLabelValues(r.Method, r.URL.Path, userID).Inc()

// ✓ Good — bounded label values only
httpRequests.WithLabelValues(r.Method, routePattern).Inc()
go
// ✗ Bad — not passing context (breaks trace propagation)
result, err := db.Query("SELECT ...")

// ✓ Good — context flows through, trace continues
result, err := db.QueryContext(ctx, "SELECT ...")
go
// ✗ Bad — using Summary for latency (can't aggregate across instances)
prometheus.NewSummary(prometheus.SummaryOpts{
    Name:       "http_request_duration_seconds",
    Objectives: map[float64]float64{0.99: 0.001},
})

// ✓ Good — use Histogram (aggregatable, supports histogram_quantile)
prometheus.NewHistogram(prometheus.HistogramOpts{
    Name:    "http_request_duration_seconds",
    Buckets: prometheus.DefBuckets,
})
go
// ✗ 错误——同时记录并返回错误(错误会在调用链中被多次记录)
if err != nil {
    slog.Error("query failed", "error", err)
    return fmt.Errorf("query: %w", err)
}

// ✓ 正确——带上下文返回错误,在顶层统一记录
if err != nil {
    return fmt.Errorf("querying users: %w", err)
}
go
// ✗ 错误——高基数标签(无界用户ID)
httpRequests.WithLabelValues(r.Method, r.URL.Path, userID).Inc()

// ✓ 正确——仅使用有界标签值
httpRequests.WithLabelValues(r.Method, routePattern).Inc()
go
// ✗ 错误——未传递上下文(中断追踪链路传递)
result, err := db.Query("SELECT ...")

// ✓ 正确——传递上下文,追踪链路持续
result, err := db.QueryContext(ctx, "SELECT ...")
go
// ✗ 错误——使用Summary记录延迟(无法跨实例聚合)
prometheus.NewSummary(prometheus.SummaryOpts{
    Name:       "http_request_duration_seconds",
    Objectives: map[float64]float64{0.99: 0.001},
})

// ✓ 正确——使用Histogram(支持聚合、histogram_quantile查询)
prometheus.NewHistogram(prometheus.HistogramOpts{
    Name:    "http_request_duration_seconds",
    Buckets: prometheus.DefBuckets,
})