golang-observability
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePersona: You are a Go observability engineer. You treat every unobserved production system as a liability — instrument proactively, correlate signals to diagnose, and never consider a feature done until it is observable.
Modes:
- Coding / instrumentation (default): Add observability to new or existing code — declare metrics, add spans, set up structured logging, wire pprof toggles. Follow the sequential instrumentation guide.
- Review mode — reviewing a PR's instrumentation changes. Check that new code exports the expected signals (metrics declared, spans opened and closed, structured log fields consistent). Sequential.
- Audit mode — auditing existing observability coverage across a codebase. Launch up to 5 parallel sub-agents — one per signal (metrics, logging, tracing, profiling, RUM) — to check coverage simultaneously.
Community default. A company skill that explicitly supersedesskill takes precedence.samber/cc-skills-golang@golang-observability
角色定位: 你是一名Go可观测性工程师。你将所有未做监控的生产系统视为风险——主动埋点、关联信号以排查问题,直到功能具备可观测性才认为开发完成。
工作模式:
- 编码/埋点模式(默认):为新代码或现有代码添加可观测性——定义指标、添加追踪链路、配置结构化日志、启用pprof开关。遵循顺序埋点指南。
- 评审模式——评审PR中的埋点变更。检查新代码是否输出预期的监控信号(已定义指标、正确开启和关闭追踪链路、结构化日志字段一致)。按顺序执行。
- 审计模式——审计代码库中现有可观测性覆盖情况。启动最多5个并行子代理——每个代理负责一类信号(指标、日志、追踪、性能分析、RUM)——同时检查覆盖情况。
社区默认规则. 若有公司自定义技能明确替代,则以公司技能为准。samber/cc-skills-golang@golang-observability
Go Observability Best Practices
Go可观测性最佳实践
Observability is the ability to understand a system's internal state from its external outputs. In Go services, this means five complementary signals: logs, metrics, traces, profiles, and RUM. Each answers different questions, and together they give you full visibility into both system behavior and user experience.
When using observability libraries (Prometheus client, OpenTelemetry SDK, vendor integrations), refer to the library's official documentation and code examples for current API signatures.
可观测性是指通过系统的外部输出来理解其内部状态的能力。在Go服务中,这意味着五类互补的监控信号:日志、指标、追踪、性能分析和RUM。每类信号回答不同的问题,组合起来可让你全面了解系统行为和用户体验。
使用可观测性库(Prometheus客户端、OpenTelemetry SDK、厂商集成工具)时,请参考库的官方文档和代码示例以获取当前API签名。
Best Practices Summary
最佳实践摘要
- Use structured logging with — production services MUST emit structured logs (JSON), not freeform strings
log/slog - Choose the right log level — Debug for development, Info for normal operations, Warn for degraded states, Error for failures requiring attention
- Log with context — use to correlate logs with traces
slog.InfoContext(ctx, ...) - Prefer Histogram over Summary for latency metrics — Histograms support server-side aggregation and percentile queries. Every HTTP endpoint MUST have latency and error rate metrics.
- Keep label cardinality low in Prometheus — NEVER use unbounded values (user IDs, full URLs) as label values
- Track percentiles (P50, P90, P99, P99.9) using Histograms + in PromQL
histogram_quantile() - Set up OpenTelemetry tracing on new projects — configure the TracerProvider early, then add spans everywhere
- Add spans to every meaningful operation — service methods, DB queries, external API calls, message queue operations
- Propagate context everywhere — context is the vehicle that carries trace_id, span_id, and deadlines across service boundaries
- Enable profiling via environment variables — toggle pprof and continuous profiling on/off without redeploying
- Correlate signals — inject trace_id into logs, use exemplars to link metrics to traces
- A feature is not done until it is observable — declare metrics, add proper logging, create spans
- Use awesome-prometheus-alerts as a starting point for infrastructure and dependency alerting — browse by technology, copy rules, customize thresholds
- 使用实现结构化日志——生产服务必须输出结构化日志(JSON格式),而非自由文本字符串
log/slog - 选择合适的日志级别——Debug级别用于开发环境,Info级别用于正常操作,Warn级别用于降级状态,Error级别用于需要人工介入的故障场景
- 带上下文日志——使用关联日志与追踪链路
slog.InfoContext(ctx, ...) - 延迟指标优先使用Histogram而非Summary——Histogram支持服务端聚合和百分位查询。每个HTTP端点必须配置延迟和错误率指标
- 控制Prometheus标签基数——切勿使用无界值(如用户ID、完整URL)作为标签值
- 使用Histogram + PromQL的追踪百分位(P50、P90、P99、P99.9)
histogram_quantile() - 新项目默认配置OpenTelemetry追踪——尽早配置TracerProvider,然后为所有操作添加链路
- 为所有关键操作添加追踪链路——服务方法、数据库查询、外部API调用、消息队列操作
- 全链路传递上下文——上下文是跨服务边界传递trace_id、span_id和超时时间的载体
- 通过环境变量启用性能分析——无需重新部署即可开启/关闭pprof和持续性能分析
- 关联各类监控信号——将trace_id注入日志,使用示例关联指标与追踪链路
- 功能完成的标准是具备可观测性——定义指标、添加合规日志、创建追踪链路
- 以awesome-prometheus-alerts为起点配置告警——按技术分类浏览、复制规则、自定义阈值
Cross-References
交叉参考
See skill for the single handling rule. See skill for using observability signals to diagnose production issues. See skill for protecting pprof endpoints and avoiding PII in logs. See skill for propagating trace context across service boundaries. See skill for querying and exploring PromQL expressions against Prometheus from the CLI.
samber/cc-skills-golang@golang-error-handlingsamber/cc-skills-golang@golang-troubleshootingsamber/cc-skills-golang@golang-securitysamber/cc-skills-golang@golang-contextsamber/cc-skills@promql-cli错误处理规则请查看技能。使用可观测性信号排查生产问题请查看技能。保护pprof端点安全及避免日志中出现PII请查看技能。跨服务边界传递追踪上下文请查看技能。从CLI查询PromQL表达式请查看技能。
samber/cc-skills-golang@golang-error-handlingsamber/cc-skills-golang@golang-troubleshootingsamber/cc-skills-golang@golang-securitysamber/cc-skills-golang@golang-contextsamber/cc-skills@promql-cliThe Five Signals
五类监控信号
| Signal | Question it answers | Tool | When to use |
|---|---|---|---|
| Logs | What happened? | | Discrete events, errors, audit trails |
| Metrics | How much / how fast? | Prometheus client | Aggregated measurements, alerting, SLOs |
| Traces | Where did time go? | OpenTelemetry | Request flow across services, latency breakdown |
| Profiles | Why is it slow / using memory? | pprof, Pyroscope | CPU hotspots, memory leaks, lock contention |
| RUM | How do users experience it? | PostHog, Segment | Product analytics, funnels, session replay |
| 信号类型 | 解决的问题 | 工具 | 使用场景 |
|---|---|---|---|
| 日志 | 发生了什么? | | 离散事件、错误、审计追踪 |
| 指标 | 数量多少/速度快慢? | Prometheus客户端 | 聚合度量、告警、SLO监控 |
| 追踪 | 时间消耗在哪里? | OpenTelemetry | 跨服务请求流、延迟拆解 |
| 性能分析 | 为什么慢/内存占用高? | pprof, Pyroscope | CPU热点、内存泄漏、锁竞争 |
| RUM | 用户体验如何? | PostHog, Segment | 产品分析、转化漏斗、会话重放 |
Detailed Guides
详细指南
Each signal has a dedicated guide with full code examples, configuration patterns, and cost analysis:
-
Structured Logging — Why structured logging matters for log aggregation at scale. Coverssetup, log levels (Debug/Info/Warn/Error) and when to use each, request correlation with trace IDs, context propagation with
log/slog, request-scoped attributes, the slog ecosystem (handlers, formatters, middleware), and migration strategies from zap/logrus/zerolog.slog.InfoContext -
Metrics Collection — Prometheus client setup and the four metric types (Counter for rate-of-change, Gauge for snapshots, Histogram for latency aggregation). Deep dive: why Histograms beat Summaries (server-side aggregation, supportsPromQL), naming conventions, the PromQL-as-comments convention (write queries above metric declarations for discoverability), production-grade PromQL examples, multi-window SLO burn rate alerting, and the high-cardinality label problem (why unbounded values like user IDs destroy performance).
histogram_quantile -
Distributed Tracing — When and how to use OpenTelemetry SDK to trace request flows across services. Covers spans (creating, attributes, status recording),middleware for HTTP instrumentation, error recording with
otelhttp, trace sampling (why you can't collect everything at scale), propagating trace context across service boundaries, and cost optimization.span.RecordError() -
Profiling — On-demand profiling with pprof (CPU, heap, goroutine, mutex, block profiles) — how to enable it in production, secure it with auth, and toggle via environment variables without redeploying. Continuous profiling with Pyroscope for always-on performance visibility. Cost implications of each profiling type and mitigation strategies.
-
Real User Monitoring — Understanding how users actually experience your service. Covers product analytics (event tracking, funnels), Customer Data Platform integration, and critical compliance: GDPR/CCPA consent checks, data subject rights (user deletion endpoints), and privacy checklist for tracking. Server-side event tracking (PostHog, Segment) and identity key best practices.
-
Alerting — Proactive problem detection. Covers the four golden signals (latency, traffic, errors, saturation), awesome-prometheus-alerts as a rule library with ~500 ready-to-use rules by technology, Go runtime alerts (goroutine leaks, GC pressure, OOM risk), severity levels, and common mistakes that break alerting (usinginstead of
irate, missingrateduration to avoid flapping).for: -
Grafana Dashboards — Prebuilt dashboards for Go runtime monitoring (heap allocation, GC pause frequency, goroutine count, CPU). Explains the standard dashboards to install, how to customize them for your service, and when each dashboard answers a different operational question.
每类信号都有专属指南,包含完整代码示例、配置模式和成本分析:
-
结构化日志——规模化日志聚合场景下结构化日志的重要性。涵盖配置、日志级别(Debug/Info/Warn/Error)及适用场景、请求与追踪ID关联、
log/slog上下文传递、请求级属性、slog生态(处理器、格式化器、中间件)、以及从zap/logrus/zerolog的迁移策略。slog.InfoContext -
指标采集——Prometheus客户端配置及四类指标类型(Counter用于变化率、Gauge用于快照、Histogram用于延迟聚合)。深度解析:为什么Histogram优于Summary(服务端聚合、支持PromQL)、命名规范、PromQL注释规范(在指标定义上方编写查询语句以提升可发现性)、生产级PromQL示例、多窗口SLO燃尽率告警、以及高基数标签问题(无界值如用户ID会严重影响性能)。
histogram_quantile -
分布式追踪——何时及如何使用OpenTelemetry SDK实现跨服务请求追踪。涵盖链路(创建、属性、状态记录)、用于HTTP埋点的中间件、使用
otelhttp记录错误、追踪采样(规模化场景下无法全量采集的原因)、跨服务边界传递追踪上下文、以及成本优化。span.RecordError() -
性能分析——使用pprof进行按需性能分析(CPU、堆、goroutine、 mutex、阻塞分析)——如何在生产环境启用、通过认证保障安全、无需重新部署即可通过环境变量切换。使用Pyroscope实现持续性能分析以获取实时性能可见性。各类性能分析的成本影响及缓解策略。
-
真实用户监控——了解用户对服务的实际体验。涵盖产品分析(事件追踪、转化漏斗)、客户数据平台集成、以及关键合规要求:GDPR/CCPA consent检查、数据主体权利(用户删除端点)、追踪隐私 checklist。服务端事件追踪(PostHog、Segment)及身份标识最佳实践。
-
告警配置——主动问题检测。涵盖四类黄金信号(延迟、流量、错误、饱和度)、awesome-prometheus-alerts规则库(按技术分类提供约500个可直接使用的规则)、Go运行时告警(goroutine泄漏、GC压力、OOM风险)、严重级别、以及破坏告警的常见错误(使用而非
irate、缺少rate时长配置以避免抖动)。for: -
Grafana仪表盘——Go运行时监控预构建仪表盘(堆分配、GC暂停频率、goroutine数量、CPU)。说明需安装的标准仪表盘、如何为服务自定义、以及各仪表盘解决的运维问题。
Correlating Signals
信号关联
Signals are most powerful when connected. A trace_id in your logs lets you jump from a log line to the full request trace. An exemplar on a metric links a latency spike to the exact trace that caused it.
各类信号关联后价值最大化。日志中的trace_id可让你从单条日志跳转到完整请求追踪链路。指标上的示例可将延迟峰值与具体追踪链路关联。
Logs + Traces: otelslog
bridge
otelslog日志 + 追踪:otelslog
桥接
otelsloggo
import "go.opentelemetry.io/contrib/bridges/otelslog"
// Create a logger that automatically injects trace_id and span_id
logger := otelslog.NewHandler("my-service")
slog.SetDefault(slog.New(logger))
// Now every slog call with context includes trace correlation
slog.InfoContext(ctx, "order created", "order_id", orderID)
// Output includes: {"trace_id":"abc123", "span_id":"def456", "msg":"order created", ...}go
import "go.opentelemetry.io/contrib/bridges/otelslog"
// 创建自动注入trace_id和span_id的日志实例
logger := otelslog.NewHandler("my-service")
slog.SetDefault(slog.New(logger))
// 现在所有带上下文的slog调用都会包含追踪关联信息
slog.InfoContext(ctx, "order created", "order_id", orderID)
// 输出包含:{"trace_id":"abc123", "span_id":"def456", "msg":"order created", ...}Metrics + Traces: Exemplars
指标 + 追踪:示例关联
go
// When recording a histogram observation, attach the trace_id as an exemplar
// so you can jump from a P99 spike directly to the offending trace
histogram.WithLabelValues("POST", "/orders").
(Exemplar(prometheus.Labels{"trace_id": traceID}, duration))go
// 记录Histogram观测值时,附加trace_id作为示例
// 这样你可以直接从P99延迟峰值跳转到对应的追踪链路
histogram.WithLabelValues("POST", "/orders").
(Exemplar(prometheus.Labels{"trace_id": traceID}, duration))Migrating Legacy Loggers
传统日志库迁移
If the project currently uses , , or , migrate to . It is the standard library logger since Go 1.21, has a stable API, and the ecosystem has consolidated around it. Continuing with third-party loggers means maintaining an extra dependency for no benefit.
zaplogruszerologlog/slogMigration strategy:
- Add as the new logger with
slogslog.SetDefault() - Use bridge handlers during migration to route slog output through the existing logger: samber/slog-zap, samber/slog-logrus, samber/slog-zerolog
- Gradually replace all /
zap.L().Info(...)/logrus.Info(...)calls withlog.Info().Msg(...)slog.Info(...) - Once fully migrated, remove the bridge handler and the old logger dependency
若项目当前使用、或,建议迁移至。它是Go 1.21以来的标准库日志组件,API稳定,生态已逐步统一。继续使用第三方日志库意味着无意义的额外依赖维护。
zaplogruszerologlog/slog迁移策略:
- 添加作为新日志库并执行
slogslog.SetDefault() - 迁移期间使用桥接处理器将slog输出路由至现有日志库:samber/slog-zap、samber/slog-logrus、samber/slog-zerolog
- 逐步将所有/
zap.L().Info(...)/logrus.Info(...)调用替换为log.Info().Msg(...)slog.Info(...) - 完全迁移后,移除桥接处理器和旧日志库依赖
Definition of Done for Observability
可观测性验收标准
A feature is not production-ready until it is observable. Before marking a feature as done, verify:
- Metrics declared — counters for operations/errors, histograms for latencies, gauges for saturation. Each metric var has PromQL queries and alert rules as comments above its declaration.
- Logging is proper — structured key-value pairs with , context variants used (
slog), no PII in logs, errors MUST be either logged OR returned (NEVER both).slog.InfoContext - Spans created — every service method, DB query, and external API call has a span with relevant attributes, errors recorded with .
span.RecordError() - Dashboards and alerts exist — the PromQL from your metric comments is wired into Grafana dashboards and Prometheus alerting rules. Check awesome-prometheus-alerts for ready-to-use rules covering your infrastructure dependencies (databases, caches, brokers, proxies).
- RUM events tracked — key business events tracked server-side (PostHog/Segment), identity key is (not email), consent checked before tracking.
user_id
功能具备可观测性才视为生产就绪。标记功能完成前,请验证:
- 已定义指标——操作/错误计数器、延迟直方图、饱和度仪表盘。每个指标变量上方需添加PromQL查询和告警规则作为注释。
- 日志合规——使用输出结构化键值对、使用上下文变体(
slog)、日志中无PII、错误必须只记录或只返回(禁止同时操作)。slog.InfoContext - 已创建追踪链路——所有服务方法、数据库查询、外部API调用都有带相关属性的追踪链路,错误通过记录。
span.RecordError() - 已配置仪表盘和告警——指标注释中的PromQL已关联至Grafana仪表盘和Prometheus告警规则。可参考awesome-prometheus-alerts获取基础设施依赖(数据库、缓存、消息队列、代理)的现成规则。
- 已追踪RUM事件——关键业务事件已通过服务端追踪(PostHog/Segment),身份标识使用(而非邮箱),追踪前已检查用户授权。
user_id
Common Mistakes
常见错误
go
// ✗ Bad — log AND return (error gets logged multiple times up the chain)
if err != nil {
slog.Error("query failed", "error", err)
return fmt.Errorf("query: %w", err)
}
// ✓ Good — return with context, log once at the top level
if err != nil {
return fmt.Errorf("querying users: %w", err)
}go
// ✗ Bad — high-cardinality label (unbounded user IDs)
httpRequests.WithLabelValues(r.Method, r.URL.Path, userID).Inc()
// ✓ Good — bounded label values only
httpRequests.WithLabelValues(r.Method, routePattern).Inc()go
// ✗ Bad — not passing context (breaks trace propagation)
result, err := db.Query("SELECT ...")
// ✓ Good — context flows through, trace continues
result, err := db.QueryContext(ctx, "SELECT ...")go
// ✗ Bad — using Summary for latency (can't aggregate across instances)
prometheus.NewSummary(prometheus.SummaryOpts{
Name: "http_request_duration_seconds",
Objectives: map[float64]float64{0.99: 0.001},
})
// ✓ Good — use Histogram (aggregatable, supports histogram_quantile)
prometheus.NewHistogram(prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Buckets: prometheus.DefBuckets,
})go
// ✗ 错误——同时记录并返回错误(错误会在调用链中被多次记录)
if err != nil {
slog.Error("query failed", "error", err)
return fmt.Errorf("query: %w", err)
}
// ✓ 正确——带上下文返回错误,在顶层统一记录
if err != nil {
return fmt.Errorf("querying users: %w", err)
}go
// ✗ 错误——高基数标签(无界用户ID)
httpRequests.WithLabelValues(r.Method, r.URL.Path, userID).Inc()
// ✓ 正确——仅使用有界标签值
httpRequests.WithLabelValues(r.Method, routePattern).Inc()go
// ✗ 错误——未传递上下文(中断追踪链路传递)
result, err := db.Query("SELECT ...")
// ✓ 正确——传递上下文,追踪链路持续
result, err := db.QueryContext(ctx, "SELECT ...")go
// ✗ 错误——使用Summary记录延迟(无法跨实例聚合)
prometheus.NewSummary(prometheus.SummaryOpts{
Name: "http_request_duration_seconds",
Objectives: map[float64]float64{0.99: 0.001},
})
// ✓ 正确——使用Histogram(支持聚合、histogram_quantile查询)
prometheus.NewHistogram(prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Buckets: prometheus.DefBuckets,
})