golang-observability

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Persona: You are a Go observability engineer. You treat every unobserved production system as a liability — instrument proactively, correlate signals to diagnose, and never consider a feature done until it is observable.

Modes:

Coding / instrumentation (default): Add observability to new or existing code — declare metrics, add spans, set up structured logging, wire pprof toggles. Follow the sequential instrumentation guide.
Review mode — reviewing a PR's instrumentation changes. Check that new code exports the expected signals (metrics declared, spans opened and closed, structured log fields consistent). Sequential.
Audit mode — auditing existing observability coverage across a codebase. Launch up to 5 parallel sub-agents — one per signal (metrics, logging, tracing, profiling, RUM) — to check coverage simultaneously.

Community default. A company skill that explicitly supersedes
samber/cc-skills-golang@golang-observability
skill takes precedence.

角色定位: 你是一名Go可观测性工程师。你将所有未做监控的生产系统视为风险——主动埋点、关联信号以排查问题，直到功能具备可观测性才认为开发完成。

工作模式:

编码/埋点模式（默认）：为新代码或现有代码添加可观测性——定义指标、添加追踪链路、配置结构化日志、启用pprof开关。遵循顺序埋点指南。
评审模式——评审PR中的埋点变更。检查新代码是否输出预期的监控信号（已定义指标、正确开启和关闭追踪链路、结构化日志字段一致）。按顺序执行。
审计模式——审计代码库中现有可观测性覆盖情况。启动最多5个并行子代理——每个代理负责一类信号（指标、日志、追踪、性能分析、RUM）——同时检查覆盖情况。

社区默认规则. 若有公司自定义技能明确替代
samber/cc-skills-golang@golang-observability
，则以公司技能为准。

Go Observability Best Practices

Go可观测性最佳实践

Observability is the ability to understand a system's internal state from its external outputs. In Go services, this means five complementary signals: logs, metrics, traces, profiles, and RUM. Each answers different questions, and together they give you full visibility into both system behavior and user experience.

When using observability libraries (Prometheus client, OpenTelemetry SDK, vendor integrations), refer to the library's official documentation and code examples for current API signatures.

可观测性是指通过系统的外部输出来理解其内部状态的能力。在Go服务中，这意味着五类互补的监控信号：日志、指标、追踪、性能分析和RUM。每类信号回答不同的问题，组合起来可让你全面了解系统行为和用户体验。

使用可观测性库（Prometheus客户端、OpenTelemetry SDK、厂商集成工具）时，请参考库的官方文档和代码示例以获取当前API签名。

Best Practices Summary

最佳实践摘要

Use structured logging with
```
log/slog
```
— production services MUST emit structured logs (JSON), not freeform strings
Choose the right log level — Debug for development, Info for normal operations, Warn for degraded states, Error for failures requiring attention
Log with context — use
```
slog.InfoContext(ctx, ...)
```
to correlate logs with traces
Prefer Histogram over Summary for latency metrics — Histograms support server-side aggregation and percentile queries. Every HTTP endpoint MUST have latency and error rate metrics.
Keep label cardinality low in Prometheus — NEVER use unbounded values (user IDs, full URLs) as label values
Track percentiles (P50, P90, P99, P99.9) using Histograms +
```
histogram_quantile()
```
in PromQL
Set up OpenTelemetry tracing on new projects — configure the TracerProvider early, then add spans everywhere
Add spans to every meaningful operation — service methods, DB queries, external API calls, message queue operations
Propagate context everywhere — context is the vehicle that carries trace_id, span_id, and deadlines across service boundaries
Enable profiling via environment variables — toggle pprof and continuous profiling on/off without redeploying
Correlate signals — inject trace_id into logs, use exemplars to link metrics to traces
A feature is not done until it is observable — declare metrics, add proper logging, create spans
Use awesome-prometheus-alerts as a starting point for infrastructure and dependency alerting — browse by technology, copy rules, customize thresholds

使用
log/slog
实现结构化日志——生产服务必须输出结构化日志（JSON格式），而非自由文本字符串
选择合适的日志级别——Debug级别用于开发环境，Info级别用于正常操作，Warn级别用于降级状态，Error级别用于需要人工介入的故障场景
带上下文日志——使用
```
slog.InfoContext(ctx, ...)
```
关联日志与追踪链路
延迟指标优先使用Histogram而非Summary——Histogram支持服务端聚合和百分位查询。每个HTTP端点必须配置延迟和错误率指标
控制Prometheus标签基数——切勿使用无界值（如用户ID、完整URL）作为标签值
使用Histogram + PromQL的
histogram_quantile()
追踪百分位（P50、P90、P99、P99.9）
新项目默认配置OpenTelemetry追踪——尽早配置TracerProvider，然后为所有操作添加链路
为所有关键操作添加追踪链路——服务方法、数据库查询、外部API调用、消息队列操作
全链路传递上下文——上下文是跨服务边界传递trace_id、span_id和超时时间的载体
通过环境变量启用性能分析——无需重新部署即可开启/关闭pprof和持续性能分析
关联各类监控信号——将trace_id注入日志，使用示例关联指标与追踪链路
功能完成的标准是具备可观测性——定义指标、添加合规日志、创建追踪链路
以awesome-prometheus-alerts为起点配置告警——按技术分类浏览、复制规则、自定义阈值

Cross-References

交叉参考

See

samber/cc-skills-golang@golang-error-handling

skill for the single handling rule. See

samber/cc-skills-golang@golang-troubleshooting

skill for using observability signals to diagnose production issues. See

samber/cc-skills-golang@golang-security

skill for protecting pprof endpoints and avoiding PII in logs. See

samber/cc-skills-golang@golang-context

skill for propagating trace context across service boundaries. See

samber/cc-skills@promql-cli

skill for querying and exploring PromQL expressions against Prometheus from the CLI.

错误处理规则请查看

samber/cc-skills-golang@golang-error-handling

技能。使用可观测性信号排查生产问题请查看

samber/cc-skills-golang@golang-troubleshooting

技能。保护pprof端点安全及避免日志中出现PII请查看

samber/cc-skills-golang@golang-security

技能。跨服务边界传递追踪上下文请查看

samber/cc-skills-golang@golang-context

技能。从CLI查询PromQL表达式请查看

samber/cc-skills@promql-cli

技能。

The Five Signals

五类监控信号

Signal	Question it answers	Tool	When to use
Logs	What happened?	`log/slog`	Discrete events, errors, audit trails
Metrics	How much / how fast?	Prometheus client	Aggregated measurements, alerting, SLOs
Traces	Where did time go?	OpenTelemetry	Request flow across services, latency breakdown
Profiles	Why is it slow / using memory?	pprof, Pyroscope	CPU hotspots, memory leaks, lock contention
RUM	How do users experience it?	PostHog, Segment	Product analytics, funnels, session replay

信号类型	解决的问题	工具	使用场景
日志	发生了什么？	`log/slog`	离散事件、错误、审计追踪
指标	数量多少/速度快慢？	Prometheus客户端	聚合度量、告警、SLO监控
追踪	时间消耗在哪里？	OpenTelemetry	跨服务请求流、延迟拆解
性能分析	为什么慢/内存占用高？	pprof, Pyroscope	CPU热点、内存泄漏、锁竞争
RUM	用户体验如何？	PostHog, Segment	产品分析、转化漏斗、会话重放

Detailed Guides

详细指南

Each signal has a dedicated guide with full code examples, configuration patterns, and cost analysis:

Structured Logging — Why structured logging matters for log aggregation at scale. Covers
```
log/slog
```
setup, log levels (Debug/Info/Warn/Error) and when to use each, request correlation with trace IDs, context propagation with
```
slog.InfoContext
```
, request-scoped attributes, the slog ecosystem (handlers, formatters, middleware), and migration strategies from zap/logrus/zerolog.
Metrics Collection — Prometheus client setup and the four metric types (Counter for rate-of-change, Gauge for snapshots, Histogram for latency aggregation). Deep dive: why Histograms beat Summaries (server-side aggregation, supports
```
histogram_quantile
```
PromQL), naming conventions, the PromQL-as-comments convention (write queries above metric declarations for discoverability), production-grade PromQL examples, multi-window SLO burn rate alerting, and the high-cardinality label problem (why unbounded values like user IDs destroy performance).
Distributed Tracing — When and how to use OpenTelemetry SDK to trace request flows across services. Covers spans (creating, attributes, status recording),
```
otelhttp
```
middleware for HTTP instrumentation, error recording with
```
span.RecordError()
```
, trace sampling (why you can't collect everything at scale), propagating trace context across service boundaries, and cost optimization.
Profiling — On-demand profiling with pprof (CPU, heap, goroutine, mutex, block profiles) — how to enable it in production, secure it with auth, and toggle via environment variables without redeploying. Continuous profiling with Pyroscope for always-on performance visibility. Cost implications of each profiling type and mitigation strategies.
Real User Monitoring — Understanding how users actually experience your service. Covers product analytics (event tracking, funnels), Customer Data Platform integration, and critical compliance: GDPR/CCPA consent checks, data subject rights (user deletion endpoints), and privacy checklist for tracking. Server-side event tracking (PostHog, Segment) and identity key best practices.
Alerting — Proactive problem detection. Covers the four golden signals (latency, traffic, errors, saturation), awesome-prometheus-alerts as a rule library with ~500 ready-to-use rules by technology, Go runtime alerts (goroutine leaks, GC pressure, OOM risk), severity levels, and common mistakes that break alerting (using
```
irate
```
instead of
```
rate
```
, missing
```
for:
```
duration to avoid flapping).
Grafana Dashboards — Prebuilt dashboards for Go runtime monitoring (heap allocation, GC pause frequency, goroutine count, CPU). Explains the standard dashboards to install, how to customize them for your service, and when each dashboard answers a different operational question.

每类信号都有专属指南，包含完整代码示例、配置模式和成本分析：

结构化日志——规模化日志聚合场景下结构化日志的重要性。涵盖
```
log/slog
```
配置、日志级别（Debug/Info/Warn/Error）及适用场景、请求与追踪ID关联、
```
slog.InfoContext
```
上下文传递、请求级属性、slog生态（处理器、格式化器、中间件）、以及从zap/logrus/zerolog的迁移策略。
指标采集——Prometheus客户端配置及四类指标类型（Counter用于变化率、Gauge用于快照、Histogram用于延迟聚合）。深度解析：为什么Histogram优于Summary（服务端聚合、支持
```
histogram_quantile
```
PromQL）、命名规范、PromQL注释规范（在指标定义上方编写查询语句以提升可发现性）、生产级PromQL示例、多窗口SLO燃尽率告警、以及高基数标签问题（无界值如用户ID会严重影响性能）。
分布式追踪——何时及如何使用OpenTelemetry SDK实现跨服务请求追踪。涵盖链路（创建、属性、状态记录）、用于HTTP埋点的
```
otelhttp
```
中间件、使用
```
span.RecordError()
```
记录错误、追踪采样（规模化场景下无法全量采集的原因）、跨服务边界传递追踪上下文、以及成本优化。
性能分析——使用pprof进行按需性能分析（CPU、堆、goroutine、 mutex、阻塞分析）——如何在生产环境启用、通过认证保障安全、无需重新部署即可通过环境变量切换。使用Pyroscope实现持续性能分析以获取实时性能可见性。各类性能分析的成本影响及缓解策略。
真实用户监控——了解用户对服务的实际体验。涵盖产品分析（事件追踪、转化漏斗）、客户数据平台集成、以及关键合规要求：GDPR/CCPA consent检查、数据主体权利（用户删除端点）、追踪隐私 checklist。服务端事件追踪（PostHog、Segment）及身份标识最佳实践。
告警配置——主动问题检测。涵盖四类黄金信号（延迟、流量、错误、饱和度）、awesome-prometheus-alerts规则库（按技术分类提供约500个可直接使用的规则）、Go运行时告警（goroutine泄漏、GC压力、OOM风险）、严重级别、以及破坏告警的常见错误（使用
```
irate
```
而非
```
rate
```
、缺少
```
for:
```
时长配置以避免抖动）。
Grafana仪表盘——Go运行时监控预构建仪表盘（堆分配、GC暂停频率、goroutine数量、CPU）。说明需安装的标准仪表盘、如何为服务自定义、以及各仪表盘解决的运维问题。

Correlating Signals

信号关联

Signals are most powerful when connected. A trace_id in your logs lets you jump from a log line to the full request trace. An exemplar on a metric links a latency spike to the exact trace that caused it.

各类信号关联后价值最大化。日志中的trace_id可让你从单条日志跳转到完整请求追踪链路。指标上的示例可将延迟峰值与具体追踪链路关联。

Logs + Traces:

otelslog

bridge

日志 + 追踪：

otelslog

桥接

import "go.opentelemetry.io/contrib/bridges/otelslog"

// Create a logger that automatically injects trace_id and span_id
logger := otelslog.NewHandler("my-service")
slog.SetDefault(slog.New(logger))

// Now every slog call with context includes trace correlation
slog.InfoContext(ctx, "order created", "order_id", orderID)
// Output includes: {"trace_id":"abc123", "span_id":"def456", "msg":"order created", ...}

import "go.opentelemetry.io/contrib/bridges/otelslog"

// 创建自动注入trace_id和span_id的日志实例
logger := otelslog.NewHandler("my-service")
slog.SetDefault(slog.New(logger))

// 现在所有带上下文的slog调用都会包含追踪关联信息
slog.InfoContext(ctx, "order created", "order_id", orderID)
// 输出包含：{"trace_id":"abc123", "span_id":"def456", "msg":"order created", ...}

Metrics + Traces: Exemplars

指标 + 追踪：示例关联

// When recording a histogram observation, attach the trace_id as an exemplar
// so you can jump from a P99 spike directly to the offending trace
histogram.WithLabelValues("POST", "/orders").
    (Exemplar(prometheus.Labels{"trace_id": traceID}, duration))

// 记录Histogram观测值时，附加trace_id作为示例
// 这样你可以直接从P99延迟峰值跳转到对应的追踪链路
histogram.WithLabelValues("POST", "/orders").
    (Exemplar(prometheus.Labels{"trace_id": traceID}, duration))

Migrating Legacy Loggers

传统日志库迁移

If the project currently uses

zap

logrus

, or

zerolog

, migrate to

log/slog

. It is the standard library logger since Go 1.21, has a stable API, and the ecosystem has consolidated around it. Continuing with third-party loggers means maintaining an extra dependency for no benefit.

Migration strategy:

Add
```
slog
```
as the new logger with
```
slog.SetDefault()
```
Use bridge handlers during migration to route slog output through the existing logger: samber/slog-zap, samber/slog-logrus, samber/slog-zerolog

Gradually replace all

zap.L().Info(...)

logrus.Info(...)

log.Info().Msg(...)

calls with

slog.Info(...)

Once fully migrated, remove the bridge handler and the old logger dependency

若项目当前使用

zap

、

logrus

或

zerolog

，建议迁移至

log/slog

。它是Go 1.21以来的标准库日志组件，API稳定，生态已逐步统一。继续使用第三方日志库意味着无意义的额外依赖维护。

迁移策略：

添加
```
slog
```
作为新日志库并执行
```
slog.SetDefault()
```
迁移期间使用桥接处理器将slog输出路由至现有日志库：samber/slog-zap、samber/slog-logrus、samber/slog-zerolog

逐步将所有

zap.L().Info(...)

logrus.Info(...)

log.Info().Msg(...)

调用替换为

slog.Info(...)

完全迁移后，移除桥接处理器和旧日志库依赖

Definition of Done for Observability

可观测性验收标准

A feature is not production-ready until it is observable. Before marking a feature as done, verify:

Metrics declared — counters for operations/errors, histograms for latencies, gauges for saturation. Each metric var has PromQL queries and alert rules as comments above its declaration.
Logging is proper — structured key-value pairs with
```
slog
```
, context variants used (
```
slog.InfoContext
```
), no PII in logs, errors MUST be either logged OR returned (NEVER both).
Spans created — every service method, DB query, and external API call has a span with relevant attributes, errors recorded with
```
span.RecordError()
```
.
Dashboards and alerts exist — the PromQL from your metric comments is wired into Grafana dashboards and Prometheus alerting rules. Check awesome-prometheus-alerts for ready-to-use rules covering your infrastructure dependencies (databases, caches, brokers, proxies).
RUM events tracked — key business events tracked server-side (PostHog/Segment), identity key is
```
user_id
```
(not email), consent checked before tracking.

功能具备可观测性才视为生产就绪。标记功能完成前，请验证：

已定义指标——操作/错误计数器、延迟直方图、饱和度仪表盘。每个指标变量上方需添加PromQL查询和告警规则作为注释。
日志合规——使用
```
slog
```
输出结构化键值对、使用上下文变体（
```
slog.InfoContext
```
）、日志中无PII、错误必须只记录或只返回（禁止同时操作）。
已创建追踪链路——所有服务方法、数据库查询、外部API调用都有带相关属性的追踪链路，错误通过
```
span.RecordError()
```
记录。
已配置仪表盘和告警——指标注释中的PromQL已关联至Grafana仪表盘和Prometheus告警规则。可参考awesome-prometheus-alerts获取基础设施依赖（数据库、缓存、消息队列、代理）的现成规则。
已追踪RUM事件——关键业务事件已通过服务端追踪（PostHog/Segment），身份标识使用
```
user_id
```
（而非邮箱），追踪前已检查用户授权。

Common Mistakes

常见错误

// ✗ Bad — log AND return (error gets logged multiple times up the chain)
if err != nil {
    slog.Error("query failed", "error", err)
    return fmt.Errorf("query: %w", err)
}

// ✓ Good — return with context, log once at the top level
if err != nil {
    return fmt.Errorf("querying users: %w", err)
}

// ✗ Bad — high-cardinality label (unbounded user IDs)
httpRequests.WithLabelValues(r.Method, r.URL.Path, userID).Inc()

// ✓ Good — bounded label values only
httpRequests.WithLabelValues(r.Method, routePattern).Inc()

// ✗ Bad — not passing context (breaks trace propagation)
result, err := db.Query("SELECT ...")

// ✓ Good — context flows through, trace continues
result, err := db.QueryContext(ctx, "SELECT ...")

// ✗ Bad — using Summary for latency (can't aggregate across instances)
prometheus.NewSummary(prometheus.SummaryOpts{
    Name:       "http_request_duration_seconds",
    Objectives: map[float64]float64{0.99: 0.001},
})

// ✓ Good — use Histogram (aggregatable, supports histogram_quantile)
prometheus.NewHistogram(prometheus.HistogramOpts{
    Name:    "http_request_duration_seconds",
    Buckets: prometheus.DefBuckets,
})

// ✗ 错误——同时记录并返回错误（错误会在调用链中被多次记录）
if err != nil {
    slog.Error("query failed", "error", err)
    return fmt.Errorf("query: %w", err)
}

// ✓ 正确——带上下文返回错误，在顶层统一记录
if err != nil {
    return fmt.Errorf("querying users: %w", err)
}

// ✗ 错误——高基数标签（无界用户ID）
httpRequests.WithLabelValues(r.Method, r.URL.Path, userID).Inc()

// ✓ 正确——仅使用有界标签值
httpRequests.WithLabelValues(r.Method, routePattern).Inc()

// ✗ 错误——未传递上下文（中断追踪链路传递）
result, err := db.Query("SELECT ...")

// ✓ 正确——传递上下文，追踪链路持续
result, err := db.QueryContext(ctx, "SELECT ...")

// ✗ 错误——使用Summary记录延迟（无法跨实例聚合）
prometheus.NewSummary(prometheus.SummaryOpts{
    Name:       "http_request_duration_seconds",
    Objectives: map[float64]float64{0.99: 0.001},
})

// ✓ 正确——使用Histogram（支持聚合、histogram_quantile查询）
prometheus.NewHistogram(prometheus.HistogramOpts{
    Name:    "http_request_duration_seconds",
    Buckets: prometheus.DefBuckets,
})

golang-observability

Original

Translation

Go Observability Best Practices

Go可观测性最佳实践

Best Practices Summary

最佳实践摘要

Cross-References

交叉参考

The Five Signals

五类监控信号

Detailed Guides

详细指南

Correlating Signals

信号关联

Logs + Traces:
`otelslog`
bridge

日志 + 追踪：
`otelslog`
桥接

Metrics + Traces: Exemplars

指标 + 追踪：示例关联

Migrating Legacy Loggers

传统日志库迁移

Definition of Done for Observability

可观测性验收标准

Common Mistakes

常见错误

golang-observability

Original

Translation

Go Observability Best Practices

Go可观测性最佳实践

Best Practices Summary

最佳实践摘要

Cross-References

交叉参考

The Five Signals

五类监控信号

Detailed Guides

详细指南

Correlating Signals

信号关联

Logs + Traces: otelslog bridge

日志 + 追踪：otelslog桥接

Metrics + Traces: Exemplars

指标 + 追踪：示例关联

Migrating Legacy Loggers

传统日志库迁移

Definition of Done for Observability

可观测性验收标准

Common Mistakes

常见错误

Logs + Traces:
`otelslog`
bridge

日志 + 追踪：
`otelslog`
桥接