golang-observability-opentelemetry
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGo Observability with OpenTelemetry
基于OpenTelemetry的Go应用可观测性
Overview
概述
Modern Go applications require comprehensive observability through the three pillars: traces, metrics, and logs. OpenTelemetry provides vendor-neutral instrumentation for distributed tracing, Prometheus offers powerful metrics collection, and Go's slog package (1.21+) delivers structured logging with minimal overhead.
Key Features:
- 🔍 OpenTelemetry: Distributed tracing with context propagation
- 📊 Prometheus: Metrics collection with /metrics endpoint
- 📝 Structured Logging: slog with JSON formatting and correlation IDs
- 🎯 Auto-Instrumentation: HTTP/gRPC middleware patterns
- 💚 Health Checks: Kubernetes-ready readiness/liveness probes
- 🔄 Graceful Shutdown: Clean exporter shutdown and signal handling
现代Go应用需要通过三大支柱实现全面的可观测性:追踪、指标和日志。OpenTelemetry提供厂商中立的分布式追踪埋点能力,Prometheus具备强大的指标采集功能,而Go的slog包(1.21及以上版本)可实现低开销的结构化日志。
核心特性:
- 🔍 OpenTelemetry: 支持上下文传递的分布式追踪
- 📊 Prometheus: 提供/metrics端点的指标采集
- 📝 结构化日志: 支持JSON格式化和关联ID的slog
- 🎯 自动埋点: HTTP/gRPC中间件模式
- 💚 健康检查: 适配Kubernetes的就绪/存活探针
- 🔄 优雅停机: 清理导出器并处理信号
When to Use This Skill
适用场景
Activate this skill when:
- Instrumenting microservices for production observability
- Setting up distributed tracing across service boundaries
- Creating operational dashboards with Prometheus/Grafana
- Debugging production performance issues or bottlenecks
- Implementing SLOs and monitoring SLIs
- Adding observability to existing Go applications
- Correlating logs, traces, and metrics for debugging
在以下场景中启用该技能:
- 为微服务实现生产环境可观测性
- 跨服务边界搭建分布式追踪
- 基于Prometheus/Grafana创建运维仪表盘
- 调试生产环境性能问题或瓶颈
- 实现SLO并监控SLI
- 为现有Go应用添加可观测性能力
- 关联日志、追踪和指标进行调试
Core Observability Principles
可观测性核心原则
The Three Pillars
三大支柱
- Traces: Understand request flow across distributed systems
- Metrics: Measure system behavior and performance over time
- Logs: Record discrete events for debugging and audit
- 追踪: 理解分布式系统中的请求流转
- 指标: 衡量系统行为和长期性能
- 日志: 记录离散事件用于调试和审计
Correlation Strategy
关联策略
All three pillars must share common identifiers:
- Trace ID: Links all operations in a request
- Span ID: Identifies specific operation within trace
- Request ID: Correlates logs with traces and metrics
三大支柱必须共享通用标识符:
- Trace ID: 关联同一请求中的所有操作
- Span ID: 标识追踪中的具体操作
- Request ID: 关联日志、追踪和指标
OpenTelemetry Integration
OpenTelemetry集成
Installation
安装
bash
go get go.opentelemetry.io/otel
go get go.opentelemetry.io/otel/sdk
go get go.opentelemetry.io/otel/exporters/jaeger
go get go.opentelemetry.io/contrib/instrumentation/net/http/otelhttpbash
go get go.opentelemetry.io/otel
go get go.opentelemetry.io/otel/sdk
go get go.opentelemetry.io/otel/exporters/jaeger
go get go.opentelemetry.io/contrib/instrumentation/net/http/otelhttpBasic Setup
基础配置
go
package main
import (
"context"
"log"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/jaeger"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
)
func initTracer(serviceName string) (*sdktrace.TracerProvider, error) {
// Create Jaeger exporter
exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
jaeger.WithEndpoint("http://localhost:14268/api/traces"),
))
if err != nil {
return nil, err
}
// Create resource with service name
res, err := resource.Merge(
resource.Default(),
resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceName(serviceName),
semconv.ServiceVersion("1.0.0"),
),
)
if err != nil {
return nil, err
}
// Create tracer provider
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(res),
sdktrace.WithSampler(sdktrace.AlwaysSample()), // Use probability sampler in production
)
otel.SetTracerProvider(tp)
return tp, nil
}
func main() {
tp, err := initTracer("order-service")
if err != nil {
log.Fatal(err)
}
defer func() {
if err := tp.Shutdown(context.Background()); err != nil {
log.Printf("Error shutting down tracer: %v", err)
}
}()
// Application code...
}go
package main
import (
"context"
"log"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/jaeger"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
)
func initTracer(serviceName string) (*sdktrace.TracerProvider, error) {
// Create Jaeger exporter
exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
jaeger.WithEndpoint("http://localhost:14268/api/traces"),
))
if err != nil {
return nil, err
}
// Create resource with service name
res, err := resource.Merge(
resource.Default(),
resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceName(serviceName),
semconv.ServiceVersion("1.0.0"),
),
)
if err != nil {
return nil, err
}
// Create tracer provider
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(res),
sdktrace.WithSampler(sdktrace.AlwaysSample()), // Use probability sampler in production
)
otel.SetTracerProvider(tp)
return tp, nil
}
func main() {
tp, err := initTracer("order-service")
if err != nil {
log.Fatal(err)
}
defer func() {
if err := tp.Shutdown(context.Background()); err != nil {
log.Printf("Error shutting down tracer: %v", err)
}
}()
// Application code...
}Creating Spans
创建Span
go
import (
"context"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/codes"
"go.opentelemetry.io/otel/trace"
)
func ProcessOrder(ctx context.Context, order Order) error {
tracer := otel.Tracer("order-service")
ctx, span := tracer.Start(ctx, "ProcessOrder")
defer span.End()
// Add attributes
span.SetAttributes(
attribute.String("order.id", order.ID),
attribute.Int("order.items", len(order.Items)),
attribute.Float64("order.total", order.Total),
)
// Validate order (creates child span)
if err := validateOrder(ctx, order); err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "validation failed")
return err
}
// Fulfill order
if err := fulfillOrder(ctx, order); err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "fulfillment failed")
return err
}
span.SetStatus(codes.Ok, "order processed successfully")
return nil
}
func validateOrder(ctx context.Context, order Order) error {
_, span := otel.Tracer("order-service").Start(ctx, "validateOrder")
defer span.End()
// Validation logic...
return nil
}go
import (
"context"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/codes"
"go.opentelemetry.io/otel/trace"
)
func ProcessOrder(ctx context.Context, order Order) error {
tracer := otel.Tracer("order-service")
ctx, span := tracer.Start(ctx, "ProcessOrder")
defer span.End()
// Add attributes
span.SetAttributes(
attribute.String("order.id", order.ID),
attribute.Int("order.items", len(order.Items)),
attribute.Float64("order.total", order.Total),
)
// Validate order (creates child span)
if err := validateOrder(ctx, order); err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "validation failed")
return err
}
// Fulfill order
if err := fulfillOrder(ctx, order); err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "fulfillment failed")
return err
}
span.SetStatus(codes.Ok, "order processed successfully")
return nil
}
func validateOrder(ctx context.Context, order Order) error {
_, span := otel.Tracer("order-service").Start(ctx, "validateOrder")
defer span.End()
// Validation logic...
return nil
}HTTP Middleware Instrumentation
HTTP中间件埋点
go
import (
"net/http"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)
func main() {
// Wrap handler with automatic tracing
handler := http.HandlerFunc(orderHandler)
wrappedHandler := otelhttp.NewHandler(handler, "order-handler")
http.Handle("/orders", wrappedHandler)
http.ListenAndServe(":8080", nil)
}
// Manual instrumentation for more control
func orderHandler(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
tracer := otel.Tracer("order-service")
ctx, span := tracer.Start(ctx, "orderHandler")
defer span.End()
// Extract order ID from request
orderID := r.URL.Query().Get("id")
span.SetAttributes(attribute.String("order.id", orderID))
// Process order with propagated context
order, err := fetchOrder(ctx, orderID)
if err != nil {
span.RecordError(err)
http.Error(w, "Order not found", http.StatusNotFound)
return
}
// ... handle response
}go
import (
"net/http"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)
func main() {
// Wrap handler with automatic tracing
handler := http.HandlerFunc(orderHandler)
wrappedHandler := otelhttp.NewHandler(handler, "order-handler")
http.Handle("/orders", wrappedHandler)
http.ListenAndServe(":8080", nil)
}
// Manual instrumentation for more control
func orderHandler(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
tracer := otel.Tracer("order-service")
ctx, span := tracer.Start(ctx, "orderHandler")
defer span.End()
// Extract order ID from request
orderID := r.URL.Query().Get("id")
span.SetAttributes(attribute.String("order.id", orderID))
// Process order with propagated context
order, err := fetchOrder(ctx, orderID)
if err != nil {
span.RecordError(err)
http.Error(w, "Order not found", http.StatusNotFound)
return
}
// ... handle response
}Prometheus Metrics
Prometheus指标
Installation
安装
bash
go get github.com/prometheus/client_golang/prometheus
go get github.com/prometheus/client_golang/prometheus/promhttpbash
go get github.com/prometheus/client_golang/prometheus
go get github.com/prometheus/client_golang/prometheus/promhttpMetric Types and Patterns
指标类型与模式
go
package metrics
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)
var (
// Counter: Monotonically increasing value
httpRequestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "path", "status"},
)
// Gauge: Value that can go up or down
activeConnections = promauto.NewGauge(
prometheus.GaugeOpts{
Name: "active_connections",
Help: "Number of active connections",
},
)
// Histogram: Observations bucketed by value
httpRequestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: prometheus.DefBuckets, // [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
},
[]string{"method", "path"},
)
// Summary: Similar to histogram but calculates quantiles
dbQueryDuration = promauto.NewSummaryVec(
prometheus.SummaryOpts{
Name: "db_query_duration_seconds",
Help: "Database query duration",
Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
},
[]string{"query_type"},
)
)go
package metrics
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)
var (
// Counter: Monotonically increasing value
httpRequestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "path", "status"},
)
// Gauge: Value that can go up or down
activeConnections = promauto.NewGauge(
prometheus.GaugeOpts{
Name: "active_connections",
Help: "Number of active connections",
},
)
// Histogram: Observations bucketed by value
httpRequestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: prometheus.DefBuckets, // [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
},
[]string{"method", "path"},
)
// Summary: Similar to histogram but calculates quantiles
dbQueryDuration = promauto.NewSummaryVec(
prometheus.SummaryOpts{
Name: "db_query_duration_seconds",
Help: "Database query duration",
Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
},
[]string{"query_type"},
)
)Metrics Middleware
指标中间件
go
import (
"net/http"
"strconv"
"time"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
// Metrics middleware that instruments all HTTP handlers
func MetricsMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// Track active connections
activeConnections.Inc()
defer activeConnections.Dec()
// Wrap response writer to capture status code
rw := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK}
// Call next handler
next.ServeHTTP(rw, r)
// Record metrics
duration := time.Since(start).Seconds()
httpRequestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, strconv.Itoa(rw.statusCode)).Inc()
})
}
type responseWriter struct {
http.ResponseWriter
statusCode int
}
func (rw *responseWriter) WriteHeader(code int) {
rw.statusCode = code
rw.ResponseWriter.WriteHeader(code)
}
// Expose metrics endpoint
func main() {
http.Handle("/metrics", promhttp.Handler())
handler := MetricsMiddleware(http.HandlerFunc(orderHandler))
http.Handle("/orders", handler)
http.ListenAndServe(":8080", nil)
}go
import (
"net/http"
"strconv"
"time"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
// Metrics middleware that instruments all HTTP handlers
func MetricsMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// Track active connections
activeConnections.Inc()
defer activeConnections.Dec()
// Wrap response writer to capture status code
rw := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK}
// Call next handler
next.ServeHTTP(rw, r)
// Record metrics
duration := time.Since(start).Seconds()
httpRequestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, strconv.Itoa(rw.statusCode)).Inc()
})
}
type responseWriter struct {
http.ResponseWriter
statusCode int
}
func (rw *responseWriter) WriteHeader(code int) {
rw.statusCode = code
rw.ResponseWriter.WriteHeader(code)
}
// Expose metrics endpoint
func main() {
http.Handle("/metrics", promhttp.Handler())
handler := MetricsMiddleware(http.HandlerFunc(orderHandler))
http.Handle("/orders", handler)
http.ListenAndServe(":8080", nil)
}Custom Metrics Example
自定义指标示例
go
func ProcessPayment(ctx context.Context, payment Payment) error {
timer := prometheus.NewTimer(dbQueryDuration.WithLabelValues("payment_insert"))
defer timer.ObserveDuration()
// Process payment
if err := db.InsertPayment(payment); err != nil {
httpRequestsTotal.WithLabelValues("POST", "/payments", "500").Inc()
return err
}
httpRequestsTotal.WithLabelValues("POST", "/payments", "200").Inc()
return nil
}go
func ProcessPayment(ctx context.Context, payment Payment) error {
timer := prometheus.NewTimer(dbQueryDuration.WithLabelValues("payment_insert"))
defer timer.ObserveDuration()
// Process payment
if err := db.InsertPayment(payment); err != nil {
httpRequestsTotal.WithLabelValues("POST", "/payments", "500").Inc()
return err
}
httpRequestsTotal.WithLabelValues("POST", "/payments", "200").Inc()
return nil
}Structured Logging with slog
基于slog的结构化日志
Basic Setup (Go 1.21+)
基础配置(Go 1.21+)
go
package main
import (
"context"
"log/slog"
"os"
)
func initLogger() *slog.Logger {
// JSON logger for production
handler := slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
Level: slog.LevelInfo,
AddSource: true, // Include file:line information
})
logger := slog.New(handler)
slog.SetDefault(logger) // Set as default logger
return logger
}
func main() {
logger := initLogger()
logger.Info("service starting",
"service", "order-service",
"version", "1.0.0",
"port", 8080,
)
}go
package main
import (
"context"
"log/slog"
"os"
)
func initLogger() *slog.Logger {
// JSON logger for production
handler := slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
Level: slog.LevelInfo,
AddSource: true, // Include file:line information
})
logger := slog.New(handler)
slog.SetDefault(logger) // Set as default logger
return logger
}
func main() {
logger := initLogger()
logger.Info("service starting",
"service", "order-service",
"version", "1.0.0",
"port", 8080,
)
}Context-Aware Logging
上下文感知日志
go
import (
"context"
"log/slog"
"go.opentelemetry.io/otel/trace"
)
// Add trace context to logger
func LoggerWithTrace(ctx context.Context) *slog.Logger {
span := trace.SpanFromContext(ctx)
spanCtx := span.SpanContext()
return slog.With(
"trace_id", spanCtx.TraceID().String(),
"span_id", spanCtx.SpanID().String(),
)
}
func HandleRequest(ctx context.Context, req Request) error {
logger := LoggerWithTrace(ctx)
logger.Info("processing request",
"request_id", req.ID,
"method", req.Method,
"path", req.Path,
)
if err := processRequest(ctx, req); err != nil {
logger.Error("request failed",
"error", err,
"duration_ms", time.Since(req.StartTime).Milliseconds(),
)
return err
}
logger.Info("request completed successfully",
"duration_ms", time.Since(req.StartTime).Milliseconds(),
)
return nil
}go
import (
"context"
"log/slog"
"go.opentelemetry.io/otel/trace"
)
// Add trace context to logger
func LoggerWithTrace(ctx context.Context) *slog.Logger {
span := trace.SpanFromContext(ctx)
spanCtx := span.SpanContext()
return slog.With(
"trace_id", spanCtx.TraceID().String(),
"span_id", spanCtx.SpanID().String(),
)
}
func HandleRequest(ctx context.Context, req Request) error {
logger := LoggerWithTrace(ctx)
logger.Info("processing request",
"request_id", req.ID,
"method", req.Method,
"path", req.Path,
)
if err := processRequest(ctx, req); err != nil {
logger.Error("request failed",
"error", err,
"duration_ms", time.Since(req.StartTime).Milliseconds(),
)
return err
}
logger.Info("request completed successfully",
"duration_ms", time.Since(req.StartTime).Milliseconds(),
)
return nil
}Log Levels and Structured Fields
日志级别与结构化字段
go
func ProcessOrder(ctx context.Context, order Order) error {
logger := LoggerWithTrace(ctx).With(
"order_id", order.ID,
"user_id", order.UserID,
)
logger.Debug("validating order", "items", len(order.Items))
if len(order.Items) == 0 {
logger.Warn("empty order received")
return ErrEmptyOrder
}
logger.Info("order validation passed")
if err := fulfillOrder(ctx, order); err != nil {
logger.Error("fulfillment failed",
"error", err,
slog.Group("order_details",
"total", order.Total,
"items", len(order.Items),
),
)
return err
}
logger.Info("order processed successfully",
"total", order.Total,
)
return nil
}go
func ProcessOrder(ctx context.Context, order Order) error {
logger := LoggerWithTrace(ctx).With(
"order_id", order.ID,
"user_id", order.UserID,
)
logger.Debug("validating order", "items", len(order.Items))
if len(order.Items) == 0 {
logger.Warn("empty order received")
return ErrEmptyOrder
}
logger.Info("order validation passed")
if err := fulfillOrder(ctx, order); err != nil {
logger.Error("fulfillment failed",
"error", err,
slog.Group("order_details",
"total", order.Total,
"items", len(order.Items),
),
)
return err
}
logger.Info("order processed successfully",
"total", order.Total,
)
return nil
}Health Checks and Graceful Shutdown
健康检查与优雅停机
Health Check Endpoints
健康检查端点
go
import (
"context"
"database/sql"
"encoding/json"
"net/http"
"time"
)
type HealthChecker struct {
db *sql.DB
// Add other dependencies
}
type HealthStatus struct {
Status string `json:"status"`
Version string `json:"version"`
Checks map[string]string `json:"checks"`
Timestamp time.Time `json:"timestamp"`
}
// Liveness probe - is the app running?
func (hc *HealthChecker) LivenessHandler(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
json.NewEncoder(w).Encode(map[string]string{
"status": "alive",
})
}
// Readiness probe - is the app ready to serve traffic?
func (hc *HealthChecker) ReadinessHandler(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
defer cancel()
status := HealthStatus{
Status: "ready",
Version: "1.0.0",
Checks: make(map[string]string),
Timestamp: time.Now(),
}
// Check database
if err := hc.db.PingContext(ctx); err != nil {
status.Status = "not_ready"
status.Checks["database"] = "unhealthy: " + err.Error()
w.WriteHeader(http.StatusServiceUnavailable)
} else {
status.Checks["database"] = "healthy"
}
// Add more dependency checks (Redis, external APIs, etc.)
w.Header().Set("Content-Type", "application/json")
if status.Status == "ready" {
w.WriteHeader(http.StatusOK)
}
json.NewEncoder(w).Encode(status)
}go
import (
"context"
"database/sql"
"encoding/json"
"net/http"
"time"
)
type HealthChecker struct {
db *sql.DB
// Add other dependencies
}
type HealthStatus struct {
Status string `json:"status"`
Version string `json:"version"`
Checks map[string]string `json:"checks"`
Timestamp time.Time `json:"timestamp"`
}
// Liveness probe - is the app running?
func (hc *HealthChecker) LivenessHandler(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
json.NewEncoder(w).Encode(map[string]string{
"status": "alive",
})
}
// Readiness probe - is the app ready to serve traffic?
func (hc *HealthChecker) ReadinessHandler(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
defer cancel()
status := HealthStatus{
Status: "ready",
Version: "1.0.0",
Checks: make(map[string]string),
Timestamp: time.Now(),
}
// Check database
if err := hc.db.PingContext(ctx); err != nil {
status.Status = "not_ready"
status.Checks["database"] = "unhealthy: " + err.Error()
w.WriteHeader(http.StatusServiceUnavailable)
} else {
status.Checks["database"] = "healthy"
}
// Add more dependency checks (Redis, external APIs, etc.)
w.Header().Set("Content-Type", "application/json")
if status.Status == "ready" {
w.WriteHeader(http.StatusOK)
}
json.NewEncoder(w).Encode(status)
}Graceful Shutdown
优雅停机
go
import (
"context"
"net/http"
"os"
"os/signal"
"syscall"
"time"
)
func main() {
// Initialize tracer
tp, err := initTracer("order-service")
if err != nil {
log.Fatal(err)
}
// Setup HTTP server
server := &http.Server{
Addr: ":8080",
Handler: setupRoutes(),
}
// Channel for shutdown signals
shutdown := make(chan os.Signal, 1)
signal.Notify(shutdown, os.Interrupt, syscall.SIGTERM)
// Start server in goroutine
go func() {
slog.Info("server starting", "port", 8080)
if err := server.ListenAndServe(); err != http.ErrServerClosed {
log.Fatal(err)
}
}()
// Wait for shutdown signal
<-shutdown
slog.Info("shutdown signal received")
// Create shutdown context with timeout
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
// Shutdown HTTP server
slog.Info("shutting down HTTP server")
if err := server.Shutdown(ctx); err != nil {
slog.Error("HTTP server shutdown error", "error", err)
}
// Shutdown tracer provider (flush spans)
slog.Info("shutting down tracer")
if err := tp.Shutdown(ctx); err != nil {
slog.Error("tracer shutdown error", "error", err)
}
slog.Info("shutdown complete")
}go
import (
"context"
"net/http"
"os"
"os/signal"
"syscall"
"time"
)
func main() {
// Initialize tracer
tp, err := initTracer("order-service")
if err != nil {
log.Fatal(err)
}
// Setup HTTP server
server := &http.Server{
Addr: ":8080",
Handler: setupRoutes(),
}
// Channel for shutdown signals
shutdown := make(chan os.Signal, 1)
signal.Notify(shutdown, os.Interrupt, syscall.SIGTERM)
// Start server in goroutine
go func() {
slog.Info("server starting", "port", 8080)
if err := server.ListenAndServe(); err != http.ErrServerClosed {
log.Fatal(err)
}
}()
// Wait for shutdown signal
<-shutdown
slog.Info("shutdown signal received")
// Create shutdown context with timeout
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
// Shutdown HTTP server
slog.Info("shutting down HTTP server")
if err := server.Shutdown(ctx); err != nil {
slog.Error("HTTP server shutdown error", "error", err)
}
// Shutdown tracer provider (flush spans)
slog.Info("shutting down tracer")
if err := tp.Shutdown(ctx); err != nil {
slog.Error("tracer shutdown error", "error", err)
}
slog.Info("shutdown complete")
}Complete Instrumentation Example
完整埋点示例
go
package main
import (
"context"
"database/sql"
"log/slog"
"net/http"
"os"
"time"
"github.com/prometheus/client_golang/prometheus/promhttp"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
)
type Server struct {
db *sql.DB
logger *slog.Logger
}
func (s *Server) orderHandler(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
// Get tracer and create span
tracer := otel.Tracer("order-service")
ctx, span := tracer.Start(ctx, "orderHandler")
defer span.End()
// Create context-aware logger with trace ID
logger := s.logger.With(
"trace_id", span.SpanContext().TraceID().String(),
"request_id", r.Header.Get("X-Request-ID"),
)
orderID := r.URL.Query().Get("id")
span.SetAttributes(attribute.String("order.id", orderID))
logger.Info("fetching order", "order_id", orderID)
// Fetch order from database
order, err := s.fetchOrder(ctx, orderID)
if err != nil {
span.RecordError(err)
logger.Error("failed to fetch order", "error", err)
http.Error(w, "Order not found", http.StatusNotFound)
return
}
logger.Info("order fetched successfully",
"order_id", orderID,
"items", len(order.Items),
)
// Return order as JSON
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(order)
}
func (s *Server) fetchOrder(ctx context.Context, orderID string) (*Order, error) {
_, span := otel.Tracer("order-service").Start(ctx, "fetchOrder")
defer span.End()
// Time database query
start := time.Now()
var order Order
err := s.db.QueryRowContext(ctx, "SELECT * FROM orders WHERE id = ?", orderID).Scan(&order)
duration := time.Since(start).Seconds()
dbQueryDuration.WithLabelValues("select_order").Observe(duration)
return &order, err
}
func setupRoutes(s *Server, hc *HealthChecker) http.Handler {
mux := http.NewServeMux()
// Health endpoints (no tracing needed)
mux.HandleFunc("/health", hc.LivenessHandler)
mux.HandleFunc("/ready", hc.ReadinessHandler)
mux.Handle("/metrics", promhttp.Handler())
// Business endpoints (with tracing)
orderHandler := http.HandlerFunc(s.orderHandler)
mux.Handle("/orders", otelhttp.NewHandler(orderHandler, "orders"))
// Wrap everything with metrics middleware
return MetricsMiddleware(mux)
}go
package main
import (
"context"
"database/sql"
"log/slog"
"net/http"
"os"
"time"
"github.com/prometheus/client_golang/prometheus/promhttp"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
)
type Server struct {
db *sql.DB
logger *slog.Logger
}
func (s *Server) orderHandler(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
// Get tracer and create span
tracer := otel.Tracer("order-service")
ctx, span := tracer.Start(ctx, "orderHandler")
defer span.End()
// Create context-aware logger with trace ID
logger := s.logger.With(
"trace_id", span.SpanContext().TraceID().String(),
"request_id", r.Header.Get("X-Request-ID"),
)
orderID := r.URL.Query().Get("id")
span.SetAttributes(attribute.String("order.id", orderID))
logger.Info("fetching order", "order_id", orderID)
// Fetch order from database
order, err := s.fetchOrder(ctx, orderID)
if err != nil {
span.RecordError(err)
logger.Error("failed to fetch order", "error", err)
http.Error(w, "Order not found", http.StatusNotFound)
return
}
logger.Info("order fetched successfully",
"order_id", orderID,
"items", len(order.Items),
)
// Return order as JSON
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(order)
}
func (s *Server) fetchOrder(ctx context.Context, orderID string) (*Order, error) {
_, span := otel.Tracer("order-service").Start(ctx, "fetchOrder")
defer span.End()
// Time database query
start := time.Now()
var order Order
err := s.db.QueryRowContext(ctx, "SELECT * FROM orders WHERE id = ?", orderID).Scan(&order)
duration := time.Since(start).Seconds()
dbQueryDuration.WithLabelValues("select_order").Observe(duration)
return &order, err
}
func setupRoutes(s *Server, hc *HealthChecker) http.Handler {
mux := http.NewServeMux()
// Health endpoints (no tracing needed)
mux.HandleFunc("/health", hc.LivenessHandler)
mux.HandleFunc("/ready", hc.ReadinessHandler)
mux.Handle("/metrics", promhttp.Handler())
// Business endpoints (with tracing)
orderHandler := http.HandlerFunc(s.orderHandler)
mux.Handle("/orders", otelhttp.NewHandler(orderHandler, "orders"))
// Wrap everything with metrics middleware
return MetricsMiddleware(mux)
}Decision Trees
决策树
When to Use OpenTelemetry
何时使用OpenTelemetry
Use OpenTelemetry When:
- Building distributed systems with multiple services
- Need to trace requests across service boundaries
- Debugging performance issues in microservices
- Want vendor-neutral observability (switch backends easily)
- Require correlation between traces, metrics, and logs
Don't Use OpenTelemetry When:
- Building simple monolithic applications
- Performance overhead is critical (consider sampling)
- Team lacks observability infrastructure (Jaeger, Zipkin)
在以下场景使用OpenTelemetry:
- 构建包含多个服务的分布式系统
- 需要跨服务边界追踪请求
- 调试微服务中的性能问题
- 希望使用厂商中立的可观测性方案(可轻松切换后端)
- 需要关联追踪、指标和日志
在以下场景不使用OpenTelemetry:
- 构建简单的单体应用
- 对性能开销要求极高(可考虑采样)
- 团队缺乏可观测性基础设施(如Jaeger、Zipkin)
When to Use Prometheus
何时使用Prometheus
Use Prometheus When:
- Need time-series metrics for monitoring and alerting
- Building operational dashboards (Grafana)
- Measuring SLIs for SLO compliance
- Tracking business metrics (requests/sec, conversion rates)
- Kubernetes/containerized environments
Don't Use Prometheus When:
- Need high-cardinality metrics (Prometheus has limits)
- Require long-term metric storage (use Thanos/Cortex)
- Need push-based metrics (Prometheus is pull-based)
在以下场景使用Prometheus:
- 需要时间序列指标用于监控和告警
- 构建运维仪表盘(如Grafana)
- 衡量SLO合规性的SLI
- 跟踪业务指标(请求/秒、转化率等)
- Kubernetes/容器化环境
在以下场景不使用Prometheus:
- 需要高基数指标(Prometheus存在限制)
- 需要长期指标存储(可使用Thanos/Cortex)
- 需要基于推送的指标(Prometheus是拉取模式)
When to Use slog
何时使用slog
Use slog When:
- Go 1.21+ projects (standard library, zero dependencies)
- Need structured logging with JSON output
- Want high-performance logging with minimal allocations
- Integrating with log aggregation systems (Loki, ELK)
Don't Use slog When:
- Go < 1.21 (use zap or zerolog instead)
- Need complex log routing or filtering (use zap)
- Require very specific features (audit trails, etc.)
在以下场景使用slog:
- Go 1.21+项目(标准库,零依赖)
- 需要JSON格式的结构化日志
- 追求低分配的高性能日志
- 与日志聚合系统集成(如Loki、ELK)
在以下场景不使用slog:
- Go版本低于1.21(可使用zap或zerolog)
- 需要复杂的日志路由或过滤(可使用zap)
- 需要非常特定的功能(如审计追踪等)
Sampling Strategy Decision
采样策略决策
Always Sample When:
- Development/staging environments
- Total traffic < 100 requests/sec
- Debugging specific issues
Probabilistic Sampling When:
- Production with moderate traffic (100-10K req/sec)
- Sample rate: 1-10% typically
Tail-Based Sampling When:
- High traffic production (>10K req/sec)
- Only sample errors and slow requests
- Requires tail-sampling processor (OpenTelemetry Collector)
始终采样的场景:
- 开发/测试环境
- 总流量<100请求/秒
- 调试特定问题
概率采样的场景:
- 中等流量的生产环境(100-10K请求/秒)
- 采样率:通常为1-10%
尾部采样的场景:
- 高流量生产环境(>10K请求/秒)
- 仅采样错误和慢请求
- 需要尾部采样处理器(OpenTelemetry Collector)
Anti-Patterns to Avoid
需避免的反模式
❌ Not Propagating Context
❌ 不传递上下文
WRONG: Breaking trace context
go
func processOrder(order Order) error {
// Creates new context, loses trace!
ctx := context.Background()
return validateOrder(ctx, order)
}CORRECT: Propagate context through call chain
go
func processOrder(ctx context.Context, order Order) error {
// Propagates trace context
return validateOrder(ctx, order)
}错误示例:破坏追踪上下文
go
func processOrder(order Order) error {
// Creates new context, loses trace!
ctx := context.Background()
return validateOrder(ctx, order)
}正确示例:在调用链中传递上下文
go
func processOrder(ctx context.Context, order Order) error {
// Propagates trace context
return validateOrder(ctx, order)
}❌ Cardinality Explosion
❌ 基数爆炸
WRONG: Unbounded label values
go
// user_id can have millions of values!
httpRequests.WithLabelValues(r.Method, r.URL.Path, userID).Inc()CORRECT: Use bounded labels
go
// Only method and path (bounded values)
httpRequests.WithLabelValues(r.Method, r.URL.Path).Inc()
// Track user-specific metrics separately if needed错误示例:无界标签值
go
// user_id can have millions of values!
httpRequests.WithLabelValues(r.Method, r.URL.Path, userID).Inc()正确示例:使用有界标签
go
// Only method and path (bounded values)
httpRequests.WithLabelValues(r.Method, r.URL.Path).Inc()
// Track user-specific metrics separately if needed❌ Logging Sensitive Data
❌ 记录敏感数据
WRONG: Exposing PII and secrets
go
logger.Info("user login",
"email", user.Email, // PII!
"password", user.Password, // CRITICAL!
"token", authToken, // SECRET!
)CORRECT: Redact sensitive information
go
logger.Info("user login",
"user_id", user.ID, // Safe identifier
"method", "password",
)错误示例:暴露PII和机密信息
go
logger.Info("user login",
"email", user.Email, // PII!
"password", user.Password, // CRITICAL!
"token", authToken, // SECRET!
)正确示例:脱敏敏感信息
go
logger.Info("user login",
"user_id", user.ID, // Safe identifier
"method", "password",
)❌ Not Closing Spans
❌ 不关闭Span
WRONG: Span leaks memory
go
func processOrder(ctx context.Context) error {
ctx, span := tracer.Start(ctx, "processOrder")
// Missing defer span.End()!
if err := validate(); err != nil {
return err // Span never closed!
}
return nil
}CORRECT: Always defer span.End()
go
func processOrder(ctx context.Context) error {
ctx, span := tracer.Start(ctx, "processOrder")
defer span.End() // Always runs
if err := validate(); err != nil {
span.RecordError(err)
return err
}
return nil
}错误示例:Span导致内存泄漏
go
func processOrder(ctx context.Context) error {
ctx, span := tracer.Start(ctx, "processOrder")
// Missing defer span.End()!
if err := validate(); err != nil {
return err // Span never closed!
}
return nil
}正确示例:始终使用defer span.End()
go
func processOrder(ctx context.Context) error {
ctx, span := tracer.Start(ctx, "processOrder")
defer span.End() // Always runs
if err := validate(); err != nil {
span.RecordError(err)
return err
}
return nil
}❌ Synchronous Metric Export
❌ 同步指标导出
WRONG: Blocking requests with metric export
go
// Synchronous export blocks HTTP handler
exporter := jaeger.New(jaeger.WithCollectorEndpoint(...))
tp := sdktrace.NewTracerProvider(
sdktrace.WithSyncer(exporter), // BAD: Synchronous!
)CORRECT: Use batching for async export
go
// Batching exports asynchronously
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter), // GOOD: Async batching
)错误示例:使用指标导出阻塞请求
go
// Synchronous export blocks HTTP handler
exporter := jaeger.New(jaeger.WithCollectorEndpoint(...))
tp := sdktrace.NewTracerProvider(
sdktrace.WithSyncer(exporter), // BAD: Synchronous!
)正确示例:使用批处理进行异步导出
go
// Batching exports asynchronously
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter), // GOOD: Async batching
)❌ Missing Graceful Shutdown
❌ 缺少优雅停机
WRONG: Losing traces on shutdown
go
func main() {
tp, _ := initTracer("service")
// Missing shutdown - spans lost!
http.ListenAndServe(":8080", nil)
}CORRECT: Shutdown exporters properly
go
func main() {
tp, _ := initTracer("service")
defer tp.Shutdown(context.Background())
// Handle signals and graceful shutdown
server.ListenAndServe()
}错误示例:停机时丢失追踪数据
go
func main() {
tp, _ := initTracer("service")
// Missing shutdown - spans lost!
http.ListenAndServe(":8080", nil)
}正确示例:正确关闭导出器
go
func main() {
tp, _ := initTracer("service")
defer tp.Shutdown(context.Background())
// Handle signals and graceful shutdown
server.ListenAndServe()
}Best Practices
最佳实践
- Context Propagation: Always pass through call chains
context.Context - Bounded Labels: Keep metric label cardinality under 1000 combinations
- Sampling: Use probabilistic sampling in high-traffic production
- Correlation IDs: Include trace_id in logs for correlation
- Health Checks: Implement both (liveness) and
/health(readiness)/ready - Graceful Shutdown: Flush traces and metrics before exit
- Error Recording: Use for automatic error tracking
span.RecordError() - Metric Naming: Follow Prometheus naming conventions (,
_total)_seconds - Log Levels: Use appropriate levels (Debug, Info, Warn, Error)
- Auto-Instrumentation: Use middleware for HTTP/gRPC when possible
- 上下文传递: 始终在调用链中传递
context.Context - 有界标签: 保持指标标签基数在1000种组合以内
- 采样策略: 在高流量生产环境中使用概率采样
- 关联ID: 在日志中包含trace_id用于关联
- 健康检查: 同时实现(存活探针)和
/health(就绪探针)/ready - 优雅停机: 退出前刷新追踪和指标数据
- 错误记录: 使用实现自动错误追踪
span.RecordError() - 指标命名: 遵循Prometheus命名规范(,
_total)_seconds - 日志级别: 使用合适的日志级别(Debug、Info、Warn、Error)
- 自动埋点: 尽可能使用HTTP/gRPC中间件
Metric Naming Conventions
指标命名规范
Follow Prometheus best practices:
Counter Metrics (always increasing):
- (not
http_requests_total)http_requests payment_transactions_totalerrors_total
Gauge Metrics (can go up or down):
active_connectionsqueue_sizememory_usage_bytes
Histogram/Summary Metrics (observations):
- (not
http_request_duration_seconds)_milliseconds db_query_duration_secondsresponse_size_bytes
Label Naming:
- Use , not
methodhttp_method - Use , not
statusorstatus_codehttp_status - Use snake_case, not camelCase
遵循Prometheus最佳实践:
计数器指标(持续递增):
- (而非
http_requests_total)http_requests payment_transactions_totalerrors_total
仪表盘指标(可增可减):
active_connectionsqueue_sizememory_usage_bytes
直方图/摘要指标(观测值):
- (而非
http_request_duration_seconds)_milliseconds db_query_duration_secondsresponse_size_bytes
标签命名:
- 使用,而非
methodhttp_method - 使用,而非
status或status_codehttp_status - 使用蛇形命名法(snake_case),而非驼峰命名法(camelCase)
Resources
参考资源
Official Documentation:
- OpenTelemetry Go: https://opentelemetry.io/docs/instrumentation/go/
- Prometheus Client Library: https://github.com/prometheus/client_golang
- Go slog Package: https://pkg.go.dev/log/slog
Recent Guides (2025):
- "Observability in Go: What Real Engineers Are Saying in 2025" (Quesma Blog)
- "Monitoring Go Apps with OpenTelemetry Metrics" (Better Stack, 2025)
- Prometheus Best Practices: https://prometheus.io/docs/practices/naming/
Related Skills:
- golang-web-frameworks: HTTP server patterns and middleware
- golang-testing-strategies: Testing instrumented code
- verification-before-completion: Validating observability setup
官方文档:
- OpenTelemetry Go: https://opentelemetry.io/docs/instrumentation/go/
- Prometheus Client Library: https://github.com/prometheus/client_golang
- Go slog Package: https://pkg.go.dev/log/slog
最新指南(2025):
- "Observability in Go: What Real Engineers Are Saying in 2025"(Quesma博客)
- "Monitoring Go Apps with OpenTelemetry Metrics"(Better Stack,2025)
- Prometheus最佳实践: https://prometheus.io/docs/practices/naming/
相关技能:
- golang-web-frameworks: HTTP服务器模式与中间件
- golang-testing-strategies: 埋点代码的测试
- verification-before-completion: 验证可观测性配置
Quick Reference
快速参考
Initialize OpenTelemetry
初始化OpenTelemetry
go
tp, _ := initTracer("service-name")
defer tp.Shutdown(context.Background())go
tp, _ := initTracer("service-name")
defer tp.Shutdown(context.Background())Create Spans
创建Span
go
ctx, span := otel.Tracer("name").Start(ctx, "operation")
defer span.End()
span.SetAttributes(attribute.String("key", "value"))go
ctx, span := otel.Tracer("name").Start(ctx, "operation")
defer span.End()
span.SetAttributes(attribute.String("key", "value"))Define Metrics
定义指标
go
counter := promauto.NewCounterVec(opts, []string{"label"})
histogram := promauto.NewHistogramVec(opts, []string{"label"})go
counter := promauto.NewCounterVec(opts, []string{"label"})
histogram := promauto.NewHistogramVec(opts, []string{"label"})Structured Logging
结构化日志
go
logger := slog.With("trace_id", traceID)
logger.Info("message", "key", value)go
logger := slog.With("trace_id", traceID)
logger.Info("message", "key", value)Health Checks
健康检查
go
http.HandleFunc("/health", livenessHandler)
http.HandleFunc("/ready", readinessHandler)Token Estimate: ~5,000 tokens (entry point + full content)
Version: 1.0.0
Last Updated: 2025-12-03
go
http.HandleFunc("/health", livenessHandler)
http.HandleFunc("/ready", readinessHandler)Token估算: ~5,000 tokens(入口点+完整内容)
版本: 1.0.0
最后更新: 2025-12-03