golang-observability-opentelemetry

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Go Observability with OpenTelemetry

基于OpenTelemetry的Go应用可观测性

Overview

概述

Modern Go applications require comprehensive observability through the three pillars: traces, metrics, and logs. OpenTelemetry provides vendor-neutral instrumentation for distributed tracing, Prometheus offers powerful metrics collection, and Go's slog package (1.21+) delivers structured logging with minimal overhead.
Key Features:
  • 🔍 OpenTelemetry: Distributed tracing with context propagation
  • 📊 Prometheus: Metrics collection with /metrics endpoint
  • 📝 Structured Logging: slog with JSON formatting and correlation IDs
  • 🎯 Auto-Instrumentation: HTTP/gRPC middleware patterns
  • 💚 Health Checks: Kubernetes-ready readiness/liveness probes
  • 🔄 Graceful Shutdown: Clean exporter shutdown and signal handling
现代Go应用需要通过三大支柱实现全面的可观测性:追踪、指标和日志。OpenTelemetry提供厂商中立的分布式追踪埋点能力,Prometheus具备强大的指标采集功能,而Go的slog包(1.21及以上版本)可实现低开销的结构化日志。
核心特性:
  • 🔍 OpenTelemetry: 支持上下文传递的分布式追踪
  • 📊 Prometheus: 提供/metrics端点的指标采集
  • 📝 结构化日志: 支持JSON格式化和关联ID的slog
  • 🎯 自动埋点: HTTP/gRPC中间件模式
  • 💚 健康检查: 适配Kubernetes的就绪/存活探针
  • 🔄 优雅停机: 清理导出器并处理信号

When to Use This Skill

适用场景

Activate this skill when:
  • Instrumenting microservices for production observability
  • Setting up distributed tracing across service boundaries
  • Creating operational dashboards with Prometheus/Grafana
  • Debugging production performance issues or bottlenecks
  • Implementing SLOs and monitoring SLIs
  • Adding observability to existing Go applications
  • Correlating logs, traces, and metrics for debugging
在以下场景中启用该技能:
  • 为微服务实现生产环境可观测性
  • 跨服务边界搭建分布式追踪
  • 基于Prometheus/Grafana创建运维仪表盘
  • 调试生产环境性能问题或瓶颈
  • 实现SLO并监控SLI
  • 为现有Go应用添加可观测性能力
  • 关联日志、追踪和指标进行调试

Core Observability Principles

可观测性核心原则

The Three Pillars

三大支柱

  1. Traces: Understand request flow across distributed systems
  2. Metrics: Measure system behavior and performance over time
  3. Logs: Record discrete events for debugging and audit
  1. 追踪: 理解分布式系统中的请求流转
  2. 指标: 衡量系统行为和长期性能
  3. 日志: 记录离散事件用于调试和审计

Correlation Strategy

关联策略

All three pillars must share common identifiers:
  • Trace ID: Links all operations in a request
  • Span ID: Identifies specific operation within trace
  • Request ID: Correlates logs with traces and metrics
三大支柱必须共享通用标识符:
  • Trace ID: 关联同一请求中的所有操作
  • Span ID: 标识追踪中的具体操作
  • Request ID: 关联日志、追踪和指标

OpenTelemetry Integration

OpenTelemetry集成

Installation

安装

bash
go get go.opentelemetry.io/otel
go get go.opentelemetry.io/otel/sdk
go get go.opentelemetry.io/otel/exporters/jaeger
go get go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp
bash
go get go.opentelemetry.io/otel
go get go.opentelemetry.io/otel/sdk
go get go.opentelemetry.io/otel/exporters/jaeger
go get go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp

Basic Setup

基础配置

go
package main

import (
    "context"
    "log"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/jaeger"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
)

func initTracer(serviceName string) (*sdktrace.TracerProvider, error) {
    // Create Jaeger exporter
    exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
        jaeger.WithEndpoint("http://localhost:14268/api/traces"),
    ))
    if err != nil {
        return nil, err
    }

    // Create resource with service name
    res, err := resource.Merge(
        resource.Default(),
        resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceName(serviceName),
            semconv.ServiceVersion("1.0.0"),
        ),
    )
    if err != nil {
        return nil, err
    }

    // Create tracer provider
    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(res),
        sdktrace.WithSampler(sdktrace.AlwaysSample()), // Use probability sampler in production
    )

    otel.SetTracerProvider(tp)
    return tp, nil
}

func main() {
    tp, err := initTracer("order-service")
    if err != nil {
        log.Fatal(err)
    }
    defer func() {
        if err := tp.Shutdown(context.Background()); err != nil {
            log.Printf("Error shutting down tracer: %v", err)
        }
    }()

    // Application code...
}
go
package main

import (
    "context"
    "log"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/jaeger"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
)

func initTracer(serviceName string) (*sdktrace.TracerProvider, error) {
    // Create Jaeger exporter
    exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
        jaeger.WithEndpoint("http://localhost:14268/api/traces"),
    ))
    if err != nil {
        return nil, err
    }

    // Create resource with service name
    res, err := resource.Merge(
        resource.Default(),
        resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceName(serviceName),
            semconv.ServiceVersion("1.0.0"),
        ),
    )
    if err != nil {
        return nil, err
    }

    // Create tracer provider
    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(res),
        sdktrace.WithSampler(sdktrace.AlwaysSample()), // Use probability sampler in production
    )

    otel.SetTracerProvider(tp)
    return tp, nil
}

func main() {
    tp, err := initTracer("order-service")
    if err != nil {
        log.Fatal(err)
    }
    defer func() {
        if err := tp.Shutdown(context.Background()); err != nil {
            log.Printf("Error shutting down tracer: %v", err)
        }
    }()

    // Application code...
}

Creating Spans

创建Span

go
import (
    "context"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/codes"
    "go.opentelemetry.io/otel/trace"
)

func ProcessOrder(ctx context.Context, order Order) error {
    tracer := otel.Tracer("order-service")
    ctx, span := tracer.Start(ctx, "ProcessOrder")
    defer span.End()

    // Add attributes
    span.SetAttributes(
        attribute.String("order.id", order.ID),
        attribute.Int("order.items", len(order.Items)),
        attribute.Float64("order.total", order.Total),
    )

    // Validate order (creates child span)
    if err := validateOrder(ctx, order); err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, "validation failed")
        return err
    }

    // Fulfill order
    if err := fulfillOrder(ctx, order); err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, "fulfillment failed")
        return err
    }

    span.SetStatus(codes.Ok, "order processed successfully")
    return nil
}

func validateOrder(ctx context.Context, order Order) error {
    _, span := otel.Tracer("order-service").Start(ctx, "validateOrder")
    defer span.End()

    // Validation logic...
    return nil
}
go
import (
    "context"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/codes"
    "go.opentelemetry.io/otel/trace"
)

func ProcessOrder(ctx context.Context, order Order) error {
    tracer := otel.Tracer("order-service")
    ctx, span := tracer.Start(ctx, "ProcessOrder")
    defer span.End()

    // Add attributes
    span.SetAttributes(
        attribute.String("order.id", order.ID),
        attribute.Int("order.items", len(order.Items)),
        attribute.Float64("order.total", order.Total),
    )

    // Validate order (creates child span)
    if err := validateOrder(ctx, order); err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, "validation failed")
        return err
    }

    // Fulfill order
    if err := fulfillOrder(ctx, order); err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, "fulfillment failed")
        return err
    }

    span.SetStatus(codes.Ok, "order processed successfully")
    return nil
}

func validateOrder(ctx context.Context, order Order) error {
    _, span := otel.Tracer("order-service").Start(ctx, "validateOrder")
    defer span.End()

    // Validation logic...
    return nil
}

HTTP Middleware Instrumentation

HTTP中间件埋点

go
import (
    "net/http"

    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)

func main() {
    // Wrap handler with automatic tracing
    handler := http.HandlerFunc(orderHandler)
    wrappedHandler := otelhttp.NewHandler(handler, "order-handler")

    http.Handle("/orders", wrappedHandler)
    http.ListenAndServe(":8080", nil)
}

// Manual instrumentation for more control
func orderHandler(w http.ResponseWriter, r *http.Request) {
    ctx := r.Context()
    tracer := otel.Tracer("order-service")

    ctx, span := tracer.Start(ctx, "orderHandler")
    defer span.End()

    // Extract order ID from request
    orderID := r.URL.Query().Get("id")
    span.SetAttributes(attribute.String("order.id", orderID))

    // Process order with propagated context
    order, err := fetchOrder(ctx, orderID)
    if err != nil {
        span.RecordError(err)
        http.Error(w, "Order not found", http.StatusNotFound)
        return
    }

    // ... handle response
}
go
import (
    "net/http"

    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)

func main() {
    // Wrap handler with automatic tracing
    handler := http.HandlerFunc(orderHandler)
    wrappedHandler := otelhttp.NewHandler(handler, "order-handler")

    http.Handle("/orders", wrappedHandler)
    http.ListenAndServe(":8080", nil)
}

// Manual instrumentation for more control
func orderHandler(w http.ResponseWriter, r *http.Request) {
    ctx := r.Context()
    tracer := otel.Tracer("order-service")

    ctx, span := tracer.Start(ctx, "orderHandler")
    defer span.End()

    // Extract order ID from request
    orderID := r.URL.Query().Get("id")
    span.SetAttributes(attribute.String("order.id", orderID))

    // Process order with propagated context
    order, err := fetchOrder(ctx, orderID)
    if err != nil {
        span.RecordError(err)
        http.Error(w, "Order not found", http.StatusNotFound)
        return
    }

    // ... handle response
}

Prometheus Metrics

Prometheus指标

Installation

安装

bash
go get github.com/prometheus/client_golang/prometheus
go get github.com/prometheus/client_golang/prometheus/promhttp
bash
go get github.com/prometheus/client_golang/prometheus
go get github.com/prometheus/client_golang/prometheus/promhttp

Metric Types and Patterns

指标类型与模式

go
package metrics

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    // Counter: Monotonically increasing value
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "path", "status"},
    )

    // Gauge: Value that can go up or down
    activeConnections = promauto.NewGauge(
        prometheus.GaugeOpts{
            Name: "active_connections",
            Help: "Number of active connections",
        },
    )

    // Histogram: Observations bucketed by value
    httpRequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration in seconds",
            Buckets: prometheus.DefBuckets, // [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
        },
        []string{"method", "path"},
    )

    // Summary: Similar to histogram but calculates quantiles
    dbQueryDuration = promauto.NewSummaryVec(
        prometheus.SummaryOpts{
            Name:       "db_query_duration_seconds",
            Help:       "Database query duration",
            Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
        },
        []string{"query_type"},
    )
)
go
package metrics

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    // Counter: Monotonically increasing value
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "path", "status"},
    )

    // Gauge: Value that can go up or down
    activeConnections = promauto.NewGauge(
        prometheus.GaugeOpts{
            Name: "active_connections",
            Help: "Number of active connections",
        },
    )

    // Histogram: Observations bucketed by value
    httpRequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration in seconds",
            Buckets: prometheus.DefBuckets, // [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
        },
        []string{"method", "path"},
    )

    // Summary: Similar to histogram but calculates quantiles
    dbQueryDuration = promauto.NewSummaryVec(
        prometheus.SummaryOpts{
            Name:       "db_query_duration_seconds",
            Help:       "Database query duration",
            Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
        },
        []string{"query_type"},
    )
)

Metrics Middleware

指标中间件

go
import (
    "net/http"
    "strconv"
    "time"

    "github.com/prometheus/client_golang/prometheus/promhttp"
)

// Metrics middleware that instruments all HTTP handlers
func MetricsMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()

        // Track active connections
        activeConnections.Inc()
        defer activeConnections.Dec()

        // Wrap response writer to capture status code
        rw := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK}

        // Call next handler
        next.ServeHTTP(rw, r)

        // Record metrics
        duration := time.Since(start).Seconds()
        httpRequestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
        httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, strconv.Itoa(rw.statusCode)).Inc()
    })
}

type responseWriter struct {
    http.ResponseWriter
    statusCode int
}

func (rw *responseWriter) WriteHeader(code int) {
    rw.statusCode = code
    rw.ResponseWriter.WriteHeader(code)
}

// Expose metrics endpoint
func main() {
    http.Handle("/metrics", promhttp.Handler())

    handler := MetricsMiddleware(http.HandlerFunc(orderHandler))
    http.Handle("/orders", handler)

    http.ListenAndServe(":8080", nil)
}
go
import (
    "net/http"
    "strconv"
    "time"

    "github.com/prometheus/client_golang/prometheus/promhttp"
)

// Metrics middleware that instruments all HTTP handlers
func MetricsMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()

        // Track active connections
        activeConnections.Inc()
        defer activeConnections.Dec()

        // Wrap response writer to capture status code
        rw := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK}

        // Call next handler
        next.ServeHTTP(rw, r)

        // Record metrics
        duration := time.Since(start).Seconds()
        httpRequestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
        httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, strconv.Itoa(rw.statusCode)).Inc()
    })
}

type responseWriter struct {
    http.ResponseWriter
    statusCode int
}

func (rw *responseWriter) WriteHeader(code int) {
    rw.statusCode = code
    rw.ResponseWriter.WriteHeader(code)
}

// Expose metrics endpoint
func main() {
    http.Handle("/metrics", promhttp.Handler())

    handler := MetricsMiddleware(http.HandlerFunc(orderHandler))
    http.Handle("/orders", handler)

    http.ListenAndServe(":8080", nil)
}

Custom Metrics Example

自定义指标示例

go
func ProcessPayment(ctx context.Context, payment Payment) error {
    timer := prometheus.NewTimer(dbQueryDuration.WithLabelValues("payment_insert"))
    defer timer.ObserveDuration()

    // Process payment
    if err := db.InsertPayment(payment); err != nil {
        httpRequestsTotal.WithLabelValues("POST", "/payments", "500").Inc()
        return err
    }

    httpRequestsTotal.WithLabelValues("POST", "/payments", "200").Inc()
    return nil
}
go
func ProcessPayment(ctx context.Context, payment Payment) error {
    timer := prometheus.NewTimer(dbQueryDuration.WithLabelValues("payment_insert"))
    defer timer.ObserveDuration()

    // Process payment
    if err := db.InsertPayment(payment); err != nil {
        httpRequestsTotal.WithLabelValues("POST", "/payments", "500").Inc()
        return err
    }

    httpRequestsTotal.WithLabelValues("POST", "/payments", "200").Inc()
    return nil
}

Structured Logging with slog

基于slog的结构化日志

Basic Setup (Go 1.21+)

基础配置(Go 1.21+)

go
package main

import (
    "context"
    "log/slog"
    "os"
)

func initLogger() *slog.Logger {
    // JSON logger for production
    handler := slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
        Level: slog.LevelInfo,
        AddSource: true, // Include file:line information
    })

    logger := slog.New(handler)
    slog.SetDefault(logger) // Set as default logger
    return logger
}

func main() {
    logger := initLogger()

    logger.Info("service starting",
        "service", "order-service",
        "version", "1.0.0",
        "port", 8080,
    )
}
go
package main

import (
    "context"
    "log/slog"
    "os"
)

func initLogger() *slog.Logger {
    // JSON logger for production
    handler := slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
        Level: slog.LevelInfo,
        AddSource: true, // Include file:line information
    })

    logger := slog.New(handler)
    slog.SetDefault(logger) // Set as default logger
    return logger
}

func main() {
    logger := initLogger()

    logger.Info("service starting",
        "service", "order-service",
        "version", "1.0.0",
        "port", 8080,
    )
}

Context-Aware Logging

上下文感知日志

go
import (
    "context"
    "log/slog"

    "go.opentelemetry.io/otel/trace"
)

// Add trace context to logger
func LoggerWithTrace(ctx context.Context) *slog.Logger {
    span := trace.SpanFromContext(ctx)
    spanCtx := span.SpanContext()

    return slog.With(
        "trace_id", spanCtx.TraceID().String(),
        "span_id", spanCtx.SpanID().String(),
    )
}

func HandleRequest(ctx context.Context, req Request) error {
    logger := LoggerWithTrace(ctx)

    logger.Info("processing request",
        "request_id", req.ID,
        "method", req.Method,
        "path", req.Path,
    )

    if err := processRequest(ctx, req); err != nil {
        logger.Error("request failed",
            "error", err,
            "duration_ms", time.Since(req.StartTime).Milliseconds(),
        )
        return err
    }

    logger.Info("request completed successfully",
        "duration_ms", time.Since(req.StartTime).Milliseconds(),
    )
    return nil
}
go
import (
    "context"
    "log/slog"

    "go.opentelemetry.io/otel/trace"
)

// Add trace context to logger
func LoggerWithTrace(ctx context.Context) *slog.Logger {
    span := trace.SpanFromContext(ctx)
    spanCtx := span.SpanContext()

    return slog.With(
        "trace_id", spanCtx.TraceID().String(),
        "span_id", spanCtx.SpanID().String(),
    )
}

func HandleRequest(ctx context.Context, req Request) error {
    logger := LoggerWithTrace(ctx)

    logger.Info("processing request",
        "request_id", req.ID,
        "method", req.Method,
        "path", req.Path,
    )

    if err := processRequest(ctx, req); err != nil {
        logger.Error("request failed",
            "error", err,
            "duration_ms", time.Since(req.StartTime).Milliseconds(),
        )
        return err
    }

    logger.Info("request completed successfully",
        "duration_ms", time.Since(req.StartTime).Milliseconds(),
    )
    return nil
}

Log Levels and Structured Fields

日志级别与结构化字段

go
func ProcessOrder(ctx context.Context, order Order) error {
    logger := LoggerWithTrace(ctx).With(
        "order_id", order.ID,
        "user_id", order.UserID,
    )

    logger.Debug("validating order", "items", len(order.Items))

    if len(order.Items) == 0 {
        logger.Warn("empty order received")
        return ErrEmptyOrder
    }

    logger.Info("order validation passed")

    if err := fulfillOrder(ctx, order); err != nil {
        logger.Error("fulfillment failed",
            "error", err,
            slog.Group("order_details",
                "total", order.Total,
                "items", len(order.Items),
            ),
        )
        return err
    }

    logger.Info("order processed successfully",
        "total", order.Total,
    )
    return nil
}
go
func ProcessOrder(ctx context.Context, order Order) error {
    logger := LoggerWithTrace(ctx).With(
        "order_id", order.ID,
        "user_id", order.UserID,
    )

    logger.Debug("validating order", "items", len(order.Items))

    if len(order.Items) == 0 {
        logger.Warn("empty order received")
        return ErrEmptyOrder
    }

    logger.Info("order validation passed")

    if err := fulfillOrder(ctx, order); err != nil {
        logger.Error("fulfillment failed",
            "error", err,
            slog.Group("order_details",
                "total", order.Total,
                "items", len(order.Items),
            ),
        )
        return err
    }

    logger.Info("order processed successfully",
        "total", order.Total,
    )
    return nil
}

Health Checks and Graceful Shutdown

健康检查与优雅停机

Health Check Endpoints

健康检查端点

go
import (
    "context"
    "database/sql"
    "encoding/json"
    "net/http"
    "time"
)

type HealthChecker struct {
    db *sql.DB
    // Add other dependencies
}

type HealthStatus struct {
    Status      string            `json:"status"`
    Version     string            `json:"version"`
    Checks      map[string]string `json:"checks"`
    Timestamp   time.Time         `json:"timestamp"`
}

// Liveness probe - is the app running?
func (hc *HealthChecker) LivenessHandler(w http.ResponseWriter, r *http.Request) {
    w.Header().Set("Content-Type", "application/json")
    w.WriteHeader(http.StatusOK)
    json.NewEncoder(w).Encode(map[string]string{
        "status": "alive",
    })
}

// Readiness probe - is the app ready to serve traffic?
func (hc *HealthChecker) ReadinessHandler(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
    defer cancel()

    status := HealthStatus{
        Status:    "ready",
        Version:   "1.0.0",
        Checks:    make(map[string]string),
        Timestamp: time.Now(),
    }

    // Check database
    if err := hc.db.PingContext(ctx); err != nil {
        status.Status = "not_ready"
        status.Checks["database"] = "unhealthy: " + err.Error()
        w.WriteHeader(http.StatusServiceUnavailable)
    } else {
        status.Checks["database"] = "healthy"
    }

    // Add more dependency checks (Redis, external APIs, etc.)

    w.Header().Set("Content-Type", "application/json")
    if status.Status == "ready" {
        w.WriteHeader(http.StatusOK)
    }
    json.NewEncoder(w).Encode(status)
}
go
import (
    "context"
    "database/sql"
    "encoding/json"
    "net/http"
    "time"
)

type HealthChecker struct {
    db *sql.DB
    // Add other dependencies
}

type HealthStatus struct {
    Status      string            `json:"status"`
    Version     string            `json:"version"`
    Checks      map[string]string `json:"checks"`
    Timestamp   time.Time         `json:"timestamp"`
}

// Liveness probe - is the app running?
func (hc *HealthChecker) LivenessHandler(w http.ResponseWriter, r *http.Request) {
    w.Header().Set("Content-Type", "application/json")
    w.WriteHeader(http.StatusOK)
    json.NewEncoder(w).Encode(map[string]string{
        "status": "alive",
    })
}

// Readiness probe - is the app ready to serve traffic?
func (hc *HealthChecker) ReadinessHandler(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
    defer cancel()

    status := HealthStatus{
        Status:    "ready",
        Version:   "1.0.0",
        Checks:    make(map[string]string),
        Timestamp: time.Now(),
    }

    // Check database
    if err := hc.db.PingContext(ctx); err != nil {
        status.Status = "not_ready"
        status.Checks["database"] = "unhealthy: " + err.Error()
        w.WriteHeader(http.StatusServiceUnavailable)
    } else {
        status.Checks["database"] = "healthy"
    }

    // Add more dependency checks (Redis, external APIs, etc.)

    w.Header().Set("Content-Type", "application/json")
    if status.Status == "ready" {
        w.WriteHeader(http.StatusOK)
    }
    json.NewEncoder(w).Encode(status)
}

Graceful Shutdown

优雅停机

go
import (
    "context"
    "net/http"
    "os"
    "os/signal"
    "syscall"
    "time"
)

func main() {
    // Initialize tracer
    tp, err := initTracer("order-service")
    if err != nil {
        log.Fatal(err)
    }

    // Setup HTTP server
    server := &http.Server{
        Addr:    ":8080",
        Handler: setupRoutes(),
    }

    // Channel for shutdown signals
    shutdown := make(chan os.Signal, 1)
    signal.Notify(shutdown, os.Interrupt, syscall.SIGTERM)

    // Start server in goroutine
    go func() {
        slog.Info("server starting", "port", 8080)
        if err := server.ListenAndServe(); err != http.ErrServerClosed {
            log.Fatal(err)
        }
    }()

    // Wait for shutdown signal
    <-shutdown
    slog.Info("shutdown signal received")

    // Create shutdown context with timeout
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()

    // Shutdown HTTP server
    slog.Info("shutting down HTTP server")
    if err := server.Shutdown(ctx); err != nil {
        slog.Error("HTTP server shutdown error", "error", err)
    }

    // Shutdown tracer provider (flush spans)
    slog.Info("shutting down tracer")
    if err := tp.Shutdown(ctx); err != nil {
        slog.Error("tracer shutdown error", "error", err)
    }

    slog.Info("shutdown complete")
}
go
import (
    "context"
    "net/http"
    "os"
    "os/signal"
    "syscall"
    "time"
)

func main() {
    // Initialize tracer
    tp, err := initTracer("order-service")
    if err != nil {
        log.Fatal(err)
    }

    // Setup HTTP server
    server := &http.Server{
        Addr:    ":8080",
        Handler: setupRoutes(),
    }

    // Channel for shutdown signals
    shutdown := make(chan os.Signal, 1)
    signal.Notify(shutdown, os.Interrupt, syscall.SIGTERM)

    // Start server in goroutine
    go func() {
        slog.Info("server starting", "port", 8080)
        if err := server.ListenAndServe(); err != http.ErrServerClosed {
            log.Fatal(err)
        }
    }()

    // Wait for shutdown signal
    <-shutdown
    slog.Info("shutdown signal received")

    // Create shutdown context with timeout
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()

    // Shutdown HTTP server
    slog.Info("shutting down HTTP server")
    if err := server.Shutdown(ctx); err != nil {
        slog.Error("HTTP server shutdown error", "error", err)
    }

    // Shutdown tracer provider (flush spans)
    slog.Info("shutting down tracer")
    if err := tp.Shutdown(ctx); err != nil {
        slog.Error("tracer shutdown error", "error", err)
    }

    slog.Info("shutdown complete")
}

Complete Instrumentation Example

完整埋点示例

go
package main

import (
    "context"
    "database/sql"
    "log/slog"
    "net/http"
    "os"
    "time"

    "github.com/prometheus/client_golang/prometheus/promhttp"
    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
)

type Server struct {
    db     *sql.DB
    logger *slog.Logger
}

func (s *Server) orderHandler(w http.ResponseWriter, r *http.Request) {
    ctx := r.Context()

    // Get tracer and create span
    tracer := otel.Tracer("order-service")
    ctx, span := tracer.Start(ctx, "orderHandler")
    defer span.End()

    // Create context-aware logger with trace ID
    logger := s.logger.With(
        "trace_id", span.SpanContext().TraceID().String(),
        "request_id", r.Header.Get("X-Request-ID"),
    )

    orderID := r.URL.Query().Get("id")
    span.SetAttributes(attribute.String("order.id", orderID))

    logger.Info("fetching order", "order_id", orderID)

    // Fetch order from database
    order, err := s.fetchOrder(ctx, orderID)
    if err != nil {
        span.RecordError(err)
        logger.Error("failed to fetch order", "error", err)
        http.Error(w, "Order not found", http.StatusNotFound)
        return
    }

    logger.Info("order fetched successfully",
        "order_id", orderID,
        "items", len(order.Items),
    )

    // Return order as JSON
    w.Header().Set("Content-Type", "application/json")
    json.NewEncoder(w).Encode(order)
}

func (s *Server) fetchOrder(ctx context.Context, orderID string) (*Order, error) {
    _, span := otel.Tracer("order-service").Start(ctx, "fetchOrder")
    defer span.End()

    // Time database query
    start := time.Now()

    var order Order
    err := s.db.QueryRowContext(ctx, "SELECT * FROM orders WHERE id = ?", orderID).Scan(&order)

    duration := time.Since(start).Seconds()
    dbQueryDuration.WithLabelValues("select_order").Observe(duration)

    return &order, err
}

func setupRoutes(s *Server, hc *HealthChecker) http.Handler {
    mux := http.NewServeMux()

    // Health endpoints (no tracing needed)
    mux.HandleFunc("/health", hc.LivenessHandler)
    mux.HandleFunc("/ready", hc.ReadinessHandler)
    mux.Handle("/metrics", promhttp.Handler())

    // Business endpoints (with tracing)
    orderHandler := http.HandlerFunc(s.orderHandler)
    mux.Handle("/orders", otelhttp.NewHandler(orderHandler, "orders"))

    // Wrap everything with metrics middleware
    return MetricsMiddleware(mux)
}
go
package main

import (
    "context"
    "database/sql"
    "log/slog"
    "net/http"
    "os"
    "time"

    "github.com/prometheus/client_golang/prometheus/promhttp"
    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
)

type Server struct {
    db     *sql.DB
    logger *slog.Logger
}

func (s *Server) orderHandler(w http.ResponseWriter, r *http.Request) {
    ctx := r.Context()

    // Get tracer and create span
    tracer := otel.Tracer("order-service")
    ctx, span := tracer.Start(ctx, "orderHandler")
    defer span.End()

    // Create context-aware logger with trace ID
    logger := s.logger.With(
        "trace_id", span.SpanContext().TraceID().String(),
        "request_id", r.Header.Get("X-Request-ID"),
    )

    orderID := r.URL.Query().Get("id")
    span.SetAttributes(attribute.String("order.id", orderID))

    logger.Info("fetching order", "order_id", orderID)

    // Fetch order from database
    order, err := s.fetchOrder(ctx, orderID)
    if err != nil {
        span.RecordError(err)
        logger.Error("failed to fetch order", "error", err)
        http.Error(w, "Order not found", http.StatusNotFound)
        return
    }

    logger.Info("order fetched successfully",
        "order_id", orderID,
        "items", len(order.Items),
    )

    // Return order as JSON
    w.Header().Set("Content-Type", "application/json")
    json.NewEncoder(w).Encode(order)
}

func (s *Server) fetchOrder(ctx context.Context, orderID string) (*Order, error) {
    _, span := otel.Tracer("order-service").Start(ctx, "fetchOrder")
    defer span.End()

    // Time database query
    start := time.Now()

    var order Order
    err := s.db.QueryRowContext(ctx, "SELECT * FROM orders WHERE id = ?", orderID).Scan(&order)

    duration := time.Since(start).Seconds()
    dbQueryDuration.WithLabelValues("select_order").Observe(duration)

    return &order, err
}

func setupRoutes(s *Server, hc *HealthChecker) http.Handler {
    mux := http.NewServeMux()

    // Health endpoints (no tracing needed)
    mux.HandleFunc("/health", hc.LivenessHandler)
    mux.HandleFunc("/ready", hc.ReadinessHandler)
    mux.Handle("/metrics", promhttp.Handler())

    // Business endpoints (with tracing)
    orderHandler := http.HandlerFunc(s.orderHandler)
    mux.Handle("/orders", otelhttp.NewHandler(orderHandler, "orders"))

    // Wrap everything with metrics middleware
    return MetricsMiddleware(mux)
}

Decision Trees

决策树

When to Use OpenTelemetry

何时使用OpenTelemetry

Use OpenTelemetry When:
  • Building distributed systems with multiple services
  • Need to trace requests across service boundaries
  • Debugging performance issues in microservices
  • Want vendor-neutral observability (switch backends easily)
  • Require correlation between traces, metrics, and logs
Don't Use OpenTelemetry When:
  • Building simple monolithic applications
  • Performance overhead is critical (consider sampling)
  • Team lacks observability infrastructure (Jaeger, Zipkin)
在以下场景使用OpenTelemetry:
  • 构建包含多个服务的分布式系统
  • 需要跨服务边界追踪请求
  • 调试微服务中的性能问题
  • 希望使用厂商中立的可观测性方案(可轻松切换后端)
  • 需要关联追踪、指标和日志
在以下场景不使用OpenTelemetry:
  • 构建简单的单体应用
  • 对性能开销要求极高(可考虑采样)
  • 团队缺乏可观测性基础设施(如Jaeger、Zipkin)

When to Use Prometheus

何时使用Prometheus

Use Prometheus When:
  • Need time-series metrics for monitoring and alerting
  • Building operational dashboards (Grafana)
  • Measuring SLIs for SLO compliance
  • Tracking business metrics (requests/sec, conversion rates)
  • Kubernetes/containerized environments
Don't Use Prometheus When:
  • Need high-cardinality metrics (Prometheus has limits)
  • Require long-term metric storage (use Thanos/Cortex)
  • Need push-based metrics (Prometheus is pull-based)
在以下场景使用Prometheus:
  • 需要时间序列指标用于监控和告警
  • 构建运维仪表盘(如Grafana)
  • 衡量SLO合规性的SLI
  • 跟踪业务指标(请求/秒、转化率等)
  • Kubernetes/容器化环境
在以下场景不使用Prometheus:
  • 需要高基数指标(Prometheus存在限制)
  • 需要长期指标存储(可使用Thanos/Cortex)
  • 需要基于推送的指标(Prometheus是拉取模式)

When to Use slog

何时使用slog

Use slog When:
  • Go 1.21+ projects (standard library, zero dependencies)
  • Need structured logging with JSON output
  • Want high-performance logging with minimal allocations
  • Integrating with log aggregation systems (Loki, ELK)
Don't Use slog When:
  • Go < 1.21 (use zap or zerolog instead)
  • Need complex log routing or filtering (use zap)
  • Require very specific features (audit trails, etc.)
在以下场景使用slog:
  • Go 1.21+项目(标准库,零依赖)
  • 需要JSON格式的结构化日志
  • 追求低分配的高性能日志
  • 与日志聚合系统集成(如Loki、ELK)
在以下场景不使用slog:
  • Go版本低于1.21(可使用zap或zerolog)
  • 需要复杂的日志路由或过滤(可使用zap)
  • 需要非常特定的功能(如审计追踪等)

Sampling Strategy Decision

采样策略决策

Always Sample When:
  • Development/staging environments
  • Total traffic < 100 requests/sec
  • Debugging specific issues
Probabilistic Sampling When:
  • Production with moderate traffic (100-10K req/sec)
  • Sample rate: 1-10% typically
Tail-Based Sampling When:
  • High traffic production (>10K req/sec)
  • Only sample errors and slow requests
  • Requires tail-sampling processor (OpenTelemetry Collector)
始终采样的场景:
  • 开发/测试环境
  • 总流量<100请求/秒
  • 调试特定问题
概率采样的场景:
  • 中等流量的生产环境(100-10K请求/秒)
  • 采样率:通常为1-10%
尾部采样的场景:
  • 高流量生产环境(>10K请求/秒)
  • 仅采样错误和慢请求
  • 需要尾部采样处理器(OpenTelemetry Collector)

Anti-Patterns to Avoid

需避免的反模式

❌ Not Propagating Context

❌ 不传递上下文

WRONG: Breaking trace context
go
func processOrder(order Order) error {
    // Creates new context, loses trace!
    ctx := context.Background()
    return validateOrder(ctx, order)
}
CORRECT: Propagate context through call chain
go
func processOrder(ctx context.Context, order Order) error {
    // Propagates trace context
    return validateOrder(ctx, order)
}
错误示例:破坏追踪上下文
go
func processOrder(order Order) error {
    // Creates new context, loses trace!
    ctx := context.Background()
    return validateOrder(ctx, order)
}
正确示例:在调用链中传递上下文
go
func processOrder(ctx context.Context, order Order) error {
    // Propagates trace context
    return validateOrder(ctx, order)
}

❌ Cardinality Explosion

❌ 基数爆炸

WRONG: Unbounded label values
go
// user_id can have millions of values!
httpRequests.WithLabelValues(r.Method, r.URL.Path, userID).Inc()
CORRECT: Use bounded labels
go
// Only method and path (bounded values)
httpRequests.WithLabelValues(r.Method, r.URL.Path).Inc()
// Track user-specific metrics separately if needed
错误示例:无界标签值
go
// user_id can have millions of values!
httpRequests.WithLabelValues(r.Method, r.URL.Path, userID).Inc()
正确示例:使用有界标签
go
// Only method and path (bounded values)
httpRequests.WithLabelValues(r.Method, r.URL.Path).Inc()
// Track user-specific metrics separately if needed

❌ Logging Sensitive Data

❌ 记录敏感数据

WRONG: Exposing PII and secrets
go
logger.Info("user login",
    "email", user.Email,        // PII!
    "password", user.Password,  // CRITICAL!
    "token", authToken,         // SECRET!
)
CORRECT: Redact sensitive information
go
logger.Info("user login",
    "user_id", user.ID,  // Safe identifier
    "method", "password",
)
错误示例:暴露PII和机密信息
go
logger.Info("user login",
    "email", user.Email,        // PII!
    "password", user.Password,  // CRITICAL!
    "token", authToken,         // SECRET!
)
正确示例:脱敏敏感信息
go
logger.Info("user login",
    "user_id", user.ID,  // Safe identifier
    "method", "password",
)

❌ Not Closing Spans

❌ 不关闭Span

WRONG: Span leaks memory
go
func processOrder(ctx context.Context) error {
    ctx, span := tracer.Start(ctx, "processOrder")
    // Missing defer span.End()!

    if err := validate(); err != nil {
        return err  // Span never closed!
    }

    return nil
}
CORRECT: Always defer span.End()
go
func processOrder(ctx context.Context) error {
    ctx, span := tracer.Start(ctx, "processOrder")
    defer span.End()  // Always runs

    if err := validate(); err != nil {
        span.RecordError(err)
        return err
    }

    return nil
}
错误示例:Span导致内存泄漏
go
func processOrder(ctx context.Context) error {
    ctx, span := tracer.Start(ctx, "processOrder")
    // Missing defer span.End()!

    if err := validate(); err != nil {
        return err  // Span never closed!
    }

    return nil
}
正确示例:始终使用defer span.End()
go
func processOrder(ctx context.Context) error {
    ctx, span := tracer.Start(ctx, "processOrder")
    defer span.End()  // Always runs

    if err := validate(); err != nil {
        span.RecordError(err)
        return err
    }

    return nil
}

❌ Synchronous Metric Export

❌ 同步指标导出

WRONG: Blocking requests with metric export
go
// Synchronous export blocks HTTP handler
exporter := jaeger.New(jaeger.WithCollectorEndpoint(...))
tp := sdktrace.NewTracerProvider(
    sdktrace.WithSyncer(exporter),  // BAD: Synchronous!
)
CORRECT: Use batching for async export
go
// Batching exports asynchronously
tp := sdktrace.NewTracerProvider(
    sdktrace.WithBatcher(exporter),  // GOOD: Async batching
)
错误示例:使用指标导出阻塞请求
go
// Synchronous export blocks HTTP handler
exporter := jaeger.New(jaeger.WithCollectorEndpoint(...))
tp := sdktrace.NewTracerProvider(
    sdktrace.WithSyncer(exporter),  // BAD: Synchronous!
)
正确示例:使用批处理进行异步导出
go
// Batching exports asynchronously
tp := sdktrace.NewTracerProvider(
    sdktrace.WithBatcher(exporter),  // GOOD: Async batching
)

❌ Missing Graceful Shutdown

❌ 缺少优雅停机

WRONG: Losing traces on shutdown
go
func main() {
    tp, _ := initTracer("service")
    // Missing shutdown - spans lost!
    http.ListenAndServe(":8080", nil)
}
CORRECT: Shutdown exporters properly
go
func main() {
    tp, _ := initTracer("service")
    defer tp.Shutdown(context.Background())

    // Handle signals and graceful shutdown
    server.ListenAndServe()
}
错误示例:停机时丢失追踪数据
go
func main() {
    tp, _ := initTracer("service")
    // Missing shutdown - spans lost!
    http.ListenAndServe(":8080", nil)
}
正确示例:正确关闭导出器
go
func main() {
    tp, _ := initTracer("service")
    defer tp.Shutdown(context.Background())

    // Handle signals and graceful shutdown
    server.ListenAndServe()
}

Best Practices

最佳实践

  1. Context Propagation: Always pass
    context.Context
    through call chains
  2. Bounded Labels: Keep metric label cardinality under 1000 combinations
  3. Sampling: Use probabilistic sampling in high-traffic production
  4. Correlation IDs: Include trace_id in logs for correlation
  5. Health Checks: Implement both
    /health
    (liveness) and
    /ready
    (readiness)
  6. Graceful Shutdown: Flush traces and metrics before exit
  7. Error Recording: Use
    span.RecordError()
    for automatic error tracking
  8. Metric Naming: Follow Prometheus naming conventions (
    _total
    ,
    _seconds
    )
  9. Log Levels: Use appropriate levels (Debug, Info, Warn, Error)
  10. Auto-Instrumentation: Use middleware for HTTP/gRPC when possible
  1. 上下文传递: 始终在调用链中传递
    context.Context
  2. 有界标签: 保持指标标签基数在1000种组合以内
  3. 采样策略: 在高流量生产环境中使用概率采样
  4. 关联ID: 在日志中包含trace_id用于关联
  5. 健康检查: 同时实现
    /health
    (存活探针)和
    /ready
    (就绪探针)
  6. 优雅停机: 退出前刷新追踪和指标数据
  7. 错误记录: 使用
    span.RecordError()
    实现自动错误追踪
  8. 指标命名: 遵循Prometheus命名规范(
    _total
    ,
    _seconds
  9. 日志级别: 使用合适的日志级别(Debug、Info、Warn、Error)
  10. 自动埋点: 尽可能使用HTTP/gRPC中间件

Metric Naming Conventions

指标命名规范

Follow Prometheus best practices:
Counter Metrics (always increasing):
  • http_requests_total
    (not
    http_requests
    )
  • payment_transactions_total
  • errors_total
Gauge Metrics (can go up or down):
  • active_connections
  • queue_size
  • memory_usage_bytes
Histogram/Summary Metrics (observations):
  • http_request_duration_seconds
    (not
    _milliseconds
    )
  • db_query_duration_seconds
  • response_size_bytes
Label Naming:
  • Use
    method
    , not
    http_method
  • Use
    status
    , not
    status_code
    or
    http_status
  • Use snake_case, not camelCase
遵循Prometheus最佳实践:
计数器指标(持续递增):
  • http_requests_total
    (而非
    http_requests
  • payment_transactions_total
  • errors_total
仪表盘指标(可增可减):
  • active_connections
  • queue_size
  • memory_usage_bytes
直方图/摘要指标(观测值):
  • http_request_duration_seconds
    (而非
    _milliseconds
  • db_query_duration_seconds
  • response_size_bytes
标签命名
  • 使用
    method
    ,而非
    http_method
  • 使用
    status
    ,而非
    status_code
    http_status
  • 使用蛇形命名法(snake_case),而非驼峰命名法(camelCase)

Resources

参考资源

Official Documentation:
Recent Guides (2025):
Related Skills:
  • golang-web-frameworks: HTTP server patterns and middleware
  • golang-testing-strategies: Testing instrumented code
  • verification-before-completion: Validating observability setup
官方文档:
最新指南(2025):
  • "Observability in Go: What Real Engineers Are Saying in 2025"(Quesma博客)
  • "Monitoring Go Apps with OpenTelemetry Metrics"(Better Stack,2025)
  • Prometheus最佳实践: https://prometheus.io/docs/practices/naming/
相关技能:
  • golang-web-frameworks: HTTP服务器模式与中间件
  • golang-testing-strategies: 埋点代码的测试
  • verification-before-completion: 验证可观测性配置

Quick Reference

快速参考

Initialize OpenTelemetry

初始化OpenTelemetry

go
tp, _ := initTracer("service-name")
defer tp.Shutdown(context.Background())
go
tp, _ := initTracer("service-name")
defer tp.Shutdown(context.Background())

Create Spans

创建Span

go
ctx, span := otel.Tracer("name").Start(ctx, "operation")
defer span.End()
span.SetAttributes(attribute.String("key", "value"))
go
ctx, span := otel.Tracer("name").Start(ctx, "operation")
defer span.End()
span.SetAttributes(attribute.String("key", "value"))

Define Metrics

定义指标

go
counter := promauto.NewCounterVec(opts, []string{"label"})
histogram := promauto.NewHistogramVec(opts, []string{"label"})
go
counter := promauto.NewCounterVec(opts, []string{"label"})
histogram := promauto.NewHistogramVec(opts, []string{"label"})

Structured Logging

结构化日志

go
logger := slog.With("trace_id", traceID)
logger.Info("message", "key", value)
go
logger := slog.With("trace_id", traceID)
logger.Info("message", "key", value)

Health Checks

健康检查

go
http.HandleFunc("/health", livenessHandler)
http.HandleFunc("/ready", readinessHandler)

Token Estimate: ~5,000 tokens (entry point + full content) Version: 1.0.0 Last Updated: 2025-12-03
go
http.HandleFunc("/health", livenessHandler)
http.HandleFunc("/ready", readinessHandler)

Token估算: ~5,000 tokens(入口点+完整内容) 版本: 1.0.0 最后更新: 2025-12-03