monitoring-observability

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Monitoring & Observability

监控与可观测性

Overview

概述

This skill provides comprehensive guidance for monitoring and observability workflows including metrics design, log aggregation, distributed tracing, alerting strategies, SLO/SLA management, and tool selection.
When to use this skill:
  • Setting up monitoring for new services
  • Designing alerts and dashboards
  • Troubleshooting performance issues
  • Implementing SLO tracking and error budgets
  • Choosing between monitoring tools
  • Integrating OpenTelemetry instrumentation
  • Analyzing metrics, logs, and traces
  • Optimizing Datadog costs and finding waste
  • Migrating from Datadog to open-source stack

本技能为监控与可观测性工作流提供全面指导,包括指标设计、日志聚合、分布式追踪、告警策略、SLO/SLA管理及工具选型。
何时使用该技能:
  • 为新服务搭建监控系统
  • 设计告警规则与仪表盘
  • 排查性能问题
  • 落地SLO追踪与错误预算管理
  • 选择合适的监控工具
  • 集成OpenTelemetry埋点
  • 分析指标、日志与追踪数据
  • 优化Datadog成本并排查资源浪费
  • 从Datadog迁移至开源技术栈

Core Workflow: Observability Implementation

核心工作流:可观测性落地

Use this decision tree to determine your starting point:
Are you setting up monitoring from scratch?
├─ YES → Start with "1. Design Metrics Strategy"
└─ NO → Do you have an existing issue?
    ├─ YES → Go to "9. Troubleshooting & Analysis"
    └─ NO → Are you improving existing monitoring?
        ├─ Alerts → Go to "3. Alert Design"
        ├─ Dashboards → Go to "4. Dashboard & Visualization"
        ├─ SLOs → Go to "5. SLO & Error Budgets"
        ├─ Tool selection → Read references/tool_comparison.md
        └─ Using Datadog? High costs? → Go to "7. Datadog Cost Optimization & Migration"

使用以下决策树确定你的起始点:
你是否要从零开始搭建监控?
├─ 是 → 从"1. 指标策略设计"开始
└─ 否 → 你是否遇到现有问题?
    ├─ 是 → 转到"9. 故障排查与分析"
    └─ 否 → 你是否要优化现有监控?
        ├─ 告警 → 转到"3. 告警设计"
        ├─ 仪表盘 → 转到"4. 仪表盘与可视化"
        ├─ SLO → 转到"5. SLO与错误预算"
        ├─ 工具选型 → 阅读 references/tool_comparison.md
        └─ 正在使用Datadog?成本过高? → 转到"7. Datadog成本优化与迁移"

1. Design Metrics Strategy

1. 指标策略设计

Start with The Four Golden Signals

从四大黄金信号入手

Every service should monitor:
  1. Latency: Response time (p50, p95, p99)
  2. Traffic: Requests per second
  3. Errors: Failure rate
  4. Saturation: Resource utilization
For request-driven services, use the RED Method:
  • Rate: Requests/sec
  • Errors: Error rate
  • Duration: Response time
For infrastructure resources, use the USE Method:
  • Utilization: % time busy
  • Saturation**: Queue depth
  • Errors**: Error count
Quick Start - Web Application Example:
promql
undefined
每个服务都应监控以下内容:
  1. 延迟(Latency): 响应时间(p50、p95、p99分位值)
  2. 流量(Traffic): 每秒请求数
  3. 错误(Errors): 失败率
  4. 饱和度(Saturation): 资源利用率
对于请求驱动型服务,使用RED方法:
  • Rate: 每秒请求数
  • Errors: 错误率
  • Duration: 响应时间
对于基础设施资源,使用USE方法:
  • Utilization: 资源繁忙时间占比
  • Saturation: 队列深度
  • Errors: 错误计数
快速入门 - Web应用示例:
promql
undefined

Rate (requests/sec)

速率(每秒请求数)

sum(rate(http_requests_total[5m]))
sum(rate(http_requests_total[5m]))

Errors (error rate %)

错误率(百分比)

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

Duration (p95 latency)

延迟(p95分位值)

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) )
undefined
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) )
undefined

Deep Dive: Metric Design

深度解析:指标设计

For comprehensive metric design guidance including:
  • Metric types (counter, gauge, histogram, summary)
  • Cardinality best practices
  • Naming conventions
  • Dashboard design principles
→ Read: references/metrics_design.md
如需全面的指标设计指导,包括:
  • 指标类型(计数器、仪表盘、直方图、摘要)
  • 基数优化最佳实践
  • 命名规范
  • 仪表盘设计原则
→ 阅读: references/metrics_design.md

Automated Metric Analysis

自动化指标分析

Detect anomalies and trends in your metrics:
bash
undefined
检测指标中的异常与趋势:
bash
undefined

Analyze Prometheus metrics for anomalies

分析Prometheus指标异常

python3 scripts/analyze_metrics.py prometheus
--endpoint http://localhost:9090
--query 'rate(http_requests_total[5m])'
--hours 24
python3 scripts/analyze_metrics.py prometheus
--endpoint http://localhost:9090
--query 'rate(http_requests_total[5m])'
--hours 24

Analyze CloudWatch metrics

分析CloudWatch指标

python3 scripts/analyze_metrics.py cloudwatch
--namespace AWS/EC2
--metric CPUUtilization
--dimensions InstanceId=i-1234567890abcdef0
--hours 48

**→ Script**: [scripts/analyze_metrics.py](scripts/analyze_metrics.py)

---
python3 scripts/analyze_metrics.py cloudwatch
--namespace AWS/EC2
--metric CPUUtilization
--dimensions InstanceId=i-1234567890abcdef0
--hours 48

**→ 脚本**: [scripts/analyze_metrics.py](scripts/analyze_metrics.py)

---

2. Log Aggregation & Analysis

2. 日志聚合与分析

Structured Logging Checklist

结构化日志检查清单

Every log entry should include:
  • ✅ Timestamp (ISO 8601 format)
  • ✅ Log level (DEBUG, INFO, WARN, ERROR, FATAL)
  • ✅ Message (human-readable)
  • ✅ Service name
  • ✅ Request ID (for tracing)
Example structured log (JSON):
json
{
  "timestamp": "2024-10-28T14:32:15Z",
  "level": "error",
  "message": "Payment processing failed",
  "service": "payment-service",
  "request_id": "550e8400-e29b-41d4-a716-446655440000",
  "user_id": "user123",
  "order_id": "ORD-456",
  "error_type": "GatewayTimeout",
  "duration_ms": 5000
}
每条日志条目应包含:
  • ✅ 时间戳(ISO 8601格式)
  • ✅ 日志级别(DEBUG、INFO、WARN、ERROR、FATAL)
  • ✅ 消息内容(易读格式)
  • ✅ 服务名称
  • ✅ 请求ID(用于追踪)
结构化日志示例(JSON格式):
json
{
  "timestamp": "2024-10-28T14:32:15Z",
  "level": "error",
  "message": "Payment processing failed",
  "service": "payment-service",
  "request_id": "550e8400-e29b-41d4-a716-446655440000",
  "user_id": "user123",
  "order_id": "ORD-456",
  "error_type": "GatewayTimeout",
  "duration_ms": 5000
}

Log Aggregation Patterns

日志聚合模式

ELK Stack (Elasticsearch, Logstash, Kibana):
  • Best for: Deep log analysis, complex queries
  • Cost: High (infrastructure + operations)
  • Complexity: High
Grafana Loki:
  • Best for: Cost-effective logging, Kubernetes
  • Cost: Low
  • Complexity: Medium
CloudWatch Logs:
  • Best for: AWS-centric applications
  • Cost: Medium
  • Complexity: Low
ELK栈(Elasticsearch、Logstash、Kibana):
  • 最佳场景:深度日志分析、复杂查询
  • 成本:高(基础设施+运维成本)
  • 复杂度:高
Grafana Loki:
  • 最佳场景:高性价比日志方案、Kubernetes环境
  • 成本:低
  • 复杂度:中等
CloudWatch Logs:
  • 最佳场景:AWS生态应用
  • 成本:中等
  • 复杂度:低

Log Analysis

日志分析

Analyze logs for errors, patterns, and anomalies:
bash
undefined
分析日志中的错误、模式与异常:
bash
undefined

Analyze log file for patterns

分析日志文件中的模式

python3 scripts/log_analyzer.py application.log
python3 scripts/log_analyzer.py application.log

Show error lines with context

显示错误行及上下文

python3 scripts/log_analyzer.py application.log --show-errors
python3 scripts/log_analyzer.py application.log --show-errors

Extract stack traces

提取堆栈追踪信息

python3 scripts/log_analyzer.py application.log --show-traces

**→ Script**: [scripts/log_analyzer.py](scripts/log_analyzer.py)
python3 scripts/log_analyzer.py application.log --show-traces

**→ 脚本**: [scripts/log_analyzer.py](scripts/log_analyzer.py)

Deep Dive: Logging

深度解析:日志管理

For comprehensive logging guidance including:
  • Structured logging implementation examples (Python, Node.js, Go, Java)
  • Log aggregation patterns (ELK, Loki, CloudWatch, Fluentd)
  • Query patterns and best practices
  • PII redaction and security
  • Sampling and rate limiting
→ Read: references/logging_guide.md

如需全面的日志管理指导,包括:
  • 结构化日志落地示例(Python、Node.js、Go、Java)
  • 日志聚合模式(ELK、Loki、CloudWatch、Fluentd)
  • 查询模式与最佳实践
  • PII数据脱敏与安全
  • 日志采样与限流
→ 阅读: references/logging_guide.md

3. Alert Design

3. 告警设计

Alert Design Principles

告警设计原则

  1. Every alert must be actionable - If you can't do something, don't alert
  2. Alert on symptoms, not causes - Alert on user experience, not components
  3. Tie alerts to SLOs - Connect to business impact
  4. Reduce noise - Only page for critical issues
  1. 每个告警必须可执行 - 若无法采取行动,请勿触发告警
  2. 告警针对症状而非原因 - 告警应聚焦用户体验,而非组件状态
  3. 告警与SLO绑定 - 关联业务影响
  4. 减少告警噪音 - 仅对关键问题触发通知

Alert Severity Levels

告警严重级别

SeverityResponse TimeExample
CriticalPage immediatelyService down, SLO violation
WarningTicket, review in hoursElevated error rate, resource warning
InfoLog for awarenessDeployment completed, scaling event
严重级别响应时间示例
Critical(严重)立即触发通知服务宕机、SLO违规
Warning(警告)创建工单,数小时内处理错误率升高、资源预警
Info(信息)记录用于感知部署完成、扩缩容事件

Multi-Window Burn Rate Alerting

多窗口消耗速率告警

Alert when error budget is consumed too quickly:
yaml
undefined
当错误预算消耗过快时触发告警:
yaml
undefined

Fast burn (1h window) - Critical

快速消耗(1小时窗口)- 严重级别

  • alert: ErrorBudgetFastBurn expr: | (error_rate / 0.001) > 14.4 # 99.9% SLO for: 2m labels: severity: critical
  • alert: ErrorBudgetFastBurn expr: | (error_rate / 0.001) > 14.4 # 99.9% SLO for: 2m labels: severity: critical

Slow burn (6h window) - Warning

缓慢消耗(6小时窗口)- 警告级别

  • alert: ErrorBudgetSlowBurn expr: | (error_rate / 0.001) > 6 # 99.9% SLO for: 30m labels: severity: warning
undefined
  • alert: ErrorBudgetSlowBurn expr: | (error_rate / 0.001) > 6 # 99.9% SLO for: 30m labels: severity: warning
undefined

Alert Quality Checker

告警质量检查工具

Audit your alert rules against best practices:
bash
undefined
对照最佳实践审核你的告警规则:
bash
undefined

Check single file

检查单个文件

python3 scripts/alert_quality_checker.py alerts.yml
python3 scripts/alert_quality_checker.py alerts.yml

Check all rules in directory

检查目录下所有规则

python3 scripts/alert_quality_checker.py /path/to/prometheus/rules/

**Checks for**:
- Alert naming conventions
- Required labels (severity, team)
- Required annotations (summary, description, runbook_url)
- PromQL expression quality
- 'for' clause to prevent flapping

**→ Script**: [scripts/alert_quality_checker.py](scripts/alert_quality_checker.py)
python3 scripts/alert_quality_checker.py /path/to/prometheus/rules/

**检查内容**:
- 告警命名规范
- 必填标签(severity、team)
- 必填注解(summary、description、runbook_url)
- PromQL表达式质量
- 防止告警抖动的'for'子句

**→ 脚本**: [scripts/alert_quality_checker.py](scripts/alert_quality_checker.py)

Alert Templates

告警模板

Production-ready alert rule templates:
→ Templates:
  • assets/templates/prometheus-alerts/webapp-alerts.yml - Web application alerts
  • assets/templates/prometheus-alerts/kubernetes-alerts.yml - Kubernetes alerts
生产级告警规则模板:
→ 模板:
  • assets/templates/prometheus-alerts/webapp-alerts.yml - Web应用告警规则
  • assets/templates/prometheus-alerts/kubernetes-alerts.yml - Kubernetes告警规则

Deep Dive: Alerting

深度解析:告警管理

For comprehensive alerting guidance including:
  • Alert design patterns (multi-window, rate of change, threshold with hysteresis)
  • Alert annotation best practices
  • Alert routing (severity-based, team-based, time-based)
  • Inhibition rules
  • Runbook structure
  • On-call best practices
→ Read: references/alerting_best_practices.md
如需全面的告警管理指导,包括:
  • 告警设计模式(多窗口、速率变化、带滞后阈值)
  • 告警注解最佳实践
  • 告警路由(基于严重级别、团队、时间)
  • 抑制规则
  • 运行手册结构
  • 值班最佳实践
→ 阅读: references/alerting_best_practices.md

Runbook Template

运行手册模板

Create comprehensive runbooks for your alerts:
→ Template: assets/templates/runbooks/incident-runbook-template.md

为你的告警创建全面的运行手册:
→ 模板: assets/templates/runbooks/incident-runbook-template.md

4. Dashboard & Visualization

4. 仪表盘与可视化

Dashboard Design Principles

仪表盘设计原则

  1. Top-down layout: Most important metrics first
  2. Color coding: Red (critical), yellow (warning), green (healthy)
  3. Consistent time windows: All panels use same time range
  4. Limit panels: 8-12 panels per dashboard maximum
  5. Include context: Show related metrics together
  1. 自上而下布局: 最重要的指标放在最上方
  2. 颜色编码: 红色(严重)、黄色(警告)、绿色(健康)
  3. 统一时间窗口: 所有面板使用相同的时间范围
  4. 限制面板数量: 每个仪表盘最多8-12个面板
  5. 包含上下文: 关联指标放在一起展示

Recommended Dashboard Structure

推荐的仪表盘结构

┌─────────────────────────────────────┐
│  Overall Health (Single Stats)      │
│  [Requests/s] [Error%] [P95 Latency]│
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│  Request Rate & Errors (Graphs)     │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│  Latency Distribution (Graphs)      │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│  Resource Usage (Graphs)            │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│  整体健康状态(单值统计)            │
│  [每秒请求数] [错误率] [P95延迟]     │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│  请求速率与错误数(图表)            │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│  延迟分布(图表)                    │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│  资源使用率(图表)                  │
└─────────────────────────────────────┘

Generate Grafana Dashboards

生成Grafana仪表盘

Automatically generate dashboards from templates:
bash
undefined
从模板自动生成仪表盘:
bash
undefined

Web application dashboard

Web应用仪表盘

python3 scripts/dashboard_generator.py webapp
--title "My API Dashboard"
--service my_api
--output dashboard.json
python3 scripts/dashboard_generator.py webapp
--title "My API Dashboard"
--service my_api
--output dashboard.json

Kubernetes dashboard

Kubernetes仪表盘

python3 scripts/dashboard_generator.py kubernetes
--title "K8s Production"
--namespace production
--output k8s-dashboard.json
python3 scripts/dashboard_generator.py kubernetes
--title "K8s Production"
--namespace production
--output k8s-dashboard.json

Database dashboard

数据库仪表盘

python3 scripts/dashboard_generator.py database
--title "PostgreSQL"
--db-type postgres
--instance db.example.com:5432
--output db-dashboard.json

**Supports**:
- Web applications (requests, errors, latency, resources)
- Kubernetes (pods, nodes, resources, network)
- Databases (PostgreSQL, MySQL)

**→ Script**: [scripts/dashboard_generator.py](scripts/dashboard_generator.py)

---
python3 scripts/dashboard_generator.py database
--title "PostgreSQL"
--db-type postgres
--instance db.example.com:5432
--output db-dashboard.json

**支持场景**:
- Web应用(请求、错误、延迟、资源)
- Kubernetes(Pod、节点、资源、网络)
- 数据库(PostgreSQL、MySQL)

**→ 脚本**: [scripts/dashboard_generator.py](scripts/dashboard_generator.py)

---

5. SLO & Error Budgets

5. SLO与错误预算

SLO Fundamentals

SLO基础知识

SLI (Service Level Indicator): Measurement of service quality
  • Example: Request latency, error rate, availability
SLO (Service Level Objective): Target value for an SLI
  • Example: "99.9% of requests return in < 500ms"
Error Budget: Allowed failure amount = (100% - SLO)
  • Example: 99.9% SLO = 0.1% error budget = 43.2 minutes/month
SLI(服务水平指标): 服务质量的量化指标
  • 示例:请求延迟、错误率、可用性
SLO(服务水平目标): SLI的目标值
  • 示例:"99.9%的请求响应时间<500ms"
错误预算: 允许的故障总量 = (100% - SLO)
  • 示例:99.9% SLO = 0.1%错误预算 = 每月43.2分钟

Common SLO Targets

常见SLO目标

AvailabilityDowntime/MonthUse Case
99%7.2 hoursInternal tools
99.9%43.2 minutesStandard production
99.95%21.6 minutesCritical services
99.99%4.3 minutesHigh availability
可用性每月停机时间使用场景
99%7.2小时内部工具
99.9%43.2分钟标准生产环境
99.95%21.6分钟核心服务
99.99%4.3分钟高可用服务

SLO Calculator

SLO计算器

Calculate compliance, error budgets, and burn rates:
bash
undefined
计算合规性、错误预算与消耗速率:
bash
undefined

Show SLO reference table

显示SLO参考表

python3 scripts/slo_calculator.py --table
python3 scripts/slo_calculator.py --table

Calculate availability SLO

计算可用性SLO

python3 scripts/slo_calculator.py availability
--slo 99.9
--total-requests 1000000
--failed-requests 1500
--period-days 30
python3 scripts/slo_calculator.py availability
--slo 99.9
--total-requests 1000000
--failed-requests 1500
--period-days 30

Calculate burn rate

计算消耗速率

python3 scripts/slo_calculator.py burn-rate
--slo 99.9
--errors 50
--requests 10000
--window-hours 1

**→ Script**: [scripts/slo_calculator.py](scripts/slo_calculator.py)
python3 scripts/slo_calculator.py burn-rate
--slo 99.9
--errors 50
--requests 10000
--window-hours 1

**→ 脚本**: [scripts/slo_calculator.py](scripts/slo_calculator.py)

Deep Dive: SLO/SLA

深度解析:SLO/SLA

For comprehensive SLO/SLA guidance including:
  • Choosing appropriate SLIs
  • Setting realistic SLO targets
  • Error budget policies
  • Burn rate alerting
  • SLA structure and contracts
  • Monthly reporting templates
→ Read: references/slo_sla_guide.md

如需全面的SLO/SLA指导,包括:
  • 选择合适的SLI
  • 设置合理的SLO目标
  • 错误预算策略
  • 消耗速率告警
  • SLA结构与合同
  • 月度报告模板
→ 阅读: references/slo_sla_guide.md

6. Distributed Tracing

6. 分布式追踪

When to Use Tracing

何时使用追踪

Use distributed tracing when you need to:
  • Debug performance issues across services
  • Understand request flow through microservices
  • Identify bottlenecks in distributed systems
  • Find N+1 query problems
当你需要以下能力时使用分布式追踪:
  • 跨服务调试性能问题
  • 理解微服务中的请求流
  • 识别分布式系统中的瓶颈
  • 发现N+1查询问题

OpenTelemetry Implementation

OpenTelemetry落地示例

Python example:
python
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

@tracer.start_as_current_span("process_order")
def process_order(order_id):
    span = trace.get_current_span()
    span.set_attribute("order.id", order_id)

    try:
        result = payment_service.charge(order_id)
        span.set_attribute("payment.status", "success")
        return result
    except Exception as e:
        span.set_status(trace.Status(trace.StatusCode.ERROR))
        span.record_exception(e)
        raise
Python示例:
python
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

@tracer.start_as_current_span("process_order")
def process_order(order_id):
    span = trace.get_current_span()
    span.set_attribute("order.id", order_id)

    try:
        result = payment_service.charge(order_id)
        span.set_attribute("payment.status", "success")
        return result
    except Exception as e:
        span.set_status(trace.Status(trace.StatusCode.ERROR))
        span.record_exception(e)
        raise

Sampling Strategies

采样策略

  • Development: 100% (ALWAYS_ON)
  • Staging: 50-100%
  • Production: 1-10% (or error-based sampling)
Error-based sampling (always sample errors, 1% of successes):
python
class ErrorSampler(Sampler):
    def should_sample(self, parent_context, trace_id, name, **kwargs):
        attributes = kwargs.get('attributes', {})

        if attributes.get('error', False):
            return Decision.RECORD_AND_SAMPLE

        if trace_id & 0xFF < 3:  # ~1%
            return Decision.RECORD_AND_SAMPLE

        return Decision.DROP
  • 开发环境: 100%采样(ALWAYS_ON)
  • 预发布环境: 50-100%采样
  • 生产环境: 1-10%采样(或基于错误的采样)
基于错误的采样(始终采样错误,1%采样成功请求):
python
class ErrorSampler(Sampler):
    def should_sample(self, parent_context, trace_id, name, **kwargs):
        attributes = kwargs.get('attributes', {})

        if attributes.get('error', False):
            return Decision.RECORD_AND_SAMPLE

        if trace_id & 0xFF < 3:  # ~1%
            return Decision.RECORD_AND_SAMPLE

        return Decision.DROP

OTel Collector Configuration

OTel Collector配置

Production-ready OpenTelemetry Collector configuration:
→ Template: assets/templates/otel-config/collector-config.yaml
Features:
  • Receives OTLP, Prometheus, and host metrics
  • Batching and memory limiting
  • Tail sampling (error-based, latency-based, probabilistic)
  • Multiple exporters (Tempo, Jaeger, Loki, Prometheus, CloudWatch, Datadog)
生产级OpenTelemetry Collector配置:
→ 模板: assets/templates/otel-config/collector-config.yaml
特性:
  • 接收OTLP、Prometheus与主机指标
  • 批量处理与内存限制
  • 尾部采样(基于错误、延迟、概率)
  • 多导出器(Tempo、Jaeger、Loki、Prometheus、CloudWatch、Datadog)

Deep Dive: Tracing

深度解析:分布式追踪

For comprehensive tracing guidance including:
  • OpenTelemetry instrumentation (Python, Node.js, Go, Java)
  • Span attributes and semantic conventions
  • Context propagation (W3C Trace Context)
  • Backend comparison (Jaeger, Tempo, X-Ray, Datadog APM)
  • Analysis patterns (finding slow traces, N+1 queries)
  • Integration with logs
→ Read: references/tracing_guide.md

如需全面的分布式追踪指导,包括:
  • OpenTelemetry埋点(Python、Node.js、Go、Java)
  • Span属性与语义规范
  • 上下文传递(W3C Trace Context)
  • 后端对比(Jaeger、Tempo、X-Ray、Datadog APM)
  • 分析模式(查找慢追踪、N+1查询)
  • 与日志系统的集成
→ 阅读: references/tracing_guide.md

7. Datadog Cost Optimization & Migration

7. Datadog成本优化与迁移

Scenario 1: I'm Using Datadog and Costs Are Too High

场景1:正在使用Datadog但成本过高

If your Datadog bill is growing out of control, start by identifying waste:
若你的Datadog账单持续增长,首先需识别资源浪费:

Cost Analysis Script

成本分析脚本

Automatically analyze your Datadog usage and find cost optimization opportunities:
bash
undefined
自动分析你的Datadog使用情况并找到成本优化机会:
bash
undefined

Analyze Datadog usage (requires API key and APP key)

分析Datadog使用情况(需要API密钥和APP密钥)

python3 scripts/datadog_cost_analyzer.py
--api-key $DD_API_KEY
--app-key $DD_APP_KEY
python3 scripts/datadog_cost_analyzer.py
--api-key $DD_API_KEY
--app-key $DD_APP_KEY

Show detailed breakdown by category

按类别显示详细拆分

python3 scripts/datadog_cost_analyzer.py
--api-key $DD_API_KEY
--app-key $DD_APP_KEY
--show-details

**What it checks**:
- Infrastructure host count and cost
- Custom metrics usage and high-cardinality metrics
- Log ingestion volume and trends
- APM host usage
- Unused or noisy monitors
- Container vs VM optimization opportunities

**→ Script**: [scripts/datadog_cost_analyzer.py](scripts/datadog_cost_analyzer.py)
python3 scripts/datadog_cost_analyzer.py
--api-key $DD_API_KEY
--app-key $DD_APP_KEY
--show-details

**检查内容**:
- 基础设施主机数量与成本
- 自定义指标使用情况与高基数指标
- 日志 ingestion 量与趋势
- APM主机使用情况
- 未使用或噪音大的监控器
- 容器与VM的优化机会

**→ 脚本**: [scripts/datadog_cost_analyzer.py](scripts/datadog_cost_analyzer.py)

Common Cost Optimization Strategies

常见成本优化策略

1. Custom Metrics Optimization (typical savings: 20-40%):
  • Remove high-cardinality tags (user IDs, request IDs)
  • Delete unused custom metrics
  • Aggregate metrics before sending
  • Use metric prefixes to identify teams/services
2. Log Management (typical savings: 30-50%):
  • Implement log sampling for high-volume services
  • Use exclusion filters for debug/trace logs in production
  • Archive cold logs to S3/GCS after 7 days
  • Set log retention policies (15 days instead of 30)
3. APM Optimization (typical savings: 15-25%):
  • Reduce trace sampling rates (10% → 5% in prod)
  • Use head-based sampling instead of complete sampling
  • Remove APM from non-critical services
  • Use trace search with lower retention
4. Infrastructure Monitoring (typical savings: 10-20%):
  • Switch from VM-based to container-based pricing where possible
  • Remove agents from ephemeral instances
  • Use Datadog's host reduction strategies
  • Consolidate staging environments
1. 自定义指标优化(典型节省:20-40%):
  • 移除高基数标签(用户ID、请求ID)
  • 删除未使用的自定义指标
  • 发送前聚合指标
  • 使用指标前缀区分团队/服务
2. 日志管理优化(典型节省:30-50%):
  • 为高流量服务实现日志采样
  • 在生产环境中排除调试/追踪日志
  • 7天后将冷日志归档至S3/GCS
  • 设置日志保留策略(15天而非30天)
3. APM优化(典型节省:15-25%):
  • 降低追踪采样率(生产环境从10%→5%)
  • 使用头部采样而非全量采样
  • 为非核心服务移除APM
  • 使用低保留期的追踪搜索
4. 基础设施监控优化(典型节省:10-20%):
  • 尽可能从VM定价切换到容器定价
  • 从临时实例中移除Agent
  • 使用Datadog的主机缩减策略
  • 合并预发布环境

Scenario 2: Migrating Away from Datadog

场景2:从Datadog迁移

If you're considering migrating to a more cost-effective open-source stack:
若你考虑迁移至更具成本效益的开源技术栈:

Migration Overview

迁移概述

From DatadogTo Open Source Stack:
  • Metrics: Datadog → Prometheus + Grafana
  • Logs: Datadog Logs → Grafana Loki
  • Traces: Datadog APM → Tempo or Jaeger
  • Dashboards: Datadog → Grafana
  • Alerts: Datadog Monitors → Prometheus Alertmanager
Estimated Cost Savings: 60-77% ($49.8k-61.8k/year for 100-host environment)
从Datadog到开源技术栈:
  • 指标:Datadog → Prometheus + Grafana
  • 日志:Datadog Logs → Grafana Loki
  • 追踪:Datadog APM → Tempo或Jaeger
  • 仪表盘:Datadog → Grafana
  • 告警:Datadog Monitors → Prometheus Alertmanager
预计成本节省: 60-77%(100主机环境每年节省$49.8k-61.8k)

Migration Strategy

迁移策略

Phase 1: Run Parallel (Month 1-2):
  • Deploy open-source stack alongside Datadog
  • Migrate metrics first (lowest risk)
  • Validate data accuracy
Phase 2: Migrate Dashboards & Alerts (Month 2-3):
  • Convert Datadog dashboards to Grafana
  • Translate alert rules (use DQL → PromQL guide below)
  • Train team on new tools
Phase 3: Migrate Logs & Traces (Month 3-4):
  • Set up Loki for log aggregation
  • Deploy Tempo/Jaeger for tracing
  • Update application instrumentation
Phase 4: Decommission Datadog (Month 4-5):
  • Confirm all functionality migrated
  • Cancel Datadog subscription
阶段1:并行运行(第1-2个月):
  • 在Datadog旁部署开源技术栈
  • 先迁移指标(风险最低)
  • 验证数据准确性
阶段2:迁移仪表盘与告警(第2-3个月):
  • 将Datadog仪表盘转换为Grafana格式
  • 翻译告警规则(使用下方DQL→PromQL指南)
  • 培训团队使用新工具
阶段3:迁移日志与追踪(第3-4个月):
  • 搭建Loki用于日志聚合
  • 部署Tempo/Jaeger用于追踪
  • 更新应用埋点
阶段4:停用Datadog(第4-5个月):
  • 确认所有功能已迁移
  • 取消Datadog订阅

Query Translation: DQL → PromQL

查询翻译:DQL → PromQL

When migrating dashboards and alerts, you'll need to translate Datadog queries to PromQL:
Quick examples:
undefined
迁移仪表盘与告警时,需将Datadog查询转换为PromQL:
快速示例:
undefined

Average CPU

平均CPU使用率

Datadog: avg:system.cpu.user{*} Prometheus: avg(node_cpu_seconds_total{mode="user"})
Datadog: avg:system.cpu.user{*} Prometheus: avg(node_cpu_seconds_total{mode="user"})

Request rate

请求速率

Datadog: sum:requests.count{*}.as_rate() Prometheus: sum(rate(http_requests_total[5m]))
Datadog: sum:requests.count{*}.as_rate() Prometheus: sum(rate(http_requests_total[5m]))

P95 latency

P95延迟

Datadog: p95:request.duration{*} Prometheus: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Datadog: p95:request.duration{*} Prometheus: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Error rate percentage

错误率百分比

Datadog: (sum:requests.errors{}.as_rate() / sum:requests.count{}.as_rate()) * 100 Prometheus: (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100

**→ Full Translation Guide**: [references/dql_promql_translation.md](references/dql_promql_translation.md)
Datadog: (sum:requests.errors{}.as_rate() / sum:requests.count{}.as_rate()) * 100 Prometheus: (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100

**→ 完整翻译指南**: [references/dql_promql_translation.md](references/dql_promql_translation.md)

Cost Comparison

成本对比

Example: 100-host infrastructure
ComponentDatadog (Annual)Open Source (Annual)Savings
Infrastructure$18,000$10,000 (self-hosted infra)$8,000
Custom Metrics$600Included$600
Logs$24,000$3,000 (storage)$21,000
APM/Traces$37,200$5,000 (storage)$32,200
Total$79,800$18,000$61,800 (77%)
示例:100主机基础设施
组件Datadog(年成本)开源方案(年成本)节省金额
基础设施$18,000$10,000(自托管基础设施)$8,000
自定义指标$600包含在内$600
日志$24,000$3,000(存储成本)$21,000
APM/追踪$37,200$5,000(存储成本)$32,200
总计$79,800$18,000$61,800(77%)

Deep Dive: Datadog Migration

深度解析:Datadog迁移

For comprehensive migration guidance including:
  • Detailed cost comparison and ROI calculations
  • Step-by-step migration instructions
  • Infrastructure sizing recommendations (CPU, RAM, storage)
  • Dashboard conversion tools and examples
  • Alert rule translation patterns
  • Application instrumentation changes (DogStatsD → Prometheus client)
  • Python scripts for exporting Datadog dashboards and monitors
  • Common challenges and solutions
→ Read: references/datadog_migration.md

如需全面的迁移指导,包括:
  • 详细成本对比与ROI计算
  • 分步迁移说明
  • 基础设施 sizing 建议(CPU、内存、存储)
  • 仪表盘转换工具与示例
  • 告警规则翻译模式
  • 应用埋点变更(DogStatsD→Prometheus客户端)
  • 导出Datadog仪表盘与监控器的Python脚本
  • 常见挑战与解决方案
→ 阅读: references/datadog_migration.md

8. Tool Selection & Comparison

8. 工具选型与对比

Decision Matrix

决策矩阵

Choose Prometheus + Grafana if:
  • ✅ Using Kubernetes
  • ✅ Want control and customization
  • ✅ Have ops capacity
  • ✅ Budget-conscious
Choose Datadog if:
  • ✅ Want ease of use
  • ✅ Need full observability now
  • ✅ Budget allows ($8k+/month for 100 hosts)
Choose Grafana Stack (LGTM) if:
  • ✅ Want open source full stack
  • ✅ Cost-effective solution
  • ✅ Cloud-native architecture
Choose ELK Stack if:
  • ✅ Heavy log analysis needs
  • ✅ Need powerful search
  • ✅ Have dedicated ops team
Choose Cloud Native (CloudWatch/etc) if:
  • ✅ Single cloud provider
  • ✅ Simple needs
  • ✅ Want minimal setup
选择Prometheus + Grafana如果:
  • ✅ 使用Kubernetes
  • ✅ 希望掌控与自定义
  • ✅ 具备运维能力
  • ✅ 关注预算
选择Datadog如果:
  • ✅ 追求易用性
  • ✅ 需要快速落地全链路可观测性
  • ✅ 预算充足(100主机每月$8k+)
选择Grafana栈(LGTM)如果:
  • ✅ 希望使用开源全栈方案
  • ✅ 追求高性价比
  • ✅ 云原生架构
选择ELK栈如果:
  • ✅ 有深度日志分析需求
  • ✅ 需要强大的搜索能力
  • ✅ 有专门的运维团队
选择云原生方案(CloudWatch等)如果:
  • ✅ 单一云供应商
  • ✅ 需求简单
  • ✅ 希望最小化配置

Cost Comparison (100 hosts, 1TB logs/month)

成本对比(100主机,每月1TB日志)

SolutionMonthly CostSetupOps Burden
Prometheus + Loki + Tempo$1,500MediumMedium
Grafana Cloud$3,000LowLow
Datadog$8,000LowNone
ELK Stack$4,000HighHigh
CloudWatch$2,000LowLow
方案月成本配置难度运维负担
Prometheus + Loki + Tempo$1,500中等中等
Grafana Cloud$3,000
Datadog$8,000
ELK栈$4,000
CloudWatch$2,000

Deep Dive: Tool Comparison

深度解析:工具对比

For comprehensive tool comparison including:
  • Metrics platforms (Prometheus, Datadog, New Relic, CloudWatch, Grafana Cloud)
  • Logging platforms (ELK, Loki, Splunk, CloudWatch Logs, Sumo Logic)
  • Tracing platforms (Jaeger, Tempo, Datadog APM, X-Ray)
  • Full-stack observability comparison
  • Recommendations by company size
→ Read: references/tool_comparison.md

如需全面的工具对比,包括:
  • 指标平台(Prometheus、Datadog、New Relic、CloudWatch、Grafana Cloud)
  • 日志平台(ELK、Loki、Splunk、CloudWatch Logs、Sumo Logic)
  • 追踪平台(Jaeger、Tempo、Datadog APM、X-Ray)
  • 全链路可观测性对比
  • 按公司规模给出的推荐
→ 阅读: references/tool_comparison.md

9. Troubleshooting & Analysis

9. 故障排查与分析

Health Check Validation

健康检查验证

Validate health check endpoints against best practices:
bash
undefined
对照最佳实践验证健康检查端点:
bash
undefined

Check single endpoint

检查单个端点

python3 scripts/health_check_validator.py https://api.example.com/health
python3 scripts/health_check_validator.py https://api.example.com/health

Check multiple endpoints

检查多个端点

python3 scripts/health_check_validator.py
https://api.example.com/health
https://api.example.com/readiness
--verbose

**Checks for**:
- ✓ Returns 200 status code
- ✓ Response time < 1 second
- ✓ Returns JSON format
- ✓ Contains 'status' field
- ✓ Includes version/build info
- ✓ Checks dependencies
- ✓ Disables caching

**→ Script**: [scripts/health_check_validator.py](scripts/health_check_validator.py)
python3 scripts/health_check_validator.py
https://api.example.com/health
https://api.example.com/readiness
--verbose

**检查内容**:
- ✓ 返回200状态码
- ✓ 响应时间<1秒
- ✓ 返回JSON格式
- ✓ 包含'status'字段
- ✓ 包含版本/构建信息
- ✓ 检查依赖状态
- ✓ 禁用缓存

**→ 脚本**: [scripts/health_check_validator.py](scripts/health_check_validator.py)

Common Troubleshooting Workflows

常见故障排查工作流

High Latency Investigation:
  1. Check dashboards for latency spike
  2. Query traces for slow operations
  3. Check database slow query log
  4. Check external API response times
  5. Review recent deployments
  6. Check resource utilization (CPU, memory)
High Error Rate Investigation:
  1. Check error logs for patterns
  2. Identify affected endpoints
  3. Check dependency health
  4. Review recent deployments
  5. Check resource limits
  6. Verify configuration
Service Down Investigation:
  1. Check if pods/instances are running
  2. Check health check endpoint
  3. Review recent deployments
  4. Check resource availability
  5. Check network connectivity
  6. Review logs for startup errors

高延迟排查:
  1. 查看仪表盘确认延迟峰值
  2. 查询追踪数据找到慢操作
  3. 检查数据库慢查询日志
  4. 检查外部API响应时间
  5. 查看最近的部署记录
  6. 检查资源使用率(CPU、内存)
高错误率排查:
  1. 检查错误日志中的模式
  2. 识别受影响的端点
  3. 检查依赖健康状态
  4. 查看最近的部署记录
  5. 检查资源限制
  6. 验证配置
服务宕机排查:
  1. 检查Pod/实例是否运行
  2. 检查健康检查端点
  3. 查看最近的部署记录
  4. 检查资源可用性
  5. 检查网络连通性
  6. 查看启动错误日志

Quick Reference Commands

快速参考命令

Prometheus Queries

Prometheus查询

promql
undefined
promql
undefined

Request rate

请求速率

sum(rate(http_requests_total[5m]))
sum(rate(http_requests_total[5m]))

Error rate

错误率

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

P95 latency

P95延迟

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) )
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) )

CPU usage

CPU使用率

100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory usage

内存使用率

(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
undefined
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
undefined

Kubernetes Commands

Kubernetes命令

bash
undefined
bash
undefined

Check pod status

检查Pod状态

kubectl get pods -n <namespace>
kubectl get pods -n <namespace>

View pod logs

查看Pod日志

kubectl logs -f <pod-name> -n <namespace>
kubectl logs -f <pod-name> -n <namespace>

Check pod resources

检查Pod资源使用

kubectl top pods -n <namespace>
kubectl top pods -n <namespace>

Describe pod for events

查看Pod事件

kubectl describe pod <pod-name> -n <namespace>
kubectl describe pod <pod-name> -n <namespace>

Check recent deployments

查看最近的部署历史

kubectl rollout history deployment/<name> -n <namespace>
undefined
kubectl rollout history deployment/<name> -n <namespace>
undefined

Log Queries

日志查询

Elasticsearch:
json
GET /logs-*/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "level": "error" } },
        { "range": { "@timestamp": { "gte": "now-1h" } } }
      ]
    }
  }
}
Loki (LogQL):
logql
{job="app", level="error"} |= "error" | json
CloudWatch Insights:
fields @timestamp, level, message
| filter level = "error"
| filter @timestamp > ago(1h)

Elasticsearch:
json
GET /logs-*/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "level": "error" } },
        { "range": { "@timestamp": { "gte": "now-1h" } } }
      ]
    }
  }
}
Loki (LogQL):
logql
{job="app", level="error"} |= "error" | json
CloudWatch Insights:
fields @timestamp, level, message
| filter level = "error"
| filter @timestamp > ago(1h)

Resources Summary

资源汇总

Scripts (automation and analysis)

脚本(自动化与分析)

  • analyze_metrics.py
    - Detect anomalies in Prometheus/CloudWatch metrics
  • alert_quality_checker.py
    - Audit alert rules against best practices
  • slo_calculator.py
    - Calculate SLO compliance and error budgets
  • log_analyzer.py
    - Parse logs for errors and patterns
  • dashboard_generator.py
    - Generate Grafana dashboards from templates
  • health_check_validator.py
    - Validate health check endpoints
  • datadog_cost_analyzer.py
    - Analyze Datadog usage and find cost waste
  • analyze_metrics.py
    - 检测Prometheus/CloudWatch指标中的异常
  • alert_quality_checker.py
    - 对照最佳实践审核告警规则
  • slo_calculator.py
    - 计算SLO合规性与错误预算
  • log_analyzer.py
    - 解析日志中的错误与模式
  • dashboard_generator.py
    - 从模板生成Grafana仪表盘
  • health_check_validator.py
    - 验证健康检查端点
  • datadog_cost_analyzer.py
    - 分析Datadog使用情况并找到资源浪费

References (deep-dive documentation)

参考文档(深度解析)

  • metrics_design.md
    - Four Golden Signals, RED/USE methods, metric types
  • alerting_best_practices.md
    - Alert design, runbooks, on-call practices
  • logging_guide.md
    - Structured logging, aggregation patterns
  • tracing_guide.md
    - OpenTelemetry, distributed tracing
  • slo_sla_guide.md
    - SLI/SLO/SLA definitions, error budgets
  • tool_comparison.md
    - Comprehensive comparison of monitoring tools
  • datadog_migration.md
    - Complete guide for migrating from Datadog to OSS stack
  • dql_promql_translation.md
    - Datadog Query Language to PromQL translation reference
  • metrics_design.md
    - 四大黄金信号、RED/USE方法、指标类型
  • alerting_best_practices.md
    - 告警设计、运行手册、值班实践
  • logging_guide.md
    - 结构化日志、聚合模式
  • tracing_guide.md
    - OpenTelemetry、分布式追踪
  • slo_sla_guide.md
    - SLI/SLO/SLA定义、错误预算
  • tool_comparison.md
    - 监控工具全面对比
  • datadog_migration.md
    - 从Datadog迁移至开源栈的完整指南
  • dql_promql_translation.md
    - Datadog查询语言到PromQL的翻译参考

Templates (ready-to-use configurations)

模板(即用型配置)

  • prometheus-alerts/webapp-alerts.yml
    - Production-ready web app alerts
  • prometheus-alerts/kubernetes-alerts.yml
    - Kubernetes monitoring alerts
  • otel-config/collector-config.yaml
    - OpenTelemetry Collector configuration
  • runbooks/incident-runbook-template.md
    - Incident response template

  • prometheus-alerts/webapp-alerts.yml
    - 生产级Web应用告警规则
  • prometheus-alerts/kubernetes-alerts.yml
    - Kubernetes监控告警规则
  • otel-config/collector-config.yaml
    - OpenTelemetry Collector配置
  • runbooks/incident-runbook-template.md
    - 事件响应模板

Best Practices

最佳实践

Metrics

指标

  • Start with Four Golden Signals
  • Use appropriate metric types (counter, gauge, histogram)
  • Keep cardinality low (avoid high-cardinality labels)
  • Follow naming conventions
  • 从四大黄金信号入手
  • 使用合适的指标类型(计数器、仪表盘、直方图)
  • 保持低基数(避免高基数标签)
  • 遵循命名规范

Logging

日志

  • Use structured logging (JSON)
  • Include request IDs for tracing
  • Set appropriate log levels
  • Redact PII before logging
  • 使用结构化日志(JSON)
  • 包含请求ID用于追踪
  • 设置合适的日志级别
  • 日志前脱敏PII数据

Alerting

告警

  • Make every alert actionable
  • Alert on symptoms, not causes
  • Use multi-window burn rate alerts
  • Include runbook links
  • 确保每个告警可执行
  • 告警针对症状而非原因
  • 使用多窗口消耗速率告警
  • 包含运行手册链接

Tracing

追踪

  • Sample appropriately (1-10% in production)
  • Always record errors
  • Use semantic conventions
  • Propagate context between services
  • 合理采样(生产环境1-10%)
  • 始终采样错误请求
  • 使用语义规范
  • 在服务间传递上下文

SLOs

SLO

  • Start with current performance
  • Set realistic targets
  • Define error budget policies
  • Review and adjust quarterly
  • 从当前性能水平入手
  • 设置合理的目标
  • 定义错误预算策略
  • 每季度回顾与调整