monitoring-observability

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Monitoring & Observability

监控与可观测性

Overview

概述

This skill provides comprehensive guidance for monitoring and observability workflows including metrics design, log aggregation, distributed tracing, alerting strategies, SLO/SLA management, and tool selection.

When to use this skill:

Setting up monitoring for new services
Designing alerts and dashboards
Troubleshooting performance issues
Implementing SLO tracking and error budgets
Choosing between monitoring tools
Integrating OpenTelemetry instrumentation
Analyzing metrics, logs, and traces
Optimizing Datadog costs and finding waste
Migrating from Datadog to open-source stack

本技能为监控与可观测性工作流提供全面指导，包括指标设计、日志聚合、分布式追踪、告警策略、SLO/SLA管理及工具选型。

何时使用该技能:

为新服务搭建监控系统
设计告警规则与仪表盘
排查性能问题
落地SLO追踪与错误预算管理
选择合适的监控工具
集成OpenTelemetry埋点
分析指标、日志与追踪数据
优化Datadog成本并排查资源浪费
从Datadog迁移至开源技术栈

Core Workflow: Observability Implementation

核心工作流：可观测性落地

Use this decision tree to determine your starting point:

Are you setting up monitoring from scratch?
├─ YES → Start with "1. Design Metrics Strategy"
└─ NO → Do you have an existing issue?
    ├─ YES → Go to "9. Troubleshooting & Analysis"
    └─ NO → Are you improving existing monitoring?
        ├─ Alerts → Go to "3. Alert Design"
        ├─ Dashboards → Go to "4. Dashboard & Visualization"
        ├─ SLOs → Go to "5. SLO & Error Budgets"
        ├─ Tool selection → Read references/tool_comparison.md
        └─ Using Datadog? High costs? → Go to "7. Datadog Cost Optimization & Migration"

使用以下决策树确定你的起始点：

你是否要从零开始搭建监控？
├─ 是 → 从"1. 指标策略设计"开始
└─ 否 → 你是否遇到现有问题？
    ├─ 是 → 转到"9. 故障排查与分析"
    └─ 否 → 你是否要优化现有监控？
        ├─ 告警 → 转到"3. 告警设计"
        ├─ 仪表盘 → 转到"4. 仪表盘与可视化"
        ├─ SLO → 转到"5. SLO与错误预算"
        ├─ 工具选型 → 阅读 references/tool_comparison.md
        └─ 正在使用Datadog？成本过高？ → 转到"7. Datadog成本优化与迁移"

1. Design Metrics Strategy

1. 指标策略设计

Start with The Four Golden Signals

从四大黄金信号入手

Every service should monitor:

Latency: Response time (p50, p95, p99)
Traffic: Requests per second
Errors: Failure rate
Saturation: Resource utilization

For request-driven services, use the RED Method:

Rate: Requests/sec
Errors: Error rate
Duration: Response time

For infrastructure resources, use the USE Method:

Utilization: % time busy
Saturation**: Queue depth
Errors**: Error count

Quick Start - Web Application Example:

promql

undefined

每个服务都应监控以下内容：

延迟（Latency）: 响应时间（p50、p95、p99分位值）
流量（Traffic）: 每秒请求数
错误（Errors）: 失败率
饱和度（Saturation）: 资源利用率

对于请求驱动型服务，使用RED方法:

Rate: 每秒请求数
Errors: 错误率
Duration: 响应时间

对于基础设施资源，使用USE方法:

Utilization: 资源繁忙时间占比
Saturation: 队列深度
Errors: 错误计数

快速入门 - Web应用示例:

promql

undefined

Rate (requests/sec)

速率（每秒请求数）

sum(rate(http_requests_total[5m]))

Errors (error rate %)

错误率（百分比）

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

Duration (p95 latency)

延迟（p95分位值）

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) )

undefined

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) )

undefined

Deep Dive: Metric Design

深度解析：指标设计

For comprehensive metric design guidance including:

Metric types (counter, gauge, histogram, summary)
Cardinality best practices
Naming conventions
Dashboard design principles

→ Read: references/metrics_design.md

如需全面的指标设计指导，包括：

指标类型（计数器、仪表盘、直方图、摘要）
基数优化最佳实践
命名规范
仪表盘设计原则

→ 阅读: references/metrics_design.md

Automated Metric Analysis

自动化指标分析

Detect anomalies and trends in your metrics:

bash

undefined

检测指标中的异常与趋势：

bash

undefined

Analyze Prometheus metrics for anomalies

分析Prometheus指标异常

python3 scripts/analyze_metrics.py prometheus
--endpoint http://localhost:9090
--query 'rate(http_requests_total[5m])'
--hours 24

Analyze CloudWatch metrics

分析CloudWatch指标

python3 scripts/analyze_metrics.py cloudwatch
--namespace AWS/EC2
--metric CPUUtilization
--dimensions InstanceId=i-1234567890abcdef0
--hours 48


**→ Script**: [scripts/analyze_metrics.py](scripts/analyze_metrics.py)

---

python3 scripts/analyze_metrics.py cloudwatch
--namespace AWS/EC2
--metric CPUUtilization
--dimensions InstanceId=i-1234567890abcdef0
--hours 48


**→ 脚本**: [scripts/analyze_metrics.py](scripts/analyze_metrics.py)

---

2. Log Aggregation & Analysis

2. 日志聚合与分析

Structured Logging Checklist

结构化日志检查清单

Every log entry should include:

✅ Timestamp (ISO 8601 format)
✅ Log level (DEBUG, INFO, WARN, ERROR, FATAL)
✅ Message (human-readable)
✅ Service name
✅ Request ID (for tracing)

Example structured log (JSON):

json

{
  "timestamp": "2024-10-28T14:32:15Z",
  "level": "error",
  "message": "Payment processing failed",
  "service": "payment-service",
  "request_id": "550e8400-e29b-41d4-a716-446655440000",
  "user_id": "user123",
  "order_id": "ORD-456",
  "error_type": "GatewayTimeout",
  "duration_ms": 5000
}

每条日志条目应包含：

✅ 时间戳（ISO 8601格式）
✅ 日志级别（DEBUG、INFO、WARN、ERROR、FATAL）
✅ 消息内容（易读格式）
✅ 服务名称
✅ 请求ID（用于追踪）

结构化日志示例（JSON格式）:

json

{
  "timestamp": "2024-10-28T14:32:15Z",
  "level": "error",
  "message": "Payment processing failed",
  "service": "payment-service",
  "request_id": "550e8400-e29b-41d4-a716-446655440000",
  "user_id": "user123",
  "order_id": "ORD-456",
  "error_type": "GatewayTimeout",
  "duration_ms": 5000
}

Log Aggregation Patterns

日志聚合模式

ELK Stack (Elasticsearch, Logstash, Kibana):

Best for: Deep log analysis, complex queries
Cost: High (infrastructure + operations)
Complexity: High

Grafana Loki:

Best for: Cost-effective logging, Kubernetes
Cost: Low
Complexity: Medium

CloudWatch Logs:

Best for: AWS-centric applications
Cost: Medium
Complexity: Low

ELK栈（Elasticsearch、Logstash、Kibana）:

最佳场景：深度日志分析、复杂查询
成本：高（基础设施+运维成本）
复杂度：高

Grafana Loki:

最佳场景：高性价比日志方案、Kubernetes环境
成本：低
复杂度：中等

CloudWatch Logs:

最佳场景：AWS生态应用
成本：中等
复杂度：低

Log Analysis

日志分析

Analyze logs for errors, patterns, and anomalies:

bash

undefined

分析日志中的错误、模式与异常：

bash

undefined

Analyze log file for patterns

分析日志文件中的模式

python3 scripts/log_analyzer.py application.log

Show error lines with context

显示错误行及上下文

python3 scripts/log_analyzer.py application.log --show-errors

Extract stack traces

提取堆栈追踪信息

python3 scripts/log_analyzer.py application.log --show-traces


**→ Script**: [scripts/log_analyzer.py](scripts/log_analyzer.py)

python3 scripts/log_analyzer.py application.log --show-traces


**→ 脚本**: [scripts/log_analyzer.py](scripts/log_analyzer.py)

Deep Dive: Logging

深度解析：日志管理

For comprehensive logging guidance including:

Structured logging implementation examples (Python, Node.js, Go, Java)
Log aggregation patterns (ELK, Loki, CloudWatch, Fluentd)
Query patterns and best practices
PII redaction and security
Sampling and rate limiting

→ Read: references/logging_guide.md

如需全面的日志管理指导，包括：

结构化日志落地示例（Python、Node.js、Go、Java）
日志聚合模式（ELK、Loki、CloudWatch、Fluentd）
查询模式与最佳实践
PII数据脱敏与安全
日志采样与限流

→ 阅读: references/logging_guide.md

3. Alert Design

3. 告警设计

Alert Design Principles

告警设计原则

Every alert must be actionable - If you can't do something, don't alert
Alert on symptoms, not causes - Alert on user experience, not components
Tie alerts to SLOs - Connect to business impact
Reduce noise - Only page for critical issues

每个告警必须可执行 - 若无法采取行动，请勿触发告警
告警针对症状而非原因 - 告警应聚焦用户体验，而非组件状态
告警与SLO绑定 - 关联业务影响
减少告警噪音 - 仅对关键问题触发通知

Alert Severity Levels

告警严重级别

Severity	Response Time	Example
Critical	Page immediately	Service down, SLO violation
Warning	Ticket, review in hours	Elevated error rate, resource warning
Info	Log for awareness	Deployment completed, scaling event

严重级别	响应时间	示例
Critical（严重）	立即触发通知	服务宕机、SLO违规
Warning（警告）	创建工单，数小时内处理	错误率升高、资源预警
Info（信息）	记录用于感知	部署完成、扩缩容事件

Multi-Window Burn Rate Alerting

多窗口消耗速率告警

Alert when error budget is consumed too quickly:

yaml

undefined

当错误预算消耗过快时触发告警：

yaml

undefined

Fast burn (1h window) - Critical

快速消耗（1小时窗口）- 严重级别

alert: ErrorBudgetFastBurn expr: | (error_rate / 0.001) > 14.4 # 99.9% SLO for: 2m labels: severity: critical

alert: ErrorBudgetFastBurn expr: | (error_rate / 0.001) > 14.4 # 99.9% SLO for: 2m labels: severity: critical

Slow burn (6h window) - Warning

缓慢消耗（6小时窗口）- 警告级别

alert: ErrorBudgetSlowBurn expr: | (error_rate / 0.001) > 6 # 99.9% SLO for: 30m labels: severity: warning

undefined

alert: ErrorBudgetSlowBurn expr: | (error_rate / 0.001) > 6 # 99.9% SLO for: 30m labels: severity: warning

undefined

Alert Quality Checker

告警质量检查工具

Audit your alert rules against best practices:

bash

undefined

对照最佳实践审核你的告警规则：

bash

undefined

Check single file

检查单个文件

python3 scripts/alert_quality_checker.py alerts.yml

Check all rules in directory

检查目录下所有规则

python3 scripts/alert_quality_checker.py /path/to/prometheus/rules/


**Checks for**:
- Alert naming conventions
- Required labels (severity, team)
- Required annotations (summary, description, runbook_url)
- PromQL expression quality
- 'for' clause to prevent flapping

**→ Script**: [scripts/alert_quality_checker.py](scripts/alert_quality_checker.py)

python3 scripts/alert_quality_checker.py /path/to/prometheus/rules/


**检查内容**:
- 告警命名规范
- 必填标签（severity、team）
- 必填注解（summary、description、runbook_url）
- PromQL表达式质量
- 防止告警抖动的'for'子句

**→ 脚本**: [scripts/alert_quality_checker.py](scripts/alert_quality_checker.py)

Alert Templates

告警模板

Production-ready alert rule templates:

→ Templates:

assets/templates/prometheus-alerts/webapp-alerts.yml - Web application alerts
assets/templates/prometheus-alerts/kubernetes-alerts.yml - Kubernetes alerts

生产级告警规则模板：

→ 模板:

assets/templates/prometheus-alerts/webapp-alerts.yml - Web应用告警规则
assets/templates/prometheus-alerts/kubernetes-alerts.yml - Kubernetes告警规则

Deep Dive: Alerting

深度解析：告警管理

For comprehensive alerting guidance including:

Alert design patterns (multi-window, rate of change, threshold with hysteresis)
Alert annotation best practices
Alert routing (severity-based, team-based, time-based)
Inhibition rules
Runbook structure
On-call best practices

→ Read: references/alerting_best_practices.md

如需全面的告警管理指导，包括：

告警设计模式（多窗口、速率变化、带滞后阈值）
告警注解最佳实践
告警路由（基于严重级别、团队、时间）
抑制规则
运行手册结构
值班最佳实践

→ 阅读: references/alerting_best_practices.md

Runbook Template

运行手册模板

Create comprehensive runbooks for your alerts:

→ Template: assets/templates/runbooks/incident-runbook-template.md

为你的告警创建全面的运行手册：

→ 模板: assets/templates/runbooks/incident-runbook-template.md

4. Dashboard & Visualization

4. 仪表盘与可视化

Dashboard Design Principles

仪表盘设计原则

Top-down layout: Most important metrics first
Color coding: Red (critical), yellow (warning), green (healthy)
Consistent time windows: All panels use same time range
Limit panels: 8-12 panels per dashboard maximum
Include context: Show related metrics together

自上而下布局: 最重要的指标放在最上方
颜色编码: 红色（严重）、黄色（警告）、绿色（健康）
统一时间窗口: 所有面板使用相同的时间范围
限制面板数量: 每个仪表盘最多8-12个面板
包含上下文: 关联指标放在一起展示

Recommended Dashboard Structure

Generate Grafana Dashboards

生成Grafana仪表盘

Automatically generate dashboards from templates:

bash

undefined

从模板自动生成仪表盘：

bash

undefined

Web application dashboard

Web应用仪表盘

python3 scripts/dashboard_generator.py webapp
--title "My API Dashboard"
--service my_api
--output dashboard.json

Kubernetes dashboard

Kubernetes仪表盘

python3 scripts/dashboard_generator.py kubernetes
--title "K8s Production"
--namespace production
--output k8s-dashboard.json

Database dashboard

数据库仪表盘

python3 scripts/dashboard_generator.py database
--title "PostgreSQL"
--db-type postgres
--instance db.example.com:5432
--output db-dashboard.json


**Supports**:
- Web applications (requests, errors, latency, resources)
- Kubernetes (pods, nodes, resources, network)
- Databases (PostgreSQL, MySQL)

**→ Script**: [scripts/dashboard_generator.py](scripts/dashboard_generator.py)

---

python3 scripts/dashboard_generator.py database
--title "PostgreSQL"
--db-type postgres
--instance db.example.com:5432
--output db-dashboard.json


**支持场景**:
- Web应用（请求、错误、延迟、资源）
- Kubernetes（Pod、节点、资源、网络）
- 数据库（PostgreSQL、MySQL）

**→ 脚本**: [scripts/dashboard_generator.py](scripts/dashboard_generator.py)

---

5. SLO & Error Budgets

5. SLO与错误预算

SLO Fundamentals

SLO基础知识

SLI (Service Level Indicator): Measurement of service quality

Example: Request latency, error rate, availability

SLO (Service Level Objective): Target value for an SLI

Example: "99.9% of requests return in < 500ms"

Error Budget: Allowed failure amount = (100% - SLO)

Example: 99.9% SLO = 0.1% error budget = 43.2 minutes/month

SLI（服务水平指标）: 服务质量的量化指标

示例：请求延迟、错误率、可用性

SLO（服务水平目标）: SLI的目标值

示例："99.9%的请求响应时间<500ms"

错误预算: 允许的故障总量 = (100% - SLO)

示例：99.9% SLO = 0.1%错误预算 = 每月43.2分钟

Common SLO Targets

常见SLO目标

Availability	Downtime/Month	Use Case
99%	7.2 hours	Internal tools
99.9%	43.2 minutes	Standard production
99.95%	21.6 minutes	Critical services
99.99%	4.3 minutes	High availability

可用性	每月停机时间	使用场景
99%	7.2小时	内部工具
99.9%	43.2分钟	标准生产环境
99.95%	21.6分钟	核心服务
99.99%	4.3分钟	高可用服务

SLO Calculator

SLO计算器

Calculate compliance, error budgets, and burn rates:

bash

undefined

计算合规性、错误预算与消耗速率：

bash

undefined

Show SLO reference table

显示SLO参考表

python3 scripts/slo_calculator.py --table

Calculate availability SLO

计算可用性SLO

python3 scripts/slo_calculator.py availability
--slo 99.9
--total-requests 1000000
--failed-requests 1500
--period-days 30

Calculate burn rate

计算消耗速率

python3 scripts/slo_calculator.py burn-rate
--slo 99.9
--errors 50
--requests 10000
--window-hours 1


**→ Script**: [scripts/slo_calculator.py](scripts/slo_calculator.py)

python3 scripts/slo_calculator.py burn-rate
--slo 99.9
--errors 50
--requests 10000
--window-hours 1


**→ 脚本**: [scripts/slo_calculator.py](scripts/slo_calculator.py)

Deep Dive: SLO/SLA

深度解析：SLO/SLA

For comprehensive SLO/SLA guidance including:

Choosing appropriate SLIs
Setting realistic SLO targets
Error budget policies
Burn rate alerting
SLA structure and contracts
Monthly reporting templates

→ Read: references/slo_sla_guide.md

如需全面的SLO/SLA指导，包括：

选择合适的SLI
设置合理的SLO目标
错误预算策略
消耗速率告警
SLA结构与合同
月度报告模板

→ 阅读: references/slo_sla_guide.md

6. Distributed Tracing

6. 分布式追踪

When to Use Tracing

何时使用追踪

Use distributed tracing when you need to:

Debug performance issues across services
Understand request flow through microservices
Identify bottlenecks in distributed systems
Find N+1 query problems

当你需要以下能力时使用分布式追踪：

跨服务调试性能问题
理解微服务中的请求流
识别分布式系统中的瓶颈
发现N+1查询问题

OpenTelemetry Implementation

OpenTelemetry落地示例

Python example:

python

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

@tracer.start_as_current_span("process_order")
def process_order(order_id):
    span = trace.get_current_span()
    span.set_attribute("order.id", order_id)

    try:
        result = payment_service.charge(order_id)
        span.set_attribute("payment.status", "success")
        return result
    except Exception as e:
        span.set_status(trace.Status(trace.StatusCode.ERROR))
        span.record_exception(e)
        raise

Python示例:

python

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

@tracer.start_as_current_span("process_order")
def process_order(order_id):
    span = trace.get_current_span()
    span.set_attribute("order.id", order_id)

    try:
        result = payment_service.charge(order_id)
        span.set_attribute("payment.status", "success")
        return result
    except Exception as e:
        span.set_status(trace.Status(trace.StatusCode.ERROR))
        span.record_exception(e)
        raise

Sampling Strategies

采样策略

Development: 100% (ALWAYS_ON)
Staging: 50-100%
Production: 1-10% (or error-based sampling)

Error-based sampling (always sample errors, 1% of successes):

python

class ErrorSampler(Sampler):
    def should_sample(self, parent_context, trace_id, name, **kwargs):
        attributes = kwargs.get('attributes', {})

        if attributes.get('error', False):
            return Decision.RECORD_AND_SAMPLE

        if trace_id & 0xFF < 3:  # ~1%
            return Decision.RECORD_AND_SAMPLE

        return Decision.DROP

开发环境: 100%采样（ALWAYS_ON）
预发布环境: 50-100%采样
生产环境: 1-10%采样（或基于错误的采样）

基于错误的采样（始终采样错误，1%采样成功请求）:

python

class ErrorSampler(Sampler):
    def should_sample(self, parent_context, trace_id, name, **kwargs):
        attributes = kwargs.get('attributes', {})

        if attributes.get('error', False):
            return Decision.RECORD_AND_SAMPLE

        if trace_id & 0xFF < 3:  # ~1%
            return Decision.RECORD_AND_SAMPLE

        return Decision.DROP

OTel Collector Configuration

OTel Collector配置

Production-ready OpenTelemetry Collector configuration:

→ Template: assets/templates/otel-config/collector-config.yaml

Features:

Receives OTLP, Prometheus, and host metrics
Batching and memory limiting
Tail sampling (error-based, latency-based, probabilistic)
Multiple exporters (Tempo, Jaeger, Loki, Prometheus, CloudWatch, Datadog)

生产级OpenTelemetry Collector配置：

→ 模板: assets/templates/otel-config/collector-config.yaml

特性:

接收OTLP、Prometheus与主机指标
批量处理与内存限制
尾部采样（基于错误、延迟、概率）
多导出器（Tempo、Jaeger、Loki、Prometheus、CloudWatch、Datadog）

Deep Dive: Tracing

深度解析：分布式追踪

For comprehensive tracing guidance including:

OpenTelemetry instrumentation (Python, Node.js, Go, Java)
Span attributes and semantic conventions
Context propagation (W3C Trace Context)
Backend comparison (Jaeger, Tempo, X-Ray, Datadog APM)
Analysis patterns (finding slow traces, N+1 queries)
Integration with logs

→ Read: references/tracing_guide.md

如需全面的分布式追踪指导，包括：

OpenTelemetry埋点（Python、Node.js、Go、Java）
Span属性与语义规范
上下文传递（W3C Trace Context）
后端对比（Jaeger、Tempo、X-Ray、Datadog APM）
分析模式（查找慢追踪、N+1查询）
与日志系统的集成

→ 阅读: references/tracing_guide.md

7. Datadog Cost Optimization & Migration

7. Datadog成本优化与迁移

Scenario 1: I'm Using Datadog and Costs Are Too High

场景1：正在使用Datadog但成本过高

If your Datadog bill is growing out of control, start by identifying waste:

若你的Datadog账单持续增长，首先需识别资源浪费：

Cost Analysis Script

成本分析脚本

Automatically analyze your Datadog usage and find cost optimization opportunities:

bash

undefined

自动分析你的Datadog使用情况并找到成本优化机会：

bash

undefined

Analyze Datadog usage (requires API key and APP key)

分析Datadog使用情况（需要API密钥和APP密钥）

python3 scripts/datadog_cost_analyzer.py
--api-key $DD_API_KEY
--app-key $DD_APP_KEY

Show detailed breakdown by category

按类别显示详细拆分

python3 scripts/datadog_cost_analyzer.py
--api-key $DD_API_KEY
--app-key $DD_APP_KEY
--show-details


**What it checks**:
- Infrastructure host count and cost
- Custom metrics usage and high-cardinality metrics
- Log ingestion volume and trends
- APM host usage
- Unused or noisy monitors
- Container vs VM optimization opportunities

**→ Script**: [scripts/datadog_cost_analyzer.py](scripts/datadog_cost_analyzer.py)

python3 scripts/datadog_cost_analyzer.py
--api-key $DD_API_KEY
--app-key $DD_APP_KEY
--show-details


**检查内容**:
- 基础设施主机数量与成本
- 自定义指标使用情况与高基数指标
- 日志 ingestion 量与趋势
- APM主机使用情况
- 未使用或噪音大的监控器
- 容器与VM的优化机会

**→ 脚本**: [scripts/datadog_cost_analyzer.py](scripts/datadog_cost_analyzer.py)

Common Cost Optimization Strategies

常见成本优化策略

1. Custom Metrics Optimization (typical savings: 20-40%):

Remove high-cardinality tags (user IDs, request IDs)
Delete unused custom metrics
Aggregate metrics before sending
Use metric prefixes to identify teams/services

2. Log Management (typical savings: 30-50%):

Implement log sampling for high-volume services
Use exclusion filters for debug/trace logs in production
Archive cold logs to S3/GCS after 7 days
Set log retention policies (15 days instead of 30)

3. APM Optimization (typical savings: 15-25%):

Reduce trace sampling rates (10% → 5% in prod)
Use head-based sampling instead of complete sampling
Remove APM from non-critical services
Use trace search with lower retention

4. Infrastructure Monitoring (typical savings: 10-20%):

Switch from VM-based to container-based pricing where possible
Remove agents from ephemeral instances
Use Datadog's host reduction strategies
Consolidate staging environments

1. 自定义指标优化（典型节省：20-40%）:

移除高基数标签（用户ID、请求ID）
删除未使用的自定义指标
发送前聚合指标
使用指标前缀区分团队/服务

2. 日志管理优化（典型节省：30-50%）:

为高流量服务实现日志采样
在生产环境中排除调试/追踪日志
7天后将冷日志归档至S3/GCS
设置日志保留策略（15天而非30天）

3. APM优化（典型节省：15-25%）:

降低追踪采样率（生产环境从10%→5%）
使用头部采样而非全量采样
为非核心服务移除APM
使用低保留期的追踪搜索

4. 基础设施监控优化（典型节省：10-20%）:

尽可能从VM定价切换到容器定价
从临时实例中移除Agent
使用Datadog的主机缩减策略
合并预发布环境

Scenario 2: Migrating Away from Datadog

场景2：从Datadog迁移

If you're considering migrating to a more cost-effective open-source stack:

若你考虑迁移至更具成本效益的开源技术栈：

Migration Overview

迁移概述

From Datadog → To Open Source Stack:

Metrics: Datadog → Prometheus + Grafana
Logs: Datadog Logs → Grafana Loki
Traces: Datadog APM → Tempo or Jaeger
Dashboards: Datadog → Grafana
Alerts: Datadog Monitors → Prometheus Alertmanager

Estimated Cost Savings: 60-77% ($49.8k-61.8k/year for 100-host environment)

从Datadog → 到开源技术栈:

指标：Datadog → Prometheus + Grafana
日志：Datadog Logs → Grafana Loki
追踪：Datadog APM → Tempo或Jaeger
仪表盘：Datadog → Grafana
告警：Datadog Monitors → Prometheus Alertmanager

预计成本节省: 60-77%（100主机环境每年节省$49.8k-61.8k）

Migration Strategy

迁移策略

Phase 1: Run Parallel (Month 1-2):

Deploy open-source stack alongside Datadog
Migrate metrics first (lowest risk)
Validate data accuracy

Phase 2: Migrate Dashboards & Alerts (Month 2-3):

Convert Datadog dashboards to Grafana
Translate alert rules (use DQL → PromQL guide below)
Train team on new tools

Phase 3: Migrate Logs & Traces (Month 3-4):

Set up Loki for log aggregation
Deploy Tempo/Jaeger for tracing
Update application instrumentation

Phase 4: Decommission Datadog (Month 4-5):

Confirm all functionality migrated
Cancel Datadog subscription

阶段1：并行运行（第1-2个月）:

在Datadog旁部署开源技术栈
先迁移指标（风险最低）
验证数据准确性

阶段2：迁移仪表盘与告警（第2-3个月）:

将Datadog仪表盘转换为Grafana格式
翻译告警规则（使用下方DQL→PromQL指南）
培训团队使用新工具

阶段3：迁移日志与追踪（第3-4个月）:

搭建Loki用于日志聚合
部署Tempo/Jaeger用于追踪
更新应用埋点

阶段4：停用Datadog（第4-5个月）:

确认所有功能已迁移
取消Datadog订阅

Query Translation: DQL → PromQL

查询翻译：DQL → PromQL

When migrating dashboards and alerts, you'll need to translate Datadog queries to PromQL:

Quick examples:

undefined

迁移仪表盘与告警时，需将Datadog查询转换为PromQL：

快速示例:

undefined

Average CPU

平均CPU使用率

Datadog: avg:system.cpu.user{*} Prometheus: avg(node_cpu_seconds_total{mode="user"})

Request rate

请求速率

Datadog: sum:requests.count{*}.as_rate() Prometheus: sum(rate(http_requests_total[5m]))

P95 latency

P95延迟

Datadog: p95:request.duration{*} Prometheus: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Error rate percentage

错误率百分比

Datadog: (sum:requests.errors{}.as_rate() / sum:requests.count{}.as_rate()) * 100 Prometheus: (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100


**→ Full Translation Guide**: [references/dql_promql_translation.md](references/dql_promql_translation.md)

Datadog: (sum:requests.errors{}.as_rate() / sum:requests.count{}.as_rate()) * 100 Prometheus: (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100


**→ 完整翻译指南**: [references/dql_promql_translation.md](references/dql_promql_translation.md)

Cost Comparison

成本对比

Example: 100-host infrastructure

Component	Datadog (Annual)	Open Source (Annual)	Savings
Infrastructure	$18,000	$10,000 (self-hosted infra)	$8,000
Custom Metrics	$600	Included	$600
Logs	$24,000	$3,000 (storage)	$21,000
APM/Traces	$37,200	$5,000 (storage)	$32,200
Total	$79,800	$18,000	$61,800 (77%)

示例：100主机基础设施

组件	Datadog（年成本）	开源方案（年成本）	节省金额
基础设施	$18,000	$10,000（自托管基础设施）	$8,000
自定义指标	$600	包含在内	$600
日志	$24,000	$3,000（存储成本）	$21,000
APM/追踪	$37,200	$5,000（存储成本）	$32,200
总计	$79,800	$18,000	$61,800（77%）

Deep Dive: Datadog Migration

深度解析：Datadog迁移

For comprehensive migration guidance including:

Detailed cost comparison and ROI calculations
Step-by-step migration instructions
Infrastructure sizing recommendations (CPU, RAM, storage)
Dashboard conversion tools and examples
Alert rule translation patterns
Application instrumentation changes (DogStatsD → Prometheus client)
Python scripts for exporting Datadog dashboards and monitors
Common challenges and solutions

→ Read: references/datadog_migration.md

如需全面的迁移指导，包括：

详细成本对比与ROI计算
分步迁移说明
基础设施 sizing 建议（CPU、内存、存储）
仪表盘转换工具与示例
告警规则翻译模式
应用埋点变更（DogStatsD→Prometheus客户端）
导出Datadog仪表盘与监控器的Python脚本
常见挑战与解决方案

→ 阅读: references/datadog_migration.md

8. Tool Selection & Comparison

8. 工具选型与对比

Decision Matrix

决策矩阵

Choose Prometheus + Grafana if:

✅ Using Kubernetes
✅ Want control and customization
✅ Have ops capacity
✅ Budget-conscious

Choose Datadog if:

✅ Want ease of use
✅ Need full observability now
✅ Budget allows ($8k+/month for 100 hosts)

Choose Grafana Stack (LGTM) if:

✅ Want open source full stack
✅ Cost-effective solution
✅ Cloud-native architecture

Choose ELK Stack if:

✅ Heavy log analysis needs
✅ Need powerful search
✅ Have dedicated ops team

Choose Cloud Native (CloudWatch/etc) if:

✅ Single cloud provider
✅ Simple needs
✅ Want minimal setup

选择Prometheus + Grafana如果:

✅ 使用Kubernetes
✅ 希望掌控与自定义
✅ 具备运维能力
✅ 关注预算

选择Datadog如果:

✅ 追求易用性
✅ 需要快速落地全链路可观测性
✅ 预算充足（100主机每月$8k+）

选择Grafana栈（LGTM）如果:

✅ 希望使用开源全栈方案
✅ 追求高性价比
✅ 云原生架构

选择ELK栈如果:

✅ 有深度日志分析需求
✅ 需要强大的搜索能力
✅ 有专门的运维团队

选择云原生方案（CloudWatch等）如果:

✅ 单一云供应商
✅ 需求简单
✅ 希望最小化配置

Cost Comparison (100 hosts, 1TB logs/month)

成本对比（100主机，每月1TB日志）

Solution	Monthly Cost	Setup	Ops Burden
Prometheus + Loki + Tempo	$1,500	Medium	Medium
Grafana Cloud	$3,000	Low	Low
Datadog	$8,000	Low	None
ELK Stack	$4,000	High	High
CloudWatch	$2,000	Low	Low

方案	月成本	配置难度	运维负担
Prometheus + Loki + Tempo	$1,500	中等	中等
Grafana Cloud	$3,000	低	低
Datadog	$8,000	低	无
ELK栈	$4,000	高	高
CloudWatch	$2,000	低	低

Deep Dive: Tool Comparison

深度解析：工具对比

For comprehensive tool comparison including:

Metrics platforms (Prometheus, Datadog, New Relic, CloudWatch, Grafana Cloud)
Logging platforms (ELK, Loki, Splunk, CloudWatch Logs, Sumo Logic)
Tracing platforms (Jaeger, Tempo, Datadog APM, X-Ray)
Full-stack observability comparison
Recommendations by company size

→ Read: references/tool_comparison.md

如需全面的工具对比，包括：

指标平台（Prometheus、Datadog、New Relic、CloudWatch、Grafana Cloud）
日志平台（ELK、Loki、Splunk、CloudWatch Logs、Sumo Logic）
追踪平台（Jaeger、Tempo、Datadog APM、X-Ray）
全链路可观测性对比
按公司规模给出的推荐

→ 阅读: references/tool_comparison.md

9. Troubleshooting & Analysis

9. 故障排查与分析

Health Check Validation

健康检查验证

Validate health check endpoints against best practices:

bash

undefined

对照最佳实践验证健康检查端点：

bash

undefined

Check single endpoint

检查单个端点

python3 scripts/health_check_validator.py https://api.example.com/health

Check multiple endpoints

检查多个端点

python3 scripts/health_check_validator.py
https://api.example.com/health
https://api.example.com/readiness
--verbose


**Checks for**:
- ✓ Returns 200 status code
- ✓ Response time < 1 second
- ✓ Returns JSON format
- ✓ Contains 'status' field
- ✓ Includes version/build info
- ✓ Checks dependencies
- ✓ Disables caching

**→ Script**: [scripts/health_check_validator.py](scripts/health_check_validator.py)

python3 scripts/health_check_validator.py
https://api.example.com/health
https://api.example.com/readiness
--verbose


**检查内容**:
- ✓ 返回200状态码
- ✓ 响应时间<1秒
- ✓ 返回JSON格式
- ✓ 包含'status'字段
- ✓ 包含版本/构建信息
- ✓ 检查依赖状态
- ✓ 禁用缓存

**→ 脚本**: [scripts/health_check_validator.py](scripts/health_check_validator.py)

Common Troubleshooting Workflows

常见故障排查工作流

High Latency Investigation:

Check dashboards for latency spike
Query traces for slow operations
Check database slow query log
Check external API response times
Review recent deployments
Check resource utilization (CPU, memory)

High Error Rate Investigation:

Check error logs for patterns
Identify affected endpoints
Check dependency health
Review recent deployments
Check resource limits
Verify configuration

Service Down Investigation:

Check if pods/instances are running
Check health check endpoint
Review recent deployments
Check resource availability
Check network connectivity
Review logs for startup errors

高延迟排查:

查看仪表盘确认延迟峰值
查询追踪数据找到慢操作
检查数据库慢查询日志
检查外部API响应时间
查看最近的部署记录
检查资源使用率（CPU、内存）

高错误率排查:

检查错误日志中的模式
识别受影响的端点
检查依赖健康状态
查看最近的部署记录
检查资源限制
验证配置

服务宕机排查:

检查Pod/实例是否运行
检查健康检查端点
查看最近的部署记录
检查资源可用性
检查网络连通性
查看启动错误日志

Quick Reference Commands

快速参考命令

Prometheus Queries

Prometheus查询

promql

undefined

promql

undefined

Request rate

请求速率

sum(rate(http_requests_total[5m]))

Error rate

错误率

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

P95 latency

P95延迟

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) )

CPU usage

CPU使用率

100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory usage

内存使用率

(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

undefined

(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

undefined

Kubernetes Commands

Kubernetes命令

bash

undefined

bash

undefined

Check pod status

检查Pod状态

kubectl get pods -n <namespace>

View pod logs

查看Pod日志

kubectl logs -f <pod-name> -n <namespace>

Check pod resources

检查Pod资源使用

kubectl top pods -n <namespace>

Describe pod for events

查看Pod事件

kubectl describe pod <pod-name> -n <namespace>

Check recent deployments

查看最近的部署历史

kubectl rollout history deployment/<name> -n <namespace>

undefined

kubectl rollout history deployment/<name> -n <namespace>

undefined

Log Queries

日志查询

Elasticsearch:

json

GET /logs-*/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "level": "error" } },
        { "range": { "@timestamp": { "gte": "now-1h" } } }
      ]
    }
  }
}

Loki (LogQL):

logql

{job="app", level="error"} |= "error" | json

CloudWatch Insights:

fields @timestamp, level, message
| filter level = "error"
| filter @timestamp > ago(1h)

Elasticsearch:

json

GET /logs-*/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "level": "error" } },
        { "range": { "@timestamp": { "gte": "now-1h" } } }
      ]
    }
  }
}

Loki (LogQL):

logql

{job="app", level="error"} |= "error" | json

CloudWatch Insights:

fields @timestamp, level, message
| filter level = "error"
| filter @timestamp > ago(1h)

Resources Summary

资源汇总

Scripts (automation and analysis)

脚本（自动化与分析）

```
analyze_metrics.py
```
- Detect anomalies in Prometheus/CloudWatch metrics
```
alert_quality_checker.py
```
- Audit alert rules against best practices
```
slo_calculator.py
```
- Calculate SLO compliance and error budgets
```
log_analyzer.py
```
- Parse logs for errors and patterns
```
dashboard_generator.py
```
- Generate Grafana dashboards from templates
```
health_check_validator.py
```
- Validate health check endpoints
```
datadog_cost_analyzer.py
```
- Analyze Datadog usage and find cost waste

```
analyze_metrics.py
```
- 检测Prometheus/CloudWatch指标中的异常
```
alert_quality_checker.py
```
- 对照最佳实践审核告警规则
```
slo_calculator.py
```
- 计算SLO合规性与错误预算
```
log_analyzer.py
```
- 解析日志中的错误与模式
```
dashboard_generator.py
```
- 从模板生成Grafana仪表盘
```
health_check_validator.py
```
- 验证健康检查端点
```
datadog_cost_analyzer.py
```
- 分析Datadog使用情况并找到资源浪费

References (deep-dive documentation)

参考文档（深度解析）

```
metrics_design.md
```
- Four Golden Signals, RED/USE methods, metric types
```
alerting_best_practices.md
```
- Alert design, runbooks, on-call practices
```
logging_guide.md
```
- Structured logging, aggregation patterns
```
tracing_guide.md
```
- OpenTelemetry, distributed tracing
```
slo_sla_guide.md
```
- SLI/SLO/SLA definitions, error budgets
```
tool_comparison.md
```
- Comprehensive comparison of monitoring tools
```
datadog_migration.md
```
- Complete guide for migrating from Datadog to OSS stack
```
dql_promql_translation.md
```
- Datadog Query Language to PromQL translation reference

```
metrics_design.md
```
- 四大黄金信号、RED/USE方法、指标类型
```
alerting_best_practices.md
```
- 告警设计、运行手册、值班实践
```
logging_guide.md
```
- 结构化日志、聚合模式
```
tracing_guide.md
```
- OpenTelemetry、分布式追踪
```
slo_sla_guide.md
```
- SLI/SLO/SLA定义、错误预算
```
tool_comparison.md
```
- 监控工具全面对比
```
datadog_migration.md
```
- 从Datadog迁移至开源栈的完整指南
```
dql_promql_translation.md
```
- Datadog查询语言到PromQL的翻译参考

Templates (ready-to-use configurations)

模板（即用型配置）

```
prometheus-alerts/webapp-alerts.yml
```
- Production-ready web app alerts
```
prometheus-alerts/kubernetes-alerts.yml
```
- Kubernetes monitoring alerts
```
otel-config/collector-config.yaml
```
- OpenTelemetry Collector configuration
```
runbooks/incident-runbook-template.md
```
- Incident response template

```
prometheus-alerts/webapp-alerts.yml
```
- 生产级Web应用告警规则
```
prometheus-alerts/kubernetes-alerts.yml
```
- Kubernetes监控告警规则
```
otel-config/collector-config.yaml
```
- OpenTelemetry Collector配置
```
runbooks/incident-runbook-template.md
```
- 事件响应模板

Best Practices

最佳实践

Metrics

指标

Start with Four Golden Signals
Use appropriate metric types (counter, gauge, histogram)
Keep cardinality low (avoid high-cardinality labels)
Follow naming conventions

从四大黄金信号入手
使用合适的指标类型（计数器、仪表盘、直方图）
保持低基数（避免高基数标签）
遵循命名规范

Logging

日志

Use structured logging (JSON)
Include request IDs for tracing
Set appropriate log levels
Redact PII before logging

使用结构化日志（JSON）
包含请求ID用于追踪
设置合适的日志级别
日志前脱敏PII数据

Alerting

告警

Make every alert actionable
Alert on symptoms, not causes
Use multi-window burn rate alerts
Include runbook links

确保每个告警可执行
告警针对症状而非原因
使用多窗口消耗速率告警
包含运行手册链接

Tracing

追踪

Sample appropriately (1-10% in production)
Always record errors
Use semantic conventions
Propagate context between services

合理采样（生产环境1-10%）
始终采样错误请求
使用语义规范
在服务间传递上下文

SLOs

SLO

Start with current performance
Set realistic targets
Define error budget policies
Review and adjust quarterly

从当前性能水平入手
设置合理的目标
定义错误预算策略
每季度回顾与调整

monitoring-observability

Original

Translation

Monitoring & Observability

监控与可观测性

Overview

概述

Core Workflow: Observability Implementation

核心工作流：可观测性落地

1. Design Metrics Strategy

1. 指标策略设计

Start with The Four Golden Signals

从四大黄金信号入手

Rate (requests/sec)

速率（每秒请求数）

Errors (error rate %)

错误率（百分比）

Duration (p95 latency)

延迟（p95分位值）

Deep Dive: Metric Design

深度解析：指标设计

Automated Metric Analysis

自动化指标分析

Analyze Prometheus metrics for anomalies

分析Prometheus指标异常

Analyze CloudWatch metrics

分析CloudWatch指标

2. Log Aggregation & Analysis

2. 日志聚合与分析

Structured Logging Checklist

结构化日志检查清单

Log Aggregation Patterns

日志聚合模式

Log Analysis

日志分析

Analyze log file for patterns

分析日志文件中的模式

Show error lines with context

显示错误行及上下文

Extract stack traces

提取堆栈追踪信息

Deep Dive: Logging

深度解析：日志管理

3. Alert Design

3. 告警设计

Alert Design Principles

告警设计原则

Alert Severity Levels

告警严重级别

Multi-Window Burn Rate Alerting

多窗口消耗速率告警

Fast burn (1h window) - Critical

快速消耗（1小时窗口）- 严重级别

Slow burn (6h window) - Warning

缓慢消耗（6小时窗口）- 警告级别

Alert Quality Checker

告警质量检查工具

Check single file

检查单个文件

Check all rules in directory

检查目录下所有规则

Alert Templates

告警模板

Deep Dive: Alerting

深度解析：告警管理

Runbook Template

运行手册模板

4. Dashboard & Visualization

4. 仪表盘与可视化

Dashboard Design Principles

仪表盘设计原则

Recommended Dashboard Structure

推荐的仪表盘结构

Generate Grafana Dashboards

生成Grafana仪表盘

Web application dashboard

Web应用仪表盘

Kubernetes dashboard

Kubernetes仪表盘

Database dashboard