monitoring-observability
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMonitoring & Observability
监控与可观测性
Overview
概述
This skill provides comprehensive guidance for monitoring and observability workflows including metrics design, log aggregation, distributed tracing, alerting strategies, SLO/SLA management, and tool selection.
When to use this skill:
- Setting up monitoring for new services
- Designing alerts and dashboards
- Troubleshooting performance issues
- Implementing SLO tracking and error budgets
- Choosing between monitoring tools
- Integrating OpenTelemetry instrumentation
- Analyzing metrics, logs, and traces
- Optimizing Datadog costs and finding waste
- Migrating from Datadog to open-source stack
本技能为监控与可观测性工作流提供全面指导,包括指标设计、日志聚合、分布式追踪、告警策略、SLO/SLA管理及工具选型。
何时使用该技能:
- 为新服务搭建监控系统
- 设计告警规则与仪表盘
- 排查性能问题
- 落地SLO追踪与错误预算管理
- 选择合适的监控工具
- 集成OpenTelemetry埋点
- 分析指标、日志与追踪数据
- 优化Datadog成本并排查资源浪费
- 从Datadog迁移至开源技术栈
Core Workflow: Observability Implementation
核心工作流:可观测性落地
Use this decision tree to determine your starting point:
Are you setting up monitoring from scratch?
├─ YES → Start with "1. Design Metrics Strategy"
└─ NO → Do you have an existing issue?
├─ YES → Go to "9. Troubleshooting & Analysis"
└─ NO → Are you improving existing monitoring?
├─ Alerts → Go to "3. Alert Design"
├─ Dashboards → Go to "4. Dashboard & Visualization"
├─ SLOs → Go to "5. SLO & Error Budgets"
├─ Tool selection → Read references/tool_comparison.md
└─ Using Datadog? High costs? → Go to "7. Datadog Cost Optimization & Migration"使用以下决策树确定你的起始点:
你是否要从零开始搭建监控?
├─ 是 → 从"1. 指标策略设计"开始
└─ 否 → 你是否遇到现有问题?
├─ 是 → 转到"9. 故障排查与分析"
└─ 否 → 你是否要优化现有监控?
├─ 告警 → 转到"3. 告警设计"
├─ 仪表盘 → 转到"4. 仪表盘与可视化"
├─ SLO → 转到"5. SLO与错误预算"
├─ 工具选型 → 阅读 references/tool_comparison.md
└─ 正在使用Datadog?成本过高? → 转到"7. Datadog成本优化与迁移"1. Design Metrics Strategy
1. 指标策略设计
Start with The Four Golden Signals
从四大黄金信号入手
Every service should monitor:
- Latency: Response time (p50, p95, p99)
- Traffic: Requests per second
- Errors: Failure rate
- Saturation: Resource utilization
For request-driven services, use the RED Method:
- Rate: Requests/sec
- Errors: Error rate
- Duration: Response time
For infrastructure resources, use the USE Method:
- Utilization: % time busy
- Saturation**: Queue depth
- Errors**: Error count
Quick Start - Web Application Example:
promql
undefined每个服务都应监控以下内容:
- 延迟(Latency): 响应时间(p50、p95、p99分位值)
- 流量(Traffic): 每秒请求数
- 错误(Errors): 失败率
- 饱和度(Saturation): 资源利用率
对于请求驱动型服务,使用RED方法:
- Rate: 每秒请求数
- Errors: 错误率
- Duration: 响应时间
对于基础设施资源,使用USE方法:
- Utilization: 资源繁忙时间占比
- Saturation: 队列深度
- Errors: 错误计数
快速入门 - Web应用示例:
promql
undefinedRate (requests/sec)
速率(每秒请求数)
sum(rate(http_requests_total[5m]))
sum(rate(http_requests_total[5m]))
Errors (error rate %)
错误率(百分比)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
Duration (p95 latency)
延迟(p95分位值)
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
undefinedhistogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
undefinedDeep Dive: Metric Design
深度解析:指标设计
For comprehensive metric design guidance including:
- Metric types (counter, gauge, histogram, summary)
- Cardinality best practices
- Naming conventions
- Dashboard design principles
→ Read: references/metrics_design.md
如需全面的指标设计指导,包括:
- 指标类型(计数器、仪表盘、直方图、摘要)
- 基数优化最佳实践
- 命名规范
- 仪表盘设计原则
→ 阅读: references/metrics_design.md
Automated Metric Analysis
自动化指标分析
Detect anomalies and trends in your metrics:
bash
undefined检测指标中的异常与趋势:
bash
undefinedAnalyze Prometheus metrics for anomalies
分析Prometheus指标异常
python3 scripts/analyze_metrics.py prometheus
--endpoint http://localhost:9090
--query 'rate(http_requests_total[5m])'
--hours 24
--endpoint http://localhost:9090
--query 'rate(http_requests_total[5m])'
--hours 24
python3 scripts/analyze_metrics.py prometheus
--endpoint http://localhost:9090
--query 'rate(http_requests_total[5m])'
--hours 24
--endpoint http://localhost:9090
--query 'rate(http_requests_total[5m])'
--hours 24
Analyze CloudWatch metrics
分析CloudWatch指标
python3 scripts/analyze_metrics.py cloudwatch
--namespace AWS/EC2
--metric CPUUtilization
--dimensions InstanceId=i-1234567890abcdef0
--hours 48
--namespace AWS/EC2
--metric CPUUtilization
--dimensions InstanceId=i-1234567890abcdef0
--hours 48
**→ Script**: [scripts/analyze_metrics.py](scripts/analyze_metrics.py)
---python3 scripts/analyze_metrics.py cloudwatch
--namespace AWS/EC2
--metric CPUUtilization
--dimensions InstanceId=i-1234567890abcdef0
--hours 48
--namespace AWS/EC2
--metric CPUUtilization
--dimensions InstanceId=i-1234567890abcdef0
--hours 48
**→ 脚本**: [scripts/analyze_metrics.py](scripts/analyze_metrics.py)
---2. Log Aggregation & Analysis
2. 日志聚合与分析
Structured Logging Checklist
结构化日志检查清单
Every log entry should include:
- ✅ Timestamp (ISO 8601 format)
- ✅ Log level (DEBUG, INFO, WARN, ERROR, FATAL)
- ✅ Message (human-readable)
- ✅ Service name
- ✅ Request ID (for tracing)
Example structured log (JSON):
json
{
"timestamp": "2024-10-28T14:32:15Z",
"level": "error",
"message": "Payment processing failed",
"service": "payment-service",
"request_id": "550e8400-e29b-41d4-a716-446655440000",
"user_id": "user123",
"order_id": "ORD-456",
"error_type": "GatewayTimeout",
"duration_ms": 5000
}每条日志条目应包含:
- ✅ 时间戳(ISO 8601格式)
- ✅ 日志级别(DEBUG、INFO、WARN、ERROR、FATAL)
- ✅ 消息内容(易读格式)
- ✅ 服务名称
- ✅ 请求ID(用于追踪)
结构化日志示例(JSON格式):
json
{
"timestamp": "2024-10-28T14:32:15Z",
"level": "error",
"message": "Payment processing failed",
"service": "payment-service",
"request_id": "550e8400-e29b-41d4-a716-446655440000",
"user_id": "user123",
"order_id": "ORD-456",
"error_type": "GatewayTimeout",
"duration_ms": 5000
}Log Aggregation Patterns
日志聚合模式
ELK Stack (Elasticsearch, Logstash, Kibana):
- Best for: Deep log analysis, complex queries
- Cost: High (infrastructure + operations)
- Complexity: High
Grafana Loki:
- Best for: Cost-effective logging, Kubernetes
- Cost: Low
- Complexity: Medium
CloudWatch Logs:
- Best for: AWS-centric applications
- Cost: Medium
- Complexity: Low
ELK栈(Elasticsearch、Logstash、Kibana):
- 最佳场景:深度日志分析、复杂查询
- 成本:高(基础设施+运维成本)
- 复杂度:高
Grafana Loki:
- 最佳场景:高性价比日志方案、Kubernetes环境
- 成本:低
- 复杂度:中等
CloudWatch Logs:
- 最佳场景:AWS生态应用
- 成本:中等
- 复杂度:低
Log Analysis
日志分析
Analyze logs for errors, patterns, and anomalies:
bash
undefined分析日志中的错误、模式与异常:
bash
undefinedAnalyze log file for patterns
分析日志文件中的模式
python3 scripts/log_analyzer.py application.log
python3 scripts/log_analyzer.py application.log
Show error lines with context
显示错误行及上下文
python3 scripts/log_analyzer.py application.log --show-errors
python3 scripts/log_analyzer.py application.log --show-errors
Extract stack traces
提取堆栈追踪信息
python3 scripts/log_analyzer.py application.log --show-traces
**→ Script**: [scripts/log_analyzer.py](scripts/log_analyzer.py)python3 scripts/log_analyzer.py application.log --show-traces
**→ 脚本**: [scripts/log_analyzer.py](scripts/log_analyzer.py)Deep Dive: Logging
深度解析:日志管理
For comprehensive logging guidance including:
- Structured logging implementation examples (Python, Node.js, Go, Java)
- Log aggregation patterns (ELK, Loki, CloudWatch, Fluentd)
- Query patterns and best practices
- PII redaction and security
- Sampling and rate limiting
→ Read: references/logging_guide.md
如需全面的日志管理指导,包括:
- 结构化日志落地示例(Python、Node.js、Go、Java)
- 日志聚合模式(ELK、Loki、CloudWatch、Fluentd)
- 查询模式与最佳实践
- PII数据脱敏与安全
- 日志采样与限流
→ 阅读: references/logging_guide.md
3. Alert Design
3. 告警设计
Alert Design Principles
告警设计原则
- Every alert must be actionable - If you can't do something, don't alert
- Alert on symptoms, not causes - Alert on user experience, not components
- Tie alerts to SLOs - Connect to business impact
- Reduce noise - Only page for critical issues
- 每个告警必须可执行 - 若无法采取行动,请勿触发告警
- 告警针对症状而非原因 - 告警应聚焦用户体验,而非组件状态
- 告警与SLO绑定 - 关联业务影响
- 减少告警噪音 - 仅对关键问题触发通知
Alert Severity Levels
告警严重级别
| Severity | Response Time | Example |
|---|---|---|
| Critical | Page immediately | Service down, SLO violation |
| Warning | Ticket, review in hours | Elevated error rate, resource warning |
| Info | Log for awareness | Deployment completed, scaling event |
| 严重级别 | 响应时间 | 示例 |
|---|---|---|
| Critical(严重) | 立即触发通知 | 服务宕机、SLO违规 |
| Warning(警告) | 创建工单,数小时内处理 | 错误率升高、资源预警 |
| Info(信息) | 记录用于感知 | 部署完成、扩缩容事件 |
Multi-Window Burn Rate Alerting
多窗口消耗速率告警
Alert when error budget is consumed too quickly:
yaml
undefined当错误预算消耗过快时触发告警:
yaml
undefinedFast burn (1h window) - Critical
快速消耗(1小时窗口)- 严重级别
- alert: ErrorBudgetFastBurn expr: | (error_rate / 0.001) > 14.4 # 99.9% SLO for: 2m labels: severity: critical
- alert: ErrorBudgetFastBurn expr: | (error_rate / 0.001) > 14.4 # 99.9% SLO for: 2m labels: severity: critical
Slow burn (6h window) - Warning
缓慢消耗(6小时窗口)- 警告级别
- alert: ErrorBudgetSlowBurn expr: | (error_rate / 0.001) > 6 # 99.9% SLO for: 30m labels: severity: warning
undefined- alert: ErrorBudgetSlowBurn expr: | (error_rate / 0.001) > 6 # 99.9% SLO for: 30m labels: severity: warning
undefinedAlert Quality Checker
告警质量检查工具
Audit your alert rules against best practices:
bash
undefined对照最佳实践审核你的告警规则:
bash
undefinedCheck single file
检查单个文件
python3 scripts/alert_quality_checker.py alerts.yml
python3 scripts/alert_quality_checker.py alerts.yml
Check all rules in directory
检查目录下所有规则
python3 scripts/alert_quality_checker.py /path/to/prometheus/rules/
**Checks for**:
- Alert naming conventions
- Required labels (severity, team)
- Required annotations (summary, description, runbook_url)
- PromQL expression quality
- 'for' clause to prevent flapping
**→ Script**: [scripts/alert_quality_checker.py](scripts/alert_quality_checker.py)python3 scripts/alert_quality_checker.py /path/to/prometheus/rules/
**检查内容**:
- 告警命名规范
- 必填标签(severity、team)
- 必填注解(summary、description、runbook_url)
- PromQL表达式质量
- 防止告警抖动的'for'子句
**→ 脚本**: [scripts/alert_quality_checker.py](scripts/alert_quality_checker.py)Alert Templates
告警模板
Production-ready alert rule templates:
→ Templates:
- assets/templates/prometheus-alerts/webapp-alerts.yml - Web application alerts
- assets/templates/prometheus-alerts/kubernetes-alerts.yml - Kubernetes alerts
生产级告警规则模板:
→ 模板:
- assets/templates/prometheus-alerts/webapp-alerts.yml - Web应用告警规则
- assets/templates/prometheus-alerts/kubernetes-alerts.yml - Kubernetes告警规则
Deep Dive: Alerting
深度解析:告警管理
For comprehensive alerting guidance including:
- Alert design patterns (multi-window, rate of change, threshold with hysteresis)
- Alert annotation best practices
- Alert routing (severity-based, team-based, time-based)
- Inhibition rules
- Runbook structure
- On-call best practices
→ Read: references/alerting_best_practices.md
如需全面的告警管理指导,包括:
- 告警设计模式(多窗口、速率变化、带滞后阈值)
- 告警注解最佳实践
- 告警路由(基于严重级别、团队、时间)
- 抑制规则
- 运行手册结构
- 值班最佳实践
→ 阅读: references/alerting_best_practices.md
Runbook Template
运行手册模板
Create comprehensive runbooks for your alerts:
→ Template: assets/templates/runbooks/incident-runbook-template.md
为你的告警创建全面的运行手册:
→ 模板: assets/templates/runbooks/incident-runbook-template.md
4. Dashboard & Visualization
4. 仪表盘与可视化
Dashboard Design Principles
仪表盘设计原则
- Top-down layout: Most important metrics first
- Color coding: Red (critical), yellow (warning), green (healthy)
- Consistent time windows: All panels use same time range
- Limit panels: 8-12 panels per dashboard maximum
- Include context: Show related metrics together
- 自上而下布局: 最重要的指标放在最上方
- 颜色编码: 红色(严重)、黄色(警告)、绿色(健康)
- 统一时间窗口: 所有面板使用相同的时间范围
- 限制面板数量: 每个仪表盘最多8-12个面板
- 包含上下文: 关联指标放在一起展示
Recommended Dashboard Structure
推荐的仪表盘结构
┌─────────────────────────────────────┐
│ Overall Health (Single Stats) │
│ [Requests/s] [Error%] [P95 Latency]│
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Request Rate & Errors (Graphs) │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Latency Distribution (Graphs) │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Resource Usage (Graphs) │
└─────────────────────────────────────┘┌─────────────────────────────────────┐
│ 整体健康状态(单值统计) │
│ [每秒请求数] [错误率] [P95延迟] │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 请求速率与错误数(图表) │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 延迟分布(图表) │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 资源使用率(图表) │
└─────────────────────────────────────┘Generate Grafana Dashboards
生成Grafana仪表盘
Automatically generate dashboards from templates:
bash
undefined从模板自动生成仪表盘:
bash
undefinedWeb application dashboard
Web应用仪表盘
python3 scripts/dashboard_generator.py webapp
--title "My API Dashboard"
--service my_api
--output dashboard.json
--title "My API Dashboard"
--service my_api
--output dashboard.json
python3 scripts/dashboard_generator.py webapp
--title "My API Dashboard"
--service my_api
--output dashboard.json
--title "My API Dashboard"
--service my_api
--output dashboard.json
Kubernetes dashboard
Kubernetes仪表盘
python3 scripts/dashboard_generator.py kubernetes
--title "K8s Production"
--namespace production
--output k8s-dashboard.json
--title "K8s Production"
--namespace production
--output k8s-dashboard.json
python3 scripts/dashboard_generator.py kubernetes
--title "K8s Production"
--namespace production
--output k8s-dashboard.json
--title "K8s Production"
--namespace production
--output k8s-dashboard.json
Database dashboard
数据库仪表盘
python3 scripts/dashboard_generator.py database
--title "PostgreSQL"
--db-type postgres
--instance db.example.com:5432
--output db-dashboard.json
--title "PostgreSQL"
--db-type postgres
--instance db.example.com:5432
--output db-dashboard.json
**Supports**:
- Web applications (requests, errors, latency, resources)
- Kubernetes (pods, nodes, resources, network)
- Databases (PostgreSQL, MySQL)
**→ Script**: [scripts/dashboard_generator.py](scripts/dashboard_generator.py)
---python3 scripts/dashboard_generator.py database
--title "PostgreSQL"
--db-type postgres
--instance db.example.com:5432
--output db-dashboard.json
--title "PostgreSQL"
--db-type postgres
--instance db.example.com:5432
--output db-dashboard.json
**支持场景**:
- Web应用(请求、错误、延迟、资源)
- Kubernetes(Pod、节点、资源、网络)
- 数据库(PostgreSQL、MySQL)
**→ 脚本**: [scripts/dashboard_generator.py](scripts/dashboard_generator.py)
---5. SLO & Error Budgets
5. SLO与错误预算
SLO Fundamentals
SLO基础知识
SLI (Service Level Indicator): Measurement of service quality
- Example: Request latency, error rate, availability
SLO (Service Level Objective): Target value for an SLI
- Example: "99.9% of requests return in < 500ms"
Error Budget: Allowed failure amount = (100% - SLO)
- Example: 99.9% SLO = 0.1% error budget = 43.2 minutes/month
SLI(服务水平指标): 服务质量的量化指标
- 示例:请求延迟、错误率、可用性
SLO(服务水平目标): SLI的目标值
- 示例:"99.9%的请求响应时间<500ms"
错误预算: 允许的故障总量 = (100% - SLO)
- 示例:99.9% SLO = 0.1%错误预算 = 每月43.2分钟
Common SLO Targets
常见SLO目标
| Availability | Downtime/Month | Use Case |
|---|---|---|
| 99% | 7.2 hours | Internal tools |
| 99.9% | 43.2 minutes | Standard production |
| 99.95% | 21.6 minutes | Critical services |
| 99.99% | 4.3 minutes | High availability |
| 可用性 | 每月停机时间 | 使用场景 |
|---|---|---|
| 99% | 7.2小时 | 内部工具 |
| 99.9% | 43.2分钟 | 标准生产环境 |
| 99.95% | 21.6分钟 | 核心服务 |
| 99.99% | 4.3分钟 | 高可用服务 |
SLO Calculator
SLO计算器
Calculate compliance, error budgets, and burn rates:
bash
undefined计算合规性、错误预算与消耗速率:
bash
undefinedShow SLO reference table
显示SLO参考表
python3 scripts/slo_calculator.py --table
python3 scripts/slo_calculator.py --table
Calculate availability SLO
计算可用性SLO
python3 scripts/slo_calculator.py availability
--slo 99.9
--total-requests 1000000
--failed-requests 1500
--period-days 30
--slo 99.9
--total-requests 1000000
--failed-requests 1500
--period-days 30
python3 scripts/slo_calculator.py availability
--slo 99.9
--total-requests 1000000
--failed-requests 1500
--period-days 30
--slo 99.9
--total-requests 1000000
--failed-requests 1500
--period-days 30
Calculate burn rate
计算消耗速率
python3 scripts/slo_calculator.py burn-rate
--slo 99.9
--errors 50
--requests 10000
--window-hours 1
--slo 99.9
--errors 50
--requests 10000
--window-hours 1
**→ Script**: [scripts/slo_calculator.py](scripts/slo_calculator.py)python3 scripts/slo_calculator.py burn-rate
--slo 99.9
--errors 50
--requests 10000
--window-hours 1
--slo 99.9
--errors 50
--requests 10000
--window-hours 1
**→ 脚本**: [scripts/slo_calculator.py](scripts/slo_calculator.py)Deep Dive: SLO/SLA
深度解析:SLO/SLA
For comprehensive SLO/SLA guidance including:
- Choosing appropriate SLIs
- Setting realistic SLO targets
- Error budget policies
- Burn rate alerting
- SLA structure and contracts
- Monthly reporting templates
→ Read: references/slo_sla_guide.md
如需全面的SLO/SLA指导,包括:
- 选择合适的SLI
- 设置合理的SLO目标
- 错误预算策略
- 消耗速率告警
- SLA结构与合同
- 月度报告模板
→ 阅读: references/slo_sla_guide.md
6. Distributed Tracing
6. 分布式追踪
When to Use Tracing
何时使用追踪
Use distributed tracing when you need to:
- Debug performance issues across services
- Understand request flow through microservices
- Identify bottlenecks in distributed systems
- Find N+1 query problems
当你需要以下能力时使用分布式追踪:
- 跨服务调试性能问题
- 理解微服务中的请求流
- 识别分布式系统中的瓶颈
- 发现N+1查询问题
OpenTelemetry Implementation
OpenTelemetry落地示例
Python example:
python
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
@tracer.start_as_current_span("process_order")
def process_order(order_id):
span = trace.get_current_span()
span.set_attribute("order.id", order_id)
try:
result = payment_service.charge(order_id)
span.set_attribute("payment.status", "success")
return result
except Exception as e:
span.set_status(trace.Status(trace.StatusCode.ERROR))
span.record_exception(e)
raisePython示例:
python
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
@tracer.start_as_current_span("process_order")
def process_order(order_id):
span = trace.get_current_span()
span.set_attribute("order.id", order_id)
try:
result = payment_service.charge(order_id)
span.set_attribute("payment.status", "success")
return result
except Exception as e:
span.set_status(trace.Status(trace.StatusCode.ERROR))
span.record_exception(e)
raiseSampling Strategies
采样策略
- Development: 100% (ALWAYS_ON)
- Staging: 50-100%
- Production: 1-10% (or error-based sampling)
Error-based sampling (always sample errors, 1% of successes):
python
class ErrorSampler(Sampler):
def should_sample(self, parent_context, trace_id, name, **kwargs):
attributes = kwargs.get('attributes', {})
if attributes.get('error', False):
return Decision.RECORD_AND_SAMPLE
if trace_id & 0xFF < 3: # ~1%
return Decision.RECORD_AND_SAMPLE
return Decision.DROP- 开发环境: 100%采样(ALWAYS_ON)
- 预发布环境: 50-100%采样
- 生产环境: 1-10%采样(或基于错误的采样)
基于错误的采样(始终采样错误,1%采样成功请求):
python
class ErrorSampler(Sampler):
def should_sample(self, parent_context, trace_id, name, **kwargs):
attributes = kwargs.get('attributes', {})
if attributes.get('error', False):
return Decision.RECORD_AND_SAMPLE
if trace_id & 0xFF < 3: # ~1%
return Decision.RECORD_AND_SAMPLE
return Decision.DROPOTel Collector Configuration
OTel Collector配置
Production-ready OpenTelemetry Collector configuration:
→ Template: assets/templates/otel-config/collector-config.yaml
Features:
- Receives OTLP, Prometheus, and host metrics
- Batching and memory limiting
- Tail sampling (error-based, latency-based, probabilistic)
- Multiple exporters (Tempo, Jaeger, Loki, Prometheus, CloudWatch, Datadog)
生产级OpenTelemetry Collector配置:
→ 模板: assets/templates/otel-config/collector-config.yaml
特性:
- 接收OTLP、Prometheus与主机指标
- 批量处理与内存限制
- 尾部采样(基于错误、延迟、概率)
- 多导出器(Tempo、Jaeger、Loki、Prometheus、CloudWatch、Datadog)
Deep Dive: Tracing
深度解析:分布式追踪
For comprehensive tracing guidance including:
- OpenTelemetry instrumentation (Python, Node.js, Go, Java)
- Span attributes and semantic conventions
- Context propagation (W3C Trace Context)
- Backend comparison (Jaeger, Tempo, X-Ray, Datadog APM)
- Analysis patterns (finding slow traces, N+1 queries)
- Integration with logs
→ Read: references/tracing_guide.md
如需全面的分布式追踪指导,包括:
- OpenTelemetry埋点(Python、Node.js、Go、Java)
- Span属性与语义规范
- 上下文传递(W3C Trace Context)
- 后端对比(Jaeger、Tempo、X-Ray、Datadog APM)
- 分析模式(查找慢追踪、N+1查询)
- 与日志系统的集成
→ 阅读: references/tracing_guide.md
7. Datadog Cost Optimization & Migration
7. Datadog成本优化与迁移
Scenario 1: I'm Using Datadog and Costs Are Too High
场景1:正在使用Datadog但成本过高
If your Datadog bill is growing out of control, start by identifying waste:
若你的Datadog账单持续增长,首先需识别资源浪费:
Cost Analysis Script
成本分析脚本
Automatically analyze your Datadog usage and find cost optimization opportunities:
bash
undefined自动分析你的Datadog使用情况并找到成本优化机会:
bash
undefinedAnalyze Datadog usage (requires API key and APP key)
分析Datadog使用情况(需要API密钥和APP密钥)
python3 scripts/datadog_cost_analyzer.py
--api-key $DD_API_KEY
--app-key $DD_APP_KEY
--api-key $DD_API_KEY
--app-key $DD_APP_KEY
python3 scripts/datadog_cost_analyzer.py
--api-key $DD_API_KEY
--app-key $DD_APP_KEY
--api-key $DD_API_KEY
--app-key $DD_APP_KEY
Show detailed breakdown by category
按类别显示详细拆分
python3 scripts/datadog_cost_analyzer.py
--api-key $DD_API_KEY
--app-key $DD_APP_KEY
--show-details
--api-key $DD_API_KEY
--app-key $DD_APP_KEY
--show-details
**What it checks**:
- Infrastructure host count and cost
- Custom metrics usage and high-cardinality metrics
- Log ingestion volume and trends
- APM host usage
- Unused or noisy monitors
- Container vs VM optimization opportunities
**→ Script**: [scripts/datadog_cost_analyzer.py](scripts/datadog_cost_analyzer.py)python3 scripts/datadog_cost_analyzer.py
--api-key $DD_API_KEY
--app-key $DD_APP_KEY
--show-details
--api-key $DD_API_KEY
--app-key $DD_APP_KEY
--show-details
**检查内容**:
- 基础设施主机数量与成本
- 自定义指标使用情况与高基数指标
- 日志 ingestion 量与趋势
- APM主机使用情况
- 未使用或噪音大的监控器
- 容器与VM的优化机会
**→ 脚本**: [scripts/datadog_cost_analyzer.py](scripts/datadog_cost_analyzer.py)Common Cost Optimization Strategies
常见成本优化策略
1. Custom Metrics Optimization (typical savings: 20-40%):
- Remove high-cardinality tags (user IDs, request IDs)
- Delete unused custom metrics
- Aggregate metrics before sending
- Use metric prefixes to identify teams/services
2. Log Management (typical savings: 30-50%):
- Implement log sampling for high-volume services
- Use exclusion filters for debug/trace logs in production
- Archive cold logs to S3/GCS after 7 days
- Set log retention policies (15 days instead of 30)
3. APM Optimization (typical savings: 15-25%):
- Reduce trace sampling rates (10% → 5% in prod)
- Use head-based sampling instead of complete sampling
- Remove APM from non-critical services
- Use trace search with lower retention
4. Infrastructure Monitoring (typical savings: 10-20%):
- Switch from VM-based to container-based pricing where possible
- Remove agents from ephemeral instances
- Use Datadog's host reduction strategies
- Consolidate staging environments
1. 自定义指标优化(典型节省:20-40%):
- 移除高基数标签(用户ID、请求ID)
- 删除未使用的自定义指标
- 发送前聚合指标
- 使用指标前缀区分团队/服务
2. 日志管理优化(典型节省:30-50%):
- 为高流量服务实现日志采样
- 在生产环境中排除调试/追踪日志
- 7天后将冷日志归档至S3/GCS
- 设置日志保留策略(15天而非30天)
3. APM优化(典型节省:15-25%):
- 降低追踪采样率(生产环境从10%→5%)
- 使用头部采样而非全量采样
- 为非核心服务移除APM
- 使用低保留期的追踪搜索
4. 基础设施监控优化(典型节省:10-20%):
- 尽可能从VM定价切换到容器定价
- 从临时实例中移除Agent
- 使用Datadog的主机缩减策略
- 合并预发布环境
Scenario 2: Migrating Away from Datadog
场景2:从Datadog迁移
If you're considering migrating to a more cost-effective open-source stack:
若你考虑迁移至更具成本效益的开源技术栈:
Migration Overview
迁移概述
From Datadog → To Open Source Stack:
- Metrics: Datadog → Prometheus + Grafana
- Logs: Datadog Logs → Grafana Loki
- Traces: Datadog APM → Tempo or Jaeger
- Dashboards: Datadog → Grafana
- Alerts: Datadog Monitors → Prometheus Alertmanager
Estimated Cost Savings: 60-77% ($49.8k-61.8k/year for 100-host environment)
从Datadog → 到开源技术栈:
- 指标:Datadog → Prometheus + Grafana
- 日志:Datadog Logs → Grafana Loki
- 追踪:Datadog APM → Tempo或Jaeger
- 仪表盘:Datadog → Grafana
- 告警:Datadog Monitors → Prometheus Alertmanager
预计成本节省: 60-77%(100主机环境每年节省$49.8k-61.8k)
Migration Strategy
迁移策略
Phase 1: Run Parallel (Month 1-2):
- Deploy open-source stack alongside Datadog
- Migrate metrics first (lowest risk)
- Validate data accuracy
Phase 2: Migrate Dashboards & Alerts (Month 2-3):
- Convert Datadog dashboards to Grafana
- Translate alert rules (use DQL → PromQL guide below)
- Train team on new tools
Phase 3: Migrate Logs & Traces (Month 3-4):
- Set up Loki for log aggregation
- Deploy Tempo/Jaeger for tracing
- Update application instrumentation
Phase 4: Decommission Datadog (Month 4-5):
- Confirm all functionality migrated
- Cancel Datadog subscription
阶段1:并行运行(第1-2个月):
- 在Datadog旁部署开源技术栈
- 先迁移指标(风险最低)
- 验证数据准确性
阶段2:迁移仪表盘与告警(第2-3个月):
- 将Datadog仪表盘转换为Grafana格式
- 翻译告警规则(使用下方DQL→PromQL指南)
- 培训团队使用新工具
阶段3:迁移日志与追踪(第3-4个月):
- 搭建Loki用于日志聚合
- 部署Tempo/Jaeger用于追踪
- 更新应用埋点
阶段4:停用Datadog(第4-5个月):
- 确认所有功能已迁移
- 取消Datadog订阅
Query Translation: DQL → PromQL
查询翻译:DQL → PromQL
When migrating dashboards and alerts, you'll need to translate Datadog queries to PromQL:
Quick examples:
undefined迁移仪表盘与告警时,需将Datadog查询转换为PromQL:
快速示例:
undefinedAverage CPU
平均CPU使用率
Datadog: avg:system.cpu.user{*}
Prometheus: avg(node_cpu_seconds_total{mode="user"})
Datadog: avg:system.cpu.user{*}
Prometheus: avg(node_cpu_seconds_total{mode="user"})
Request rate
请求速率
Datadog: sum:requests.count{*}.as_rate()
Prometheus: sum(rate(http_requests_total[5m]))
Datadog: sum:requests.count{*}.as_rate()
Prometheus: sum(rate(http_requests_total[5m]))
P95 latency
P95延迟
Datadog: p95:request.duration{*}
Prometheus: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Datadog: p95:request.duration{*}
Prometheus: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Error rate percentage
错误率百分比
Datadog: (sum:requests.errors{}.as_rate() / sum:requests.count{}.as_rate()) * 100
Prometheus: (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
**→ Full Translation Guide**: [references/dql_promql_translation.md](references/dql_promql_translation.md)Datadog: (sum:requests.errors{}.as_rate() / sum:requests.count{}.as_rate()) * 100
Prometheus: (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
**→ 完整翻译指南**: [references/dql_promql_translation.md](references/dql_promql_translation.md)Cost Comparison
成本对比
Example: 100-host infrastructure
| Component | Datadog (Annual) | Open Source (Annual) | Savings |
|---|---|---|---|
| Infrastructure | $18,000 | $10,000 (self-hosted infra) | $8,000 |
| Custom Metrics | $600 | Included | $600 |
| Logs | $24,000 | $3,000 (storage) | $21,000 |
| APM/Traces | $37,200 | $5,000 (storage) | $32,200 |
| Total | $79,800 | $18,000 | $61,800 (77%) |
示例:100主机基础设施
| 组件 | Datadog(年成本) | 开源方案(年成本) | 节省金额 |
|---|---|---|---|
| 基础设施 | $18,000 | $10,000(自托管基础设施) | $8,000 |
| 自定义指标 | $600 | 包含在内 | $600 |
| 日志 | $24,000 | $3,000(存储成本) | $21,000 |
| APM/追踪 | $37,200 | $5,000(存储成本) | $32,200 |
| 总计 | $79,800 | $18,000 | $61,800(77%) |
Deep Dive: Datadog Migration
深度解析:Datadog迁移
For comprehensive migration guidance including:
- Detailed cost comparison and ROI calculations
- Step-by-step migration instructions
- Infrastructure sizing recommendations (CPU, RAM, storage)
- Dashboard conversion tools and examples
- Alert rule translation patterns
- Application instrumentation changes (DogStatsD → Prometheus client)
- Python scripts for exporting Datadog dashboards and monitors
- Common challenges and solutions
→ Read: references/datadog_migration.md
如需全面的迁移指导,包括:
- 详细成本对比与ROI计算
- 分步迁移说明
- 基础设施 sizing 建议(CPU、内存、存储)
- 仪表盘转换工具与示例
- 告警规则翻译模式
- 应用埋点变更(DogStatsD→Prometheus客户端)
- 导出Datadog仪表盘与监控器的Python脚本
- 常见挑战与解决方案
→ 阅读: references/datadog_migration.md
8. Tool Selection & Comparison
8. 工具选型与对比
Decision Matrix
决策矩阵
Choose Prometheus + Grafana if:
- ✅ Using Kubernetes
- ✅ Want control and customization
- ✅ Have ops capacity
- ✅ Budget-conscious
Choose Datadog if:
- ✅ Want ease of use
- ✅ Need full observability now
- ✅ Budget allows ($8k+/month for 100 hosts)
Choose Grafana Stack (LGTM) if:
- ✅ Want open source full stack
- ✅ Cost-effective solution
- ✅ Cloud-native architecture
Choose ELK Stack if:
- ✅ Heavy log analysis needs
- ✅ Need powerful search
- ✅ Have dedicated ops team
Choose Cloud Native (CloudWatch/etc) if:
- ✅ Single cloud provider
- ✅ Simple needs
- ✅ Want minimal setup
选择Prometheus + Grafana如果:
- ✅ 使用Kubernetes
- ✅ 希望掌控与自定义
- ✅ 具备运维能力
- ✅ 关注预算
选择Datadog如果:
- ✅ 追求易用性
- ✅ 需要快速落地全链路可观测性
- ✅ 预算充足(100主机每月$8k+)
选择Grafana栈(LGTM)如果:
- ✅ 希望使用开源全栈方案
- ✅ 追求高性价比
- ✅ 云原生架构
选择ELK栈如果:
- ✅ 有深度日志分析需求
- ✅ 需要强大的搜索能力
- ✅ 有专门的运维团队
选择云原生方案(CloudWatch等)如果:
- ✅ 单一云供应商
- ✅ 需求简单
- ✅ 希望最小化配置
Cost Comparison (100 hosts, 1TB logs/month)
成本对比(100主机,每月1TB日志)
| Solution | Monthly Cost | Setup | Ops Burden |
|---|---|---|---|
| Prometheus + Loki + Tempo | $1,500 | Medium | Medium |
| Grafana Cloud | $3,000 | Low | Low |
| Datadog | $8,000 | Low | None |
| ELK Stack | $4,000 | High | High |
| CloudWatch | $2,000 | Low | Low |
| 方案 | 月成本 | 配置难度 | 运维负担 |
|---|---|---|---|
| Prometheus + Loki + Tempo | $1,500 | 中等 | 中等 |
| Grafana Cloud | $3,000 | 低 | 低 |
| Datadog | $8,000 | 低 | 无 |
| ELK栈 | $4,000 | 高 | 高 |
| CloudWatch | $2,000 | 低 | 低 |
Deep Dive: Tool Comparison
深度解析:工具对比
For comprehensive tool comparison including:
- Metrics platforms (Prometheus, Datadog, New Relic, CloudWatch, Grafana Cloud)
- Logging platforms (ELK, Loki, Splunk, CloudWatch Logs, Sumo Logic)
- Tracing platforms (Jaeger, Tempo, Datadog APM, X-Ray)
- Full-stack observability comparison
- Recommendations by company size
→ Read: references/tool_comparison.md
如需全面的工具对比,包括:
- 指标平台(Prometheus、Datadog、New Relic、CloudWatch、Grafana Cloud)
- 日志平台(ELK、Loki、Splunk、CloudWatch Logs、Sumo Logic)
- 追踪平台(Jaeger、Tempo、Datadog APM、X-Ray)
- 全链路可观测性对比
- 按公司规模给出的推荐
→ 阅读: references/tool_comparison.md
9. Troubleshooting & Analysis
9. 故障排查与分析
Health Check Validation
健康检查验证
Validate health check endpoints against best practices:
bash
undefined对照最佳实践验证健康检查端点:
bash
undefinedCheck single endpoint
检查单个端点
python3 scripts/health_check_validator.py https://api.example.com/health
python3 scripts/health_check_validator.py https://api.example.com/health
Check multiple endpoints
检查多个端点
python3 scripts/health_check_validator.py
https://api.example.com/health
https://api.example.com/readiness
--verbose
https://api.example.com/health
https://api.example.com/readiness
--verbose
**Checks for**:
- ✓ Returns 200 status code
- ✓ Response time < 1 second
- ✓ Returns JSON format
- ✓ Contains 'status' field
- ✓ Includes version/build info
- ✓ Checks dependencies
- ✓ Disables caching
**→ Script**: [scripts/health_check_validator.py](scripts/health_check_validator.py)python3 scripts/health_check_validator.py
https://api.example.com/health
https://api.example.com/readiness
--verbose
https://api.example.com/health
https://api.example.com/readiness
--verbose
**检查内容**:
- ✓ 返回200状态码
- ✓ 响应时间<1秒
- ✓ 返回JSON格式
- ✓ 包含'status'字段
- ✓ 包含版本/构建信息
- ✓ 检查依赖状态
- ✓ 禁用缓存
**→ 脚本**: [scripts/health_check_validator.py](scripts/health_check_validator.py)Common Troubleshooting Workflows
常见故障排查工作流
High Latency Investigation:
- Check dashboards for latency spike
- Query traces for slow operations
- Check database slow query log
- Check external API response times
- Review recent deployments
- Check resource utilization (CPU, memory)
High Error Rate Investigation:
- Check error logs for patterns
- Identify affected endpoints
- Check dependency health
- Review recent deployments
- Check resource limits
- Verify configuration
Service Down Investigation:
- Check if pods/instances are running
- Check health check endpoint
- Review recent deployments
- Check resource availability
- Check network connectivity
- Review logs for startup errors
高延迟排查:
- 查看仪表盘确认延迟峰值
- 查询追踪数据找到慢操作
- 检查数据库慢查询日志
- 检查外部API响应时间
- 查看最近的部署记录
- 检查资源使用率(CPU、内存)
高错误率排查:
- 检查错误日志中的模式
- 识别受影响的端点
- 检查依赖健康状态
- 查看最近的部署记录
- 检查资源限制
- 验证配置
服务宕机排查:
- 检查Pod/实例是否运行
- 检查健康检查端点
- 查看最近的部署记录
- 检查资源可用性
- 检查网络连通性
- 查看启动错误日志
Quick Reference Commands
快速参考命令
Prometheus Queries
Prometheus查询
promql
undefinedpromql
undefinedRequest rate
请求速率
sum(rate(http_requests_total[5m]))
sum(rate(http_requests_total[5m]))
Error rate
错误率
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
P95 latency
P95延迟
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
CPU usage
CPU使用率
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Memory usage
内存使用率
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
undefined(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
undefinedKubernetes Commands
Kubernetes命令
bash
undefinedbash
undefinedCheck pod status
检查Pod状态
kubectl get pods -n <namespace>
kubectl get pods -n <namespace>
View pod logs
查看Pod日志
kubectl logs -f <pod-name> -n <namespace>
kubectl logs -f <pod-name> -n <namespace>
Check pod resources
检查Pod资源使用
kubectl top pods -n <namespace>
kubectl top pods -n <namespace>
Describe pod for events
查看Pod事件
kubectl describe pod <pod-name> -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
Check recent deployments
查看最近的部署历史
kubectl rollout history deployment/<name> -n <namespace>
undefinedkubectl rollout history deployment/<name> -n <namespace>
undefinedLog Queries
日志查询
Elasticsearch:
json
GET /logs-*/_search
{
"query": {
"bool": {
"must": [
{ "match": { "level": "error" } },
{ "range": { "@timestamp": { "gte": "now-1h" } } }
]
}
}
}Loki (LogQL):
logql
{job="app", level="error"} |= "error" | jsonCloudWatch Insights:
fields @timestamp, level, message
| filter level = "error"
| filter @timestamp > ago(1h)Elasticsearch:
json
GET /logs-*/_search
{
"query": {
"bool": {
"must": [
{ "match": { "level": "error" } },
{ "range": { "@timestamp": { "gte": "now-1h" } } }
]
}
}
}Loki (LogQL):
logql
{job="app", level="error"} |= "error" | jsonCloudWatch Insights:
fields @timestamp, level, message
| filter level = "error"
| filter @timestamp > ago(1h)Resources Summary
资源汇总
Scripts (automation and analysis)
脚本(自动化与分析)
- - Detect anomalies in Prometheus/CloudWatch metrics
analyze_metrics.py - - Audit alert rules against best practices
alert_quality_checker.py - - Calculate SLO compliance and error budgets
slo_calculator.py - - Parse logs for errors and patterns
log_analyzer.py - - Generate Grafana dashboards from templates
dashboard_generator.py - - Validate health check endpoints
health_check_validator.py - - Analyze Datadog usage and find cost waste
datadog_cost_analyzer.py
- - 检测Prometheus/CloudWatch指标中的异常
analyze_metrics.py - - 对照最佳实践审核告警规则
alert_quality_checker.py - - 计算SLO合规性与错误预算
slo_calculator.py - - 解析日志中的错误与模式
log_analyzer.py - - 从模板生成Grafana仪表盘
dashboard_generator.py - - 验证健康检查端点
health_check_validator.py - - 分析Datadog使用情况并找到资源浪费
datadog_cost_analyzer.py
References (deep-dive documentation)
参考文档(深度解析)
- - Four Golden Signals, RED/USE methods, metric types
metrics_design.md - - Alert design, runbooks, on-call practices
alerting_best_practices.md - - Structured logging, aggregation patterns
logging_guide.md - - OpenTelemetry, distributed tracing
tracing_guide.md - - SLI/SLO/SLA definitions, error budgets
slo_sla_guide.md - - Comprehensive comparison of monitoring tools
tool_comparison.md - - Complete guide for migrating from Datadog to OSS stack
datadog_migration.md - - Datadog Query Language to PromQL translation reference
dql_promql_translation.md
- - 四大黄金信号、RED/USE方法、指标类型
metrics_design.md - - 告警设计、运行手册、值班实践
alerting_best_practices.md - - 结构化日志、聚合模式
logging_guide.md - - OpenTelemetry、分布式追踪
tracing_guide.md - - SLI/SLO/SLA定义、错误预算
slo_sla_guide.md - - 监控工具全面对比
tool_comparison.md - - 从Datadog迁移至开源栈的完整指南
datadog_migration.md - - Datadog查询语言到PromQL的翻译参考
dql_promql_translation.md
Templates (ready-to-use configurations)
模板(即用型配置)
- - Production-ready web app alerts
prometheus-alerts/webapp-alerts.yml - - Kubernetes monitoring alerts
prometheus-alerts/kubernetes-alerts.yml - - OpenTelemetry Collector configuration
otel-config/collector-config.yaml - - Incident response template
runbooks/incident-runbook-template.md
- - 生产级Web应用告警规则
prometheus-alerts/webapp-alerts.yml - - Kubernetes监控告警规则
prometheus-alerts/kubernetes-alerts.yml - - OpenTelemetry Collector配置
otel-config/collector-config.yaml - - 事件响应模板
runbooks/incident-runbook-template.md
Best Practices
最佳实践
Metrics
指标
- Start with Four Golden Signals
- Use appropriate metric types (counter, gauge, histogram)
- Keep cardinality low (avoid high-cardinality labels)
- Follow naming conventions
- 从四大黄金信号入手
- 使用合适的指标类型(计数器、仪表盘、直方图)
- 保持低基数(避免高基数标签)
- 遵循命名规范
Logging
日志
- Use structured logging (JSON)
- Include request IDs for tracing
- Set appropriate log levels
- Redact PII before logging
- 使用结构化日志(JSON)
- 包含请求ID用于追踪
- 设置合适的日志级别
- 日志前脱敏PII数据
Alerting
告警
- Make every alert actionable
- Alert on symptoms, not causes
- Use multi-window burn rate alerts
- Include runbook links
- 确保每个告警可执行
- 告警针对症状而非原因
- 使用多窗口消耗速率告警
- 包含运行手册链接
Tracing
追踪
- Sample appropriately (1-10% in production)
- Always record errors
- Use semantic conventions
- Propagate context between services
- 合理采样(生产环境1-10%)
- 始终采样错误请求
- 使用语义规范
- 在服务间传递上下文
SLOs
SLO
- Start with current performance
- Set realistic targets
- Define error budget policies
- Review and adjust quarterly
- 从当前性能水平入手
- 设置合理的目标
- 定义错误预算策略
- 每季度回顾与调整