datadog-observability

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Datadog Observability

Datadog可观测性

Overview

概述

Datadog is a SaaS observability platform providing unified monitoring across infrastructure, applications, logs, and user experience. It offers AI-powered anomaly detection, 1000+ integrations, and OpenTelemetry compatibility.
Core Capabilities:
  • APM: Distributed tracing with automatic instrumentation for 8+ languages
  • Infrastructure: Host, container, and cloud service monitoring
  • Logs: Centralized collection with processing pipelines and 15-month retention
  • Metrics: Custom metrics via DogStatsD with cardinality management
  • Synthetics: Proactive API and browser testing from 29+ global locations
  • RUM: Frontend performance with Core Web Vitals and session replay
Datadog是一款SaaS可观测性平台,提供跨基础设施、应用、日志和用户体验的统一监控。它具备AI驱动的异常检测功能、1000+集成以及OpenTelemetry兼容性。
核心功能:
  • APM:支持8+种语言自动插桩的分布式追踪
  • 基础设施监控:主机、容器和云服务监控
  • 日志管理:集中式收集,附带处理流水线及15个月数据留存
  • 指标管理:通过DogStatsD自定义指标,支持基数管理
  • 合成监控:从全球29+个位置主动进行API和浏览器测试
  • RUM:包含核心Web指标和会话重放的前端性能监控

When to Use This Skill

何时使用该技能

Activate when:
  • Setting up production monitoring and observability
  • Implementing distributed tracing across microservices
  • Configuring log aggregation and analysis pipelines
  • Creating custom metrics and dashboards
  • Setting up alerting and anomaly detection
  • Optimizing Datadog costs
Do not use when:
  • Building with open-source stack (use Prometheus/Grafana instead)
  • Cost is primary concern and budget is limited
  • Need maximum customization over managed solution
启用场景:
  • 搭建生产环境监控与可观测性体系
  • 在微服务间实施分布式追踪
  • 配置日志聚合与分析流水线
  • 创建自定义指标和仪表盘
  • 配置告警与异常检测
  • 优化Datadog使用成本
禁用场景:
  • 使用开源技术栈(建议使用Prometheus/Grafana替代)
  • 成本为首要考虑因素且预算有限
  • 需要对托管方案进行最大化定制

Quick Start

快速开始

1. Install Datadog Agent

1. 安装Datadog Agent

Docker (simplest):
bash
docker run -d --name dd-agent \
  -e DD_API_KEY=<YOUR_API_KEY> \
  -e DD_SITE="datadoghq.com" \
  -v /var/run/docker.sock:/var/run/docker.sock:ro \
  -v /proc/:/host/proc/:ro \
  -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
  gcr.io/datadoghq/agent:7
Kubernetes (Helm):
bash
helm repo add datadog https://helm.datadoghq.com
helm install datadog-agent datadog/datadog \
  --set datadog.apiKey=<YOUR_API_KEY> \
  --set datadog.apm.enabled=true \
  --set datadog.logs.enabled=true
Docker(最简方式):
bash
docker run -d --name dd-agent \
  -e DD_API_KEY=<YOUR_API_KEY> \
  -e DD_SITE="datadoghq.com" \
  -v /var/run/docker.sock:/var/run/docker.sock:ro \
  -v /proc/:/host/proc/:ro \
  -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
  gcr.io/datadoghq/agent:7
Kubernetes(Helm):
bash
helm repo add datadog https://helm.datadoghq.com
helm install datadog-agent datadog/datadog \
  --set datadog.apiKey=<YOUR_API_KEY> \
  --set datadog.apm.enabled=true \
  --set datadog.logs.enabled=true

2. Instrument Your Application

2. 为应用插桩

Python:
python
from ddtrace import tracer, patch_all
Python:
python
from ddtrace import tracer, patch_all

Automatic instrumentation for common libraries

Automatic instrumentation for common libraries

patch_all()
patch_all()

Manual span for custom operations

Manual span for custom operations

with tracer.trace("custom.operation", service="my-service") as span: span.set_tag("user.id", user_id) # your code here

**Node.js:**
```javascript
// Must be first import
const tracer = require('dd-trace').init({
  service: 'my-service',
  env: 'production',
  version: '1.0.0',
});
with tracer.trace("custom.operation", service="my-service") as span: span.set_tag("user.id", user_id) # your code here

**Node.js:**
```javascript
// Must be first import
const tracer = require('dd-trace').init({
  service: 'my-service',
  env: 'production',
  version: '1.0.0',
});

3. Verify in Datadog UI

3. 在Datadog UI中验证

  1. Go to Infrastructure > Host Map to verify agent
  2. Go to APM > Services to see traced services
  3. Go to Logs > Search to verify log collection
  1. 进入“基础设施 > 主机地图”验证Agent状态
  2. 进入“APM > 服务”查看已追踪的服务
  3. 进入“日志 > 搜索”验证日志收集情况

Core Concepts

核心概念

Tagging Strategy

标签策略

Tags enable filtering, aggregation, and cost attribution. Use consistent tags across all telemetry.
Required Tags:
TagPurposeExample
env
Environment
env:production
service
Service name
service:api-gateway
version
Deployment version
version:1.2.3
team
Owning team
team:platform
Avoid High-Cardinality Tags:
  • User IDs, request IDs, timestamps
  • Pod IDs in Kubernetes
  • Build numbers, commit hashes
标签用于过滤、聚合和成本归因。所有遥测数据应使用统一的标签规范。
必填标签:
标签用途示例
env
环境标识
env:production
service
服务名称
service:api-gateway
version
部署版本
version:1.2.3
team
负责团队
team:platform
避免高基数标签:
  • 用户ID、请求ID、时间戳
  • Kubernetes中的Pod ID
  • 构建编号、提交哈希

Unified Observability

统一可观测性

Datadog correlates metrics, traces, and logs automatically:
  • Traces include span tags that link to metrics
  • Logs inject trace IDs for correlation
  • Dashboards combine all data sources
Datadog自动关联指标、追踪和日志:
  • 追踪包含关联至指标的 span 标签
  • 日志注入追踪ID以实现关联
  • 仪表盘整合所有数据源

Best Practices

最佳实践

Start Simple

从简入手

  1. Install Agent with basic configuration
  2. Enable automatic instrumentation
  3. Verify data in Datadog UI
  4. Add custom spans/metrics as needed
  1. 安装Agent并使用基础配置
  2. 启用自动插桩
  3. 在Datadog UI中验证数据
  4. 根据需要添加自定义span/指标

Progressive Enhancement

渐进式增强

Basic → APM tracing → Custom spans → Custom metrics → Profiling → RUM
基础配置 → APM追踪 → 自定义Span → 自定义指标 → 性能分析 → RUM

Key Instrumentation Points

关键插桩点

  • HTTP entry/exit points
  • Database queries
  • External service calls
  • Message queue operations
  • Business-critical flows
  • HTTP入口/出口点
  • 数据库查询
  • 外部服务调用
  • 消息队列操作
  • 业务关键流程

Common Mistakes

常见误区

  1. High-cardinality tags: Using user IDs or request IDs as tags creates millions of unique metrics
  2. Missing log index quotas: Leads to unexpected bills from log volume spikes
  3. Over-alerting: Creates alert fatigue; alert on symptoms, not causes
  4. Missing service tags: Prevents correlation between metrics, traces, and logs
  5. No sampling for high-volume traces: Ingests everything, causing cost explosion
  1. 高基数标签:将用户ID或请求ID作为标签会生成数百万个唯一指标
  2. 未设置日志索引配额:日志量突增会导致意外账单
  3. 告警过度:引发告警疲劳;应针对症状而非原因告警
  4. 缺失服务标签:无法关联指标、追踪和日志
  5. 高流量追踪未采样:全量摄入会导致成本激增

Navigation

导航

For detailed implementation:
  • Agent Installation: Docker, Kubernetes, Linux, Windows, and cloud-specific setup
  • APM Instrumentation: Python, Node.js, Go, Java instrumentation with code examples
  • Log Management: Pipelines, Grok parsing, standard attributes, archives
  • Custom Metrics: DogStatsD patterns, metric types, tagging best practices
  • Alerting: Monitor types, anomaly detection, alert hygiene
  • Cost Optimization: Metrics without Limits, sampling, index quotas
  • Kubernetes: DaemonSet, Cluster Agent, autodiscovery
如需详细实施指南:
  • Agent安装:Docker、Kubernetes、Linux、Windows及云环境专属配置
  • APM插桩:Python、Node.js、Go、Java插桩及代码示例
  • 日志管理:流水线、Grok解析、标准属性、归档
  • 自定义指标:DogStatsD模式、指标类型、标签最佳实践
  • 告警配置:监控类型、异常检测、告警治理
  • 成本优化:无限制指标、采样、索引配额
  • Kubernetes:DaemonSet、集群Agent、自动发现

Complementary Skills

互补技能

When using this skill, consider these related skills (if deployed):
  • docker: Container instrumentation patterns
  • kubernetes: K8s-native monitoring patterns
  • python/nodejs/go: Language-specific APM setup
使用本技能时,可结合以下相关技能(若已部署):
  • docker:容器插桩模式
  • kubernetes:K8s原生监控模式
  • python/nodejs/go:语言专属APM配置

Resources

资源