cloudwatch

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

AWS CloudWatch

AWS CloudWatch

Amazon CloudWatch provides monitoring and observability for AWS resources and applications. It collects metrics, logs, and events, enabling you to monitor, troubleshoot, and optimize your AWS environment.
Amazon CloudWatch为AWS资源和应用提供监控与可观测性能力。它收集指标、日志和事件,帮助您监控、排查问题并优化AWS环境。

Table of Contents

目录

Core Concepts

核心概念

Metrics

指标

Time-ordered data points published to CloudWatch. Key components:
  • Namespace: Container for metrics (e.g.,
    AWS/Lambda
    )
  • Metric name: Name of the measurement (e.g.,
    Invocations
    )
  • Dimensions: Name-value pairs for filtering (e.g.,
    FunctionName=MyFunc
    )
  • Statistics: Aggregations (Sum, Average, Min, Max, SampleCount, pN)
发布到CloudWatch的时间序列数据点。核心组件包括:
  • Namespace:指标的容器(例如:
    AWS/Lambda
  • Metric name:测量指标的名称(例如:
    Invocations
  • Dimensions:用于过滤的键值对(例如:
    FunctionName=MyFunc
  • Statistics:聚合统计值(Sum、Average、Min、Max、SampleCount、pN)

Logs

日志

Log data from AWS services and applications:
  • Log groups: Collections of log streams
  • Log streams: Sequences of log events from same source
  • Log events: Individual log entries with timestamp and message
来自AWS服务和应用的日志数据:
  • Log groups:日志流的集合
  • Log streams:来自同一来源的日志事件序列
  • Log events:包含时间戳和消息的单个日志条目

Alarms

告警

Automated actions based on metric thresholds:
  • States: OK, ALARM, INSUFFICIENT_DATA
  • Actions: SNS notifications, Auto Scaling, EC2 actions
基于指标阈值的自动化操作:
  • States:OK、ALARM、INSUFFICIENT_DATA
  • Actions:SNS通知、Auto Scaling、EC2操作

Common Patterns

常见使用场景

Create a Metric Alarm

创建指标告警

AWS CLI:
bash
undefined
AWS CLI:
bash
undefined

CPU utilization alarm for EC2

CPU utilization alarm for EC2

aws cloudwatch put-metric-alarm
--alarm-name "HighCPU-i-1234567890abcdef0"
--metric-name CPUUtilization
--namespace AWS/EC2
--statistic Average
--period 300
--threshold 80
--comparison-operator GreaterThanThreshold
--evaluation-periods 2
--dimensions Name=InstanceId,Value=i-1234567890abcdef0
--alarm-actions arn:aws:sns:us-east-1:123456789012:alerts
--ok-actions arn:aws:sns:us-east-1:123456789012:alerts

**boto3:**

```python
import boto3

cloudwatch = boto3.client('cloudwatch')

cloudwatch.put_metric_alarm(
    AlarmName='HighCPU-i-1234567890abcdef0',
    MetricName='CPUUtilization',
    Namespace='AWS/EC2',
    Statistic='Average',
    Period=300,
    Threshold=80.0,
    ComparisonOperator='GreaterThanThreshold',
    EvaluationPeriods=2,
    Dimensions=[
        {'Name': 'InstanceId', 'Value': 'i-1234567890abcdef0'}
    ],
    AlarmActions=['arn:aws:sns:us-east-1:123456789012:alerts'],
    OKActions=['arn:aws:sns:us-east-1:123456789012:alerts']
)
aws cloudwatch put-metric-alarm
--alarm-name "HighCPU-i-1234567890abcdef0"
--metric-name CPUUtilization
--namespace AWS/EC2
--statistic Average
--period 300
--threshold 80
--comparison-operator GreaterThanThreshold
--evaluation-periods 2
--dimensions Name=InstanceId,Value=i-1234567890abcdef0
--alarm-actions arn:aws:sns:us-east-1:123456789012:alerts
--ok-actions arn:aws:sns:us-east-1:123456789012:alerts

**boto3:**

```python
import boto3

cloudwatch = boto3.client('cloudwatch')

cloudwatch.put_metric_alarm(
    AlarmName='HighCPU-i-1234567890abcdef0',
    MetricName='CPUUtilization',
    Namespace='AWS/EC2',
    Statistic='Average',
    Period=300,
    Threshold=80.0,
    ComparisonOperator='GreaterThanThreshold',
    EvaluationPeriods=2,
    Dimensions=[
        {'Name': 'InstanceId', 'Value': 'i-1234567890abcdef0'}
    ],
    AlarmActions=['arn:aws:sns:us-east-1:123456789012:alerts'],
    OKActions=['arn:aws:sns:us-east-1:123456789012:alerts']
)

Lambda Error Rate Alarm

Lambda错误率告警

bash
aws cloudwatch put-metric-alarm \
  --alarm-name "LambdaErrorRate-MyFunction" \
  --metrics '[
    {
      "Id": "errors",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/Lambda",
          "MetricName": "Errors",
          "Dimensions": [{"Name": "FunctionName", "Value": "MyFunction"}]
        },
        "Period": 60,
        "Stat": "Sum"
      },
      "ReturnData": false
    },
    {
      "Id": "invocations",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/Lambda",
          "MetricName": "Invocations",
          "Dimensions": [{"Name": "FunctionName", "Value": "MyFunction"}]
        },
        "Period": 60,
        "Stat": "Sum"
      },
      "ReturnData": false
    },
    {
      "Id": "errorRate",
      "Expression": "errors/invocations*100",
      "Label": "Error Rate",
      "ReturnData": true
    }
  ]' \
  --threshold 5 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 3 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts
bash
aws cloudwatch put-metric-alarm \
  --alarm-name "LambdaErrorRate-MyFunction" \
  --metrics '[
    {
      "Id": "errors",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/Lambda",
          "MetricName": "Errors",
          "Dimensions": [{"Name": "FunctionName", "Value": "MyFunction"}]
        },
        "Period": 60,
        "Stat": "Sum"
      },
      "ReturnData": false
    },
    {
      "Id": "invocations",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/Lambda",
          "MetricName": "Invocations",
          "Dimensions": [{"Name": "FunctionName", "Value": "MyFunction"}]
        },
        "Period": 60,
        "Stat": "Sum"
      },
      "ReturnData": false
    },
    {
      "Id": "errorRate",
      "Expression": "errors/invocations*100",
      "Label": "Error Rate",
      "ReturnData": true
    }
  ]' \
  --threshold 5 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 3 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts

Query Logs with Insights

使用Insights查询日志

bash
undefined
bash
undefined

Find errors in Lambda logs

Find errors in Lambda logs

aws logs start-query
--log-group-name /aws/lambda/MyFunction
--start-time $(date -d '1 hour ago' +%s)
--end-time $(date +%s)
--query-string ' fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 50 '
aws logs start-query
--log-group-name /aws/lambda/MyFunction
--start-time $(date -d '1 hour ago' +%s)
--end-time $(date +%s)
--query-string ' fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 50 '

Get query results

Get query results

aws logs get-query-results --query-id <query-id>

**boto3:**

```python
import boto3
import time

logs = boto3.client('logs')
aws logs get-query-results --query-id <query-id>

**boto3:**

```python
import boto3
import time

logs = boto3.client('logs')

Start query

Start query

response = logs.start_query( logGroupName='/aws/lambda/MyFunction', startTime=int(time.time()) - 3600, endTime=int(time.time()), queryString=''' fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 50 ''' )
query_id = response['queryId']
response = logs.start_query( logGroupName='/aws/lambda/MyFunction', startTime=int(time.time()) - 3600, endTime=int(time.time()), queryString=''' fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 50 ''' )
query_id = response['queryId']

Wait for results

Wait for results

while True: result = logs.get_query_results(queryId=query_id) if result['status'] == 'Complete': break time.sleep(1)
for row in result['results']: print(row)
undefined
while True: result = logs.get_query_results(queryId=query_id) if result['status'] == 'Complete': break time.sleep(1)
for row in result['results']: print(row)
undefined

Create Metric Filter

创建指标过滤器

Extract metrics from log patterns:
bash
undefined
从日志模式中提取指标:
bash
undefined

Create metric filter for error count

Create metric filter for error count

aws logs put-metric-filter
--log-group-name /aws/lambda/MyFunction
--filter-name ErrorCount
--filter-pattern "ERROR"
--metric-transformations
metricName=ErrorCount,metricNamespace=MyApp,metricValue=1,defaultValue=0
undefined
aws logs put-metric-filter
--log-group-name /aws/lambda/MyFunction
--filter-name ErrorCount
--filter-pattern "ERROR"
--metric-transformations
metricName=ErrorCount,metricNamespace=MyApp,metricValue=1,defaultValue=0
undefined

Publish Custom Metrics

发布自定义指标

python
import boto3

cloudwatch = boto3.client('cloudwatch')

cloudwatch.put_metric_data(
    Namespace='MyApp',
    MetricData=[
        {
            'MetricName': 'OrdersProcessed',
            'Value': 1,
            'Unit': 'Count',
            'Dimensions': [
                {'Name': 'Environment', 'Value': 'Production'},
                {'Name': 'OrderType', 'Value': 'Standard'}
            ]
        }
    ]
)
python
import boto3

cloudwatch = boto3.client('cloudwatch')

cloudwatch.put_metric_data(
    Namespace='MyApp',
    MetricData=[
        {
            'MetricName': 'OrdersProcessed',
            'Value': 1,
            'Unit': 'Count',
            'Dimensions': [
                {'Name': 'Environment', 'Value': 'Production'},
                {'Name': 'OrderType', 'Value': 'Standard'}
            ]
        }
    ]
)

Create Dashboard

创建仪表盘

bash
cat > dashboard.json << 'EOF'
{
  "widgets": [
    {
      "type": "metric",
      "x": 0, "y": 0, "width": 12, "height": 6,
      "properties": {
        "title": "Lambda Invocations",
        "metrics": [
          ["AWS/Lambda", "Invocations", "FunctionName", "MyFunction"]
        ],
        "period": 60,
        "stat": "Sum",
        "region": "us-east-1"
      }
    },
    {
      "type": "log",
      "x": 12, "y": 0, "width": 12, "height": 6,
      "properties": {
        "title": "Recent Errors",
        "query": "SOURCE '/aws/lambda/MyFunction' | filter @message like /ERROR/ | limit 20",
        "region": "us-east-1"
      }
    }
  ]
}
EOF

aws cloudwatch put-dashboard \
  --dashboard-name MyAppDashboard \
  --dashboard-body file://dashboard.json
bash
cat > dashboard.json << 'EOF'
{
  "widgets": [
    {
      "type": "metric",
      "x": 0, "y": 0, "width": 12, "height": 6,
      "properties": {
        "title": "Lambda Invocations",
        "metrics": [
          ["AWS/Lambda", "Invocations", "FunctionName", "MyFunction"]
        ],
        "period": 60,
        "stat": "Sum",
        "region": "us-east-1"
      }
    },
    {
      "type": "log",
      "x": 12, "y": 0, "width": 12, "height": 6,
      "properties": {
        "title": "Recent Errors",
        "query": "SOURCE '/aws/lambda/MyFunction' | filter @message like /ERROR/ | limit 20",
        "region": "us-east-1"
      }
    }
  ]
}
EOF

aws cloudwatch put-dashboard \
  --dashboard-name MyAppDashboard \
  --dashboard-body file://dashboard.json

CLI Reference

CLI参考

Metrics Commands

指标相关命令

CommandDescription
aws cloudwatch put-metric-data
Publish custom metrics
aws cloudwatch get-metric-data
Retrieve metric values
aws cloudwatch get-metric-statistics
Get aggregated statistics
aws cloudwatch list-metrics
List available metrics
命令描述
aws cloudwatch put-metric-data
发布自定义指标
aws cloudwatch get-metric-data
检索指标值
aws cloudwatch get-metric-statistics
获取聚合统计数据
aws cloudwatch list-metrics
列出可用指标

Alarms Commands

告警相关命令

CommandDescription
aws cloudwatch put-metric-alarm
Create or update alarm
aws cloudwatch describe-alarms
List alarms
aws cloudwatch set-alarm-state
Manually set alarm state
aws cloudwatch delete-alarms
Delete alarms
命令描述
aws cloudwatch put-metric-alarm
创建或更新告警
aws cloudwatch describe-alarms
列出告警
aws cloudwatch set-alarm-state
手动设置告警状态
aws cloudwatch delete-alarms
删除告警

Logs Commands

日志相关命令

CommandDescription
aws logs create-log-group
Create log group
aws logs put-log-events
Write log events
aws logs filter-log-events
Search log events
aws logs start-query
Start Insights query
aws logs put-metric-filter
Create metric filter
aws logs put-retention-policy
Set log retention
命令描述
aws logs create-log-group
创建日志组
aws logs put-log-events
写入日志事件
aws logs filter-log-events
搜索日志事件
aws logs start-query
启动Insights查询
aws logs put-metric-filter
创建指标过滤器
aws logs put-retention-policy
设置日志保留策略

Best Practices

最佳实践

Metrics

指标

  • Use dimensions wisely — too many creates metric explosion
  • Aggregate before publishing — batch custom metrics
  • Use high-resolution metrics (1-second) only when needed
  • Set meaningful units for custom metrics
  • 合理使用维度:过多维度会导致指标数量爆炸
  • 发布前先聚合:批量处理自定义指标
  • 仅在需要时使用高分辨率指标(1秒粒度)
  • 为自定义指标设置有意义的单位

Alarms

告警

  • Use composite alarms for complex conditions
  • Set appropriate evaluation periods to avoid flapping
  • Include OK actions to track recovery
  • Use anomaly detection for dynamic thresholds
  • 使用复合告警处理复杂条件
  • 设置合适的评估周期避免告警频繁波动
  • 添加OK状态动作跟踪恢复情况
  • 使用异常检测实现动态阈值

Logs

日志

  • Set retention policies — don't keep logs forever
  • Use structured logging (JSON) for better querying
  • Create metric filters for key events
  • Use Contributor Insights for top-N analysis
  • 设置保留策略:不要永久保留日志
  • 使用结构化日志(JSON格式)以提升查询效率
  • 为关键事件创建指标过滤器
  • 使用Contributor Insights进行Top-N分析

Cost Optimization

成本优化

  • Delete unused dashboards
  • Reduce log retention for non-critical logs
  • Avoid high-resolution metrics unless necessary
  • Use log subscription filters instead of polling
  • 删除未使用的仪表盘
  • 缩短非关键日志的保留时间
  • 除非必要,否则避免使用高分辨率指标
  • 使用日志订阅过滤器替代轮询方式

Troubleshooting

故障排查

Missing Metrics

指标缺失

Causes:
  • Service not publishing yet (wait 1-5 minutes)
  • Wrong namespace/dimensions
  • Detailed monitoring not enabled (EC2)
Debug:
bash
undefined
可能原因:
  • 服务尚未开始发布指标(等待1-5分钟)
  • 命名空间/维度错误
  • 未启用详细监控(EC2)
排查命令:
bash
undefined

List metrics for a namespace

List metrics for a namespace

aws cloudwatch list-metrics
--namespace AWS/Lambda
--dimensions Name=FunctionName,Value=MyFunction
undefined
aws cloudwatch list-metrics
--namespace AWS/Lambda
--dimensions Name=FunctionName,Value=MyFunction
undefined

Alarm Stuck in INSUFFICIENT_DATA

告警持续处于INSUFFICIENT_DATA状态

Causes:
  • Metric not being published
  • Dimensions mismatch
  • Evaluation period too short
Debug:
bash
undefined
可能原因:
  • 指标未被发布
  • 维度不匹配
  • 评估周期过短
排查命令:
bash
undefined

Check if metric has data

Check if metric has data

aws cloudwatch get-metric-statistics
--namespace AWS/Lambda
--metric-name Invocations
--dimensions Name=FunctionName,Value=MyFunction
--start-time $(date -d '1 hour ago' -u +%Y-%m-%dT%H:%M:%SZ)
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)
--period 60
--statistics Sum
undefined
aws cloudwatch get-metric-statistics
--namespace AWS/Lambda
--metric-name Invocations
--dimensions Name=FunctionName,Value=MyFunction
--start-time $(date -d '1 hour ago' -u +%Y-%m-%dT%H:%M:%SZ)
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)
--period 60
--statistics Sum
undefined

Log Events Not Appearing

日志事件未显示

Causes:
  • IAM permissions missing
  • CloudWatch Logs agent not running
  • Log group doesn't exist
Debug:
bash
undefined
可能原因:
  • 缺少IAM权限
  • CloudWatch Logs代理未运行
  • 日志组不存在
排查命令:
bash
undefined

Check log streams

Check log streams

aws logs describe-log-streams
--log-group-name /aws/lambda/MyFunction
--order-by LastEventTime
--descending
--limit 5
undefined
aws logs describe-log-streams
--log-group-name /aws/lambda/MyFunction
--order-by LastEventTime
--descending
--limit 5
undefined

High CloudWatch Costs

CloudWatch成本过高

Check usage:
bash
undefined
查看使用情况:
bash
undefined

Get PutLogEvents usage

Get PutLogEvents usage

aws cloudwatch get-metric-statistics
--namespace AWS/Logs
--metric-name IncomingBytes
--dimensions Name=LogGroupName,Value=/aws/lambda/MyFunction
--start-time $(date -d '7 days ago' -u +%Y-%m-%dT%H:%M:%SZ)
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)
--period 86400
--statistics Sum
undefined
aws cloudwatch get-metric-statistics
--namespace AWS/Logs
--metric-name IncomingBytes
--dimensions Name=LogGroupName,Value=/aws/lambda/MyFunction
--start-time $(date -d '7 days ago' -u +%Y-%m-%dT%H:%M:%SZ)
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)
--period 86400
--statistics Sum
undefined

References

参考资料