cloudwatch
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAWS CloudWatch
AWS CloudWatch
Amazon CloudWatch provides monitoring and observability for AWS resources and applications. It collects metrics, logs, and events, enabling you to monitor, troubleshoot, and optimize your AWS environment.
Amazon CloudWatch为AWS资源和应用提供监控与可观测性能力。它收集指标、日志和事件,帮助您监控、排查问题并优化AWS环境。
Table of Contents
目录
Core Concepts
核心概念
Metrics
指标
Time-ordered data points published to CloudWatch. Key components:
- Namespace: Container for metrics (e.g., )
AWS/Lambda - Metric name: Name of the measurement (e.g., )
Invocations - Dimensions: Name-value pairs for filtering (e.g., )
FunctionName=MyFunc - Statistics: Aggregations (Sum, Average, Min, Max, SampleCount, pN)
发布到CloudWatch的时间序列数据点。核心组件包括:
- Namespace:指标的容器(例如:)
AWS/Lambda - Metric name:测量指标的名称(例如:)
Invocations - Dimensions:用于过滤的键值对(例如:)
FunctionName=MyFunc - Statistics:聚合统计值(Sum、Average、Min、Max、SampleCount、pN)
Logs
日志
Log data from AWS services and applications:
- Log groups: Collections of log streams
- Log streams: Sequences of log events from same source
- Log events: Individual log entries with timestamp and message
来自AWS服务和应用的日志数据:
- Log groups:日志流的集合
- Log streams:来自同一来源的日志事件序列
- Log events:包含时间戳和消息的单个日志条目
Alarms
告警
Automated actions based on metric thresholds:
- States: OK, ALARM, INSUFFICIENT_DATA
- Actions: SNS notifications, Auto Scaling, EC2 actions
基于指标阈值的自动化操作:
- States:OK、ALARM、INSUFFICIENT_DATA
- Actions:SNS通知、Auto Scaling、EC2操作
Common Patterns
常见使用场景
Create a Metric Alarm
创建指标告警
AWS CLI:
bash
undefinedAWS CLI:
bash
undefinedCPU utilization alarm for EC2
CPU utilization alarm for EC2
aws cloudwatch put-metric-alarm
--alarm-name "HighCPU-i-1234567890abcdef0"
--metric-name CPUUtilization
--namespace AWS/EC2
--statistic Average
--period 300
--threshold 80
--comparison-operator GreaterThanThreshold
--evaluation-periods 2
--dimensions Name=InstanceId,Value=i-1234567890abcdef0
--alarm-actions arn:aws:sns:us-east-1:123456789012:alerts
--ok-actions arn:aws:sns:us-east-1:123456789012:alerts
--alarm-name "HighCPU-i-1234567890abcdef0"
--metric-name CPUUtilization
--namespace AWS/EC2
--statistic Average
--period 300
--threshold 80
--comparison-operator GreaterThanThreshold
--evaluation-periods 2
--dimensions Name=InstanceId,Value=i-1234567890abcdef0
--alarm-actions arn:aws:sns:us-east-1:123456789012:alerts
--ok-actions arn:aws:sns:us-east-1:123456789012:alerts
**boto3:**
```python
import boto3
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_alarm(
AlarmName='HighCPU-i-1234567890abcdef0',
MetricName='CPUUtilization',
Namespace='AWS/EC2',
Statistic='Average',
Period=300,
Threshold=80.0,
ComparisonOperator='GreaterThanThreshold',
EvaluationPeriods=2,
Dimensions=[
{'Name': 'InstanceId', 'Value': 'i-1234567890abcdef0'}
],
AlarmActions=['arn:aws:sns:us-east-1:123456789012:alerts'],
OKActions=['arn:aws:sns:us-east-1:123456789012:alerts']
)aws cloudwatch put-metric-alarm
--alarm-name "HighCPU-i-1234567890abcdef0"
--metric-name CPUUtilization
--namespace AWS/EC2
--statistic Average
--period 300
--threshold 80
--comparison-operator GreaterThanThreshold
--evaluation-periods 2
--dimensions Name=InstanceId,Value=i-1234567890abcdef0
--alarm-actions arn:aws:sns:us-east-1:123456789012:alerts
--ok-actions arn:aws:sns:us-east-1:123456789012:alerts
--alarm-name "HighCPU-i-1234567890abcdef0"
--metric-name CPUUtilization
--namespace AWS/EC2
--statistic Average
--period 300
--threshold 80
--comparison-operator GreaterThanThreshold
--evaluation-periods 2
--dimensions Name=InstanceId,Value=i-1234567890abcdef0
--alarm-actions arn:aws:sns:us-east-1:123456789012:alerts
--ok-actions arn:aws:sns:us-east-1:123456789012:alerts
**boto3:**
```python
import boto3
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_alarm(
AlarmName='HighCPU-i-1234567890abcdef0',
MetricName='CPUUtilization',
Namespace='AWS/EC2',
Statistic='Average',
Period=300,
Threshold=80.0,
ComparisonOperator='GreaterThanThreshold',
EvaluationPeriods=2,
Dimensions=[
{'Name': 'InstanceId', 'Value': 'i-1234567890abcdef0'}
],
AlarmActions=['arn:aws:sns:us-east-1:123456789012:alerts'],
OKActions=['arn:aws:sns:us-east-1:123456789012:alerts']
)Lambda Error Rate Alarm
Lambda错误率告警
bash
aws cloudwatch put-metric-alarm \
--alarm-name "LambdaErrorRate-MyFunction" \
--metrics '[
{
"Id": "errors",
"MetricStat": {
"Metric": {
"Namespace": "AWS/Lambda",
"MetricName": "Errors",
"Dimensions": [{"Name": "FunctionName", "Value": "MyFunction"}]
},
"Period": 60,
"Stat": "Sum"
},
"ReturnData": false
},
{
"Id": "invocations",
"MetricStat": {
"Metric": {
"Namespace": "AWS/Lambda",
"MetricName": "Invocations",
"Dimensions": [{"Name": "FunctionName", "Value": "MyFunction"}]
},
"Period": 60,
"Stat": "Sum"
},
"ReturnData": false
},
{
"Id": "errorRate",
"Expression": "errors/invocations*100",
"Label": "Error Rate",
"ReturnData": true
}
]' \
--threshold 5 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 3 \
--alarm-actions arn:aws:sns:us-east-1:123456789012:alertsbash
aws cloudwatch put-metric-alarm \
--alarm-name "LambdaErrorRate-MyFunction" \
--metrics '[
{
"Id": "errors",
"MetricStat": {
"Metric": {
"Namespace": "AWS/Lambda",
"MetricName": "Errors",
"Dimensions": [{"Name": "FunctionName", "Value": "MyFunction"}]
},
"Period": 60,
"Stat": "Sum"
},
"ReturnData": false
},
{
"Id": "invocations",
"MetricStat": {
"Metric": {
"Namespace": "AWS/Lambda",
"MetricName": "Invocations",
"Dimensions": [{"Name": "FunctionName", "Value": "MyFunction"}]
},
"Period": 60,
"Stat": "Sum"
},
"ReturnData": false
},
{
"Id": "errorRate",
"Expression": "errors/invocations*100",
"Label": "Error Rate",
"ReturnData": true
}
]' \
--threshold 5 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 3 \
--alarm-actions arn:aws:sns:us-east-1:123456789012:alertsQuery Logs with Insights
使用Insights查询日志
bash
undefinedbash
undefinedFind errors in Lambda logs
Find errors in Lambda logs
aws logs start-query
--log-group-name /aws/lambda/MyFunction
--start-time $(date -d '1 hour ago' +%s)
--end-time $(date +%s)
--query-string ' fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 50 '
--log-group-name /aws/lambda/MyFunction
--start-time $(date -d '1 hour ago' +%s)
--end-time $(date +%s)
--query-string ' fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 50 '
aws logs start-query
--log-group-name /aws/lambda/MyFunction
--start-time $(date -d '1 hour ago' +%s)
--end-time $(date +%s)
--query-string ' fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 50 '
--log-group-name /aws/lambda/MyFunction
--start-time $(date -d '1 hour ago' +%s)
--end-time $(date +%s)
--query-string ' fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 50 '
Get query results
Get query results
aws logs get-query-results --query-id <query-id>
**boto3:**
```python
import boto3
import time
logs = boto3.client('logs')aws logs get-query-results --query-id <query-id>
**boto3:**
```python
import boto3
import time
logs = boto3.client('logs')Start query
Start query
response = logs.start_query(
logGroupName='/aws/lambda/MyFunction',
startTime=int(time.time()) - 3600,
endTime=int(time.time()),
queryString='''
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 50
'''
)
query_id = response['queryId']
response = logs.start_query(
logGroupName='/aws/lambda/MyFunction',
startTime=int(time.time()) - 3600,
endTime=int(time.time()),
queryString='''
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 50
'''
)
query_id = response['queryId']
Wait for results
Wait for results
while True:
result = logs.get_query_results(queryId=query_id)
if result['status'] == 'Complete':
break
time.sleep(1)
for row in result['results']:
print(row)
undefinedwhile True:
result = logs.get_query_results(queryId=query_id)
if result['status'] == 'Complete':
break
time.sleep(1)
for row in result['results']:
print(row)
undefinedCreate Metric Filter
创建指标过滤器
Extract metrics from log patterns:
bash
undefined从日志模式中提取指标:
bash
undefinedCreate metric filter for error count
Create metric filter for error count
aws logs put-metric-filter
--log-group-name /aws/lambda/MyFunction
--filter-name ErrorCount
--filter-pattern "ERROR"
--metric-transformations
metricName=ErrorCount,metricNamespace=MyApp,metricValue=1,defaultValue=0
--log-group-name /aws/lambda/MyFunction
--filter-name ErrorCount
--filter-pattern "ERROR"
--metric-transformations
metricName=ErrorCount,metricNamespace=MyApp,metricValue=1,defaultValue=0
undefinedaws logs put-metric-filter
--log-group-name /aws/lambda/MyFunction
--filter-name ErrorCount
--filter-pattern "ERROR"
--metric-transformations
metricName=ErrorCount,metricNamespace=MyApp,metricValue=1,defaultValue=0
--log-group-name /aws/lambda/MyFunction
--filter-name ErrorCount
--filter-pattern "ERROR"
--metric-transformations
metricName=ErrorCount,metricNamespace=MyApp,metricValue=1,defaultValue=0
undefinedPublish Custom Metrics
发布自定义指标
python
import boto3
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_data(
Namespace='MyApp',
MetricData=[
{
'MetricName': 'OrdersProcessed',
'Value': 1,
'Unit': 'Count',
'Dimensions': [
{'Name': 'Environment', 'Value': 'Production'},
{'Name': 'OrderType', 'Value': 'Standard'}
]
}
]
)python
import boto3
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_data(
Namespace='MyApp',
MetricData=[
{
'MetricName': 'OrdersProcessed',
'Value': 1,
'Unit': 'Count',
'Dimensions': [
{'Name': 'Environment', 'Value': 'Production'},
{'Name': 'OrderType', 'Value': 'Standard'}
]
}
]
)Create Dashboard
创建仪表盘
bash
cat > dashboard.json << 'EOF'
{
"widgets": [
{
"type": "metric",
"x": 0, "y": 0, "width": 12, "height": 6,
"properties": {
"title": "Lambda Invocations",
"metrics": [
["AWS/Lambda", "Invocations", "FunctionName", "MyFunction"]
],
"period": 60,
"stat": "Sum",
"region": "us-east-1"
}
},
{
"type": "log",
"x": 12, "y": 0, "width": 12, "height": 6,
"properties": {
"title": "Recent Errors",
"query": "SOURCE '/aws/lambda/MyFunction' | filter @message like /ERROR/ | limit 20",
"region": "us-east-1"
}
}
]
}
EOF
aws cloudwatch put-dashboard \
--dashboard-name MyAppDashboard \
--dashboard-body file://dashboard.jsonbash
cat > dashboard.json << 'EOF'
{
"widgets": [
{
"type": "metric",
"x": 0, "y": 0, "width": 12, "height": 6,
"properties": {
"title": "Lambda Invocations",
"metrics": [
["AWS/Lambda", "Invocations", "FunctionName", "MyFunction"]
],
"period": 60,
"stat": "Sum",
"region": "us-east-1"
}
},
{
"type": "log",
"x": 12, "y": 0, "width": 12, "height": 6,
"properties": {
"title": "Recent Errors",
"query": "SOURCE '/aws/lambda/MyFunction' | filter @message like /ERROR/ | limit 20",
"region": "us-east-1"
}
}
]
}
EOF
aws cloudwatch put-dashboard \
--dashboard-name MyAppDashboard \
--dashboard-body file://dashboard.jsonCLI Reference
CLI参考
Metrics Commands
指标相关命令
| Command | Description |
|---|---|
| Publish custom metrics |
| Retrieve metric values |
| Get aggregated statistics |
| List available metrics |
| 命令 | 描述 |
|---|---|
| 发布自定义指标 |
| 检索指标值 |
| 获取聚合统计数据 |
| 列出可用指标 |
Alarms Commands
告警相关命令
| Command | Description |
|---|---|
| Create or update alarm |
| List alarms |
| Manually set alarm state |
| Delete alarms |
| 命令 | 描述 |
|---|---|
| 创建或更新告警 |
| 列出告警 |
| 手动设置告警状态 |
| 删除告警 |
Logs Commands
日志相关命令
| Command | Description |
|---|---|
| Create log group |
| Write log events |
| Search log events |
| Start Insights query |
| Create metric filter |
| Set log retention |
| 命令 | 描述 |
|---|---|
| 创建日志组 |
| 写入日志事件 |
| 搜索日志事件 |
| 启动Insights查询 |
| 创建指标过滤器 |
| 设置日志保留策略 |
Best Practices
最佳实践
Metrics
指标
- Use dimensions wisely — too many creates metric explosion
- Aggregate before publishing — batch custom metrics
- Use high-resolution metrics (1-second) only when needed
- Set meaningful units for custom metrics
- 合理使用维度:过多维度会导致指标数量爆炸
- 发布前先聚合:批量处理自定义指标
- 仅在需要时使用高分辨率指标(1秒粒度)
- 为自定义指标设置有意义的单位
Alarms
告警
- Use composite alarms for complex conditions
- Set appropriate evaluation periods to avoid flapping
- Include OK actions to track recovery
- Use anomaly detection for dynamic thresholds
- 使用复合告警处理复杂条件
- 设置合适的评估周期避免告警频繁波动
- 添加OK状态动作跟踪恢复情况
- 使用异常检测实现动态阈值
Logs
日志
- Set retention policies — don't keep logs forever
- Use structured logging (JSON) for better querying
- Create metric filters for key events
- Use Contributor Insights for top-N analysis
- 设置保留策略:不要永久保留日志
- 使用结构化日志(JSON格式)以提升查询效率
- 为关键事件创建指标过滤器
- 使用Contributor Insights进行Top-N分析
Cost Optimization
成本优化
- Delete unused dashboards
- Reduce log retention for non-critical logs
- Avoid high-resolution metrics unless necessary
- Use log subscription filters instead of polling
- 删除未使用的仪表盘
- 缩短非关键日志的保留时间
- 除非必要,否则避免使用高分辨率指标
- 使用日志订阅过滤器替代轮询方式
Troubleshooting
故障排查
Missing Metrics
指标缺失
Causes:
- Service not publishing yet (wait 1-5 minutes)
- Wrong namespace/dimensions
- Detailed monitoring not enabled (EC2)
Debug:
bash
undefined可能原因:
- 服务尚未开始发布指标(等待1-5分钟)
- 命名空间/维度错误
- 未启用详细监控(EC2)
排查命令:
bash
undefinedList metrics for a namespace
List metrics for a namespace
aws cloudwatch list-metrics
--namespace AWS/Lambda
--dimensions Name=FunctionName,Value=MyFunction
--namespace AWS/Lambda
--dimensions Name=FunctionName,Value=MyFunction
undefinedaws cloudwatch list-metrics
--namespace AWS/Lambda
--dimensions Name=FunctionName,Value=MyFunction
--namespace AWS/Lambda
--dimensions Name=FunctionName,Value=MyFunction
undefinedAlarm Stuck in INSUFFICIENT_DATA
告警持续处于INSUFFICIENT_DATA状态
Causes:
- Metric not being published
- Dimensions mismatch
- Evaluation period too short
Debug:
bash
undefined可能原因:
- 指标未被发布
- 维度不匹配
- 评估周期过短
排查命令:
bash
undefinedCheck if metric has data
Check if metric has data
aws cloudwatch get-metric-statistics
--namespace AWS/Lambda
--metric-name Invocations
--dimensions Name=FunctionName,Value=MyFunction
--start-time $(date -d '1 hour ago' -u +%Y-%m-%dT%H:%M:%SZ)
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)
--period 60
--statistics Sum
--namespace AWS/Lambda
--metric-name Invocations
--dimensions Name=FunctionName,Value=MyFunction
--start-time $(date -d '1 hour ago' -u +%Y-%m-%dT%H:%M:%SZ)
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)
--period 60
--statistics Sum
undefinedaws cloudwatch get-metric-statistics
--namespace AWS/Lambda
--metric-name Invocations
--dimensions Name=FunctionName,Value=MyFunction
--start-time $(date -d '1 hour ago' -u +%Y-%m-%dT%H:%M:%SZ)
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)
--period 60
--statistics Sum
--namespace AWS/Lambda
--metric-name Invocations
--dimensions Name=FunctionName,Value=MyFunction
--start-time $(date -d '1 hour ago' -u +%Y-%m-%dT%H:%M:%SZ)
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)
--period 60
--statistics Sum
undefinedLog Events Not Appearing
日志事件未显示
Causes:
- IAM permissions missing
- CloudWatch Logs agent not running
- Log group doesn't exist
Debug:
bash
undefined可能原因:
- 缺少IAM权限
- CloudWatch Logs代理未运行
- 日志组不存在
排查命令:
bash
undefinedCheck log streams
Check log streams
aws logs describe-log-streams
--log-group-name /aws/lambda/MyFunction
--order-by LastEventTime
--descending
--limit 5
--log-group-name /aws/lambda/MyFunction
--order-by LastEventTime
--descending
--limit 5
undefinedaws logs describe-log-streams
--log-group-name /aws/lambda/MyFunction
--order-by LastEventTime
--descending
--limit 5
--log-group-name /aws/lambda/MyFunction
--order-by LastEventTime
--descending
--limit 5
undefinedHigh CloudWatch Costs
CloudWatch成本过高
Check usage:
bash
undefined查看使用情况:
bash
undefinedGet PutLogEvents usage
Get PutLogEvents usage
aws cloudwatch get-metric-statistics
--namespace AWS/Logs
--metric-name IncomingBytes
--dimensions Name=LogGroupName,Value=/aws/lambda/MyFunction
--start-time $(date -d '7 days ago' -u +%Y-%m-%dT%H:%M:%SZ)
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)
--period 86400
--statistics Sum
--namespace AWS/Logs
--metric-name IncomingBytes
--dimensions Name=LogGroupName,Value=/aws/lambda/MyFunction
--start-time $(date -d '7 days ago' -u +%Y-%m-%dT%H:%M:%SZ)
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)
--period 86400
--statistics Sum
undefinedaws cloudwatch get-metric-statistics
--namespace AWS/Logs
--metric-name IncomingBytes
--dimensions Name=LogGroupName,Value=/aws/lambda/MyFunction
--start-time $(date -d '7 days ago' -u +%Y-%m-%dT%H:%M:%SZ)
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)
--period 86400
--statistics Sum
--namespace AWS/Logs
--metric-name IncomingBytes
--dimensions Name=LogGroupName,Value=/aws/lambda/MyFunction
--start-time $(date -d '7 days ago' -u +%Y-%m-%dT%H:%M:%SZ)
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)
--period 86400
--statistics Sum
undefined