log-analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLog Analysis Methodology
日志分析方法论
Core Philosophy: Partition-First
核心理念:分区优先
NEVER start by reading raw log samples.
Logs can be overwhelming. The partition-first approach prevents:
- Missing the forest for the trees
- Wasting time on irrelevant data
- Overwhelming context with noise
绝对不要从阅读原始日志样本开始。
日志数量庞大,容易让人无从下手。分区优先的方法可以避免:
- 只见树木不见森林
- 在无关数据上浪费时间
- 被冗余信息干扰判断
The 4-Step Process
四步流程
Step 1: Get Statistics
步骤1:获取统计数据
Before ANY log search, understand the landscape:
CloudWatch Insights:
undefined在进行任何日志搜索之前,先了解整体情况:
CloudWatch Insights:
undefinedHow many errors?
错误数量是多少?
filter @message like /ERROR/
| stats count(*) as total
filter @message like /ERROR/
| stats count(*) as total
Error rate over time
随时间变化的错误率
filter @message like /ERROR/
| stats count(*) by bin(5m)
filter @message like /ERROR/
| stats count(*) by bin(5m)
What types of errors?
错误类型有哪些?
filter @message like /ERROR/
| parse @message /(?<error_type>[\w.]+Exception)/
| stats count(*) by error_type
| sort count desc
**Datadog**:filter @message like /ERROR/
| parse @message /(?<error_type>[\w.]+Exception)/
| stats count(*) by error_type
| sort count desc
**Datadog**:Error distribution by service
按服务划分的错误分布
service:* status:error | stats count by service
service:* status:error | stats count by service
Error types
错误类型
service:myapp status:error | stats count by @error.kind
**Questions to answer**:
- What's the total error volume?
- Is it increasing, stable, or decreasing?
- What are the unique error types?
- Which services/hosts are affected?service:myapp status:error | stats count by @error.kind
**需要解答的问题**:
- 错误总数量是多少?
- 错误数量呈上升、稳定还是下降趋势?
- 有哪些独特的错误类型?
- 哪些服务/主机受到影响?Step 2: Identify Patterns
步骤2:识别模式
Look for correlations:
Temporal patterns:
- Did errors start at a specific time?
- Is there periodicity (every hour, every day)?
- Correlation with deployments or traffic spikes?
Service patterns:
- Is one service the source?
- Is the error propagating across services?
Error patterns:
- What's the most frequent error?
- Are errors clustered or distributed?
寻找关联规律:
时间模式:
- 错误是否在特定时间开始出现?
- 是否具有周期性(每小时、每天)?
- 是否与部署或流量高峰相关?
服务模式:
- 是否某个服务是错误源?
- 错误是否在服务之间传播?
错误模式:
- 最频繁出现的错误是什么?
- 错误是集中出现还是分散分布?
Step 3: Sample Strategically
步骤3:针对性采样
Only NOW read actual log samples:
Sample from anomalies:
- Get logs from the peak error time
- Get logs from normal time for comparison
Sample by error type:
- Get examples of each distinct error type
- Limit to 5-10 per type
Sample around events:
- Logs before/after a deployment
- Logs around a specific incident timestamp
只有到这一步才去阅读实际的日志样本:
从异常情况中采样:
- 获取错误峰值时段的日志
- 获取正常时段的日志用于对比
按错误类型采样:
- 获取每种不同错误类型的示例
- 每种类型限制5-10条
围绕事件采样:
- 部署前后的日志
- 特定事件时间点前后的日志
Step 4: Correlate with Events
步骤4:与事件关联
Connect logs to system changes:
undefined将日志与系统变更关联起来:
undefinedUse git_log to find recent deployments
使用git_log查找近期部署记录
git_log --since="2 hours ago"
git_log --since="2 hours ago"
Use get_deployment_history for K8s
使用get_deployment_history查看K8s部署历史
get_deployment_history deployment=api-server
get_deployment_history deployment=api-server
Compare log patterns before/after changes
对比变更前后的日志模式
undefinedundefinedPlatform-Specific Tips
平台专属技巧
CloudWatch Insights
CloudWatch Insights
Best practices:
undefined最佳实践:
undefinedAlways include time filter
始终包含时间过滤器
filter @timestamp > ago(1h)
filter @timestamp > ago(1h)
Use parse for structured extraction
使用parse进行结构化提取
parse @message /status=(?<status>\d+)/
parse @message /status=(?<status>\d+)/
Aggregate before displaying
先聚合再展示
stats count(*) by status | sort count desc | limit 10
**Common queries**:stats count(*) by status | sort count desc | limit 10
**常用查询**:Latency distribution
延迟分布
filter @type = "REPORT"
| stats avg(@duration) as avg,
pct(@duration, 95) as p95,
pct(@duration, 99) as p99
filter @type = "REPORT"
| stats avg(@duration) as avg,
pct(@duration, 95) as p95,
pct(@duration, 99) as p99
Error messages with context
带上下文的错误信息
filter @message like /ERROR/
| fields @timestamp, @message
| sort @timestamp desc
| limit 20
undefinedfilter @message like /ERROR/
| fields @timestamp, @message
| sort @timestamp desc
| limit 20
undefinedDatadog Logs
Datadog日志
Query syntax:
undefined查询语法:
undefinedFilter by service and status
按服务和状态过滤
service:api-gateway status:error
service:api-gateway status:error
Field queries
字段查询
@http.status_code:>=500
@http.status_code:>=500
Wildcard
通配符
@error.message:timeout
@error.message:timeout
Time comparison
时间对比
service:api (now-1h TO now) vs (now-25h TO now-24h)
undefinedservice:api (now-1h TO now) vs (now-25h TO now-24h)
undefinedKubernetes Logs
Kubernetes日志
Use get_pod_logs wisely:
- Always specify (default: 100)
tail_lines - Filter to specific containers in multi-container pods
- Use first for crashes/restarts
get_pod_events
明智使用get_pod_logs:
- 始终指定(默认值:100)
tail_lines - 在多容器Pod中过滤特定容器
- 先使用查看崩溃/重启情况
get_pod_events
Anti-Patterns to Avoid
需避免的反模式
- Dumping all logs - Never request unbounded log queries
- Starting with samples - Always get statistics first
- Ignoring time windows - Narrow to incident window
- Missing correlation - Always connect to deployments/changes
- Single-service focus - Check upstream/downstream services
- 导出所有日志 - 绝对不要请求无限制的日志查询
- 从样本开始 - 务必先获取统计数据
- 忽略时间窗口 - 缩小到事件发生的时间范围
- 忽略关联关系 - 始终与部署/变更关联起来
- 只关注单个服务 - 检查上下游服务
Investigation Template
调查模板
undefinedundefinedLog Analysis Report
日志分析报告
Statistics
统计数据
- Time window: [start] to [end]
- Total log volume: X events
- Error count: Y events (Z%)
- Error rate trend: [increasing/stable/decreasing]
- 时间窗口:[开始时间] 至 [结束时间]
- 日志总数量:X 条
- 错误数量:Y 条(占比Z%)
- 错误率趋势:[上升/稳定/下降]
Top Error Types
主要错误类型
- [ErrorType1]: N occurrences - [description]
- [ErrorType2]: M occurrences - [description]
- [错误类型1]:N 次出现 - [描述]
- [错误类型2]:M 次出现 - [描述]
Temporal Pattern
时间模式
- Errors started at: [timestamp]
- Correlation: [deployment X / traffic spike / external event]
- 错误开始时间:[时间戳]
- 关联因素:[部署X / 流量高峰 / 外部事件]
Sample Errors
错误样本
[Quote 2-3 representative error messages]
[引用2-3条有代表性的错误信息]
Root Cause Hypothesis
根因假设
[Based on patterns, what's the likely cause?]
undefined[基于识别到的模式,可能的原因是什么?]
undefined