log-analysis

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Log Analysis Methodology

日志分析方法论

Core Philosophy: Partition-First

核心理念：分区优先

NEVER start by reading raw log samples.

Logs can be overwhelming. The partition-first approach prevents:

Missing the forest for the trees
Wasting time on irrelevant data
Overwhelming context with noise

绝对不要从阅读原始日志样本开始。

日志数量庞大，容易让人无从下手。分区优先的方法可以避免：

只见树木不见森林
在无关数据上浪费时间
被冗余信息干扰判断

The 4-Step Process

四步流程

Step 1: Get Statistics

步骤1：获取统计数据

Before ANY log search, understand the landscape:

CloudWatch Insights:

undefined

在进行任何日志搜索之前，先了解整体情况：

CloudWatch Insights：

undefined

How many errors?

错误数量是多少？

filter @message like /ERROR/ | stats count(*) as total

Error rate over time

随时间变化的错误率

filter @message like /ERROR/ | stats count(*) by bin(5m)

What types of errors?

错误类型有哪些？

filter @message like /ERROR/ | parse @message /(?<error_type>[\w.]+Exception)/ | stats count(*) by error_type | sort count desc


**Datadog**:

filter @message like /ERROR/ | parse @message /(?<error_type>[\w.]+Exception)/ | stats count(*) by error_type | sort count desc


**Datadog**：

Error distribution by service

按服务划分的错误分布

service:* status:error | stats count by service

Error types

错误类型

service:myapp status:error | stats count by @error.kind


**Questions to answer**:
- What's the total error volume?
- Is it increasing, stable, or decreasing?
- What are the unique error types?
- Which services/hosts are affected?

service:myapp status:error | stats count by @error.kind


**需要解答的问题**：
- 错误总数量是多少？
- 错误数量呈上升、稳定还是下降趋势？
- 有哪些独特的错误类型？
- 哪些服务/主机受到影响？

Step 2: Identify Patterns

步骤2：识别模式

Look for correlations:

Temporal patterns:

Did errors start at a specific time?
Is there periodicity (every hour, every day)?
Correlation with deployments or traffic spikes?

Service patterns:

Is one service the source?
Is the error propagating across services?

Error patterns:

What's the most frequent error?
Are errors clustered or distributed?

寻找关联规律：

时间模式：

错误是否在特定时间开始出现？
是否具有周期性（每小时、每天）？
是否与部署或流量高峰相关？

服务模式：

是否某个服务是错误源？
错误是否在服务之间传播？

错误模式：

最频繁出现的错误是什么？
错误是集中出现还是分散分布？

Step 3: Sample Strategically

步骤3：针对性采样

Only NOW read actual log samples:

Sample from anomalies:

Get logs from the peak error time
Get logs from normal time for comparison

Sample by error type:

Get examples of each distinct error type
Limit to 5-10 per type

Sample around events:

Logs before/after a deployment
Logs around a specific incident timestamp

只有到这一步才去阅读实际的日志样本：

从异常情况中采样：

获取错误峰值时段的日志
获取正常时段的日志用于对比

按错误类型采样：

获取每种不同错误类型的示例
每种类型限制5-10条

围绕事件采样：

部署前后的日志
特定事件时间点前后的日志

Step 4: Correlate with Events

步骤4：与事件关联

Connect logs to system changes:

undefined

将日志与系统变更关联起来：

undefined

Use git_log to find recent deployments

使用git_log查找近期部署记录

git_log --since="2 hours ago"

Use get_deployment_history for K8s

使用get_deployment_history查看K8s部署历史

get_deployment_history deployment=api-server

Compare log patterns before/after changes

对比变更前后的日志模式

undefined

undefined

Platform-Specific Tips

平台专属技巧

CloudWatch Insights

Best practices:

undefined

最佳实践：

undefined

Always include time filter

始终包含时间过滤器

filter @timestamp > ago(1h)

Use parse for structured extraction

使用parse进行结构化提取

parse @message /status=(?<status>\d+)/

Aggregate before displaying

先聚合再展示

stats count(*) by status | sort count desc | limit 10


**Common queries**:

stats count(*) by status | sort count desc | limit 10


**常用查询**：

Latency distribution

延迟分布

filter @type = "REPORT" | stats avg(@duration) as avg, pct(@duration, 95) as p95, pct(@duration, 99) as p99

Error messages with context

带上下文的错误信息

filter @message like /ERROR/ | fields @timestamp, @message | sort @timestamp desc | limit 20

undefined

filter @message like /ERROR/ | fields @timestamp, @message | sort @timestamp desc | limit 20

undefined

Datadog Logs

Datadog日志

Query syntax:

undefined

查询语法：

undefined

Filter by service and status

按服务和状态过滤

service:api-gateway status:error

Field queries

字段查询

@http.status_code:>=500

Wildcard

通配符

@error.message:timeout

Time comparison

时间对比

service:api (now-1h TO now) vs (now-25h TO now-24h)

undefined

service:api (now-1h TO now) vs (now-25h TO now-24h)

undefined

Kubernetes Logs

Kubernetes日志

Use get_pod_logs wisely:

Always specify
```
tail_lines
```
(default: 100)
Filter to specific containers in multi-container pods
Use
```
get_pod_events
```
first for crashes/restarts

明智使用get_pod_logs：

始终指定
```
tail_lines
```
（默认值：100）
在多容器Pod中过滤特定容器
先使用
```
get_pod_events
```
查看崩溃/重启情况

Anti-Patterns to Avoid

需避免的反模式

Dumping all logs - Never request unbounded log queries
Starting with samples - Always get statistics first
Ignoring time windows - Narrow to incident window
Missing correlation - Always connect to deployments/changes
Single-service focus - Check upstream/downstream services

导出所有日志 - 绝对不要请求无限制的日志查询
从样本开始 - 务必先获取统计数据
忽略时间窗口 - 缩小到事件发生的时间范围
忽略关联关系 - 始终与部署/变更关联起来
只关注单个服务 - 检查上下游服务

Investigation Template

调查模板

undefined

undefined

Log Analysis Report

日志分析报告

Statistics

统计数据

Time window: [start] to [end]
Total log volume: X events
Error count: Y events (Z%)
Error rate trend: [increasing/stable/decreasing]

时间窗口：[开始时间] 至 [结束时间]
日志总数量：X 条
错误数量：Y 条（占比Z%）
错误率趋势：[上升/稳定/下降]

Top Error Types

主要错误类型

[ErrorType1]: N occurrences - [description]
[ErrorType2]: M occurrences - [description]

[错误类型1]：N 次出现 - [描述]
[错误类型2]：M 次出现 - [描述]

Temporal Pattern

时间模式

Errors started at: [timestamp]
Correlation: [deployment X / traffic spike / external event]

错误开始时间：[时间戳]
关联因素：[部署X / 流量高峰 / 外部事件]

Sample Errors

错误样本

[Quote 2-3 representative error messages]

[引用2-3条有代表性的错误信息]

Root Cause Hypothesis

根因假设

[Based on patterns, what's the likely cause?]

undefined

[基于识别到的模式，可能的原因是什么？]

undefined