log-analysis

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Log Analysis Methodology

日志分析方法论

Core Philosophy: Partition-First

核心理念:分区优先

NEVER start by reading raw log samples.
Logs can be overwhelming. The partition-first approach prevents:
  • Missing the forest for the trees
  • Wasting time on irrelevant data
  • Overwhelming context with noise
绝对不要从阅读原始日志样本开始。
日志数量庞大,容易让人无从下手。分区优先的方法可以避免:
  • 只见树木不见森林
  • 在无关数据上浪费时间
  • 被冗余信息干扰判断

The 4-Step Process

四步流程

Step 1: Get Statistics

步骤1:获取统计数据

Before ANY log search, understand the landscape:
CloudWatch Insights:
undefined
在进行任何日志搜索之前,先了解整体情况:
CloudWatch Insights
undefined

How many errors?

错误数量是多少?

filter @message like /ERROR/ | stats count(*) as total
filter @message like /ERROR/ | stats count(*) as total

Error rate over time

随时间变化的错误率

filter @message like /ERROR/ | stats count(*) by bin(5m)
filter @message like /ERROR/ | stats count(*) by bin(5m)

What types of errors?

错误类型有哪些?

filter @message like /ERROR/ | parse @message /(?<error_type>[\w.]+Exception)/ | stats count(*) by error_type | sort count desc

**Datadog**:
filter @message like /ERROR/ | parse @message /(?<error_type>[\w.]+Exception)/ | stats count(*) by error_type | sort count desc

**Datadog**:

Error distribution by service

按服务划分的错误分布

service:* status:error | stats count by service
service:* status:error | stats count by service

Error types

错误类型

service:myapp status:error | stats count by @error.kind

**Questions to answer**:
- What's the total error volume?
- Is it increasing, stable, or decreasing?
- What are the unique error types?
- Which services/hosts are affected?
service:myapp status:error | stats count by @error.kind

**需要解答的问题**:
- 错误总数量是多少?
- 错误数量呈上升、稳定还是下降趋势?
- 有哪些独特的错误类型?
- 哪些服务/主机受到影响?

Step 2: Identify Patterns

步骤2:识别模式

Look for correlations:
Temporal patterns:
  • Did errors start at a specific time?
  • Is there periodicity (every hour, every day)?
  • Correlation with deployments or traffic spikes?
Service patterns:
  • Is one service the source?
  • Is the error propagating across services?
Error patterns:
  • What's the most frequent error?
  • Are errors clustered or distributed?
寻找关联规律:
时间模式
  • 错误是否在特定时间开始出现?
  • 是否具有周期性(每小时、每天)?
  • 是否与部署或流量高峰相关?
服务模式
  • 是否某个服务是错误源?
  • 错误是否在服务之间传播?
错误模式
  • 最频繁出现的错误是什么?
  • 错误是集中出现还是分散分布?

Step 3: Sample Strategically

步骤3:针对性采样

Only NOW read actual log samples:
Sample from anomalies:
  • Get logs from the peak error time
  • Get logs from normal time for comparison
Sample by error type:
  • Get examples of each distinct error type
  • Limit to 5-10 per type
Sample around events:
  • Logs before/after a deployment
  • Logs around a specific incident timestamp
只有到这一步才去阅读实际的日志样本:
从异常情况中采样
  • 获取错误峰值时段的日志
  • 获取正常时段的日志用于对比
按错误类型采样
  • 获取每种不同错误类型的示例
  • 每种类型限制5-10条
围绕事件采样
  • 部署前后的日志
  • 特定事件时间点前后的日志

Step 4: Correlate with Events

步骤4:与事件关联

Connect logs to system changes:
undefined
将日志与系统变更关联起来:
undefined

Use git_log to find recent deployments

使用git_log查找近期部署记录

git_log --since="2 hours ago"
git_log --since="2 hours ago"

Use get_deployment_history for K8s

使用get_deployment_history查看K8s部署历史

get_deployment_history deployment=api-server
get_deployment_history deployment=api-server

Compare log patterns before/after changes

对比变更前后的日志模式

undefined
undefined

Platform-Specific Tips

平台专属技巧

CloudWatch Insights

CloudWatch Insights

Best practices:
undefined
最佳实践
undefined

Always include time filter

始终包含时间过滤器

filter @timestamp > ago(1h)
filter @timestamp > ago(1h)

Use parse for structured extraction

使用parse进行结构化提取

parse @message /status=(?<status>\d+)/
parse @message /status=(?<status>\d+)/

Aggregate before displaying

先聚合再展示

stats count(*) by status | sort count desc | limit 10

**Common queries**:
stats count(*) by status | sort count desc | limit 10

**常用查询**:

Latency distribution

延迟分布

filter @type = "REPORT" | stats avg(@duration) as avg, pct(@duration, 95) as p95, pct(@duration, 99) as p99
filter @type = "REPORT" | stats avg(@duration) as avg, pct(@duration, 95) as p95, pct(@duration, 99) as p99

Error messages with context

带上下文的错误信息

filter @message like /ERROR/ | fields @timestamp, @message | sort @timestamp desc | limit 20
undefined
filter @message like /ERROR/ | fields @timestamp, @message | sort @timestamp desc | limit 20
undefined

Datadog Logs

Datadog日志

Query syntax:
undefined
查询语法
undefined

Filter by service and status

按服务和状态过滤

service:api-gateway status:error
service:api-gateway status:error

Field queries

字段查询

@http.status_code:>=500
@http.status_code:>=500

Wildcard

通配符

@error.message:timeout
@error.message:timeout

Time comparison

时间对比

service:api (now-1h TO now) vs (now-25h TO now-24h)
undefined
service:api (now-1h TO now) vs (now-25h TO now-24h)
undefined

Kubernetes Logs

Kubernetes日志

Use get_pod_logs wisely:
  • Always specify
    tail_lines
    (default: 100)
  • Filter to specific containers in multi-container pods
  • Use
    get_pod_events
    first for crashes/restarts
明智使用get_pod_logs
  • 始终指定
    tail_lines
    (默认值:100)
  • 在多容器Pod中过滤特定容器
  • 先使用
    get_pod_events
    查看崩溃/重启情况

Anti-Patterns to Avoid

需避免的反模式

  1. Dumping all logs - Never request unbounded log queries
  2. Starting with samples - Always get statistics first
  3. Ignoring time windows - Narrow to incident window
  4. Missing correlation - Always connect to deployments/changes
  5. Single-service focus - Check upstream/downstream services
  1. 导出所有日志 - 绝对不要请求无限制的日志查询
  2. 从样本开始 - 务必先获取统计数据
  3. 忽略时间窗口 - 缩小到事件发生的时间范围
  4. 忽略关联关系 - 始终与部署/变更关联起来
  5. 只关注单个服务 - 检查上下游服务

Investigation Template

调查模板

undefined
undefined

Log Analysis Report

日志分析报告

Statistics

统计数据

  • Time window: [start] to [end]
  • Total log volume: X events
  • Error count: Y events (Z%)
  • Error rate trend: [increasing/stable/decreasing]
  • 时间窗口:[开始时间] 至 [结束时间]
  • 日志总数量:X 条
  • 错误数量:Y 条(占比Z%)
  • 错误率趋势:[上升/稳定/下降]

Top Error Types

主要错误类型

  1. [ErrorType1]: N occurrences - [description]
  2. [ErrorType2]: M occurrences - [description]
  1. [错误类型1]:N 次出现 - [描述]
  2. [错误类型2]:M 次出现 - [描述]

Temporal Pattern

时间模式

  • Errors started at: [timestamp]
  • Correlation: [deployment X / traffic spike / external event]
  • 错误开始时间:[时间戳]
  • 关联因素:[部署X / 流量高峰 / 外部事件]

Sample Errors

错误样本

[Quote 2-3 representative error messages]
[引用2-3条有代表性的错误信息]

Root Cause Hypothesis

根因假设

[Based on patterns, what's the likely cause?]
undefined
[基于识别到的模式,可能的原因是什么?]
undefined