intermittent-issue-debugging

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Intermittent Issue Debugging

偶发问题调试

Overview

概述

Intermittent issues are the most difficult to debug because they don't occur consistently. Systematic approach and comprehensive monitoring are essential.

偶发问题是最难调试的，因为它们不会持续出现。系统性方法和全面监控至关重要。

When to Use

何时使用

Sporadic errors in logs
Users report occasional issues
Flaky tests
Race conditions suspected
Timing-dependent bugs
Resource exhaustion issues

日志中的偶发错误
用户报告的偶尔出现的问题
不稳定的测试（Flaky tests）
怀疑存在竞态条件
依赖时序的Bug
资源耗尽问题

Instructions

操作步骤

1. Capturing Intermittent Issues

1. 捕获偶发问题

javascript

// Strategy 1: Comprehensive Logging
// Add detailed logging around suspected code

function processPayment(orderId) {
  const startTime = Date.now();
  console.log(`[${startTime}] Payment start: order=${orderId}`);

  try {
    const result = chargeCard(orderId);
    console.log(`[${Date.now()}] Payment success: ${orderId}`);
    return result;
  } catch (error) {
    const duration = Date.now() - startTime;
    console.error(`[${Date.now()}] Payment FAILED:`, {
      order: orderId,
      error: error.message,
      duration_ms: duration,
      error_type: error.constructor.name,
      stack: error.stack
    });
    throw error;
  }
}

// Strategy 2: Correlation IDs
// Track requests across systems

const correlationId = generateId();
logger.info({
  correlationId,
  action: 'payment_start',
  orderId: 123
});

chargeCard(orderId, {headers: {correlationId}});

logger.info({
  correlationId,
  action: 'payment_end',
  status: 'success'
});

// Later, can grep logs by correlationId to see full trace

// Strategy 3: Error Sampling
// Capture full error context when occurs

window.addEventListener('error', (event) => {
  const errorData = {
    message: event.message,
    url: event.filename,
    line: event.lineno,
    col: event.colno,
    stack: event.error?.stack,
    userAgent: navigator.userAgent,
    memory: performance.memory?.usedJSHeapSize,
    timestamp: new Date().toISOString()
  };

  sendToMonitoring(errorData);  // Send to error tracking
});

javascript

// Strategy 1: Comprehensive Logging
// Add detailed logging around suspected code

function processPayment(orderId) {
  const startTime = Date.now();
  console.log(`[${startTime}] Payment start: order=${orderId}`);

  try {
    const result = chargeCard(orderId);
    console.log(`[${Date.now()}] Payment success: ${orderId}`);
    return result;
  } catch (error) {
    const duration = Date.now() - startTime;
    console.error(`[${Date.now()}] Payment FAILED:`, {
      order: orderId,
      error: error.message,
      duration_ms: duration,
      error_type: error.constructor.name,
      stack: error.stack
    });
    throw error;
  }
}

// Strategy 2: Correlation IDs
// Track requests across systems

const correlationId = generateId();
logger.info({
  correlationId,
  action: 'payment_start',
  orderId: 123
});

chargeCard(orderId, {headers: {correlationId}});

logger.info({
  correlationId,
  action: 'payment_end',
  status: 'success'
});

// Later, can grep logs by correlationId to see full trace

// Strategy 3: Error Sampling
// Capture full error context when occurs

window.addEventListener('error', (event) => {
  const errorData = {
    message: event.message,
    url: event.filename,
    line: event.lineno,
    col: event.colno,
    stack: event.error?.stack,
    userAgent: navigator.userAgent,
    memory: performance.memory?.usedJSHeapSize,
    timestamp: new Date().toISOString()
  };

  sendToMonitoring(errorData);  // Send to error tracking
});

2. Common Intermittent Issues

2. 常见偶发问题

yaml

Issue: Race Condition

Symptom: Inconsistent behavior depending on timing

Example:
  Thread 1: Read count (5)
  Thread 2: Read count (5), increment to 6, write
  Thread 1: Increment to 6, write (overrides Thread 2)
  Result: Should be 7, but is 6

Debug:
  1. Add detailed timestamps
  2. Log all operations
  3. Look for overlapping operations
  4. Check if order matters

Solution:
  - Use locks/mutexes
  - Use atomic operations
  - Use message queues
  - Ensure single writer

---

Issue: Timing-Dependent Bug

Symptom: Test passes sometimes, fails others

Example:
  test_user_creation:
    1. Create user (sometimes slow)
    2. Check user exists
    3. Fails if create took too long

Debug:
  - Add timeout logging
  - Increase wait time
  - Add explicit waits
  - Mock slow operations

Solution:
  - Explicit wait for condition
  - Remove time-dependent assertions
  - Use proper test fixtures

---

Issue: Resource Exhaustion

Symptom: Works fine, but after time fails

Example:
  - Memory grows over time
  - Connections pool exhausted
  - Disk space fills up
  - Max open files reached

Debug:
  - Monitor resources continuously
  - Check for leaks (memory growth)
  - Monitor connection count
  - Check long-running processes

Solution:
  - Fix memory leak
  - Increase resource limits
  - Implement cleanup
  - Add monitoring/alerts

---

Issue: Intermittent Network Failure

Symptom: API calls occasionally fail

Debug:
  - Check network logs
  - Identify timeout patterns
  - Check if time-of-day dependent
  - Check if load dependent

Solution:
  - Implement exponential backoff retry
  - Add circuit breaker
  - Increase timeout
  - Add redundancy

yaml

Issue: Race Condition
问题：竞态条件

Symptom: Inconsistent behavior depending on timing
症状：行为因时序不同而不一致

Example:
示例：
  Thread 1: Read count (5)
  线程1：读取计数（5）
  Thread 2: Read count (5), increment to 6, write
  线程2：读取计数（5），递增到6，写入
  Thread 1: Increment to 6, write (overrides Thread 2)
  线程1：递增到6，写入（覆盖线程2的结果）
  Result: Should be 7, but is 6
  结果：预期为7，实际为6

Debug:
调试方法：
  1. Add detailed timestamps
  1. 添加详细时间戳
  2. Log all operations
  2. 记录所有操作
  3. Look for overlapping operations
  3. 查找重叠操作
  4. Check if order matters
  4. 检查执行顺序是否关键

Solution:
解决方案：
  - Use locks/mutexes
  - 使用锁/互斥量
  - Use atomic operations
  - 使用原子操作
  - Use message queues
  - 使用消息队列
  - Ensure single writer
  - 确保单一写入者

---

Issue: Timing-Dependent Bug
问题：依赖时序的Bug

Symptom: Test passes sometimes, fails others
症状：测试时而通过，时而失败

Example:
示例：
  test_user_creation:
    1. Create user (sometimes slow)
    1. 创建用户（有时速度缓慢）
    2. Check user exists
    2. 检查用户是否存在
    3. Fails if create took too long
    3. 如果创建耗时过长则失败

Debug:
调试方法：
  - Add timeout logging
  - 添加超时日志
  - Increase wait time
  - 增加等待时间
  - Add explicit waits
  - 添加显式等待
  - Mock slow operations
  - 模拟缓慢操作

Solution:
解决方案：
  - Explicit wait for condition
  - 显式等待条件满足
  - Remove time-dependent assertions
  - 移除依赖时序的断言
  - Use proper test fixtures
  - 使用合适的测试夹具

---

Issue: Resource Exhaustion
问题：资源耗尽

Symptom: Works fine, but after time fails
症状：初期运行正常，一段时间后失败

Example:
示例：
  - Memory grows over time
  - 内存随时间增长
  - Connections pool exhausted
  - 连接池耗尽
  - Disk space fills up
  - 磁盘空间已满
  - Max open files reached
  - 达到最大打开文件数限制

Debug:
调试方法：
  - Monitor resources continuously
  - 持续监控资源
  - Check for leaks (memory growth)
  - 检查是否存在泄漏（内存增长）
  - Monitor connection count
  - 监控连接数
  - Check long-running processes
  - 检查长时间运行的进程

Solution:
解决方案：
  - Fix memory leak
  - 修复内存泄漏
  - Increase resource limits
  - 提升资源限制
  - Implement cleanup
  - 实现资源清理
  - Add monitoring/alerts
  - 添加监控/告警

---

Issue: Intermittent Network Failure
问题：偶发网络故障

Symptom: API calls occasionally fail
症状：API调用偶尔失败

Debug:
调试方法：
  - Check network logs
  - 检查网络日志
  - Identify timeout patterns
  - 识别超时模式
  - Check if time-of-day dependent
  - 检查是否与时间段相关
  - Check if load dependent
  - 检查是否与负载相关

Solution:
解决方案：
  - Implement exponential backoff retry
  - 实现指数退避重试
  - Add circuit breaker
  - 添加断路器
  - Increase timeout
  - 增加超时时间
  - Add redundancy
  - 添加冗余

3. Systematic Investigation Process

3. 系统性调查流程

yaml

Step 1: Understand the Pattern
  Questions:
    - How often does it occur? (1/100, 1/1000?)
    - When does it occur? (time of day, load, specific user?)
    - What are the conditions? (network, memory, load?)
    - Is it reproducible? (deterministic or random?)
    - Any recent changes?

  Analysis:
    - Review error logs
    - Check error rate trends
    - Identify patterns
    - Correlate with changes

Step 2: Reproduce Reliably
  Methods:
    - Increase test frequency (run 1000 times)
    - Stress test (heavy load)
    - Simulate poor conditions (network, memory)
    - Run on different machines
    - Run in production-like environment

  Goal: Make issue consistent to analyze

Step 3: Add Instrumentation
  - Add detailed logging
  - Add monitoring metrics
  - Add trace IDs
  - Capture errors fully
  - Log system state

Step 4: Capture the Issue
  - Recreate scenario
  - Capture full context
  - Note system state
  - Document conditions
  - Get reproduction case

Step 5: Analyze Data
  - Review logs
  - Look for patterns
  - Compare normal vs error cases
  - Check timing correlations
  - Identify root cause

Step 6: Implement Fix
  - Based on root cause
  - Verify with reproduction case
  - Test extensively
  - Add regression test

yaml

Step 1: Understand the Pattern
步骤1：识别模式
  Questions:
  问题：
    - How often does it occur? (1/100, 1/1000?)
    - 问题出现频率如何？（千分之一？百分之一？）
    - When does it occur? (time of day, load, specific user?)
    - 问题何时出现？（特定时间段、高负载时、特定用户？）
    - What are the conditions? (network, memory, load?)
    - 出现时的环境条件是什么？（网络、内存、负载情况？）
    - Is it reproducible? (deterministic or random?)
    - 是否可以复现？（确定触发还是随机？）
    - Any recent changes?
    - 近期是否有变更？

  Analysis:
  分析：
    - Review error logs
    - 查看错误日志
    - Check error rate trends
    - 检查错误率趋势
    - Identify patterns
    - 识别规律
    - Correlate with changes
    - 关联近期变更

Step 2: Reproduce Reliably
步骤2：稳定复现问题
  Methods:
  方法：
    - Increase test frequency (run 1000 times)
    - 提高测试频率（运行1000次）
    - Stress test (heavy load)
    - 压力测试（高负载）
    - Simulate poor conditions (network, memory)
    - 模拟恶劣环境（网络、内存不足）
    - Run on different machines
    - 在不同机器上运行
    - Run in production-like environment
    - 在类生产环境中运行

  Goal: Make issue consistent to analyze
  目标：让问题稳定出现以便分析

Step 3: Add Instrumentation
步骤3：添加监控埋点
  - Add detailed logging
  - 添加详细日志
  - Add monitoring metrics
  - 添加监控指标
  - Add trace IDs
  - 添加追踪ID
  - Capture errors fully
  - 完整捕获错误信息
  - Log system state
  - 记录系统状态

Step 4: Capture the Issue
步骤4：捕获问题现场
  - Recreate scenario
  - 重现场景
  - Capture full context
  - 捕获完整上下文
  - Note system state
  - 记录系统状态
  - Document conditions
  - 记录环境条件
  - Get reproduction case
  - 获取复现用例

Step 5: Analyze Data
步骤5：分析数据
  - Review logs
  - 查看日志
  - Look for patterns
  - 寻找规律
  - Compare normal vs error cases
  - 对比正常与错误场景
  - Check timing correlations
  - 检查时序关联
  - Identify root cause
  - 确定根本原因

Step 6: Implement Fix
步骤6：实施修复
  - Based on root cause
  - 基于根本原因修复
  - Verify with reproduction case
  - 通过复现用例验证
  - Test extensively
  - 全面测试
  - Add regression test
  - 添加回归测试

4. Monitoring & Prevention

4. 监控与预防

yaml

Monitoring Strategy:

Real User Monitoring (RUM):
  - Error rates by feature
  - Latency percentiles
  - User impact
  - Trend analysis

Application Performance Monitoring (APM):
  - Request traces
  - Database query performance
  - External service calls
  - Resource usage

Synthetic Monitoring:
  - Regular test execution
  - Simulate user flows
  - Alert on failures
  - Trend tracking

---

Alerting:

Setup alerts for:
  - Error rate spike
  - Response time >threshold
  - Memory growth trend
  - Failed transactions

---

Prevention Checklist:

[ ] Comprehensive logging in place
[ ] Error tracking configured
[ ] Performance monitoring active
[ ] Resource monitoring enabled
[ ] Correlation IDs used
[ ] Failed requests captured
[ ] Timeout values appropriate
[ ] Retry logic implemented
[ ] Circuit breakers in place
[ ] Load testing performed
[ ] Stress testing performed
[ ] Race conditions reviewed
[ ] Timing dependencies checked

---

Tools:

Monitoring:
  - New Relic / DataDog
  - Prometheus / Grafana
  - Sentry / Rollbar
  - Custom logging

Testing:
  - Load testing (k6, JMeter)
  - Chaos engineering (gremlin)
  - Property-based testing (hypothesis)
  - Fuzz testing

Debugging:
  - Distributed tracing (Jaeger)
  - Correlation IDs
  - Detailed logging
  - Debuggers

yaml

Monitoring Strategy:
监控策略：

Real User Monitoring (RUM):
真实用户监控（RUM）：
  - Error rates by feature
  - 按功能模块统计错误率
  - Latency percentiles
  - 延迟百分位数
  - User impact
  - 用户影响范围
  - Trend analysis
  - 趋势分析

Application Performance Monitoring (APM):
应用性能监控（APM）：
  - Request traces
  - 请求追踪
  - Database query performance
  - 数据库查询性能
  - External service calls
  - 外部服务调用
  - Resource usage
  - 资源使用情况

Synthetic Monitoring:
合成监控：
  - Regular test execution
  - 定期执行测试
  - Simulate user flows
  - 模拟用户流程
  - Alert on failures
  - 故障告警
  - Trend tracking
  - 趋势跟踪

---

Alerting:
告警设置：

Setup alerts for:
为以下情况设置告警：
  - Error rate spike
  - 错误率突增
  - Response time >threshold
  - 响应时间超过阈值
  - Memory growth trend
  - 内存增长趋势
  - Failed transactions
  - 交易失败

---

Prevention Checklist:
预防检查清单：

[ ] Comprehensive logging in place
[ ] 已部署全面日志记录
[ ] Error tracking configured
[ ] 已配置错误追踪
[ ] Performance monitoring active
[ ] 已启用性能监控
[ ] Resource monitoring enabled
[ ] 已启用资源监控
[ ] Correlation IDs used
[ ] 已使用关联ID
[ ] Failed requests captured
[ ] 已捕获失败请求
[ ] Timeout values appropriate
[ ] 超时值设置合理
[ ] Retry logic implemented
[ ] 已实现重试逻辑
[ ] Circuit breakers in place
[ ] 已部署断路器
[ ] Load testing performed
[ ] 已执行负载测试
[ ] Stress testing performed
[ ] 已执行压力测试
[ ] Race conditions reviewed
[ ] 已排查竞态条件
[ ] Timing dependencies checked
[ ] 已检查时序依赖

---

Tools:
工具：

Monitoring:
监控工具：
  - New Relic / DataDog
  - Prometheus / Grafana
  - Sentry / Rollbar
  - Custom logging
  - 自定义日志系统

Testing:
测试工具：
  - Load testing (k6, JMeter)
  - 负载测试（k6、JMeter）
  - Chaos engineering (gremlin)
  - 混沌工程（Gremlin）
  - Property-based testing (hypothesis)
  - 属性测试（Hypothesis）
  - Fuzz testing
  - 模糊测试

Debugging:
调试工具：
  - Distributed tracing (Jaeger)
  - 分布式追踪（Jaeger）
  - Correlation IDs
  - 关联ID
  - Detailed logging
  - 详细日志
  - Debuggers
  - 调试器

Key Points

关键点

Comprehensive logging is essential
Add correlation IDs for tracing
Monitor for patterns and trends
Stress test to reproduce
Use detailed error context
Implement exponential backoff for retries
Monitor resource exhaustion
Add circuit breakers for external services
Log system state with errors
Implement proper monitoring/alerting

全面的日志记录至关重要
添加关联ID用于追踪
监控模式和趋势
通过压力测试复现问题
使用详细的错误上下文
为重试实现指数退避
监控资源耗尽情况
为外部服务添加断路器
记录错误发生时的系统状态
部署合适的监控/告警系统