debug-training-logs

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Distributed Training Log Debugger

分布式训练日志调试工具

You are debugging a distributed training job failure. The user will provide one or two directories:
  • Worker logs (stderr from SLURM/torchrun):
    $ARGUMENTS[0]
    (required)
  • AIStore daemon logs (proxy/target tarballs or extracted dirs):
    $ARGUMENTS[1]
    (optional)
Your goal: find the root cause, not just the symptom. There are often cascading failures — one root cause triggers many downstream errors. Trace backwards from the final crash to the original trigger.
你正在调试分布式训练任务的故障。用户会提供一个或两个目录:
  • Worker日志(SLURM/torchrun的stderr输出):
    $ARGUMENTS[0]
    (必填)
  • AIStore守护进程日志(proxy/target的压缩包或解压后的目录):
    $ARGUMENTS[1]
    (可选)
你的目标是找到根本原因,而非仅定位表面症状。分布式训练常出现连锁故障——一个根本原因会触发多个下游错误。需从最终崩溃点反向追溯至最初触发源。

CRITICAL: Verification discipline

关键:严谨验证原则

You MUST double-check every conclusion before presenting it to the user. Distributed training failures have cascading effects that make it easy to mistake a symptom for the root cause. Follow these rules:
  1. Check ALL ranks, not a sample. Do not look at 5 ranks and assume the other 123 are the same. Use
    sort -u
    on the full output to find outliers. A single outlier rank can be the entire root cause.
  2. Verify NCCL state for EVERY rank. Extract
    last enqueued work
    and
    last completed work
    for all ranks and check for ANY rank that differs. A rank with
    enqueued == completed
    (no pending ops) is fundamentally different from ranks with
    enqueued > completed
    (ops pending) — it means that rank never entered the stuck collective.
  3. Distinguish "First PG on this rank to signal dumping" from "Observed flight recorder dump signal from another rank". The first is the initiator (its own watchdog fired). The second was notified (it may not even be in the collective). These are DIFFERENT failure modes.
  4. Before stating a root cause, re-verify it against the raw data. Re-read the actual log lines. Do not rely on earlier summaries you wrote — they may have been wrong.
  5. If your analysis changes during the investigation, explicitly state what was wrong before and why. Do not silently shift conclusions.
在向用户呈现结论前,你必须反复核对每一项结论。 分布式训练的连锁效应很容易让你将症状误判为根本原因,请遵循以下规则:
  1. 检查所有rank,而非抽样。不要只看5个rank就假设其他123个rank的情况一致。对完整输出使用
    sort -u
    命令找出异常rank。单个异常rank可能就是整个故障的根本原因。
  2. 验证每个rank的NCCL状态。提取所有rank的
    last enqueued work
    last completed work
    ,检查是否存在任何异常rank。若某个rank的
    enqueued == completed
    (无待处理操作),则与
    enqueued > completed
    (有待处理操作)的rank存在本质差异——这意味着该rank从未进入阻塞的集体通信。
  3. 区分“当前rank首个触发转储信号的进程组”与“观察到来自其他rank的飞行记录器转储信号”。前者是故障发起者(自身看门狗机制触发),后者是被动响应(甚至可能未参与集体通信)。这是两种完全不同的故障模式。
  4. 在确定根本原因前,对照原始日志数据重新验证。重新阅读实际日志行,不要依赖之前的总结内容——这些总结可能存在错误。
  5. 若分析过程中结论发生变化,需明确说明之前的错误及原因。不要悄悄修改结论。

Phase 0: Obtain logs

阶段0:获取日志

If the user has not provided worker logs, ask them to download and provide the SLURM error logs. The logs are typically on the compute nodes or in a shared filesystem. Suggest:
bash
undefined
若用户未提供worker日志,请要求他们下载并提供SLURM错误日志。日志通常位于计算节点或共享文件系统中,可建议执行以下命令:
bash
undefined

From the user's machine, SCP the SLURM error logs:

从用户机器通过SCP复制SLURM错误日志:

scp <cluster>:/path/to/slurm/logs/error-<JOBID>-*.out ./training_logs/
scp <cluster>:/path/to/slurm/logs/error-<JOBID>-*.out ./training_logs/

Or if the logs are on a shared filesystem accessible from a login node:

若日志位于登录节点可访问的共享文件系统:

mkdir -p ./training_logs cp /path/to/slurm/error-<JOBID>-*.out ./training_logs/

The user needs to provide ALL per-node error files (e.g., `error-JOBID-0.out` through `error-JOBID-N.out`) — not just one node. Without all files you cannot identify which specific rank caused the failure.
mkdir -p ./training_logs cp /path/to/slurm/error-<JOBID>-*.out ./training_logs/

用户需提供所有节点的错误文件(例如`error-JOBID-0.out`至`error-JOBID-N.out`)——而非仅单个节点的日志。缺少完整日志将无法确定具体哪个rank导致了故障。

Phase 1: Triage worker logs

阶段1:Worker日志分类排查

1.1 Understand the job

1.1 了解任务信息

Read the first 80 lines of a few log files to determine:
  • Framework (NeMo, Megatron, PyTorch Lightning, DeepSpeed, etc.)
  • Scale (GPU count, node count, ranks per node)
  • What the job is doing (training, fine-tuning, inference)
  • Whether it resumed from a checkpoint
阅读部分日志文件的前80行,确定:
  • 使用的框架(NeMo、Megatron、PyTorch Lightning、DeepSpeed等)
  • 训练规模(GPU数量、节点数量、每节点rank数)
  • 任务类型(训练、微调、推理)
  • 是否从检查点恢复训练

1.2 Find the fatal errors

1.2 定位致命错误

Search ALL log files in parallel for these patterns (priority order):
Tier 1 — Process killers:
NCCL.*timeout|Watchdog caught collective operation timeout
taking the entire process down
SIGTERM|SIGKILL|SIGABRT
CUDA error|CUDA out of memory|OOM
Tier 2 — Training loop crashes:
RuntimeError|Exception.*Error
AISBatchLoaderError|StopIteration
Traceback \(most recent call last\)
Tier 3 — Data loading / IO:
Connection reset|Connection broken|Connection refused
retrying [0-9]+/[0-9]+
timed out|deadline exceeded
broken pipe
并行搜索所有日志文件中的以下模式(按优先级排序):
一级——进程终止信号:
NCCL.*timeout|Watchdog caught collective operation timeout
taking the entire process down
SIGTERM|SIGKILL|SIGABRT
CUDA error|CUDA out of memory|OOM
二级——训练循环崩溃:
RuntimeError|Exception.*Error
AISBatchLoaderError|StopIteration
Traceback \(most recent call last\)
三级——数据加载/IO问题:
Connection reset|Connection broken|Connection refused
retrying [0-9]+/[0-9]+
timed out|deadline exceeded
broken pipe

1.3 Identify the initiator and the straggler

1.3 识别故障发起者与滞后rank

For NCCL timeouts, you MUST check ALL ranks — do not sample. Extract the full NCCL state for every rank:
bash
undefined
对于NCCL超时问题,必须检查所有rank——不能抽样。提取每个rank的完整NCCL状态:
bash
undefined

Get watchdog state for all ranks that fired their own watchdog:

获取所有触发自身看门狗机制的rank的状态:

grep "failure detected by watchdog" error-.out | grep -o "Rank [0-9].last enqueued work: [0-9], last completed work: [0-9]*" | sort -u
grep "failure detected by watchdog" error-.out | grep -o "Rank [0-9].last enqueued work: [0-9], last completed work: [0-9]*" | sort -u

Get ranks that were NOTIFIED (did not fire their own watchdog):

获取仅接收转储信号(未触发自身看门狗)的rank:

grep "Observed flight recorder dump signal from another rank" error-.out | grep -o "Rank [0-9]"
grep "Observed flight recorder dump signal from another rank" error-.out | grep -o "Rank [0-9]"

Count total unique ranks found vs expected:

统计已发现的唯一rank数量与预期数量的对比:

grep "failure detected by watchdog" error-.out | grep -o "Rank [0-9]" | sort -t' ' -k2 -n -u | wc -l

**The rank that "Observed" instead of "detected" is likely the straggler** — it had no pending NCCL operations because it never entered the collective. Check its `Last enqueued NCCL work` — if it equals `last completed NCCL work`, the rank was stuck OUTSIDE NCCL (in the training loop, data loading, etc.), not inside a collective.

For NCCL BROADCAST/ALLREDUCE timeouts, calculate when the collective started:
`start_time = timeout_time - timeout_ms` (usually 1800000ms = 30min)
grep "failure detected by watchdog" error-.out | grep -o "Rank [0-9]" | sort -t' ' -k2 -n -u | wc -l

**显示“Observed”而非“detected”的rank很可能是滞后rank**——它没有待处理的NCCL操作,因为从未进入集体通信。检查其`Last enqueued NCCL work`——若该值等于`last completed NCCL work`,则说明该rank在NCCL外部(训练循环、数据加载等环节)阻塞,而非在集体通信内部。

对于NCCL BROADCAST/ALLREDUCE超时,计算集体通信的开始时间:`start_time = timeout_time - timeout_ms`(通常为1800000ms = 30分钟)

1.4 Count and classify

1.4 统计与分类

  • Count
    Connection reset
    across all files (total and per-file)
  • Check retry patterns: are retries always 1/N (recovering) or do they escalate to N/N (exhausted)?
  • Count unique error types and affected ranks
  • Look for
    AISBatchLoaderError
    or similar batch loader errors — these indicate AIStore returned fewer objects than requested
  • 统计所有文件中
    Connection reset
    的出现次数(总数及每个文件的次数)
  • 检查重试模式:重试是始终为1/N(正在恢复)还是升级为N/N(已耗尽重试次数)?
  • 统计唯一错误类型及受影响的rank
  • 查找
    AISBatchLoaderError
    或类似的批量加载器错误——这些错误表明AIStore返回的对象数量少于请求数量

1.5 Build the timeline

1.5 构建时间线

Determine the causal chain: which error happened first, which are consequences. The pattern is usually:
Data loading error (root cause)
  -> Some ranks crash out of training loop
  -> Crashed ranks can't participate in NCCL collective
  -> NCCL collective hangs for timeout period (usually 30min)
  -> Watchdog kills all remaining ranks
确定因果链:哪个错误最先发生,哪些是后续结果。常见模式如下:
数据加载错误(根本原因)
  -> 部分rank退出训练循环
  -> 崩溃的rank无法参与NCCL集体通信
  -> NCCL集体通信超时阻塞(通常为30分钟)
  -> 看门狗机制终止所有剩余rank

1.6 Check NeMo-specific synchronization points

1.6 检查NeMo特定同步点

NeMo has per-step collective operations that can cause rank desynchronization:
  • PreemptionCallback (
    nemo/utils/callbacks/preemption.py
    ): Calls
    torch.distributed.broadcast(interrupted, 0)
    at the end of EVERY training step via
    on_train_batch_end
    . If one rank's training step takes >30 min longer than others, this broadcast times out.
  • NeMoModelCheckpoint (
    nemo/utils/callbacks/nemo_model_checkpoint.py
    ): Multiple
    trainer.strategy.broadcast()
    calls during checkpoint save/load.
  • DDP gradient all-reduce: Automatic per-step synchronization during backward pass.
  • broadcast_buffers
    (DDP default=True): Broadcasts model buffers (e.g., batch norm stats) from rank 0 at each forward pass.
If rank 0 is ahead of other ranks in NCCL SeqNum, check whether the gap matches the number of per-step collectives (PreemptionCallback broadcast + DDP allreduce + broadcast_buffers = ~3 ops per step).
NeMo存在每步集体通信操作,可能导致rank不同步:
  • PreemptionCallback
    nemo/utils/callbacks/preemption.py
    ):通过
    on_train_batch_end
    在每个训练步骤结束时调用
    torch.distributed.broadcast(interrupted, 0)
    。若某个rank的训练步骤比其他rank耗时超过30分钟,该广播操作会超时。
  • NeMoModelCheckpoint
    nemo/utils/callbacks/nemo_model_checkpoint.py
    ):在检查点保存/加载期间会多次调用
    trainer.strategy.broadcast()
  • DDP梯度all-reduce:反向传播阶段自动进行每步同步。
  • broadcast_buffers
    (DDP默认值为True):在每次前向传播时从rank 0广播模型缓冲区(如批归一化统计数据)。
若rank 0的NCCL SeqNum领先其他rank,检查差距是否与每步集体通信的数量匹配(PreemptionCallback广播 + DDP allreduce + broadcast_buffers = 每步约3次操作)。

Phase 1.7: Request AIStore logs if not provided

阶段1.7:若未提供AIStore日志则请求获取

If the analysis points to storage I/O issues (connection resets, data loading errors, timeouts) and the user did NOT provide AIS daemon logs, suggest downloading them. First check if the
ais
CLI is available (
which ais
). If it is, guide the user through:
  1. Set the cluster endpoint (ask the user for the endpoint URL):
    bash
    export AIS_ENDPOINT=https://<ais-cluster-endpoint>:<port>
  2. Set the auth token (ask the user for the token value):
    bash
    export AIS_AUTHN_TOKEN=<token>
  3. Handle TLS — skip cert verification or point to the CA cert:
    bash
    # Option A: skip verification
    ais config cli set cluster.skip_verify_crt=true
    # Option B: set CA cert
    export AIS_SERVER_CRT=/path/to/ca.crt
  4. Download all cluster logs to the current log directory:
    bash
    ais log get cluster <path-to-worker-logs-dir>/ais_logs
    This downloads TAR.GZ archives from all proxy and target nodes.
  5. Extract and analyze per Phase 2 below.
If the
ais
CLI is not installed, it can be built from the AIStore repository (
cd cmd/cli && go install .
) or downloaded as a binary. Alternatively, ask the user to download the logs manually from the AIS cluster.
若分析指向存储I/O问题(连接重置、数据加载错误、超时)且用户未提供AIS守护进程日志,建议下载日志。首先检查
ais
CLI是否可用(执行
which ais
)。若可用,引导用户执行以下步骤:
  1. 设置集群端点(请用户提供端点URL):
    bash
    export AIS_ENDPOINT=https://<ais-cluster-endpoint>:<port>
  2. 设置认证令牌(请用户提供令牌值):
    bash
    export AIS_AUTHN_TOKEN=<token>
  3. 处理TLS——跳过证书验证或指定CA证书:
    bash
    # 选项A:跳过验证
    ais config cli set cluster.skip_verify_crt=true
    # 选项B:设置CA证书
    export AIS_SERVER_CRT=/path/to/ca.crt
  4. 下载所有集群日志至当前日志目录
    bash
    ais log get cluster <path-to-worker-logs-dir>/ais_logs
    该命令会下载所有proxy和target节点的TAR.GZ压缩包。
  5. 解压并按照下文阶段2的步骤进行分析。
若未安装
ais
CLI,可从AIStore仓库构建(
cd cmd/cli && go install .
)或下载二进制文件。或者,要求用户手动从AIS集群下载日志。

Phase 2: Debug AIStore logs (if provided)

阶段2:调试AIStore日志(若已提供)

2.1 Extract and orient

2.1 解压与定位

If tarballs (
.tar.gz
), extract them:
bash
mkdir -p extracted && cd extracted
for f in ../*.tar.gz; do
  name=$(basename "$f" .tar.gz)
  mkdir -p "$name" && tar xzf "$f" -C "$name"
done
AIS log file naming and time ranges:
AIS daemon logs follow the naming convention:
aistarget.ais-target-N.INFO.MMDD-HHMMSS.1    # target logs
aisproxy.ais-proxy-N.INFO.MMDD-HHMMSS.1      # proxy logs
A single daemon (target or proxy) may have multiple log files — each file covers a specific time range:
  • The
    MMDD-HHMMSS
    in the filename is the start time of that file
  • The file covers from its start time until the start time of the next file for the same daemon
  • If there is no next file, it covers until the daemon stopped or the logs were collected
  • A new file is created when the daemon restarts (crash, upgrade, maintenance cycle)
To find the correct file for a failure window:
  1. List all files for each daemon, sorted by the timestamp in the filename
  2. For each daemon, find the file whose start time is BEFORE the failure window AND whose next file's start time is AFTER the failure window (or there is no next file)
  3. A daemon may have been restarted DURING the failure window — check if any file starts within the window (this indicates a restart, which is itself significant)
  4. Log lines within a file only have TIME (HH:MM:SS), not dates. If a file spans midnight, the same time (e.g., "21:30") may appear twice — once for each day. Use surrounding context (stats counter values, known events) to disambiguate which day a log line belongs to.
Always check ALL files for a daemon, not just the latest. The latest file may only cover minutes after a restart, while the failure evidence is in an earlier file.
Timezone verification: AIS daemon logs and worker (SLURM/torchrun) logs may be on different machines in different timezones. NEVER assume they match. To verify:
  1. Look for AIS periodic timestamp markers:
    common:NNN DD Mon YY HH:MM UTC =============
    — these explicitly state the timezone (typically UTC).
  2. Look for absolute timestamps in worker logs: NeMo uses
    YYYY-MM-DD HH:MM:SS
    , SLURM uses
    YYYY-MM-DDThh:mm:ss
    — but neither includes timezone by default.
  3. Cross-reference a known event visible in both log sources. The best anchor is the job death: find the SLURM
    CANCELLED
    timestamp in worker logs, then find the exact moment
    get.n
    stops incrementing in AIS stats. If they align, the timezones match. If they're offset by a round number of hours, one system is in a different timezone.
  4. If timezones don't match, apply the offset consistently before correlating events.
若为压缩包(
.tar.gz
),先解压:
bash
mkdir -p extracted && cd extracted
for f in ../*.tar.gz; do
  name=$(basename "$f" .tar.gz)
  mkdir -p "$name" && tar xzf "$f" -C "$name"
done
AIS日志文件命名与时间范围:
AIS守护进程日志遵循以下命名规则:
aistarget.ais-target-N.INFO.MMDD-HHMMSS.1    # target日志
aisproxy.ais-proxy-N.INFO.MMDD-HHMMSS.1      # proxy日志
单个守护进程(target或proxy)可能有多个日志文件——每个文件覆盖特定时间范围:
  • 文件名中的
    MMDD-HHMMSS
    是该文件的起始时间
  • 文件覆盖从起始时间到同一守护进程下一个文件的起始时间
  • 若没有下一个文件,则覆盖至守护进程停止或日志收集完成
  • 守护进程重启(崩溃、升级、维护周期)时会创建新文件
定位故障窗口对应的日志文件:
  1. 列出每个守护进程的所有文件,按文件名中的时间戳排序
  2. 对于每个守护进程,找到起始时间早于故障窗口且下一个文件的起始时间晚于故障窗口的文件(若无下一个文件则直接选择)
  3. 故障窗口期间守护进程可能已重启——检查是否有文件的起始时间在故障窗口内(这表明发生了重启,本身就是重要线索)
  4. 文件内的日志行仅包含时间(HH:MM:SS),无日期。若文件跨午夜,同一时间(如"21:30")可能出现两次——分别对应两天。可通过上下文(统计计数器值、已知事件)判断日志行所属日期。
务必检查守护进程的所有文件,而非仅最新文件。最新文件可能仅覆盖重启后的几分钟,而故障证据可能在更早的文件中。
时区验证: AIS守护进程日志与worker(SLURM/torchrun)日志可能位于不同时区的机器上,切勿默认时区一致。验证方法:
  1. 查找AIS定期时间戳标记:
    common:NNN DD Mon YY HH:MM UTC =============
    ——这些标记明确说明时区(通常为UTC)。
  2. 查找worker日志中的绝对时间戳:NeMo使用
    YYYY-MM-DD HH:MM:SS
    ,SLURM使用
    YYYY-MM-DDThh:mm:ss
    ——但两者默认均不包含时区。
  3. 交叉验证两种日志源中的已知事件。最佳锚点是任务终止时间:在worker日志中找到SLURM的
    CANCELLED
    时间戳,然后在AIS统计数据中找到
    get.n
    停止增长的精确时刻。若两者对齐,则时区一致;若相差整数小时,则说明其中一个系统处于不同时区。
  4. 若时区不一致,在关联事件前需统一应用时间偏移量。

2.2 Check target stats during failure window

2.2 检查故障窗口内的target统计数据

AIStore targets emit periodic stats lines (every ~3 min). Extract key counters around the failure time:
Critical counters to track:
  • err.get.n
    — GET errors (should be stable; spikes indicate problems)
  • err.getbatch.n
    — batch GET errors
  • err.http.write.n
    — broken HTTP responses to clients
  • err.put.n
    ,
    err.head.n
    ,
    err.lst.n
    — other operation errors
  • get.n
    vs
    err.get.n
    — compute error rate
  • getbatch.n
    ,
    getbatch.obj.n
    — batch operation counts
Compare values between consecutive stats lines to find the delta (new errors in that interval).
AIStore target会定期输出统计行(约每3分钟一次)。提取故障时间前后的关键计数器:
需跟踪的关键计数器:
  • err.get.n
    —— GET错误数(应保持稳定;峰值表明存在问题)
  • err.getbatch.n
    —— 批量GET错误数
  • err.http.write.n
    —— 向客户端发送HTTP响应失败数(连接中断)
  • err.put.n
    ,
    err.head.n
    ,
    err.lst.n
    —— 其他操作错误数
  • get.n
    vs
    err.get.n
    —— 计算错误率
  • getbatch.n
    ,
    getbatch.obj.n
    —— 批量操作次数
对比连续统计行之间的数值,找出增量(该时间段内新增的错误数)。

2.3 Search for error-level messages

2.3 搜索错误级别的消息

grep "^E " <logfile>          # Error messages
grep "^W " <logfile>          # Warning messages
Key error patterns in AIStore:
  • x-get-batch.*out-of-bounds index
    — batch GET lost objects during inter-target streaming
  • shared-dm.*terminated.*broken pipe
    — inter-target data mover stream failure
  • shared-dm: xid.*not found, dropping recv
    — objects dropped because batch job already aborted
  • resource pressure: load=critical
    — target under disk/memory/CPU pressure
  • lcache.*hk.*dsk=critical
    — disk at critical level, housekeeping skipped
  • gc:.*free mem
    /
    oom:
    — memory pressure / forced GC
grep "^E " <logfile>          # 错误消息
grep "^W " <logfile>          # 警告消息
AIStore中的关键错误模式:
  • x-get-batch.*out-of-bounds index
    —— 批量GET在target间流传输时丢失对象
  • shared-dm.*terminated.*broken pipe
    —— target间数据传输流失败
  • shared-dm: xid.*not found, dropping recv
    —— 批量任务已终止,丢弃对象
  • resource pressure: load=critical
    —— target处于磁盘/内存/CPU压力临界状态
  • lcache.*hk.*dsk=critical
    —— 磁盘空间不足,跳过清理操作
  • gc:.*free mem
    /
    oom:
    —— 内存压力 / 强制垃圾回收

2.4 Check proxy logs

2.4 检查proxy日志

The proxy orchestrates batch GETs. Check proxy stats for:
  • err.get.n
    — should be near zero; high = proxy-level routing failures
  • err.http.write.n
    — proxy dropping client connections
If proxy errors are stable but target errors are spiking, the problem is target-side (disk, memory, inter-target networking).
proxy负责协调批量GET操作。检查proxy统计数据:
  • err.get.n
    —— 应接近0;数值高表明proxy级路由失败
  • err.http.write.n
    —— proxy中断客户端连接数
若proxy错误稳定但target错误激增,则问题出在target端(磁盘、内存、target间网络)。

2.5 Correlate the
x-get-batch
failure mechanism

2.5 关联
x-get-batch
故障机制

The typical AIStore batch GET failure chain:
1. Target under resource pressure (dsk=critical, mem=low)
2. Inter-target shared-dm streams break (broken pipe)
3. x-get-batch gets out-of-bounds index (recv'd len=0)
4. Batch job aborted, subsequent objects dropped (xid not found)
5. Client receives fewer objects than requested
6. Lhotse/client batch loader raises error (iterator exhausted prematurely)
AIStore批量GET的典型故障链:
1. Target处于资源压力临界状态(磁盘不足、内存偏低)
2. Target间shared-dm流中断(管道破裂)
3. x-get-batch出现越界索引(接收长度=0)
4. 批量任务终止,后续对象被丢弃(xid未找到)
5. 客户端收到的对象数量少于请求数量
6. Lhotse/客户端批量加载器触发错误(迭代器提前耗尽)

Phase 3: Synthesize

阶段3:综合分析

3.1 Write the report

3.1 撰写报告

Structure your output as:
Job Details — framework, scale, start time, data source
Timeline Table — chronological events with timestamps, source file:line
Root Cause Chain — numbered chain from trigger to final crash, with arrows
Key Files — which log files contain the critical evidence
Recommendations — actionable fixes (storage-side, client-side, training config)
输出结构应包含:
任务详情——框架、规模、开始时间、数据源
时间线表格——按时间顺序排列事件,包含时间戳、源文件及行号
根本原因链——从触发源到最终崩溃的编号链式结构,带箭头
关键文件——包含关键证据的日志文件
建议——可执行的修复方案(存储端、客户端、训练配置)

3.2 Classify the failure

3.2 故障分类

Common root cause categories:
  • Storage I/O: disk pressure, broken pipes, connection resets, batch object loss
  • Network: NCCL timeout without data errors, NIC failures, switch issues
  • GPU/CUDA: OOM, ECC errors, CUDA assertions
  • Data: corrupt/missing data files, manifest mismatches, schema errors
  • Software: version mismatches, config errors, OOM in Python process
  • Infrastructure: node failure, preemption, SLURM timeout
  • Data loading stall: single rank stuck in data loading (no timeout on read), blocking all other ranks at the next collective
常见根本原因类别:
  • 存储I/O:磁盘压力、管道破裂、连接重置、批量对象丢失
  • 网络:无数据错误的NCCL超时、NIC故障、交换机问题
  • GPU/CUDA:OOM、ECC错误、CUDA断言失败
  • 数据:数据文件损坏/缺失、清单不匹配、 schema错误
  • 软件:版本不兼容、配置错误、Python进程OOM
  • 基础设施:节点故障、抢占、SLURM超时
  • 数据加载阻塞:单个rank在数据加载环节阻塞(读取无超时),导致其他所有rank在后续集体通信时被阻塞

Reference: AIStore error counter meanings

参考:AIStore错误计数器含义

CounterMeaning
err.get.n
Individual object GET failures on this target
err.getbatch.n
Batch GET operation failures
err.http.write.n
Failed HTTP writes to client (connection dropped)
err.append.n
Append operation failures
err.put.n
PUT operation failures
err.head.n
HEAD (metadata) operation failures
err.lst.n
List operation failures
err.kalive.n
Keep-alive failures
getbatch.n
Total batch GET operations
getbatch.obj.n
Total objects served via batch GET
getbatch.throttle.ns
Time spent throttling batch GETs
计数器含义
err.get.n
当前target上单个对象GET操作失败数
err.getbatch.n
批量GET操作失败数
err.http.write.n
向客户端发送HTTP响应失败数(连接中断)
err.append.n
Append操作失败数
err.put.n
PUT操作失败数
err.head.n
HEAD(元数据)操作失败数
err.lst.n
List操作失败数
err.kalive.n
保活操作失败数
getbatch.n
总批量GET操作数
getbatch.obj.n
通过批量GET提供的总对象数
getbatch.throttle.ns
批量GET操作节流耗时

Reference: NCCL timeout anatomy

参考:NCCL超时剖析

  • Watchdog caught collective operation timeout
    — the NCCL watchdog detected a stuck collective
  • SeqNum=N, OpType=BROADCAST/ALLREDUCE
    — which collective and sequence number
  • last enqueued work: N, last completed work: M
    — work M completed, work M+1 is stuck
  • Timeout(ms)=1800000
    — 30-minute timeout (default)
  • First PG on this rank to signal dumping
    — THIS rank initiated the cascade
  • Observed flight recorder dump signal from another rank
    — this rank is reacting to another's timeout
  • To avoid data inconsistency, we are taking the entire process down
    — watchdog kills the process
  • Watchdog caught collective operation timeout
    —— NCCL看门狗机制检测到阻塞的集体通信
  • SeqNum=N, OpType=BROADCAST/ALLREDUCE
    —— 集体通信类型及序列号
  • last enqueued work: N, last completed work: M
    —— 已完成M项操作,M+1项操作阻塞
  • Timeout(ms)=1800000
    —— 30分钟超时(默认值)
  • First PG on this rank to signal dumping
    —— 当前rank是故障连锁的发起者
  • Observed flight recorder dump signal from another rank
    —— 当前rank是响应其他rank的超时
  • To avoid data inconsistency, we are taking the entire process down
    —— 看门狗机制终止进程

Distinguishing data loading stalls from GPU/NCCL hangs

区分数据加载阻塞与GPU/NCCL挂起

The
last enqueued work
vs
last completed work
state is critical for determining whether the hang is in the CPU (data loading, training loop) or in the GPU (NCCL communication):
  • If
    enqueued == completed
    (no pending ops)
    : This rank has NO NCCL work in-flight. It is stuck in the CPU-side training loop (data loading, audio decoding, batch prep) and never entered the collective. This is the straggler — the rank that caused the hang.
  • If
    enqueued == completed + 1
    : The rank has submitted exactly one operation that hasn't completed. It entered the collective but can't complete because the straggler rank hasn't joined.
  • If
    enqueued > completed + 1
    : Multiple operations are queued — the CPU has progressed past the stuck point and submitted additional operations asynchronously (e.g., DDP gradient allreduce via hooks). Still waiting for the straggler.
  • If rank 0 has higher
    enqueued
    /
    completed
    than others
    : Rank 0 (often the broadcast root) completed its side of a collective (send) but receivers can't complete (receive) because the straggler hasn't joined.
The key pattern for a data loading stall:
  • 1 rank:
    enqueued == completed
    ,
    active collectives: 0
    , "Observed flight recorder dump signal from another rank" — this is the straggler
  • N-1 ranks:
    enqueued > completed
    , "failure detected by watchdog" — these are waiting for the straggler
  • Rank 0: may be further ahead if it's the broadcast root (can complete send without receivers)
The key pattern for a GPU fabric issue:
  • ALL ranks:
    enqueued > completed
    (all entered the collective), all show "First PG on this rank to signal dumping" — no straggler, the collective itself is broken
Compare
enqueued
counts across ALL ranks. Even ONE outlier changes the entire diagnosis.
last enqueued work
last completed work
的状态是判断挂起发生在CPU(数据加载、训练循环)还是GPU(NCCL通信)的关键:
  • enqueued == completed
    (无待处理操作)
    :该rank无正在进行的NCCL操作,在CPU端训练循环(数据加载、音频解码、批量预处理)阻塞,从未进入集体通信。这是滞后rank——导致挂起的根源
  • enqueued == completed + 1
    :该rank提交了一项未完成的操作,已进入集体通信但因滞后rank未加入而无法完成。
  • enqueued > completed + 1
    :存在多项待处理操作——CPU已越过阻塞点并异步提交了额外操作(如通过钩子实现的DDP梯度all-reduce),仍在等待滞后rank。
  • 若rank 0的
    enqueued
    /
    completed
    值高于其他rank
    :rank 0(通常是广播根节点)已完成集体通信的发送端操作,但接收端因滞后rank未加入而无法完成接收。
数据加载阻塞的关键模式:
  • 1个rank:
    enqueued == completed
    active collectives: 0
    ,显示“Observed flight recorder dump signal from another rank” —— 这是滞后rank
  • N-1个rank:
    enqueued > completed
    ,显示“failure detected by watchdog” —— 这些rank在等待滞后rank
  • rank 0:若为广播根节点,可能进度领先(无需等待接收端即可完成发送)
GPU架构问题的关键模式:
  • 所有rank:
    enqueued > completed
    (均已进入集体通信),均显示“First PG on this rank to signal dumping” —— 无滞后rank,集体通信本身存在故障
对比所有rank的
enqueued
计数,即使单个异常rank也会完全改变诊断结果。

Reference: NeMo/Lhotse data loading pitfalls

参考:NeMo/Lhotse数据加载陷阱

Common data loading issues that cause single-rank stalls:
  • No timeout on Lhotse URL audio reads:
    AudioSource._prepare_for_reading()
    calls
    f.read()
    with no Lhotse-level timeout. A stalled download blocks the DataLoader worker indefinitely.
  • fault_tolerant=True
    silently drops failed audio
    : Failed audio files are skipped, reducing effective batch size per rank. Different ranks may have different failure rates depending on their shard assignments.
  • .m4a
    files lose extension in BytesIO
    : When audio is downloaded from URLs and wrapped in BytesIO, the file extension is lost. Lhotse's
    CompositeAudioBackend
    cannot use the fast-path for m4a (TorchaudioFFMPEGBackend) and falls through to expensive cascading backend trials.
  • Connection reset on idle keep-alive: AIStore closes idle HTTP connections after 30s (
    DfltMaxIdleTimeout
    ). The Python SDK's urllib3 pool doesn't match this timeout, causing
    Connection reset by peer
    on stale pooled connections. These are caught and retried (always succeed on 1st retry) — they are noise, NOT a root cause.
  • Each rank gets disjoint data shards: Lhotse splits shards via
    src[rank::world_size]
    . One rank may get shards with more corrupt files, larger audio, or slower storage targets.
导致单个rank阻塞的常见数据加载问题:
  • Lhotse URL音频读取无超时
    AudioSource._prepare_for_reading()
    调用
    f.read()
    时未设置Lhotse级超时。停滞的下载会无限期阻塞DataLoader worker。
  • fault_tolerant=True
    静默丢弃失败音频
    :失败的音频文件被跳过,降低了每个rank的有效批量大小。不同rank因分片分配不同,失败率可能存在差异。
  • .m4a
    文件在BytesIO中丢失扩展名
    :从URL下载音频并包装在BytesIO中时,文件扩展名丢失。Lhotse的
    CompositeAudioBackend
    无法使用m4a的快速路径(TorchaudioFFMPEGBackend),转而执行耗时的级联后端尝试。
  • 空闲保活连接重置:AIStore会在30秒后关闭空闲HTTP连接(
    DfltMaxIdleTimeout
    )。Python SDK的urllib3池未匹配该超时,导致陈旧池连接出现
    Connection reset by peer
    错误。这些错误会被捕获并重试(首次重试通常成功)——属于噪声,并非根本原因
  • 每个rank获取不相交的数据分片:Lhotse通过
    src[rank::world_size]
    拆分分片。某个rank可能分到包含更多损坏文件、更大音频或存储target更慢的分片。