nemo-mbridge-multi-node-slurm

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Multi-Node Slurm

多节点Slurm任务

Convert single-node
uv run python -m torch.distributed.run
commands into multi-node Slurm sbatch scripts with Enroot container support, and debug common multi-node failures.
将单节点
uv run python -m torch.distributed.run
命令转换为支持Enroot容器的多节点Slurm sbatch脚本,并调试常见的多节点故障。

Two Approaches: srun-native vs uv run torch.distributed

两种实现方式:srun原生方式 vs uv run torch.distributed方式

Approach
ntasks-per-node
Process spawningBest for
srun-native (preferred)8Slurm spawns 8 tasks/nodeConversion, inference, Bridge scripts
uv run torch.distributed (legacy)1
uv run python -m torch.distributed.run
spawns 8 procs/node
MLM pretrain_gpt.py
Prefer srun-native — simpler, avoids shell escaping issues with TRAIN_CMD. Megatron Bridge auto-derives
RANK
,
WORLD_SIZE
,
LOCAL_RANK
,
MASTER_ADDR
,
MASTER_PORT
from SLURM env vars (
SLURM_PROCID
,
SLURM_NTASKS
,
SLURM_LOCALID
,
SLURM_NODELIST
) via
common_utils.py
helpers called during
initialize.py
distributed init, so you never need to set them manually.
方式
ntasks-per-node
进程启动方式适用场景
srun原生方式(推荐)8Slurm在每个节点启动8个任务模型转换、推理、Bridge脚本
uv run torch.distributed方式(传统)1
uv run python -m torch.distributed.run
在每个节点启动8个进程
MLM pretrain_gpt.py训练脚本
优先选择srun原生方式——更简单,避免TRAIN_CMD的shell转义问题。Megatron Bridge会通过
initialize.py
分布式初始化过程中调用的
common_utils.py
工具函数,从SLURM环境变量(
SLURM_PROCID
SLURM_NTASKS
SLURM_LOCALID
SLURM_NODELIST
)自动推导
RANK
WORLD_SIZE
LOCAL_RANK
MASTER_ADDR
MASTER_PORT
,因此无需手动设置这些变量。

Cluster Environment

集群环境

Container

容器配置

bash
CONTAINER_IMAGE="<PATH_TO_YOUR_CONTAINER>.sqsh"
CONTAINER_MOUNTS="<SHARED_FS>:<SHARED_FS>,<PATH_TO_MEGATRON_BRIDGE>:/opt/Megatron-Bridge,<PATH_TO_DATA>:/opt/data"
bash
CONTAINER_IMAGE="<PATH_TO_YOUR_CONTAINER>.sqsh"
CONTAINER_MOUNTS="<SHARED_FS>:<SHARED_FS>,<PATH_TO_MEGATRON_BRIDGE>:/opt/Megatron-Bridge,<PATH_TO_DATA>:/opt/data"

Standard Paths

标准路径

bash
WORKDIR="/opt/Megatron-Bridge"
DATA_PATH="<PATH_TO_PREPROCESSED_DATA>/dclm_01_01_text_document"
bash
WORKDIR="/opt/Megatron-Bridge"
DATA_PATH="<PATH_TO_PREPROCESSED_DATA>/dclm_01_01_text_document"

Tokens / Caches

令牌/缓存配置

bash
export GH_TOKEN=<YOUR_GITHUB_TOKEN>
export HF_TOKEN=<YOUR_HF_TOKEN>
export HF_HOME=<SHARED_FS>/HF_HOME
export UV_CACHE_DIR="<SHARED_FS>/uv_cache"
export NEMO_HOME="<SHARED_FS>/cache/nemo"
Important:
NEMO_HOME
must point to a shared filesystem (e.g. Lustre) for multi-node SFT/PEFT jobs. The default (
/root/.cache/nemo
) is container-local and not shared across nodes. Without this, packed-sequence data files prepared on node 0 are invisible to other nodes, causing
TypeError: 'NoneType' object is not an iterator
.
bash
export GH_TOKEN=<YOUR_GITHUB_TOKEN>
export HF_TOKEN=<YOUR_HF_TOKEN>
export HF_HOME=<SHARED_FS>/HF_HOME
export UV_CACHE_DIR="<SHARED_FS>/uv_cache"
export NEMO_HOME="<SHARED_FS>/cache/nemo"
重要提示:对于多节点SFT/PEFT任务,
NEMO_HOME
必须指向共享文件系统(如Lustre)。默认路径(
/root/.cache/nemo
)是容器本地路径,无法跨节点共享。如果不修改此配置,节点0生成的打包序列数据文件对其他节点不可见,会导致
TypeError: 'NoneType' object is not an iterator
错误。

Log Directory

日志目录

text
<SHARED_FS>/logs/<job_name>_<suffix>
text
<SHARED_FS>/logs/<job_name>_<suffix>

srun-native Approach (Preferred)

srun原生方式(推荐)

Slurm spawns all processes directly. No
torch.distributed.run
, no TRAIN_CMD escaping.
Slurm直接启动所有进程,无需使用
torch.distributed.run
,也无需对TRAIN_CMD进行转义。

SBATCH Headers

SBATCH头部配置

bash
#SBATCH --job-name=<model>-<task>
#SBATCH --nodes=<NNODES>
#SBATCH --ntasks-per-node=8          # Slurm spawns 8 tasks per node
#SBATCH --gpus-per-node=8
#SBATCH --time=00:30:00
#SBATCH --account=<YOUR_ACCOUNT>
#SBATCH --partition=batch
#SBATCH --output=<SHARED_FS>/logs/<job_name>_%j.log
#SBATCH --exclusive
bash
#SBATCH --job-name=<model>-<task>
#SBATCH --nodes=<NNODES>
#SBATCH --ntasks-per-node=8          # Slurm在每个节点启动8个任务
#SBATCH --gpus-per-node=8
#SBATCH --time=00:30:00
#SBATCH --account=<YOUR_ACCOUNT>
#SBATCH --partition=batch
#SBATCH --output=<SHARED_FS>/logs/<job_name>_%j.log
#SBATCH --exclusive

Build and Launch

构建与启动

Two-phase srun: first a single-process srun to populate the uv cache, then the full multi-node srun.
bash
undefined
分两阶段执行srun:首先用单进程srun填充uv缓存,然后执行完整的多节点srun。
bash
undefined

Env exports at sbatch level (before srun)

在sbatch级别导出环境变量(srun之前)

export TORCH_NCCL_AVOID_RECORD_STREAMS=1 export NCCL_NVLS_ENABLE=0
export TORCH_NCCL_AVOID_RECORD_STREAMS=1 export NCCL_NVLS_ENABLE=0

Phase 1: Single-process uv sync to build/populate the shared cache

阶段1:单进程uv同步,构建并填充Lustre上的共享缓存

srun --mpi=pmix -N 1 --ntasks=1
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync"
srun --mpi=pmix -N 1 --ntasks=1
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync"

Phase 2: Full multi-node run (uv sync is a fast no-op since cache is warm)

阶段2:完整多节点运行(由于缓存已预热,uv sync会快速执行无操作)

srun --mpi=pmix
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync && uv run --no-sync python <script.py> <args>"
undefined
srun --mpi=pmix
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync && uv run --no-sync python <script.py> <args>"
undefined

srun-native Key Points

srun原生方式关键点

  • Phase 1 runs
    uv sync
    once on a single node/process, building all wheels into the shared cache on Lustre
  • Phase 2's
    uv sync
    is a fast no-op (everything is cached) — safe to run on all ranks without sleep guards
  • initialize.py
    +
    common_utils.py
    auto-set
    RANK
    ,
    WORLD_SIZE
    ,
    LOCAL_RANK
    ,
    MASTER_ADDR
    ,
    MASTER_PORT
    from SLURM env vars
  • Env vars like
    HF_TOKEN
    ,
    HF_HOME
    ,
    UV_CACHE_DIR
    exported at sbatch level are inherited by srun tasks
  • Reference:
    examples/models/glm/glm_45v/slurm_sft.sh
    ,
    examples/models/minimax/minimax_m2/slurm_conversion.sh

  • 阶段1在单个节点/进程上运行
    uv sync
    ,将所有依赖包构建到Lustre上的共享缓存中
  • 阶段2的
    uv sync
    是快速无操作(所有内容已缓存)——无需睡眠等待,可安全在所有rank上运行
  • initialize.py
    +
    common_utils.py
    从SLURM环境变量自动设置
    RANK
    WORLD_SIZE
    LOCAL_RANK
    MASTER_ADDR
    MASTER_PORT
  • 在sbatch级别导出的环境变量(如
    HF_TOKEN
    HF_HOME
    UV_CACHE_DIR
    )会被srun任务继承
  • 参考示例:
    examples/models/glm/glm_45v/slurm_sft.sh
    examples/models/minimax/minimax_m2/slurm_conversion.sh

uv run torch.distributed Approach (Legacy)

uv run torch.distributed方式(传统)

Use when the script requires
torch.distributed.run
(e.g., MLM pretrain_gpt.py) or when Bridge's
initialize.py
is not in the call path.
当脚本必须使用
torch.distributed.run
(如MLM pretrain_gpt.py)或Bridge的
initialize.py
不在调用路径中时使用此方式。

1. Add SBATCH Headers

1. 添加SBATCH头部配置

bash
#SBATCH --job-name=<model>-<framework>
#SBATCH --nodes=<NNODES>
#SBATCH --ntasks-per-node=1          # ALWAYS 1 — torchrun handles per-node spawning
#SBATCH --gpus-per-node=8
#SBATCH --time=00:30:00
#SBATCH --account=<YOUR_ACCOUNT>
#SBATCH --partition=batch
#SBATCH --output=<SHARED_FS>/logs/<job_name>_%j.log
#SBATCH --exclusive
Critical:
--ntasks-per-node=1
, NOT 8.
uv run python -m torch.distributed.run --nproc_per_node=8
spawns 8 processes per node. Using
ntasks-per-node=8
causes EADDRINUSE port collisions (8 tasks x 8 procs = 64 per node).
bash
#SBATCH --job-name=<model>-<framework>
#SBATCH --nodes=<NNODES>
#SBATCH --ntasks-per-node=1          # 必须设为1 —— torchrun负责每个节点的进程启动
#SBATCH --gpus-per-node=8
#SBATCH --time=00:30:00
#SBATCH --account=<YOUR_ACCOUNT>
#SBATCH --partition=batch
#SBATCH --output=<SHARED_FS>/logs/<job_name>_%j.log
#SBATCH --exclusive
关键注意事项
--ntasks-per-node=1
,不能设为8。
uv run python -m torch.distributed.run --nproc_per_node=8
会在每个节点启动8个进程。如果设置
ntasks-per-node=8
,会导致EADDRINUSE端口冲突(8个任务 × 8个进程 = 每个节点64个进程)。

2. Convert to Multi-Node

2. 转换为多节点运行

Replace single-node:
bash
uv run python -m torch.distributed.run --nproc_per_node=8 \
  <script> <args>
With multi-node (inside
TRAIN_CMD
string):
bash
uv run python -m torch.distributed.run \
  --nproc_per_node=8 \
  --nnodes=\${SLURM_JOB_NUM_NODES} \
  --node_rank=\${SLURM_NODEID} \
  <script> <args>
MASTER_ADDR
and
MASTER_PORT
are auto-derived from SLURM env vars by
initialize.py
/
common_utils.py
— no need to set them.
将单节点命令:
bash
uv run python -m torch.distributed.run --nproc_per_node=8 \
  <script> <args>
替换为多节点命令(放在
TRAIN_CMD
字符串中):
bash
uv run python -m torch.distributed.run \
  --nproc_per_node=8 \
  --nnodes=\${SLURM_JOB_NUM_NODES} \
  --node_rank=\${SLURM_NODEID} \
  <script> <args>
MASTER_ADDR
MASTER_PORT
会由
initialize.py
/
common_utils.py
从SLURM环境变量自动推导——无需手动设置。

3. Wrap in TRAIN_CMD + two-phase srun

3. 用TRAIN_CMD包裹并分两阶段执行srun

Use the same two-phase pattern: first a single-process srun to warm the uv cache, then the full run.
Environment exports go inside TRAIN_CMD (they must be set inside the container):
bash
TRAIN_CMD="
export CUDA_DEVICE_MAX_CONNECTIONS=1 && \
export NVTE_ALLOW_NONDETERMINISTIC_ALGO=1 && \
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True && \
export NCCL_NVLS_ENABLE=0 && \
export GH_TOKEN=$GH_TOKEN && \
export HF_TOKEN=$HF_TOKEN && \
export HF_HOME=$HF_HOME && \
export UV_CACHE_DIR=$UV_CACHE_DIR && \
wandb login \$WANDB_API_KEY && \
mkdir -p $LOGDIR && \
cd $WORKDIR && \
uv sync && \
<training command here>
"
使用相同的两阶段模式:首先用单进程srun预热uv缓存,然后执行完整运行。
环境变量导出需放在TRAIN_CMD内部(必须在容器内设置):
bash
TRAIN_CMD="
export CUDA_DEVICE_MAX_CONNECTIONS=1 && \
export NVTE_ALLOW_NONDETERMINISTIC_ALGO=1 && \
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True && \
export NCCL_NVLS_ENABLE=0 && \
export GH_TOKEN=$GH_TOKEN && \
export HF_TOKEN=$HF_TOKEN && \
export HF_HOME=$HF_HOME && \
export UV_CACHE_DIR=$UV_CACHE_DIR && \
wandb login \$WANDB_API_KEY && \
mkdir -p $LOGDIR && \
cd $WORKDIR && \
uv sync && \
<training command here>
"

4. Launch (two-phase)

4. 启动(分两阶段)

bash
undefined
bash
undefined

Phase 1: Single-process uv sync to build/populate the shared cache

阶段1:单进程uv同步,构建并填充共享缓存

srun --mpi=pmix -N 1 --ntasks=1
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync"
srun --mpi=pmix -N 1 --ntasks=1
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync"

Phase 2: Full multi-node run (uv sync in TRAIN_CMD is a fast no-op)

阶段2:完整多节点运行(TRAIN_CMD中的uv sync是快速无操作)

srun --mpi=pmix --no-kill
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "$TRAIN_CMD" 2>&1 | tee "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log"
undefined
srun --mpi=pmix --no-kill
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "$TRAIN_CMD" 2>&1 | tee "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log"
undefined

5. (Optional) Add Loss Extraction Footer

5.(可选)添加损失提取尾部脚本

bash
echo "======================================"
echo "Done. Losses:"
echo "======================================"
grep -E "iteration\s+" "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log" | grep -iE "lm loss|reduced_train_loss" | head -25

bash
echo "======================================"
echo "任务完成。损失数据:"
echo "======================================"
grep -E "iteration\s+" "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log" | grep -iE "lm loss|reduced_train_loss" | head -25

Interactive GPU Allocation (
salloc
+
srun
)

交互式GPU资源分配(
salloc
+
srun

For ad-hoc testing (inference, conversion debugging), always follow these 3 steps:
用于临时测试(推理、转换调试),请始终遵循以下3个步骤:

Step 1: Allocate the node

步骤1:分配节点

bash
salloc --account <YOUR_ACCOUNT> -N 1 \
  -J <YOUR_ACCOUNT>-debug \
  -p interactive --gpus-per-node=8 -t 240
bash
salloc --account <YOUR_ACCOUNT> -N 1 \
  -J <YOUR_ACCOUNT>-debug \
  -p interactive --gpus-per-node=8 -t 240

Step 2: Launch container shell

步骤2:启动容器shell

bash
srun --mpi=pmix --no-kill \
  --container-image $CONTAINER_IMAGE \
  --container-mounts $CONTAINER_MOUNTS \
  --account <YOUR_ACCOUNT> -N 1 \
  -J <YOUR_ACCOUNT>-debug \
  --no-container-mount-home --gpus-per-node=8 \
  -p interactive --pty bash
bash
srun --mpi=pmix --no-kill \
  --container-image $CONTAINER_IMAGE \
  --container-mounts $CONTAINER_MOUNTS \
  --account <YOUR_ACCOUNT> -N 1 \
  -J <YOUR_ACCOUNT>-debug \
  --no-container-mount-home --gpus-per-node=8 \
  -p interactive --pty bash

Step 3: Set up environment inside container

步骤3:在容器内配置环境

bash
export GH_TOKEN=<YOUR_GITHUB_TOKEN>
wandb login <YOUR_WANDB_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>
export HF_HOME=<SHARED_FS>/HF_HOME
export UV_CACHE_DIR="<SHARED_FS>/uv_cache"
export NEMO_HOME="<SHARED_FS>/cache/nemo"
uv sync
Then run commands with
uv run
(uses the synced virtualenv):
bash
uv run python -m torch.distributed.run --nproc_per_node=8 \
  examples/conversion/hf_to_megatron_generate_text.py \
  --hf_model_path <org>/<model> --prompt "What is AI?" --max_new_tokens 50 --ep 8
Pitfalls with interactive allocation:
ErrorCauseFix
Cannot find GPU specification
Missing
--gpus-per-node
Always include
--gpus-per-node=8
in both
salloc
and
srun
invalid partition specified: pool0
Wrong partition nameUse
interactive
for interactive,
batch
for sbatch. Check:
sinfo --summarize
Invalid account or account/partition combination
Partition not available for accountCheck combos:
sacctmgr -nP show assoc where user=$USER format=account,partition
Unable to create step for job... Requested node configuration is not available
-w <node>
conflicts with allocation
Remove
-w
flag — HF cache is on shared filesystem, accessible from any node
uv: command not found
inside container
Container doesn't have
uv
pre-installed
Use a container with
uv
pre-installed, or
pip install uv
No space left on device
during
uv
or
pip
Container's
/root/.cache/
is full
Redirect:
export UV_CACHE_DIR=<SHARED_FS>/uv_cache
ModuleNotFoundError: No module named 'megatron.core.activations'
Container's pre-installed megatron-core conflicts with local
3rdparty/Megatron-LM
Install local:
pip install -e 3rdparty/Megatron-LM --no-deps --no-build-isolation

bash
export GH_TOKEN=<YOUR_GITHUB_TOKEN>
wandb login <YOUR_WANDB_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>
export HF_HOME=<SHARED_FS>/HF_HOME
export UV_CACHE_DIR="<SHARED_FS>/uv_cache"
export NEMO_HOME="<SHARED_FS>/cache/nemo"
uv sync
然后使用
uv run
运行命令(使用已同步的虚拟环境):
bash
uv run python -m torch.distributed.run --nproc_per_node=8 \
  examples/conversion/hf_to_megatron_generate_text.py \
  --hf_model_path <org>/<model> --prompt "What is AI?" --max_new_tokens 50 --ep 8
交互式分配的常见陷阱
错误原因修复方案
Cannot find GPU specification
缺少
--gpus-per-node
参数
salloc
srun
命令中始终添加
--gpus-per-node=8
invalid partition specified: pool0
分区名称错误交互式任务使用
interactive
分区,sbatch任务使用
batch
分区。可通过
sinfo --summarize
查看可用分区
Invalid account or account/partition combination
账户无法访问指定分区通过
sacctmgr -nP show assoc where user=$USER format=account,partition
查看账户可用的分区组合
Unable to create step for job... Requested node configuration is not available
-w <node>
参数与分配冲突
移除
-w
参数——HF缓存位于共享文件系统,可从任意节点访问
uv: command not found
inside container
容器未预安装
uv
使用预安装
uv
的容器,或执行
pip install uv
安装
No space left on device
during
uv
or
pip
容器的
/root/.cache/
路径空间不足
重定向缓存路径:
export UV_CACHE_DIR=<SHARED_FS>/uv_cache
ModuleNotFoundError: No module named 'megatron.core.activations'
容器预安装的megatron-core与本地
3rdparty/Megatron-LM
冲突
安装本地版本:
pip install -e 3rdparty/Megatron-LM --no-deps --no-build-isolation

Debugging Multi-Node Failures

多节点故障调试

Quick Diagnosis

快速诊断

Check the log for these patterns (in order):
bash
undefined
按顺序检查日志中的以下模式:
bash
undefined

1. Find the actual error (filter noise)

1. 查找实际错误(过滤冗余信息)

grep -a 'Error|OOM|CUDA out of memory|FAILED|Killed' job.log
| grep -v 'UserWarning|AllocatorConfig|transformer_engine|frame|srun: error'
grep -a 'Error|OOM|CUDA out of memory|FAILED|Killed' job.log
| grep -v 'UserWarning|AllocatorConfig|transformer_engine|frame|srun: error'

2. Check which rank crashed first

2. 检查哪个rank最先崩溃

grep -a 'Failures:' -A 20 job.log | head -25
grep -a 'Failures:' -A 20 job.log | head -25

3. Check for NCCL timeout

3. 检查NCCL超时问题

grep -a 'ncclUniqueId|timeout|crash on rank 0' job.log | head -5
undefined
grep -a 'ncclUniqueId|timeout|crash on rank 0' job.log | head -5
undefined

Debugging Checklist

调试清单

When a multi-node job fails:
  1. Check exit code: 1 = Python error, 9 = OOM killed, 143 = SIGTERM (timeout or cascade)
  2. Find first failure: Which task/node crashed first? Others get SIGTERM (143) as cascade
  3. grep the actual error: Filter out UserWarnings, NCCL frame dumps
  4. Check rank 0 specifically: Most save/export errors happen on rank 0
  5. Verify EP sizing: For MoE models, ensure
    num_experts / EP
    fits in GPU memory with headroom
  6. Try interactive first: Use
    salloc -N 2 -p interactive
    to iterate faster than sbatch queue
当多节点任务失败时:
  1. 检查退出码:1 = Python错误,9 = OOM被终止,143 = SIGTERM(超时或连锁崩溃)
  2. 定位首个故障:哪个任务/节点最先崩溃?其他节点会因连锁反应收到SIGTERM(143)信号
  3. 过滤实际错误:过滤掉UserWarning、NCCL帧转储等冗余信息
  4. 重点检查rank 0:大多数保存/导出错误发生在rank 0
  5. 验证EP配置:对于MoE模型,确保
    num_experts / EP
    在GPU内存中有足够余量
  6. 先尝试交互式运行:使用
    salloc -N 2 -p interactive
    进行调试,比sbatch队列更快

NCCL Timeout at
dist.barrier()
— "crash on rank 0"

NCCL在
dist.barrier()
处超时——"crash on rank 0"

Symptom: All ranks on node 2+ show:
text
[rank8] is setting up NCCL communicator and retrieving ncclUniqueId from [0]
... wait timeout after 600000ms
This may indicate a possible application crash on rank 0
Root causes (check in order):
CauseHow to verifyFix
save_artifacts
hangs on rank 0
Error is in
save_hf_weights
dist.barrier()
Increase timeout:
init_process_group("nccl", timeout=timedelta(minutes=60))
ImportError
in custom model code
grep ImportError job.log
Catch
ImportError
in
save_artifacts
(see below)
Rank 0 OOM during export
grep 'OutOfMemory' job.log
Increase EP or nodes
Network issue between nodesError only on cross-node ranksCheck
sinfo
, try different nodes
The
save_artifacts
problem
: When
trust_remote_code=True
, rank 0 runs
save_artifacts()
(downloads tokenizer, config, custom modeling code) while all other ranks skip directly to
dist.barrier()
. If
save_artifacts
is slow or crashes, other ranks timeout.
Fix for ImportError in save_artifacts (
hf_pretrained/base.py
):
python
undefined
症状:节点2及以上的所有rank显示:
text
[rank8] is setting up NCCL communicator and retrieving ncclUniqueId from [0]
... wait timeout after 600000ms
This may indicate a possible application crash on rank 0
根本原因(按顺序检查):
原因验证方式修复方案
save_artifacts
在rank 0上挂起
错误出现在
save_hf_weights
dist.barrier()
增加超时时间:
init_process_group("nccl", timeout=timedelta(minutes=60))
自定义模型代码中的
ImportError
grep ImportError job.log
save_artifacts
中捕获
ImportError
(见下文)
导出时rank 0发生OOM
grep 'OutOfMemory' job.log
增加EP数量或节点数量
节点间网络问题仅跨节点rank出现错误检查
sinfo
,尝试更换节点
save_artifacts
问题
:当
trust_remote_code=True
时,rank 0会执行
save_artifacts()
(下载tokenizer、配置文件、自定义模型代码),而其他rank直接跳转到
dist.barrier()
。如果
save_artifacts
执行缓慢或崩溃,其他rank会超时。
修复
save_artifacts
中的ImportError
hf_pretrained/base.py
):
python
undefined

Change:

修改前:

except OSError: pass
except OSError: pass

To:

修改后:

except (OSError, ImportError): pass
undefined
except (OSError, ImportError): pass
undefined

OOM for MoE Models

MoE模型的OOM问题

Symptom:
torch.OutOfMemoryError: CUDA out of memory
during model loading or forward pass.
Key insight: TP does NOT reduce expert memory. Only EP splits experts across GPUs.
Sizing formula:
text
experts_per_gpu = num_experts / EP
expert_memory_gb ≈ experts_per_gpu * expert_params * 2 / 1e9  (bf16)
total_per_gpu ≈ expert_memory_gb + attention_memory_gb + kv_cache_gb
MiniMax-M2 example (256 experts, ~230GB fp8 → ~460GB bf16):
ConfigNodesGPUsExperts/GPUResult
TP=2, EP=41864OOM (too many experts)
TP=2, EP=821632Works for roundtrip (weight-only), OOM for inference
TP=1, EP=1621616Works for inference
TP=2, EP=328648Comfortable for training
Rules of thumb:
  • Roundtrip (weight-only): can use more experts per GPU (~60GB model params OK)
  • Inference (forward pass + KV cache): needs headroom (~40GB model params max)
  • Training (activations + optimizer): needs even more headroom (~30GB model params max)
症状:模型加载或前向传播时出现
torch.OutOfMemoryError: CUDA out of memory
错误。
关键要点:TP(张量并行)不会减少专家内存占用,只有EP(专家并行)会将专家分配到多个GPU上。
内存计算公式
text
experts_per_gpu = num_experts / EP
expert_memory_gb ≈ experts_per_gpu * expert_params * 2 / 1e9  (bf16精度)
total_per_gpu ≈ expert_memory_gb + attention_memory_gb + kv_cache_gb
MiniMax-M2示例(256个专家,fp8精度约230GB → bf16精度约460GB):
配置节点数GPU数每个GPU的专家数结果
TP=2, EP=41864OOM(专家数量过多)
TP=2, EP=821632往返转换(仅权重)可行,推理时OOM
TP=1, EP=1621616推理可行
TP=2, EP=328648训练时有足够余量
经验法则
  • 往返转换(仅权重):每个GPU可容纳更多专家(约60GB模型参数可行)
  • 推理(前向传播 + KV缓存):需要更多内存余量(最大约40GB模型参数)
  • 训练(激活值 + 优化器):需要更大的内存余量(最大约30GB模型参数)

ModuleNotFoundError: No module named 'megatron.core.tensor_parallel'

ModuleNotFoundError: No module named 'megatron.core.tensor_parallel'

Cause: Container's pre-installed megatron-core conflicts with local
3rdparty/Megatron-LM
.
Fix: Add
uv sync
before running:
bash
CMD="if [ \"\$SLURM_LOCALID\" -eq 0 ]; then uv sync; else sleep 10; fi && "
CMD="${CMD}uv run --no-sync python <script> <args>"
原因:容器预安装的megatron-core与本地
3rdparty/Megatron-LM
冲突。
修复方案:在运行前添加
uv sync
bash
CMD="if [ \"\$SLURM_LOCALID\" -eq 0 ]; then uv sync; else sleep 10; fi && "
CMD="${CMD}uv run --no-sync python <script> <args>"

FP8 Weight Mismatch in Roundtrip

FP8权重往返转换不匹配

Symptom: Roundtrip completes but shows ❌ for all expert weights and raises
ValueError: Weight mismatch detected
.
Cause: Original HF weights are FP8, Megatron stores in BF16. Exported weights are BF16. Comparison against original FP8 exceeds
atol=1e-1
.
This is expected for FP8 models. The conversion is correct; the comparison tolerance is insufficient for the FP8→BF16 precision gap.
症状:往返转换完成,但所有专家权重显示❌并抛出
ValueError: Weight mismatch detected
错误。
原因:原始HF权重是FP8精度,Megatron以BF16精度存储,导出的权重也是BF16精度。与原始FP8权重的比较超出了
atol=1e-1
的容忍度。
这是FP8模型的正常现象。转换过程是正确的,只是FP8→BF16的精度差距导致比较容忍度不足。

WORLD_SIZE
Not Set with srun

使用srun时
WORLD_SIZE
未设置

Symptom: Script exits with "must be launched with torchrun".
Cause: Scripts check
os.environ.get("WORLD_SIZE")
which torchrun sets but srun doesn't.
Fix: Also check
SLURM_NTASKS
:
python
if os.environ.get("WORLD_SIZE") is None and os.environ.get("SLURM_NTASKS") is None:
    sys.exit(1)
Bridge's
common_utils.py
helpers (called by
initialize.py
) populate env vars from SLURM:
python
if "RANK" not in os.environ:
    os.environ["RANK"] = str(get_rank_safe())          # uses SLURM_PROCID
if "WORLD_SIZE" not in os.environ:
    os.environ["WORLD_SIZE"] = str(get_world_size_safe())  # uses SLURM_NTASKS
if "MASTER_ADDR" not in os.environ:
    os.environ["MASTER_ADDR"] = get_master_addr_safe()     # parses SLURM_NODELIST
if "MASTER_PORT" not in os.environ:
    os.environ["MASTER_PORT"] = str(get_master_port_safe()) # derives from SLURM_JOB_ID

症状:脚本退出并提示"must be launched with torchrun"。
原因:脚本检查
os.environ.get("WORLD_SIZE")
,该变量由torchrun设置,但srun不会设置。
修复方案:同时检查
SLURM_NTASKS
python
if os.environ.get("WORLD_SIZE") is None and os.environ.get("SLURM_NTASKS") is None:
    sys.exit(1)
Bridge的
common_utils.py
工具函数(由
initialize.py
调用)会从SLURM环境变量填充相关环境变量:
python
if "RANK" not in os.environ:
    os.environ["RANK"] = str(get_rank_safe())          # 使用SLURM_PROCID
if "WORLD_SIZE" not in os.environ:
    os.environ["WORLD_SIZE"] = str(get_world_size_safe())  # 使用SLURM_NTASKS
if "MASTER_ADDR" not in os.environ:
    os.environ["MASTER_ADDR"] = get_master_addr_safe()     # 解析SLURM_NODELIST
if "MASTER_PORT" not in os.environ:
    os.environ["MASTER_PORT"] = str(get_master_port_safe()) # 从SLURM_JOB_ID推导

Key Gotchas

关键注意事项

  1. Two-phase srun for
    uv sync
    : Run a single-process srun first to warm the cache, then the full multi-node srun. The second
    uv sync
    is a fast no-op since everything is already cached on the shared filesystem.
  2. --no-container-mount-home
    is an
    srun
    flag, NOT an
    #SBATCH
    directive.
  3. Escaping inside TRAIN_CMD: Since
    TRAIN_CMD
    is a double-quoted string, escape inner
    $
    for Slurm variables that must expand at runtime (not sbatch time):
    • \${SLURM_PROCID}
      ,
      \${SLURM_JOB_NUM_NODES}
      ,
      \${SLURM_NODEID}
    • Host-side variables like
      $GH_TOKEN
      ,
      $LOGDIR
      ,
      $WORKDIR
      expand at sbatch time — no escaping needed.
  4. Bridge
    rm -rf nemo_experiments
    : Add before training to avoid stale checkpoint auto-resume.
  5. MLM needs PYTHONPATH: For pretrain_gpt.py scripts, add inside TRAIN_CMD:
    bash
    PYTHONPATH=${WORKDIR}/3rdparty/Megatron-LM:\${PYTHONPATH:-} \
  6. Node count heuristic: Total GPUs =
    NNODES * 8
    . Must satisfy:
    TP * PP * EP * DP >= total_GPUs
    where
    DP = total_GPUs / (TP * PP * EP)
    .
  7. NEMO_HOME
    on shared filesystem for multi-node SFT
    : The default nemo cache (
    /root/.cache/nemo
    ) is container-local. Multi-node SFT with packed sequences prepares
    .npy
    files on one node that are invisible to others. Set
    export NEMO_HOME=<SHARED_FS>/cache/nemo
    so packed data is shared. Without this, ranks on other nodes fail with
    TypeError: 'NoneType' object is not an iterator
    .
  1. 分两阶段执行srun进行
    uv sync
    :先执行单进程srun预热缓存,再执行完整的多节点srun。第二次
    uv sync
    是快速无操作,因为所有内容已存储在共享文件系统的缓存中。
  2. --no-container-mount-home
    srun
    参数,不是
    #SBATCH
    指令
  3. TRAIN_CMD内部的转义:由于
    TRAIN_CMD
    是双引号字符串,对于需要在运行时而非sbatch提交时展开的Slurm变量,需转义内部的
    $
    • \${SLURM_PROCID}
      \${SLURM_JOB_NUM_NODES}
      \${SLURM_NODEID}
    • 主机端变量如
      $GH_TOKEN
      $LOGDIR
      $WORKDIR
      会在sbatch提交时展开——无需转义。
  4. Bridge的
    rm -rf nemo_experiments
    :在训练前添加此命令,避免陈旧检查点自动恢复。
  5. MLM需要PYTHONPATH:对于pretrain_gpt.py脚本,在TRAIN_CMD内部添加:
    bash
    PYTHONPATH=${WORKDIR}/3rdparty/Megatron-LM:\${PYTHONPATH:-} \
  6. 节点数启发式规则:总GPU数 =
    NNODES * 8
    。必须满足:
    TP * PP * EP * DP >= total_GPUs
    ,其中
    DP = total_GPUs / (TP * PP * EP)
  7. 多节点SFT的
    NEMO_HOME
    需指向共享文件系统
    :默认nemo缓存路径(
    /root/.cache/nemo
    )是容器本地路径。多节点SFT使用打包序列时,会在一个节点生成
    .npy
    文件,其他节点无法访问。设置
    export NEMO_HOME=<SHARED_FS>/cache/nemo
    可让打包数据共享。如果不设置,其他节点的rank会因无法访问数据而抛出
    TypeError: 'NoneType' object is not an iterator
    错误。

Full Templates and Command Bodies

完整模板与命令示例

For copyable sbatch scaffolding and Bridge/MLM-specific
TRAIN_CMD
bodies, read references/templates.md.
如需可复制的sbatch框架以及Bridge/MLM专用的
TRAIN_CMD
示例,请查看references/templates.md