multi-node-slurm

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Multi-Node Slurm

多节点Slurm作业

Convert single-node
uv run python -m torch.distributed.run
commands into multi-node Slurm sbatch scripts with Enroot container support, and debug common multi-node failures.
将单节点的
uv run python -m torch.distributed.run
命令转换为支持Enroot容器的多节点Slurm sbatch脚本,并调试常见的多节点故障。

Two Approaches: srun-native vs uv run torch.distributed

两种实现方式:srun原生 vs uv run torch.distributed

Approach
ntasks-per-node
Process spawningBest for
srun-native (preferred)8Slurm spawns 8 tasks/nodeConversion, inference, Bridge scripts
uv run torch.distributed (legacy)1
uv run python -m torch.distributed.run
spawns 8 procs/node
MLM pretrain_gpt.py
Prefer srun-native — simpler, avoids shell escaping issues with TRAIN_CMD. Megatron Bridge auto-derives
RANK
,
WORLD_SIZE
,
LOCAL_RANK
,
MASTER_ADDR
,
MASTER_PORT
from SLURM env vars (
SLURM_PROCID
,
SLURM_NTASKS
,
SLURM_LOCALID
,
SLURM_NODELIST
) via
common_utils.py
helpers called during
initialize.py
distributed init, so you never need to set them manually.
方式
ntasks-per-node
进程启动方式适用场景
srun原生(推荐)8Slurm在每个节点启动8个任务模型转换、推理、Bridge脚本
uv run torch.distributed( legacy)1
uv run python -m torch.distributed.run
在每个节点启动8个进程
MLM预训练pretrain_gpt.py
优先选择srun原生方式 —— 更简单,避免TRAIN_CMD的Shell转义问题。Megatron Bridge会通过
initialize.py
分布式初始化时调用的
common_utils.py
工具,自动从SLURM环境变量(
SLURM_PROCID
SLURM_NTASKS
SLURM_LOCALID
SLURM_NODELIST
)中推导
RANK
WORLD_SIZE
LOCAL_RANK
MASTER_ADDR
MASTER_PORT
,无需手动设置。

Cluster Environment

集群环境配置

Container

容器设置

bash
CONTAINER_IMAGE="<PATH_TO_YOUR_CONTAINER>.sqsh"
CONTAINER_MOUNTS="<SHARED_FS>:<SHARED_FS>,<PATH_TO_MEGATRON_BRIDGE>:/opt/Megatron-Bridge,<PATH_TO_DATA>:/opt/data"
bash
CONTAINER_IMAGE="<PATH_TO_YOUR_CONTAINER>.sqsh"
CONTAINER_MOUNTS="<SHARED_FS>:<SHARED_FS>,<PATH_TO_MEGATRON_BRIDGE>:/opt/Megatron-Bridge,<PATH_TO_DATA>:/opt/data"

Standard Paths

标准路径

bash
WORKDIR="/opt/Megatron-Bridge"
DATA_PATH="<PATH_TO_PREPROCESSED_DATA>/dclm_01_01_text_document"
bash
WORKDIR="/opt/Megatron-Bridge"
DATA_PATH="<PATH_TO_PREPROCESSED_DATA>/dclm_01_01_text_document"

Tokens / Caches

令牌与缓存

bash
export GH_TOKEN=<YOUR_GITHUB_TOKEN>
export HF_TOKEN=<YOUR_HF_TOKEN>
export HF_HOME=<SHARED_FS>/HF_HOME
export UV_CACHE_DIR="<SHARED_FS>/uv_cache"
export NEMO_HOME="<SHARED_FS>/cache/nemo"
Important:
NEMO_HOME
must point to a shared filesystem (e.g. Lustre) for multi-node SFT/PEFT jobs. The default (
/root/.cache/nemo
) is container-local and not shared across nodes. Without this, packed-sequence data files prepared on node 0 are invisible to other nodes, causing
TypeError: 'NoneType' object is not an iterator
.
bash
export GH_TOKEN=<YOUR_GITHUB_TOKEN>
export HF_TOKEN=<YOUR_HF_TOKEN>
export HF_HOME=<SHARED_FS>/HF_HOME
export UV_CACHE_DIR="<SHARED_FS>/uv_cache"
export NEMO_HOME="<SHARED_FS>/cache/nemo"
重要提示:对于多节点SFT/PEFT任务,
NEMO_HOME
必须指向共享文件系统(如Lustre)。默认路径(
/root/.cache/nemo
)是容器本地路径,无法跨节点共享。如果不修改此设置,节点0生成的打包序列数据文件对其他节点不可见,会导致
TypeError: 'NoneType' object is not an iterator
错误。

Log Directory

日志目录

text
<SHARED_FS>/logs/<job_name>_<suffix>
text
<SHARED_FS>/logs/<job_name>_<suffix>

srun-native Approach (Preferred)

srun原生方式(推荐)

Slurm spawns all processes directly. No
torch.distributed.run
, no TRAIN_CMD escaping.
Slurm直接启动所有进程,无需
torch.distributed.run
,也无需对TRAIN_CMD进行转义。

SBATCH Headers

SBATCH头部配置

bash
#SBATCH --job-name=<model>-<task>
#SBATCH --nodes=<NNODES>
#SBATCH --ntasks-per-node=8          # Slurm spawns 8 tasks per node
#SBATCH --gpus-per-node=8
#SBATCH --time=00:30:00
#SBATCH --account=<YOUR_ACCOUNT>
#SBATCH --partition=batch
#SBATCH --output=<SHARED_FS>/logs/<job_name>_%j.log
#SBATCH --exclusive
bash
#SBATCH --job-name=<model>-<task>
#SBATCH --nodes=<NNODES>
#SBATCH --ntasks-per-node=8          # Slurm在每个节点启动8个任务
#SBATCH --gpus-per-node=8
#SBATCH --time=00:30:00
#SBATCH --account=<YOUR_ACCOUNT>
#SBATCH --partition=batch
#SBATCH --output=<SHARED_FS>/logs/<job_name>_%j.log
#SBATCH --exclusive

Build and Launch

构建与启动

Two-phase srun: first a single-process srun to populate the uv cache, then the full multi-node srun.
bash
undefined
分两阶段执行srun:首先用单进程srun填充uv缓存,然后执行完整的多节点srun。
bash
undefined

Env exports at sbatch level (before srun)

在sbatch级别设置环境变量(srun之前)

export TORCH_NCCL_AVOID_RECORD_STREAMS=1 export NCCL_NVLS_ENABLE=0
export TORCH_NCCL_AVOID_RECORD_STREAMS=1 export NCCL_NVLS_ENABLE=0

Phase 1: Single-process uv sync to build/populate the shared cache

阶段1:单进程uv同步,构建并填充Lustre上的共享缓存

srun --mpi=pmix -N 1 --ntasks=1
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync"
srun --mpi=pmix -N 1 --ntasks=1
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync"

Phase 2: Full multi-node run (uv sync is a fast no-op since cache is warm)

阶段2:完整多节点运行(由于缓存已预热,uv sync会快速跳过)

srun --mpi=pmix
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync && uv run --no-sync python <script.py> <args>"
undefined
srun --mpi=pmix
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync && uv run --no-sync python <script.py> <args>"
undefined

srun-native Key Points

srun原生方式关键点

  • Phase 1 runs
    uv sync
    once on a single node/process, building all wheels into the shared cache on Lustre
  • Phase 2's
    uv sync
    is a fast no-op (everything is cached) — safe to run on all ranks without sleep guards
  • initialize.py
    +
    common_utils.py
    auto-set
    RANK
    ,
    WORLD_SIZE
    ,
    LOCAL_RANK
    ,
    MASTER_ADDR
    ,
    MASTER_PORT
    from SLURM env vars
  • Env vars like
    HF_TOKEN
    ,
    HF_HOME
    ,
    UV_CACHE_DIR
    exported at sbatch level are inherited by srun tasks
  • Reference:
    examples/models/glm/glm_45v/slurm_sft.sh
    ,
    examples/models/minimax/minimax_m2/slurm_conversion.sh

  • 阶段1在单个节点/进程上运行
    uv sync
    ,将所有依赖包构建到Lustre的共享缓存中
  • 阶段2的
    uv sync
    是快速无操作(所有内容已缓存)—— 无需睡眠等待,可在所有rank上安全运行
  • initialize.py
    +
    common_utils.py
    自动从SLURM环境变量中设置
    RANK
    WORLD_SIZE
    LOCAL_RANK
    MASTER_ADDR
    MASTER_PORT
  • 在sbatch级别导出的环境变量(如
    HF_TOKEN
    HF_HOME
    UV_CACHE_DIR
    )会被srun任务继承
  • 参考示例:
    examples/models/glm/glm_45v/slurm_sft.sh
    examples/models/minimax/minimax_m2/slurm_conversion.sh

uv run torch.distributed Approach (Legacy)

uv run torch.distributed方式(Legacy)

Use when the script requires
torch.distributed.run
(e.g., MLM pretrain_gpt.py) or when Bridge's
initialize.py
is not in the call path.
当脚本必须使用
torch.distributed.run
(如MLM的pretrain_gpt.py)或Bridge的
initialize.py
不在调用路径时使用此方式。

1. Add SBATCH Headers

1. 添加SBATCH头部配置

bash
#SBATCH --job-name=<model>-<framework>
#SBATCH --nodes=<NNODES>
#SBATCH --ntasks-per-node=1          # ALWAYS 1 — torchrun handles per-node spawning
#SBATCH --gpus-per-node=8
#SBATCH --time=00:30:00
#SBATCH --account=<YOUR_ACCOUNT>
#SBATCH --partition=batch
#SBATCH --output=<SHARED_FS>/logs/<job_name>_%j.log
#SBATCH --exclusive
Critical:
--ntasks-per-node=1
, NOT 8.
uv run python -m torch.distributed.run --nproc_per_node=8
spawns 8 processes per node. Using
ntasks-per-node=8
causes EADDRINUSE port collisions (8 tasks x 8 procs = 64 per node).
bash
#SBATCH --job-name=<model>-<framework>
#SBATCH --nodes=<NNODES>
#SBATCH --ntasks-per-node=1          # 必须设为1 —— torchrun负责每个节点的进程启动
#SBATCH --gpus-per-node=8
#SBATCH --time=00:30:00
#SBATCH --account=<YOUR_ACCOUNT>
#SBATCH --partition=batch
#SBATCH --output=<SHARED_FS>/logs/<job_name>_%j.log
#SBATCH --exclusive
关键注意事项
--ntasks-per-node=1
,而非8。
uv run python -m torch.distributed.run --nproc_per_node=8
会在每个节点启动8个进程。如果设置
ntasks-per-node=8
,会导致EADDRINUSE端口冲突(8个任务 × 8个进程 = 每个节点64个进程)。

2. Convert to Multi-Node

2. 转换为多节点命令

Replace single-node:
bash
uv run python -m torch.distributed.run --nproc_per_node=8 \
  <script> <args>
With multi-node (inside
TRAIN_CMD
string):
bash
uv run python -m torch.distributed.run \
  --nproc_per_node=8 \
  --nnodes=\${SLURM_JOB_NUM_NODES} \
  --node_rank=\${SLURM_NODEID} \
  <script> <args>
MASTER_ADDR
and
MASTER_PORT
are auto-derived from SLURM env vars by
initialize.py
/
common_utils.py
— no need to set them.
将单节点命令:
bash
uv run python -m torch.distributed.run --nproc_per_node=8 \
  <script> <args>
替换为多节点命令(放在
TRAIN_CMD
字符串内):
bash
uv run python -m torch.distributed.run \
  --nproc_per_node=8 \
  --nnodes=\${SLURM_JOB_NUM_NODES} \
  --node_rank=\${SLURM_NODEID} \
  <script> <args>
MASTER_ADDR
MASTER_PORT
会由
initialize.py
/
common_utils.py
从SLURM环境变量自动推导——无需手动设置。

3. Wrap in TRAIN_CMD + two-phase srun

3. 用TRAIN_CMD包裹并分两阶段执行srun

Use the same two-phase pattern: first a single-process srun to warm the uv cache, then the full run.
Environment exports go inside TRAIN_CMD (they must be set inside the container):
bash
TRAIN_CMD="
export CUDA_DEVICE_MAX_CONNECTIONS=1 && \
export NVTE_ALLOW_NONDETERMINISTIC_ALGO=1 && \
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True && \
export NCCL_NVLS_ENABLE=0 && \
export GH_TOKEN=$GH_TOKEN && \
export HF_TOKEN=$HF_TOKEN && \
export HF_HOME=$HF_HOME && \
export UV_CACHE_DIR=$UV_CACHE_DIR && \
wandb login \$WANDB_API_KEY && \
mkdir -p $LOGDIR && \
cd $WORKDIR && \
uv sync && \
<training command here>
"
使用相同的两阶段模式:首先用单进程srun预热uv缓存,然后执行完整运行。
环境变量要放在TRAIN_CMD内部(必须在容器内设置):
bash
TRAIN_CMD="
export CUDA_DEVICE_MAX_CONNECTIONS=1 && \
export NVTE_ALLOW_NONDETERMINISTIC_ALGO=1 && \
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True && \
export NCCL_NVLS_ENABLE=0 && \
export GH_TOKEN=$GH_TOKEN && \
export HF_TOKEN=$HF_TOKEN && \
export HF_HOME=$HF_HOME && \
export UV_CACHE_DIR=$UV_CACHE_DIR && \
wandb login \$WANDB_API_KEY && \
mkdir -p $LOGDIR && \
cd $WORKDIR && \
uv sync && \
<training command here>
"

4. Launch (two-phase)

4. 启动命令(两阶段)

bash
undefined
bash
undefined

Phase 1: Single-process uv sync to build/populate the shared cache

阶段1:单进程uv同步,构建并填充共享缓存

srun --mpi=pmix -N 1 --ntasks=1
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync"
srun --mpi=pmix -N 1 --ntasks=1
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync"

Phase 2: Full multi-node run (uv sync in TRAIN_CMD is a fast no-op)

阶段2:完整多节点运行(TRAIN_CMD中的uv sync是快速无操作)

srun --mpi=pmix --no-kill
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "$TRAIN_CMD" 2>&1 | tee "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log"
undefined
srun --mpi=pmix --no-kill
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "$TRAIN_CMD" 2>&1 | tee "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log"
undefined

5. (Optional) Add Loss Extraction Footer

5.(可选)添加损失提取尾部

bash
echo "======================================"
echo "Done. Losses:"
echo "======================================"
grep -E "iteration\s+" "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log" | grep -iE "lm loss|reduced_train_loss" | head -25

bash
echo "======================================"
echo "Done. Losses:"
echo "======================================"
grep -E "iteration\s+" "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log" | grep -iE "lm loss|reduced_train_loss" | head -25

Interactive GPU Allocation (
salloc
+
srun
)

交互式GPU分配(
salloc
+
srun

For ad-hoc testing (inference, conversion debugging), always follow these 3 steps:
对于临时测试(推理、转换调试),请始终遵循以下3个步骤:

Step 1: Allocate the node

步骤1:分配节点

bash
salloc --account <YOUR_ACCOUNT> -N 1 \
  -J <YOUR_ACCOUNT>-debug \
  -p interactive --gpus-per-node=8 -t 240
bash
salloc --account <YOUR_ACCOUNT> -N 1 \
  -J <YOUR_ACCOUNT>-debug \
  -p interactive --gpus-per-node=8 -t 240

Step 2: Launch container shell

步骤2:启动容器Shell

bash
srun --mpi=pmix --no-kill \
  --container-image $CONTAINER_IMAGE \
  --container-mounts $CONTAINER_MOUNTS \
  --account <YOUR_ACCOUNT> -N 1 \
  -J <YOUR_ACCOUNT>-debug \
  --no-container-mount-home --gpus-per-node=8 \
  -p interactive --pty bash
bash
srun --mpi=pmix --no-kill \
  --container-image $CONTAINER_IMAGE \
  --container-mounts $CONTAINER_MOUNTS \
  --account <YOUR_ACCOUNT> -N 1 \
  -J <YOUR_ACCOUNT>-debug \
  --no-container-mount-home --gpus-per-node=8 \
  -p interactive --pty bash

Step 3: Set up environment inside container

步骤3:在容器内设置环境

bash
export GH_TOKEN=<YOUR_GITHUB_TOKEN>
wandb login <YOUR_WANDB_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>
export HF_HOME=<SHARED_FS>/HF_HOME
export UV_CACHE_DIR="<SHARED_FS>/uv_cache"
export NEMO_HOME="<SHARED_FS>/cache/nemo"
uv sync
Then run commands with
uv run
(uses the synced virtualenv):
bash
uv run python -m torch.distributed.run --nproc_per_node=8 \
  examples/conversion/hf_to_megatron_generate_text.py \
  --hf_model_path <org>/<model> --prompt "What is AI?" --max_new_tokens 50 --ep 8
Pitfalls with interactive allocation:
ErrorCauseFix
Cannot find GPU specification
Missing
--gpus-per-node
Always include
--gpus-per-node=8
in both
salloc
and
srun
invalid partition specified: pool0
Wrong partition nameUse
interactive
for interactive,
batch
for sbatch. Check:
sinfo --summarize
Invalid account or account/partition combination
Partition not available for accountCheck combos:
sacctmgr -nP show assoc where user=$USER format=account,partition
Unable to create step for job... Requested node configuration is not available
-w <node>
conflicts with allocation
Remove
-w
flag — HF cache is on shared filesystem, accessible from any node
uv: command not found
inside container
Container doesn't have
uv
pre-installed
Use a container with
uv
pre-installed, or
pip install uv
No space left on device
during
uv
or
pip
Container's
/root/.cache/
is full
Redirect:
export UV_CACHE_DIR=<SHARED_FS>/uv_cache
ModuleNotFoundError: No module named 'megatron.core.activations'
Container's pre-installed megatron-core conflicts with local
3rdparty/Megatron-LM
Install local:
pip install -e 3rdparty/Megatron-LM --no-deps --no-build-isolation

bash
export GH_TOKEN=<YOUR_GITHUB_TOKEN>
wandb login <YOUR_WANDB_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>
export HF_HOME=<SHARED_FS>/HF_HOME
export UV_CACHE_DIR="<SHARED_FS>/uv_cache"
export NEMO_HOME="<SHARED_FS>/cache/nemo"
uv sync
然后使用
uv run
运行命令(使用已同步的虚拟环境):
bash
uv run python -m torch.distributed.run --nproc_per_node=8 \
  examples/conversion/hf_to_megatron_generate_text.py \
  --hf_model_path <org>/<model> --prompt "What is AI?" --max_new_tokens 50 --ep 8
交互式分配的常见陷阱
错误原因修复方案
Cannot find GPU specification
缺少
--gpus-per-node
参数
salloc
srun
中始终加入
--gpus-per-node=8
invalid partition specified: pool0
分区名称错误交互式任务用
interactive
分区,sbatch任务用
batch
分区。可通过
sinfo --summarize
查看分区信息
Invalid account or account/partition combination
账户无法访问指定分区通过
sacctmgr -nP show assoc where user=$USER format=account,partition
查看可用的账户-分区组合
Unable to create step for job... Requested node configuration is not available
-w <node>
参数与分配冲突
移除
-w
参数——HF缓存位于共享文件系统,可从任何节点访问
uv: command not found
inside container
容器未预安装
uv
使用预安装
uv
的容器,或执行
pip install uv
安装
No space left on device
during
uv
or
pip
容器的
/root/.cache/
目录已满
重定向缓存路径:
export UV_CACHE_DIR=<SHARED_FS>/uv_cache
ModuleNotFoundError: No module named 'megatron.core.activations'
容器预安装的megatron-core与本地
3rdparty/Megatron-LM
冲突
安装本地版本:
pip install -e 3rdparty/Megatron-LM --no-deps --no-build-isolation

Debugging Multi-Node Failures

多节点故障调试

Quick Diagnosis

快速诊断

Check the log for these patterns (in order):
bash
undefined
按顺序在日志中查找以下模式:
bash
undefined

1. Find the actual error (filter noise)

1. 定位实际错误(过滤无关信息)

grep -a 'Error|OOM|CUDA out of memory|FAILED|Killed' job.log
| grep -v 'UserWarning|AllocatorConfig|transformer_engine|frame|srun: error'
grep -a 'Error|OOM|CUDA out of memory|FAILED|Killed' job.log
| grep -v 'UserWarning|AllocatorConfig|transformer_engine|frame|srun: error'

2. Check which rank crashed first

2. 检查哪个rank最先崩溃

grep -a 'Failures:' -A 20 job.log | head -25
grep -a 'Failures:' -A 20 job.log | head -25

3. Check for NCCL timeout

3. 检查NCCL超时问题

grep -a 'ncclUniqueId|timeout|crash on rank 0' job.log | head -5
undefined
grep -a 'ncclUniqueId|timeout|crash on rank 0' job.log | head -5
undefined

Debugging Checklist

调试检查清单

When a multi-node job fails:
  1. Check exit code: 1 = Python error, 9 = OOM killed, 143 = SIGTERM (timeout or cascade)
  2. Find first failure: Which task/node crashed first? Others get SIGTERM (143) as cascade
  3. grep the actual error: Filter out UserWarnings, NCCL frame dumps
  4. Check rank 0 specifically: Most save/export errors happen on rank 0
  5. Verify EP sizing: For MoE models, ensure
    num_experts / EP
    fits in GPU memory with headroom
  6. Try interactive first: Use
    salloc -N 2 -p interactive
    to iterate faster than sbatch queue
当多节点任务失败时:
  1. 检查退出码:1 = Python错误,9 = OOM被终止,143 = SIGTERM(超时或连锁崩溃)
  2. 定位首次故障:哪个任务/节点最先崩溃?其他节点会因连锁反应收到SIGTERM(143)
  3. 过滤实际错误:排除UserWarning、NCCL帧转储等无关信息
  4. 重点检查rank 0:大多数保存/导出错误发生在rank 0
  5. 验证EP配置:对于MoE模型,确保
    num_experts / EP
    的结果在GPU内存中留有足够余量
  6. 先尝试交互式模式:使用
    salloc -N 2 -p interactive
    进行调试,比sbatch队列更快

NCCL Timeout at
dist.barrier()
— "crash on rank 0"

dist.barrier()
处的NCCL超时 —— "crash on rank 0"

Symptom: All ranks on node 2+ show:
text
[rank8] is setting up NCCL communicator and retrieving ncclUniqueId from [0]
... wait timeout after 600000ms
This may indicate a possible application crash on rank 0
Root causes (check in order):
CauseHow to verifyFix
save_artifacts
hangs on rank 0
Error is in
save_hf_weights
dist.barrier()
Increase timeout:
init_process_group("nccl", timeout=timedelta(minutes=60))
ImportError
in custom model code
grep ImportError job.log
Catch
ImportError
in
save_artifacts
(see below)
Rank 0 OOM during export
grep 'OutOfMemory' job.log
Increase EP or nodes
Network issue between nodesError only on cross-node ranksCheck
sinfo
, try different nodes
The
save_artifacts
problem
: When
trust_remote_code=True
, rank 0 runs
save_artifacts()
(downloads tokenizer, config, custom modeling code) while all other ranks skip directly to
dist.barrier()
. If
save_artifacts
is slow or crashes, other ranks timeout.
Fix for ImportError in save_artifacts (
hf_pretrained/base.py
):
python
undefined
症状:节点2及以后的所有rank显示:
text
[rank8] is setting up NCCL communicator and retrieving ncclUniqueId from [0]
... wait timeout after 600000ms
This may indicate a possible application crash on rank 0
根本原因(按顺序检查):
原因验证方式修复方案
rank 0上的
save_artifacts
挂起
错误出现在
save_hf_weights
dist.barrier()
增加超时时间:
init_process_group("nccl", timeout=timedelta(minutes=60))
自定义模型代码中的
ImportError
grep ImportError job.log
save_artifacts
中捕获
ImportError
(见下文)
导出时rank 0发生OOM
grep 'OutOfMemory' job.log
增加EP数量或节点数量
节点间网络问题仅跨节点rank出现错误检查
sinfo
,尝试更换节点
save_artifacts
问题
:当
trust_remote_code=True
时,rank 0会运行
save_artifacts()
(下载tokenizer、配置、自定义建模代码),而其他rank直接跳转到
dist.barrier()
。如果
save_artifacts
运行缓慢或崩溃,其他rank会超时。
修复save_artifacts中的ImportError
hf_pretrained/base.py
):
python
undefined

Change:

修改前:

except OSError: pass
except OSError: pass

To:

修改后:

except (OSError, ImportError): pass
undefined
except (OSError, ImportError): pass
undefined

OOM for MoE Models

MoE模型的OOM问题

Symptom:
torch.OutOfMemoryError: CUDA out of memory
during model loading or forward pass.
Key insight: TP does NOT reduce expert memory. Only EP splits experts across GPUs.
Sizing formula:
text
experts_per_gpu = num_experts / EP
expert_memory_gb ≈ experts_per_gpu * expert_params * 2 / 1e9  (bf16)
total_per_gpu ≈ expert_memory_gb + attention_memory_gb + kv_cache_gb
MiniMax-M2 example (256 experts, ~230GB fp8 → ~460GB bf16):
ConfigNodesGPUsExperts/GPUResult
TP=2, EP=41864OOM (too many experts)
TP=2, EP=821632Works for roundtrip (weight-only), OOM for inference
TP=1, EP=1621616Works for inference
TP=2, EP=328648Comfortable for training
Rules of thumb:
  • Roundtrip (weight-only): can use more experts per GPU (~60GB model params OK)
  • Inference (forward pass + KV cache): needs headroom (~40GB model params max)
  • Training (activations + optimizer): needs even more headroom (~30GB model params max)
症状:模型加载或前向传播时出现
torch.OutOfMemoryError: CUDA out of memory
关键要点:TP(张量并行)不会减少专家内存占用,只有EP(专家并行)会将专家分配到多个GPU上。
内存计算公式
text
每个GPU的专家数 = num_experts / EP
专家内存占用(GB) ≈ 每个GPU的专家数 × 专家参数数量 × 2 / 1e9 (bf16精度)
每个GPU总内存占用 ≈ 专家内存占用 + 注意力内存占用 + KV缓存内存占用
MiniMax-M2示例(256个专家,fp8精度约230GB → bf16精度约460GB):
配置节点数GPU数每个GPU的专家数结果
TP=2, EP=41864OOM(专家数过多)
TP=2, EP=821632往返转换(仅权重)可行,推理时OOM
TP=1, EP=1621616推理可行
TP=2, EP=328648训练时内存充足
经验法则
  • 往返转换(仅权重):每个GPU可容纳更多专家(约60GB模型参数可行)
  • 推理(前向传播 + KV缓存):需要预留内存余量(最多约40GB模型参数)
  • 训练(激活值 + 优化器):需要更多内存余量(最多约30GB模型参数)

ModuleNotFoundError: No module named 'megatron.core.tensor_parallel'

ModuleNotFoundError: No module named 'megatron.core.tensor_parallel'

Cause: Container's pre-installed megatron-core conflicts with local
3rdparty/Megatron-LM
.
Fix: Add
uv sync
before running:
bash
CMD="if [ \"\$SLURM_LOCALID\" -eq 0 ]; then uv sync; else sleep 10; fi && "
CMD="${CMD}uv run --no-sync python <script> <args>"
原因:容器预安装的megatron-core与本地
3rdparty/Megatron-LM
冲突。
修复方案:运行前添加
uv sync
bash
CMD="if [ \"\$SLURM_LOCALID\" -eq 0 ]; then uv sync; else sleep 10; fi && "
CMD="${CMD}uv run --no-sync python <script> <args>"

FP8 Weight Mismatch in Roundtrip

往返转换中的FP8权重不匹配

Symptom: Roundtrip completes but shows ❌ for all expert weights and raises
ValueError: Weight mismatch detected
.
Cause: Original HF weights are FP8, Megatron stores in BF16. Exported weights are BF16. Comparison against original FP8 exceeds
atol=1e-1
.
This is expected for FP8 models. The conversion is correct; the comparison tolerance is insufficient for the FP8→BF16 precision gap.
症状:往返转换完成,但所有专家权重显示❌,并抛出
ValueError: Weight mismatch detected
原因:原始HF权重是FP8精度,Megatron以BF16精度存储,导出的权重也是BF16精度。与原始FP8权重的比较超出了
atol=1e-1
的容忍度。
这是FP8模型的正常现象。转换过程是正确的,只是比较的容忍度不足以覆盖FP8→BF16的精度差异。

WORLD_SIZE
Not Set with srun

使用srun时
WORLD_SIZE
未设置

Symptom: Script exits with "must be launched with torchrun".
Cause: Scripts check
os.environ.get("WORLD_SIZE")
which torchrun sets but srun doesn't.
Fix: Also check
SLURM_NTASKS
:
python
if os.environ.get("WORLD_SIZE") is None and os.environ.get("SLURM_NTASKS") is None:
    sys.exit(1)
Bridge's
common_utils.py
helpers (called by
initialize.py
) populate env vars from SLURM:
python
if "RANK" not in os.environ:
    os.environ["RANK"] = str(get_rank_safe())          # uses SLURM_PROCID
if "WORLD_SIZE" not in os.environ:
    os.environ["WORLD_SIZE"] = str(get_world_size_safe())  # uses SLURM_NTASKS
if "MASTER_ADDR" not in os.environ:
    os.environ["MASTER_ADDR"] = get_master_addr_safe()     # parses SLURM_NODELIST
if "MASTER_PORT" not in os.environ:
    os.environ["MASTER_PORT"] = str(get_master_port_safe()) # derives from SLURM_JOB_ID

症状:脚本退出并提示"must be launched with torchrun"。
原因:脚本检查
os.environ.get("WORLD_SIZE")
,该变量由torchrun设置,但srun不会设置。
修复方案:同时检查
SLURM_NTASKS
python
if os.environ.get("WORLD_SIZE") is None and os.environ.get("SLURM_NTASKS") is None:
    sys.exit(1)
Bridge的
common_utils.py
工具(由
initialize.py
调用)会从SLURM环境变量中填充相关环境变量:
python
if "RANK" not in os.environ:
    os.environ["RANK"] = str(get_rank_safe())          # 使用SLURM_PROCID
if "WORLD_SIZE" not in os.environ:
    os.environ["WORLD_SIZE"] = str(get_world_size_safe())  # 使用SLURM_NTASKS
if "MASTER_ADDR" not in os.environ:
    os.environ["MASTER_ADDR"] = get_master_addr_safe()     # 解析SLURM_NODELIST
if "MASTER_PORT" not in os.environ:
    os.environ["MASTER_PORT"] = str(get_master_port_safe()) # 从SLURM_JOB_ID推导

Key Gotchas

关键注意事项

  1. Two-phase srun for
    uv sync
    : Run a single-process srun first to warm the cache, then the full multi-node srun. The second
    uv sync
    is a fast no-op since everything is already cached on the shared filesystem.
  2. --no-container-mount-home
    is an
    srun
    flag, NOT an
    #SBATCH
    directive.
  3. Escaping inside TRAIN_CMD: Since
    TRAIN_CMD
    is a double-quoted string, escape inner
    $
    for Slurm variables that must expand at runtime (not sbatch time):
    • \${SLURM_PROCID}
      ,
      \${SLURM_JOB_NUM_NODES}
      ,
      \${SLURM_NODEID}
    • Host-side variables like
      $GH_TOKEN
      ,
      $LOGDIR
      ,
      $WORKDIR
      expand at sbatch time — no escaping needed.
  4. Bridge
    rm -rf nemo_experiments
    : Add before training to avoid stale checkpoint auto-resume.
  5. MLM needs PYTHONPATH: For pretrain_gpt.py scripts, add inside TRAIN_CMD:
    bash
    PYTHONPATH=${WORKDIR}/3rdparty/Megatron-LM:\${PYTHONPATH:-} \
  6. Node count heuristic: Total GPUs =
    NNODES * 8
    . Must satisfy:
    TP * PP * EP * DP >= total_GPUs
    where
    DP = total_GPUs / (TP * PP * EP)
    .
  7. NEMO_HOME
    on shared filesystem for multi-node SFT
    : The default nemo cache (
    /root/.cache/nemo
    ) is container-local. Multi-node SFT with packed sequences prepares
    .npy
    files on one node that are invisible to others. Set
    export NEMO_HOME=<SHARED_FS>/cache/nemo
    so packed data is shared. Without this, ranks on other nodes fail with
    TypeError: 'NoneType' object is not an iterator
    .
  1. uv sync
    的两阶段srun
    :先运行单进程srun预热缓存,再运行完整的多节点srun。第二次
    uv sync
    是快速无操作,因为所有内容已存储在共享文件系统的缓存中。
  2. --no-container-mount-home
    是srun参数,而非
    #SBATCH
    指令
  3. TRAIN_CMD内的转义:由于
    TRAIN_CMD
    是双引号字符串,对于需要在运行时(而非sbatch提交时)展开的Slurm变量,要转义内部的
    $
    • \${SLURM_PROCID}
      \${SLURM_JOB_NUM_NODES}
      \${SLURM_NODEID}
    • 主机侧变量(如
      $GH_TOKEN
      $LOGDIR
      $WORKDIR
      )会在sbatch提交时展开——无需转义。
  4. Bridge的
    rm -rf nemo_experiments
    :训练前添加此命令,避免陈旧的检查点自动恢复。
  5. MLM需要PYTHONPATH:对于pretrain_gpt.py脚本,在TRAIN_CMD内添加:
    bash
    PYTHONPATH=${WORKDIR}/3rdparty/Megatron-LM:\${PYTHONPATH:-} \
  6. 节点数量启发式规则:总GPU数 =
    NNODES * 8
    。必须满足:
    TP * PP * EP * DP >= total_GPUs
    ,其中
    DP = total_GPUs / (TP * PP * EP)
  7. 多节点SFT的
    NEMO_HOME
    需指向共享文件系统
    :默认的nemo缓存(
    /root/.cache/nemo
    )是容器本地路径。多节点SFT的打包序列会在一个节点生成
    .npy
    文件,其他节点无法访问。设置
    export NEMO_HOME=<SHARED_FS>/cache/nemo
    可让打包数据共享。如果不设置,其他节点的rank会因无法访问数据而抛出
    TypeError: 'NoneType' object is not an iterator

Full Template

完整模板

bash
#!/bin/bash
bash
#!/bin/bash

==============================================================================

==============================================================================

<MODEL_NAME> <pretrain|sft> — <Framework: MLM | Megatron Bridge>

<MODEL_NAME> <pretrain|sft> — <Framework: MLM | Megatron Bridge>

Default: TP<X> PP<Y> EP<Z>, NNODES=<N> (<N*8> GPUs), MBS=<M>, GBS=<G>

默认配置:TP<X> PP<Y> EP<Z>, NNODES=<N>(共<N*8>个GPU), MBS=<M>, GBS=<G>

Usage:

使用方法:

sbatch <script_name>.sh

sbatch <script_name>.sh

==============================================================================

==============================================================================

#SBATCH --job-name=<job-name> #SBATCH --nodes=<NNODES> #SBATCH --ntasks-per-node=1 #SBATCH --gpus-per-node=8 #SBATCH --time=00:30:00 #SBATCH --account=<YOUR_ACCOUNT> #SBATCH --partition=batch #SBATCH --output=<SHARED_FS>/logs/<job_name>_%j.log #SBATCH --exclusive
#SBATCH --job-name=<job-name> #SBATCH --nodes=<NNODES> #SBATCH --ntasks-per-node=1 #SBATCH --gpus-per-node=8 #SBATCH --time=00:30:00 #SBATCH --account=<YOUR_ACCOUNT> #SBATCH --partition=batch #SBATCH --output=<SHARED_FS>/logs/<job_name>_%j.log #SBATCH --exclusive

── Container ────────────────────────────────────────────────────────────

── 容器配置 ────────────────────────────────────────────────────────────

CONTAINER_IMAGE="<PATH_TO_YOUR_CONTAINER>.sqsh" CONTAINER_MOUNTS="<SHARED_FS>:<SHARED_FS>,<PATH_TO_MEGATRON_BRIDGE>:/opt/Megatron-Bridge,<PATH_TO_DATA>:/opt/data"
CONTAINER_IMAGE="<PATH_TO_YOUR_CONTAINER>.sqsh" CONTAINER_MOUNTS="<SHARED_FS>:<SHARED_FS>,<PATH_TO_MEGATRON_BRIDGE>:/opt/Megatron-Bridge,<PATH_TO_DATA>:/opt/data"

── Paths ────────────────────────────────────────────────────────────────

── 路径配置 ────────────────────────────────────────────────────────────────

WORKDIR="/opt/Megatron-Bridge" LOGDIR="<SHARED_FS>/logs/<logdir_name>" DATA_PATH="<PATH_TO_PREPROCESSED_DATA>/dclm_01_01_text_document"
WORKDIR="/opt/Megatron-Bridge" LOGDIR="<SHARED_FS>/logs/<logdir_name>" DATA_PATH="<PATH_TO_PREPROCESSED_DATA>/dclm_01_01_text_document"

── Parallelism ──────────────────────────────────────────────────────────

── 并行配置 ──────────────────────────────────────────────────────────

TP=1; PP=1; EP=1
TP=1; PP=1; EP=1

── Training ─────────────────────────────────────────────────────────────

── 训练配置 ─────────────────────────────────────────────────────────────

MBS=1; GBS=256 SEQ=4096 SEED=1234 TRAIN_ITERS=20
MBS=1; GBS=256 SEQ=4096 SEED=1234 TRAIN_ITERS=20

── Tokens / Caches ──────────────────────────────────────────────────────

── 令牌与缓存配置 ──────────────────────────────────────────────────────

export GH_TOKEN=<YOUR_GITHUB_TOKEN> export HF_TOKEN=<YOUR_HF_TOKEN> export HF_HOME=<SHARED_FS>/HF_HOME export UV_CACHE_DIR="<SHARED_FS>/uv_cache" export NEMO_HOME="<SHARED_FS>/cache/nemo"
export GH_TOKEN=<YOUR_GITHUB_TOKEN> export HF_TOKEN=<YOUR_HF_TOKEN> export HF_HOME=<SHARED_FS>/HF_HOME export UV_CACHE_DIR="<SHARED_FS>/uv_cache" export NEMO_HOME="<SHARED_FS>/cache/nemo"

── Build training command ───────────────────────────────────────────────

── 构建训练命令 ───────────────────────────────────────────────

TRAIN_CMD=" export CUDA_DEVICE_MAX_CONNECTIONS=1 &&
export NVTE_ALLOW_NONDETERMINISTIC_ALGO=1 &&
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True &&
export NCCL_NVLS_ENABLE=0 &&
export GH_TOKEN=$GH_TOKEN &&
export HF_TOKEN=$HF_TOKEN &&
export HF_HOME=$HF_HOME &&
export UV_CACHE_DIR=$UV_CACHE_DIR &&
export NEMO_HOME=$NEMO_HOME &&
wandb login $WANDB_API_KEY &&
mkdir -p $LOGDIR &&
cd $WORKDIR &&
uv sync &&
<TRAINING_COMMAND_HERE> "
echo "======================================" echo "<MODEL_NAME> <Framework> Pretrain" echo "Job: $SLURM_JOB_ID | Nodes: $SLURM_JOB_NUM_NODES" echo "TP=$TP PP=$PP EP=$EP MBS=$MBS GBS=$GBS" echo "======================================"
TRAIN_CMD=" export CUDA_DEVICE_MAX_CONNECTIONS=1 &&
export NVTE_ALLOW_NONDETERMINISTIC_ALGO=1 &&
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True &&
export NCCL_NVLS_ENABLE=0 &&
export GH_TOKEN=$GH_TOKEN &&
export HF_TOKEN=$HF_TOKEN &&
export HF_HOME=$HF_HOME &&
export UV_CACHE_DIR=$UV_CACHE_DIR &&
export NEMO_HOME=$NEMO_HOME &&
wandb login $WANDB_API_KEY &&
mkdir -p $LOGDIR &&
cd $WORKDIR &&
uv sync &&
<TRAINING_COMMAND_HERE> "
echo "======================================" echo "<MODEL_NAME> <Framework> 预训练" echo "任务ID: $SLURM_JOB_ID | 节点数: $SLURM_JOB_NUM_NODES" echo "TP=$TP PP=$PP EP=$EP MBS=$MBS GBS=$GBS" echo "======================================"

Phase 1: Single-process uv sync to build/populate the shared cache

阶段1:单进程uv同步,构建并填充共享缓存

srun --mpi=pmix -N 1 --ntasks=1
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync"
srun --mpi=pmix -N 1 --ntasks=1
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync"

Phase 2: Full multi-node run (uv sync in TRAIN_CMD is a fast no-op)

阶段2:完整多节点运行(TRAIN_CMD中的uv sync是快速无操作)

srun --mpi=pmix --no-kill
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "$TRAIN_CMD" 2>&1 | tee "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log"
echo "" echo "======================================" echo "Done. Losses:" echo "======================================" grep -E "iteration\s+" "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log" | grep -iE "lm loss|reduced_train_loss" | head -25
undefined
srun --mpi=pmix --no-kill
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "$TRAIN_CMD" 2>&1 | tee "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log"
echo "" echo "======================================" echo "任务完成。损失数据:" echo "======================================" grep -E "iteration\s+" "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log" | grep -iE "lm loss|reduced_train_loss" | head -25
undefined

Bridge-Specific TRAIN_CMD Body

Bridge专属TRAIN_CMD内容

bash
rm -rf nemo_experiments && \
uv run python -m torch.distributed.run \
  --nproc_per_node=8 \
  --nnodes=\${SLURM_JOB_NUM_NODES} \
  --node_rank=\${SLURM_NODEID} \
  scripts/training/run_recipe.py \
  --recipe <recipe_name> \
  model.tensor_model_parallel_size=$TP \
  model.pipeline_model_parallel_size=$PP \
  ...overrides...
bash
rm -rf nemo_experiments && \
uv run python -m torch.distributed.run \
  --nproc_per_node=8 \
  --nnodes=\${SLURM_JOB_NUM_NODES} \
  --node_rank=\${SLURM_NODEID} \
  scripts/training/run_recipe.py \
  --recipe <recipe_name> \
  model.tensor_model_parallel_size=$TP \
  model.pipeline_model_parallel_size=$PP \
  ...overrides...

MLM-Specific TRAIN_CMD Body

MLM专属TRAIN_CMD内容

bash
PYTHONPATH=${WORKDIR}/3rdparty/Megatron-LM:\${PYTHONPATH:-} \
uv run python -m torch.distributed.run \
  --nproc_per_node=8 \
  --nnodes=\${SLURM_JOB_NUM_NODES} \
  --node_rank=\${SLURM_NODEID} \
  3rdparty/Megatron-LM/pretrain_gpt.py \
  --tensor-model-parallel-size $TP \
  --pipeline-model-parallel-size $PP \
  ...args...
bash
PYTHONPATH=${WORKDIR}/3rdparty/Megatron-LM:\${PYTHONPATH:-} \
uv run python -m torch.distributed.run \
  --nproc_per_node=8 \
  --nnodes=\${SLURM_JOB_NUM_NODES} \
  --node_rank=\${SLURM_NODEID} \
  3rdparty/Megatron-LM/pretrain_gpt.py \
  --tensor-model-parallel-size $TP \
  --pipeline-model-parallel-size $PP \
  ...args...