multi-node-slurm

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Multi-Node Slurm

多节点Slurm作业

Convert single-node

uv run python -m torch.distributed.run

commands into multi-node Slurm sbatch scripts with Enroot container support, and debug common multi-node failures.

将单节点的

uv run python -m torch.distributed.run

命令转换为支持Enroot容器的多节点Slurm sbatch脚本，并调试常见的多节点故障。

Two Approaches: srun-native vs uv run torch.distributed

两种实现方式：srun原生 vs uv run torch.distributed

Approach	`ntasks-per-node`	Process spawning	Best for
srun-native (preferred)	8	Slurm spawns 8 tasks/node	Conversion, inference, Bridge scripts
uv run torch.distributed (legacy)	1	`uv run python -m torch.distributed.run` spawns 8 procs/node	MLM pretrain_gpt.py

Prefer srun-native — simpler, avoids shell escaping issues with TRAIN_CMD. Megatron Bridge auto-derives

RANK

WORLD_SIZE

LOCAL_RANK

MASTER_ADDR

MASTER_PORT

from SLURM env vars (

SLURM_PROCID

SLURM_NTASKS

SLURM_LOCALID

SLURM_NODELIST

) via

common_utils.py

helpers called during

initialize.py

distributed init, so you never need to set them manually.

方式	`ntasks-per-node`	进程启动方式	适用场景
srun原生（推荐）	8	Slurm在每个节点启动8个任务	模型转换、推理、Bridge脚本
uv run torch.distributed（ legacy）	1	`uv run python -m torch.distributed.run` 在每个节点启动8个进程	MLM预训练pretrain_gpt.py

优先选择srun原生方式 —— 更简单，避免TRAIN_CMD的Shell转义问题。Megatron Bridge会通过

initialize.py

分布式初始化时调用的

common_utils.py

工具，自动从SLURM环境变量（

SLURM_PROCID

、

SLURM_NTASKS

、

SLURM_LOCALID

、

SLURM_NODELIST

）中推导

RANK

、

WORLD_SIZE

、

LOCAL_RANK

、

MASTER_ADDR

、

MASTER_PORT

，无需手动设置。

Cluster Environment

集群环境配置

Container

容器设置

bash

CONTAINER_IMAGE="<PATH_TO_YOUR_CONTAINER>.sqsh"
CONTAINER_MOUNTS="<SHARED_FS>:<SHARED_FS>,<PATH_TO_MEGATRON_BRIDGE>:/opt/Megatron-Bridge,<PATH_TO_DATA>:/opt/data"

bash

CONTAINER_IMAGE="<PATH_TO_YOUR_CONTAINER>.sqsh"
CONTAINER_MOUNTS="<SHARED_FS>:<SHARED_FS>,<PATH_TO_MEGATRON_BRIDGE>:/opt/Megatron-Bridge,<PATH_TO_DATA>:/opt/data"

Standard Paths

标准路径

bash

WORKDIR="/opt/Megatron-Bridge"
DATA_PATH="<PATH_TO_PREPROCESSED_DATA>/dclm_01_01_text_document"

bash

WORKDIR="/opt/Megatron-Bridge"
DATA_PATH="<PATH_TO_PREPROCESSED_DATA>/dclm_01_01_text_document"

Tokens / Caches

令牌与缓存

bash

export GH_TOKEN=<YOUR_GITHUB_TOKEN>
export HF_TOKEN=<YOUR_HF_TOKEN>
export HF_HOME=<SHARED_FS>/HF_HOME
export UV_CACHE_DIR="<SHARED_FS>/uv_cache"
export NEMO_HOME="<SHARED_FS>/cache/nemo"

Important:

NEMO_HOME

must point to a shared filesystem (e.g. Lustre) for multi-node SFT/PEFT jobs. The default (

/root/.cache/nemo

) is container-local and not shared across nodes. Without this, packed-sequence data files prepared on node 0 are invisible to other nodes, causing

TypeError: 'NoneType' object is not an iterator

bash

export GH_TOKEN=<YOUR_GITHUB_TOKEN>
export HF_TOKEN=<YOUR_HF_TOKEN>
export HF_HOME=<SHARED_FS>/HF_HOME
export UV_CACHE_DIR="<SHARED_FS>/uv_cache"
export NEMO_HOME="<SHARED_FS>/cache/nemo"

重要提示：对于多节点SFT/PEFT任务，

NEMO_HOME

必须指向共享文件系统（如Lustre）。默认路径（

/root/.cache/nemo

）是容器本地路径，无法跨节点共享。如果不修改此设置，节点0生成的打包序列数据文件对其他节点不可见，会导致

TypeError: 'NoneType' object is not an iterator

错误。

Log Directory

日志目录

text

<SHARED_FS>/logs/<job_name>_<suffix>

text

<SHARED_FS>/logs/<job_name>_<suffix>

srun-native Approach (Preferred)

srun原生方式（推荐）

Slurm spawns all processes directly. No

torch.distributed.run

, no TRAIN_CMD escaping.

Slurm直接启动所有进程，无需

torch.distributed.run

，也无需对TRAIN_CMD进行转义。

SBATCH Headers

SBATCH头部配置

bash

#SBATCH --job-name=<model>-<task>
#SBATCH --nodes=<NNODES>
#SBATCH --ntasks-per-node=8          # Slurm spawns 8 tasks per node
#SBATCH --gpus-per-node=8
#SBATCH --time=00:30:00
#SBATCH --account=<YOUR_ACCOUNT>
#SBATCH --partition=batch
#SBATCH --output=<SHARED_FS>/logs/<job_name>_%j.log
#SBATCH --exclusive

bash

#SBATCH --job-name=<model>-<task>
#SBATCH --nodes=<NNODES>
#SBATCH --ntasks-per-node=8          # Slurm在每个节点启动8个任务
#SBATCH --gpus-per-node=8
#SBATCH --time=00:30:00
#SBATCH --account=<YOUR_ACCOUNT>
#SBATCH --partition=batch
#SBATCH --output=<SHARED_FS>/logs/<job_name>_%j.log
#SBATCH --exclusive

Build and Launch

构建与启动

Two-phase srun: first a single-process srun to populate the uv cache, then the full multi-node srun.

bash

undefined

分两阶段执行srun：首先用单进程srun填充uv缓存，然后执行完整的多节点srun。

bash

undefined

Env exports at sbatch level (before srun)

在sbatch级别设置环境变量（srun之前）

export TORCH_NCCL_AVOID_RECORD_STREAMS=1 export NCCL_NVLS_ENABLE=0

Phase 1: Single-process uv sync to build/populate the shared cache

阶段1：单进程uv同步，构建并填充Lustre上的共享缓存

srun --mpi=pmix -N 1 --ntasks=1
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync"

Phase 2: Full multi-node run (uv sync is a fast no-op since cache is warm)

阶段2：完整多节点运行（由于缓存已预热，uv sync会快速跳过）

srun --mpi=pmix
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync && uv run --no-sync python <script.py> <args>"

undefined

srun --mpi=pmix
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync && uv run --no-sync python <script.py> <args>"

undefined

srun-native Key Points

srun原生方式关键点

Phase 1 runs
```
uv sync
```
once on a single node/process, building all wheels into the shared cache on Lustre
Phase 2's
```
uv sync
```
is a fast no-op (everything is cached) — safe to run on all ranks without sleep guards

initialize.py

common_utils.py

auto-set

RANK

WORLD_SIZE

LOCAL_RANK

MASTER_ADDR

MASTER_PORT

from SLURM env vars

Env vars like
```
HF_TOKEN
```
,
```
HF_HOME
```
,
```
UV_CACHE_DIR
```
exported at sbatch level are inherited by srun tasks

Reference:

examples/models/glm/glm_45v/slurm_sft.sh

examples/models/minimax/minimax_m2/slurm_conversion.sh

阶段1在单个节点/进程上运行
```
uv sync
```
，将所有依赖包构建到Lustre的共享缓存中
阶段2的
```
uv sync
```
是快速无操作（所有内容已缓存）—— 无需睡眠等待，可在所有rank上安全运行

initialize.py

common_utils.py

自动从SLURM环境变量中设置

RANK

、

WORLD_SIZE

、

LOCAL_RANK

、

MASTER_ADDR

、

MASTER_PORT

在sbatch级别导出的环境变量（如
```
HF_TOKEN
```
、
```
HF_HOME
```
、
```
UV_CACHE_DIR
```
）会被srun任务继承

参考示例：

examples/models/glm/glm_45v/slurm_sft.sh

、

examples/models/minimax/minimax_m2/slurm_conversion.sh

uv run torch.distributed Approach (Legacy)

uv run torch.distributed方式（Legacy）

Use when the script requires

torch.distributed.run

(e.g., MLM pretrain_gpt.py) or when Bridge's

initialize.py

is not in the call path.

当脚本必须使用

torch.distributed.run

（如MLM的pretrain_gpt.py）或Bridge的

initialize.py

不在调用路径时使用此方式。

1. Add SBATCH Headers

1. 添加SBATCH头部配置

bash

#SBATCH --job-name=<model>-<framework>
#SBATCH --nodes=<NNODES>
#SBATCH --ntasks-per-node=1          # ALWAYS 1 — torchrun handles per-node spawning
#SBATCH --gpus-per-node=8
#SBATCH --time=00:30:00
#SBATCH --account=<YOUR_ACCOUNT>
#SBATCH --partition=batch
#SBATCH --output=<SHARED_FS>/logs/<job_name>_%j.log
#SBATCH --exclusive

Critical:

--ntasks-per-node=1

, NOT 8.

uv run python -m torch.distributed.run --nproc_per_node=8

spawns 8 processes per node. Using

ntasks-per-node=8

causes EADDRINUSE port collisions (8 tasks x 8 procs = 64 per node).

bash

#SBATCH --job-name=<model>-<framework>
#SBATCH --nodes=<NNODES>
#SBATCH --ntasks-per-node=1          # 必须设为1 —— torchrun负责每个节点的进程启动
#SBATCH --gpus-per-node=8
#SBATCH --time=00:30:00
#SBATCH --account=<YOUR_ACCOUNT>
#SBATCH --partition=batch
#SBATCH --output=<SHARED_FS>/logs/<job_name>_%j.log
#SBATCH --exclusive

关键注意事项：

--ntasks-per-node=1

，而非8。

uv run python -m torch.distributed.run --nproc_per_node=8

会在每个节点启动8个进程。如果设置

ntasks-per-node=8

，会导致EADDRINUSE端口冲突（8个任务 × 8个进程 = 每个节点64个进程）。

2. Convert to Multi-Node

2. 转换为多节点命令

Replace single-node:

bash

uv run python -m torch.distributed.run --nproc_per_node=8 \
  <script> <args>

With multi-node (inside

TRAIN_CMD

string):

bash

uv run python -m torch.distributed.run \
  --nproc_per_node=8 \
  --nnodes=\${SLURM_JOB_NUM_NODES} \
  --node_rank=\${SLURM_NODEID} \
  <script> <args>

MASTER_ADDR

and

MASTER_PORT

are auto-derived from SLURM env vars by

initialize.py

common_utils.py

— no need to set them.

将单节点命令：

bash

uv run python -m torch.distributed.run --nproc_per_node=8 \
  <script> <args>

替换为多节点命令（放在

TRAIN_CMD

字符串内）：

bash

uv run python -m torch.distributed.run \
  --nproc_per_node=8 \
  --nnodes=\${SLURM_JOB_NUM_NODES} \
  --node_rank=\${SLURM_NODEID} \
  <script> <args>

MASTER_ADDR

和

MASTER_PORT

会由

initialize.py

common_utils.py

从SLURM环境变量自动推导——无需手动设置。

3. Wrap in TRAIN_CMD + two-phase srun

3. 用TRAIN_CMD包裹并分两阶段执行srun

Use the same two-phase pattern: first a single-process srun to warm the uv cache, then the full run.

Environment exports go inside TRAIN_CMD (they must be set inside the container):

bash

TRAIN_CMD="
export CUDA_DEVICE_MAX_CONNECTIONS=1 && \
export NVTE_ALLOW_NONDETERMINISTIC_ALGO=1 && \
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True && \
export NCCL_NVLS_ENABLE=0 && \
export GH_TOKEN=$GH_TOKEN && \
export HF_TOKEN=$HF_TOKEN && \
export HF_HOME=$HF_HOME && \
export UV_CACHE_DIR=$UV_CACHE_DIR && \
wandb login \$WANDB_API_KEY && \
mkdir -p $LOGDIR && \
cd $WORKDIR && \
uv sync && \
<training command here>
"

使用相同的两阶段模式：首先用单进程srun预热uv缓存，然后执行完整运行。

环境变量要放在TRAIN_CMD内部（必须在容器内设置）：

bash

TRAIN_CMD="
export CUDA_DEVICE_MAX_CONNECTIONS=1 && \
export NVTE_ALLOW_NONDETERMINISTIC_ALGO=1 && \
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True && \
export NCCL_NVLS_ENABLE=0 && \
export GH_TOKEN=$GH_TOKEN && \
export HF_TOKEN=$HF_TOKEN && \
export HF_HOME=$HF_HOME && \
export UV_CACHE_DIR=$UV_CACHE_DIR && \
wandb login \$WANDB_API_KEY && \
mkdir -p $LOGDIR && \
cd $WORKDIR && \
uv sync && \
<training command here>
"

4. Launch (two-phase)

4. 启动命令（两阶段）

bash

undefined

bash

undefined

Phase 1: Single-process uv sync to build/populate the shared cache

阶段1：单进程uv同步，构建并填充共享缓存

srun --mpi=pmix -N 1 --ntasks=1
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync"

Phase 2: Full multi-node run (uv sync in TRAIN_CMD is a fast no-op)

阶段2：完整多节点运行（TRAIN_CMD中的uv sync是快速无操作）

srun --mpi=pmix --no-kill
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "$TRAIN_CMD" 2>&1 | tee "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log"

undefined

srun --mpi=pmix --no-kill
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "$TRAIN_CMD" 2>&1 | tee "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log"

undefined

5. (Optional) Add Loss Extraction Footer

5.（可选）添加损失提取尾部

bash

echo "======================================"
echo "Done. Losses:"
echo "======================================"
grep -E "iteration\s+" "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log" | grep -iE "lm loss|reduced_train_loss" | head -25

bash

echo "======================================"
echo "Done. Losses:"
echo "======================================"
grep -E "iteration\s+" "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log" | grep -iE "lm loss|reduced_train_loss" | head -25

Interactive GPU Allocation (

salloc

srun

)

交互式GPU分配（

salloc

srun

）

For ad-hoc testing (inference, conversion debugging), always follow these 3 steps:

对于临时测试（推理、转换调试），请始终遵循以下3个步骤：

Step 1: Allocate the node

步骤1：分配节点

bash

salloc --account <YOUR_ACCOUNT> -N 1 \
  -J <YOUR_ACCOUNT>-debug \
  -p interactive --gpus-per-node=8 -t 240

bash

salloc --account <YOUR_ACCOUNT> -N 1 \
  -J <YOUR_ACCOUNT>-debug \
  -p interactive --gpus-per-node=8 -t 240

Step 2: Launch container shell

步骤2：启动容器Shell

bash

srun --mpi=pmix --no-kill \
  --container-image $CONTAINER_IMAGE \
  --container-mounts $CONTAINER_MOUNTS \
  --account <YOUR_ACCOUNT> -N 1 \
  -J <YOUR_ACCOUNT>-debug \
  --no-container-mount-home --gpus-per-node=8 \
  -p interactive --pty bash

bash

srun --mpi=pmix --no-kill \
  --container-image $CONTAINER_IMAGE \
  --container-mounts $CONTAINER_MOUNTS \
  --account <YOUR_ACCOUNT> -N 1 \
  -J <YOUR_ACCOUNT>-debug \
  --no-container-mount-home --gpus-per-node=8 \
  -p interactive --pty bash

Step 3: Set up environment inside container

步骤3：在容器内设置环境

bash

export GH_TOKEN=<YOUR_GITHUB_TOKEN>
wandb login <YOUR_WANDB_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>
export HF_HOME=<SHARED_FS>/HF_HOME
export UV_CACHE_DIR="<SHARED_FS>/uv_cache"
export NEMO_HOME="<SHARED_FS>/cache/nemo"
uv sync

Then run commands with

uv run

(uses the synced virtualenv):

bash

uv run python -m torch.distributed.run --nproc_per_node=8 \
  examples/conversion/hf_to_megatron_generate_text.py \
  --hf_model_path <org>/<model> --prompt "What is AI?" --max_new_tokens 50 --ep 8

Pitfalls with interactive allocation:

Error	Cause	Fix
`Cannot find GPU specification`	Missing `--gpus-per-node`	Always include `--gpus-per-node=8` in both `salloc` and `srun`
`invalid partition specified: pool0`	Wrong partition name	Use `interactive` for interactive, `batch` for sbatch. Check: `sinfo --summarize`
`Invalid account or account/partition combination`	Partition not available for account	Check combos: `sacctmgr -nP show assoc where user=$USER format=account,partition`
`Unable to create step for job... Requested node configuration is not available`	`-w <node>` conflicts with allocation	Remove `-w` flag — HF cache is on shared filesystem, accessible from any node
`uv: command not found` inside container	Container doesn't have `uv` pre-installed	Use a container with `uv` pre-installed, or `pip install uv`
`No space left on device` during `uv` or `pip`	Container's `/root/.cache/` is full	Redirect: `export UV_CACHE_DIR=<SHARED_FS>/uv_cache`
`ModuleNotFoundError: No module named 'megatron.core.activations'`	Container's pre-installed megatron-core conflicts with local `3rdparty/Megatron-LM`	Install local: `pip install -e 3rdparty/Megatron-LM --no-deps --no-build-isolation`

bash

export GH_TOKEN=<YOUR_GITHUB_TOKEN>
wandb login <YOUR_WANDB_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>
export HF_HOME=<SHARED_FS>/HF_HOME
export UV_CACHE_DIR="<SHARED_FS>/uv_cache"
export NEMO_HOME="<SHARED_FS>/cache/nemo"
uv sync

然后使用

uv run

运行命令（使用已同步的虚拟环境）：

bash

uv run python -m torch.distributed.run --nproc_per_node=8 \
  examples/conversion/hf_to_megatron_generate_text.py \
  --hf_model_path <org>/<model> --prompt "What is AI?" --max_new_tokens 50 --ep 8

交互式分配的常见陷阱：

错误	原因	修复方案
`Cannot find GPU specification`	缺少 `--gpus-per-node` 参数	在 `salloc` 和 `srun` 中始终加入 `--gpus-per-node=8`
`invalid partition specified: pool0`	分区名称错误	交互式任务用 `interactive` 分区，sbatch任务用 `batch` 分区。可通过 `sinfo --summarize` 查看分区信息
`Invalid account or account/partition combination`	账户无法访问指定分区	通过 `sacctmgr -nP show assoc where user=$USER format=account,partition` 查看可用的账户-分区组合
`Unable to create step for job... Requested node configuration is not available`	`-w <node>` 参数与分配冲突	移除 `-w` 参数——HF缓存位于共享文件系统，可从任何节点访问
`uv: command not found` inside container	容器未预安装 `uv`	使用预安装 `uv` 的容器，或执行 `pip install uv` 安装
`No space left on device` during `uv` or `pip`	容器的 `/root/.cache/` 目录已满	重定向缓存路径： `export UV_CACHE_DIR=<SHARED_FS>/uv_cache`
`ModuleNotFoundError: No module named 'megatron.core.activations'`	容器预安装的megatron-core与本地 `3rdparty/Megatron-LM` 冲突	安装本地版本： `pip install -e 3rdparty/Megatron-LM --no-deps --no-build-isolation`

Debugging Multi-Node Failures

多节点故障调试

Quick Diagnosis

快速诊断

Check the log for these patterns (in order):

bash

undefined

按顺序在日志中查找以下模式：

bash

undefined

1. Find the actual error (filter noise)

1. 定位实际错误（过滤无关信息）

2. Check which rank crashed first

2. 检查哪个rank最先崩溃

grep -a 'Failures:' -A 20 job.log | head -25

3. Check for NCCL timeout

3. 检查NCCL超时问题

grep -a 'ncclUniqueId|timeout|crash on rank 0' job.log | head -5

undefined

grep -a 'ncclUniqueId|timeout|crash on rank 0' job.log | head -5

undefined

Debugging Checklist

调试检查清单

When a multi-node job fails:

Check exit code: 1 = Python error, 9 = OOM killed, 143 = SIGTERM (timeout or cascade)
Find first failure: Which task/node crashed first? Others get SIGTERM (143) as cascade
grep the actual error: Filter out UserWarnings, NCCL frame dumps
Check rank 0 specifically: Most save/export errors happen on rank 0
Verify EP sizing: For MoE models, ensure
```
num_experts / EP
```
fits in GPU memory with headroom
Try interactive first: Use
```
salloc -N 2 -p interactive
```
to iterate faster than sbatch queue

当多节点任务失败时：

检查退出码：1 = Python错误，9 = OOM被终止，143 = SIGTERM（超时或连锁崩溃）
定位首次故障：哪个任务/节点最先崩溃？其他节点会因连锁反应收到SIGTERM（143）
过滤实际错误：排除UserWarning、NCCL帧转储等无关信息
重点检查rank 0：大多数保存/导出错误发生在rank 0
验证EP配置：对于MoE模型，确保
```
num_experts / EP
```
的结果在GPU内存中留有足够余量
先尝试交互式模式：使用
```
salloc -N 2 -p interactive
```
进行调试，比sbatch队列更快

NCCL Timeout at

dist.barrier()

— "crash on rank 0"

dist.barrier()

处的NCCL超时 —— "crash on rank 0"

Symptom: All ranks on node 2+ show:

text

[rank8] is setting up NCCL communicator and retrieving ncclUniqueId from [0]
... wait timeout after 600000ms
This may indicate a possible application crash on rank 0

Root causes (check in order):

Cause	How to verify	Fix
`save_artifacts` hangs on rank 0	Error is in `save_hf_weights` → `dist.barrier()`	Increase timeout: `init_process_group("nccl", timeout=timedelta(minutes=60))`
`ImportError` in custom model code	`grep ImportError job.log`	Catch `ImportError` in `save_artifacts` (see below)
Rank 0 OOM during export	`grep 'OutOfMemory' job.log`	Increase EP or nodes
Network issue between nodes	Error only on cross-node ranks	Check `sinfo` , try different nodes

The
save_artifacts
problem: When

trust_remote_code=True

, rank 0 runs

save_artifacts()

(downloads tokenizer, config, custom modeling code) while all other ranks skip directly to

dist.barrier()

. If

save_artifacts

is slow or crashes, other ranks timeout.

Fix for ImportError in save_artifacts (

hf_pretrained/base.py

python

undefined

症状：节点2及以后的所有rank显示：

text

[rank8] is setting up NCCL communicator and retrieving ncclUniqueId from [0]
... wait timeout after 600000ms
This may indicate a possible application crash on rank 0

根本原因（按顺序检查）：

原因	验证方式	修复方案
rank 0上的 `save_artifacts` 挂起	错误出现在 `save_hf_weights` → `dist.barrier()`	增加超时时间： `init_process_group("nccl", timeout=timedelta(minutes=60))`
自定义模型代码中的 `ImportError`	`grep ImportError job.log`	在 `save_artifacts` 中捕获 `ImportError` （见下文）
导出时rank 0发生OOM	`grep 'OutOfMemory' job.log`	增加EP数量或节点数量
节点间网络问题	仅跨节点rank出现错误	检查 `sinfo` ，尝试更换节点

save_artifacts
问题：当

trust_remote_code=True

时，rank 0会运行

save_artifacts()

（下载tokenizer、配置、自定义建模代码），而其他rank直接跳转到

dist.barrier()

。如果

save_artifacts

运行缓慢或崩溃，其他rank会超时。

修复save_artifacts中的ImportError（

hf_pretrained/base.py

）：

python

undefined

Change:

修改前：

except OSError: pass

To:

修改后：

except (OSError, ImportError): pass

undefined

except (OSError, ImportError): pass

undefined

OOM for MoE Models

MoE模型的OOM问题

Symptom:

torch.OutOfMemoryError: CUDA out of memory

during model loading or forward pass.

Key insight: TP does NOT reduce expert memory. Only EP splits experts across GPUs.

Sizing formula:

text

experts_per_gpu = num_experts / EP
expert_memory_gb ≈ experts_per_gpu * expert_params * 2 / 1e9  (bf16)
total_per_gpu ≈ expert_memory_gb + attention_memory_gb + kv_cache_gb

MiniMax-M2 example (256 experts, ~230GB fp8 → ~460GB bf16):

Config	Nodes	GPUs	Experts/GPU	Result
TP=2, EP=4	1	8	64	OOM (too many experts)
TP=2, EP=8	2	16	32	Works for roundtrip (weight-only), OOM for inference
TP=1, EP=16	2	16	16	Works for inference
TP=2, EP=32	8	64	8	Comfortable for training

Rules of thumb:

Roundtrip (weight-only): can use more experts per GPU (~60GB model params OK)
Inference (forward pass + KV cache): needs headroom (~40GB model params max)
Training (activations + optimizer): needs even more headroom (~30GB model params max)

症状：模型加载或前向传播时出现

torch.OutOfMemoryError: CUDA out of memory

。

关键要点：TP（张量并行）不会减少专家内存占用，只有EP（专家并行）会将专家分配到多个GPU上。

内存计算公式：

text

每个GPU的专家数 = num_experts / EP
专家内存占用(GB) ≈ 每个GPU的专家数 × 专家参数数量 × 2 / 1e9 （bf16精度）
每个GPU总内存占用 ≈ 专家内存占用 + 注意力内存占用 + KV缓存内存占用

MiniMax-M2示例（256个专家，fp8精度约230GB → bf16精度约460GB）：

配置	节点数	GPU数	每个GPU的专家数	结果
TP=2, EP=4	1	8	64	OOM（专家数过多）
TP=2, EP=8	2	16	32	往返转换（仅权重）可行，推理时OOM
TP=1, EP=16	2	16	16	推理可行
TP=2, EP=32	8	64	8	训练时内存充足

经验法则：

往返转换（仅权重）：每个GPU可容纳更多专家（约60GB模型参数可行）
推理（前向传播 + KV缓存）：需要预留内存余量（最多约40GB模型参数）
训练（激活值 + 优化器）：需要更多内存余量（最多约30GB模型参数）

ModuleNotFoundError: No module named 'megatron.core.tensor_parallel'

ModuleNotFoundError: No module named 'megatron.core.tensor_parallel'

Cause: Container's pre-installed megatron-core conflicts with local

3rdparty/Megatron-LM

Fix: Add

uv sync

before running:

bash

CMD="if [ \"\$SLURM_LOCALID\" -eq 0 ]; then uv sync; else sleep 10; fi && "
CMD="${CMD}uv run --no-sync python <script> <args>"

原因：容器预安装的megatron-core与本地

3rdparty/Megatron-LM

冲突。

修复方案：运行前添加

uv sync

：

bash

CMD="if [ \"\$SLURM_LOCALID\" -eq 0 ]; then uv sync; else sleep 10; fi && "
CMD="${CMD}uv run --no-sync python <script> <args>"

FP8 Weight Mismatch in Roundtrip

往返转换中的FP8权重不匹配

Symptom: Roundtrip completes but shows ❌ for all expert weights and raises

ValueError: Weight mismatch detected

Cause: Original HF weights are FP8, Megatron stores in BF16. Exported weights are BF16. Comparison against original FP8 exceeds

atol=1e-1

This is expected for FP8 models. The conversion is correct; the comparison tolerance is insufficient for the FP8→BF16 precision gap.

症状：往返转换完成，但所有专家权重显示❌，并抛出

ValueError: Weight mismatch detected

。

原因：原始HF权重是FP8精度，Megatron以BF16精度存储，导出的权重也是BF16精度。与原始FP8权重的比较超出了

atol=1e-1

的容忍度。

这是FP8模型的正常现象。转换过程是正确的，只是比较的容忍度不足以覆盖FP8→BF16的精度差异。

WORLD_SIZE

Not Set with srun

使用srun时

WORLD_SIZE

未设置

Symptom: Script exits with "must be launched with torchrun".

Cause: Scripts check

os.environ.get("WORLD_SIZE")

which torchrun sets but srun doesn't.

Fix: Also check

SLURM_NTASKS

python

if os.environ.get("WORLD_SIZE") is None and os.environ.get("SLURM_NTASKS") is None:
    sys.exit(1)

Bridge's

common_utils.py

helpers (called by

initialize.py

) populate env vars from SLURM:

python

if "RANK" not in os.environ:
    os.environ["RANK"] = str(get_rank_safe())          # uses SLURM_PROCID
if "WORLD_SIZE" not in os.environ:
    os.environ["WORLD_SIZE"] = str(get_world_size_safe())  # uses SLURM_NTASKS
if "MASTER_ADDR" not in os.environ:
    os.environ["MASTER_ADDR"] = get_master_addr_safe()     # parses SLURM_NODELIST
if "MASTER_PORT" not in os.environ:
    os.environ["MASTER_PORT"] = str(get_master_port_safe()) # derives from SLURM_JOB_ID

症状：脚本退出并提示"must be launched with torchrun"。

原因：脚本检查

os.environ.get("WORLD_SIZE")

，该变量由torchrun设置，但srun不会设置。

修复方案：同时检查

SLURM_NTASKS

：

python

if os.environ.get("WORLD_SIZE") is None and os.environ.get("SLURM_NTASKS") is None:
    sys.exit(1)

Bridge的

common_utils.py

工具（由

initialize.py

调用）会从SLURM环境变量中填充相关环境变量：

python

if "RANK" not in os.environ:
    os.environ["RANK"] = str(get_rank_safe())          # 使用SLURM_PROCID
if "WORLD_SIZE" not in os.environ:
    os.environ["WORLD_SIZE"] = str(get_world_size_safe())  # 使用SLURM_NTASKS
if "MASTER_ADDR" not in os.environ:
    os.environ["MASTER_ADDR"] = get_master_addr_safe()     # 解析SLURM_NODELIST
if "MASTER_PORT" not in os.environ:
    os.environ["MASTER_PORT"] = str(get_master_port_safe()) # 从SLURM_JOB_ID推导

Key Gotchas

关键注意事项

Two-phase srun for
uv sync
: Run a single-process srun first to warm the cache, then the full multi-node srun. The second
```
uv sync
```
is a fast no-op since everything is already cached on the shared filesystem.
--no-container-mount-home
is an
```
srun
```
flag, NOT an
```
#SBATCH
```
directive.
Escaping inside TRAIN_CMD: Since
```
TRAIN_CMD
```
is a double-quoted string, escape inner
```
$
```
for Slurm variables that must expand at runtime (not sbatch time):
- ```
\${SLURM_PROCID}
```
  ,
```
\${SLURM_JOB_NUM_NODES}
```
  ,
```
\${SLURM_NODEID}
```
- Host-side variables like
```
$GH_TOKEN
```
  ,
```
$LOGDIR
```
  ,
```
$WORKDIR
```
  expand at sbatch time — no escaping needed.
Bridge
rm -rf nemo_experiments
: Add before training to avoid stale checkpoint auto-resume.
MLM needs PYTHONPATH: For pretrain_gpt.py scripts, add inside TRAIN_CMD:
bash
```
PYTHONPATH=${WORKDIR}/3rdparty/Megatron-LM:\${PYTHONPATH:-} \
```

Node count heuristic: Total GPUs =

NNODES * 8

. Must satisfy:

TP * PP * EP * DP >= total_GPUs

where

DP = total_GPUs / (TP * PP * EP)

NEMO_HOME
on shared filesystem for multi-node SFT: The default nemo cache (
```
/root/.cache/nemo
```
) is container-local. Multi-node SFT with packed sequences prepares
```
.npy
```
files on one node that are invisible to others. Set
```
export NEMO_HOME=<SHARED_FS>/cache/nemo
```
so packed data is shared. Without this, ranks on other nodes fail with
```
TypeError: 'NoneType' object is not an iterator
```
.

uv sync
的两阶段srun：先运行单进程srun预热缓存，再运行完整的多节点srun。第二次
```
uv sync
```
是快速无操作，因为所有内容已存储在共享文件系统的缓存中。
--no-container-mount-home
是srun参数，而非
#SBATCH
指令。
TRAIN_CMD内的转义：由于
```
TRAIN_CMD
```
是双引号字符串，对于需要在运行时（而非sbatch提交时）展开的Slurm变量，要转义内部的
```
$
```
：
- ```
\${SLURM_PROCID}
```
  、
```
\${SLURM_JOB_NUM_NODES}
```
  、
```
\${SLURM_NODEID}
```
- 主机侧变量（如
```
$GH_TOKEN
```
  、
```
$LOGDIR
```
  、
```
$WORKDIR
```
  ）会在sbatch提交时展开——无需转义。
Bridge的
rm -rf nemo_experiments
：训练前添加此命令，避免陈旧的检查点自动恢复。
MLM需要PYTHONPATH：对于pretrain_gpt.py脚本，在TRAIN_CMD内添加：
bash
```
PYTHONPATH=${WORKDIR}/3rdparty/Megatron-LM:\${PYTHONPATH:-} \
```

节点数量启发式规则：总GPU数 =

NNODES * 8

。必须满足：

TP * PP * EP * DP >= total_GPUs

，其中

DP = total_GPUs / (TP * PP * EP)

。

多节点SFT的
NEMO_HOME
需指向共享文件系统：默认的nemo缓存（
```
/root/.cache/nemo
```
）是容器本地路径。多节点SFT的打包序列会在一个节点生成
```
.npy
```
文件，其他节点无法访问。设置
```
export NEMO_HOME=<SHARED_FS>/cache/nemo
```
可让打包数据共享。如果不设置，其他节点的rank会因无法访问数据而抛出
```
TypeError: 'NoneType' object is not an iterator
```
。

Full Template

完整模板

bash

#!/bin/bash

bash

#!/bin/bash

==============================================================================

<MODEL_NAME> <pretrain|sft> — <Framework: MLM | Megatron Bridge>

Default: TP<X> PP<Y> EP<Z>, NNODES=<N> (<N*8> GPUs), MBS=<M>, GBS=<G>

默认配置：TP<X> PP<Y> EP<Z>, NNODES=<N>（共<N*8>个GPU）, MBS=<M>, GBS=<G>

Usage:

使用方法：

sbatch <script_name>.sh

==============================================================================

#SBATCH --job-name=<job-name> #SBATCH --nodes=<NNODES> #SBATCH --ntasks-per-node=1 #SBATCH --gpus-per-node=8 #SBATCH --time=00:30:00 #SBATCH --account=<YOUR_ACCOUNT> #SBATCH --partition=batch #SBATCH --output=<SHARED_FS>/logs/<job_name>_%j.log #SBATCH --exclusive

── Container ────────────────────────────────────────────────────────────

── 容器配置 ────────────────────────────────────────────────────────────

CONTAINER_IMAGE="<PATH_TO_YOUR_CONTAINER>.sqsh" CONTAINER_MOUNTS="<SHARED_FS>:<SHARED_FS>,<PATH_TO_MEGATRON_BRIDGE>:/opt/Megatron-Bridge,<PATH_TO_DATA>:/opt/data"

── Paths ────────────────────────────────────────────────────────────────

── 路径配置 ────────────────────────────────────────────────────────────────

WORKDIR="/opt/Megatron-Bridge" LOGDIR="<SHARED_FS>/logs/<logdir_name>" DATA_PATH="<PATH_TO_PREPROCESSED_DATA>/dclm_01_01_text_document"

── Parallelism ──────────────────────────────────────────────────────────

── 并行配置 ──────────────────────────────────────────────────────────

TP=1; PP=1; EP=1

── Training ─────────────────────────────────────────────────────────────

── 训练配置 ─────────────────────────────────────────────────────────────

MBS=1; GBS=256 SEQ=4096 SEED=1234 TRAIN_ITERS=20

── Tokens / Caches ──────────────────────────────────────────────────────

── 令牌与缓存配置 ──────────────────────────────────────────────────────

export GH_TOKEN=<YOUR_GITHUB_TOKEN> export HF_TOKEN=<YOUR_HF_TOKEN> export HF_HOME=<SHARED_FS>/HF_HOME export UV_CACHE_DIR="<SHARED_FS>/uv_cache" export NEMO_HOME="<SHARED_FS>/cache/nemo"

── Build training command ───────────────────────────────────────────────

── 构建训练命令 ───────────────────────────────────────────────

TRAIN_CMD=" export CUDA_DEVICE_MAX_CONNECTIONS=1 &&
export NVTE_ALLOW_NONDETERMINISTIC_ALGO=1 &&
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True &&
export NCCL_NVLS_ENABLE=0 &&
export GH_TOKEN=$GH_TOKEN &&
export HF_TOKEN=$HF_TOKEN &&
export HF_HOME=$HF_HOME &&
export UV_CACHE_DIR=$UV_CACHE_DIR &&
export NEMO_HOME=$NEMO_HOME &&
wandb login $WANDB_API_KEY &&
mkdir -p $LOGDIR &&
cd $WORKDIR &&
uv sync &&
<TRAINING_COMMAND_HERE> "

echo "======================================" echo "<MODEL_NAME> <Framework> Pretrain" echo "Job: $SLURM_JOB_ID | Nodes: $SLURM_JOB_NUM_NODES" echo "TP=$TP PP=$PP EP=$EP MBS=$MBS GBS=$GBS" echo "======================================"

echo "======================================" echo "<MODEL_NAME> <Framework> 预训练" echo "任务ID: $SLURM_JOB_ID | 节点数: $SLURM_JOB_NUM_NODES" echo "TP=$TP PP=$PP EP=$EP MBS=$MBS GBS=$GBS" echo "======================================"

Phase 1: Single-process uv sync to build/populate the shared cache

阶段1：单进程uv同步，构建并填充共享缓存

srun --mpi=pmix -N 1 --ntasks=1
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync"

Phase 2: Full multi-node run (uv sync in TRAIN_CMD is a fast no-op)

阶段2：完整多节点运行（TRAIN_CMD中的uv sync是快速无操作）

srun --mpi=pmix --no-kill
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "$TRAIN_CMD" 2>&1 | tee "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log"

echo "" echo "======================================" echo "Done. Losses:" echo "======================================" grep -E "iteration\s+" "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log" | grep -iE "lm loss|reduced_train_loss" | head -25

undefined

srun --mpi=pmix --no-kill
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "$TRAIN_CMD" 2>&1 | tee "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log"

echo "" echo "======================================" echo "任务完成。损失数据：" echo "======================================" grep -E "iteration\s+" "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log" | grep -iE "lm loss|reduced_train_loss" | head -25

undefined

Bridge-Specific TRAIN_CMD Body

Bridge专属TRAIN_CMD内容

bash

rm -rf nemo_experiments && \
uv run python -m torch.distributed.run \
  --nproc_per_node=8 \
  --nnodes=\${SLURM_JOB_NUM_NODES} \
  --node_rank=\${SLURM_NODEID} \
  scripts/training/run_recipe.py \
  --recipe <recipe_name> \
  model.tensor_model_parallel_size=$TP \
  model.pipeline_model_parallel_size=$PP \
  ...overrides...

bash

rm -rf nemo_experiments && \
uv run python -m torch.distributed.run \
  --nproc_per_node=8 \
  --nnodes=\${SLURM_JOB_NUM_NODES} \
  --node_rank=\${SLURM_NODEID} \
  scripts/training/run_recipe.py \
  --recipe <recipe_name> \
  model.tensor_model_parallel_size=$TP \
  model.pipeline_model_parallel_size=$PP \
  ...overrides...

MLM-Specific TRAIN_CMD Body

MLM专属TRAIN_CMD内容

bash

PYTHONPATH=${WORKDIR}/3rdparty/Megatron-LM:\${PYTHONPATH:-} \
uv run python -m torch.distributed.run \
  --nproc_per_node=8 \
  --nnodes=\${SLURM_JOB_NUM_NODES} \
  --node_rank=\${SLURM_NODEID} \
  3rdparty/Megatron-LM/pretrain_gpt.py \
  --tensor-model-parallel-size $TP \
  --pipeline-model-parallel-size $PP \
  ...args...

bash

PYTHONPATH=${WORKDIR}/3rdparty/Megatron-LM:\${PYTHONPATH:-} \
uv run python -m torch.distributed.run \
  --nproc_per_node=8 \
  --nnodes=\${SLURM_JOB_NUM_NODES} \
  --node_rank=\${SLURM_NODEID} \
  3rdparty/Megatron-LM/pretrain_gpt.py \
  --tensor-model-parallel-size $TP \
  --pipeline-model-parallel-size $PP \
  ...args...

multi-node-slurm

Original

Translation

Multi-Node Slurm

多节点Slurm作业

Two Approaches: srun-native vs uv run torch.distributed

两种实现方式：srun原生 vs uv run torch.distributed

Cluster Environment

集群环境配置

Container

容器设置

Standard Paths

标准路径

Tokens / Caches

令牌与缓存

Log Directory

日志目录

srun-native Approach (Preferred)

srun原生方式（推荐）

SBATCH Headers

SBATCH头部配置

Build and Launch

构建与启动

Env exports at sbatch level (before srun)

在sbatch级别设置环境变量（srun之前）

Phase 1: Single-process uv sync to build/populate the shared cache

阶段1：单进程uv同步，构建并填充Lustre上的共享缓存

Phase 2: Full multi-node run (uv sync is a fast no-op since cache is warm)

阶段2：完整多节点运行（由于缓存已预热，uv sync会快速跳过）

srun-native Key Points

srun原生方式关键点

uv run torch.distributed Approach (Legacy)

uv run torch.distributed方式（Legacy）

1. Add SBATCH Headers

1. 添加SBATCH头部配置

2. Convert to Multi-Node

2. 转换为多节点命令

3. Wrap in TRAIN_CMD + two-phase srun

3. 用TRAIN_CMD包裹并分两阶段执行srun

4. Launch (two-phase)

4. 启动命令（两阶段）

Phase 1: Single-process uv sync to build/populate the shared cache

阶段1：单进程uv同步，构建并填充共享缓存

Phase 2: Full multi-node run (uv sync in TRAIN_CMD is a fast no-op)

阶段2：完整多节点运行（TRAIN_CMD中的uv sync是快速无操作）

5. (Optional) Add Loss Extraction Footer

5.（可选）添加损失提取尾部

Interactive GPU Allocation (salloc + srun)

交互式GPU分配（salloc + srun）

Step 1: Allocate the node

步骤1：分配节点

Step 2: Launch container shell

步骤2：启动容器Shell

Step 3: Set up environment inside container

步骤3：在容器内设置环境

Debugging Multi-Node Failures

多节点故障调试

Quick Diagnosis

快速诊断

1. Find the actual error (filter noise)

1. 定位实际错误（过滤无关信息）

2. Check which rank crashed first

2. 检查哪个rank最先崩溃

3. Check for NCCL timeout

3. 检查NCCL超时问题

Debugging Checklist

调试检查清单

NCCL Timeout at dist.barrier() — "crash on rank 0"

dist.barrier()处的NCCL超时 —— "crash on rank 0"

Change:

修改前：

To:

修改后：

OOM for MoE Models

MoE模型的OOM问题

ModuleNotFoundError: No module named 'megatron.core.tensor_parallel'

ModuleNotFoundError: No module named 'megatron.core.tensor_parallel'

FP8 Weight Mismatch in Roundtrip

往返转换中的FP8权重不匹配

WORLD_SIZE Not Set with srun

Interactive GPU Allocation (
`salloc`
+
`srun`
)

交互式GPU分配（
`salloc`
+
`srun`
）

NCCL Timeout at
`dist.barrier()`
— "crash on rank 0"

`dist.barrier()`
处的NCCL超时 —— "crash on rank 0"

`ModuleNotFoundError: No module named 'megatron.core.tensor_parallel'`

`ModuleNotFoundError: No module named 'megatron.core.tensor_parallel'`

`WORLD_SIZE`
Not Set with srun

使用srun时
`WORLD_SIZE`
未设置