multi-node-slurm
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMulti-Node Slurm
多节点Slurm作业
Convert single-node commands into multi-node Slurm sbatch scripts with Enroot container support, and debug common multi-node failures.
uv run python -m torch.distributed.run将单节点的命令转换为支持Enroot容器的多节点Slurm sbatch脚本,并调试常见的多节点故障。
uv run python -m torch.distributed.runTwo Approaches: srun-native vs uv run torch.distributed
两种实现方式:srun原生 vs uv run torch.distributed
| Approach | | Process spawning | Best for |
|---|---|---|---|
| srun-native (preferred) | 8 | Slurm spawns 8 tasks/node | Conversion, inference, Bridge scripts |
| uv run torch.distributed (legacy) | 1 | | MLM pretrain_gpt.py |
Prefer srun-native — simpler, avoids shell escaping issues with TRAIN_CMD. Megatron Bridge auto-derives , , , , from SLURM env vars (, , , ) via helpers called during distributed init, so you never need to set them manually.
RANKWORLD_SIZELOCAL_RANKMASTER_ADDRMASTER_PORTSLURM_PROCIDSLURM_NTASKSSLURM_LOCALIDSLURM_NODELISTcommon_utils.pyinitialize.py| 方式 | | 进程启动方式 | 适用场景 |
|---|---|---|---|
| srun原生(推荐) | 8 | Slurm在每个节点启动8个任务 | 模型转换、推理、Bridge脚本 |
| uv run torch.distributed( legacy) | 1 | | MLM预训练pretrain_gpt.py |
优先选择srun原生方式 —— 更简单,避免TRAIN_CMD的Shell转义问题。Megatron Bridge会通过分布式初始化时调用的工具,自动从SLURM环境变量(、、、)中推导、、、、,无需手动设置。
initialize.pycommon_utils.pySLURM_PROCIDSLURM_NTASKSSLURM_LOCALIDSLURM_NODELISTRANKWORLD_SIZELOCAL_RANKMASTER_ADDRMASTER_PORTCluster Environment
集群环境配置
Container
容器设置
bash
CONTAINER_IMAGE="<PATH_TO_YOUR_CONTAINER>.sqsh"
CONTAINER_MOUNTS="<SHARED_FS>:<SHARED_FS>,<PATH_TO_MEGATRON_BRIDGE>:/opt/Megatron-Bridge,<PATH_TO_DATA>:/opt/data"bash
CONTAINER_IMAGE="<PATH_TO_YOUR_CONTAINER>.sqsh"
CONTAINER_MOUNTS="<SHARED_FS>:<SHARED_FS>,<PATH_TO_MEGATRON_BRIDGE>:/opt/Megatron-Bridge,<PATH_TO_DATA>:/opt/data"Standard Paths
标准路径
bash
WORKDIR="/opt/Megatron-Bridge"
DATA_PATH="<PATH_TO_PREPROCESSED_DATA>/dclm_01_01_text_document"bash
WORKDIR="/opt/Megatron-Bridge"
DATA_PATH="<PATH_TO_PREPROCESSED_DATA>/dclm_01_01_text_document"Tokens / Caches
令牌与缓存
bash
export GH_TOKEN=<YOUR_GITHUB_TOKEN>
export HF_TOKEN=<YOUR_HF_TOKEN>
export HF_HOME=<SHARED_FS>/HF_HOME
export UV_CACHE_DIR="<SHARED_FS>/uv_cache"
export NEMO_HOME="<SHARED_FS>/cache/nemo"Important: must point to a shared filesystem (e.g. Lustre) for multi-node SFT/PEFT jobs.
The default () is container-local and not shared across nodes.
Without this, packed-sequence data files prepared on node 0 are invisible to other
nodes, causing .
NEMO_HOME/root/.cache/nemoTypeError: 'NoneType' object is not an iteratorbash
export GH_TOKEN=<YOUR_GITHUB_TOKEN>
export HF_TOKEN=<YOUR_HF_TOKEN>
export HF_HOME=<SHARED_FS>/HF_HOME
export UV_CACHE_DIR="<SHARED_FS>/uv_cache"
export NEMO_HOME="<SHARED_FS>/cache/nemo"重要提示:对于多节点SFT/PEFT任务,必须指向共享文件系统(如Lustre)。默认路径()是容器本地路径,无法跨节点共享。如果不修改此设置,节点0生成的打包序列数据文件对其他节点不可见,会导致错误。
NEMO_HOME/root/.cache/nemoTypeError: 'NoneType' object is not an iteratorLog Directory
日志目录
text
<SHARED_FS>/logs/<job_name>_<suffix>text
<SHARED_FS>/logs/<job_name>_<suffix>srun-native Approach (Preferred)
srun原生方式(推荐)
Slurm spawns all processes directly. No , no TRAIN_CMD escaping.
torch.distributed.runSlurm直接启动所有进程,无需,也无需对TRAIN_CMD进行转义。
torch.distributed.runSBATCH Headers
SBATCH头部配置
bash
#SBATCH --job-name=<model>-<task>
#SBATCH --nodes=<NNODES>
#SBATCH --ntasks-per-node=8 # Slurm spawns 8 tasks per node
#SBATCH --gpus-per-node=8
#SBATCH --time=00:30:00
#SBATCH --account=<YOUR_ACCOUNT>
#SBATCH --partition=batch
#SBATCH --output=<SHARED_FS>/logs/<job_name>_%j.log
#SBATCH --exclusivebash
#SBATCH --job-name=<model>-<task>
#SBATCH --nodes=<NNODES>
#SBATCH --ntasks-per-node=8 # Slurm在每个节点启动8个任务
#SBATCH --gpus-per-node=8
#SBATCH --time=00:30:00
#SBATCH --account=<YOUR_ACCOUNT>
#SBATCH --partition=batch
#SBATCH --output=<SHARED_FS>/logs/<job_name>_%j.log
#SBATCH --exclusiveBuild and Launch
构建与启动
Two-phase srun: first a single-process srun to populate the uv cache, then the full multi-node srun.
bash
undefined分两阶段执行srun:首先用单进程srun填充uv缓存,然后执行完整的多节点srun。
bash
undefinedEnv exports at sbatch level (before srun)
在sbatch级别设置环境变量(srun之前)
export TORCH_NCCL_AVOID_RECORD_STREAMS=1
export NCCL_NVLS_ENABLE=0
export TORCH_NCCL_AVOID_RECORD_STREAMS=1
export NCCL_NVLS_ENABLE=0
Phase 1: Single-process uv sync to build/populate the shared cache
阶段1:单进程uv同步,构建并填充Lustre上的共享缓存
srun --mpi=pmix -N 1 --ntasks=1
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync"
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync"
srun --mpi=pmix -N 1 --ntasks=1
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync"
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync"
Phase 2: Full multi-node run (uv sync is a fast no-op since cache is warm)
阶段2:完整多节点运行(由于缓存已预热,uv sync会快速跳过)
srun --mpi=pmix
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync && uv run --no-sync python <script.py> <args>"
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync && uv run --no-sync python <script.py> <args>"
undefinedsrun --mpi=pmix
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync && uv run --no-sync python <script.py> <args>"
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync && uv run --no-sync python <script.py> <args>"
undefinedsrun-native Key Points
srun原生方式关键点
- Phase 1 runs once on a single node/process, building all wheels into the shared cache on Lustre
uv sync - Phase 2's is a fast no-op (everything is cached) — safe to run on all ranks without sleep guards
uv sync - +
initialize.pyauto-setcommon_utils.py,RANK,WORLD_SIZE,LOCAL_RANK,MASTER_ADDRfrom SLURM env varsMASTER_PORT - Env vars like ,
HF_TOKEN,HF_HOMEexported at sbatch level are inherited by srun tasksUV_CACHE_DIR - Reference: ,
examples/models/glm/glm_45v/slurm_sft.shexamples/models/minimax/minimax_m2/slurm_conversion.sh
- 阶段1在单个节点/进程上运行,将所有依赖包构建到Lustre的共享缓存中
uv sync - 阶段2的是快速无操作(所有内容已缓存)—— 无需睡眠等待,可在所有rank上安全运行
uv sync - +
initialize.py自动从SLURM环境变量中设置common_utils.py、RANK、WORLD_SIZE、LOCAL_RANK、MASTER_ADDRMASTER_PORT - 在sbatch级别导出的环境变量(如、
HF_TOKEN、HF_HOME)会被srun任务继承UV_CACHE_DIR - 参考示例:、
examples/models/glm/glm_45v/slurm_sft.shexamples/models/minimax/minimax_m2/slurm_conversion.sh
uv run torch.distributed Approach (Legacy)
uv run torch.distributed方式(Legacy)
Use when the script requires (e.g., MLM pretrain_gpt.py) or when Bridge's is not in the call path.
torch.distributed.runinitialize.py当脚本必须使用(如MLM的pretrain_gpt.py)或Bridge的不在调用路径时使用此方式。
torch.distributed.runinitialize.py1. Add SBATCH Headers
1. 添加SBATCH头部配置
bash
#SBATCH --job-name=<model>-<framework>
#SBATCH --nodes=<NNODES>
#SBATCH --ntasks-per-node=1 # ALWAYS 1 — torchrun handles per-node spawning
#SBATCH --gpus-per-node=8
#SBATCH --time=00:30:00
#SBATCH --account=<YOUR_ACCOUNT>
#SBATCH --partition=batch
#SBATCH --output=<SHARED_FS>/logs/<job_name>_%j.log
#SBATCH --exclusiveCritical: , NOT 8. spawns 8 processes per node. Using causes EADDRINUSE port collisions (8 tasks x 8 procs = 64 per node).
--ntasks-per-node=1uv run python -m torch.distributed.run --nproc_per_node=8ntasks-per-node=8bash
#SBATCH --job-name=<model>-<framework>
#SBATCH --nodes=<NNODES>
#SBATCH --ntasks-per-node=1 # 必须设为1 —— torchrun负责每个节点的进程启动
#SBATCH --gpus-per-node=8
#SBATCH --time=00:30:00
#SBATCH --account=<YOUR_ACCOUNT>
#SBATCH --partition=batch
#SBATCH --output=<SHARED_FS>/logs/<job_name>_%j.log
#SBATCH --exclusive关键注意事项:,而非8。会在每个节点启动8个进程。如果设置,会导致EADDRINUSE端口冲突(8个任务 × 8个进程 = 每个节点64个进程)。
--ntasks-per-node=1uv run python -m torch.distributed.run --nproc_per_node=8ntasks-per-node=82. Convert to Multi-Node
2. 转换为多节点命令
Replace single-node:
bash
uv run python -m torch.distributed.run --nproc_per_node=8 \
<script> <args>With multi-node (inside string):
TRAIN_CMDbash
uv run python -m torch.distributed.run \
--nproc_per_node=8 \
--nnodes=\${SLURM_JOB_NUM_NODES} \
--node_rank=\${SLURM_NODEID} \
<script> <args>MASTER_ADDRMASTER_PORTinitialize.pycommon_utils.py将单节点命令:
bash
uv run python -m torch.distributed.run --nproc_per_node=8 \
<script> <args>替换为多节点命令(放在字符串内):
TRAIN_CMDbash
uv run python -m torch.distributed.run \
--nproc_per_node=8 \
--nnodes=\${SLURM_JOB_NUM_NODES} \
--node_rank=\${SLURM_NODEID} \
<script> <args>MASTER_ADDRMASTER_PORTinitialize.pycommon_utils.py3. Wrap in TRAIN_CMD + two-phase srun
3. 用TRAIN_CMD包裹并分两阶段执行srun
Use the same two-phase pattern: first a single-process srun to warm the uv cache, then the full run.
Environment exports go inside TRAIN_CMD (they must be set inside the container):
bash
TRAIN_CMD="
export CUDA_DEVICE_MAX_CONNECTIONS=1 && \
export NVTE_ALLOW_NONDETERMINISTIC_ALGO=1 && \
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True && \
export NCCL_NVLS_ENABLE=0 && \
export GH_TOKEN=$GH_TOKEN && \
export HF_TOKEN=$HF_TOKEN && \
export HF_HOME=$HF_HOME && \
export UV_CACHE_DIR=$UV_CACHE_DIR && \
wandb login \$WANDB_API_KEY && \
mkdir -p $LOGDIR && \
cd $WORKDIR && \
uv sync && \
<training command here>
"使用相同的两阶段模式:首先用单进程srun预热uv缓存,然后执行完整运行。
环境变量要放在TRAIN_CMD内部(必须在容器内设置):
bash
TRAIN_CMD="
export CUDA_DEVICE_MAX_CONNECTIONS=1 && \
export NVTE_ALLOW_NONDETERMINISTIC_ALGO=1 && \
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True && \
export NCCL_NVLS_ENABLE=0 && \
export GH_TOKEN=$GH_TOKEN && \
export HF_TOKEN=$HF_TOKEN && \
export HF_HOME=$HF_HOME && \
export UV_CACHE_DIR=$UV_CACHE_DIR && \
wandb login \$WANDB_API_KEY && \
mkdir -p $LOGDIR && \
cd $WORKDIR && \
uv sync && \
<training command here>
"4. Launch (two-phase)
4. 启动命令(两阶段)
bash
undefinedbash
undefinedPhase 1: Single-process uv sync to build/populate the shared cache
阶段1:单进程uv同步,构建并填充共享缓存
srun --mpi=pmix -N 1 --ntasks=1
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync"
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync"
srun --mpi=pmix -N 1 --ntasks=1
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync"
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync"
Phase 2: Full multi-node run (uv sync in TRAIN_CMD is a fast no-op)
阶段2:完整多节点运行(TRAIN_CMD中的uv sync是快速无操作)
srun --mpi=pmix --no-kill
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "$TRAIN_CMD" 2>&1 | tee "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log"
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "$TRAIN_CMD" 2>&1 | tee "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log"
undefinedsrun --mpi=pmix --no-kill
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "$TRAIN_CMD" 2>&1 | tee "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log"
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "$TRAIN_CMD" 2>&1 | tee "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log"
undefined5. (Optional) Add Loss Extraction Footer
5.(可选)添加损失提取尾部
bash
echo "======================================"
echo "Done. Losses:"
echo "======================================"
grep -E "iteration\s+" "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log" | grep -iE "lm loss|reduced_train_loss" | head -25bash
echo "======================================"
echo "Done. Losses:"
echo "======================================"
grep -E "iteration\s+" "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log" | grep -iE "lm loss|reduced_train_loss" | head -25Interactive GPU Allocation (salloc
+ srun
)
sallocsrun交互式GPU分配(salloc
+ srun
)
sallocsrunFor ad-hoc testing (inference, conversion debugging), always follow these 3 steps:
对于临时测试(推理、转换调试),请始终遵循以下3个步骤:
Step 1: Allocate the node
步骤1:分配节点
bash
salloc --account <YOUR_ACCOUNT> -N 1 \
-J <YOUR_ACCOUNT>-debug \
-p interactive --gpus-per-node=8 -t 240bash
salloc --account <YOUR_ACCOUNT> -N 1 \
-J <YOUR_ACCOUNT>-debug \
-p interactive --gpus-per-node=8 -t 240Step 2: Launch container shell
步骤2:启动容器Shell
bash
srun --mpi=pmix --no-kill \
--container-image $CONTAINER_IMAGE \
--container-mounts $CONTAINER_MOUNTS \
--account <YOUR_ACCOUNT> -N 1 \
-J <YOUR_ACCOUNT>-debug \
--no-container-mount-home --gpus-per-node=8 \
-p interactive --pty bashbash
srun --mpi=pmix --no-kill \
--container-image $CONTAINER_IMAGE \
--container-mounts $CONTAINER_MOUNTS \
--account <YOUR_ACCOUNT> -N 1 \
-J <YOUR_ACCOUNT>-debug \
--no-container-mount-home --gpus-per-node=8 \
-p interactive --pty bashStep 3: Set up environment inside container
步骤3:在容器内设置环境
bash
export GH_TOKEN=<YOUR_GITHUB_TOKEN>
wandb login <YOUR_WANDB_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>
export HF_HOME=<SHARED_FS>/HF_HOME
export UV_CACHE_DIR="<SHARED_FS>/uv_cache"
export NEMO_HOME="<SHARED_FS>/cache/nemo"
uv syncThen run commands with (uses the synced virtualenv):
uv runbash
uv run python -m torch.distributed.run --nproc_per_node=8 \
examples/conversion/hf_to_megatron_generate_text.py \
--hf_model_path <org>/<model> --prompt "What is AI?" --max_new_tokens 50 --ep 8Pitfalls with interactive allocation:
| Error | Cause | Fix |
|---|---|---|
| Missing | Always include |
| Wrong partition name | Use |
| Partition not available for account | Check combos: |
| | Remove |
| Container doesn't have | Use a container with |
| Container's | Redirect: |
| Container's pre-installed megatron-core conflicts with local | Install local: |
bash
export GH_TOKEN=<YOUR_GITHUB_TOKEN>
wandb login <YOUR_WANDB_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>
export HF_HOME=<SHARED_FS>/HF_HOME
export UV_CACHE_DIR="<SHARED_FS>/uv_cache"
export NEMO_HOME="<SHARED_FS>/cache/nemo"
uv sync然后使用运行命令(使用已同步的虚拟环境):
uv runbash
uv run python -m torch.distributed.run --nproc_per_node=8 \
examples/conversion/hf_to_megatron_generate_text.py \
--hf_model_path <org>/<model> --prompt "What is AI?" --max_new_tokens 50 --ep 8交互式分配的常见陷阱:
| 错误 | 原因 | 修复方案 |
|---|---|---|
| 缺少 | 在 |
| 分区名称错误 | 交互式任务用 |
| 账户无法访问指定分区 | 通过 |
| | 移除 |
| 容器未预安装 | 使用预安装 |
| 容器的 | 重定向缓存路径: |
| 容器预安装的megatron-core与本地 | 安装本地版本: |
Debugging Multi-Node Failures
多节点故障调试
Quick Diagnosis
快速诊断
Check the log for these patterns (in order):
bash
undefined按顺序在日志中查找以下模式:
bash
undefined1. Find the actual error (filter noise)
1. 定位实际错误(过滤无关信息)
grep -a 'Error|OOM|CUDA out of memory|FAILED|Killed' job.log
| grep -v 'UserWarning|AllocatorConfig|transformer_engine|frame|srun: error'
| grep -v 'UserWarning|AllocatorConfig|transformer_engine|frame|srun: error'
grep -a 'Error|OOM|CUDA out of memory|FAILED|Killed' job.log
| grep -v 'UserWarning|AllocatorConfig|transformer_engine|frame|srun: error'
| grep -v 'UserWarning|AllocatorConfig|transformer_engine|frame|srun: error'
2. Check which rank crashed first
2. 检查哪个rank最先崩溃
grep -a 'Failures:' -A 20 job.log | head -25
grep -a 'Failures:' -A 20 job.log | head -25
3. Check for NCCL timeout
3. 检查NCCL超时问题
grep -a 'ncclUniqueId|timeout|crash on rank 0' job.log | head -5
undefinedgrep -a 'ncclUniqueId|timeout|crash on rank 0' job.log | head -5
undefinedDebugging Checklist
调试检查清单
When a multi-node job fails:
- Check exit code: 1 = Python error, 9 = OOM killed, 143 = SIGTERM (timeout or cascade)
- Find first failure: Which task/node crashed first? Others get SIGTERM (143) as cascade
- grep the actual error: Filter out UserWarnings, NCCL frame dumps
- Check rank 0 specifically: Most save/export errors happen on rank 0
- Verify EP sizing: For MoE models, ensure fits in GPU memory with headroom
num_experts / EP - Try interactive first: Use to iterate faster than sbatch queue
salloc -N 2 -p interactive
当多节点任务失败时:
- 检查退出码:1 = Python错误,9 = OOM被终止,143 = SIGTERM(超时或连锁崩溃)
- 定位首次故障:哪个任务/节点最先崩溃?其他节点会因连锁反应收到SIGTERM(143)
- 过滤实际错误:排除UserWarning、NCCL帧转储等无关信息
- 重点检查rank 0:大多数保存/导出错误发生在rank 0
- 验证EP配置:对于MoE模型,确保的结果在GPU内存中留有足够余量
num_experts / EP - 先尝试交互式模式:使用进行调试,比sbatch队列更快
salloc -N 2 -p interactive
NCCL Timeout at dist.barrier()
— "crash on rank 0"
dist.barrier()dist.barrier()
处的NCCL超时 —— "crash on rank 0"
dist.barrier()Symptom: All ranks on node 2+ show:
text
[rank8] is setting up NCCL communicator and retrieving ncclUniqueId from [0]
... wait timeout after 600000ms
This may indicate a possible application crash on rank 0Root causes (check in order):
| Cause | How to verify | Fix |
|---|---|---|
| Error is in | Increase timeout: |
| | Catch |
| Rank 0 OOM during export | | Increase EP or nodes |
| Network issue between nodes | Error only on cross-node ranks | Check |
The problem: When , rank 0 runs (downloads tokenizer, config, custom modeling code) while all other ranks skip directly to . If is slow or crashes, other ranks timeout.
save_artifactstrust_remote_code=Truesave_artifacts()dist.barrier()save_artifactsFix for ImportError in save_artifacts ():
hf_pretrained/base.pypython
undefined症状:节点2及以后的所有rank显示:
text
[rank8] is setting up NCCL communicator and retrieving ncclUniqueId from [0]
... wait timeout after 600000ms
This may indicate a possible application crash on rank 0根本原因(按顺序检查):
| 原因 | 验证方式 | 修复方案 |
|---|---|---|
rank 0上的 | 错误出现在 | 增加超时时间: |
自定义模型代码中的 | | 在 |
| 导出时rank 0发生OOM | | 增加EP数量或节点数量 |
| 节点间网络问题 | 仅跨节点rank出现错误 | 检查 |
save_artifactstrust_remote_code=Truesave_artifacts()dist.barrier()save_artifacts修复save_artifacts中的ImportError():
hf_pretrained/base.pypython
undefinedChange:
修改前:
except OSError:
pass
except OSError:
pass
To:
修改后:
except (OSError, ImportError):
pass
undefinedexcept (OSError, ImportError):
pass
undefinedOOM for MoE Models
MoE模型的OOM问题
Symptom: during model loading or forward pass.
torch.OutOfMemoryError: CUDA out of memoryKey insight: TP does NOT reduce expert memory. Only EP splits experts across GPUs.
Sizing formula:
text
experts_per_gpu = num_experts / EP
expert_memory_gb ≈ experts_per_gpu * expert_params * 2 / 1e9 (bf16)
total_per_gpu ≈ expert_memory_gb + attention_memory_gb + kv_cache_gbMiniMax-M2 example (256 experts, ~230GB fp8 → ~460GB bf16):
| Config | Nodes | GPUs | Experts/GPU | Result |
|---|---|---|---|---|
| TP=2, EP=4 | 1 | 8 | 64 | OOM (too many experts) |
| TP=2, EP=8 | 2 | 16 | 32 | Works for roundtrip (weight-only), OOM for inference |
| TP=1, EP=16 | 2 | 16 | 16 | Works for inference |
| TP=2, EP=32 | 8 | 64 | 8 | Comfortable for training |
Rules of thumb:
- Roundtrip (weight-only): can use more experts per GPU (~60GB model params OK)
- Inference (forward pass + KV cache): needs headroom (~40GB model params max)
- Training (activations + optimizer): needs even more headroom (~30GB model params max)
症状:模型加载或前向传播时出现。
torch.OutOfMemoryError: CUDA out of memory关键要点:TP(张量并行)不会减少专家内存占用,只有EP(专家并行)会将专家分配到多个GPU上。
内存计算公式:
text
每个GPU的专家数 = num_experts / EP
专家内存占用(GB) ≈ 每个GPU的专家数 × 专家参数数量 × 2 / 1e9 (bf16精度)
每个GPU总内存占用 ≈ 专家内存占用 + 注意力内存占用 + KV缓存内存占用MiniMax-M2示例(256个专家,fp8精度约230GB → bf16精度约460GB):
| 配置 | 节点数 | GPU数 | 每个GPU的专家数 | 结果 |
|---|---|---|---|---|
| TP=2, EP=4 | 1 | 8 | 64 | OOM(专家数过多) |
| TP=2, EP=8 | 2 | 16 | 32 | 往返转换(仅权重)可行,推理时OOM |
| TP=1, EP=16 | 2 | 16 | 16 | 推理可行 |
| TP=2, EP=32 | 8 | 64 | 8 | 训练时内存充足 |
经验法则:
- 往返转换(仅权重):每个GPU可容纳更多专家(约60GB模型参数可行)
- 推理(前向传播 + KV缓存):需要预留内存余量(最多约40GB模型参数)
- 训练(激活值 + 优化器):需要更多内存余量(最多约30GB模型参数)
ModuleNotFoundError: No module named 'megatron.core.tensor_parallel'
ModuleNotFoundError: No module named 'megatron.core.tensor_parallel'ModuleNotFoundError: No module named 'megatron.core.tensor_parallel'
ModuleNotFoundError: No module named 'megatron.core.tensor_parallel'Cause: Container's pre-installed megatron-core conflicts with local .
3rdparty/Megatron-LMFix: Add before running:
uv syncbash
CMD="if [ \"\$SLURM_LOCALID\" -eq 0 ]; then uv sync; else sleep 10; fi && "
CMD="${CMD}uv run --no-sync python <script> <args>"原因:容器预安装的megatron-core与本地冲突。
3rdparty/Megatron-LM修复方案:运行前添加:
uv syncbash
CMD="if [ \"\$SLURM_LOCALID\" -eq 0 ]; then uv sync; else sleep 10; fi && "
CMD="${CMD}uv run --no-sync python <script> <args>"FP8 Weight Mismatch in Roundtrip
往返转换中的FP8权重不匹配
Symptom: Roundtrip completes but shows ❌ for all expert weights and raises .
ValueError: Weight mismatch detectedCause: Original HF weights are FP8, Megatron stores in BF16. Exported weights are BF16. Comparison against original FP8 exceeds .
atol=1e-1This is expected for FP8 models. The conversion is correct; the comparison tolerance is insufficient for the FP8→BF16 precision gap.
症状:往返转换完成,但所有专家权重显示❌,并抛出。
ValueError: Weight mismatch detected原因:原始HF权重是FP8精度,Megatron以BF16精度存储,导出的权重也是BF16精度。与原始FP8权重的比较超出了的容忍度。
atol=1e-1这是FP8模型的正常现象。转换过程是正确的,只是比较的容忍度不足以覆盖FP8→BF16的精度差异。
WORLD_SIZE
Not Set with srun
WORLD_SIZE使用srun时WORLD_SIZE
未设置
WORLD_SIZESymptom: Script exits with "must be launched with torchrun".
Cause: Scripts check which torchrun sets but srun doesn't.
os.environ.get("WORLD_SIZE")Fix: Also check :
SLURM_NTASKSpython
if os.environ.get("WORLD_SIZE") is None and os.environ.get("SLURM_NTASKS") is None:
sys.exit(1)Bridge's helpers (called by ) populate env vars from SLURM:
common_utils.pyinitialize.pypython
if "RANK" not in os.environ:
os.environ["RANK"] = str(get_rank_safe()) # uses SLURM_PROCID
if "WORLD_SIZE" not in os.environ:
os.environ["WORLD_SIZE"] = str(get_world_size_safe()) # uses SLURM_NTASKS
if "MASTER_ADDR" not in os.environ:
os.environ["MASTER_ADDR"] = get_master_addr_safe() # parses SLURM_NODELIST
if "MASTER_PORT" not in os.environ:
os.environ["MASTER_PORT"] = str(get_master_port_safe()) # derives from SLURM_JOB_ID症状:脚本退出并提示"must be launched with torchrun"。
原因:脚本检查,该变量由torchrun设置,但srun不会设置。
os.environ.get("WORLD_SIZE")修复方案:同时检查:
SLURM_NTASKSpython
if os.environ.get("WORLD_SIZE") is None and os.environ.get("SLURM_NTASKS") is None:
sys.exit(1)Bridge的工具(由调用)会从SLURM环境变量中填充相关环境变量:
common_utils.pyinitialize.pypython
if "RANK" not in os.environ:
os.environ["RANK"] = str(get_rank_safe()) # 使用SLURM_PROCID
if "WORLD_SIZE" not in os.environ:
os.environ["WORLD_SIZE"] = str(get_world_size_safe()) # 使用SLURM_NTASKS
if "MASTER_ADDR" not in os.environ:
os.environ["MASTER_ADDR"] = get_master_addr_safe() # 解析SLURM_NODELIST
if "MASTER_PORT" not in os.environ:
os.environ["MASTER_PORT"] = str(get_master_port_safe()) # 从SLURM_JOB_ID推导Key Gotchas
关键注意事项
-
Two-phase srun for: Run a single-process srun first to warm the cache, then the full multi-node srun. The second
uv syncis a fast no-op since everything is already cached on the shared filesystem.uv sync -
is an
--no-container-mount-homeflag, NOT ansrundirective.#SBATCH -
Escaping inside TRAIN_CMD: Sinceis a double-quoted string, escape inner
TRAIN_CMDfor Slurm variables that must expand at runtime (not sbatch time):$- ,
\${SLURM_PROCID},\${SLURM_JOB_NUM_NODES}\${SLURM_NODEID} - Host-side variables like ,
$GH_TOKEN,$LOGDIRexpand at sbatch time — no escaping needed.$WORKDIR
-
Bridge: Add before training to avoid stale checkpoint auto-resume.
rm -rf nemo_experiments -
MLM needs PYTHONPATH: For pretrain_gpt.py scripts, add inside TRAIN_CMD:bash
PYTHONPATH=${WORKDIR}/3rdparty/Megatron-LM:\${PYTHONPATH:-} \ -
Node count heuristic: Total GPUs =. Must satisfy:
NNODES * 8whereTP * PP * EP * DP >= total_GPUs.DP = total_GPUs / (TP * PP * EP) -
on shared filesystem for multi-node SFT: The default nemo cache (
NEMO_HOME) is container-local. Multi-node SFT with packed sequences prepares/root/.cache/nemofiles on one node that are invisible to others. Set.npyso packed data is shared. Without this, ranks on other nodes fail withexport NEMO_HOME=<SHARED_FS>/cache/nemo.TypeError: 'NoneType' object is not an iterator
-
的两阶段srun:先运行单进程srun预热缓存,再运行完整的多节点srun。第二次
uv sync是快速无操作,因为所有内容已存储在共享文件系统的缓存中。uv sync -
是srun参数,而非
--no-container-mount-home指令。#SBATCH -
TRAIN_CMD内的转义:由于是双引号字符串,对于需要在运行时(而非sbatch提交时)展开的Slurm变量,要转义内部的
TRAIN_CMD:$- 、
\${SLURM_PROCID}、\${SLURM_JOB_NUM_NODES}\${SLURM_NODEID} - 主机侧变量(如、
$GH_TOKEN、$LOGDIR)会在sbatch提交时展开——无需转义。$WORKDIR
-
Bridge的:训练前添加此命令,避免陈旧的检查点自动恢复。
rm -rf nemo_experiments -
MLM需要PYTHONPATH:对于pretrain_gpt.py脚本,在TRAIN_CMD内添加:bash
PYTHONPATH=${WORKDIR}/3rdparty/Megatron-LM:\${PYTHONPATH:-} \ -
节点数量启发式规则:总GPU数 =。必须满足:
NNODES * 8,其中TP * PP * EP * DP >= total_GPUs。DP = total_GPUs / (TP * PP * EP) -
多节点SFT的需指向共享文件系统:默认的nemo缓存(
NEMO_HOME)是容器本地路径。多节点SFT的打包序列会在一个节点生成/root/.cache/nemo文件,其他节点无法访问。设置.npy可让打包数据共享。如果不设置,其他节点的rank会因无法访问数据而抛出export NEMO_HOME=<SHARED_FS>/cache/nemo。TypeError: 'NoneType' object is not an iterator
Full Template
完整模板
bash
#!/bin/bashbash
#!/bin/bash==============================================================================
==============================================================================
<MODEL_NAME> <pretrain|sft> — <Framework: MLM | Megatron Bridge>
<MODEL_NAME> <pretrain|sft> — <Framework: MLM | Megatron Bridge>
Default: TP<X> PP<Y> EP<Z>, NNODES=<N> (<N*8> GPUs), MBS=<M>, GBS=<G>
默认配置:TP<X> PP<Y> EP<Z>, NNODES=<N>(共<N*8>个GPU), MBS=<M>, GBS=<G>
Usage:
使用方法:
sbatch <script_name>.sh
sbatch <script_name>.sh
==============================================================================
==============================================================================
#SBATCH --job-name=<job-name>
#SBATCH --nodes=<NNODES>
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=8
#SBATCH --time=00:30:00
#SBATCH --account=<YOUR_ACCOUNT>
#SBATCH --partition=batch
#SBATCH --output=<SHARED_FS>/logs/<job_name>_%j.log
#SBATCH --exclusive
#SBATCH --job-name=<job-name>
#SBATCH --nodes=<NNODES>
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=8
#SBATCH --time=00:30:00
#SBATCH --account=<YOUR_ACCOUNT>
#SBATCH --partition=batch
#SBATCH --output=<SHARED_FS>/logs/<job_name>_%j.log
#SBATCH --exclusive
── Container ────────────────────────────────────────────────────────────
── 容器配置 ────────────────────────────────────────────────────────────
CONTAINER_IMAGE="<PATH_TO_YOUR_CONTAINER>.sqsh"
CONTAINER_MOUNTS="<SHARED_FS>:<SHARED_FS>,<PATH_TO_MEGATRON_BRIDGE>:/opt/Megatron-Bridge,<PATH_TO_DATA>:/opt/data"
CONTAINER_IMAGE="<PATH_TO_YOUR_CONTAINER>.sqsh"
CONTAINER_MOUNTS="<SHARED_FS>:<SHARED_FS>,<PATH_TO_MEGATRON_BRIDGE>:/opt/Megatron-Bridge,<PATH_TO_DATA>:/opt/data"
── Paths ────────────────────────────────────────────────────────────────
── 路径配置 ────────────────────────────────────────────────────────────────
WORKDIR="/opt/Megatron-Bridge"
LOGDIR="<SHARED_FS>/logs/<logdir_name>"
DATA_PATH="<PATH_TO_PREPROCESSED_DATA>/dclm_01_01_text_document"
WORKDIR="/opt/Megatron-Bridge"
LOGDIR="<SHARED_FS>/logs/<logdir_name>"
DATA_PATH="<PATH_TO_PREPROCESSED_DATA>/dclm_01_01_text_document"
── Parallelism ──────────────────────────────────────────────────────────
── 并行配置 ──────────────────────────────────────────────────────────
TP=1; PP=1; EP=1
TP=1; PP=1; EP=1
── Training ─────────────────────────────────────────────────────────────
── 训练配置 ─────────────────────────────────────────────────────────────
MBS=1; GBS=256
SEQ=4096
SEED=1234
TRAIN_ITERS=20
MBS=1; GBS=256
SEQ=4096
SEED=1234
TRAIN_ITERS=20
── Tokens / Caches ──────────────────────────────────────────────────────
── 令牌与缓存配置 ──────────────────────────────────────────────────────
export GH_TOKEN=<YOUR_GITHUB_TOKEN>
export HF_TOKEN=<YOUR_HF_TOKEN>
export HF_HOME=<SHARED_FS>/HF_HOME
export UV_CACHE_DIR="<SHARED_FS>/uv_cache"
export NEMO_HOME="<SHARED_FS>/cache/nemo"
export GH_TOKEN=<YOUR_GITHUB_TOKEN>
export HF_TOKEN=<YOUR_HF_TOKEN>
export HF_HOME=<SHARED_FS>/HF_HOME
export UV_CACHE_DIR="<SHARED_FS>/uv_cache"
export NEMO_HOME="<SHARED_FS>/cache/nemo"
── Build training command ───────────────────────────────────────────────
── 构建训练命令 ───────────────────────────────────────────────
TRAIN_CMD="
export CUDA_DEVICE_MAX_CONNECTIONS=1 &&
export NVTE_ALLOW_NONDETERMINISTIC_ALGO=1 &&
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True &&
export NCCL_NVLS_ENABLE=0 &&
export GH_TOKEN=$GH_TOKEN &&
export HF_TOKEN=$HF_TOKEN &&
export HF_HOME=$HF_HOME &&
export UV_CACHE_DIR=$UV_CACHE_DIR &&
export NEMO_HOME=$NEMO_HOME &&
wandb login $WANDB_API_KEY &&
mkdir -p $LOGDIR &&
cd $WORKDIR &&
uv sync &&
<TRAINING_COMMAND_HERE> "
export NVTE_ALLOW_NONDETERMINISTIC_ALGO=1 &&
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True &&
export NCCL_NVLS_ENABLE=0 &&
export GH_TOKEN=$GH_TOKEN &&
export HF_TOKEN=$HF_TOKEN &&
export HF_HOME=$HF_HOME &&
export UV_CACHE_DIR=$UV_CACHE_DIR &&
export NEMO_HOME=$NEMO_HOME &&
wandb login $WANDB_API_KEY &&
mkdir -p $LOGDIR &&
cd $WORKDIR &&
uv sync &&
<TRAINING_COMMAND_HERE> "
echo "======================================"
echo "<MODEL_NAME> <Framework> Pretrain"
echo "Job: $SLURM_JOB_ID | Nodes: $SLURM_JOB_NUM_NODES"
echo "TP=$TP PP=$PP EP=$EP MBS=$MBS GBS=$GBS"
echo "======================================"
TRAIN_CMD="
export CUDA_DEVICE_MAX_CONNECTIONS=1 &&
export NVTE_ALLOW_NONDETERMINISTIC_ALGO=1 &&
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True &&
export NCCL_NVLS_ENABLE=0 &&
export GH_TOKEN=$GH_TOKEN &&
export HF_TOKEN=$HF_TOKEN &&
export HF_HOME=$HF_HOME &&
export UV_CACHE_DIR=$UV_CACHE_DIR &&
export NEMO_HOME=$NEMO_HOME &&
wandb login $WANDB_API_KEY &&
mkdir -p $LOGDIR &&
cd $WORKDIR &&
uv sync &&
<TRAINING_COMMAND_HERE> "
export NVTE_ALLOW_NONDETERMINISTIC_ALGO=1 &&
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True &&
export NCCL_NVLS_ENABLE=0 &&
export GH_TOKEN=$GH_TOKEN &&
export HF_TOKEN=$HF_TOKEN &&
export HF_HOME=$HF_HOME &&
export UV_CACHE_DIR=$UV_CACHE_DIR &&
export NEMO_HOME=$NEMO_HOME &&
wandb login $WANDB_API_KEY &&
mkdir -p $LOGDIR &&
cd $WORKDIR &&
uv sync &&
<TRAINING_COMMAND_HERE> "
echo "======================================"
echo "<MODEL_NAME> <Framework> 预训练"
echo "任务ID: $SLURM_JOB_ID | 节点数: $SLURM_JOB_NUM_NODES"
echo "TP=$TP PP=$PP EP=$EP MBS=$MBS GBS=$GBS"
echo "======================================"
Phase 1: Single-process uv sync to build/populate the shared cache
阶段1:单进程uv同步,构建并填充共享缓存
srun --mpi=pmix -N 1 --ntasks=1
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync"
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync"
srun --mpi=pmix -N 1 --ntasks=1
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync"
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "cd $WORKDIR && uv sync"
Phase 2: Full multi-node run (uv sync in TRAIN_CMD is a fast no-op)
阶段2:完整多节点运行(TRAIN_CMD中的uv sync是快速无操作)
srun --mpi=pmix --no-kill
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "$TRAIN_CMD" 2>&1 | tee "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log"
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "$TRAIN_CMD" 2>&1 | tee "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log"
echo ""
echo "======================================"
echo "Done. Losses:"
echo "======================================"
grep -E "iteration\s+" "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log" | grep -iE "lm loss|reduced_train_loss" | head -25
undefinedsrun --mpi=pmix --no-kill
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "$TRAIN_CMD" 2>&1 | tee "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log"
--container-image="$CONTAINER_IMAGE"
--container-mounts="$CONTAINER_MOUNTS"
--no-container-mount-home
bash -c "$TRAIN_CMD" 2>&1 | tee "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log"
echo ""
echo "======================================"
echo "任务完成。损失数据:"
echo "======================================"
grep -E "iteration\s+" "$LOGDIR/<prefix>_${SLURM_JOB_ID}.log" | grep -iE "lm loss|reduced_train_loss" | head -25
undefinedBridge-Specific TRAIN_CMD Body
Bridge专属TRAIN_CMD内容
bash
rm -rf nemo_experiments && \
uv run python -m torch.distributed.run \
--nproc_per_node=8 \
--nnodes=\${SLURM_JOB_NUM_NODES} \
--node_rank=\${SLURM_NODEID} \
scripts/training/run_recipe.py \
--recipe <recipe_name> \
model.tensor_model_parallel_size=$TP \
model.pipeline_model_parallel_size=$PP \
...overrides...bash
rm -rf nemo_experiments && \
uv run python -m torch.distributed.run \
--nproc_per_node=8 \
--nnodes=\${SLURM_JOB_NUM_NODES} \
--node_rank=\${SLURM_NODEID} \
scripts/training/run_recipe.py \
--recipe <recipe_name> \
model.tensor_model_parallel_size=$TP \
model.pipeline_model_parallel_size=$PP \
...overrides...MLM-Specific TRAIN_CMD Body
MLM专属TRAIN_CMD内容
bash
PYTHONPATH=${WORKDIR}/3rdparty/Megatron-LM:\${PYTHONPATH:-} \
uv run python -m torch.distributed.run \
--nproc_per_node=8 \
--nnodes=\${SLURM_JOB_NUM_NODES} \
--node_rank=\${SLURM_NODEID} \
3rdparty/Megatron-LM/pretrain_gpt.py \
--tensor-model-parallel-size $TP \
--pipeline-model-parallel-size $PP \
...args...bash
PYTHONPATH=${WORKDIR}/3rdparty/Megatron-LM:\${PYTHONPATH:-} \
uv run python -m torch.distributed.run \
--nproc_per_node=8 \
--nnodes=\${SLURM_JOB_NUM_NODES} \
--node_rank=\${SLURM_NODEID} \
3rdparty/Megatron-LM/pretrain_gpt.py \
--tensor-model-parallel-size $TP \
--pipeline-model-parallel-size $PP \
...args...