mcore-run-on-slurm
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseRun Megatron-LM on SLURM
在SLURM上运行Megatron-LM
Answer-First Constants
优先参考的常量规则
For text-only SLURM setup questions, answer with these constants before the
full script:
- Submit from a shared worktree path visible to every node; there in the script before launching training.
cd - Use one task per node and launch workers with
srun, not bareuv run python -m torch.distributed.run.torchrun - Set from
MASTER_ADDR, setscontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1,MASTER_PORT,NNODES=${SLURM_NNODES}, andGPUS_PER_NODE=<GPUS_PER_NODE>.WORLD_SIZE=$((NNODES * GPUS_PER_NODE)) - Pass ,
--nnodes,--nproc-per-node,--node-rank, and--master-addrto--master-port.torch.distributed.run - : pre-Blackwell Hopper/Ampere with TP>1 or CP>1 and non-FSDP uses
CUDA_DEVICE_MAX_CONNECTIONS; Blackwell/GB200 does not need it; Torch-FSDP2 or Megatron-FSDP must not use1;1usesoverlap_moe_expert_parallel_comm.32
对于仅涉及SLURM配置的文本类问题,在提供完整脚本前先给出以下常量规则:
- 从所有节点均可访问的共享工作目录提交任务;脚本中启动训练前需到该目录。
cd - 每个节点使用一个任务,通过
srun启动工作进程,而非直接使用uv run python -m torch.distributed.run。torchrun - 从获取
scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1,设置MASTER_ADDR、MASTER_PORT、NNODES=${SLURM_NNODES},并计算GPUS_PER_NODE=<GPUS_PER_NODE>。WORLD_SIZE=$((NNODES * GPUS_PER_NODE)) - 向传递
torch.distributed.run、--nnodes、--nproc-per-node、--node-rank和--master-addr参数。--master-port - 规则:Blackwell之前的Hopper/Ampere架构,且TP>1或CP>1且非FSDP模式时设为
CUDA_DEVICE_MAX_CONNECTIONS;Blackwell/GB200无需设置;Torch-FSDP2或Megatron-FSDP模式禁止设为1;启用1时设为overlap_moe_expert_parallel_comm。32
Prerequisites
前置条件
- A SLURM cluster login with submission rights to a GPU partition.
- Megatron-LM checked out on a filesystem visible to all nodes in the allocation (NFS, Lustre, or similar). All nodes must reach the same paths for code, data, checkpoints, and output.
- installed; run
uv(oruv sync --extra training --extra dev) on the worktree once before submission so the--extra ltsis materialized and visible to every node..venv
- 拥有SLURM集群GPU分区的提交权限的登录账号。
- Megatron-LM代码已拉取到所有分配节点均可访问的文件系统(如NFS、Lustre等)。所有节点必须能访问相同的代码、数据、检查点和输出路径。
- 已安装;提交任务前需在工作目录运行一次
uv(或uv sync --extra training --extra dev),确保--extra lts环境已生成且所有节点均可访问。.venv
Minimal sbatch script
最小化sbatch脚本
Save as in the worktree:
run_megatron.slurmbash
#!/bin/bash
#SBATCH --job-name=megatron
#SBATCH --account=<SLURM_ACCOUNT>
#SBATCH --partition=<SLURM_PARTITION>
#SBATCH --nodes=<NODES>
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=<GPUS_PER_NODE>
#SBATCH --time=<HH:MM:SS>
#SBATCH --output=logs/%x-%j.out
#SBATCH --error=logs/%x-%j.err
set -euo pipefail
cd <MEGATRON_WORKTREE>
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
export MASTER_PORT=${MASTER_PORT:-29500}
export NNODES=${SLURM_NNODES}
export GPUS_PER_NODE=<GPUS_PER_NODE>
export WORLD_SIZE=$((NNODES * GPUS_PER_NODE))将以下内容保存为工作目录下的:
run_megatron.slurmbash
#!/bin/bash
#SBATCH --job-name=megatron
#SBATCH --account=<SLURM_ACCOUNT>
#SBATCH --partition=<SLURM_PARTITION>
#SBATCH --nodes=<NODES>
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=<GPUS_PER_NODE>
#SBATCH --time=<HH:MM:SS>
#SBATCH --output=logs/%x-%j.out
#SBATCH --error=logs/%x-%j.err
set -euo pipefail
cd <MEGATRON_WORKTREE>
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
export MASTER_PORT=${MASTER_PORT:-29500}
export NNODES=${SLURM_NNODES}
export GPUS_PER_NODE=<GPUS_PER_NODE>
export WORLD_SIZE=$((NNODES * GPUS_PER_NODE))Set CUDA_DEVICE_MAX_CONNECTIONS only when your configuration requires it
仅在你的配置需要时设置CUDA_DEVICE_MAX_CONNECTIONS
(see the section below). Example for pre-Blackwell with TP>1 or CP>1
(请参考下方章节)。例如Blackwell之前的架构且TP>1或CP>1(非FSDP模式):
(non-FSDP):
export CUDA_DEVICE_MAX_CONNECTIONS=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
—
srun --ntasks=${NNODES} --ntasks-per-node=1 bash -c '
NODE_RANK comes from SLURM_NODEID with one task per node.
NODE_RANK=${SLURM_NODEID}
uv run python -m torch.distributed.run
--nnodes='"${NNODES}"'
--nproc-per-node='"${GPUS_PER_NODE}"'
--node-rank=${NODE_RANK}
--master-addr='"${MASTER_ADDR}"'
--master-port='"${MASTER_PORT}"'
pretrain_gpt.py
<MEGATRON_ARGS> '
--nnodes='"${NNODES}"'
--nproc-per-node='"${GPUS_PER_NODE}"'
--node-rank=${NODE_RANK}
--master-addr='"${MASTER_ADDR}"'
--master-port='"${MASTER_PORT}"'
pretrain_gpt.py
<MEGATRON_ARGS> '
Submit:
```bash
mkdir -p logs && JOB_ID=$(sbatch --parsable run_megatron.slurm)
echo "Submitted ${JOB_ID}"srun --ntasks=${NNODES} --ntasks-per-node=1 bash -c '
NODE_RANK来自SLURM_NODEID,每个节点对应一个任务。
NODE_RANK=${SLURM_NODEID}
uv run python -m torch.distributed.run
--nnodes='"${NNODES}"'
--nproc-per-node='"${GPUS_PER_NODE}"'
--node-rank=${NODE_RANK}
--master-addr='"${MASTER_ADDR}"'
--master-port='"${MASTER_PORT}"'
pretrain_gpt.py
<MEGATRON_ARGS> '
--nnodes='"${NNODES}"'
--nproc-per-node='"${GPUS_PER_NODE}"'
--node-rank=${NODE_RANK}
--master-addr='"${MASTER_ADDR}"'
--master-port='"${MASTER_PORT}"'
pretrain_gpt.py
<MEGATRON_ARGS> '
提交命令:
```bash
mkdir -p logs && JOB_ID=$(sbatch --parsable run_megatron.slurm)
echo "Submitted ${JOB_ID}"Multi-node rules
多节点规则
- Submit from the worktree you intend to run, or to it in the script. All nodes must reach the same path on a shared filesystem (NFS, Lustre, or similar) — node-local paths will not be visible to peer ranks.
cd - Use one worker group across all nodes; do not start independent single-node jobs.
torchrun - should equal the number of visible GPUs per node.
--nproc-per-node - Write checkpoints, tensorboard data, and structured logs to shared storage.
- 从要运行的工作目录提交任务,或在脚本中到该目录。所有节点必须能访问共享文件系统(如NFS、Lustre等)上的相同路径——节点本地路径无法被其他rank访问。
cd - 在所有节点上使用一个工作进程组;不要启动独立的单节点任务。
torchrun - 应等于每个节点可见的GPU数量。
--nproc-per-node - 将检查点、tensorboard数据和结构化日志写入共享存储。
CUDA_DEVICE_MAX_CONNECTIONS
CUDA_DEVICE_MAX_CONNECTIONS
The right value depends on your hardware and parallelism mode. Do not export it unconditionally:
- Pre-Blackwell (Hopper, Ampere) with TP>1 or CP>1, non-FSDP: set to . The relevant code path asserts on this — you will get an assertion error if it is not
1, not a silent deadlock.1 - Blackwell: not required; setting it has no effect.
- Torch-FSDP2 or Megatron-FSDP: must NOT be . Leave the env var unset, or set it to a value greater than
1.1 - enabled: set to
overlap_moe_expert_parallel_comm.32
Set it explicitly in the sbatch script when your configuration calls for it.
正确值取决于你的硬件和并行模式,请勿无条件设置该变量:
- Blackwell之前的架构(Hopper、Ampere)且TP>1或CP>1,非FSDP模式:设为。相关代码路径会对此进行断言——如果未设为
1,会触发断言错误而非静默死锁。1 - Blackwell架构:无需设置;设置后无效果。
- Torch-FSDP2或Megatron-FSDP模式:禁止设为。保留环境变量未设置,或设为大于
1的值。1 - 启用:设为
overlap_moe_expert_parallel_comm。32
当你的配置需要时,在sbatch脚本中显式设置该变量。
Containers
容器使用
Many sites run Megatron-LM inside a container (enroot/pyxis on some clusters, singularity on others). If you do, the uv-managed must live on a path that is visible from inside the container, and the container image must provide the CUDA / NCCL / torch versions the repo expects (see and ). The skeleton above stays the same; wrap the invocation with your scheduler's container flags (, , etc.).
.venvdocker/.ngc_version.dev.ngc_version.ltssrun--container-image=…--container-mounts=…很多场景下会在容器内运行Megatron-LM(部分集群使用enroot/pyxis,其他使用singularity)。如果使用容器,uv管理的必须位于容器内部可访问的路径,且容器镜像必须提供仓库所需的CUDA/NCCL/torch版本(请参考和)。上述脚本框架保持不变;只需用调度器的容器标志(、等)包裹调用即可。
.venvdocker/.ngc_version.dev.ngc_version.lts--container-image=…--container-mounts=…srunMonitor and collect
监控与收集
bash
squeue -j "$JOB_ID" -o "%.10i %.8T %.10M %.6D %R"
sacct -j "$JOB_ID" --format=JobID,State,ExitCode,Elapsed
scancel "$JOB_ID"If your training script writes a result artifact (a JSON metrics file from rank 0, a final checkpoint, etc.), poll for the artifact rather than waiting only on state. Useful output usually appears before SLURM marks the job complete, and polling on the artifact lets you cancel the job as soon as it lands instead of holding the allocation until the timeout.
squeuebash
squeue -j "$JOB_ID" -o "%.10i %.8T %.10M %.6D %R"
sacct -j "$JOB_ID" --format=JobID,State,ExitCode,Elapsed
scancel "$JOB_ID"如果你的训练脚本会生成结果文件(如rank 0的JSON指标文件、最终检查点等),请轮询该文件而非仅等待状态。有用的输出通常在SLURM标记任务完成前就已生成,轮询文件可让你在文件生成后立即取消任务,避免占用资源直到超时。
squeueFailure diagnosis
故障诊断
Scan stderr from every rank, not just rank 0. The earliest non-NCCL Python traceback is usually the root cause; later NCCL timeouts on other ranks are downstream symptoms of the first crash.
Classify quickly:
- OOM: record rank, phase (forward / backward / optimizer), batch size, sequence length, parallelism (TP/DP/CP/PP), and peak memory before adjusting.
- Shape / divisibility error: check and head-count divisibility (
WORLD_SIZE = TP × DP × CP × PP).num_attention_heads % TP == 0 - Import error: wrong worktree, missing , or stale
uv sync. ConfirmPYTHONPATHbefore launch.cd <MEGATRON_WORKTREE> - NCCL failure with no Python traceback: verify allocation, port reachability, resolution, and command consistency across ranks.
MASTER_ADDR
扫描所有rank的stderr日志,而非仅rank 0。最早出现的非NCCL Python回溯通常是根本原因;其他rank后续的NCCL超时是首次崩溃的下游症状。
快速分类故障:
- OOM(内存不足):记录rank、阶段(前向/反向/优化器)、批量大小、序列长度、并行模式(TP/DP/CP/PP)以及峰值内存,再进行调整。
- 形状/可分性错误:检查以及注意力头数的可分性(
WORLD_SIZE = TP × DP × CP × PP)。num_attention_heads % TP == 0 - 导入错误:工作目录错误、未执行或
uv sync过期。启动前确认已PYTHONPATH。cd <MEGATRON_WORKTREE> - 无Python回溯的NCCL故障:验证资源分配、端口可达性、解析以及所有rank的命令一致性。
MASTER_ADDR
Common pitfalls
常见陷阱
- Forgetting before the first submission. If the venv is missing, every job rebuilds it from inside
uv sync, costing minutes per job.srun - Writing logs to a node-local path that disappears at job exit. Always write to the shared filesystem.
- Setting blindly. The right value depends on hardware and parallelism mode (see the dedicated section above). Setting it to
CUDA_DEVICE_MAX_CONNECTIONS=1with FSDP causes a different problem; on Blackwell it has no effect; on pre-Blackwell with TP>1 or CP>1 (non-FSDP) the code asserts, it does not deadlock.1 - Running bare instead of
torchrun. Bareuv run python -m torch.distributed.runmay dispatch through a python interpreter that does not see venv packages, depending on how the venv is set up.torchrun
- 首次提交前忘记执行。如果缺少venv环境,每个任务都会在
uv sync内部重新构建,每次任务耗时数分钟。srun - 将日志写入任务结束后会消失的节点本地路径。始终写入共享文件系统。
- 盲目设置。正确值取决于硬件和并行模式(请参考上方专属章节)。在FSDP模式下设为
CUDA_DEVICE_MAX_CONNECTIONS=1会引发其他问题;在Blackwell架构上无效果;在Blackwell之前的架构且TP>1或CP>1(非FSDP模式)下,代码会触发断言而非死锁。1 - 使用直接的而非
torchrun。直接使用uv run python -m torch.distributed.run可能会调用无法识别venv包的Python解释器,具体取决于venv的配置方式。torchrun