run-on-slurm

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Run Megatron-LM on SLURM

在SLURM上运行Megatron-LM

Prerequisites

前提条件

A SLURM cluster login with submission rights to a GPU partition.
Megatron-LM checked out on a filesystem visible to all nodes in the allocation (NFS, Lustre, or similar). All nodes must reach the same paths for code, data, checkpoints, and output.
```
uv
```
installed; run
```
uv sync --extra training --extra dev
```
(or
```
--extra lts
```
) on the worktree once before submission so the
```
.venv
```
is materialized and visible to every node.

拥有SLURM集群登录权限，且能提交任务到GPU分区。
Megatron-LM已检出到所有分配节点均可访问的文件系统（如NFS、Lustre等）。所有节点必须能访问相同路径下的代码、数据、检查点和输出文件。
已安装
```
uv
```
；提交任务前需在工作目录运行一次
```
uv sync --extra training --extra dev
```
（或
```
--extra lts
```
），使
```
.venv
```
环境实例化并能被所有节点访问。

Minimal sbatch script

最小化sbatch脚本

Save as

run_megatron.slurm

in the worktree:

bash

#!/bin/bash
#SBATCH --job-name=megatron
#SBATCH --account=<SLURM_ACCOUNT>
#SBATCH --partition=<SLURM_PARTITION>
#SBATCH --nodes=<NODES>
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=<GPUS_PER_NODE>
#SBATCH --time=<HH:MM:SS>
#SBATCH --output=logs/%x-%j.out
#SBATCH --error=logs/%x-%j.err

set -euo pipefail
cd <MEGATRON_WORKTREE>

export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
export MASTER_PORT=${MASTER_PORT:-29500}
export NNODES=${SLURM_NNODES}
export GPUS_PER_NODE=<GPUS_PER_NODE>
export WORLD_SIZE=$((NNODES * GPUS_PER_NODE))

将以下内容保存为工作目录中的

run_megatron.slurm

：

bash

#!/bin/bash
#SBATCH --job-name=megatron
#SBATCH --account=<SLURM_ACCOUNT>
#SBATCH --partition=<SLURM_PARTITION>
#SBATCH --nodes=<NODES>
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=<GPUS_PER_NODE>
#SBATCH --time=<HH:MM:SS>
#SBATCH --output=logs/%x-%j.out
#SBATCH --error=logs/%x-%j.err

set -euo pipefail
cd <MEGATRON_WORKTREE>

export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
export MASTER_PORT=${MASTER_PORT:-29500}
export NNODES=${SLURM_NNODES}
export GPUS_PER_NODE=<GPUS_PER_NODE>
export WORLD_SIZE=$((NNODES * GPUS_PER_NODE))

Set CUDA_DEVICE_MAX_CONNECTIONS only when your configuration requires it

仅在配置需要时设置CUDA_DEVICE_MAX_CONNECTIONS

(see the section below). Example for pre-Blackwell with TP>1 or CP>1

—

(non-FSDP):

—

export CUDA_DEVICE_MAX_CONNECTIONS=1

—

srun --ntasks=${NNODES} --ntasks-per-node=1 bash -c '

NODE_RANK comes from SLURM_NODEID with one task per node.

NODE_RANK=${SLURM_NODEID} uv run python -m torch.distributed.run
--nnodes='"${NNODES}"'
--nproc-per-node='"${GPUS_PER_NODE}"'
--node-rank=${NODE_RANK}
--master-addr='"${MASTER_ADDR}"'
--master-port='"${MASTER_PORT}"'
pretrain_gpt.py
<MEGATRON_ARGS> '


Submit:

```bash
mkdir -p logs && JOB_ID=$(sbatch --parsable run_megatron.slurm)
echo "Submitted ${JOB_ID}"

#（参见下方章节）。例如，对于Blackwell之前的硬件且TP>1或CP>1 #（非FSDP模式）：

Multi-node rules

export CUDA_DEVICE_MAX_CONNECTIONS=1

Submit from the worktree you intend to run, or
```
cd
```
to it in the script. All nodes must reach the same path on a shared filesystem (NFS, Lustre, or similar) — node-local paths will not be visible to peer ranks.
Use one
```
torchrun
```
worker group across all nodes; do not start independent single-node jobs.
```
--nproc-per-node
```
should equal the number of visible GPUs per node.
Write checkpoints, tensorboard data, and structured logs to shared storage.

srun --ntasks=${NNODES} --ntasks-per-node=1 bash -c '

NODE_RANK来自SLURM_NODEID，每个节点一个任务。


提交任务：

```bash
mkdir -p logs && JOB_ID=$(sbatch --parsable run_megatron.slurm)
echo "已提交任务 ${JOB_ID}"

CUDA_DEVICE_MAX_CONNECTIONS

多节点规则

The right value depends on your hardware and parallelism mode. Do not export it unconditionally:

Pre-Blackwell (Hopper, Ampere) with TP>1 or CP>1, non-FSDP: set to
```
1
```
. The relevant code path asserts on this — you will get an assertion error if it is not
```
1
```
, not a silent deadlock.
Blackwell: not required; setting it has no effect.
Torch-FSDP2 or Megatron-FSDP: must NOT be
```
1
```
. Leave the env var unset, or set it to a value greater than
```
1
```
.
overlap_moe_expert_parallel_comm
enabled: set to
```
32
```
.

Set it explicitly in the sbatch script when your configuration calls for it.

从要运行的工作目录提交任务，或在脚本中
```
cd
```
到该目录。所有节点必须能访问共享文件系统（如NFS、Lustre等）上的相同路径——节点本地路径无法被其他rank访问。
在所有节点上使用一个
```
torchrun
```
工作组；不要启动独立的单节点任务。
```
--nproc-per-node
```
应等于每个节点可见的GPU数量。
将检查点、tensorboard数据和结构化日志写入共享存储。

Containers

CUDA_DEVICE_MAX_CONNECTIONS

Many sites run Megatron-LM inside a container (enroot/pyxis on some clusters, singularity on others). If you do, the uv-managed

.venv

must live on a path that is visible from inside the container, and the container image must provide the CUDA / NCCL / torch versions the repo expects (see

docker/.ngc_version.dev

and

.ngc_version.lts

). The skeleton above stays the same; wrap the

srun

invocation with your scheduler's container flags (

--container-image=…

--container-mounts=…

, etc.).

合适的值取决于硬件和并行模式，请勿无条件导出该变量：

Blackwell之前的硬件（Hopper、Ampere）且TP>1或CP>1，非FSDP模式：设置为
```
1
```
。相关代码路径会对此进行断言——如果未设置为
```
1
```
，会触发断言错误，而非静默死锁。
Blackwell硬件：无需设置；设置该变量无效果。
Torch-FSDP2或Megatron-FSDP模式：绝对不能设置为
```
1
```
。保留环境变量未设置，或设置为大于
```
1
```
的值。
启用
overlap_moe_expert_parallel_comm
：设置为
```
32
```
。

当配置需要时，在sbatch脚本中显式设置该变量。

Monitor and collect

容器

bash

squeue -j "$JOB_ID" -o "%.10i %.8T %.10M %.6D %R"
sacct -j "$JOB_ID" --format=JobID,State,ExitCode,Elapsed
scancel "$JOB_ID"

If your training script writes a result artifact (a JSON metrics file from rank 0, a final checkpoint, etc.), poll for the artifact rather than waiting only on

squeue

state. Useful output usually appears before SLURM marks the job complete, and polling on the artifact lets you cancel the job as soon as it lands instead of holding the allocation until the timeout.

很多场景下会在容器中运行Megatron-LM（部分集群使用enroot/pyxis，其他使用singularity）。如果使用容器，uv管理的

.venv

必须位于容器内部可访问的路径，且容器镜像必须提供仓库所需的CUDA/NCCL/torch版本（参见

docker/.ngc_version.dev

和

.ngc_version.lts

）。上述脚本框架保持不变；只需用调度器的容器标志（如

--container-image=…

、

--container-mounts=…

等）包裹

srun

调用即可。

Failure diagnosis

监控与收集

Scan stderr from every rank, not just rank 0. The earliest non-NCCL Python traceback is usually the root cause; later NCCL timeouts on other ranks are downstream symptoms of the first crash.

Classify quickly:

OOM: record rank, phase (forward / backward / optimizer), batch size, sequence length, parallelism (TP/DP/CP/PP), and peak memory before adjusting.
Shape / divisibility error: check
```
WORLD_SIZE = TP × DP × CP × PP
```
and head-count divisibility (
```
num_attention_heads % TP == 0
```
).
Import error: wrong worktree, missing
```
uv sync
```
, or stale
```
PYTHONPATH
```
. Confirm
```
cd <MEGATRON_WORKTREE>
```
before launch.
NCCL failure with no Python traceback: verify allocation, port reachability,
```
MASTER_ADDR
```
resolution, and command consistency across ranks.

bash

squeue -j "$JOB_ID" -o "%.10i %.8T %.10M %.6D %R"
sacct -j "$JOB_ID" --format=JobID,State,ExitCode,Elapsed
scancel "$JOB_ID"

如果训练脚本会生成结果产物（如rank 0输出的JSON指标文件、最终检查点等），请轮询该产物而非仅等待

squeue

状态。通常在SLURM标记任务完成前就会生成有用输出，轮询产物可让你在产物生成后立即取消任务，而非占用资源直到超时。

Common pitfalls

故障诊断

Forgetting
```
uv sync
```
before the first submission. If the venv is missing, every job rebuilds it from inside
```
srun
```
, costing minutes per job.
Writing logs to a node-local path that disappears at job exit. Always write to the shared filesystem.
Setting
```
CUDA_DEVICE_MAX_CONNECTIONS=1
```
blindly. The right value depends on hardware and parallelism mode (see the dedicated section above). Setting it to
```
1
```
with FSDP causes a different problem; on Blackwell it has no effect; on pre-Blackwell with TP>1 or CP>1 (non-FSDP) the code asserts, it does not deadlock.
Running bare
```
torchrun
```
instead of
```
uv run python -m torch.distributed.run
```
. Bare
```
torchrun
```
may dispatch through a python interpreter that does not see venv packages, depending on how the venv is set up.

扫描所有rank的stderr，而非仅rank 0的。最早出现的非NCCL Python回溯通常是根本原因；其他rank后续出现的NCCL超时是首次崩溃的下游症状。

快速分类故障：

OOM（内存不足）：记录rank、阶段（前向/反向/优化器）、批量大小、序列长度、并行模式（TP/DP/CP/PP）以及调整前的峰值内存。
形状/可分性错误：检查
```
WORLD_SIZE = TP × DP × CP × PP
```
以及注意力头数的可分性（
```
num_attention_heads % TP == 0
```
）。
导入错误：工作目录错误、未执行
```
uv sync
```
或
```
PYTHONPATH
```
过时。启动前确认已
```
cd <MEGATRON_WORKTREE>
```
。
无Python回溯的NCCL故障：验证资源分配、端口可达性、
```
MASTER_ADDR
```
解析以及各rank的命令一致性。

—

常见陷阱

—

首次提交前忘记执行
```
uv sync
```
。如果缺少venv环境，每个任务都会在
```
srun
```
内部重新构建，每个任务耗时数分钟。
将日志写入任务结束后会消失的节点本地路径。始终写入共享文件系统。
盲目设置
```
CUDA_DEVICE_MAX_CONNECTIONS=1
```
。合适的值取决于硬件和并行模式（参见上方专门章节）。在FSDP模式下设置为1会引发其他问题；在Blackwell硬件上设置无效果；在Blackwell之前的硬件且TP>1或CP>1（非FSDP模式）下，代码会触发断言而非死锁。
使用裸
```
torchrun
```
而非
```
uv run python -m torch.distributed.run
```
。裸
```
torchrun
```
可能会调用无法识别venv包的Python解释器，具体取决于venv的设置方式。