mcore-run-on-slurm

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Run Megatron-LM on SLURM

在SLURM上运行Megatron-LM

Answer-First Constants

优先参考的常量规则

For text-only SLURM setup questions, answer with these constants before the full script:
  • Submit from a shared worktree path visible to every node;
    cd
    there in the script before launching training.
  • Use one
    srun
    task per node and launch workers with
    uv run python -m torch.distributed.run
    , not bare
    torchrun
    .
  • Set
    MASTER_ADDR
    from
    scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1
    , set
    MASTER_PORT
    ,
    NNODES=${SLURM_NNODES}
    ,
    GPUS_PER_NODE=<GPUS_PER_NODE>
    , and
    WORLD_SIZE=$((NNODES * GPUS_PER_NODE))
    .
  • Pass
    --nnodes
    ,
    --nproc-per-node
    ,
    --node-rank
    ,
    --master-addr
    , and
    --master-port
    to
    torch.distributed.run
    .
  • CUDA_DEVICE_MAX_CONNECTIONS
    : pre-Blackwell Hopper/Ampere with TP>1 or CP>1 and non-FSDP uses
    1
    ; Blackwell/GB200 does not need it; Torch-FSDP2 or Megatron-FSDP must not use
    1
    ;
    overlap_moe_expert_parallel_comm
    uses
    32
    .
对于仅涉及SLURM配置的文本类问题,在提供完整脚本前先给出以下常量规则:
  • 从所有节点均可访问的共享工作目录提交任务;脚本中启动训练前需
    cd
    到该目录。
  • 每个节点使用一个
    srun
    任务,通过
    uv run python -m torch.distributed.run
    启动工作进程,而非直接使用
    torchrun
  • scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1
    获取
    MASTER_ADDR
    ,设置
    MASTER_PORT
    NNODES=${SLURM_NNODES}
    GPUS_PER_NODE=<GPUS_PER_NODE>
    ,并计算
    WORLD_SIZE=$((NNODES * GPUS_PER_NODE))
  • torch.distributed.run
    传递
    --nnodes
    --nproc-per-node
    --node-rank
    --master-addr
    --master-port
    参数。
  • CUDA_DEVICE_MAX_CONNECTIONS
    规则:Blackwell之前的Hopper/Ampere架构,且TP>1或CP>1且非FSDP模式时设为
    1
    ;Blackwell/GB200无需设置;Torch-FSDP2或Megatron-FSDP模式禁止设为
    1
    ;启用
    overlap_moe_expert_parallel_comm
    时设为
    32

Prerequisites

前置条件

  • A SLURM cluster login with submission rights to a GPU partition.
  • Megatron-LM checked out on a filesystem visible to all nodes in the allocation (NFS, Lustre, or similar). All nodes must reach the same paths for code, data, checkpoints, and output.
  • uv
    installed; run
    uv sync --extra training --extra dev
    (or
    --extra lts
    ) on the worktree once before submission so the
    .venv
    is materialized and visible to every node.
  • 拥有SLURM集群GPU分区的提交权限的登录账号。
  • Megatron-LM代码已拉取到所有分配节点均可访问的文件系统(如NFS、Lustre等)。所有节点必须能访问相同的代码、数据、检查点和输出路径。
  • 已安装
    uv
    ;提交任务前需在工作目录运行一次
    uv sync --extra training --extra dev
    (或
    --extra lts
    ),确保
    .venv
    环境已生成且所有节点均可访问。

Minimal sbatch script

最小化sbatch脚本

Save as
run_megatron.slurm
in the worktree:
bash
#!/bin/bash
#SBATCH --job-name=megatron
#SBATCH --account=<SLURM_ACCOUNT>
#SBATCH --partition=<SLURM_PARTITION>
#SBATCH --nodes=<NODES>
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=<GPUS_PER_NODE>
#SBATCH --time=<HH:MM:SS>
#SBATCH --output=logs/%x-%j.out
#SBATCH --error=logs/%x-%j.err

set -euo pipefail
cd <MEGATRON_WORKTREE>

export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
export MASTER_PORT=${MASTER_PORT:-29500}
export NNODES=${SLURM_NNODES}
export GPUS_PER_NODE=<GPUS_PER_NODE>
export WORLD_SIZE=$((NNODES * GPUS_PER_NODE))
将以下内容保存为工作目录下的
run_megatron.slurm
bash
#!/bin/bash
#SBATCH --job-name=megatron
#SBATCH --account=<SLURM_ACCOUNT>
#SBATCH --partition=<SLURM_PARTITION>
#SBATCH --nodes=<NODES>
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=<GPUS_PER_NODE>
#SBATCH --time=<HH:MM:SS>
#SBATCH --output=logs/%x-%j.out
#SBATCH --error=logs/%x-%j.err

set -euo pipefail
cd <MEGATRON_WORKTREE>

export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
export MASTER_PORT=${MASTER_PORT:-29500}
export NNODES=${SLURM_NNODES}
export GPUS_PER_NODE=<GPUS_PER_NODE>
export WORLD_SIZE=$((NNODES * GPUS_PER_NODE))

Set CUDA_DEVICE_MAX_CONNECTIONS only when your configuration requires it

仅在你的配置需要时设置CUDA_DEVICE_MAX_CONNECTIONS

(see the section below). Example for pre-Blackwell with TP>1 or CP>1

(请参考下方章节)。例如Blackwell之前的架构且TP>1或CP>1(非FSDP模式):

(non-FSDP):

export CUDA_DEVICE_MAX_CONNECTIONS=1

export CUDA_DEVICE_MAX_CONNECTIONS=1

srun --ntasks=${NNODES} --ntasks-per-node=1 bash -c '

NODE_RANK comes from SLURM_NODEID with one task per node.

NODE_RANK=${SLURM_NODEID} uv run python -m torch.distributed.run
--nnodes='"${NNODES}"'
--nproc-per-node='"${GPUS_PER_NODE}"'
--node-rank=${NODE_RANK}
--master-addr='"${MASTER_ADDR}"'
--master-port='"${MASTER_PORT}"'
pretrain_gpt.py
<MEGATRON_ARGS> '

Submit:

```bash
mkdir -p logs && JOB_ID=$(sbatch --parsable run_megatron.slurm)
echo "Submitted ${JOB_ID}"
srun --ntasks=${NNODES} --ntasks-per-node=1 bash -c '

NODE_RANK来自SLURM_NODEID,每个节点对应一个任务。

NODE_RANK=${SLURM_NODEID} uv run python -m torch.distributed.run
--nnodes='"${NNODES}"'
--nproc-per-node='"${GPUS_PER_NODE}"'
--node-rank=${NODE_RANK}
--master-addr='"${MASTER_ADDR}"'
--master-port='"${MASTER_PORT}"'
pretrain_gpt.py
<MEGATRON_ARGS> '

提交命令:

```bash
mkdir -p logs && JOB_ID=$(sbatch --parsable run_megatron.slurm)
echo "Submitted ${JOB_ID}"

Multi-node rules

多节点规则

  • Submit from the worktree you intend to run, or
    cd
    to it in the script. All nodes must reach the same path on a shared filesystem (NFS, Lustre, or similar) — node-local paths will not be visible to peer ranks.
  • Use one
    torchrun
    worker group across all nodes; do not start independent single-node jobs.
  • --nproc-per-node
    should equal the number of visible GPUs per node.
  • Write checkpoints, tensorboard data, and structured logs to shared storage.
  • 从要运行的工作目录提交任务,或在脚本中
    cd
    到该目录。所有节点必须能访问共享文件系统(如NFS、Lustre等)上的相同路径——节点本地路径无法被其他rank访问。
  • 在所有节点上使用一个
    torchrun
    工作进程组;不要启动独立的单节点任务。
  • --nproc-per-node
    应等于每个节点可见的GPU数量。
  • 将检查点、tensorboard数据和结构化日志写入共享存储。

CUDA_DEVICE_MAX_CONNECTIONS

CUDA_DEVICE_MAX_CONNECTIONS

The right value depends on your hardware and parallelism mode. Do not export it unconditionally:
  • Pre-Blackwell (Hopper, Ampere) with TP>1 or CP>1, non-FSDP: set to
    1
    . The relevant code path asserts on this — you will get an assertion error if it is not
    1
    , not a silent deadlock.
  • Blackwell: not required; setting it has no effect.
  • Torch-FSDP2 or Megatron-FSDP: must NOT be
    1
    . Leave the env var unset, or set it to a value greater than
    1
    .
  • overlap_moe_expert_parallel_comm
    enabled:
    set to
    32
    .
Set it explicitly in the sbatch script when your configuration calls for it.
正确值取决于你的硬件和并行模式,请勿无条件设置该变量:
  • Blackwell之前的架构(Hopper、Ampere)且TP>1或CP>1,非FSDP模式:设为
    1
    。相关代码路径会对此进行断言——如果未设为
    1
    ,会触发断言错误而非静默死锁。
  • Blackwell架构:无需设置;设置后无效果。
  • Torch-FSDP2或Megatron-FSDP模式:禁止设为
    1
    。保留环境变量未设置,或设为大于
    1
    的值。
  • 启用
    overlap_moe_expert_parallel_comm
    :设为
    32
当你的配置需要时,在sbatch脚本中显式设置该变量。

Containers

容器使用

Many sites run Megatron-LM inside a container (enroot/pyxis on some clusters, singularity on others). If you do, the uv-managed
.venv
must live on a path that is visible from inside the container, and the container image must provide the CUDA / NCCL / torch versions the repo expects (see
docker/.ngc_version.dev
and
.ngc_version.lts
). The skeleton above stays the same; wrap the
srun
invocation with your scheduler's container flags (
--container-image=…
,
--container-mounts=…
, etc.).
很多场景下会在容器内运行Megatron-LM(部分集群使用enroot/pyxis,其他使用singularity)。如果使用容器,uv管理的
.venv
必须位于容器内部可访问的路径,且容器镜像必须提供仓库所需的CUDA/NCCL/torch版本(请参考
docker/.ngc_version.dev
.ngc_version.lts
)。上述脚本框架保持不变;只需用调度器的容器标志(
--container-image=…
--container-mounts=…
等)包裹
srun
调用即可。

Monitor and collect

监控与收集

bash
squeue -j "$JOB_ID" -o "%.10i %.8T %.10M %.6D %R"
sacct -j "$JOB_ID" --format=JobID,State,ExitCode,Elapsed
scancel "$JOB_ID"
If your training script writes a result artifact (a JSON metrics file from rank 0, a final checkpoint, etc.), poll for the artifact rather than waiting only on
squeue
state. Useful output usually appears before SLURM marks the job complete, and polling on the artifact lets you cancel the job as soon as it lands instead of holding the allocation until the timeout.
bash
squeue -j "$JOB_ID" -o "%.10i %.8T %.10M %.6D %R"
sacct -j "$JOB_ID" --format=JobID,State,ExitCode,Elapsed
scancel "$JOB_ID"
如果你的训练脚本会生成结果文件(如rank 0的JSON指标文件、最终检查点等),请轮询该文件而非仅等待
squeue
状态。有用的输出通常在SLURM标记任务完成前就已生成,轮询文件可让你在文件生成后立即取消任务,避免占用资源直到超时。

Failure diagnosis

故障诊断

Scan stderr from every rank, not just rank 0. The earliest non-NCCL Python traceback is usually the root cause; later NCCL timeouts on other ranks are downstream symptoms of the first crash.
Classify quickly:
  • OOM: record rank, phase (forward / backward / optimizer), batch size, sequence length, parallelism (TP/DP/CP/PP), and peak memory before adjusting.
  • Shape / divisibility error: check
    WORLD_SIZE = TP × DP × CP × PP
    and head-count divisibility (
    num_attention_heads % TP == 0
    ).
  • Import error: wrong worktree, missing
    uv sync
    , or stale
    PYTHONPATH
    . Confirm
    cd <MEGATRON_WORKTREE>
    before launch.
  • NCCL failure with no Python traceback: verify allocation, port reachability,
    MASTER_ADDR
    resolution, and command consistency across ranks.
扫描所有rank的stderr日志,而非仅rank 0。最早出现的非NCCL Python回溯通常是根本原因;其他rank后续的NCCL超时是首次崩溃的下游症状。
快速分类故障:
  • OOM(内存不足):记录rank、阶段(前向/反向/优化器)、批量大小、序列长度、并行模式(TP/DP/CP/PP)以及峰值内存,再进行调整。
  • 形状/可分性错误:检查
    WORLD_SIZE = TP × DP × CP × PP
    以及注意力头数的可分性(
    num_attention_heads % TP == 0
    )。
  • 导入错误:工作目录错误、未执行
    uv sync
    PYTHONPATH
    过期。启动前确认已
    cd <MEGATRON_WORKTREE>
  • 无Python回溯的NCCL故障:验证资源分配、端口可达性、
    MASTER_ADDR
    解析以及所有rank的命令一致性。

Common pitfalls

常见陷阱

  • Forgetting
    uv sync
    before the first submission. If the venv is missing, every job rebuilds it from inside
    srun
    , costing minutes per job.
  • Writing logs to a node-local path that disappears at job exit. Always write to the shared filesystem.
  • Setting
    CUDA_DEVICE_MAX_CONNECTIONS=1
    blindly. The right value depends on hardware and parallelism mode (see the dedicated section above). Setting it to
    1
    with FSDP causes a different problem; on Blackwell it has no effect; on pre-Blackwell with TP>1 or CP>1 (non-FSDP) the code asserts, it does not deadlock.
  • Running bare
    torchrun
    instead of
    uv run python -m torch.distributed.run
    . Bare
    torchrun
    may dispatch through a python interpreter that does not see venv packages, depending on how the venv is set up.
  • 首次提交前忘记执行
    uv sync
    。如果缺少venv环境,每个任务都会在
    srun
    内部重新构建,每次任务耗时数分钟。
  • 将日志写入任务结束后会消失的节点本地路径。始终写入共享文件系统。
  • 盲目设置
    CUDA_DEVICE_MAX_CONNECTIONS=1
    。正确值取决于硬件和并行模式(请参考上方专属章节)。在FSDP模式下设为
    1
    会引发其他问题;在Blackwell架构上无效果;在Blackwell之前的架构且TP>1或CP>1(非FSDP模式)下,代码会触发断言而非死锁。
  • 使用直接的
    torchrun
    而非
    uv run python -m torch.distributed.run
    。直接使用
    torchrun
    可能会调用无法识别venv包的Python解释器,具体取决于venv的配置方式。