nemo-automodel-distributed-training

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Distributed Training in NeMo AutoModel

NeMo AutoModel中的分布式训练

Purpose

目的

NeMo AutoModel uses PyTorch-native distributed training. All parallelism is orchestrated through a single
MeshContext
object that holds device meshes, strategy configs, and axis names.
NeMo AutoModel采用PyTorch原生分布式训练。所有并行机制通过单个
MeshContext
对象进行编排,该对象存储设备网格、策略配置和轴名称。

Instructions

操作说明

For conceptual distributed-training questions, answer directly from the quick patterns in this skill without inspecting the repository. Start with the strategy choice, then list only the YAML fields and constraints relevant to the question.
Use direct action verbs in the final answer: recommend the strategy, show the minimal YAML, state the sizing constraint, and name the unsupported strategies. Do not discuss model onboarding, recipes, Slurm, SkyPilot, or checkpointing unless the user asks.
对于分布式训练的概念性问题,直接参考本技能中的快速模式作答,无需查看代码仓库。先说明策略选择,再列出与问题相关的YAML字段及约束条件。
最终回答使用直接动作动词:推荐策略、展示最简YAML配置、说明规模约束、指出不支持的策略。除非用户询问,否则不要讨论模型接入、训练脚本、Slurm、SkyPilot或 checkpoint相关内容。

Examples

示例

TP plus PP for a large multi-node model

大型多节点模型的TP+PP配置

Recommend
strategy: fsdp2
. Mention
tp_size
,
pp_size
,
cp_size
,
ep_size
, and the
pipeline
sub-config. State that
dp_size
is inferred from
world_size / (tp_size * pp_size * cp_size)
.
yaml
distributed:
  strategy: fsdp2
  tp_size: 8
  pp_size: 4
  cp_size: 1
  ep_size: 1
  pipeline:
    pp_schedule: interleaved1f1b
    pp_microbatch_size: 1
推荐
strategy: fsdp2
。提及
tp_size
pp_size
cp_size
ep_size
pipeline
子配置。说明
dp_size
world_size / (tp_size * pp_size * cp_size)
自动推导。
yaml
distributed:
  strategy: fsdp2
  tp_size: 8
  pp_size: 4
  cp_size: 1
  ep_size: 1
  pipeline:
    pp_schedule: interleaved1f1b
    pp_microbatch_size: 1

MoE expert parallelism

MoE专家并行配置

Recommend
strategy: fsdp2
with
ep_size > 1
. Say this creates a separate
moe_mesh
; include the
moe
sub-config when relevant; state that
ep_size
must divide
dp_size * cp_size
. Do not recommend
megatron_fsdp
or
ddp
.
yaml
distributed:
  strategy: fsdp2
  ep_size: 8
  moe:
    reshard_after_forward: false
推荐
strategy: fsdp2
并设置
ep_size > 1
。说明这会创建独立的
moe_mesh
;相关场景下需包含
moe
子配置;指出
ep_size
必须能整除
dp_size * cp_size
。不推荐
megatron_fsdp
ddp
yaml
distributed:
  strategy: fsdp2
  ep_size: 8
  moe:
    reshard_after_forward: false

MegatronFSDP limitations

MegatronFSDP的局限性

Say no for pipeline parallelism, expert parallelism, and
sequence_parallel
. Recommend
fsdp2
for PP, EP, or
sequence_parallel
; mention that DDP is only simple data parallelism.
说明其不支持流水线并行、专家并行及
sequence_parallel
。推荐使用
fsdp2
实现PP、EP或
sequence_parallel
;提及DDP仅支持简单数据并行。

Strategy Selection

策略选择

Three strategies are available, selected via the
distributed.strategy
YAML key:
StrategyYAML valueBest for
FSDP2
fsdp2
General use, recommended default. Supports TP, PP, CP, EP, HSDP.
MegatronFSDP
megatron_fsdp
NVIDIA Megatron-style FSDP. No PP, no EP, no sequence_parallel.
DDP
ddp
Simple data parallelism only. No TP, PP, CP, or EP.
Decision tree:
  • Single GPU: no distributed config needed (FSDP2Manager skips parallelization when world_size=1).
  • Multi-GPU single node:
    fsdp2
    (default). Use
    ddp
    only if you need the simplest possible setup.
  • Multi-node:
    fsdp2
    with appropriate TP/PP sizing.
  • MoE models with expert parallelism:
    fsdp2
    with
    ep_size > 1
    (creates a separate
    moe_mesh
    ).
  • Large models (70B+):
    fsdp2
    with PP + TP.
  • Long sequences (8K+): add CP (
    cp_size > 1
    ).
When answering strategy-selection questions, state the chosen
distributed.strategy
first, then enumerate the YAML fields the user must set.
Quick TP + PP answer:
  • Use
    strategy: fsdp2
    ; do not use
    megatron_fsdp
    when pipeline parallelism is required.
  • Set
    tp_size
    for tensor parallelism and
    pp_size
    for pipeline parallelism.
  • Add a
    pipeline:
    sub-config with
    pp_schedule
    and
    pp_microbatch_size
    .
  • Leave
    dp_size
    unset or
    none
    ; it is inferred as
    world_size / (tp_size * pp_size * cp_size)
    .
  • Keep TP inside a fast intra-node domain when possible, and use PP across model depth for 70B+ models.
Quick MoE expert-parallel answer:
  • Start with
    strategy: fsdp2
    and
    ep_size > 1
    .
  • Include a
    moe:
    sub-config only when
    ep_size > 1
    ; it maps to
    MoEParallelizerConfig
    .
  • Expect a separate
    moe_mesh
    for expert parallelism in addition to the main
    device_mesh
    .
  • Do not recommend
    megatron_fsdp
    or
    ddp
    for expert parallelism;
    megatron_fsdp
    has no EP support.
  • Before finishing an MoE EP answer, explicitly state that
    ep_size
    must divide
    dp_size * cp_size
    and that
    megatron_fsdp
    does not support EP, PP, or
    sequence_parallel
    .
共有三种策略可选,通过YAML中的
distributed.strategy
字段指定:
策略YAML取值适用场景
FSDP2
fsdp2
通用场景,推荐默认使用。支持TP、PP、CP、EP、HSDP。
MegatronFSDP
megatron_fsdp
NVIDIA Megatron风格的FSDP。不支持PP、EP、sequence_parallel。
DDP
ddp
仅支持简单数据并行。不支持TP、PP、CP或EP。
决策树:
  • 单GPU:无需分布式配置(当world_size=1时,FSDP2Manager会跳过并行化)。
  • 多GPU单节点:默认使用
    fsdp2
    。仅当需要最简设置时才使用
    ddp
  • 多节点:使用
    fsdp2
    并配置合适的TP/PP规模。
  • 带专家并行的MoE模型:使用
    fsdp2
    并设置
    ep_size > 1
    (会创建独立的
    moe_mesh
    )。
  • 大型模型(70B+):使用
    fsdp2
    并配置PP+TP。
  • 长序列(8K+):添加CP(设置
    cp_size > 1
    )。
回答策略选择类问题时,先说明选定的
distributed.strategy
,再列举用户必须设置的YAML字段。
TP+PP快速回答:
  • 使用
    strategy: fsdp2
    ;需要流水线并行时不要使用
    megatron_fsdp
  • 设置
    tp_size
    实现张量并行,设置
    pp_size
    实现流水线并行。
  • 添加包含
    pp_schedule
    pp_microbatch_size
    pipeline:
    子配置。
  • 无需设置
    dp_size
    或设为
    none
    ;它会自动推导为
    world_size / (tp_size * pp_size * cp_size)
  • 尽可能将TP限制在节点内高速NVLink域中,70B+模型使用PP跨模型深度扩展。
MoE专家并行快速回答:
  • strategy: fsdp2
    ep_size > 1
    开始配置。
  • 仅当
    ep_size > 1
    时才需要包含
    moe:
    子配置;该配置对应
    MoEParallelizerConfig
  • 除主
    device_mesh
    外,会为专家并行创建独立的
    moe_mesh
  • 不要为专家并行推荐
    megatron_fsdp
    ddp
    megatron_fsdp
    不支持EP。
  • 完成MoE EP配置回答前,需明确说明
    ep_size
    必须能整除
    dp_size * cp_size
    ,且
    megatron_fsdp
    不支持EP、PP或
    sequence_parallel

YAML Config Structure

YAML配置结构

The
distributed
section in the recipe YAML maps directly to
parse_distributed_section()
in
recipes/_dist_setup.py
:
yaml
distributed:
  strategy: fsdp2           # fsdp2 | megatron_fsdp | ddp
  dp_size: none             # auto-calculated from world_size / (tp * pp * cp)
  dp_replicate_size: none   # FSDP2-only, for HSDP
  tp_size: 1
  pp_size: 1
  cp_size: 1
  ep_size: 1

  # Strategy-specific flags (forwarded to the strategy dataclass):
  sequence_parallel: false
  activation_checkpointing: false
  defer_fsdp_grad_sync: true   # FSDP2 only

  # Sub-configs (optional):
  pipeline:
    pp_schedule: 1f1b
    pp_microbatch_size: 1
    # ... see PipelineConfig fields

  moe:
    reshard_after_forward: false
    # ... see MoEParallelizerConfig fields
The
dp_size
is always inferred:
dp_size = world_size / (tp_size * pp_size * cp_size)
训练脚本YAML中的
distributed
部分直接映射到
recipes/_dist_setup.py
中的
parse_distributed_section()
函数:
yaml
distributed:
  strategy: fsdp2           # fsdp2 | megatron_fsdp | ddp
  dp_size: none             # 由world_size / (tp * pp * cp)自动计算
  dp_replicate_size: none   # 仅FSDP2支持,用于HSDP
  tp_size: 1
  pp_size: 1
  cp_size: 1
  ep_size: 1

  # 策略专属参数(转发至策略数据类):
  sequence_parallel: false
  activation_checkpointing: false
  defer_fsdp_grad_sync: true   # 仅FSDP2支持

  # 子配置(可选):
  pipeline:
    pp_schedule: 1f1b
    pp_microbatch_size: 1
    # ... 查看PipelineConfig字段

  moe:
    reshard_after_forward: false
    # ... 查看MoEParallelizerConfig字段
dp_size
始终自动推导:
dp_size = world_size / (tp_size * pp_size * cp_size)

Infrastructure Flow

基础设施流程

YAML distributed section
    -> parse_distributed_section()          [recipes/_dist_setup.py]
    -> setup_distributed()                  [recipes/_dist_setup.py]
        -> create_device_mesh()             [components/distributed/device_mesh.py]
        -> MeshContext(...)                  [components/distributed/mesh.py]
    -> instantiate_infrastructure()         [_transformers/infrastructure.py]
        -> _instantiate_distributed()       -> FSDP2Manager / MegatronFSDPManager / DDPManager
        -> _instantiate_pipeline()          -> AutoPipeline (if pp_size > 1)
        -> parallelize_fn                   -> MoE parallelizer (if ep_size > 1) or PP wrapper
    -> apply_model_infrastructure()         [_transformers/infrastructure.py]
        -> _shard_pp() or _shard_ep_fsdp()  (applies sharding to the model)
YAML distributed section
    -> parse_distributed_section()          [recipes/_dist_setup.py]
    -> setup_distributed()                  [recipes/_dist_setup.py]
        -> create_device_mesh()             [components/distributed/device_mesh.py]
        -> MeshContext(...)                  [components/distributed/mesh.py]
    -> instantiate_infrastructure()         [_transformers/infrastructure.py]
        -> _instantiate_distributed()       -> FSDP2Manager / MegatronFSDPManager / DDPManager
        -> _instantiate_pipeline()          -> AutoPipeline(若pp_size > 1)
        -> parallelize_fn                   -> MoE并行器(若ep_size > 1)或PP包装器
    -> apply_model_infrastructure()         [_transformers/infrastructure.py]
        -> _shard_pp() or _shard_ep_fsdp() (对模型进行分片)

FSDP2 Configuration

FSDP2配置

Basic FSDP2 (data parallelism only)

基础FSDP2(仅数据并行)

yaml
distributed:
  strategy: fsdp2
  tp_size: 1
  cp_size: 1
This auto-calculates
dp_size = world_size
and applies
fully_shard()
per transformer block via DTensor-based sharding.
yaml
distributed:
  strategy: fsdp2
  tp_size: 1
  cp_size: 1
此配置自动计算
dp_size = world_size
,并通过基于DTensor的分片对每个Transformer块应用
fully_shard()

FSDP2 with Tensor Parallelism

带张量并行的FSDP2

Keep TP within a single NVLink domain (typically one node):
yaml
distributed:
  strategy: fsdp2
  tp_size: 4        # 2, 4, or 8 -- must divide GPUs per node
  sequence_parallel: true
The TP plan is auto-selected based on the model type. Pass a custom plan via the Python API if needed:
python
config = FSDP2Config(sequence_parallel=True, tp_plan=my_custom_plan)
将TP限制在单个NVLink域内(通常为一个节点):
yaml
distributed:
  strategy: fsdp2
  tp_size: 4        # 2、4或8 -- 必须能整除单节点GPU数量
  sequence_parallel: true
TP方案会根据模型类型自动选择。如需自定义,可通过Python API传入:
python
config = FSDP2Config(sequence_parallel=True, tp_plan=my_custom_plan)

FSDP2 with Pipeline Parallelism

带流水线并行的FSDP2

yaml
distributed:
  strategy: fsdp2
  pp_size: 2
  pipeline:
    pp_schedule: interleaved1f1b   # 1f1b, gpipe, interleaved_1f1b, etc.
    pp_microbatch_size: 4
    scale_grads_in_schedule: false
The model must have a
_pp_plan
attribute (set on the HF model class) for
AutoPipeline
to know how to split layers across stages. Models without
_pp_plan
are not compatible with PP.
yaml
distributed:
  strategy: fsdp2
  pp_size: 2
  pipeline:
    pp_schedule: interleaved1f1b   # 1f1b, gpipe, interleaved_1f1b等
    pp_microbatch_size: 4
    scale_grads_in_schedule: false
模型必须具备
_pp_plan
属性(设置在HF模型类上),
AutoPipeline
才能知道如何将层分配到不同阶段。无
_pp_plan
的模型不兼容PP。

FSDP2 with HSDP (Hybrid Sharded Data Parallel)

带HSDP(混合分片数据并行)的FSDP2

Intra-node full sharding + inter-node replication via a 2D DeviceMesh:
yaml
distributed:
  strategy: fsdp2
  dp_replicate_size: 2   # must divide dp_size
Constraint:
dp_replicate_size < dp_size
(pure replication with no sharding is not supported by FSDP2).
节点内全分片 + 节点间复制,通过二维DeviceMesh实现:
yaml
distributed:
  strategy: fsdp2
  dp_replicate_size: 2   # 必须能整除dp_size
约束:
dp_replicate_size < dp_size
(FSDP2不支持无分片的纯复制)。

Activation Checkpointing

激活 checkpoint

Trades compute for memory by recomputing activations during backward:
yaml
distributed:
  activation_checkpointing: true
This is forwarded to the strategy config for non-EP models, or read from
MeshContext.activation_checkpointing
for EP models.
通过反向传播时重新计算激活值,以计算量换取内存:
yaml
distributed:
  activation_checkpointing: true
对于非EP模型,该参数会转发至策略配置;对于EP模型,会从
MeshContext.activation_checkpointing
读取。

Gradient Sync Deferral

梯度同步延迟

FSDP2 defers gradient sync to the final micro-batch by default for communication overlap:
yaml
distributed:
  defer_fsdp_grad_sync: true   # default
为实现通信重叠,FSDP2默认将梯度同步延迟到最后一个微批次:
yaml
distributed:
  defer_fsdp_grad_sync: true   # 默认值

Mixed Precision

混合精度

FSDP2Config defaults to bfloat16 for all three precision knobs via
MixedPrecisionPolicy(param_dtype=bf16, reduce_dtype=bf16, output_dtype=bf16, cast_forward_inputs=True)
. Override via the Python API:
python
from torch.distributed.fsdp import MixedPrecisionPolicy
config = FSDP2Config(
    mp_policy=MixedPrecisionPolicy(param_dtype=torch.float16, reduce_dtype=torch.float32),
)
FSDP2Config默认通过
MixedPrecisionPolicy(param_dtype=bf16, reduce_dtype=bf16, output_dtype=bf16, cast_forward_inputs=True)
将三个精度参数设为bfloat16。可通过Python API覆盖:
python
from torch.distributed.fsdp import MixedPrecisionPolicy
config = FSDP2Config(
    mp_policy=MixedPrecisionPolicy(param_dtype=torch.float16, reduce_dtype=torch.float32),
)

Pipeline Parallelism

流水线并行

Requirements

要求

  1. Model class must define
    _pp_plan
    (a dict mapping module FQNs to stages).
  2. pp_size > 1
    in the distributed section.
  3. A
    pipeline
    sub-config with schedule and microbatch size.
  1. 模型类必须定义
    _pp_plan
    (一个将模块全限定名映射到阶段的字典)。
  2. distributed部分中
    pp_size > 1
  3. 包含
    pipeline
    子配置,指定调度策略和微批次大小。

Supported schedules

支持的调度策略

Defined in
PipelineConfig.pp_schedule
:
  • 1f1b
    (one-forward-one-backward, default)
  • gpipe
  • interleaved_1f1b
    /
    interleaved1f1b
  • looped_bfs
  • dfs
  • v_schedule
  • zero_bubble
PipelineConfig.pp_schedule
中定义:
  • 1f1b
    (一次前向一次反向,默认)
  • gpipe
  • interleaved_1f1b
    /
    interleaved1f1b
  • looped_bfs
  • dfs
  • v_schedule
  • zero_bubble

Example (8B model on 8 GPUs, PP=2 + DP=4)

示例(8B模型在8个GPU上运行,PP=2 + DP=4)

yaml
distributed:
  strategy: fsdp2
  pp_size: 2

  pipeline:
    pp_schedule: interleaved1f1b
    pp_microbatch_size: 4
    scale_grads_in_schedule: false

checkpoint:
  model_save_format: safetensors
  save_consolidated: true
yaml
distributed:
  strategy: fsdp2
  pp_size: 2

  pipeline:
    pp_schedule: interleaved1f1b
    pp_microbatch_size: 4
    scale_grads_in_schedule: false

checkpoint:
  model_save_format: safetensors
  save_consolidated: true

How it works

工作原理

AutoPipeline.build()
calls
pipeline_model()
which splits the model into stages using the model's
_pp_plan
, creates
PipelineStage
objects, and builds the schedule. During training,
schedule.step()
drives forward and backward through the pipeline.
AutoPipeline.build()
调用
pipeline_model()
,使用模型的
_pp_plan
将模型拆分为多个阶段,创建
PipelineStage
对象并构建调度策略。训练过程中,
schedule.step()
驱动流水线的前向和反向传播。

Context Parallelism

上下文并行

Use CP for long sequences (8K+). CP shards Q/K/V on the sequence dimension as DTensors.
长序列(8K+)场景下使用CP。CP将Q/K/V按序列维度分片为DTensor。

Config

配置

yaml
distributed:
  strategy: fsdp2
  cp_size: 2   # or 4, 8
yaml
distributed:
  strategy: fsdp2
  cp_size: 2   # 或4、8

Requirements

要求

  • SDPA (Flash Attention or Efficient Attention backend) or Transformer Engine attention. SDPBackend.MATH is not compatible with DTensor.
  • Attention masks are automatically stripped;
    is_causal=True
    is set via forward pre-hooks registered by
    attach_context_parallel_hooks()
    .
  • 使用SDPA(Flash Attention或Efficient Attention后端)或Transformer Engine注意力机制。SDPBackend.MATH与DTensor不兼容。
  • 注意力掩码会自动移除;通过
    attach_context_parallel_hooks()
    注册的前向钩子设置
    is_causal=True

How it works

工作原理

  1. After model sharding,
    apply_model_infrastructure()
    calls
    attach_context_parallel_hooks()
    on each model part (for non-TE models).
  2. At each training step,
    make_cp_batch_and_ctx()
    creates a CP context manager that shards the batch along the sequence dimension and sets up
    context_parallel()
    from
    torch.distributed.tensor.experimental
    .
  3. For TE attention models,
    make_cp_batch_for_te()
    uses THD format and TE's
    thd_get_partitioned_indices
    for sharding.
  1. 模型分片后,
    apply_model_infrastructure()
    对每个模型部分调用
    attach_context_parallel_hooks()
    (针对非TE模型)。
  2. 每个训练步骤中,
    make_cp_batch_and_ctx()
    创建CP上下文管理器,将批次按序列维度分片,并从
    torch.distributed.tensor.experimental
    设置
    context_parallel()
  3. 对于TE注意力模型,
    make_cp_batch_for_te()
    使用THD格式和TE的
    thd_get_partitioned_indices
    进行分片。

CP with Sequence Packing

带序列打包的CP

CP works with packed sequences. The
packed_sequence_size
must be divisible by
cp_size
. When using TE, chunks are sharded per-chunk via
_shard_thd_chunk_for_te()
.
CP支持打包序列。
packed_sequence_size
必须能被
cp_size
整除。使用TE时,通过
_shard_thd_chunk_for_te()
对每个分片进行处理。

Sequence Packing

序列打包

Packing multiple sequences into a single training sample for efficiency.
将多个序列打包为单个训练样本以提升效率。

Config

配置

yaml
packed_sequence:
  packed_sequence_size: 4096   # 0 = disabled

step_scheduler:
  local_batch_size: 1          # must be 1 for packed sequences
When
packed_sequence_size > 0
, the dataset collator packs sequences up to that length.
local_batch_size
must be 1 because each "sample" is already a packed batch.
yaml
packed_sequence:
  packed_sequence_size: 4096   # 0 = 禁用

step_scheduler:
  local_batch_size: 1          # 打包序列必须设为1
packed_sequence_size > 0
时,数据集整理器会将序列打包至指定长度。
local_batch_size
必须设为1,因为每个"样本"已经是一个打包后的批次。

MoE Distributed Training

MoE分布式训练

Expert Parallelism

专家并行

Set
ep_size > 1
to distribute experts across GPUs. This creates a separate
moe_mesh
alongside the main
device_mesh
:
yaml
distributed:
  strategy: fsdp2
  ep_size: 8
  activation_checkpointing: true
The
moe_mesh
shape is
(pp_size, ep_shard_size, ep_size)
with dimension names
("pp", "ep_shard", "ep")
.
Constraint:
dp_cp_size
(=
dp_size * cp_size
) must be divisible by
ep_size
.
设置
ep_size > 1
将专家分布到多个GPU上。这会在主
device_mesh
之外创建独立的
moe_mesh
yaml
distributed:
  strategy: fsdp2
  ep_size: 8
  activation_checkpointing: true
moe_mesh
的形状为
(pp_size, ep_shard_size, ep_size)
,维度名称为
("pp", "ep_shard", "ep")
约束:
dp_cp_size
(即
dp_size * cp_size
)必须能被
ep_size
整除。

MoE sub-config

MoE子配置

yaml
distributed:
  strategy: fsdp2
  ep_size: 8
  activation_checkpointing: true

  moe:
    reshard_after_forward: false
    ignore_router_for_ac: false
    wrap_outer_model: true
The
moe
sub-section maps to
MoEParallelizerConfig
and is only instantiated when
ep_size > 1
.
yaml
distributed:
  strategy: fsdp2
  ep_size: 8
  activation_checkpointing: true

  moe:
    reshard_after_forward: false
    ignore_router_for_ac: false
    wrap_outer_model: true
moe
子部分对应
MoEParallelizerConfig
,仅当
ep_size > 1
时才会实例化。

Full MoE example (Qwen3-30B-A3B on 8 GPUs)

完整MoE示例(Qwen3-30B-A3B在8个GPU上运行)

yaml
distributed:
  strategy: fsdp2
  tp_size: 1
  cp_size: 1
  pp_size: 1
  ep_size: 8
  sequence_parallel: false
  activation_checkpointing: true
yaml
distributed:
  strategy: fsdp2
  tp_size: 1
  cp_size: 1
  pp_size: 1
  ep_size: 8
  sequence_parallel: false
  activation_checkpointing: true

MegatronFSDP limitations

MegatronFSDP的局限性

Despite its name,
megatron_fsdp
does not support expert parallelism (
ep_size > 1
), pipeline parallelism (
pp_size > 1
), or
sequence_parallel
. Use
fsdp2
for these features.
尽管名称如此,
megatron_fsdp
支持专家并行(
ep_size > 1
)、流水线并行(
pp_size > 1
)或
sequence_parallel
。如需这些功能,请使用
fsdp2

Parallelism Sizing Guidelines

并行规模指南

Dense models

稠密模型

Model sizeTPPPCPStrategy
< 3B111FSDP2 (DP only)
3-13B2-411FSDP2 + TP
13-70B4-82-41FSDP2 + TP + PP
70B+84-81FSDP2 + TP + PP
Any + long seq (8K+)as aboveas above2-8add CP
模型规模TPPPCP策略
< 3B111FSDP2(仅DP)
3-13B2-411FSDP2 + TP
13-70B4-82-41FSDP2 + TP + PP
70B+84-81FSDP2 + TP + PP
任意规模+长序列(8K+)同上同上2-8添加CP

MoE models

MoE模型

MoE models need less TP than dense models of similar total parameter count because only a fraction of parameters are active per token. EP is the primary scaling dimension:
ModelTPPPEPNotes
Small MoE (<10B total)118EP only
Medium MoE (10-30B total)1-218small TP for shared layers
Large MoE (100B+ total)1-24+8-64PP for depth, EP for experts
MoE模型所需的TP规模小于同等总参数量的稠密模型,因为每个token仅激活部分参数。EP是主要的扩展维度:
模型TPPPEP说明
小型MoE(总参数量<10B)118仅使用EP
中型MoE(总参数量10-30B)1-218共享层使用小TP
大型MoE(总参数量100B+)1-24+8-64使用PP扩展深度,EP扩展专家数量

Hardware topology rules

硬件拓扑规则

  • TP must stay within a single NVLink domain (one node, typically 8 GPUs).
  • Use PP or DP for cross-node scaling.
  • TP across InfiniBand degrades throughput severely.
  • TP必须限制在单个NVLink域内(一个节点,通常为8个GPU)。
  • 使用PP或DP进行跨节点扩展。
  • 跨InfiniBand的TP会严重降低吞吐量。

Code Anchors

代码锚点

  • components/distributed/config.py
    : FSDP2Config, MegatronFSDPConfig, DDPConfig.
  • components/distributed/mesh.py
    : MeshContext, strategy map, and mesh sizes.
  • components/distributed/device_mesh.py
    : device mesh and
    moe_mesh
    creation.
  • components/distributed/pipelining/config.py
    : PipelineConfig fields.
  • components/moe/config.py
    : MoEParallelizerConfig and MoEConfig.
  • recipes/_dist_setup.py
    : YAML parsing and distributed setup.
  • components/distributed/config.py
    : FSDP2Config、MegatronFSDPConfig、DDPConfig。
  • components/distributed/mesh.py
    : MeshContext、策略映射及网格规模。
  • components/distributed/device_mesh.py
    : 设备网格和
    moe_mesh
    创建。
  • components/distributed/pipelining/config.py
    : PipelineConfig字段。
  • components/moe/config.py
    : MoEParallelizerConfig和MoEConfig。
  • recipes/_dist_setup.py
    : YAML解析和分布式设置。

Pitfalls

注意事项

  1. TP across nodes destroys throughput. Always keep TP within a single NVLink domain. Use PP or DP for cross-node scaling.
  2. PP requires
    _pp_plan
    on the model class.
    Not all HF models have this. Check
    validate_hf_model_for_pipeline_support()
    before enabling PP.
  3. PP bubbles reduce GPU utilization. Use interleaved schedules (
    interleaved_1f1b
    ) and smaller microbatches to reduce bubble time.
  4. FSDP2 requires DTensor-aware state dict saving. Use
    safetensors
    with
    save_consolidated: true
    for checkpoint compatibility.
  5. CP requires compatible attention. SDPA (Flash Attention or Efficient Attention) or TE attention only.
    SDPBackend.MATH
    is not compatible with DTensor.
  6. MoE EP size must evenly divide
    dp_size * cp_size
    .
    The device mesh creation asserts
    dp_cp_size % ep_size == 0
    .
  7. MegatronFSDP is more limited than FSDP2. It does not support PP (
    pp_size > 1
    ), EP (
    ep_size > 1
    ), or
    sequence_parallel
    . The
    MeshContext
    validation raises on these combinations.
  8. DDP supports nothing beyond data parallelism. No TP, PP, CP, EP, or HSDP. Validation raises on any of these.
  9. Activation checkpointing increases compute. It saves memory by recomputing activations during backward, but adds ~30% compute overhead.
  10. Mixed precision policy must match model expectations. The default bfloat16 policy works for most models. FP16 models may need a custom
    MixedPrecisionPolicy
    .
  11. packed_sequence_size
    must be divisible by
    cp_size
    when using CP with packed sequences.
  12. dp_replicate_size
    is FSDP2-only.
    Passing it with
    megatron_fsdp
    or
    ddp
    raises a
    ValueError
    .
  1. 跨节点TP会彻底破坏吞吐量。始终将TP限制在单个NVLink域内。使用PP或DP进行跨节点扩展。
  2. PP要求模型类具备
    _pp_plan
    。并非所有HF模型都有此属性。启用PP前请检查
    validate_hf_model_for_pipeline_support()
  3. PP气泡会降低GPU利用率。使用交错调度策略(
    interleaved_1f1b
    )和更小的微批次以减少气泡时间。
  4. FSDP2需要支持DTensor的状态字典保存。使用
    safetensors
    并设置
    save_consolidated: true
    以保证checkpoint兼容性。
  5. CP需要兼容的注意力机制。仅支持SDPA(Flash Attention或Efficient Attention)或TE注意力机制。
    SDPBackend.MATH
    与DTensor不兼容。
  6. MoE的EP规模必须能整除
    dp_size * cp_size
    。设备网格创建时会断言
    dp_cp_size % ep_size == 0
  7. MegatronFSDP的功能比FSDP2更有限。它不支持PP(
    pp_size > 1
    )、EP(
    ep_size > 1
    )或
    sequence_parallel
    MeshContext
    验证时会对这些组合抛出异常。
  8. DDP仅支持数据并行。不支持TP、PP、CP、EP或HSDP。启用这些功能时验证会抛出异常。
  9. 激活 checkpoint会增加计算量。它通过反向传播时重新计算激活值节省内存,但会增加约30%的计算开销。
  10. 混合精度策略必须匹配模型预期。默认的bfloat16策略适用于大多数模型。FP16模型可能需要自定义
    MixedPrecisionPolicy
  11. 使用CP和打包序列时,
    packed_sequence_size
    必须能被
    cp_size
    整除
  12. dp_replicate_size
    仅FSDP2支持
    。与
    megatron_fsdp
    ddp
    一起使用时会抛出
    ValueError

Verification

验证

Run the smallest recipe that exercises the requested strategy. Success means exit code 0, finite loss, no NCCL timeout, and log output matching the expected TP/PP/CP/EP sizes.
运行能测试目标策略的最小训练脚本。成功的标志是退出码为0、损失值有限、无NCCL超时,且日志输出与预期的TP/PP/CP/EP规模一致。",