nemo-automodel-distributed-training

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Distributed Training in NeMo AutoModel

NeMo AutoModel中的分布式训练

Purpose

目的

NeMo AutoModel uses PyTorch-native distributed training. All parallelism is orchestrated through a single

MeshContext

object that holds device meshes, strategy configs, and axis names.

NeMo AutoModel采用PyTorch原生分布式训练。所有并行机制通过单个

MeshContext

对象进行编排，该对象存储设备网格、策略配置和轴名称。

Instructions

操作说明

For conceptual distributed-training questions, answer directly from the quick patterns in this skill without inspecting the repository. Start with the strategy choice, then list only the YAML fields and constraints relevant to the question.

Use direct action verbs in the final answer: recommend the strategy, show the minimal YAML, state the sizing constraint, and name the unsupported strategies. Do not discuss model onboarding, recipes, Slurm, SkyPilot, or checkpointing unless the user asks.

对于分布式训练的概念性问题，直接参考本技能中的快速模式作答，无需查看代码仓库。先说明策略选择，再列出与问题相关的YAML字段及约束条件。

最终回答使用直接动作动词：推荐策略、展示最简YAML配置、说明规模约束、指出不支持的策略。除非用户询问，否则不要讨论模型接入、训练脚本、Slurm、SkyPilot或 checkpoint相关内容。

Examples

示例

TP plus PP for a large multi-node model

大型多节点模型的TP+PP配置

Recommend

strategy: fsdp2

. Mention

tp_size

pp_size

cp_size

ep_size

, and the

pipeline

sub-config. State that

dp_size

is inferred from

world_size / (tp_size * pp_size * cp_size)

yaml

distributed:
  strategy: fsdp2
  tp_size: 8
  pp_size: 4
  cp_size: 1
  ep_size: 1
  pipeline:
    pp_schedule: interleaved1f1b
    pp_microbatch_size: 1

MoE expert parallelism

MoE专家并行配置

Recommend

strategy: fsdp2

with

ep_size > 1

. Say this creates a separate

moe_mesh

; include the

moe

sub-config when relevant; state that

ep_size

must divide

dp_size * cp_size

. Do not recommend

megatron_fsdp

ddp

yaml

distributed:
  strategy: fsdp2
  ep_size: 8
  moe:
    reshard_after_forward: false

MegatronFSDP limitations

MegatronFSDP的局限性

Say no for pipeline parallelism, expert parallelism, and

sequence_parallel

. Recommend

fsdp2

for PP, EP, or

sequence_parallel

; mention that DDP is only simple data parallelism.

说明其不支持流水线并行、专家并行及

sequence_parallel

。推荐使用

fsdp2

实现PP、EP或

sequence_parallel

；提及DDP仅支持简单数据并行。

Strategy Selection

策略选择

Three strategies are available, selected via the

distributed.strategy

YAML key:

Strategy	YAML value	Best for
FSDP2	`fsdp2`	General use, recommended default. Supports TP, PP, CP, EP, HSDP.
MegatronFSDP	`megatron_fsdp`	NVIDIA Megatron-style FSDP. No PP, no EP, no sequence_parallel.
DDP	`ddp`	Simple data parallelism only. No TP, PP, CP, or EP.

Decision tree:

Single GPU: no distributed config needed (FSDP2Manager skips parallelization when world_size=1).
Multi-GPU single node:
```
fsdp2
```
(default). Use
```
ddp
```
only if you need the simplest possible setup.
Multi-node:
```
fsdp2
```
with appropriate TP/PP sizing.
MoE models with expert parallelism:
```
fsdp2
```
with
```
ep_size > 1
```
(creates a separate
```
moe_mesh
```
).
Large models (70B+):
```
fsdp2
```
with PP + TP.
Long sequences (8K+): add CP (
```
cp_size > 1
```
).

When answering strategy-selection questions, state the chosen

distributed.strategy

first, then enumerate the YAML fields the user must set.

Quick TP + PP answer:

Use
```
strategy: fsdp2
```
; do not use
```
megatron_fsdp
```
when pipeline parallelism is required.
Set
```
tp_size
```
for tensor parallelism and
```
pp_size
```
for pipeline parallelism.
Add a
```
pipeline:
```
sub-config with
```
pp_schedule
```
and
```
pp_microbatch_size
```
.

Leave

dp_size

unset or

none

; it is inferred as

world_size / (tp_size * pp_size * cp_size)

Keep TP inside a fast intra-node domain when possible, and use PP across model depth for 70B+ models.

Quick MoE expert-parallel answer:

Start with
```
strategy: fsdp2
```
and
```
ep_size > 1
```
.
Include a
```
moe:
```
sub-config only when
```
ep_size > 1
```
; it maps to
```
MoEParallelizerConfig
```
.
Expect a separate
```
moe_mesh
```
for expert parallelism in addition to the main
```
device_mesh
```
.
Do not recommend
```
megatron_fsdp
```
or
```
ddp
```
for expert parallelism;
```
megatron_fsdp
```
has no EP support.
Before finishing an MoE EP answer, explicitly state that
```
ep_size
```
must divide
```
dp_size * cp_size
```
and that
```
megatron_fsdp
```
does not support EP, PP, or
```
sequence_parallel
```
.

共有三种策略可选，通过YAML中的

distributed.strategy

字段指定：

策略	YAML取值	适用场景
FSDP2	`fsdp2`	通用场景，推荐默认使用。支持TP、PP、CP、EP、HSDP。
MegatronFSDP	`megatron_fsdp`	NVIDIA Megatron风格的FSDP。不支持PP、EP、sequence_parallel。
DDP	`ddp`	仅支持简单数据并行。不支持TP、PP、CP或EP。

决策树：

单GPU：无需分布式配置（当world_size=1时，FSDP2Manager会跳过并行化）。
多GPU单节点：默认使用
```
fsdp2
```
。仅当需要最简设置时才使用
```
ddp
```
。
多节点：使用
```
fsdp2
```
并配置合适的TP/PP规模。
带专家并行的MoE模型：使用
```
fsdp2
```
并设置
```
ep_size > 1
```
（会创建独立的
```
moe_mesh
```
）。
大型模型（70B+）：使用
```
fsdp2
```
并配置PP+TP。
长序列（8K+）：添加CP（设置
```
cp_size > 1
```
）。

回答策略选择类问题时，先说明选定的

distributed.strategy

，再列举用户必须设置的YAML字段。

TP+PP快速回答：

使用
```
strategy: fsdp2
```
；需要流水线并行时不要使用
```
megatron_fsdp
```
。
设置
```
tp_size
```
实现张量并行，设置
```
pp_size
```
实现流水线并行。
添加包含
```
pp_schedule
```
和
```
pp_microbatch_size
```
的
```
pipeline:
```
子配置。

无需设置

dp_size

或设为

none

；它会自动推导为

world_size / (tp_size * pp_size * cp_size)

。

尽可能将TP限制在节点内高速NVLink域中，70B+模型使用PP跨模型深度扩展。

MoE专家并行快速回答：

从
```
strategy: fsdp2
```
和
```
ep_size > 1
```
开始配置。
仅当
```
ep_size > 1
```
时才需要包含
```
moe:
```
子配置；该配置对应
```
MoEParallelizerConfig
```
。
除主
```
device_mesh
```
外，会为专家并行创建独立的
```
moe_mesh
```
。
不要为专家并行推荐
```
megatron_fsdp
```
或
```
ddp
```
；
```
megatron_fsdp
```
不支持EP。
完成MoE EP配置回答前，需明确说明
```
ep_size
```
必须能整除
```
dp_size * cp_size
```
，且
```
megatron_fsdp
```
不支持EP、PP或
```
sequence_parallel
```
。

YAML Config Structure

YAML配置结构

The

distributed

section in the recipe YAML maps directly to

parse_distributed_section()

recipes/_dist_setup.py

yaml

distributed:
  strategy: fsdp2           # fsdp2 | megatron_fsdp | ddp
  dp_size: none             # auto-calculated from world_size / (tp * pp * cp)
  dp_replicate_size: none   # FSDP2-only, for HSDP
  tp_size: 1
  pp_size: 1
  cp_size: 1
  ep_size: 1

  # Strategy-specific flags (forwarded to the strategy dataclass):
  sequence_parallel: false
  activation_checkpointing: false
  defer_fsdp_grad_sync: true   # FSDP2 only

  # Sub-configs (optional):
  pipeline:
    pp_schedule: 1f1b
    pp_microbatch_size: 1
    # ... see PipelineConfig fields

  moe:
    reshard_after_forward: false
    # ... see MoEParallelizerConfig fields

The

dp_size

is always inferred:

dp_size = world_size / (tp_size * pp_size * cp_size)

训练脚本YAML中的

distributed

部分直接映射到

recipes/_dist_setup.py

中的

parse_distributed_section()

函数：

yaml

distributed:
  strategy: fsdp2           # fsdp2 | megatron_fsdp | ddp
  dp_size: none             # 由world_size / (tp * pp * cp)自动计算
  dp_replicate_size: none   # 仅FSDP2支持，用于HSDP
  tp_size: 1
  pp_size: 1
  cp_size: 1
  ep_size: 1

  # 策略专属参数（转发至策略数据类）：
  sequence_parallel: false
  activation_checkpointing: false
  defer_fsdp_grad_sync: true   # 仅FSDP2支持

  # 子配置（可选）：
  pipeline:
    pp_schedule: 1f1b
    pp_microbatch_size: 1
    # ... 查看PipelineConfig字段

  moe:
    reshard_after_forward: false
    # ... 查看MoEParallelizerConfig字段

dp_size

始终自动推导：

dp_size = world_size / (tp_size * pp_size * cp_size)

Infrastructure Flow

基础设施流程

YAML distributed section
    -> parse_distributed_section()          [recipes/_dist_setup.py]
    -> setup_distributed()                  [recipes/_dist_setup.py]
        -> create_device_mesh()             [components/distributed/device_mesh.py]
        -> MeshContext(...)                  [components/distributed/mesh.py]
    -> instantiate_infrastructure()         [_transformers/infrastructure.py]
        -> _instantiate_distributed()       -> FSDP2Manager / MegatronFSDPManager / DDPManager
        -> _instantiate_pipeline()          -> AutoPipeline (if pp_size > 1)
        -> parallelize_fn                   -> MoE parallelizer (if ep_size > 1) or PP wrapper
    -> apply_model_infrastructure()         [_transformers/infrastructure.py]
        -> _shard_pp() or _shard_ep_fsdp()  (applies sharding to the model)

YAML distributed section
    -> parse_distributed_section()          [recipes/_dist_setup.py]
    -> setup_distributed()                  [recipes/_dist_setup.py]
        -> create_device_mesh()             [components/distributed/device_mesh.py]
        -> MeshContext(...)                  [components/distributed/mesh.py]
    -> instantiate_infrastructure()         [_transformers/infrastructure.py]
        -> _instantiate_distributed()       -> FSDP2Manager / MegatronFSDPManager / DDPManager
        -> _instantiate_pipeline()          -> AutoPipeline（若pp_size > 1）
        -> parallelize_fn                   -> MoE并行器（若ep_size > 1）或PP包装器
    -> apply_model_infrastructure()         [_transformers/infrastructure.py]
        -> _shard_pp() or _shard_ep_fsdp() （对模型进行分片）

FSDP2 Configuration

FSDP2配置

Basic FSDP2 (data parallelism only)

基础FSDP2（仅数据并行）

yaml

distributed:
  strategy: fsdp2
  tp_size: 1
  cp_size: 1

This auto-calculates

dp_size = world_size

and applies

fully_shard()

per transformer block via DTensor-based sharding.

yaml

distributed:
  strategy: fsdp2
  tp_size: 1
  cp_size: 1

此配置自动计算

dp_size = world_size

，并通过基于DTensor的分片对每个Transformer块应用

fully_shard()

。

FSDP2 with Tensor Parallelism

带张量并行的FSDP2

Keep TP within a single NVLink domain (typically one node):

yaml

distributed:
  strategy: fsdp2
  tp_size: 4        # 2, 4, or 8 -- must divide GPUs per node
  sequence_parallel: true

The TP plan is auto-selected based on the model type. Pass a custom plan via the Python API if needed:

python

config = FSDP2Config(sequence_parallel=True, tp_plan=my_custom_plan)

将TP限制在单个NVLink域内（通常为一个节点）：

yaml

distributed:
  strategy: fsdp2
  tp_size: 4        # 2、4或8 -- 必须能整除单节点GPU数量
  sequence_parallel: true

TP方案会根据模型类型自动选择。如需自定义，可通过Python API传入：

python

config = FSDP2Config(sequence_parallel=True, tp_plan=my_custom_plan)

FSDP2 with Pipeline Parallelism

带流水线并行的FSDP2

yaml

distributed:
  strategy: fsdp2
  pp_size: 2
  pipeline:
    pp_schedule: interleaved1f1b   # 1f1b, gpipe, interleaved_1f1b, etc.
    pp_microbatch_size: 4
    scale_grads_in_schedule: false

The model must have a

_pp_plan

attribute (set on the HF model class) for

AutoPipeline

to know how to split layers across stages. Models without

_pp_plan

are not compatible with PP.

yaml

distributed:
  strategy: fsdp2
  pp_size: 2
  pipeline:
    pp_schedule: interleaved1f1b   # 1f1b, gpipe, interleaved_1f1b等
    pp_microbatch_size: 4
    scale_grads_in_schedule: false

模型必须具备

_pp_plan

属性（设置在HF模型类上），

AutoPipeline

才能知道如何将层分配到不同阶段。无

_pp_plan

的模型不兼容PP。

FSDP2 with HSDP (Hybrid Sharded Data Parallel)

带HSDP（混合分片数据并行）的FSDP2

Intra-node full sharding + inter-node replication via a 2D DeviceMesh:

yaml

distributed:
  strategy: fsdp2
  dp_replicate_size: 2   # must divide dp_size

Constraint:

dp_replicate_size < dp_size

(pure replication with no sharding is not supported by FSDP2).

节点内全分片 + 节点间复制，通过二维DeviceMesh实现：

yaml

distributed:
  strategy: fsdp2
  dp_replicate_size: 2   # 必须能整除dp_size

约束：

dp_replicate_size < dp_size

（FSDP2不支持无分片的纯复制）。

Activation Checkpointing

激活 checkpoint

Trades compute for memory by recomputing activations during backward:

yaml

distributed:
  activation_checkpointing: true

This is forwarded to the strategy config for non-EP models, or read from

MeshContext.activation_checkpointing

for EP models.

通过反向传播时重新计算激活值，以计算量换取内存：

yaml

distributed:
  activation_checkpointing: true

对于非EP模型，该参数会转发至策略配置；对于EP模型，会从

MeshContext.activation_checkpointing

读取。

Gradient Sync Deferral

梯度同步延迟

FSDP2 defers gradient sync to the final micro-batch by default for communication overlap:

yaml

distributed:
  defer_fsdp_grad_sync: true   # default

为实现通信重叠，FSDP2默认将梯度同步延迟到最后一个微批次：

yaml

distributed:
  defer_fsdp_grad_sync: true   # 默认值

Mixed Precision

混合精度

FSDP2Config defaults to bfloat16 for all three precision knobs via

MixedPrecisionPolicy(param_dtype=bf16, reduce_dtype=bf16, output_dtype=bf16, cast_forward_inputs=True)

. Override via the Python API:

python

from torch.distributed.fsdp import MixedPrecisionPolicy
config = FSDP2Config(
    mp_policy=MixedPrecisionPolicy(param_dtype=torch.float16, reduce_dtype=torch.float32),
)

FSDP2Config默认通过

MixedPrecisionPolicy(param_dtype=bf16, reduce_dtype=bf16, output_dtype=bf16, cast_forward_inputs=True)

将三个精度参数设为bfloat16。可通过Python API覆盖：

python

from torch.distributed.fsdp import MixedPrecisionPolicy
config = FSDP2Config(
    mp_policy=MixedPrecisionPolicy(param_dtype=torch.float16, reduce_dtype=torch.float32),
)

Pipeline Parallelism

流水线并行

Requirements

要求

Model class must define
```
_pp_plan
```
(a dict mapping module FQNs to stages).
```
pp_size > 1
```
in the distributed section.
A
```
pipeline
```
sub-config with schedule and microbatch size.

模型类必须定义
```
_pp_plan
```
（一个将模块全限定名映射到阶段的字典）。
distributed部分中
```
pp_size > 1
```
。
包含
```
pipeline
```
子配置，指定调度策略和微批次大小。

Supported schedules

支持的调度策略

Defined in

PipelineConfig.pp_schedule

```
1f1b
```
(one-forward-one-backward, default)
```
gpipe
```
```
interleaved_1f1b
```
/
```
interleaved1f1b
```
```
looped_bfs
```
```
dfs
```
```
v_schedule
```
```
zero_bubble
```

在

PipelineConfig.pp_schedule

中定义：

```
1f1b
```
（一次前向一次反向，默认）
```
gpipe
```
```
interleaved_1f1b
```
/
```
interleaved1f1b
```
```
looped_bfs
```
```
dfs
```
```
v_schedule
```
```
zero_bubble
```

Example (8B model on 8 GPUs, PP=2 + DP=4)

示例（8B模型在8个GPU上运行，PP=2 + DP=4）

yaml

distributed:
  strategy: fsdp2
  pp_size: 2

  pipeline:
    pp_schedule: interleaved1f1b
    pp_microbatch_size: 4
    scale_grads_in_schedule: false

checkpoint:
  model_save_format: safetensors
  save_consolidated: true

yaml

distributed:
  strategy: fsdp2
  pp_size: 2

  pipeline:
    pp_schedule: interleaved1f1b
    pp_microbatch_size: 4
    scale_grads_in_schedule: false

checkpoint:
  model_save_format: safetensors
  save_consolidated: true

How it works

工作原理

AutoPipeline.build()

calls

pipeline_model()

which splits the model into stages using the model's

_pp_plan

, creates

PipelineStage

objects, and builds the schedule. During training,

schedule.step()

drives forward and backward through the pipeline.

AutoPipeline.build()

调用

pipeline_model()

，使用模型的

_pp_plan

将模型拆分为多个阶段，创建

PipelineStage

对象并构建调度策略。训练过程中，

schedule.step()

驱动流水线的前向和反向传播。

Context Parallelism

上下文并行

Use CP for long sequences (8K+). CP shards Q/K/V on the sequence dimension as DTensors.

长序列（8K+）场景下使用CP。CP将Q/K/V按序列维度分片为DTensor。

Config

配置

yaml

distributed:
  strategy: fsdp2
  cp_size: 2   # or 4, 8

yaml

distributed:
  strategy: fsdp2
  cp_size: 2   # 或4、8

Requirements

要求

SDPA (Flash Attention or Efficient Attention backend) or Transformer Engine attention. SDPBackend.MATH is not compatible with DTensor.
Attention masks are automatically stripped;
```
is_causal=True
```
is set via forward pre-hooks registered by
```
attach_context_parallel_hooks()
```
.

使用SDPA（Flash Attention或Efficient Attention后端）或Transformer Engine注意力机制。SDPBackend.MATH与DTensor不兼容。
注意力掩码会自动移除；通过
```
attach_context_parallel_hooks()
```
注册的前向钩子设置
```
is_causal=True
```
。

How it works

工作原理

After model sharding,
```
apply_model_infrastructure()
```
calls
```
attach_context_parallel_hooks()
```
on each model part (for non-TE models).
At each training step,
```
make_cp_batch_and_ctx()
```
creates a CP context manager that shards the batch along the sequence dimension and sets up
```
context_parallel()
```
from
```
torch.distributed.tensor.experimental
```
.
For TE attention models,
```
make_cp_batch_for_te()
```
uses THD format and TE's
```
thd_get_partitioned_indices
```
for sharding.

模型分片后，
```
apply_model_infrastructure()
```
对每个模型部分调用
```
attach_context_parallel_hooks()
```
（针对非TE模型）。
每个训练步骤中，
```
make_cp_batch_and_ctx()
```
创建CP上下文管理器，将批次按序列维度分片，并从
```
torch.distributed.tensor.experimental
```
设置
```
context_parallel()
```
。
对于TE注意力模型，
```
make_cp_batch_for_te()
```
使用THD格式和TE的
```
thd_get_partitioned_indices
```
进行分片。

CP with Sequence Packing

带序列打包的CP

CP works with packed sequences. The

packed_sequence_size

must be divisible by

cp_size

. When using TE, chunks are sharded per-chunk via

_shard_thd_chunk_for_te()

CP支持打包序列。

packed_sequence_size

必须能被

cp_size

整除。使用TE时，通过

_shard_thd_chunk_for_te()

对每个分片进行处理。

Sequence Packing

序列打包

Packing multiple sequences into a single training sample for efficiency.

将多个序列打包为单个训练样本以提升效率。

Config

配置

yaml

packed_sequence:
  packed_sequence_size: 4096   # 0 = disabled

step_scheduler:
  local_batch_size: 1          # must be 1 for packed sequences

When

packed_sequence_size > 0

, the dataset collator packs sequences up to that length.

local_batch_size

must be 1 because each "sample" is already a packed batch.

yaml

packed_sequence:
  packed_sequence_size: 4096   # 0 = 禁用

step_scheduler:
  local_batch_size: 1          # 打包序列必须设为1

当

packed_sequence_size > 0

时，数据集整理器会将序列打包至指定长度。

local_batch_size

必须设为1，因为每个"样本"已经是一个打包后的批次。

MoE Distributed Training

MoE分布式训练

Expert Parallelism

专家并行

Set

ep_size > 1

to distribute experts across GPUs. This creates a separate

moe_mesh

alongside the main

device_mesh

yaml

distributed:
  strategy: fsdp2
  ep_size: 8
  activation_checkpointing: true

The

moe_mesh

shape is

(pp_size, ep_shard_size, ep_size)

with dimension names

("pp", "ep_shard", "ep")

Constraint:

dp_cp_size

dp_size * cp_size

) must be divisible by

ep_size

设置

ep_size > 1

将专家分布到多个GPU上。这会在主

device_mesh

之外创建独立的

moe_mesh

：

yaml

distributed:
  strategy: fsdp2
  ep_size: 8
  activation_checkpointing: true

moe_mesh

的形状为

(pp_size, ep_shard_size, ep_size)

，维度名称为

("pp", "ep_shard", "ep")

。

约束：

dp_cp_size

（即

dp_size * cp_size

）必须能被

ep_size

整除。

MoE sub-config

MoE子配置

yaml

distributed:
  strategy: fsdp2
  ep_size: 8
  activation_checkpointing: true

  moe:
    reshard_after_forward: false
    ignore_router_for_ac: false
    wrap_outer_model: true

The

moe

sub-section maps to

MoEParallelizerConfig

and is only instantiated when

ep_size > 1

yaml

distributed:
  strategy: fsdp2
  ep_size: 8
  activation_checkpointing: true

  moe:
    reshard_after_forward: false
    ignore_router_for_ac: false
    wrap_outer_model: true

moe

子部分对应

MoEParallelizerConfig

，仅当

ep_size > 1

时才会实例化。

Full MoE example (Qwen3-30B-A3B on 8 GPUs)

完整MoE示例（Qwen3-30B-A3B在8个GPU上运行）

yaml

distributed:
  strategy: fsdp2
  tp_size: 1
  cp_size: 1
  pp_size: 1
  ep_size: 8
  sequence_parallel: false
  activation_checkpointing: true

yaml

distributed:
  strategy: fsdp2
  tp_size: 1
  cp_size: 1
  pp_size: 1
  ep_size: 8
  sequence_parallel: false
  activation_checkpointing: true

MegatronFSDP limitations

MegatronFSDP的局限性

Despite its name,

megatron_fsdp

does not support expert parallelism (

ep_size > 1

), pipeline parallelism (

pp_size > 1

), or

sequence_parallel

. Use

fsdp2

for these features.

尽管名称如此，

megatron_fsdp

不支持专家并行（

ep_size > 1

）、流水线并行（

pp_size > 1

）或

sequence_parallel

。如需这些功能，请使用

fsdp2

。

Parallelism Sizing Guidelines

并行规模指南

Dense models

稠密模型

Model size	TP	PP	CP	Strategy
< 3B	1	1	1	FSDP2 (DP only)
3-13B	2-4	1	1	FSDP2 + TP
13-70B	4-8	2-4	1	FSDP2 + TP + PP
70B+	8	4-8	1	FSDP2 + TP + PP
Any + long seq (8K+)	as above	as above	2-8	add CP

模型规模	TP	PP	CP	策略
< 3B	1	1	1	FSDP2（仅DP）
3-13B	2-4	1	1	FSDP2 + TP
13-70B	4-8	2-4	1	FSDP2 + TP + PP
70B+	8	4-8	1	FSDP2 + TP + PP
任意规模+长序列（8K+）	同上	同上	2-8	添加CP

MoE models

MoE模型

MoE models need less TP than dense models of similar total parameter count because only a fraction of parameters are active per token. EP is the primary scaling dimension:

Model	TP	PP	EP	Notes
Small MoE (<10B total)	1	1	8	EP only
Medium MoE (10-30B total)	1-2	1	8	small TP for shared layers
Large MoE (100B+ total)	1-2	4+	8-64	PP for depth, EP for experts

MoE模型所需的TP规模小于同等总参数量的稠密模型，因为每个token仅激活部分参数。EP是主要的扩展维度：

模型	TP	PP	EP	说明
小型MoE（总参数量<10B）	1	1	8	仅使用EP
中型MoE（总参数量10-30B）	1-2	1	8	共享层使用小TP
大型MoE（总参数量100B+）	1-2	4+	8-64	使用PP扩展深度，EP扩展专家数量

Hardware topology rules

硬件拓扑规则

TP must stay within a single NVLink domain (one node, typically 8 GPUs).
Use PP or DP for cross-node scaling.
TP across InfiniBand degrades throughput severely.

TP必须限制在单个NVLink域内（一个节点，通常为8个GPU）。
使用PP或DP进行跨节点扩展。
跨InfiniBand的TP会严重降低吞吐量。

Code Anchors

代码锚点

```
components/distributed/config.py
```
: FSDP2Config, MegatronFSDPConfig, DDPConfig.
```
components/distributed/mesh.py
```
: MeshContext, strategy map, and mesh sizes.

components/distributed/device_mesh.py

: device mesh and

moe_mesh

creation.

components/distributed/pipelining/config.py

: PipelineConfig fields.

```
components/moe/config.py
```
: MoEParallelizerConfig and MoEConfig.
```
recipes/_dist_setup.py
```
: YAML parsing and distributed setup.

```
components/distributed/config.py
```
: FSDP2Config、MegatronFSDPConfig、DDPConfig。
```
components/distributed/mesh.py
```
: MeshContext、策略映射及网格规模。

components/distributed/device_mesh.py

: 设备网格和

moe_mesh

创建。

components/distributed/pipelining/config.py

: PipelineConfig字段。

```
components/moe/config.py
```
: MoEParallelizerConfig和MoEConfig。
```
recipes/_dist_setup.py
```
: YAML解析和分布式设置。

Pitfalls

注意事项

TP across nodes destroys throughput. Always keep TP within a single NVLink domain. Use PP or DP for cross-node scaling.
PP requires
_pp_plan
on the model class. Not all HF models have this. Check
```
validate_hf_model_for_pipeline_support()
```
before enabling PP.
PP bubbles reduce GPU utilization. Use interleaved schedules (
```
interleaved_1f1b
```
) and smaller microbatches to reduce bubble time.
FSDP2 requires DTensor-aware state dict saving. Use
```
safetensors
```
with
```
save_consolidated: true
```
for checkpoint compatibility.
CP requires compatible attention. SDPA (Flash Attention or Efficient Attention) or TE attention only.
```
SDPBackend.MATH
```
is not compatible with DTensor.
MoE EP size must evenly divide
dp_size * cp_size
. The device mesh creation asserts
```
dp_cp_size % ep_size == 0
```
.
MegatronFSDP is more limited than FSDP2. It does not support PP (
```
pp_size > 1
```
), EP (
```
ep_size > 1
```
), or
```
sequence_parallel
```
. The
```
MeshContext
```
validation raises on these combinations.
DDP supports nothing beyond data parallelism. No TP, PP, CP, EP, or HSDP. Validation raises on any of these.
Activation checkpointing increases compute. It saves memory by recomputing activations during backward, but adds ~30% compute overhead.
Mixed precision policy must match model expectations. The default bfloat16 policy works for most models. FP16 models may need a custom
```
MixedPrecisionPolicy
```
.
packed_sequence_size
must be divisible by
cp_size
when using CP with packed sequences.

dp_replicate_size
is FSDP2-only. Passing it with

megatron_fsdp

ddp

raises a

ValueError

跨节点TP会彻底破坏吞吐量。始终将TP限制在单个NVLink域内。使用PP或DP进行跨节点扩展。
PP要求模型类具备
_pp_plan
。并非所有HF模型都有此属性。启用PP前请检查
```
validate_hf_model_for_pipeline_support()
```
。
PP气泡会降低GPU利用率。使用交错调度策略（
```
interleaved_1f1b
```
）和更小的微批次以减少气泡时间。
FSDP2需要支持DTensor的状态字典保存。使用
```
safetensors
```
并设置
```
save_consolidated: true
```
以保证checkpoint兼容性。
CP需要兼容的注意力机制。仅支持SDPA（Flash Attention或Efficient Attention）或TE注意力机制。
```
SDPBackend.MATH
```
与DTensor不兼容。
MoE的EP规模必须能整除
dp_size * cp_size
。设备网格创建时会断言
```
dp_cp_size % ep_size == 0
```
。
MegatronFSDP的功能比FSDP2更有限。它不支持PP（
```
pp_size > 1
```
）、EP（
```
ep_size > 1
```
）或
```
sequence_parallel
```
。
```
MeshContext
```
验证时会对这些组合抛出异常。
DDP仅支持数据并行。不支持TP、PP、CP、EP或HSDP。启用这些功能时验证会抛出异常。
激活 checkpoint会增加计算量。它通过反向传播时重新计算激活值节省内存，但会增加约30%的计算开销。
混合精度策略必须匹配模型预期。默认的bfloat16策略适用于大多数模型。FP16模型可能需要自定义
```
MixedPrecisionPolicy
```
。
使用CP和打包序列时，
packed_sequence_size
必须能被
cp_size
整除。
dp_replicate_size
仅FSDP2支持。与
```
megatron_fsdp
```
或
```
ddp
```
一起使用时会抛出
```
ValueError
```
。

Verification

验证

Run the smallest recipe that exercises the requested strategy. Success means exit code 0, finite loss, no NCCL timeout, and log output matching the expected TP/PP/CP/EP sizes.

运行能测试目标策略的最小训练脚本。成功的标志是退出码为0、损失值有限、无NCCL超时，且日志输出与预期的TP/PP/CP/EP规模一致。",