nemo-automodel-distributed-training
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDistributed Training in NeMo AutoModel
NeMo AutoModel中的分布式训练
Purpose
目的
NeMo AutoModel uses PyTorch-native distributed training.
All parallelism is orchestrated through a single object that
holds device meshes, strategy configs, and axis names.
MeshContextNeMo AutoModel采用PyTorch原生分布式训练。所有并行机制通过单个对象进行编排,该对象存储设备网格、策略配置和轴名称。
MeshContextInstructions
操作说明
For conceptual distributed-training questions, answer directly from the quick
patterns in this skill without inspecting the repository. Start with the
strategy choice, then list only the YAML fields and constraints relevant to the
question.
Use direct action verbs in the final answer: recommend the strategy, show the
minimal YAML, state the sizing constraint, and name the unsupported strategies.
Do not discuss model onboarding, recipes, Slurm, SkyPilot, or checkpointing
unless the user asks.
对于分布式训练的概念性问题,直接参考本技能中的快速模式作答,无需查看代码仓库。先说明策略选择,再列出与问题相关的YAML字段及约束条件。
最终回答使用直接动作动词:推荐策略、展示最简YAML配置、说明规模约束、指出不支持的策略。除非用户询问,否则不要讨论模型接入、训练脚本、Slurm、SkyPilot或 checkpoint相关内容。
Examples
示例
TP plus PP for a large multi-node model
大型多节点模型的TP+PP配置
Recommend . Mention , , ,
, and the sub-config. State that is inferred from
.
strategy: fsdp2tp_sizepp_sizecp_sizeep_sizepipelinedp_sizeworld_size / (tp_size * pp_size * cp_size)yaml
distributed:
strategy: fsdp2
tp_size: 8
pp_size: 4
cp_size: 1
ep_size: 1
pipeline:
pp_schedule: interleaved1f1b
pp_microbatch_size: 1推荐。提及、、、及子配置。说明由自动推导。
strategy: fsdp2tp_sizepp_sizecp_sizeep_sizepipelinedp_sizeworld_size / (tp_size * pp_size * cp_size)yaml
distributed:
strategy: fsdp2
tp_size: 8
pp_size: 4
cp_size: 1
ep_size: 1
pipeline:
pp_schedule: interleaved1f1b
pp_microbatch_size: 1MoE expert parallelism
MoE专家并行配置
Recommend with . Say this creates a separate
; include the sub-config when relevant; state that
must divide . Do not recommend or .
strategy: fsdp2ep_size > 1moe_meshmoeep_sizedp_size * cp_sizemegatron_fsdpddpyaml
distributed:
strategy: fsdp2
ep_size: 8
moe:
reshard_after_forward: false推荐并设置。说明这会创建独立的;相关场景下需包含子配置;指出必须能整除。不推荐或。
strategy: fsdp2ep_size > 1moe_meshmoeep_sizedp_size * cp_sizemegatron_fsdpddpyaml
distributed:
strategy: fsdp2
ep_size: 8
moe:
reshard_after_forward: falseMegatronFSDP limitations
MegatronFSDP的局限性
Say no for pipeline parallelism, expert parallelism, and .
Recommend for PP, EP, or ; mention that DDP is only
simple data parallelism.
sequence_parallelfsdp2sequence_parallel说明其不支持流水线并行、专家并行及。推荐使用实现PP、EP或;提及DDP仅支持简单数据并行。
sequence_parallelfsdp2sequence_parallelStrategy Selection
策略选择
Three strategies are available, selected via the YAML key:
distributed.strategy| Strategy | YAML value | Best for |
|---|---|---|
| FSDP2 | | General use, recommended default. Supports TP, PP, CP, EP, HSDP. |
| MegatronFSDP | | NVIDIA Megatron-style FSDP. No PP, no EP, no sequence_parallel. |
| DDP | | Simple data parallelism only. No TP, PP, CP, or EP. |
Decision tree:
- Single GPU: no distributed config needed (FSDP2Manager skips parallelization when world_size=1).
- Multi-GPU single node: (default). Use
fsdp2only if you need the simplest possible setup.ddp - Multi-node: with appropriate TP/PP sizing.
fsdp2 - MoE models with expert parallelism: with
fsdp2(creates a separateep_size > 1).moe_mesh - Large models (70B+): with PP + TP.
fsdp2 - Long sequences (8K+): add CP ().
cp_size > 1
When answering strategy-selection questions, state the chosen
first, then enumerate the YAML fields the user must set.
distributed.strategyQuick TP + PP answer:
- Use ; do not use
strategy: fsdp2when pipeline parallelism is required.megatron_fsdp - Set for tensor parallelism and
tp_sizefor pipeline parallelism.pp_size - Add a sub-config with
pipeline:andpp_schedule.pp_microbatch_size - Leave unset or
dp_size; it is inferred asnone.world_size / (tp_size * pp_size * cp_size) - Keep TP inside a fast intra-node domain when possible, and use PP across model depth for 70B+ models.
Quick MoE expert-parallel answer:
- Start with and
strategy: fsdp2.ep_size > 1 - Include a sub-config only when
moe:; it maps toep_size > 1.MoEParallelizerConfig - Expect a separate for expert parallelism in addition to the main
moe_mesh.device_mesh - Do not recommend or
megatron_fsdpfor expert parallelism;ddphas no EP support.megatron_fsdp - Before finishing an MoE EP answer, explicitly state that must divide
ep_sizeand thatdp_size * cp_sizedoes not support EP, PP, ormegatron_fsdp.sequence_parallel
共有三种策略可选,通过YAML中的字段指定:
distributed.strategy| 策略 | YAML取值 | 适用场景 |
|---|---|---|
| FSDP2 | | 通用场景,推荐默认使用。支持TP、PP、CP、EP、HSDP。 |
| MegatronFSDP | | NVIDIA Megatron风格的FSDP。不支持PP、EP、sequence_parallel。 |
| DDP | | 仅支持简单数据并行。不支持TP、PP、CP或EP。 |
决策树:
- 单GPU:无需分布式配置(当world_size=1时,FSDP2Manager会跳过并行化)。
- 多GPU单节点:默认使用。仅当需要最简设置时才使用
fsdp2。ddp - 多节点:使用并配置合适的TP/PP规模。
fsdp2 - 带专家并行的MoE模型:使用并设置
fsdp2(会创建独立的ep_size > 1)。moe_mesh - 大型模型(70B+):使用并配置PP+TP。
fsdp2 - 长序列(8K+):添加CP(设置)。
cp_size > 1
回答策略选择类问题时,先说明选定的,再列举用户必须设置的YAML字段。
distributed.strategyTP+PP快速回答:
- 使用;需要流水线并行时不要使用
strategy: fsdp2。megatron_fsdp - 设置实现张量并行,设置
tp_size实现流水线并行。pp_size - 添加包含和
pp_schedule的pp_microbatch_size子配置。pipeline: - 无需设置或设为
dp_size;它会自动推导为none。world_size / (tp_size * pp_size * cp_size) - 尽可能将TP限制在节点内高速NVLink域中,70B+模型使用PP跨模型深度扩展。
MoE专家并行快速回答:
- 从和
strategy: fsdp2开始配置。ep_size > 1 - 仅当时才需要包含
ep_size > 1子配置;该配置对应moe:。MoEParallelizerConfig - 除主外,会为专家并行创建独立的
device_mesh。moe_mesh - 不要为专家并行推荐或
megatron_fsdp;ddp不支持EP。megatron_fsdp - 完成MoE EP配置回答前,需明确说明必须能整除
ep_size,且dp_size * cp_size不支持EP、PP或megatron_fsdp。sequence_parallel
YAML Config Structure
YAML配置结构
The section in the recipe YAML maps directly to
in :
distributedparse_distributed_section()recipes/_dist_setup.pyyaml
distributed:
strategy: fsdp2 # fsdp2 | megatron_fsdp | ddp
dp_size: none # auto-calculated from world_size / (tp * pp * cp)
dp_replicate_size: none # FSDP2-only, for HSDP
tp_size: 1
pp_size: 1
cp_size: 1
ep_size: 1
# Strategy-specific flags (forwarded to the strategy dataclass):
sequence_parallel: false
activation_checkpointing: false
defer_fsdp_grad_sync: true # FSDP2 only
# Sub-configs (optional):
pipeline:
pp_schedule: 1f1b
pp_microbatch_size: 1
# ... see PipelineConfig fields
moe:
reshard_after_forward: false
# ... see MoEParallelizerConfig fieldsThe is always inferred:
dp_sizedp_size = world_size / (tp_size * pp_size * cp_size)训练脚本YAML中的部分直接映射到中的函数:
distributedrecipes/_dist_setup.pyparse_distributed_section()yaml
distributed:
strategy: fsdp2 # fsdp2 | megatron_fsdp | ddp
dp_size: none # 由world_size / (tp * pp * cp)自动计算
dp_replicate_size: none # 仅FSDP2支持,用于HSDP
tp_size: 1
pp_size: 1
cp_size: 1
ep_size: 1
# 策略专属参数(转发至策略数据类):
sequence_parallel: false
activation_checkpointing: false
defer_fsdp_grad_sync: true # 仅FSDP2支持
# 子配置(可选):
pipeline:
pp_schedule: 1f1b
pp_microbatch_size: 1
# ... 查看PipelineConfig字段
moe:
reshard_after_forward: false
# ... 查看MoEParallelizerConfig字段dp_sizedp_size = world_size / (tp_size * pp_size * cp_size)Infrastructure Flow
基础设施流程
YAML distributed section
-> parse_distributed_section() [recipes/_dist_setup.py]
-> setup_distributed() [recipes/_dist_setup.py]
-> create_device_mesh() [components/distributed/device_mesh.py]
-> MeshContext(...) [components/distributed/mesh.py]
-> instantiate_infrastructure() [_transformers/infrastructure.py]
-> _instantiate_distributed() -> FSDP2Manager / MegatronFSDPManager / DDPManager
-> _instantiate_pipeline() -> AutoPipeline (if pp_size > 1)
-> parallelize_fn -> MoE parallelizer (if ep_size > 1) or PP wrapper
-> apply_model_infrastructure() [_transformers/infrastructure.py]
-> _shard_pp() or _shard_ep_fsdp() (applies sharding to the model)YAML distributed section
-> parse_distributed_section() [recipes/_dist_setup.py]
-> setup_distributed() [recipes/_dist_setup.py]
-> create_device_mesh() [components/distributed/device_mesh.py]
-> MeshContext(...) [components/distributed/mesh.py]
-> instantiate_infrastructure() [_transformers/infrastructure.py]
-> _instantiate_distributed() -> FSDP2Manager / MegatronFSDPManager / DDPManager
-> _instantiate_pipeline() -> AutoPipeline(若pp_size > 1)
-> parallelize_fn -> MoE并行器(若ep_size > 1)或PP包装器
-> apply_model_infrastructure() [_transformers/infrastructure.py]
-> _shard_pp() or _shard_ep_fsdp() (对模型进行分片)FSDP2 Configuration
FSDP2配置
Basic FSDP2 (data parallelism only)
基础FSDP2(仅数据并行)
yaml
distributed:
strategy: fsdp2
tp_size: 1
cp_size: 1This auto-calculates and applies per
transformer block via DTensor-based sharding.
dp_size = world_sizefully_shard()yaml
distributed:
strategy: fsdp2
tp_size: 1
cp_size: 1此配置自动计算,并通过基于DTensor的分片对每个Transformer块应用。
dp_size = world_sizefully_shard()FSDP2 with Tensor Parallelism
带张量并行的FSDP2
Keep TP within a single NVLink domain (typically one node):
yaml
distributed:
strategy: fsdp2
tp_size: 4 # 2, 4, or 8 -- must divide GPUs per node
sequence_parallel: trueThe TP plan is auto-selected based on the model type. Pass a custom plan via
the Python API if needed:
python
config = FSDP2Config(sequence_parallel=True, tp_plan=my_custom_plan)将TP限制在单个NVLink域内(通常为一个节点):
yaml
distributed:
strategy: fsdp2
tp_size: 4 # 2、4或8 -- 必须能整除单节点GPU数量
sequence_parallel: trueTP方案会根据模型类型自动选择。如需自定义,可通过Python API传入:
python
config = FSDP2Config(sequence_parallel=True, tp_plan=my_custom_plan)FSDP2 with Pipeline Parallelism
带流水线并行的FSDP2
yaml
distributed:
strategy: fsdp2
pp_size: 2
pipeline:
pp_schedule: interleaved1f1b # 1f1b, gpipe, interleaved_1f1b, etc.
pp_microbatch_size: 4
scale_grads_in_schedule: falseThe model must have a attribute (set on the HF model class) for
to know how to split layers across stages. Models without
are not compatible with PP.
_pp_planAutoPipeline_pp_planyaml
distributed:
strategy: fsdp2
pp_size: 2
pipeline:
pp_schedule: interleaved1f1b # 1f1b, gpipe, interleaved_1f1b等
pp_microbatch_size: 4
scale_grads_in_schedule: false模型必须具备属性(设置在HF模型类上),才能知道如何将层分配到不同阶段。无的模型不兼容PP。
_pp_planAutoPipeline_pp_planFSDP2 with HSDP (Hybrid Sharded Data Parallel)
带HSDP(混合分片数据并行)的FSDP2
Intra-node full sharding + inter-node replication via a 2D DeviceMesh:
yaml
distributed:
strategy: fsdp2
dp_replicate_size: 2 # must divide dp_sizeConstraint: (pure replication with no sharding
is not supported by FSDP2).
dp_replicate_size < dp_size节点内全分片 + 节点间复制,通过二维DeviceMesh实现:
yaml
distributed:
strategy: fsdp2
dp_replicate_size: 2 # 必须能整除dp_size约束:(FSDP2不支持无分片的纯复制)。
dp_replicate_size < dp_sizeActivation Checkpointing
激活 checkpoint
Trades compute for memory by recomputing activations during backward:
yaml
distributed:
activation_checkpointing: trueThis is forwarded to the strategy config for non-EP models, or read from
for EP models.
MeshContext.activation_checkpointing通过反向传播时重新计算激活值,以计算量换取内存:
yaml
distributed:
activation_checkpointing: true对于非EP模型,该参数会转发至策略配置;对于EP模型,会从读取。
MeshContext.activation_checkpointingGradient Sync Deferral
梯度同步延迟
FSDP2 defers gradient sync to the final micro-batch by default for
communication overlap:
yaml
distributed:
defer_fsdp_grad_sync: true # default为实现通信重叠,FSDP2默认将梯度同步延迟到最后一个微批次:
yaml
distributed:
defer_fsdp_grad_sync: true # 默认值Mixed Precision
混合精度
FSDP2Config defaults to bfloat16 for all three precision knobs via
. Override via the Python API:
MixedPrecisionPolicy(param_dtype=bf16, reduce_dtype=bf16, output_dtype=bf16, cast_forward_inputs=True)python
from torch.distributed.fsdp import MixedPrecisionPolicy
config = FSDP2Config(
mp_policy=MixedPrecisionPolicy(param_dtype=torch.float16, reduce_dtype=torch.float32),
)FSDP2Config默认通过将三个精度参数设为bfloat16。可通过Python API覆盖:
MixedPrecisionPolicy(param_dtype=bf16, reduce_dtype=bf16, output_dtype=bf16, cast_forward_inputs=True)python
from torch.distributed.fsdp import MixedPrecisionPolicy
config = FSDP2Config(
mp_policy=MixedPrecisionPolicy(param_dtype=torch.float16, reduce_dtype=torch.float32),
)Pipeline Parallelism
流水线并行
Requirements
要求
- Model class must define (a dict mapping module FQNs to stages).
_pp_plan - in the distributed section.
pp_size > 1 - A sub-config with schedule and microbatch size.
pipeline
- 模型类必须定义(一个将模块全限定名映射到阶段的字典)。
_pp_plan - distributed部分中。
pp_size > 1 - 包含子配置,指定调度策略和微批次大小。
pipeline
Supported schedules
支持的调度策略
Defined in :
PipelineConfig.pp_schedule- (one-forward-one-backward, default)
1f1b gpipe- /
interleaved_1f1binterleaved1f1b looped_bfsdfsv_schedulezero_bubble
在中定义:
PipelineConfig.pp_schedule- (一次前向一次反向,默认)
1f1b gpipe- /
interleaved_1f1binterleaved1f1b looped_bfsdfsv_schedulezero_bubble
Example (8B model on 8 GPUs, PP=2 + DP=4)
示例(8B模型在8个GPU上运行,PP=2 + DP=4)
yaml
distributed:
strategy: fsdp2
pp_size: 2
pipeline:
pp_schedule: interleaved1f1b
pp_microbatch_size: 4
scale_grads_in_schedule: false
checkpoint:
model_save_format: safetensors
save_consolidated: trueyaml
distributed:
strategy: fsdp2
pp_size: 2
pipeline:
pp_schedule: interleaved1f1b
pp_microbatch_size: 4
scale_grads_in_schedule: false
checkpoint:
model_save_format: safetensors
save_consolidated: trueHow it works
工作原理
AutoPipeline.build()pipeline_model()_pp_planPipelineStageschedule.step()AutoPipeline.build()pipeline_model()_pp_planPipelineStageschedule.step()Context Parallelism
上下文并行
Use CP for long sequences (8K+). CP shards Q/K/V on the sequence dimension
as DTensors.
长序列(8K+)场景下使用CP。CP将Q/K/V按序列维度分片为DTensor。
Config
配置
yaml
distributed:
strategy: fsdp2
cp_size: 2 # or 4, 8yaml
distributed:
strategy: fsdp2
cp_size: 2 # 或4、8Requirements
要求
- SDPA (Flash Attention or Efficient Attention backend) or Transformer Engine attention. SDPBackend.MATH is not compatible with DTensor.
- Attention masks are automatically stripped; is set via forward pre-hooks registered by
is_causal=True.attach_context_parallel_hooks()
- 使用SDPA(Flash Attention或Efficient Attention后端)或Transformer Engine注意力机制。SDPBackend.MATH与DTensor不兼容。
- 注意力掩码会自动移除;通过注册的前向钩子设置
attach_context_parallel_hooks()。is_causal=True
How it works
工作原理
- After model sharding, calls
apply_model_infrastructure()on each model part (for non-TE models).attach_context_parallel_hooks() - At each training step, creates a CP context manager that shards the batch along the sequence dimension and sets up
make_cp_batch_and_ctx()fromcontext_parallel().torch.distributed.tensor.experimental - For TE attention models, uses THD format and TE's
make_cp_batch_for_te()for sharding.thd_get_partitioned_indices
- 模型分片后,对每个模型部分调用
apply_model_infrastructure()(针对非TE模型)。attach_context_parallel_hooks() - 每个训练步骤中,创建CP上下文管理器,将批次按序列维度分片,并从
make_cp_batch_and_ctx()设置torch.distributed.tensor.experimental。context_parallel() - 对于TE注意力模型,使用THD格式和TE的
make_cp_batch_for_te()进行分片。thd_get_partitioned_indices
CP with Sequence Packing
带序列打包的CP
CP works with packed sequences. The must be divisible
by . When using TE, chunks are sharded per-chunk via
.
packed_sequence_sizecp_size_shard_thd_chunk_for_te()CP支持打包序列。必须能被整除。使用TE时,通过对每个分片进行处理。
packed_sequence_sizecp_size_shard_thd_chunk_for_te()Sequence Packing
序列打包
Packing multiple sequences into a single training sample for efficiency.
将多个序列打包为单个训练样本以提升效率。
Config
配置
yaml
packed_sequence:
packed_sequence_size: 4096 # 0 = disabled
step_scheduler:
local_batch_size: 1 # must be 1 for packed sequencesWhen , the dataset collator packs sequences up to
that length. must be 1 because each "sample" is already a
packed batch.
packed_sequence_size > 0local_batch_sizeyaml
packed_sequence:
packed_sequence_size: 4096 # 0 = 禁用
step_scheduler:
local_batch_size: 1 # 打包序列必须设为1当时,数据集整理器会将序列打包至指定长度。必须设为1,因为每个"样本"已经是一个打包后的批次。
packed_sequence_size > 0local_batch_sizeMoE Distributed Training
MoE分布式训练
Expert Parallelism
专家并行
Set to distribute experts across GPUs. This creates a separate
alongside the main :
ep_size > 1moe_meshdevice_meshyaml
distributed:
strategy: fsdp2
ep_size: 8
activation_checkpointing: trueThe shape is with dimension
names .
moe_mesh(pp_size, ep_shard_size, ep_size)("pp", "ep_shard", "ep")Constraint: (= ) must be divisible by
.
dp_cp_sizedp_size * cp_sizeep_size设置将专家分布到多个GPU上。这会在主之外创建独立的:
ep_size > 1device_meshmoe_meshyaml
distributed:
strategy: fsdp2
ep_size: 8
activation_checkpointing: truemoe_mesh(pp_size, ep_shard_size, ep_size)("pp", "ep_shard", "ep")约束:(即)必须能被整除。
dp_cp_sizedp_size * cp_sizeep_sizeMoE sub-config
MoE子配置
yaml
distributed:
strategy: fsdp2
ep_size: 8
activation_checkpointing: true
moe:
reshard_after_forward: false
ignore_router_for_ac: false
wrap_outer_model: trueThe sub-section maps to and is only
instantiated when .
moeMoEParallelizerConfigep_size > 1yaml
distributed:
strategy: fsdp2
ep_size: 8
activation_checkpointing: true
moe:
reshard_after_forward: false
ignore_router_for_ac: false
wrap_outer_model: truemoeMoEParallelizerConfigep_size > 1Full MoE example (Qwen3-30B-A3B on 8 GPUs)
完整MoE示例(Qwen3-30B-A3B在8个GPU上运行)
yaml
distributed:
strategy: fsdp2
tp_size: 1
cp_size: 1
pp_size: 1
ep_size: 8
sequence_parallel: false
activation_checkpointing: trueyaml
distributed:
strategy: fsdp2
tp_size: 1
cp_size: 1
pp_size: 1
ep_size: 8
sequence_parallel: false
activation_checkpointing: trueMegatronFSDP limitations
MegatronFSDP的局限性
Despite its name, does not support expert parallelism
(), pipeline parallelism (), or
. Use for these features.
megatron_fsdpep_size > 1pp_size > 1sequence_parallelfsdp2尽管名称如此,不支持专家并行()、流水线并行()或。如需这些功能,请使用。
megatron_fsdpep_size > 1pp_size > 1sequence_parallelfsdp2Parallelism Sizing Guidelines
并行规模指南
Dense models
稠密模型
| Model size | TP | PP | CP | Strategy |
|---|---|---|---|---|
| < 3B | 1 | 1 | 1 | FSDP2 (DP only) |
| 3-13B | 2-4 | 1 | 1 | FSDP2 + TP |
| 13-70B | 4-8 | 2-4 | 1 | FSDP2 + TP + PP |
| 70B+ | 8 | 4-8 | 1 | FSDP2 + TP + PP |
| Any + long seq (8K+) | as above | as above | 2-8 | add CP |
| 模型规模 | TP | PP | CP | 策略 |
|---|---|---|---|---|
| < 3B | 1 | 1 | 1 | FSDP2(仅DP) |
| 3-13B | 2-4 | 1 | 1 | FSDP2 + TP |
| 13-70B | 4-8 | 2-4 | 1 | FSDP2 + TP + PP |
| 70B+ | 8 | 4-8 | 1 | FSDP2 + TP + PP |
| 任意规模+长序列(8K+) | 同上 | 同上 | 2-8 | 添加CP |
MoE models
MoE模型
MoE models need less TP than dense models of similar total parameter count
because only a fraction of parameters are active per token. EP is the primary
scaling dimension:
| Model | TP | PP | EP | Notes |
|---|---|---|---|---|
| Small MoE (<10B total) | 1 | 1 | 8 | EP only |
| Medium MoE (10-30B total) | 1-2 | 1 | 8 | small TP for shared layers |
| Large MoE (100B+ total) | 1-2 | 4+ | 8-64 | PP for depth, EP for experts |
MoE模型所需的TP规模小于同等总参数量的稠密模型,因为每个token仅激活部分参数。EP是主要的扩展维度:
| 模型 | TP | PP | EP | 说明 |
|---|---|---|---|---|
| 小型MoE(总参数量<10B) | 1 | 1 | 8 | 仅使用EP |
| 中型MoE(总参数量10-30B) | 1-2 | 1 | 8 | 共享层使用小TP |
| 大型MoE(总参数量100B+) | 1-2 | 4+ | 8-64 | 使用PP扩展深度,EP扩展专家数量 |
Hardware topology rules
硬件拓扑规则
- TP must stay within a single NVLink domain (one node, typically 8 GPUs).
- Use PP or DP for cross-node scaling.
- TP across InfiniBand degrades throughput severely.
- TP必须限制在单个NVLink域内(一个节点,通常为8个GPU)。
- 使用PP或DP进行跨节点扩展。
- 跨InfiniBand的TP会严重降低吞吐量。
Code Anchors
代码锚点
- : FSDP2Config, MegatronFSDPConfig, DDPConfig.
components/distributed/config.py - : MeshContext, strategy map, and mesh sizes.
components/distributed/mesh.py - : device mesh and
components/distributed/device_mesh.pycreation.moe_mesh - : PipelineConfig fields.
components/distributed/pipelining/config.py - : MoEParallelizerConfig and MoEConfig.
components/moe/config.py - : YAML parsing and distributed setup.
recipes/_dist_setup.py
- : FSDP2Config、MegatronFSDPConfig、DDPConfig。
components/distributed/config.py - : MeshContext、策略映射及网格规模。
components/distributed/mesh.py - : 设备网格和
components/distributed/device_mesh.py创建。moe_mesh - : PipelineConfig字段。
components/distributed/pipelining/config.py - : MoEParallelizerConfig和MoEConfig。
components/moe/config.py - : YAML解析和分布式设置。
recipes/_dist_setup.py
Pitfalls
注意事项
-
TP across nodes destroys throughput. Always keep TP within a single NVLink domain. Use PP or DP for cross-node scaling.
-
PP requireson the model class. Not all HF models have this. Check
_pp_planbefore enabling PP.validate_hf_model_for_pipeline_support() -
PP bubbles reduce GPU utilization. Use interleaved schedules () and smaller microbatches to reduce bubble time.
interleaved_1f1b -
FSDP2 requires DTensor-aware state dict saving. Usewith
safetensorsfor checkpoint compatibility.save_consolidated: true -
CP requires compatible attention. SDPA (Flash Attention or Efficient Attention) or TE attention only.is not compatible with DTensor.
SDPBackend.MATH -
MoE EP size must evenly divide. The device mesh creation asserts
dp_size * cp_size.dp_cp_size % ep_size == 0 -
MegatronFSDP is more limited than FSDP2. It does not support PP (), EP (
pp_size > 1), orep_size > 1. Thesequence_parallelvalidation raises on these combinations.MeshContext -
DDP supports nothing beyond data parallelism. No TP, PP, CP, EP, or HSDP. Validation raises on any of these.
-
Activation checkpointing increases compute. It saves memory by recomputing activations during backward, but adds ~30% compute overhead.
-
Mixed precision policy must match model expectations. The default bfloat16 policy works for most models. FP16 models may need a custom.
MixedPrecisionPolicy -
must be divisible by
packed_sequence_sizewhen using CP with packed sequences.cp_size -
is FSDP2-only. Passing it with
dp_replicate_sizeormegatron_fsdpraises addp.ValueError
-
跨节点TP会彻底破坏吞吐量。始终将TP限制在单个NVLink域内。使用PP或DP进行跨节点扩展。
-
PP要求模型类具备。并非所有HF模型都有此属性。启用PP前请检查
_pp_plan。validate_hf_model_for_pipeline_support() -
PP气泡会降低GPU利用率。使用交错调度策略()和更小的微批次以减少气泡时间。
interleaved_1f1b -
FSDP2需要支持DTensor的状态字典保存。使用并设置
safetensors以保证checkpoint兼容性。save_consolidated: true -
CP需要兼容的注意力机制。仅支持SDPA(Flash Attention或Efficient Attention)或TE注意力机制。与DTensor不兼容。
SDPBackend.MATH -
MoE的EP规模必须能整除。设备网格创建时会断言
dp_size * cp_size。dp_cp_size % ep_size == 0 -
MegatronFSDP的功能比FSDP2更有限。它不支持PP()、EP(
pp_size > 1)或ep_size > 1。sequence_parallel验证时会对这些组合抛出异常。MeshContext -
DDP仅支持数据并行。不支持TP、PP、CP、EP或HSDP。启用这些功能时验证会抛出异常。
-
激活 checkpoint会增加计算量。它通过反向传播时重新计算激活值节省内存,但会增加约30%的计算开销。
-
混合精度策略必须匹配模型预期。默认的bfloat16策略适用于大多数模型。FP16模型可能需要自定义。
MixedPrecisionPolicy -
使用CP和打包序列时,必须能被
packed_sequence_size整除。cp_size -
仅FSDP2支持。与
dp_replicate_size或megatron_fsdp一起使用时会抛出ddp。ValueError
Verification
验证
Run the smallest recipe that exercises the requested strategy. Success means
exit code 0, finite loss, no NCCL timeout, and log output matching the expected
TP/PP/CP/EP sizes.
运行能测试目标策略的最小训练脚本。成功的标志是退出码为0、损失值有限、无NCCL超时,且日志输出与预期的TP/PP/CP/EP规模一致。",