perf-megatron-fsdp
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMegatron FSDP Skill
Megatron FSDP Skill
For stable background and recommendation level, see:
- @docs/training/megatron-fsdp.md
- @skills/perf-megatron-fsdp/card.yaml
如需了解稳定背景和推荐等级,请查看:
- @docs/training/megatron-fsdp.md
- @skills/perf-megatron-fsdp/card.yaml
Enablement
启用步骤
Minimal Megatron FSDP override in Bridge:
python
cfg.dist.use_megatron_fsdp = True
cfg.ddp.use_megatron_fsdp = True
cfg.ddp.data_parallel_sharding_strategy = "optim_grads_params"
cfg.ddp.average_in_collective = False
cfg.checkpoint.ckpt_format = "fsdp_dtensor"Example recipe fixup:
python
cfg = llama3_8b_pretrain_config()
cfg.dist.use_megatron_fsdp = True
cfg.ddp.use_megatron_fsdp = True
cfg.ddp.data_parallel_sharding_strategy = "optim_grads_params"
cfg.ddp.average_in_collective = False
cfg.checkpoint.ckpt_format = "fsdp_dtensor"
cfg.checkpoint.save = "/tmp/fsdp_ckpts"
cfg.checkpoint.load = NonePerformance harness note:
bash
python scripts/performance/launch.py --use_megatron_fsdp true在Bridge中启用Megatron FSDP的最小配置覆盖:
python
cfg.dist.use_megatron_fsdp = True
cfg.ddp.use_megatron_fsdp = True
cfg.ddp.data_parallel_sharding_strategy = "optim_grads_params"
cfg.ddp.average_in_collective = False
cfg.checkpoint.ckpt_format = "fsdp_dtensor"示例配置调整:
python
cfg = llama3_8b_pretrain_config()
cfg.dist.use_megatron_fsdp = True
cfg.ddp.use_megatron_fsdp = True
cfg.ddp.data_parallel_sharding_strategy = "optim_grads_params"
cfg.ddp.average_in_collective = False
cfg.checkpoint.ckpt_format = "fsdp_dtensor"
cfg.checkpoint.save = "/tmp/fsdp_ckpts"
cfg.checkpoint.load = None性能测试工具说明:
bash
python scripts/performance/launch.py --use_megatron_fsdp trueCode Anchors
代码锚点
Bridge config definition:
148
use_megatron_fsdp: bool = False
"""Use Megatron's Fully Sharded Data Parallel. Cannot be used together with use_torch_fsdp2."""
use_torch_fsdp2: bool = False
"""Use the torch FSDP2 implementation. FSDP2 is not currently working with Pipeline Parallel.
It is still not in a stable release stage, and may therefore contain bugs or other
potential issues."""Bridge validation:
1533
if self.dist.use_megatron_fsdp and self.dist.use_torch_fsdp2:
raise ValueError(...)
...
assert not self.dist.use_tp_pp_dp_mapping, "use_tp_pp_dp_mapping is not supported with Megatron FSDP"
...
assert self.checkpoint.ckpt_format == "fsdp_dtensor", (
"Megatron FSDP only supports fsdp_dtensor checkpoint format"
)Runtime wrapper selection:
217
if use_megatron_fsdp:
DP = FullyShardedDataParallel
elif use_torch_fsdp2:
DP = TorchFullyShardedDataParallel
else:
DP = DistributedDataParallel
...
DP(
config=get_model_config(model_chunk),
ddp_config=ddp_config,
module=model_chunk,
...
pg_collection=pg_collection,
)Perf harness overrides:
74
recipe.ddp.use_megatron_fsdp = True
recipe.ddp.data_parallel_sharding_strategy = "optim_grads_params"
recipe.ddp.keep_fp8_transpose_cache = False
recipe.ddp.average_in_collective = False
...
recipe.checkpoint.load = NoneBridge配置定义:
148
use_megatron_fsdp: bool = False
"""Use Megatron's Fully Sharded Data Parallel. Cannot be used together with use_torch_fsdp2."""
use_torch_fsdp2: bool = False
"""Use the torch FSDP2 implementation. FSDP2 is not currently working with Pipeline Parallel.
It is still not in a stable release stage, and may therefore contain bugs or other
potential issues."""Bridge验证逻辑:
1533
if self.dist.use_megatron_fsdp and self.dist.use_torch_fsdp2:
raise ValueError(...)
...
assert not self.dist.use_tp_pp_dp_mapping, "use_tp_pp_dp_mapping is not supported with Megatron FSDP"
...
assert self.checkpoint.ckpt_format == "fsdp_dtensor", (
"Megatron FSDP only supports fsdp_dtensor checkpoint format"
)运行时包装器选择:
217
if use_megatron_fsdp:
DP = FullyShardedDataParallel
elif use_torch_fsdp2:
DP = TorchFullyShardedDataParallel
else:
DP = DistributedDataParallel
...
DP(
config=get_model_config(model_chunk),
ddp_config=ddp_config,
module=model_chunk,
...
pg_collection=pg_collection,
)性能测试工具配置覆盖:
74
recipe.ddp.use_megatron_fsdp = True
recipe.ddp.data_parallel_sharding_strategy = "optim_grads_params"
recipe.ddp.keep_fp8_transpose_cache = False
recipe.ddp.average_in_collective = False
...
recipe.checkpoint.load = NonePitfalls
注意事项
- Public recipes often expose but still default to
use_megatron_fsdp. If save/load is enabled, switch tockpt_format="torch_dist".fsdp_dtensor - exists, but on the validated branch Bridge still fails before training because
use_torch_fsdp2passes_ddp_wrap.pg_collection - CPU offloading is only valid when and activation recomputation is disabled.
pipeline_model_parallel_size == 1 - Upstream warns that FSDP and TP/CP can want different settings on Hopper and earlier.
CUDA_DEVICE_MAX_CONNECTIONS - Megatron FSDP and FSDP2 are mutually exclusive.
- 公开配置通常会暴露,但默认仍为
use_megatron_fsdp。若启用保存/加载功能,请切换为ckpt_format="torch_dist"。fsdp_dtensor - 已存在,但在已验证分支中,Bridge会在训练前失败,因为
use_torch_fsdp2传递了_ddp_wrap。pg_collection - CPU卸载仅在且禁用激活重计算时有效。
pipeline_model_parallel_size == 1 - 上游提示,在Hopper及更早版本上,FSDP与TP/CP可能需要不同的设置。
CUDA_DEVICE_MAX_CONNECTIONS - Megatron FSDP与FSDP2互斥。
Verification
验证方法
Use the existing 2-GPU functional smoke test:
bash
CUDA_VISIBLE_DEVICES=0,1 uv run python -m torch.distributed.run --nproc_per_node=2 \
-m pytest tests/functional_tests/training/test_megatron_fsdp.py::TestMegatronFSDP::test_fsdp_pretrain_basic -v -sSuccess criteria:
- Pytest reports
1 passed - The log shows finite loss at the last iteration
- The run finishes without a checkpoint format assertion
使用现有的2-GPU功能冒烟测试:
bash
CUDA_VISIBLE_DEVICES=0,1 uv run python -m torch.distributed.run --nproc_per_node=2 \
-m pytest tests/functional_tests/training/test_megatron_fsdp.py::TestMegatronFSDP::test_fsdp_pretrain_basic -v -s成功标准:
- Pytest报告显示
1 passed - 日志显示最后一轮迭代的损失值为有限值
- 运行完成时未出现检查点格式断言错误