perf-megatron-fsdp

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Megatron FSDP Skill

For stable background and recommendation level, see:

@docs/training/megatron-fsdp.md
@skills/perf-megatron-fsdp/card.yaml

如需了解稳定背景和推荐等级，请查看：

@docs/training/megatron-fsdp.md
@skills/perf-megatron-fsdp/card.yaml

Enablement

启用步骤

Minimal Megatron FSDP override in Bridge:

python

cfg.dist.use_megatron_fsdp = True
cfg.ddp.use_megatron_fsdp = True
cfg.ddp.data_parallel_sharding_strategy = "optim_grads_params"
cfg.ddp.average_in_collective = False
cfg.checkpoint.ckpt_format = "fsdp_dtensor"

Example recipe fixup:

python

cfg = llama3_8b_pretrain_config()
cfg.dist.use_megatron_fsdp = True
cfg.ddp.use_megatron_fsdp = True
cfg.ddp.data_parallel_sharding_strategy = "optim_grads_params"
cfg.ddp.average_in_collective = False
cfg.checkpoint.ckpt_format = "fsdp_dtensor"
cfg.checkpoint.save = "/tmp/fsdp_ckpts"
cfg.checkpoint.load = None

Performance harness note:

bash

python scripts/performance/launch.py --use_megatron_fsdp true

在Bridge中启用Megatron FSDP的最小配置覆盖：

python

cfg.dist.use_megatron_fsdp = True
cfg.ddp.use_megatron_fsdp = True
cfg.ddp.data_parallel_sharding_strategy = "optim_grads_params"
cfg.ddp.average_in_collective = False
cfg.checkpoint.ckpt_format = "fsdp_dtensor"

示例配置调整：

python

cfg = llama3_8b_pretrain_config()
cfg.dist.use_megatron_fsdp = True
cfg.ddp.use_megatron_fsdp = True
cfg.ddp.data_parallel_sharding_strategy = "optim_grads_params"
cfg.ddp.average_in_collective = False
cfg.checkpoint.ckpt_format = "fsdp_dtensor"
cfg.checkpoint.save = "/tmp/fsdp_ckpts"
cfg.checkpoint.load = None

性能测试工具说明：

bash

python scripts/performance/launch.py --use_megatron_fsdp true

Code Anchors

代码锚点

Bridge config definition:

148

use_megatron_fsdp: bool = False
"""Use Megatron's Fully Sharded Data Parallel. Cannot be used together with use_torch_fsdp2."""

use_torch_fsdp2: bool = False
"""Use the torch FSDP2 implementation. FSDP2 is not currently working with Pipeline Parallel.
It is still not in a stable release stage, and may therefore contain bugs or other
potential issues."""

Bridge validation:

1533

if self.dist.use_megatron_fsdp and self.dist.use_torch_fsdp2:
    raise ValueError(...)
...
assert not self.dist.use_tp_pp_dp_mapping, "use_tp_pp_dp_mapping is not supported with Megatron FSDP"
...
assert self.checkpoint.ckpt_format == "fsdp_dtensor", (
    "Megatron FSDP only supports fsdp_dtensor checkpoint format"
)

Runtime wrapper selection:

217

if use_megatron_fsdp:
    DP = FullyShardedDataParallel
elif use_torch_fsdp2:
    DP = TorchFullyShardedDataParallel
else:
    DP = DistributedDataParallel
...
DP(
    config=get_model_config(model_chunk),
    ddp_config=ddp_config,
    module=model_chunk,
    ...
    pg_collection=pg_collection,
)

Perf harness overrides:

recipe.ddp.use_megatron_fsdp = True
recipe.ddp.data_parallel_sharding_strategy = "optim_grads_params"
recipe.ddp.keep_fp8_transpose_cache = False
recipe.ddp.average_in_collective = False
...
recipe.checkpoint.load = None

Bridge配置定义：

148

use_megatron_fsdp: bool = False
"""Use Megatron's Fully Sharded Data Parallel. Cannot be used together with use_torch_fsdp2."""

use_torch_fsdp2: bool = False
"""Use the torch FSDP2 implementation. FSDP2 is not currently working with Pipeline Parallel.
It is still not in a stable release stage, and may therefore contain bugs or other
potential issues."""

Bridge验证逻辑：

1533

if self.dist.use_megatron_fsdp and self.dist.use_torch_fsdp2:
    raise ValueError(...)
...
assert not self.dist.use_tp_pp_dp_mapping, "use_tp_pp_dp_mapping is not supported with Megatron FSDP"
...
assert self.checkpoint.ckpt_format == "fsdp_dtensor", (
    "Megatron FSDP only supports fsdp_dtensor checkpoint format"
)

运行时包装器选择：

217

if use_megatron_fsdp:
    DP = FullyShardedDataParallel
elif use_torch_fsdp2:
    DP = TorchFullyShardedDataParallel
else:
    DP = DistributedDataParallel
...
DP(
    config=get_model_config(model_chunk),
    ddp_config=ddp_config,
    module=model_chunk,
    ...
    pg_collection=pg_collection,
)

性能测试工具配置覆盖：

recipe.ddp.use_megatron_fsdp = True
recipe.ddp.data_parallel_sharding_strategy = "optim_grads_params"
recipe.ddp.keep_fp8_transpose_cache = False
recipe.ddp.average_in_collective = False
...
recipe.checkpoint.load = None

Pitfalls

注意事项

Public recipes often expose
```
use_megatron_fsdp
```
but still default to
```
ckpt_format="torch_dist"
```
. If save/load is enabled, switch to
```
fsdp_dtensor
```
.
```
use_torch_fsdp2
```
exists, but on the validated branch Bridge still fails before training because
```
_ddp_wrap
```
passes
```
pg_collection
```
.
CPU offloading is only valid when
```
pipeline_model_parallel_size == 1
```
and activation recomputation is disabled.
Upstream warns that FSDP and TP/CP can want different
```
CUDA_DEVICE_MAX_CONNECTIONS
```
settings on Hopper and earlier.
Megatron FSDP and FSDP2 are mutually exclusive.

公开配置通常会暴露
```
use_megatron_fsdp
```
，但默认仍为
```
ckpt_format="torch_dist"
```
。若启用保存/加载功能，请切换为
```
fsdp_dtensor
```
。
```
use_torch_fsdp2
```
已存在，但在已验证分支中，Bridge会在训练前失败，因为
```
_ddp_wrap
```
传递了
```
pg_collection
```
。
CPU卸载仅在
```
pipeline_model_parallel_size == 1
```
且禁用激活重计算时有效。
上游提示，在Hopper及更早版本上，FSDP与TP/CP可能需要不同的
```
CUDA_DEVICE_MAX_CONNECTIONS
```
设置。
Megatron FSDP与FSDP2互斥。

Verification

验证方法

Use the existing 2-GPU functional smoke test:

bash

CUDA_VISIBLE_DEVICES=0,1 uv run python -m torch.distributed.run --nproc_per_node=2 \
  -m pytest tests/functional_tests/training/test_megatron_fsdp.py::TestMegatronFSDP::test_fsdp_pretrain_basic -v -s

Success criteria:

Pytest reports
```
1 passed
```
The log shows finite loss at the last iteration
The run finishes without a checkpoint format assertion

使用现有的2-GPU功能冒烟测试：

bash

CUDA_VISIBLE_DEVICES=0,1 uv run python -m torch.distributed.run --nproc_per_node=2 \
  -m pytest tests/functional_tests/training/test_megatron_fsdp.py::TestMegatronFSDP::test_fsdp_pretrain_basic -v -s

成功标准：

Pytest报告显示
```
1 passed
```
日志显示最后一轮迭代的损失值为有限值
运行完成时未出现检查点格式断言错误