Loading...
Loading...
Adds PyTorch FSDP2 (fully_shard) to training scripts with correct init, sharding, mixed precision/offload config, and distributed checkpointing. Use when models exceed single-GPU memory or when you need DTensor-based sharding with DeviceMesh.
npx skill4agent add kiterlin/intelligent-detection-system pytorch-fsdp2fully_shardFSDP2 in PyTorch is exposed primarily viaand thetorch.distributed.fsdp.fully_shardmethods it adds in-place to modules. See:FSDPModule,references/pytorch_fully_shard_api.md.references/pytorch_fsdp2_tutorial.md
references/pytorch_ddp_notes.mdreferences/pytorch_fsdp1_api.mdtorchrunLOCAL_RANKfully_shard()model(input)model.forward(input)unshard()fully_shardtorch.save(model.state_dict())torchrun --nproc_per_node <gpus_per_node> ...RANKWORLD_SIZELOCAL_RANKreferences/pytorch_fsdp2_tutorial.mdreferences/pytorch_fully_shard_api.mddist.init_process_group(backend="nccl")torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))DeviceMeshreferences/pytorch_device_mesh_tutorial.mdmetawith torch.device("meta"): model = ...fully_shard(...)fully_shard(model)model.to_empty(device="cuda")model.reset_parameters()references/pytorch_fsdp2_tutorial.mdfully_shard()fully_shardif isinstance(m, TransformerBlock): fully_shard(m, ...)fully_shard(model, ...)fully_shardreferences/pytorch_fully_shard_api.mdreshard_after_forwardNoneTrueFalseTrueFalseintreferences/pytorch_fully_shard_api.mdmp_policy=MixedPrecisionPolicy(param_dtype=..., reduce_dtype=..., output_dtype=..., cast_forward_inputs=...)offload_policy=CPUOffloadPolicy()reduce_dtypereferences/pytorch_fully_shard_api.mdset_requires_gradient_syncno_sync()references/pytorch_fsdp2_tutorial.mdget_model_state_dictset_model_state_dictStateDictOptions(full_state_dict=True, cpu_offload=True, broadcast_from_rank0=True, ...)get_optimizer_state_dictset_optimizer_state_dicttorch.saveDTensor.full_tensor()references/pytorch_dcp_overview.mdreferences/pytorch_dcp_recipe.mdreferences/pytorch_dcp_async_recipe.mdreferences/pytorch_fsdp2_tutorial.mdreferences/pytorch_examples_fsdp2.mdtorchrunLOCAL_RANKDeviceMeshmetafully_shardfully_shard(model)model(inputs)set_requires_gradient_synctorch.distributed.checkpointreferences/pytorch_fsdp2_tutorial.mdreferences/pytorch_fully_shard_api.mdreferences/pytorch_device_mesh_tutorial.mdreferences/pytorch_dcp_recipe.mdStatefulget_state_dictdcp.save(...)dcp.load(...)set_state_dictreferences/pytorch_dcp_recipe.mdtorch.cuda.set_device(LOCAL_RANK)torchrunforward()model(input)unshard()fully_shard()torch.savemodel(inputs)unshard()model.forward(...)fully_shardfully_shardreshard_after_forward=Trueset_requires_gradient_syncno_sync()references/pytorch_fully_shard_api.mdreferences/pytorch_fsdp2_tutorial.mdinit_distributed()build_model_meta()fully_shardbuild_optimizer()train_step()model(inputs)checkpoint_save/load()references/pytorch_examples_fsdp2.mdreferences/pytorch_fsdp2_tutorial.mdreferences/pytorch_fully_shard_api.mdreferences/pytorch_ddp_notes.mdreferences/pytorch_fsdp1_api.mdreferences/pytorch_device_mesh_tutorial.mdreferences/pytorch_tp_tutorial.mdreferences/pytorch_dcp_overview.mdreferences/pytorch_dcp_recipe.mdreferences/pytorch_dcp_async_recipe.mdreferences/pytorch_examples_fsdp2.mdreferences/torchtitan_fsdp_notes.mdreferences/ray_train_fsdp2_example.md