huggingface-accelerate
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseHuggingFace Accelerate - Unified Distributed Training
HuggingFace Accelerate - 统一分布式训练
Quick start
快速开始
Accelerate simplifies distributed training to 4 lines of code.
Installation:
bash
pip install accelerateConvert PyTorch script (4 lines):
python
import torch
+ from accelerate import Accelerator
+ accelerator = Accelerator()
model = torch.nn.Transformer()
optimizer = torch.optim.Adam(model.parameters())
dataloader = torch.utils.data.DataLoader(dataset)
+ model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
for batch in dataloader:
optimizer.zero_grad()
loss = model(batch)
- loss.backward()
+ accelerator.backward(loss)
optimizer.step()Run (single command):
bash
accelerate launch train.pyAccelerate 将分布式训练简化为仅4行代码。
安装:
bash
pip install accelerate转换PyTorch脚本(仅需4行):
python
import torch
+ from accelerate import Accelerator
+ accelerator = Accelerator()
model = torch.nn.Transformer()
optimizer = torch.optim.Adam(model.parameters())
dataloader = torch.utils.data.DataLoader(dataset)
+ model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
for batch in dataloader:
optimizer.zero_grad()
loss = model(batch)
- loss.backward()
+ accelerator.backward(loss)
optimizer.step()运行(单条命令):
bash
accelerate launch train.pyCommon workflows
常见工作流
Workflow 1: From single GPU to multi-GPU
工作流1:从单GPU到多GPU
Original script:
python
undefined原始脚本:
python
undefinedtrain.py
train.py
import torch
model = torch.nn.Linear(10, 2).to('cuda')
optimizer = torch.optim.Adam(model.parameters())
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
for epoch in range(10):
for batch in dataloader:
batch = batch.to('cuda')
optimizer.zero_grad()
loss = model(batch).mean()
loss.backward()
optimizer.step()
**With Accelerate** (4 lines added):
```pythonimport torch
model = torch.nn.Linear(10, 2).to('cuda')
optimizer = torch.optim.Adam(model.parameters())
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
for epoch in range(10):
for batch in dataloader:
batch = batch.to('cuda')
optimizer.zero_grad()
loss = model(batch).mean()
loss.backward()
optimizer.step()
**使用Accelerate**(新增4行代码):
```pythontrain.py
train.py
import torch
from accelerate import Accelerator # +1
accelerator = Accelerator() # +2
model = torch.nn.Linear(10, 2)
optimizer = torch.optim.Adam(model.parameters())
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader) # +3
for epoch in range(10):
for batch in dataloader:
# No .to('cuda') needed - automatic!
optimizer.zero_grad()
loss = model(batch).mean()
accelerator.backward(loss) # +4
optimizer.step()
**Configure** (interactive):
```bash
accelerate configQuestions:
- Which machine? (single/multi GPU/TPU/CPU)
- How many machines? (1)
- Mixed precision? (no/fp16/bf16/fp8)
- DeepSpeed? (no/yes)
Launch (works on any setup):
bash
undefinedimport torch
from accelerate import Accelerator # +1
accelerator = Accelerator() # +2
model = torch.nn.Linear(10, 2)
optimizer = torch.optim.Adam(model.parameters())
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader) # +3
for epoch in range(10):
for batch in dataloader:
# 无需手动.to('cuda') - 自动处理!
optimizer.zero_grad()
loss = model(batch).mean()
accelerator.backward(loss) # +4
optimizer.step()
**配置**(交互式):
```bash
accelerate config问题:
- 使用哪种机器?(单/多GPU/TPU/CPU)
- 机器数量?(1台)
- 是否使用混合精度?(否/fp16/bf16/fp8)
- 是否使用DeepSpeed?(否/是)
启动(适用于任何环境):
bash
undefinedSingle GPU
单GPU
accelerate launch train.py
accelerate launch train.py
Multi-GPU (8 GPUs)
多GPU(8块GPU)
accelerate launch --multi_gpu --num_processes 8 train.py
accelerate launch --multi_gpu --num_processes 8 train.py
Multi-node
多节点
accelerate launch --multi_gpu --num_processes 16
--num_machines 2 --machine_rank 0
--main_process_ip $MASTER_ADDR
train.py
--num_machines 2 --machine_rank 0
--main_process_ip $MASTER_ADDR
train.py
undefinedaccelerate launch --multi_gpu --num_processes 16
--num_machines 2 --machine_rank 0
--main_process_ip $MASTER_ADDR
train.py
--num_machines 2 --machine_rank 0
--main_process_ip $MASTER_ADDR
train.py
undefinedWorkflow 2: Mixed precision training
工作流2:混合精度训练
Enable FP16/BF16:
python
from accelerate import Accelerator启用FP16/BF16:
python
from accelerate import AcceleratorFP16 (with gradient scaling)
FP16(带梯度缩放)
accelerator = Accelerator(mixed_precision='fp16')
accelerator = Accelerator(mixed_precision='fp16')
BF16 (no scaling, more stable)
BF16(无缩放,更稳定)
accelerator = Accelerator(mixed_precision='bf16')
accelerator = Accelerator(mixed_precision='bf16')
FP8 (H100+)
FP8(仅支持H100+)
accelerator = Accelerator(mixed_precision='fp8')
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
accelerator = Accelerator(mixed_precision='fp8')
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
Everything else is automatic!
其余操作全部自动完成!
for batch in dataloader:
with accelerator.autocast(): # Optional, done automatically
loss = model(batch)
accelerator.backward(loss)
undefinedfor batch in dataloader:
with accelerator.autocast(): # 可选,已自动处理
loss = model(batch)
accelerator.backward(loss)
undefinedWorkflow 3: DeepSpeed ZeRO integration
工作流3:DeepSpeed ZeRO集成
Enable DeepSpeed ZeRO-2:
python
from accelerate import Accelerator
accelerator = Accelerator(
mixed_precision='bf16',
deepspeed_plugin={
"zero_stage": 2, # ZeRO-2
"offload_optimizer": False,
"gradient_accumulation_steps": 4
}
)启用DeepSpeed ZeRO-2:
python
from accelerate import Accelerator
accelerator = Accelerator(
mixed_precision='bf16',
deepspeed_plugin={
"zero_stage": 2, # ZeRO-2
"offload_optimizer": False,
"gradient_accumulation_steps": 4
}
)Same code as before!
其余代码与之前一致!
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
**Or via config**:
```bash
accelerate configmodel, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
**或通过配置文件**:
```bash
accelerate configSelect: DeepSpeed → ZeRO-2
选择:DeepSpeed → ZeRO-2
**deepspeed_config.json**:
```json
{
"fp16": {"enabled": false},
"bf16": {"enabled": true},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {"device": "cpu"},
"allgather_bucket_size": 5e8,
"reduce_bucket_size": 5e8
}
}Launch:
bash
accelerate launch --config_file deepspeed_config.json train.py
**deepspeed_config.json**:
```json
{
"fp16": {"enabled": false},
"bf16": {"enabled": true},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {"device": "cpu"},
"allgather_bucket_size": 5e8,
"reduce_bucket_size": 5e8
}
}启动:
bash
accelerate launch --config_file deepspeed_config.json train.pyWorkflow 4: FSDP (Fully Sharded Data Parallel)
工作流4:FSDP(完全分片数据并行)
Enable FSDP:
python
from accelerate import Accelerator, FullyShardedDataParallelPlugin
fsdp_plugin = FullyShardedDataParallelPlugin(
sharding_strategy="FULL_SHARD", # ZeRO-3 equivalent
auto_wrap_policy="TRANSFORMER_AUTO_WRAP",
cpu_offload=False
)
accelerator = Accelerator(
mixed_precision='bf16',
fsdp_plugin=fsdp_plugin
)
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)Or via config:
bash
accelerate config启用FSDP:
python
from accelerate import Accelerator, FullyShardedDataParallelPlugin
fsdp_plugin = FullyShardedDataParallelPlugin(
sharding_strategy="FULL_SHARD", # 等效于ZeRO-3
auto_wrap_policy="TRANSFORMER_AUTO_WRAP",
cpu_offload=False
)
accelerator = Accelerator(
mixed_precision='bf16',
fsdp_plugin=fsdp_plugin
)
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)或通过配置:
bash
accelerate configSelect: FSDP → Full Shard → No CPU Offload
选择:FSDP → 完全分片 → 不使用CPU卸载
undefinedundefinedWorkflow 5: Gradient accumulation
工作流5:梯度累积
Accumulate gradients:
python
from accelerate import Accelerator
accelerator = Accelerator(gradient_accumulation_steps=4)
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
for batch in dataloader:
with accelerator.accumulate(model): # Handles accumulation
optimizer.zero_grad()
loss = model(batch)
accelerator.backward(loss)
optimizer.step()Effective batch size:
batch_size * num_gpus * gradient_accumulation_steps累积梯度:
python
from accelerate import Accelerator
accelerator = Accelerator(gradient_accumulation_steps=4)
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
for batch in dataloader:
with accelerator.accumulate(model): # 处理累积逻辑
optimizer.zero_grad()
loss = model(batch)
accelerator.backward(loss)
optimizer.step()有效批量大小:
batch_size * num_gpus * gradient_accumulation_stepsWhen to use vs alternatives
适用场景与替代方案对比
Use Accelerate when:
- Want simplest distributed training
- Need single script for any hardware
- Use HuggingFace ecosystem
- Want flexibility (DDP/DeepSpeed/FSDP/Megatron)
- Need quick prototyping
Key advantages:
- 4 lines: Minimal code changes
- Unified API: Same code for DDP, DeepSpeed, FSDP, Megatron
- Automatic: Device placement, mixed precision, sharding
- Interactive config: No manual launcher setup
- Single launch: Works everywhere
Use alternatives instead:
- PyTorch Lightning: Need callbacks, high-level abstractions
- Ray Train: Multi-node orchestration, hyperparameter tuning
- DeepSpeed: Direct API control, advanced features
- Raw DDP: Maximum control, minimal abstraction
推荐使用Accelerate的场景:
- 追求最简单的分布式训练实现
- 需要单脚本适配任意硬件
- 使用HuggingFace生态系统
- 需要灵活性(支持DDP/DeepSpeed/FSDP/Megatron)
- 需要快速原型开发
核心优势:
- 仅4行代码: 最小化代码改动
- 统一API: 同一代码支持DDP、DeepSpeed、FSDP、Megatron
- 自动化: 设备分配、混合精度、分片全部自动处理
- 交互式配置: 无需手动设置启动器
- 单条启动命令: 适用于所有环境
推荐使用替代方案的场景:
- PyTorch Lightning: 需要回调函数、高层抽象
- Ray Train: 多节点编排、超参数调优
- DeepSpeed: 需要直接API控制、高级功能
- 原生DDP: 需要最大控制权、最小抽象
Common issues
常见问题
Issue: Wrong device placement
Don't manually move to device:
python
undefined问题:设备分配错误
请勿手动将数据移动到设备:
python
undefinedWRONG
错误写法
batch = batch.to('cuda')
batch = batch.to('cuda')
CORRECT
正确写法
Accelerate handles it automatically after prepare()
调用prepare()后,Accelerate会自动处理设备分配
**Issue: Gradient accumulation not working**
Use context manager:
```python
**问题:梯度累积不生效**
请使用上下文管理器:
```pythonCORRECT
正确写法
with accelerator.accumulate(model):
optimizer.zero_grad()
accelerator.backward(loss)
optimizer.step()
**Issue: Checkpointing in distributed**
Use accelerator methods:
```pythonwith accelerator.accumulate(model):
optimizer.zero_grad()
accelerator.backward(loss)
optimizer.step()
**问题:分布式环境下的 checkpoint 保存与加载**
请使用Accelerate提供的方法:
```pythonSave only on main process
仅在主进程保存
if accelerator.is_main_process:
accelerator.save_state('checkpoint/')
if accelerator.is_main_process:
accelerator.save_state('checkpoint/')
Load on all processes
在所有进程加载
accelerator.load_state('checkpoint/')
**Issue: Different results with FSDP**
Ensure same random seed:
```python
from accelerate.utils import set_seed
set_seed(42)accelerator.load_state('checkpoint/')
**问题:使用FSDP时结果不一致**
确保设置相同的随机种子:
```python
from accelerate.utils import set_seed
set_seed(42)Advanced topics
高级主题
Megatron integration: See references/megatron-integration.md for tensor parallelism, pipeline parallelism, and sequence parallelism setup.
Custom plugins: See references/custom-plugins.md for creating custom distributed plugins and advanced configuration.
Performance tuning: See references/performance.md for profiling, memory optimization, and best practices.
Megatron集成: 请参考 references/megatron-integration.md 了解张量并行、流水线并行和序列并行的设置方法。
自定义插件: 请参考 references/custom-plugins.md 了解如何创建自定义分布式插件和高级配置。
性能调优: 请参考 references/performance.md 了解性能分析、内存优化和最佳实践。
Hardware requirements
硬件要求
- CPU: Works (slow)
- Single GPU: Works
- Multi-GPU: DDP (default), DeepSpeed, or FSDP
- Multi-node: DDP, DeepSpeed, FSDP, Megatron
- TPU: Supported
- Apple MPS: Supported
Launcher requirements:
- DDP: (built-in)
torch.distributed.run - DeepSpeed: (pip install deepspeed)
deepspeed - FSDP: PyTorch 1.12+ (built-in)
- Megatron: Custom setup
- CPU: 支持(速度较慢)
- 单GPU: 支持
- 多GPU: 支持DDP(默认)、DeepSpeed或FSDP
- 多节点: 支持DDP、DeepSpeed、FSDP、Megatron
- TPU: 支持
- Apple MPS: 支持
启动器要求:
- DDP: (内置)
torch.distributed.run - DeepSpeed: (需通过pip安装)
deepspeed - FSDP: PyTorch 1.12+(内置)
- Megatron: 需自定义设置
Resources
资源
- Docs: https://huggingface.co/docs/accelerate
- GitHub: https://github.com/huggingface/accelerate
- Version: 1.11.0+
- Tutorial: "Accelerate your scripts"
- Examples: https://github.com/huggingface/accelerate/tree/main/examples
- Used by: HuggingFace Transformers, TRL, PEFT, all HF libraries
- 文档: https://huggingface.co/docs/accelerate
- GitHub: https://github.com/huggingface/accelerate
- 版本: 1.11.0+
- 教程: "Accelerate your scripts"
- 示例: https://github.com/huggingface/accelerate/tree/main/examples
- 使用者: HuggingFace Transformers、TRL、PEFT及所有HF库