torch-pipeline-parallelism
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTorch Pipeline Parallelism
PyTorch流水线并行
Overview
概述
This skill provides guidance for implementing pipeline parallelism in PyTorch for distributed model training. Pipeline parallelism partitions a model across multiple devices/ranks, where each rank processes a subset of layers and communicates activations/gradients with neighboring ranks.
本技能为在PyTorch中实现分布式模型训练的流水线并行提供指导。流水线并行将模型划分到多个设备/节点上,每个节点处理一部分层,并与相邻节点通信激活值/梯度。
Key Concepts
核心概念
Pipeline Parallelism Patterns
流水线并行模式
- AFAB (All-Forward-All-Backward): Process all microbatch forwards first, cache activations, then process all backwards. This is the most common pattern for pipeline parallelism.
- 1F1B (One-Forward-One-Backward): Interleave forward and backward passes for better memory efficiency but more complex scheduling.
- AFAB(All-Forward-All-Backward):先处理所有微批次的前向传播,缓存激活值,再处理所有反向传播。这是流水线并行最常用的模式。
- 1F1B(One-Forward-One-Backward):交替进行前向和反向传播,内存效率更高但调度更复杂。
Critical Components
关键组件
- Model Partitioning: Divide model layers across ranks
- Activation Communication: Send/receive hidden states between ranks
- Gradient Communication: Send/receive gradients during backward pass
- Activation Caching: Store activations for backward pass computation
- 模型分区:将模型层分配到不同节点
- 激活值通信:在节点间发送/接收隐藏状态
- 梯度通信:反向传播期间发送/接收梯度
- 激活值缓存:存储激活值用于反向传播计算
Implementation Approach
实现步骤
Step 1: Understand Model Architecture First
步骤1:先理解模型架构
Before implementing, thoroughly understand the model being parallelized:
- Identify the layer structure (e.g., for LLaMA)
model.model.layers - Understand embedding layers (input embeddings, position embeddings)
- Identify the output head (e.g., for language models)
lm_head - Note any shared parameters or tied weights
在实现之前,需全面理解要并行化的模型:
- 识别层结构(例如LLaMA模型的)
model.model.layers - 理解嵌入层(输入嵌入、位置嵌入)
- 识别输出头(例如语言模型的)
lm_head - 注意任何共享参数或绑定权重
Step 2: Plan Tensor Shape Handling
步骤2:规划张量形状处理
Critical distinction between ranks:
- Rank 0 (first stage): Receives integer token IDs with shape
[batch, seq_len] - Intermediate ranks: Receive hidden states with shape
[batch, seq_len, hidden_size] - Final rank: Must apply output head and compute loss
Create explicit shape handling logic for each case rather than assuming uniform input types.
不同节点之间的关键区别:
- Rank 0(第一阶段):接收形状为的整数token ID
[batch, seq_len] - 中间节点:接收形状为的隐藏状态
[batch, seq_len, hidden_size] - 最终节点:必须应用输出头并计算损失
为每种情况创建明确的形状处理逻辑,而非假设输入类型统一。
Step 3: Design Communication Strategy
步骤3:设计通信策略
Use for batched send/receive operations:
torch.distributed.P2POppython
undefined使用进行批量发送/接收操作:
torch.distributed.P2POppython
undefinedPreferred: Batched P2P operations
Preferred: Batched P2P operations
ops = []
if rank > 0:
ops.append(dist.P2POp(dist.irecv, recv_tensor, rank - 1))
if rank < world_size - 1:
ops.append(dist.P2POp(dist.isend, send_tensor, rank + 1))
reqs = dist.batch_isend_irecv(ops)
for req in reqs:
req.wait()
Avoid using bare `dist.send/dist.recv` as they are blocking and less efficient.ops = []
if rank > 0:
ops.append(dist.P2POp(dist.irecv, recv_tensor, rank - 1))
if rank < world_size - 1:
ops.append(dist.P2POp(dist.isend, send_tensor, rank + 1))
reqs = dist.batch_isend_irecv(ops)
for req in reqs:
req.wait()
避免使用裸`dist.send/dist.recv`,因为它们是阻塞式的且效率较低。Step 4: Handle Gradient Flow Correctly
步骤4:正确处理梯度流
Critical pitfall: Using severs the computational graph.
tensor.detach().requires_grad_(True)Correct approach for maintaining gradient flow:
python
undefined关键陷阱:使用会切断计算图。
tensor.detach().requires_grad_(True)保持梯度流的正确方法:
python
undefinedFor caching inputs that need gradients
For caching inputs that need gradients
input_cache = []
input_cache = []
During forward: cache the input tensor directly (not detached)
During forward: cache the input tensor directly (not detached)
stage_input = received_tensor.requires_grad_(True)
input_cache.append(stage_input)
stage_input = received_tensor.requires_grad_(True)
input_cache.append(stage_input)
During backward: use the cached tensor to compute gradients
During backward: use the cached tensor to compute gradients
The gradient flows through the original tensor
The gradient flows through the original tensor
Verify gradient connectivity with small test cases before full implementation.
在完整实现之前,先用小型测试用例验证梯度连通性。Step 5: Implement Shape Communication
步骤5:实现形状通信
Communicate tensor shapes before data when shapes vary:
python
undefined当形状变化时,先通信张量形状再传输数据:
python
undefinedEfficient: Single tensor for shape
Efficient: Single tensor for shape
shape_tensor = torch.tensor(list(tensor.shape), dtype=torch.long, device=device)
dist.send(shape_tensor, dst_rank)
shape_tensor = torch.tensor(list(tensor.shape), dtype=torch.long, device=device)
dist.send(shape_tensor, dst_rank)
Then send the actual data
Then send the actual data
dist.send(tensor.contiguous(), dst_rank)
Avoid sending each dimension as a separate tensor.dist.send(tensor.contiguous(), dst_rank)
避免将每个维度作为单独的张量发送。Verification Strategies
验证策略
1. Gradient Flow Verification
1. 梯度流验证
Create a minimal test to verify gradients flow correctly:
python
def test_gradient_flow():
# Create simple model partition
# Run forward/backward
# Check that model.parameters() have non-None gradients
for name, param in model.named_parameters():
assert param.grad is not None, f"No gradient for {name}"
assert not torch.all(param.grad == 0), f"Zero gradient for {name}"创建小型测试以验证梯度流是否正确:
python
def test_gradient_flow():
# Create simple model partition
# Run forward/backward
# Check that model.parameters() have non-None gradients
for name, param in model.named_parameters():
assert param.grad is not None, f"No gradient for {name}"
assert not torch.all(param.grad == 0), f"Zero gradient for {name}"2. Activation Shape Verification
2. 激活值形状验证
Log shapes at each stage boundary:
python
undefined在每个阶段边界记录形状:
python
undefinedBefore send
Before send
print(f"Rank {rank} sending shape: {tensor.shape}")
print(f"Rank {rank} sending shape: {tensor.shape}")
After receive
After receive
print(f"Rank {rank} received shape: {tensor.shape}")
undefinedprint(f"Rank {rank} received shape: {tensor.shape}")
undefined3. Single-Rank Testing
3. 单节点测试
Test with to ensure the implementation handles the degenerate case:
world_size=1- No communication should occur
- Model should function as standard single-device training
- All gradients should flow correctly
使用测试,确保实现能处理退化情况:
world_size=1- 不应发生任何通信
- 模型应像标准单设备训练一样运行
- 所有梯度应正确流动
4. End-to-End Loss Comparison
4. 端到端损失对比
Compare loss values between:
- Pipeline parallel implementation
- Standard single-device training (ground truth)
Values should match within numerical precision.
比较以下两种情况的损失值:
- 流水线并行实现
- 标准单设备训练(基准真值)
数值应在精度范围内匹配。
Common Pitfalls
常见陷阱
1. Truncated Code Edits
1. 代码编辑截断
When making large code changes:
- Prefer smaller, targeted edits over large rewrites
- Verify edit completeness by reading the file after each edit
- Use full file writes for major structural changes
进行大规模代码修改时:
- 优先进行小范围、针对性的修改,而非大规模重写
- 每次修改后读取文件以验证修改的完整性
- 对于主要结构变化,使用完整文件写入
2. Detach Breaking Gradient Flow
2. Detach切断梯度流
python
undefinedpython
undefinedWRONG: Severs computational graph
WRONG: Severs computational graph
cached = tensor.detach().requires_grad_(True)
cached = tensor.detach().requires_grad_(True)
RIGHT: Maintains graph connection for backward
RIGHT: Maintains graph connection for backward
cached = tensor.clone().requires_grad_(True)
cached = tensor.clone().requires_grad_(True)
Or: simply keep reference to original tensor
Or: simply keep reference to original tensor
undefinedundefined3. Missing lm_head in Partitions
3. 分区中缺少lm_head
The output head () is often separate from the layer list. Ensure:
lm_head- It's included in the final rank's computation
- Its parameters receive gradients
- It's not duplicated across ranks
输出头()通常与层列表分离。需确保:
lm_head- 它被包含在最终节点的计算中
- 其参数能接收梯度
- 不会在多个节点间重复
4. Position Embeddings Handling
4. 位置嵌入处理
Position embeddings (especially rotary embeddings) require care:
- They may need explicit computation before the first layer
- The API varies between model implementations
- Test with the specific model architecture being used
位置嵌入(尤其是旋转嵌入)需谨慎处理:
- 可能需要在第一层之前显式计算
- 不同模型实现的API有所不同
- 针对要使用的特定模型架构进行测试
5. Empty Partitions
5. 空分区
When , some ranks may have no layers:
world_size > num_layers- Add explicit handling for empty partitions
- These ranks still need to forward activations
- Avoid division by zero in layer assignment
当时,部分节点可能没有层:
world_size > num_layers- 添加对空分区的显式处理
- 这些节点仍需转发激活值
- 避免在层分配时出现除零错误
6. Variable Sequence Lengths
6. 可变序列长度
If microbatches have different sequence lengths:
- Communicate shapes before data
- Don't cache and reuse shapes across microbatches
- Consider padding strategies for efficiency
如果微批次的序列长度不同:
- 先通信形状再传输数据
- 不要跨微批次缓存和重用形状
- 考虑使用填充策略以提高效率
Code Organization
代码组织结构
Structure the implementation clearly:
pipeline_parallel.py
├── partition_model() # Divide layers across ranks
├── get_stage_layers() # Get this rank's layer subset
├── forward_stage() # Single stage forward pass
├── backward_stage() # Single stage backward pass
├── send_activation() # Send tensor to next rank
├── recv_activation() # Receive tensor from prev rank
├── send_gradient() # Send gradient to prev rank
├── recv_gradient() # Receive gradient from next rank
└── train_step_pipeline() # Main training step orchestrator清晰组织实现代码:
pipeline_parallel.py
├── partition_model() # Divide layers across ranks
├── get_stage_layers() # Get this rank's layer subset
├── forward_stage() # Single stage forward pass
├── backward_stage() # Single stage backward pass
├── send_activation() # Send tensor to next rank
├── recv_activation() # Receive tensor from prev rank
├── send_gradient() # Send gradient to prev rank
├── recv_gradient() # Receive gradient from next rank
└── train_step_pipeline() # Main training step orchestratorTesting Checklist
测试检查清单
Before considering implementation complete:
- Gradient flow verified for all model parameters
- Shapes correct at each stage boundary
- world_size=1 case works correctly
- Loss matches non-parallel baseline
- No communication deadlocks
- Memory usage scales appropriately with world_size
- Position embeddings handled correctly for the specific model
- Output head (lm_head) included and receives gradients
在认为实现完成前,需确认:
- 所有模型参数的梯度流已验证
- 每个阶段边界的形状正确
- world_size=1的情况能正常运行
- 损失值与非并行基准匹配
- 无通信死锁
- 内存使用随world_size适当扩展
- 针对特定模型的位置嵌入处理正确
- 输出头(lm_head)已包含且能接收梯度