Torch Pipeline Parallelism

PyTorch流水线并行

Overview

概述

This skill provides guidance for implementing pipeline parallelism in PyTorch for distributed model training. Pipeline parallelism partitions a model across multiple devices/ranks, where each rank processes a subset of layers and communicates activations/gradients with neighboring ranks.

本技能为在PyTorch中实现分布式模型训练的流水线并行提供指导。流水线并行将模型划分到多个设备/节点上，每个节点处理一部分层，并与相邻节点通信激活值/梯度。

Key Concepts

核心概念

Pipeline Parallelism Patterns

流水线并行模式

AFAB (All-Forward-All-Backward): Process all microbatch forwards first, cache activations, then process all backwards. This is the most common pattern for pipeline parallelism.
1F1B (One-Forward-One-Backward): Interleave forward and backward passes for better memory efficiency but more complex scheduling.

AFAB（All-Forward-All-Backward）：先处理所有微批次的前向传播，缓存激活值，再处理所有反向传播。这是流水线并行最常用的模式。
1F1B（One-Forward-One-Backward）：交替进行前向和反向传播，内存效率更高但调度更复杂。

Critical Components

关键组件

Model Partitioning: Divide model layers across ranks
Activation Communication: Send/receive hidden states between ranks
Gradient Communication: Send/receive gradients during backward pass
Activation Caching: Store activations for backward pass computation

模型分区：将模型层分配到不同节点
激活值通信：在节点间发送/接收隐藏状态
梯度通信：反向传播期间发送/接收梯度
激活值缓存：存储激活值用于反向传播计算

Implementation Approach

实现步骤

Step 1: Understand Model Architecture First

步骤1：先理解模型架构

Before implementing, thoroughly understand the model being parallelized:

Identify the layer structure (e.g.,
```
model.model.layers
```
for LLaMA)
Understand embedding layers (input embeddings, position embeddings)
Identify the output head (e.g.,
```
lm_head
```
for language models)
Note any shared parameters or tied weights

在实现之前，需全面理解要并行化的模型：

识别层结构（例如LLaMA模型的
```
model.model.layers
```
）
理解嵌入层（输入嵌入、位置嵌入）
识别输出头（例如语言模型的
```
lm_head
```
）
注意任何共享参数或绑定权重

Step 2: Plan Tensor Shape Handling

步骤2：规划张量形状处理

Critical distinction between ranks:

Rank 0 (first stage): Receives integer token IDs with shape
```
[batch, seq_len]
```
Intermediate ranks: Receive hidden states with shape
```
[batch, seq_len, hidden_size]
```
Final rank: Must apply output head and compute loss

Create explicit shape handling logic for each case rather than assuming uniform input types.

不同节点之间的关键区别：

Rank 0（第一阶段）：接收形状为
```
[batch, seq_len]
```
的整数token ID
中间节点：接收形状为
```
[batch, seq_len, hidden_size]
```
的隐藏状态
最终节点：必须应用输出头并计算损失

为每种情况创建明确的形状处理逻辑，而非假设输入类型统一。

Step 3: Design Communication Strategy

步骤3：设计通信策略

Use

torch.distributed.P2POp

for batched send/receive operations:

python

undefined

使用

torch.distributed.P2POp

进行批量发送/接收操作：

python

undefined

Preferred: Batched P2P operations

ops = [] if rank > 0: ops.append(dist.P2POp(dist.irecv, recv_tensor, rank - 1)) if rank < world_size - 1: ops.append(dist.P2POp(dist.isend, send_tensor, rank + 1)) reqs = dist.batch_isend_irecv(ops) for req in reqs: req.wait()


Avoid using bare `dist.send/dist.recv` as they are blocking and less efficient.

ops = [] if rank > 0: ops.append(dist.P2POp(dist.irecv, recv_tensor, rank - 1)) if rank < world_size - 1: ops.append(dist.P2POp(dist.isend, send_tensor, rank + 1)) reqs = dist.batch_isend_irecv(ops) for req in reqs: req.wait()


避免使用裸`dist.send/dist.recv`，因为它们是阻塞式的且效率较低。

Step 4: Handle Gradient Flow Correctly

步骤4：正确处理梯度流

Critical pitfall: Using

tensor.detach().requires_grad_(True)

severs the computational graph.

Correct approach for maintaining gradient flow:

python

undefined

关键陷阱：使用

tensor.detach().requires_grad_(True)

会切断计算图。

保持梯度流的正确方法：

python

undefined

For caching inputs that need gradients

input_cache = []

During forward: cache the input tensor directly (not detached)

stage_input = received_tensor.requires_grad_(True) input_cache.append(stage_input)

During backward: use the cached tensor to compute gradients

The gradient flows through the original tensor


Verify gradient connectivity with small test cases before full implementation.


在完整实现之前，先用小型测试用例验证梯度连通性。

Step 5: Implement Shape Communication

步骤5：实现形状通信

Communicate tensor shapes before data when shapes vary:

python

undefined

当形状变化时，先通信张量形状再传输数据：

python

undefined

Efficient: Single tensor for shape

shape_tensor = torch.tensor(list(tensor.shape), dtype=torch.long, device=device) dist.send(shape_tensor, dst_rank)

Then send the actual data

dist.send(tensor.contiguous(), dst_rank)


Avoid sending each dimension as a separate tensor.

dist.send(tensor.contiguous(), dst_rank)


避免将每个维度作为单独的张量发送。

Verification Strategies

验证策略

1. Gradient Flow Verification

1. 梯度流验证

Create a minimal test to verify gradients flow correctly:

python

def test_gradient_flow():
    # Create simple model partition
    # Run forward/backward
    # Check that model.parameters() have non-None gradients
    for name, param in model.named_parameters():
        assert param.grad is not None, f"No gradient for {name}"
        assert not torch.all(param.grad == 0), f"Zero gradient for {name}"

创建小型测试以验证梯度流是否正确：

python

def test_gradient_flow():
    # Create simple model partition
    # Run forward/backward
    # Check that model.parameters() have non-None gradients
    for name, param in model.named_parameters():
        assert param.grad is not None, f"No gradient for {name}"
        assert not torch.all(param.grad == 0), f"Zero gradient for {name}"

2. Activation Shape Verification

2. 激活值形状验证

Log shapes at each stage boundary:

python

undefined

在每个阶段边界记录形状：

python

undefined

Before send

print(f"Rank {rank} sending shape: {tensor.shape}")

After receive

print(f"Rank {rank} received shape: {tensor.shape}")

undefined

print(f"Rank {rank} received shape: {tensor.shape}")

undefined

3. Single-Rank Testing

3. 单节点测试

Test with

world_size=1

to ensure the implementation handles the degenerate case:

No communication should occur
Model should function as standard single-device training
All gradients should flow correctly

使用

world_size=1

测试，确保实现能处理退化情况：

不应发生任何通信
模型应像标准单设备训练一样运行
所有梯度应正确流动

4. End-to-End Loss Comparison

4. 端到端损失对比

Compare loss values between:

Pipeline parallel implementation
Standard single-device training (ground truth)

Values should match within numerical precision.

比较以下两种情况的损失值：

流水线并行实现
标准单设备训练（基准真值）

数值应在精度范围内匹配。

Common Pitfalls

常见陷阱

1. Truncated Code Edits

1. 代码编辑截断

When making large code changes:

Prefer smaller, targeted edits over large rewrites
Verify edit completeness by reading the file after each edit
Use full file writes for major structural changes

进行大规模代码修改时：

优先进行小范围、针对性的修改，而非大规模重写
每次修改后读取文件以验证修改的完整性
对于主要结构变化，使用完整文件写入

2. Detach Breaking Gradient Flow

2. Detach切断梯度流

python

undefined

python

undefined

WRONG: Severs computational graph

cached = tensor.detach().requires_grad_(True)

RIGHT: Maintains graph connection for backward

cached = tensor.clone().requires_grad_(True)

Or: simply keep reference to original tensor

undefined

undefined

3. Missing lm_head in Partitions

3. 分区中缺少lm_head

The output head (

lm_head

) is often separate from the layer list. Ensure:

It's included in the final rank's computation
Its parameters receive gradients
It's not duplicated across ranks

输出头（

lm_head

）通常与层列表分离。需确保：

它被包含在最终节点的计算中
其参数能接收梯度
不会在多个节点间重复

4. Position Embeddings Handling

4. 位置嵌入处理

Position embeddings (especially rotary embeddings) require care:

They may need explicit computation before the first layer
The API varies between model implementations
Test with the specific model architecture being used

位置嵌入（尤其是旋转嵌入）需谨慎处理：

可能需要在第一层之前显式计算
不同模型实现的API有所不同
针对要使用的特定模型架构进行测试

5. Empty Partitions

5. 空分区

When

world_size > num_layers

, some ranks may have no layers:

Add explicit handling for empty partitions
These ranks still need to forward activations
Avoid division by zero in layer assignment

当

world_size > num_layers

时，部分节点可能没有层：

添加对空分区的显式处理
这些节点仍需转发激活值
避免在层分配时出现除零错误

6. Variable Sequence Lengths

6. 可变序列长度

If microbatches have different sequence lengths:

Communicate shapes before data
Don't cache and reuse shapes across microbatches
Consider padding strategies for efficiency

如果微批次的序列长度不同：

先通信形状再传输数据
不要跨微批次缓存和重用形状
考虑使用填充策略以提高效率

Code Organization

代码组织结构

Structure the implementation clearly:

pipeline_parallel.py
├── partition_model()        # Divide layers across ranks
├── get_stage_layers()       # Get this rank's layer subset
├── forward_stage()          # Single stage forward pass
├── backward_stage()         # Single stage backward pass
├── send_activation()        # Send tensor to next rank
├── recv_activation()        # Receive tensor from prev rank
├── send_gradient()          # Send gradient to prev rank
├── recv_gradient()          # Receive gradient from next rank
└── train_step_pipeline()    # Main training step orchestrator

清晰组织实现代码：

pipeline_parallel.py
├── partition_model()        # Divide layers across ranks
├── get_stage_layers()       # Get this rank's layer subset
├── forward_stage()          # Single stage forward pass
├── backward_stage()         # Single stage backward pass
├── send_activation()        # Send tensor to next rank
├── recv_activation()        # Receive tensor from prev rank
├── send_gradient()          # Send gradient to prev rank
├── recv_gradient()          # Receive gradient from next rank
└── train_step_pipeline()    # Main training step orchestrator

Testing Checklist

测试检查清单

Before considering implementation complete:

Gradient flow verified for all model parameters
Shapes correct at each stage boundary
world_size=1 case works correctly
Loss matches non-parallel baseline
No communication deadlocks
Memory usage scales appropriately with world_size
Position embeddings handled correctly for the specific model
Output head (lm_head) included and receives gradients

在认为实现完成前，需确认：

所有模型参数的梯度流已验证
每个阶段边界的形状正确
world_size=1的情况能正常运行
损失值与非并行基准匹配
无通信死锁
内存使用随world_size适当扩展
针对特定模型的位置嵌入处理正确
输出头（lm_head）已包含且能接收梯度

torch-pipeline-parallelism

Original

Translation

Torch Pipeline Parallelism

PyTorch流水线并行

Overview

概述

Key Concepts

核心概念

Pipeline Parallelism Patterns

流水线并行模式

Critical Components

关键组件

Implementation Approach

实现步骤

Step 1: Understand Model Architecture First

步骤1：先理解模型架构

Step 2: Plan Tensor Shape Handling

步骤2：规划张量形状处理

Step 3: Design Communication Strategy

步骤3：设计通信策略

Preferred: Batched P2P operations

Preferred: Batched P2P operations

Step 4: Handle Gradient Flow Correctly

步骤4：正确处理梯度流

For caching inputs that need gradients

For caching inputs that need gradients

During forward: cache the input tensor directly (not detached)

During forward: cache the input tensor directly (not detached)

During backward: use the cached tensor to compute gradients

During backward: use the cached tensor to compute gradients

The gradient flows through the original tensor

The gradient flows through the original tensor

Step 5: Implement Shape Communication

步骤5：实现形状通信

Efficient: Single tensor for shape

Efficient: Single tensor for shape

Then send the actual data

Then send the actual data

Verification Strategies

验证策略

1. Gradient Flow Verification

1. 梯度流验证

2. Activation Shape Verification

2. 激活值形状验证

Before send

Before send

After receive

After receive

3. Single-Rank Testing

3. 单节点测试

4. End-to-End Loss Comparison

4. 端到端损失对比

Common Pitfalls

常见陷阱

1. Truncated Code Edits

1. 代码编辑截断

2. Detach Breaking Gradient Flow

2. Detach切断梯度流

WRONG: Severs computational graph

WRONG: Severs computational graph

RIGHT: Maintains graph connection for backward

RIGHT: Maintains graph connection for backward

Or: simply keep reference to original tensor

Or: simply keep reference to original tensor

3. Missing lm_head in Partitions

3. 分区中缺少lm_head

4. Position Embeddings Handling

4. 位置嵌入处理

5. Empty Partitions

5. 空分区

6. Variable Sequence Lengths

6. 可变序列长度

Code Organization

代码组织结构

Testing Checklist

测试检查清单