perf-parallelism-strategies

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Parallelism Strategy Selection Skill

并行策略选择技巧

For stable background on each parallelism type, see:

@docs/parallelisms.md
@skills/perf-parallelism-strategies/card.yaml

如需了解各类并行方式的基础知识，请查阅：

@docs/parallelisms.md
@skills/perf-parallelism-strategies/card.yaml

Decision by Model Size

按模型规模决策

Dense models

稠密模型

Model size	GPUs	Recommended starting point
< 1B	1-8	DP only
1-10B	8-16	TP=2-4 + DP
10-70B	16-64	TP=4-8 + PP=2-4 + DP
70-175B	64-256	TP=8 + PP=4-8 + DP
175-500B	256-1024	TP=8 + PP=8-16 + CP=2 + DP

模型规模	GPU数量	推荐初始方案
< 1B	1-8	仅使用DP
1-10B	8-16	TP=2-4 + DP
10-70B	16-64	TP=4-8 + PP=2-4 + DP
70-175B	64-256	TP=8 + PP=4-8 + DP
175-500B	256-1024	TP=8 + PP=8-16 + CP=2 + DP

MoE models

MoE模型

MoE parallelism differs from dense models. Because only a fraction of parameters are active per token, TP can often stay at 1 or 2 — the active parameter shard already fits on a single GPU. EP is the primary scaling dimension, with PP handling cross-node layer distribution.

Model (total / active)	TP	PP	EP	Notes
OLMoE 7B / 1B	1	1	8	EP only, fits single node
Moonlight 16B / 3B	2	1	8	small TP for shared layers
DeepSeek-V2 236B / 21B	1	4	32	no TP at all
GLM-4.5 Air 106B / 12B	1	4	8	no TP at all
Qwen3 30B-A3B	4	2	4
GLM-4.5 355B / 32B	2	8	16
Qwen3 235B-A22B	4	16	8	CP=2 for pretrain
DeepSeek-V3 671B / 37B	2	16	64	TP=2, not 8
Kimi-K2 1T	2	16	32

Key patterns:

TP is sized by active params, not total params. A 671B MoE with 37B active needs far less TP than a 70B dense model.
EP scales with expert count. Common: EP = num_experts or num_experts / experts_per_gpu.
PP handles depth. Large MoE models use PP=8-16 across nodes.
ETP (expert tensor parallelism) is rarely used. Llama 4 is an exception (ETP=4).

These are starting points, not hard rules. Always profile the first iteration to verify memory and communication.

MoE并行与稠密模型不同。由于每个token仅激活部分参数，TP通常可保持为1或2——激活参数分片已能适配单GPU。EP是主要的扩展维度，PP负责跨节点的层分布。

模型（总参数/激活参数）	TP	PP	EP	说明
OLMoE 7B / 1B	1	1	8	仅使用EP，适配单节点
Moonlight 16B / 3B	2	1	8	共享层使用小TP
DeepSeek-V2 236B / 21B	1	4	32	完全不使用TP
GLM-4.5 Air 106B / 12B	1	4	8	完全不使用TP
Qwen3 30B-A3B	4	2	4
GLM-4.5 355B / 32B	2	8	16
Qwen3 235B-A22B	4	16	8	预训练时使用CP=2
DeepSeek-V3 671B / 37B	2	16	64	TP=2而非8
Kimi-K2 1T	2	16	32

核心规律：

TP规模由激活参数而非总参数决定。拥有37B激活参数的671B MoE模型所需的TP远少于70B稠密模型。
EP随专家数量扩展。常见设置：EP = 专家数量或专家数量/单GPU专家数。
PP处理模型深度。大型MoE模型在跨节点使用PP=8-16。
ETP（专家张量并行）极少使用。Llama 4是例外（ETP=4）。

以上均为初始参考，非硬性规则。首次迭代时务必进行性能分析，验证内存与通信情况。

Decision by Hardware Topology

按硬件拓扑决策

Single node with NVLink:

python

cfg.model.tensor_model_parallel_size = 8

Multiple nodes with InfiniBand:

python

cfg.model.tensor_model_parallel_size = 8
cfg.model.pipeline_model_parallel_size = N

Limited network (Ethernet):

python

cfg.model.tensor_model_parallel_size = 4
cfg.model.pipeline_model_parallel_size = M

The stable rule is: keep TP within a single NVLink domain. Use PP or DP for cross-node scaling. TP across nodes is almost always a performance loss.

带NVLink的单节点：

python

cfg.model.tensor_model_parallel_size = 8

带InfiniBand的多节点：

python

cfg.model.tensor_model_parallel_size = 8
cfg.model.pipeline_model_parallel_size = N

网络受限（以太网）：

python

cfg.model.tensor_model_parallel_size = 4
cfg.model.pipeline_model_parallel_size = M

通用规则：将TP控制在单个NVLink域内。跨节点扩展使用PP或DP。跨节点设置TP几乎总会导致性能损失。

Decision by Sequence Length

按序列长度决策

Sequence length	Recommendation
< 2K	standard TP + PP + DP
2K-8K	add SP ( `sequence_parallel=True` )
8K-32K	add CP=2
32K+	add CP=4-8, consider `a2a+p2p` for large CP

序列长度	推荐方案
< 2K	标准TP + PP + DP
2K-8K	增加SP（ `sequence_parallel=True` ）
8K-32K	增加CP=2
32K+	增加CP=4-8，大规模CP时考虑 `a2a+p2p`

Combined Parallelism Enablement

组合并行启用

3D parallelism (TP + PP + DP):

python

cfg.model.tensor_model_parallel_size = 4
cfg.model.pipeline_model_parallel_size = 4
cfg.model.sequence_parallel = True

4D parallelism (TP + PP + CP + DP):

python

cfg.model.tensor_model_parallel_size = 8
cfg.model.pipeline_model_parallel_size = 8
cfg.model.context_parallel_size = 2
cfg.model.sequence_parallel = True

MoE with EP + PP (e.g. DeepSeek-V2 236B on 128 GPUs):

python

cfg.model.tensor_model_parallel_size = 1
cfg.model.pipeline_model_parallel_size = 4
cfg.model.expert_model_parallel_size = 32
cfg.model.sequence_parallel = False

MoE with small TP + PP + EP (e.g. DeepSeek-V3 671B on 256 GPUs):

python

cfg.model.tensor_model_parallel_size = 2
cfg.model.pipeline_model_parallel_size = 16
cfg.model.expert_model_parallel_size = 64
cfg.model.sequence_parallel = True

DP size is always implicit:

data_parallel_size = world_size / (TP * PP * CP)        # dense path
expert_data_parallel_size = world_size / (PP * EP * ETP) # MoE path

3D并行（TP + PP + DP）：

python

cfg.model.tensor_model_parallel_size = 4
cfg.model.pipeline_model_parallel_size = 4
cfg.model.sequence_parallel = True

4D并行（TP + PP + CP + DP）：

python

cfg.model.tensor_model_parallel_size = 8
cfg.model.pipeline_model_parallel_size = 8
cfg.model.context_parallel_size = 2
cfg.model.sequence_parallel = True

MoE搭配EP + PP（例如128 GPU上的DeepSeek-V2 236B）：

python

cfg.model.tensor_model_parallel_size = 1
cfg.model.pipeline_model_parallel_size = 4
cfg.model.expert_model_parallel_size = 32
cfg.model.sequence_parallel = False

MoE搭配小TP + PP + EP（例如256 GPU上的DeepSeek-V3 671B）：

python

cfg.model.tensor_model_parallel_size = 2
cfg.model.pipeline_model_parallel_size = 16
cfg.model.expert_model_parallel_size = 64
cfg.model.sequence_parallel = True

DP规模始终为隐式计算：

data_parallel_size = world_size / (TP * PP * CP)        # 稠密模型路径
expert_data_parallel_size = world_size / (PP * EP * ETP) # MoE模型路径

Minimum GPU Count

最低GPU数量

The minimum GPUs needed to run a config (i.e. with

DP=1

EDP=1

) is not the product of all parallelism dimensions. The dense path uses a

TP*CP

-mesh and the MoE path uses an

EP*ETP

-mesh, and within each PP stage these two meshes share the same set of GPUs — they overlap, they don't multiply. Only PP stages multiply (they're disjoint slices of the model). So:

min_gpus = PP * max(TP * CP, EP * ETP)

Common simplification (WRONG):

PP * TP * CP * EP * ETP

. This over-allocates GPUs and shows up in many READMEs and slurm sizing tables. Don't propagate it.

The decoupling of attention and MoE parallelism (different mesh shapes for the dense and expert paths sharing the same PP-stage GPUs) is detailed in Pangu Ultra MoE (arXiv:2504.14960).

运行某配置所需的最低GPU数量（即

DP=1

、

EDP=1

时）并非所有并行维度的乘积。稠密模型路径使用

TP*CP

网格，MoE模型路径使用

EP*ETP

网格，且在每个PP阶段中这两个网格共享同一组GPU——它们是重叠关系，而非相乘关系。仅PP阶段是相乘的（它们是模型的不相交切片）。因此：

min_gpus = PP * max(TP * CP, EP * ETP)

常见错误简化方式：

PP * TP * CP * EP * ETP

。这种方式会过度分配GPU，在许多README和Slurm规模表中都能看到。请勿沿用该方式。

注意力并行与MoE并行的解耦（稠密路径与专家路径使用不同网格形状，但共享同一PP阶段的GPU）在Pangu Ultra MoE (arXiv:2504.14960)中有详细说明。

Examples

示例

Config	Wrong (PP·TP·CP·EP·ETP)	Correct (PP·max(TP·CP, EP·ETP))
PP=1, TP=2, CP=1, EP=8, ETP=1	16	8 (1 node)
PP=1, TP=4, CP=1, EP=8, ETP=1	32	8 (max(4, 8))
PP=1, TP=2, CP=2, EP=8, ETP=1	32	8 (max(4, 8))
PP=1, TP=2, CP=4, EP=8, ETP=1	64	8 (max(8, 8))
PP=2, TP=2, CP=1, EP=8, ETP=1	32	16 (2 · max(2, 8))
PP=1, TP=2, CP=1, EP=4, ETP=2	16	8 (max(2, 8))

配置	错误计算（PP·TP·CP·EP·ETP）	正确计算（PP·max(TP·CP, EP·ETP)）
PP=1, TP=2, CP=1, EP=8, ETP=1	16	8（1节点）
PP=1, TP=4, CP=1, EP=8, ETP=1	32	8（max(4, 8)）
PP=1, TP=2, CP=2, EP=8, ETP=1	32	8（max(4, 8)）
PP=1, TP=2, CP=4, EP=8, ETP=1	64	8（max(8, 8)）
PP=2, TP=2, CP=1, EP=8, ETP=1	32	16（2 · max(2, 8)）
PP=1, TP=2, CP=1, EP=4, ETP=2	16	8（max(2, 8)）

Scaling above the minimum

超过最低规模的扩展

Adding GPUs scales

DP

and/or

EDP

(the

world_size

must satisfy both equations simultaneously). At

min_gpus

the larger-mesh side has DP (or EDP) = 1 and the smaller side absorbs the slack.

Example — TP=2, CP=1, EP=8, ETP=1, PP=1:

8 GPUs (
```
min_gpus
```
): dense
```
DP = 8/2 = 4
```
, MoE
```
EDP = 8/8 = 1
```
16 GPUs: dense
```
DP = 8
```
, MoE
```
EDP = 2
```
→ 2× global batch
32 GPUs: dense
```
DP = 16
```
, MoE
```
EDP = 4
```
→ 4× global batch

When sizing slurm scripts, compute

--nodes

from

min_gpus

(or a multiple of it for higher throughput via DP/EDP).

增加GPU会扩展

DP

和/或

EDP

（

world_size

必须同时满足两个等式）。在

min_gpus

时，网格规模较大的一侧DP（或EDP）=1，较小的一侧吸收剩余资源。

示例——TP=2, CP=1, EP=8, ETP=1, PP=1：

8 GPU（
```
min_gpus
```
）：稠密模型
```
DP = 8/2 = 4
```
，MoE模型
```
EDP = 8/8 = 1
```
16 GPU：稠密模型
```
DP = 8
```
，MoE模型
```
EDP = 2
```
→ 全局批量翻倍
32 GPU：稠密模型
```
DP = 16
```
，MoE模型
```
EDP = 4
```
→ 全局批量翻4倍

编写Slurm脚本时，根据

min_gpus

（或其倍数，通过DP/EDP提升吞吐量）计算

--nodes

。

Memory Estimation

内存估算

Without parallelism (70B model, FP16):

parameters:       140 GB
gradients:        140 GB
optimizer states: 280 GB (Adam)
activations:       48 GB (batch=1, seq=4K)
total:            608 GB

With TP=4, PP=4, DP=4 (64 GPUs):

parameters:        8.75 GB per GPU
gradients:         8.75 GB per GPU
optimizer states: 17.50 GB per GPU
activations:       3.00 GB per GPU
total:           ~38    GB per GPU

无并行设置（70B模型，FP16）：

参数:       140 GB
梯度:        140 GB
优化器状态: 280 GB（Adam）
激活值:       48 GB（batch=1, seq=4K）
总计:            608 GB

启用TP=4, PP=4, DP=4（64 GPU）：

单GPU参数:        8.75 GB
单GPU梯度:         8.75 GB
单GPU优化器状态: 17.50 GB
单GPU激活值:       3.00 GB
单GPU总计:           ~38    GB

Code Anchors

代码锚点

Parallelism dimensions set in model provider:

model_config = GPTModelProvider(
    tensor_model_parallel_size=2,
    # ... other model parameters
)

DP size calculation:

424

data_parallel_size = world_size / (tensor_model_parallel_size × pipeline_model_parallel_size × context_parallel_size)

Bridge initialization wires parallelism into process groups:

618

parallel_state.initialize_model_parallel(
    tensor_model_parallel_size=model_config.tensor_model_parallel_size,
    pipeline_model_parallel_size=model_config.pipeline_model_parallel_size,
    ...
    context_parallel_size=model_config.context_parallel_size,
    hierarchical_context_parallel_sizes=model_config.hierarchical_context_parallel_sizes,
    expert_model_parallel_size=model_config.expert_model_parallel_size,
    ...
)

并行维度在模型提供者中设置：

model_config = GPTModelProvider(
    tensor_model_parallel_size=2,
    # ... 其他模型参数
)

DP规模计算：

424

data_parallel_size = world_size / (tensor_model_parallel_size × pipeline_model_parallel_size × context_parallel_size)

Bridge初始化将并行配置接入进程组：

618

parallel_state.initialize_model_parallel(
    tensor_model_parallel_size=model_config.tensor_model_parallel_size,
    pipeline_model_parallel_size=model_config.pipeline_model_parallel_size,
    ...
    context_parallel_size=model_config.context_parallel_size,
    hierarchical_context_parallel_sizes=model_config.hierarchical_context_parallel_sizes,
    expert_model_parallel_size=model_config.expert_model_parallel_size,
    ...
)

Pitfalls

注意事项

TP across nodes destroys throughput. Always keep TP within a single NVLink domain.
PP without interleaving has large pipeline bubbles. Use
```
virtual_pipeline_model_parallel_size
```
when possible.
SP requires
```
tensor_model_parallel_size > 1
```
. Enabling SP alone without TP is a config error.

CP requires

seq_length % (2 * context_parallel_size) == 0

EP is only for MoE models. Setting
```
expert_model_parallel_size
```
on a dense model is a no-op or error.
The model-size-to-parallelism table above is a starting heuristic. Always profile the first iteration to check memory and communication.
```
CUDA_DEVICE_MAX_CONNECTIONS
```
and related env vars interact with overlap settings. See @skills/perf-tp-dp-comm-overlap/SKILL.md.
The minimum GPU count for an MoE config is
```
PP * max(TP*CP, EP*ETP)
```
, not the product of all dimensions. The dense
```
TP*CP
```
-mesh and MoE
```
EP*ETP
```
-mesh share the same GPUs in each PP stage. See "Minimum GPU Count" section above.

跨节点设置TP会严重降低吞吐量。务必将TP控制在单个NVLink域内。
未使用交错的PP会产生较大的流水线气泡。尽可能使用
```
virtual_pipeline_model_parallel_size
```
。
SP要求
```
tensor_model_parallel_size > 1
```
。仅启用SP而不设置TP属于配置错误。

CP要求

seq_length % (2 * context_parallel_size) == 0

。

EP仅适用于MoE模型。在稠密模型上设置
```
expert_model_parallel_size
```
无效果或报错。
上述模型规模与并行配置对照表仅为初始启发式参考。首次迭代时务必进行性能分析，检查内存与通信情况。
```
CUDA_DEVICE_MAX_CONNECTIONS
```
及相关环境变量会与重叠设置交互。请查阅@skills/perf-tp-dp-comm-overlap/SKILL.md。
MoE配置的最低GPU数量为
```
PP * max(TP*CP, EP*ETP)
```
，而非所有维度的乘积。稠密模型的
```
TP*CP
```
网格与MoE模型的
```
EP*ETP
```
网格在每个PP阶段共享同一组GPU。详见上文“最低GPU数量”章节。

Verification

验证

Quick sanity check that combined parallelism initializes correctly using the smallest available recipe with overridden parallelism:

bash

CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python -m torch.distributed.run --nproc_per_node=4 \
  scripts/training/run_recipe.py \
  --recipe llama32_1b_pretrain_config \
  model.tensor_model_parallel_size=2 \
  model.pipeline_model_parallel_size=2 \
  model.sequence_parallel=True \
  train.train_iters=3 train.global_batch_size=8 train.micro_batch_size=1 \
  scheduler.lr_warmup_iters=0 \
  validation.eval_iters=0 validation.eval_interval=0 \
  checkpoint.save_interval=0 \
  logger.log_interval=1

Success criteria:

exit code 0
finite loss at iteration 3 (e.g.
```
lm loss: 1.003808E+01
```
)
log shows TP=2 PP=2 DP=1 layout with 4 ranks

使用最小可用配方并覆盖并行配置，快速验证组合并行是否正确初始化：

bash

CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python -m torch.distributed.run --nproc_per_node=4 \
  scripts/training/run_recipe.py \
  --recipe llama32_1b_pretrain_config \
  model.tensor_model_parallel_size=2 \
  model.pipeline_model_parallel_size=2 \
  model.sequence_parallel=True \
  train.train_iters=3 train.global_batch_size=8 train.micro_batch_size=1 \
  scheduler.lr_warmup_iters=0 \
  validation.eval_iters=0 validation.eval_interval=0 \
  checkpoint.save_interval=0 \
  logger.log_interval=1

成功标准：

退出码为0
第3次迭代时损失值为有限值（例如
```
lm loss: 1.003808E+01
```
）
日志显示TP=2 PP=2 DP=1的布局，共4个rank