perf-parallelism-strategies
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseParallelism Strategy Selection Skill
并行策略选择技巧
For stable background on each parallelism type, see:
- @docs/parallelisms.md
- @skills/perf-parallelism-strategies/card.yaml
如需了解各类并行方式的基础知识,请查阅:
- @docs/parallelisms.md
- @skills/perf-parallelism-strategies/card.yaml
Decision by Model Size
按模型规模决策
Dense models
稠密模型
| Model size | GPUs | Recommended starting point |
|---|---|---|
| < 1B | 1-8 | DP only |
| 1-10B | 8-16 | TP=2-4 + DP |
| 10-70B | 16-64 | TP=4-8 + PP=2-4 + DP |
| 70-175B | 64-256 | TP=8 + PP=4-8 + DP |
| 175-500B | 256-1024 | TP=8 + PP=8-16 + CP=2 + DP |
| 模型规模 | GPU数量 | 推荐初始方案 |
|---|---|---|
| < 1B | 1-8 | 仅使用DP |
| 1-10B | 8-16 | TP=2-4 + DP |
| 10-70B | 16-64 | TP=4-8 + PP=2-4 + DP |
| 70-175B | 64-256 | TP=8 + PP=4-8 + DP |
| 175-500B | 256-1024 | TP=8 + PP=8-16 + CP=2 + DP |
MoE models
MoE模型
MoE parallelism differs from dense models. Because only a fraction of
parameters are active per token, TP can often stay at 1 or 2 — the active
parameter shard already fits on a single GPU. EP is the primary scaling
dimension, with PP handling cross-node layer distribution.
| Model (total / active) | TP | PP | EP | Notes |
|---|---|---|---|---|
| OLMoE 7B / 1B | 1 | 1 | 8 | EP only, fits single node |
| Moonlight 16B / 3B | 2 | 1 | 8 | small TP for shared layers |
| DeepSeek-V2 236B / 21B | 1 | 4 | 32 | no TP at all |
| GLM-4.5 Air 106B / 12B | 1 | 4 | 8 | no TP at all |
| Qwen3 30B-A3B | 4 | 2 | 4 | |
| GLM-4.5 355B / 32B | 2 | 8 | 16 | |
| Qwen3 235B-A22B | 4 | 16 | 8 | CP=2 for pretrain |
| DeepSeek-V3 671B / 37B | 2 | 16 | 64 | TP=2, not 8 |
| Kimi-K2 1T | 2 | 16 | 32 |
Key patterns:
- TP is sized by active params, not total params. A 671B MoE with 37B active needs far less TP than a 70B dense model.
- EP scales with expert count. Common: EP = num_experts or num_experts / experts_per_gpu.
- PP handles depth. Large MoE models use PP=8-16 across nodes.
- ETP (expert tensor parallelism) is rarely used. Llama 4 is an exception (ETP=4).
These are starting points, not hard rules. Always profile the first
iteration to verify memory and communication.
MoE并行与稠密模型不同。由于每个token仅激活部分参数,TP通常可保持为1或2——激活参数分片已能适配单GPU。EP是主要的扩展维度,PP负责跨节点的层分布。
| 模型(总参数/激活参数) | TP | PP | EP | 说明 |
|---|---|---|---|---|
| OLMoE 7B / 1B | 1 | 1 | 8 | 仅使用EP,适配单节点 |
| Moonlight 16B / 3B | 2 | 1 | 8 | 共享层使用小TP |
| DeepSeek-V2 236B / 21B | 1 | 4 | 32 | 完全不使用TP |
| GLM-4.5 Air 106B / 12B | 1 | 4 | 8 | 完全不使用TP |
| Qwen3 30B-A3B | 4 | 2 | 4 | |
| GLM-4.5 355B / 32B | 2 | 8 | 16 | |
| Qwen3 235B-A22B | 4 | 16 | 8 | 预训练时使用CP=2 |
| DeepSeek-V3 671B / 37B | 2 | 16 | 64 | TP=2而非8 |
| Kimi-K2 1T | 2 | 16 | 32 |
核心规律:
- TP规模由激活参数而非总参数决定。拥有37B激活参数的671B MoE模型所需的TP远少于70B稠密模型。
- EP随专家数量扩展。常见设置:EP = 专家数量 或 专家数量/单GPU专家数。
- PP处理模型深度。大型MoE模型在跨节点使用PP=8-16。
- ETP(专家张量并行)极少使用。Llama 4是例外(ETP=4)。
以上均为初始参考,非硬性规则。首次迭代时务必进行性能分析,验证内存与通信情况。
Decision by Hardware Topology
按硬件拓扑决策
Single node with NVLink:
python
cfg.model.tensor_model_parallel_size = 8Multiple nodes with InfiniBand:
python
cfg.model.tensor_model_parallel_size = 8
cfg.model.pipeline_model_parallel_size = NLimited network (Ethernet):
python
cfg.model.tensor_model_parallel_size = 4
cfg.model.pipeline_model_parallel_size = MThe stable rule is: keep TP within a single NVLink domain. Use PP or DP
for cross-node scaling. TP across nodes is almost always a performance
loss.
带NVLink的单节点:
python
cfg.model.tensor_model_parallel_size = 8带InfiniBand的多节点:
python
cfg.model.tensor_model_parallel_size = 8
cfg.model.pipeline_model_parallel_size = N网络受限(以太网):
python
cfg.model.tensor_model_parallel_size = 4
cfg.model.pipeline_model_parallel_size = M通用规则:将TP控制在单个NVLink域内。跨节点扩展使用PP或DP。跨节点设置TP几乎总会导致性能损失。
Decision by Sequence Length
按序列长度决策
| Sequence length | Recommendation |
|---|---|
| < 2K | standard TP + PP + DP |
| 2K-8K | add SP ( |
| 8K-32K | add CP=2 |
| 32K+ | add CP=4-8, consider |
| 序列长度 | 推荐方案 |
|---|---|
| < 2K | 标准TP + PP + DP |
| 2K-8K | 增加SP( |
| 8K-32K | 增加CP=2 |
| 32K+ | 增加CP=4-8,大规模CP时考虑 |
Combined Parallelism Enablement
组合并行启用
3D parallelism (TP + PP + DP):
python
cfg.model.tensor_model_parallel_size = 4
cfg.model.pipeline_model_parallel_size = 4
cfg.model.sequence_parallel = True4D parallelism (TP + PP + CP + DP):
python
cfg.model.tensor_model_parallel_size = 8
cfg.model.pipeline_model_parallel_size = 8
cfg.model.context_parallel_size = 2
cfg.model.sequence_parallel = TrueMoE with EP + PP (e.g. DeepSeek-V2 236B on 128 GPUs):
python
cfg.model.tensor_model_parallel_size = 1
cfg.model.pipeline_model_parallel_size = 4
cfg.model.expert_model_parallel_size = 32
cfg.model.sequence_parallel = FalseMoE with small TP + PP + EP (e.g. DeepSeek-V3 671B on 256 GPUs):
python
cfg.model.tensor_model_parallel_size = 2
cfg.model.pipeline_model_parallel_size = 16
cfg.model.expert_model_parallel_size = 64
cfg.model.sequence_parallel = TrueDP size is always implicit:
data_parallel_size = world_size / (TP * PP * CP) # dense path
expert_data_parallel_size = world_size / (PP * EP * ETP) # MoE path3D并行(TP + PP + DP):
python
cfg.model.tensor_model_parallel_size = 4
cfg.model.pipeline_model_parallel_size = 4
cfg.model.sequence_parallel = True4D并行(TP + PP + CP + DP):
python
cfg.model.tensor_model_parallel_size = 8
cfg.model.pipeline_model_parallel_size = 8
cfg.model.context_parallel_size = 2
cfg.model.sequence_parallel = TrueMoE搭配EP + PP(例如128 GPU上的DeepSeek-V2 236B):
python
cfg.model.tensor_model_parallel_size = 1
cfg.model.pipeline_model_parallel_size = 4
cfg.model.expert_model_parallel_size = 32
cfg.model.sequence_parallel = FalseMoE搭配小TP + PP + EP(例如256 GPU上的DeepSeek-V3 671B):
python
cfg.model.tensor_model_parallel_size = 2
cfg.model.pipeline_model_parallel_size = 16
cfg.model.expert_model_parallel_size = 64
cfg.model.sequence_parallel = TrueDP规模始终为隐式计算:
data_parallel_size = world_size / (TP * PP * CP) # 稠密模型路径
expert_data_parallel_size = world_size / (PP * EP * ETP) # MoE模型路径Minimum GPU Count
最低GPU数量
The minimum GPUs needed to run a config (i.e. with , )
is not the product of all parallelism dimensions. The dense path uses
a -mesh and the MoE path uses an -mesh, and within each PP
stage these two meshes share the same set of GPUs — they overlap, they
don't multiply. Only PP stages multiply (they're disjoint slices of the
model). So:
DP=1EDP=1TP*CPEP*ETPmin_gpus = PP * max(TP * CP, EP * ETP)Common simplification (WRONG): . This
over-allocates GPUs and shows up in many READMEs and slurm sizing tables.
Don't propagate it.
PP * TP * CP * EP * ETPThe decoupling of attention and MoE parallelism (different mesh shapes
for the dense and expert paths sharing the same PP-stage GPUs) is
detailed in
Pangu Ultra MoE (arXiv:2504.14960).
运行某配置所需的最低GPU数量(即、时)并非所有并行维度的乘积。稠密模型路径使用网格,MoE模型路径使用网格,且在每个PP阶段中这两个网格共享同一组GPU——它们是重叠关系,而非相乘关系。仅PP阶段是相乘的(它们是模型的不相交切片)。因此:
DP=1EDP=1TP*CPEP*ETPmin_gpus = PP * max(TP * CP, EP * ETP)常见错误简化方式: 。这种方式会过度分配GPU,在许多README和Slurm规模表中都能看到。请勿沿用该方式。
PP * TP * CP * EP * ETP注意力并行与MoE并行的解耦(稠密路径与专家路径使用不同网格形状,但共享同一PP阶段的GPU)在Pangu Ultra MoE (arXiv:2504.14960)中有详细说明。
Examples
示例
| Config | Wrong (PP·TP·CP·EP·ETP) | Correct (PP·max(TP·CP, EP·ETP)) |
|---|---|---|
| PP=1, TP=2, CP=1, EP=8, ETP=1 | 16 | 8 (1 node) |
| PP=1, TP=4, CP=1, EP=8, ETP=1 | 32 | 8 (max(4, 8)) |
| PP=1, TP=2, CP=2, EP=8, ETP=1 | 32 | 8 (max(4, 8)) |
| PP=1, TP=2, CP=4, EP=8, ETP=1 | 64 | 8 (max(8, 8)) |
| PP=2, TP=2, CP=1, EP=8, ETP=1 | 32 | 16 (2 · max(2, 8)) |
| PP=1, TP=2, CP=1, EP=4, ETP=2 | 16 | 8 (max(2, 8)) |
| 配置 | 错误计算(PP·TP·CP·EP·ETP) | 正确计算(PP·max(TP·CP, EP·ETP)) |
|---|---|---|
| PP=1, TP=2, CP=1, EP=8, ETP=1 | 16 | 8(1节点) |
| PP=1, TP=4, CP=1, EP=8, ETP=1 | 32 | 8(max(4, 8)) |
| PP=1, TP=2, CP=2, EP=8, ETP=1 | 32 | 8(max(4, 8)) |
| PP=1, TP=2, CP=4, EP=8, ETP=1 | 64 | 8(max(8, 8)) |
| PP=2, TP=2, CP=1, EP=8, ETP=1 | 32 | 16(2 · max(2, 8)) |
| PP=1, TP=2, CP=1, EP=4, ETP=2 | 16 | 8(max(2, 8)) |
Scaling above the minimum
超过最低规模的扩展
Adding GPUs scales and/or (the must satisfy
both equations simultaneously). At the larger-mesh side has
DP (or EDP) = 1 and the smaller side absorbs the slack.
DPEDPworld_sizemin_gpusExample — TP=2, CP=1, EP=8, ETP=1, PP=1:
- 8 GPUs (): dense
min_gpus, MoEDP = 8/2 = 4EDP = 8/8 = 1 - 16 GPUs: dense , MoE
DP = 8→ 2× global batchEDP = 2 - 32 GPUs: dense , MoE
DP = 16→ 4× global batchEDP = 4
When sizing slurm scripts, compute from (or a
multiple of it for higher throughput via DP/EDP).
--nodesmin_gpus增加GPU会扩展和/或(必须同时满足两个等式)。在时,网格规模较大的一侧DP(或EDP)=1,较小的一侧吸收剩余资源。
DPEDPworld_sizemin_gpus示例——TP=2, CP=1, EP=8, ETP=1, PP=1:
- 8 GPU():稠密模型
min_gpus,MoE模型DP = 8/2 = 4EDP = 8/8 = 1 - 16 GPU:稠密模型,MoE模型
DP = 8→ 全局批量翻倍EDP = 2 - 32 GPU:稠密模型,MoE模型
DP = 16→ 全局批量翻4倍EDP = 4
编写Slurm脚本时,根据(或其倍数,通过DP/EDP提升吞吐量)计算。
min_gpus--nodesMemory Estimation
内存估算
Without parallelism (70B model, FP16):
parameters: 140 GB
gradients: 140 GB
optimizer states: 280 GB (Adam)
activations: 48 GB (batch=1, seq=4K)
total: 608 GBWith TP=4, PP=4, DP=4 (64 GPUs):
parameters: 8.75 GB per GPU
gradients: 8.75 GB per GPU
optimizer states: 17.50 GB per GPU
activations: 3.00 GB per GPU
total: ~38 GB per GPU无并行设置(70B模型,FP16):
参数: 140 GB
梯度: 140 GB
优化器状态: 280 GB(Adam)
激活值: 48 GB(batch=1, seq=4K)
总计: 608 GB启用TP=4, PP=4, DP=4(64 GPU):
单GPU参数: 8.75 GB
单GPU梯度: 8.75 GB
单GPU优化器状态: 17.50 GB
单GPU激活值: 3.00 GB
单GPU总计: ~38 GBCode Anchors
代码锚点
Parallelism dimensions set in model provider:
66
model_config = GPTModelProvider(
tensor_model_parallel_size=2,
# ... other model parameters
)DP size calculation:
424
data_parallel_size = world_size / (tensor_model_parallel_size × pipeline_model_parallel_size × context_parallel_size)Bridge initialization wires parallelism into process groups:
618
parallel_state.initialize_model_parallel(
tensor_model_parallel_size=model_config.tensor_model_parallel_size,
pipeline_model_parallel_size=model_config.pipeline_model_parallel_size,
...
context_parallel_size=model_config.context_parallel_size,
hierarchical_context_parallel_sizes=model_config.hierarchical_context_parallel_sizes,
expert_model_parallel_size=model_config.expert_model_parallel_size,
...
)并行维度在模型提供者中设置:
66
model_config = GPTModelProvider(
tensor_model_parallel_size=2,
# ... 其他模型参数
)DP规模计算:
424
data_parallel_size = world_size / (tensor_model_parallel_size × pipeline_model_parallel_size × context_parallel_size)Bridge初始化将并行配置接入进程组:
618
parallel_state.initialize_model_parallel(
tensor_model_parallel_size=model_config.tensor_model_parallel_size,
pipeline_model_parallel_size=model_config.pipeline_model_parallel_size,
...
context_parallel_size=model_config.context_parallel_size,
hierarchical_context_parallel_sizes=model_config.hierarchical_context_parallel_sizes,
expert_model_parallel_size=model_config.expert_model_parallel_size,
...
)Pitfalls
注意事项
-
TP across nodes destroys throughput. Always keep TP within a single NVLink domain.
-
PP without interleaving has large pipeline bubbles. Usewhen possible.
virtual_pipeline_model_parallel_size -
SP requires. Enabling SP alone without TP is a config error.
tensor_model_parallel_size > 1 -
CP requires.
seq_length % (2 * context_parallel_size) == 0 -
EP is only for MoE models. Settingon a dense model is a no-op or error.
expert_model_parallel_size -
The model-size-to-parallelism table above is a starting heuristic. Always profile the first iteration to check memory and communication.
-
and related env vars interact with overlap settings. See @skills/perf-tp-dp-comm-overlap/SKILL.md.
CUDA_DEVICE_MAX_CONNECTIONS -
The minimum GPU count for an MoE config is, not the product of all dimensions. The dense
PP * max(TP*CP, EP*ETP)-mesh and MoETP*CP-mesh share the same GPUs in each PP stage. See "Minimum GPU Count" section above.EP*ETP
-
跨节点设置TP会严重降低吞吐量。务必将TP控制在单个NVLink域内。
-
未使用交错的PP会产生较大的流水线气泡。尽可能使用。
virtual_pipeline_model_parallel_size -
SP要求。仅启用SP而不设置TP属于配置错误。
tensor_model_parallel_size > 1 -
CP要求。
seq_length % (2 * context_parallel_size) == 0 -
EP仅适用于MoE模型。在稠密模型上设置无效果或报错。
expert_model_parallel_size -
上述模型规模与并行配置对照表仅为初始启发式参考。首次迭代时务必进行性能分析,检查内存与通信情况。
-
及相关环境变量会与重叠设置交互。请查阅@skills/perf-tp-dp-comm-overlap/SKILL.md。
CUDA_DEVICE_MAX_CONNECTIONS -
MoE配置的最低GPU数量为,而非所有维度的乘积。稠密模型的
PP * max(TP*CP, EP*ETP)网格与MoE模型的TP*CP网格在每个PP阶段共享同一组GPU。详见上文“最低GPU数量”章节。EP*ETP
Verification
验证
Quick sanity check that combined parallelism initializes correctly using
the smallest available recipe with overridden parallelism:
bash
CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python -m torch.distributed.run --nproc_per_node=4 \
scripts/training/run_recipe.py \
--recipe llama32_1b_pretrain_config \
model.tensor_model_parallel_size=2 \
model.pipeline_model_parallel_size=2 \
model.sequence_parallel=True \
train.train_iters=3 train.global_batch_size=8 train.micro_batch_size=1 \
scheduler.lr_warmup_iters=0 \
validation.eval_iters=0 validation.eval_interval=0 \
checkpoint.save_interval=0 \
logger.log_interval=1Success criteria:
- exit code 0
- finite loss at iteration 3 (e.g. )
lm loss: 1.003808E+01 - log shows TP=2 PP=2 DP=1 layout with 4 ranks
使用最小可用配方并覆盖并行配置,快速验证组合并行是否正确初始化:
bash
CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python -m torch.distributed.run --nproc_per_node=4 \
scripts/training/run_recipe.py \
--recipe llama32_1b_pretrain_config \
model.tensor_model_parallel_size=2 \
model.pipeline_model_parallel_size=2 \
model.sequence_parallel=True \
train.train_iters=3 train.global_batch_size=8 train.micro_batch_size=1 \
scheduler.lr_warmup_iters=0 \
validation.eval_iters=0 validation.eval_interval=0 \
checkpoint.save_interval=0 \
logger.log_interval=1成功标准:
- 退出码为0
- 第3次迭代时损失值为有限值(例如)
lm loss: 1.003808E+01 - 日志显示TP=2 PP=2 DP=1的布局,共4个rank