perf-parallelism-strategies

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Parallelism Strategy Selection Skill

并行策略选择技巧

For stable background on each parallelism type, see:
  • @docs/parallelisms.md
  • @skills/perf-parallelism-strategies/card.yaml
如需了解各类并行方式的基础知识,请查阅:
  • @docs/parallelisms.md
  • @skills/perf-parallelism-strategies/card.yaml

Decision by Model Size

按模型规模决策

Dense models

稠密模型

Model sizeGPUsRecommended starting point
< 1B1-8DP only
1-10B8-16TP=2-4 + DP
10-70B16-64TP=4-8 + PP=2-4 + DP
70-175B64-256TP=8 + PP=4-8 + DP
175-500B256-1024TP=8 + PP=8-16 + CP=2 + DP
模型规模GPU数量推荐初始方案
< 1B1-8仅使用DP
1-10B8-16TP=2-4 + DP
10-70B16-64TP=4-8 + PP=2-4 + DP
70-175B64-256TP=8 + PP=4-8 + DP
175-500B256-1024TP=8 + PP=8-16 + CP=2 + DP

MoE models

MoE模型

MoE parallelism differs from dense models. Because only a fraction of parameters are active per token, TP can often stay at 1 or 2 — the active parameter shard already fits on a single GPU. EP is the primary scaling dimension, with PP handling cross-node layer distribution.
Model (total / active)TPPPEPNotes
OLMoE 7B / 1B118EP only, fits single node
Moonlight 16B / 3B218small TP for shared layers
DeepSeek-V2 236B / 21B1432no TP at all
GLM-4.5 Air 106B / 12B148no TP at all
Qwen3 30B-A3B424
GLM-4.5 355B / 32B2816
Qwen3 235B-A22B4168CP=2 for pretrain
DeepSeek-V3 671B / 37B21664TP=2, not 8
Kimi-K2 1T21632
Key patterns:
  • TP is sized by active params, not total params. A 671B MoE with 37B active needs far less TP than a 70B dense model.
  • EP scales with expert count. Common: EP = num_experts or num_experts / experts_per_gpu.
  • PP handles depth. Large MoE models use PP=8-16 across nodes.
  • ETP (expert tensor parallelism) is rarely used. Llama 4 is an exception (ETP=4).
These are starting points, not hard rules. Always profile the first iteration to verify memory and communication.
MoE并行与稠密模型不同。由于每个token仅激活部分参数,TP通常可保持为1或2——激活参数分片已能适配单GPU。EP是主要的扩展维度,PP负责跨节点的层分布。
模型(总参数/激活参数)TPPPEP说明
OLMoE 7B / 1B118仅使用EP,适配单节点
Moonlight 16B / 3B218共享层使用小TP
DeepSeek-V2 236B / 21B1432完全不使用TP
GLM-4.5 Air 106B / 12B148完全不使用TP
Qwen3 30B-A3B424
GLM-4.5 355B / 32B2816
Qwen3 235B-A22B4168预训练时使用CP=2
DeepSeek-V3 671B / 37B21664TP=2而非8
Kimi-K2 1T21632
核心规律:
  • TP规模由激活参数而非总参数决定。拥有37B激活参数的671B MoE模型所需的TP远少于70B稠密模型。
  • EP随专家数量扩展。常见设置:EP = 专家数量 或 专家数量/单GPU专家数。
  • PP处理模型深度。大型MoE模型在跨节点使用PP=8-16。
  • ETP(专家张量并行)极少使用。Llama 4是例外(ETP=4)。
以上均为初始参考,非硬性规则。首次迭代时务必进行性能分析,验证内存与通信情况。

Decision by Hardware Topology

按硬件拓扑决策

Single node with NVLink:
python
cfg.model.tensor_model_parallel_size = 8
Multiple nodes with InfiniBand:
python
cfg.model.tensor_model_parallel_size = 8
cfg.model.pipeline_model_parallel_size = N
Limited network (Ethernet):
python
cfg.model.tensor_model_parallel_size = 4
cfg.model.pipeline_model_parallel_size = M
The stable rule is: keep TP within a single NVLink domain. Use PP or DP for cross-node scaling. TP across nodes is almost always a performance loss.
带NVLink的单节点:
python
cfg.model.tensor_model_parallel_size = 8
带InfiniBand的多节点:
python
cfg.model.tensor_model_parallel_size = 8
cfg.model.pipeline_model_parallel_size = N
网络受限(以太网):
python
cfg.model.tensor_model_parallel_size = 4
cfg.model.pipeline_model_parallel_size = M
通用规则:将TP控制在单个NVLink域内。跨节点扩展使用PP或DP。跨节点设置TP几乎总会导致性能损失。

Decision by Sequence Length

按序列长度决策

Sequence lengthRecommendation
< 2Kstandard TP + PP + DP
2K-8Kadd SP (
sequence_parallel=True
)
8K-32Kadd CP=2
32K+add CP=4-8, consider
a2a+p2p
for large CP
序列长度推荐方案
< 2K标准TP + PP + DP
2K-8K增加SP(
sequence_parallel=True
8K-32K增加CP=2
32K+增加CP=4-8,大规模CP时考虑
a2a+p2p

Combined Parallelism Enablement

组合并行启用

3D parallelism (TP + PP + DP):
python
cfg.model.tensor_model_parallel_size = 4
cfg.model.pipeline_model_parallel_size = 4
cfg.model.sequence_parallel = True
4D parallelism (TP + PP + CP + DP):
python
cfg.model.tensor_model_parallel_size = 8
cfg.model.pipeline_model_parallel_size = 8
cfg.model.context_parallel_size = 2
cfg.model.sequence_parallel = True
MoE with EP + PP (e.g. DeepSeek-V2 236B on 128 GPUs):
python
cfg.model.tensor_model_parallel_size = 1
cfg.model.pipeline_model_parallel_size = 4
cfg.model.expert_model_parallel_size = 32
cfg.model.sequence_parallel = False
MoE with small TP + PP + EP (e.g. DeepSeek-V3 671B on 256 GPUs):
python
cfg.model.tensor_model_parallel_size = 2
cfg.model.pipeline_model_parallel_size = 16
cfg.model.expert_model_parallel_size = 64
cfg.model.sequence_parallel = True
DP size is always implicit:
data_parallel_size = world_size / (TP * PP * CP)        # dense path
expert_data_parallel_size = world_size / (PP * EP * ETP) # MoE path
3D并行(TP + PP + DP):
python
cfg.model.tensor_model_parallel_size = 4
cfg.model.pipeline_model_parallel_size = 4
cfg.model.sequence_parallel = True
4D并行(TP + PP + CP + DP):
python
cfg.model.tensor_model_parallel_size = 8
cfg.model.pipeline_model_parallel_size = 8
cfg.model.context_parallel_size = 2
cfg.model.sequence_parallel = True
MoE搭配EP + PP(例如128 GPU上的DeepSeek-V2 236B):
python
cfg.model.tensor_model_parallel_size = 1
cfg.model.pipeline_model_parallel_size = 4
cfg.model.expert_model_parallel_size = 32
cfg.model.sequence_parallel = False
MoE搭配小TP + PP + EP(例如256 GPU上的DeepSeek-V3 671B):
python
cfg.model.tensor_model_parallel_size = 2
cfg.model.pipeline_model_parallel_size = 16
cfg.model.expert_model_parallel_size = 64
cfg.model.sequence_parallel = True
DP规模始终为隐式计算:
data_parallel_size = world_size / (TP * PP * CP)        # 稠密模型路径
expert_data_parallel_size = world_size / (PP * EP * ETP) # MoE模型路径

Minimum GPU Count

最低GPU数量

The minimum GPUs needed to run a config (i.e. with
DP=1
,
EDP=1
) is not the product of all parallelism dimensions. The dense path uses a
TP*CP
-mesh and the MoE path uses an
EP*ETP
-mesh, and within each PP stage these two meshes share the same set of GPUs — they overlap, they don't multiply. Only PP stages multiply (they're disjoint slices of the model). So:
min_gpus = PP * max(TP * CP, EP * ETP)
Common simplification (WRONG):
PP * TP * CP * EP * ETP
. This over-allocates GPUs and shows up in many READMEs and slurm sizing tables. Don't propagate it.
The decoupling of attention and MoE parallelism (different mesh shapes for the dense and expert paths sharing the same PP-stage GPUs) is detailed in Pangu Ultra MoE (arXiv:2504.14960).
运行某配置所需的最低GPU数量(即
DP=1
EDP=1
时)并非所有并行维度的乘积。稠密模型路径使用
TP*CP
网格,MoE模型路径使用
EP*ETP
网格,且在每个PP阶段中这两个网格共享同一组GPU——它们是重叠关系,而非相乘关系。仅PP阶段是相乘的(它们是模型的不相交切片)。因此:
min_gpus = PP * max(TP * CP, EP * ETP)
常见错误简化方式:
PP * TP * CP * EP * ETP
。这种方式会过度分配GPU,在许多README和Slurm规模表中都能看到。请勿沿用该方式。
注意力并行与MoE并行的解耦(稠密路径与专家路径使用不同网格形状,但共享同一PP阶段的GPU)在Pangu Ultra MoE (arXiv:2504.14960)中有详细说明。

Examples

示例

ConfigWrong (PP·TP·CP·EP·ETP)Correct (PP·max(TP·CP, EP·ETP))
PP=1, TP=2, CP=1, EP=8, ETP=1168 (1 node)
PP=1, TP=4, CP=1, EP=8, ETP=1328 (max(4, 8))
PP=1, TP=2, CP=2, EP=8, ETP=1328 (max(4, 8))
PP=1, TP=2, CP=4, EP=8, ETP=1648 (max(8, 8))
PP=2, TP=2, CP=1, EP=8, ETP=13216 (2 · max(2, 8))
PP=1, TP=2, CP=1, EP=4, ETP=2168 (max(2, 8))
配置错误计算(PP·TP·CP·EP·ETP)正确计算(PP·max(TP·CP, EP·ETP))
PP=1, TP=2, CP=1, EP=8, ETP=1168(1节点)
PP=1, TP=4, CP=1, EP=8, ETP=1328(max(4, 8))
PP=1, TP=2, CP=2, EP=8, ETP=1328(max(4, 8))
PP=1, TP=2, CP=4, EP=8, ETP=1648(max(8, 8))
PP=2, TP=2, CP=1, EP=8, ETP=13216(2 · max(2, 8))
PP=1, TP=2, CP=1, EP=4, ETP=2168(max(2, 8))

Scaling above the minimum

超过最低规模的扩展

Adding GPUs scales
DP
and/or
EDP
(the
world_size
must satisfy both equations simultaneously). At
min_gpus
the larger-mesh side has DP (or EDP) = 1 and the smaller side absorbs the slack.
Example — TP=2, CP=1, EP=8, ETP=1, PP=1:
  • 8 GPUs (
    min_gpus
    ): dense
    DP = 8/2 = 4
    , MoE
    EDP = 8/8 = 1
  • 16 GPUs: dense
    DP = 8
    , MoE
    EDP = 2
    → 2× global batch
  • 32 GPUs: dense
    DP = 16
    , MoE
    EDP = 4
    → 4× global batch
When sizing slurm scripts, compute
--nodes
from
min_gpus
(or a multiple of it for higher throughput via DP/EDP).
增加GPU会扩展
DP
和/或
EDP
world_size
必须同时满足两个等式)。在
min_gpus
时,网格规模较大的一侧DP(或EDP)=1,较小的一侧吸收剩余资源。
示例——TP=2, CP=1, EP=8, ETP=1, PP=1:
  • 8 GPU
    min_gpus
    ):稠密模型
    DP = 8/2 = 4
    ,MoE模型
    EDP = 8/8 = 1
  • 16 GPU:稠密模型
    DP = 8
    ,MoE模型
    EDP = 2
    → 全局批量翻倍
  • 32 GPU:稠密模型
    DP = 16
    ,MoE模型
    EDP = 4
    → 全局批量翻4倍
编写Slurm脚本时,根据
min_gpus
(或其倍数,通过DP/EDP提升吞吐量)计算
--nodes

Memory Estimation

内存估算

Without parallelism (70B model, FP16):
parameters:       140 GB
gradients:        140 GB
optimizer states: 280 GB (Adam)
activations:       48 GB (batch=1, seq=4K)
total:            608 GB
With TP=4, PP=4, DP=4 (64 GPUs):
parameters:        8.75 GB per GPU
gradients:         8.75 GB per GPU
optimizer states: 17.50 GB per GPU
activations:       3.00 GB per GPU
total:           ~38    GB per GPU
无并行设置(70B模型,FP16):
参数:       140 GB
梯度:        140 GB
优化器状态: 280 GB(Adam)
激活值:       48 GB(batch=1, seq=4K)
总计:            608 GB
启用TP=4, PP=4, DP=4(64 GPU):
单GPU参数:        8.75 GB
单GPU梯度:         8.75 GB
单GPU优化器状态: 17.50 GB
单GPU激活值:       3.00 GB
单GPU总计:           ~38    GB

Code Anchors

代码锚点

Parallelism dimensions set in model provider:
66
model_config = GPTModelProvider(
    tensor_model_parallel_size=2,
    # ... other model parameters
)
DP size calculation:
424
data_parallel_size = world_size / (tensor_model_parallel_size × pipeline_model_parallel_size × context_parallel_size)
Bridge initialization wires parallelism into process groups:
618
parallel_state.initialize_model_parallel(
    tensor_model_parallel_size=model_config.tensor_model_parallel_size,
    pipeline_model_parallel_size=model_config.pipeline_model_parallel_size,
    ...
    context_parallel_size=model_config.context_parallel_size,
    hierarchical_context_parallel_sizes=model_config.hierarchical_context_parallel_sizes,
    expert_model_parallel_size=model_config.expert_model_parallel_size,
    ...
)
并行维度在模型提供者中设置:
66
model_config = GPTModelProvider(
    tensor_model_parallel_size=2,
    # ... 其他模型参数
)
DP规模计算:
424
data_parallel_size = world_size / (tensor_model_parallel_size × pipeline_model_parallel_size × context_parallel_size)
Bridge初始化将并行配置接入进程组:
618
parallel_state.initialize_model_parallel(
    tensor_model_parallel_size=model_config.tensor_model_parallel_size,
    pipeline_model_parallel_size=model_config.pipeline_model_parallel_size,
    ...
    context_parallel_size=model_config.context_parallel_size,
    hierarchical_context_parallel_sizes=model_config.hierarchical_context_parallel_sizes,
    expert_model_parallel_size=model_config.expert_model_parallel_size,
    ...
)

Pitfalls

注意事项

  1. TP across nodes destroys throughput. Always keep TP within a single NVLink domain.
  2. PP without interleaving has large pipeline bubbles. Use
    virtual_pipeline_model_parallel_size
    when possible.
  3. SP requires
    tensor_model_parallel_size > 1
    . Enabling SP alone without TP is a config error.
  4. CP requires
    seq_length % (2 * context_parallel_size) == 0
    .
  5. EP is only for MoE models. Setting
    expert_model_parallel_size
    on a dense model is a no-op or error.
  6. The model-size-to-parallelism table above is a starting heuristic. Always profile the first iteration to check memory and communication.
  7. CUDA_DEVICE_MAX_CONNECTIONS
    and related env vars interact with overlap settings. See @skills/perf-tp-dp-comm-overlap/SKILL.md.
  8. The minimum GPU count for an MoE config is
    PP * max(TP*CP, EP*ETP)
    , not the product of all dimensions. The dense
    TP*CP
    -mesh and MoE
    EP*ETP
    -mesh share the same GPUs in each PP stage. See "Minimum GPU Count" section above.
  1. 跨节点设置TP会严重降低吞吐量。务必将TP控制在单个NVLink域内。
  2. 未使用交错的PP会产生较大的流水线气泡。尽可能使用
    virtual_pipeline_model_parallel_size
  3. SP要求
    tensor_model_parallel_size > 1
    。仅启用SP而不设置TP属于配置错误。
  4. CP要求
    seq_length % (2 * context_parallel_size) == 0
  5. EP仅适用于MoE模型。在稠密模型上设置
    expert_model_parallel_size
    无效果或报错。
  6. 上述模型规模与并行配置对照表仅为初始启发式参考。首次迭代时务必进行性能分析,检查内存与通信情况。
  7. CUDA_DEVICE_MAX_CONNECTIONS
    及相关环境变量会与重叠设置交互。请查阅@skills/perf-tp-dp-comm-overlap/SKILL.md。
  8. MoE配置的最低GPU数量为
    PP * max(TP*CP, EP*ETP)
    ,而非所有维度的乘积。稠密模型的
    TP*CP
    网格与MoE模型的
    EP*ETP
    网格在每个PP阶段共享同一组GPU。详见上文“最低GPU数量”章节。

Verification

验证

Quick sanity check that combined parallelism initializes correctly using the smallest available recipe with overridden parallelism:
bash
CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python -m torch.distributed.run --nproc_per_node=4 \
  scripts/training/run_recipe.py \
  --recipe llama32_1b_pretrain_config \
  model.tensor_model_parallel_size=2 \
  model.pipeline_model_parallel_size=2 \
  model.sequence_parallel=True \
  train.train_iters=3 train.global_batch_size=8 train.micro_batch_size=1 \
  scheduler.lr_warmup_iters=0 \
  validation.eval_iters=0 validation.eval_interval=0 \
  checkpoint.save_interval=0 \
  logger.log_interval=1
Success criteria:
  • exit code 0
  • finite loss at iteration 3 (e.g.
    lm loss: 1.003808E+01
    )
  • log shows TP=2 PP=2 DP=1 layout with 4 ranks
使用最小可用配方并覆盖并行配置,快速验证组合并行是否正确初始化:
bash
CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python -m torch.distributed.run --nproc_per_node=4 \
  scripts/training/run_recipe.py \
  --recipe llama32_1b_pretrain_config \
  model.tensor_model_parallel_size=2 \
  model.pipeline_model_parallel_size=2 \
  model.sequence_parallel=True \
  train.train_iters=3 train.global_batch_size=8 train.micro_batch_size=1 \
  scheduler.lr_warmup_iters=0 \
  validation.eval_iters=0 validation.eval_interval=0 \
  checkpoint.save_interval=0 \
  logger.log_interval=1
成功标准:
  • 退出码为0
  • 第3次迭代时损失值为有限值(例如
    lm loss: 1.003808E+01
  • 日志显示TP=2 PP=2 DP=1的布局,共4个rank