perf-moe-optimization-workflow

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

MoE Training Optimization Workflow

MoE训练优化工作流

Stable docs: @docs/training/moe-optimization.md Card: @skills/perf-moe-optimization-workflow/card.yaml Source: Scalable Training of MoE Models with Megatron Core

Stable docs: @docs/training/moe-optimization.md Card: @skills/perf-moe-optimization-workflow/card.yaml Source: 基于Megatron Core的MoE模型可扩展训练

Quick Reference

快速参考

Think in terms of the paper's Three Walls:

memory wall
communication wall
compute and host-overhead wall

MoE tuning is iterative. Fixing one wall usually exposes the next one, so the best workflow is: fit first, scale second, profile third, then retune.

可以从论文中的Three Walls框架角度思考：

内存墙
通信墙
计算与主机开销墙

MoE调优是一个迭代过程。解决一个瓶颈通常会暴露下一个，因此最佳工作流为：先适配，再扩展，然后分析性能，最后重新调优。

Phase 1: Make The Run Memory-Feasible

阶段1：确保运行内存可行

Start with a configuration that fits reliably before chasing throughput.

Recommended order:

Use the smallest amount of model parallelism that still fits.
Turn on selective recompute before falling back to full recompute.
Add offloading only when recompute and parallelism are still insufficient.
Use
```
--fake-init-process-group
```
to sanity-check large parallel layouts on a single GPU before burning cluster time.

在追求吞吐量之前，先从能稳定适配的配置开始。

推荐顺序：

使用能满足适配需求的最小模型并行度。
先启用选择性重计算，再考虑完全重计算。
仅当重计算和并行度仍不足以满足需求时，再添加卸载。
在消耗集群资源之前，使用
```
--fake-init-process-group
```
在单GPU上验证大型并行布局是否合理。

Recompute guidance

重计算指南

Prefer selective recompute for MoE runs:

good first choices:

layernorm

core_attn

moe_act

mlp

, or model-specific modules (

shared_experts

mla_up_proj

)

use full recompute only when the run still does not fit
revisit recompute after enabling CUDA graphs, because some graph scopes and full recompute paths do not mix well

As a rule of thumb, fine-grained recompute often recovers most of the needed memory while keeping throughput much closer to the non-recompute baseline than full-layer recompute does.

MoE运行优先选择选择性重计算：

优先选择：
```
layernorm
```
、
```
core_attn
```
、
```
moe_act
```
、
```
mlp
```
，或模型特定模块（
```
shared_experts
```
、
```
mla_up_proj
```
）
仅当运行仍无法适配时，才使用完全重计算
启用CUDA graphs后重新评估重计算策略，因为某些图作用域与完全重计算路径兼容性不佳

根据经验，细粒度重计算通常能恢复大部分所需内存，同时相比全层重计算，吞吐量更接近无重计算的基准水平。

Phase 2: Choose Parallelism For Scale

阶段2：选择并行度以实现扩展

Priority order:

Maximize DP once the model fits.
Keep the hot communication path inside the fast interconnect when possible.
Use PP, plus VPP if needed, for multi-node scaling.
Prefer EP over extra TP for expert layers.
Add CP for long context once sequence length makes attention memory dominant.

优先级顺序：

模型适配后，最大化DP（数据并行）。
尽可能将热点通信路径置于高速互连网络内。
多节点扩展时使用PP（流水线并行），必要时添加VPP（虚拟流水线并行）。
对于专家层，优先选择EP（专家并行）而非额外的TP（张量并行）。
当序列长度导致注意力内存成为主导因素时，为长上下文添加CP（上下文并行）。

Parallel Folding

并行折叠

Parallel Folding decouples attention and MoE parallelism so you do not have to pick a single compromise layout:

text

Attention: TP × CP × DP × PP
MoE:       ETP × EP × EDP × PP

Key knobs:

```
--expert-model-parallel-size
```
```
--expert-tensor-parallel-size
```

Use it when attention prefers some TP or CP, but expert layers benefit from a larger EP degree than the dense layers can tolerate.

并行折叠将注意力并行与MoE并行解耦，因此无需选择单一的折中布局：

text

Attention: TP × CP × DP × PP
MoE:       ETP × EP × EDP × PP

关键参数：

```
--expert-model-parallel-size
```
```
--expert-tensor-parallel-size
```

当注意力部分偏好TP或CP，但专家层能从比密集层更高的EP并行度中受益时，可以使用该方法。

Phase 3: Profile The Dominant Bottleneck

阶段3：分析主导瓶颈

Bottleneck	What it looks like	Primary fixes
Memory	Run fits only with aggressive full recompute or OOMs during warmup	selective recompute, FP8, offloading, better PP layout
Communication	Nsight shows large all-to-all or collective blocks	DeepEP or HybridEP, EP overlap, DP/TP overlap, better PP layout
Host overhead	GPU gaps, launch-bound traces, Python overhead	CUDA graphs, `--manual-gc` , higher MBS, CPU affinity tuning
Compute	Low SM utilization after comm and host issues are addressed	grouped GEMM, fusion work, FP8, dispatcher-specific kernel tuning

瓶颈类型	表现特征	主要解决方案
内存	仅通过激进的完全重计算才能运行，或预热阶段出现OOM	选择性重计算、FP8、卸载、优化PP布局
通信	Nsight显示大量all-to-all或集合操作块	DeepEP或HybridEP、EP重叠、DP/TP重叠、优化PP布局
主机开销	GPU存在空闲间隙、启动受限的追踪记录、Python开销	CUDA graphs、 `--manual-gc` 、更高MBS、CPU亲和性调优
计算	解决通信和主机问题后SM利用率仍然较低	分组GEMM、融合优化、FP8、针对调度器的内核调优

Dispatcher And Overlap Guidance

调度器与重叠优化指南

Use dispatcher choice as a bottleneck fix, not as the first tuning knob.

```
moe_token_dispatcher_type="alltoall"
```
: safest bring-up path, fine for smaller EP sizes

moe_token_dispatcher_type="flex"

moe_flex_dispatcher_backend="deepep"

: strong default for H100 and B200 style deployments

moe_token_dispatcher_type="flex"

moe_flex_dispatcher_backend="hybridep"

: strongest starting point on GB200 or GB300 NVL72 systems

If the all-to-all path is visible in profiles, combine dispatcher tuning with:

```
--overlap-moe-expert-parallel-comm
```
```
--overlap-grad-reduce
```
```
--tp-comm-overlap
```

将调度器选择作为瓶颈修复手段，而非首个调优参数。

```
moe_token_dispatcher_type="alltoall"
```
：最安全的启用路径，适用于较小的EP规模

moe_token_dispatcher_type="flex"

moe_flex_dispatcher_backend="deepep"

：H100和B200类部署的可靠默认选项

moe_token_dispatcher_type="flex"

moe_flex_dispatcher_backend="hybridep"

：GB200或GB300 NVL72系统上的最佳起始选择

如果性能分析中可见all-to-all路径，可将调度器调优与以下选项结合使用：

```
--overlap-moe-expert-parallel-comm
```
```
--overlap-grad-reduce
```
```
--tp-comm-overlap
```

FP8 Recipe Quick Decision

FP8方案快速选择

Platform	Recommended starting recipe
Hopper	FP8 blockwise
Blackwell	MXFP8
Blackwell, speed-first exploration	NVFP4 after the BF16 or FP8 path is stable

Keep the router in FP32. The largest wins usually come from expert GEMMs and other heavy matrix math, not from trying to quantize every small MoE component.

平台	推荐起始方案
Hopper	FP8 blockwise
Blackwell	MXFP8
Blackwell（优先追求速度）	BF16或FP8路径稳定后使用NVFP4

保持路由器使用FP32精度。最大的性能提升通常来自专家GEMM和其他重型矩阵运算，而非尝试量化每个MoE小组件。

CUDA Graphs For MoE

MoE的CUDA Graphs使用方法

For dropless MoE, start with partial TE-scoped graphs:

```
attn
```
```
moe_router
```
```
moe_preprocess
```

That path usually gives a meaningful step-time win while keeping the dynamic expert work outside the graph. Expect a moderate speedup when launch overhead is visible, but budget several extra GB of memory and verify that shapes remain static.

Use full-iteration graphs only for graph-friendly workloads such as drop-and-pad or tightly controlled static-shape experiments.

Related references:

@skills/perf-cuda-graphs/SKILL.md
@docs/training/cuda-graphs.md
@docs/training/activation-recomputation.md

对于无丢弃MoE，从部分TE作用域图开始：

```
attn
```
```
moe_router
```
```
moe_preprocess
```

该路径通常能显著缩短单步时间，同时将动态专家工作置于图外。当存在启动开销时，预计能获得中等幅度的加速，但需要预留额外数GB内存，并确保形状保持静态。

仅对图友好的工作负载使用全迭代图，例如drop-and-pad或严格控制的静态形状实验。

Pitfalls

注意事项

Do not optimize in the wrong order: fitting the model and selecting sane parallelism matter more than micro-optimizations.
Platform changes the limiting wall: H100-class runs often feel more communication-bound, while GB200 or GB300 runs often expose CPU or launch overhead earlier.
FP8 MFU can look misleadingly low: compare absolute throughput as well as MFU when switching precision modes.
CUDA graphs and recompute interact: TE-scoped graphs are usually paired with selective recompute, not blanket full recompute.
Parallel Folding is not optional at large scale: once attention and expert layers want clearly different layouts, a single shared TP or EP plan becomes a tax on both.

不要按错误顺序优化：模型适配和合理选择并行度比微优化更重要。
平台会改变瓶颈类型：H100级别的运行通常更受通信限制，而GB200或GB300的运行往往更早暴露CPU或启动开销问题。
FP8的MFU可能看起来虚低：切换精度模式时，要同时比较绝对吞吐量和MFU。
CUDA graphs与重计算相互影响：TE作用域图通常与选择性重计算搭配使用，而非全面的完全重计算。
大规模场景下并行折叠不可或缺：一旦注意力层和专家层需要明显不同的布局，单一共享的TP或EP方案会对两者都造成性能损耗。