perf-expert-parallel-overlap

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

MoE Expert-Parallel Overlap Skill

MoE专家并行通信重叠技能

Stable docs: @docs/training/communication-overlap.md Card: @skills/perf-expert-parallel-overlap/card.yaml

稳定文档：@docs/training/communication-overlap.md 技能卡片：@skills/perf-expert-parallel-overlap/card.yaml

References

参考资料

Stable docs: @docs/training/communication-overlap.md
Structured metadata: @skills/perf-expert-parallel-overlap/card.yaml

稳定文档：@docs/training/communication-overlap.md
结构化元数据：@skills/perf-expert-parallel-overlap/card.yaml

What It Is

功能介绍

Expert-parallel (EP) overlap hides the cost of token dispatch/combine all-to-all communication by running it concurrently with expert FFN compute. Optionally, delayed expert weight-gradient computation (

delay_wgrad_compute

) provides additional overlap by deferring wgrad to overlap with the next layer's forward.

Bridge supports two dispatcher paths:

Dispatcher	Backend	When to use
`alltoall`	Standard MoE all-to-all	Default, broadest compatibility
`flex`	DeepEP or HybridEP	Higher overlap on Ampere/Hopper/Blackwell

专家并行（EP）重叠通过将token分发/合并的all-to-all通信与专家FFN计算并行执行，隐藏通信开销。可选的延迟专家权重梯度计算（

delay_wgrad_compute

）通过将权重梯度计算延迟至与下一层前向传播重叠，进一步提升优化效果。

Bridge支持两种调度器路径：

调度器	后端	使用场景
`alltoall`	标准MoE all-to-all	默认选项，兼容性最广
`flex`	DeepEP或HybridEP	在Ampere/Hopper/Blackwell架构GPU上实现更高重叠率

Quick Decision

快速决策指南

Use EP overlap when:

the model is MoE with
```
EP > 1
```
expert dispatch/combine communication is a meaningful part of step time
you have memory headroom and are tuning for throughput

Prefer:

```
alltoall
```
dispatcher for the first rollout (broader compatibility)
```
flex
```
+ DeepEP/HybridEP when running on supported GPUs and seeking additional gains

Avoid EP overlap when:

full activation recompute is enabled
```
moe_shared_expert_overlap
```
is enabled
the run is still being brought up for correctness
PyTorch < 2.6.0

Expected outcome:

if all-to-all dispatch is a clear profile bottleneck, overlap can produce a modest to meaningful speedup
if the run is tiny, communication-light, or dominated by another wall, the gain may be negligible

在以下场景使用EP重叠：

模型为MoE架构且
```
EP > 1
```
专家分发/合并通信占单步时间的显著比例
有内存余量且以吞吐量调优为目标

优先选择：

首次部署时使用
```
alltoall
```
调度器（兼容性更广）
在支持的GPU上运行且追求额外性能提升时，选择
```
flex
```
+ DeepEP/HybridEP

避免在以下场景使用EP重叠：

已启用全激活重计算
已启用
```
moe_shared_expert_overlap
```
运行仍处于正确性验证阶段
PyTorch版本 < 2.6.0

预期效果：

若all-to-all分发是明确的性能瓶颈，重叠优化可带来适度至显著的加速
若运行规模极小、通信负载低或受其他瓶颈主导，性能提升可能可忽略

Enablement

启用方法

alltoall dispatcher

alltoall调度器

python

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False

cfg.model.expert_model_parallel_size = 8
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.bf16 = True
cfg.model.fp16 = False

python

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False

cfg.model.expert_model_parallel_size = 8
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.bf16 = True
cfg.model.fp16 = False

flex dispatcher (DeepEP or HybridEP)

flex调度器（DeepEP或HybridEP）

python

from megatron.bridge.training.flex_dispatcher_backend import apply_flex_dispatcher_backend

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False

apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="deepep")

python

from megatron.bridge.training.flex_dispatcher_backend import apply_flex_dispatcher_backend

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False

apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="deepep")

or: apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="hybridep")

或：apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="hybridep")

undefined

undefined

Compatibility And Constraints

兼容性与约束

```
expert_model_parallel_size > 1
```
```
num_moe_experts > 1
```

moe_token_dispatcher_type

must be

"alltoall"

"flex"

```
moe_shared_expert_overlap = False
```
Base precision is BF16 or FP16
PyTorch
```
>= 2.6.0
```

PP > 1

virtual_pipeline_model_parallel_size

must be set

recompute_granularity != "full"

recompute_method = None

recompute_num_layers = None

```
mtp_num_layers
```
must be
```
None
```
or
```
1
```

delay_wgrad_compute

requires

overlap_moe_expert_parallel_comm

as a prerequisite

```
delay_wgrad_compute
```
with
```
overlap_grad_reduce
```
requires TE >= 2.7.0

delay_wgrad_compute

with

gradient_accumulation_fusion

requires TE >= 2.7.0

CUDA graph
```
attn
```
scope +
```
delay_wgrad_compute
```
requires TE >= 2.12.0,
```
gradient_accumulation_fusion = True
```
, and no attention bias
DeepEP: Ampere, Hopper, B200, B300 GPUs only
HybridEP: Ampere, Hopper, B200, B300, GB200/GB300 with NVL72

```
expert_model_parallel_size > 1
```
```
num_moe_experts > 1
```

moe_token_dispatcher_type

必须为

"alltoall"

或

"flex"

```
moe_shared_expert_overlap = False
```
基础精度为BF16或FP16
PyTorch版本
```
>= 2.6.0
```

若

PP > 1

，必须设置

virtual_pipeline_model_parallel_size

recompute_granularity != "full"

，

recompute_method = None

，

recompute_num_layers = None

```
mtp_num_layers
```
必须为
```
None
```
或
```
1
```

delay_wgrad_compute

需以

overlap_moe_expert_parallel_comm

为前置条件

```
delay_wgrad_compute
```
搭配
```
overlap_grad_reduce
```
需要TE >= 2.7.0

delay_wgrad_compute

搭配

gradient_accumulation_fusion

需要TE >= 2.7.0

CUDA图
```
attn
```
作用域 +
```
delay_wgrad_compute
```
需要TE >= 2.12.0、
```
gradient_accumulation_fusion = True
```
且无注意力偏置
DeepEP：仅支持Ampere、Hopper、B200、B300 GPU
HybridEP：支持Ampere、Hopper、B200、B300、带NVL72的GB200/GB300 GPU

Minimal Working Config

最小可用配置

python

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = False
cfg.model.expert_model_parallel_size = 4
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.moe_shared_expert_overlap = False
cfg.model.bf16 = True

Use this as the correctness-first starting point. Add delayed wgrad, flex dispatch, and CUDA-graph interactions only after the plain overlap path is known to work.

python

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = False
cfg.model.expert_model_parallel_size = 4
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.moe_shared_expert_overlap = False
cfg.model.bf16 = True

以此作为优先保证正确性的起点。仅在确认基础重叠路径可正常工作后，再添加延迟权重梯度、flex调度及CUDA图交互功能。

Measured Short-Run Evidence

短期测试结果

A 2026-05-18 current-main H100 x16 Qwen3 30B-A3B mock pretraining run used

EP=16

alltoall

, BF16, global batch size 1024, CUDA graphs disabled, and

moe_permute_fusion=false

. With iterations 3-8 as the steady window:

Case	Steady mean	Relative
no EP overlap	41.25s	1.000x
EP overlap	31.31s	1.317x
EP overlap plus `delay_wgrad_compute`	31.20s	1.322x

This is evidence for enabling plain EP overlap on this inter-node all-to-all shape. It does not show a meaningful independent win from delayed wgrad, and it does not validate fused MoE permutation because that path was disabled for the runtime stack.

2026-05-18基于当前主分支的H100 x16 Qwen3 30B-A3B模拟预训练运行，使用

EP=16

、

alltoall

、BF16、全局批量大小1024、CUDA图禁用、

moe_permute_fusion=false

。以第3-8轮迭代为稳定窗口：

场景	稳定均值	相对速度
无EP重叠	41.25s	1.000x
启用EP重叠	31.31s	1.317x
启用EP重叠+ `delay_wgrad_compute`	31.20s	1.322x

此结果证明在该跨节点all-to-all场景下启用基础EP重叠是有效的。延迟权重梯度未带来显著独立收益，且由于运行栈禁用了该路径，未验证融合MoE置换。

Minimal Runnable Command

最小可运行命令

Performance harness example inside a Slurm allocation. Keep the model, parallelism, dispatcher, and runtime fixed, and vary only the two overlap overrides:

bash

uv run python scripts/performance/run_script.py \
  -m qwen \
  -mr qwen3_30b_a3b \
  --task pretrain \
  -g h100 \
  -c bf16 \
  -ng 16 \
  -gn 8 \
  --max_steps 8 \
  --config_variant v1 \
  --cuda_graph_impl none \
  --moe_flex_dispatcher_backend None \
  --moe_a2a_overlap false \
  --tokenizer_type NullTokenizer \
  comm_overlap.overlap_moe_expert_parallel_comm=true \
  comm_overlap.delay_wgrad_compute=false \
  model.moe_shared_expert_overlap=false

Do not use

--moe_a2a_overlap true

when separating plain EP overlap from delayed wgrad: the performance harness helper enables both

overlap_moe_expert_parallel_comm

and

delay_wgrad_compute

Unit test verification:

bash

uv run python -m pytest \
  tests/unit_tests/training/test_comm_overlap.py -k "moe" \
  tests/unit_tests/training/test_deepep.py -q

Slurm分配内的性能测试工具示例。保持模型、并行度、调度器及运行时固定，仅修改两个重叠配置项：

bash

uv run python scripts/performance/run_script.py \
  -m qwen \
  -mr qwen3_30b_a3b \
  --task pretrain \
  -g h100 \
  -c bf16 \
  -ng 16 \
  -gn 8 \
  --max_steps 8 \
  --config_variant v1 \
  --cuda_graph_impl none \
  --moe_flex_dispatcher_backend None \
  --moe_a2a_overlap false \
  --tokenizer_type NullTokenizer \
  comm_overlap.overlap_moe_expert_parallel_comm=true \
  comm_overlap.delay_wgrad_compute=false \
  model.moe_shared_expert_overlap=false

当分离基础EP重叠与延迟权重梯度时，请勿使用

--moe_a2a_overlap true

：性能测试工具会同时启用

overlap_moe_expert_parallel_comm

和

delay_wgrad_compute

。

单元测试验证：

bash

uv run python -m pytest \
  tests/unit_tests/training/test_comm_overlap.py -k "moe" \
  tests/unit_tests/training/test_deepep.py -q

Verification

验证方法

Unit tests

单元测试

bash

uv run python -m pytest \
  tests/unit_tests/training/test_comm_overlap.py \
  tests/unit_tests/training/test_deepep.py -q

bash

uv run python -m pytest \
  tests/unit_tests/training/test_comm_overlap.py \
  tests/unit_tests/training/test_deepep.py -q

Log checks

日志检查

After a successful run with EP overlap:

Confirm no assertion errors during
```
CommOverlapConfig
```
finalization
Confirm
```
overlap_moe_expert_parallel_comm
```
appears as
```
True
```
in the logged config
If using flex dispatcher, confirm
```
moe_token_dispatcher_type = "flex"
```
and the correct backend in logs

成功运行EP重叠后：

确认
```
CommOverlapConfig
```
最终化过程中无断言错误
确认日志中
```
overlap_moe_expert_parallel_comm
```
显示为
```
True
```
若使用flex调度器，确认日志中
```
moe_token_dispatcher_type = "flex"
```
及正确的后端

Success criteria

成功标准

Config validation passes for the selected dispatcher and overlap settings
Training runs complete without hangs or assertion failures
Throughput improves or at least does not regress for the target workload
Loss trajectory matches baseline (overlap should not affect convergence)

所选调度器和重叠配置的验证通过
训练运行完成无挂起或断言失败
目标工作负载的吞吐量提升或至少无退化
损失曲线与基线一致（重叠优化不应影响收敛）

Code Anchors

代码锚点

Bridge overlap validation

Bridge重叠验证

470

if self.user_comm_overlap_cfg.overlap_moe_expert_parallel_comm is True:
    assert model_cfg.expert_model_parallel_size > 1, ...
    assert model_cfg.num_moe_experts > 1, ...
    assert model_cfg.moe_token_dispatcher_type in ["alltoall", "flex"], ...
    assert model_cfg.bf16 or model_cfg.fp16, ...
    assert is_torch_min_version("2.6.0"), ...
    # ... PP + VPP check, recompute checks, shared_expert_overlap check ...

470

if self.user_comm_overlap_cfg.overlap_moe_expert_parallel_comm is True:
    assert model_cfg.expert_model_parallel_size > 1, ...
    assert model_cfg.num_moe_experts > 1, ...
    assert model_cfg.moe_token_dispatcher_type in ["alltoall", "flex"], ...
    assert model_cfg.bf16 or model_cfg.fp16, ...
    assert is_torch_min_version("2.6.0"), ...
    # ... PP + VPP检查、重计算检查、shared_expert_overlap检查 ...

Delayed wgrad validation

延迟权重梯度验证

507

if self.user_comm_overlap_cfg.delay_wgrad_compute is True:
    # TE version checks for overlap_grad_reduce and gradient_accumulation_fusion
    # CUDA graph scope validations for delayed wgrad
    assert overlap_moe_expert_parallel_comm, ...

507

if self.user_comm_overlap_cfg.delay_wgrad_compute is True:
    # overlap_grad_reduce和gradient_accumulation_fusion的TE版本检查
    # 延迟权重梯度的CUDA图作用域验证
    assert overlap_moe_expert_parallel_comm, ...

Flex-dispatcher activation

Flex调度器激活

def apply_flex_dispatcher_backend(...):
    # GPU architecture check for DeepEP / HybridEP
    model_config.moe_token_dispatcher_type = "flex"
    model_config.moe_flex_dispatcher_backend = moe_flex_dispatcher_backend
    model_config.moe_shared_expert_overlap = False

def apply_flex_dispatcher_backend(...):
    # DeepEP/HybridEP的GPU架构检查
    model_config.moe_token_dispatcher_type = "flex"
    model_config.moe_flex_dispatcher_backend = moe_flex_dispatcher_backend
    model_config.moe_shared_expert_overlap = False

Perf harness override

性能测试工具配置覆盖

149

def _set_moe_a2a_overlap_overrides(recipe, moe_a2a_overlap=False):
    if moe_a2a_overlap:
        recipe.comm_overlap.overlap_moe_expert_parallel_comm = True
        recipe.comm_overlap.delay_wgrad_compute = True
        recipe.model.moe_shared_expert_overlap = False

149

def _set_moe_a2a_overlap_overrides(recipe, moe_a2a_overlap=False):
    if moe_a2a_overlap:
        recipe.comm_overlap.overlap_moe_expert_parallel_comm = True
        recipe.comm_overlap.delay_wgrad_compute = True
        recipe.model.moe_shared_expert_overlap = False

Tests

测试文件

File	Coverage
`tests/unit_tests/training/test_comm_overlap.py`	EP overlap validation, delayed wgrad, CUDA graph + wgrad interaction
`tests/unit_tests/training/test_deepep.py`	DeepEP/HybridEP helper activation and GPU gating

文件	覆盖范围
`tests/unit_tests/training/test_comm_overlap.py`	EP重叠验证、延迟权重梯度、CUDA图+权重梯度交互
`tests/unit_tests/training/test_deepep.py`	DeepEP/HybridEP助手激活及GPU适配检查

Failure Diagnosis

故障排查

Symptom	Likely Cause	How To Confirm	Fix
assert `expert_model_parallel_size > 1`	EP not configured	Check `expert_model_parallel_size`	Set EP > 1
assert `moe_token_dispatcher_type`	Wrong dispatcher	Check dispatcher type	Use `"alltoall"` or `"flex"`
assert on BF16/FP16	Wrong precision	Check `bf16` and `fp16`	Set `bf16 = True`
hang during training	PyTorch < 2.6	Check PyTorch version	Upgrade to >= 2.6.0
assert `virtual_pipeline_model_parallel_size`	PP > 1 without VPP	Check PP and VPP config	Set VPP when PP > 1
assert `recompute_granularity`	Full recompute enabled	Check recompute settings	Disable full recompute
assert `overlap_moe_expert_parallel_comm required`	delayed wgrad without EP overlap	Check `delay_wgrad_compute` without overlap	Enable EP overlap first
assert `gradient_accumulation_fusion`	CUDA graph + delayed wgrad	Check graph scope + wgrad settings	Enable `gradient_accumulation_fusion`
assert on attention bias	CUDA graph attn + delayed wgrad + bias	Check `add_bias_linear` / `add_qkv_bias`	Disable attention bias
no throughput gain from flex dispatcher	`apply_flex_dispatcher_backend` not called	Check `moe_token_dispatcher_type` in logs	Call `apply_flex_dispatcher_backend(...)`
DeepEP/HybridEP silently skipped	Unsupported GPU	Check warning logs	Run on Ampere/Hopper/Blackwell

症状	可能原因	确认方法	修复方案
断言 `expert_model_parallel_size > 1`	未配置EP	检查 `expert_model_parallel_size`	设置EP > 1
`moe_token_dispatcher_type` 断言错误	调度器类型错误	检查调度器类型	使用 `"alltoall"` 或 `"flex"`
BF16/FP16断言错误	精度设置错误	检查 `bf16` 和 `fp16`	设置 `bf16 = True`
训练过程挂起	PyTorch版本 < 2.6	检查PyTorch版本	升级至 >= 2.6.0
`virtual_pipeline_model_parallel_size` 断言错误	PP > 1但未设置VPP	检查PP和VPP配置	当PP > 1时设置VPP
`recompute_granularity` 断言错误	已启用全重计算	检查重计算设置	禁用全重计算
断言 `overlap_moe_expert_parallel_comm required`	启用延迟权重梯度但未启用EP重叠	检查 `delay_wgrad_compute` 是否未搭配重叠	先启用EP重叠
`gradient_accumulation_fusion` 断言错误	CUDA图+延迟权重梯度组合	检查图作用域+权重梯度设置	启用 `gradient_accumulation_fusion`
注意力偏置断言错误	CUDA图attn+延迟权重梯度+偏置组合	检查 `add_bias_linear` / `add_qkv_bias`	禁用注意力偏置
flex调度器无吞吐量提升	未调用 `apply_flex_dispatcher_backend`	检查日志中 `moe_token_dispatcher_type`	调用 `apply_flex_dispatcher_backend(...)`
DeepEP/HybridEP被静默跳过	GPU不支持	检查警告日志	在Ampere/Hopper/Blackwell架构GPU上运行

Known Limitations

已知限制

Setting
```
moe_flex_dispatcher_backend
```
alone does not activate flex dispatch — you must call
```
apply_flex_dispatcher_backend(...)
```
.
Public recipes are often conservative and leave MoE overlap disabled by default.
End-to-end throughput gains have not yet been measured in a controlled Bridge experiment for every model family. Code validation is stronger than a single universal performance claim.
MoE overlap and shared-expert overlap are mutually exclusive.
CUDA graph plus delayed wgrad is a multi-constraint path that requires careful TE version and scope validation.

仅设置

moe_flex_dispatcher_backend

不会激活flex调度 — 必须调用

apply_flex_dispatcher_backend(...)

。

公开配置通常较为保守，默认禁用MoE重叠。
尚未在受控Bridge实验中针对所有模型家族测量端到端吞吐量提升。代码验证比单一通用性能声明更可靠。
MoE重叠与共享专家重叠互斥。
CUDA图+延迟权重梯度是多约束路径，需仔细验证TE版本和作用域。