perf-expert-parallel-overlap

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

MoE Expert-Parallel Overlap Skill

MoE专家并行通信重叠技能

Stable docs: @docs/training/communication-overlap.md Card: @skills/perf-expert-parallel-overlap/card.yaml
稳定文档:@docs/training/communication-overlap.md 技能卡片:@skills/perf-expert-parallel-overlap/card.yaml

References

参考资料

  • Stable docs: @docs/training/communication-overlap.md
  • Structured metadata: @skills/perf-expert-parallel-overlap/card.yaml
  • 稳定文档:@docs/training/communication-overlap.md
  • 结构化元数据:@skills/perf-expert-parallel-overlap/card.yaml

What It Is

功能介绍

Expert-parallel (EP) overlap hides the cost of token dispatch/combine all-to-all communication by running it concurrently with expert FFN compute. Optionally, delayed expert weight-gradient computation (
delay_wgrad_compute
) provides additional overlap by deferring wgrad to overlap with the next layer's forward.
Bridge supports two dispatcher paths:
DispatcherBackendWhen to use
alltoall
Standard MoE all-to-allDefault, broadest compatibility
flex
DeepEP or HybridEPHigher overlap on Ampere/Hopper/Blackwell
专家并行(EP)重叠通过将token分发/合并的all-to-all通信与专家FFN计算并行执行,隐藏通信开销。可选的延迟专家权重梯度计算(
delay_wgrad_compute
)通过将权重梯度计算延迟至与下一层前向传播重叠,进一步提升优化效果。
Bridge支持两种调度器路径:
调度器后端使用场景
alltoall
标准MoE all-to-all默认选项,兼容性最广
flex
DeepEP或HybridEP在Ampere/Hopper/Blackwell架构GPU上实现更高重叠率

Quick Decision

快速决策指南

Use EP overlap when:
  • the model is MoE with
    EP > 1
  • expert dispatch/combine communication is a meaningful part of step time
  • you have memory headroom and are tuning for throughput
Prefer:
  • alltoall
    dispatcher for the first rollout (broader compatibility)
  • flex
    + DeepEP/HybridEP when running on supported GPUs and seeking additional gains
Avoid EP overlap when:
  • full activation recompute is enabled
  • moe_shared_expert_overlap
    is enabled
  • the run is still being brought up for correctness
  • PyTorch < 2.6.0
Expected outcome:
  • if all-to-all dispatch is a clear profile bottleneck, overlap can produce a modest to meaningful speedup
  • if the run is tiny, communication-light, or dominated by another wall, the gain may be negligible
在以下场景使用EP重叠:
  • 模型为MoE架构且
    EP > 1
  • 专家分发/合并通信占单步时间的显著比例
  • 有内存余量且以吞吐量调优为目标
优先选择:
  • 首次部署时使用
    alltoall
    调度器(兼容性更广)
  • 在支持的GPU上运行且追求额外性能提升时,选择
    flex
    + DeepEP/HybridEP
避免在以下场景使用EP重叠:
  • 已启用全激活重计算
  • 已启用
    moe_shared_expert_overlap
  • 运行仍处于正确性验证阶段
  • PyTorch版本 < 2.6.0
预期效果:
  • 若all-to-all分发是明确的性能瓶颈,重叠优化可带来适度至显著的加速
  • 若运行规模极小、通信负载低或受其他瓶颈主导,性能提升可能可忽略

Enablement

启用方法

alltoall dispatcher

alltoall调度器

python
cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False

cfg.model.expert_model_parallel_size = 8
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.bf16 = True
cfg.model.fp16 = False
python
cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False

cfg.model.expert_model_parallel_size = 8
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.bf16 = True
cfg.model.fp16 = False

flex dispatcher (DeepEP or HybridEP)

flex调度器(DeepEP或HybridEP)

python
from megatron.bridge.training.flex_dispatcher_backend import apply_flex_dispatcher_backend

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False

apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="deepep")
python
from megatron.bridge.training.flex_dispatcher_backend import apply_flex_dispatcher_backend

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False

apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="deepep")

or: apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="hybridep")

或:apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="hybridep")

undefined
undefined

Compatibility And Constraints

兼容性与约束

  • expert_model_parallel_size > 1
  • num_moe_experts > 1
  • moe_token_dispatcher_type
    must be
    "alltoall"
    or
    "flex"
  • moe_shared_expert_overlap = False
  • Base precision is BF16 or FP16
  • PyTorch
    >= 2.6.0
  • If
    PP > 1
    ,
    virtual_pipeline_model_parallel_size
    must be set
  • recompute_granularity != "full"
    ,
    recompute_method = None
    ,
    recompute_num_layers = None
  • mtp_num_layers
    must be
    None
    or
    1
  • delay_wgrad_compute
    requires
    overlap_moe_expert_parallel_comm
    as a prerequisite
  • delay_wgrad_compute
    with
    overlap_grad_reduce
    requires TE >= 2.7.0
  • delay_wgrad_compute
    with
    gradient_accumulation_fusion
    requires TE >= 2.7.0
  • CUDA graph
    attn
    scope +
    delay_wgrad_compute
    requires TE >= 2.12.0,
    gradient_accumulation_fusion = True
    , and no attention bias
  • DeepEP: Ampere, Hopper, B200, B300 GPUs only
  • HybridEP: Ampere, Hopper, B200, B300, GB200/GB300 with NVL72
  • expert_model_parallel_size > 1
  • num_moe_experts > 1
  • moe_token_dispatcher_type
    必须为
    "alltoall"
    "flex"
  • moe_shared_expert_overlap = False
  • 基础精度为BF16或FP16
  • PyTorch版本
    >= 2.6.0
  • PP > 1
    ,必须设置
    virtual_pipeline_model_parallel_size
  • recompute_granularity != "full"
    recompute_method = None
    recompute_num_layers = None
  • mtp_num_layers
    必须为
    None
    1
  • delay_wgrad_compute
    需以
    overlap_moe_expert_parallel_comm
    为前置条件
  • delay_wgrad_compute
    搭配
    overlap_grad_reduce
    需要TE >= 2.7.0
  • delay_wgrad_compute
    搭配
    gradient_accumulation_fusion
    需要TE >= 2.7.0
  • CUDA图
    attn
    作用域 +
    delay_wgrad_compute
    需要TE >= 2.12.0、
    gradient_accumulation_fusion = True
    且无注意力偏置
  • DeepEP:仅支持Ampere、Hopper、B200、B300 GPU
  • HybridEP:支持Ampere、Hopper、B200、B300、带NVL72的GB200/GB300 GPU

Minimal Working Config

最小可用配置

python
cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = False
cfg.model.expert_model_parallel_size = 4
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.moe_shared_expert_overlap = False
cfg.model.bf16 = True
Use this as the correctness-first starting point. Add delayed wgrad, flex dispatch, and CUDA-graph interactions only after the plain overlap path is known to work.
python
cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = False
cfg.model.expert_model_parallel_size = 4
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.moe_shared_expert_overlap = False
cfg.model.bf16 = True
以此作为优先保证正确性的起点。仅在确认基础重叠路径可正常工作后,再添加延迟权重梯度、flex调度及CUDA图交互功能。

Measured Short-Run Evidence

短期测试结果

A 2026-05-18 current-main H100 x16 Qwen3 30B-A3B mock pretraining run used
EP=16
,
alltoall
, BF16, global batch size 1024, CUDA graphs disabled, and
moe_permute_fusion=false
. With iterations 3-8 as the steady window:
CaseSteady meanRelative
no EP overlap41.25s1.000x
EP overlap31.31s1.317x
EP overlap plus
delay_wgrad_compute
31.20s1.322x
This is evidence for enabling plain EP overlap on this inter-node all-to-all shape. It does not show a meaningful independent win from delayed wgrad, and it does not validate fused MoE permutation because that path was disabled for the runtime stack.
2026-05-18基于当前主分支的H100 x16 Qwen3 30B-A3B模拟预训练运行,使用
EP=16
alltoall
、BF16、全局批量大小1024、CUDA图禁用、
moe_permute_fusion=false
。以第3-8轮迭代为稳定窗口:
场景稳定均值相对速度
无EP重叠41.25s1.000x
启用EP重叠31.31s1.317x
启用EP重叠+
delay_wgrad_compute
31.20s1.322x
此结果证明在该跨节点all-to-all场景下启用基础EP重叠是有效的。延迟权重梯度未带来显著独立收益,且由于运行栈禁用了该路径,未验证融合MoE置换。

Minimal Runnable Command

最小可运行命令

Performance harness example inside a Slurm allocation. Keep the model, parallelism, dispatcher, and runtime fixed, and vary only the two overlap overrides:
bash
uv run python scripts/performance/run_script.py \
  -m qwen \
  -mr qwen3_30b_a3b \
  --task pretrain \
  -g h100 \
  -c bf16 \
  -ng 16 \
  -gn 8 \
  --max_steps 8 \
  --config_variant v1 \
  --cuda_graph_impl none \
  --moe_flex_dispatcher_backend None \
  --moe_a2a_overlap false \
  --tokenizer_type NullTokenizer \
  comm_overlap.overlap_moe_expert_parallel_comm=true \
  comm_overlap.delay_wgrad_compute=false \
  model.moe_shared_expert_overlap=false
Do not use
--moe_a2a_overlap true
when separating plain EP overlap from delayed wgrad: the performance harness helper enables both
overlap_moe_expert_parallel_comm
and
delay_wgrad_compute
.
Unit test verification:
bash
uv run python -m pytest \
  tests/unit_tests/training/test_comm_overlap.py -k "moe" \
  tests/unit_tests/training/test_deepep.py -q
Slurm分配内的性能测试工具示例。保持模型、并行度、调度器及运行时固定,仅修改两个重叠配置项:
bash
uv run python scripts/performance/run_script.py \
  -m qwen \
  -mr qwen3_30b_a3b \
  --task pretrain \
  -g h100 \
  -c bf16 \
  -ng 16 \
  -gn 8 \
  --max_steps 8 \
  --config_variant v1 \
  --cuda_graph_impl none \
  --moe_flex_dispatcher_backend None \
  --moe_a2a_overlap false \
  --tokenizer_type NullTokenizer \
  comm_overlap.overlap_moe_expert_parallel_comm=true \
  comm_overlap.delay_wgrad_compute=false \
  model.moe_shared_expert_overlap=false
当分离基础EP重叠与延迟权重梯度时,请勿使用
--moe_a2a_overlap true
:性能测试工具会同时启用
overlap_moe_expert_parallel_comm
delay_wgrad_compute
单元测试验证:
bash
uv run python -m pytest \
  tests/unit_tests/training/test_comm_overlap.py -k "moe" \
  tests/unit_tests/training/test_deepep.py -q

Verification

验证方法

Unit tests

单元测试

bash
uv run python -m pytest \
  tests/unit_tests/training/test_comm_overlap.py \
  tests/unit_tests/training/test_deepep.py -q
bash
uv run python -m pytest \
  tests/unit_tests/training/test_comm_overlap.py \
  tests/unit_tests/training/test_deepep.py -q

Log checks

日志检查

After a successful run with EP overlap:
  1. Confirm no assertion errors during
    CommOverlapConfig
    finalization
  2. Confirm
    overlap_moe_expert_parallel_comm
    appears as
    True
    in the logged config
  3. If using flex dispatcher, confirm
    moe_token_dispatcher_type = "flex"
    and the correct backend in logs
成功运行EP重叠后:
  1. 确认
    CommOverlapConfig
    最终化过程中无断言错误
  2. 确认日志中
    overlap_moe_expert_parallel_comm
    显示为
    True
  3. 若使用flex调度器,确认日志中
    moe_token_dispatcher_type = "flex"
    及正确的后端

Success criteria

成功标准

  • Config validation passes for the selected dispatcher and overlap settings
  • Training runs complete without hangs or assertion failures
  • Throughput improves or at least does not regress for the target workload
  • Loss trajectory matches baseline (overlap should not affect convergence)
  • 所选调度器和重叠配置的验证通过
  • 训练运行完成无挂起或断言失败
  • 目标工作负载的吞吐量提升或至少无退化
  • 损失曲线与基线一致(重叠优化不应影响收敛)

Code Anchors

代码锚点

Bridge overlap validation

Bridge重叠验证

470
if self.user_comm_overlap_cfg.overlap_moe_expert_parallel_comm is True:
    assert model_cfg.expert_model_parallel_size > 1, ...
    assert model_cfg.num_moe_experts > 1, ...
    assert model_cfg.moe_token_dispatcher_type in ["alltoall", "flex"], ...
    assert model_cfg.bf16 or model_cfg.fp16, ...
    assert is_torch_min_version("2.6.0"), ...
    # ... PP + VPP check, recompute checks, shared_expert_overlap check ...
470
if self.user_comm_overlap_cfg.overlap_moe_expert_parallel_comm is True:
    assert model_cfg.expert_model_parallel_size > 1, ...
    assert model_cfg.num_moe_experts > 1, ...
    assert model_cfg.moe_token_dispatcher_type in ["alltoall", "flex"], ...
    assert model_cfg.bf16 or model_cfg.fp16, ...
    assert is_torch_min_version("2.6.0"), ...
    # ... PP + VPP检查、重计算检查、shared_expert_overlap检查 ...

Delayed wgrad validation

延迟权重梯度验证

507
if self.user_comm_overlap_cfg.delay_wgrad_compute is True:
    # TE version checks for overlap_grad_reduce and gradient_accumulation_fusion
    # CUDA graph scope validations for delayed wgrad
    assert overlap_moe_expert_parallel_comm, ...
507
if self.user_comm_overlap_cfg.delay_wgrad_compute is True:
    # overlap_grad_reduce和gradient_accumulation_fusion的TE版本检查
    # 延迟权重梯度的CUDA图作用域验证
    assert overlap_moe_expert_parallel_comm, ...

Flex-dispatcher activation

Flex调度器激活

27
def apply_flex_dispatcher_backend(...):
    # GPU architecture check for DeepEP / HybridEP
    model_config.moe_token_dispatcher_type = "flex"
    model_config.moe_flex_dispatcher_backend = moe_flex_dispatcher_backend
    model_config.moe_shared_expert_overlap = False
27
def apply_flex_dispatcher_backend(...):
    # DeepEP/HybridEP的GPU架构检查
    model_config.moe_token_dispatcher_type = "flex"
    model_config.moe_flex_dispatcher_backend = moe_flex_dispatcher_backend
    model_config.moe_shared_expert_overlap = False

Perf harness override

性能测试工具配置覆盖

149
def _set_moe_a2a_overlap_overrides(recipe, moe_a2a_overlap=False):
    if moe_a2a_overlap:
        recipe.comm_overlap.overlap_moe_expert_parallel_comm = True
        recipe.comm_overlap.delay_wgrad_compute = True
        recipe.model.moe_shared_expert_overlap = False
149
def _set_moe_a2a_overlap_overrides(recipe, moe_a2a_overlap=False):
    if moe_a2a_overlap:
        recipe.comm_overlap.overlap_moe_expert_parallel_comm = True
        recipe.comm_overlap.delay_wgrad_compute = True
        recipe.model.moe_shared_expert_overlap = False

Tests

测试文件

FileCoverage
tests/unit_tests/training/test_comm_overlap.py
EP overlap validation, delayed wgrad, CUDA graph + wgrad interaction
tests/unit_tests/training/test_deepep.py
DeepEP/HybridEP helper activation and GPU gating
文件覆盖范围
tests/unit_tests/training/test_comm_overlap.py
EP重叠验证、延迟权重梯度、CUDA图+权重梯度交互
tests/unit_tests/training/test_deepep.py
DeepEP/HybridEP助手激活及GPU适配检查

Failure Diagnosis

故障排查

SymptomLikely CauseHow To ConfirmFix
assert
expert_model_parallel_size > 1
EP not configuredCheck
expert_model_parallel_size
Set EP > 1
assert
moe_token_dispatcher_type
Wrong dispatcherCheck dispatcher typeUse
"alltoall"
or
"flex"
assert on BF16/FP16Wrong precisionCheck
bf16
and
fp16
Set
bf16 = True
hang during trainingPyTorch < 2.6Check PyTorch versionUpgrade to >= 2.6.0
assert
virtual_pipeline_model_parallel_size
PP > 1 without VPPCheck PP and VPP configSet VPP when PP > 1
assert
recompute_granularity
Full recompute enabledCheck recompute settingsDisable full recompute
assert
overlap_moe_expert_parallel_comm required
delayed wgrad without EP overlapCheck
delay_wgrad_compute
without overlap
Enable EP overlap first
assert
gradient_accumulation_fusion
CUDA graph + delayed wgradCheck graph scope + wgrad settingsEnable
gradient_accumulation_fusion
assert on attention biasCUDA graph attn + delayed wgrad + biasCheck
add_bias_linear
/
add_qkv_bias
Disable attention bias
no throughput gain from flex dispatcher
apply_flex_dispatcher_backend
not called
Check
moe_token_dispatcher_type
in logs
Call
apply_flex_dispatcher_backend(...)
DeepEP/HybridEP silently skippedUnsupported GPUCheck warning logsRun on Ampere/Hopper/Blackwell
症状可能原因确认方法修复方案
断言
expert_model_parallel_size > 1
未配置EP检查
expert_model_parallel_size
设置EP > 1
moe_token_dispatcher_type
断言错误
调度器类型错误检查调度器类型使用
"alltoall"
"flex"
BF16/FP16断言错误精度设置错误检查
bf16
fp16
设置
bf16 = True
训练过程挂起PyTorch版本 < 2.6检查PyTorch版本升级至 >= 2.6.0
virtual_pipeline_model_parallel_size
断言错误
PP > 1但未设置VPP检查PP和VPP配置当PP > 1时设置VPP
recompute_granularity
断言错误
已启用全重计算检查重计算设置禁用全重计算
断言
overlap_moe_expert_parallel_comm required
启用延迟权重梯度但未启用EP重叠检查
delay_wgrad_compute
是否未搭配重叠
先启用EP重叠
gradient_accumulation_fusion
断言错误
CUDA图+延迟权重梯度组合检查图作用域+权重梯度设置启用
gradient_accumulation_fusion
注意力偏置断言错误CUDA图attn+延迟权重梯度+偏置组合检查
add_bias_linear
/
add_qkv_bias
禁用注意力偏置
flex调度器无吞吐量提升未调用
apply_flex_dispatcher_backend
检查日志中
moe_token_dispatcher_type
调用
apply_flex_dispatcher_backend(...)
DeepEP/HybridEP被静默跳过GPU不支持检查警告日志在Ampere/Hopper/Blackwell架构GPU上运行

Known Limitations

已知限制

  • Setting
    moe_flex_dispatcher_backend
    alone does not activate flex dispatch — you must call
    apply_flex_dispatcher_backend(...)
    .
  • Public recipes are often conservative and leave MoE overlap disabled by default.
  • End-to-end throughput gains have not yet been measured in a controlled Bridge experiment for every model family. Code validation is stronger than a single universal performance claim.
  • MoE overlap and shared-expert overlap are mutually exclusive.
  • CUDA graph plus delayed wgrad is a multi-constraint path that requires careful TE version and scope validation.
  • 仅设置
    moe_flex_dispatcher_backend
    不会激活flex调度 — 必须调用
    apply_flex_dispatcher_backend(...)
  • 公开配置通常较为保守,默认禁用MoE重叠。
  • 尚未在受控Bridge实验中针对所有模型家族测量端到端吞吐量提升。代码验证比单一通用性能声明更可靠。
  • MoE重叠与共享专家重叠互斥。
  • CUDA图+延迟权重梯度是多约束路径,需仔细验证TE版本和作用域。