MoE Expert-Parallel Overlap Skill
Stable docs: @docs/training/communication-overlap.md
Card: @skills/perf-expert-parallel-overlap/card.yaml
References
- Stable docs: @docs/training/communication-overlap.md
- Structured metadata: @skills/perf-expert-parallel-overlap/card.yaml
What It Is
Expert-parallel (EP) overlap hides the cost of token dispatch/combine all-to-all
communication by running it concurrently with expert FFN compute. Optionally,
delayed expert weight-gradient computation (
) provides
additional overlap by deferring wgrad to overlap with the next layer's forward.
Bridge supports two dispatcher paths:
| Dispatcher | Backend | When to use |
|---|
| Standard MoE all-to-all | Default, broadest compatibility |
| DeepEP or HybridEP | Higher overlap on Ampere/Hopper/Blackwell |
Quick Decision
Use EP overlap when:
- the model is MoE with
- expert dispatch/combine communication is a meaningful part of step time
- you have memory headroom and are tuning for throughput
Prefer:
- dispatcher for the first rollout (broader compatibility)
- + DeepEP/HybridEP when running on supported GPUs and seeking
additional gains
Avoid EP overlap when:
- full activation recompute is enabled
moe_shared_expert_overlap
is enabled
- the run is still being brought up for correctness
- PyTorch < 2.6.0
Expected outcome:
- if all-to-all dispatch is a clear profile bottleneck, overlap can produce a
modest to meaningful speedup
- if the run is tiny, communication-light, or dominated by another wall, the
gain may be negligible
Enablement
alltoall dispatcher
python
cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False
cfg.model.expert_model_parallel_size = 8
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.bf16 = True
cfg.model.fp16 = False
flex dispatcher (DeepEP or HybridEP)
python
from megatron.bridge.training.flex_dispatcher_backend import apply_flex_dispatcher_backend
cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False
apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="deepep")
# or: apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="hybridep")
Compatibility And Constraints
expert_model_parallel_size > 1
moe_token_dispatcher_type
must be or
moe_shared_expert_overlap = False
- Base precision is BF16 or FP16
- PyTorch
- If ,
virtual_pipeline_model_parallel_size
must be set
recompute_granularity != "full"
, ,
recompute_num_layers = None
- must be or
- requires
overlap_moe_expert_parallel_comm
as a
prerequisite
- with requires TE >= 2.7.0
- with
gradient_accumulation_fusion
requires TE >= 2.7.0
- CUDA graph scope + requires TE >= 2.12.0,
gradient_accumulation_fusion = True
, and no attention bias
- DeepEP: Ampere, Hopper, B200, B300 GPUs only
- HybridEP: Ampere, Hopper, B200, B300, GB200/GB300 with NVL72
Minimal Working Config
python
cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = False
cfg.model.expert_model_parallel_size = 4
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.moe_shared_expert_overlap = False
cfg.model.bf16 = True
Use this as the correctness-first starting point. Add delayed wgrad, flex
dispatch, and CUDA-graph interactions only after the plain overlap path is
known to work.
Measured Short-Run Evidence
A 2026-05-18 current-main H100 x16 Qwen3 30B-A3B mock pretraining run used
,
, BF16, global batch size 1024, CUDA graphs disabled, and
. With iterations 3-8 as the steady window:
| Case | Steady mean | Relative |
|---|
| no EP overlap | 41.25s | 1.000x |
| EP overlap | 31.31s | 1.317x |
| EP overlap plus | 31.20s | 1.322x |
This is evidence for enabling plain EP overlap on this inter-node all-to-all
shape. It does not show a meaningful independent win from delayed wgrad, and it
does not validate fused MoE permutation because that path was disabled for the
runtime stack.
Minimal Runnable Command
Performance harness example inside a Slurm allocation. Keep the model,
parallelism, dispatcher, and runtime fixed, and vary only the two overlap
overrides:
bash
uv run python scripts/performance/run_script.py \
-m qwen \
-mr qwen3_30b_a3b \
--task pretrain \
-g h100 \
-c bf16 \
-ng 16 \
-gn 8 \
--max_steps 8 \
--config_variant v1 \
--cuda_graph_impl none \
--moe_flex_dispatcher_backend None \
--moe_a2a_overlap false \
--tokenizer_type NullTokenizer \
comm_overlap.overlap_moe_expert_parallel_comm=true \
comm_overlap.delay_wgrad_compute=false \
model.moe_shared_expert_overlap=false
Do not use
when separating plain EP overlap from
delayed wgrad: the performance harness helper enables both
overlap_moe_expert_parallel_comm
and
.
Unit test verification:
bash
uv run python -m pytest \
tests/unit_tests/training/test_comm_overlap.py -k "moe" \
tests/unit_tests/training/test_deepep.py -q
Verification
Unit tests
bash
uv run python -m pytest \
tests/unit_tests/training/test_comm_overlap.py \
tests/unit_tests/training/test_deepep.py -q
Log checks
After a successful run with EP overlap:
- Confirm no assertion errors during finalization
- Confirm
overlap_moe_expert_parallel_comm
appears as in the logged
config
- If using flex dispatcher, confirm
moe_token_dispatcher_type = "flex"
and
the correct backend in logs
Success criteria
- Config validation passes for the selected dispatcher and overlap settings
- Training runs complete without hangs or assertion failures
- Throughput improves or at least does not regress for the target workload
- Loss trajectory matches baseline (overlap should not affect convergence)
Code Anchors
Bridge overlap validation
470
if self.user_comm_overlap_cfg.overlap_moe_expert_parallel_comm is True:
assert model_cfg.expert_model_parallel_size > 1, ...
assert model_cfg.num_moe_experts > 1, ...
assert model_cfg.moe_token_dispatcher_type in ["alltoall", "flex"], ...
assert model_cfg.bf16 or model_cfg.fp16, ...
assert is_torch_min_version("2.6.0"), ...
# ... PP + VPP check, recompute checks, shared_expert_overlap check ...
Delayed wgrad validation
507
if self.user_comm_overlap_cfg.delay_wgrad_compute is True:
# TE version checks for overlap_grad_reduce and gradient_accumulation_fusion
# CUDA graph scope validations for delayed wgrad
assert overlap_moe_expert_parallel_comm, ...
Flex-dispatcher activation
27
def apply_flex_dispatcher_backend(...):
# GPU architecture check for DeepEP / HybridEP
model_config.moe_token_dispatcher_type = "flex"
model_config.moe_flex_dispatcher_backend = moe_flex_dispatcher_backend
model_config.moe_shared_expert_overlap = False
Perf harness override
149
def _set_moe_a2a_overlap_overrides(recipe, moe_a2a_overlap=False):
if moe_a2a_overlap:
recipe.comm_overlap.overlap_moe_expert_parallel_comm = True
recipe.comm_overlap.delay_wgrad_compute = True
recipe.model.moe_shared_expert_overlap = False
Tests
| File | Coverage |
|---|
tests/unit_tests/training/test_comm_overlap.py
| EP overlap validation, delayed wgrad, CUDA graph + wgrad interaction |
tests/unit_tests/training/test_deepep.py
| DeepEP/HybridEP helper activation and GPU gating |
Failure Diagnosis
| Symptom | Likely Cause | How To Confirm | Fix |
|---|
assert expert_model_parallel_size > 1
| EP not configured | Check expert_model_parallel_size
| Set EP > 1 |
assert moe_token_dispatcher_type
| Wrong dispatcher | Check dispatcher type | Use or |
| assert on BF16/FP16 | Wrong precision | Check and | Set |
| hang during training | PyTorch < 2.6 | Check PyTorch version | Upgrade to >= 2.6.0 |
assert virtual_pipeline_model_parallel_size
| PP > 1 without VPP | Check PP and VPP config | Set VPP when PP > 1 |
| assert | Full recompute enabled | Check recompute settings | Disable full recompute |
assert overlap_moe_expert_parallel_comm required
| delayed wgrad without EP overlap | Check without overlap | Enable EP overlap first |
assert gradient_accumulation_fusion
| CUDA graph + delayed wgrad | Check graph scope + wgrad settings | Enable gradient_accumulation_fusion
|
| assert on attention bias | CUDA graph attn + delayed wgrad + bias | Check / | Disable attention bias |
| no throughput gain from flex dispatcher | apply_flex_dispatcher_backend
not called | Check moe_token_dispatcher_type
in logs | Call apply_flex_dispatcher_backend(...)
|
| DeepEP/HybridEP silently skipped | Unsupported GPU | Check warning logs | Run on Ampere/Hopper/Blackwell |
Known Limitations
- Setting
moe_flex_dispatcher_backend
alone does not activate flex dispatch —
you must call apply_flex_dispatcher_backend(...)
.
- Public recipes are often conservative and leave MoE overlap disabled by
default.
- End-to-end throughput gains have not yet been measured in a controlled Bridge
experiment for every model family. Code validation is stronger than a single
universal performance claim.
- MoE overlap and shared-expert overlap are mutually exclusive.
- CUDA graph plus delayed wgrad is a multi-constraint path that requires
careful TE version and scope validation.