perf-expert-parallel-overlap
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMoE Expert-Parallel Overlap Skill
MoE专家并行通信重叠技能
Stable docs: @docs/training/communication-overlap.md
Card: @skills/perf-expert-parallel-overlap/card.yaml
稳定文档:@docs/training/communication-overlap.md
技能卡片:@skills/perf-expert-parallel-overlap/card.yaml
References
参考资料
- Stable docs: @docs/training/communication-overlap.md
- Structured metadata: @skills/perf-expert-parallel-overlap/card.yaml
- 稳定文档:@docs/training/communication-overlap.md
- 结构化元数据:@skills/perf-expert-parallel-overlap/card.yaml
What It Is
功能介绍
Expert-parallel (EP) overlap hides the cost of token dispatch/combine all-to-all
communication by running it concurrently with expert FFN compute. Optionally,
delayed expert weight-gradient computation () provides
additional overlap by deferring wgrad to overlap with the next layer's forward.
delay_wgrad_computeBridge supports two dispatcher paths:
| Dispatcher | Backend | When to use |
|---|---|---|
| Standard MoE all-to-all | Default, broadest compatibility |
| DeepEP or HybridEP | Higher overlap on Ampere/Hopper/Blackwell |
专家并行(EP)重叠通过将token分发/合并的all-to-all通信与专家FFN计算并行执行,隐藏通信开销。可选的延迟专家权重梯度计算()通过将权重梯度计算延迟至与下一层前向传播重叠,进一步提升优化效果。
delay_wgrad_computeBridge支持两种调度器路径:
| 调度器 | 后端 | 使用场景 |
|---|---|---|
| 标准MoE all-to-all | 默认选项,兼容性最广 |
| DeepEP或HybridEP | 在Ampere/Hopper/Blackwell架构GPU上实现更高重叠率 |
Quick Decision
快速决策指南
Use EP overlap when:
- the model is MoE with
EP > 1 - expert dispatch/combine communication is a meaningful part of step time
- you have memory headroom and are tuning for throughput
Prefer:
- dispatcher for the first rollout (broader compatibility)
alltoall - + DeepEP/HybridEP when running on supported GPUs and seeking additional gains
flex
Avoid EP overlap when:
- full activation recompute is enabled
- is enabled
moe_shared_expert_overlap - the run is still being brought up for correctness
- PyTorch < 2.6.0
Expected outcome:
- if all-to-all dispatch is a clear profile bottleneck, overlap can produce a modest to meaningful speedup
- if the run is tiny, communication-light, or dominated by another wall, the gain may be negligible
在以下场景使用EP重叠:
- 模型为MoE架构且
EP > 1 - 专家分发/合并通信占单步时间的显著比例
- 有内存余量且以吞吐量调优为目标
优先选择:
- 首次部署时使用调度器(兼容性更广)
alltoall - 在支持的GPU上运行且追求额外性能提升时,选择+ DeepEP/HybridEP
flex
避免在以下场景使用EP重叠:
- 已启用全激活重计算
- 已启用
moe_shared_expert_overlap - 运行仍处于正确性验证阶段
- PyTorch版本 < 2.6.0
预期效果:
- 若all-to-all分发是明确的性能瓶颈,重叠优化可带来适度至显著的加速
- 若运行规模极小、通信负载低或受其他瓶颈主导,性能提升可能可忽略
Enablement
启用方法
alltoall dispatcher
alltoall调度器
python
cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False
cfg.model.expert_model_parallel_size = 8
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.bf16 = True
cfg.model.fp16 = Falsepython
cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False
cfg.model.expert_model_parallel_size = 8
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.bf16 = True
cfg.model.fp16 = Falseflex dispatcher (DeepEP or HybridEP)
flex调度器(DeepEP或HybridEP)
python
from megatron.bridge.training.flex_dispatcher_backend import apply_flex_dispatcher_backend
cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False
apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="deepep")python
from megatron.bridge.training.flex_dispatcher_backend import apply_flex_dispatcher_backend
cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False
apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="deepep")or: apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="hybridep")
或:apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="hybridep")
undefinedundefinedCompatibility And Constraints
兼容性与约束
expert_model_parallel_size > 1num_moe_experts > 1- must be
moe_token_dispatcher_typeor"alltoall""flex" moe_shared_expert_overlap = False- Base precision is BF16 or FP16
- PyTorch
>= 2.6.0 - If ,
PP > 1must be setvirtual_pipeline_model_parallel_size - ,
recompute_granularity != "full",recompute_method = Nonerecompute_num_layers = None - must be
mtp_num_layersorNone1 - requires
delay_wgrad_computeas a prerequisiteoverlap_moe_expert_parallel_comm - with
delay_wgrad_computerequires TE >= 2.7.0overlap_grad_reduce - with
delay_wgrad_computerequires TE >= 2.7.0gradient_accumulation_fusion - CUDA graph scope +
attnrequires TE >= 2.12.0,delay_wgrad_compute, and no attention biasgradient_accumulation_fusion = True - DeepEP: Ampere, Hopper, B200, B300 GPUs only
- HybridEP: Ampere, Hopper, B200, B300, GB200/GB300 with NVL72
expert_model_parallel_size > 1num_moe_experts > 1- 必须为
moe_token_dispatcher_type或"alltoall""flex" moe_shared_expert_overlap = False- 基础精度为BF16或FP16
- PyTorch版本
>= 2.6.0 - 若,必须设置
PP > 1virtual_pipeline_model_parallel_size - ,
recompute_granularity != "full",recompute_method = Nonerecompute_num_layers = None - 必须为
mtp_num_layers或None1 - 需以
delay_wgrad_compute为前置条件overlap_moe_expert_parallel_comm - 搭配
delay_wgrad_compute需要TE >= 2.7.0overlap_grad_reduce - 搭配
delay_wgrad_compute需要TE >= 2.7.0gradient_accumulation_fusion - CUDA图作用域 +
attn需要TE >= 2.12.0、delay_wgrad_compute且无注意力偏置gradient_accumulation_fusion = True - DeepEP:仅支持Ampere、Hopper、B200、B300 GPU
- HybridEP:支持Ampere、Hopper、B200、B300、带NVL72的GB200/GB300 GPU
Minimal Working Config
最小可用配置
python
cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = False
cfg.model.expert_model_parallel_size = 4
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.moe_shared_expert_overlap = False
cfg.model.bf16 = TrueUse this as the correctness-first starting point. Add delayed wgrad, flex
dispatch, and CUDA-graph interactions only after the plain overlap path is
known to work.
python
cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = False
cfg.model.expert_model_parallel_size = 4
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.moe_shared_expert_overlap = False
cfg.model.bf16 = True以此作为优先保证正确性的起点。仅在确认基础重叠路径可正常工作后,再添加延迟权重梯度、flex调度及CUDA图交互功能。
Measured Short-Run Evidence
短期测试结果
A 2026-05-18 current-main H100 x16 Qwen3 30B-A3B mock pretraining run used
, , BF16, global batch size 1024, CUDA graphs disabled, and
. With iterations 3-8 as the steady window:
EP=16alltoallmoe_permute_fusion=false| Case | Steady mean | Relative |
|---|---|---|
| no EP overlap | 41.25s | 1.000x |
| EP overlap | 31.31s | 1.317x |
EP overlap plus | 31.20s | 1.322x |
This is evidence for enabling plain EP overlap on this inter-node all-to-all
shape. It does not show a meaningful independent win from delayed wgrad, and it
does not validate fused MoE permutation because that path was disabled for the
runtime stack.
2026-05-18基于当前主分支的H100 x16 Qwen3 30B-A3B模拟预训练运行,使用、、BF16、全局批量大小1024、CUDA图禁用、。以第3-8轮迭代为稳定窗口:
EP=16alltoallmoe_permute_fusion=false| 场景 | 稳定均值 | 相对速度 |
|---|---|---|
| 无EP重叠 | 41.25s | 1.000x |
| 启用EP重叠 | 31.31s | 1.317x |
启用EP重叠+ | 31.20s | 1.322x |
此结果证明在该跨节点all-to-all场景下启用基础EP重叠是有效的。延迟权重梯度未带来显著独立收益,且由于运行栈禁用了该路径,未验证融合MoE置换。
Minimal Runnable Command
最小可运行命令
Performance harness example inside a Slurm allocation. Keep the model,
parallelism, dispatcher, and runtime fixed, and vary only the two overlap
overrides:
bash
uv run python scripts/performance/run_script.py \
-m qwen \
-mr qwen3_30b_a3b \
--task pretrain \
-g h100 \
-c bf16 \
-ng 16 \
-gn 8 \
--max_steps 8 \
--config_variant v1 \
--cuda_graph_impl none \
--moe_flex_dispatcher_backend None \
--moe_a2a_overlap false \
--tokenizer_type NullTokenizer \
comm_overlap.overlap_moe_expert_parallel_comm=true \
comm_overlap.delay_wgrad_compute=false \
model.moe_shared_expert_overlap=falseDo not use when separating plain EP overlap from
delayed wgrad: the performance harness helper enables both
and .
--moe_a2a_overlap trueoverlap_moe_expert_parallel_commdelay_wgrad_computeUnit test verification:
bash
uv run python -m pytest \
tests/unit_tests/training/test_comm_overlap.py -k "moe" \
tests/unit_tests/training/test_deepep.py -qSlurm分配内的性能测试工具示例。保持模型、并行度、调度器及运行时固定,仅修改两个重叠配置项:
bash
uv run python scripts/performance/run_script.py \
-m qwen \
-mr qwen3_30b_a3b \
--task pretrain \
-g h100 \
-c bf16 \
-ng 16 \
-gn 8 \
--max_steps 8 \
--config_variant v1 \
--cuda_graph_impl none \
--moe_flex_dispatcher_backend None \
--moe_a2a_overlap false \
--tokenizer_type NullTokenizer \
comm_overlap.overlap_moe_expert_parallel_comm=true \
comm_overlap.delay_wgrad_compute=false \
model.moe_shared_expert_overlap=false当分离基础EP重叠与延迟权重梯度时,请勿使用:性能测试工具会同时启用和。
--moe_a2a_overlap trueoverlap_moe_expert_parallel_commdelay_wgrad_compute单元测试验证:
bash
uv run python -m pytest \
tests/unit_tests/training/test_comm_overlap.py -k "moe" \
tests/unit_tests/training/test_deepep.py -qVerification
验证方法
Unit tests
单元测试
bash
uv run python -m pytest \
tests/unit_tests/training/test_comm_overlap.py \
tests/unit_tests/training/test_deepep.py -qbash
uv run python -m pytest \
tests/unit_tests/training/test_comm_overlap.py \
tests/unit_tests/training/test_deepep.py -qLog checks
日志检查
After a successful run with EP overlap:
- Confirm no assertion errors during finalization
CommOverlapConfig - Confirm appears as
overlap_moe_expert_parallel_commin the logged configTrue - If using flex dispatcher, confirm and the correct backend in logs
moe_token_dispatcher_type = "flex"
成功运行EP重叠后:
- 确认最终化过程中无断言错误
CommOverlapConfig - 确认日志中显示为
overlap_moe_expert_parallel_commTrue - 若使用flex调度器,确认日志中及正确的后端
moe_token_dispatcher_type = "flex"
Success criteria
成功标准
- Config validation passes for the selected dispatcher and overlap settings
- Training runs complete without hangs or assertion failures
- Throughput improves or at least does not regress for the target workload
- Loss trajectory matches baseline (overlap should not affect convergence)
- 所选调度器和重叠配置的验证通过
- 训练运行完成无挂起或断言失败
- 目标工作负载的吞吐量提升或至少无退化
- 损失曲线与基线一致(重叠优化不应影响收敛)
Code Anchors
代码锚点
Bridge overlap validation
Bridge重叠验证
470
if self.user_comm_overlap_cfg.overlap_moe_expert_parallel_comm is True:
assert model_cfg.expert_model_parallel_size > 1, ...
assert model_cfg.num_moe_experts > 1, ...
assert model_cfg.moe_token_dispatcher_type in ["alltoall", "flex"], ...
assert model_cfg.bf16 or model_cfg.fp16, ...
assert is_torch_min_version("2.6.0"), ...
# ... PP + VPP check, recompute checks, shared_expert_overlap check ...470
if self.user_comm_overlap_cfg.overlap_moe_expert_parallel_comm is True:
assert model_cfg.expert_model_parallel_size > 1, ...
assert model_cfg.num_moe_experts > 1, ...
assert model_cfg.moe_token_dispatcher_type in ["alltoall", "flex"], ...
assert model_cfg.bf16 or model_cfg.fp16, ...
assert is_torch_min_version("2.6.0"), ...
# ... PP + VPP检查、重计算检查、shared_expert_overlap检查 ...Delayed wgrad validation
延迟权重梯度验证
507
if self.user_comm_overlap_cfg.delay_wgrad_compute is True:
# TE version checks for overlap_grad_reduce and gradient_accumulation_fusion
# CUDA graph scope validations for delayed wgrad
assert overlap_moe_expert_parallel_comm, ...507
if self.user_comm_overlap_cfg.delay_wgrad_compute is True:
# overlap_grad_reduce和gradient_accumulation_fusion的TE版本检查
# 延迟权重梯度的CUDA图作用域验证
assert overlap_moe_expert_parallel_comm, ...Flex-dispatcher activation
Flex调度器激活
27
def apply_flex_dispatcher_backend(...):
# GPU architecture check for DeepEP / HybridEP
model_config.moe_token_dispatcher_type = "flex"
model_config.moe_flex_dispatcher_backend = moe_flex_dispatcher_backend
model_config.moe_shared_expert_overlap = False27
def apply_flex_dispatcher_backend(...):
# DeepEP/HybridEP的GPU架构检查
model_config.moe_token_dispatcher_type = "flex"
model_config.moe_flex_dispatcher_backend = moe_flex_dispatcher_backend
model_config.moe_shared_expert_overlap = FalsePerf harness override
性能测试工具配置覆盖
149
def _set_moe_a2a_overlap_overrides(recipe, moe_a2a_overlap=False):
if moe_a2a_overlap:
recipe.comm_overlap.overlap_moe_expert_parallel_comm = True
recipe.comm_overlap.delay_wgrad_compute = True
recipe.model.moe_shared_expert_overlap = False149
def _set_moe_a2a_overlap_overrides(recipe, moe_a2a_overlap=False):
if moe_a2a_overlap:
recipe.comm_overlap.overlap_moe_expert_parallel_comm = True
recipe.comm_overlap.delay_wgrad_compute = True
recipe.model.moe_shared_expert_overlap = FalseTests
测试文件
| File | Coverage |
|---|---|
| EP overlap validation, delayed wgrad, CUDA graph + wgrad interaction |
| DeepEP/HybridEP helper activation and GPU gating |
| 文件 | 覆盖范围 |
|---|---|
| EP重叠验证、延迟权重梯度、CUDA图+权重梯度交互 |
| DeepEP/HybridEP助手激活及GPU适配检查 |
Failure Diagnosis
故障排查
| Symptom | Likely Cause | How To Confirm | Fix |
|---|---|---|---|
assert | EP not configured | Check | Set EP > 1 |
assert | Wrong dispatcher | Check dispatcher type | Use |
| assert on BF16/FP16 | Wrong precision | Check | Set |
| hang during training | PyTorch < 2.6 | Check PyTorch version | Upgrade to >= 2.6.0 |
assert | PP > 1 without VPP | Check PP and VPP config | Set VPP when PP > 1 |
assert | Full recompute enabled | Check recompute settings | Disable full recompute |
assert | delayed wgrad without EP overlap | Check | Enable EP overlap first |
assert | CUDA graph + delayed wgrad | Check graph scope + wgrad settings | Enable |
| assert on attention bias | CUDA graph attn + delayed wgrad + bias | Check | Disable attention bias |
| no throughput gain from flex dispatcher | | Check | Call |
| DeepEP/HybridEP silently skipped | Unsupported GPU | Check warning logs | Run on Ampere/Hopper/Blackwell |
| 症状 | 可能原因 | 确认方法 | 修复方案 |
|---|---|---|---|
断言 | 未配置EP | 检查 | 设置EP > 1 |
| 调度器类型错误 | 检查调度器类型 | 使用 |
| BF16/FP16断言错误 | 精度设置错误 | 检查 | 设置 |
| 训练过程挂起 | PyTorch版本 < 2.6 | 检查PyTorch版本 | 升级至 >= 2.6.0 |
| PP > 1但未设置VPP | 检查PP和VPP配置 | 当PP > 1时设置VPP |
| 已启用全重计算 | 检查重计算设置 | 禁用全重计算 |
断言 | 启用延迟权重梯度但未启用EP重叠 | 检查 | 先启用EP重叠 |
| CUDA图+延迟权重梯度组合 | 检查图作用域+权重梯度设置 | 启用 |
| 注意力偏置断言错误 | CUDA图attn+延迟权重梯度+偏置组合 | 检查 | 禁用注意力偏置 |
| flex调度器无吞吐量提升 | 未调用 | 检查日志中 | 调用 |
| DeepEP/HybridEP被静默跳过 | GPU不支持 | 检查警告日志 | 在Ampere/Hopper/Blackwell架构GPU上运行 |
Known Limitations
已知限制
- Setting alone does not activate flex dispatch — you must call
moe_flex_dispatcher_backend.apply_flex_dispatcher_backend(...) - Public recipes are often conservative and leave MoE overlap disabled by default.
- End-to-end throughput gains have not yet been measured in a controlled Bridge experiment for every model family. Code validation is stronger than a single universal performance claim.
- MoE overlap and shared-expert overlap are mutually exclusive.
- CUDA graph plus delayed wgrad is a multi-constraint path that requires careful TE version and scope validation.
- 仅设置不会激活flex调度 — 必须调用
moe_flex_dispatcher_backend。apply_flex_dispatcher_backend(...) - 公开配置通常较为保守,默认禁用MoE重叠。
- 尚未在受控Bridge实验中针对所有模型家族测量端到端吞吐量提升。代码验证比单一通用性能声明更可靠。
- MoE重叠与共享专家重叠互斥。
- CUDA图+延迟权重梯度是多约束路径,需仔细验证TE版本和作用域。