nemo-mbridge-perf-moe-dispatcher-selection
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMoE Dispatcher Selection Guide
MoE调度器选择指南
Stable docs: @docs/training/moe-optimization.md
Card: @skills/nemo-mbridge-perf-moe-dispatcher-selection/card.yaml
稳定版文档:@docs/training/moe-optimization.md
技能卡片:@skills/nemo-mbridge-perf-moe-dispatcher-selection/card.yaml
Quick Decision
快速决策
By hardware
按硬件选择
| Hardware | First choice | Why |
|---|---|---|
| H100 | DeepEP, if the runtime package is installed | Strong default for cross-node EP on Hopper |
| B200 | DeepEP, if the runtime package is installed | Good first choice unless a platform-specific HybridEP path is available |
| GB200 / GB300 NVL72 | HybridEP, if the runtime package is installed | Best fit for NVLink-domain-aware dispatch and lower memory pressure |
| Unknown or first bring-up | | Easiest path for correctness and debugging |
| 硬件 | 首选方案 | 原因 |
|---|---|---|
| H100 | 若已安装运行时包则选DeepEP | Hopper架构下跨节点EP的强力默认方案 |
| B200 | 若已安装运行时包则选DeepEP | 除非有平台专属HybridEP路径,否则是优质首选 |
| GB200 / GB300 NVL72 | 若已安装运行时包则选HybridEP | 最适配NVLink域感知调度,且内存压力更低 |
| 未知硬件或首次部署 | | 正确性验证和调试的最简路径 |
By EP degree
按EP规模选择
| EP size | Guidance |
|---|---|
| Small EP | Dispatcher choice is usually second-order; start with |
| Medium EP | DeepEP often becomes worthwhile |
| Large EP | HybridEP is usually the best target on NVL72 systems |
| EP规模 | 指导建议 |
|---|---|
| 小型EP | 调度器选择通常影响不大;从 |
| 中型EP | DeepEP通常会带来明显收益 |
| 大型EP | 在NVL72系统上HybridEP通常是最佳选择 |
Model-Family Patterns
模型家族实践模式
| Workload | Common best path | Notes |
|---|---|---|
| DSV3 at large scale | HybridEP on GB200 or GB300, DeepEP on H100 | Dispatcher choice matters more as EP and PP both grow |
| Qwen3 235B | DeepEP on H100, HybridEP on GB200 | HybridEP usually wins on GB200 and often uses less memory |
| Qwen3 30B | DeepEP | Smaller models still benefit, but the absolute gap is smaller |
| Qwen3-Next | Close race in BF16, HybridEP stronger in FP8 or memory-tight runs | Good reminder to test, not assume |
| MoE VLMs | Start simple, then test HybridEP on GB200-class systems | Vision workloads are sensitive to both memory and host overhead |
| 工作负载 | 常用最优方案 | 说明 |
|---|---|---|
| 大规模DSV3 | GB200或GB300用HybridEP,H100用DeepEP | 当EP和PP规模同时增长时,调度器选择的影响更大 |
| Qwen3 235B | H100用DeepEP,GB200用HybridEP | HybridEP在GB200上通常表现更优,且内存占用更低 |
| Qwen3 30B | DeepEP | 小模型仍能受益,但性能差距绝对值更小 |
| Qwen3-Next | BF16精度下各方案性能接近;FP8或内存紧张场景下HybridEP优势更明显 | 提醒需实际测试,而非主观假设 |
| MoE多模态大模型(VLM) | 先从简单方案入手,再在GB200级系统上测试HybridEP | 视觉工作负载对内存和主机开销都很敏感 |
Rounded Evidence Summary
综合证据总结
Backend availability gate
后端可用性前置检查
Do not interpret a dispatcher timing until the container has proven that the
selected backend package is available.
selects the standard dispatcher, while and
select and then require their corresponding
runtime packages at model construction time. If DeepEP or HybridEP is missing,
record the import failure as an environment limitation and treat as
the only measured correctness fallback for that run.
--moe_flex_dispatcher_backend Nonealltoalldeepephybridepmoe_token_dispatcher_type="flex"alltoall在未确认容器已安装所选后端包前,不要解读调度器性能数据。会选择标准调度器,而和会将,并在模型构建阶段依赖对应的运行时包。若DeepEP或HybridEP缺失,需记录导入失败为环境限制,并将作为该次运行中唯一可验证正确性的 fallback 方案。
--moe_flex_dispatcher_backend Nonealltoalldeepephybridepmoe_token_dispatcher_type="flex"alltoallQwen3 30B A3B on H100
H100上的Qwen3 30B A3B测试
A short 2026-05-17 H100 smoke run used Qwen3 30B A3B BF16, 16 GPUs, EP=16,
the recipe's Transformer Engine CUDA graph scopes (,
), and due to a Triton JIT
compatibility issue in the run container. The fallback completed five
steps with 45.65 s mean step time after warmup, 132.9 mean TFLOP/s/GPU after
warmup, final loss 11.44050, and 61.351 GB peak max allocated memory. DeepEP
and HybridEP selected the requested flex backend in the dumped configs but
failed before the first iteration because the packages were not installed. This
confirms the availability gate; it is not a throughput ranking for flex
dispatchers on H100.
moe_routermoe_preprocessmodel.moe_permute_fusion=falsealltoall2026-05-17在H100上进行的简短冒烟测试使用了Qwen3 30B A3B BF16精度、16块GPU、EP=16、训练脚本中的Transformer Engine CUDA图作用域(、),且因运行容器中存在Triton JIT兼容性问题,设置了。 fallback方案完成5个步骤,预热后平均步长时间45.65秒,预热后每GPU平均TFLOP/s为132.9,最终损失值11.44050,峰值内存占用61.351 GB。DeepEP和HybridEP在导出配置中选择了指定的flex后端,但因未安装对应包,在第一次迭代前就失败了。这验证了可用性前置检查的必要性;该结果并非H100上flex调度器的吞吐量排名。
moe_routermoe_preprocessmodel.moe_permute_fusion=falsealltoallDSV3 on GB200 or GB300
GB200或GB300上的DSV3测试
The broad trend is more important than any single row in the tracker:
- plain is usually the conservative baseline
alltoall - DeepEP improves that baseline once EP communication becomes visible
- HybridEP adds another step up on NVL72 systems, especially after CUDA graphs, routing improvements, and CPU-side cleanup are already in place
In practice, the stack often moves from roughly "low-teens MFU" territory with
an untuned baseline into "high-teens to low-20s MFU" territory after the full
dispatcher and kernel stack is tuned.
整体趋势比跟踪器中的单个数据行更重要:
- 原生通常是保守基线
alltoall - 当EP通信开销变得显著时,DeepEP能优化该基线
- 在NVL72系统上,HybridEP能进一步提升性能,尤其是在CUDA图、路由优化和CPU端清理已完成的情况下
实际场景中,经过调度器和内核栈的完整调优后,系统通常从未调优基线的“十几% MFU”区间提升到“十几到二十几% MFU”区间。
Qwen3 235B on GB200
GB200上的Qwen3 235B测试
For Qwen3 235B, the practical ordering is usually:
- for initial bring-up
alltoall - DeepEP if you want a familiar tuned path
- HybridEP for the strongest steady-state result on GB200
HybridEP is usually modestly faster than on this workload and often
has noticeably better memory headroom.
alltoall对于Qwen3 235B,实践中的优先级通常为:
- 初始部署用
alltoall - 若需要成熟的调优路径则选DeepEP
- 在GB200上追求最佳稳态性能则选HybridEP
HybridEP在该工作负载上通常比略快,且内存余量明显更优。
alltoallQwen3-Next on GB200
GB200上的Qwen3-Next测试
This family is a good reminder that dispatcher wins are workload-dependent:
- in BF16, and HybridEP can be close
alltoall - in FP8 or memory-constrained settings, HybridEP tends to look better
- pipeline layout and grouped-GEMM changes can matter almost as much as the dispatcher itself
该模型家族提醒我们,调度器的优势取决于工作负载:
- BF16精度下,和HybridEP性能接近
alltoall - FP8或内存受限场景下,HybridEP表现更优
- 流水线布局和分组GEMM的调整对性能的影响几乎与调度器本身相当
Tuning Parameters
调优参数
DeepEP
DeepEP
DeepEP is selected by setting
and .
moe_token_dispatcher_type="flex"moe_flex_dispatcher_backend="deepep"bash
--moe-deepep-num-sms 20Tune the SM count allocated to DeepEP communication kernels (default 20).
The optimal value depends on the workload and EP degree.
First confirm the DeepEP package imports in the target container; a missing
package fails during model construction, before any dispatcher timing is
available.
通过设置和选择DeepEP。
moe_token_dispatcher_type="flex"moe_flex_dispatcher_backend="deepep"bash
--moe-deepep-num-sms 20调整分配给DeepEP通信内核的SM数量(默认值20)。最优值取决于工作负载和EP规模。首先确认目标容器中已导入DeepEP包;若包缺失,会在模型构建阶段失败,无法获取任何调度器性能数据。
HybridEP
HybridEP
HybridEP is selected by setting
and .
moe_token_dispatcher_type="flex"moe_flex_dispatcher_backend="hybridep"bash
--moe-hybridep-num-sms 16Tune the SM count allocated to HybridEP communication (default 16). The
performance harness uses 32 for HybridEP workloads. Sweep between 16 and 32
for the target hardware. Set
to match the NVLink domain size of
the deployment. If it does not match the actual topology, performance and
sometimes correctness will suffer.
First confirm the HybridEP package imports in the target container; a missing
package fails during model construction, before any dispatcher timing is
available.
NUM_OF_HYBRID_EP_RANKS_PER_NVLINK_DOMAIN通过设置和选择HybridEP。
moe_token_dispatcher_type="flex"moe_flex_dispatcher_backend="hybridep"bash
--moe-hybridep-num-sms 16调整分配给HybridEP通信的SM数量(默认值16)。性能测试工具在HybridEP工作负载中使用32。针对目标硬件在16到32之间进行扫描调优。设置以匹配部署环境的NVLink域大小。若与实际拓扑不匹配,会影响性能甚至正确性。首先确认目标容器中已导入HybridEP包;若包缺失,会在模型构建阶段失败,无法获取任何调度器性能数据。
NUM_OF_HYBRID_EP_RANKS_PER_NVLINK_DOMAINRouting mode
路由模式
bash
--moe-router-force-load-balancingFor performance benchmarking, force-balance routing is the safer default. It
usually outperforms dropless routing in large-scale benchmarks and makes results
more comparable across dispatcher backends.
bash
--moe-router-force-load-balancing对于性能基准测试,强制负载均衡路由是更安全的默认选项。在大规模基准测试中,它通常比无丢弃路由性能更优,且能让不同调度器后端的结果更具可比性。
Key Interactions
关键交互影响
| Feature | Interaction |
|---|---|
| CUDA graphs | Best paired with |
| EP overlap | Helps when dispatcher time is still visible after backend tuning |
| FP8 | Often increases the relative importance of communication and host overhead |
| CPU affinity | Can matter as much as dispatcher choice on GB200 or GB300 |
| Pipeline layout | Poor PP or VPP layout can erase dispatcher gains |
| 特性 | 交互影响 |
|---|---|
| CUDA图 | 与无丢弃MoE的 |
| EP重叠 | 在后端调优后调度器开销仍显著时,该特性会有所帮助 |
| FP8 | 通常会提升通信和主机开销的相对重要性 |
| CPU亲和性 | 在GB200或GB300上,其重要性可能与调度器选择相当 |
| 流水线布局 | 糟糕的PP或VPP布局可能抵消调度器带来的性能提升 |
When To Use Each
各调度器适用场景
alltoall
alltoallalltoall
alltoall- first correctness bring-up
- small EP configurations
- debugging communication regressions
- 首次正确性验证部署
- 小型EP配置
- 调试通信性能退化问题
DeepEP
DeepEP
- Hopper or B200 deployments
- cross-node EP is clearly visible in profiles
- you want a mature intermediate step before testing HybridEP
- Hopper或B200部署环境
- 跨节点EP开销在性能分析中明显可见
- 在测试HybridEP前,需要一个成熟的中间过渡方案
HybridEP
HybridEP
- GB200 or GB300 NVL72 systems
- large EP degrees
- memory headroom matters in addition to throughput
- GB200或GB300 NVL72系统
- 大型EP规模
- 除吞吐量外,内存余量也很重要的场景
Pitfalls
常见陷阱
-
Do not compare dispatchers on different stacks: container, routing mode, PP layout, and CUDA-graph scope can move the result as much as the dispatcher.
-
HybridEP is topology-sensitive: it is not a universal win outside the hardware it was designed for.
-
Both dispatchers need SM tuning: default(20) and
moe_deepep_num_sms(16) are reasonable starting points but rarely optimal.moe_hybridep_num_sms -
Force-balance and dropless are not interchangeable baselines: keep the routing mode fixed when comparing dispatcher backends.
-
Memory and throughput can trade off differently by model: Qwen3-style runs may show a smaller speed delta than DSV3, but still justify HybridEP for memory headroom.
-
Backend import failures are not performance data: if DeepEP or HybridEP is missing from the container, do not compare its failed job against a completedjob. Fix the environment first, then rerun the same stack.
alltoall
-
不要在不同技术栈下比较调度器:容器环境、路由模式、PP布局和CUDA图作用域对结果的影响可能与调度器本身相当。
-
HybridEP对拓扑敏感:在其设计目标硬件之外的环境中,并非总能带来性能提升。
-
两种调度器都需要SM调优:默认的(20)和
moe_deepep_num_sms(16)是合理的起点,但很少是最优值。moe_hybridep_num_sms -
强制负载均衡和无丢弃路由不能作为可互换的基线:比较调度器后端时,需保持路由模式一致。
-
不同模型的内存与吞吐量权衡方式不同:Qwen3系列的性能差距可能比DSV3小,但HybridEP仍能凭借内存余量优势具备使用价值。
-
后端导入失败不属于性能数据:若容器中缺失DeepEP或HybridEP,不要将其失败任务与完成的任务进行比较。需先修复环境,再在相同技术栈下重新运行测试。
alltoall