Loading...
Loading...
Compare original and translation side by side
| Hardware | First choice | Why |
|---|---|---|
| H100 | DeepEP, if the runtime package is installed | Strong default for cross-node EP on Hopper |
| B200 | DeepEP, if the runtime package is installed | Good first choice unless a platform-specific HybridEP path is available |
| GB200 / GB300 NVL72 | HybridEP, if the runtime package is installed | Best fit for NVLink-domain-aware dispatch and lower memory pressure |
| Unknown or first bring-up | | Easiest path for correctness and debugging |
| 硬件 | 首选方案 | 原因 |
|---|---|---|
| H100 | 若已安装运行时包,选择DeepEP | Hopper架构下跨节点EP的可靠默认方案 |
| B200 | 若已安装运行时包,选择DeepEP | 除非有平台专属的HybridEP路径,否则是最佳首选 |
| GB200 / GB300 NVL72 | 若已安装运行时包,选择HybridEP | 最适配NVLink域感知调度,且内存压力更低 |
| 未知硬件或首次部署 | | 最易保证正确性和调试的方案 |
| EP size | Guidance |
|---|---|
| Small EP | Dispatcher choice is usually second-order; start with |
| Medium EP | DeepEP often becomes worthwhile |
| Large EP | HybridEP is usually the best target on NVL72 systems |
| EP规模 | 指导建议 |
|---|---|
| 小型EP | 调度器选择通常影响不大;从 |
| 中型EP | DeepEP通常会带来显著收益 |
| 大型EP | 在NVL72系统上,HybridEP通常是最佳选择 |
| Workload | Common best path | Notes |
|---|---|---|
| DSV3 at large scale | HybridEP on GB200 or GB300, DeepEP on H100 | Dispatcher choice matters more as EP and PP both grow |
| Qwen3 235B | DeepEP on H100, HybridEP on GB200 | HybridEP usually wins on GB200 and often uses less memory |
| Qwen3 30B | DeepEP | Smaller models still benefit, but the absolute gap is smaller |
| Qwen3-Next | Close race in BF16, HybridEP stronger in FP8 or memory-tight runs | Good reminder to test, not assume |
| MoE VLMs | Start simple, then test HybridEP on GB200-class systems | Vision workloads are sensitive to both memory and host overhead |
| 工作负载 | 常用最优方案 | 说明 |
|---|---|---|
| 大规模DSV3 | GB200或GB300上用HybridEP,H100上用DeepEP | 随着EP和PP(Pipeline Parallelism)规模增长,调度器选择的影响愈发显著 |
| Qwen3 235B | H100上用DeepEP,GB200上用HybridEP | HybridEP在GB200上通常表现更优,且内存占用更低 |
| Qwen3 30B | DeepEP | 较小模型仍能受益,但性能提升的绝对差距更小 |
| Qwen3-Next | BF16精度下各方案差距不大;FP8或内存紧张场景下HybridEP表现更优 | 提醒需实际测试,而非主观假设 |
| MoE VLM | 从简单方案开始,之后在GB200级系统上测试HybridEP | 视觉工作负载对内存和主机开销都很敏感 |
--moe_flex_dispatcher_backend Nonealltoalldeepephybridepmoe_token_dispatcher_type="flex"alltoall--moe_flex_dispatcher_backend Nonealltoalldeepephybridepmoe_token_dispatcher_type="flex"alltoallmoe_routermoe_preprocessmodel.moe_permute_fusion=falsealltoallmoe_routermoe_preprocessmodel.moe_permute_fusion=falsealltoallalltoallalltoallalltoallalltoallalltoallalltoallalltoallalltoallmoe_token_dispatcher_type="flex"moe_flex_dispatcher_backend="deepep"--moe-deepep-num-sms 20moe_token_dispatcher_type="flex"moe_flex_dispatcher_backend="deepep"--moe-deepep-num-sms 20moe_token_dispatcher_type="flex"moe_flex_dispatcher_backend="hybridep"--moe-hybridep-num-sms 16NUM_OF_HYBRID_EP_RANKS_PER_NVLINK_DOMAINmoe_token_dispatcher_type="flex"moe_flex_dispatcher_backend="hybridep"--moe-hybridep-num-sms 16NUM_OF_HYBRID_EP_RANKS_PER_NVLINK_DOMAIN--moe-router-force-load-balancing--moe-router-force-load-balancing| Feature | Interaction |
|---|---|
| CUDA graphs | Best paired with |
| EP overlap | Helps when dispatcher time is still visible after backend tuning |
| FP8 | Often increases the relative importance of communication and host overhead |
| CPU affinity | Can matter as much as dispatcher choice on GB200 or GB300 |
| Pipeline layout | Poor PP or VPP layout can erase dispatcher gains |
| 特性 | 交互影响 |
|---|---|
| CUDA graphs | 与无丢弃MoE的 |
| EP重叠 | 当后端调优后调度器时间仍显著时,该特性会有所帮助 |
| FP8 | 通常会提升通信和主机开销的相对重要性 |
| CPU亲和性 | 在GB200或GB300上,其重要性可能与调度器选择相当 |
| 流水线布局 | 糟糕的PP或VPP布局可能抵消调度器带来的性能提升 |
alltoallalltoallmoe_deepep_num_smsmoe_hybridep_num_smsalltoallmoe_deepep_num_smsmoe_hybridep_num_smsalltoall