perf-moe-hardware-configs

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

MoE Hardware Configuration Reference

MoE硬件配置参考

Stable docs: @docs/training/moe-optimization.md Card: @skills/perf-moe-hardware-configs/card.yaml
稳定文档:@docs/training/moe-optimization.md 卡片:@skills/perf-moe-hardware-configs/card.yaml

Quick Platform Playbook

快速平台指南

PlatformTypical MoE strategyWhat usually matters most
H100DeepEP + stronger PP + moderate TPcommunication overlap and PP efficiency
B200DeepEP + MXFP8 + careful PP layoutcontainer quality and tuned comm settings
GB200HybridEP + partial CUDA graphs + CPU cleanuphost overhead, topology-aware dispatch, memory headroom
GB300HybridEP + newer FP8 and kernel stacksame GB200 playbook, usually with a higher ceiling
平台典型MoE策略最关键的优化点
H100DeepEP + 更强的PP + 适度TP通信重叠与PP效率
B200DeepEP + MXFP8 + 谨慎的PP布局容器质量与调优后的通信设置
GB200HybridEP + 部分CUDA Graph + CPU清理主机开销、拓扑感知调度、内存余量
GB300HybridEP + 新一代FP8与内核栈与GB200指南一致,通常上限更高

Rounded Performance Bands

近似性能范围

These are intentionally rounded so the document stays durable as the tracker moves. Treat them as planning ranges, not exact promises.
Workload familyHardwareTypical bandRepresentative shape
DSV3, large-scaleH100low-to-mid hundreds TFLOPS/GPU, high-teens MFUTP2, EP64, PP8, DeepEP
DSV3, large-scaleB200high-hundreds TFLOPS/GPU, mid-teens MFUTP1, EP32, PP8, DeepEP
DSV3, large-scaleGB200around 1K TFLOPS/GPU, low-20s MFUTP1, EP64, PP4, HybridEP
DSV3, large-scaleGB300above the GB200 band, often mid-20s MFUTP1, EP64, PP4, HybridEP
Qwen3 235BH100low-300s TFLOPS/GPU, around 30% MFUTP2, EP32, PP8, DeepEP
Qwen3 235BGB200high-hundreds TFLOPS/GPU in tuned runsTP1 or TP2, EP32-64, PP4, HybridEP
Qwen3 30BH100low-200s TFLOPS/GPUTP1, EP8, PP1, DeepEP
Qwen3-Next 80BGB200low-300s TFLOPS/GPU in BF16-class runsTP1, EP32, PP2, HybridEP
这些数值经过刻意取整,以便跟踪器更新时文档仍能保持实用性。请将其视为规划范围,而非精确承诺。
工作负载系列硬件典型范围代表性配置
DSV3(大规模)H100每GPU数百TFLOPS(低至中区间),MFU为十几(高区间)TP2, EP64, PP8, DeepEP
DSV3(大规模)B200每GPU数百TFLOPS(高区间),MFU为十几(中区间)TP1, EP32, PP8, DeepEP
DSV3(大规模)GB200每GPU约1K TFLOPS,MFU为20左右(低区间)TP1, EP64, PP4, HybridEP
DSV3(大规模)GB300高于GB200范围,MFU通常为20左右(中区间)TP1, EP64, PP4, HybridEP
Qwen3 235BH100每GPU约300左右TFLOPS,MFU约30%TP2, EP32, PP8, DeepEP
Qwen3 235BGB200调优后每GPU数百TFLOPS(高区间)TP1或TP2, EP32-64, PP4, HybridEP
Qwen3 30BH100每GPU约200左右TFLOPSTP1, EP8, PP1, DeepEP
Qwen3-Next 80BGB200BF16级运行中每GPU约300左右TFLOPSTP1, EP32, PP2, HybridEP

Representative Config Families

代表性配置系列

DSV3 on H100

H100上的DSV3

text
Dispatcher: DeepEP
TP=2  EP=64  PP=8  VPP=4
Routing: force balance
Recompute: light-to-moderate selective recompute
Priority: overlap communication and keep PP efficient
text
Dispatcher: DeepEP
TP=2  EP=64  PP=8  VPP=4
路由:强制均衡
重计算:轻到中度选择性重计算
优先级:通信重叠与保持PP效率

DSV3 on B200

B200上的DSV3

text
Dispatcher: DeepEP
TP=1  EP=32  PP=8  VPP=2 or similar
Precision: MXFP8-class
Recompute: selective recompute around MLA up-projection and MLP-side modules
Priority: container quality, PP layout, and DeepEP SMS tuning
text
Dispatcher: DeepEP
TP=1  EP=32  PP=8  VPP=2或类似配置
精度:MXFP8级
重计算:围绕MLA上投影和MLP侧模块的选择性重计算
优先级:容器质量、PP布局与DeepEP SMS调优

DSV3 on GB200 or GB300

GB200或GB300上的DSV3

text
Dispatcher: HybridEP
TP=1  EP=64  PP=4  VPP=4
Precision: MXFP8-class
CUDA Graph: attn + moe_router + moe_preprocess
Priority: HybridEP, CPU optimization, and graph-friendly static shapes
text
Dispatcher: HybridEP
TP=1  EP=64  PP=4  VPP=4
精度:MXFP8级
CUDA Graph: attn + moe_router + moe_preprocess
优先级:HybridEP、CPU优化与图形友好的静态形状

Qwen3 235B on H100

H100上的Qwen3 235B

text
Dispatcher: DeepEP
TP=2  EP=32  PP=8  VPP=4
Recompute: norm and activation-side selective recompute
Priority: communication overlap and router-path cleanup
text
Dispatcher: DeepEP
TP=2  EP=32  PP=8  VPP=4
重计算:归一化与激活侧选择性重计算
优先级:通信重叠与路由路径清理

Qwen3 235B on GB200

GB200上的Qwen3 235B

text
Dispatcher: HybridEP
TP=1 or 2  EP=32 to 64  PP=4
CUDA Graph: attn + moe_router + moe_preprocess
Recompute: moe_act, mlp, or norm depending on memory pressure
Priority: balance throughput against memory headroom
text
Dispatcher: HybridEP
TP=1或2  EP=32至64  PP=4
CUDA Graph: attn + moe_router + moe_preprocess
重计算:根据内存压力选择moe_act、mlp或归一化模块
优先级:平衡吞吐量与内存余量

Qwen3-Next 80B on GB200

GB200上的Qwen3-Next 80B

text
Dispatcher: HybridEP
TP=1  EP=32  PP=2  VPP around 4
CUDA Graph: attn + moe_router + moe_preprocess
Priority: pipeline layout and grouped GEMM quality
text
Dispatcher: HybridEP
TP=1  EP=32  PP=2  VPP约为4
CUDA Graph: attn + moe_router + moe_preprocess
优先级:流水线布局与分组GEMM质量

Cross-Cutting Patterns

跨平台通用模式

PP layout

PP布局

  • E
    = embedding
  • t
    = transformer
  • m
    = MTP
  • L
    = loss
  • |
    = stage boundary
The biggest platform difference is usually not just the dispatcher. It is the combination of dispatcher, PP shape, and whether VPP keeps each stage balanced.
  • E
    = 嵌入层
  • t
    = Transformer层
  • m
    = MTP
  • L
    = 损失层
  • |
    = 阶段边界
平台间最大的差异通常不仅在于调度器,而是调度器、PP形状以及VPP是否保持各阶段平衡的组合效果。

Recompute strategy

重计算策略

Memory pressureStarting point
lownone or a very narrow selective set
moderate
moe_act
,
mlp
,
norm
, or similar selective modules
highmodel-specific up-projection plus selective MoE and MLP modules
extreme or long-contextfull recompute only if the selective path still does not fit
内存压力初始方案
无或仅极少量选择性模块
中等
moe_act
mlp
norm
或类似选择性模块
模型特定的上投影模块,加上选择性MoE和MLP模块
极高或长上下文仅当选择性方案仍无法容纳时,才使用全量重计算

Environment variables

环境变量

bash
CUDA_DEVICE_MAX_CONNECTIONS=1
CUDA_DEVICE_MAX_CONNECTIONS=32   # common when EP overlap and CUDA graphs are combined
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
NCCL_GRAPH_REGISTER=0
bash
CUDA_DEVICE_MAX_CONNECTIONS=1
CUDA_DEVICE_MAX_CONNECTIONS=32   # 结合EP重叠与CUDA Graph时常用
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
NCCL_GRAPH_REGISTER=0

CPU-side tuning

CPU侧调优

On GB200 and GB300, CPU affinity and general host-overhead cleanup can move the needle almost as much as a dispatcher swap. Treat them as first-class tuning work, not as afterthoughts.
在GB200和GB300上,CPU亲和性与主机开销清理的优化效果几乎等同于更换调度器。请将其视为核心调优工作,而非事后补充。

Pitfalls

常见误区

  1. Do not cargo-cult a tracker row: the winning config usually depends on routing mode, container, and PP layout as much as on hardware name.
  2. Container quality matters: large regressions can come from the software stack rather than the model recipe.
  3. VPP must be intentional: a bad VPP split can erase the gain from a better dispatcher.
  4. Compare absolute throughput, not only MFU: MFU can mislead when switching between BF16, FP8, and other precision modes.
  5. Force-balance routing is the safer benchmark default: keep routing mode fixed when comparing hardware or dispatcher stacks.
  1. 不要盲目照搬跟踪器配置:最优配置通常不仅取决于硬件名称,还与路由模式、容器和PP布局密切相关。
  2. 容器质量至关重要:软件栈的问题可能导致性能大幅下降,而非模型配方本身。
  3. VPP配置需刻意设计:不合理的VPP拆分可能抵消更好调度器带来的性能提升。
  4. 比较绝对吞吐量,而非仅看MFU:在BF16、FP8等不同精度模式间切换时,MFU可能产生误导。
  5. 强制均衡路由是更安全的基准默认值:比较硬件或调度器栈时,请固定路由模式。