perf-moe-hardware-configs
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMoE Hardware Configuration Reference
MoE硬件配置参考
Stable docs: @docs/training/moe-optimization.md
Card: @skills/perf-moe-hardware-configs/card.yaml
稳定文档:@docs/training/moe-optimization.md
卡片:@skills/perf-moe-hardware-configs/card.yaml
Quick Platform Playbook
快速平台指南
| Platform | Typical MoE strategy | What usually matters most |
|---|---|---|
| H100 | DeepEP + stronger PP + moderate TP | communication overlap and PP efficiency |
| B200 | DeepEP + MXFP8 + careful PP layout | container quality and tuned comm settings |
| GB200 | HybridEP + partial CUDA graphs + CPU cleanup | host overhead, topology-aware dispatch, memory headroom |
| GB300 | HybridEP + newer FP8 and kernel stack | same GB200 playbook, usually with a higher ceiling |
| 平台 | 典型MoE策略 | 最关键的优化点 |
|---|---|---|
| H100 | DeepEP + 更强的PP + 适度TP | 通信重叠与PP效率 |
| B200 | DeepEP + MXFP8 + 谨慎的PP布局 | 容器质量与调优后的通信设置 |
| GB200 | HybridEP + 部分CUDA Graph + CPU清理 | 主机开销、拓扑感知调度、内存余量 |
| GB300 | HybridEP + 新一代FP8与内核栈 | 与GB200指南一致,通常上限更高 |
Rounded Performance Bands
近似性能范围
These are intentionally rounded so the document stays durable as the tracker
moves. Treat them as planning ranges, not exact promises.
| Workload family | Hardware | Typical band | Representative shape |
|---|---|---|---|
| DSV3, large-scale | H100 | low-to-mid hundreds TFLOPS/GPU, high-teens MFU | TP2, EP64, PP8, DeepEP |
| DSV3, large-scale | B200 | high-hundreds TFLOPS/GPU, mid-teens MFU | TP1, EP32, PP8, DeepEP |
| DSV3, large-scale | GB200 | around 1K TFLOPS/GPU, low-20s MFU | TP1, EP64, PP4, HybridEP |
| DSV3, large-scale | GB300 | above the GB200 band, often mid-20s MFU | TP1, EP64, PP4, HybridEP |
| Qwen3 235B | H100 | low-300s TFLOPS/GPU, around 30% MFU | TP2, EP32, PP8, DeepEP |
| Qwen3 235B | GB200 | high-hundreds TFLOPS/GPU in tuned runs | TP1 or TP2, EP32-64, PP4, HybridEP |
| Qwen3 30B | H100 | low-200s TFLOPS/GPU | TP1, EP8, PP1, DeepEP |
| Qwen3-Next 80B | GB200 | low-300s TFLOPS/GPU in BF16-class runs | TP1, EP32, PP2, HybridEP |
这些数值经过刻意取整,以便跟踪器更新时文档仍能保持实用性。请将其视为规划范围,而非精确承诺。
| 工作负载系列 | 硬件 | 典型范围 | 代表性配置 |
|---|---|---|---|
| DSV3(大规模) | H100 | 每GPU数百TFLOPS(低至中区间),MFU为十几(高区间) | TP2, EP64, PP8, DeepEP |
| DSV3(大规模) | B200 | 每GPU数百TFLOPS(高区间),MFU为十几(中区间) | TP1, EP32, PP8, DeepEP |
| DSV3(大规模) | GB200 | 每GPU约1K TFLOPS,MFU为20左右(低区间) | TP1, EP64, PP4, HybridEP |
| DSV3(大规模) | GB300 | 高于GB200范围,MFU通常为20左右(中区间) | TP1, EP64, PP4, HybridEP |
| Qwen3 235B | H100 | 每GPU约300左右TFLOPS,MFU约30% | TP2, EP32, PP8, DeepEP |
| Qwen3 235B | GB200 | 调优后每GPU数百TFLOPS(高区间) | TP1或TP2, EP32-64, PP4, HybridEP |
| Qwen3 30B | H100 | 每GPU约200左右TFLOPS | TP1, EP8, PP1, DeepEP |
| Qwen3-Next 80B | GB200 | BF16级运行中每GPU约300左右TFLOPS | TP1, EP32, PP2, HybridEP |
Representative Config Families
代表性配置系列
DSV3 on H100
H100上的DSV3
text
Dispatcher: DeepEP
TP=2 EP=64 PP=8 VPP=4
Routing: force balance
Recompute: light-to-moderate selective recompute
Priority: overlap communication and keep PP efficienttext
Dispatcher: DeepEP
TP=2 EP=64 PP=8 VPP=4
路由:强制均衡
重计算:轻到中度选择性重计算
优先级:通信重叠与保持PP效率DSV3 on B200
B200上的DSV3
text
Dispatcher: DeepEP
TP=1 EP=32 PP=8 VPP=2 or similar
Precision: MXFP8-class
Recompute: selective recompute around MLA up-projection and MLP-side modules
Priority: container quality, PP layout, and DeepEP SMS tuningtext
Dispatcher: DeepEP
TP=1 EP=32 PP=8 VPP=2或类似配置
精度:MXFP8级
重计算:围绕MLA上投影和MLP侧模块的选择性重计算
优先级:容器质量、PP布局与DeepEP SMS调优DSV3 on GB200 or GB300
GB200或GB300上的DSV3
text
Dispatcher: HybridEP
TP=1 EP=64 PP=4 VPP=4
Precision: MXFP8-class
CUDA Graph: attn + moe_router + moe_preprocess
Priority: HybridEP, CPU optimization, and graph-friendly static shapestext
Dispatcher: HybridEP
TP=1 EP=64 PP=4 VPP=4
精度:MXFP8级
CUDA Graph: attn + moe_router + moe_preprocess
优先级:HybridEP、CPU优化与图形友好的静态形状Qwen3 235B on H100
H100上的Qwen3 235B
text
Dispatcher: DeepEP
TP=2 EP=32 PP=8 VPP=4
Recompute: norm and activation-side selective recompute
Priority: communication overlap and router-path cleanuptext
Dispatcher: DeepEP
TP=2 EP=32 PP=8 VPP=4
重计算:归一化与激活侧选择性重计算
优先级:通信重叠与路由路径清理Qwen3 235B on GB200
GB200上的Qwen3 235B
text
Dispatcher: HybridEP
TP=1 or 2 EP=32 to 64 PP=4
CUDA Graph: attn + moe_router + moe_preprocess
Recompute: moe_act, mlp, or norm depending on memory pressure
Priority: balance throughput against memory headroomtext
Dispatcher: HybridEP
TP=1或2 EP=32至64 PP=4
CUDA Graph: attn + moe_router + moe_preprocess
重计算:根据内存压力选择moe_act、mlp或归一化模块
优先级:平衡吞吐量与内存余量Qwen3-Next 80B on GB200
GB200上的Qwen3-Next 80B
text
Dispatcher: HybridEP
TP=1 EP=32 PP=2 VPP around 4
CUDA Graph: attn + moe_router + moe_preprocess
Priority: pipeline layout and grouped GEMM qualitytext
Dispatcher: HybridEP
TP=1 EP=32 PP=2 VPP约为4
CUDA Graph: attn + moe_router + moe_preprocess
优先级:流水线布局与分组GEMM质量Cross-Cutting Patterns
跨平台通用模式
PP layout
PP布局
- = embedding
E - = transformer
t - = MTP
m - = loss
L - = stage boundary
|
The biggest platform difference is usually not just the dispatcher. It is the
combination of dispatcher, PP shape, and whether VPP keeps each stage balanced.
- = 嵌入层
E - = Transformer层
t - = MTP
m - = 损失层
L - = 阶段边界
|
平台间最大的差异通常不仅在于调度器,而是调度器、PP形状以及VPP是否保持各阶段平衡的组合效果。
Recompute strategy
重计算策略
| Memory pressure | Starting point |
|---|---|
| low | none or a very narrow selective set |
| moderate | |
| high | model-specific up-projection plus selective MoE and MLP modules |
| extreme or long-context | full recompute only if the selective path still does not fit |
| 内存压力 | 初始方案 |
|---|---|
| 低 | 无或仅极少量选择性模块 |
| 中等 | |
| 高 | 模型特定的上投影模块,加上选择性MoE和MLP模块 |
| 极高或长上下文 | 仅当选择性方案仍无法容纳时,才使用全量重计算 |
Environment variables
环境变量
bash
CUDA_DEVICE_MAX_CONNECTIONS=1
CUDA_DEVICE_MAX_CONNECTIONS=32 # common when EP overlap and CUDA graphs are combined
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
NCCL_GRAPH_REGISTER=0bash
CUDA_DEVICE_MAX_CONNECTIONS=1
CUDA_DEVICE_MAX_CONNECTIONS=32 # 结合EP重叠与CUDA Graph时常用
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
NCCL_GRAPH_REGISTER=0CPU-side tuning
CPU侧调优
On GB200 and GB300, CPU affinity and general host-overhead cleanup can move the
needle almost as much as a dispatcher swap. Treat them as first-class tuning
work, not as afterthoughts.
在GB200和GB300上,CPU亲和性与主机开销清理的优化效果几乎等同于更换调度器。请将其视为核心调优工作,而非事后补充。
Pitfalls
常见误区
-
Do not cargo-cult a tracker row: the winning config usually depends on routing mode, container, and PP layout as much as on hardware name.
-
Container quality matters: large regressions can come from the software stack rather than the model recipe.
-
VPP must be intentional: a bad VPP split can erase the gain from a better dispatcher.
-
Compare absolute throughput, not only MFU: MFU can mislead when switching between BF16, FP8, and other precision modes.
-
Force-balance routing is the safer benchmark default: keep routing mode fixed when comparing hardware or dispatcher stacks.
-
不要盲目照搬跟踪器配置:最优配置通常不仅取决于硬件名称,还与路由模式、容器和PP布局密切相关。
-
容器质量至关重要:软件栈的问题可能导致性能大幅下降,而非模型配方本身。
-
VPP配置需刻意设计:不合理的VPP拆分可能抵消更好调度器带来的性能提升。
-
比较绝对吞吐量,而非仅看MFU:在BF16、FP8等不同精度模式间切换时,MFU可能产生误导。
-
强制均衡路由是更安全的基准默认值:比较硬件或调度器栈时,请固定路由模式。