sglang-minimax-m2-series-optimization
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSGLang MiniMax M2 Series Optimization
SGLang MiniMax M2系列优化
Overview
概述
The skill covers the full MiniMax optimization ladder: mainline history, the remaining still-open upstream PR track, and current-main validation lanes. Use it to recover, extend, or audit MiniMax-specific optimizations, or to reuse the patterns on a structurally similar MoE model.
As of , refreshed against SGLang commit , the MiniMax story is split across three sources of truth:
2026-04-21origin/mainc122d343a- mainline history already present in
main - still-open upstream PRs that are important for MiniMax-M2.5, but not fully landed in yet
main - current registered docs/tests/workflows, especially the MiniMax-M2.7 AMD accuracy and performance lanes
This skill tracks all three, but it labels them clearly. Do not assume an optimization from a PR page is already in your local tree, and do not assume MiniMax-M2.7 or M2.7-highspeed is covered by MiniMax-M2.5 validation just because the same model file is used.
The historical evidence for every stage lives in:
- references/pr-history.md: mainline and still-open PR evidence, benchmark notes, key code patterns
- references/playbook.md: symptom mapping, commands, validation order
本技能覆盖MiniMax完整的优化路径:主分支历史、仍在推进的上游PR追踪,以及当前主分支的验证流程。可用于恢复、扩展或审核MiniMax专属优化,或将相关模式复用在结构相似的MoE模型上。
截至,基于SGLang 提交更新,MiniMax的相关内容分为三个可信来源:
2026-04-21origin/mainc122d343a- 已合并到分支的主分支历史
main - 对MiniMax-M2.5至关重要但尚未完全合并到分支的上游PR
main - 当前已注册的文档/测试/工作流,尤其是MiniMax-M2.7的AMD精度和性能验证流程
本技能会追踪这三部分内容,并进行清晰标注。请勿假设PR页面中的优化已存在于你的本地代码分支,也不要因为使用相同的模型文件,就认为MiniMax-M2.5的验证流程适用于MiniMax-M2.7或M2.7-highspeed。
每个阶段的历史记录可在以下文件中查看:
- references/pr-history.md:主分支和待合并PR的记录、基准测试说明、核心代码模式
- references/playbook.md:症状映射、命令、验证顺序
Before You Change Anything
修改前的准备
Record the exact serving shape first:
-
M2, M2.1, M2.5, or M2.7
-
instruct or reasoning-style launch
-
native, AWQ, FP8, FP4, ModelSlim, or other quant format
-
TP / DP / EP / PP topology
-
DP attention enabled or not
-
DeepEP, FlashInfer, Triton, or other MoE / attention backend
-
piecewise CUDA graph enabled or not
-
speculative decoding or Eagle3 enabled or not
-
NVIDIA, AMD, NPU, or other backend
-
launch parser pair:and
--tool-call-parser minimax-m2when tool/thinking behavior matters--reasoning-parser minimax-append-think -
exact registered suite, workflow job, or hardware lane used for validation
-
QK normalization depends on how heads are partitioned or replicated
-
M2.5 scale-out performance depends on communication strategy, not only kernels
-
quantized checkpoints depend on exact loader conventions
首先记录精确的服务部署形态:
-
M2、M2.1、M2.5或M2.7
-
指令型或推理型启动方式
-
原生、AWQ、FP8、FP4、ModelSlim或其他量化格式
-
TP / DP / EP / PP拓扑结构
-
是否启用DP注意力
-
使用DeepEP、FlashInfer、Triton或其他MoE/注意力后端
-
是否启用分段CUDA图
-
是否启用 speculative decoding 或 Eagle3
-
NVIDIA、AMD、NPU或其他后端
-
启动解析器组合:当工具/推理行为至关重要时,使用和
--tool-call-parser minimax-m2--reasoning-parser minimax-append-think -
用于验证的精确注册套件、工作流任务或硬件流程
-
QK归一化依赖于注意力头的分区或复制方式
-
M2.5的横向扩展性能取决于通信策略,而非仅内核
-
量化 checkpoint 依赖于精确的加载器约定
Core Principle
核心原则
Do not treat MiniMax as a generic DeepSeek-like MoE.
- MiniMax-M2 is a QK-normalized attention plus sparse-MoE story.
- MiniMax-M2.5 adds a much heavier distributed and quantized runtime story.
- MiniMax-M2.7 currently follows the MiniMax-M2-family model path but has its own AMD accuracy/performance lanes; treat it as a separate validation target.
- MiniMax-M2.7-highspeed is currently visible through upstream docs work, not a separate current-main runtime path.
- TP QK norm/all-reduce is already a two-generation mainline optimization: #16483 keeps the older RMSNormTP all-reduce aligned, and #20673 adds the fused JIT TP QK norm path behind .
SGLANG_USE_FUSED_PARALLEL_QKNORM - The most important distinctions are often not "model size" but:
- whether attention TP equals model TP
- whether KV heads are partitioned or replicated
- whether MoE output should all-reduce, reduce-scatter, or stay fused for the next layer
The optimization order matters:
- confirm the loader and topology contract
- fix correctness in the MiniMax-specific path
- remove generic overhead in the hot path
- only then add deeper kernel or communication specialization
- validate on the exact topology that triggered the issue
不要将MiniMax视为通用的DeepSeek类MoE模型。
- MiniMax-M2是融合QK归一化注意力与稀疏MoE的模型
- MiniMax-M2.5新增了更复杂的分布式和量化运行时逻辑
- MiniMax-M2.7目前遵循MiniMax-M2家族的模型路径,但拥有独立的AMD精度/性能验证流程;需将其视为单独的验证目标
- MiniMax-M2.7-highspeed目前仅通过上游文档工作可见,尚未成为当前主分支的独立运行时路径
- TP QK归一化/全归约已是两代主分支优化:#16483保持旧版RMSNormTP全归约对齐,#20673在配置下新增融合JIT TP QK归一化路径
SGLANG_USE_FUSED_PARALLEL_QKNORM - 最重要的区别通常不是「模型大小」,而是:
- 注意力TP是否等于模型TP
- KV头是分区还是复制
- MoE输出是否需要全归约、归约分散,或保持融合状态进入下一层
优化顺序至关重要:
- 确认加载器和拓扑结构约定
- 修复MiniMax专属路径中的正确性问题
- 移除热点路径中的通用开销
- 仅在此基础上添加更深层的内核或通信优化
- 在触发问题的精确拓扑结构上进行验证
What Transfers To Similar Models
可复用至相似模型的内容
Reuse this skill on a non-MiniMax model when it shares one or more of these traits:
- QK normalization whose cost is dominated by TP communication
- KV-head replication when
num_key_value_heads < tp_size - sparse MoE without DeepSeek-style shared experts
- topology-dependent attention groups where attention TP is not the same as model TP
- quantized MoE checkpoints that rely on packed or fused module mappings
Reuse the order of investigation and validation discipline, not the MiniMax-specific constants.
当非MiniMax模型具备以下一个或多个特征时,可复用本技能:
- QK归一化的成本主要由TP通信主导
- 当时的KV头复制
num_key_value_heads < tp_size - 无DeepSeek式共享专家的稀疏MoE
- 注意力TP与模型TP不同的拓扑依赖型注意力组
- 依赖打包或融合模块映射的量化MoE checkpoint
复用调查和验证流程的顺序,而非MiniMax专属的常量。
M2 Core Evolution Path
M2核心演进路径
Use this path when the target is and the problem is mostly inside the core model path already on .
MiniMaxAI/MiniMax-M2*main当目标为且问题主要存在于已合并到分支的核心模型路径中时,使用此路径。
MiniMaxAI/MiniMax-M2*mainStage M2-0: Basic support exists, but the path is still naive
阶段M2-0:基础支持已存在,但路径仍未优化
The model can launch, but the earliest support path is not yet optimized and may still miss MiniMax-specific surfaces.
-
basic model registration and weight loading
-
MiniMax-specific MoE, QK norm, and tool-call integration exist
-
do not confuse "supported" with "optimized"
-
exists and is the active runtime path
python/sglang/srt/models/minimax_m2.py -
later performance or correctness work has a stable MiniMax-specific home
模型可启动,但最早的支持路径尚未优化,可能仍缺少MiniMax专属的功能面。
-
基础模型注册和权重加载
-
MiniMax专属的MoE、QK归一化和工具调用集成已存在
-
不要混淆「支持」和「优化」
-
已存在且为当前运行时路径
python/sglang/srt/models/minimax_m2.py -
后续的性能或正确性工作拥有稳定的MiniMax专属实现载体
Stage M2-1: Fix RMSNorm precision before chasing speed
阶段M2-1:在追求速度前修复RMSNorm精度
MiniMax QK normalization is numerically sensitive. Before deeper optimization, the norm path must accumulate safely.
-
prefer fp32 accumulation in the norm path
-
treat QK norm correctness as a prerequisite for later TP work
-
the norm path no longer relies on lower-precision accumulation where MiniMax accuracy is sensitive
MiniMax的QK归一化对数值精度敏感。在进行更深层优化前,归一化路径必须保证安全的累加方式。
-
在归一化路径中优先使用fp32累加
-
将QK归一化的正确性视为后续TP工作的前提
-
归一化路径不再依赖低精度累加,避免影响MiniMax的精度敏感性
Stage M2-2: Expose Eagle3 capture and embedding surfaces
阶段M2-2:暴露Eagle3捕获和嵌入接口
MiniMax needs to expose the same capture surfaces as other spec-decoding-capable models. Without them, speculative or auxiliary-hidden-state features fail even if base generation works.
-
capture intermediate hidden states for selected layers
-
expose
get_embed_and_head -
keep the speculative-decoding surface area on the MiniMax model, not on ad hoc wrappers
-
works
set_eagle3_layers_to_capture(...) -
exists and downstream speculative code can call it
get_embed_and_head()
Stage M2-3: Make the MoE path correct before making it faster
阶段M2-3:先确保MoE路径正确,再优化速度
Before tuning kernels, MiniMax needs the right MoE contract. This includes correct DeepEP forward usage and removing unnecessary router-side work.
-
keep the DeepEP forward path aligned with MiniMax's expert layout
-
do not add shared-expert logic that MiniMax does not use
-
remove unnecessary router work by specializing the top-k sigmoid path
-
the DeepEP MiniMax MoE path is functionally correct
-
the router no longer spends time on generic work MiniMax does not need
Stage M2-4: Specialize the QK norm hot path
阶段M2-4:优化QK归一化热点路径
For MiniMax, QK normalization is a real decode hotspot. Once correctness is solid, the next gains come from fusing the TP-aware norm path instead of doing separate generic operations.
-
compute Q and K norm together
-
keep TP-aware reduction in the same specialized path
-
preserve the custom all-reduce fast path by keeping reduction buffers aligned
-
prefer the fused JIT TP QK norm path on supported CUDA launches instead of stopping at the older in-model RMSNormTP path
-
check,
SGLANG_USE_FUSED_PARALLEL_QKNORM, andfused_parallel_qknorm(...)before claiming a missing all-reduce optimizationCustomAllReduceV2 -
is the active per-layer QK norm implementation
MiniMaxM2RMSNormTP -
the reduction path consistently selects the fast aligned all-reduce path
-
can use the fused TP QK norm custom op when the JIT path is enabled and supported
MiniMaxM2QKRMSNorm -
focused validation exists inand
python/sglang/jit_kernel/tests/test_tp_qknorm.pypython/sglang/jit_kernel/benchmark/bench_tp_qknorm.py
对于MiniMax,QK归一化是实际解码过程中的热点。一旦正确性得到保障,下一步优化来自于融合感知TP的归一化路径,而非执行单独的通用操作。
-
联合计算Q和K的归一化
-
将感知TP的归约保留在同一优化路径中
-
通过保持归约缓冲区对齐,保留自定义全归约快速路径
-
在支持的CUDA启动环境中,优先使用融合JIT TP QK归一化路径,而非停留在旧版模型内的RMSNormTP路径
-
在声称缺少全归约优化前,检查、
SGLANG_USE_FUSED_PARALLEL_QKNORM和fused_parallel_qknorm(...)CustomAllReduceV2 -
为当前每层QK归一化的实现
MiniMaxM2RMSNormTP -
归约路径始终选择快速对齐的全归约路径
-
当JIT路径启用且支持时,可使用融合TP QK归一化自定义算子
MiniMaxM2QKRMSNorm -
在和
python/sglang/jit_kernel/tests/test_tp_qknorm.py中存在针对性验证python/sglang/jit_kernel/benchmark/bench_tp_qknorm.py
Stage M2-5: Harden for piecewise CUDA graph and pipeline parallelism
阶段M2-5:强化分段CUDA图和流水线并行支持
Once the core hot paths are in place, MiniMax needs to remain usable under graph capture and PP partitioning.
-
keep piecewise CUDA graph contexts correct around MoE expert-distribution recording
-
propagate
pp_proxy_tensors -
make weight loading layer-range aware under PP
Family-adjacent caveat:
-
#18310 is for MiniMax-M2.1 and focuses on aplus CUDA-graph crash. It is not the core M2 mainline optimization ladder, but it is worth borrowing if graph tracing regresses on a MiniMax-family branch.
torch.compile -
MiniMax can run under PP without wrapper gaps
-
piecewise CUDA graph support does not regress the MiniMax-specific path
M2.5 Extension Path
M2.5扩展路径
Use this path when the target is or another later MiniMax-family checkpoint. Start from the M2 core stages above, then continue here.
MiniMaxAI/MiniMax-M2.5当目标为或其他较新的MiniMax家族checkpoint时,使用此路径。先完成上述M2核心阶段,再继续此路径。
MiniMaxAI/MiniMax-M2.5Stage M25-0: Audit the loader contract already on main
main阶段M25-0:审核已合并到main
分支的加载器约定
mainM2.5 stresses loading and quantized checkpoint conventions much harder than the early M2 path.
-
preserve
packed_modules_mapping -
preserve KV-cache scale remapping
-
keep ModelSlim-specific layer assumptions consistent with MiniMax layout
-
packed qkv and gate-up modules load correctly
-
KV cache scales are not silently dropped
-
ModelSlim quant layers do not assume a different MoE layout
Stage M25-1: Fill the remaining quantized-loader gaps not yet in main
main阶段M25-1:填补当前主分支中仍缺失的量化加载器缺口
Status:
Tracked upstream PR work; not fully present in commit as of .
origin/mainc122d343a2026-04-21Some M2.5 quantized checkpoints use fused expert naming that the current mainline loader still does not fully cover.
-
support fused expert mappings such as
w13 -
prefer explicit fused mapping before falling back to olderlogic
w1/w2/w3 -
add a focused weight-loading test when you port this work
-
AWQ or similar M2.5 checkpoints with fused expert weights load without local remapping hacks
状态:
追踪上游PR工作;截至,尚未完全合并到提交。
2026-04-21origin/mainc122d343a部分M2.5量化checkpoint使用融合专家命名,当前主分支加载器尚未完全支持。
-
支持等融合专家映射
w13 -
优先使用显式融合映射,再回退到旧版逻辑
w1/w2/w3 -
移植此工作时添加针对性的权重加载测试
-
带有融合专家权重的AWQ或类似M2.5 checkpoint无需本地重映射即可加载
Stage M25-2: Make the scale-out runtime contract explicit
阶段M25-2:明确横向扩展运行时约定
Status:
Partly on , partly still tracked from upstream PR work as of commit on .
mainorigin/mainc122d343a2026-04-21For M2.5, the next bottleneck is often not a single kernel. It is the distributed contract across PP, EP, DP, and DeepEP.
-
keep PP support from the mainline path
-
make DeepEP runtime requirements explicit, especially hidden-size and dtype expectations
-
treat DP support and DP-attention support as separate stages
-
PP launches correctly
-
DeepEP no longer fails due to unsupported hidden size or dtype mismatch
-
the runtime contract is written down for the exact TP / DP / EP / PP shape you care about
Stage M25-3: Add the DP-attention and DEP communication optimizations
阶段M25-3:添加DP注意力和DEP通信优化
Status:
Mixed: #20067 is part of ; #20489 and #20975 remain tracked as upstream PR work not fully present in commit as of .
mainorigin/mainc122d343a2026-04-21This is the biggest M2.5 scale-out gap. Performance and correctness both depend on using the attention-TP group rather than blindly reusing the model-TP group.
-
use attention TP group and rank instead of global TP group in MiniMax attention
-
allow reduce-scatter after MoE when padding or DEP makes it profitable
-
support FP4 all-gather when the communication path can quantize before transport
-
allow all-reduce fusion between the MoE output and the next attention preparation
-
guard zero-token and empty-batch paths
-
DP-attention MiniMax uses attention-TP metadata consistently
-
DEP no longer performs an unnecessary all-reduce plus slice
-
empty-batch or high-rank edge cases no longer crash
这是M2.5横向扩展的最大缺口。性能和正确性都依赖于使用注意力TP组,而非盲目复用模型TP组。
-
在MiniMax注意力中使用注意力TP组和rank,而非全局TP组
-
当填充或DEP使归约分散更高效时,允许在MoE后执行归约分散
-
当通信路径可在传输前量化时,支持FP4全聚集
-
允许在MoE输出与下一个注意力准备阶段之间融合全归约
-
保护零令牌和空批处理路径
-
DP注意力MiniMax始终使用注意力TP元数据
-
DEP不再执行不必要的全归约加切片操作
-
空批处理或高rank边缘情况不再崩溃
Stage M25-4: Replace the older TP QK norm path with the fused JIT version
阶段M25-4:用融合JIT版本替换旧版TP QK归一化路径
The older QK norm path was already specialized; the newer mainline path pushes it further by moving to a fused JIT kernel that reuses custom all-reduce v2 more efficiently.
-
fuse TP Q and K norm into one custom op
-
keep a fallback path for unsupported environments
-
add a dedicated benchmark and unit test with the PR
-
the MiniMax path can use the fused TP QK norm custom op
-
the fallback path is still available when the JIT kernel cannot run
Stage M25-5: Fix TP16 and replicated-KV-head correctness
阶段M25-5:修复TP16和复制KV头的正确性
When , multiple TP ranks can share the same KV head. That means the K norm weights and reductions must follow the replica layout, not a naive full-TP assumption.
num_key_value_heads < tp_size-
shard norm weights by logical head replica
-
reduce only across ranks that share the same head
-
do not assume the full TP group is the correct reduction group
-
high-TP MiniMax-M2.5 runs do not produce repeated or garbled output caused by incorrect K norm sharding
Stage M25-6: Optional NVFP4 fallback for non-Blackwell GPUs
阶段M25-6:针对非Blackwell GPU的可选NVFP4回退
Status:
Mainline as #19652 by commit on ; not MiniMax-specific, but directly relevant to some MiniMax-M2.5 deployments.
origin/mainc122d343a2026-04-21If the target checkpoint is an NVFP4 MiniMax variant on A100, H100, A40, or another non-Blackwell GPU, the real blocker may be the generic FP4 Marlin fallback rather than MiniMax model code.
-
keep weights compressed in FP4
-
route unsupported native FP4 cases to Marlin fallback
-
preserve both linear and MoE fallback paths
-
NVFP4 MiniMax-family checkpoints can run coherently on non-Blackwell GPUs without decompression hacks
如果目标checkpoint是A100、H100、A40或其他非Blackwell GPU上的NVFP4 MiniMax变体,真正的障碍可能是通用FP4 Marlin回退逻辑,而非MiniMax模型代码。
-
保持权重以FP4压缩存储
-
将不支持的原生FP4路由到Marlin回退路径
-
保留线性和MoE回退路径
-
NVFP4 MiniMax家族checkpoint可在非Blackwell GPU上连贯运行,无需解压缩 hack
M2.7 Current-Main Validation Path
M2.7当前主分支验证路径
Use this path when the target is , , or when an AMD MiniMax change might affect the currently registered M2.7 lanes. Current main has explicit AMD accuracy and performance coverage for M2.7, while first-class M2.7 and M2.7-highspeed docs are tracked by upstream PR #20873.
MiniMaxAI/MiniMax-M2.7MiniMaxAI/MiniMax-M2.7-highspeed当目标为、,或AMD MiniMax变更可能影响当前已注册的M2.7验证流程时,使用此路径。当前主分支为M2.7提供了明确的AMD精度和性能覆盖,而M2.7和M2.7-highspeed的一等文档支持由上游PR #20873追踪。
MiniMaxAI/MiniMax-M2.7MiniMaxAI/MiniMax-M2.7-highspeedStage M27-0: Treat M2.7 as a separate later-family validation target
阶段M27-0:将M2.7视为独立的晚家族验证目标
M2.7 currently reuses the MiniMax-M2-family runtime code, but the active registered tests are not just copies of M2.5. They launch on AMD with TP8+EP8 and the aiter attention backend.
MiniMaxAI/MiniMax-M2.7-
keep the model-file assumptions from the M2/M2.5 ladder unless current code proves M2.7 has a new runtime path
-
validate M2.7 separately from M2.5 on AMD when changing attention, MoE communication, loader, or aiter-related behavior
-
use the registered M2.7 model path overridefor local mirrors
MINIMAX_M27_MODEL_PATH -
preserve launch details from the current tests:,
--tp 8,--ep-size 8,--attention-backend aiter,SGLANG_USE_AITER=1, multithread loading, and a long watchdog timeout--mem-fraction-static 0.85 -
inspect both MI30x/MI325 and MI35x lanes because they use distinct registered suites
-
test/registered/amd/accuracy/mi30x/test_minimax_m27_eval_amd.py -
test/registered/amd/perf/mi30x/test_minimax_m27_perf_amd.py -
test/registered/amd/accuracy/mi35x/test_minimax_m27_eval_mi35x.py -
test/registered/amd/perf/mi35x/test_minimax_m27_perf_mi35x.py -
.github/workflows/nightly-test-amd.yml -
.github/workflows/nightly-test-amd-rocm720.yml -
M2.7 accuracy and performance suites pass on the target AMD lane
-
M2.5 and M2.7 failures are triaged independently
-
docs do not become the only source of truth for M2.7 until a first-class M2.7 usage doc exists
M2.7当前复用MiniMax-M2家族的运行时代码,但已注册的活跃测试并非M2.5的简单复制。它们在AMD上以TP8+EP8和aiter注意力后端启动。
MiniMaxAI/MiniMax-M2.7-
除非当前代码证明M2.7有新的运行时路径,否则保留M2/M2.5路径中的模型文件假设
-
当修改注意力、MoE通信、加载器或aiter相关行为时,在AMD上单独验证M2.7,而非与M2.5一起
-
使用已注册的M2.7模型路径覆盖进行本地镜像
MINIMAX_M27_MODEL_PATH -
保留当前测试的启动细节:、
--tp 8、--ep-size 8、--attention-backend aiter、SGLANG_USE_AITER=1、多线程加载,以及长看门狗超时--mem-fraction-static 0.85 -
检查MI30x/MI325和MI35x验证流程,因为它们使用不同的注册套件
-
test/registered/amd/accuracy/mi30x/test_minimax_m27_eval_amd.py -
test/registered/amd/perf/mi30x/test_minimax_m27_perf_amd.py -
test/registered/amd/accuracy/mi35x/test_minimax_m27_eval_mi35x.py -
test/registered/amd/perf/mi35x/test_minimax_m27_perf_mi35x.py -
.github/workflows/nightly-test-amd.yml -
.github/workflows/nightly-test-amd-rocm720.yml -
M2.7精度和性能套件在目标AMD验证流程中通过
-
M2.5和M2.7的失败可独立分类
-
在一等M2.7使用文档出现前,不要将文档作为M2.7的唯一可信来源
Open PR Radar
活跃PR追踪
Check these active upstream tracks before designing a new MiniMax skill or declaring a gap:
- #22934: M2.5 EPLB bugfix for missing on
routed_experts_weights_of_layer.MiniMaxM2ForCausalLM - #22744: NVIDIA support aimed at reducing FP32 gate GEMM cost for MiniMax-M2.5 decode.
--enable-tf32-matmul - #22300: FP8 GEMM/deepgemm scale-format fix for fp16 MiniMax-M2.5 models.
- #23301: token-by-token streaming of MiniMax-M2 string tool-call parameters.
- #20873: docs-only M2.7 and M2.7-highspeed support.
- #22432 and #23190: NPU split-QKV, TP RMSNorm, RoPE, Eagle3, and DP-attention fixes for MiniMax2.
- #17826, #19468, #20031, #20489, and #20975: remaining distributed, DeepEP, quant-loader, and DP-attention cleanup tracks carried from the earlier M2.5 radar.
在设计新的MiniMax技能或声明缺口前,检查以下活跃上游追踪:
- #22934:修复上缺失
MiniMaxM2ForCausalLM的M2.5 EPLB bug。routed_experts_weights_of_layer - #22744:NVIDIA 支持,旨在降低MiniMax-M2.5解码时FP32门控GEMM的成本。
--enable-tf32-matmul - #22300:针对fp16 MiniMax-M2.5模型的FP8 GEMM/deepgemm缩放格式修复。
- #23301:MiniMax-M2字符串工具调用参数的逐令牌流式传输。
- #20873:仅文档层面的M2.7和M2.7-highspeed支持。
- #22432和#23190:针对MiniMax2的NPU拆分QKV、TP RMSNorm、RoPE、Eagle3和DP注意力修复。
- #17826、#19468、#20031、#20489和#20975:早期M2.5追踪中遗留的分布式、DeepEP、量化加载器和DP注意力清理工作。
Investigation Order
调查顺序
When debugging a MiniMax issue, prefer this order:
- classify the exact runtime shape
- check whether the relevant optimization already exists on
main - if not, check whether it lives in a still-open upstream PR
- only then decide whether to port, reimplement, or defer it
For the supporting evidence and commands, use:
- references/playbook.md
- references/pr-history.md
调试MiniMax问题时,优先遵循以下顺序:
- 分类精确的运行时形态
- 检查相关优化是否已存在于分支
main - 如果不存在,检查是否存在于仍在推进的上游PR中
- 仅在此之后决定是否移植、重新实现或推迟
支持性证据和命令可参考:
- references/playbook.md
- references/pr-history.md
Anti-Patterns
反模式
- Do not treat attention TP and model TP as automatically identical for MiniMax-M2.5.
- Do not optimize a generic MoE kernel first if the real problem is MiniMax loader or topology plumbing.
- Do not assume a TP-only text launch proves PP, DP, DP-attention, or DeepEP correctness.
- Do not bypass or KV-scale remapping just to make one checkpoint load.
packed_modules_mapping - Do not copy still-open upstream PR behavior into production without noting that it is not on yet.
main - Do not assume MiniMax-M2.7 is validated by passing only MiniMax-M2.5 tests; use the M2.7 AMD lanes when the change can affect that checkpoint.
- Do not omit or
--tool-call-parser minimax-m2when validating tool or reasoning behavior from the serving docs.--reasoning-parser minimax-append-think - Do not miss when benchmarking or diagnosing the current TP QK norm path.
SGLANG_USE_FUSED_PARALLEL_QKNORM
- 不要假设MiniMax-M2.5的注意力TP和模型TP自动相同。
- 如果真正的问题是MiniMax加载器或拓扑结构 plumbing,不要优先优化通用MoE内核。
- 不要假设仅TP的文本启动可证明PP、DP、DP注意力或DeepEP的正确性。
- 不要为了加载单个checkpoint而绕过或KV缩放重映射。
packed_modules_mapping - 不要将仍在推进的上游PR行为复制到生产环境,而不注明其尚未合并到分支。
main - 不要假设通过MiniMax-M2.5测试即可验证MiniMax-M2.7;当变更可能影响该checkpoint时,使用M2.7 AMD验证流程。
- 验证服务文档中的工具或推理行为时,不要遗漏或
--tool-call-parser minimax-m2。--reasoning-parser minimax-append-think - 基准测试或诊断当前TP QK归一化路径时,不要遗漏。
SGLANG_USE_FUSED_PARALLEL_QKNORM