sglang-kimi-k2-k25-optimization
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSGLang Kimi K2/K2.5 Optimization
SGLang Kimi K2/K2.5 优化指南
Overview
概述
The skill is an optimization ladder. Identify which stage the current code is at, apply the next missing optimization, and only move deeper after the earlier stage is satisfied.
Current-main snapshot:
This skill was refreshed against SGLang commit on . Since the older PR ladder was written, current main has added a Kimi-K2.5 usage doc, parser and OpenAI-serving tests for , Kimi-K2.5 LoRA regression coverage, AMD/GB300 validation lanes, and a Kimi-K2-Thinking stress test. Treat those as part of the active validation surface, not as optional CI trivia.
Active open PRs now also define several next likely skill updates: W4AFP8 loading, W4A16 DeepEP low-latency, Kimi-K2.5 multimodal processor fixes, ROCm fused QK RMSNorm, and JIT migration of the older K2 fused gate path.
One important non-open gap is Kimi-K2-Thinking DeepEP plus int4/Marlin: #13789 tried to support it but was closed unmerged after hitting an illegal memory access in the path. Do not mark that combination as mainline-supported just because the generic Marlin JIT work in #19181 landed.
origin/mainc122d343a2026-04-21kimi_k2fused_marlin_moeThe historical evidence for every stage lives in:
- references/pr-history.md: merged PR evidence, benchmark tables, key code
- references/playbook.md: symptom mapping, commands, validation order
本技能是一套优化阶梯。先确定当前代码所处的阶段,应用下一个缺失的优化步骤,只有在完成前一阶段后,才能进入更深层次的优化。
当前主分支快照:
本技能基于2026年4月21日SGLang 分支的提交进行更新。自旧版PR阶梯文档编写以来,当前主分支新增了Kimi-K2.5使用文档、的解析器与OpenAI服务测试、Kimi-K2.5 LoRA回归覆盖、AMD/GB300验证流程,以及Kimi-K2-Thinking压力测试。请将这些内容视为核心验证范围的一部分,而非可选的CI琐事。
当前活跃的未合并PR也定义了多个后续可能的技能更新方向:W4AFP8加载、W4A16 DeepEP低延迟路径、Kimi-K2.5多模态处理器修复、ROCm融合QK RMSNorm,以及旧版K2融合门控路径的JIT迁移。
一个重要的非公开缺口是Kimi-K2-Thinking的DeepEP加int4/Marlin支持:#13789曾尝试实现该功能,但因路径出现非法内存访问而未合并。不要因为#19181中的通用Marlin JIT工作已合并,就认为该组合已被主线版本支持。
origin/mainc122d343akimi_k2fused_marlin_moe每个阶段的历史依据可在以下文档中找到:
- references/pr-history.md:已合并PR的证据、基准测试表、核心代码
- references/playbook.md:症状映射、命令、验证顺序
Before You Change Anything
修改前的准备
Record the exact serving shape first:
- K2 or K2.5
- thinking or instruct
- text-only or multimodal
- native or quantized weights
- TP / DP / EP / PP / PD / EPLB topology
- speculative decoding enabled or not
- NVIDIA / AMD / other backend
- launch parser pair: and
--tool-call-parser kimi_k2--reasoning-parser kimi_k2 - LoRA enabled or not
- current SGLang commit and the matching registered-test lane
首先记录当前服务的精确配置:
- 模型类型:K2 或 K2.5
- 模式:思考模式(thinking)或指令模式(instruct)
- 输入类型:纯文本或多模态
- 权重类型:原生或量化
- 分布式拓扑:TP / DP / EP / PP / PD / EPLB
- 是否启用投机解码
- 后端类型:NVIDIA / AMD / 其他
- 启动解析器组合:和
--tool-call-parser kimi_k2--reasoning-parser kimi_k2 - 是否启用LoRA
- 当前SGLang提交版本及对应的注册测试流程
Core Principle
核心原则
Do not treat K2 and K2.5 as one optimization problem.
- K2 is mainly a router and MoE hot-path story.
384 experts - K2 thinking adds a separate quantized Marlin MoE story.
- K2.5 is much more wrapper-heavy: , quant mapping, PP/PD/EPLB, multimodal DP encoder, Eagle3, and PCG compatibility all matter.
text_config - Current K2/K2.5 serving also has a parser contract: tool calls and thinking output are expected to go through .
kimi_k2 - Current open PRs split into three useful future tracks: quantized K2.5 loading/execution, multimodal wrapper correctness, and backend-specific fused norm or MoE kernel work.
For non-Kimi models, first decide which family they resemble more:
- a K2-like router and MoE hot-path problem
- a K2 thinking-like quantized Marlin MoE problem
- or a K2.5-like wrapper, multimodal, and runtime-plumbing problem
The optimization order also matters:
- enable the right shape or wrapper contract
- remove generic overhead
- add or select the Kimi-specific fast path
- micro-optimize the hot kernel
- harden PCG or distributed correctness
- tune per-device fused MoE configs
不要将K2和K2.5视为同一优化问题。
- K2主要是的路由与MoE热路径优化。
384个专家 - K2思考模式新增了独立的量化Marlin MoE优化。
- K2.5更侧重于包装器逻辑:、量化映射、PP/PD/EPLB、多模态DP编码器、Eagle3以及PCG兼容性都至关重要。
text_config - 当前K2/K2.5服务还存在解析器契约:工具调用和思考模式输出需通过处理。
kimi_k2 - 当前活跃PR分为三个未来优化方向:量化K2.5加载/执行、多模态包装器正确性、后端特定的融合归一化或MoE内核优化。
对于非Kimi模型,先判断它更接近以下哪类问题:
- 类似K2的路由与MoE热路径问题
- 类似K2思考模式的量化Marlin MoE问题
- 类似K2.5的包装器、多模态及运行时管线问题
优化顺序同样重要:
- 启用正确的配置或包装器契约
- 移除通用逻辑的开销
- 添加或选择Kimi专属的快速路径
- 微优化热内核
- 强化PCG或分布式正确性
- 针对不同设备调优融合MoE配置
What Transfers To Similar Models
可迁移至相似模型的经验
Reuse this skill on a non-Kimi model when it shares one or more of these traits:
- a sparse MoE router whose real expert count or top-k shape breaks generic DeepSeek-era assumptions
- a wrapper that hides text-model metadata inside or wrapper-local quant fields
text_config - a multimodal tower whose throughput depends on DP encoder or DP attention execution details
- speculative decoding, PP, PD, EPLB, or PCG features that depend on wrapper surface area rather than only kernel speed
Reuse the optimization order and validation discipline, not the literal Kimi constants or filenames.
当非Kimi模型具备以下一个或多个特征时,可复用本技能:
- 稀疏MoE路由,其实际专家数量或top-k配置打破了通用DeepSeek时代的假设
- 包装器将文本模型元数据隐藏在或包装器本地量化字段中
text_config - 多模态塔的吞吐量依赖于DP编码器或DP注意力的执行细节
- 投机解码、PP、PD、EPLB或PCG功能依赖于包装器的暴露面,而非仅内核速度
复用优化顺序和验证规范,而非字面意义上的Kimi常量或文件名。
Open PR Radar
活跃PR追踪
Check these active upstream tracks before designing a new Kimi skill or declaring a gap:
- #22806: and W4AFP8 model-loading support for Kimi-K2.5.
KimiW4AFp8Config - #22496: Kimi-K2.5 W4A16 DeepEP low-latency path with JIT Marlin and direct DeepEP MoE work.
- #22964: output fix after grid metadata changed from
KimiGPUProcessorWrapper._cpu_calltowardgrid_thws.image_grid_thw - #23186: AMD/ROCm BF16 fused QK RMSNorm path for .
Kimi-K2.5-MXFP4 - #19703: migrate from AOT
kimi_k2_moe_fused_gateinto the JIT kernel module.sgl-kernel - #22488: generalize the Kimi2 fused MoE gate JIT path to GLM-5-style -expert shapes.
256 - #22208: AMD gfx950 small-M fused MoE config tuning for Kimi-K2.5 .
int4_w4a16 - #21741: generic compressed-tensors W4AFP8 MoE support that underpins some Kimi-K2.5 quantized loading work.
Known closed gap to remember:
- #13789: attempted Kimi-K2-Thinking DeepEP support with int4/Marlin, but remained unmerged after an illegal memory access in .
fused_marlin_moe
在设计新的Kimi优化技能或声明缺口前,先查看以下上游活跃PR:
- #22806:为Kimi-K2.5添加及W4AFP8模型加载支持。
KimiW4AFp8Config - #22496:Kimi-K2.5的W4A16 DeepEP低延迟路径,包含JIT Marlin与直接DeepEP MoE实现。
- #22964:修复在网格元数据从
KimiGPUProcessorWrapper._cpu_call改为grid_thws后的输出问题。image_grid_thw - #23186:为添加AMD/ROCm BF16融合QK RMSNorm路径。
Kimi-K2.5-MXFP4 - #19703:将从AOT的
kimi_k2_moe_fused_gate迁移至JIT内核模块。sgl-kernel - #22488:将Kimi2融合MoE门控JIT路径推广至GLM-5风格的专家配置。
256 - #22208:为Kimi-K2.5 调优AMD gfx950小批量融合MoE配置。
int4_w4a16 - #21741:通用压缩张量W4AFP8 MoE支持,为部分Kimi-K2.5量化加载工作提供基础。
需要记住的已关闭缺口:
- #13789:尝试为Kimi-K2-Thinking添加DeepEP与int4/Marlin支持,但因路径出现非法内存访问而未合并。
fused_marlin_moe
K2 Evolution Path
K2演进路径
Use this path when the target is .
moonshotai/Kimi-K2*当目标模型为时,使用此路径。
moonshotai/Kimi-K2*Stage K2-0: Basic support but not yet K2-shaped
阶段K2-0:基础支持但未适配K2配置
The code may "support K2" in a broad sense, but still inherit DeepSeek assumptions that silently cap performance or break specialized kernels.
-
hidden assumptions that
num_experts == 256 -
generic grouped-topk path still used for K2
-
no dedicated K2 router or gate dispatch
-
router-side code can legally run withexperts
384 -
tests and benchmarks no longer hardcodefor the K2 path
256
代码可能在广义上“支持K2”,但仍继承了DeepSeek的假设,这些假设会暗中限制性能或破坏专用内核。
-
隐含假设
num_experts == 256 -
K2仍使用通用分组top-k路径
-
无专用K2路由或门控调度
-
参考PR:#8013
-
路由侧代码可合法支持个专家
384 -
测试与基准测试不再为K2路径硬编码
256
Stage K2-1: Remove the 256 experts
router assumption
256 experts阶段K2-1:移除“256个专家”的路由假设
Before K2 can benefit from any later hot-path work, the router GEMM path has to accept experts.
384-
makedispatch by runtime expert count
dsv3_router_gemm -
keep the specialized unrolled kernel structure instead of falling back to a generic slow path
-
preserve the existing hidden-size and output-dtype contract
-
sgl-kernel/csrc/gemm/dsv3_router_gemm_entry.cu -
the K2 path uses the specialized router kernel withexperts
384 -
there is no hidden fallback caused by a hardcoded
256
在K2能从后续热路径优化中获益前,路由GEMM路径必须支持个专家。
384-
让根据运行时专家数量进行调度
dsv3_router_gemm -
保留专用的展开内核结构,而非回退到通用慢路径
-
保留现有的隐藏层大小与输出 dtype 契约
-
涉及文件:
sgl-kernel/csrc/gemm/dsv3_router_gemm_entry.cu -
参考PR:#8013
-
K2路径使用支持个专家的专用路由内核
384 -
不存在因硬编码导致的隐藏回退逻辑
256
Stage K2-2: Remove generic grouped-topk overhead
阶段K2-2:移除通用分组top-k的开销
K2 thinking has experts and . The generic grouped-topk path wastes time on masking and grouping logic that K2 does not need.
384num_expert_group == 1-
add a narrow K2-specific topk implementation
-
keep semantics identical: renormalization, routed scaling, expert remap, padded-token masking
-
optimize the dispatch condition before touching CUDA
-
python/sglang/srt/layers/moe/topk.py -
K2 uses a dedicated biased-topk path instead of the generic grouped implementation
-
the router hotspot is smaller even before adding a fused CUDA op
K2思考模式有个专家且,通用分组top-k路径会在K2不需要的掩码与分组逻辑上浪费时间。
384num_expert_group == 1-
添加针对K2的窄范围top-k实现
-
保持语义一致:重归一化、路由缩放、专家重映射、填充令牌掩码
-
在修改CUDA代码前优化调度条件
-
涉及文件:
python/sglang/srt/layers/moe/topk.py -
参考PR:#13150
-
K2使用专用的带偏置top-k路径,而非通用分组实现
-
路由热点在添加融合CUDA操作前已缩小
Stage K2-3: Fuse sigmoid, bias, top-k, and renorm into one CUDA op
阶段K2-3:将sigmoid、偏置、top-k与重归一化融合为单个CUDA操作
Once the Python or topk path is still hot, K2 needs a model-specific fused gate kernel to cut launch overhead and reduce temporary traffic.
torch.compile-
build a dedicatedop for
kimi_k2_moe_fused_gate384 experts -
separate small-token and large-token execution strategies
-
fuse,
sigmoid, top-k selection, writeback, and optional renormalize/scaling+ bias -
add a dedicated benchmark and unit test at the same time
-
sgl-kernel/csrc/moe/kimi_k2_moe_fused_gate.cu -
sgl-kernel/tests/test_kimi_k2_moe_fused_gate.py -
sgl-kernel/benchmark/bench_kimi_k2_moe_fused_gate.py -
the K2 router hotspot moves from generic topk to the dedicated fused gate
-
benchmark coverage exists for the exact K2 shape
当Python或的top-k路径仍是热点时,K2需要一个模型专属的融合门控内核来减少启动开销与临时数据传输。
torch.compile-
为构建专用的
384个专家操作kimi_k2_moe_fused_gate -
区分小令牌与大令牌的执行策略
-
融合、
sigmoid、top-k选择、写回及可选的重归一化/缩放+偏置 -
同时添加专用基准测试与单元测试
-
涉及文件:
sgl-kernel/csrc/moe/kimi_k2_moe_fused_gate.cusgl-kernel/tests/test_kimi_k2_moe_fused_gate.pysgl-kernel/benchmark/bench_kimi_k2_moe_fused_gate.py -
参考PR:#13287
-
K2路由热点从通用top-k转移至专用融合门控
-
针对K2精确配置的基准测试覆盖已存在
Stage K2-4: Wire runtime dispatch to the best available K2 fast path
阶段K2-4:将运行时调度连接至最优K2快速路径
Adding a fast kernel is not enough; the runtime must actually choose the best maintained path for the K2 shape.
-
dispatch by K2 shape in
topk.py -
usewhen it is the active best path for the current backend
kimi_k2_moe_fused_gate -
preferwhen backend constraints are satisfied and it supersedes the older gate path
fused_topk_deepseek -
keep the generic implementation alive for non-K2 shapes and non-CUDA backends
-
treat dispatch order as part of the optimization contract, not as an incidental detail
-
python/sglang/srt/layers/moe/topk.py -
CUDA K2 requests no longer route through the old compiled topk path
-
kernel selection is deterministic for the active backend and shape
Stage K2-5: Micro-optimize the K2 fused gate kernel when it is still the hot path
阶段K2-5:当融合门控仍是热点时,对其进行微优化
After the fused gate exists, and when it is still the active router fast path, the next gains come from simplifying the kernel and improving memory behavior rather than inventing another algorithm.
-
simplify dtype support to the dtype the hot path actually uses
-
vectorize large-token loads with
float4 -
reducein the small-token path
__syncthreads() -
shrink shared-memory footprint by storing only selected top-k state
-
preserve deterministic tie-breaking in reductions
-
sgl-kernel/csrc/moe/kimi_k2_moe_fused_gate.cu -
profiler shows the fused gate kernel materially smaller than the first fused version
-
kernel correctness is unchanged
融合门控存在后,若它仍是活跃的路由快速路径,下一步优化需简化内核并改进内存行为,而非发明新算法。
-
将 dtype 支持简化为热路径实际使用的类型
-
使用向量化大令牌加载
float4 -
减少小令牌路径中的调用
__syncthreads() -
通过仅存储选中的top-k状态缩小共享内存占用
-
保留归约中的确定性平局处理
-
涉及文件:
sgl-kernel/csrc/moe/kimi_k2_moe_fused_gate.cu -
参考PR:#13374
-
性能分析显示融合门控内核较首个融合版本显著缩小
-
内核正确性未改变
Stage K2-6: Harden piecewise CUDA graph correctness
阶段K2-6:强化分段CUDA图(PCG)的正确性
Fast kernels that are not PCG-safe become unusable in exactly the high-performance serving regime K2 cares about.
-
register fake ops for shape and dtype propagation
-
ensure invalid expert selections write deterministic zero outputs and zero indices
-
treat PCG correctness bugs as first-class regressions, not edge cases
-
python/sglang/srt/layers/moe/topk.py -
sgl-kernel/csrc/moe/kimi_k2_moe_fused_gate.cu -
PCG capture and replay work without launch crashes or illegal instructions
Stage K2-7: Clean up and harden the quantized K2 thinking Marlin MoE path
阶段K2-7:清理并强化量化K2思考模式的Marlin MoE路径
For K2 thinking, a lot of remaining latency and correctness risk lives in memory traffic, scratch-buffer handling, EP plumbing, PCG compatibility, and the active Marlin implementation boundary rather than in the router itself.
-
moveinto SGLang-side code so it can evolve without
fused_marlin_moerelease frictionsgl-kernel -
choose config via
try_get_optimal_moe_config(..., is_marlin=True) -
reuse shared temporary buffers
-
do not zero scratch buffers unless EP actually requires it
-
only pass a realwhen dispatcher metadata exists
expert_map -
keep the Marlin path PCG-safe
-
optimize the active JIT-backed Marlin kernel implementation, not only the wrapper around it
-
python/sglang/srt/layers/moe/fused_moe_triton/fused_marlin_moe.py -
python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors_moe.py -
python/sglang/jit_kernel/moe_wna16_marlin.py -
tp-only K2 thinking no longer pays the fake-EP zeroing penalty
-
EP-aware paths pass true expert metadata instead of placeholders
-
PCG capture works for the quantized Marlin MoE path
-
future kernel changes land in the active JIT-backed implementation instead of only in stale wrapper code
对于K2思考模式,剩余的大部分延迟与正确性风险存在于内存传输、临时缓冲区处理、EP管线、PCG兼容性以及当前Marlin实现边界,而非路由本身。
-
将移至SGLang侧代码,使其无需等待
fused_marlin_moe发布即可演进sgl-kernel -
通过选择配置
try_get_optimal_moe_config(..., is_marlin=True) -
复用共享临时缓冲区
-
仅在EP实际需要时才清零临时缓冲区
-
仅当调度器元数据存在时才传递真实的
expert_map -
保持Marlin路径的PCG兼容性
-
优化活跃的JIT-backed Marlin内核实现,而非仅优化其包装器
-
涉及文件:
python/sglang/srt/layers/moe/fused_moe_triton/fused_marlin_moe.pypython/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors_moe.pypython/sglang/jit_kernel/moe_wna16_marlin.py -
仅使用TP的K2思考模式不再承担伪EP清零开销
-
感知EP的路径传递真实专家元数据而非占位符
-
量化Marlin MoE路径支持PCG捕获
-
未来内核变更可直接在活跃的JIT-backed实现中进行,而非仅修改陈旧的包装器代码
Stage K2-8: Add hardware-specific fused MoE tuning files
阶段K2-8:添加硬件专属的融合MoE调优文件
Even with the correct kernel path, K2 MoE throughput is device-sensitive. H20, H200, and B200 needed separate config tables.
-
treat the tuning filename as part of the optimization contract
-
match exact,
Triton version,E,N,dtype, andblock_shapedevice_name -
do not reuse a config file just because the model name is also K2
-
python/sglang/srt/layers/moe/fused_moe_triton/configs/ -
the serving shape resolves to the exact per-device tuning file instead of a generic default
K2.5 Evolution Path
K2.5演进路径
Use this path when the target is .
moonshotai/Kimi-K2.5*当目标模型为时,使用此路径。
moonshotai/Kimi-K2.5*Stage K25-0: Wrapper bring-up
阶段K25-0:包装器启动
K2.5 is not "just another DeepSeek model". It is a wrapped multimodal model, so optimization begins with the wrapper contract.
-
expose the language model through a K2.5 wrapper
-
keep multimodal tower and projector plumbing explicit
-
do not assume later runtime features can bypass the wrapper
-
K2.5 can launch with the wrapper and basic multimodal plumbing intact
K2.5并非“又一个DeepSeek模型”,它是一个包装后的多模态模型,因此优化从包装器契约开始。
-
通过K2.5包装器暴露语言模型
-
明确保留多模态塔与投影器管线
-
不要假设后续运行时功能可绕过包装器
-
参考PR:#17789
-
K2.5可通过包装器启动,且基础多模态管线完整
Stage K25-1: Make MoE config initialization wrapper-aware
阶段K25-1:让MoE配置初始化感知包装器
The first K2.5 bottleneck was often not kernel speed, but the wrong config being read because MoE metadata lived in .
text_config-
inspectwhen present
hf_config.text_config -
do not assume top-levelcontains the text-model MoE fields
hf_config -
python/sglang/srt/managers/scheduler.py -
fused MoE config init sees the real text-model expert metadata
K2.5的首个瓶颈通常不是内核速度,而是因MoE元数据存储在中导致读取到错误配置。
text_config-
当存在时,优先读取该配置
hf_config.text_config -
不要假设顶级包含文本模型的MoE字段
hf_config -
涉及文件:
python/sglang/srt/managers/scheduler.py -
参考PR:#18064
-
融合MoE配置初始化可读取到真实的文本模型专家元数据
Stage K25-2: Make quantized checkpoint loading robust
阶段K25-2:强化量化检查点加载的鲁棒性
Before K2.5 can be optimized, quantized checkpoints must load with the right name mapping and quant metadata propagation.
-
remap HF weight names into the SGLang wrapper layout
-
remap excluded-module patterns too, not only main weights
-
preserveon the wrapper itself
quant_config -
python/sglang/srt/models/kimi_k25.py -
python/sglang/srt/layers/quantization/modelopt_quant.py -
NVFP4 or related quantized K2.5 checkpoints load without wrapper-name mismatch
Stage K25-3: Add parallel runtime plumbing through the wrapper
阶段K25-3:添加通过包装器的并行运行时管线
K2.5 performance features often fail because the wrapper does not expose the metadata that PP, PD, EPLB, or PCG expect.
-
forward
pp_proxy_tensors -
exposeand
start_layerend_layer -
expose
self.model = self.language_model.model -
expose
routed_experts_weights_of_layer -
python/sglang/srt/models/kimi_k25.py -
python/sglang/srt/models/deepseek_v2.py -
wrapper-level runtime features no longer need to special-case K2.5 manually
K2.5的性能功能常因包装器未暴露PP、PD、EPLB或PCG所需的元数据而失效。
-
转发
pp_proxy_tensors -
暴露与
start_layerend_layer -
暴露
self.model = self.language_model.model -
暴露
routed_experts_weights_of_layer -
涉及文件:
python/sglang/srt/models/kimi_k25.pypython/sglang/srt/models/deepseek_v2.py -
包装器级别的运行时功能无需再为K2.5单独编写特殊逻辑
Stage K25-4: Scale and stabilize the multimodal DP path
阶段K25-4:扩展并稳定多模态DP路径
For multimodal K2.5, the vision path becomes part of the performance story. Local auto-batching alone was not enough, and DP-attention correctness issues could erase the gains.
-
gate the vision tower by
mm_enable_dp_encoder -
passthrough the wrapper and encoder stack
use_data_parallel -
use the DP-sharded vision execution path when enabled
-
avoid extra reduction or mismatched behavior in the VLM DP-attention path
-
python/sglang/srt/models/kimi_k25.py -
python/sglang/srt/layers/attention/vision.py -
multimodal K2.5 can use the DP encoder path instead of only local vision execution
-
enabling DP attention does not introduce extra reduction or launch mismatch
对于多模态K2.5,视觉路径成为性能优化的一部分。仅本地自动批处理不够,DP注意力的正确性问题可能抵消优化收益。
-
通过控制视觉塔
mm_enable_dp_encoder -
将传递至包装器与编码器栈
use_data_parallel -
启用时使用DP分片的视觉执行路径
-
避免VLM DP注意力路径中出现额外归约或行为不匹配
-
涉及文件:
python/sglang/srt/models/kimi_k25.pypython/sglang/srt/layers/attention/vision.py -
多模态K2.5可使用DP编码器路径,而非仅本地视觉执行
-
启用DP注意力不会引入额外归约或启动不匹配
Stage K25-5: Tune fused MoE for the real K2.5 quant and hardware shape
阶段K25-5:针对真实K2.5量化与硬件配置调优融合MoE
K2.5 fused MoE can look supported while still using a terrible default config. On AMD, the big gain came from teaching the tuning tools the real K2.5 wrapper and int4 shape.
-
look throughin tuning utilities
text_config -
derivefrom
block_shapewhen neededquantization_config["config_groups"] -
add fullsupport to benchmark and tuning scripts
int4_w4a16 -
passinto worker processes so tuning matches real serving
ServerArgs -
save the final per-shape configs under the exact filename contract
-
benchmark/kernels/fused_moe_triton/common_utils.py -
benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py -
python/sglang/srt/layers/moe/fused_moe_triton/configs/ -
K2.5 no longer falls back to a default fused-MoE config for the active quant and device
K2.5的融合MoE看似已支持,但可能仍使用糟糕的默认配置。在AMD设备上,关键优化是让调优工具了解真实的K2.5包装器与int4配置。
-
在调优工具中读取
text_config -
必要时从推导
quantization_config["config_groups"]block_shape -
为基准测试与调优脚本添加完整的支持
int4_w4a16 -
将传递至工作进程,使调优与真实服务匹配
ServerArgs -
将最终的按配置调优结果保存至精确的文件名契约下
-
涉及文件:
benchmark/kernels/fused_moe_triton/common_utils.pybenchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.pypython/sglang/srt/layers/moe/fused_moe_triton/configs/ -
参考PR:#19228
-
K2.5不再为当前量化与设备回退至默认融合MoE配置
Stage K25-6: Make speculative decoding wrapper-compatible
阶段K25-6:让投机解码与包装器兼容
For K2.5, speculative decoding support was mostly blocked by missing wrapper hooks, not by the draft model core itself.
-
forward Eagle3 helper methods through the K2.5 wrapper
-
keep embed/head exposure consistent with other optimized model wrappers
-
python/sglang/srt/models/kimi_k25.py -
Eagle3 runtime code can call the wrapper the same way it calls direct language-model wrappers
对于K2.5,投机解码支持主要因缺少包装器钩子而受阻,而非模型核心本身。
-
通过K2.5包装器转发Eagle3辅助方法
-
保持嵌入层/头部的暴露方式与其他优化后的模型包装器一致
-
涉及文件:
python/sglang/srt/models/kimi_k25.py -
参考PR:#19689
-
Eagle3运行时代码可像调用直接语言模型包装器一样调用K2.5包装器
Stage K25-7: Harden multimodal + DP attention + speculative decode correctness
阶段K25-7:强化多模态+DP注意力+投机解码的正确性
Once K2.5 mixes multimodal embeddings, DP attention, and speculative decoding, correctness bugs show up in extend mode and launch paths.
-
respect
forward_batch.mm_input_embeds -
append only the new tail token embedding in extend mode
-
validate on the exact combined launch shape, not just TP-only text mode
-
python/sglang/srt/models/llama_eagle3.py -
test/registered/8-gpu-models/test_kimi_k25.py -
the combined multimodal + DP attention + MTP or Eagle3 path runs without launch crashes
当K2.5混合使用多模态嵌入、DP注意力与投机解码时,正确性缺陷会在扩展模式与启动路径中显现。
-
尊重
forward_batch.mm_input_embeds -
在扩展模式下仅追加新的尾令牌嵌入
-
针对精确的组合启动配置进行验证,而非仅TP-only文本模式
-
涉及文件:
python/sglang/srt/models/llama_eagle3.pytest/registered/8-gpu-models/test_kimi_k25.py -
参考PR:#21391
-
多模态+DP注意力+MTP或Eagle3的组合路径可正常运行,无启动崩溃
Stage K25-8: Keep current-main serving and validation surfaces aligned
阶段K25-8:保持当前主分支服务与验证范围一致
Current main has enough Kimi-K2.5 coverage that optimization work should preserve the launch, parser, multimodal-grid, LoRA, and backend validation contracts together.
-
keep the documented Kimi-K2.5 launch contract in sync with tests:and
--tool-call-parser kimi_k2--reasoning-parser kimi_k2 -
preserve Kimi grid metadata flow:,
grid_thws, and the GPU image preprocessing path with CPU-compatible inputsKimiGridMMDataMixin -
if LoRA, MoE LoRA sharing, attention backend selection, or logprob paths are touched, include the Kimi-K2.5 LoRA regression
-
choose AMD validation by backend and quant shape:
- native Kimi-K2.5 with MLA uses TP4 on MI35x because 64 heads at TP8 gives only 8 heads per GPU
aiter - Kimi-K2.5-MXFP4 MI35x coverage uses TP8 and validates default plus FP8 KV-cache variants
- native Kimi-K2.5 with
-
use GB300/NVFP4 lanes when changing Blackwell-specific quant, cache, or kernel behavior
-
use the Kimi-K2-Thinking stress test when parser, long-run stability, or K2 thinking serving paths are involved
-
docs_new/docs/basic_usage/kimi_k2_5.mdx -
python/sglang/srt/function_call/kimik2_detector.py -
python/sglang/srt/parser/reasoning_parser.py -
python/sglang/srt/multimodal/processors/kimi_common.py -
python/sglang/srt/multimodal/processors/kimi_k25.py -
test/registered/function_call/test_kimik2_detector.py -
test/registered/lora/test_lora_kimi_k25_logprob_diff.py -
test/registered/amd/accuracy/mi35x/test_kimi_k25_aiter_mla_eval_mi35x.py -
test/registered/amd/accuracy/mi35x/test_kimi_k25_mxfp4_eval_mi35x.py -
test/registered/gb300/test_kimi_k25.py -
test/registered/gb300/test_kimi_k25_nvfp4.py -
test/registered/stress/test_stress_kimi_k2.py -
docs, registered launches, and stress tests agree on the parser flags
-
image token expansion still derives from grid metadata instead of placeholder counts alone
-
backend-specific tests run on the topology they were written for
-
LoRA logprob coverage is not skipped when changing K2.5 adapter or shared-expert LoRA behavior
当前主分支已具备足够的Kimi-K2.5覆盖,优化工作需同时保留启动、解析器、多模态网格、LoRA及后端验证契约。
-
保持文档中的Kimi-K2.5启动契约与测试同步:和
--tool-call-parser kimi_k2--reasoning-parser kimi_k2 -
保留Kimi网格元数据流程:、
grid_thws以及支持CPU输入的GPU图像预处理路径KimiGridMMDataMixin -
若修改LoRA、MoE LoRA共享、注意力后端选择或对数概率路径,需包含Kimi-K2.5 LoRA回归测试
-
根据后端与量化配置选择AMD验证流程:
- 原生Kimi-K2.5搭配MLA在MI35x上使用TP4,因为64个头部在TP8下每个GPU仅分配8个头部
aiter - Kimi-K2.5-MXFP4 MI35x覆盖使用TP8,并验证默认与FP8 KV缓存变体
- 原生Kimi-K2.5搭配
-
修改Blackwell专属的量化、缓存或内核行为时,使用GB300/NVFP4测试流程
-
修改解析器、长期稳定性或K2思考模式服务路径时,运行Kimi-K2-Thinking压力测试
-
涉及文件:
docs_new/docs/basic_usage/kimi_k2_5.mdxpython/sglang/srt/function_call/kimik2_detector.pypython/sglang/srt/parser/reasoning_parser.pypython/sglang/srt/multimodal/processors/kimi_common.pypython/sglang/srt/multimodal/processors/kimi_k25.pytest/registered/function_call/test_kimik2_detector.pytest/registered/lora/test_lora_kimi_k25_logprob_diff.pytest/registered/amd/accuracy/mi35x/test_kimi_k25_aiter_mla_eval_mi35x.pytest/registered/amd/accuracy/mi35x/test_kimi_k25_mxfp4_eval_mi35x.pytest/registered/gb300/test_kimi_k25.pytest/registered/gb300/test_kimi_k25_nvfp4.pytest/registered/stress/test_stress_kimi_k2.py -
文档、注册启动流程与压力测试在解析器参数上保持一致
-
图像令牌扩展仍基于网格元数据,而非仅占位符数量
-
后端特定测试在其设计的拓扑上运行
-
修改K2.5适配器或共享专家LoRA行为时,不会跳过LoRA对数概率覆盖测试
Usage
使用方法
When asked to optimize Kimi, or a structurally similar model, do not start from whatever file looks hottest. First place the current code on the ladder above.
Use this decision process:
- Determine whether the current problem is K2, K2.5, or which family it most closely resembles.
- Find the highest stage that is already satisfied.
- Work on the next missing stage, not a later one.
- Validate narrowly on the exact serving shape.
- Only after that, widen to more topology combinations.
Examples:
- K2 decode is still spending time in generic grouped-topk:
you are missing , so do not jump straight to fused-MoE tuning files.
K2-2 - K2 fused gate exists but PCG crashes:
you are at , so focus on fake op registration or invalid-selection guards.
K2-6 - K2.5 int4 on AMD uses a default config:
you are at , so tune the wrapper-aware fused-MoE path before editing the model wrapper again.
K25-5 - K2.5 PP works but Eagle3 multimodal crashes:
you are at or
K25-6, not at the early support stages.K25-7 - K2.5 tool calls, thinking output, image requests, or LoRA logprobs regress after a runtime change:
you are at ; check launch parser flags, grid metadata, OpenAI-serving parser tests, and the targeted backend lane before blaming the MoE kernel.
K25-8
For non-Kimi models, map them by structure:
- a new sparse MoE model with non-generic expert count or router shape: treat it like the K2 path first
- a quantized sparse MoE model whose hot path runs through Marlin: treat it like the K2 thinking Marlin path
- a wrapped multimodal model whose PP, PD, DP encoder, or speculative decode support keeps breaking: treat it like the K2.5 path first
当需要优化Kimi或结构相似的模型时,不要从看起来最热门的文件开始。首先将当前代码定位到上述阶梯中的对应阶段。
使用以下决策流程:
- 判断当前问题属于K2、K2.5,或最接近哪类模型家族。
- 找到已完成的最高阶段。
- 处理下一个缺失的阶段,而非跳过直接处理后续阶段。
- 针对精确的服务配置进行窄范围验证。
- 完成后,再扩展到更多拓扑组合。
示例:
- K2解码仍在通用分组top-k上花费时间:
你缺失阶段的优化,不要直接跳至融合MoE调优文件。
K2-2 - K2融合门控已存在但PCG崩溃:
你处于阶段,需专注于伪操作注册或无效选择防护。
K2-6 - AMD上的K2.5 int4使用默认配置:
你处于阶段,需先调优感知包装器的融合MoE路径,再修改模型包装器。
K25-5 - K2.5 PP正常但Eagle3多模态崩溃:
你处于或
K25-6阶段,而非早期支持阶段。K25-7 - 运行时变更后,K2.5工具调用、思考输出、图像请求或LoRA对数概率出现回归:
你处于阶段;在归咎于MoE内核前,先检查启动解析器参数、网格元数据、OpenAI服务解析器测试及目标后端流程。
K25-8
对于非Kimi模型,按结构映射:
- 具有非通用专家数量或路由配置的新型稀疏MoE模型: 先按K2路径处理
- 热路径通过Marlin运行的量化稀疏MoE模型: 按K2思考模式的Marlin路径处理
- PP、PD、DP编码器或投机解码支持频繁失效的包装式多模态模型: 先按K2.5路径处理
Guardrails
注意事项
- Do not hardcode experts anywhere on the K2 fast path.
256 - Do not collapse K2 and K2.5 into one generic DeepSeek optimization.
- Do not optimize generic kernels first if K2 already has a dedicated specialization.
- Do not bypass the K2.5 wrapper to make one feature work; later PP, PD, EPLB, or PCG features often depend on that wrapper surface.
- Do not trust a tuning file unless ,
Triton version,E,N,dtype, andblock_shapeall match the current shape.device - Do not validate only TP-only text mode if the real bug involves PP, DP encoder, multimodal inputs, EP, or speculative decoding.
- Do not omit or
--tool-call-parser kimi_k2when validating current K2/K2.5 tool or thinking behavior.--reasoning-parser kimi_k2 - Do not run native Kimi-K2.5 aiter MLA on MI35x at TP8 and treat the failure as a model regression; the registered native aiter MLA lane documents TP4 as the supported shape.
- Do not copy Kimi-specific constants, dispatch predicates, or tuning filenames into a different model unless its serving shape actually matches.
- 不要在K2快速路径的任何位置硬编码个专家。
256 - 不要将K2和K2.5合并为单一的通用DeepSeek优化。
- 如果K2已有专用优化,不要优先优化通用内核。
- 不要绕过K2.5包装器来实现单个功能;后续PP、PD、EPLB或PCG功能通常依赖于该包装器的暴露面。
- 除非、
Triton version、E、N、dtype与block_shape均匹配当前配置,否则不要信任调优文件。device - 如果真实缺陷涉及PP、DP编码器、多模态输入、EP或投机解码,不要仅在TP-only文本模式下进行验证。
- 验证当前K2/K2.5工具或思考模式行为时,不要遗漏或
--tool-call-parser kimi_k2参数。--reasoning-parser kimi_k2 - 不要在MI35x上以TP8运行原生Kimi-K2.5 aiter MLA,并将失败视为模型回归;注册的原生aiter MLA流程已记录TP4为支持的配置。
- 不要将Kimi专属的常量、调度谓词或调优文件名复制到其他模型,除非其服务配置完全匹配。
Validation
验证
For K2 kernel work:
bash
pytest -q sgl-kernel/tests/test_dsv3_router_gemm.py
pytest -q sgl-kernel/tests/test_kimi_k2_moe_fused_gate.py
pytest -q test/registered/kernels/test_fused_topk_deepseek.py
python benchmark/bench_kimi_k2_moe_fused_gate.pyFor K2 quantized Marlin MoE work:
bash
pytest -q test/registered/quant/test_marlin_moe.py
pytest -q python/sglang/jit_kernel/tests/test_moe_wna16_marlin.pyFor K2.5 wrapper or combined runtime work:
bash
pytest -q test/registered/8-gpu-models/test_kimi_k25.pyFor current parser and OpenAI-serving behavior:
bash
pytest -q test/registered/function_call/test_kimik2_detector.py
pytest -q test/registered/unit/parser/test_reasoning_parser.py -k KimiK2
pytest -q test/registered/unit/function_call/test_function_call_parser.py -k KimiK2
pytest -q test/registered/unit/entrypoints/openai/test_serving_chat.py -k kimi_k2For backend-specific or adapter-sensitive K2.5 work, run only on matching hardware:
bash
pytest -q test/registered/lora/test_lora_kimi_k25_logprob_diff.py
pytest -q test/registered/amd/accuracy/mi35x/test_kimi_k25_aiter_mla_eval_mi35x.py
pytest -q test/registered/amd/accuracy/mi35x/test_kimi_k25_mxfp4_eval_mi35x.py
pytest -q test/registered/gb300/test_kimi_k25.py
pytest -q test/registered/gb300/test_kimi_k25_nvfp4.py
pytest -q test/registered/stress/test_stress_kimi_k2.pyFor tuning work:
- rerun only the relevant tuning script
- keep the real model, quant, TP, EP, and backend
- save output under the exact config filename contract
针对K2内核工作:
bash
pytest -q sgl-kernel/tests/test_dsv3_router_gemm.py
pytest -q sgl-kernel/tests/test_kimi_k2_moe_fused_gate.py
pytest -q test/registered/kernels/test_fused_topk_deepseek.py
python benchmark/bench_kimi_k2_moe_fused_gate.py针对K2量化Marlin MoE工作:
bash
pytest -q test/registered/quant/test_marlin_moe.py
pytest -q python/sglang/jit_kernel/tests/test_moe_wna16_marlin.py针对K2.5包装器或组合运行时工作:
bash
pytest -q test/registered/8-gpu-models/test_kimi_k25.py针对当前解析器与OpenAI服务行为:
bash
pytest -q test/registered/function_call/test_kimik2_detector.py
pytest -q test/registered/unit/parser/test_reasoning_parser.py -k KimiK2
pytest -q test/registered/unit/function_call/test_function_call_parser.py -k KimiK2
pytest -q test/registered/unit/entrypoints/openai/test_serving_chat.py -k kimi_k2针对后端特定或适配器敏感的K2.5工作,仅在匹配硬件上运行:
bash
pytest -q test/registered/lora/test_lora_kimi_k25_logprob_diff.py
pytest -q test/registered/amd/accuracy/mi35x/test_kimi_k25_aiter_mla_eval_mi35x.py
pytest -q test/registered/amd/accuracy/mi35x/test_kimi_k25_mxfp4_eval_mi35x.py
pytest -q test/registered/gb300/test_kimi_k25.py
pytest -q test/registered/gb300/test_kimi_k25_nvfp4.py
pytest -q test/registered/stress/test_stress_kimi_k2.py针对调优工作:
- 仅重新运行相关调优脚本
- 保持真实的模型、量化、TP、EP与后端配置
- 将输出保存至精确的配置文件名契约下
References
参考资料
- Historical evidence and benchmark tables: references/pr-history.md
- Symptom mapping and validation order: references/playbook.md
- 历史依据与基准测试表:references/pr-history.md
- 症状映射与验证顺序:references/playbook.md