sglang-kimi-k2-k25-optimization

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

SGLang Kimi K2/K2.5 Optimization

SGLang Kimi K2/K2.5 优化指南

Overview

概述

The skill is an optimization ladder. Identify which stage the current code is at, apply the next missing optimization, and only move deeper after the earlier stage is satisfied.

Current-main snapshot: This skill was refreshed against SGLang

origin/main

commit

c122d343a

2026-04-21

. Since the older PR ladder was written, current main has added a Kimi-K2.5 usage doc, parser and OpenAI-serving tests for

kimi_k2

, Kimi-K2.5 LoRA regression coverage, AMD/GB300 validation lanes, and a Kimi-K2-Thinking stress test. Treat those as part of the active validation surface, not as optional CI trivia. Active open PRs now also define several next likely skill updates: W4AFP8 loading, W4A16 DeepEP low-latency, Kimi-K2.5 multimodal processor fixes, ROCm fused QK RMSNorm, and JIT migration of the older K2 fused gate path. One important non-open gap is Kimi-K2-Thinking DeepEP plus int4/Marlin: #13789 tried to support it but was closed unmerged after hitting an illegal memory access in the

fused_marlin_moe

path. Do not mark that combination as mainline-supported just because the generic Marlin JIT work in #19181 landed.

The historical evidence for every stage lives in:

references/pr-history.md: merged PR evidence, benchmark tables, key code
references/playbook.md: symptom mapping, commands, validation order

本技能是一套优化阶梯。先确定当前代码所处的阶段，应用下一个缺失的优化步骤，只有在完成前一阶段后，才能进入更深层次的优化。

当前主分支快照：本技能基于2026年4月21日SGLang

origin/main

分支的

c122d343a

提交进行更新。自旧版PR阶梯文档编写以来，当前主分支新增了Kimi-K2.5使用文档、

kimi_k2

的解析器与OpenAI服务测试、Kimi-K2.5 LoRA回归覆盖、AMD/GB300验证流程，以及Kimi-K2-Thinking压力测试。请将这些内容视为核心验证范围的一部分，而非可选的CI琐事。当前活跃的未合并PR也定义了多个后续可能的技能更新方向：W4AFP8加载、W4A16 DeepEP低延迟路径、Kimi-K2.5多模态处理器修复、ROCm融合QK RMSNorm，以及旧版K2融合门控路径的JIT迁移。一个重要的非公开缺口是Kimi-K2-Thinking的DeepEP加int4/Marlin支持：#13789曾尝试实现该功能，但因

fused_marlin_moe

路径出现非法内存访问而未合并。不要因为#19181中的通用Marlin JIT工作已合并，就认为该组合已被主线版本支持。

每个阶段的历史依据可在以下文档中找到：

references/pr-history.md：已合并PR的证据、基准测试表、核心代码
references/playbook.md：症状映射、命令、验证顺序

Before You Change Anything

修改前的准备

Record the exact serving shape first:

K2 or K2.5
thinking or instruct
text-only or multimodal
native or quantized weights
TP / DP / EP / PP / PD / EPLB topology
speculative decoding enabled or not
NVIDIA / AMD / other backend

launch parser pair:

--tool-call-parser kimi_k2

and

--reasoning-parser kimi_k2

LoRA enabled or not
current SGLang commit and the matching registered-test lane

首先记录当前服务的精确配置：

模型类型：K2 或 K2.5
模式：思考模式（thinking）或指令模式（instruct）
输入类型：纯文本或多模态
权重类型：原生或量化
分布式拓扑：TP / DP / EP / PP / PD / EPLB
是否启用投机解码
后端类型：NVIDIA / AMD / 其他

启动解析器组合：

--tool-call-parser kimi_k2

和

--reasoning-parser kimi_k2

是否启用LoRA
当前SGLang提交版本及对应的注册测试流程

Core Principle

核心原则

Do not treat K2 and K2.5 as one optimization problem.

K2 is mainly a
```
384 experts
```
router and MoE hot-path story.
K2 thinking adds a separate quantized Marlin MoE story.
K2.5 is much more wrapper-heavy:
```
text_config
```
, quant mapping, PP/PD/EPLB, multimodal DP encoder, Eagle3, and PCG compatibility all matter.
Current K2/K2.5 serving also has a parser contract: tool calls and thinking output are expected to go through
```
kimi_k2
```
.
Current open PRs split into three useful future tracks: quantized K2.5 loading/execution, multimodal wrapper correctness, and backend-specific fused norm or MoE kernel work.

For non-Kimi models, first decide which family they resemble more:

a K2-like router and MoE hot-path problem
a K2 thinking-like quantized Marlin MoE problem
or a K2.5-like wrapper, multimodal, and runtime-plumbing problem

The optimization order also matters:

enable the right shape or wrapper contract
remove generic overhead
add or select the Kimi-specific fast path
micro-optimize the hot kernel
harden PCG or distributed correctness
tune per-device fused MoE configs

不要将K2和K2.5视为同一优化问题。

K2主要是
```
384个专家
```
的路由与MoE热路径优化。
K2思考模式新增了独立的量化Marlin MoE优化。
K2.5更侧重于包装器逻辑：
```
text_config
```
、量化映射、PP/PD/EPLB、多模态DP编码器、Eagle3以及PCG兼容性都至关重要。
当前K2/K2.5服务还存在解析器契约：工具调用和思考模式输出需通过
```
kimi_k2
```
处理。
当前活跃PR分为三个未来优化方向：量化K2.5加载/执行、多模态包装器正确性、后端特定的融合归一化或MoE内核优化。

对于非Kimi模型，先判断它更接近以下哪类问题：

类似K2的路由与MoE热路径问题
类似K2思考模式的量化Marlin MoE问题
类似K2.5的包装器、多模态及运行时管线问题

优化顺序同样重要：

启用正确的配置或包装器契约
移除通用逻辑的开销
添加或选择Kimi专属的快速路径
微优化热内核
强化PCG或分布式正确性
针对不同设备调优融合MoE配置

What Transfers To Similar Models

可迁移至相似模型的经验

Reuse this skill on a non-Kimi model when it shares one or more of these traits:

a sparse MoE router whose real expert count or top-k shape breaks generic DeepSeek-era assumptions
a wrapper that hides text-model metadata inside
```
text_config
```
or wrapper-local quant fields
a multimodal tower whose throughput depends on DP encoder or DP attention execution details
speculative decoding, PP, PD, EPLB, or PCG features that depend on wrapper surface area rather than only kernel speed

Reuse the optimization order and validation discipline, not the literal Kimi constants or filenames.

当非Kimi模型具备以下一个或多个特征时，可复用本技能：

稀疏MoE路由，其实际专家数量或top-k配置打破了通用DeepSeek时代的假设
包装器将文本模型元数据隐藏在
```
text_config
```
或包装器本地量化字段中
多模态塔的吞吐量依赖于DP编码器或DP注意力的执行细节
投机解码、PP、PD、EPLB或PCG功能依赖于包装器的暴露面，而非仅内核速度

复用优化顺序和验证规范，而非字面意义上的Kimi常量或文件名。

Open PR Radar

活跃PR追踪

Check these active upstream tracks before designing a new Kimi skill or declaring a gap:

#22806:
```
KimiW4AFp8Config
```
and W4AFP8 model-loading support for Kimi-K2.5.
#22496: Kimi-K2.5 W4A16 DeepEP low-latency path with JIT Marlin and direct DeepEP MoE work.
#22964:
```
KimiGPUProcessorWrapper._cpu_call
```
output fix after grid metadata changed from
```
grid_thws
```
toward
```
image_grid_thw
```
.
#23186: AMD/ROCm BF16 fused QK RMSNorm path for
```
Kimi-K2.5-MXFP4
```
.
#19703: migrate
```
kimi_k2_moe_fused_gate
```
from AOT
```
sgl-kernel
```
into the JIT kernel module.
#22488: generalize the Kimi2 fused MoE gate JIT path to GLM-5-style
```
256
```
-expert shapes.
#22208: AMD gfx950 small-M fused MoE config tuning for Kimi-K2.5
```
int4_w4a16
```
.
#21741: generic compressed-tensors W4AFP8 MoE support that underpins some Kimi-K2.5 quantized loading work.

Known closed gap to remember:

#13789: attempted Kimi-K2-Thinking DeepEP support with int4/Marlin, but remained unmerged after an illegal memory access in
```
fused_marlin_moe
```
.

在设计新的Kimi优化技能或声明缺口前，先查看以下上游活跃PR：

#22806：为Kimi-K2.5添加
```
KimiW4AFp8Config
```
及W4AFP8模型加载支持。
#22496：Kimi-K2.5的W4A16 DeepEP低延迟路径，包含JIT Marlin与直接DeepEP MoE实现。
#22964：修复
```
KimiGPUProcessorWrapper._cpu_call
```
在网格元数据从
```
grid_thws
```
改为
```
image_grid_thw
```
后的输出问题。
#23186：为
```
Kimi-K2.5-MXFP4
```
添加AMD/ROCm BF16融合QK RMSNorm路径。
#19703：将
```
kimi_k2_moe_fused_gate
```
从AOT的
```
sgl-kernel
```
迁移至JIT内核模块。
#22488：将Kimi2融合MoE门控JIT路径推广至GLM-5风格的
```
256
```
专家配置。
#22208：为Kimi-K2.5
```
int4_w4a16
```
调优AMD gfx950小批量融合MoE配置。
#21741：通用压缩张量W4AFP8 MoE支持，为部分Kimi-K2.5量化加载工作提供基础。

需要记住的已关闭缺口：

#13789：尝试为Kimi-K2-Thinking添加DeepEP与int4/Marlin支持，但因
```
fused_marlin_moe
```
路径出现非法内存访问而未合并。

K2 Evolution Path

K2演进路径

Use this path when the target is

moonshotai/Kimi-K2*

当目标模型为

moonshotai/Kimi-K2*

时，使用此路径。

Stage K2-0: Basic support but not yet K2-shaped

阶段K2-0：基础支持但未适配K2配置

The code may "support K2" in a broad sense, but still inherit DeepSeek assumptions that silently cap performance or break specialized kernels.

hidden assumptions that
```
num_experts == 256
```
generic grouped-topk path still used for K2
no dedicated K2 router or gate dispatch
#8013
router-side code can legally run with
```
384
```
experts
tests and benchmarks no longer hardcode
```
256
```
for the K2 path

代码可能在广义上“支持K2”，但仍继承了DeepSeek的假设，这些假设会暗中限制性能或破坏专用内核。

隐含假设
```
num_experts == 256
```
K2仍使用通用分组top-k路径
无专用K2路由或门控调度
参考PR：#8013
路由侧代码可合法支持
```
384
```
个专家
测试与基准测试不再为K2路径硬编码
```
256
```

Stage K2-1: Remove the

256 experts

router assumption

阶段K2-1：移除“256个专家”的路由假设

Before K2 can benefit from any later hot-path work, the router GEMM path has to accept

experts.

make
```
dsv3_router_gemm
```
dispatch by runtime expert count
keep the specialized unrolled kernel structure instead of falling back to a generic slow path
preserve the existing hidden-size and output-dtype contract

sgl-kernel/csrc/gemm/dsv3_router_gemm_entry.cu

#8013
the K2 path uses the specialized router kernel with
```
384
```
experts
there is no hidden fallback caused by a hardcoded
```
256
```

在K2能从后续热路径优化中获益前，路由GEMM路径必须支持

个专家。

让
```
dsv3_router_gemm
```
根据运行时专家数量进行调度
保留专用的展开内核结构，而非回退到通用慢路径
保留现有的隐藏层大小与输出 dtype 契约

涉及文件：

sgl-kernel/csrc/gemm/dsv3_router_gemm_entry.cu

参考PR：#8013
K2路径使用支持
```
384
```
个专家的专用路由内核
不存在因硬编码
```
256
```
导致的隐藏回退逻辑

Stage K2-2: Remove generic grouped-topk overhead

阶段K2-2：移除通用分组top-k的开销

K2 thinking has

experts and

num_expert_group == 1

. The generic grouped-topk path wastes time on masking and grouping logic that K2 does not need.

add a narrow K2-specific topk implementation
keep semantics identical: renormalization, routed scaling, expert remap, padded-token masking
optimize the dispatch condition before touching CUDA
```
python/sglang/srt/layers/moe/topk.py
```
#13150
K2 uses a dedicated biased-topk path instead of the generic grouped implementation
the router hotspot is smaller even before adding a fused CUDA op

K2思考模式有

个专家且

num_expert_group == 1

，通用分组top-k路径会在K2不需要的掩码与分组逻辑上浪费时间。

添加针对K2的窄范围top-k实现
保持语义一致：重归一化、路由缩放、专家重映射、填充令牌掩码
在修改CUDA代码前优化调度条件
涉及文件：
```
python/sglang/srt/layers/moe/topk.py
```
参考PR：#13150
K2使用专用的带偏置top-k路径，而非通用分组实现
路由热点在添加融合CUDA操作前已缩小

Stage K2-3: Fuse sigmoid, bias, top-k, and renorm into one CUDA op

阶段K2-3：将sigmoid、偏置、top-k与重归一化融合为单个CUDA操作

Once the Python or

torch.compile

topk path is still hot, K2 needs a model-specific fused gate kernel to cut launch overhead and reduce temporary traffic.

build a dedicated
```
kimi_k2_moe_fused_gate
```
op for
```
384 experts
```
separate small-token and large-token execution strategies
fuse
```
sigmoid
```
,
```
+ bias
```
, top-k selection, writeback, and optional renormalize/scaling
add a dedicated benchmark and unit test at the same time

sgl-kernel/csrc/moe/kimi_k2_moe_fused_gate.cu

sgl-kernel/tests/test_kimi_k2_moe_fused_gate.py

sgl-kernel/benchmark/bench_kimi_k2_moe_fused_gate.py

#13287
the K2 router hotspot moves from generic topk to the dedicated fused gate
benchmark coverage exists for the exact K2 shape

当Python或

torch.compile

的top-k路径仍是热点时，K2需要一个模型专属的融合门控内核来减少启动开销与临时数据传输。

为
```
384个专家
```
构建专用的
```
kimi_k2_moe_fused_gate
```
操作
区分小令牌与大令牌的执行策略
融合
```
sigmoid
```
、
```
+偏置
```
、top-k选择、写回及可选的重归一化/缩放
同时添加专用基准测试与单元测试

涉及文件：

sgl-kernel/csrc/moe/kimi_k2_moe_fused_gate.cu

sgl-kernel/tests/test_kimi_k2_moe_fused_gate.py

sgl-kernel/benchmark/bench_kimi_k2_moe_fused_gate.py

参考PR：#13287
K2路由热点从通用top-k转移至专用融合门控
针对K2精确配置的基准测试覆盖已存在

Stage K2-4: Wire runtime dispatch to the best available K2 fast path

阶段K2-4：将运行时调度连接至最优K2快速路径

Adding a fast kernel is not enough; the runtime must actually choose the best maintained path for the K2 shape.

dispatch by K2 shape in
```
topk.py
```
use
```
kimi_k2_moe_fused_gate
```
when it is the active best path for the current backend
prefer
```
fused_topk_deepseek
```
when backend constraints are satisfied and it supersedes the older gate path
keep the generic implementation alive for non-K2 shapes and non-CUDA backends
treat dispatch order as part of the optimization contract, not as an incidental detail
```
python/sglang/srt/layers/moe/topk.py
```
#13332
#15347
#17325
CUDA K2 requests no longer route through the old compiled topk path
kernel selection is deterministic for the active backend and shape

仅添加快速内核不够，运行时必须为K2配置实际选择最优的维护路径。

在
```
topk.py
```
中根据K2配置进行调度
当前后端支持时，优先使用
```
kimi_k2_moe_fused_gate
```
作为最优路径
当后端约束满足且
```
fused_topk_deepseek
```
优于旧门控路径时，优先选择该路径
为非K2配置与非CUDA后端保留通用实现
将调度顺序视为优化契约的一部分，而非偶然细节
涉及文件：
```
python/sglang/srt/layers/moe/topk.py
```
参考PR： #13332 #15347 #17325
CUDA K2请求不再通过旧编译top-k路径
内核选择针对当前后端与配置是确定性的

Stage K2-5: Micro-optimize the K2 fused gate kernel when it is still the hot path

阶段K2-5：当融合门控仍是热点时，对其进行微优化

After the fused gate exists, and when it is still the active router fast path, the next gains come from simplifying the kernel and improving memory behavior rather than inventing another algorithm.

simplify dtype support to the dtype the hot path actually uses
vectorize large-token loads with
```
float4
```
reduce
```
__syncthreads()
```
in the small-token path
shrink shared-memory footprint by storing only selected top-k state
preserve deterministic tie-breaking in reductions

sgl-kernel/csrc/moe/kimi_k2_moe_fused_gate.cu

#13374
profiler shows the fused gate kernel materially smaller than the first fused version
kernel correctness is unchanged

融合门控存在后，若它仍是活跃的路由快速路径，下一步优化需简化内核并改进内存行为，而非发明新算法。

将 dtype 支持简化为热路径实际使用的类型
使用
```
float4
```
向量化大令牌加载
减少小令牌路径中的
```
__syncthreads()
```
调用
通过仅存储选中的top-k状态缩小共享内存占用
保留归约中的确定性平局处理

涉及文件：

sgl-kernel/csrc/moe/kimi_k2_moe_fused_gate.cu

参考PR：#13374
性能分析显示融合门控内核较首个融合版本显著缩小
内核正确性未改变

Stage K2-6: Harden piecewise CUDA graph correctness

阶段K2-6：强化分段CUDA图（PCG）的正确性

Fast kernels that are not PCG-safe become unusable in exactly the high-performance serving regime K2 cares about.

register fake ops for shape and dtype propagation
ensure invalid expert selections write deterministic zero outputs and zero indices
treat PCG correctness bugs as first-class regressions, not edge cases
```
python/sglang/srt/layers/moe/topk.py
```

sgl-kernel/csrc/moe/kimi_k2_moe_fused_gate.cu

#13466
#15306
PCG capture and replay work without launch crashes or illegal instructions

不支持PCG的快速内核在K2关注的高性能服务场景中无法使用。

注册用于形状与 dtype 传播的伪操作
确保无效专家选择输出确定性的零值与零索引
将PCG正确性缺陷视为一级回归，而非边缘情况

涉及文件：

python/sglang/srt/layers/moe/topk.py

sgl-kernel/csrc/moe/kimi_k2_moe_fused_gate.cu

参考PR： #13466 #15306
PCG捕获与重放可正常运行，无启动崩溃或非法指令

Stage K2-7: Clean up and harden the quantized K2 thinking Marlin MoE path

阶段K2-7：清理并强化量化K2思考模式的Marlin MoE路径

For K2 thinking, a lot of remaining latency and correctness risk lives in memory traffic, scratch-buffer handling, EP plumbing, PCG compatibility, and the active Marlin implementation boundary rather than in the router itself.

move
```
fused_marlin_moe
```
into SGLang-side code so it can evolve without
```
sgl-kernel
```
release friction

choose config via

try_get_optimal_moe_config(..., is_marlin=True)

reuse shared temporary buffers
do not zero scratch buffers unless EP actually requires it
only pass a real
```
expert_map
```
when dispatcher metadata exists
keep the Marlin path PCG-safe
optimize the active JIT-backed Marlin kernel implementation, not only the wrapper around it

python/sglang/srt/layers/moe/fused_moe_triton/fused_marlin_moe.py

python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors_moe.py

python/sglang/jit_kernel/moe_wna16_marlin.py

#13596
#13725
#15100
#19181
tp-only K2 thinking no longer pays the fake-EP zeroing penalty
EP-aware paths pass true expert metadata instead of placeholders
PCG capture works for the quantized Marlin MoE path
future kernel changes land in the active JIT-backed implementation instead of only in stale wrapper code

对于K2思考模式，剩余的大部分延迟与正确性风险存在于内存传输、临时缓冲区处理、EP管线、PCG兼容性以及当前Marlin实现边界，而非路由本身。

将
```
fused_marlin_moe
```
移至SGLang侧代码，使其无需等待
```
sgl-kernel
```
发布即可演进

通过

try_get_optimal_moe_config(..., is_marlin=True)

选择配置

复用共享临时缓冲区
仅在EP实际需要时才清零临时缓冲区
仅当调度器元数据存在时才传递真实的
```
expert_map
```
保持Marlin路径的PCG兼容性
优化活跃的JIT-backed Marlin内核实现，而非仅优化其包装器

涉及文件：

python/sglang/srt/layers/moe/fused_moe_triton/fused_marlin_moe.py

python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors_moe.py

python/sglang/jit_kernel/moe_wna16_marlin.py

参考PR： #13596 #13725 #15100 #19181
仅使用TP的K2思考模式不再承担伪EP清零开销
感知EP的路径传递真实专家元数据而非占位符
量化Marlin MoE路径支持PCG捕获
未来内核变更可直接在活跃的JIT-backed实现中进行，而非仅修改陈旧的包装器代码

Stage K2-8: Add hardware-specific fused MoE tuning files

阶段K2-8：添加硬件专属的融合MoE调优文件

Even with the correct kernel path, K2 MoE throughput is device-sensitive. H20, H200, and B200 needed separate config tables.

treat the tuning filename as part of the optimization contract

match exact

Triton version

dtype

block_shape

, and

device_name

do not reuse a config file just because the model name is also K2

python/sglang/srt/layers/moe/fused_moe_triton/configs/

#8047
#8021
#8176
#8178
#8183
#9010
the serving shape resolves to the exact per-device tuning file instead of a generic default

即使内核路径正确，K2 MoE的吞吐量仍对设备敏感。H20、H200与B200需要单独的配置表。

将调优文件名视为优化契约的一部分

匹配精确的

Triton version

、

dtype

、

block_shape

与

device_name

不要仅因模型名称同为K2就复用配置文件

涉及文件：

python/sglang/srt/layers/moe/fused_moe_triton/configs/

参考PR： #8047 #8021 #8176 #8178 #8183 #9010
服务配置可匹配到精确的设备专属调优文件，而非通用默认值

K2.5 Evolution Path

K2.5演进路径

Use this path when the target is

moonshotai/Kimi-K2.5*

当目标模型为

moonshotai/Kimi-K2.5*

时，使用此路径。

Stage K25-0: Wrapper bring-up

阶段K25-0：包装器启动

K2.5 is not "just another DeepSeek model". It is a wrapped multimodal model, so optimization begins with the wrapper contract.

expose the language model through a K2.5 wrapper
keep multimodal tower and projector plumbing explicit
do not assume later runtime features can bypass the wrapper
#17789
K2.5 can launch with the wrapper and basic multimodal plumbing intact

K2.5并非“又一个DeepSeek模型”，它是一个包装后的多模态模型，因此优化从包装器契约开始。

通过K2.5包装器暴露语言模型
明确保留多模态塔与投影器管线
不要假设后续运行时功能可绕过包装器
参考PR：#17789
K2.5可通过包装器启动，且基础多模态管线完整

Stage K25-1: Make MoE config initialization wrapper-aware

阶段K25-1：让MoE配置初始化感知包装器

The first K2.5 bottleneck was often not kernel speed, but the wrong config being read because MoE metadata lived in

text_config

inspect
```
hf_config.text_config
```
when present
do not assume top-level
```
hf_config
```
contains the text-model MoE fields
```
python/sglang/srt/managers/scheduler.py
```
#18064
fused MoE config init sees the real text-model expert metadata

K2.5的首个瓶颈通常不是内核速度，而是因MoE元数据存储在

text_config

中导致读取到错误配置。

当存在
```
hf_config.text_config
```
时，优先读取该配置
不要假设顶级
```
hf_config
```
包含文本模型的MoE字段
涉及文件：
```
python/sglang/srt/managers/scheduler.py
```
参考PR：#18064
融合MoE配置初始化可读取到真实的文本模型专家元数据

Stage K25-2: Make quantized checkpoint loading robust

阶段K25-2：强化量化检查点加载的鲁棒性

Before K2.5 can be optimized, quantized checkpoints must load with the right name mapping and quant metadata propagation.

remap HF weight names into the SGLang wrapper layout
remap excluded-module patterns too, not only main weights
preserve
```
quant_config
```
on the wrapper itself
```
python/sglang/srt/models/kimi_k25.py
```

python/sglang/srt/layers/quantization/modelopt_quant.py

#18370
#18440
NVFP4 or related quantized K2.5 checkpoints load without wrapper-name mismatch

在优化K2.5前，量化检查点必须能通过正确的名称映射与量化元数据传播完成加载。

将HF权重名称映射至SGLang包装器布局
不仅映射主权重，还要映射排除模块的模式
在包装器上保留
```
quant_config
```

涉及文件：

python/sglang/srt/models/kimi_k25.py

python/sglang/srt/layers/quantization/modelopt_quant.py

参考PR： #18370 #18440
NVFP4或相关量化K2.5检查点可正常加载，无包装器名称不匹配问题

Stage K25-3: Add parallel runtime plumbing through the wrapper

阶段K25-3：添加通过包装器的并行运行时管线

K2.5 performance features often fail because the wrapper does not expose the metadata that PP, PD, EPLB, or PCG expect.

forward
```
pp_proxy_tensors
```
expose
```
start_layer
```
and
```
end_layer
```
expose
```
self.model = self.language_model.model
```
expose
```
routed_experts_weights_of_layer
```
```
python/sglang/srt/models/kimi_k25.py
```
```
python/sglang/srt/models/deepseek_v2.py
```
#18434
#19959
#20747
#21004
wrapper-level runtime features no longer need to special-case K2.5 manually

K2.5的性能功能常因包装器未暴露PP、PD、EPLB或PCG所需的元数据而失效。

转发
```
pp_proxy_tensors
```
暴露
```
start_layer
```
与
```
end_layer
```
暴露
```
self.model = self.language_model.model
```
暴露
```
routed_experts_weights_of_layer
```

涉及文件：

python/sglang/srt/models/kimi_k25.py

python/sglang/srt/models/deepseek_v2.py

参考PR： #18434 #19959 #20747 #21004
包装器级别的运行时功能无需再为K2.5单独编写特殊逻辑

Stage K25-4: Scale and stabilize the multimodal DP path

阶段K25-4：扩展并稳定多模态DP路径

For multimodal K2.5, the vision path becomes part of the performance story. Local auto-batching alone was not enough, and DP-attention correctness issues could erase the gains.

gate the vision tower by
```
mm_enable_dp_encoder
```
pass
```
use_data_parallel
```
through the wrapper and encoder stack
use the DP-sharded vision execution path when enabled
avoid extra reduction or mismatched behavior in the VLM DP-attention path
```
python/sglang/srt/models/kimi_k25.py
```

python/sglang/srt/layers/attention/vision.py

#17991
#18689
multimodal K2.5 can use the DP encoder path instead of only local vision execution
enabling DP attention does not introduce extra reduction or launch mismatch

对于多模态K2.5，视觉路径成为性能优化的一部分。仅本地自动批处理不够，DP注意力的正确性问题可能抵消优化收益。

通过
```
mm_enable_dp_encoder
```
控制视觉塔
将
```
use_data_parallel
```
传递至包装器与编码器栈
启用时使用DP分片的视觉执行路径
避免VLM DP注意力路径中出现额外归约或行为不匹配

涉及文件：

python/sglang/srt/models/kimi_k25.py

python/sglang/srt/layers/attention/vision.py

参考PR： #17991 #18689
多模态K2.5可使用DP编码器路径，而非仅本地视觉执行
启用DP注意力不会引入额外归约或启动不匹配

Stage K25-5: Tune fused MoE for the real K2.5 quant and hardware shape

阶段K25-5：针对真实K2.5量化与硬件配置调优融合MoE

K2.5 fused MoE can look supported while still using a terrible default config. On AMD, the big gain came from teaching the tuning tools the real K2.5 wrapper and int4 shape.

look through
```
text_config
```
in tuning utilities

derive

block_shape

from

quantization_config["config_groups"]

when needed

add full
```
int4_w4a16
```
support to benchmark and tuning scripts
pass
```
ServerArgs
```
into worker processes so tuning matches real serving
save the final per-shape configs under the exact filename contract

benchmark/kernels/fused_moe_triton/common_utils.py

benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py

python/sglang/srt/layers/moe/fused_moe_triton/configs/

#19228
K2.5 no longer falls back to a default fused-MoE config for the active quant and device

K2.5的融合MoE看似已支持，但可能仍使用糟糕的默认配置。在AMD设备上，关键优化是让调优工具了解真实的K2.5包装器与int4配置。

在调优工具中读取
```
text_config
```

必要时从

quantization_config["config_groups"]

推导

block_shape

为基准测试与调优脚本添加完整的
```
int4_w4a16
```
支持
将
```
ServerArgs
```
传递至工作进程，使调优与真实服务匹配
将最终的按配置调优结果保存至精确的文件名契约下

涉及文件：

benchmark/kernels/fused_moe_triton/common_utils.py

benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py

python/sglang/srt/layers/moe/fused_moe_triton/configs/

参考PR：#19228
K2.5不再为当前量化与设备回退至默认融合MoE配置

Stage K25-6: Make speculative decoding wrapper-compatible

阶段K25-6：让投机解码与包装器兼容

For K2.5, speculative decoding support was mostly blocked by missing wrapper hooks, not by the draft model core itself.

forward Eagle3 helper methods through the K2.5 wrapper
keep embed/head exposure consistent with other optimized model wrappers
```
python/sglang/srt/models/kimi_k25.py
```
#19689
Eagle3 runtime code can call the wrapper the same way it calls direct language-model wrappers

对于K2.5，投机解码支持主要因缺少包装器钩子而受阻，而非模型核心本身。

通过K2.5包装器转发Eagle3辅助方法
保持嵌入层/头部的暴露方式与其他优化后的模型包装器一致
涉及文件：
```
python/sglang/srt/models/kimi_k25.py
```
参考PR：#19689
Eagle3运行时代码可像调用直接语言模型包装器一样调用K2.5包装器

Stage K25-7: Harden multimodal + DP attention + speculative decode correctness

阶段K25-7：强化多模态+DP注意力+投机解码的正确性

Once K2.5 mixes multimodal embeddings, DP attention, and speculative decoding, correctness bugs show up in extend mode and launch paths.

respect
```
forward_batch.mm_input_embeds
```
append only the new tail token embedding in extend mode
validate on the exact combined launch shape, not just TP-only text mode

python/sglang/srt/models/llama_eagle3.py

test/registered/8-gpu-models/test_kimi_k25.py

#21391
the combined multimodal + DP attention + MTP or Eagle3 path runs without launch crashes

当K2.5混合使用多模态嵌入、DP注意力与投机解码时，正确性缺陷会在扩展模式与启动路径中显现。

尊重
```
forward_batch.mm_input_embeds
```
在扩展模式下仅追加新的尾令牌嵌入
针对精确的组合启动配置进行验证，而非仅TP-only文本模式

涉及文件：

python/sglang/srt/models/llama_eagle3.py

test/registered/8-gpu-models/test_kimi_k25.py

参考PR：#21391
多模态+DP注意力+MTP或Eagle3的组合路径可正常运行，无启动崩溃

Stage K25-8: Keep current-main serving and validation surfaces aligned

阶段K25-8：保持当前主分支服务与验证范围一致

Current main has enough Kimi-K2.5 coverage that optimization work should preserve the launch, parser, multimodal-grid, LoRA, and backend validation contracts together.

keep the documented Kimi-K2.5 launch contract in sync with tests:
```
--tool-call-parser kimi_k2
```
and
```
--reasoning-parser kimi_k2
```
preserve Kimi grid metadata flow:
```
grid_thws
```
,
```
KimiGridMMDataMixin
```
, and the GPU image preprocessing path with CPU-compatible inputs
if LoRA, MoE LoRA sharing, attention backend selection, or logprob paths are touched, include the Kimi-K2.5 LoRA regression
choose AMD validation by backend and quant shape:
- native Kimi-K2.5 with
```
aiter
```
  MLA uses TP4 on MI35x because 64 heads at TP8 gives only 8 heads per GPU
- Kimi-K2.5-MXFP4 MI35x coverage uses TP8 and validates default plus FP8 KV-cache variants
use GB300/NVFP4 lanes when changing Blackwell-specific quant, cache, or kernel behavior
use the Kimi-K2-Thinking stress test when parser, long-run stability, or K2 thinking serving paths are involved
```
docs_new/docs/basic_usage/kimi_k2_5.mdx
```

python/sglang/srt/function_call/kimik2_detector.py

python/sglang/srt/parser/reasoning_parser.py

python/sglang/srt/multimodal/processors/kimi_common.py

python/sglang/srt/multimodal/processors/kimi_k25.py

test/registered/function_call/test_kimik2_detector.py

test/registered/lora/test_lora_kimi_k25_logprob_diff.py

test/registered/amd/accuracy/mi35x/test_kimi_k25_aiter_mla_eval_mi35x.py

test/registered/amd/accuracy/mi35x/test_kimi_k25_mxfp4_eval_mi35x.py

```
test/registered/gb300/test_kimi_k25.py
```

test/registered/gb300/test_kimi_k25_nvfp4.py

test/registered/stress/test_stress_kimi_k2.py

docs, registered launches, and stress tests agree on the parser flags
image token expansion still derives from grid metadata instead of placeholder counts alone
backend-specific tests run on the topology they were written for
LoRA logprob coverage is not skipped when changing K2.5 adapter or shared-expert LoRA behavior

当前主分支已具备足够的Kimi-K2.5覆盖，优化工作需同时保留启动、解析器、多模态网格、LoRA及后端验证契约。

保持文档中的Kimi-K2.5启动契约与测试同步：
```
--tool-call-parser kimi_k2
```
和
```
--reasoning-parser kimi_k2
```
保留Kimi网格元数据流程：
```
grid_thws
```
、
```
KimiGridMMDataMixin
```
以及支持CPU输入的GPU图像预处理路径
若修改LoRA、MoE LoRA共享、注意力后端选择或对数概率路径，需包含Kimi-K2.5 LoRA回归测试
根据后端与量化配置选择AMD验证流程：
- 原生Kimi-K2.5搭配
```
aiter
```
  MLA在MI35x上使用TP4，因为64个头部在TP8下每个GPU仅分配8个头部
- Kimi-K2.5-MXFP4 MI35x覆盖使用TP8，并验证默认与FP8 KV缓存变体
修改Blackwell专属的量化、缓存或内核行为时，使用GB300/NVFP4测试流程
修改解析器、长期稳定性或K2思考模式服务路径时，运行Kimi-K2-Thinking压力测试

涉及文件：

docs_new/docs/basic_usage/kimi_k2_5.mdx

python/sglang/srt/function_call/kimik2_detector.py

python/sglang/srt/parser/reasoning_parser.py

python/sglang/srt/multimodal/processors/kimi_common.py

python/sglang/srt/multimodal/processors/kimi_k25.py

test/registered/function_call/test_kimik2_detector.py

test/registered/lora/test_lora_kimi_k25_logprob_diff.py

test/registered/amd/accuracy/mi35x/test_kimi_k25_aiter_mla_eval_mi35x.py

test/registered/amd/accuracy/mi35x/test_kimi_k25_mxfp4_eval_mi35x.py

test/registered/gb300/test_kimi_k25.py

test/registered/gb300/test_kimi_k25_nvfp4.py

test/registered/stress/test_stress_kimi_k2.py

文档、注册启动流程与压力测试在解析器参数上保持一致
图像令牌扩展仍基于网格元数据，而非仅占位符数量
后端特定测试在其设计的拓扑上运行
修改K2.5适配器或共享专家LoRA行为时，不会跳过LoRA对数概率覆盖测试

Usage

使用方法

When asked to optimize Kimi, or a structurally similar model, do not start from whatever file looks hottest. First place the current code on the ladder above.

Use this decision process:

Determine whether the current problem is K2, K2.5, or which family it most closely resembles.
Find the highest stage that is already satisfied.
Work on the next missing stage, not a later one.
Validate narrowly on the exact serving shape.
Only after that, widen to more topology combinations.

Examples:

K2 decode is still spending time in generic grouped-topk: you are missing
```
K2-2
```
, so do not jump straight to fused-MoE tuning files.
K2 fused gate exists but PCG crashes: you are at
```
K2-6
```
, so focus on fake op registration or invalid-selection guards.
K2.5 int4 on AMD uses a default config: you are at
```
K25-5
```
, so tune the wrapper-aware fused-MoE path before editing the model wrapper again.
K2.5 PP works but Eagle3 multimodal crashes: you are at
```
K25-6
```
or
```
K25-7
```
, not at the early support stages.
K2.5 tool calls, thinking output, image requests, or LoRA logprobs regress after a runtime change: you are at
```
K25-8
```
; check launch parser flags, grid metadata, OpenAI-serving parser tests, and the targeted backend lane before blaming the MoE kernel.

For non-Kimi models, map them by structure:

a new sparse MoE model with non-generic expert count or router shape: treat it like the K2 path first
a quantized sparse MoE model whose hot path runs through Marlin: treat it like the K2 thinking Marlin path
a wrapped multimodal model whose PP, PD, DP encoder, or speculative decode support keeps breaking: treat it like the K2.5 path first

当需要优化Kimi或结构相似的模型时，不要从看起来最热门的文件开始。首先将当前代码定位到上述阶梯中的对应阶段。

使用以下决策流程：

判断当前问题属于K2、K2.5，或最接近哪类模型家族。
找到已完成的最高阶段。
处理下一个缺失的阶段，而非跳过直接处理后续阶段。
针对精确的服务配置进行窄范围验证。
完成后，再扩展到更多拓扑组合。

示例：

K2解码仍在通用分组top-k上花费时间：你缺失
```
K2-2
```
阶段的优化，不要直接跳至融合MoE调优文件。
K2融合门控已存在但PCG崩溃：你处于
```
K2-6
```
阶段，需专注于伪操作注册或无效选择防护。
AMD上的K2.5 int4使用默认配置：你处于
```
K25-5
```
阶段，需先调优感知包装器的融合MoE路径，再修改模型包装器。
K2.5 PP正常但Eagle3多模态崩溃：你处于
```
K25-6
```
或
```
K25-7
```
阶段，而非早期支持阶段。
运行时变更后，K2.5工具调用、思考输出、图像请求或LoRA对数概率出现回归：你处于
```
K25-8
```
阶段；在归咎于MoE内核前，先检查启动解析器参数、网格元数据、OpenAI服务解析器测试及目标后端流程。

对于非Kimi模型，按结构映射：

具有非通用专家数量或路由配置的新型稀疏MoE模型：先按K2路径处理
热路径通过Marlin运行的量化稀疏MoE模型：按K2思考模式的Marlin路径处理
PP、PD、DP编码器或投机解码支持频繁失效的包装式多模态模型：先按K2.5路径处理

Guardrails

注意事项

Do not hardcode
```
256
```
experts anywhere on the K2 fast path.
Do not collapse K2 and K2.5 into one generic DeepSeek optimization.
Do not optimize generic kernels first if K2 already has a dedicated specialization.
Do not bypass the K2.5 wrapper to make one feature work; later PP, PD, EPLB, or PCG features often depend on that wrapper surface.
Do not trust a tuning file unless
```
Triton version
```
,
```
E
```
,
```
N
```
,
```
dtype
```
,
```
block_shape
```
, and
```
device
```
all match the current shape.
Do not validate only TP-only text mode if the real bug involves PP, DP encoder, multimodal inputs, EP, or speculative decoding.
Do not omit
```
--tool-call-parser kimi_k2
```
or
```
--reasoning-parser kimi_k2
```
when validating current K2/K2.5 tool or thinking behavior.
Do not run native Kimi-K2.5 aiter MLA on MI35x at TP8 and treat the failure as a model regression; the registered native aiter MLA lane documents TP4 as the supported shape.
Do not copy Kimi-specific constants, dispatch predicates, or tuning filenames into a different model unless its serving shape actually matches.

不要在K2快速路径的任何位置硬编码
```
256
```
个专家。
不要将K2和K2.5合并为单一的通用DeepSeek优化。
如果K2已有专用优化，不要优先优化通用内核。
不要绕过K2.5包装器来实现单个功能；后续PP、PD、EPLB或PCG功能通常依赖于该包装器的暴露面。
除非
```
Triton version
```
、
```
E
```
、
```
N
```
、
```
dtype
```
、
```
block_shape
```
与
```
device
```
均匹配当前配置，否则不要信任调优文件。
如果真实缺陷涉及PP、DP编码器、多模态输入、EP或投机解码，不要仅在TP-only文本模式下进行验证。
验证当前K2/K2.5工具或思考模式行为时，不要遗漏
```
--tool-call-parser kimi_k2
```
或
```
--reasoning-parser kimi_k2
```
参数。
不要在MI35x上以TP8运行原生Kimi-K2.5 aiter MLA，并将失败视为模型回归；注册的原生aiter MLA流程已记录TP4为支持的配置。
不要将Kimi专属的常量、调度谓词或调优文件名复制到其他模型，除非其服务配置完全匹配。

Validation

验证

For K2 kernel work:

bash

pytest -q sgl-kernel/tests/test_dsv3_router_gemm.py
pytest -q sgl-kernel/tests/test_kimi_k2_moe_fused_gate.py
pytest -q test/registered/kernels/test_fused_topk_deepseek.py
python benchmark/bench_kimi_k2_moe_fused_gate.py

For K2 quantized Marlin MoE work:

bash

pytest -q test/registered/quant/test_marlin_moe.py
pytest -q python/sglang/jit_kernel/tests/test_moe_wna16_marlin.py

For K2.5 wrapper or combined runtime work:

bash

pytest -q test/registered/8-gpu-models/test_kimi_k25.py

For current parser and OpenAI-serving behavior:

bash

pytest -q test/registered/function_call/test_kimik2_detector.py
pytest -q test/registered/unit/parser/test_reasoning_parser.py -k KimiK2
pytest -q test/registered/unit/function_call/test_function_call_parser.py -k KimiK2
pytest -q test/registered/unit/entrypoints/openai/test_serving_chat.py -k kimi_k2

For backend-specific or adapter-sensitive K2.5 work, run only on matching hardware:

bash

pytest -q test/registered/lora/test_lora_kimi_k25_logprob_diff.py
pytest -q test/registered/amd/accuracy/mi35x/test_kimi_k25_aiter_mla_eval_mi35x.py
pytest -q test/registered/amd/accuracy/mi35x/test_kimi_k25_mxfp4_eval_mi35x.py
pytest -q test/registered/gb300/test_kimi_k25.py
pytest -q test/registered/gb300/test_kimi_k25_nvfp4.py
pytest -q test/registered/stress/test_stress_kimi_k2.py

For tuning work:

rerun only the relevant tuning script
keep the real model, quant, TP, EP, and backend
save output under the exact config filename contract

针对K2内核工作：

bash

pytest -q sgl-kernel/tests/test_dsv3_router_gemm.py
pytest -q sgl-kernel/tests/test_kimi_k2_moe_fused_gate.py
pytest -q test/registered/kernels/test_fused_topk_deepseek.py
python benchmark/bench_kimi_k2_moe_fused_gate.py

针对K2量化Marlin MoE工作：

bash

pytest -q test/registered/quant/test_marlin_moe.py
pytest -q python/sglang/jit_kernel/tests/test_moe_wna16_marlin.py

针对K2.5包装器或组合运行时工作：

bash

pytest -q test/registered/8-gpu-models/test_kimi_k25.py

针对当前解析器与OpenAI服务行为：

bash

pytest -q test/registered/function_call/test_kimik2_detector.py
pytest -q test/registered/unit/parser/test_reasoning_parser.py -k KimiK2
pytest -q test/registered/unit/function_call/test_function_call_parser.py -k KimiK2
pytest -q test/registered/unit/entrypoints/openai/test_serving_chat.py -k kimi_k2

针对后端特定或适配器敏感的K2.5工作，仅在匹配硬件上运行：

bash

pytest -q test/registered/lora/test_lora_kimi_k25_logprob_diff.py
pytest -q test/registered/amd/accuracy/mi35x/test_kimi_k25_aiter_mla_eval_mi35x.py
pytest -q test/registered/amd/accuracy/mi35x/test_kimi_k25_mxfp4_eval_mi35x.py
pytest -q test/registered/gb300/test_kimi_k25.py
pytest -q test/registered/gb300/test_kimi_k25_nvfp4.py
pytest -q test/registered/stress/test_stress_kimi_k2.py

针对调优工作：

仅重新运行相关调优脚本
保持真实的模型、量化、TP、EP与后端配置
将输出保存至精确的配置文件名契约下

References

参考资料

Historical evidence and benchmark tables: references/pr-history.md
Symptom mapping and validation order: references/playbook.md

历史依据与基准测试表：references/pr-history.md
症状映射与验证顺序：references/playbook.md