sglang-skill
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSGLang Development
SGLang开发
Source Code Locations
源码位置
SGLang 源码位于此 skill 安装目录下的 。
实际路径取决于所用工具:
repos/sglang/- Cursor:
~/.cursor/skills/sglang-skill/repos/sglang/ - Claude Code:
~/.claude/skills/sglang-skill/repos/sglang/ - Codex:
~/.agents/skills/sglang-skill/repos/sglang/
SGLANG_REPO: 下文示例用 作占位符,替换为实际路径。
~/.cursor/skills/sglang-skill/repos/sglang/如果该路径不存在,在项目目录下运行 。
bash update-repos.sh sglangSGLang 源码位于此 skill 安装目录下的 。
实际路径取决于所用工具:
repos/sglang/- Cursor:
~/.cursor/skills/sglang-skill/repos/sglang/ - Claude Code:
~/.claude/skills/sglang-skill/repos/sglang/ - Codex:
~/.agents/skills/sglang-skill/repos/sglang/
SGLANG_REPO: 下文示例用 作占位符,替换为实际路径。
~/.cursor/skills/sglang-skill/repos/sglang/如果该路径不存在,在项目目录下运行 。
bash update-repos.sh sglangCore Runtime (SRT)
核心运行时(SRT)
SGLANG_REPO/python/sglang/srt/
├── layers/
│ ├── attention/ # Attention backends
│ │ ├── flashinfer_backend.py # FlashInfer (默认)
│ │ ├── flashinfer_mla_backend.py # FlashInfer MLA (DeepSeek)
│ │ ├── cutlass_mla_backend.py # CUTLASS MLA
│ │ ├── flashattention_backend.py # FlashAttention
│ │ ├── triton_backend.py # Triton attention
│ │ ├── flashmla_backend.py # FlashMLA
│ │ ├── nsa_backend.py # Native Sparse Attention
│ │ ├── tbo_backend.py # TBO
│ │ ├── fla/ # Flash Linear Attention
│ │ ├── triton_ops/ # Triton attention ops
│ │ └── wave_ops/ # Wave attention ops
│ ├── moe/ # MoE routing and dispatch
│ ├── quantization/ # FP8, GPTQ, AWQ, Marlin, etc.
│ ├── deep_gemm_wrapper/ # DeepGEMM 集成
│ └── utils/
├── models/ # 模型实现 (LLaMA, DeepSeek, Qwen, etc.)
│ └── deepseek_common/ # DeepSeek V2/V3 共享组件
├── managers/ # Scheduler, TokenizerManager, Detokenizer
├── mem_cache/ # KV cache, Radix cache
├── model_executor/ # 模型执行器, forward batch
├── model_loader/ # 模型加载, 权重映射
├── entrypoints/ # 启动入口: Engine, OpenAI API server
├── speculative/ # Speculative decoding
├── disaggregation/ # Disaggregated prefill/decode
├── distributed/ # TP/PP/EP 分布式
├── compilation/ # CUDA Graph, Torch.compile
├── configs/ # 模型配置
├── lora/ # LoRA 推理
├── eplb/ # Expert-level load balancing
├── hardware_backend/ # 硬件适配 (CUDA, ROCm, XPU)
└── utils/ # 工具函数SGLANG_REPO/python/sglang/srt/
├── layers/
│ ├── attention/ # 注意力后端
│ │ ├── flashinfer_backend.py # FlashInfer(默认)
│ │ ├── flashinfer_mla_backend.py # FlashInfer MLA(DeepSeek)
│ │ ├── cutlass_mla_backend.py # CUTLASS MLA
│ │ ├── flashattention_backend.py # FlashAttention
│ │ ├── triton_backend.py # Triton注意力实现
│ │ ├── flashmla_backend.py # FlashMLA
│ │ ├── nsa_backend.py # Native Sparse Attention
│ │ ├── tbo_backend.py # TBO
│ │ ├── fla/ # Flash Linear Attention
│ │ ├── triton_ops/ # Triton注意力算子
│ │ └── wave_ops/ # Wave注意力算子
│ ├── moe/ # MoE路由与调度
│ ├── quantization/ # FP8、GPTQ、AWQ、Marlin等量化实现
│ ├── deep_gemm_wrapper/ # DeepGEMM集成
│ └── utils/
├── models/ # 模型实现(LLaMA、DeepSeek、Qwen等)
│ └── deepseek_common/ # DeepSeek V2/V3共享组件
├── managers/ # 调度器、Tokenizer管理器、Detokenizer
├── mem_cache/ # KV缓存、基数缓存
├── model_executor/ # 模型执行器、前向批处理
├── model_loader/ # 模型加载、权重映射
├── entrypoints/ # 启动入口:Engine、OpenAI API服务器
├── speculative/ # 投机解码
├── disaggregation/ # 解耦预填充/解码
├── distributed/ # TP/PP/EP分布式
├── compilation/ # CUDA图、Torch.compile
├── configs/ # 模型配置
├── lora/ # LoRA推理
├── eplb/ # 专家级负载均衡
├── hardware_backend/ # 硬件适配(CUDA、ROCm、XPU)
└── utils/ # 工具函数JIT Kernels (Python CUDA/Triton Kernels)
JIT内核(Python CUDA/Triton内核)
SGLANG_REPO/python/sglang/jit_kernel/
├── flash_attention/ # Flash Attention 自定义实现
├── flash_attention_v4.py # Flash Attention v4
├── cutedsl_gdn.py # CuTeDSL GDN kernel
├── concat_mla.py # MLA concat kernel
├── norm.py # Normalization kernels
├── rope.py # RoPE position encoding
├── pos_enc.py # Position encoding
├── per_tensor_quant_fp8.py # FP8 量化
├── kvcache.py # KV cache kernels
├── hicache.py # HiCache kernels
├── gptq_marlin.py # GPTQ Marlin kernel
├── cuda_wait_value.py # CUDA sync primitives
└── diffusion/ # Diffusion model kernelsSGLANG_REPO/python/sglang/jit_kernel/
├── flash_attention/ # Flash Attention自定义实现
├── flash_attention_v4.py # Flash Attention v4
├── cutedsl_gdn.py # CuTeDSL GDN内核
├── concat_mla.py # MLA拼接内核
├── norm.py # 归一化内核
├── rope.py # RoPE位置编码
├── pos_enc.py # 位置编码
├── per_tensor_quant_fp8.py # FP8量化
├── kvcache.py # KV缓存内核
├── hicache.py # HiCache内核
├── gptq_marlin.py # GPTQ Marlin内核
├── cuda_wait_value.py # CUDA同步原语
└── diffusion/ # 扩散模型内核sgl-kernel (C++/CUDA Custom Kernels)
sgl-kernel(C++/CUDA自定义内核)
SGLANG_REPO/sgl-kernel/
├── csrc/
│ ├── attention/ # Custom attention CUDA kernels
│ ├── cutlass_extensions/ # CUTLASS GEMM extensions
│ ├── gemm/ # GEMM kernels
│ ├── moe/ # MoE dispatch/combine kernels
│ ├── quantization/ # Quantization CUDA kernels
│ ├── allreduce/ # AllReduce CUDA kernels
│ ├── speculative/ # Speculative decoding kernels
│ ├── kvcacheio/ # KV cache I/O
│ ├── mamba/ # Mamba SSM kernels
│ ├── memory/ # Memory management
│ └── grammar/ # Grammar-guided generation
├── include/ # C++ headers
├── python/ # Python bindings
├── tests/ # Kernel tests
└── benchmark/ # Kernel benchmarksSGLANG_REPO/sgl-kernel/
├── csrc/
│ ├── attention/ # 自定义注意力CUDA内核
│ ├── cutlass_extensions/ # CUTLASS GEMM扩展
│ ├── gemm/ # GEMM内核
│ ├── moe/ # MoE调度/组合内核
│ ├── quantization/ # 量化CUDA内核
│ ├── allreduce/ # AllReduce CUDA内核
│ ├── speculative/ # 投机解码内核
│ ├── kvcacheio/ # KV缓存I/O
│ ├── mamba/ # Mamba SSM内核
│ ├── memory/ # 内存管理
│ └── grammar/ # 语法引导生成
├── include/ # C++头文件
├── python/ # Python绑定
├── tests/ # 内核测试
└── benchmark/ # 内核基准测试Frontend Language
前端语言
SGLANG_REPO/python/sglang/lang/ # SGLang 前端 DSL
SGLANG_REPO/examples/ # 使用示例
SGLANG_REPO/benchmark/ # 性能基准
SGLANG_REPO/test/ # 测试套件
SGLANG_REPO/docs/ # 文档SGLANG_REPO/python/sglang/lang/ # SGLang前端DSL
SGLANG_REPO/examples/ # 使用示例
SGLANG_REPO/benchmark/ # 性能基准
SGLANG_REPO/test/ # 测试套件
SGLANG_REPO/docs/ # 文档Search Strategy
搜索策略
用 Grep 工具搜索,不要整文件加载。
使用Grep工具搜索,不要加载整个文件。
Attention 和 MLA
Attention和MLA
bash
SGLANG_REPO="$HOME/.cursor/skills/sglang-skill/repos/sglang"bash
SGLANG_REPO="$HOME/.cursor/skills/sglang-skill/repos/sglang"查找 attention backend 注册
查找Attention后端注册
rg "register|Backend" $SGLANG_REPO/python/sglang/srt/layers/attention/attention_registry.py
rg "register|Backend" $SGLANG_REPO/python/sglang/srt/layers/attention/attention_registry.py
查找 FlashInfer MLA 实现
查找FlashInfer MLA实现
rg "forward|mla" $SGLANG_REPO/python/sglang/srt/layers/attention/flashinfer_mla_backend.py
rg "forward|mla" $SGLANG_REPO/python/sglang/srt/layers/attention/flashinfer_mla_backend.py
查找 CUTLASS MLA
查找CUTLASS MLA
rg "cutlass|mla" $SGLANG_REPO/python/sglang/srt/layers/attention/cutlass_mla_backend.py
rg "cutlass|mla" $SGLANG_REPO/python/sglang/srt/layers/attention/cutlass_mla_backend.py
查找 attention 通用接口
查找Attention通用接口
rg "class.*Backend|def forward" $SGLANG_REPO/python/sglang/srt/layers/attention/base_attn_backend.py
undefinedrg "class.*Backend|def forward" $SGLANG_REPO/python/sglang/srt/layers/attention/base_attn_backend.py
undefinedScheduler 和 Batching
调度器与批处理
bash
undefinedbash
undefinedScheduler 核心逻辑
调度器核心逻辑
rg "class Scheduler|def get_next_batch" $SGLANG_REPO/python/sglang/srt/managers/
rg "class Scheduler|def get_next_batch" $SGLANG_REPO/python/sglang/srt/managers/
Continuous batching 和 chunked prefill
连续批处理和分块预填充
rg "chunk|prefill|extend" $SGLANG_REPO/python/sglang/srt/managers/
rg "chunk|prefill|extend" $SGLANG_REPO/python/sglang/srt/managers/
CUDA Graph
CUDA图
rg "cuda_graph|CudaGraph" $SGLANG_REPO/python/sglang/srt/compilation/
undefinedrg "cuda_graph|CudaGraph" $SGLANG_REPO/python/sglang/srt/compilation/
undefinedKV Cache 和 Memory
KV缓存与内存
bash
undefinedbash
undefinedRadix cache 实现
基数缓存实现
rg "RadixCache|radix" $SGLANG_REPO/python/sglang/srt/mem_cache/
rg "RadixCache|radix" $SGLANG_REPO/python/sglang/srt/mem_cache/
KV cache 管理
KV缓存管理
rg "class.*Pool|allocate|free" $SGLANG_REPO/python/sglang/srt/mem_cache/
rg "class.*Pool|allocate|free" $SGLANG_REPO/python/sglang/srt/mem_cache/
HiCache (hierarchical cache)
HiCache(分层缓存)
rg "HiCache|hicache" $SGLANG_REPO/python/sglang/srt/mem_cache/
undefinedrg "HiCache|hicache" $SGLANG_REPO/python/sglang/srt/mem_cache/
undefined模型相关
模型相关
bash
undefinedbash
undefined查找特定模型实现
查找特定模型实现
rg "class.*ForCausalLM" $SGLANG_REPO/python/sglang/srt/models/
rg "class.*ForCausalLM" $SGLANG_REPO/python/sglang/srt/models/
DeepSeek V2/V3 实现
DeepSeek V2/V3实现
rg "DeepSeek|MLA|MoE" $SGLANG_REPO/python/sglang/srt/models/deepseek_v2.py
rg "DeepSeek|MLA|MoE" $SGLANG_REPO/python/sglang/srt/models/deepseek_v2.py
模型加载和权重映射
模型加载和权重映射
rg "load_weight|weight_map" $SGLANG_REPO/python/sglang/srt/model_loader/
undefinedrg "load_weight|weight_map" $SGLANG_REPO/python/sglang/srt/model_loader/
undefinedMoE
MoE
bash
undefinedbash
undefinedMoE routing
MoE路由
rg "TopK|router|expert" $SGLANG_REPO/python/sglang/srt/layers/moe/
rg "TopK|router|expert" $SGLANG_REPO/python/sglang/srt/layers/moe/
MoE CUDA kernels
MoE CUDA内核
rg "moe" $SGLANG_REPO/sgl-kernel/csrc/moe/
undefinedrg "moe" $SGLANG_REPO/sgl-kernel/csrc/moe/
undefined量化
量化
bash
undefinedbash
undefinedFP8 量化
FP8量化
rg "fp8|float8" $SGLANG_REPO/python/sglang/srt/layers/quantization/
rg "fp8|float8" $SGLANG_REPO/python/sglang/srt/layers/quantization/
GPTQ/AWQ/Marlin
GPTQ/AWQ/Marlin
rg "gptq|awq|marlin" $SGLANG_REPO/python/sglang/srt/layers/quantization/
undefinedrg "gptq|awq|marlin" $SGLANG_REPO/python/sglang/srt/layers/quantization/
undefinedSpeculative Decoding
投机解码
bash
rg "speculative\|draft\|verify" $SGLANG_REPO/python/sglang/srt/speculative/bash
rg "speculative\|draft\|verify" $SGLANG_REPO/python/sglang/srt/speculative/分布式
分布式
bash
undefinedbash
undefinedTP/PP/EP
TP/PP/EP
rg "tensor_parallel|pipeline_parallel|expert_parallel" $SGLANG_REPO/python/sglang/srt/distributed/
rg "tensor_parallel|pipeline_parallel|expert_parallel" $SGLANG_REPO/python/sglang/srt/distributed/
Disaggregated serving
解耦服务
rg "disagg|prefill_worker|decode_worker" $SGLANG_REPO/python/sglang/srt/disaggregation/
undefinedrg "disagg|prefill_worker|decode_worker" $SGLANG_REPO/python/sglang/srt/disaggregation/
undefinedWhen to Use Each Source
各源码目录适用场景
| Need | Source | Path |
|---|---|---|
| Attention backend 接口 | SRT layers | |
| FlashInfer attention | SRT layers | |
| MLA (DeepSeek) | SRT layers | |
| MoE routing/dispatch | SRT layers | |
| 量化 (FP8/GPTQ/AWQ) | SRT layers | |
| Scheduler | SRT managers | |
| KV cache / Radix cache | SRT mem_cache | |
| 模型实现 | SRT models | |
| DeepSeek V2/V3 | SRT models | |
| Speculative decoding | SRT speculative | |
| Disaggregated serving | SRT disagg | |
| TP/PP/EP 分布式 | SRT distributed | |
| CUDA Graph | SRT compilation | |
| 模型加载 | SRT model_loader | |
| 启动入口 | SRT entrypoints | |
| JIT Triton kernels | jit_kernel | |
| Custom CUDA kernels | sgl-kernel | |
| CUTLASS extensions | sgl-kernel | |
| 前端 DSL | lang | |
| 使用示例 | examples | |
| 需求 | 源码模块 | 路径 |
|---|---|---|
| Attention后端接口 | SRT layers | |
| FlashInfer注意力 | SRT layers | |
| MLA(DeepSeek) | SRT layers | |
| MoE路由/调度 | SRT layers | |
| 量化(FP8/GPTQ/AWQ) | SRT layers | |
| 调度器 | SRT managers | |
| KV缓存/基数缓存 | SRT mem_cache | |
| 模型实现 | SRT models | |
| DeepSeek V2/V3 | SRT models | |
| 投机解码 | SRT speculative | |
| 解耦服务 | SRT disagg | |
| TP/PP/EP分布式 | SRT distributed | |
| CUDA图 | SRT compilation | |
| 模型加载 | SRT model_loader | |
| 启动入口 | SRT entrypoints | |
| JIT Triton内核 | jit_kernel | |
| 自定义CUDA内核 | sgl-kernel | |
| CUTLASS扩展 | sgl-kernel | |
| 前端DSL | lang | |
| 使用示例 | examples | |
常见开发场景
常见开发场景
添加新 Attention Backend
添加新的Attention后端
- 继承 中的
base_attn_backend.pyAttnBackend - 实现 方法
forward() - 在 注册
attention_registry.py - 参考 作为模板
flashinfer_backend.py
- 继承 中的
base_attn_backend.py类AttnBackend - 实现 方法
forward() - 在 中注册
attention_registry.py - 参考 作为模板
flashinfer_backend.py
添加新模型
添加新模型
- 在 创建模型文件
srt/models/ - 实现 类
ForCausalLM - 实现 方法
load_weights() - 参考 作为模板
srt/models/llama.py
- 在 目录下创建模型文件
srt/models/ - 实现 类
ForCausalLM - 实现 方法
load_weights() - 参考 作为模板
srt/models/llama.py
添加新量化方法
添加新量化方法
- 在 添加量化模块
srt/layers/quantization/ - 注册到量化工厂
- 参考 或
fp8_kernel.pygptq.py
- 在 目录下添加量化模块
srt/layers/quantization/ - 注册到量化工厂
- 参考 或
fp8_kernel.pygptq.py
启动和调试
启动和调试
bash
undefinedbash
undefined启动 OpenAI 兼容 API server
启动兼容OpenAI的API服务器
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 1
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 1
使用 Engine API (Python)
使用Engine API(Python)
from sglang import Engine
engine = Engine(model_path="meta-llama/Meta-Llama-3-8B-Instruct")
from sglang import Engine
engine = Engine(model_path="meta-llama/Meta-Llama-3-8B-Instruct")
Profiling
性能分析
python -m sglang.launch_server --model-path ... --enable-torch-compile
nsys profile -o report python -m sglang.launch_server ...
undefinedpython -m sglang.launch_server --model-path ... --enable-torch-compile
nsys profile -o report python -m sglang.launch_server ...
undefined更新 SGLang 源码
更新SGLang源码
bash
undefinedbash
undefined在 cursor-gpu-skills 项目目录下
在cursor-gpu-skills项目目录下执行
bash update-repos.sh sglang
undefinedbash update-repos.sh sglang
undefinedAdditional References
额外参考资料
- SGLang 文档: https://docs.sglang.ai/
- GitHub: https://github.com/sgl-project/sglang
- SGLang文档: https://docs.sglang.ai/
- GitHub: https://github.com/sgl-project/sglang