sglang
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSGLang
SGLang
High-performance serving framework for LLMs and VLMs with RadixAttention for automatic prefix caching.
基于RadixAttention自动前缀缓存的LLM和VLM高性能服务框架。
When to use SGLang
何时使用SGLang
Use SGLang when:
- Need structured outputs (JSON, regex, grammar)
- Building agents with repeated prefixes (system prompts, tools)
- Agentic workflows with function calling
- Multi-turn conversations with shared context
- Need faster JSON decoding (3× vs standard)
Use vLLM instead when:
- Simple text generation without structure
- Don't need prefix caching
- Want mature, widely-tested production system
Use TensorRT-LLM instead when:
- Maximum single-request latency (no batching needed)
- NVIDIA-only deployment
- Need FP8/INT4 quantization on H100
在以下场景使用SGLang:
- 需要结构化输出(JSON、正则表达式、语法规则)
- 构建带重复前缀的智能体(系统提示词、工具)
- 带函数调用的智能体工作流
- 带共享上下文的多轮对话
- 需要更快的JSON解码速度(比标准方式快3倍)
在以下场景使用vLLM:
- 无需结构化的简单文本生成
- 不需要前缀缓存
- 追求成熟、经过广泛测试的生产系统
在以下场景使用TensorRT-LLM:
- 追求单请求最低延迟(无需批处理)
- 仅NVIDIA部署
- 需要在H100上使用FP8/INT4量化
Quick start
快速开始
Installation
安装
bash
undefinedbash
undefinedpip install (recommended)
pip install (recommended)
pip install "sglang[all]"
pip install "sglang[all]"
With FlashInfer (faster, CUDA 11.8/12.1)
With FlashInfer (faster, CUDA 11.8/12.1)
pip install sglang[all] flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
pip install sglang[all] flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
From source
From source
git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"
undefinedgit clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"
undefinedLaunch server
启动服务器
bash
undefinedbash
undefinedBasic server (Llama 3-8B)
Basic server (Llama 3-8B)
python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--port 30000
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--port 30000
python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--port 30000
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--port 30000
With RadixAttention (automatic prefix caching)
With RadixAttention (automatic prefix caching)
python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--port 30000
--enable-radix-cache # Default: enabled
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--port 30000
--enable-radix-cache # Default: enabled
python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--port 30000
--enable-radix-cache # Default: enabled
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--port 30000
--enable-radix-cache # Default: enabled
Multi-GPU (tensor parallelism)
Multi-GPU (tensor parallelism)
python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-70B-Instruct
--tp 4
--port 30000
--model-path meta-llama/Meta-Llama-3-70B-Instruct
--tp 4
--port 30000
undefinedpython -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-70B-Instruct
--tp 4
--port 30000
--model-path meta-llama/Meta-Llama-3-70B-Instruct
--tp 4
--port 30000
undefinedBasic inference
基础推理
python
import sglang as sglpython
import sglang as sglSet backend
Set backend
sgl.set_default_backend(sgl.OpenAI("http://localhost:30000/v1"))
sgl.set_default_backend(sgl.OpenAI("http://localhost:30000/v1"))
Simple generation
Simple generation
@sgl.function
def simple_gen(s, question):
s += "Q: " + question + "\n"
s += "A:" + sgl.gen("answer", max_tokens=100)
@sgl.function
def simple_gen(s, question):
s += "Q: " + question + "\n"
s += "A:" + sgl.gen("answer", max_tokens=100)
Run
Run
state = simple_gen.run(question="What is the capital of France?")
print(state["answer"])
state = simple_gen.run(question="What is the capital of France?")
print(state["answer"])
Output: "The capital of France is Paris."
Output: "The capital of France is Paris."
undefinedundefinedStructured JSON output
结构化JSON输出
python
import sglang as sgl
@sgl.function
def extract_person(s, text):
s += f"Extract person information from: {text}\n"
s += "Output JSON:\n"
# Constrained JSON generation
s += sgl.gen(
"json_output",
max_tokens=200,
regex=r'\{"name": "[^"]+", "age": \d+, "occupation": "[^"]+"\}'
)python
import sglang as sgl
@sgl.function
def extract_person(s, text):
s += f"Extract person information from: {text}\n"
s += "Output JSON:\n"
# Constrained JSON generation
s += sgl.gen(
"json_output",
max_tokens=200,
regex=r'\{"name": "[^"]+", "age": \d+, "occupation": "[^"]+"\}'
)Run
Run
state = extract_person.run(
text="John Smith is a 35-year-old software engineer."
)
print(state["json_output"])
state = extract_person.run(
text="John Smith is a 35-year-old software engineer."
)
print(state["json_output"])
Output: {"name": "John Smith", "age": 35, "occupation": "software engineer"}
Output: {"name": "John Smith", "age": 35, "occupation": "software engineer"}
undefinedundefinedRadixAttention (Key Innovation)
RadixAttention(核心创新)
What it does: Automatically caches and reuses common prefixes across requests.
Performance:
- 5× faster for agentic workloads with shared system prompts
- 10× faster for few-shot prompting with repeated examples
- Zero configuration - works automatically
How it works:
- Builds radix tree of all processed tokens
- Automatically detects shared prefixes
- Reuses KV cache for matching prefixes
- Only computes new tokens
Example (Agent with system prompt):
Request 1: [SYSTEM_PROMPT] + "What's the weather?"
→ Computes full prompt (1000 tokens)
Request 2: [SAME_SYSTEM_PROMPT] + "Book a flight"
→ Reuses system prompt KV cache (998 tokens)
→ Only computes 2 new tokens
→ 5× faster!功能介绍:自动缓存并复用请求间的公共前缀。
性能表现:
- 对于带共享系统提示词的智能体工作流,速度提升5倍
- 对于带重复示例的少样本提示,速度提升10倍
- 零配置 - 自动生效
工作原理:
- 构建所有已处理tokens的基数树
- 自动检测共享前缀
- 复用匹配前缀的KV缓存
- 仅计算新tokens
示例(带系统提示词的智能体):
Request 1: [SYSTEM_PROMPT] + "What's the weather?"
→ 计算完整提示词(1000个tokens)
Request 2: [SAME_SYSTEM_PROMPT] + "Book a flight"
→ 复用系统提示词的KV缓存(998个tokens)
→ 仅计算2个新tokens
→ 速度提升5倍!Structured generation patterns
结构化生成模式
JSON with schema
带Schema的JSON生成
python
@sgl.function
def structured_extraction(s, article):
s += f"Article: {article}\n\n"
s += "Extract key information as JSON:\n"
# JSON schema constraint
schema = {
"type": "object",
"properties": {
"title": {"type": "string"},
"author": {"type": "string"},
"summary": {"type": "string"},
"sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]}
},
"required": ["title", "author", "summary", "sentiment"]
}
s += sgl.gen("info", max_tokens=300, json_schema=schema)
state = structured_extraction.run(article="...")
print(state["info"])python
@sgl.function
def structured_extraction(s, article):
s += f"Article: {article}\n\n"
s += "Extract key information as JSON:\n"
# JSON schema constraint
schema = {
"type": "object",
"properties": {
"title": {"type": "string"},
"author": {"type": "string"},
"summary": {"type": "string"},
"sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]}
},
"required": ["title", "author", "summary", "sentiment"]
}
s += sgl.gen("info", max_tokens=300, json_schema=schema)
state = structured_extraction.run(article="...")
print(state["info"])Output: Valid JSON matching schema
Output: Valid JSON matching schema
undefinedundefinedRegex-constrained generation
正则约束生成
python
@sgl.function
def extract_email(s, text):
s += f"Extract email from: {text}\n"
s += "Email: "
# Email regex pattern
s += sgl.gen(
"email",
max_tokens=50,
regex=r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
)
state = extract_email.run(text="Contact john.doe@example.com for details")
print(state["email"])python
@sgl.function
def extract_email(s, text):
s += f"Extract email from: {text}\n"
s += "Email: "
# Email regex pattern
s += sgl.gen(
"email",
max_tokens=50,
regex=r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
)
state = extract_email.run(text="Contact john.doe@example.com for details")
print(state["email"])Output: "john.doe@example.com"
Output: "john.doe@example.com"
undefinedundefinedGrammar-based generation
基于语法规则的生成
python
@sgl.function
def generate_code(s, description):
s += f"Generate Python code for: {description}\n"
s += "```python\n"
# EBNF grammar for Python
python_grammar = """
?start: function_def
function_def: "def" NAME "(" [parameters] "):" suite
parameters: parameter ("," parameter)*
parameter: NAME
suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT
"""
s += sgl.gen("code", max_tokens=200, grammar=python_grammar)
s += "\n```"python
@sgl.function
def generate_code(s, description):
s += f"Generate Python code for: {description}\n"
s += "```python\n"
# EBNF grammar for Python
python_grammar = """
?start: function_def
function_def: "def" NAME "(" [parameters] "):" suite
parameters: parameter ("," parameter)*
parameter: NAME
suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT
"""
s += sgl.gen("code", max_tokens=200, grammar=python_grammar)
s += "\n```"Agent workflows with function calling
带函数调用的智能体工作流
python
import sglang as sglpython
import sglang as sglDefine tools
Define tools
tools = [
{
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
}
}
},
{
"name": "book_flight",
"description": "Book a flight",
"parameters": {
"type": "object",
"properties": {
"from": {"type": "string"},
"to": {"type": "string"},
"date": {"type": "string"}
}
}
}
]
@sgl.function
def agent_workflow(s, user_query, tools):
# System prompt (cached with RadixAttention)
s += "You are a helpful assistant with access to tools.\n"
s += f"Available tools: {tools}\n\n"
# User query
s += f"User: {user_query}\n"
s += "Assistant: "
# Generate with function calling
s += sgl.gen(
"response",
max_tokens=200,
tools=tools, # SGLang handles tool call format
stop=["User:", "\n\n"]
)tools = [
{
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
}
}
},
{
"name": "book_flight",
"description": "Book a flight",
"parameters": {
"type": "object",
"properties": {
"from": {"type": "string"},
"to": {"type": "string"},
"date": {"type": "string"}
}
}
}
]
@sgl.function
def agent_workflow(s, user_query, tools):
# System prompt (cached with RadixAttention)
s += "You are a helpful assistant with access to tools.\n"
s += f"Available tools: {tools}\n\n"
# User query
s += f"User: {user_query}\n"
s += "Assistant: "
# Generate with function calling
s += sgl.gen(
"response",
max_tokens=200,
tools=tools, # SGLang handles tool call format
stop=["User:", "\n\n"]
)Multiple queries reuse system prompt
Multiple queries reuse system prompt
state1 = agent_workflow.run(
user_query="What's the weather in NYC?",
tools=tools
)
state1 = agent_workflow.run(
user_query="What's the weather in NYC?",
tools=tools
)
First call: Computes full system prompt
First call: Computes full system prompt
state2 = agent_workflow.run(
user_query="Book a flight to LA",
tools=tools
)
state2 = agent_workflow.run(
user_query="Book a flight to LA",
tools=tools
)
Second call: Reuses system prompt (5× faster)
Second call: Reuses system prompt (5× faster)
undefinedundefinedPerformance benchmarks
性能基准测试
RadixAttention speedup
RadixAttention加速效果
Few-shot prompting (10 examples in prompt):
- vLLM: 2.5 sec/request
- SGLang: 0.25 sec/request (10× faster)
- Throughput: 4× higher
Agent workflows (1000-token system prompt):
- vLLM: 1.8 sec/request
- SGLang: 0.35 sec/request (5× faster)
JSON decoding:
- Standard: 45 tok/s
- SGLang: 135 tok/s (3× faster)
少样本提示(提示词含10个示例):
- vLLM: 2.5秒/请求
- SGLang: 0.25秒/请求(10倍加速)
- 吞吐量:提升4倍
智能体工作流(1000-token系统提示词):
- vLLM: 1.8秒/请求
- SGLang: 0.35秒/请求(5倍加速)
JSON解码:
- 标准方式:45 tok/s
- SGLang: 135 tok/s(3倍加速)
Throughput (Llama 3-8B, A100)
吞吐量(Llama 3-8B,A100)
| Workload | vLLM | SGLang | Speedup |
|---|---|---|---|
| Simple generation | 2500 tok/s | 2800 tok/s | 1.12× |
| Few-shot (10 examples) | 500 tok/s | 5000 tok/s | 10× |
| Agent (tool calls) | 800 tok/s | 4000 tok/s | 5× |
| JSON output | 600 tok/s | 2400 tok/s | 4× |
| 工作负载 | vLLM | SGLang | 加速比 |
|---|---|---|---|
| 简单生成 | 2500 tok/s | 2800 tok/s | 1.12× |
| 少样本(10个示例) | 500 tok/s | 5000 tok/s | 10× |
| 智能体(工具调用) | 800 tok/s | 4000 tok/s | 5× |
| JSON输出 | 600 tok/s | 2400 tok/s | 4× |
Multi-turn conversations
多轮对话
python
@sgl.function
def multi_turn_chat(s, history, new_message):
# System prompt (always cached)
s += "You are a helpful AI assistant.\n\n"
# Conversation history (cached as it grows)
for msg in history:
s += f"{msg['role']}: {msg['content']}\n"
# New user message (only new part)
s += f"User: {new_message}\n"
s += "Assistant: "
s += sgl.gen("response", max_tokens=200)python
@sgl.function
def multi_turn_chat(s, history, new_message):
# System prompt (always cached)
s += "You are a helpful AI assistant.\n\n"
# Conversation history (cached as it grows)
for msg in history:
s += f"{msg['role']}: {msg['content']}\n"
# New user message (only new part)
s += f"User: {new_message}\n"
s += "Assistant: "
s += sgl.gen("response", max_tokens=200)Turn 1
Turn 1
history = []
state = multi_turn_chat.run(history=history, new_message="Hi there!")
history.append({"role": "User", "content": "Hi there!"})
history.append({"role": "Assistant", "content": state["response"]})
history = []
state = multi_turn_chat.run(history=history, new_message="Hi there!")
history.append({"role": "User", "content": "Hi there!"})
history.append({"role": "Assistant", "content": state["response"]})
Turn 2 (reuses Turn 1 KV cache)
Turn 2 (reuses Turn 1 KV cache)
state = multi_turn_chat.run(history=history, new_message="What's 2+2?")
state = multi_turn_chat.run(history=history, new_message="What's 2+2?")
Only computes new message (much faster!)
Only computes new message (much faster!)
Turn 3 (reuses Turn 1 + Turn 2 KV cache)
Turn 3 (reuses Turn 1 + Turn 2 KV cache)
state = multi_turn_chat.run(history=history, new_message="Tell me a joke")
state = multi_turn_chat.run(history=history, new_message="Tell me a joke")
Progressively faster as history grows
Progressively faster as history grows
undefinedundefinedAdvanced features
高级功能
Speculative decoding
推测解码
bash
undefinedbash
undefinedLaunch with draft model (2-3× faster)
Launch with draft model (2-3× faster)
python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-70B-Instruct
--speculative-model meta-llama/Meta-Llama-3-8B-Instruct
--speculative-num-steps 5
--model-path meta-llama/Meta-Llama-3-70B-Instruct
--speculative-model meta-llama/Meta-Llama-3-8B-Instruct
--speculative-num-steps 5
undefinedpython -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-70B-Instruct
--speculative-model meta-llama/Meta-Llama-3-8B-Instruct
--speculative-num-steps 5
--model-path meta-llama/Meta-Llama-3-70B-Instruct
--speculative-model meta-llama/Meta-Llama-3-8B-Instruct
--speculative-num-steps 5
undefinedMulti-modal (vision models)
多模态(视觉模型)
python
@sgl.function
def describe_image(s, image_path):
s += sgl.image(image_path)
s += "Describe this image in detail: "
s += sgl.gen("description", max_tokens=200)
state = describe_image.run(image_path="photo.jpg")
print(state["description"])python
@sgl.function
def describe_image(s, image_path):
s += sgl.image(image_path)
s += "Describe this image in detail: "
s += sgl.gen("description", max_tokens=200)
state = describe_image.run(image_path="photo.jpg")
print(state["description"])Batching and parallel requests
批处理与并行请求
python
undefinedpython
undefinedAutomatic batching (continuous batching)
Automatic batching (continuous batching)
states = sgl.run_batch(
[
simple_gen.bind(question="What is AI?"),
simple_gen.bind(question="What is ML?"),
simple_gen.bind(question="What is DL?"),
]
)
states = sgl.run_batch(
[
simple_gen.bind(question="What is AI?"),
simple_gen.bind(question="What is ML?"),
simple_gen.bind(question="What is DL?"),
]
)
All 3 processed in single batch (efficient)
All 3 processed in single batch (efficient)
undefinedundefinedOpenAI-compatible API
OpenAI兼容API
bash
undefinedbash
undefinedStart server with OpenAI API
Start server with OpenAI API
python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--port 30000
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--port 30000
python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--port 30000
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--port 30000
Use with OpenAI client
Use with OpenAI client
curl http://localhost:30000/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "default", "messages": [ {"role": "system", "content": "You are helpful"}, {"role": "user", "content": "Hello"} ], "temperature": 0.7, "max_tokens": 100 }'
-H "Content-Type: application/json"
-d '{ "model": "default", "messages": [ {"role": "system", "content": "You are helpful"}, {"role": "user", "content": "Hello"} ], "temperature": 0.7, "max_tokens": 100 }'
curl http://localhost:30000/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "default", "messages": [ {"role": "system", "content": "You are helpful"}, {"role": "user", "content": "Hello"} ], "temperature": 0.7, "max_tokens": 100 }'
-H "Content-Type: application/json"
-d '{ "model": "default", "messages": [ {"role": "system", "content": "You are helpful"}, {"role": "user", "content": "Hello"} ], "temperature": 0.7, "max_tokens": 100 }'
Works with OpenAI Python SDK
Works with OpenAI Python SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Hello"}]
)
undefinedfrom openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Hello"}]
)
undefinedSupported models
支持的模型
Text models:
- Llama 2, Llama 3, Llama 3.1, Llama 3.2
- Mistral, Mixtral
- Qwen, Qwen2, QwQ
- DeepSeek-V2, DeepSeek-V3
- Gemma, Phi-3
Vision models:
- LLaVA, LLaVA-OneVision
- Phi-3-Vision
- Qwen2-VL
100+ models from HuggingFace
文本模型:
- Llama 2, Llama 3, Llama 3.1, Llama 3.2
- Mistral, Mixtral
- Qwen, Qwen2, QwQ
- DeepSeek-V2, DeepSeek-V3
- Gemma, Phi-3
视觉模型:
- LLaVA, LLaVA-OneVision
- Phi-3-Vision
- Qwen2-VL
100+模型来自HuggingFace
Hardware support
硬件支持
NVIDIA: A100, H100, L4, T4 (CUDA 11.8+)
AMD: MI300, MI250 (ROCm 6.0+)
Intel: Xeon with GPU (coming soon)
Apple: M1/M2/M3 via MPS (experimental)
NVIDIA:A100, H100, L4, T4(CUDA 11.8+)
AMD:MI300, MI250(ROCm 6.0+)
Intel:带GPU的Xeon(即将支持)
Apple:M1/M2/M3 via MPS(实验性支持)
References
参考资料
- Structured Generation Guide - JSON schemas, regex, grammars, validation
- RadixAttention Deep Dive - How it works, optimization, benchmarks
- Production Deployment - Multi-GPU, monitoring, autoscaling
- 结构化生成指南 - JSON Schema、正则表达式、语法规则、验证
- RadixAttention深度解析 - 工作原理、优化、基准测试
- 生产部署指南 - 多GPU、监控、自动扩缩容
Resources
资源
- GitHub: https://github.com/sgl-project/sglang
- Docs: https://sgl-project.github.io/
- Paper: RadixAttention (arXiv:2312.07104)
- Discord: https://discord.gg/sglang
- GitHub:https://github.com/sgl-project/sglang
- 文档:https://sgl-project.github.io/
- 论文:RadixAttention(arXiv:2312.07104)
- Discord:https://discord.gg/sglang