serving-llms-vllm

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

vLLM - High-Performance LLM Serving

vLLM - 高性能大语言模型服务

Quick start

快速开始

vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).

Installation:

bash

pip install vllm

Basic offline inference:

python

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
sampling = SamplingParams(temperature=0.7, max_tokens=256)

outputs = llm.generate(["Explain quantum computing"], sampling)
print(outputs[0].outputs[0].text)

OpenAI-compatible server:

bash

vllm serve meta-llama/Llama-3-8B-Instruct

通过PagedAttention（基于块的KV缓存）和连续批处理（混合预填充/解码请求），vLLM的吞吐量比标准transformers高24倍。

安装:

bash

pip install vllm

基础离线推理:

python

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
sampling = SamplingParams(temperature=0.7, max_tokens=256)

outputs = llm.generate(["Explain quantum computing"], sampling)
print(outputs[0].outputs[0].text)

兼容OpenAI的服务器:

bash

vllm serve meta-llama/Llama-3-8B-Instruct

Query with OpenAI SDK

python -c " from openai import OpenAI client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY') print(client.chat.completions.create( model='meta-llama/Llama-3-8B-Instruct', messages=[{'role': 'user', 'content': 'Hello!'}] ).choices[0].message.content) "

undefined

undefined

Common workflows

常见工作流

Workflow 1: Production API deployment

工作流1：生产级API部署

Copy this checklist and track progress:

Deployment Progress:
- [ ] Step 1: Configure server settings
- [ ] Step 2: Test with limited traffic
- [ ] Step 3: Enable monitoring
- [ ] Step 4: Deploy to production
- [ ] Step 5: Verify performance metrics

Step 1: Configure server settings

Choose configuration based on your model size:

bash

undefined

复制以下检查清单并跟踪进度:

部署进度:
- [ ] 步骤1：配置服务器设置
- [ ] 步骤2：使用有限流量进行测试
- [ ] 步骤3：启用监控
- [ ] 步骤4：部署到生产环境
- [ ] 步骤5：验证性能指标

步骤1：配置服务器设置

根据模型大小选择配置:

bash

undefined

For 7B-13B models on single GPU

单GPU上运行7B-13B模型

vllm serve meta-llama/Llama-3-8B-Instruct
--gpu-memory-utilization 0.9
--max-model-len 8192
--port 8000

For 30B-70B models with tensor parallelism

借助张量并行运行30B-70B模型

vllm serve meta-llama/Llama-2-70b-hf
--tensor-parallel-size 4
--gpu-memory-utilization 0.9
--quantization awq
--port 8000

For production with caching and metrics

带缓存和指标的生产环境配置

vllm serve meta-llama/Llama-3-8B-Instruct
--gpu-memory-utilization 0.9
--enable-prefix-caching
--enable-metrics
--metrics-port 9090
--port 8000
--host 0.0.0.0


**Step 2: Test with limited traffic**

Run load test before production:

```bash

vllm serve meta-llama/Llama-3-8B-Instruct
--gpu-memory-utilization 0.9
--enable-prefix-caching
--enable-metrics
--metrics-port 9090
--port 8000
--host 0.0.0.0


**步骤2：使用有限流量进行测试**

部署到生产环境前运行负载测试:

```bash

Install load testing tool

安装负载测试工具

pip install locust

Create test_load.py with sample requests

创建test_load.py并编写示例请求

Run: locust -f test_load.py --host http://localhost:8000

运行: locust -f test_load.py --host http://localhost:8000


Verify TTFT (time to first token) < 500ms and throughput > 100 req/sec.

**Step 3: Enable monitoring**

vLLM exposes Prometheus metrics on port 9090:

```bash
curl http://localhost:9090/metrics | grep vllm

Key metrics to monitor:

```
vllm:time_to_first_token_seconds
```
- Latency
```
vllm:num_requests_running
```
- Active requests
```
vllm:gpu_cache_usage_perc
```
- KV cache utilization

Step 4: Deploy to production

Use Docker for consistent deployment:

bash

undefined


验证TTFT（首 token 生成时间）< 500ms，吞吐量 > 100 请求/秒。

**步骤3：启用监控**

vLLM在9090端口暴露Prometheus指标:

```bash
curl http://localhost:9090/metrics | grep vllm

需要监控的关键指标:

```
vllm:time_to_first_token_seconds
```
- 延迟
```
vllm:num_requests_running
```
- 活跃请求数
```
vllm:gpu_cache_usage_perc
```
- KV缓存利用率

步骤4：部署到生产环境

使用Docker实现一致部署:

bash

undefined

Run vLLM in Docker

在Docker中运行vLLM

docker run --gpus all -p 8000:8000
vllm/vllm-openai:latest
--model meta-llama/Llama-3-8B-Instruct
--gpu-memory-utilization 0.9
--enable-prefix-caching


**Step 5: Verify performance metrics**

Check that deployment meets targets:
- TTFT < 500ms (for short prompts)
- Throughput > target req/sec
- GPU utilization > 80%
- No OOM errors in logs

docker run --gpus all -p 8000:8000
vllm/vllm-openai:latest
--model meta-llama/Llama-3-8B-Instruct
--gpu-memory-utilization 0.9
--enable-prefix-caching


**步骤5：验证性能指标**

检查部署是否符合目标:
- TTFT < 500ms（短提示场景）
- 吞吐量 > 目标请求数/秒
- GPU利用率 > 80%
- 日志中无OOM（内存不足）错误

Workflow 2: Offline batch inference

工作流2：离线批量推理

For processing large datasets without server overhead.

Copy this checklist:

Batch Processing:
- [ ] Step 1: Prepare input data
- [ ] Step 2: Configure LLM engine
- [ ] Step 3: Run batch inference
- [ ] Step 4: Process results

Step 1: Prepare input data

python

undefined

用于在无服务器开销的情况下处理大型数据集。

复制以下检查清单:

批量处理:
- [ ] 步骤1：准备输入数据
- [ ] 步骤2：配置LLM引擎
- [ ] 步骤3：运行批量推理
- [ ] 步骤4：处理结果

步骤1：准备输入数据

python

undefined

Load prompts from file

从文件加载提示词

prompts = [] with open("prompts.txt") as f: prompts = [line.strip() for line in f]

print(f"Loaded {len(prompts)} prompts")


**Step 2: Configure LLM engine**

```python
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3-8B-Instruct",
    tensor_parallel_size=2,  # Use 2 GPUs
    gpu_memory_utilization=0.9,
    max_model_len=4096
)

sampling = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512,
    stop=["</s>", "\n\n"]
)

Step 3: Run batch inference

vLLM automatically batches requests for efficiency:

python

undefined

prompts = [] with open("prompts.txt") as f: prompts = [line.strip() for line in f]

print(f"已加载 {len(prompts)} 个提示词")


**步骤2：配置LLM引擎**

```python
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3-8B-Instruct",
    tensor_parallel_size=2,  # 使用2块GPU
    gpu_memory_utilization=0.9,
    max_model_len=4096
)

sampling = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512,
    stop=["</s>", "\n\n"]
)

步骤3：运行批量推理

vLLM会自动对请求进行批处理以提升效率:

python

undefined

Process all prompts in one call

一次性处理所有提示词

outputs = llm.generate(prompts, sampling)

vLLM handles batching internally

vLLM在内部处理批处理

No need to manually chunk prompts

无需手动拆分提示词


**Step 4: Process results**

```python


**步骤4：处理结果**

```python

Extract generated text

提取生成的文本

results = [] for output in outputs: prompt = output.prompt generated = output.outputs[0].text results.append({ "prompt": prompt, "generated": generated, "tokens": len(output.outputs[0].token_ids) })

Save to file

保存到文件

import json with open("results.jsonl", "w") as f: for result in results: f.write(json.dumps(result) + "\n")

print(f"Processed {len(results)} prompts")

undefined

import json with open("results.jsonl", "w") as f: for result in results: f.write(json.dumps(result) + "\n")

print(f"已处理 {len(results)} 个提示词")

undefined

Workflow 3: Quantized model serving

工作流3：量化模型服务

Fit large models in limited GPU memory.

Quantization Setup:
- [ ] Step 1: Choose quantization method
- [ ] Step 2: Find or create quantized model
- [ ] Step 3: Launch with quantization flag
- [ ] Step 4: Verify accuracy

Step 1: Choose quantization method

AWQ: Best for 70B models, minimal accuracy loss
GPTQ: Wide model support, good compression
FP8: Fastest on H100 GPUs

Step 2: Find or create quantized model

Use pre-quantized models from HuggingFace:

bash

undefined

在有限GPU内存中运行大模型。

量化设置:
- [ ] 步骤1：选择量化方法
- [ ] 步骤2：查找或创建量化模型
- [ ] 步骤3：通过量化标志启动服务
- [ ] 步骤4：验证准确性

步骤1：选择量化方法

AWQ: 最适合70B模型，准确性损失极小
GPTQ: 支持模型范围广，压缩效果好
FP8: 在H100 GPU上速度最快

步骤2：查找或创建量化模型

使用HuggingFace上的预量化模型:

bash

undefined

Search for AWQ models

搜索AWQ模型

Example: TheBloke/Llama-2-70B-AWQ

示例: TheBloke/Llama-2-70B-AWQ


**Step 3: Launch with quantization flag**

```bash


**步骤3：通过量化标志启动服务**

```bash

Using pre-quantized model

使用预量化模型

vllm serve TheBloke/Llama-2-70B-AWQ
--quantization awq
--tensor-parallel-size 1
--gpu-memory-utilization 0.95

Results: 70B model in ~40GB VRAM

效果: 70B模型可在约40GB显存中运行


**Step 4: Verify accuracy**

Test outputs match expected quality:

```python


**步骤4：验证准确性**

测试输出是否符合预期质量:

```python

Compare quantized vs non-quantized responses

比较量化与非量化模型的响应

Verify task-specific performance unchanged

验证特定任务性能无变化

undefined

undefined

When to use vs alternatives

适用场景与替代方案对比

Use vLLM when:

Deploying production LLM APIs (100+ req/sec)
Serving OpenAI-compatible endpoints
Limited GPU memory but need large models
Multi-user applications (chatbots, assistants)
Need low latency with high throughput

Use alternatives instead:

llama.cpp: CPU/edge inference, single-user
HuggingFace transformers: Research, prototyping, one-off generation
TensorRT-LLM: NVIDIA-only, need absolute maximum performance
Text-Generation-Inference: Already in HuggingFace ecosystem

适合使用vLLM的场景:

部署生产级LLM API（100+ 请求/秒）
提供兼容OpenAI的端点
GPU内存有限但需要运行大模型
多用户应用（聊天机器人、助手）
需要低延迟且高吞吐量

适合使用替代方案的场景:

llama.cpp: CPU/边缘设备推理，单用户场景
HuggingFace transformers: 研究、原型开发、一次性生成任务
TensorRT-LLM: 仅支持NVIDIA，追求极致性能
Text-Generation-Inference: 已在HuggingFace生态中使用

Common issues

常见问题

Issue: Out of memory during model loading

Reduce memory usage:

bash

vllm serve MODEL \
  --gpu-memory-utilization 0.7 \
  --max-model-len 4096

Or use quantization:

bash

vllm serve MODEL --quantization awq

Issue: Slow first token (TTFT > 1 second)

Enable prefix caching for repeated prompts:

bash

vllm serve MODEL --enable-prefix-caching

For long prompts, enable chunked prefill:

bash

vllm serve MODEL --enable-chunked-prefill

Issue: Model not found error

Use

--trust-remote-code

for custom models:

bash

vllm serve MODEL --trust-remote-code

Issue: Low throughput (<50 req/sec)

Increase concurrent sequences:

bash

vllm serve MODEL --max-num-seqs 512

Check GPU utilization with

nvidia-smi

- should be >80%.

Issue: Inference slower than expected

Verify tensor parallelism uses power of 2 GPUs:

bash

vllm serve MODEL --tensor-parallel-size 4  # Not 3

Enable speculative decoding for faster generation:

bash

vllm serve MODEL --speculative-model DRAFT_MODEL

问题：模型加载时内存不足

降低内存占用:

bash

vllm serve MODEL \
  --gpu-memory-utilization 0.7 \
  --max-model-len 4096

或使用量化:

bash

vllm serve MODEL --quantization awq

问题：首token生成缓慢（TTFT > 1秒）

为重复提示词启用前缀缓存:

bash

vllm serve MODEL --enable-prefix-caching

对于长提示词，启用分块预填充:

bash

vllm serve MODEL --enable-chunked-prefill

问题：模型未找到错误

对于自定义模型，使用

--trust-remote-code

bash

vllm serve MODEL --trust-remote-code

问题：吞吐量低（<50 请求/秒）

增加并发序列数:

bash

vllm serve MODEL --max-num-seqs 512

使用

nvidia-smi

检查GPU利用率 - 应大于80%。

问题：推理速度低于预期

验证张量并行使用的GPU数量为2的幂:

bash

vllm serve MODEL --tensor-parallel-size 4  # 不要用3

启用推测解码以加快生成速度:

bash

vllm serve MODEL --speculative-model DRAFT_MODEL

Advanced topics

高级主题

Server deployment patterns: See references/server-deployment.md for Docker, Kubernetes, and load balancing configurations.

Performance optimization: See references/optimization.md for PagedAttention tuning, continuous batching details, and benchmark results.

Quantization guide: See references/quantization.md for AWQ/GPTQ/FP8 setup, model preparation, and accuracy comparisons.

Troubleshooting: See references/troubleshooting.md for detailed error messages, debugging steps, and performance diagnostics.

服务器部署模式: 查看references/server-deployment.md获取Docker、Kubernetes和负载均衡配置。

性能优化: 查看references/optimization.md获取PagedAttention调优、连续批处理细节和基准测试结果。

量化指南: 查看references/quantization.md获取AWQ/GPTQ/FP8设置、模型准备和准确性对比。

故障排除: 查看references/troubleshooting.md获取详细错误信息、调试步骤和性能诊断。

Hardware requirements

硬件要求

Small models (7B-13B): 1x A10 (24GB) or A100 (40GB)
Medium models (30B-40B): 2x A100 (40GB) with tensor parallelism
Large models (70B+): 4x A100 (40GB) or 2x A100 (80GB), use AWQ/GPTQ

Supported platforms: NVIDIA (primary), AMD ROCm, Intel GPUs, TPUs

小型模型（7B-13B）: 1块A10（24GB）或A100（40GB）GPU
中型模型（30B-40B）: 2块A100（40GB）GPU + 张量并行
大型模型（70B+）: 4块A100（40GB）或2块A100（80GB）GPU，使用AWQ/GPTQ量化

支持平台: NVIDIA（首选）、AMD ROCm、Intel GPU、TPU

Resources

资源

Official docs: https://docs.vllm.ai
GitHub: https://github.com/vllm-project/vllm
Paper: "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023)
Community: https://discuss.vllm.ai

官方文档: https://docs.vllm.ai
GitHub: https://github.com/vllm-project/vllm
论文: 《Efficient Memory Management for Large Language Model Serving with PagedAttention》（SOSP 2023）
社区: https://discuss.vllm.ai