serving-llms-vllm

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

vLLM - High-Performance LLM Serving

vLLM - 高性能大语言模型服务

Quick start

快速开始

vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).
Installation:
bash
pip install vllm
Basic offline inference:
python
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
sampling = SamplingParams(temperature=0.7, max_tokens=256)

outputs = llm.generate(["Explain quantum computing"], sampling)
print(outputs[0].outputs[0].text)
OpenAI-compatible server:
bash
vllm serve meta-llama/Llama-3-8B-Instruct
通过PagedAttention(基于块的KV缓存)和连续批处理(混合预填充/解码请求),vLLM的吞吐量比标准transformers高24倍。
安装:
bash
pip install vllm
基础离线推理:
python
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
sampling = SamplingParams(temperature=0.7, max_tokens=256)

outputs = llm.generate(["Explain quantum computing"], sampling)
print(outputs[0].outputs[0].text)
兼容OpenAI的服务器:
bash
vllm serve meta-llama/Llama-3-8B-Instruct

Query with OpenAI SDK

Query with OpenAI SDK

python -c " from openai import OpenAI client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY') print(client.chat.completions.create( model='meta-llama/Llama-3-8B-Instruct', messages=[{'role': 'user', 'content': 'Hello!'}] ).choices[0].message.content) "
undefined
python -c " from openai import OpenAI client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY') print(client.chat.completions.create( model='meta-llama/Llama-3-8B-Instruct', messages=[{'role': 'user', 'content': 'Hello!'}] ).choices[0].message.content) "
undefined

Common workflows

常见工作流

Workflow 1: Production API deployment

工作流1:生产级API部署

Copy this checklist and track progress:
Deployment Progress:
- [ ] Step 1: Configure server settings
- [ ] Step 2: Test with limited traffic
- [ ] Step 3: Enable monitoring
- [ ] Step 4: Deploy to production
- [ ] Step 5: Verify performance metrics
Step 1: Configure server settings
Choose configuration based on your model size:
bash
undefined
复制以下检查清单并跟踪进度:
部署进度:
- [ ] 步骤1:配置服务器设置
- [ ] 步骤2:使用有限流量进行测试
- [ ] 步骤3:启用监控
- [ ] 步骤4:部署到生产环境
- [ ] 步骤5:验证性能指标
步骤1:配置服务器设置
根据模型大小选择配置:
bash
undefined

For 7B-13B models on single GPU

单GPU上运行7B-13B模型

vllm serve meta-llama/Llama-3-8B-Instruct
--gpu-memory-utilization 0.9
--max-model-len 8192
--port 8000
vllm serve meta-llama/Llama-3-8B-Instruct
--gpu-memory-utilization 0.9
--max-model-len 8192
--port 8000

For 30B-70B models with tensor parallelism

借助张量并行运行30B-70B模型

vllm serve meta-llama/Llama-2-70b-hf
--tensor-parallel-size 4
--gpu-memory-utilization 0.9
--quantization awq
--port 8000
vllm serve meta-llama/Llama-2-70b-hf
--tensor-parallel-size 4
--gpu-memory-utilization 0.9
--quantization awq
--port 8000

For production with caching and metrics

带缓存和指标的生产环境配置

vllm serve meta-llama/Llama-3-8B-Instruct
--gpu-memory-utilization 0.9
--enable-prefix-caching
--enable-metrics
--metrics-port 9090
--port 8000
--host 0.0.0.0

**Step 2: Test with limited traffic**

Run load test before production:

```bash
vllm serve meta-llama/Llama-3-8B-Instruct
--gpu-memory-utilization 0.9
--enable-prefix-caching
--enable-metrics
--metrics-port 9090
--port 8000
--host 0.0.0.0

**步骤2:使用有限流量进行测试**

部署到生产环境前运行负载测试:

```bash

Install load testing tool

安装负载测试工具

pip install locust
pip install locust

Create test_load.py with sample requests

创建test_load.py并编写示例请求

Run: locust -f test_load.py --host http://localhost:8000

运行: locust -f test_load.py --host http://localhost:8000


Verify TTFT (time to first token) < 500ms and throughput > 100 req/sec.

**Step 3: Enable monitoring**

vLLM exposes Prometheus metrics on port 9090:

```bash
curl http://localhost:9090/metrics | grep vllm
Key metrics to monitor:
  • vllm:time_to_first_token_seconds
    - Latency
  • vllm:num_requests_running
    - Active requests
  • vllm:gpu_cache_usage_perc
    - KV cache utilization
Step 4: Deploy to production
Use Docker for consistent deployment:
bash
undefined

验证TTFT(首 token 生成时间)< 500ms,吞吐量 > 100 请求/秒。

**步骤3:启用监控**

vLLM在9090端口暴露Prometheus指标:

```bash
curl http://localhost:9090/metrics | grep vllm
需要监控的关键指标:
  • vllm:time_to_first_token_seconds
    - 延迟
  • vllm:num_requests_running
    - 活跃请求数
  • vllm:gpu_cache_usage_perc
    - KV缓存利用率
步骤4:部署到生产环境
使用Docker实现一致部署:
bash
undefined

Run vLLM in Docker

在Docker中运行vLLM

docker run --gpus all -p 8000:8000
vllm/vllm-openai:latest
--model meta-llama/Llama-3-8B-Instruct
--gpu-memory-utilization 0.9
--enable-prefix-caching

**Step 5: Verify performance metrics**

Check that deployment meets targets:
- TTFT < 500ms (for short prompts)
- Throughput > target req/sec
- GPU utilization > 80%
- No OOM errors in logs
docker run --gpus all -p 8000:8000
vllm/vllm-openai:latest
--model meta-llama/Llama-3-8B-Instruct
--gpu-memory-utilization 0.9
--enable-prefix-caching

**步骤5:验证性能指标**

检查部署是否符合目标:
- TTFT < 500ms(短提示场景)
- 吞吐量 > 目标请求数/秒
- GPU利用率 > 80%
- 日志中无OOM(内存不足)错误

Workflow 2: Offline batch inference

工作流2:离线批量推理

For processing large datasets without server overhead.
Copy this checklist:
Batch Processing:
- [ ] Step 1: Prepare input data
- [ ] Step 2: Configure LLM engine
- [ ] Step 3: Run batch inference
- [ ] Step 4: Process results
Step 1: Prepare input data
python
undefined
用于在无服务器开销的情况下处理大型数据集。
复制以下检查清单:
批量处理:
- [ ] 步骤1:准备输入数据
- [ ] 步骤2:配置LLM引擎
- [ ] 步骤3:运行批量推理
- [ ] 步骤4:处理结果
步骤1:准备输入数据
python
undefined

Load prompts from file

从文件加载提示词

prompts = [] with open("prompts.txt") as f: prompts = [line.strip() for line in f]
print(f"Loaded {len(prompts)} prompts")

**Step 2: Configure LLM engine**

```python
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3-8B-Instruct",
    tensor_parallel_size=2,  # Use 2 GPUs
    gpu_memory_utilization=0.9,
    max_model_len=4096
)

sampling = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512,
    stop=["</s>", "\n\n"]
)
Step 3: Run batch inference
vLLM automatically batches requests for efficiency:
python
undefined
prompts = [] with open("prompts.txt") as f: prompts = [line.strip() for line in f]
print(f"已加载 {len(prompts)} 个提示词")

**步骤2:配置LLM引擎**

```python
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3-8B-Instruct",
    tensor_parallel_size=2,  # 使用2块GPU
    gpu_memory_utilization=0.9,
    max_model_len=4096
)

sampling = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512,
    stop=["</s>", "\n\n"]
)
步骤3:运行批量推理
vLLM会自动对请求进行批处理以提升效率:
python
undefined

Process all prompts in one call

一次性处理所有提示词

outputs = llm.generate(prompts, sampling)
outputs = llm.generate(prompts, sampling)

vLLM handles batching internally

vLLM在内部处理批处理

No need to manually chunk prompts

无需手动拆分提示词


**Step 4: Process results**

```python

**步骤4:处理结果**

```python

Extract generated text

提取生成的文本

results = [] for output in outputs: prompt = output.prompt generated = output.outputs[0].text results.append({ "prompt": prompt, "generated": generated, "tokens": len(output.outputs[0].token_ids) })
results = [] for output in outputs: prompt = output.prompt generated = output.outputs[0].text results.append({ "prompt": prompt, "generated": generated, "tokens": len(output.outputs[0].token_ids) })

Save to file

保存到文件

import json with open("results.jsonl", "w") as f: for result in results: f.write(json.dumps(result) + "\n")
print(f"Processed {len(results)} prompts")
undefined
import json with open("results.jsonl", "w") as f: for result in results: f.write(json.dumps(result) + "\n")
print(f"已处理 {len(results)} 个提示词")
undefined

Workflow 3: Quantized model serving

工作流3:量化模型服务

Fit large models in limited GPU memory.
Quantization Setup:
- [ ] Step 1: Choose quantization method
- [ ] Step 2: Find or create quantized model
- [ ] Step 3: Launch with quantization flag
- [ ] Step 4: Verify accuracy
Step 1: Choose quantization method
  • AWQ: Best for 70B models, minimal accuracy loss
  • GPTQ: Wide model support, good compression
  • FP8: Fastest on H100 GPUs
Step 2: Find or create quantized model
Use pre-quantized models from HuggingFace:
bash
undefined
在有限GPU内存中运行大模型。
量化设置:
- [ ] 步骤1:选择量化方法
- [ ] 步骤2:查找或创建量化模型
- [ ] 步骤3:通过量化标志启动服务
- [ ] 步骤4:验证准确性
步骤1:选择量化方法
  • AWQ: 最适合70B模型,准确性损失极小
  • GPTQ: 支持模型范围广,压缩效果好
  • FP8: 在H100 GPU上速度最快
步骤2:查找或创建量化模型
使用HuggingFace上的预量化模型:
bash
undefined

Search for AWQ models

搜索AWQ模型

Example: TheBloke/Llama-2-70B-AWQ

示例: TheBloke/Llama-2-70B-AWQ


**Step 3: Launch with quantization flag**

```bash

**步骤3:通过量化标志启动服务**

```bash

Using pre-quantized model

使用预量化模型

vllm serve TheBloke/Llama-2-70B-AWQ
--quantization awq
--tensor-parallel-size 1
--gpu-memory-utilization 0.95
vllm serve TheBloke/Llama-2-70B-AWQ
--quantization awq
--tensor-parallel-size 1
--gpu-memory-utilization 0.95

Results: 70B model in ~40GB VRAM

效果: 70B模型可在约40GB显存中运行


**Step 4: Verify accuracy**

Test outputs match expected quality:

```python

**步骤4:验证准确性**

测试输出是否符合预期质量:

```python

Compare quantized vs non-quantized responses

比较量化与非量化模型的响应

Verify task-specific performance unchanged

验证特定任务性能无变化

undefined
undefined

When to use vs alternatives

适用场景与替代方案对比

Use vLLM when:
  • Deploying production LLM APIs (100+ req/sec)
  • Serving OpenAI-compatible endpoints
  • Limited GPU memory but need large models
  • Multi-user applications (chatbots, assistants)
  • Need low latency with high throughput
Use alternatives instead:
  • llama.cpp: CPU/edge inference, single-user
  • HuggingFace transformers: Research, prototyping, one-off generation
  • TensorRT-LLM: NVIDIA-only, need absolute maximum performance
  • Text-Generation-Inference: Already in HuggingFace ecosystem
适合使用vLLM的场景:
  • 部署生产级LLM API(100+ 请求/秒)
  • 提供兼容OpenAI的端点
  • GPU内存有限但需要运行大模型
  • 多用户应用(聊天机器人、助手)
  • 需要低延迟且高吞吐量
适合使用替代方案的场景:
  • llama.cpp: CPU/边缘设备推理,单用户场景
  • HuggingFace transformers: 研究、原型开发、一次性生成任务
  • TensorRT-LLM: 仅支持NVIDIA,追求极致性能
  • Text-Generation-Inference: 已在HuggingFace生态中使用

Common issues

常见问题

Issue: Out of memory during model loading
Reduce memory usage:
bash
vllm serve MODEL \
  --gpu-memory-utilization 0.7 \
  --max-model-len 4096
Or use quantization:
bash
vllm serve MODEL --quantization awq
Issue: Slow first token (TTFT > 1 second)
Enable prefix caching for repeated prompts:
bash
vllm serve MODEL --enable-prefix-caching
For long prompts, enable chunked prefill:
bash
vllm serve MODEL --enable-chunked-prefill
Issue: Model not found error
Use
--trust-remote-code
for custom models:
bash
vllm serve MODEL --trust-remote-code
Issue: Low throughput (<50 req/sec)
Increase concurrent sequences:
bash
vllm serve MODEL --max-num-seqs 512
Check GPU utilization with
nvidia-smi
- should be >80%.
Issue: Inference slower than expected
Verify tensor parallelism uses power of 2 GPUs:
bash
vllm serve MODEL --tensor-parallel-size 4  # Not 3
Enable speculative decoding for faster generation:
bash
vllm serve MODEL --speculative-model DRAFT_MODEL
问题:模型加载时内存不足
降低内存占用:
bash
vllm serve MODEL \
  --gpu-memory-utilization 0.7 \
  --max-model-len 4096
或使用量化:
bash
vllm serve MODEL --quantization awq
问题:首token生成缓慢(TTFT > 1秒)
为重复提示词启用前缀缓存:
bash
vllm serve MODEL --enable-prefix-caching
对于长提示词,启用分块预填充:
bash
vllm serve MODEL --enable-chunked-prefill
问题:模型未找到错误
对于自定义模型,使用
--trust-remote-code
:
bash
vllm serve MODEL --trust-remote-code
问题:吞吐量低(<50 请求/秒)
增加并发序列数:
bash
vllm serve MODEL --max-num-seqs 512
使用
nvidia-smi
检查GPU利用率 - 应大于80%。
问题:推理速度低于预期
验证张量并行使用的GPU数量为2的幂:
bash
vllm serve MODEL --tensor-parallel-size 4  # 不要用3
启用推测解码以加快生成速度:
bash
vllm serve MODEL --speculative-model DRAFT_MODEL

Advanced topics

高级主题

Server deployment patterns: See references/server-deployment.md for Docker, Kubernetes, and load balancing configurations.
Performance optimization: See references/optimization.md for PagedAttention tuning, continuous batching details, and benchmark results.
Quantization guide: See references/quantization.md for AWQ/GPTQ/FP8 setup, model preparation, and accuracy comparisons.
Troubleshooting: See references/troubleshooting.md for detailed error messages, debugging steps, and performance diagnostics.
服务器部署模式: 查看references/server-deployment.md获取Docker、Kubernetes和负载均衡配置。
性能优化: 查看references/optimization.md获取PagedAttention调优、连续批处理细节和基准测试结果。
量化指南: 查看references/quantization.md获取AWQ/GPTQ/FP8设置、模型准备和准确性对比。
故障排除: 查看references/troubleshooting.md获取详细错误信息、调试步骤和性能诊断。

Hardware requirements

硬件要求

  • Small models (7B-13B): 1x A10 (24GB) or A100 (40GB)
  • Medium models (30B-40B): 2x A100 (40GB) with tensor parallelism
  • Large models (70B+): 4x A100 (40GB) or 2x A100 (80GB), use AWQ/GPTQ
Supported platforms: NVIDIA (primary), AMD ROCm, Intel GPUs, TPUs
  • 小型模型(7B-13B): 1块A10(24GB)或A100(40GB)GPU
  • 中型模型(30B-40B): 2块A100(40GB)GPU + 张量并行
  • 大型模型(70B+): 4块A100(40GB)或2块A100(80GB)GPU,使用AWQ/GPTQ量化
支持平台: NVIDIA(首选)、AMD ROCm、Intel GPU、TPU

Resources

资源