vllm-bench-serve

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

vLLM Bench Serve

Benchmark vLLM or any OpenAI-compatible serving endpoint using the

vllm bench serve

CLI. Measures throughput, latency (TTFT, TPOT), and goodput against configurable request load.

Reference: vLLM Bench Serve Documentation

使用

vllm bench serve

CLI对vLLM或任何兼容OpenAI的服务端点进行基准测试。针对可配置的请求负载，测量吞吐量、延迟（TTFT、TPOT）以及有效吞吐量。

参考文档：vLLM Bench Serve 官方文档

Prerequisites

前置条件

vLLM installed (or any OpenAI-compatible server running)
A vLLM server or API endpoint already serving a model
Python environment with vLLM for the benchmark client

已安装vLLM（或运行任何兼容OpenAI的服务器）
已部署模型的vLLM服务器或API端点
用于基准测试客户端的Python环境（需安装vLLM）

Quick Start

快速开始

Basic benchmark against local vLLM server (default random dataset, 1000 prompts):

bash

vllm bench serve \
  --backend openai-chat \
  --host 127.0.0.1 \
  --port 8000 \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --endpoint /v1/chat/completions

Save results to JSON:

bash

vllm bench serve \
  --backend openai-chat \
  --host 127.0.0.1 \
  --port 8000 \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --endpoint /v1/chat/completions \
  --save-result \
  --result-dir ./bench-results \
  --metadata "version=0.6.0" "tp=1"

Note: When using
--backend openai-chat
, you must specify
--endpoint /v1/chat/completions
(default is
/v1/completions
).

针对本地vLLM服务器的基础基准测试（默认random数据集，1000条提示词）：

bash

vllm bench serve \
  --backend openai-chat \
  --host 127.0.0.1 \
  --port 8000 \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --endpoint /v1/chat/completions

将结果保存为JSON：

bash

vllm bench serve \
  --backend openai-chat \
  --host 127.0.0.1 \
  --port 8000 \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --endpoint /v1/chat/completions \
  --save-result \
  --result-dir ./bench-results \
  --metadata "version=0.6.0" "tp=1"

注意： 使用
--backend openai-chat
时，必须指定
--endpoint /v1/chat/completions
（默认值为
/v1/completions
）。

Core Arguments

核心参数

Argument	Default	Description
`--backend`	`openai`	Backend type: `openai` , `openai-chat` , `openai-embeddings` , `vllm` , `vllm-pooling` , `vllm-rerank` , etc.
`--host`	`127.0.0.1`	Server host
`--port`	`8000`	Server port
`--base-url`	-	Alternative: full base URL instead of host:port
`--endpoint`	`/v1/completions`	API endpoint; use `/v1/chat/completions` for openai-chat
`--model`	(from /v1/models)	Model name
`--num-prompts`	`1000`	Number of prompts to process
`--request-rate`	`inf`	Requests per second; `inf` = burst all at once
`--max-concurrency`	-	Max concurrent requests (caps parallelism)
`--num-warmups`	`0`	Warmup requests before measuring

参数	默认值	描述
`--backend`	`openai`	后端类型： `openai` 、 `openai-chat` 、 `openai-embeddings` 、 `vllm` 、 `vllm-pooling` 、 `vllm-rerank` 等。
`--host`	`127.0.0.1`	服务器主机地址
`--port`	`8000`	服务器端口
`--base-url`	-	替代方案：完整的基础URL，无需指定host:port
`--endpoint`	`/v1/completions`	API端点；使用openai-chat后端时需用 `/v1/chat/completions`
`--model`	（来自/v1/models）	模型名称
`--num-prompts`	`1000`	待处理的提示词数量
`--request-rate`	`inf`	每秒请求数； `inf` 表示一次性全部发送
`--max-concurrency`	-	最大并发请求数（限制并行度）
`--num-warmups`	`0`	正式测量前的预热请求数

Datasets

数据集

`--dataset-name`	Use Case
`random`	Synthetic random prompts (default)
`sharegpt`	ShareGPT conversation format; requires `--dataset-path`
`sonnet`	Sonnet-style prompts
`hf`	HuggingFace dataset; requires `--dataset-path` (dataset ID)
`custom` / `custom_mm`	Custom dataset; requires `--dataset-path`
`prefix_repetition`	Prefix repetition benchmark
`random-mm`	Random multimodal (images/videos)
`spec_bench`	Spec bench dataset

Dataset-specific options (examples):

bash

undefined

`--dataset-name`	使用场景
`random`	合成随机提示词（默认）
`sharegpt`	ShareGPT对话格式；需指定 `--dataset-path`
`sonnet`	Sonnet风格提示词
`hf`	HuggingFace数据集；需指定 `--dataset-path` （数据集ID）
`custom` / `custom_mm`	自定义数据集；需指定 `--dataset-path`
`prefix_repetition`	前缀重复基准测试
`random-mm`	随机多模态（图片/视频）
`spec_bench`	Spec bench数据集

数据集专属配置示例：

bash

undefined

Random: control input/output length

Random：控制输入/输出长度

--dataset-name random --random-input-len 1024 --random-output-len 128

Sonnet defaults: input 550, output 150, prefix 200

Sonnet默认配置：输入550，输出150，前缀200

--dataset-name sonnet --sonnet-input-len 550 --sonnet-output-len 150

HuggingFace dataset

HuggingFace数据集

--dataset-name hf --dataset-path "lmarena-ai/VisionArena-Chat" --hf-split test

General overrides (map to dataset-specific args)

通用覆盖配置（映射到数据集专属参数）

--input-len 512 --output-len 256

undefined

--input-len 512 --output-len 256

undefined

Load Control

负载控制

bash

undefined

bash

undefined

Fixed request rate (Poisson process)

固定请求速率（泊松过程）

--request-rate 10

More bursty arrivals (gamma distribution, burstiness < 1)

更具突发性的请求到达（伽马分布，burstiness < 1）

--request-rate 10 --burstiness 0.5

Ramp-up from low to high RPS

请求速率从低到高线性提升

--ramp-up-strategy linear --ramp-up-start-rps 1 --ramp-up-end-rps 50

Limit concurrency (useful for rate-limited APIs)

限制并发数（适用于有速率限制的API）

--max-concurrency 32

undefined

--max-concurrency 32

undefined

Results and Metrics

结果与指标

Argument	Description
`--save-result`	Save benchmark results to JSON
`--save-detailed`	Include per-request TTFT, TPOT, errors in JSON
`--append-result`	Append to existing result file
`--result-dir`	Directory for result files
`--result-filename`	Custom filename (default: `{label}-{request_rate}qps-{model}-{timestamp}.json` )
`--percentile-metrics`	Metrics for percentiles: `ttft` , `tpot` , `itl` , `e2el` (default: `ttft,tpot,itl` )
`--metric-percentiles`	Percentile values, e.g. `25,50,99` (default: `99` )
`--goodput`	SLO for goodput: `ttft:500 tpot:50` (ms)

参数	描述
`--save-result`	将基准测试结果保存为JSON
`--save-detailed`	在JSON中包含每条请求的TTFT、TPOT、错误信息
`--append-result`	添加到已有的结果文件中
`--result-dir`	结果文件存储目录
`--result-filename`	自定义文件名（默认： `{label}-{request_rate}qps-{model}-{timestamp}.json` ）
`--percentile-metrics`	百分位指标： `ttft` 、 `tpot` 、 `itl` 、 `e2el` （默认： `ttft,tpot,itl` ）
`--metric-percentiles`	百分位数值，例如 `25,50,99` （默认： `99` ）
`--goodput`	有效吞吐量的服务水平目标（SLO）： `ttft:500 tpot:50` （毫秒）

Sampling Parameters (OpenAI-compatible backends)

采样参数（兼容OpenAI的后端）

bash

--temperature 0.7 --top-p 0.95 --top-k 50
--frequency-penalty 0 --presence-penalty 0 --repetition-penalty 1.0

bash

--temperature 0.7 --top-p 0.95 --top-k 50
--frequency-penalty 0 --presence-penalty 0 --repetition-penalty 1.0

Common Workflows

常见工作流

1. Throughput test with random dataset (burst):

bash

vllm bench serve --backend openai-chat --host 127.0.0.1 --port 8000 \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name random \
  --num-prompts 500 --random-input-len 512 --random-output-len 128

2. Latency test with fixed QPS:

bash

vllm bench serve --backend openai-chat --host 127.0.0.1 --port 8000 \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --endpoint /v1/chat/completions \
  --request-rate 5 --num-prompts 200 \
  --save-result --percentile-metrics ttft,tpot --metric-percentiles 50,99

3. Benchmark against remote API (base-url):

bash

vllm bench serve --backend openai-chat \
  --base-url "https://api.example.com/v1" \
  --model my-model \
  --header "Authorization=Bearer $API_KEY"

4. Run inside Docker (when vLLM client not on host):

bash

docker exec <container-name> vllm bench serve \
  --backend openai-chat --host 127.0.0.1 --port 8000 \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name random --num-prompts 100

1. 使用random数据集进行吞吐量测试（突发模式）：

bash

vllm bench serve --backend openai-chat --host 127.0.0.1 --port 8000 \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name random \
  --num-prompts 500 --random-input-len 512 --random-output-len 128

2. 固定QPS的延迟测试：

bash

vllm bench serve --backend openai-chat --host 127.0.0.1 --port 8000 \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --endpoint /v1/chat/completions \
  --request-rate 5 --num-prompts 200 \
  --save-result --percentile-metrics ttft,tpot --metric-percentiles 50,99

3. 针对远程API的基准测试（使用base-url）：

bash

vllm bench serve --backend openai-chat \
  --base-url "https://api.example.com/v1" \
  --model my-model \
  --header "Authorization=Bearer $API_KEY"

4. 在Docker内运行（当vLLM客户端不在主机上时）：

bash

docker exec <container-name> vllm bench serve \
  --backend openai-chat --host 127.0.0.1 --port 8000 \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name random --num-prompts 100

Troubleshooting

故障排除

Connection refused: Ensure the server is running and
```
--host
```
/
```
--port
```
or
```
--base-url
```
are correct.
Model not found: Pass
```
--model
```
explicitly or ensure
```
/v1/models
```
returns the model.

URL must end with chat/completions: Use

--endpoint /v1/chat/completions

when

--backend openai-chat

Rate limit / 429: Reduce
```
--request-rate
```
or
```
--max-concurrency
```
.
Ready check: Use
```
--ready-check-timeout-sec 60
```
to wait for the endpoint before benchmarking.
SSL: Use
```
--insecure
```
for self-signed certificates.

连接被拒绝：确保服务器正在运行，且
```
--host
```
/
```
--port
```
或
```
--base-url
```
配置正确。
模型未找到：显式传递
```
--model
```
参数，或确保
```
/v1/models
```
能返回该模型。
URL必须以chat/completions结尾：使用
```
--backend openai-chat
```
时，需指定
```
--endpoint /v1/chat/completions
```
。
速率限制/429错误：降低
```
--request-rate
```
或
```
--max-concurrency
```
的值。
就绪检查：使用
```
--ready-check-timeout-sec 60
```
，在基准测试前等待端点就绪。
SSL问题：使用
```
--insecure
```
参数忽略自签名证书。

Notes

注意事项

For embeddings/rerank benchmarks, use

--backend openai-embeddings

vllm-pooling

, or

vllm-rerank

```
--profile
```
requires
```
--profiler-config
```
on the server for vLLM profiling.
Goodput SLOs are useful for SLA-style analysis; see DistServe paper for details.

对于嵌入/重排序基准测试，使用
```
--backend openai-embeddings
```
、
```
vllm-pooling
```
或
```
vllm-rerank
```
。
```
--profile
```
参数需要服务器端配置
```
--profiler-config
```
才能进行vLLM性能分析。
有效吞吐量SLO适用于服务水平协议（SLA）风格的分析；详情请参考DistServe论文。