vllm-bench-random-synthetic

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

vLLM Benchmark with Random Synthetic Data

基于随机合成数据的vLLM基准测试

Run a quick performance benchmark on a vLLM server using synthetic random data. This skill measures core serving metrics including request throughput, token throughput, TTFT (Time to First Token), TPOT (Time per Output Token), and inter-token latency.

使用随机合成数据对vLLM服务器进行快速性能基准测试。该技能可衡量核心服务指标，包括请求吞吐量、令牌吞吐量、TTFT（首令牌生成时间）、TPOT（输出单令牌耗时）以及令牌间延迟。

When to use

使用场景

User wants to quickly benchmark vLLM serving performance
User wants to measure throughput and latency metrics without downloading datasets
User wants to test a vLLM deployment with synthetic workload
User wants baseline performance numbers for a specific model

用户希望快速对vLLM服务性能进行基准测试
用户希望无需下载数据集即可衡量吞吐量和延迟指标
用户希望使用合成工作负载测试vLLM部署
用户希望获取特定模型的基准性能数据

Prerequisites

前提条件

vLLM must be installed (
```
pip install vllm
```
)
A vLLM server must be running (or can be started as part of the benchmark)
For GPU models, NVIDIA GPU with appropriate drivers must be available

必须已安装vLLM（
```
pip install vllm
```
）
vLLM服务器必须处于运行状态（或可作为基准测试的一部分启动）
对于GPU模型，必须具备带有合适驱动的NVIDIA GPU

Quick Start

快速开始

The simplest way to run the benchmark:

bash

undefined

运行基准测试的最简单方式：

bash

undefined

Start vLLM server (in background or separate terminal)

启动vLLM服务器（在后台或单独终端中）

vllm serve Qwen/Qwen2.5-1.5B-Instruct

Run benchmark with random synthetic data

使用随机合成数据运行基准测试

vllm bench serve
--backend openai-chat
--model Qwen/Qwen2.5-1.5B-Instruct
--endpoint /v1/chat/completions
--dataset-name random
--num-prompts 10


**Note**: 
- Use `--backend openai-chat` with endpoint `/v1/chat/completions` for online benchmarks.

vllm bench serve
--backend openai-chat
--model Qwen/Qwen2.5-1.5B-Instruct
--endpoint /v1/chat/completions
--dataset-name random
--num-prompts 10


**注意**： 
- 在线基准测试需使用`--backend openai-chat`，并指定端点`/v1/chat/completions`。

Parameters

参数

Parameter	Description	Default
`--backend`	Backend type: `vllm` , `openai` , `openai-chat`	`vllm`
`--model`	Model name (must match the server)	Required
`--endpoint`	API endpoint path	`/v1/completions` or `/v1/chat/completions`
`--dataset-name`	Dataset to use	`random` (synthetic)
`--num-prompts`	Number of requests to send	`10`
`--port`	Server port	`8000`
`--max-concurrency`	Maximum concurrent requests	Auto
`--save-result`	Save results to file	Off
`--result-dir`	Directory to save results	`./`

参数	描述	默认值
`--backend`	后端类型： `vllm` , `openai` , `openai-chat`	`vllm`
`--model`	模型名称（必须与服务器一致）	必填
`--endpoint`	API端点路径	`/v1/completions` 或 `/v1/chat/completions`
`--dataset-name`	使用的数据集	`random` （合成数据）
`--num-prompts`	发送的请求数量	`10`
`--port`	服务器端口	`8000`
`--max-concurrency`	最大并发请求数	自动
`--save-result`	将结果保存到文件	关闭
`--result-dir`	结果保存目录	`./`

Expected Output

预期输出

When successful, you will see output like:

============ Serving Benchmark Result ============
Successful requests:                     10
Benchmark duration (s):                  5.78
Total input tokens:                      1369
Total generated tokens:                  2212
Request throughput (req/s):              1.73
Output token throughput (tok/s):         382.89
Total token throughput (tok/s):          619.85
---------------Time to First Token----------------
Mean TTFT (ms):                          71.54
Median TTFT (ms):                        73.88
P99 TTFT (ms):                           79.49
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.91
Median TPOT (ms):                        7.96
P99 TPOT (ms):                           8.03
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.74
Median ITL (ms):                         7.70
P99 ITL (ms):                            8.39
==================================================

成功运行后，您将看到如下输出：

============ Serving Benchmark Result ============
Successful requests:                     10
Benchmark duration (s):                  5.78
Total input tokens:                      1369
Total generated tokens:                  2212
Request throughput (req/s):              1.73
Output token throughput (tok/s):         382.89
Total token throughput (tok/s):          619.85
---------------Time to First Token----------------
Mean TTFT (ms):                          71.54
Median TTFT (ms):                        73.88
P99 TTFT (ms):                           79.49
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.91
Median TPOT (ms):                        7.96
P99 TPOT (ms):                           8.03
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.74
Median ITL (ms):                         7.70
P99 ITL (ms):                            8.39
==================================================

Advanced Usage

进阶用法

With more prompts for better statistics

使用更多提示以获得更准确的统计数据

bash

vllm bench serve \
  --backend openai-chat \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name random \
  --num-prompts 100

bash

vllm bench serve \
  --backend openai-chat \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name random \
  --num-prompts 100

Save results to file

将结果保存到文件

bash

vllm bench serve \
  --backend openai-chat \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name random \
  --num-prompts 50 \
  --save-result \
  --result-dir ./benchmark-results/

bash

vllm bench serve \
  --backend openai-chat \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name random \
  --num-prompts 50 \
  --save-result \
  --result-dir ./benchmark-results/

Custom port and concurrency

自定义端口和并发数

bash

vllm bench serve \
  --backend openai-chat \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name random \
  --num-prompts 100 \
  --port 8001 \
  --max-concurrency 4

bash

vllm bench serve \
  --backend openai-chat \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name random \
  --num-prompts 100 \
  --port 8001 \
  --max-concurrency 4

Model Recommendations

模型推荐

For quick testing (small models, fast):

```
Qwen/Qwen2.5-1.5B-Instruct
```
(recommended for quick tests)
```
facebook/opt-125m
```
```
facebook/opt-350m
```

For realistic benchmarks (medium models):

```
Qwen/Qwen2.5-7B-Instruct
```
```
meta-llama/Llama-3.1-8B-Instruct
```
```
mistralai/Mistral-7B-Instruct-v0.3
```

用于快速测试（小型模型，速度快）：

```
Qwen/Qwen2.5-1.5B-Instruct
```
（推荐用于快速测试）
```
facebook/opt-125m
```
```
facebook/opt-350m
```

用于真实场景基准测试（中型模型）：

```
Qwen/Qwen2.5-7B-Instruct
```
```
meta-llama/Llama-3.1-8B-Instruct
```
```
mistralai/Mistral-7B-Instruct-v0.3
```

Workflow

工作流程

Check if vLLM is installed: Run
```
vllm --version
```
to verify
Check if server is already running: Run
```
curl http://localhost:8000/health
```
to check
Start vLLM server if needed: Run
```
vllm serve <model-name>
```
(wait for "Application startup complete")
Run benchmark: Execute
```
vllm bench serve
```
with appropriate parameters
Review results: Check throughput and latency metrics
Clean up: If the agent skill started the vLLM server (not a pre-existing one), stop it after benchmark completion using
```
kill <PID>
```

检查vLLM是否已安装：运行
```
vllm --version
```
进行验证
检查服务器是否已运行：运行
```
curl http://localhost:8000/health
```
进行检查
若需要则启动vLLM服务器：运行
```
vllm serve <model-name>
```
（等待"Application startup complete"提示）
运行基准测试：执行带有合适参数的
```
vllm bench serve
```
命令
查看结果：检查吞吐量和延迟指标
清理资源：如果是该技能启动的vLLM服务器（而非预先运行的服务器），测试完成后使用
```
kill <PID>
```
命令停止服务器，以释放资源

Troubleshooting

故障排除

Server not responding:

Check if server is running:
```
curl http://localhost:8000/health
```
Verify port matches: Use
```
--port
```
flag if server is on different port

Model not found:

Ensure model name matches exactly between server and benchmark
Check HuggingFace access:
```
export HF_TOKEN=<your_token>
```
if needed

Out of memory:

Use a smaller model (e.g., Qwen2.5-1.5B-Instruct)
Reduce
```
--num-prompts
```
or
```
--max-concurrency
```

Connection refused:

Server may still be starting (wait for "Application startup complete")
Check firewall or network settings

服务器无响应:

检查服务器是否运行：
```
curl http://localhost:8000/health
```
验证端口是否匹配：若服务器使用不同端口，使用
```
--port
```
参数指定

模型未找到:

确保服务器和基准测试使用的模型名称完全一致
检查HuggingFace访问权限：若需要，设置
```
export HF_TOKEN=<your_token>
```

内存不足:

使用更小的模型（例如Qwen2.5-1.5B-Instruct）
减少
```
--num-prompts
```
或
```
--max-concurrency
```
的值

连接被拒绝:

服务器可能仍在启动中（等待"Application startup complete"提示）
检查防火墙或网络设置

Notes

注意事项

The
```
random
```
dataset generates synthetic prompts automatically
Benchmark duration scales with
```
--num-prompts
```
For production benchmarking, use at least 100 prompts for stable statistics
Results may vary based on hardware, model size, and system load
First run may be slower due to model loading and compilation
Important: If the agent skill starts a vLLM server for benchmarking, it must stop the server after the benchmark completes to free up resources. Do not stop pre-existing servers that were already running before the benchmark.

```
random
```
数据集会自动生成合成提示
基准测试时长随
```
--num-prompts
```
参数增加而延长
用于生产环境基准测试时，至少使用100个提示以获得稳定的统计数据
结果可能因硬件、模型大小和系统负载而异
首次运行可能因模型加载和编译而较慢
重要提示：如果是该技能为基准测试启动的vLLM服务器，测试完成后必须停止服务器以释放资源。请勿停止基准测试前已在运行的服务器。