vllm-bench-random-synthetic

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

vLLM Benchmark with Random Synthetic Data

基于随机合成数据的vLLM基准测试

Run a quick performance benchmark on a vLLM server using synthetic random data. This skill measures core serving metrics including request throughput, token throughput, TTFT (Time to First Token), TPOT (Time per Output Token), and inter-token latency.
使用随机合成数据对vLLM服务器进行快速性能基准测试。该技能可衡量核心服务指标,包括请求吞吐量、令牌吞吐量、TTFT(首令牌生成时间)、TPOT(输出单令牌耗时)以及令牌间延迟。

When to use

使用场景

  • User wants to quickly benchmark vLLM serving performance
  • User wants to measure throughput and latency metrics without downloading datasets
  • User wants to test a vLLM deployment with synthetic workload
  • User wants baseline performance numbers for a specific model
  • 用户希望快速对vLLM服务性能进行基准测试
  • 用户希望无需下载数据集即可衡量吞吐量和延迟指标
  • 用户希望使用合成工作负载测试vLLM部署
  • 用户希望获取特定模型的基准性能数据

Prerequisites

前提条件

  • vLLM must be installed (
    pip install vllm
    )
  • A vLLM server must be running (or can be started as part of the benchmark)
  • For GPU models, NVIDIA GPU with appropriate drivers must be available
  • 必须已安装vLLM(
    pip install vllm
  • vLLM服务器必须处于运行状态(或可作为基准测试的一部分启动)
  • 对于GPU模型,必须具备带有合适驱动的NVIDIA GPU

Quick Start

快速开始

The simplest way to run the benchmark:
bash
undefined
运行基准测试的最简单方式:
bash
undefined

Start vLLM server (in background or separate terminal)

启动vLLM服务器(在后台或单独终端中)

vllm serve Qwen/Qwen2.5-1.5B-Instruct
vllm serve Qwen/Qwen2.5-1.5B-Instruct

Run benchmark with random synthetic data

使用随机合成数据运行基准测试

vllm bench serve
--backend openai-chat
--model Qwen/Qwen2.5-1.5B-Instruct
--endpoint /v1/chat/completions
--dataset-name random
--num-prompts 10

**Note**: 
- Use `--backend openai-chat` with endpoint `/v1/chat/completions` for online benchmarks.
vllm bench serve
--backend openai-chat
--model Qwen/Qwen2.5-1.5B-Instruct
--endpoint /v1/chat/completions
--dataset-name random
--num-prompts 10

**注意**: 
- 在线基准测试需使用`--backend openai-chat`,并指定端点`/v1/chat/completions`。

Parameters

参数

ParameterDescriptionDefault
--backend
Backend type:
vllm
,
openai
,
openai-chat
vllm
--model
Model name (must match the server)Required
--endpoint
API endpoint path
/v1/completions
or
/v1/chat/completions
--dataset-name
Dataset to use
random
(synthetic)
--num-prompts
Number of requests to send
10
--port
Server port
8000
--max-concurrency
Maximum concurrent requestsAuto
--save-result
Save results to fileOff
--result-dir
Directory to save results
./
参数描述默认值
--backend
后端类型:
vllm
,
openai
,
openai-chat
vllm
--model
模型名称(必须与服务器一致)必填
--endpoint
API端点路径
/v1/completions
/v1/chat/completions
--dataset-name
使用的数据集
random
(合成数据)
--num-prompts
发送的请求数量
10
--port
服务器端口
8000
--max-concurrency
最大并发请求数自动
--save-result
将结果保存到文件关闭
--result-dir
结果保存目录
./

Expected Output

预期输出

When successful, you will see output like:
============ Serving Benchmark Result ============
Successful requests:                     10
Benchmark duration (s):                  5.78
Total input tokens:                      1369
Total generated tokens:                  2212
Request throughput (req/s):              1.73
Output token throughput (tok/s):         382.89
Total token throughput (tok/s):          619.85
---------------Time to First Token----------------
Mean TTFT (ms):                          71.54
Median TTFT (ms):                        73.88
P99 TTFT (ms):                           79.49
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.91
Median TPOT (ms):                        7.96
P99 TPOT (ms):                           8.03
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.74
Median ITL (ms):                         7.70
P99 ITL (ms):                            8.39
==================================================
成功运行后,您将看到如下输出:
============ Serving Benchmark Result ============
Successful requests:                     10
Benchmark duration (s):                  5.78
Total input tokens:                      1369
Total generated tokens:                  2212
Request throughput (req/s):              1.73
Output token throughput (tok/s):         382.89
Total token throughput (tok/s):          619.85
---------------Time to First Token----------------
Mean TTFT (ms):                          71.54
Median TTFT (ms):                        73.88
P99 TTFT (ms):                           79.49
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.91
Median TPOT (ms):                        7.96
P99 TPOT (ms):                           8.03
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.74
Median ITL (ms):                         7.70
P99 ITL (ms):                            8.39
==================================================

Advanced Usage

进阶用法

With more prompts for better statistics

使用更多提示以获得更准确的统计数据

bash
vllm bench serve \
  --backend openai-chat \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name random \
  --num-prompts 100
bash
vllm bench serve \
  --backend openai-chat \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name random \
  --num-prompts 100

Save results to file

将结果保存到文件

bash
vllm bench serve \
  --backend openai-chat \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name random \
  --num-prompts 50 \
  --save-result \
  --result-dir ./benchmark-results/
bash
vllm bench serve \
  --backend openai-chat \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name random \
  --num-prompts 50 \
  --save-result \
  --result-dir ./benchmark-results/

Custom port and concurrency

自定义端口和并发数

bash
vllm bench serve \
  --backend openai-chat \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name random \
  --num-prompts 100 \
  --port 8001 \
  --max-concurrency 4
bash
vllm bench serve \
  --backend openai-chat \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name random \
  --num-prompts 100 \
  --port 8001 \
  --max-concurrency 4

Model Recommendations

模型推荐

For quick testing (small models, fast):
  • Qwen/Qwen2.5-1.5B-Instruct
    (recommended for quick tests)
  • facebook/opt-125m
  • facebook/opt-350m
For realistic benchmarks (medium models):
  • Qwen/Qwen2.5-7B-Instruct
  • meta-llama/Llama-3.1-8B-Instruct
  • mistralai/Mistral-7B-Instruct-v0.3
用于快速测试(小型模型,速度快):
  • Qwen/Qwen2.5-1.5B-Instruct
    (推荐用于快速测试)
  • facebook/opt-125m
  • facebook/opt-350m
用于真实场景基准测试(中型模型):
  • Qwen/Qwen2.5-7B-Instruct
  • meta-llama/Llama-3.1-8B-Instruct
  • mistralai/Mistral-7B-Instruct-v0.3

Workflow

工作流程

  1. Check if vLLM is installed: Run
    vllm --version
    to verify
  2. Check if server is already running: Run
    curl http://localhost:8000/health
    to check
  3. Start vLLM server if needed: Run
    vllm serve <model-name>
    (wait for "Application startup complete")
  4. Run benchmark: Execute
    vllm bench serve
    with appropriate parameters
  5. Review results: Check throughput and latency metrics
  6. Clean up: If the agent skill started the vLLM server (not a pre-existing one), stop it after benchmark completion using
    kill <PID>
  1. 检查vLLM是否已安装:运行
    vllm --version
    进行验证
  2. 检查服务器是否已运行:运行
    curl http://localhost:8000/health
    进行检查
  3. 若需要则启动vLLM服务器:运行
    vllm serve <model-name>
    (等待"Application startup complete"提示)
  4. 运行基准测试:执行带有合适参数的
    vllm bench serve
    命令
  5. 查看结果:检查吞吐量和延迟指标
  6. 清理资源:如果是该技能启动的vLLM服务器(而非预先运行的服务器),测试完成后使用
    kill <PID>
    命令停止服务器,以释放资源

Troubleshooting

故障排除

Server not responding:
  • Check if server is running:
    curl http://localhost:8000/health
  • Verify port matches: Use
    --port
    flag if server is on different port
Model not found:
  • Ensure model name matches exactly between server and benchmark
  • Check HuggingFace access:
    export HF_TOKEN=<your_token>
    if needed
Out of memory:
  • Use a smaller model (e.g., Qwen2.5-1.5B-Instruct)
  • Reduce
    --num-prompts
    or
    --max-concurrency
Connection refused:
  • Server may still be starting (wait for "Application startup complete")
  • Check firewall or network settings
服务器无响应:
  • 检查服务器是否运行:
    curl http://localhost:8000/health
  • 验证端口是否匹配:若服务器使用不同端口,使用
    --port
    参数指定
模型未找到:
  • 确保服务器和基准测试使用的模型名称完全一致
  • 检查HuggingFace访问权限:若需要,设置
    export HF_TOKEN=<your_token>
内存不足:
  • 使用更小的模型(例如Qwen2.5-1.5B-Instruct)
  • 减少
    --num-prompts
    --max-concurrency
    的值
连接被拒绝:
  • 服务器可能仍在启动中(等待"Application startup complete"提示)
  • 检查防火墙或网络设置

Notes

注意事项

  • The
    random
    dataset generates synthetic prompts automatically
  • Benchmark duration scales with
    --num-prompts
  • For production benchmarking, use at least 100 prompts for stable statistics
  • Results may vary based on hardware, model size, and system load
  • First run may be slower due to model loading and compilation
  • Important: If the agent skill starts a vLLM server for benchmarking, it must stop the server after the benchmark completes to free up resources. Do not stop pre-existing servers that were already running before the benchmark.
  • random
    数据集会自动生成合成提示
  • 基准测试时长随
    --num-prompts
    参数增加而延长
  • 用于生产环境基准测试时,至少使用100个提示以获得稳定的统计数据
  • 结果可能因硬件、模型大小和系统负载而异
  • 首次运行可能因模型加载和编译而较慢
  • 重要提示:如果是该技能为基准测试启动的vLLM服务器,测试完成后必须停止服务器以释放资源。请勿停止基准测试前已在运行的服务器。