vllm-bench-random-synthetic
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesevLLM Benchmark with Random Synthetic Data
基于随机合成数据的vLLM基准测试
Run a quick performance benchmark on a vLLM server using synthetic random data. This skill measures core serving metrics including request throughput, token throughput, TTFT (Time to First Token), TPOT (Time per Output Token), and inter-token latency.
使用随机合成数据对vLLM服务器进行快速性能基准测试。该技能可衡量核心服务指标,包括请求吞吐量、令牌吞吐量、TTFT(首令牌生成时间)、TPOT(输出单令牌耗时)以及令牌间延迟。
When to use
使用场景
- User wants to quickly benchmark vLLM serving performance
- User wants to measure throughput and latency metrics without downloading datasets
- User wants to test a vLLM deployment with synthetic workload
- User wants baseline performance numbers for a specific model
- 用户希望快速对vLLM服务性能进行基准测试
- 用户希望无需下载数据集即可衡量吞吐量和延迟指标
- 用户希望使用合成工作负载测试vLLM部署
- 用户希望获取特定模型的基准性能数据
Prerequisites
前提条件
- vLLM must be installed ()
pip install vllm - A vLLM server must be running (or can be started as part of the benchmark)
- For GPU models, NVIDIA GPU with appropriate drivers must be available
- 必须已安装vLLM()
pip install vllm - vLLM服务器必须处于运行状态(或可作为基准测试的一部分启动)
- 对于GPU模型,必须具备带有合适驱动的NVIDIA GPU
Quick Start
快速开始
The simplest way to run the benchmark:
bash
undefined运行基准测试的最简单方式:
bash
undefinedStart vLLM server (in background or separate terminal)
启动vLLM服务器(在后台或单独终端中)
vllm serve Qwen/Qwen2.5-1.5B-Instruct
vllm serve Qwen/Qwen2.5-1.5B-Instruct
Run benchmark with random synthetic data
使用随机合成数据运行基准测试
vllm bench serve
--backend openai-chat
--model Qwen/Qwen2.5-1.5B-Instruct
--endpoint /v1/chat/completions
--dataset-name random
--num-prompts 10
--backend openai-chat
--model Qwen/Qwen2.5-1.5B-Instruct
--endpoint /v1/chat/completions
--dataset-name random
--num-prompts 10
**Note**:
- Use `--backend openai-chat` with endpoint `/v1/chat/completions` for online benchmarks.vllm bench serve
--backend openai-chat
--model Qwen/Qwen2.5-1.5B-Instruct
--endpoint /v1/chat/completions
--dataset-name random
--num-prompts 10
--backend openai-chat
--model Qwen/Qwen2.5-1.5B-Instruct
--endpoint /v1/chat/completions
--dataset-name random
--num-prompts 10
**注意**:
- 在线基准测试需使用`--backend openai-chat`,并指定端点`/v1/chat/completions`。Parameters
参数
| Parameter | Description | Default |
|---|---|---|
| Backend type: | |
| Model name (must match the server) | Required |
| API endpoint path | |
| Dataset to use | |
| Number of requests to send | |
| Server port | |
| Maximum concurrent requests | Auto |
| Save results to file | Off |
| Directory to save results | |
| 参数 | 描述 | 默认值 |
|---|---|---|
| 后端类型: | |
| 模型名称(必须与服务器一致) | 必填 |
| API端点路径 | |
| 使用的数据集 | |
| 发送的请求数量 | |
| 服务器端口 | |
| 最大并发请求数 | 自动 |
| 将结果保存到文件 | 关闭 |
| 结果保存目录 | |
Expected Output
预期输出
When successful, you will see output like:
============ Serving Benchmark Result ============
Successful requests: 10
Benchmark duration (s): 5.78
Total input tokens: 1369
Total generated tokens: 2212
Request throughput (req/s): 1.73
Output token throughput (tok/s): 382.89
Total token throughput (tok/s): 619.85
---------------Time to First Token----------------
Mean TTFT (ms): 71.54
Median TTFT (ms): 73.88
P99 TTFT (ms): 79.49
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 7.91
Median TPOT (ms): 7.96
P99 TPOT (ms): 8.03
---------------Inter-token Latency----------------
Mean ITL (ms): 7.74
Median ITL (ms): 7.70
P99 ITL (ms): 8.39
==================================================成功运行后,您将看到如下输出:
============ Serving Benchmark Result ============
Successful requests: 10
Benchmark duration (s): 5.78
Total input tokens: 1369
Total generated tokens: 2212
Request throughput (req/s): 1.73
Output token throughput (tok/s): 382.89
Total token throughput (tok/s): 619.85
---------------Time to First Token----------------
Mean TTFT (ms): 71.54
Median TTFT (ms): 73.88
P99 TTFT (ms): 79.49
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 7.91
Median TPOT (ms): 7.96
P99 TPOT (ms): 8.03
---------------Inter-token Latency----------------
Mean ITL (ms): 7.74
Median ITL (ms): 7.70
P99 ITL (ms): 8.39
==================================================Advanced Usage
进阶用法
With more prompts for better statistics
使用更多提示以获得更准确的统计数据
bash
vllm bench serve \
--backend openai-chat \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random \
--num-prompts 100bash
vllm bench serve \
--backend openai-chat \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random \
--num-prompts 100Save results to file
将结果保存到文件
bash
vllm bench serve \
--backend openai-chat \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random \
--num-prompts 50 \
--save-result \
--result-dir ./benchmark-results/bash
vllm bench serve \
--backend openai-chat \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random \
--num-prompts 50 \
--save-result \
--result-dir ./benchmark-results/Custom port and concurrency
自定义端口和并发数
bash
vllm bench serve \
--backend openai-chat \
--model meta-llama/Llama-3.1-8B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random \
--num-prompts 100 \
--port 8001 \
--max-concurrency 4bash
vllm bench serve \
--backend openai-chat \
--model meta-llama/Llama-3.1-8B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random \
--num-prompts 100 \
--port 8001 \
--max-concurrency 4Model Recommendations
模型推荐
For quick testing (small models, fast):
- (recommended for quick tests)
Qwen/Qwen2.5-1.5B-Instruct facebook/opt-125mfacebook/opt-350m
For realistic benchmarks (medium models):
Qwen/Qwen2.5-7B-Instructmeta-llama/Llama-3.1-8B-Instructmistralai/Mistral-7B-Instruct-v0.3
用于快速测试(小型模型,速度快):
- (推荐用于快速测试)
Qwen/Qwen2.5-1.5B-Instruct facebook/opt-125mfacebook/opt-350m
用于真实场景基准测试(中型模型):
Qwen/Qwen2.5-7B-Instructmeta-llama/Llama-3.1-8B-Instructmistralai/Mistral-7B-Instruct-v0.3
Workflow
工作流程
- Check if vLLM is installed: Run to verify
vllm --version - Check if server is already running: Run to check
curl http://localhost:8000/health - Start vLLM server if needed: Run (wait for "Application startup complete")
vllm serve <model-name> - Run benchmark: Execute with appropriate parameters
vllm bench serve - Review results: Check throughput and latency metrics
- Clean up: If the agent skill started the vLLM server (not a pre-existing one), stop it after benchmark completion using
kill <PID>
- 检查vLLM是否已安装:运行进行验证
vllm --version - 检查服务器是否已运行:运行进行检查
curl http://localhost:8000/health - 若需要则启动vLLM服务器:运行(等待"Application startup complete"提示)
vllm serve <model-name> - 运行基准测试:执行带有合适参数的命令
vllm bench serve - 查看结果:检查吞吐量和延迟指标
- 清理资源:如果是该技能启动的vLLM服务器(而非预先运行的服务器),测试完成后使用命令停止服务器,以释放资源
kill <PID>
Troubleshooting
故障排除
Server not responding:
- Check if server is running:
curl http://localhost:8000/health - Verify port matches: Use flag if server is on different port
--port
Model not found:
- Ensure model name matches exactly between server and benchmark
- Check HuggingFace access: if needed
export HF_TOKEN=<your_token>
Out of memory:
- Use a smaller model (e.g., Qwen2.5-1.5B-Instruct)
- Reduce or
--num-prompts--max-concurrency
Connection refused:
- Server may still be starting (wait for "Application startup complete")
- Check firewall or network settings
服务器无响应:
- 检查服务器是否运行:
curl http://localhost:8000/health - 验证端口是否匹配:若服务器使用不同端口,使用参数指定
--port
模型未找到:
- 确保服务器和基准测试使用的模型名称完全一致
- 检查HuggingFace访问权限:若需要,设置
export HF_TOKEN=<your_token>
内存不足:
- 使用更小的模型(例如Qwen2.5-1.5B-Instruct)
- 减少或
--num-prompts的值--max-concurrency
连接被拒绝:
- 服务器可能仍在启动中(等待"Application startup complete"提示)
- 检查防火墙或网络设置
Notes
注意事项
- The dataset generates synthetic prompts automatically
random - Benchmark duration scales with
--num-prompts - For production benchmarking, use at least 100 prompts for stable statistics
- Results may vary based on hardware, model size, and system load
- First run may be slower due to model loading and compilation
- Important: If the agent skill starts a vLLM server for benchmarking, it must stop the server after the benchmark completes to free up resources. Do not stop pre-existing servers that were already running before the benchmark.
- 数据集会自动生成合成提示
random - 基准测试时长随参数增加而延长
--num-prompts - 用于生产环境基准测试时,至少使用100个提示以获得稳定的统计数据
- 结果可能因硬件、模型大小和系统负载而异
- 首次运行可能因模型加载和编译而较慢
- 重要提示:如果是该技能为基准测试启动的vLLM服务器,测试完成后必须停止服务器以释放资源。请勿停止基准测试前已在运行的服务器。