Loading...
Loading...
Benchmark vLLM or OpenAI-compatible serving endpoints using vllm bench serve. Supports multiple datasets (random, sharegpt, sonnet, HF), backends (openai, openai-chat, vllm-pooling, embeddings), throughput/latency testing with request-rate control, and result saving. Use when benchmarking LLM serving performance, measuring TTFT/TPOT, or load testing inference APIs.
npx skill4agent add vllm-project/vllm-skills vllm-bench-servevllm bench servevllm bench serve \
--backend openai-chat \
--host 127.0.0.1 \
--port 8000 \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completionsvllm bench serve \
--backend openai-chat \
--host 127.0.0.1 \
--port 8000 \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--save-result \
--result-dir ./bench-results \
--metadata "version=0.6.0" "tp=1"Note: When using, you must specify--backend openai-chat(default is--endpoint /v1/chat/completions)./v1/completions
| Argument | Default | Description |
|---|---|---|
| | Backend type: |
| | Server host |
| | Server port |
| - | Alternative: full base URL instead of host:port |
| | API endpoint; use |
| (from /v1/models) | Model name |
| | Number of prompts to process |
| | Requests per second; |
| - | Max concurrent requests (caps parallelism) |
| | Warmup requests before measuring |
| Use Case |
|---|---|
| Synthetic random prompts (default) |
| ShareGPT conversation format; requires |
| Sonnet-style prompts |
| HuggingFace dataset; requires |
| Custom dataset; requires |
| Prefix repetition benchmark |
| Random multimodal (images/videos) |
| Spec bench dataset |
# Random: control input/output length
--dataset-name random --random-input-len 1024 --random-output-len 128
# Sonnet defaults: input 550, output 150, prefix 200
--dataset-name sonnet --sonnet-input-len 550 --sonnet-output-len 150
# HuggingFace dataset
--dataset-name hf --dataset-path "lmarena-ai/VisionArena-Chat" --hf-split test
# General overrides (map to dataset-specific args)
--input-len 512 --output-len 256# Fixed request rate (Poisson process)
--request-rate 10
# More bursty arrivals (gamma distribution, burstiness < 1)
--request-rate 10 --burstiness 0.5
# Ramp-up from low to high RPS
--ramp-up-strategy linear --ramp-up-start-rps 1 --ramp-up-end-rps 50
# Limit concurrency (useful for rate-limited APIs)
--max-concurrency 32| Argument | Description |
|---|---|
| Save benchmark results to JSON |
| Include per-request TTFT, TPOT, errors in JSON |
| Append to existing result file |
| Directory for result files |
| Custom filename (default: |
| Metrics for percentiles: |
| Percentile values, e.g. |
| SLO for goodput: |
--temperature 0.7 --top-p 0.95 --top-k 50
--frequency-penalty 0 --presence-penalty 0 --repetition-penalty 1.0vllm bench serve --backend openai-chat --host 127.0.0.1 --port 8000 \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random \
--num-prompts 500 --random-input-len 512 --random-output-len 128vllm bench serve --backend openai-chat --host 127.0.0.1 --port 8000 \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--request-rate 5 --num-prompts 200 \
--save-result --percentile-metrics ttft,tpot --metric-percentiles 50,99vllm bench serve --backend openai-chat \
--base-url "https://api.example.com/v1" \
--model my-model \
--header "Authorization=Bearer $API_KEY"docker exec <container-name> vllm bench serve \
--backend openai-chat --host 127.0.0.1 --port 8000 \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random --num-prompts 100--host--port--base-url--model/v1/models--endpoint /v1/chat/completions--backend openai-chat--request-rate--max-concurrency--ready-check-timeout-sec 60--insecure--backend openai-embeddingsvllm-poolingvllm-rerank--profile--profiler-config