vllm-prefix-cache-bench
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesevLLM Prefix Caching Benchmark
vLLM 前缀缓存基准测试
Benchmark the efficiency of vLLM's automatic prefix caching (APC) feature. The offline script runs directly against the vLLM engine (no server required). For online/serving tests, use with the dataset.
benchmarks/benchmark_prefix_caching.pyvllm bench serveprefix_repetition基准测试vLLM的自动前缀缓存(APC)功能的效率。离线脚本可直接运行于vLLM引擎(无需服务器)。对于在线/服务端测试,请结合数据集使用。
benchmarks/benchmark_prefix_caching.pyprefix_repetitionvllm bench serveWhen to use
使用场景
- User wants to measure the performance impact of prefix caching for repeated or partially-shared prompts.
- User wants to compare throughput/latency with and without .
--enable-prefix-caching - User wants to test prefix caching using a fixed synthetic prompt, a real dataset (e.g. ShareGPT), or a synthetic prefix/suffix repetition pattern.
- 用户希望衡量前缀缓存对重复或部分共享提示词的性能影响。
- 用户希望对比开启与关闭时的吞吐量/延迟。
--enable-prefix-caching - 用户希望使用固定合成提示词、真实数据集(如ShareGPT)或合成前缀/后缀重复模式来测试前缀缓存。
Option 1 (default). Fixed Prompt with Prefix Caching
选项1(默认):带前缀缓存的固定提示词
Runs a synthetic benchmark with a fixed prompt repeated multiple times to directly measure cache hit efficiency. No dataset download required.
bash
python3 benchmarks/benchmark_prefix_caching.py \
--model Qwen/Qwen3-8B \
--enable-prefix-caching \
--num-prompts 1 \
--repeat-count 100 \
--input-length-range 128:256To compare against the baseline without caching:
bash
python3 benchmarks/benchmark_prefix_caching.py \
--model Qwen/Qwen3-8B \
--no-enable-prefix-caching \
--num-prompts 1 \
--repeat-count 100 \
--input-length-range 128:256运行一个使用固定提示词重复多次的合成基准测试,直接衡量缓存命中效率。无需下载数据集。
bash
python3 benchmarks/benchmark_prefix_caching.py \
--model Qwen/Qwen3-8B \
--enable-prefix-caching \
--num-prompts 1 \
--repeat-count 100 \
--input-length-range 128:256对比无缓存的基准版本:
bash
python3 benchmarks/benchmark_prefix_caching.py \
--model Qwen/Qwen3-8B \
--no-enable-prefix-caching \
--num-prompts 1 \
--repeat-count 100 \
--input-length-range 128:256Option 2. ShareGPT Dataset with Prefix Caching
选项2:带前缀缓存的ShareGPT数据集
Uses real-world conversational data from ShareGPT to evaluate prefix caching with naturally occurring prompt sharing.
First, download the dataset:
bash
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.jsonThen run the benchmark:
bash
python3 benchmarks/benchmark_prefix_caching.py \
--model Qwen/Qwen3-8B \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--enable-prefix-caching \
--num-prompts 20 \
--repeat-count 5 \
--input-length-range 128:256使用来自ShareGPT的真实对话数据,评估自然场景下提示词共享时的前缀缓存效果。
首先下载数据集:
bash
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json然后运行基准测试:
bash
python3 benchmarks/benchmark_prefix_caching.py \
--model Qwen/Qwen3-8B \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--enable-prefix-caching \
--num-prompts 20 \
--repeat-count 5 \
--input-length-range 128:256Option 3. Prefix Repetition Dataset (Online)
选项3:前缀重复数据集(在线)
Uses with the synthetic dataset to test caching via the serving API. This requires a running vLLM server.
vllm bench serveprefix_repetitionFirst, start the server:
bash
vllm serve Qwen/Qwen3-8BThen run the benchmark:
bash
vllm bench serve \
--backend openai \
--model Qwen/Qwen3-8B \
--dataset-name prefix_repetition \
--num-prompts 100 \
--prefix-repetition-prefix-len 512 \
--prefix-repetition-suffix-len 128 \
--prefix-repetition-num-prefixes 5 \
--prefix-repetition-output-len 128Key parameters for :
prefix_repetition| Parameter | Description |
|---|---|
| Number of tokens in the shared prefix portion |
| Number of tokens in the unique suffix portion |
| Number of distinct prefixes to cycle through |
| Number of output tokens to generate per request |
结合合成的数据集使用,通过服务API测试缓存功能。这需要运行中的vLLM服务器。
prefix_repetitionvllm bench serve首先启动服务器:
bash
vllm serve Qwen/Qwen3-8B然后运行基准测试:
bash
vllm bench serve \
--backend openai \
--model Qwen/Qwen3-8B \
--dataset-name prefix_repetition \
--num-prompts 100 \
--prefix-repetition-prefix-len 512 \
--prefix-repetition-suffix-len 128 \
--prefix-repetition-num-prefixes 5 \
--prefix-repetition-output-len 128prefix_repetition| 参数 | 描述 |
|---|---|
| 共享前缀部分的token数量 |
| 唯一后缀部分的token数量 |
| 循环使用的不同前缀数量 |
| 每个请求生成的输出token数量 |
Notes
注意事项
- Run all commands from the root of the vLLM repository ().
cd vllm - Keep the default model () unless the user specifies a different one or the model is unavailable; change only
Qwen/Qwen3-8B.--model - in Option 1 and 2 controls how many times each sampled prompt is replayed; higher values increase cache hit rate.
--repeat-count - accepts a
--input-length-rangetoken range, e.g.min:max.128:256 - For multi-GPU setups, add .
--tensor-parallel-size <N> - To test different hash algorithms for prefix caching internals, use (requires
--prefix-caching-hash-algo xxhash).pip install xxhash
- 所有命令需在vLLM仓库根目录运行()。
cd vllm - 除非用户指定其他模型或当前模型不可用,否则保留默认模型();仅修改
Qwen/Qwen3-8B参数。--model - 选项1和2中的控制每个采样提示词的重放次数;数值越高,缓存命中率越高。
--repeat-count - 接受
--input-length-range格式的token范围,例如min:max。128:256 - 对于多GPU环境,添加参数。
--tensor-parallel-size <N> - 要测试前缀缓存内部的不同哈希算法,使用(需先安装
--prefix-caching-hash-algo xxhash)。pip install xxhash
Arguments for benchmark_prefix_caching.py
benchmark_prefix_caching.pybenchmark_prefix_caching.py
的参数
benchmark_prefix_caching.py| Argument | Required | Description |
|---|---|---|
| Yes | Model name or path (HuggingFace ID or local path) |
| Yes | Number of prompts to process |
| Yes | Token length range for inputs, e.g. |
| No | Number of times each prompt is repeated (default: 1) |
| No | Path to a dataset file (e.g. ShareGPT JSON). Omit for synthetic fixed-prompt mode |
| No | Fixed prefix token length to prepend to every prompt |
| No | Number of output tokens to generate per request |
| No | Sort prompts by length before benchmarking |
| No | Toggle APC (recommended: enable to test caching) |
| No | Hash algorithm: |
| No | Number of GPUs for tensor parallelism |
| No | Skip detokenization to reduce overhead |
| 参数 | 是否必填 | 描述 |
|---|---|---|
| 是 | 模型名称或路径(HuggingFace ID或本地路径) |
| 是 | 要处理的提示词数量 |
| 是 | 输入的token长度范围,例如 |
| 否 | 每个提示词的重复次数(默认值:1) |
| 否 | 数据集文件路径(如ShareGPT JSON)。省略则使用合成固定提示词模式 |
| 否 | 要添加到每个提示词前的固定前缀token长度 |
| 否 | 每个请求生成的输出token数量 |
| 否 | 基准测试前按长度对提示词排序 |
| 否 | 开启/关闭APC(推荐开启以测试缓存) |
| 否 | 哈希算法: |
| 否 | 用于张量并行的GPU数量 |
| 否 | 跳过解码以减少开销 |
Troubleshooting
故障排查
- If reports file not found, locate your local vLLM repository first and run the command from that repo root.
python3 benchmarks/*.py - If you do not have the repository yet, clone it and continue:
bash
git clone https://github.com/vllm-project/vllm
cd vllm- If HuggingFace model download fails due to access restrictions, set your token: or pass
export HF_TOKEN=<your_token>.--hf-token <your_token> - If or
xxhashis not installed and you use those hash algorithms, install them first:cbor2.pip install xxhash cbor2
- 如果提示文件未找到,请先找到本地vLLM仓库并在根目录运行命令。
python3 benchmarks/*.py - 如果尚未克隆仓库,先克隆再继续:
bash
git clone https://github.com/vllm-project/vllm
cd vllm- 如果HuggingFace模型下载因权限限制失败,设置你的token:或传递
export HF_TOKEN=<your_token>参数。--hf-token <your_token> - 如果使用或
xxhash哈希算法但未安装,先安装:cbor2。pip install xxhash cbor2