vllm-prefix-cache-bench

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

vLLM Prefix Caching Benchmark

vLLM 前缀缓存基准测试

Benchmark the efficiency of vLLM's automatic prefix caching (APC) feature. The offline script

benchmarks/benchmark_prefix_caching.py

runs directly against the vLLM engine (no server required). For online/serving tests, use

vllm bench serve

with the

prefix_repetition

dataset.

基准测试vLLM的自动前缀缓存（APC）功能的效率。离线脚本

benchmarks/benchmark_prefix_caching.py

可直接运行于vLLM引擎（无需服务器）。对于在线/服务端测试，请结合

prefix_repetition

数据集使用

vllm bench serve

。

When to use

使用场景

User wants to measure the performance impact of prefix caching for repeated or partially-shared prompts.
User wants to compare throughput/latency with and without
```
--enable-prefix-caching
```
.
User wants to test prefix caching using a fixed synthetic prompt, a real dataset (e.g. ShareGPT), or a synthetic prefix/suffix repetition pattern.

用户希望衡量前缀缓存对重复或部分共享提示词的性能影响。
用户希望对比开启与关闭
```
--enable-prefix-caching
```
时的吞吐量/延迟。
用户希望使用固定合成提示词、真实数据集（如ShareGPT）或合成前缀/后缀重复模式来测试前缀缓存。

Option 1 (default). Fixed Prompt with Prefix Caching

选项1（默认）：带前缀缓存的固定提示词

Runs a synthetic benchmark with a fixed prompt repeated multiple times to directly measure cache hit efficiency. No dataset download required.

bash

python3 benchmarks/benchmark_prefix_caching.py \
  --model Qwen/Qwen3-8B \
  --enable-prefix-caching \
  --num-prompts 1 \
  --repeat-count 100 \
  --input-length-range 128:256

To compare against the baseline without caching:

bash

python3 benchmarks/benchmark_prefix_caching.py \
  --model Qwen/Qwen3-8B \
  --no-enable-prefix-caching \
  --num-prompts 1 \
  --repeat-count 100 \
  --input-length-range 128:256

运行一个使用固定提示词重复多次的合成基准测试，直接衡量缓存命中效率。无需下载数据集。

bash

python3 benchmarks/benchmark_prefix_caching.py \
  --model Qwen/Qwen3-8B \
  --enable-prefix-caching \
  --num-prompts 1 \
  --repeat-count 100 \
  --input-length-range 128:256

对比无缓存的基准版本：

bash

python3 benchmarks/benchmark_prefix_caching.py \
  --model Qwen/Qwen3-8B \
  --no-enable-prefix-caching \
  --num-prompts 1 \
  --repeat-count 100 \
  --input-length-range 128:256

Option 2. ShareGPT Dataset with Prefix Caching

选项2：带前缀缓存的ShareGPT数据集

Uses real-world conversational data from ShareGPT to evaluate prefix caching with naturally occurring prompt sharing.

First, download the dataset:

bash

wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

Then run the benchmark:

bash

python3 benchmarks/benchmark_prefix_caching.py \
  --model Qwen/Qwen3-8B \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --enable-prefix-caching \
  --num-prompts 20 \
  --repeat-count 5 \
  --input-length-range 128:256

使用来自ShareGPT的真实对话数据，评估自然场景下提示词共享时的前缀缓存效果。

首先下载数据集：

bash

wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

然后运行基准测试：

bash

python3 benchmarks/benchmark_prefix_caching.py \
  --model Qwen/Qwen3-8B \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --enable-prefix-caching \
  --num-prompts 20 \
  --repeat-count 5 \
  --input-length-range 128:256

Option 3. Prefix Repetition Dataset (Online)

选项3：前缀重复数据集（在线）

Uses

vllm bench serve

with the synthetic

prefix_repetition

dataset to test caching via the serving API. This requires a running vLLM server.

First, start the server:

bash

vllm serve Qwen/Qwen3-8B

Then run the benchmark:

bash

vllm bench serve \
  --backend openai \
  --model Qwen/Qwen3-8B \
  --dataset-name prefix_repetition \
  --num-prompts 100 \
  --prefix-repetition-prefix-len 512 \
  --prefix-repetition-suffix-len 128 \
  --prefix-repetition-num-prefixes 5 \
  --prefix-repetition-output-len 128

Key parameters for

prefix_repetition

Parameter	Description
`--prefix-repetition-prefix-len`	Number of tokens in the shared prefix portion
`--prefix-repetition-suffix-len`	Number of tokens in the unique suffix portion
`--prefix-repetition-num-prefixes`	Number of distinct prefixes to cycle through
`--prefix-repetition-output-len`	Number of output tokens to generate per request

结合合成的

prefix_repetition

数据集使用

vllm bench serve

，通过服务API测试缓存功能。这需要运行中的vLLM服务器。

首先启动服务器：

bash

vllm serve Qwen/Qwen3-8B

然后运行基准测试：

bash

vllm bench serve \
  --backend openai \
  --model Qwen/Qwen3-8B \
  --dataset-name prefix_repetition \
  --num-prompts 100 \
  --prefix-repetition-prefix-len 512 \
  --prefix-repetition-suffix-len 128 \
  --prefix-repetition-num-prefixes 5 \
  --prefix-repetition-output-len 128

prefix_repetition

的关键参数：

参数	描述
`--prefix-repetition-prefix-len`	共享前缀部分的token数量
`--prefix-repetition-suffix-len`	唯一后缀部分的token数量
`--prefix-repetition-num-prefixes`	循环使用的不同前缀数量
`--prefix-repetition-output-len`	每个请求生成的输出token数量

Notes

注意事项

Run all commands from the root of the vLLM repository (
```
cd vllm
```
).
Keep the default model (
```
Qwen/Qwen3-8B
```
) unless the user specifies a different one or the model is unavailable; change only
```
--model
```
.
```
--repeat-count
```
in Option 1 and 2 controls how many times each sampled prompt is replayed; higher values increase cache hit rate.
```
--input-length-range
```
accepts a
```
min:max
```
token range, e.g.
```
128:256
```
.
For multi-GPU setups, add
```
--tensor-parallel-size <N>
```
.
To test different hash algorithms for prefix caching internals, use
```
--prefix-caching-hash-algo xxhash
```
(requires
```
pip install xxhash
```
).

所有命令需在vLLM仓库根目录运行（
```
cd vllm
```
）。
除非用户指定其他模型或当前模型不可用，否则保留默认模型（
```
Qwen/Qwen3-8B
```
）；仅修改
```
--model
```
参数。
选项1和2中的
```
--repeat-count
```
控制每个采样提示词的重放次数；数值越高，缓存命中率越高。
```
--input-length-range
```
接受
```
min:max
```
格式的token范围，例如
```
128:256
```
。
对于多GPU环境，添加
```
--tensor-parallel-size <N>
```
参数。
要测试前缀缓存内部的不同哈希算法，使用
```
--prefix-caching-hash-algo xxhash
```
（需先安装
```
pip install xxhash
```
）。

Arguments for

benchmark_prefix_caching.py

benchmark_prefix_caching.py

的参数

Argument	Required	Description
`--model`	Yes	Model name or path (HuggingFace ID or local path)
`--num-prompts`	Yes	Number of prompts to process
`--input-length-range`	Yes	Token length range for inputs, e.g. `128:256`
`--repeat-count`	No	Number of times each prompt is repeated (default: 1)
`--dataset-path`	No	Path to a dataset file (e.g. ShareGPT JSON). Omit for synthetic fixed-prompt mode
`--prefix-len`	No	Fixed prefix token length to prepend to every prompt
`--output-len`	No	Number of output tokens to generate per request
`--sort`	No	Sort prompts by length before benchmarking
`--enable-prefix-caching` / `--no-enable-prefix-caching`	No	Toggle APC (recommended: enable to test caching)
`--prefix-caching-hash-algo`	No	Hash algorithm: `sha256` , `sha256_cbor` , `xxhash` , `xxhash_cbor`
`--tensor-parallel-size`	No	Number of GPUs for tensor parallelism
`--disable-detokenize`	No	Skip detokenization to reduce overhead

参数	是否必填	描述
`--model`	是	模型名称或路径（HuggingFace ID或本地路径）
`--num-prompts`	是	要处理的提示词数量
`--input-length-range`	是	输入的token长度范围，例如 `128:256`
`--repeat-count`	否	每个提示词的重复次数（默认值：1）
`--dataset-path`	否	数据集文件路径（如ShareGPT JSON）。省略则使用合成固定提示词模式
`--prefix-len`	否	要添加到每个提示词前的固定前缀token长度
`--output-len`	否	每个请求生成的输出token数量
`--sort`	否	基准测试前按长度对提示词排序
`--enable-prefix-caching` / `--no-enable-prefix-caching`	否	开启/关闭APC（推荐开启以测试缓存）
`--prefix-caching-hash-algo`	否	哈希算法： `sha256` , `sha256_cbor` , `xxhash` , `xxhash_cbor`
`--tensor-parallel-size`	否	用于张量并行的GPU数量
`--disable-detokenize`	否	跳过解码以减少开销

Troubleshooting

故障排查

If
```
python3 benchmarks/*.py
```
reports file not found, locate your local vLLM repository first and run the command from that repo root.
If you do not have the repository yet, clone it and continue:

bash

git clone https://github.com/vllm-project/vllm
cd vllm

If HuggingFace model download fails due to access restrictions, set your token:
```
export HF_TOKEN=<your_token>
```
or pass
```
--hf-token <your_token>
```
.
If
```
xxhash
```
or
```
cbor2
```
is not installed and you use those hash algorithms, install them first:
```
pip install xxhash cbor2
```
.

如果
```
python3 benchmarks/*.py
```
提示文件未找到，请先找到本地vLLM仓库并在根目录运行命令。
如果尚未克隆仓库，先克隆再继续：

bash

git clone https://github.com/vllm-project/vllm
cd vllm

如果HuggingFace模型下载因权限限制失败，设置你的token：
```
export HF_TOKEN=<your_token>
```
或传递
```
--hf-token <your_token>
```
参数。
如果使用
```
xxhash
```
或
```
cbor2
```
哈希算法但未安装，先安装：
```
pip install xxhash cbor2
```
。

vllm-prefix-cache-bench

Original

Translation

vLLM Prefix Caching Benchmark

vLLM 前缀缓存基准测试

When to use

使用场景

Option 1 (default). Fixed Prompt with Prefix Caching

选项1（默认）：带前缀缓存的固定提示词

Option 2. ShareGPT Dataset with Prefix Caching

选项2：带前缀缓存的ShareGPT数据集

Option 3. Prefix Repetition Dataset (Online)

选项3：前缀重复数据集（在线）

Notes

注意事项

Arguments for
`benchmark_prefix_caching.py`

`benchmark_prefix_caching.py`
的参数

Troubleshooting

故障排查

vllm-prefix-cache-bench

Original

Translation

vLLM Prefix Caching Benchmark

vLLM 前缀缓存基准测试

When to use

使用场景

Option 1 (default). Fixed Prompt with Prefix Caching

选项1（默认）：带前缀缓存的固定提示词

Option 2. ShareGPT Dataset with Prefix Caching

选项2：带前缀缓存的ShareGPT数据集

Option 3. Prefix Repetition Dataset (Online)

选项3：前缀重复数据集（在线）

Notes

注意事项

Arguments for benchmark_prefix_caching.py

benchmark_prefix_caching.py的参数

Troubleshooting

故障排查

Arguments for
`benchmark_prefix_caching.py`

`benchmark_prefix_caching.py`
的参数