vllm-prefix-cache-bench

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

vLLM Prefix Caching Benchmark

vLLM 前缀缓存基准测试

Benchmark the efficiency of vLLM's automatic prefix caching (APC) feature. The offline script
benchmarks/benchmark_prefix_caching.py
runs directly against the vLLM engine (no server required). For online/serving tests, use
vllm bench serve
with the
prefix_repetition
dataset.
基准测试vLLM的自动前缀缓存(APC)功能的效率。离线脚本
benchmarks/benchmark_prefix_caching.py
可直接运行于vLLM引擎(无需服务器)。对于在线/服务端测试,请结合
prefix_repetition
数据集使用
vllm bench serve

When to use

使用场景

  • User wants to measure the performance impact of prefix caching for repeated or partially-shared prompts.
  • User wants to compare throughput/latency with and without
    --enable-prefix-caching
    .
  • User wants to test prefix caching using a fixed synthetic prompt, a real dataset (e.g. ShareGPT), or a synthetic prefix/suffix repetition pattern.
  • 用户希望衡量前缀缓存对重复或部分共享提示词的性能影响。
  • 用户希望对比开启与关闭
    --enable-prefix-caching
    时的吞吐量/延迟。
  • 用户希望使用固定合成提示词、真实数据集(如ShareGPT)或合成前缀/后缀重复模式来测试前缀缓存。

Option 1 (default). Fixed Prompt with Prefix Caching

选项1(默认):带前缀缓存的固定提示词

Runs a synthetic benchmark with a fixed prompt repeated multiple times to directly measure cache hit efficiency. No dataset download required.
bash
python3 benchmarks/benchmark_prefix_caching.py \
  --model Qwen/Qwen3-8B \
  --enable-prefix-caching \
  --num-prompts 1 \
  --repeat-count 100 \
  --input-length-range 128:256
To compare against the baseline without caching:
bash
python3 benchmarks/benchmark_prefix_caching.py \
  --model Qwen/Qwen3-8B \
  --no-enable-prefix-caching \
  --num-prompts 1 \
  --repeat-count 100 \
  --input-length-range 128:256
运行一个使用固定提示词重复多次的合成基准测试,直接衡量缓存命中效率。无需下载数据集。
bash
python3 benchmarks/benchmark_prefix_caching.py \
  --model Qwen/Qwen3-8B \
  --enable-prefix-caching \
  --num-prompts 1 \
  --repeat-count 100 \
  --input-length-range 128:256
对比无缓存的基准版本:
bash
python3 benchmarks/benchmark_prefix_caching.py \
  --model Qwen/Qwen3-8B \
  --no-enable-prefix-caching \
  --num-prompts 1 \
  --repeat-count 100 \
  --input-length-range 128:256

Option 2. ShareGPT Dataset with Prefix Caching

选项2:带前缀缓存的ShareGPT数据集

Uses real-world conversational data from ShareGPT to evaluate prefix caching with naturally occurring prompt sharing.
First, download the dataset:
bash
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
Then run the benchmark:
bash
python3 benchmarks/benchmark_prefix_caching.py \
  --model Qwen/Qwen3-8B \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --enable-prefix-caching \
  --num-prompts 20 \
  --repeat-count 5 \
  --input-length-range 128:256
使用来自ShareGPT的真实对话数据,评估自然场景下提示词共享时的前缀缓存效果。
首先下载数据集:
bash
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
然后运行基准测试:
bash
python3 benchmarks/benchmark_prefix_caching.py \
  --model Qwen/Qwen3-8B \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --enable-prefix-caching \
  --num-prompts 20 \
  --repeat-count 5 \
  --input-length-range 128:256

Option 3. Prefix Repetition Dataset (Online)

选项3:前缀重复数据集(在线)

Uses
vllm bench serve
with the synthetic
prefix_repetition
dataset to test caching via the serving API. This requires a running vLLM server.
First, start the server:
bash
vllm serve Qwen/Qwen3-8B
Then run the benchmark:
bash
vllm bench serve \
  --backend openai \
  --model Qwen/Qwen3-8B \
  --dataset-name prefix_repetition \
  --num-prompts 100 \
  --prefix-repetition-prefix-len 512 \
  --prefix-repetition-suffix-len 128 \
  --prefix-repetition-num-prefixes 5 \
  --prefix-repetition-output-len 128
Key parameters for
prefix_repetition
:
ParameterDescription
--prefix-repetition-prefix-len
Number of tokens in the shared prefix portion
--prefix-repetition-suffix-len
Number of tokens in the unique suffix portion
--prefix-repetition-num-prefixes
Number of distinct prefixes to cycle through
--prefix-repetition-output-len
Number of output tokens to generate per request
结合合成的
prefix_repetition
数据集使用
vllm bench serve
,通过服务API测试缓存功能。这需要运行中的vLLM服务器。
首先启动服务器:
bash
vllm serve Qwen/Qwen3-8B
然后运行基准测试:
bash
vllm bench serve \
  --backend openai \
  --model Qwen/Qwen3-8B \
  --dataset-name prefix_repetition \
  --num-prompts 100 \
  --prefix-repetition-prefix-len 512 \
  --prefix-repetition-suffix-len 128 \
  --prefix-repetition-num-prefixes 5 \
  --prefix-repetition-output-len 128
prefix_repetition
的关键参数:
参数描述
--prefix-repetition-prefix-len
共享前缀部分的token数量
--prefix-repetition-suffix-len
唯一后缀部分的token数量
--prefix-repetition-num-prefixes
循环使用的不同前缀数量
--prefix-repetition-output-len
每个请求生成的输出token数量

Notes

注意事项

  • Run all commands from the root of the vLLM repository (
    cd vllm
    ).
  • Keep the default model (
    Qwen/Qwen3-8B
    ) unless the user specifies a different one or the model is unavailable; change only
    --model
    .
  • --repeat-count
    in Option 1 and 2 controls how many times each sampled prompt is replayed; higher values increase cache hit rate.
  • --input-length-range
    accepts a
    min:max
    token range, e.g.
    128:256
    .
  • For multi-GPU setups, add
    --tensor-parallel-size <N>
    .
  • To test different hash algorithms for prefix caching internals, use
    --prefix-caching-hash-algo xxhash
    (requires
    pip install xxhash
    ).
  • 所有命令需在vLLM仓库根目录运行(
    cd vllm
    )。
  • 除非用户指定其他模型或当前模型不可用,否则保留默认模型(
    Qwen/Qwen3-8B
    );仅修改
    --model
    参数。
  • 选项1和2中的
    --repeat-count
    控制每个采样提示词的重放次数;数值越高,缓存命中率越高。
  • --input-length-range
    接受
    min:max
    格式的token范围,例如
    128:256
  • 对于多GPU环境,添加
    --tensor-parallel-size <N>
    参数。
  • 要测试前缀缓存内部的不同哈希算法,使用
    --prefix-caching-hash-algo xxhash
    (需先安装
    pip install xxhash
    )。

Arguments for
benchmark_prefix_caching.py

benchmark_prefix_caching.py
的参数

ArgumentRequiredDescription
--model
YesModel name or path (HuggingFace ID or local path)
--num-prompts
YesNumber of prompts to process
--input-length-range
YesToken length range for inputs, e.g.
128:256
--repeat-count
NoNumber of times each prompt is repeated (default: 1)
--dataset-path
NoPath to a dataset file (e.g. ShareGPT JSON). Omit for synthetic fixed-prompt mode
--prefix-len
NoFixed prefix token length to prepend to every prompt
--output-len
NoNumber of output tokens to generate per request
--sort
NoSort prompts by length before benchmarking
--enable-prefix-caching
/
--no-enable-prefix-caching
NoToggle APC (recommended: enable to test caching)
--prefix-caching-hash-algo
NoHash algorithm:
sha256
,
sha256_cbor
,
xxhash
,
xxhash_cbor
--tensor-parallel-size
NoNumber of GPUs for tensor parallelism
--disable-detokenize
NoSkip detokenization to reduce overhead
参数是否必填描述
--model
模型名称或路径(HuggingFace ID或本地路径)
--num-prompts
要处理的提示词数量
--input-length-range
输入的token长度范围,例如
128:256
--repeat-count
每个提示词的重复次数(默认值:1)
--dataset-path
数据集文件路径(如ShareGPT JSON)。省略则使用合成固定提示词模式
--prefix-len
要添加到每个提示词前的固定前缀token长度
--output-len
每个请求生成的输出token数量
--sort
基准测试前按长度对提示词排序
--enable-prefix-caching
/
--no-enable-prefix-caching
开启/关闭APC(推荐开启以测试缓存)
--prefix-caching-hash-algo
哈希算法:
sha256
,
sha256_cbor
,
xxhash
,
xxhash_cbor
--tensor-parallel-size
用于张量并行的GPU数量
--disable-detokenize
跳过解码以减少开销

Troubleshooting

故障排查

  • If
    python3 benchmarks/*.py
    reports file not found, locate your local vLLM repository first and run the command from that repo root.
  • If you do not have the repository yet, clone it and continue:
bash
git clone https://github.com/vllm-project/vllm
cd vllm
  • If HuggingFace model download fails due to access restrictions, set your token:
    export HF_TOKEN=<your_token>
    or pass
    --hf-token <your_token>
    .
  • If
    xxhash
    or
    cbor2
    is not installed and you use those hash algorithms, install them first:
    pip install xxhash cbor2
    .
  • 如果
    python3 benchmarks/*.py
    提示文件未找到,请先找到本地vLLM仓库并在根目录运行命令。
  • 如果尚未克隆仓库,先克隆再继续:
bash
git clone https://github.com/vllm-project/vllm
cd vllm
  • 如果HuggingFace模型下载因权限限制失败,设置你的token:
    export HF_TOKEN=<your_token>
    或传递
    --hf-token <your_token>
    参数。
  • 如果使用
    xxhash
    cbor2
    哈希算法但未安装,先安装:
    pip install xxhash cbor2