Loading...
Loading...
This is a skill for benchmarking the efficiency of automatic prefix caching in vLLM using fixed prompts, real-world datasets, or synthetic prefix/suffix patterns. Use when the user asks to benchmark prefix caching hit rate, caching efficiency, or repeated-prompt performance in vLLM.
npx skill4agent add vllm-project/vllm-skills vllm-prefix-cache-benchbenchmarks/benchmark_prefix_caching.pyvllm bench serveprefix_repetition--enable-prefix-cachingpython3 benchmarks/benchmark_prefix_caching.py \
--model Qwen/Qwen3-8B \
--enable-prefix-caching \
--num-prompts 1 \
--repeat-count 100 \
--input-length-range 128:256python3 benchmarks/benchmark_prefix_caching.py \
--model Qwen/Qwen3-8B \
--no-enable-prefix-caching \
--num-prompts 1 \
--repeat-count 100 \
--input-length-range 128:256wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.jsonpython3 benchmarks/benchmark_prefix_caching.py \
--model Qwen/Qwen3-8B \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--enable-prefix-caching \
--num-prompts 20 \
--repeat-count 5 \
--input-length-range 128:256vllm bench serveprefix_repetitionvllm serve Qwen/Qwen3-8Bvllm bench serve \
--backend openai \
--model Qwen/Qwen3-8B \
--dataset-name prefix_repetition \
--num-prompts 100 \
--prefix-repetition-prefix-len 512 \
--prefix-repetition-suffix-len 128 \
--prefix-repetition-num-prefixes 5 \
--prefix-repetition-output-len 128prefix_repetition| Parameter | Description |
|---|---|
| Number of tokens in the shared prefix portion |
| Number of tokens in the unique suffix portion |
| Number of distinct prefixes to cycle through |
| Number of output tokens to generate per request |
cd vllmQwen/Qwen3-8B--model--repeat-count--input-length-rangemin:max128:256--tensor-parallel-size <N>--prefix-caching-hash-algo xxhashpip install xxhashbenchmark_prefix_caching.py| Argument | Required | Description |
|---|---|---|
| Yes | Model name or path (HuggingFace ID or local path) |
| Yes | Number of prompts to process |
| Yes | Token length range for inputs, e.g. |
| No | Number of times each prompt is repeated (default: 1) |
| No | Path to a dataset file (e.g. ShareGPT JSON). Omit for synthetic fixed-prompt mode |
| No | Fixed prefix token length to prepend to every prompt |
| No | Number of output tokens to generate per request |
| No | Sort prompts by length before benchmarking |
| No | Toggle APC (recommended: enable to test caching) |
| No | Hash algorithm: |
| No | Number of GPUs for tensor parallelism |
| No | Skip detokenization to reduce overhead |
python3 benchmarks/*.pygit clone https://github.com/vllm-project/vllm
cd vllmexport HF_TOKEN=<your_token>--hf-token <your_token>xxhashcbor2pip install xxhash cbor2