vllm-ascend

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

vLLM-Ascend - LLM Inference Serving

vLLM-Ascend - LLM推理服务

vLLM-Ascend is a plugin for vLLM that enables efficient LLM inference on Huawei Ascend AI processors. It provides Ascend-optimized kernels, quantization support, and distributed inference capabilities.

vLLM-Ascend是vLLM的一款插件,可在华为Ascend AI处理器上实现高效的LLM推理。它提供了Ascend优化内核、量化支持以及分布式推理能力。

Quick Start

快速开始

Offline Batch Inference

离线批量推理

python
import os
python
import os

Required for vLLM-Ascend: set multiprocessing method before importing vLLM

vLLM-Ascend必需:在导入vLLM前设置多进程方法

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
from vllm import LLM, SamplingParams
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
from vllm import LLM, SamplingParams

Load model with Ascend NPU (device auto-detected when vllm-ascend is installed)

加载Ascend NPU适配模型(安装vllm-ascend后会自动检测设备)

llm = LLM( model="Qwen/Qwen2.5-7B-Instruct", max_model_len=4096 )
llm = LLM( model="Qwen/Qwen2.5-7B-Instruct", max_model_len=4096 )

Prepare prompts and sampling params

准备提示词和采样参数

prompts = [ "Hello, how are you?", "Explain quantum computing in simple terms.", ] sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)
prompts = [ "你好,最近怎么样?", "用简单的语言解释量子计算。", ] sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)

Generate outputs

生成输出结果

outputs = llm.generate(prompts, sampling_params)
outputs = llm.generate(prompts, sampling_params)

Print results

打印结果

for output in outputs: print(f"Prompt: {output.prompt}") print(f"Output: {output.outputs[0].text}\n")
undefined
for output in outputs: print(f"提示词: {output.prompt}") print(f"输出: {output.outputs[0].text}\n")
undefined

OpenAI-Compatible API Server

兼容OpenAI的API服务器

bash
undefined
bash
undefined

Start the API server

启动API服务器

vllm serve Qwen/Qwen2.5-7B-Instruct
--max-model-len 4096
--max-num-seqs 256
--served-model-name "qwen2.5-7b"
vllm serve Qwen/Qwen2.5-7B-Instruct
--max-model-len 4096
--max-num-seqs 256
--served-model-name "qwen2.5-7b"

Or using Python

或使用Python启动

python -m vllm.entrypoints.openai.api_server
--model Qwen/Qwen2.5-7B-Instruct
--max-model-len 4096
undefined
python -m vllm.entrypoints.openai.api_server
--model Qwen/Qwen2.5-7B-Instruct
--max-model-len 4096
undefined

API Client Example

API客户端示例

python
import requests
python
import requests

Completions API

补全API

response = requests.post( "http://localhost:8000/v1/completions", json={ "model": "qwen2.5-7b", "prompt": "Once upon a time", "max_tokens": 100, "temperature": 0.7 } ) print(response.json())
response = requests.post( "http://localhost:8000/v1/completions", json={ "model": "qwen2.5-7b", "prompt": "很久很久以前", "max_tokens": 100, "temperature": 0.7 } ) print(response.json())

Chat Completions API

对话补全API

response = requests.post( "http://localhost:8000/v1/chat/completions", json={ "model": "qwen2.5-7b", "messages": [ {"role": "user", "content": "Hello!"} ], "max_tokens": 100 } ) print(response.json())

---
response = requests.post( "http://localhost:8000/v1/chat/completions", json={ "model": "qwen2.5-7b", "messages": [ {"role": "user", "content": "你好!"} ], "max_tokens": 100 } ) print(response.json())

---

Installation

安装

Prerequisites

前置条件

  • CANN: 8.0.RC1 or higher
  • Python: 3.9 or higher
  • PyTorch Ascend: Compatible with your CANN version
  • CANN:8.0.RC1或更高版本
  • Python:3.9或更高版本
  • PyTorch Ascend:与你的CANN版本兼容

Method 1: Docker (Recommended)

方法1:Docker(推荐)

bash
undefined
bash
undefined

Pull pre-built image

拉取预构建镜像

docker pull ascendai/vllm-ascend:latest
docker pull ascendai/vllm-ascend:latest

Run with NPU access

启动容器并挂载NPU设备

docker run -it --rm
--device /dev/davinci0
--device /dev/davinci_manager
--device /dev/devmm_svm
--device /dev/hisi_hdc
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver
-v /usr/local/Ascend/add-ons:/usr/local/Ascend/add-ons
-e ASCEND_RT_VISIBLE_DEVICES=0
ascendai/vllm-ascend:latest
undefined
docker run -it --rm
--device /dev/davinci0
--device /dev/davinci_manager
--device /dev/devmm_svm
--device /dev/hisi_hdc
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver
-v /usr/local/Ascend/add-ons:/usr/local/Ascend/add-ons
-e ASCEND_RT_VISIBLE_DEVICES=0
ascendai/vllm-ascend:latest
undefined

Method 2: pip Installation

方法2:pip安装

bash
undefined
bash
undefined

Install vLLM with Ascend plugin

安装带Ascend插件的vLLM

pip install vllm-ascend
pip install vllm-ascend

Or install from source

或从源码安装

git clone https://github.com/vllm-project/vllm-ascend.git cd vllm-ascend pip install -e .
undefined
git clone https://github.com/vllm-project/vllm-ascend.git cd vllm-ascend pip install -e .
undefined

Verify Installation

验证安装

bash
undefined
bash
undefined

Check vLLM Ascend installation

检查vLLM Ascend安装情况

python -c "import vllm_ascend; print(vllm_ascend.version)"
python -c "import vllm_ascend; print(vllm_ascend.version)"

Check NPU availability

检查NPU可用性

python -c "import torch; import torch_npu; print(torch_npu.npu.device_count())"

---
python -c "import torch; import torch_npu; print(torch_npu.npu.device_count())"

---

Deployment

部署

Server Mode

服务器模式

bash
undefined
bash
undefined

Basic server deployment

基础服务器部署

vllm serve <model_path>

--served-model-name <name>
--host 0.0.0.0
--port 8000
vllm serve <model_path>

--served-model-name <name>
--host 0.0.0.0
--port 8000

Production deployment with optimizations

带优化的生产环境部署

vllm serve /path/to/model

--served-model-name "qwen2.5-72b"
--max-model-len 8192
--max-num-seqs 256
--tensor-parallel-size 4
--gpu-memory-utilization 0.9
--dtype bfloat16
--api-key <your-api-key>
undefined
vllm serve /path/to/model

--served-model-name "qwen2.5-72b"
--max-model-len 8192
--max-num-seqs 256
--tensor-parallel-size 4
--gpu-memory-utilization 0.9
--dtype bfloat16
--api-key <your-api-key>
undefined

Python API

Python API

python
import os
python
import os

Required: Set spawn method before importing vLLM

必需:在导入vLLM前设置spawn多进程方法

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
from vllm import LLM, SamplingParams
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
from vllm import LLM, SamplingParams

Single NPU

单NPU部署

llm = LLM( model="Qwen/Qwen2.5-7B-Instruct", max_model_len=4096, dtype="bfloat16" )
llm = LLM( model="Qwen/Qwen2.5-7B-Instruct", max_model_len=4096, dtype="bfloat16" )

Distributed inference (multi-NPU)

分布式推理(多NPU)

llm = LLM( model="Qwen/Qwen2.5-72B-Instruct", tensor_parallel_size=4, max_model_len=8192 )
llm = LLM( model="Qwen/Qwen2.5-72B-Instruct", tensor_parallel_size=4, max_model_len=8192 )

Generate

生成结果

params = SamplingParams(temperature=0.7, max_tokens=512) outputs = llm.generate(["Hello world"], params)
undefined
params = SamplingParams(temperature=0.7, max_tokens=512) outputs = llm.generate(["你好,世界"], params)
undefined

LLM Engine (Advanced)

LLM引擎(进阶)

python
from vllm import LLMEngine, EngineArgs, SamplingParams

engine_args = EngineArgs(
    model="Qwen/Qwen2.5-7B-Instruct",
    max_model_len=4096
)
engine = LLMEngine.from_engine_args(engine_args)
python
from vllm import LLMEngine, EngineArgs, SamplingParams

engine_args = EngineArgs(
    model="Qwen/Qwen2.5-7B-Instruct",
    max_model_len=4096
)
engine = LLMEngine.from_engine_args(engine_args)

Add requests and step through generation

添加请求并逐步生成结果

request_id = "req-001" prompt = "Hello, world!" params = SamplingParams(max_tokens=50) engine.add_request(request_id, prompt, params)
while engine.has_unfinished_requests(): outputs = engine.step() for output in outputs: if output.finished: print(f"{output.request_id}: {output.outputs[0].text}")

---
request_id = "req-001" prompt = "你好,世界!" params = SamplingParams(max_tokens=50) engine.add_request(request_id, prompt, params)
while engine.has_unfinished_requests(): outputs = engine.step() for output in outputs: if output.finished: print(f"{output.request_id}: {output.outputs[0].text}")

---

Quantization

量化

vLLM-Ascend supports models quantized with msModelSlim. For quantization details, see msmodelslim.
vLLM-Ascend支持使用msModelSlim量化的模型。量化详情请参考msmodelslim

Using Quantized Models

使用量化模型

bash
undefined
bash
undefined

W8A8 quantized model

W8A8量化模型

vllm serve /path/to/quantized-model-w8a8

--quantization ascend
--max-model-len 4096
vllm serve /path/to/quantized-model-w8a8

--quantization ascend
--max-model-len 4096

W4A8 quantized model

W4A8量化模型

vllm serve /path/to/quantized-model-w4a8

--quantization ascend
--max-model-len 4096
undefined
vllm serve /path/to/quantized-model-w4a8

--quantization ascend
--max-model-len 4096
undefined

Python API with Quantization

带量化的Python API

python
from vllm import LLM, SamplingParams

llm = LLM(
    model="/path/to/quantized-model",
    quantization="ascend",
    max_model_len=4096
)

params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["Hello"], params)

python
from vllm import LLM, SamplingParams

llm = LLM(
    model="/path/to/quantized-model",
    quantization="ascend",
    max_model_len=4096
)

params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["你好"], params)

Distributed Inference

分布式推理

Tensor Parallelism

张量并行

Distributes model layers across multiple NPUs for large models.
bash
undefined
将模型层分布到多个NPU上,以支持大模型。
bash
undefined

4-way tensor parallelism

4路张量并行

vllm serve Qwen/Qwen2.5-72B-Instruct
--tensor-parallel-size 4
--max-model-len 8192

```python
import os
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

from vllm import LLM

llm = LLM(
    model="Qwen/Qwen2.5-72B-Instruct",
    tensor_parallel_size=4,
    max_model_len=8192
)
vllm serve Qwen/Qwen2.5-72B-Instruct
--tensor-parallel-size 4
--max-model-len 8192

```python
import os
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

from vllm import LLM

llm = LLM(
    model="Qwen/Qwen2.5-72B-Instruct",
    tensor_parallel_size=4,
    max_model_len=8192
)

Pipeline Parallelism

流水线并行

python
from vllm import LLM

llm = LLM(
    model="DeepSeek-V3",
    pipeline_parallel_size=2,
    tensor_parallel_size=4
)
python
from vllm import LLM

llm = LLM(
    model="DeepSeek-V3",
    pipeline_parallel_size=2,
    tensor_parallel_size=4
)

Multi-Node Deployment

多节点部署

bash
undefined
bash
undefined

Node 0 (Rank 0)

节点0(Rank 0)

vllm serve <model>

--tensor-parallel-size 8
--pipeline-parallel-size 2
--distributed-init-method "tcp://192.168.1.10:29500"
--distributed-rank 0
vllm serve <model>

--tensor-parallel-size 8
--pipeline-parallel-size 2
--distributed-init-method "tcp://192.168.1.10:29500"
--distributed-rank 0

Node 1 (Rank 1)

节点1(Rank 1)

vllm serve <model>

--tensor-parallel-size 8
--pipeline-parallel-size 2
--distributed-init-method "tcp://192.168.1.10:29500"
--distributed-rank 1

---
vllm serve <model>

--tensor-parallel-size 8
--pipeline-parallel-size 2
--distributed-init-method "tcp://192.168.1.10:29500"
--distributed-rank 1

---

Performance Optimization

性能优化

Key Parameters

关键参数

ParameterDefaultDescriptionTuning Advice
--max-model-len
Model maxMaximum sequence lengthReduce if OOM
--max-num-seqs
256Max concurrent sequencesIncrease for throughput
--gpu-memory-utilization
0.9GPU memory fractionLower if OOM during warmup
--dtype
autoData typebfloat16 for speed, float16 for compatibility
--tensor-parallel-size
1Tensor parallelism degreeUse for large models
--pipeline-parallel-size
1Pipeline parallelism degreeUse for very large models
参数默认值说明调优建议
--max-model-len
模型最大值最大序列长度出现OOM时减小该值
--max-num-seqs
256最大并发序列数提高该值以提升吞吐量
--gpu-memory-utilization
0.9GPU内存使用率预热阶段出现OOM时降低该值
--dtype
auto数据类型追求速度用bfloat16,追求兼容性用float16
--tensor-parallel-size
1张量并行度大模型场景下使用
--pipeline-parallel-size
1流水线并行度超大型模型场景下使用

Example Configurations

示例配置

bash
undefined
bash
undefined

Small model (7B), single NPU

小型模型(7B),单NPU

vllm serve <model> --max-model-len 4096 --max-num-seqs 256
vllm serve <model> --max-model-len 4096 --max-num-seqs 256

Medium model (32B), single NPU

中型模型(32B),单NPU

vllm serve <model> --max-model-len 8192 --max-num-seqs 128
vllm serve <model> --max-model-len 8192 --max-num-seqs 128

Large model (72B), multi-NPU

大型模型(72B),多NPU

vllm serve <model> --tensor-parallel-size 4 --max-model-len 8192
vllm serve <model> --tensor-parallel-size 4 --max-model-len 8192

Maximum throughput

最大吞吐量配置

vllm serve <model> --max-num-seqs 512 --gpu-memory-utilization 0.95

---
vllm serve <model> --max-num-seqs 512 --gpu-memory-utilization 0.95

---

Troubleshooting

故障排查

Common Issues

常见问题

Q: AclNN_Parameter_Error or dtype errors?
bash
undefined
Q: 出现AclNN_Parameter_Error或数据类型错误?
bash
undefined

Check CANN version compatibility

检查CANN版本兼容性

npu-smi info
npu-smi info

Ensure CANN >= 8.0.RC1

确保CANN版本 >= 8.0.RC1

Try different dtype

尝试切换数据类型

vllm serve <model> --dtype float16

**Q: Out of Memory (OOM)?**
```bash
vllm serve <model> --dtype float16

**Q: 内存不足(OOM)?**
```bash

Reduce max model length

减小最大模型长度

vllm serve <model> --max-model-len 2048
vllm serve <model> --max-model-len 2048

Lower memory utilization

降低内存使用率

vllm serve <model> --gpu-memory-utilization 0.8
vllm serve <model> --gpu-memory-utilization 0.8

Reduce concurrent sequences

减少并发序列数

vllm serve <model> --max-num-seqs 128

**Q: Model loading fails?**
```bash
vllm serve <model> --max-num-seqs 128

**Q: 模型加载失败?**
```bash

Check model path

检查模型路径

ls /path/to/model
ls /path/to/model

Verify tokenizer

验证分词器

python -c "from transformers import AutoTokenizer; tok = AutoTokenizer.from_pretrained('/path/to/model'); print('OK')"
python -c "from transformers import AutoTokenizer; tok = AutoTokenizer.from_pretrained('/path/to/model'); print('OK')"

Use trust_remote_code for custom models

自定义模型需开启trust_remote_code

vllm serve <model> --trust-remote-code

**Q: Slow inference?**
```bash
vllm serve <model> --trust-remote-code

**Q: 推理速度慢?**
```bash

Enable bfloat16 for faster compute

启用bfloat16以加快计算

vllm serve <model> --dtype bfloat16
vllm serve <model> --dtype bfloat16

Adjust block size

调整块大小

vllm serve <model> --block-size 256
vllm serve <model> --block-size 256

Enable prefix caching

启用前缀缓存

vllm serve <model> --enable-prefix-caching

**Q: API server connection refused?**
```bash
vllm serve <model> --enable-prefix-caching

**Q: API服务器连接被拒绝?**
```bash

Check server is running

检查服务器是否运行

Verify port is not in use

验证端口是否被占用

lsof -i :8000
lsof -i :8000

Use explicit host/port

显式指定主机和端口

vllm serve <model> --host 0.0.0.0 --port 8000
undefined
vllm serve <model> --host 0.0.0.0 --port 8000
undefined

Environment Variables

环境变量

bash
undefined
bash
undefined

Required: Set multiprocessing method for vLLM-Ascend

必需:为vLLM-Ascend设置多进程方法

export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_WORKER_MULTIPROC_METHOD=spawn

Set Ascend device IDs

设置Ascend设备ID

export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3

Debug logging

调试日志

export VLLM_LOGGING_LEVEL=DEBUG
export VLLM_LOGGING_LEVEL=DEBUG

Disable lazy initialization (for debugging)

禁用延迟初始化(用于调试)

export VLLM_ASCEND_LAZY_INIT=0

---
export VLLM_ASCEND_LAZY_INIT=0

---

Scripts

脚本

  • scripts/benchmark_throughput.py
    - Throughput benchmark
  • scripts/benchmark_latency.py
    - Latency benchmark
  • scripts/start_server.sh
    - Server startup template

  • scripts/benchmark_throughput.py
    - 吞吐量基准测试脚本
  • scripts/benchmark_latency.py
    - 延迟基准测试脚本
  • scripts/start_server.sh
    - 服务器启动模板脚本

References

参考资料

  • references/deployment.md - Deployment patterns and best practices
  • references/supported-models.md - Complete model support matrix
  • references/api-reference.md - API endpoint documentation

  • references/deployment.md - 部署模式与最佳实践
  • references/supported-models.md - 完整模型支持矩阵
  • references/api-reference.md - API端点文档

Related Skills

相关技能

  • msmodelslim - Model quantization for vLLM-Ascend
  • ascend-docker - Docker container setup for Ascend
  • npu-smi - NPU device management
  • hccl-test - HCCL performance testing for multi-NPU

  • msmodelslim - 为vLLM-Ascend提供的模型量化工具
  • ascend-docker - Ascend的Docker容器配置
  • npu-smi - NPU设备管理工具
  • hccl-test - 多NPU的HCCL性能测试工具

Official References

官方参考