vLLM-Ascend - LLM Inference Serving

vLLM-Ascend - LLM推理服务

vLLM-Ascend is a plugin for vLLM that enables efficient LLM inference on Huawei Ascend AI processors. It provides Ascend-optimized kernels, quantization support, and distributed inference capabilities.

vLLM-Ascend是vLLM的一款插件，可在华为Ascend AI处理器上实现高效的LLM推理。它提供了Ascend优化内核、量化支持以及分布式推理能力。

Quick Start

快速开始

Offline Batch Inference

离线批量推理

python

import os

python

import os

Required for vLLM-Ascend: set multiprocessing method before importing vLLM

vLLM-Ascend必需：在导入vLLM前设置多进程方法

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

from vllm import LLM, SamplingParams

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

from vllm import LLM, SamplingParams

Load model with Ascend NPU (device auto-detected when vllm-ascend is installed)

加载Ascend NPU适配模型（安装vllm-ascend后会自动检测设备）

llm = LLM( model="Qwen/Qwen2.5-7B-Instruct", max_model_len=4096 )

Prepare prompts and sampling params

准备提示词和采样参数

prompts = [ "Hello, how are you?", "Explain quantum computing in simple terms.", ] sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)

prompts = [ "你好，最近怎么样？", "用简单的语言解释量子计算。", ] sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)

Generate outputs

生成输出结果

outputs = llm.generate(prompts, sampling_params)

Print results

打印结果

for output in outputs: print(f"Prompt: {output.prompt}") print(f"Output: {output.outputs[0].text}\n")

undefined

for output in outputs: print(f"提示词: {output.prompt}") print(f"输出: {output.outputs[0].text}\n")

undefined

OpenAI-Compatible API Server

兼容OpenAI的API服务器

bash

undefined

bash

undefined

Start the API server

启动API服务器

vllm serve Qwen/Qwen2.5-7B-Instruct
--max-model-len 4096
--max-num-seqs 256
--served-model-name "qwen2.5-7b"

Or using Python

或使用Python启动

python -m vllm.entrypoints.openai.api_server
--model Qwen/Qwen2.5-7B-Instruct
--max-model-len 4096

undefined

python -m vllm.entrypoints.openai.api_server
--model Qwen/Qwen2.5-7B-Instruct
--max-model-len 4096

undefined

API Client Example

API客户端示例

python

import requests

python

import requests

Completions API

补全API

response = requests.post( "http://localhost:8000/v1/completions", json={ "model": "qwen2.5-7b", "prompt": "Once upon a time", "max_tokens": 100, "temperature": 0.7 } ) print(response.json())

response = requests.post( "http://localhost:8000/v1/completions", json={ "model": "qwen2.5-7b", "prompt": "很久很久以前", "max_tokens": 100, "temperature": 0.7 } ) print(response.json())

Chat Completions API

对话补全API

response = requests.post( "http://localhost:8000/v1/chat/completions", json={ "model": "qwen2.5-7b", "messages": [ {"role": "user", "content": "Hello!"} ], "max_tokens": 100 } ) print(response.json())

---

response = requests.post( "http://localhost:8000/v1/chat/completions", json={ "model": "qwen2.5-7b", "messages": [ {"role": "user", "content": "你好！"} ], "max_tokens": 100 } ) print(response.json())

---

Installation

安装

Prerequisites

前置条件

CANN: 8.0.RC1 or higher
Python: 3.9 or higher
PyTorch Ascend: Compatible with your CANN version

CANN：8.0.RC1或更高版本
Python：3.9或更高版本
PyTorch Ascend：与你的CANN版本兼容

Method 1: Docker (Recommended)

方法1：Docker（推荐）

bash

undefined

bash

undefined

Pull pre-built image

拉取预构建镜像

docker pull ascendai/vllm-ascend:latest

Run with NPU access

启动容器并挂载NPU设备

docker run -it --rm
--device /dev/davinci0
--device /dev/davinci_manager
--device /dev/devmm_svm
--device /dev/hisi_hdc
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver
-v /usr/local/Ascend/add-ons:/usr/local/Ascend/add-ons
-e ASCEND_RT_VISIBLE_DEVICES=0
ascendai/vllm-ascend:latest

undefined

docker run -it --rm
--device /dev/davinci0
--device /dev/davinci_manager
--device /dev/devmm_svm
--device /dev/hisi_hdc
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver
-v /usr/local/Ascend/add-ons:/usr/local/Ascend/add-ons
-e ASCEND_RT_VISIBLE_DEVICES=0
ascendai/vllm-ascend:latest

undefined

Method 2: pip Installation

方法2：pip安装

bash

undefined

bash

undefined

Install vLLM with Ascend plugin

安装带Ascend插件的vLLM

pip install vllm-ascend

Or install from source

或从源码安装

git clone https://github.com/vllm-project/vllm-ascend.git cd vllm-ascend pip install -e .

undefined

git clone https://github.com/vllm-project/vllm-ascend.git cd vllm-ascend pip install -e .

undefined

Verify Installation

验证安装

bash

undefined

bash

undefined

Check vLLM Ascend installation

检查vLLM Ascend安装情况

python -c "import vllm_ascend; print(vllm_ascend.version)"

Check NPU availability

检查NPU可用性

python -c "import torch; import torch_npu; print(torch_npu.npu.device_count())"

---

python -c "import torch; import torch_npu; print(torch_npu.npu.device_count())"

---

Deployment

部署

Server Mode

服务器模式

bash

undefined

bash

undefined

Basic server deployment

基础服务器部署

vllm serve <model_path>

--served-model-name <name>
--host 0.0.0.0
--port 8000

Production deployment with optimizations

带优化的生产环境部署

vllm serve /path/to/model

--served-model-name "qwen2.5-72b"
--max-model-len 8192
--max-num-seqs 256
--tensor-parallel-size 4
--gpu-memory-utilization 0.9
--dtype bfloat16
--api-key <your-api-key>

undefined

vllm serve /path/to/model

--served-model-name "qwen2.5-72b"
--max-model-len 8192
--max-num-seqs 256
--tensor-parallel-size 4
--gpu-memory-utilization 0.9
--dtype bfloat16
--api-key <your-api-key>

undefined

Python API

python

import os

python

import os

Required: Set spawn method before importing vLLM

必需：在导入vLLM前设置spawn多进程方法

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

from vllm import LLM, SamplingParams

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

from vllm import LLM, SamplingParams

Single NPU

单NPU部署

llm = LLM( model="Qwen/Qwen2.5-7B-Instruct", max_model_len=4096, dtype="bfloat16" )

Distributed inference (multi-NPU)

分布式推理（多NPU）

llm = LLM( model="Qwen/Qwen2.5-72B-Instruct", tensor_parallel_size=4, max_model_len=8192 )

Generate

生成结果

params = SamplingParams(temperature=0.7, max_tokens=512) outputs = llm.generate(["Hello world"], params)

undefined

params = SamplingParams(temperature=0.7, max_tokens=512) outputs = llm.generate(["你好，世界"], params)

undefined

LLM Engine (Advanced)

LLM引擎（进阶）

python

from vllm import LLMEngine, EngineArgs, SamplingParams

engine_args = EngineArgs(
    model="Qwen/Qwen2.5-7B-Instruct",
    max_model_len=4096
)
engine = LLMEngine.from_engine_args(engine_args)

python

from vllm import LLMEngine, EngineArgs, SamplingParams

engine_args = EngineArgs(
    model="Qwen/Qwen2.5-7B-Instruct",
    max_model_len=4096
)
engine = LLMEngine.from_engine_args(engine_args)

Add requests and step through generation

添加请求并逐步生成结果

request_id = "req-001" prompt = "Hello, world!" params = SamplingParams(max_tokens=50) engine.add_request(request_id, prompt, params)

while engine.has_unfinished_requests(): outputs = engine.step() for output in outputs: if output.finished: print(f"{output.request_id}: {output.outputs[0].text}")

---

request_id = "req-001" prompt = "你好，世界！" params = SamplingParams(max_tokens=50) engine.add_request(request_id, prompt, params)

while engine.has_unfinished_requests(): outputs = engine.step() for output in outputs: if output.finished: print(f"{output.request_id}: {output.outputs[0].text}")

---

Quantization

量化

vLLM-Ascend supports models quantized with msModelSlim. For quantization details, see msmodelslim.

vLLM-Ascend支持使用msModelSlim量化的模型。量化详情请参考msmodelslim。

Using Quantized Models

使用量化模型

bash

undefined

bash

undefined

W8A8 quantized model

W8A8量化模型

vllm serve /path/to/quantized-model-w8a8

--quantization ascend
--max-model-len 4096

W4A8 quantized model

W4A8量化模型

vllm serve /path/to/quantized-model-w4a8

--quantization ascend
--max-model-len 4096

undefined

vllm serve /path/to/quantized-model-w4a8

--quantization ascend
--max-model-len 4096

undefined

Python API with Quantization

带量化的Python API

python

from vllm import LLM, SamplingParams

llm = LLM(
    model="/path/to/quantized-model",
    quantization="ascend",
    max_model_len=4096
)

params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["Hello"], params)

python

from vllm import LLM, SamplingParams

llm = LLM(
    model="/path/to/quantized-model",
    quantization="ascend",
    max_model_len=4096
)

params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["你好"], params)

Distributed Inference

分布式推理

Tensor Parallelism

张量并行

Distributes model layers across multiple NPUs for large models.

bash

undefined

将模型层分布到多个NPU上，以支持大模型。

bash

undefined

4-way tensor parallelism

4路张量并行

vllm serve Qwen/Qwen2.5-72B-Instruct
--tensor-parallel-size 4
--max-model-len 8192


```python
import os
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

from vllm import LLM

llm = LLM(
    model="Qwen/Qwen2.5-72B-Instruct",
    tensor_parallel_size=4,
    max_model_len=8192
)

vllm serve Qwen/Qwen2.5-72B-Instruct
--tensor-parallel-size 4
--max-model-len 8192


```python
import os
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

from vllm import LLM

llm = LLM(
    model="Qwen/Qwen2.5-72B-Instruct",
    tensor_parallel_size=4,
    max_model_len=8192
)

Pipeline Parallelism

流水线并行

python

from vllm import LLM

llm = LLM(
    model="DeepSeek-V3",
    pipeline_parallel_size=2,
    tensor_parallel_size=4
)

python

from vllm import LLM

llm = LLM(
    model="DeepSeek-V3",
    pipeline_parallel_size=2,
    tensor_parallel_size=4
)

Multi-Node Deployment

多节点部署

bash

undefined

bash

undefined

Node 0 (Rank 0)

节点0（Rank 0）

vllm serve <model>

--tensor-parallel-size 8
--pipeline-parallel-size 2
--distributed-init-method "tcp://192.168.1.10:29500"
--distributed-rank 0

Node 1 (Rank 1)

节点1（Rank 1）

vllm serve <model>

--tensor-parallel-size 8
--pipeline-parallel-size 2
--distributed-init-method "tcp://192.168.1.10:29500"
--distributed-rank 1

---

vllm serve <model>

--tensor-parallel-size 8
--pipeline-parallel-size 2
--distributed-init-method "tcp://192.168.1.10:29500"
--distributed-rank 1

---

Performance Optimization

性能优化

Key Parameters

关键参数

Parameter	Default	Description	Tuning Advice
`--max-model-len`	Model max	Maximum sequence length	Reduce if OOM
`--max-num-seqs`	256	Max concurrent sequences	Increase for throughput
`--gpu-memory-utilization`	0.9	GPU memory fraction	Lower if OOM during warmup
`--dtype`	auto	Data type	bfloat16 for speed, float16 for compatibility
`--tensor-parallel-size`	1	Tensor parallelism degree	Use for large models
`--pipeline-parallel-size`	1	Pipeline parallelism degree	Use for very large models

参数	默认值	说明	调优建议
`--max-model-len`	模型最大值	最大序列长度	出现OOM时减小该值
`--max-num-seqs`	256	最大并发序列数	提高该值以提升吞吐量
`--gpu-memory-utilization`	0.9	GPU内存使用率	预热阶段出现OOM时降低该值
`--dtype`	auto	数据类型	追求速度用bfloat16，追求兼容性用float16
`--tensor-parallel-size`	1	张量并行度	大模型场景下使用
`--pipeline-parallel-size`	1	流水线并行度	超大型模型场景下使用

Example Configurations

示例配置

bash

undefined

bash

undefined

Small model (7B), single NPU

小型模型（7B），单NPU

vllm serve <model> --max-model-len 4096 --max-num-seqs 256

Medium model (32B), single NPU

中型模型（32B），单NPU

vllm serve <model> --max-model-len 8192 --max-num-seqs 128

Large model (72B), multi-NPU

大型模型（72B），多NPU

vllm serve <model> --tensor-parallel-size 4 --max-model-len 8192

Maximum throughput

最大吞吐量配置

vllm serve <model> --max-num-seqs 512 --gpu-memory-utilization 0.95

---

vllm serve <model> --max-num-seqs 512 --gpu-memory-utilization 0.95

---

Troubleshooting

故障排查

Common Issues

常见问题

Q: AclNN_Parameter_Error or dtype errors?

bash

undefined

Q: 出现AclNN_Parameter_Error或数据类型错误？

bash

undefined

Check CANN version compatibility

检查CANN版本兼容性

npu-smi info

Ensure CANN >= 8.0.RC1

确保CANN版本 >= 8.0.RC1

Try different dtype

尝试切换数据类型

vllm serve <model> --dtype float16


**Q: Out of Memory (OOM)?**
```bash

vllm serve <model> --dtype float16


**Q: 内存不足（OOM）？**
```bash

Reduce max model length

减小最大模型长度

vllm serve <model> --max-model-len 2048

Lower memory utilization

降低内存使用率

vllm serve <model> --gpu-memory-utilization 0.8

Reduce concurrent sequences

减少并发序列数

vllm serve <model> --max-num-seqs 128


**Q: Model loading fails?**
```bash

vllm serve <model> --max-num-seqs 128


**Q: 模型加载失败？**
```bash

Check model path

检查模型路径

ls /path/to/model

Verify tokenizer

验证分词器

python -c "from transformers import AutoTokenizer; tok = AutoTokenizer.from_pretrained('/path/to/model'); print('OK')"

Use trust_remote_code for custom models

自定义模型需开启trust_remote_code

vllm serve <model> --trust-remote-code


**Q: Slow inference?**
```bash

vllm serve <model> --trust-remote-code


**Q: 推理速度慢？**
```bash

Enable bfloat16 for faster compute

启用bfloat16以加快计算

vllm serve <model> --dtype bfloat16

Adjust block size

调整块大小

vllm serve <model> --block-size 256

Enable prefix caching

启用前缀缓存

vllm serve <model> --enable-prefix-caching


**Q: API server connection refused?**
```bash

vllm serve <model> --enable-prefix-caching


**Q: API服务器连接被拒绝？**
```bash

Check server is running

检查服务器是否运行

curl http://localhost:8000/health

Verify port is not in use

验证端口是否被占用

lsof -i :8000

Use explicit host/port

显式指定主机和端口

vllm serve <model> --host 0.0.0.0 --port 8000

undefined

vllm serve <model> --host 0.0.0.0 --port 8000

undefined

Environment Variables

环境变量

bash

undefined

bash

undefined

Required: Set multiprocessing method for vLLM-Ascend

必需：为vLLM-Ascend设置多进程方法

export VLLM_WORKER_MULTIPROC_METHOD=spawn

Set Ascend device IDs

设置Ascend设备ID

export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3

Debug logging

调试日志

export VLLM_LOGGING_LEVEL=DEBUG

Disable lazy initialization (for debugging)

禁用延迟初始化（用于调试）

export VLLM_ASCEND_LAZY_INIT=0

---

export VLLM_ASCEND_LAZY_INIT=0

---

Scripts

脚本

```
scripts/benchmark_throughput.py
```
- Throughput benchmark
```
scripts/benchmark_latency.py
```
- Latency benchmark
```
scripts/start_server.sh
```
- Server startup template

```
scripts/benchmark_throughput.py
```
- 吞吐量基准测试脚本
```
scripts/benchmark_latency.py
```
- 延迟基准测试脚本
```
scripts/start_server.sh
```
- 服务器启动模板脚本

References

参考资料

references/deployment.md - Deployment patterns and best practices
references/supported-models.md - Complete model support matrix
references/api-reference.md - API endpoint documentation

references/deployment.md - 部署模式与最佳实践
references/supported-models.md - 完整模型支持矩阵
references/api-reference.md - API端点文档

Related Skills

Official References

官方参考

vLLM-Ascend Documentation: https://docs.vllm.ai/projects/ascend/en/latest/
vLLM Documentation: https://docs.vllm.ai/
Huawei Ascend: https://www.hiascend.com/document
GitHub Repository: https://github.com/vllm-project/vllm-ascend

vLLM-Ascend文档：https://docs.vllm.ai/projects/ascend/en/latest/
vLLM文档：https://docs.vllm.ai/
华为Ascend：https://www.hiascend.com/document
GitHub仓库：https://github.com/vllm-project/vllm-ascend

vllm-ascend

Original

Translation

vLLM-Ascend - LLM Inference Serving

vLLM-Ascend - LLM推理服务

Quick Start

快速开始

Offline Batch Inference

离线批量推理

Required for vLLM-Ascend: set multiprocessing method before importing vLLM

vLLM-Ascend必需：在导入vLLM前设置多进程方法

Load model with Ascend NPU (device auto-detected when vllm-ascend is installed)

加载Ascend NPU适配模型（安装vllm-ascend后会自动检测设备）

Prepare prompts and sampling params

准备提示词和采样参数

Generate outputs

生成输出结果

Print results

打印结果

OpenAI-Compatible API Server

兼容OpenAI的API服务器

Start the API server

启动API服务器

Or using Python

或使用Python启动

API Client Example

API客户端示例

Completions API

补全API

Chat Completions API

对话补全API

Installation

安装

Prerequisites

前置条件

Method 1: Docker (Recommended)

方法1：Docker（推荐）

Pull pre-built image

拉取预构建镜像

Run with NPU access

启动容器并挂载NPU设备

Method 2: pip Installation

方法2：pip安装

Install vLLM with Ascend plugin

安装带Ascend插件的vLLM

Or install from source

或从源码安装

Verify Installation

验证安装

Check vLLM Ascend installation

检查vLLM Ascend安装情况

Check NPU availability

检查NPU可用性

Deployment

部署

Server Mode

服务器模式

Basic server deployment

基础服务器部署

Production deployment with optimizations

带优化的生产环境部署

Python API

Python API

Required: Set spawn method before importing vLLM

必需：在导入vLLM前设置spawn多进程方法

Single NPU

单NPU部署

Distributed inference (multi-NPU)

分布式推理（多NPU）

Generate

生成结果

LLM Engine (Advanced)

LLM引擎（进阶）

Add requests and step through generation

添加请求并逐步生成结果

Quantization

量化

Using Quantized Models

使用量化模型

W8A8 quantized model