vllm-ascend
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesevLLM-Ascend - LLM Inference Serving
vLLM-Ascend - LLM推理服务
vLLM-Ascend is a plugin for vLLM that enables efficient LLM inference on Huawei Ascend AI processors. It provides Ascend-optimized kernels, quantization support, and distributed inference capabilities.
vLLM-Ascend是vLLM的一款插件,可在华为Ascend AI处理器上实现高效的LLM推理。它提供了Ascend优化内核、量化支持以及分布式推理能力。
Quick Start
快速开始
Offline Batch Inference
离线批量推理
python
import ospython
import osRequired for vLLM-Ascend: set multiprocessing method before importing vLLM
vLLM-Ascend必需:在导入vLLM前设置多进程方法
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
from vllm import LLM, SamplingParams
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
from vllm import LLM, SamplingParams
Load model with Ascend NPU (device auto-detected when vllm-ascend is installed)
加载Ascend NPU适配模型(安装vllm-ascend后会自动检测设备)
llm = LLM(
model="Qwen/Qwen2.5-7B-Instruct",
max_model_len=4096
)
llm = LLM(
model="Qwen/Qwen2.5-7B-Instruct",
max_model_len=4096
)
Prepare prompts and sampling params
准备提示词和采样参数
prompts = [
"Hello, how are you?",
"Explain quantum computing in simple terms.",
]
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)
prompts = [
"你好,最近怎么样?",
"用简单的语言解释量子计算。",
]
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)
Generate outputs
生成输出结果
outputs = llm.generate(prompts, sampling_params)
outputs = llm.generate(prompts, sampling_params)
Print results
打印结果
for output in outputs:
print(f"Prompt: {output.prompt}")
print(f"Output: {output.outputs[0].text}\n")
undefinedfor output in outputs:
print(f"提示词: {output.prompt}")
print(f"输出: {output.outputs[0].text}\n")
undefinedOpenAI-Compatible API Server
兼容OpenAI的API服务器
bash
undefinedbash
undefinedStart the API server
启动API服务器
vllm serve Qwen/Qwen2.5-7B-Instruct
--max-model-len 4096
--max-num-seqs 256
--served-model-name "qwen2.5-7b"
--max-model-len 4096
--max-num-seqs 256
--served-model-name "qwen2.5-7b"
vllm serve Qwen/Qwen2.5-7B-Instruct
--max-model-len 4096
--max-num-seqs 256
--served-model-name "qwen2.5-7b"
--max-model-len 4096
--max-num-seqs 256
--served-model-name "qwen2.5-7b"
Or using Python
或使用Python启动
python -m vllm.entrypoints.openai.api_server
--model Qwen/Qwen2.5-7B-Instruct
--max-model-len 4096
--model Qwen/Qwen2.5-7B-Instruct
--max-model-len 4096
undefinedpython -m vllm.entrypoints.openai.api_server
--model Qwen/Qwen2.5-7B-Instruct
--max-model-len 4096
--model Qwen/Qwen2.5-7B-Instruct
--max-model-len 4096
undefinedAPI Client Example
API客户端示例
python
import requestspython
import requestsCompletions API
补全API
response = requests.post(
"http://localhost:8000/v1/completions",
json={
"model": "qwen2.5-7b",
"prompt": "Once upon a time",
"max_tokens": 100,
"temperature": 0.7
}
)
print(response.json())
response = requests.post(
"http://localhost:8000/v1/completions",
json={
"model": "qwen2.5-7b",
"prompt": "很久很久以前",
"max_tokens": 100,
"temperature": 0.7
}
)
print(response.json())
Chat Completions API
对话补全API
response = requests.post(
"http://localhost:8000/v1/chat/completions",
json={
"model": "qwen2.5-7b",
"messages": [
{"role": "user", "content": "Hello!"}
],
"max_tokens": 100
}
)
print(response.json())
---response = requests.post(
"http://localhost:8000/v1/chat/completions",
json={
"model": "qwen2.5-7b",
"messages": [
{"role": "user", "content": "你好!"}
],
"max_tokens": 100
}
)
print(response.json())
---Installation
安装
Prerequisites
前置条件
- CANN: 8.0.RC1 or higher
- Python: 3.9 or higher
- PyTorch Ascend: Compatible with your CANN version
- CANN:8.0.RC1或更高版本
- Python:3.9或更高版本
- PyTorch Ascend:与你的CANN版本兼容
Method 1: Docker (Recommended)
方法1:Docker(推荐)
bash
undefinedbash
undefinedPull pre-built image
拉取预构建镜像
docker pull ascendai/vllm-ascend:latest
docker pull ascendai/vllm-ascend:latest
Run with NPU access
启动容器并挂载NPU设备
docker run -it --rm
--device /dev/davinci0
--device /dev/davinci_manager
--device /dev/devmm_svm
--device /dev/hisi_hdc
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver
-v /usr/local/Ascend/add-ons:/usr/local/Ascend/add-ons
-e ASCEND_RT_VISIBLE_DEVICES=0
ascendai/vllm-ascend:latest
--device /dev/davinci0
--device /dev/davinci_manager
--device /dev/devmm_svm
--device /dev/hisi_hdc
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver
-v /usr/local/Ascend/add-ons:/usr/local/Ascend/add-ons
-e ASCEND_RT_VISIBLE_DEVICES=0
ascendai/vllm-ascend:latest
undefineddocker run -it --rm
--device /dev/davinci0
--device /dev/davinci_manager
--device /dev/devmm_svm
--device /dev/hisi_hdc
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver
-v /usr/local/Ascend/add-ons:/usr/local/Ascend/add-ons
-e ASCEND_RT_VISIBLE_DEVICES=0
ascendai/vllm-ascend:latest
--device /dev/davinci0
--device /dev/davinci_manager
--device /dev/devmm_svm
--device /dev/hisi_hdc
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver
-v /usr/local/Ascend/add-ons:/usr/local/Ascend/add-ons
-e ASCEND_RT_VISIBLE_DEVICES=0
ascendai/vllm-ascend:latest
undefinedMethod 2: pip Installation
方法2:pip安装
bash
undefinedbash
undefinedInstall vLLM with Ascend plugin
安装带Ascend插件的vLLM
pip install vllm-ascend
pip install vllm-ascend
Or install from source
或从源码安装
git clone https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
pip install -e .
undefinedgit clone https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
pip install -e .
undefinedVerify Installation
验证安装
bash
undefinedbash
undefinedCheck vLLM Ascend installation
检查vLLM Ascend安装情况
python -c "import vllm_ascend; print(vllm_ascend.version)"
python -c "import vllm_ascend; print(vllm_ascend.version)"
Check NPU availability
检查NPU可用性
python -c "import torch; import torch_npu; print(torch_npu.npu.device_count())"
---python -c "import torch; import torch_npu; print(torch_npu.npu.device_count())"
---Deployment
部署
Server Mode
服务器模式
bash
undefinedbash
undefinedBasic server deployment
基础服务器部署
vllm serve <model_path>
--served-model-name <name>
--host 0.0.0.0
--port 8000
--served-model-name <name>
--host 0.0.0.0
--port 8000
vllm serve <model_path>
--served-model-name <name>
--host 0.0.0.0
--port 8000
--served-model-name <name>
--host 0.0.0.0
--port 8000
Production deployment with optimizations
带优化的生产环境部署
vllm serve /path/to/model
--served-model-name "qwen2.5-72b"
--max-model-len 8192
--max-num-seqs 256
--tensor-parallel-size 4
--gpu-memory-utilization 0.9
--dtype bfloat16
--api-key <your-api-key>
--served-model-name "qwen2.5-72b"
--max-model-len 8192
--max-num-seqs 256
--tensor-parallel-size 4
--gpu-memory-utilization 0.9
--dtype bfloat16
--api-key <your-api-key>
undefinedvllm serve /path/to/model
--served-model-name "qwen2.5-72b"
--max-model-len 8192
--max-num-seqs 256
--tensor-parallel-size 4
--gpu-memory-utilization 0.9
--dtype bfloat16
--api-key <your-api-key>
--served-model-name "qwen2.5-72b"
--max-model-len 8192
--max-num-seqs 256
--tensor-parallel-size 4
--gpu-memory-utilization 0.9
--dtype bfloat16
--api-key <your-api-key>
undefinedPython API
Python API
python
import ospython
import osRequired: Set spawn method before importing vLLM
必需:在导入vLLM前设置spawn多进程方法
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
from vllm import LLM, SamplingParams
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
from vllm import LLM, SamplingParams
Single NPU
单NPU部署
llm = LLM(
model="Qwen/Qwen2.5-7B-Instruct",
max_model_len=4096,
dtype="bfloat16"
)
llm = LLM(
model="Qwen/Qwen2.5-7B-Instruct",
max_model_len=4096,
dtype="bfloat16"
)
Distributed inference (multi-NPU)
分布式推理(多NPU)
llm = LLM(
model="Qwen/Qwen2.5-72B-Instruct",
tensor_parallel_size=4,
max_model_len=8192
)
llm = LLM(
model="Qwen/Qwen2.5-72B-Instruct",
tensor_parallel_size=4,
max_model_len=8192
)
Generate
生成结果
params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["Hello world"], params)
undefinedparams = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["你好,世界"], params)
undefinedLLM Engine (Advanced)
LLM引擎(进阶)
python
from vllm import LLMEngine, EngineArgs, SamplingParams
engine_args = EngineArgs(
model="Qwen/Qwen2.5-7B-Instruct",
max_model_len=4096
)
engine = LLMEngine.from_engine_args(engine_args)python
from vllm import LLMEngine, EngineArgs, SamplingParams
engine_args = EngineArgs(
model="Qwen/Qwen2.5-7B-Instruct",
max_model_len=4096
)
engine = LLMEngine.from_engine_args(engine_args)Add requests and step through generation
添加请求并逐步生成结果
request_id = "req-001"
prompt = "Hello, world!"
params = SamplingParams(max_tokens=50)
engine.add_request(request_id, prompt, params)
while engine.has_unfinished_requests():
outputs = engine.step()
for output in outputs:
if output.finished:
print(f"{output.request_id}: {output.outputs[0].text}")
---request_id = "req-001"
prompt = "你好,世界!"
params = SamplingParams(max_tokens=50)
engine.add_request(request_id, prompt, params)
while engine.has_unfinished_requests():
outputs = engine.step()
for output in outputs:
if output.finished:
print(f"{output.request_id}: {output.outputs[0].text}")
---Quantization
量化
vLLM-Ascend supports models quantized with msModelSlim. For quantization details, see msmodelslim.
vLLM-Ascend支持使用msModelSlim量化的模型。量化详情请参考msmodelslim。
Using Quantized Models
使用量化模型
bash
undefinedbash
undefinedW8A8 quantized model
W8A8量化模型
vllm serve /path/to/quantized-model-w8a8
--quantization ascend
--max-model-len 4096
--quantization ascend
--max-model-len 4096
vllm serve /path/to/quantized-model-w8a8
--quantization ascend
--max-model-len 4096
--quantization ascend
--max-model-len 4096
W4A8 quantized model
W4A8量化模型
vllm serve /path/to/quantized-model-w4a8
--quantization ascend
--max-model-len 4096
--quantization ascend
--max-model-len 4096
undefinedvllm serve /path/to/quantized-model-w4a8
--quantization ascend
--max-model-len 4096
--quantization ascend
--max-model-len 4096
undefinedPython API with Quantization
带量化的Python API
python
from vllm import LLM, SamplingParams
llm = LLM(
model="/path/to/quantized-model",
quantization="ascend",
max_model_len=4096
)
params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["Hello"], params)python
from vllm import LLM, SamplingParams
llm = LLM(
model="/path/to/quantized-model",
quantization="ascend",
max_model_len=4096
)
params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["你好"], params)Distributed Inference
分布式推理
Tensor Parallelism
张量并行
Distributes model layers across multiple NPUs for large models.
bash
undefined将模型层分布到多个NPU上,以支持大模型。
bash
undefined4-way tensor parallelism
4路张量并行
vllm serve Qwen/Qwen2.5-72B-Instruct
--tensor-parallel-size 4
--max-model-len 8192
--tensor-parallel-size 4
--max-model-len 8192
```python
import os
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
from vllm import LLM
llm = LLM(
model="Qwen/Qwen2.5-72B-Instruct",
tensor_parallel_size=4,
max_model_len=8192
)vllm serve Qwen/Qwen2.5-72B-Instruct
--tensor-parallel-size 4
--max-model-len 8192
--tensor-parallel-size 4
--max-model-len 8192
```python
import os
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
from vllm import LLM
llm = LLM(
model="Qwen/Qwen2.5-72B-Instruct",
tensor_parallel_size=4,
max_model_len=8192
)Pipeline Parallelism
流水线并行
python
from vllm import LLM
llm = LLM(
model="DeepSeek-V3",
pipeline_parallel_size=2,
tensor_parallel_size=4
)python
from vllm import LLM
llm = LLM(
model="DeepSeek-V3",
pipeline_parallel_size=2,
tensor_parallel_size=4
)Multi-Node Deployment
多节点部署
bash
undefinedbash
undefinedNode 0 (Rank 0)
节点0(Rank 0)
vllm serve <model>
--tensor-parallel-size 8
--pipeline-parallel-size 2
--distributed-init-method "tcp://192.168.1.10:29500"
--distributed-rank 0
--tensor-parallel-size 8
--pipeline-parallel-size 2
--distributed-init-method "tcp://192.168.1.10:29500"
--distributed-rank 0
vllm serve <model>
--tensor-parallel-size 8
--pipeline-parallel-size 2
--distributed-init-method "tcp://192.168.1.10:29500"
--distributed-rank 0
--tensor-parallel-size 8
--pipeline-parallel-size 2
--distributed-init-method "tcp://192.168.1.10:29500"
--distributed-rank 0
Node 1 (Rank 1)
节点1(Rank 1)
vllm serve <model>
--tensor-parallel-size 8
--pipeline-parallel-size 2
--distributed-init-method "tcp://192.168.1.10:29500"
--distributed-rank 1
--tensor-parallel-size 8
--pipeline-parallel-size 2
--distributed-init-method "tcp://192.168.1.10:29500"
--distributed-rank 1
---vllm serve <model>
--tensor-parallel-size 8
--pipeline-parallel-size 2
--distributed-init-method "tcp://192.168.1.10:29500"
--distributed-rank 1
--tensor-parallel-size 8
--pipeline-parallel-size 2
--distributed-init-method "tcp://192.168.1.10:29500"
--distributed-rank 1
---Performance Optimization
性能优化
Key Parameters
关键参数
| Parameter | Default | Description | Tuning Advice |
|---|---|---|---|
| Model max | Maximum sequence length | Reduce if OOM |
| 256 | Max concurrent sequences | Increase for throughput |
| 0.9 | GPU memory fraction | Lower if OOM during warmup |
| auto | Data type | bfloat16 for speed, float16 for compatibility |
| 1 | Tensor parallelism degree | Use for large models |
| 1 | Pipeline parallelism degree | Use for very large models |
| 参数 | 默认值 | 说明 | 调优建议 |
|---|---|---|---|
| 模型最大值 | 最大序列长度 | 出现OOM时减小该值 |
| 256 | 最大并发序列数 | 提高该值以提升吞吐量 |
| 0.9 | GPU内存使用率 | 预热阶段出现OOM时降低该值 |
| auto | 数据类型 | 追求速度用bfloat16,追求兼容性用float16 |
| 1 | 张量并行度 | 大模型场景下使用 |
| 1 | 流水线并行度 | 超大型模型场景下使用 |
Example Configurations
示例配置
bash
undefinedbash
undefinedSmall model (7B), single NPU
小型模型(7B),单NPU
vllm serve <model> --max-model-len 4096 --max-num-seqs 256
vllm serve <model> --max-model-len 4096 --max-num-seqs 256
Medium model (32B), single NPU
中型模型(32B),单NPU
vllm serve <model> --max-model-len 8192 --max-num-seqs 128
vllm serve <model> --max-model-len 8192 --max-num-seqs 128
Large model (72B), multi-NPU
大型模型(72B),多NPU
vllm serve <model> --tensor-parallel-size 4 --max-model-len 8192
vllm serve <model> --tensor-parallel-size 4 --max-model-len 8192
Maximum throughput
最大吞吐量配置
vllm serve <model> --max-num-seqs 512 --gpu-memory-utilization 0.95
---vllm serve <model> --max-num-seqs 512 --gpu-memory-utilization 0.95
---Troubleshooting
故障排查
Common Issues
常见问题
Q: AclNN_Parameter_Error or dtype errors?
bash
undefinedQ: 出现AclNN_Parameter_Error或数据类型错误?
bash
undefinedCheck CANN version compatibility
检查CANN版本兼容性
npu-smi info
npu-smi info
Ensure CANN >= 8.0.RC1
确保CANN版本 >= 8.0.RC1
Try different dtype
尝试切换数据类型
vllm serve <model> --dtype float16
**Q: Out of Memory (OOM)?**
```bashvllm serve <model> --dtype float16
**Q: 内存不足(OOM)?**
```bashReduce max model length
减小最大模型长度
vllm serve <model> --max-model-len 2048
vllm serve <model> --max-model-len 2048
Lower memory utilization
降低内存使用率
vllm serve <model> --gpu-memory-utilization 0.8
vllm serve <model> --gpu-memory-utilization 0.8
Reduce concurrent sequences
减少并发序列数
vllm serve <model> --max-num-seqs 128
**Q: Model loading fails?**
```bashvllm serve <model> --max-num-seqs 128
**Q: 模型加载失败?**
```bashCheck model path
检查模型路径
ls /path/to/model
ls /path/to/model
Verify tokenizer
验证分词器
python -c "from transformers import AutoTokenizer; tok = AutoTokenizer.from_pretrained('/path/to/model'); print('OK')"
python -c "from transformers import AutoTokenizer; tok = AutoTokenizer.from_pretrained('/path/to/model'); print('OK')"
Use trust_remote_code for custom models
自定义模型需开启trust_remote_code
vllm serve <model> --trust-remote-code
**Q: Slow inference?**
```bashvllm serve <model> --trust-remote-code
**Q: 推理速度慢?**
```bashEnable bfloat16 for faster compute
启用bfloat16以加快计算
vllm serve <model> --dtype bfloat16
vllm serve <model> --dtype bfloat16
Adjust block size
调整块大小
vllm serve <model> --block-size 256
vllm serve <model> --block-size 256
Enable prefix caching
启用前缀缓存
vllm serve <model> --enable-prefix-caching
**Q: API server connection refused?**
```bashvllm serve <model> --enable-prefix-caching
**Q: API服务器连接被拒绝?**
```bashCheck server is running
检查服务器是否运行
Verify port is not in use
验证端口是否被占用
lsof -i :8000
lsof -i :8000
Use explicit host/port
显式指定主机和端口
vllm serve <model> --host 0.0.0.0 --port 8000
undefinedvllm serve <model> --host 0.0.0.0 --port 8000
undefinedEnvironment Variables
环境变量
bash
undefinedbash
undefinedRequired: Set multiprocessing method for vLLM-Ascend
必需:为vLLM-Ascend设置多进程方法
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_WORKER_MULTIPROC_METHOD=spawn
Set Ascend device IDs
设置Ascend设备ID
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
Debug logging
调试日志
export VLLM_LOGGING_LEVEL=DEBUG
export VLLM_LOGGING_LEVEL=DEBUG
Disable lazy initialization (for debugging)
禁用延迟初始化(用于调试)
export VLLM_ASCEND_LAZY_INIT=0
---export VLLM_ASCEND_LAZY_INIT=0
---Scripts
脚本
- - Throughput benchmark
scripts/benchmark_throughput.py - - Latency benchmark
scripts/benchmark_latency.py - - Server startup template
scripts/start_server.sh
- - 吞吐量基准测试脚本
scripts/benchmark_throughput.py - - 延迟基准测试脚本
scripts/benchmark_latency.py - - 服务器启动模板脚本
scripts/start_server.sh
References
参考资料
- references/deployment.md - Deployment patterns and best practices
- references/supported-models.md - Complete model support matrix
- references/api-reference.md - API endpoint documentation
- references/deployment.md - 部署模式与最佳实践
- references/supported-models.md - 完整模型支持矩阵
- references/api-reference.md - API端点文档
Related Skills
相关技能
- msmodelslim - Model quantization for vLLM-Ascend
- ascend-docker - Docker container setup for Ascend
- npu-smi - NPU device management
- hccl-test - HCCL performance testing for multi-NPU
- msmodelslim - 为vLLM-Ascend提供的模型量化工具
- ascend-docker - Ascend的Docker容器配置
- npu-smi - NPU设备管理工具
- hccl-test - 多NPU的HCCL性能测试工具
Official References
官方参考
- vLLM-Ascend Documentation: https://docs.vllm.ai/projects/ascend/en/latest/
- vLLM Documentation: https://docs.vllm.ai/
- Huawei Ascend: https://www.hiascend.com/document
- GitHub Repository: https://github.com/vllm-project/vllm-ascend
- vLLM-Ascend文档:https://docs.vllm.ai/projects/ascend/en/latest/
- vLLM文档:https://docs.vllm.ai/
- 华为Ascend:https://www.hiascend.com/document
- GitHub仓库:https://github.com/vllm-project/vllm-ascend