vllm-deploy-simple
Original:🇺🇸 English
Translated
1 scripts
Quick install and deploy vLLM, start serving with a simple LLM, and test OpenAI API.
4installs
Sourcevllm-project/vllm-skills
Added on
NPX Install
npx skill4agent add vllm-project/vllm-skills vllm-deploy-simpleTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →vLLM Simple Deployment
A simple skill to quickly install vLLM, start a server, and validate the OpenAI-compatible API.
What this skill does
This skill provides a streamlined workflow to:
- Detect hardware backend (NVIDIA CUDA, AMD ROCm, Google TPU, or CPU)
- Install vLLM with appropriate backend support
- Start the vLLM server with configurable model and port
- Test the OpenAI-compatible API endpoint
- Validate the deployment is working correctly
- Support virtual environment isolation
Prerequisites
- Python 3.10+
- GPU (NVIDIA CUDA, AMD ROCm) (recommended) or TPU or CPU
- pip or uv package manager
- curl (for API testing)
- Virtual environment (optional but recommended)
Usage
Create a venv
If user did not specify the venv path or asked to deploy in the current environment, create a venv using uv with python 3.12 in the current folder. If uv not found, make a folder in this path and use python to create a virtual environment.
Run the complete workflow (suggested)
If user did not specify the venv path, model, or port, use default options:
bash
# Default deployment options (--venv "." --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8)
scripts/quickstart.shOr with custom options:
bash
# Use custom virtual environment
scripts/quickstart.sh --venv /path/to/venv
# Use custom model and port
scripts/quickstart.sh --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000
# Use custom GPU memory utilization
scripts/quickstart.sh --gpu_memory_utilization 0.6
# Combine all options
scripts/quickstart.sh --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8This will:
- Activate the virtual environment (if specified)
- Detect hardware backend (CUDA/ROCm/TPU/CPU)
- Install vLLM with appropriate backend support
- Start the vLLM server in the background
- Wait for the server to be ready
- Test the API with a sample request
- Display the server status
Run individual commands (for step-by-step usage or troubleshooting)
Install vLLM:
bash
scripts/quickstart.sh install
# Or with virtual environment
scripts/quickstart.sh install --venv /path/to/venvStart the server:
bash
scripts/quickstart.sh start
# Or with custom options
scripts/quickstart.sh start --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8Test the API:
bash
scripts/quickstart.sh test
# Or with custom port
scripts/quickstart.sh test --port 8000Stop the server:
bash
scripts/quickstart.sh stop
# Or with virtual environment
scripts/quickstart.sh stop --venv /path/to/venvCheck server status:
bash
scripts/quickstart.sh statusRestart the server:
bash
scripts/quickstart.sh restart
# Or with custom options
scripts/quickstart.sh restart --venv /path/to/venv --port 8000 --gpu_memory_utilization 0.8Configuration
The script supports the following command-line options:
bash
scripts/quickstart.sh [command] [OPTIONS]
Commands:
install - Install vLLM and dependencies
start - Start the vLLM server
stop - Stop the vLLM server
test - Test the OpenAI-compatible API
status - Show server status
restart - Restart the server
all - Run complete workflow (default)
Options:
--model MODEL Model to use (default: Qwen/Qwen2.5-1.5B-Instruct)
--port PORT Port to run server on (default: 8000)
--venv VENV_PATH Virtual environment path (default: .)
--gpu_memory_utilization VRAM GPU memory utilization (default: 0.8)Hardware Backend Detection
The script automatically detects your hardware and installs the appropriate vLLM version:
- NVIDIA CUDA: Detected via command
nvidia-smi - AMD ROCm: Detected via and
/dev/kfddevices/dev/dri - Google TPU: Detected via environment variable or
TPU_NAMEcommandgcloud - CPU: Fallback if no GPU/TPU detected
For Google TPU, the script installs instead of the standard package.
vllm-tpuvllmAPI Testing
The test script sends a simple chat completion request:
bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"messages": [{"role": "user", "content": "Say hello!"}],
"max_tokens": 50
}'Troubleshooting
Virtual environment not found:
- Ensure the path provided with exists and is a valid virtual environment
--venv - Check that the activation script exists (on Linux/macOS or
bin/activateon Windows)Scripts/activate - Check and install uv, and create a new virtual environment with uv: (suggested); or with pip:
uv venv /path/to/venvpython3 -m venv /path/to/venv
Server won't start:
- Check if the port is already in use:
lsof -i :8000 - Verify GPU availability: (for NVIDIA) or
nvidia-smi(for AMD)rocm-smi - Check vLLM installation:
python -c "import vllm; print(vllm.__version__)" - Review server logs at
$VENV_PATH/tmp/vllm-server.log
API returns errors:
- Wait a few seconds for the model to load
- Check server logs:
cat $VENV_PATH/tmp/vllm-server.log - Verify the server is running:
scripts/quickstart.sh status
Out of memory:
- Use a smaller model (e.g., Qwen2.5-0.5B-Instruct)
- Reduce parameter
--gpu-memory-utilization - Close other GPU-intensive applications
Wrong backend detected:
- For NVIDIA: Ensure is in your PATH
nvidia-smi - For AMD: Check that ROCm drivers are properly installed
- For TPU: Set environment variable or install
TPU_NAMEgcloud
Notes
- The server runs in the background and logs to
$VENV_PATH/tmp/vllm-server.log - The PID is stored in for easy management
$VENV_PATH/tmp/vllm-server.pid - First run will download the model (~3GB for Qwen2.5-1.5B-Instruct)
- Subsequent runs will use the cached model
- The script automatically detects and uses if available, otherwise falls back to
uvpip - Virtual environment support allows isolation from system Python packages
- Arguments can be specified in any order (e.g., )
scripts/quickstart.sh --port 8080 start --venv /path/to/venv