vLLM Simple Deployment

A simple skill to quickly install vLLM, start a server, and validate the OpenAI-compatible API.

What this skill does

This skill provides a streamlined workflow to:

Detect hardware backend (NVIDIA CUDA, AMD ROCm, Google TPU, or CPU)
Install vLLM with appropriate backend support
Start the vLLM server with configurable model and port
Test the OpenAI-compatible API endpoint
Validate the deployment is working correctly
Support virtual environment isolation

Prerequisites

Python 3.10+
GPU (NVIDIA CUDA, AMD ROCm) (recommended) or TPU or CPU
pip or uv package manager
curl (for API testing)
Virtual environment (optional but recommended)

Usage

Create a venv

If user did not specify the venv path or asked to deploy in the current environment, create a venv using uv with python 3.12 in the current folder. If uv not found, make a folder in this path and use python to create a virtual environment.

Run the complete workflow (suggested)

If user did not specify the venv path, model, or port, use default options:

bash

# Default deployment options (--venv "." --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8)
scripts/quickstart.sh

Or with custom options:

bash

# Use custom virtual environment
scripts/quickstart.sh --venv /path/to/venv

# Use custom model and port
scripts/quickstart.sh --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000

# Use custom GPU memory utilization
scripts/quickstart.sh --gpu_memory_utilization 0.6

# Combine all options
scripts/quickstart.sh --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8

This will:

Activate the virtual environment (if specified)
Detect hardware backend (CUDA/ROCm/TPU/CPU)
Install vLLM with appropriate backend support
Start the vLLM server in the background
Wait for the server to be ready
Test the API with a sample request
Display the server status

Run individual commands (for step-by-step usage or troubleshooting)

Install vLLM:

bash

scripts/quickstart.sh install
# Or with virtual environment
scripts/quickstart.sh install --venv /path/to/venv

Start the server:

bash

scripts/quickstart.sh start
# Or with custom options
scripts/quickstart.sh start --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8

Test the API:

bash

scripts/quickstart.sh test
# Or with custom port
scripts/quickstart.sh test --port 8000

Stop the server:

bash

scripts/quickstart.sh stop
# Or with virtual environment
scripts/quickstart.sh stop --venv /path/to/venv

Check server status:

bash

scripts/quickstart.sh status

Restart the server:

bash

scripts/quickstart.sh restart
# Or with custom options
scripts/quickstart.sh restart --venv /path/to/venv --port 8000 --gpu_memory_utilization 0.8

Configuration

The script supports the following command-line options:

bash

scripts/quickstart.sh [command] [OPTIONS]

Commands:
  install  - Install vLLM and dependencies
  start    - Start the vLLM server
  stop     - Stop the vLLM server
  test     - Test the OpenAI-compatible API
  status   - Show server status
  restart  - Restart the server
  all      - Run complete workflow (default)

Options:
  --model MODEL                 Model to use (default: Qwen/Qwen2.5-1.5B-Instruct)
  --port PORT                   Port to run server on (default: 8000)
  --venv VENV_PATH              Virtual environment path (default: .)
  --gpu_memory_utilization VRAM GPU memory utilization (default: 0.8)

Hardware Backend Detection

The script automatically detects your hardware and installs the appropriate vLLM version:

NVIDIA CUDA: Detected via
```
nvidia-smi
```
command
AMD ROCm: Detected via
```
/dev/kfd
```
and
```
/dev/dri
```
devices
Google TPU: Detected via
```
TPU_NAME
```
environment variable or
```
gcloud
```
command
CPU: Fallback if no GPU/TPU detected

For Google TPU, the script installs

vllm-tpu

instead of the standard

vllm

package.

API Testing

The test script sends a simple chat completion request:

bash

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Say hello!"}],
    "max_tokens": 50
  }'

Troubleshooting

Virtual environment not found:

Ensure the path provided with
```
--venv
```
exists and is a valid virtual environment
Check that the activation script exists (
```
bin/activate
```
on Linux/macOS or
```
Scripts/activate
```
on Windows)
Check and install uv, and create a new virtual environment with uv:
```
uv venv /path/to/venv
```
(suggested); or with pip:
```
python3 -m venv /path/to/venv
```

Server won't start:

Check if the port is already in use:
```
lsof -i :8000
```
Verify GPU availability:
```
nvidia-smi
```
(for NVIDIA) or
```
rocm-smi
```
(for AMD)

Check vLLM installation:

python -c "import vllm; print(vllm.__version__)"

Review server logs at
```
$VENV_PATH/tmp/vllm-server.log
```

API returns errors:

Wait a few seconds for the model to load
Check server logs:
```
cat $VENV_PATH/tmp/vllm-server.log
```
Verify the server is running:
```
scripts/quickstart.sh status
```

Out of memory:

Use a smaller model (e.g., Qwen2.5-0.5B-Instruct)
Reduce
```
--gpu-memory-utilization
```
parameter
Close other GPU-intensive applications

Wrong backend detected:

For NVIDIA: Ensure
```
nvidia-smi
```
is in your PATH
For AMD: Check that ROCm drivers are properly installed
For TPU: Set
```
TPU_NAME
```
environment variable or install
```
gcloud
```

Notes

The server runs in the background and logs to
```
$VENV_PATH/tmp/vllm-server.log
```
The PID is stored in
```
$VENV_PATH/tmp/vllm-server.pid
```
for easy management
First run will download the model (~3GB for Qwen2.5-1.5B-Instruct)
Subsequent runs will use the cached model
The script automatically detects and uses
```
uv
```
if available, otherwise falls back to
```
pip
```
Virtual environment support allows isolation from system Python packages

Arguments can be specified in any order (e.g.,

scripts/quickstart.sh --port 8080 start --venv /path/to/venv

)

vllm-deploy-simple

NPX Install

Tags

SKILL.md Content

vLLM Simple Deployment

What this skill does

Prerequisites

Usage

Create a venv

Run the complete workflow (suggested)

Run individual commands (for step-by-step usage or troubleshooting)

Configuration

Hardware Backend Detection

API Testing

Troubleshooting

Notes