vllm-deploy-simple

Original🇺🇸 English
Translated
1 scripts

Quick install and deploy vLLM, start serving with a simple LLM, and test OpenAI API.

4installs
Added on

NPX Install

npx skill4agent add vllm-project/vllm-skills vllm-deploy-simple

vLLM Simple Deployment

A simple skill to quickly install vLLM, start a server, and validate the OpenAI-compatible API.

What this skill does

This skill provides a streamlined workflow to:
  • Detect hardware backend (NVIDIA CUDA, AMD ROCm, Google TPU, or CPU)
  • Install vLLM with appropriate backend support
  • Start the vLLM server with configurable model and port
  • Test the OpenAI-compatible API endpoint
  • Validate the deployment is working correctly
  • Support virtual environment isolation

Prerequisites

  • Python 3.10+
  • GPU (NVIDIA CUDA, AMD ROCm) (recommended) or TPU or CPU
  • pip or uv package manager
  • curl (for API testing)
  • Virtual environment (optional but recommended)

Usage

Create a venv

If user did not specify the venv path or asked to deploy in the current environment, create a venv using uv with python 3.12 in the current folder. If uv not found, make a folder in this path and use python to create a virtual environment.

Run the complete workflow (suggested)

If user did not specify the venv path, model, or port, use default options:
bash
# Default deployment options (--venv "." --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8)
scripts/quickstart.sh
Or with custom options:
bash
# Use custom virtual environment
scripts/quickstart.sh --venv /path/to/venv

# Use custom model and port
scripts/quickstart.sh --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000

# Use custom GPU memory utilization
scripts/quickstart.sh --gpu_memory_utilization 0.6

# Combine all options
scripts/quickstart.sh --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8
This will:
  1. Activate the virtual environment (if specified)
  2. Detect hardware backend (CUDA/ROCm/TPU/CPU)
  3. Install vLLM with appropriate backend support
  4. Start the vLLM server in the background
  5. Wait for the server to be ready
  6. Test the API with a sample request
  7. Display the server status

Run individual commands (for step-by-step usage or troubleshooting)

Install vLLM:
bash
scripts/quickstart.sh install
# Or with virtual environment
scripts/quickstart.sh install --venv /path/to/venv
Start the server:
bash
scripts/quickstart.sh start
# Or with custom options
scripts/quickstart.sh start --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8
Test the API:
bash
scripts/quickstart.sh test
# Or with custom port
scripts/quickstart.sh test --port 8000
Stop the server:
bash
scripts/quickstart.sh stop
# Or with virtual environment
scripts/quickstart.sh stop --venv /path/to/venv
Check server status:
bash
scripts/quickstart.sh status
Restart the server:
bash
scripts/quickstart.sh restart
# Or with custom options
scripts/quickstart.sh restart --venv /path/to/venv --port 8000 --gpu_memory_utilization 0.8

Configuration

The script supports the following command-line options:
bash
scripts/quickstart.sh [command] [OPTIONS]

Commands:
  install  - Install vLLM and dependencies
  start    - Start the vLLM server
  stop     - Stop the vLLM server
  test     - Test the OpenAI-compatible API
  status   - Show server status
  restart  - Restart the server
  all      - Run complete workflow (default)

Options:
  --model MODEL                 Model to use (default: Qwen/Qwen2.5-1.5B-Instruct)
  --port PORT                   Port to run server on (default: 8000)
  --venv VENV_PATH              Virtual environment path (default: .)
  --gpu_memory_utilization VRAM GPU memory utilization (default: 0.8)

Hardware Backend Detection

The script automatically detects your hardware and installs the appropriate vLLM version:
  • NVIDIA CUDA: Detected via
    nvidia-smi
    command
  • AMD ROCm: Detected via
    /dev/kfd
    and
    /dev/dri
    devices
  • Google TPU: Detected via
    TPU_NAME
    environment variable or
    gcloud
    command
  • CPU: Fallback if no GPU/TPU detected
For Google TPU, the script installs
vllm-tpu
instead of the standard
vllm
package.

API Testing

The test script sends a simple chat completion request:
bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Say hello!"}],
    "max_tokens": 50
  }'

Troubleshooting

Virtual environment not found:
  • Ensure the path provided with
    --venv
    exists and is a valid virtual environment
  • Check that the activation script exists (
    bin/activate
    on Linux/macOS or
    Scripts/activate
    on Windows)
  • Check and install uv, and create a new virtual environment with uv:
    uv venv /path/to/venv
    (suggested); or with pip:
    python3 -m venv /path/to/venv
Server won't start:
  • Check if the port is already in use:
    lsof -i :8000
  • Verify GPU availability:
    nvidia-smi
    (for NVIDIA) or
    rocm-smi
    (for AMD)
  • Check vLLM installation:
    python -c "import vllm; print(vllm.__version__)"
  • Review server logs at
    $VENV_PATH/tmp/vllm-server.log
API returns errors:
  • Wait a few seconds for the model to load
  • Check server logs:
    cat $VENV_PATH/tmp/vllm-server.log
  • Verify the server is running:
    scripts/quickstart.sh status
Out of memory:
  • Use a smaller model (e.g., Qwen2.5-0.5B-Instruct)
  • Reduce
    --gpu-memory-utilization
    parameter
  • Close other GPU-intensive applications
Wrong backend detected:
  • For NVIDIA: Ensure
    nvidia-smi
    is in your PATH
  • For AMD: Check that ROCm drivers are properly installed
  • For TPU: Set
    TPU_NAME
    environment variable or install
    gcloud

Notes

  • The server runs in the background and logs to
    $VENV_PATH/tmp/vllm-server.log
  • The PID is stored in
    $VENV_PATH/tmp/vllm-server.pid
    for easy management
  • First run will download the model (~3GB for Qwen2.5-1.5B-Instruct)
  • Subsequent runs will use the cached model
  • The script automatically detects and uses
    uv
    if available, otherwise falls back to
    pip
  • Virtual environment support allows isolation from system Python packages
  • Arguments can be specified in any order (e.g.,
    scripts/quickstart.sh --port 8080 start --venv /path/to/venv
    )