Loading...
Loading...
Serve a quantized or unquantized LLM checkpoint as an OpenAI-compatible API endpoint using vLLM, SGLang, or TRT-LLM. Use when user says "deploy model", "serve model", "start vLLM server", "launch SGLang", "TRT-LLM deploy", "AutoDeploy", "benchmark throughput", "serve checkpoint", or needs an inference endpoint from a HuggingFace or ModelOpt-quantized checkpoint. Do NOT use for quantizing models (use ptq) or evaluating accuracy (use evaluation).
npx skill4agent add nvidia/skills deploymentscripts/deploy.sh# Start vLLM server with a ModelOpt checkpoint
scripts/deploy.sh start --model ./qwen3-0.6b-fp8
# Start with SGLang and tensor parallelism
scripts/deploy.sh start --model ./llama-70b-nvfp4 --framework sglang --tp 4
# Start from HuggingFace hub
scripts/deploy.sh start --model nvidia/Llama-3.1-8B-Instruct-FP8
# Test the API
scripts/deploy.sh test
# Check status
scripts/deploy.sh status
# Stop
scripts/deploy.sh stopMODELOPT_WORKSPACE_ROOTskills/common/workspace-management.mdls "$MODELOPT_WORKSPACE_ROOT/" 2>/dev/nullcdhf_quant_config.jsonoutput/outputs/exported_model/--export_pathnvidia/Llama-3.1-8B-Instruct-FP8Note: This skill expects HF-format checkpoints (from PTQ with). TRT-LLM format checkpoints should be deployed directly with TRT-LLM — see--export_fmt hf.references/trtllm.md
cat <checkpoint_path>/hf_quant_config.json 2>/dev/null || echo "No hf_quant_config.json"config.jsonquantization_configquant_method: "modelopt"| Situation | Recommended | Why |
|---|---|---|
| General use | vLLM | Widest ecosystem, easy setup, OpenAI-compatible |
| Best SGLang model support | SGLang | Strong DeepSeek/Llama 4 support |
| Maximum optimization | TRT-LLM | Best throughput via engine compilation |
| Mixed-precision / AutoQuant | TRT-LLM AutoDeploy | Only option for AutoQuant checkpoints |
references/support-matrix.mdskills/common/environment-setup.mdpython -c "import vllm; print(f'vLLM {vllm.__version__}')" 2>/dev/null || echo "vLLM not installed"
python -c "import sglang; print(f'SGLang {sglang.__version__}')" 2>/dev/null || echo "SGLang not installed"
python -c "import tensorrt_llm; print(f'TRT-LLM {tensorrt_llm.__version__}')" 2>/dev/null || echo "TRT-LLM not installed"references/setup.mdparams × 2 bytesparams × 1 byteparams × 0.5 bytes-tp <num_gpus>| Framework | Reference file |
|---|---|
| vLLM | |
| SGLang | |
| TRT-LLM | |
# Serve as OpenAI-compatible endpoint
python -m vllm.entrypoints.openai.api_server \
--model <checkpoint_path> \
--quantization modelopt \
--tensor-parallel-size <num_gpus> \
--host 0.0.0.0 --port 8000--quantization modelopt_fp4python -m sglang.launch_server \
--model-path <checkpoint_path> \
--quantization modelopt \
--tp <num_gpus> \
--host 0.0.0.0 --port 8000from tensorrt_llm import LLM, SamplingParams
llm = LLM(model="<checkpoint_path>")
outputs = llm.generate(["Hello, my name is"], SamplingParams(temperature=0.8, top_p=0.95))references/trtllm.md# Health check
curl -s http://localhost:8000/health
# List models
curl -s http://localhost:8000/v1/models | python -m json.tool
# Test generation
curl -s http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<model_name>",
"prompt": "The capital of France is",
"max_tokens": 32
}' | python -m json.tool~/.config/modelopt/clusters.yaml.claude/clusters.yamlskills/common/slurm-setup.mdsource .claude/skills/common/remote_exec.sh
remote_load_cluster
remote_check_ssh
remote_detect_envremote_run "ls <checkpoint_path>/config.json"remote_sync_to <local_checkpoint_path> checkpoints/skills/common/slurm-setup.mdpython -m vllm.entrypoints.openai.api_server --model <path> --quantization modeloptsqueue -j $JOBID -o %Nremote_runremote_run "nohup python -m vllm.entrypoints.openai.api_server --model <path> --port 8000 > deploy.log 2>&1 &"remote_run "curl -s http://localhost:8000/health"
remote_run "curl -s http://localhost:8000/v1/models"http://<node_hostname>:8000| Error | Cause | Fix |
|---|---|---|
| Model too large for GPU(s) | Increase |
| vLLM/SGLang version too old | Upgrade: vLLM >= 0.10.1, SGLang >= 0.4.10 |
| Not a ModelOpt-exported checkpoint | Re-export with |
| Server still starting | Wait 30-60s for large models; check logs for errors |
| Framework doesn't support FP4 for this model | Check support matrix in |
references/support-matrix.mdreferences/unsupported-models.md/health/v1/models