deployment

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Deployment Skill

模型部署技能

Serve a model checkpoint as an OpenAI-compatible inference endpoint. Supports vLLM, SGLang, and TRT-LLM (including AutoDeploy).
将模型checkpoint作为兼容OpenAI的推理端点提供服务。支持vLLM、SGLang和TRT-LLM(包括AutoDeploy)。

Quick Start

快速开始

Prefer
scripts/deploy.sh
for standard local deployments — it handles quant detection, health checks, and server lifecycle. Use the raw framework commands in Step 4 when you need flags the script doesn't support, or for remote deployment.
bash
undefined
对于标准本地部署,推荐使用
scripts/deploy.sh
脚本——它会处理量化检测、健康检查和服务器生命周期管理。当需要脚本不支持的参数,或进行远程部署时,请使用步骤4中的原生框架命令。
bash
undefined

Start vLLM server with a ModelOpt checkpoint

使用ModelOpt checkpoint启动vLLM服务器

scripts/deploy.sh start --model ./qwen3-0.6b-fp8
scripts/deploy.sh start --model ./qwen3-0.6b-fp8

Start with SGLang and tensor parallelism

使用SGLang和张量并行启动

scripts/deploy.sh start --model ./llama-70b-nvfp4 --framework sglang --tp 4
scripts/deploy.sh start --model ./llama-70b-nvfp4 --framework sglang --tp 4

Start from HuggingFace hub

从HuggingFace模型库启动

scripts/deploy.sh start --model nvidia/Llama-3.1-8B-Instruct-FP8
scripts/deploy.sh start --model nvidia/Llama-3.1-8B-Instruct-FP8

Test the API

测试API

scripts/deploy.sh test
scripts/deploy.sh test

Check status

检查状态

scripts/deploy.sh status
scripts/deploy.sh status

Stop

停止服务

scripts/deploy.sh stop

The script handles: GPU detection, quantization flag auto-detection (FP8 vs FP4), server lifecycle (start/stop/restart/status), health check polling, and API testing.
scripts/deploy.sh stop

该脚本会处理:GPU检测、量化参数自动检测(FP8与FP4)、服务器生命周期管理(启动/停止/重启/状态查询)、健康检查轮询以及API测试。

Decision Flow

决策流程

0. Check workspace (multi-user / Slack bot)

0. 检查工作区(多用户/Slack机器人)

If
MODELOPT_WORKSPACE_ROOT
is set, read
skills/common/workspace-management.md
. Before creating a new workspace, check for existing ones — especially if deploying a checkpoint from a prior PTQ run:
bash
ls "$MODELOPT_WORKSPACE_ROOT/" 2>/dev/null
If the user says "deploy the model I just quantized" or references a previous PTQ, find the matching workspace and
cd
into it. The checkpoint should be in that workspace's output directory.
如果设置了
MODELOPT_WORKSPACE_ROOT
,请阅读
skills/common/workspace-management.md
。在创建新工作区之前,检查是否存在现有工作区——尤其是在部署来自之前PTQ运行的checkpoint时:
bash
ls "$MODELOPT_WORKSPACE_ROOT/" 2>/dev/null
如果用户说“部署我刚量化的模型”或提及之前的PTQ操作,请找到匹配的工作区并
cd
进入其中。checkpoint应位于该工作区的输出目录中。

1. Identify the checkpoint

1. 识别checkpoint

Determine what the user wants to deploy:
  • Local quantized checkpoint (from ptq skill or manual export): look for
    hf_quant_config.json
    in the directory. If coming from a prior PTQ run in the same workspace, check common output locations:
    output/
    ,
    outputs/
    ,
    exported_model/
    , or the
    --export_path
    used in the PTQ command.
  • HuggingFace model hub (e.g.,
    nvidia/Llama-3.1-8B-Instruct-FP8
    ): use directly
  • Unquantized model: deploy as-is (BF16) or suggest quantizing first with the ptq skill
Note: This skill expects HF-format checkpoints (from PTQ with
--export_fmt hf
). TRT-LLM format checkpoints should be deployed directly with TRT-LLM — see
references/trtllm.md
.
Check the quantization format if applicable:
bash
cat <checkpoint_path>/hf_quant_config.json 2>/dev/null || echo "No hf_quant_config.json"
If not found, also check
config.json
for a
quantization_config
section with
quant_method: "modelopt"
. If neither exists, the checkpoint is unquantized.
确定用户想要部署的内容:
  • 本地量化checkpoint(来自ptq技能或手动导出):在目录中查找
    hf_quant_config.json
    。如果来自同一工作区中之前的PTQ运行,请检查常见输出位置:
    output/
    outputs/
    exported_model/
    或PTQ命令中使用的
    --export_path
  • HuggingFace模型库(例如
    nvidia/Llama-3.1-8B-Instruct-FP8
    ):直接使用
  • 未量化模型:按原样部署(BF16格式)或建议先使用ptq技能进行量化
注意: 此技能要求HF格式的checkpoint(通过
--export_fmt hf
参数从PTQ导出)。TRT-LLM格式的checkpoint应直接使用TRT-LLM部署——请参阅
references/trtllm.md
检查量化格式(如适用):
bash
cat <checkpoint_path>/hf_quant_config.json 2>/dev/null || echo "No hf_quant_config.json"
如果未找到,还需检查
config.json
中是否包含带有
quant_method: "modelopt"
quantization_config
部分。如果两者都不存在,则该checkpoint未量化。

2. Choose the framework

2. 选择框架

If the user hasn't specified a framework, recommend based on this priority:
SituationRecommendedWhy
General usevLLMWidest ecosystem, easy setup, OpenAI-compatible
Best SGLang model supportSGLangStrong DeepSeek/Llama 4 support
Maximum optimizationTRT-LLMBest throughput via engine compilation
Mixed-precision / AutoQuantTRT-LLM AutoDeployOnly option for AutoQuant checkpoints
Check the support matrix in
references/support-matrix.md
to confirm the model + format + framework combination is supported.
如果用户未指定框架,请根据以下优先级推荐:
场景推荐框架原因
通用场景vLLM生态系统最广泛,设置简单,兼容OpenAI
最佳SGLang模型支持SGLang对DeepSeek/Llama 4支持出色
最大化优化TRT-LLM通过引擎编译实现最佳吞吐量
混合精度/AutoQuantTRT-LLM AutoDeployAutoQuant checkpoint的唯一选择
请查阅
references/support-matrix.md
中的支持矩阵,确认模型+格式+框架的组合是否受支持。

3. Check the environment

3. 检查环境

Read
skills/common/environment-setup.md
for GPU detection, local vs remote, and SLURM/Docker/bare metal detection. After completing it you should know: GPU model/count, local or remote, and execution environment.
Then check the deployment framework is installed:
bash
python -c "import vllm; print(f'vLLM {vllm.__version__}')" 2>/dev/null || echo "vLLM not installed"
python -c "import sglang; print(f'SGLang {sglang.__version__}')" 2>/dev/null || echo "SGLang not installed"
python -c "import tensorrt_llm; print(f'TRT-LLM {tensorrt_llm.__version__}')" 2>/dev/null || echo "TRT-LLM not installed"
If not installed, consult
references/setup.md
.
GPU memory estimate (to determine tensor parallelism):
  • BF16:
    params × 2 bytes
    (8B ≈ 16 GB)
  • FP8:
    params × 1 byte
    (8B ≈ 8 GB)
  • FP4:
    params × 0.5 bytes
    (8B ≈ 4 GB)
  • Add ~2-4 GB for KV cache and framework overhead
If the model exceeds single GPU memory, use tensor parallelism (
-tp <num_gpus>
).
阅读
skills/common/environment-setup.md
了解GPU检测、本地与远程环境、SLURM/Docker/裸金属环境检测。完成后应了解:GPU型号/数量、本地或远程环境、执行环境类型。
然后检查部署框架是否已安装:
bash
python -c "import vllm; print(f'vLLM {vllm.__version__}')" 2>/dev/null || echo "vLLM not installed"
python -c "import sglang; print(f'SGLang {sglang.__version__}')" 2>/dev/null || echo "SGLang not installed"
python -c "import tensorrt_llm; print(f'TRT-LLM {tensorrt_llm.__version__}')" 2>/dev/null || echo "TRT-LLM not installed"
如果未安装,请参阅
references/setup.md
进行安装。
GPU内存估算(用于确定张量并行度):
  • BF16:
    参数数量 × 2字节
    (8B模型≈16 GB)
  • FP8:
    参数数量 × 1字节
    (8B模型≈8 GB)
  • FP4:
    参数数量 × 0.5字节
    (8B模型≈4 GB)
  • 额外添加约2-4 GB用于KV缓存和框架开销
如果模型超出单GPU内存,请使用张量并行(
-tp <num_gpus>
)。

4. Deploy

4. 部署

Read the framework-specific reference for detailed instructions:
FrameworkReference file
vLLM
references/vllm.md
SGLang
references/sglang.md
TRT-LLM
references/trtllm.md
Quick-start commands (for common cases):
阅读框架特定的参考文档获取详细说明:
框架参考文件
vLLM
references/vllm.md
SGLang
references/sglang.md
TRT-LLM
references/trtllm.md
快速启动命令(常见场景):

vLLM

vLLM

bash
undefined
bash
undefined

Serve as OpenAI-compatible endpoint

作为兼容OpenAI的端点提供服务

python -m vllm.entrypoints.openai.api_server
--model <checkpoint_path>
--quantization modelopt
--tensor-parallel-size <num_gpus>
--host 0.0.0.0 --port 8000

For NVFP4 checkpoints, use `--quantization modelopt_fp4`.
python -m vllm.entrypoints.openai.api_server
--model <checkpoint_path>
--quantization modelopt
--tensor-parallel-size <num_gpus>
--host 0.0.0.0 --port 8000

对于NVFP4 checkpoint,请使用`--quantization modelopt_fp4`。

SGLang

SGLang

bash
python -m sglang.launch_server \
    --model-path <checkpoint_path> \
    --quantization modelopt \
    --tp <num_gpus> \
    --host 0.0.0.0 --port 8000
bash
python -m sglang.launch_server \
    --model-path <checkpoint_path> \
    --quantization modelopt \
    --tp <num_gpus> \
    --host 0.0.0.0 --port 8000

TRT-LLM (direct)

TRT-LLM(直接部署)

python
from tensorrt_llm import LLM, SamplingParams
llm = LLM(model="<checkpoint_path>")
outputs = llm.generate(["Hello, my name is"], SamplingParams(temperature=0.8, top_p=0.95))
python
from tensorrt_llm import LLM, SamplingParams
llm = LLM(model="<checkpoint_path>")
outputs = llm.generate(["Hello, my name is"], SamplingParams(temperature=0.8, top_p=0.95))

TRT-LLM AutoDeploy

TRT-LLM AutoDeploy

For AutoQuant or mixed-precision checkpoints, see
references/trtllm.md
.
对于AutoQuant或混合精度checkpoint,请参阅
references/trtllm.md

5. Verify the deployment

5. 验证部署

After the server starts, verify it's healthy:
bash
undefined
服务器启动后,验证其健康状态:
bash
undefined

Health check

健康检查

List models

列出模型

curl -s http://localhost:8000/v1/models | python -m json.tool
curl -s http://localhost:8000/v1/models | python -m json.tool

Test generation

测试生成功能

curl -s http://localhost:8000/v1/completions
-H "Content-Type: application/json"
-d '{ "model": "<model_name>", "prompt": "The capital of France is", "max_tokens": 32 }' | python -m json.tool

All checks must pass before reporting success to the user.
curl -s http://localhost:8000/v1/completions
-H "Content-Type: application/json"
-d '{ "model": "<model_name>", "prompt": "The capital of France is", "max_tokens": 32 }' | python -m json.tool

所有检查必须通过后,才能向用户报告部署成功。

6. Remote deployment (SSH/SLURM)

6. 远程部署(SSH/SLURM)

If a cluster config exists (
~/.config/modelopt/clusters.yaml
or
.claude/clusters.yaml
), or the user mentions running on a remote machine:
  1. Check container registry auth — before submitting any SLURM job with a container image, verify credentials exist on the cluster per
    skills/common/slurm-setup.md
    section 6. If credentials are missing for the image's registry, ask the user to fix auth or switch to an image on an authenticated registry (e.g., NGC). Do not submit until auth is confirmed.
  2. Source remote utilities:
    bash
    source .claude/skills/common/remote_exec.sh
    remote_load_cluster
    remote_check_ssh
    remote_detect_env
  3. Sync the checkpoint (only if it was produced locally):
    If the checkpoint path is a remote/absolute path (e.g., from a prior PTQ run on the cluster), skip sync — it's already there. Verify with
    remote_run "ls <checkpoint_path>/config.json"
    . Only sync if the checkpoint is local:
    bash
    remote_sync_to <local_checkpoint_path> checkpoints/
  4. Deploy based on remote environment:
    • SLURM — see
      skills/common/slurm-setup.md
      for job script templates (container setup, account/partition discovery). The server command inside the container is the same as Step 4 (e.g.,
      python -m vllm.entrypoints.openai.api_server --model <path> --quantization modelopt
      ). After submitting, register the job and set up monitoring per the monitor skill. Get the node hostname from
      squeue -j $JOBID -o %N
      .
    • Bare metal / Docker — use
      remote_run
      to start the server directly:
      bash
      remote_run "nohup python -m vllm.entrypoints.openai.api_server --model <path> --port 8000 > deploy.log 2>&1 &"
  5. Verify remotely:
    bash
    remote_run "curl -s http://localhost:8000/health"
    remote_run "curl -s http://localhost:8000/v1/models"
  6. Report the endpoint — include the remote hostname and port so the user can connect (e.g.,
    http://<node_hostname>:8000
    ). For SLURM, note that the port is only reachable from within the cluster network.
For NEL-managed deployment (evaluation with self-deployment), use the evaluation skill instead — NEL handles SLURM container deployment, health checks, and teardown automatically.
如果存在集群配置(
~/.config/modelopt/clusters.yaml
.claude/clusters.yaml
),或者用户提及在远程机器上运行:
  1. 检查容器镜像仓库认证 — 在提交任何使用容器镜像的SLURM作业之前,请根据
    skills/common/slurm-setup.md
    第6节验证集群上是否存在凭据。如果镜像所在仓库的凭据缺失,请要求用户修复认证或切换到已认证仓库(如NGC)中的镜像。认证确认前请勿提交作业。
  2. 加载远程工具:
    bash
    source .claude/skills/common/remote_exec.sh
    remote_load_cluster
    remote_check_ssh
    remote_detect_env
  3. 同步checkpoint(仅当checkpoint在本地生成时):
    如果checkpoint路径是远程/绝对路径(例如来自集群上之前的PTQ运行),则跳过同步——它已在远程端。通过
    remote_run "ls <checkpoint_path>/config.json"
    进行验证。仅当checkpoint在本地时才进行同步:
    bash
    remote_sync_to <local_checkpoint_path> checkpoints/
  4. 根据远程环境进行部署:
    • SLURM — 请参阅
      skills/common/slurm-setup.md
      获取作业脚本模板(容器设置、账户/分区发现)。容器内的服务器命令与步骤4相同(例如
      python -m vllm.entrypoints.openai.api_server --model <path> --quantization modelopt
      )。提交后,请根据监控技能注册作业并设置监控。通过
      squeue -j $JOBID -o %N
      获取节点主机名。
    • 裸金属/Docker — 使用
      remote_run
      直接启动服务器:
      bash
      remote_run "nohup python -m vllm.entrypoints.openai.api_server --model <path> --port 8000 > deploy.log 2>&1 &"
  5. 远程验证:
    bash
    remote_run "curl -s http://localhost:8000/health"
    remote_run "curl -s http://localhost:8000/v1/models"
  6. 报告端点 — 包含远程主机名和端口,以便用户连接(例如
    http://<node_hostname>:8000
    )。对于SLURM,请注意该端口仅在集群网络内可达。
对于NEL管理的部署(自部署评估),请改用评估技能——NEL会自动处理SLURM容器部署、健康检查和销毁。

Error Handling

错误处理

ErrorCauseFix
CUDA out of memory
Model too large for GPU(s)Increase
--tensor-parallel-size
or use a smaller model
quantization="modelopt" not recognized
vLLM/SGLang version too oldUpgrade: vLLM >= 0.10.1, SGLang >= 0.4.10
hf_quant_config.json not found
Not a ModelOpt-exported checkpointRe-export with
export_hf_checkpoint()
, or remove
--quantization
flag
Connection refused
on health check
Server still startingWait 30-60s for large models; check logs for errors
modelopt_fp4 not supported
Framework doesn't support FP4 for this modelCheck support matrix in
references/support-matrix.md
错误原因修复方案
CUDA out of memory
模型超出GPU内存增大
--tensor-parallel-size
或使用更小的模型
quantization="modelopt" not recognized
vLLM/SGLang版本过旧升级:vLLM >= 0.10.1,SGLang >= 0.4.10
hf_quant_config.json not found
不是ModelOpt导出的checkpoint使用
export_hf_checkpoint()
重新导出,或移除
--quantization
参数
健康检查时
Connection refused
服务器仍在启动中大型模型等待30-60秒;检查日志查看错误
modelopt_fp4 not supported
框架不支持该模型的FP4格式查阅
references/support-matrix.md
中的支持矩阵

Unsupported Models

不支持的模型

If the model is not in the validated support matrix (
references/support-matrix.md
), deployment may fail due to weight key mismatches, missing architecture mappings, or quantized/unquantized layer confusion. Read
references/unsupported-models.md
for the iterative debug loop: run → read error → diagnose → patch framework source → re-run. For kernel-level issues, escalate to the framework team rather than attempting fixes.
如果模型不在已验证的支持矩阵(
references/support-matrix.md
)中,部署可能会因权重键不匹配、缺少架构映射或量化/未量化层混淆而失败。请阅读
references/unsupported-models.md
了解迭代调试流程:运行 → 读取错误 → 诊断 → 修补框架源码 → 重新运行。对于内核级问题,请升级到框架团队处理,而非尝试自行修复。

Success Criteria

成功标准

  1. Server process is running and healthy (
    /health
    returns 200)
  2. Model is listed at
    /v1/models
  3. Test generation produces coherent output
  4. Server URL and port are reported to the user
  5. If benchmarking was requested, throughput/latency numbers are reported
  1. 服务器进程运行正常且健康(
    /health
    返回200)
  2. 模型在
    /v1/models
    中列出
  3. 测试生成产生连贯输出
  4. 向用户报告服务器URL和端口
  5. 如果请求了基准测试,需报告吞吐量/延迟数据