ptq

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

ModelOpt Post-Training Quantization

ModelOpt训练后量化

Produce a quantized checkpoint from a pretrained model. Read
examples/llm_ptq/README.md
first
— it has the support matrix, CLI flags, and accuracy guidance.
从预训练模型生成量化后的checkpoint。请先阅读
examples/llm_ptq/README.md
——其中包含支持矩阵、CLI参数和精度指导。

Step 1 — Environment

步骤1 — 环境准备

Read
skills/common/environment-setup.md
and
skills/common/workspace-management.md
. After completing them you should know:
  • ModelOpt source is available
  • Local or remote (+ cluster config if remote)
  • SLURM / Docker+GPU / bare GPU
  • Launcher available?
  • Which workspace to use
阅读
skills/common/environment-setup.md
skills/common/workspace-management.md
。完成后你需要确认:
  • 已获取ModelOpt源码
  • 使用本地环境或远程环境(远程环境需配置集群信息)
  • 采用SLURM / Docker+GPU / 裸机GPU环境
  • 是否有可用的启动器
  • 使用哪个工作区

Step 2 — Is the model supported?

步骤2 — 模型是否受支持?

Check the support table in
examples/llm_ptq/README.md
for verified HF models.
  • Listed → supported, use
    hf_ptq.py
    (step 4A/4B)
  • Not listed → read
    references/unsupported-models.md
    to determine if
    hf_ptq.py
    can still work or if a custom script is needed (step 4C)
查看
examples/llm_ptq/README.md
中的支持表,确认已验证的HF模型:
  • 已列出 → 支持,使用
    hf_ptq.py
    (步骤4A/4B)
  • 未列出 → 阅读
    references/unsupported-models.md
    ,判断
    hf_ptq.py
    是否仍可使用,或是否需要自定义脚本(步骤4C)

Step 2.5 — Check for model-specific dependencies

步骤2.5 — 检查模型专属依赖

If the model uses
trust_remote_code
(check
config.json
for
auto_map
), inspect its custom Python files for imports not present in the container:
bash
grep -h "^from \|^import " <model_path>/modeling_*.py | sort -u
Known dependency patterns:
Import foundPackages to install
from mamba_ssm
/
from causal_conv1d
mamba-ssm causal-conv1d
(Mamba/hybrid models: NemotronH, Jamba)
If extra deps are needed:
  • Launcher (4B): set
    EXTRA_PIP_DEPS
    in the task's
    environment
    section —
    ptq.sh
    installs them automatically
  • Manual (4A):
    unset PIP_CONSTRAINT && pip install <deps>
    before running
    hf_ptq.py
如果模型使用
trust_remote_code
(查看
config.json
中的
auto_map
),检查其自定义Python文件中是否包含容器中未安装的依赖:
bash
grep -h "^from \|^import " <model_path>/modeling_*.py | sort -u
已知依赖模式:
检测到的导入需要安装的包
from mamba_ssm
/
from causal_conv1d
mamba-ssm causal-conv1d
(适用于Mamba/混合模型:NemotronH、Jamba)
如果需要额外依赖:
  • 启动器(4B):在任务的
    environment
    部分设置
    EXTRA_PIP_DEPS
    ——
    ptq.sh
    会自动安装这些依赖
  • 手动执行(4A):在运行
    hf_ptq.py
    前执行
    unset PIP_CONSTRAINT && pip install <deps>

Step 3 — Choose quantization format

步骤3 — 选择量化格式

First, check for a model-specific recipe:
bash
ls modelopt_recipes/models/ 2>/dev/null
If a model-specific recipe exists, use
--recipe <path>
— it may contain tuned settings.
If no model-specific recipe, choose a format based on GPU (details in
examples/llm_ptq/README.md
):
  • Blackwell (B100/B200/GB200):
    nvfp4
    variants
  • Hopper (H100/H200) or older:
    fp8
    or
    int4_awq
Use
--qformat <name>
(e.g.,
--qformat nvfp4
). Format definitions:
modelopt/torch/quantization/config.py
. General PTQ recipes in
modelopt_recipes/general/ptq/
correspond to the same formats —
--qformat
is the simpler way to use them.
NVFP4 can be calibrated on Hopper but requires Blackwell for inference.
首先,检查是否有模型专属配置方案:
bash
ls modelopt_recipes/models/ 2>/dev/null
如果存在模型专属配置方案,使用
--recipe <path>
——其中包含调优后的设置。
如果没有模型专属配置方案,根据GPU选择格式(详情见
examples/llm_ptq/README.md
):
  • Blackwell(B100/B200/GB200):
    nvfp4
    系列格式
  • Hopper(H100/H200)或更早型号:
    fp8
    int4_awq
使用
--qformat <name>
(例如:
--qformat nvfp4
)。格式定义见
modelopt/torch/quantization/config.py
modelopt_recipes/general/ptq/
中的通用PTQ配置方案对应相同格式——
--qformat
是更简便的使用方式。
NVFP4可在Hopper上完成校准,但推理需要Blackwell GPU。

Step 4 — Run PTQ

步骤4 — 运行PTQ

Goal: checkpoint on disk (
.safetensors
+
config.json
).
For listed models (4A/4B): run full calibration directly (
--calib_size 512
). For unlisted models (4C): run a smoke test first (
--calib_size 4
), wait for success, then full calibration.
目标:生成磁盘上的checkpoint
.safetensors
+
config.json
)。
对于已列出的模型(4A/4B):直接运行完整校准(
--calib_size 512
)。 对于未列出的模型(4C):先运行冒烟测试(
--calib_size 4
),成功后再执行完整校准。

Which path?

选择执行路径

text
In README table? ─→ YES ──→ SLURM (local or remote)? ──→ LAUNCHER (4B)
                  │          Local Docker + GPU? ────────→ LAUNCHER (4B)
                  │          Remote Docker (no SLURM)? ──→ MANUAL (4A)
                  │          Bare GPU (local or remote)? → MANUAL (4A)
                  └→ NOT LISTED ──→ UNLISTED MODEL (4C)
text
是否在README表中? ─→ 是 ──→ 是否使用SLURM(本地/远程)? ──→ 启动器(4B)
                  │          是否使用本地Docker+GPU? ────────→ 启动器(4B)
                  │          是否使用远程Docker(无SLURM)? ──→ 手动执行(4A)
                  │          是否使用裸机GPU(本地/远程)? → 手动执行(4A)
                  └→ 否 ──→ 未列出模型(4C)

4A — Direct: supported model, manual execution

4A — 直接执行:支持的模型,手动运行

bash
pip install --no-build-isolation "nvidia-modelopt[hf]"
pip install -r examples/llm_ptq/requirements.txt

python examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path <model> \
    --qformat <format> \
    --calib_size 512 \
    --export_path <output>
Run
--help
for all options.
For remote: use
remote_run
from
remote_exec.sh
(see
skills/common/remote-execution.md
).
bash
pip install --no-build-isolation "nvidia-modelopt[hf]"
pip install -r examples/llm_ptq/requirements.txt

python examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path <model> \
    --qformat <format> \
    --calib_size 512 \
    --export_path <output>
运行
--help
查看所有可用参数。
远程环境:使用
remote_exec.sh
中的
remote_run
(见
skills/common/remote-execution.md
)。

4B — Launcher: supported model on SLURM or local Docker

4B — 启动器执行:在SLURM或本地Docker上运行支持的模型

Write a YAML config using
common/hf_ptq/hf_ptq.sh
. See
references/launcher-guide.md
for the full template.
bash
cd tools/launcher
使用
common/hf_ptq/hf_ptq.sh
编写YAML配置文件。完整模板见
references/launcher-guide.md
bash
cd tools/launcher

SLURM (remote or local):

SLURM(远程/本地):

SLURM_HOST=<host> SLURM_ACCOUNT=<acct> uv run launch.py --yaml <config.yaml> user=<ssh_user> identity=<ssh_key> --yes
SLURM_HOST=<host> SLURM_ACCOUNT=<acct> uv run launch.py --yaml <config.yaml> user=<ssh_user> identity=<ssh_key> --yes

Local Docker:

本地Docker:

uv run launch.py --yaml <config.yaml> hf_local=<hf_cache> --yes

The launcher blocks and tails logs until the job completes. If the launcher fails (missing deps, config errors), fall back to path 4A (manual execution).
uv run launch.py --yaml <config.yaml> hf_local=<hf_cache> --yes

启动器会阻塞并跟踪日志,直到任务完成。如果启动器失败(缺少依赖、配置错误),回退到路径4A(手动执行)。

4C — Unlisted model

4C — 未列出的模型

Follow
references/unsupported-models.md
. It walks through investigating the model, patching ModelOpt if needed, and running
hf_ptq.py
. Run manually (like 4A) for easier monitoring and debugging.
For SLURM, see
skills/common/slurm-setup.md
and
references/slurm-setup-ptq.md
.
遵循
references/unsupported-models.md
中的指引。其中包含模型调研、必要时修补ModelOpt、运行
hf_ptq.py
的步骤。建议手动执行(如4A),以便更易监控和调试。
如果使用SLURM,查看
skills/common/slurm-setup.md
references/slurm-setup-ptq.md

Monitoring

监控

After job submission, register the job and set up monitoring per the monitor skill.
提交任务后,按照监控Skill的指引注册任务并设置监控。

Step 5 — Verify output

步骤5 — 验证输出

bash
ls -lh <output_path>/
bash
ls -lh <output_path>/

Expect: config.json, tokenizer files, model-*.safetensors

预期文件:config.json、tokenizer文件、model-*.safetensors


Report the path and size to the user.

向用户报告输出路径和文件大小。

Post-quantization validation

量化后验证

Validate the exported checkpoint's quantization pattern matches the recipe. Quantization config patterns can silently miss layers if the model uses non-standard naming (e.g., Gemma4
experts.*
missed by
*mlp*
patterns) — this only surfaces later as deployment failures. Read
references/checkpoint-validation.md
for the validation script, expected patterns per recipe, and common pattern gaps.
Next steps: If the user wants to deploy or evaluate the quantized checkpoint, use the deployment or evaluation skill. The checkpoint workspace carries over. If the model required patches during PTQ (e.g., transformers upgrade), the same fixes will likely be needed at deployment and evaluation time.
验证导出的checkpoint的量化模式是否与配置方案匹配。如果模型使用非标准命名(例如Gemma4的
experts.*
未被
*mlp*
模式匹配),量化配置模式可能会遗漏部分层——这只会在后续部署时暴露问题。阅读
references/checkpoint-validation.md
获取验证脚本、各配置方案的预期模式及常见模式漏洞。
后续步骤:如果用户想要部署或评估量化后的checkpoint,使用部署评估Skill。工作区会自动延续。如果PTQ过程中需要对模型进行补丁(例如升级transformers),部署和评估阶段可能需要同样的修复。

Key API Rules

核心API规则

  • mtq.register()
    classes must define
    _setup()
    and call it from
    __init__
  • Call
    mto.enable_huggingface_checkpointing()
    before quantization
  • Wildcard
    *gate*
    matches too broadly — use
    *mlp.gate*
    or
    *router*
  • VLMs:
    hf_ptq.py
    auto-extracts the language model via
    extract_and_prepare_language_model_from_vl()
    — no manual VLM handling needed in most cases
  • FP8 checkpoints: prefer
    _QuantFP8Linear
    (lazy dequant) over
    FineGrainedFP8Config(dequantize=True)
    which wastes ~2x memory. See
    references/unsupported-models.md
    for details
  • Custom quantizer names must end with
    _input_quantizer
    or
    _weight_quantizer
  • mtq.register()
    必须定义
    _setup()
    并在
    __init__
    中调用
  • 量化前必须调用
    mto.enable_huggingface_checkpointing()
  • 通配符
    *gate*
    匹配范围过广——使用
    *mlp.gate*
    *router*
  • VLM:
    hf_ptq.py
    会通过
    extract_and_prepare_language_model_from_vl()
    自动提取语言模型——大多数情况下无需手动处理VLM
  • FP8 checkpoint:优先使用
    _QuantFP8Linear
    (延迟反量化),而非
    FineGrainedFP8Config(dequantize=True)
    ,后者会浪费约2倍内存。详情见
    references/unsupported-models.md
  • 自定义量化器名称必须以
    _input_quantizer
    _weight_quantizer
    结尾

Common Pitfalls

常见陷阱

  • Model-specific dependencies: Models with
    trust_remote_code
    may import packages not in the container (e.g.,
    mamba-ssm
    for hybrid Mamba models). See Step 2.5. Use
    EXTRA_PIP_DEPS
    env var with the launcher, or install manually before running
    hf_ptq.py
  • Transformers version: New models may need a newer version of transformers than what's installed. Check
    config.json
    for
    transformers_version
    . In containers, beware of
    PIP_CONSTRAINT
    blocking upgrades — see
    references/slurm-setup-ptq.md
    for workarounds
  • Gated datasets: Some calibration datasets require HF authentication. Ensure
    HF_TOKEN
    is set in the job environment, or use
    --dataset cnn_dailymail
    as a non-gated alternative
  • NFS root_squash + Docker: See
    skills/common/slurm-setup.md
    section 5
  • 模型专属依赖:使用
    trust_remote_code
    的模型可能会导入容器中未安装的包(例如混合Mamba模型需要
    mamba-ssm
    )。见步骤2.5。启动器可使用
    EXTRA_PIP_DEPS
    环境变量,或在运行
    hf_ptq.py
    前手动安装依赖
  • Transformers版本:新模型可能需要比当前安装版本更新的transformers。查看
    config.json
    中的
    transformers_version
    。在容器中需注意
    PIP_CONSTRAINT
    可能会阻止升级——见
    references/slurm-setup-ptq.md
    中的解决方案
  • ** gated数据集**:部分校准数据集需要HF认证。确保任务环境中已设置
    HF_TOKEN
    ,或使用
    --dataset cnn_dailymail
    作为非 gated替代方案
  • NFS root_squash + Docker:见
    skills/common/slurm-setup.md
    第5节

References

参考文档

ReferenceWhen to read
skills/common/environment-setup.md
Step 1: always
skills/common/workspace-management.md
Step 1: always
references/launcher-guide.md
Step 4B only (launcher path)
tools/launcher/CLAUDE.md
Step 4B only, if you need more launcher detail
references/unsupported-models.md
Step 4C only (unlisted model)
references/checkpoint-validation.md
Step 5: validate quantization pattern matches recipe
skills/common/remote-execution.md
Step 4A/4C only, if target is remote
skills/common/slurm-setup.md
Step 4A/4C only, if using SLURM manually (not launcher)
references/slurm-setup-ptq.md
Step 4A/4C only, PTQ-specific SLURM (container, GPU sizing, FSDP2)
examples/llm_ptq/README.md
Step 3: support matrix, CLI flags, accuracy
modelopt/torch/quantization/config.py
Step 3: format definitions
modelopt/torch/export/model_utils.py
Step 4C: TRT-LLM export type mapping
modelopt_recipes/
Step 3: pre-built recipes
参考文档阅读时机
skills/common/environment-setup.md
步骤1:必看
skills/common/workspace-management.md
步骤1:必看
references/launcher-guide.md
仅步骤4B(启动器路径)
tools/launcher/CLAUDE.md
仅步骤4B,如需更多启动器细节
references/unsupported-models.md
仅步骤4C(未列出模型)
references/checkpoint-validation.md
步骤5:验证量化模式是否匹配配置方案
skills/common/remote-execution.md
仅步骤4A/4C,若目标环境为远程
skills/common/slurm-setup.md
仅步骤4A/4C,若手动使用SLURM(非启动器)
references/slurm-setup-ptq.md
仅步骤4A/4C,PTQ专属SLURM配置(容器、GPU规格、FSDP2)
examples/llm_ptq/README.md
步骤3:支持矩阵、CLI参数、精度指导
modelopt/torch/quantization/config.py
步骤3:格式定义
modelopt/torch/export/model_utils.py
步骤4C:TRT-LLM导出类型映射
modelopt_recipes/
步骤3:预构建配置方案