ptq

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

ModelOpt Post-Training Quantization

ModelOpt训练后量化

Produce a quantized checkpoint from a pretrained model. Read
examples/llm_ptq/README.md
first — it has the support matrix, CLI flags, and accuracy guidance.

从预训练模型生成量化后的checkpoint。请先阅读
examples/llm_ptq/README.md
——其中包含支持矩阵、CLI参数和精度指导。

Step 1 — Environment

步骤1 — 环境准备

Read

skills/common/environment-setup.md

and

skills/common/workspace-management.md

. After completing them you should know:

ModelOpt source is available
Local or remote (+ cluster config if remote)
SLURM / Docker+GPU / bare GPU
Launcher available?
Which workspace to use

阅读

skills/common/environment-setup.md

和

skills/common/workspace-management.md

。完成后你需要确认：

已获取ModelOpt源码
使用本地环境或远程环境（远程环境需配置集群信息）
采用SLURM / Docker+GPU / 裸机GPU环境
是否有可用的启动器
使用哪个工作区

Step 2 — Is the model supported?

步骤2 — 模型是否受支持？

Check the support table in

examples/llm_ptq/README.md

for verified HF models.

Listed → supported, use
```
hf_ptq.py
```
(step 4A/4B)
Not listed → read
```
references/unsupported-models.md
```
to determine if
```
hf_ptq.py
```
can still work or if a custom script is needed (step 4C)

查看

examples/llm_ptq/README.md

中的支持表，确认已验证的HF模型：

已列出 → 支持，使用
```
hf_ptq.py
```
（步骤4A/4B）
未列出 → 阅读
```
references/unsupported-models.md
```
，判断
```
hf_ptq.py
```
是否仍可使用，或是否需要自定义脚本（步骤4C）

Step 2.5 — Check for model-specific dependencies

步骤2.5 — 检查模型专属依赖

If the model uses

trust_remote_code

(check

config.json

for

auto_map

), inspect its custom Python files for imports not present in the container:

bash

grep -h "^from \|^import " <model_path>/modeling_*.py | sort -u

Known dependency patterns:

Import found	Packages to install
`from mamba_ssm` / `from causal_conv1d`	`mamba-ssm causal-conv1d` (Mamba/hybrid models: NemotronH, Jamba)

If extra deps are needed:

Launcher (4B): set
```
EXTRA_PIP_DEPS
```
in the task's
```
environment
```
section —
```
ptq.sh
```
installs them automatically

Manual (4A):

unset PIP_CONSTRAINT && pip install <deps>

before running

hf_ptq.py

如果模型使用

trust_remote_code

（查看

config.json

中的

auto_map

），检查其自定义Python文件中是否包含容器中未安装的依赖：

bash

grep -h "^from \|^import " <model_path>/modeling_*.py | sort -u

已知依赖模式：

检测到的导入	需要安装的包
`from mamba_ssm` / `from causal_conv1d`	`mamba-ssm causal-conv1d` （适用于Mamba/混合模型：NemotronH、Jamba）

如果需要额外依赖：

启动器（4B）：在任务的
```
environment
```
部分设置
```
EXTRA_PIP_DEPS
```
——
```
ptq.sh
```
会自动安装这些依赖

手动执行（4A）：在运行

hf_ptq.py

前执行

unset PIP_CONSTRAINT && pip install <deps>

Step 3 — Choose quantization format

步骤3 — 选择量化格式

First, check for a model-specific recipe:

bash

ls modelopt_recipes/models/ 2>/dev/null

If a model-specific recipe exists, use

--recipe <path>

— it may contain tuned settings.

If no model-specific recipe, choose a format based on GPU (details in

examples/llm_ptq/README.md

Blackwell (B100/B200/GB200):
```
nvfp4
```
variants
Hopper (H100/H200) or older:
```
fp8
```
or
```
int4_awq
```

Use

--qformat <name>

(e.g.,

--qformat nvfp4

). Format definitions:

modelopt/torch/quantization/config.py

. General PTQ recipes in

modelopt_recipes/general/ptq/

correspond to the same formats —

--qformat

is the simpler way to use them.

NVFP4 can be calibrated on Hopper but requires Blackwell for inference.

首先，检查是否有模型专属配置方案：

bash

ls modelopt_recipes/models/ 2>/dev/null

如果存在模型专属配置方案，使用

--recipe <path>

——其中包含调优后的设置。

如果没有模型专属配置方案，根据GPU选择格式（详情见

examples/llm_ptq/README.md

）：

Blackwell（B100/B200/GB200）：
```
nvfp4
```
系列格式
Hopper（H100/H200）或更早型号：
```
fp8
```
或
```
int4_awq
```

使用

--qformat <name>

（例如：

--qformat nvfp4

）。格式定义见

modelopt/torch/quantization/config.py

。

modelopt_recipes/general/ptq/

中的通用PTQ配置方案对应相同格式——

--qformat

是更简便的使用方式。

NVFP4可在Hopper上完成校准，但推理需要Blackwell GPU。

Step 4 — Run PTQ

步骤4 — 运行PTQ

Goal: checkpoint on disk (

.safetensors

config.json

For listed models (4A/4B): run full calibration directly (

--calib_size 512

). For unlisted models (4C): run a smoke test first (

--calib_size 4

), wait for success, then full calibration.

目标：生成磁盘上的checkpoint（

.safetensors

config.json

）。

对于已列出的模型（4A/4B）：直接运行完整校准（

--calib_size 512

）。对于未列出的模型（4C）：先运行冒烟测试（

--calib_size 4

），成功后再执行完整校准。

Which path?

选择执行路径

text

In README table? ─→ YES ──→ SLURM (local or remote)? ──→ LAUNCHER (4B)
                  │          Local Docker + GPU? ────────→ LAUNCHER (4B)
                  │          Remote Docker (no SLURM)? ──→ MANUAL (4A)
                  │          Bare GPU (local or remote)? → MANUAL (4A)
                  │
                  └→ NOT LISTED ──→ UNLISTED MODEL (4C)

text

是否在README表中？ ─→ 是 ──→ 是否使用SLURM（本地/远程）？ ──→ 启动器（4B）
                  │          是否使用本地Docker+GPU？ ────────→ 启动器（4B）
                  │          是否使用远程Docker（无SLURM）？ ──→ 手动执行（4A）
                  │          是否使用裸机GPU（本地/远程）？ → 手动执行（4A）
                  │
                  └→ 否 ──→ 未列出模型（4C）

4A — Direct: supported model, manual execution

4A — 直接执行：支持的模型，手动运行

bash

pip install --no-build-isolation "nvidia-modelopt[hf]"
pip install -r examples/llm_ptq/requirements.txt

python examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path <model> \
    --qformat <format> \
    --calib_size 512 \
    --export_path <output>

Run

--help

for all options.

For remote: use

remote_run

from

remote_exec.sh

(see

skills/common/remote-execution.md

bash

pip install --no-build-isolation "nvidia-modelopt[hf]"
pip install -r examples/llm_ptq/requirements.txt

python examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path <model> \
    --qformat <format> \
    --calib_size 512 \
    --export_path <output>

运行

--help

查看所有可用参数。

远程环境：使用

remote_exec.sh

中的

remote_run

（见

skills/common/remote-execution.md

）。

4B — Launcher: supported model on SLURM or local Docker

4B — 启动器执行：在SLURM或本地Docker上运行支持的模型

Write a YAML config using

common/hf_ptq/hf_ptq.sh

. See

references/launcher-guide.md

for the full template.

bash

cd tools/launcher

使用

common/hf_ptq/hf_ptq.sh

编写YAML配置文件。完整模板见

references/launcher-guide.md

。

bash

cd tools/launcher

SLURM (remote or local):

SLURM（远程/本地）：

SLURM_HOST=<host> SLURM_ACCOUNT=<acct> uv run launch.py --yaml <config.yaml> user=<ssh_user> identity=<ssh_key> --yes

Local Docker:

本地Docker：

uv run launch.py --yaml <config.yaml> hf_local=<hf_cache> --yes


The launcher blocks and tails logs until the job completes. If the launcher fails (missing deps, config errors), fall back to path 4A (manual execution).

uv run launch.py --yaml <config.yaml> hf_local=<hf_cache> --yes


启动器会阻塞并跟踪日志，直到任务完成。如果启动器失败（缺少依赖、配置错误），回退到路径4A（手动执行）。

4C — Unlisted model

4C — 未列出的模型

references/unsupported-models.md

. It walks through investigating the model, patching ModelOpt if needed, and running

hf_ptq.py

. Run manually (like 4A) for easier monitoring and debugging.

For SLURM, see

skills/common/slurm-setup.md

and

references/slurm-setup-ptq.md

遵循

references/unsupported-models.md

中的指引。其中包含模型调研、必要时修补ModelOpt、运行

hf_ptq.py

的步骤。建议手动执行（如4A），以便更易监控和调试。

如果使用SLURM，查看

skills/common/slurm-setup.md

和

references/slurm-setup-ptq.md

。

Monitoring

监控

After job submission, register the job and set up monitoring per the monitor skill.

提交任务后，按照监控Skill的指引注册任务并设置监控。

Step 5 — Verify output

步骤5 — 验证输出

bash

ls -lh <output_path>/

bash

ls -lh <output_path>/

Expect: config.json, tokenizer files, model-*.safetensors

预期文件：config.json、tokenizer文件、model-*.safetensors


Report the path and size to the user.


向用户报告输出路径和文件大小。

Post-quantization validation

量化后验证

Validate the exported checkpoint's quantization pattern matches the recipe. Quantization config patterns can silently miss layers if the model uses non-standard naming (e.g., Gemma4

experts.*

missed by

*mlp*

patterns) — this only surfaces later as deployment failures. Read

references/checkpoint-validation.md

for the validation script, expected patterns per recipe, and common pattern gaps.

Next steps: If the user wants to deploy or evaluate the quantized checkpoint, use the deployment or evaluation skill. The checkpoint workspace carries over. If the model required patches during PTQ (e.g., transformers upgrade), the same fixes will likely be needed at deployment and evaluation time.

验证导出的checkpoint的量化模式是否与配置方案匹配。如果模型使用非标准命名（例如Gemma4的

experts.*

未被

*mlp*

模式匹配），量化配置模式可能会遗漏部分层——这只会在后续部署时暴露问题。阅读

references/checkpoint-validation.md

获取验证脚本、各配置方案的预期模式及常见模式漏洞。

后续步骤：如果用户想要部署或评估量化后的checkpoint，使用部署或评估Skill。工作区会自动延续。如果PTQ过程中需要对模型进行补丁（例如升级transformers），部署和评估阶段可能需要同样的修复。

Key API Rules

核心API规则

```
mtq.register()
```
classes must define
```
_setup()
```
and call it from
```
__init__
```
Call
```
mto.enable_huggingface_checkpointing()
```
before quantization
Wildcard
```
*gate*
```
matches too broadly — use
```
*mlp.gate*
```
or
```
*router*
```
VLMs:
```
hf_ptq.py
```
auto-extracts the language model via
```
extract_and_prepare_language_model_from_vl()
```
— no manual VLM handling needed in most cases

FP8 checkpoints: prefer

_QuantFP8Linear

(lazy dequant) over

FineGrainedFP8Config(dequantize=True)

which wastes ~2x memory. See

references/unsupported-models.md

for details

Custom quantizer names must end with
```
_input_quantizer
```
or
```
_weight_quantizer
```

```
mtq.register()
```
类必须定义
```
_setup()
```
并在
```
__init__
```
中调用
量化前必须调用
```
mto.enable_huggingface_checkpointing()
```
通配符
```
*gate*
```
匹配范围过广——使用
```
*mlp.gate*
```
或
```
*router*
```
VLM：
```
hf_ptq.py
```
会通过
```
extract_and_prepare_language_model_from_vl()
```
自动提取语言模型——大多数情况下无需手动处理VLM
FP8 checkpoint：优先使用
```
_QuantFP8Linear
```
（延迟反量化），而非
```
FineGrainedFP8Config(dequantize=True)
```
，后者会浪费约2倍内存。详情见
```
references/unsupported-models.md
```
自定义量化器名称必须以
```
_input_quantizer
```
或
```
_weight_quantizer
```
结尾

Common Pitfalls

常见陷阱

Model-specific dependencies: Models with
```
trust_remote_code
```
may import packages not in the container (e.g.,
```
mamba-ssm
```
for hybrid Mamba models). See Step 2.5. Use
```
EXTRA_PIP_DEPS
```
env var with the launcher, or install manually before running
```
hf_ptq.py
```
Transformers version: New models may need a newer version of transformers than what's installed. Check
```
config.json
```
for
```
transformers_version
```
. In containers, beware of
```
PIP_CONSTRAINT
```
blocking upgrades — see
```
references/slurm-setup-ptq.md
```
for workarounds
Gated datasets: Some calibration datasets require HF authentication. Ensure
```
HF_TOKEN
```
is set in the job environment, or use
```
--dataset cnn_dailymail
```
as a non-gated alternative
NFS root_squash + Docker: See
```
skills/common/slurm-setup.md
```
section 5

模型专属依赖：使用
```
trust_remote_code
```
的模型可能会导入容器中未安装的包（例如混合Mamba模型需要
```
mamba-ssm
```
）。见步骤2.5。启动器可使用
```
EXTRA_PIP_DEPS
```
环境变量，或在运行
```
hf_ptq.py
```
前手动安装依赖
Transformers版本：新模型可能需要比当前安装版本更新的transformers。查看
```
config.json
```
中的
```
transformers_version
```
。在容器中需注意
```
PIP_CONSTRAINT
```
可能会阻止升级——见
```
references/slurm-setup-ptq.md
```
中的解决方案
** gated数据集**：部分校准数据集需要HF认证。确保任务环境中已设置
```
HF_TOKEN
```
，或使用
```
--dataset cnn_dailymail
```
作为非 gated替代方案
NFS root_squash + Docker：见
```
skills/common/slurm-setup.md
```
第5节

References

参考文档

Reference	When to read
`skills/common/environment-setup.md`	Step 1: always
`skills/common/workspace-management.md`	Step 1: always
`references/launcher-guide.md`	Step 4B only (launcher path)
`tools/launcher/CLAUDE.md`	Step 4B only, if you need more launcher detail
`references/unsupported-models.md`	Step 4C only (unlisted model)
`references/checkpoint-validation.md`	Step 5: validate quantization pattern matches recipe
`skills/common/remote-execution.md`	Step 4A/4C only, if target is remote
`skills/common/slurm-setup.md`	Step 4A/4C only, if using SLURM manually (not launcher)
`references/slurm-setup-ptq.md`	Step 4A/4C only, PTQ-specific SLURM (container, GPU sizing, FSDP2)
`examples/llm_ptq/README.md`	Step 3: support matrix, CLI flags, accuracy
`modelopt/torch/quantization/config.py`	Step 3: format definitions
`modelopt/torch/export/model_utils.py`	Step 4C: TRT-LLM export type mapping
`modelopt_recipes/`	Step 3: pre-built recipes

参考文档	阅读时机
`skills/common/environment-setup.md`	步骤1：必看
`skills/common/workspace-management.md`	步骤1：必看
`references/launcher-guide.md`	仅步骤4B（启动器路径）
`tools/launcher/CLAUDE.md`	仅步骤4B，如需更多启动器细节
`references/unsupported-models.md`	仅步骤4C（未列出模型）
`references/checkpoint-validation.md`	步骤5：验证量化模式是否匹配配置方案
`skills/common/remote-execution.md`	仅步骤4A/4C，若目标环境为远程
`skills/common/slurm-setup.md`	仅步骤4A/4C，若手动使用SLURM（非启动器）
`references/slurm-setup-ptq.md`	仅步骤4A/4C，PTQ专属SLURM配置（容器、GPU规格、FSDP2）
`examples/llm_ptq/README.md`	步骤3：支持矩阵、CLI参数、精度指导
`modelopt/torch/quantization/config.py`	步骤3：格式定义
`modelopt/torch/export/model_utils.py`	步骤4C：TRT-LLM导出类型映射
`modelopt_recipes/`	步骤3：预构建配置方案