ptq
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseModelOpt Post-Training Quantization
ModelOpt训练后量化
Produce a quantized checkpoint from a pretrained model. Read first — it has the support matrix, CLI flags, and accuracy guidance.
examples/llm_ptq/README.md从预训练模型生成量化后的checkpoint。请先阅读——其中包含支持矩阵、CLI参数和精度指导。
examples/llm_ptq/README.mdStep 1 — Environment
步骤1 — 环境准备
Read and . After completing them you should know:
skills/common/environment-setup.mdskills/common/workspace-management.md- ModelOpt source is available
- Local or remote (+ cluster config if remote)
- SLURM / Docker+GPU / bare GPU
- Launcher available?
- Which workspace to use
阅读和。完成后你需要确认:
skills/common/environment-setup.mdskills/common/workspace-management.md- 已获取ModelOpt源码
- 使用本地环境或远程环境(远程环境需配置集群信息)
- 采用SLURM / Docker+GPU / 裸机GPU环境
- 是否有可用的启动器
- 使用哪个工作区
Step 2 — Is the model supported?
步骤2 — 模型是否受支持?
Check the support table in for verified HF models.
examples/llm_ptq/README.md- Listed → supported, use (step 4A/4B)
hf_ptq.py - Not listed → read to determine if
references/unsupported-models.mdcan still work or if a custom script is needed (step 4C)hf_ptq.py
查看中的支持表,确认已验证的HF模型:
examples/llm_ptq/README.md- 已列出 → 支持,使用(步骤4A/4B)
hf_ptq.py - 未列出 → 阅读,判断
references/unsupported-models.md是否仍可使用,或是否需要自定义脚本(步骤4C)hf_ptq.py
Step 2.5 — Check for model-specific dependencies
步骤2.5 — 检查模型专属依赖
If the model uses (check for ), inspect its custom Python files for imports not present in the container:
trust_remote_codeconfig.jsonauto_mapbash
grep -h "^from \|^import " <model_path>/modeling_*.py | sort -uKnown dependency patterns:
| Import found | Packages to install |
|---|---|
| |
If extra deps are needed:
- Launcher (4B): set in the task's
EXTRA_PIP_DEPSsection —environmentinstalls them automaticallyptq.sh - Manual (4A): before running
unset PIP_CONSTRAINT && pip install <deps>hf_ptq.py
如果模型使用(查看中的),检查其自定义Python文件中是否包含容器中未安装的依赖:
trust_remote_codeconfig.jsonauto_mapbash
grep -h "^from \|^import " <model_path>/modeling_*.py | sort -u已知依赖模式:
| 检测到的导入 | 需要安装的包 |
|---|---|
| |
如果需要额外依赖:
- 启动器(4B):在任务的部分设置
environment——EXTRA_PIP_DEPS会自动安装这些依赖ptq.sh - 手动执行(4A):在运行前执行
hf_ptq.pyunset PIP_CONSTRAINT && pip install <deps>
Step 3 — Choose quantization format
步骤3 — 选择量化格式
First, check for a model-specific recipe:
bash
ls modelopt_recipes/models/ 2>/dev/nullIf a model-specific recipe exists, use — it may contain tuned settings.
--recipe <path>If no model-specific recipe, choose a format based on GPU (details in ):
examples/llm_ptq/README.md- Blackwell (B100/B200/GB200): variants
nvfp4 - Hopper (H100/H200) or older: or
fp8int4_awq
Use (e.g., ). Format definitions: . General PTQ recipes in correspond to the same formats — is the simpler way to use them.
--qformat <name>--qformat nvfp4modelopt/torch/quantization/config.pymodelopt_recipes/general/ptq/--qformatNVFP4 can be calibrated on Hopper but requires Blackwell for inference.
首先,检查是否有模型专属配置方案:
bash
ls modelopt_recipes/models/ 2>/dev/null如果存在模型专属配置方案,使用——其中包含调优后的设置。
--recipe <path>如果没有模型专属配置方案,根据GPU选择格式(详情见):
examples/llm_ptq/README.md- Blackwell(B100/B200/GB200):系列格式
nvfp4 - Hopper(H100/H200)或更早型号:或
fp8int4_awq
使用(例如:)。格式定义见。中的通用PTQ配置方案对应相同格式——是更简便的使用方式。
--qformat <name>--qformat nvfp4modelopt/torch/quantization/config.pymodelopt_recipes/general/ptq/--qformatNVFP4可在Hopper上完成校准,但推理需要Blackwell GPU。
Step 4 — Run PTQ
步骤4 — 运行PTQ
Goal: checkpoint on disk ( + ).
.safetensorsconfig.jsonFor listed models (4A/4B): run full calibration directly ().
For unlisted models (4C): run a smoke test first (), wait for success, then full calibration.
--calib_size 512--calib_size 4目标:生成磁盘上的checkpoint( + )。
.safetensorsconfig.json对于已列出的模型(4A/4B):直接运行完整校准()。
对于未列出的模型(4C):先运行冒烟测试(),成功后再执行完整校准。
--calib_size 512--calib_size 4Which path?
选择执行路径
text
In README table? ─→ YES ──→ SLURM (local or remote)? ──→ LAUNCHER (4B)
│ Local Docker + GPU? ────────→ LAUNCHER (4B)
│ Remote Docker (no SLURM)? ──→ MANUAL (4A)
│ Bare GPU (local or remote)? → MANUAL (4A)
│
└→ NOT LISTED ──→ UNLISTED MODEL (4C)text
是否在README表中? ─→ 是 ──→ 是否使用SLURM(本地/远程)? ──→ 启动器(4B)
│ 是否使用本地Docker+GPU? ────────→ 启动器(4B)
│ 是否使用远程Docker(无SLURM)? ──→ 手动执行(4A)
│ 是否使用裸机GPU(本地/远程)? → 手动执行(4A)
│
└→ 否 ──→ 未列出模型(4C)4A — Direct: supported model, manual execution
4A — 直接执行:支持的模型,手动运行
bash
pip install --no-build-isolation "nvidia-modelopt[hf]"
pip install -r examples/llm_ptq/requirements.txt
python examples/llm_ptq/hf_ptq.py \
--pyt_ckpt_path <model> \
--qformat <format> \
--calib_size 512 \
--export_path <output>Run for all options.
--helpFor remote: use from (see ).
remote_runremote_exec.shskills/common/remote-execution.mdbash
pip install --no-build-isolation "nvidia-modelopt[hf]"
pip install -r examples/llm_ptq/requirements.txt
python examples/llm_ptq/hf_ptq.py \
--pyt_ckpt_path <model> \
--qformat <format> \
--calib_size 512 \
--export_path <output>运行查看所有可用参数。
--help远程环境:使用中的(见)。
remote_exec.shremote_runskills/common/remote-execution.md4B — Launcher: supported model on SLURM or local Docker
4B — 启动器执行:在SLURM或本地Docker上运行支持的模型
Write a YAML config using . See for the full template.
common/hf_ptq/hf_ptq.shreferences/launcher-guide.mdbash
cd tools/launcher使用编写YAML配置文件。完整模板见。
common/hf_ptq/hf_ptq.shreferences/launcher-guide.mdbash
cd tools/launcherSLURM (remote or local):
SLURM(远程/本地):
SLURM_HOST=<host> SLURM_ACCOUNT=<acct> uv run launch.py --yaml <config.yaml> user=<ssh_user> identity=<ssh_key> --yes
SLURM_HOST=<host> SLURM_ACCOUNT=<acct> uv run launch.py --yaml <config.yaml> user=<ssh_user> identity=<ssh_key> --yes
Local Docker:
本地Docker:
uv run launch.py --yaml <config.yaml> hf_local=<hf_cache> --yes
The launcher blocks and tails logs until the job completes. If the launcher fails (missing deps, config errors), fall back to path 4A (manual execution).uv run launch.py --yaml <config.yaml> hf_local=<hf_cache> --yes
启动器会阻塞并跟踪日志,直到任务完成。如果启动器失败(缺少依赖、配置错误),回退到路径4A(手动执行)。4C — Unlisted model
4C — 未列出的模型
Follow . It walks through investigating the model, patching ModelOpt if needed, and running . Run manually (like 4A) for easier monitoring and debugging.
references/unsupported-models.mdhf_ptq.pyFor SLURM, see and .
skills/common/slurm-setup.mdreferences/slurm-setup-ptq.md遵循中的指引。其中包含模型调研、必要时修补ModelOpt、运行的步骤。建议手动执行(如4A),以便更易监控和调试。
references/unsupported-models.mdhf_ptq.py如果使用SLURM,查看和。
skills/common/slurm-setup.mdreferences/slurm-setup-ptq.mdMonitoring
监控
After job submission, register the job and set up monitoring per the monitor skill.
提交任务后,按照监控Skill的指引注册任务并设置监控。
Step 5 — Verify output
步骤5 — 验证输出
bash
ls -lh <output_path>/bash
ls -lh <output_path>/Expect: config.json, tokenizer files, model-*.safetensors
预期文件:config.json、tokenizer文件、model-*.safetensors
Report the path and size to the user.
向用户报告输出路径和文件大小。Post-quantization validation
量化后验证
Validate the exported checkpoint's quantization pattern matches the recipe. Quantization config patterns can silently miss layers if the model uses non-standard naming (e.g., Gemma4 missed by patterns) — this only surfaces later as deployment failures. Read for the validation script, expected patterns per recipe, and common pattern gaps.
experts.**mlp*references/checkpoint-validation.mdNext steps: If the user wants to deploy or evaluate the quantized checkpoint, use the deployment or evaluation skill. The checkpoint workspace carries over. If the model required patches during PTQ (e.g., transformers upgrade), the same fixes will likely be needed at deployment and evaluation time.
验证导出的checkpoint的量化模式是否与配置方案匹配。如果模型使用非标准命名(例如Gemma4的未被模式匹配),量化配置模式可能会遗漏部分层——这只会在后续部署时暴露问题。阅读获取验证脚本、各配置方案的预期模式及常见模式漏洞。
experts.**mlp*references/checkpoint-validation.md后续步骤:如果用户想要部署或评估量化后的checkpoint,使用部署或评估Skill。工作区会自动延续。如果PTQ过程中需要对模型进行补丁(例如升级transformers),部署和评估阶段可能需要同样的修复。
Key API Rules
核心API规则
- classes must define
mtq.register()and call it from_setup()__init__ - Call before quantization
mto.enable_huggingface_checkpointing() - Wildcard matches too broadly — use
*gate*or*mlp.gate**router* - VLMs: auto-extracts the language model via
hf_ptq.py— no manual VLM handling needed in most casesextract_and_prepare_language_model_from_vl() - FP8 checkpoints: prefer (lazy dequant) over
_QuantFP8Linearwhich wastes ~2x memory. SeeFineGrainedFP8Config(dequantize=True)for detailsreferences/unsupported-models.md - Custom quantizer names must end with or
_input_quantizer_weight_quantizer
- 类必须定义
mtq.register()并在_setup()中调用__init__ - 量化前必须调用
mto.enable_huggingface_checkpointing() - 通配符匹配范围过广——使用
*gate*或*mlp.gate**router* - VLM:会通过
hf_ptq.py自动提取语言模型——大多数情况下无需手动处理VLMextract_and_prepare_language_model_from_vl() - FP8 checkpoint:优先使用(延迟反量化),而非
_QuantFP8Linear,后者会浪费约2倍内存。详情见FineGrainedFP8Config(dequantize=True)references/unsupported-models.md - 自定义量化器名称必须以或
_input_quantizer结尾_weight_quantizer
Common Pitfalls
常见陷阱
- Model-specific dependencies: Models with may import packages not in the container (e.g.,
trust_remote_codefor hybrid Mamba models). See Step 2.5. Usemamba-ssmenv var with the launcher, or install manually before runningEXTRA_PIP_DEPShf_ptq.py - Transformers version: New models may need a newer version of transformers than what's installed. Check for
config.json. In containers, beware oftransformers_versionblocking upgrades — seePIP_CONSTRAINTfor workaroundsreferences/slurm-setup-ptq.md - Gated datasets: Some calibration datasets require HF authentication. Ensure is set in the job environment, or use
HF_TOKENas a non-gated alternative--dataset cnn_dailymail - NFS root_squash + Docker: See section 5
skills/common/slurm-setup.md
- 模型专属依赖:使用的模型可能会导入容器中未安装的包(例如混合Mamba模型需要
trust_remote_code)。见步骤2.5。启动器可使用mamba-ssm环境变量,或在运行EXTRA_PIP_DEPS前手动安装依赖hf_ptq.py - Transformers版本:新模型可能需要比当前安装版本更新的transformers。查看中的
config.json。在容器中需注意transformers_version可能会阻止升级——见PIP_CONSTRAINT中的解决方案references/slurm-setup-ptq.md - ** gated数据集**:部分校准数据集需要HF认证。确保任务环境中已设置,或使用
HF_TOKEN作为非 gated替代方案--dataset cnn_dailymail - NFS root_squash + Docker:见第5节
skills/common/slurm-setup.md
References
参考文档
| Reference | When to read |
|---|---|
| Step 1: always |
| Step 1: always |
| Step 4B only (launcher path) |
| Step 4B only, if you need more launcher detail |
| Step 4C only (unlisted model) |
| Step 5: validate quantization pattern matches recipe |
| Step 4A/4C only, if target is remote |
| Step 4A/4C only, if using SLURM manually (not launcher) |
| Step 4A/4C only, PTQ-specific SLURM (container, GPU sizing, FSDP2) |
| Step 3: support matrix, CLI flags, accuracy |
| Step 3: format definitions |
| Step 4C: TRT-LLM export type mapping |
| Step 3: pre-built recipes |
| 参考文档 | 阅读时机 |
|---|---|
| 步骤1:必看 |
| 步骤1:必看 |
| 仅步骤4B(启动器路径) |
| 仅步骤4B,如需更多启动器细节 |
| 仅步骤4C(未列出模型) |
| 步骤5:验证量化模式是否匹配配置方案 |
| 仅步骤4A/4C,若目标环境为远程 |
| 仅步骤4A/4C,若手动使用SLURM(非启动器) |
| 仅步骤4A/4C,PTQ专属SLURM配置(容器、GPU规格、FSDP2) |
| 步骤3:支持矩阵、CLI参数、精度指导 |
| 步骤3:格式定义 |
| 步骤4C:TRT-LLM导出类型映射 |
| 步骤3:预构建配置方案 |