Loading...
Loading...
This skill should be used when the user asks to "quantize a model", "run PTQ", "post-training quantization", "NVFP4 quantization", "FP8 quantization", "INT8 quantization", "INT4 AWQ", "quantize LLM", "quantize MoE", "quantize VLM", or needs to produce a quantized HuggingFace or TensorRT-LLM checkpoint from a pretrained model using ModelOpt.
npx skill4agent add nvidia/skills ptqexamples/llm_ptq/README.mdskills/common/environment-setup.mdskills/common/workspace-management.mdexamples/llm_ptq/README.mdhf_ptq.pyreferences/unsupported-models.mdhf_ptq.pytrust_remote_codeconfig.jsonauto_mapgrep -h "^from \|^import " <model_path>/modeling_*.py | sort -u| Import found | Packages to install |
|---|---|
| |
EXTRA_PIP_DEPSenvironmentptq.shunset PIP_CONSTRAINT && pip install <deps>hf_ptq.pyls modelopt_recipes/models/ 2>/dev/null--recipe <path>examples/llm_ptq/README.mdnvfp4fp8int4_awq--qformat <name>--qformat nvfp4modelopt/torch/quantization/config.pymodelopt_recipes/general/ptq/--qformatNVFP4 can be calibrated on Hopper but requires Blackwell for inference.
.safetensorsconfig.json--calib_size 512--calib_size 4In README table? ─→ YES ──→ SLURM (local or remote)? ──→ LAUNCHER (4B)
│ Local Docker + GPU? ────────→ LAUNCHER (4B)
│ Remote Docker (no SLURM)? ──→ MANUAL (4A)
│ Bare GPU (local or remote)? → MANUAL (4A)
│
└→ NOT LISTED ──→ UNLISTED MODEL (4C)pip install --no-build-isolation "nvidia-modelopt[hf]"
pip install -r examples/llm_ptq/requirements.txt
python examples/llm_ptq/hf_ptq.py \
--pyt_ckpt_path <model> \
--qformat <format> \
--calib_size 512 \
--export_path <output>--helpremote_runremote_exec.shskills/common/remote-execution.mdcommon/hf_ptq/hf_ptq.shreferences/launcher-guide.mdcd tools/launcher
# SLURM (remote or local):
SLURM_HOST=<host> SLURM_ACCOUNT=<acct> uv run launch.py --yaml <config.yaml> user=<ssh_user> identity=<ssh_key> --yes
# Local Docker:
uv run launch.py --yaml <config.yaml> hf_local=<hf_cache> --yesreferences/unsupported-models.mdhf_ptq.pyskills/common/slurm-setup.mdreferences/slurm-setup-ptq.mdls -lh <output_path>/
# Expect: config.json, tokenizer files, model-*.safetensorsexperts.**mlp*references/checkpoint-validation.mdmtq.register()_setup()__init__mto.enable_huggingface_checkpointing()*gate**mlp.gate**router*hf_ptq.pyextract_and_prepare_language_model_from_vl()_QuantFP8LinearFineGrainedFP8Config(dequantize=True)references/unsupported-models.md_input_quantizer_weight_quantizertrust_remote_codemamba-ssmEXTRA_PIP_DEPShf_ptq.pyconfig.jsontransformers_versionPIP_CONSTRAINTreferences/slurm-setup-ptq.mdHF_TOKEN--dataset cnn_dailymailskills/common/slurm-setup.md| Reference | When to read |
|---|---|
| Step 1: always |
| Step 1: always |
| Step 4B only (launcher path) |
| Step 4B only, if you need more launcher detail |
| Step 4C only (unlisted model) |
| Step 5: validate quantization pattern matches recipe |
| Step 4A/4C only, if target is remote |
| Step 4A/4C only, if using SLURM manually (not launcher) |
| Step 4A/4C only, PTQ-specific SLURM (container, GPU sizing, FSDP2) |
| Step 3: support matrix, CLI flags, accuracy |
| Step 3: format definitions |
| Step 4C: TRT-LLM export type mapping |
| Step 3: pre-built recipes |