gptq

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

GPTQ (Generative Pre-trained Transformer Quantization)

GPTQ(生成式预训练Transformer量化)

Post-training quantization method that compresses LLMs to 4-bit with minimal accuracy loss using group-wise quantization.
一种训练后量化方法,采用分组量化将大语言模型(LLM)压缩至4位,精度损失极小。

When to use GPTQ

何时使用GPTQ

Use GPTQ when:
  • Need to fit large models (70B+) on limited GPU memory
  • Want 4× memory reduction with <2% accuracy loss
  • Deploying on consumer GPUs (RTX 4090, 3090)
  • Need faster inference (3-4× speedup vs FP16)
Use AWQ instead when:
  • Need slightly better accuracy (<1% loss)
  • Have newer GPUs (Ampere, Ada)
  • Want Marlin kernel support (2× faster on some GPUs)
Use bitsandbytes instead when:
  • Need simple integration with transformers
  • Want 8-bit quantization (less compression, better quality)
  • Don't need pre-quantized model files
在以下场景使用GPTQ:
  • 需要在有限的GPU内存中运行大模型(70B及以上)
  • 希望实现内存占用降低4倍且精度损失小于2%
  • 在消费级GPU(RTX 4090、3090)上部署模型
  • 需要更快的推理速度(相比FP16提升3-4倍)
在以下场景改用AWQ:
  • 需要略高的精度(损失小于1%)
  • 使用较新的GPU(Ampere、Ada架构)
  • 希望获得Marlin内核支持(部分GPU上速度提升2倍)
在以下场景改用bitsandbytes:
  • 需要与transformers进行简单集成
  • 希望使用8位量化(压缩比更低,质量更好)
  • 不需要预量化模型文件

Quick start

快速开始

Installation

安装

bash
undefined
bash
undefined

Install AutoGPTQ

安装AutoGPTQ

pip install auto-gptq
pip install auto-gptq

With Triton (Linux only, faster)

安装带Triton版本(仅Linux,速度更快)

pip install auto-gptq[triton]
pip install auto-gptq[triton]

With CUDA extensions (faster)

安装带CUDA扩展版本(速度更快)

pip install auto-gptq --no-build-isolation
pip install auto-gptq --no-build-isolation

Full installation

完整安装

pip install auto-gptq transformers accelerate
undefined
pip install auto-gptq transformers accelerate
undefined

Load pre-quantized model

加载预量化模型

python
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
python
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM

Load quantized model from HuggingFace

从HuggingFace加载量化模型

model_name = "TheBloke/Llama-2-7B-Chat-GPTQ"
model = AutoGPTQForCausalLM.from_quantized( model_name, device="cuda:0", use_triton=False # Set True on Linux for speed )
tokenizer = AutoTokenizer.from_pretrained(model_name)
model_name = "TheBloke/Llama-2-7B-Chat-GPTQ"
model = AutoGPTQForCausalLM.from_quantized( model_name, device="cuda:0", use_triton=False # Linux系统设置为True以提升速度 )
tokenizer = AutoTokenizer.from_pretrained(model_name)

Generate

生成文本

prompt = "Explain quantum computing" inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0") outputs = model.generate(**inputs, max_new_tokens=200) print(tokenizer.decode(outputs[0]))
undefined
prompt = "解释量子计算" inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0") outputs = model.generate(**inputs, max_new_tokens=200) print(tokenizer.decode(outputs[0]))
undefined

Quantize your own model

量化自定义模型

python
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset
python
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset

Load model

加载模型

model_name = "meta-llama/Llama-2-7b-chat-hf" tokenizer = AutoTokenizer.from_pretrained(model_name)
model_name = "meta-llama/Llama-2-7b-chat-hf" tokenizer = AutoTokenizer.from_pretrained(model_name)

Quantization config

量化配置

quantize_config = BaseQuantizeConfig( bits=4, # 4-bit quantization group_size=128, # Group size (recommended: 128) desc_act=False, # Activation order (False for CUDA kernel) damp_percent=0.01 # Dampening factor )
quantize_config = BaseQuantizeConfig( bits=4, # 4位量化 group_size=128, # 分组大小(推荐值:128) desc_act=False, # 激活顺序(CUDA内核需设置为False) damp_percent=0.01 # 阻尼因子 )

Load model for quantization

加载待量化模型

model = AutoGPTQForCausalLM.from_pretrained( model_name, quantize_config=quantize_config )
model = AutoGPTQForCausalLM.from_pretrained( model_name, quantize_config=quantize_config )

Prepare calibration data

准备校准数据

dataset = load_dataset("c4", split="train", streaming=True) calibration_data = [ tokenizer(example["text"])["input_ids"][:512] for example in dataset.take(128) ]
dataset = load_dataset("c4", split="train", streaming=True) calibration_data = [ tokenizer(example["text"])["input_ids"][:512] for example in dataset.take(128) ]

Quantize

执行量化

model.quantize(calibration_data)
model.quantize(calibration_data)

Save quantized model

保存量化模型

model.save_quantized("llama-2-7b-gptq") tokenizer.save_pretrained("llama-2-7b-gptq")
model.save_quantized("llama-2-7b-gptq") tokenizer.save_pretrained("llama-2-7b-gptq")

Push to HuggingFace

推送至HuggingFace

model.push_to_hub("username/llama-2-7b-gptq")
undefined
model.push_to_hub("username/llama-2-7b-gptq")
undefined

Group-wise quantization

分组量化

How GPTQ works:
  1. Group weights: Divide each weight matrix into groups (typically 128 elements)
  2. Quantize per-group: Each group has its own scale/zero-point
  3. Minimize error: Uses Hessian information to minimize quantization error
  4. Result: 4-bit weights with near-FP16 accuracy
Group size trade-off:
Group SizeModel SizeAccuracySpeedRecommendation
-1 (per-column)SmallestBestSlowestResearch only
32SmallerBetterSlowerHigh accuracy needed
128MediumGoodFastRecommended default
256LargerLowerFasterSpeed critical
1024LargestLowestFastestNot recommended
Example:
Weight matrix: [1024, 4096] = 4.2M elements

Group size = 128:
- Groups: 4.2M / 128 = 32,768 groups
- Each group: own 4-bit scale + zero-point
- Result: Better granularity → better accuracy
GPTQ工作原理:
  1. 分组权重: 将每个权重矩阵划分为多个组(通常为128个元素一组)
  2. 逐组量化: 每个组拥有独立的缩放因子/零点
  3. 最小化误差: 利用海森矩阵信息最小化量化误差
  4. 结果: 4位权重,精度接近FP16
分组大小权衡:
分组大小模型大小精度速度推荐场景
-1(按列)最小最佳最慢仅用于研究
32较小较好较慢需要高精度时
128中等良好最快推荐默认值
256较大较低更快速度优先时
1024最大最低最快不推荐
示例:
权重矩阵: [1024, 4096] = 420万个元素

分组大小 = 128:
- 组数: 420万 / 128 = 32768组
- 每组: 独立的4位缩放因子 + 零点
- 结果: 更细的粒度 → 更高的精度

Quantization configurations

量化配置

Standard 4-bit (recommended)

标准4位量化(推荐)

python
from auto_gptq import BaseQuantizeConfig

config = BaseQuantizeConfig(
    bits=4,              # 4-bit quantization
    group_size=128,      # Standard group size
    desc_act=False,      # Faster CUDA kernel
    damp_percent=0.01    # Dampening factor
)
Performance:
  • Memory: 4× reduction (70B model: 140GB → 35GB)
  • Accuracy: ~1.5% perplexity increase
  • Speed: 3-4× faster than FP16
python
from auto_gptq import BaseQuantizeConfig

config = BaseQuantizeConfig(
    bits=4,              # 4位量化
    group_size=128,      # 标准分组大小
    desc_act=False,      # 更快的CUDA内核
    damp_percent=0.01    # 阻尼因子
)
性能表现:
  • 内存: 占用降低4倍(70B模型:140GB → 35GB)
  • 精度: 困惑度提升约1.5%
  • 速度: 相比FP16快3-4倍

High accuracy (3-bit with larger groups)

高精度配置(3位+大分组)

python
config = BaseQuantizeConfig(
    bits=3,              # 3-bit (more compression)
    group_size=128,      # Keep standard group size
    desc_act=True,       # Better accuracy (slower)
    damp_percent=0.01
)
Trade-off:
  • Memory: 5× reduction
  • Accuracy: ~3% perplexity increase
  • Speed: 5× faster (but less accurate)
python
config = BaseQuantizeConfig(
    bits=3,              # 3位量化(更高压缩比)
    group_size=128,      # 保持标准分组大小
    desc_act=True,       # 精度更高(速度较慢)
    damp_percent=0.01
)
权衡:
  • 内存: 占用降低5倍
  • 精度: 困惑度提升约3%
  • 速度: 快5倍(但精度更低)

Maximum accuracy (4-bit with small groups)

最高精度配置(4位+小分组)

python
config = BaseQuantizeConfig(
    bits=4,
    group_size=32,       # Smaller groups (better accuracy)
    desc_act=True,       # Activation reordering
    damp_percent=0.005   # Lower dampening
)
Trade-off:
  • Memory: 3.5× reduction (slightly larger)
  • Accuracy: ~0.8% perplexity increase (best)
  • Speed: 2-3× faster (kernel overhead)
python
config = BaseQuantizeConfig(
    bits=4,
    group_size=32,       # 更小分组(精度更高)
    desc_act=True,       # 激活重排序
    damp_percent=0.005   # 更低阻尼
)
权衡:
  • 内存: 占用降低3.5倍(略大)
  • 精度: 困惑度提升约0.8%(最佳)
  • 速度: 快2-3倍(内核开销较大)

Kernel backends

内核后端

ExLlamaV2 (default, fastest)

ExLlamaV2(默认,最快)

python
model = AutoGPTQForCausalLM.from_quantized(
    model_name,
    device="cuda:0",
    use_exllama=True,      # Use ExLlamaV2
    exllama_config={"version": 2}
)
Performance: 1.5-2× faster than Triton
python
model = AutoGPTQForCausalLM.from_quantized(
    model_name,
    device="cuda:0",
    use_exllama=True,      # 使用ExLlamaV2
    exllama_config={"version": 2}
)
性能: 比Triton快1.5-2倍

Marlin (Ampere+ GPUs)

Marlin(Ampere+ GPU)

python
undefined
python
undefined

Quantize with Marlin format

以Marlin格式量化

config = BaseQuantizeConfig( bits=4, group_size=128, desc_act=False # Required for Marlin )
model.quantize(calibration_data, use_marlin=True)
config = BaseQuantizeConfig( bits=4, group_size=128, desc_act=False # Marlin必需配置 )
model.quantize(calibration_data, use_marlin=True)

Load with Marlin

加载Marlin格式模型

model = AutoGPTQForCausalLM.from_quantized( model_name, device="cuda:0", use_marlin=True # 2× faster on A100/H100 )

**Requirements**:
- NVIDIA Ampere or newer (A100, H100, RTX 40xx)
- Compute capability ≥ 8.0
model = AutoGPTQForCausalLM.from_quantized( model_name, device="cuda:0", use_marlin=True # 在A100/H100上速度提升2倍 )

**要求**:
- NVIDIA Ampere或更新架构(A100、H100、RTX 40xx)
- 计算能力 ≥ 8.0

Triton (Linux only)

Triton(仅Linux)

python
model = AutoGPTQForCausalLM.from_quantized(
    model_name,
    device="cuda:0",
    use_triton=True  # Linux only
)
Performance: 1.2-1.5× faster than CUDA backend
python
model = AutoGPTQForCausalLM.from_quantized(
    model_name,
    device="cuda:0",
    use_triton=True  # 仅支持Linux
)
性能: 比CUDA后端快1.2-1.5倍

Integration with transformers

与transformers集成

Direct transformers usage

直接使用transformers

python
from transformers import AutoModelForCausalLM, AutoTokenizer
python
from transformers import AutoModelForCausalLM, AutoTokenizer

Load quantized model (transformers auto-detects GPTQ)

加载量化模型(transformers会自动检测GPTQ格式)

model = AutoModelForCausalLM.from_pretrained( "TheBloke/Llama-2-13B-Chat-GPTQ", device_map="auto", trust_remote_code=False )
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-Chat-GPTQ")
model = AutoModelForCausalLM.from_pretrained( "TheBloke/Llama-2-13B-Chat-GPTQ", device_map="auto", trust_remote_code=False )
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-Chat-GPTQ")

Use like any transformers model

像使用普通transformers模型一样使用

inputs = tokenizer("Hello", return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=100)
undefined
inputs = tokenizer("你好", return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=100)
undefined

QLoRA fine-tuning (GPTQ + LoRA)

QLoRA微调(GPTQ + LoRA)

python
from transformers import AutoModelForCausalLM
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
python
from transformers import AutoModelForCausalLM
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

Load GPTQ model

加载GPTQ模型

model = AutoModelForCausalLM.from_pretrained( "TheBloke/Llama-2-7B-GPTQ", device_map="auto" )
model = AutoModelForCausalLM.from_pretrained( "TheBloke/Llama-2-7B-GPTQ", device_map="auto" )

Prepare for LoRA training

为LoRA训练做准备

model = prepare_model_for_kbit_training(model)
model = prepare_model_for_kbit_training(model)

LoRA config

LoRA配置

lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" )
lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" )

Add LoRA adapters

添加LoRA适配器

model = get_peft_model(model, lora_config)
model = get_peft_model(model, lora_config)

Fine-tune (memory efficient!)

微调(内存高效!)

70B model trainable on single A100 80GB

70B模型可在单张A100 80GB GPU上训练

undefined
undefined

Performance benchmarks

性能基准测试

Memory reduction

内存占用降低

ModelFP16GPTQ 4-bitReduction
Llama 2-7B14 GB3.5 GB
Llama 2-13B26 GB6.5 GB
Llama 2-70B140 GB35 GB
Llama 3-405B810 GB203 GB
Enables:
  • 70B on single A100 80GB (vs 2× A100 needed for FP16)
  • 405B on 3× A100 80GB (vs 11× A100 needed for FP16)
  • 13B on RTX 4090 24GB (vs OOM with FP16)
模型FP16GPTQ 4位压缩比
Llama 2-7B14 GB3.5 GB
Llama 2-13B26 GB6.5 GB
Llama 2-70B140 GB35 GB
Llama 3-405B810 GB203 GB
实现效果:
  • 70B模型可在单张A100 80GB上运行(FP16需要2张A100)
  • 405B模型可在3张A100 80GB上运行(FP16需要11张A100)
  • 13B模型可在RTX 4090 24GB上运行(FP16会显存不足)

Inference speed (Llama 2-7B, A100)

推理速度(Llama 2-7B,A100)

PrecisionTokens/secvs FP16
FP1625 tok/s
GPTQ 4-bit (CUDA)85 tok/s3.4×
GPTQ 4-bit (ExLlama)105 tok/s4.2×
GPTQ 4-bit (Marlin)120 tok/s4.8×
精度每秒生成token数相对FP16提升
FP1625 tok/s
GPTQ 4位(CUDA)85 tok/s3.4×
GPTQ 4位(ExLlama)105 tok/s4.2×
GPTQ 4位(Marlin)120 tok/s4.8×

Accuracy (perplexity on WikiText-2)

精度(WikiText-2上的困惑度)

ModelFP16GPTQ 4-bit (g=128)Degradation
Llama 2-7B5.475.55+1.5%
Llama 2-13B4.884.95+1.4%
Llama 2-70B3.323.38+1.8%
Excellent quality preservation - less than 2% degradation!
模型FP16GPTQ 4位(g=128)精度下降
Llama 2-7B5.475.55+1.5%
Llama 2-13B4.884.95+1.4%
Llama 2-70B3.323.38+1.8%
精度保留极佳 - 下降幅度小于2%!

Common patterns

常见应用模式

Multi-GPU deployment

多GPU部署

python
undefined
python
undefined

Automatic device mapping

自动设备映射

model = AutoGPTQForCausalLM.from_quantized( "TheBloke/Llama-2-70B-GPTQ", device_map="auto", # Automatically split across GPUs max_memory={0: "40GB", 1: "40GB"} # Limit per GPU )
model = AutoGPTQForCausalLM.from_quantized( "TheBloke/Llama-2-70B-GPTQ", device_map="auto", # 自动在GPU间拆分模型 max_memory={0: "40GB", 1: "40GB"} # 限制单GPU内存占用 )

Manual device mapping

手动设备映射

device_map = { "model.embed_tokens": 0, "model.layers.0-39": 0, # First 40 layers on GPU 0 "model.layers.40-79": 1, # Last 40 layers on GPU 1 "model.norm": 1, "lm_head": 1 }
model = AutoGPTQForCausalLM.from_quantized( model_name, device_map=device_map )
undefined
device_map = { "model.embed_tokens": 0, "model.layers.0-39": 0, # 前40层在GPU 0 "model.layers.40-79": 1, # 后40层在GPU 1 "model.norm": 1, "lm_head": 1 }
model = AutoGPTQForCausalLM.from_quantized( model_name, device_map=device_map )
undefined

CPU offloading

CPU卸载

python
undefined
python
undefined

Offload some layers to CPU (for very large models)

将部分层卸载到CPU(适用于超大型模型)

model = AutoGPTQForCausalLM.from_quantized( "TheBloke/Llama-2-405B-GPTQ", device_map="auto", max_memory={ 0: "80GB", # GPU 0 1: "80GB", # GPU 1 2: "80GB", # GPU 2 "cpu": "200GB" # Offload overflow to CPU } )
undefined
model = AutoGPTQForCausalLM.from_quantized( "TheBloke/Llama-2-405B-GPTQ", device_map="auto", max_memory={ 0: "80GB", # GPU 0 1: "80GB", # GPU 1 2: "80GB", # GPU 2 "cpu": "200GB" # 将溢出部分卸载到CPU } )
undefined

Batch inference

批量推理

python
undefined
python
undefined

Process multiple prompts efficiently

高效处理多个提示

prompts = [ "Explain AI", "Explain ML", "Explain DL" ]
inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")
outputs = model.generate( **inputs, max_new_tokens=100, pad_token_id=tokenizer.eos_token_id )
for i, output in enumerate(outputs): print(f"Prompt {i}: {tokenizer.decode(output)}")
undefined
prompts = [ "解释AI", "解释ML", "解释DL" ]
inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")
outputs = model.generate( **inputs, max_new_tokens=100, pad_token_id=tokenizer.eos_token_id )
for i, output in enumerate(outputs): print(f"提示{i}: {tokenizer.decode(output)}")
undefined

Finding pre-quantized models

查找预量化模型

TheBloke on HuggingFace:
Search:
bash
undefined
HuggingFace上的TheBloke:
搜索方式:
bash
undefined

Find GPTQ models on HuggingFace

在HuggingFace上查找GPTQ模型


**Download**:
```python
from auto_gptq import AutoGPTQForCausalLM

**下载方式**:
```python
from auto_gptq import AutoGPTQForCausalLM

Automatically downloads from HuggingFace

自动从HuggingFace下载

model = AutoGPTQForCausalLM.from_quantized( "TheBloke/Llama-2-70B-Chat-GPTQ", device="cuda:0" )
undefined
model = AutoGPTQForCausalLM.from_quantized( "TheBloke/Llama-2-70B-Chat-GPTQ", device="cuda:0" )
undefined

Supported models

支持的模型

  • LLaMA family: Llama 2, Llama 3, Code Llama
  • Mistral: Mistral 7B, Mixtral 8x7B, 8x22B
  • Qwen: Qwen, Qwen2, QwQ
  • DeepSeek: V2, V3
  • Phi: Phi-2, Phi-3
  • Yi, Falcon, BLOOM, OPT
  • 100+ models on HuggingFace
  • LLaMA系列: Llama 2、Llama 3、Code Llama
  • Mistral: Mistral 7B、Mixtral 8x7B、8x22B
  • Qwen: Qwen、Qwen2、QwQ
  • DeepSeek: V2、V3
  • Phi: Phi-2、Phi-3
  • Yi、Falcon、BLOOM、OPT
  • HuggingFace上的100+个模型

References

参考资料

  • Calibration Guide - Dataset selection, quantization process, quality optimization
  • Integration Guide - Transformers, PEFT, vLLM, TensorRT-LLM
  • Troubleshooting - Common issues, performance optimization
  • 校准指南 - 数据集选择、量化流程、质量优化
  • 集成指南 - Transformers、PEFT、vLLM、TensorRT-LLM
  • 故障排除 - 常见问题、性能优化

Resources

资源