gptq

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

GPTQ (Generative Pre-trained Transformer Quantization)

GPTQ（生成式预训练Transformer量化）

Post-training quantization method that compresses LLMs to 4-bit with minimal accuracy loss using group-wise quantization.

一种训练后量化方法，采用分组量化将大语言模型（LLM）压缩至4位，精度损失极小。

When to use GPTQ

何时使用GPTQ

Use GPTQ when:

Need to fit large models (70B+) on limited GPU memory
Want 4× memory reduction with <2% accuracy loss
Deploying on consumer GPUs (RTX 4090, 3090)
Need faster inference (3-4× speedup vs FP16)

Use AWQ instead when:

Need slightly better accuracy (<1% loss)
Have newer GPUs (Ampere, Ada)
Want Marlin kernel support (2× faster on some GPUs)

Use bitsandbytes instead when:

Need simple integration with transformers
Want 8-bit quantization (less compression, better quality)
Don't need pre-quantized model files

在以下场景使用GPTQ：

需要在有限的GPU内存中运行大模型（70B及以上）
希望实现内存占用降低4倍且精度损失小于2%
在消费级GPU（RTX 4090、3090）上部署模型
需要更快的推理速度（相比FP16提升3-4倍）

在以下场景改用AWQ：

需要略高的精度（损失小于1%）
使用较新的GPU（Ampere、Ada架构）
希望获得Marlin内核支持（部分GPU上速度提升2倍）

在以下场景改用bitsandbytes：

需要与transformers进行简单集成
希望使用8位量化（压缩比更低，质量更好）
不需要预量化模型文件

Quick start

快速开始

Installation

安装

bash

undefined

bash

undefined

Install AutoGPTQ

安装AutoGPTQ

pip install auto-gptq

With Triton (Linux only, faster)

安装带Triton版本（仅Linux，速度更快）

pip install auto-gptq[triton]

With CUDA extensions (faster)

安装带CUDA扩展版本（速度更快）

pip install auto-gptq --no-build-isolation

Full installation

完整安装

pip install auto-gptq transformers accelerate

undefined

pip install auto-gptq transformers accelerate

undefined

Load pre-quantized model

加载预量化模型

python

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM

python

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM

Load quantized model from HuggingFace

从HuggingFace加载量化模型

model_name = "TheBloke/Llama-2-7B-Chat-GPTQ"

model = AutoGPTQForCausalLM.from_quantized( model_name, device="cuda:0", use_triton=False # Set True on Linux for speed )

tokenizer = AutoTokenizer.from_pretrained(model_name)

model_name = "TheBloke/Llama-2-7B-Chat-GPTQ"

model = AutoGPTQForCausalLM.from_quantized( model_name, device="cuda:0", use_triton=False # Linux系统设置为True以提升速度 )

tokenizer = AutoTokenizer.from_pretrained(model_name)

Generate

生成文本

prompt = "Explain quantum computing" inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0") outputs = model.generate(**inputs, max_new_tokens=200) print(tokenizer.decode(outputs[0]))

undefined

prompt = "解释量子计算" inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0") outputs = model.generate(**inputs, max_new_tokens=200) print(tokenizer.decode(outputs[0]))

undefined

Quantize your own model

量化自定义模型

python

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset

python

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset

Load model

加载模型

model_name = "meta-llama/Llama-2-7b-chat-hf" tokenizer = AutoTokenizer.from_pretrained(model_name)

Quantization config

量化配置

quantize_config = BaseQuantizeConfig( bits=4, # 4-bit quantization group_size=128, # Group size (recommended: 128) desc_act=False, # Activation order (False for CUDA kernel) damp_percent=0.01 # Dampening factor )

quantize_config = BaseQuantizeConfig( bits=4, # 4位量化 group_size=128, # 分组大小（推荐值：128） desc_act=False, # 激活顺序（CUDA内核需设置为False） damp_percent=0.01 # 阻尼因子 )

Load model for quantization

加载待量化模型

model = AutoGPTQForCausalLM.from_pretrained( model_name, quantize_config=quantize_config )

Prepare calibration data

准备校准数据

dataset = load_dataset("c4", split="train", streaming=True) calibration_data = [ tokenizer(example["text"])["input_ids"][:512] for example in dataset.take(128) ]

Quantize

执行量化

model.quantize(calibration_data)

Save quantized model

保存量化模型

model.save_quantized("llama-2-7b-gptq") tokenizer.save_pretrained("llama-2-7b-gptq")

Push to HuggingFace

推送至HuggingFace

model.push_to_hub("username/llama-2-7b-gptq")

undefined

model.push_to_hub("username/llama-2-7b-gptq")

undefined

Group-wise quantization

分组量化

How GPTQ works:

Group weights: Divide each weight matrix into groups (typically 128 elements)
Quantize per-group: Each group has its own scale/zero-point
Minimize error: Uses Hessian information to minimize quantization error
Result: 4-bit weights with near-FP16 accuracy

Group size trade-off:

Group Size	Model Size	Accuracy	Speed	Recommendation
-1 (per-column)	Smallest	Best	Slowest	Research only
32	Smaller	Better	Slower	High accuracy needed
128	Medium	Good	Fast	Recommended default
256	Larger	Lower	Faster	Speed critical
1024	Largest	Lowest	Fastest	Not recommended

Example:

Weight matrix: [1024, 4096] = 4.2M elements

Group size = 128:
- Groups: 4.2M / 128 = 32,768 groups
- Each group: own 4-bit scale + zero-point
- Result: Better granularity → better accuracy

GPTQ工作原理:

分组权重: 将每个权重矩阵划分为多个组（通常为128个元素一组）
逐组量化: 每个组拥有独立的缩放因子/零点
最小化误差: 利用海森矩阵信息最小化量化误差
结果: 4位权重，精度接近FP16

分组大小权衡:

分组大小	模型大小	精度	速度	推荐场景
-1（按列）	最小	最佳	最慢	仅用于研究
32	较小	较好	较慢	需要高精度时
128	中等	良好	最快	推荐默认值
256	较大	较低	更快	速度优先时
1024	最大	最低	最快	不推荐

示例:

权重矩阵: [1024, 4096] = 420万个元素

分组大小 = 128:
- 组数: 420万 / 128 = 32768组
- 每组: 独立的4位缩放因子 + 零点
- 结果: 更细的粒度 → 更高的精度

Quantization configurations

量化配置

Standard 4-bit (recommended)

标准4位量化（推荐）

python

from auto_gptq import BaseQuantizeConfig

config = BaseQuantizeConfig(
    bits=4,              # 4-bit quantization
    group_size=128,      # Standard group size
    desc_act=False,      # Faster CUDA kernel
    damp_percent=0.01    # Dampening factor
)

Performance:

Memory: 4× reduction (70B model: 140GB → 35GB)
Accuracy: ~1.5% perplexity increase
Speed: 3-4× faster than FP16

python

from auto_gptq import BaseQuantizeConfig

config = BaseQuantizeConfig(
    bits=4,              # 4位量化
    group_size=128,      # 标准分组大小
    desc_act=False,      # 更快的CUDA内核
    damp_percent=0.01    # 阻尼因子
)

性能表现:

内存: 占用降低4倍（70B模型：140GB → 35GB）
精度: 困惑度提升约1.5%
速度: 相比FP16快3-4倍

High accuracy (3-bit with larger groups)

高精度配置（3位+大分组）

python

config = BaseQuantizeConfig(
    bits=3,              # 3-bit (more compression)
    group_size=128,      # Keep standard group size
    desc_act=True,       # Better accuracy (slower)
    damp_percent=0.01
)

Trade-off:

Memory: 5× reduction
Accuracy: ~3% perplexity increase
Speed: 5× faster (but less accurate)

python

config = BaseQuantizeConfig(
    bits=3,              # 3位量化（更高压缩比）
    group_size=128,      # 保持标准分组大小
    desc_act=True,       # 精度更高（速度较慢）
    damp_percent=0.01
)

权衡:

内存: 占用降低5倍
精度: 困惑度提升约3%
速度: 快5倍（但精度更低）

Maximum accuracy (4-bit with small groups)

最高精度配置（4位+小分组）

python

config = BaseQuantizeConfig(
    bits=4,
    group_size=32,       # Smaller groups (better accuracy)
    desc_act=True,       # Activation reordering
    damp_percent=0.005   # Lower dampening
)

Trade-off:

Memory: 3.5× reduction (slightly larger)
Accuracy: ~0.8% perplexity increase (best)
Speed: 2-3× faster (kernel overhead)

python

config = BaseQuantizeConfig(
    bits=4,
    group_size=32,       # 更小分组（精度更高）
    desc_act=True,       # 激活重排序
    damp_percent=0.005   # 更低阻尼
)

权衡:

内存: 占用降低3.5倍（略大）
精度: 困惑度提升约0.8%（最佳）
速度: 快2-3倍（内核开销较大）

Kernel backends

内核后端

ExLlamaV2 (default, fastest)

ExLlamaV2（默认，最快）

python

model = AutoGPTQForCausalLM.from_quantized(
    model_name,
    device="cuda:0",
    use_exllama=True,      # Use ExLlamaV2
    exllama_config={"version": 2}
)

Performance: 1.5-2× faster than Triton

python

model = AutoGPTQForCausalLM.from_quantized(
    model_name,
    device="cuda:0",
    use_exllama=True,      # 使用ExLlamaV2
    exllama_config={"version": 2}
)

性能: 比Triton快1.5-2倍

Marlin (Ampere+ GPUs)

Marlin（Ampere+ GPU）

python

undefined

python

undefined

Quantize with Marlin format

以Marlin格式量化

config = BaseQuantizeConfig( bits=4, group_size=128, desc_act=False # Required for Marlin )

model.quantize(calibration_data, use_marlin=True)

config = BaseQuantizeConfig( bits=4, group_size=128, desc_act=False # Marlin必需配置 )

model.quantize(calibration_data, use_marlin=True)

Load with Marlin

加载Marlin格式模型

model = AutoGPTQForCausalLM.from_quantized( model_name, device="cuda:0", use_marlin=True # 2× faster on A100/H100 )


**Requirements**:
- NVIDIA Ampere or newer (A100, H100, RTX 40xx)
- Compute capability ≥ 8.0

model = AutoGPTQForCausalLM.from_quantized( model_name, device="cuda:0", use_marlin=True # 在A100/H100上速度提升2倍 )


**要求**:
- NVIDIA Ampere或更新架构（A100、H100、RTX 40xx）
- 计算能力 ≥ 8.0

Triton (Linux only)

Triton（仅Linux）

python

model = AutoGPTQForCausalLM.from_quantized(
    model_name,
    device="cuda:0",
    use_triton=True  # Linux only
)

Performance: 1.2-1.5× faster than CUDA backend

python

model = AutoGPTQForCausalLM.from_quantized(
    model_name,
    device="cuda:0",
    use_triton=True  # 仅支持Linux
)

性能: 比CUDA后端快1.2-1.5倍

Integration with transformers

与transformers集成

Direct transformers usage

直接使用transformers

python

from transformers import AutoModelForCausalLM, AutoTokenizer

python

from transformers import AutoModelForCausalLM, AutoTokenizer

Load quantized model (transformers auto-detects GPTQ)

加载量化模型（transformers会自动检测GPTQ格式）

model = AutoModelForCausalLM.from_pretrained( "TheBloke/Llama-2-13B-Chat-GPTQ", device_map="auto", trust_remote_code=False )

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-Chat-GPTQ")

model = AutoModelForCausalLM.from_pretrained( "TheBloke/Llama-2-13B-Chat-GPTQ", device_map="auto", trust_remote_code=False )

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-Chat-GPTQ")

Use like any transformers model

像使用普通transformers模型一样使用

inputs = tokenizer("Hello", return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=100)

undefined

inputs = tokenizer("你好", return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=100)

undefined

QLoRA fine-tuning (GPTQ + LoRA)

QLoRA微调（GPTQ + LoRA）

python

from transformers import AutoModelForCausalLM
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

python

from transformers import AutoModelForCausalLM
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

Load GPTQ model

加载GPTQ模型

model = AutoModelForCausalLM.from_pretrained( "TheBloke/Llama-2-7B-GPTQ", device_map="auto" )

Prepare for LoRA training

为LoRA训练做准备

model = prepare_model_for_kbit_training(model)

LoRA config

LoRA配置

lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" )

Add LoRA adapters

添加LoRA适配器

model = get_peft_model(model, lora_config)

Fine-tune (memory efficient!)

微调（内存高效！）

70B model trainable on single A100 80GB

70B模型可在单张A100 80GB GPU上训练

undefined

undefined

Performance benchmarks

性能基准测试

Memory reduction

内存占用降低

Model	FP16	GPTQ 4-bit	Reduction
Llama 2-7B	14 GB	3.5 GB	4×
Llama 2-13B	26 GB	6.5 GB	4×
Llama 2-70B	140 GB	35 GB	4×
Llama 3-405B	810 GB	203 GB	4×

Enables:

70B on single A100 80GB (vs 2× A100 needed for FP16)
405B on 3× A100 80GB (vs 11× A100 needed for FP16)
13B on RTX 4090 24GB (vs OOM with FP16)

模型	FP16	GPTQ 4位	压缩比
Llama 2-7B	14 GB	3.5 GB	4×
Llama 2-13B	26 GB	6.5 GB	4×
Llama 2-70B	140 GB	35 GB	4×
Llama 3-405B	810 GB	203 GB	4×

实现效果:

70B模型可在单张A100 80GB上运行（FP16需要2张A100）
405B模型可在3张A100 80GB上运行（FP16需要11张A100）
13B模型可在RTX 4090 24GB上运行（FP16会显存不足）

Inference speed (Llama 2-7B, A100)

推理速度（Llama 2-7B，A100）

Precision	Tokens/sec	vs FP16
FP16	25 tok/s	1×
GPTQ 4-bit (CUDA)	85 tok/s	3.4×
GPTQ 4-bit (ExLlama)	105 tok/s	4.2×
GPTQ 4-bit (Marlin)	120 tok/s	4.8×

精度	每秒生成token数	相对FP16提升
FP16	25 tok/s	1×
GPTQ 4位（CUDA）	85 tok/s	3.4×
GPTQ 4位（ExLlama）	105 tok/s	4.2×
GPTQ 4位（Marlin）	120 tok/s	4.8×

Accuracy (perplexity on WikiText-2)

精度（WikiText-2上的困惑度）

Model	FP16	GPTQ 4-bit (g=128)	Degradation
Llama 2-7B	5.47	5.55	+1.5%
Llama 2-13B	4.88	4.95	+1.4%
Llama 2-70B	3.32	3.38	+1.8%

Excellent quality preservation - less than 2% degradation!

模型	FP16	GPTQ 4位（g=128）	精度下降
Llama 2-7B	5.47	5.55	+1.5%
Llama 2-13B	4.88	4.95	+1.4%
Llama 2-70B	3.32	3.38	+1.8%

精度保留极佳 - 下降幅度小于2%！

Common patterns

常见应用模式

Multi-GPU deployment

多GPU部署

python

undefined

python

undefined

Automatic device mapping

自动设备映射

model = AutoGPTQForCausalLM.from_quantized( "TheBloke/Llama-2-70B-GPTQ", device_map="auto", # Automatically split across GPUs max_memory={0: "40GB", 1: "40GB"} # Limit per GPU )

model = AutoGPTQForCausalLM.from_quantized( "TheBloke/Llama-2-70B-GPTQ", device_map="auto", # 自动在GPU间拆分模型 max_memory={0: "40GB", 1: "40GB"} # 限制单GPU内存占用 )

Manual device mapping

手动设备映射

device_map = { "model.embed_tokens": 0, "model.layers.0-39": 0, # First 40 layers on GPU 0 "model.layers.40-79": 1, # Last 40 layers on GPU 1 "model.norm": 1, "lm_head": 1 }

model = AutoGPTQForCausalLM.from_quantized( model_name, device_map=device_map )

undefined

device_map = { "model.embed_tokens": 0, "model.layers.0-39": 0, # 前40层在GPU 0 "model.layers.40-79": 1, # 后40层在GPU 1 "model.norm": 1, "lm_head": 1 }

model = AutoGPTQForCausalLM.from_quantized( model_name, device_map=device_map )

undefined

CPU offloading

CPU卸载

python

undefined

python

undefined

Offload some layers to CPU (for very large models)

将部分层卸载到CPU（适用于超大型模型）

model = AutoGPTQForCausalLM.from_quantized( "TheBloke/Llama-2-405B-GPTQ", device_map="auto", max_memory={ 0: "80GB", # GPU 0 1: "80GB", # GPU 1 2: "80GB", # GPU 2 "cpu": "200GB" # Offload overflow to CPU } )

undefined

undefined

Batch inference

批量推理

python

undefined

python

undefined

Process multiple prompts efficiently

高效处理多个提示

prompts = [ "Explain AI", "Explain ML", "Explain DL" ]

inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")

outputs = model.generate( **inputs, max_new_tokens=100, pad_token_id=tokenizer.eos_token_id )

for i, output in enumerate(outputs): print(f"Prompt {i}: {tokenizer.decode(output)}")

undefined

prompts = [ "解释AI", "解释ML", "解释DL" ]

inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")

outputs = model.generate( **inputs, max_new_tokens=100, pad_token_id=tokenizer.eos_token_id )

for i, output in enumerate(outputs): print(f"提示{i}: {tokenizer.decode(output)}")

undefined

Finding pre-quantized models

查找预量化模型

TheBloke on HuggingFace:

https://huggingface.co/TheBloke
1000+ models in GPTQ format
Multiple group sizes (32, 128)
Both CUDA and Marlin formats

Search:

bash

undefined

HuggingFace上的TheBloke:

https://huggingface.co/TheBloke
1000+个GPTQ格式模型
多种分组大小（32、128）
支持CUDA和Marlin格式

搜索方式:

bash

undefined

Find GPTQ models on HuggingFace

在HuggingFace上查找GPTQ模型

https://huggingface.co/models?library=gptq


**Download**:
```python
from auto_gptq import AutoGPTQForCausalLM

https://huggingface.co/models?library=gptq


**下载方式**:
```python
from auto_gptq import AutoGPTQForCausalLM

Automatically downloads from HuggingFace

自动从HuggingFace下载

model = AutoGPTQForCausalLM.from_quantized( "TheBloke/Llama-2-70B-Chat-GPTQ", device="cuda:0" )

undefined

model = AutoGPTQForCausalLM.from_quantized( "TheBloke/Llama-2-70B-Chat-GPTQ", device="cuda:0" )

undefined

Supported models

支持的模型

LLaMA family: Llama 2, Llama 3, Code Llama
Mistral: Mistral 7B, Mixtral 8x7B, 8x22B
Qwen: Qwen, Qwen2, QwQ
DeepSeek: V2, V3
Phi: Phi-2, Phi-3
Yi, Falcon, BLOOM, OPT
100+ models on HuggingFace

LLaMA系列: Llama 2、Llama 3、Code Llama
Mistral: Mistral 7B、Mixtral 8x7B、8x22B
Qwen: Qwen、Qwen2、QwQ
DeepSeek: V2、V3
Phi: Phi-2、Phi-3
Yi、Falcon、BLOOM、OPT
HuggingFace上的100+个模型

References

参考资料

Calibration Guide - Dataset selection, quantization process, quality optimization
Integration Guide - Transformers, PEFT, vLLM, TensorRT-LLM
Troubleshooting - Common issues, performance optimization

校准指南 - 数据集选择、量化流程、质量优化
集成指南 - Transformers、PEFT、vLLM、TensorRT-LLM
故障排除 - 常见问题、性能优化

Resources

资源

GitHub: https://github.com/AutoGPTQ/AutoGPTQ
Paper: GPTQ: Accurate Post-Training Quantization (arXiv:2210.17323)
Models: https://huggingface.co/models?library=gptq
Discord: https://discord.gg/autogptq

GitHub: https://github.com/AutoGPTQ/AutoGPTQ
论文: GPTQ: Accurate Post-Training Quantization (arXiv:2210.17323)
模型: https://huggingface.co/models?library=gptq
Discord: https://discord.gg/autogptq