gptq
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGPTQ (Generative Pre-trained Transformer Quantization)
GPTQ(生成式预训练Transformer量化)
Post-training quantization method that compresses LLMs to 4-bit with minimal accuracy loss using group-wise quantization.
一种训练后量化方法,采用分组量化将大语言模型(LLM)压缩至4位,精度损失极小。
When to use GPTQ
何时使用GPTQ
Use GPTQ when:
- Need to fit large models (70B+) on limited GPU memory
- Want 4× memory reduction with <2% accuracy loss
- Deploying on consumer GPUs (RTX 4090, 3090)
- Need faster inference (3-4× speedup vs FP16)
Use AWQ instead when:
- Need slightly better accuracy (<1% loss)
- Have newer GPUs (Ampere, Ada)
- Want Marlin kernel support (2× faster on some GPUs)
Use bitsandbytes instead when:
- Need simple integration with transformers
- Want 8-bit quantization (less compression, better quality)
- Don't need pre-quantized model files
在以下场景使用GPTQ:
- 需要在有限的GPU内存中运行大模型(70B及以上)
- 希望实现内存占用降低4倍且精度损失小于2%
- 在消费级GPU(RTX 4090、3090)上部署模型
- 需要更快的推理速度(相比FP16提升3-4倍)
在以下场景改用AWQ:
- 需要略高的精度(损失小于1%)
- 使用较新的GPU(Ampere、Ada架构)
- 希望获得Marlin内核支持(部分GPU上速度提升2倍)
在以下场景改用bitsandbytes:
- 需要与transformers进行简单集成
- 希望使用8位量化(压缩比更低,质量更好)
- 不需要预量化模型文件
Quick start
快速开始
Installation
安装
bash
undefinedbash
undefinedInstall AutoGPTQ
安装AutoGPTQ
pip install auto-gptq
pip install auto-gptq
With Triton (Linux only, faster)
安装带Triton版本(仅Linux,速度更快)
pip install auto-gptq[triton]
pip install auto-gptq[triton]
With CUDA extensions (faster)
安装带CUDA扩展版本(速度更快)
pip install auto-gptq --no-build-isolation
pip install auto-gptq --no-build-isolation
Full installation
完整安装
pip install auto-gptq transformers accelerate
undefinedpip install auto-gptq transformers accelerate
undefinedLoad pre-quantized model
加载预量化模型
python
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLMpython
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLMLoad quantized model from HuggingFace
从HuggingFace加载量化模型
model_name = "TheBloke/Llama-2-7B-Chat-GPTQ"
model = AutoGPTQForCausalLM.from_quantized(
model_name,
device="cuda:0",
use_triton=False # Set True on Linux for speed
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model_name = "TheBloke/Llama-2-7B-Chat-GPTQ"
model = AutoGPTQForCausalLM.from_quantized(
model_name,
device="cuda:0",
use_triton=False # Linux系统设置为True以提升速度
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
Generate
生成文本
prompt = "Explain quantum computing"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))
undefinedprompt = "解释量子计算"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))
undefinedQuantize your own model
量化自定义模型
python
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_datasetpython
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_datasetLoad model
加载模型
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
Quantization config
量化配置
quantize_config = BaseQuantizeConfig(
bits=4, # 4-bit quantization
group_size=128, # Group size (recommended: 128)
desc_act=False, # Activation order (False for CUDA kernel)
damp_percent=0.01 # Dampening factor
)
quantize_config = BaseQuantizeConfig(
bits=4, # 4位量化
group_size=128, # 分组大小(推荐值:128)
desc_act=False, # 激活顺序(CUDA内核需设置为False)
damp_percent=0.01 # 阻尼因子
)
Load model for quantization
加载待量化模型
model = AutoGPTQForCausalLM.from_pretrained(
model_name,
quantize_config=quantize_config
)
model = AutoGPTQForCausalLM.from_pretrained(
model_name,
quantize_config=quantize_config
)
Prepare calibration data
准备校准数据
dataset = load_dataset("c4", split="train", streaming=True)
calibration_data = [
tokenizer(example["text"])["input_ids"][:512]
for example in dataset.take(128)
]
dataset = load_dataset("c4", split="train", streaming=True)
calibration_data = [
tokenizer(example["text"])["input_ids"][:512]
for example in dataset.take(128)
]
Quantize
执行量化
model.quantize(calibration_data)
model.quantize(calibration_data)
Save quantized model
保存量化模型
model.save_quantized("llama-2-7b-gptq")
tokenizer.save_pretrained("llama-2-7b-gptq")
model.save_quantized("llama-2-7b-gptq")
tokenizer.save_pretrained("llama-2-7b-gptq")
Push to HuggingFace
推送至HuggingFace
model.push_to_hub("username/llama-2-7b-gptq")
undefinedmodel.push_to_hub("username/llama-2-7b-gptq")
undefinedGroup-wise quantization
分组量化
How GPTQ works:
- Group weights: Divide each weight matrix into groups (typically 128 elements)
- Quantize per-group: Each group has its own scale/zero-point
- Minimize error: Uses Hessian information to minimize quantization error
- Result: 4-bit weights with near-FP16 accuracy
Group size trade-off:
| Group Size | Model Size | Accuracy | Speed | Recommendation |
|---|---|---|---|---|
| -1 (per-column) | Smallest | Best | Slowest | Research only |
| 32 | Smaller | Better | Slower | High accuracy needed |
| 128 | Medium | Good | Fast | Recommended default |
| 256 | Larger | Lower | Faster | Speed critical |
| 1024 | Largest | Lowest | Fastest | Not recommended |
Example:
Weight matrix: [1024, 4096] = 4.2M elements
Group size = 128:
- Groups: 4.2M / 128 = 32,768 groups
- Each group: own 4-bit scale + zero-point
- Result: Better granularity → better accuracyGPTQ工作原理:
- 分组权重: 将每个权重矩阵划分为多个组(通常为128个元素一组)
- 逐组量化: 每个组拥有独立的缩放因子/零点
- 最小化误差: 利用海森矩阵信息最小化量化误差
- 结果: 4位权重,精度接近FP16
分组大小权衡:
| 分组大小 | 模型大小 | 精度 | 速度 | 推荐场景 |
|---|---|---|---|---|
| -1(按列) | 最小 | 最佳 | 最慢 | 仅用于研究 |
| 32 | 较小 | 较好 | 较慢 | 需要高精度时 |
| 128 | 中等 | 良好 | 最快 | 推荐默认值 |
| 256 | 较大 | 较低 | 更快 | 速度优先时 |
| 1024 | 最大 | 最低 | 最快 | 不推荐 |
示例:
权重矩阵: [1024, 4096] = 420万个元素
分组大小 = 128:
- 组数: 420万 / 128 = 32768组
- 每组: 独立的4位缩放因子 + 零点
- 结果: 更细的粒度 → 更高的精度Quantization configurations
量化配置
Standard 4-bit (recommended)
标准4位量化(推荐)
python
from auto_gptq import BaseQuantizeConfig
config = BaseQuantizeConfig(
bits=4, # 4-bit quantization
group_size=128, # Standard group size
desc_act=False, # Faster CUDA kernel
damp_percent=0.01 # Dampening factor
)Performance:
- Memory: 4× reduction (70B model: 140GB → 35GB)
- Accuracy: ~1.5% perplexity increase
- Speed: 3-4× faster than FP16
python
from auto_gptq import BaseQuantizeConfig
config = BaseQuantizeConfig(
bits=4, # 4位量化
group_size=128, # 标准分组大小
desc_act=False, # 更快的CUDA内核
damp_percent=0.01 # 阻尼因子
)性能表现:
- 内存: 占用降低4倍(70B模型:140GB → 35GB)
- 精度: 困惑度提升约1.5%
- 速度: 相比FP16快3-4倍
High accuracy (3-bit with larger groups)
高精度配置(3位+大分组)
python
config = BaseQuantizeConfig(
bits=3, # 3-bit (more compression)
group_size=128, # Keep standard group size
desc_act=True, # Better accuracy (slower)
damp_percent=0.01
)Trade-off:
- Memory: 5× reduction
- Accuracy: ~3% perplexity increase
- Speed: 5× faster (but less accurate)
python
config = BaseQuantizeConfig(
bits=3, # 3位量化(更高压缩比)
group_size=128, # 保持标准分组大小
desc_act=True, # 精度更高(速度较慢)
damp_percent=0.01
)权衡:
- 内存: 占用降低5倍
- 精度: 困惑度提升约3%
- 速度: 快5倍(但精度更低)
Maximum accuracy (4-bit with small groups)
最高精度配置(4位+小分组)
python
config = BaseQuantizeConfig(
bits=4,
group_size=32, # Smaller groups (better accuracy)
desc_act=True, # Activation reordering
damp_percent=0.005 # Lower dampening
)Trade-off:
- Memory: 3.5× reduction (slightly larger)
- Accuracy: ~0.8% perplexity increase (best)
- Speed: 2-3× faster (kernel overhead)
python
config = BaseQuantizeConfig(
bits=4,
group_size=32, # 更小分组(精度更高)
desc_act=True, # 激活重排序
damp_percent=0.005 # 更低阻尼
)权衡:
- 内存: 占用降低3.5倍(略大)
- 精度: 困惑度提升约0.8%(最佳)
- 速度: 快2-3倍(内核开销较大)
Kernel backends
内核后端
ExLlamaV2 (default, fastest)
ExLlamaV2(默认,最快)
python
model = AutoGPTQForCausalLM.from_quantized(
model_name,
device="cuda:0",
use_exllama=True, # Use ExLlamaV2
exllama_config={"version": 2}
)Performance: 1.5-2× faster than Triton
python
model = AutoGPTQForCausalLM.from_quantized(
model_name,
device="cuda:0",
use_exllama=True, # 使用ExLlamaV2
exllama_config={"version": 2}
)性能: 比Triton快1.5-2倍
Marlin (Ampere+ GPUs)
Marlin(Ampere+ GPU)
python
undefinedpython
undefinedQuantize with Marlin format
以Marlin格式量化
config = BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=False # Required for Marlin
)
model.quantize(calibration_data, use_marlin=True)
config = BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=False # Marlin必需配置
)
model.quantize(calibration_data, use_marlin=True)
Load with Marlin
加载Marlin格式模型
model = AutoGPTQForCausalLM.from_quantized(
model_name,
device="cuda:0",
use_marlin=True # 2× faster on A100/H100
)
**Requirements**:
- NVIDIA Ampere or newer (A100, H100, RTX 40xx)
- Compute capability ≥ 8.0model = AutoGPTQForCausalLM.from_quantized(
model_name,
device="cuda:0",
use_marlin=True # 在A100/H100上速度提升2倍
)
**要求**:
- NVIDIA Ampere或更新架构(A100、H100、RTX 40xx)
- 计算能力 ≥ 8.0Triton (Linux only)
Triton(仅Linux)
python
model = AutoGPTQForCausalLM.from_quantized(
model_name,
device="cuda:0",
use_triton=True # Linux only
)Performance: 1.2-1.5× faster than CUDA backend
python
model = AutoGPTQForCausalLM.from_quantized(
model_name,
device="cuda:0",
use_triton=True # 仅支持Linux
)性能: 比CUDA后端快1.2-1.5倍
Integration with transformers
与transformers集成
Direct transformers usage
直接使用transformers
python
from transformers import AutoModelForCausalLM, AutoTokenizerpython
from transformers import AutoModelForCausalLM, AutoTokenizerLoad quantized model (transformers auto-detects GPTQ)
加载量化模型(transformers会自动检测GPTQ格式)
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-13B-Chat-GPTQ",
device_map="auto",
trust_remote_code=False
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-Chat-GPTQ")
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-13B-Chat-GPTQ",
device_map="auto",
trust_remote_code=False
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-Chat-GPTQ")
Use like any transformers model
像使用普通transformers模型一样使用
inputs = tokenizer("Hello", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
undefinedinputs = tokenizer("你好", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
undefinedQLoRA fine-tuning (GPTQ + LoRA)
QLoRA微调(GPTQ + LoRA)
python
from transformers import AutoModelForCausalLM
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_modelpython
from transformers import AutoModelForCausalLM
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_modelLoad GPTQ model
加载GPTQ模型
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-7B-GPTQ",
device_map="auto"
)
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-7B-GPTQ",
device_map="auto"
)
Prepare for LoRA training
为LoRA训练做准备
model = prepare_model_for_kbit_training(model)
model = prepare_model_for_kbit_training(model)
LoRA config
LoRA配置
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
Add LoRA adapters
添加LoRA适配器
model = get_peft_model(model, lora_config)
model = get_peft_model(model, lora_config)
Fine-tune (memory efficient!)
微调(内存高效!)
70B model trainable on single A100 80GB
70B模型可在单张A100 80GB GPU上训练
undefinedundefinedPerformance benchmarks
性能基准测试
Memory reduction
内存占用降低
| Model | FP16 | GPTQ 4-bit | Reduction |
|---|---|---|---|
| Llama 2-7B | 14 GB | 3.5 GB | 4× |
| Llama 2-13B | 26 GB | 6.5 GB | 4× |
| Llama 2-70B | 140 GB | 35 GB | 4× |
| Llama 3-405B | 810 GB | 203 GB | 4× |
Enables:
- 70B on single A100 80GB (vs 2× A100 needed for FP16)
- 405B on 3× A100 80GB (vs 11× A100 needed for FP16)
- 13B on RTX 4090 24GB (vs OOM with FP16)
| 模型 | FP16 | GPTQ 4位 | 压缩比 |
|---|---|---|---|
| Llama 2-7B | 14 GB | 3.5 GB | 4× |
| Llama 2-13B | 26 GB | 6.5 GB | 4× |
| Llama 2-70B | 140 GB | 35 GB | 4× |
| Llama 3-405B | 810 GB | 203 GB | 4× |
实现效果:
- 70B模型可在单张A100 80GB上运行(FP16需要2张A100)
- 405B模型可在3张A100 80GB上运行(FP16需要11张A100)
- 13B模型可在RTX 4090 24GB上运行(FP16会显存不足)
Inference speed (Llama 2-7B, A100)
推理速度(Llama 2-7B,A100)
| Precision | Tokens/sec | vs FP16 |
|---|---|---|
| FP16 | 25 tok/s | 1× |
| GPTQ 4-bit (CUDA) | 85 tok/s | 3.4× |
| GPTQ 4-bit (ExLlama) | 105 tok/s | 4.2× |
| GPTQ 4-bit (Marlin) | 120 tok/s | 4.8× |
| 精度 | 每秒生成token数 | 相对FP16提升 |
|---|---|---|
| FP16 | 25 tok/s | 1× |
| GPTQ 4位(CUDA) | 85 tok/s | 3.4× |
| GPTQ 4位(ExLlama) | 105 tok/s | 4.2× |
| GPTQ 4位(Marlin) | 120 tok/s | 4.8× |
Accuracy (perplexity on WikiText-2)
精度(WikiText-2上的困惑度)
| Model | FP16 | GPTQ 4-bit (g=128) | Degradation |
|---|---|---|---|
| Llama 2-7B | 5.47 | 5.55 | +1.5% |
| Llama 2-13B | 4.88 | 4.95 | +1.4% |
| Llama 2-70B | 3.32 | 3.38 | +1.8% |
Excellent quality preservation - less than 2% degradation!
| 模型 | FP16 | GPTQ 4位(g=128) | 精度下降 |
|---|---|---|---|
| Llama 2-7B | 5.47 | 5.55 | +1.5% |
| Llama 2-13B | 4.88 | 4.95 | +1.4% |
| Llama 2-70B | 3.32 | 3.38 | +1.8% |
精度保留极佳 - 下降幅度小于2%!
Common patterns
常见应用模式
Multi-GPU deployment
多GPU部署
python
undefinedpython
undefinedAutomatic device mapping
自动设备映射
model = AutoGPTQForCausalLM.from_quantized(
"TheBloke/Llama-2-70B-GPTQ",
device_map="auto", # Automatically split across GPUs
max_memory={0: "40GB", 1: "40GB"} # Limit per GPU
)
model = AutoGPTQForCausalLM.from_quantized(
"TheBloke/Llama-2-70B-GPTQ",
device_map="auto", # 自动在GPU间拆分模型
max_memory={0: "40GB", 1: "40GB"} # 限制单GPU内存占用
)
Manual device mapping
手动设备映射
device_map = {
"model.embed_tokens": 0,
"model.layers.0-39": 0, # First 40 layers on GPU 0
"model.layers.40-79": 1, # Last 40 layers on GPU 1
"model.norm": 1,
"lm_head": 1
}
model = AutoGPTQForCausalLM.from_quantized(
model_name,
device_map=device_map
)
undefineddevice_map = {
"model.embed_tokens": 0,
"model.layers.0-39": 0, # 前40层在GPU 0
"model.layers.40-79": 1, # 后40层在GPU 1
"model.norm": 1,
"lm_head": 1
}
model = AutoGPTQForCausalLM.from_quantized(
model_name,
device_map=device_map
)
undefinedCPU offloading
CPU卸载
python
undefinedpython
undefinedOffload some layers to CPU (for very large models)
将部分层卸载到CPU(适用于超大型模型)
model = AutoGPTQForCausalLM.from_quantized(
"TheBloke/Llama-2-405B-GPTQ",
device_map="auto",
max_memory={
0: "80GB", # GPU 0
1: "80GB", # GPU 1
2: "80GB", # GPU 2
"cpu": "200GB" # Offload overflow to CPU
}
)
undefinedmodel = AutoGPTQForCausalLM.from_quantized(
"TheBloke/Llama-2-405B-GPTQ",
device_map="auto",
max_memory={
0: "80GB", # GPU 0
1: "80GB", # GPU 1
2: "80GB", # GPU 2
"cpu": "200GB" # 将溢出部分卸载到CPU
}
)
undefinedBatch inference
批量推理
python
undefinedpython
undefinedProcess multiple prompts efficiently
高效处理多个提示
prompts = [
"Explain AI",
"Explain ML",
"Explain DL"
]
inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=100,
pad_token_id=tokenizer.eos_token_id
)
for i, output in enumerate(outputs):
print(f"Prompt {i}: {tokenizer.decode(output)}")
undefinedprompts = [
"解释AI",
"解释ML",
"解释DL"
]
inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=100,
pad_token_id=tokenizer.eos_token_id
)
for i, output in enumerate(outputs):
print(f"提示{i}: {tokenizer.decode(output)}")
undefinedFinding pre-quantized models
查找预量化模型
TheBloke on HuggingFace:
- https://huggingface.co/TheBloke
- 1000+ models in GPTQ format
- Multiple group sizes (32, 128)
- Both CUDA and Marlin formats
Search:
bash
undefinedHuggingFace上的TheBloke:
- https://huggingface.co/TheBloke
- 1000+个GPTQ格式模型
- 多种分组大小(32、128)
- 支持CUDA和Marlin格式
搜索方式:
bash
undefinedFind GPTQ models on HuggingFace
在HuggingFace上查找GPTQ模型
**Download**:
```python
from auto_gptq import AutoGPTQForCausalLM
**下载方式**:
```python
from auto_gptq import AutoGPTQForCausalLMAutomatically downloads from HuggingFace
自动从HuggingFace下载
model = AutoGPTQForCausalLM.from_quantized(
"TheBloke/Llama-2-70B-Chat-GPTQ",
device="cuda:0"
)
undefinedmodel = AutoGPTQForCausalLM.from_quantized(
"TheBloke/Llama-2-70B-Chat-GPTQ",
device="cuda:0"
)
undefinedSupported models
支持的模型
- LLaMA family: Llama 2, Llama 3, Code Llama
- Mistral: Mistral 7B, Mixtral 8x7B, 8x22B
- Qwen: Qwen, Qwen2, QwQ
- DeepSeek: V2, V3
- Phi: Phi-2, Phi-3
- Yi, Falcon, BLOOM, OPT
- 100+ models on HuggingFace
- LLaMA系列: Llama 2、Llama 3、Code Llama
- Mistral: Mistral 7B、Mixtral 8x7B、8x22B
- Qwen: Qwen、Qwen2、QwQ
- DeepSeek: V2、V3
- Phi: Phi-2、Phi-3
- Yi、Falcon、BLOOM、OPT
- HuggingFace上的100+个模型
References
参考资料
- Calibration Guide - Dataset selection, quantization process, quality optimization
- Integration Guide - Transformers, PEFT, vLLM, TensorRT-LLM
- Troubleshooting - Common issues, performance optimization
- 校准指南 - 数据集选择、量化流程、质量优化
- 集成指南 - Transformers、PEFT、vLLM、TensorRT-LLM
- 故障排除 - 常见问题、性能优化
Resources
资源
- GitHub: https://github.com/AutoGPTQ/AutoGPTQ
- Paper: GPTQ: Accurate Post-Training Quantization (arXiv:2210.17323)
- Models: https://huggingface.co/models?library=gptq
- Discord: https://discord.gg/autogptq
- GitHub: https://github.com/AutoGPTQ/AutoGPTQ
- 论文: GPTQ: Accurate Post-Training Quantization (arXiv:2210.17323)
- 模型: https://huggingface.co/models?library=gptq
- Discord: https://discord.gg/autogptq