awq-quantization
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAWQ (Activation-aware Weight Quantization)
AWQ(激活感知权重量化)
4-bit quantization that preserves salient weights based on activation patterns, achieving 3x speedup with minimal accuracy loss.
基于激活模式保留关键权重的4位量化技术,可实现3倍推理加速且精度损失极小。
When to use AWQ
何时使用AWQ
Use AWQ when:
- Need 4-bit quantization with <5% accuracy loss
- Deploying instruction-tuned or chat models (AWQ generalizes better)
- Want ~2.5-3x inference speedup over FP16
- Using vLLM for production serving
- Have Ampere+ GPUs (A100, H100, RTX 40xx) for Marlin kernel support
Use GPTQ instead when:
- Need maximum ecosystem compatibility (more tools support GPTQ)
- Working with ExLlamaV2 backend specifically
- Have older GPUs without Marlin support
Use bitsandbytes instead when:
- Need zero calibration overhead (quantize on-the-fly)
- Want to fine-tune with QLoRA
- Prefer simpler integration
在以下场景使用AWQ:
- 需要4位量化且精度损失<5%
- 部署指令微调或对话模型(AWQ的泛化性更好)
- 希望比FP16实现约2.5-3倍的推理加速
- 使用vLLM进行生产级推理服务
- 拥有Ampere及以上架构的GPU(A100、H100、RTX 40xx)以支持Marlin内核
在以下场景使用GPTQ:
- 需要最大程度的生态兼容性(更多工具支持GPTQ)
- 专门使用ExLlamaV2后端
- 拥有不支持Marlin的旧款GPU
在以下场景使用bitsandbytes:
- 需要零校准开销(即时量化)
- 希望结合QLoRA进行微调
- 偏好更简单的集成方式
Quick start
快速开始
Installation
安装
bash
undefinedbash
undefinedDefault (Triton kernels)
默认版本(Triton内核)
pip install autoawq
pip install autoawq
With optimized CUDA kernels + Flash Attention
带优化CUDA内核+Flash Attention
pip install autoawq[kernels]
pip install autoawq[kernels]
Intel CPU/XPU optimization
Intel CPU/XPU优化版本
pip install autoawq[cpu]
**Requirements**: Python 3.8+, CUDA 11.8+, Compute Capability 7.5+pip install autoawq[cpu]
**要求**:Python 3.8+、CUDA 11.8+、计算能力7.5+Load pre-quantized model
加载预量化模型
python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_name = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"
model = AutoAWQForCausalLM.from_quantized(
model_name,
fuse_layers=True # Enable fused attention for speed
)
tokenizer = AutoTokenizer.from_pretrained(model_name)python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_name = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"
model = AutoAWQForCausalLM.from_quantized(
model_name,
fuse_layers=True # 启用融合注意力以提升速度
)
tokenizer = AutoTokenizer.from_pretrained(model_name)Generate
生成文本
inputs = tokenizer("Explain quantum computing", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
undefinedinputs = tokenizer("Explain quantum computing", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
undefinedQuantize your own model
量化自定义模型
python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "mistralai/Mistral-7B-Instruct-v0.2"python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "mistralai/Mistral-7B-Instruct-v0.2"Load model and tokenizer
加载模型与分词器
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
Quantization config
量化配置
quant_config = {
"zero_point": True, # Use zero-point quantization
"q_group_size": 128, # Group size (128 recommended)
"w_bit": 4, # 4-bit weights
"version": "GEMM" # GEMM for batch, GEMV for single-token
}
quant_config = {
"zero_point": True, # 使用零点量化
"q_group_size": 128, # 分组大小(推荐128)
"w_bit": 4, # 4位权重
"version": "GEMM" # GEMM适用于批量推理,GEMV适用于单token推理
}
Quantize (uses pileval dataset by default)
量化(默认使用pileval数据集)
model.quantize(tokenizer, quant_config=quant_config)
model.quantize(tokenizer, quant_config=quant_config)
Save
保存
model.save_quantized("mistral-7b-awq")
tokenizer.save_pretrained("mistral-7b-awq")
**Timing**: ~10-15 min for 7B, ~1 hour for 70B models.model.save_quantized("mistral-7b-awq")
tokenizer.save_pretrained("mistral-7b-awq")
**耗时**:7B模型约10-15分钟,70B模型约1小时。AWQ vs GPTQ vs bitsandbytes
AWQ vs GPTQ vs bitsandbytes
| Feature | AWQ | GPTQ | bitsandbytes |
|---|---|---|---|
| Speedup (4-bit) | ~2.5-3x | ~2x | ~1.5x |
| Accuracy loss | <5% | ~5-10% | ~5-15% |
| Calibration | Minimal (128-1K tokens) | More extensive | None |
| Overfitting risk | Low | Higher | N/A |
| Best for | Production inference | GPU inference | Easy integration |
| vLLM support | Native | Yes | Limited |
Key insight: AWQ assumes not all weights are equally important. It protects ~1% of salient weights identified by activation patterns, reducing quantization error without mixed-precision overhead.
| 特性 | AWQ | GPTQ | bitsandbytes |
|---|---|---|---|
| 4位量化加速比 | ~2.5-3x | ~2x | ~1.5x |
| 精度损失 | <5% | ~5-10% | ~5-15% |
| 校准需求 | 极少(128-1K tokens) | 更全面 | 无 |
| 过拟合风险 | 低 | 较高 | 不适用 |
| 最佳适用场景 | 生产级推理 | GPU推理 | 简易集成 |
| vLLM支持 | 原生支持 | 支持 | 有限支持 |
核心思路:AWQ假设并非所有权重都同等重要。它会保护约1%由激活模式识别出的关键权重,在无混合精度开销的情况下降低量化误差。
Kernel backends
内核后端
GEMM (default, batch inference)
GEMM(默认,批量推理)
python
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM" # Best for batch sizes > 1
}python
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM" # 批量大小>1时表现最佳
}GEMV (single-token generation)
GEMV(单token生成)
python
quant_config = {
"version": "GEMV" # 20% faster for batch_size=1
}Limitation: Only batch size 1, not good for large context.
python
quant_config = {
"version": "GEMV" # 批量大小为1时速度提升20%
}限制:仅支持批量大小1,不适用于长上下文场景。
Marlin (Ampere+ GPUs)
Marlin(Ampere+ GPU)
python
from transformers import AwqConfig, AutoModelForCausalLM
config = AwqConfig(
bits=4,
version="marlin" # 2x faster on A100/H100
)
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Mistral-7B-AWQ",
quantization_config=config
)Requirements: Compute Capability 8.0+ (A100, H100, RTX 40xx)
python
from transformers import AwqConfig, AutoModelForCausalLM
config = AwqConfig(
bits=4,
version="marlin" # 在A100/H100上速度提升2倍
)
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Mistral-7B-AWQ",
quantization_config=config
)要求:计算能力8.0+(A100、H100、RTX 40xx)
ExLlamaV2 (AMD compatible)
ExLlamaV2(AMD兼容)
python
config = AwqConfig(
bits=4,
version="exllama" # Faster prefill, AMD GPU support
)python
config = AwqConfig(
bits=4,
version="exllama" # 预填充速度更快,支持AMD GPU
)HuggingFace Transformers integration
HuggingFace Transformers集成
Direct loading
直接加载
python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/zephyr-7B-alpha-AWQ",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ")python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/zephyr-7B-alpha-AWQ",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ")Fused modules (recommended)
融合模块(推荐)
python
from transformers import AwqConfig, AutoModelForCausalLM
config = AwqConfig(
bits=4,
fuse_max_seq_len=512, # Max sequence length for fusing
do_fuse=True # Enable fused attention/MLP
)
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Mistral-7B-OpenOrca-AWQ",
quantization_config=config
)Note: Fused modules cannot combine with FlashAttention2.
python
from transformers import AwqConfig, AutoModelForCausalLM
config = AwqConfig(
bits=4,
fuse_max_seq_len=512, # 融合操作支持的最大序列长度
do_fuse=True # 启用融合注意力/MLP
)
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Mistral-7B-OpenOrca-AWQ",
quantization_config=config
)注意:融合模块无法与FlashAttention2结合使用。
vLLM integration
vLLM集成
python
from vllm import LLM, SamplingParamspython
from vllm import LLM, SamplingParamsvLLM auto-detects AWQ models
vLLM会自动检测AWQ模型
llm = LLM(
model="TheBloke/Llama-2-7B-AWQ",
quantization="awq",
dtype="half"
)
sampling = SamplingParams(temperature=0.7, max_tokens=200)
outputs = llm.generate(["Explain AI"], sampling)
undefinedllm = LLM(
model="TheBloke/Llama-2-7B-AWQ",
quantization="awq",
dtype="half"
)
sampling = SamplingParams(temperature=0.7, max_tokens=200)
outputs = llm.generate(["Explain AI"], sampling)
undefinedPerformance benchmarks
性能基准测试
Memory reduction
内存占用减少
| Model | FP16 | AWQ 4-bit | Reduction |
|---|---|---|---|
| Mistral 7B | 14 GB | 5.5 GB | 2.5x |
| Llama 2-13B | 26 GB | 10 GB | 2.6x |
| Llama 2-70B | 140 GB | 35 GB | 4x |
| 模型 | FP16 | AWQ 4位 | 压缩比 |
|---|---|---|---|
| Mistral 7B | 14 GB | 5.5 GB | 2.5x |
| Llama 2-13B | 26 GB | 10 GB | 2.6x |
| Llama 2-70B | 140 GB | 35 GB | 4x |
Inference speed (RTX 4090)
推理速度(RTX 4090)
| Model | Prefill (tok/s) | Decode (tok/s) | Memory |
|---|---|---|---|
| Mistral 7B GEMM | 3,897 | 114 | 5.55 GB |
| TinyLlama 1B GEMV | 5,179 | 431 | 2.10 GB |
| Llama 2-13B GEMM | 2,279 | 74 | 10.28 GB |
| 模型 | 预填充速度(tok/s) | 解码速度(tok/s) | 内存占用 |
|---|---|---|---|
| Mistral 7B GEMM | 3,897 | 114 | 5.55 GB |
| TinyLlama 1B GEMV | 5,179 | 431 | 2.10 GB |
| Llama 2-13B GEMM | 2,279 | 74 | 10.28 GB |
Accuracy (perplexity)
精度(困惑度)
| Model | FP16 | AWQ 4-bit | Degradation |
|---|---|---|---|
| Llama 3 8B | 8.20 | 8.48 | +3.4% |
| Mistral 7B | 5.25 | 5.42 | +3.2% |
| Qwen2 72B | 4.85 | 4.95 | +2.1% |
| 模型 | FP16 | AWQ 4位 | 精度下降幅度 |
|---|---|---|---|
| Llama 3 8B | 8.20 | 8.48 | +3.4% |
| Mistral 7B | 5.25 | 5.42 | +3.2% |
| Qwen2 72B | 4.85 | 4.95 | +2.1% |
Custom calibration data
自定义校准数据
python
undefinedpython
undefinedUse custom dataset for domain-specific models
针对领域特定模型使用自定义数据集
model.quantize(
tokenizer,
quant_config=quant_config,
calib_data="wikitext", # Or custom list of strings
max_calib_samples=256, # More samples = better accuracy
max_calib_seq_len=512 # Sequence length
)
model.quantize(
tokenizer,
quant_config=quant_config,
calib_data="wikitext", # 或自定义字符串列表
max_calib_samples=256, # 样本越多精度越好
max_calib_seq_len=512 # 序列长度
)
Or provide your own samples
或提供自定义样本
calib_samples = [
"Your domain-specific text here...",
"More examples from your use case...",
]
model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_samples)
undefinedcalib_samples = [
"此处为你的领域特定文本...",
"更多你的业务场景示例...",
]
model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_samples)
undefinedMulti-GPU deployment
多GPU部署
python
model = AutoAWQForCausalLM.from_quantized(
"TheBloke/Llama-2-70B-AWQ",
device_map="auto", # Auto-split across GPUs
max_memory={0: "40GB", 1: "40GB"}
)python
model = AutoAWQForCausalLM.from_quantized(
"TheBloke/Llama-2-70B-AWQ",
device_map="auto", # 自动在多GPU间拆分模型
max_memory={0: "40GB", 1: "40GB"}
)Supported models
支持的模型
35+ architectures including:
- Llama family: Llama 2/3, Code Llama, Mistral, Mixtral
- Qwen: Qwen, Qwen2, Qwen2.5-VL
- Others: Falcon, MPT, Phi, Yi, DeepSeek, Gemma
- Multimodal: LLaVA, LLaVA-Next, Qwen2-VL
35+种模型架构,包括:
- Llama系列:Llama 2/3、Code Llama、Mistral、Mixtral
- Qwen系列:Qwen、Qwen2、Qwen2.5-VL
- 其他:Falcon、MPT、Phi、Yi、DeepSeek、Gemma
- 多模态模型:LLaVA、LLaVA-Next、Qwen2-VL
Common issues
常见问题
CUDA OOM during quantization:
python
undefined量化时CUDA内存不足:
python
undefinedReduce batch size
减小批量大小
model.quantize(tokenizer, quant_config=quant_config, max_calib_samples=64)
**Slow inference**:
```pythonmodel.quantize(tokenizer, quant_config=quant_config, max_calib_samples=64)
**推理速度慢**:
```pythonEnable fused layers
启用融合层
model = AutoAWQForCausalLM.from_quantized(model_name, fuse_layers=True)
**AMD GPU support**:
```pythonmodel = AutoAWQForCausalLM.from_quantized(model_name, fuse_layers=True)
**AMD GPU支持**:
```pythonUse ExLlama backend
使用ExLlama后端
config = AwqConfig(bits=4, version="exllama")
undefinedconfig = AwqConfig(bits=4, version="exllama")
undefinedDeprecation notice
弃用通知
AutoAWQ is officially deprecated. For new projects, consider:
- vLLM llm-compressor: https://github.com/vllm-project/llm-compressor
- MLX-LM: For Mac devices with Apple Silicon
Existing quantized models remain usable.
AutoAWQ已正式弃用。对于新项目,建议考虑:
- vLLM llm-compressor:https://github.com/vllm-project/llm-compressor
- MLX-LM:适用于搭载Apple Silicon的Mac设备
已量化的现有模型仍可正常使用。
References
参考资料
- Paper: AWQ: Activation-aware Weight Quantization (arXiv:2306.00978) - MLSys 2024 Best Paper
- GitHub: https://github.com/casper-hansen/AutoAWQ
- MIT Han Lab: https://github.com/mit-han-lab/llm-awq
- Models: https://huggingface.co/models?library=awq
- 论文:AWQ: Activation-aware Weight Quantization(arXiv:2306.00978)- MLSys 2024最佳论文
- GitHub:https://github.com/casper-hansen/AutoAWQ
- MIT Han Lab:https://github.com/mit-han-lab/llm-awq
- 模型库:https://huggingface.co/models?library=awq