awq-quantization

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

AWQ (Activation-aware Weight Quantization)

AWQ（激活感知权重量化）

4-bit quantization that preserves salient weights based on activation patterns, achieving 3x speedup with minimal accuracy loss.

基于激活模式保留关键权重的4位量化技术，可实现3倍推理加速且精度损失极小。

When to use AWQ

何时使用AWQ

Use AWQ when:

Need 4-bit quantization with <5% accuracy loss
Deploying instruction-tuned or chat models (AWQ generalizes better)
Want ~2.5-3x inference speedup over FP16
Using vLLM for production serving
Have Ampere+ GPUs (A100, H100, RTX 40xx) for Marlin kernel support

Use GPTQ instead when:

Need maximum ecosystem compatibility (more tools support GPTQ)
Working with ExLlamaV2 backend specifically
Have older GPUs without Marlin support

Use bitsandbytes instead when:

Need zero calibration overhead (quantize on-the-fly)
Want to fine-tune with QLoRA
Prefer simpler integration

在以下场景使用AWQ：

需要4位量化且精度损失<5%
部署指令微调或对话模型（AWQ的泛化性更好）
希望比FP16实现约2.5-3倍的推理加速
使用vLLM进行生产级推理服务
拥有Ampere及以上架构的GPU（A100、H100、RTX 40xx）以支持Marlin内核

在以下场景使用GPTQ：

需要最大程度的生态兼容性（更多工具支持GPTQ）
专门使用ExLlamaV2后端
拥有不支持Marlin的旧款GPU

在以下场景使用bitsandbytes：

需要零校准开销（即时量化）
希望结合QLoRA进行微调
偏好更简单的集成方式

Quick start

快速开始

Installation

安装

bash

undefined

bash

undefined

Default (Triton kernels)

默认版本（Triton内核）

pip install autoawq

With optimized CUDA kernels + Flash Attention

带优化CUDA内核+Flash Attention

pip install autoawq[kernels]

Intel CPU/XPU optimization

Intel CPU/XPU优化版本

pip install autoawq[cpu]


**Requirements**: Python 3.8+, CUDA 11.8+, Compute Capability 7.5+

pip install autoawq[cpu]


**要求**：Python 3.8+、CUDA 11.8+、计算能力7.5+

Load pre-quantized model

加载预量化模型

python

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"

model = AutoAWQForCausalLM.from_quantized(
    model_name,
    fuse_layers=True  # Enable fused attention for speed
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

python

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"

model = AutoAWQForCausalLM.from_quantized(
    model_name,
    fuse_layers=True  # 启用融合注意力以提升速度
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Generate

生成文本

inputs = tokenizer("Explain quantum computing", return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=200) print(tokenizer.decode(outputs[0], skip_special_tokens=True))

undefined

inputs = tokenizer("Explain quantum computing", return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=200) print(tokenizer.decode(outputs[0], skip_special_tokens=True))

undefined

Quantize your own model

量化自定义模型

python

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "mistralai/Mistral-7B-Instruct-v0.2"

python

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "mistralai/Mistral-7B-Instruct-v0.2"

Load model and tokenizer

加载模型与分词器

model = AutoAWQForCausalLM.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_path)

Quantization config

量化配置

quant_config = { "zero_point": True, # Use zero-point quantization "q_group_size": 128, # Group size (128 recommended) "w_bit": 4, # 4-bit weights "version": "GEMM" # GEMM for batch, GEMV for single-token }

quant_config = { "zero_point": True, # 使用零点量化 "q_group_size": 128, # 分组大小（推荐128） "w_bit": 4, # 4位权重 "version": "GEMM" # GEMM适用于批量推理，GEMV适用于单token推理 }

Quantize (uses pileval dataset by default)

量化（默认使用pileval数据集）

model.quantize(tokenizer, quant_config=quant_config)

Save

保存

model.save_quantized("mistral-7b-awq") tokenizer.save_pretrained("mistral-7b-awq")


**Timing**: ~10-15 min for 7B, ~1 hour for 70B models.

model.save_quantized("mistral-7b-awq") tokenizer.save_pretrained("mistral-7b-awq")


**耗时**：7B模型约10-15分钟，70B模型约1小时。

AWQ vs GPTQ vs bitsandbytes

Feature	AWQ	GPTQ	bitsandbytes
Speedup (4-bit)	~2.5-3x	~2x	~1.5x
Accuracy loss	<5%	~5-10%	~5-15%
Calibration	Minimal (128-1K tokens)	More extensive	None
Overfitting risk	Low	Higher	N/A
Best for	Production inference	GPU inference	Easy integration
vLLM support	Native	Yes	Limited

Key insight: AWQ assumes not all weights are equally important. It protects ~1% of salient weights identified by activation patterns, reducing quantization error without mixed-precision overhead.

特性	AWQ	GPTQ	bitsandbytes
4位量化加速比	~2.5-3x	~2x	~1.5x
精度损失	<5%	~5-10%	~5-15%
校准需求	极少（128-1K tokens）	更全面	无
过拟合风险	低	较高	不适用
最佳适用场景	生产级推理	GPU推理	简易集成
vLLM支持	原生支持	支持	有限支持

核心思路：AWQ假设并非所有权重都同等重要。它会保护约1%由激活模式识别出的关键权重，在无混合精度开销的情况下降低量化误差。

Kernel backends

内核后端

GEMM (default, batch inference)

GEMM（默认，批量推理）

python

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"  # Best for batch sizes > 1
}

python

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"  # 批量大小>1时表现最佳
}

GEMV (single-token generation)

GEMV（单token生成）

python

quant_config = {
    "version": "GEMV"  # 20% faster for batch_size=1
}

Limitation: Only batch size 1, not good for large context.

python

quant_config = {
    "version": "GEMV"  # 批量大小为1时速度提升20%
}

限制：仅支持批量大小1，不适用于长上下文场景。

Marlin (Ampere+ GPUs)

Marlin（Ampere+ GPU）

python

from transformers import AwqConfig, AutoModelForCausalLM

config = AwqConfig(
    bits=4,
    version="marlin"  # 2x faster on A100/H100
)

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-AWQ",
    quantization_config=config
)

Requirements: Compute Capability 8.0+ (A100, H100, RTX 40xx)

python

from transformers import AwqConfig, AutoModelForCausalLM

config = AwqConfig(
    bits=4,
    version="marlin"  # 在A100/H100上速度提升2倍
)

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-AWQ",
    quantization_config=config
)

要求：计算能力8.0+（A100、H100、RTX 40xx）

ExLlamaV2 (AMD compatible)

ExLlamaV2（AMD兼容）

python

config = AwqConfig(
    bits=4,
    version="exllama"  # Faster prefill, AMD GPU support
)

python

config = AwqConfig(
    bits=4,
    version="exllama"  # 预填充速度更快，支持AMD GPU
)

HuggingFace Transformers integration

HuggingFace Transformers集成

Direct loading

直接加载

python

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/zephyr-7B-alpha-AWQ",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ")

python

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/zephyr-7B-alpha-AWQ",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ")

Fused modules (recommended)

融合模块（推荐）

python

from transformers import AwqConfig, AutoModelForCausalLM

config = AwqConfig(
    bits=4,
    fuse_max_seq_len=512,  # Max sequence length for fusing
    do_fuse=True           # Enable fused attention/MLP
)

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-OpenOrca-AWQ",
    quantization_config=config
)

Note: Fused modules cannot combine with FlashAttention2.

python

from transformers import AwqConfig, AutoModelForCausalLM

config = AwqConfig(
    bits=4,
    fuse_max_seq_len=512,  # 融合操作支持的最大序列长度
    do_fuse=True           # 启用融合注意力/MLP
)

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-OpenOrca-AWQ",
    quantization_config=config
)

注意：融合模块无法与FlashAttention2结合使用。

vLLM integration

vLLM集成

python

from vllm import LLM, SamplingParams

python

from vllm import LLM, SamplingParams

vLLM auto-detects AWQ models

vLLM会自动检测AWQ模型

llm = LLM( model="TheBloke/Llama-2-7B-AWQ", quantization="awq", dtype="half" )

sampling = SamplingParams(temperature=0.7, max_tokens=200) outputs = llm.generate(["Explain AI"], sampling)

undefined

llm = LLM( model="TheBloke/Llama-2-7B-AWQ", quantization="awq", dtype="half" )

sampling = SamplingParams(temperature=0.7, max_tokens=200) outputs = llm.generate(["Explain AI"], sampling)

undefined

Performance benchmarks

性能基准测试

Memory reduction

内存占用减少

Model	FP16	AWQ 4-bit	Reduction
Mistral 7B	14 GB	5.5 GB	2.5x
Llama 2-13B	26 GB	10 GB	2.6x
Llama 2-70B	140 GB	35 GB	4x

模型	FP16	AWQ 4位	压缩比
Mistral 7B	14 GB	5.5 GB	2.5x
Llama 2-13B	26 GB	10 GB	2.6x
Llama 2-70B	140 GB	35 GB	4x

Inference speed (RTX 4090)

推理速度（RTX 4090）

Model	Prefill (tok/s)	Decode (tok/s)	Memory
Mistral 7B GEMM	3,897	114	5.55 GB
TinyLlama 1B GEMV	5,179	431	2.10 GB
Llama 2-13B GEMM	2,279	74	10.28 GB

模型	预填充速度（tok/s）	解码速度（tok/s）	内存占用
Mistral 7B GEMM	3,897	114	5.55 GB
TinyLlama 1B GEMV	5,179	431	2.10 GB
Llama 2-13B GEMM	2,279	74	10.28 GB

Accuracy (perplexity)

精度（困惑度）

Model	FP16	AWQ 4-bit	Degradation
Llama 3 8B	8.20	8.48	+3.4%
Mistral 7B	5.25	5.42	+3.2%
Qwen2 72B	4.85	4.95	+2.1%

模型	FP16	AWQ 4位	精度下降幅度
Llama 3 8B	8.20	8.48	+3.4%
Mistral 7B	5.25	5.42	+3.2%
Qwen2 72B	4.85	4.95	+2.1%

Custom calibration data

自定义校准数据

python

undefined

python

undefined

Use custom dataset for domain-specific models

针对领域特定模型使用自定义数据集

model.quantize( tokenizer, quant_config=quant_config, calib_data="wikitext", # Or custom list of strings max_calib_samples=256, # More samples = better accuracy max_calib_seq_len=512 # Sequence length )

model.quantize( tokenizer, quant_config=quant_config, calib_data="wikitext", # 或自定义字符串列表 max_calib_samples=256, # 样本越多精度越好 max_calib_seq_len=512 # 序列长度 )

Or provide your own samples

或提供自定义样本

calib_samples = [ "Your domain-specific text here...", "More examples from your use case...", ] model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_samples)

undefined

calib_samples = [ "此处为你的领域特定文本...", "更多你的业务场景示例...", ] model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_samples)

undefined

Multi-GPU deployment

多GPU部署

python

model = AutoAWQForCausalLM.from_quantized(
    "TheBloke/Llama-2-70B-AWQ",
    device_map="auto",  # Auto-split across GPUs
    max_memory={0: "40GB", 1: "40GB"}
)

python

model = AutoAWQForCausalLM.from_quantized(
    "TheBloke/Llama-2-70B-AWQ",
    device_map="auto",  # 自动在多GPU间拆分模型
    max_memory={0: "40GB", 1: "40GB"}
)

Supported models

支持的模型

35+ architectures including:

Llama family: Llama 2/3, Code Llama, Mistral, Mixtral
Qwen: Qwen, Qwen2, Qwen2.5-VL
Others: Falcon, MPT, Phi, Yi, DeepSeek, Gemma
Multimodal: LLaVA, LLaVA-Next, Qwen2-VL

35+种模型架构，包括：

Llama系列：Llama 2/3、Code Llama、Mistral、Mixtral
Qwen系列：Qwen、Qwen2、Qwen2.5-VL
其他：Falcon、MPT、Phi、Yi、DeepSeek、Gemma
多模态模型：LLaVA、LLaVA-Next、Qwen2-VL

Common issues

常见问题

CUDA OOM during quantization:

python

undefined

量化时CUDA内存不足：

python

undefined

Reduce batch size

减小批量大小

model.quantize(tokenizer, quant_config=quant_config, max_calib_samples=64)


**Slow inference**:
```python

model.quantize(tokenizer, quant_config=quant_config, max_calib_samples=64)


**推理速度慢**：
```python

Enable fused layers

启用融合层

model = AutoAWQForCausalLM.from_quantized(model_name, fuse_layers=True)


**AMD GPU support**:
```python

model = AutoAWQForCausalLM.from_quantized(model_name, fuse_layers=True)


**AMD GPU支持**：
```python

Use ExLlama backend

使用ExLlama后端

config = AwqConfig(bits=4, version="exllama")

undefined

config = AwqConfig(bits=4, version="exllama")

undefined

Deprecation notice

弃用通知

AutoAWQ is officially deprecated. For new projects, consider:

vLLM llm-compressor: https://github.com/vllm-project/llm-compressor
MLX-LM: For Mac devices with Apple Silicon

Existing quantized models remain usable.

AutoAWQ已正式弃用。对于新项目，建议考虑：

vLLM llm-compressor：https://github.com/vllm-project/llm-compressor
MLX-LM：适用于搭载Apple Silicon的Mac设备

已量化的现有模型仍可正常使用。

References

参考资料

Paper: AWQ: Activation-aware Weight Quantization (arXiv:2306.00978) - MLSys 2024 Best Paper
GitHub: https://github.com/casper-hansen/AutoAWQ
MIT Han Lab: https://github.com/mit-han-lab/llm-awq
Models: https://huggingface.co/models?library=awq

论文：AWQ: Activation-aware Weight Quantization（arXiv:2306.00978）- MLSys 2024最佳论文
GitHub：https://github.com/casper-hansen/AutoAWQ
MIT Han Lab：https://github.com/mit-han-lab/llm-awq
模型库：https://huggingface.co/models?library=awq