awq-quantization

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

AWQ (Activation-aware Weight Quantization)

AWQ(激活感知权重量化)

4-bit quantization that preserves salient weights based on activation patterns, achieving 3x speedup with minimal accuracy loss.
基于激活模式保留关键权重的4位量化技术,可实现3倍推理加速且精度损失极小。

When to use AWQ

何时使用AWQ

Use AWQ when:
  • Need 4-bit quantization with <5% accuracy loss
  • Deploying instruction-tuned or chat models (AWQ generalizes better)
  • Want ~2.5-3x inference speedup over FP16
  • Using vLLM for production serving
  • Have Ampere+ GPUs (A100, H100, RTX 40xx) for Marlin kernel support
Use GPTQ instead when:
  • Need maximum ecosystem compatibility (more tools support GPTQ)
  • Working with ExLlamaV2 backend specifically
  • Have older GPUs without Marlin support
Use bitsandbytes instead when:
  • Need zero calibration overhead (quantize on-the-fly)
  • Want to fine-tune with QLoRA
  • Prefer simpler integration
在以下场景使用AWQ:
  • 需要4位量化且精度损失<5%
  • 部署指令微调或对话模型(AWQ的泛化性更好)
  • 希望比FP16实现约2.5-3倍的推理加速
  • 使用vLLM进行生产级推理服务
  • 拥有Ampere及以上架构的GPU(A100、H100、RTX 40xx)以支持Marlin内核
在以下场景使用GPTQ:
  • 需要最大程度的生态兼容性(更多工具支持GPTQ)
  • 专门使用ExLlamaV2后端
  • 拥有不支持Marlin的旧款GPU
在以下场景使用bitsandbytes:
  • 需要零校准开销(即时量化)
  • 希望结合QLoRA进行微调
  • 偏好更简单的集成方式

Quick start

快速开始

Installation

安装

bash
undefined
bash
undefined

Default (Triton kernels)

默认版本(Triton内核)

pip install autoawq
pip install autoawq

With optimized CUDA kernels + Flash Attention

带优化CUDA内核+Flash Attention

pip install autoawq[kernels]
pip install autoawq[kernels]

Intel CPU/XPU optimization

Intel CPU/XPU优化版本

pip install autoawq[cpu]

**Requirements**: Python 3.8+, CUDA 11.8+, Compute Capability 7.5+
pip install autoawq[cpu]

**要求**:Python 3.8+、CUDA 11.8+、计算能力7.5+

Load pre-quantized model

加载预量化模型

python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"

model = AutoAWQForCausalLM.from_quantized(
    model_name,
    fuse_layers=True  # Enable fused attention for speed
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"

model = AutoAWQForCausalLM.from_quantized(
    model_name,
    fuse_layers=True  # 启用融合注意力以提升速度
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Generate

生成文本

inputs = tokenizer("Explain quantum computing", return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=200) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
undefined
inputs = tokenizer("Explain quantum computing", return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=200) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
undefined

Quantize your own model

量化自定义模型

python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "mistralai/Mistral-7B-Instruct-v0.2"
python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "mistralai/Mistral-7B-Instruct-v0.2"

Load model and tokenizer

加载模型与分词器

model = AutoAWQForCausalLM.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoAWQForCausalLM.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_path)

Quantization config

量化配置

quant_config = { "zero_point": True, # Use zero-point quantization "q_group_size": 128, # Group size (128 recommended) "w_bit": 4, # 4-bit weights "version": "GEMM" # GEMM for batch, GEMV for single-token }
quant_config = { "zero_point": True, # 使用零点量化 "q_group_size": 128, # 分组大小(推荐128) "w_bit": 4, # 4位权重 "version": "GEMM" # GEMM适用于批量推理,GEMV适用于单token推理 }

Quantize (uses pileval dataset by default)

量化(默认使用pileval数据集)

model.quantize(tokenizer, quant_config=quant_config)
model.quantize(tokenizer, quant_config=quant_config)

Save

保存

model.save_quantized("mistral-7b-awq") tokenizer.save_pretrained("mistral-7b-awq")

**Timing**: ~10-15 min for 7B, ~1 hour for 70B models.
model.save_quantized("mistral-7b-awq") tokenizer.save_pretrained("mistral-7b-awq")

**耗时**:7B模型约10-15分钟,70B模型约1小时。

AWQ vs GPTQ vs bitsandbytes

AWQ vs GPTQ vs bitsandbytes

FeatureAWQGPTQbitsandbytes
Speedup (4-bit)~2.5-3x~2x~1.5x
Accuracy loss<5%~5-10%~5-15%
CalibrationMinimal (128-1K tokens)More extensiveNone
Overfitting riskLowHigherN/A
Best forProduction inferenceGPU inferenceEasy integration
vLLM supportNativeYesLimited
Key insight: AWQ assumes not all weights are equally important. It protects ~1% of salient weights identified by activation patterns, reducing quantization error without mixed-precision overhead.
特性AWQGPTQbitsandbytes
4位量化加速比~2.5-3x~2x~1.5x
精度损失<5%~5-10%~5-15%
校准需求极少(128-1K tokens)更全面
过拟合风险较高不适用
最佳适用场景生产级推理GPU推理简易集成
vLLM支持原生支持支持有限支持
核心思路:AWQ假设并非所有权重都同等重要。它会保护约1%由激活模式识别出的关键权重,在无混合精度开销的情况下降低量化误差。

Kernel backends

内核后端

GEMM (default, batch inference)

GEMM(默认,批量推理)

python
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"  # Best for batch sizes > 1
}
python
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"  # 批量大小>1时表现最佳
}

GEMV (single-token generation)

GEMV(单token生成)

python
quant_config = {
    "version": "GEMV"  # 20% faster for batch_size=1
}
Limitation: Only batch size 1, not good for large context.
python
quant_config = {
    "version": "GEMV"  # 批量大小为1时速度提升20%
}
限制:仅支持批量大小1,不适用于长上下文场景。

Marlin (Ampere+ GPUs)

Marlin(Ampere+ GPU)

python
from transformers import AwqConfig, AutoModelForCausalLM

config = AwqConfig(
    bits=4,
    version="marlin"  # 2x faster on A100/H100
)

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-AWQ",
    quantization_config=config
)
Requirements: Compute Capability 8.0+ (A100, H100, RTX 40xx)
python
from transformers import AwqConfig, AutoModelForCausalLM

config = AwqConfig(
    bits=4,
    version="marlin"  # 在A100/H100上速度提升2倍
)

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-AWQ",
    quantization_config=config
)
要求:计算能力8.0+(A100、H100、RTX 40xx)

ExLlamaV2 (AMD compatible)

ExLlamaV2(AMD兼容)

python
config = AwqConfig(
    bits=4,
    version="exllama"  # Faster prefill, AMD GPU support
)
python
config = AwqConfig(
    bits=4,
    version="exllama"  # 预填充速度更快,支持AMD GPU
)

HuggingFace Transformers integration

HuggingFace Transformers集成

Direct loading

直接加载

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/zephyr-7B-alpha-AWQ",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ")
python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/zephyr-7B-alpha-AWQ",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ")

Fused modules (recommended)

融合模块(推荐)

python
from transformers import AwqConfig, AutoModelForCausalLM

config = AwqConfig(
    bits=4,
    fuse_max_seq_len=512,  # Max sequence length for fusing
    do_fuse=True           # Enable fused attention/MLP
)

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-OpenOrca-AWQ",
    quantization_config=config
)
Note: Fused modules cannot combine with FlashAttention2.
python
from transformers import AwqConfig, AutoModelForCausalLM

config = AwqConfig(
    bits=4,
    fuse_max_seq_len=512,  # 融合操作支持的最大序列长度
    do_fuse=True           # 启用融合注意力/MLP
)

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-OpenOrca-AWQ",
    quantization_config=config
)
注意:融合模块无法与FlashAttention2结合使用。

vLLM integration

vLLM集成

python
from vllm import LLM, SamplingParams
python
from vllm import LLM, SamplingParams

vLLM auto-detects AWQ models

vLLM会自动检测AWQ模型

llm = LLM( model="TheBloke/Llama-2-7B-AWQ", quantization="awq", dtype="half" )
sampling = SamplingParams(temperature=0.7, max_tokens=200) outputs = llm.generate(["Explain AI"], sampling)
undefined
llm = LLM( model="TheBloke/Llama-2-7B-AWQ", quantization="awq", dtype="half" )
sampling = SamplingParams(temperature=0.7, max_tokens=200) outputs = llm.generate(["Explain AI"], sampling)
undefined

Performance benchmarks

性能基准测试

Memory reduction

内存占用减少

ModelFP16AWQ 4-bitReduction
Mistral 7B14 GB5.5 GB2.5x
Llama 2-13B26 GB10 GB2.6x
Llama 2-70B140 GB35 GB4x
模型FP16AWQ 4位压缩比
Mistral 7B14 GB5.5 GB2.5x
Llama 2-13B26 GB10 GB2.6x
Llama 2-70B140 GB35 GB4x

Inference speed (RTX 4090)

推理速度(RTX 4090)

ModelPrefill (tok/s)Decode (tok/s)Memory
Mistral 7B GEMM3,8971145.55 GB
TinyLlama 1B GEMV5,1794312.10 GB
Llama 2-13B GEMM2,2797410.28 GB
模型预填充速度(tok/s)解码速度(tok/s)内存占用
Mistral 7B GEMM3,8971145.55 GB
TinyLlama 1B GEMV5,1794312.10 GB
Llama 2-13B GEMM2,2797410.28 GB

Accuracy (perplexity)

精度(困惑度)

ModelFP16AWQ 4-bitDegradation
Llama 3 8B8.208.48+3.4%
Mistral 7B5.255.42+3.2%
Qwen2 72B4.854.95+2.1%
模型FP16AWQ 4位精度下降幅度
Llama 3 8B8.208.48+3.4%
Mistral 7B5.255.42+3.2%
Qwen2 72B4.854.95+2.1%

Custom calibration data

自定义校准数据

python
undefined
python
undefined

Use custom dataset for domain-specific models

针对领域特定模型使用自定义数据集

model.quantize( tokenizer, quant_config=quant_config, calib_data="wikitext", # Or custom list of strings max_calib_samples=256, # More samples = better accuracy max_calib_seq_len=512 # Sequence length )
model.quantize( tokenizer, quant_config=quant_config, calib_data="wikitext", # 或自定义字符串列表 max_calib_samples=256, # 样本越多精度越好 max_calib_seq_len=512 # 序列长度 )

Or provide your own samples

或提供自定义样本

calib_samples = [ "Your domain-specific text here...", "More examples from your use case...", ] model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_samples)
undefined
calib_samples = [ "此处为你的领域特定文本...", "更多你的业务场景示例...", ] model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_samples)
undefined

Multi-GPU deployment

多GPU部署

python
model = AutoAWQForCausalLM.from_quantized(
    "TheBloke/Llama-2-70B-AWQ",
    device_map="auto",  # Auto-split across GPUs
    max_memory={0: "40GB", 1: "40GB"}
)
python
model = AutoAWQForCausalLM.from_quantized(
    "TheBloke/Llama-2-70B-AWQ",
    device_map="auto",  # 自动在多GPU间拆分模型
    max_memory={0: "40GB", 1: "40GB"}
)

Supported models

支持的模型

35+ architectures including:
  • Llama family: Llama 2/3, Code Llama, Mistral, Mixtral
  • Qwen: Qwen, Qwen2, Qwen2.5-VL
  • Others: Falcon, MPT, Phi, Yi, DeepSeek, Gemma
  • Multimodal: LLaVA, LLaVA-Next, Qwen2-VL
35+种模型架构,包括:
  • Llama系列:Llama 2/3、Code Llama、Mistral、Mixtral
  • Qwen系列:Qwen、Qwen2、Qwen2.5-VL
  • 其他:Falcon、MPT、Phi、Yi、DeepSeek、Gemma
  • 多模态模型:LLaVA、LLaVA-Next、Qwen2-VL

Common issues

常见问题

CUDA OOM during quantization:
python
undefined
量化时CUDA内存不足
python
undefined

Reduce batch size

减小批量大小

model.quantize(tokenizer, quant_config=quant_config, max_calib_samples=64)

**Slow inference**:
```python
model.quantize(tokenizer, quant_config=quant_config, max_calib_samples=64)

**推理速度慢**:
```python

Enable fused layers

启用融合层

model = AutoAWQForCausalLM.from_quantized(model_name, fuse_layers=True)

**AMD GPU support**:
```python
model = AutoAWQForCausalLM.from_quantized(model_name, fuse_layers=True)

**AMD GPU支持**:
```python

Use ExLlama backend

使用ExLlama后端

config = AwqConfig(bits=4, version="exllama")
undefined
config = AwqConfig(bits=4, version="exllama")
undefined

Deprecation notice

弃用通知

AutoAWQ is officially deprecated. For new projects, consider:
Existing quantized models remain usable.
AutoAWQ已正式弃用。对于新项目,建议考虑:
已量化的现有模型仍可正常使用。

References

参考资料