hqq-quantization
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseHQQ - Half-Quadratic Quantization
HQQ - 半二次量化
Fast, calibration-free weight quantization supporting 8/4/3/2/1-bit precision with multiple optimized backends.
无需校准的快速权重量化,支持8/4/3/2/1比特精度,兼容多种优化后端。
When to use HQQ
何时使用HQQ
Use HQQ when:
- Quantizing models without calibration data (no dataset needed)
- Need fast quantization (minutes vs hours for GPTQ/AWQ)
- Deploying with vLLM or HuggingFace Transformers
- Fine-tuning quantized models with LoRA/PEFT
- Experimenting with extreme quantization (2-bit, 1-bit)
Key advantages:
- No calibration: Quantize any model instantly without sample data
- Multiple backends: PyTorch, ATEN, TorchAO, Marlin, BitBlas for optimized inference
- Flexible precision: 8/4/3/2/1-bit with configurable group sizes
- Framework integration: Native HuggingFace and vLLM support
- PEFT compatible: Fine-tune quantized models with LoRA
Use alternatives instead:
- AWQ: Need calibration-based accuracy, production serving
- GPTQ: Maximum accuracy with calibration data available
- bitsandbytes: Simple 8-bit/4-bit without custom backends
- llama.cpp/GGUF: CPU inference, Apple Silicon deployment
在以下场景使用HQQ:
- 无需校准数据即可量化模型(无需数据集)
- 需要快速量化(相较于GPTQ/AWQ的数小时,HQQ仅需数分钟)
- 使用vLLM或HuggingFace Transformers部署模型
- 结合LoRA/PEFT对量化模型进行微调
- 探索极端量化(2比特、1比特)
核心优势:
- 无需校准:无需样本数据即可立即量化任意模型
- 多后端支持:PyTorch、ATEN、TorchAO、Marlin、BitBlas,实现优化推理
- 灵活的精度配置:8/4/3/2/1比特,支持可配置分组大小
- 框架集成:原生支持HuggingFace和vLLM
- PEFT兼容:可结合LoRA对量化模型进行微调
以下场景请使用替代方案:
- AWQ:需要基于校准的高精度、生产级部署
- GPTQ:有校准数据可用时追求最高精度
- bitsandbytes:无需自定义后端的简单8比特/4比特量化
- llama.cpp/GGUF:CPU推理、Apple Silicon部署
Quick start
快速开始
Installation
安装
bash
pip install hqqbash
pip install hqqWith specific backend
安装指定后端
pip install hqq[torch] # PyTorch backend
pip install hqq[torchao] # TorchAO int4 backend
pip install hqq[bitblas] # BitBlas backend
pip install hqq[marlin] # Marlin backend
undefinedpip install hqq[torch] # PyTorch后端
pip install hqq[torchao] # TorchAO int4后端
pip install hqq[bitblas] # BitBlas后端
pip install hqq[marlin] # Marlin后端
undefinedBasic quantization
基础量化
python
from hqq.core.quantize import BaseQuantizeConfig, HQQLinear
import torch.nn as nnpython
from hqq.core.quantize import BaseQuantizeConfig, HQQLinear
import torch.nn as nnConfigure quantization
配置量化参数
config = BaseQuantizeConfig(
nbits=4, # 4-bit quantization
group_size=64, # Group size for quantization
axis=1 # Quantize along output dimension
)
config = BaseQuantizeConfig(
nbits=4, # 4比特量化
group_size=64, # 量化分组大小
axis=1 # 沿输出维度量化
)
Quantize a linear layer
量化线性层
linear = nn.Linear(4096, 4096)
hqq_linear = HQQLinear(linear, config)
linear = nn.Linear(4096, 4096)
hqq_linear = HQQLinear(linear, config)
Use normally
正常使用
output = hqq_linear(input_tensor)
undefinedoutput = hqq_linear(input_tensor)
undefinedQuantize full model with HuggingFace
结合HuggingFace量化完整模型
python
from transformers import AutoModelForCausalLM, HqqConfigpython
from transformers import AutoModelForCausalLM, HqqConfigConfigure HQQ
配置HQQ
quantization_config = HqqConfig(
nbits=4,
group_size=64,
axis=1
)
quantization_config = HqqConfig(
nbits=4,
group_size=64,
axis=1
)
Load and quantize
加载并量化模型
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=quantization_config,
device_map="auto"
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=quantization_config,
device_map="auto"
)
Model is quantized and ready to use
模型已完成量化,可直接使用
undefinedundefinedCore concepts
核心概念
Quantization configuration
量化配置
HQQ uses to define quantization parameters:
BaseQuantizeConfigpython
from hqq.core.quantize import BaseQuantizeConfigHQQ使用定义量化参数:
BaseQuantizeConfigpython
from hqq.core.quantize import BaseQuantizeConfigStandard 4-bit config
标准4比特配置
config_4bit = BaseQuantizeConfig(
nbits=4, # Bits per weight (1-8)
group_size=64, # Weights per quantization group
axis=1 # 0=input dim, 1=output dim
)
config_4bit = BaseQuantizeConfig(
nbits=4, # 每个权重的比特数(1-8)
group_size=64, # 每个量化分组的权重数量
axis=1 # 0=输入维度,1=输出维度
)
Aggressive 2-bit config
激进的2比特配置
config_2bit = BaseQuantizeConfig(
nbits=2,
group_size=16, # Smaller groups for low-bit
axis=1
)
config_2bit = BaseQuantizeConfig(
nbits=2,
group_size=16, # 低比特量化使用更小的分组
axis=1
)
Mixed precision per layer type
按层类型配置混合精度
layer_configs = {
"self_attn.q_proj": BaseQuantizeConfig(nbits=4, group_size=64),
"self_attn.k_proj": BaseQuantizeConfig(nbits=4, group_size=64),
"self_attn.v_proj": BaseQuantizeConfig(nbits=4, group_size=64),
"mlp.gate_proj": BaseQuantizeConfig(nbits=2, group_size=32),
"mlp.up_proj": BaseQuantizeConfig(nbits=2, group_size=32),
"mlp.down_proj": BaseQuantizeConfig(nbits=4, group_size=64),
}
undefinedlayer_configs = {
"self_attn.q_proj": BaseQuantizeConfig(nbits=4, group_size=64),
"self_attn.k_proj": BaseQuantizeConfig(nbits=4, group_size=64),
"self_attn.v_proj": BaseQuantizeConfig(nbits=4, group_size=64),
"mlp.gate_proj": BaseQuantizeConfig(nbits=2, group_size=32),
"mlp.up_proj": BaseQuantizeConfig(nbits=2, group_size=32),
"mlp.down_proj": BaseQuantizeConfig(nbits=4, group_size=64),
}
undefinedHQQLinear layer
HQQLinear层
The core quantized layer that replaces :
nn.Linearpython
from hqq.core.quantize import HQQLinear
import torch替代的核心量化层:
nn.Linearpython
from hqq.core.quantize import HQQLinear
import torchCreate quantized layer
创建量化层
linear = torch.nn.Linear(4096, 4096)
hqq_layer = HQQLinear(linear, config)
linear = torch.nn.Linear(4096, 4096)
hqq_layer = HQQLinear(linear, config)
Access quantized weights
访问量化后的权重
W_q = hqq_layer.W_q # Quantized weights
scale = hqq_layer.scale # Scale factors
zero = hqq_layer.zero # Zero points
W_q = hqq_layer.W_q # 量化权重
scale = hqq_layer.scale # 缩放因子
zero = hqq_layer.zero # 零点
Dequantize for inspection
反量化以进行检查
W_dequant = hqq_layer.dequantize()
undefinedW_dequant = hqq_layer.dequantize()
undefinedBackends
后端支持
HQQ supports multiple inference backends for different hardware:
python
from hqq.core.quantize import HQQLinearHQQ支持多种推理后端以适配不同硬件:
python
from hqq.core.quantize import HQQLinearAvailable backends
可用后端
backends = [
"pytorch", # Pure PyTorch (default)
"pytorch_compile", # torch.compile optimized
"aten", # Custom CUDA kernels
"torchao_int4", # TorchAO int4 matmul
"gemlite", # GemLite CUDA kernels
"bitblas", # BitBlas optimized
"marlin", # Marlin 4-bit kernels
]
backends = [
"pytorch", # 纯PyTorch(默认)
"pytorch_compile", # 经torch.compile优化
"aten", # 自定义CUDA内核
"torchao_int4", # TorchAO int4矩阵乘法
"gemlite", # GemLite CUDA内核
"bitblas", # BitBlas优化
"marlin", # Marlin 4比特内核
]
Set backend globally
全局设置后端
HQQLinear.set_backend("torchao_int4")
HQQLinear.set_backend("torchao_int4")
Or per layer
或为单个层设置后端
hqq_layer.set_backend("marlin")
**Backend selection guide:**
| Backend | Best For | Requirements |
|---------|----------|--------------|
| pytorch | Compatibility | Any GPU |
| pytorch_compile | Moderate speedup | torch>=2.0 |
| aten | Good balance | CUDA GPU |
| torchao_int4 | 4-bit inference | torchao installed |
| marlin | Maximum 4-bit speed | Ampere+ GPU |
| bitblas | Flexible bit-widths | bitblas installed |hqq_layer.set_backend("marlin")
**后端选择指南:**
| 后端 | 最佳适用场景 | 要求 |
|---------|----------|--------------|
| pytorch | 兼容性 | 任意GPU |
| pytorch_compile | 中等速度提升 | torch>=2.0 |
| aten | 平衡性能与兼容性 | CUDA GPU |
| torchao_int4 | 4比特推理 | 已安装torchao |
| marlin | 4比特最大速度 | Ampere+架构GPU |
| bitblas | 灵活的比特宽度 | 已安装bitblas |HuggingFace integration
HuggingFace集成
Load pre-quantized models
加载预量化模型
python
from transformers import AutoModelForCausalLM, AutoTokenizerpython
from transformers import AutoModelForCausalLM, AutoTokenizerLoad HQQ-quantized model from Hub
从Hub加载HQQ量化模型
model = AutoModelForCausalLM.from_pretrained(
"mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
model = AutoModelForCausalLM.from_pretrained(
"mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
Use normally
正常使用
inputs = tokenizer("Hello, world!", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
undefinedinputs = tokenizer("Hello, world!", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))
undefinedQuantize and save
量化并保存模型
python
from transformers import AutoModelForCausalLM, HqqConfigpython
from transformers import AutoModelForCausalLM, HqqConfigQuantize
量化模型
config = HqqConfig(nbits=4, group_size=64)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=config,
device_map="auto"
)
config = HqqConfig(nbits=4, group_size=64)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=config,
device_map="auto"
)
Save quantized model
保存量化模型
model.save_pretrained("./llama-8b-hqq-4bit")
model.save_pretrained("./llama-8b-hqq-4bit")
Push to Hub
推送至Hub
model.push_to_hub("my-org/Llama-3.1-8B-HQQ-4bit")
undefinedmodel.push_to_hub("my-org/Llama-3.1-8B-HQQ-4bit")
undefinedMixed precision quantization
混合精度量化
python
from transformers import AutoModelForCausalLM, HqqConfigpython
from transformers import AutoModelForCausalLM, HqqConfigDifferent precision per layer type
为不同层类型配置不同精度
config = HqqConfig(
nbits=4,
group_size=64,
# Attention layers: higher precision
# MLP layers: lower precision for memory savings
dynamic_config={
"attn": {"nbits": 4, "group_size": 64},
"mlp": {"nbits": 2, "group_size": 32}
}
)
undefinedconfig = HqqConfig(
nbits=4,
group_size=64,
# 注意力层:更高精度
# MLP层:更低精度以节省内存
dynamic_config={
"attn": {"nbits": 4, "group_size": 64},
"mlp": {"nbits": 2, "group_size": 32}
}
)
undefinedvLLM integration
vLLM集成
Serve HQQ models with vLLM
使用vLLM部署HQQ模型
python
from vllm import LLM, SamplingParamspython
from vllm import LLM, SamplingParamsLoad HQQ-quantized model
加载HQQ量化模型
llm = LLM(
model="mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit",
quantization="hqq",
dtype="float16"
)
llm = LLM(
model="mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit",
quantization="hqq",
dtype="float16"
)
Generate
生成文本
sampling_params = SamplingParams(temperature=0.7, max_tokens=100)
outputs = llm.generate(["What is machine learning?"], sampling_params)
undefinedsampling_params = SamplingParams(temperature=0.7, max_tokens=100)
outputs = llm.generate(["What is machine learning?"], sampling_params)
undefinedvLLM with custom HQQ config
结合自定义HQQ配置使用vLLM
python
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-3.1-8B",
quantization="hqq",
quantization_config={
"nbits": 4,
"group_size": 64
}
)python
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-3.1-8B",
quantization="hqq",
quantization_config={
"nbits": 4,
"group_size": 64
}
)PEFT/LoRA fine-tuning
PEFT/LoRA微调
Fine-tune quantized models
微调量化模型
python
from transformers import AutoModelForCausalLM, HqqConfig
from peft import LoraConfig, get_peft_modelpython
from transformers import AutoModelForCausalLM, HqqConfig
from peft import LoraConfig, get_peft_modelLoad quantized model
加载量化模型
quant_config = HqqConfig(nbits=4, group_size=64)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=quant_config,
device_map="auto"
)
quant_config = HqqConfig(nbits=4, group_size=64)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=quant_config,
device_map="auto"
)
Apply LoRA
应用LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
Train normally with Trainer or custom loop
使用Trainer或自定义循环正常训练
undefinedundefinedQLoRA-style training
QLoRA风格训练
python
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="./hqq-lora-output",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
num_train_epochs=3,
fp16=True,
logging_steps=10,
save_strategy="epoch"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
data_collator=data_collator
)
trainer.train()python
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="./hqq-lora-output",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
num_train_epochs=3,
fp16=True,
logging_steps=10,
save_strategy="epoch"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
data_collator=data_collator
)
trainer.train()Quantization workflows
量化工作流
Workflow 1: Quick model compression
工作流1:快速模型压缩
python
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfigpython
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig1. Configure quantization
1. 配置量化参数
config = HqqConfig(nbits=4, group_size=64)
config = HqqConfig(nbits=4, group_size=64)
2. Load and quantize (no calibration needed!)
2. 加载并量化模型(无需校准!)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
3. Verify quality
3. 验证质量
prompt = "The capital of France is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0]))
prompt = "法国的首都是"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0]))
4. Save
4. 保存模型
model.save_pretrained("./llama-8b-hqq")
tokenizer.save_pretrained("./llama-8b-hqq")
undefinedmodel.save_pretrained("./llama-8b-hqq")
tokenizer.save_pretrained("./llama-8b-hqq")
undefinedWorkflow 2: Optimize for inference speed
工作流2:优化推理速度
python
from hqq.core.quantize import HQQLinear
from transformers import AutoModelForCausalLM, HqqConfigpython
from hqq.core.quantize import HQQLinear
from transformers import AutoModelForCausalLM, HqqConfig1. Quantize with optimal backend
1. 使用最优后端量化模型
config = HqqConfig(nbits=4, group_size=64)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=config,
device_map="auto"
)
config = HqqConfig(nbits=4, group_size=64)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=config,
device_map="auto"
)
2. Set fast backend
2. 设置快速后端
HQQLinear.set_backend("marlin") # or "torchao_int4"
HQQLinear.set_backend("marlin") # 或"torchao_int4"
3. Compile for additional speedup
3. 编译模型以进一步提速
import torch
model = torch.compile(model)
import torch
model = torch.compile(model)
4. Benchmark
4. 基准测试
import time
inputs = tokenizer("Hello", return_tensors="pt").to(model.device)
start = time.time()
for _ in range(10):
model.generate(**inputs, max_new_tokens=100)
print(f"Avg time: {(time.time() - start) / 10:.2f}s")
undefinedimport time
inputs = tokenizer("Hello", return_tensors="pt").to(model.device)
start = time.time()
for _ in range(10):
model.generate(**inputs, max_new_tokens=100)
print(f"平均时间: {(time.time() - start) / 10:.2f}秒")
undefinedBest practices
最佳实践
- Start with 4-bit: Best quality/size tradeoff for most models
- Use group_size=64: Good balance; smaller for extreme quantization
- Choose backend wisely: Marlin for 4-bit Ampere+, TorchAO for flexibility
- Verify quality: Always test generation quality after quantization
- Mixed precision: Keep attention at higher precision, compress MLP more
- PEFT training: Use LoRA r=16-32 for good fine-tuning results
- 从4比特开始:对大多数模型而言,是质量与大小的最佳权衡
- 使用group_size=64:平衡性能与质量;极端量化时使用更小的分组
- 合理选择后端:4比特Ampere+架构GPU选Marlin,追求灵活性选TorchAO
- 验证质量:量化后务必测试生成质量
- 混合精度:保持注意力层高精度,进一步压缩MLP层
- PEFT训练:使用r=16-32的LoRA可获得良好的微调效果
Common issues
常见问题
Out of memory during quantization:
python
undefined量化时内存不足:
python
undefinedQuantize layer-by-layer
逐层量化
from hqq.models.hf.base import AutoHQQHFModel
model = AutoHQQHFModel.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=config,
device_map="sequential" # Load layers sequentially
)
**Slow inference:**
```pythonfrom hqq.models.hf.base import AutoHQQHFModel
model = AutoHQQHFModel.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=config,
device_map="sequential" # 逐层加载模型
)
**推理速度慢:**
```pythonSwitch to optimized backend
切换至优化后端
from hqq.core.quantize import HQQLinear
HQQLinear.set_backend("marlin") # Requires Ampere+ GPU
from hqq.core.quantize import HQQLinear
HQQLinear.set_backend("marlin") # 需要Ampere+架构GPU
Or compile
或编译模型
model = torch.compile(model, mode="reduce-overhead")
**Poor quality at 2-bit:**
```pythonmodel = torch.compile(model, mode="reduce-overhead")
**2比特量化后质量差:**
```pythonUse smaller group size
使用更小的分组大小
config = BaseQuantizeConfig(
nbits=2,
group_size=16, # Smaller groups help at low bits
axis=1
)
undefinedconfig = BaseQuantizeConfig(
nbits=2,
group_size=16, # 低比特量化时更小的分组有助于提升质量
axis=1
)
undefinedReferences
参考资料
- Advanced Usage - Custom backends, mixed precision, optimization
- Troubleshooting - Common issues, debugging, benchmarks
- 高级用法 - 自定义后端、混合精度、优化技巧
- 故障排除 - 常见问题、调试、基准测试
Resources
资源
- Repository: https://github.com/mobiusml/hqq
- Paper: Half-Quadratic Quantization
- HuggingFace Models: https://huggingface.co/mobiuslabsgmbh
- Version: 0.2.0+
- License: Apache 2.0
- 代码仓库:https://github.com/mobiusml/hqq
- 论文:Half-Quadratic Quantization
- HuggingFace模型:https://huggingface.co/mobiuslabsgmbh
- 版本:0.2.0+
- 许可证:Apache 2.0