hqq-quantization

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

HQQ - Half-Quadratic Quantization

HQQ - 半二次量化

Fast, calibration-free weight quantization supporting 8/4/3/2/1-bit precision with multiple optimized backends.
无需校准的快速权重量化,支持8/4/3/2/1比特精度,兼容多种优化后端。

When to use HQQ

何时使用HQQ

Use HQQ when:
  • Quantizing models without calibration data (no dataset needed)
  • Need fast quantization (minutes vs hours for GPTQ/AWQ)
  • Deploying with vLLM or HuggingFace Transformers
  • Fine-tuning quantized models with LoRA/PEFT
  • Experimenting with extreme quantization (2-bit, 1-bit)
Key advantages:
  • No calibration: Quantize any model instantly without sample data
  • Multiple backends: PyTorch, ATEN, TorchAO, Marlin, BitBlas for optimized inference
  • Flexible precision: 8/4/3/2/1-bit with configurable group sizes
  • Framework integration: Native HuggingFace and vLLM support
  • PEFT compatible: Fine-tune quantized models with LoRA
Use alternatives instead:
  • AWQ: Need calibration-based accuracy, production serving
  • GPTQ: Maximum accuracy with calibration data available
  • bitsandbytes: Simple 8-bit/4-bit without custom backends
  • llama.cpp/GGUF: CPU inference, Apple Silicon deployment
在以下场景使用HQQ:
  • 无需校准数据即可量化模型(无需数据集)
  • 需要快速量化(相较于GPTQ/AWQ的数小时,HQQ仅需数分钟)
  • 使用vLLM或HuggingFace Transformers部署模型
  • 结合LoRA/PEFT对量化模型进行微调
  • 探索极端量化(2比特、1比特)
核心优势:
  • 无需校准:无需样本数据即可立即量化任意模型
  • 多后端支持:PyTorch、ATEN、TorchAO、Marlin、BitBlas,实现优化推理
  • 灵活的精度配置:8/4/3/2/1比特,支持可配置分组大小
  • 框架集成:原生支持HuggingFace和vLLM
  • PEFT兼容:可结合LoRA对量化模型进行微调
以下场景请使用替代方案:
  • AWQ:需要基于校准的高精度、生产级部署
  • GPTQ:有校准数据可用时追求最高精度
  • bitsandbytes:无需自定义后端的简单8比特/4比特量化
  • llama.cpp/GGUF:CPU推理、Apple Silicon部署

Quick start

快速开始

Installation

安装

bash
pip install hqq
bash
pip install hqq

With specific backend

安装指定后端

pip install hqq[torch] # PyTorch backend pip install hqq[torchao] # TorchAO int4 backend pip install hqq[bitblas] # BitBlas backend pip install hqq[marlin] # Marlin backend
undefined
pip install hqq[torch] # PyTorch后端 pip install hqq[torchao] # TorchAO int4后端 pip install hqq[bitblas] # BitBlas后端 pip install hqq[marlin] # Marlin后端
undefined

Basic quantization

基础量化

python
from hqq.core.quantize import BaseQuantizeConfig, HQQLinear
import torch.nn as nn
python
from hqq.core.quantize import BaseQuantizeConfig, HQQLinear
import torch.nn as nn

Configure quantization

配置量化参数

config = BaseQuantizeConfig( nbits=4, # 4-bit quantization group_size=64, # Group size for quantization axis=1 # Quantize along output dimension )
config = BaseQuantizeConfig( nbits=4, # 4比特量化 group_size=64, # 量化分组大小 axis=1 # 沿输出维度量化 )

Quantize a linear layer

量化线性层

linear = nn.Linear(4096, 4096) hqq_linear = HQQLinear(linear, config)
linear = nn.Linear(4096, 4096) hqq_linear = HQQLinear(linear, config)

Use normally

正常使用

output = hqq_linear(input_tensor)
undefined
output = hqq_linear(input_tensor)
undefined

Quantize full model with HuggingFace

结合HuggingFace量化完整模型

python
from transformers import AutoModelForCausalLM, HqqConfig
python
from transformers import AutoModelForCausalLM, HqqConfig

Configure HQQ

配置HQQ

quantization_config = HqqConfig( nbits=4, group_size=64, axis=1 )
quantization_config = HqqConfig( nbits=4, group_size=64, axis=1 )

Load and quantize

加载并量化模型

model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=quantization_config, device_map="auto" )
model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=quantization_config, device_map="auto" )

Model is quantized and ready to use

模型已完成量化,可直接使用

undefined
undefined

Core concepts

核心概念

Quantization configuration

量化配置

HQQ uses
BaseQuantizeConfig
to define quantization parameters:
python
from hqq.core.quantize import BaseQuantizeConfig
HQQ使用
BaseQuantizeConfig
定义量化参数:
python
from hqq.core.quantize import BaseQuantizeConfig

Standard 4-bit config

标准4比特配置

config_4bit = BaseQuantizeConfig( nbits=4, # Bits per weight (1-8) group_size=64, # Weights per quantization group axis=1 # 0=input dim, 1=output dim )
config_4bit = BaseQuantizeConfig( nbits=4, # 每个权重的比特数(1-8) group_size=64, # 每个量化分组的权重数量 axis=1 # 0=输入维度,1=输出维度 )

Aggressive 2-bit config

激进的2比特配置

config_2bit = BaseQuantizeConfig( nbits=2, group_size=16, # Smaller groups for low-bit axis=1 )
config_2bit = BaseQuantizeConfig( nbits=2, group_size=16, # 低比特量化使用更小的分组 axis=1 )

Mixed precision per layer type

按层类型配置混合精度

layer_configs = { "self_attn.q_proj": BaseQuantizeConfig(nbits=4, group_size=64), "self_attn.k_proj": BaseQuantizeConfig(nbits=4, group_size=64), "self_attn.v_proj": BaseQuantizeConfig(nbits=4, group_size=64), "mlp.gate_proj": BaseQuantizeConfig(nbits=2, group_size=32), "mlp.up_proj": BaseQuantizeConfig(nbits=2, group_size=32), "mlp.down_proj": BaseQuantizeConfig(nbits=4, group_size=64), }
undefined
layer_configs = { "self_attn.q_proj": BaseQuantizeConfig(nbits=4, group_size=64), "self_attn.k_proj": BaseQuantizeConfig(nbits=4, group_size=64), "self_attn.v_proj": BaseQuantizeConfig(nbits=4, group_size=64), "mlp.gate_proj": BaseQuantizeConfig(nbits=2, group_size=32), "mlp.up_proj": BaseQuantizeConfig(nbits=2, group_size=32), "mlp.down_proj": BaseQuantizeConfig(nbits=4, group_size=64), }
undefined

HQQLinear layer

HQQLinear层

The core quantized layer that replaces
nn.Linear
:
python
from hqq.core.quantize import HQQLinear
import torch
替代
nn.Linear
的核心量化层:
python
from hqq.core.quantize import HQQLinear
import torch

Create quantized layer

创建量化层

linear = torch.nn.Linear(4096, 4096) hqq_layer = HQQLinear(linear, config)
linear = torch.nn.Linear(4096, 4096) hqq_layer = HQQLinear(linear, config)

Access quantized weights

访问量化后的权重

W_q = hqq_layer.W_q # Quantized weights scale = hqq_layer.scale # Scale factors zero = hqq_layer.zero # Zero points
W_q = hqq_layer.W_q # 量化权重 scale = hqq_layer.scale # 缩放因子 zero = hqq_layer.zero # 零点

Dequantize for inspection

反量化以进行检查

W_dequant = hqq_layer.dequantize()
undefined
W_dequant = hqq_layer.dequantize()
undefined

Backends

后端支持

HQQ supports multiple inference backends for different hardware:
python
from hqq.core.quantize import HQQLinear
HQQ支持多种推理后端以适配不同硬件:
python
from hqq.core.quantize import HQQLinear

Available backends

可用后端

backends = [ "pytorch", # Pure PyTorch (default) "pytorch_compile", # torch.compile optimized "aten", # Custom CUDA kernels "torchao_int4", # TorchAO int4 matmul "gemlite", # GemLite CUDA kernels "bitblas", # BitBlas optimized "marlin", # Marlin 4-bit kernels ]
backends = [ "pytorch", # 纯PyTorch(默认) "pytorch_compile", # 经torch.compile优化 "aten", # 自定义CUDA内核 "torchao_int4", # TorchAO int4矩阵乘法 "gemlite", # GemLite CUDA内核 "bitblas", # BitBlas优化 "marlin", # Marlin 4比特内核 ]

Set backend globally

全局设置后端

HQQLinear.set_backend("torchao_int4")
HQQLinear.set_backend("torchao_int4")

Or per layer

或为单个层设置后端

hqq_layer.set_backend("marlin")

**Backend selection guide:**
| Backend | Best For | Requirements |
|---------|----------|--------------|
| pytorch | Compatibility | Any GPU |
| pytorch_compile | Moderate speedup | torch>=2.0 |
| aten | Good balance | CUDA GPU |
| torchao_int4 | 4-bit inference | torchao installed |
| marlin | Maximum 4-bit speed | Ampere+ GPU |
| bitblas | Flexible bit-widths | bitblas installed |
hqq_layer.set_backend("marlin")

**后端选择指南:**
| 后端 | 最佳适用场景 | 要求 |
|---------|----------|--------------|
| pytorch | 兼容性 | 任意GPU |
| pytorch_compile | 中等速度提升 | torch>=2.0 |
| aten | 平衡性能与兼容性 | CUDA GPU |
| torchao_int4 | 4比特推理 | 已安装torchao |
| marlin | 4比特最大速度 | Ampere+架构GPU |
| bitblas | 灵活的比特宽度 | 已安装bitblas |

HuggingFace integration

HuggingFace集成

Load pre-quantized models

加载预量化模型

python
from transformers import AutoModelForCausalLM, AutoTokenizer
python
from transformers import AutoModelForCausalLM, AutoTokenizer

Load HQQ-quantized model from Hub

从Hub加载HQQ量化模型

model = AutoModelForCausalLM.from_pretrained( "mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
model = AutoModelForCausalLM.from_pretrained( "mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

Use normally

正常使用

inputs = tokenizer("Hello, world!", return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=50)
undefined
inputs = tokenizer("Hello, world!", return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=50) print(tokenizer.decode(outputs[0]))
undefined

Quantize and save

量化并保存模型

python
from transformers import AutoModelForCausalLM, HqqConfig
python
from transformers import AutoModelForCausalLM, HqqConfig

Quantize

量化模型

config = HqqConfig(nbits=4, group_size=64) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=config, device_map="auto" )
config = HqqConfig(nbits=4, group_size=64) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=config, device_map="auto" )

Save quantized model

保存量化模型

model.save_pretrained("./llama-8b-hqq-4bit")
model.save_pretrained("./llama-8b-hqq-4bit")

Push to Hub

推送至Hub

model.push_to_hub("my-org/Llama-3.1-8B-HQQ-4bit")
undefined
model.push_to_hub("my-org/Llama-3.1-8B-HQQ-4bit")
undefined

Mixed precision quantization

混合精度量化

python
from transformers import AutoModelForCausalLM, HqqConfig
python
from transformers import AutoModelForCausalLM, HqqConfig

Different precision per layer type

为不同层类型配置不同精度

config = HqqConfig( nbits=4, group_size=64, # Attention layers: higher precision # MLP layers: lower precision for memory savings dynamic_config={ "attn": {"nbits": 4, "group_size": 64}, "mlp": {"nbits": 2, "group_size": 32} } )
undefined
config = HqqConfig( nbits=4, group_size=64, # 注意力层:更高精度 # MLP层:更低精度以节省内存 dynamic_config={ "attn": {"nbits": 4, "group_size": 64}, "mlp": {"nbits": 2, "group_size": 32} } )
undefined

vLLM integration

vLLM集成

Serve HQQ models with vLLM

使用vLLM部署HQQ模型

python
from vllm import LLM, SamplingParams
python
from vllm import LLM, SamplingParams

Load HQQ-quantized model

加载HQQ量化模型

llm = LLM( model="mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit", quantization="hqq", dtype="float16" )
llm = LLM( model="mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit", quantization="hqq", dtype="float16" )

Generate

生成文本

sampling_params = SamplingParams(temperature=0.7, max_tokens=100) outputs = llm.generate(["What is machine learning?"], sampling_params)
undefined
sampling_params = SamplingParams(temperature=0.7, max_tokens=100) outputs = llm.generate(["What is machine learning?"], sampling_params)
undefined

vLLM with custom HQQ config

结合自定义HQQ配置使用vLLM

python
from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-3.1-8B",
    quantization="hqq",
    quantization_config={
        "nbits": 4,
        "group_size": 64
    }
)
python
from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-3.1-8B",
    quantization="hqq",
    quantization_config={
        "nbits": 4,
        "group_size": 64
    }
)

PEFT/LoRA fine-tuning

PEFT/LoRA微调

Fine-tune quantized models

微调量化模型

python
from transformers import AutoModelForCausalLM, HqqConfig
from peft import LoraConfig, get_peft_model
python
from transformers import AutoModelForCausalLM, HqqConfig
from peft import LoraConfig, get_peft_model

Load quantized model

加载量化模型

quant_config = HqqConfig(nbits=4, group_size=64) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=quant_config, device_map="auto" )
quant_config = HqqConfig(nbits=4, group_size=64) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=quant_config, device_map="auto" )

Apply LoRA

应用LoRA

lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" )
model = get_peft_model(model, lora_config)
lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" )
model = get_peft_model(model, lora_config)

Train normally with Trainer or custom loop

使用Trainer或自定义循环正常训练

undefined
undefined

QLoRA-style training

QLoRA风格训练

python
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./hqq-lora-output",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    num_train_epochs=3,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=data_collator
)

trainer.train()
python
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./hqq-lora-output",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    num_train_epochs=3,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=data_collator
)

trainer.train()

Quantization workflows

量化工作流

Workflow 1: Quick model compression

工作流1:快速模型压缩

python
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig
python
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig

1. Configure quantization

1. 配置量化参数

config = HqqConfig(nbits=4, group_size=64)
config = HqqConfig(nbits=4, group_size=64)

2. Load and quantize (no calibration needed!)

2. 加载并量化模型(无需校准!)

model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=config, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=config, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

3. Verify quality

3. 验证质量

prompt = "The capital of France is" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=20) print(tokenizer.decode(outputs[0]))
prompt = "法国的首都是" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=20) print(tokenizer.decode(outputs[0]))

4. Save

4. 保存模型

model.save_pretrained("./llama-8b-hqq") tokenizer.save_pretrained("./llama-8b-hqq")
undefined
model.save_pretrained("./llama-8b-hqq") tokenizer.save_pretrained("./llama-8b-hqq")
undefined

Workflow 2: Optimize for inference speed

工作流2:优化推理速度

python
from hqq.core.quantize import HQQLinear
from transformers import AutoModelForCausalLM, HqqConfig
python
from hqq.core.quantize import HQQLinear
from transformers import AutoModelForCausalLM, HqqConfig

1. Quantize with optimal backend

1. 使用最优后端量化模型

config = HqqConfig(nbits=4, group_size=64) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=config, device_map="auto" )
config = HqqConfig(nbits=4, group_size=64) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=config, device_map="auto" )

2. Set fast backend

2. 设置快速后端

HQQLinear.set_backend("marlin") # or "torchao_int4"
HQQLinear.set_backend("marlin") # 或"torchao_int4"

3. Compile for additional speedup

3. 编译模型以进一步提速

import torch model = torch.compile(model)
import torch model = torch.compile(model)

4. Benchmark

4. 基准测试

import time inputs = tokenizer("Hello", return_tensors="pt").to(model.device) start = time.time() for _ in range(10): model.generate(**inputs, max_new_tokens=100) print(f"Avg time: {(time.time() - start) / 10:.2f}s")
undefined
import time inputs = tokenizer("Hello", return_tensors="pt").to(model.device) start = time.time() for _ in range(10): model.generate(**inputs, max_new_tokens=100) print(f"平均时间: {(time.time() - start) / 10:.2f}秒")
undefined

Best practices

最佳实践

  1. Start with 4-bit: Best quality/size tradeoff for most models
  2. Use group_size=64: Good balance; smaller for extreme quantization
  3. Choose backend wisely: Marlin for 4-bit Ampere+, TorchAO for flexibility
  4. Verify quality: Always test generation quality after quantization
  5. Mixed precision: Keep attention at higher precision, compress MLP more
  6. PEFT training: Use LoRA r=16-32 for good fine-tuning results
  1. 从4比特开始:对大多数模型而言,是质量与大小的最佳权衡
  2. 使用group_size=64:平衡性能与质量;极端量化时使用更小的分组
  3. 合理选择后端:4比特Ampere+架构GPU选Marlin,追求灵活性选TorchAO
  4. 验证质量:量化后务必测试生成质量
  5. 混合精度:保持注意力层高精度,进一步压缩MLP层
  6. PEFT训练:使用r=16-32的LoRA可获得良好的微调效果

Common issues

常见问题

Out of memory during quantization:
python
undefined
量化时内存不足:
python
undefined

Quantize layer-by-layer

逐层量化

from hqq.models.hf.base import AutoHQQHFModel
model = AutoHQQHFModel.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=config, device_map="sequential" # Load layers sequentially )

**Slow inference:**
```python
from hqq.models.hf.base import AutoHQQHFModel
model = AutoHQQHFModel.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=config, device_map="sequential" # 逐层加载模型 )

**推理速度慢:**
```python

Switch to optimized backend

切换至优化后端

from hqq.core.quantize import HQQLinear HQQLinear.set_backend("marlin") # Requires Ampere+ GPU
from hqq.core.quantize import HQQLinear HQQLinear.set_backend("marlin") # 需要Ampere+架构GPU

Or compile

或编译模型

model = torch.compile(model, mode="reduce-overhead")

**Poor quality at 2-bit:**
```python
model = torch.compile(model, mode="reduce-overhead")

**2比特量化后质量差:**
```python

Use smaller group size

使用更小的分组大小

config = BaseQuantizeConfig( nbits=2, group_size=16, # Smaller groups help at low bits axis=1 )
undefined
config = BaseQuantizeConfig( nbits=2, group_size=16, # 低比特量化时更小的分组有助于提升质量 axis=1 )
undefined

References

参考资料

  • Advanced Usage - Custom backends, mixed precision, optimization
  • Troubleshooting - Common issues, debugging, benchmarks
  • 高级用法 - 自定义后端、混合精度、优化技巧
  • 故障排除 - 常见问题、调试、基准测试

Resources

资源