quantizing-models-bitsandbytes

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

bitsandbytes - LLM Quantization

bitsandbytes - LLM量化

Quick start

快速开始

bitsandbytes reduces LLM memory by 50% (8-bit) or 75% (4-bit) with <1% accuracy loss.

Installation:

bash

pip install bitsandbytes transformers accelerate

8-bit quantization (50% memory reduction):

python

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=config,
    device_map="auto"
)

bitsandbytes可将LLM内存占用降低50%（8位）或75%（4位），精度损失低于1%。

安装:

bash

pip install bitsandbytes transformers accelerate

8位量化 (降低50%内存占用):

python

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=config,
    device_map="auto"
)

Memory: 14GB → 7GB

内存占用: 14GB → 7GB


**4-bit quantization** (75% memory reduction):
```python
config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=config,
    device_map="auto"
)


**4位量化** (降低75%内存占用):
```python
config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=config,
    device_map="auto"
)

Memory: 14GB → 3.5GB

内存占用: 14GB → 3.5GB

undefined

undefined

Common workflows

常见工作流

Workflow 1: Load large model in limited GPU memory

工作流1: 在有限GPU内存下加载大模型

Copy this checklist:

Quantization Loading:
- [ ] Step 1: Calculate memory requirements
- [ ] Step 2: Choose quantization level (4-bit or 8-bit)
- [ ] Step 3: Configure quantization
- [ ] Step 4: Load and verify model

Step 1: Calculate memory requirements

Estimate model memory:

FP16 memory (GB) = Parameters × 2 bytes / 1e9
INT8 memory (GB) = Parameters × 1 byte / 1e9
INT4 memory (GB) = Parameters × 0.5 bytes / 1e9

Example (Llama 2 7B):
FP16: 7B × 2 / 1e9 = 14 GB
INT8: 7B × 1 / 1e9 = 7 GB
INT4: 7B × 0.5 / 1e9 = 3.5 GB

Step 2: Choose quantization level

GPU VRAM	Model Size	Recommended
8 GB	3B	4-bit
12 GB	7B	4-bit
16 GB	7B	8-bit or 4-bit
24 GB	13B	8-bit or 70B 4-bit
40+ GB	70B	8-bit

Step 3: Configure quantization

For 8-bit (better accuracy):

python

from transformers import BitsAndBytesConfig
import torch

config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,  # Outlier threshold
    llm_int8_has_fp16_weight=False
)

For 4-bit (maximum memory savings):

python

config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,  # Compute in FP16
    bnb_4bit_quant_type="nf4",  # NormalFloat4 (recommended)
    bnb_4bit_use_double_quant=True  # Nested quantization
)

Step 4: Load and verify model

python

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-hf",
    quantization_config=config,
    device_map="auto",  # Automatic device placement
    torch_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")

复制以下检查清单:

量化加载流程:
- [ ] 步骤1: 计算内存需求
- [ ] 步骤2: 选择量化等级（4位或8位）
- [ ] 步骤3: 配置量化参数
- [ ] 步骤4: 加载并验证模型

步骤1: 计算内存需求

预估模型内存占用:

FP16内存占用 (GB) = 参数数量 × 2字节 / 1e9
INT8内存占用 (GB) = 参数数量 × 1字节 / 1e9
INT4内存占用 (GB) = 参数数量 × 0.5字节 / 1e9

示例 (Llama 2 7B):
FP16: 7B × 2 / 1e9 = 14 GB
INT8: 7B × 1 / 1e9 = 7 GB
INT4: 7B × 0.5 / 1e9 = 3.5 GB

步骤2: 选择量化等级

GPU显存	模型大小	推荐方案
8 GB	3B	4位
12 GB	7B	4位
16 GB	7B	8位或4位
24 GB	13B	8位或70B 4位
40+ GB	70B	8位

步骤3: 配置量化参数

8位配置（精度更高）:

python

from transformers import BitsAndBytesConfig
import torch

config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,  # 异常值阈值
    llm_int8_has_fp16_weight=False
)

4位配置（最大化内存节省）:

python

config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,  # 使用FP16计算
    bnb_4bit_quant_type="nf4",  # NormalFloat4（推荐）
    bnb_4bit_use_double_quant=True  # 嵌套量化
)

步骤4: 加载并验证模型

python

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-hf",
    quantization_config=config,
    device_map="auto",  # 自动设备分配
    torch_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")

Test inference

测试推理

inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_length=50) print(tokenizer.decode(outputs[0]))

Check memory

检查内存占用

import torch print(f"Memory allocated: {torch.cuda.memory_allocated()/1e9:.2f}GB")

undefined

import torch print(f"已分配内存: {torch.cuda.memory_allocated()/1e9:.2f}GB")

undefined

Workflow 2: Fine-tune with QLoRA (4-bit training)

工作流2: 使用QLoRA微调（4位训练）

QLoRA enables fine-tuning large models on consumer GPUs.

Copy this checklist:

QLoRA Fine-tuning:
- [ ] Step 1: Install dependencies
- [ ] Step 2: Configure 4-bit base model
- [ ] Step 3: Add LoRA adapters
- [ ] Step 4: Train with standard Trainer

Step 1: Install dependencies

bash

pip install bitsandbytes transformers peft accelerate datasets

Step 2: Configure 4-bit base model

python

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

Step 3: Add LoRA adapters

python

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

QLoRA支持在消费级GPU上微调大模型。

复制以下检查清单:

QLoRA微调流程:
- [ ] 步骤1: 安装依赖
- [ ] 步骤2: 配置4位基础模型
- [ ] 步骤3: 添加LoRA适配器
- [ ] 步骤4: 使用标准Trainer训练

步骤1: 安装依赖

bash

pip install bitsandbytes transformers peft accelerate datasets

步骤2: 配置4位基础模型

python

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

步骤3: 添加LoRA适配器

python

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

Prepare model for training

预处理模型以适配训练

model = prepare_model_for_kbit_training(model)

Configure LoRA

配置LoRA

lora_config = LoraConfig( r=16, # LoRA rank lora_alpha=32, # LoRA alpha target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" )

lora_config = LoraConfig( r=16, # LoRA秩 lora_alpha=32, # LoRA alpha参数 target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" )

Add LoRA adapters

添加LoRA适配器

model = get_peft_model(model, lora_config) model.print_trainable_parameters()

Output: trainable params: 4.2M || all params: 6.7B || trainable%: 0.06%

输出: trainable params: 4.2M || all params: 6.7B || trainable%: 0.06%


**Step 4: Train with standard Trainer**

```python
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./qlora-output",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer
)

trainer.train()


**步骤4: 使用标准Trainer训练**

```python
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./qlora-output",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer
)

trainer.train()

Save LoRA adapters (only ~20MB)

保存LoRA适配器（仅约20MB）

model.save_pretrained("./qlora-adapters")

undefined

model.save_pretrained("./qlora-adapters")

undefined

Workflow 3: 8-bit optimizer for memory-efficient training

工作流3: 8位优化器实现内存高效训练

Use 8-bit Adam/AdamW to reduce optimizer memory by 75%.

8-bit Optimizer Setup:
- [ ] Step 1: Replace standard optimizer
- [ ] Step 2: Configure training
- [ ] Step 3: Monitor memory savings

Step 1: Replace standard optimizer

python

import bitsandbytes as bnb
from transformers import Trainer, TrainingArguments

使用8位Adam/AdamW可降低75%的优化器内存占用。

8位优化器设置流程:
- [ ] 步骤1: 替换标准优化器
- [ ] 步骤2: 配置训练参数
- [ ] 步骤3: 监控内存节省情况

步骤1: 替换标准优化器

python

import bitsandbytes as bnb
from transformers import Trainer, TrainingArguments

Instead of torch.optim.AdamW

替代torch.optim.AdamW

model = AutoModelForCausalLM.from_pretrained("model-name")

training_args = TrainingArguments( output_dir="./output", per_device_train_batch_size=8, optim="paged_adamw_8bit", # 8-bit optimizer learning_rate=5e-5 )

trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset )

trainer.train()


**Manual optimizer usage**:
```python
import bitsandbytes as bnb

optimizer = bnb.optim.AdamW8bit(
    model.parameters(),
    lr=1e-4,
    betas=(0.9, 0.999),
    eps=1e-8
)

model = AutoModelForCausalLM.from_pretrained("model-name")

training_args = TrainingArguments( output_dir="./output", per_device_train_batch_size=8, optim="paged_adamw_8bit", # 8位优化器 learning_rate=5e-5 )

trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset )

trainer.train()


**手动使用优化器**:
```python
import bitsandbytes as bnb

optimizer = bnb.optim.AdamW8bit(
    model.parameters(),
    lr=1e-4,
    betas=(0.9, 0.999),
    eps=1e-8
)

Training loop

训练循环

for batch in dataloader: loss = model(**batch).loss loss.backward() optimizer.step() optimizer.zero_grad()


**Step 2: Configure training**

Compare memory:

Standard AdamW optimizer memory = model_params × 8 bytes (states) 8-bit AdamW memory = model_params × 2 bytes Savings = 75% optimizer memory

Example (Llama 2 7B): Standard: 7B × 8 = 56 GB 8-bit: 7B × 2 = 14 GB Savings: 42 GB


**Step 3: Monitor memory savings**

```python
import torch

before = torch.cuda.memory_allocated()

for batch in dataloader: loss = model(**batch).loss loss.backward() optimizer.step() optimizer.zero_grad()


**步骤2: 配置训练参数**

内存对比:

标准AdamW优化器内存占用 = 模型参数 × 8字节（状态） 8位AdamW内存占用 = 模型参数 × 2字节内存节省 = 75%的优化器内存

示例 (Llama 2 7B): 标准: 7B × 8 = 56 GB 8位: 7B × 2 = 14 GB 节省: 42 GB


**步骤3: 监控内存节省情况**

```python
import torch

before = torch.cuda.memory_allocated()

Training step

训练步骤

optimizer.step()

after = torch.cuda.memory_allocated() print(f"Memory used: {(after-before)/1e9:.2f}GB")

undefined

optimizer.step()

after = torch.cuda.memory_allocated() print(f"内存使用: {(after-before)/1e9:.2f}GB")

undefined

When to use vs alternatives

使用场景与替代方案对比

Use bitsandbytes when:

GPU memory limited (need to fit larger model)
Training with QLoRA (fine-tune 70B on single GPU)
Inference only (50-75% memory reduction)
Using HuggingFace Transformers
Acceptable 0-2% accuracy degradation

Use alternatives instead:

GPTQ/AWQ: Production serving (faster inference than bitsandbytes)
GGUF: CPU inference (llama.cpp)
FP8: H100 GPUs (hardware FP8 faster)
Full precision: Accuracy critical, memory not constrained

适合使用bitsandbytes的场景:

GPU内存有限（需要运行更大的模型）
使用QLoRA训练（单GPU上微调70B模型）
仅做推理（降低50-75%内存占用）
使用HuggingFace Transformers
可接受0-2%的精度下降

更适合使用替代方案的场景:

GPTQ/AWQ: 生产环境部署（推理速度比bitsandbytes更快）
GGUF: CPU推理（配合llama.cpp）
FP8: H100 GPU（硬件FP8速度更快）
全精度: 精度要求极高，内存不受限制

Common issues

常见问题

Issue: CUDA error during loading

Install matching CUDA version:

bash

undefined

问题: 加载时出现CUDA错误

安装匹配的CUDA版本:

bash

undefined

Check CUDA version

检查CUDA版本

nvcc --version

Install matching bitsandbytes

安装匹配版本的bitsandbytes

pip install bitsandbytes --no-cache-dir


**Issue: Model loading slow**

Use CPU offload for large models:
```python
model = AutoModelForCausalLM.from_pretrained(
    "model-name",
    quantization_config=config,
    device_map="auto",
    max_memory={0: "20GB", "cpu": "30GB"}  # Offload to CPU
)

Issue: Lower accuracy than expected

Try 8-bit instead of 4-bit:

python

config = BitsAndBytesConfig(load_in_8bit=True)

pip install bitsandbytes --no-cache-dir


**问题: 模型加载速度慢**

大模型使用CPU卸载:
```python
model = AutoModelForCausalLM.from_pretrained(
    "model-name",
    quantization_config=config,
    device_map="auto",
    max_memory={0: "20GB", "cpu": "30GB"}  # 卸载到CPU
)

问题: 精度低于预期

尝试使用8位代替4位:

python

config = BitsAndBytesConfig(load_in_8bit=True)

8-bit has <0.5% accuracy loss vs 1-2% for 4-bit

8位精度损失<0.5%，而4位为1-2%


Or use NF4 with double quantization:
```python
config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",  # Better than fp4
    bnb_4bit_use_double_quant=True  # Extra accuracy
)

Issue: OOM even with 4-bit

Enable CPU offload:

python

model = AutoModelForCausalLM.from_pretrained(
    "model-name",
    quantization_config=config,
    device_map="auto",
    offload_folder="offload",  # Disk offload
    offload_state_dict=True
)


或者使用带双重量化的NF4:
```python
config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",  # 比fp4效果更好
    bnb_4bit_use_double_quant=True  # 提升精度
)

问题: 即使使用4位仍然出现OOM

启用CPU卸载:

python

model = AutoModelForCausalLM.from_pretrained(
    "model-name",
    quantization_config=config,
    device_map="auto",
    offload_folder="offload",  # 磁盘卸载
    offload_state_dict=True
)

Advanced topics

高级主题

QLoRA training guide: See references/qlora-training.md for complete fine-tuning workflows, hyperparameter tuning, and multi-GPU training.

Quantization formats: See references/quantization-formats.md for INT8, NF4, FP4 comparison, double quantization, and custom quantization configs.

Memory optimization: See references/memory-optimization.md for CPU offloading strategies, gradient checkpointing, and memory profiling.

QLoRA训练指南: 查看 references/qlora-training.md 了解完整的微调工作流、超参数调优和多GPU训练方法。

量化格式: 查看 references/quantization-formats.md 了解INT8、NF4、FP4的对比、双重量化和自定义量化配置。

内存优化: 查看 references/memory-optimization.md 了解CPU卸载策略、梯度检查点和内存分析方法。

Hardware requirements

硬件要求

GPU: NVIDIA with compute capability 7.0+ (Turing, Ampere, Hopper)
VRAM: Depends on model and quantization
- 4-bit Llama 2 7B: 4GB
- 4-bit Llama 2 13B: 8GB
- 4-bit Llama 2 70B: 24GB
CUDA: 11.1+ (12.0+ recommended)
PyTorch: 2.0+

Supported platforms: NVIDIA GPUs (primary), AMD ROCm, Intel GPUs (experimental)

GPU: 计算能力7.0+的NVIDIA显卡（Turing、Ampere、Hopper架构）
显存: 取决于模型和量化等级
- 4位Llama 2 7B: 4GB
- 4位Llama 2 13B: 8GB
- 4位Llama 2 70B: 24GB
CUDA: 11.1+（推荐12.0+）
PyTorch: 2.0+

支持平台: 优先支持NVIDIA GPU，AMD ROCm、Intel GPU处于实验性支持阶段

Resources

资源

GitHub: https://github.com/bitsandbytes-foundation/bitsandbytes
HuggingFace docs: https://huggingface.co/docs/transformers/quantization/bitsandbytes
QLoRA paper: "QLoRA: Efficient Finetuning of Quantized LLMs" (2023)
LLM.int8() paper: "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" (2022)

GitHub: https://github.com/bitsandbytes-foundation/bitsandbytes
HuggingFace文档: https://huggingface.co/docs/transformers/quantization/bitsandbytes
QLoRA论文: "QLoRA: Efficient Finetuning of Quantized LLMs" (2023)
LLM.int8()论文: "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" (2022)