quantizing-models-bitsandbytes
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesebitsandbytes - LLM Quantization
bitsandbytes - 大语言模型量化
Quick start
快速开始
bitsandbytes reduces LLM memory by 50% (8-bit) or 75% (4-bit) with <1% accuracy loss.
Installation:
bash
pip install bitsandbytes transformers accelerate8-bit quantization (50% memory reduction):
python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=config,
device_map="auto"
)bitsandbytes可将大语言模型的内存占用降低50%(8位量化)或75%(4位量化),同时精度损失小于1%。
安装:
bash
pip install bitsandbytes transformers accelerate8位量化(内存占用降低50%):
python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=config,
device_map="auto"
)Memory: 14GB → 7GB
内存占用:14GB → 7GB
**4-bit quantization** (75% memory reduction):
```python
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=config,
device_map="auto"
)
**4位量化**(内存占用降低75%):
```python
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=config,
device_map="auto"
)Memory: 14GB → 3.5GB
内存占用:14GB → 3.5GB
undefinedundefinedCommon workflows
常见工作流
Workflow 1: Load large model in limited GPU memory
工作流1:在有限GPU内存中加载大模型
Copy this checklist:
Quantization Loading:
- [ ] Step 1: Calculate memory requirements
- [ ] Step 2: Choose quantization level (4-bit or 8-bit)
- [ ] Step 3: Configure quantization
- [ ] Step 4: Load and verify modelStep 1: Calculate memory requirements
Estimate model memory:
FP16 memory (GB) = Parameters × 2 bytes / 1e9
INT8 memory (GB) = Parameters × 1 byte / 1e9
INT4 memory (GB) = Parameters × 0.5 bytes / 1e9
Example (Llama 2 7B):
FP16: 7B × 2 / 1e9 = 14 GB
INT8: 7B × 1 / 1e9 = 7 GB
INT4: 7B × 0.5 / 1e9 = 3.5 GBStep 2: Choose quantization level
| GPU VRAM | Model Size | Recommended |
|---|---|---|
| 8 GB | 3B | 4-bit |
| 12 GB | 7B | 4-bit |
| 16 GB | 7B | 8-bit or 4-bit |
| 24 GB | 13B | 8-bit or 70B 4-bit |
| 40+ GB | 70B | 8-bit |
Step 3: Configure quantization
For 8-bit (better accuracy):
python
from transformers import BitsAndBytesConfig
import torch
config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0, # Outlier threshold
llm_int8_has_fp16_weight=False
)For 4-bit (maximum memory savings):
python
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16, # Compute in FP16
bnb_4bit_quant_type="nf4", # NormalFloat4 (recommended)
bnb_4bit_use_double_quant=True # Nested quantization
)Step 4: Load and verify model
python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-13b-hf",
quantization_config=config,
device_map="auto", # Automatic device placement
torch_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")复制以下检查清单:
量化加载:
- [ ] 步骤1:计算内存需求
- [ ] 步骤2:选择量化级别(4位或8位)
- [ ] 步骤3:配置量化参数
- [ ] 步骤4:加载并验证模型步骤1:计算内存需求
估算模型内存占用:
FP16内存占用(GB)= 参数数量 × 2字节 / 1e9
INT8内存占用(GB)= 参数数量 × 1字节 / 1e9
INT4内存占用(GB)= 参数数量 × 0.5字节 / 1e9
示例(Llama 2 7B):
FP16: 7B × 2 / 1e9 = 14 GB
INT8: 7B × 1 / 1e9 = 7 GB
INT4: 7B × 0.5 / 1e9 = 3.5 GB步骤2:选择量化级别
| GPU显存 | 模型大小 | 推荐方案 |
|---|---|---|
| 8 GB | 3B | 4位 |
| 12 GB | 7B | 4位 |
| 16 GB | 7B | 8位或4位 |
| 24 GB | 13B | 8位或70B 4位 |
| 40+ GB | 70B | 8位 |
步骤3:配置量化参数
8位量化(精度更优):
python
from transformers import BitsAndBytesConfig
import torch
config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0, # 异常值阈值
llm_int8_has_fp16_weight=False
)4位量化(内存节省最大化):
python
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16, # 以FP16精度计算
bnb_4bit_quant_type="nf4", # NormalFloat4(推荐)
bnb_4bit_use_double_quant=True # 嵌套量化
)步骤4:加载并验证模型
python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-13b-hf",
quantization_config=config,
device_map="auto", # 自动设备分配
torch_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")Test inference
测试推理
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))
Check memory
检查内存占用
import torch
print(f"Memory allocated: {torch.cuda.memory_allocated()/1e9:.2f}GB")
undefinedimport torch
print(f"已分配内存: {torch.cuda.memory_allocated()/1e9:.2f}GB")
undefinedWorkflow 2: Fine-tune with QLoRA (4-bit training)
工作流2:使用QLoRA进行微调(4位训练)
QLoRA enables fine-tuning large models on consumer GPUs.
Copy this checklist:
QLoRA Fine-tuning:
- [ ] Step 1: Install dependencies
- [ ] Step 2: Configure 4-bit base model
- [ ] Step 3: Add LoRA adapters
- [ ] Step 4: Train with standard TrainerStep 1: Install dependencies
bash
pip install bitsandbytes transformers peft accelerate datasetsStep 2: Configure 4-bit base model
python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto"
)Step 3: Add LoRA adapters
python
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_trainingQLoRA支持在消费级GPU上对大模型进行微调。
复制以下检查清单:
QLoRA微调:
- [ ] 步骤1:安装依赖
- [ ] 步骤2:配置4位基础模型
- [ ] 步骤3:添加LoRA适配器
- [ ] 步骤4:使用标准Trainer进行训练步骤1:安装依赖
bash
pip install bitsandbytes transformers peft accelerate datasets步骤2:配置4位基础模型
python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto"
)步骤3:添加LoRA适配器
python
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_trainingPrepare model for training
准备模型用于训练
model = prepare_model_for_kbit_training(model)
model = prepare_model_for_kbit_training(model)
Configure LoRA
配置LoRA
lora_config = LoraConfig(
r=16, # LoRA rank
lora_alpha=32, # LoRA alpha
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
lora_config = LoraConfig(
r=16, # LoRA秩
lora_alpha=32, # LoRA alpha值
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
Add LoRA adapters
添加LoRA适配器
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
Output: trainable params: 4.2M || all params: 6.7B || trainable%: 0.06%
输出: trainable params: 4.2M || all params: 6.7B || trainable%: 0.06%
**Step 4: Train with standard Trainer**
```python
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./qlora-output",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_strategy="epoch"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
tokenizer=tokenizer
)
trainer.train()
**步骤4:使用标准Trainer进行训练**
```python
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./qlora-output",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_strategy="epoch"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
tokenizer=tokenizer
)
trainer.train()Save LoRA adapters (only ~20MB)
保存LoRA适配器(仅约20MB)
model.save_pretrained("./qlora-adapters")
undefinedmodel.save_pretrained("./qlora-adapters")
undefinedWorkflow 3: 8-bit optimizer for memory-efficient training
工作流3:使用8位优化器实现内存高效训练
Use 8-bit Adam/AdamW to reduce optimizer memory by 75%.
8-bit Optimizer Setup:
- [ ] Step 1: Replace standard optimizer
- [ ] Step 2: Configure training
- [ ] Step 3: Monitor memory savingsStep 1: Replace standard optimizer
python
import bitsandbytes as bnb
from transformers import Trainer, TrainingArguments使用8位Adam/AdamW可将优化器的内存占用降低75%。
8位优化器设置:
- [ ] 步骤1:替换标准优化器
- [ ] 步骤2:配置训练参数
- [ ] 步骤3:监控内存节省情况步骤1:替换标准优化器
python
import bitsandbytes as bnb
from transformers import Trainer, TrainingArgumentsInstead of torch.optim.AdamW
替代torch.optim.AdamW
model = AutoModelForCausalLM.from_pretrained("model-name")
training_args = TrainingArguments(
output_dir="./output",
per_device_train_batch_size=8,
optim="paged_adamw_8bit", # 8-bit optimizer
learning_rate=5e-5
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset
)
trainer.train()
**Manual optimizer usage**:
```python
import bitsandbytes as bnb
optimizer = bnb.optim.AdamW8bit(
model.parameters(),
lr=1e-4,
betas=(0.9, 0.999),
eps=1e-8
)model = AutoModelForCausalLM.from_pretrained("model-name")
training_args = TrainingArguments(
output_dir="./output",
per_device_train_batch_size=8,
optim="paged_adamw_8bit", # 8位优化器
learning_rate=5e-5
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset
)
trainer.train()
**手动使用优化器**:
```python
import bitsandbytes as bnb
optimizer = bnb.optim.AdamW8bit(
model.parameters(),
lr=1e-4,
betas=(0.9, 0.999),
eps=1e-8
)Training loop
训练循环
for batch in dataloader:
loss = model(**batch).loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
**Step 2: Configure training**
Compare memory:Standard AdamW optimizer memory = model_params × 8 bytes (states)
8-bit AdamW memory = model_params × 2 bytes
Savings = 75% optimizer memory
Example (Llama 2 7B):
Standard: 7B × 8 = 56 GB
8-bit: 7B × 2 = 14 GB
Savings: 42 GB
**Step 3: Monitor memory savings**
```python
import torch
before = torch.cuda.memory_allocated()for batch in dataloader:
loss = model(**batch).loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
**步骤2:配置训练参数**
对比内存占用:标准AdamW优化器内存占用 = 模型参数 × 8字节(状态)
8位AdamW内存占用 = 模型参数 × 2字节
内存节省率 = 75%
示例(Llama 2 7B):
标准: 7B × 8 = 56 GB
8位: 7B × 2 = 14 GB
节省: 42 GB
**步骤3:监控内存节省情况**
```python
import torch
before = torch.cuda.memory_allocated()Training step
训练步骤
optimizer.step()
after = torch.cuda.memory_allocated()
print(f"Memory used: {(after-before)/1e9:.2f}GB")
undefinedoptimizer.step()
after = torch.cuda.memory_allocated()
print(f"使用内存: {(after-before)/1e9:.2f}GB")
undefinedWhen to use vs alternatives
适用场景与替代方案对比
Use bitsandbytes when:
- GPU memory limited (need to fit larger model)
- Training with QLoRA (fine-tune 70B on single GPU)
- Inference only (50-75% memory reduction)
- Using HuggingFace Transformers
- Acceptable 0-2% accuracy degradation
Use alternatives instead:
- GPTQ/AWQ: Production serving (faster inference than bitsandbytes)
- GGUF: CPU inference (llama.cpp)
- FP8: H100 GPUs (hardware FP8 faster)
- Full precision: Accuracy critical, memory not constrained
优先选择bitsandbytes的场景:
- GPU内存有限(需要运行更大模型)
- 使用QLoRA进行微调(单GPU训练70B模型)
- 仅推理场景(50%-75%内存占用降低)
- 配合HuggingFace Transformers使用
- 可接受0-2%的精度损失
选择替代方案的场景:
- GPTQ/AWQ: 生产环境部署(推理速度比bitsandbytes更快)
- GGUF: CPU推理(llama.cpp)
- FP8: 使用H100 GPU(硬件原生FP8速度更快)
- 全精度: 对精度要求极高,且内存不受限制
Common issues
常见问题
Issue: CUDA error during loading
Install matching CUDA version:
bash
undefined问题:加载时出现CUDA错误
安装匹配的CUDA版本:
bash
undefinedCheck CUDA version
检查CUDA版本
nvcc --version
nvcc --version
Install matching bitsandbytes
安装匹配的bitsandbytes
pip install bitsandbytes --no-cache-dir
**Issue: Model loading slow**
Use CPU offload for large models:
```python
model = AutoModelForCausalLM.from_pretrained(
"model-name",
quantization_config=config,
device_map="auto",
max_memory={0: "20GB", "cpu": "30GB"} # Offload to CPU
)Issue: Lower accuracy than expected
Try 8-bit instead of 4-bit:
python
config = BitsAndBytesConfig(load_in_8bit=True)pip install bitsandbytes --no-cache-dir
**问题:模型加载速度慢**
对大模型使用CPU卸载:
```python
model = AutoModelForCausalLM.from_pretrained(
"model-name",
quantization_config=config,
device_map="auto",
max_memory={0: "20GB", "cpu": "30GB"} # 卸载到CPU
)问题:精度低于预期
尝试用8位量化替代4位量化:
python
config = BitsAndBytesConfig(load_in_8bit=True)8-bit has <0.5% accuracy loss vs 1-2% for 4-bit
8位量化精度损失<0.5%,而4位量化为1-2%
Or use NF4 with double quantization:
```python
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # Better than fp4
bnb_4bit_use_double_quant=True # Extra accuracy
)Issue: OOM even with 4-bit
Enable CPU offload:
python
model = AutoModelForCausalLM.from_pretrained(
"model-name",
quantization_config=config,
device_map="auto",
offload_folder="offload", # Disk offload
offload_state_dict=True
)
或使用NF4格式并开启双重量化:
```python
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # 优于fp4
bnb_4bit_use_double_quant=True # 提升精度
)问题:即使使用4位量化仍出现OOM(内存不足)
启用CPU卸载:
python
model = AutoModelForCausalLM.from_pretrained(
"model-name",
quantization_config=config,
device_map="auto",
offload_folder="offload", # 磁盘卸载
offload_state_dict=True
)Advanced topics
进阶主题
QLoRA training guide: See references/qlora-training.md for complete fine-tuning workflows, hyperparameter tuning, and multi-GPU training.
Quantization formats: See references/quantization-formats.md for INT8, NF4, FP4 comparison, double quantization, and custom quantization configs.
Memory optimization: See references/memory-optimization.md for CPU offloading strategies, gradient checkpointing, and memory profiling.
QLoRA训练指南: 查看references/qlora-training.md获取完整的微调工作流、超参数调优和多GPU训练方法。
量化格式: 查看references/quantization-formats.md对比INT8、NF4、FP4格式,了解双重量化和自定义量化配置。
内存优化: 查看references/memory-optimization.md获取CPU卸载策略、梯度检查点和内存分析方法。
Hardware requirements
硬件要求
- GPU: NVIDIA with compute capability 7.0+ (Turing, Ampere, Hopper)
- VRAM: Depends on model and quantization
- 4-bit Llama 2 7B: 4GB
- 4-bit Llama 2 13B: 8GB
- 4-bit Llama 2 70B: 24GB
- CUDA: 11.1+ (12.0+ recommended)
- PyTorch: 2.0+
Supported platforms: NVIDIA GPUs (primary), AMD ROCm, Intel GPUs (experimental)
- GPU: NVIDIA计算能力7.0+(Turing、Ampere、Hopper架构)
- 显存: 取决于模型大小和量化级别
- 4位Llama 2 7B: 4GB
- 4位Llama 2 13B: 8GB
- 4位Llama 2 70B: 24GB
- CUDA: 11.1+(推荐12.0+)
- PyTorch: 2.0+
支持平台: NVIDIA GPU(主要)、AMD ROCm、Intel GPU(实验性)
Resources
资源
- GitHub: https://github.com/bitsandbytes-foundation/bitsandbytes
- HuggingFace docs: https://huggingface.co/docs/transformers/quantization/bitsandbytes
- QLoRA paper: "QLoRA: Efficient Finetuning of Quantized LLMs" (2023)
- LLM.int8() paper: "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" (2022)
- GitHub: https://github.com/bitsandbytes-foundation/bitsandbytes
- HuggingFace文档: https://huggingface.co/docs/transformers/quantization/bitsandbytes
- QLoRA论文: "QLoRA: Efficient Finetuning of Quantized LLMs" (2023)
- LLM.int8()论文: "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" (2022)