qlora
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseQLoRA: Quantized Low-Rank Adaptation
QLoRA:量化低秩适配
QLoRA enables fine-tuning of large language models on consumer GPUs by combining 4-bit quantization with LoRA adapters. A 65B model can be fine-tuned on a single 48GB GPU while matching 16-bit fine-tuning performance.
Prerequisites: This skill assumes familiarity with LoRA. See theskill for LoRA fundamentals (LoraConfig, target_modules, training patterns).lora
QLoRA通过结合4-bit量化与LoRA适配器,实现了在消费级GPU上对大语言模型的微调。借助该技术,可在单张48GB显存的GPU上微调65B参数模型,且性能与16-bit精度的微调效果相当。
前提条件:使用该技术需具备LoRA相关知识基础。可参考技术文档了解LoRA核心内容(LoraConfig、目标模块、训练模式)。lora
Table of Contents
目录
Core Innovations
核心创新点
QLoRA introduces three techniques that reduce memory usage without sacrificing performance:
QLoRA引入了三项技术,可在不损失性能的前提下降低内存占用:
4-bit NormalFloat (NF4)
4-bit正态浮点(NF4)
NF4 is an information-theoretically optimal quantization data type for normally distributed weights. Neural network weights are typically normally distributed, making NF4 more efficient than standard 4-bit floats.
Storage: 4-bit NF4 (quantized weights)
Compute: 16-bit BF16 (dequantized for forward/backward pass)The key insight: weights are stored in 4-bit but dequantized to bf16 for computation. Only the frozen base model is quantized; LoRA adapters remain in full precision.
NF4 vs FP4:
| Quantization | Description | Use Case |
|---|---|---|
| Normalized Float 4-bit, optimal for normal distributions | Default, recommended |
| Standard 4-bit float | Legacy, rarely needed |
NF4是针对正态分布权重的信息论最优量化数据类型。神经网络权重通常呈正态分布,因此NF4比标准4-bit浮点更高效。
存储:4-bit NF4(量化后权重)
计算:16-bit BF16(前向/反向传播时反量化)核心思路:权重以4-bit存储,但计算时反量化为bf16精度。仅冻结的基础模型会被量化,LoRA适配器保持全精度。
NF4与FP4对比:
| 量化类型 | 描述 | 适用场景 |
|---|---|---|
| 4-bit归一化浮点,适用于正态分布 | 默认推荐选项 |
| 标准4-bit浮点 | 遗留类型,极少使用 |
Double Quantization
双重量化
Standard quantization requires storing scaling constants (typically fp32) for each quantization block. Double quantization quantizes these constants too:
First quantization: weights → 4-bit + fp32 scaling constants
Double quantization: scaling constants → 8-bit + fp32 second-level constantsThis saves approximately 0.37 bits per parameter—significant for billion-parameter models:
- 7B model: ~325 MB savings
- 70B model: ~3.2 GB savings
标准量化需为每个量化块存储缩放常数(通常为fp32精度)。双重量化会进一步对这些常数进行量化:
首次量化: 权重 → 4-bit + fp32缩放常数
双重量化: 缩放常数 → 8-bit + fp32二级常数这可为每参数节省约0.37比特,对于数十亿参数的模型而言效果显著:
- 7B模型:约节省325 MB
- 70B模型:约节省3.2 GB
Paged Optimizers
分页优化器
During training, gradient checkpointing can cause memory spikes when processing long sequences. Paged optimizers use NVIDIA unified memory to automatically transfer optimizer states between GPU and CPU:
Normal training: OOM on memory spike
Paged optimizers: GPU ↔ CPU transfer handles spikes gracefullyThis is handled automatically by bitsandbytes when using 4-bit training.
训练过程中,梯度检查点在处理长序列时可能导致内存峰值。分页优化器利用NVIDIA统一内存,自动在GPU与CPU之间转移优化器状态:
常规训练:内存峰值时触发OOM
分页优化器:通过GPU ↔ CPU转移平稳处理峰值当使用4-bit训练时,bitsandbytes库会自动处理该逻辑。
BitsAndBytesConfig Deep Dive
BitsAndBytesConfig深度解析
All Parameters Explained
参数详解
python
from transformers import BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
# Core 4-bit settings
load_in_4bit=True, # Enable 4-bit quantization
bnb_4bit_quant_type="nf4", # "nf4" (recommended) or "fp4"
# Double quantization
bnb_4bit_use_double_quant=True, # Quantize the quantization constants
# Compute precision
bnb_4bit_compute_dtype=torch.bfloat16, # Dequantize to this dtype for compute
# Optional: specific storage type (usually auto-detected)
bnb_4bit_quant_storage=torch.uint8, # Storage dtype for quantized weights
)python
from transformers import BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
# 核心4-bit设置
load_in_4bit=True, # 启用4-bit量化
bnb_4bit_quant_type="nf4", # 可选"nf4"(推荐)或"fp4"
# 双重量化
bnb_4bit_use_double_quant=True, # 对量化常数进行二次量化
# 计算精度
bnb_4bit_compute_dtype=torch.bfloat16, # 计算时反量化至该数据类型
# 可选:指定存储类型(通常自动检测)
bnb_4bit_quant_storage=torch.uint8, # 量化权重的存储数据类型
)Compute Dtype Selection
计算数据类型选择
| Dtype | Hardware | Notes |
|---|---|---|
| Ampere+ (RTX 30xx, A100) | Recommended, faster |
| Older GPUs (V100, RTX 20xx) | Use if bf16 not supported |
| Any | Slower, only for debugging |
Check bf16 support:
python
import torch
print(torch.cuda.is_bf16_supported()) # True on Ampere+| 数据类型 | 硬件 | 说明 |
|---|---|---|
| Ampere架构及以上(RTX 30xx、A100) | 推荐选项,速度更快 |
| 旧款GPU(V100、RTX 20xx) | 当硬件不支持bf16时使用 |
| 任意硬件 | 速度较慢,仅用于调试 |
检查bf16支持情况:
python
import torch
print(torch.cuda.is_bf16_supported()) # Ampere及以上架构返回TrueComparison: Quantization Options
量化配置对比
python
undefinedpython
undefinedRecommended: NF4 + double quant + bf16
推荐配置:NF4 + 双重量化 + bf16
optimal_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
optimal_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
Maximum memory savings (slightly slower)
最大内存节省(速度略慢)
max_savings_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float16, # fp16 uses less memory than bf16
)
max_savings_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float16, # fp16比bf16占用更少内存
)
8-bit alternative (less compression, sometimes more stable)
8-bit替代方案(压缩比更低,有时稳定性更好)
eight_bit_config = BitsAndBytesConfig(
load_in_8bit=True,
)
undefinedeight_bit_config = BitsAndBytesConfig(
load_in_8bit=True,
)
undefinedMemory Requirements
内存需求
| Model Size | Full Fine-tuning | LoRA (16-bit) | QLoRA (4-bit) |
|---|---|---|---|
| 7B | ~60 GB | ~16 GB | ~6 GB |
| 13B | ~104 GB | ~28 GB | ~10 GB |
| 34B | ~272 GB | ~75 GB | ~20 GB |
| 70B | ~560 GB | ~160 GB | ~48 GB |
Notes:
- QLoRA memory includes model + optimizer states + activations
- Actual usage varies with batch size, sequence length, and gradient checkpointing
- Add ~20% buffer for safe operation
| 模型规模 | 全量微调 | LoRA(16-bit) | QLoRA(4-bit) |
|---|---|---|---|
| 7B | ~60 GB | ~16 GB | ~6 GB |
| 13B | ~104 GB | ~28 GB | ~10 GB |
| 34B | ~272 GB | ~75 GB | ~20 GB |
| 70B | ~560 GB | ~160 GB | ~48 GB |
说明:
- QLoRA内存占用包含模型、优化器状态及激活值
- 实际占用量随批次大小、序列长度及梯度检查点设置而变化
- 建议预留约20%的缓冲空间以保证稳定运行
GPU Recommendations
GPU推荐
| GPU VRAM | Max Model Size (QLoRA) |
|---|---|
| 8 GB | 7B (tight) |
| 16 GB | 7-13B |
| 24 GB | 13-34B |
| 48 GB | 34-70B |
| 80 GB | 70B+ comfortably |
| GPU显存 | QLoRA支持的最大模型规模 |
|---|---|
| 8 GB | 7B(紧张) |
| 16 GB | 7-13B |
| 24 GB | 13-34B |
| 48 GB | 34-70B |
| 80 GB | 70B及以上(轻松运行) |
Complete Training Example
完整训练示例
python
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torchpython
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch1. Quantization config
1. 量化配置
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
2. Load quantized model
2. 加载量化模型
model_name = "meta-llama/Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
attn_implementation="flash_attention_2", # Optional: faster attention
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
model_name = "meta-llama/Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
attn_implementation="flash_attention_2", # 可选:更快的注意力机制
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
3. Prepare for k-bit training (critical step!)
3. 为k-bit训练准备模型(关键步骤!)
model = prepare_model_for_kbit_training(model)
model = prepare_model_for_kbit_training(model)
4. LoRA config (see lora skill for parameter details)
4. LoRA配置(参数细节参考LoRA技术文档)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
5. Dataset
5. 数据集
dataset = load_dataset("tatsu-lab/alpaca", split="train[:1000]")
def format_example(example):
if example["input"]:
return {"text": f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"}
return {"text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"}
dataset = dataset.map(format_example)
dataset = load_dataset("tatsu-lab/alpaca", split="train[:1000]")
def format_example(example):
if example["input"]:
return {"text": f"### 指令:\n{example['instruction']}\n\n### 输入:\n{example['input']}\n\n### 响应:\n{example['output']}"}
return {"text": f"### 指令:\n{example['instruction']}\n\n### 响应:\n{example['output']}"}
dataset = dataset.map(format_example)
6. Training
6. 训练
sft_config = SFTConfig(
output_dir="./qlora-output",
max_seq_length=512,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=1,
learning_rate=2e-4,
bf16=True,
logging_steps=10,
save_steps=100,
gradient_checkpointing=True,
gradient_checkpointing_kwargs={"use_reentrant": False},
optim="paged_adamw_8bit", # Paged optimizer for memory efficiency
)
trainer = SFTTrainer(
model=model,
args=sft_config,
train_dataset=dataset,
processing_class=tokenizer,
dataset_text_field="text",
)
trainer.train()
sft_config = SFTConfig(
output_dir="./qlora-output",
max_seq_length=512,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=1,
learning_rate=2e-4,
bf16=True,
logging_steps=10,
save_steps=100,
gradient_checkpointing=True,
gradient_checkpointing_kwargs={"use_reentrant": False},
optim="paged_adamw_8bit", # 分页优化器提升内存效率
)
trainer = SFTTrainer(
model=model,
args=sft_config,
train_dataset=dataset,
processing_class=tokenizer,
dataset_text_field="text",
)
trainer.train()
7. Save adapter
7. 保存适配器
model.save_pretrained("./qlora-adapter")
tokenizer.save_pretrained("./qlora-adapter")
undefinedmodel.save_pretrained("./qlora-adapter")
tokenizer.save_pretrained("./qlora-adapter")
undefinedInference and Merging
推理与模型合并
Inference with Quantized Model
基于量化模型的推理
python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch
model_name = "meta-llama/Llama-3.1-8B"python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch
model_name = "meta-llama/Llama-3.1-8B"Load quantized base model
加载量化基础模型
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
base_model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
base_model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
Load adapter
加载适配器
model = PeftModel.from_pretrained(base_model, "./qlora-adapter")
model.eval()
model = PeftModel.from_pretrained(base_model, "./qlora-adapter")
model.eval()
Generate
生成文本
inputs = tokenizer("### Instruction:\nExplain quantum computing.\n\n### Response:\n", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
undefinedinputs = tokenizer("### 指令:\n解释量子计算。\n\n### 响应:\n", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
undefinedMerging to Full Precision
合并为全精度模型
To merge QLoRA adapters into a full-precision model (for deployment without bitsandbytes):
python
from transformers import AutoModelForCausalLM
from peft import PeftModel
import torch若要将QLoRA适配器合并为全精度模型(无需bitsandbytes即可部署):
python
from transformers import AutoModelForCausalLM
from peft import PeftModel
import torchLoad base model in full precision (on CPU to avoid OOM)
加载全精度基础模型(在CPU上加载以避免OOM)
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
torch_dtype=torch.bfloat16,
device_map="cpu",
)
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
torch_dtype=torch.bfloat16,
device_map="cpu",
)
Load adapter
加载适配器
model = PeftModel.from_pretrained(base_model, "./qlora-adapter")
model = PeftModel.from_pretrained(base_model, "./qlora-adapter")
Merge and unload
合并并卸载适配器
merged_model = model.merge_and_unload()
merged_model = model.merge_and_unload()
Save merged model
保存合并后的模型
merged_model.save_pretrained("./merged-model")
**Note**: Merging requires enough RAM to hold the full-precision model. For 70B models, this means ~140GB RAM.merged_model.save_pretrained("./merged-model")
**注意**:合并操作需要足够的内存以容纳全精度模型。对于70B参数模型,需约140GB内存。Troubleshooting
问题排查
CUDA Version Issues
CUDA版本问题
bash
undefinedbash
undefinedCheck CUDA version
检查CUDA版本
nvcc --version
python -c "import torch; print(torch.version.cuda)"
nvcc --version
python -c "import torch; print(torch.version.cuda)"
bitsandbytes requires CUDA 11.7+
bitsandbytes需要CUDA 11.7及以上版本
If version mismatch, reinstall:
若版本不匹配,重新安装:
pip uninstall bitsandbytes
pip install bitsandbytes --upgrade
undefinedpip uninstall bitsandbytes
pip install bitsandbytes --upgrade
undefined"cannot find libcudart" or Missing Library Errors
“找不到libcudart”或缺失库错误
bash
undefinedbash
undefinedFind CUDA installation
查找CUDA安装路径
find /usr -name "libcudart*" 2>/dev/null
find /usr -name "libcudart*" 2>/dev/null
Set environment variable
设置环境变量
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
Or for conda:
若使用conda环境:
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
undefinedexport LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
undefinedSlow Training
训练速度缓慢
Common cause: compute dtype mismatch
python
undefined常见原因:计算数据类型不匹配
python
undefinedCheck if model is using expected dtype
检查模型是否使用预期的数据类型
for name, param in model.named_parameters():
if param.requires_grad:
print(f"{name}: {param.dtype}")
break # All LoRA params should match
for name, param in model.named_parameters():
if param.requires_grad:
print(f"{name}: {param.dtype}")
break # 所有LoRA参数应保持一致
Ensure bf16 is used in training args if BitsAndBytesConfig uses bf16
若BitsAndBytesConfig使用bf16,需确保训练参数中设置bf16=True
Mismatch causes constant dtype conversions
类型不匹配会导致频繁的数据类型转换
undefinedundefinedOut of Memory
内存不足
python
undefinedpython
undefined1. Enable gradient checkpointing
1. 启用梯度检查点
model.gradient_checkpointing_enable()
model.gradient_checkpointing_enable()
2. Reduce batch size, increase accumulation
2. 减小批次大小,增加梯度累积步数
per_device_train_batch_size = 1
gradient_accumulation_steps = 16
per_device_train_batch_size = 1
gradient_accumulation_steps = 16
3. Use paged optimizer
3. 使用分页优化器
optim = "paged_adamw_8bit"
optim = "paged_adamw_8bit"
4. Reduce sequence length
4. 减小序列长度
max_seq_length = 256
max_seq_length = 256
5. Target fewer modules
5. 减少目标模块数量
target_modules = ["q_proj", "v_proj"] # Minimal set
undefinedtarget_modules = ["q_proj", "v_proj"] # 最小模块集
undefinedModel Loads But Training Fails
模型加载成功但训练失败
python
undefinedpython
undefinedEnsure prepare_model_for_kbit_training is called
确保调用了prepare_model_for_kbit_training
from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model) # Don't skip this!
from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model) # 不可跳过此步骤!
Enable input gradients if needed
若需要,启用输入梯度
model.enable_input_require_grads()
undefinedmodel.enable_input_require_grads()
undefinedBest Practices
最佳实践
-
Always use: This enables gradient computation through the frozen quantized layers
prepare_model_for_kbit_training -
Match compute dtype with training precision: If, use
bnb_4bit_compute_dtype=torch.bfloat16in training argsbf16=True -
Use paged optimizers for large models:or
optim="paged_adamw_8bit"handles memory spikes"paged_adamw_32bit" -
Start with NF4 + double quantization: This is the recommended default; only change if debugging
-
Gradient checkpointing is essential: Always enable for QLoRA training to fit larger batch sizes
-
Test inference before long training runs: Load the model and generate a few tokens to catch configuration issues early
-
Monitor GPU memory: Useor
nvidia-smito track actual usagetorch.cuda.memory_summary() -
Consider 8-bit for unstable training: If 4-bit training shows instability, tryas a middle ground
load_in_8bit=True
-
务必调用:该操作可启用对冻结量化层的梯度计算
prepare_model_for_kbit_training -
匹配计算数据类型与训练精度:若,需在训练参数中设置
bnb_4bit_compute_dtype=torch.bfloat16bf16=True -
大模型使用分页优化器:或
optim="paged_adamw_8bit"可处理内存峰值"paged_adamw_32bit" -
默认使用NF4+双重量化:这是推荐的默认配置,仅在调试时修改
-
梯度检查点至关重要:QLoRA训练中务必启用梯度检查点,以适配更大的批次大小
-
长时训练前先测试推理:加载模型并生成少量文本,提前发现配置问题
-
监控GPU内存:使用或
nvidia-smi跟踪实际内存占用torch.cuda.memory_summary() -
不稳定训练时考虑8-bit量化:若4-bit训练出现不稳定,可尝试作为折中方案
load_in_8bit=True