qlora

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

QLoRA: Quantized Low-Rank Adaptation

QLoRA:量化低秩适配

QLoRA enables fine-tuning of large language models on consumer GPUs by combining 4-bit quantization with LoRA adapters. A 65B model can be fine-tuned on a single 48GB GPU while matching 16-bit fine-tuning performance.
Prerequisites: This skill assumes familiarity with LoRA. See the
lora
skill for LoRA fundamentals (LoraConfig, target_modules, training patterns).
QLoRA通过结合4-bit量化与LoRA适配器,实现了在消费级GPU上对大语言模型的微调。借助该技术,可在单张48GB显存的GPU上微调65B参数模型,且性能与16-bit精度的微调效果相当。
前提条件:使用该技术需具备LoRA相关知识基础。可参考
lora
技术文档了解LoRA核心内容(LoraConfig、目标模块、训练模式)。

Table of Contents

目录

Core Innovations

核心创新点

QLoRA introduces three techniques that reduce memory usage without sacrificing performance:
QLoRA引入了三项技术,可在不损失性能的前提下降低内存占用:

4-bit NormalFloat (NF4)

4-bit正态浮点(NF4)

NF4 is an information-theoretically optimal quantization data type for normally distributed weights. Neural network weights are typically normally distributed, making NF4 more efficient than standard 4-bit floats.
Storage: 4-bit NF4 (quantized weights)
Compute: 16-bit BF16 (dequantized for forward/backward pass)
The key insight: weights are stored in 4-bit but dequantized to bf16 for computation. Only the frozen base model is quantized; LoRA adapters remain in full precision.
NF4 vs FP4:
QuantizationDescriptionUse Case
nf4
Normalized Float 4-bit, optimal for normal distributionsDefault, recommended
fp4
Standard 4-bit floatLegacy, rarely needed
NF4是针对正态分布权重的信息论最优量化数据类型。神经网络权重通常呈正态分布,因此NF4比标准4-bit浮点更高效。
存储:4-bit NF4(量化后权重)
计算:16-bit BF16(前向/反向传播时反量化)
核心思路:权重以4-bit存储,但计算时反量化为bf16精度。仅冻结的基础模型会被量化,LoRA适配器保持全精度。
NF4与FP4对比:
量化类型描述适用场景
nf4
4-bit归一化浮点,适用于正态分布默认推荐选项
fp4
标准4-bit浮点遗留类型,极少使用

Double Quantization

双重量化

Standard quantization requires storing scaling constants (typically fp32) for each quantization block. Double quantization quantizes these constants too:
First quantization:  weights → 4-bit + fp32 scaling constants
Double quantization: scaling constants → 8-bit + fp32 second-level constants
This saves approximately 0.37 bits per parameter—significant for billion-parameter models:
  • 7B model: ~325 MB savings
  • 70B model: ~3.2 GB savings
标准量化需为每个量化块存储缩放常数(通常为fp32精度)。双重量化会进一步对这些常数进行量化:
首次量化: 权重 → 4-bit + fp32缩放常数
双重量化: 缩放常数 → 8-bit + fp32二级常数
这可为每参数节省约0.37比特,对于数十亿参数的模型而言效果显著:
  • 7B模型:约节省325 MB
  • 70B模型:约节省3.2 GB

Paged Optimizers

分页优化器

During training, gradient checkpointing can cause memory spikes when processing long sequences. Paged optimizers use NVIDIA unified memory to automatically transfer optimizer states between GPU and CPU:
Normal training: OOM on memory spike
Paged optimizers: GPU ↔ CPU transfer handles spikes gracefully
This is handled automatically by bitsandbytes when using 4-bit training.
训练过程中,梯度检查点在处理长序列时可能导致内存峰值。分页优化器利用NVIDIA统一内存,自动在GPU与CPU之间转移优化器状态:
常规训练:内存峰值时触发OOM
分页优化器:通过GPU ↔ CPU转移平稳处理峰值
当使用4-bit训练时,bitsandbytes库会自动处理该逻辑。

BitsAndBytesConfig Deep Dive

BitsAndBytesConfig深度解析

All Parameters Explained

参数详解

python
from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    # Core 4-bit settings
    load_in_4bit=True,              # Enable 4-bit quantization
    bnb_4bit_quant_type="nf4",      # "nf4" (recommended) or "fp4"

    # Double quantization
    bnb_4bit_use_double_quant=True, # Quantize the quantization constants

    # Compute precision
    bnb_4bit_compute_dtype=torch.bfloat16,  # Dequantize to this dtype for compute

    # Optional: specific storage type (usually auto-detected)
    bnb_4bit_quant_storage=torch.uint8,     # Storage dtype for quantized weights
)
python
from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    # 核心4-bit设置
    load_in_4bit=True,              # 启用4-bit量化
    bnb_4bit_quant_type="nf4",      # 可选"nf4"(推荐)或"fp4"

    # 双重量化
    bnb_4bit_use_double_quant=True, # 对量化常数进行二次量化

    # 计算精度
    bnb_4bit_compute_dtype=torch.bfloat16,  # 计算时反量化至该数据类型

    # 可选:指定存储类型(通常自动检测)
    bnb_4bit_quant_storage=torch.uint8,     # 量化权重的存储数据类型
)

Compute Dtype Selection

计算数据类型选择

DtypeHardwareNotes
torch.bfloat16
Ampere+ (RTX 30xx, A100)Recommended, faster
torch.float16
Older GPUs (V100, RTX 20xx)Use if bf16 not supported
torch.float32
AnySlower, only for debugging
Check bf16 support:
python
import torch
print(torch.cuda.is_bf16_supported())  # True on Ampere+
数据类型硬件说明
torch.bfloat16
Ampere架构及以上(RTX 30xx、A100)推荐选项,速度更快
torch.float16
旧款GPU(V100、RTX 20xx)当硬件不支持bf16时使用
torch.float32
任意硬件速度较慢,仅用于调试
检查bf16支持情况:
python
import torch
print(torch.cuda.is_bf16_supported())  # Ampere及以上架构返回True

Comparison: Quantization Options

量化配置对比

python
undefined
python
undefined

Recommended: NF4 + double quant + bf16

推荐配置:NF4 + 双重量化 + bf16

optimal_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16, )
optimal_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16, )

Maximum memory savings (slightly slower)

最大内存节省(速度略慢)

max_savings_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.float16, # fp16 uses less memory than bf16 )
max_savings_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.float16, # fp16比bf16占用更少内存 )

8-bit alternative (less compression, sometimes more stable)

8-bit替代方案(压缩比更低,有时稳定性更好)

eight_bit_config = BitsAndBytesConfig( load_in_8bit=True, )
undefined
eight_bit_config = BitsAndBytesConfig( load_in_8bit=True, )
undefined

Memory Requirements

内存需求

Model SizeFull Fine-tuningLoRA (16-bit)QLoRA (4-bit)
7B~60 GB~16 GB~6 GB
13B~104 GB~28 GB~10 GB
34B~272 GB~75 GB~20 GB
70B~560 GB~160 GB~48 GB
Notes:
  • QLoRA memory includes model + optimizer states + activations
  • Actual usage varies with batch size, sequence length, and gradient checkpointing
  • Add ~20% buffer for safe operation
模型规模全量微调LoRA(16-bit)QLoRA(4-bit)
7B~60 GB~16 GB~6 GB
13B~104 GB~28 GB~10 GB
34B~272 GB~75 GB~20 GB
70B~560 GB~160 GB~48 GB
说明:
  • QLoRA内存占用包含模型、优化器状态及激活值
  • 实际占用量随批次大小、序列长度及梯度检查点设置而变化
  • 建议预留约20%的缓冲空间以保证稳定运行

GPU Recommendations

GPU推荐

GPU VRAMMax Model Size (QLoRA)
8 GB7B (tight)
16 GB7-13B
24 GB13-34B
48 GB34-70B
80 GB70B+ comfortably
GPU显存QLoRA支持的最大模型规模
8 GB7B(紧张)
16 GB7-13B
24 GB13-34B
48 GB34-70B
80 GB70B及以上(轻松运行)

Complete Training Example

完整训练示例

python
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch
python
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch

1. Quantization config

1. 量化配置

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16, )
bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16, )

2. Load quantized model

2. 加载量化模型

model_name = "meta-llama/Llama-3.1-8B" model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=bnb_config, device_map="auto", attn_implementation="flash_attention_2", # Optional: faster attention )
tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "right"
model_name = "meta-llama/Llama-3.1-8B" model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=bnb_config, device_map="auto", attn_implementation="flash_attention_2", # 可选:更快的注意力机制 )
tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "right"

3. Prepare for k-bit training (critical step!)

3. 为k-bit训练准备模型(关键步骤!)

model = prepare_model_for_kbit_training(model)
model = prepare_model_for_kbit_training(model)

4. LoRA config (see lora skill for parameter details)

4. LoRA配置(参数细节参考LoRA技术文档)

lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", )
model = get_peft_model(model, lora_config) model.print_trainable_parameters()
lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", )
model = get_peft_model(model, lora_config) model.print_trainable_parameters()

5. Dataset

5. 数据集

dataset = load_dataset("tatsu-lab/alpaca", split="train[:1000]")
def format_example(example): if example["input"]: return {"text": f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"} return {"text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"}
dataset = dataset.map(format_example)
dataset = load_dataset("tatsu-lab/alpaca", split="train[:1000]")
def format_example(example): if example["input"]: return {"text": f"### 指令:\n{example['instruction']}\n\n### 输入:\n{example['input']}\n\n### 响应:\n{example['output']}"} return {"text": f"### 指令:\n{example['instruction']}\n\n### 响应:\n{example['output']}"}
dataset = dataset.map(format_example)

6. Training

6. 训练

sft_config = SFTConfig( output_dir="./qlora-output", max_seq_length=512, per_device_train_batch_size=4, gradient_accumulation_steps=4, num_train_epochs=1, learning_rate=2e-4, bf16=True, logging_steps=10, save_steps=100, gradient_checkpointing=True, gradient_checkpointing_kwargs={"use_reentrant": False}, optim="paged_adamw_8bit", # Paged optimizer for memory efficiency )
trainer = SFTTrainer( model=model, args=sft_config, train_dataset=dataset, processing_class=tokenizer, dataset_text_field="text", )
trainer.train()
sft_config = SFTConfig( output_dir="./qlora-output", max_seq_length=512, per_device_train_batch_size=4, gradient_accumulation_steps=4, num_train_epochs=1, learning_rate=2e-4, bf16=True, logging_steps=10, save_steps=100, gradient_checkpointing=True, gradient_checkpointing_kwargs={"use_reentrant": False}, optim="paged_adamw_8bit", # 分页优化器提升内存效率 )
trainer = SFTTrainer( model=model, args=sft_config, train_dataset=dataset, processing_class=tokenizer, dataset_text_field="text", )
trainer.train()

7. Save adapter

7. 保存适配器

model.save_pretrained("./qlora-adapter") tokenizer.save_pretrained("./qlora-adapter")
undefined
model.save_pretrained("./qlora-adapter") tokenizer.save_pretrained("./qlora-adapter")
undefined

Inference and Merging

推理与模型合并

Inference with Quantized Model

基于量化模型的推理

python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

model_name = "meta-llama/Llama-3.1-8B"
python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

model_name = "meta-llama/Llama-3.1-8B"

Load quantized base model

加载量化基础模型

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, )
base_model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=bnb_config, device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained(model_name)
bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, )
base_model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=bnb_config, device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained(model_name)

Load adapter

加载适配器

model = PeftModel.from_pretrained(base_model, "./qlora-adapter") model.eval()
model = PeftModel.from_pretrained(base_model, "./qlora-adapter") model.eval()

Generate

生成文本

inputs = tokenizer("### Instruction:\nExplain quantum computing.\n\n### Response:\n", return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=256) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
undefined
inputs = tokenizer("### 指令:\n解释量子计算。\n\n### 响应:\n", return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=256) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
undefined

Merging to Full Precision

合并为全精度模型

To merge QLoRA adapters into a full-precision model (for deployment without bitsandbytes):
python
from transformers import AutoModelForCausalLM
from peft import PeftModel
import torch
若要将QLoRA适配器合并为全精度模型(无需bitsandbytes即可部署):
python
from transformers import AutoModelForCausalLM
from peft import PeftModel
import torch

Load base model in full precision (on CPU to avoid OOM)

加载全精度基础模型(在CPU上加载以避免OOM)

base_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", torch_dtype=torch.bfloat16, device_map="cpu", )
base_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", torch_dtype=torch.bfloat16, device_map="cpu", )

Load adapter

加载适配器

model = PeftModel.from_pretrained(base_model, "./qlora-adapter")
model = PeftModel.from_pretrained(base_model, "./qlora-adapter")

Merge and unload

合并并卸载适配器

merged_model = model.merge_and_unload()
merged_model = model.merge_and_unload()

Save merged model

保存合并后的模型

merged_model.save_pretrained("./merged-model")

**Note**: Merging requires enough RAM to hold the full-precision model. For 70B models, this means ~140GB RAM.
merged_model.save_pretrained("./merged-model")

**注意**:合并操作需要足够的内存以容纳全精度模型。对于70B参数模型,需约140GB内存。

Troubleshooting

问题排查

CUDA Version Issues

CUDA版本问题

bash
undefined
bash
undefined

Check CUDA version

检查CUDA版本

nvcc --version python -c "import torch; print(torch.version.cuda)"
nvcc --version python -c "import torch; print(torch.version.cuda)"

bitsandbytes requires CUDA 11.7+

bitsandbytes需要CUDA 11.7及以上版本

If version mismatch, reinstall:

若版本不匹配,重新安装:

pip uninstall bitsandbytes pip install bitsandbytes --upgrade
undefined
pip uninstall bitsandbytes pip install bitsandbytes --upgrade
undefined

"cannot find libcudart" or Missing Library Errors

“找不到libcudart”或缺失库错误

bash
undefined
bash
undefined

Find CUDA installation

查找CUDA安装路径

find /usr -name "libcudart*" 2>/dev/null
find /usr -name "libcudart*" 2>/dev/null

Set environment variable

设置环境变量

export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Or for conda:

若使用conda环境:

export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
undefined
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
undefined

Slow Training

训练速度缓慢

Common cause: compute dtype mismatch
python
undefined
常见原因:计算数据类型不匹配
python
undefined

Check if model is using expected dtype

检查模型是否使用预期的数据类型

for name, param in model.named_parameters(): if param.requires_grad: print(f"{name}: {param.dtype}") break # All LoRA params should match
for name, param in model.named_parameters(): if param.requires_grad: print(f"{name}: {param.dtype}") break # 所有LoRA参数应保持一致

Ensure bf16 is used in training args if BitsAndBytesConfig uses bf16

若BitsAndBytesConfig使用bf16,需确保训练参数中设置bf16=True

Mismatch causes constant dtype conversions

类型不匹配会导致频繁的数据类型转换

undefined
undefined

Out of Memory

内存不足

python
undefined
python
undefined

1. Enable gradient checkpointing

1. 启用梯度检查点

model.gradient_checkpointing_enable()
model.gradient_checkpointing_enable()

2. Reduce batch size, increase accumulation

2. 减小批次大小,增加梯度累积步数

per_device_train_batch_size = 1 gradient_accumulation_steps = 16
per_device_train_batch_size = 1 gradient_accumulation_steps = 16

3. Use paged optimizer

3. 使用分页优化器

optim = "paged_adamw_8bit"
optim = "paged_adamw_8bit"

4. Reduce sequence length

4. 减小序列长度

max_seq_length = 256
max_seq_length = 256

5. Target fewer modules

5. 减少目标模块数量

target_modules = ["q_proj", "v_proj"] # Minimal set
undefined
target_modules = ["q_proj", "v_proj"] # 最小模块集
undefined

Model Loads But Training Fails

模型加载成功但训练失败

python
undefined
python
undefined

Ensure prepare_model_for_kbit_training is called

确保调用了prepare_model_for_kbit_training

from peft import prepare_model_for_kbit_training model = prepare_model_for_kbit_training(model) # Don't skip this!
from peft import prepare_model_for_kbit_training model = prepare_model_for_kbit_training(model) # 不可跳过此步骤!

Enable input gradients if needed

若需要,启用输入梯度

model.enable_input_require_grads()
undefined
model.enable_input_require_grads()
undefined

Best Practices

最佳实践

  1. Always use
    prepare_model_for_kbit_training
    : This enables gradient computation through the frozen quantized layers
  2. Match compute dtype with training precision: If
    bnb_4bit_compute_dtype=torch.bfloat16
    , use
    bf16=True
    in training args
  3. Use paged optimizers for large models:
    optim="paged_adamw_8bit"
    or
    "paged_adamw_32bit"
    handles memory spikes
  4. Start with NF4 + double quantization: This is the recommended default; only change if debugging
  5. Gradient checkpointing is essential: Always enable for QLoRA training to fit larger batch sizes
  6. Test inference before long training runs: Load the model and generate a few tokens to catch configuration issues early
  7. Monitor GPU memory: Use
    nvidia-smi
    or
    torch.cuda.memory_summary()
    to track actual usage
  8. Consider 8-bit for unstable training: If 4-bit training shows instability, try
    load_in_8bit=True
    as a middle ground
  1. 务必调用
    prepare_model_for_kbit_training
    :该操作可启用对冻结量化层的梯度计算
  2. 匹配计算数据类型与训练精度:若
    bnb_4bit_compute_dtype=torch.bfloat16
    ,需在训练参数中设置
    bf16=True
  3. 大模型使用分页优化器
    optim="paged_adamw_8bit"
    "paged_adamw_32bit"
    可处理内存峰值
  4. 默认使用NF4+双重量化:这是推荐的默认配置,仅在调试时修改
  5. 梯度检查点至关重要:QLoRA训练中务必启用梯度检查点,以适配更大的批次大小
  6. 长时训练前先测试推理:加载模型并生成少量文本,提前发现配置问题
  7. 监控GPU内存:使用
    nvidia-smi
    torch.cuda.memory_summary()
    跟踪实际内存占用
  8. 不稳定训练时考虑8-bit量化:若4-bit训练出现不稳定,可尝试
    load_in_8bit=True
    作为折中方案