qlora

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

QLoRA: Quantized Low-Rank Adaptation

QLoRA：量化低秩适配

QLoRA enables fine-tuning of large language models on consumer GPUs by combining 4-bit quantization with LoRA adapters. A 65B model can be fine-tuned on a single 48GB GPU while matching 16-bit fine-tuning performance.

Prerequisites: This skill assumes familiarity with LoRA. See the
lora
skill for LoRA fundamentals (LoraConfig, target_modules, training patterns).

QLoRA通过结合4-bit量化与LoRA适配器，实现了在消费级GPU上对大语言模型的微调。借助该技术，可在单张48GB显存的GPU上微调65B参数模型，且性能与16-bit精度的微调效果相当。

前提条件：使用该技术需具备LoRA相关知识基础。可参考
lora
技术文档了解LoRA核心内容（LoraConfig、目标模块、训练模式）。

Core Innovations

核心创新点

QLoRA introduces three techniques that reduce memory usage without sacrificing performance:

QLoRA引入了三项技术，可在不损失性能的前提下降低内存占用：

4-bit NormalFloat (NF4)

4-bit正态浮点（NF4）

NF4 is an information-theoretically optimal quantization data type for normally distributed weights. Neural network weights are typically normally distributed, making NF4 more efficient than standard 4-bit floats.

Storage: 4-bit NF4 (quantized weights)
Compute: 16-bit BF16 (dequantized for forward/backward pass)

The key insight: weights are stored in 4-bit but dequantized to bf16 for computation. Only the frozen base model is quantized; LoRA adapters remain in full precision.

NF4 vs FP4:

Quantization	Description	Use Case
`nf4`	Normalized Float 4-bit, optimal for normal distributions	Default, recommended
`fp4`	Standard 4-bit float	Legacy, rarely needed

NF4是针对正态分布权重的信息论最优量化数据类型。神经网络权重通常呈正态分布，因此NF4比标准4-bit浮点更高效。

存储：4-bit NF4（量化后权重）
计算：16-bit BF16（前向/反向传播时反量化）

核心思路：权重以4-bit存储，但计算时反量化为bf16精度。仅冻结的基础模型会被量化，LoRA适配器保持全精度。

NF4与FP4对比：

量化类型	描述	适用场景
`nf4`	4-bit归一化浮点，适用于正态分布	默认推荐选项
`fp4`	标准4-bit浮点	遗留类型，极少使用

Double Quantization

双重量化

Standard quantization requires storing scaling constants (typically fp32) for each quantization block. Double quantization quantizes these constants too:

First quantization:  weights → 4-bit + fp32 scaling constants
Double quantization: scaling constants → 8-bit + fp32 second-level constants

This saves approximately 0.37 bits per parameter—significant for billion-parameter models:

7B model: ~325 MB savings
70B model: ~3.2 GB savings

标准量化需为每个量化块存储缩放常数（通常为fp32精度）。双重量化会进一步对这些常数进行量化：

首次量化： 权重 → 4-bit + fp32缩放常数
双重量化： 缩放常数 → 8-bit + fp32二级常数

这可为每参数节省约0.37比特，对于数十亿参数的模型而言效果显著：

7B模型：约节省325 MB
70B模型：约节省3.2 GB

Paged Optimizers

分页优化器

During training, gradient checkpointing can cause memory spikes when processing long sequences. Paged optimizers use NVIDIA unified memory to automatically transfer optimizer states between GPU and CPU:

Normal training: OOM on memory spike
Paged optimizers: GPU ↔ CPU transfer handles spikes gracefully

This is handled automatically by bitsandbytes when using 4-bit training.

训练过程中，梯度检查点在处理长序列时可能导致内存峰值。分页优化器利用NVIDIA统一内存，自动在GPU与CPU之间转移优化器状态：

常规训练：内存峰值时触发OOM
分页优化器：通过GPU ↔ CPU转移平稳处理峰值

当使用4-bit训练时，bitsandbytes库会自动处理该逻辑。

BitsAndBytesConfig Deep Dive

BitsAndBytesConfig深度解析

All Parameters Explained

参数详解

python

from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    # Core 4-bit settings
    load_in_4bit=True,              # Enable 4-bit quantization
    bnb_4bit_quant_type="nf4",      # "nf4" (recommended) or "fp4"

    # Double quantization
    bnb_4bit_use_double_quant=True, # Quantize the quantization constants

    # Compute precision
    bnb_4bit_compute_dtype=torch.bfloat16,  # Dequantize to this dtype for compute

    # Optional: specific storage type (usually auto-detected)
    bnb_4bit_quant_storage=torch.uint8,     # Storage dtype for quantized weights
)

python

from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    # 核心4-bit设置
    load_in_4bit=True,              # 启用4-bit量化
    bnb_4bit_quant_type="nf4",      # 可选"nf4"（推荐）或"fp4"

    # 双重量化
    bnb_4bit_use_double_quant=True, # 对量化常数进行二次量化

    # 计算精度
    bnb_4bit_compute_dtype=torch.bfloat16,  # 计算时反量化至该数据类型

    # 可选：指定存储类型（通常自动检测）
    bnb_4bit_quant_storage=torch.uint8,     # 量化权重的存储数据类型
)

Compute Dtype Selection

计算数据类型选择

Dtype	Hardware	Notes
`torch.bfloat16`	Ampere+ (RTX 30xx, A100)	Recommended, faster
`torch.float16`	Older GPUs (V100, RTX 20xx)	Use if bf16 not supported
`torch.float32`	Any	Slower, only for debugging

Check bf16 support:

python

import torch
print(torch.cuda.is_bf16_supported())  # True on Ampere+

数据类型	硬件	说明
`torch.bfloat16`	Ampere架构及以上（RTX 30xx、A100）	推荐选项，速度更快
`torch.float16`	旧款GPU（V100、RTX 20xx）	当硬件不支持bf16时使用
`torch.float32`	任意硬件	速度较慢，仅用于调试

检查bf16支持情况：

python

import torch
print(torch.cuda.is_bf16_supported())  # Ampere及以上架构返回True

Comparison: Quantization Options

量化配置对比

python

undefined

python

undefined

Recommended: NF4 + double quant + bf16

推荐配置：NF4 + 双重量化 + bf16

optimal_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16, )

Maximum memory savings (slightly slower)

最大内存节省（速度略慢）

max_savings_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.float16, # fp16 uses less memory than bf16 )

max_savings_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.float16, # fp16比bf16占用更少内存 )

8-bit alternative (less compression, sometimes more stable)

8-bit替代方案（压缩比更低，有时稳定性更好）

eight_bit_config = BitsAndBytesConfig( load_in_8bit=True, )

undefined

eight_bit_config = BitsAndBytesConfig( load_in_8bit=True, )

undefined

Memory Requirements

内存需求

Model Size	Full Fine-tuning	LoRA (16-bit)	QLoRA (4-bit)
7B	~60 GB	~16 GB	~6 GB
13B	~104 GB	~28 GB	~10 GB
34B	~272 GB	~75 GB	~20 GB
70B	~560 GB	~160 GB	~48 GB

Notes:

QLoRA memory includes model + optimizer states + activations
Actual usage varies with batch size, sequence length, and gradient checkpointing
Add ~20% buffer for safe operation

模型规模	全量微调	LoRA（16-bit）	QLoRA（4-bit）
7B	~60 GB	~16 GB	~6 GB
13B	~104 GB	~28 GB	~10 GB
34B	~272 GB	~75 GB	~20 GB
70B	~560 GB	~160 GB	~48 GB

说明：

QLoRA内存占用包含模型、优化器状态及激活值
实际占用量随批次大小、序列长度及梯度检查点设置而变化
建议预留约20%的缓冲空间以保证稳定运行

GPU Recommendations

GPU推荐

GPU VRAM	Max Model Size (QLoRA)
8 GB	7B (tight)
16 GB	7-13B
24 GB	13-34B
48 GB	34-70B
80 GB	70B+ comfortably

GPU显存	QLoRA支持的最大模型规模
8 GB	7B（紧张）
16 GB	7-13B
24 GB	13-34B
48 GB	34-70B
80 GB	70B及以上（轻松运行）

Complete Training Example

完整训练示例

python

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch

python

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch

1. Quantization config

1. 量化配置

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16, )

2. Load quantized model

2. 加载量化模型

model_name = "meta-llama/Llama-3.1-8B" model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=bnb_config, device_map="auto", attn_implementation="flash_attention_2", # Optional: faster attention )

tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "right"

3. Prepare for k-bit training (critical step!)

3. 为k-bit训练准备模型（关键步骤！）

model = prepare_model_for_kbit_training(model)

4. LoRA config (see lora skill for parameter details)

4. LoRA配置（参数细节参考LoRA技术文档）

lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", )

model = get_peft_model(model, lora_config) model.print_trainable_parameters()

lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", )

model = get_peft_model(model, lora_config) model.print_trainable_parameters()

5. Dataset

5. 数据集

dataset = load_dataset("tatsu-lab/alpaca", split="train[:1000]")

def format_example(example): if example["input"]: return {"text": f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"} return {"text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"}

dataset = dataset.map(format_example)

dataset = load_dataset("tatsu-lab/alpaca", split="train[:1000]")

def format_example(example): if example["input"]: return {"text": f"### 指令:\n{example['instruction']}\n\n### 输入:\n{example['input']}\n\n### 响应:\n{example['output']}"} return {"text": f"### 指令:\n{example['instruction']}\n\n### 响应:\n{example['output']}"}

dataset = dataset.map(format_example)

6. Training

6. 训练

sft_config = SFTConfig( output_dir="./qlora-output", max_seq_length=512, per_device_train_batch_size=4, gradient_accumulation_steps=4, num_train_epochs=1, learning_rate=2e-4, bf16=True, logging_steps=10, save_steps=100, gradient_checkpointing=True, gradient_checkpointing_kwargs={"use_reentrant": False}, optim="paged_adamw_8bit", # Paged optimizer for memory efficiency )

trainer = SFTTrainer( model=model, args=sft_config, train_dataset=dataset, processing_class=tokenizer, dataset_text_field="text", )

trainer.train()

trainer = SFTTrainer( model=model, args=sft_config, train_dataset=dataset, processing_class=tokenizer, dataset_text_field="text", )

trainer.train()

7. Save adapter

7. 保存适配器

model.save_pretrained("./qlora-adapter") tokenizer.save_pretrained("./qlora-adapter")

undefined

model.save_pretrained("./qlora-adapter") tokenizer.save_pretrained("./qlora-adapter")

undefined

Inference and Merging

推理与模型合并

Inference with Quantized Model

基于量化模型的推理

python

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

model_name = "meta-llama/Llama-3.1-8B"

python

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

model_name = "meta-llama/Llama-3.1-8B"

Load quantized base model

加载量化基础模型

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, )

base_model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=bnb_config, device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained(model_name)

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, )

base_model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=bnb_config, device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained(model_name)

Load adapter

加载适配器

model = PeftModel.from_pretrained(base_model, "./qlora-adapter") model.eval()

Generate

生成文本

inputs = tokenizer("### Instruction:\nExplain quantum computing.\n\n### Response:\n", return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=256) print(tokenizer.decode(outputs[0], skip_special_tokens=True))

undefined

inputs = tokenizer("### 指令:\n解释量子计算。\n\n### 响应:\n", return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=256) print(tokenizer.decode(outputs[0], skip_special_tokens=True))

undefined

Merging to Full Precision

合并为全精度模型

To merge QLoRA adapters into a full-precision model (for deployment without bitsandbytes):

python

from transformers import AutoModelForCausalLM
from peft import PeftModel
import torch

若要将QLoRA适配器合并为全精度模型（无需bitsandbytes即可部署）：

python

from transformers import AutoModelForCausalLM
from peft import PeftModel
import torch

Load base model in full precision (on CPU to avoid OOM)

加载全精度基础模型（在CPU上加载以避免OOM）

base_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", torch_dtype=torch.bfloat16, device_map="cpu", )

Load adapter

加载适配器

model = PeftModel.from_pretrained(base_model, "./qlora-adapter")

Merge and unload

合并并卸载适配器

merged_model = model.merge_and_unload()

Save merged model

保存合并后的模型

merged_model.save_pretrained("./merged-model")


**Note**: Merging requires enough RAM to hold the full-precision model. For 70B models, this means ~140GB RAM.

merged_model.save_pretrained("./merged-model")


**注意**：合并操作需要足够的内存以容纳全精度模型。对于70B参数模型，需约140GB内存。

Troubleshooting

问题排查

CUDA Version Issues

CUDA版本问题

bash

undefined

bash

undefined

Check CUDA version

检查CUDA版本

nvcc --version python -c "import torch; print(torch.version.cuda)"

bitsandbytes requires CUDA 11.7+

bitsandbytes需要CUDA 11.7及以上版本

If version mismatch, reinstall:

若版本不匹配，重新安装：

pip uninstall bitsandbytes pip install bitsandbytes --upgrade

undefined

pip uninstall bitsandbytes pip install bitsandbytes --upgrade

undefined

"cannot find libcudart" or Missing Library Errors

“找不到libcudart”或缺失库错误

bash

undefined

bash

undefined

Find CUDA installation

查找CUDA安装路径

find /usr -name "libcudart*" 2>/dev/null

Set environment variable

设置环境变量

export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Or for conda:

若使用conda环境：

export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

undefined

export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

undefined

Slow Training

训练速度缓慢

Common cause: compute dtype mismatch

python

undefined

常见原因：计算数据类型不匹配

python

undefined

Check if model is using expected dtype

检查模型是否使用预期的数据类型

for name, param in model.named_parameters(): if param.requires_grad: print(f"{name}: {param.dtype}") break # All LoRA params should match

for name, param in model.named_parameters(): if param.requires_grad: print(f"{name}: {param.dtype}") break # 所有LoRA参数应保持一致

Ensure bf16 is used in training args if BitsAndBytesConfig uses bf16

若BitsAndBytesConfig使用bf16，需确保训练参数中设置bf16=True

Mismatch causes constant dtype conversions

类型不匹配会导致频繁的数据类型转换

undefined

undefined

Out of Memory

内存不足

python

undefined

python

undefined

1. Enable gradient checkpointing

1. 启用梯度检查点

model.gradient_checkpointing_enable()

2. Reduce batch size, increase accumulation

2. 减小批次大小，增加梯度累积步数

per_device_train_batch_size = 1 gradient_accumulation_steps = 16

3. Use paged optimizer

3. 使用分页优化器

optim = "paged_adamw_8bit"

4. Reduce sequence length

4. 减小序列长度

max_seq_length = 256

5. Target fewer modules

5. 减少目标模块数量

target_modules = ["q_proj", "v_proj"] # Minimal set

undefined

target_modules = ["q_proj", "v_proj"] # 最小模块集

undefined

Model Loads But Training Fails

模型加载成功但训练失败

python

undefined

python

undefined

Ensure prepare_model_for_kbit_training is called

确保调用了prepare_model_for_kbit_training

from peft import prepare_model_for_kbit_training model = prepare_model_for_kbit_training(model) # Don't skip this!

from peft import prepare_model_for_kbit_training model = prepare_model_for_kbit_training(model) # 不可跳过此步骤！

Enable input gradients if needed

若需要，启用输入梯度

model.enable_input_require_grads()

undefined

model.enable_input_require_grads()

undefined

Best Practices

最佳实践

Always use
prepare_model_for_kbit_training
: This enables gradient computation through the frozen quantized layers
Match compute dtype with training precision: If
```
bnb_4bit_compute_dtype=torch.bfloat16
```
, use
```
bf16=True
```
in training args
Use paged optimizers for large models:
```
optim="paged_adamw_8bit"
```
or
```
"paged_adamw_32bit"
```
handles memory spikes
Start with NF4 + double quantization: This is the recommended default; only change if debugging
Gradient checkpointing is essential: Always enable for QLoRA training to fit larger batch sizes
Test inference before long training runs: Load the model and generate a few tokens to catch configuration issues early
Monitor GPU memory: Use
```
nvidia-smi
```
or
```
torch.cuda.memory_summary()
```
to track actual usage
Consider 8-bit for unstable training: If 4-bit training shows instability, try
```
load_in_8bit=True
```
as a middle ground

务必调用
prepare_model_for_kbit_training
：该操作可启用对冻结量化层的梯度计算
匹配计算数据类型与训练精度：若
```
bnb_4bit_compute_dtype=torch.bfloat16
```
，需在训练参数中设置
```
bf16=True
```
大模型使用分页优化器：
```
optim="paged_adamw_8bit"
```
或
```
"paged_adamw_32bit"
```
可处理内存峰值
默认使用NF4+双重量化：这是推荐的默认配置，仅在调试时修改
梯度检查点至关重要：QLoRA训练中务必启用梯度检查点，以适配更大的批次大小
长时训练前先测试推理：加载模型并生成少量文本，提前发现配置问题
监控GPU内存：使用
```
nvidia-smi
```
或
```
torch.cuda.memory_summary()
```
跟踪实际内存占用
不稳定训练时考虑8-bit量化：若4-bit训练出现不稳定，可尝试
```
load_in_8bit=True
```
作为折中方案

qlora

Original

Translation

QLoRA: Quantized Low-Rank Adaptation

QLoRA：量化低秩适配

Table of Contents

目录

Core Innovations

核心创新点

4-bit NormalFloat (NF4)

4-bit正态浮点（NF4）

Double Quantization

双重量化

Paged Optimizers

分页优化器

BitsAndBytesConfig Deep Dive

BitsAndBytesConfig深度解析

All Parameters Explained

参数详解

Compute Dtype Selection

计算数据类型选择

Comparison: Quantization Options

量化配置对比

Recommended: NF4 + double quant + bf16

推荐配置：NF4 + 双重量化 + bf16

Maximum memory savings (slightly slower)

最大内存节省（速度略慢）

8-bit alternative (less compression, sometimes more stable)

8-bit替代方案（压缩比更低，有时稳定性更好）

Memory Requirements

内存需求

GPU Recommendations

GPU推荐

Complete Training Example

完整训练示例

1. Quantization config

1. 量化配置

2. Load quantized model

2. 加载量化模型

3. Prepare for k-bit training (critical step!)

3. 为k-bit训练准备模型（关键步骤！）

4. LoRA config (see lora skill for parameter details)

4. LoRA配置（参数细节参考LoRA技术文档）

5. Dataset

5. 数据集

6. Training

6. 训练

7. Save adapter

7. 保存适配器

Inference and Merging

推理与模型合并

Inference with Quantized Model

基于量化模型的推理

Load quantized base model

加载量化基础模型

Load adapter

加载适配器

Generate

生成文本

Merging to Full Precision

合并为全精度模型

Load base model in full precision (on CPU to avoid OOM)

加载全精度基础模型（在CPU上加载以避免OOM）

Load adapter

加载适配器

Merge and unload

合并并卸载适配器

Save merged model

保存合并后的模型

Troubleshooting

问题排查

CUDA Version Issues

CUDA版本问题

Check CUDA version

检查CUDA版本

bitsandbytes requires CUDA 11.7+

bitsandbytes需要CUDA 11.7及以上版本

If version mismatch, reinstall:

若版本不匹配，重新安装：

"cannot find libcudart" or Missing Library Errors