moai-ml-llm-fine-tuning
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLLM Fine-Tuning Expert
大语言模型微调专家
Parameter-Efficient Fine-Tuning (PEFT) for Enterprise LLMs
Focus: LoRA, QLoRA, Domain Adaptation
Models: Llama 3.1, Mistral, Mixtral, Falcon
Stack: PyTorch, Transformers, PEFT, bitsandbytes
面向企业级LLM的参数高效微调(PEFT)
聚焦领域:LoRA、QLoRA、领域适配
支持模型:Llama 3.1、Mistral、Mixtral、Falcon
技术栈:PyTorch、Transformers、PEFT、bitsandbytes
Overview
概述
Enterprise-grade fine-tuning strategies for customizing Large Language Models (LLMs) with minimal resource requirements.
以极低资源需求定制大语言模型(LLM)的企业级微调方案。
Core Capabilities
核心能力
- Parameter-Efficient Fine-Tuning (PEFT): LoRA, QLoRA, Prefix Tuning
- Quantization: 4-bit/8-bit training with bitsandbytes
- Distributed Training: Multi-GPU, DeepSpeed, FSDP
- Optimization: Flash Attention 2, Gradient Checkpointing
- Evaluation: Perplexity, BLEU, ROUGE, Domain benchmarks
- 参数高效微调(PEFT):LoRA、QLoRA、前缀微调
- 量化:基于bitsandbytes的4位/8位训练
- 分布式训练:多GPU、DeepSpeed、FSDP
- 优化:Flash Attention 2、梯度检查点
- 评估:困惑度、BLEU、ROUGE、领域基准测试
Technology Stack
技术栈
- PEFT 0.13+: Adapter management
- Transformers 4.45+: Model architecture
- TRL 0.11+: Supervised Fine-Tuning (SFT), DPO
- Accelerate 0.34+: Training loop orchestration
- bitsandbytes 0.45+: Low-precision optimization
- PEFT 0.13+:适配器管理
- Transformers 4.45+:模型架构
- TRL 0.11+:监督微调(SFT)、DPO
- Accelerate 0.34+:训练流程编排
- bitsandbytes 0.45+:低精度优化
Fine-Tuning Strategies
微调策略
| Method | Params Updated | VRAM (70B) | Use Case |
|---|---|---|---|
| Full Fine-Tuning | 100% | ~420GB | Foundation model creation |
| LoRA | 0.1-1% | ~180GB | Domain adaptation, Style transfer |
| QLoRA | 0.1-1% | ~24GB | Consumer GPU training, Cost efficiency |
| 方法 | 更新参数占比 | 70B模型显存占用 | 适用场景 |
|---|---|---|---|
| 全量微调 | 100% | ~420GB | 基础模型创建 |
| LoRA | 0.1-1% | ~180GB | 领域适配、风格迁移 |
| QLoRA | 0.1-1% | ~24GB | 消费级GPU训练、成本优化 |
Implementation Patterns
实现模式
1. QLoRA Configuration (Recommended)
1. QLoRA配置(推荐)
Efficient 4-bit training for large models.
python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
def setup_qlora_model(model_id="meta-llama/Llama-3.1-8B"):
# 1. Quantization Config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
# 2. Load Model
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True
)
# 3. Enable Gradient Checkpointing & K-bit Training
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
# 4. LoRA Config
peft_config = LoraConfig(
r=16, # Rank
lora_alpha=32, # Alpha (scaling)
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# 5. Apply Adapter
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
return model针对大模型的高效4位训练方案。
python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
def setup_qlora_model(model_id="meta-llama/Llama-3.1-8B"):
# 1. Quantization Config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
# 2. Load Model
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True
)
# 3. Enable Gradient Checkpointing & K-bit Training
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
# 4. LoRA Config
peft_config = LoraConfig(
r=16, # Rank
lora_alpha=32, # Alpha (scaling)
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# 5. Apply Adapter
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
return model2. Training Loop with TRL
2. 基于TRL的训练循环
Supervised Fine-Tuning (SFT) using HuggingFace TRL.
python
from trl import SFTTrainer
from transformers import TrainingArguments
def train_model(model, tokenizer, dataset, output_dir="./results"):
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
fp16=True, # or bf16=True for Ampere+
logging_steps=10,
save_strategy="epoch",
optim="paged_adamw_8bit",
report_to="tensorboard"
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
dataset_text_field="text", # Column containing formatted prompt
max_seq_length=2048,
tokenizer=tokenizer,
args=training_args,
packing=False,
)
trainer.train()
trainer.save_model()使用HuggingFace TRL实现监督微调(SFT)。
python
from trl import SFTTrainer
from transformers import TrainingArguments
def train_model(model, tokenizer, dataset, output_dir="./results"):
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
fp16=True, # or bf16=True for Ampere+
logging_steps=10,
save_strategy="epoch",
optim="paged_adamw_8bit",
report_to="tensorboard"
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
dataset_text_field="text", # Column containing formatted prompt
max_seq_length=2048,
tokenizer=tokenizer,
args=training_args,
packing=False,
)
trainer.train()
trainer.save_model()3. Data Preparation
3. 数据准备
Formatting data for instruction tuning.
python
from datasets import load_dataset
def format_instruction(sample):
return f"""### Instruction:
{sample['instruction']}为指令微调格式化数据。
python
from datasets import load_dataset
def format_instruction(sample):
return f"""### Instruction:
{sample['instruction']}Input:
Input:
{sample['input']}
{sample['input']}
Response:
Response:
{sample['output']}
"""
def prepare_dataset(path="databricks/databricks-dolly-15k"):
dataset = load_dataset(path, split="train")
# Format for SFTTrainer
dataset = dataset.map(lambda x: {"text": format_instruction(x)})
return dataset
---{sample['output']}
"""
def prepare_dataset(path="databricks/databricks-dolly-15k"):
dataset = load_dataset(path, split="train")
# Format for SFTTrainer
dataset = dataset.map(lambda x: {"text": format_instruction(x)})
return dataset
---Advanced Techniques
高级技巧
Multi-GPU Distributed Training
多GPU分布式训练
Using Accelerate and DeepSpeed for scaling.
bash
undefined使用Accelerate和DeepSpeed实现算力扩展。
bash
undefinedconfig.yaml for accelerate
config.yaml for accelerate
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
deepspeed_config:
gradient_accumulation_steps: 4
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: cpu
zero_optimization:
stage: 2
Run command:
```bash
accelerate launch --config_file config.yaml train.pycompute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
deepspeed_config:
gradient_accumulation_steps: 4
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: cpu
zero_optimization:
stage: 2
运行命令:
```bash
accelerate launch --config_file config.yaml train.pyModel Merging
模型合并
Merging LoRA adapters back into the base model for deployment.
python
from peft import PeftModel
def merge_adapter(base_model_id, adapter_path, output_path):
# Load base model in FP16 (not 4-bit)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_id,
torch_dtype=torch.float16,
device_map="auto"
)
# Load adapter
model = PeftModel.from_pretrained(base_model, adapter_path)
# Merge
model = model.merge_and_unload()
# Save
model.save_pretrained(output_path)
base_model.tokenizer.save_pretrained(output_path)将LoRA适配器合并回基础模型用于部署。
python
from peft import PeftModel
def merge_adapter(base_model_id, adapter_path, output_path):
# Load base model in FP16 (not 4-bit)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_id,
torch_dtype=torch.float16,
device_map="auto"
)
# Load adapter
model = PeftModel.from_pretrained(base_model, adapter_path)
# Merge
model = model.merge_and_unload()
# Save
model.save_pretrained(output_path)
base_model.tokenizer.save_pretrained(output_path)Validation Checklist
验证检查清单
Setup:
- GPU environment verified (CUDA available)
- Dependencies installed (peft, trl, bitsandbytes)
- HuggingFace token configured
Data:
- Dataset formatted correctly (Instruction/Input/Output)
- Tokenization length checked (< context window)
- Train/Val split created
Training:
- QLoRA config applied (4-bit, nf4)
- Gradient checkpointing enabled
- Learning rate scheduled (warmup + decay)
- Loss monitoring active
Evaluation:
- Perplexity calculated
- Generation quality manually verified
- Adapter merged successfully
环境配置:
- GPU环境验证通过(CUDA可用)
- 依赖安装完成(peft、trl、bitsandbytes)
- HuggingFace令牌已配置
数据准备:
- 数据集格式正确(Instruction/Input/Output结构)
- 分词长度检查通过(小于上下文窗口)
- 已划分训练/验证集
训练过程:
- 已应用QLoRA配置(4位、nf4量化)
- 已启用梯度检查点
- 学习率调度配置完成(预热+衰减)
- 损失监控已激活
模型评估:
- 已计算困惑度
- 生成质量已人工核验
- 适配器合并成功
Related Skills
相关技能
- : General ML workflows
moai-domain-ml - : Data preparation
moai-domain-data-science - : Inference optimization
moai-essentials-perf
Last Updated: 2025-11-20
- :通用机器学习工作流
moai-domain-ml - :数据准备
moai-domain-data-science - :推理优化
moai-essentials-perf
最后更新:2025-11-20