model_finetuning
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTRL - Transformer Reinforcement Learning
TRL - Transformer强化学习
Quick start
快速开始
TRL provides post-training methods for aligning language models with human preferences.
Installation:
bash
pip install trl transformers datasets peft accelerateSupervised Fine-Tuning (instruction tuning):
python
from trl import SFTTrainer
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
train_dataset=dataset, # Prompt-completion pairs
)
trainer.train()DPO (align with preferences):
python
from trl import DPOTrainer, DPOConfig
config = DPOConfig(output_dir="model-dpo", beta=0.1)
trainer = DPOTrainer(
model=model,
args=config,
train_dataset=preference_dataset, # chosen/rejected pairs
processing_class=tokenizer
)
trainer.train()TRL提供用于让语言模型对齐人类偏好的训练后方法。
安装:
bash
pip install trl transformers datasets peft accelerate监督微调(指令微调):
python
from trl import SFTTrainer
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
train_dataset=dataset, # Prompt-补全对
)
trainer.train()DPO(偏好对齐):
python
from trl import DPOTrainer, DPOConfig
config = DPOConfig(output_dir="model-dpo", beta=0.1)
trainer = DPOTrainer(
model=model,
args=config,
train_dataset=preference_dataset, # 接受/拒绝对
processing_class=tokenizer
)
trainer.train()Common workflows
常用工作流
Workflow 1: Full RLHF pipeline (SFT → Reward Model → PPO)
工作流1:完整RLHF流程(SFT → 奖励模型 → PPO)
Complete pipeline from base model to human-aligned model.
Copy this checklist:
RLHF Training:
- [ ] Step 1: Supervised fine-tuning (SFT)
- [ ] Step 2: Train reward model
- [ ] Step 3: PPO reinforcement learning
- [ ] Step 4: Evaluate aligned modelStep 1: Supervised fine-tuning
Train base model on instruction-following data:
python
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset从基础模型到人类对齐模型的完整流水线。
复制此检查清单:
RLHF训练:
- [ ] 步骤1:监督微调(SFT)
- [ ] 步骤2:训练奖励模型
- [ ] 步骤3:PPO强化学习
- [ ] 步骤4:评估对齐后模型步骤1:监督微调
在指令跟随数据集上训练基础模型:
python
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
from datasets import load_datasetLoad model
加载模型
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
Load instruction dataset
加载指令数据集
dataset = load_dataset("trl-lib/Capybara", split="train")
dataset = load_dataset("trl-lib/Capybara", split="train")
Configure training
配置训练参数
training_args = SFTConfig(
output_dir="Qwen2.5-0.5B-SFT",
per_device_train_batch_size=4,
num_train_epochs=1,
learning_rate=2e-5,
logging_steps=10,
save_strategy="epoch"
)
training_args = SFTConfig(
output_dir="Qwen2.5-0.5B-SFT",
per_device_train_batch_size=4,
num_train_epochs=1,
learning_rate=2e-5,
logging_steps=10,
save_strategy="epoch"
)
Train
训练
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer
)
trainer.train()
trainer.save_model()
**Step 2: Train reward model**
Train model to predict human preferences:
```python
from transformers import AutoModelForSequenceClassification
from trl import RewardTrainer, RewardConfigtrainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer
)
trainer.train()
trainer.save_model()
**步骤2:训练奖励模型**
训练模型预测人类偏好:
```python
from transformers import AutoModelForSequenceClassification
from trl import RewardTrainer, RewardConfigLoad SFT model as base
加载SFT模型作为基础
model = AutoModelForSequenceClassification.from_pretrained(
"Qwen2.5-0.5B-SFT",
num_labels=1 # Single reward score
)
tokenizer = AutoTokenizer.from_pretrained("Qwen2.5-0.5B-SFT")
model = AutoModelForSequenceClassification.from_pretrained(
"Qwen2.5-0.5B-SFT",
num_labels=1 # 单个奖励分数
)
tokenizer = AutoTokenizer.from_pretrained("Qwen2.5-0.5B-SFT")
Load preference data (chosen/rejected pairs)
加载偏好数据(接受/拒绝对)
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
Configure training
配置训练参数
training_args = RewardConfig(
output_dir="Qwen2.5-0.5B-Reward",
per_device_train_batch_size=2,
num_train_epochs=1,
learning_rate=1e-5
)
training_args = RewardConfig(
output_dir="Qwen2.5-0.5B-Reward",
per_device_train_batch_size=2,
num_train_epochs=1,
learning_rate=1e-5
)
Train reward model
训练奖励模型
trainer = RewardTrainer(
model=model,
args=training_args,
processing_class=tokenizer,
train_dataset=dataset
)
trainer.train()
trainer.save_model()
**Step 3: PPO reinforcement learning**
Optimize policy using reward model:
```bash
python -m trl.scripts.ppo \
--model_name_or_path Qwen2.5-0.5B-SFT \
--reward_model_path Qwen2.5-0.5B-Reward \
--dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
--output_dir Qwen2.5-0.5B-PPO \
--learning_rate 3e-6 \
--per_device_train_batch_size 64 \
--total_episodes 10000Step 4: Evaluate
python
from transformers import pipelinetrainer = RewardTrainer(
model=model,
args=training_args,
processing_class=tokenizer,
train_dataset=dataset
)
trainer.train()
trainer.save_model()
**步骤3:PPO强化学习**
使用奖励模型优化策略:
```bash
python -m trl.scripts.ppo \
--model_name_or_path Qwen2.5-0.5B-SFT \
--reward_model_path Qwen2.5-0.5B-Reward \
--dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
--output_dir Qwen2.5-0.5B-PPO \
--learning_rate 3e-6 \
--per_device_train_batch_size 64 \
--total_episodes 10000步骤4:评估
python
from transformers import pipelineLoad aligned model
加载对齐后模型
generator = pipeline("text-generation", model="Qwen2.5-0.5B-PPO")
generator = pipeline("text-generation", model="Qwen2.5-0.5B-PPO")
Test
测试
prompt = "Explain quantum computing to a 10-year-old"
output = generator(prompt, max_length=200)[0]["generated_text"]
print(output)
undefinedprompt = "给10岁小孩解释量子计算"
output = generator(prompt, max_length=200)[0]["generated_text"]
print(output)
undefinedWorkflow 2: Simple preference alignment with DPO
工作流2:使用DPO实现简单偏好对齐
Align model with preferences without reward model.
Copy this checklist:
DPO Training:
- [ ] Step 1: Prepare preference dataset
- [ ] Step 2: Configure DPO
- [ ] Step 3: Train with DPOTrainer
- [ ] Step 4: Evaluate alignmentStep 1: Prepare preference dataset
Dataset format:
json
{
"prompt": "What is the capital of France?",
"chosen": "The capital of France is Paris.",
"rejected": "I don't know."
}Load dataset:
python
from datasets import load_dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")无需奖励模型即可让模型对齐偏好。
复制此检查清单:
DPO训练:
- [ ] 步骤1:准备偏好数据集
- [ ] 步骤2:配置DPO参数
- [ ] 步骤3:使用DPOTrainer训练
- [ ] 步骤4:评估对齐效果步骤1:准备偏好数据集
数据集格式:
json
{
"prompt": "法国的首都是什么?",
"chosen": "法国的首都是巴黎。",
"rejected": "我不知道。"
}加载数据集:
python
from datasets import load_dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")Or load your own
或者加载你自己的数据集
dataset = load_dataset("json", data_files="preferences.json")
dataset = load_dataset("json", data_files="preferences.json")
**Step 2: Configure DPO**
```python
from trl import DPOConfig
config = DPOConfig(
output_dir="Qwen2.5-0.5B-DPO",
per_device_train_batch_size=4,
num_train_epochs=1,
learning_rate=5e-7,
beta=0.1, # KL penalty strength
max_prompt_length=512,
max_length=1024,
logging_steps=10
)Step 3: Train with DPOTrainer
python
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
trainer = DPOTrainer(
model=model,
args=config,
train_dataset=dataset,
processing_class=tokenizer
)
trainer.train()
trainer.save_model()CLI alternative:
bash
trl dpo \
--model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--dataset_name argilla/Capybara-Preferences \
--output_dir Qwen2.5-0.5B-DPO \
--per_device_train_batch_size 4 \
--learning_rate 5e-7 \
--beta 0.1
**步骤2:配置DPO参数**
```python
from trl import DPOConfig
config = DPOConfig(
output_dir="Qwen2.5-0.5B-DPO",
per_device_train_batch_size=4,
num_train_epochs=1,
learning_rate=5e-7,
beta=0.1, # KL惩罚强度
max_prompt_length=512,
max_length=1024,
logging_steps=10
)步骤3:使用DPOTrainer训练
python
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
trainer = DPOTrainer(
model=model,
args=config,
train_dataset=dataset,
processing_class=tokenizer
)
trainer.train()
trainer.save_model()CLI替代方案:
bash
trl dpo \
--model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--dataset_name argilla/Capybara-Preferences \
--output_dir Qwen2.5-0.5B-DPO \
--per_device_train_batch_size 4 \
--learning_rate 5e-7 \
--beta 0.1Workflow 3: Memory-efficient online RL with GRPO
工作流3:使用GRPO实现内存高效的在线RL
Train with reinforcement learning using minimal memory.
Copy this checklist:
GRPO Training:
- [ ] Step 1: Define reward function
- [ ] Step 2: Configure GRPO
- [ ] Step 3: Train with GRPOTrainerStep 1: Define reward function
python
def reward_function(completions, **kwargs):
"""
Compute rewards for completions.
Args:
completions: List of generated texts
Returns:
List of reward scores (floats)
"""
rewards = []
for completion in completions:
# Example: reward based on length and unique words
score = len(completion.split()) # Favor longer responses
score += len(set(completion.lower().split())) # Reward unique words
rewards.append(score)
return rewardsOr use a reward model:
python
from transformers import pipeline
reward_model = pipeline("text-classification", model="reward-model-path")
def reward_from_model(completions, prompts, **kwargs):
# Combine prompt + completion
full_texts = [p + c for p, c in zip(prompts, completions)]
# Get reward scores
results = reward_model(full_texts)
return [r["score"] for r in results]Step 2: Configure GRPO
python
from trl import GRPOConfig
config = GRPOConfig(
output_dir="Qwen2-GRPO",
per_device_train_batch_size=4,
num_train_epochs=1,
learning_rate=1e-5,
num_generations=4, # Generate 4 completions per prompt
max_new_tokens=128
)Step 3: Train with GRPOTrainer
python
from datasets import load_dataset
from trl import GRPOTrainer使用最小显存开销进行强化学习训练。
复制此检查清单:
GRPO训练:
- [ ] 步骤1:定义奖励函数
- [ ] 步骤2:配置GRPO参数
- [ ] 步骤3:使用GRPOTrainer训练步骤1:定义奖励函数
python
def reward_function(completions, **kwargs):
"""
计算生成内容的奖励。
参数:
completions: 生成文本的列表
返回:
奖励分数列表(浮点型)
"""
rewards = []
for completion in completions:
# 示例:基于长度和唯一词计算奖励
score = len(completion.split()) # 偏好更长的回复
score += len(set(completion.lower().split())) # 奖励唯一词
rewards.append(score)
return rewards或者使用奖励模型:
python
from transformers import pipeline
reward_model = pipeline("text-classification", model="reward-model-path")
def reward_from_model(completions, prompts, **kwargs):
# 拼接prompt + 生成内容
full_texts = [p + c for p, c in zip(prompts, completions)]
# 获取奖励分数
results = reward_model(full_texts)
return [r["score"] for r in results]步骤2:配置GRPO参数
python
from trl import GRPOConfig
config = GRPOConfig(
output_dir="Qwen2-GRPO",
per_device_train_batch_size=4,
num_train_epochs=1,
learning_rate=1e-5,
num_generations=4, # 每个prompt生成4条回复
max_new_tokens=128
)步骤3:使用GRPOTrainer训练
python
from datasets import load_dataset
from trl import GRPOTrainerLoad prompt-only dataset
加载仅含prompt的数据集
dataset = load_dataset("trl-lib/tldr", split="train")
trainer = GRPOTrainer(
model="Qwen/Qwen2-0.5B-Instruct",
reward_funcs=reward_function, # Your reward function
args=config,
train_dataset=dataset
)
trainer.train()
**CLI**:
```bash
trl grpo \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/tldr \
--output_dir Qwen2-GRPO \
--num_generations 4dataset = load_dataset("trl-lib/tldr", split="train")
trainer = GRPOTrainer(
model="Qwen/Qwen2-0.5B-Instruct",
reward_funcs=reward_function, # 自定义奖励函数
args=config,
train_dataset=dataset
)
trainer.train()
**CLI**:
```bash
trl grpo \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/tldr \
--output_dir Qwen2-GRPO \
--num_generations 4When to use vs alternatives
适用场景与替代方案对比
Use TRL when:
- Need to align model with human preferences
- Have preference data (chosen/rejected pairs)
- Want to use reinforcement learning (PPO, GRPO)
- Need reward model training
- Doing RLHF (full pipeline)
Method selection:
- SFT: Have prompt-completion pairs, want basic instruction following
- DPO: Have preferences, want simple alignment (no reward model needed)
- PPO: Have reward model, need maximum control over RL
- GRPO: Memory-constrained, want online RL
- Reward Model: Building RLHF pipeline, need to score generations
Use alternatives instead:
- HuggingFace Trainer: Basic fine-tuning without RL
- Axolotl: YAML-based training configuration
- LitGPT: Educational, minimal fine-tuning
- Unsloth: Fast LoRA training
适合使用TRL的场景:
- 需要让模型对齐人类偏好
- 拥有偏好数据(接受/拒绝对)
- 需要使用强化学习(PPO、GRPO)
- 需要训练奖励模型
- 实现完整RLHF流水线
方法选择指南:
- SFT: 拥有prompt-补全对,需要基础的指令跟随能力
- DPO: 拥有偏好数据,需要简单的对齐能力(无需奖励模型)
- PPO: 已有奖励模型,需要最大程度控制RL流程
- GRPO: 显存受限,需要使用在线RL
- 奖励模型: 构建RLHF流水线,需要给生成内容打分
适合使用替代方案的场景:
- HuggingFace Trainer: 无需RL的基础微调
- Axolotl: 基于YAML的训练配置
- LitGPT: 用于教学的轻量化微调
- Unsloth: 快速LoRA训练
Common issues
常见问题
Issue: OOM during DPO training
Reduce batch size and sequence length:
python
config = DPOConfig(
per_device_train_batch_size=1, # Reduce from 4
max_length=512, # Reduce from 1024
gradient_accumulation_steps=8 # Maintain effective batch
)Or use gradient checkpointing:
python
model.gradient_checkpointing_enable()Issue: Poor alignment quality
Tune beta parameter:
python
undefined问题:DPO训练时出现OOM(显存不足)
降低 batch size 和序列长度:
python
config = DPOConfig(
per_device_train_batch_size=1, # 从4降低
max_length=512, # 从1024降低
gradient_accumulation_steps=8 # 维持有效batch大小
)或者使用梯度检查点:
python
model.gradient_checkpointing_enable()问题:对齐效果差
调整beta参数:
python
undefinedHigher beta = more conservative (stays closer to reference)
更高的beta = 更保守(更接近参考模型)
config = DPOConfig(beta=0.5) # Default 0.1
config = DPOConfig(beta=0.5) # 默认0.1
Lower beta = more aggressive alignment
更低的beta = 更激进的对齐
config = DPOConfig(beta=0.01)
**Issue: Reward model not learning**
Check loss type and learning rate:
```python
config = RewardConfig(
learning_rate=1e-5, # Try different LR
num_train_epochs=3 # Train longer
)Ensure preference dataset has clear winners:
python
undefinedconfig = DPOConfig(beta=0.01)
**问题:奖励模型没有学习效果**
检查损失类型和学习率:
```python
config = RewardConfig(
learning_rate=1e-5, # 尝试不同的学习率
num_train_epochs=3 # 延长训练时间
)确保偏好数据集有明确的优劣区分:
python
undefinedVerify dataset
验证数据集
print(dataset[0])
print(dataset[0])
Should have clear chosen > rejected
应该有明确的chosen优于rejected的区分
**Issue: PPO training unstable**
Adjust KL coefficient:
```python
config = PPOConfig(
kl_coef=0.1, # Increase from 0.05
cliprange=0.1 # Reduce from 0.2
)
**问题:PPO训练不稳定**
调整KL系数:
```python
config = PPOConfig(
kl_coef=0.1, # 从0.05上调
cliprange=0.1 # 从0.2下调
)Advanced topics
高级主题
SFT training guide: See references/sft-training.md for dataset formats, chat templates, packing strategies, and multi-GPU training.
DPO variants: See references/dpo-variants.md for IPO, cDPO, RPO, and other DPO loss functions with recommended hyperparameters.
Reward modeling: See references/reward-modeling.md for outcome vs process rewards, Bradley-Terry loss, and reward model evaluation.
Online RL methods: See references/online-rl.md for PPO, GRPO, RLOO, and OnlineDPO with detailed configurations.
SFT训练指南: 查看 references/sft-training.md 了解数据集格式、对话模板、打包策略和多GPU训练相关内容。
DPO变体: 查看 references/dpo-variants.md 了解IPO、cDPO、RPO等DPO损失函数及推荐超参数。
奖励建模: 查看 references/reward-modeling.md 了解结果奖励 vs 过程奖励、Bradley-Terry损失和奖励模型评估相关内容。
在线RL方法: 查看 references/online-rl.md 了解PPO、GRPO、RLOO和OnlineDPO的详细配置。
Hardware requirements
硬件要求
- GPU: NVIDIA (CUDA required)
- VRAM: Depends on model and method
- SFT 7B: 16GB (with LoRA)
- DPO 7B: 24GB (stores reference model)
- PPO 7B: 40GB (policy + reward model)
- GRPO 7B: 24GB (more memory efficient)
- Multi-GPU: Supported via
accelerate - Mixed precision: BF16 recommended (A100/H100)
Memory optimization:
- Use LoRA/QLoRA for all methods
- Enable gradient checkpointing
- Use smaller batch sizes with gradient accumulation
- GPU: NVIDIA(需要CUDA支持)
- 显存: 取决于模型和方法
- SFT 7B模型: 16GB(使用LoRA)
- DPO 7B模型: 24GB(存储参考模型)
- PPO 7B模型: 40GB(策略模型 + 奖励模型)
- GRPO 7B模型: 24GB(内存效率更高)
- 多GPU: 通过支持
accelerate - 混合精度: 推荐使用BF16(A100/H100)
内存优化方案:
- 所有方法都可使用LoRA/QLoRA
- 开启梯度检查点
- 使用更小的batch size搭配梯度累积
Resources
资源
- Docs: https://huggingface.co/docs/trl/
- GitHub: https://github.com/huggingface/trl
- Papers:
- "Training language models to follow instructions with human feedback" (InstructGPT, 2022)
- "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (DPO, 2023)
- "Group Relative Policy Optimization" (GRPO, 2024)
- Examples: https://github.com/huggingface/trl/tree/main/examples/scripts
Model Finetuning v1.1 - Enhanced
- 文档: https://huggingface.co/docs/trl/
- GitHub: https://github.com/huggingface/trl
- 相关论文:
- "Training language models to follow instructions with human feedback" (InstructGPT, 2022)
- "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (DPO, 2023)
- "Group Relative Policy Optimization" (GRPO, 2024)
- 示例: https://github.com/huggingface/trl/tree/main/examples/scripts
模型微调 v1.1 - 增强版
🔄 Workflow
🔄 工作流
Aşama 1: Preparation (Data-Centric AI)
阶段1:准备工作(数据驱动AI)
- Data Cleaning: Veri setini deduplicate et ve kalite kontrolü yap (PII temizliği).
- Format: Dataset'i modele uygun formata (ShareGPT, Alpaca vb.) dönüştür.
- Baseline: Base modelin performansını (Zero-shot) ölç ve kaydet.
- 数据清洗: 对数据集进行去重和质量检查(清除PII个人可识别信息)。
- 格式转换: 将数据集转换为模型兼容的格式(ShareGPT、Alpaca等)。
- 基准测试: 测量并记录基础模型的零样本性能。
Aşama 2: Training (Parameter Efficient)
阶段2:训练(参数高效)
- LoRA/QLoRA: Full fine-tuning yerine LoRA (Rank 16-64) kullan (Daha az VRAM, %95+ performans).
- Monitoring: WandB veya MLflow ile Loss eğrilerini ve eval metriklerini canlı izle.
- Checkpointing: Her epoch veya belirli stepte model ağırlıklarını kaydet.
- LoRA/QLoRA: 优先使用LoRA(秩16-64)替代全量微调,显存占用更低,性能保留95%以上。
- 监控: 通过WandB或MLflow实时监控损失曲线和评估指标。
- Checkpoint保存: 每个epoch或指定步长保存模型权重。
Aşama 3: Alignment & Evaluation
阶段3:对齐与评估
- Alignment: SFT sonrası gerekiyorsa DPO veya PPO ile insan tercihlerine hizala.
- Evaluation: skill'ini kullanarak otomatik ve manuel testler yap.
llm_evaluation - Merging: LoRA adaptörlerini base modele merge et ve quantize (GGUF/AWQ) yap.
- 对齐: SFT完成后按需通过DPO或PPO将模型与人类偏好对齐。
- 评估: 使用skill执行自动和手动测试。
llm_evaluation - 合并: 将LoRA适配器合并到基础模型并执行量化(GGUF/AWQ格式)。
Kontrol Noktaları
检查点
| Aşama | Doğrulama |
|---|---|
| 1 | Training Loss düşerken Validation Loss artıyor mu (Overfitting)? |
| 2 | Model "Catastrophic Forgetting" yaşıyor mu (Eski yeteneklerini kaybetti mi)? |
| 3 | Inference hızı ve belleği (VRAM) deployment ortamına uygun mu? |
| 阶段 | 验证项 |
|---|---|
| 1 | 训练损失下降时验证损失是否上升(过拟合)? |
| 2 | 模型是否出现「灾难性遗忘」(丢失原有能力)? |
| 3 | 推理速度和显存占用是否符合部署环境要求? |