model_finetuning

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

TRL - Transformer Reinforcement Learning

TRL - Transformer强化学习

Quick start

快速开始

TRL provides post-training methods for aligning language models with human preferences.
Installation:
bash
pip install trl transformers datasets peft accelerate
Supervised Fine-Tuning (instruction tuning):
python
from trl import SFTTrainer

trainer = SFTTrainer(
    model="Qwen/Qwen2.5-0.5B",
    train_dataset=dataset,  # Prompt-completion pairs
)
trainer.train()
DPO (align with preferences):
python
from trl import DPOTrainer, DPOConfig

config = DPOConfig(output_dir="model-dpo", beta=0.1)
trainer = DPOTrainer(
    model=model,
    args=config,
    train_dataset=preference_dataset,  # chosen/rejected pairs
    processing_class=tokenizer
)
trainer.train()
TRL提供用于让语言模型对齐人类偏好的训练后方法。
安装:
bash
pip install trl transformers datasets peft accelerate
监督微调(指令微调):
python
from trl import SFTTrainer

trainer = SFTTrainer(
    model="Qwen/Qwen2.5-0.5B",
    train_dataset=dataset,  # Prompt-补全对
)
trainer.train()
DPO(偏好对齐):
python
from trl import DPOTrainer, DPOConfig

config = DPOConfig(output_dir="model-dpo", beta=0.1)
trainer = DPOTrainer(
    model=model,
    args=config,
    train_dataset=preference_dataset,  # 接受/拒绝对
    processing_class=tokenizer
)
trainer.train()

Common workflows

常用工作流

Workflow 1: Full RLHF pipeline (SFT → Reward Model → PPO)

工作流1:完整RLHF流程(SFT → 奖励模型 → PPO)

Complete pipeline from base model to human-aligned model.
Copy this checklist:
RLHF Training:
- [ ] Step 1: Supervised fine-tuning (SFT)
- [ ] Step 2: Train reward model
- [ ] Step 3: PPO reinforcement learning
- [ ] Step 4: Evaluate aligned model
Step 1: Supervised fine-tuning
Train base model on instruction-following data:
python
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
从基础模型到人类对齐模型的完整流水线。
复制此检查清单:
RLHF训练:
- [ ] 步骤1:监督微调(SFT)
- [ ] 步骤2:训练奖励模型
- [ ] 步骤3:PPO强化学习
- [ ] 步骤4:评估对齐后模型
步骤1:监督微调
在指令跟随数据集上训练基础模型:
python
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

Load model

加载模型

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B") tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B") tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")

Load instruction dataset

加载指令数据集

dataset = load_dataset("trl-lib/Capybara", split="train")
dataset = load_dataset("trl-lib/Capybara", split="train")

Configure training

配置训练参数

training_args = SFTConfig( output_dir="Qwen2.5-0.5B-SFT", per_device_train_batch_size=4, num_train_epochs=1, learning_rate=2e-5, logging_steps=10, save_strategy="epoch" )
training_args = SFTConfig( output_dir="Qwen2.5-0.5B-SFT", per_device_train_batch_size=4, num_train_epochs=1, learning_rate=2e-5, logging_steps=10, save_strategy="epoch" )

Train

训练

trainer = SFTTrainer( model=model, args=training_args, train_dataset=dataset, tokenizer=tokenizer ) trainer.train() trainer.save_model()

**Step 2: Train reward model**

Train model to predict human preferences:

```python
from transformers import AutoModelForSequenceClassification
from trl import RewardTrainer, RewardConfig
trainer = SFTTrainer( model=model, args=training_args, train_dataset=dataset, tokenizer=tokenizer ) trainer.train() trainer.save_model()

**步骤2:训练奖励模型**

训练模型预测人类偏好:

```python
from transformers import AutoModelForSequenceClassification
from trl import RewardTrainer, RewardConfig

Load SFT model as base

加载SFT模型作为基础

model = AutoModelForSequenceClassification.from_pretrained( "Qwen2.5-0.5B-SFT", num_labels=1 # Single reward score ) tokenizer = AutoTokenizer.from_pretrained("Qwen2.5-0.5B-SFT")
model = AutoModelForSequenceClassification.from_pretrained( "Qwen2.5-0.5B-SFT", num_labels=1 # 单个奖励分数 ) tokenizer = AutoTokenizer.from_pretrained("Qwen2.5-0.5B-SFT")

Load preference data (chosen/rejected pairs)

加载偏好数据(接受/拒绝对)

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

Configure training

配置训练参数

training_args = RewardConfig( output_dir="Qwen2.5-0.5B-Reward", per_device_train_batch_size=2, num_train_epochs=1, learning_rate=1e-5 )
training_args = RewardConfig( output_dir="Qwen2.5-0.5B-Reward", per_device_train_batch_size=2, num_train_epochs=1, learning_rate=1e-5 )

Train reward model

训练奖励模型

trainer = RewardTrainer( model=model, args=training_args, processing_class=tokenizer, train_dataset=dataset ) trainer.train() trainer.save_model()

**Step 3: PPO reinforcement learning**

Optimize policy using reward model:

```bash
python -m trl.scripts.ppo \
    --model_name_or_path Qwen2.5-0.5B-SFT \
    --reward_model_path Qwen2.5-0.5B-Reward \
    --dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
    --output_dir Qwen2.5-0.5B-PPO \
    --learning_rate 3e-6 \
    --per_device_train_batch_size 64 \
    --total_episodes 10000
Step 4: Evaluate
python
from transformers import pipeline
trainer = RewardTrainer( model=model, args=training_args, processing_class=tokenizer, train_dataset=dataset ) trainer.train() trainer.save_model()

**步骤3:PPO强化学习**

使用奖励模型优化策略:

```bash
python -m trl.scripts.ppo \
    --model_name_or_path Qwen2.5-0.5B-SFT \
    --reward_model_path Qwen2.5-0.5B-Reward \
    --dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
    --output_dir Qwen2.5-0.5B-PPO \
    --learning_rate 3e-6 \
    --per_device_train_batch_size 64 \
    --total_episodes 10000
步骤4:评估
python
from transformers import pipeline

Load aligned model

加载对齐后模型

generator = pipeline("text-generation", model="Qwen2.5-0.5B-PPO")
generator = pipeline("text-generation", model="Qwen2.5-0.5B-PPO")

Test

测试

prompt = "Explain quantum computing to a 10-year-old" output = generator(prompt, max_length=200)[0]["generated_text"] print(output)
undefined
prompt = "给10岁小孩解释量子计算" output = generator(prompt, max_length=200)[0]["generated_text"] print(output)
undefined

Workflow 2: Simple preference alignment with DPO

工作流2:使用DPO实现简单偏好对齐

Align model with preferences without reward model.
Copy this checklist:
DPO Training:
- [ ] Step 1: Prepare preference dataset
- [ ] Step 2: Configure DPO
- [ ] Step 3: Train with DPOTrainer
- [ ] Step 4: Evaluate alignment
Step 1: Prepare preference dataset
Dataset format:
json
{
  "prompt": "What is the capital of France?",
  "chosen": "The capital of France is Paris.",
  "rejected": "I don't know."
}
Load dataset:
python
from datasets import load_dataset

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
无需奖励模型即可让模型对齐偏好。
复制此检查清单:
DPO训练:
- [ ] 步骤1:准备偏好数据集
- [ ] 步骤2:配置DPO参数
- [ ] 步骤3:使用DPOTrainer训练
- [ ] 步骤4:评估对齐效果
步骤1:准备偏好数据集
数据集格式:
json
{
  "prompt": "法国的首都是什么?",
  "chosen": "法国的首都是巴黎。",
  "rejected": "我不知道。"
}
加载数据集:
python
from datasets import load_dataset

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

Or load your own

或者加载你自己的数据集

dataset = load_dataset("json", data_files="preferences.json")

dataset = load_dataset("json", data_files="preferences.json")


**Step 2: Configure DPO**

```python
from trl import DPOConfig

config = DPOConfig(
    output_dir="Qwen2.5-0.5B-DPO",
    per_device_train_batch_size=4,
    num_train_epochs=1,
    learning_rate=5e-7,
    beta=0.1,  # KL penalty strength
    max_prompt_length=512,
    max_length=1024,
    logging_steps=10
)
Step 3: Train with DPOTrainer
python
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")

trainer = DPOTrainer(
    model=model,
    args=config,
    train_dataset=dataset,
    processing_class=tokenizer
)

trainer.train()
trainer.save_model()
CLI alternative:
bash
trl dpo \
    --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
    --dataset_name argilla/Capybara-Preferences \
    --output_dir Qwen2.5-0.5B-DPO \
    --per_device_train_batch_size 4 \
    --learning_rate 5e-7 \
    --beta 0.1

**步骤2:配置DPO参数**

```python
from trl import DPOConfig

config = DPOConfig(
    output_dir="Qwen2.5-0.5B-DPO",
    per_device_train_batch_size=4,
    num_train_epochs=1,
    learning_rate=5e-7,
    beta=0.1,  # KL惩罚强度
    max_prompt_length=512,
    max_length=1024,
    logging_steps=10
)
步骤3:使用DPOTrainer训练
python
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")

trainer = DPOTrainer(
    model=model,
    args=config,
    train_dataset=dataset,
    processing_class=tokenizer
)

trainer.train()
trainer.save_model()
CLI替代方案:
bash
trl dpo \
    --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
    --dataset_name argilla/Capybara-Preferences \
    --output_dir Qwen2.5-0.5B-DPO \
    --per_device_train_batch_size 4 \
    --learning_rate 5e-7 \
    --beta 0.1

Workflow 3: Memory-efficient online RL with GRPO

工作流3:使用GRPO实现内存高效的在线RL

Train with reinforcement learning using minimal memory.
Copy this checklist:
GRPO Training:
- [ ] Step 1: Define reward function
- [ ] Step 2: Configure GRPO
- [ ] Step 3: Train with GRPOTrainer
Step 1: Define reward function
python
def reward_function(completions, **kwargs):
    """
    Compute rewards for completions.

    Args:
        completions: List of generated texts

    Returns:
        List of reward scores (floats)
    """
    rewards = []
    for completion in completions:
        # Example: reward based on length and unique words
        score = len(completion.split())  # Favor longer responses
        score += len(set(completion.lower().split()))  # Reward unique words
        rewards.append(score)
    return rewards
Or use a reward model:
python
from transformers import pipeline

reward_model = pipeline("text-classification", model="reward-model-path")

def reward_from_model(completions, prompts, **kwargs):
    # Combine prompt + completion
    full_texts = [p + c for p, c in zip(prompts, completions)]
    # Get reward scores
    results = reward_model(full_texts)
    return [r["score"] for r in results]
Step 2: Configure GRPO
python
from trl import GRPOConfig

config = GRPOConfig(
    output_dir="Qwen2-GRPO",
    per_device_train_batch_size=4,
    num_train_epochs=1,
    learning_rate=1e-5,
    num_generations=4,  # Generate 4 completions per prompt
    max_new_tokens=128
)
Step 3: Train with GRPOTrainer
python
from datasets import load_dataset
from trl import GRPOTrainer
使用最小显存开销进行强化学习训练。
复制此检查清单:
GRPO训练:
- [ ] 步骤1:定义奖励函数
- [ ] 步骤2:配置GRPO参数
- [ ] 步骤3:使用GRPOTrainer训练
步骤1:定义奖励函数
python
def reward_function(completions, **kwargs):
    """
    计算生成内容的奖励。

    参数:
        completions: 生成文本的列表

    返回:
        奖励分数列表(浮点型)
    """
    rewards = []
    for completion in completions:
        # 示例:基于长度和唯一词计算奖励
        score = len(completion.split())  # 偏好更长的回复
        score += len(set(completion.lower().split()))  # 奖励唯一词
        rewards.append(score)
    return rewards
或者使用奖励模型:
python
from transformers import pipeline

reward_model = pipeline("text-classification", model="reward-model-path")

def reward_from_model(completions, prompts, **kwargs):
    # 拼接prompt + 生成内容
    full_texts = [p + c for p, c in zip(prompts, completions)]
    # 获取奖励分数
    results = reward_model(full_texts)
    return [r["score"] for r in results]
步骤2:配置GRPO参数
python
from trl import GRPOConfig

config = GRPOConfig(
    output_dir="Qwen2-GRPO",
    per_device_train_batch_size=4,
    num_train_epochs=1,
    learning_rate=1e-5,
    num_generations=4,  # 每个prompt生成4条回复
    max_new_tokens=128
)
步骤3:使用GRPOTrainer训练
python
from datasets import load_dataset
from trl import GRPOTrainer

Load prompt-only dataset

加载仅含prompt的数据集

dataset = load_dataset("trl-lib/tldr", split="train")
trainer = GRPOTrainer( model="Qwen/Qwen2-0.5B-Instruct", reward_funcs=reward_function, # Your reward function args=config, train_dataset=dataset )
trainer.train()

**CLI**:
```bash
trl grpo \
    --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
    --dataset_name trl-lib/tldr \
    --output_dir Qwen2-GRPO \
    --num_generations 4
dataset = load_dataset("trl-lib/tldr", split="train")
trainer = GRPOTrainer( model="Qwen/Qwen2-0.5B-Instruct", reward_funcs=reward_function, # 自定义奖励函数 args=config, train_dataset=dataset )
trainer.train()

**CLI**:
```bash
trl grpo \
    --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
    --dataset_name trl-lib/tldr \
    --output_dir Qwen2-GRPO \
    --num_generations 4

When to use vs alternatives

适用场景与替代方案对比

Use TRL when:
  • Need to align model with human preferences
  • Have preference data (chosen/rejected pairs)
  • Want to use reinforcement learning (PPO, GRPO)
  • Need reward model training
  • Doing RLHF (full pipeline)
Method selection:
  • SFT: Have prompt-completion pairs, want basic instruction following
  • DPO: Have preferences, want simple alignment (no reward model needed)
  • PPO: Have reward model, need maximum control over RL
  • GRPO: Memory-constrained, want online RL
  • Reward Model: Building RLHF pipeline, need to score generations
Use alternatives instead:
  • HuggingFace Trainer: Basic fine-tuning without RL
  • Axolotl: YAML-based training configuration
  • LitGPT: Educational, minimal fine-tuning
  • Unsloth: Fast LoRA training
适合使用TRL的场景:
  • 需要让模型对齐人类偏好
  • 拥有偏好数据(接受/拒绝对)
  • 需要使用强化学习(PPO、GRPO)
  • 需要训练奖励模型
  • 实现完整RLHF流水线
方法选择指南:
  • SFT: 拥有prompt-补全对,需要基础的指令跟随能力
  • DPO: 拥有偏好数据,需要简单的对齐能力(无需奖励模型)
  • PPO: 已有奖励模型,需要最大程度控制RL流程
  • GRPO: 显存受限,需要使用在线RL
  • 奖励模型: 构建RLHF流水线,需要给生成内容打分
适合使用替代方案的场景:
  • HuggingFace Trainer: 无需RL的基础微调
  • Axolotl: 基于YAML的训练配置
  • LitGPT: 用于教学的轻量化微调
  • Unsloth: 快速LoRA训练

Common issues

常见问题

Issue: OOM during DPO training
Reduce batch size and sequence length:
python
config = DPOConfig(
    per_device_train_batch_size=1,  # Reduce from 4
    max_length=512,  # Reduce from 1024
    gradient_accumulation_steps=8  # Maintain effective batch
)
Or use gradient checkpointing:
python
model.gradient_checkpointing_enable()
Issue: Poor alignment quality
Tune beta parameter:
python
undefined
问题:DPO训练时出现OOM(显存不足)
降低 batch size 和序列长度:
python
config = DPOConfig(
    per_device_train_batch_size=1,  # 从4降低
    max_length=512,  # 从1024降低
    gradient_accumulation_steps=8  # 维持有效batch大小
)
或者使用梯度检查点:
python
model.gradient_checkpointing_enable()
问题:对齐效果差
调整beta参数:
python
undefined

Higher beta = more conservative (stays closer to reference)

更高的beta = 更保守(更接近参考模型)

config = DPOConfig(beta=0.5) # Default 0.1
config = DPOConfig(beta=0.5) # 默认0.1

Lower beta = more aggressive alignment

更低的beta = 更激进的对齐

config = DPOConfig(beta=0.01)

**Issue: Reward model not learning**

Check loss type and learning rate:
```python
config = RewardConfig(
    learning_rate=1e-5,  # Try different LR
    num_train_epochs=3  # Train longer
)
Ensure preference dataset has clear winners:
python
undefined
config = DPOConfig(beta=0.01)

**问题:奖励模型没有学习效果**

检查损失类型和学习率:
```python
config = RewardConfig(
    learning_rate=1e-5,  # 尝试不同的学习率
    num_train_epochs=3  # 延长训练时间
)
确保偏好数据集有明确的优劣区分:
python
undefined

Verify dataset

验证数据集

print(dataset[0])
print(dataset[0])

Should have clear chosen > rejected

应该有明确的chosen优于rejected的区分


**Issue: PPO training unstable**

Adjust KL coefficient:
```python
config = PPOConfig(
    kl_coef=0.1,  # Increase from 0.05
    cliprange=0.1  # Reduce from 0.2
)

**问题:PPO训练不稳定**

调整KL系数:
```python
config = PPOConfig(
    kl_coef=0.1,  # 从0.05上调
    cliprange=0.1  # 从0.2下调
)

Advanced topics

高级主题

SFT training guide: See references/sft-training.md for dataset formats, chat templates, packing strategies, and multi-GPU training.
DPO variants: See references/dpo-variants.md for IPO, cDPO, RPO, and other DPO loss functions with recommended hyperparameters.
Reward modeling: See references/reward-modeling.md for outcome vs process rewards, Bradley-Terry loss, and reward model evaluation.
Online RL methods: See references/online-rl.md for PPO, GRPO, RLOO, and OnlineDPO with detailed configurations.
SFT训练指南: 查看 references/sft-training.md 了解数据集格式、对话模板、打包策略和多GPU训练相关内容。
DPO变体: 查看 references/dpo-variants.md 了解IPO、cDPO、RPO等DPO损失函数及推荐超参数。
奖励建模: 查看 references/reward-modeling.md 了解结果奖励 vs 过程奖励、Bradley-Terry损失和奖励模型评估相关内容。
在线RL方法: 查看 references/online-rl.md 了解PPO、GRPO、RLOO和OnlineDPO的详细配置。

Hardware requirements

硬件要求

  • GPU: NVIDIA (CUDA required)
  • VRAM: Depends on model and method
    • SFT 7B: 16GB (with LoRA)
    • DPO 7B: 24GB (stores reference model)
    • PPO 7B: 40GB (policy + reward model)
    • GRPO 7B: 24GB (more memory efficient)
  • Multi-GPU: Supported via
    accelerate
  • Mixed precision: BF16 recommended (A100/H100)
Memory optimization:
  • Use LoRA/QLoRA for all methods
  • Enable gradient checkpointing
  • Use smaller batch sizes with gradient accumulation
  • GPU: NVIDIA(需要CUDA支持)
  • 显存: 取决于模型和方法
    • SFT 7B模型: 16GB(使用LoRA)
    • DPO 7B模型: 24GB(存储参考模型)
    • PPO 7B模型: 40GB(策略模型 + 奖励模型)
    • GRPO 7B模型: 24GB(内存效率更高)
  • 多GPU: 通过
    accelerate
    支持
  • 混合精度: 推荐使用BF16(A100/H100)
内存优化方案:
  • 所有方法都可使用LoRA/QLoRA
  • 开启梯度检查点
  • 使用更小的batch size搭配梯度累积

Resources

资源

Model Finetuning v1.1 - Enhanced
模型微调 v1.1 - 增强版

🔄 Workflow

🔄 工作流

Aşama 1: Preparation (Data-Centric AI)

阶段1:准备工作(数据驱动AI)

  • Data Cleaning: Veri setini deduplicate et ve kalite kontrolü yap (PII temizliği).
  • Format: Dataset'i modele uygun formata (ShareGPT, Alpaca vb.) dönüştür.
  • Baseline: Base modelin performansını (Zero-shot) ölç ve kaydet.
  • 数据清洗: 对数据集进行去重和质量检查(清除PII个人可识别信息)。
  • 格式转换: 将数据集转换为模型兼容的格式(ShareGPT、Alpaca等)。
  • 基准测试: 测量并记录基础模型的零样本性能。

Aşama 2: Training (Parameter Efficient)

阶段2:训练(参数高效)

  • LoRA/QLoRA: Full fine-tuning yerine LoRA (Rank 16-64) kullan (Daha az VRAM, %95+ performans).
  • Monitoring: WandB veya MLflow ile Loss eğrilerini ve eval metriklerini canlı izle.
  • Checkpointing: Her epoch veya belirli stepte model ağırlıklarını kaydet.
  • LoRA/QLoRA: 优先使用LoRA(秩16-64)替代全量微调,显存占用更低,性能保留95%以上。
  • 监控: 通过WandB或MLflow实时监控损失曲线和评估指标。
  • Checkpoint保存: 每个epoch或指定步长保存模型权重。

Aşama 3: Alignment & Evaluation

阶段3:对齐与评估

  • Alignment: SFT sonrası gerekiyorsa DPO veya PPO ile insan tercihlerine hizala.
  • Evaluation:
    llm_evaluation
    skill'ini kullanarak otomatik ve manuel testler yap.
  • Merging: LoRA adaptörlerini base modele merge et ve quantize (GGUF/AWQ) yap.
  • 对齐: SFT完成后按需通过DPO或PPO将模型与人类偏好对齐。
  • 评估: 使用
    llm_evaluation
    skill执行自动和手动测试。
  • 合并: 将LoRA适配器合并到基础模型并执行量化(GGUF/AWQ格式)。

Kontrol Noktaları

检查点

AşamaDoğrulama
1Training Loss düşerken Validation Loss artıyor mu (Overfitting)?
2Model "Catastrophic Forgetting" yaşıyor mu (Eski yeteneklerini kaybetti mi)?
3Inference hızı ve belleği (VRAM) deployment ortamına uygun mu?
阶段验证项
1训练损失下降时验证损失是否上升(过拟合)?
2模型是否出现「灾难性遗忘」(丢失原有能力)?
3推理速度和显存占用是否符合部署环境要求?