trl-training

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

TRL Training Skill

TRL 训练技能

You are an expert at using the TRL (Transformers Reinforcement Learning) library to train and fine-tune large language models.
你是一位精通使用TRL(Transformers Reinforcement Learning)库训练和微调大语言模型的专家。

Overview

概述

TRL provides CLI commands for post-training foundation models using state-of-the-art techniques:
  • SFT (Supervised Fine-Tuning): Fine-tune models on instruction-following or conversational datasets
  • DPO (Direct Preference Optimization): Align models using preference data
  • GRPO (Group Relative Policy Optimization): Train models by ranking multiple sampled outputs relative to each other and optimizing based on their comparative rewards.
  • RLOO (Reinforce Leave One Out): Online RL training with generation-based rewards
  • Reward Model Training: Train reward models for RLHF
TRL is built on top of Hugging Face Transformers and Accelerate, providing seamless integration with the Hugging Face ecosystem.
TRL提供CLI命令,用于采用前沿技术对基础模型进行后训练:
  • SFT(监督式微调):在指令遵循或对话数据集上微调模型
  • DPO(直接偏好优化):使用偏好数据对齐模型
  • GRPO(组相对策略优化):通过对多个采样输出进行相互排名,并基于它们的比较奖励进行优化来训练模型。
  • RLOO(留一法强化学习):基于生成式奖励的在线RL训练
  • 奖励模型训练:为RLHF训练奖励模型
TRL构建于Hugging Face Transformers和Accelerate之上,与Hugging Face生态系统无缝集成。

Core Commands

核心命令

trl sft - Supervised Fine-Tuning

trl sft - 监督式微调

Fine-tune language models on instruction-following or conversational datasets.
Full training:
bash
trl sft \
  --model_name_or_path Qwen/Qwen2-0.5B \
  --dataset_name trl-lib/Capybara \
  --learning_rate 2.0e-5 \
  --num_train_epochs 1 \
  --packing \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --eos_token '<|im_end|>' \
  --eval_strategy steps \
  --eval_steps 100 \
  --output_dir Qwen2-0.5B-SFT \
  --push_to_hub
Train with LoRA adapters:
bash
trl sft \
  --model_name_or_path Qwen/Qwen2-0.5B \
  --dataset_name trl-lib/Capybara \
  --learning_rate 2.0e-4 \
  --num_train_epochs 1 \
  --packing \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --eos_token '<|im_end|>' \
  --eval_strategy steps \
  --eval_steps 100 \
  --use_peft \
  --lora_r 32 \
  --lora_alpha 16 \
  --output_dir Qwen2-0.5B-SFT \
  --push_to_hub
在指令遵循或对话数据集上微调语言模型。
完整训练:
bash
trl sft \
  --model_name_or_path Qwen/Qwen2-0.5B \
  --dataset_name trl-lib/Capybara \
  --learning_rate 2.0e-5 \
  --num_train_epochs 1 \
  --packing \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --eos_token '<|im_end|>' \
  --eval_strategy steps \
  --eval_steps 100 \
  --output_dir Qwen2-0.5B-SFT \
  --push_to_hub
使用LoRA适配器训练:
bash
trl sft \
  --model_name_or_path Qwen/Qwen2-0.5B \
  --dataset_name trl-lib/Capybara \
  --learning_rate 2.0e-4 \
  --num_train_epochs 1 \
  --packing \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --eos_token '<|im_end|>' \
  --eval_strategy steps \
  --eval_steps 100 \
  --use_peft \
  --lora_r 32 \
  --lora_alpha 16 \
  --output_dir Qwen2-0.5B-SFT \
  --push_to_hub

trl dpo - Direct Preference Optimization

trl dpo - 直接偏好优化

Align models using preference data (chosen/rejected pairs).
Full training:
bash
trl dpo \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --learning_rate 5.0e-7 \
  --num_train_epochs 1 \
  --per_device_train_batch_size 2 \
  --max_steps 1000 \
  --gradient_accumulation_steps 8 \
  --eval_strategy steps \
  --eval_steps 50 \
  --output_dir Qwen2-0.5B-DPO \
  --no_remove_unused_columns
Train with LoRA adapters:
bash
trl dpo \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --learning_rate 5.0e-6 \
  --num_train_epochs 1 \
  --per_device_train_batch_size 2 \
  --max_steps 1000 \
  --gradient_accumulation_steps 8 \
  --eval_strategy steps \
  --eval_steps 50 \
  --output_dir Qwen2-0.5B-DPO \
  --no_remove_unused_columns \
  --use_peft \
  --lora_r 32 \
  --lora_alpha 16
使用偏好数据(选中/拒绝样本对)对齐模型。
完整训练:
bash
trl dpo \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --learning_rate 5.0e-7 \
  --num_train_epochs 1 \
  --per_device_train_batch_size 2 \
  --max_steps 1000 \
  --gradient_accumulation_steps 8 \
  --eval_strategy steps \
  --eval_steps 50 \
  --output_dir Qwen2-0.5B-DPO \
  --no_remove_unused_columns
使用LoRA适配器训练:
bash
trl dpo \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --learning_rate 5.0e-6 \
  --num_train_epochs 1 \
  --per_device_train_batch_size 2 \
  --max_steps 1000 \
  --gradient_accumulation_steps 8 \
  --eval_strategy steps \
  --eval_steps 50 \
  --output_dir Qwen2-0.5B-DPO \
  --no_remove_unused_columns \
  --use_peft \
  --lora_r 32 \
  --lora_alpha 16

trl grpo - Group Relative Policy Optimization

trl grpo - 组相对策略优化

Train models using reward functions or LLM-as-a-judge for evaluating generations and providing rewards.
Basic usage:
bash
trl grpo \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name trl-lib/gsm8k \
  --reward_funcs accuracy_reward \
  --output_dir Qwen2-0.5B-GRPO \
  --push_to_hub
使用奖励函数或LLM-as-a-judge(大语言模型作为评判者)评估生成结果并提供奖励来训练模型。
基础用法:
bash
trl grpo \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name trl-lib/gsm8k \
  --reward_funcs accuracy_reward \
  --output_dir Qwen2-0.5B-GRPO \
  --push_to_hub

trl rloo - Reinforce Leave One Out

trl rloo - 留一法强化学习

Online RL training where the model generates text and receives rewards based on custom criteria.
Basic usage:
bash
trl rloo \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name trl-lib/tldr \
  --reward_model_name_or_path sentiment-analysis:nlptown/bert-base-multilingual-uncased-sentiment \
  --output_dir Qwen2-0.5B-RLOO \
  --push_to_hub
在线RL训练,模型生成文本并基于自定义标准获得奖励。
基础用法:
bash
trl rloo \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name trl-lib/tldr \
  --reward_model_name_or_path sentiment-analysis:nlptown/bert-base-multilingual-uncased-sentiment \
  --output_dir Qwen2-0.5B-RLOO \
  --push_to_hub

trl reward - Reward Model Training

trl reward - 奖励模型训练

Train a reward model to score text quality for RLHF.
Full training:
bash
trl reward \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --output_dir Qwen2-0.5B-Reward \
  --per_device_train_batch_size 8 \
  --num_train_epochs 1 \
  --learning_rate 1.0e-5 \
  --eval_strategy steps \
  --eval_steps 50 \
  --max_length 2048
Train with LoRA adapters:
bash
trl reward \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --output_dir Qwen2-0.5B-Reward-LoRA \
  --per_device_train_batch_size 8 \
  --num_train_epochs 1 \
  --learning_rate 1.0e-4 \
  --eval_strategy steps \
  --eval_steps 50 \
  --max_length 2048 \
  --use_peft \
  --lora_task_type SEQ_CLS \
  --lora_r 32 \
  --lora_alpha 16
训练奖励模型,为RLHF评分文本质量。
完整训练:
bash
trl reward \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --output_dir Qwen2-0.5B-Reward \
  --per_device_train_batch_size 8 \
  --num_train_epochs 1 \
  --learning_rate 1.0e-5 \
  --eval_strategy steps \
  --eval_steps 50 \
  --max_length 2048
使用LoRA适配器训练:
bash
trl reward \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --output_dir Qwen2-0.5B-Reward-LoRA \
  --per_device_train_batch_size 8 \
  --num_train_epochs 1 \
  --learning_rate 1.0e-4 \
  --eval_strategy steps \
  --eval_steps 50 \
  --max_length 2048 \
  --use_peft \
  --lora_task_type SEQ_CLS \
  --lora_r 32 \
  --lora_alpha 16

Configuration Files

配置文件

TRL supports YAML configuration files for reproducible training. All CLI arguments can be specified in a config file.
Example config (sft_config.yaml):
yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: trl-lib/Capybara
learning_rate: 2.0e-5
num_train_epochs: 1
per_device_train_batch_size: 8
gradient_accumulation_steps: 2
output_dir: ./sft_output
use_peft: true
lora_r: 16
lora_alpha: 16
report_to: trackio
Launch with config:
bash
trl sft --config sft_config.yaml
Override config values:
bash
trl sft --config sft_config.yaml --learning_rate 1.0e-5
TRL支持YAML配置文件,用于可复现的训练。所有CLI参数都可以在配置文件中指定。
示例配置(sft_config.yaml):
yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: trl-lib/Capybara
learning_rate: 2.0e-5
num_train_epochs: 1
per_device_train_batch_size: 8
gradient_accumulation_steps: 2
output_dir: ./sft_output
use_peft: true
lora_r: 16
lora_alpha: 16
report_to: trackio
使用配置文件启动:
bash
trl sft --config sft_config.yaml
覆盖配置文件值:
bash
trl sft --config sft_config.yaml --learning_rate 1.0e-5

Distributed Training

分布式训练

TRL integrates with Accelerate for multi-GPU and multi-node training.
Multi-GPU training:
bash
trl sft \
  --config sft_config.yaml \
  --num_processes 4
Use predefined Accelerate configs:
TRL provides predefined configs:
single_gpu
,
multi_gpu
,
fsdp1
,
fsdp2
,
zero1
,
zero2
,
zero3
bash
trl sft \
  --config sft_config.yaml \
  --accelerate_config zero2
Custom Accelerate config:
bash
undefined
TRL与Accelerate集成,支持多GPU和多节点训练。
多GPU训练:
bash
trl sft \
  --config sft_config.yaml \
  --num_processes 4
使用预定义Accelerate配置:
TRL提供预定义配置:
single_gpu
multi_gpu
fsdp1
fsdp2
zero1
zero2
zero3
bash
trl sft \
  --config sft_config.yaml \
  --accelerate_config zero2
自定义Accelerate配置:
bash
undefined

Generate custom config

生成自定义配置

accelerate config
accelerate config

Use custom config

使用自定义配置

trl sft --config sft_config.yaml --config_file ~/.cache/huggingface/accelerate/default_config.yaml

**Fully Sharded Data Parallel (FSDP):**

```bash
trl sft --config sft_config.yaml --accelerate_config fsdp2
DeepSpeed ZeRO:
bash
trl sft --config sft_config.yaml --accelerate_config zero3
trl sft --config sft_config.yaml --config_file ~/.cache/huggingface/accelerate/default_config.yaml

**完全分片数据并行(FSDP):**

```bash
trl sft --config sft_config.yaml --accelerate_config fsdp2
DeepSpeed ZeRO:
bash
trl sft --config sft_config.yaml --accelerate_config zero3

Troubleshooting

故障排除

CUDA Out of Memory

CUDA内存不足

  • Reduce
    --per_device_train_batch_size
    and increase
    --gradient_accumulation_steps
  • Enable
    --use_peft
    for LoRA training
  • Use
    --gradient_checkpointing
    to save memory
  • Try smaller model or longer sequence truncation
  • 减小
    --per_device_train_batch_size
    并增大
    --gradient_accumulation_steps
  • 启用
    --use_peft
    进行LoRA训练
  • 使用
    --gradient_checkpointing
    节省内存
  • 尝试更小的模型或更长的序列截断

Dataset Loading Issues

数据集加载问题

  • Verify dataset exists: check Hugging Face Hub or local path
  • Check dataset format matches expected columns
  • Use
    --dataset_config
    for multi-config datasets
  • Inspect dataset:
    from datasets import load_dataset; ds = load_dataset(name)
  • 验证数据集是否存在:检查Hugging Face Hub或本地路径
  • 检查数据集格式是否匹配预期列
  • 对多配置数据集使用
    --dataset_config
  • 检查数据集:
    from datasets import load_dataset; ds = load_dataset(name)

Model Loading Issues

模型加载问题

  • Verify model exists on Hugging Face Hub
  • Check if gated model requires authentication:
    hf auth login
  • For local models, provide absolute path
  • Ensure sufficient disk space and memory
  • 验证模型是否存在于Hugging Face Hub
  • 检查 gated 模型是否需要认证:
    hf auth login
  • 对于本地模型,提供绝对路径
  • 确保有足够的磁盘空间和内存

Slow Training

训练缓慢

  • Enable dataset
    --packing
    for short sequences
  • Use larger
    --per_device_train_batch_size
    if memory allows
  • Enable
    --tf32
    for faster computation on Ampere GPUs
  • Use
    --bf16
    on supported hardware
  • Consider multi-GPU training with
    --num_processes
  • 对短序列启用数据集
    --packing
  • 如果内存允许,使用更大的
    --per_device_train_batch_size
  • 在Ampere GPU上启用
    --tf32
    以加快计算
  • 在支持的硬件上使用
    --bf16
  • 考虑使用
    --num_processes
    进行多GPU训练

Generation Issues (GRPO/RLOO)

生成问题(GRPO/RLOO)

  • Check prompt format in dataset
  • Adjust
    --temperature
    and
    --top_p
    for generation
  • Verify the reward function (for GRPO/RLOO)
  • 检查数据集中的提示格式
  • 调整生成的
    --temperature
    --top_p
  • 验证奖励函数(针对GRPO/RLOO)

Additional Resources

额外资源

Best Practices

最佳实践

  1. Start with SFT: Always fine-tune base models with SFT before preference alignment
  2. Use LoRA for efficiency: Enable
    --use_peft
    for faster training and lower memory
  3. Monitor training: Use
    --report_to trackio
    (or
    --report_to wandb
    or
    --report_to tensorboard
    ) for tracking
  4. Save checkpoints: TRL automatically saves checkpoints in
    --output_dir
  5. Test on small datasets first: Verify pipeline works before full training
  6. Use configuration files: Create YAML configs for reproducibility
  7. Leverage Accelerate: Use multi-GPU training for faster iteration
When helping users with TRL:
  • Always check which training method is appropriate for their use case
  • Verify dataset format matches the expected schema
  • Recommend starting with smaller models for testing
  • Suggest LoRA for resource-constrained environments
  • Point to specific documentation sections for advanced features
  1. 从SFT开始:在偏好对齐之前,始终先用SFT微调基础模型
  2. 使用LoRA提升效率:启用
    --use_peft
    以实现更快的训练和更低的内存消耗
  3. 监控训练:使用
    --report_to trackio
    (或
    --report_to wandb
    --report_to tensorboard
    )进行跟踪
  4. 保存检查点:TRL会自动在
    --output_dir
    中保存检查点
  5. 先在小数据集上测试:在完整训练前验证流水线是否可行
  6. 使用配置文件:创建YAML配置以确保可复现性
  7. 利用Accelerate:使用多GPU训练加快迭代速度
在帮助用户使用TRL时:
  • 始终检查哪种训练方法适合他们的使用场景
  • 验证数据集格式是否匹配预期模式
  • 建议从较小的模型开始测试
  • 针对资源受限环境推荐使用LoRA
  • 为高级功能指向特定的文档章节