trl-training
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTRL Training Skill
TRL 训练技能
You are an expert at using the TRL (Transformers Reinforcement Learning) library to train and fine-tune large language models.
你是一位精通使用TRL(Transformers Reinforcement Learning)库训练和微调大语言模型的专家。
Overview
概述
TRL provides CLI commands for post-training foundation models using state-of-the-art techniques:
- SFT (Supervised Fine-Tuning): Fine-tune models on instruction-following or conversational datasets
- DPO (Direct Preference Optimization): Align models using preference data
- GRPO (Group Relative Policy Optimization): Train models by ranking multiple sampled outputs relative to each other and optimizing based on their comparative rewards.
- RLOO (Reinforce Leave One Out): Online RL training with generation-based rewards
- Reward Model Training: Train reward models for RLHF
TRL is built on top of Hugging Face Transformers and Accelerate, providing seamless integration with the Hugging Face ecosystem.
TRL提供CLI命令,用于采用前沿技术对基础模型进行后训练:
- SFT(监督式微调):在指令遵循或对话数据集上微调模型
- DPO(直接偏好优化):使用偏好数据对齐模型
- GRPO(组相对策略优化):通过对多个采样输出进行相互排名,并基于它们的比较奖励进行优化来训练模型。
- RLOO(留一法强化学习):基于生成式奖励的在线RL训练
- 奖励模型训练:为RLHF训练奖励模型
TRL构建于Hugging Face Transformers和Accelerate之上,与Hugging Face生态系统无缝集成。
Core Commands
核心命令
trl sft - Supervised Fine-Tuning
trl sft - 监督式微调
Fine-tune language models on instruction-following or conversational datasets.
Full training:
bash
trl sft \
--model_name_or_path Qwen/Qwen2-0.5B \
--dataset_name trl-lib/Capybara \
--learning_rate 2.0e-5 \
--num_train_epochs 1 \
--packing \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--eos_token '<|im_end|>' \
--eval_strategy steps \
--eval_steps 100 \
--output_dir Qwen2-0.5B-SFT \
--push_to_hubTrain with LoRA adapters:
bash
trl sft \
--model_name_or_path Qwen/Qwen2-0.5B \
--dataset_name trl-lib/Capybara \
--learning_rate 2.0e-4 \
--num_train_epochs 1 \
--packing \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--eos_token '<|im_end|>' \
--eval_strategy steps \
--eval_steps 100 \
--use_peft \
--lora_r 32 \
--lora_alpha 16 \
--output_dir Qwen2-0.5B-SFT \
--push_to_hub在指令遵循或对话数据集上微调语言模型。
完整训练:
bash
trl sft \
--model_name_or_path Qwen/Qwen2-0.5B \
--dataset_name trl-lib/Capybara \
--learning_rate 2.0e-5 \
--num_train_epochs 1 \
--packing \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--eos_token '<|im_end|>' \
--eval_strategy steps \
--eval_steps 100 \
--output_dir Qwen2-0.5B-SFT \
--push_to_hub使用LoRA适配器训练:
bash
trl sft \
--model_name_or_path Qwen/Qwen2-0.5B \
--dataset_name trl-lib/Capybara \
--learning_rate 2.0e-4 \
--num_train_epochs 1 \
--packing \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--eos_token '<|im_end|>' \
--eval_strategy steps \
--eval_steps 100 \
--use_peft \
--lora_r 32 \
--lora_alpha 16 \
--output_dir Qwen2-0.5B-SFT \
--push_to_hubtrl dpo - Direct Preference Optimization
trl dpo - 直接偏好优化
Align models using preference data (chosen/rejected pairs).
Full training:
bash
trl dpo \
--dataset_name trl-lib/ultrafeedback_binarized \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--learning_rate 5.0e-7 \
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
--max_steps 1000 \
--gradient_accumulation_steps 8 \
--eval_strategy steps \
--eval_steps 50 \
--output_dir Qwen2-0.5B-DPO \
--no_remove_unused_columnsTrain with LoRA adapters:
bash
trl dpo \
--dataset_name trl-lib/ultrafeedback_binarized \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--learning_rate 5.0e-6 \
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
--max_steps 1000 \
--gradient_accumulation_steps 8 \
--eval_strategy steps \
--eval_steps 50 \
--output_dir Qwen2-0.5B-DPO \
--no_remove_unused_columns \
--use_peft \
--lora_r 32 \
--lora_alpha 16使用偏好数据(选中/拒绝样本对)对齐模型。
完整训练:
bash
trl dpo \
--dataset_name trl-lib/ultrafeedback_binarized \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--learning_rate 5.0e-7 \
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
--max_steps 1000 \
--gradient_accumulation_steps 8 \
--eval_strategy steps \
--eval_steps 50 \
--output_dir Qwen2-0.5B-DPO \
--no_remove_unused_columns使用LoRA适配器训练:
bash
trl dpo \
--dataset_name trl-lib/ultrafeedback_binarized \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--learning_rate 5.0e-6 \
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
--max_steps 1000 \
--gradient_accumulation_steps 8 \
--eval_strategy steps \
--eval_steps 50 \
--output_dir Qwen2-0.5B-DPO \
--no_remove_unused_columns \
--use_peft \
--lora_r 32 \
--lora_alpha 16trl grpo - Group Relative Policy Optimization
trl grpo - 组相对策略优化
Train models using reward functions or LLM-as-a-judge for evaluating generations and providing rewards.
Basic usage:
bash
trl grpo \
--model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name trl-lib/gsm8k \
--reward_funcs accuracy_reward \
--output_dir Qwen2-0.5B-GRPO \
--push_to_hub使用奖励函数或LLM-as-a-judge(大语言模型作为评判者)评估生成结果并提供奖励来训练模型。
基础用法:
bash
trl grpo \
--model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name trl-lib/gsm8k \
--reward_funcs accuracy_reward \
--output_dir Qwen2-0.5B-GRPO \
--push_to_hubtrl rloo - Reinforce Leave One Out
trl rloo - 留一法强化学习
Online RL training where the model generates text and receives rewards based on custom criteria.
Basic usage:
bash
trl rloo \
--model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name trl-lib/tldr \
--reward_model_name_or_path sentiment-analysis:nlptown/bert-base-multilingual-uncased-sentiment \
--output_dir Qwen2-0.5B-RLOO \
--push_to_hub在线RL训练,模型生成文本并基于自定义标准获得奖励。
基础用法:
bash
trl rloo \
--model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name trl-lib/tldr \
--reward_model_name_or_path sentiment-analysis:nlptown/bert-base-multilingual-uncased-sentiment \
--output_dir Qwen2-0.5B-RLOO \
--push_to_hubtrl reward - Reward Model Training
trl reward - 奖励模型训练
Train a reward model to score text quality for RLHF.
Full training:
bash
trl reward \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized \
--output_dir Qwen2-0.5B-Reward \
--per_device_train_batch_size 8 \
--num_train_epochs 1 \
--learning_rate 1.0e-5 \
--eval_strategy steps \
--eval_steps 50 \
--max_length 2048Train with LoRA adapters:
bash
trl reward \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized \
--output_dir Qwen2-0.5B-Reward-LoRA \
--per_device_train_batch_size 8 \
--num_train_epochs 1 \
--learning_rate 1.0e-4 \
--eval_strategy steps \
--eval_steps 50 \
--max_length 2048 \
--use_peft \
--lora_task_type SEQ_CLS \
--lora_r 32 \
--lora_alpha 16训练奖励模型,为RLHF评分文本质量。
完整训练:
bash
trl reward \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized \
--output_dir Qwen2-0.5B-Reward \
--per_device_train_batch_size 8 \
--num_train_epochs 1 \
--learning_rate 1.0e-5 \
--eval_strategy steps \
--eval_steps 50 \
--max_length 2048使用LoRA适配器训练:
bash
trl reward \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized \
--output_dir Qwen2-0.5B-Reward-LoRA \
--per_device_train_batch_size 8 \
--num_train_epochs 1 \
--learning_rate 1.0e-4 \
--eval_strategy steps \
--eval_steps 50 \
--max_length 2048 \
--use_peft \
--lora_task_type SEQ_CLS \
--lora_r 32 \
--lora_alpha 16Configuration Files
配置文件
TRL supports YAML configuration files for reproducible training. All CLI arguments can be specified in a config file.
Example config (sft_config.yaml):
yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: trl-lib/Capybara
learning_rate: 2.0e-5
num_train_epochs: 1
per_device_train_batch_size: 8
gradient_accumulation_steps: 2
output_dir: ./sft_output
use_peft: true
lora_r: 16
lora_alpha: 16
report_to: trackioLaunch with config:
bash
trl sft --config sft_config.yamlOverride config values:
bash
trl sft --config sft_config.yaml --learning_rate 1.0e-5TRL支持YAML配置文件,用于可复现的训练。所有CLI参数都可以在配置文件中指定。
示例配置(sft_config.yaml):
yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: trl-lib/Capybara
learning_rate: 2.0e-5
num_train_epochs: 1
per_device_train_batch_size: 8
gradient_accumulation_steps: 2
output_dir: ./sft_output
use_peft: true
lora_r: 16
lora_alpha: 16
report_to: trackio使用配置文件启动:
bash
trl sft --config sft_config.yaml覆盖配置文件值:
bash
trl sft --config sft_config.yaml --learning_rate 1.0e-5Distributed Training
分布式训练
TRL integrates with Accelerate for multi-GPU and multi-node training.
Multi-GPU training:
bash
trl sft \
--config sft_config.yaml \
--num_processes 4Use predefined Accelerate configs:
TRL provides predefined configs: , , , , , ,
single_gpumulti_gpufsdp1fsdp2zero1zero2zero3bash
trl sft \
--config sft_config.yaml \
--accelerate_config zero2Custom Accelerate config:
bash
undefinedTRL与Accelerate集成,支持多GPU和多节点训练。
多GPU训练:
bash
trl sft \
--config sft_config.yaml \
--num_processes 4使用预定义Accelerate配置:
TRL提供预定义配置:、、、、、、
single_gpumulti_gpufsdp1fsdp2zero1zero2zero3bash
trl sft \
--config sft_config.yaml \
--accelerate_config zero2自定义Accelerate配置:
bash
undefinedGenerate custom config
生成自定义配置
accelerate config
accelerate config
Use custom config
使用自定义配置
trl sft --config sft_config.yaml --config_file ~/.cache/huggingface/accelerate/default_config.yaml
**Fully Sharded Data Parallel (FSDP):**
```bash
trl sft --config sft_config.yaml --accelerate_config fsdp2DeepSpeed ZeRO:
bash
trl sft --config sft_config.yaml --accelerate_config zero3trl sft --config sft_config.yaml --config_file ~/.cache/huggingface/accelerate/default_config.yaml
**完全分片数据并行(FSDP):**
```bash
trl sft --config sft_config.yaml --accelerate_config fsdp2DeepSpeed ZeRO:
bash
trl sft --config sft_config.yaml --accelerate_config zero3Troubleshooting
故障排除
CUDA Out of Memory
CUDA内存不足
- Reduce and increase
--per_device_train_batch_size--gradient_accumulation_steps - Enable for LoRA training
--use_peft - Use to save memory
--gradient_checkpointing - Try smaller model or longer sequence truncation
- 减小并增大
--per_device_train_batch_size--gradient_accumulation_steps - 启用进行LoRA训练
--use_peft - 使用节省内存
--gradient_checkpointing - 尝试更小的模型或更长的序列截断
Dataset Loading Issues
数据集加载问题
- Verify dataset exists: check Hugging Face Hub or local path
- Check dataset format matches expected columns
- Use for multi-config datasets
--dataset_config - Inspect dataset:
from datasets import load_dataset; ds = load_dataset(name)
- 验证数据集是否存在:检查Hugging Face Hub或本地路径
- 检查数据集格式是否匹配预期列
- 对多配置数据集使用
--dataset_config - 检查数据集:
from datasets import load_dataset; ds = load_dataset(name)
Model Loading Issues
模型加载问题
- Verify model exists on Hugging Face Hub
- Check if gated model requires authentication:
hf auth login - For local models, provide absolute path
- Ensure sufficient disk space and memory
- 验证模型是否存在于Hugging Face Hub
- 检查 gated 模型是否需要认证:
hf auth login - 对于本地模型,提供绝对路径
- 确保有足够的磁盘空间和内存
Slow Training
训练缓慢
- Enable dataset for short sequences
--packing - Use larger if memory allows
--per_device_train_batch_size - Enable for faster computation on Ampere GPUs
--tf32 - Use on supported hardware
--bf16 - Consider multi-GPU training with
--num_processes
- 对短序列启用数据集
--packing - 如果内存允许,使用更大的
--per_device_train_batch_size - 在Ampere GPU上启用以加快计算
--tf32 - 在支持的硬件上使用
--bf16 - 考虑使用进行多GPU训练
--num_processes
Generation Issues (GRPO/RLOO)
生成问题(GRPO/RLOO)
- Check prompt format in dataset
- Adjust and
--temperaturefor generation--top_p - Verify the reward function (for GRPO/RLOO)
- 检查数据集中的提示格式
- 调整生成的和
--temperature--top_p - 验证奖励函数(针对GRPO/RLOO)
Additional Resources
额外资源
- Documentation: https://huggingface.co/docs/trl
- GitHub: https://github.com/huggingface/trl
- Examples: https://github.com/huggingface/trl/tree/main/examples
Best Practices
最佳实践
- Start with SFT: Always fine-tune base models with SFT before preference alignment
- Use LoRA for efficiency: Enable for faster training and lower memory
--use_peft - Monitor training: Use (or
--report_to trackioor--report_to wandb) for tracking--report_to tensorboard - Save checkpoints: TRL automatically saves checkpoints in
--output_dir - Test on small datasets first: Verify pipeline works before full training
- Use configuration files: Create YAML configs for reproducibility
- Leverage Accelerate: Use multi-GPU training for faster iteration
When helping users with TRL:
- Always check which training method is appropriate for their use case
- Verify dataset format matches the expected schema
- Recommend starting with smaller models for testing
- Suggest LoRA for resource-constrained environments
- Point to specific documentation sections for advanced features
- 从SFT开始:在偏好对齐之前,始终先用SFT微调基础模型
- 使用LoRA提升效率:启用以实现更快的训练和更低的内存消耗
--use_peft - 监控训练:使用(或
--report_to trackio、--report_to wandb)进行跟踪--report_to tensorboard - 保存检查点:TRL会自动在中保存检查点
--output_dir - 先在小数据集上测试:在完整训练前验证流水线是否可行
- 使用配置文件:创建YAML配置以确保可复现性
- 利用Accelerate:使用多GPU训练加快迭代速度
在帮助用户使用TRL时:
- 始终检查哪种训练方法适合他们的使用场景
- 验证数据集格式是否匹配预期模式
- 建议从较小的模型开始测试
- 针对资源受限环境推荐使用LoRA
- 为高级功能指向特定的文档章节