trl-training

Original：🇺🇸 English

Translated

Train and fine-tune transformer language models using TRL (Transformers Reinforcement Learning). Supports SFT, DPO, GRPO, KTO, RLOO and Reward Model training via CLI commands.

15installs

Sourcehuggingface/skills

Added on2026-06-10

NPX Install

npx skill4agent add huggingface/skills trl-training

SKILL.md Content

View Translation Comparison →

TRL Training Skill

You are an expert at using the TRL (Transformers Reinforcement Learning) library to train and fine-tune large language models.

Overview

TRL provides CLI commands for post-training foundation models using state-of-the-art techniques:

SFT (Supervised Fine-Tuning): Fine-tune models on instruction-following or conversational datasets
DPO (Direct Preference Optimization): Align models using preference data
GRPO (Group Relative Policy Optimization): Train models by ranking multiple sampled outputs relative to each other and optimizing based on their comparative rewards.
RLOO (Reinforce Leave One Out): Online RL training with generation-based rewards
Reward Model Training: Train reward models for RLHF

TRL is built on top of Hugging Face Transformers and Accelerate, providing seamless integration with the Hugging Face ecosystem.

Core Commands

trl sft - Supervised Fine-Tuning

Fine-tune language models on instruction-following or conversational datasets.

Full training:

bash

trl sft \
  --model_name_or_path Qwen/Qwen2-0.5B \
  --dataset_name trl-lib/Capybara \
  --learning_rate 2.0e-5 \
  --num_train_epochs 1 \
  --packing \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --eos_token '<|im_end|>' \
  --eval_strategy steps \
  --eval_steps 100 \
  --output_dir Qwen2-0.5B-SFT \
  --push_to_hub

Train with LoRA adapters:

bash

trl sft \
  --model_name_or_path Qwen/Qwen2-0.5B \
  --dataset_name trl-lib/Capybara \
  --learning_rate 2.0e-4 \
  --num_train_epochs 1 \
  --packing \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --eos_token '<|im_end|>' \
  --eval_strategy steps \
  --eval_steps 100 \
  --use_peft \
  --lora_r 32 \
  --lora_alpha 16 \
  --output_dir Qwen2-0.5B-SFT \
  --push_to_hub

trl dpo - Direct Preference Optimization

Align models using preference data (chosen/rejected pairs).

Full training:

bash

trl dpo \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --learning_rate 5.0e-7 \
  --num_train_epochs 1 \
  --per_device_train_batch_size 2 \
  --max_steps 1000 \
  --gradient_accumulation_steps 8 \
  --eval_strategy steps \
  --eval_steps 50 \
  --output_dir Qwen2-0.5B-DPO \
  --no_remove_unused_columns

Train with LoRA adapters:

bash

trl dpo \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --learning_rate 5.0e-6 \
  --num_train_epochs 1 \
  --per_device_train_batch_size 2 \
  --max_steps 1000 \
  --gradient_accumulation_steps 8 \
  --eval_strategy steps \
  --eval_steps 50 \
  --output_dir Qwen2-0.5B-DPO \
  --no_remove_unused_columns \
  --use_peft \
  --lora_r 32 \
  --lora_alpha 16

trl grpo - Group Relative Policy Optimization

Train models using reward functions or LLM-as-a-judge for evaluating generations and providing rewards.

Basic usage:

bash

trl grpo \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name trl-lib/gsm8k \
  --reward_funcs accuracy_reward \
  --output_dir Qwen2-0.5B-GRPO \
  --push_to_hub

trl rloo - Reinforce Leave One Out

Online RL training where the model generates text and receives rewards based on custom criteria.

Basic usage:

bash

trl rloo \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name trl-lib/tldr \
  --reward_model_name_or_path sentiment-analysis:nlptown/bert-base-multilingual-uncased-sentiment \
  --output_dir Qwen2-0.5B-RLOO \
  --push_to_hub

trl reward - Reward Model Training

Train a reward model to score text quality for RLHF.

Full training:

bash

trl reward \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --output_dir Qwen2-0.5B-Reward \
  --per_device_train_batch_size 8 \
  --num_train_epochs 1 \
  --learning_rate 1.0e-5 \
  --eval_strategy steps \
  --eval_steps 50 \
  --max_length 2048

Train with LoRA adapters:

bash

trl reward \
  --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
  --dataset_name trl-lib/ultrafeedback_binarized \
  --output_dir Qwen2-0.5B-Reward-LoRA \
  --per_device_train_batch_size 8 \
  --num_train_epochs 1 \
  --learning_rate 1.0e-4 \
  --eval_strategy steps \
  --eval_steps 50 \
  --max_length 2048 \
  --use_peft \
  --lora_task_type SEQ_CLS \
  --lora_r 32 \
  --lora_alpha 16

Configuration Files

TRL supports YAML configuration files for reproducible training. All CLI arguments can be specified in a config file.

Example config (sft_config.yaml):

yaml

model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: trl-lib/Capybara
learning_rate: 2.0e-5
num_train_epochs: 1
per_device_train_batch_size: 8
gradient_accumulation_steps: 2
output_dir: ./sft_output
use_peft: true
lora_r: 16
lora_alpha: 16
report_to: trackio

Launch with config:

bash

trl sft --config sft_config.yaml

Override config values:

bash

trl sft --config sft_config.yaml --learning_rate 1.0e-5

Distributed Training

TRL integrates with Accelerate for multi-GPU and multi-node training.

Multi-GPU training:

bash

trl sft \
  --config sft_config.yaml \
  --num_processes 4

Use predefined Accelerate configs:

TRL provides predefined configs:

single_gpu

multi_gpu

fsdp1

fsdp2

zero1

zero2

zero3

bash

trl sft \
  --config sft_config.yaml \
  --accelerate_config zero2

Custom Accelerate config:

bash

# Generate custom config
accelerate config

# Use custom config
trl sft --config sft_config.yaml --config_file ~/.cache/huggingface/accelerate/default_config.yaml

Fully Sharded Data Parallel (FSDP):

bash

trl sft --config sft_config.yaml --accelerate_config fsdp2

DeepSpeed ZeRO:

bash

trl sft --config sft_config.yaml --accelerate_config zero3

Troubleshooting

CUDA Out of Memory

Reduce

--per_device_train_batch_size

and increase

--gradient_accumulation_steps

Enable
```
--use_peft
```
for LoRA training
Use
```
--gradient_checkpointing
```
to save memory
Try smaller model or longer sequence truncation

Dataset Loading Issues

Verify dataset exists: check Hugging Face Hub or local path
Check dataset format matches expected columns
Use
```
--dataset_config
```
for multi-config datasets

Inspect dataset:

from datasets import load_dataset; ds = load_dataset(name)

Model Loading Issues

Verify model exists on Hugging Face Hub
Check if gated model requires authentication:
```
hf auth login
```
For local models, provide absolute path
Ensure sufficient disk space and memory

Slow Training

Enable dataset
```
--packing
```
for short sequences
Use larger
```
--per_device_train_batch_size
```
if memory allows
Enable
```
--tf32
```
for faster computation on Ampere GPUs
Use
```
--bf16
```
on supported hardware
Consider multi-GPU training with
```
--num_processes
```

Generation Issues (GRPO/RLOO)

Check prompt format in dataset
Adjust
```
--temperature
```
and
```
--top_p
```
for generation
Verify the reward function (for GRPO/RLOO)

Additional Resources

Documentation: https://huggingface.co/docs/trl
GitHub: https://github.com/huggingface/trl
Examples: https://github.com/huggingface/trl/tree/main/examples

Best Practices

Start with SFT: Always fine-tune base models with SFT before preference alignment
Use LoRA for efficiency: Enable
```
--use_peft
```
for faster training and lower memory

Monitor training: Use

--report_to trackio

(or

--report_to wandb

--report_to tensorboard

) for tracking

Save checkpoints: TRL automatically saves checkpoints in
```
--output_dir
```
Test on small datasets first: Verify pipeline works before full training
Use configuration files: Create YAML configs for reproducibility
Leverage Accelerate: Use multi-GPU training for faster iteration

When helping users with TRL:

Always check which training method is appropriate for their use case
Verify dataset format matches the expected schema
Recommend starting with smaller models for testing
Suggest LoRA for resource-constrained environments
Point to specific documentation sections for advanced features

trl-training

NPX Install

Tags

SKILL.md Content

TRL Training Skill

Overview

Core Commands

trl sft - Supervised Fine-Tuning

trl dpo - Direct Preference Optimization

trl grpo - Group Relative Policy Optimization

trl rloo - Reinforce Leave One Out

trl reward - Reward Model Training

Configuration Files

Distributed Training

Troubleshooting

CUDA Out of Memory

Dataset Loading Issues

Model Loading Issues

Slow Training

Generation Issues (GRPO/RLOO)

Additional Resources

Best Practices