openrlhf-training
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseOpenRLHF - High-Performance RLHF Training
OpenRLHF - 高性能RLHF训练
Quick start
快速开始
OpenRLHF is a Ray-based RLHF framework optimized for distributed training with vLLM inference acceleration.
Installation:
bash
undefinedOpenRLHF是一款基于Ray的RLHF框架,针对分布式训练进行了优化,并支持vLLM推理加速。
安装步骤:
bash
undefinedLaunch Docker container
启动Docker容器
docker run --runtime=nvidia -it --rm --shm-size="10g" --cap-add=SYS_ADMIN
-v $PWD:/openrlhf nvcr.io/nvidia/pytorch:25.02-py3 bash
-v $PWD:/openrlhf nvcr.io/nvidia/pytorch:25.02-py3 bash
docker run --runtime=nvidia -it --rm --shm-size="10g" --cap-add=SYS_ADMIN
-v $PWD:/openrlhf nvcr.io/nvidia/pytorch:25.02-py3 bash
-v $PWD:/openrlhf nvcr.io/nvidia/pytorch:25.02-py3 bash
Uninstall conflicts
卸载冲突依赖
sudo pip uninstall xgboost transformer_engine flash_attn pynvml -y
sudo pip uninstall xgboost transformer_engine flash_attn pynvml -y
Install OpenRLHF with vLLM
安装带vLLM的OpenRLHF
pip install openrlhf[vllm]
**PPO Training** (Hybrid Engine):
```bash
ray start --head --node-ip-address 0.0.0.0 --num-gpus 8
ray job submit --address="http://127.0.0.1:8265" \
--runtime-env-json='{"working_dir": "/openrlhf"}' \
-- python3 -m openrlhf.cli.train_ppo_ray \
--ref_num_nodes 1 --ref_num_gpus_per_node 8 \
--reward_num_nodes 1 --reward_num_gpus_per_node 8 \
--critic_num_nodes 1 --critic_num_gpus_per_node 8 \
--actor_num_nodes 1 --actor_num_gpus_per_node 8 \
--vllm_num_engines 4 --vllm_tensor_parallel_size 2 \
--colocate_all_models \
--vllm_gpu_memory_utilization 0.5 \
--pretrain OpenRLHF/Llama-3-8b-sft-mixture \
--reward_pretrain OpenRLHF/Llama-3-8b-rm-700k \
--save_path ./output/llama3-8b-rlhf \
--micro_train_batch_size 8 --train_batch_size 128 \
--micro_rollout_batch_size 16 --rollout_batch_size 1024 \
--max_epochs 1 --prompt_max_len 1024 --generate_max_len 1024 \
--zero_stage 3 --bf16 \
--actor_learning_rate 5e-7 --critic_learning_rate 9e-6 \
--init_kl_coef 0.01 --normalize_reward \
--gradient_checkpointing --packing_samples \
--vllm_enable_sleep --deepspeed_enable_sleepGRPO Training (Group Normalized Policy Optimization):
bash
undefinedpip install openrlhf[vllm]
**PPO训练(混合引擎)**:
```bash
ray start --head --node-ip-address 0.0.0.0 --num-gpus 8
ray job submit --address="http://127.0.0.1:8265" \
--runtime-env-json='{"working_dir": "/openrlhf"}' \
-- python3 -m openrlhf.cli.train_ppo_ray \
--ref_num_nodes 1 --ref_num_gpus_per_node 8 \
--reward_num_nodes 1 --reward_num_gpus_per_node 8 \
--critic_num_nodes 1 --critic_num_gpus_per_node 8 \
--actor_num_nodes 1 --actor_num_gpus_per_node 8 \
--vllm_num_engines 4 --vllm_tensor_parallel_size 2 \
--colocate_all_models \
--vllm_gpu_memory_utilization 0.5 \
--pretrain OpenRLHF/Llama-3-8b-sft-mixture \
--reward_pretrain OpenRLHF/Llama-3-8b-rm-700k \
--save_path ./output/llama3-8b-rlhf \
--micro_train_batch_size 8 --train_batch_size 128 \
--micro_rollout_batch_size 16 --rollout_batch_size 1024 \
--max_epochs 1 --prompt_max_len 1024 --generate_max_len 1024 \
--zero_stage 3 --bf16 \
--actor_learning_rate 5e-7 --critic_learning_rate 9e-6 \
--init_kl_coef 0.01 --normalize_reward \
--gradient_checkpointing --packing_samples \
--vllm_enable_sleep --deepspeed_enable_sleepGRPO训练(分组归一化策略优化):
bash
undefinedSame command as PPO, but add:
与PPO命令相同,添加以下参数:
--advantage_estimator group_norm
undefined--advantage_estimator group_norm
undefinedCommon workflows
常见工作流
Workflow 1: Full RLHF pipeline (SFT → Reward Model → PPO)
工作流1:完整RLHF流程(SFT → 奖励模型 → PPO)
Step 1: Train reward model (DPO):
bash
deepspeed --module openrlhf.cli.train_rm \
--save_path ./output/llama3-8b-rm \
--save_steps -1 --logging_steps 1 \
--eval_steps -1 --train_batch_size 256 \
--micro_train_batch_size 1 --pretrain meta-llama/Meta-Llama-3-8B \
--bf16 --max_epochs 1 --max_len 8192 \
--zero_stage 3 --learning_rate 9e-6 \
--dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \
--apply_chat_template --chosen_key chosen \
--rejected_key rejected --flash_attn --gradient_checkpointingStep 2: PPO training:
bash
ray start --head --node-ip-address 0.0.0.0 --num-gpus 8
ray job submit --address="http://127.0.0.1:8265" \
-- python3 -m openrlhf.cli.train_ppo_ray \
--ref_num_nodes 1 --ref_num_gpus_per_node 8 \
--reward_num_nodes 1 --reward_num_gpus_per_node 8 \
--critic_num_nodes 1 --critic_num_gpus_per_node 8 \
--actor_num_nodes 1 --actor_num_gpus_per_node 8 \
--vllm_num_engines 4 --vllm_tensor_parallel_size 2 \
--colocate_all_models \
--pretrain OpenRLHF/Llama-3-8b-sft-mixture \
--reward_pretrain ./output/llama3-8b-rm \
--save_path ./output/llama3-8b-ppo \
--micro_train_batch_size 8 --train_batch_size 128 \
--micro_rollout_batch_size 16 --rollout_batch_size 1024 \
--max_epochs 1 --prompt_max_len 1024 --generate_max_len 1024 \
--zero_stage 3 --bf16 \
--actor_learning_rate 5e-7 --critic_learning_rate 9e-6 \
--init_kl_coef 0.01 --normalize_reward \
--vllm_enable_sleep --deepspeed_enable_sleep步骤1:训练奖励模型(DPO):
bash
deepspeed --module openrlhf.cli.train_rm \
--save_path ./output/llama3-8b-rm \
--save_steps -1 --logging_steps 1 \
--eval_steps -1 --train_batch_size 256 \
--micro_train_batch_size 1 --pretrain meta-llama/Meta-Llama-3-8B \
--bf16 --max_epochs 1 --max_len 8192 \
--zero_stage 3 --learning_rate 9e-6 \
--dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \
--apply_chat_template --chosen_key chosen \
--rejected_key rejected --flash_attn --gradient_checkpointing步骤2:PPO训练:
bash
ray start --head --node-ip-address 0.0.0.0 --num-gpus 8
ray job submit --address="http://127.0.0.1:8265" \
-- python3 -m openrlhf.cli.train_ppo_ray \
--ref_num_nodes 1 --ref_num_gpus_per_node 8 \
--reward_num_nodes 1 --reward_num_gpus_per_node 8 \
--critic_num_nodes 1 --critic_num_gpus_per_node 8 \
--actor_num_nodes 1 --actor_num_gpus_per_node 8 \
--vllm_num_engines 4 --vllm_tensor_parallel_size 2 \
--colocate_all_models \
--pretrain OpenRLHF/Llama-3-8b-sft-mixture \
--reward_pretrain ./output/llama3-8b-rm \
--save_path ./output/llama3-8b-ppo \
--micro_train_batch_size 8 --train_batch_size 128 \
--micro_rollout_batch_size 16 --rollout_batch_size 1024 \
--max_epochs 1 --prompt_max_len 1024 --generate_max_len 1024 \
--zero_stage 3 --bf16 \
--actor_learning_rate 5e-7 --critic_learning_rate 9e-6 \
--init_kl_coef 0.01 --normalize_reward \
--vllm_enable_sleep --deepspeed_enable_sleepWorkflow 2: GRPO training (no critic model needed)
工作流2:GRPO训练(无需批评模型)
Memory-efficient alternative to PPO:
bash
ray job submit --address="http://127.0.0.1:8265" \
-- python3 -m openrlhf.cli.train_ppo_ray \
--advantage_estimator group_norm \
--ref_num_nodes 1 --ref_num_gpus_per_node 8 \
--reward_num_nodes 1 --reward_num_gpus_per_node 8 \
--actor_num_nodes 1 --actor_num_gpus_per_node 8 \
--vllm_num_engines 4 --vllm_tensor_parallel_size 2 \
--colocate_all_models \
--pretrain OpenRLHF/Llama-3-8b-sft-mixture \
--reward_pretrain OpenRLHF/Llama-3-8b-rm-700k \
--save_path ./output/llama3-8b-grpo \
--micro_train_batch_size 8 --train_batch_size 128 \
--micro_rollout_batch_size 16 --rollout_batch_size 1024 \
--max_epochs 1 --bf16 \
--actor_learning_rate 5e-7 \
--init_kl_coef 0.01 --use_kl_loss --kl_estimator k3 \
--normalize_reward --no_advantage_std_normKey GRPO parameters:
- - Enables GRPO
--advantage_estimator group_norm - - KL loss from GRPO paper
--use_kl_loss - - Loss function (k2 ≈ k1)
--kl_estimator k3 - - Disables std normalization
--no_advantage_std_norm
这是PPO的内存高效替代方案:
bash
ray job submit --address="http://127.0.0.1:8265" \
-- python3 -m openrlhf.cli.train_ppo_ray \
--advantage_estimator group_norm \
--ref_num_nodes 1 --ref_num_gpus_per_node 8 \
--reward_num_nodes 1 --reward_num_gpus_per_node 8 \
--actor_num_nodes 1 --actor_num_gpus_per_node 8 \
--vllm_num_engines 4 --vllm_tensor_parallel_size 2 \
--colocate_all_models \
--pretrain OpenRLHF/Llama-3-8b-sft-mixture \
--reward_pretrain OpenRLHF/Llama-3-8b-rm-700k \
--save_path ./output/llama3-8b-grpo \
--micro_train_batch_size 8 --train_batch_size 128 \
--micro_rollout_batch_size 16 --rollout_batch_size 1024 \
--max_epochs 1 --bf16 \
--actor_learning_rate 5e-7 \
--init_kl_coef 0.01 --use_kl_loss --kl_estimator k3 \
--normalize_reward --no_advantage_std_normGRPO关键参数:
- - 启用GRPO
--advantage_estimator group_norm - - GRPO论文中的KL损失
--use_kl_loss - - 损失函数(k2 ≈ k1)
--kl_estimator k3 - - 禁用标准差归一化
--no_advantage_std_norm
Workflow 3: DPO training (preference optimization)
工作流3:DPO训练(偏好优化)
Simpler alternative without reward model:
bash
deepspeed --module openrlhf.cli.train_dpo \
--save_path ./output/llama3-8b-dpo \
--save_steps -1 --logging_steps 1 \
--eval_steps -1 --train_batch_size 256 \
--micro_train_batch_size 2 --pretrain meta-llama/Meta-Llama-3-8B \
--bf16 --max_epochs 1 --max_len 8192 \
--zero_stage 3 --learning_rate 5e-7 --beta 0.1 \
--dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \
--apply_chat_template --chosen_key chosen \
--rejected_key rejected --flash_attn --gradient_checkpointing无需奖励模型的简化替代方案:
bash
deepspeed --module openrlhf.cli.train_dpo \
--save_path ./output/llama3-8b-dpo \
--save_steps -1 --logging_steps 1 \
--eval_steps -1 --train_batch_size 256 \
--micro_train_batch_size 2 --pretrain meta-llama/Meta-Llama-3-8B \
--bf16 --max_epochs 1 --max_len 8192 \
--zero_stage 3 --learning_rate 5e-7 --beta 0.1 \
--dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \
--apply_chat_template --chosen_key chosen \
--rejected_key rejected --flash_attn --gradient_checkpointingWhen to use vs alternatives
适用场景与替代方案对比
Use OpenRLHF when:
- Training large models (7B-70B+) with RL
- Need vLLM inference acceleration
- Want distributed architecture with Ray
- Have multi-node GPU cluster
- Need PPO/GRPO/RLOO/DPO in one framework
Algorithm selection:
- PPO: Maximum control, best for complex rewards
- GRPO: Memory-efficient, no critic needed
- RLOO: Modified PPO with per-token KL
- REINFORCE++: More stable than GRPO, faster than PPO
- DPO: Simplest, no reward model needed
Use alternatives instead:
- TRL: Single-node training, simpler API
- veRL: ByteDance's framework for 671B models
- DeepSpeedChat: Integrated with DeepSpeed ecosystem
选择OpenRLHF的场景:
- 使用RL训练大模型(7B-700亿+)
- 需要vLLM推理加速
- 希望采用Ray分布式架构
- 拥有多节点GPU集群
- 需要在单一框架中支持PPO/GRPO/RLOO/DPO
算法选择指南:
- PPO:控制能力最强,适合复杂奖励机制
- GRPO:内存高效,无需批评模型
- RLOO:带逐token KL的改进版PPO
- REINFORCE++:比GRPO更稳定,比PPO更快
- DPO:最简单,无需奖励模型
选择替代方案的场景:
- TRL:单节点训练,API更简洁
- veRL:字节跳动针对671B模型的框架
- DeepSpeedChat:与DeepSpeed生态系统深度集成
Common issues
常见问题
Issue: GPU OOM with large models
Disable model colocation:
bash
undefined问题:大模型训练时GPU显存不足(OOM)
禁用模型共置:
bash
undefinedRemove --colocate_all_models flag
移除--colocate_all_models参数
Allocate separate GPUs for each model
为每个模型分配独立GPU
--actor_num_gpus_per_node 8
--critic_num_gpus_per_node 8
--reward_num_gpus_per_node 8
--ref_num_gpus_per_node 8
--critic_num_gpus_per_node 8
--reward_num_gpus_per_node 8
--ref_num_gpus_per_node 8
**Issue: DeepSpeed GPU index out of range**
Set environment variable:
```bash
export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1Issue: Training instability
Use Hybrid Engine instead of async:
bash
--colocate_all_models \
--vllm_enable_sleep \
--deepspeed_enable_sleepAdjust KL coefficient:
bash
--init_kl_coef 0.05 # Increase from 0.01Issue: Slow generation during PPO
Enable vLLM acceleration:
bash
--vllm_num_engines 4 \
--vllm_tensor_parallel_size 2 \
--vllm_gpu_memory_utilization 0.5--actor_num_gpus_per_node 8
--critic_num_gpus_per_node 8
--reward_num_gpus_per_node 8
--ref_num_gpus_per_node 8
--critic_num_gpus_per_node 8
--reward_num_gpus_per_node 8
--ref_num_gpus_per_node 8
**问题:DeepSpeed GPU索引超出范围**
设置环境变量:
```bash
export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1问题:训练不稳定
使用混合引擎替代异步模式:
bash
--colocate_all_models \
--vllm_enable_sleep \
--deepspeed_enable_sleep调整KL系数:
bash
--init_kl_coef 0.05 # 从0.01调高问题:PPO训练时生成速度慢
启用vLLM加速:
bash
--vllm_num_engines 4 \
--vllm_tensor_parallel_size 2 \
--vllm_gpu_memory_utilization 0.5Advanced topics
高级主题
Hybrid Engine GPU sharing: See references/hybrid-engine.md for vLLM sleep mode, DeepSpeed sleep mode, and optimal node allocation.
Algorithm comparison: See references/algorithm-comparison.md for PPO vs GRPO vs RLOO vs REINFORCE++ benchmarks and hyperparameters.
Multi-node setup: See references/multi-node-training.md for Ray cluster configuration and fault tolerance.
Custom reward functions: See references/custom-rewards.md for reinforced fine-tuning and agent RLHF.
混合引擎GPU共享:参考references/hybrid-engine.md了解vLLM睡眠模式、DeepSpeed睡眠模式及最优节点分配方案。
算法对比:参考references/algorithm-comparison.md获取PPO vs GRPO vs RLOO vs REINFORCE++的基准测试结果及超参数设置。
多节点配置:参考references/multi-node-training.md了解Ray集群配置及容错机制。
自定义奖励函数:参考references/custom-rewards.md了解强化微调及Agent RLHF相关内容。
Hardware requirements
硬件要求
- GPU: NVIDIA A100/H100 recommended
- VRAM:
- 7B model: 8× A100 40GB (Hybrid Engine)
- 70B model: 48× A100 80GB (vLLM:Actor:Critic = 1:1:1)
- Multi-node: Ray cluster with InfiniBand recommended
- Docker: NVIDIA PyTorch container 25.02+
Performance:
- 2× faster than DeepSpeedChat
- vLLM inference acceleration
- Hybrid Engine minimizes GPU idle time
- GPU:推荐使用NVIDIA A100/H100
- 显存:
- 7B模型:8× A100 40GB(混合引擎模式)
- 70B模型:48× A100 80GB(vLLM:Actor:Critic = 1:1:1)
- 多节点:推荐配备InfiniBand的Ray集群
- Docker:NVIDIA PyTorch容器25.02+
性能表现:
- 速度比DeepSpeedChat快2倍
- 支持vLLM推理加速
- 混合引擎最小化GPU空闲时间
Resources
资源链接
- Docs: https://github.com/OpenRLHF/OpenRLHF
- Paper: https://arxiv.org/abs/2405.11143
- Examples: https://github.com/OpenRLHF/OpenRLHF/tree/main/examples
- Discord: Community support