verl-rl-training

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

verl: Volcano Engine Reinforcement Learning for LLMs

verl: 面向大语言模型的火山引擎强化学习框架

verl is a flexible, efficient, and production-ready RL training library for large language models from ByteDance's Seed team. It implements the HybridFlow framework (EuroSys 2025) and powers models like Doubao-1.5-pro achieving O1-level performance on math benchmarks.
verl是字节跳动Seed团队推出的一款灵活、高效且可用于生产环境的大语言模型强化学习训练库。它实现了HybridFlow框架(发表于EuroSys 2025),支撑了如豆包-1.5-pro等模型在数学基准测试中达到O1级性能。

When to Use verl

何时使用verl

Choose verl when you need:
  • Production-ready RL training at scale (tested up to 671B parameters)
  • Flexibility to swap backends (FSDP ↔ Megatron-LM ↔ vLLM ↔ SGLang)
  • Support for multiple RL algorithms (PPO, GRPO, RLOO, REINFORCE++, DAPO)
  • Multi-turn rollout with tool calling for agentic workflows
  • Vision-language model RL training
Consider alternatives when:
  • You need Megatron-native training → use slime or miles
  • You want PyTorch-native abstractions with Monarch → use torchforge
  • You only need simple SFT/DPO → use TRL or Axolotl
选择verl的场景:
  • 可用于生产环境的大规模强化学习训练(已支持最高6710亿参数模型)
  • 灵活切换后端(FSDP ↔ Megatron-LM ↔ vLLM ↔ SGLang)
  • 支持多种强化学习算法(PPO、GRPO、RLOO、REINFORCE++、DAPO)
  • 支持多轮rollout及工具调用的智能体工作流
  • 支持多模态大语言模型(视觉-语言模型)的强化学习训练
考虑替代方案的场景:
  • 需要Megatron原生训练 → 使用slimemiles
  • 想要基于Monarch的PyTorch原生抽象 → 使用torchforge
  • 仅需简单的SFT/DPO训练 → 使用TRLAxolotl

Key Features

核心特性

  • Training backends: FSDP, FSDP2, Megatron-LM
  • Rollout engines: vLLM, SGLang, HuggingFace Transformers
  • Algorithms: PPO, GRPO, DAPO, RLOO, ReMax, REINFORCE++, SPIN, SPPO
  • Models: Qwen-3, Llama-3.1, DeepSeek, Gemma-2 (0.5B to 671B)
  • Advanced: LoRA RL, sequence parallelism, expert parallelism, multi-turn tools
  • 训练后端:FSDP、FSDP2、Megatron-LM
  • Rollout引擎:vLLM、SGLang、HuggingFace Transformers
  • 算法支持:PPO、GRPO、DAPO、RLOO、ReMax、REINFORCE++、SPIN、SPPO
  • 适配模型:Qwen-3、Llama-3.1、DeepSeek、Gemma-2(参数规模0.5B至671B)
  • 高级功能:LoRA强化学习、序列并行、专家并行、多轮工具调用

Installation

安装方式

bash
undefined
bash
undefined

Option 1: pip install

选项1:pip安装

pip install verl[vllm] # or verl[sglang] for SGLang backend
pip install verl[vllm] # 若使用SGLang后端则安装verl[sglang]

Option 2: Docker (recommended for production)

选项2:Docker(生产环境推荐)

docker pull verlai/verl:vllm011.latest
docker pull verlai/verl:vllm011.latest

Option 3: From source

选项3:从源码安装

git clone https://github.com/volcengine/verl.git cd verl && pip install -e .[vllm,math]
undefined
git clone https://github.com/volcengine/verl.git cd verl && pip install -e .[vllm,math]
undefined

Quick Start: GRPO Training

快速开始:GRPO训练

bash
python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=~/data/gsm8k/train.parquet \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-7B \
    actor_rollout_ref.rollout.n=8 \
    actor_rollout_ref.actor.use_kl_loss=True \
    trainer.n_gpus_per_node=8
bash
python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=~/data/gsm8k/train.parquet \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-7B \
    actor_rollout_ref.rollout.n=8 \
    actor_rollout_ref.actor.use_kl_loss=True \
    trainer.n_gpus_per_node=8

Core Architecture

核心架构

verl uses a HybridFlow programming model separating control flow from computation:
┌─────────────────────────────────────────────────────────┐
│ Single-Process Controller (Ray)                         │
│ - Orchestrates: rollout → reward → train → sync        │
└─────────────────────┬───────────────────────────────────┘
┌─────────────────────▼───────────────────────────────────┐
│ Multi-Process Workers                                   │
│ ├── ActorRolloutRefWorker (policy + generation)        │
│ ├── CriticWorker (value estimation, PPO only)          │
│ └── RewardManager (model-based or rule-based rewards)  │
└─────────────────────────────────────────────────────────┘

verl采用HybridFlow编程模型,将控制流与计算流分离:
┌─────────────────────────────────────────────────────────┐
│ 单进程控制器(Ray)                                     │
│ - 编排流程:rollout → 奖励计算 → 训练 → 同步             │
└─────────────────────┬───────────────────────────────────┘
┌─────────────────────▼───────────────────────────────────┐
│ 多进程工作节点                                           │
│ ├── ActorRolloutRefWorker(策略+生成)                  │
│ ├── CriticWorker(价值估计,仅PPO可用)                 │
│ └── RewardManager(基于模型或规则的奖励计算)           │
└─────────────────────────────────────────────────────────┘

Workflow 1: Math Reasoning with GRPO

工作流1:基于GRPO的数学推理训练

Use this workflow for training reasoning models on math tasks like GSM8K or MATH.
该工作流适用于在GSM8K或MATH等数学任务上训练推理模型。

Prerequisites Checklist

前置检查清单

  • GPU cluster with 8+ GPUs (H100 recommended)
  • Dataset in parquet format with
    prompt
    and
    reward_model
    columns
  • Base model from HuggingFace Hub
  • 具备8张及以上GPU的集群(推荐使用H100)
  • 格式为parquet的数据集,包含
    prompt
    reward_model
  • 来自HuggingFace Hub的基础模型

Step 1: Prepare Dataset

步骤1:准备数据集

python
import pandas as pd

data = [
    {
        "prompt": [{"role": "user", "content": "What is 15 + 27?"}],
        "reward_model": {"ground_truth": "42"}
    },
    # ... more examples
]
df = pd.DataFrame(data)
df.to_parquet("train.parquet")
python
import pandas as pd

data = [
    {
        "prompt": [{"role": "user", "content": "15加27等于多少?"}],
        "reward_model": {"ground_truth": "42"}
    },
    # ... 更多示例
]
df = pd.DataFrame(data)
df.to_parquet("train.parquet")

Step 2: Define Reward Function

步骤2:定义奖励函数

python
undefined
python
undefined

reward_function.py

reward_function.py

import re
def compute_reward(responses, ground_truths): rewards = [] for response, gt in zip(responses, ground_truths): # Extract answer from response match = re.search(r'\boxed{([^}]+)}', response) if match and match.group(1).strip() == gt.strip(): rewards.append(1.0) else: rewards.append(0.0) return rewards
undefined
import re
def compute_reward(responses, ground_truths): rewards = [] for response, gt in zip(responses, ground_truths): # 从响应中提取答案 match = re.search(r'\boxed{([^}]+)}', response) if match and match.group(1).strip() == gt.strip(): rewards.append(1.0) else: rewards.append(0.0) return rewards
undefined

Step 3: Create Training Config

步骤3:创建训练配置

yaml
undefined
yaml
undefined

config/grpo_math.yaml

config/grpo_math.yaml

algorithm: adv_estimator: grpo gamma: 1.0 lam: 1.0
data: train_files: /path/to/train.parquet val_files: /path/to/val.parquet train_batch_size: 256 max_prompt_length: 512 max_response_length: 2048
actor_rollout_ref: model: path: Qwen/Qwen2.5-7B-Instruct actor: use_kl_loss: true kl_loss_coef: 0.001 ppo_mini_batch_size: 64 rollout: name: vllm n: 8 # samples per prompt temperature: 0.7 top_p: 0.95
trainer: total_epochs: 3 n_gpus_per_node: 8 save_freq: 100
undefined
algorithm: adv_estimator: grpo gamma: 1.0 lam: 1.0
data: train_files: /path/to/train.parquet val_files: /path/to/val.parquet train_batch_size: 256 max_prompt_length: 512 max_response_length: 2048
actor_rollout_ref: model: path: Qwen/Qwen2.5-7B-Instruct actor: use_kl_loss: true kl_loss_coef: 0.001 ppo_mini_batch_size: 64 rollout: name: vllm n: 8 # 每个prompt的采样数 temperature: 0.7 top_p: 0.95
trainer: total_epochs: 3 n_gpus_per_node: 8 save_freq: 100
undefined

Step 4: Launch Training

步骤4:启动训练

bash
python3 -m verl.trainer.main_ppo \
    --config-path config \
    --config-name grpo_math \
    trainer.experiment_name=grpo_math_qwen7b
bash
python3 -m verl.trainer.main_ppo \
    --config-path config \
    --config-name grpo_math \
    trainer.experiment_name=grpo_math_qwen7b

Step 5: Monitor and Validate

步骤5:监控与验证

  • Check WandB/TensorBoard for loss curves
  • Verify reward is increasing over steps
  • Run evaluation on held-out test set

  • 在WandB/TensorBoard中查看损失曲线
  • 验证奖励值随训练步骤提升
  • 在预留的测试集上运行评估

Workflow 2: PPO with Critic Model

工作流2:带Critic模型的PPO训练

Use this workflow when you need value-based advantage estimation (GAE).
当需要基于价值的优势估计(GAE)时,使用该工作流。

Key Differences from GRPO

与GRPO的核心差异

  • Requires separate critic model
  • Uses Generalized Advantage Estimation (GAE)
  • Better for tasks with dense rewards
  • 需要独立的Critic模型
  • 使用广义优势估计(GAE)
  • 更适用于具有密集奖励的任务

Configuration

配置示例

yaml
algorithm:
  adv_estimator: gae  # Use GAE instead of GRPO
  gamma: 0.99
  lam: 0.95

critic:
  model:
    path: Qwen/Qwen2.5-7B-Instruct  # Can be same or different from actor
  ppo_mini_batch_size: 64

actor_rollout_ref:
  actor:
    use_kl_loss: true
    kl_loss_coef: 0.02
    clip_ratio: 0.2  # PPO clipping
yaml
algorithm:
  adv_estimator: gae  # 使用GAE替代GRPO
  gamma: 0.99
  lam: 0.95

critic:
  model:
    path: Qwen/Qwen2.5-7B-Instruct  # 可与Actor模型相同或不同
  ppo_mini_batch_size: 64

actor_rollout_ref:
  actor:
    use_kl_loss: true
    kl_loss_coef: 0.02
    clip_ratio: 0.2  # PPO裁剪系数

Launch with Critic

启动带Critic的训练

bash
python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=gae \
    critic.model.path=Qwen/Qwen2.5-7B-Instruct \
    trainer.n_gpus_per_node=8

bash
python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=gae \
    critic.model.path=Qwen/Qwen2.5-7B-Instruct \
    trainer.n_gpus_per_node=8

Workflow 3: Large-Scale Training with Megatron

工作流3:基于Megatron的大规模训练

Use this workflow for models >70B parameters or when you need expert parallelism.
该工作流适用于参数规模>70B的模型,或需要专家并行的场景。

Prerequisites

前置条件

  • Install Megatron-LM bridge:
    pip install mbridge
  • Convert model to Megatron format
  • Multi-node cluster with NVLink/InfiniBand
  • 安装Megatron-LM桥接工具:
    pip install mbridge
  • 将模型转换为Megatron格式
  • 具备NVLink/InfiniBand的多节点集群

Configuration for 70B+ Models

70B+模型配置示例

yaml
actor_rollout_ref:
  model:
    path: /path/to/megatron/checkpoint
    backend: megatron
  actor:
    strategy: megatron
    tensor_model_parallel_size: 8
    pipeline_model_parallel_size: 2
  rollout:
    name: vllm
    tensor_parallel_size: 8
yaml
actor_rollout_ref:
  model:
    path: /path/to/megatron/checkpoint
    backend: megatron
  actor:
    strategy: megatron
    tensor_model_parallel_size: 8
    pipeline_model_parallel_size: 2
  rollout:
    name: vllm
    tensor_parallel_size: 8

Launch Multi-Node

启动多节点训练

bash
undefined
bash
undefined

On head node

在主节点执行

ray start --head --port=6379
ray start --head --port=6379

On worker nodes

在工作节点执行

ray start --address='head_ip:6379'
ray start --address='head_ip:6379'

Launch training

启动训练任务

python3 -m verl.trainer.main_ppo
trainer.nnodes=4
trainer.n_gpus_per_node=8

---
python3 -m verl.trainer.main_ppo
trainer.nnodes=4
trainer.n_gpus_per_node=8

---

Configuration Reference

配置参考

Algorithm Selection

算法选择

Algorithm
adv_estimator
Use Case
GRPO
grpo
Critic-free, math/reasoning
PPO/GAE
gae
Dense rewards, value estimation
REINFORCE++
reinforce_plus_plus
Variance reduction
RLOO
rloo
Leave-one-out baseline
ReMax
remax
Maximum reward baseline
OPO
opo
Optimal policy optimization
算法
adv_estimator
参数值
适用场景
GRPO
grpo
无需Critic模型,数学/推理任务
PPO/GAE
gae
密集奖励任务,价值估计
REINFORCE++
reinforce_plus_plus
方差缩减
RLOO
rloo
留一法基线
ReMax
remax
最大奖励基线
OPO
opo
最优策略优化

Key Parameters

核心参数

yaml
undefined
yaml
undefined

Rollout parameters

Rollout参数

actor_rollout_ref.rollout.n: 8 # Samples per prompt actor_rollout_ref.rollout.temperature: 0.7 # Sampling temperature actor_rollout_ref.rollout.top_p: 0.95 # Nucleus sampling
actor_rollout_ref.rollout.n: 8 # 每个prompt的采样数 actor_rollout_ref.rollout.temperature: 0.7 # 采样温度 actor_rollout_ref.rollout.top_p: 0.95 # 核采样参数

Training parameters

训练参数

actor_rollout_ref.actor.lr: 1e-6 # Learning rate actor_rollout_ref.actor.ppo_mini_batch_size: 64 actor_rollout_ref.actor.clip_ratio: 0.2 # PPO clip range
actor_rollout_ref.actor.lr: 1e-6 # 学习率 actor_rollout_ref.actor.ppo_mini_batch_size: 64 actor_rollout_ref.actor.clip_ratio: 0.2 # PPO裁剪范围

KL control

KL控制

actor_rollout_ref.actor.use_kl_loss: true actor_rollout_ref.actor.kl_loss_coef: 0.001 algorithm.kl_ctrl.target_kl: 0.1 # For adaptive KL control

---
actor_rollout_ref.actor.use_kl_loss: true actor_rollout_ref.actor.kl_loss_coef: 0.001 algorithm.kl_ctrl.target_kl: 0.1 # 自适应KL控制目标值

---

Common Issues and Solutions

常见问题与解决方案

Issue: OOM During Rollout

问题:Rollout阶段显存不足

Symptoms: CUDA out of memory during generation phase
Solutions:
yaml
undefined
症状:生成阶段出现CUDA内存不足错误
解决方案
yaml
undefined

Reduce batch size

减小批量大小

actor_rollout_ref.rollout.log_prob_micro_batch_size: 4
actor_rollout_ref.rollout.log_prob_micro_batch_size: 4

Enable gradient checkpointing

启用梯度检查点

actor_rollout_ref.model.enable_gradient_checkpointing: true
actor_rollout_ref.model.enable_gradient_checkpointing: true

Use FSDP2 with CPU offloading

使用FSDP2并开启CPU卸载

actor_rollout_ref.actor.strategy: fsdp2 actor_rollout_ref.actor.fsdp_config.offload_policy: true
undefined
actor_rollout_ref.actor.strategy: fsdp2 actor_rollout_ref.actor.fsdp_config.offload_policy: true
undefined

Issue: Training Instability

问题:训练不稳定

Symptoms: Loss spikes, reward collapse
Solutions:
yaml
undefined
症状:损失值骤增,奖励值崩溃
解决方案
yaml
undefined

Reduce learning rate

降低学习率

actor_rollout_ref.actor.lr: 5e-7
actor_rollout_ref.actor.lr: 5e-7

Increase KL penalty

增大KL惩罚系数

actor_rollout_ref.actor.kl_loss_coef: 0.01
actor_rollout_ref.actor.kl_loss_coef: 0.01

Enable gradient clipping

启用梯度裁剪

actor_rollout_ref.actor.max_grad_norm: 1.0
undefined
actor_rollout_ref.actor.max_grad_norm: 1.0
undefined

Issue: Slow Weight Sync

问题:权重同步缓慢

Symptoms: Long pauses between rollout and training
Solutions:
bash
undefined
症状:Rollout与训练阶段之间出现长时间停顿
解决方案
bash
undefined

Use FSDP2 for faster resharding

使用FSDP2实现更快的重分片

actor_rollout_ref.actor.strategy=fsdp2
actor_rollout_ref.actor.strategy=fsdp2

Enable async weight transfer

启用异步权重传输

trainer.async_weight_update=true
undefined
trainer.async_weight_update=true
undefined

Issue: vLLM Version Mismatch

问题:vLLM版本不兼容

Symptoms: Import errors or generation failures
Solution: Use compatible versions:
bash
pip install vllm>=0.8.5,<=0.12.0
症状:导入错误或生成失败
解决方案:使用兼容版本:
bash
pip install vllm>=0.8.5,<=0.12.0

Avoid vLLM 0.7.x (known bugs)

避免使用vLLM 0.7.x版本(存在已知bug)


---

---

Advanced Topics

高级主题

Multi-Turn Tool Calling

多轮工具调用

See references/multi-turn.md for agentic workflows with tool use.
关于带工具调用的智能体工作流,请参考references/multi-turn.md

Vision-Language Models

视觉-语言模型

yaml
actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-VL-7B-Instruct
  rollout:
    name: vllm
    enable_vision: true
yaml
actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-VL-7B-Instruct
  rollout:
    name: vllm
    enable_vision: true

LoRA Training

LoRA训练

yaml
actor_rollout_ref:
  actor:
    lora:
      enabled: true
      r: 16
      alpha: 32
      target_modules: ["q_proj", "v_proj"]

yaml
actor_rollout_ref:
  actor:
    lora:
      enabled: true
      r: 16
      alpha: 32
      target_modules: ["q_proj", "v_proj"]

Resources

资源链接