torchforge-rl-training

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

torchforge: PyTorch-Native Agentic RL Library

torchforge is Meta's PyTorch-native RL library that separates infrastructure concerns from algorithm concerns. It enables rapid RL research by letting you focus on algorithms while handling distributed training, inference, and weight sync automatically.

torchforge是Meta推出的PyTorch原生强化学习（RL）库，可将基础设施相关需求与算法需求分离。它让你能够专注于算法研发，同时自动处理分布式训练、推理和权重同步，从而加速强化学习研究进程。

When to Use torchforge

何时使用torchforge

Choose torchforge when you need:

Clean separation between RL algorithms and infrastructure
PyTorch-native abstractions (no Ray dependency)
Easy algorithm experimentation (GRPO, DAPO, SAPO in ~100 lines)
Scalable training with Monarch actor system
Integration with TorchTitan for model parallelism

Consider alternatives when:

You need production-ready stability → use miles or verl
You want Megatron-native training → use slime
torchforge is experimental and APIs may change

在以下场景选择torchforge：

需要将强化学习算法与基础设施清晰分离
偏好PyTorch原生抽象（无Ray依赖）
便捷的算法实验（GRPO、DAPO、SAPO算法仅需约100行代码即可实现）
借助Monarch actor系统实现可扩展训练
与TorchTitan集成以实现模型并行化

在以下场景考虑替代方案：

需要生产级别的稳定性 → 使用miles或verl
想要Megatron原生训练 → 使用slime
torchforge目前处于实验阶段，API可能会发生变化

Key Features

核心特性

Algorithm isolation: Implement RL algorithms without touching infrastructure
Scalability: From single GPU to thousands via Monarch
Modern stack: TorchTitan (training), vLLM (inference), TorchStore (sync)
Loss functions: GRPO, DAPO, CISPO, GSPO, SAPO built-in

算法隔离：无需接触基础设施即可实现强化学习算法
可扩展性：支持从单GPU扩展至数千GPU（通过Monarch）
现代化技术栈：TorchTitan（训练）、vLLM（推理）、TorchStore（同步）
内置损失函数：GRPO、DAPO、CISPO、GSPO、SAPO

Architecture Overview

架构概述

┌─────────────────────────────────────────────────────────┐
│ Application Layer (Your Code)                           │
│ - Define reward models, loss functions, sampling        │
└─────────────────────┬───────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────┐
│ Forge API Layer                                         │
│ - Episode, Group dataclasses                           │
│ - Service interfaces (async/await)                      │
└─────────────────────┬───────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────┐
│ Distributed Services (Monarch)                          │
│ ├── Trainer (TorchTitan FSDP)                          │
│ ├── Generator (vLLM inference)                          │
│ ├── Reference Model (frozen KL baseline)               │
│ └── Reward Actors (compute rewards)                    │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│ Application Layer (Your Code)                           │
│ - Define reward models, loss functions, sampling        │
└─────────────────────┬───────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────┐
│ Forge API Layer                                         │
│ - Episode, Group dataclasses                           │
│ - Service interfaces (async/await)                      │
└─────────────────────┬───────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────┐
│ Distributed Services (Monarch)                          │
│ ├── Trainer (TorchTitan FSDP)                          │
│ ├── Generator (vLLM inference)                          │
│ ├── Reference Model (frozen KL baseline)               │
│ └── Reward Actors (compute rewards)                    │
└─────────────────────────────────────────────────────────┘

Installation

安装

bash

undefined

bash

undefined

Create environment

创建环境

conda create -n forge python=3.12 conda activate forge

Install (handles PyTorch nightly + dependencies)

安装（自动处理PyTorch nightly及依赖）

./scripts/install.sh

Verify

验证安装

python -c "import torch, forge, vllm; print('OK')"

undefined

python -c "import torch, forge, vllm; print('OK')"

undefined

ROCm Installation

ROCM版本安装

bash

./scripts/install_rocm.sh

bash

./scripts/install_rocm.sh

Quick Start

快速开始

SFT Training (2+ GPUs)

SFT训练（2块及以上GPU）

bash

python -m apps.sft.main --config apps/sft/llama3_8b.yaml

bash

python -m apps.sft.main --config apps/sft/llama3_8b.yaml

GRPO Training (3+ GPUs)

GRPO训练（3块及以上GPU）

bash

python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml

bash

python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml

Workflow 1: GRPO Training for Math Reasoning

工作流1：面向数学推理的GRPO训练

Use this workflow for training reasoning models with group-relative advantages.

本工作流适用于使用组相对优势训练推理模型。

Prerequisites Checklist

前置检查清单

3+ GPUs (GPU0: trainer, GPU1: ref_model, GPU2: generator)
Model from HuggingFace Hub
Training dataset (GSM8K, MATH, etc.)

3块及以上GPU（GPU0用于训练器，GPU1用于参考模型，GPU2用于生成器）
来自HuggingFace Hub的模型
训练数据集（如GSM8K、MATH等）

Step 1: Create Configuration

步骤1：创建配置文件

yaml

undefined

yaml

undefined

config/grpo_math.yaml

model: "Qwen/Qwen2.5-7B-Instruct"

dataset: path: "openai/gsm8k" split: "train" streaming: true

training: batch_size: 4 learning_rate: 1e-6 seq_len: 4096 dtype: bfloat16 gradient_accumulation_steps: 4

grpo: n_samples: 8 # Responses per prompt clip_low: 0.2 clip_high: 0.28 beta: 0.1 # KL penalty coefficient temperature: 0.7

services: generator: procs: 1 num_replicas: 1 with_gpus: true trainer: procs: 1 num_replicas: 1 with_gpus: true ref_model: procs: 1 num_replicas: 1 with_gpus: true

undefined

model: "Qwen/Qwen2.5-7B-Instruct"

dataset: path: "openai/gsm8k" split: "train" streaming: true

training: batch_size: 4 learning_rate: 1e-6 seq_len: 4096 dtype: bfloat16 gradient_accumulation_steps: 4

grpo: n_samples: 8 # 每个prompt对应的响应数 clip_low: 0.2 clip_high: 0.28 beta: 0.1 # KL惩罚系数 temperature: 0.7

services: generator: procs: 1 num_replicas: 1 with_gpus: true trainer: procs: 1 num_replicas: 1 with_gpus: true ref_model: procs: 1 num_replicas: 1 with_gpus: true

undefined

Step 2: Define Reward Function

步骤2：定义奖励函数

python

undefined

python

undefined

rewards.py

Reward functions are in forge.data.rewards

奖励函数位于forge.data.rewards模块

from forge.data.rewards import MathReward, ThinkingReward import re

Or define your own reward function

或者自定义奖励函数

class CustomMathReward: def call(self, prompt: str, response: str, target: str) -> float: # Extract answer from response match = re.search(r'\boxed{([^}]+)}', response) if not match: return 0.0

    answer = match.group(1).strip()
    return 1.0 if answer == target else 0.0

undefined

class CustomMathReward: def call(self, prompt: str, response: str, target: str) -> float: # 从响应中提取答案 match = re.search(r'\boxed{([^}]+)}', response) if not match: return 0.0

    answer = match.group(1).strip()
    return 1.0 if answer == target else 0.0

undefined

Step 3: Launch Training

步骤3：启动训练

bash

python -m apps.grpo.main --config config/grpo_math.yaml

bash

python -m apps.grpo.main --config config/grpo_math.yaml

Step 4: Monitor Progress

步骤4：监控训练进度

Check W&B dashboard for loss curves
Verify entropy is decreasing (policy becoming more deterministic)
Monitor KL divergence (should stay bounded)

查看W&B仪表盘的损失曲线
验证熵值是否下降（策略逐渐变得更具确定性）
监控KL散度（应保持在合理范围内）

Workflow 2: Custom Loss Function

工作流2：自定义损失函数

Use this workflow to implement new RL algorithms.

本工作流适用于实现新型强化学习算法。

Step 1: Create Loss Class

步骤1：创建损失类

python

undefined

python

undefined

src/forge/losses/custom_loss.py

import torch import torch.nn as nn

class CustomLoss(nn.Module): def init(self, clip_range: float = 0.2, beta: float = 0.1): super().init() self.clip_range = clip_range self.beta = beta

def forward(
    self,
    logprobs: torch.Tensor,
    ref_logprobs: torch.Tensor,
    advantages: torch.Tensor,
    padding_mask: torch.Tensor,
) -> torch.Tensor:
    # Compute importance ratio
    ratio = torch.exp(logprobs - ref_logprobs)

    # Clipped policy gradient
    clipped_ratio = torch.clamp(
        ratio,
        1 - self.clip_range,
        1 + self.clip_range
    )
    pg_loss = -torch.min(ratio * advantages, clipped_ratio * advantages)

    # KL penalty
    kl = ref_logprobs - logprobs

    # Apply mask and aggregate
    masked_loss = (pg_loss + self.beta * kl) * padding_mask
    loss = masked_loss.sum() / padding_mask.sum()

    return loss

undefined

import torch import torch.nn as nn

class CustomLoss(nn.Module): def init(self, clip_range: float = 0.2, beta: float = 0.1): super().init() self.clip_range = clip_range self.beta = beta

def forward(
    self,
    logprobs: torch.Tensor,
    ref_logprobs: torch.Tensor,
    advantages: torch.Tensor,
    padding_mask: torch.Tensor,
) -> torch.Tensor:
    # 计算重要性比率
    ratio = torch.exp(logprobs - ref_logprobs)

    # 裁剪后的策略梯度
    clipped_ratio = torch.clamp(
        ratio,
        1 - self.clip_range,
        1 + self.clip_range
    )
    pg_loss = -torch.min(ratio * advantages, clipped_ratio * advantages)

    # KL惩罚
    kl = ref_logprobs - logprobs

    # 应用掩码并聚合损失
    masked_loss = (pg_loss + self.beta * kl) * padding_mask
    loss = masked_loss.sum() / padding_mask.sum()

    return loss

undefined

Step 2: Integrate into Application

步骤2：集成到应用中

python

undefined

python

undefined

apps/custom/main.py

from forge.losses.custom_loss import CustomLoss

loss_fn = CustomLoss(clip_range=0.2, beta=0.1)

from forge.losses.custom_loss import CustomLoss

loss_fn = CustomLoss(clip_range=0.2, beta=0.1)

In training loop

在训练循环中

loss = loss_fn( logprobs=logprobs, ref_logprobs=ref_logprobs, advantages=advantages, padding_mask=padding_mask, )

---

loss = loss_fn( logprobs=logprobs, ref_logprobs=ref_logprobs, advantages=advantages, padding_mask=padding_mask, )

---

Workflow 3: Multi-GPU Distributed Training

工作流3：多GPU分布式训练

Use this workflow for scaling to multiple GPUs or nodes.

本工作流适用于将训练扩展至多块GPU或多个节点。

Configuration for Distributed

分布式配置

yaml

undefined

yaml

undefined

config/distributed.yaml

model: "meta-llama/Meta-Llama-3.1-8B-Instruct"

parallelism: tensor_parallel_degree: 2 # Split model across GPUs pipeline_parallel_degree: 1 data_parallel_shard_degree: 2

services: generator: procs: 2 # 2 processes for TP=2 num_replicas: 1 with_gpus: true trainer: procs: 2 num_replicas: 1 with_gpus: true

undefined

model: "meta-llama/Meta-Llama-3.1-8B-Instruct"

parallelism: tensor_parallel_degree: 2 # 跨GPU拆分模型 pipeline_parallel_degree: 1 data_parallel_shard_degree: 2

services: generator: procs: 2 # TP=2时需2个进程 num_replicas: 1 with_gpus: true trainer: procs: 2 num_replicas: 1 with_gpus: true

undefined

Launch with SLURM

使用SLURM启动

bash

undefined

bash

undefined

Submit job

提交任务

sbatch --nodes=2 --gpus-per-node=8 run_grpo.sh

undefined

sbatch --nodes=2 --gpus-per-node=8 run_grpo.sh

undefined

Launch Locally (Multi-GPU)

本地启动（多GPU）

bash

undefined

bash

undefined

8 GPU setup

8 GPU配置

python -m apps.grpo.main
--config config/distributed.yaml
--trainer.procs 4
--generator.procs 4

---

python -m apps.grpo.main
--config config/distributed.yaml
--trainer.procs 4
--generator.procs 4

---

Core API Reference

核心API参考

Training Batch Format

训练批次格式

torchforge uses dictionary-based batches for training:

python

undefined

torchforge使用基于字典的批次格式进行训练：

python

undefined

inputs: list of dicts with torch.Tensor values

inputs：包含torch.Tensor值的字典列表

inputs = [{"tokens": torch.Tensor}]

targets: list of dicts with training signals

targets：包含训练信号的字典列表

targets = [{ "response": torch.Tensor, "ref_logprobs": torch.Tensor, "advantages": torch.Tensor, "padding_mask": torch.Tensor }]

train_step returns loss as float

train_step返回损失值（浮点型）

loss = trainer.train_step(inputs, targets)

undefined

loss = trainer.train_step(inputs, targets)

undefined

Completion

生成结果

Generated output from vLLM:

python

@dataclass
class Completion:
    text: str              # Generated text
    token_ids: list[int]   # Token IDs
    logprobs: list[float]  # Log probabilities
    metadata: dict         # Custom metadata

vLLM生成的输出格式：

python

@dataclass
class Completion:
    text: str              # 生成的文本
    token_ids: list[int]   # Token ID列表
    logprobs: list[float]  # 对数概率列表
    metadata: dict         # 自定义元数据

Built-in Loss Functions

内置损失函数

Loss Functions

损失函数列表

Loss functions are in the

forge.losses

module:

python

from forge.losses import SimpleGRPOLoss, ReinforceLoss

损失函数位于

forge.losses

模块：

python

from forge.losses import SimpleGRPOLoss, ReinforceLoss

SimpleGRPOLoss for GRPO training

用于GRPO训练的SimpleGRPOLoss

loss_fn = SimpleGRPOLoss(beta=0.1)

Forward pass

前向传播

loss = loss_fn( logprobs=logprobs, ref_logprobs=ref_logprobs, advantages=advantages, padding_mask=padding_mask )

undefined

loss = loss_fn( logprobs=logprobs, ref_logprobs=ref_logprobs, advantages=advantages, padding_mask=padding_mask )

undefined

ReinforceLoss

python

from forge.losses.reinforce_loss import ReinforceLoss

python

from forge.losses.reinforce_loss import ReinforceLoss

With optional importance ratio clipping

带可选重要性比率裁剪的ReinforceLoss

loss_fn = ReinforceLoss(clip_ratio=0.2)

---

loss_fn = ReinforceLoss(clip_ratio=0.2)

---

Common Issues and Solutions

常见问题与解决方案

Issue: Not Enough GPUs

问题：GPU资源不足

Symptoms: "Insufficient GPU resources" error

Solutions:

yaml

undefined

症状：出现“Insufficient GPU resources”错误

解决方案：

yaml

undefined

Reduce service requirements

降低服务资源需求

services: generator: procs: 1 with_gpus: true trainer: procs: 1 with_gpus: true

Remove ref_model (uses generator weights)


Or use CPU for reference model:
```yaml
ref_model:
  with_gpus: false

services: generator: procs: 1 with_gpus: true trainer: procs: 1 with_gpus: true

移除ref_model（使用generator的权重）


或者将参考模型部署在CPU上：
```yaml
ref_model:
  with_gpus: false

Issue: OOM During Generation

问题：生成过程中出现OOM

Symptoms: CUDA OOM in vLLM

Solutions:

yaml

undefined

症状：vLLM中出现CUDA OOM错误

解决方案：

yaml

undefined

Reduce batch size

降低批次大小

grpo: n_samples: 4 # Reduce from 8

grpo: n_samples: 4 # 从8减少至4

Or reduce sequence length

或者降低序列长度

training: seq_len: 2048

undefined

training: seq_len: 2048

undefined

Issue: Slow Weight Sync

问题：权重同步缓慢

Symptoms: Long pauses between training and generation

Solutions:

bash

undefined

症状：训练与生成之间出现长时间停顿

解决方案：

bash

undefined

Enable RDMA (if available)

启用RDMA（如果可用）

export TORCHSTORE_USE_RDMA=1

Or reduce sync frequency

或者降低同步频率

training: sync_interval: 10 # Sync every 10 steps

undefined

training: sync_interval: 10 # 每10步同步一次

undefined

Issue: Policy Collapse

问题：策略坍塌

Symptoms: Entropy drops to zero, reward stops improving

Solutions:

yaml

undefined

症状：熵值降至零，奖励不再提升

解决方案：

yaml

undefined

Increase KL penalty

增加KL惩罚系数

grpo: beta: 0.2 # Increase from 0.1

grpo: beta: 0.2 # 从0.1提升至0.2

Or add entropy bonus

或者添加熵奖励

training: entropy_coef: 0.01

---

training: entropy_coef: 0.01

---