rlhf

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Understanding RLHF

理解RLHF

Reinforcement Learning from Human Feedback (RLHF) is a technique for aligning language models with human preferences. Rather than relying solely on next-token prediction, RLHF uses human judgment to guide model behavior toward helpful, harmless, and honest outputs.

从人类反馈中学习的强化学习（RLHF）是一种使语言模型与人类偏好对齐的技术。它不再单纯依赖下一个token预测，而是利用人类判断引导模型行为，输出有用、无害且诚实的内容。

Core Concepts

核心概念

Why RLHF?

为什么选择RLHF？

Pretraining produces models that predict likely text, not necessarily good text. A model trained on internet data learns to complete text in ways that reflect its training distribution—including toxic, unhelpful, or dishonest patterns. RLHF addresses this gap by optimizing for human preferences rather than likelihood.

The core insight: humans can often recognize good outputs more easily than they can specify what makes an output good. RLHF exploits this by collecting human judgments and using them to shape model behavior.

预训练生成的模型擅长预测可能出现的文本，但不一定是“优质”文本。基于互联网数据训练的模型会学习到符合训练分布的文本生成方式——包括有毒、无用或不诚实的模式。RLHF通过针对人类偏好而非可能性进行优化，解决了这一问题。

核心思路：人类往往更容易识别优质输出，而非明确界定优质输出的标准。RLHF利用这一点，收集人类判断并以此塑造模型行为。

The Alignment Problem

对齐问题

Language models face several alignment challenges:

Helpfulness: Following instructions and providing useful information
Harmlessness: Avoiding toxic, dangerous, or inappropriate outputs
Honesty: Acknowledging uncertainty and avoiding fabrication
Intent alignment: Understanding what users actually want, not just what they say

RLHF provides a framework for encoding these properties through preference data.

语言模型面临多项对齐挑战：

有用性：遵循指令并提供实用信息
无害性：避免生成有毒、危险或不当内容
诚实性：承认不确定性，避免编造信息
意图对齐：理解用户真实需求，而非字面意思

RLHF通过偏好数据提供了一个编码这些属性的框架。

Key Components

关键组件

Preference data: Human judgments comparing model outputs
Reward model: A learned function approximating human preferences
Policy optimization: RL algorithms that maximize expected reward
Regularization: Constraints preventing deviation from the base model

偏好数据：用于比较模型输出的人类判断
奖励模型：近似人类偏好的习得函数
策略优化：最大化预期奖励的强化学习算法
正则化：防止模型偏离基础模型的约束

The RLHF Pipeline

RLHF流程

The standard RLHF pipeline consists of three main stages:

标准RLHF流程包含三个主要阶段：

Stage 1: Supervised Fine-Tuning (SFT)

阶段1：监督微调（SFT）

Start with a pretrained language model and fine-tune it on high-quality demonstrations. This teaches the model the desired format and style of responses.

Input: Pretrained model + demonstration dataset Output: SFT model that can follow instructions

从预训练语言模型入手，在高质量演示数据上进行微调。这会教会模型所需的响应格式和风格。

输入：预训练模型 + 演示数据集输出：能够遵循指令的SFT模型

Stage 2: Reward Model Training

阶段2：奖励模型训练

Train a model to predict human preferences between pairs of outputs. The reward model learns to score outputs in a way that correlates with human judgment.

Input: SFT model + preference dataset (chosen/rejected pairs) Output: Reward model that scores any output

训练一个模型来预测人类对多组输出的偏好。奖励模型学习以与人类判断相关的方式为输出打分。

输入：SFT模型 + 偏好数据集（选中/未选中的输出对）输出：可为任意输出打分的奖励模型

Stage 3: Policy Optimization

阶段3：策略优化

Use reinforcement learning to optimize the SFT model against the reward model, while staying close to the original SFT distribution.

Input: SFT model + reward model Output: Final aligned model

使用强化学习针对奖励模型优化SFT模型，同时保持与原始SFT分布的接近度。

输入：SFT模型 + 奖励模型输出：最终的对齐模型

Alternative: Direct Alignment

替代方案：直接对齐

Direct alignment algorithms (DPO, IPO, KTO) skip the reward model entirely, optimizing directly from preference data. This simplifies the pipeline but trades off some flexibility.

直接对齐算法（DPO、IPO、KTO）跳过奖励模型，直接基于偏好数据进行优化。这简化了流程，但牺牲了部分灵活性。

Preference Data

偏好数据

Preference data encodes human judgment about model outputs. The most common format is pairwise comparisons.

偏好数据编码了人类对模型输出的判断。最常见的格式是两两比较。

Pairwise Preferences

两两偏好

Given a prompt, collect two or more model outputs and have humans indicate which is better:

Prompt: "Explain quantum entanglement"

Response A: [technical explanation]
Response B: [simpler explanation with analogy]

Human preference: B > A

This creates (prompt, chosen, rejected) tuples for training.

给定一个提示词，收集两个或多个模型输出，让人类指出哪个更优：

Prompt: "Explain quantum entanglement"

Response A: [technical explanation]
Response B: [simpler explanation with analogy]

Human preference: B > A

这会生成用于训练的（提示词，选中输出，未选中输出）三元组。

Collection Methods

收集方法

Human annotation: Trained annotators compare outputs according to guidelines. Most reliable but expensive and slow.

AI feedback: Use a capable model to generate preferences. Faster and cheaper but may propagate biases. This is the basis for Constitutional AI (CAI) and RLAIF.

Implicit signals: User interactions like upvotes, regeneration requests, or conversation length. Noisy but abundant.

人工标注：训练标注员根据指南比较输出。最可靠但成本高、速度慢。

AI反馈：使用能力较强的模型生成偏好。更快更便宜，但可能传播偏见。这是宪法AI（CAI）和RLAIF的基础。

隐式信号：用户交互，如点赞、重新生成请求或对话时长。存在噪声但数量庞大。

Data Quality Considerations

数据质量考量

Annotator agreement: Low agreement suggests ambiguous criteria or subjective preferences
Distribution coverage: Preferences should cover the range of model behaviors
Prompt diversity: Diverse prompts prevent overfitting to narrow scenarios
Preference strength: Some comparisons are clear; others are nearly ties

标注一致性：一致性低表明标准模糊或偏好主观
分布覆盖：偏好应覆盖模型行为的全部范围
提示词多样性：多样化的提示词可防止过拟合到狭窄场景
偏好强度：部分比较结果明确，部分则难分伯仲

Instruction Tuning

指令调优

Instruction tuning (supervised fine-tuning on instruction-response pairs) serves as the foundation for RLHF.

指令调优（在指令-响应对上进行监督微调）是RLHF的基础。

Purpose

目的

Teaches the model to follow instructions rather than just complete text
Establishes the format and style for responses
Creates a starting point that already exhibits desired behaviors
Provides the reference policy for KL regularization

教会模型遵循指令而非单纯补全文本
确立响应的格式和风格
创建已具备所需行为的起点
为KL正则化提供参考策略

Dataset Composition

数据集构成

Typical instruction tuning datasets include:

Single-turn QA: Questions with direct answers
Multi-turn dialogue: Conversational exchanges
Task instructions: Specific tasks with examples
Chain-of-thought: Reasoning traces for complex problems

典型的指令调优数据集包括：

单轮问答：带有直接答案的问题
多轮对话：对话交互内容
任务指令：带示例的特定任务
思维链：复杂问题的推理过程

Relationship to RLHF

与RLHF的关系

The SFT model defines the "prior" that RLHF refines. A better SFT model means:

The reward model has better starting outputs to compare
Policy optimization has less work to do
The KL penalty keeps the final model closer to this baseline

SFT模型定义了RLHF要优化的“先验”。更优质的SFT模型意味着：

奖励模型有更优质的初始输出可比较
策略优化的工作量更少
KL惩罚使最终模型更接近该基线

Reward Modeling

奖励建模

The reward model transforms pairwise preferences into a scalar signal for RL optimization.

奖励模型将两两偏好转换为用于强化学习优化的标量信号。

The Bradley-Terry Model

Bradley-Terry模型

Preferences are modeled using the Bradley-Terry framework:

P(A > B) = sigmoid(r(A) - r(B))

Where r(x) is the reward for output x. This assumes preferences depend only on the difference in rewards.

The loss function is:

L = -log(sigmoid(r(chosen) - r(rejected)))

This pushes the reward model to assign higher scores to chosen outputs.

偏好通过Bradley-Terry框架建模：

P(A > B) = sigmoid(r(A) - r(B))

其中r(x)是输出x的奖励值。该框架假设偏好仅取决于奖励值的差异。

损失函数为：

L = -log(sigmoid(r(chosen) - r(rejected)))

这会促使奖励模型为选中的输出分配更高分数。

Architecture

架构

Reward models are typically:

The SFT model with a scalar head instead of the language modeling head
Trained on (prompt, chosen, rejected) tuples
Output a single scalar reward for any (prompt, response) pair

奖励模型通常是：

替换语言建模头为标量头的SFT模型
在（提示词，选中，未选中）三元组上训练
可为任意（提示词，响应）对输出单个标量奖励

Considerations

注意事项

Scaling: Larger reward models generally produce better signals
Calibration: Absolute reward values are less important than rankings
Generalization: The model must score outputs it hasn't seen during training
Over-optimization: Policies can exploit reward model weaknesses

See

reference/reward-modeling.md

for detailed training procedures.

缩放：更大的奖励模型通常能生成更优的信号
校准：绝对奖励值的重要性低于排名
泛化：模型必须能为训练期间未见过的输出打分
过度优化：策略可能利用奖励模型的弱点

详见

reference/reward-modeling.md

中的训练流程细节。

Policy Optimization

策略优化

Policy optimization uses RL to maximize expected reward while staying close to the reference policy.

策略优化使用强化学习最大化预期奖励，同时保持与参考策略的接近度。

The RLHF Objective

RLHF目标函数

maximize E[R(x, y)] - β * KL(π || π_ref)

Where:

R(x, y) is the reward for response y to prompt x
KL(π || π_ref) measures deviation from the reference policy
β controls the strength of the regularization

maximize E[R(x, y)] - β * KL(π || π_ref)

其中：

R(x, y)是提示词x对应响应y的奖励
KL(π || π_ref)衡量与参考策略的偏差
β控制正则化的强度

PPO (Proximal Policy Optimization)

PPO（近端策略优化）

PPO is the most common algorithm for RLHF:

Sample responses from the current policy
Score responses with the reward model
Compute advantage estimates
Update policy with clipped surrogate objective

The clipping prevents large policy updates that could destabilize training.

PPO是RLHF中最常用的算法：

从当前策略采样响应
用奖励模型为响应打分
计算优势估计值
使用截断代理目标更新策略

截断操作可防止可能破坏训练稳定性的大幅策略更新。

KL Regularization

KL正则化

The KL penalty serves multiple purposes:

Prevents reward hacking: Stops the policy from finding adversarial inputs to the reward model
Maintains capabilities: Keeps the model close to the pretrained distribution
Stabilizes training: Limits how far the policy can move per update

Higher β means more conservative updates; lower β allows more aggressive optimization.

KL惩罚有多个作用：

防止奖励滥用：阻止策略找到奖励模型的对抗性输入
维持能力：保持模型接近预训练分布
稳定训练：限制每次更新的策略变动幅度

β值越高，更新越保守；β值越低，优化越激进。

REINFORCE vs PPO

REINFORCE与PPO对比

REINFORCE is simpler but has higher variance:

Uses raw returns without value function baseline
Single-sample gradient estimates
Can work for simpler problems

PPO adds complexity but improves stability:

Clipped surrogate objective
Multiple epochs per batch
Better sample efficiency

See

reference/policy-optimization.md

for algorithm details.

REINFORCE更简单但方差更高：

无价值函数基线，使用原始回报
单样本梯度估计
适用于较简单的问题

PPO增加了复杂度但提升了稳定性：

截断代理目标
每批次数据多轮训练
样本效率更高

详见

reference/policy-optimization.md

中的算法细节。

Direct Alignment Algorithms

直接对齐算法

Direct alignment methods optimize the RLHF objective without training a separate reward model.

直接对齐方法无需训练单独的奖励模型即可优化RLHF目标函数。

DPO (Direct Preference Optimization)

DPO（直接偏好优化）

DPO reparameterizes the RLHF objective to derive a closed-form loss:

L = -log sigmoid(β * (log π(y_w|x)/π_ref(y_w|x) - log π(y_l|x)/π_ref(y_l|x)))

Where y_w is the preferred response and y_l is the dispreferred response.

Advantages:

No separate reward model training
Simpler pipeline with fewer hyperparameters
More stable than RL-based methods

Trade-offs:

Less flexible than explicit reward models
Cannot reuse reward model for other purposes
May be more sensitive to data quality

DPO重新参数化RLHF目标函数，推导出闭式损失：

L = -log sigmoid(β * (log π(y_w|x)/π_ref(y_w|x) - log π(y_l|x)/π_ref(y_l|x)))

其中y_w是偏好响应，y_l是非偏好响应。

优势：

无需训练单独的奖励模型
流程更简单，超参数更少
比基于强化学习的方法更稳定

权衡：

灵活性不如显式奖励模型
无法将奖励模型复用至其他场景
可能对数据质量更敏感

IPO (Identity Preference Optimization)

IPO（恒等偏好优化）

IPO addresses potential overfitting in DPO by using a different loss formulation that doesn't assume the Bradley-Terry model perfectly describes preferences.

IPO通过使用不同的损失公式解决DPO中可能存在的过拟合问题，该公式不假设Bradley-Terry模型能完美描述偏好。

KTO (Kahneman-Tversky Optimization)

KTO（卡尼曼-特沃斯基优化）

KTO works with binary feedback (good/bad) rather than pairwise comparisons, making data collection easier. It's based on prospect theory from behavioral economics.

KTO适用于二进制反馈（好/坏）而非两两比较，简化了数据收集。它基于行为经济学中的前景理论。

When to Use Direct Alignment

何时使用直接对齐

Direct alignment is preferred when:

Simplicity is important
Computational resources are limited
The reward model won't be reused

Reward-based RLHF is preferred when:

You need the reward model for other purposes (filtering, ranking)
You have a large preference dataset
You want maximum flexibility in optimization

See

reference/direct-alignment.md

for detailed algorithm comparisons.

在以下场景优先选择直接对齐：

重视流程简洁性
计算资源有限
无需复用奖励模型

在以下场景优先选择基于奖励模型的RLHF：

需要将奖励模型用于其他目的（过滤、排名）
拥有大规模偏好数据集
希望获得最大的优化灵活性

详见

reference/direct-alignment.md

中的算法对比细节。

Challenges

挑战

Over-Optimization

过度优化

As optimization proceeds, the policy may exploit weaknesses in the reward model rather than improving on the true objective. Symptoms include:

Rising reward model scores but declining human evaluation
Increasingly verbose or formulaic outputs
Sycophantic behavior (agreeing with users regardless of correctness)

Mitigations:

Stronger KL regularization
Reward model ensembles
Early stopping based on held-out evaluation

随着优化的推进，策略可能利用奖励模型的弱点而非真正改进目标。症状包括：

奖励模型分数上升但人类评价下降
输出越来越冗长或公式化
谄媚行为（无论正确性如何都同意用户观点）

缓解措施：

更强的KL正则化
奖励模型集成
基于保留数据集评估的早停机制

Reward Hacking

奖励滥用

The policy finds inputs that score highly with the reward model but don't represent genuine improvement:

Length exploitation (longer responses score higher)
Style mimicry (copying patterns from high-reward examples)
Adversarial outputs that confuse the reward model

策略找到在奖励模型中得分高但并非真正改进的输入：

长度利用（更长的响应得分更高）
风格模仿（复制高奖励示例的模式）
混淆奖励模型的对抗性输出

Evaluation

评估

Evaluating aligned models is difficult because:

Human preferences are subjective and context-dependent
Automated metrics don't capture alignment properties well
A/B testing is expensive and slow
Models may perform differently on evaluation vs deployment

评估对齐模型难度较大，原因包括：

人类偏好具有主观性和上下文依赖性
自动化指标无法很好地捕捉对齐属性
A/B测试成本高且速度慢
模型在评估和部署场景中的表现可能不同

Distribution Shift

分布偏移

The preference data comes from a specific distribution of prompts and responses. The deployed model will encounter different inputs, and the reward model may not generalize well.

偏好数据来自特定的提示词和响应分布。部署后的模型会遇到不同的输入，奖励模型可能无法很好地泛化。

Best Practices

最佳实践

Start with a strong SFT model: RLHF refines behavior; it works best when the base model already exhibits desired patterns
Invest in preference data quality: Garbage in, garbage out—clear guidelines and trained annotators matter
Use KL regularization: Don't optimize reward too aggressively; the reward model is an imperfect proxy
Monitor for reward hacking: Track human evaluations alongside reward model scores
Consider direct alignment first: DPO is simpler and often performs comparably to PPO
Iterate on reward model: Improve the reward model as you discover its weaknesses
Diverse prompts: Ensure preference data covers the distribution you care about
Regularize appropriately: Higher β for safety-critical applications; lower β for capability-focused training

从优质SFT模型开始：RLHF用于优化行为；当基础模型已具备所需模式时效果最佳
重视偏好数据质量：垃圾进垃圾出——清晰的指南和训练有素的标注员至关重要
使用KL正则化：不要过度优化奖励；奖励模型只是不完美的代理
监控奖励滥用：同时跟踪人类评价和奖励模型分数
优先考虑直接对齐：DPO更简单，性能通常与PPO相当
迭代优化奖励模型：发现奖励模型的弱点后进行改进
多样化提示词：确保偏好数据覆盖你关注的分布
合理正则化：安全关键应用使用更高的β值；能力聚焦训练使用更低的β值

References

参考文献

Reference Files

参考文件

```
reference/reward-modeling.md
```
- Detailed reward model training procedures
```
reference/policy-optimization.md
```
- PPO and policy gradient algorithms for RLHF
```
reference/direct-alignment.md
```
- DPO, IPO, KTO and other direct methods

```
reference/reward-modeling.md
```
- 奖励模型训练流程细节
```
reference/policy-optimization.md
```
- 用于RLHF的PPO和策略梯度算法
```
reference/direct-alignment.md
```
- DPO、IPO、KTO及其他直接方法

External Resources

外部资源

RLHF Book by Nathan Lambert - Comprehensive textbook on RLHF
Training language models to follow instructions with human feedback - InstructGPT paper
Direct Preference Optimization - DPO paper

RLHF Book by Nathan Lambert - 关于RLHF的综合性教材
Training language models to follow instructions with human feedback - InstructGPT论文
Direct Preference Optimization - DPO论文