rlhf
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseUnderstanding RLHF
理解RLHF
Reinforcement Learning from Human Feedback (RLHF) is a technique for aligning language models with human preferences. Rather than relying solely on next-token prediction, RLHF uses human judgment to guide model behavior toward helpful, harmless, and honest outputs.
从人类反馈中学习的强化学习(RLHF)是一种使语言模型与人类偏好对齐的技术。它不再单纯依赖下一个token预测,而是利用人类判断引导模型行为,输出有用、无害且诚实的内容。
Table of Contents
目录
Core Concepts
核心概念
Why RLHF?
为什么选择RLHF?
Pretraining produces models that predict likely text, not necessarily good text. A model trained on internet data learns to complete text in ways that reflect its training distribution—including toxic, unhelpful, or dishonest patterns. RLHF addresses this gap by optimizing for human preferences rather than likelihood.
The core insight: humans can often recognize good outputs more easily than they can specify what makes an output good. RLHF exploits this by collecting human judgments and using them to shape model behavior.
预训练生成的模型擅长预测可能出现的文本,但不一定是“优质”文本。基于互联网数据训练的模型会学习到符合训练分布的文本生成方式——包括有毒、无用或不诚实的模式。RLHF通过针对人类偏好而非可能性进行优化,解决了这一问题。
核心思路:人类往往更容易识别优质输出,而非明确界定优质输出的标准。RLHF利用这一点,收集人类判断并以此塑造模型行为。
The Alignment Problem
对齐问题
Language models face several alignment challenges:
- Helpfulness: Following instructions and providing useful information
- Harmlessness: Avoiding toxic, dangerous, or inappropriate outputs
- Honesty: Acknowledging uncertainty and avoiding fabrication
- Intent alignment: Understanding what users actually want, not just what they say
RLHF provides a framework for encoding these properties through preference data.
语言模型面临多项对齐挑战:
- 有用性:遵循指令并提供实用信息
- 无害性:避免生成有毒、危险或不当内容
- 诚实性:承认不确定性,避免编造信息
- 意图对齐:理解用户真实需求,而非字面意思
RLHF通过偏好数据提供了一个编码这些属性的框架。
Key Components
关键组件
- Preference data: Human judgments comparing model outputs
- Reward model: A learned function approximating human preferences
- Policy optimization: RL algorithms that maximize expected reward
- Regularization: Constraints preventing deviation from the base model
- 偏好数据:用于比较模型输出的人类判断
- 奖励模型:近似人类偏好的习得函数
- 策略优化:最大化预期奖励的强化学习算法
- 正则化:防止模型偏离基础模型的约束
The RLHF Pipeline
RLHF流程
The standard RLHF pipeline consists of three main stages:
标准RLHF流程包含三个主要阶段:
Stage 1: Supervised Fine-Tuning (SFT)
阶段1:监督微调(SFT)
Start with a pretrained language model and fine-tune it on high-quality demonstrations. This teaches the model the desired format and style of responses.
Input: Pretrained model + demonstration dataset
Output: SFT model that can follow instructions
从预训练语言模型入手,在高质量演示数据上进行微调。这会教会模型所需的响应格式和风格。
输入:预训练模型 + 演示数据集
输出:能够遵循指令的SFT模型
Stage 2: Reward Model Training
阶段2:奖励模型训练
Train a model to predict human preferences between pairs of outputs. The reward model learns to score outputs in a way that correlates with human judgment.
Input: SFT model + preference dataset (chosen/rejected pairs)
Output: Reward model that scores any output
训练一个模型来预测人类对多组输出的偏好。奖励模型学习以与人类判断相关的方式为输出打分。
输入:SFT模型 + 偏好数据集(选中/未选中的输出对)
输出:可为任意输出打分的奖励模型
Stage 3: Policy Optimization
阶段3:策略优化
Use reinforcement learning to optimize the SFT model against the reward model, while staying close to the original SFT distribution.
Input: SFT model + reward model
Output: Final aligned model
使用强化学习针对奖励模型优化SFT模型,同时保持与原始SFT分布的接近度。
输入:SFT模型 + 奖励模型
输出:最终的对齐模型
Alternative: Direct Alignment
替代方案:直接对齐
Direct alignment algorithms (DPO, IPO, KTO) skip the reward model entirely, optimizing directly from preference data. This simplifies the pipeline but trades off some flexibility.
直接对齐算法(DPO、IPO、KTO)跳过奖励模型,直接基于偏好数据进行优化。这简化了流程,但牺牲了部分灵活性。
Preference Data
偏好数据
Preference data encodes human judgment about model outputs. The most common format is pairwise comparisons.
偏好数据编码了人类对模型输出的判断。最常见的格式是两两比较。
Pairwise Preferences
两两偏好
Given a prompt, collect two or more model outputs and have humans indicate which is better:
Prompt: "Explain quantum entanglement"
Response A: [technical explanation]
Response B: [simpler explanation with analogy]
Human preference: B > AThis creates (prompt, chosen, rejected) tuples for training.
给定一个提示词,收集两个或多个模型输出,让人类指出哪个更优:
Prompt: "Explain quantum entanglement"
Response A: [technical explanation]
Response B: [simpler explanation with analogy]
Human preference: B > A这会生成用于训练的(提示词,选中输出,未选中输出)三元组。
Collection Methods
收集方法
Human annotation: Trained annotators compare outputs according to guidelines. Most reliable but expensive and slow.
AI feedback: Use a capable model to generate preferences. Faster and cheaper but may propagate biases. This is the basis for Constitutional AI (CAI) and RLAIF.
Implicit signals: User interactions like upvotes, regeneration requests, or conversation length. Noisy but abundant.
人工标注:训练标注员根据指南比较输出。最可靠但成本高、速度慢。
AI反馈:使用能力较强的模型生成偏好。更快更便宜,但可能传播偏见。这是宪法AI(CAI)和RLAIF的基础。
隐式信号:用户交互,如点赞、重新生成请求或对话时长。存在噪声但数量庞大。
Data Quality Considerations
数据质量考量
- Annotator agreement: Low agreement suggests ambiguous criteria or subjective preferences
- Distribution coverage: Preferences should cover the range of model behaviors
- Prompt diversity: Diverse prompts prevent overfitting to narrow scenarios
- Preference strength: Some comparisons are clear; others are nearly ties
- 标注一致性:一致性低表明标准模糊或偏好主观
- 分布覆盖:偏好应覆盖模型行为的全部范围
- 提示词多样性:多样化的提示词可防止过拟合到狭窄场景
- 偏好强度:部分比较结果明确,部分则难分伯仲
Instruction Tuning
指令调优
Instruction tuning (supervised fine-tuning on instruction-response pairs) serves as the foundation for RLHF.
指令调优(在指令-响应对上进行监督微调)是RLHF的基础。
Purpose
目的
- Teaches the model to follow instructions rather than just complete text
- Establishes the format and style for responses
- Creates a starting point that already exhibits desired behaviors
- Provides the reference policy for KL regularization
- 教会模型遵循指令而非单纯补全文本
- 确立响应的格式和风格
- 创建已具备所需行为的起点
- 为KL正则化提供参考策略
Dataset Composition
数据集构成
Typical instruction tuning datasets include:
- Single-turn QA: Questions with direct answers
- Multi-turn dialogue: Conversational exchanges
- Task instructions: Specific tasks with examples
- Chain-of-thought: Reasoning traces for complex problems
典型的指令调优数据集包括:
- 单轮问答:带有直接答案的问题
- 多轮对话:对话交互内容
- 任务指令:带示例的特定任务
- 思维链:复杂问题的推理过程
Relationship to RLHF
与RLHF的关系
The SFT model defines the "prior" that RLHF refines. A better SFT model means:
- The reward model has better starting outputs to compare
- Policy optimization has less work to do
- The KL penalty keeps the final model closer to this baseline
SFT模型定义了RLHF要优化的“先验”。更优质的SFT模型意味着:
- 奖励模型有更优质的初始输出可比较
- 策略优化的工作量更少
- KL惩罚使最终模型更接近该基线
Reward Modeling
奖励建模
The reward model transforms pairwise preferences into a scalar signal for RL optimization.
奖励模型将两两偏好转换为用于强化学习优化的标量信号。
The Bradley-Terry Model
Bradley-Terry模型
Preferences are modeled using the Bradley-Terry framework:
P(A > B) = sigmoid(r(A) - r(B))Where r(x) is the reward for output x. This assumes preferences depend only on the difference in rewards.
The loss function is:
L = -log(sigmoid(r(chosen) - r(rejected)))This pushes the reward model to assign higher scores to chosen outputs.
偏好通过Bradley-Terry框架建模:
P(A > B) = sigmoid(r(A) - r(B))其中r(x)是输出x的奖励值。该框架假设偏好仅取决于奖励值的差异。
损失函数为:
L = -log(sigmoid(r(chosen) - r(rejected)))这会促使奖励模型为选中的输出分配更高分数。
Architecture
架构
Reward models are typically:
- The SFT model with a scalar head instead of the language modeling head
- Trained on (prompt, chosen, rejected) tuples
- Output a single scalar reward for any (prompt, response) pair
奖励模型通常是:
- 替换语言建模头为标量头的SFT模型
- 在(提示词,选中,未选中)三元组上训练
- 可为任意(提示词,响应)对输出单个标量奖励
Considerations
注意事项
- Scaling: Larger reward models generally produce better signals
- Calibration: Absolute reward values are less important than rankings
- Generalization: The model must score outputs it hasn't seen during training
- Over-optimization: Policies can exploit reward model weaknesses
See for detailed training procedures.
reference/reward-modeling.md- 缩放:更大的奖励模型通常能生成更优的信号
- 校准:绝对奖励值的重要性低于排名
- 泛化:模型必须能为训练期间未见过的输出打分
- 过度优化:策略可能利用奖励模型的弱点
详见中的训练流程细节。
reference/reward-modeling.mdPolicy Optimization
策略优化
Policy optimization uses RL to maximize expected reward while staying close to the reference policy.
策略优化使用强化学习最大化预期奖励,同时保持与参考策略的接近度。
The RLHF Objective
RLHF目标函数
maximize E[R(x, y)] - β * KL(π || π_ref)Where:
- R(x, y) is the reward for response y to prompt x
- KL(π || π_ref) measures deviation from the reference policy
- β controls the strength of the regularization
maximize E[R(x, y)] - β * KL(π || π_ref)其中:
- R(x, y)是提示词x对应响应y的奖励
- KL(π || π_ref)衡量与参考策略的偏差
- β控制正则化的强度
PPO (Proximal Policy Optimization)
PPO(近端策略优化)
PPO is the most common algorithm for RLHF:
- Sample responses from the current policy
- Score responses with the reward model
- Compute advantage estimates
- Update policy with clipped surrogate objective
The clipping prevents large policy updates that could destabilize training.
PPO是RLHF中最常用的算法:
- 从当前策略采样响应
- 用奖励模型为响应打分
- 计算优势估计值
- 使用截断代理目标更新策略
截断操作可防止可能破坏训练稳定性的大幅策略更新。
KL Regularization
KL正则化
The KL penalty serves multiple purposes:
- Prevents reward hacking: Stops the policy from finding adversarial inputs to the reward model
- Maintains capabilities: Keeps the model close to the pretrained distribution
- Stabilizes training: Limits how far the policy can move per update
Higher β means more conservative updates; lower β allows more aggressive optimization.
KL惩罚有多个作用:
- 防止奖励滥用:阻止策略找到奖励模型的对抗性输入
- 维持能力:保持模型接近预训练分布
- 稳定训练:限制每次更新的策略变动幅度
β值越高,更新越保守;β值越低,优化越激进。
REINFORCE vs PPO
REINFORCE与PPO对比
REINFORCE is simpler but has higher variance:
- Uses raw returns without value function baseline
- Single-sample gradient estimates
- Can work for simpler problems
PPO adds complexity but improves stability:
- Clipped surrogate objective
- Multiple epochs per batch
- Better sample efficiency
See for algorithm details.
reference/policy-optimization.mdREINFORCE更简单但方差更高:
- 无价值函数基线,使用原始回报
- 单样本梯度估计
- 适用于较简单的问题
PPO增加了复杂度但提升了稳定性:
- 截断代理目标
- 每批次数据多轮训练
- 样本效率更高
详见中的算法细节。
reference/policy-optimization.mdDirect Alignment Algorithms
直接对齐算法
Direct alignment methods optimize the RLHF objective without training a separate reward model.
直接对齐方法无需训练单独的奖励模型即可优化RLHF目标函数。
DPO (Direct Preference Optimization)
DPO(直接偏好优化)
DPO reparameterizes the RLHF objective to derive a closed-form loss:
L = -log sigmoid(β * (log π(y_w|x)/π_ref(y_w|x) - log π(y_l|x)/π_ref(y_l|x)))Where y_w is the preferred response and y_l is the dispreferred response.
Advantages:
- No separate reward model training
- Simpler pipeline with fewer hyperparameters
- More stable than RL-based methods
Trade-offs:
- Less flexible than explicit reward models
- Cannot reuse reward model for other purposes
- May be more sensitive to data quality
DPO重新参数化RLHF目标函数,推导出闭式损失:
L = -log sigmoid(β * (log π(y_w|x)/π_ref(y_w|x) - log π(y_l|x)/π_ref(y_l|x)))其中y_w是偏好响应,y_l是非偏好响应。
优势:
- 无需训练单独的奖励模型
- 流程更简单,超参数更少
- 比基于强化学习的方法更稳定
权衡:
- 灵活性不如显式奖励模型
- 无法将奖励模型复用至其他场景
- 可能对数据质量更敏感
IPO (Identity Preference Optimization)
IPO(恒等偏好优化)
IPO addresses potential overfitting in DPO by using a different loss formulation that doesn't assume the Bradley-Terry model perfectly describes preferences.
IPO通过使用不同的损失公式解决DPO中可能存在的过拟合问题,该公式不假设Bradley-Terry模型能完美描述偏好。
KTO (Kahneman-Tversky Optimization)
KTO(卡尼曼-特沃斯基优化)
KTO works with binary feedback (good/bad) rather than pairwise comparisons, making data collection easier. It's based on prospect theory from behavioral economics.
KTO适用于二进制反馈(好/坏)而非两两比较,简化了数据收集。它基于行为经济学中的前景理论。
When to Use Direct Alignment
何时使用直接对齐
Direct alignment is preferred when:
- Simplicity is important
- Computational resources are limited
- The reward model won't be reused
Reward-based RLHF is preferred when:
- You need the reward model for other purposes (filtering, ranking)
- You have a large preference dataset
- You want maximum flexibility in optimization
See for detailed algorithm comparisons.
reference/direct-alignment.md在以下场景优先选择直接对齐:
- 重视流程简洁性
- 计算资源有限
- 无需复用奖励模型
在以下场景优先选择基于奖励模型的RLHF:
- 需要将奖励模型用于其他目的(过滤、排名)
- 拥有大规模偏好数据集
- 希望获得最大的优化灵活性
详见中的算法对比细节。
reference/direct-alignment.mdChallenges
挑战
Over-Optimization
过度优化
As optimization proceeds, the policy may exploit weaknesses in the reward model rather than improving on the true objective. Symptoms include:
- Rising reward model scores but declining human evaluation
- Increasingly verbose or formulaic outputs
- Sycophantic behavior (agreeing with users regardless of correctness)
Mitigations:
- Stronger KL regularization
- Reward model ensembles
- Early stopping based on held-out evaluation
随着优化的推进,策略可能利用奖励模型的弱点而非真正改进目标。症状包括:
- 奖励模型分数上升但人类评价下降
- 输出越来越冗长或公式化
- 谄媚行为(无论正确性如何都同意用户观点)
缓解措施:
- 更强的KL正则化
- 奖励模型集成
- 基于保留数据集评估的早停机制
Reward Hacking
奖励滥用
The policy finds inputs that score highly with the reward model but don't represent genuine improvement:
- Length exploitation (longer responses score higher)
- Style mimicry (copying patterns from high-reward examples)
- Adversarial outputs that confuse the reward model
策略找到在奖励模型中得分高但并非真正改进的输入:
- 长度利用(更长的响应得分更高)
- 风格模仿(复制高奖励示例的模式)
- 混淆奖励模型的对抗性输出
Evaluation
评估
Evaluating aligned models is difficult because:
- Human preferences are subjective and context-dependent
- Automated metrics don't capture alignment properties well
- A/B testing is expensive and slow
- Models may perform differently on evaluation vs deployment
评估对齐模型难度较大,原因包括:
- 人类偏好具有主观性和上下文依赖性
- 自动化指标无法很好地捕捉对齐属性
- A/B测试成本高且速度慢
- 模型在评估和部署场景中的表现可能不同
Distribution Shift
分布偏移
The preference data comes from a specific distribution of prompts and responses. The deployed model will encounter different inputs, and the reward model may not generalize well.
偏好数据来自特定的提示词和响应分布。部署后的模型会遇到不同的输入,奖励模型可能无法很好地泛化。
Best Practices
最佳实践
- Start with a strong SFT model: RLHF refines behavior; it works best when the base model already exhibits desired patterns
- Invest in preference data quality: Garbage in, garbage out—clear guidelines and trained annotators matter
- Use KL regularization: Don't optimize reward too aggressively; the reward model is an imperfect proxy
- Monitor for reward hacking: Track human evaluations alongside reward model scores
- Consider direct alignment first: DPO is simpler and often performs comparably to PPO
- Iterate on reward model: Improve the reward model as you discover its weaknesses
- Diverse prompts: Ensure preference data covers the distribution you care about
- Regularize appropriately: Higher β for safety-critical applications; lower β for capability-focused training
- 从优质SFT模型开始:RLHF用于优化行为;当基础模型已具备所需模式时效果最佳
- 重视偏好数据质量:垃圾进垃圾出——清晰的指南和训练有素的标注员至关重要
- 使用KL正则化:不要过度优化奖励;奖励模型只是不完美的代理
- 监控奖励滥用:同时跟踪人类评价和奖励模型分数
- 优先考虑直接对齐:DPO更简单,性能通常与PPO相当
- 迭代优化奖励模型:发现奖励模型的弱点后进行改进
- 多样化提示词:确保偏好数据覆盖你关注的分布
- 合理正则化:安全关键应用使用更高的β值;能力聚焦训练使用更低的β值
References
参考文献
Reference Files
参考文件
- - Detailed reward model training procedures
reference/reward-modeling.md - - PPO and policy gradient algorithms for RLHF
reference/policy-optimization.md - - DPO, IPO, KTO and other direct methods
reference/direct-alignment.md
- - 奖励模型训练流程细节
reference/reward-modeling.md - - 用于RLHF的PPO和策略梯度算法
reference/policy-optimization.md - - DPO、IPO、KTO及其他直接方法
reference/direct-alignment.md
External Resources
外部资源
- RLHF Book by Nathan Lambert - Comprehensive textbook on RLHF
- Training language models to follow instructions with human feedback - InstructGPT paper
- Direct Preference Optimization - DPO paper
- RLHF Book by Nathan Lambert - 关于RLHF的综合性教材
- Training language models to follow instructions with human feedback - InstructGPT论文
- Direct Preference Optimization - DPO论文