grpo-rl-training

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

GRPO/RL Training with TRL

使用TRL进行GRPO/RL训练

Expert-level guidance for implementing Group Relative Policy Optimization (GRPO) using the Transformer Reinforcement Learning (TRL) library. This skill provides battle-tested patterns, critical insights, and production-ready workflows for fine-tuning language models with custom reward functions.
本内容为使用Transformer Reinforcement Learning(TRL)库实现Group Relative Policy Optimization(GRPO)提供专家级指导,包含经过实战验证的模式、关键见解,以及可用于生产环境的、结合自定义奖励函数微调语言模型的工作流。

When to Use This Skill

何时使用该技能

Use GRPO training when you need to:
  • Enforce specific output formats (e.g., XML tags, JSON, structured reasoning)
  • Teach verifiable tasks with objective correctness metrics (math, coding, fact-checking)
  • Improve reasoning capabilities by rewarding chain-of-thought patterns
  • Align models to domain-specific behaviors without labeled preference data
  • Optimize for multiple objectives simultaneously (format + correctness + style)
Do NOT use GRPO for:
  • Simple supervised fine-tuning tasks (use SFT instead)
  • Tasks without clear reward signals
  • When you already have high-quality preference pairs (use DPO/PPO instead)

在以下场景中使用GRPO训练:
  • 强制执行特定输出格式(如XML标签、JSON、结构化推理格式)
  • 教授可验证任务(带有客观正确性指标,如数学计算、代码编写、事实核查)
  • 提升推理能力(通过奖励思维链模式实现)
  • 使模型适配领域特定行为(无需标注偏好数据)
  • 同时优化多个目标(格式+正确性+风格)
请勿在以下场景使用GRPO:
  • 简单的监督微调任务(请使用SFT)
  • 没有明确奖励信号的任务
  • 已拥有高质量偏好样本对的场景(请使用DPO/PPO)

Core Concepts

核心概念

1. GRPO Algorithm Fundamentals

1. GRPO算法基础

Key Mechanism:
  • Generates multiple completions for each prompt (group size: 4-16)
  • Compares completions within each group using reward functions
  • Updates policy to favor higher-rewarded responses relative to the group
Critical Difference from PPO:
  • No separate reward model needed
  • More sample-efficient (learns from within-group comparisons)
  • Simpler to implement and debug
Mathematical Intuition:
For each prompt p:
  1. Generate N completions: {c₁, c₂, ..., cₙ}
  2. Compute rewards: {r₁, r₂, ..., rₙ}
  3. Learn to increase probability of high-reward completions
     relative to low-reward ones in the same group
核心机制:
  • 为每个提示生成多个补全结果(分组大小:4-16)
  • 使用奖励函数比较每组内的补全结果
  • 更新策略,使组内高奖励响应的概率相对低奖励响应提升
与PPO的关键区别:
  • 无需单独的奖励模型
  • 样本效率更高(从组内对比中学习)
  • 实现和调试更简单
数学原理:
For each prompt p:
  1. Generate N completions: {c₁, c₂, ..., cₙ}
  2. Compute rewards: {r₁, r₂, ..., rₙ}
  3. Learn to increase probability of high-reward completions
     relative to low-reward ones in the same group

2. Reward Function Design Philosophy

2. 奖励函数设计原则

Golden Rules:
  1. Compose multiple reward functions - Each handles one aspect (format, correctness, style)
  2. Scale rewards appropriately - Higher weight = stronger signal
  3. Use incremental rewards - Partial credit for partial compliance
  4. Test rewards independently - Debug each reward function in isolation
Reward Function Types:
TypeUse CaseExample Weight
CorrectnessVerifiable tasks (math, code)2.0 (highest)
FormatStrict structure enforcement0.5-1.0
LengthEncourage verbosity/conciseness0.1-0.5
StylePenalize unwanted patterns-0.5 to 0.5

黄金法则:
  1. 组合多个奖励函数 - 每个函数负责一个维度(格式、正确性、风格)
  2. 合理缩放奖励权重 - 权重越高,信号强度越强
  3. 使用增量奖励 - 对部分合规的结果给予部分奖励
  4. 独立测试奖励函数 - 单独调试每个奖励函数
奖励函数类型:
类型使用场景推荐权重
正确性可验证任务(数学、代码)2.0(最高)
格式严格结构强制执行0.5-1.0
长度控制输出冗长/简洁程度0.1-0.5
风格惩罚 unwanted patterns-0.5 至 0.5

Implementation Workflow

实现工作流

Step 1: Dataset Preparation

步骤1:数据集准备

Critical Requirements:
  • Prompts in chat format (list of dicts with 'role' and 'content')
  • Include system prompts to set expectations
  • For verifiable tasks, include ground truth answers as additional columns
Example Structure:
python
from datasets import load_dataset, Dataset

SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
[Your step-by-step thinking]
</reasoning>
<answer>
[Final answer]
</answer>
"""

def prepare_dataset(raw_data):
    """
    Transform raw data into GRPO-compatible format.

    Returns: Dataset with columns:
    - 'prompt': List[Dict] with role/content (system + user messages)
    - 'answer': str (ground truth, optional but recommended)
    """
    return raw_data.map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_answer(x['raw_answer'])
    })
Pro Tips:
  • Use one-shot or few-shot examples in system prompt for complex formats
  • Keep prompts concise (max_prompt_length: 256-512 tokens)
  • Validate data quality before training (garbage in = garbage out)
关键要求:
  • 提示采用对话格式(包含'role'和'content'的字典列表)
  • 包含系统提示以设定预期
  • 对于可验证任务,将标准答案作为额外列包含
示例结构:
python
from datasets import load_dataset, Dataset

SYSTEM_PROMPT = """
请按照以下格式回复:
<reasoning>
[你的分步思考过程]
</reasoning>
<answer>
[最终答案]
</answer>
"""

def prepare_dataset(raw_data):
    """
    将原始数据转换为GRPO兼容格式。

    返回:包含以下列的Dataset:
    - 'prompt': 包含role/content的字典列表(系统消息+用户消息)
    - 'answer': 字符串类型(标准答案,可选但推荐)
    """
    return raw_data.map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_answer(x['raw_answer'])
    })
实用技巧:
  • 对于复杂格式,在系统提示中使用单样本或少样本示例
  • 保持提示简洁(最大提示长度:256-512 tokens)
  • 训练前验证数据质量(输入垃圾,输出垃圾)

Step 2: Reward Function Implementation

步骤2:奖励函数实现

Template Structure:
python
def reward_function_name(
    prompts,        # List[List[Dict]]: Original prompts
    completions,    # List[List[Dict]]: Model generations
    answer=None,    # Optional: Ground truth from dataset
    **kwargs        # Additional dataset columns
) -> list[float]:
    """
    Evaluate completions and return rewards.

    Returns: List of floats (one per completion)
    """
    # Extract completion text
    responses = [comp[0]['content'] for comp in completions]

    # Compute rewards
    rewards = []
    for response in responses:
        score = compute_score(response)
        rewards.append(score)

    return rewards
Example 1: Correctness Reward (Math/Coding)
python
def correctness_reward(prompts, completions, answer, **kwargs):
    """Reward correct answers with high score."""
    responses = [comp[0]['content'] for comp in completions]
    extracted = [extract_final_answer(r) for r in responses]
    return [2.0 if ans == gt else 0.0
            for ans, gt in zip(extracted, answer)]
Example 2: Format Reward (Structured Output)
python
import re

def format_reward(completions, **kwargs):
    """Reward XML-like structured format."""
    pattern = r'<reasoning>.*?</reasoning>\s*<answer>.*?</answer>'
    responses = [comp[0]['content'] for comp in completions]
    return [1.0 if re.search(pattern, r, re.DOTALL) else 0.0
            for r in responses]
Example 3: Incremental Format Reward (Partial Credit)
python
def incremental_format_reward(completions, **kwargs):
    """Award partial credit for format compliance."""
    responses = [comp[0]['content'] for comp in completions]
    rewards = []

    for r in responses:
        score = 0.0
        if '<reasoning>' in r:
            score += 0.25
        if '</reasoning>' in r:
            score += 0.25
        if '<answer>' in r:
            score += 0.25
        if '</answer>' in r:
            score += 0.25
        # Penalize extra text after closing tag
        if r.count('</answer>') == 1:
            extra_text = r.split('</answer>')[-1].strip()
            score -= len(extra_text) * 0.001
        rewards.append(score)

    return rewards
Critical Insight: Combine 3-5 reward functions for robust training. Order matters less than diversity of signals.
模板结构:
python
def reward_function_name(
    prompts,        # List[List[Dict]]: 原始提示
    completions,    # List[List[Dict]]: 模型生成结果
    answer=None,    # 可选:数据集中的标准答案
    **kwargs        # 其他数据集列
) -> list[float]:
    """
    评估补全结果并返回奖励值。

    返回:浮点数列表(每个补全结果对应一个值)
    """
    # 提取补全文本
    responses = [comp[0]['content'] for comp in completions]

    # 计算奖励
    rewards = []
    for response in responses:
        score = compute_score(response)
        rewards.append(score)

    return rewards
示例1:正确性奖励(数学/代码场景)
python
def correctness_reward(prompts, completions, answer, **kwargs):
    """为正确答案给予高分数奖励。"""
    responses = [comp[0]['content'] for comp in completions]
    extracted = [extract_final_answer(r) for r in responses]
    return [2.0 if ans == gt else 0.0
            for ans, gt in zip(extracted, answer)]
示例2:格式奖励(结构化输出场景)
python
import re

def format_reward(completions, **kwargs):
    """符合类XML结构化格式的内容给予奖励。"""
    pattern = r'<reasoning>.*?</reasoning>\s*<answer>.*?</answer>'
    responses = [comp[0]['content'] for comp in completions]
    return [1.0 if re.search(pattern, r, re.DOTALL) else 0.0
            for r in responses]
示例3:增量格式奖励(部分合规奖励)
python
def incremental_format_reward(completions, **kwargs):
    """为部分符合格式要求的内容给予部分奖励。"""
    responses = [comp[0]['content'] for comp in completions]
    rewards = []

    for r in responses:
        score = 0.0
        if '<reasoning>' in r:
            score += 0.25
        if '</reasoning>' in r:
            score += 0.25
        if '<answer>' in r:
            score += 0.25
        if '</answer>' in r:
            score += 0.25
        # 对结束标签后的额外文本进行惩罚
        if r.count('</answer>') == 1:
            extra_text = r.split('</answer>')[-1].strip()
            score -= len(extra_text) * 0.001
        rewards.append(score)

    return rewards
关键见解: 组合3-5个奖励函数以实现鲁棒训练,函数的顺序不如信号的多样性重要。

Step 3: Training Configuration

步骤3:训练配置

Memory-Optimized Config (Small GPU)
python
from trl import GRPOConfig

training_args = GRPOConfig(
    output_dir="outputs/grpo-model",

    # Learning rate
    learning_rate=5e-6,          # Lower = more stable
    adam_beta1=0.9,
    adam_beta2=0.99,
    weight_decay=0.1,
    warmup_ratio=0.1,
    lr_scheduler_type='cosine',

    # Batch settings
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,  # Effective batch = 4

    # GRPO-specific
    num_generations=8,            # Group size: 8-16 recommended
    max_prompt_length=256,
    max_completion_length=512,

    # Training duration
    num_train_epochs=1,
    max_steps=None,               # Or set fixed steps (e.g., 500)

    # Optimization
    bf16=True,                    # Faster on A100/H100
    optim="adamw_8bit",          # Memory-efficient optimizer
    max_grad_norm=0.1,

    # Logging
    logging_steps=1,
    save_steps=100,
    report_to="wandb",            # Or "none" for no logging
)
High-Performance Config (Large GPU)
python
training_args = GRPOConfig(
    output_dir="outputs/grpo-model",
    learning_rate=1e-5,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    num_generations=16,           # Larger groups = better signal
    max_prompt_length=512,
    max_completion_length=1024,
    num_train_epochs=1,
    bf16=True,
    use_vllm=True,                # Fast generation with vLLM
    logging_steps=10,
)
Critical Hyperparameters:
ParameterImpactTuning Advice
num_generations
Group size for comparisonStart with 8, increase to 16 if GPU allows
learning_rate
Convergence speed/stability5e-6 (safe), 1e-5 (faster, riskier)
max_completion_length
Output verbosityMatch your task (512 for reasoning, 256 for short answers)
gradient_accumulation_steps
Effective batch sizeIncrease if GPU memory limited
内存优化配置(小显存GPU)
python
from trl import GRPOConfig

training_args = GRPOConfig(
    output_dir="outputs/grpo-model",

    # 学习率
    learning_rate=5e-6,          # 学习率越低,训练越稳定
    adam_beta1=0.9,
    adam_beta2=0.99,
    weight_decay=0.1,
    warmup_ratio=0.1,
    lr_scheduler_type='cosine',

    # 批次设置
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,  # 有效批次大小=4

    # GRPO特定配置
    num_generations=8,            # 分组大小:推荐8-16
    max_prompt_length=256,
    max_completion_length=512,

    # 训练时长
    num_train_epochs=1,
    max_steps=None,               # 或设置固定步数(如500)

    # 优化选项
    bf16=True,                    # 在A100/H100上速度更快
    optim="adamw_8bit",          # 显存高效的优化器
    max_grad_norm=0.1,

    # 日志设置
    logging_steps=1,
    save_steps=100,
    report_to="wandb",            # 或设置为"none"以关闭日志
)
高性能配置(大显存GPU)
python
training_args = GRPOConfig(
    output_dir="outputs/grpo-model",
    learning_rate=1e-5,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    num_generations=16,           # 更大的分组=更优的信号
    max_prompt_length=512,
    max_completion_length=1024,
    num_train_epochs=1,
    bf16=True,
    use_vllm=True,                # 使用vLLM实现快速生成
    logging_steps=10,
)
关键超参数:
参数影响调优建议
num_generations
用于对比的分组大小从8开始,若GPU显存允许可增加至16
learning_rate
收敛速度/稳定性5e-6(安全值),1e-5(速度更快但风险更高)
max_completion_length
输出冗长程度匹配任务需求(推理任务设为512,短答案设为256)
gradient_accumulation_steps
有效批次大小若GPU显存不足则增加该值

Step 4: Model Setup and Training

步骤4:模型设置与训练

Standard Setup (Transformers)
python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import GRPOTrainer
标准设置(基于Transformers)
python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import GRPOTrainer

Load model

加载模型

model_name = "Qwen/Qwen2.5-1.5B-Instruct" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", # 2-3x faster device_map="auto" )
tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token
model_name = "Qwen/Qwen2.5-1.5B-Instruct" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", # 速度提升2-3倍 device_map="auto" )
tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token

Optional: LoRA for parameter-efficient training

可选:使用LoRA实现参数高效训练

peft_config = LoraConfig( r=16, # Rank (higher = more capacity) lora_alpha=32, # Scaling factor (typically 2*r) target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj" ], task_type="CAUSAL_LM", lora_dropout=0.05, )
peft_config = LoraConfig( r=16, # 秩(值越高,模型容量越大) lora_alpha=32, # 缩放因子(通常为2*r) target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj" ], task_type="CAUSAL_LM", lora_dropout=0.05, )

Initialize trainer

初始化训练器

trainer = GRPOTrainer( model=model, processing_class=tokenizer, reward_funcs=[ incremental_format_reward, format_reward, correctness_reward, ], args=training_args, train_dataset=dataset, peft_config=peft_config, # Remove for full fine-tuning )
trainer = GRPOTrainer( model=model, processing_class=tokenizer, reward_funcs=[ incremental_format_reward, format_reward, correctness_reward, ], args=training_args, train_dataset=dataset, peft_config=peft_config, # 若进行全参数微调则移除该配置 )

Train

开始训练

trainer.train()
trainer.train()

Save

保存模型

trainer.save_model("final_model")

**Unsloth Setup (2-3x Faster)**
```python
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="google/gemma-3-1b-it",
    max_seq_length=1024,
    load_in_4bit=True,
    fast_inference=True,
    max_lora_rank=32,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,
    use_gradient_checkpointing="unsloth",
)
trainer.save_model("final_model")

**Unsloth设置(速度提升2-3倍)**
```python
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="google/gemma-3-1b-it",
    max_seq_length=1024,
    load_in_4bit=True,
    fast_inference=True,
    max_lora_rank=32,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,
    use_gradient_checkpointing="unsloth",
)

Rest is identical to standard setup

其余部分与标准设置一致

trainer = GRPOTrainer(model=model, ...) trainer.train()

---
trainer = GRPOTrainer(model=model, ...) trainer.train()

---

Critical Training Insights

关键训练见解

1. Loss Behavior (EXPECTED PATTERN)

1. 损失行为(预期模式)

  • Loss starts near 0 and INCREASES during training
  • This is CORRECT - loss measures KL divergence from initial policy
  • Model is learning (diverging from original behavior to optimize rewards)
  • Monitor reward metrics instead of loss for progress
  • 损失值从接近0开始,在训练过程中上升
  • 这是正常现象 - 损失值衡量与初始策略的KL散度
  • 模型正在学习(偏离原始行为以优化奖励)
  • 应监控奖励指标而非损失值来判断训练进度

2. Reward Tracking

2. 奖励跟踪

Key metrics to watch:
  • reward
    : Average across all completions
  • reward_std
    : Diversity within groups (should remain > 0)
  • kl
    : KL divergence from reference (should grow moderately)
Healthy Training Pattern:
Step   Reward    Reward_Std   KL
100    0.5       0.3          0.02
200    0.8       0.25         0.05
300    1.2       0.2          0.08  ← Good progression
400    1.5       0.15         0.12
Warning Signs:
  • Reward std → 0 (model collapsing to single response)
  • KL exploding (> 0.5) (diverging too much, reduce LR)
  • Reward stuck (reward functions too harsh or model capacity issue)
需要关注的关键指标:
  • reward
    : 所有补全结果的平均奖励
  • reward_std
    : 分组内的奖励多样性(应保持>0)
  • kl
    : 与参考模型的KL散度(应适度增长)
健康的训练模式:
步数   奖励值    奖励标准差   KL散度
100    0.5       0.3          0.02
200    0.8       0.25         0.05
300    1.2       0.2          0.08  ← 良好的进展
400    1.5       0.15         0.12
警告信号:
  • 奖励标准差趋近于0(模型坍缩为单一响应)
  • KL散度激增(>0.5)(偏离度过大,降低学习率)
  • 奖励值停滞(奖励函数过于严格或模型容量不足)

3. Common Pitfalls and Solutions

3. 常见问题与解决方案

ProblemSymptomSolution
Mode collapseAll completions identicalIncrease
num_generations
, add diversity penalty
No learningFlat rewardsCheck reward function logic, increase LR
OOM errorsGPU memory exceededReduce
num_generations
, enable gradient checkpointing
Slow training< 1 it/sEnable
use_vllm=True
, use Unsloth, reduce seq length
Format ignoredModel doesn't follow structureIncrease format reward weight, add incremental rewards

问题症状解决方案
模式坍缩所有补全结果完全相同增大
num_generations
,添加多样性惩罚
无学习进展奖励值持平检查奖励函数逻辑,提高学习率
显存不足错误GPU显存耗尽减小
num_generations
,启用梯度检查点
训练速度慢每秒迭代次数<1启用
use_vllm=True
,使用Unsloth,减小序列长度
格式被忽略模型不遵循指定结构提高格式奖励权重,添加增量奖励

Advanced Patterns

高级模式

1. Multi-Stage Training

1. 多阶段训练

For complex tasks, train in stages:
python
undefined
对于复杂任务,分阶段训练:
python
undefined

Stage 1: Format compliance (epochs=1)

阶段1:格式合规性训练(轮次=1)

trainer_stage1 = GRPOTrainer( model=model, reward_funcs=[incremental_format_reward, format_reward], ... ) trainer_stage1.train()
trainer_stage1 = GRPOTrainer( model=model, reward_funcs=[incremental_format_reward, format_reward], ... ) trainer_stage1.train()

Stage 2: Correctness (epochs=1)

阶段2:正确性训练(轮次=1)

trainer_stage2 = GRPOTrainer( model=model, reward_funcs=[format_reward, correctness_reward], ... ) trainer_stage2.train()
undefined
trainer_stage2 = GRPOTrainer( model=model, reward_funcs=[format_reward, correctness_reward], ... ) trainer_stage2.train()
undefined

2. Adaptive Reward Scaling

2. 自适应奖励缩放

python
class AdaptiveReward:
    def __init__(self, base_reward_func, initial_weight=1.0):
        self.func = base_reward_func
        self.weight = initial_weight

    def __call__(self, *args, **kwargs):
        rewards = self.func(*args, **kwargs)
        return [r * self.weight for r in rewards]

    def adjust_weight(self, success_rate):
        """Increase weight if model struggling, decrease if succeeding."""
        if success_rate < 0.3:
            self.weight *= 1.2
        elif success_rate > 0.8:
            self.weight *= 0.9
python
class AdaptiveReward:
    def __init__(self, base_reward_func, initial_weight=1.0):
        self.func = base_reward_func
        self.weight = initial_weight

    def __call__(self, *args, **kwargs):
        rewards = self.func(*args, **kwargs)
        return [r * self.weight for r in rewards]

    def adjust_weight(self, success_rate):
        """若模型表现不佳则增加权重,若表现优秀则降低权重。"""
        if success_rate < 0.3:
            self.weight *= 1.2
        elif success_rate > 0.8:
            self.weight *= 0.9

3. Custom Dataset Integration

3. 自定义数据集集成

python
def load_custom_knowledge_base(csv_path):
    """Example: School communication platform docs."""
    import pandas as pd
    df = pd.read_csv(csv_path)

    dataset = Dataset.from_pandas(df).map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': CUSTOM_SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': x['expert_answer']
    })
    return dataset

python
def load_custom_knowledge_base(csv_path):
    """示例:加载校园沟通平台文档。"""
    import pandas as pd
    df = pd.read_csv(csv_path)

    dataset = Dataset.from_pandas(df).map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': CUSTOM_SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': x['expert_answer']
    })
    return dataset

Deployment and Inference

部署与推理

Save and Merge LoRA

保存并合并LoRA权重

python
undefined
python
undefined

Merge LoRA adapters into base model

将LoRA适配器合并到基础模型

if hasattr(trainer.model, 'merge_and_unload'): merged_model = trainer.model.merge_and_unload() merged_model.save_pretrained("production_model") tokenizer.save_pretrained("production_model")
undefined
if hasattr(trainer.model, 'merge_and_unload'): merged_model = trainer.model.merge_and_unload() merged_model.save_pretrained("production_model") tokenizer.save_pretrained("production_model")
undefined

Inference Example

推理示例

python
from transformers import pipeline

generator = pipeline(
    "text-generation",
    model="production_model",
    tokenizer=tokenizer
)

result = generator(
    [
        {'role': 'system', 'content': SYSTEM_PROMPT},
        {'role': 'user', 'content': "What is 15 + 27?"}
    ],
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)
print(result[0]['generated_text'])

python
from transformers import pipeline

generator = pipeline(
    "text-generation",
    model="production_model",
    tokenizer=tokenizer
)

result = generator(
    [
        {'role': 'system', 'content': SYSTEM_PROMPT},
        {'role': 'user', 'content': "15加27等于多少?"}
    ],
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)
print(result[0]['generated_text'])

Best Practices Checklist

最佳实践检查清单

Before Training:
  • Validate dataset format (prompts as List[Dict])
  • Test reward functions on sample data
  • Calculate expected max_prompt_length from data
  • Choose appropriate num_generations based on GPU memory
  • Set up logging (wandb recommended)
During Training:
  • Monitor reward progression (should increase)
  • Check reward_std (should stay > 0.1)
  • Watch for OOM errors (reduce batch size if needed)
  • Sample generations every 50-100 steps
  • Validate format compliance on holdout set
After Training:
  • Merge LoRA weights if using PEFT
  • Test on diverse prompts
  • Compare to baseline model
  • Document reward weights and hyperparameters
  • Save reproducibility config

训练前:
  • 验证数据集格式(提示为字典列表)
  • 在样本数据上测试奖励函数
  • 根据数据计算预期的最大提示长度
  • 根据GPU显存选择合适的num_generations
  • 配置日志系统(推荐使用wandb)
训练中:
  • 监控奖励值进展(应逐步上升)
  • 检查奖励标准差(应保持>0.1)
  • 关注显存不足错误(若需要则减小批次大小)
  • 每50-100步抽样生成结果
  • 在保留数据集上验证格式合规性
训练后:
  • 若使用PEFT则合并LoRA权重
  • 在多样化提示上测试模型
  • 与基线模型进行对比
  • 记录奖励权重和超参数
  • 保存可复现的配置

Troubleshooting Guide

故障排除指南

Debugging Workflow

调试工作流

  1. Isolate reward functions - Test each independently
  2. Check data distribution - Ensure diversity in prompts
  3. Reduce complexity - Start with single reward, add gradually
  4. Monitor generations - Print samples every N steps
  5. Validate extraction logic - Ensure answer parsing works
  1. 隔离奖励函数 - 独立测试每个函数
  2. 检查数据分布 - 确保提示具有多样性
  3. 降低复杂度 - 从单个奖励函数开始,逐步添加
  4. 监控生成结果 - 每N步打印样本
  5. 验证提取逻辑 - 确保答案解析正常工作

Quick Fixes

快速修复

python
undefined
python
undefined

Debug reward function

调试奖励函数

def debug_reward(completions, **kwargs): responses = [comp[0]['content'] for comp in completions] for i, r in enumerate(responses[:2]): # Print first 2 print(f"Response {i}: {r[:200]}...") return [1.0] * len(responses) # Dummy rewards
def debug_reward(completions, **kwargs): responses = [comp[0]['content'] for comp in completions] for i, r in enumerate(responses[:2]): # 打印前2个结果 print(f"响应 {i}: {r[:200]}...") return [1.0] * len(responses) # 虚拟奖励值

Test without training

不进行训练,仅测试

trainer = GRPOTrainer(..., reward_funcs=[debug_reward]) trainer.generate_completions(dataset[:1]) # Generate without updating

---
trainer = GRPOTrainer(..., reward_funcs=[debug_reward]) trainer.generate_completions(dataset[:1]) # 仅生成结果,不更新模型

---

References and Resources

参考资料

Official Documentation:
Example Repositories:
Recommended Reading:
  • Progressive Disclosure Pattern for agent instructions
  • Reward shaping in RL (Ng et al.)
  • LoRA paper (Hu et al., 2021)

官方文档:
示例仓库:
推荐阅读:
  • 智能体指令的渐进式披露模式
  • 强化学习中的奖励塑造(Ng等人)
  • LoRA论文(Hu等人,2021)

Usage Instructions for Agents

智能体使用说明

When this skill is loaded:
  1. Read this entire file before implementing GRPO training
  2. Start with the simplest reward function (e.g., length-based) to validate setup
  3. Use the templates in
    templates/
    directory as starting points
  4. Reference examples in
    examples/
    for task-specific implementations
  5. Follow the workflow sequentially (don't skip steps)
  6. Debug incrementally - add one reward function at a time
Critical Reminders:
  • Always use multiple reward functions (3-5 is optimal)
  • Monitor reward metrics, not loss
  • Test reward functions before training
  • Start small (num_generations=4), scale up gradually
  • Save checkpoints frequently (every 100 steps)
This skill is designed for expert-level implementation. Beginners should start with supervised fine-tuning before attempting GRPO.
加载本技能后:
  1. 在实现GRPO训练前通读全文
  2. 从最简单的奖励函数开始(如基于长度的奖励函数)以验证设置
  3. 使用
    templates/
    目录中的模板
    作为起点
  4. 参考
    examples/
    中的示例
    实现特定任务
  5. 按顺序遵循工作流(不要跳过步骤)
  6. 逐步调试 - 一次添加一个奖励函数
关键提醒:
  • 始终使用多个奖励函数(3-5个为最优)
  • 监控奖励指标而非损失值
  • 训练前测试奖励函数
  • 从小规模开始(num_generations=4),逐步扩展
  • 频繁保存检查点(每100步一次)
本技能专为专家级实现设计。初学者应先从监督微调开始,再尝试GRPO。