grpo-rl-training

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

GRPO/RL Training with TRL

使用TRL进行GRPO/RL训练

Expert-level guidance for implementing Group Relative Policy Optimization (GRPO) using the Transformer Reinforcement Learning (TRL) library. This skill provides battle-tested patterns, critical insights, and production-ready workflows for fine-tuning language models with custom reward functions.

本内容为使用Transformer Reinforcement Learning（TRL）库实现Group Relative Policy Optimization（GRPO）提供专家级指导，包含经过实战验证的模式、关键见解，以及可用于生产环境的、结合自定义奖励函数微调语言模型的工作流。

When to Use This Skill

何时使用该技能

Use GRPO training when you need to:

Enforce specific output formats (e.g., XML tags, JSON, structured reasoning)
Teach verifiable tasks with objective correctness metrics (math, coding, fact-checking)
Improve reasoning capabilities by rewarding chain-of-thought patterns
Align models to domain-specific behaviors without labeled preference data
Optimize for multiple objectives simultaneously (format + correctness + style)

Do NOT use GRPO for:

Simple supervised fine-tuning tasks (use SFT instead)
Tasks without clear reward signals
When you already have high-quality preference pairs (use DPO/PPO instead)

在以下场景中使用GRPO训练：

强制执行特定输出格式（如XML标签、JSON、结构化推理格式）
教授可验证任务（带有客观正确性指标，如数学计算、代码编写、事实核查）
提升推理能力（通过奖励思维链模式实现）
使模型适配领域特定行为（无需标注偏好数据）
同时优化多个目标（格式+正确性+风格）

请勿在以下场景使用GRPO：

简单的监督微调任务（请使用SFT）
没有明确奖励信号的任务
已拥有高质量偏好样本对的场景（请使用DPO/PPO）

Core Concepts

核心概念

1. GRPO Algorithm Fundamentals

1. GRPO算法基础

Key Mechanism:

Generates multiple completions for each prompt (group size: 4-16)
Compares completions within each group using reward functions
Updates policy to favor higher-rewarded responses relative to the group

Critical Difference from PPO:

No separate reward model needed
More sample-efficient (learns from within-group comparisons)
Simpler to implement and debug

Mathematical Intuition:

For each prompt p:
  1. Generate N completions: {c₁, c₂, ..., cₙ}
  2. Compute rewards: {r₁, r₂, ..., rₙ}
  3. Learn to increase probability of high-reward completions
     relative to low-reward ones in the same group

核心机制：

为每个提示生成多个补全结果（分组大小：4-16）
使用奖励函数比较每组内的补全结果
更新策略，使组内高奖励响应的概率相对低奖励响应提升

与PPO的关键区别：

无需单独的奖励模型
样本效率更高（从组内对比中学习）
实现和调试更简单

数学原理：

For each prompt p:
  1. Generate N completions: {c₁, c₂, ..., cₙ}
  2. Compute rewards: {r₁, r₂, ..., rₙ}
  3. Learn to increase probability of high-reward completions
     relative to low-reward ones in the same group

2. Reward Function Design Philosophy

2. 奖励函数设计原则

Golden Rules:

Compose multiple reward functions - Each handles one aspect (format, correctness, style)
Scale rewards appropriately - Higher weight = stronger signal
Use incremental rewards - Partial credit for partial compliance
Test rewards independently - Debug each reward function in isolation

Reward Function Types:

Type	Use Case	Example Weight
Correctness	Verifiable tasks (math, code)	2.0 (highest)
Format	Strict structure enforcement	0.5-1.0
Length	Encourage verbosity/conciseness	0.1-0.5
Style	Penalize unwanted patterns	-0.5 to 0.5

黄金法则：

组合多个奖励函数 - 每个函数负责一个维度（格式、正确性、风格）
合理缩放奖励权重 - 权重越高，信号强度越强
使用增量奖励 - 对部分合规的结果给予部分奖励
独立测试奖励函数 - 单独调试每个奖励函数

奖励函数类型：

类型	使用场景	推荐权重
正确性	可验证任务（数学、代码）	2.0（最高）
格式	严格结构强制执行	0.5-1.0
长度	控制输出冗长/简洁程度	0.1-0.5
风格	惩罚 unwanted patterns	-0.5 至 0.5

Implementation Workflow

实现工作流

Step 1: Dataset Preparation

步骤1：数据集准备

Critical Requirements:

Prompts in chat format (list of dicts with 'role' and 'content')
Include system prompts to set expectations
For verifiable tasks, include ground truth answers as additional columns

Example Structure:

python

from datasets import load_dataset, Dataset

SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
[Your step-by-step thinking]
</reasoning>
<answer>
[Final answer]
</answer>
"""

def prepare_dataset(raw_data):
    """
    Transform raw data into GRPO-compatible format.

    Returns: Dataset with columns:
    - 'prompt': List[Dict] with role/content (system + user messages)
    - 'answer': str (ground truth, optional but recommended)
    """
    return raw_data.map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_answer(x['raw_answer'])
    })

Pro Tips:

Use one-shot or few-shot examples in system prompt for complex formats
Keep prompts concise (max_prompt_length: 256-512 tokens)
Validate data quality before training (garbage in = garbage out)

关键要求：

提示采用对话格式（包含'role'和'content'的字典列表）
包含系统提示以设定预期
对于可验证任务，将标准答案作为额外列包含

示例结构：

python

from datasets import load_dataset, Dataset

SYSTEM_PROMPT = """
请按照以下格式回复：
<reasoning>
[你的分步思考过程]
</reasoning>
<answer>
[最终答案]
</answer>
"""

def prepare_dataset(raw_data):
    """
    将原始数据转换为GRPO兼容格式。

    返回：包含以下列的Dataset：
    - 'prompt': 包含role/content的字典列表（系统消息+用户消息）
    - 'answer': 字符串类型（标准答案，可选但推荐）
    """
    return raw_data.map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_answer(x['raw_answer'])
    })

实用技巧：

对于复杂格式，在系统提示中使用单样本或少样本示例
保持提示简洁（最大提示长度：256-512 tokens）
训练前验证数据质量（输入垃圾，输出垃圾）

Step 2: Reward Function Implementation

步骤2：奖励函数实现

Template Structure:

python

def reward_function_name(
    prompts,        # List[List[Dict]]: Original prompts
    completions,    # List[List[Dict]]: Model generations
    answer=None,    # Optional: Ground truth from dataset
    **kwargs        # Additional dataset columns
) -> list[float]:
    """
    Evaluate completions and return rewards.

    Returns: List of floats (one per completion)
    """
    # Extract completion text
    responses = [comp[0]['content'] for comp in completions]

    # Compute rewards
    rewards = []
    for response in responses:
        score = compute_score(response)
        rewards.append(score)

    return rewards

Example 1: Correctness Reward (Math/Coding)

python

def correctness_reward(prompts, completions, answer, **kwargs):
    """Reward correct answers with high score."""
    responses = [comp[0]['content'] for comp in completions]
    extracted = [extract_final_answer(r) for r in responses]
    return [2.0 if ans == gt else 0.0
            for ans, gt in zip(extracted, answer)]

Example 2: Format Reward (Structured Output)

python

import re

def format_reward(completions, **kwargs):
    """Reward XML-like structured format."""
    pattern = r'<reasoning>.*?</reasoning>\s*<answer>.*?</answer>'
    responses = [comp[0]['content'] for comp in completions]
    return [1.0 if re.search(pattern, r, re.DOTALL) else 0.0
            for r in responses]

Example 3: Incremental Format Reward (Partial Credit)

python

def incremental_format_reward(completions, **kwargs):
    """Award partial credit for format compliance."""
    responses = [comp[0]['content'] for comp in completions]
    rewards = []

    for r in responses:
        score = 0.0
        if '<reasoning>' in r:
            score += 0.25
        if '</reasoning>' in r:
            score += 0.25
        if '<answer>' in r:
            score += 0.25
        if '</answer>' in r:
            score += 0.25
        # Penalize extra text after closing tag
        if r.count('</answer>') == 1:
            extra_text = r.split('</answer>')[-1].strip()
            score -= len(extra_text) * 0.001
        rewards.append(score)

    return rewards

Critical Insight: Combine 3-5 reward functions for robust training. Order matters less than diversity of signals.

模板结构：

python

def reward_function_name(
    prompts,        # List[List[Dict]]: 原始提示
    completions,    # List[List[Dict]]: 模型生成结果
    answer=None,    # 可选：数据集中的标准答案
    **kwargs        # 其他数据集列
) -> list[float]:
    """
    评估补全结果并返回奖励值。

    返回：浮点数列表（每个补全结果对应一个值）
    """
    # 提取补全文本
    responses = [comp[0]['content'] for comp in completions]

    # 计算奖励
    rewards = []
    for response in responses:
        score = compute_score(response)
        rewards.append(score)

    return rewards

示例1：正确性奖励（数学/代码场景）

python

def correctness_reward(prompts, completions, answer, **kwargs):
    """为正确答案给予高分数奖励。"""
    responses = [comp[0]['content'] for comp in completions]
    extracted = [extract_final_answer(r) for r in responses]
    return [2.0 if ans == gt else 0.0
            for ans, gt in zip(extracted, answer)]

示例2：格式奖励（结构化输出场景）

python

import re

def format_reward(completions, **kwargs):
    """符合类XML结构化格式的内容给予奖励。"""
    pattern = r'<reasoning>.*?</reasoning>\s*<answer>.*?</answer>'
    responses = [comp[0]['content'] for comp in completions]
    return [1.0 if re.search(pattern, r, re.DOTALL) else 0.0
            for r in responses]

示例3：增量格式奖励（部分合规奖励）

python

def incremental_format_reward(completions, **kwargs):
    """为部分符合格式要求的内容给予部分奖励。"""
    responses = [comp[0]['content'] for comp in completions]
    rewards = []

    for r in responses:
        score = 0.0
        if '<reasoning>' in r:
            score += 0.25
        if '</reasoning>' in r:
            score += 0.25
        if '<answer>' in r:
            score += 0.25
        if '</answer>' in r:
            score += 0.25
        # 对结束标签后的额外文本进行惩罚
        if r.count('</answer>') == 1:
            extra_text = r.split('</answer>')[-1].strip()
            score -= len(extra_text) * 0.001
        rewards.append(score)

    return rewards

关键见解： 组合3-5个奖励函数以实现鲁棒训练，函数的顺序不如信号的多样性重要。

Step 3: Training Configuration

步骤3：训练配置

Memory-Optimized Config (Small GPU)

python

from trl import GRPOConfig

training_args = GRPOConfig(
    output_dir="outputs/grpo-model",

    # Learning rate
    learning_rate=5e-6,          # Lower = more stable
    adam_beta1=0.9,
    adam_beta2=0.99,
    weight_decay=0.1,
    warmup_ratio=0.1,
    lr_scheduler_type='cosine',

    # Batch settings
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,  # Effective batch = 4

    # GRPO-specific
    num_generations=8,            # Group size: 8-16 recommended
    max_prompt_length=256,
    max_completion_length=512,

    # Training duration
    num_train_epochs=1,
    max_steps=None,               # Or set fixed steps (e.g., 500)

    # Optimization
    bf16=True,                    # Faster on A100/H100
    optim="adamw_8bit",          # Memory-efficient optimizer
    max_grad_norm=0.1,

    # Logging
    logging_steps=1,
    save_steps=100,
    report_to="wandb",            # Or "none" for no logging
)

High-Performance Config (Large GPU)

python

training_args = GRPOConfig(
    output_dir="outputs/grpo-model",
    learning_rate=1e-5,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    num_generations=16,           # Larger groups = better signal
    max_prompt_length=512,
    max_completion_length=1024,
    num_train_epochs=1,
    bf16=True,
    use_vllm=True,                # Fast generation with vLLM
    logging_steps=10,
)

Critical Hyperparameters:

Parameter	Impact	Tuning Advice
`num_generations`	Group size for comparison	Start with 8, increase to 16 if GPU allows
`learning_rate`	Convergence speed/stability	5e-6 (safe), 1e-5 (faster, riskier)
`max_completion_length`	Output verbosity	Match your task (512 for reasoning, 256 for short answers)
`gradient_accumulation_steps`	Effective batch size	Increase if GPU memory limited

内存优化配置（小显存GPU）

python

from trl import GRPOConfig

training_args = GRPOConfig(
    output_dir="outputs/grpo-model",

    # 学习率
    learning_rate=5e-6,          # 学习率越低，训练越稳定
    adam_beta1=0.9,
    adam_beta2=0.99,
    weight_decay=0.1,
    warmup_ratio=0.1,
    lr_scheduler_type='cosine',

    # 批次设置
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,  # 有效批次大小=4

    # GRPO特定配置
    num_generations=8,            # 分组大小：推荐8-16
    max_prompt_length=256,
    max_completion_length=512,

    # 训练时长
    num_train_epochs=1,
    max_steps=None,               # 或设置固定步数（如500）

    # 优化选项
    bf16=True,                    # 在A100/H100上速度更快
    optim="adamw_8bit",          # 显存高效的优化器
    max_grad_norm=0.1,

    # 日志设置
    logging_steps=1,
    save_steps=100,
    report_to="wandb",            # 或设置为"none"以关闭日志
)

高性能配置（大显存GPU）

python

training_args = GRPOConfig(
    output_dir="outputs/grpo-model",
    learning_rate=1e-5,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    num_generations=16,           # 更大的分组=更优的信号
    max_prompt_length=512,
    max_completion_length=1024,
    num_train_epochs=1,
    bf16=True,
    use_vllm=True,                # 使用vLLM实现快速生成
    logging_steps=10,
)

关键超参数：

参数	影响	调优建议
`num_generations`	用于对比的分组大小	从8开始，若GPU显存允许可增加至16
`learning_rate`	收敛速度/稳定性	5e-6（安全值），1e-5（速度更快但风险更高）
`max_completion_length`	输出冗长程度	匹配任务需求（推理任务设为512，短答案设为256）
`gradient_accumulation_steps`	有效批次大小	若GPU显存不足则增加该值

Step 4: Model Setup and Training

步骤4：模型设置与训练

Standard Setup (Transformers)

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import GRPOTrainer

标准设置（基于Transformers）

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import GRPOTrainer

Load model

加载模型

model_name = "Qwen/Qwen2.5-1.5B-Instruct" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", # 2-3x faster device_map="auto" )

tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token

Optional: LoRA for parameter-efficient training

可选：使用LoRA实现参数高效训练

peft_config = LoraConfig( r=16, # Rank (higher = more capacity) lora_alpha=32, # Scaling factor (typically 2*r) target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj" ], task_type="CAUSAL_LM", lora_dropout=0.05, )

peft_config = LoraConfig( r=16, # 秩（值越高，模型容量越大） lora_alpha=32, # 缩放因子（通常为2*r） target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj" ], task_type="CAUSAL_LM", lora_dropout=0.05, )

Initialize trainer

初始化训练器

trainer = GRPOTrainer( model=model, processing_class=tokenizer, reward_funcs=[ incremental_format_reward, format_reward, correctness_reward, ], args=training_args, train_dataset=dataset, peft_config=peft_config, # Remove for full fine-tuning )

Train

开始训练

trainer.train()

Save

保存模型

trainer.save_model("final_model")


**Unsloth Setup (2-3x Faster)**
```python
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="google/gemma-3-1b-it",
    max_seq_length=1024,
    load_in_4bit=True,
    fast_inference=True,
    max_lora_rank=32,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,
    use_gradient_checkpointing="unsloth",
)

trainer.save_model("final_model")


**Unsloth设置（速度提升2-3倍）**
```python
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="google/gemma-3-1b-it",
    max_seq_length=1024,
    load_in_4bit=True,
    fast_inference=True,
    max_lora_rank=32,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,
    use_gradient_checkpointing="unsloth",
)

Rest is identical to standard setup

其余部分与标准设置一致

trainer = GRPOTrainer(model=model, ...) trainer.train()

---

trainer = GRPOTrainer(model=model, ...) trainer.train()

---

Critical Training Insights

关键训练见解

1. Loss Behavior (EXPECTED PATTERN)

1. 损失行为（预期模式）

Loss starts near 0 and INCREASES during training
This is CORRECT - loss measures KL divergence from initial policy
Model is learning (diverging from original behavior to optimize rewards)
Monitor reward metrics instead of loss for progress

损失值从接近0开始，在训练过程中上升
这是正常现象 - 损失值衡量与初始策略的KL散度
模型正在学习（偏离原始行为以优化奖励）
应监控奖励指标而非损失值来判断训练进度

2. Reward Tracking

2. 奖励跟踪

Key metrics to watch:

```
reward
```
: Average across all completions
```
reward_std
```
: Diversity within groups (should remain > 0)
```
kl
```
: KL divergence from reference (should grow moderately)

Healthy Training Pattern:

Step   Reward    Reward_Std   KL
100    0.5       0.3          0.02
200    0.8       0.25         0.05
300    1.2       0.2          0.08  ← Good progression
400    1.5       0.15         0.12

Warning Signs:

Reward std → 0 (model collapsing to single response)
KL exploding (> 0.5) (diverging too much, reduce LR)
Reward stuck (reward functions too harsh or model capacity issue)

需要关注的关键指标：

```
reward
```
: 所有补全结果的平均奖励
```
reward_std
```
: 分组内的奖励多样性（应保持>0）
```
kl
```
: 与参考模型的KL散度（应适度增长）

健康的训练模式：

步数   奖励值    奖励标准差   KL散度
100    0.5       0.3          0.02
200    0.8       0.25         0.05
300    1.2       0.2          0.08  ← 良好的进展
400    1.5       0.15         0.12

警告信号：

奖励标准差趋近于0（模型坍缩为单一响应）
KL散度激增（>0.5）（偏离度过大，降低学习率）
奖励值停滞（奖励函数过于严格或模型容量不足）

3. Common Pitfalls and Solutions

3. 常见问题与解决方案

Problem	Symptom	Solution
Mode collapse	All completions identical	Increase `num_generations` , add diversity penalty
No learning	Flat rewards	Check reward function logic, increase LR
OOM errors	GPU memory exceeded	Reduce `num_generations` , enable gradient checkpointing
Slow training	< 1 it/s	Enable `use_vllm=True` , use Unsloth, reduce seq length
Format ignored	Model doesn't follow structure	Increase format reward weight, add incremental rewards

问题	症状	解决方案
模式坍缩	所有补全结果完全相同	增大 `num_generations` ，添加多样性惩罚
无学习进展	奖励值持平	检查奖励函数逻辑，提高学习率
显存不足错误	GPU显存耗尽	减小 `num_generations` ，启用梯度检查点
训练速度慢	每秒迭代次数<1	启用 `use_vllm=True` ，使用Unsloth，减小序列长度
格式被忽略	模型不遵循指定结构	提高格式奖励权重，添加增量奖励

Advanced Patterns

高级模式

1. Multi-Stage Training

1. 多阶段训练

For complex tasks, train in stages:

python

undefined

对于复杂任务，分阶段训练：

python

undefined

Stage 1: Format compliance (epochs=1)

阶段1：格式合规性训练（轮次=1）

trainer_stage1 = GRPOTrainer( model=model, reward_funcs=[incremental_format_reward, format_reward], ... ) trainer_stage1.train()

Stage 2: Correctness (epochs=1)

阶段2：正确性训练（轮次=1）

trainer_stage2 = GRPOTrainer( model=model, reward_funcs=[format_reward, correctness_reward], ... ) trainer_stage2.train()

undefined

trainer_stage2 = GRPOTrainer( model=model, reward_funcs=[format_reward, correctness_reward], ... ) trainer_stage2.train()

undefined

2. Adaptive Reward Scaling

2. 自适应奖励缩放

python

class AdaptiveReward:
    def __init__(self, base_reward_func, initial_weight=1.0):
        self.func = base_reward_func
        self.weight = initial_weight

    def __call__(self, *args, **kwargs):
        rewards = self.func(*args, **kwargs)
        return [r * self.weight for r in rewards]

    def adjust_weight(self, success_rate):
        """Increase weight if model struggling, decrease if succeeding."""
        if success_rate < 0.3:
            self.weight *= 1.2
        elif success_rate > 0.8:
            self.weight *= 0.9

python

class AdaptiveReward:
    def __init__(self, base_reward_func, initial_weight=1.0):
        self.func = base_reward_func
        self.weight = initial_weight

    def __call__(self, *args, **kwargs):
        rewards = self.func(*args, **kwargs)
        return [r * self.weight for r in rewards]

    def adjust_weight(self, success_rate):
        """若模型表现不佳则增加权重，若表现优秀则降低权重。"""
        if success_rate < 0.3:
            self.weight *= 1.2
        elif success_rate > 0.8:
            self.weight *= 0.9

3. Custom Dataset Integration

3. 自定义数据集集成

python

def load_custom_knowledge_base(csv_path):
    """Example: School communication platform docs."""
    import pandas as pd
    df = pd.read_csv(csv_path)

    dataset = Dataset.from_pandas(df).map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': CUSTOM_SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': x['expert_answer']
    })
    return dataset

python

def load_custom_knowledge_base(csv_path):
    """示例：加载校园沟通平台文档。"""
    import pandas as pd
    df = pd.read_csv(csv_path)

    dataset = Dataset.from_pandas(df).map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': CUSTOM_SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': x['expert_answer']
    })
    return dataset

Deployment and Inference

部署与推理

Save and Merge LoRA

保存并合并LoRA权重

python

undefined

python

undefined

Merge LoRA adapters into base model

将LoRA适配器合并到基础模型

if hasattr(trainer.model, 'merge_and_unload'): merged_model = trainer.model.merge_and_unload() merged_model.save_pretrained("production_model") tokenizer.save_pretrained("production_model")

undefined

if hasattr(trainer.model, 'merge_and_unload'): merged_model = trainer.model.merge_and_unload() merged_model.save_pretrained("production_model") tokenizer.save_pretrained("production_model")

undefined

Inference Example

推理示例

python

from transformers import pipeline

generator = pipeline(
    "text-generation",
    model="production_model",
    tokenizer=tokenizer
)

result = generator(
    [
        {'role': 'system', 'content': SYSTEM_PROMPT},
        {'role': 'user', 'content': "What is 15 + 27?"}
    ],
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)
print(result[0]['generated_text'])

python

from transformers import pipeline

generator = pipeline(
    "text-generation",
    model="production_model",
    tokenizer=tokenizer
)

result = generator(
    [
        {'role': 'system', 'content': SYSTEM_PROMPT},
        {'role': 'user', 'content': "15加27等于多少？"}
    ],
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)
print(result[0]['generated_text'])

Best Practices Checklist

最佳实践检查清单

Troubleshooting Guide

故障排除指南

Debugging Workflow

调试工作流

Isolate reward functions - Test each independently
Check data distribution - Ensure diversity in prompts
Reduce complexity - Start with single reward, add gradually
Monitor generations - Print samples every N steps
Validate extraction logic - Ensure answer parsing works

隔离奖励函数 - 独立测试每个函数
检查数据分布 - 确保提示具有多样性
降低复杂度 - 从单个奖励函数开始，逐步添加
监控生成结果 - 每N步打印样本
验证提取逻辑 - 确保答案解析正常工作

Quick Fixes

快速修复

python

undefined

python

undefined

Debug reward function

调试奖励函数

def debug_reward(completions, **kwargs): responses = [comp[0]['content'] for comp in completions] for i, r in enumerate(responses[:2]): # Print first 2 print(f"Response {i}: {r[:200]}...") return [1.0] * len(responses) # Dummy rewards

def debug_reward(completions, **kwargs): responses = [comp[0]['content'] for comp in completions] for i, r in enumerate(responses[:2]): # 打印前2个结果 print(f"响应 {i}: {r[:200]}...") return [1.0] * len(responses) # 虚拟奖励值

Test without training

不进行训练，仅测试

trainer = GRPOTrainer(..., reward_funcs=[debug_reward]) trainer.generate_completions(dataset[:1]) # Generate without updating

---

trainer = GRPOTrainer(..., reward_funcs=[debug_reward]) trainer.generate_completions(dataset[:1]) # 仅生成结果，不更新模型

---

References and Resources

参考资料

Official Documentation:

TRL GRPO Trainer: https://huggingface.co/docs/trl/grpo_trainer
DeepSeek R1 Paper: https://arxiv.org/abs/2501.12948
Unsloth Docs: https://docs.unsloth.ai/

Example Repositories:

Open R1 Implementation: https://github.com/huggingface/open-r1
TRL Examples: https://github.com/huggingface/trl/tree/main/examples

Recommended Reading:

Progressive Disclosure Pattern for agent instructions
Reward shaping in RL (Ng et al.)
LoRA paper (Hu et al., 2021)

官方文档：

TRL GRPO Trainer: https://huggingface.co/docs/trl/grpo_trainer
DeepSeek R1论文: https://arxiv.org/abs/2501.12948
Unsloth文档: https://docs.unsloth.ai/

示例仓库：

Open R1实现: https://github.com/huggingface/open-r1
TRL示例: https://github.com/huggingface/trl/tree/main/examples

推荐阅读：

智能体指令的渐进式披露模式
强化学习中的奖励塑造（Ng等人）
LoRA论文（Hu等人，2021）

Usage Instructions for Agents

智能体使用说明

When this skill is loaded:

Read this entire file before implementing GRPO training
Start with the simplest reward function (e.g., length-based) to validate setup
Use the templates in
```
templates/
```
directory as starting points
Reference examples in
```
examples/
```
for task-specific implementations
Follow the workflow sequentially (don't skip steps)
Debug incrementally - add one reward function at a time

Critical Reminders:

Always use multiple reward functions (3-5 is optimal)
Monitor reward metrics, not loss
Test reward functions before training
Start small (num_generations=4), scale up gradually
Save checkpoints frequently (every 100 steps)

This skill is designed for expert-level implementation. Beginners should start with supervised fine-tuning before attempting GRPO.

加载本技能后：

在实现GRPO训练前通读全文
从最简单的奖励函数开始（如基于长度的奖励函数）以验证设置
使用
templates/
目录中的模板作为起点
参考
examples/
中的示例实现特定任务
按顺序遵循工作流（不要跳过步骤）
逐步调试 - 一次添加一个奖励函数

关键提醒：

始终使用多个奖励函数（3-5个为最优）
监控奖励指标而非损失值
训练前测试奖励函数
从小规模开始（num_generations=4），逐步扩展
频繁保存检查点（每100步一次）

本技能专为专家级实现设计。初学者应先从监督微调开始，再尝试GRPO。