unsloth-training

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

<objective> Guide LLM fine-tuning using Unsloth:

GRPO - RL with reward functions (no labeled outputs needed)
SFT - Supervised fine-tuning with input/output pairs
Vision - VLM fine-tuning (Qwen3-VL, Gemma3, Llama 3.2 Vision)

Key capabilities:

FP8 Training - 60% less VRAM, 1.4x faster (RTX 40+, H100)
3x Packing - Automatic 2-5x speedup for mixed-length data
Docker - Official
```
unsloth/unsloth
```
image
Mobile - QAT → ExecuTorch → iOS/Android (~40 tok/s)
Export - GGUF, Ollama, vLLM, LM Studio, SGLang </objective>

<quick_start> GRPO with FP8 (60% less VRAM):

python

import os
os.environ['UNSLOTH_VLLM_STANDBY'] = "1"  # Shared memory
from unsloth import FastLanguageModel
from trl import GRPOConfig, GRPOTrainer

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-8B",
    max_seq_length=2048, load_in_fp8=True, fast_inference=True,
)
model = FastLanguageModel.get_peft_model(
    model, r=64,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    use_gradient_checkpointing="unsloth",
)

def correctness_reward(completions, answer, **kwargs):
    return [2.0 if extract_answer(c) == a else 0.0
            for c, a in zip(completions, answer)]

trainer = GRPOTrainer(
    model=model,
    args=GRPOConfig(num_generations=4, beta=0.04, learning_rate=5e-6),
    train_dataset=dataset, reward_funcs=[correctness_reward],
)
trainer.train()

SFT with Packing (2-5x faster):

python

from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model=model, train_dataset=dataset, processing_class=tokenizer,
    args=SFTConfig(
        per_device_train_batch_size=2, num_train_epochs=3,
        learning_rate=2e-4, packing=True,  # 2-5x speedup
    ),
)
trainer.train()

</quick_start>

<success_criteria> A training run is successful when:

Model loads without OOM errors
Reward (GRPO) or loss (SFT) shows improvement trend
Generated outputs match expected format
Model exported to desired format (LoRA, merged, GGUF)
Test inference produces reasonable outputs </success_criteria>

<activation_triggers> Explicit triggers:

```
/unsloth grpo
```
- GRPO (RL) training
```
/unsloth sft
```
- SFT training
```
/unsloth fp8
```
- FP8 training setup
```
/unsloth vision
```
- VLM fine-tuning
```
/unsloth mobile
```
- Phone deployment (QAT)
```
/unsloth docker
```
- Docker container setup
```
/unsloth troubleshoot
```
- Debug issues

Natural language:

"train with GRPO", "fine-tune", "reward functions"
"FP8 training", "fp8", "less VRAM"
"vision fine-tuning", "VLM", "image training"
"phone deployment", "mobile LLM", "ExecuTorch"
"docker training", "container", "unsloth docker"
"packing", "faster training", "500k context"
"export GGUF", "Ollama", "vLLM", "SGLang" </activation_triggers>

<file_locations> Core references:

```
reference/reward-design.md
```
- Reward function patterns
```
reference/domain-examples.md
```
- Voice AI, Sales Agent examples
```
reference/hyperparameters.md
```
- GRPOConfig reference
```
reference/troubleshooting.md
```
- Common fixes

New feature references:

```
reference/fp8-training.md
```
- FP8 setup, VRAM savings
```
reference/deployment.md
```
- Docker, vLLM, LoRA hot-swap, SGLang
```
reference/export-formats.md
```
- GGUF, Ollama, LM Studio, Dynamic 2.0
```
reference/advanced-training.md
```
- 500K context, packing, checkpoints
```
reference/vision-training.md
```
- VLM fine-tuning
```
reference/mobile-deployment.md
```
- QAT, ExecuTorch, iOS/Android

Code examples:

reference/grpo/

reference/sft/

</file_locations>

<core_concepts>

<objective> 指导如何使用Unsloth进行LLM微调：

GRPO - 基于奖励函数的强化学习（无需标注输出）
SFT - 基于输入/输出样本的有监督微调
视觉模型 - VLM微调（支持Qwen3-VL、Gemma3、Llama 3.2 Vision）

核心功能：

FP8训练 - 显存占用降低60%，训练速度提升1.4倍（支持RTX 40+、H100显卡）
3倍数据打包 - 针对混合长度数据自动实现2-5倍训练加速
Docker支持 - 官方
```
unsloth/unsloth
```
镜像
移动端部署 - 量化感知训练（QAT）→ ExecuTorch → iOS/Android部署（约40 tokens/秒）
多格式导出 - 支持GGUF、Ollama、vLLM、LM Studio、SGLang格式 </objective>

<quick_start> 基于FP8的GRPO训练（显存占用降低60%）：

python

import os
os.environ['UNSLOTH_VLLM_STANDBY'] = "1"  # Shared memory
from unsloth import FastLanguageModel
from trl import GRPOConfig, GRPOTrainer

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-8B",
    max_seq_length=2048, load_in_fp8=True, fast_inference=True,
)
model = FastLanguageModel.get_peft_model(
    model, r=64,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    use_gradient_checkpointing="unsloth",
)

def correctness_reward(completions, answer, **kwargs):
    return [2.0 if extract_answer(c) == a else 0.0
            for c, a in zip(completions, answer)]

trainer = GRPOTrainer(
    model=model,
    args=GRPOConfig(num_generations=4, beta=0.04, learning_rate=5e-6),
    train_dataset=dataset, reward_funcs=[correctness_reward],
)
trainer.train()

基于数据打包的SFT训练（速度提升2-5倍）：

python

from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model=model, train_dataset=dataset, processing_class=tokenizer,
    args=SFTConfig(
        per_device_train_batch_size=2, num_train_epochs=3,
        learning_rate=2e-4, packing=True,  # 2-5x speedup
    ),
)
trainer.train()

</quick_start>

<success_criteria> 训练成功的判定标准：

模型加载时无内存不足（OOM）错误
GRPO的奖励值或SFT的损失值呈现改善趋势
生成的输出符合预期格式
模型可导出为目标格式（LoRA、合并模型、GGUF）
测试推理能生成合理输出 </success_criteria>

<activation_triggers> 显式触发指令：

```
/unsloth grpo
```
- 启动GRPO（强化学习）训练
```
/unsloth sft
```
- 启动SFT训练
```
/unsloth fp8
```
- 配置FP8训练环境
```
/unsloth vision
```
- 启动VLM微调
```
/unsloth mobile
```
- 配置移动端部署（QAT）
```
/unsloth docker
```
- 配置Docker容器
```
/unsloth troubleshoot
```
- 调试问题

自然语言触发：

"使用GRPO训练"、"微调模型"、"奖励函数"
"FP8训练"、"fp8"、"减少显存占用"
"视觉模型微调"、"VLM"、"图像训练"
"移动端部署"、"移动LLM"、"ExecuTorch"
"Docker训练"、"容器"、"unsloth docker"
"数据打包"、"加速训练"、"500k上下文"
"导出GGUF"、"Ollama"、"vLLM"、"SGLang" </activation_triggers>

<file_locations> 核心参考文档：

```
reference/reward-design.md
```
- 奖励函数设计模式
```
reference/domain-examples.md
```
- 语音AI、销售Agent等领域示例
```
reference/hyperparameters.md
```
- GRPOConfig参数参考
```
reference/troubleshooting.md
```
- 常见问题解决方案

新功能参考文档：

```
reference/fp8-training.md
```
- FP8配置、显存优化
```
reference/deployment.md
```
- Docker、vLLM、LoRA热切换、SGLang部署
```
reference/export-formats.md
```
- GGUF、Ollama、LM Studio、Dynamic 2.0格式导出
```
reference/advanced-training.md
```
- 500K上下文、数据打包、检查点管理
```
reference/vision-training.md
```
- VLM微调指南
```
reference/mobile-deployment.md
```
- QAT、ExecuTorch、iOS/Android部署

代码示例：

reference/grpo/

、

reference/sft/

</file_locations>

<core_concepts>

When to Use GRPO vs SFT

GRPO与SFT的适用场景对比

Method	Use When	Data Needed
GRPO	Improving reasoning quality	Prompts + verifiable answers
GRPO	Aligning behavior with preferences	Reward functions
GRPO	When you can verify correctness	Verifiable outputs
SFT	Teaching specific output format	Input/output pairs
SFT	Following new instructions	Conversation examples
SFT	Learning domain knowledge	Labeled examples

方法	适用场景	所需数据
GRPO	提升推理质量	提示词 + 可验证答案
GRPO	对齐模型行为与偏好	奖励函数
GRPO	可验证输出正确性的场景	可验证的输出结果
SFT	教授特定输出格式	输入/输出样本对
SFT	让模型遵循新指令	对话示例
SFT	让模型学习领域知识	标注样本

Model Selection

模型选择指南

Model	Size	VRAM	Use Case
`unsloth/Qwen2.5-0.5B-Instruct`	0.5B	5GB	Mobile deployment (~200MB GGUF)
`unsloth/Qwen2.5-1.5B-Instruct`	1.5B	5GB	Learning/prototyping
`Qwen/Qwen2.5-3B-Instruct`	3B	8GB	Good balance (recommended start)
`unsloth/Qwen2.5-7B-Instruct`	7B	16GB	Production quality
`unsloth/Phi-4`	14B	20GB	Strong reasoning

模型	参数规模	显存需求	适用场景
`unsloth/Qwen2.5-0.5B-Instruct`	0.5B	5GB	移动端部署（GGUF格式约200MB）
`unsloth/Qwen2.5-1.5B-Instruct`	1.5B	5GB	学习与原型开发
`Qwen/Qwen2.5-3B-Instruct`	3B	8GB	平衡性能与资源（推荐入门选择）
`unsloth/Qwen2.5-7B-Instruct`	7B	16GB	生产级质量
`unsloth/Phi-4`	14B	20GB	强推理能力需求场景

Core Hyperparameters

核心超参数配置

GRPO (RL):

python

GRPOConfig(
    num_generations=4,        # Completions per prompt (2-8)
    beta=0.04,                # KL penalty (0.01-0.1)
    learning_rate=5e-6,       # 10x smaller than SFT!
    max_completion_length=512,
    max_steps=300,            # Minimum for results
)

SFT:

python

TrainingArguments(
    learning_rate=2e-4,       # Standard SFT rate
    num_train_epochs=3,       # 2-4 typical
    per_device_train_batch_size=2,
)

</core_concepts>

<reward_functions>

GRPO（强化学习）：

python

GRPOConfig(
    num_generations=4,        # Completions per prompt (2-8)
    beta=0.04,                # KL penalty (0.01-0.1)
    learning_rate=5e-6,       # 10x smaller than SFT!
    max_completion_length=512,
    max_steps=300,            # Minimum for results
)

SFT：

python

TrainingArguments(
    learning_rate=2e-4,       # Standard SFT rate
    num_train_epochs=3,       # 2-4 typical
    per_device_train_batch_size=2,
)

</core_concepts>

<reward_functions>

Reward Function Design

奖励函数设计指南

Reward functions are the core of GRPO. They return a list of floats for each completion.

奖励函数是GRPO的核心，需为每个模型输出返回一个浮点数列表。

Pattern 1: Correctness (Primary Signal)

模式1：正确性奖励（核心信号）

python

def correctness_reward(completions, answer, **kwargs):
    """
    +2.0 for correct answer, 0.0 otherwise.
    This should be your highest-weighted reward.
    """
    rewards = []
    for completion, true_answer in zip(completions, answer):
        extracted = extract_answer(completion)
        try:
            pred = float(extracted.replace(",", "").strip())
            true = float(true_answer.replace(",", "").strip())
            reward = 2.0 if abs(pred - true) < 0.01 else 0.0
        except ValueError:
            reward = 2.0 if extracted.strip() == str(true_answer).strip() else 0.0
        rewards.append(reward)
    return rewards

python

def correctness_reward(completions, answer, **kwargs):
    """
    +2.0 for correct answer, 0.0 otherwise.
    This should be your highest-weighted reward.
    """
    rewards = []
    for completion, true_answer in zip(completions, answer):
        extracted = extract_answer(completion)
        try:
            pred = float(extracted.replace(",", "").strip())
            true = float(true_answer.replace(",", "").strip())
            reward = 2.0 if abs(pred - true) < 0.01 else 0.0
        except ValueError:
            reward = 2.0 if extracted.strip() == str(true_answer).strip() else 0.0
        rewards.append(reward)
    return rewards

Pattern 2: Format Compliance

模式2：格式合规奖励

python

def format_reward(completions, **kwargs):
    """
    +0.5 for proper XML structure with reasoning and answer tags.
    """
    rewards = []
    for completion in completions:
        has_reasoning = bool(re.search(r"<reasoning>.*?</reasoning>", completion, re.DOTALL))
        has_answer = bool(re.search(r"<answer>.*?</answer>", completion, re.DOTALL))
        if has_reasoning and has_answer:
            rewards.append(0.5)
        elif has_answer:
            rewards.append(0.2)
        else:
            rewards.append(0.0)
    return rewards

python

def format_reward(completions, **kwargs):
    """
    +0.5 for proper XML structure with reasoning and answer tags.
    """
    rewards = []
    for completion in completions:
        has_reasoning = bool(re.search(r"<reasoning>.*?</reasoning>", completion, re.DOTALL))
        has_answer = bool(re.search(r"<answer>.*?</answer>", completion, re.DOTALL))
        if has_reasoning and has_answer:
            rewards.append(0.5)
        elif has_answer:
            rewards.append(0.2)
        else:
            rewards.append(0.0)
    return rewards

Pattern 3: Reasoning Quality

模式3：推理质量奖励

python

def reasoning_length_reward(completions, **kwargs):
    """
    +0.3 for substantive reasoning (30-200 words).
    """
    rewards = []
    for completion in completions:
        reasoning = extract_reasoning(completion)
        word_count = len(reasoning.split()) if reasoning else 0
        if 30 <= word_count <= 200:
            rewards.append(0.3)
        elif 15 <= word_count < 30:
            rewards.append(0.1)
        else:
            rewards.append(0.0)
    return rewards

python

def reasoning_length_reward(completions, **kwargs):
    """
    +0.3 for substantive reasoning (30-200 words).
    """
    rewards = []
    for completion in completions:
        reasoning = extract_reasoning(completion)
        word_count = len(reasoning.split()) if reasoning else 0
        if 30 <= word_count <= 200:
            rewards.append(0.3)
        elif 15 <= word_count < 30:
            rewards.append(0.1)
        else:
            rewards.append(0.0)
    return rewards

Pattern 4: Negative Constraints

模式4：负向约束奖励

python

def no_hedging_reward(completions, **kwargs):
    """
    -0.3 penalty for uncertainty language.
    """
    hedging = ["i think", "maybe", "perhaps", "possibly", "i'm not sure"]
    rewards = []
    for completion in completions:
        has_hedging = any(phrase in completion.lower() for phrase in hedging)
        rewards.append(-0.3 if has_hedging else 0.0)
    return rewards

python

def no_hedging_reward(completions, **kwargs):
    """
    -0.3 penalty for uncertainty language.
    """
    hedging = ["i think", "maybe", "perhaps", "possibly", "i'm not sure"]
    rewards = []
    for completion in completions:
        has_hedging = any(phrase in completion.lower() for phrase in hedging)
        rewards.append(-0.3 if has_hedging else 0.0)
    return rewards

Typical Reward Stack

典型奖励函数组合

python

reward_funcs = [
    correctness_reward,      # +2.0 max (primary signal)
    format_reward,           # +0.5 max (structure)
    reasoning_length_reward, # +0.3 max (quality)
    no_hedging_reward,       # -0.3 max (constraint)
]

python

reward_funcs = [
    correctness_reward,      # +2.0 max (primary signal)
    format_reward,           # +0.5 max (structure)
    reasoning_length_reward, # +0.3 max (quality)
    no_hedging_reward,       # -0.3 max (constraint)
]

Total range: -0.3 to +2.8


> **For domain-specific rewards:** See `reference/domain-examples.md` for Voice AI, Sales Agent, and Support patterns.
</reward_functions>

<prompt_format>


> **领域特定奖励函数参考：** 请查看`reference/domain-examples.md`中的语音AI、销售Agent、客服场景示例。
</reward_functions>

<prompt_format>

Prompt Structure

提示词结构

System Prompt with XML Tags

带XML标签的系统提示词

python

SYSTEM_PROMPT = """You are a helpful assistant that thinks step-by-step.

Always respond in this exact format:
<reasoning>
[Your step-by-step thinking process]
</reasoning>
<answer>
[Your final answer - just the number or short response]
</answer>
"""

python

SYSTEM_PROMPT = """You are a helpful assistant that thinks step-by-step.

Always respond in this exact format:
<reasoning>
[Your step-by-step thinking process]
</reasoning>
<answer>
[Your final answer - just the number or short response]
</answer>
"""

Extraction Helpers

内容提取工具函数

python

import re

def extract_answer(text: str) -> str:
    """Extract answer from XML tags"""
    match = re.search(r"<answer>(.*?)</answer>", text, re.DOTALL)
    return match.group(1).strip() if match else ""

def extract_reasoning(text: str) -> str:
    """Extract reasoning from XML tags"""
    match = re.search(r"<reasoning>(.*?)</reasoning>", text, re.DOTALL)
    return match.group(1).strip() if match else ""

python

import re

def extract_answer(text: str) -> str:
    """Extract answer from XML tags"""
    match = re.search(r"<answer>(.*?)</answer>", text, re.DOTALL)
    return match.group(1).strip() if match else ""

def extract_reasoning(text: str) -> str:
    """Extract reasoning from XML tags"""
    match = re.search(r"<reasoning>(.*?)</reasoning>", text, re.DOTALL)
    return match.group(1).strip() if match else ""

Dataset Format

数据集格式

GRPO (prompt-only):

python

dataset = dataset.map(lambda ex: {
    "prompt": [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": ex["question"]}
    ],
    "answer": ex["answer"]  # Ground truth for verification
})

SFT (full conversations):

python

dataset = dataset.map(lambda ex: {
    "conversations": [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": ex["input"]},
        {"role": "assistant", "content": ex["output"]}
    ]
})

</prompt_format>

<model_export>

GRPO（仅提示词）：

python

dataset = dataset.map(lambda ex: {
    "prompt": [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": ex["question"]}
    ],
    "answer": ex["answer"]  # Ground truth for verification
})

SFT（完整对话）：

python

dataset = dataset.map(lambda ex: {
    "conversations": [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": ex["input"]},
        {"role": "assistant", "content": ex["output"]}
    ]
})

</prompt_format>

<model_export>

Save and Deploy

模型保存与部署

Save LoRA Only (~100MB)

仅保存LoRA权重（约100MB）

python

model.save_lora("grpo_lora")

python

model.save_lora("grpo_lora")

Merge and Save Full Model

合并并保存完整模型

python

model.save_pretrained_merged(
    "grpo_merged", tokenizer,
    save_method="merged_16bit",
)

python

model.save_pretrained_merged(
    "grpo_merged", tokenizer,
    save_method="merged_16bit",
)

Export to GGUF for Ollama

导出为GGUF格式用于Ollama

python

model.save_pretrained_gguf(
    "grpo_gguf", tokenizer,
    quantization_method="q4_k_m",  # Options: q4_k_m, q8_0, q5_k_m
)

python

model.save_pretrained_gguf(
    "grpo_gguf", tokenizer,
    quantization_method="q4_k_m",  # Options: q4_k_m, q8_0, q5_k_m
)

Test with Ollama

使用Ollama测试模型

bash

undefined

bash

undefined

Create Modelfile

cat > Modelfile << EOF FROM ./grpo_gguf/unsloth.Q4_K_M.gguf TEMPLATE """{{ .System }} User: {{ .Prompt }} Assistant: """ PARAMETER temperature 0.7 EOF

ollama create my-model -f Modelfile ollama run my-model "Solve: 15 + 27 = ?"

</model_export>

<routing>

cat > Modelfile << EOF FROM ./grpo_gguf/unsloth.Q4_K_M.gguf TEMPLATE """{{ .System }} User: {{ .Prompt }} Assistant: """ PARAMETER temperature 0.7 EOF

ollama create my-model -f Modelfile ollama run my-model "Solve: 15 + 27 = ?"

</model_export>

<routing>

Request Routing

请求路由指引

GRPO training: → GRPOConfig, reward functions, dataset prep → Reference:

reference/grpo/basic_grpo.py

SFT training: → SFTTrainer, dataset formatting → Reference:

reference/sft/sales_extractor_training.py

Reward function design: → 4 patterns (correctness, format, quality, constraints) → Reference:

reference/reward-design.md

reference/domain-examples.md

FP8 training: → 60% VRAM savings, env vars, pre-quantized models → Reference:

reference/fp8-training.md

Docker setup: → Official image, volumes, Jupyter/SSH → Reference:

reference/deployment.md

Vision fine-tuning: → FastVisionModel, VLM data format → Reference:

reference/vision-training.md

Mobile deployment: → QAT, ExecuTorch, iOS/Android → Reference:

reference/mobile-deployment.md

Long context / packing: → 500K context, 2-5x speedup → Reference:

reference/advanced-training.md

Export formats: → GGUF methods, Ollama, vLLM, SGLang → Reference:

reference/export-formats.md

Training issues: →

reference/troubleshooting.md

</routing>

<troubleshooting_quick>

GRPO训练： → 参考GRPOConfig、奖励函数、数据集准备 → 文档参考：

reference/grpo/basic_grpo.py

SFT训练： → 参考SFTTrainer、数据集格式化 → 文档参考：

reference/sft/sales_extractor_training.py

奖励函数设计： → 参考4种设计模式（正确性、格式、质量、约束） → 文档参考：

reference/reward-design.md

、

reference/domain-examples.md

FP8训练： → 参考60%显存优化、环境变量、预量化模型 → 文档参考：

reference/fp8-training.md

Docker配置： → 参考官方镜像、卷挂载、Jupyter/SSH配置 → 文档参考：

reference/deployment.md

视觉模型微调： → 参考FastVisionModel、VLM数据格式 → 文档参考：

reference/vision-training.md

移动端部署： → 参考QAT、ExecuTorch、iOS/Android配置 → 文档参考：

reference/mobile-deployment.md

长上下文/数据打包： → 参考500K上下文、2-5倍加速 → 文档参考：

reference/advanced-training.md

格式导出： → 参考GGUF方法、Ollama、vLLM、SGLang → 文档参考：

reference/export-formats.md

训练问题排查： → 参考

reference/troubleshooting.md

</routing>

<troubleshooting_quick>

Quick Troubleshooting

快速排查指南

Symptom	Fix
Reward not increasing	Wait 300+ steps, then increase learning_rate 2x
Reward spiky/unstable	Decrease learning_rate 0.5x, increase beta
Model outputs garbage	Increase beta 2-4x, check prompt format
Out of memory	Reduce max_completion_length, num_generations=2
No reasoning appearing	Train 500+ steps, use model >= 1.5B

For detailed troubleshooting: See
reference/troubleshooting.md
</troubleshooting_quick>

<training_checklist>

症状	解决方案
奖励值无提升	等待300+训练步，然后将学习率提高2倍
奖励值波动/不稳定	将学习率降低至0.5倍，增大beta值
模型输出无意义内容	将beta值增大2-4倍，检查提示词格式
内存不足	减小max_completion_length，设置num_generations=2
无推理过程输出	训练500+步，使用参数≥1.5B的模型

详细排查指南： 请查看
reference/troubleshooting.md
</troubleshooting_quick>

<training_checklist>

Pre-Training Checklist

训练前检查清单

GRPO:

Model loads without OOM
LoRA configured with
```
use_gradient_checkpointing="unsloth"
```
Dataset has
```
prompt
```
and
```
answer
```
fields
At least one reward function defined and tested
```
num_generations >= 2
```
```
beta
```
set (0.01-0.1, start at 0.04)
```
learning_rate
```
set (1e-6 to 1e-5)
At least 300 steps planned

SFT:

Model loads without OOM
Dataset has conversations or text field
Chat template applied correctly
```
learning_rate
```
~2e-4
2-4 epochs planned </training_checklist>

<cost_estimates>

GRPO：

模型加载无内存不足（OOM）错误
LoRA配置中设置了
```
use_gradient_checkpointing="unsloth"
```
数据集包含
```
prompt
```
和
```
answer
```
字段
至少定义并测试了一个奖励函数
```
num_generations ≥ 2
```
设置了
```
beta
```
值（0.01-0.1，建议初始值0.04）
设置了
```
learning_rate
```
值（1e-6至1e-5）
计划至少300步训练

SFT：

模型加载无内存不足（OOM）错误
数据集包含对话或文本字段
正确应用了聊天模板
```
learning_rate
```
设置为~2e-4
计划2-4轮训练 epoch </training_checklist>

<cost_estimates>

Cost Estimates

成本估算

Training Data Generation:

50 seeds x 10 variations x ~1000 tokens = ~500K tokens
Claude Sonnet: ~$1.50
GPT-4o-mini: ~$0.30

Training Compute:

Google Colab T4: Free (30-60 min for 1000 examples)
RunPod A40 (48GB): ~$0.79/hr

Total per experiment: ~$0.50-$1.00 on RunPod </cost_estimates>

<example_session>

训练数据生成：

50个种子样本 × 10种变体 × ~1000 tokens = ~500K tokens
Claude Sonnet：约$1.50
GPT-4o-mini：约$0.30

训练算力成本：

Google Colab T4：免费（1000个样本约30-60分钟）
RunPod A40（48GB显存）：约$0.79/小时

单实验总成本： RunPod上约$0.50-$1.00 </cost_estimates>

<example_session>

Example Session

示例会话

User: "I want to train a model that gives concise voice responses"

Claude:

First, let's set up the model:

python

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen2.5-3B-Instruct",
    max_seq_length=1024, load_in_4bit=True,
)

For voice AI, you need brevity rewards. Here's a custom stack:

python

def brevity_reward(completions, **kwargs):
    """Voice responses under 50 words get +0.5"""
    return [0.5 if len(c.split()) <= 50 else -0.3 for c in completions]

def speakable_reward(completions, **kwargs):
    """Penalize markdown that sounds bad spoken"""
    bad = ["**", "- ", "```", "http"]
    return [-0.2 * sum(1 for b in bad if b in c) for c in completions]

reward_funcs = [correctness_reward, brevity_reward, speakable_reward]

See
```
reference/domain-examples.md
```
for complete Voice AI reward patterns. </example_session>

用户： "我想训练一个能生成简洁语音回复的模型"

Claude：

首先，我们先初始化模型：

python

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen2.5-3B-Instruct",
    max_seq_length=1024, load_in_4bit=True,
)

针对语音AI场景，你需要简洁性奖励函数，以下是一个自定义组合：

python

def brevity_reward(completions, **kwargs):
    """Voice responses under 50 words get +0.5"""
    return [0.5 if len(c.split()) <= 50 else -0.3 for c in completions]

def speakable_reward(completions, **kwargs):
    """Penalize markdown that sounds bad spoken"""
    bad = ["**", "- ", "```", "http"]
    return [-0.2 * sum(1 for b in bad if b in c) for c in completions]

reward_funcs = [correctness_reward, brevity_reward, speakable_reward]

完整的语音AI奖励函数模式请查看
```
reference/domain-examples.md
```
。 </example_session>