unsloth-training

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
<objective> Guide LLM fine-tuning using Unsloth:
  1. GRPO - RL with reward functions (no labeled outputs needed)
  2. SFT - Supervised fine-tuning with input/output pairs
  3. Vision - VLM fine-tuning (Qwen3-VL, Gemma3, Llama 3.2 Vision)
Key capabilities:
  • FP8 Training - 60% less VRAM, 1.4x faster (RTX 40+, H100)
  • 3x Packing - Automatic 2-5x speedup for mixed-length data
  • Docker - Official
    unsloth/unsloth
    image
  • Mobile - QAT → ExecuTorch → iOS/Android (~40 tok/s)
  • Export - GGUF, Ollama, vLLM, LM Studio, SGLang </objective>
<quick_start> GRPO with FP8 (60% less VRAM):
python
import os
os.environ['UNSLOTH_VLLM_STANDBY'] = "1"  # Shared memory
from unsloth import FastLanguageModel
from trl import GRPOConfig, GRPOTrainer

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-8B",
    max_seq_length=2048, load_in_fp8=True, fast_inference=True,
)
model = FastLanguageModel.get_peft_model(
    model, r=64,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    use_gradient_checkpointing="unsloth",
)

def correctness_reward(completions, answer, **kwargs):
    return [2.0 if extract_answer(c) == a else 0.0
            for c, a in zip(completions, answer)]

trainer = GRPOTrainer(
    model=model,
    args=GRPOConfig(num_generations=4, beta=0.04, learning_rate=5e-6),
    train_dataset=dataset, reward_funcs=[correctness_reward],
)
trainer.train()
SFT with Packing (2-5x faster):
python
from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model=model, train_dataset=dataset, processing_class=tokenizer,
    args=SFTConfig(
        per_device_train_batch_size=2, num_train_epochs=3,
        learning_rate=2e-4, packing=True,  # 2-5x speedup
    ),
)
trainer.train()
</quick_start>
<success_criteria> A training run is successful when:
  • Model loads without OOM errors
  • Reward (GRPO) or loss (SFT) shows improvement trend
  • Generated outputs match expected format
  • Model exported to desired format (LoRA, merged, GGUF)
  • Test inference produces reasonable outputs </success_criteria>
<activation_triggers> Explicit triggers:
  • /unsloth grpo
    - GRPO (RL) training
  • /unsloth sft
    - SFT training
  • /unsloth fp8
    - FP8 training setup
  • /unsloth vision
    - VLM fine-tuning
  • /unsloth mobile
    - Phone deployment (QAT)
  • /unsloth docker
    - Docker container setup
  • /unsloth troubleshoot
    - Debug issues
Natural language:
  • "train with GRPO", "fine-tune", "reward functions"
  • "FP8 training", "fp8", "less VRAM"
  • "vision fine-tuning", "VLM", "image training"
  • "phone deployment", "mobile LLM", "ExecuTorch"
  • "docker training", "container", "unsloth docker"
  • "packing", "faster training", "500k context"
  • "export GGUF", "Ollama", "vLLM", "SGLang" </activation_triggers>
<file_locations> Core references:
  • reference/reward-design.md
    - Reward function patterns
  • reference/domain-examples.md
    - Voice AI, Sales Agent examples
  • reference/hyperparameters.md
    - GRPOConfig reference
  • reference/troubleshooting.md
    - Common fixes
New feature references:
  • reference/fp8-training.md
    - FP8 setup, VRAM savings
  • reference/deployment.md
    - Docker, vLLM, LoRA hot-swap, SGLang
  • reference/export-formats.md
    - GGUF, Ollama, LM Studio, Dynamic 2.0
  • reference/advanced-training.md
    - 500K context, packing, checkpoints
  • reference/vision-training.md
    - VLM fine-tuning
  • reference/mobile-deployment.md
    - QAT, ExecuTorch, iOS/Android
Code examples:
reference/grpo/
,
reference/sft/
</file_locations>
<core_concepts>
<objective> 指导如何使用Unsloth进行LLM微调:
  1. GRPO - 基于奖励函数的强化学习(无需标注输出)
  2. SFT - 基于输入/输出样本的有监督微调
  3. 视觉模型 - VLM微调(支持Qwen3-VL、Gemma3、Llama 3.2 Vision)
核心功能:
  • FP8训练 - 显存占用降低60%,训练速度提升1.4倍(支持RTX 40+、H100显卡)
  • 3倍数据打包 - 针对混合长度数据自动实现2-5倍训练加速
  • Docker支持 - 官方
    unsloth/unsloth
    镜像
  • 移动端部署 - 量化感知训练(QAT)→ ExecuTorch → iOS/Android部署(约40 tokens/秒)
  • 多格式导出 - 支持GGUF、Ollama、vLLM、LM Studio、SGLang格式 </objective>
<quick_start> 基于FP8的GRPO训练(显存占用降低60%):
python
import os
os.environ['UNSLOTH_VLLM_STANDBY'] = "1"  # Shared memory
from unsloth import FastLanguageModel
from trl import GRPOConfig, GRPOTrainer

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-8B",
    max_seq_length=2048, load_in_fp8=True, fast_inference=True,
)
model = FastLanguageModel.get_peft_model(
    model, r=64,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    use_gradient_checkpointing="unsloth",
)

def correctness_reward(completions, answer, **kwargs):
    return [2.0 if extract_answer(c) == a else 0.0
            for c, a in zip(completions, answer)]

trainer = GRPOTrainer(
    model=model,
    args=GRPOConfig(num_generations=4, beta=0.04, learning_rate=5e-6),
    train_dataset=dataset, reward_funcs=[correctness_reward],
)
trainer.train()
基于数据打包的SFT训练(速度提升2-5倍):
python
from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model=model, train_dataset=dataset, processing_class=tokenizer,
    args=SFTConfig(
        per_device_train_batch_size=2, num_train_epochs=3,
        learning_rate=2e-4, packing=True,  # 2-5x speedup
    ),
)
trainer.train()
</quick_start>
<success_criteria> 训练成功的判定标准:
  • 模型加载时无内存不足(OOM)错误
  • GRPO的奖励值或SFT的损失值呈现改善趋势
  • 生成的输出符合预期格式
  • 模型可导出为目标格式(LoRA、合并模型、GGUF)
  • 测试推理能生成合理输出 </success_criteria>
<activation_triggers> 显式触发指令:
  • /unsloth grpo
    - 启动GRPO(强化学习)训练
  • /unsloth sft
    - 启动SFT训练
  • /unsloth fp8
    - 配置FP8训练环境
  • /unsloth vision
    - 启动VLM微调
  • /unsloth mobile
    - 配置移动端部署(QAT)
  • /unsloth docker
    - 配置Docker容器
  • /unsloth troubleshoot
    - 调试问题
自然语言触发:
  • "使用GRPO训练"、"微调模型"、"奖励函数"
  • "FP8训练"、"fp8"、"减少显存占用"
  • "视觉模型微调"、"VLM"、"图像训练"
  • "移动端部署"、"移动LLM"、"ExecuTorch"
  • "Docker训练"、"容器"、"unsloth docker"
  • "数据打包"、"加速训练"、"500k上下文"
  • "导出GGUF"、"Ollama"、"vLLM"、"SGLang" </activation_triggers>
<file_locations> 核心参考文档:
  • reference/reward-design.md
    - 奖励函数设计模式
  • reference/domain-examples.md
    - 语音AI、销售Agent等领域示例
  • reference/hyperparameters.md
    - GRPOConfig参数参考
  • reference/troubleshooting.md
    - 常见问题解决方案
新功能参考文档:
  • reference/fp8-training.md
    - FP8配置、显存优化
  • reference/deployment.md
    - Docker、vLLM、LoRA热切换、SGLang部署
  • reference/export-formats.md
    - GGUF、Ollama、LM Studio、Dynamic 2.0格式导出
  • reference/advanced-training.md
    - 500K上下文、数据打包、检查点管理
  • reference/vision-training.md
    - VLM微调指南
  • reference/mobile-deployment.md
    - QAT、ExecuTorch、iOS/Android部署
代码示例:
reference/grpo/
reference/sft/
</file_locations>
<core_concepts>

When to Use GRPO vs SFT

GRPO与SFT的适用场景对比

MethodUse WhenData Needed
GRPOImproving reasoning qualityPrompts + verifiable answers
GRPOAligning behavior with preferencesReward functions
GRPOWhen you can verify correctnessVerifiable outputs
SFTTeaching specific output formatInput/output pairs
SFTFollowing new instructionsConversation examples
SFTLearning domain knowledgeLabeled examples
方法适用场景所需数据
GRPO提升推理质量提示词 + 可验证答案
GRPO对齐模型行为与偏好奖励函数
GRPO可验证输出正确性的场景可验证的输出结果
SFT教授特定输出格式输入/输出样本对
SFT让模型遵循新指令对话示例
SFT让模型学习领域知识标注样本

Model Selection

模型选择指南

ModelSizeVRAMUse Case
unsloth/Qwen2.5-0.5B-Instruct
0.5B5GBMobile deployment (~200MB GGUF)
unsloth/Qwen2.5-1.5B-Instruct
1.5B5GBLearning/prototyping
Qwen/Qwen2.5-3B-Instruct
3B8GBGood balance (recommended start)
unsloth/Qwen2.5-7B-Instruct
7B16GBProduction quality
unsloth/Phi-4
14B20GBStrong reasoning
模型参数规模显存需求适用场景
unsloth/Qwen2.5-0.5B-Instruct
0.5B5GB移动端部署(GGUF格式约200MB)
unsloth/Qwen2.5-1.5B-Instruct
1.5B5GB学习与原型开发
Qwen/Qwen2.5-3B-Instruct
3B8GB平衡性能与资源(推荐入门选择)
unsloth/Qwen2.5-7B-Instruct
7B16GB生产级质量
unsloth/Phi-4
14B20GB强推理能力需求场景

Core Hyperparameters

核心超参数配置

GRPO (RL):
python
GRPOConfig(
    num_generations=4,        # Completions per prompt (2-8)
    beta=0.04,                # KL penalty (0.01-0.1)
    learning_rate=5e-6,       # 10x smaller than SFT!
    max_completion_length=512,
    max_steps=300,            # Minimum for results
)
SFT:
python
TrainingArguments(
    learning_rate=2e-4,       # Standard SFT rate
    num_train_epochs=3,       # 2-4 typical
    per_device_train_batch_size=2,
)
</core_concepts>
<reward_functions>
GRPO(强化学习):
python
GRPOConfig(
    num_generations=4,        # Completions per prompt (2-8)
    beta=0.04,                # KL penalty (0.01-0.1)
    learning_rate=5e-6,       # 10x smaller than SFT!
    max_completion_length=512,
    max_steps=300,            # Minimum for results
)
SFT:
python
TrainingArguments(
    learning_rate=2e-4,       # Standard SFT rate
    num_train_epochs=3,       # 2-4 typical
    per_device_train_batch_size=2,
)
</core_concepts>
<reward_functions>

Reward Function Design

奖励函数设计指南

Reward functions are the core of GRPO. They return a list of floats for each completion.
奖励函数是GRPO的核心,需为每个模型输出返回一个浮点数列表。

Pattern 1: Correctness (Primary Signal)

模式1:正确性奖励(核心信号)

python
def correctness_reward(completions, answer, **kwargs):
    """
    +2.0 for correct answer, 0.0 otherwise.
    This should be your highest-weighted reward.
    """
    rewards = []
    for completion, true_answer in zip(completions, answer):
        extracted = extract_answer(completion)
        try:
            pred = float(extracted.replace(",", "").strip())
            true = float(true_answer.replace(",", "").strip())
            reward = 2.0 if abs(pred - true) < 0.01 else 0.0
        except ValueError:
            reward = 2.0 if extracted.strip() == str(true_answer).strip() else 0.0
        rewards.append(reward)
    return rewards
python
def correctness_reward(completions, answer, **kwargs):
    """
    +2.0 for correct answer, 0.0 otherwise.
    This should be your highest-weighted reward.
    """
    rewards = []
    for completion, true_answer in zip(completions, answer):
        extracted = extract_answer(completion)
        try:
            pred = float(extracted.replace(",", "").strip())
            true = float(true_answer.replace(",", "").strip())
            reward = 2.0 if abs(pred - true) < 0.01 else 0.0
        except ValueError:
            reward = 2.0 if extracted.strip() == str(true_answer).strip() else 0.0
        rewards.append(reward)
    return rewards

Pattern 2: Format Compliance

模式2:格式合规奖励

python
def format_reward(completions, **kwargs):
    """
    +0.5 for proper XML structure with reasoning and answer tags.
    """
    rewards = []
    for completion in completions:
        has_reasoning = bool(re.search(r"<reasoning>.*?</reasoning>", completion, re.DOTALL))
        has_answer = bool(re.search(r"<answer>.*?</answer>", completion, re.DOTALL))
        if has_reasoning and has_answer:
            rewards.append(0.5)
        elif has_answer:
            rewards.append(0.2)
        else:
            rewards.append(0.0)
    return rewards
python
def format_reward(completions, **kwargs):
    """
    +0.5 for proper XML structure with reasoning and answer tags.
    """
    rewards = []
    for completion in completions:
        has_reasoning = bool(re.search(r"<reasoning>.*?</reasoning>", completion, re.DOTALL))
        has_answer = bool(re.search(r"<answer>.*?</answer>", completion, re.DOTALL))
        if has_reasoning and has_answer:
            rewards.append(0.5)
        elif has_answer:
            rewards.append(0.2)
        else:
            rewards.append(0.0)
    return rewards

Pattern 3: Reasoning Quality

模式3:推理质量奖励

python
def reasoning_length_reward(completions, **kwargs):
    """
    +0.3 for substantive reasoning (30-200 words).
    """
    rewards = []
    for completion in completions:
        reasoning = extract_reasoning(completion)
        word_count = len(reasoning.split()) if reasoning else 0
        if 30 <= word_count <= 200:
            rewards.append(0.3)
        elif 15 <= word_count < 30:
            rewards.append(0.1)
        else:
            rewards.append(0.0)
    return rewards
python
def reasoning_length_reward(completions, **kwargs):
    """
    +0.3 for substantive reasoning (30-200 words).
    """
    rewards = []
    for completion in completions:
        reasoning = extract_reasoning(completion)
        word_count = len(reasoning.split()) if reasoning else 0
        if 30 <= word_count <= 200:
            rewards.append(0.3)
        elif 15 <= word_count < 30:
            rewards.append(0.1)
        else:
            rewards.append(0.0)
    return rewards

Pattern 4: Negative Constraints

模式4:负向约束奖励

python
def no_hedging_reward(completions, **kwargs):
    """
    -0.3 penalty for uncertainty language.
    """
    hedging = ["i think", "maybe", "perhaps", "possibly", "i'm not sure"]
    rewards = []
    for completion in completions:
        has_hedging = any(phrase in completion.lower() for phrase in hedging)
        rewards.append(-0.3 if has_hedging else 0.0)
    return rewards
python
def no_hedging_reward(completions, **kwargs):
    """
    -0.3 penalty for uncertainty language.
    """
    hedging = ["i think", "maybe", "perhaps", "possibly", "i'm not sure"]
    rewards = []
    for completion in completions:
        has_hedging = any(phrase in completion.lower() for phrase in hedging)
        rewards.append(-0.3 if has_hedging else 0.0)
    return rewards

Typical Reward Stack

典型奖励函数组合

python
reward_funcs = [
    correctness_reward,      # +2.0 max (primary signal)
    format_reward,           # +0.5 max (structure)
    reasoning_length_reward, # +0.3 max (quality)
    no_hedging_reward,       # -0.3 max (constraint)
]
python
reward_funcs = [
    correctness_reward,      # +2.0 max (primary signal)
    format_reward,           # +0.5 max (structure)
    reasoning_length_reward, # +0.3 max (quality)
    no_hedging_reward,       # -0.3 max (constraint)
]

Total range: -0.3 to +2.8

Total range: -0.3 to +2.8


> **For domain-specific rewards:** See `reference/domain-examples.md` for Voice AI, Sales Agent, and Support patterns.
</reward_functions>

<prompt_format>

> **领域特定奖励函数参考:** 请查看`reference/domain-examples.md`中的语音AI、销售Agent、客服场景示例。
</reward_functions>

<prompt_format>

Prompt Structure

提示词结构

System Prompt with XML Tags

带XML标签的系统提示词

python
SYSTEM_PROMPT = """You are a helpful assistant that thinks step-by-step.

Always respond in this exact format:
<reasoning>
[Your step-by-step thinking process]
</reasoning>
<answer>
[Your final answer - just the number or short response]
</answer>
"""
python
SYSTEM_PROMPT = """You are a helpful assistant that thinks step-by-step.

Always respond in this exact format:
<reasoning>
[Your step-by-step thinking process]
</reasoning>
<answer>
[Your final answer - just the number or short response]
</answer>
"""

Extraction Helpers

内容提取工具函数

python
import re

def extract_answer(text: str) -> str:
    """Extract answer from XML tags"""
    match = re.search(r"<answer>(.*?)</answer>", text, re.DOTALL)
    return match.group(1).strip() if match else ""

def extract_reasoning(text: str) -> str:
    """Extract reasoning from XML tags"""
    match = re.search(r"<reasoning>(.*?)</reasoning>", text, re.DOTALL)
    return match.group(1).strip() if match else ""
python
import re

def extract_answer(text: str) -> str:
    """Extract answer from XML tags"""
    match = re.search(r"<answer>(.*?)</answer>", text, re.DOTALL)
    return match.group(1).strip() if match else ""

def extract_reasoning(text: str) -> str:
    """Extract reasoning from XML tags"""
    match = re.search(r"<reasoning>(.*?)</reasoning>", text, re.DOTALL)
    return match.group(1).strip() if match else ""

Dataset Format

数据集格式

GRPO (prompt-only):
python
dataset = dataset.map(lambda ex: {
    "prompt": [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": ex["question"]}
    ],
    "answer": ex["answer"]  # Ground truth for verification
})
SFT (full conversations):
python
dataset = dataset.map(lambda ex: {
    "conversations": [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": ex["input"]},
        {"role": "assistant", "content": ex["output"]}
    ]
})
</prompt_format>
<model_export>
GRPO(仅提示词):
python
dataset = dataset.map(lambda ex: {
    "prompt": [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": ex["question"]}
    ],
    "answer": ex["answer"]  # Ground truth for verification
})
SFT(完整对话):
python
dataset = dataset.map(lambda ex: {
    "conversations": [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": ex["input"]},
        {"role": "assistant", "content": ex["output"]}
    ]
})
</prompt_format>
<model_export>

Save and Deploy

模型保存与部署

Save LoRA Only (~100MB)

仅保存LoRA权重(约100MB)

python
model.save_lora("grpo_lora")
python
model.save_lora("grpo_lora")

Merge and Save Full Model

合并并保存完整模型

python
model.save_pretrained_merged(
    "grpo_merged", tokenizer,
    save_method="merged_16bit",
)
python
model.save_pretrained_merged(
    "grpo_merged", tokenizer,
    save_method="merged_16bit",
)

Export to GGUF for Ollama

导出为GGUF格式用于Ollama

python
model.save_pretrained_gguf(
    "grpo_gguf", tokenizer,
    quantization_method="q4_k_m",  # Options: q4_k_m, q8_0, q5_k_m
)
python
model.save_pretrained_gguf(
    "grpo_gguf", tokenizer,
    quantization_method="q4_k_m",  # Options: q4_k_m, q8_0, q5_k_m
)

Test with Ollama

使用Ollama测试模型

bash
undefined
bash
undefined

Create Modelfile

Create Modelfile

cat > Modelfile << EOF FROM ./grpo_gguf/unsloth.Q4_K_M.gguf TEMPLATE """{{ .System }} User: {{ .Prompt }} Assistant: """ PARAMETER temperature 0.7 EOF
ollama create my-model -f Modelfile ollama run my-model "Solve: 15 + 27 = ?"
</model_export>

<routing>
cat > Modelfile << EOF FROM ./grpo_gguf/unsloth.Q4_K_M.gguf TEMPLATE """{{ .System }} User: {{ .Prompt }} Assistant: """ PARAMETER temperature 0.7 EOF
ollama create my-model -f Modelfile ollama run my-model "Solve: 15 + 27 = ?"
</model_export>

<routing>

Request Routing

请求路由指引

GRPO training: → GRPOConfig, reward functions, dataset prep → Reference:
reference/grpo/basic_grpo.py
SFT training: → SFTTrainer, dataset formatting → Reference:
reference/sft/sales_extractor_training.py
Reward function design: → 4 patterns (correctness, format, quality, constraints) → Reference:
reference/reward-design.md
,
reference/domain-examples.md
FP8 training: → 60% VRAM savings, env vars, pre-quantized models → Reference:
reference/fp8-training.md
Docker setup: → Official image, volumes, Jupyter/SSH → Reference:
reference/deployment.md
Vision fine-tuning: → FastVisionModel, VLM data format → Reference:
reference/vision-training.md
Mobile deployment: → QAT, ExecuTorch, iOS/Android → Reference:
reference/mobile-deployment.md
Long context / packing: → 500K context, 2-5x speedup → Reference:
reference/advanced-training.md
Export formats: → GGUF methods, Ollama, vLLM, SGLang → Reference:
reference/export-formats.md
Training issues:
reference/troubleshooting.md
</routing>
<troubleshooting_quick>
GRPO训练: → 参考GRPOConfig、奖励函数、数据集准备 → 文档参考:
reference/grpo/basic_grpo.py
SFT训练: → 参考SFTTrainer、数据集格式化 → 文档参考:
reference/sft/sales_extractor_training.py
奖励函数设计: → 参考4种设计模式(正确性、格式、质量、约束) → 文档参考:
reference/reward-design.md
reference/domain-examples.md
FP8训练: → 参考60%显存优化、环境变量、预量化模型 → 文档参考:
reference/fp8-training.md
Docker配置: → 参考官方镜像、卷挂载、Jupyter/SSH配置 → 文档参考:
reference/deployment.md
视觉模型微调: → 参考FastVisionModel、VLM数据格式 → 文档参考:
reference/vision-training.md
移动端部署: → 参考QAT、ExecuTorch、iOS/Android配置 → 文档参考:
reference/mobile-deployment.md
长上下文/数据打包: → 参考500K上下文、2-5倍加速 → 文档参考:
reference/advanced-training.md
格式导出: → 参考GGUF方法、Ollama、vLLM、SGLang → 文档参考:
reference/export-formats.md
训练问题排查: → 参考
reference/troubleshooting.md
</routing>
<troubleshooting_quick>

Quick Troubleshooting

快速排查指南

SymptomFix
Reward not increasingWait 300+ steps, then increase learning_rate 2x
Reward spiky/unstableDecrease learning_rate 0.5x, increase beta
Model outputs garbageIncrease beta 2-4x, check prompt format
Out of memoryReduce max_completion_length, num_generations=2
No reasoning appearingTrain 500+ steps, use model >= 1.5B
For detailed troubleshooting: See
reference/troubleshooting.md
</troubleshooting_quick>
<training_checklist>
症状解决方案
奖励值无提升等待300+训练步,然后将学习率提高2倍
奖励值波动/不稳定将学习率降低至0.5倍,增大beta值
模型输出无意义内容将beta值增大2-4倍,检查提示词格式
内存不足减小max_completion_length,设置num_generations=2
无推理过程输出训练500+步,使用参数≥1.5B的模型
详细排查指南: 请查看
reference/troubleshooting.md
</troubleshooting_quick>
<training_checklist>

Pre-Training Checklist

训练前检查清单

GRPO:
  • Model loads without OOM
  • LoRA configured with
    use_gradient_checkpointing="unsloth"
  • Dataset has
    prompt
    and
    answer
    fields
  • At least one reward function defined and tested
  • num_generations >= 2
  • beta
    set (0.01-0.1, start at 0.04)
  • learning_rate
    set (1e-6 to 1e-5)
  • At least 300 steps planned
SFT:
  • Model loads without OOM
  • Dataset has conversations or text field
  • Chat template applied correctly
  • learning_rate
    ~2e-4
  • 2-4 epochs planned </training_checklist>
<cost_estimates>
GRPO:
  • 模型加载无内存不足(OOM)错误
  • LoRA配置中设置了
    use_gradient_checkpointing="unsloth"
  • 数据集包含
    prompt
    answer
    字段
  • 至少定义并测试了一个奖励函数
  • num_generations ≥ 2
  • 设置了
    beta
    值(0.01-0.1,建议初始值0.04)
  • 设置了
    learning_rate
    值(1e-6至1e-5)
  • 计划至少300步训练
SFT:
  • 模型加载无内存不足(OOM)错误
  • 数据集包含对话或文本字段
  • 正确应用了聊天模板
  • learning_rate
    设置为~2e-4
  • 计划2-4轮训练 epoch </training_checklist>
<cost_estimates>

Cost Estimates

成本估算

Training Data Generation:
  • 50 seeds x 10 variations x ~1000 tokens = ~500K tokens
  • Claude Sonnet: ~$1.50
  • GPT-4o-mini: ~$0.30
Training Compute:
  • Google Colab T4: Free (30-60 min for 1000 examples)
  • RunPod A40 (48GB): ~$0.79/hr
Total per experiment: ~$0.50-$1.00 on RunPod </cost_estimates>
<example_session>
训练数据生成:
  • 50个种子样本 × 10种变体 × ~1000 tokens = ~500K tokens
  • Claude Sonnet:约$1.50
  • GPT-4o-mini:约$0.30
训练算力成本:
  • Google Colab T4:免费(1000个样本约30-60分钟)
  • RunPod A40(48GB显存):约$0.79/小时
单实验总成本: RunPod上约$0.50-$1.00 </cost_estimates>
<example_session>

Example Session

示例会话

User: "I want to train a model that gives concise voice responses"
Claude:
  1. First, let's set up the model:
python
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen2.5-3B-Instruct",
    max_seq_length=1024, load_in_4bit=True,
)
  1. For voice AI, you need brevity rewards. Here's a custom stack:
python
def brevity_reward(completions, **kwargs):
    """Voice responses under 50 words get +0.5"""
    return [0.5 if len(c.split()) <= 50 else -0.3 for c in completions]

def speakable_reward(completions, **kwargs):
    """Penalize markdown that sounds bad spoken"""
    bad = ["**", "- ", "```", "http"]
    return [-0.2 * sum(1 for b in bad if b in c) for c in completions]

reward_funcs = [correctness_reward, brevity_reward, speakable_reward]
  1. See
    reference/domain-examples.md
    for complete Voice AI reward patterns. </example_session>
用户: "我想训练一个能生成简洁语音回复的模型"
Claude:
  1. 首先,我们先初始化模型:
python
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen2.5-3B-Instruct",
    max_seq_length=1024, load_in_4bit=True,
)
  1. 针对语音AI场景,你需要简洁性奖励函数,以下是一个自定义组合:
python
def brevity_reward(completions, **kwargs):
    """Voice responses under 50 words get +0.5"""
    return [0.5 if len(c.split()) <= 50 else -0.3 for c in completions]

def speakable_reward(completions, **kwargs):
    """Penalize markdown that sounds bad spoken"""
    bad = ["**", "- ", "```", "http"]
    return [-0.2 * sum(1 for b in bad if b in c) for c in completions]

reward_funcs = [correctness_reward, brevity_reward, speakable_reward]
  1. 完整的语音AI奖励函数模式请查看
    reference/domain-examples.md
    。 </example_session>