unsloth-training
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese<objective>
Guide LLM fine-tuning using Unsloth:
- GRPO - RL with reward functions (no labeled outputs needed)
- SFT - Supervised fine-tuning with input/output pairs
- Vision - VLM fine-tuning (Qwen3-VL, Gemma3, Llama 3.2 Vision)
Key capabilities:
- FP8 Training - 60% less VRAM, 1.4x faster (RTX 40+, H100)
- 3x Packing - Automatic 2-5x speedup for mixed-length data
- Docker - Official image
unsloth/unsloth - Mobile - QAT → ExecuTorch → iOS/Android (~40 tok/s)
- Export - GGUF, Ollama, vLLM, LM Studio, SGLang </objective>
<quick_start>
GRPO with FP8 (60% less VRAM):
python
import os
os.environ['UNSLOTH_VLLM_STANDBY'] = "1" # Shared memory
from unsloth import FastLanguageModel
from trl import GRPOConfig, GRPOTrainer
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen3-8B",
max_seq_length=2048, load_in_fp8=True, fast_inference=True,
)
model = FastLanguageModel.get_peft_model(
model, r=64,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
use_gradient_checkpointing="unsloth",
)
def correctness_reward(completions, answer, **kwargs):
return [2.0 if extract_answer(c) == a else 0.0
for c, a in zip(completions, answer)]
trainer = GRPOTrainer(
model=model,
args=GRPOConfig(num_generations=4, beta=0.04, learning_rate=5e-6),
train_dataset=dataset, reward_funcs=[correctness_reward],
)
trainer.train()SFT with Packing (2-5x faster):
python
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model=model, train_dataset=dataset, processing_class=tokenizer,
args=SFTConfig(
per_device_train_batch_size=2, num_train_epochs=3,
learning_rate=2e-4, packing=True, # 2-5x speedup
),
)
trainer.train()</quick_start>
<success_criteria>
A training run is successful when:
- Model loads without OOM errors
- Reward (GRPO) or loss (SFT) shows improvement trend
- Generated outputs match expected format
- Model exported to desired format (LoRA, merged, GGUF)
- Test inference produces reasonable outputs </success_criteria>
<activation_triggers>
Explicit triggers:
- - GRPO (RL) training
/unsloth grpo - - SFT training
/unsloth sft - - FP8 training setup
/unsloth fp8 - - VLM fine-tuning
/unsloth vision - - Phone deployment (QAT)
/unsloth mobile - - Docker container setup
/unsloth docker - - Debug issues
/unsloth troubleshoot
Natural language:
- "train with GRPO", "fine-tune", "reward functions"
- "FP8 training", "fp8", "less VRAM"
- "vision fine-tuning", "VLM", "image training"
- "phone deployment", "mobile LLM", "ExecuTorch"
- "docker training", "container", "unsloth docker"
- "packing", "faster training", "500k context"
- "export GGUF", "Ollama", "vLLM", "SGLang" </activation_triggers>
<file_locations>
Core references:
- - Reward function patterns
reference/reward-design.md - - Voice AI, Sales Agent examples
reference/domain-examples.md - - GRPOConfig reference
reference/hyperparameters.md - - Common fixes
reference/troubleshooting.md
New feature references:
- - FP8 setup, VRAM savings
reference/fp8-training.md - - Docker, vLLM, LoRA hot-swap, SGLang
reference/deployment.md - - GGUF, Ollama, LM Studio, Dynamic 2.0
reference/export-formats.md - - 500K context, packing, checkpoints
reference/advanced-training.md - - VLM fine-tuning
reference/vision-training.md - - QAT, ExecuTorch, iOS/Android
reference/mobile-deployment.md
Code examples: ,
</file_locations>
reference/grpo/reference/sft/<core_concepts>
<objective>
指导如何使用Unsloth进行LLM微调:
- GRPO - 基于奖励函数的强化学习(无需标注输出)
- SFT - 基于输入/输出样本的有监督微调
- 视觉模型 - VLM微调(支持Qwen3-VL、Gemma3、Llama 3.2 Vision)
核心功能:
- FP8训练 - 显存占用降低60%,训练速度提升1.4倍(支持RTX 40+、H100显卡)
- 3倍数据打包 - 针对混合长度数据自动实现2-5倍训练加速
- Docker支持 - 官方镜像
unsloth/unsloth - 移动端部署 - 量化感知训练(QAT)→ ExecuTorch → iOS/Android部署(约40 tokens/秒)
- 多格式导出 - 支持GGUF、Ollama、vLLM、LM Studio、SGLang格式 </objective>
<quick_start>
基于FP8的GRPO训练(显存占用降低60%):
python
import os
os.environ['UNSLOTH_VLLM_STANDBY'] = "1" # Shared memory
from unsloth import FastLanguageModel
from trl import GRPOConfig, GRPOTrainer
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen3-8B",
max_seq_length=2048, load_in_fp8=True, fast_inference=True,
)
model = FastLanguageModel.get_peft_model(
model, r=64,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
use_gradient_checkpointing="unsloth",
)
def correctness_reward(completions, answer, **kwargs):
return [2.0 if extract_answer(c) == a else 0.0
for c, a in zip(completions, answer)]
trainer = GRPOTrainer(
model=model,
args=GRPOConfig(num_generations=4, beta=0.04, learning_rate=5e-6),
train_dataset=dataset, reward_funcs=[correctness_reward],
)
trainer.train()基于数据打包的SFT训练(速度提升2-5倍):
python
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model=model, train_dataset=dataset, processing_class=tokenizer,
args=SFTConfig(
per_device_train_batch_size=2, num_train_epochs=3,
learning_rate=2e-4, packing=True, # 2-5x speedup
),
)
trainer.train()</quick_start>
<success_criteria>
训练成功的判定标准:
- 模型加载时无内存不足(OOM)错误
- GRPO的奖励值或SFT的损失值呈现改善趋势
- 生成的输出符合预期格式
- 模型可导出为目标格式(LoRA、合并模型、GGUF)
- 测试推理能生成合理输出 </success_criteria>
<activation_triggers>
显式触发指令:
- - 启动GRPO(强化学习)训练
/unsloth grpo - - 启动SFT训练
/unsloth sft - - 配置FP8训练环境
/unsloth fp8 - - 启动VLM微调
/unsloth vision - - 配置移动端部署(QAT)
/unsloth mobile - - 配置Docker容器
/unsloth docker - - 调试问题
/unsloth troubleshoot
自然语言触发:
- "使用GRPO训练"、"微调模型"、"奖励函数"
- "FP8训练"、"fp8"、"减少显存占用"
- "视觉模型微调"、"VLM"、"图像训练"
- "移动端部署"、"移动LLM"、"ExecuTorch"
- "Docker训练"、"容器"、"unsloth docker"
- "数据打包"、"加速训练"、"500k上下文"
- "导出GGUF"、"Ollama"、"vLLM"、"SGLang" </activation_triggers>
<file_locations>
核心参考文档:
- - 奖励函数设计模式
reference/reward-design.md - - 语音AI、销售Agent等领域示例
reference/domain-examples.md - - GRPOConfig参数参考
reference/hyperparameters.md - - 常见问题解决方案
reference/troubleshooting.md
新功能参考文档:
- - FP8配置、显存优化
reference/fp8-training.md - - Docker、vLLM、LoRA热切换、SGLang部署
reference/deployment.md - - GGUF、Ollama、LM Studio、Dynamic 2.0格式导出
reference/export-formats.md - - 500K上下文、数据打包、检查点管理
reference/advanced-training.md - - VLM微调指南
reference/vision-training.md - - QAT、ExecuTorch、iOS/Android部署
reference/mobile-deployment.md
代码示例: 、
</file_locations>
reference/grpo/reference/sft/<core_concepts>
When to Use GRPO vs SFT
GRPO与SFT的适用场景对比
| Method | Use When | Data Needed |
|---|---|---|
| GRPO | Improving reasoning quality | Prompts + verifiable answers |
| GRPO | Aligning behavior with preferences | Reward functions |
| GRPO | When you can verify correctness | Verifiable outputs |
| SFT | Teaching specific output format | Input/output pairs |
| SFT | Following new instructions | Conversation examples |
| SFT | Learning domain knowledge | Labeled examples |
| 方法 | 适用场景 | 所需数据 |
|---|---|---|
| GRPO | 提升推理质量 | 提示词 + 可验证答案 |
| GRPO | 对齐模型行为与偏好 | 奖励函数 |
| GRPO | 可验证输出正确性的场景 | 可验证的输出结果 |
| SFT | 教授特定输出格式 | 输入/输出样本对 |
| SFT | 让模型遵循新指令 | 对话示例 |
| SFT | 让模型学习领域知识 | 标注样本 |
Model Selection
模型选择指南
| Model | Size | VRAM | Use Case |
|---|---|---|---|
| 0.5B | 5GB | Mobile deployment (~200MB GGUF) |
| 1.5B | 5GB | Learning/prototyping |
| 3B | 8GB | Good balance (recommended start) |
| 7B | 16GB | Production quality |
| 14B | 20GB | Strong reasoning |
| 模型 | 参数规模 | 显存需求 | 适用场景 |
|---|---|---|---|
| 0.5B | 5GB | 移动端部署(GGUF格式约200MB) |
| 1.5B | 5GB | 学习与原型开发 |
| 3B | 8GB | 平衡性能与资源(推荐入门选择) |
| 7B | 16GB | 生产级质量 |
| 14B | 20GB | 强推理能力需求场景 |
Core Hyperparameters
核心超参数配置
GRPO (RL):
python
GRPOConfig(
num_generations=4, # Completions per prompt (2-8)
beta=0.04, # KL penalty (0.01-0.1)
learning_rate=5e-6, # 10x smaller than SFT!
max_completion_length=512,
max_steps=300, # Minimum for results
)SFT:
python
TrainingArguments(
learning_rate=2e-4, # Standard SFT rate
num_train_epochs=3, # 2-4 typical
per_device_train_batch_size=2,
)</core_concepts>
<reward_functions>
GRPO(强化学习):
python
GRPOConfig(
num_generations=4, # Completions per prompt (2-8)
beta=0.04, # KL penalty (0.01-0.1)
learning_rate=5e-6, # 10x smaller than SFT!
max_completion_length=512,
max_steps=300, # Minimum for results
)SFT:
python
TrainingArguments(
learning_rate=2e-4, # Standard SFT rate
num_train_epochs=3, # 2-4 typical
per_device_train_batch_size=2,
)</core_concepts>
<reward_functions>
Reward Function Design
奖励函数设计指南
Reward functions are the core of GRPO. They return a list of floats for each completion.
奖励函数是GRPO的核心,需为每个模型输出返回一个浮点数列表。
Pattern 1: Correctness (Primary Signal)
模式1:正确性奖励(核心信号)
python
def correctness_reward(completions, answer, **kwargs):
"""
+2.0 for correct answer, 0.0 otherwise.
This should be your highest-weighted reward.
"""
rewards = []
for completion, true_answer in zip(completions, answer):
extracted = extract_answer(completion)
try:
pred = float(extracted.replace(",", "").strip())
true = float(true_answer.replace(",", "").strip())
reward = 2.0 if abs(pred - true) < 0.01 else 0.0
except ValueError:
reward = 2.0 if extracted.strip() == str(true_answer).strip() else 0.0
rewards.append(reward)
return rewardspython
def correctness_reward(completions, answer, **kwargs):
"""
+2.0 for correct answer, 0.0 otherwise.
This should be your highest-weighted reward.
"""
rewards = []
for completion, true_answer in zip(completions, answer):
extracted = extract_answer(completion)
try:
pred = float(extracted.replace(",", "").strip())
true = float(true_answer.replace(",", "").strip())
reward = 2.0 if abs(pred - true) < 0.01 else 0.0
except ValueError:
reward = 2.0 if extracted.strip() == str(true_answer).strip() else 0.0
rewards.append(reward)
return rewardsPattern 2: Format Compliance
模式2:格式合规奖励
python
def format_reward(completions, **kwargs):
"""
+0.5 for proper XML structure with reasoning and answer tags.
"""
rewards = []
for completion in completions:
has_reasoning = bool(re.search(r"<reasoning>.*?</reasoning>", completion, re.DOTALL))
has_answer = bool(re.search(r"<answer>.*?</answer>", completion, re.DOTALL))
if has_reasoning and has_answer:
rewards.append(0.5)
elif has_answer:
rewards.append(0.2)
else:
rewards.append(0.0)
return rewardspython
def format_reward(completions, **kwargs):
"""
+0.5 for proper XML structure with reasoning and answer tags.
"""
rewards = []
for completion in completions:
has_reasoning = bool(re.search(r"<reasoning>.*?</reasoning>", completion, re.DOTALL))
has_answer = bool(re.search(r"<answer>.*?</answer>", completion, re.DOTALL))
if has_reasoning and has_answer:
rewards.append(0.5)
elif has_answer:
rewards.append(0.2)
else:
rewards.append(0.0)
return rewardsPattern 3: Reasoning Quality
模式3:推理质量奖励
python
def reasoning_length_reward(completions, **kwargs):
"""
+0.3 for substantive reasoning (30-200 words).
"""
rewards = []
for completion in completions:
reasoning = extract_reasoning(completion)
word_count = len(reasoning.split()) if reasoning else 0
if 30 <= word_count <= 200:
rewards.append(0.3)
elif 15 <= word_count < 30:
rewards.append(0.1)
else:
rewards.append(0.0)
return rewardspython
def reasoning_length_reward(completions, **kwargs):
"""
+0.3 for substantive reasoning (30-200 words).
"""
rewards = []
for completion in completions:
reasoning = extract_reasoning(completion)
word_count = len(reasoning.split()) if reasoning else 0
if 30 <= word_count <= 200:
rewards.append(0.3)
elif 15 <= word_count < 30:
rewards.append(0.1)
else:
rewards.append(0.0)
return rewardsPattern 4: Negative Constraints
模式4:负向约束奖励
python
def no_hedging_reward(completions, **kwargs):
"""
-0.3 penalty for uncertainty language.
"""
hedging = ["i think", "maybe", "perhaps", "possibly", "i'm not sure"]
rewards = []
for completion in completions:
has_hedging = any(phrase in completion.lower() for phrase in hedging)
rewards.append(-0.3 if has_hedging else 0.0)
return rewardspython
def no_hedging_reward(completions, **kwargs):
"""
-0.3 penalty for uncertainty language.
"""
hedging = ["i think", "maybe", "perhaps", "possibly", "i'm not sure"]
rewards = []
for completion in completions:
has_hedging = any(phrase in completion.lower() for phrase in hedging)
rewards.append(-0.3 if has_hedging else 0.0)
return rewardsTypical Reward Stack
典型奖励函数组合
python
reward_funcs = [
correctness_reward, # +2.0 max (primary signal)
format_reward, # +0.5 max (structure)
reasoning_length_reward, # +0.3 max (quality)
no_hedging_reward, # -0.3 max (constraint)
]python
reward_funcs = [
correctness_reward, # +2.0 max (primary signal)
format_reward, # +0.5 max (structure)
reasoning_length_reward, # +0.3 max (quality)
no_hedging_reward, # -0.3 max (constraint)
]Total range: -0.3 to +2.8
Total range: -0.3 to +2.8
> **For domain-specific rewards:** See `reference/domain-examples.md` for Voice AI, Sales Agent, and Support patterns.
</reward_functions>
<prompt_format>
> **领域特定奖励函数参考:** 请查看`reference/domain-examples.md`中的语音AI、销售Agent、客服场景示例。
</reward_functions>
<prompt_format>Prompt Structure
提示词结构
System Prompt with XML Tags
带XML标签的系统提示词
python
SYSTEM_PROMPT = """You are a helpful assistant that thinks step-by-step.
Always respond in this exact format:
<reasoning>
[Your step-by-step thinking process]
</reasoning>
<answer>
[Your final answer - just the number or short response]
</answer>
"""python
SYSTEM_PROMPT = """You are a helpful assistant that thinks step-by-step.
Always respond in this exact format:
<reasoning>
[Your step-by-step thinking process]
</reasoning>
<answer>
[Your final answer - just the number or short response]
</answer>
"""Extraction Helpers
内容提取工具函数
python
import re
def extract_answer(text: str) -> str:
"""Extract answer from XML tags"""
match = re.search(r"<answer>(.*?)</answer>", text, re.DOTALL)
return match.group(1).strip() if match else ""
def extract_reasoning(text: str) -> str:
"""Extract reasoning from XML tags"""
match = re.search(r"<reasoning>(.*?)</reasoning>", text, re.DOTALL)
return match.group(1).strip() if match else ""python
import re
def extract_answer(text: str) -> str:
"""Extract answer from XML tags"""
match = re.search(r"<answer>(.*?)</answer>", text, re.DOTALL)
return match.group(1).strip() if match else ""
def extract_reasoning(text: str) -> str:
"""Extract reasoning from XML tags"""
match = re.search(r"<reasoning>(.*?)</reasoning>", text, re.DOTALL)
return match.group(1).strip() if match else ""Dataset Format
数据集格式
GRPO (prompt-only):
python
dataset = dataset.map(lambda ex: {
"prompt": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": ex["question"]}
],
"answer": ex["answer"] # Ground truth for verification
})SFT (full conversations):
python
dataset = dataset.map(lambda ex: {
"conversations": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": ex["input"]},
{"role": "assistant", "content": ex["output"]}
]
})</prompt_format>
<model_export>
GRPO(仅提示词):
python
dataset = dataset.map(lambda ex: {
"prompt": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": ex["question"]}
],
"answer": ex["answer"] # Ground truth for verification
})SFT(完整对话):
python
dataset = dataset.map(lambda ex: {
"conversations": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": ex["input"]},
{"role": "assistant", "content": ex["output"]}
]
})</prompt_format>
<model_export>
Save and Deploy
模型保存与部署
Save LoRA Only (~100MB)
仅保存LoRA权重(约100MB)
python
model.save_lora("grpo_lora")python
model.save_lora("grpo_lora")Merge and Save Full Model
合并并保存完整模型
python
model.save_pretrained_merged(
"grpo_merged", tokenizer,
save_method="merged_16bit",
)python
model.save_pretrained_merged(
"grpo_merged", tokenizer,
save_method="merged_16bit",
)Export to GGUF for Ollama
导出为GGUF格式用于Ollama
python
model.save_pretrained_gguf(
"grpo_gguf", tokenizer,
quantization_method="q4_k_m", # Options: q4_k_m, q8_0, q5_k_m
)python
model.save_pretrained_gguf(
"grpo_gguf", tokenizer,
quantization_method="q4_k_m", # Options: q4_k_m, q8_0, q5_k_m
)Test with Ollama
使用Ollama测试模型
bash
undefinedbash
undefinedCreate Modelfile
Create Modelfile
cat > Modelfile << EOF
FROM ./grpo_gguf/unsloth.Q4_K_M.gguf
TEMPLATE """{{ .System }}
User: {{ .Prompt }}
Assistant: """
PARAMETER temperature 0.7
EOF
ollama create my-model -f Modelfile
ollama run my-model "Solve: 15 + 27 = ?"
</model_export>
<routing>cat > Modelfile << EOF
FROM ./grpo_gguf/unsloth.Q4_K_M.gguf
TEMPLATE """{{ .System }}
User: {{ .Prompt }}
Assistant: """
PARAMETER temperature 0.7
EOF
ollama create my-model -f Modelfile
ollama run my-model "Solve: 15 + 27 = ?"
</model_export>
<routing>Request Routing
请求路由指引
GRPO training: → GRPOConfig, reward functions, dataset prep
→ Reference:
reference/grpo/basic_grpo.pySFT training: → SFTTrainer, dataset formatting
→ Reference:
reference/sft/sales_extractor_training.pyReward function design: → 4 patterns (correctness, format, quality, constraints)
→ Reference: ,
reference/reward-design.mdreference/domain-examples.mdFP8 training: → 60% VRAM savings, env vars, pre-quantized models
→ Reference:
reference/fp8-training.mdDocker setup: → Official image, volumes, Jupyter/SSH
→ Reference:
reference/deployment.mdVision fine-tuning: → FastVisionModel, VLM data format
→ Reference:
reference/vision-training.mdMobile deployment: → QAT, ExecuTorch, iOS/Android
→ Reference:
reference/mobile-deployment.mdLong context / packing: → 500K context, 2-5x speedup
→ Reference:
reference/advanced-training.mdExport formats: → GGUF methods, Ollama, vLLM, SGLang
→ Reference:
reference/export-formats.mdTraining issues: →
</routing>
reference/troubleshooting.md<troubleshooting_quick>
GRPO训练: → 参考GRPOConfig、奖励函数、数据集准备
→ 文档参考:
reference/grpo/basic_grpo.pySFT训练: → 参考SFTTrainer、数据集格式化
→ 文档参考:
reference/sft/sales_extractor_training.py奖励函数设计: → 参考4种设计模式(正确性、格式、质量、约束)
→ 文档参考:、
reference/reward-design.mdreference/domain-examples.mdFP8训练: → 参考60%显存优化、环境变量、预量化模型
→ 文档参考:
reference/fp8-training.mdDocker配置: → 参考官方镜像、卷挂载、Jupyter/SSH配置
→ 文档参考:
reference/deployment.md视觉模型微调: → 参考FastVisionModel、VLM数据格式
→ 文档参考:
reference/vision-training.md移动端部署: → 参考QAT、ExecuTorch、iOS/Android配置
→ 文档参考:
reference/mobile-deployment.md长上下文/数据打包: → 参考500K上下文、2-5倍加速
→ 文档参考:
reference/advanced-training.md格式导出: → 参考GGUF方法、Ollama、vLLM、SGLang
→ 文档参考:
reference/export-formats.md训练问题排查: → 参考
</routing>
reference/troubleshooting.md<troubleshooting_quick>
Quick Troubleshooting
快速排查指南
| Symptom | Fix |
|---|---|
| Reward not increasing | Wait 300+ steps, then increase learning_rate 2x |
| Reward spiky/unstable | Decrease learning_rate 0.5x, increase beta |
| Model outputs garbage | Increase beta 2-4x, check prompt format |
| Out of memory | Reduce max_completion_length, num_generations=2 |
| No reasoning appearing | Train 500+ steps, use model >= 1.5B |
For detailed troubleshooting: See</troubleshooting_quick>reference/troubleshooting.md
<training_checklist>
| 症状 | 解决方案 |
|---|---|
| 奖励值无提升 | 等待300+训练步,然后将学习率提高2倍 |
| 奖励值波动/不稳定 | 将学习率降低至0.5倍,增大beta值 |
| 模型输出无意义内容 | 将beta值增大2-4倍,检查提示词格式 |
| 内存不足 | 减小max_completion_length,设置num_generations=2 |
| 无推理过程输出 | 训练500+步,使用参数≥1.5B的模型 |
详细排查指南: 请查看</troubleshooting_quick>reference/troubleshooting.md
<training_checklist>
Pre-Training Checklist
训练前检查清单
GRPO:
- Model loads without OOM
- LoRA configured with
use_gradient_checkpointing="unsloth" - Dataset has and
promptfieldsanswer - At least one reward function defined and tested
-
num_generations >= 2 - set (0.01-0.1, start at 0.04)
beta - set (1e-6 to 1e-5)
learning_rate - At least 300 steps planned
SFT:
- Model loads without OOM
- Dataset has conversations or text field
- Chat template applied correctly
- ~2e-4
learning_rate - 2-4 epochs planned </training_checklist>
<cost_estimates>
GRPO:
- 模型加载无内存不足(OOM)错误
- LoRA配置中设置了
use_gradient_checkpointing="unsloth" - 数据集包含和
prompt字段answer - 至少定义并测试了一个奖励函数
-
num_generations ≥ 2 - 设置了值(0.01-0.1,建议初始值0.04)
beta - 设置了值(1e-6至1e-5)
learning_rate - 计划至少300步训练
SFT:
- 模型加载无内存不足(OOM)错误
- 数据集包含对话或文本字段
- 正确应用了聊天模板
- 设置为~2e-4
learning_rate - 计划2-4轮训练 epoch </training_checklist>
<cost_estimates>
Cost Estimates
成本估算
Training Data Generation:
- 50 seeds x 10 variations x ~1000 tokens = ~500K tokens
- Claude Sonnet: ~$1.50
- GPT-4o-mini: ~$0.30
Training Compute:
- Google Colab T4: Free (30-60 min for 1000 examples)
- RunPod A40 (48GB): ~$0.79/hr
Total per experiment: ~$0.50-$1.00 on RunPod
</cost_estimates>
<example_session>
训练数据生成:
- 50个种子样本 × 10种变体 × ~1000 tokens = ~500K tokens
- Claude Sonnet:约$1.50
- GPT-4o-mini:约$0.30
训练算力成本:
- Google Colab T4:免费(1000个样本约30-60分钟)
- RunPod A40(48GB显存):约$0.79/小时
单实验总成本: RunPod上约$0.50-$1.00
</cost_estimates>
<example_session>
Example Session
示例会话
User: "I want to train a model that gives concise voice responses"
Claude:
- First, let's set up the model:
python
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="Qwen/Qwen2.5-3B-Instruct",
max_seq_length=1024, load_in_4bit=True,
)- For voice AI, you need brevity rewards. Here's a custom stack:
python
def brevity_reward(completions, **kwargs):
"""Voice responses under 50 words get +0.5"""
return [0.5 if len(c.split()) <= 50 else -0.3 for c in completions]
def speakable_reward(completions, **kwargs):
"""Penalize markdown that sounds bad spoken"""
bad = ["**", "- ", "```", "http"]
return [-0.2 * sum(1 for b in bad if b in c) for c in completions]
reward_funcs = [correctness_reward, brevity_reward, speakable_reward]- See for complete Voice AI reward patterns. </example_session>
reference/domain-examples.md
用户: "我想训练一个能生成简洁语音回复的模型"
Claude:
- 首先,我们先初始化模型:
python
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="Qwen/Qwen2.5-3B-Instruct",
max_seq_length=1024, load_in_4bit=True,
)- 针对语音AI场景,你需要简洁性奖励函数,以下是一个自定义组合:
python
def brevity_reward(completions, **kwargs):
"""Voice responses under 50 words get +0.5"""
return [0.5 if len(c.split()) <= 50 else -0.3 for c in completions]
def speakable_reward(completions, **kwargs):
"""Penalize markdown that sounds bad spoken"""
bad = ["**", "- ", "```", "http"]
return [-0.2 * sum(1 for b in bad if b in c) for c in completions]
reward_funcs = [correctness_reward, brevity_reward, speakable_reward]- 完整的语音AI奖励函数模式请查看。 </example_session>
reference/domain-examples.md