ai-fine-tuning
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseFine-Tune Models on Your Data
基于自有数据微调模型
Guide the user through deciding whether to fine-tune, preparing data, running fine-tuning with DSPy, distilling to cheaper models, and deploying. Fine-tuning is powerful but expensive — always confirm prerequisites first.
引导用户判断是否需要进行微调、准备数据、使用DSPy执行微调、将模型蒸馏为低成本模型,以及部署模型。微调效果显著但成本高昂——请务必先确认前提条件。
Should you fine-tune?
是否应该进行模型微调?
Before writing any code, walk through these questions with the user:
- Have you optimized prompts first? If not, use — prompt optimization is 10x cheaper and often sufficient.
/ai-improving-accuracy - Do you have 500+ labeled examples? Fine-tuning with less data usually overfits. Collect more data first.
- Is your baseline accuracy above 50%? If your prompt-optimized program is below 50%, your task definition or data has problems. Fix those first.
- What's the goal — quality or cost?
- Quality: You've maxed out prompt optimization and need more accuracy
- Cost: You want a small cheap model to match an expensive one
在编写任何代码之前,先和用户一起梳理以下问题:
- 你是否已完成提示词优化? 如果没有,请使用——提示词优化的成本仅为微调的1/10,且通常足以满足需求。
/ai-improving-accuracy - 你是否拥有500+标注样本? 数据量不足时进行微调通常会导致过拟合,请先收集更多数据。
- 你的基线准确率是否超过50%? 如果经过提示词优化后的程序准确率低于50%,说明你的任务定义或数据存在问题,请先解决这些问题。
- 你的目标是提升质量还是降低成本?
- 提升质量:提示词优化已达瓶颈,需要进一步提高准确率
- 降低成本:希望用小型低成本模型达到昂贵模型的效果
When to fine-tune
适合微调的场景
- You've already optimized prompts with MIPROv2 and hit a ceiling
- You have 500+ labeled examples (1000+ is better)
- Your baseline is >50% and you need to push higher
- You want to distill an expensive model into a cheaper one (10-50x cost savings)
- Your domain has specialized vocabulary or patterns the base model doesn't know
- You need faster inference (smaller fine-tuned models are faster)
- 你已使用MIPROv2完成提示词优化,且效果已达瓶颈
- 你拥有500+标注样本(1000+更佳)
- 基线准确率>50%,且需要进一步提升
- 希望将昂贵模型蒸馏为低成本模型(可节省10-50倍成本)
- 你的领域存在基础模型未掌握的专业词汇或模式
- 你需要更快的推理速度(小型微调模型的推理速度更快)
When NOT to fine-tune
不适合微调的场景
- You haven't tried prompt optimization yet — start with
/ai-improving-accuracy - You have fewer than 500 examples — need more data? Use to bootstrap synthetic examples, or use BootstrapFewShot or MIPROv2 instead
/ai-generating-data - Your baseline is below 50% — your data or task definition needs work
- You're still iterating on what the task is — fine-tuning locks you in
- You don't have a clear metric — you can't evaluate fine-tuning without one
- Your use case changes frequently — fine-tuned models don't adapt to new instructions easily
- 你尚未尝试提示词优化——请从开始
/ai-improving-accuracy - 你的样本量不足500——需要更多数据?可使用生成合成样本,或改用BootstrapFewShot或MIPROv2
/ai-generating-data - 基线准确率低于50%——你的数据或任务定义存在问题
- 你仍在迭代任务定义——微调会让你被固定的任务定义束缚
- 你没有明确的评估指标——没有指标就无法评估微调效果
- 你的使用场景频繁变化——微调后的模型难以适应新指令
Prerequisites checklist
前提条件检查清单
Before starting, confirm:
- Data: 500+ labeled examples (1000+ recommended), split 80/10/10 (train/dev/test)
- Baseline: Prompt-optimized program with measured accuracy (use )
/ai-improving-accuracy - Metric: Clear, automated metric that scores predictions
- Compute: API access (OpenAI fine-tuning API) or local GPUs (for open-source models)
- Budget: OpenAI fine-tuning costs ~$0.008/1K tokens for GPT-4o-mini; local needs 1+ GPU
开始前,请确认以下内容:
- 数据:500+标注样本(推荐1000+),按80/10/10比例划分为训练集/验证集/测试集
- 基线:经过提示词优化且已测量准确率的程序(使用生成)
/ai-improving-accuracy - 指标:可自动评估预测结果的明确指标
- 计算资源:API访问权限(OpenAI微调API)或本地GPU(用于开源模型)
- 预算:OpenAI微调成本约为GPT-4o-mini每1K令牌0.008美元;本地微调需要1+GPU
Step 1: Prepare your data and baseline
步骤1:准备数据与基线模型
Build a strong baseline first
先构建强基线模型
Always compare fine-tuning against a prompt-optimized baseline:
python
import dspy
lm = dspy.LM("openai/gpt-4o")
dspy.configure(lm=lm)请始终将微调效果与经过提示词优化的基线模型进行对比:
python
import dspy
lm = dspy.LM("openai/gpt-4o")
dspy.configure(lm=lm)Define your program
定义你的程序
class Classify(dspy.Signature):
"""Classify the support ticket."""
text: str = dspy.InputField()
category: str = dspy.OutputField()
program = dspy.ChainOfThought(Classify)
class Classify(dspy.Signature):
"""对支持工单进行分类。"""
text: str = dspy.InputField()
category: str = dspy.OutputField()
program = dspy.ChainOfThought(Classify)
Prepare data
准备数据
import json
with open("labeled_data.json") as f:
data = json.load(f)
examples = [dspy.Example(text=x["text"], category=x["category"]).with_inputs("text") for x in data]
import json
with open("labeled_data.json") as f:
data = json.load(f)
examples = [dspy.Example(text=x["text"], category=x["category"]).with_inputs("text") for x in data]
Split: 80% train, 10% dev, 10% test
划分数据集:80%训练集,10%验证集,10%测试集
n = len(examples)
trainset = examples[:int(n * 0.8)]
devset = examples[int(n * 0.8):int(n * 0.9)]
testset = examples[int(n * 0.9):]
n = len(examples)
trainset = examples[:int(n * 0.8)]
devset = examples[int(n * 0.8):int(n * 0.9)]
testset = examples[int(n * 0.9):]
Measure baseline
测量基线准确率
def metric(example, prediction, trace=None):
return prediction.category.lower() == example.category.lower()
from dspy.evaluate import Evaluate
evaluator = Evaluate(devset=devset, metric=metric, num_threads=4, display_progress=True)
baseline_score = evaluator(program)
print(f"Baseline: {baseline_score:.1f}%")
undefineddef metric(example, prediction, trace=None):
return prediction.category.lower() == example.category.lower()
from dspy.evaluate import Evaluate
evaluator = Evaluate(devset=devset, metric=metric, num_threads=4, display_progress=True)
baseline_score = evaluator(program)
print(f"基线准确率:{baseline_score:.1f}%")
undefinedOptimize prompts first (your comparison point)
先优化提示词(作为对比基准)
python
optimizer = dspy.MIPROv2(metric=metric, auto="medium")
prompt_optimized = optimizer.compile(program, trainset=trainset)
prompt_score = evaluator(prompt_optimized)
print(f"Prompt-optimized: {prompt_score:.1f}%")If prompt optimization gets you to your quality goal, stop here. Fine-tuning is only worth it if you need to go further.
python
optimizer = dspy.MIPROv2(metric=metric, auto="medium")
prompt_optimized = optimizer.compile(program, trainset=trainset)
prompt_score = evaluator(prompt_optimized)
print(f"提示词优化后准确率:{prompt_score:.1f}%")如果提示词优化已达到你的质量目标,请在此停止。只有当你需要进一步提升时,微调才具有价值。
Step 2: BootstrapFinetune (core fine-tuning)
步骤2:BootstrapFinetune(核心微调流程)
The main fine-tuning workflow in DSPy. It bootstraps successful reasoning traces from your training data, filters them by your metric, and fine-tunes the model weights.
python
optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
finetuned = optimizer.compile(program, trainset=trainset)这是DSPy中的主要微调工作流。它会从训练数据中引导生成成功的推理轨迹,通过你的指标筛选这些轨迹,然后微调模型权重。
python
optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
finetuned = optimizer.compile(program, trainset=trainset)Evaluate the fine-tuned model
评估微调后的模型
finetuned_score = evaluator(finetuned)
print(f"Baseline: {baseline_score:.1f}%")
print(f"Prompt-optimized: {prompt_score:.1f}%")
print(f"Fine-tuned: {finetuned_score:.1f}%")
undefinedfinetuned_score = evaluator(finetuned)
print(f"基线准确率: {baseline_score:.1f}%")
print(f"提示词优化后准确率: {prompt_score:.1f}%")
print(f"微调后准确率: {finetuned_score:.1f}%")
undefinedHow it works
工作原理
- Bootstrap traces: Runs your program on each training example, keeping traces where the metric passes
- Filter by metric: Only successful traces become training data
- Fine-tune weights: Sends traces to the model provider's fine-tuning API
- Return optimized program: The program now uses the fine-tuned model
- 引导生成轨迹:在每个训练样本上运行你的程序,保留指标达标的轨迹
- 指标筛选:仅将成功的轨迹作为训练数据
- 微调权重:将轨迹发送至模型提供商的微调API
- 返回优化后的程序:程序现在将使用微调后的模型
Requirements
要求
- A fine-tunable model (OpenAI ,
gpt-4o-mini; or local open-source models)gpt-4o - 500+ training examples (more traces bootstrapped = better fine-tuning)
- A metric that reliably identifies good outputs
- 支持微调的模型(OpenAI 、
gpt-4o-mini;或本地开源模型)gpt-4o - 500+训练样本(引导生成的轨迹越多,微调效果越好)
- 可可靠识别优质输出的指标
Step 3: Model distillation (expensive to cheap)
步骤3:模型蒸馏(从昂贵到低成本)
Train a small, cheap model to mimic an expensive model. This is the biggest cost saver — 10-50x reduction with 85-95% quality retention.
训练小型低成本模型来模仿昂贵模型的效果。这是最有效的成本削减方式——可节省10-50倍成本,同时保留85-95%的模型质量。
Teacher-student pattern
师生模型模式
python
undefinedpython
undefinedStep 1: Teacher — expensive model, high quality
步骤1:教师模型——昂贵模型,高质量
teacher_lm = dspy.LM("openai/gpt-4o")
dspy.configure(lm=teacher_lm)
teacher_lm = dspy.LM("openai/gpt-4o")
dspy.configure(lm=teacher_lm)
Build and optimize the teacher
构建并优化教师模型
teacher = dspy.ChainOfThought(Classify)
optimizer = dspy.MIPROv2(metric=metric, auto="medium")
teacher_optimized = optimizer.compile(teacher, trainset=trainset)
teacher_score = evaluator(teacher_optimized)
print(f"Teacher (GPT-4o): {teacher_score:.1f}%")
teacher = dspy.ChainOfThought(Classify)
optimizer = dspy.MIPROv2(metric=metric, auto="medium")
teacher_optimized = optimizer.compile(teacher, trainset=trainset)
teacher_score = evaluator(teacher_optimized)
print(f"教师模型(GPT-4o)准确率: {teacher_score:.1f}%")
Step 2: Student — fine-tune cheap model on teacher's outputs
步骤2:学生模型——在教师模型的输出上微调低成本模型
student_lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=student_lm)
student = dspy.ChainOfThought(Classify)
ft_optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
student_finetuned = ft_optimizer.compile(student, trainset=trainset, teacher=teacher_optimized)
student_score = evaluator(student_finetuned)
print(f"Student (GPT-4o-mini, fine-tuned): {student_score:.1f}%")
undefinedstudent_lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=student_lm)
student = dspy.ChainOfThought(Classify)
ft_optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
student_finetuned = ft_optimizer.compile(student, trainset=trainset, teacher=teacher_optimized)
student_score = evaluator(student_finetuned)
print(f"学生模型(GPT-4o-mini,微调后)准确率: {student_score:.1f}%")
undefinedTypical results
典型结果
| Model | Quality | Cost per 1M tokens |
|---|---|---|
| GPT-4o (teacher) | 85% | ~$5.00 |
| GPT-4o-mini (no tuning) | 70% | ~$0.15 |
| GPT-4o-mini (fine-tuned) | 81% | ~$0.15 |
The fine-tuned student costs 33x less and retains ~95% of teacher quality.
| 模型 | 准确率 | 每1M令牌成本 |
|---|---|---|
| GPT-4o(教师模型) | 85% | ~$5.00 |
| GPT-4o-mini(未微调) | 70% | ~$0.15 |
| GPT-4o-mini(微调后) | 81% | ~$0.15 |
微调后的学生模型成本降低33倍,同时保留约95%的教师模型质量。
Step 4: BetterTogether (maximum quality)
步骤4:BetterTogether(最大化模型质量)
BetterTogether alternates between prompt optimization and weight optimization, getting more out of both. Based on the BetterTogether paper (arXiv 2407.10930v2), this approach yields 5-78% gains over either technique alone.
python
optimizer = dspy.BetterTogether(
metric=metric,
prompt_optimizer=dspy.MIPROv2,
weight_optimizer=dspy.BootstrapFinetune,
)
best = optimizer.compile(program, trainset=trainset)
best_score = evaluator(best)
print(f"Prompt-only: {prompt_score:.1f}%")
print(f"Fine-tune-only: {finetuned_score:.1f}%")
print(f"BetterTogether: {best_score:.1f}%")BetterTogether交替进行提示词优化和权重优化,充分发挥两种技术的优势。基于BetterTogether论文(arXiv 2407.10930v2),该方法比单独使用任意一种技术的效果提升5-78%。
python
optimizer = dspy.BetterTogether(
metric=metric,
prompt_optimizer=dspy.MIPROv2,
weight_optimizer=dspy.BootstrapFinetune,
)
best = optimizer.compile(program, trainset=trainset)
best_score = evaluator(best)
print(f"仅提示词优化: {prompt_score:.1f}%")
print(f"仅微调: {finetuned_score:.1f}%")
print(f"BetterTogether: {best_score:.1f}%")How it works
工作原理
- Round 1: Optimize prompts (instructions + few-shot examples)
- Round 2: Fine-tune weights using the optimized prompts
- Round 3: Re-optimize prompts for the fine-tuned model
- Each round builds on the previous, creating synergy between prompt and weight optimization
- 第一轮:优化提示词(指令+少样本示例)
- 第二轮:使用优化后的提示词进行权重微调
- 第三轮:针对微调后的模型重新优化提示词
- 每一轮都基于上一轮的结果,实现提示词优化与权重优化的协同增效
When to use BetterTogether
何时使用BetterTogether
- You want the absolute best quality and have the compute budget
- Fine-tuning alone didn't close the gap to your quality target
- You have 500+ examples and a reliable metric
- 你追求极致的模型质量,且具备足够的计算预算
- 单独微调无法达到你的质量目标
- 你拥有500+样本和可靠的评估指标
Step 5: Evaluate and deploy
步骤5:评估与部署
Thorough evaluation
全面评估
Always evaluate on the held-out test set (not dev set):
python
test_evaluator = Evaluate(devset=testset, metric=metric, num_threads=4, display_progress=True)
print(f"Test set results:")
print(f" Baseline: {test_evaluator(program):.1f}%")
print(f" Prompt-optimized: {test_evaluator(prompt_optimized):.1f}%")
print(f" Fine-tuned: {test_evaluator(finetuned):.1f}%")请始终在预留的测试集(而非验证集)上进行评估:
python
test_evaluator = Evaluate(devset=testset, metric=metric, num_threads=4, display_progress=True)
print(f"测试集结果:")
print(f" 基线准确率: {test_evaluator(program):.1f}%")
print(f" 提示词优化后准确率: {test_evaluator(prompt_optimized):.1f}%")
print(f" 微调后准确率: {test_evaluator(finetuned):.1f}%")Save and load for production
保存并加载用于生产环境
python
undefinedpython
undefinedSave
保存
finetuned.save("finetuned_program.json")
finetuned.save("finetuned_program.json")
Load later
后续加载
from my_module import MyProgram
production = MyProgram()
production.load("finetuned_program.json")
result = production(text="New support ticket...")
undefinedfrom my_module import MyProgram
production = MyProgram()
production.load("finetuned_program.json")
result = production(text="新的支持工单...")
undefinedWhen fine-tuning goes wrong
微调常见问题排查
Can't bootstrap enough traces
无法引导生成足够的成功轨迹
If the base model fails on most training examples, there aren't enough successful traces to fine-tune on.
Fixes:
- Use a stronger model for bootstrapping (GPT-4o instead of GPT-4o-mini)
- Relax your metric during bootstrapping (accept partial credit)
- Simplify your task (break multi-step into single steps)
如果基础模型在大多数训练样本上表现不佳,将没有足够的成功轨迹用于微调。
解决方法:
- 使用更强的模型进行引导生成(用GPT-4o替代GPT-4o-mini)
- 在引导生成阶段放宽指标要求(接受部分符合条件的结果)
- 简化任务(将多步骤任务拆分为单步骤任务)
Model overfits (high train accuracy, low test accuracy)
模型过拟合(训练集准确率高,测试集准确率低)
Fixes:
- Add more training data
- Reduce fine-tuning epochs (if provider allows)
- Use a larger base model (less prone to overfitting)
- Simplify your output format
解决方法:
- 添加更多训练数据
- 减少微调轮数(若提供商支持)
- 使用更大的基础模型(更不易过拟合)
- 简化输出格式
Fine-tuning didn't improve over prompt optimization
微调效果未超过提示词优化
Fixes:
- Check that bootstrapping produced enough successful traces (need 200+)
- Try BetterTogether instead of BootstrapFinetune alone
- Verify your metric actually correlates with quality
- Try a different base model
解决方法:
- 检查引导生成的成功轨迹数量是否足够(需要200+)
- 改用BetterTogether而非单独使用BootstrapFinetune
- 验证你的指标是否与实际质量相关
- 尝试不同的基础模型
Infrastructure choices
基础设施选择
OpenAI API (easiest)
OpenAI API(最简便)
Works with and . DSPy handles the fine-tuning API calls automatically:
gpt-4o-minigpt-4opython
lm = dspy.LM("openai/gpt-4o-mini") # fine-tunable via API- Pros: No GPU needed, simple setup, fast
- Cons: Data sent to OpenAI, ongoing per-token costs, limited model choices
支持和。DSPy会自动处理微调API调用:
gpt-4o-minigpt-4opython
lm = dspy.LM("openai/gpt-4o-mini") # 可通过API进行微调- 优点:无需GPU,设置简单,速度快
- 缺点:数据会发送至OpenAI,存在持续的按令牌计费成本,模型选择有限
Local fine-tuning (own your model)
本地微调(完全掌控模型)
For open-source models (Llama, Mistral, etc.) using LoRA/QLoRA:
python
lm = dspy.LM("together_ai/meta-llama/Llama-3-70b-chat-hf")- Pros: Data stays private, no per-token costs after training, full control
- Cons: Needs GPU(s), more setup, slower iteration
适用于使用LoRA/QLoRA的开源模型(Llama、Mistral等):
python
lm = dspy.LM("together_ai/meta-llama/Llama-3-70b-chat-hf")- 优点:数据保持私有,训练后无按令牌计费成本,完全可控
- 缺点:需要GPU,设置更复杂,迭代速度较慢
Cloud GPU platforms
云GPU平台
AWS SageMaker, Google Cloud, Lambda Labs, or Together AI for training:
- Pros: Scalable, no hardware to manage
- Cons: Costs vary, setup per platform
可使用AWS SageMaker、Google Cloud、Lambda Labs或Together AI进行训练:
- 优点:可扩展,无需管理硬件
- 缺点:成本差异大,需针对不同平台进行设置
Additional resources
额外资源
- For worked examples (classification, distillation, BetterTogether), see examples.md
- Use to build a strong baseline before fine-tuning
/ai-improving-accuracy - Use for other cost reduction strategies beyond distillation
/ai-cutting-costs - Use if fine-tuning or evaluation errors occur
/ai-fixing-errors
- 如需完整示例(分类、蒸馏、BetterTogether),请查看examples.md
- 微调前请使用构建强基线模型
/ai-improving-accuracy - 如需蒸馏之外的其他成本削减策略,请使用
/ai-cutting-costs - 若微调或评估过程中出现错误,请使用
/ai-fixing-errors