ai-fine-tuning

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Fine-Tune Models on Your Data

基于自有数据微调模型

Guide the user through deciding whether to fine-tune, preparing data, running fine-tuning with DSPy, distilling to cheaper models, and deploying. Fine-tuning is powerful but expensive — always confirm prerequisites first.
引导用户判断是否需要进行微调、准备数据、使用DSPy执行微调、将模型蒸馏为低成本模型,以及部署模型。微调效果显著但成本高昂——请务必先确认前提条件。

Should you fine-tune?

是否应该进行模型微调?

Before writing any code, walk through these questions with the user:
  1. Have you optimized prompts first? If not, use
    /ai-improving-accuracy
    — prompt optimization is 10x cheaper and often sufficient.
  2. Do you have 500+ labeled examples? Fine-tuning with less data usually overfits. Collect more data first.
  3. Is your baseline accuracy above 50%? If your prompt-optimized program is below 50%, your task definition or data has problems. Fix those first.
  4. What's the goal — quality or cost?
    • Quality: You've maxed out prompt optimization and need more accuracy
    • Cost: You want a small cheap model to match an expensive one
在编写任何代码之前,先和用户一起梳理以下问题:
  1. 你是否已完成提示词优化? 如果没有,请使用
    /ai-improving-accuracy
    ——提示词优化的成本仅为微调的1/10,且通常足以满足需求。
  2. 你是否拥有500+标注样本? 数据量不足时进行微调通常会导致过拟合,请先收集更多数据。
  3. 你的基线准确率是否超过50%? 如果经过提示词优化后的程序准确率低于50%,说明你的任务定义或数据存在问题,请先解决这些问题。
  4. 你的目标是提升质量还是降低成本?
    • 提升质量:提示词优化已达瓶颈,需要进一步提高准确率
    • 降低成本:希望用小型低成本模型达到昂贵模型的效果

When to fine-tune

适合微调的场景

  • You've already optimized prompts with MIPROv2 and hit a ceiling
  • You have 500+ labeled examples (1000+ is better)
  • Your baseline is >50% and you need to push higher
  • You want to distill an expensive model into a cheaper one (10-50x cost savings)
  • Your domain has specialized vocabulary or patterns the base model doesn't know
  • You need faster inference (smaller fine-tuned models are faster)
  • 你已使用MIPROv2完成提示词优化,且效果已达瓶颈
  • 你拥有500+标注样本(1000+更佳)
  • 基线准确率>50%,且需要进一步提升
  • 希望将昂贵模型蒸馏为低成本模型(可节省10-50倍成本)
  • 你的领域存在基础模型未掌握的专业词汇或模式
  • 你需要更快的推理速度(小型微调模型的推理速度更快)

When NOT to fine-tune

不适合微调的场景

  • You haven't tried prompt optimization yet — start with
    /ai-improving-accuracy
  • You have fewer than 500 examples — need more data? Use
    /ai-generating-data
    to bootstrap synthetic examples, or use BootstrapFewShot or MIPROv2 instead
  • Your baseline is below 50% — your data or task definition needs work
  • You're still iterating on what the task is — fine-tuning locks you in
  • You don't have a clear metric — you can't evaluate fine-tuning without one
  • Your use case changes frequently — fine-tuned models don't adapt to new instructions easily
  • 你尚未尝试提示词优化——请从
    /ai-improving-accuracy
    开始
  • 你的样本量不足500——需要更多数据?可使用
    /ai-generating-data
    生成合成样本,或改用BootstrapFewShot或MIPROv2
  • 基线准确率低于50%——你的数据或任务定义存在问题
  • 你仍在迭代任务定义——微调会让你被固定的任务定义束缚
  • 你没有明确的评估指标——没有指标就无法评估微调效果
  • 你的使用场景频繁变化——微调后的模型难以适应新指令

Prerequisites checklist

前提条件检查清单

Before starting, confirm:
  • Data: 500+ labeled examples (1000+ recommended), split 80/10/10 (train/dev/test)
  • Baseline: Prompt-optimized program with measured accuracy (use
    /ai-improving-accuracy
    )
  • Metric: Clear, automated metric that scores predictions
  • Compute: API access (OpenAI fine-tuning API) or local GPUs (for open-source models)
  • Budget: OpenAI fine-tuning costs ~$0.008/1K tokens for GPT-4o-mini; local needs 1+ GPU
开始前,请确认以下内容:
  • 数据:500+标注样本(推荐1000+),按80/10/10比例划分为训练集/验证集/测试集
  • 基线:经过提示词优化且已测量准确率的程序(使用
    /ai-improving-accuracy
    生成)
  • 指标:可自动评估预测结果的明确指标
  • 计算资源:API访问权限(OpenAI微调API)或本地GPU(用于开源模型)
  • 预算:OpenAI微调成本约为GPT-4o-mini每1K令牌0.008美元;本地微调需要1+GPU

Step 1: Prepare your data and baseline

步骤1:准备数据与基线模型

Build a strong baseline first

先构建强基线模型

Always compare fine-tuning against a prompt-optimized baseline:
python
import dspy

lm = dspy.LM("openai/gpt-4o")
dspy.configure(lm=lm)
请始终将微调效果与经过提示词优化的基线模型进行对比:
python
import dspy

lm = dspy.LM("openai/gpt-4o")
dspy.configure(lm=lm)

Define your program

定义你的程序

class Classify(dspy.Signature): """Classify the support ticket.""" text: str = dspy.InputField() category: str = dspy.OutputField()
program = dspy.ChainOfThought(Classify)
class Classify(dspy.Signature): """对支持工单进行分类。""" text: str = dspy.InputField() category: str = dspy.OutputField()
program = dspy.ChainOfThought(Classify)

Prepare data

准备数据

import json with open("labeled_data.json") as f: data = json.load(f)
examples = [dspy.Example(text=x["text"], category=x["category"]).with_inputs("text") for x in data]
import json with open("labeled_data.json") as f: data = json.load(f)
examples = [dspy.Example(text=x["text"], category=x["category"]).with_inputs("text") for x in data]

Split: 80% train, 10% dev, 10% test

划分数据集:80%训练集,10%验证集,10%测试集

n = len(examples) trainset = examples[:int(n * 0.8)] devset = examples[int(n * 0.8):int(n * 0.9)] testset = examples[int(n * 0.9):]
n = len(examples) trainset = examples[:int(n * 0.8)] devset = examples[int(n * 0.8):int(n * 0.9)] testset = examples[int(n * 0.9):]

Measure baseline

测量基线准确率

def metric(example, prediction, trace=None): return prediction.category.lower() == example.category.lower()
from dspy.evaluate import Evaluate evaluator = Evaluate(devset=devset, metric=metric, num_threads=4, display_progress=True) baseline_score = evaluator(program) print(f"Baseline: {baseline_score:.1f}%")
undefined
def metric(example, prediction, trace=None): return prediction.category.lower() == example.category.lower()
from dspy.evaluate import Evaluate evaluator = Evaluate(devset=devset, metric=metric, num_threads=4, display_progress=True) baseline_score = evaluator(program) print(f"基线准确率:{baseline_score:.1f}%")
undefined

Optimize prompts first (your comparison point)

先优化提示词(作为对比基准)

python
optimizer = dspy.MIPROv2(metric=metric, auto="medium")
prompt_optimized = optimizer.compile(program, trainset=trainset)
prompt_score = evaluator(prompt_optimized)
print(f"Prompt-optimized: {prompt_score:.1f}%")
If prompt optimization gets you to your quality goal, stop here. Fine-tuning is only worth it if you need to go further.
python
optimizer = dspy.MIPROv2(metric=metric, auto="medium")
prompt_optimized = optimizer.compile(program, trainset=trainset)
prompt_score = evaluator(prompt_optimized)
print(f"提示词优化后准确率:{prompt_score:.1f}%")
如果提示词优化已达到你的质量目标,请在此停止。只有当你需要进一步提升时,微调才具有价值。

Step 2: BootstrapFinetune (core fine-tuning)

步骤2:BootstrapFinetune(核心微调流程)

The main fine-tuning workflow in DSPy. It bootstraps successful reasoning traces from your training data, filters them by your metric, and fine-tunes the model weights.
python
optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
finetuned = optimizer.compile(program, trainset=trainset)
这是DSPy中的主要微调工作流。它会从训练数据中引导生成成功的推理轨迹,通过你的指标筛选这些轨迹,然后微调模型权重。
python
optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
finetuned = optimizer.compile(program, trainset=trainset)

Evaluate the fine-tuned model

评估微调后的模型

finetuned_score = evaluator(finetuned) print(f"Baseline: {baseline_score:.1f}%") print(f"Prompt-optimized: {prompt_score:.1f}%") print(f"Fine-tuned: {finetuned_score:.1f}%")
undefined
finetuned_score = evaluator(finetuned) print(f"基线准确率: {baseline_score:.1f}%") print(f"提示词优化后准确率: {prompt_score:.1f}%") print(f"微调后准确率: {finetuned_score:.1f}%")
undefined

How it works

工作原理

  1. Bootstrap traces: Runs your program on each training example, keeping traces where the metric passes
  2. Filter by metric: Only successful traces become training data
  3. Fine-tune weights: Sends traces to the model provider's fine-tuning API
  4. Return optimized program: The program now uses the fine-tuned model
  1. 引导生成轨迹:在每个训练样本上运行你的程序,保留指标达标的轨迹
  2. 指标筛选:仅将成功的轨迹作为训练数据
  3. 微调权重:将轨迹发送至模型提供商的微调API
  4. 返回优化后的程序:程序现在将使用微调后的模型

Requirements

要求

  • A fine-tunable model (OpenAI
    gpt-4o-mini
    ,
    gpt-4o
    ; or local open-source models)
  • 500+ training examples (more traces bootstrapped = better fine-tuning)
  • A metric that reliably identifies good outputs
  • 支持微调的模型(OpenAI
    gpt-4o-mini
    gpt-4o
    ;或本地开源模型)
  • 500+训练样本(引导生成的轨迹越多,微调效果越好)
  • 可可靠识别优质输出的指标

Step 3: Model distillation (expensive to cheap)

步骤3:模型蒸馏(从昂贵到低成本)

Train a small, cheap model to mimic an expensive model. This is the biggest cost saver — 10-50x reduction with 85-95% quality retention.
训练小型低成本模型来模仿昂贵模型的效果。这是最有效的成本削减方式——可节省10-50倍成本,同时保留85-95%的模型质量。

Teacher-student pattern

师生模型模式

python
undefined
python
undefined

Step 1: Teacher — expensive model, high quality

步骤1:教师模型——昂贵模型,高质量

teacher_lm = dspy.LM("openai/gpt-4o") dspy.configure(lm=teacher_lm)
teacher_lm = dspy.LM("openai/gpt-4o") dspy.configure(lm=teacher_lm)

Build and optimize the teacher

构建并优化教师模型

teacher = dspy.ChainOfThought(Classify) optimizer = dspy.MIPROv2(metric=metric, auto="medium") teacher_optimized = optimizer.compile(teacher, trainset=trainset)
teacher_score = evaluator(teacher_optimized) print(f"Teacher (GPT-4o): {teacher_score:.1f}%")
teacher = dspy.ChainOfThought(Classify) optimizer = dspy.MIPROv2(metric=metric, auto="medium") teacher_optimized = optimizer.compile(teacher, trainset=trainset)
teacher_score = evaluator(teacher_optimized) print(f"教师模型(GPT-4o)准确率: {teacher_score:.1f}%")

Step 2: Student — fine-tune cheap model on teacher's outputs

步骤2:学生模型——在教师模型的输出上微调低成本模型

student_lm = dspy.LM("openai/gpt-4o-mini") dspy.configure(lm=student_lm)
student = dspy.ChainOfThought(Classify) ft_optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24) student_finetuned = ft_optimizer.compile(student, trainset=trainset, teacher=teacher_optimized)
student_score = evaluator(student_finetuned) print(f"Student (GPT-4o-mini, fine-tuned): {student_score:.1f}%")
undefined
student_lm = dspy.LM("openai/gpt-4o-mini") dspy.configure(lm=student_lm)
student = dspy.ChainOfThought(Classify) ft_optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24) student_finetuned = ft_optimizer.compile(student, trainset=trainset, teacher=teacher_optimized)
student_score = evaluator(student_finetuned) print(f"学生模型(GPT-4o-mini,微调后)准确率: {student_score:.1f}%")
undefined

Typical results

典型结果

ModelQualityCost per 1M tokens
GPT-4o (teacher)85%~$5.00
GPT-4o-mini (no tuning)70%~$0.15
GPT-4o-mini (fine-tuned)81%~$0.15
The fine-tuned student costs 33x less and retains ~95% of teacher quality.
模型准确率每1M令牌成本
GPT-4o(教师模型)85%~$5.00
GPT-4o-mini(未微调)70%~$0.15
GPT-4o-mini(微调后)81%~$0.15
微调后的学生模型成本降低33倍,同时保留约95%的教师模型质量。

Step 4: BetterTogether (maximum quality)

步骤4:BetterTogether(最大化模型质量)

BetterTogether alternates between prompt optimization and weight optimization, getting more out of both. Based on the BetterTogether paper (arXiv 2407.10930v2), this approach yields 5-78% gains over either technique alone.
python
optimizer = dspy.BetterTogether(
    metric=metric,
    prompt_optimizer=dspy.MIPROv2,
    weight_optimizer=dspy.BootstrapFinetune,
)
best = optimizer.compile(program, trainset=trainset)

best_score = evaluator(best)
print(f"Prompt-only:    {prompt_score:.1f}%")
print(f"Fine-tune-only: {finetuned_score:.1f}%")
print(f"BetterTogether: {best_score:.1f}%")
BetterTogether交替进行提示词优化和权重优化,充分发挥两种技术的优势。基于BetterTogether论文(arXiv 2407.10930v2),该方法比单独使用任意一种技术的效果提升5-78%。
python
optimizer = dspy.BetterTogether(
    metric=metric,
    prompt_optimizer=dspy.MIPROv2,
    weight_optimizer=dspy.BootstrapFinetune,
)
best = optimizer.compile(program, trainset=trainset)

best_score = evaluator(best)
print(f"仅提示词优化:    {prompt_score:.1f}%")
print(f"仅微调: {finetuned_score:.1f}%")
print(f"BetterTogether: {best_score:.1f}%")

How it works

工作原理

  1. Round 1: Optimize prompts (instructions + few-shot examples)
  2. Round 2: Fine-tune weights using the optimized prompts
  3. Round 3: Re-optimize prompts for the fine-tuned model
  4. Each round builds on the previous, creating synergy between prompt and weight optimization
  1. 第一轮:优化提示词(指令+少样本示例)
  2. 第二轮:使用优化后的提示词进行权重微调
  3. 第三轮:针对微调后的模型重新优化提示词
  4. 每一轮都基于上一轮的结果,实现提示词优化与权重优化的协同增效

When to use BetterTogether

何时使用BetterTogether

  • You want the absolute best quality and have the compute budget
  • Fine-tuning alone didn't close the gap to your quality target
  • You have 500+ examples and a reliable metric
  • 你追求极致的模型质量,且具备足够的计算预算
  • 单独微调无法达到你的质量目标
  • 你拥有500+样本和可靠的评估指标

Step 5: Evaluate and deploy

步骤5:评估与部署

Thorough evaluation

全面评估

Always evaluate on the held-out test set (not dev set):
python
test_evaluator = Evaluate(devset=testset, metric=metric, num_threads=4, display_progress=True)

print(f"Test set results:")
print(f"  Baseline:         {test_evaluator(program):.1f}%")
print(f"  Prompt-optimized: {test_evaluator(prompt_optimized):.1f}%")
print(f"  Fine-tuned:       {test_evaluator(finetuned):.1f}%")
请始终在预留的测试集(而非验证集)上进行评估:
python
test_evaluator = Evaluate(devset=testset, metric=metric, num_threads=4, display_progress=True)

print(f"测试集结果:")
print(f"  基线准确率:         {test_evaluator(program):.1f}%")
print(f"  提示词优化后准确率: {test_evaluator(prompt_optimized):.1f}%")
print(f"  微调后准确率:       {test_evaluator(finetuned):.1f}%")

Save and load for production

保存并加载用于生产环境

python
undefined
python
undefined

Save

保存

finetuned.save("finetuned_program.json")
finetuned.save("finetuned_program.json")

Load later

后续加载

from my_module import MyProgram production = MyProgram() production.load("finetuned_program.json") result = production(text="New support ticket...")
undefined
from my_module import MyProgram production = MyProgram() production.load("finetuned_program.json") result = production(text="新的支持工单...")
undefined

When fine-tuning goes wrong

微调常见问题排查

Can't bootstrap enough traces

无法引导生成足够的成功轨迹

If the base model fails on most training examples, there aren't enough successful traces to fine-tune on.
Fixes:
  • Use a stronger model for bootstrapping (GPT-4o instead of GPT-4o-mini)
  • Relax your metric during bootstrapping (accept partial credit)
  • Simplify your task (break multi-step into single steps)
如果基础模型在大多数训练样本上表现不佳,将没有足够的成功轨迹用于微调。
解决方法:
  • 使用更强的模型进行引导生成(用GPT-4o替代GPT-4o-mini)
  • 在引导生成阶段放宽指标要求(接受部分符合条件的结果)
  • 简化任务(将多步骤任务拆分为单步骤任务)

Model overfits (high train accuracy, low test accuracy)

模型过拟合(训练集准确率高,测试集准确率低)

Fixes:
  • Add more training data
  • Reduce fine-tuning epochs (if provider allows)
  • Use a larger base model (less prone to overfitting)
  • Simplify your output format
解决方法:
  • 添加更多训练数据
  • 减少微调轮数(若提供商支持)
  • 使用更大的基础模型(更不易过拟合)
  • 简化输出格式

Fine-tuning didn't improve over prompt optimization

微调效果未超过提示词优化

Fixes:
  • Check that bootstrapping produced enough successful traces (need 200+)
  • Try BetterTogether instead of BootstrapFinetune alone
  • Verify your metric actually correlates with quality
  • Try a different base model
解决方法:
  • 检查引导生成的成功轨迹数量是否足够(需要200+)
  • 改用BetterTogether而非单独使用BootstrapFinetune
  • 验证你的指标是否与实际质量相关
  • 尝试不同的基础模型

Infrastructure choices

基础设施选择

OpenAI API (easiest)

OpenAI API(最简便)

Works with
gpt-4o-mini
and
gpt-4o
. DSPy handles the fine-tuning API calls automatically:
python
lm = dspy.LM("openai/gpt-4o-mini")  # fine-tunable via API
  • Pros: No GPU needed, simple setup, fast
  • Cons: Data sent to OpenAI, ongoing per-token costs, limited model choices
支持
gpt-4o-mini
gpt-4o
。DSPy会自动处理微调API调用:
python
lm = dspy.LM("openai/gpt-4o-mini")  # 可通过API进行微调
  • 优点:无需GPU,设置简单,速度快
  • 缺点:数据会发送至OpenAI,存在持续的按令牌计费成本,模型选择有限

Local fine-tuning (own your model)

本地微调(完全掌控模型)

For open-source models (Llama, Mistral, etc.) using LoRA/QLoRA:
python
lm = dspy.LM("together_ai/meta-llama/Llama-3-70b-chat-hf")
  • Pros: Data stays private, no per-token costs after training, full control
  • Cons: Needs GPU(s), more setup, slower iteration
适用于使用LoRA/QLoRA的开源模型(Llama、Mistral等):
python
lm = dspy.LM("together_ai/meta-llama/Llama-3-70b-chat-hf")
  • 优点:数据保持私有,训练后无按令牌计费成本,完全可控
  • 缺点:需要GPU,设置更复杂,迭代速度较慢

Cloud GPU platforms

云GPU平台

AWS SageMaker, Google Cloud, Lambda Labs, or Together AI for training:
  • Pros: Scalable, no hardware to manage
  • Cons: Costs vary, setup per platform
可使用AWS SageMaker、Google Cloud、Lambda Labs或Together AI进行训练:
  • 优点:可扩展,无需管理硬件
  • 缺点:成本差异大,需针对不同平台进行设置

Additional resources

额外资源

  • For worked examples (classification, distillation, BetterTogether), see examples.md
  • Use
    /ai-improving-accuracy
    to build a strong baseline before fine-tuning
  • Use
    /ai-cutting-costs
    for other cost reduction strategies beyond distillation
  • Use
    /ai-fixing-errors
    if fine-tuning or evaluation errors occur
  • 如需完整示例(分类、蒸馏、BetterTogether),请查看examples.md
  • 微调前请使用
    /ai-improving-accuracy
    构建强基线模型
  • 如需蒸馏之外的其他成本削减策略,请使用
    /ai-cutting-costs
  • 若微调或评估过程中出现错误,请使用
    /ai-fixing-errors