ai-fine-tuning

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Fine-Tune Models on Your Data

基于自有数据微调模型

Guide the user through deciding whether to fine-tune, preparing data, running fine-tuning with DSPy, distilling to cheaper models, and deploying. Fine-tuning is powerful but expensive — always confirm prerequisites first.

引导用户判断是否需要进行微调、准备数据、使用DSPy执行微调、将模型蒸馏为低成本模型，以及部署模型。微调效果显著但成本高昂——请务必先确认前提条件。

Should you fine-tune?

是否应该进行模型微调？

Before writing any code, walk through these questions with the user:

Have you optimized prompts first? If not, use
```
/ai-improving-accuracy
```
— prompt optimization is 10x cheaper and often sufficient.
Do you have 500+ labeled examples? Fine-tuning with less data usually overfits. Collect more data first.
Is your baseline accuracy above 50%? If your prompt-optimized program is below 50%, your task definition or data has problems. Fix those first.
What's the goal — quality or cost?
- Quality: You've maxed out prompt optimization and need more accuracy
- Cost: You want a small cheap model to match an expensive one

在编写任何代码之前，先和用户一起梳理以下问题：

你是否已完成提示词优化？ 如果没有，请使用
```
/ai-improving-accuracy
```
——提示词优化的成本仅为微调的1/10，且通常足以满足需求。
你是否拥有500+标注样本？ 数据量不足时进行微调通常会导致过拟合，请先收集更多数据。
你的基线准确率是否超过50%？ 如果经过提示词优化后的程序准确率低于50%，说明你的任务定义或数据存在问题，请先解决这些问题。
你的目标是提升质量还是降低成本？
- 提升质量：提示词优化已达瓶颈，需要进一步提高准确率
- 降低成本：希望用小型低成本模型达到昂贵模型的效果

When to fine-tune

适合微调的场景

You've already optimized prompts with MIPROv2 and hit a ceiling
You have 500+ labeled examples (1000+ is better)
Your baseline is >50% and you need to push higher
You want to distill an expensive model into a cheaper one (10-50x cost savings)
Your domain has specialized vocabulary or patterns the base model doesn't know
You need faster inference (smaller fine-tuned models are faster)

你已使用MIPROv2完成提示词优化，且效果已达瓶颈
你拥有500+标注样本（1000+更佳）
基线准确率>50%，且需要进一步提升
希望将昂贵模型蒸馏为低成本模型（可节省10-50倍成本）
你的领域存在基础模型未掌握的专业词汇或模式
你需要更快的推理速度（小型微调模型的推理速度更快）

When NOT to fine-tune

不适合微调的场景

You haven't tried prompt optimization yet — start with
```
/ai-improving-accuracy
```
You have fewer than 500 examples — need more data? Use
```
/ai-generating-data
```
to bootstrap synthetic examples, or use BootstrapFewShot or MIPROv2 instead
Your baseline is below 50% — your data or task definition needs work
You're still iterating on what the task is — fine-tuning locks you in
You don't have a clear metric — you can't evaluate fine-tuning without one
Your use case changes frequently — fine-tuned models don't adapt to new instructions easily

你尚未尝试提示词优化——请从
```
/ai-improving-accuracy
```
开始
你的样本量不足500——需要更多数据？可使用
```
/ai-generating-data
```
生成合成样本，或改用BootstrapFewShot或MIPROv2
基线准确率低于50%——你的数据或任务定义存在问题
你仍在迭代任务定义——微调会让你被固定的任务定义束缚
你没有明确的评估指标——没有指标就无法评估微调效果
你的使用场景频繁变化——微调后的模型难以适应新指令

Prerequisites checklist

前提条件检查清单

Before starting, confirm:

Data: 500+ labeled examples (1000+ recommended), split 80/10/10 (train/dev/test)
Baseline: Prompt-optimized program with measured accuracy (use
```
/ai-improving-accuracy
```
)
Metric: Clear, automated metric that scores predictions
Compute: API access (OpenAI fine-tuning API) or local GPUs (for open-source models)
Budget: OpenAI fine-tuning costs ~$0.008/1K tokens for GPT-4o-mini; local needs 1+ GPU

开始前，请确认以下内容：

数据：500+标注样本（推荐1000+），按80/10/10比例划分为训练集/验证集/测试集
基线：经过提示词优化且已测量准确率的程序（使用
```
/ai-improving-accuracy
```
生成）
指标：可自动评估预测结果的明确指标
计算资源：API访问权限（OpenAI微调API）或本地GPU（用于开源模型）
预算：OpenAI微调成本约为GPT-4o-mini每1K令牌0.008美元；本地微调需要1+GPU

Step 1: Prepare your data and baseline

步骤1：准备数据与基线模型

Build a strong baseline first

先构建强基线模型

Always compare fine-tuning against a prompt-optimized baseline:

python

import dspy

lm = dspy.LM("openai/gpt-4o")
dspy.configure(lm=lm)

请始终将微调效果与经过提示词优化的基线模型进行对比：

python

import dspy

lm = dspy.LM("openai/gpt-4o")
dspy.configure(lm=lm)

Define your program

定义你的程序

class Classify(dspy.Signature): """Classify the support ticket.""" text: str = dspy.InputField() category: str = dspy.OutputField()

program = dspy.ChainOfThought(Classify)

class Classify(dspy.Signature): """对支持工单进行分类。""" text: str = dspy.InputField() category: str = dspy.OutputField()

program = dspy.ChainOfThought(Classify)

Prepare data

准备数据

import json with open("labeled_data.json") as f: data = json.load(f)

examples = [dspy.Example(text=x["text"], category=x["category"]).with_inputs("text") for x in data]

import json with open("labeled_data.json") as f: data = json.load(f)

examples = [dspy.Example(text=x["text"], category=x["category"]).with_inputs("text") for x in data]

Split: 80% train, 10% dev, 10% test

划分数据集：80%训练集，10%验证集，10%测试集

n = len(examples) trainset = examples[:int(n * 0.8)] devset = examples[int(n * 0.8):int(n * 0.9)] testset = examples[int(n * 0.9):]

Measure baseline

测量基线准确率

def metric(example, prediction, trace=None): return prediction.category.lower() == example.category.lower()

from dspy.evaluate import Evaluate evaluator = Evaluate(devset=devset, metric=metric, num_threads=4, display_progress=True) baseline_score = evaluator(program) print(f"Baseline: {baseline_score:.1f}%")

undefined

def metric(example, prediction, trace=None): return prediction.category.lower() == example.category.lower()

undefined

Optimize prompts first (your comparison point)

先优化提示词（作为对比基准）

python

optimizer = dspy.MIPROv2(metric=metric, auto="medium")
prompt_optimized = optimizer.compile(program, trainset=trainset)
prompt_score = evaluator(prompt_optimized)
print(f"Prompt-optimized: {prompt_score:.1f}%")

If prompt optimization gets you to your quality goal, stop here. Fine-tuning is only worth it if you need to go further.

python

optimizer = dspy.MIPROv2(metric=metric, auto="medium")
prompt_optimized = optimizer.compile(program, trainset=trainset)
prompt_score = evaluator(prompt_optimized)
print(f"提示词优化后准确率：{prompt_score:.1f}%")

如果提示词优化已达到你的质量目标，请在此停止。只有当你需要进一步提升时，微调才具有价值。

Step 2: BootstrapFinetune (core fine-tuning)

步骤2：BootstrapFinetune（核心微调流程）

The main fine-tuning workflow in DSPy. It bootstraps successful reasoning traces from your training data, filters them by your metric, and fine-tunes the model weights.

python

optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
finetuned = optimizer.compile(program, trainset=trainset)

这是DSPy中的主要微调工作流。它会从训练数据中引导生成成功的推理轨迹，通过你的指标筛选这些轨迹，然后微调模型权重。

python

optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
finetuned = optimizer.compile(program, trainset=trainset)

Evaluate the fine-tuned model

评估微调后的模型

finetuned_score = evaluator(finetuned) print(f"Baseline: {baseline_score:.1f}%") print(f"Prompt-optimized: {prompt_score:.1f}%") print(f"Fine-tuned: {finetuned_score:.1f}%")

undefined

finetuned_score = evaluator(finetuned) print(f"基线准确率： {baseline_score:.1f}%") print(f"提示词优化后准确率： {prompt_score:.1f}%") print(f"微调后准确率： {finetuned_score:.1f}%")

undefined

How it works

工作原理

Bootstrap traces: Runs your program on each training example, keeping traces where the metric passes
Filter by metric: Only successful traces become training data
Fine-tune weights: Sends traces to the model provider's fine-tuning API
Return optimized program: The program now uses the fine-tuned model

引导生成轨迹：在每个训练样本上运行你的程序，保留指标达标的轨迹
指标筛选：仅将成功的轨迹作为训练数据
微调权重：将轨迹发送至模型提供商的微调API
返回优化后的程序：程序现在将使用微调后的模型

Requirements

要求

A fine-tunable model (OpenAI
```
gpt-4o-mini
```
,
```
gpt-4o
```
; or local open-source models)
500+ training examples (more traces bootstrapped = better fine-tuning)
A metric that reliably identifies good outputs

支持微调的模型（OpenAI
```
gpt-4o-mini
```
、
```
gpt-4o
```
；或本地开源模型）
500+训练样本（引导生成的轨迹越多，微调效果越好）
可可靠识别优质输出的指标

Step 3: Model distillation (expensive to cheap)

步骤3：模型蒸馏（从昂贵到低成本）

Train a small, cheap model to mimic an expensive model. This is the biggest cost saver — 10-50x reduction with 85-95% quality retention.

训练小型低成本模型来模仿昂贵模型的效果。这是最有效的成本削减方式——可节省10-50倍成本，同时保留85-95%的模型质量。

Teacher-student pattern

师生模型模式

python

undefined

python

undefined

Step 1: Teacher — expensive model, high quality

步骤1：教师模型——昂贵模型，高质量

teacher_lm = dspy.LM("openai/gpt-4o") dspy.configure(lm=teacher_lm)

Build and optimize the teacher

构建并优化教师模型

teacher = dspy.ChainOfThought(Classify) optimizer = dspy.MIPROv2(metric=metric, auto="medium") teacher_optimized = optimizer.compile(teacher, trainset=trainset)

teacher_score = evaluator(teacher_optimized) print(f"Teacher (GPT-4o): {teacher_score:.1f}%")

teacher = dspy.ChainOfThought(Classify) optimizer = dspy.MIPROv2(metric=metric, auto="medium") teacher_optimized = optimizer.compile(teacher, trainset=trainset)

teacher_score = evaluator(teacher_optimized) print(f"教师模型（GPT-4o）准确率： {teacher_score:.1f}%")

Step 2: Student — fine-tune cheap model on teacher's outputs

步骤2：学生模型——在教师模型的输出上微调低成本模型

student_lm = dspy.LM("openai/gpt-4o-mini") dspy.configure(lm=student_lm)

student = dspy.ChainOfThought(Classify) ft_optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24) student_finetuned = ft_optimizer.compile(student, trainset=trainset, teacher=teacher_optimized)

student_score = evaluator(student_finetuned) print(f"Student (GPT-4o-mini, fine-tuned): {student_score:.1f}%")

undefined

student_lm = dspy.LM("openai/gpt-4o-mini") dspy.configure(lm=student_lm)

student_score = evaluator(student_finetuned) print(f"学生模型（GPT-4o-mini，微调后）准确率： {student_score:.1f}%")

undefined

Typical results

典型结果

Model	Quality	Cost per 1M tokens
GPT-4o (teacher)	85%	~$5.00
GPT-4o-mini (no tuning)	70%	~$0.15
GPT-4o-mini (fine-tuned)	81%	~$0.15

The fine-tuned student costs 33x less and retains ~95% of teacher quality.

模型	准确率	每1M令牌成本
GPT-4o（教师模型）	85%	~$5.00
GPT-4o-mini（未微调）	70%	~$0.15
GPT-4o-mini（微调后）	81%	~$0.15

微调后的学生模型成本降低33倍，同时保留约95%的教师模型质量。

Step 4: BetterTogether (maximum quality)

步骤4：BetterTogether（最大化模型质量）

BetterTogether alternates between prompt optimization and weight optimization, getting more out of both. Based on the BetterTogether paper (arXiv 2407.10930v2), this approach yields 5-78% gains over either technique alone.

python

optimizer = dspy.BetterTogether(
    metric=metric,
    prompt_optimizer=dspy.MIPROv2,
    weight_optimizer=dspy.BootstrapFinetune,
)
best = optimizer.compile(program, trainset=trainset)

best_score = evaluator(best)
print(f"Prompt-only:    {prompt_score:.1f}%")
print(f"Fine-tune-only: {finetuned_score:.1f}%")
print(f"BetterTogether: {best_score:.1f}%")

BetterTogether交替进行提示词优化和权重优化，充分发挥两种技术的优势。基于BetterTogether论文（arXiv 2407.10930v2），该方法比单独使用任意一种技术的效果提升5-78%。

python

optimizer = dspy.BetterTogether(
    metric=metric,
    prompt_optimizer=dspy.MIPROv2,
    weight_optimizer=dspy.BootstrapFinetune,
)
best = optimizer.compile(program, trainset=trainset)

best_score = evaluator(best)
print(f"仅提示词优化：    {prompt_score:.1f}%")
print(f"仅微调： {finetuned_score:.1f}%")
print(f"BetterTogether： {best_score:.1f}%")

How it works

工作原理

Round 1: Optimize prompts (instructions + few-shot examples)
Round 2: Fine-tune weights using the optimized prompts
Round 3: Re-optimize prompts for the fine-tuned model
Each round builds on the previous, creating synergy between prompt and weight optimization

第一轮：优化提示词（指令+少样本示例）
第二轮：使用优化后的提示词进行权重微调
第三轮：针对微调后的模型重新优化提示词
每一轮都基于上一轮的结果，实现提示词优化与权重优化的协同增效

When to use BetterTogether

何时使用BetterTogether

You want the absolute best quality and have the compute budget
Fine-tuning alone didn't close the gap to your quality target
You have 500+ examples and a reliable metric

你追求极致的模型质量，且具备足够的计算预算
单独微调无法达到你的质量目标
你拥有500+样本和可靠的评估指标

Step 5: Evaluate and deploy

步骤5：评估与部署

Thorough evaluation

全面评估

Always evaluate on the held-out test set (not dev set):

python

test_evaluator = Evaluate(devset=testset, metric=metric, num_threads=4, display_progress=True)

print(f"Test set results:")
print(f"  Baseline:         {test_evaluator(program):.1f}%")
print(f"  Prompt-optimized: {test_evaluator(prompt_optimized):.1f}%")
print(f"  Fine-tuned:       {test_evaluator(finetuned):.1f}%")

请始终在预留的测试集（而非验证集）上进行评估：

python

test_evaluator = Evaluate(devset=testset, metric=metric, num_threads=4, display_progress=True)

print(f"测试集结果：")
print(f"  基线准确率：         {test_evaluator(program):.1f}%")
print(f"  提示词优化后准确率： {test_evaluator(prompt_optimized):.1f}%")
print(f"  微调后准确率：       {test_evaluator(finetuned):.1f}%")

Save and load for production

保存并加载用于生产环境

python

undefined

python

undefined

Save

保存

finetuned.save("finetuned_program.json")

Load later

后续加载

from my_module import MyProgram production = MyProgram() production.load("finetuned_program.json") result = production(text="New support ticket...")

undefined

from my_module import MyProgram production = MyProgram() production.load("finetuned_program.json") result = production(text="新的支持工单...")

undefined

When fine-tuning goes wrong

微调常见问题排查

Can't bootstrap enough traces

无法引导生成足够的成功轨迹

If the base model fails on most training examples, there aren't enough successful traces to fine-tune on.

Fixes:

Use a stronger model for bootstrapping (GPT-4o instead of GPT-4o-mini)
Relax your metric during bootstrapping (accept partial credit)
Simplify your task (break multi-step into single steps)

如果基础模型在大多数训练样本上表现不佳，将没有足够的成功轨迹用于微调。

解决方法：

使用更强的模型进行引导生成（用GPT-4o替代GPT-4o-mini）
在引导生成阶段放宽指标要求（接受部分符合条件的结果）
简化任务（将多步骤任务拆分为单步骤任务）

Model overfits (high train accuracy, low test accuracy)

模型过拟合（训练集准确率高，测试集准确率低）

Fixes:

Add more training data
Reduce fine-tuning epochs (if provider allows)
Use a larger base model (less prone to overfitting)
Simplify your output format

解决方法：

添加更多训练数据
减少微调轮数（若提供商支持）
使用更大的基础模型（更不易过拟合）
简化输出格式

Fine-tuning didn't improve over prompt optimization

微调效果未超过提示词优化

Fixes:

Check that bootstrapping produced enough successful traces (need 200+)
Try BetterTogether instead of BootstrapFinetune alone
Verify your metric actually correlates with quality
Try a different base model

解决方法：

检查引导生成的成功轨迹数量是否足够（需要200+）
改用BetterTogether而非单独使用BootstrapFinetune
验证你的指标是否与实际质量相关
尝试不同的基础模型

Infrastructure choices

基础设施选择

OpenAI API (easiest)

OpenAI API（最简便）

Works with

gpt-4o-mini

and

gpt-4o

. DSPy handles the fine-tuning API calls automatically:

python

lm = dspy.LM("openai/gpt-4o-mini")  # fine-tunable via API

Pros: No GPU needed, simple setup, fast
Cons: Data sent to OpenAI, ongoing per-token costs, limited model choices

支持

gpt-4o-mini

和

gpt-4o

。DSPy会自动处理微调API调用：

python

lm = dspy.LM("openai/gpt-4o-mini")  # 可通过API进行微调

优点：无需GPU，设置简单，速度快
缺点：数据会发送至OpenAI，存在持续的按令牌计费成本，模型选择有限

Local fine-tuning (own your model)

本地微调（完全掌控模型）

For open-source models (Llama, Mistral, etc.) using LoRA/QLoRA:

python

lm = dspy.LM("together_ai/meta-llama/Llama-3-70b-chat-hf")

Pros: Data stays private, no per-token costs after training, full control
Cons: Needs GPU(s), more setup, slower iteration

适用于使用LoRA/QLoRA的开源模型（Llama、Mistral等）：

python

lm = dspy.LM("together_ai/meta-llama/Llama-3-70b-chat-hf")

优点：数据保持私有，训练后无按令牌计费成本，完全可控
缺点：需要GPU，设置更复杂，迭代速度较慢

Cloud GPU platforms

云GPU平台

AWS SageMaker, Google Cloud, Lambda Labs, or Together AI for training:

Pros: Scalable, no hardware to manage
Cons: Costs vary, setup per platform

可使用AWS SageMaker、Google Cloud、Lambda Labs或Together AI进行训练：

优点：可扩展，无需管理硬件
缺点：成本差异大，需针对不同平台进行设置

Additional resources

额外资源

For worked examples (classification, distillation, BetterTogether), see examples.md
Use
```
/ai-improving-accuracy
```
to build a strong baseline before fine-tuning
Use
```
/ai-cutting-costs
```
for other cost reduction strategies beyond distillation
Use
```
/ai-fixing-errors
```
if fine-tuning or evaluation errors occur

如需完整示例（分类、蒸馏、BetterTogether），请查看examples.md
微调前请使用
```
/ai-improving-accuracy
```
构建强基线模型
如需蒸馏之外的其他成本削减策略，请使用
```
/ai-cutting-costs
```
若微调或评估过程中出现错误，请使用
```
/ai-fixing-errors
```