ai-improving-accuracy
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMeasure and Improve Your AI
衡量并优化你的AI
Guide the user through measuring how well their AI works, then systematically improving it. This is a loop: define "good" -> measure -> improve -> verify.
引导用户衡量AI的工作表现,然后系统性地对其进行优化。这是一个循环流程:定义「合格标准」→ 衡量 → 优化 → 验证。
The Workflow
工作流程
- Define what "good" means — write a metric
- Measure current quality — run an evaluation
- Improve — choose an optimizer, run it
- Verify — re-evaluate to confirm improvement
- Iterate or ship
- 定义「合格」的标准 — 制定评估指标
- 衡量当前性能 — 运行评估
- 优化AI — 选择优化器并运行
- 验证效果 — 重新评估以确认性能提升
- 迭代优化或上线部署
Step 1: Define what "good" means (write a metric)
步骤1:定义「合格」的标准(制定评估指标)
A metric takes an expected answer and the AI's answer, and returns a score.
评估指标会接收预期答案和AI的输出答案,然后返回一个分数。
Exact match (simplest)
完全匹配(最简单的方式)
python
def metric(example, prediction, trace=None):
return prediction.answer == example.answerpython
def metric(example, prediction, trace=None):
return prediction.answer == example.answerNormalized match (handles capitalization/whitespace)
标准化匹配(处理大小写/空格差异)
python
def metric(example, prediction, trace=None):
return prediction.answer.strip().lower() == example.answer.strip().lower()python
def metric(example, prediction, trace=None):
return prediction.answer.strip().lower() == example.answer.strip().lower()Partial credit (for multi-field outputs)
部分得分(适用于多字段输出)
python
def metric(example, prediction, trace=None):
fields = ["name", "email", "phone"]
correct = sum(
1 for f in fields
if getattr(prediction, f, "").lower() == getattr(example, f, "").lower()
)
return correct / len(fields)python
def metric(example, prediction, trace=None):
fields = ["name", "email", "phone"]
correct = sum(
1 for f in fields
if getattr(prediction, f, "").lower() == getattr(example, f, "").lower()
)
return correct / len(fields)F1 score (for text overlap)
F1分数(基于文本重叠度)
python
def metric(example, prediction, trace=None):
gold_tokens = set(example.answer.lower().split())
pred_tokens = set(prediction.answer.lower().split())
if not gold_tokens or not pred_tokens:
return float(gold_tokens == pred_tokens)
precision = len(gold_tokens & pred_tokens) / len(pred_tokens)
recall = len(gold_tokens & pred_tokens) / len(gold_tokens)
if precision + recall == 0:
return 0.0
return 2 * (precision * recall) / (precision + recall)python
def metric(example, prediction, trace=None):
gold_tokens = set(example.answer.lower().split())
pred_tokens = set(prediction.answer.lower().split())
if not gold_tokens or not pred_tokens:
return float(gold_tokens == pred_tokens)
precision = len(gold_tokens & pred_tokens) / len(pred_tokens)
recall = len(gold_tokens & pred_tokens) / len(gold_tokens)
if precision + recall == 0:
return 0.0
return 2 * (precision * recall) / (precision + recall)AI-as-judge (for open-ended tasks)
AI作为评判者(适用于开放式任务)
When exact match is too strict (summaries, creative tasks, open-ended Q&A):
python
class AssessQuality(dspy.Signature):
"""Assess if the predicted answer is correct and complete."""
question: str = dspy.InputField()
gold_answer: str = dspy.InputField()
predicted_answer: str = dspy.InputField()
is_correct: bool = dspy.OutputField()
def metric(example, prediction, trace=None):
judge = dspy.Predict(AssessQuality)
result = judge(
question=example.question,
gold_answer=example.answer,
predicted_answer=prediction.answer,
)
return result.is_correct当完全匹配的标准过于严格时(如摘要生成、创意任务、开放式问答):
python
class AssessQuality(dspy.Signature):
"""Assess if the predicted answer is correct and complete."""
question: str = dspy.InputField()
gold_answer: str = dspy.InputField()
predicted_answer: str = dspy.InputField()
is_correct: bool = dspy.OutputField()
def metric(example, prediction, trace=None):
judge = dspy.Predict(AssessQuality)
result = judge(
question=example.question,
gold_answer=example.answer,
predicted_answer=prediction.answer,
)
return result.is_correctComposite metric (multiple criteria)
复合指标(多维度标准)
python
def metric(example, prediction, trace=None):
correct = float(prediction.answer.lower() == example.answer.lower())
concise = float(len(prediction.answer.split()) < 50)
has_reasoning = float(len(getattr(prediction, 'reasoning', '')) > 20)
return 0.7 * correct + 0.2 * concise + 0.1 * has_reasoningpython
def metric(example, prediction, trace=None):
correct = float(prediction.answer.lower() == example.answer.lower())
concise = float(len(prediction.answer.split()) < 50)
has_reasoning = float(len(getattr(prediction, 'reasoning', '')) > 20)
return 0.7 * correct + 0.2 * concise + 0.1 * has_reasoningTraining-aware metric
感知训练的指标
The parameter is not during optimization. Use it for stricter requirements during training:
traceNonepython
def metric(example, prediction, trace=None):
correct = prediction.answer == example.answer
if trace is not None:
# During optimization, also require good reasoning
has_reasoning = len(prediction.reasoning) > 50
return correct and has_reasoning
return correct在优化过程中,参数不会为。可利用该参数在训练阶段设置更严格的要求:
traceNonepython
def metric(example, prediction, trace=None):
correct = prediction.answer == example.answer
if trace is not None:
# During optimization, also require good reasoning
has_reasoning = len(prediction.reasoning) > 50
return correct and has_reasoning
return correctStep 2: Measure current quality (run evaluation)
步骤2:衡量当前性能(运行评估)
Prepare test data
准备测试数据
If you don't have enough examples, use to generate synthetic training data.
/ai-generating-datapython
import dspy如果没有足够的示例,可使用生成合成训练数据。
/ai-generating-datapython
import dspyManual creation
Manual creation
devset = [
dspy.Example(question="What is DSPy?", answer="A framework for LM programs").with_inputs("question"),
# 20-100+ examples for reliable evaluation
]
devset = [
dspy.Example(question="What is DSPy?", answer="A framework for LM programs").with_inputs("question"),
# 20-100+ examples for reliable evaluation
]
From CSV/JSON
From CSV/JSON
import json
with open("test_data.json") as f:
data = json.load(f)
devset = [dspy.Example(**x).with_inputs("question") for x in data]
import json
with open("test_data.json") as f:
data = json.load(f)
devset = [dspy.Example(**x).with_inputs("question") for x in data]
From HuggingFace
From HuggingFace
from datasets import load_dataset
dataset = load_dataset("squad", split="validation[:100]")
devset = [
dspy.Example(question=x["question"], answer=x["answers"]["text"][0]).with_inputs("question")
for x in dataset
]
undefinedfrom datasets import load_dataset
dataset = load_dataset("squad", split="validation[:100]")
devset = [
dspy.Example(question=x["question"], answer=x["answers"]["text"][0]).with_inputs("question")
for x in dataset
]
undefinedRun evaluation
运行评估
python
from dspy.evaluate import Evaluate
evaluator = Evaluate(
devset=devset,
metric=metric,
num_threads=4,
display_progress=True,
display_table=5, # show 5 example results
)
baseline_score = evaluator(my_program)
print(f"Baseline: {baseline_score}")python
from dspy.evaluate import Evaluate
evaluator = Evaluate(
devset=devset,
metric=metric,
num_threads=4,
display_progress=True,
display_table=5, # show 5 example results
)
baseline_score = evaluator(my_program)
print(f"Baseline: {baseline_score}")Step 3: Improve (choose an optimizer)
步骤3:优化AI(选择优化器)
Quick guide: which optimizer?
快速指南:选择合适的优化器?
| Training examples | Recommended optimizer | Expected improvement | Typical cost |
|---|---|---|---|
| <20 | GEPA (instruction tuning) | 5-15% | ~$0.50 |
| 20-50 | BootstrapFewShot | 5-20% | ~$0.50-2 |
| 50-200 | BootstrapFewShot, then MIPROv2 | 15-35% | ~$2-10 |
| 200-500 | MIPROv2 (auto="medium") | 20-40% | ~$5-15 |
| 500+ | MIPROv2 (auto="heavy") or BootstrapFinetune | 25-50% | ~$15-50+ |
Start here
|
+- Just getting started (<50 examples)? -> BootstrapFewShot
| Quick, cheap, usually gives a solid boost.
|
+- Want better prompts (50+ examples)? -> MIPROv2
| Optimizes both instructions and examples.
| Best general-purpose prompt optimizer.
|
+- Want to tune instructions only (<50 examples)? -> GEPA
| Good when you have few examples.
|
+- Need maximum quality (500+ examples)? -> BootstrapFinetune
| Fine-tunes the model weights.
| Best for production with smaller/cheaper models.
|
+- Want to combine approaches? -> BetterTogether
Jointly optimizes prompts and weights.Stacking tip: Run BootstrapFewShot first, then MIPROv2 on the result. This often beats either alone — bootstrap finds good examples, then MIPRO refines the instructions.
Optimized prompts are model-specific. If you change models, re-run your optimizer. See .
/ai-switching-models| 训练示例数量 | 推荐优化器 | 预期提升效果 | 典型成本 |
|---|---|---|---|
| <20 | GEPA(指令调优) | 5-15% | ~$0.50 |
| 20-50 | BootstrapFewShot | 5-20% | ~$0.50-2 |
| 50-200 | BootstrapFewShot,之后使用MIPROv2 | 15-35% | ~$2-10 |
| 200-500 | MIPROv2(auto="medium") | 20-40% | ~$5-15 |
| 500+ | MIPROv2(auto="heavy")或BootstrapFinetune | 25-50% | ~$15-50+ |
Start here
|
+- Just getting started (<50 examples)? -> BootstrapFewShot
| Quick, cheap, usually gives a solid boost.
|
+- Want better prompts (50+ examples)? -> MIPROv2
| Optimizes both instructions and examples.
| Best general-purpose prompt optimizer.
|
+- Want to tune instructions only (<50 examples)? -> GEPA
| Good when you have few examples.
|
+- Need maximum quality (500+ examples)? -> BootstrapFinetune
| Fine-tunes the model weights.
| Best for production with smaller/cheaper models.
|
+- Want to combine approaches? -> BetterTogether
Jointly optimizes prompts and weights.堆叠技巧: 先运行BootstrapFewShot,再对结果使用MIPROv2优化。这种组合通常比单独使用其中一种效果更好——Bootstrap负责筛选优质示例,MIPRO则优化指令。
优化后的提示词是模型专属的。如果更换模型,请重新运行优化器。详情见。
/ai-switching-modelsBootstrapFewShot (start here)
BootstrapFewShot(入门首选)
Fast, cheap. Finds good examples by running your program and keeping successful traces.
python
optimizer = dspy.BootstrapFewShot(
metric=metric,
max_bootstrapped_demos=4,
max_labeled_demos=4,
)
optimized = optimizer.compile(my_program, trainset=trainset)Cost: Minimal (one pass through trainset). Expected improvement: 5-20%.
快速、低成本。通过运行你的程序并保留成功的执行轨迹来筛选优质示例。
python
optimizer = dspy.BootstrapFewShot(
metric=metric,
max_bootstrapped_demos=4,
max_labeled_demos=4,
)
optimized = optimizer.compile(my_program, trainset=trainset)成本: 极低(仅需遍历一次训练集)。预期提升效果: 5-20%。
MIPROv2 (recommended for most cases)
MIPROv2(多数场景下的推荐选择)
Optimizes both instructions and examples. Best general-purpose optimizer.
python
optimizer = dspy.MIPROv2(
metric=metric,
auto="medium", # "light", "medium", "heavy"
)
optimized = optimizer.compile(my_program, trainset=trainset)- : Quick, ~$1-2
"light" - : Balanced, ~$5-10
"medium" - : Thorough, ~$15-30
"heavy"
Expected improvement: 15-35%.
同时优化指令和示例。最佳通用型优化器。
python
optimizer = dspy.MIPROv2(
metric=metric,
auto="medium", # "light", "medium", "heavy"
)
optimized = optimizer.compile(my_program, trainset=trainset)- : 快速,约1-2美元
"light" - : 平衡型,约5-10美元
"medium" - : 深度优化,约15-30美元
"heavy"
预期提升效果: 15-35%。
GEPA (instruction tuning)
GEPA(指令调优)
Good with few examples or when you want to focus on instruction quality:
python
optimizer = dspy.GEPA()
optimized = optimizer.compile(my_program, trainset=trainset, metric=metric)适用于示例数量较少,或仅需专注于指令质量优化的场景:
python
optimizer = dspy.GEPA()
optimized = optimizer.compile(my_program, trainset=trainset, metric=metric)BootstrapFinetune (maximum quality)
BootstrapFinetune(追求极致性能)
Fine-tunes model weights for the biggest accuracy gains. Requires 500+ examples and a fine-tunable model:
python
optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
optimized = optimizer.compile(my_program, trainset=trainset)For the full fine-tuning workflow (decision framework, prerequisites, model distillation, BetterTogether), see .
/ai-fine-tuning通过微调模型权重实现最大幅度的准确率提升。需要500+示例以及支持微调的模型:
python
optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
optimized = optimizer.compile(my_program, trainset=trainset)完整的微调工作流(决策框架、前提条件、模型蒸馏、BetterTogether)详见。
/ai-fine-tuningWhen optimization plateaus
当优化进入瓶颈期
If your score stops improving, check these common causes:
| Symptom | Likely cause | Fix |
|---|---|---|
| Score stuck at 60-70% despite optimization | Task too complex for single step | |
| Optimizer overfits (train score high, dev score flat) | Too little training data | |
| Score varies wildly between runs | Non-deterministic metric or small devset | Increase devset to 100+, set temperature=0 |
| Diminishing returns from more demos | Prompt is maxed out; model is the limit | |
| Score high but real users complain | Metric doesn't match real quality | Rewrite metric based on actual failure patterns |
如果性能分数停止提升,请排查以下常见原因:
| 症状 | 可能原因 | 解决方法 |
|---|---|---|
| 即使经过优化,分数仍卡在60-70% | 单步任务过于复杂 | |
| 优化器过拟合(训练集分数高,开发集分数无变化) | 训练数据不足 | |
| 多次运行后分数波动极大 | 指标非确定性或开发集规模过小 | 将开发集扩充至100+示例,设置temperature=0 |
| 增加示例数量后提升效果递减 | 提示词已达上限,模型性能成为瓶颈 | |
| 分数很高但实际用户反馈差 | 指标与实际质量不匹配 | 根据实际失败模式重新制定指标 |
Step 4: Verify improvement
步骤4:验证优化效果
python
optimized_score = evaluator(optimized)
print(f"Baseline: {baseline_score:.1f}%")
print(f"Optimized: {optimized_score:.1f}%")
print(f"Improvement: {optimized_score - baseline_score:.1f}%")python
optimized_score = evaluator(optimized)
print(f"Baseline: {baseline_score:.1f}%")
print(f"Optimized: {optimized_score:.1f}%")
print(f"Improvement: {optimized_score - baseline_score:.1f}%")Step 5: Save and ship
步骤5:保存并上线
python
optimized.save("optimized_program.json")python
optimized.save("optimized_program.json")Load later
Load later
my_program = MyProgram()
my_program.load("optimized_program.json")
undefinedmy_program = MyProgram()
my_program.load("optimized_program.json")
undefinedKey patterns
核心模式
- Start simple: exact match metric + BootstrapFewShot, then upgrade if needed
- Validate your metric: manually check 10-20 examples to make sure the metric scores correctly
- More data helps: optimizers work better with more training examples
- Never evaluate on trainset: always use a held-out devset
- Use : looking at actual predictions reveals metric bugs
display_table - Iterate: run optimization, check results, improve metric, re-optimize
- 从简开始:先使用完全匹配指标+BootstrapFewShot,后续按需升级
- 验证指标有效性:手动检查10-20个示例,确保指标评分准确
- 数据越多效果越好:优化器在训练示例充足时表现更优
- 绝不在训练集上评估:始终使用独立的开发集
- 使用:查看实际预测结果可发现指标存在的问题
display_table - 持续迭代:运行优化、检查结果、改进指标、重新优化
Additional resources
额外资源
- For optimizer comparison table and metric patterns, see reference.md
- Once quality is good, use to reduce your AI bill
/ai-cutting-costs - Use to track quality in production after deployment
/ai-monitoring - Use to log, compare, and manage multiple optimization runs
/ai-tracking-experiments - Accuracy plateaued despite optimization? Try to restructure your task
/ai-decomposing-tasks - If things are broken, use to diagnose
/ai-fixing-errors
- 优化器对比表和指标模式详见reference.md
- 当性能达标后,可使用降低AI使用成本
/ai-cutting-costs - 部署后可使用在生产环境中跟踪性能
/ai-monitoring - 使用记录、对比并管理多轮优化实验
/ai-tracking-experiments - 经过优化后性能仍停滞?尝试重构任务
/ai-decomposing-tasks - 若出现问题,使用进行诊断
/ai-fixing-errors