ai-improving-accuracy

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Measure and Improve Your AI

衡量并优化你的AI

Guide the user through measuring how well their AI works, then systematically improving it. This is a loop: define "good" -> measure -> improve -> verify.
引导用户衡量AI的工作表现,然后系统性地对其进行优化。这是一个循环流程:定义「合格标准」→ 衡量 → 优化 → 验证。

The Workflow

工作流程

  1. Define what "good" means — write a metric
  2. Measure current quality — run an evaluation
  3. Improve — choose an optimizer, run it
  4. Verify — re-evaluate to confirm improvement
  5. Iterate or ship
  1. 定义「合格」的标准 — 制定评估指标
  2. 衡量当前性能 — 运行评估
  3. 优化AI — 选择优化器并运行
  4. 验证效果 — 重新评估以确认性能提升
  5. 迭代优化或上线部署

Step 1: Define what "good" means (write a metric)

步骤1:定义「合格」的标准(制定评估指标)

A metric takes an expected answer and the AI's answer, and returns a score.
评估指标会接收预期答案和AI的输出答案,然后返回一个分数。

Exact match (simplest)

完全匹配(最简单的方式)

python
def metric(example, prediction, trace=None):
    return prediction.answer == example.answer
python
def metric(example, prediction, trace=None):
    return prediction.answer == example.answer

Normalized match (handles capitalization/whitespace)

标准化匹配(处理大小写/空格差异)

python
def metric(example, prediction, trace=None):
    return prediction.answer.strip().lower() == example.answer.strip().lower()
python
def metric(example, prediction, trace=None):
    return prediction.answer.strip().lower() == example.answer.strip().lower()

Partial credit (for multi-field outputs)

部分得分(适用于多字段输出)

python
def metric(example, prediction, trace=None):
    fields = ["name", "email", "phone"]
    correct = sum(
        1 for f in fields
        if getattr(prediction, f, "").lower() == getattr(example, f, "").lower()
    )
    return correct / len(fields)
python
def metric(example, prediction, trace=None):
    fields = ["name", "email", "phone"]
    correct = sum(
        1 for f in fields
        if getattr(prediction, f, "").lower() == getattr(example, f, "").lower()
    )
    return correct / len(fields)

F1 score (for text overlap)

F1分数(基于文本重叠度)

python
def metric(example, prediction, trace=None):
    gold_tokens = set(example.answer.lower().split())
    pred_tokens = set(prediction.answer.lower().split())
    if not gold_tokens or not pred_tokens:
        return float(gold_tokens == pred_tokens)
    precision = len(gold_tokens & pred_tokens) / len(pred_tokens)
    recall = len(gold_tokens & pred_tokens) / len(gold_tokens)
    if precision + recall == 0:
        return 0.0
    return 2 * (precision * recall) / (precision + recall)
python
def metric(example, prediction, trace=None):
    gold_tokens = set(example.answer.lower().split())
    pred_tokens = set(prediction.answer.lower().split())
    if not gold_tokens or not pred_tokens:
        return float(gold_tokens == pred_tokens)
    precision = len(gold_tokens & pred_tokens) / len(pred_tokens)
    recall = len(gold_tokens & pred_tokens) / len(gold_tokens)
    if precision + recall == 0:
        return 0.0
    return 2 * (precision * recall) / (precision + recall)

AI-as-judge (for open-ended tasks)

AI作为评判者(适用于开放式任务)

When exact match is too strict (summaries, creative tasks, open-ended Q&A):
python
class AssessQuality(dspy.Signature):
    """Assess if the predicted answer is correct and complete."""
    question: str = dspy.InputField()
    gold_answer: str = dspy.InputField()
    predicted_answer: str = dspy.InputField()
    is_correct: bool = dspy.OutputField()

def metric(example, prediction, trace=None):
    judge = dspy.Predict(AssessQuality)
    result = judge(
        question=example.question,
        gold_answer=example.answer,
        predicted_answer=prediction.answer,
    )
    return result.is_correct
当完全匹配的标准过于严格时(如摘要生成、创意任务、开放式问答):
python
class AssessQuality(dspy.Signature):
    """Assess if the predicted answer is correct and complete."""
    question: str = dspy.InputField()
    gold_answer: str = dspy.InputField()
    predicted_answer: str = dspy.InputField()
    is_correct: bool = dspy.OutputField()

def metric(example, prediction, trace=None):
    judge = dspy.Predict(AssessQuality)
    result = judge(
        question=example.question,
        gold_answer=example.answer,
        predicted_answer=prediction.answer,
    )
    return result.is_correct

Composite metric (multiple criteria)

复合指标(多维度标准)

python
def metric(example, prediction, trace=None):
    correct = float(prediction.answer.lower() == example.answer.lower())
    concise = float(len(prediction.answer.split()) < 50)
    has_reasoning = float(len(getattr(prediction, 'reasoning', '')) > 20)
    return 0.7 * correct + 0.2 * concise + 0.1 * has_reasoning
python
def metric(example, prediction, trace=None):
    correct = float(prediction.answer.lower() == example.answer.lower())
    concise = float(len(prediction.answer.split()) < 50)
    has_reasoning = float(len(getattr(prediction, 'reasoning', '')) > 20)
    return 0.7 * correct + 0.2 * concise + 0.1 * has_reasoning

Training-aware metric

感知训练的指标

The
trace
parameter is not
None
during optimization. Use it for stricter requirements during training:
python
def metric(example, prediction, trace=None):
    correct = prediction.answer == example.answer
    if trace is not None:
        # During optimization, also require good reasoning
        has_reasoning = len(prediction.reasoning) > 50
        return correct and has_reasoning
    return correct
在优化过程中,
trace
参数不会为
None
。可利用该参数在训练阶段设置更严格的要求:
python
def metric(example, prediction, trace=None):
    correct = prediction.answer == example.answer
    if trace is not None:
        # During optimization, also require good reasoning
        has_reasoning = len(prediction.reasoning) > 50
        return correct and has_reasoning
    return correct

Step 2: Measure current quality (run evaluation)

步骤2:衡量当前性能(运行评估)

Prepare test data

准备测试数据

If you don't have enough examples, use
/ai-generating-data
to generate synthetic training data.
python
import dspy
如果没有足够的示例,可使用
/ai-generating-data
生成合成训练数据。
python
import dspy

Manual creation

Manual creation

devset = [ dspy.Example(question="What is DSPy?", answer="A framework for LM programs").with_inputs("question"), # 20-100+ examples for reliable evaluation ]
devset = [ dspy.Example(question="What is DSPy?", answer="A framework for LM programs").with_inputs("question"), # 20-100+ examples for reliable evaluation ]

From CSV/JSON

From CSV/JSON

import json with open("test_data.json") as f: data = json.load(f) devset = [dspy.Example(**x).with_inputs("question") for x in data]
import json with open("test_data.json") as f: data = json.load(f) devset = [dspy.Example(**x).with_inputs("question") for x in data]

From HuggingFace

From HuggingFace

from datasets import load_dataset dataset = load_dataset("squad", split="validation[:100]") devset = [ dspy.Example(question=x["question"], answer=x["answers"]["text"][0]).with_inputs("question") for x in dataset ]
undefined
from datasets import load_dataset dataset = load_dataset("squad", split="validation[:100]") devset = [ dspy.Example(question=x["question"], answer=x["answers"]["text"][0]).with_inputs("question") for x in dataset ]
undefined

Run evaluation

运行评估

python
from dspy.evaluate import Evaluate

evaluator = Evaluate(
    devset=devset,
    metric=metric,
    num_threads=4,
    display_progress=True,
    display_table=5,   # show 5 example results
)

baseline_score = evaluator(my_program)
print(f"Baseline: {baseline_score}")
python
from dspy.evaluate import Evaluate

evaluator = Evaluate(
    devset=devset,
    metric=metric,
    num_threads=4,
    display_progress=True,
    display_table=5,   # show 5 example results
)

baseline_score = evaluator(my_program)
print(f"Baseline: {baseline_score}")

Step 3: Improve (choose an optimizer)

步骤3:优化AI(选择优化器)

Quick guide: which optimizer?

快速指南:选择合适的优化器?

Training examplesRecommended optimizerExpected improvementTypical cost
<20GEPA (instruction tuning)5-15%~$0.50
20-50BootstrapFewShot5-20%~$0.50-2
50-200BootstrapFewShot, then MIPROv215-35%~$2-10
200-500MIPROv2 (auto="medium")20-40%~$5-15
500+MIPROv2 (auto="heavy") or BootstrapFinetune25-50%~$15-50+
Start here
|
+- Just getting started (<50 examples)? -> BootstrapFewShot
|   Quick, cheap, usually gives a solid boost.
|
+- Want better prompts (50+ examples)? -> MIPROv2
|   Optimizes both instructions and examples.
|   Best general-purpose prompt optimizer.
|
+- Want to tune instructions only (<50 examples)? -> GEPA
|   Good when you have few examples.
|
+- Need maximum quality (500+ examples)? -> BootstrapFinetune
|   Fine-tunes the model weights.
|   Best for production with smaller/cheaper models.
|
+- Want to combine approaches? -> BetterTogether
    Jointly optimizes prompts and weights.
Stacking tip: Run BootstrapFewShot first, then MIPROv2 on the result. This often beats either alone — bootstrap finds good examples, then MIPRO refines the instructions.
Optimized prompts are model-specific. If you change models, re-run your optimizer. See
/ai-switching-models
.
训练示例数量推荐优化器预期提升效果典型成本
<20GEPA(指令调优)5-15%~$0.50
20-50BootstrapFewShot5-20%~$0.50-2
50-200BootstrapFewShot,之后使用MIPROv215-35%~$2-10
200-500MIPROv2(auto="medium")20-40%~$5-15
500+MIPROv2(auto="heavy")或BootstrapFinetune25-50%~$15-50+
Start here
|
+- Just getting started (<50 examples)? -> BootstrapFewShot
|   Quick, cheap, usually gives a solid boost.
|
+- Want better prompts (50+ examples)? -> MIPROv2
|   Optimizes both instructions and examples.
|   Best general-purpose prompt optimizer.
|
+- Want to tune instructions only (<50 examples)? -> GEPA
|   Good when you have few examples.
|
+- Need maximum quality (500+ examples)? -> BootstrapFinetune
|   Fine-tunes the model weights.
|   Best for production with smaller/cheaper models.
|
+- Want to combine approaches? -> BetterTogether
    Jointly optimizes prompts and weights.
堆叠技巧: 先运行BootstrapFewShot,再对结果使用MIPROv2优化。这种组合通常比单独使用其中一种效果更好——Bootstrap负责筛选优质示例,MIPRO则优化指令。
优化后的提示词是模型专属的。如果更换模型,请重新运行优化器。详情见
/ai-switching-models

BootstrapFewShot (start here)

BootstrapFewShot(入门首选)

Fast, cheap. Finds good examples by running your program and keeping successful traces.
python
optimizer = dspy.BootstrapFewShot(
    metric=metric,
    max_bootstrapped_demos=4,
    max_labeled_demos=4,
)
optimized = optimizer.compile(my_program, trainset=trainset)
Cost: Minimal (one pass through trainset). Expected improvement: 5-20%.
快速、低成本。通过运行你的程序并保留成功的执行轨迹来筛选优质示例。
python
optimizer = dspy.BootstrapFewShot(
    metric=metric,
    max_bootstrapped_demos=4,
    max_labeled_demos=4,
)
optimized = optimizer.compile(my_program, trainset=trainset)
成本: 极低(仅需遍历一次训练集)。预期提升效果: 5-20%。

MIPROv2 (recommended for most cases)

MIPROv2(多数场景下的推荐选择)

Optimizes both instructions and examples. Best general-purpose optimizer.
python
optimizer = dspy.MIPROv2(
    metric=metric,
    auto="medium",    # "light", "medium", "heavy"
)
optimized = optimizer.compile(my_program, trainset=trainset)
  • "light"
    : Quick, ~$1-2
  • "medium"
    : Balanced, ~$5-10
  • "heavy"
    : Thorough, ~$15-30
Expected improvement: 15-35%.
同时优化指令和示例。最佳通用型优化器。
python
optimizer = dspy.MIPROv2(
    metric=metric,
    auto="medium",    # "light", "medium", "heavy"
)
optimized = optimizer.compile(my_program, trainset=trainset)
  • "light"
    : 快速,约1-2美元
  • "medium"
    : 平衡型,约5-10美元
  • "heavy"
    : 深度优化,约15-30美元
预期提升效果: 15-35%。

GEPA (instruction tuning)

GEPA(指令调优)

Good with few examples or when you want to focus on instruction quality:
python
optimizer = dspy.GEPA()
optimized = optimizer.compile(my_program, trainset=trainset, metric=metric)
适用于示例数量较少,或仅需专注于指令质量优化的场景:
python
optimizer = dspy.GEPA()
optimized = optimizer.compile(my_program, trainset=trainset, metric=metric)

BootstrapFinetune (maximum quality)

BootstrapFinetune(追求极致性能)

Fine-tunes model weights for the biggest accuracy gains. Requires 500+ examples and a fine-tunable model:
python
optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
optimized = optimizer.compile(my_program, trainset=trainset)
For the full fine-tuning workflow (decision framework, prerequisites, model distillation, BetterTogether), see
/ai-fine-tuning
.
通过微调模型权重实现最大幅度的准确率提升。需要500+示例以及支持微调的模型:
python
optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
optimized = optimizer.compile(my_program, trainset=trainset)
完整的微调工作流(决策框架、前提条件、模型蒸馏、BetterTogether)详见
/ai-fine-tuning

When optimization plateaus

当优化进入瓶颈期

If your score stops improving, check these common causes:
SymptomLikely causeFix
Score stuck at 60-70% despite optimizationTask too complex for single step
/ai-decomposing-tasks
— break into subtasks
Optimizer overfits (train score high, dev score flat)Too little training data
/ai-generating-data
— generate more examples
Score varies wildly between runsNon-deterministic metric or small devsetIncrease devset to 100+, set temperature=0
Diminishing returns from more demosPrompt is maxed out; model is the limit
/ai-switching-models
— try a more capable model
Score high but real users complainMetric doesn't match real qualityRewrite metric based on actual failure patterns
如果性能分数停止提升,请排查以下常见原因:
症状可能原因解决方法
即使经过优化,分数仍卡在60-70%单步任务过于复杂
/ai-decomposing-tasks
— 将任务拆分为子任务
优化器过拟合(训练集分数高,开发集分数无变化)训练数据不足
/ai-generating-data
— 生成更多示例
多次运行后分数波动极大指标非确定性或开发集规模过小将开发集扩充至100+示例,设置temperature=0
增加示例数量后提升效果递减提示词已达上限,模型性能成为瓶颈
/ai-switching-models
— 尝试更强大的模型
分数很高但实际用户反馈差指标与实际质量不匹配根据实际失败模式重新制定指标

Step 4: Verify improvement

步骤4:验证优化效果

python
optimized_score = evaluator(optimized)
print(f"Baseline: {baseline_score:.1f}%")
print(f"Optimized: {optimized_score:.1f}%")
print(f"Improvement: {optimized_score - baseline_score:.1f}%")
python
optimized_score = evaluator(optimized)
print(f"Baseline: {baseline_score:.1f}%")
print(f"Optimized: {optimized_score:.1f}%")
print(f"Improvement: {optimized_score - baseline_score:.1f}%")

Step 5: Save and ship

步骤5:保存并上线

python
optimized.save("optimized_program.json")
python
optimized.save("optimized_program.json")

Load later

Load later

my_program = MyProgram() my_program.load("optimized_program.json")
undefined
my_program = MyProgram() my_program.load("optimized_program.json")
undefined

Key patterns

核心模式

  • Start simple: exact match metric + BootstrapFewShot, then upgrade if needed
  • Validate your metric: manually check 10-20 examples to make sure the metric scores correctly
  • More data helps: optimizers work better with more training examples
  • Never evaluate on trainset: always use a held-out devset
  • Use
    display_table
    : looking at actual predictions reveals metric bugs
  • Iterate: run optimization, check results, improve metric, re-optimize
  • 从简开始:先使用完全匹配指标+BootstrapFewShot,后续按需升级
  • 验证指标有效性:手动检查10-20个示例,确保指标评分准确
  • 数据越多效果越好:优化器在训练示例充足时表现更优
  • 绝不在训练集上评估:始终使用独立的开发集
  • 使用
    display_table
    :查看实际预测结果可发现指标存在的问题
  • 持续迭代:运行优化、检查结果、改进指标、重新优化

Additional resources

额外资源

  • For optimizer comparison table and metric patterns, see reference.md
  • Once quality is good, use
    /ai-cutting-costs
    to reduce your AI bill
  • Use
    /ai-monitoring
    to track quality in production after deployment
  • Use
    /ai-tracking-experiments
    to log, compare, and manage multiple optimization runs
  • Accuracy plateaued despite optimization? Try
    /ai-decomposing-tasks
    to restructure your task
  • If things are broken, use
    /ai-fixing-errors
    to diagnose
  • 优化器对比表和指标模式详见reference.md
  • 当性能达标后,可使用
    /ai-cutting-costs
    降低AI使用成本
  • 部署后可使用
    /ai-monitoring
    在生产环境中跟踪性能
  • 使用
    /ai-tracking-experiments
    记录、对比并管理多轮优化实验
  • 经过优化后性能仍停滞?尝试
    /ai-decomposing-tasks
    重构任务
  • 若出现问题,使用
    /ai-fixing-errors
    进行诊断