ai-improving-accuracy

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Measure and Improve Your AI

衡量并优化你的AI

Guide the user through measuring how well their AI works, then systematically improving it. This is a loop: define "good" -> measure -> improve -> verify.

引导用户衡量AI的工作表现，然后系统性地对其进行优化。这是一个循环流程：定义「合格标准」→ 衡量 → 优化 → 验证。

The Workflow

工作流程

Define what "good" means — write a metric
Measure current quality — run an evaluation
Improve — choose an optimizer, run it
Verify — re-evaluate to confirm improvement
Iterate or ship

定义「合格」的标准 — 制定评估指标
衡量当前性能 — 运行评估
优化AI — 选择优化器并运行
验证效果 — 重新评估以确认性能提升
迭代优化或上线部署

Step 1: Define what "good" means (write a metric)

步骤1：定义「合格」的标准（制定评估指标）

A metric takes an expected answer and the AI's answer, and returns a score.

评估指标会接收预期答案和AI的输出答案，然后返回一个分数。

Exact match (simplest)

完全匹配（最简单的方式）

python

def metric(example, prediction, trace=None):
    return prediction.answer == example.answer

python

def metric(example, prediction, trace=None):
    return prediction.answer == example.answer

Normalized match (handles capitalization/whitespace)

标准化匹配（处理大小写/空格差异）

python

def metric(example, prediction, trace=None):
    return prediction.answer.strip().lower() == example.answer.strip().lower()

python

def metric(example, prediction, trace=None):
    return prediction.answer.strip().lower() == example.answer.strip().lower()

Partial credit (for multi-field outputs)

部分得分（适用于多字段输出）

python

def metric(example, prediction, trace=None):
    fields = ["name", "email", "phone"]
    correct = sum(
        1 for f in fields
        if getattr(prediction, f, "").lower() == getattr(example, f, "").lower()
    )
    return correct / len(fields)

python

def metric(example, prediction, trace=None):
    fields = ["name", "email", "phone"]
    correct = sum(
        1 for f in fields
        if getattr(prediction, f, "").lower() == getattr(example, f, "").lower()
    )
    return correct / len(fields)

F1 score (for text overlap)

F1分数（基于文本重叠度）

python

def metric(example, prediction, trace=None):
    gold_tokens = set(example.answer.lower().split())
    pred_tokens = set(prediction.answer.lower().split())
    if not gold_tokens or not pred_tokens:
        return float(gold_tokens == pred_tokens)
    precision = len(gold_tokens & pred_tokens) / len(pred_tokens)
    recall = len(gold_tokens & pred_tokens) / len(gold_tokens)
    if precision + recall == 0:
        return 0.0
    return 2 * (precision * recall) / (precision + recall)

python

def metric(example, prediction, trace=None):
    gold_tokens = set(example.answer.lower().split())
    pred_tokens = set(prediction.answer.lower().split())
    if not gold_tokens or not pred_tokens:
        return float(gold_tokens == pred_tokens)
    precision = len(gold_tokens & pred_tokens) / len(pred_tokens)
    recall = len(gold_tokens & pred_tokens) / len(gold_tokens)
    if precision + recall == 0:
        return 0.0
    return 2 * (precision * recall) / (precision + recall)

AI-as-judge (for open-ended tasks)

AI作为评判者（适用于开放式任务）

When exact match is too strict (summaries, creative tasks, open-ended Q&A):

python

class AssessQuality(dspy.Signature):
    """Assess if the predicted answer is correct and complete."""
    question: str = dspy.InputField()
    gold_answer: str = dspy.InputField()
    predicted_answer: str = dspy.InputField()
    is_correct: bool = dspy.OutputField()

def metric(example, prediction, trace=None):
    judge = dspy.Predict(AssessQuality)
    result = judge(
        question=example.question,
        gold_answer=example.answer,
        predicted_answer=prediction.answer,
    )
    return result.is_correct

当完全匹配的标准过于严格时（如摘要生成、创意任务、开放式问答）：

python

class AssessQuality(dspy.Signature):
    """Assess if the predicted answer is correct and complete."""
    question: str = dspy.InputField()
    gold_answer: str = dspy.InputField()
    predicted_answer: str = dspy.InputField()
    is_correct: bool = dspy.OutputField()

def metric(example, prediction, trace=None):
    judge = dspy.Predict(AssessQuality)
    result = judge(
        question=example.question,
        gold_answer=example.answer,
        predicted_answer=prediction.answer,
    )
    return result.is_correct

Composite metric (multiple criteria)

复合指标（多维度标准）

python

def metric(example, prediction, trace=None):
    correct = float(prediction.answer.lower() == example.answer.lower())
    concise = float(len(prediction.answer.split()) < 50)
    has_reasoning = float(len(getattr(prediction, 'reasoning', '')) > 20)
    return 0.7 * correct + 0.2 * concise + 0.1 * has_reasoning

python

def metric(example, prediction, trace=None):
    correct = float(prediction.answer.lower() == example.answer.lower())
    concise = float(len(prediction.answer.split()) < 50)
    has_reasoning = float(len(getattr(prediction, 'reasoning', '')) > 20)
    return 0.7 * correct + 0.2 * concise + 0.1 * has_reasoning

Training-aware metric

感知训练的指标

The

trace

parameter is not

None

during optimization. Use it for stricter requirements during training:

python

def metric(example, prediction, trace=None):
    correct = prediction.answer == example.answer
    if trace is not None:
        # During optimization, also require good reasoning
        has_reasoning = len(prediction.reasoning) > 50
        return correct and has_reasoning
    return correct

在优化过程中，

trace

参数不会为

None

。可利用该参数在训练阶段设置更严格的要求：

python

def metric(example, prediction, trace=None):
    correct = prediction.answer == example.answer
    if trace is not None:
        # During optimization, also require good reasoning
        has_reasoning = len(prediction.reasoning) > 50
        return correct and has_reasoning
    return correct

Step 2: Measure current quality (run evaluation)

步骤2：衡量当前性能（运行评估）

Prepare test data

准备测试数据

If you don't have enough examples, use

/ai-generating-data

to generate synthetic training data.

python

import dspy

如果没有足够的示例，可使用

/ai-generating-data

生成合成训练数据。

python

import dspy

Manual creation

devset = [ dspy.Example(question="What is DSPy?", answer="A framework for LM programs").with_inputs("question"), # 20-100+ examples for reliable evaluation ]

From CSV/JSON

import json with open("test_data.json") as f: data = json.load(f) devset = [dspy.Example(**x).with_inputs("question") for x in data]

From HuggingFace

from datasets import load_dataset dataset = load_dataset("squad", split="validation[:100]") devset = [ dspy.Example(question=x["question"], answer=x["answers"]["text"][0]).with_inputs("question") for x in dataset ]

undefined

undefined

Run evaluation

运行评估

python

from dspy.evaluate import Evaluate

evaluator = Evaluate(
    devset=devset,
    metric=metric,
    num_threads=4,
    display_progress=True,
    display_table=5,   # show 5 example results
)

baseline_score = evaluator(my_program)
print(f"Baseline: {baseline_score}")

python

from dspy.evaluate import Evaluate

evaluator = Evaluate(
    devset=devset,
    metric=metric,
    num_threads=4,
    display_progress=True,
    display_table=5,   # show 5 example results
)

baseline_score = evaluator(my_program)
print(f"Baseline: {baseline_score}")

Step 3: Improve (choose an optimizer)

步骤3：优化AI（选择优化器）

Quick guide: which optimizer?

快速指南：选择合适的优化器？

Training examples	Recommended optimizer	Expected improvement	Typical cost
<20	GEPA (instruction tuning)	5-15%	~$0.50
20-50	BootstrapFewShot	5-20%	~$0.50-2
50-200	BootstrapFewShot, then MIPROv2	15-35%	~$2-10
200-500	MIPROv2 (auto="medium")	20-40%	~$5-15
500+	MIPROv2 (auto="heavy") or BootstrapFinetune	25-50%	~$15-50+

Start here
|
+- Just getting started (<50 examples)? -> BootstrapFewShot
|   Quick, cheap, usually gives a solid boost.
|
+- Want better prompts (50+ examples)? -> MIPROv2
|   Optimizes both instructions and examples.
|   Best general-purpose prompt optimizer.
|
+- Want to tune instructions only (<50 examples)? -> GEPA
|   Good when you have few examples.
|
+- Need maximum quality (500+ examples)? -> BootstrapFinetune
|   Fine-tunes the model weights.
|   Best for production with smaller/cheaper models.
|
+- Want to combine approaches? -> BetterTogether
    Jointly optimizes prompts and weights.

Stacking tip: Run BootstrapFewShot first, then MIPROv2 on the result. This often beats either alone — bootstrap finds good examples, then MIPRO refines the instructions.

Optimized prompts are model-specific. If you change models, re-run your optimizer. See

/ai-switching-models

训练示例数量	推荐优化器	预期提升效果	典型成本
<20	GEPA（指令调优）	5-15%	~$0.50
20-50	BootstrapFewShot	5-20%	~$0.50-2
50-200	BootstrapFewShot，之后使用MIPROv2	15-35%	~$2-10
200-500	MIPROv2（auto="medium"）	20-40%	~$5-15
500+	MIPROv2（auto="heavy"）或BootstrapFinetune	25-50%	~$15-50+

Start here
|
+- Just getting started (<50 examples)? -> BootstrapFewShot
|   Quick, cheap, usually gives a solid boost.
|
+- Want better prompts (50+ examples)? -> MIPROv2
|   Optimizes both instructions and examples.
|   Best general-purpose prompt optimizer.
|
+- Want to tune instructions only (<50 examples)? -> GEPA
|   Good when you have few examples.
|
+- Need maximum quality (500+ examples)? -> BootstrapFinetune
|   Fine-tunes the model weights.
|   Best for production with smaller/cheaper models.
|
+- Want to combine approaches? -> BetterTogether
    Jointly optimizes prompts and weights.

堆叠技巧： 先运行BootstrapFewShot，再对结果使用MIPROv2优化。这种组合通常比单独使用其中一种效果更好——Bootstrap负责筛选优质示例，MIPRO则优化指令。

优化后的提示词是模型专属的。如果更换模型，请重新运行优化器。详情见

/ai-switching-models

。

BootstrapFewShot (start here)

BootstrapFewShot（入门首选）

Fast, cheap. Finds good examples by running your program and keeping successful traces.

python

optimizer = dspy.BootstrapFewShot(
    metric=metric,
    max_bootstrapped_demos=4,
    max_labeled_demos=4,
)
optimized = optimizer.compile(my_program, trainset=trainset)

Cost: Minimal (one pass through trainset). Expected improvement: 5-20%.

快速、低成本。通过运行你的程序并保留成功的执行轨迹来筛选优质示例。

python

optimizer = dspy.BootstrapFewShot(
    metric=metric,
    max_bootstrapped_demos=4,
    max_labeled_demos=4,
)
optimized = optimizer.compile(my_program, trainset=trainset)

成本： 极低（仅需遍历一次训练集）。预期提升效果： 5-20%。

MIPROv2 (recommended for most cases)

MIPROv2（多数场景下的推荐选择）

Optimizes both instructions and examples. Best general-purpose optimizer.

python

optimizer = dspy.MIPROv2(
    metric=metric,
    auto="medium",    # "light", "medium", "heavy"
)
optimized = optimizer.compile(my_program, trainset=trainset)

```
"light"
```
: Quick, ~$1-2
```
"medium"
```
: Balanced, ~$5-10
```
"heavy"
```
: Thorough, ~$15-30

Expected improvement: 15-35%.

同时优化指令和示例。最佳通用型优化器。

python

optimizer = dspy.MIPROv2(
    metric=metric,
    auto="medium",    # "light", "medium", "heavy"
)
optimized = optimizer.compile(my_program, trainset=trainset)

```
"light"
```
: 快速，约1-2美元
```
"medium"
```
: 平衡型，约5-10美元
```
"heavy"
```
: 深度优化，约15-30美元

预期提升效果： 15-35%。

GEPA (instruction tuning)

GEPA（指令调优）

Good with few examples or when you want to focus on instruction quality:

python

optimizer = dspy.GEPA()
optimized = optimizer.compile(my_program, trainset=trainset, metric=metric)

适用于示例数量较少，或仅需专注于指令质量优化的场景：

python

optimizer = dspy.GEPA()
optimized = optimizer.compile(my_program, trainset=trainset, metric=metric)

BootstrapFinetune (maximum quality)

BootstrapFinetune（追求极致性能）

Fine-tunes model weights for the biggest accuracy gains. Requires 500+ examples and a fine-tunable model:

python

optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
optimized = optimizer.compile(my_program, trainset=trainset)

For the full fine-tuning workflow (decision framework, prerequisites, model distillation, BetterTogether), see

/ai-fine-tuning

通过微调模型权重实现最大幅度的准确率提升。需要500+示例以及支持微调的模型：

python

optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
optimized = optimizer.compile(my_program, trainset=trainset)

完整的微调工作流（决策框架、前提条件、模型蒸馏、BetterTogether）详见

/ai-fine-tuning

。

When optimization plateaus

当优化进入瓶颈期

If your score stops improving, check these common causes:

Symptom	Likely cause	Fix
Score stuck at 60-70% despite optimization	Task too complex for single step	`/ai-decomposing-tasks` — break into subtasks
Optimizer overfits (train score high, dev score flat)	Too little training data	`/ai-generating-data` — generate more examples
Score varies wildly between runs	Non-deterministic metric or small devset	Increase devset to 100+, set temperature=0
Diminishing returns from more demos	Prompt is maxed out; model is the limit	`/ai-switching-models` — try a more capable model
Score high but real users complain	Metric doesn't match real quality	Rewrite metric based on actual failure patterns

如果性能分数停止提升，请排查以下常见原因：

症状	可能原因	解决方法
即使经过优化，分数仍卡在60-70%	单步任务过于复杂	`/ai-decomposing-tasks` — 将任务拆分为子任务
优化器过拟合（训练集分数高，开发集分数无变化）	训练数据不足	`/ai-generating-data` — 生成更多示例
多次运行后分数波动极大	指标非确定性或开发集规模过小	将开发集扩充至100+示例，设置temperature=0
增加示例数量后提升效果递减	提示词已达上限，模型性能成为瓶颈	`/ai-switching-models` — 尝试更强大的模型
分数很高但实际用户反馈差	指标与实际质量不匹配	根据实际失败模式重新制定指标

Step 4: Verify improvement

步骤4：验证优化效果

python

optimized_score = evaluator(optimized)
print(f"Baseline: {baseline_score:.1f}%")
print(f"Optimized: {optimized_score:.1f}%")
print(f"Improvement: {optimized_score - baseline_score:.1f}%")

python

optimized_score = evaluator(optimized)
print(f"Baseline: {baseline_score:.1f}%")
print(f"Optimized: {optimized_score:.1f}%")
print(f"Improvement: {optimized_score - baseline_score:.1f}%")

Step 5: Save and ship

步骤5：保存并上线

python

optimized.save("optimized_program.json")

python

optimized.save("optimized_program.json")

Load later

my_program = MyProgram() my_program.load("optimized_program.json")

undefined

my_program = MyProgram() my_program.load("optimized_program.json")

undefined

Key patterns

核心模式

Start simple: exact match metric + BootstrapFewShot, then upgrade if needed
Validate your metric: manually check 10-20 examples to make sure the metric scores correctly
More data helps: optimizers work better with more training examples
Never evaluate on trainset: always use a held-out devset
Use
display_table
: looking at actual predictions reveals metric bugs
Iterate: run optimization, check results, improve metric, re-optimize

从简开始：先使用完全匹配指标+BootstrapFewShot，后续按需升级
验证指标有效性：手动检查10-20个示例，确保指标评分准确
数据越多效果越好：优化器在训练示例充足时表现更优
绝不在训练集上评估：始终使用独立的开发集
使用
display_table
：查看实际预测结果可发现指标存在的问题
持续迭代：运行优化、检查结果、改进指标、重新优化

Additional resources

额外资源

For optimizer comparison table and metric patterns, see reference.md
Once quality is good, use
```
/ai-cutting-costs
```
to reduce your AI bill
Use
```
/ai-monitoring
```
to track quality in production after deployment
Use
```
/ai-tracking-experiments
```
to log, compare, and manage multiple optimization runs
Accuracy plateaued despite optimization? Try
```
/ai-decomposing-tasks
```
to restructure your task
If things are broken, use
```
/ai-fixing-errors
```
to diagnose

优化器对比表和指标模式详见reference.md
当性能达标后，可使用
```
/ai-cutting-costs
```
降低AI使用成本
部署后可使用
```
/ai-monitoring
```
在生产环境中跟踪性能
使用
```
/ai-tracking-experiments
```
记录、对比并管理多轮优化实验
经过优化后性能仍停滞？尝试
```
/ai-decomposing-tasks
```
重构任务
若出现问题，使用
```
/ai-fixing-errors
```
进行诊断