ai-generating-data
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGenerate Synthetic Training Data
生成合成训练数据
Guide the user through generating high-quality synthetic training data with DSPy. This solves the "I don't have data" problem that blocks every other AI workflow.
引导用户使用DSPy生成高质量的合成训练数据。这可以解决阻碍所有其他AI工作流的「没有数据」难题。
When you need synthetic data
何时需要合成数据
- Cold start: You're building a new feature and have zero labeled examples
- Not enough for optimization: You have 10-30 examples but optimizers need 200+
- Privacy/compliance: You can't use real customer data for training
- Edge cases: Your AI works on common inputs but fails on rare ones
- Unbalanced categories: Some categories have 500 examples, others have 10
- New categories: You added a category and have no examples for it
- Schema changed: Your input/output format changed, old data doesn't fit
- Proof of concept: PM wants a demo by Friday, no time to collect real data
- 冷启动:你正在构建新功能,没有任何标注示例
- 示例数量不足无法优化:你有10-30个示例,但优化器需要200个以上
- 隐私/合规要求:你无法使用真实客户数据进行训练
- 边缘场景:你的AI在常见输入上表现正常,但在罕见输入上失效
- 类别分布不均衡:部分类别有500个示例,其他类别只有10个
- 新增类别:你添加了一个新类别,没有对应的示例
- 数据结构变更:你的输入/输出格式发生变化,旧数据不再适用
- 概念验证:产品经理要求周五前完成演示,没有时间收集真实数据
The core idea
核心思路
Define a generator signature whose outputs match your task's input/output fields. Use an LM to produce examples. Filter for quality. Use for optimization.
Research shows this works surprisingly well:
- Optimized generator prompts match models trained on 100K+ human labels using only 10 gold labels (arXiv 2406.11706)
- DSPy-optimized Chain-of-Thought generation outperforms hand-written static templates (arXiv 2508.13930)
The key insight: the prompt used to generate data is a critical hyperparameter — optimizing it matters more than generating more data.
定义一个生成器签名,使其输出与你的任务输入/输出字段匹配。使用大语言模型(LM)生成示例,过滤出高质量示例,再用于优化。
研究表明这种方法效果出奇地好:
- 经过优化的生成器提示词,仅用10个黄金标签,就能达到使用10万+人工标注数据训练的模型效果(arXiv 2406.11706)
- DSPy优化的思维链生成效果优于手写的静态模板(arXiv 2508.13930)
关键洞察:用于生成数据的提示词是一个关键超参数——优化它比生成更多数据更重要。
Step 1: Define what an example looks like
步骤1:定义示例格式
Your generator's outputs should match your task's inputs and expected outputs.
python
import dspy生成器的输出应与你的任务输入和预期输出匹配。
python
import dspyYour task — what the AI will do in production
Your task — what the AI will do in production
class ClassifyTicket(dspy.Signature):
"""Classify a support ticket into a category."""
ticket_text: str = dspy.InputField()
category: str = dspy.OutputField()
class ClassifyTicket(dspy.Signature):
"""Classify a support ticket into a category."""
ticket_text: str = dspy.InputField()
category: str = dspy.OutputField()
Generator — produces examples for your task
Generator — produces examples for your task
class GenerateTicketExample(dspy.Signature):
"""Generate a realistic support ticket with its correct category."""
category: str = dspy.InputField(desc="the target category to generate an example for")
ticket_text: str = dspy.OutputField(desc="a realistic support ticket for this category")
The generator's output fields become inputs to your task. Think of it as: "given what I want the answer to be, generate a realistic input."class GenerateTicketExample(dspy.Signature):
"""Generate a realistic support ticket with its correct category."""
category: str = dspy.InputField(desc="the target category to generate an example for")
ticket_text: str = dspy.OutputField(desc="a realistic support ticket for this category")
生成器的输出字段将作为你的任务输入。可以理解为:「给定我想要的答案,生成一个真实的输入。」Multi-field tasks
多字段任务
If your task has multiple inputs or outputs, mirror all of them:
python
undefined如果你的任务有多个输入或输出,需全部对应:
python
undefinedTask: extract structured data from text
Task: extract structured data from text
class ExtractContact(dspy.Signature):
"""Extract contact info from a message."""
message: str = dspy.InputField()
name: str = dspy.OutputField()
email: str = dspy.OutputField()
phone: str = dspy.OutputField()
class ExtractContact(dspy.Signature):
"""Extract contact info from a message."""
message: str = dspy.InputField()
name: str = dspy.OutputField()
email: str = dspy.OutputField()
phone: str = dspy.OutputField()
Generator: produce realistic messages with known contact info
Generator: produce realistic messages with known contact info
class GenerateContactExample(dspy.Signature):
"""Generate a realistic message that contains contact information."""
name: str = dspy.InputField(desc="the person's name to embed in the message")
email: str = dspy.InputField(desc="the email address to embed in the message")
phone: str = dspy.InputField(desc="the phone number to embed in the message")
message: str = dspy.OutputField(desc="a realistic message containing this contact info")
undefinedclass GenerateContactExample(dspy.Signature):
"""Generate a realistic message that contains contact information."""
name: str = dspy.InputField(desc="the person's name to embed in the message")
email: str = dspy.InputField(desc="the email address to embed in the message")
phone: str = dspy.InputField(desc="the phone number to embed in the message")
message: str = dspy.OutputField(desc="a realistic message containing this contact info")
undefinedStep 2: Write seed examples
步骤2:编写种子示例
Start with 5-10 hand-written examples. These anchor the generator's understanding of what "realistic" means for your domain.
python
seeds = [
dspy.Example(
ticket_text="I was charged twice for my subscription this month. Order #4521.",
category="billing"
).with_inputs("ticket_text"),
dspy.Example(
ticket_text="The app crashes when I try to upload a profile photo on Android.",
category="bug"
).with_inputs("ticket_text"),
dspy.Example(
ticket_text="How do I export my data to CSV? I can't find the option anywhere.",
category="how-to"
).with_inputs("ticket_text"),
dspy.Example(
ticket_text="I'd love to see dark mode added. The white background hurts my eyes.",
category="feature-request"
).with_inputs("ticket_text"),
dspy.Example(
ticket_text="My account got locked after too many login attempts. Please help.",
category="account"
).with_inputs("ticket_text"),
]Even 5 seeds dramatically improve generation quality over zero.
从5-10个手写示例开始。这些示例可以锚定生成器对「真实」的理解,符合你的业务领域。
python
seeds = [
dspy.Example(
ticket_text="I was charged twice for my subscription this month. Order #4521.",
category="billing"
).with_inputs("ticket_text"),
dspy.Example(
ticket_text="The app crashes when I try to upload a profile photo on Android.",
category="bug"
).with_inputs("ticket_text"),
dspy.Example(
ticket_text="How do I export my data to CSV? I can't find the option anywhere.",
category="how-to"
).with_inputs("ticket_text"),
dspy.Example(
ticket_text="I'd love to see dark mode added. The white background hurts my eyes.",
category="feature-request"
).with_inputs("ticket_text"),
dspy.Example(
ticket_text="My account got locked after too many login attempts. Please help.",
category="account"
).with_inputs("ticket_text"),
]即使只有5个种子示例,也能比从零开始大幅提升生成质量。
Step 3: Generate in batches
步骤3:批量生成
Two patterns depending on your LM provider:
根据你的大语言模型提供商,有两种模式可选:
Pattern A: n=N
batch generation
n=N模式A:n=N
批量生成
n=NWhen your provider supports the parameter (OpenAI does), this generates multiple completions in one call — faster and often more diverse:
npython
generator = dspy.Predict(GenerateTicketExample, n=20)
response = generator(category="billing")
examples = [
dspy.Example(ticket_text=t, category="billing").with_inputs("ticket_text")
for t in response.completions.ticket_text
]当你的提供商支持参数时(如OpenAI),可以在一次调用中生成多个补全结果——速度更快,且通常多样性更高:
npython
generator = dspy.Predict(GenerateTicketExample, n=20)
response = generator(category="billing")
examples = [
dspy.Example(ticket_text=t, category="billing").with_inputs("ticket_text")
for t in response.completions.ticket_text
]Pattern B: Loop generation
模式B:循环生成
Works with any provider. More control over each example:
python
examples = []
categories = ["billing", "bug", "how-to", "feature-request", "account"]
for category in categories:
generator = dspy.Predict(GenerateTicketExample)
for i in range(40):
result = generator(category=category)
examples.append(
dspy.Example(ticket_text=result.ticket_text, category=category)
.with_inputs("ticket_text")
)
print(f"Generated {len(examples)} examples")The parameter isn't supported by all providers — use the loop pattern as a reliable fallback.
n适用于所有提供商,对每个示例有更多控制权:
python
examples = []
categories = ["billing", "bug", "how-to", "feature-request", "account"]
for category in categories:
generator = dspy.Predict(GenerateTicketExample)
for i in range(40):
result = generator(category=category)
examples.append(
dspy.Example(ticket_text=result.ticket_text, category=category)
.with_inputs("ticket_text")
)
print(f"Generated {len(examples)} examples")并非所有提供商都支持参数——循环模式是可靠的备选方案。
nGeneration strategies
生成策略
Pick the strategy that fits your gap:
Category-driven — generate N per category (fixes imbalance):
python
for category in categories:
for i in range(50):
result = generator(category=category)
examples.append(dspy.Example(ticket_text=result.ticket_text, category=category).with_inputs("ticket_text"))Seed-and-vary — pass a seed example with a variation instruction:
python
class GenerateVariation(dspy.Signature):
"""Generate a variation of this support ticket with a different tone and phrasing."""
original_ticket: str = dspy.InputField(desc="the original ticket to vary")
variation_type: str = dspy.InputField(desc="how to vary it: tone, length, complexity, or language")
ticket_text: str = dspy.OutputField(desc="a new ticket with the same meaning but different style")
vary = dspy.Predict(GenerateVariation)
for seed in seeds:
for variation in ["angry tone", "very brief", "verbose and detailed", "non-native English"]:
result = vary(original_ticket=seed.ticket_text, variation_type=variation)
examples.append(dspy.Example(ticket_text=result.ticket_text, category=seed.category).with_inputs("ticket_text"))Scenario-driven — specify edge case scenarios:
python
class GenerateScenarioTicket(dspy.Signature):
"""Generate a support ticket matching a specific scenario."""
category: str = dspy.InputField()
scenario: str = dspy.InputField(desc="the specific scenario to generate")
ticket_text: str = dspy.OutputField()
gen = dspy.Predict(GenerateScenarioTicket)
scenarios = [
("billing", "customer charged in wrong currency"),
("billing", "refund for a cancelled subscription"),
("bug", "issue only happens on slow network connections"),
("bug", "multi-step reproduction involving two features"),
("how-to", "customer is non-technical and confused by jargon"),
]
for category, scenario in scenarios:
result = gen(category=category, scenario=scenario)
examples.append(dspy.Example(ticket_text=result.ticket_text, category=category).with_inputs("ticket_text"))Difficulty-driven — generate easy, medium, hard examples separately:
python
class GenerateByDifficulty(dspy.Signature):
"""Generate a support ticket at a specific difficulty level for classification."""
category: str = dspy.InputField()
difficulty: str = dspy.InputField(desc="easy (clear-cut), medium (some ambiguity), or hard (could be multiple categories)")
ticket_text: str = dspy.OutputField()
gen = dspy.Predict(GenerateByDifficulty)
for category in categories:
for difficulty in ["easy", "medium", "hard"]:
for i in range(15):
result = gen(category=category, difficulty=difficulty)
examples.append(dspy.Example(ticket_text=result.ticket_text, category=category).with_inputs("ticket_text"))Diversity trick — add a random field to push the LM toward varied outputs:
sindexpython
import random
class GenerateDiverse(dspy.Signature):
"""Generate a unique and realistic support ticket."""
category: str = dspy.InputField()
sindex: str = dspy.InputField(desc="a unique seed index for diversity")
ticket_text: str = dspy.OutputField()
gen = dspy.Predict(GenerateDiverse)
for category in categories:
for i in range(50):
result = gen(category=category, sindex=str(random.randint(0, 1_000_000)))
examples.append(dspy.Example(ticket_text=result.ticket_text, category=category).with_inputs("ticket_text"))The random prevents the LM from falling into repetitive patterns.
sindex选择适合你需求缺口的策略:
类别驱动——为每个类别生成N个示例(解决类别不均衡问题):
python
for category in categories:
for i in range(50):
result = generator(category=category)
examples.append(dspy.Example(ticket_text=result.ticket_text, category=category).with_inputs("ticket_text"))种子变体生成——传入种子示例和变体指令:
python
class GenerateVariation(dspy.Signature):
"""Generate a variation of this support ticket with a different tone and phrasing."""
original_ticket: str = dspy.InputField(desc="the original ticket to vary")
variation_type: str = dspy.InputField(desc="how to vary it: tone, length, complexity, or language")
ticket_text: str = dspy.OutputField(desc="a new ticket with the same meaning but different style")
vary = dspy.Predict(GenerateVariation)
for seed in seeds:
for variation in ["angry tone", "very brief", "verbose and detailed", "non-native English"]:
result = vary(original_ticket=seed.ticket_text, variation_type=variation)
examples.append(dspy.Example(ticket_text=result.ticket_text, category=seed.category).with_inputs("ticket_text"))场景驱动——指定边缘场景:
python
class GenerateScenarioTicket(dspy.Signature):
"""Generate a support ticket matching a specific scenario."""
category: str = dspy.InputField()
scenario: str = dspy.InputField(desc="the specific scenario to generate")
ticket_text: str = dspy.OutputField()
gen = dspy.Predict(GenerateScenarioTicket)
scenarios = [
("billing", "customer charged in wrong currency"),
("billing", "refund for a cancelled subscription"),
("bug", "issue only happens on slow network connections"),
("bug", "multi-step reproduction involving two features"),
("how-to", "customer is non-technical and confused by jargon"),
]
for category, scenario in scenarios:
result = gen(category=category, scenario=scenario)
examples.append(dspy.Example(ticket_text=result.ticket_text, category=category).with_inputs("ticket_text"))难度驱动——分别生成简单、中等、困难示例:
python
class GenerateByDifficulty(dspy.Signature):
"""Generate a support ticket at a specific difficulty level for classification."""
category: str = dspy.InputField()
difficulty: str = dspy.InputField(desc="easy (clear-cut), medium (some ambiguity), or hard (could be multiple categories)")
ticket_text: str = dspy.OutputField()
gen = dspy.Predict(GenerateByDifficulty)
for category in categories:
for difficulty in ["easy", "medium", "hard"]:
for i in range(15):
result = gen(category=category, difficulty=difficulty)
examples.append(dspy.Example(ticket_text=result.ticket_text, category=category).with_inputs("ticket_text"))多样性技巧——添加随机字段,促使大语言模型生成更多样的输出:
sindexpython
import random
class GenerateDiverse(dspy.Signature):
"""Generate a unique and realistic support ticket."""
category: str = dspy.InputField()
sindex: str = dspy.InputField(desc="a unique seed index for diversity")
ticket_text: str = dspy.OutputField()
gen = dspy.Predict(GenerateDiverse)
for category in categories:
for i in range(50):
result = gen(category=category, sindex=str(random.randint(0, 1_000_000)))
examples.append(dspy.Example(ticket_text=result.ticket_text, category=category).with_inputs("ticket_text"))随机可以防止大语言模型陷入重复模式。
sindexStep 4: Filter for quality
步骤4:质量过滤
Generated data always contains some bad examples. Filter aggressively — aim to generate 2-3x what you need and keep ~50%.
生成的数据中总会包含一些低质量示例。要严格过滤——目标是生成所需数量的2-3倍,最终保留约50%。
Simple: metric-based filtering
简单方式:基于指标的过滤
Run each generated example through your task program and check with your metric:
python
program = dspy.ChainOfThought(ClassifyTicket)
filtered = []
for ex in examples:
pred = program(**ex.inputs())
if metric(ex, pred):
filtered.append(ex)
print(f"Kept {len(filtered)}/{len(examples)} ({100*len(filtered)//len(examples)}%)")This works when your program is already decent — it filters out examples that are confusing or mislabeled.
将每个生成的示例传入你的任务程序,用指标进行校验:
python
program = dspy.ChainOfThought(ClassifyTicket)
filtered = []
for ex in examples:
pred = program(**ex.inputs())
if metric(ex, pred):
filtered.append(ex)
print(f"Kept {len(filtered)}/{len(examples)} ({100*len(filtered)//len(examples)}%)")当你的程序已经具备一定性能时,这种方法有效——它会过滤掉那些模糊或标注错误的示例。
Robust: LM-based assessment
可靠方式:基于大语言模型的评估
Use a separate assessment step to check realism and correctness:
python
class AssessExample(dspy.Signature):
"""Is this a realistic and correctly labeled example?"""
ticket_text: str = dspy.InputField()
category: str = dspy.InputField()
is_realistic: bool = dspy.OutputField(desc="true if this looks like a real support ticket")
is_correctly_labeled: bool = dspy.OutputField(desc="true if the category matches the ticket")
assessor = dspy.Predict(AssessExample)
filtered = []
for ex in examples:
result = assessor(ticket_text=ex.ticket_text, category=ex.category)
if result.is_realistic and result.is_correctly_labeled:
filtered.append(ex)
print(f"Kept {len(filtered)}/{len(examples)} ({100*len(filtered)//len(examples)}%)")使用单独的评估步骤检查示例的真实性和正确性:
python
class AssessExample(dspy.Signature):
"""Is this a realistic and correctly labeled example?"""
ticket_text: str = dspy.InputField()
category: str = dspy.InputField()
is_realistic: bool = dspy.OutputField(desc="true if this looks like a real support ticket")
is_correctly_labeled: bool = dspy.OutputField(desc="true if the category matches the ticket")
assessor = dspy.Predict(AssessExample)
filtered = []
for ex in examples:
result = assessor(ticket_text=ex.ticket_text, category=ex.category)
if result.is_realistic and result.is_correctly_labeled:
filtered.append(ex)
print(f"Kept {len(filtered)}/{len(examples)} ({100*len(filtered)//len(examples)}%)")Quality gates with dspy.Suggest
dspy.Suggest使用dspy.Suggest
实现质量关卡
dspy.SuggestFor tighter integration, build quality checks into the generator itself. When a constraint fails, DSPy retries the generation:
Suggestpython
class QualityGenerator(dspy.Module):
def __init__(self):
self.generate = dspy.ChainOfThought(GenerateTicketExample)
self.assess = dspy.Predict(AssessExample)
def forward(self, category):
result = self.generate(category=category)
assessment = self.assess(ticket_text=result.ticket_text, category=category)
dspy.Suggest(assessment.is_realistic, "Generated ticket should be realistic")
dspy.Suggest(assessment.is_correctly_labeled, "Category label should be correct")
return result
generator = QualityGenerator()为了更紧密的集成,可以在生成器中内置质量检查。当约束不满足时,DSPy会重新尝试生成:
Suggestpython
class QualityGenerator(dspy.Module):
def __init__(self):
self.generate = dspy.ChainOfThought(GenerateTicketExample)
self.assess = dspy.Predict(AssessExample)
def forward(self, category):
result = self.generate(category=category)
assessment = self.assess(ticket_text=result.ticket_text, category=category)
dspy.Suggest(assessment.is_realistic, "Generated ticket should be realistic")
dspy.Suggest(assessment.is_correctly_labeled, "Category label should be correct")
return result
generator = QualityGenerator()DSPy retries generation when Suggest constraints fail
DSPy retries generation when Suggest constraints fail
undefinedundefinedCheck for duplicates
移除重复示例
Remove near-duplicates to keep your dataset diverse:
python
seen = set()
unique = []
for ex in filtered:
# Normalize and check
key = ex.ticket_text.strip().lower()
if key not in seen:
seen.add(key)
unique.append(ex)
print(f"Removed {len(filtered) - len(unique)} near-duplicates")
filtered = unique移除近似重复的示例,保持数据集的多样性:
python
seen = set()
unique = []
for ex in filtered:
# Normalize and check
key = ex.ticket_text.strip().lower()
if key not in seen:
seen.add(key)
unique.append(ex)
print(f"Removed {len(filtered) - len(unique)} near-duplicates")
filtered = uniqueStep 5: Optimize the generator itself (advanced)
步骤5:优化生成器本身(进阶)
Research (arXiv 2406.11706) shows that optimizing the prompt used to generate data dramatically improves downstream quality. This is meta-optimization: optimizing the generator so it produces better training data.
python
class DataGenerator(dspy.Module):
def __init__(self):
self.generate = dspy.ChainOfThought(GenerateTicketExample)
def forward(self, category):
return self.generate(category=category)研究(arXiv 2406.11706)表明,优化用于生成数据的提示词能显著提升下游任务的性能。这是元优化:优化生成器使其生成更好的训练数据。
python
class DataGenerator(dspy.Module):
def __init__(self):
self.generate = dspy.ChainOfThought(GenerateTicketExample)
def forward(self, category):
return self.generate(category=category)Define a metric that measures generated data quality
Define a metric that measures generated data quality
def generator_metric(example, prediction, trace=None):
# Check if a downstream classifier gets the right answer on this generated example
classifier = dspy.Predict(ClassifyTicket)
task_example = dspy.Example(ticket_text=prediction.ticket_text, category=example.category).with_inputs("ticket_text")
task_pred = classifier(**task_example.inputs())
return task_pred.category.lower() == example.category.lower()
def generator_metric(example, prediction, trace=None):
# Check if a downstream classifier gets the right answer on this generated example
classifier = dspy.Predict(ClassifyTicket)
task_example = dspy.Example(ticket_text=prediction.ticket_text, category=example.category).with_inputs("ticket_text")
task_pred = classifier(**task_example.inputs())
return task_pred.category.lower() == example.category.lower()
Optimize the generator's prompts
Optimize the generator's prompts
optimizer = dspy.BootstrapFewShot(metric=generator_metric)
optimized_generator = optimizer.compile(DataGenerator(), trainset=seeds)
optimizer = dspy.BootstrapFewShot(metric=generator_metric)
optimized_generator = optimizer.compile(DataGenerator(), trainset=seeds)
Now generate with the optimized generator
Now generate with the optimized generator
better_examples = []
for category in categories:
for i in range(50):
result = optimized_generator(category=category)
better_examples.append(
dspy.Example(ticket_text=result.ticket_text, category=category).with_inputs("ticket_text")
)
This closes the loop: better generator prompts produce better data, which produces better task programs.better_examples = []
for category in categories:
for i in range(50):
result = optimized_generator(category=category)
better_examples.append(
dspy.Example(ticket_text=result.ticket_text, category=category).with_inputs("ticket_text")
)
这形成了闭环:更好的生成器提示词生成更好的数据,进而训练出更好的任务程序。Step 6: Use generated data for optimization
步骤6:使用生成的数据进行优化
Full pipeline: generate, filter, split, optimize, evaluate.
python
import random
from dspy.evaluate import Evaluate完整流程:生成、过滤、拆分、优化、评估。
python
import random
from dspy.evaluate import EvaluateShuffle and split
Shuffle and split
random.shuffle(filtered)
split = int(len(filtered) * 0.8)
trainset = filtered[:split]
devset = filtered[split:]
print(f"Train: {len(trainset)}, Dev: {len(devset)}")
random.shuffle(filtered)
split = int(len(filtered) * 0.8)
trainset = filtered[:split]
devset = filtered[split:]
print(f"Train: {len(trainset)}, Dev: {len(devset)}")
Configure your task LM (can be cheaper than the generator LM)
Configure your task LM (can be cheaper than the generator LM)
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)
Build and optimize your task program
Build and optimize your task program
program = dspy.ChainOfThought(ClassifyTicket)
optimizer = dspy.MIPROv2(metric=metric, auto="medium")
optimized = optimizer.compile(program, trainset=trainset)
program = dspy.ChainOfThought(ClassifyTicket)
optimizer = dspy.MIPROv2(metric=metric, auto="medium")
optimized = optimizer.compile(program, trainset=trainset)
Evaluate
Evaluate
evaluator = Evaluate(devset=devset, metric=metric, num_threads=4, display_progress=True)
score = evaluator(optimized)
print(f"Score on synthetic dev set: {score:.1f}%")
evaluator = Evaluate(devset=devset, metric=metric, num_threads=4, display_progress=True)
score = evaluator(optimized)
print(f"Score on synthetic dev set: {score:.1f}%")
Save
Save
optimized.save("optimized_program.json")
If you have even a small number of real examples, use them as the dev set instead — real data gives more trustworthy evaluation.optimized.save("optimized_program.json")
如果你有少量真实示例,可以用它们作为开发集——真实数据的评估结果更可信。Common scenarios
常见场景
Cold start — zero real data. Write 5-10 seeds. Generate 200+ synthetic examples across all categories. Filter and optimize. See examples.md for a full walkthrough.
Edge case gaps — your AI works at 85% but fails on specific scenarios. Run error analysis, identify the failure patterns, then use scenario-driven generation targeting those gaps. Re-optimize with the augmented dataset.
Privacy/compliance — can't use real customer data. Generate synthetic examples with realistic patterns but no PII. Validate with domain-specific assessments. The quality gate pattern ensures generated data meets your standards.
dspy.SuggestNew categories — added a category with no examples. Use category-driven generation to produce 50+ examples for the new category, then retrain.
Rebalancing — some categories have 500 examples, others have 10. Generate more for underrepresented categories until all are roughly balanced.
Schema changed — your input/output format changed. Generate new examples matching the new schema rather than manually converting old data.
冷启动——没有真实数据。编写5-10个种子示例,为所有类别生成200+合成示例,过滤后进行优化。详见examples.md中的完整流程。
边缘场景缺口——你的AI准确率为85%,但在特定场景下失效。运行错误分析,识别失败模式,然后使用场景驱动生成填补这些缺口,再用扩充后的数据集重新优化。
隐私/合规要求——无法使用真实客户数据。生成具有真实模式但不包含个人身份信息(PII)的合成示例。通过领域特定评估验证数据质量。质量关卡模式可确保生成的数据符合你的标准。
dspy.Suggest新增类别——添加了新类别,没有对应的示例。使用类别驱动生成,为新类别生成50+示例,然后重新训练。
类别再平衡——部分类别有500个示例,其他类别只有10个。为代表性不足的类别生成更多示例,直到所有类别数量大致平衡。
数据结构变更——你的输入/输出格式发生变化。生成符合新数据结构的新示例,而非手动转换旧数据。
Tips and pitfalls
技巧与注意事项
- Always validate generated data — LMs produce plausible but wrong labels. Filter aggressively.
- Mix synthetic with real data when available — even 20 real examples mixed in improve quality significantly.
- Use a stronger model to generate, cheaper model for your task — e.g., generate with GPT-4o, run your task on GPT-4o-mini.
- Generate more than you need — aim for 2-3x your target, keep ~50% after filtering.
- Check for duplicates — LMs tend to repeat themselves, especially without the diversity trick.
- Iterate — generate, optimize, evaluate, identify gaps, generate more for gaps.
- Don't trust synthetic eval scores blindly — if possible, validate final quality on real data.
- The parameter for batch generation isn't supported by all providers — use the loop pattern as a reliable fallback.
n
- 始终验证生成的数据——大语言模型会生成看似合理但标注错误的示例。要严格过滤。
- 如果有真实数据,将合成数据与真实数据混合——即使只混合20个真实示例,也能显著提升质量。
- 用更强的模型生成数据,用更便宜的模型执行任务——例如,用GPT-4o生成数据,用GPT-4o-mini执行任务。
- 生成比实际需要更多的数据——目标是生成所需数量的2-3倍,过滤后保留约50%。
- 检查重复示例——大语言模型容易重复生成,尤其是不使用多样性技巧时。
- 迭代优化——生成、优化、评估、识别缺口、针对缺口生成更多数据。
- 不要盲目相信合成数据的评估分数——如果可能,用真实数据验证最终性能。
- 并非所有提供商都支持批量生成的参数——循环模式是可靠的备选方案。
n
Additional resources
额外资源
- For end-to-end worked examples (cold start, edge cases, privacy), see examples.md
- Use to measure and improve your optimized program
/ai-improving-accuracy - Use once you have enough generated data for weight optimization
/ai-fine-tuning - Use to scaffold a project, then fill data with this skill
/ai-kickoff
- 有关端到端的完整示例(冷启动、边缘场景、隐私要求),请查看examples.md
- 使用测量并提升优化后的程序性能
/ai-improving-accuracy - 当你有足够的生成数据后,使用进行权重优化
/ai-fine-tuning - 使用搭建项目框架,然后用本技能填充数据
/ai-kickoff