ai-generating-data

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Generate Synthetic Training Data

生成合成训练数据

Guide the user through generating high-quality synthetic training data with DSPy. This solves the "I don't have data" problem that blocks every other AI workflow.

引导用户使用DSPy生成高质量的合成训练数据。这可以解决阻碍所有其他AI工作流的「没有数据」难题。

When you need synthetic data

何时需要合成数据

Cold start: You're building a new feature and have zero labeled examples
Not enough for optimization: You have 10-30 examples but optimizers need 200+
Privacy/compliance: You can't use real customer data for training
Edge cases: Your AI works on common inputs but fails on rare ones
Unbalanced categories: Some categories have 500 examples, others have 10
New categories: You added a category and have no examples for it
Schema changed: Your input/output format changed, old data doesn't fit
Proof of concept: PM wants a demo by Friday, no time to collect real data

冷启动：你正在构建新功能，没有任何标注示例
示例数量不足无法优化：你有10-30个示例，但优化器需要200个以上
隐私/合规要求：你无法使用真实客户数据进行训练
边缘场景：你的AI在常见输入上表现正常，但在罕见输入上失效
类别分布不均衡：部分类别有500个示例，其他类别只有10个
新增类别：你添加了一个新类别，没有对应的示例
数据结构变更：你的输入/输出格式发生变化，旧数据不再适用
概念验证：产品经理要求周五前完成演示，没有时间收集真实数据

The core idea

核心思路

Define a generator signature whose outputs match your task's input/output fields. Use an LM to produce examples. Filter for quality. Use for optimization.

Research shows this works surprisingly well:

Optimized generator prompts match models trained on 100K+ human labels using only 10 gold labels (arXiv 2406.11706)
DSPy-optimized Chain-of-Thought generation outperforms hand-written static templates (arXiv 2508.13930)

The key insight: the prompt used to generate data is a critical hyperparameter — optimizing it matters more than generating more data.

定义一个生成器签名，使其输出与你的任务输入/输出字段匹配。使用大语言模型（LM）生成示例，过滤出高质量示例，再用于优化。

研究表明这种方法效果出奇地好：

经过优化的生成器提示词，仅用10个黄金标签，就能达到使用10万+人工标注数据训练的模型效果（arXiv 2406.11706）
DSPy优化的思维链生成效果优于手写的静态模板（arXiv 2508.13930）

关键洞察：用于生成数据的提示词是一个关键超参数——优化它比生成更多数据更重要。

Step 1: Define what an example looks like

步骤1：定义示例格式

Your generator's outputs should match your task's inputs and expected outputs.

python

import dspy

生成器的输出应与你的任务输入和预期输出匹配。

python

import dspy

Your task — what the AI will do in production

class ClassifyTicket(dspy.Signature): """Classify a support ticket into a category.""" ticket_text: str = dspy.InputField() category: str = dspy.OutputField()

Generator — produces examples for your task

class GenerateTicketExample(dspy.Signature): """Generate a realistic support ticket with its correct category.""" category: str = dspy.InputField(desc="the target category to generate an example for") ticket_text: str = dspy.OutputField(desc="a realistic support ticket for this category")


The generator's output fields become inputs to your task. Think of it as: "given what I want the answer to be, generate a realistic input."


生成器的输出字段将作为你的任务输入。可以理解为：「给定我想要的答案，生成一个真实的输入。」

Multi-field tasks

多字段任务

If your task has multiple inputs or outputs, mirror all of them:

python

undefined

如果你的任务有多个输入或输出，需全部对应：

python

undefined

Task: extract structured data from text

class ExtractContact(dspy.Signature): """Extract contact info from a message.""" message: str = dspy.InputField() name: str = dspy.OutputField() email: str = dspy.OutputField() phone: str = dspy.OutputField()

Generator: produce realistic messages with known contact info

class GenerateContactExample(dspy.Signature): """Generate a realistic message that contains contact information.""" name: str = dspy.InputField(desc="the person's name to embed in the message") email: str = dspy.InputField(desc="the email address to embed in the message") phone: str = dspy.InputField(desc="the phone number to embed in the message") message: str = dspy.OutputField(desc="a realistic message containing this contact info")

undefined

undefined

Step 2: Write seed examples

步骤2：编写种子示例

Start with 5-10 hand-written examples. These anchor the generator's understanding of what "realistic" means for your domain.

python

seeds = [
    dspy.Example(
        ticket_text="I was charged twice for my subscription this month. Order #4521.",
        category="billing"
    ).with_inputs("ticket_text"),
    dspy.Example(
        ticket_text="The app crashes when I try to upload a profile photo on Android.",
        category="bug"
    ).with_inputs("ticket_text"),
    dspy.Example(
        ticket_text="How do I export my data to CSV? I can't find the option anywhere.",
        category="how-to"
    ).with_inputs("ticket_text"),
    dspy.Example(
        ticket_text="I'd love to see dark mode added. The white background hurts my eyes.",
        category="feature-request"
    ).with_inputs("ticket_text"),
    dspy.Example(
        ticket_text="My account got locked after too many login attempts. Please help.",
        category="account"
    ).with_inputs("ticket_text"),
]

Even 5 seeds dramatically improve generation quality over zero.

从5-10个手写示例开始。这些示例可以锚定生成器对「真实」的理解，符合你的业务领域。

python

seeds = [
    dspy.Example(
        ticket_text="I was charged twice for my subscription this month. Order #4521.",
        category="billing"
    ).with_inputs("ticket_text"),
    dspy.Example(
        ticket_text="The app crashes when I try to upload a profile photo on Android.",
        category="bug"
    ).with_inputs("ticket_text"),
    dspy.Example(
        ticket_text="How do I export my data to CSV? I can't find the option anywhere.",
        category="how-to"
    ).with_inputs("ticket_text"),
    dspy.Example(
        ticket_text="I'd love to see dark mode added. The white background hurts my eyes.",
        category="feature-request"
    ).with_inputs("ticket_text"),
    dspy.Example(
        ticket_text="My account got locked after too many login attempts. Please help.",
        category="account"
    ).with_inputs("ticket_text"),
]

即使只有5个种子示例，也能比从零开始大幅提升生成质量。

Step 3: Generate in batches

步骤3：批量生成

Two patterns depending on your LM provider:

根据你的大语言模型提供商，有两种模式可选：

Pattern A:

n=N

batch generation

模式A：

n=N

批量生成

When your provider supports the

parameter (OpenAI does), this generates multiple completions in one call — faster and often more diverse:

python

generator = dspy.Predict(GenerateTicketExample, n=20)
response = generator(category="billing")
examples = [
    dspy.Example(ticket_text=t, category="billing").with_inputs("ticket_text")
    for t in response.completions.ticket_text
]

当你的提供商支持

参数时（如OpenAI），可以在一次调用中生成多个补全结果——速度更快，且通常多样性更高：

python

generator = dspy.Predict(GenerateTicketExample, n=20)
response = generator(category="billing")
examples = [
    dspy.Example(ticket_text=t, category="billing").with_inputs("ticket_text")
    for t in response.completions.ticket_text
]

Pattern B: Loop generation

模式B：循环生成

Works with any provider. More control over each example:

python

examples = []
categories = ["billing", "bug", "how-to", "feature-request", "account"]

for category in categories:
    generator = dspy.Predict(GenerateTicketExample)
    for i in range(40):
        result = generator(category=category)
        examples.append(
            dspy.Example(ticket_text=result.ticket_text, category=category)
            .with_inputs("ticket_text")
        )

print(f"Generated {len(examples)} examples")

The

parameter isn't supported by all providers — use the loop pattern as a reliable fallback.

适用于所有提供商，对每个示例有更多控制权：

python

examples = []
categories = ["billing", "bug", "how-to", "feature-request", "account"]

for category in categories:
    generator = dspy.Predict(GenerateTicketExample)
    for i in range(40):
        result = generator(category=category)
        examples.append(
            dspy.Example(ticket_text=result.ticket_text, category=category)
            .with_inputs("ticket_text")
        )

print(f"Generated {len(examples)} examples")

并非所有提供商都支持

参数——循环模式是可靠的备选方案。

Generation strategies

生成策略

Pick the strategy that fits your gap:

Category-driven — generate N per category (fixes imbalance):

python

for category in categories:
    for i in range(50):
        result = generator(category=category)
        examples.append(dspy.Example(ticket_text=result.ticket_text, category=category).with_inputs("ticket_text"))

Seed-and-vary — pass a seed example with a variation instruction:

python

class GenerateVariation(dspy.Signature):
    """Generate a variation of this support ticket with a different tone and phrasing."""
    original_ticket: str = dspy.InputField(desc="the original ticket to vary")
    variation_type: str = dspy.InputField(desc="how to vary it: tone, length, complexity, or language")
    ticket_text: str = dspy.OutputField(desc="a new ticket with the same meaning but different style")

vary = dspy.Predict(GenerateVariation)
for seed in seeds:
    for variation in ["angry tone", "very brief", "verbose and detailed", "non-native English"]:
        result = vary(original_ticket=seed.ticket_text, variation_type=variation)
        examples.append(dspy.Example(ticket_text=result.ticket_text, category=seed.category).with_inputs("ticket_text"))

Scenario-driven — specify edge case scenarios:

python

class GenerateScenarioTicket(dspy.Signature):
    """Generate a support ticket matching a specific scenario."""
    category: str = dspy.InputField()
    scenario: str = dspy.InputField(desc="the specific scenario to generate")
    ticket_text: str = dspy.OutputField()

gen = dspy.Predict(GenerateScenarioTicket)
scenarios = [
    ("billing", "customer charged in wrong currency"),
    ("billing", "refund for a cancelled subscription"),
    ("bug", "issue only happens on slow network connections"),
    ("bug", "multi-step reproduction involving two features"),
    ("how-to", "customer is non-technical and confused by jargon"),
]
for category, scenario in scenarios:
    result = gen(category=category, scenario=scenario)
    examples.append(dspy.Example(ticket_text=result.ticket_text, category=category).with_inputs("ticket_text"))

Difficulty-driven — generate easy, medium, hard examples separately:

python

class GenerateByDifficulty(dspy.Signature):
    """Generate a support ticket at a specific difficulty level for classification."""
    category: str = dspy.InputField()
    difficulty: str = dspy.InputField(desc="easy (clear-cut), medium (some ambiguity), or hard (could be multiple categories)")
    ticket_text: str = dspy.OutputField()

gen = dspy.Predict(GenerateByDifficulty)
for category in categories:
    for difficulty in ["easy", "medium", "hard"]:
        for i in range(15):
            result = gen(category=category, difficulty=difficulty)
            examples.append(dspy.Example(ticket_text=result.ticket_text, category=category).with_inputs("ticket_text"))

Diversity trick — add a random

sindex

field to push the LM toward varied outputs:

python

import random

class GenerateDiverse(dspy.Signature):
    """Generate a unique and realistic support ticket."""
    category: str = dspy.InputField()
    sindex: str = dspy.InputField(desc="a unique seed index for diversity")
    ticket_text: str = dspy.OutputField()

gen = dspy.Predict(GenerateDiverse)
for category in categories:
    for i in range(50):
        result = gen(category=category, sindex=str(random.randint(0, 1_000_000)))
        examples.append(dspy.Example(ticket_text=result.ticket_text, category=category).with_inputs("ticket_text"))

The random

sindex

prevents the LM from falling into repetitive patterns.

选择适合你需求缺口的策略：

类别驱动——为每个类别生成N个示例（解决类别不均衡问题）：

python

for category in categories:
    for i in range(50):
        result = generator(category=category)
        examples.append(dspy.Example(ticket_text=result.ticket_text, category=category).with_inputs("ticket_text"))

种子变体生成——传入种子示例和变体指令：

python

class GenerateVariation(dspy.Signature):
    """Generate a variation of this support ticket with a different tone and phrasing."""
    original_ticket: str = dspy.InputField(desc="the original ticket to vary")
    variation_type: str = dspy.InputField(desc="how to vary it: tone, length, complexity, or language")
    ticket_text: str = dspy.OutputField(desc="a new ticket with the same meaning but different style")

vary = dspy.Predict(GenerateVariation)
for seed in seeds:
    for variation in ["angry tone", "very brief", "verbose and detailed", "non-native English"]:
        result = vary(original_ticket=seed.ticket_text, variation_type=variation)
        examples.append(dspy.Example(ticket_text=result.ticket_text, category=seed.category).with_inputs("ticket_text"))

场景驱动——指定边缘场景：

python

class GenerateScenarioTicket(dspy.Signature):
    """Generate a support ticket matching a specific scenario."""
    category: str = dspy.InputField()
    scenario: str = dspy.InputField(desc="the specific scenario to generate")
    ticket_text: str = dspy.OutputField()

gen = dspy.Predict(GenerateScenarioTicket)
scenarios = [
    ("billing", "customer charged in wrong currency"),
    ("billing", "refund for a cancelled subscription"),
    ("bug", "issue only happens on slow network connections"),
    ("bug", "multi-step reproduction involving two features"),
    ("how-to", "customer is non-technical and confused by jargon"),
]
for category, scenario in scenarios:
    result = gen(category=category, scenario=scenario)
    examples.append(dspy.Example(ticket_text=result.ticket_text, category=category).with_inputs("ticket_text"))

难度驱动——分别生成简单、中等、困难示例：

python

class GenerateByDifficulty(dspy.Signature):
    """Generate a support ticket at a specific difficulty level for classification."""
    category: str = dspy.InputField()
    difficulty: str = dspy.InputField(desc="easy (clear-cut), medium (some ambiguity), or hard (could be multiple categories)")
    ticket_text: str = dspy.OutputField()

gen = dspy.Predict(GenerateByDifficulty)
for category in categories:
    for difficulty in ["easy", "medium", "hard"]:
        for i in range(15):
            result = gen(category=category, difficulty=difficulty)
            examples.append(dspy.Example(ticket_text=result.ticket_text, category=category).with_inputs("ticket_text"))

多样性技巧——添加随机

sindex

字段，促使大语言模型生成更多样的输出：

python

import random

class GenerateDiverse(dspy.Signature):
    """Generate a unique and realistic support ticket."""
    category: str = dspy.InputField()
    sindex: str = dspy.InputField(desc="a unique seed index for diversity")
    ticket_text: str = dspy.OutputField()

gen = dspy.Predict(GenerateDiverse)
for category in categories:
    for i in range(50):
        result = gen(category=category, sindex=str(random.randint(0, 1_000_000)))
        examples.append(dspy.Example(ticket_text=result.ticket_text, category=category).with_inputs("ticket_text"))

随机

sindex

可以防止大语言模型陷入重复模式。

Step 4: Filter for quality

步骤4：质量过滤

Generated data always contains some bad examples. Filter aggressively — aim to generate 2-3x what you need and keep ~50%.

生成的数据中总会包含一些低质量示例。要严格过滤——目标是生成所需数量的2-3倍，最终保留约50%。

Simple: metric-based filtering

简单方式：基于指标的过滤

Run each generated example through your task program and check with your metric:

python

program = dspy.ChainOfThought(ClassifyTicket)
filtered = []

for ex in examples:
    pred = program(**ex.inputs())
    if metric(ex, pred):
        filtered.append(ex)

print(f"Kept {len(filtered)}/{len(examples)} ({100*len(filtered)//len(examples)}%)")

This works when your program is already decent — it filters out examples that are confusing or mislabeled.

将每个生成的示例传入你的任务程序，用指标进行校验：

python

program = dspy.ChainOfThought(ClassifyTicket)
filtered = []

for ex in examples:
    pred = program(**ex.inputs())
    if metric(ex, pred):
        filtered.append(ex)

print(f"Kept {len(filtered)}/{len(examples)} ({100*len(filtered)//len(examples)}%)")

当你的程序已经具备一定性能时，这种方法有效——它会过滤掉那些模糊或标注错误的示例。

Robust: LM-based assessment

可靠方式：基于大语言模型的评估

Use a separate assessment step to check realism and correctness:

python

class AssessExample(dspy.Signature):
    """Is this a realistic and correctly labeled example?"""
    ticket_text: str = dspy.InputField()
    category: str = dspy.InputField()
    is_realistic: bool = dspy.OutputField(desc="true if this looks like a real support ticket")
    is_correctly_labeled: bool = dspy.OutputField(desc="true if the category matches the ticket")

assessor = dspy.Predict(AssessExample)
filtered = []

for ex in examples:
    result = assessor(ticket_text=ex.ticket_text, category=ex.category)
    if result.is_realistic and result.is_correctly_labeled:
        filtered.append(ex)

print(f"Kept {len(filtered)}/{len(examples)} ({100*len(filtered)//len(examples)}%)")

使用单独的评估步骤检查示例的真实性和正确性：

python

class AssessExample(dspy.Signature):
    """Is this a realistic and correctly labeled example?"""
    ticket_text: str = dspy.InputField()
    category: str = dspy.InputField()
    is_realistic: bool = dspy.OutputField(desc="true if this looks like a real support ticket")
    is_correctly_labeled: bool = dspy.OutputField(desc="true if the category matches the ticket")

assessor = dspy.Predict(AssessExample)
filtered = []

for ex in examples:
    result = assessor(ticket_text=ex.ticket_text, category=ex.category)
    if result.is_realistic and result.is_correctly_labeled:
        filtered.append(ex)

print(f"Kept {len(filtered)}/{len(examples)} ({100*len(filtered)//len(examples)}%)")

Quality gates with

dspy.Suggest

使用

dspy.Suggest

实现质量关卡

For tighter integration, build quality checks into the generator itself. When a

Suggest

constraint fails, DSPy retries the generation:

python

class QualityGenerator(dspy.Module):
    def __init__(self):
        self.generate = dspy.ChainOfThought(GenerateTicketExample)
        self.assess = dspy.Predict(AssessExample)

    def forward(self, category):
        result = self.generate(category=category)
        assessment = self.assess(ticket_text=result.ticket_text, category=category)
        dspy.Suggest(assessment.is_realistic, "Generated ticket should be realistic")
        dspy.Suggest(assessment.is_correctly_labeled, "Category label should be correct")
        return result

generator = QualityGenerator()

为了更紧密的集成，可以在生成器中内置质量检查。当

Suggest

约束不满足时，DSPy会重新尝试生成：

python

class QualityGenerator(dspy.Module):
    def __init__(self):
        self.generate = dspy.ChainOfThought(GenerateTicketExample)
        self.assess = dspy.Predict(AssessExample)

    def forward(self, category):
        result = self.generate(category=category)
        assessment = self.assess(ticket_text=result.ticket_text, category=category)
        dspy.Suggest(assessment.is_realistic, "Generated ticket should be realistic")
        dspy.Suggest(assessment.is_correctly_labeled, "Category label should be correct")
        return result

generator = QualityGenerator()

DSPy retries generation when Suggest constraints fail

undefined

undefined

Check for duplicates

移除重复示例

Remove near-duplicates to keep your dataset diverse:

python

seen = set()
unique = []
for ex in filtered:
    # Normalize and check
    key = ex.ticket_text.strip().lower()
    if key not in seen:
        seen.add(key)
        unique.append(ex)

print(f"Removed {len(filtered) - len(unique)} near-duplicates")
filtered = unique

移除近似重复的示例，保持数据集的多样性：

python

seen = set()
unique = []
for ex in filtered:
    # Normalize and check
    key = ex.ticket_text.strip().lower()
    if key not in seen:
        seen.add(key)
        unique.append(ex)

print(f"Removed {len(filtered) - len(unique)} near-duplicates")
filtered = unique

Step 5: Optimize the generator itself (advanced)

步骤5：优化生成器本身（进阶）

Research (arXiv 2406.11706) shows that optimizing the prompt used to generate data dramatically improves downstream quality. This is meta-optimization: optimizing the generator so it produces better training data.

python

class DataGenerator(dspy.Module):
    def __init__(self):
        self.generate = dspy.ChainOfThought(GenerateTicketExample)

    def forward(self, category):
        return self.generate(category=category)

研究（arXiv 2406.11706）表明，优化用于生成数据的提示词能显著提升下游任务的性能。这是元优化：优化生成器使其生成更好的训练数据。

python

class DataGenerator(dspy.Module):
    def __init__(self):
        self.generate = dspy.ChainOfThought(GenerateTicketExample)

    def forward(self, category):
        return self.generate(category=category)

Define a metric that measures generated data quality

def generator_metric(example, prediction, trace=None): # Check if a downstream classifier gets the right answer on this generated example classifier = dspy.Predict(ClassifyTicket) task_example = dspy.Example(ticket_text=prediction.ticket_text, category=example.category).with_inputs("ticket_text") task_pred = classifier(**task_example.inputs()) return task_pred.category.lower() == example.category.lower()

Optimize the generator's prompts

optimizer = dspy.BootstrapFewShot(metric=generator_metric) optimized_generator = optimizer.compile(DataGenerator(), trainset=seeds)

Now generate with the optimized generator

better_examples = [] for category in categories: for i in range(50): result = optimized_generator(category=category) better_examples.append( dspy.Example(ticket_text=result.ticket_text, category=category).with_inputs("ticket_text") )


This closes the loop: better generator prompts produce better data, which produces better task programs.


这形成了闭环：更好的生成器提示词生成更好的数据，进而训练出更好的任务程序。

Step 6: Use generated data for optimization

步骤6：使用生成的数据进行优化

Full pipeline: generate, filter, split, optimize, evaluate.

python

import random
from dspy.evaluate import Evaluate

完整流程：生成、过滤、拆分、优化、评估。

python

import random
from dspy.evaluate import Evaluate

Shuffle and split

random.shuffle(filtered) split = int(len(filtered) * 0.8) trainset = filtered[:split] devset = filtered[split:]

print(f"Train: {len(trainset)}, Dev: {len(devset)}")

random.shuffle(filtered) split = int(len(filtered) * 0.8) trainset = filtered[:split] devset = filtered[split:]

print(f"Train: {len(trainset)}, Dev: {len(devset)}")

Configure your task LM (can be cheaper than the generator LM)

lm = dspy.LM("openai/gpt-4o-mini") dspy.configure(lm=lm)

Build and optimize your task program

program = dspy.ChainOfThought(ClassifyTicket)

optimizer = dspy.MIPROv2(metric=metric, auto="medium") optimized = optimizer.compile(program, trainset=trainset)

program = dspy.ChainOfThought(ClassifyTicket)

optimizer = dspy.MIPROv2(metric=metric, auto="medium") optimized = optimizer.compile(program, trainset=trainset)

Evaluate

evaluator = Evaluate(devset=devset, metric=metric, num_threads=4, display_progress=True) score = evaluator(optimized) print(f"Score on synthetic dev set: {score:.1f}%")

Save

optimized.save("optimized_program.json")


If you have even a small number of real examples, use them as the dev set instead — real data gives more trustworthy evaluation.

optimized.save("optimized_program.json")


如果你有少量真实示例，可以用它们作为开发集——真实数据的评估结果更可信。

Common scenarios

常见场景

Cold start — zero real data. Write 5-10 seeds. Generate 200+ synthetic examples across all categories. Filter and optimize. See examples.md for a full walkthrough.

Edge case gaps — your AI works at 85% but fails on specific scenarios. Run error analysis, identify the failure patterns, then use scenario-driven generation targeting those gaps. Re-optimize with the augmented dataset.

Privacy/compliance — can't use real customer data. Generate synthetic examples with realistic patterns but no PII. Validate with domain-specific assessments. The

dspy.Suggest

quality gate pattern ensures generated data meets your standards.

New categories — added a category with no examples. Use category-driven generation to produce 50+ examples for the new category, then retrain.

Rebalancing — some categories have 500 examples, others have 10. Generate more for underrepresented categories until all are roughly balanced.

Schema changed — your input/output format changed. Generate new examples matching the new schema rather than manually converting old data.

冷启动——没有真实数据。编写5-10个种子示例，为所有类别生成200+合成示例，过滤后进行优化。详见examples.md中的完整流程。

边缘场景缺口——你的AI准确率为85%，但在特定场景下失效。运行错误分析，识别失败模式，然后使用场景驱动生成填补这些缺口，再用扩充后的数据集重新优化。

隐私/合规要求——无法使用真实客户数据。生成具有真实模式但不包含个人身份信息（PII）的合成示例。通过领域特定评估验证数据质量。

dspy.Suggest

质量关卡模式可确保生成的数据符合你的标准。

新增类别——添加了新类别，没有对应的示例。使用类别驱动生成，为新类别生成50+示例，然后重新训练。

类别再平衡——部分类别有500个示例，其他类别只有10个。为代表性不足的类别生成更多示例，直到所有类别数量大致平衡。

数据结构变更——你的输入/输出格式发生变化。生成符合新数据结构的新示例，而非手动转换旧数据。

Tips and pitfalls

技巧与注意事项

Always validate generated data — LMs produce plausible but wrong labels. Filter aggressively.
Mix synthetic with real data when available — even 20 real examples mixed in improve quality significantly.
Use a stronger model to generate, cheaper model for your task — e.g., generate with GPT-4o, run your task on GPT-4o-mini.
Generate more than you need — aim for 2-3x your target, keep ~50% after filtering.
Check for duplicates — LMs tend to repeat themselves, especially without the diversity trick.
Iterate — generate, optimize, evaluate, identify gaps, generate more for gaps.
Don't trust synthetic eval scores blindly — if possible, validate final quality on real data.
The
n
parameter for batch generation isn't supported by all providers — use the loop pattern as a reliable fallback.

始终验证生成的数据——大语言模型会生成看似合理但标注错误的示例。要严格过滤。
如果有真实数据，将合成数据与真实数据混合——即使只混合20个真实示例，也能显著提升质量。
用更强的模型生成数据，用更便宜的模型执行任务——例如，用GPT-4o生成数据，用GPT-4o-mini执行任务。
生成比实际需要更多的数据——目标是生成所需数量的2-3倍，过滤后保留约50%。
检查重复示例——大语言模型容易重复生成，尤其是不使用多样性技巧时。
迭代优化——生成、优化、评估、识别缺口、针对缺口生成更多数据。
不要盲目相信合成数据的评估分数——如果可能，用真实数据验证最终性能。
并非所有提供商都支持批量生成的
n
参数——循环模式是可靠的备选方案。

Additional resources

额外资源

For end-to-end worked examples (cold start, edge cases, privacy), see examples.md
Use
```
/ai-improving-accuracy
```
to measure and improve your optimized program
Use
```
/ai-fine-tuning
```
once you have enough generated data for weight optimization
Use
```
/ai-kickoff
```
to scaffold a project, then fill data with this skill

有关端到端的完整示例（冷启动、边缘场景、隐私要求），请查看examples.md
使用
```
/ai-improving-accuracy
```
测量并提升优化后的程序性能
当你有足够的生成数据后，使用
```
/ai-fine-tuning
```
进行权重优化
使用
```
/ai-kickoff
```
搭建项目框架，然后用本技能填充数据

ai-generating-data

Original

Translation

Generate Synthetic Training Data

生成合成训练数据

When you need synthetic data

何时需要合成数据

The core idea

核心思路

Step 1: Define what an example looks like

步骤1：定义示例格式

Your task — what the AI will do in production

Your task — what the AI will do in production

Generator — produces examples for your task

Generator — produces examples for your task

Multi-field tasks

多字段任务

Task: extract structured data from text

Task: extract structured data from text

Generator: produce realistic messages with known contact info

Generator: produce realistic messages with known contact info

Step 2: Write seed examples

步骤2：编写种子示例

Step 3: Generate in batches

步骤3：批量生成

Pattern A: n=N batch generation

模式A：n=N批量生成

Pattern B: Loop generation

模式B：循环生成

Generation strategies

生成策略

Step 4: Filter for quality

步骤4：质量过滤

Simple: metric-based filtering

简单方式：基于指标的过滤

Robust: LM-based assessment

可靠方式：基于大语言模型的评估

Quality gates with dspy.Suggest

使用dspy.Suggest实现质量关卡

DSPy retries generation when Suggest constraints fail

DSPy retries generation when Suggest constraints fail

Check for duplicates

移除重复示例

Step 5: Optimize the generator itself (advanced)

步骤5：优化生成器本身（进阶）

Define a metric that measures generated data quality

Define a metric that measures generated data quality

Optimize the generator's prompts

Optimize the generator's prompts

Now generate with the optimized generator

Now generate with the optimized generator

Step 6: Use generated data for optimization

步骤6：使用生成的数据进行优化

Shuffle and split

Shuffle and split

Configure your task LM (can be cheaper than the generator LM)

Configure your task LM (can be cheaper than the generator LM)

Build and optimize your task program

Build and optimize your task program

Evaluate

Evaluate

Save

Save

Common scenarios

常见场景

Tips and pitfalls

技巧与注意事项

Additional resources

额外资源

Pattern A:
`n=N`
batch generation

模式A：
`n=N`
批量生成

Quality gates with
`dspy.Suggest`

使用
`dspy.Suggest`
实现质量关卡