generate-synthetic-data

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Generate Synthetic Data

生成合成数据

Generate diverse, realistic test inputs that cover the failure space of an LLM pipeline.
生成多样化、符合真实场景的测试输入,覆盖LLM流水线的所有故障场景。

Prerequisites

前置条件

Before generating synthetic data, identify where the pipeline is likely to fail. Ask the user about known failure-prone areas, review existing user feedback, or form hypotheses from available traces. Dimensions (Step 1) must target anticipated failures, not arbitrary variation.
在生成合成数据之前,先确定流水线可能出现故障的位置。可以询问用户已知的易故障区域、查看现有用户反馈,或是从已有的链路数据中形成故障假设。步骤1中的维度必须针对预期故障点设置,而非随意设置可变维度。

Core Process

核心流程

Step 1: Define Dimensions

步骤1:定义维度

Dimensions are axes of variation specific to your application. Choose dimensions based on where you expect failures.
Dimension 1: [Name] — [What it captures]
  Values: [value_a, value_b, value_c, ...]

Dimension 2: [Name] — [What it captures]
  Values: [value_a, value_b, value_c, ...]

Dimension 3: [Name] — [What it captures]
  Values: [value_a, value_b, value_c, ...]
Example for a real estate assistant:
Feature: what task the user wants
  Values: [property search, scheduling, email drafting]

Client Persona: who the user serves
  Values: [first-time buyer, investor, luxury buyer]

Scenario Type: query clarity
  Values: [well-specified, ambiguous, out-of-scope]
Start with 3 dimensions. Add more only if initial traces reveal failure patterns along new axes.
维度是你的应用特有的可变轴,请基于你预期会出现故障的位置选择维度。
Dimension 1: [Name] — [What it captures]
  Values: [value_a, value_b, value_c, ...]

Dimension 2: [Name] — [What it captures]
  Values: [value_a, value_b, value_c, ...]

Dimension 3: [Name] — [What it captures]
  Values: [value_a, value_b, value_c, ...]
房地产助理的示例:
Feature: what task the user wants
  Values: [property search, scheduling, email drafting]

Client Persona: who the user serves
  Values: [first-time buyer, investor, luxury buyer]

Scenario Type: query clarity
  Values: [well-specified, ambiguous, out-of-scope]
初始先设置3个维度,只有当初始链路数据暴露出新轴上的故障模式时,再添加更多维度。

Step 2: Draft 20 Tuples with the User

步骤2:和用户一起起草20个元组

A tuple is one combination of dimension values defining a specific test case. Present 20 draft tuples to the user and iterate until they confirm the tuples reflect realistic scenarios. The user's domain knowledge is essential here — they know which combinations actually occur and which are unrealistic.
(Feature: Property Search, Persona: Investor, Scenario: Ambiguous)
(Feature: Scheduling, Persona: First-time Buyer, Scenario: Well-specified)
(Feature: Email Drafting, Persona: Luxury Buyer, Scenario: Out-of-scope)
元组是维度值的组合,定义了一个特定的测试用例。向用户展示20个起草的元组,反复迭代直到用户确认这些元组符合真实场景。用户的领域知识在这里至关重要——他们清楚哪些组合是实际会出现的,哪些是不符合现实的。
(Feature: Property Search, Persona: Investor, Scenario: Ambiguous)
(Feature: Scheduling, Persona: First-time Buyer, Scenario: Well-specified)
(Feature: Email Drafting, Persona: Luxury Buyer, Scenario: Out-of-scope)

Step 3: Generate More Tuples with an LLM

步骤3:使用LLM生成更多元组

Generate 10 random combinations of ({dim1}, {dim2}, {dim3})
for a {your application description}.

The dimensions are:
{dim1}: {description}. Possible values: {values}
{dim2}: {description}. Possible values: {values}
{dim3}: {description}. Possible values: {values}

Output each tuple in the format: ({dim1}, {dim2}, {dim3})
Avoid duplicates. Vary values across dimensions.
Generate 10 random combinations of ({dim1}, {dim2}, {dim3})
for a {your application description}.

The dimensions are:
{dim1}: {description}. Possible values: {values}
{dim2}: {description}. Possible values: {values}
{dim3}: {description}. Possible values: {values}

Output each tuple in the format: ({dim1}, {dim2}, {dim3})
Avoid duplicates. Vary values across dimensions.

Step 4: Convert Each Tuple to a Natural Language Query

步骤4:将每个元组转换为自然语言查询

Use a separate prompt for this step. Single-step generation (tuples + queries together) produces repetitive phrasing.
We are generating synthetic user queries for a {your application}.
{Brief description of what it does.}

Given:
{dim1}: {value}
{dim2}: {value}
{dim3}: {value}

Write a realistic query that a user might enter. The query should
reflect the specified persona and scenario characteristics.

Example: "{one of your hand-written examples}"

Now generate a new query.
这一步使用单独的prompt完成。单步生成(同时生成元组+查询)会导致表述重复。
We are generating synthetic user queries for a {your application}.
{Brief description of what it does.}

Given:
{dim1}: {value}
{dim2}: {value}
{dim3}: {value}

Write a realistic query that a user might enter. The query should
reflect the specified persona and scenario characteristics.

Example: "{one of your hand-written examples}"

Now generate a new query.

Step 5: Filter for Quality

步骤5:质量过滤

Review generated queries. Discard and regenerate when:
  • Phrasing is awkward or unrealistic
  • Content doesn't match the tuple's intent
  • Queries are too similar to each other
Optional: use an LLM to rate realism on a 1-5 scale, discard below 3.
审核生成的查询,遇到以下情况时丢弃并重新生成:
  • 表述生硬或不符合真实场景
  • 内容与元组的预期不符
  • 查询之间相似度太高
可选:使用LLM按照1-5分对真实度评分,丢弃得分低于3分的结果。

Step 6: Run Queries Through the Pipeline

步骤6:在流水线中运行查询

Execute all queries through the full LLM pipeline. Capture complete traces: input, all intermediate steps, tool calls, retrieved docs, final output.
Target: ~100 high-quality, diverse traces. This is a rough heuristic for reaching saturation (where new traces stop revealing new failure categories). The number depends on system complexity.
将所有查询放入完整的LLM流水线中执行。捕获完整链路数据:输入、所有中间步骤、工具调用、检索到的文档、最终输出。
目标:约100条高质量、多样化的链路数据 这是达到饱和的粗略经验值(此时新的链路数据不会再暴露新的故障类别),具体数量取决于系统复杂度。

Sampling Real User Data

真实用户数据抽样

When you have real queries available, don't sample randomly. Use stratified sampling:
  1. Identify high-variance dimensions — read through queries and find ways they differ (length, topic, complexity, presence of constraints).
  2. Assign labels — for small sets, with the user; for large sets, use K-means clustering on query embeddings.
  3. Sample from each group — ensures coverage across query types, not just the most common ones.
When both real and synthetic data are available, use synthetic data to fill gaps in underrepresented query types.
当你有真实查询可用时,不要随机抽样,请使用分层抽样:
  1. 识别高方差维度 —— 通读查询,找出它们的差异点(长度、主题、复杂度、是否存在约束条件)。
  2. 添加标签 —— 小数据集可以和用户一起标注,大数据集可以对查询embeddings使用K-means聚类。
  3. 从每个组中抽样 —— 确保覆盖所有查询类型,而不仅仅是最常见的类型。
当同时拥有真实数据和合成数据时,使用合成数据填补代表性不足的查询类型的缺口。

Anti-Patterns

反模式

  • Unstructured generation. Prompting "give me test queries" without the dimension/tuple structure produces generic, repetitive, happy-path examples.
  • Single-step generation. Generating tuples and queries in one prompt produces less diverse results than the two-step separation.
  • Arbitrary dimensions. Dimensions that don't target failure-prone regions waste test budget.
  • Skipping user review of tuples. Without the user validating tuples first, you can't judge whether LLM-generated tuples are realistic.
  • Synthetic data when no one can judge realism. If no one can judge whether a synthetic trace is realistic, use real data instead.
  • Synthetic data for complex domain-specific content (legal filings, medical records) where LLMs miss structural nuance.
  • Synthetic data for low-resource languages or dialects where LLM-generated samples are unrealistic.
  • 非结构化生成 不使用维度/元组结构,直接提示“给我一些测试查询”会生成通用、重复的正常路径示例。
  • 单步生成 在同一个prompt中生成元组和查询,得到的结果多样性远低于两步分离的生成方式。
  • 随意设置维度 没有针对易故障区域设置的维度会浪费测试资源。
  • 跳过用户对元组的审核 如果没有用户先验证元组,你无法判断LLM生成的元组是否符合现实。
  • 没有人能判断真实度时使用合成数据 如果没有人能判断合成链路是否真实,请改用真实数据。
  • 针对复杂领域专属内容使用合成数据(法律文件、医疗记录),此时LLM会忽略结构性细节。
  • 针对低资源语言或方言使用合成数据,此时LLM生成的样本不符合真实场景。