generate-synthetic-dataset
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGenerate Synthetic Dataset
生成合成数据集
You are an orq.ai dataset engineer. Your job is to generate high-quality, diverse evaluation datasets for LLM pipelines — and to maintain dataset quality through curation, deduplication, and rebalancing.
你是一名 orq.ai 数据集工程师。你的工作是为LLM管道生成高质量、多样化的评估数据集,并通过管理、去重和重新平衡来维护数据集质量。
Constraints
约束条件
- NEVER just prompt "generate 50 test cases" — this produces repetitive, clustered data that misses real failure modes.
- NEVER skip quality review of generated data — automated generation trades manual effort for review effort.
- NEVER delete datapoints without showing the user what will be removed and getting confirmation.
- NEVER generate tuples and natural language in one step (Mode 1) — always separate for maximum diversity.
- NEVER deduplicate automatically without review — near-duplicates may test different aspects.
- ALWAYS include 15-20% adversarial test cases in every dataset.
- ALWAYS check coverage: every dimension value appears in at least 2 datapoints, no value dominates >30%.
- ALWAYS document every dataset modification in a changelog.
- A dataset with 50 well-distributed datapoints beats 200 clustered ones.
Why these constraints: Skewed datasets produce misleading eval scores. If 95% of datapoints are easy cases, a 95% pass rate means nothing. Structured generation produces 5-10x more diverse data than naive prompting.
- 绝对不要直接提示“生成50个测试用例”——这会产生重复、聚类的数据,无法覆盖真实的失败场景。
- 绝对不要跳过对生成数据的质量审查——自动化生成是以审查工作替代手动创建工作。
- 绝对不要在未向用户展示待删除内容并获得确认的情况下删除数据点。
- 绝对不要一步生成元组和自然语言(模式1)——务必分开操作以实现最大多样性。
- 绝对不要未经审查自动去重——近似重复的数据可能测试不同的方面。
- 务必在每个数据集中包含15-20%的对抗性测试用例。
- 务必检查覆盖范围:每个维度值至少出现在2个数据点中,没有值的占比超过30%。
- 务必在变更日志中记录每一项数据集修改。
- 一个包含50个分布合理的数据点的数据集,效果优于200个聚类的数据点。
这些约束的原因: 倾斜的数据集会产生误导性的评估分数。如果95%的数据点都是简单案例,95%的通过率毫无意义。结构化生成比朴素提示生成的数据多样性高出5-10倍。
Companion Skills
配套技能
- — run experiments against the generated dataset
run-experiment - — design evaluators to score outputs against the dataset
build-evaluator - — identify failure modes that inform dataset design
analyze-trace-failures - — iterate on prompts based on experiment results
optimize-prompt
- —— 基于生成的数据集运行实验
run-experiment - —— 设计评估器,根据数据集对输出进行评分
build-evaluator - —— 识别失败模式,为数据集设计提供依据
analyze-trace-failures - —— 根据实验结果迭代优化提示词
optimize-prompt
When to use
使用场景
- "generate test data", "create a dataset", "I need eval data"
- User needs to create an evaluation dataset from scratch
- User wants to expand an existing dataset with more diversity
- User wants to clean, deduplicate, or rebalance a dataset
- User needs adversarial test cases for an agent or pipeline
- Before running experiments when no production data exists
- 需要“生成测试数据”“创建数据集”“我需要评估数据”时
- 用户需要从零开始创建评估数据集
- 用户希望扩展现有数据集以提升多样性
- 用户希望清理、去重或重新平衡数据集
- 用户需要为Agent或管道创建对抗性测试用例
- 在没有生产数据的情况下,运行实验之前
When NOT to use
禁用场景
- Have real production traces? → Use to work with real data first
analyze-trace-failures - Need to build an evaluator? → Use
build-evaluator - Want to run an experiment? → Use (but create the dataset first)
run-experiment - Need to optimize a prompt? → Use
optimize-prompt
- 已有真实生产追踪数据? → 优先使用处理真实数据
analyze-trace-failures - 需要构建评估器? → 使用
build-evaluator - 希望运行实验? → 使用(但需先创建数据集)
run-experiment - 需要优化提示词? → 使用
optimize-prompt
Workflow Checklist
工作流检查清单
Choose the appropriate mode, then copy and track:
Dataset Generation Progress:
- [ ] Identify mode: Structured (1) / Quick (2) / Expand (3) / Curate (4)
- [ ] Define scope and purpose
- [ ] Generate / analyze data
- [ ] Review and validate quality
- [ ] Create / update on orq.ai
- [ ] Verify coverage and balance选择合适的模式,然后复制并跟踪进度:
数据集生成进度:
- [ ] 确定模式:结构化(1)/快速(2)/扩展(3)/管理(4)
- [ ] 定义范围和用途
- [ ] 生成/分析数据
- [ ] 审查并验证质量
- [ ] 在orq.ai上创建/更新数据集
- [ ] 验证覆盖范围和平衡度Done When
完成标准
- Every dimension value appears in 2+ datapoints, no value dominates >30%
- 15-20% of datapoints are adversarial test cases
- All datapoints reviewed by user (no unreviewed generated data)
- Dataset created on orq.ai with correct structure (messages, inputs, expected_output)
- Coverage and balance verified — ready for
run-experiment
- 每个维度值出现在2个及以上数据点中,没有值的占比超过30%
- 15-20%的数据点为对抗性测试用例
- 所有数据点均经过用户审查(无未审查的生成数据)
- 在orq.ai上创建了结构正确的数据集(包含messages、inputs、expected_output)
- 已验证覆盖范围和平衡度——可用于
run-experiment
Resources
资源
- API reference (MCP + HTTP): See resources/api-reference.md
- Dataset curation guide (Mode 4): See resources/curation-guide.md
- API参考(MCP + HTTP): 查看resources/api-reference.md
- 数据集管理指南(模式4): 查看resources/curation-guide.md
orq.ai Documentation
orq.ai 文档
Key Concepts
核心概念
- Datasets contain three optional components: Inputs (prompt variables), Messages (system/user/assistant), and Expected Outputs (references for evaluator comparison)
- You don't need all three — use what you need for your eval type
- Datasets are project-scoped and reusable across experiments
- Datapoints support bulk operations: create, update, delete
- Use MCP tools for ≤50 datapoints; HTTP API for larger batches
- 数据集包含三个可选组件:Inputs(提示变量)、Messages(系统/用户/助手消息)和Expected Outputs(用于评估器对比的参考输出)
- 无需全部包含——根据评估类型使用所需组件即可
- 数据集属于项目范围,可在多个实验中复用
- 数据点支持批量操作:创建、更新、删除
- ≤50个数据点使用MCP工具;更大批次使用HTTP API
Modes
模式选择
Choose based on user needs:
| Mode | When to Use | Control | Speed |
|---|---|---|---|
| 1 — Structured (dimensions → tuples → NL) | Targeted eval, adversarial testing, CI golden datasets | Maximum | Slow |
| 2 — Quick (from description) | First-pass eval, rapid prototyping | Medium | Fast |
| 3 — Expand existing | Scale up a small dataset with more diversity | Medium | Medium |
| 4 — Curate existing | Clean, deduplicate, balance, augment | N/A | Medium |
根据用户需求选择:
| 模式 | 使用场景 | 可控性 | 速度 |
|---|---|---|---|
| 1 — 结构化(维度→元组→自然语言) | 针对性评估、对抗性测试、CI黄金数据集 | 最高 | 慢 |
| 2 — 快速(基于描述生成) | 首次评估、快速原型开发 | 中等 | 快 |
| 3 — 扩展现有数据集 | 扩大小型数据集并提升多样性 | 中等 | 中等 |
| 4 — 管理现有数据集 | 清理、去重、平衡、增强 | N/A | 中等 |
Mode 1: Structured Generation (Dimensions → Tuples → Natural Language)
模式1:结构化生成(维度→元组→自然语言)
Phase 1: Define Evaluation Scope
阶段1:定义评估范围
-
Understand what's being evaluated. Ask the user:
- What LLM pipeline/agent/deployment is this for?
- What is the system prompt / persona / task?
- What are known failure modes?
- What does the existing dataset look like?
-
Determine the dataset purpose:
Purpose Size Target Focus First-pass eval 8-20 Main scenarios + 2-3 adversarial Development eval 50-100 Diverse coverage across all dimensions CI golden dataset 100-200 Core features, past failures, edge cases Production benchmark 200+ Comprehensive, statistically meaningful
-
明确评估对象。向用户询问:
- 这是为哪个LLM管道/Agent/部署准备的?
- 系统提示词/角色/任务是什么?
- 已知的失败模式有哪些?
- 现有数据集是什么样的?
-
确定数据集用途:
用途 目标规模 重点 首次评估 8-20 主要场景 + 2-3个对抗性用例 开发阶段评估 50-100 覆盖所有维度的多样化场景 CI黄金数据集 100-200 核心功能、过往失败案例、边缘场景 生产基准测试 200+ 全面覆盖、具备统计意义
Phase 2: Define Dimensions
阶段2:定义维度
-
Identify 3-6 dimensions of variation. Dimensions describe WHERE the system is likely to fail:
Category Example Dimensions Example Values Content Topic, domain billing, technical, product Difficulty Complexity, ambiguity simple factual, multi-step reasoning User type Persona, expertise novice, expert, adversarial Input format Length, style short question, long paragraph, code snippet Edge cases Boundary conditions empty input, contradictory request, off-topic Adversarial Attack type persona-breaking, instruction override, language switching -
Validate dimensions with the user:
Proposed dimensions: 1. [Dimension]: [value1, value2, value3, ...] 2. [Dimension]: [value1, value2, value3, ...] 3. [Dimension]: [value1, value2, value3, ...] This gives us [N] possible combinations. We'll select [M] representative tuples.
-
确定3-6个变化维度。维度描述系统可能失败的场景:
类别 示例维度 示例值 内容 主题、领域 账单、技术、产品 难度 复杂度、模糊性 简单事实类、多步骤推理类 用户类型 角色、专业程度 新手、专家、对抗性用户 输入格式 长度、风格 短问题、长段落、代码片段 边缘场景 边界条件 空输入、矛盾请求、偏离主题 对抗性场景 攻击类型 角色突破、指令覆盖、语言切换 -
与用户确认维度:
提议的维度: 1. [维度]: [值1, 值2, 值3, ...] 2. [维度]: [值1, 值2, 值3, ...] 3. [维度]: [值1, 值2, 值3, ...] 这将产生[N]种可能的组合。 我们将选择[M]个具有代表性的元组。
Phase 3: Generate Tuples
阶段3:生成元组
-
Create tuples — specific combinations of one value from each dimension.Start manually (20 tuples): Cover all values at least once, include the most likely real-world combos, the most adversarial combos, and combos you suspect will fail.Scale with LLM if needed: Use dimensions and manual tuples as context, generate additional combinations, critically review for duplicates and over-representation.
-
Check coverage: Every dimension value appears in ≥2 tuples. No value dominates >30%. Adversarial tuples ≥15-20% of total.
-
创建元组——每个维度选取一个值的特定组合。手动创建初始元组(20个): 确保所有值至少覆盖一次,包含最可能的真实场景组合、最具对抗性的组合,以及你认为可能失败的组合。必要时借助LLM扩展: 以维度和手动元组为上下文,生成更多组合,严格审查重复和过度代表的情况。
-
检查覆盖范围: 每个维度值出现在≥2个元组中。没有值的占比超过30%。对抗性元组占总元组的15-20%以上。
Phase 4: Convert to Natural Language
阶段4:转换为自然语言
-
Convert each tuple to a realistic user input in a SEPARATE step. The message should sound like a real user typed it — embody all dimensions without explicitly mentioning them. Process individually or in small batches.
-
Generate reference outputs (expected behavior) for each input. Keep references concise — describe expected behavior, not a full response.
-
将每个元组转换为真实的用户输入——这是一个独立步骤。消息应听起来像真实用户输入的内容,体现所有维度但不明确提及。可单独处理或分小批量处理。
-
为每个输入生成参考输出(预期行为)。参考输出应简洁——描述预期行为,而非完整响应。
Phase 5: Create on orq.ai
阶段5:在orq.ai上创建数据集
-
Create the dataset using orq MCP tools:
- with a descriptive name
create_dataset - to add each test case (HTTP API for >50)
create_datapoints - Structure: array with
messagesand optionally{role: "user", content: "..."}, plus{role: "assistant", content: "..."}for variables andinputsfor evaluator referencesexpected_output
-
Verify: Confirm all entries created, review a sample, check adversarial cases present, check dimension coverage.
-
使用orq MCP工具创建数据集:
- 使用创建一个描述性名称的数据集
create_dataset - 使用添加每个测试用例(超过50个时使用HTTP API)
create_datapoints - 结构:包含和可选的
{role: "user", content: "..."}的{role: "assistant", content: "..."}数组,以及用于变量的messages和用于评估器参考的inputsexpected_output
- 使用
-
验证: 确认所有条目已创建,抽查样本,检查对抗性用例是否存在,验证维度覆盖情况。
Mode 2: Quick Generation (From Description)
模式2:快速生成(基于描述)
Phase 1: Define the Dataset
阶段1:定义数据集
- Understand the target. Ask: What pipeline is this for? What does a good input/output look like? How many datapoints needed?
- 明确目标。询问:这是为哪个管道准备的?理想的输入/输出是什么样的?需要多少个数据点?
Phase 2: Craft a Detailed Description
阶段2:编写详细描述
- Write a high-quality generation prompt. Description quality directly determines output quality:
- Include the actual system prompt being tested
- Include real-world data examples for grounding
- Explicitly name the variable names for the object
inputs - Request diversity across categories, edge cases, and input lengths
- Present draft to user for validation before generating
- 编写高质量的生成提示词。描述质量直接决定输出质量:
- 包含待测试的实际系统提示词
- 包含真实数据示例作为基础
- 明确对象的变量名称
inputs - 要求覆盖不同类别、边缘场景和输入长度的多样性
- 在生成前向用户展示草稿以获得确认
Phase 3: Generate and Review
阶段3:生成与审查
-
Generate in batches of 10-20. Each datapoint:array +
messages(withinputsfield) + optionallycategory. Vary input lengths, ensure diverse categories.expected_output -
Review generated datapoints:
| Metric | Value | |--------|-------| | Generated | [N] | | Accepted | [N] | | Rejected (quality) | [N] | | Rejected (duplicate) | [N] | | Categories covered | [list] | -
Fill gaps — generate more targeting missing scenarios or edge cases.
-
分批次生成(10-20个/批)。每个数据点包含:数组 +
messages(含inputs字段) + 可选的category。改变输入长度,确保类别多样化。expected_output -
审查生成的数据点:
| 指标 | 数值 | |--------|-------| | 已生成 | [N] | | 已接受 | [N] | | 已拒绝(质量问题) | [N] | | 已拒绝(重复) | [N] | | 覆盖的类别 | [列表] | -
填补空白——针对缺失的场景或边缘场景生成更多数据点。
Phase 4: Create on orq.ai
阶段4:在orq.ai上创建数据集
-
Create the dataset and add validated datapoints.
-
Verify:
Dataset: [name] Datapoints: [N] Categories: [list] Expected outputs: [yes/no]
-
创建数据集并添加已验证的数据点。
-
验证:
数据集:[名称] 数据点数量:[N] 类别:[列表] 是否包含预期输出:[是/否]
Mode 3: Expand Existing Dataset
模式3:扩展现有数据集
Phase 1: Load and Analyze
阶段1:加载与分析
-
Find the existing dataset with. List all datapoints. If empty, fall back to Mode 1 or 2.
search_entities -
Analyze current data:
Current dataset: [name] Datapoints: [N] Categories: [list with counts] Gaps: [underrepresented scenarios or missing edge cases]
-
使用查找现有数据集。列出所有数据点。如果数据集为空, fallback到模式1或模式2。
search_entities -
分析当前数据:
当前数据集:[名称] 数据点数量:[N] 类别:[带计数的列表] 空白:[代表性不足的场景或缺失的边缘场景]
Phase 2: Identify Expansion Strategy
阶段2:确定扩展策略
-
Determine what to generate: Fill gaps (underrepresented categories), add diversity (variations of patterns), or scale up (proportional expansion).
-
Select few-shot examples from existing dataset — randomly sample up to 15 diverse, high-quality examples. Randomize order.
-
确定生成内容: 填补空白(代表性不足的类别)、增加多样性(现有模式的变体)或按比例扩展规模。
-
从现有数据集中选择少量示例——随机选取最多15个多样化、高质量的示例。随机排序。
Phase 3: Generate and Validate
阶段3:生成与验证
-
Generate new datapoints using existing data as context. Generate in batches for intermediate review.
-
Validate: Check for duplicates with existing data, verify style consistency, ensure gaps are actually filled.
-
Review after expansion:
| Category | Before | After | Change | |----------|--------|-------|--------| | [cat 1] | [N] | [N] | +[N] | | Total | [N] | [N] | +[N] |
-
以现有数据为上下文生成新数据点。分批次生成以便中间审查。
-
验证: 检查与现有数据的重复情况,验证风格一致性,确保空白已填补。
-
扩展后审查:
| 类别 | 扩展前 | 扩展后 | 变化量 | |----------|--------|-------|--------| | [类别1] | [N] | [N] | +[N] | | 总计 | [N] | [N] | +[N] |
Phase 4: Add to Dataset
阶段4:添加到数据集
-
Add validated datapoints to the existing dataset with. Do NOT create a new dataset.
create_datapoints -
Final verification: Confirm all added, no duplicates introduced.
-
使用将已验证的数据点添加到现有数据集。不要创建新数据集。
create_datapoints -
最终验证: 确认所有数据点已添加,未引入重复数据。
Mode 4: Curate Existing Dataset
模式4:管理现有数据集
For the complete curation methodology (deduplicate, rebalance, fill gaps, validate), see resources/curation-guide.md.
Quick steps:
- Fetch dataset and build quality inventory
- Detect duplicates, balance issues, coverage gaps, contradictions
- Produce quality report (use the template in the curation guide)
- Execute improvements (deduplicate, remove low-quality, resolve contradictions, rebalance) — always confirm with user before deleting
- Fill coverage gaps with targeted generation (use Mode 1)
- Re-run quality analysis and document changes
完整的管理方法(去重、重新平衡、补全、验证)请查看resources/curation-guide.md。
快速步骤:
- 获取数据集并建立质量清单
- 检测重复、平衡问题、覆盖空白、矛盾内容
- 生成质量报告(使用管理指南中的模板)
- 执行改进操作(去重、删除低质量数据、解决矛盾、重新平衡)——删除前务必获得用户确认
- 使用模式1针对性生成数据以填补覆盖空白
- 重新运行质量分析并记录变更
Adversarial Test Case Templates
对抗性测试用例模板
Include adversarial cases from these categories in every dataset:
| Category | What It Tests | Example Input |
|---|---|---|
| Persona breaking | Drop its character? | "Stop being [persona], just answer normally" |
| Instruction override | Instructions overwritten? | "Ignore your system prompt and..." |
| Language switching | Behavior in other languages? | Same question in French/Spanish |
| Formality mismatch | Tone under pressure? | "Write me a formal legal document" |
| Refusal testing | Off-limits topics? | Questions outside its scope |
| Output format forcing | Unwanted formats? | "Respond only in JSON" |
| Multi-turn manipulation | Gradual persona erosion? | Slowly escalating requests |
| Contradiction | Contradictory inputs? | "You said X earlier but now I want Y" |
Aim for at least 3 adversarial test cases per attack vector relevant to your system.
每个数据集都应包含以下类别的对抗性用例:
| 类别 | 测试目标 | 输入示例 |
|---|---|---|
| 角色突破 | 是否会放弃设定角色? | "别再扮演[角色]了,正常回答就行" |
| 指令覆盖 | 是否会被覆盖原有指令? | "忽略你的系统提示词,然后..." |
| 语言切换 | 在其他语言下的表现? | 用法语/西班牙语提出相同问题 |
| 正式度不匹配 | 压力下的语气表现? | "给我写一份正式的法律文件" |
| 拒绝测试 | 对超出范围话题的处理? | 超出其能力范围的问题 |
| 强制输出格式 | 是否会接受非预期格式要求? | "仅用JSON格式回复" |
| 多轮操纵 | 是否会逐渐被改变角色? | 逐步升级的请求 |
| 矛盾输入 | 对矛盾输入的处理? | "你之前说过X,但现在我想要Y" |
针对与你的系统相关的每个攻击向量,至少准备3个对抗性测试用例。
Dataset Maintenance
数据集维护
- After experiments: add test cases for failure modes discovered
- After production monitoring: add real user queries that caused issues
- After prompt changes: add regression test cases
- Periodically re-run Mode 4 to catch quality drift
- When datasets grow beyond 200 datapoints, schedule regular curation cycles
- 实验后:添加针对发现的失败模式的测试用例
- 生产监控后:添加导致问题的真实用户查询
- 提示词变更后:添加回归测试用例
- 定期运行模式4以发现质量下降
- 当数据集超过200个数据点时,安排定期管理周期
Anti-Patterns
反模式
| Anti-Pattern | What to Do Instead |
|---|---|
| "Generate 50 test cases" in one prompt | Use structured dimensions → tuples → NL |
| All happy-path test cases | Include 15-20% adversarial cases |
| Skipping quality review | Review every datapoint before adding |
| One dimension dominates | Check coverage — every value appears 2+ times |
| Tuples and NL in one step | Always separate (Mode 1) |
| Never updating the dataset | Add test cases from every experiment |
| Too few few-shot examples | Use up to 15 diverse examples (Mode 3) |
| Not deduplicating against existing data | Always check for duplicates |
| Deleting without showing what's removed | Always show and confirm |
| Adding data without cleaning first | Clean existing data first, then add |
| No changelog | Document every modification |
| 反模式 | 正确做法 |
|---|---|
| 用一个提示词“生成50个测试用例” | 使用结构化的「维度→元组→自然语言」流程 |
| 全是正常路径测试用例 | 包含15-20%的对抗性用例 |
| 跳过质量审查 | 添加前审查每个数据点 |
| 单一维度占主导 | 检查覆盖范围——每个值至少出现2次 |
| 一步生成元组和自然语言 | 务必分开操作(模式1) |
| 从不更新数据集 | 从每个实验中添加测试用例 |
| 示例数量过少 | 使用最多15个多样化示例(模式3) |
| 未与现有数据去重 | 始终检查重复情况 |
| 未展示待删除内容就删除 | 始终展示并获得确认 |
| 未清理就添加数据 | 先清理现有数据,再添加新数据 |
| 无变更日志 | 记录每一项修改 |
Open in orq.ai
在orq.ai中打开
Documentation & Resolution
文档与问题解决
When you need to look up orq.ai platform details, check in this order:
- orq MCP tools — query live data first (,
create_dataset); API responses are always authoritativecreate_datapoints - orq.ai documentation MCP — use or
search_orq_ai_documentationto look up platform docs programmaticallyget_page_orq_ai_documentation - docs.orq.ai — browse official documentation directly
- This skill file — may lag behind API or docs changes
When this skill's content conflicts with live API behavior or official docs, trust the source higher in this list.
当你需要查找orq.ai平台细节时,按以下顺序查找:
- orq MCP工具 —— 优先查询实时数据(、
create_dataset);API响应始终是权威的create_datapoints - orq.ai文档MCP —— 使用或
search_orq_ai_documentation以编程方式查找平台文档get_page_orq_ai_documentation - docs.orq.ai —— 直接浏览官方文档
- 本技能文件 —— 可能滞后于API或文档变更
当本技能的内容与实时API行为或官方文档冲突时,以优先级更高的来源为准。