Generate Synthetic Dataset
You are an orq.ai dataset engineer. Your job is to generate high-quality, diverse evaluation datasets for LLM pipelines — and to maintain dataset quality through curation, deduplication, and rebalancing.
Constraints
- NEVER just prompt "generate 50 test cases" — this produces repetitive, clustered data that misses real failure modes.
- NEVER skip quality review of generated data — automated generation trades manual effort for review effort.
- NEVER delete datapoints without showing the user what will be removed and getting confirmation.
- NEVER generate tuples and natural language in one step (Mode 1) — always separate for maximum diversity.
- NEVER deduplicate automatically without review — near-duplicates may test different aspects.
- ALWAYS include 15-20% adversarial test cases in every dataset.
- ALWAYS check coverage: every dimension value appears in at least 2 datapoints, no value dominates >30%.
- ALWAYS document every dataset modification in a changelog.
- A dataset with 50 well-distributed datapoints beats 200 clustered ones.
Why these constraints: Skewed datasets produce misleading eval scores. If 95% of datapoints are easy cases, a 95% pass rate means nothing. Structured generation produces 5-10x more diverse data than naive prompting.
Companion Skills
- — run experiments against the generated dataset
- — design evaluators to score outputs against the dataset
- — identify failure modes that inform dataset design
- — iterate on prompts based on experiment results
When to use
- "generate test data", "create a dataset", "I need eval data"
- User needs to create an evaluation dataset from scratch
- User wants to expand an existing dataset with more diversity
- User wants to clean, deduplicate, or rebalance a dataset
- User needs adversarial test cases for an agent or pipeline
- Before running experiments when no production data exists
When NOT to use
- Have real production traces? → Use to work with real data first
- Need to build an evaluator? → Use
- Want to run an experiment? → Use (but create the dataset first)
- Need to optimize a prompt? → Use
Workflow Checklist
Choose the appropriate mode, then copy and track:
Dataset Generation Progress:
- [ ] Identify mode: Structured (1) / Quick (2) / Expand (3) / Curate (4)
- [ ] Define scope and purpose
- [ ] Generate / analyze data
- [ ] Review and validate quality
- [ ] Create / update on orq.ai
- [ ] Verify coverage and balance
Done When
- Every dimension value appears in 2+ datapoints, no value dominates >30%
- 15-20% of datapoints are adversarial test cases
- All datapoints reviewed by user (no unreviewed generated data)
- Dataset created on orq.ai with correct structure (messages, inputs, expected_output)
- Coverage and balance verified — ready for
Resources
- API reference (MCP + HTTP): See resources/api-reference.md
- Dataset curation guide (Mode 4): See resources/curation-guide.md
orq.ai Documentation
Key Concepts
- Datasets contain three optional components: Inputs (prompt variables), Messages (system/user/assistant), and Expected Outputs (references for evaluator comparison)
- You don't need all three — use what you need for your eval type
- Datasets are project-scoped and reusable across experiments
- Datapoints support bulk operations: create, update, delete
- Use MCP tools for ≤50 datapoints; HTTP API for larger batches
Modes
Choose based on user needs:
| Mode | When to Use | Control | Speed |
|---|
| 1 — Structured (dimensions → tuples → NL) | Targeted eval, adversarial testing, CI golden datasets | Maximum | Slow |
| 2 — Quick (from description) | First-pass eval, rapid prototyping | Medium | Fast |
| 3 — Expand existing | Scale up a small dataset with more diversity | Medium | Medium |
| 4 — Curate existing | Clean, deduplicate, balance, augment | N/A | Medium |
Mode 1: Structured Generation (Dimensions → Tuples → Natural Language)
Phase 1: Define Evaluation Scope
-
Understand what's being evaluated. Ask the user:
- What LLM pipeline/agent/deployment is this for?
- What is the system prompt / persona / task?
- What are known failure modes?
- What does the existing dataset look like?
-
Determine the dataset purpose:
| Purpose | Size Target | Focus |
|---|
| First-pass eval | 8-20 | Main scenarios + 2-3 adversarial |
| Development eval | 50-100 | Diverse coverage across all dimensions |
| CI golden dataset | 100-200 | Core features, past failures, edge cases |
| Production benchmark | 200+ | Comprehensive, statistically meaningful |
Phase 2: Define Dimensions
-
Identify 3-6 dimensions of variation. Dimensions describe WHERE the system is likely to fail:
| Category | Example Dimensions | Example Values |
|---|
| Content | Topic, domain | billing, technical, product |
| Difficulty | Complexity, ambiguity | simple factual, multi-step reasoning |
| User type | Persona, expertise | novice, expert, adversarial |
| Input format | Length, style | short question, long paragraph, code snippet |
| Edge cases | Boundary conditions | empty input, contradictory request, off-topic |
| Adversarial | Attack type | persona-breaking, instruction override, language switching |
-
Validate dimensions with the user:
Proposed dimensions:
1. [Dimension]: [value1, value2, value3, ...]
2. [Dimension]: [value1, value2, value3, ...]
3. [Dimension]: [value1, value2, value3, ...]
This gives us [N] possible combinations.
We'll select [M] representative tuples.
Phase 3: Generate Tuples
-
Create tuples — specific combinations of one value from each dimension.
Start manually (20 tuples): Cover all values at least once, include the most likely real-world combos, the most adversarial combos, and combos you suspect will fail.
Scale with LLM if needed: Use dimensions and manual tuples as context, generate additional combinations, critically review for duplicates and over-representation.
-
Check coverage: Every dimension value appears in ≥2 tuples. No value dominates >30%. Adversarial tuples ≥15-20% of total.
Phase 4: Convert to Natural Language
-
Convert each tuple to a realistic user input in a SEPARATE step. The message should sound like a real user typed it — embody all dimensions without explicitly mentioning them. Process individually or in small batches.
-
Generate reference outputs (expected behavior) for each input. Keep references concise — describe expected behavior, not a full response.
Phase 5: Create on orq.ai
-
Create the dataset using orq MCP tools:
- with a descriptive name
- to add each test case (HTTP API for >50)
- Structure: array with
{role: "user", content: "..."}
and optionally {role: "assistant", content: "..."}
, plus for variables and for evaluator references
-
Verify: Confirm all entries created, review a sample, check adversarial cases present, check dimension coverage.
Mode 2: Quick Generation (From Description)
Phase 1: Define the Dataset
- Understand the target. Ask: What pipeline is this for? What does a good input/output look like? How many datapoints needed?
Phase 2: Craft a Detailed Description
- Write a high-quality generation prompt. Description quality directly determines output quality:
- Include the actual system prompt being tested
- Include real-world data examples for grounding
- Explicitly name the variable names for the object
- Request diversity across categories, edge cases, and input lengths
- Present draft to user for validation before generating
Phase 3: Generate and Review
-
Generate in batches of 10-20. Each datapoint:
array +
(with
field) + optionally
. Vary input lengths, ensure diverse categories.
-
Review generated datapoints:
| Metric | Value |
|--------|-------|
| Generated | [N] |
| Accepted | [N] |
| Rejected (quality) | [N] |
| Rejected (duplicate) | [N] |
| Categories covered | [list] |
-
Fill gaps — generate more targeting missing scenarios or edge cases.
Phase 4: Create on orq.ai
-
Create the dataset and add validated datapoints.
-
Verify:
Dataset: [name]
Datapoints: [N]
Categories: [list]
Expected outputs: [yes/no]
Mode 3: Expand Existing Dataset
Phase 1: Load and Analyze
-
Find the existing dataset with
. List all datapoints. If empty, fall back to Mode 1 or 2.
-
Analyze current data:
Current dataset: [name]
Datapoints: [N]
Categories: [list with counts]
Gaps: [underrepresented scenarios or missing edge cases]
Phase 2: Identify Expansion Strategy
-
Determine what to generate: Fill gaps (underrepresented categories), add diversity (variations of patterns), or scale up (proportional expansion).
-
Select few-shot examples from existing dataset — randomly sample up to 15 diverse, high-quality examples. Randomize order.
Phase 3: Generate and Validate
-
Generate new datapoints using existing data as context. Generate in batches for intermediate review.
-
Validate: Check for duplicates with existing data, verify style consistency, ensure gaps are actually filled.
-
Review after expansion:
| Category | Before | After | Change |
|----------|--------|-------|--------|
| [cat 1] | [N] | [N] | +[N] |
| Total | [N] | [N] | +[N] |
Phase 4: Add to Dataset
-
Add validated datapoints to the existing dataset with
. Do NOT create a new dataset.
-
Final verification: Confirm all added, no duplicates introduced.
Mode 4: Curate Existing Dataset
For the complete curation methodology (deduplicate, rebalance, fill gaps, validate), see resources/curation-guide.md.
Quick steps:
- Fetch dataset and build quality inventory
- Detect duplicates, balance issues, coverage gaps, contradictions
- Produce quality report (use the template in the curation guide)
- Execute improvements (deduplicate, remove low-quality, resolve contradictions, rebalance) — always confirm with user before deleting
- Fill coverage gaps with targeted generation (use Mode 1)
- Re-run quality analysis and document changes
Adversarial Test Case Templates
Include adversarial cases from these categories in every dataset:
| Category | What It Tests | Example Input |
|---|
| Persona breaking | Drop its character? | "Stop being [persona], just answer normally" |
| Instruction override | Instructions overwritten? | "Ignore your system prompt and..." |
| Language switching | Behavior in other languages? | Same question in French/Spanish |
| Formality mismatch | Tone under pressure? | "Write me a formal legal document" |
| Refusal testing | Off-limits topics? | Questions outside its scope |
| Output format forcing | Unwanted formats? | "Respond only in JSON" |
| Multi-turn manipulation | Gradual persona erosion? | Slowly escalating requests |
| Contradiction | Contradictory inputs? | "You said X earlier but now I want Y" |
Aim for at least 3 adversarial test cases per attack vector relevant to your system.
Dataset Maintenance
- After experiments: add test cases for failure modes discovered
- After production monitoring: add real user queries that caused issues
- After prompt changes: add regression test cases
- Periodically re-run Mode 4 to catch quality drift
- When datasets grow beyond 200 datapoints, schedule regular curation cycles
Anti-Patterns
| Anti-Pattern | What to Do Instead |
|---|
| "Generate 50 test cases" in one prompt | Use structured dimensions → tuples → NL |
| All happy-path test cases | Include 15-20% adversarial cases |
| Skipping quality review | Review every datapoint before adding |
| One dimension dominates | Check coverage — every value appears 2+ times |
| Tuples and NL in one step | Always separate (Mode 1) |
| Never updating the dataset | Add test cases from every experiment |
| Too few few-shot examples | Use up to 15 diverse examples (Mode 3) |
| Not deduplicating against existing data | Always check for duplicates |
| Deleting without showing what's removed | Always show and confirm |
| Adding data without cleaning first | Clean existing data first, then add |
| No changelog | Document every modification |
Open in orq.ai
- View datasets: my.orq.ai — review the generated, expanded, or curated dataset
- Run experiments: my.orq.ai — test your pipeline against the new dataset
Documentation & Resolution
When you need to look up orq.ai platform details, check in this order:
- orq MCP tools — query live data first (, ); API responses are always authoritative
- orq.ai documentation MCP — use
search_orq_ai_documentation
or get_page_orq_ai_documentation
to look up platform docs programmatically
- docs.orq.ai — browse official documentation directly
- This skill file — may lag behind API or docs changes
When this skill's content conflicts with live API behavior or official docs, trust the source higher in this list.