Loading...
Loading...
Generate and curate evaluation datasets — structured generation via dimensions-tuples-NL, quick from description, expansion from existing data, plus dataset maintenance through deduplication, rebalancing, and gap-filling. Use when creating eval data, expanding test coverage, or cleaning datasets. Do NOT use when sufficient real production data exists (use analyze-trace-failures instead). Do NOT use for evaluator creation (use build-evaluator).
npx skill4agent add orq-ai/assistant-plugins generate-synthetic-datasetrun-experimentbuild-evaluatoranalyze-trace-failuresoptimize-promptanalyze-trace-failuresbuild-evaluatorrun-experimentoptimize-promptDataset Generation Progress:
- [ ] Identify mode: Structured (1) / Quick (2) / Expand (3) / Curate (4)
- [ ] Define scope and purpose
- [ ] Generate / analyze data
- [ ] Review and validate quality
- [ ] Create / update on orq.ai
- [ ] Verify coverage and balancerun-experiment| Mode | When to Use | Control | Speed |
|---|---|---|---|
| 1 — Structured (dimensions → tuples → NL) | Targeted eval, adversarial testing, CI golden datasets | Maximum | Slow |
| 2 — Quick (from description) | First-pass eval, rapid prototyping | Medium | Fast |
| 3 — Expand existing | Scale up a small dataset with more diversity | Medium | Medium |
| 4 — Curate existing | Clean, deduplicate, balance, augment | N/A | Medium |
| Purpose | Size Target | Focus |
|---|---|---|
| First-pass eval | 8-20 | Main scenarios + 2-3 adversarial |
| Development eval | 50-100 | Diverse coverage across all dimensions |
| CI golden dataset | 100-200 | Core features, past failures, edge cases |
| Production benchmark | 200+ | Comprehensive, statistically meaningful |
| Category | Example Dimensions | Example Values |
|---|---|---|
| Content | Topic, domain | billing, technical, product |
| Difficulty | Complexity, ambiguity | simple factual, multi-step reasoning |
| User type | Persona, expertise | novice, expert, adversarial |
| Input format | Length, style | short question, long paragraph, code snippet |
| Edge cases | Boundary conditions | empty input, contradictory request, off-topic |
| Adversarial | Attack type | persona-breaking, instruction override, language switching |
Proposed dimensions:
1. [Dimension]: [value1, value2, value3, ...]
2. [Dimension]: [value1, value2, value3, ...]
3. [Dimension]: [value1, value2, value3, ...]
This gives us [N] possible combinations.
We'll select [M] representative tuples.create_datasetcreate_datapointsmessages{role: "user", content: "..."}{role: "assistant", content: "..."}inputsexpected_outputinputsmessagesinputscategoryexpected_output| Metric | Value |
|--------|-------|
| Generated | [N] |
| Accepted | [N] |
| Rejected (quality) | [N] |
| Rejected (duplicate) | [N] |
| Categories covered | [list] |Dataset: [name]
Datapoints: [N]
Categories: [list]
Expected outputs: [yes/no]search_entitiesCurrent dataset: [name]
Datapoints: [N]
Categories: [list with counts]
Gaps: [underrepresented scenarios or missing edge cases]| Category | Before | After | Change |
|----------|--------|-------|--------|
| [cat 1] | [N] | [N] | +[N] |
| Total | [N] | [N] | +[N] |create_datapoints| Category | What It Tests | Example Input |
|---|---|---|
| Persona breaking | Drop its character? | "Stop being [persona], just answer normally" |
| Instruction override | Instructions overwritten? | "Ignore your system prompt and..." |
| Language switching | Behavior in other languages? | Same question in French/Spanish |
| Formality mismatch | Tone under pressure? | "Write me a formal legal document" |
| Refusal testing | Off-limits topics? | Questions outside its scope |
| Output format forcing | Unwanted formats? | "Respond only in JSON" |
| Multi-turn manipulation | Gradual persona erosion? | Slowly escalating requests |
| Contradiction | Contradictory inputs? | "You said X earlier but now I want Y" |
| Anti-Pattern | What to Do Instead |
|---|---|
| "Generate 50 test cases" in one prompt | Use structured dimensions → tuples → NL |
| All happy-path test cases | Include 15-20% adversarial cases |
| Skipping quality review | Review every datapoint before adding |
| One dimension dominates | Check coverage — every value appears 2+ times |
| Tuples and NL in one step | Always separate (Mode 1) |
| Never updating the dataset | Add test cases from every experiment |
| Too few few-shot examples | Use up to 15 diverse examples (Mode 3) |
| Not deduplicating against existing data | Always check for duplicates |
| Deleting without showing what's removed | Always show and confirm |
| Adding data without cleaning first | Clean existing data first, then add |
| No changelog | Document every modification |
create_datasetcreate_datapointssearch_orq_ai_documentationget_page_orq_ai_documentation