Loading...
Loading...
Guidelines for creating high-quality datasets for LLM post-training (SFT/DPO/RLHF). Use when preparing data for fine-tuning, evaluating data quality, or designing data collection strategies.
npx skill4agent add sundial-org/skills training-data-curation{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}{"role": "system", "content": "..."}{"prompt": "...", "chosen": "...", "rejected": "..."}chosenrejected{"prompt": "...", "completion": "...", "label": true/false}{"prompt": "...", "responses": ["best", "second", "worst"]}| Issue | Detection | Fix | Source |
|---|---|---|---|
| Duplicates | Hash-based dedup | Remove exact matches, MinHash for near-dupes | [3] |
| Boilerplate | Keyword filter | Remove "subscribe", "cookie policy", etc. | [8] |
| Repetitive text | N-gram analysis | Flag if <30% unique trigrams | [4] |
| Low-quality text | Alpha ratio | Remove if <50% alphabetic characters | [8] |
| Wrong language | Language detection | fastText classifier, filter to target | [3] |
| Too short | Length check | Minimum 3-5 sentences, 100+ words for documents | [8] |
| Dataset Size | Use Case | Source |
|---|---|---|
| 100-1K | Quick experiments, specific behaviors | — |
| 1K-10K | Production SFT, domain adaptation | — |
| 10K-100K | Comprehensive instruction tuning | [1] |
| 1M+ preference pairs | Large-scale RLHF | [1] |