training-data-curation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTraining Data Curation Guidelines
训练数据整理指南
Best practices for gathering and preparing training data for LLM fine-tuning.
LLM微调训练数据收集与准备的最佳实践。
Data Quality Principles
数据质量原则
Quality over quantity. Llama 2 used only 27,540 high-quality SFT examples and outperformed models trained on larger noisy datasets [1]. Focus on clean, diverse, well-formatted data.
Garbage in, garbage out. The model will learn patterns from your data—including errors, biases, and formatting issues. Inspect samples manually before training.
Match the target distribution. Training data should reflect the tasks and style you want the model to perform. If you want formal responses, don't train on casual chat data.
质量优先于数量。Llama 2仅使用了27,540个高质量SFT样本,就超越了使用更大规模含噪数据集训练的模型[1]。重点关注干净、多样、格式规范的数据。
垃圾进,垃圾出。模型会从数据中学习模式——包括错误、偏见和格式问题。训练前需手动检查样本。
匹配目标分布。训练数据应反映你期望模型执行的任务和风格。如果需要正式的回复,就不要使用随意的聊天数据进行训练。
Format Requirements
格式要求
Supervised Fine-Tuning (SFT)
监督微调(SFT)
Use the messages format (OpenAI/Anthropic/Tinker standard) [5]:
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}- Each sample is a complete conversation
- Multi-turn: alternate user/assistant messages
- System prompts optional:
{"role": "system", "content": "..."} - JSONL format, one sample per line
使用消息格式(OpenAI/Anthropic/Tinker标准)[5]:
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}- 每个样本是一段完整的对话
- 多轮对话:交替的用户/助手消息
- 系统提示为可选:
{"role": "system", "content": "..."} - JSONL格式,每行一个样本
Preference Learning (DPO/ORPO/KTO)
偏好学习(DPO/ORPO/KTO)
Requires paired comparisons [2]:
{"prompt": "...", "chosen": "...", "rejected": "..."}- and
chosenmust respond to the same promptrejected - Quality difference should be clear and consistent
- Annotator agreement >70% indicates usable samples [1]
For KTO, pairs aren't required—just binary labels on completions [7]:
{"prompt": "...", "completion": "...", "label": true/false}Reward Modeling (RLHF)
奖励建模(RLHF)
Needs ranked responses [1]:
{"prompt": "...", "responses": ["best", "second", "worst"]}需要排序后的回复数据[1]:
{"prompt": "...", "responses": ["best", "second", "worst"]}Quality Checklist
质量检查清单
Before training, verify:
- No duplicates — exact and near-duplicate removal [3]
- No empty fields — all required fields populated
- Consistent format — schema matches throughout
- Appropriate length — not too short (noise) or too long (truncation)
- Clean text — proper encoding, no HTML/boilerplate artifacts [8]
- Manual inspection — reviewed random sample of 50-100 examples
- No PII/sensitive data — unless intentionally included
- License verified — legal to use for training
Common Quality Issues
常见质量问题
| Issue | Detection | Fix | Source |
|---|---|---|---|
| Duplicates | Hash-based dedup | Remove exact matches, MinHash for near-dupes | [3] |
| Boilerplate | Keyword filter | Remove "subscribe", "cookie policy", etc. | [8] |
| Repetitive text | N-gram analysis | Flag if <30% unique trigrams | [4] |
| Low-quality text | Alpha ratio | Remove if <50% alphabetic characters | [8] |
| Wrong language | Language detection | fastText classifier, filter to target | [3] |
| Too short | Length check | Minimum 3-5 sentences, 100+ words for documents | [8] |
Data Sources
数据来源
High quality:
Medium quality:
- Synthetic data from stronger models (distillation)
- Community Q&A with voting signals
- Filtered user-generated content
Use with caution:
- Raw web scrapes
- Unfiltered synthetic data
- Data without clear provenance [6]
Sizing Guidelines
规模指南
File Format
文件格式
- JSONL — one JSON object per line, human-readable
- Parquet — efficient for large datasets, built-in compression [3]
- Sharding — split files >500MB into chunks
- JSONL —— 每行一个JSON对象,人类可读
- Parquet —— 适用于大规模数据集,内置压缩[3]
- 分片 —— 将大于500MB的文件拆分为多个小块
References
参考文献
- Llama 2 Paper — Touvron et al. (2023). SFT/RLHF data quality practices, 27K SFT examples, >70% annotator agreement threshold
- TRL Library — HuggingFace trainer implementations for SFT, DPO, KTO, ORPO
- FineWeb Paper — Penedo et al. (2024). Large-scale filtering: MinHash dedup, language detection, quality classifiers
- Data-Juicer — Alibaba's quality filtering toolkit with repetition filters, n-gram analysis
- Tinker API — Training API using messages format for SFT, DPO/RLHF support
- Data Provenance Initiative — Longpre et al. (2023). Dataset licensing and attribution audit
- KTO Paper — Ethayarajh et al. (2024). Binary preference learning without pairs
- C4/T5 Paper — Raffel et al. (2020). Foundational filtering: terminal punctuation, min sentences, alpha ratio, boilerplate removal
- Llama 2 Paper —— Touvron等人(2023)。SFT/RLHF数据质量实践、27K SFT样本、>70%标注者一致性阈值
- TRL Library —— HuggingFace的SFT、DPO、KTO、ORPO训练器实现
- FineWeb Paper —— Penedo等人(2024)。大规模过滤:MinHash去重、语言检测、质量分类器
- Data-Juicer —— 阿里巴巴的质量过滤工具包,包含重复过滤、N-gram分析
- Tinker API —— 采用消息格式进行SFT的训练API,支持DPO/RLHF
- Data Provenance Initiative —— Longpre等人(2023)。数据集许可证与归因审计
- KTO Paper —— Ethayarajh等人(2024)。无需配对的二元偏好学习
- C4/T5 Paper —— Raffel等人(2020)。基础过滤:终端标点、最小句子数、字母占比、模板冗余内容移除