training-data-curation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Training Data Curation Guidelines

训练数据整理指南

Best practices for gathering and preparing training data for LLM fine-tuning.
LLM微调训练数据收集与准备的最佳实践。

Data Quality Principles

数据质量原则

Quality over quantity. Llama 2 used only 27,540 high-quality SFT examples and outperformed models trained on larger noisy datasets [1]. Focus on clean, diverse, well-formatted data.
Garbage in, garbage out. The model will learn patterns from your data—including errors, biases, and formatting issues. Inspect samples manually before training.
Match the target distribution. Training data should reflect the tasks and style you want the model to perform. If you want formal responses, don't train on casual chat data.
质量优先于数量。Llama 2仅使用了27,540个高质量SFT样本,就超越了使用更大规模含噪数据集训练的模型[1]。重点关注干净、多样、格式规范的数据。
垃圾进,垃圾出。模型会从数据中学习模式——包括错误、偏见和格式问题。训练前需手动检查样本。
匹配目标分布。训练数据应反映你期望模型执行的任务和风格。如果需要正式的回复,就不要使用随意的聊天数据进行训练。

Format Requirements

格式要求

Supervised Fine-Tuning (SFT)

监督微调(SFT)

Use the messages format (OpenAI/Anthropic/Tinker standard) [5]:
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
  • Each sample is a complete conversation
  • Multi-turn: alternate user/assistant messages
  • System prompts optional:
    {"role": "system", "content": "..."}
  • JSONL format, one sample per line
使用消息格式(OpenAI/Anthropic/Tinker标准)[5]
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
  • 每个样本是一段完整的对话
  • 多轮对话:交替的用户/助手消息
  • 系统提示为可选:
    {"role": "system", "content": "..."}
  • JSONL格式,每行一个样本

Preference Learning (DPO/ORPO/KTO)

偏好学习(DPO/ORPO/KTO)

Requires paired comparisons [2]:
{"prompt": "...", "chosen": "...", "rejected": "..."}
  • chosen
    and
    rejected
    must respond to the same prompt
  • Quality difference should be clear and consistent
  • Annotator agreement >70% indicates usable samples [1]
For KTO, pairs aren't required—just binary labels on completions [7]:
{"prompt": "...", "completion": "...", "label": true/false}
需要配对比较数据[2]
{"prompt": "...", "chosen": "...", "rejected": "..."}
  • chosen
    rejected
    必须针对同一个prompt作出回复
  • 质量差异应清晰且一致
  • 标注者一致性>70%表示样本可用[1]
对于KTO,不需要配对数据——只需对补全内容添加二元标签即可[7]
{"prompt": "...", "completion": "...", "label": true/false}

Reward Modeling (RLHF)

奖励建模(RLHF)

Needs ranked responses [1]:
{"prompt": "...", "responses": ["best", "second", "worst"]}
需要排序后的回复数据[1]
{"prompt": "...", "responses": ["best", "second", "worst"]}

Quality Checklist

质量检查清单

Before training, verify:
  • No duplicates — exact and near-duplicate removal [3]
  • No empty fields — all required fields populated
  • Consistent format — schema matches throughout
  • Appropriate length — not too short (noise) or too long (truncation)
  • Clean text — proper encoding, no HTML/boilerplate artifacts [8]
  • Manual inspection — reviewed random sample of 50-100 examples
  • No PII/sensitive data — unless intentionally included
  • License verified — legal to use for training
训练前,请验证:
  • 无重复数据 —— 移除完全重复和近似重复的数据[3]
  • 无空字段 —— 所有必填字段均已填充
  • 格式一致 —— 全程遵循统一的schema
  • 长度合适 —— 不要过短(含噪)或过长(会被截断)
  • 文本干净 —— 编码规范,无HTML/模板冗余内容[8]
  • 人工检查 —— 随机抽查50-100个样本
  • 无PII/敏感数据 —— 除非是有意包含
  • 许可证已验证 —— 可合法用于训练

Common Quality Issues

常见质量问题

IssueDetectionFixSource
DuplicatesHash-based dedupRemove exact matches, MinHash for near-dupes[3]
BoilerplateKeyword filterRemove "subscribe", "cookie policy", etc.[8]
Repetitive textN-gram analysisFlag if <30% unique trigrams[4]
Low-quality textAlpha ratioRemove if <50% alphabetic characters[8]
Wrong languageLanguage detectionfastText classifier, filter to target[3]
Too shortLength checkMinimum 3-5 sentences, 100+ words for documents[8]
问题检测方法修复方案来源
重复数据基于哈希的去重移除完全匹配项,使用MinHash处理近似重复项[3]
模板冗余内容关键词过滤移除“subscribe”、“cookie policy”等内容[8]
文本重复N-gram分析若唯一三元组占比<30%则标记[4]
低质量文本字母占比若字母字符占比<50%则移除[8]
错误语言语言检测使用fastText分类器,过滤出目标语言[3]
文本过短长度检查文档至少3-5个句子,100+单词[8]

Data Sources

数据来源

High quality:
  • Curated human annotations [1]
  • Expert-written examples
  • Filtered high-quality web data [3]
Medium quality:
  • Synthetic data from stronger models (distillation)
  • Community Q&A with voting signals
  • Filtered user-generated content
Use with caution:
  • Raw web scrapes
  • Unfiltered synthetic data
  • Data without clear provenance [6]
高质量:
  • 人工标注的精选数据[1]
  • 专家撰写的样本
  • 经过过滤的高质量网页数据[3]
中等质量:
  • 由更强模型生成的合成数据(蒸馏)
  • 带投票信号的社区问答
  • 经过过滤的用户生成内容
需谨慎使用:
  • 原始网页爬取数据
  • 未过滤的合成数据
  • 来源不明确的数据[6]

Sizing Guidelines

规模指南

Dataset SizeUse CaseSource
100-1KQuick experiments, specific behaviors
1K-10KProduction SFT, domain adaptation
10K-100KComprehensive instruction tuning[1]
1M+ preference pairsLarge-scale RLHF[1]
Llama 2 used ~27K SFT examples and 1M+ preference comparisons [1].
数据集规模使用场景来源
100-1K快速实验、特定行为调优
1K-10K生产级SFT、领域适配
10K-100K全面指令调优[1]
1M+偏好配对大规模RLHF[1]
Llama 2使用了约27K个SFT样本和1M+偏好对比数据[1]

File Format

文件格式

  • JSONL — one JSON object per line, human-readable
  • Parquet — efficient for large datasets, built-in compression [3]
  • Sharding — split files >500MB into chunks
  • JSONL —— 每行一个JSON对象,人类可读
  • Parquet —— 适用于大规模数据集,内置压缩[3]
  • 分片 —— 将大于500MB的文件拆分为多个小块

References

参考文献

  1. Llama 2 Paper — Touvron et al. (2023). SFT/RLHF data quality practices, 27K SFT examples, >70% annotator agreement threshold
  2. TRL Library — HuggingFace trainer implementations for SFT, DPO, KTO, ORPO
  3. FineWeb Paper — Penedo et al. (2024). Large-scale filtering: MinHash dedup, language detection, quality classifiers
  4. Data-Juicer — Alibaba's quality filtering toolkit with repetition filters, n-gram analysis
  5. Tinker API — Training API using messages format for SFT, DPO/RLHF support
  6. Data Provenance Initiative — Longpre et al. (2023). Dataset licensing and attribution audit
  7. KTO Paper — Ethayarajh et al. (2024). Binary preference learning without pairs
  8. C4/T5 Paper — Raffel et al. (2020). Foundational filtering: terminal punctuation, min sentences, alpha ratio, boilerplate removal
  1. Llama 2 Paper —— Touvron等人(2023)。SFT/RLHF数据质量实践、27K SFT样本、>70%标注者一致性阈值
  2. TRL Library —— HuggingFace的SFT、DPO、KTO、ORPO训练器实现
  3. FineWeb Paper —— Penedo等人(2024)。大规模过滤:MinHash去重、语言检测、质量分类器
  4. Data-Juicer —— 阿里巴巴的质量过滤工具包,包含重复过滤、N-gram分析
  5. Tinker API —— 采用消息格式进行SFT的训练API,支持DPO/RLHF
  6. Data Provenance Initiative —— Longpre等人(2023)。数据集许可证与归因审计
  7. KTO Paper —— Ethayarajh等人(2024)。无需配对的二元偏好学习
  8. C4/T5 Paper —— Raffel等人(2020)。基础过滤:终端标点、最小句子数、字母占比、模板冗余内容移除