Training Data Curation Guidelines

训练数据整理指南

Best practices for gathering and preparing training data for LLM fine-tuning.

LLM微调训练数据收集与准备的最佳实践。

Data Quality Principles

数据质量原则

Quality over quantity. Llama 2 used only 27,540 high-quality SFT examples and outperformed models trained on larger noisy datasets [1]. Focus on clean, diverse, well-formatted data.

Garbage in, garbage out. The model will learn patterns from your data—including errors, biases, and formatting issues. Inspect samples manually before training.

Match the target distribution. Training data should reflect the tasks and style you want the model to perform. If you want formal responses, don't train on casual chat data.

质量优先于数量。Llama 2仅使用了27,540个高质量SFT样本，就超越了使用更大规模含噪数据集训练的模型[1]。重点关注干净、多样、格式规范的数据。

垃圾进，垃圾出。模型会从数据中学习模式——包括错误、偏见和格式问题。训练前需手动检查样本。

匹配目标分布。训练数据应反映你期望模型执行的任务和风格。如果需要正式的回复，就不要使用随意的聊天数据进行训练。

Format Requirements

格式要求

Supervised Fine-Tuning (SFT)

监督微调（SFT）

Use the messages format (OpenAI/Anthropic/Tinker standard) [5]:

{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Each sample is a complete conversation
Multi-turn: alternate user/assistant messages
System prompts optional:
```
{"role": "system", "content": "..."}
```
JSONL format, one sample per line

使用消息格式（OpenAI/Anthropic/Tinker标准）[5]：

{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

每个样本是一段完整的对话
多轮对话：交替的用户/助手消息
系统提示为可选：
```
{"role": "system", "content": "..."}
```
JSONL格式，每行一个样本

Preference Learning (DPO/ORPO/KTO)

偏好学习（DPO/ORPO/KTO）

Requires paired comparisons [2]:

{"prompt": "...", "chosen": "...", "rejected": "..."}

```
chosen
```
and
```
rejected
```
must respond to the same prompt
Quality difference should be clear and consistent
Annotator agreement >70% indicates usable samples [1]

For KTO, pairs aren't required—just binary labels on completions [7]:

{"prompt": "...", "completion": "...", "label": true/false}

需要配对比较数据[2]：

{"prompt": "...", "chosen": "...", "rejected": "..."}

```
chosen
```
和
```
rejected
```
必须针对同一个prompt作出回复
质量差异应清晰且一致
标注者一致性>70%表示样本可用[1]

对于KTO，不需要配对数据——只需对补全内容添加二元标签即可[7]：

{"prompt": "...", "completion": "...", "label": true/false}

Reward Modeling (RLHF)

奖励建模（RLHF）

Needs ranked responses [1]:

{"prompt": "...", "responses": ["best", "second", "worst"]}

需要排序后的回复数据[1]：

{"prompt": "...", "responses": ["best", "second", "worst"]}

Quality Checklist

质量检查清单

Before training, verify:

No duplicates — exact and near-duplicate removal [3]
No empty fields — all required fields populated
Consistent format — schema matches throughout
Appropriate length — not too short (noise) or too long (truncation)
Clean text — proper encoding, no HTML/boilerplate artifacts [8]
Manual inspection — reviewed random sample of 50-100 examples
No PII/sensitive data — unless intentionally included
License verified — legal to use for training

训练前，请验证：

无重复数据 —— 移除完全重复和近似重复的数据[3]
无空字段 —— 所有必填字段均已填充
格式一致 —— 全程遵循统一的schema
长度合适 —— 不要过短（含噪）或过长（会被截断）
文本干净 —— 编码规范，无HTML/模板冗余内容[8]
人工检查 —— 随机抽查50-100个样本
无PII/敏感数据 —— 除非是有意包含
许可证已验证 —— 可合法用于训练

Common Quality Issues

常见质量问题

Issue	Detection	Fix	Source
Duplicates	Hash-based dedup	Remove exact matches, MinHash for near-dupes	[3]
Boilerplate	Keyword filter	Remove "subscribe", "cookie policy", etc.	[8]
Repetitive text	N-gram analysis	Flag if <30% unique trigrams	[4]
Low-quality text	Alpha ratio	Remove if <50% alphabetic characters	[8]
Wrong language	Language detection	fastText classifier, filter to target	[3]
Too short	Length check	Minimum 3-5 sentences, 100+ words for documents	[8]

问题	检测方法	修复方案	来源
重复数据	基于哈希的去重	移除完全匹配项，使用MinHash处理近似重复项	[3]
模板冗余内容	关键词过滤	移除“subscribe”、“cookie policy”等内容	[8]
文本重复	N-gram分析	若唯一三元组占比<30%则标记	[4]
低质量文本	字母占比	若字母字符占比<50%则移除	[8]
错误语言	语言检测	使用fastText分类器，过滤出目标语言	[3]
文本过短	长度检查	文档至少3-5个句子，100+单词	[8]

Data Sources

数据来源

High quality:

Curated human annotations [1]
Expert-written examples
Filtered high-quality web data [3]

Medium quality:

Synthetic data from stronger models (distillation)
Community Q&A with voting signals
Filtered user-generated content

Use with caution:

Raw web scrapes
Unfiltered synthetic data
Data without clear provenance [6]

高质量：

人工标注的精选数据[1]
专家撰写的样本
经过过滤的高质量网页数据[3]

中等质量：

由更强模型生成的合成数据（蒸馏）
带投票信号的社区问答
经过过滤的用户生成内容

需谨慎使用：

原始网页爬取数据
未过滤的合成数据
来源不明确的数据[6]

Sizing Guidelines

规模指南

Dataset Size	Use Case	Source
100-1K	Quick experiments, specific behaviors	—
1K-10K	Production SFT, domain adaptation	—
10K-100K	Comprehensive instruction tuning	[1]
1M+ preference pairs	Large-scale RLHF	[1]

Llama 2 used ~27K SFT examples and 1M+ preference comparisons [1].

数据集规模	使用场景	来源
100-1K	快速实验、特定行为调优	—
1K-10K	生产级SFT、领域适配	—
10K-100K	全面指令调优	[1]
1M+偏好配对	大规模RLHF	[1]

Llama 2使用了约27K个SFT样本和1M+偏好对比数据[1]。

File Format

文件格式

JSONL — one JSON object per line, human-readable
Parquet — efficient for large datasets, built-in compression [3]
Sharding — split files >500MB into chunks

JSONL —— 每行一个JSON对象，人类可读
Parquet —— 适用于大规模数据集，内置压缩[3]
分片 —— 将大于500MB的文件拆分为多个小块

References

参考文献

Llama 2 Paper — Touvron et al. (2023). SFT/RLHF data quality practices, 27K SFT examples, >70% annotator agreement threshold
TRL Library — HuggingFace trainer implementations for SFT, DPO, KTO, ORPO
FineWeb Paper — Penedo et al. (2024). Large-scale filtering: MinHash dedup, language detection, quality classifiers
Data-Juicer — Alibaba's quality filtering toolkit with repetition filters, n-gram analysis
Tinker API — Training API using messages format for SFT, DPO/RLHF support
Data Provenance Initiative — Longpre et al. (2023). Dataset licensing and attribution audit
KTO Paper — Ethayarajh et al. (2024). Binary preference learning without pairs
C4/T5 Paper — Raffel et al. (2020). Foundational filtering: terminal punctuation, min sentences, alpha ratio, boilerplate removal

Llama 2 Paper —— Touvron等人（2023）。SFT/RLHF数据质量实践、27K SFT样本、>70%标注者一致性阈值
TRL Library —— HuggingFace的SFT、DPO、KTO、ORPO训练器实现
FineWeb Paper —— Penedo等人（2024）。大规模过滤：MinHash去重、语言检测、质量分类器
Data-Juicer —— 阿里巴巴的质量过滤工具包，包含重复过滤、N-gram分析
Tinker API —— 采用消息格式进行SFT的训练API，支持DPO/RLHF
Data Provenance Initiative —— Longpre等人（2023）。数据集许可证与归因审计
KTO Paper —— Ethayarajh等人（2024）。无需配对的二元偏好学习
C4/T5 Paper —— Raffel等人（2020）。基础过滤：终端标点、最小句子数、字母占比、模板冗余内容移除

training-data-curation

Original

Translation