Loading...
Loading...
Compare original and translation side by side
| Need | Resource |
|---|---|
| Planning my dataset - requirements, strategy, quality checklist | |
| How to create diverse examples - variation techniques, multi-turn patterns, format-specific guidance | |
| ChatML format details - structure, specification, common issues, framework compatibility | |
| Example datasets - inspiration across domains, multi-turn samples, edge cases | |
| Validating quality - validation workflow, analyzing datasets, troubleshooting | |
| Training & deployment - framework setup, hyperparameters, optimization, deployment | |
| 需求 | 资源 |
|---|---|
| 规划数据集 - 需求、策略、质量检查清单 | |
| 如何创建多样示例 - 多样化技巧、多轮对话模式、格式特定指南 | |
| ChatML格式细节 - 结构、规范、常见问题、框架兼容性 | |
| 示例数据集 - 跨领域参考、多轮对话样本、边缘案例 | |
| 质量验证 - 验证流程、数据集分析、故障排除 | |
| 训练与部署 - 框架设置、超参数、优化、部署 | |
resources/dataset-strategy.mdresources/dataset-strategy.mdresources/generation-techniques.mdresources/generation-techniques.mdundefinedundefined
**Quality Checklist:**
- [ ] JSON validation passed (no errors)
- [ ] Analysis shows good diversity metrics
- [ ] Manual sample review passed
- [ ] No duplicate or near-duplicate examples
- [ ] All required fields present
- [ ] Realistic user queries
- [ ] Accurate, helpful responses
- [ ] Balanced category distribution
- [ ] Dataset metadata documented
See [`resources/quality-validation.md`](resources/quality-validation.md) for validation details, troubleshooting, and documentation templates.
**质量检查清单:**
- [ ] JSON验证通过(无错误)
- [ ] 分析显示良好的多样性指标
- [ ] 手动样本审核通过
- [ ] 无重复或近似重复示例
- [ ] 所有必填字段齐全
- [ ] 真实的用户查询
- [ ] 准确、有用的响应
- [ ] 分类分布均衡
- [ ] 数据集元数据已记录
详见[`resources/quality-validation.md`](resources/quality-validation.md)中的验证细节、故障排除和文档模板。training_data.jsonlvalidation_data.jsonldataset_info.txttype: chat_templatechat_template: chatmlapply_chat_template()resources/framework-integration.mdtraining_data.jsonlvalidation_data.jsonldataset_info.txttype: chat_templatechat_template: chatmlapply_chat_template()resources/framework-integration.mdmessages{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "How do I reverse a string in Python?"}, {"role": "assistant", "content": "Use slicing: `text[::-1]`"}]}systemuserassistantresources/chatml-format.mdmessages{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "How do I reverse a string in Python?"}, {"role": "assistant", "content": "Use slicing: `text[::-1]`"}]}systemuserassistantresources/chatml-format.mdscripts/scripts/python scripts/validate_chatml.py training_data.jsonl
python scripts/validate_chatml.py training_data.jsonl --verbosepython scripts/validate_chatml.py training_data.jsonl
python scripts/validate_chatml.py training_data.jsonl --verbosepython scripts/analyze_dataset.py training_data.jsonl
python scripts/analyze_dataset.py training_data.jsonl --export stats.jsonpython scripts/analyze_dataset.py training_data.jsonl
python scripts/analyze_dataset.py training_data.jsonl --export stats.json| Task Complexity | Recommended Size | Notes |
|---|---|---|
| Simple tasks | 100-500 | Well-defined, limited variation |
| Medium tasks | 500-2,000 | Multiple scenarios, moderate complexity |
| Complex tasks | 2,000-10,000+ | Many edge cases, high variability |
| Domain adaptation | 1,000-5,000 | Specialized knowledge required |
| 任务复杂度 | 推荐规模 | 说明 |
|---|---|---|
| 简单任务 | 100-500 | 定义明确,变化有限 |
| 中等任务 | 500-2000 | 多场景,中等复杂度 |
| 复杂任务 | 2000-10000+ | 大量边缘案例,高可变性 |
| 领域适配 | 1000-5000 | 需要专业知识 |
resources/dataset-strategy.mdresources/generation-techniques.mdresources/chatml-format.mdresources/examples.mdresources/quality-validation.mdresources/framework-integration.mdresources/dataset-strategy.mdresources/generation-techniques.mdresources/chatml-format.mdresources/examples.mdresources/quality-validation.mdresources/framework-integration.md