golden-dataset-management
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGolden Dataset Management
黄金数据集管理
Protect and maintain high-quality test datasets for AI/ML systems
为AI/ML系统保护和维护高质量测试数据集
Overview
概述
A golden dataset is a curated collection of high-quality examples used for:
- Regression testing: Ensure new code doesn't break existing functionality
- Retrieval evaluation: Measure search quality (precision, recall, MRR)
- Model benchmarking: Compare different models/approaches
- Reproducibility: Consistent results across environments
When to use this skill:
- Building test datasets for RAG systems
- Implementing backup/restore for critical data
- Validating data integrity (URL contracts, embeddings)
- Migrating data between environments
黄金数据集是经过精心整理的高质量示例集合,用于:
- 回归测试:确保新代码不会破坏现有功能
- 检索评估:衡量搜索质量(精确率、召回率、MRR)
- 模型基准测试:比较不同模型/方法
- 可复现性:在不同环境中获得一致结果
何时使用此技能:
- 为RAG系统构建测试数据集
- 为关键数据实现备份/恢复功能
- 验证数据完整性(URL协议、embeddings)
- 在不同环境之间迁移数据
OrchestKit's Golden Dataset
OrchestKit的黄金数据集
Stats (Production):
- 98 analyses (completed content analyses)
- 415 chunks (embedded text segments)
- 203 test queries (with expected results)
- 91.6% pass rate (retrieval quality metric)
Purpose:
- Test hybrid search (vector + BM25 + RRF)
- Validate metadata boosting strategies
- Detect regressions in retrieval quality
- Benchmark new embedding models
生产环境统计数据:
- 98项分析(已完成的内容分析)
- 415个数据块(嵌入的文本片段)
- 203条测试查询(含预期结果)
- 91.6%的通过率(检索质量指标)
用途:
- 测试混合搜索(vector + BM25 + RRF)
- 验证元数据加权策略
- 检测检索质量的回归问题
- 对新嵌入模型进行基准测试
Core Concepts
核心概念
Data Integrity Contracts
数据完整性协议
The URL Contract:
Golden dataset analyses MUST store real canonical URLs, not placeholders.
python
undefinedURL协议:
黄金数据集分析必须存储真实的规范URL,而非占位符。
python
undefinedWRONG - Placeholder URL (breaks restore)
错误示例 - 占位符URL(会破坏恢复功能)
analysis.url = "https://orchestkit.dev/placeholder/123"
analysis.url = "https://orchestkit.dev/placeholder/123"
CORRECT - Real canonical URL (enables re-fetch if needed)
正确示例 - 真实规范URL(需要时可重新获取内容)
analysis.url = "https://docs.python.org/3/library/asyncio.html"
**Why this matters:**
- Enables re-fetching content if embeddings need regeneration
- Allows validation that source content hasn't changed
- Provides audit trail for data provenance
---analysis.url = "https://docs.python.org/3/library/asyncio.html"
**重要性:**
- 当需要重新生成embeddings时,可重新获取内容
- 可验证源内容是否发生变更
- 提供数据来源的审计追踪
---Backup Strategy Comparison
备份策略对比
| Strategy | Version Control | Restore Speed | Portability | Inspection |
|---|---|---|---|---|
| JSON (recommended) | Yes | Slower (regen embeddings) | High | Easy |
| SQL Dump | No (binary) | Fast | DB-version dependent | Hard |
OrchestKit uses JSON backup for version control and portability.
| 策略 | 版本控制 | 恢复速度 | 可移植性 | 可检查性 |
|---|---|---|---|---|
| JSON(推荐) | 支持 | 较慢(需重新生成embeddings) | 高 | 简单 |
| SQL转储 | 不支持(二进制格式) | 快 | 依赖数据库版本 | 困难 |
OrchestKit采用JSON备份以支持版本控制和可移植性。
Quick Reference
快速参考
Backup Format
备份格式
json
{
"version": "1.0",
"created_at": "2025-12-19T10:30:00Z",
"metadata": {
"total_analyses": 98,
"total_chunks": 415,
"total_artifacts": 98
},
"analyses": [
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"url": "https://docs.python.org/3/library/asyncio.html",
"content_type": "documentation",
"status": "completed",
"created_at": "2025-11-15T08:20:00Z",
"chunks": [
{
"id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
"content": "asyncio is a library...",
"section_title": "Introduction to asyncio"
// embedding NOT included (regenerated on restore)
}
]
}
]
}Key Design Decisions:
- Embeddings excluded (regenerate on restore with current model)
- Nested structure (analyses -> chunks -> artifacts)
- Metadata for validation
- ISO timestamps for reproducibility
json
{
"version": "1.0",
"created_at": "2025-12-19T10:30:00Z",
"metadata": {
"total_analyses": 98,
"total_chunks": 415,
"total_artifacts": 98
},
"analyses": [
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"url": "https://docs.python.org/3/library/asyncio.html",
"content_type": "documentation",
"status": "completed",
"created_at": "2025-11-15T08:20:00Z",
"chunks": [
{
"id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
"content": "asyncio is a library...",
"section_title": "Introduction to asyncio"
// embedding 未包含(恢复时重新生成)
}
]
}
]
}核心设计决策:
- 不包含embeddings(恢复时使用当前模型重新生成)
- 嵌套结构(分析 -> 数据块 -> 工件)
- 用于验证的元数据
- 用于可复现性的ISO时间戳
CLI Commands
CLI命令
bash
cd backendbash
cd backendBackup golden dataset
备份黄金数据集
poetry run python scripts/backup_golden_dataset.py backup
poetry run python scripts/backup_golden_dataset.py backup
Verify backup integrity
验证备份完整性
poetry run python scripts/backup_golden_dataset.py verify
poetry run python scripts/backup_golden_dataset.py verify
Restore from backup (WARNING: Deletes existing data)
从备份恢复(警告:会删除现有数据)
poetry run python scripts/backup_golden_dataset.py restore --replace
poetry run python scripts/backup_golden_dataset.py restore --replace
Restore without deleting (adds to existing)
不删除现有数据进行恢复(添加到现有数据中)
poetry run python scripts/backup_golden_dataset.py restore
undefinedpoetry run python scripts/backup_golden_dataset.py restore
undefinedValidation Checks
验证检查项
| Check | Error/Warning | Description |
|---|---|---|
| Count mismatch | Error | Analysis/chunk count differs from metadata |
| Placeholder URLs | Error | URLs containing orchestkit.dev or placeholder |
| Missing embeddings | Error | Chunks without embeddings after restore |
| Orphaned chunks | Warning | Chunks with no parent analysis |
| 检查项 | 错误/警告 | 描述 |
|---|---|---|
| 数量不匹配 | 错误 | 分析/数据块数量与元数据不符 |
| 占位符URL | 错误 | URL包含orchestkit.dev或placeholder |
| 缺少embeddings | 错误 | 恢复后数据块没有embeddings |
| 孤立数据块 | 警告 | 没有父分析的数据块 |
Best Practices Summary
最佳实践总结
- Version control backups - Commit to git for history and diffs
- Validate before deployment - Run verify before production changes
- Test restore in staging - Never test restore in production first
- Document changes - Track additions/removals in metadata
- 对备份进行版本控制 - 提交到git以保留历史记录和差异
- 部署前验证 - 在生产变更前运行验证
- 在预发布环境测试恢复 - 切勿先在生产环境测试恢复
- 记录变更 - 在元数据中跟踪添加/删除的内容
Disaster Recovery Quick Guide
灾难恢复快速指南
| Scenario | Steps |
|---|---|
| Accidental deletion | |
| Migration failure | |
| New environment | Clone repo -> setup DB -> |
| 场景 | 步骤 |
|---|---|
| 意外删除 | |
| 迁移失败 | |
| 新环境部署 | 克隆仓库 -> 设置数据库 -> |
References
参考资料
For detailed implementation patterns, see:
- - Backup strategies, JSON format, backup script implementation, CI/CD automation
references/storage-patterns.md - - Restore implementation, embedding regeneration, validation checklist, disaster recovery scenarios
references/versioning.md
如需详细的实现模式,请参阅:
- - 备份策略、JSON格式、备份脚本实现、CI/CD自动化
references/storage-patterns.md - - 恢复实现、embeddings重新生成、验证检查清单、灾难恢复场景
references/versioning.md
Related Skills
相关技能
- - Schema and integrity validation
golden-dataset-validation - - Quality criteria and curation workflows
golden-dataset-curation - - Retrieval evaluation using golden dataset
pgvector-search - - Embedding generation for restore
ai-native-development
Version: 1.0.0 (December 2025)
Status: Production-ready patterns from OrchestKit's 98-analysis golden dataset
- - 模式和完整性验证
golden-dataset-validation - - 质量标准和整理流程
golden-dataset-curation - - 使用黄金数据集进行检索评估
pgvector-search - - 为恢复生成embeddings
ai-native-development
版本: 1.0.0(2025年12月)
状态: 来自OrchestKit的98项分析黄金数据集的生产就绪模式
Capability Details
能力详情
backup
backup
Keywords: golden dataset, backup, export, json backup, version control data
Solves:
- How do I backup the golden dataset?
- Export analyses to JSON for version control
- Protect critical test datasets
- Create portable database snapshots
关键词: golden dataset, backup, export, json backup, version control data
解决问题:
- 如何备份黄金数据集?
- 将分析结果导出为JSON以进行版本控制
- 保护关键测试数据集
- 创建可移植的数据库快照
restore
restore
Keywords: restore dataset, import analyses, regenerate embeddings, disaster recovery, new environment
Solves:
- How do I restore from backup?
- Import golden dataset to new environment
- Regenerate embeddings after restore
- Disaster recovery procedures
关键词: restore dataset, import analyses, regenerate embeddings, disaster recovery, new environment
解决问题:
- 如何从备份恢复?
- 将黄金数据集导入新环境
- 恢复后重新生成embeddings
- 灾难恢复流程
validation
validation
Keywords: verify dataset, url contract, data integrity, validate backup, placeholder urls
Solves:
- How do I validate dataset integrity?
- Check URL contracts (no placeholders)
- Verify embeddings exist
- Detect orphaned chunks
关键词: verify dataset, url contract, data integrity, validate backup, placeholder urls
解决问题:
- 如何验证数据集完整性?
- 检查URL协议(无占位符)
- 验证embeddings是否存在
- 检测孤立数据块
ci-cd-automation
ci-cd-automation
Keywords: automated backup, github actions, ci cd backup, scheduled backup
Solves:
- How do I automate dataset backups?
- Set up GitHub Actions for weekly backups
- Commit backups to git automatically
- CI/CD integration patterns
关键词: automated backup, github actions, ci cd backup, scheduled backup
解决问题:
- 如何自动化数据集备份?
- 设置GitHub Actions进行每周备份
- 自动将备份提交到git
- CI/CD集成模式
disaster-recovery
disaster-recovery
Keywords: disaster recovery, accidental deletion, migration failure, rollback
Solves:
- What if I accidentally delete the dataset?
- Database migration gone wrong
- Restore after data corruption
- Rollback procedures
关键词: disaster recovery, accidental deletion, migration failure, rollback
解决问题:
- 意外删除数据集怎么办?
- 数据库迁移失败
- 数据损坏后恢复
- 回滚流程
orchestkit-golden-dataset
orchestkit-golden-dataset
Keywords: orchestkit, 98 analyses, 415 chunks, retrieval evaluation, real world
Solves:
- What is OrchestKit's golden dataset?
- How does OrchestKit protect test data?
- Real-world backup/restore examples
- Production golden dataset stats
关键词: orchestkit, 98 analyses, 415 chunks, retrieval evaluation, real world
解决问题:
- OrchestKit的黄金数据集是什么?
- OrchestKit如何保护测试数据?
- 真实世界的备份/恢复示例
- 生产环境黄金数据集统计数据