golden-dataset-management

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Golden Dataset Management

黄金数据集管理

Protect and maintain high-quality test datasets for AI/ML systems
为AI/ML系统保护和维护高质量测试数据集

Overview

概述

A golden dataset is a curated collection of high-quality examples used for:
  • Regression testing: Ensure new code doesn't break existing functionality
  • Retrieval evaluation: Measure search quality (precision, recall, MRR)
  • Model benchmarking: Compare different models/approaches
  • Reproducibility: Consistent results across environments
When to use this skill:
  • Building test datasets for RAG systems
  • Implementing backup/restore for critical data
  • Validating data integrity (URL contracts, embeddings)
  • Migrating data between environments

黄金数据集是经过精心整理的高质量示例集合,用于:
  • 回归测试:确保新代码不会破坏现有功能
  • 检索评估:衡量搜索质量(精确率、召回率、MRR)
  • 模型基准测试:比较不同模型/方法
  • 可复现性:在不同环境中获得一致结果
何时使用此技能:
  • 为RAG系统构建测试数据集
  • 为关键数据实现备份/恢复功能
  • 验证数据完整性(URL协议、embeddings)
  • 在不同环境之间迁移数据

OrchestKit's Golden Dataset

OrchestKit的黄金数据集

Stats (Production):
  • 98 analyses (completed content analyses)
  • 415 chunks (embedded text segments)
  • 203 test queries (with expected results)
  • 91.6% pass rate (retrieval quality metric)
Purpose:
  • Test hybrid search (vector + BM25 + RRF)
  • Validate metadata boosting strategies
  • Detect regressions in retrieval quality
  • Benchmark new embedding models

生产环境统计数据:
  • 98项分析(已完成的内容分析)
  • 415个数据块(嵌入的文本片段)
  • 203条测试查询(含预期结果)
  • 91.6%的通过率(检索质量指标)
用途:
  • 测试混合搜索(vector + BM25 + RRF)
  • 验证元数据加权策略
  • 检测检索质量的回归问题
  • 对新嵌入模型进行基准测试

Core Concepts

核心概念

Data Integrity Contracts

数据完整性协议

The URL Contract: Golden dataset analyses MUST store real canonical URLs, not placeholders.
python
undefined
URL协议: 黄金数据集分析必须存储真实的规范URL,而非占位符。
python
undefined

WRONG - Placeholder URL (breaks restore)

错误示例 - 占位符URL(会破坏恢复功能)

CORRECT - Real canonical URL (enables re-fetch if needed)

正确示例 - 真实规范URL(需要时可重新获取内容)


**Why this matters:**
- Enables re-fetching content if embeddings need regeneration
- Allows validation that source content hasn't changed
- Provides audit trail for data provenance

---

**重要性:**
- 当需要重新生成embeddings时,可重新获取内容
- 可验证源内容是否发生变更
- 提供数据来源的审计追踪

---

Backup Strategy Comparison

备份策略对比

StrategyVersion ControlRestore SpeedPortabilityInspection
JSON (recommended)YesSlower (regen embeddings)HighEasy
SQL DumpNo (binary)FastDB-version dependentHard
OrchestKit uses JSON backup for version control and portability.

策略版本控制恢复速度可移植性可检查性
JSON(推荐)支持较慢(需重新生成embeddings)简单
SQL转储不支持(二进制格式)依赖数据库版本困难
OrchestKit采用JSON备份以支持版本控制和可移植性。

Quick Reference

快速参考

Backup Format

备份格式

json
{
  "version": "1.0",
  "created_at": "2025-12-19T10:30:00Z",
  "metadata": {
    "total_analyses": 98,
    "total_chunks": 415,
    "total_artifacts": 98
  },
  "analyses": [
    {
      "id": "550e8400-e29b-41d4-a716-446655440000",
      "url": "https://docs.python.org/3/library/asyncio.html",
      "content_type": "documentation",
      "status": "completed",
      "created_at": "2025-11-15T08:20:00Z",
      "chunks": [
        {
          "id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
          "content": "asyncio is a library...",
          "section_title": "Introduction to asyncio"
          // embedding NOT included (regenerated on restore)
        }
      ]
    }
  ]
}
Key Design Decisions:
  • Embeddings excluded (regenerate on restore with current model)
  • Nested structure (analyses -> chunks -> artifacts)
  • Metadata for validation
  • ISO timestamps for reproducibility
json
{
  "version": "1.0",
  "created_at": "2025-12-19T10:30:00Z",
  "metadata": {
    "total_analyses": 98,
    "total_chunks": 415,
    "total_artifacts": 98
  },
  "analyses": [
    {
      "id": "550e8400-e29b-41d4-a716-446655440000",
      "url": "https://docs.python.org/3/library/asyncio.html",
      "content_type": "documentation",
      "status": "completed",
      "created_at": "2025-11-15T08:20:00Z",
      "chunks": [
        {
          "id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
          "content": "asyncio is a library...",
          "section_title": "Introduction to asyncio"
          // embedding 未包含(恢复时重新生成)
        }
      ]
    }
  ]
}
核心设计决策:
  • 不包含embeddings(恢复时使用当前模型重新生成)
  • 嵌套结构(分析 -> 数据块 -> 工件)
  • 用于验证的元数据
  • 用于可复现性的ISO时间戳

CLI Commands

CLI命令

bash
cd backend
bash
cd backend

Backup golden dataset

备份黄金数据集

poetry run python scripts/backup_golden_dataset.py backup
poetry run python scripts/backup_golden_dataset.py backup

Verify backup integrity

验证备份完整性

poetry run python scripts/backup_golden_dataset.py verify
poetry run python scripts/backup_golden_dataset.py verify

Restore from backup (WARNING: Deletes existing data)

从备份恢复(警告:会删除现有数据)

poetry run python scripts/backup_golden_dataset.py restore --replace
poetry run python scripts/backup_golden_dataset.py restore --replace

Restore without deleting (adds to existing)

不删除现有数据进行恢复(添加到现有数据中)

poetry run python scripts/backup_golden_dataset.py restore
undefined
poetry run python scripts/backup_golden_dataset.py restore
undefined

Validation Checks

验证检查项

CheckError/WarningDescription
Count mismatchErrorAnalysis/chunk count differs from metadata
Placeholder URLsErrorURLs containing orchestkit.dev or placeholder
Missing embeddingsErrorChunks without embeddings after restore
Orphaned chunksWarningChunks with no parent analysis

检查项错误/警告描述
数量不匹配错误分析/数据块数量与元数据不符
占位符URL错误URL包含orchestkit.dev或placeholder
缺少embeddings错误恢复后数据块没有embeddings
孤立数据块警告没有父分析的数据块

Best Practices Summary

最佳实践总结

  1. Version control backups - Commit to git for history and diffs
  2. Validate before deployment - Run verify before production changes
  3. Test restore in staging - Never test restore in production first
  4. Document changes - Track additions/removals in metadata

  1. 对备份进行版本控制 - 提交到git以保留历史记录和差异
  2. 部署前验证 - 在生产变更前运行验证
  3. 在预发布环境测试恢复 - 切勿先在生产环境测试恢复
  4. 记录变更 - 在元数据中跟踪添加/删除的内容

Disaster Recovery Quick Guide

灾难恢复快速指南

ScenarioSteps
Accidental deletion
restore --replace
->
verify
-> run tests
Migration failure
alembic downgrade -1
->
restore --replace
-> fix migration
New environmentClone repo -> setup DB ->
restore
-> run tests

场景步骤
意外删除
restore --replace
->
verify
-> 运行测试
迁移失败
alembic downgrade -1
->
restore --replace
-> 修复迁移问题
新环境部署克隆仓库 -> 设置数据库 ->
restore
-> 运行测试

References

参考资料

For detailed implementation patterns, see:
  • references/storage-patterns.md
    - Backup strategies, JSON format, backup script implementation, CI/CD automation
  • references/versioning.md
    - Restore implementation, embedding regeneration, validation checklist, disaster recovery scenarios

如需详细的实现模式,请参阅:
  • references/storage-patterns.md
    - 备份策略、JSON格式、备份脚本实现、CI/CD自动化
  • references/versioning.md
    - 恢复实现、embeddings重新生成、验证检查清单、灾难恢复场景

Related Skills

相关技能

  • golden-dataset-validation
    - Schema and integrity validation
  • golden-dataset-curation
    - Quality criteria and curation workflows
  • pgvector-search
    - Retrieval evaluation using golden dataset
  • ai-native-development
    - Embedding generation for restore

Version: 1.0.0 (December 2025) Status: Production-ready patterns from OrchestKit's 98-analysis golden dataset
  • golden-dataset-validation
    - 模式和完整性验证
  • golden-dataset-curation
    - 质量标准和整理流程
  • pgvector-search
    - 使用黄金数据集进行检索评估
  • ai-native-development
    - 为恢复生成embeddings

版本: 1.0.0(2025年12月) 状态: 来自OrchestKit的98项分析黄金数据集的生产就绪模式

Capability Details

能力详情

backup

backup

Keywords: golden dataset, backup, export, json backup, version control data Solves:
  • How do I backup the golden dataset?
  • Export analyses to JSON for version control
  • Protect critical test datasets
  • Create portable database snapshots
关键词: golden dataset, backup, export, json backup, version control data 解决问题:
  • 如何备份黄金数据集?
  • 将分析结果导出为JSON以进行版本控制
  • 保护关键测试数据集
  • 创建可移植的数据库快照

restore

restore

Keywords: restore dataset, import analyses, regenerate embeddings, disaster recovery, new environment Solves:
  • How do I restore from backup?
  • Import golden dataset to new environment
  • Regenerate embeddings after restore
  • Disaster recovery procedures
关键词: restore dataset, import analyses, regenerate embeddings, disaster recovery, new environment 解决问题:
  • 如何从备份恢复?
  • 将黄金数据集导入新环境
  • 恢复后重新生成embeddings
  • 灾难恢复流程

validation

validation

Keywords: verify dataset, url contract, data integrity, validate backup, placeholder urls Solves:
  • How do I validate dataset integrity?
  • Check URL contracts (no placeholders)
  • Verify embeddings exist
  • Detect orphaned chunks
关键词: verify dataset, url contract, data integrity, validate backup, placeholder urls 解决问题:
  • 如何验证数据集完整性?
  • 检查URL协议(无占位符)
  • 验证embeddings是否存在
  • 检测孤立数据块

ci-cd-automation

ci-cd-automation

Keywords: automated backup, github actions, ci cd backup, scheduled backup Solves:
  • How do I automate dataset backups?
  • Set up GitHub Actions for weekly backups
  • Commit backups to git automatically
  • CI/CD integration patterns
关键词: automated backup, github actions, ci cd backup, scheduled backup 解决问题:
  • 如何自动化数据集备份?
  • 设置GitHub Actions进行每周备份
  • 自动将备份提交到git
  • CI/CD集成模式

disaster-recovery

disaster-recovery

Keywords: disaster recovery, accidental deletion, migration failure, rollback Solves:
  • What if I accidentally delete the dataset?
  • Database migration gone wrong
  • Restore after data corruption
  • Rollback procedures
关键词: disaster recovery, accidental deletion, migration failure, rollback 解决问题:
  • 意外删除数据集怎么办?
  • 数据库迁移失败
  • 数据损坏后恢复
  • 回滚流程

orchestkit-golden-dataset

orchestkit-golden-dataset

Keywords: orchestkit, 98 analyses, 415 chunks, retrieval evaluation, real world Solves:
  • What is OrchestKit's golden dataset?
  • How does OrchestKit protect test data?
  • Real-world backup/restore examples
  • Production golden dataset stats
关键词: orchestkit, 98 analyses, 415 chunks, retrieval evaluation, real world 解决问题:
  • OrchestKit的黄金数据集是什么?
  • OrchestKit如何保护测试数据?
  • 真实世界的备份/恢复示例
  • 生产环境黄金数据集统计数据