golden-dataset

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Golden Dataset

黄金数据集

Comprehensive patterns for building, managing, and validating golden datasets for AI/ML evaluation. Each category has individual rule files in
rules/
loaded on-demand.
针对AI/ML评估场景的黄金数据集构建、管理与验证的全面模式。每个分类在
rules/
目录下都有独立的规则文件,可按需加载。

Quick Reference

快速参考

CategoryRulesImpactWhen to Use
Curation3HIGHContent collection, annotation pipelines, diversity analysis
Management3HIGHVersioning, backup/restore, CI/CD automation
Validation3CRITICALQuality scoring, drift detection, regression testing
Add Workflow1HIGH9-phase curation, quality scoring, bias detection, silver-to-gold
Total: 10 rules across 4 categories
分类规则数量影响程度适用场景
整理3内容收集、标注流水线、多样性分析
管理3版本控制、备份/恢复、CI/CD自动化
验证3关键质量评分、漂移检测、回归测试
新增工作流19阶段整理、质量分析、偏差检测、银级转黄金级
总计:4个分类下共10条规则

Curation

整理

Content collection, multi-agent annotation, and diversity analysis for golden datasets.
RuleFileKey Pattern
Collection
rules/curation-collection.md
Content type classification, quality thresholds, duplicate prevention
Annotation
rules/curation-annotation.md
Multi-agent pipeline, consensus aggregation, Langfuse tracing
Diversity
rules/curation-diversity.md
Difficulty stratification, domain coverage, balance guidelines
黄金数据集的内容收集、多Agent标注与多样性分析。
规则文件核心模式
收集
rules/curation-collection.md
内容类型分类、质量阈值、重复项预防
标注
rules/curation-annotation.md
多Agent流水线、共识聚合、Langfuse追踪
多样性
rules/curation-diversity.md
难度分层、领域覆盖、平衡准则

Management

管理

Versioning, storage, and CI/CD automation for golden datasets.
RuleFileKey Pattern
Versioning
rules/management-versioning.md
JSON backup format, embedding regeneration, disaster recovery
Storage
rules/management-storage.md
Backup strategies, URL contract, data integrity checks
CI Integration
rules/management-ci.md
GitHub Actions automation, pre-deployment validation, weekly backups
黄金数据集的版本控制、存储与CI/CD自动化。
规则文件核心模式
版本控制
rules/management-versioning.md
JSON备份格式、嵌入向量重新生成、灾难恢复
存储
rules/management-storage.md
备份策略、URL规范、数据完整性校验
CI集成
rules/management-ci.md
GitHub Actions自动化、部署前验证、每周备份

Validation

验证

Quality scoring, drift detection, and regression testing for golden datasets.
RuleFileKey Pattern
Quality
rules/validation-quality.md
Schema validation, content quality, referential integrity
Drift
rules/validation-drift.md
Duplicate detection, semantic similarity, coverage gap analysis
Regression
rules/validation-regression.md
Difficulty distribution, pre-commit hooks, full dataset validation
黄金数据集的质量评分、漂移检测与回归测试。
规则文件核心模式
质量
rules/validation-quality.md
Schema验证、内容质量、引用完整性
漂移
rules/validation-drift.md
重复项检测、语义相似度、覆盖缺口分析
回归
rules/validation-regression.md
难度分布、提交前钩子、全数据集验证

Add Workflow

新增工作流

Structured workflow for adding new documents to the golden dataset.
RuleFileKey Pattern
Add Document
rules/curation-add-workflow.md
9-phase curation, parallel quality analysis, bias detection
向黄金数据集添加新文档的结构化工作流。
规则文件核心模式
添加文档
rules/curation-add-workflow.md
9阶段整理、并行质量分析、偏差检测

Quick Start Example

快速开始示例

python
from app.shared.services.embeddings import embed_text

async def validate_before_add(document: dict, source_url_map: dict) -> dict:
    """Pre-addition validation for golden dataset entries."""
    errors = []

    # 1. URL contract check
    if "placeholder" in document.get("source_url", ""):
        errors.append("URL must be canonical, not a placeholder")

    # 2. Content quality
    if len(document.get("title", "")) < 10:
        errors.append("Title too short (min 10 chars)")

    # 3. Tag requirements
    if len(document.get("tags", [])) < 2:
        errors.append("At least 2 domain tags required")

    return {"valid": len(errors) == 0, "errors": errors}
python
from app.shared.services.embeddings import embed_text

async def validate_before_add(document: dict, source_url_map: dict) -> dict:
    """黄金数据集条目添加前的验证操作。"""
    errors = []

    # 1. URL规范校验
    if "placeholder" in document.get("source_url", ""):
        errors.append("URL必须为标准正式链接,不能是占位符")

    # 2. 内容质量校验
    if len(document.get("title", "")) < 10:
        errors.append("标题过短(最少10个字符)")

    # 3. 标签要求
    if len(document.get("tags", [])) < 2:
        errors.append("至少需要2个领域标签")

    return {"valid": len(errors) == 0, "errors": errors}

Key Decisions

关键决策

DecisionRecommendation
Backup formatJSON (version controlled, portable)
Embedding storageExclude from backup (regenerate on restore)
Quality threshold>= 0.70 quality score for inclusion
Confidence threshold>= 0.65 for auto-include
Duplicate threshold>= 0.90 similarity blocks, >= 0.85 warns
Min tags per entry2 domain tags
Min test queries3 per document
Difficulty balanceTrivial 3, Easy 3, Medium 5, Hard 3 minimum
CI frequencyWeekly automated backup (Sunday 2am UTC)
决策项推荐方案
备份格式JSON(支持版本控制、可移植)
嵌入向量存储不包含在备份中(恢复时重新生成)
质量阈值准入质量分数≥0.70
置信度阈值自动准入置信度≥0.65
重复项阈值相似度≥0.90时拦截,≥0.85时发出警告
单条目最少标签数2个领域标签
最少测试查询数每份文档对应3条测试查询
难度平衡至少包含3条简单、3条易、5条中等、3条困难级查询
CI执行频率每周自动备份(UTC时间周日凌晨2点)

Common Mistakes

常见误区

  1. Using placeholder URLs instead of canonical source URLs
  2. Skipping embedding regeneration after restore
  3. Not validating referential integrity between documents and queries
  4. Over-indexing on articles (neglecting tutorials, research papers)
  5. Missing difficulty distribution balance in test queries
  6. Not running verification after backup/restore operations
  7. Testing restore procedures in production instead of staging
  8. Committing SQL dumps instead of JSON (not version-control friendly)
  1. 使用占位符URL而非标准正式源URL
  2. 恢复后跳过嵌入向量重新生成步骤
  3. 未验证文档与查询之间的引用完整性
  4. 过度侧重文章类内容(忽略教程、研究论文)
  5. 测试查询的难度分布失衡
  6. 备份/恢复操作后未执行验证
  7. 在生产环境而非预发布环境测试恢复流程
  8. 提交SQL转储文件而非JSON(不适合版本控制)

Evaluations

评估用例

See
test-cases.json
for 9 test cases across all categories.
请查看
test-cases.json
,包含全分类下的9个测试用例。

Related Skills

相关技能

  • rag-retrieval
    - Retrieval evaluation using golden dataset
  • langfuse-observability
    - Tracing patterns for curation workflows
  • testing-patterns
    - General testing patterns and strategies
  • ai-native-development
    - Embedding generation for restore
  • rag-retrieval
    - 利用黄金数据集进行检索评估
  • langfuse-observability
    - 整理工作流的追踪模式
  • testing-patterns
    - 通用测试模式与策略
  • ai-native-development
    - 恢复时的嵌入向量生成

Capability Details

能力详情

curation

整理

Keywords: golden dataset, curation, content collection, annotation, quality criteria
Solves:
  • Classify document content types for golden dataset
  • Run multi-agent quality analysis pipelines
  • Generate test queries for new documents
关键词: 黄金数据集、整理、内容收集、标注、质量标准
解决问题:
  • 为黄金数据集分类文档内容类型
  • 运行多Agent质量分析流水线
  • 为新文档生成测试查询

management

管理

Keywords: golden dataset, backup, restore, versioning, disaster recovery
Solves:
  • Backup and restore golden datasets with JSON
  • Regenerate embeddings after restore
  • Automate backups with CI/CD
关键词: 黄金数据集、备份、恢复、版本控制、灾难恢复
解决问题:
  • 使用JSON备份与恢复黄金数据集
  • 恢复后重新生成嵌入向量
  • 通过CI/CD自动化备份流程

validation

验证

Keywords: golden dataset, validation, schema, duplicate detection, quality metrics
Solves:
  • Validate entries against document schema
  • Detect duplicate or near-duplicate entries
  • Analyze dataset coverage and distribution gaps
关键词: 黄金数据集、验证、Schema、重复项检测、质量指标
解决问题:
  • 依据文档Schema验证条目
  • 检测重复或近似重复条目
  • 分析数据集覆盖范围与分布缺口