golden-dataset
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGolden Dataset
黄金数据集
Comprehensive patterns for building, managing, and validating golden datasets for AI/ML evaluation. Each category has individual rule files in loaded on-demand.
rules/针对AI/ML评估场景的黄金数据集构建、管理与验证的全面模式。每个分类在目录下都有独立的规则文件,可按需加载。
rules/Quick Reference
快速参考
| Category | Rules | Impact | When to Use |
|---|---|---|---|
| Curation | 3 | HIGH | Content collection, annotation pipelines, diversity analysis |
| Management | 3 | HIGH | Versioning, backup/restore, CI/CD automation |
| Validation | 3 | CRITICAL | Quality scoring, drift detection, regression testing |
| Add Workflow | 1 | HIGH | 9-phase curation, quality scoring, bias detection, silver-to-gold |
Total: 10 rules across 4 categories
Curation
整理
Content collection, multi-agent annotation, and diversity analysis for golden datasets.
| Rule | File | Key Pattern |
|---|---|---|
| Collection | | Content type classification, quality thresholds, duplicate prevention |
| Annotation | | Multi-agent pipeline, consensus aggregation, Langfuse tracing |
| Diversity | | Difficulty stratification, domain coverage, balance guidelines |
黄金数据集的内容收集、多Agent标注与多样性分析。
| 规则 | 文件 | 核心模式 |
|---|---|---|
| 收集 | | 内容类型分类、质量阈值、重复项预防 |
| 标注 | | 多Agent流水线、共识聚合、Langfuse追踪 |
| 多样性 | | 难度分层、领域覆盖、平衡准则 |
Management
管理
Versioning, storage, and CI/CD automation for golden datasets.
| Rule | File | Key Pattern |
|---|---|---|
| Versioning | | JSON backup format, embedding regeneration, disaster recovery |
| Storage | | Backup strategies, URL contract, data integrity checks |
| CI Integration | | GitHub Actions automation, pre-deployment validation, weekly backups |
黄金数据集的版本控制、存储与CI/CD自动化。
| 规则 | 文件 | 核心模式 |
|---|---|---|
| 版本控制 | | JSON备份格式、嵌入向量重新生成、灾难恢复 |
| 存储 | | 备份策略、URL规范、数据完整性校验 |
| CI集成 | | GitHub Actions自动化、部署前验证、每周备份 |
Validation
验证
Quality scoring, drift detection, and regression testing for golden datasets.
| Rule | File | Key Pattern |
|---|---|---|
| Quality | | Schema validation, content quality, referential integrity |
| Drift | | Duplicate detection, semantic similarity, coverage gap analysis |
| Regression | | Difficulty distribution, pre-commit hooks, full dataset validation |
黄金数据集的质量评分、漂移检测与回归测试。
| 规则 | 文件 | 核心模式 |
|---|---|---|
| 质量 | | Schema验证、内容质量、引用完整性 |
| 漂移 | | 重复项检测、语义相似度、覆盖缺口分析 |
| 回归 | | 难度分布、提交前钩子、全数据集验证 |
Add Workflow
新增工作流
Structured workflow for adding new documents to the golden dataset.
| Rule | File | Key Pattern |
|---|---|---|
| Add Document | | 9-phase curation, parallel quality analysis, bias detection |
向黄金数据集添加新文档的结构化工作流。
| 规则 | 文件 | 核心模式 |
|---|---|---|
| 添加文档 | | 9阶段整理、并行质量分析、偏差检测 |
Quick Start Example
快速开始示例
python
from app.shared.services.embeddings import embed_text
async def validate_before_add(document: dict, source_url_map: dict) -> dict:
"""Pre-addition validation for golden dataset entries."""
errors = []
# 1. URL contract check
if "placeholder" in document.get("source_url", ""):
errors.append("URL must be canonical, not a placeholder")
# 2. Content quality
if len(document.get("title", "")) < 10:
errors.append("Title too short (min 10 chars)")
# 3. Tag requirements
if len(document.get("tags", [])) < 2:
errors.append("At least 2 domain tags required")
return {"valid": len(errors) == 0, "errors": errors}python
from app.shared.services.embeddings import embed_text
async def validate_before_add(document: dict, source_url_map: dict) -> dict:
"""黄金数据集条目添加前的验证操作。"""
errors = []
# 1. URL规范校验
if "placeholder" in document.get("source_url", ""):
errors.append("URL必须为标准正式链接,不能是占位符")
# 2. 内容质量校验
if len(document.get("title", "")) < 10:
errors.append("标题过短(最少10个字符)")
# 3. 标签要求
if len(document.get("tags", [])) < 2:
errors.append("至少需要2个领域标签")
return {"valid": len(errors) == 0, "errors": errors}Key Decisions
关键决策
| Decision | Recommendation |
|---|---|
| Backup format | JSON (version controlled, portable) |
| Embedding storage | Exclude from backup (regenerate on restore) |
| Quality threshold | >= 0.70 quality score for inclusion |
| Confidence threshold | >= 0.65 for auto-include |
| Duplicate threshold | >= 0.90 similarity blocks, >= 0.85 warns |
| Min tags per entry | 2 domain tags |
| Min test queries | 3 per document |
| Difficulty balance | Trivial 3, Easy 3, Medium 5, Hard 3 minimum |
| CI frequency | Weekly automated backup (Sunday 2am UTC) |
| 决策项 | 推荐方案 |
|---|---|
| 备份格式 | JSON(支持版本控制、可移植) |
| 嵌入向量存储 | 不包含在备份中(恢复时重新生成) |
| 质量阈值 | 准入质量分数≥0.70 |
| 置信度阈值 | 自动准入置信度≥0.65 |
| 重复项阈值 | 相似度≥0.90时拦截,≥0.85时发出警告 |
| 单条目最少标签数 | 2个领域标签 |
| 最少测试查询数 | 每份文档对应3条测试查询 |
| 难度平衡 | 至少包含3条简单、3条易、5条中等、3条困难级查询 |
| CI执行频率 | 每周自动备份(UTC时间周日凌晨2点) |
Common Mistakes
常见误区
- Using placeholder URLs instead of canonical source URLs
- Skipping embedding regeneration after restore
- Not validating referential integrity between documents and queries
- Over-indexing on articles (neglecting tutorials, research papers)
- Missing difficulty distribution balance in test queries
- Not running verification after backup/restore operations
- Testing restore procedures in production instead of staging
- Committing SQL dumps instead of JSON (not version-control friendly)
- 使用占位符URL而非标准正式源URL
- 恢复后跳过嵌入向量重新生成步骤
- 未验证文档与查询之间的引用完整性
- 过度侧重文章类内容(忽略教程、研究论文)
- 测试查询的难度分布失衡
- 备份/恢复操作后未执行验证
- 在生产环境而非预发布环境测试恢复流程
- 提交SQL转储文件而非JSON(不适合版本控制)
Evaluations
评估用例
See for 9 test cases across all categories.
test-cases.json请查看,包含全分类下的9个测试用例。
test-cases.jsonRelated Skills
相关技能
- - Retrieval evaluation using golden dataset
rag-retrieval - - Tracing patterns for curation workflows
langfuse-observability - - General testing patterns and strategies
testing-patterns - - Embedding generation for restore
ai-native-development
- - 利用黄金数据集进行检索评估
rag-retrieval - - 整理工作流的追踪模式
langfuse-observability - - 通用测试模式与策略
testing-patterns - - 恢复时的嵌入向量生成
ai-native-development
Capability Details
能力详情
curation
整理
Keywords: golden dataset, curation, content collection, annotation, quality criteria
Solves:
- Classify document content types for golden dataset
- Run multi-agent quality analysis pipelines
- Generate test queries for new documents
关键词: 黄金数据集、整理、内容收集、标注、质量标准
解决问题:
- 为黄金数据集分类文档内容类型
- 运行多Agent质量分析流水线
- 为新文档生成测试查询
management
管理
Keywords: golden dataset, backup, restore, versioning, disaster recovery
Solves:
- Backup and restore golden datasets with JSON
- Regenerate embeddings after restore
- Automate backups with CI/CD
关键词: 黄金数据集、备份、恢复、版本控制、灾难恢复
解决问题:
- 使用JSON备份与恢复黄金数据集
- 恢复后重新生成嵌入向量
- 通过CI/CD自动化备份流程
validation
验证
Keywords: golden dataset, validation, schema, duplicate detection, quality metrics
Solves:
- Validate entries against document schema
- Detect duplicate or near-duplicate entries
- Analyze dataset coverage and distribution gaps
关键词: 黄金数据集、验证、Schema、重复项检测、质量指标
解决问题:
- 依据文档Schema验证条目
- 检测重复或近似重复条目
- 分析数据集覆盖范围与分布缺口