golden-dataset-validation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Golden Dataset Validation

黄金数据集验证

Ensure data integrity, prevent duplicates, and maintain quality standards
确保数据完整性、防止重复并维持质量标准

Overview

概述

This skill provides comprehensive validation patterns for the golden dataset, ensuring every entry meets quality standards before inclusion.
When to use this skill:
  • Validating new documents before adding
  • Running integrity checks on existing dataset
  • Detecting duplicate or similar content
  • Analyzing coverage gaps
  • Pre-commit validation hooks

本Skill为黄金数据集提供全面的验证模式,确保每条数据在纳入前都符合质量标准。
何时使用本Skill:
  • 验证待新增的新文档
  • 对现有数据集执行完整性检查
  • 检测重复或相似内容
  • 分析覆盖范围缺口
  • 提交前验证钩子

Schema Validation

Schema 验证

Document Schema (v2.0)

文档Schema(v2.0)

json
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "required": ["id", "title", "source_url", "content_type", "sections"],
  "properties": {
    "id": {
      "type": "string",
      "pattern": "^[a-z0-9-]+$",
      "description": "Unique kebab-case identifier"
    },
    "title": {
      "type": "string",
      "minLength": 10,
      "maxLength": 200
    },
    "source_url": {
      "type": "string",
      "format": "uri",
      "description": "Canonical source URL (NOT placeholder)"
    },
    "content_type": {
      "type": "string",
      "enum": ["article", "tutorial", "research_paper", "documentation", "video_transcript", "code_repository"]
    },
    "bucket": {
      "type": "string",
      "enum": ["short", "long"]
    },
    "tags": {
      "type": "array",
      "items": {"type": "string"},
      "minItems": 2,
      "maxItems": 10
    },
    "sections": {
      "type": "array",
      "minItems": 1,
      "items": {
        "type": "object",
        "required": ["id", "title", "content"],
        "properties": {
          "id": {"type": "string", "pattern": "^[a-z0-9-/]+$"},
          "title": {"type": "string"},
          "content": {"type": "string", "minLength": 50},
          "granularity": {"enum": ["coarse", "fine", "summary"]}
        }
      }
    }
  }
}
json
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "required": ["id", "title", "source_url", "content_type", "sections"],
  "properties": {
    "id": {
      "type": "string",
      "pattern": "^[a-z0-9-]+$",
      "description": "唯一的短横线分隔式标识符"
    },
    "title": {
      "type": "string",
      "minLength": 10,
      "maxLength": 200
    },
    "source_url": {
      "type": "string",
      "format": "uri",
      "description": "规范的源URL(不可为占位符)"
    },
    "content_type": {
      "type": "string",
      "enum": ["article", "tutorial", "research_paper", "documentation", "video_transcript", "code_repository"]
    },
    "bucket": {
      "type": "string",
      "enum": ["short", "long"]
    },
    "tags": {
      "type": "array",
      "items": {"type": "string"},
      "minItems": 2,
      "maxItems": 10
    },
    "sections": {
      "type": "array",
      "minItems": 1,
      "items": {
        "type": "object",
        "required": ["id", "title", "content"],
        "properties": {
          "id": {"type": "string", "pattern": "^[a-z0-9-/]+$"},
          "title": {"type": "string"},
          "content": {"type": "string", "minLength": 50},
          "granularity": {"enum": ["coarse", "fine", "summary"]}
        }
      }
    }
  }
}

Query Schema

查询Schema

json
{
  "type": "object",
  "required": ["id", "query", "difficulty", "expected_chunks", "min_score"],
  "properties": {
    "id": {"type": "string", "pattern": "^q-[a-z0-9-]+$"},
    "query": {"type": "string", "minLength": 5, "maxLength": 500},
    "modes": {"type": "array", "items": {"enum": ["semantic", "keyword", "hybrid"]}},
    "category": {"enum": ["specific", "broad", "negative", "edge", "coarse-to-fine"]},
    "difficulty": {"enum": ["trivial", "easy", "medium", "hard", "adversarial"]},
    "expected_chunks": {"type": "array", "items": {"type": "string"}, "minItems": 1},
    "min_score": {"type": "number", "minimum": 0, "maximum": 1}
  }
}

json
{
  "type": "object",
  "required": ["id", "query", "difficulty", "expected_chunks", "min_score"],
  "properties": {
    "id": {"type": "string", "pattern": "^q-[a-z0-9-]+$"},
    "query": {"type": "string", "minLength": 5, "maxLength": 500},
    "modes": {"type": "array", "items": {"enum": ["semantic", "keyword", "hybrid"]}},
    "category": {"enum": ["specific", "broad", "negative", "edge", "coarse-to-fine"]},
    "difficulty": {"enum": ["trivial", "easy", "medium", "hard", "adversarial"]},
    "expected_chunks": {"type": "array", "items": {"type": "string"}, "minItems": 1},
    "min_score": {"type": "number", "minimum": 0, "maximum": 1}
  }
}

Validation Rules Summary

验证规则汇总

RulePurposeSeverity
No Placeholder URLsEnsure real canonical URLsError
Unique IdentifiersNo duplicate doc/query/section IDsError
Referential IntegrityQuery chunks reference valid sectionsError
Content QualityTitle/content length, tag countWarning
Difficulty DistributionBalanced query difficulty levelsWarning

规则目的严重程度
无占位符URL确保使用真实的规范URL错误
唯一标识符文档/查询/章节ID无重复错误
引用完整性查询片段引用有效的章节错误
内容质量标题/内容长度、标签数量警告
难度分布查询难度级别均衡警告

Quick Reference

快速参考

Duplicate Detection Thresholds

重复项检测阈值

SimilarityAction
>= 0.90Block - Content too similar
>= 0.85Warn - High similarity detected
>= 0.80Note - Similar content exists
< 0.80Allow - Sufficiently unique
相似度操作
>= 0.90阻止 - 内容过于相似
>= 0.85警告 - 检测到高度相似内容
>= 0.80记录 - 存在相似内容
< 0.80允许 - 内容足够唯一

Coverage Requirements

覆盖范围要求

MetricMinimum
Tutorials>= 15% of documents
Research papers>= 5% of documents
Domain coverage>= 5 docs per expected domain
Hard queries>= 10% of queries
Adversarial queries>= 5% of queries
指标最低要求
教程占文档总数的 >=15%
研究论文占文档总数的 >=5%
领域覆盖每个目标领域至少5篇文档
高难度查询占查询总数的 >=10%
对抗性查询占查询总数的 >=5%

Difficulty Distribution Requirements

难度分布要求

LevelMinimum Count
trivial3
easy3
medium5
hard3

级别最低数量
trivial3
easy3
medium5
hard3

References

参考资料

For detailed implementation patterns, see:
  • references/validation-rules.md
    - URL validation, ID uniqueness, referential integrity, content quality, and duplicate detection code
  • references/quality-metrics.md
    - Coverage analysis, pre-addition validation workflow, full dataset validation, and CLI/hook integration

如需详细的实现模式,请参阅:
  • references/validation-rules.md
    - URL验证、ID唯一性、引用完整性、内容质量和重复项检测代码
  • references/quality-metrics.md
    - 覆盖范围分析、新增前验证流程、完整数据集验证以及CLI/钩子集成

Related Skills

相关Skill

  • golden-dataset-curation
    - Quality criteria and workflows
  • golden-dataset-management
    - Backup/restore operations
  • pgvector-search
    - Embedding-based duplicate detection

Version: 1.0.0 (December 2025) Issue: #599
  • golden-dataset-curation
    - 质量标准与工作流
  • golden-dataset-management
    - 备份/恢复操作
  • pgvector-search
    - 基于嵌入的重复项检测

版本: 1.0.0(2025年12月) 问题编号: #599

Capability Details

能力详情

schema-validation

schema-validation

Keywords: schema, validation, schema check, format validation Solves:
  • Validate entries against document schema
  • Check required fields are present
  • Verify data types and constraints
关键词: schema、验证、schema检查、格式验证 解决的问题:
  • 对照文档schema验证条目
  • 检查必填字段是否存在
  • 验证数据类型与约束条件

duplicate-detection

duplicate-detection

Keywords: duplicate, detection, deduplication, similarity check Solves:
  • Detect duplicate or near-duplicate entries
  • Use semantic similarity for fuzzy matching
  • Prevent redundant entries in dataset
关键词: 重复项、检测、去重、相似度检查 解决的问题:
  • 检测重复或近似重复的条目
  • 使用语义相似度进行模糊匹配
  • 防止数据集中出现冗余条目

referential-integrity

referential-integrity

Keywords: referential, integrity, foreign key, relationship Solves:
  • Verify relationships between documents and queries
  • Check source URL mappings are valid
  • Ensure cross-references are consistent
关键词: 引用、完整性、外键、关联关系 解决的问题:
  • 验证文档与查询之间的关联关系
  • 检查源URL映射是否有效
  • 确保交叉引用的一致性

coverage-analysis

coverage-analysis

Keywords: coverage, analysis, distribution, completeness Solves:
  • Analyze dataset coverage across domains
  • Identify gaps in difficulty distribution
  • Report coverage metrics and recommendations
关键词: 覆盖范围、分析、分布、完整性 解决的问题:
  • 分析数据集在各领域的覆盖情况
  • 识别难度分布中的缺口
  • 生成覆盖范围指标与建议