golden-dataset-validation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Golden Dataset Validation

黄金数据集验证

Ensure data integrity, prevent duplicates, and maintain quality standards

确保数据完整性、防止重复并维持质量标准

Overview

概述

This skill provides comprehensive validation patterns for the golden dataset, ensuring every entry meets quality standards before inclusion.

When to use this skill:

Validating new documents before adding
Running integrity checks on existing dataset
Detecting duplicate or similar content
Analyzing coverage gaps
Pre-commit validation hooks

本Skill为黄金数据集提供全面的验证模式，确保每条数据在纳入前都符合质量标准。

何时使用本Skill：

验证待新增的新文档
对现有数据集执行完整性检查
检测重复或相似内容
分析覆盖范围缺口
提交前验证钩子

Schema Validation

Schema 验证

Document Schema (v2.0)

文档Schema（v2.0）

json

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "required": ["id", "title", "source_url", "content_type", "sections"],
  "properties": {
    "id": {
      "type": "string",
      "pattern": "^[a-z0-9-]+$",
      "description": "Unique kebab-case identifier"
    },
    "title": {
      "type": "string",
      "minLength": 10,
      "maxLength": 200
    },
    "source_url": {
      "type": "string",
      "format": "uri",
      "description": "Canonical source URL (NOT placeholder)"
    },
    "content_type": {
      "type": "string",
      "enum": ["article", "tutorial", "research_paper", "documentation", "video_transcript", "code_repository"]
    },
    "bucket": {
      "type": "string",
      "enum": ["short", "long"]
    },
    "tags": {
      "type": "array",
      "items": {"type": "string"},
      "minItems": 2,
      "maxItems": 10
    },
    "sections": {
      "type": "array",
      "minItems": 1,
      "items": {
        "type": "object",
        "required": ["id", "title", "content"],
        "properties": {
          "id": {"type": "string", "pattern": "^[a-z0-9-/]+$"},
          "title": {"type": "string"},
          "content": {"type": "string", "minLength": 50},
          "granularity": {"enum": ["coarse", "fine", "summary"]}
        }
      }
    }
  }
}

json

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "required": ["id", "title", "source_url", "content_type", "sections"],
  "properties": {
    "id": {
      "type": "string",
      "pattern": "^[a-z0-9-]+$",
      "description": "唯一的短横线分隔式标识符"
    },
    "title": {
      "type": "string",
      "minLength": 10,
      "maxLength": 200
    },
    "source_url": {
      "type": "string",
      "format": "uri",
      "description": "规范的源URL（不可为占位符）"
    },
    "content_type": {
      "type": "string",
      "enum": ["article", "tutorial", "research_paper", "documentation", "video_transcript", "code_repository"]
    },
    "bucket": {
      "type": "string",
      "enum": ["short", "long"]
    },
    "tags": {
      "type": "array",
      "items": {"type": "string"},
      "minItems": 2,
      "maxItems": 10
    },
    "sections": {
      "type": "array",
      "minItems": 1,
      "items": {
        "type": "object",
        "required": ["id", "title", "content"],
        "properties": {
          "id": {"type": "string", "pattern": "^[a-z0-9-/]+$"},
          "title": {"type": "string"},
          "content": {"type": "string", "minLength": 50},
          "granularity": {"enum": ["coarse", "fine", "summary"]}
        }
      }
    }
  }
}

Query Schema

查询Schema

json

{
  "type": "object",
  "required": ["id", "query", "difficulty", "expected_chunks", "min_score"],
  "properties": {
    "id": {"type": "string", "pattern": "^q-[a-z0-9-]+$"},
    "query": {"type": "string", "minLength": 5, "maxLength": 500},
    "modes": {"type": "array", "items": {"enum": ["semantic", "keyword", "hybrid"]}},
    "category": {"enum": ["specific", "broad", "negative", "edge", "coarse-to-fine"]},
    "difficulty": {"enum": ["trivial", "easy", "medium", "hard", "adversarial"]},
    "expected_chunks": {"type": "array", "items": {"type": "string"}, "minItems": 1},
    "min_score": {"type": "number", "minimum": 0, "maximum": 1}
  }
}

json

{
  "type": "object",
  "required": ["id", "query", "difficulty", "expected_chunks", "min_score"],
  "properties": {
    "id": {"type": "string", "pattern": "^q-[a-z0-9-]+$"},
    "query": {"type": "string", "minLength": 5, "maxLength": 500},
    "modes": {"type": "array", "items": {"enum": ["semantic", "keyword", "hybrid"]}},
    "category": {"enum": ["specific", "broad", "negative", "edge", "coarse-to-fine"]},
    "difficulty": {"enum": ["trivial", "easy", "medium", "hard", "adversarial"]},
    "expected_chunks": {"type": "array", "items": {"type": "string"}, "minItems": 1},
    "min_score": {"type": "number", "minimum": 0, "maximum": 1}
  }
}

Validation Rules Summary

验证规则汇总

Rule	Purpose	Severity
No Placeholder URLs	Ensure real canonical URLs	Error
Unique Identifiers	No duplicate doc/query/section IDs	Error
Referential Integrity	Query chunks reference valid sections	Error
Content Quality	Title/content length, tag count	Warning
Difficulty Distribution	Balanced query difficulty levels	Warning

规则	目的	严重程度
无占位符URL	确保使用真实的规范URL	错误
唯一标识符	文档/查询/章节ID无重复	错误
引用完整性	查询片段引用有效的章节	错误
内容质量	标题/内容长度、标签数量	警告
难度分布	查询难度级别均衡	警告

Quick Reference

快速参考

Duplicate Detection Thresholds

重复项检测阈值

Similarity	Action
>= 0.90	Block - Content too similar
>= 0.85	Warn - High similarity detected
>= 0.80	Note - Similar content exists
< 0.80	Allow - Sufficiently unique

相似度	操作
>= 0.90	阻止 - 内容过于相似
>= 0.85	警告 - 检测到高度相似内容
>= 0.80	记录 - 存在相似内容
< 0.80	允许 - 内容足够唯一

Coverage Requirements

覆盖范围要求

Metric	Minimum
Tutorials	>= 15% of documents
Research papers	>= 5% of documents
Domain coverage	>= 5 docs per expected domain
Hard queries	>= 10% of queries
Adversarial queries	>= 5% of queries

指标	最低要求
教程	占文档总数的 >=15%
研究论文	占文档总数的 >=5%
领域覆盖	每个目标领域至少5篇文档
高难度查询	占查询总数的 >=10%
对抗性查询	占查询总数的 >=5%

Difficulty Distribution Requirements

难度分布要求

Level	Minimum Count
trivial	3
easy	3
medium	5
hard	3

级别	最低数量
trivial	3
easy	3
medium	5
hard	3

References

参考资料

For detailed implementation patterns, see:

```
references/validation-rules.md
```
- URL validation, ID uniqueness, referential integrity, content quality, and duplicate detection code
```
references/quality-metrics.md
```
- Coverage analysis, pre-addition validation workflow, full dataset validation, and CLI/hook integration

如需详细的实现模式，请参阅：

```
references/validation-rules.md
```
- URL验证、ID唯一性、引用完整性、内容质量和重复项检测代码
```
references/quality-metrics.md
```
- 覆盖范围分析、新增前验证流程、完整数据集验证以及CLI/钩子集成

Related Skills

Capability Details

能力详情

schema-validation

Keywords: schema, validation, schema check, format validation Solves:

Validate entries against document schema
Check required fields are present
Verify data types and constraints

关键词: schema、验证、schema检查、格式验证 解决的问题:

对照文档schema验证条目
检查必填字段是否存在
验证数据类型与约束条件

duplicate-detection

Keywords: duplicate, detection, deduplication, similarity check Solves:

Detect duplicate or near-duplicate entries
Use semantic similarity for fuzzy matching
Prevent redundant entries in dataset

关键词: 重复项、检测、去重、相似度检查 解决的问题:

检测重复或近似重复的条目
使用语义相似度进行模糊匹配
防止数据集中出现冗余条目

referential-integrity

Keywords: referential, integrity, foreign key, relationship Solves:

Verify relationships between documents and queries
Check source URL mappings are valid
Ensure cross-references are consistent

关键词: 引用、完整性、外键、关联关系 解决的问题:

验证文档与查询之间的关联关系
检查源URL映射是否有效
确保交叉引用的一致性

coverage-analysis

Keywords: coverage, analysis, distribution, completeness Solves:

Analyze dataset coverage across domains
Identify gaps in difficulty distribution
Report coverage metrics and recommendations

关键词: 覆盖范围、分析、分布、完整性 解决的问题:

分析数据集在各领域的覆盖情况
识别难度分布中的缺口
生成覆盖范围指标与建议

golden-dataset-validation

Original

Translation

Golden Dataset Validation

黄金数据集验证

Overview

概述

Schema Validation

Schema 验证

Document Schema (v2.0)

文档Schema（v2.0）

Query Schema

查询Schema

Validation Rules Summary

验证规则汇总

Quick Reference

快速参考

Duplicate Detection Thresholds

重复项检测阈值

Coverage Requirements

覆盖范围要求

Difficulty Distribution Requirements

难度分布要求

References

参考资料

Related Skills

相关Skill

Capability Details

能力详情

schema-validation

schema-validation

duplicate-detection

duplicate-detection

referential-integrity

referential-integrity

coverage-analysis

coverage-analysis