golden-dataset-validation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGolden Dataset Validation
黄金数据集验证
Ensure data integrity, prevent duplicates, and maintain quality standards
确保数据完整性、防止重复并维持质量标准
Overview
概述
This skill provides comprehensive validation patterns for the golden dataset, ensuring every entry meets quality standards before inclusion.
When to use this skill:
- Validating new documents before adding
- Running integrity checks on existing dataset
- Detecting duplicate or similar content
- Analyzing coverage gaps
- Pre-commit validation hooks
本Skill为黄金数据集提供全面的验证模式,确保每条数据在纳入前都符合质量标准。
何时使用本Skill:
- 验证待新增的新文档
- 对现有数据集执行完整性检查
- 检测重复或相似内容
- 分析覆盖范围缺口
- 提交前验证钩子
Schema Validation
Schema 验证
Document Schema (v2.0)
文档Schema(v2.0)
json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"required": ["id", "title", "source_url", "content_type", "sections"],
"properties": {
"id": {
"type": "string",
"pattern": "^[a-z0-9-]+$",
"description": "Unique kebab-case identifier"
},
"title": {
"type": "string",
"minLength": 10,
"maxLength": 200
},
"source_url": {
"type": "string",
"format": "uri",
"description": "Canonical source URL (NOT placeholder)"
},
"content_type": {
"type": "string",
"enum": ["article", "tutorial", "research_paper", "documentation", "video_transcript", "code_repository"]
},
"bucket": {
"type": "string",
"enum": ["short", "long"]
},
"tags": {
"type": "array",
"items": {"type": "string"},
"minItems": 2,
"maxItems": 10
},
"sections": {
"type": "array",
"minItems": 1,
"items": {
"type": "object",
"required": ["id", "title", "content"],
"properties": {
"id": {"type": "string", "pattern": "^[a-z0-9-/]+$"},
"title": {"type": "string"},
"content": {"type": "string", "minLength": 50},
"granularity": {"enum": ["coarse", "fine", "summary"]}
}
}
}
}
}json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"required": ["id", "title", "source_url", "content_type", "sections"],
"properties": {
"id": {
"type": "string",
"pattern": "^[a-z0-9-]+$",
"description": "唯一的短横线分隔式标识符"
},
"title": {
"type": "string",
"minLength": 10,
"maxLength": 200
},
"source_url": {
"type": "string",
"format": "uri",
"description": "规范的源URL(不可为占位符)"
},
"content_type": {
"type": "string",
"enum": ["article", "tutorial", "research_paper", "documentation", "video_transcript", "code_repository"]
},
"bucket": {
"type": "string",
"enum": ["short", "long"]
},
"tags": {
"type": "array",
"items": {"type": "string"},
"minItems": 2,
"maxItems": 10
},
"sections": {
"type": "array",
"minItems": 1,
"items": {
"type": "object",
"required": ["id", "title", "content"],
"properties": {
"id": {"type": "string", "pattern": "^[a-z0-9-/]+$"},
"title": {"type": "string"},
"content": {"type": "string", "minLength": 50},
"granularity": {"enum": ["coarse", "fine", "summary"]}
}
}
}
}
}Query Schema
查询Schema
json
{
"type": "object",
"required": ["id", "query", "difficulty", "expected_chunks", "min_score"],
"properties": {
"id": {"type": "string", "pattern": "^q-[a-z0-9-]+$"},
"query": {"type": "string", "minLength": 5, "maxLength": 500},
"modes": {"type": "array", "items": {"enum": ["semantic", "keyword", "hybrid"]}},
"category": {"enum": ["specific", "broad", "negative", "edge", "coarse-to-fine"]},
"difficulty": {"enum": ["trivial", "easy", "medium", "hard", "adversarial"]},
"expected_chunks": {"type": "array", "items": {"type": "string"}, "minItems": 1},
"min_score": {"type": "number", "minimum": 0, "maximum": 1}
}
}json
{
"type": "object",
"required": ["id", "query", "difficulty", "expected_chunks", "min_score"],
"properties": {
"id": {"type": "string", "pattern": "^q-[a-z0-9-]+$"},
"query": {"type": "string", "minLength": 5, "maxLength": 500},
"modes": {"type": "array", "items": {"enum": ["semantic", "keyword", "hybrid"]}},
"category": {"enum": ["specific", "broad", "negative", "edge", "coarse-to-fine"]},
"difficulty": {"enum": ["trivial", "easy", "medium", "hard", "adversarial"]},
"expected_chunks": {"type": "array", "items": {"type": "string"}, "minItems": 1},
"min_score": {"type": "number", "minimum": 0, "maximum": 1}
}
}Validation Rules Summary
验证规则汇总
| Rule | Purpose | Severity |
|---|---|---|
| No Placeholder URLs | Ensure real canonical URLs | Error |
| Unique Identifiers | No duplicate doc/query/section IDs | Error |
| Referential Integrity | Query chunks reference valid sections | Error |
| Content Quality | Title/content length, tag count | Warning |
| Difficulty Distribution | Balanced query difficulty levels | Warning |
| 规则 | 目的 | 严重程度 |
|---|---|---|
| 无占位符URL | 确保使用真实的规范URL | 错误 |
| 唯一标识符 | 文档/查询/章节ID无重复 | 错误 |
| 引用完整性 | 查询片段引用有效的章节 | 错误 |
| 内容质量 | 标题/内容长度、标签数量 | 警告 |
| 难度分布 | 查询难度级别均衡 | 警告 |
Quick Reference
快速参考
Duplicate Detection Thresholds
重复项检测阈值
| Similarity | Action |
|---|---|
| >= 0.90 | Block - Content too similar |
| >= 0.85 | Warn - High similarity detected |
| >= 0.80 | Note - Similar content exists |
| < 0.80 | Allow - Sufficiently unique |
| 相似度 | 操作 |
|---|---|
| >= 0.90 | 阻止 - 内容过于相似 |
| >= 0.85 | 警告 - 检测到高度相似内容 |
| >= 0.80 | 记录 - 存在相似内容 |
| < 0.80 | 允许 - 内容足够唯一 |
Coverage Requirements
覆盖范围要求
| Metric | Minimum |
|---|---|
| Tutorials | >= 15% of documents |
| Research papers | >= 5% of documents |
| Domain coverage | >= 5 docs per expected domain |
| Hard queries | >= 10% of queries |
| Adversarial queries | >= 5% of queries |
| 指标 | 最低要求 |
|---|---|
| 教程 | 占文档总数的 >=15% |
| 研究论文 | 占文档总数的 >=5% |
| 领域覆盖 | 每个目标领域至少5篇文档 |
| 高难度查询 | 占查询总数的 >=10% |
| 对抗性查询 | 占查询总数的 >=5% |
Difficulty Distribution Requirements
难度分布要求
| Level | Minimum Count |
|---|---|
| trivial | 3 |
| easy | 3 |
| medium | 5 |
| hard | 3 |
| 级别 | 最低数量 |
|---|---|
| trivial | 3 |
| easy | 3 |
| medium | 5 |
| hard | 3 |
References
参考资料
For detailed implementation patterns, see:
- - URL validation, ID uniqueness, referential integrity, content quality, and duplicate detection code
references/validation-rules.md - - Coverage analysis, pre-addition validation workflow, full dataset validation, and CLI/hook integration
references/quality-metrics.md
如需详细的实现模式,请参阅:
- - URL验证、ID唯一性、引用完整性、内容质量和重复项检测代码
references/validation-rules.md - - 覆盖范围分析、新增前验证流程、完整数据集验证以及CLI/钩子集成
references/quality-metrics.md
Related Skills
相关Skill
- - Quality criteria and workflows
golden-dataset-curation - - Backup/restore operations
golden-dataset-management - - Embedding-based duplicate detection
pgvector-search
Version: 1.0.0 (December 2025)
Issue: #599
- - 质量标准与工作流
golden-dataset-curation - - 备份/恢复操作
golden-dataset-management - - 基于嵌入的重复项检测
pgvector-search
版本: 1.0.0(2025年12月)
问题编号: #599
Capability Details
能力详情
schema-validation
schema-validation
Keywords: schema, validation, schema check, format validation
Solves:
- Validate entries against document schema
- Check required fields are present
- Verify data types and constraints
关键词: schema、验证、schema检查、格式验证
解决的问题:
- 对照文档schema验证条目
- 检查必填字段是否存在
- 验证数据类型与约束条件
duplicate-detection
duplicate-detection
Keywords: duplicate, detection, deduplication, similarity check
Solves:
- Detect duplicate or near-duplicate entries
- Use semantic similarity for fuzzy matching
- Prevent redundant entries in dataset
关键词: 重复项、检测、去重、相似度检查
解决的问题:
- 检测重复或近似重复的条目
- 使用语义相似度进行模糊匹配
- 防止数据集中出现冗余条目
referential-integrity
referential-integrity
Keywords: referential, integrity, foreign key, relationship
Solves:
- Verify relationships between documents and queries
- Check source URL mappings are valid
- Ensure cross-references are consistent
关键词: 引用、完整性、外键、关联关系
解决的问题:
- 验证文档与查询之间的关联关系
- 检查源URL映射是否有效
- 确保交叉引用的一致性
coverage-analysis
coverage-analysis
Keywords: coverage, analysis, distribution, completeness
Solves:
- Analyze dataset coverage across domains
- Identify gaps in difficulty distribution
- Report coverage metrics and recommendations
关键词: 覆盖范围、分析、分布、完整性
解决的问题:
- 分析数据集在各领域的覆盖情况
- 识别难度分布中的缺口
- 生成覆盖范围指标与建议