golden-dataset-curation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGolden Dataset Curation
黄金数据集整理
Curate high-quality documents for the golden dataset with multi-agent validation
通过多Agent验证为黄金数据集整理高质量文档
Overview
概述
This skill provides patterns and workflows for adding new documents to the golden dataset with thorough quality analysis. It complements which handles backup/restore.
golden-dataset-managementWhen to use this skill:
- Adding new documents to the golden dataset
- Classifying content types and difficulty levels
- Generating test queries for new documents
- Running multi-agent quality analysis
本技能提供了为黄金数据集添加新文档并进行全面质量分析的模式与工作流。它是(负责备份/恢复)的补充功能。
golden-dataset-management何时使用本技能:
- 为黄金数据集添加新文档
- 分类内容类型与难度等级
- 为新文档生成测试查询
- 运行多Agent质量分析
Content Types
内容类型
| Type | Description | Quality Focus |
|---|---|---|
| Technical articles, blog posts | Depth, accuracy, actionability |
| Step-by-step guides | Completeness, clarity, code quality |
| Academic papers, whitepapers | Rigor, citations, methodology |
| API docs, reference materials | Accuracy, completeness, examples |
| Transcribed video content | Structure, coherence, key points |
| README, code analysis | Code quality, documentation |
| 类型 | 描述 | 质量关注点 |
|---|---|---|
| 技术文章、博客文章 | 深度、准确性、可操作性 |
| 分步指南 | 完整性、清晰度、代码质量 |
| 学术论文、白皮书 | 严谨性、引用规范、研究方法 |
| API文档、参考资料 | 准确性、完整性、示例丰富度 |
| 视频转录内容 | 结构合理性、连贯性、关键点提炼 |
| README、代码分析 | 代码质量、文档完善度 |
Difficulty Levels
难度等级
| Level | Semantic Complexity | Expected Score | Characteristics |
|---|---|---|---|
| trivial | Direct keyword match | >0.85 | Technical terms, exact phrases |
| easy | Common synonyms | >0.70 | Well-known concepts, slight variations |
| medium | Paraphrased intent | >0.55 | Conceptual queries, multi-topic |
| hard | Multi-hop reasoning | >0.40 | Cross-domain, comparative analysis |
| adversarial | Edge cases | Graceful degradation | Robustness tests, off-domain |
| 等级 | 语义复杂度 | 预期分数 | 特征 |
|---|---|---|---|
| trivial(简单) | 直接关键词匹配 | >0.85 | 技术术语、精确短语 |
| easy(容易) | 常见同义词替换 | >0.70 | 知名概念、轻微变体 |
| medium(中等) | 意图改写 | >0.55 | 概念性查询、多主题 |
| hard(困难) | 多跳推理 | >0.40 | 跨领域、对比分析 |
| adversarial(对抗性) | 边缘案例 | 优雅降级 | 鲁棒性测试、域外内容 |
Quality Dimensions
质量维度
| Dimension | Weight | Perfect | Acceptable | Failing |
|---|---|---|---|---|
| Accuracy | 0.25 | 0.95-1.0 | 0.70-0.94 | <0.70 |
| Coherence | 0.20 | 0.90-1.0 | 0.60-0.89 | <0.60 |
| Depth | 0.25 | 0.90-1.0 | 0.55-0.89 | <0.55 |
| Relevance | 0.30 | 0.95-1.0 | 0.70-0.94 | <0.70 |
Evaluation focuses:
- Accuracy: Technical correctness, code validity, up-to-date info
- Coherence: Logical structure, clear flow, consistent terminology
- Depth: Comprehensive coverage, edge cases, appropriate detail
- Relevance: Alignment with AI/ML, backend, frontend, DevOps domains
| 维度 | 权重 | 优秀 | 合格 | 不合格 |
|---|---|---|---|---|
| Accuracy(准确性) | 0.25 | 0.95-1.0 | 0.70-0.94 | <0.70 |
| Coherence(连贯性) | 0.20 | 0.90-1.0 | 0.60-0.89 | <0.60 |
| Depth(深度) | 0.25 | 0.90-1.0 | 0.55-0.89 | <0.55 |
| Relevance(相关性) | 0.30 | 0.95-1.0 | 0.70-0.94 | <0.70 |
评估重点:
- 准确性: 技术正确性、代码有效性、信息时效性
- 连贯性: 逻辑结构、清晰流程、术语一致性
- 深度: 全面覆盖、边缘案例、细节恰当性
- 相关性: 与AI/ML、后端、前端、DevOps领域的契合度
Multi-Agent Pipeline
多Agent流水线
INPUT: URL/Content
|
v
+------------------+
| FETCH AGENT | Extract structure, detect type
+--------+---------+
|
v
+-----------------------------------------------+
| PARALLEL ANALYSIS AGENTS |
| Quality | Difficulty | Domain | Query Gen |
+-----------------------------------------------+
|
v
+------------------+
| CONSENSUS | Weighted score + confidence
| AGGREGATOR | -> include/review/exclude
+--------+---------+
|
v
+------------------+
| USER APPROVAL | Show scores, confirm
+--------+---------+
|
v
OUTPUT: Curated document entryINPUT: URL/Content
|
v
+------------------+
| FETCH AGENT | Extract structure, detect type
+--------+---------+
|
v
+-----------------------------------------------+
| PARALLEL ANALYSIS AGENTS |
| Quality | Difficulty | Domain | Query Gen |
+-----------------------------------------------+
|
v
+------------------+
| CONSENSUS | Weighted score + confidence
| AGGREGATOR | -> include/review/exclude
+--------+---------+
|
v
+------------------+
| USER APPROVAL | Show scores, confirm
+--------+---------+
|
v
OUTPUT: Curated document entryDecision Thresholds
决策阈值
| Quality Score | Confidence | Decision |
|---|---|---|
| >= 0.75 | >= 0.70 | include |
| >= 0.55 | any | review |
| < 0.55 | any | exclude |
| 质量分数 | 置信度 | 决策 |
|---|---|---|
| >= 0.75 | >= 0.70 | 纳入 |
| >= 0.55 | 任意 | 审核 |
| < 0.55 | 任意 | 排除 |
Quality Thresholds
质量阈值
yaml
undefinedyaml
undefinedRecommended thresholds for golden dataset inclusion
Recommended thresholds for golden dataset inclusion
minimum_quality_score: 0.70
minimum_confidence: 0.65
required_tags: 2 # At least 2 domain tags
required_queries: 3 # At least 3 test queries
---minimum_quality_score: 0.70
minimum_confidence: 0.65
required_tags: 2 # At least 2 domain tags
required_queries: 3 # At least 3 test queries
---Coverage Balance Guidelines
覆盖平衡指南
Maintain balanced coverage across:
- Content types: Don't over-index on articles
- Difficulty levels: Need trivial AND hard queries
- Domains: Spread across AI/ML, backend, frontend, etc.
需维持以下维度的平衡覆盖:
- 内容类型: 不要过度偏向文章类
- 难度等级: 既需要简单也需要困难的查询
- 领域: 覆盖AI/ML、后端、前端等多个领域
Duplicate Prevention Checklist
重复内容预防检查清单
Before adding:
- Check URL against existing
source_url_map.json - Run semantic similarity against existing document embeddings
- Warn if >80% similar to existing document
添加前需:
- 对照现有检查URL
source_url_map.json - 与现有文档嵌入向量进行语义相似度比对
- 若与现有文档相似度>80%则发出警告
Provenance Tracking
来源追踪
Always record:
- Source URL (canonical)
- Curation date
- Agent scores (for audit trail)
- Langfuse trace ID
需始终记录:
- 来源URL(标准链接)
- 整理日期
- Agent评分(用于审计追踪)
- Langfuse跟踪ID
Langfuse Integration
Langfuse集成
Trace Structure
跟踪结构
python
trace = langfuse.trace(
name="golden-dataset-curation",
metadata={"source_url": url, "document_id": doc_id}
)python
trace = langfuse.trace(
name="golden-dataset-curation",
metadata={"source_url": url, "document_id": doc_id}
)Log individual dimension scores
Log individual dimension scores
trace.score(name="accuracy", value=0.85)
trace.score(name="coherence", value=0.90)
trace.score(name="depth", value=0.78)
trace.score(name="relevance", value=0.92)
trace.score(name="accuracy", value=0.85)
trace.score(name="coherence", value=0.90)
trace.score(name="depth", value=0.78)
trace.score(name="relevance", value=0.92)
Final aggregated score
Final aggregated score
trace.score(name="quality_total", value=0.87)
trace.event(name="curation_decision", metadata={"decision": "include"})
undefinedtrace.score(name="quality_total", value=0.87)
trace.event(name="curation_decision", metadata={"decision": "include"})
undefinedManaged Prompts
托管提示词
| Prompt Name | Purpose |
|---|---|
| Classify content_type |
| Assign difficulty |
| Extract tags |
| Generate test queries |
| 提示词名称 | 用途 |
|---|---|
| 分类内容类型 |
| 分配难度等级 |
| 提取标签 |
| 生成测试查询 |
References
参考资料
For detailed implementation patterns, see:
- - Content type classification, difficulty stratification, quality evaluation dimensions, and best practices
references/selection-criteria.md - - Multi-agent pipeline architecture, agent specifications, consensus aggregation logic, and Langfuse integration
references/annotation-patterns.md
如需详细实现模式,请参阅:
- - 内容类型分类、难度分层、质量评估维度及最佳实践
references/selection-criteria.md - - 多Agent流水线架构、Agent规范、共识聚合逻辑及Langfuse集成
references/annotation-patterns.md
Related Skills
相关技能
- - Backup/restore operations
golden-dataset-management - - Validation rules and checks
golden-dataset-validation - - Tracing patterns
langfuse-observability - - Duplicate detection
pgvector-search
Version: 1.0.0 (December 2025)
Issue: #599
- - 备份/恢复操作
golden-dataset-management - - 验证规则与检查
golden-dataset-validation - - 跟踪模式
langfuse-observability - - 重复内容检测
pgvector-search
版本: 1.0.0(2025年12月)
问题编号: #599
Capability Details
能力详情
content-classification
content-classification
Keywords: content type, classification, document type, golden dataset
Solves:
- Classify document content types for golden dataset
- Categorize entries by domain and purpose
- Identify content requiring special handling
关键词: content type, classification, document type, golden dataset
解决问题:
- 为黄金数据集分类文档内容类型
- 按领域与用途对条目进行分类
- 识别需要特殊处理的内容
difficulty-stratification
difficulty-stratification
Keywords: difficulty, stratification, complexity level, challenge rating
Solves:
- Assign difficulty levels to golden dataset entries
- Ensure balanced difficulty distribution
- Identify edge cases and challenging examples
关键词: difficulty, stratification, complexity level, challenge rating
解决问题:
- 为黄金数据集条目分配难度等级
- 确保难度分布均衡
- 识别边缘案例与具有挑战性的示例
quality-evaluation
quality-evaluation
Keywords: quality, evaluation, quality dimensions, quality criteria
Solves:
- Evaluate entry quality against defined criteria
- Score entries on multiple quality dimensions
- Identify entries needing improvement
关键词: quality, evaluation, quality dimensions, quality criteria
解决问题:
- 根据定义的标准评估条目质量
- 从多个质量维度为条目打分
- 识别需要改进的条目
multi-agent-analysis
multi-agent-analysis
Keywords: multi-agent, parallel analysis, consensus, agent evaluation
Solves:
- Run parallel agent evaluations on entries
- Aggregate consensus from multiple analysts
- Resolve disagreements in classifications
关键词: multi-agent, parallel analysis, consensus, agent evaluation
解决问题:
- 对条目运行并行Agent评估
- 聚合多个分析Agent的共识结果
- 解决分类中的分歧