Loading...
Loading...
Use when creating or improving golden datasets for AI evaluation. Defines quality criteria, curation workflows, and multi-agent analysis patterns for test data.
npx skill4agent add yonatangross/orchestkit golden-dataset-curationgolden-dataset-management| Type | Description | Quality Focus |
|---|---|---|
| Technical articles, blog posts | Depth, accuracy, actionability |
| Step-by-step guides | Completeness, clarity, code quality |
| Academic papers, whitepapers | Rigor, citations, methodology |
| API docs, reference materials | Accuracy, completeness, examples |
| Transcribed video content | Structure, coherence, key points |
| README, code analysis | Code quality, documentation |
| Level | Semantic Complexity | Expected Score | Characteristics |
|---|---|---|---|
| trivial | Direct keyword match | >0.85 | Technical terms, exact phrases |
| easy | Common synonyms | >0.70 | Well-known concepts, slight variations |
| medium | Paraphrased intent | >0.55 | Conceptual queries, multi-topic |
| hard | Multi-hop reasoning | >0.40 | Cross-domain, comparative analysis |
| adversarial | Edge cases | Graceful degradation | Robustness tests, off-domain |
| Dimension | Weight | Perfect | Acceptable | Failing |
|---|---|---|---|---|
| Accuracy | 0.25 | 0.95-1.0 | 0.70-0.94 | <0.70 |
| Coherence | 0.20 | 0.90-1.0 | 0.60-0.89 | <0.60 |
| Depth | 0.25 | 0.90-1.0 | 0.55-0.89 | <0.55 |
| Relevance | 0.30 | 0.95-1.0 | 0.70-0.94 | <0.70 |
INPUT: URL/Content
|
v
+------------------+
| FETCH AGENT | Extract structure, detect type
+--------+---------+
|
v
+-----------------------------------------------+
| PARALLEL ANALYSIS AGENTS |
| Quality | Difficulty | Domain | Query Gen |
+-----------------------------------------------+
|
v
+------------------+
| CONSENSUS | Weighted score + confidence
| AGGREGATOR | -> include/review/exclude
+--------+---------+
|
v
+------------------+
| USER APPROVAL | Show scores, confirm
+--------+---------+
|
v
OUTPUT: Curated document entry| Quality Score | Confidence | Decision |
|---|---|---|
| >= 0.75 | >= 0.70 | include |
| >= 0.55 | any | review |
| < 0.55 | any | exclude |
# Recommended thresholds for golden dataset inclusion
minimum_quality_score: 0.70
minimum_confidence: 0.65
required_tags: 2 # At least 2 domain tags
required_queries: 3 # At least 3 test queriessource_url_map.jsontrace = langfuse.trace(
name="golden-dataset-curation",
metadata={"source_url": url, "document_id": doc_id}
)
# Log individual dimension scores
trace.score(name="accuracy", value=0.85)
trace.score(name="coherence", value=0.90)
trace.score(name="depth", value=0.78)
trace.score(name="relevance", value=0.92)
# Final aggregated score
trace.score(name="quality_total", value=0.87)
trace.event(name="curation_decision", metadata={"decision": "include"})| Prompt Name | Purpose |
|---|---|
| Classify content_type |
| Assign difficulty |
| Extract tags |
| Generate test queries |
references/selection-criteria.mdreferences/annotation-patterns.mdgolden-dataset-managementgolden-dataset-validationlangfuse-observabilitypgvector-search