dataset-curator
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDataset Curator
数据集管理工具
This skill ensures that the data you feed to your AI is clean, accurate, and safe.
本技能确保你输入AI的数据干净、准确且安全。
Capabilities
功能特性
1. Data Cleaning & Structuring
1. 数据清理与结构化
- Removes duplicates, boilerplate, and noisy text from knowledge bases.
- Converts unstructured documents into clean Markdown or JSON/Vector-friendly formats.
- 移除知识库中的重复内容、模板文本和噪声文本。
- 将非结构化文档转换为整洁的Markdown或JSON/向量友好格式。
2. Privacy Audit
2. 隐私审核
- Scans datasets for PII (Personal Identifiable Information) before they are sent to LLMs or vector databases.
- 在数据集被发送至LLMs或向量数据库前,扫描其中的PII(个人可识别信息)。
Usage
使用示例
- "Clean up the directory and structure it for better RAG performance."
knowledge/ - "Audit this customer feedback dataset for sensitive info before we use it for AI training."
- “清理目录并进行结构化处理,以提升RAG性能。”
knowledge/ - “在我们将这份客户反馈数据集用于AI训练前,审核其中的敏感信息。”
Knowledge Protocol
知识协议
- This skill adheres to the . It automatically integrates Public, Confidential (Company/Client), and Personal knowledge tiers, prioritizing the most specific secrets while ensuring no leaks to public outputs.
knowledge/orchestration/knowledge-protocol.md
- 本技能遵循。它会自动整合公开、机密(公司/客户)和个人知识层级,在优先处理最具体的保密信息的同时,确保不会泄露到公开输出中。
knowledge/orchestration/knowledge-protocol.md