dataset-curator

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Dataset Curator

数据集管理工具

This skill ensures that the data you feed to your AI is clean, accurate, and safe.
本技能确保你输入AI的数据干净、准确且安全。

Capabilities

功能特性

1. Data Cleaning & Structuring

1. 数据清理与结构化

  • Removes duplicates, boilerplate, and noisy text from knowledge bases.
  • Converts unstructured documents into clean Markdown or JSON/Vector-friendly formats.
  • 移除知识库中的重复内容、模板文本和噪声文本。
  • 将非结构化文档转换为整洁的Markdown或JSON/向量友好格式。

2. Privacy Audit

2. 隐私审核

  • Scans datasets for PII (Personal Identifiable Information) before they are sent to LLMs or vector databases.
  • 在数据集被发送至LLMs或向量数据库前,扫描其中的PII(个人可识别信息)。

Usage

使用示例

  • "Clean up the
    knowledge/
    directory and structure it for better RAG performance."
  • "Audit this customer feedback dataset for sensitive info before we use it for AI training."
  • “清理
    knowledge/
    目录并进行结构化处理,以提升RAG性能。”
  • “在我们将这份客户反馈数据集用于AI训练前,审核其中的敏感信息。”

Knowledge Protocol

知识协议

  • This skill adheres to the
    knowledge/orchestration/knowledge-protocol.md
    . It automatically integrates Public, Confidential (Company/Client), and Personal knowledge tiers, prioritizing the most specific secrets while ensuring no leaks to public outputs.
  • 本技能遵循
    knowledge/orchestration/knowledge-protocol.md
    。它会自动整合公开、机密(公司/客户)和个人知识层级,在优先处理最具体的保密信息的同时,确保不会泄露到公开输出中。