dataset-evaluation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWorkflow Instruction
工作流说明
Follow the workflow shown below. Locate the dataset, check the file type, and resolve any issues with missing files or wrong file types. Determine the fine-tuning model and fine-tuning strategy. Run scripts/format_detector.py to evaluate whether the file is formatted correctly for the currently selected model and strategy. Summarize the results: is the dataset ready for fine-tuning?
遵循如下所示的工作流。定位数据集,检查文件类型,解决文件缺失或文件类型错误的问题。确定微调模型与微调策略。运行scripts/format_detector.py评估文件格式是否符合当前所选模型与策略的要求。汇总结果:数据集是否已可用于微调?
Workflow
工作流
-
Locate Dataset:
- The full path may be a local file path, or an S3 URI
- Resolve the full path to the dataset file, make sure read permissions are available, and help the user if the file is not found
-
Determine strategy and model:
- File formatting depends on the currently selected fine-tuning strategy and fine-tuning base model.
- If the strategy and model are already known from the conversation context (e.g., selected via the finetuning-setup skill), use them.
- If not available in context, activate the finetuning-setup skill to determine them before proceeding.
-
Check File Formatting: Run the tool format_detector.py to make sure the file conforms to formatting requirements.
- Send the full path directly to the format_detector script as an argument
- Do not send the model and strategy as arguments
- Do not download data from S3
- Do not make local copies of data
-
Summarize Results: Tell the user if their data is ready
- Examine the output of format_detector and compare to the known strategy and model
- Important: training datasets and evaluation datasets have different format requirements.
- Training datasets must match the fine-tuning strategy format (SFT, DPO, RLVR) per
references/strategy_data_requirements.md - Evaluation datasets (for model evaluation) must match one of the SageMaker evaluation dataset formats.
- Training datasets must match the fine-tuning strategy format (SFT, DPO, RLVR) per
- Report back to the user if their current dataset is valid for its intended purpose
- Warn the user if their dataset is valid, but for a different strategy or model
- Warn the user if their dataset is not valid for any strategy/model pair
-
定位数据集:
- 完整路径可以是本地文件路径,也可以是S3 URI
- 解析数据集文件的完整路径,确保拥有读取权限,若文件未找到则为用户提供帮助
-
确定策略与模型:
- 文件格式取决于当前选择的微调策略与微调基模型。
- 如果会话上下文(例如通过finetuning-setup skill选择)中已有策略和模型信息,则直接使用。
- 如果上下文中没有相关信息,先激活finetuning-setup skill确定相关参数后再继续。
-
检查文件格式: 运行format_detector.py工具确保文件符合格式要求。
- 将完整路径作为参数直接传给format_detector脚本
- 不要将模型和策略作为参数传入
- 不要从S3下载数据
- 不要创建数据的本地副本
-
汇总结果: 告知用户其数据是否可用
- 检查format_detector的输出,与已知的策略和模型要求做对比
- 重要提示:训练数据集和评估数据集的格式要求不同。
- 训练数据集必须符合中对应微调策略(SFT、DPO、RLVR)的格式要求
references/strategy_data_requirements.md - 评估数据集(用于模型评估)必须符合SageMaker评估数据集格式中的一种。
- 训练数据集必须符合
- 向用户反馈当前数据集是否符合其预期用途的要求
- 如果数据集有效但适配的是其他策略或模型,需向用户发出警告
- 如果数据集对任何策略/模型组合都无效,需向用户发出警告
Messages to the User
给用户的消息
- Introduction: "This skill checks the structure of your dataset for model fine-tuning."
- File types: This skill applies to files that are formatted according to the Amazon SageMaker AI Developer Guide
- 介绍语: "This skill checks the structure of your dataset for model fine-tuning."
- 文件类型: 该skill适用于符合Amazon SageMaker AI开发者指南格式规范的文件
Resources
资源
- scripts/format_detector.py is self-contained format validation script that can be run independently
- finetuning-setup skill should have already determined the fine-tuning strategy and base model
- references/strategy_data_requirements.md contains data format requirements per strategy
- scripts/format_detector.py是可独立运行的自包含格式校验脚本
- finetuning-setup skill应已提前确定微调策略和基模型
- references/strategy_data_requirements.md包含各策略对应的数据格式要求
Script Details
脚本详情
- scripts/format_detector.py is self-contained format validation script that can be run independently:
bash
undefined- scripts/format_detector.py是可独立运行的自包含格式校验脚本:
bash
undefinedWith the file path argument identified in workflow step 1
With the file path argument identified in workflow step 1
python scripts/format_detector.py local_path/to/dataset
undefinedpython scripts/format_detector.py local_path/to/dataset
undefined