validation-scripts
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseML Training Validation Scripts
ML训练验证脚本
Purpose: Production-ready validation and testing utilities for ML training workflows. Ensures data quality, model integrity, pipeline correctness, and dependency availability before and during training.
Activation Triggers:
- Validating training datasets before fine-tuning
- Checking model checkpoints and outputs
- Testing end-to-end training pipelines
- Verifying system dependencies and GPU availability
- Debugging training failures or data issues
- Ensuring data format compliance
- Validating model configurations
- Testing inference pipelines
Key Resources:
- - Comprehensive dataset validation
scripts/validate-data.sh - - Model checkpoint and config validation
scripts/validate-model.sh - - End-to-end pipeline testing
scripts/test-pipeline.sh - - System and package dependency checking
scripts/check-dependencies.sh - - Pipeline test configuration template
templates/test-config.yaml - - Data validation schema template
templates/validation-schema.json - - Dataset validation workflow
examples/data-validation-example.md - - Complete pipeline testing guide
examples/pipeline-testing-example.md
用途: 为ML训练工作流提供生产就绪的验证与测试工具,确保训练前后的数据质量、模型完整性、管道正确性及依赖项可用性。
触发场景:
- 微调前验证训练数据集
- 检查模型检查点与输出结果
- 测试端到端训练管道
- 验证系统依赖项与GPU可用性
- 调试训练故障或数据问题
- 确保数据格式合规
- 验证模型配置
- 测试推理管道
核心资源:
- - 全面的数据集验证
scripts/validate-data.sh - - 模型检查点与配置验证
scripts/validate-model.sh - - 端到端管道测试
scripts/test-pipeline.sh - - 系统与包依赖项检查
scripts/check-dependencies.sh - - 管道测试配置模板
templates/test-config.yaml - - 数据验证 schema 模板
templates/validation-schema.json - - 数据集验证工作流示例
examples/data-validation-example.md - - 完整管道测试指南
examples/pipeline-testing-example.md
Quick Start
快速开始
Validate Training Data
验证训练数据
bash
bash scripts/validate-data.sh \
--data-path ./data/train.jsonl \
--format jsonl \
--schema templates/validation-schema.jsonbash
bash scripts/validate-data.sh \
--data-path ./data/train.jsonl \
--format jsonl \
--schema templates/validation-schema.jsonValidate Model Checkpoint
验证模型检查点
bash
bash scripts/validate-model.sh \
--model-path ./checkpoints/epoch-3 \
--framework pytorch \
--check-weightsbash
bash scripts/validate-model.sh \
--model-path ./checkpoints/epoch-3 \
--framework pytorch \
--check-weightsTest Complete Pipeline
测试完整管道
bash
bash scripts/test-pipeline.sh \
--config templates/test-config.yaml \
--data ./data/sample.jsonl \
--verbosebash
bash scripts/test-pipeline.sh \
--config templates/test-config.yaml \
--data ./data/sample.jsonl \
--verboseCheck System Dependencies
检查系统依赖项
bash
bash scripts/check-dependencies.sh \
--framework pytorch \
--gpu-required \
--min-vram 16bash
bash scripts/check-dependencies.sh \
--framework pytorch \
--gpu-required \
--min-vram 16Validation Scripts
验证脚本
1. Data Validation (validate-data.sh)
1. 数据验证(validate-data.sh)
Purpose: Validate training datasets for format compliance, data quality, and schema conformance.
Usage:
bash
bash scripts/validate-data.sh [OPTIONS]Options:
- - Path to dataset file or directory (required)
--data-path PATH - - Data format: jsonl, csv, parquet, arrow (default: jsonl)
--format FORMAT - - Path to validation schema JSON file
--schema PATH - - Number of samples to validate (default: all)
--sample-size N - - Check for duplicate entries
--check-duplicates - - Check for null/missing values
--check-null - - Validate text length constraints
--check-length - - Validate tokenization compatibility
--check-tokens - - Tokenizer to use for token validation (default: gpt2)
--tokenizer MODEL - - Maximum sequence length (default: 2048)
--max-length N - - Output validation report path (default: validation-report.json)
--output REPORT
Validation Checks:
- Format Compliance: Verify file format and structure
- Schema Validation: Check against JSON schema if provided
- Data Types: Validate field types (string, int, float, etc.)
- Required Fields: Ensure all required fields are present
- Value Ranges: Check numeric values within expected ranges
- Text Quality: Detect empty strings, excessive whitespace, encoding issues
- Duplicates: Identify duplicate entries (exact or near-duplicate)
- Token Counts: Verify sequences fit within model context length
- Label Distribution: Check class balance for classification tasks
- Missing Values: Detect and report null/missing values
Example Output:
json
{
"status": "PASS",
"total_samples": 10000,
"valid_samples": 9987,
"invalid_samples": 13,
"validation_errors": [
{
"sample_id": 42,
"field": "text",
"error": "Exceeds max token length: 2150 > 2048"
},
{
"sample_id": 156,
"field": "label",
"error": "Invalid label value: 'unknwn' (typo)"
}
],
"statistics": {
"avg_text_length": 487,
"avg_token_count": 128,
"label_distribution": {
"positive": 4892,
"negative": 4895,
"neutral": 213
}
},
"recommendations": [
"Remove or fix 13 invalid samples before training",
"Label distribution is imbalanced - consider class weighting"
]
}Exit Codes:
- - All validations passed
0 - - Validation errors found
1 - - Script error (invalid arguments, file not found)
2
用途: 验证训练数据集的格式合规性、数据质量及 schema 一致性。
使用方式:
bash
bash scripts/validate-data.sh [OPTIONS]选项:
- - 数据集文件或目录路径(必填)
--data-path PATH - - 数据格式:jsonl、csv、parquet、arrow(默认:jsonl)
--format FORMAT - - 验证 schema JSON 文件路径
--schema PATH - - 要验证的样本数量(默认:全部)
--sample-size N - - 检查重复条目
--check-duplicates - - 检查空值/缺失值
--check-null - - 验证文本长度约束
--check-length - - 验证分词兼容性
--check-tokens - - 用于分词验证的Tokenizer(默认:gpt2)
--tokenizer MODEL - - 最大序列长度(默认:2048)
--max-length N - - 验证报告输出路径(默认:validation-report.json)
--output REPORT
验证检查项:
- 格式合规性:验证文件格式与结构
- Schema验证:若提供则对照JSON schema检查
- 数据类型:验证字段类型(字符串、整数、浮点数等)
- 必填字段:确保所有必填字段存在
- 值范围:检查数值是否在预期范围内
- 文本质量:检测空字符串、过多空白字符、编码问题
- 重复项:识别重复条目(精确或近似重复)
- 分词计数:验证序列是否符合模型上下文长度
- 标签分布:检查分类任务的类别平衡性
- 缺失值:检测并报告空值/缺失值
示例输出:
json
{
"status": "PASS",
"total_samples": 10000,
"valid_samples": 9987,
"invalid_samples": 13,
"validation_errors": [
{
"sample_id": 42,
"field": "text",
"error": "Exceeds max token length: 2150 > 2048"
},
{
"sample_id": 156,
"field": "label",
"error": "Invalid label value: 'unknwn' (typo)"
}
],
"statistics": {
"avg_text_length": 487,
"avg_token_count": 128,
"label_distribution": {
"positive": 4892,
"negative": 4895,
"neutral": 213
}
},
"recommendations": [
"Remove or fix 13 invalid samples before training",
"Label distribution is imbalanced - consider class weighting"
]
}退出码:
- - 所有验证通过
0 - - 发现验证错误
1 - - 脚本错误(参数无效、文件未找到)
2
2. Model Validation (validate-model.sh)
2. 模型验证(validate-model.sh)
Purpose: Validate model checkpoints, configurations, and inference readiness.
Usage:
bash
bash scripts/validate-model.sh [OPTIONS]Options:
- - Path to model checkpoint directory (required)
--model-path PATH - - Framework: pytorch, tensorflow, jax (default: pytorch)
--framework FRAMEWORK - - Verify model weights are loadable
--check-weights - - Validate model configuration file
--check-config - - Verify tokenizer files are present
--check-tokenizer - - Test inference with sample input
--check-inference - - Sample text for inference test
--sample-input TEXT - - Expected output for verification
--expected-output TEXT - - Output validation report path
--output REPORT
Validation Checks:
- File Structure: Verify required files are present
- Config Validation: Check model_config.json for correctness
- Weight Integrity: Load and verify model weights
- Tokenizer Files: Ensure tokenizer files exist and are loadable
- Model Architecture: Validate architecture matches config
- Memory Requirements: Estimate GPU/CPU memory needed
- Inference Test: Run sample inference if requested
- Quantization: Verify quantized models load correctly
- LoRA/PEFT: Validate adapter weights and configuration
Example Output:
json
{
"status": "PASS",
"model_path": "./checkpoints/llama-7b-finetuned",
"framework": "pytorch",
"checks": {
"file_structure": "PASS",
"config": "PASS",
"weights": "PASS",
"tokenizer": "PASS",
"inference": "PASS"
},
"model_info": {
"architecture": "LlamaForCausalLM",
"parameters": "7.2B",
"precision": "float16",
"lora_enabled": true,
"lora_rank": 16
},
"memory_estimate": {
"model_size_gb": 13.5,
"inference_vram_gb": 16.2,
"training_vram_gb": 24.8
},
"inference_test": {
"input": "Hello, world!",
"output": "Hello, world! How can I help you today?",
"latency_ms": 142
}
}用途: 验证模型检查点、配置及推理就绪状态。
使用方式:
bash
bash scripts/validate-model.sh [OPTIONS]选项:
- - 模型检查点目录路径(必填)
--model-path PATH - - 框架:pytorch、tensorflow、jax(默认:pytorch)
--framework FRAMEWORK - - 验证模型权重可加载
--check-weights - - 验证模型配置文件
--check-config - - 验证Tokenizer文件存在
--check-tokenizer - - 使用样本输入测试推理
--check-inference - - 用于推理测试的样本文本
--sample-input TEXT - - 用于验证的预期输出
--expected-output TEXT - - 验证报告输出路径
--output REPORT
验证检查项:
- 文件结构:验证所需文件是否存在
- 配置验证:检查model_config.json的正确性
- 权重完整性:加载并验证模型权重
- Tokenizer文件:确保Tokenizer文件存在且可加载
- 模型架构:验证架构与配置匹配
- 内存需求:估算GPU/CPU所需内存
- 推理测试:若请求则运行样本推理
- 量化验证:验证量化模型可正确加载
- LoRA/PEFT:验证适配器权重与配置
示例输出:
json
{
"status": "PASS",
"model_path": "./checkpoints/llama-7b-finetuned",
"framework": "pytorch",
"checks": {
"file_structure": "PASS",
"config": "PASS",
"weights": "PASS",
"tokenizer": "PASS",
"inference": "PASS"
},
"model_info": {
"architecture": "LlamaForCausalLM",
"parameters": "7.2B",
"precision": "float16",
"lora_enabled": true,
"lora_rank": 16
},
"memory_estimate": {
"model_size_gb": 13.5,
"inference_vram_gb": 16.2,
"training_vram_gb": 24.8
},
"inference_test": {
"input": "Hello, world!",
"output": "Hello, world! How can I help you today?",
"latency_ms": 142
}
}3. Pipeline Testing (test-pipeline.sh)
3. 管道测试(test-pipeline.sh)
Purpose: Test end-to-end ML training and inference pipelines with sample data.
Usage:
bash
bash scripts/test-pipeline.sh [OPTIONS]Options:
- - Pipeline test configuration file (required)
--config PATH - - Sample data for testing
--data PATH - - Comma-separated steps to test (default: all)
--steps STEPS - - Enable detailed output
--verbose - - Output test report path
--output REPORT - - Stop on first failure
--fail-fast - - Clean up temporary files after test
--cleanup
Pipeline Steps:
- data_loading: Test data loading and preprocessing
- tokenization: Verify tokenization pipeline
- model_loading: Load model and verify initialization
- training_step: Run single training step
- validation_step: Run single validation step
- checkpoint_save: Test checkpoint saving
- checkpoint_load: Test checkpoint loading
- inference: Test inference pipeline
- metrics: Verify metrics calculation
Test Configuration (test-config.yaml):
yaml
pipeline:
name: llama-7b-finetuning
framework: pytorch
data:
train_path: ./data/sample-train.jsonl
val_path: ./data/sample-val.jsonl
format: jsonl
model:
base_model: meta-llama/Llama-2-7b-hf
checkpoint_path: ./checkpoints/test
load_in_8bit: false
lora:
enabled: true
r: 16
alpha: 32
training:
batch_size: 1
gradient_accumulation: 1
learning_rate: 2e-4
max_steps: 5
testing:
sample_size: 10
timeout_seconds: 300
expected_loss_range: [0.5, 3.0]Example Output:
Pipeline Test Report
====================
Pipeline: llama-7b-finetuning
Started: 2025-11-01 12:34:56
Duration: 127 seconds
Test Results:
✓ data_loading (2.3s) - Loaded 10 samples successfully
✓ tokenization (1.1s) - Tokenized all samples, avg length: 128 tokens
✓ model_loading (8.7s) - Model loaded, 7.2B parameters
✓ training_step (15.4s) - Training step completed, loss: 1.847
✓ validation_step (12.1s) - Validation step completed, loss: 1.923
✓ checkpoint_save (3.2s) - Checkpoint saved to ./checkpoints/test
✓ checkpoint_load (6.8s) - Checkpoint loaded successfully
✓ inference (2.9s) - Inference completed, latency: 142ms
✓ metrics (0.4s) - Metrics calculated correctly
Overall: PASS (8/8 tests passed)
Performance Metrics:
- Total time: 127s
- GPU memory used: 15.2 GB
- CPU memory used: 8.4 GB
Recommendations:
- Pipeline is ready for full training
- Consider increasing batch_size to improve throughput用途: 使用样本数据测试端到端ML训练与推理管道。
使用方式:
bash
bash scripts/test-pipeline.sh [OPTIONS]选项:
- - 管道测试配置文件(必填)
--config PATH - - 用于测试的样本数据
--data PATH - - 要测试的步骤(逗号分隔,默认:全部)
--steps STEPS - - 启用详细输出
--verbose - - 测试报告输出路径
--output REPORT - - 首次失败即停止
--fail-fast - - 测试后清理临时文件
--cleanup
管道步骤:
- data_loading:测试数据加载与预处理
- tokenization:验证分词管道
- model_loading:加载模型并验证初始化
- training_step:运行单次训练步骤
- validation_step:运行单次验证步骤
- checkpoint_save:测试检查点保存
- checkpoint_load:测试检查点加载
- inference:测试推理管道
- metrics:验证指标计算
测试配置(test-config.yaml):
yaml
pipeline:
name: llama-7b-finetuning
framework: pytorch
data:
train_path: ./data/sample-train.jsonl
val_path: ./data/sample-val.jsonl
format: jsonl
model:
base_model: meta-llama/Llama-2-7b-hf
checkpoint_path: ./checkpoints/test
load_in_8bit: false
lora:
enabled: true
r: 16
alpha: 32
training:
batch_size: 1
gradient_accumulation: 1
learning_rate: 2e-4
max_steps: 5
testing:
sample_size: 10
timeout_seconds: 300
expected_loss_range: [0.5, 3.0]示例输出:
Pipeline Test Report
====================
Pipeline: llama-7b-finetuning
Started: 2025-11-01 12:34:56
Duration: 127 seconds
Test Results:
✓ data_loading (2.3s) - Loaded 10 samples successfully
✓ tokenization (1.1s) - Tokenized all samples, avg length: 128 tokens
✓ model_loading (8.7s) - Model loaded, 7.2B parameters
✓ training_step (15.4s) - Training step completed, loss: 1.847
✓ validation_step (12.1s) - Validation step completed, loss: 1.923
✓ checkpoint_save (3.2s) - Checkpoint saved to ./checkpoints/test
✓ checkpoint_load (6.8s) - Checkpoint loaded successfully
✓ inference (2.9s) - Inference completed, latency: 142ms
✓ metrics (0.4s) - Metrics calculated correctly
Overall: PASS (8/8 tests passed)
Performance Metrics:
- Total time: 127s
- GPU memory used: 15.2 GB
- CPU memory used: 8.4 GB
Recommendations:
- Pipeline is ready for full training
- Consider increasing batch_size to improve throughput4. Dependency Checking (check-dependencies.sh)
4. 依赖项检查(check-dependencies.sh)
Purpose: Verify all required system dependencies, packages, and GPU availability.
Usage:
bash
bash scripts/check-dependencies.sh [OPTIONS]Options:
- - ML framework: pytorch, tensorflow, jax (default: pytorch)
--framework FRAMEWORK - - Require GPU availability
--gpu-required - - Minimum GPU VRAM required (GB)
--min-vram GB - - Required CUDA version
--cuda-version VERSION - - Path to requirements.txt or packages list
--packages FILE - - Platform: modal, lambda, runpod, local (default: local)
--platform PLATFORM - - Attempt to install missing packages
--fix - - Output dependency report path
--output REPORT
Dependency Checks:
- Python Version: Verify compatible Python version
- ML Framework: Check PyTorch/TensorFlow/JAX installation
- CUDA/cuDNN: Verify CUDA toolkit and cuDNN
- GPU Availability: Detect and validate GPUs
- VRAM: Check GPU memory capacity
- System Packages: Verify system-level dependencies
- Python Packages: Check required pip packages
- Platform Tools: Validate platform-specific tools (Modal, Lambda CLI)
- Storage: Check available disk space
- Network: Test internet connectivity for model downloads
Example Output:
json
{
"status": "PASS",
"platform": "modal",
"checks": {
"python": {
"status": "PASS",
"version": "3.10.12",
"required": ">=3.9"
},
"pytorch": {
"status": "PASS",
"version": "2.1.0",
"cuda_available": true,
"cuda_version": "12.1"
},
"gpu": {
"status": "PASS",
"count": 1,
"type": "NVIDIA A100",
"vram_gb": 40,
"driver_version": "535.129.03"
},
"packages": {
"status": "PASS",
"installed": 42,
"missing": 0,
"outdated": 3
},
"storage": {
"status": "PASS",
"available_gb": 128,
"required_gb": 50
}
},
"recommendations": [
"Update transformers to latest version (4.36.0 -> 4.37.2)",
"Consider upgrading to PyTorch 2.2.0 for better performance"
]
}用途: 验证所有必需的系统依赖项、包及GPU可用性。
使用方式:
bash
bash scripts/check-dependencies.sh [OPTIONS]选项:
- - ML框架:pytorch、tensorflow、jax(默认:pytorch)
--framework FRAMEWORK - - 要求GPU可用
--gpu-required - - 所需最小GPU显存(GB)
--min-vram GB - - 所需CUDA版本
--cuda-version VERSION - - requirements.txt或包列表路径
--packages FILE - - 平台:modal、lambda、runpod、local(默认:local)
--platform PLATFORM - - 尝试安装缺失的包
--fix - - 依赖项报告输出路径
--output REPORT
依赖项检查项:
- Python版本:验证兼容的Python版本
- ML框架:检查PyTorch/TensorFlow/JAX安装情况
- CUDA/cuDNN:验证CUDA工具包与cuDNN
- GPU可用性:检测并验证GPU
- 显存:检查GPU内存容量
- 系统包:验证系统级依赖项
- Python包:检查所需pip包
- 平台工具:验证平台特定工具(Modal、Lambda CLI)
- 存储:检查可用磁盘空间
- 网络:测试模型下载的网络连通性
示例输出:
json
{
"status": "PASS",
"platform": "modal",
"checks": {
"python": {
"status": "PASS",
"version": "3.10.12",
"required": ">=3.9"
},
"pytorch": {
"status": "PASS",
"version": "2.1.0",
"cuda_available": true,
"cuda_version": "12.1"
},
"gpu": {
"status": "PASS",
"count": 1,
"type": "NVIDIA A100",
"vram_gb": 40,
"driver_version": "535.129.03"
},
"packages": {
"status": "PASS",
"installed": 42,
"missing": 0,
"outdated": 3
},
"storage": {
"status": "PASS",
"available_gb": 128,
"required_gb": 50
}
},
"recommendations": [
"Update transformers to latest version (4.36.0 -> 4.37.2)",
"Consider upgrading to PyTorch 2.2.0 for better performance"
]
}Templates
模板
Validation Schema Template (templates/validation-schema.json)
验证Schema模板(templates/validation-schema.json)
Defines expected data structure and validation rules for training datasets.
json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"required": ["text", "label"],
"properties": {
"text": {
"type": "string",
"minLength": 10,
"maxLength": 5000,
"description": "Input text for training"
},
"label": {
"type": "string",
"enum": ["positive", "negative", "neutral"],
"description": "Classification label"
},
"metadata": {
"type": "object",
"properties": {
"source": {
"type": "string"
},
"timestamp": {
"type": "string",
"format": "date-time"
}
}
}
},
"validation_rules": {
"max_token_length": 2048,
"tokenizer": "meta-llama/Llama-2-7b-hf",
"check_duplicates": true,
"min_label_count": 100,
"max_label_imbalance_ratio": 10.0
}
}Customization:
- Modify fields for your use case
required - Add custom properties and validation rules
- Set appropriate length constraints
- Define label enums for classification
- Configure tokenizer and max length
定义训练数据集的预期数据结构与验证规则。
json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"required": ["text", "label"],
"properties": {
"text": {
"type": "string",
"minLength": 10,
"maxLength": 5000,
"description": "Input text for training"
},
"label": {
"type": "string",
"enum": ["positive", "negative", "neutral"],
"description": "Classification label"
},
"metadata": {
"type": "object",
"properties": {
"source": {
"type": "string"
},
"timestamp": {
"type": "string",
"format": "date-time"
}
}
}
},
"validation_rules": {
"max_token_length": 2048,
"tokenizer": "meta-llama/Llama-2-7b-hf",
"check_duplicates": true,
"min_label_count": 100,
"max_label_imbalance_ratio": 10.0
}
}自定义:
- 根据你的用例修改字段
required - 添加自定义属性与验证规则
- 设置合适的长度约束
- 为分类任务定义标签枚举
- 配置Tokenizer与最大长度
Test Configuration Template (templates/test-config.yaml)
测试配置模板(templates/test-config.yaml)
Defines pipeline testing parameters and expected behaviors.
yaml
pipeline:
name: test-pipeline
framework: pytorch
platform: modal # modal, lambda, runpod, local
data:
train_path: ./data/sample-train.jsonl
val_path: ./data/sample-val.jsonl
format: jsonl # jsonl, csv, parquet
sample_size: 10 # Number of samples to use for testing
model:
base_model: meta-llama/Llama-2-7b-hf
checkpoint_path: ./checkpoints/test
quantization: null # null, 8bit, 4bit
lora:
enabled: true
r: 16
alpha: 32
dropout: 0.05
target_modules: ["q_proj", "v_proj"]
training:
batch_size: 1
gradient_accumulation: 1
learning_rate: 2e-4
max_steps: 5
warmup_steps: 0
eval_steps: 2
testing:
sample_size: 10
timeout_seconds: 300
expected_loss_range: [0.5, 3.0]
fail_fast: true
cleanup: true
validation:
metrics: ["loss", "perplexity"]
check_gradients: true
check_memory_leak: true
gpu:
required: true
min_vram_gb: 16
allow_cpu_fallback: false定义管道测试参数与预期行为。
yaml
pipeline:
name: test-pipeline
framework: pytorch
platform: modal # modal, lambda, runpod, local
data:
train_path: ./data/sample-train.jsonl
val_path: ./data/sample-val.jsonl
format: jsonl # jsonl, csv, parquet
sample_size: 10 # Number of samples to use for testing
model:
base_model: meta-llama/Llama-2-7b-hf
checkpoint_path: ./checkpoints/test
quantization: null # null, 8bit, 4bit
lora:
enabled: true
r: 16
alpha: 32
dropout: 0.05
target_modules: ["q_proj", "v_proj"]
training:
batch_size: 1
gradient_accumulation: 1
learning_rate: 2e-4
max_steps: 5
warmup_steps: 0
eval_steps: 2
testing:
sample_size: 10
timeout_seconds: 300
expected_loss_range: [0.5, 3.0]
fail_fast: true
cleanup: true
validation:
metrics: ["loss", "perplexity"]
check_gradients: true
check_memory_leak: true
gpu:
required: true
min_vram_gb: 16
allow_cpu_fallback: falseIntegration with ML Training Workflow
与ML训练工作流集成
Pre-Training Validation
训练前验证
bash
undefinedbash
undefined1. Check system dependencies
1. 检查系统依赖项
bash scripts/check-dependencies.sh
--framework pytorch
--gpu-required
--min-vram 16
--framework pytorch
--gpu-required
--min-vram 16
bash scripts/check-dependencies.sh
--framework pytorch
--gpu-required
--min-vram 16
--framework pytorch
--gpu-required
--min-vram 16
2. Validate training data
2. 验证训练数据
bash scripts/validate-data.sh
--data-path ./data/train.jsonl
--schema templates/validation-schema.json
--check-duplicates
--check-tokens
--data-path ./data/train.jsonl
--schema templates/validation-schema.json
--check-duplicates
--check-tokens
bash scripts/validate-data.sh
--data-path ./data/train.jsonl
--schema templates/validation-schema.json
--check-duplicates
--check-tokens
--data-path ./data/train.jsonl
--schema templates/validation-schema.json
--check-duplicates
--check-tokens
3. Test pipeline with sample data
3. 使用样本数据测试管道
bash scripts/test-pipeline.sh
--config templates/test-config.yaml
--verbose
--config templates/test-config.yaml
--verbose
undefinedbash scripts/test-pipeline.sh
--config templates/test-config.yaml
--verbose
--config templates/test-config.yaml
--verbose
undefinedPost-Training Validation
训练后验证
bash
undefinedbash
undefined1. Validate model checkpoint
1. 验证模型检查点
bash scripts/validate-model.sh
--model-path ./checkpoints/final
--framework pytorch
--check-weights
--check-inference
--model-path ./checkpoints/final
--framework pytorch
--check-weights
--check-inference
bash scripts/validate-model.sh
--model-path ./checkpoints/final
--framework pytorch
--check-weights
--check-inference
--model-path ./checkpoints/final
--framework pytorch
--check-weights
--check-inference
2. Test inference pipeline
2. 测试推理管道
bash scripts/test-pipeline.sh
--config templates/test-config.yaml
--steps inference,metrics
--config templates/test-config.yaml
--steps inference,metrics
undefinedbash scripts/test-pipeline.sh
--config templates/test-config.yaml
--steps inference,metrics
--config templates/test-config.yaml
--steps inference,metrics
undefinedCI/CD Integration
CI/CD集成
bash
undefinedbash
undefined.github/workflows/validate-training.yml
.github/workflows/validate-training.yml
-
name: Validate Training Data run: | bash plugins/ml-training/skills/validation-scripts/scripts/validate-data.sh
--data-path ./data/train.jsonl
--schema ./validation-schema.json -
name: Test Training Pipeline run: | bash plugins/ml-training/skills/validation-scripts/scripts/test-pipeline.sh
--config ./test-config.yaml
--fail-fast
undefined-
name: Validate Training Data run: | bash plugins/ml-training/skills/validation-scripts/scripts/validate-data.sh
--data-path ./data/train.jsonl
--schema ./validation-schema.json -
name: Test Training Pipeline run: | bash plugins/ml-training/skills/validation-scripts/scripts/test-pipeline.sh
--config ./test-config.yaml
--fail-fast
undefinedError Handling and Debugging
错误处理与调试
Common Validation Failures
常见验证失败
Data Validation Failures:
- Token length exceeded → Truncate or filter long sequences
- Missing required fields → Fix data preprocessing
- Duplicate entries → Deduplicate dataset
- Invalid labels → Clean label data
Model Validation Failures:
- Missing config files → Ensure complete checkpoint
- Weight loading errors → Check framework compatibility
- Tokenizer mismatch → Verify tokenizer version
Pipeline Test Failures:
- Out of memory → Reduce batch size, enable gradient checkpointing
- CUDA errors → Check GPU drivers, CUDA version
- Slow performance → Profile and optimize data loading
数据验证失败:
- 分词长度超限 → 截断或过滤长序列
- 缺失必填字段 → 修复数据预处理
- 重复条目 → 对数据集去重
- 无效标签 → 清理标签数据
模型验证失败:
- 缺失配置文件 → 确保检查点完整
- 权重加载错误 → 检查框架兼容性
- Tokenizer不匹配 → 验证Tokenizer版本
管道测试失败:
- 内存不足 → 减小批量大小,启用梯度检查点
- CUDA错误 → 检查GPU驱动、CUDA版本
- 性能缓慢 → 分析并优化数据加载
Debug Mode
调试模式
All scripts support verbose debugging:
bash
bash scripts/validate-data.sh --data-path ./data/train.jsonl --verbose --debugOutputs:
- Detailed progress logs
- Stack traces for errors
- Performance metrics
- Memory usage statistics
所有脚本支持详细调试:
bash
bash scripts/validate-data.sh --data-path ./data/train.jsonl --verbose --debug输出内容:
- 详细进度日志
- 错误堆栈跟踪
- 性能指标
- 内存使用统计
Best Practices
最佳实践
- Always validate before training - Catch data issues early
- Use schema validation - Enforce data quality standards
- Test with sample data first - Verify pipeline before full training
- Check dependencies on new platforms - Avoid runtime failures
- Validate checkpoints - Ensure model integrity before deployment
- Monitor validation metrics - Track data quality over time
- Automate validation in CI/CD - Prevent bad data from entering pipeline
- Keep validation schemas updated - Reflect current data requirements
- 训练前务必验证 - 尽早发现数据问题
- 使用Schema验证 - 强制执行数据质量标准
- 先使用样本数据测试 - 完整训练前验证管道
- 在新平台上检查依赖项 - 避免运行时故障
- 验证检查点 - 部署前确保模型完整性
- 监控验证指标 - 随时间跟踪数据质量
- 在CI/CD中自动化验证 - 防止不良数据进入管道
- 保持验证Schema更新 - 反映当前数据需求
Performance Tips
性能优化建议
- Use for quick validation of large datasets
--sample-size - Run dependency checks once per environment, cache results
- Parallelize validation with GNU parallel for multi-file datasets
- Use in CI/CD to save time
--fail-fast - Cache tokenizers to avoid re-downloading
- 对大型数据集使用进行快速验证
--sample-size - 每个环境运行一次依赖项检查,缓存结果
- 使用GNU parallel对多文件数据集并行验证
- 在CI/CD中使用节省时间
--fail-fast - 缓存Tokenizer以避免重复下载
Troubleshooting
故障排除
Script Not Found:
bash
undefined脚本未找到:
bash
undefinedEnsure you're in the correct directory
确保你在正确的目录中
cd /path/to/ml-training/skills/validation-scripts
bash scripts/validate-data.sh --help
**Permission Denied:**
```bashcd /path/to/ml-training/skills/validation-scripts
bash scripts/validate-data.sh --help
**权限拒绝:**
```bashMake scripts executable
使脚本可执行
chmod +x scripts/*.sh
**Missing Dependencies:**
```bashchmod +x scripts/*.sh
**缺失依赖项:**
```bashInstall required tools
安装所需工具
pip install jsonschema pandas pyarrow
sudo apt-get install jq bc
---
**Plugin**: ml-training
**Version**: 1.0.0
**Supported Frameworks**: PyTorch, TensorFlow, JAX
**Platforms**: Modal, Lambda Labs, RunPod, Localpip install jsonschema pandas pyarrow
sudo apt-get install jq bc
---
**插件**: ml-training
**版本**: 1.0.0
**支持框架**: PyTorch, TensorFlow, JAX
**支持平台**: Modal, Lambda Labs, RunPod, Local