validation-scripts

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

ML Training Validation Scripts

ML训练验证脚本

Purpose: Production-ready validation and testing utilities for ML training workflows. Ensures data quality, model integrity, pipeline correctness, and dependency availability before and during training.
Activation Triggers:
  • Validating training datasets before fine-tuning
  • Checking model checkpoints and outputs
  • Testing end-to-end training pipelines
  • Verifying system dependencies and GPU availability
  • Debugging training failures or data issues
  • Ensuring data format compliance
  • Validating model configurations
  • Testing inference pipelines
Key Resources:
  • scripts/validate-data.sh
    - Comprehensive dataset validation
  • scripts/validate-model.sh
    - Model checkpoint and config validation
  • scripts/test-pipeline.sh
    - End-to-end pipeline testing
  • scripts/check-dependencies.sh
    - System and package dependency checking
  • templates/test-config.yaml
    - Pipeline test configuration template
  • templates/validation-schema.json
    - Data validation schema template
  • examples/data-validation-example.md
    - Dataset validation workflow
  • examples/pipeline-testing-example.md
    - Complete pipeline testing guide
用途: 为ML训练工作流提供生产就绪的验证与测试工具,确保训练前后的数据质量、模型完整性、管道正确性及依赖项可用性。
触发场景:
  • 微调前验证训练数据集
  • 检查模型检查点与输出结果
  • 测试端到端训练管道
  • 验证系统依赖项与GPU可用性
  • 调试训练故障或数据问题
  • 确保数据格式合规
  • 验证模型配置
  • 测试推理管道
核心资源:
  • scripts/validate-data.sh
    - 全面的数据集验证
  • scripts/validate-model.sh
    - 模型检查点与配置验证
  • scripts/test-pipeline.sh
    - 端到端管道测试
  • scripts/check-dependencies.sh
    - 系统与包依赖项检查
  • templates/test-config.yaml
    - 管道测试配置模板
  • templates/validation-schema.json
    - 数据验证 schema 模板
  • examples/data-validation-example.md
    - 数据集验证工作流示例
  • examples/pipeline-testing-example.md
    - 完整管道测试指南

Quick Start

快速开始

Validate Training Data

验证训练数据

bash
bash scripts/validate-data.sh \
  --data-path ./data/train.jsonl \
  --format jsonl \
  --schema templates/validation-schema.json
bash
bash scripts/validate-data.sh \
  --data-path ./data/train.jsonl \
  --format jsonl \
  --schema templates/validation-schema.json

Validate Model Checkpoint

验证模型检查点

bash
bash scripts/validate-model.sh \
  --model-path ./checkpoints/epoch-3 \
  --framework pytorch \
  --check-weights
bash
bash scripts/validate-model.sh \
  --model-path ./checkpoints/epoch-3 \
  --framework pytorch \
  --check-weights

Test Complete Pipeline

测试完整管道

bash
bash scripts/test-pipeline.sh \
  --config templates/test-config.yaml \
  --data ./data/sample.jsonl \
  --verbose
bash
bash scripts/test-pipeline.sh \
  --config templates/test-config.yaml \
  --data ./data/sample.jsonl \
  --verbose

Check System Dependencies

检查系统依赖项

bash
bash scripts/check-dependencies.sh \
  --framework pytorch \
  --gpu-required \
  --min-vram 16
bash
bash scripts/check-dependencies.sh \
  --framework pytorch \
  --gpu-required \
  --min-vram 16

Validation Scripts

验证脚本

1. Data Validation (validate-data.sh)

1. 数据验证(validate-data.sh)

Purpose: Validate training datasets for format compliance, data quality, and schema conformance.
Usage:
bash
bash scripts/validate-data.sh [OPTIONS]
Options:
  • --data-path PATH
    - Path to dataset file or directory (required)
  • --format FORMAT
    - Data format: jsonl, csv, parquet, arrow (default: jsonl)
  • --schema PATH
    - Path to validation schema JSON file
  • --sample-size N
    - Number of samples to validate (default: all)
  • --check-duplicates
    - Check for duplicate entries
  • --check-null
    - Check for null/missing values
  • --check-length
    - Validate text length constraints
  • --check-tokens
    - Validate tokenization compatibility
  • --tokenizer MODEL
    - Tokenizer to use for token validation (default: gpt2)
  • --max-length N
    - Maximum sequence length (default: 2048)
  • --output REPORT
    - Output validation report path (default: validation-report.json)
Validation Checks:
  • Format Compliance: Verify file format and structure
  • Schema Validation: Check against JSON schema if provided
  • Data Types: Validate field types (string, int, float, etc.)
  • Required Fields: Ensure all required fields are present
  • Value Ranges: Check numeric values within expected ranges
  • Text Quality: Detect empty strings, excessive whitespace, encoding issues
  • Duplicates: Identify duplicate entries (exact or near-duplicate)
  • Token Counts: Verify sequences fit within model context length
  • Label Distribution: Check class balance for classification tasks
  • Missing Values: Detect and report null/missing values
Example Output:
json
{
  "status": "PASS",
  "total_samples": 10000,
  "valid_samples": 9987,
  "invalid_samples": 13,
  "validation_errors": [
    {
      "sample_id": 42,
      "field": "text",
      "error": "Exceeds max token length: 2150 > 2048"
    },
    {
      "sample_id": 156,
      "field": "label",
      "error": "Invalid label value: 'unknwn' (typo)"
    }
  ],
  "statistics": {
    "avg_text_length": 487,
    "avg_token_count": 128,
    "label_distribution": {
      "positive": 4892,
      "negative": 4895,
      "neutral": 213
    }
  },
  "recommendations": [
    "Remove or fix 13 invalid samples before training",
    "Label distribution is imbalanced - consider class weighting"
  ]
}
Exit Codes:
  • 0
    - All validations passed
  • 1
    - Validation errors found
  • 2
    - Script error (invalid arguments, file not found)
用途: 验证训练数据集的格式合规性、数据质量及 schema 一致性。
使用方式:
bash
bash scripts/validate-data.sh [OPTIONS]
选项:
  • --data-path PATH
    - 数据集文件或目录路径(必填)
  • --format FORMAT
    - 数据格式:jsonl、csv、parquet、arrow(默认:jsonl)
  • --schema PATH
    - 验证 schema JSON 文件路径
  • --sample-size N
    - 要验证的样本数量(默认:全部)
  • --check-duplicates
    - 检查重复条目
  • --check-null
    - 检查空值/缺失值
  • --check-length
    - 验证文本长度约束
  • --check-tokens
    - 验证分词兼容性
  • --tokenizer MODEL
    - 用于分词验证的Tokenizer(默认:gpt2)
  • --max-length N
    - 最大序列长度(默认:2048)
  • --output REPORT
    - 验证报告输出路径(默认:validation-report.json)
验证检查项:
  • 格式合规性:验证文件格式与结构
  • Schema验证:若提供则对照JSON schema检查
  • 数据类型:验证字段类型(字符串、整数、浮点数等)
  • 必填字段:确保所有必填字段存在
  • 值范围:检查数值是否在预期范围内
  • 文本质量:检测空字符串、过多空白字符、编码问题
  • 重复项:识别重复条目(精确或近似重复)
  • 分词计数:验证序列是否符合模型上下文长度
  • 标签分布:检查分类任务的类别平衡性
  • 缺失值:检测并报告空值/缺失值
示例输出:
json
{
  "status": "PASS",
  "total_samples": 10000,
  "valid_samples": 9987,
  "invalid_samples": 13,
  "validation_errors": [
    {
      "sample_id": 42,
      "field": "text",
      "error": "Exceeds max token length: 2150 > 2048"
    },
    {
      "sample_id": 156,
      "field": "label",
      "error": "Invalid label value: 'unknwn' (typo)"
    }
  ],
  "statistics": {
    "avg_text_length": 487,
    "avg_token_count": 128,
    "label_distribution": {
      "positive": 4892,
      "negative": 4895,
      "neutral": 213
    }
  },
  "recommendations": [
    "Remove or fix 13 invalid samples before training",
    "Label distribution is imbalanced - consider class weighting"
  ]
}
退出码:
  • 0
    - 所有验证通过
  • 1
    - 发现验证错误
  • 2
    - 脚本错误(参数无效、文件未找到)

2. Model Validation (validate-model.sh)

2. 模型验证(validate-model.sh)

Purpose: Validate model checkpoints, configurations, and inference readiness.
Usage:
bash
bash scripts/validate-model.sh [OPTIONS]
Options:
  • --model-path PATH
    - Path to model checkpoint directory (required)
  • --framework FRAMEWORK
    - Framework: pytorch, tensorflow, jax (default: pytorch)
  • --check-weights
    - Verify model weights are loadable
  • --check-config
    - Validate model configuration file
  • --check-tokenizer
    - Verify tokenizer files are present
  • --check-inference
    - Test inference with sample input
  • --sample-input TEXT
    - Sample text for inference test
  • --expected-output TEXT
    - Expected output for verification
  • --output REPORT
    - Output validation report path
Validation Checks:
  • File Structure: Verify required files are present
  • Config Validation: Check model_config.json for correctness
  • Weight Integrity: Load and verify model weights
  • Tokenizer Files: Ensure tokenizer files exist and are loadable
  • Model Architecture: Validate architecture matches config
  • Memory Requirements: Estimate GPU/CPU memory needed
  • Inference Test: Run sample inference if requested
  • Quantization: Verify quantized models load correctly
  • LoRA/PEFT: Validate adapter weights and configuration
Example Output:
json
{
  "status": "PASS",
  "model_path": "./checkpoints/llama-7b-finetuned",
  "framework": "pytorch",
  "checks": {
    "file_structure": "PASS",
    "config": "PASS",
    "weights": "PASS",
    "tokenizer": "PASS",
    "inference": "PASS"
  },
  "model_info": {
    "architecture": "LlamaForCausalLM",
    "parameters": "7.2B",
    "precision": "float16",
    "lora_enabled": true,
    "lora_rank": 16
  },
  "memory_estimate": {
    "model_size_gb": 13.5,
    "inference_vram_gb": 16.2,
    "training_vram_gb": 24.8
  },
  "inference_test": {
    "input": "Hello, world!",
    "output": "Hello, world! How can I help you today?",
    "latency_ms": 142
  }
}
用途: 验证模型检查点、配置及推理就绪状态。
使用方式:
bash
bash scripts/validate-model.sh [OPTIONS]
选项:
  • --model-path PATH
    - 模型检查点目录路径(必填)
  • --framework FRAMEWORK
    - 框架:pytorch、tensorflow、jax(默认:pytorch)
  • --check-weights
    - 验证模型权重可加载
  • --check-config
    - 验证模型配置文件
  • --check-tokenizer
    - 验证Tokenizer文件存在
  • --check-inference
    - 使用样本输入测试推理
  • --sample-input TEXT
    - 用于推理测试的样本文本
  • --expected-output TEXT
    - 用于验证的预期输出
  • --output REPORT
    - 验证报告输出路径
验证检查项:
  • 文件结构:验证所需文件是否存在
  • 配置验证:检查model_config.json的正确性
  • 权重完整性:加载并验证模型权重
  • Tokenizer文件:确保Tokenizer文件存在且可加载
  • 模型架构:验证架构与配置匹配
  • 内存需求:估算GPU/CPU所需内存
  • 推理测试:若请求则运行样本推理
  • 量化验证:验证量化模型可正确加载
  • LoRA/PEFT:验证适配器权重与配置
示例输出:
json
{
  "status": "PASS",
  "model_path": "./checkpoints/llama-7b-finetuned",
  "framework": "pytorch",
  "checks": {
    "file_structure": "PASS",
    "config": "PASS",
    "weights": "PASS",
    "tokenizer": "PASS",
    "inference": "PASS"
  },
  "model_info": {
    "architecture": "LlamaForCausalLM",
    "parameters": "7.2B",
    "precision": "float16",
    "lora_enabled": true,
    "lora_rank": 16
  },
  "memory_estimate": {
    "model_size_gb": 13.5,
    "inference_vram_gb": 16.2,
    "training_vram_gb": 24.8
  },
  "inference_test": {
    "input": "Hello, world!",
    "output": "Hello, world! How can I help you today?",
    "latency_ms": 142
  }
}

3. Pipeline Testing (test-pipeline.sh)

3. 管道测试(test-pipeline.sh)

Purpose: Test end-to-end ML training and inference pipelines with sample data.
Usage:
bash
bash scripts/test-pipeline.sh [OPTIONS]
Options:
  • --config PATH
    - Pipeline test configuration file (required)
  • --data PATH
    - Sample data for testing
  • --steps STEPS
    - Comma-separated steps to test (default: all)
  • --verbose
    - Enable detailed output
  • --output REPORT
    - Output test report path
  • --fail-fast
    - Stop on first failure
  • --cleanup
    - Clean up temporary files after test
Pipeline Steps:
  • data_loading: Test data loading and preprocessing
  • tokenization: Verify tokenization pipeline
  • model_loading: Load model and verify initialization
  • training_step: Run single training step
  • validation_step: Run single validation step
  • checkpoint_save: Test checkpoint saving
  • checkpoint_load: Test checkpoint loading
  • inference: Test inference pipeline
  • metrics: Verify metrics calculation
Test Configuration (test-config.yaml):
yaml
pipeline:
  name: llama-7b-finetuning
  framework: pytorch

data:
  train_path: ./data/sample-train.jsonl
  val_path: ./data/sample-val.jsonl
  format: jsonl

model:
  base_model: meta-llama/Llama-2-7b-hf
  checkpoint_path: ./checkpoints/test
  load_in_8bit: false
  lora:
    enabled: true
    r: 16
    alpha: 32

training:
  batch_size: 1
  gradient_accumulation: 1
  learning_rate: 2e-4
  max_steps: 5

testing:
  sample_size: 10
  timeout_seconds: 300
  expected_loss_range: [0.5, 3.0]
Example Output:
Pipeline Test Report
====================

Pipeline: llama-7b-finetuning
Started: 2025-11-01 12:34:56
Duration: 127 seconds

Test Results:
✓ data_loading (2.3s) - Loaded 10 samples successfully
✓ tokenization (1.1s) - Tokenized all samples, avg length: 128 tokens
✓ model_loading (8.7s) - Model loaded, 7.2B parameters
✓ training_step (15.4s) - Training step completed, loss: 1.847
✓ validation_step (12.1s) - Validation step completed, loss: 1.923
✓ checkpoint_save (3.2s) - Checkpoint saved to ./checkpoints/test
✓ checkpoint_load (6.8s) - Checkpoint loaded successfully
✓ inference (2.9s) - Inference completed, latency: 142ms
✓ metrics (0.4s) - Metrics calculated correctly

Overall: PASS (8/8 tests passed)

Performance Metrics:
- Total time: 127s
- GPU memory used: 15.2 GB
- CPU memory used: 8.4 GB

Recommendations:
- Pipeline is ready for full training
- Consider increasing batch_size to improve throughput
用途: 使用样本数据测试端到端ML训练与推理管道。
使用方式:
bash
bash scripts/test-pipeline.sh [OPTIONS]
选项:
  • --config PATH
    - 管道测试配置文件(必填)
  • --data PATH
    - 用于测试的样本数据
  • --steps STEPS
    - 要测试的步骤(逗号分隔,默认:全部)
  • --verbose
    - 启用详细输出
  • --output REPORT
    - 测试报告输出路径
  • --fail-fast
    - 首次失败即停止
  • --cleanup
    - 测试后清理临时文件
管道步骤:
  • data_loading:测试数据加载与预处理
  • tokenization:验证分词管道
  • model_loading:加载模型并验证初始化
  • training_step:运行单次训练步骤
  • validation_step:运行单次验证步骤
  • checkpoint_save:测试检查点保存
  • checkpoint_load:测试检查点加载
  • inference:测试推理管道
  • metrics:验证指标计算
测试配置(test-config.yaml):
yaml
pipeline:
  name: llama-7b-finetuning
  framework: pytorch

data:
  train_path: ./data/sample-train.jsonl
  val_path: ./data/sample-val.jsonl
  format: jsonl

model:
  base_model: meta-llama/Llama-2-7b-hf
  checkpoint_path: ./checkpoints/test
  load_in_8bit: false
  lora:
    enabled: true
    r: 16
    alpha: 32

training:
  batch_size: 1
  gradient_accumulation: 1
  learning_rate: 2e-4
  max_steps: 5

testing:
  sample_size: 10
  timeout_seconds: 300
  expected_loss_range: [0.5, 3.0]
示例输出:
Pipeline Test Report
====================

Pipeline: llama-7b-finetuning
Started: 2025-11-01 12:34:56
Duration: 127 seconds

Test Results:
✓ data_loading (2.3s) - Loaded 10 samples successfully
✓ tokenization (1.1s) - Tokenized all samples, avg length: 128 tokens
✓ model_loading (8.7s) - Model loaded, 7.2B parameters
✓ training_step (15.4s) - Training step completed, loss: 1.847
✓ validation_step (12.1s) - Validation step completed, loss: 1.923
✓ checkpoint_save (3.2s) - Checkpoint saved to ./checkpoints/test
✓ checkpoint_load (6.8s) - Checkpoint loaded successfully
✓ inference (2.9s) - Inference completed, latency: 142ms
✓ metrics (0.4s) - Metrics calculated correctly

Overall: PASS (8/8 tests passed)

Performance Metrics:
- Total time: 127s
- GPU memory used: 15.2 GB
- CPU memory used: 8.4 GB

Recommendations:
- Pipeline is ready for full training
- Consider increasing batch_size to improve throughput

4. Dependency Checking (check-dependencies.sh)

4. 依赖项检查(check-dependencies.sh)

Purpose: Verify all required system dependencies, packages, and GPU availability.
Usage:
bash
bash scripts/check-dependencies.sh [OPTIONS]
Options:
  • --framework FRAMEWORK
    - ML framework: pytorch, tensorflow, jax (default: pytorch)
  • --gpu-required
    - Require GPU availability
  • --min-vram GB
    - Minimum GPU VRAM required (GB)
  • --cuda-version VERSION
    - Required CUDA version
  • --packages FILE
    - Path to requirements.txt or packages list
  • --platform PLATFORM
    - Platform: modal, lambda, runpod, local (default: local)
  • --fix
    - Attempt to install missing packages
  • --output REPORT
    - Output dependency report path
Dependency Checks:
  • Python Version: Verify compatible Python version
  • ML Framework: Check PyTorch/TensorFlow/JAX installation
  • CUDA/cuDNN: Verify CUDA toolkit and cuDNN
  • GPU Availability: Detect and validate GPUs
  • VRAM: Check GPU memory capacity
  • System Packages: Verify system-level dependencies
  • Python Packages: Check required pip packages
  • Platform Tools: Validate platform-specific tools (Modal, Lambda CLI)
  • Storage: Check available disk space
  • Network: Test internet connectivity for model downloads
Example Output:
json
{
  "status": "PASS",
  "platform": "modal",
  "checks": {
    "python": {
      "status": "PASS",
      "version": "3.10.12",
      "required": ">=3.9"
    },
    "pytorch": {
      "status": "PASS",
      "version": "2.1.0",
      "cuda_available": true,
      "cuda_version": "12.1"
    },
    "gpu": {
      "status": "PASS",
      "count": 1,
      "type": "NVIDIA A100",
      "vram_gb": 40,
      "driver_version": "535.129.03"
    },
    "packages": {
      "status": "PASS",
      "installed": 42,
      "missing": 0,
      "outdated": 3
    },
    "storage": {
      "status": "PASS",
      "available_gb": 128,
      "required_gb": 50
    }
  },
  "recommendations": [
    "Update transformers to latest version (4.36.0 -> 4.37.2)",
    "Consider upgrading to PyTorch 2.2.0 for better performance"
  ]
}
用途: 验证所有必需的系统依赖项、包及GPU可用性。
使用方式:
bash
bash scripts/check-dependencies.sh [OPTIONS]
选项:
  • --framework FRAMEWORK
    - ML框架:pytorch、tensorflow、jax(默认:pytorch)
  • --gpu-required
    - 要求GPU可用
  • --min-vram GB
    - 所需最小GPU显存(GB)
  • --cuda-version VERSION
    - 所需CUDA版本
  • --packages FILE
    - requirements.txt或包列表路径
  • --platform PLATFORM
    - 平台:modal、lambda、runpod、local(默认:local)
  • --fix
    - 尝试安装缺失的包
  • --output REPORT
    - 依赖项报告输出路径
依赖项检查项:
  • Python版本:验证兼容的Python版本
  • ML框架:检查PyTorch/TensorFlow/JAX安装情况
  • CUDA/cuDNN:验证CUDA工具包与cuDNN
  • GPU可用性:检测并验证GPU
  • 显存:检查GPU内存容量
  • 系统包:验证系统级依赖项
  • Python包:检查所需pip包
  • 平台工具:验证平台特定工具(Modal、Lambda CLI)
  • 存储:检查可用磁盘空间
  • 网络:测试模型下载的网络连通性
示例输出:
json
{
  "status": "PASS",
  "platform": "modal",
  "checks": {
    "python": {
      "status": "PASS",
      "version": "3.10.12",
      "required": ">=3.9"
    },
    "pytorch": {
      "status": "PASS",
      "version": "2.1.0",
      "cuda_available": true,
      "cuda_version": "12.1"
    },
    "gpu": {
      "status": "PASS",
      "count": 1,
      "type": "NVIDIA A100",
      "vram_gb": 40,
      "driver_version": "535.129.03"
    },
    "packages": {
      "status": "PASS",
      "installed": 42,
      "missing": 0,
      "outdated": 3
    },
    "storage": {
      "status": "PASS",
      "available_gb": 128,
      "required_gb": 50
    }
  },
  "recommendations": [
    "Update transformers to latest version (4.36.0 -> 4.37.2)",
    "Consider upgrading to PyTorch 2.2.0 for better performance"
  ]
}

Templates

模板

Validation Schema Template (templates/validation-schema.json)

验证Schema模板(templates/validation-schema.json)

Defines expected data structure and validation rules for training datasets.
json
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "required": ["text", "label"],
  "properties": {
    "text": {
      "type": "string",
      "minLength": 10,
      "maxLength": 5000,
      "description": "Input text for training"
    },
    "label": {
      "type": "string",
      "enum": ["positive", "negative", "neutral"],
      "description": "Classification label"
    },
    "metadata": {
      "type": "object",
      "properties": {
        "source": {
          "type": "string"
        },
        "timestamp": {
          "type": "string",
          "format": "date-time"
        }
      }
    }
  },
  "validation_rules": {
    "max_token_length": 2048,
    "tokenizer": "meta-llama/Llama-2-7b-hf",
    "check_duplicates": true,
    "min_label_count": 100,
    "max_label_imbalance_ratio": 10.0
  }
}
Customization:
  • Modify
    required
    fields for your use case
  • Add custom properties and validation rules
  • Set appropriate length constraints
  • Define label enums for classification
  • Configure tokenizer and max length
定义训练数据集的预期数据结构与验证规则。
json
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "required": ["text", "label"],
  "properties": {
    "text": {
      "type": "string",
      "minLength": 10,
      "maxLength": 5000,
      "description": "Input text for training"
    },
    "label": {
      "type": "string",
      "enum": ["positive", "negative", "neutral"],
      "description": "Classification label"
    },
    "metadata": {
      "type": "object",
      "properties": {
        "source": {
          "type": "string"
        },
        "timestamp": {
          "type": "string",
          "format": "date-time"
        }
      }
    }
  },
  "validation_rules": {
    "max_token_length": 2048,
    "tokenizer": "meta-llama/Llama-2-7b-hf",
    "check_duplicates": true,
    "min_label_count": 100,
    "max_label_imbalance_ratio": 10.0
  }
}
自定义:
  • 根据你的用例修改
    required
    字段
  • 添加自定义属性与验证规则
  • 设置合适的长度约束
  • 为分类任务定义标签枚举
  • 配置Tokenizer与最大长度

Test Configuration Template (templates/test-config.yaml)

测试配置模板(templates/test-config.yaml)

Defines pipeline testing parameters and expected behaviors.
yaml
pipeline:
  name: test-pipeline
  framework: pytorch
  platform: modal  # modal, lambda, runpod, local

data:
  train_path: ./data/sample-train.jsonl
  val_path: ./data/sample-val.jsonl
  format: jsonl  # jsonl, csv, parquet
  sample_size: 10  # Number of samples to use for testing

model:
  base_model: meta-llama/Llama-2-7b-hf
  checkpoint_path: ./checkpoints/test
  quantization: null  # null, 8bit, 4bit
  lora:
    enabled: true
    r: 16
    alpha: 32
    dropout: 0.05
    target_modules: ["q_proj", "v_proj"]

training:
  batch_size: 1
  gradient_accumulation: 1
  learning_rate: 2e-4
  max_steps: 5
  warmup_steps: 0
  eval_steps: 2

testing:
  sample_size: 10
  timeout_seconds: 300
  expected_loss_range: [0.5, 3.0]
  fail_fast: true
  cleanup: true

validation:
  metrics: ["loss", "perplexity"]
  check_gradients: true
  check_memory_leak: true

gpu:
  required: true
  min_vram_gb: 16
  allow_cpu_fallback: false
定义管道测试参数与预期行为。
yaml
pipeline:
  name: test-pipeline
  framework: pytorch
  platform: modal  # modal, lambda, runpod, local

data:
  train_path: ./data/sample-train.jsonl
  val_path: ./data/sample-val.jsonl
  format: jsonl  # jsonl, csv, parquet
  sample_size: 10  # Number of samples to use for testing

model:
  base_model: meta-llama/Llama-2-7b-hf
  checkpoint_path: ./checkpoints/test
  quantization: null  # null, 8bit, 4bit
  lora:
    enabled: true
    r: 16
    alpha: 32
    dropout: 0.05
    target_modules: ["q_proj", "v_proj"]

training:
  batch_size: 1
  gradient_accumulation: 1
  learning_rate: 2e-4
  max_steps: 5
  warmup_steps: 0
  eval_steps: 2

testing:
  sample_size: 10
  timeout_seconds: 300
  expected_loss_range: [0.5, 3.0]
  fail_fast: true
  cleanup: true

validation:
  metrics: ["loss", "perplexity"]
  check_gradients: true
  check_memory_leak: true

gpu:
  required: true
  min_vram_gb: 16
  allow_cpu_fallback: false

Integration with ML Training Workflow

与ML训练工作流集成

Pre-Training Validation

训练前验证

bash
undefined
bash
undefined

1. Check system dependencies

1. 检查系统依赖项

bash scripts/check-dependencies.sh
--framework pytorch
--gpu-required
--min-vram 16
bash scripts/check-dependencies.sh
--framework pytorch
--gpu-required
--min-vram 16

2. Validate training data

2. 验证训练数据

bash scripts/validate-data.sh
--data-path ./data/train.jsonl
--schema templates/validation-schema.json
--check-duplicates
--check-tokens
bash scripts/validate-data.sh
--data-path ./data/train.jsonl
--schema templates/validation-schema.json
--check-duplicates
--check-tokens

3. Test pipeline with sample data

3. 使用样本数据测试管道

bash scripts/test-pipeline.sh
--config templates/test-config.yaml
--verbose
undefined
bash scripts/test-pipeline.sh
--config templates/test-config.yaml
--verbose
undefined

Post-Training Validation

训练后验证

bash
undefined
bash
undefined

1. Validate model checkpoint

1. 验证模型检查点

bash scripts/validate-model.sh
--model-path ./checkpoints/final
--framework pytorch
--check-weights
--check-inference
bash scripts/validate-model.sh
--model-path ./checkpoints/final
--framework pytorch
--check-weights
--check-inference

2. Test inference pipeline

2. 测试推理管道

bash scripts/test-pipeline.sh
--config templates/test-config.yaml
--steps inference,metrics
undefined
bash scripts/test-pipeline.sh
--config templates/test-config.yaml
--steps inference,metrics
undefined

CI/CD Integration

CI/CD集成

bash
undefined
bash
undefined

.github/workflows/validate-training.yml

.github/workflows/validate-training.yml

  • name: Validate Training Data run: | bash plugins/ml-training/skills/validation-scripts/scripts/validate-data.sh
    --data-path ./data/train.jsonl
    --schema ./validation-schema.json
  • name: Test Training Pipeline run: | bash plugins/ml-training/skills/validation-scripts/scripts/test-pipeline.sh
    --config ./test-config.yaml
    --fail-fast
undefined
  • name: Validate Training Data run: | bash plugins/ml-training/skills/validation-scripts/scripts/validate-data.sh
    --data-path ./data/train.jsonl
    --schema ./validation-schema.json
  • name: Test Training Pipeline run: | bash plugins/ml-training/skills/validation-scripts/scripts/test-pipeline.sh
    --config ./test-config.yaml
    --fail-fast
undefined

Error Handling and Debugging

错误处理与调试

Common Validation Failures

常见验证失败

Data Validation Failures:
  • Token length exceeded → Truncate or filter long sequences
  • Missing required fields → Fix data preprocessing
  • Duplicate entries → Deduplicate dataset
  • Invalid labels → Clean label data
Model Validation Failures:
  • Missing config files → Ensure complete checkpoint
  • Weight loading errors → Check framework compatibility
  • Tokenizer mismatch → Verify tokenizer version
Pipeline Test Failures:
  • Out of memory → Reduce batch size, enable gradient checkpointing
  • CUDA errors → Check GPU drivers, CUDA version
  • Slow performance → Profile and optimize data loading
数据验证失败:
  • 分词长度超限 → 截断或过滤长序列
  • 缺失必填字段 → 修复数据预处理
  • 重复条目 → 对数据集去重
  • 无效标签 → 清理标签数据
模型验证失败:
  • 缺失配置文件 → 确保检查点完整
  • 权重加载错误 → 检查框架兼容性
  • Tokenizer不匹配 → 验证Tokenizer版本
管道测试失败:
  • 内存不足 → 减小批量大小,启用梯度检查点
  • CUDA错误 → 检查GPU驱动、CUDA版本
  • 性能缓慢 → 分析并优化数据加载

Debug Mode

调试模式

All scripts support verbose debugging:
bash
bash scripts/validate-data.sh --data-path ./data/train.jsonl --verbose --debug
Outputs:
  • Detailed progress logs
  • Stack traces for errors
  • Performance metrics
  • Memory usage statistics
所有脚本支持详细调试:
bash
bash scripts/validate-data.sh --data-path ./data/train.jsonl --verbose --debug
输出内容:
  • 详细进度日志
  • 错误堆栈跟踪
  • 性能指标
  • 内存使用统计

Best Practices

最佳实践

  1. Always validate before training - Catch data issues early
  2. Use schema validation - Enforce data quality standards
  3. Test with sample data first - Verify pipeline before full training
  4. Check dependencies on new platforms - Avoid runtime failures
  5. Validate checkpoints - Ensure model integrity before deployment
  6. Monitor validation metrics - Track data quality over time
  7. Automate validation in CI/CD - Prevent bad data from entering pipeline
  8. Keep validation schemas updated - Reflect current data requirements
  1. 训练前务必验证 - 尽早发现数据问题
  2. 使用Schema验证 - 强制执行数据质量标准
  3. 先使用样本数据测试 - 完整训练前验证管道
  4. 在新平台上检查依赖项 - 避免运行时故障
  5. 验证检查点 - 部署前确保模型完整性
  6. 监控验证指标 - 随时间跟踪数据质量
  7. 在CI/CD中自动化验证 - 防止不良数据进入管道
  8. 保持验证Schema更新 - 反映当前数据需求

Performance Tips

性能优化建议

  • Use
    --sample-size
    for quick validation of large datasets
  • Run dependency checks once per environment, cache results
  • Parallelize validation with GNU parallel for multi-file datasets
  • Use
    --fail-fast
    in CI/CD to save time
  • Cache tokenizers to avoid re-downloading
  • 对大型数据集使用
    --sample-size
    进行快速验证
  • 每个环境运行一次依赖项检查,缓存结果
  • 使用GNU parallel对多文件数据集并行验证
  • 在CI/CD中使用
    --fail-fast
    节省时间
  • 缓存Tokenizer以避免重复下载

Troubleshooting

故障排除

Script Not Found:
bash
undefined
脚本未找到:
bash
undefined

Ensure you're in the correct directory

确保你在正确的目录中

cd /path/to/ml-training/skills/validation-scripts bash scripts/validate-data.sh --help

**Permission Denied:**
```bash
cd /path/to/ml-training/skills/validation-scripts bash scripts/validate-data.sh --help

**权限拒绝:**
```bash

Make scripts executable

使脚本可执行

chmod +x scripts/*.sh

**Missing Dependencies:**
```bash
chmod +x scripts/*.sh

**缺失依赖项:**
```bash

Install required tools

安装所需工具

pip install jsonschema pandas pyarrow sudo apt-get install jq bc

---

**Plugin**: ml-training
**Version**: 1.0.0
**Supported Frameworks**: PyTorch, TensorFlow, JAX
**Platforms**: Modal, Lambda Labs, RunPod, Local
pip install jsonschema pandas pyarrow sudo apt-get install jq bc

---

**插件**: ml-training
**版本**: 1.0.0
**支持框架**: PyTorch, TensorFlow, JAX
**支持平台**: Modal, Lambda Labs, RunPod, Local