ML Training Validation Scripts

Purpose: Production-ready validation and testing utilities for ML training workflows. Ensures data quality, model integrity, pipeline correctness, and dependency availability before and during training.

Activation Triggers:

Validating training datasets before fine-tuning
Checking model checkpoints and outputs
Testing end-to-end training pipelines
Verifying system dependencies and GPU availability
Debugging training failures or data issues
Ensuring data format compliance
Validating model configurations
Testing inference pipelines

Key Resources:

```
scripts/validate-data.sh
```
- Comprehensive dataset validation
```
scripts/validate-model.sh
```
- Model checkpoint and config validation
```
scripts/test-pipeline.sh
```
- End-to-end pipeline testing
```
scripts/check-dependencies.sh
```
- System and package dependency checking
```
templates/test-config.yaml
```
- Pipeline test configuration template
```
templates/validation-schema.json
```
- Data validation schema template
```
examples/data-validation-example.md
```
- Dataset validation workflow
```
examples/pipeline-testing-example.md
```
- Complete pipeline testing guide

Quick Start

Validate Training Data

bash

bash scripts/validate-data.sh \
  --data-path ./data/train.jsonl \
  --format jsonl \
  --schema templates/validation-schema.json

Validate Model Checkpoint

bash

bash scripts/validate-model.sh \
  --model-path ./checkpoints/epoch-3 \
  --framework pytorch \
  --check-weights

Test Complete Pipeline

bash

bash scripts/test-pipeline.sh \
  --config templates/test-config.yaml \
  --data ./data/sample.jsonl \
  --verbose

Check System Dependencies

bash

bash scripts/check-dependencies.sh \
  --framework pytorch \
  --gpu-required \
  --min-vram 16

Validation Scripts

1. Data Validation (validate-data.sh)

Purpose: Validate training datasets for format compliance, data quality, and schema conformance.

Usage:

bash

bash scripts/validate-data.sh [OPTIONS]

Options:

```
--data-path PATH
```
- Path to dataset file or directory (required)
```
--format FORMAT
```
- Data format: jsonl, csv, parquet, arrow (default: jsonl)
```
--schema PATH
```
- Path to validation schema JSON file
```
--sample-size N
```
- Number of samples to validate (default: all)
```
--check-duplicates
```
- Check for duplicate entries
```
--check-null
```
- Check for null/missing values
```
--check-length
```
- Validate text length constraints
```
--check-tokens
```
- Validate tokenization compatibility
```
--tokenizer MODEL
```
- Tokenizer to use for token validation (default: gpt2)
```
--max-length N
```
- Maximum sequence length (default: 2048)
```
--output REPORT
```
- Output validation report path (default: validation-report.json)

Validation Checks:

Format Compliance: Verify file format and structure
Schema Validation: Check against JSON schema if provided
Data Types: Validate field types (string, int, float, etc.)
Required Fields: Ensure all required fields are present
Value Ranges: Check numeric values within expected ranges
Text Quality: Detect empty strings, excessive whitespace, encoding issues
Duplicates: Identify duplicate entries (exact or near-duplicate)
Token Counts: Verify sequences fit within model context length
Label Distribution: Check class balance for classification tasks
Missing Values: Detect and report null/missing values

Example Output:

json

{
  "status": "PASS",
  "total_samples": 10000,
  "valid_samples": 9987,
  "invalid_samples": 13,
  "validation_errors": [
    {
      "sample_id": 42,
      "field": "text",
      "error": "Exceeds max token length: 2150 > 2048"
    },
    {
      "sample_id": 156,
      "field": "label",
      "error": "Invalid label value: 'unknwn' (typo)"
    }
  ],
  "statistics": {
    "avg_text_length": 487,
    "avg_token_count": 128,
    "label_distribution": {
      "positive": 4892,
      "negative": 4895,
      "neutral": 213
    }
  },
  "recommendations": [
    "Remove or fix 13 invalid samples before training",
    "Label distribution is imbalanced - consider class weighting"
  ]
}

Exit Codes:

```
0
```
- All validations passed
```
1
```
- Validation errors found
```
2
```
- Script error (invalid arguments, file not found)

2. Model Validation (validate-model.sh)

Purpose: Validate model checkpoints, configurations, and inference readiness.

Usage:

bash

bash scripts/validate-model.sh [OPTIONS]

Options:

```
--model-path PATH
```
- Path to model checkpoint directory (required)
```
--framework FRAMEWORK
```
- Framework: pytorch, tensorflow, jax (default: pytorch)
```
--check-weights
```
- Verify model weights are loadable
```
--check-config
```
- Validate model configuration file
```
--check-tokenizer
```
- Verify tokenizer files are present
```
--check-inference
```
- Test inference with sample input
```
--sample-input TEXT
```
- Sample text for inference test
```
--expected-output TEXT
```
- Expected output for verification
```
--output REPORT
```
- Output validation report path

Validation Checks:

File Structure: Verify required files are present
Config Validation: Check model_config.json for correctness
Weight Integrity: Load and verify model weights
Tokenizer Files: Ensure tokenizer files exist and are loadable
Model Architecture: Validate architecture matches config
Memory Requirements: Estimate GPU/CPU memory needed
Inference Test: Run sample inference if requested
Quantization: Verify quantized models load correctly
LoRA/PEFT: Validate adapter weights and configuration

Example Output:

json

{
  "status": "PASS",
  "model_path": "./checkpoints/llama-7b-finetuned",
  "framework": "pytorch",
  "checks": {
    "file_structure": "PASS",
    "config": "PASS",
    "weights": "PASS",
    "tokenizer": "PASS",
    "inference": "PASS"
  },
  "model_info": {
    "architecture": "LlamaForCausalLM",
    "parameters": "7.2B",
    "precision": "float16",
    "lora_enabled": true,
    "lora_rank": 16
  },
  "memory_estimate": {
    "model_size_gb": 13.5,
    "inference_vram_gb": 16.2,
    "training_vram_gb": 24.8
  },
  "inference_test": {
    "input": "Hello, world!",
    "output": "Hello, world! How can I help you today?",
    "latency_ms": 142
  }
}

3. Pipeline Testing (test-pipeline.sh)

Purpose: Test end-to-end ML training and inference pipelines with sample data.

Usage:

bash

bash scripts/test-pipeline.sh [OPTIONS]

Options:

```
--config PATH
```
- Pipeline test configuration file (required)
```
--data PATH
```
- Sample data for testing
```
--steps STEPS
```
- Comma-separated steps to test (default: all)
```
--verbose
```
- Enable detailed output
```
--output REPORT
```
- Output test report path
```
--fail-fast
```
- Stop on first failure
```
--cleanup
```
- Clean up temporary files after test

Pipeline Steps:

data_loading: Test data loading and preprocessing
tokenization: Verify tokenization pipeline
model_loading: Load model and verify initialization
training_step: Run single training step
validation_step: Run single validation step
checkpoint_save: Test checkpoint saving
checkpoint_load: Test checkpoint loading
inference: Test inference pipeline
metrics: Verify metrics calculation

Test Configuration (test-config.yaml):

yaml

pipeline:
  name: llama-7b-finetuning
  framework: pytorch

data:
  train_path: ./data/sample-train.jsonl
  val_path: ./data/sample-val.jsonl
  format: jsonl

model:
  base_model: meta-llama/Llama-2-7b-hf
  checkpoint_path: ./checkpoints/test
  load_in_8bit: false
  lora:
    enabled: true
    r: 16
    alpha: 32

training:
  batch_size: 1
  gradient_accumulation: 1
  learning_rate: 2e-4
  max_steps: 5

testing:
  sample_size: 10
  timeout_seconds: 300
  expected_loss_range: [0.5, 3.0]

Example Output:

Pipeline Test Report
====================

Pipeline: llama-7b-finetuning
Started: 2025-11-01 12:34:56
Duration: 127 seconds

Test Results:
✓ data_loading (2.3s) - Loaded 10 samples successfully
✓ tokenization (1.1s) - Tokenized all samples, avg length: 128 tokens
✓ model_loading (8.7s) - Model loaded, 7.2B parameters
✓ training_step (15.4s) - Training step completed, loss: 1.847
✓ validation_step (12.1s) - Validation step completed, loss: 1.923
✓ checkpoint_save (3.2s) - Checkpoint saved to ./checkpoints/test
✓ checkpoint_load (6.8s) - Checkpoint loaded successfully
✓ inference (2.9s) - Inference completed, latency: 142ms
✓ metrics (0.4s) - Metrics calculated correctly

Overall: PASS (8/8 tests passed)

Performance Metrics:
- Total time: 127s
- GPU memory used: 15.2 GB
- CPU memory used: 8.4 GB

Recommendations:
- Pipeline is ready for full training
- Consider increasing batch_size to improve throughput

4. Dependency Checking (check-dependencies.sh)

Purpose: Verify all required system dependencies, packages, and GPU availability.

Usage:

bash

bash scripts/check-dependencies.sh [OPTIONS]

Options:

```
--framework FRAMEWORK
```
- ML framework: pytorch, tensorflow, jax (default: pytorch)
```
--gpu-required
```
- Require GPU availability
```
--min-vram GB
```
- Minimum GPU VRAM required (GB)
```
--cuda-version VERSION
```
- Required CUDA version
```
--packages FILE
```
- Path to requirements.txt or packages list
```
--platform PLATFORM
```
- Platform: modal, lambda, runpod, local (default: local)
```
--fix
```
- Attempt to install missing packages
```
--output REPORT
```
- Output dependency report path

Dependency Checks:

Python Version: Verify compatible Python version
ML Framework: Check PyTorch/TensorFlow/JAX installation
CUDA/cuDNN: Verify CUDA toolkit and cuDNN
GPU Availability: Detect and validate GPUs
VRAM: Check GPU memory capacity
System Packages: Verify system-level dependencies
Python Packages: Check required pip packages
Platform Tools: Validate platform-specific tools (Modal, Lambda CLI)
Storage: Check available disk space
Network: Test internet connectivity for model downloads

Example Output:

json

{
  "status": "PASS",
  "platform": "modal",
  "checks": {
    "python": {
      "status": "PASS",
      "version": "3.10.12",
      "required": ">=3.9"
    },
    "pytorch": {
      "status": "PASS",
      "version": "2.1.0",
      "cuda_available": true,
      "cuda_version": "12.1"
    },
    "gpu": {
      "status": "PASS",
      "count": 1,
      "type": "NVIDIA A100",
      "vram_gb": 40,
      "driver_version": "535.129.03"
    },
    "packages": {
      "status": "PASS",
      "installed": 42,
      "missing": 0,
      "outdated": 3
    },
    "storage": {
      "status": "PASS",
      "available_gb": 128,
      "required_gb": 50
    }
  },
  "recommendations": [
    "Update transformers to latest version (4.36.0 -> 4.37.2)",
    "Consider upgrading to PyTorch 2.2.0 for better performance"
  ]
}

Templates

Validation Schema Template (templates/validation-schema.json)

Defines expected data structure and validation rules for training datasets.

json

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "required": ["text", "label"],
  "properties": {
    "text": {
      "type": "string",
      "minLength": 10,
      "maxLength": 5000,
      "description": "Input text for training"
    },
    "label": {
      "type": "string",
      "enum": ["positive", "negative", "neutral"],
      "description": "Classification label"
    },
    "metadata": {
      "type": "object",
      "properties": {
        "source": {
          "type": "string"
        },
        "timestamp": {
          "type": "string",
          "format": "date-time"
        }
      }
    }
  },
  "validation_rules": {
    "max_token_length": 2048,
    "tokenizer": "meta-llama/Llama-2-7b-hf",
    "check_duplicates": true,
    "min_label_count": 100,
    "max_label_imbalance_ratio": 10.0
  }
}

Customization:

Modify
```
required
```
fields for your use case
Add custom properties and validation rules
Set appropriate length constraints
Define label enums for classification
Configure tokenizer and max length

Test Configuration Template (templates/test-config.yaml)

Defines pipeline testing parameters and expected behaviors.

yaml

pipeline:
  name: test-pipeline
  framework: pytorch
  platform: modal  # modal, lambda, runpod, local

data:
  train_path: ./data/sample-train.jsonl
  val_path: ./data/sample-val.jsonl
  format: jsonl  # jsonl, csv, parquet
  sample_size: 10  # Number of samples to use for testing

model:
  base_model: meta-llama/Llama-2-7b-hf
  checkpoint_path: ./checkpoints/test
  quantization: null  # null, 8bit, 4bit
  lora:
    enabled: true
    r: 16
    alpha: 32
    dropout: 0.05
    target_modules: ["q_proj", "v_proj"]

training:
  batch_size: 1
  gradient_accumulation: 1
  learning_rate: 2e-4
  max_steps: 5
  warmup_steps: 0
  eval_steps: 2

testing:
  sample_size: 10
  timeout_seconds: 300
  expected_loss_range: [0.5, 3.0]
  fail_fast: true
  cleanup: true

validation:
  metrics: ["loss", "perplexity"]
  check_gradients: true
  check_memory_leak: true

gpu:
  required: true
  min_vram_gb: 16
  allow_cpu_fallback: false

Integration with ML Training Workflow

Pre-Training Validation

bash

# 1. Check system dependencies
bash scripts/check-dependencies.sh \
  --framework pytorch \
  --gpu-required \
  --min-vram 16

# 2. Validate training data
bash scripts/validate-data.sh \
  --data-path ./data/train.jsonl \
  --schema templates/validation-schema.json \
  --check-duplicates \
  --check-tokens

# 3. Test pipeline with sample data
bash scripts/test-pipeline.sh \
  --config templates/test-config.yaml \
  --verbose

Post-Training Validation

bash

# 1. Validate model checkpoint
bash scripts/validate-model.sh \
  --model-path ./checkpoints/final \
  --framework pytorch \
  --check-weights \
  --check-inference

# 2. Test inference pipeline
bash scripts/test-pipeline.sh \
  --config templates/test-config.yaml \
  --steps inference,metrics

CI/CD Integration

bash

# .github/workflows/validate-training.yml
- name: Validate Training Data
  run: |
    bash plugins/ml-training/skills/validation-scripts/scripts/validate-data.sh \
      --data-path ./data/train.jsonl \
      --schema ./validation-schema.json

- name: Test Training Pipeline
  run: |
    bash plugins/ml-training/skills/validation-scripts/scripts/test-pipeline.sh \
      --config ./test-config.yaml \
      --fail-fast

Error Handling and Debugging

Common Validation Failures

Data Validation Failures:

Token length exceeded → Truncate or filter long sequences
Missing required fields → Fix data preprocessing
Duplicate entries → Deduplicate dataset
Invalid labels → Clean label data

Model Validation Failures:

Missing config files → Ensure complete checkpoint
Weight loading errors → Check framework compatibility
Tokenizer mismatch → Verify tokenizer version

Pipeline Test Failures:

Out of memory → Reduce batch size, enable gradient checkpointing
CUDA errors → Check GPU drivers, CUDA version
Slow performance → Profile and optimize data loading

Debug Mode

All scripts support verbose debugging:

bash

bash scripts/validate-data.sh --data-path ./data/train.jsonl --verbose --debug

Outputs:

Detailed progress logs
Stack traces for errors
Performance metrics
Memory usage statistics

Best Practices

Always validate before training - Catch data issues early
Use schema validation - Enforce data quality standards
Test with sample data first - Verify pipeline before full training
Check dependencies on new platforms - Avoid runtime failures
Validate checkpoints - Ensure model integrity before deployment
Monitor validation metrics - Track data quality over time
Automate validation in CI/CD - Prevent bad data from entering pipeline
Keep validation schemas updated - Reflect current data requirements

Performance Tips

Use
```
--sample-size
```
for quick validation of large datasets
Run dependency checks once per environment, cache results
Parallelize validation with GNU parallel for multi-file datasets
Use
```
--fail-fast
```
in CI/CD to save time
Cache tokenizers to avoid re-downloading

Troubleshooting

Script Not Found:

bash

# Ensure you're in the correct directory
cd /path/to/ml-training/skills/validation-scripts
bash scripts/validate-data.sh --help

Permission Denied:

bash

# Make scripts executable
chmod +x scripts/*.sh

Missing Dependencies:

bash

# Install required tools
pip install jsonschema pandas pyarrow
sudo apt-get install jq bc

Plugin: ml-training Version: 1.0.0 Supported Frameworks: PyTorch, TensorFlow, JAX Platforms: Modal, Lambda Labs, RunPod, Local

validation-scripts

NPX Install

Tags

SKILL.md Content

ML Training Validation Scripts

Quick Start

Validate Training Data

Validate Model Checkpoint

Test Complete Pipeline

Check System Dependencies

Validation Scripts

1. Data Validation (validate-data.sh)

2. Model Validation (validate-model.sh)

3. Pipeline Testing (test-pipeline.sh)

4. Dependency Checking (check-dependencies.sh)

Templates

Validation Schema Template (templates/validation-schema.json)

Test Configuration Template (templates/test-config.yaml)

Integration with ML Training Workflow

Pre-Training Validation

Post-Training Validation

CI/CD Integration

Error Handling and Debugging

Common Validation Failures

Debug Mode

Best Practices

Performance Tips

Troubleshooting