validation-scripts

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

ML Training Validation Scripts

ML训练验证脚本

Purpose: Production-ready validation and testing utilities for ML training workflows. Ensures data quality, model integrity, pipeline correctness, and dependency availability before and during training.

Activation Triggers:

Validating training datasets before fine-tuning
Checking model checkpoints and outputs
Testing end-to-end training pipelines
Verifying system dependencies and GPU availability
Debugging training failures or data issues
Ensuring data format compliance
Validating model configurations
Testing inference pipelines

Key Resources:

```
scripts/validate-data.sh
```
- Comprehensive dataset validation
```
scripts/validate-model.sh
```
- Model checkpoint and config validation
```
scripts/test-pipeline.sh
```
- End-to-end pipeline testing
```
scripts/check-dependencies.sh
```
- System and package dependency checking
```
templates/test-config.yaml
```
- Pipeline test configuration template
```
templates/validation-schema.json
```
- Data validation schema template
```
examples/data-validation-example.md
```
- Dataset validation workflow
```
examples/pipeline-testing-example.md
```
- Complete pipeline testing guide

用途： 为ML训练工作流提供生产就绪的验证与测试工具，确保训练前后的数据质量、模型完整性、管道正确性及依赖项可用性。

触发场景：

微调前验证训练数据集
检查模型检查点与输出结果
测试端到端训练管道
验证系统依赖项与GPU可用性
调试训练故障或数据问题
确保数据格式合规
验证模型配置
测试推理管道

核心资源：

```
scripts/validate-data.sh
```
- 全面的数据集验证
```
scripts/validate-model.sh
```
- 模型检查点与配置验证
```
scripts/test-pipeline.sh
```
- 端到端管道测试
```
scripts/check-dependencies.sh
```
- 系统与包依赖项检查
```
templates/test-config.yaml
```
- 管道测试配置模板
```
templates/validation-schema.json
```
- 数据验证 schema 模板
```
examples/data-validation-example.md
```
- 数据集验证工作流示例
```
examples/pipeline-testing-example.md
```
- 完整管道测试指南

Quick Start

快速开始

Validate Training Data

验证训练数据

bash

bash scripts/validate-data.sh \
  --data-path ./data/train.jsonl \
  --format jsonl \
  --schema templates/validation-schema.json

bash

bash scripts/validate-data.sh \
  --data-path ./data/train.jsonl \
  --format jsonl \
  --schema templates/validation-schema.json

Validate Model Checkpoint

验证模型检查点

bash

bash scripts/validate-model.sh \
  --model-path ./checkpoints/epoch-3 \
  --framework pytorch \
  --check-weights

bash

bash scripts/validate-model.sh \
  --model-path ./checkpoints/epoch-3 \
  --framework pytorch \
  --check-weights

Test Complete Pipeline

测试完整管道

bash

bash scripts/test-pipeline.sh \
  --config templates/test-config.yaml \
  --data ./data/sample.jsonl \
  --verbose

bash

bash scripts/test-pipeline.sh \
  --config templates/test-config.yaml \
  --data ./data/sample.jsonl \
  --verbose

Check System Dependencies

检查系统依赖项

bash

bash scripts/check-dependencies.sh \
  --framework pytorch \
  --gpu-required \
  --min-vram 16

bash

bash scripts/check-dependencies.sh \
  --framework pytorch \
  --gpu-required \
  --min-vram 16

Validation Scripts

验证脚本

1. Data Validation (validate-data.sh)

1. 数据验证（validate-data.sh）

Purpose: Validate training datasets for format compliance, data quality, and schema conformance.

Usage:

bash

bash scripts/validate-data.sh [OPTIONS]

Options:

```
--data-path PATH
```
- Path to dataset file or directory (required)
```
--format FORMAT
```
- Data format: jsonl, csv, parquet, arrow (default: jsonl)
```
--schema PATH
```
- Path to validation schema JSON file
```
--sample-size N
```
- Number of samples to validate (default: all)
```
--check-duplicates
```
- Check for duplicate entries
```
--check-null
```
- Check for null/missing values
```
--check-length
```
- Validate text length constraints
```
--check-tokens
```
- Validate tokenization compatibility
```
--tokenizer MODEL
```
- Tokenizer to use for token validation (default: gpt2)
```
--max-length N
```
- Maximum sequence length (default: 2048)
```
--output REPORT
```
- Output validation report path (default: validation-report.json)

Validation Checks:

Format Compliance: Verify file format and structure
Schema Validation: Check against JSON schema if provided
Data Types: Validate field types (string, int, float, etc.)
Required Fields: Ensure all required fields are present
Value Ranges: Check numeric values within expected ranges
Text Quality: Detect empty strings, excessive whitespace, encoding issues
Duplicates: Identify duplicate entries (exact or near-duplicate)
Token Counts: Verify sequences fit within model context length
Label Distribution: Check class balance for classification tasks
Missing Values: Detect and report null/missing values

Example Output:

json

{
  "status": "PASS",
  "total_samples": 10000,
  "valid_samples": 9987,
  "invalid_samples": 13,
  "validation_errors": [
    {
      "sample_id": 42,
      "field": "text",
      "error": "Exceeds max token length: 2150 > 2048"
    },
    {
      "sample_id": 156,
      "field": "label",
      "error": "Invalid label value: 'unknwn' (typo)"
    }
  ],
  "statistics": {
    "avg_text_length": 487,
    "avg_token_count": 128,
    "label_distribution": {
      "positive": 4892,
      "negative": 4895,
      "neutral": 213
    }
  },
  "recommendations": [
    "Remove or fix 13 invalid samples before training",
    "Label distribution is imbalanced - consider class weighting"
  ]
}

Exit Codes:

```
0
```
- All validations passed
```
1
```
- Validation errors found
```
2
```
- Script error (invalid arguments, file not found)

用途： 验证训练数据集的格式合规性、数据质量及 schema 一致性。

使用方式：

bash

bash scripts/validate-data.sh [OPTIONS]

选项：

```
--data-path PATH
```
- 数据集文件或目录路径（必填）
```
--format FORMAT
```
- 数据格式：jsonl、csv、parquet、arrow（默认：jsonl）
```
--schema PATH
```
- 验证 schema JSON 文件路径
```
--sample-size N
```
- 要验证的样本数量（默认：全部）
```
--check-duplicates
```
- 检查重复条目
```
--check-null
```
- 检查空值/缺失值
```
--check-length
```
- 验证文本长度约束
```
--check-tokens
```
- 验证分词兼容性
```
--tokenizer MODEL
```
- 用于分词验证的Tokenizer（默认：gpt2）
```
--max-length N
```
- 最大序列长度（默认：2048）
```
--output REPORT
```
- 验证报告输出路径（默认：validation-report.json）

验证检查项：

格式合规性：验证文件格式与结构
Schema验证：若提供则对照JSON schema检查
数据类型：验证字段类型（字符串、整数、浮点数等）
必填字段：确保所有必填字段存在
值范围：检查数值是否在预期范围内
文本质量：检测空字符串、过多空白字符、编码问题
重复项：识别重复条目（精确或近似重复）
分词计数：验证序列是否符合模型上下文长度
标签分布：检查分类任务的类别平衡性
缺失值：检测并报告空值/缺失值

示例输出：

json

{
  "status": "PASS",
  "total_samples": 10000,
  "valid_samples": 9987,
  "invalid_samples": 13,
  "validation_errors": [
    {
      "sample_id": 42,
      "field": "text",
      "error": "Exceeds max token length: 2150 > 2048"
    },
    {
      "sample_id": 156,
      "field": "label",
      "error": "Invalid label value: 'unknwn' (typo)"
    }
  ],
  "statistics": {
    "avg_text_length": 487,
    "avg_token_count": 128,
    "label_distribution": {
      "positive": 4892,
      "negative": 4895,
      "neutral": 213
    }
  },
  "recommendations": [
    "Remove or fix 13 invalid samples before training",
    "Label distribution is imbalanced - consider class weighting"
  ]
}

退出码：

```
0
```
- 所有验证通过
```
1
```
- 发现验证错误
```
2
```
- 脚本错误（参数无效、文件未找到）

2. Model Validation (validate-model.sh)

2. 模型验证（validate-model.sh）

Purpose: Validate model checkpoints, configurations, and inference readiness.

Usage:

bash

bash scripts/validate-model.sh [OPTIONS]

Options:

```
--model-path PATH
```
- Path to model checkpoint directory (required)
```
--framework FRAMEWORK
```
- Framework: pytorch, tensorflow, jax (default: pytorch)
```
--check-weights
```
- Verify model weights are loadable
```
--check-config
```
- Validate model configuration file
```
--check-tokenizer
```
- Verify tokenizer files are present
```
--check-inference
```
- Test inference with sample input
```
--sample-input TEXT
```
- Sample text for inference test
```
--expected-output TEXT
```
- Expected output for verification
```
--output REPORT
```
- Output validation report path

Validation Checks:

File Structure: Verify required files are present
Config Validation: Check model_config.json for correctness
Weight Integrity: Load and verify model weights
Tokenizer Files: Ensure tokenizer files exist and are loadable
Model Architecture: Validate architecture matches config
Memory Requirements: Estimate GPU/CPU memory needed
Inference Test: Run sample inference if requested
Quantization: Verify quantized models load correctly
LoRA/PEFT: Validate adapter weights and configuration

Example Output:

json

{
  "status": "PASS",
  "model_path": "./checkpoints/llama-7b-finetuned",
  "framework": "pytorch",
  "checks": {
    "file_structure": "PASS",
    "config": "PASS",
    "weights": "PASS",
    "tokenizer": "PASS",
    "inference": "PASS"
  },
  "model_info": {
    "architecture": "LlamaForCausalLM",
    "parameters": "7.2B",
    "precision": "float16",
    "lora_enabled": true,
    "lora_rank": 16
  },
  "memory_estimate": {
    "model_size_gb": 13.5,
    "inference_vram_gb": 16.2,
    "training_vram_gb": 24.8
  },
  "inference_test": {
    "input": "Hello, world!",
    "output": "Hello, world! How can I help you today?",
    "latency_ms": 142
  }
}

用途： 验证模型检查点、配置及推理就绪状态。

使用方式：

bash

bash scripts/validate-model.sh [OPTIONS]

选项：

```
--model-path PATH
```
- 模型检查点目录路径（必填）
```
--framework FRAMEWORK
```
- 框架：pytorch、tensorflow、jax（默认：pytorch）
```
--check-weights
```
- 验证模型权重可加载
```
--check-config
```
- 验证模型配置文件
```
--check-tokenizer
```
- 验证Tokenizer文件存在
```
--check-inference
```
- 使用样本输入测试推理
```
--sample-input TEXT
```
- 用于推理测试的样本文本
```
--expected-output TEXT
```
- 用于验证的预期输出
```
--output REPORT
```
- 验证报告输出路径

验证检查项：

文件结构：验证所需文件是否存在
配置验证：检查model_config.json的正确性
权重完整性：加载并验证模型权重
Tokenizer文件：确保Tokenizer文件存在且可加载
模型架构：验证架构与配置匹配
内存需求：估算GPU/CPU所需内存
推理测试：若请求则运行样本推理
量化验证：验证量化模型可正确加载
LoRA/PEFT：验证适配器权重与配置

示例输出：

json

{
  "status": "PASS",
  "model_path": "./checkpoints/llama-7b-finetuned",
  "framework": "pytorch",
  "checks": {
    "file_structure": "PASS",
    "config": "PASS",
    "weights": "PASS",
    "tokenizer": "PASS",
    "inference": "PASS"
  },
  "model_info": {
    "architecture": "LlamaForCausalLM",
    "parameters": "7.2B",
    "precision": "float16",
    "lora_enabled": true,
    "lora_rank": 16
  },
  "memory_estimate": {
    "model_size_gb": 13.5,
    "inference_vram_gb": 16.2,
    "training_vram_gb": 24.8
  },
  "inference_test": {
    "input": "Hello, world!",
    "output": "Hello, world! How can I help you today?",
    "latency_ms": 142
  }
}

3. Pipeline Testing (test-pipeline.sh)

3. 管道测试（test-pipeline.sh）

Purpose: Test end-to-end ML training and inference pipelines with sample data.

Usage:

bash

bash scripts/test-pipeline.sh [OPTIONS]

Options:

```
--config PATH
```
- Pipeline test configuration file (required)
```
--data PATH
```
- Sample data for testing
```
--steps STEPS
```
- Comma-separated steps to test (default: all)
```
--verbose
```
- Enable detailed output
```
--output REPORT
```
- Output test report path
```
--fail-fast
```
- Stop on first failure
```
--cleanup
```
- Clean up temporary files after test

Pipeline Steps:

data_loading: Test data loading and preprocessing
tokenization: Verify tokenization pipeline
model_loading: Load model and verify initialization
training_step: Run single training step
validation_step: Run single validation step
checkpoint_save: Test checkpoint saving
checkpoint_load: Test checkpoint loading
inference: Test inference pipeline
metrics: Verify metrics calculation

Test Configuration (test-config.yaml):

yaml

pipeline:
  name: llama-7b-finetuning
  framework: pytorch

data:
  train_path: ./data/sample-train.jsonl
  val_path: ./data/sample-val.jsonl
  format: jsonl

model:
  base_model: meta-llama/Llama-2-7b-hf
  checkpoint_path: ./checkpoints/test
  load_in_8bit: false
  lora:
    enabled: true
    r: 16
    alpha: 32

training:
  batch_size: 1
  gradient_accumulation: 1
  learning_rate: 2e-4
  max_steps: 5

testing:
  sample_size: 10
  timeout_seconds: 300
  expected_loss_range: [0.5, 3.0]

Example Output:

Pipeline Test Report
====================

Pipeline: llama-7b-finetuning
Started: 2025-11-01 12:34:56
Duration: 127 seconds

Test Results:
✓ data_loading (2.3s) - Loaded 10 samples successfully
✓ tokenization (1.1s) - Tokenized all samples, avg length: 128 tokens
✓ model_loading (8.7s) - Model loaded, 7.2B parameters
✓ training_step (15.4s) - Training step completed, loss: 1.847
✓ validation_step (12.1s) - Validation step completed, loss: 1.923
✓ checkpoint_save (3.2s) - Checkpoint saved to ./checkpoints/test
✓ checkpoint_load (6.8s) - Checkpoint loaded successfully
✓ inference (2.9s) - Inference completed, latency: 142ms
✓ metrics (0.4s) - Metrics calculated correctly

Overall: PASS (8/8 tests passed)

Performance Metrics:
- Total time: 127s
- GPU memory used: 15.2 GB
- CPU memory used: 8.4 GB

Recommendations:
- Pipeline is ready for full training
- Consider increasing batch_size to improve throughput

用途： 使用样本数据测试端到端ML训练与推理管道。

使用方式：

bash

bash scripts/test-pipeline.sh [OPTIONS]

选项：

```
--config PATH
```
- 管道测试配置文件（必填）
```
--data PATH
```
- 用于测试的样本数据
```
--steps STEPS
```
- 要测试的步骤（逗号分隔，默认：全部）
```
--verbose
```
- 启用详细输出
```
--output REPORT
```
- 测试报告输出路径
```
--fail-fast
```
- 首次失败即停止
```
--cleanup
```
- 测试后清理临时文件

管道步骤：

data_loading：测试数据加载与预处理
tokenization：验证分词管道
model_loading：加载模型并验证初始化
training_step：运行单次训练步骤
validation_step：运行单次验证步骤
checkpoint_save：测试检查点保存
checkpoint_load：测试检查点加载
inference：测试推理管道
metrics：验证指标计算

测试配置（test-config.yaml）：

yaml

pipeline:
  name: llama-7b-finetuning
  framework: pytorch

data:
  train_path: ./data/sample-train.jsonl
  val_path: ./data/sample-val.jsonl
  format: jsonl

model:
  base_model: meta-llama/Llama-2-7b-hf
  checkpoint_path: ./checkpoints/test
  load_in_8bit: false
  lora:
    enabled: true
    r: 16
    alpha: 32

training:
  batch_size: 1
  gradient_accumulation: 1
  learning_rate: 2e-4
  max_steps: 5

testing:
  sample_size: 10
  timeout_seconds: 300
  expected_loss_range: [0.5, 3.0]

示例输出：

Pipeline Test Report
====================

Pipeline: llama-7b-finetuning
Started: 2025-11-01 12:34:56
Duration: 127 seconds

Test Results:
✓ data_loading (2.3s) - Loaded 10 samples successfully
✓ tokenization (1.1s) - Tokenized all samples, avg length: 128 tokens
✓ model_loading (8.7s) - Model loaded, 7.2B parameters
✓ training_step (15.4s) - Training step completed, loss: 1.847
✓ validation_step (12.1s) - Validation step completed, loss: 1.923
✓ checkpoint_save (3.2s) - Checkpoint saved to ./checkpoints/test
✓ checkpoint_load (6.8s) - Checkpoint loaded successfully
✓ inference (2.9s) - Inference completed, latency: 142ms
✓ metrics (0.4s) - Metrics calculated correctly

Overall: PASS (8/8 tests passed)

Performance Metrics:
- Total time: 127s
- GPU memory used: 15.2 GB
- CPU memory used: 8.4 GB

Recommendations:
- Pipeline is ready for full training
- Consider increasing batch_size to improve throughput

4. Dependency Checking (check-dependencies.sh)

4. 依赖项检查（check-dependencies.sh）

Purpose: Verify all required system dependencies, packages, and GPU availability.

Usage:

bash

bash scripts/check-dependencies.sh [OPTIONS]

Options:

```
--framework FRAMEWORK
```
- ML framework: pytorch, tensorflow, jax (default: pytorch)
```
--gpu-required
```
- Require GPU availability
```
--min-vram GB
```
- Minimum GPU VRAM required (GB)
```
--cuda-version VERSION
```
- Required CUDA version
```
--packages FILE
```
- Path to requirements.txt or packages list
```
--platform PLATFORM
```
- Platform: modal, lambda, runpod, local (default: local)
```
--fix
```
- Attempt to install missing packages
```
--output REPORT
```
- Output dependency report path

Dependency Checks:

Python Version: Verify compatible Python version
ML Framework: Check PyTorch/TensorFlow/JAX installation
CUDA/cuDNN: Verify CUDA toolkit and cuDNN
GPU Availability: Detect and validate GPUs
VRAM: Check GPU memory capacity
System Packages: Verify system-level dependencies
Python Packages: Check required pip packages
Platform Tools: Validate platform-specific tools (Modal, Lambda CLI)
Storage: Check available disk space
Network: Test internet connectivity for model downloads

Example Output:

json

{
  "status": "PASS",
  "platform": "modal",
  "checks": {
    "python": {
      "status": "PASS",
      "version": "3.10.12",
      "required": ">=3.9"
    },
    "pytorch": {
      "status": "PASS",
      "version": "2.1.0",
      "cuda_available": true,
      "cuda_version": "12.1"
    },
    "gpu": {
      "status": "PASS",
      "count": 1,
      "type": "NVIDIA A100",
      "vram_gb": 40,
      "driver_version": "535.129.03"
    },
    "packages": {
      "status": "PASS",
      "installed": 42,
      "missing": 0,
      "outdated": 3
    },
    "storage": {
      "status": "PASS",
      "available_gb": 128,
      "required_gb": 50
    }
  },
  "recommendations": [
    "Update transformers to latest version (4.36.0 -> 4.37.2)",
    "Consider upgrading to PyTorch 2.2.0 for better performance"
  ]
}

用途： 验证所有必需的系统依赖项、包及GPU可用性。

使用方式：

bash

bash scripts/check-dependencies.sh [OPTIONS]

选项：

```
--framework FRAMEWORK
```
- ML框架：pytorch、tensorflow、jax（默认：pytorch）
```
--gpu-required
```
- 要求GPU可用
```
--min-vram GB
```
- 所需最小GPU显存（GB）
```
--cuda-version VERSION
```
- 所需CUDA版本
```
--packages FILE
```
- requirements.txt或包列表路径
```
--platform PLATFORM
```
- 平台：modal、lambda、runpod、local（默认：local）
```
--fix
```
- 尝试安装缺失的包
```
--output REPORT
```
- 依赖项报告输出路径

依赖项检查项：

Python版本：验证兼容的Python版本
ML框架：检查PyTorch/TensorFlow/JAX安装情况
CUDA/cuDNN：验证CUDA工具包与cuDNN
GPU可用性：检测并验证GPU
显存：检查GPU内存容量
系统包：验证系统级依赖项
Python包：检查所需pip包
平台工具：验证平台特定工具（Modal、Lambda CLI）
存储：检查可用磁盘空间
网络：测试模型下载的网络连通性

示例输出：

json

{
  "status": "PASS",
  "platform": "modal",
  "checks": {
    "python": {
      "status": "PASS",
      "version": "3.10.12",
      "required": ">=3.9"
    },
    "pytorch": {
      "status": "PASS",
      "version": "2.1.0",
      "cuda_available": true,
      "cuda_version": "12.1"
    },
    "gpu": {
      "status": "PASS",
      "count": 1,
      "type": "NVIDIA A100",
      "vram_gb": 40,
      "driver_version": "535.129.03"
    },
    "packages": {
      "status": "PASS",
      "installed": 42,
      "missing": 0,
      "outdated": 3
    },
    "storage": {
      "status": "PASS",
      "available_gb": 128,
      "required_gb": 50
    }
  },
  "recommendations": [
    "Update transformers to latest version (4.36.0 -> 4.37.2)",
    "Consider upgrading to PyTorch 2.2.0 for better performance"
  ]
}

Templates

模板

Validation Schema Template (templates/validation-schema.json)

验证Schema模板（templates/validation-schema.json）

Defines expected data structure and validation rules for training datasets.

json

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "required": ["text", "label"],
  "properties": {
    "text": {
      "type": "string",
      "minLength": 10,
      "maxLength": 5000,
      "description": "Input text for training"
    },
    "label": {
      "type": "string",
      "enum": ["positive", "negative", "neutral"],
      "description": "Classification label"
    },
    "metadata": {
      "type": "object",
      "properties": {
        "source": {
          "type": "string"
        },
        "timestamp": {
          "type": "string",
          "format": "date-time"
        }
      }
    }
  },
  "validation_rules": {
    "max_token_length": 2048,
    "tokenizer": "meta-llama/Llama-2-7b-hf",
    "check_duplicates": true,
    "min_label_count": 100,
    "max_label_imbalance_ratio": 10.0
  }
}

Customization:

Modify
```
required
```
fields for your use case
Add custom properties and validation rules
Set appropriate length constraints
Define label enums for classification
Configure tokenizer and max length

定义训练数据集的预期数据结构与验证规则。

json

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "required": ["text", "label"],
  "properties": {
    "text": {
      "type": "string",
      "minLength": 10,
      "maxLength": 5000,
      "description": "Input text for training"
    },
    "label": {
      "type": "string",
      "enum": ["positive", "negative", "neutral"],
      "description": "Classification label"
    },
    "metadata": {
      "type": "object",
      "properties": {
        "source": {
          "type": "string"
        },
        "timestamp": {
          "type": "string",
          "format": "date-time"
        }
      }
    }
  },
  "validation_rules": {
    "max_token_length": 2048,
    "tokenizer": "meta-llama/Llama-2-7b-hf",
    "check_duplicates": true,
    "min_label_count": 100,
    "max_label_imbalance_ratio": 10.0
  }
}

自定义：

根据你的用例修改
```
required
```
字段
添加自定义属性与验证规则
设置合适的长度约束
为分类任务定义标签枚举
配置Tokenizer与最大长度

Test Configuration Template (templates/test-config.yaml)

测试配置模板（templates/test-config.yaml）

Defines pipeline testing parameters and expected behaviors.

yaml

pipeline:
  name: test-pipeline
  framework: pytorch
  platform: modal  # modal, lambda, runpod, local

data:
  train_path: ./data/sample-train.jsonl
  val_path: ./data/sample-val.jsonl
  format: jsonl  # jsonl, csv, parquet
  sample_size: 10  # Number of samples to use for testing

model:
  base_model: meta-llama/Llama-2-7b-hf
  checkpoint_path: ./checkpoints/test
  quantization: null  # null, 8bit, 4bit
  lora:
    enabled: true
    r: 16
    alpha: 32
    dropout: 0.05
    target_modules: ["q_proj", "v_proj"]

training:
  batch_size: 1
  gradient_accumulation: 1
  learning_rate: 2e-4
  max_steps: 5
  warmup_steps: 0
  eval_steps: 2

testing:
  sample_size: 10
  timeout_seconds: 300
  expected_loss_range: [0.5, 3.0]
  fail_fast: true
  cleanup: true

validation:
  metrics: ["loss", "perplexity"]
  check_gradients: true
  check_memory_leak: true

gpu:
  required: true
  min_vram_gb: 16
  allow_cpu_fallback: false

定义管道测试参数与预期行为。

yaml

pipeline:
  name: test-pipeline
  framework: pytorch
  platform: modal  # modal, lambda, runpod, local

data:
  train_path: ./data/sample-train.jsonl
  val_path: ./data/sample-val.jsonl
  format: jsonl  # jsonl, csv, parquet
  sample_size: 10  # Number of samples to use for testing

model:
  base_model: meta-llama/Llama-2-7b-hf
  checkpoint_path: ./checkpoints/test
  quantization: null  # null, 8bit, 4bit
  lora:
    enabled: true
    r: 16
    alpha: 32
    dropout: 0.05
    target_modules: ["q_proj", "v_proj"]

training:
  batch_size: 1
  gradient_accumulation: 1
  learning_rate: 2e-4
  max_steps: 5
  warmup_steps: 0
  eval_steps: 2

testing:
  sample_size: 10
  timeout_seconds: 300
  expected_loss_range: [0.5, 3.0]
  fail_fast: true
  cleanup: true

validation:
  metrics: ["loss", "perplexity"]
  check_gradients: true
  check_memory_leak: true

gpu:
  required: true
  min_vram_gb: 16
  allow_cpu_fallback: false

Integration with ML Training Workflow

与ML训练工作流集成

Pre-Training Validation

训练前验证

bash

undefined

bash

undefined

1. Check system dependencies

1. 检查系统依赖项

bash scripts/check-dependencies.sh
--framework pytorch
--gpu-required
--min-vram 16

2. Validate training data

2. 验证训练数据

bash scripts/validate-data.sh
--data-path ./data/train.jsonl
--schema templates/validation-schema.json
--check-duplicates
--check-tokens

3. Test pipeline with sample data

3. 使用样本数据测试管道

bash scripts/test-pipeline.sh
--config templates/test-config.yaml
--verbose

undefined

bash scripts/test-pipeline.sh
--config templates/test-config.yaml
--verbose

undefined

Post-Training Validation

训练后验证

bash

undefined

bash

undefined

1. Validate model checkpoint

1. 验证模型检查点

bash scripts/validate-model.sh
--model-path ./checkpoints/final
--framework pytorch
--check-weights
--check-inference

2. Test inference pipeline

2. 测试推理管道

bash scripts/test-pipeline.sh
--config templates/test-config.yaml
--steps inference,metrics

undefined

bash scripts/test-pipeline.sh
--config templates/test-config.yaml
--steps inference,metrics

undefined

CI/CD Integration

CI/CD集成

bash

undefined

bash

undefined

.github/workflows/validate-training.yml

name: Validate Training Data run: | bash plugins/ml-training/skills/validation-scripts/scripts/validate-data.sh
--data-path ./data/train.jsonl
--schema ./validation-schema.json
name: Test Training Pipeline run: | bash plugins/ml-training/skills/validation-scripts/scripts/test-pipeline.sh
--config ./test-config.yaml
--fail-fast

undefined

name: Validate Training Data run: | bash plugins/ml-training/skills/validation-scripts/scripts/validate-data.sh
--data-path ./data/train.jsonl
--schema ./validation-schema.json
name: Test Training Pipeline run: | bash plugins/ml-training/skills/validation-scripts/scripts/test-pipeline.sh
--config ./test-config.yaml
--fail-fast

undefined

Error Handling and Debugging

错误处理与调试

Common Validation Failures

常见验证失败

Data Validation Failures:

Token length exceeded → Truncate or filter long sequences
Missing required fields → Fix data preprocessing
Duplicate entries → Deduplicate dataset
Invalid labels → Clean label data

Model Validation Failures:

Missing config files → Ensure complete checkpoint
Weight loading errors → Check framework compatibility
Tokenizer mismatch → Verify tokenizer version

Pipeline Test Failures:

Out of memory → Reduce batch size, enable gradient checkpointing
CUDA errors → Check GPU drivers, CUDA version
Slow performance → Profile and optimize data loading

数据验证失败：

分词长度超限 → 截断或过滤长序列
缺失必填字段 → 修复数据预处理
重复条目 → 对数据集去重
无效标签 → 清理标签数据

模型验证失败：

缺失配置文件 → 确保检查点完整
权重加载错误 → 检查框架兼容性
Tokenizer不匹配 → 验证Tokenizer版本

管道测试失败：

内存不足 → 减小批量大小，启用梯度检查点
CUDA错误 → 检查GPU驱动、CUDA版本
性能缓慢 → 分析并优化数据加载

Debug Mode

调试模式

All scripts support verbose debugging:

bash

bash scripts/validate-data.sh --data-path ./data/train.jsonl --verbose --debug

Outputs:

Detailed progress logs
Stack traces for errors
Performance metrics
Memory usage statistics

所有脚本支持详细调试：

bash

bash scripts/validate-data.sh --data-path ./data/train.jsonl --verbose --debug

输出内容：

详细进度日志
错误堆栈跟踪
性能指标
内存使用统计

Best Practices

最佳实践

Always validate before training - Catch data issues early
Use schema validation - Enforce data quality standards
Test with sample data first - Verify pipeline before full training
Check dependencies on new platforms - Avoid runtime failures
Validate checkpoints - Ensure model integrity before deployment
Monitor validation metrics - Track data quality over time
Automate validation in CI/CD - Prevent bad data from entering pipeline
Keep validation schemas updated - Reflect current data requirements

训练前务必验证 - 尽早发现数据问题
使用Schema验证 - 强制执行数据质量标准
先使用样本数据测试 - 完整训练前验证管道
在新平台上检查依赖项 - 避免运行时故障
验证检查点 - 部署前确保模型完整性
监控验证指标 - 随时间跟踪数据质量
在CI/CD中自动化验证 - 防止不良数据进入管道
保持验证Schema更新 - 反映当前数据需求

Performance Tips

性能优化建议

Use
```
--sample-size
```
for quick validation of large datasets
Run dependency checks once per environment, cache results
Parallelize validation with GNU parallel for multi-file datasets
Use
```
--fail-fast
```
in CI/CD to save time
Cache tokenizers to avoid re-downloading

对大型数据集使用
```
--sample-size
```
进行快速验证
每个环境运行一次依赖项检查，缓存结果
使用GNU parallel对多文件数据集并行验证
在CI/CD中使用
```
--fail-fast
```
节省时间
缓存Tokenizer以避免重复下载

Troubleshooting

故障排除

Script Not Found:

bash

undefined

脚本未找到：

bash

undefined

Ensure you're in the correct directory

确保你在正确的目录中

cd /path/to/ml-training/skills/validation-scripts bash scripts/validate-data.sh --help


**Permission Denied:**
```bash

cd /path/to/ml-training/skills/validation-scripts bash scripts/validate-data.sh --help


**权限拒绝：**
```bash

Make scripts executable

使脚本可执行

chmod +x scripts/*.sh


**Missing Dependencies:**
```bash

chmod +x scripts/*.sh


**缺失依赖项：**
```bash

Install required tools

安装所需工具

pip install jsonschema pandas pyarrow sudo apt-get install jq bc


---

**Plugin**: ml-training
**Version**: 1.0.0
**Supported Frameworks**: PyTorch, TensorFlow, JAX
**Platforms**: Modal, Lambda Labs, RunPod, Local

pip install jsonschema pandas pyarrow sudo apt-get install jq bc


---

**插件**: ml-training
**版本**: 1.0.0
**支持框架**: PyTorch, TensorFlow, JAX
**支持平台**: Modal, Lambda Labs, RunPod, Local