ml-model-training
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseML Model Training
机器学习模型训练
Train machine learning models with proper data handling and evaluation.
通过规范的数据处理与评估流程训练机器学习模型。
Training Workflow
训练流程
- Data Preparation → 2. Feature Engineering → 3. Model Selection → 4. Training → 5. Evaluation
- 数据准备 → 2. 特征工程 → 3. 模型选择 → 4. 模型训练 → 5. 模型评估
Data Preparation
数据准备
python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoderpython
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoderLoad and clean data
Load and clean data
df = pd.read_csv('data.csv')
df = df.dropna()
df = pd.read_csv('data.csv')
df = df.dropna()
Encode categorical variables
Encode categorical variables
le = LabelEncoder()
df['category'] = le.fit_transform(df['category'])
le = LabelEncoder()
df['category'] = le.fit_transform(df['category'])
Split data (70/15/15)
Split data (70/15/15)
X = df.drop('target', axis=1)
y = df['target']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)
X = df.drop('target', axis=1)
y = df['target']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)
Scale features
Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)
undefinedscaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)
undefinedScikit-learn Training
Scikit-learn 训练
python
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_val)
print(classification_report(y_val, y_pred))python
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_val)
print(classification_report(y_val, y_pred))PyTorch Training
PyTorch 训练
python
import torch
import torch.nn as nn
class Model(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(input_dim, 64),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(64, 32),
nn.ReLU(),
nn.Linear(32, 1),
nn.Sigmoid()
)
def forward(self, x):
return self.layers(x)
model = Model(X_train.shape[1])
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.BCELoss()
for epoch in range(100):
model.train()
optimizer.zero_grad()
output = model(X_train_tensor)
loss = criterion(output, y_train_tensor)
loss.backward()
optimizer.step()python
import torch
import torch.nn as nn
class Model(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(input_dim, 64),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(64, 32),
nn.ReLU(),
nn.Linear(32, 1),
nn.Sigmoid()
)
def forward(self, x):
return self.layers(x)
model = Model(X_train.shape[1])
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.BCELoss()
for epoch in range(100):
model.train()
optimizer.zero_grad()
output = model(X_train_tensor)
loss = criterion(output, y_train_tensor)
loss.backward()
optimizer.step()Evaluation Metrics
评估指标
| Task | Metrics |
|---|---|
| Classification | Accuracy, Precision, Recall, F1, AUC-ROC |
| Regression | MSE, RMSE, MAE, R² |
| 任务类型 | 评估指标 |
|---|---|
| 分类任务 | 准确率(Accuracy)、精确率(Precision)、召回率(Recall)、F1值、AUC-ROC |
| 回归任务 | MSE、RMSE、MAE、R² |
Complete Framework Examples
完整框架示例
-
PyTorch: See references/pytorch-training.md for complete training with:
- Custom model classes with BatchNorm and Dropout
- Training/validation loops with early stopping
- Learning rate scheduling
- Model checkpointing
- Full evaluation with classification report
-
TensorFlow/Keras: See references/tensorflow-keras.md for:
- Sequential model architecture
- Callbacks (EarlyStopping, ReduceLROnPlateau, ModelCheckpoint, TensorBoard)
- Training history visualization
- TFLite conversion for mobile deployment
- Custom training loops
-
PyTorch:查看 references/pytorch-training.md 获取完整训练方案,包括:
- 包含BatchNorm和Dropout的自定义模型类
- 带早停机制的训练/验证循环
- 学习率调度
- 模型 checkpoint 保存
- 结合分类报告的完整评估
-
TensorFlow/Keras:查看 references/tensorflow-keras.md 获取以下内容:
- 序列模型架构
- 回调函数(EarlyStopping、ReduceLROnPlateau、ModelCheckpoint、TensorBoard)
- 训练历史可视化
- 用于移动端部署的TFLite转换
- 自定义训练循环
Best Practices
最佳实践
Do:
- Use cross-validation for robust evaluation
- Track experiments with MLflow
- Save model checkpoints regularly
- Monitor for overfitting
- Document hyperparameters
- Use 70/15/15 train/val/test split
Don't:
- Train without a validation set
- Ignore class imbalance
- Skip feature scaling
- Use test set for hyperparameter tuning
- Forget to set random seeds
建议做法:
- 使用交叉验证实现鲁棒性评估
- 用MLflow跟踪实验过程
- 定期保存模型 checkpoint
- 监控过拟合情况
- 记录超参数
- 采用70/15/15的训练/验证/测试集划分比例
避免做法:
- 无验证集直接训练
- 忽略类别不平衡问题
- 跳过特征缩放步骤
- 使用测试集进行超参数调优
- 忘记设置随机种子
Known Issues Prevention
常见问题预防
1. Data Leakage
1. 数据泄露
Problem: Scaling or transforming data before splitting leads to test set information leaking into training.
Solution: Always split data first, then fit transformers only on training data:
python
undefined问题:在划分数据集前进行缩放或转换,会导致测试集信息泄露到训练过程中。
解决方案:先划分数据集,再仅基于训练数据拟合转换器:
python
undefined✅ Correct: Fit on train, transform train/val/test
✅ 正确做法:基于训练集拟合,转换训练/验证/测试集
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val) # Only transform
X_test = scaler.transform(X_test) # Only transform
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val) # 仅做转换
X_test = scaler.transform(X_test) # 仅做转换
❌ Wrong: Fitting on all data
❌ 错误做法:基于全部数据拟合
X_all = scaler.fit_transform(X) # Leaks test info!
undefinedX_all = scaler.fit_transform(X) # 泄露测试集信息!
undefined2. Class Imbalance Ignored
2. 忽略类别不平衡
Problem: Training on imbalanced datasets (e.g., 95% class A, 5% class B) leads to models that predict only the majority class.
Solution: Use class weights or resampling:
python
from sklearn.utils.class_weight import compute_class_weight问题:在不平衡数据集上训练(例如95%为A类,5%为B类),会导致模型仅预测多数类。
解决方案:使用类别权重或重采样:
python
from sklearn.utils.class_weight import compute_class_weightCompute class weights
计算类别权重
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
model = RandomForestClassifier(class_weight='balanced')
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
model = RandomForestClassifier(class_weight='balanced')
Or use SMOTE for oversampling minority class
或使用SMOTE对少数类进行过采样
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
undefinedfrom imblearn.over_sampling import SMOTE
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
undefined3. Overfitting Due to No Regularization
3. 无正则化导致过拟合
Problem: Complex models memorize training data, perform poorly on validation/test sets.
Solution: Add regularization techniques:
python
undefined问题:复杂模型会记忆训练数据,在验证/测试集上表现不佳。
解决方案:添加正则化技术:
python
undefinedDropout in PyTorch
PyTorch中的Dropout
nn.Dropout(0.3)
nn.Dropout(0.3)
L2 regularization in scikit-learn
scikit-learn中的L2正则化
RandomForestClassifier(max_depth=10, min_samples_split=20)
RandomForestClassifier(max_depth=10, min_samples_split=20)
Early stopping in Keras
Keras中的早停机制
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
model.fit(X_train, y_train, validation_data=(X_val, y_val), callbacks=[early_stop])
undefinedfrom tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
model.fit(X_train, y_train, validation_data=(X_val, y_val), callbacks=[early_stop])
undefined4. Not Setting Random Seeds
4. 未设置随机种子
Problem: Results are not reproducible across runs, making debugging and comparison impossible.
Solution: Set all random seeds:
python
import random
import numpy as np
import torch
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(42)问题:实验结果无法复现,导致调试和对比无法进行。
解决方案:设置所有随机种子:
python
import random
import numpy as np
import torch
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(42)5. Using Test Set for Hyperparameter Tuning
5. 使用测试集进行超参数调优
Problem: Optimizing hyperparameters on test set leads to overfitting to test data.
Solution: Use validation set for tuning, test set only for final evaluation:
python
from sklearn.model_selection import GridSearchCV问题:在测试集上优化超参数会导致模型过拟合测试数据。
解决方案:使用验证集进行调优,仅用测试集做最终评估:
python
from sklearn.model_selection import GridSearchCV✅ Correct: Tune on train+val, evaluate on test
✅ 正确做法:基于训练+验证集调优,在测试集上评估
param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [5, 10, 15]}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train) # Cross-validation on training set
best_model = grid_search.best_estimator_
param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [5, 10, 15]}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train) # 在训练集上进行交叉验证
best_model = grid_search.best_estimator_
Final evaluation on held-out test set
在预留的测试集上做最终评估
final_score = best_model.score(X_test, y_test)
undefinedfinal_score = best_model.score(X_test, y_test)
undefinedWhen to Load References
何时参考文档
Load reference files when you need:
- PyTorch implementation details: Load for complete training loops with early stopping, learning rate scheduling, and checkpointing
references/pytorch-training.md - TensorFlow/Keras patterns: Load for callback usage, custom training loops, and mobile deployment with TFLite
references/tensorflow-keras.md
当你需要以下内容时,可查看参考文件:
- PyTorch 实现细节:查看 获取带早停、学习率调度和checkpoint保存的完整训练循环
references/pytorch-training.md - TensorFlow/Keras 实践模式:查看 获取回调函数使用、自定义训练循环,以及基于TFLite的移动端部署方案
references/tensorflow-keras.md