machine-learning

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Machine Learning

机器学习

Production-grade machine learning with scikit-learn, XGBoost, and modern ML engineering practices.
基于scikit-learn、XGBoost和现代机器学习工程实践的生产级机器学习方案。

Quick Start

快速开始

python
undefined
python
undefined

Production ML Pipeline with scikit-learn

Production ML Pipeline with scikit-learn

import pandas as pd import numpy as np from sklearn.model_selection import train_test_split, cross_val_score from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report, roc_auc_score import joblib
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split, cross_val_score from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report, roc_auc_score import joblib

Load and split data

Load and split data

df = pd.read_csv("data/customers.csv") X = df.drop("churn", axis=1) y = df["churn"]
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y )
df = pd.read_csv("data/customers.csv") X = df.drop("churn", axis=1) y = df["churn"]
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y )

Define feature types

Define feature types

numeric_features = ["age", "tenure", "monthly_charges"] categorical_features = ["contract_type", "payment_method"]
numeric_features = ["age", "tenure", "monthly_charges"] categorical_features = ["contract_type", "payment_method"]

Build preprocessing pipeline

Build preprocessing pipeline

numeric_transformer = Pipeline([ ("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler()) ])
categorical_transformer = Pipeline([ ("imputer", SimpleImputer(strategy="constant", fill_value="missing")), ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False)) ])
preprocessor = ColumnTransformer([ ("num", numeric_transformer, numeric_features), ("cat", categorical_transformer, categorical_features) ])
numeric_transformer = Pipeline([ ("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler()) ])
categorical_transformer = Pipeline([ ("imputer", SimpleImputer(strategy="constant", fill_value="missing")), ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False)) ])
preprocessor = ColumnTransformer([ ("num", numeric_transformer, numeric_features), ("cat", categorical_transformer, categorical_features) ])

Full pipeline

Full pipeline

model = Pipeline([ ("preprocessor", preprocessor), ("classifier", RandomForestClassifier(n_estimators=100, random_state=42)) ])
model = Pipeline([ ("preprocessor", preprocessor), ("classifier", RandomForestClassifier(n_estimators=100, random_state=42)) ])

Train and evaluate

Train and evaluate

model.fit(X_train, y_train) y_pred = model.predict(X_test) y_prob = model.predict_proba(X_test)[:, 1]
print(classification_report(y_test, y_pred)) print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
model.fit(X_train, y_train) y_pred = model.predict(X_test) y_prob = model.predict_proba(X_test)[:, 1]
print(classification_report(y_test, y_pred)) print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")

Save model

Save model

joblib.dump(model, "models/churn_model.joblib")
undefined
joblib.dump(model, "models/churn_model.joblib")
undefined

Core Concepts

核心概念

1. Feature Engineering Pipeline

1. 特征工程流水线

python
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import FunctionTransformer
import numpy as np

class DateFeatureExtractor(BaseEstimator, TransformerMixin):
    """Custom transformer for date features."""

    def __init__(self, date_column: str):
        self.date_column = date_column

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        dates = pd.to_datetime(X[self.date_column])
        X["day_of_week"] = dates.dt.dayofweek
        X["month"] = dates.dt.month
        X["is_weekend"] = (dates.dt.dayofweek >= 5).astype(int)
        X["days_since_epoch"] = (dates - pd.Timestamp("1970-01-01")).dt.days
        return X.drop(self.date_column, axis=1)

class OutlierClipper(BaseEstimator, TransformerMixin):
    """Clip outliers to percentile bounds."""

    def __init__(self, lower_percentile=1, upper_percentile=99):
        self.lower_percentile = lower_percentile
        self.upper_percentile = upper_percentile
        self.bounds_ = {}

    def fit(self, X, y=None):
        for col in X.columns:
            self.bounds_[col] = (
                np.percentile(X[col], self.lower_percentile),
                np.percentile(X[col], self.upper_percentile)
            )
        return self

    def transform(self, X):
        X = X.copy()
        for col, (lower, upper) in self.bounds_.items():
            X[col] = X[col].clip(lower, upper)
        return X
python
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import FunctionTransformer
import numpy as np

class DateFeatureExtractor(BaseEstimator, TransformerMixin):
    """Custom transformer for date features."""

    def __init__(self, date_column: str):
        self.date_column = date_column

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        dates = pd.to_datetime(X[self.date_column])
        X["day_of_week"] = dates.dt.dayofweek
        X["month"] = dates.dt.month
        X["is_weekend"] = (dates.dt.dayofweek >= 5).astype(int)
        X["days_since_epoch"] = (dates - pd.Timestamp("1970-01-01")).dt.days
        return X.drop(self.date_column, axis=1)

class OutlierClipper(BaseEstimator, TransformerMixin):
    """Clip outliers to percentile bounds."""

    def __init__(self, lower_percentile=1, upper_percentile=99):
        self.lower_percentile = lower_percentile
        self.upper_percentile = upper_percentile
        self.bounds_ = {}

    def fit(self, X, y=None):
        for col in X.columns:
            self.bounds_[col] = (
                np.percentile(X[col], self.lower_percentile),
                np.percentile(X[col], self.upper_percentile)
            )
        return self

    def transform(self, X):
        X = X.copy()
        for col, (lower, upper) in self.bounds_.items():
            X[col] = X[col].clip(lower, upper)
        return X

Log transform for skewed features

Log transform for skewed features

log_transformer = FunctionTransformer( func=lambda x: np.log1p(np.maximum(x, 0)), inverse_func=lambda x: np.expm1(x) )
undefined
log_transformer = FunctionTransformer( func=lambda x: np.log1p(np.maximum(x, 0)), inverse_func=lambda x: np.expm1(x) )
undefined

2. Cross-Validation Strategies

2. 交叉验证策略

python
from sklearn.model_selection import (
    StratifiedKFold, TimeSeriesSplit, GroupKFold,
    cross_val_score, cross_validate
)
python
from sklearn.model_selection import (
    StratifiedKFold, TimeSeriesSplit, GroupKFold,
    cross_val_score, cross_validate
)

Stratified K-Fold (for imbalanced classification)

Stratified K-Fold (for imbalanced classification)

stratified_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score( model, X, y, cv=stratified_cv, scoring="roc_auc", n_jobs=-1 ) print(f"ROC-AUC: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
stratified_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score( model, X, y, cv=stratified_cv, scoring="roc_auc", n_jobs=-1 ) print(f"ROC-AUC: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")

Time Series Split (for temporal data)

Time Series Split (for temporal data)

ts_cv = TimeSeriesSplit(n_splits=5, gap=7) # 7-day gap
for train_idx, test_idx in ts_cv.split(X): X_train, X_test = X.iloc[train_idx], X.iloc[test_idx] y_train, y_test = y.iloc[train_idx], y.iloc[test_idx] # Train and evaluate...
ts_cv = TimeSeriesSplit(n_splits=5, gap=7) # 7-day gap
for train_idx, test_idx in ts_cv.split(X): X_train, X_test = X.iloc[train_idx], X.iloc[test_idx] y_train, y_test = y.iloc[train_idx], y.iloc[test_idx] # Train and evaluate...

Group K-Fold (prevent data leakage by user/entity)

Group K-Fold (prevent data leakage by user/entity)

group_cv = GroupKFold(n_splits=5) groups = df["user_id"] # Same user never in train and test
scores = cross_val_score( model, X, y, cv=group_cv, groups=groups, scoring="roc_auc" )
group_cv = GroupKFold(n_splits=5) groups = df["user_id"] # Same user never in train and test
scores = cross_val_score( model, X, y, cv=group_cv, groups=groups, scoring="roc_auc" )

Multiple metrics at once

Multiple metrics at once

results = cross_validate( model, X, y, cv=stratified_cv, scoring=["accuracy", "precision", "recall", "f1", "roc_auc"], return_train_score=True )
undefined
results = cross_validate( model, X, y, cv=stratified_cv, scoring=["accuracy", "precision", "recall", "f1", "roc_auc"], return_train_score=True )
undefined

3. Hyperparameter Tuning

3. 超参数调优

python
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
import optuna
python
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
import optuna

RandomizedSearchCV (good baseline)

RandomizedSearchCV (good baseline)

param_dist = { "classifier__n_estimators": randint(100, 500), "classifier__max_depth": randint(3, 15), "classifier__min_samples_split": randint(2, 20), "classifier__min_samples_leaf": randint(1, 10), }
random_search = RandomizedSearchCV( model, param_distributions=param_dist, n_iter=50, cv=stratified_cv, scoring="roc_auc", n_jobs=-1, random_state=42, verbose=1 ) random_search.fit(X_train, y_train) print(f"Best params: {random_search.best_params_}") print(f"Best score: {random_search.best_score_:.4f}")
param_dist = { "classifier__n_estimators": randint(100, 500), "classifier__max_depth": randint(3, 15), "classifier__min_samples_split": randint(2, 20), "classifier__min_samples_leaf": randint(1, 10), }
random_search = RandomizedSearchCV( model, param_distributions=param_dist, n_iter=50, cv=stratified_cv, scoring="roc_auc", n_jobs=-1, random_state=42, verbose=1 ) random_search.fit(X_train, y_train) print(f"Best params: {random_search.best_params_}") print(f"Best score: {random_search.best_score_:.4f}")

Optuna (modern, efficient)

Optuna (modern, efficient)

def objective(trial): params = { "n_estimators": trial.suggest_int("n_estimators", 100, 500), "max_depth": trial.suggest_int("max_depth", 3, 15), "min_samples_split": trial.suggest_int("min_samples_split", 2, 20), "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True), }
model = XGBClassifier(**params, random_state=42)
scores = cross_val_score(model, X_train, y_train, cv=5, scoring="roc_auc")
return scores.mean()
study = optuna.create_study(direction="maximize") study.optimize(objective, n_trials=100, n_jobs=-1) print(f"Best params: {study.best_params}")
undefined
def objective(trial): params = { "n_estimators": trial.suggest_int("n_estimators", 100, 500), "max_depth": trial.suggest_int("max_depth", 3, 15), "min_samples_split": trial.suggest_int("min_samples_split", 2, 20), "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True), }
model = XGBClassifier(**params, random_state=42)
scores = cross_val_score(model, X_train, y_train, cv=5, scoring="roc_auc")
return scores.mean()
study = optuna.create_study(direction="maximize") study.optimize(objective, n_trials=100, n_jobs=-1) print(f"Best params: {study.best_params}")
undefined

4. XGBoost Production Pattern

4. XGBoost生产模式

python
import xgboost as xgb
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
python
import xgboost as xgb
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt

Prepare DMatrix for efficiency

Prepare DMatrix for efficiency

dtrain = xgb.DMatrix(X_train, label=y_train, enable_categorical=True) dtest = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)
params = { "objective": "binary:logistic", "eval_metric": ["logloss", "auc"], "max_depth": 6, "learning_rate": 0.1, "subsample": 0.8, "colsample_bytree": 0.8, "min_child_weight": 1, "tree_method": "hist", # Fast histogram-based "device": "cuda", # GPU if available "random_state": 42, }
dtrain = xgb.DMatrix(X_train, label=y_train, enable_categorical=True) dtest = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)
params = { "objective": "binary:logistic", "eval_metric": ["logloss", "auc"], "max_depth": 6, "learning_rate": 0.1, "subsample": 0.8, "colsample_bytree": 0.8, "min_child_weight": 1, "tree_method": "hist", # Fast histogram-based "device": "cuda", # GPU if available "random_state": 42, }

Train with early stopping

Train with early stopping

evals = [(dtrain, "train"), (dtest, "eval")] model = xgb.train( params, dtrain, num_boost_round=1000, evals=evals, early_stopping_rounds=50, verbose_eval=100 )
evals = [(dtrain, "train"), (dtest, "eval")] model = xgb.train( params, dtrain, num_boost_round=1000, evals=evals, early_stopping_rounds=50, verbose_eval=100 )

Feature importance

Feature importance

importance = model.get_score(importance_type="gain") sorted_importance = dict(sorted(importance.items(), key=lambda x: x[1], reverse=True))
importance = model.get_score(importance_type="gain") sorted_importance = dict(sorted(importance.items(), key=lambda x: x[1], reverse=True))

SHAP values for interpretability

SHAP values for interpretability

import shap explainer = shap.TreeExplainer(model) shap_values = explainer.shap_values(X_test) shap.summary_plot(shap_values, X_test, plot_type="bar")
undefined
import shap explainer = shap.TreeExplainer(model) shap_values = explainer.shap_values(X_test) shap.summary_plot(shap_values, X_test, plot_type="bar")
undefined

5. Handling Imbalanced Data

5. 不平衡数据处理

python
from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.utils.class_weight import compute_class_weight
python
from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.utils.class_weight import compute_class_weight

Option 1: Class weights

Option 1: Class weights

class_weights = compute_class_weight("balanced", classes=np.unique(y_train), y=y_train) weight_dict = dict(zip(np.unique(y_train), class_weights))
model = RandomForestClassifier(class_weight=weight_dict)
class_weights = compute_class_weight("balanced", classes=np.unique(y_train), y=y_train) weight_dict = dict(zip(np.unique(y_train), class_weights))
model = RandomForestClassifier(class_weight=weight_dict)

Option 2: SMOTE oversampling

Option 2: SMOTE oversampling

smote = SMOTE(random_state=42, sampling_strategy=0.5) X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
smote = SMOTE(random_state=42, sampling_strategy=0.5) X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

Option 3: Combined pipeline (recommended)

Option 3: Combined pipeline (recommended)

resampling_pipeline = ImbPipeline([ ("preprocessor", preprocessor), ("smote", SMOTE(random_state=42)), ("classifier", RandomForestClassifier()) ])
resampling_pipeline = ImbPipeline([ ("preprocessor", preprocessor), ("smote", SMOTE(random_state=42)), ("classifier", RandomForestClassifier()) ])

Option 4: Threshold tuning

Option 4: Threshold tuning

from sklearn.metrics import precision_recall_curve
y_prob = model.predict_proba(X_test)[:, 1] precisions, recalls, thresholds = precision_recall_curve(y_test, y_prob)
from sklearn.metrics import precision_recall_curve
y_prob = model.predict_proba(X_test)[:, 1] precisions, recalls, thresholds = precision_recall_curve(y_test, y_prob)

Find threshold for target recall

Find threshold for target recall

target_recall = 0.8 idx = np.argmin(np.abs(recalls - target_recall)) optimal_threshold = thresholds[idx]
y_pred_adjusted = (y_prob >= optimal_threshold).astype(int)
undefined
target_recall = 0.8 idx = np.argmin(np.abs(recalls - target_recall)) optimal_threshold = thresholds[idx]
y_pred_adjusted = (y_prob >= optimal_threshold).astype(int)
undefined

Tools & Technologies

工具与技术

ToolPurposeVersion (2025)
scikit-learnCore ML library1.4+
XGBoostGradient boosting2.0+
LightGBMFast gradient boosting4.2+
CatBoostCategorical boosting1.2+
imbalanced-learnSampling strategies0.12+
SHAPModel interpretability0.44+
OptunaHyperparameter tuning3.5+
MLflowExperiment tracking2.10+
工具用途版本(2025)
scikit-learn核心机器学习库1.4+
XGBoost梯度提升框架2.0+
LightGBM快速梯度提升框架4.2+
CatBoost类别特征梯度提升框架1.2+
imbalanced-learn采样策略库0.12+
SHAP模型可解释性工具0.44+
Optuna超参数调优工具3.5+
MLflow实验追踪工具2.10+

Learning Path

学习路径

Phase 1: Foundations (Weeks 1-4)

阶段1:基础(第1-4周)

Week 1: Supervised learning concepts, bias-variance
Week 2: Linear/logistic regression, evaluation metrics
Week 3: Decision trees, ensemble methods
Week 4: Cross-validation, train/test methodology
第1周:监督学习概念、偏差-方差
第2周:线性/逻辑回归、评估指标
第3周:决策树、集成方法
第4周:交叉验证、训练/测试方法论

Phase 2: Intermediate (Weeks 5-8)

阶段2:中级(第5-8周)

Week 5: Feature engineering, preprocessing
Week 6: Gradient boosting (XGBoost, LightGBM)
Week 7: Hyperparameter tuning strategies
Week 8: Handling imbalanced data
第5周:特征工程、预处理
第6周:梯度提升(XGBoost、LightGBM)
第7周:超参数调优策略
第8周:不平衡数据处理

Phase 3: Advanced (Weeks 9-12)

阶段3:高级(第9-12周)

Week 9: Unsupervised learning (clustering, PCA)
Week 10: Model interpretability (SHAP, LIME)
Week 11: Time series forecasting
Week 12: Anomaly detection
第9周:无监督学习(聚类、PCA)
第10周:模型可解释性(SHAP、LIME)
第11周:时间序列预测
第12周:异常检测

Phase 4: Production (Weeks 13-16)

阶段4:生产实践(第13-16周)

Week 13: ML pipelines with scikit-learn
Week 14: Model serialization, versioning
Week 15: A/B testing for ML models
Week 16: Monitoring and retraining
第13周:基于scikit-learn的ML流水线
第14周:模型序列化、版本控制
第15周:机器学习模型的A/B测试
第16周:模型监控与重训练

Troubleshooting Guide

故障排除指南

Common Failure Modes

常见失败模式

IssueSymptomsRoot CauseFix
OverfittingTrain >> Test scoreModel too complexRegularization, cross-validation
UnderfittingBoth scores lowModel too simpleMore features, complex model
Data LeakagePerfect CV, bad prodFuture info in featuresCheck feature timing
Class ImbalanceLow minority recallSkewed class distributionSMOTE, class weights, threshold
Covariate ShiftModel degrades over timeData distribution changedMonitor, retrain regularly
问题症状根本原因解决方法
过拟合训练集得分远高于测试集模型过于复杂正则化、交叉验证
欠拟合训练集和测试集得分都很低模型过于简单增加特征、使用更复杂的模型
数据泄露交叉验证得分完美,但生产环境表现很差特征中包含未来信息检查特征的时间逻辑
类别不平衡少数类召回率低类别分布倾斜SMOTE、类别权重、阈值调整
协变量偏移模型性能随时间下降数据分布发生变化监控数据、定期重训练

Debug Checklist

调试检查清单

python
undefined
python
undefined

1. Check data distribution

1. Check data distribution

print(y.value_counts(normalize=True))
print(y.value_counts(normalize=True))

2. Verify no data leakage

2. Verify no data leakage

- Features computed before target event

- Features computed before target event

- No future information

- No future information

- No target encoding on full data

- No target encoding on full data

3. Learning curves

3. Learning curves

from sklearn.model_selection import learning_curve
train_sizes, train_scores, test_scores = learning_curve( model, X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 10), scoring="roc_auc" )
from sklearn.model_selection import learning_curve
train_sizes, train_scores, test_scores = learning_curve( model, X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 10), scoring="roc_auc" )

4. Feature importance analysis

4. Feature importance analysis

importances = model.feature_importances_ sorted_idx = np.argsort(importances)[::-1]
importances = model.feature_importances_ sorted_idx = np.argsort(importances)[::-1]

5. Error analysis

5. Error analysis

errors = X_test[y_test != y_pred]
errors = X_test[y_test != y_pred]

Analyze patterns in misclassifications

Analyze patterns in misclassifications

undefined
undefined

Unit Test Template

单元测试模板

python
import pytest
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from your_module import create_pipeline, train_model

@pytest.fixture
def sample_data():
    X, y = make_classification(
        n_samples=1000, n_features=20, n_informative=10,
        n_redundant=5, random_state=42
    )
    return train_test_split(X, y, test_size=0.2, random_state=42)

class TestMLPipeline:

    def test_pipeline_trains_successfully(self, sample_data):
        X_train, X_test, y_train, y_test = sample_data
        model = create_pipeline()
        model.fit(X_train, y_train)
        assert hasattr(model, "predict")

    def test_predictions_valid_range(self, sample_data):
        X_train, X_test, y_train, y_test = sample_data
        model = create_pipeline()
        model.fit(X_train, y_train)
        predictions = model.predict_proba(X_test)[:, 1]

        assert np.all(predictions >= 0)
        assert np.all(predictions <= 1)

    def test_model_better_than_random(self, sample_data):
        X_train, X_test, y_train, y_test = sample_data
        model = create_pipeline()
        model.fit(X_train, y_train)
        score = model.score(X_test, y_test)

        assert score > 0.5  # Better than random

    def test_handles_missing_values(self):
        X = np.array([[1, 2], [np.nan, 3], [4, np.nan]])
        y = np.array([0, 1, 0])

        model = create_pipeline()
        model.fit(X, y)
        predictions = model.predict(X)

        assert len(predictions) == len(y)
python
import pytest
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from your_module import create_pipeline, train_model

@pytest.fixture
def sample_data():
    X, y = make_classification(
        n_samples=1000, n_features=20, n_informative=10,
        n_redundant=5, random_state=42
    )
    return train_test_split(X, y, test_size=0.2, random_state=42)

class TestMLPipeline:

    def test_pipeline_trains_successfully(self, sample_data):
        X_train, X_test, y_train, y_test = sample_data
        model = create_pipeline()
        model.fit(X_train, y_train)
        assert hasattr(model, "predict")

    def test_predictions_valid_range(self, sample_data):
        X_train, X_test, y_train, y_test = sample_data
        model = create_pipeline()
        model.fit(X_train, y_train)
        predictions = model.predict_proba(X_test)[:, 1]

        assert np.all(predictions >= 0)
        assert np.all(predictions <= 1)

    def test_model_better_than_random(self, sample_data):
        X_train, X_test, y_train, y_test = sample_data
        model = create_pipeline()
        model.fit(X_train, y_train)
        score = model.score(X_test, y_test)

        assert score > 0.5  # Better than random

    def test_handles_missing_values(self):
        X = np.array([[1, 2], [np.nan, 3], [4, np.nan]])
        y = np.array([0, 1, 0])

        model = create_pipeline()
        model.fit(X, y)
        predictions = model.predict(X)

        assert len(predictions) == len(y)

Best Practices

最佳实践

Model Development

模型开发

python
undefined
python
undefined

✅ DO: Use pipelines for reproducibility

✅ DO: Use pipelines for reproducibility

pipeline = Pipeline([ ("preprocessor", preprocessor), ("model", model) ])
pipeline = Pipeline([ ("preprocessor", preprocessor), ("model", model) ])

✅ DO: Stratify splits for classification

✅ DO: Stratify splits for classification

X_train, X_test, y_train, y_test = train_test_split( X, y, stratify=y, random_state=42 )
X_train, X_test, y_train, y_test = train_test_split( X, y, stratify=y, random_state=42 )

✅ DO: Use appropriate metrics

✅ DO: Use appropriate metrics

Classification: ROC-AUC, PR-AUC, F1

Classification: ROC-AUC, PR-AUC, F1

Regression: RMSE, MAE, R²

Regression: RMSE, MAE, R²

❌ DON'T: Tune on test set

❌ DON'T: Tune on test set

❌ DON'T: Feature engineer on full data

❌ DON'T: Feature engineer on full data

❌ DON'T: Ignore class imbalance

❌ DON'T: Ignore class imbalance

undefined
undefined

Production Readiness

生产就绪

python
undefined
python
undefined

✅ DO: Version your models

✅ DO: Version your models

import mlflow
mlflow.sklearn.log_model(model, "model") mlflow.log_params(params) mlflow.log_metrics({"auc": auc_score})
import mlflow
mlflow.sklearn.log_model(model, "model") mlflow.log_params(params) mlflow.log_metrics({"auc": auc_score})

✅ DO: Monitor predictions

✅ DO: Monitor predictions

def monitor_predictions(predictions, reference_dist): from scipy.stats import ks_2samp stat, p_value = ks_2samp(predictions, reference_dist) if p_value < 0.05: alert("Distribution shift detected")
undefined
def monitor_predictions(predictions, reference_dist): from scipy.stats import ks_2samp stat, p_value = ks_2samp(predictions, reference_dist) if p_value < 0.05: alert("Distribution shift detected")
undefined

Resources

资源

Official Documentation

官方文档

Courses

课程

Books

书籍

  • "Hands-On Machine Learning" by Aurélien Géron
  • "The Elements of Statistical Learning"
  • "Feature Engineering for ML" by Alice Zheng
  • 《Hands-On Machine Learning》(Aurélien Géron 著)
  • 《The Elements of Statistical Learning》
  • 《Feature Engineering for ML》(Alice Zheng 著)

Next Skills

后续技能

After mastering Machine Learning:
  • deep-learning
    - Neural networks with PyTorch
  • mlops
    - Production ML systems
  • llms-generative-ai
    - Large language models
  • statistics-math
    - Deeper mathematical foundations

Skill Certification Checklist:
  • Can build end-to-end ML pipelines with scikit-learn
  • Can tune hyperparameters with cross-validation
  • Can handle imbalanced datasets appropriately
  • Can interpret models with SHAP values
  • Can deploy models with proper versioning
掌握机器学习后,可继续学习:
  • deep-learning
    - 基于PyTorch的神经网络
  • mlops
    - 生产级机器学习系统
  • llms-generative-ai
    - 大语言模型
  • statistics-math
    - 深入数学基础

技能认证清单:
  • 能够使用scikit-learn构建端到端ML流水线
  • 能够结合交叉验证进行超参数调优
  • 能够合理处理不平衡数据集
  • 能够使用SHAP值解释模型
  • 能够通过版本控制部署模型