Machine Learning

机器学习

Production-grade machine learning with scikit-learn, XGBoost, and modern ML engineering practices.

基于scikit-learn、XGBoost和现代机器学习工程实践的生产级机器学习方案。

Quick Start

快速开始

python

undefined

python

undefined

Production ML Pipeline with scikit-learn

import pandas as pd import numpy as np from sklearn.model_selection import train_test_split, cross_val_score from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report, roc_auc_score import joblib

Load and split data

df = pd.read_csv("data/customers.csv") X = df.drop("churn", axis=1) y = df["churn"]

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y )

df = pd.read_csv("data/customers.csv") X = df.drop("churn", axis=1) y = df["churn"]

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y )

Define feature types

numeric_features = ["age", "tenure", "monthly_charges"] categorical_features = ["contract_type", "payment_method"]

Build preprocessing pipeline

numeric_transformer = Pipeline([ ("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler()) ])

categorical_transformer = Pipeline([ ("imputer", SimpleImputer(strategy="constant", fill_value="missing")), ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False)) ])

preprocessor = ColumnTransformer([ ("num", numeric_transformer, numeric_features), ("cat", categorical_transformer, categorical_features) ])

numeric_transformer = Pipeline([ ("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler()) ])

categorical_transformer = Pipeline([ ("imputer", SimpleImputer(strategy="constant", fill_value="missing")), ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False)) ])

preprocessor = ColumnTransformer([ ("num", numeric_transformer, numeric_features), ("cat", categorical_transformer, categorical_features) ])

Full pipeline

model = Pipeline([ ("preprocessor", preprocessor), ("classifier", RandomForestClassifier(n_estimators=100, random_state=42)) ])

Train and evaluate

model.fit(X_train, y_train) y_pred = model.predict(X_test) y_prob = model.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred)) print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")

model.fit(X_train, y_train) y_pred = model.predict(X_test) y_prob = model.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred)) print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")

Save model

joblib.dump(model, "models/churn_model.joblib")

undefined

joblib.dump(model, "models/churn_model.joblib")

undefined

Core Concepts

核心概念

1. Feature Engineering Pipeline

1. 特征工程流水线

python

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import FunctionTransformer
import numpy as np

class DateFeatureExtractor(BaseEstimator, TransformerMixin):
    """Custom transformer for date features."""

    def __init__(self, date_column: str):
        self.date_column = date_column

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        dates = pd.to_datetime(X[self.date_column])
        X["day_of_week"] = dates.dt.dayofweek
        X["month"] = dates.dt.month
        X["is_weekend"] = (dates.dt.dayofweek >= 5).astype(int)
        X["days_since_epoch"] = (dates - pd.Timestamp("1970-01-01")).dt.days
        return X.drop(self.date_column, axis=1)

class OutlierClipper(BaseEstimator, TransformerMixin):
    """Clip outliers to percentile bounds."""

    def __init__(self, lower_percentile=1, upper_percentile=99):
        self.lower_percentile = lower_percentile
        self.upper_percentile = upper_percentile
        self.bounds_ = {}

    def fit(self, X, y=None):
        for col in X.columns:
            self.bounds_[col] = (
                np.percentile(X[col], self.lower_percentile),
                np.percentile(X[col], self.upper_percentile)
            )
        return self

    def transform(self, X):
        X = X.copy()
        for col, (lower, upper) in self.bounds_.items():
            X[col] = X[col].clip(lower, upper)
        return X

python

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import FunctionTransformer
import numpy as np

class DateFeatureExtractor(BaseEstimator, TransformerMixin):
    """Custom transformer for date features."""

    def __init__(self, date_column: str):
        self.date_column = date_column

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        dates = pd.to_datetime(X[self.date_column])
        X["day_of_week"] = dates.dt.dayofweek
        X["month"] = dates.dt.month
        X["is_weekend"] = (dates.dt.dayofweek >= 5).astype(int)
        X["days_since_epoch"] = (dates - pd.Timestamp("1970-01-01")).dt.days
        return X.drop(self.date_column, axis=1)

class OutlierClipper(BaseEstimator, TransformerMixin):
    """Clip outliers to percentile bounds."""

    def __init__(self, lower_percentile=1, upper_percentile=99):
        self.lower_percentile = lower_percentile
        self.upper_percentile = upper_percentile
        self.bounds_ = {}

    def fit(self, X, y=None):
        for col in X.columns:
            self.bounds_[col] = (
                np.percentile(X[col], self.lower_percentile),
                np.percentile(X[col], self.upper_percentile)
            )
        return self

    def transform(self, X):
        X = X.copy()
        for col, (lower, upper) in self.bounds_.items():
            X[col] = X[col].clip(lower, upper)
        return X

Log transform for skewed features

log_transformer = FunctionTransformer( func=lambda x: np.log1p(np.maximum(x, 0)), inverse_func=lambda x: np.expm1(x) )

undefined

log_transformer = FunctionTransformer( func=lambda x: np.log1p(np.maximum(x, 0)), inverse_func=lambda x: np.expm1(x) )

undefined

2. Cross-Validation Strategies

2. 交叉验证策略

python

from sklearn.model_selection import (
    StratifiedKFold, TimeSeriesSplit, GroupKFold,
    cross_val_score, cross_validate
)

python

from sklearn.model_selection import (
    StratifiedKFold, TimeSeriesSplit, GroupKFold,
    cross_val_score, cross_validate
)

Stratified K-Fold (for imbalanced classification)

stratified_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score( model, X, y, cv=stratified_cv, scoring="roc_auc", n_jobs=-1 ) print(f"ROC-AUC: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")

stratified_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score( model, X, y, cv=stratified_cv, scoring="roc_auc", n_jobs=-1 ) print(f"ROC-AUC: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")

Time Series Split (for temporal data)

ts_cv = TimeSeriesSplit(n_splits=5, gap=7) # 7-day gap

for train_idx, test_idx in ts_cv.split(X): X_train, X_test = X.iloc[train_idx], X.iloc[test_idx] y_train, y_test = y.iloc[train_idx], y.iloc[test_idx] # Train and evaluate...

ts_cv = TimeSeriesSplit(n_splits=5, gap=7) # 7-day gap

for train_idx, test_idx in ts_cv.split(X): X_train, X_test = X.iloc[train_idx], X.iloc[test_idx] y_train, y_test = y.iloc[train_idx], y.iloc[test_idx] # Train and evaluate...

Group K-Fold (prevent data leakage by user/entity)

group_cv = GroupKFold(n_splits=5) groups = df["user_id"] # Same user never in train and test

scores = cross_val_score( model, X, y, cv=group_cv, groups=groups, scoring="roc_auc" )

group_cv = GroupKFold(n_splits=5) groups = df["user_id"] # Same user never in train and test

scores = cross_val_score( model, X, y, cv=group_cv, groups=groups, scoring="roc_auc" )

Multiple metrics at once

results = cross_validate( model, X, y, cv=stratified_cv, scoring=["accuracy", "precision", "recall", "f1", "roc_auc"], return_train_score=True )

undefined

results = cross_validate( model, X, y, cv=stratified_cv, scoring=["accuracy", "precision", "recall", "f1", "roc_auc"], return_train_score=True )

undefined

3. Hyperparameter Tuning

3. 超参数调优

python

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
import optuna

python

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
import optuna

RandomizedSearchCV (good baseline)

param_dist = { "classifier__n_estimators": randint(100, 500), "classifier__max_depth": randint(3, 15), "classifier__min_samples_split": randint(2, 20), "classifier__min_samples_leaf": randint(1, 10), }

random_search = RandomizedSearchCV( model, param_distributions=param_dist, n_iter=50, cv=stratified_cv, scoring="roc_auc", n_jobs=-1, random_state=42, verbose=1 ) random_search.fit(X_train, y_train) print(f"Best params: {random_search.best_params_}") print(f"Best score: {random_search.best_score_:.4f}")

param_dist = { "classifier__n_estimators": randint(100, 500), "classifier__max_depth": randint(3, 15), "classifier__min_samples_split": randint(2, 20), "classifier__min_samples_leaf": randint(1, 10), }

random_search = RandomizedSearchCV( model, param_distributions=param_dist, n_iter=50, cv=stratified_cv, scoring="roc_auc", n_jobs=-1, random_state=42, verbose=1 ) random_search.fit(X_train, y_train) print(f"Best params: {random_search.best_params_}") print(f"Best score: {random_search.best_score_:.4f}")

Optuna (modern, efficient)

def objective(trial): params = { "n_estimators": trial.suggest_int("n_estimators", 100, 500), "max_depth": trial.suggest_int("max_depth", 3, 15), "min_samples_split": trial.suggest_int("min_samples_split", 2, 20), "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True), }

model = XGBClassifier(**params, random_state=42)
scores = cross_val_score(model, X_train, y_train, cv=5, scoring="roc_auc")
return scores.mean()

study = optuna.create_study(direction="maximize") study.optimize(objective, n_trials=100, n_jobs=-1) print(f"Best params: {study.best_params}")

undefined

def objective(trial): params = { "n_estimators": trial.suggest_int("n_estimators", 100, 500), "max_depth": trial.suggest_int("max_depth", 3, 15), "min_samples_split": trial.suggest_int("min_samples_split", 2, 20), "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True), }

model = XGBClassifier(**params, random_state=42)
scores = cross_val_score(model, X_train, y_train, cv=5, scoring="roc_auc")
return scores.mean()

study = optuna.create_study(direction="maximize") study.optimize(objective, n_trials=100, n_jobs=-1) print(f"Best params: {study.best_params}")

undefined

4. XGBoost Production Pattern

4. XGBoost生产模式

python

import xgboost as xgb
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt

python

import xgboost as xgb
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt

Prepare DMatrix for efficiency

dtrain = xgb.DMatrix(X_train, label=y_train, enable_categorical=True) dtest = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)

params = { "objective": "binary:logistic", "eval_metric": ["logloss", "auc"], "max_depth": 6, "learning_rate": 0.1, "subsample": 0.8, "colsample_bytree": 0.8, "min_child_weight": 1, "tree_method": "hist", # Fast histogram-based "device": "cuda", # GPU if available "random_state": 42, }

dtrain = xgb.DMatrix(X_train, label=y_train, enable_categorical=True) dtest = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)

params = { "objective": "binary:logistic", "eval_metric": ["logloss", "auc"], "max_depth": 6, "learning_rate": 0.1, "subsample": 0.8, "colsample_bytree": 0.8, "min_child_weight": 1, "tree_method": "hist", # Fast histogram-based "device": "cuda", # GPU if available "random_state": 42, }

Train with early stopping

evals = [(dtrain, "train"), (dtest, "eval")] model = xgb.train( params, dtrain, num_boost_round=1000, evals=evals, early_stopping_rounds=50, verbose_eval=100 )

Feature importance

importance = model.get_score(importance_type="gain") sorted_importance = dict(sorted(importance.items(), key=lambda x: x[1], reverse=True))

SHAP values for interpretability

import shap explainer = shap.TreeExplainer(model) shap_values = explainer.shap_values(X_test) shap.summary_plot(shap_values, X_test, plot_type="bar")

undefined

import shap explainer = shap.TreeExplainer(model) shap_values = explainer.shap_values(X_test) shap.summary_plot(shap_values, X_test, plot_type="bar")

undefined

5. Handling Imbalanced Data

5. 不平衡数据处理

python

from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.utils.class_weight import compute_class_weight

python

from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.utils.class_weight import compute_class_weight

Option 1: Class weights

class_weights = compute_class_weight("balanced", classes=np.unique(y_train), y=y_train) weight_dict = dict(zip(np.unique(y_train), class_weights))

model = RandomForestClassifier(class_weight=weight_dict)

class_weights = compute_class_weight("balanced", classes=np.unique(y_train), y=y_train) weight_dict = dict(zip(np.unique(y_train), class_weights))

model = RandomForestClassifier(class_weight=weight_dict)

Option 2: SMOTE oversampling

smote = SMOTE(random_state=42, sampling_strategy=0.5) X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

Option 3: Combined pipeline (recommended)

resampling_pipeline = ImbPipeline([ ("preprocessor", preprocessor), ("smote", SMOTE(random_state=42)), ("classifier", RandomForestClassifier()) ])

Option 4: Threshold tuning

from sklearn.metrics import precision_recall_curve

y_prob = model.predict_proba(X_test)[:, 1] precisions, recalls, thresholds = precision_recall_curve(y_test, y_prob)

from sklearn.metrics import precision_recall_curve

y_prob = model.predict_proba(X_test)[:, 1] precisions, recalls, thresholds = precision_recall_curve(y_test, y_prob)

Find threshold for target recall

target_recall = 0.8 idx = np.argmin(np.abs(recalls - target_recall)) optimal_threshold = thresholds[idx]

y_pred_adjusted = (y_prob >= optimal_threshold).astype(int)

undefined

target_recall = 0.8 idx = np.argmin(np.abs(recalls - target_recall)) optimal_threshold = thresholds[idx]

y_pred_adjusted = (y_prob >= optimal_threshold).astype(int)

undefined

Tools & Technologies

工具与技术

Tool	Purpose	Version (2025)
scikit-learn	Core ML library	1.4+
XGBoost	Gradient boosting	2.0+
LightGBM	Fast gradient boosting	4.2+
CatBoost	Categorical boosting	1.2+
imbalanced-learn	Sampling strategies	0.12+
SHAP	Model interpretability	0.44+
Optuna	Hyperparameter tuning	3.5+
MLflow	Experiment tracking	2.10+

工具	用途	版本（2025）
scikit-learn	核心机器学习库	1.4+
XGBoost	梯度提升框架	2.0+
LightGBM	快速梯度提升框架	4.2+
CatBoost	类别特征梯度提升框架	1.2+
imbalanced-learn	采样策略库	0.12+
SHAP	模型可解释性工具	0.44+
Optuna	超参数调优工具	3.5+
MLflow	实验追踪工具	2.10+

Learning Path

学习路径

Phase 1: Foundations (Weeks 1-4)

阶段1：基础（第1-4周）

Week 1: Supervised learning concepts, bias-variance
Week 2: Linear/logistic regression, evaluation metrics
Week 3: Decision trees, ensemble methods
Week 4: Cross-validation, train/test methodology

第1周：监督学习概念、偏差-方差
第2周：线性/逻辑回归、评估指标
第3周：决策树、集成方法
第4周：交叉验证、训练/测试方法论

Phase 2: Intermediate (Weeks 5-8)

阶段2：中级（第5-8周）

Week 5: Feature engineering, preprocessing
Week 6: Gradient boosting (XGBoost, LightGBM)
Week 7: Hyperparameter tuning strategies
Week 8: Handling imbalanced data

第5周：特征工程、预处理
第6周：梯度提升（XGBoost、LightGBM）
第7周：超参数调优策略
第8周：不平衡数据处理

Phase 3: Advanced (Weeks 9-12)

阶段3：高级（第9-12周）

Week 9: Unsupervised learning (clustering, PCA)
Week 10: Model interpretability (SHAP, LIME)
Week 11: Time series forecasting
Week 12: Anomaly detection

第9周：无监督学习（聚类、PCA）
第10周：模型可解释性（SHAP、LIME）
第11周：时间序列预测
第12周：异常检测

Phase 4: Production (Weeks 13-16)

阶段4：生产实践（第13-16周）

Week 13: ML pipelines with scikit-learn
Week 14: Model serialization, versioning
Week 15: A/B testing for ML models
Week 16: Monitoring and retraining

第13周：基于scikit-learn的ML流水线
第14周：模型序列化、版本控制
第15周：机器学习模型的A/B测试
第16周：模型监控与重训练

Troubleshooting Guide

故障排除指南

Common Failure Modes

常见失败模式

Issue	Symptoms	Root Cause	Fix
Overfitting	Train >> Test score	Model too complex	Regularization, cross-validation
Underfitting	Both scores low	Model too simple	More features, complex model
Data Leakage	Perfect CV, bad prod	Future info in features	Check feature timing
Class Imbalance	Low minority recall	Skewed class distribution	SMOTE, class weights, threshold
Covariate Shift	Model degrades over time	Data distribution changed	Monitor, retrain regularly

问题	症状	根本原因	解决方法
过拟合	训练集得分远高于测试集	模型过于复杂	正则化、交叉验证
欠拟合	训练集和测试集得分都很低	模型过于简单	增加特征、使用更复杂的模型
数据泄露	交叉验证得分完美，但生产环境表现很差	特征中包含未来信息	检查特征的时间逻辑
类别不平衡	少数类召回率低	类别分布倾斜	SMOTE、类别权重、阈值调整
协变量偏移	模型性能随时间下降	数据分布发生变化	监控数据、定期重训练

Debug Checklist

调试检查清单

python

undefined

python

undefined

1. Check data distribution

print(y.value_counts(normalize=True))

2. Verify no data leakage

- Features computed before target event

- No future information

- No target encoding on full data

3. Learning curves

from sklearn.model_selection import learning_curve

train_sizes, train_scores, test_scores = learning_curve( model, X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 10), scoring="roc_auc" )

from sklearn.model_selection import learning_curve

train_sizes, train_scores, test_scores = learning_curve( model, X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 10), scoring="roc_auc" )

4. Feature importance analysis

importances = model.feature_importances_ sorted_idx = np.argsort(importances)[::-1]

5. Error analysis

errors = X_test[y_test != y_pred]

Analyze patterns in misclassifications

undefined

undefined

Unit Test Template

单元测试模板

python

import pytest
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from your_module import create_pipeline, train_model

@pytest.fixture
def sample_data():
    X, y = make_classification(
        n_samples=1000, n_features=20, n_informative=10,
        n_redundant=5, random_state=42
    )
    return train_test_split(X, y, test_size=0.2, random_state=42)

class TestMLPipeline:

    def test_pipeline_trains_successfully(self, sample_data):
        X_train, X_test, y_train, y_test = sample_data
        model = create_pipeline()
        model.fit(X_train, y_train)
        assert hasattr(model, "predict")

    def test_predictions_valid_range(self, sample_data):
        X_train, X_test, y_train, y_test = sample_data
        model = create_pipeline()
        model.fit(X_train, y_train)
        predictions = model.predict_proba(X_test)[:, 1]

        assert np.all(predictions >= 0)
        assert np.all(predictions <= 1)

    def test_model_better_than_random(self, sample_data):
        X_train, X_test, y_train, y_test = sample_data
        model = create_pipeline()
        model.fit(X_train, y_train)
        score = model.score(X_test, y_test)

        assert score > 0.5  # Better than random

    def test_handles_missing_values(self):
        X = np.array([[1, 2], [np.nan, 3], [4, np.nan]])
        y = np.array([0, 1, 0])

        model = create_pipeline()
        model.fit(X, y)
        predictions = model.predict(X)

        assert len(predictions) == len(y)

python

import pytest
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from your_module import create_pipeline, train_model

@pytest.fixture
def sample_data():
    X, y = make_classification(
        n_samples=1000, n_features=20, n_informative=10,
        n_redundant=5, random_state=42
    )
    return train_test_split(X, y, test_size=0.2, random_state=42)

class TestMLPipeline:

    def test_pipeline_trains_successfully(self, sample_data):
        X_train, X_test, y_train, y_test = sample_data
        model = create_pipeline()
        model.fit(X_train, y_train)
        assert hasattr(model, "predict")

    def test_predictions_valid_range(self, sample_data):
        X_train, X_test, y_train, y_test = sample_data
        model = create_pipeline()
        model.fit(X_train, y_train)
        predictions = model.predict_proba(X_test)[:, 1]

        assert np.all(predictions >= 0)
        assert np.all(predictions <= 1)

    def test_model_better_than_random(self, sample_data):
        X_train, X_test, y_train, y_test = sample_data
        model = create_pipeline()
        model.fit(X_train, y_train)
        score = model.score(X_test, y_test)

        assert score > 0.5  # Better than random

    def test_handles_missing_values(self):
        X = np.array([[1, 2], [np.nan, 3], [4, np.nan]])
        y = np.array([0, 1, 0])

        model = create_pipeline()
        model.fit(X, y)
        predictions = model.predict(X)

        assert len(predictions) == len(y)

Best Practices

最佳实践

Model Development

模型开发

python

undefined

python

undefined

✅ DO: Use pipelines for reproducibility

pipeline = Pipeline([ ("preprocessor", preprocessor), ("model", model) ])

✅ DO: Stratify splits for classification

X_train, X_test, y_train, y_test = train_test_split( X, y, stratify=y, random_state=42 )

✅ DO: Use appropriate metrics

Classification: ROC-AUC, PR-AUC, F1

Regression: RMSE, MAE, R²

❌ DON'T: Tune on test set

❌ DON'T: Feature engineer on full data

❌ DON'T: Ignore class imbalance

undefined

undefined

Production Readiness

生产就绪

python

undefined

python

undefined

✅ DO: Version your models

import mlflow

mlflow.sklearn.log_model(model, "model") mlflow.log_params(params) mlflow.log_metrics({"auc": auc_score})

import mlflow

mlflow.sklearn.log_model(model, "model") mlflow.log_params(params) mlflow.log_metrics({"auc": auc_score})

✅ DO: Monitor predictions

def monitor_predictions(predictions, reference_dist): from scipy.stats import ks_2samp stat, p_value = ks_2samp(predictions, reference_dist) if p_value < 0.05: alert("Distribution shift detected")

undefined

def monitor_predictions(predictions, reference_dist): from scipy.stats import ks_2samp stat, p_value = ks_2samp(predictions, reference_dist) if p_value < 0.05: alert("Distribution shift detected")

undefined

Resources

资源

Official Documentation

官方文档

Courses

课程

Books

书籍

"Hands-On Machine Learning" by Aurélien Géron
"The Elements of Statistical Learning"
"Feature Engineering for ML" by Alice Zheng

《Hands-On Machine Learning》（Aurélien Géron 著）
《The Elements of Statistical Learning》
《Feature Engineering for ML》（Alice Zheng 著）

Next Skills

后续技能

After mastering Machine Learning:

→
```
deep-learning
```
- Neural networks with PyTorch
→
```
mlops
```
- Production ML systems
→
```
llms-generative-ai
```
- Large language models
→
```
statistics-math
```
- Deeper mathematical foundations

Skill Certification Checklist:

Can build end-to-end ML pipelines with scikit-learn
Can tune hyperparameters with cross-validation
Can handle imbalanced datasets appropriately
Can interpret models with SHAP values
Can deploy models with proper versioning

掌握机器学习后，可继续学习：

→
```
deep-learning
```
- 基于PyTorch的神经网络
→
```
mlops
```
- 生产级机器学习系统
→
```
llms-generative-ai
```
- 大语言模型
→
```
statistics-math
```
- 深入数学基础

技能认证清单：

能够使用scikit-learn构建端到端ML流水线
能够结合交叉验证进行超参数调优
能够合理处理不平衡数据集
能够使用SHAP值解释模型
能够通过版本控制部署模型

machine-learning

Original

Translation

Machine Learning

机器学习

Quick Start

快速开始

Production ML Pipeline with scikit-learn

Production ML Pipeline with scikit-learn

Load and split data

Load and split data

Define feature types

Define feature types

Build preprocessing pipeline

Build preprocessing pipeline

Full pipeline

Full pipeline

Train and evaluate

Train and evaluate

Save model

Save model

Core Concepts

核心概念

1. Feature Engineering Pipeline

1. 特征工程流水线

Log transform for skewed features

Log transform for skewed features

2. Cross-Validation Strategies

2. 交叉验证策略

Stratified K-Fold (for imbalanced classification)

Stratified K-Fold (for imbalanced classification)

Time Series Split (for temporal data)

Time Series Split (for temporal data)

Group K-Fold (prevent data leakage by user/entity)

Group K-Fold (prevent data leakage by user/entity)

Multiple metrics at once

Multiple metrics at once

3. Hyperparameter Tuning

3. 超参数调优

RandomizedSearchCV (good baseline)

RandomizedSearchCV (good baseline)

Optuna (modern, efficient)

Optuna (modern, efficient)

4. XGBoost Production Pattern

4. XGBoost生产模式

Prepare DMatrix for efficiency

Prepare DMatrix for efficiency

Train with early stopping

Train with early stopping

Feature importance

Feature importance

SHAP values for interpretability

SHAP values for interpretability

5. Handling Imbalanced Data

5. 不平衡数据处理

Option 1: Class weights

Option 1: Class weights

Option 2: SMOTE oversampling

Option 2: SMOTE oversampling

Option 3: Combined pipeline (recommended)

Option 3: Combined pipeline (recommended)

Option 4: Threshold tuning

Option 4: Threshold tuning

Find threshold for target recall

Find threshold for target recall

Tools & Technologies

工具与技术

Learning Path

学习路径

Phase 1: Foundations (Weeks 1-4)

阶段1：基础（第1-4周）

Phase 2: Intermediate (Weeks 5-8)

阶段2：中级（第5-8周）

Phase 3: Advanced (Weeks 9-12)

阶段3：高级（第9-12周）

Phase 4: Production (Weeks 13-16)

阶段4：生产实践（第13-16周）

Troubleshooting Guide

故障排除指南

Common Failure Modes