scikit-learn

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Scikit-learn Machine Learning

Scikit-learn 机器学习

Industry-standard Python library for classical machine learning.
业界标准的经典机器学习Python库。

When to Use

适用场景

  • Classification or regression tasks
  • Clustering or dimensionality reduction
  • Preprocessing and feature engineering
  • Model evaluation and cross-validation
  • Hyperparameter tuning
  • Building ML pipelines

  • 分类或回归任务
  • 聚类或降维操作
  • 数据预处理与特征工程
  • 模型评估与交叉验证
  • 超参数调优
  • 构建机器学习流水线(ML pipeline)

Algorithm Selection

算法选择

Classification

分类

AlgorithmBest ForStrengths
Logistic RegressionBaseline, interpretableFast, probabilistic
Random ForestGeneral purposeHandles non-linear, feature importance
Gradient BoostingBest accuracyState-of-art for tabular
SVMHigh-dimensional dataWorks well with few samples
KNNSimple problemsNo training, instance-based
算法适用场景优势
Logistic Regression基准模型、可解释性需求速度快、基于概率输出
Random Forest通用场景处理非线性数据、提供特征重要性
Gradient Boosting追求最高准确率表格数据任务中的当前最优方案
SVM高维数据场景在样本量较少时表现出色
KNN简单问题场景无需训练、基于实例的算法

Regression

回归

AlgorithmBest ForNotes
Linear RegressionBaselineInterpretable coefficients
Ridge/LassoRegularization neededL2 vs L1 penalty
Random ForestNon-linear relationshipsRobust to outliers
Gradient BoostingBest accuracyXGBoost, LightGBM wrappers
算法适用场景说明
Linear Regression基准模型系数可解释
Ridge/Lasso需要正则化的场景L2 vs L1 正则化惩罚
Random Forest非线性关系数据对异常值鲁棒
Gradient Boosting追求最高准确率支持XGBoost、LightGBM封装

Clustering

聚类

AlgorithmBest ForKey Parameter
KMeansSpherical clustersn_clusters (must specify)
DBSCANArbitrary shapeseps (density)
AgglomerativeHierarchicaln_clusters or distance threshold
Gaussian MixtureSoft clusteringn_components
算法适用场景关键参数
KMeans球形簇数据n_clusters(需指定)
DBSCAN任意形状簇数据eps(密度阈值)
Agglomerative层次化聚类需求n_clusters 或距离阈值
Gaussian Mixture软聚类场景n_components

Dimensionality Reduction

降维

MethodPreservesUse Case
PCAGlobal varianceFeature reduction
t-SNELocal structure2D/3D visualization
UMAPBoth local/globalVisualization + downstream

方法保留特性适用场景
PCA全局方差特征降维
t-SNE局部结构2D/3D 可视化
UMAP局部与全局结构兼顾可视化及下游任务

Pipeline Concepts

流水线(Pipeline)概念

Key concept: Pipelines prevent data leakage by ensuring transformations are fit only on training data.
ComponentPurpose
PipelineSequential steps (transform → model)
ColumnTransformerApply different transforms to different columns
FeatureUnionCombine multiple feature extraction methods
Common preprocessing flow:
  1. Impute missing values (SimpleImputer)
  2. Scale numeric features (StandardScaler, MinMaxScaler)
  3. Encode categoricals (OneHotEncoder, OrdinalEncoder)
  4. Optional: feature selection or polynomial features

核心概念:流水线可防止数据泄露,确保数据转换仅在训练集上拟合。
组件用途
Pipeline按顺序执行步骤(转换 → 模型)
ColumnTransformer对不同列应用不同的转换操作
FeatureUnion组合多种特征提取方法
常见预处理流程
  1. 缺失值填充(SimpleImputer)
  2. 数值特征缩放(StandardScaler、MinMaxScaler)
  3. 分类特征编码(OneHotEncoder、OrdinalEncoder)
  4. 可选:特征选择或多项式特征生成

Model Evaluation

模型评估

Cross-Validation Strategies

交叉验证策略

StrategyUse Case
KFoldGeneral purpose
StratifiedKFoldImbalanced classification
TimeSeriesSplitTemporal data
LeaveOneOutVery small datasets
策略适用场景
KFold通用场景
StratifiedKFold不平衡分类任务
TimeSeriesSplit时间序列数据
LeaveOneOut极小数据集

Metrics

评估指标

TaskMetricWhen to Use
ClassificationAccuracyBalanced classes
F1-scoreImbalanced classes
ROC-AUCRanking, threshold tuning
Precision/RecallDomain-specific costs
RegressionRMSEPenalize large errors
MAERobust to outliers
Explained variance

任务指标适用场景
分类准确率类别均衡的数据集
F1分数类别不均衡的数据集
ROC-AUC排序任务、阈值调优
精确率/召回率有特定领域成本考量的场景
回归RMSE需惩罚大误差的场景
MAE对异常值鲁棒
解释方差占比

Hyperparameter Tuning

超参数调优

MethodProsCons
GridSearchCVExhaustiveSlow for many params
RandomizedSearchCVFasterMay miss optimal
HalvingGridSearchCVEfficientRequires sklearn 0.24+
Key concept: Always tune on validation set, evaluate final model on held-out test set.

方法优点缺点
GridSearchCV穷尽式搜索参数较多时速度慢
RandomizedSearchCV速度更快可能错过最优参数
HalvingGridSearchCV高效需要scikit-learn 0.24及以上版本
核心概念:始终在验证集上进行调优,在预留的测试集上评估最终模型。

Best Practices

最佳实践

PracticeWhy
Split data firstPrevent leakage
Use pipelinesReproducible, no leakage
Scale for distance-basedKNN, SVM, PCA need scaled features
Stratify imbalancedPreserve class distribution
Cross-validateReliable performance estimates
Check learning curvesDiagnose over/underfitting

实践原因
先划分数据集防止数据泄露
使用流水线结果可复现、避免数据泄露
为基于距离的算法缩放数据KNN、SVM、PCA 需要缩放后的特征
对不平衡数据集进行分层划分保留类别分布
使用交叉验证获得可靠的性能评估结果
检查学习曲线诊断过拟合/欠拟合

Common Pitfalls

常见陷阱

PitfallSolution
Fitting scaler on all dataUse pipeline or fit only on train
Using accuracy for imbalancedUse F1, ROC-AUC, or balanced accuracy
Too many hyperparametersStart simple, add complexity
Ignoring feature importanceUse
feature_importances_
or permutation importance
陷阱解决方案
在全部数据上拟合缩放器使用流水线,或仅在训练集上拟合
对不平衡数据集使用准确率使用F1分数、ROC-AUC或平衡准确率
调优过多超参数从简单模型开始,逐步增加复杂度
忽略特征重要性使用
feature_importances_
或置换重要性

Resources

参考资源