scikit-learn

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Scikit-learn Machine Learning

Scikit-learn 机器学习

Industry-standard Python library for classical machine learning.

业界标准的经典机器学习Python库。

When to Use

适用场景

Classification or regression tasks
Clustering or dimensionality reduction
Preprocessing and feature engineering
Model evaluation and cross-validation
Hyperparameter tuning
Building ML pipelines

分类或回归任务
聚类或降维操作
数据预处理与特征工程
模型评估与交叉验证
超参数调优
构建机器学习流水线（ML pipeline）

Algorithm Selection

算法选择

Classification

分类

Algorithm	Best For	Strengths
Logistic Regression	Baseline, interpretable	Fast, probabilistic
Random Forest	General purpose	Handles non-linear, feature importance
Gradient Boosting	Best accuracy	State-of-art for tabular
SVM	High-dimensional data	Works well with few samples
KNN	Simple problems	No training, instance-based

算法	适用场景	优势
Logistic Regression	基准模型、可解释性需求	速度快、基于概率输出
Random Forest	通用场景	处理非线性数据、提供特征重要性
Gradient Boosting	追求最高准确率	表格数据任务中的当前最优方案
SVM	高维数据场景	在样本量较少时表现出色
KNN	简单问题场景	无需训练、基于实例的算法

Regression

回归

Algorithm	Best For	Notes
Linear Regression	Baseline	Interpretable coefficients
Ridge/Lasso	Regularization needed	L2 vs L1 penalty
Random Forest	Non-linear relationships	Robust to outliers
Gradient Boosting	Best accuracy	XGBoost, LightGBM wrappers

算法	适用场景	说明
Linear Regression	基准模型	系数可解释
Ridge/Lasso	需要正则化的场景	L2 vs L1 正则化惩罚
Random Forest	非线性关系数据	对异常值鲁棒
Gradient Boosting	追求最高准确率	支持XGBoost、LightGBM封装

Clustering

聚类

Algorithm	Best For	Key Parameter
KMeans	Spherical clusters	n_clusters (must specify)
DBSCAN	Arbitrary shapes	eps (density)
Agglomerative	Hierarchical	n_clusters or distance threshold
Gaussian Mixture	Soft clustering	n_components

算法	适用场景	关键参数
KMeans	球形簇数据	n_clusters（需指定）
DBSCAN	任意形状簇数据	eps（密度阈值）
Agglomerative	层次化聚类需求	n_clusters 或距离阈值
Gaussian Mixture	软聚类场景	n_components

Dimensionality Reduction

降维

Method	Preserves	Use Case
PCA	Global variance	Feature reduction
t-SNE	Local structure	2D/3D visualization
UMAP	Both local/global	Visualization + downstream

方法	保留特性	适用场景
PCA	全局方差	特征降维
t-SNE	局部结构	2D/3D 可视化
UMAP	局部与全局结构兼顾	可视化及下游任务

Pipeline Concepts

流水线（Pipeline）概念

Key concept: Pipelines prevent data leakage by ensuring transformations are fit only on training data.

Component	Purpose
Pipeline	Sequential steps (transform → model)
ColumnTransformer	Apply different transforms to different columns
FeatureUnion	Combine multiple feature extraction methods

Common preprocessing flow:

Impute missing values (SimpleImputer)
Scale numeric features (StandardScaler, MinMaxScaler)
Encode categoricals (OneHotEncoder, OrdinalEncoder)
Optional: feature selection or polynomial features

核心概念：流水线可防止数据泄露，确保数据转换仅在训练集上拟合。

组件	用途
Pipeline	按顺序执行步骤（转换 → 模型）
ColumnTransformer	对不同列应用不同的转换操作
FeatureUnion	组合多种特征提取方法

常见预处理流程：

缺失值填充（SimpleImputer）
数值特征缩放（StandardScaler、MinMaxScaler）
分类特征编码（OneHotEncoder、OrdinalEncoder）
可选：特征选择或多项式特征生成

Model Evaluation

模型评估

Cross-Validation Strategies

交叉验证策略

Strategy	Use Case
KFold	General purpose
StratifiedKFold	Imbalanced classification
TimeSeriesSplit	Temporal data
LeaveOneOut	Very small datasets

策略	适用场景
KFold	通用场景
StratifiedKFold	不平衡分类任务
TimeSeriesSplit	时间序列数据
LeaveOneOut	极小数据集

Metrics

评估指标

Task	Metric	When to Use
Classification	Accuracy	Balanced classes
	F1-score	Imbalanced classes
	ROC-AUC	Ranking, threshold tuning
	Precision/Recall	Domain-specific costs
Regression	RMSE	Penalize large errors
	MAE	Robust to outliers
	R²	Explained variance

任务	指标	适用场景
分类	准确率	类别均衡的数据集
	F1分数	类别不均衡的数据集
	ROC-AUC	排序任务、阈值调优
	精确率/召回率	有特定领域成本考量的场景
回归	RMSE	需惩罚大误差的场景
	MAE	对异常值鲁棒
	R²	解释方差占比

Hyperparameter Tuning

超参数调优

Method	Pros	Cons
GridSearchCV	Exhaustive	Slow for many params
RandomizedSearchCV	Faster	May miss optimal
HalvingGridSearchCV	Efficient	Requires sklearn 0.24+

Key concept: Always tune on validation set, evaluate final model on held-out test set.

方法	优点	缺点
GridSearchCV	穷尽式搜索	参数较多时速度慢
RandomizedSearchCV	速度更快	可能错过最优参数
HalvingGridSearchCV	高效	需要scikit-learn 0.24及以上版本

核心概念：始终在验证集上进行调优，在预留的测试集上评估最终模型。

Best Practices

最佳实践

Practice	Why
Split data first	Prevent leakage
Use pipelines	Reproducible, no leakage
Scale for distance-based	KNN, SVM, PCA need scaled features
Stratify imbalanced	Preserve class distribution
Cross-validate	Reliable performance estimates
Check learning curves	Diagnose over/underfitting

实践	原因
先划分数据集	防止数据泄露
使用流水线	结果可复现、避免数据泄露
为基于距离的算法缩放数据	KNN、SVM、PCA 需要缩放后的特征
对不平衡数据集进行分层划分	保留类别分布
使用交叉验证	获得可靠的性能评估结果
检查学习曲线	诊断过拟合/欠拟合

Common Pitfalls

常见陷阱

Pitfall	Solution
Fitting scaler on all data	Use pipeline or fit only on train
Using accuracy for imbalanced	Use F1, ROC-AUC, or balanced accuracy
Too many hyperparameters	Start simple, add complexity
Ignoring feature importance	Use `feature_importances_` or permutation importance

陷阱	解决方案
在全部数据上拟合缩放器	使用流水线，或仅在训练集上拟合
对不平衡数据集使用准确率	使用F1分数、ROC-AUC或平衡准确率
调优过多超参数	从简单模型开始，逐步增加复杂度
忽略特征重要性	使用 `feature_importances_` 或置换重要性

Resources

参考资源

Docs: https://scikit-learn.org/
User Guide: https://scikit-learn.org/stable/user_guide.html
Algorithm Cheat Sheet: https://scikit-learn.org/stable/tutorial/machine_learning_map/

官方文档：https://scikit-learn.org/
用户指南：https://scikit-learn.org/stable/user_guide.html
算法速查表：https://scikit-learn.org/stable/tutorial/machine_learning_map/