scikit-learn
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseScikit-learn Machine Learning
Scikit-learn 机器学习
Industry-standard Python library for classical machine learning.
业界标准的经典机器学习Python库。
When to Use
适用场景
- Classification or regression tasks
- Clustering or dimensionality reduction
- Preprocessing and feature engineering
- Model evaluation and cross-validation
- Hyperparameter tuning
- Building ML pipelines
- 分类或回归任务
- 聚类或降维操作
- 数据预处理与特征工程
- 模型评估与交叉验证
- 超参数调优
- 构建机器学习流水线(ML pipeline)
Algorithm Selection
算法选择
Classification
分类
| Algorithm | Best For | Strengths |
|---|---|---|
| Logistic Regression | Baseline, interpretable | Fast, probabilistic |
| Random Forest | General purpose | Handles non-linear, feature importance |
| Gradient Boosting | Best accuracy | State-of-art for tabular |
| SVM | High-dimensional data | Works well with few samples |
| KNN | Simple problems | No training, instance-based |
| 算法 | 适用场景 | 优势 |
|---|---|---|
| Logistic Regression | 基准模型、可解释性需求 | 速度快、基于概率输出 |
| Random Forest | 通用场景 | 处理非线性数据、提供特征重要性 |
| Gradient Boosting | 追求最高准确率 | 表格数据任务中的当前最优方案 |
| SVM | 高维数据场景 | 在样本量较少时表现出色 |
| KNN | 简单问题场景 | 无需训练、基于实例的算法 |
Regression
回归
| Algorithm | Best For | Notes |
|---|---|---|
| Linear Regression | Baseline | Interpretable coefficients |
| Ridge/Lasso | Regularization needed | L2 vs L1 penalty |
| Random Forest | Non-linear relationships | Robust to outliers |
| Gradient Boosting | Best accuracy | XGBoost, LightGBM wrappers |
| 算法 | 适用场景 | 说明 |
|---|---|---|
| Linear Regression | 基准模型 | 系数可解释 |
| Ridge/Lasso | 需要正则化的场景 | L2 vs L1 正则化惩罚 |
| Random Forest | 非线性关系数据 | 对异常值鲁棒 |
| Gradient Boosting | 追求最高准确率 | 支持XGBoost、LightGBM封装 |
Clustering
聚类
| Algorithm | Best For | Key Parameter |
|---|---|---|
| KMeans | Spherical clusters | n_clusters (must specify) |
| DBSCAN | Arbitrary shapes | eps (density) |
| Agglomerative | Hierarchical | n_clusters or distance threshold |
| Gaussian Mixture | Soft clustering | n_components |
| 算法 | 适用场景 | 关键参数 |
|---|---|---|
| KMeans | 球形簇数据 | n_clusters(需指定) |
| DBSCAN | 任意形状簇数据 | eps(密度阈值) |
| Agglomerative | 层次化聚类需求 | n_clusters 或距离阈值 |
| Gaussian Mixture | 软聚类场景 | n_components |
Dimensionality Reduction
降维
| Method | Preserves | Use Case |
|---|---|---|
| PCA | Global variance | Feature reduction |
| t-SNE | Local structure | 2D/3D visualization |
| UMAP | Both local/global | Visualization + downstream |
| 方法 | 保留特性 | 适用场景 |
|---|---|---|
| PCA | 全局方差 | 特征降维 |
| t-SNE | 局部结构 | 2D/3D 可视化 |
| UMAP | 局部与全局结构兼顾 | 可视化及下游任务 |
Pipeline Concepts
流水线(Pipeline)概念
Key concept: Pipelines prevent data leakage by ensuring transformations are fit only on training data.
| Component | Purpose |
|---|---|
| Pipeline | Sequential steps (transform → model) |
| ColumnTransformer | Apply different transforms to different columns |
| FeatureUnion | Combine multiple feature extraction methods |
Common preprocessing flow:
- Impute missing values (SimpleImputer)
- Scale numeric features (StandardScaler, MinMaxScaler)
- Encode categoricals (OneHotEncoder, OrdinalEncoder)
- Optional: feature selection or polynomial features
核心概念:流水线可防止数据泄露,确保数据转换仅在训练集上拟合。
| 组件 | 用途 |
|---|---|
| Pipeline | 按顺序执行步骤(转换 → 模型) |
| ColumnTransformer | 对不同列应用不同的转换操作 |
| FeatureUnion | 组合多种特征提取方法 |
常见预处理流程:
- 缺失值填充(SimpleImputer)
- 数值特征缩放(StandardScaler、MinMaxScaler)
- 分类特征编码(OneHotEncoder、OrdinalEncoder)
- 可选:特征选择或多项式特征生成
Model Evaluation
模型评估
Cross-Validation Strategies
交叉验证策略
| Strategy | Use Case |
|---|---|
| KFold | General purpose |
| StratifiedKFold | Imbalanced classification |
| TimeSeriesSplit | Temporal data |
| LeaveOneOut | Very small datasets |
| 策略 | 适用场景 |
|---|---|
| KFold | 通用场景 |
| StratifiedKFold | 不平衡分类任务 |
| TimeSeriesSplit | 时间序列数据 |
| LeaveOneOut | 极小数据集 |
Metrics
评估指标
| Task | Metric | When to Use |
|---|---|---|
| Classification | Accuracy | Balanced classes |
| F1-score | Imbalanced classes | |
| ROC-AUC | Ranking, threshold tuning | |
| Precision/Recall | Domain-specific costs | |
| Regression | RMSE | Penalize large errors |
| MAE | Robust to outliers | |
| R² | Explained variance |
| 任务 | 指标 | 适用场景 |
|---|---|---|
| 分类 | 准确率 | 类别均衡的数据集 |
| F1分数 | 类别不均衡的数据集 | |
| ROC-AUC | 排序任务、阈值调优 | |
| 精确率/召回率 | 有特定领域成本考量的场景 | |
| 回归 | RMSE | 需惩罚大误差的场景 |
| MAE | 对异常值鲁棒 | |
| R² | 解释方差占比 |
Hyperparameter Tuning
超参数调优
| Method | Pros | Cons |
|---|---|---|
| GridSearchCV | Exhaustive | Slow for many params |
| RandomizedSearchCV | Faster | May miss optimal |
| HalvingGridSearchCV | Efficient | Requires sklearn 0.24+ |
Key concept: Always tune on validation set, evaluate final model on held-out test set.
| 方法 | 优点 | 缺点 |
|---|---|---|
| GridSearchCV | 穷尽式搜索 | 参数较多时速度慢 |
| RandomizedSearchCV | 速度更快 | 可能错过最优参数 |
| HalvingGridSearchCV | 高效 | 需要scikit-learn 0.24及以上版本 |
核心概念:始终在验证集上进行调优,在预留的测试集上评估最终模型。
Best Practices
最佳实践
| Practice | Why |
|---|---|
| Split data first | Prevent leakage |
| Use pipelines | Reproducible, no leakage |
| Scale for distance-based | KNN, SVM, PCA need scaled features |
| Stratify imbalanced | Preserve class distribution |
| Cross-validate | Reliable performance estimates |
| Check learning curves | Diagnose over/underfitting |
| 实践 | 原因 |
|---|---|
| 先划分数据集 | 防止数据泄露 |
| 使用流水线 | 结果可复现、避免数据泄露 |
| 为基于距离的算法缩放数据 | KNN、SVM、PCA 需要缩放后的特征 |
| 对不平衡数据集进行分层划分 | 保留类别分布 |
| 使用交叉验证 | 获得可靠的性能评估结果 |
| 检查学习曲线 | 诊断过拟合/欠拟合 |
Common Pitfalls
常见陷阱
| Pitfall | Solution |
|---|---|
| Fitting scaler on all data | Use pipeline or fit only on train |
| Using accuracy for imbalanced | Use F1, ROC-AUC, or balanced accuracy |
| Too many hyperparameters | Start simple, add complexity |
| Ignoring feature importance | Use |
| 陷阱 | 解决方案 |
|---|---|
| 在全部数据上拟合缩放器 | 使用流水线,或仅在训练集上拟合 |
| 对不平衡数据集使用准确率 | 使用F1分数、ROC-AUC或平衡准确率 |
| 调优过多超参数 | 从简单模型开始,逐步增加复杂度 |
| 忽略特征重要性 | 使用 |
Resources
参考资源
- Docs: https://scikit-learn.org/
- User Guide: https://scikit-learn.org/stable/user_guide.html
- Algorithm Cheat Sheet: https://scikit-learn.org/stable/tutorial/machine_learning_map/