scikit-learn-best-practices

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Scikit-learn Best Practices

Scikit-learn 最佳实践

Expert guidelines for scikit-learn development, focusing on machine learning workflows, model development, evaluation, and best practices.
专家级scikit-learn开发指南,聚焦机器学习工作流、模型开发、评估及最佳实践。

Code Style and Structure

代码风格与结构

  • Write concise, technical responses with accurate Python examples
  • Prioritize reproducibility in machine learning workflows
  • Use functional programming for data pipelines
  • Use object-oriented programming for custom estimators
  • Prefer vectorized operations over explicit loops
  • Follow PEP 8 style guidelines
  • 编写简洁、专业的回复,并附带准确的Python示例
  • 优先保证机器学习工作流的可复现性
  • 对数据管道使用函数式编程
  • 对自定义估计器使用面向对象编程
  • 优先使用向量化操作而非显式循环
  • 遵循PEP 8风格指南

Machine Learning Workflow

机器学习工作流

Data Preparation

数据准备

  • Always split data before any preprocessing: train/validation/test
  • Use
    train_test_split()
    with
    random_state
    for reproducibility
  • Stratify splits for imbalanced classification:
    stratify=y
  • Keep test set completely separate until final evaluation
  • 始终在任何预处理之前拆分数据:训练/验证/测试集
  • 使用带
    random_state
    参数的
    train_test_split()
    保证可复现性
  • 针对不平衡分类问题使用分层拆分:
    stratify=y
  • 在最终评估前完全隔离测试集

Feature Engineering

特征工程

  • Scale features appropriately for distance-based algorithms
  • Use
    StandardScaler
    for normally distributed features
  • Use
    MinMaxScaler
    for bounded features
  • Use
    RobustScaler
    for data with outliers
  • Encode categorical variables:
    OneHotEncoder
    ,
    OrdinalEncoder
    ,
    LabelEncoder
  • Handle missing values:
    SimpleImputer
    ,
    KNNImputer
  • 针对基于距离的算法对特征进行适当缩放
  • 对正态分布特征使用
    StandardScaler
  • 对有界特征使用
    MinMaxScaler
  • 对含异常值的数据使用
    RobustScaler
  • 编码分类变量:
    OneHotEncoder
    ,
    OrdinalEncoder
    ,
    LabelEncoder
  • 处理缺失值:
    SimpleImputer
    ,
    KNNImputer

Pipelines

管道

  • Always use
    Pipeline
    to chain preprocessing and modeling
  • Prevents data leakage by fitting transformers only on training data
  • Makes code cleaner and more reproducible
  • Enables easy deployment and serialization
python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])
  • 始终使用
    Pipeline
    将预处理与建模环节串联
  • 通过仅在训练数据上拟合转换器来防止数据泄露
  • 让代码更简洁、更具可复现性
  • 便于部署与序列化
python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

Column Transformers

列转换器

  • Use
    ColumnTransformer
    for different preprocessing per feature type
  • Combine numeric and categorical preprocessing in single pipeline
  • 对不同类型的特征使用
    ColumnTransformer
    进行差异化预处理
  • 在单个管道中结合数值型与分类型特征的预处理

Model Selection and Tuning

模型选择与调优

Cross-Validation

交叉验证

  • Use cross-validation for reliable performance estimates
  • cross_val_score()
    for quick evaluation
  • cross_validate()
    for multiple metrics
  • Use appropriate CV strategy:
    • KFold
      for regression
    • StratifiedKFold
      for classification
    • TimeSeriesSplit
      for temporal data
    • GroupKFold
      for grouped data
  • 使用交叉验证获取可靠的性能评估结果
  • cross_val_score()
    用于快速评估
  • cross_validate()
    用于多指标评估
  • 使用合适的CV策略:
    • KFold
      用于回归问题
    • StratifiedKFold
      用于分类问题
    • TimeSeriesSplit
      用于时序数据
    • GroupKFold
      用于分组数据

Hyperparameter Tuning

超参数调优

  • Use
    GridSearchCV
    for exhaustive search
  • Use
    RandomizedSearchCV
    for large parameter spaces
  • Always tune on training/validation data, never test data
  • Set
    n_jobs=-1
    for parallel processing
  • 使用
    GridSearchCV
    进行穷尽搜索
  • 使用
    RandomizedSearchCV
    处理大参数空间
  • 始终在训练/验证数据上进行调优,绝不在测试数据上
  • 设置
    n_jobs=-1
    启用并行处理

Model Evaluation

模型评估

Classification Metrics

分类指标

  • Use appropriate metrics for your problem:
    • accuracy_score
      for balanced classes
    • precision_score
      ,
      recall_score
      ,
      f1_score
      for imbalanced
    • roc_auc_score
      for ranking ability
  • Use
    classification_report()
    for comprehensive overview
  • Examine
    confusion_matrix()
    for error analysis
  • 根据问题选择合适的指标:
    • accuracy_score
      用于类别均衡的情况
    • precision_score
      ,
      recall_score
      ,
      f1_score
      用于类别不平衡的情况
    • roc_auc_score
      用于评估排序能力
  • 使用
    classification_report()
    获取综合评估概述
  • 查看
    confusion_matrix()
    进行错误分析

Regression Metrics

回归指标

  • mean_squared_error
    (MSE) for general use
  • mean_absolute_error
    (MAE) for interpretability
  • r2_score
    for explained variance
  • mean_squared_error
    (MSE)用于通用场景
  • mean_absolute_error
    (MAE)用于可解释性要求高的场景
  • r2_score
    用于评估解释方差

Evaluation Best Practices

评估最佳实践

  • Report confidence intervals, not just point estimates
  • Use multiple metrics to understand model behavior
  • Compare against meaningful baselines
  • Evaluate on held-out test set only once, at the end
  • 报告置信区间,而非仅点估计值
  • 使用多指标来理解模型行为
  • 与有意义的基线模型进行对比
  • 仅在最后阶段对预留的测试集进行一次评估

Handling Imbalanced Data

不平衡数据处理

  • Use stratified splitting and cross-validation
  • Consider class weights:
    class_weight='balanced'
  • Use appropriate metrics (F1, AUC-PR, not accuracy)
  • Adjust decision threshold based on business needs
  • 使用分层拆分与交叉验证
  • 考虑类别权重:
    class_weight='balanced'
  • 使用合适的指标(F1、AUC-PR,而非准确率)
  • 根据业务需求调整决策阈值

Feature Selection

特征选择

  • Use
    SelectKBest
    with statistical tests
  • Use
    RFE
    (Recursive Feature Elimination)
  • Use model-based selection:
    SelectFromModel
  • Examine feature importances from tree-based models
  • 使用结合统计检验的
    SelectKBest
  • 使用
    RFE
    (递归特征消除)
  • 使用基于模型的选择方法:
    SelectFromModel
  • 查看基于树模型的特征重要性

Model Persistence

模型持久化

  • Use
    joblib
    for saving and loading models
  • Save entire pipelines, not just models
  • Version control model artifacts
  • Document model metadata
  • 使用
    joblib
    保存与加载模型
  • 保存完整的管道,而非仅模型本身
  • 对模型工件进行版本控制
  • 记录模型元数据

Performance Optimization

性能优化

  • Use
    n_jobs=-1
    for parallel processing where available
  • Consider
    warm_start=True
    for iterative training
  • Use sparse matrices for high-dimensional sparse data
  • Consider incremental learning with
    partial_fit()
    for large data
  • 在可用的地方设置
    n_jobs=-1
    启用并行处理
  • 考虑使用
    warm_start=True
    进行迭代训练
  • 对高维稀疏数据使用稀疏矩阵
  • 针对大数据考虑使用
    partial_fit()
    进行增量学习

Key Conventions

关键约定

  • Import from submodules:
    from sklearn.ensemble import RandomForestClassifier
  • Set
    random_state
    for reproducibility
  • Use pipelines to prevent data leakage
  • Document model choices and hyperparameters
  • 从子模块导入:
    from sklearn.ensemble import RandomForestClassifier
  • 设置
    random_state
    保证可复现性
  • 使用管道防止数据泄露
  • 记录模型选择与超参数