scikit-learn-best-practices
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseScikit-learn Best Practices
Scikit-learn 最佳实践
Expert guidelines for scikit-learn development, focusing on machine learning workflows, model development, evaluation, and best practices.
专家级scikit-learn开发指南,聚焦机器学习工作流、模型开发、评估及最佳实践。
Code Style and Structure
代码风格与结构
- Write concise, technical responses with accurate Python examples
- Prioritize reproducibility in machine learning workflows
- Use functional programming for data pipelines
- Use object-oriented programming for custom estimators
- Prefer vectorized operations over explicit loops
- Follow PEP 8 style guidelines
- 编写简洁、专业的回复,并附带准确的Python示例
- 优先保证机器学习工作流的可复现性
- 对数据管道使用函数式编程
- 对自定义估计器使用面向对象编程
- 优先使用向量化操作而非显式循环
- 遵循PEP 8风格指南
Machine Learning Workflow
机器学习工作流
Data Preparation
数据准备
- Always split data before any preprocessing: train/validation/test
- Use with
train_test_split()for reproducibilityrandom_state - Stratify splits for imbalanced classification:
stratify=y - Keep test set completely separate until final evaluation
- 始终在任何预处理之前拆分数据:训练/验证/测试集
- 使用带参数的
random_state保证可复现性train_test_split() - 针对不平衡分类问题使用分层拆分:
stratify=y - 在最终评估前完全隔离测试集
Feature Engineering
特征工程
- Scale features appropriately for distance-based algorithms
- Use for normally distributed features
StandardScaler - Use for bounded features
MinMaxScaler - Use for data with outliers
RobustScaler - Encode categorical variables: ,
OneHotEncoder,OrdinalEncoderLabelEncoder - Handle missing values: ,
SimpleImputerKNNImputer
- 针对基于距离的算法对特征进行适当缩放
- 对正态分布特征使用
StandardScaler - 对有界特征使用
MinMaxScaler - 对含异常值的数据使用
RobustScaler - 编码分类变量:,
OneHotEncoder,OrdinalEncoderLabelEncoder - 处理缺失值:,
SimpleImputerKNNImputer
Pipelines
管道
- Always use to chain preprocessing and modeling
Pipeline - Prevents data leakage by fitting transformers only on training data
- Makes code cleaner and more reproducible
- Enables easy deployment and serialization
python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(random_state=42))
])- 始终使用将预处理与建模环节串联
Pipeline - 通过仅在训练数据上拟合转换器来防止数据泄露
- 让代码更简洁、更具可复现性
- 便于部署与序列化
python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(random_state=42))
])Column Transformers
列转换器
- Use for different preprocessing per feature type
ColumnTransformer - Combine numeric and categorical preprocessing in single pipeline
- 对不同类型的特征使用进行差异化预处理
ColumnTransformer - 在单个管道中结合数值型与分类型特征的预处理
Model Selection and Tuning
模型选择与调优
Cross-Validation
交叉验证
- Use cross-validation for reliable performance estimates
- for quick evaluation
cross_val_score() - for multiple metrics
cross_validate() - Use appropriate CV strategy:
- for regression
KFold - for classification
StratifiedKFold - for temporal data
TimeSeriesSplit - for grouped data
GroupKFold
- 使用交叉验证获取可靠的性能评估结果
- 用于快速评估
cross_val_score() - 用于多指标评估
cross_validate() - 使用合适的CV策略:
- 用于回归问题
KFold - 用于分类问题
StratifiedKFold - 用于时序数据
TimeSeriesSplit - 用于分组数据
GroupKFold
Hyperparameter Tuning
超参数调优
- Use for exhaustive search
GridSearchCV - Use for large parameter spaces
RandomizedSearchCV - Always tune on training/validation data, never test data
- Set for parallel processing
n_jobs=-1
- 使用进行穷尽搜索
GridSearchCV - 使用处理大参数空间
RandomizedSearchCV - 始终在训练/验证数据上进行调优,绝不在测试数据上
- 设置启用并行处理
n_jobs=-1
Model Evaluation
模型评估
Classification Metrics
分类指标
- Use appropriate metrics for your problem:
- for balanced classes
accuracy_score - ,
precision_score,recall_scorefor imbalancedf1_score - for ranking ability
roc_auc_score
- Use for comprehensive overview
classification_report() - Examine for error analysis
confusion_matrix()
- 根据问题选择合适的指标:
- 用于类别均衡的情况
accuracy_score - ,
precision_score,recall_score用于类别不平衡的情况f1_score - 用于评估排序能力
roc_auc_score
- 使用获取综合评估概述
classification_report() - 查看进行错误分析
confusion_matrix()
Regression Metrics
回归指标
- (MSE) for general use
mean_squared_error - (MAE) for interpretability
mean_absolute_error - for explained variance
r2_score
- (MSE)用于通用场景
mean_squared_error - (MAE)用于可解释性要求高的场景
mean_absolute_error - 用于评估解释方差
r2_score
Evaluation Best Practices
评估最佳实践
- Report confidence intervals, not just point estimates
- Use multiple metrics to understand model behavior
- Compare against meaningful baselines
- Evaluate on held-out test set only once, at the end
- 报告置信区间,而非仅点估计值
- 使用多指标来理解模型行为
- 与有意义的基线模型进行对比
- 仅在最后阶段对预留的测试集进行一次评估
Handling Imbalanced Data
不平衡数据处理
- Use stratified splitting and cross-validation
- Consider class weights:
class_weight='balanced' - Use appropriate metrics (F1, AUC-PR, not accuracy)
- Adjust decision threshold based on business needs
- 使用分层拆分与交叉验证
- 考虑类别权重:
class_weight='balanced' - 使用合适的指标(F1、AUC-PR,而非准确率)
- 根据业务需求调整决策阈值
Feature Selection
特征选择
- Use with statistical tests
SelectKBest - Use (Recursive Feature Elimination)
RFE - Use model-based selection:
SelectFromModel - Examine feature importances from tree-based models
- 使用结合统计检验的
SelectKBest - 使用(递归特征消除)
RFE - 使用基于模型的选择方法:
SelectFromModel - 查看基于树模型的特征重要性
Model Persistence
模型持久化
- Use for saving and loading models
joblib - Save entire pipelines, not just models
- Version control model artifacts
- Document model metadata
- 使用保存与加载模型
joblib - 保存完整的管道,而非仅模型本身
- 对模型工件进行版本控制
- 记录模型元数据
Performance Optimization
性能优化
- Use for parallel processing where available
n_jobs=-1 - Consider for iterative training
warm_start=True - Use sparse matrices for high-dimensional sparse data
- Consider incremental learning with for large data
partial_fit()
- 在可用的地方设置启用并行处理
n_jobs=-1 - 考虑使用进行迭代训练
warm_start=True - 对高维稀疏数据使用稀疏矩阵
- 针对大数据考虑使用进行增量学习
partial_fit()
Key Conventions
关键约定
- Import from submodules:
from sklearn.ensemble import RandomForestClassifier - Set for reproducibility
random_state - Use pipelines to prevent data leakage
- Document model choices and hyperparameters
- 从子模块导入:
from sklearn.ensemble import RandomForestClassifier - 设置保证可复现性
random_state - 使用管道防止数据泄露
- 记录模型选择与超参数