scikit-learn-best-practices

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Scikit-learn Best Practices

Scikit-learn 最佳实践

Expert guidelines for scikit-learn development, focusing on machine learning workflows, model development, evaluation, and best practices.

专家级scikit-learn开发指南，聚焦机器学习工作流、模型开发、评估及最佳实践。

Code Style and Structure

代码风格与结构

Write concise, technical responses with accurate Python examples
Prioritize reproducibility in machine learning workflows
Use functional programming for data pipelines
Use object-oriented programming for custom estimators
Prefer vectorized operations over explicit loops
Follow PEP 8 style guidelines

编写简洁、专业的回复，并附带准确的Python示例
优先保证机器学习工作流的可复现性
对数据管道使用函数式编程
对自定义估计器使用面向对象编程
优先使用向量化操作而非显式循环
遵循PEP 8风格指南

Machine Learning Workflow

机器学习工作流

Data Preparation

数据准备

Always split data before any preprocessing: train/validation/test
Use
```
train_test_split()
```
with
```
random_state
```
for reproducibility
Stratify splits for imbalanced classification:
```
stratify=y
```
Keep test set completely separate until final evaluation

始终在任何预处理之前拆分数据：训练/验证/测试集
使用带
```
random_state
```
参数的
```
train_test_split()
```
保证可复现性
针对不平衡分类问题使用分层拆分：
```
stratify=y
```
在最终评估前完全隔离测试集

Feature Engineering

特征工程

Scale features appropriately for distance-based algorithms
Use
```
StandardScaler
```
for normally distributed features
Use
```
MinMaxScaler
```
for bounded features
Use
```
RobustScaler
```
for data with outliers
Encode categorical variables:
```
OneHotEncoder
```
,
```
OrdinalEncoder
```
,
```
LabelEncoder
```
Handle missing values:
```
SimpleImputer
```
,
```
KNNImputer
```

针对基于距离的算法对特征进行适当缩放
对正态分布特征使用
```
StandardScaler
```
对有界特征使用
```
MinMaxScaler
```
对含异常值的数据使用
```
RobustScaler
```
编码分类变量：
```
OneHotEncoder
```
,
```
OrdinalEncoder
```
,
```
LabelEncoder
```
处理缺失值：
```
SimpleImputer
```
,
```
KNNImputer
```

Pipelines

管道

Always use
```
Pipeline
```
to chain preprocessing and modeling
Prevents data leakage by fitting transformers only on training data
Makes code cleaner and more reproducible
Enables easy deployment and serialization

python

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

始终使用
```
Pipeline
```
将预处理与建模环节串联
通过仅在训练数据上拟合转换器来防止数据泄露
让代码更简洁、更具可复现性
便于部署与序列化

python

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

Column Transformers

列转换器

Use
```
ColumnTransformer
```
for different preprocessing per feature type
Combine numeric and categorical preprocessing in single pipeline

对不同类型的特征使用
```
ColumnTransformer
```
进行差异化预处理
在单个管道中结合数值型与分类型特征的预处理

Model Selection and Tuning

模型选择与调优

Cross-Validation

交叉验证

Use cross-validation for reliable performance estimates
```
cross_val_score()
```
for quick evaluation
```
cross_validate()
```
for multiple metrics
Use appropriate CV strategy:
- ```
KFold
```
  for regression
- ```
StratifiedKFold
```
  for classification
- ```
TimeSeriesSplit
```
  for temporal data
- ```
GroupKFold
```
  for grouped data

使用交叉验证获取可靠的性能评估结果
```
cross_val_score()
```
用于快速评估
```
cross_validate()
```
用于多指标评估
使用合适的CV策略:
- ```
KFold
```
  用于回归问题
- ```
StratifiedKFold
```
  用于分类问题
- ```
TimeSeriesSplit
```
  用于时序数据
- ```
GroupKFold
```
  用于分组数据

Hyperparameter Tuning

超参数调优

Use
```
GridSearchCV
```
for exhaustive search
Use
```
RandomizedSearchCV
```
for large parameter spaces
Always tune on training/validation data, never test data
Set
```
n_jobs=-1
```
for parallel processing

使用
```
GridSearchCV
```
进行穷尽搜索
使用
```
RandomizedSearchCV
```
处理大参数空间
始终在训练/验证数据上进行调优，绝不在测试数据上
设置
```
n_jobs=-1
```
启用并行处理

Model Evaluation

模型评估

Classification Metrics

分类指标

Use appropriate metrics for your problem:
- ```
accuracy_score
```
  for balanced classes
- ```
precision_score
```
  ,
```
recall_score
```
  ,
```
f1_score
```
  for imbalanced
- ```
roc_auc_score
```
  for ranking ability
Use
```
classification_report()
```
for comprehensive overview
Examine
```
confusion_matrix()
```
for error analysis

根据问题选择合适的指标:
- ```
accuracy_score
```
  用于类别均衡的情况
- ```
precision_score
```
  ,
```
recall_score
```
  ,
```
f1_score
```
  用于类别不平衡的情况
- ```
roc_auc_score
```
  用于评估排序能力
使用
```
classification_report()
```
获取综合评估概述
查看
```
confusion_matrix()
```
进行错误分析

Regression Metrics

回归指标

```
mean_squared_error
```
(MSE) for general use
```
mean_absolute_error
```
(MAE) for interpretability
```
r2_score
```
for explained variance

```
mean_squared_error
```
（MSE）用于通用场景
```
mean_absolute_error
```
（MAE）用于可解释性要求高的场景
```
r2_score
```
用于评估解释方差

Evaluation Best Practices

评估最佳实践

Report confidence intervals, not just point estimates
Use multiple metrics to understand model behavior
Compare against meaningful baselines
Evaluate on held-out test set only once, at the end

报告置信区间，而非仅点估计值
使用多指标来理解模型行为
与有意义的基线模型进行对比
仅在最后阶段对预留的测试集进行一次评估

Handling Imbalanced Data

不平衡数据处理

Use stratified splitting and cross-validation
Consider class weights:
```
class_weight='balanced'
```
Use appropriate metrics (F1, AUC-PR, not accuracy)
Adjust decision threshold based on business needs

使用分层拆分与交叉验证
考虑类别权重：
```
class_weight='balanced'
```
使用合适的指标（F1、AUC-PR，而非准确率）
根据业务需求调整决策阈值

Feature Selection

特征选择

Use
```
SelectKBest
```
with statistical tests
Use
```
RFE
```
(Recursive Feature Elimination)
Use model-based selection:
```
SelectFromModel
```
Examine feature importances from tree-based models

使用结合统计检验的
```
SelectKBest
```
使用
```
RFE
```
（递归特征消除）
使用基于模型的选择方法：
```
SelectFromModel
```
查看基于树模型的特征重要性

Model Persistence

模型持久化

Use
```
joblib
```
for saving and loading models
Save entire pipelines, not just models
Version control model artifacts
Document model metadata

使用
```
joblib
```
保存与加载模型
保存完整的管道，而非仅模型本身
对模型工件进行版本控制
记录模型元数据

Performance Optimization

性能优化

Use
```
n_jobs=-1
```
for parallel processing where available
Consider
```
warm_start=True
```
for iterative training
Use sparse matrices for high-dimensional sparse data
Consider incremental learning with
```
partial_fit()
```
for large data

在可用的地方设置
```
n_jobs=-1
```
启用并行处理
考虑使用
```
warm_start=True
```
进行迭代训练
对高维稀疏数据使用稀疏矩阵
针对大数据考虑使用
```
partial_fit()
```
进行增量学习

Key Conventions

关键约定

Import from submodules:

from sklearn.ensemble import RandomForestClassifier

Set
```
random_state
```
for reproducibility
Use pipelines to prevent data leakage
Document model choices and hyperparameters

从子模块导入：

from sklearn.ensemble import RandomForestClassifier

设置
```
random_state
```
保证可复现性
使用管道防止数据泄露
记录模型选择与超参数