scikit-learn

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

scikit-learn - Machine Learning in Python

scikit-learn - Python机器学习库

A robust library for classical machine learning. It features a uniform API: all objects share the same interface for fitting, transforming, and predicting.
这是一个用于经典机器学习的可靠库,具有统一的API:所有对象都共享相同的拟合、转换和预测接口。

When to Use

适用场景

  • Classification: Detecting categories (Spam vs. Ham, Disease diagnosis).
  • Regression: Predicting continuous values (House prices, Stock trends).
  • Clustering: Grouping similar objects (Market segmentation, Image compression).
  • Dimensionality Reduction: Reducing feature count while keeping info (PCA, Visualization).
  • Model Selection: Comparing models and tuning hyperparameters (Cross-validation, Grid search).
  • Preprocessing: Transforming raw data into features (Scaling, Encoding, Imputation).
  • 分类:识别类别(垃圾邮件与正常邮件、疾病诊断)。
  • 回归:预测连续值(房价、股票趋势)。
  • 聚类:对相似对象分组(市场细分、图像压缩)。
  • 降维:在保留信息的同时减少特征数量(PCA、数据可视化)。
  • 模型选择:对比模型并调优超参数(交叉验证、网格搜索)。
  • 预处理:将原始数据转换为可用特征(标准化、编码、缺失值填充)。

Reference Documentation

参考文档

Official docs: https://scikit-learn.org/stable/
User Guide: https://scikit-learn.org/stable/user_guide.html
Search patterns:
sklearn.pipeline.Pipeline
,
sklearn.model_selection
,
sklearn.ensemble
,
sklearn.preprocessing
官方文档https://scikit-learn.org/stable/
用户指南https://scikit-learn.org/stable/user_guide.html
搜索示例
sklearn.pipeline.Pipeline
,
sklearn.model_selection
,
sklearn.ensemble
,
sklearn.preprocessing

Core Principles

核心原则

The "Estimator" Interface

"Estimator"接口

  • Estimators: Implement
    fit(X, y)
    . They learn from data.
  • Transformers: Implement
    transform(X)
    (and
    fit_transform(X)
    ). They modify data.
  • Predictors: Implement
    predict(X)
    . They provide estimates for new data.
  • Estimators:实现
    fit(X, y)
    方法,从数据中学习。
  • Transformers:实现
    transform(X)
    (以及
    fit_transform(X)
    )方法,用于修改数据。
  • Predictors:实现
    predict(X)
    方法,为新数据提供预测结果。

Use scikit-learn For

scikit-learn适用场景

  • Tabular data (Excel-like, CSVs).
  • Traditional ML (Random Forests, SVMs, Linear Models).
  • Feature engineering and pipeline automation.
  • Small to medium-sized datasets.
  • 表格数据(类Excel、CSV格式)。
  • 传统机器学习算法(随机森林、SVM、线性模型)。
  • 特征工程与流水线自动化。
  • 中小型数据集。

Do NOT Use For

scikit-learn不适用场景

  • Deep Learning / Neural Networks (use PyTorch or TensorFlow).
  • Natural Language Processing at scale (use spaCy or HuggingFace).
  • Large-scale "Big Data" (use Spark MLlib or Dask-ML).
  • Real-time streaming predictions (consider specialized inference engines).
  • 深度学习/神经网络(请使用PyTorch或TensorFlow)。
  • 大规模自然语言处理(请使用spaCy或HuggingFace)。
  • 大规模“大数据”处理(请使用Spark MLlib或Dask-ML)。
  • 实时流预测(请考虑专用推理引擎)。

Quick Reference

快速参考

Installation

安装

bash
pip install scikit-learn
bash
pip install scikit-learn

Standard Imports

标准导入

python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report, mean_squared_error
python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report, mean_squared_error

Basic Pattern - Train/Predict

基础模式 - 训练/预测

python
from sklearn.ensemble import RandomForestClassifier
python
from sklearn.ensemble import RandomForestClassifier

1. Prepare data

1. 准备数据

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

2. Instantiate and fit

2. 实例化并拟合模型

model = RandomForestClassifier(n_estimators=100) model.fit(X_train, y_train)
model = RandomForestClassifier(n_estimators=100) model.fit(X_train, y_train)

3. Predict and evaluate

3. 预测并评估

predictions = model.predict(X_test) print(classification_report(y_test, predictions))
undefined
predictions = model.predict(X_test) print(classification_report(y_test, predictions))
undefined

Critical Rules

重要规则

✅ DO

✅ 推荐做法

  • Split before anything - Always use
    train_test_split
    before looking at data properties.
  • Use Pipelines - Combine preprocessing and modeling to prevent data leakage.
  • Scale your data - Models like SVM, KNN, and Linear Regression require feature scaling.
  • Check for Imbalance - Use
    stratify=y
    in
    train_test_split
    for classification.
  • Cross-Validate - Don't trust a single train/test split; use
    cross_val_score
    .
  • Handle Missing Values - Use
    SimpleImputer
    or similar before fitting models.
  • Standardize Categories - Use
    OneHotEncoder
    for nominal or
    OrdinalEncoder
    for ordinal data.
  • 先拆分数据 - 在查看数据属性前,务必使用
    train_test_split
    拆分数据集。
  • 使用Pipeline - 将预处理和建模步骤结合,防止数据泄露。
  • 标准化数据 - SVM、KNN和线性回归等模型需要对特征进行标准化。
  • 检查类别不平衡 - 分类任务中,在
    train_test_split
    中使用
    stratify=y
    参数。
  • 交叉验证 - 不要仅依赖单次训练/测试拆分的结果,使用
    cross_val_score
    进行验证。
  • 处理缺失值 - 在拟合模型前,使用
    SimpleImputer
    或类似工具填充缺失值。
  • 标准化类别数据 - 名义数据使用
    OneHotEncoder
    ,有序数据使用
    OrdinalEncoder

❌ DON'T

❌ 不推荐做法

  • Fit on test data - Never call
    .fit()
    or
    .fit_transform()
    on the test set.
  • Use Categorical data as-is - Scikit-learn requires numerical input; encode strings first.
  • Ignore Class Imbalance - Accuracy is misleading for imbalanced datasets; use F1-score or AUC.
  • Overfit - Don't keep tuning hyperparameters until the test score is perfect.
  • Ignore Random State - Set
    random_state
    for reproducibility during experiments.
  • 在测试集上拟合 - 永远不要在测试集上调用
    .fit()
    .fit_transform()
    方法。
  • 直接使用类别数据 - scikit-learn要求输入为数值型数据,需先对字符串进行编码。
  • 忽略类别不平衡 - 对于不平衡数据集,准确率指标具有误导性,应使用F1分数或AUC指标。
  • 过拟合 - 不要持续调优超参数直到测试集分数达到完美。
  • 忽略随机状态 - 实验过程中设置
    random_state
    以保证结果可复现。

Anti-Patterns (NEVER)

反模式(绝对禁止)

python
undefined
python
undefined

❌ BAD: Data Leakage (Fitting scaler on the whole dataset)

❌ 错误:数据泄露(在整个数据集上拟合标准化器)

scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Data from "future" test set leaks into training! X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)
scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # 测试集的“未来”数据泄露到训练过程中! X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

✅ GOOD: Fit scaler only on training data

✅ 正确:仅在训练集上拟合标准化器

X_train, X_test, y_train, y_test = train_test_split(X, y) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Use training mean/std
X_train, X_test, y_train, y_test = train_test_split(X, y) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # 使用训练集的均值/标准差

❌ BAD: Repeating preprocessing manually

❌ 错误:手动重复预处理步骤

(Error-prone and hard to maintain)

(容易出错且难以维护)

✅ GOOD: Use Pipelines (Automates everything safely)

✅ 正确:使用Pipeline(安全自动化所有步骤)

pipe = Pipeline([ ('scaler', StandardScaler()), ('model', RandomForestClassifier()) ]) pipe.fit(X_train, y_train)
undefined
pipe = Pipeline([ ('scaler', StandardScaler()), ('model', RandomForestClassifier()) ]) pipe.fit(X_train, y_train)
undefined

Preprocessing (sklearn.preprocessing)

预处理(sklearn.preprocessing)

Scaling and Encoding

标准化与编码

python
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
python
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder

Scaling numerical data

数值数据标准化

scaler = StandardScaler() X_num_scaled = scaler.fit_transform(X_numeric)
scaler = StandardScaler() X_num_scaled = scaler.fit_transform(X_numeric)

Encoding categorical data

类别数据编码

encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore') X_cat_encoded = encoder.fit_transform(X_categorical)
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore') X_cat_encoded = encoder.fit_transform(X_categorical)

Handling missing values

处理缺失值

from sklearn.impute import SimpleImputer imputer = SimpleImputer(strategy='median') X_imputed = imputer.fit_transform(X_with_nan)
undefined
from sklearn.impute import SimpleImputer imputer = SimpleImputer(strategy='median') X_imputed = imputer.fit_transform(X_with_nan)
undefined

Column Transformer (The Pro Way)

Column Transformer(专业做法)

python
from sklearn.compose import ColumnTransformer

numeric_features = ['age', 'salary']
categorical_features = ['city', 'job_type']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)
    ])
python
from sklearn.compose import ColumnTransformer

numeric_features = ['age', 'salary']
categorical_features = ['city', 'job_type']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)
    ])

Now use this in a pipeline

现在将其用于Pipeline

pipeline = Pipeline([ ('prep', preprocessor), ('clf', LogisticRegression()) ])
undefined
pipeline = Pipeline([ ('prep', preprocessor), ('clf', LogisticRegression()) ])
undefined

Classification

分类

Common Algorithms

常用算法

python
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
python
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier

Logistic Regression (Baseline)

逻辑回归(基准模型)

log_reg = LogisticRegression(max_iter=1000)
log_reg = LogisticRegression(max_iter=1000)

Support Vector Machine

支持向量机

svm = SVC(kernel='rbf', probability=True)
svm = SVC(kernel='rbf', probability=True)

Gradient Boosting

梯度提升

gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)
undefined
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)
undefined

Regression

回归

Common Algorithms

常用算法

python
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor
python
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor

Regularized Linear Models

正则化线性模型

ridge = Ridge(alpha=1.0) # L2 lasso = Lasso(alpha=0.1) # L1
ridge = Ridge(alpha=1.0) # L2正则 lasso = Lasso(alpha=0.1) # L1正则

Non-linear Regression

非线性回归

rf_reg = RandomForestRegressor(n_estimators=100, max_depth=10)
undefined
rf_reg = RandomForestRegressor(n_estimators=100, max_depth=10)
undefined

Model Evaluation

模型评估

Metrics

评估指标

python
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, r2_score, mean_absolute_error
python
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, r2_score, mean_absolute_error

Classification

分类任务

acc = accuracy_score(y_true, y_pred) f1 = f1_score(y_true, y_pred, average='weighted')
acc = accuracy_score(y_true, y_pred) f1 = f1_score(y_true, y_pred, average='weighted')

Regression

回归任务

r2 = r2_score(y_true, y_pred) mae = mean_absolute_error(y_true, y_pred)
undefined
r2 = r2_score(y_true, y_pred) mae = mean_absolute_error(y_true, y_pred)
undefined

Cross-Validation

交叉验证

python
from sklearn.model_selection import cross_val_score

scores = cross_val_score(pipeline, X, y, cv=5, scoring='f1_macro')
print(f"Mean F1: {scores.mean():.4f} (+/- {scores.std():.4f})")
python
from sklearn.model_selection import cross_val_score

scores = cross_val_score(pipeline, X, y, cv=5, scoring='f1_macro')
print(f"平均F1分数: {scores.mean():.4f} (+/- {scores.std():.4f})")

Hyperparameter Tuning

超参数调优

Grid Search and Randomized Search

网格搜索与随机搜索

python
from sklearn.model_selection import GridSearchCV

param_grid = {
    'clf__n_estimators': [50, 100, 200],
    'clf__max_depth': [None, 10, 20],
    'clf__min_samples_split': [2, 5]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

print(f"Best params: {grid_search.best_params_}")
best_model = grid_search.best_estimator_
python
from sklearn.model_selection import GridSearchCV

param_grid = {
    'clf__n_estimators': [50, 100, 200],
    'clf__max_depth': [None, 10, 20],
    'clf__min_samples_split': [2, 5]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

print(f"最优参数: {grid_search.best_params_}")
best_model = grid_search.best_estimator_

Dimensionality Reduction

降维

PCA (Principal Component Analysis)

PCA(主成分分析)

python
from sklearn.decomposition import PCA
python
from sklearn.decomposition import PCA

Reduce to 2 components for visualization

降为2个成分用于可视化

pca = PCA(n_components=2) X_pca = pca.fit_transform(X_scaled)
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
undefined
pca = PCA(n_components=2) X_pca = pca.fit_transform(X_scaled)
print(f"解释方差比: {pca.explained_variance_ratio_}")
undefined

Clustering

聚类

K-Means and DBSCAN

K-Means与DBSCAN

python
from sklearn.cluster import KMeans, DBSCAN
python
from sklearn.cluster import KMeans, DBSCAN

K-Means (Requires specifying K)

K-Means(需要指定聚类数K)

kmeans = KMeans(n_clusters=3, n_init='auto') clusters = kmeans.fit_predict(X)
kmeans = KMeans(n_clusters=3, n_init='auto') clusters = kmeans.fit_predict(X)

DBSCAN (Density-based, finds K automatically)

DBSCAN(基于密度,自动确定聚类数K)

dbscan = DBSCAN(eps=0.5, min_samples=5) clusters = dbscan.fit_predict(X)
undefined
dbscan = DBSCAN(eps=0.5, min_samples=5) clusters = dbscan.fit_predict(X)
undefined

Practical Workflows

实用工作流

1. End-to-End Classification Pipeline

1. 端到端分类Pipeline

python
def build_and_train_model(X, y):
    # 1. Identify types
    num_cols = X.select_dtypes(include=['int64', 'float64']).columns
    cat_cols = X.select_dtypes(include=['object', 'category']).columns

    # 2. Setup Preprocessing
    preprocessor = ColumnTransformer([
        ('num', StandardScaler(), num_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
    ])

    # 3. Create Pipeline
    clf = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', RandomForestClassifier(random_state=42))
    ])

    # 4. Train
    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
    clf.fit(X_train, y_train)
    
    return clf, X_test, y_test
python
def build_and_train_model(X, y):
    # 1. 识别特征类型
    num_cols = X.select_dtypes(include=['int64', 'float64']).columns
    cat_cols = X.select_dtypes(include=['object', 'category']).columns

    # 2. 设置预处理步骤
    preprocessor = ColumnTransformer([
        ('num', StandardScaler(), num_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
    ])

    # 3. 创建Pipeline
    clf = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', RandomForestClassifier(random_state=42))
    ])

    # 4. 训练模型
    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
    clf.fit(X_train, y_train)
    
    return clf, X_test, y_test

model, X_test, y_test = build_and_train_model(df.drop('target', axis=1), df['target'])

model, X_test, y_test = build_and_train_model(df.drop('target', axis=1), df['target'])

undefined
undefined

2. Custom Feature Engineering (Transformer)

2. 自定义特征工程(Transformer)

python
from sklearn.base import BaseEstimator, TransformerMixin

class LogTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None):
        self.columns = columns
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X_copy = X.copy()
        for col in self.columns:
            X_copy[col] = np.log1p(X_copy[col])
        return X_copy
python
from sklearn.base import BaseEstimator, TransformerMixin

class LogTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None):
        self.columns = columns
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X_copy = X.copy()
        for col in self.columns:
            X_copy[col] = np.log1p(X_copy[col])
        return X_copy

Performance Optimization

性能优化

Using n_jobs

使用n_jobs参数

python
undefined
python
undefined

Use all CPU cores for training/tuning

使用所有CPU核心进行训练/调优

model = RandomForestClassifier(n_jobs=-1) grid = GridSearchCV(model, param_grid, n_jobs=-1)
undefined
model = RandomForestClassifier(n_jobs=-1) grid = GridSearchCV(model, param_grid, n_jobs=-1)
undefined

Working with Large Data (partial_fit)

处理大数据(partial_fit)

python
from sklearn.linear_model import SGDClassifier
python
from sklearn.linear_model import SGDClassifier

Online learning (incremental fit)

在线学习(增量拟合)

model = SGDClassifier() for X_chunk, y_chunk in data_stream: model.partial_fit(X_chunk, y_chunk, classes=np.unique(y_all))
undefined
model = SGDClassifier() for X_chunk, y_chunk in data_stream: model.partial_fit(X_chunk, y_chunk, classes=np.unique(y_all))
undefined

Common Pitfalls and Solutions

常见问题与解决方案

Imbalanced Classes

类别不平衡

python
undefined
python
undefined

❌ Problem: Model predicts only the majority class

❌ 问题:模型仅预测多数类别

✅ Solution: Adjust class weights

✅ 解决方案:调整类别权重

model = RandomForestClassifier(class_weight='balanced')
model = RandomForestClassifier(class_weight='balanced')

OR use SMOTE from imbalanced-learn library

或使用imbalanced-learn库中的SMOTE算法

undefined
undefined

Convergence Warnings

收敛警告

python
undefined
python
undefined

❌ Problem: "ConvergenceWarning: Liblinear failed to converge"

❌ 问题:"ConvergenceWarning: Liblinear failed to converge"

✅ Solution: Increase max_iter or scale data

✅ 解决方案:增加max_iter参数或标准化数据

model = LogisticRegression(max_iter=2000)
model = LogisticRegression(max_iter=2000)

Often solved by applying StandardScaler first!

通常先应用StandardScaler即可解决!

undefined
undefined

Categorical Values in Test Set not in Train

测试集中出现训练集未见过的类别

python
undefined
python
undefined

❌ Problem: ValueError when unseen categories appear in test

❌ 问题:测试集中出现未见过的类别时抛出ValueError

✅ Solution: Use handle_unknown in OneHotEncoder

✅ 解决方案:在OneHotEncoder中使用handle_unknown参数

encoder = OneHotEncoder(handle_unknown='ignore')

Scikit-learn is the backbone of Python ML. Its API is so successful that many other libraries (XGBoost, LightGBM) mimic it.
encoder = OneHotEncoder(handle_unknown='ignore')

Scikit-learn是Python机器学习领域的核心库。其API设计非常成功,许多其他库(如XGBoost、LightGBM)都借鉴了它的设计。