scikit-learn

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

scikit-learn - Machine Learning in Python

scikit-learn - Python机器学习库

A robust library for classical machine learning. It features a uniform API: all objects share the same interface for fitting, transforming, and predicting.

这是一个用于经典机器学习的可靠库，具有统一的API：所有对象都共享相同的拟合、转换和预测接口。

When to Use

适用场景

Classification: Detecting categories (Spam vs. Ham, Disease diagnosis).
Regression: Predicting continuous values (House prices, Stock trends).
Clustering: Grouping similar objects (Market segmentation, Image compression).
Dimensionality Reduction: Reducing feature count while keeping info (PCA, Visualization).
Model Selection: Comparing models and tuning hyperparameters (Cross-validation, Grid search).
Preprocessing: Transforming raw data into features (Scaling, Encoding, Imputation).

分类：识别类别（垃圾邮件与正常邮件、疾病诊断）。
回归：预测连续值（房价、股票趋势）。
聚类：对相似对象分组（市场细分、图像压缩）。
降维：在保留信息的同时减少特征数量（PCA、数据可视化）。
模型选择：对比模型并调优超参数（交叉验证、网格搜索）。
预处理：将原始数据转换为可用特征（标准化、编码、缺失值填充）。

Reference Documentation

参考文档

Official docs: https://scikit-learn.org/stable/
User Guide: https://scikit-learn.org/stable/user_guide.html
Search patterns:

sklearn.pipeline.Pipeline

sklearn.model_selection

sklearn.ensemble

sklearn.preprocessing

官方文档：https://scikit-learn.org/stable/
用户指南：https://scikit-learn.org/stable/user_guide.html
搜索示例：

sklearn.pipeline.Pipeline

sklearn.model_selection

sklearn.ensemble

sklearn.preprocessing

Core Principles

核心原则

The "Estimator" Interface

"Estimator"接口

Estimators: Implement
```
fit(X, y)
```
. They learn from data.
Transformers: Implement
```
transform(X)
```
(and
```
fit_transform(X)
```
). They modify data.
Predictors: Implement
```
predict(X)
```
. They provide estimates for new data.

Estimators：实现
```
fit(X, y)
```
方法，从数据中学习。
Transformers：实现
```
transform(X)
```
（以及
```
fit_transform(X)
```
）方法，用于修改数据。
Predictors：实现
```
predict(X)
```
方法，为新数据提供预测结果。

Use scikit-learn For

scikit-learn适用场景

Tabular data (Excel-like, CSVs).
Traditional ML (Random Forests, SVMs, Linear Models).
Feature engineering and pipeline automation.
Small to medium-sized datasets.

表格数据（类Excel、CSV格式）。
传统机器学习算法（随机森林、SVM、线性模型）。
特征工程与流水线自动化。
中小型数据集。

Do NOT Use For

scikit-learn不适用场景

Deep Learning / Neural Networks (use PyTorch or TensorFlow).
Natural Language Processing at scale (use spaCy or HuggingFace).
Large-scale "Big Data" (use Spark MLlib or Dask-ML).
Real-time streaming predictions (consider specialized inference engines).

深度学习/神经网络（请使用PyTorch或TensorFlow）。
大规模自然语言处理（请使用spaCy或HuggingFace）。
大规模“大数据”处理（请使用Spark MLlib或Dask-ML）。
实时流预测（请考虑专用推理引擎）。

Quick Reference

快速参考

Installation

安装

bash

pip install scikit-learn

bash

pip install scikit-learn

Standard Imports

标准导入

python

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report, mean_squared_error

python

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report, mean_squared_error

Basic Pattern - Train/Predict

基础模式 - 训练/预测

python

from sklearn.ensemble import RandomForestClassifier

python

from sklearn.ensemble import RandomForestClassifier

1. Prepare data

1. 准备数据

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

2. Instantiate and fit

2. 实例化并拟合模型

model = RandomForestClassifier(n_estimators=100) model.fit(X_train, y_train)

3. Predict and evaluate

3. 预测并评估

predictions = model.predict(X_test) print(classification_report(y_test, predictions))

undefined

predictions = model.predict(X_test) print(classification_report(y_test, predictions))

undefined

Critical Rules

重要规则

✅ DO

✅ 推荐做法

Split before anything - Always use
```
train_test_split
```
before looking at data properties.
Use Pipelines - Combine preprocessing and modeling to prevent data leakage.
Scale your data - Models like SVM, KNN, and Linear Regression require feature scaling.
Check for Imbalance - Use
```
stratify=y
```
in
```
train_test_split
```
for classification.
Cross-Validate - Don't trust a single train/test split; use
```
cross_val_score
```
.
Handle Missing Values - Use
```
SimpleImputer
```
or similar before fitting models.
Standardize Categories - Use
```
OneHotEncoder
```
for nominal or
```
OrdinalEncoder
```
for ordinal data.

先拆分数据 - 在查看数据属性前，务必使用
```
train_test_split
```
拆分数据集。
使用Pipeline - 将预处理和建模步骤结合，防止数据泄露。
标准化数据 - SVM、KNN和线性回归等模型需要对特征进行标准化。
检查类别不平衡 - 分类任务中，在
```
train_test_split
```
中使用
```
stratify=y
```
参数。
交叉验证 - 不要仅依赖单次训练/测试拆分的结果，使用
```
cross_val_score
```
进行验证。
处理缺失值 - 在拟合模型前，使用
```
SimpleImputer
```
或类似工具填充缺失值。
标准化类别数据 - 名义数据使用
```
OneHotEncoder
```
，有序数据使用
```
OrdinalEncoder
```
。

❌ DON'T

❌ 不推荐做法

Fit on test data - Never call
```
.fit()
```
or
```
.fit_transform()
```
on the test set.
Use Categorical data as-is - Scikit-learn requires numerical input; encode strings first.
Ignore Class Imbalance - Accuracy is misleading for imbalanced datasets; use F1-score or AUC.
Overfit - Don't keep tuning hyperparameters until the test score is perfect.
Ignore Random State - Set
```
random_state
```
for reproducibility during experiments.

在测试集上拟合 - 永远不要在测试集上调用
```
.fit()
```
或
```
.fit_transform()
```
方法。
直接使用类别数据 - scikit-learn要求输入为数值型数据，需先对字符串进行编码。
忽略类别不平衡 - 对于不平衡数据集，准确率指标具有误导性，应使用F1分数或AUC指标。
过拟合 - 不要持续调优超参数直到测试集分数达到完美。
忽略随机状态 - 实验过程中设置
```
random_state
```
以保证结果可复现。

Anti-Patterns (NEVER)

反模式（绝对禁止）

python

undefined

python

undefined

❌ BAD: Data Leakage (Fitting scaler on the whole dataset)

❌ 错误：数据泄露（在整个数据集上拟合标准化器）

scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Data from "future" test set leaks into training! X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # 测试集的“未来”数据泄露到训练过程中！ X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

✅ GOOD: Fit scaler only on training data

✅ 正确：仅在训练集上拟合标准化器

X_train, X_test, y_train, y_test = train_test_split(X, y) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Use training mean/std

X_train, X_test, y_train, y_test = train_test_split(X, y) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # 使用训练集的均值/标准差

❌ BAD: Repeating preprocessing manually

❌ 错误：手动重复预处理步骤

(Error-prone and hard to maintain)

（容易出错且难以维护）

✅ GOOD: Use Pipelines (Automates everything safely)

✅ 正确：使用Pipeline（安全自动化所有步骤）

pipe = Pipeline([ ('scaler', StandardScaler()), ('model', RandomForestClassifier()) ]) pipe.fit(X_train, y_train)

undefined

pipe = Pipeline([ ('scaler', StandardScaler()), ('model', RandomForestClassifier()) ]) pipe.fit(X_train, y_train)

undefined

Preprocessing (sklearn.preprocessing)

预处理（sklearn.preprocessing）

Scaling and Encoding

标准化与编码

python

from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder

python

from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder

Scaling numerical data

数值数据标准化

scaler = StandardScaler() X_num_scaled = scaler.fit_transform(X_numeric)

Encoding categorical data

类别数据编码

encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore') X_cat_encoded = encoder.fit_transform(X_categorical)

Handling missing values

处理缺失值

from sklearn.impute import SimpleImputer imputer = SimpleImputer(strategy='median') X_imputed = imputer.fit_transform(X_with_nan)

undefined

from sklearn.impute import SimpleImputer imputer = SimpleImputer(strategy='median') X_imputed = imputer.fit_transform(X_with_nan)

undefined

Column Transformer (The Pro Way)

Column Transformer（专业做法）

python

from sklearn.compose import ColumnTransformer

numeric_features = ['age', 'salary']
categorical_features = ['city', 'job_type']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)
    ])

python

from sklearn.compose import ColumnTransformer

numeric_features = ['age', 'salary']
categorical_features = ['city', 'job_type']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)
    ])

Now use this in a pipeline

现在将其用于Pipeline

pipeline = Pipeline([ ('prep', preprocessor), ('clf', LogisticRegression()) ])

undefined

pipeline = Pipeline([ ('prep', preprocessor), ('clf', LogisticRegression()) ])

undefined

Classification

分类

Common Algorithms

常用算法

python

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier

python

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier

Logistic Regression (Baseline)

逻辑回归（基准模型）

log_reg = LogisticRegression(max_iter=1000)

Support Vector Machine

支持向量机

svm = SVC(kernel='rbf', probability=True)

Gradient Boosting

梯度提升

gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)

undefined

gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)

undefined

Regression

回归

Common Algorithms

常用算法

python

from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor

python

from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor

Regularized Linear Models

正则化线性模型

ridge = Ridge(alpha=1.0) # L2 lasso = Lasso(alpha=0.1) # L1

ridge = Ridge(alpha=1.0) # L2正则 lasso = Lasso(alpha=0.1) # L1正则

Non-linear Regression

非线性回归

rf_reg = RandomForestRegressor(n_estimators=100, max_depth=10)

undefined

rf_reg = RandomForestRegressor(n_estimators=100, max_depth=10)

undefined

Model Evaluation

模型评估

Metrics

评估指标

python

from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, r2_score, mean_absolute_error

python

from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, r2_score, mean_absolute_error

Classification

分类任务

acc = accuracy_score(y_true, y_pred) f1 = f1_score(y_true, y_pred, average='weighted')

Regression

回归任务

r2 = r2_score(y_true, y_pred) mae = mean_absolute_error(y_true, y_pred)

undefined

r2 = r2_score(y_true, y_pred) mae = mean_absolute_error(y_true, y_pred)

undefined

Cross-Validation

交叉验证

python

from sklearn.model_selection import cross_val_score

scores = cross_val_score(pipeline, X, y, cv=5, scoring='f1_macro')
print(f"Mean F1: {scores.mean():.4f} (+/- {scores.std():.4f})")

python

from sklearn.model_selection import cross_val_score

scores = cross_val_score(pipeline, X, y, cv=5, scoring='f1_macro')
print(f"平均F1分数: {scores.mean():.4f} (+/- {scores.std():.4f})")

Hyperparameter Tuning

超参数调优

Grid Search and Randomized Search

网格搜索与随机搜索

python

from sklearn.model_selection import GridSearchCV

param_grid = {
    'clf__n_estimators': [50, 100, 200],
    'clf__max_depth': [None, 10, 20],
    'clf__min_samples_split': [2, 5]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

print(f"Best params: {grid_search.best_params_}")
best_model = grid_search.best_estimator_

python

from sklearn.model_selection import GridSearchCV

param_grid = {
    'clf__n_estimators': [50, 100, 200],
    'clf__max_depth': [None, 10, 20],
    'clf__min_samples_split': [2, 5]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

print(f"最优参数: {grid_search.best_params_}")
best_model = grid_search.best_estimator_

Dimensionality Reduction

降维

PCA (Principal Component Analysis)

PCA（主成分分析）

python

from sklearn.decomposition import PCA

python

from sklearn.decomposition import PCA

Reduce to 2 components for visualization

降为2个成分用于可视化

pca = PCA(n_components=2) X_pca = pca.fit_transform(X_scaled)

print(f"Explained variance ratio: {pca.explained_variance_ratio_}")

undefined

pca = PCA(n_components=2) X_pca = pca.fit_transform(X_scaled)

print(f"解释方差比: {pca.explained_variance_ratio_}")

undefined

Clustering

聚类

K-Means and DBSCAN

K-Means与DBSCAN

python

from sklearn.cluster import KMeans, DBSCAN

python

from sklearn.cluster import KMeans, DBSCAN

K-Means (Requires specifying K)

K-Means（需要指定聚类数K）

kmeans = KMeans(n_clusters=3, n_init='auto') clusters = kmeans.fit_predict(X)

DBSCAN (Density-based, finds K automatically)

DBSCAN（基于密度，自动确定聚类数K）

dbscan = DBSCAN(eps=0.5, min_samples=5) clusters = dbscan.fit_predict(X)

undefined

dbscan = DBSCAN(eps=0.5, min_samples=5) clusters = dbscan.fit_predict(X)

undefined

Practical Workflows

实用工作流

1. End-to-End Classification Pipeline

1. 端到端分类Pipeline

python

def build_and_train_model(X, y):
    # 1. Identify types
    num_cols = X.select_dtypes(include=['int64', 'float64']).columns
    cat_cols = X.select_dtypes(include=['object', 'category']).columns

    # 2. Setup Preprocessing
    preprocessor = ColumnTransformer([
        ('num', StandardScaler(), num_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
    ])

    # 3. Create Pipeline
    clf = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', RandomForestClassifier(random_state=42))
    ])

    # 4. Train
    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
    clf.fit(X_train, y_train)
    
    return clf, X_test, y_test

python

def build_and_train_model(X, y):
    # 1. 识别特征类型
    num_cols = X.select_dtypes(include=['int64', 'float64']).columns
    cat_cols = X.select_dtypes(include=['object', 'category']).columns

    # 2. 设置预处理步骤
    preprocessor = ColumnTransformer([
        ('num', StandardScaler(), num_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
    ])

    # 3. 创建Pipeline
    clf = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', RandomForestClassifier(random_state=42))
    ])

    # 4. 训练模型
    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
    clf.fit(X_train, y_train)
    
    return clf, X_test, y_test

model, X_test, y_test = build_and_train_model(df.drop('target', axis=1), df['target'])

undefined

undefined

2. Custom Feature Engineering (Transformer)

2. 自定义特征工程（Transformer）

python

from sklearn.base import BaseEstimator, TransformerMixin

class LogTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None):
        self.columns = columns
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X_copy = X.copy()
        for col in self.columns:
            X_copy[col] = np.log1p(X_copy[col])
        return X_copy

python

from sklearn.base import BaseEstimator, TransformerMixin

class LogTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None):
        self.columns = columns
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X_copy = X.copy()
        for col in self.columns:
            X_copy[col] = np.log1p(X_copy[col])
        return X_copy

Performance Optimization

性能优化

Using n_jobs

使用n_jobs参数

python

undefined

python

undefined

Use all CPU cores for training/tuning

使用所有CPU核心进行训练/调优

model = RandomForestClassifier(n_jobs=-1) grid = GridSearchCV(model, param_grid, n_jobs=-1)

undefined

model = RandomForestClassifier(n_jobs=-1) grid = GridSearchCV(model, param_grid, n_jobs=-1)

undefined

Working with Large Data (partial_fit)

处理大数据（partial_fit）

python

from sklearn.linear_model import SGDClassifier

python

from sklearn.linear_model import SGDClassifier

Online learning (incremental fit)

在线学习（增量拟合）

model = SGDClassifier() for X_chunk, y_chunk in data_stream: model.partial_fit(X_chunk, y_chunk, classes=np.unique(y_all))

undefined

model = SGDClassifier() for X_chunk, y_chunk in data_stream: model.partial_fit(X_chunk, y_chunk, classes=np.unique(y_all))

undefined

Common Pitfalls and Solutions

常见问题与解决方案

Imbalanced Classes

类别不平衡

python

undefined

python

undefined

❌ Problem: Model predicts only the majority class

❌ 问题：模型仅预测多数类别

✅ Solution: Adjust class weights

✅ 解决方案：调整类别权重

model = RandomForestClassifier(class_weight='balanced')

OR use SMOTE from imbalanced-learn library

或使用imbalanced-learn库中的SMOTE算法

undefined

undefined

Convergence Warnings

收敛警告

python

undefined

python

undefined

❌ Problem: "ConvergenceWarning: Liblinear failed to converge"

❌ 问题："ConvergenceWarning: Liblinear failed to converge"

✅ Solution: Increase max_iter or scale data

✅ 解决方案：增加max_iter参数或标准化数据

model = LogisticRegression(max_iter=2000)

Often solved by applying StandardScaler first!

通常先应用StandardScaler即可解决！

undefined

undefined

Categorical Values in Test Set not in Train

测试集中出现训练集未见过的类别

python

undefined

python

undefined

❌ Problem: ValueError when unseen categories appear in test

❌ 问题：测试集中出现未见过的类别时抛出ValueError

✅ Solution: Use handle_unknown in OneHotEncoder

✅ 解决方案：在OneHotEncoder中使用handle_unknown参数

encoder = OneHotEncoder(handle_unknown='ignore')


Scikit-learn is the backbone of Python ML. Its API is so successful that many other libraries (XGBoost, LightGBM) mimic it.

encoder = OneHotEncoder(handle_unknown='ignore')


Scikit-learn是Python机器学习领域的核心库。其API设计非常成功，许多其他库（如XGBoost、LightGBM）都借鉴了它的设计。