scikit-learn
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesescikit-learn - Machine Learning in Python
scikit-learn - Python机器学习库
A robust library for classical machine learning. It features a uniform API: all objects share the same interface for fitting, transforming, and predicting.
这是一个用于经典机器学习的可靠库,具有统一的API:所有对象都共享相同的拟合、转换和预测接口。
When to Use
适用场景
- Classification: Detecting categories (Spam vs. Ham, Disease diagnosis).
- Regression: Predicting continuous values (House prices, Stock trends).
- Clustering: Grouping similar objects (Market segmentation, Image compression).
- Dimensionality Reduction: Reducing feature count while keeping info (PCA, Visualization).
- Model Selection: Comparing models and tuning hyperparameters (Cross-validation, Grid search).
- Preprocessing: Transforming raw data into features (Scaling, Encoding, Imputation).
- 分类:识别类别(垃圾邮件与正常邮件、疾病诊断)。
- 回归:预测连续值(房价、股票趋势)。
- 聚类:对相似对象分组(市场细分、图像压缩)。
- 降维:在保留信息的同时减少特征数量(PCA、数据可视化)。
- 模型选择:对比模型并调优超参数(交叉验证、网格搜索)。
- 预处理:将原始数据转换为可用特征(标准化、编码、缺失值填充)。
Reference Documentation
参考文档
Official docs: https://scikit-learn.org/stable/
User Guide: https://scikit-learn.org/stable/user_guide.html
Search patterns:, , ,
User Guide: https://scikit-learn.org/stable/user_guide.html
Search patterns:
sklearn.pipeline.Pipelinesklearn.model_selectionsklearn.ensemblesklearn.preprocessing官方文档:https://scikit-learn.org/stable/
用户指南:https://scikit-learn.org/stable/user_guide.html
搜索示例:, , ,
用户指南:https://scikit-learn.org/stable/user_guide.html
搜索示例:
sklearn.pipeline.Pipelinesklearn.model_selectionsklearn.ensemblesklearn.preprocessingCore Principles
核心原则
The "Estimator" Interface
"Estimator"接口
- Estimators: Implement . They learn from data.
fit(X, y) - Transformers: Implement (and
transform(X)). They modify data.fit_transform(X) - Predictors: Implement . They provide estimates for new data.
predict(X)
- Estimators:实现方法,从数据中学习。
fit(X, y) - Transformers:实现(以及
transform(X))方法,用于修改数据。fit_transform(X) - Predictors:实现方法,为新数据提供预测结果。
predict(X)
Use scikit-learn For
scikit-learn适用场景
- Tabular data (Excel-like, CSVs).
- Traditional ML (Random Forests, SVMs, Linear Models).
- Feature engineering and pipeline automation.
- Small to medium-sized datasets.
- 表格数据(类Excel、CSV格式)。
- 传统机器学习算法(随机森林、SVM、线性模型)。
- 特征工程与流水线自动化。
- 中小型数据集。
Do NOT Use For
scikit-learn不适用场景
- Deep Learning / Neural Networks (use PyTorch or TensorFlow).
- Natural Language Processing at scale (use spaCy or HuggingFace).
- Large-scale "Big Data" (use Spark MLlib or Dask-ML).
- Real-time streaming predictions (consider specialized inference engines).
- 深度学习/神经网络(请使用PyTorch或TensorFlow)。
- 大规模自然语言处理(请使用spaCy或HuggingFace)。
- 大规模“大数据”处理(请使用Spark MLlib或Dask-ML)。
- 实时流预测(请考虑专用推理引擎)。
Quick Reference
快速参考
Installation
安装
bash
pip install scikit-learnbash
pip install scikit-learnStandard Imports
标准导入
python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report, mean_squared_errorpython
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report, mean_squared_errorBasic Pattern - Train/Predict
基础模式 - 训练/预测
python
from sklearn.ensemble import RandomForestClassifierpython
from sklearn.ensemble import RandomForestClassifier1. Prepare data
1. 准备数据
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
2. Instantiate and fit
2. 实例化并拟合模型
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
3. Predict and evaluate
3. 预测并评估
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))
undefinedpredictions = model.predict(X_test)
print(classification_report(y_test, predictions))
undefinedCritical Rules
重要规则
✅ DO
✅ 推荐做法
- Split before anything - Always use before looking at data properties.
train_test_split - Use Pipelines - Combine preprocessing and modeling to prevent data leakage.
- Scale your data - Models like SVM, KNN, and Linear Regression require feature scaling.
- Check for Imbalance - Use in
stratify=yfor classification.train_test_split - Cross-Validate - Don't trust a single train/test split; use .
cross_val_score - Handle Missing Values - Use or similar before fitting models.
SimpleImputer - Standardize Categories - Use for nominal or
OneHotEncoderfor ordinal data.OrdinalEncoder
- 先拆分数据 - 在查看数据属性前,务必使用拆分数据集。
train_test_split - 使用Pipeline - 将预处理和建模步骤结合,防止数据泄露。
- 标准化数据 - SVM、KNN和线性回归等模型需要对特征进行标准化。
- 检查类别不平衡 - 分类任务中,在中使用
train_test_split参数。stratify=y - 交叉验证 - 不要仅依赖单次训练/测试拆分的结果,使用进行验证。
cross_val_score - 处理缺失值 - 在拟合模型前,使用或类似工具填充缺失值。
SimpleImputer - 标准化类别数据 - 名义数据使用,有序数据使用
OneHotEncoder。OrdinalEncoder
❌ DON'T
❌ 不推荐做法
- Fit on test data - Never call or
.fit()on the test set..fit_transform() - Use Categorical data as-is - Scikit-learn requires numerical input; encode strings first.
- Ignore Class Imbalance - Accuracy is misleading for imbalanced datasets; use F1-score or AUC.
- Overfit - Don't keep tuning hyperparameters until the test score is perfect.
- Ignore Random State - Set for reproducibility during experiments.
random_state
- 在测试集上拟合 - 永远不要在测试集上调用或
.fit()方法。.fit_transform() - 直接使用类别数据 - scikit-learn要求输入为数值型数据,需先对字符串进行编码。
- 忽略类别不平衡 - 对于不平衡数据集,准确率指标具有误导性,应使用F1分数或AUC指标。
- 过拟合 - 不要持续调优超参数直到测试集分数达到完美。
- 忽略随机状态 - 实验过程中设置以保证结果可复现。
random_state
Anti-Patterns (NEVER)
反模式(绝对禁止)
python
undefinedpython
undefined❌ BAD: Data Leakage (Fitting scaler on the whole dataset)
❌ 错误:数据泄露(在整个数据集上拟合标准化器)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Data from "future" test set leaks into training!
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # 测试集的“未来”数据泄露到训练过程中!
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)
✅ GOOD: Fit scaler only on training data
✅ 正确:仅在训练集上拟合标准化器
X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Use training mean/std
X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # 使用训练集的均值/标准差
❌ BAD: Repeating preprocessing manually
❌ 错误:手动重复预处理步骤
(Error-prone and hard to maintain)
(容易出错且难以维护)
✅ GOOD: Use Pipelines (Automates everything safely)
✅ 正确:使用Pipeline(安全自动化所有步骤)
pipe = Pipeline([
('scaler', StandardScaler()),
('model', RandomForestClassifier())
])
pipe.fit(X_train, y_train)
undefinedpipe = Pipeline([
('scaler', StandardScaler()),
('model', RandomForestClassifier())
])
pipe.fit(X_train, y_train)
undefinedPreprocessing (sklearn.preprocessing)
预处理(sklearn.preprocessing)
Scaling and Encoding
标准化与编码
python
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoderpython
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoderScaling numerical data
数值数据标准化
scaler = StandardScaler()
X_num_scaled = scaler.fit_transform(X_numeric)
scaler = StandardScaler()
X_num_scaled = scaler.fit_transform(X_numeric)
Encoding categorical data
类别数据编码
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
X_cat_encoded = encoder.fit_transform(X_categorical)
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
X_cat_encoded = encoder.fit_transform(X_categorical)
Handling missing values
处理缺失值
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X_with_nan)
undefinedfrom sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X_with_nan)
undefinedColumn Transformer (The Pro Way)
Column Transformer(专业做法)
python
from sklearn.compose import ColumnTransformer
numeric_features = ['age', 'salary']
categorical_features = ['city', 'job_type']
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(), categorical_features)
])python
from sklearn.compose import ColumnTransformer
numeric_features = ['age', 'salary']
categorical_features = ['city', 'job_type']
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(), categorical_features)
])Now use this in a pipeline
现在将其用于Pipeline
pipeline = Pipeline([
('prep', preprocessor),
('clf', LogisticRegression())
])
undefinedpipeline = Pipeline([
('prep', preprocessor),
('clf', LogisticRegression())
])
undefinedClassification
分类
Common Algorithms
常用算法
python
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifierpython
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifierLogistic Regression (Baseline)
逻辑回归(基准模型)
log_reg = LogisticRegression(max_iter=1000)
log_reg = LogisticRegression(max_iter=1000)
Support Vector Machine
支持向量机
svm = SVC(kernel='rbf', probability=True)
svm = SVC(kernel='rbf', probability=True)
Gradient Boosting
梯度提升
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)
undefinedgb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)
undefinedRegression
回归
Common Algorithms
常用算法
python
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressorpython
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressorRegularized Linear Models
正则化线性模型
ridge = Ridge(alpha=1.0) # L2
lasso = Lasso(alpha=0.1) # L1
ridge = Ridge(alpha=1.0) # L2正则
lasso = Lasso(alpha=0.1) # L1正则
Non-linear Regression
非线性回归
rf_reg = RandomForestRegressor(n_estimators=100, max_depth=10)
undefinedrf_reg = RandomForestRegressor(n_estimators=100, max_depth=10)
undefinedModel Evaluation
模型评估
Metrics
评估指标
python
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, r2_score, mean_absolute_errorpython
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, r2_score, mean_absolute_errorClassification
分类任务
acc = accuracy_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred, average='weighted')
acc = accuracy_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred, average='weighted')
Regression
回归任务
r2 = r2_score(y_true, y_pred)
mae = mean_absolute_error(y_true, y_pred)
undefinedr2 = r2_score(y_true, y_pred)
mae = mean_absolute_error(y_true, y_pred)
undefinedCross-Validation
交叉验证
python
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipeline, X, y, cv=5, scoring='f1_macro')
print(f"Mean F1: {scores.mean():.4f} (+/- {scores.std():.4f})")python
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipeline, X, y, cv=5, scoring='f1_macro')
print(f"平均F1分数: {scores.mean():.4f} (+/- {scores.std():.4f})")Hyperparameter Tuning
超参数调优
Grid Search and Randomized Search
网格搜索与随机搜索
python
from sklearn.model_selection import GridSearchCV
param_grid = {
'clf__n_estimators': [50, 100, 200],
'clf__max_depth': [None, 10, 20],
'clf__min_samples_split': [2, 5]
}
grid_search = GridSearchCV(pipeline, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
best_model = grid_search.best_estimator_python
from sklearn.model_selection import GridSearchCV
param_grid = {
'clf__n_estimators': [50, 100, 200],
'clf__max_depth': [None, 10, 20],
'clf__min_samples_split': [2, 5]
}
grid_search = GridSearchCV(pipeline, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)
print(f"最优参数: {grid_search.best_params_}")
best_model = grid_search.best_estimator_Dimensionality Reduction
降维
PCA (Principal Component Analysis)
PCA(主成分分析)
python
from sklearn.decomposition import PCApython
from sklearn.decomposition import PCAReduce to 2 components for visualization
降为2个成分用于可视化
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
undefinedpca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print(f"解释方差比: {pca.explained_variance_ratio_}")
undefinedClustering
聚类
K-Means and DBSCAN
K-Means与DBSCAN
python
from sklearn.cluster import KMeans, DBSCANpython
from sklearn.cluster import KMeans, DBSCANK-Means (Requires specifying K)
K-Means(需要指定聚类数K)
kmeans = KMeans(n_clusters=3, n_init='auto')
clusters = kmeans.fit_predict(X)
kmeans = KMeans(n_clusters=3, n_init='auto')
clusters = kmeans.fit_predict(X)
DBSCAN (Density-based, finds K automatically)
DBSCAN(基于密度,自动确定聚类数K)
dbscan = DBSCAN(eps=0.5, min_samples=5)
clusters = dbscan.fit_predict(X)
undefineddbscan = DBSCAN(eps=0.5, min_samples=5)
clusters = dbscan.fit_predict(X)
undefinedPractical Workflows
实用工作流
1. End-to-End Classification Pipeline
1. 端到端分类Pipeline
python
def build_and_train_model(X, y):
# 1. Identify types
num_cols = X.select_dtypes(include=['int64', 'float64']).columns
cat_cols = X.select_dtypes(include=['object', 'category']).columns
# 2. Setup Preprocessing
preprocessor = ColumnTransformer([
('num', StandardScaler(), num_cols),
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
])
# 3. Create Pipeline
clf = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(random_state=42))
])
# 4. Train
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
clf.fit(X_train, y_train)
return clf, X_test, y_testpython
def build_and_train_model(X, y):
# 1. 识别特征类型
num_cols = X.select_dtypes(include=['int64', 'float64']).columns
cat_cols = X.select_dtypes(include=['object', 'category']).columns
# 2. 设置预处理步骤
preprocessor = ColumnTransformer([
('num', StandardScaler(), num_cols),
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
])
# 3. 创建Pipeline
clf = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(random_state=42))
])
# 4. 训练模型
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
clf.fit(X_train, y_train)
return clf, X_test, y_testmodel, X_test, y_test = build_and_train_model(df.drop('target', axis=1), df['target'])
model, X_test, y_test = build_and_train_model(df.drop('target', axis=1), df['target'])
undefinedundefined2. Custom Feature Engineering (Transformer)
2. 自定义特征工程(Transformer)
python
from sklearn.base import BaseEstimator, TransformerMixin
class LogTransformer(BaseEstimator, TransformerMixin):
def __init__(self, columns=None):
self.columns = columns
def fit(self, X, y=None):
return self
def transform(self, X):
X_copy = X.copy()
for col in self.columns:
X_copy[col] = np.log1p(X_copy[col])
return X_copypython
from sklearn.base import BaseEstimator, TransformerMixin
class LogTransformer(BaseEstimator, TransformerMixin):
def __init__(self, columns=None):
self.columns = columns
def fit(self, X, y=None):
return self
def transform(self, X):
X_copy = X.copy()
for col in self.columns:
X_copy[col] = np.log1p(X_copy[col])
return X_copyPerformance Optimization
性能优化
Using n_jobs
使用n_jobs参数
python
undefinedpython
undefinedUse all CPU cores for training/tuning
使用所有CPU核心进行训练/调优
model = RandomForestClassifier(n_jobs=-1)
grid = GridSearchCV(model, param_grid, n_jobs=-1)
undefinedmodel = RandomForestClassifier(n_jobs=-1)
grid = GridSearchCV(model, param_grid, n_jobs=-1)
undefinedWorking with Large Data (partial_fit)
处理大数据(partial_fit)
python
from sklearn.linear_model import SGDClassifierpython
from sklearn.linear_model import SGDClassifierOnline learning (incremental fit)
在线学习(增量拟合)
model = SGDClassifier()
for X_chunk, y_chunk in data_stream:
model.partial_fit(X_chunk, y_chunk, classes=np.unique(y_all))
undefinedmodel = SGDClassifier()
for X_chunk, y_chunk in data_stream:
model.partial_fit(X_chunk, y_chunk, classes=np.unique(y_all))
undefinedCommon Pitfalls and Solutions
常见问题与解决方案
Imbalanced Classes
类别不平衡
python
undefinedpython
undefined❌ Problem: Model predicts only the majority class
❌ 问题:模型仅预测多数类别
✅ Solution: Adjust class weights
✅ 解决方案:调整类别权重
model = RandomForestClassifier(class_weight='balanced')
model = RandomForestClassifier(class_weight='balanced')
OR use SMOTE from imbalanced-learn library
或使用imbalanced-learn库中的SMOTE算法
undefinedundefinedConvergence Warnings
收敛警告
python
undefinedpython
undefined❌ Problem: "ConvergenceWarning: Liblinear failed to converge"
❌ 问题:"ConvergenceWarning: Liblinear failed to converge"
✅ Solution: Increase max_iter or scale data
✅ 解决方案:增加max_iter参数或标准化数据
model = LogisticRegression(max_iter=2000)
model = LogisticRegression(max_iter=2000)
Often solved by applying StandardScaler first!
通常先应用StandardScaler即可解决!
undefinedundefinedCategorical Values in Test Set not in Train
测试集中出现训练集未见过的类别
python
undefinedpython
undefined❌ Problem: ValueError when unseen categories appear in test
❌ 问题:测试集中出现未见过的类别时抛出ValueError
✅ Solution: Use handle_unknown in OneHotEncoder
✅ 解决方案:在OneHotEncoder中使用handle_unknown参数
encoder = OneHotEncoder(handle_unknown='ignore')
Scikit-learn is the backbone of Python ML. Its API is so successful that many other libraries (XGBoost, LightGBM) mimic it.encoder = OneHotEncoder(handle_unknown='ignore')
Scikit-learn是Python机器学习领域的核心库。其API设计非常成功,许多其他库(如XGBoost、LightGBM)都借鉴了它的设计。