scikit-learn

Original🇺🇸 English
Translated

Use when "scikit-learn", "sklearn", "machine learning", "classification", "regression", "clustering", or asking about "train test split", "cross validation", "hyperparameter tuning", "ML pipeline", "random forest", "SVM", "preprocessing"

5installs
Added on

NPX Install

npx skill4agent add eyadsibai/ltk scikit-learn

Scikit-learn Machine Learning

Industry-standard Python library for classical machine learning.

When to Use

  • Classification or regression tasks
  • Clustering or dimensionality reduction
  • Preprocessing and feature engineering
  • Model evaluation and cross-validation
  • Hyperparameter tuning
  • Building ML pipelines

Algorithm Selection

Classification

AlgorithmBest ForStrengths
Logistic RegressionBaseline, interpretableFast, probabilistic
Random ForestGeneral purposeHandles non-linear, feature importance
Gradient BoostingBest accuracyState-of-art for tabular
SVMHigh-dimensional dataWorks well with few samples
KNNSimple problemsNo training, instance-based

Regression

AlgorithmBest ForNotes
Linear RegressionBaselineInterpretable coefficients
Ridge/LassoRegularization neededL2 vs L1 penalty
Random ForestNon-linear relationshipsRobust to outliers
Gradient BoostingBest accuracyXGBoost, LightGBM wrappers

Clustering

AlgorithmBest ForKey Parameter
KMeansSpherical clustersn_clusters (must specify)
DBSCANArbitrary shapeseps (density)
AgglomerativeHierarchicaln_clusters or distance threshold
Gaussian MixtureSoft clusteringn_components

Dimensionality Reduction

MethodPreservesUse Case
PCAGlobal varianceFeature reduction
t-SNELocal structure2D/3D visualization
UMAPBoth local/globalVisualization + downstream

Pipeline Concepts

Key concept: Pipelines prevent data leakage by ensuring transformations are fit only on training data.
ComponentPurpose
PipelineSequential steps (transform → model)
ColumnTransformerApply different transforms to different columns
FeatureUnionCombine multiple feature extraction methods
Common preprocessing flow:
  1. Impute missing values (SimpleImputer)
  2. Scale numeric features (StandardScaler, MinMaxScaler)
  3. Encode categoricals (OneHotEncoder, OrdinalEncoder)
  4. Optional: feature selection or polynomial features

Model Evaluation

Cross-Validation Strategies

StrategyUse Case
KFoldGeneral purpose
StratifiedKFoldImbalanced classification
TimeSeriesSplitTemporal data
LeaveOneOutVery small datasets

Metrics

TaskMetricWhen to Use
ClassificationAccuracyBalanced classes
F1-scoreImbalanced classes
ROC-AUCRanking, threshold tuning
Precision/RecallDomain-specific costs
RegressionRMSEPenalize large errors
MAERobust to outliers
Explained variance

Hyperparameter Tuning

MethodProsCons
GridSearchCVExhaustiveSlow for many params
RandomizedSearchCVFasterMay miss optimal
HalvingGridSearchCVEfficientRequires sklearn 0.24+
Key concept: Always tune on validation set, evaluate final model on held-out test set.

Best Practices

PracticeWhy
Split data firstPrevent leakage
Use pipelinesReproducible, no leakage
Scale for distance-basedKNN, SVM, PCA need scaled features
Stratify imbalancedPreserve class distribution
Cross-validateReliable performance estimates
Check learning curvesDiagnose over/underfitting

Common Pitfalls

PitfallSolution
Fitting scaler on all dataUse pipeline or fit only on train
Using accuracy for imbalancedUse F1, ROC-AUC, or balanced accuracy
Too many hyperparametersStart simple, add complexity
Ignoring feature importanceUse
feature_importances_
or permutation importance

Resources