Loading...
Loading...
Use when "scikit-learn", "sklearn", "machine learning", "classification", "regression", "clustering", or asking about "train test split", "cross validation", "hyperparameter tuning", "ML pipeline", "random forest", "SVM", "preprocessing"
npx skill4agent add eyadsibai/ltk scikit-learn| Algorithm | Best For | Strengths |
|---|---|---|
| Logistic Regression | Baseline, interpretable | Fast, probabilistic |
| Random Forest | General purpose | Handles non-linear, feature importance |
| Gradient Boosting | Best accuracy | State-of-art for tabular |
| SVM | High-dimensional data | Works well with few samples |
| KNN | Simple problems | No training, instance-based |
| Algorithm | Best For | Notes |
|---|---|---|
| Linear Regression | Baseline | Interpretable coefficients |
| Ridge/Lasso | Regularization needed | L2 vs L1 penalty |
| Random Forest | Non-linear relationships | Robust to outliers |
| Gradient Boosting | Best accuracy | XGBoost, LightGBM wrappers |
| Algorithm | Best For | Key Parameter |
|---|---|---|
| KMeans | Spherical clusters | n_clusters (must specify) |
| DBSCAN | Arbitrary shapes | eps (density) |
| Agglomerative | Hierarchical | n_clusters or distance threshold |
| Gaussian Mixture | Soft clustering | n_components |
| Method | Preserves | Use Case |
|---|---|---|
| PCA | Global variance | Feature reduction |
| t-SNE | Local structure | 2D/3D visualization |
| UMAP | Both local/global | Visualization + downstream |
| Component | Purpose |
|---|---|
| Pipeline | Sequential steps (transform → model) |
| ColumnTransformer | Apply different transforms to different columns |
| FeatureUnion | Combine multiple feature extraction methods |
| Strategy | Use Case |
|---|---|
| KFold | General purpose |
| StratifiedKFold | Imbalanced classification |
| TimeSeriesSplit | Temporal data |
| LeaveOneOut | Very small datasets |
| Task | Metric | When to Use |
|---|---|---|
| Classification | Accuracy | Balanced classes |
| F1-score | Imbalanced classes | |
| ROC-AUC | Ranking, threshold tuning | |
| Precision/Recall | Domain-specific costs | |
| Regression | RMSE | Penalize large errors |
| MAE | Robust to outliers | |
| R² | Explained variance |
| Method | Pros | Cons |
|---|---|---|
| GridSearchCV | Exhaustive | Slow for many params |
| RandomizedSearchCV | Faster | May miss optimal |
| HalvingGridSearchCV | Efficient | Requires sklearn 0.24+ |
| Practice | Why |
|---|---|
| Split data first | Prevent leakage |
| Use pipelines | Reproducible, no leakage |
| Scale for distance-based | KNN, SVM, PCA need scaled features |
| Stratify imbalanced | Preserve class distribution |
| Cross-validate | Reliable performance estimates |
| Check learning curves | Diagnose over/underfitting |
| Pitfall | Solution |
|---|---|
| Fitting scaler on all data | Use pipeline or fit only on train |
| Using accuracy for imbalanced | Use F1, ROC-AUC, or balanced accuracy |
| Too many hyperparameters | Start simple, add complexity |
| Ignoring feature importance | Use |