Senior Data Scientist
Overview
Build end-to-end data science workflows from data exploration through model deployment. This skill covers data preprocessing, feature engineering, model selection, hyperparameter tuning, cross-validation, experiment tracking with MLflow/W&B, statistical testing, visualization with matplotlib/seaborn/plotly, and Jupyter notebook best practices.
Announce at start: "I'm using the senior-data-scientist skill for data science workflow."
Phase 1: Data Understanding
Goal: Profile the dataset and establish a baseline before any modeling.
Actions
- Load and profile the dataset (shape, types, distributions)
- Identify missing values, outliers, and data quality issues
- Perform exploratory data analysis (EDA)
- Define the target variable and success metrics
- Establish baseline performance
Baseline Models (Always Start Here)
| Task | Baseline Model | Why |
|---|
| Classification | Majority class classifier | Lower bound for accuracy |
| Classification | Logistic regression | Simple, interpretable |
| Regression | Mean predictor | Lower bound for RMSE |
| Regression | Linear regression | Simple, interpretable |
| Time series | Naive forecast (previous value) | Lower bound for MAE |
| Time series | Seasonal naive | Captures basic seasonality |
STOP — Do NOT proceed to Phase 2 until:
Phase 2: Feature Engineering
Goal: Transform raw data into features that improve model performance.
Actions
- Handle missing values (imputation strategy)
- Encode categorical variables
- Scale/normalize numerical features
- Create derived features
- Feature selection (remove redundant/irrelevant)
Missing Value Strategy Decision Table
| Strategy | When to Use | Implementation |
|---|
| Drop rows | < 5% missing, MCAR | |
| Mean/Median | Numerical, no outliers | SimpleImputer(strategy='median')
|
| Mode | Categorical | SimpleImputer(strategy='most_frequent')
|
| KNN Imputer | Structured missing patterns | KNNImputer(n_neighbors=5)
|
| Iterative | Complex relationships | |
| Flag + Impute | Missingness is informative | Add column + impute |
Categorical Encoding Decision Table
| Method | When | Cardinality |
|---|
| One-Hot | Nominal, low cardinality | < 10 categories |
| Label/Ordinal | Ordinal features | Any |
| Target Encoding | High cardinality nominal | > 10 categories |
| Frequency Encoding | When frequency matters | Any |
| Binary Encoding | Very high cardinality | > 50 categories |
Scaling Decision Table
| Scaler | When | Robust to Outliers? |
|---|
| StandardScaler | Default choice (mean=0, std=1) | No |
| RobustScaler | Outliers present (median/IQR) | Yes |
| MinMaxScaler | Neural networks, distance-based [0,1] | No |
Feature Types and Engineering
| Feature Type | Techniques |
|---|
| Numerical | Log transform, polynomial, binning, interactions (A*B, A/B) |
| Temporal | Hour, day-of-week, is_weekend, time_since_event, cyclical (sin/cos), lags |
| Text | TF-IDF, word count, sentiment scores, named entities, embeddings |
| Categorical | Encoding (above), interaction with numerical features |
Feature Selection Decision Table
| Method | Type | Use When |
|---|
| Correlation matrix | Filter | Initial exploration |
| Mutual information | Filter | Non-linear relationships |
| Recursive Feature Elimination | Wrapper | Model-specific selection |
| L1 Regularization | Embedded | Linear models |
| Feature importance | Embedded | Tree-based models |
| Permutation importance | Model-agnostic | Final validation |
STOP — Do NOT proceed to Phase 3 until:
Phase 3: Modeling
Goal: Select, train, and evaluate candidate models.
Actions
- Select candidate algorithms
- Set up cross-validation strategy
- Train and evaluate candidates
- Hyperparameter tuning
- Final model selection and evaluation
Algorithm Decision Table
| Data Characteristics | Try First | Also Consider |
|---|
| Tabular, < 10K rows | Random Forest, XGBoost | Logistic/Linear Regression |
| Tabular, > 10K rows | XGBoost, LightGBM | CatBoost, Neural Network |
| High dimensionality | Lasso/Ridge, SVM | Random Forest with selection |
| Time series | Prophet, ARIMA | LSTM, XGBoost with lag features |
| Text classification | Fine-tuned transformer | TF-IDF + Logistic Regression |
| Image classification | Pre-trained CNN (ResNet, EfficientNet) | Vision Transformer |
| Regression | XGBoost, Random Forest | Linear Regression, Neural Network |
| Anomaly detection | Isolation Forest | LOF, Autoencoder |
Cross-Validation Strategy Decision Table
| Strategy | When | Code |
|---|
| K-Fold (k=5) | Default, balanced data | |
| Stratified K-Fold | Classification, imbalanced | StratifiedKFold(n_splits=5)
|
| Time Series Split | Temporal data | TimeSeriesSplit(n_splits=5)
|
| Group K-Fold | Grouped observations | |
| Leave-One-Out | Very small datasets | |
Evaluation Metrics Decision Table
| Task | Primary Metric | Secondary Metrics |
|---|
| Binary Classification | AUC-ROC | F1, Precision, Recall, AP |
| Multiclass | Macro F1 | Accuracy, Confusion Matrix |
| Regression | RMSE | MAE, R-squared, MAPE |
| Ranking | NDCG | MAP, MRR |
| Anomaly Detection | F1, AP | Precision@K, Recall@K |
Hyperparameter Tuning Decision Table
| Method | Compute Budget | Search Space | Implementation |
|---|
| Grid Search | Low (< 100 combos) | Small, known ranges | |
| Random Search | Medium | Large, uncertain | |
| Bayesian (Optuna) | Any | Large, expensive | |
| Successive Halving | Large | Many candidates | |
Common Hyperparameters (XGBoost/LightGBM)
python
param_space = {
'n_estimators': [100, 300, 500, 1000],
'max_depth': [3, 5, 7, 9],
'learning_rate': [0.01, 0.05, 0.1],
'subsample': [0.7, 0.8, 0.9],
'colsample_bytree': [0.7, 0.8, 0.9],
'min_child_weight': [1, 3, 5],
}
STOP — Do NOT proceed to Phase 4 until:
Phase 4: Deployment
Goal: Serialize, serve, and monitor the model in production.
Actions
- Serialize model and preprocessing pipeline
- Create prediction API or batch pipeline
- Set up monitoring for data drift and model degradation
- Document model card (inputs, outputs, limitations, biases)
STOP — Deployment complete when:
Experiment Tracking
MLflow Pattern
python
import mlflow
mlflow.set_experiment("customer-churn-prediction")
with mlflow.start_run(run_name="xgboost-v2"):
mlflow.log_params(params)
mlflow.log_metrics({"auc": auc_score, "f1": f1_score})
mlflow.log_artifact("confusion_matrix.png")
mlflow.sklearn.log_model(pipeline, "model")
mlflow.set_tag("version", "2.1")
What to Track
| Category | Items |
|---|
| Parameters | All hyperparameters, random seed |
| Metrics | Train and validation metrics |
| Data | Data version/hash, feature list |
| Artifacts | Plots, reports, model files |
| Metadata | Training duration, model size |
Statistical Tests Decision Table
| Question | Test | Assumption |
|---|
| Two group means different? | t-test (independent) | Normal distribution |
| Two groups (non-normal)? | Mann-Whitney U | None |
| Paired measurements? | Paired t-test | Normal differences |
| 3+ group means? | ANOVA | Normal, equal variance |
| Categorical association? | Chi-squared | Expected freq > 5 |
| Distribution normal? | Shapiro-Wilk | n < 5000 |
| Two distributions different? | Kolmogorov-Smirnov | Continuous data |
P-Value Guidelines
- p < 0.05: statistically significant (conventional)
- Always report effect size alongside p-value
- Adjust for multiple comparisons (Bonferroni, FDR)
- Statistical significance is not practical significance
Visualization Decision Table
| Data Type | Plot | Library |
|---|
| Distribution | Histogram, KDE, Box plot | seaborn |
| Comparison | Bar chart, Grouped bar | matplotlib |
| Correlation | Scatter, Heatmap | seaborn |
| Trend | Line chart | matplotlib/plotly |
| Composition | Stacked bar, Pie (max 5 slices) | matplotlib |
| Interactive | Scatter, Line, Dashboard | plotly |
Visualization Rules
- Title every plot descriptively
- Label axes with units
- Use colorblind-safe palettes ()
- Start y-axis at 0 for bar charts
- Annotate key findings directly on plots
Jupyter Notebook Structure
1. ## Setup (imports, configuration)
2. ## Data Loading
3. ## Exploratory Data Analysis
4. ## Data Preprocessing
5. ## Feature Engineering
6. ## Modeling
7. ## Evaluation
8. ## Conclusions
Notebook Best Practices
- Restart and run all before sharing
- Keep cells focused and sequential
- Use markdown cells for explanations
- Extract reusable code to modules
- Version control with
- Pin all dependency versions
Anti-Patterns / Common Mistakes
| Anti-Pattern | Why It Is Wrong | Correct Approach |
|---|
| Training on test data | Data leakage, inflated metrics | Strict train/test separation |
| Feature engineering before split | Leaks test information into features | Engineer on training data only |
| Reporting training metrics | Not generalizable | Report validation/test metrics |
| Accuracy on imbalanced data | Misleading (majority class wins) | Use F1, AUC-ROC, or AP |
| Tuning on test set | Overfitting to test data | Use validation set for tuning |
| No baseline comparison | Cannot measure improvement | Always establish baseline first |
| Cherry-picking evaluation examples | Selection bias | Report on full evaluation set |
| Deploying without drift monitoring | Silent model degradation | Monitor input distributions |
Integration Points
| Skill | Relationship |
|---|
| Prompt evaluation uses statistical testing methods |
| ML testing follows the evaluation methodology |
| Model inference optimization follows measurement cycle |
| Model performance thresholds become acceptance criteria |
| Subjective output evaluation uses LLM-as-judge |
| Notebook and pipeline code reviewed for quality |
Skill Type
FLEXIBLE — Adapt preprocessing, modeling, and evaluation approaches to the specific data characteristics, business requirements, and compute constraints. The four-phase process and experiment tracking are strongly recommended. Always establish a baseline before modeling.