senior-data-scientist

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Senior Data Scientist

高级数据科学家

Overview

概述

Build end-to-end data science workflows from data exploration through model deployment. This skill covers data preprocessing, feature engineering, model selection, hyperparameter tuning, cross-validation, experiment tracking with MLflow/W&B, statistical testing, visualization with matplotlib/seaborn/plotly, and Jupyter notebook best practices.

Announce at start: "I'm using the senior-data-scientist skill for data science workflow."

构建从数据探索到模型部署的端到端数据科学工作流。本技能涵盖数据预处理、特征工程、模型选择、超参数调优、交叉验证、使用MLflow/W&B进行实验追踪、统计检验、使用matplotlib/seaborn/plotly进行可视化，以及Jupyter notebook最佳实践。

启动时声明： "我将使用senior-data-scientist技能处理数据科学工作流。"

Phase 1: Data Understanding

阶段1：数据理解

Goal: Profile the dataset and establish a baseline before any modeling.

目标： 在建模前完成数据集画像并建立基准性能。

Actions

行动项

Load and profile the dataset (shape, types, distributions)
Identify missing values, outliers, and data quality issues
Perform exploratory data analysis (EDA)
Define the target variable and success metrics
Establish baseline performance

加载并完成数据集画像（形状、类型、分布）
识别缺失值、异常值和数据质量问题
执行探索性数据分析（EDA）
定义目标变量和成功指标
建立基准性能

Baseline Models (Always Start Here)

基准模型（始终从此处开始）

Task	Baseline Model	Why
Classification	Majority class classifier	Lower bound for accuracy
Classification	Logistic regression	Simple, interpretable
Regression	Mean predictor	Lower bound for RMSE
Regression	Linear regression	Simple, interpretable
Time series	Naive forecast (previous value)	Lower bound for MAE
Time series	Seasonal naive	Captures basic seasonality

任务	基准模型	原因
分类	多数类分类器	准确率下限
分类	逻辑回归	简单、可解释
回归	均值预测器	RMSE下限
回归	线性回归	简单、可解释
时间序列	朴素预测（前值）	MAE下限
时间序列	季节性朴素预测	可捕获基础季节性

STOP — Do NOT proceed to Phase 2 until:

停止 — 满足以下条件前请勿进入阶段2：

Phase 2: Feature Engineering

阶段2：特征工程

Goal: Transform raw data into features that improve model performance.

目标： 将原始数据转换为可提升模型性能的特征。

Actions

行动项

Handle missing values (imputation strategy)
Encode categorical variables
Scale/normalize numerical features
Create derived features
Feature selection (remove redundant/irrelevant)

处理缺失值（选择合理填充策略）
编码类别变量
缩放/归一化数值特征
创建衍生特征
特征选择（移除冗余/无关特征）

Missing Value Strategy Decision Table

缺失值处理策略决策表

Strategy	When to Use	Implementation
Drop rows	< 5% missing, MCAR	`df.dropna()`
Mean/Median	Numerical, no outliers	`SimpleImputer(strategy='median')`
Mode	Categorical	`SimpleImputer(strategy='most_frequent')`
KNN Imputer	Structured missing patterns	`KNNImputer(n_neighbors=5)`
Iterative	Complex relationships	`IterativeImputer()`
Flag + Impute	Missingness is informative	Add `is_missing` column + impute

策略	适用场景	实现方式
删除行	缺失率<5%、完全随机缺失	`df.dropna()`
均值/中位数	数值型、无异常值	`SimpleImputer(strategy='median')`
众数	类别型	`SimpleImputer(strategy='most_frequent')`
KNN填充	存在结构化缺失模式	`KNNImputer(n_neighbors=5)`
迭代填充	存在复杂关联	`IterativeImputer()`
标记+填充	缺失本身具有信息价值	添加 `is_missing` 列+填充

Categorical Encoding Decision Table

类别编码决策表

Method	When	Cardinality
One-Hot	Nominal, low cardinality	< 10 categories
Label/Ordinal	Ordinal features	Any
Target Encoding	High cardinality nominal	> 10 categories
Frequency Encoding	When frequency matters	Any
Binary Encoding	Very high cardinality	> 50 categories

方法	适用场景	基数
独热编码	名义型、低基数	< 10个类别
标签/序数编码	序数特征	任意
目标编码	高基数名义型	> 10个类别
频率编码	频率有业务意义时	任意
二进制编码	极高基数	> 50个类别

Scaling Decision Table

缩放决策表

Scaler	When	Robust to Outliers?
StandardScaler	Default choice (mean=0, std=1)	No
RobustScaler	Outliers present (median/IQR)	Yes
MinMaxScaler	Neural networks, distance-based [0,1]	No

缩放器	适用场景	对异常值鲁棒？
StandardScaler	默认选择（均值=0，标准差=1）	否
RobustScaler	存在异常值（中位数/四分位距）	是
MinMaxScaler	神经网络、基于距离的算法，输出范围[0,1]	否

Feature Types and Engineering

特征类型与工程方法

Feature Type	Techniques
Numerical	Log transform, polynomial, binning, interactions (A*B, A/B)
Temporal	Hour, day-of-week, is_weekend, time_since_event, cyclical (sin/cos), lags
Text	TF-IDF, word count, sentiment scores, named entities, embeddings
Categorical	Encoding (above), interaction with numerical features

特征类型	技术
数值型	对数变换、多项式、分箱、交叉特征（A*B、A/B）
时间型	小时、星期几、是否周末、距事件发生时长、周期性编码（sin/cos）、滞后特征
文本	TF-IDF、词数、情感分、命名实体、嵌入向量
类别型	上述编码、与数值特征交叉

Feature Selection Decision Table

特征选择决策表

Method	Type	Use When
Correlation matrix	Filter	Initial exploration
Mutual information	Filter	Non-linear relationships
Recursive Feature Elimination	Wrapper	Model-specific selection
L1 Regularization	Embedded	Linear models
Feature importance	Embedded	Tree-based models
Permutation importance	Model-agnostic	Final validation

方法	类型	适用场景
相关矩阵	过滤法	初步探索
互信息	过滤法	存在非线性关系
递归特征消除	包装法	模型特定选择
L1正则化	嵌入法	线性模型
特征重要性	嵌入法	树模型
排列重要性	模型无关	最终验证

STOP — Do NOT proceed to Phase 3 until:

停止 — 满足以下条件前请勿进入阶段3：

Phase 3: Modeling

阶段3：建模

Goal: Select, train, and evaluate candidate models.

目标： 选择、训练并评估候选模型。

Actions

行动项

Select candidate algorithms
Set up cross-validation strategy
Train and evaluate candidates
Hyperparameter tuning
Final model selection and evaluation

选择候选算法
搭建交叉验证策略
训练并评估候选模型
超参数调优
最终模型选择与评估

Algorithm Decision Table

算法决策表

Data Characteristics	Try First	Also Consider
Tabular, < 10K rows	Random Forest, XGBoost	Logistic/Linear Regression
Tabular, > 10K rows	XGBoost, LightGBM	CatBoost, Neural Network
High dimensionality	Lasso/Ridge, SVM	Random Forest with selection
Time series	Prophet, ARIMA	LSTM, XGBoost with lag features
Text classification	Fine-tuned transformer	TF-IDF + Logistic Regression
Image classification	Pre-trained CNN (ResNet, EfficientNet)	Vision Transformer
Regression	XGBoost, Random Forest	Linear Regression, Neural Network
Anomaly detection	Isolation Forest	LOF, Autoencoder

数据特征	优先尝试	也可考虑
表格型、行数<10K	Random Forest, XGBoost	逻辑回归/线性回归
表格型、行数>10K	XGBoost, LightGBM	CatBoost、神经网络
高维度	Lasso/Ridge, SVM	带特征选择的随机森林
时间序列	Prophet, ARIMA	LSTM、带滞后特征的XGBoost
文本分类	微调Transformer	TF-IDF + 逻辑回归
图像分类	预训练CNN（ResNet、EfficientNet）	Vision Transformer
回归	XGBoost、随机森林	线性回归、神经网络
异常检测	孤立森林	LOF、自动编码器

Cross-Validation Strategy Decision Table

交叉验证策略决策表

Strategy	When	Code
K-Fold (k=5)	Default, balanced data	`KFold(n_splits=5)`
Stratified K-Fold	Classification, imbalanced	`StratifiedKFold(n_splits=5)`
Time Series Split	Temporal data	`TimeSeriesSplit(n_splits=5)`
Group K-Fold	Grouped observations	`GroupKFold(n_splits=5)`
Leave-One-Out	Very small datasets	`LeaveOneOut()`

策略	适用场景	代码
K折（k=5）	默认、平衡数据	`KFold(n_splits=5)`
分层K折	分类任务、不平衡数据	`StratifiedKFold(n_splits=5)`
时间序列拆分	时间型数据	`TimeSeriesSplit(n_splits=5)`
分组K折	存在分组观测值	`GroupKFold(n_splits=5)`
留一法	极小数据集	`LeaveOneOut()`

Evaluation Metrics Decision Table

评估指标决策表

Task	Primary Metric	Secondary Metrics
Binary Classification	AUC-ROC	F1, Precision, Recall, AP
Multiclass	Macro F1	Accuracy, Confusion Matrix
Regression	RMSE	MAE, R-squared, MAPE
Ranking	NDCG	MAP, MRR
Anomaly Detection	F1, AP	Precision@K, Recall@K

任务	首要指标	次要指标
二分类	AUC-ROC	F1、精确率、召回率、AP
多分类	宏F1	准确率、混淆矩阵
回归	RMSE	MAE、R平方、MAPE
排序	NDCG	MAP、MRR
异常检测	F1、AP	Precision@K、Recall@K

Hyperparameter Tuning Decision Table

超参数调优决策表

Method	Compute Budget	Search Space	Implementation
Grid Search	Low (< 100 combos)	Small, known ranges	`GridSearchCV`
Random Search	Medium	Large, uncertain	`RandomizedSearchCV`
Bayesian (Optuna)	Any	Large, expensive	`optuna.create_study()`
Successive Halving	Large	Many candidates	`HalvingRandomSearchCV`

方法	计算预算	搜索空间	实现方式
网格搜索	低（<100种组合）	小范围、范围已知	`GridSearchCV`
随机搜索	中等	大范围、范围不确定	`RandomizedSearchCV`
贝叶斯（Optuna）	任意	大范围、训练成本高	`optuna.create_study()`
连续减半法	高	大量候选	`HalvingRandomSearchCV`

Common Hyperparameters (XGBoost/LightGBM)

通用超参数（XGBoost/LightGBM）

python

param_space = {
    'n_estimators': [100, 300, 500, 1000],
    'max_depth': [3, 5, 7, 9],
    'learning_rate': [0.01, 0.05, 0.1],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9],
    'min_child_weight': [1, 3, 5],
}

python

param_space = {
    'n_estimators': [100, 300, 500, 1000],
    'max_depth': [3, 5, 7, 9],
    'learning_rate': [0.01, 0.05, 0.1],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9],
    'min_child_weight': [1, 3, 5],
}

STOP — Do NOT proceed to Phase 4 until:

停止 — 满足以下条件前请勿进入阶段4：

Phase 4: Deployment

阶段4：部署

Goal: Serialize, serve, and monitor the model in production.

目标： 序列化、部署并监控生产环境中的模型。

Actions

行动项

Serialize model and preprocessing pipeline
Create prediction API or batch pipeline
Set up monitoring for data drift and model degradation
Document model card (inputs, outputs, limitations, biases)

序列化模型与预处理流水线
创建预测API或批量流水线
搭建数据漂移和模型性能下降监控
编写模型卡文档（输入、输出、限制、偏差）

STOP — Deployment complete when:

停止 — 满足以下条件即部署完成：

Model is serialized with preprocessing pipeline
Prediction API or batch pipeline works end-to-end
Monitoring is configured for data drift
Model card is documented

模型与预处理流水线已序列化
预测API或批量流水线可端到端运行
已配置数据漂移监控
已完成模型卡文档

Experiment Tracking

实验追踪

MLflow Pattern

MLflow使用模板

python

import mlflow

mlflow.set_experiment("customer-churn-prediction")

with mlflow.start_run(run_name="xgboost-v2"):
    mlflow.log_params(params)
    mlflow.log_metrics({"auc": auc_score, "f1": f1_score})
    mlflow.log_artifact("confusion_matrix.png")
    mlflow.sklearn.log_model(pipeline, "model")
    mlflow.set_tag("version", "2.1")

python

import mlflow

mlflow.set_experiment("customer-churn-prediction")

with mlflow.start_run(run_name="xgboost-v2"):
    mlflow.log_params(params)
    mlflow.log_metrics({"auc": auc_score, "f1": f1_score})
    mlflow.log_artifact("confusion_matrix.png")
    mlflow.sklearn.log_model(pipeline, "model")
    mlflow.set_tag("version", "2.1")

What to Track

需要追踪的内容

Category	Items
Parameters	All hyperparameters, random seed
Metrics	Train and validation metrics
Data	Data version/hash, feature list
Artifacts	Plots, reports, model files
Metadata	Training duration, model size

类别	条目
参数	所有超参数、随机种子
指标	训练集和验证集指标
数据	数据版本/哈希、特征列表
产物	图表、报告、模型文件
元数据	训练时长、模型大小

Statistical Tests Decision Table

统计检验决策表

Question	Test	Assumption
Two group means different?	t-test (independent)	Normal distribution
Two groups (non-normal)?	Mann-Whitney U	None
Paired measurements?	Paired t-test	Normal differences
3+ group means?	ANOVA	Normal, equal variance
Categorical association?	Chi-squared	Expected freq > 5
Distribution normal?	Shapiro-Wilk	n < 5000
Two distributions different?	Kolmogorov-Smirnov	Continuous data

问题	检验方法	假设
两组均值是否存在差异？	独立样本t检验	正态分布
两组（非正态）？	曼-惠特尼U检验	无
配对测量？	配对t检验	差值正态分布
3组及以上均值？	方差分析	正态分布、方差齐性
类别变量是否相关？	卡方检验	期望频数>5
分布是否正态？	夏皮罗-威尔克检验	样本量<5000
两个分布是否存在差异？	柯尔莫哥洛夫-斯米尔诺夫检验	连续数据

P-Value Guidelines

P值使用指南

p < 0.05: statistically significant (conventional)
Always report effect size alongside p-value
Adjust for multiple comparisons (Bonferroni, FDR)
Statistical significance is not practical significance

p < 0.05：（惯例）统计显著
报告p值的同时必须报告效应量
多重比较需校正（Bonferroni、FDR）
统计显著不等于实际意义显著

Visualization Decision Table

可视化决策表

Data Type	Plot	Library
Distribution	Histogram, KDE, Box plot	seaborn
Comparison	Bar chart, Grouped bar	matplotlib
Correlation	Scatter, Heatmap	seaborn
Trend	Line chart	matplotlib/plotly
Composition	Stacked bar, Pie (max 5 slices)	matplotlib
Interactive	Scatter, Line, Dashboard	plotly

数据类型	图表类型	库
分布	直方图、KDE图、箱线图	seaborn
对比	柱状图、分组柱状图	matplotlib
相关性	散点图、热力图	seaborn
趋势	折线图	matplotlib/plotly
构成	堆叠柱状图、饼图（最多5个分区）	matplotlib
交互式	散点图、折线图、仪表盘	plotly

Visualization Rules

可视化规则

Title every plot descriptively
Label axes with units
Use colorblind-safe palettes (
```
seaborn: colorblind
```
)
Start y-axis at 0 for bar charts
Annotate key findings directly on plots

所有图表需添加描述性标题
坐标轴需标注单位
使用色盲友好配色（
```
seaborn: colorblind
```
）
柱状图的y轴需从0开始
关键结论直接标注在图表上

Jupyter Notebook Structure

Jupyter Notebook结构

1. ## Setup (imports, configuration)
2. ## Data Loading
3. ## Exploratory Data Analysis
4. ## Data Preprocessing
5. ## Feature Engineering
6. ## Modeling
7. ## Evaluation
8. ## Conclusions

1. ## 环境配置（导入包、配置）
2. ## 数据加载
3. ## 探索性数据分析
4. ## 数据预处理
5. ## 特征工程
6. ## 建模
7. ## 评估
8. ## 结论

Notebook Best Practices

Notebook最佳实践

Restart and run all before sharing
Keep cells focused and sequential
Use markdown cells for explanations
Extract reusable code to
```
.py
```
modules
Version control with
```
nbstripout
```
Pin all dependency versions

分享前重启并运行所有单元格
保持单元格目标单一、顺序执行
使用markdown单元格撰写说明
可复用代码提取到
```
.py
```
模块中
使用
```
nbstripout
```
进行版本控制
锁定所有依赖版本

Anti-Patterns / Common Mistakes

反模式/常见错误

Anti-Pattern	Why It Is Wrong	Correct Approach
Training on test data	Data leakage, inflated metrics	Strict train/test separation
Feature engineering before split	Leaks test information into features	Engineer on training data only
Reporting training metrics	Not generalizable	Report validation/test metrics
Accuracy on imbalanced data	Misleading (majority class wins)	Use F1, AUC-ROC, or AP
Tuning on test set	Overfitting to test data	Use validation set for tuning
No baseline comparison	Cannot measure improvement	Always establish baseline first
Cherry-picking evaluation examples	Selection bias	Report on full evaluation set
Deploying without drift monitoring	Silent model degradation	Monitor input distributions

反模式	错误原因	正确做法
在测试集上训练	数据泄露、指标虚高	严格拆分训练集/测试集
拆分前做特征工程	测试集信息泄露到特征中	仅在训练集上做特征工程
上报训练集指标	不具备泛化性	上报验证集/测试集指标
不平衡数据上使用准确率	具有误导性（多数类占优）	使用F1、AUC-ROC或AP
在测试集上调优	过拟合到测试集	使用验证集调优
无基准对比	无法衡量提升效果	始终先建立基准
挑选评估样例	选择偏差	在完整评估集上报告结果
部署时无漂移监控	模型性能无声下降	监控输入分布

Integration Points

集成点

Skill	Relationship
`senior-prompt-engineer`	Prompt evaluation uses statistical testing methods
`testing-strategy`	ML testing follows the evaluation methodology
`performance-optimization`	Model inference optimization follows measurement cycle
`acceptance-testing`	Model performance thresholds become acceptance criteria
`llm-as-judge`	Subjective output evaluation uses LLM-as-judge
`code-review`	Notebook and pipeline code reviewed for quality

技能	关系
`senior-prompt-engineer`	提示词评估使用统计检验方法
`testing-strategy`	机器学习测试遵循该评估方法论
`performance-optimization`	模型推理优化遵循度量周期
`acceptance-testing`	模型性能阈值成为验收标准
`llm-as-judge`	主观输出评估使用LLM-as-judge
`code-review`	Notebook和流水线代码需做质量评审

Skill Type

技能类型

FLEXIBLE — Adapt preprocessing, modeling, and evaluation approaches to the specific data characteristics, business requirements, and compute constraints. The four-phase process and experiment tracking are strongly recommended. Always establish a baseline before modeling.

灵活型 — 可根据具体数据特征、业务需求和计算限制调整预处理、建模和评估方法。强烈建议遵循四阶段流程和实验追踪规范。建模前务必先建立基准。