senior-data-scientist
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSenior Data Scientist
高级数据科学家
Overview
概述
Build end-to-end data science workflows from data exploration through model deployment. This skill covers data preprocessing, feature engineering, model selection, hyperparameter tuning, cross-validation, experiment tracking with MLflow/W&B, statistical testing, visualization with matplotlib/seaborn/plotly, and Jupyter notebook best practices.
Announce at start: "I'm using the senior-data-scientist skill for data science workflow."
构建从数据探索到模型部署的端到端数据科学工作流。本技能涵盖数据预处理、特征工程、模型选择、超参数调优、交叉验证、使用MLflow/W&B进行实验追踪、统计检验、使用matplotlib/seaborn/plotly进行可视化,以及Jupyter notebook最佳实践。
启动时声明: "我将使用senior-data-scientist技能处理数据科学工作流。"
Phase 1: Data Understanding
阶段1:数据理解
Goal: Profile the dataset and establish a baseline before any modeling.
目标: 在建模前完成数据集画像并建立基准性能。
Actions
行动项
- Load and profile the dataset (shape, types, distributions)
- Identify missing values, outliers, and data quality issues
- Perform exploratory data analysis (EDA)
- Define the target variable and success metrics
- Establish baseline performance
- 加载并完成数据集画像(形状、类型、分布)
- 识别缺失值、异常值和数据质量问题
- 执行探索性数据分析(EDA)
- 定义目标变量和成功指标
- 建立基准性能
Baseline Models (Always Start Here)
基准模型(始终从此处开始)
| Task | Baseline Model | Why |
|---|---|---|
| Classification | Majority class classifier | Lower bound for accuracy |
| Classification | Logistic regression | Simple, interpretable |
| Regression | Mean predictor | Lower bound for RMSE |
| Regression | Linear regression | Simple, interpretable |
| Time series | Naive forecast (previous value) | Lower bound for MAE |
| Time series | Seasonal naive | Captures basic seasonality |
| 任务 | 基准模型 | 原因 |
|---|---|---|
| 分类 | 多数类分类器 | 准确率下限 |
| 分类 | 逻辑回归 | 简单、可解释 |
| 回归 | 均值预测器 | RMSE下限 |
| 回归 | 线性回归 | 简单、可解释 |
| 时间序列 | 朴素预测(前值) | MAE下限 |
| 时间序列 | 季节性朴素预测 | 可捕获基础季节性 |
STOP — Do NOT proceed to Phase 2 until:
停止 — 满足以下条件前请勿进入阶段2:
- Dataset is profiled (shape, types, distributions)
- Missing values and outliers are documented
- Target variable is defined
- Success metrics are chosen
- Baseline performance is established
- 已完成数据集画像(形状、类型、分布)
- 已记录缺失值和异常值
- 已定义目标变量
- 已选定成功指标
- 已建立基准性能
Phase 2: Feature Engineering
阶段2:特征工程
Goal: Transform raw data into features that improve model performance.
目标: 将原始数据转换为可提升模型性能的特征。
Actions
行动项
- Handle missing values (imputation strategy)
- Encode categorical variables
- Scale/normalize numerical features
- Create derived features
- Feature selection (remove redundant/irrelevant)
- 处理缺失值(选择合理填充策略)
- 编码类别变量
- 缩放/归一化数值特征
- 创建衍生特征
- 特征选择(移除冗余/无关特征)
Missing Value Strategy Decision Table
缺失值处理策略决策表
| Strategy | When to Use | Implementation |
|---|---|---|
| Drop rows | < 5% missing, MCAR | |
| Mean/Median | Numerical, no outliers | |
| Mode | Categorical | |
| KNN Imputer | Structured missing patterns | |
| Iterative | Complex relationships | |
| Flag + Impute | Missingness is informative | Add |
| 策略 | 适用场景 | 实现方式 |
|---|---|---|
| 删除行 | 缺失率<5%、完全随机缺失 | |
| 均值/中位数 | 数值型、无异常值 | |
| 众数 | 类别型 | |
| KNN填充 | 存在结构化缺失模式 | |
| 迭代填充 | 存在复杂关联 | |
| 标记+填充 | 缺失本身具有信息价值 | 添加 |
Categorical Encoding Decision Table
类别编码决策表
| Method | When | Cardinality |
|---|---|---|
| One-Hot | Nominal, low cardinality | < 10 categories |
| Label/Ordinal | Ordinal features | Any |
| Target Encoding | High cardinality nominal | > 10 categories |
| Frequency Encoding | When frequency matters | Any |
| Binary Encoding | Very high cardinality | > 50 categories |
| 方法 | 适用场景 | 基数 |
|---|---|---|
| 独热编码 | 名义型、低基数 | < 10个类别 |
| 标签/序数编码 | 序数特征 | 任意 |
| 目标编码 | 高基数名义型 | > 10个类别 |
| 频率编码 | 频率有业务意义时 | 任意 |
| 二进制编码 | 极高基数 | > 50个类别 |
Scaling Decision Table
缩放决策表
| Scaler | When | Robust to Outliers? |
|---|---|---|
| StandardScaler | Default choice (mean=0, std=1) | No |
| RobustScaler | Outliers present (median/IQR) | Yes |
| MinMaxScaler | Neural networks, distance-based [0,1] | No |
| 缩放器 | 适用场景 | 对异常值鲁棒? |
|---|---|---|
| StandardScaler | 默认选择(均值=0,标准差=1) | 否 |
| RobustScaler | 存在异常值(中位数/四分位距) | 是 |
| MinMaxScaler | 神经网络、基于距离的算法,输出范围[0,1] | 否 |
Feature Types and Engineering
特征类型与工程方法
| Feature Type | Techniques |
|---|---|
| Numerical | Log transform, polynomial, binning, interactions (A*B, A/B) |
| Temporal | Hour, day-of-week, is_weekend, time_since_event, cyclical (sin/cos), lags |
| Text | TF-IDF, word count, sentiment scores, named entities, embeddings |
| Categorical | Encoding (above), interaction with numerical features |
| 特征类型 | 技术 |
|---|---|
| 数值型 | 对数变换、多项式、分箱、交叉特征(A*B、A/B) |
| 时间型 | 小时、星期几、是否周末、距事件发生时长、周期性编码(sin/cos)、滞后特征 |
| 文本 | TF-IDF、词数、情感分、命名实体、嵌入向量 |
| 类别型 | 上述编码、与数值特征交叉 |
Feature Selection Decision Table
特征选择决策表
| Method | Type | Use When |
|---|---|---|
| Correlation matrix | Filter | Initial exploration |
| Mutual information | Filter | Non-linear relationships |
| Recursive Feature Elimination | Wrapper | Model-specific selection |
| L1 Regularization | Embedded | Linear models |
| Feature importance | Embedded | Tree-based models |
| Permutation importance | Model-agnostic | Final validation |
| 方法 | 类型 | 适用场景 |
|---|---|---|
| 相关矩阵 | 过滤法 | 初步探索 |
| 互信息 | 过滤法 | 存在非线性关系 |
| 递归特征消除 | 包装法 | 模型特定选择 |
| L1正则化 | 嵌入法 | 线性模型 |
| 特征重要性 | 嵌入法 | 树模型 |
| 排列重要性 | 模型无关 | 最终验证 |
STOP — Do NOT proceed to Phase 3 until:
停止 — 满足以下条件前请勿进入阶段3:
- Missing values are handled with justified strategy
- Categorical variables are encoded appropriately
- Numerical features are scaled
- Feature engineering is done BEFORE train/test split on training data only
- Feature selection has reduced dimensionality if needed
- 已采用合理策略处理缺失值
- 已对类别变量进行合适编码
- 已缩放数值特征
- 特征工程仅在训练集上完成,且在训练集/测试集拆分前执行
- 如有需要已通过特征选择降低维度
Phase 3: Modeling
阶段3:建模
Goal: Select, train, and evaluate candidate models.
目标: 选择、训练并评估候选模型。
Actions
行动项
- Select candidate algorithms
- Set up cross-validation strategy
- Train and evaluate candidates
- Hyperparameter tuning
- Final model selection and evaluation
- 选择候选算法
- 搭建交叉验证策略
- 训练并评估候选模型
- 超参数调优
- 最终模型选择与评估
Algorithm Decision Table
算法决策表
| Data Characteristics | Try First | Also Consider |
|---|---|---|
| Tabular, < 10K rows | Random Forest, XGBoost | Logistic/Linear Regression |
| Tabular, > 10K rows | XGBoost, LightGBM | CatBoost, Neural Network |
| High dimensionality | Lasso/Ridge, SVM | Random Forest with selection |
| Time series | Prophet, ARIMA | LSTM, XGBoost with lag features |
| Text classification | Fine-tuned transformer | TF-IDF + Logistic Regression |
| Image classification | Pre-trained CNN (ResNet, EfficientNet) | Vision Transformer |
| Regression | XGBoost, Random Forest | Linear Regression, Neural Network |
| Anomaly detection | Isolation Forest | LOF, Autoencoder |
| 数据特征 | 优先尝试 | 也可考虑 |
|---|---|---|
| 表格型、行数<10K | Random Forest, XGBoost | 逻辑回归/线性回归 |
| 表格型、行数>10K | XGBoost, LightGBM | CatBoost、神经网络 |
| 高维度 | Lasso/Ridge, SVM | 带特征选择的随机森林 |
| 时间序列 | Prophet, ARIMA | LSTM、带滞后特征的XGBoost |
| 文本分类 | 微调Transformer | TF-IDF + 逻辑回归 |
| 图像分类 | 预训练CNN(ResNet、EfficientNet) | Vision Transformer |
| 回归 | XGBoost、随机森林 | 线性回归、神经网络 |
| 异常检测 | 孤立森林 | LOF、自动编码器 |
Cross-Validation Strategy Decision Table
交叉验证策略决策表
| Strategy | When | Code |
|---|---|---|
| K-Fold (k=5) | Default, balanced data | |
| Stratified K-Fold | Classification, imbalanced | |
| Time Series Split | Temporal data | |
| Group K-Fold | Grouped observations | |
| Leave-One-Out | Very small datasets | |
| 策略 | 适用场景 | 代码 |
|---|---|---|
| K折(k=5) | 默认、平衡数据 | |
| 分层K折 | 分类任务、不平衡数据 | |
| 时间序列拆分 | 时间型数据 | |
| 分组K折 | 存在分组观测值 | |
| 留一法 | 极小数据集 | |
Evaluation Metrics Decision Table
评估指标决策表
| Task | Primary Metric | Secondary Metrics |
|---|---|---|
| Binary Classification | AUC-ROC | F1, Precision, Recall, AP |
| Multiclass | Macro F1 | Accuracy, Confusion Matrix |
| Regression | RMSE | MAE, R-squared, MAPE |
| Ranking | NDCG | MAP, MRR |
| Anomaly Detection | F1, AP | Precision@K, Recall@K |
| 任务 | 首要指标 | 次要指标 |
|---|---|---|
| 二分类 | AUC-ROC | F1、精确率、召回率、AP |
| 多分类 | 宏F1 | 准确率、混淆矩阵 |
| 回归 | RMSE | MAE、R平方、MAPE |
| 排序 | NDCG | MAP、MRR |
| 异常检测 | F1、AP | Precision@K、Recall@K |
Hyperparameter Tuning Decision Table
超参数调优决策表
| Method | Compute Budget | Search Space | Implementation |
|---|---|---|---|
| Grid Search | Low (< 100 combos) | Small, known ranges | |
| Random Search | Medium | Large, uncertain | |
| Bayesian (Optuna) | Any | Large, expensive | |
| Successive Halving | Large | Many candidates | |
| 方法 | 计算预算 | 搜索空间 | 实现方式 |
|---|---|---|---|
| 网格搜索 | 低(<100种组合) | 小范围、范围已知 | |
| 随机搜索 | 中等 | 大范围、范围不确定 | |
| 贝叶斯(Optuna) | 任意 | 大范围、训练成本高 | |
| 连续减半法 | 高 | 大量候选 | |
Common Hyperparameters (XGBoost/LightGBM)
通用超参数(XGBoost/LightGBM)
python
param_space = {
'n_estimators': [100, 300, 500, 1000],
'max_depth': [3, 5, 7, 9],
'learning_rate': [0.01, 0.05, 0.1],
'subsample': [0.7, 0.8, 0.9],
'colsample_bytree': [0.7, 0.8, 0.9],
'min_child_weight': [1, 3, 5],
}python
param_space = {
'n_estimators': [100, 300, 500, 1000],
'max_depth': [3, 5, 7, 9],
'learning_rate': [0.01, 0.05, 0.1],
'subsample': [0.7, 0.8, 0.9],
'colsample_bytree': [0.7, 0.8, 0.9],
'min_child_weight': [1, 3, 5],
}STOP — Do NOT proceed to Phase 4 until:
停止 — 满足以下条件前请勿进入阶段4:
- At least 2 candidate models are evaluated
- Cross-validation is used (not just train/test split)
- Results beat the baseline from Phase 1
- Best model is selected with justification
- Overfitting is checked (train vs validation gap)
- 已评估至少2个候选模型
- 已使用交叉验证(而非仅训练集/测试集拆分)
- 结果优于阶段1的基准性能
- 已合理选择最优模型
- 已检查过拟合(训练集与验证集性能差距)
Phase 4: Deployment
阶段4:部署
Goal: Serialize, serve, and monitor the model in production.
目标: 序列化、部署并监控生产环境中的模型。
Actions
行动项
- Serialize model and preprocessing pipeline
- Create prediction API or batch pipeline
- Set up monitoring for data drift and model degradation
- Document model card (inputs, outputs, limitations, biases)
- 序列化模型与预处理流水线
- 创建预测API或批量流水线
- 搭建数据漂移和模型性能下降监控
- 编写模型卡文档(输入、输出、限制、偏差)
STOP — Deployment complete when:
停止 — 满足以下条件即部署完成:
- Model is serialized with preprocessing pipeline
- Prediction API or batch pipeline works end-to-end
- Monitoring is configured for data drift
- Model card is documented
- 模型与预处理流水线已序列化
- 预测API或批量流水线可端到端运行
- 已配置数据漂移监控
- 已完成模型卡文档
Experiment Tracking
实验追踪
MLflow Pattern
MLflow使用模板
python
import mlflow
mlflow.set_experiment("customer-churn-prediction")
with mlflow.start_run(run_name="xgboost-v2"):
mlflow.log_params(params)
mlflow.log_metrics({"auc": auc_score, "f1": f1_score})
mlflow.log_artifact("confusion_matrix.png")
mlflow.sklearn.log_model(pipeline, "model")
mlflow.set_tag("version", "2.1")python
import mlflow
mlflow.set_experiment("customer-churn-prediction")
with mlflow.start_run(run_name="xgboost-v2"):
mlflow.log_params(params)
mlflow.log_metrics({"auc": auc_score, "f1": f1_score})
mlflow.log_artifact("confusion_matrix.png")
mlflow.sklearn.log_model(pipeline, "model")
mlflow.set_tag("version", "2.1")What to Track
需要追踪的内容
| Category | Items |
|---|---|
| Parameters | All hyperparameters, random seed |
| Metrics | Train and validation metrics |
| Data | Data version/hash, feature list |
| Artifacts | Plots, reports, model files |
| Metadata | Training duration, model size |
| 类别 | 条目 |
|---|---|
| 参数 | 所有超参数、随机种子 |
| 指标 | 训练集和验证集指标 |
| 数据 | 数据版本/哈希、特征列表 |
| 产物 | 图表、报告、模型文件 |
| 元数据 | 训练时长、模型大小 |
Statistical Tests Decision Table
统计检验决策表
| Question | Test | Assumption |
|---|---|---|
| Two group means different? | t-test (independent) | Normal distribution |
| Two groups (non-normal)? | Mann-Whitney U | None |
| Paired measurements? | Paired t-test | Normal differences |
| 3+ group means? | ANOVA | Normal, equal variance |
| Categorical association? | Chi-squared | Expected freq > 5 |
| Distribution normal? | Shapiro-Wilk | n < 5000 |
| Two distributions different? | Kolmogorov-Smirnov | Continuous data |
| 问题 | 检验方法 | 假设 |
|---|---|---|
| 两组均值是否存在差异? | 独立样本t检验 | 正态分布 |
| 两组(非正态)? | 曼-惠特尼U检验 | 无 |
| 配对测量? | 配对t检验 | 差值正态分布 |
| 3组及以上均值? | 方差分析 | 正态分布、方差齐性 |
| 类别变量是否相关? | 卡方检验 | 期望频数>5 |
| 分布是否正态? | 夏皮罗-威尔克检验 | 样本量<5000 |
| 两个分布是否存在差异? | 柯尔莫哥洛夫-斯米尔诺夫检验 | 连续数据 |
P-Value Guidelines
P值使用指南
- p < 0.05: statistically significant (conventional)
- Always report effect size alongside p-value
- Adjust for multiple comparisons (Bonferroni, FDR)
- Statistical significance is not practical significance
- p < 0.05:(惯例)统计显著
- 报告p值的同时必须报告效应量
- 多重比较需校正(Bonferroni、FDR)
- 统计显著不等于实际意义显著
Visualization Decision Table
可视化决策表
| Data Type | Plot | Library |
|---|---|---|
| Distribution | Histogram, KDE, Box plot | seaborn |
| Comparison | Bar chart, Grouped bar | matplotlib |
| Correlation | Scatter, Heatmap | seaborn |
| Trend | Line chart | matplotlib/plotly |
| Composition | Stacked bar, Pie (max 5 slices) | matplotlib |
| Interactive | Scatter, Line, Dashboard | plotly |
| 数据类型 | 图表类型 | 库 |
|---|---|---|
| 分布 | 直方图、KDE图、箱线图 | seaborn |
| 对比 | 柱状图、分组柱状图 | matplotlib |
| 相关性 | 散点图、热力图 | seaborn |
| 趋势 | 折线图 | matplotlib/plotly |
| 构成 | 堆叠柱状图、饼图(最多5个分区) | matplotlib |
| 交互式 | 散点图、折线图、仪表盘 | plotly |
Visualization Rules
可视化规则
- Title every plot descriptively
- Label axes with units
- Use colorblind-safe palettes ()
seaborn: colorblind - Start y-axis at 0 for bar charts
- Annotate key findings directly on plots
- 所有图表需添加描述性标题
- 坐标轴需标注单位
- 使用色盲友好配色()
seaborn: colorblind - 柱状图的y轴需从0开始
- 关键结论直接标注在图表上
Jupyter Notebook Structure
Jupyter Notebook结构
1. ## Setup (imports, configuration)
2. ## Data Loading
3. ## Exploratory Data Analysis
4. ## Data Preprocessing
5. ## Feature Engineering
6. ## Modeling
7. ## Evaluation
8. ## Conclusions1. ## 环境配置(导入包、配置)
2. ## 数据加载
3. ## 探索性数据分析
4. ## 数据预处理
5. ## 特征工程
6. ## 建模
7. ## 评估
8. ## 结论Notebook Best Practices
Notebook最佳实践
- Restart and run all before sharing
- Keep cells focused and sequential
- Use markdown cells for explanations
- Extract reusable code to modules
.py - Version control with
nbstripout - Pin all dependency versions
- 分享前重启并运行所有单元格
- 保持单元格目标单一、顺序执行
- 使用markdown单元格撰写说明
- 可复用代码提取到模块中
.py - 使用进行版本控制
nbstripout - 锁定所有依赖版本
Anti-Patterns / Common Mistakes
反模式/常见错误
| Anti-Pattern | Why It Is Wrong | Correct Approach |
|---|---|---|
| Training on test data | Data leakage, inflated metrics | Strict train/test separation |
| Feature engineering before split | Leaks test information into features | Engineer on training data only |
| Reporting training metrics | Not generalizable | Report validation/test metrics |
| Accuracy on imbalanced data | Misleading (majority class wins) | Use F1, AUC-ROC, or AP |
| Tuning on test set | Overfitting to test data | Use validation set for tuning |
| No baseline comparison | Cannot measure improvement | Always establish baseline first |
| Cherry-picking evaluation examples | Selection bias | Report on full evaluation set |
| Deploying without drift monitoring | Silent model degradation | Monitor input distributions |
| 反模式 | 错误原因 | 正确做法 |
|---|---|---|
| 在测试集上训练 | 数据泄露、指标虚高 | 严格拆分训练集/测试集 |
| 拆分前做特征工程 | 测试集信息泄露到特征中 | 仅在训练集上做特征工程 |
| 上报训练集指标 | 不具备泛化性 | 上报验证集/测试集指标 |
| 不平衡数据上使用准确率 | 具有误导性(多数类占优) | 使用F1、AUC-ROC或AP |
| 在测试集上调优 | 过拟合到测试集 | 使用验证集调优 |
| 无基准对比 | 无法衡量提升效果 | 始终先建立基准 |
| 挑选评估样例 | 选择偏差 | 在完整评估集上报告结果 |
| 部署时无漂移监控 | 模型性能无声下降 | 监控输入分布 |
Integration Points
集成点
| Skill | Relationship |
|---|---|
| Prompt evaluation uses statistical testing methods |
| ML testing follows the evaluation methodology |
| Model inference optimization follows measurement cycle |
| Model performance thresholds become acceptance criteria |
| Subjective output evaluation uses LLM-as-judge |
| Notebook and pipeline code reviewed for quality |
| 技能 | 关系 |
|---|---|
| 提示词评估使用统计检验方法 |
| 机器学习测试遵循该评估方法论 |
| 模型推理优化遵循度量周期 |
| 模型性能阈值成为验收标准 |
| 主观输出评估使用LLM-as-judge |
| Notebook和流水线代码需做质量评审 |
Skill Type
技能类型
FLEXIBLE — Adapt preprocessing, modeling, and evaluation approaches to the specific data characteristics, business requirements, and compute constraints. The four-phase process and experiment tracking are strongly recommended. Always establish a baseline before modeling.
灵活型 — 可根据具体数据特征、业务需求和计算限制调整预处理、建模和评估方法。强烈建议遵循四阶段流程和实验追踪规范。建模前务必先建立基准。