senior-data-scientist

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Senior Data Scientist

高级数据科学家

Overview

概述

Build end-to-end data science workflows from data exploration through model deployment. This skill covers data preprocessing, feature engineering, model selection, hyperparameter tuning, cross-validation, experiment tracking with MLflow/W&B, statistical testing, visualization with matplotlib/seaborn/plotly, and Jupyter notebook best practices.
Announce at start: "I'm using the senior-data-scientist skill for data science workflow."

构建从数据探索到模型部署的端到端数据科学工作流。本技能涵盖数据预处理、特征工程、模型选择、超参数调优、交叉验证、使用MLflow/W&B进行实验追踪、统计检验、使用matplotlib/seaborn/plotly进行可视化,以及Jupyter notebook最佳实践。
启动时声明: "我将使用senior-data-scientist技能处理数据科学工作流。"

Phase 1: Data Understanding

阶段1:数据理解

Goal: Profile the dataset and establish a baseline before any modeling.
目标: 在建模前完成数据集画像并建立基准性能。

Actions

行动项

  1. Load and profile the dataset (shape, types, distributions)
  2. Identify missing values, outliers, and data quality issues
  3. Perform exploratory data analysis (EDA)
  4. Define the target variable and success metrics
  5. Establish baseline performance
  1. 加载并完成数据集画像(形状、类型、分布)
  2. 识别缺失值、异常值和数据质量问题
  3. 执行探索性数据分析(EDA)
  4. 定义目标变量和成功指标
  5. 建立基准性能

Baseline Models (Always Start Here)

基准模型(始终从此处开始)

TaskBaseline ModelWhy
ClassificationMajority class classifierLower bound for accuracy
ClassificationLogistic regressionSimple, interpretable
RegressionMean predictorLower bound for RMSE
RegressionLinear regressionSimple, interpretable
Time seriesNaive forecast (previous value)Lower bound for MAE
Time seriesSeasonal naiveCaptures basic seasonality
任务基准模型原因
分类多数类分类器准确率下限
分类逻辑回归简单、可解释
回归均值预测器RMSE下限
回归线性回归简单、可解释
时间序列朴素预测(前值)MAE下限
时间序列季节性朴素预测可捕获基础季节性

STOP — Do NOT proceed to Phase 2 until:

停止 — 满足以下条件前请勿进入阶段2:

  • Dataset is profiled (shape, types, distributions)
  • Missing values and outliers are documented
  • Target variable is defined
  • Success metrics are chosen
  • Baseline performance is established

  • 已完成数据集画像(形状、类型、分布)
  • 已记录缺失值和异常值
  • 已定义目标变量
  • 已选定成功指标
  • 已建立基准性能

Phase 2: Feature Engineering

阶段2:特征工程

Goal: Transform raw data into features that improve model performance.
目标: 将原始数据转换为可提升模型性能的特征。

Actions

行动项

  1. Handle missing values (imputation strategy)
  2. Encode categorical variables
  3. Scale/normalize numerical features
  4. Create derived features
  5. Feature selection (remove redundant/irrelevant)
  1. 处理缺失值(选择合理填充策略)
  2. 编码类别变量
  3. 缩放/归一化数值特征
  4. 创建衍生特征
  5. 特征选择(移除冗余/无关特征)

Missing Value Strategy Decision Table

缺失值处理策略决策表

StrategyWhen to UseImplementation
Drop rows< 5% missing, MCAR
df.dropna()
Mean/MedianNumerical, no outliers
SimpleImputer(strategy='median')
ModeCategorical
SimpleImputer(strategy='most_frequent')
KNN ImputerStructured missing patterns
KNNImputer(n_neighbors=5)
IterativeComplex relationships
IterativeImputer()
Flag + ImputeMissingness is informativeAdd
is_missing
column + impute
策略适用场景实现方式
删除行缺失率<5%、完全随机缺失
df.dropna()
均值/中位数数值型、无异常值
SimpleImputer(strategy='median')
众数类别型
SimpleImputer(strategy='most_frequent')
KNN填充存在结构化缺失模式
KNNImputer(n_neighbors=5)
迭代填充存在复杂关联
IterativeImputer()
标记+填充缺失本身具有信息价值添加
is_missing
列+填充

Categorical Encoding Decision Table

类别编码决策表

MethodWhenCardinality
One-HotNominal, low cardinality< 10 categories
Label/OrdinalOrdinal featuresAny
Target EncodingHigh cardinality nominal> 10 categories
Frequency EncodingWhen frequency mattersAny
Binary EncodingVery high cardinality> 50 categories
方法适用场景基数
独热编码名义型、低基数< 10个类别
标签/序数编码序数特征任意
目标编码高基数名义型> 10个类别
频率编码频率有业务意义时任意
二进制编码极高基数> 50个类别

Scaling Decision Table

缩放决策表

ScalerWhenRobust to Outliers?
StandardScalerDefault choice (mean=0, std=1)No
RobustScalerOutliers present (median/IQR)Yes
MinMaxScalerNeural networks, distance-based [0,1]No
缩放器适用场景对异常值鲁棒?
StandardScaler默认选择(均值=0,标准差=1)
RobustScaler存在异常值(中位数/四分位距)
MinMaxScaler神经网络、基于距离的算法,输出范围[0,1]

Feature Types and Engineering

特征类型与工程方法

Feature TypeTechniques
NumericalLog transform, polynomial, binning, interactions (A*B, A/B)
TemporalHour, day-of-week, is_weekend, time_since_event, cyclical (sin/cos), lags
TextTF-IDF, word count, sentiment scores, named entities, embeddings
CategoricalEncoding (above), interaction with numerical features
特征类型技术
数值型对数变换、多项式、分箱、交叉特征(A*B、A/B)
时间型小时、星期几、是否周末、距事件发生时长、周期性编码(sin/cos)、滞后特征
文本TF-IDF、词数、情感分、命名实体、嵌入向量
类别型上述编码、与数值特征交叉

Feature Selection Decision Table

特征选择决策表

MethodTypeUse When
Correlation matrixFilterInitial exploration
Mutual informationFilterNon-linear relationships
Recursive Feature EliminationWrapperModel-specific selection
L1 RegularizationEmbeddedLinear models
Feature importanceEmbeddedTree-based models
Permutation importanceModel-agnosticFinal validation
方法类型适用场景
相关矩阵过滤法初步探索
互信息过滤法存在非线性关系
递归特征消除包装法模型特定选择
L1正则化嵌入法线性模型
特征重要性嵌入法树模型
排列重要性模型无关最终验证

STOP — Do NOT proceed to Phase 3 until:

停止 — 满足以下条件前请勿进入阶段3:

  • Missing values are handled with justified strategy
  • Categorical variables are encoded appropriately
  • Numerical features are scaled
  • Feature engineering is done BEFORE train/test split on training data only
  • Feature selection has reduced dimensionality if needed

  • 已采用合理策略处理缺失值
  • 已对类别变量进行合适编码
  • 已缩放数值特征
  • 特征工程仅在训练集上完成,且在训练集/测试集拆分前执行
  • 如有需要已通过特征选择降低维度

Phase 3: Modeling

阶段3:建模

Goal: Select, train, and evaluate candidate models.
目标: 选择、训练并评估候选模型。

Actions

行动项

  1. Select candidate algorithms
  2. Set up cross-validation strategy
  3. Train and evaluate candidates
  4. Hyperparameter tuning
  5. Final model selection and evaluation
  1. 选择候选算法
  2. 搭建交叉验证策略
  3. 训练并评估候选模型
  4. 超参数调优
  5. 最终模型选择与评估

Algorithm Decision Table

算法决策表

Data CharacteristicsTry FirstAlso Consider
Tabular, < 10K rowsRandom Forest, XGBoostLogistic/Linear Regression
Tabular, > 10K rowsXGBoost, LightGBMCatBoost, Neural Network
High dimensionalityLasso/Ridge, SVMRandom Forest with selection
Time seriesProphet, ARIMALSTM, XGBoost with lag features
Text classificationFine-tuned transformerTF-IDF + Logistic Regression
Image classificationPre-trained CNN (ResNet, EfficientNet)Vision Transformer
RegressionXGBoost, Random ForestLinear Regression, Neural Network
Anomaly detectionIsolation ForestLOF, Autoencoder
数据特征优先尝试也可考虑
表格型、行数<10KRandom Forest, XGBoost逻辑回归/线性回归
表格型、行数>10KXGBoost, LightGBMCatBoost、神经网络
高维度Lasso/Ridge, SVM带特征选择的随机森林
时间序列Prophet, ARIMALSTM、带滞后特征的XGBoost
文本分类微调TransformerTF-IDF + 逻辑回归
图像分类预训练CNN(ResNet、EfficientNet)Vision Transformer
回归XGBoost、随机森林线性回归、神经网络
异常检测孤立森林LOF、自动编码器

Cross-Validation Strategy Decision Table

交叉验证策略决策表

StrategyWhenCode
K-Fold (k=5)Default, balanced data
KFold(n_splits=5)
Stratified K-FoldClassification, imbalanced
StratifiedKFold(n_splits=5)
Time Series SplitTemporal data
TimeSeriesSplit(n_splits=5)
Group K-FoldGrouped observations
GroupKFold(n_splits=5)
Leave-One-OutVery small datasets
LeaveOneOut()
策略适用场景代码
K折(k=5)默认、平衡数据
KFold(n_splits=5)
分层K折分类任务、不平衡数据
StratifiedKFold(n_splits=5)
时间序列拆分时间型数据
TimeSeriesSplit(n_splits=5)
分组K折存在分组观测值
GroupKFold(n_splits=5)
留一法极小数据集
LeaveOneOut()

Evaluation Metrics Decision Table

评估指标决策表

TaskPrimary MetricSecondary Metrics
Binary ClassificationAUC-ROCF1, Precision, Recall, AP
MulticlassMacro F1Accuracy, Confusion Matrix
RegressionRMSEMAE, R-squared, MAPE
RankingNDCGMAP, MRR
Anomaly DetectionF1, APPrecision@K, Recall@K
任务首要指标次要指标
二分类AUC-ROCF1、精确率、召回率、AP
多分类宏F1准确率、混淆矩阵
回归RMSEMAE、R平方、MAPE
排序NDCGMAP、MRR
异常检测F1、APPrecision@K、Recall@K

Hyperparameter Tuning Decision Table

超参数调优决策表

MethodCompute BudgetSearch SpaceImplementation
Grid SearchLow (< 100 combos)Small, known ranges
GridSearchCV
Random SearchMediumLarge, uncertain
RandomizedSearchCV
Bayesian (Optuna)AnyLarge, expensive
optuna.create_study()
Successive HalvingLargeMany candidates
HalvingRandomSearchCV
方法计算预算搜索空间实现方式
网格搜索低(<100种组合)小范围、范围已知
GridSearchCV
随机搜索中等大范围、范围不确定
RandomizedSearchCV
贝叶斯(Optuna)任意大范围、训练成本高
optuna.create_study()
连续减半法大量候选
HalvingRandomSearchCV

Common Hyperparameters (XGBoost/LightGBM)

通用超参数(XGBoost/LightGBM)

python
param_space = {
    'n_estimators': [100, 300, 500, 1000],
    'max_depth': [3, 5, 7, 9],
    'learning_rate': [0.01, 0.05, 0.1],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9],
    'min_child_weight': [1, 3, 5],
}
python
param_space = {
    'n_estimators': [100, 300, 500, 1000],
    'max_depth': [3, 5, 7, 9],
    'learning_rate': [0.01, 0.05, 0.1],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9],
    'min_child_weight': [1, 3, 5],
}

STOP — Do NOT proceed to Phase 4 until:

停止 — 满足以下条件前请勿进入阶段4:

  • At least 2 candidate models are evaluated
  • Cross-validation is used (not just train/test split)
  • Results beat the baseline from Phase 1
  • Best model is selected with justification
  • Overfitting is checked (train vs validation gap)

  • 已评估至少2个候选模型
  • 已使用交叉验证(而非仅训练集/测试集拆分)
  • 结果优于阶段1的基准性能
  • 已合理选择最优模型
  • 已检查过拟合(训练集与验证集性能差距)

Phase 4: Deployment

阶段4:部署

Goal: Serialize, serve, and monitor the model in production.
目标: 序列化、部署并监控生产环境中的模型。

Actions

行动项

  1. Serialize model and preprocessing pipeline
  2. Create prediction API or batch pipeline
  3. Set up monitoring for data drift and model degradation
  4. Document model card (inputs, outputs, limitations, biases)
  1. 序列化模型与预处理流水线
  2. 创建预测API或批量流水线
  3. 搭建数据漂移和模型性能下降监控
  4. 编写模型卡文档(输入、输出、限制、偏差)

STOP — Deployment complete when:

停止 — 满足以下条件即部署完成:

  • Model is serialized with preprocessing pipeline
  • Prediction API or batch pipeline works end-to-end
  • Monitoring is configured for data drift
  • Model card is documented

  • 模型与预处理流水线已序列化
  • 预测API或批量流水线可端到端运行
  • 已配置数据漂移监控
  • 已完成模型卡文档

Experiment Tracking

实验追踪

MLflow Pattern

MLflow使用模板

python
import mlflow

mlflow.set_experiment("customer-churn-prediction")

with mlflow.start_run(run_name="xgboost-v2"):
    mlflow.log_params(params)
    mlflow.log_metrics({"auc": auc_score, "f1": f1_score})
    mlflow.log_artifact("confusion_matrix.png")
    mlflow.sklearn.log_model(pipeline, "model")
    mlflow.set_tag("version", "2.1")
python
import mlflow

mlflow.set_experiment("customer-churn-prediction")

with mlflow.start_run(run_name="xgboost-v2"):
    mlflow.log_params(params)
    mlflow.log_metrics({"auc": auc_score, "f1": f1_score})
    mlflow.log_artifact("confusion_matrix.png")
    mlflow.sklearn.log_model(pipeline, "model")
    mlflow.set_tag("version", "2.1")

What to Track

需要追踪的内容

CategoryItems
ParametersAll hyperparameters, random seed
MetricsTrain and validation metrics
DataData version/hash, feature list
ArtifactsPlots, reports, model files
MetadataTraining duration, model size

类别条目
参数所有超参数、随机种子
指标训练集和验证集指标
数据数据版本/哈希、特征列表
产物图表、报告、模型文件
元数据训练时长、模型大小

Statistical Tests Decision Table

统计检验决策表

QuestionTestAssumption
Two group means different?t-test (independent)Normal distribution
Two groups (non-normal)?Mann-Whitney UNone
Paired measurements?Paired t-testNormal differences
3+ group means?ANOVANormal, equal variance
Categorical association?Chi-squaredExpected freq > 5
Distribution normal?Shapiro-Wilkn < 5000
Two distributions different?Kolmogorov-SmirnovContinuous data
问题检验方法假设
两组均值是否存在差异?独立样本t检验正态分布
两组(非正态)?曼-惠特尼U检验
配对测量?配对t检验差值正态分布
3组及以上均值?方差分析正态分布、方差齐性
类别变量是否相关?卡方检验期望频数>5
分布是否正态?夏皮罗-威尔克检验样本量<5000
两个分布是否存在差异?柯尔莫哥洛夫-斯米尔诺夫检验连续数据

P-Value Guidelines

P值使用指南

  • p < 0.05: statistically significant (conventional)
  • Always report effect size alongside p-value
  • Adjust for multiple comparisons (Bonferroni, FDR)
  • Statistical significance is not practical significance

  • p < 0.05:(惯例)统计显著
  • 报告p值的同时必须报告效应量
  • 多重比较需校正(Bonferroni、FDR)
  • 统计显著不等于实际意义显著

Visualization Decision Table

可视化决策表

Data TypePlotLibrary
DistributionHistogram, KDE, Box plotseaborn
ComparisonBar chart, Grouped barmatplotlib
CorrelationScatter, Heatmapseaborn
TrendLine chartmatplotlib/plotly
CompositionStacked bar, Pie (max 5 slices)matplotlib
InteractiveScatter, Line, Dashboardplotly
数据类型图表类型
分布直方图、KDE图、箱线图seaborn
对比柱状图、分组柱状图matplotlib
相关性散点图、热力图seaborn
趋势折线图matplotlib/plotly
构成堆叠柱状图、饼图(最多5个分区)matplotlib
交互式散点图、折线图、仪表盘plotly

Visualization Rules

可视化规则

  • Title every plot descriptively
  • Label axes with units
  • Use colorblind-safe palettes (
    seaborn: colorblind
    )
  • Start y-axis at 0 for bar charts
  • Annotate key findings directly on plots

  • 所有图表需添加描述性标题
  • 坐标轴需标注单位
  • 使用色盲友好配色(
    seaborn: colorblind
  • 柱状图的y轴需从0开始
  • 关键结论直接标注在图表上

Jupyter Notebook Structure

Jupyter Notebook结构

1. ## Setup (imports, configuration)
2. ## Data Loading
3. ## Exploratory Data Analysis
4. ## Data Preprocessing
5. ## Feature Engineering
6. ## Modeling
7. ## Evaluation
8. ## Conclusions
1. ## 环境配置(导入包、配置)
2. ## 数据加载
3. ## 探索性数据分析
4. ## 数据预处理
5. ## 特征工程
6. ## 建模
7. ## 评估
8. ## 结论

Notebook Best Practices

Notebook最佳实践

  • Restart and run all before sharing
  • Keep cells focused and sequential
  • Use markdown cells for explanations
  • Extract reusable code to
    .py
    modules
  • Version control with
    nbstripout
  • Pin all dependency versions

  • 分享前重启并运行所有单元格
  • 保持单元格目标单一、顺序执行
  • 使用markdown单元格撰写说明
  • 可复用代码提取到
    .py
    模块中
  • 使用
    nbstripout
    进行版本控制
  • 锁定所有依赖版本

Anti-Patterns / Common Mistakes

反模式/常见错误

Anti-PatternWhy It Is WrongCorrect Approach
Training on test dataData leakage, inflated metricsStrict train/test separation
Feature engineering before splitLeaks test information into featuresEngineer on training data only
Reporting training metricsNot generalizableReport validation/test metrics
Accuracy on imbalanced dataMisleading (majority class wins)Use F1, AUC-ROC, or AP
Tuning on test setOverfitting to test dataUse validation set for tuning
No baseline comparisonCannot measure improvementAlways establish baseline first
Cherry-picking evaluation examplesSelection biasReport on full evaluation set
Deploying without drift monitoringSilent model degradationMonitor input distributions

反模式错误原因正确做法
在测试集上训练数据泄露、指标虚高严格拆分训练集/测试集
拆分前做特征工程测试集信息泄露到特征中仅在训练集上做特征工程
上报训练集指标不具备泛化性上报验证集/测试集指标
不平衡数据上使用准确率具有误导性(多数类占优)使用F1、AUC-ROC或AP
在测试集上调优过拟合到测试集使用验证集调优
无基准对比无法衡量提升效果始终先建立基准
挑选评估样例选择偏差在完整评估集上报告结果
部署时无漂移监控模型性能无声下降监控输入分布

Integration Points

集成点

SkillRelationship
senior-prompt-engineer
Prompt evaluation uses statistical testing methods
testing-strategy
ML testing follows the evaluation methodology
performance-optimization
Model inference optimization follows measurement cycle
acceptance-testing
Model performance thresholds become acceptance criteria
llm-as-judge
Subjective output evaluation uses LLM-as-judge
code-review
Notebook and pipeline code reviewed for quality

技能关系
senior-prompt-engineer
提示词评估使用统计检验方法
testing-strategy
机器学习测试遵循该评估方法论
performance-optimization
模型推理优化遵循度量周期
acceptance-testing
模型性能阈值成为验收标准
llm-as-judge
主观输出评估使用LLM-as-judge
code-review
Notebook和流水线代码需做质量评审

Skill Type

技能类型

FLEXIBLE — Adapt preprocessing, modeling, and evaluation approaches to the specific data characteristics, business requirements, and compute constraints. The four-phase process and experiment tracking are strongly recommended. Always establish a baseline before modeling.
灵活型 — 可根据具体数据特征、业务需求和计算限制调整预处理、建模和评估方法。强烈建议遵循四阶段流程和实验追踪规范。建模前务必先建立基准。