ai-ml-data-science

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Science Engineering Suite - Quick Reference

数据科学工程套件 - 快速参考

This skill turns raw data and questions into validated, documented models ready for production:
  • EDA workflows: Structured exploration with drift detection
  • Feature engineering: Reproducible feature pipelines with leakage prevention and train/serve parity
  • Model selection: Baselines first; strong tabular defaults; escalate complexity only when justified
  • Evaluation & reporting: Slice analysis, uncertainty, model cards, production metrics
  • SQL transformation: SQLMesh for staging/intermediate/marts layers
  • MLOps: CI/CD, CT (continuous training), CM (continuous monitoring)
  • Production patterns: Data contracts, lineage, feedback loops, streaming features
Modern emphasis (2026): Feature stores, automated retraining, drift monitoring (Evidently), train-serve parity, and agentic ML loops (plan -> execute -> evaluate -> improve). Tools: LightGBM, CatBoost, scikit-learn, PyTorch, Polars (lazy eval for larger-than-RAM datasets), lakeFS for data versioning.

本技能可将原始数据与业务问题转化为经过验证、文档完备的可生产模型
  • EDA工作流:带漂移检测的结构化数据探索
  • 特征工程:具备防数据泄露、训练-服务一致性的可复现特征流水线
  • 模型选择:优先搭建基准模型;为表格数据提供优质默认方案;仅在必要时提升复杂度
  • 评估与报告:切片分析、不确定性评估、模型卡片、生产环境指标
  • SQL转换:使用SQLMesh构建分层数据( staging/intermediate/marts 层)
  • MLOps:CI/CD、持续训练(CT)、持续监控(CM)
  • 生产模式:数据契约、数据血缘、反馈循环、流式特征
2026年重点方向:特征存储、自动重训练、漂移监控(Evidently)、训练-服务一致性,以及智能ML循环(规划→执行→评估→优化)。工具包括:LightGBM、CatBoost、scikit-learn、PyTorch、Polars(针对超内存数据集的惰性求值)、用于数据版本控制的lakeFS。

Quick Reference

快速参考

TaskTool/FrameworkCommandWhen to Use
EDA & ProfilingPandas, Great Expectations
df.describe()
,
ge.validate()
Initial data exploration and quality checks
Feature EngineeringPandas, Polars, Feature Stores
df.transform()
, Feast materialization
Creating lag, rolling, categorical features
Model TrainingGradient boosting, linear models, scikit-learn
lgb.train()
,
model.fit()
Strong baselines for tabular ML
Hyperparameter TuningOptuna, Ray Tune
optuna.create_study()
,
tune.run()
Optimizing model parameters
SQL TransformationSQLMesh
sqlmesh plan
,
sqlmesh run
Building staging/intermediate/marts layers
Experiment TrackingMLflow, W&B
mlflow.log_metric()
,
wandb.log()
Versioning experiments and models
Model Evaluationscikit-learn, custom metrics
metrics.roc_auc_score()
, slice analysis
Validating model performance

任务工具/框架命令使用场景
EDA与数据剖析Pandas, Great Expectations
df.describe()
,
ge.validate()
初始数据探索与质量检查
特征工程Pandas, Polars, 特征存储
df.transform()
, Feast materialization
创建滞后、滚动、分类特征
模型训练梯度提升、线性模型、scikit-learn
lgb.train()
,
model.fit()
表格型机器学习的优质基准模型
超参数调优Optuna, Ray Tune
optuna.create_study()
,
tune.run()
优化模型参数
SQL转换SQLMesh
sqlmesh plan
,
sqlmesh run
构建分层数据(staging/intermediate/marts层)
实验跟踪MLflow, W&B
mlflow.log_metric()
,
wandb.log()
实验与模型的版本控制
模型评估scikit-learn, 自定义指标
metrics.roc_auc_score()
, 切片分析
验证模型性能

Data Lake & Lakehouse

数据湖与湖仓

For comprehensive data lake/lakehouse patterns (beyond SQLMesh transformation), see data-lake-platform:
  • Table formats: Apache Iceberg, Delta Lake, Apache Hudi
  • Query engines: ClickHouse, DuckDB, Apache Doris, StarRocks
  • Alternative transformation: dbt (alternative to SQLMesh)
  • Ingestion: dlt, Airbyte (connectors)
  • Streaming: Apache Kafka patterns
  • Orchestration: Dagster, Airflow
This skill focuses on ML feature engineering and modeling. Use data-lake-platform for general-purpose data infrastructure.

如需了解全面的数据湖/湖仓模式(超出SQLMesh转换范畴),请查看 data-lake-platform
  • 表格式:Apache Iceberg, Delta Lake, Apache Hudi
  • 查询引擎:ClickHouse, DuckDB, Apache Doris, StarRocks
  • 替代转换工具:dbt(SQLMesh的替代方案)
  • 数据摄入:dlt, Airbyte(连接器)
  • 流式处理:Apache Kafka模式
  • 任务编排:Dagster, Airflow
本技能聚焦于机器学习特征工程与建模。通用数据基础设施相关内容请参考data-lake-platform。

Related Skills

相关技能

For adjacent topics, reference:
  • ai-mlops - APIs, batch jobs, monitoring, drift, data ingestion (dlt)
  • ai-llm - LLM prompting, fine-tuning, evaluation
  • ai-rag - RAG pipelines, chunking, retrieval
  • ai-llm-inference - LLM inference optimization, quantization
  • ai-ml-timeseries - Time series forecasting, backtesting
  • qa-testing-strategy - Test-driven development, coverage
  • data-sql-optimization - SQL optimization, index patterns (complements SQLMesh)
  • data-lake-platform - Data lake/lakehouse infrastructure (ClickHouse, Iceberg, Kafka)

相邻主题请参考:
  • ai-mlops - API、批处理作业、监控、漂移检测、数据摄入(dlt)
  • ai-llm - LLM提示词工程、微调、评估
  • ai-rag - RAG流水线、分块、检索
  • ai-llm-inference - LLM推理优化、量化
  • ai-ml-timeseries - 时间序列预测、回测
  • qa-testing-strategy - 测试驱动开发、覆盖率
  • data-sql-optimization - SQL优化、索引模式(补充SQLMesh)
  • data-lake-platform - 数据湖/湖仓基础设施(ClickHouse, Iceberg, Kafka)

Decision Tree: Choosing Data Science Approach

决策树:选择数据科学方法

text
User needs ML for: [Problem Type]
  - Tabular data?
    - Small-medium (<1M rows)? -> LightGBM (fast, efficient)
    - Large and complex (>1M rows)? -> LightGBM first, then NN if needed
    - High-dim sparse (text, counts)? -> Linear models, then shallow NN

  - Time series?
    - Seasonality? -> LightGBM, then see ai-ml-timeseries
    - Long-term dependencies? -> Transformers (see ai-ml-timeseries)

  - Text or mixed modalities?
    - LLMs/Transformers -> See ai-llm

  - SQL transformations?
    - SQLMesh (staging/intermediate/marts layers)
Rule of thumb: For tabular data, tree-based gradient boosting is a strong baseline, but must be validated against alternatives and constraints.

text
用户的机器学习需求:[问题类型]
  - 表格型数据?
    - 中小型数据集(<100万行)? -> LightGBM(快速、高效)
    - 大型复杂数据集(>100万行)? -> 优先使用LightGBM,必要时再尝试神经网络
    - 高维稀疏数据(文本、计数型)? -> 线性模型,之后尝试浅层神经网络

  - 时间序列数据?
    - 存在季节性? -> LightGBM,之后参考ai-ml-timeseries
    - 存在长期依赖? -> Transformers(参考ai-ml-timeseries)

  - 文本或多模态数据?
    - LLMs/Transformers -> 参考ai-llm

  - 需要SQL转换?
    - SQLMesh(构建staging/intermediate/marts层)
经验法则:对于表格型数据,基于树的梯度提升是优质基准模型,但必须结合替代方案与约束条件进行验证。

Core Concepts (Vendor-Agnostic)

核心概念(厂商无关)

  • Problem framing: define success metrics, baselines, and decision thresholds before modeling.
  • Leakage prevention: ensure all features are available at prediction time; split by time/group when appropriate.
  • Uncertainty: report confidence intervals and stability (fold variance, bootstrap) rather than single-point metrics.
  • Reproducibility: version code/data/features, fix seeds, and record the environment.
  • Operational handoff: define monitoring, retraining triggers, and rollback criteria with MLOps.
  • 问题框架搭建:在建模前定义成功指标、基准模型与决策阈值。
  • 防数据泄露:确保所有特征在预测时均可获取;必要时按时间/分组拆分数据集。
  • 不确定性评估:报告置信区间与稳定性(折方差、自助法),而非单一指标。
  • 可复现性:对代码/数据/特征进行版本控制,固定随机种子,并记录环境信息。
  • 生产环境交付:与MLOps团队协作,定义监控、重训练触发条件与回退准则。

Implementation Practices (Tooling Examples)

实践指南(工具示例)

  • Track experiments and artifacts (run id, commit hash, data version).
  • Add data validation gates in pipelines (schema + distribution + freshness).
  • Prefer reproducible, testable feature code (shared transforms, point-in-time correctness).
  • Use datasheets/model cards and eval reports as deployment prerequisites (Datasheets for Datasets: https://arxiv.org/abs/1803.09010; Model Cards: https://arxiv.org/abs/1810.03993).

Do / Avoid

建议/禁忌

Do
  • Do start with baselines and a simple model to expose leakage and data issues early.
  • Do run slice analysis and document failure modes before recommending deployment.
  • Do keep an immutable eval set; refresh training data without contaminating evaluation.
Avoid
  • Avoid random splits for temporal or user-correlated data.
  • Avoid "metric gaming" (optimizing the number without validating business impact).
  • Avoid training on labels created after the prediction timestamp (silent future leakage).
建议
  • 从基准模型和简单模型开始,尽早暴露数据泄露与数据问题。
  • 在建议部署前,执行切片分析并记录失效模式。
  • 保留不可变的评估数据集;刷新训练数据时避免污染评估集。
禁忌
  • 对时序或用户相关数据使用随机拆分。
  • 避免“指标游戏”(仅优化数值而不验证业务影响)。
  • 避免使用预测时间戳之后生成的标签进行训练(隐性未来数据泄露)。

Core Patterns (Overview)

核心模式(概述)

Pattern 1: End-to-End DS Project Lifecycle

模式1:端到端数据科学项目生命周期

Use when: Starting or restructuring any DS/ML project.
Stages:
  1. Problem framing - Business objective, success metrics, baseline
  2. Data & feasibility - Sources, coverage, granularity, label quality
  3. EDA & data quality - Schema, missingness, outliers, leakage checks
  4. Feature engineering - Per data type with feature store integration
  5. Modelling - Baselines first, then LightGBM, then complexity as needed
  6. Evaluation - Offline metrics, slice analysis, error analysis
  7. Reporting - Model evaluation report + model card
  8. MLOps - CI/CD, CT (continuous training), CM (continuous monitoring)
Detailed guide: EDA Best Practices

适用场景:启动或重构任何数据科学/机器学习项目。
阶段
  1. 问题框架搭建 - 业务目标、成功指标、基准模型
  2. 数据与可行性分析 - 数据源、覆盖范围、粒度、标签质量
  3. EDA与数据质量 - Schema、缺失值、异常值、数据泄露检查
  4. 特征工程 - 按数据类型处理并集成特征存储
  5. 建模 - 优先搭建基准模型,然后使用LightGBM,必要时提升复杂度
  6. 评估 - 离线指标、切片分析、错误分析
  7. 报告 - 模型评估报告 + 模型卡片
  8. MLOps - CI/CD、持续训练(CT)、持续监控(CM)
详细指南EDA最佳实践

Pattern 2: Feature Engineering

模式2:特征工程

Use when: Designing features before modelling or during model improvement.
By data type:
  • Numeric: Standardize, handle outliers, transform skew, scale
  • Categorical: One-hot/ordinal (low cardinality), target/frequency/hashing (high cardinality)
    • Feature Store Integration: Store encoders, mappings, statistics centrally
  • Text: Cleaning, TF-IDF, embeddings, simple stats
  • Time: Calendar features, recency, rolling/lag features
Key Modern Practice: Use feature stores (Feast, Tecton, Databricks) for versioning, sharing, and train-serve parity.
Detailed guide: Feature Engineering Patterns

适用场景:建模前或模型优化阶段设计特征。
按数据类型分类
  • 数值型:标准化、处理异常值、转换偏态、缩放
  • 分类型:独热编码/有序编码(低基数)、目标编码/频率编码/哈希编码(高基数)
    • 特征存储集成:集中存储编码器、映射关系、统计信息
  • 文本型:清洗、TF-IDF、嵌入向量、简单统计特征
  • 时间型:日历特征、新鲜度、滚动/滞后特征
现代关键实践:使用特征存储(Feast、Tecton、Databricks)进行版本控制、共享与训练-服务一致性保障。
详细指南特征工程模式

Pattern 3: Data Contracts & Lineage

模式3:数据契约与数据血缘

Use when: Building production ML systems with data quality requirements.
Components:
  • Contracts: Schema + ranges/nullability + freshness SLAs
  • Lineage: Track source -> feature store -> train -> serve
  • Feature store hygiene: Materialization cadence, backfill/replay, encoder versioning
  • Schema evolution: Backward/forward-compatible migrations with shadow runs
Detailed guide: Data Contracts & Lineage

适用场景:构建有数据质量要求的生产级机器学习系统。
组成部分
  • 契约:Schema + 范围/空值规则 + 新鲜度SLA
  • 血缘:跟踪数据源 → 特征存储 → 训练 → 服务的全链路
  • 特征存储 hygiene:物化频率、回填/重放、编码器版本控制
  • Schema演进:通过影子运行实现向后/向前兼容的迁移
详细指南数据契约与血缘

Pattern 4: Model Selection & Training

模式4:模型选择与训练

Use when: Picking model families and starting experiments.
Decision guide (modern benchmarks):
  • Tabular: Start with a strong baseline (linear/logistic, then gradient boosting) and iterate based on error analysis
  • Baselines: Always implement simple baselines first (majority class, mean, naive forecast)
  • Train/val/test splits: Time-based (forecasting), group-based (user/item leakage), or random (IID)
  • Hyperparameter tuning: Start manual, then Bayesian optimization (Optuna, Ray Tune)
  • Overfitting control: Regularization, early stopping, cross-validation
Detailed guide: Modelling Patterns

适用场景:选择模型家族并启动实验。
决策指南(现代基准)
  • 表格型数据:从优质基准模型(线性/逻辑回归,然后是梯度提升)开始,基于错误分析迭代优化
  • 基准模型:始终先实现简单基准模型(多数类、均值、朴素预测)
  • 训练/验证/测试拆分:基于时间(预测场景)、基于分组(用户/物品泄露)或随机(独立同分布数据)
  • 超参数调优:先手动调优,再使用贝叶斯优化(Optuna、Ray Tune)
  • 过拟合控制:正则化、早停、交叉验证
详细指南建模模式

Pattern 5: Evaluation & Reporting

模式5:评估与报告

Use when: Finalizing a model candidate or handing over to production.
Key components:
  • Metric selection: Primary (ROC-AUC, PR-AUC, RMSE) + guardrails (calibration, fairness)
  • Threshold selection: ROC/PR curves, cost-sensitive, F1 maximization
  • Slice analysis: Performance by geography, user segments, product categories
  • Error analysis: Collect high-error examples, cluster by error type, identify systematic failures
  • Uncertainty: Confidence intervals (bootstrap where appropriate), variance across folds, and stability checks
  • Evaluation report: 8-section report (objective, data, features, models, metrics, slices, risks, recommendation)
  • Model card: Documentation for stakeholders (intended use, data, performance, ethics, operations)
Detailed guide: Evaluation Patterns

适用场景:确定候选模型或将模型交付至生产环境。
核心组成
  • 指标选择:主指标(ROC-AUC、PR-AUC、RMSE) + 防护指标(校准度、公平性)
  • 阈值选择:ROC/PR曲线、成本敏感、F1最大化
  • 切片分析:按地域、用户群体、产品类别分析性能
  • 错误分析:收集高误差样本,按错误类型聚类,识别系统性失效
  • 不确定性评估:置信区间(必要时使用自助法)、折间方差、稳定性检查
  • 评估报告:8部分报告(目标、数据、特征、模型、指标、切片、风险、建议)
  • 模型卡片:面向利益相关者的文档(预期用途、数据、性能、伦理、运维)
详细指南评估模式

Pattern 6: Reproducibility & MLOps

模式6:可复现性与MLOps

Use when: Ensuring experiments are reproducible and production-ready.
Modern MLOps (CI/CD/CT/CM):
  • CI (Continuous Integration): Automated testing, data validation, code quality
  • CD (Continuous Delivery): Environment-specific promotion (dev -> staging -> prod), canary deployment
  • CT (Continuous Training): Drift-triggered and scheduled retraining
  • CM (Continuous Monitoring): Real-time data drift, performance, system health
Versioning:
  • Code (git commit), data (DVC, LakeFS), features (feature store), models (MLflow Registry)
  • Seeds (reproducibility), hyperparameters (experiment tracker)
Detailed guide: Reproducibility Checklist

适用场景:确保实验可复现且模型可投入生产。
现代MLOps(CI/CD/CT/CM)
  • CI(持续集成):自动化测试、数据验证、代码质量检查
  • CD(持续交付):分环境推广(开发→预发布→生产)、金丝雀部署
  • CT(持续训练):由漂移触发或定时执行的重训练
  • CM(持续监控):实时数据漂移、性能、系统健康监控
版本控制
  • 代码(git commit)、数据(DVC、LakeFS)、特征(特征存储)、模型(MLflow Registry)
  • 随机种子(保障可复现性)、超参数(实验跟踪器)
详细指南可复现性检查清单

Pattern 7: Feature Freshness & Streaming

模式7:特征新鲜度与流式处理

Use when: Managing real-time features and streaming pipelines.
Components:
  • Freshness contracts: Define freshness SLAs per feature, monitor lag, alert on breaches
  • Batch + stream parity: Same feature logic across batch/stream, idempotent upserts
  • Schema evolution: Version schemas, add forward/backward-compatible parsers, backfill with rollback
  • Data quality gates: PII/format checks, range checks, distribution drift (KL, KS, PSI)
Detailed guide: Feature Freshness & Streaming

适用场景:管理实时特征与流式流水线。
组成部分
  • 新鲜度契约:为每个特征定义新鲜度SLA,监控延迟,触发告警
  • 批流一致性:批处理/流处理使用相同的特征逻辑,幂等更新
  • Schema演进:版本化Schema,添加向前/向后兼容的解析器,支持回滚式回填
  • 数据质量关卡:PII/格式检查、范围检查、分布漂移(KL、KS、PSI)
详细指南特征新鲜度与流式处理

Pattern 8: Production Feedback Loops

模式8:生产环境反馈循环

Use when: Capturing production signals and implementing continuous improvement.
Components:
  • Signal capture: Log predictions + user edits/acceptance/abandonment (scrub PII)
  • Labeling: Route failures/edge cases to human review, create balanced sets
  • Dataset refresh: Periodic refresh (weekly/monthly) with lineage, protect eval set
  • Online eval: Shadow/canary new models, track solve rate, calibration, cost, latency
Detailed guide: Production Feedback Loops

适用场景:捕获生产环境信号并实现持续优化。
组成部分
  • 信号捕获:记录预测结果 + 用户编辑/接受/放弃操作(清除PII)
  • 标注:将失效/边缘案例路由至人工审核,创建均衡数据集
  • 数据集刷新:定期刷新(每周/每月)并保留血缘,保护评估集
  • 在线评估:影子/金丝雀部署新模型,跟踪解决率、校准度、成本、延迟
详细指南生产环境反馈循环

Resources (Detailed Guides)

资源(详细指南)

For comprehensive operational patterns and checklists, see:
  • EDA Best Practices - Structured workflow for exploratory data analysis
  • Feature Engineering Patterns - Operational patterns by data type
  • Data Contracts & Lineage - Data quality, versioning, feature store ops
  • Modelling Patterns - Model selection, hyperparameter tuning, train/test splits
  • Evaluation Patterns - Metrics, slice analysis, evaluation reports, model cards
  • Reproducibility Checklist - Experiment tracking, MLOps (CI/CD/CT/CM)
  • Feature Freshness & Streaming - Real-time features, schema evolution
  • Production Feedback Loops - Online learning, labeling, canary deployment

如需了解全面的操作模式与检查清单,请查看:
  • EDA最佳实践 - 结构化的探索性数据分析工作流
  • 特征工程模式 - 按数据类型划分的操作模式
  • 数据契约与血缘 - 数据质量、版本控制、特征存储运维
  • 建模模式 - 模型选择、超参数调优、训练/测试拆分
  • 评估模式 - 指标、切片分析、评估报告、模型卡片
  • 可复现性检查清单 - 实验跟踪、MLOps(CI/CD/CT/CM)
  • 特征新鲜度与流式处理 - 实时特征、Schema演进
  • 生产环境反馈循环 - 在线学习、标注、金丝雀部署

Templates

模板

Use these as copy-paste starting points:
可直接复制使用以下模板:

Project & Workflow Templates

项目与工作流模板

  • Standard DS project template:
    assets/project/template-standard.md
  • Quick DS experiment template:
    assets/project/template-quick.md
  • 标准数据科学项目模板
    assets/project/template-standard.md
  • 快速数据科学实验模板
    assets/project/template-quick.md

Feature Engineering & EDA

特征工程与EDA

  • Feature engineering template:
    assets/features/template-feature-engineering.md
  • EDA checklist & notebook template:
    assets/eda/template-eda.md
  • 特征工程模板
    assets/features/template-feature-engineering.md
  • EDA检查清单与笔记本模板
    assets/eda/template-eda.md

Evaluation & Reporting

评估与报告

  • Model evaluation report:
    assets/evaluation/template-evaluation-report.md
  • Model card:
    assets/evaluation/template-model-card.md
  • ML experiment review:
    assets/review/experiment-review-template.md
  • 模型评估报告
    assets/evaluation/template-evaluation-report.md
  • 模型卡片
    assets/evaluation/template-model-card.md
  • 机器学习实验评审
    assets/review/experiment-review-template.md

SQL Transformation (SQLMesh)

SQL转换(SQLMesh)

For SQL-based data transformation and feature engineering:
  • SQLMesh project setup:
    ../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-project.md
  • SQLMesh model types:
    ../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-model.md
    (FULL, INCREMENTAL, VIEW)
  • Incremental models:
    ../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-incremental.md
  • DAG and dependencies:
    ../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-dag.md
  • Testing and data quality:
    ../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-testing.md
Use SQLMesh when:
  • Building SQL-based feature pipelines
  • Managing incremental data transformations
  • Creating staging/intermediate/marts layers
  • Testing SQL logic with unit tests and audits
For data ingestion (loading raw data), use:
  • ai-mlops skill (dlt templates for REST APIs, databases, warehouses)
基于SQL的数据转换与特征工程相关模板:
  • SQLMesh项目搭建
    ../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-project.md
  • SQLMesh模型类型
    ../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-model.md
    (FULL, INCREMENTAL, VIEW)
  • 增量模型
    ../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-incremental.md
  • DAG与依赖
    ../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-dag.md
  • 测试与数据质量
    ../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-testing.md
SQLMesh适用场景
  • 构建基于SQL的特征流水线
  • 管理增量数据转换
  • 创建分层数据(staging/intermediate/marts层)
  • 使用单元测试与审计测试SQL逻辑
数据摄入(加载原始数据)请使用
  • ai-mlops技能(REST API、数据库、数据仓库的dlt模板)

Navigation

导航

Resources
  • references/reproducibility-checklist.md
  • references/evaluation-patterns.md
  • references/feature-engineering-patterns.md
  • references/modelling-patterns.md
  • references/feature-freshness-streaming.md
  • references/eda-best-practices.md
  • references/data-contracts-lineage.md
  • references/production-feedback-loops.md
Templates
  • assets/project/template-standard.md
  • assets/project/template-quick.md
  • assets/features/template-feature-engineering.md
  • assets/eda/template-eda.md
  • assets/evaluation/template-evaluation-report.md
  • assets/evaluation/template-model-card.md
  • assets/review/experiment-review-template.md
  • template-sqlmesh-project.md
  • template-sqlmesh-model.md
  • template-sqlmesh-incremental.md
  • template-sqlmesh-dag.md
  • template-sqlmesh-testing.md
Data
  • data/sources.json - Curated external references

资源
  • references/reproducibility-checklist.md
  • references/evaluation-patterns.md
  • references/feature-engineering-patterns.md
  • references/modelling-patterns.md
  • references/feature-freshness-streaming.md
  • references/eda-best-practices.md
  • references/data-contracts-lineage.md
  • references/production-feedback-loops.md
模板
  • assets/project/template-standard.md
  • assets/project/template-quick.md
  • assets/features/template-feature-engineering.md
  • assets/eda/template-eda.md
  • assets/evaluation/template-evaluation-report.md
  • assets/evaluation/template-model-card.md
  • assets/review/experiment-review-template.md
  • template-sqlmesh-project.md
  • template-sqlmesh-model.md
  • template-sqlmesh-incremental.md
  • template-sqlmesh-dag.md
  • template-sqlmesh-testing.md
数据
  • data/sources.json - 精选外部参考资料

External Resources

外部资源

See data/sources.json for curated foundational and implementation references:
  • Core ML/DL: scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow, JAX
  • Data processing: pandas, NumPy, Polars, DuckDB, Spark, Dask
  • SQL transformation: SQLMesh, dbt (staging/marts/incremental patterns)
  • Feature stores: Feast, Tecton, Databricks Feature Store (centralized feature management)
  • Data validation: Pydantic, Great Expectations, Pandera, Evidently (quality + drift)
  • Visualization: Matplotlib, Seaborn, Plotly, Streamlit, Dash
  • MLOps: MLflow, W&B, DVC, Neptune (experiment tracking + model registry)
  • Hyperparameter tuning: Optuna, Ray Tune, Hyperopt
  • Model serving: BentoML, FastAPI, TorchServe, Seldon, Ray Serve
  • Orchestration: Kubeflow, Metaflow, Prefect, Airflow, ZenML
  • Cloud platforms: AWS SageMaker, Google Vertex AI, Azure ML, Databricks, Snowflake
Use this skill to execute data science projects end-to-end: concrete checklists, patterns, and templates, not theory.
请查看data/sources.json获取精选的基础与实践参考资料:
  • 核心机器学习/深度学习:scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow, JAX
  • 数据处理:pandas, NumPy, Polars, DuckDB, Spark, Dask
  • SQL转换:SQLMesh, dbt(staging/marts/incremental模式)
  • 特征存储:Feast, Tecton, Databricks Feature Store(集中式特征管理)
  • 数据验证:Pydantic, Great Expectations, Pandera, Evidently(质量 + 漂移检测)
  • 可视化:Matplotlib, Seaborn, Plotly, Streamlit, Dash
  • MLOps:MLflow, W&B, DVC, Neptune(实验跟踪 + 模型注册表)
  • 超参数调优:Optuna, Ray Tune, Hyperopt
  • 模型服务:BentoML, FastAPI, TorchServe, Seldon, Ray Serve
  • 任务编排:Kubeflow, Metaflow, Prefect, Airflow, ZenML
  • 云平台:AWS SageMaker, Google Vertex AI, Azure ML, Databricks, Snowflake
使用本技能端到端执行数据科学项目:提供具体的检查清单、模式与模板,而非理论内容。