ai-ml-data-science

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Data Science Engineering Suite - Quick Reference

数据科学工程套件 - 快速参考

This skill turns raw data and questions into validated, documented models ready for production:

EDA workflows: Structured exploration with drift detection
Feature engineering: Reproducible feature pipelines with leakage prevention and train/serve parity
Model selection: Baselines first; strong tabular defaults; escalate complexity only when justified
Evaluation & reporting: Slice analysis, uncertainty, model cards, production metrics
SQL transformation: SQLMesh for staging/intermediate/marts layers
MLOps: CI/CD, CT (continuous training), CM (continuous monitoring)
Production patterns: Data contracts, lineage, feedback loops, streaming features

Modern emphasis (2026): Feature stores, automated retraining, drift monitoring (Evidently), train-serve parity, and agentic ML loops (plan -> execute -> evaluate -> improve). Tools: LightGBM, CatBoost, scikit-learn, PyTorch, Polars (lazy eval for larger-than-RAM datasets), lakeFS for data versioning.

本技能可将原始数据与业务问题转化为经过验证、文档完备的可生产模型：

EDA工作流：带漂移检测的结构化数据探索
特征工程：具备防数据泄露、训练-服务一致性的可复现特征流水线
模型选择：优先搭建基准模型；为表格数据提供优质默认方案；仅在必要时提升复杂度
评估与报告：切片分析、不确定性评估、模型卡片、生产环境指标
SQL转换：使用SQLMesh构建分层数据（ staging/intermediate/marts 层）
MLOps：CI/CD、持续训练（CT）、持续监控（CM）
生产模式：数据契约、数据血缘、反馈循环、流式特征

2026年重点方向：特征存储、自动重训练、漂移监控（Evidently）、训练-服务一致性，以及智能ML循环（规划→执行→评估→优化）。工具包括：LightGBM、CatBoost、scikit-learn、PyTorch、Polars（针对超内存数据集的惰性求值）、用于数据版本控制的lakeFS。

Quick Reference

快速参考

Task	Tool/Framework	Command	When to Use
EDA & Profiling	Pandas, Great Expectations	`df.describe()` , `ge.validate()`	Initial data exploration and quality checks
Feature Engineering	Pandas, Polars, Feature Stores	`df.transform()` , Feast materialization	Creating lag, rolling, categorical features
Model Training	Gradient boosting, linear models, scikit-learn	`lgb.train()` , `model.fit()`	Strong baselines for tabular ML
Hyperparameter Tuning	Optuna, Ray Tune	`optuna.create_study()` , `tune.run()`	Optimizing model parameters
SQL Transformation	SQLMesh	`sqlmesh plan` , `sqlmesh run`	Building staging/intermediate/marts layers
Experiment Tracking	MLflow, W&B	`mlflow.log_metric()` , `wandb.log()`	Versioning experiments and models
Model Evaluation	scikit-learn, custom metrics	`metrics.roc_auc_score()` , slice analysis	Validating model performance

任务	工具/框架	命令	使用场景
EDA与数据剖析	Pandas, Great Expectations	`df.describe()` , `ge.validate()`	初始数据探索与质量检查
特征工程	Pandas, Polars, 特征存储	`df.transform()` , Feast materialization	创建滞后、滚动、分类特征
模型训练	梯度提升、线性模型、scikit-learn	`lgb.train()` , `model.fit()`	表格型机器学习的优质基准模型
超参数调优	Optuna, Ray Tune	`optuna.create_study()` , `tune.run()`	优化模型参数
SQL转换	SQLMesh	`sqlmesh plan` , `sqlmesh run`	构建分层数据（staging/intermediate/marts层）
实验跟踪	MLflow, W&B	`mlflow.log_metric()` , `wandb.log()`	实验与模型的版本控制
模型评估	scikit-learn, 自定义指标	`metrics.roc_auc_score()` , 切片分析	验证模型性能

Data Lake & Lakehouse

数据湖与湖仓

For comprehensive data lake/lakehouse patterns (beyond SQLMesh transformation), see data-lake-platform:

Table formats: Apache Iceberg, Delta Lake, Apache Hudi
Query engines: ClickHouse, DuckDB, Apache Doris, StarRocks
Alternative transformation: dbt (alternative to SQLMesh)
Ingestion: dlt, Airbyte (connectors)
Streaming: Apache Kafka patterns
Orchestration: Dagster, Airflow

This skill focuses on ML feature engineering and modeling. Use data-lake-platform for general-purpose data infrastructure.

如需了解全面的数据湖/湖仓模式（超出SQLMesh转换范畴），请查看 data-lake-platform：

表格式：Apache Iceberg, Delta Lake, Apache Hudi
查询引擎：ClickHouse, DuckDB, Apache Doris, StarRocks
替代转换工具：dbt（SQLMesh的替代方案）
数据摄入：dlt, Airbyte（连接器）
流式处理：Apache Kafka模式
任务编排：Dagster, Airflow

本技能聚焦于机器学习特征工程与建模。通用数据基础设施相关内容请参考data-lake-platform。

Related Skills

Decision Tree: Choosing Data Science Approach

决策树：选择数据科学方法

text

User needs ML for: [Problem Type]
  - Tabular data?
    - Small-medium (<1M rows)? -> LightGBM (fast, efficient)
    - Large and complex (>1M rows)? -> LightGBM first, then NN if needed
    - High-dim sparse (text, counts)? -> Linear models, then shallow NN

  - Time series?
    - Seasonality? -> LightGBM, then see ai-ml-timeseries
    - Long-term dependencies? -> Transformers (see ai-ml-timeseries)

  - Text or mixed modalities?
    - LLMs/Transformers -> See ai-llm

  - SQL transformations?
    - SQLMesh (staging/intermediate/marts layers)

Rule of thumb: For tabular data, tree-based gradient boosting is a strong baseline, but must be validated against alternatives and constraints.

text

用户的机器学习需求：[问题类型]
  - 表格型数据？
    - 中小型数据集（<100万行）？ -> LightGBM（快速、高效）
    - 大型复杂数据集（>100万行）？ -> 优先使用LightGBM，必要时再尝试神经网络
    - 高维稀疏数据（文本、计数型）？ -> 线性模型，之后尝试浅层神经网络

  - 时间序列数据？
    - 存在季节性？ -> LightGBM，之后参考ai-ml-timeseries
    - 存在长期依赖？ -> Transformers（参考ai-ml-timeseries）

  - 文本或多模态数据？
    - LLMs/Transformers -> 参考ai-llm

  - 需要SQL转换？
    - SQLMesh（构建staging/intermediate/marts层）

经验法则：对于表格型数据，基于树的梯度提升是优质基准模型，但必须结合替代方案与约束条件进行验证。

Core Concepts (Vendor-Agnostic)

核心概念（厂商无关）

Problem framing: define success metrics, baselines, and decision thresholds before modeling.
Leakage prevention: ensure all features are available at prediction time; split by time/group when appropriate.
Uncertainty: report confidence intervals and stability (fold variance, bootstrap) rather than single-point metrics.
Reproducibility: version code/data/features, fix seeds, and record the environment.
Operational handoff: define monitoring, retraining triggers, and rollback criteria with MLOps.

问题框架搭建：在建模前定义成功指标、基准模型与决策阈值。
防数据泄露：确保所有特征在预测时均可获取；必要时按时间/分组拆分数据集。
不确定性评估：报告置信区间与稳定性（折方差、自助法），而非单一指标。
可复现性：对代码/数据/特征进行版本控制，固定随机种子，并记录环境信息。
生产环境交付：与MLOps团队协作，定义监控、重训练触发条件与回退准则。

Implementation Practices (Tooling Examples)

实践指南（工具示例）

Track experiments and artifacts (run id, commit hash, data version).
Add data validation gates in pipelines (schema + distribution + freshness).
Prefer reproducible, testable feature code (shared transforms, point-in-time correctness).
Use datasheets/model cards and eval reports as deployment prerequisites (Datasheets for Datasets: https://arxiv.org/abs/1803.09010; Model Cards: https://arxiv.org/abs/1810.03993).

跟踪实验与工件（运行ID、提交哈希、数据版本）。
在流水线中添加数据验证关卡（ schema + 分布 + 新鲜度）。
优先选择可复现、可测试的特征代码（共享转换逻辑、时点正确性）。
将数据手册/模型卡片与评估报告作为部署前提（数据集手册：https://arxiv.org/abs/1803.09010；模型卡片：https://arxiv.org/abs/1810.03993）。

Do / Avoid

建议/禁忌

Do start with baselines and a simple model to expose leakage and data issues early.
Do run slice analysis and document failure modes before recommending deployment.
Do keep an immutable eval set; refresh training data without contaminating evaluation.

Avoid

Avoid random splits for temporal or user-correlated data.
Avoid "metric gaming" (optimizing the number without validating business impact).
Avoid training on labels created after the prediction timestamp (silent future leakage).

建议

从基准模型和简单模型开始，尽早暴露数据泄露与数据问题。
在建议部署前，执行切片分析并记录失效模式。
保留不可变的评估数据集；刷新训练数据时避免污染评估集。

禁忌

对时序或用户相关数据使用随机拆分。
避免“指标游戏”（仅优化数值而不验证业务影响）。
避免使用预测时间戳之后生成的标签进行训练（隐性未来数据泄露）。

Core Patterns (Overview)

核心模式（概述）

Pattern 1: End-to-End DS Project Lifecycle

模式1：端到端数据科学项目生命周期

Use when: Starting or restructuring any DS/ML project.

Stages:

Problem framing - Business objective, success metrics, baseline
Data & feasibility - Sources, coverage, granularity, label quality
EDA & data quality - Schema, missingness, outliers, leakage checks
Feature engineering - Per data type with feature store integration
Modelling - Baselines first, then LightGBM, then complexity as needed
Evaluation - Offline metrics, slice analysis, error analysis
Reporting - Model evaluation report + model card
MLOps - CI/CD, CT (continuous training), CM (continuous monitoring)

Detailed guide: EDA Best Practices

适用场景：启动或重构任何数据科学/机器学习项目。

阶段：

问题框架搭建 - 业务目标、成功指标、基准模型
数据与可行性分析 - 数据源、覆盖范围、粒度、标签质量
EDA与数据质量 - Schema、缺失值、异常值、数据泄露检查
特征工程 - 按数据类型处理并集成特征存储
建模 - 优先搭建基准模型，然后使用LightGBM，必要时提升复杂度
评估 - 离线指标、切片分析、错误分析
报告 - 模型评估报告 + 模型卡片
MLOps - CI/CD、持续训练（CT）、持续监控（CM）

详细指南：EDA最佳实践

Pattern 2: Feature Engineering

模式2：特征工程

Use when: Designing features before modelling or during model improvement.

By data type:

Numeric: Standardize, handle outliers, transform skew, scale
Categorical: One-hot/ordinal (low cardinality), target/frequency/hashing (high cardinality)
- Feature Store Integration: Store encoders, mappings, statistics centrally
Text: Cleaning, TF-IDF, embeddings, simple stats
Time: Calendar features, recency, rolling/lag features

Key Modern Practice: Use feature stores (Feast, Tecton, Databricks) for versioning, sharing, and train-serve parity.

Detailed guide: Feature Engineering Patterns

适用场景：建模前或模型优化阶段设计特征。

按数据类型分类：

数值型：标准化、处理异常值、转换偏态、缩放
分类型：独热编码/有序编码（低基数）、目标编码/频率编码/哈希编码（高基数）
- 特征存储集成：集中存储编码器、映射关系、统计信息
文本型：清洗、TF-IDF、嵌入向量、简单统计特征
时间型：日历特征、新鲜度、滚动/滞后特征

现代关键实践：使用特征存储（Feast、Tecton、Databricks）进行版本控制、共享与训练-服务一致性保障。

详细指南：特征工程模式

Pattern 3: Data Contracts & Lineage

模式3：数据契约与数据血缘

Use when: Building production ML systems with data quality requirements.

Components:

Contracts: Schema + ranges/nullability + freshness SLAs
Lineage: Track source -> feature store -> train -> serve
Feature store hygiene: Materialization cadence, backfill/replay, encoder versioning
Schema evolution: Backward/forward-compatible migrations with shadow runs

Detailed guide: Data Contracts & Lineage

适用场景：构建有数据质量要求的生产级机器学习系统。

组成部分：

契约：Schema + 范围/空值规则 + 新鲜度SLA
血缘：跟踪数据源 → 特征存储 → 训练 → 服务的全链路
特征存储 hygiene：物化频率、回填/重放、编码器版本控制
Schema演进：通过影子运行实现向后/向前兼容的迁移

详细指南：数据契约与血缘

Pattern 4: Model Selection & Training

模式4：模型选择与训练

Use when: Picking model families and starting experiments.

Decision guide (modern benchmarks):

Tabular: Start with a strong baseline (linear/logistic, then gradient boosting) and iterate based on error analysis
Baselines: Always implement simple baselines first (majority class, mean, naive forecast)
Train/val/test splits: Time-based (forecasting), group-based (user/item leakage), or random (IID)
Hyperparameter tuning: Start manual, then Bayesian optimization (Optuna, Ray Tune)
Overfitting control: Regularization, early stopping, cross-validation

Detailed guide: Modelling Patterns

适用场景：选择模型家族并启动实验。

决策指南（现代基准）：

表格型数据：从优质基准模型（线性/逻辑回归，然后是梯度提升）开始，基于错误分析迭代优化
基准模型：始终先实现简单基准模型（多数类、均值、朴素预测）
训练/验证/测试拆分：基于时间（预测场景）、基于分组（用户/物品泄露）或随机（独立同分布数据）
超参数调优：先手动调优，再使用贝叶斯优化（Optuna、Ray Tune）
过拟合控制：正则化、早停、交叉验证

详细指南：建模模式

Pattern 5: Evaluation & Reporting

模式5：评估与报告

Use when: Finalizing a model candidate or handing over to production.

Key components:

Metric selection: Primary (ROC-AUC, PR-AUC, RMSE) + guardrails (calibration, fairness)
Threshold selection: ROC/PR curves, cost-sensitive, F1 maximization
Slice analysis: Performance by geography, user segments, product categories
Error analysis: Collect high-error examples, cluster by error type, identify systematic failures
Uncertainty: Confidence intervals (bootstrap where appropriate), variance across folds, and stability checks
Evaluation report: 8-section report (objective, data, features, models, metrics, slices, risks, recommendation)
Model card: Documentation for stakeholders (intended use, data, performance, ethics, operations)

Detailed guide: Evaluation Patterns

适用场景：确定候选模型或将模型交付至生产环境。

核心组成：

指标选择：主指标（ROC-AUC、PR-AUC、RMSE） + 防护指标（校准度、公平性）
阈值选择：ROC/PR曲线、成本敏感、F1最大化
切片分析：按地域、用户群体、产品类别分析性能
错误分析：收集高误差样本，按错误类型聚类，识别系统性失效
不确定性评估：置信区间（必要时使用自助法）、折间方差、稳定性检查
评估报告：8部分报告（目标、数据、特征、模型、指标、切片、风险、建议）
模型卡片：面向利益相关者的文档（预期用途、数据、性能、伦理、运维）

详细指南：评估模式

Pattern 6: Reproducibility & MLOps

模式6：可复现性与MLOps

Use when: Ensuring experiments are reproducible and production-ready.

Modern MLOps (CI/CD/CT/CM):

CI (Continuous Integration): Automated testing, data validation, code quality
CD (Continuous Delivery): Environment-specific promotion (dev -> staging -> prod), canary deployment
CT (Continuous Training): Drift-triggered and scheduled retraining
CM (Continuous Monitoring): Real-time data drift, performance, system health

Versioning:

Code (git commit), data (DVC, LakeFS), features (feature store), models (MLflow Registry)
Seeds (reproducibility), hyperparameters (experiment tracker)

Detailed guide: Reproducibility Checklist

适用场景：确保实验可复现且模型可投入生产。

现代MLOps（CI/CD/CT/CM）：

CI（持续集成）：自动化测试、数据验证、代码质量检查
CD（持续交付）：分环境推广（开发→预发布→生产）、金丝雀部署
CT（持续训练）：由漂移触发或定时执行的重训练
CM（持续监控）：实时数据漂移、性能、系统健康监控

版本控制：

代码（git commit）、数据（DVC、LakeFS）、特征（特征存储）、模型（MLflow Registry）
随机种子（保障可复现性）、超参数（实验跟踪器）

详细指南：可复现性检查清单

Pattern 7: Feature Freshness & Streaming

模式7：特征新鲜度与流式处理

Use when: Managing real-time features and streaming pipelines.

Components:

Freshness contracts: Define freshness SLAs per feature, monitor lag, alert on breaches
Batch + stream parity: Same feature logic across batch/stream, idempotent upserts
Schema evolution: Version schemas, add forward/backward-compatible parsers, backfill with rollback
Data quality gates: PII/format checks, range checks, distribution drift (KL, KS, PSI)

Detailed guide: Feature Freshness & Streaming

适用场景：管理实时特征与流式流水线。

组成部分：

新鲜度契约：为每个特征定义新鲜度SLA，监控延迟，触发告警
批流一致性：批处理/流处理使用相同的特征逻辑，幂等更新
Schema演进：版本化Schema，添加向前/向后兼容的解析器，支持回滚式回填
数据质量关卡：PII/格式检查、范围检查、分布漂移（KL、KS、PSI）

详细指南：特征新鲜度与流式处理

Pattern 8: Production Feedback Loops

模式8：生产环境反馈循环

Use when: Capturing production signals and implementing continuous improvement.

Components:

Signal capture: Log predictions + user edits/acceptance/abandonment (scrub PII)
Labeling: Route failures/edge cases to human review, create balanced sets
Dataset refresh: Periodic refresh (weekly/monthly) with lineage, protect eval set
Online eval: Shadow/canary new models, track solve rate, calibration, cost, latency

Detailed guide: Production Feedback Loops

适用场景：捕获生产环境信号并实现持续优化。

组成部分：

信号捕获：记录预测结果 + 用户编辑/接受/放弃操作（清除PII）
标注：将失效/边缘案例路由至人工审核，创建均衡数据集
数据集刷新：定期刷新（每周/每月）并保留血缘，保护评估集
在线评估：影子/金丝雀部署新模型，跟踪解决率、校准度、成本、延迟

详细指南：生产环境反馈循环

Resources (Detailed Guides)

资源（详细指南）

For comprehensive operational patterns and checklists, see:

EDA Best Practices - Structured workflow for exploratory data analysis
Feature Engineering Patterns - Operational patterns by data type
Data Contracts & Lineage - Data quality, versioning, feature store ops
Modelling Patterns - Model selection, hyperparameter tuning, train/test splits
Evaluation Patterns - Metrics, slice analysis, evaluation reports, model cards
Reproducibility Checklist - Experiment tracking, MLOps (CI/CD/CT/CM)
Feature Freshness & Streaming - Real-time features, schema evolution
Production Feedback Loops - Online learning, labeling, canary deployment

如需了解全面的操作模式与检查清单，请查看：

EDA最佳实践 - 结构化的探索性数据分析工作流
特征工程模式 - 按数据类型划分的操作模式
数据契约与血缘 - 数据质量、版本控制、特征存储运维
建模模式 - 模型选择、超参数调优、训练/测试拆分
评估模式 - 指标、切片分析、评估报告、模型卡片
可复现性检查清单 - 实验跟踪、MLOps（CI/CD/CT/CM）
特征新鲜度与流式处理 - 实时特征、Schema演进
生产环境反馈循环 - 在线学习、标注、金丝雀部署

Templates

模板

Use these as copy-paste starting points:

可直接复制使用以下模板：

Project & Workflow Templates

项目与工作流模板

Standard DS project template:
```
assets/project/template-standard.md
```
Quick DS experiment template:
```
assets/project/template-quick.md
```

标准数据科学项目模板：
```
assets/project/template-standard.md
```
快速数据科学实验模板：
```
assets/project/template-quick.md
```

Feature Engineering & EDA

特征工程与EDA

Feature engineering template:

assets/features/template-feature-engineering.md

EDA checklist & notebook template:
```
assets/eda/template-eda.md
```

特征工程模板：

assets/features/template-feature-engineering.md

EDA检查清单与笔记本模板：
```
assets/eda/template-eda.md
```

Evaluation & Reporting

评估与报告

Model evaluation report:

assets/evaluation/template-evaluation-report.md

Model card:

assets/evaluation/template-model-card.md

ML experiment review:

assets/review/experiment-review-template.md

模型评估报告：

assets/evaluation/template-evaluation-report.md

模型卡片：

assets/evaluation/template-model-card.md

机器学习实验评审：

assets/review/experiment-review-template.md

SQL Transformation (SQLMesh)

SQL转换（SQLMesh）

For SQL-based data transformation and feature engineering:

SQLMesh project setup:

../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-project.md

SQLMesh model types:

../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-model.md

(FULL, INCREMENTAL, VIEW)

Incremental models:

../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-incremental.md

DAG and dependencies:

../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-dag.md

Testing and data quality:

../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-testing.md

Use SQLMesh when:

Building SQL-based feature pipelines
Managing incremental data transformations
Creating staging/intermediate/marts layers
Testing SQL logic with unit tests and audits

For data ingestion (loading raw data), use:

ai-mlops skill (dlt templates for REST APIs, databases, warehouses)

基于SQL的数据转换与特征工程相关模板：

SQLMesh项目搭建：

../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-project.md

SQLMesh模型类型：

../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-model.md

（FULL, INCREMENTAL, VIEW）

增量模型：

../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-incremental.md

DAG与依赖：

../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-dag.md

测试与数据质量：

../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-testing.md

SQLMesh适用场景：

构建基于SQL的特征流水线
管理增量数据转换
创建分层数据（staging/intermediate/marts层）
使用单元测试与审计测试SQL逻辑

数据摄入（加载原始数据）请使用：

ai-mlops技能（REST API、数据库、数据仓库的dlt模板）

Navigation

assets/project/template-standard.md
assets/project/template-quick.md
assets/features/template-feature-engineering.md
assets/eda/template-eda.md
assets/evaluation/template-evaluation-report.md
assets/evaluation/template-model-card.md
assets/review/experiment-review-template.md
template-sqlmesh-project.md
template-sqlmesh-model.md
template-sqlmesh-incremental.md
template-sqlmesh-dag.md
template-sqlmesh-testing.md

Data

data/sources.json - Curated external references

资源

references/reproducibility-checklist.md
references/evaluation-patterns.md
references/feature-engineering-patterns.md
references/modelling-patterns.md
references/feature-freshness-streaming.md
references/eda-best-practices.md
references/data-contracts-lineage.md
references/production-feedback-loops.md

模板

assets/project/template-standard.md
assets/project/template-quick.md
assets/features/template-feature-engineering.md
assets/eda/template-eda.md
assets/evaluation/template-evaluation-report.md
assets/evaluation/template-model-card.md
assets/review/experiment-review-template.md
template-sqlmesh-project.md
template-sqlmesh-model.md
template-sqlmesh-incremental.md
template-sqlmesh-dag.md
template-sqlmesh-testing.md

数据

data/sources.json - 精选外部参考资料

External Resources

外部资源

See data/sources.json for curated foundational and implementation references:

Core ML/DL: scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow, JAX
Data processing: pandas, NumPy, Polars, DuckDB, Spark, Dask
SQL transformation: SQLMesh, dbt (staging/marts/incremental patterns)
Feature stores: Feast, Tecton, Databricks Feature Store (centralized feature management)
Data validation: Pydantic, Great Expectations, Pandera, Evidently (quality + drift)
Visualization: Matplotlib, Seaborn, Plotly, Streamlit, Dash
MLOps: MLflow, W&B, DVC, Neptune (experiment tracking + model registry)
Hyperparameter tuning: Optuna, Ray Tune, Hyperopt
Model serving: BentoML, FastAPI, TorchServe, Seldon, Ray Serve
Orchestration: Kubeflow, Metaflow, Prefect, Airflow, ZenML
Cloud platforms: AWS SageMaker, Google Vertex AI, Azure ML, Databricks, Snowflake

Use this skill to execute data science projects end-to-end: concrete checklists, patterns, and templates, not theory.

请查看data/sources.json获取精选的基础与实践参考资料：

核心机器学习/深度学习：scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow, JAX
数据处理：pandas, NumPy, Polars, DuckDB, Spark, Dask
SQL转换：SQLMesh, dbt（staging/marts/incremental模式）
特征存储：Feast, Tecton, Databricks Feature Store（集中式特征管理）
数据验证：Pydantic, Great Expectations, Pandera, Evidently（质量 + 漂移检测）
可视化：Matplotlib, Seaborn, Plotly, Streamlit, Dash
MLOps：MLflow, W&B, DVC, Neptune（实验跟踪 + 模型注册表）
超参数调优：Optuna, Ray Tune, Hyperopt
模型服务：BentoML, FastAPI, TorchServe, Seldon, Ray Serve
任务编排：Kubeflow, Metaflow, Prefect, Airflow, ZenML
云平台：AWS SageMaker, Google Vertex AI, Azure ML, Databricks, Snowflake

使用本技能端到端执行数据科学项目：提供具体的检查清单、模式与模板，而非理论内容。

ai-ml-data-science

Original

Translation

Data Science Engineering Suite - Quick Reference

数据科学工程套件 - 快速参考

Quick Reference

快速参考

Data Lake & Lakehouse

数据湖与湖仓

Related Skills

相关技能

Decision Tree: Choosing Data Science Approach

决策树：选择数据科学方法

Core Concepts (Vendor-Agnostic)

核心概念（厂商无关）

Implementation Practices (Tooling Examples)

实践指南（工具示例）

Do / Avoid

建议/禁忌

Core Patterns (Overview)

核心模式（概述）

Pattern 1: End-to-End DS Project Lifecycle

模式1：端到端数据科学项目生命周期

Pattern 2: Feature Engineering

模式2：特征工程

Pattern 3: Data Contracts & Lineage

模式3：数据契约与数据血缘

Pattern 4: Model Selection & Training

模式4：模型选择与训练

Pattern 5: Evaluation & Reporting

模式5：评估与报告

Pattern 6: Reproducibility & MLOps

模式6：可复现性与MLOps

Pattern 7: Feature Freshness & Streaming

模式7：特征新鲜度与流式处理

Pattern 8: Production Feedback Loops

模式8：生产环境反馈循环

Resources (Detailed Guides)

资源（详细指南）

Templates

模板

Project & Workflow Templates

项目与工作流模板

Feature Engineering & EDA

特征工程与EDA

Evaluation & Reporting

评估与报告

SQL Transformation (SQLMesh)

SQL转换（SQLMesh）

Navigation

导航

External Resources

外部资源