ai-ml-data-science
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData Science Engineering Suite - Quick Reference
数据科学工程套件 - 快速参考
This skill turns raw data and questions into validated, documented models ready for production:
- EDA workflows: Structured exploration with drift detection
- Feature engineering: Reproducible feature pipelines with leakage prevention and train/serve parity
- Model selection: Baselines first; strong tabular defaults; escalate complexity only when justified
- Evaluation & reporting: Slice analysis, uncertainty, model cards, production metrics
- SQL transformation: SQLMesh for staging/intermediate/marts layers
- MLOps: CI/CD, CT (continuous training), CM (continuous monitoring)
- Production patterns: Data contracts, lineage, feedback loops, streaming features
Modern emphasis (2026): Feature stores, automated retraining, drift monitoring (Evidently), train-serve parity, and agentic ML loops (plan -> execute -> evaluate -> improve). Tools: LightGBM, CatBoost, scikit-learn, PyTorch, Polars (lazy eval for larger-than-RAM datasets), lakeFS for data versioning.
本技能可将原始数据与业务问题转化为经过验证、文档完备的可生产模型:
- EDA工作流:带漂移检测的结构化数据探索
- 特征工程:具备防数据泄露、训练-服务一致性的可复现特征流水线
- 模型选择:优先搭建基准模型;为表格数据提供优质默认方案;仅在必要时提升复杂度
- 评估与报告:切片分析、不确定性评估、模型卡片、生产环境指标
- SQL转换:使用SQLMesh构建分层数据( staging/intermediate/marts 层)
- MLOps:CI/CD、持续训练(CT)、持续监控(CM)
- 生产模式:数据契约、数据血缘、反馈循环、流式特征
2026年重点方向:特征存储、自动重训练、漂移监控(Evidently)、训练-服务一致性,以及智能ML循环(规划→执行→评估→优化)。工具包括:LightGBM、CatBoost、scikit-learn、PyTorch、Polars(针对超内存数据集的惰性求值)、用于数据版本控制的lakeFS。
Quick Reference
快速参考
| Task | Tool/Framework | Command | When to Use |
|---|---|---|---|
| EDA & Profiling | Pandas, Great Expectations | | Initial data exploration and quality checks |
| Feature Engineering | Pandas, Polars, Feature Stores | | Creating lag, rolling, categorical features |
| Model Training | Gradient boosting, linear models, scikit-learn | | Strong baselines for tabular ML |
| Hyperparameter Tuning | Optuna, Ray Tune | | Optimizing model parameters |
| SQL Transformation | SQLMesh | | Building staging/intermediate/marts layers |
| Experiment Tracking | MLflow, W&B | | Versioning experiments and models |
| Model Evaluation | scikit-learn, custom metrics | | Validating model performance |
| 任务 | 工具/框架 | 命令 | 使用场景 |
|---|---|---|---|
| EDA与数据剖析 | Pandas, Great Expectations | | 初始数据探索与质量检查 |
| 特征工程 | Pandas, Polars, 特征存储 | | 创建滞后、滚动、分类特征 |
| 模型训练 | 梯度提升、线性模型、scikit-learn | | 表格型机器学习的优质基准模型 |
| 超参数调优 | Optuna, Ray Tune | | 优化模型参数 |
| SQL转换 | SQLMesh | | 构建分层数据(staging/intermediate/marts层) |
| 实验跟踪 | MLflow, W&B | | 实验与模型的版本控制 |
| 模型评估 | scikit-learn, 自定义指标 | | 验证模型性能 |
Data Lake & Lakehouse
数据湖与湖仓
For comprehensive data lake/lakehouse patterns (beyond SQLMesh transformation), see data-lake-platform:
- Table formats: Apache Iceberg, Delta Lake, Apache Hudi
- Query engines: ClickHouse, DuckDB, Apache Doris, StarRocks
- Alternative transformation: dbt (alternative to SQLMesh)
- Ingestion: dlt, Airbyte (connectors)
- Streaming: Apache Kafka patterns
- Orchestration: Dagster, Airflow
This skill focuses on ML feature engineering and modeling. Use data-lake-platform for general-purpose data infrastructure.
如需了解全面的数据湖/湖仓模式(超出SQLMesh转换范畴),请查看 data-lake-platform:
- 表格式:Apache Iceberg, Delta Lake, Apache Hudi
- 查询引擎:ClickHouse, DuckDB, Apache Doris, StarRocks
- 替代转换工具:dbt(SQLMesh的替代方案)
- 数据摄入:dlt, Airbyte(连接器)
- 流式处理:Apache Kafka模式
- 任务编排:Dagster, Airflow
本技能聚焦于机器学习特征工程与建模。通用数据基础设施相关内容请参考data-lake-platform。
Related Skills
相关技能
For adjacent topics, reference:
- ai-mlops - APIs, batch jobs, monitoring, drift, data ingestion (dlt)
- ai-llm - LLM prompting, fine-tuning, evaluation
- ai-rag - RAG pipelines, chunking, retrieval
- ai-llm-inference - LLM inference optimization, quantization
- ai-ml-timeseries - Time series forecasting, backtesting
- qa-testing-strategy - Test-driven development, coverage
- data-sql-optimization - SQL optimization, index patterns (complements SQLMesh)
- data-lake-platform - Data lake/lakehouse infrastructure (ClickHouse, Iceberg, Kafka)
相邻主题请参考:
- ai-mlops - API、批处理作业、监控、漂移检测、数据摄入(dlt)
- ai-llm - LLM提示词工程、微调、评估
- ai-rag - RAG流水线、分块、检索
- ai-llm-inference - LLM推理优化、量化
- ai-ml-timeseries - 时间序列预测、回测
- qa-testing-strategy - 测试驱动开发、覆盖率
- data-sql-optimization - SQL优化、索引模式(补充SQLMesh)
- data-lake-platform - 数据湖/湖仓基础设施(ClickHouse, Iceberg, Kafka)
Decision Tree: Choosing Data Science Approach
决策树:选择数据科学方法
text
User needs ML for: [Problem Type]
- Tabular data?
- Small-medium (<1M rows)? -> LightGBM (fast, efficient)
- Large and complex (>1M rows)? -> LightGBM first, then NN if needed
- High-dim sparse (text, counts)? -> Linear models, then shallow NN
- Time series?
- Seasonality? -> LightGBM, then see ai-ml-timeseries
- Long-term dependencies? -> Transformers (see ai-ml-timeseries)
- Text or mixed modalities?
- LLMs/Transformers -> See ai-llm
- SQL transformations?
- SQLMesh (staging/intermediate/marts layers)Rule of thumb: For tabular data, tree-based gradient boosting is a strong baseline, but must be validated against alternatives and constraints.
text
用户的机器学习需求:[问题类型]
- 表格型数据?
- 中小型数据集(<100万行)? -> LightGBM(快速、高效)
- 大型复杂数据集(>100万行)? -> 优先使用LightGBM,必要时再尝试神经网络
- 高维稀疏数据(文本、计数型)? -> 线性模型,之后尝试浅层神经网络
- 时间序列数据?
- 存在季节性? -> LightGBM,之后参考ai-ml-timeseries
- 存在长期依赖? -> Transformers(参考ai-ml-timeseries)
- 文本或多模态数据?
- LLMs/Transformers -> 参考ai-llm
- 需要SQL转换?
- SQLMesh(构建staging/intermediate/marts层)经验法则:对于表格型数据,基于树的梯度提升是优质基准模型,但必须结合替代方案与约束条件进行验证。
Core Concepts (Vendor-Agnostic)
核心概念(厂商无关)
- Problem framing: define success metrics, baselines, and decision thresholds before modeling.
- Leakage prevention: ensure all features are available at prediction time; split by time/group when appropriate.
- Uncertainty: report confidence intervals and stability (fold variance, bootstrap) rather than single-point metrics.
- Reproducibility: version code/data/features, fix seeds, and record the environment.
- Operational handoff: define monitoring, retraining triggers, and rollback criteria with MLOps.
- 问题框架搭建:在建模前定义成功指标、基准模型与决策阈值。
- 防数据泄露:确保所有特征在预测时均可获取;必要时按时间/分组拆分数据集。
- 不确定性评估:报告置信区间与稳定性(折方差、自助法),而非单一指标。
- 可复现性:对代码/数据/特征进行版本控制,固定随机种子,并记录环境信息。
- 生产环境交付:与MLOps团队协作,定义监控、重训练触发条件与回退准则。
Implementation Practices (Tooling Examples)
实践指南(工具示例)
- Track experiments and artifacts (run id, commit hash, data version).
- Add data validation gates in pipelines (schema + distribution + freshness).
- Prefer reproducible, testable feature code (shared transforms, point-in-time correctness).
- Use datasheets/model cards and eval reports as deployment prerequisites (Datasheets for Datasets: https://arxiv.org/abs/1803.09010; Model Cards: https://arxiv.org/abs/1810.03993).
- 跟踪实验与工件(运行ID、提交哈希、数据版本)。
- 在流水线中添加数据验证关卡( schema + 分布 + 新鲜度)。
- 优先选择可复现、可测试的特征代码(共享转换逻辑、时点正确性)。
- 将数据手册/模型卡片与评估报告作为部署前提(数据集手册:https://arxiv.org/abs/1803.09010;模型卡片:https://arxiv.org/abs/1810.03993)。
Do / Avoid
建议/禁忌
Do
- Do start with baselines and a simple model to expose leakage and data issues early.
- Do run slice analysis and document failure modes before recommending deployment.
- Do keep an immutable eval set; refresh training data without contaminating evaluation.
Avoid
- Avoid random splits for temporal or user-correlated data.
- Avoid "metric gaming" (optimizing the number without validating business impact).
- Avoid training on labels created after the prediction timestamp (silent future leakage).
建议
- 从基准模型和简单模型开始,尽早暴露数据泄露与数据问题。
- 在建议部署前,执行切片分析并记录失效模式。
- 保留不可变的评估数据集;刷新训练数据时避免污染评估集。
禁忌
- 对时序或用户相关数据使用随机拆分。
- 避免“指标游戏”(仅优化数值而不验证业务影响)。
- 避免使用预测时间戳之后生成的标签进行训练(隐性未来数据泄露)。
Core Patterns (Overview)
核心模式(概述)
Pattern 1: End-to-End DS Project Lifecycle
模式1:端到端数据科学项目生命周期
Use when: Starting or restructuring any DS/ML project.
Stages:
- Problem framing - Business objective, success metrics, baseline
- Data & feasibility - Sources, coverage, granularity, label quality
- EDA & data quality - Schema, missingness, outliers, leakage checks
- Feature engineering - Per data type with feature store integration
- Modelling - Baselines first, then LightGBM, then complexity as needed
- Evaluation - Offline metrics, slice analysis, error analysis
- Reporting - Model evaluation report + model card
- MLOps - CI/CD, CT (continuous training), CM (continuous monitoring)
Detailed guide: EDA Best Practices
适用场景:启动或重构任何数据科学/机器学习项目。
阶段:
- 问题框架搭建 - 业务目标、成功指标、基准模型
- 数据与可行性分析 - 数据源、覆盖范围、粒度、标签质量
- EDA与数据质量 - Schema、缺失值、异常值、数据泄露检查
- 特征工程 - 按数据类型处理并集成特征存储
- 建模 - 优先搭建基准模型,然后使用LightGBM,必要时提升复杂度
- 评估 - 离线指标、切片分析、错误分析
- 报告 - 模型评估报告 + 模型卡片
- MLOps - CI/CD、持续训练(CT)、持续监控(CM)
详细指南:EDA最佳实践
Pattern 2: Feature Engineering
模式2:特征工程
Use when: Designing features before modelling or during model improvement.
By data type:
- Numeric: Standardize, handle outliers, transform skew, scale
- Categorical: One-hot/ordinal (low cardinality), target/frequency/hashing (high cardinality)
- Feature Store Integration: Store encoders, mappings, statistics centrally
- Text: Cleaning, TF-IDF, embeddings, simple stats
- Time: Calendar features, recency, rolling/lag features
Key Modern Practice: Use feature stores (Feast, Tecton, Databricks) for versioning, sharing, and train-serve parity.
Detailed guide: Feature Engineering Patterns
适用场景:建模前或模型优化阶段设计特征。
按数据类型分类:
- 数值型:标准化、处理异常值、转换偏态、缩放
- 分类型:独热编码/有序编码(低基数)、目标编码/频率编码/哈希编码(高基数)
- 特征存储集成:集中存储编码器、映射关系、统计信息
- 文本型:清洗、TF-IDF、嵌入向量、简单统计特征
- 时间型:日历特征、新鲜度、滚动/滞后特征
现代关键实践:使用特征存储(Feast、Tecton、Databricks)进行版本控制、共享与训练-服务一致性保障。
详细指南:特征工程模式
Pattern 3: Data Contracts & Lineage
模式3:数据契约与数据血缘
Use when: Building production ML systems with data quality requirements.
Components:
- Contracts: Schema + ranges/nullability + freshness SLAs
- Lineage: Track source -> feature store -> train -> serve
- Feature store hygiene: Materialization cadence, backfill/replay, encoder versioning
- Schema evolution: Backward/forward-compatible migrations with shadow runs
Detailed guide: Data Contracts & Lineage
适用场景:构建有数据质量要求的生产级机器学习系统。
组成部分:
- 契约:Schema + 范围/空值规则 + 新鲜度SLA
- 血缘:跟踪数据源 → 特征存储 → 训练 → 服务的全链路
- 特征存储 hygiene:物化频率、回填/重放、编码器版本控制
- Schema演进:通过影子运行实现向后/向前兼容的迁移
详细指南:数据契约与血缘
Pattern 4: Model Selection & Training
模式4:模型选择与训练
Use when: Picking model families and starting experiments.
Decision guide (modern benchmarks):
- Tabular: Start with a strong baseline (linear/logistic, then gradient boosting) and iterate based on error analysis
- Baselines: Always implement simple baselines first (majority class, mean, naive forecast)
- Train/val/test splits: Time-based (forecasting), group-based (user/item leakage), or random (IID)
- Hyperparameter tuning: Start manual, then Bayesian optimization (Optuna, Ray Tune)
- Overfitting control: Regularization, early stopping, cross-validation
Detailed guide: Modelling Patterns
适用场景:选择模型家族并启动实验。
决策指南(现代基准):
- 表格型数据:从优质基准模型(线性/逻辑回归,然后是梯度提升)开始,基于错误分析迭代优化
- 基准模型:始终先实现简单基准模型(多数类、均值、朴素预测)
- 训练/验证/测试拆分:基于时间(预测场景)、基于分组(用户/物品泄露)或随机(独立同分布数据)
- 超参数调优:先手动调优,再使用贝叶斯优化(Optuna、Ray Tune)
- 过拟合控制:正则化、早停、交叉验证
详细指南:建模模式
Pattern 5: Evaluation & Reporting
模式5:评估与报告
Use when: Finalizing a model candidate or handing over to production.
Key components:
- Metric selection: Primary (ROC-AUC, PR-AUC, RMSE) + guardrails (calibration, fairness)
- Threshold selection: ROC/PR curves, cost-sensitive, F1 maximization
- Slice analysis: Performance by geography, user segments, product categories
- Error analysis: Collect high-error examples, cluster by error type, identify systematic failures
- Uncertainty: Confidence intervals (bootstrap where appropriate), variance across folds, and stability checks
- Evaluation report: 8-section report (objective, data, features, models, metrics, slices, risks, recommendation)
- Model card: Documentation for stakeholders (intended use, data, performance, ethics, operations)
Detailed guide: Evaluation Patterns
适用场景:确定候选模型或将模型交付至生产环境。
核心组成:
- 指标选择:主指标(ROC-AUC、PR-AUC、RMSE) + 防护指标(校准度、公平性)
- 阈值选择:ROC/PR曲线、成本敏感、F1最大化
- 切片分析:按地域、用户群体、产品类别分析性能
- 错误分析:收集高误差样本,按错误类型聚类,识别系统性失效
- 不确定性评估:置信区间(必要时使用自助法)、折间方差、稳定性检查
- 评估报告:8部分报告(目标、数据、特征、模型、指标、切片、风险、建议)
- 模型卡片:面向利益相关者的文档(预期用途、数据、性能、伦理、运维)
详细指南:评估模式
Pattern 6: Reproducibility & MLOps
模式6:可复现性与MLOps
Use when: Ensuring experiments are reproducible and production-ready.
Modern MLOps (CI/CD/CT/CM):
- CI (Continuous Integration): Automated testing, data validation, code quality
- CD (Continuous Delivery): Environment-specific promotion (dev -> staging -> prod), canary deployment
- CT (Continuous Training): Drift-triggered and scheduled retraining
- CM (Continuous Monitoring): Real-time data drift, performance, system health
Versioning:
- Code (git commit), data (DVC, LakeFS), features (feature store), models (MLflow Registry)
- Seeds (reproducibility), hyperparameters (experiment tracker)
Detailed guide: Reproducibility Checklist
适用场景:确保实验可复现且模型可投入生产。
现代MLOps(CI/CD/CT/CM):
- CI(持续集成):自动化测试、数据验证、代码质量检查
- CD(持续交付):分环境推广(开发→预发布→生产)、金丝雀部署
- CT(持续训练):由漂移触发或定时执行的重训练
- CM(持续监控):实时数据漂移、性能、系统健康监控
版本控制:
- 代码(git commit)、数据(DVC、LakeFS)、特征(特征存储)、模型(MLflow Registry)
- 随机种子(保障可复现性)、超参数(实验跟踪器)
详细指南:可复现性检查清单
Pattern 7: Feature Freshness & Streaming
模式7:特征新鲜度与流式处理
Use when: Managing real-time features and streaming pipelines.
Components:
- Freshness contracts: Define freshness SLAs per feature, monitor lag, alert on breaches
- Batch + stream parity: Same feature logic across batch/stream, idempotent upserts
- Schema evolution: Version schemas, add forward/backward-compatible parsers, backfill with rollback
- Data quality gates: PII/format checks, range checks, distribution drift (KL, KS, PSI)
Detailed guide: Feature Freshness & Streaming
适用场景:管理实时特征与流式流水线。
组成部分:
- 新鲜度契约:为每个特征定义新鲜度SLA,监控延迟,触发告警
- 批流一致性:批处理/流处理使用相同的特征逻辑,幂等更新
- Schema演进:版本化Schema,添加向前/向后兼容的解析器,支持回滚式回填
- 数据质量关卡:PII/格式检查、范围检查、分布漂移(KL、KS、PSI)
详细指南:特征新鲜度与流式处理
Pattern 8: Production Feedback Loops
模式8:生产环境反馈循环
Use when: Capturing production signals and implementing continuous improvement.
Components:
- Signal capture: Log predictions + user edits/acceptance/abandonment (scrub PII)
- Labeling: Route failures/edge cases to human review, create balanced sets
- Dataset refresh: Periodic refresh (weekly/monthly) with lineage, protect eval set
- Online eval: Shadow/canary new models, track solve rate, calibration, cost, latency
Detailed guide: Production Feedback Loops
适用场景:捕获生产环境信号并实现持续优化。
组成部分:
- 信号捕获:记录预测结果 + 用户编辑/接受/放弃操作(清除PII)
- 标注:将失效/边缘案例路由至人工审核,创建均衡数据集
- 数据集刷新:定期刷新(每周/每月)并保留血缘,保护评估集
- 在线评估:影子/金丝雀部署新模型,跟踪解决率、校准度、成本、延迟
详细指南:生产环境反馈循环
Resources (Detailed Guides)
资源(详细指南)
For comprehensive operational patterns and checklists, see:
- EDA Best Practices - Structured workflow for exploratory data analysis
- Feature Engineering Patterns - Operational patterns by data type
- Data Contracts & Lineage - Data quality, versioning, feature store ops
- Modelling Patterns - Model selection, hyperparameter tuning, train/test splits
- Evaluation Patterns - Metrics, slice analysis, evaluation reports, model cards
- Reproducibility Checklist - Experiment tracking, MLOps (CI/CD/CT/CM)
- Feature Freshness & Streaming - Real-time features, schema evolution
- Production Feedback Loops - Online learning, labeling, canary deployment
如需了解全面的操作模式与检查清单,请查看:
- EDA最佳实践 - 结构化的探索性数据分析工作流
- 特征工程模式 - 按数据类型划分的操作模式
- 数据契约与血缘 - 数据质量、版本控制、特征存储运维
- 建模模式 - 模型选择、超参数调优、训练/测试拆分
- 评估模式 - 指标、切片分析、评估报告、模型卡片
- 可复现性检查清单 - 实验跟踪、MLOps(CI/CD/CT/CM)
- 特征新鲜度与流式处理 - 实时特征、Schema演进
- 生产环境反馈循环 - 在线学习、标注、金丝雀部署
Templates
模板
Use these as copy-paste starting points:
可直接复制使用以下模板:
Project & Workflow Templates
项目与工作流模板
- Standard DS project template:
assets/project/template-standard.md - Quick DS experiment template:
assets/project/template-quick.md
- 标准数据科学项目模板:
assets/project/template-standard.md - 快速数据科学实验模板:
assets/project/template-quick.md
Feature Engineering & EDA
特征工程与EDA
- Feature engineering template:
assets/features/template-feature-engineering.md - EDA checklist & notebook template:
assets/eda/template-eda.md
- 特征工程模板:
assets/features/template-feature-engineering.md - EDA检查清单与笔记本模板:
assets/eda/template-eda.md
Evaluation & Reporting
评估与报告
- Model evaluation report:
assets/evaluation/template-evaluation-report.md - Model card:
assets/evaluation/template-model-card.md - ML experiment review:
assets/review/experiment-review-template.md
- 模型评估报告:
assets/evaluation/template-evaluation-report.md - 模型卡片:
assets/evaluation/template-model-card.md - 机器学习实验评审:
assets/review/experiment-review-template.md
SQL Transformation (SQLMesh)
SQL转换(SQLMesh)
For SQL-based data transformation and feature engineering:
- SQLMesh project setup:
../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-project.md - SQLMesh model types: (FULL, INCREMENTAL, VIEW)
../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-model.md - Incremental models:
../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-incremental.md - DAG and dependencies:
../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-dag.md - Testing and data quality:
../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-testing.md
Use SQLMesh when:
- Building SQL-based feature pipelines
- Managing incremental data transformations
- Creating staging/intermediate/marts layers
- Testing SQL logic with unit tests and audits
For data ingestion (loading raw data), use:
- ai-mlops skill (dlt templates for REST APIs, databases, warehouses)
基于SQL的数据转换与特征工程相关模板:
- SQLMesh项目搭建:
../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-project.md - SQLMesh模型类型:(FULL, INCREMENTAL, VIEW)
../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-model.md - 增量模型:
../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-incremental.md - DAG与依赖:
../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-dag.md - 测试与数据质量:
../data-lake-platform/assets/transformation/sqlmesh/template-sqlmesh-testing.md
SQLMesh适用场景:
- 构建基于SQL的特征流水线
- 管理增量数据转换
- 创建分层数据(staging/intermediate/marts层)
- 使用单元测试与审计测试SQL逻辑
数据摄入(加载原始数据)请使用:
- ai-mlops技能(REST API、数据库、数据仓库的dlt模板)
Navigation
导航
Resources
- references/reproducibility-checklist.md
- references/evaluation-patterns.md
- references/feature-engineering-patterns.md
- references/modelling-patterns.md
- references/feature-freshness-streaming.md
- references/eda-best-practices.md
- references/data-contracts-lineage.md
- references/production-feedback-loops.md
Templates
- assets/project/template-standard.md
- assets/project/template-quick.md
- assets/features/template-feature-engineering.md
- assets/eda/template-eda.md
- assets/evaluation/template-evaluation-report.md
- assets/evaluation/template-model-card.md
- assets/review/experiment-review-template.md
- template-sqlmesh-project.md
- template-sqlmesh-model.md
- template-sqlmesh-incremental.md
- template-sqlmesh-dag.md
- template-sqlmesh-testing.md
Data
- data/sources.json - Curated external references
资源
- references/reproducibility-checklist.md
- references/evaluation-patterns.md
- references/feature-engineering-patterns.md
- references/modelling-patterns.md
- references/feature-freshness-streaming.md
- references/eda-best-practices.md
- references/data-contracts-lineage.md
- references/production-feedback-loops.md
模板
- assets/project/template-standard.md
- assets/project/template-quick.md
- assets/features/template-feature-engineering.md
- assets/eda/template-eda.md
- assets/evaluation/template-evaluation-report.md
- assets/evaluation/template-model-card.md
- assets/review/experiment-review-template.md
- template-sqlmesh-project.md
- template-sqlmesh-model.md
- template-sqlmesh-incremental.md
- template-sqlmesh-dag.md
- template-sqlmesh-testing.md
数据
- data/sources.json - 精选外部参考资料
External Resources
外部资源
See data/sources.json for curated foundational and implementation references:
- Core ML/DL: scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow, JAX
- Data processing: pandas, NumPy, Polars, DuckDB, Spark, Dask
- SQL transformation: SQLMesh, dbt (staging/marts/incremental patterns)
- Feature stores: Feast, Tecton, Databricks Feature Store (centralized feature management)
- Data validation: Pydantic, Great Expectations, Pandera, Evidently (quality + drift)
- Visualization: Matplotlib, Seaborn, Plotly, Streamlit, Dash
- MLOps: MLflow, W&B, DVC, Neptune (experiment tracking + model registry)
- Hyperparameter tuning: Optuna, Ray Tune, Hyperopt
- Model serving: BentoML, FastAPI, TorchServe, Seldon, Ray Serve
- Orchestration: Kubeflow, Metaflow, Prefect, Airflow, ZenML
- Cloud platforms: AWS SageMaker, Google Vertex AI, Azure ML, Databricks, Snowflake
Use this skill to execute data science projects end-to-end: concrete checklists, patterns, and templates, not theory.
请查看data/sources.json获取精选的基础与实践参考资料:
- 核心机器学习/深度学习:scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow, JAX
- 数据处理:pandas, NumPy, Polars, DuckDB, Spark, Dask
- SQL转换:SQLMesh, dbt(staging/marts/incremental模式)
- 特征存储:Feast, Tecton, Databricks Feature Store(集中式特征管理)
- 数据验证:Pydantic, Great Expectations, Pandera, Evidently(质量 + 漂移检测)
- 可视化:Matplotlib, Seaborn, Plotly, Streamlit, Dash
- MLOps:MLflow, W&B, DVC, Neptune(实验跟踪 + 模型注册表)
- 超参数调优:Optuna, Ray Tune, Hyperopt
- 模型服务:BentoML, FastAPI, TorchServe, Seldon, Ray Serve
- 任务编排:Kubeflow, Metaflow, Prefect, Airflow, ZenML
- 云平台:AWS SageMaker, Google Vertex AI, Azure ML, Databricks, Snowflake
使用本技能端到端执行数据科学项目:提供具体的检查清单、模式与模板,而非理论内容。