machine-learning

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Machine Learning

机器学习

Comprehensive machine learning skill covering the full ML lifecycle from experimentation to production deployment.
涵盖从实验到生产部署完整ML生命周期的综合性机器学习技能。

When to Use This Skill

何时使用该技能

  • Building machine learning pipelines
  • Feature engineering and data preprocessing
  • Model training, evaluation, and selection
  • Hyperparameter tuning and optimization
  • Model deployment and serving
  • ML experiment tracking and versioning
  • Production ML monitoring and maintenance
  • 构建机器学习流水线
  • 特征工程与数据预处理
  • 模型训练、评估与选型
  • 超参数调优与优化
  • 模型部署与服务
  • ML实验跟踪与版本管理
  • 生产环境ML监控与维护

ML Development Lifecycle

ML开发生命周期

1. Problem Definition

1. 问题定义

Classification Types:
  • Binary classification (spam/not spam)
  • Multi-class classification (image categories)
  • Multi-label classification (document tags)
  • Regression (price prediction)
  • Clustering (customer segmentation)
  • Ranking (search results)
  • Anomaly detection (fraud detection)
Success Metrics by Problem Type:
Problem TypePrimary MetricsSecondary Metrics
Binary ClassificationAUC-ROC, F1Precision, Recall, PR-AUC
Multi-classMacro F1, AccuracyPer-class metrics
RegressionRMSE, MAER², MAPE
RankingNDCG, MAPMRR
ClusteringSilhouette, Calinski-HarabaszDavies-Bouldin
分类类型:
  • 二分类(垃圾邮件/非垃圾邮件)
  • 多分类(图像类别)
  • 多标签分类(文档标签)
  • 回归(价格预测)
  • 聚类(客户细分)
  • 排序(搜索结果)
  • 异常检测(欺诈识别)
按问题类型划分的成功指标:
问题类型主要指标次要指标
二分类AUC-ROC, F1Precision, Recall, PR-AUC
多分类Macro F1, Accuracy按类别统计的指标
回归RMSE, MAER², MAPE
排序NDCG, MAPMRR
聚类Silhouette, Calinski-HarabaszDavies-Bouldin

2. Data Preparation

2. 数据准备

Data Quality Checks:
  • Missing value analysis and imputation strategies
  • Outlier detection and handling
  • Data type validation
  • Distribution analysis
  • Target leakage detection
Feature Engineering Patterns:
  • Numerical: scaling, binning, log transforms, polynomial features
  • Categorical: one-hot, target encoding, frequency encoding, embeddings
  • Temporal: lag features, rolling statistics, cyclical encoding
  • Text: TF-IDF, word embeddings, transformer embeddings
  • Geospatial: distance features, clustering, grid encoding
Train/Test Split Strategies:
  • Random split (standard)
  • Stratified split (imbalanced classes)
  • Time-based split (temporal data)
  • Group split (prevent data leakage)
  • K-fold cross-validation
数据质量检查:
  • 缺失值分析与填充策略
  • 异常值检测与处理
  • 数据类型验证
  • 分布分析
  • 目标泄露检测
特征工程模式:
  • 数值型:缩放、分箱、对数变换、多项式特征
  • 类别型:独热编码、目标编码、频率编码、嵌入
  • 时间型:滞后特征、滚动统计、周期编码
  • 文本型:TF-IDF、词嵌入、Transformer嵌入
  • 地理空间型:距离特征、聚类、网格编码
训练/测试集划分策略:
  • 随机划分(标准方式)
  • 分层划分(针对不平衡类别)
  • 基于时间的划分(时序数据)
  • 分组划分(防止数据泄露)
  • K折交叉验证

3. Model Selection

3. 模型选型

Algorithm Selection Guide:
Data SizeProblemRecommended Models
Small (<10K)ClassificationLogistic Regression, SVM, Random Forest
Small (<10K)RegressionLinear Regression, Ridge, SVR
Medium (10K-1M)ClassificationXGBoost, LightGBM, Neural Networks
Medium (10K-1M)RegressionXGBoost, LightGBM, Neural Networks
Large (>1M)AnyDeep Learning, Distributed training
TabularAnyGradient Boosting (XGBoost, LightGBM, CatBoost)
ImagesClassificationCNN, ResNet, EfficientNet, Vision Transformers
TextNLPTransformers (BERT, RoBERTa, GPT)
SequentialTime SeriesLSTM, Transformer, Prophet
算法选型指南:
数据规模问题类型推荐模型
小型(<10K)分类Logistic Regression, SVM, Random Forest
小型(<10K)回归Linear Regression, Ridge, SVR
中型(10K-1M)分类XGBoost, LightGBM, Neural Networks
中型(10K-1M)回归XGBoost, LightGBM, Neural Networks
大型(>1M)任意Deep Learning, 分布式训练
表格数据任意梯度提升(XGBoost, LightGBM, CatBoost)
图像分类CNN, ResNet, EfficientNet, Vision Transformers
文本NLPTransformers(BERT, RoBERTa, GPT)
序列数据时间序列LSTM, Transformer, Prophet

4. Model Training

4. 模型训练

Hyperparameter Tuning:
  • Grid Search: exhaustive, good for small spaces
  • Random Search: efficient, good for large spaces
  • Bayesian Optimization: smart exploration (Optuna, Hyperopt)
  • Early stopping: prevent overfitting
Common Hyperparameters:
ModelKey Parameters
XGBoostlearning_rate, max_depth, n_estimators, subsample
LightGBMnum_leaves, learning_rate, n_estimators, feature_fraction
Random Forestn_estimators, max_depth, min_samples_split
Neural Networkslearning_rate, batch_size, layers, dropout
超参数调优:
  • 网格搜索:穷举式,适用于小搜索空间
  • 随机搜索:高效,适用于大搜索空间
  • 贝叶斯优化:智能探索(Optuna, Hyperopt)
  • 早停:防止过拟合
常见超参数:
模型关键参数
XGBoostlearning_rate, max_depth, n_estimators, subsample
LightGBMnum_leaves, learning_rate, n_estimators, feature_fraction
Random Forestn_estimators, max_depth, min_samples_split
神经网络learning_rate, batch_size, layers, dropout

5. Model Evaluation

5. 模型评估

Evaluation Best Practices:
  • Always use held-out test set for final evaluation
  • Use cross-validation during development
  • Check for overfitting (train vs validation gap)
  • Evaluate on multiple metrics
  • Analyze errors qualitatively
Handling Imbalanced Data:
  • Resampling: SMOTE, undersampling
  • Class weights: weighted loss functions
  • Threshold tuning: optimize decision threshold
  • Evaluation: use PR-AUC over ROC-AUC
评估最佳实践:
  • 始终使用预留的测试集进行最终评估
  • 开发阶段使用交叉验证
  • 检查过拟合情况(训练集与验证集差距)
  • 多指标评估
  • 定性分析错误
不平衡数据处理:
  • 重采样:SMOTE、下采样
  • 类别权重:带权重的损失函数
  • 阈值调优:优化决策阈值
  • 评估:优先使用PR-AUC而非ROC-AUC

6. Production Deployment

6. 生产环境部署

Model Serving Patterns:
  • REST API (Flask, FastAPI, TF Serving)
  • Batch inference (scheduled jobs)
  • Streaming (real-time predictions)
  • Edge deployment (mobile, IoT)
Production Considerations:
  • Latency requirements (p50, p95, p99)
  • Throughput (requests per second)
  • Model size and memory footprint
  • Fallback strategies
  • A/B testing framework
模型服务模式:
  • REST API(Flask, FastAPI, TF Serving)
  • 批量推理(定时任务)
  • 流式处理(实时预测)
  • 边缘部署(移动设备、IoT)
生产环境考量因素:
  • 延迟要求(p50, p95, p99)
  • 吞吐量(每秒请求数)
  • 模型大小与内存占用
  • 降级策略
  • A/B测试框架

7. Monitoring & Maintenance

7. 监控与维护

What to Monitor:
  • Prediction latency
  • Input feature distributions (data drift)
  • Prediction distributions (concept drift)
  • Model performance metrics
  • Error rates and types
Retraining Triggers:
  • Performance degradation below threshold
  • Significant data drift detected
  • Scheduled retraining (daily, weekly)
  • New training data available
监控内容:
  • 预测延迟
  • 输入特征分布(数据漂移)
  • 预测结果分布(概念漂移)
  • 模型性能指标
  • 错误率与错误类型
重训练触发条件:
  • 性能降至阈值以下
  • 检测到显著的数据漂移
  • 定期重训练(每日、每周)
  • 有新的训练数据可用

MLOps Best Practices

MLOps最佳实践

Experiment Tracking

实验跟踪

Track for every experiment:
  • Code version (git commit)
  • Data version (hash or version ID)
  • Hyperparameters
  • Metrics (train, validation, test)
  • Model artifacts
  • Environment (packages, versions)
每个实验需跟踪:
  • 代码版本(git commit)
  • 数据版本(哈希值或版本ID)
  • 超参数
  • 指标(训练集、验证集、测试集)
  • 模型工件
  • 环境(依赖包、版本)

Model Versioning

模型版本管理

models/
├── model_v1.0.0/
│   ├── model.pkl
│   ├── metadata.json
│   ├── requirements.txt
│   └── metrics.json
├── model_v1.1.0/
└── model_v2.0.0/
models/
├── model_v1.0.0/
│   ├── model.pkl
│   ├── metadata.json
│   ├── requirements.txt
│   └── metrics.json
├── model_v1.1.0/
└── model_v2.0.0/

CI/CD for ML

ML领域的CI/CD

  1. Continuous Integration:
    • Data validation tests
    • Model training tests
    • Performance regression tests
  2. Continuous Deployment:
    • Staging environment validation
    • Shadow mode testing
    • Gradual rollout (canary)
    • Automatic rollback
  1. 持续集成:
    • 数据验证测试
    • 模型训练测试
    • 性能回归测试
  2. 持续部署:
    • 预发布环境验证
    • 影子模式测试
    • 灰度发布(金丝雀发布)
    • 自动回滚

Reference Files

参考文件

For detailed patterns and code examples, load reference files as needed:
  • references/preprocessing.md
    - Data preprocessing patterns and feature engineering techniques
  • references/model_patterns.md
    - Model architecture patterns and implementation examples
  • references/evaluation.md
    - Comprehensive evaluation strategies and metrics
如需详细模式与代码示例,可按需加载参考文件:
  • references/preprocessing.md
    - 数据预处理模式与特征工程技术
  • references/model_patterns.md
    - 模型架构模式与实现示例
  • references/evaluation.md
    - 全面的评估策略与指标

Integration with Other Skills

与其他技能的集成

  • performance - For optimizing inference latency
  • testing - For ML-specific testing patterns
  • database-optimization - For feature store queries
  • debugging - For model debugging and error analysis
  • performance - 用于优化推理延迟
  • testing - 用于ML专属测试模式
  • database-optimization - 用于特征存储查询
  • debugging - 用于模型调试与错误分析