machine-learning
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMachine Learning
机器学习
Comprehensive machine learning skill covering the full ML lifecycle from experimentation to production deployment.
涵盖从实验到生产部署完整ML生命周期的综合性机器学习技能。
When to Use This Skill
何时使用该技能
- Building machine learning pipelines
- Feature engineering and data preprocessing
- Model training, evaluation, and selection
- Hyperparameter tuning and optimization
- Model deployment and serving
- ML experiment tracking and versioning
- Production ML monitoring and maintenance
- 构建机器学习流水线
- 特征工程与数据预处理
- 模型训练、评估与选型
- 超参数调优与优化
- 模型部署与服务
- ML实验跟踪与版本管理
- 生产环境ML监控与维护
ML Development Lifecycle
ML开发生命周期
1. Problem Definition
1. 问题定义
Classification Types:
- Binary classification (spam/not spam)
- Multi-class classification (image categories)
- Multi-label classification (document tags)
- Regression (price prediction)
- Clustering (customer segmentation)
- Ranking (search results)
- Anomaly detection (fraud detection)
Success Metrics by Problem Type:
| Problem Type | Primary Metrics | Secondary Metrics |
|---|---|---|
| Binary Classification | AUC-ROC, F1 | Precision, Recall, PR-AUC |
| Multi-class | Macro F1, Accuracy | Per-class metrics |
| Regression | RMSE, MAE | R², MAPE |
| Ranking | NDCG, MAP | MRR |
| Clustering | Silhouette, Calinski-Harabasz | Davies-Bouldin |
分类类型:
- 二分类(垃圾邮件/非垃圾邮件)
- 多分类(图像类别)
- 多标签分类(文档标签)
- 回归(价格预测)
- 聚类(客户细分)
- 排序(搜索结果)
- 异常检测(欺诈识别)
按问题类型划分的成功指标:
| 问题类型 | 主要指标 | 次要指标 |
|---|---|---|
| 二分类 | AUC-ROC, F1 | Precision, Recall, PR-AUC |
| 多分类 | Macro F1, Accuracy | 按类别统计的指标 |
| 回归 | RMSE, MAE | R², MAPE |
| 排序 | NDCG, MAP | MRR |
| 聚类 | Silhouette, Calinski-Harabasz | Davies-Bouldin |
2. Data Preparation
2. 数据准备
Data Quality Checks:
- Missing value analysis and imputation strategies
- Outlier detection and handling
- Data type validation
- Distribution analysis
- Target leakage detection
Feature Engineering Patterns:
- Numerical: scaling, binning, log transforms, polynomial features
- Categorical: one-hot, target encoding, frequency encoding, embeddings
- Temporal: lag features, rolling statistics, cyclical encoding
- Text: TF-IDF, word embeddings, transformer embeddings
- Geospatial: distance features, clustering, grid encoding
Train/Test Split Strategies:
- Random split (standard)
- Stratified split (imbalanced classes)
- Time-based split (temporal data)
- Group split (prevent data leakage)
- K-fold cross-validation
数据质量检查:
- 缺失值分析与填充策略
- 异常值检测与处理
- 数据类型验证
- 分布分析
- 目标泄露检测
特征工程模式:
- 数值型:缩放、分箱、对数变换、多项式特征
- 类别型:独热编码、目标编码、频率编码、嵌入
- 时间型:滞后特征、滚动统计、周期编码
- 文本型:TF-IDF、词嵌入、Transformer嵌入
- 地理空间型:距离特征、聚类、网格编码
训练/测试集划分策略:
- 随机划分(标准方式)
- 分层划分(针对不平衡类别)
- 基于时间的划分(时序数据)
- 分组划分(防止数据泄露)
- K折交叉验证
3. Model Selection
3. 模型选型
Algorithm Selection Guide:
| Data Size | Problem | Recommended Models |
|---|---|---|
| Small (<10K) | Classification | Logistic Regression, SVM, Random Forest |
| Small (<10K) | Regression | Linear Regression, Ridge, SVR |
| Medium (10K-1M) | Classification | XGBoost, LightGBM, Neural Networks |
| Medium (10K-1M) | Regression | XGBoost, LightGBM, Neural Networks |
| Large (>1M) | Any | Deep Learning, Distributed training |
| Tabular | Any | Gradient Boosting (XGBoost, LightGBM, CatBoost) |
| Images | Classification | CNN, ResNet, EfficientNet, Vision Transformers |
| Text | NLP | Transformers (BERT, RoBERTa, GPT) |
| Sequential | Time Series | LSTM, Transformer, Prophet |
算法选型指南:
| 数据规模 | 问题类型 | 推荐模型 |
|---|---|---|
| 小型(<10K) | 分类 | Logistic Regression, SVM, Random Forest |
| 小型(<10K) | 回归 | Linear Regression, Ridge, SVR |
| 中型(10K-1M) | 分类 | XGBoost, LightGBM, Neural Networks |
| 中型(10K-1M) | 回归 | XGBoost, LightGBM, Neural Networks |
| 大型(>1M) | 任意 | Deep Learning, 分布式训练 |
| 表格数据 | 任意 | 梯度提升(XGBoost, LightGBM, CatBoost) |
| 图像 | 分类 | CNN, ResNet, EfficientNet, Vision Transformers |
| 文本 | NLP | Transformers(BERT, RoBERTa, GPT) |
| 序列数据 | 时间序列 | LSTM, Transformer, Prophet |
4. Model Training
4. 模型训练
Hyperparameter Tuning:
- Grid Search: exhaustive, good for small spaces
- Random Search: efficient, good for large spaces
- Bayesian Optimization: smart exploration (Optuna, Hyperopt)
- Early stopping: prevent overfitting
Common Hyperparameters:
| Model | Key Parameters |
|---|---|
| XGBoost | learning_rate, max_depth, n_estimators, subsample |
| LightGBM | num_leaves, learning_rate, n_estimators, feature_fraction |
| Random Forest | n_estimators, max_depth, min_samples_split |
| Neural Networks | learning_rate, batch_size, layers, dropout |
超参数调优:
- 网格搜索:穷举式,适用于小搜索空间
- 随机搜索:高效,适用于大搜索空间
- 贝叶斯优化:智能探索(Optuna, Hyperopt)
- 早停:防止过拟合
常见超参数:
| 模型 | 关键参数 |
|---|---|
| XGBoost | learning_rate, max_depth, n_estimators, subsample |
| LightGBM | num_leaves, learning_rate, n_estimators, feature_fraction |
| Random Forest | n_estimators, max_depth, min_samples_split |
| 神经网络 | learning_rate, batch_size, layers, dropout |
5. Model Evaluation
5. 模型评估
Evaluation Best Practices:
- Always use held-out test set for final evaluation
- Use cross-validation during development
- Check for overfitting (train vs validation gap)
- Evaluate on multiple metrics
- Analyze errors qualitatively
Handling Imbalanced Data:
- Resampling: SMOTE, undersampling
- Class weights: weighted loss functions
- Threshold tuning: optimize decision threshold
- Evaluation: use PR-AUC over ROC-AUC
评估最佳实践:
- 始终使用预留的测试集进行最终评估
- 开发阶段使用交叉验证
- 检查过拟合情况(训练集与验证集差距)
- 多指标评估
- 定性分析错误
不平衡数据处理:
- 重采样:SMOTE、下采样
- 类别权重:带权重的损失函数
- 阈值调优:优化决策阈值
- 评估:优先使用PR-AUC而非ROC-AUC
6. Production Deployment
6. 生产环境部署
Model Serving Patterns:
- REST API (Flask, FastAPI, TF Serving)
- Batch inference (scheduled jobs)
- Streaming (real-time predictions)
- Edge deployment (mobile, IoT)
Production Considerations:
- Latency requirements (p50, p95, p99)
- Throughput (requests per second)
- Model size and memory footprint
- Fallback strategies
- A/B testing framework
模型服务模式:
- REST API(Flask, FastAPI, TF Serving)
- 批量推理(定时任务)
- 流式处理(实时预测)
- 边缘部署(移动设备、IoT)
生产环境考量因素:
- 延迟要求(p50, p95, p99)
- 吞吐量(每秒请求数)
- 模型大小与内存占用
- 降级策略
- A/B测试框架
7. Monitoring & Maintenance
7. 监控与维护
What to Monitor:
- Prediction latency
- Input feature distributions (data drift)
- Prediction distributions (concept drift)
- Model performance metrics
- Error rates and types
Retraining Triggers:
- Performance degradation below threshold
- Significant data drift detected
- Scheduled retraining (daily, weekly)
- New training data available
监控内容:
- 预测延迟
- 输入特征分布(数据漂移)
- 预测结果分布(概念漂移)
- 模型性能指标
- 错误率与错误类型
重训练触发条件:
- 性能降至阈值以下
- 检测到显著的数据漂移
- 定期重训练(每日、每周)
- 有新的训练数据可用
MLOps Best Practices
MLOps最佳实践
Experiment Tracking
实验跟踪
Track for every experiment:
- Code version (git commit)
- Data version (hash or version ID)
- Hyperparameters
- Metrics (train, validation, test)
- Model artifacts
- Environment (packages, versions)
每个实验需跟踪:
- 代码版本(git commit)
- 数据版本(哈希值或版本ID)
- 超参数
- 指标(训练集、验证集、测试集)
- 模型工件
- 环境(依赖包、版本)
Model Versioning
模型版本管理
models/
├── model_v1.0.0/
│ ├── model.pkl
│ ├── metadata.json
│ ├── requirements.txt
│ └── metrics.json
├── model_v1.1.0/
└── model_v2.0.0/models/
├── model_v1.0.0/
│ ├── model.pkl
│ ├── metadata.json
│ ├── requirements.txt
│ └── metrics.json
├── model_v1.1.0/
└── model_v2.0.0/CI/CD for ML
ML领域的CI/CD
-
Continuous Integration:
- Data validation tests
- Model training tests
- Performance regression tests
-
Continuous Deployment:
- Staging environment validation
- Shadow mode testing
- Gradual rollout (canary)
- Automatic rollback
-
持续集成:
- 数据验证测试
- 模型训练测试
- 性能回归测试
-
持续部署:
- 预发布环境验证
- 影子模式测试
- 灰度发布(金丝雀发布)
- 自动回滚
Reference Files
参考文件
For detailed patterns and code examples, load reference files as needed:
- - Data preprocessing patterns and feature engineering techniques
references/preprocessing.md - - Model architecture patterns and implementation examples
references/model_patterns.md - - Comprehensive evaluation strategies and metrics
references/evaluation.md
如需详细模式与代码示例,可按需加载参考文件:
- - 数据预处理模式与特征工程技术
references/preprocessing.md - - 模型架构模式与实现示例
references/model_patterns.md - - 全面的评估策略与指标
references/evaluation.md
Integration with Other Skills
与其他技能的集成
- performance - For optimizing inference latency
- testing - For ML-specific testing patterns
- database-optimization - For feature store queries
- debugging - For model debugging and error analysis
- performance - 用于优化推理延迟
- testing - 用于ML专属测试模式
- database-optimization - 用于特征存储查询
- debugging - 用于模型调试与错误分析