ml-pipeline-workflow
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseML Pipeline Workflow
ML流水线工作流
Complete end-to-end MLOps pipeline orchestration from data preparation through model deployment.
实现从数据准备到模型部署的端到端MLOps流水线编排。
Overview
概述
This skill provides comprehensive guidance for building production ML pipelines that handle the full lifecycle: data ingestion → preparation → training → validation → deployment → monitoring.
本技能提供了构建覆盖完整生命周期的生产级ML流水线的全面指导:数据采集 → 准备 → 训练 → 验证 → 部署 → 监控。
When to Use This Skill
何时使用该技能
- Building new ML pipelines from scratch
- Designing workflow orchestration for ML systems
- Implementing data → model → deployment automation
- Setting up reproducible training workflows
- Creating DAG-based ML orchestration
- Integrating ML components into production systems
- 从零开始构建新的ML流水线
- 为ML系统设计工作流编排
- 实现从数据→模型→部署的自动化
- 搭建可复现的训练工作流
- 创建基于DAG的ML编排
- 将ML组件集成到生产系统中
What This Skill Provides
本技能提供的内容
Core Capabilities
核心能力
-
Pipeline Architecture
- End-to-end workflow design
- DAG orchestration patterns (Airflow, Dagster, Kubeflow)
- Component dependencies and data flow
- Error handling and retry strategies
-
Data Preparation
- Data validation and quality checks
- Feature engineering pipelines
- Data versioning and lineage
- Train/validation/test splitting strategies
-
Model Training
- Training job orchestration
- Hyperparameter management
- Experiment tracking integration
- Distributed training patterns
-
Model Validation
- Validation frameworks and metrics
- A/B testing infrastructure
- Performance regression detection
- Model comparison workflows
-
Deployment Automation
- Model serving patterns
- Canary deployments
- Blue-green deployment strategies
- Rollback mechanisms
-
流水线架构
- 端到端工作流设计
- DAG编排模式(Airflow、Dagster、Kubeflow)
- 组件依赖与数据流
- 错误处理与重试策略
-
数据准备
- 数据验证与质量检查
- 特征工程流水线
- 数据版本控制与血缘追踪
- 训练/验证/测试集拆分策略
-
模型训练
- 训练任务编排
- 超参数管理
- 实验追踪集成
- 分布式训练模式
-
模型验证
- 验证框架与指标
- A/B测试基础设施
- 性能退化检测
- 模型对比工作流
-
部署自动化
- 模型服务模式
- 金丝雀部署
- 蓝绿部署策略
- 回滚机制
Reference Documentation
参考文档
See the directory for detailed guides:
references/- data-preparation.md - Data cleaning, validation, and feature engineering
- model-training.md - Training workflows and best practices
- model-validation.md - Validation strategies and metrics
- model-deployment.md - Deployment patterns and serving architectures
请查看目录下的详细指南:
references/- data-preparation.md - 数据清洗、验证与特征工程
- model-training.md - 训练工作流与最佳实践
- model-validation.md - 验证策略与指标
- model-deployment.md - 部署模式与服务架构
Assets and Templates
资产与模板
The directory contains:
assets/- pipeline-dag.yaml.template - DAG template for workflow orchestration
- training-config.yaml - Training configuration template
- validation-checklist.md - Pre-deployment validation checklist
assets/- pipeline-dag.yaml.template - 工作流编排的DAG模板
- training-config.yaml - 训练配置模板
- validation-checklist.md - 部署前验证清单
Usage Patterns
使用模式
Basic Pipeline Setup
基础流水线搭建
python
undefinedpython
undefined1. Define pipeline stages
1. Define pipeline stages
stages = [
"data_ingestion",
"data_validation",
"feature_engineering",
"model_training",
"model_validation",
"model_deployment"
]
stages = [
"data_ingestion",
"data_validation",
"feature_engineering",
"model_training",
"model_validation",
"model_deployment"
]
2. Configure dependencies
2. Configure dependencies
See assets/pipeline-dag.yaml.template for full example
See assets/pipeline-dag.yaml.template for full example
undefinedundefinedProduction Workflow
生产级工作流
-
Data Preparation Phase
- Ingest raw data from sources
- Run data quality checks
- Apply feature transformations
- Version processed datasets
-
Training Phase
- Load versioned training data
- Execute training jobs
- Track experiments and metrics
- Save trained models
-
Validation Phase
- Run validation test suite
- Compare against baseline
- Generate performance reports
- Approve for deployment
-
Deployment Phase
- Package model artifacts
- Deploy to serving infrastructure
- Configure monitoring
- Validate production traffic
-
数据准备阶段
- 从数据源采集原始数据
- 运行数据质量检查
- 应用特征转换
- 为处理后的数据集打版本
-
训练阶段
- 加载带版本的训练数据
- 执行训练任务
- 追踪实验与指标
- 保存训练好的模型
-
验证阶段
- 运行验证测试套件
- 与基线模型对比
- 生成性能报告
- 批准部署
-
部署阶段
- 打包模型工件
- 部署到服务基础设施
- 配置监控
- 验证生产流量
Best Practices
最佳实践
Pipeline Design
流水线设计
- Modularity: Each stage should be independently testable
- Idempotency: Re-running stages should be safe
- Observability: Log metrics at every stage
- Versioning: Track data, code, and model versions
- Failure Handling: Implement retry logic and alerting
- 模块化:每个阶段应可独立测试
- 幂等性:重新运行阶段应是安全的
- 可观测性:在每个阶段记录指标
- 版本控制:追踪数据、代码与模型版本
- 故障处理:实现重试逻辑与告警
Data Management
数据管理
- Use data validation libraries (Great Expectations, TFX)
- Version datasets with DVC or similar tools
- Document feature engineering transformations
- Maintain data lineage tracking
- 使用数据验证库(Great Expectations、TFX)
- 用DVC或类似工具为数据集打版本
- 记录特征工程转换过程
- 维护数据血缘追踪
Model Operations
模型运维
- Separate training and serving infrastructure
- Use model registries (MLflow, Weights & Biases)
- Implement gradual rollouts for new models
- Monitor model performance drift
- Maintain rollback capabilities
- 分离训练与服务基础设施
- 使用模型注册表(MLflow、Weights & Biases)
- 为新模型实现渐进式发布
- 监控模型性能漂移
- 保留回滚能力
Deployment Strategies
部署策略
- Start with shadow deployments
- Use canary releases for validation
- Implement A/B testing infrastructure
- Set up automated rollback triggers
- Monitor latency and throughput
- 从影子部署开始
- 使用金丝雀发布进行验证
- 搭建A/B测试基础设施
- 设置自动化回滚触发器
- 监控延迟与吞吐量
Integration Points
集成点
Orchestration Tools
编排工具
- Apache Airflow: DAG-based workflow orchestration
- Dagster: Asset-based pipeline orchestration
- Kubeflow Pipelines: Kubernetes-native ML workflows
- Prefect: Modern dataflow automation
- Apache Airflow:基于DAG的工作流编排
- Dagster:基于资产的流水线编排
- Kubeflow Pipelines:Kubernetes原生ML工作流
- Prefect:现代数据流自动化
Experiment Tracking
实验追踪
- MLflow for experiment tracking and model registry
- Weights & Biases for visualization and collaboration
- TensorBoard for training metrics
- MLflow用于实验追踪与模型注册表
- Weights & Biases用于可视化与协作
- TensorBoard用于训练指标
Deployment Platforms
部署平台
- AWS SageMaker for managed ML infrastructure
- Google Vertex AI for GCP deployments
- Azure ML for Azure cloud
- Kubernetes + KServe for cloud-agnostic serving
- AWS SageMaker用于托管式ML基础设施
- Google Vertex AI用于GCP部署
- Azure ML用于Azure云
- Kubernetes + KServe用于云无关的服务
Progressive Disclosure
渐进式扩展
Start with the basics and gradually add complexity:
- Level 1: Simple linear pipeline (data → train → deploy)
- Level 2: Add validation and monitoring stages
- Level 3: Implement hyperparameter tuning
- Level 4: Add A/B testing and gradual rollouts
- Level 5: Multi-model pipelines with ensemble strategies
从基础开始,逐步增加复杂度:
- Level 1:简单线性流水线(数据→训练→部署)
- Level 2:添加验证与监控阶段
- Level 3:实现超参数调优
- Level 4:添加A/B测试与渐进式发布
- Level 5:多模型流水线与集成策略
Common Patterns
常见模式
Batch Training Pipeline
批量训练流水线
yaml
undefinedyaml
undefinedSee assets/pipeline-dag.yaml.template
See assets/pipeline-dag.yaml.template
stages:
- name: data_preparation dependencies: []
- name: model_training dependencies: [data_preparation]
- name: model_evaluation dependencies: [model_training]
- name: model_deployment dependencies: [model_evaluation]
undefinedstages:
- name: data_preparation dependencies: []
- name: model_training dependencies: [data_preparation]
- name: model_evaluation dependencies: [model_training]
- name: model_deployment dependencies: [model_evaluation]
undefinedReal-time Feature Pipeline
实时特征流水线
python
undefinedpython
undefinedStream processing for real-time features
Stream processing for real-time features
Combined with batch training
Combined with batch training
See references/data-preparation.md
See references/data-preparation.md
undefinedundefinedContinuous Training
持续训练
python
undefinedpython
undefinedAutomated retraining on schedule
Automated retraining on schedule
Triggered by data drift detection
Triggered by data drift detection
See references/model-training.md
See references/model-training.md
undefinedundefinedTroubleshooting
故障排查
Common Issues
常见问题
- Pipeline failures: Check dependencies and data availability
- Training instability: Review hyperparameters and data quality
- Deployment issues: Validate model artifacts and serving config
- Performance degradation: Monitor data drift and model metrics
- 流水线失败:检查依赖与数据可用性
- 训练不稳定:查看超参数与数据质量
- 部署问题:验证模型工件与服务配置
- 性能下降:监控数据漂移与模型指标
Debugging Steps
调试步骤
- Check pipeline logs for each stage
- Validate input/output data at boundaries
- Test components in isolation
- Review experiment tracking metrics
- Inspect model artifacts and metadata
- 检查每个阶段的流水线日志
- 验证边界处的输入/输出数据
- 独立测试组件
- 查看实验追踪指标
- 检查模型工件与元数据
Next Steps
后续步骤
After setting up your pipeline:
- Explore hyperparameter-tuning skill for optimization
- Learn experiment-tracking-setup for MLflow/W&B
- Review model-deployment-patterns for serving strategies
- Implement monitoring with observability tools
搭建完流水线后:
- 探索hyperparameter-tuning技能进行优化
- 学习experiment-tracking-setup以集成MLflow/W&B
- 查看model-deployment-patterns了解服务策略
- 用可观测性工具实现监控
Related Skills
相关技能
- experiment-tracking-setup: MLflow and Weights & Biases integration
- hyperparameter-tuning: Automated hyperparameter optimization
- model-deployment-patterns: Advanced deployment strategies
- experiment-tracking-setup:MLflow与Weights & Biases集成
- hyperparameter-tuning:自动化超参数优化
- model-deployment-patterns:高级部署策略