ai-mlops
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMLOps & ML Security - Complete Reference (Jan 2026)
MLOps与ML安全 - 完整参考手册(2026年1月)
Production ML lifecycle with modern security practices.
This skill covers:
- Production: Data ingestion, deployment, drift detection, monitoring, incident response
- Security: Prompt injection, jailbreak defense, RAG security, output filtering
- Governance: Privacy protection, supply chain security, safety evaluation
- Data ingestion (dlt): Load data from APIs, databases to warehouses
- Model deployment: Batch jobs, real-time APIs, hybrid systems, event-driven automation
- Operations: Real-time monitoring, drift detection, automated retraining, incident response
Modern Best Practices (Jan 2026):
- Version everything that can change: model artifacts, data snapshots, feature definitions, prompts/configs, and agent graphs; require reproducibility, rollbacks, and audit logs (NIST SSDF: https://csrc.nist.gov/pubs/sp/800/218/final).
- Gate changes with evals (offline + online) and safe rollout (shadow/canary/blue-green); treat regressions in quality, safety, latency, and cost as release blockers.
- Align controls and documentation to risk posture (EU AI Act: https://eur-lex.europa.eu/eli/reg/2024/1689/oj; NIST AI RMF + GenAI profile: https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf, https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf).
- Operationalize security: threat model the full system (data, model, prompts, tools, RAG), harden the supply chain (SBOM/signing), and ship incident playbooks for both reliability and safety events.
It is execution-focused:
- Data ingestion patterns (REST APIs, database replication, incremental loading)
- Deployment patterns (batch, online, hybrid, streaming, event-driven)
- Automated monitoring with real-time drift detection
- Automated retraining pipelines (monitor → detect → trigger → validate → deploy)
- Incident handling with validated rollback and postmortems
- Links to copy-paste templates in
assets/
结合现代安全实践的生产级ML生命周期管理。
本技能涵盖以下内容:
- 生产运维:数据接入、部署、漂移检测、监控、事件响应
- 安全防护:提示注入防御、越狱攻击防护、RAG安全、输出过滤
- 治理合规:隐私保护、供应链安全、安全性评估
- 数据接入(dlt):从API、数据库加载数据至数据仓库
- 模型部署:批量任务、实时API、混合系统、事件驱动自动化
- 运维管理:实时监控、漂移检测、自动重训练、事件响应
2026年现代最佳实践:
- 对所有可变更内容进行版本控制:模型工件、数据快照、特征定义、提示词/配置以及Agent图谱;确保可复现、可回滚,并保留审计日志(参考NIST SSDF:https://csrc.nist.gov/pubs/sp/800/218/final)。
- 通过评估(离线+在线)和安全发布策略(影子发布/金丝雀发布/蓝绿发布)管控变更;将质量、安全性、延迟和成本方面的退化视为发布阻断项。
- 使控制措施和文档与风险状况保持一致(参考EU AI Act:https://eur-lex.europa.eu/eli/reg/2024/1689/oj;NIST AI RMF + 生成式AI配置文件:https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf,https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf)。
- 落地安全运维:对整个系统(数据、模型、提示词、工具、RAG)进行威胁建模,强化供应链安全(SBOM/签名),并针对可靠性和安全事件制定事件响应手册。
本技能聚焦于落地执行:
- 数据接入模式(REST API、数据库复制、增量加载)
- 部署模式(批量、在线、混合、流式、事件驱动)
- 自动化监控与实时漂移检测
- 自动重训练流水线(监控→检测→触发→验证→部署)
- 包含验证回滚和事后复盘的事件处理流程
- 可直接复制使用的模板链接位于目录下
assets/
Quick Reference
快速参考
| Task | Tool/Framework | Command | When to Use |
|---|---|---|---|
| Data Ingestion | dlt (data load tool) | | Loading from APIs, databases to warehouses |
| Batch Deployment | Airflow, Dagster, Prefect | | Scheduled predictions on large datasets |
| API Deployment | FastAPI, Flask, TorchServe | | Real-time inference (<500ms latency) |
| LLM Serving | vLLM, TGI, BentoML | | High-throughput LLM inference |
| Model Registry | MLflow, W&B, ZenML | | Versioning and promoting models |
| Drift Detection | Statistical tests + monitors | PSI/KS, embedding drift, prediction drift | Detect data/process changes and trigger review |
| Monitoring | Prometheus, Grafana | | Metrics, alerts, SLO tracking |
| AgentOps | AgentOps, Langfuse, LangSmith | | AI agent observability, session replay |
| Incident Response | Runbooks, PagerDuty | Documented playbooks, alert routing | Handling failures and degradation |
| 任务 | 工具/框架 | 命令 | 使用场景 |
|---|---|---|---|
| 数据接入 | dlt (data load tool) | | 从API、数据库加载数据至数据仓库 |
| 批量部署 | Airflow, Dagster, Prefect | | 针对大型数据集的定时预测任务 |
| API部署 | FastAPI, Flask, TorchServe | | 实时推理(延迟<500ms) |
| LLM服务 | vLLM, TGI, BentoML | | 高吞吐量LLM推理 |
| 模型仓库 | MLflow, W&B, ZenML | | 模型版本控制与推广 |
| 漂移检测 | 统计测试+监控工具 | PSI/KS、嵌入漂移、预测漂移 | 检测数据/流程变更并触发审核 |
| 监控 | Prometheus, Grafana | | 指标追踪、告警、SLO管理 |
| Agent运维 | AgentOps, Langfuse, LangSmith | | AI Agent可观测性、会话回放 |
| 事件响应 | 运行手册, PagerDuty | 文档化的运行手册、告警路由 | 处理故障与性能退化 |
Use This Skill When
适用场景
Use this skill when the user asks for deployment, operations, monitoring, incident handling, or governance for ML/LLM/agent systems, e.g.:
- "How do I deploy this model to prod?"
- "Design a batch + online scoring architecture."
- "Add monitoring and drift detection to our model."
- "Write an incident runbook for this ML service."
- "Package this LLM/RAG pipeline as an API."
- "Plan our retraining and promotion workflow."
- "Load data from Stripe API to Snowflake."
- "Set up incremental database replication with dlt."
- "Build an ELT pipeline for warehouse loading."
If the user is asking only about EDA, modelling, or theory, prefer:
- (EDA, features, modelling, SQL transformation with SQLMesh)
ai-ml-data-science - (prompting, fine-tuning, eval)
ai-llm - (retrieval pipeline design)
ai-rag - (compression, spec decode, serving internals)
ai-llm-inference
If the user is asking about SQL transformation (after data is loaded), prefer:
- (SQLMesh templates for staging, intermediate, marts layers)
ai-ml-data-science
当用户询问关于ML/LLM/Agent系统的部署、运维、监控、事件处理或治理相关问题时,使用本技能,例如:
- "如何将这个模型部署到生产环境?"
- "设计批量+在线评分架构。"
- "为我们的模型添加监控和漂移检测功能。"
- "为这个ML服务编写事件响应运行手册。"
- "将这个LLM/RAG流水线封装为API。"
- "规划我们的重训练和模型推广流程。"
- "从Stripe API加载数据至Snowflake。"
- "使用dlt设置增量数据库复制。"
- "构建用于数据仓库加载的ELT流水线。"
如果用户仅询问EDA、建模或理论相关内容,请优先使用以下技能:
- (EDA、特征工程、建模、使用SQLMesh进行SQL转换)
ai-ml-data-science - (提示词工程、微调、评估)
ai-llm - (检索流水线设计)
ai-rag - (模型压缩、解码优化、服务内部机制)
ai-llm-inference
如果用户询问数据接入后的SQL转换,请优先使用:
- 技能(用于 staging/intermediate/marts 层的SQLMesh模板)
ai-ml-data-science
Decision Tree: Choosing Deployment Strategy
决策树:选择部署策略
text
User needs to deploy: [ML System]
├─ Data Ingestion?
│ ├─ From REST APIs? → dlt REST API templates
│ ├─ From databases? → dlt database sources (PostgreSQL, MySQL, MongoDB)
│ └─ Incremental loading? → dlt incremental patterns (timestamp, ID-based)
│
├─ Model Serving?
│ ├─ Latency <500ms? → FastAPI real-time API
│ ├─ Batch predictions? → Airflow/Dagster batch pipeline
│ └─ Mix of both? → Hybrid (batch features + online scoring)
│
├─ Monitoring & Ops?
│ ├─ Drift detection? → Evidently + automated retraining triggers
│ ├─ Performance tracking? → Prometheus + Grafana dashboards
│ └─ Incident response? → Runbooks + PagerDuty alerts
│
└─ LLM/RAG Production?
├─ Cost optimization? → Caching, prompt templates, token budgets
└─ Safety? → See ai-mlops skilltext
用户需要部署:[ML系统]
├─ 数据接入?
│ ├─ 来自REST API? → dlt REST API模板
│ ├─ 来自数据库? → dlt数据库数据源(PostgreSQL、MySQL、MongoDB)
│ └─ 增量加载? → dlt增量模式(基于时间戳、基于ID)
│
├─ 模型服务?
│ ├─ 延迟<500ms? → FastAPI实时API
│ ├─ 批量预测? → Airflow/Dagster批量流水线
│ └─ 混合模式? → 混合架构(批量特征+在线评分)
│
├─ 监控与运维?
│ ├─ 漂移检测? → Evidently + 自动重训练触发器
│ ├─ 性能追踪? → Prometheus + Grafana仪表盘
│ └─ 事件响应? → 运行手册 + PagerDuty告警
│
└─ LLM/RAG生产部署?
├─ 成本优化? → 缓存、提示词模板、Token预算管控
└─ 安全性? → 参考ai-mlops技能Core Concepts (Vendor-Agnostic)
核心概念(厂商无关)
- Lifecycle loop: train → validate → deploy → monitor → respond → retrain/retire.
- Risk controls: access control, data minimization, logging, and change management (NIST AI RMF: https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf).
- Observability planes: system metrics (latency/errors), data metrics (freshness/drift), quality metrics (model performance).
- Incident readiness: detection, containment, rollback, and root-cause analysis.
- 生命周期循环:训练→验证→部署→监控→响应→重训练/退役。
- 风险控制:访问控制、数据最小化、日志记录和变更管理(参考NIST AI RMF:https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf)。
- 可观测性维度:系统指标(延迟/错误)、数据指标(新鲜度/漂移)、质量指标(模型性能)。
- 事件就绪:检测、遏制、回滚和根因分析。
Do / Avoid
实践准则
Do
- Do gate deployments with repeatable checks: evaluation pass, load test, security review, rollback plan.
- Do version everything: code, data, features, model artifact, prompt templates, configuration.
- Do define SLOs and budgets (latency/cost/error rate) before optimizing.
Avoid
- Avoid manual “clickops” deployments without audit trail.
- Avoid silent upgrades; require eval + canary for model/prompt changes.
- Avoid drift dashboards without actions; every alert needs an owner and runbook.
推荐做法
- 通过可重复的检查管控部署:评估通过、负载测试、安全审核、回滚计划。
- 对所有内容进行版本控制:代码、数据、特征、模型工件、提示词模板、配置。
- 在优化前定义SLO和预算(延迟/成本/错误率)。
避免做法
- 避免无审计追踪的手动“点击式”部署。
- 避免静默升级;模型/提示词变更需经过评估+金丝雀发布。
- 避免无对应行动的漂移仪表盘;每个告警都需要明确负责人和运行手册。
Core Patterns Overview
核心模式概览
This skill provides production-ready patterns and guides organized into comprehensive references:
本技能提供可直接用于生产环境的模式和指南,整理为全面的参考内容:
Data & Infrastructure Patterns
数据与基础设施模式
Pattern 0: Data Contracts, Ingestion & Lineage
→ See Data Ingestion Patterns
- Data contracts with SLAs and versioning
- Ingestion modes (CDC, batch, streaming)
- Lineage tracking and schema evolution
- Replay and backfill procedures
Pattern 1: Choose Deployment Mode
→ See Deployment Patterns
- Decision table (batch, online, hybrid, streaming)
- When to use each mode
- Deployment mode selection checklist
Pattern 2: Standard Deployment Lifecycle
→ See Deployment Lifecycle
- Pre-deploy, deploy, observe, operate, evolve phases
- Environment promotion (dev → staging → prod)
- Gradual rollout strategies (canary, blue-green)
Pattern 3: Packaging & Model Registry
→ See Model Registry Patterns
- Model registry structure and metadata
- Packaging strategies (Docker, ONNX, MLflow)
- Promotion flows (experimental → production)
- Versioning and governance
模式0:数据契约、接入与血缘追踪
→ 参考数据接入模式
- 带SLA和版本控制的数据契约
- 接入模式(CDC、批量、流式)
- 血缘追踪和 schema 演进
- 重放和回填流程
模式1:选择部署模式
→ 参考部署模式
- 决策表(批量、在线、混合、流式)
- 各模式适用场景
- 部署模式选择检查清单
模式2:标准部署生命周期
→ 参考部署生命周期
- 预部署、部署、观测、运维、演进阶段
- 环境推广(开发→ staging → 生产)
- 渐进式发布策略(金丝雀、蓝绿发布)
模式3:打包与模型仓库
→ 参考模型仓库模式
- 模型仓库结构与元数据
- 打包策略(Docker、ONNX、MLflow)
- 推广流程(实验→生产)
- 版本控制与治理
Serving Patterns
服务模式
Pattern 4: Batch Scoring Pipeline
→ See Deployment Patterns
- Orchestration with Airflow/Dagster
- Idempotent scoring jobs
- Validation and backfill procedures
Pattern 5: Real-Time API Scoring
→ See API Design Patterns
- Service design (HTTP/JSON, gRPC)
- Input/output schemas
- Rate limiting, timeouts, circuit breakers
Pattern 6: Hybrid & Feature Store Integration
→ See Feature Store Patterns
- Batch vs online features
- Feature store architecture
- Training-serving consistency
- Point-in-time correctness
模式4:批量评分流水线
→ 参考部署模式
- 使用Airflow/Dagster进行编排
- 幂等评分任务
- 验证和回填流程
模式5:实时API评分
→ 参考API设计模式
- 服务设计(HTTP/JSON、gRPC)
- 输入/输出 schema
- 限流、超时、断路器
模式6:混合架构与特征仓库集成
→ 参考特征仓库模式
- 批量 vs 在线特征
- 特征仓库架构
- 训练-服务一致性
- 时间点正确性
Operations Patterns
运维模式
Pattern 7: Monitoring & Alerting
→ See Monitoring Best Practices
- Data, performance, and technical metrics
- SLO definition and tracking
- Dashboard design and alerting strategies
Pattern 8: Drift Detection & Automated Retraining
→ See Drift Detection Guide
- Automated retraining triggers
- Event-driven retraining pipelines
Pattern 9: Incidents & Runbooks
→ See Incident Response Playbooks
- Common failure modes
- Detection, diagnosis, resolution
- Post-mortem procedures
Pattern 10: LLM / RAG in Production
→ See LLM & RAG Production Patterns
- Prompt and configuration management
- Safety and compliance (PII, jailbreaks)
- Cost optimization (token budgets, caching)
- Monitoring and fallbacks
Pattern 11: Cross-Region, Residency & Rollback
→ See Multi-Region Patterns
- Multi-region deployment architectures
- Data residency and tenant isolation
- Disaster recovery and failover
- Regional rollback procedures
Pattern 12: Online Evaluation & Feedback Loops
→ See Online Evaluation Patterns
- Feedback signal collection (implicit, explicit)
- Shadow and canary deployments
- A/B testing with statistical significance
- Human-in-the-loop labeling
- Automated retraining cadence
Pattern 13: AgentOps (AI Agent Operations)
→ See AgentOps Patterns
- Session tracing and replay for AI agents
- Cost and latency tracking across agent runs
- Multi-agent visualization and debugging
- Tool invocation monitoring
- Integration with CrewAI, LangGraph, OpenAI Agents SDK
Pattern 14: Edge MLOps & TinyML
→ See Edge MLOps Patterns
- Device-aware CI/CD pipelines
- OTA model updates with rollback
- Federated learning operations
- Edge drift detection
- Intermittent connectivity handling
模式7:监控与告警
→ 参考监控最佳实践
- 数据、性能和技术指标
- SLO定义与追踪
- 仪表盘设计和告警策略
模式8:漂移检测与自动重训练
→ 参考漂移检测指南
- 自动重训练触发器
- 事件驱动的重训练流水线
模式9:事件与运行手册
→ 参考事件响应手册
- 常见故障模式
- 检测、诊断、解决流程
- 事后复盘流程
模式10:LLM/RAG生产部署
→ 参考LLM与RAG生产模式
- 提示词和配置管理
- 安全与合规(PII、越狱攻击)
- 成本优化(Token预算、缓存)
- 监控与降级方案
模式11:跨区域、数据驻留与回滚
→ 参考多区域模式
- 多区域部署架构
- 数据驻留与租户隔离
- 灾难恢复与故障转移
- 区域回滚流程
模式12:在线评估与反馈循环
→ 参考在线评估模式
- 反馈信号收集(隐式、显式)
- 影子和金丝雀部署
- 具有统计显著性的A/B测试
- 人在回路的标注
- 自动重训练节奏
模式13:AgentOps(AI Agent运维)
→ 参考AgentOps模式
- AI Agent的会话追踪与回放
- Agent运行过程中的成本和延迟追踪
- 多Agent可视化与调试
- 工具调用监控
- 与CrewAI、LangGraph、OpenAI Agents SDK集成
模式14:边缘MLOps与TinyML
→ 参考边缘MLOps模式
- 设备感知的CI/CD流水线
- 带回滚功能的OTA模型更新
- 联邦学习运维
- 边缘漂移检测
- 间歇性连接处理
Resources (Detailed Guides)
资源(详细指南)
For comprehensive operational guides, see:
Core Infrastructure:
- Data Ingestion Patterns - Data contracts, CDC, batch/streaming ingestion, lineage, schema evolution
- Deployment Lifecycle - Pre-deploy validation, environment promotion, gradual rollout, rollback
- Model Registry Patterns - Versioning, packaging, promotion workflows, governance
- Feature Store Patterns - Batch/online features, hybrid architectures, consistency, latency optimization
Serving & APIs:
- Deployment Patterns - Batch, online, hybrid, streaming deployment strategies and architectures
- API Design Patterns - ML/LLM/RAG API patterns, input/output schemas, reliability patterns, versioning
Operations & Reliability:
- Monitoring Best Practices - Metrics collection, alerting strategies, SLO definition, dashboard design
- Drift Detection Guide - Statistical tests, automated detection, retraining triggers, recovery strategies
- Incident Response Playbooks - Runbooks for common failure modes, diagnostics, resolution steps
Security & Governance:
- Threat Models - Trust boundaries, attack surface, control mapping
- Prompt Injection Mitigation - Input hardening, tool/RAG containment, least privilege
- Jailbreak Defense - Robust refusal behavior, safe completion patterns
- RAG Security - Retrieval poisoning, context injection, sensitive data leakage
- Output Filtering - Layered filters (PII/toxicity/policy), block/rewrite strategies
- Privacy Protection - PII handling, data minimization, retention, consent
- Supply Chain Security - SBOM, dependency pinning, artifact signing
- Safety Evaluation - Red teaming, eval sets, incident readiness
Advanced Patterns:
- LLM & RAG Production Patterns - Prompt management, safety, cost optimization, caching, monitoring
- Multi-Region Patterns - Multi-region deployment, data residency, disaster recovery, rollback
- Online Evaluation Patterns - A/B testing, shadow deployments, feedback loops, automated retraining
- AgentOps Patterns - AI agent observability, session replay, cost tracking, multi-agent debugging
- Edge MLOps Patterns - TinyML, federated learning, OTA updates, device-aware CI/CD
如需全面的运维指南,请参考:
核心基础设施:
- 数据接入模式 - 数据契约、CDC、批量/流式接入、血缘追踪、schema演进
- 部署生命周期 - 预部署验证、环境推广、渐进式发布、回滚
- 模型仓库模式 - 版本控制、打包、推广流程、治理
- 特征仓库模式 - 批量/在线特征、混合架构、一致性、延迟优化
服务与API:
- 部署模式 - 批量、在线、混合、流式部署策略与架构
- API设计模式 - ML/LLM/RAG API模式、输入/输出schema、可靠性模式、版本控制
运维与可靠性:
- 监控最佳实践 - 指标收集、告警策略、SLO定义、仪表盘设计
- 漂移检测指南 - 统计测试、自动检测、重训练触发器、恢复策略
- 事件响应手册 - 常见故障模式的运行手册、诊断步骤、解决流程
安全与治理:
- 威胁模型 - 信任边界、攻击面、控制映射
- 提示注入缓解 - 输入加固、工具/RAG隔离、最小权限
- 越狱攻击防御 - 稳健的拒绝行为、安全生成模式
- RAG安全 - 检索投毒、上下文注入、敏感数据泄露
- 输出过滤 - 分层过滤(PII/毒性/合规)、阻断/重写策略
- 隐私保护 - PII处理、数据最小化、留存、同意管理
- 供应链安全 - SBOM、依赖固定、工件签名
- 安全性评估 - 红队测试、评估数据集、事件就绪
高级模式:
- LLM与RAG生产模式 - 提示词管理、安全、成本优化、缓存、监控
- 多区域模式 - 多区域部署、数据驻留、灾难恢复、回滚
- 在线评估模式 - A/B测试、影子部署、反馈循环、自动重训练
- AgentOps模式 - AI Agent可观测性、会话回放、成本追踪、多Agent调试
- 边缘MLOps模式 - TinyML、联邦学习、OTA更新、设备感知CI/CD
Templates
模板
Use these as copy-paste starting points for production artifacts:
以下模板可作为生产环境工件的复制粘贴起点:
Data Ingestion (dlt)
数据接入(dlt)
For loading data into warehouses and pipelines:
- dlt basic pipeline setup - Install, configure, run basic extraction and loading
- dlt REST API sources - Extract from REST APIs with pagination, authentication, rate limiting
- dlt database sources - Replicate from PostgreSQL, MySQL, MongoDB, SQL Server
- dlt incremental loading - Timestamp-based, ID-based, merge/upsert patterns, lookback windows
- dlt warehouse loading - Load to Snowflake, BigQuery, Redshift, Postgres, DuckDB
Use dlt when:
- Loading data from APIs (Stripe, HubSpot, Shopify, custom APIs)
- Replicating databases to warehouses
- Building ELT pipelines with incremental loading
- Managing data ingestion with Python
For SQL transformation (after ingestion), use:
→ skill (SQLMesh templates for staging/intermediate/marts layers)
ai-ml-data-science用于将数据加载至数据仓库和流水线:
- dlt基础流水线设置 - 安装、配置、运行基础提取与加载
- dlt REST API数据源 - 从REST API提取数据,支持分页、认证、限流
- dlt数据库数据源 - 从PostgreSQL、MySQL、MongoDB、SQL Server复制数据
- dlt增量加载 - 基于时间戳、基于ID的增量模式、合并/更新策略、回溯窗口
- dlt数据仓库加载 - 加载至Snowflake、BigQuery、Redshift、Postgres、DuckDB
适用dlt的场景:
- 从API(Stripe、HubSpot、Shopify、自定义API)加载数据
- 将数据库复制至数据仓库
- 构建带增量加载的ELT流水线
- 使用Python管理数据接入
如需进行数据接入后的SQL转换,请使用:
→ 技能(用于 staging/intermediate/marts 层的SQLMesh模板)
ai-ml-data-scienceDeployment & Packaging
部署与打包
- Deployment & MLOps template - Complete MLOps lifecycle, model registry, promotion workflows
- Deployment readiness checklist - Go/No-Go gate, monitoring, and rollback plan
- API service template - Real-time REST/gRPC API with FastAPI, input validation, rate limiting
- Batch scoring pipeline template - Orchestrated batch inference with Airflow/Dagster, validation, backfill
- 部署与MLOps模板 - 完整MLOps生命周期、模型仓库、推广流程
- 部署就绪检查清单 - 上线/不上线 gates、监控、回滚计划
- API服务模板 - 基于FastAPI的实时REST/gRPC API,带输入验证、限流
- 批量评分流水线模板 - 使用Airflow/Dagster编排的批量推理,带验证、回填
Monitoring & Operations
监控与运维
- Monitoring & alerting template - Data/performance/technical metrics, dashboards, SLO definition
- Drift detection & retraining template - Automated drift detection, retraining triggers, promotion pipelines
- Incident runbook template - Failure mode playbooks, diagnosis steps, resolution procedures
- 监控与告警模板 - 数据/性能/技术指标、仪表盘、SLO定义
- 漂移检测与重训练模板 - 自动漂移检测、重训练触发器、推广流水线
- 事件响应运行手册模板 - 故障模式运行手册、诊断步骤、解决流程
Navigation
导航
Resources
- references/drift-detection-guide.md
- references/model-registry-patterns.md
- references/online-evaluation-patterns.md
- references/monitoring-best-practices.md
- references/llm-rag-production-patterns.md
- references/api-design-patterns.md
- references/incident-response-playbooks.md
- references/deployment-patterns.md
- references/data-ingestion-patterns.md
- references/deployment-lifecycle.md
- references/feature-store-patterns.md
- references/multi-region-patterns.md
- references/agentops-patterns.md
- references/edge-mlops-patterns.md
Templates
- template-dlt-pipeline.md
- template-dlt-rest-api.md
- template-dlt-database-source.md
- template-dlt-incremental.md
- template-dlt-warehouse-loading.md
- assets/deployment/template-deployment-mlops.md
- assets/deployment/deployment-readiness-checklist.md
- assets/deployment/template-api-service.md
- assets/deployment/template-batch-pipeline.md
- assets/ops/template-incident-runbook.md
- assets/monitoring/template-drift-retraining.md
- assets/monitoring/template-monitoring-plan.md
Data
- data/sources.json - Curated external references
资源
- references/drift-detection-guide.md
- references/model-registry-patterns.md
- references/online-evaluation-patterns.md
- references/monitoring-best-practices.md
- references/llm-rag-production-patterns.md
- references/api-design-patterns.md
- references/incident-response-playbooks.md
- references/deployment-patterns.md
- references/data-ingestion-patterns.md
- references/deployment-lifecycle.md
- references/feature-store-patterns.md
- references/multi-region-patterns.md
- references/agentops-patterns.md
- references/edge-mlops-patterns.md
模板
- template-dlt-pipeline.md
- template-dlt-rest-api.md
- template-dlt-database-source.md
- template-dlt-incremental.md
- template-dlt-warehouse-loading.md
- assets/deployment/template-deployment-mlops.md
- assets/deployment/deployment-readiness-checklist.md
- assets/deployment/template-api-service.md
- assets/deployment/template-batch-pipeline.md
- assets/ops/template-incident-runbook.md
- assets/monitoring/template-drift-retraining.md
- assets/monitoring/template-monitoring-plan.md
数据
- data/sources.json - 精选外部参考资源
External Resources
外部资源
See for curated references on:
data/sources.json- Serving frameworks (FastAPI, Flask, gRPC, TorchServe, KServe, Ray Serve)
- Orchestration (Airflow, Dagster, Prefect)
- Model registries and MLOps (MLflow, W&B, Vertex AI, Sagemaker)
- Monitoring and observability (Prometheus, Grafana, OpenTelemetry, Evidently)
- Feature stores (Feast, Tecton, Vertex, Databricks)
- Streaming & messaging (Kafka, Pulsar, Kinesis)
- LLMOps & RAG infra (vector DBs, LLM gateways, safety tools)
参考获取以下领域的精选参考内容:
data/sources.json- 服务框架(FastAPI、Flask、gRPC、TorchServe、KServe、Ray Serve)
- 编排工具(Airflow、Dagster、Prefect)
- 模型仓库与MLOps平台(MLflow、W&B、Vertex AI、Sagemaker)
- 监控与可观测性工具(Prometheus、Grafana、OpenTelemetry、Evidently)
- 特征仓库(Feast、Tecton、Vertex、Databricks)
- 流式与消息队列(Kafka、Pulsar、Kinesis)
- LLMOps与RAG基础设施(向量数据库、LLM网关、安全工具)
Data Lake & Lakehouse
数据湖与湖仓
For comprehensive data lake/lakehouse patterns (beyond dlt ingestion), see data-lake-platform:
- Table formats: Apache Iceberg, Delta Lake, Apache Hudi
- Query engines: ClickHouse, DuckDB, Apache Doris, StarRocks
- Alternative ingestion: Airbyte (GUI-based connectors)
- Transformation: dbt (alternative to SQLMesh)
- Streaming: Apache Kafka patterns
- Orchestration: Dagster, Airflow
This skill focuses on ML-specific deployment, monitoring, and security. Use data-lake-platform for general-purpose data infrastructure.
如需全面的数据湖/湖仓模式(超出dlt接入范围),请参考**data-lake-platform**:
- 表格式:Apache Iceberg、Delta Lake、Apache Hudi
- 查询引擎:ClickHouse、DuckDB、Apache Doris、StarRocks
- 替代接入工具:Airbyte(基于GUI的连接器)
- 转换工具:dbt(SQLMesh的替代方案)
- 流式处理:Apache Kafka模式
- 编排:Dagster、Airflow
本技能聚焦于ML特定的部署、监控与安全。通用数据基础设施请使用data-lake-platform技能。
Recency Protocol (Tooling Recommendations)
时效性协议(工具推荐)
When users ask recommendation questions about MLOps tooling, verify recency before answering.
当用户询问MLOps工具推荐相关问题时,在回答前需验证时效性。
Trigger Conditions
触发条件
- "What's the best MLOps platform for [use case]?"
- "What should I use for [deployment/monitoring/drift detection]?"
- "What's the latest in MLOps?"
- "Current best practices for [model registry/feature store/observability]?"
- "Is [MLflow/Kubeflow/Vertex AI] still relevant in 2026?"
- "[MLOps tool A] vs [MLOps tool B]?"
- "Best way to deploy [LLM/ML model] to production?"
- "What feature store should I use?"
- "针对[使用场景],最佳的MLOps平台是什么?"
- "[部署/监控/漂移检测]应该使用什么工具?"
- "MLOps的最新趋势是什么?"
- "[模型仓库/特征仓库/可观测性]的当前最佳实践是什么?"
- "[MLflow/Kubeflow/Vertex AI]在2026年是否仍然适用?"
- "[MLOps工具A] vs [MLOps工具B]?"
- "将[LLM/ML模型]部署到生产环境的最佳方式是什么?"
- "我应该使用哪个特征仓库?"
Minimal Recency Check
最小时效性检查
- Start from and prefer sources with
data/sources.json.add_as_web_search: true - If web search or browsing is available, confirm at least: (a) the tool’s latest release/docs date, (b) active maintenance signals, (c) a recent comparison/alternatives post.
- If live search is not available, state that you are relying on static knowledge + , and recommend validation steps (POC + evals + rollout plan).
data/sources.json
- 从开始,优先选择标记为
data/sources.json的资源。add_as_web_search: true - 如果支持网页搜索或浏览,至少确认:(a)工具的最新版本/文档日期,(b)活跃维护信号,(c)近期的对比/替代方案文章。
- 如果无法进行实时搜索,请说明您依赖静态知识+,并建议验证步骤(POC+评估+推广计划)。
data/sources.json
What to Report
汇报内容
After searching, provide:
- Current landscape: What MLOps tools/platforms are popular NOW
- Emerging trends: New approaches gaining traction (LLMOps, GenAI ops)
- Deprecated/declining: Tools or approaches losing relevance
- Recommendation: Based on fresh data, not just static knowledge
搜索完成后,提供:
- 当前格局:当前流行的MLOps工具/平台
- 新兴趋势:正在获得关注的新方法(LLMOps、生成式AI运维)
- 已过时/衰退:正在失去相关性的工具或方法
- 推荐方案:基于最新数据,而非仅静态知识
Related Skills
相关技能
For adjacent topics, reference these skills:
- ai-ml-data-science - EDA, feature engineering, modelling, evaluation, SQLMesh transformations
- ai-llm - Prompting, fine-tuning, evaluation for LLMs
- ai-agents - Agentic workflows, multi-agent systems, LLMOps
- ai-rag - RAG pipeline design, chunking, retrieval, evaluation
- ai-llm-inference - Model serving optimization, quantization, batching
- ai-prompt-engineering - Prompt design patterns and best practices
- data-lake-platform - Data lake/lakehouse infrastructure (ClickHouse, Iceberg, Kafka)
Use this skill to turn trained models into reliable services, not to derive the model itself.
如需相邻主题,请参考以下技能:
- ai-ml-data-science - EDA、特征工程、建模、评估、SQLMesh转换
- ai-llm - 提示词工程、LLM微调、评估
- ai-agents - Agent工作流、多Agent系统、LLMOps
- ai-rag - RAG流水线设计、分块、检索、评估
- ai-llm-inference - 模型服务优化、量化、批处理
- ai-prompt-engineering - 提示词设计模式与最佳实践
- data-lake-platform - 数据湖/湖仓基础设施(ClickHouse、Iceberg、Kafka)
使用本技能将训练好的模型转换为可靠的服务,而非用于模型本身的开发。