ai-mlops

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

MLOps & ML Security - Complete Reference (Jan 2026)

MLOps与ML安全 - 完整参考手册（2026年1月）

Production ML lifecycle with modern security practices.

This skill covers:

Production: Data ingestion, deployment, drift detection, monitoring, incident response
Security: Prompt injection, jailbreak defense, RAG security, output filtering
Governance: Privacy protection, supply chain security, safety evaluation

Data ingestion (dlt): Load data from APIs, databases to warehouses
Model deployment: Batch jobs, real-time APIs, hybrid systems, event-driven automation
Operations: Real-time monitoring, drift detection, automated retraining, incident response

Modern Best Practices (Jan 2026):

Version everything that can change: model artifacts, data snapshots, feature definitions, prompts/configs, and agent graphs; require reproducibility, rollbacks, and audit logs (NIST SSDF: https://csrc.nist.gov/pubs/sp/800/218/final).
Gate changes with evals (offline + online) and safe rollout (shadow/canary/blue-green); treat regressions in quality, safety, latency, and cost as release blockers.
Align controls and documentation to risk posture (EU AI Act: https://eur-lex.europa.eu/eli/reg/2024/1689/oj; NIST AI RMF + GenAI profile: https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf, https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf).
Operationalize security: threat model the full system (data, model, prompts, tools, RAG), harden the supply chain (SBOM/signing), and ship incident playbooks for both reliability and safety events.

It is execution-focused:

Data ingestion patterns (REST APIs, database replication, incremental loading)
Deployment patterns (batch, online, hybrid, streaming, event-driven)
Automated monitoring with real-time drift detection
Automated retraining pipelines (monitor → detect → trigger → validate → deploy)
Incident handling with validated rollback and postmortems
Links to copy-paste templates in
```
assets/
```

结合现代安全实践的生产级ML生命周期管理。

本技能涵盖以下内容：

生产运维：数据接入、部署、漂移检测、监控、事件响应
安全防护：提示注入防御、越狱攻击防护、RAG安全、输出过滤
治理合规：隐私保护、供应链安全、安全性评估

数据接入（dlt）：从API、数据库加载数据至数据仓库
模型部署：批量任务、实时API、混合系统、事件驱动自动化
运维管理：实时监控、漂移检测、自动重训练、事件响应

2026年现代最佳实践：

对所有可变更内容进行版本控制：模型工件、数据快照、特征定义、提示词/配置以及Agent图谱；确保可复现、可回滚，并保留审计日志（参考NIST SSDF：https://csrc.nist.gov/pubs/sp/800/218/final）。
通过评估（离线+在线）和安全发布策略（影子发布/金丝雀发布/蓝绿发布）管控变更；将质量、安全性、延迟和成本方面的退化视为发布阻断项。
使控制措施和文档与风险状况保持一致（参考EU AI Act：https://eur-lex.europa.eu/eli/reg/2024/1689/oj；NIST AI RMF + 生成式AI配置文件：https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf，https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf）。
落地安全运维：对整个系统（数据、模型、提示词、工具、RAG）进行威胁建模，强化供应链安全（SBOM/签名），并针对可靠性和安全事件制定事件响应手册。

本技能聚焦于落地执行：

数据接入模式（REST API、数据库复制、增量加载）
部署模式（批量、在线、混合、流式、事件驱动）
自动化监控与实时漂移检测
自动重训练流水线（监控→检测→触发→验证→部署）
包含验证回滚和事后复盘的事件处理流程
可直接复制使用的模板链接位于
```
assets/
```
目录下

Quick Reference

快速参考

Task	Tool/Framework	Command	When to Use
Data Ingestion	dlt (data load tool)	`dlt pipeline run` , `dlt init`	Loading from APIs, databases to warehouses
Batch Deployment	Airflow, Dagster, Prefect	`airflow dags trigger` , `dagster job launch`	Scheduled predictions on large datasets
API Deployment	FastAPI, Flask, TorchServe	`uvicorn app:app` , `torchserve --start`	Real-time inference (<500ms latency)
LLM Serving	vLLM, TGI, BentoML	`vllm serve model` , `bentoml serve`	High-throughput LLM inference
Model Registry	MLflow, W&B, ZenML	`mlflow.register_model()` , `zenml model register`	Versioning and promoting models
Drift Detection	Statistical tests + monitors	PSI/KS, embedding drift, prediction drift	Detect data/process changes and trigger review
Monitoring	Prometheus, Grafana	`prometheus.yml` , Grafana dashboards	Metrics, alerts, SLO tracking
AgentOps	AgentOps, Langfuse, LangSmith	`agentops.init()` , trace visualization	AI agent observability, session replay
Incident Response	Runbooks, PagerDuty	Documented playbooks, alert routing	Handling failures and degradation

任务	工具/框架	命令	使用场景
数据接入	dlt (data load tool)	`dlt pipeline run` , `dlt init`	从API、数据库加载数据至数据仓库
批量部署	Airflow, Dagster, Prefect	`airflow dags trigger` , `dagster job launch`	针对大型数据集的定时预测任务
API部署	FastAPI, Flask, TorchServe	`uvicorn app:app` , `torchserve --start`	实时推理（延迟<500ms）
LLM服务	vLLM, TGI, BentoML	`vllm serve model` , `bentoml serve`	高吞吐量LLM推理
模型仓库	MLflow, W&B, ZenML	`mlflow.register_model()` , `zenml model register`	模型版本控制与推广
漂移检测	统计测试+监控工具	PSI/KS、嵌入漂移、预测漂移	检测数据/流程变更并触发审核
监控	Prometheus, Grafana	`prometheus.yml` , Grafana dashboards	指标追踪、告警、SLO管理
Agent运维	AgentOps, Langfuse, LangSmith	`agentops.init()` , 追踪可视化	AI Agent可观测性、会话回放
事件响应	运行手册, PagerDuty	文档化的运行手册、告警路由	处理故障与性能退化

Use This Skill When

适用场景

Use this skill when the user asks for deployment, operations, monitoring, incident handling, or governance for ML/LLM/agent systems, e.g.:

"How do I deploy this model to prod?"
"Design a batch + online scoring architecture."
"Add monitoring and drift detection to our model."
"Write an incident runbook for this ML service."
"Package this LLM/RAG pipeline as an API."
"Plan our retraining and promotion workflow."
"Load data from Stripe API to Snowflake."
"Set up incremental database replication with dlt."
"Build an ELT pipeline for warehouse loading."

If the user is asking only about EDA, modelling, or theory, prefer:

```
ai-ml-data-science
```
(EDA, features, modelling, SQL transformation with SQLMesh)
```
ai-llm
```
(prompting, fine-tuning, eval)
```
ai-rag
```
(retrieval pipeline design)
```
ai-llm-inference
```
(compression, spec decode, serving internals)

If the user is asking about SQL transformation (after data is loaded), prefer:

```
ai-ml-data-science
```
(SQLMesh templates for staging, intermediate, marts layers)

当用户询问关于ML/LLM/Agent系统的部署、运维、监控、事件处理或治理相关问题时，使用本技能，例如：

"如何将这个模型部署到生产环境？"
"设计批量+在线评分架构。"
"为我们的模型添加监控和漂移检测功能。"
"为这个ML服务编写事件响应运行手册。"
"将这个LLM/RAG流水线封装为API。"
"规划我们的重训练和模型推广流程。"
"从Stripe API加载数据至Snowflake。"
"使用dlt设置增量数据库复制。"
"构建用于数据仓库加载的ELT流水线。"

如果用户仅询问EDA、建模或理论相关内容，请优先使用以下技能：

```
ai-ml-data-science
```
（EDA、特征工程、建模、使用SQLMesh进行SQL转换）
```
ai-llm
```
（提示词工程、微调、评估）
```
ai-rag
```
（检索流水线设计）
```
ai-llm-inference
```
（模型压缩、解码优化、服务内部机制）

如果用户询问数据接入后的SQL转换，请优先使用：

```
ai-ml-data-science
```
技能（用于 staging/intermediate/marts 层的SQLMesh模板）

Decision Tree: Choosing Deployment Strategy

决策树：选择部署策略

text

User needs to deploy: [ML System]
    ├─ Data Ingestion?
    │   ├─ From REST APIs? → dlt REST API templates
    │   ├─ From databases? → dlt database sources (PostgreSQL, MySQL, MongoDB)
    │   └─ Incremental loading? → dlt incremental patterns (timestamp, ID-based)
    │
    ├─ Model Serving?
    │   ├─ Latency <500ms? → FastAPI real-time API
    │   ├─ Batch predictions? → Airflow/Dagster batch pipeline
    │   └─ Mix of both? → Hybrid (batch features + online scoring)
    │
    ├─ Monitoring & Ops?
    │   ├─ Drift detection? → Evidently + automated retraining triggers
    │   ├─ Performance tracking? → Prometheus + Grafana dashboards
    │   └─ Incident response? → Runbooks + PagerDuty alerts
    │
    └─ LLM/RAG Production?
        ├─ Cost optimization? → Caching, prompt templates, token budgets
        └─ Safety? → See ai-mlops skill

text

用户需要部署：[ML系统]
    ├─ 数据接入？
    │   ├─ 来自REST API？ → dlt REST API模板
    │   ├─ 来自数据库？ → dlt数据库数据源（PostgreSQL、MySQL、MongoDB）
    │   └─ 增量加载？ → dlt增量模式（基于时间戳、基于ID）
    │
    ├─ 模型服务？
    │   ├─ 延迟<500ms？ → FastAPI实时API
    │   ├─ 批量预测？ → Airflow/Dagster批量流水线
    │   └─ 混合模式？ → 混合架构（批量特征+在线评分）
    │
    ├─ 监控与运维？
    │   ├─ 漂移检测？ → Evidently + 自动重训练触发器
    │   ├─ 性能追踪？ → Prometheus + Grafana仪表盘
    │   └─ 事件响应？ → 运行手册 + PagerDuty告警
    │
    └─ LLM/RAG生产部署？
        ├─ 成本优化？ → 缓存、提示词模板、Token预算管控
        └─ 安全性？ → 参考ai-mlops技能

Core Concepts (Vendor-Agnostic)

核心概念（厂商无关）

Lifecycle loop: train → validate → deploy → monitor → respond → retrain/retire.
Risk controls: access control, data minimization, logging, and change management (NIST AI RMF: https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf).
Observability planes: system metrics (latency/errors), data metrics (freshness/drift), quality metrics (model performance).
Incident readiness: detection, containment, rollback, and root-cause analysis.

生命周期循环：训练→验证→部署→监控→响应→重训练/退役。
风险控制：访问控制、数据最小化、日志记录和变更管理（参考NIST AI RMF：https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf）。
可观测性维度：系统指标（延迟/错误）、数据指标（新鲜度/漂移）、质量指标（模型性能）。
事件就绪：检测、遏制、回滚和根因分析。

Do / Avoid

实践准则

Do gate deployments with repeatable checks: evaluation pass, load test, security review, rollback plan.
Do version everything: code, data, features, model artifact, prompt templates, configuration.
Do define SLOs and budgets (latency/cost/error rate) before optimizing.

Avoid

Avoid manual “clickops” deployments without audit trail.
Avoid silent upgrades; require eval + canary for model/prompt changes.
Avoid drift dashboards without actions; every alert needs an owner and runbook.

推荐做法

通过可重复的检查管控部署：评估通过、负载测试、安全审核、回滚计划。
对所有内容进行版本控制：代码、数据、特征、模型工件、提示词模板、配置。
在优化前定义SLO和预算（延迟/成本/错误率）。

避免做法

避免无审计追踪的手动“点击式”部署。
避免静默升级；模型/提示词变更需经过评估+金丝雀发布。
避免无对应行动的漂移仪表盘；每个告警都需要明确负责人和运行手册。

Core Patterns Overview

核心模式概览

This skill provides production-ready patterns and guides organized into comprehensive references:

本技能提供可直接用于生产环境的模式和指南，整理为全面的参考内容：

Data & Infrastructure Patterns

数据与基础设施模式

Pattern 0: Data Contracts, Ingestion & Lineage → See Data Ingestion Patterns

Data contracts with SLAs and versioning
Ingestion modes (CDC, batch, streaming)
Lineage tracking and schema evolution
Replay and backfill procedures

Pattern 1: Choose Deployment Mode → See Deployment Patterns

Decision table (batch, online, hybrid, streaming)
When to use each mode
Deployment mode selection checklist

Pattern 2: Standard Deployment Lifecycle → See Deployment Lifecycle

Pre-deploy, deploy, observe, operate, evolve phases
Environment promotion (dev → staging → prod)
Gradual rollout strategies (canary, blue-green)

Pattern 3: Packaging & Model Registry → See Model Registry Patterns

Model registry structure and metadata
Packaging strategies (Docker, ONNX, MLflow)
Promotion flows (experimental → production)
Versioning and governance

模式0：数据契约、接入与血缘追踪 → 参考数据接入模式

带SLA和版本控制的数据契约
接入模式（CDC、批量、流式）
血缘追踪和 schema 演进
重放和回填流程

模式1：选择部署模式 → 参考部署模式

决策表（批量、在线、混合、流式）
各模式适用场景
部署模式选择检查清单

模式2：标准部署生命周期 → 参考部署生命周期

预部署、部署、观测、运维、演进阶段
环境推广（开发→ staging → 生产）
渐进式发布策略（金丝雀、蓝绿发布）

模式3：打包与模型仓库 → 参考模型仓库模式

模型仓库结构与元数据
打包策略（Docker、ONNX、MLflow）
推广流程（实验→生产）
版本控制与治理

Serving Patterns

服务模式

Pattern 4: Batch Scoring Pipeline → See Deployment Patterns

Orchestration with Airflow/Dagster
Idempotent scoring jobs
Validation and backfill procedures

Pattern 5: Real-Time API Scoring → See API Design Patterns

Service design (HTTP/JSON, gRPC)
Input/output schemas
Rate limiting, timeouts, circuit breakers

Pattern 6: Hybrid & Feature Store Integration → See Feature Store Patterns

Batch vs online features
Feature store architecture
Training-serving consistency
Point-in-time correctness

模式4：批量评分流水线 → 参考部署模式

使用Airflow/Dagster进行编排
幂等评分任务
验证和回填流程

模式5：实时API评分 → 参考API设计模式

服务设计（HTTP/JSON、gRPC）
输入/输出 schema
限流、超时、断路器

模式6：混合架构与特征仓库集成 → 参考特征仓库模式

批量 vs 在线特征
特征仓库架构
训练-服务一致性
时间点正确性

Operations Patterns

运维模式

Pattern 7: Monitoring & Alerting → See Monitoring Best Practices

Data, performance, and technical metrics
SLO definition and tracking
Dashboard design and alerting strategies

Pattern 8: Drift Detection & Automated Retraining → See Drift Detection Guide

Automated retraining triggers
Event-driven retraining pipelines

Pattern 9: Incidents & Runbooks → See Incident Response Playbooks

Common failure modes
Detection, diagnosis, resolution
Post-mortem procedures

Pattern 10: LLM / RAG in Production → See LLM & RAG Production Patterns

Prompt and configuration management
Safety and compliance (PII, jailbreaks)
Cost optimization (token budgets, caching)
Monitoring and fallbacks

Pattern 11: Cross-Region, Residency & Rollback → See Multi-Region Patterns

Multi-region deployment architectures
Data residency and tenant isolation
Disaster recovery and failover
Regional rollback procedures

Pattern 12: Online Evaluation & Feedback Loops → See Online Evaluation Patterns

Feedback signal collection (implicit, explicit)
Shadow and canary deployments
A/B testing with statistical significance
Human-in-the-loop labeling
Automated retraining cadence

Pattern 13: AgentOps (AI Agent Operations) → See AgentOps Patterns

Session tracing and replay for AI agents
Cost and latency tracking across agent runs
Multi-agent visualization and debugging
Tool invocation monitoring
Integration with CrewAI, LangGraph, OpenAI Agents SDK

Pattern 14: Edge MLOps & TinyML → See Edge MLOps Patterns

Device-aware CI/CD pipelines
OTA model updates with rollback
Federated learning operations
Edge drift detection
Intermittent connectivity handling

模式7：监控与告警 → 参考监控最佳实践

数据、性能和技术指标
SLO定义与追踪
仪表盘设计和告警策略

模式8：漂移检测与自动重训练 → 参考漂移检测指南

自动重训练触发器
事件驱动的重训练流水线

模式9：事件与运行手册 → 参考事件响应手册

常见故障模式
检测、诊断、解决流程
事后复盘流程

模式10：LLM/RAG生产部署 → 参考LLM与RAG生产模式

提示词和配置管理
安全与合规（PII、越狱攻击）
成本优化（Token预算、缓存）
监控与降级方案

模式11：跨区域、数据驻留与回滚 → 参考多区域模式

多区域部署架构
数据驻留与租户隔离
灾难恢复与故障转移
区域回滚流程

模式12：在线评估与反馈循环 → 参考在线评估模式

反馈信号收集（隐式、显式）
影子和金丝雀部署
具有统计显著性的A/B测试
人在回路的标注
自动重训练节奏

模式13：AgentOps（AI Agent运维） → 参考AgentOps模式

AI Agent的会话追踪与回放
Agent运行过程中的成本和延迟追踪
多Agent可视化与调试
工具调用监控
与CrewAI、LangGraph、OpenAI Agents SDK集成

模式14：边缘MLOps与TinyML → 参考边缘MLOps模式

设备感知的CI/CD流水线
带回滚功能的OTA模型更新
联邦学习运维
边缘漂移检测
间歇性连接处理

Resources (Detailed Guides)

资源（详细指南）

For comprehensive operational guides, see:

Core Infrastructure:

Data Ingestion Patterns - Data contracts, CDC, batch/streaming ingestion, lineage, schema evolution
Deployment Lifecycle - Pre-deploy validation, environment promotion, gradual rollout, rollback
Model Registry Patterns - Versioning, packaging, promotion workflows, governance
Feature Store Patterns - Batch/online features, hybrid architectures, consistency, latency optimization

Serving & APIs:

Deployment Patterns - Batch, online, hybrid, streaming deployment strategies and architectures
API Design Patterns - ML/LLM/RAG API patterns, input/output schemas, reliability patterns, versioning

Operations & Reliability:

Monitoring Best Practices - Metrics collection, alerting strategies, SLO definition, dashboard design
Drift Detection Guide - Statistical tests, automated detection, retraining triggers, recovery strategies
Incident Response Playbooks - Runbooks for common failure modes, diagnostics, resolution steps

Security & Governance:

Threat Models - Trust boundaries, attack surface, control mapping
Prompt Injection Mitigation - Input hardening, tool/RAG containment, least privilege
Jailbreak Defense - Robust refusal behavior, safe completion patterns
RAG Security - Retrieval poisoning, context injection, sensitive data leakage
Output Filtering - Layered filters (PII/toxicity/policy), block/rewrite strategies
Privacy Protection - PII handling, data minimization, retention, consent
Supply Chain Security - SBOM, dependency pinning, artifact signing
Safety Evaluation - Red teaming, eval sets, incident readiness

Advanced Patterns:

LLM & RAG Production Patterns - Prompt management, safety, cost optimization, caching, monitoring
Multi-Region Patterns - Multi-region deployment, data residency, disaster recovery, rollback
Online Evaluation Patterns - A/B testing, shadow deployments, feedback loops, automated retraining
AgentOps Patterns - AI agent observability, session replay, cost tracking, multi-agent debugging
Edge MLOps Patterns - TinyML, federated learning, OTA updates, device-aware CI/CD

如需全面的运维指南，请参考：

核心基础设施：

数据接入模式 - 数据契约、CDC、批量/流式接入、血缘追踪、schema演进
部署生命周期 - 预部署验证、环境推广、渐进式发布、回滚
模型仓库模式 - 版本控制、打包、推广流程、治理
特征仓库模式 - 批量/在线特征、混合架构、一致性、延迟优化

服务与API：

部署模式 - 批量、在线、混合、流式部署策略与架构
API设计模式 - ML/LLM/RAG API模式、输入/输出schema、可靠性模式、版本控制

运维与可靠性：

监控最佳实践 - 指标收集、告警策略、SLO定义、仪表盘设计
漂移检测指南 - 统计测试、自动检测、重训练触发器、恢复策略
事件响应手册 - 常见故障模式的运行手册、诊断步骤、解决流程

安全与治理：

威胁模型 - 信任边界、攻击面、控制映射
提示注入缓解 - 输入加固、工具/RAG隔离、最小权限
越狱攻击防御 - 稳健的拒绝行为、安全生成模式
RAG安全 - 检索投毒、上下文注入、敏感数据泄露
输出过滤 - 分层过滤（PII/毒性/合规）、阻断/重写策略
隐私保护 - PII处理、数据最小化、留存、同意管理
供应链安全 - SBOM、依赖固定、工件签名
安全性评估 - 红队测试、评估数据集、事件就绪

高级模式：

LLM与RAG生产模式 - 提示词管理、安全、成本优化、缓存、监控
多区域模式 - 多区域部署、数据驻留、灾难恢复、回滚
在线评估模式 - A/B测试、影子部署、反馈循环、自动重训练
AgentOps模式 - AI Agent可观测性、会话回放、成本追踪、多Agent调试
边缘MLOps模式 - TinyML、联邦学习、OTA更新、设备感知CI/CD

Templates

模板

Use these as copy-paste starting points for production artifacts:

以下模板可作为生产环境工件的复制粘贴起点：

Data Ingestion (dlt)

数据接入（dlt）

For loading data into warehouses and pipelines:

dlt basic pipeline setup - Install, configure, run basic extraction and loading
dlt REST API sources - Extract from REST APIs with pagination, authentication, rate limiting
dlt database sources - Replicate from PostgreSQL, MySQL, MongoDB, SQL Server
dlt incremental loading - Timestamp-based, ID-based, merge/upsert patterns, lookback windows
dlt warehouse loading - Load to Snowflake, BigQuery, Redshift, Postgres, DuckDB

Use dlt when:

Loading data from APIs (Stripe, HubSpot, Shopify, custom APIs)
Replicating databases to warehouses
Building ELT pipelines with incremental loading
Managing data ingestion with Python

For SQL transformation (after ingestion), use:

→

ai-ml-data-science

skill (SQLMesh templates for staging/intermediate/marts layers)

用于将数据加载至数据仓库和流水线：

dlt基础流水线设置 - 安装、配置、运行基础提取与加载
dlt REST API数据源 - 从REST API提取数据，支持分页、认证、限流
dlt数据库数据源 - 从PostgreSQL、MySQL、MongoDB、SQL Server复制数据
dlt增量加载 - 基于时间戳、基于ID的增量模式、合并/更新策略、回溯窗口
dlt数据仓库加载 - 加载至Snowflake、BigQuery、Redshift、Postgres、DuckDB

适用dlt的场景：

从API（Stripe、HubSpot、Shopify、自定义API）加载数据
将数据库复制至数据仓库
构建带增量加载的ELT流水线
使用Python管理数据接入

如需进行数据接入后的SQL转换，请使用：

→

ai-ml-data-science

技能（用于 staging/intermediate/marts 层的SQLMesh模板）

Deployment & Packaging

部署与打包

Deployment & MLOps template - Complete MLOps lifecycle, model registry, promotion workflows
Deployment readiness checklist - Go/No-Go gate, monitoring, and rollback plan
API service template - Real-time REST/gRPC API with FastAPI, input validation, rate limiting
Batch scoring pipeline template - Orchestrated batch inference with Airflow/Dagster, validation, backfill

部署与MLOps模板 - 完整MLOps生命周期、模型仓库、推广流程
部署就绪检查清单 - 上线/不上线 gates、监控、回滚计划
API服务模板 - 基于FastAPI的实时REST/gRPC API，带输入验证、限流
批量评分流水线模板 - 使用Airflow/Dagster编排的批量推理，带验证、回填

Monitoring & Operations

监控与运维

Monitoring & alerting template - Data/performance/technical metrics, dashboards, SLO definition
Drift detection & retraining template - Automated drift detection, retraining triggers, promotion pipelines
Incident runbook template - Failure mode playbooks, diagnosis steps, resolution procedures

监控与告警模板 - 数据/性能/技术指标、仪表盘、SLO定义
漂移检测与重训练模板 - 自动漂移检测、重训练触发器、推广流水线
事件响应运行手册模板 - 故障模式运行手册、诊断步骤、解决流程

Navigation

references/drift-detection-guide.md
references/model-registry-patterns.md
references/online-evaluation-patterns.md
references/monitoring-best-practices.md
references/llm-rag-production-patterns.md
references/api-design-patterns.md
references/incident-response-playbooks.md
references/deployment-patterns.md
references/data-ingestion-patterns.md
references/deployment-lifecycle.md
references/feature-store-patterns.md
references/multi-region-patterns.md
references/agentops-patterns.md
references/edge-mlops-patterns.md

Templates

template-dlt-pipeline.md
template-dlt-rest-api.md
template-dlt-database-source.md
template-dlt-incremental.md
template-dlt-warehouse-loading.md
assets/deployment/template-deployment-mlops.md
assets/deployment/deployment-readiness-checklist.md
assets/deployment/template-api-service.md
assets/deployment/template-batch-pipeline.md
assets/ops/template-incident-runbook.md
assets/monitoring/template-drift-retraining.md
assets/monitoring/template-monitoring-plan.md

Data

data/sources.json - Curated external references

资源

references/drift-detection-guide.md
references/model-registry-patterns.md
references/online-evaluation-patterns.md
references/monitoring-best-practices.md
references/llm-rag-production-patterns.md
references/api-design-patterns.md
references/incident-response-playbooks.md
references/deployment-patterns.md
references/data-ingestion-patterns.md
references/deployment-lifecycle.md
references/feature-store-patterns.md
references/multi-region-patterns.md
references/agentops-patterns.md
references/edge-mlops-patterns.md

模板

template-dlt-pipeline.md
template-dlt-rest-api.md
template-dlt-database-source.md
template-dlt-incremental.md
template-dlt-warehouse-loading.md
assets/deployment/template-deployment-mlops.md
assets/deployment/deployment-readiness-checklist.md
assets/deployment/template-api-service.md
assets/deployment/template-batch-pipeline.md
assets/ops/template-incident-runbook.md
assets/monitoring/template-drift-retraining.md
assets/monitoring/template-monitoring-plan.md

数据

data/sources.json - 精选外部参考资源

External Resources

外部资源

See

data/sources.json

for curated references on:

Serving frameworks (FastAPI, Flask, gRPC, TorchServe, KServe, Ray Serve)
Orchestration (Airflow, Dagster, Prefect)
Model registries and MLOps (MLflow, W&B, Vertex AI, Sagemaker)
Monitoring and observability (Prometheus, Grafana, OpenTelemetry, Evidently)
Feature stores (Feast, Tecton, Vertex, Databricks)
Streaming & messaging (Kafka, Pulsar, Kinesis)
LLMOps & RAG infra (vector DBs, LLM gateways, safety tools)

参考

data/sources.json

获取以下领域的精选参考内容：

服务框架（FastAPI、Flask、gRPC、TorchServe、KServe、Ray Serve）
编排工具（Airflow、Dagster、Prefect）
模型仓库与MLOps平台（MLflow、W&B、Vertex AI、Sagemaker）
监控与可观测性工具（Prometheus、Grafana、OpenTelemetry、Evidently）
特征仓库（Feast、Tecton、Vertex、Databricks）
流式与消息队列（Kafka、Pulsar、Kinesis）
LLMOps与RAG基础设施（向量数据库、LLM网关、安全工具）

Data Lake & Lakehouse

数据湖与湖仓

For comprehensive data lake/lakehouse patterns (beyond dlt ingestion), see data-lake-platform:

Table formats: Apache Iceberg, Delta Lake, Apache Hudi
Query engines: ClickHouse, DuckDB, Apache Doris, StarRocks
Alternative ingestion: Airbyte (GUI-based connectors)
Transformation: dbt (alternative to SQLMesh)
Streaming: Apache Kafka patterns
Orchestration: Dagster, Airflow

This skill focuses on ML-specific deployment, monitoring, and security. Use data-lake-platform for general-purpose data infrastructure.

如需全面的数据湖/湖仓模式（超出dlt接入范围），请参考**data-lake-platform**：

表格式：Apache Iceberg、Delta Lake、Apache Hudi
查询引擎：ClickHouse、DuckDB、Apache Doris、StarRocks
替代接入工具：Airbyte（基于GUI的连接器）
转换工具：dbt（SQLMesh的替代方案）
流式处理：Apache Kafka模式
编排：Dagster、Airflow

本技能聚焦于ML特定的部署、监控与安全。通用数据基础设施请使用data-lake-platform技能。

Recency Protocol (Tooling Recommendations)

时效性协议（工具推荐）

When users ask recommendation questions about MLOps tooling, verify recency before answering.

当用户询问MLOps工具推荐相关问题时，在回答前需验证时效性。

Trigger Conditions

触发条件

"What's the best MLOps platform for [use case]?"
"What should I use for [deployment/monitoring/drift detection]?"
"What's the latest in MLOps?"
"Current best practices for [model registry/feature store/observability]?"
"Is [MLflow/Kubeflow/Vertex AI] still relevant in 2026?"
"[MLOps tool A] vs [MLOps tool B]?"
"Best way to deploy [LLM/ML model] to production?"
"What feature store should I use?"

"针对[使用场景]，最佳的MLOps平台是什么？"
"[部署/监控/漂移检测]应该使用什么工具？"
"MLOps的最新趋势是什么？"
"[模型仓库/特征仓库/可观测性]的当前最佳实践是什么？"
"[MLflow/Kubeflow/Vertex AI]在2026年是否仍然适用？"
"[MLOps工具A] vs [MLOps工具B]？"
"将[LLM/ML模型]部署到生产环境的最佳方式是什么？"
"我应该使用哪个特征仓库？"

Minimal Recency Check

最小时效性检查

Start from

data/sources.json

and prefer sources with

add_as_web_search: true

If web search or browsing is available, confirm at least: (a) the tool’s latest release/docs date, (b) active maintenance signals, (c) a recent comparison/alternatives post.
If live search is not available, state that you are relying on static knowledge +
```
data/sources.json
```
, and recommend validation steps (POC + evals + rollout plan).

从
```
data/sources.json
```
开始，优先选择标记为
```
add_as_web_search: true
```
的资源。
如果支持网页搜索或浏览，至少确认：(a)工具的最新版本/文档日期，(b)活跃维护信号，(c)近期的对比/替代方案文章。
如果无法进行实时搜索，请说明您依赖静态知识+
```
data/sources.json
```
，并建议验证步骤（POC+评估+推广计划）。

What to Report

汇报内容

After searching, provide:

Current landscape: What MLOps tools/platforms are popular NOW
Emerging trends: New approaches gaining traction (LLMOps, GenAI ops)
Deprecated/declining: Tools or approaches losing relevance
Recommendation: Based on fresh data, not just static knowledge

搜索完成后，提供：

当前格局：当前流行的MLOps工具/平台
新兴趋势：正在获得关注的新方法（LLMOps、生成式AI运维）
已过时/衰退：正在失去相关性的工具或方法
推荐方案：基于最新数据，而非仅静态知识