senior-ml-engineer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSenior ML Engineer
资深机器学习工程师
Production ML engineering patterns for model deployment, MLOps infrastructure, and LLM integration.
面向模型部署、MLOps基础设施和LLM集成的生产级机器学习工程实践方案。
Table of Contents
目录
Model Deployment Workflow
模型部署工作流
Deploy a trained model to production with monitoring:
- Export model to standardized format (ONNX, TorchScript, SavedModel)
- Package model with dependencies in Docker container
- Deploy to staging environment
- Run integration tests against staging
- Deploy canary (5% traffic) to production
- Monitor latency and error rates for 1 hour
- Promote to full production if metrics pass
- Validation: p95 latency < 100ms, error rate < 0.1%
将训练好的模型部署到生产环境并配置监控:
- 将模型导出为标准化格式(ONNX、TorchScript、SavedModel)
- 使用Docker容器打包模型及依赖
- 部署到预发布环境
- 针对预发布环境运行集成测试
- 以金丝雀发布(5%流量)方式部署到生产环境
- 监控延迟和错误率1小时
- 若指标达标则全量推广至生产环境
- 验证标准: p95延迟<100ms,错误率<0.1%
Container Template
容器模板
dockerfile
FROM python:3.11-slim
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY model/ /app/model/
COPY src/ /app/src/
HEALTHCHECK CMD curl -f http://localhost:8080/health || exit 1
EXPOSE 8080
CMD ["uvicorn", "src.server:app", "--host", "0.0.0.0", "--port", "8080"]dockerfile
FROM python:3.11-slim
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY model/ /app/model/
COPY src/ /app/src/
HEALTHCHECK CMD curl -f http://localhost:8080/health || exit 1
EXPOSE 8080
CMD ["uvicorn", "src.server:app", "--host", "0.0.0.0", "--port", "8080"]Serving Options
服务部署选项
| Option | Latency | Throughput | Use Case |
|---|---|---|---|
| FastAPI + Uvicorn | Low | Medium | REST APIs, small models |
| Triton Inference Server | Very Low | Very High | GPU inference, batching |
| TensorFlow Serving | Low | High | TensorFlow models |
| TorchServe | Low | High | PyTorch models |
| Ray Serve | Medium | High | Complex pipelines, multi-model |
| 选项 | 延迟 | 吞吐量 | 适用场景 |
|---|---|---|---|
| FastAPI + Uvicorn | 低 | 中 | REST API、小型模型 |
| Triton Inference Server | 极低 | 极高 | GPU推理、批量处理 |
| TensorFlow Serving | 低 | 高 | TensorFlow模型 |
| TorchServe | 低 | 高 | PyTorch模型 |
| Ray Serve | 中 | 高 | 复杂流水线、多模型场景 |
MLOps Pipeline Setup
MLOps流水线搭建
Establish automated training and deployment:
- Configure feature store (Feast, Tecton) for training data
- Set up experiment tracking (MLflow, Weights & Biases)
- Create training pipeline with hyperparameter logging
- Register model in model registry with version metadata
- Configure staging deployment triggered by registry events
- Set up A/B testing infrastructure for model comparison
- Enable drift monitoring with alerting
- Validation: New models automatically evaluated against baseline
搭建自动化训练与部署流水线:
- 配置特征存储(Feast、Tecton)用于训练数据管理
- 搭建实验跟踪系统(MLflow、Weights & Biases)
- 创建带超参数日志的训练流水线
- 在模型注册表中注册模型并添加版本元数据
- 配置由注册表事件触发的预发布部署流程
- 搭建用于模型对比的A/B测试基础设施
- 启用带告警的漂移监控
- 验证标准: 新模型自动与基线模型进行评估对比
Feature Store Pattern
特征存储实践
python
from feast import Entity, Feature, FeatureView, FileSource
user = Entity(name="user_id", value_type=ValueType.INT64)
user_features = FeatureView(
name="user_features",
entities=["user_id"],
ttl=timedelta(days=1),
features=[
Feature(name="purchase_count_30d", dtype=ValueType.INT64),
Feature(name="avg_order_value", dtype=ValueType.FLOAT),
],
online=True,
source=FileSource(path="data/user_features.parquet"),
)python
from feast import Entity, Feature, FeatureView, FileSource
user = Entity(name="user_id", value_type=ValueType.INT64)
user_features = FeatureView(
name="user_features",
entities=["user_id"],
ttl=timedelta(days=1),
features=[
Feature(name="purchase_count_30d", dtype=ValueType.INT64),
Feature(name="avg_order_value", dtype=ValueType.FLOAT),
],
online=True,
source=FileSource(path="data/user_features.parquet"),
)Retraining Triggers
重训练触发条件
| Trigger | Detection | Action |
|---|---|---|
| Scheduled | Cron (weekly/monthly) | Full retrain |
| Performance drop | Accuracy < threshold | Immediate retrain |
| Data drift | PSI > 0.2 | Evaluate, then retrain |
| New data volume | X new samples | Incremental update |
| 触发方式 | 检测逻辑 | 执行动作 |
|---|---|---|
| 定时触发 | Cron表达式(每周/每月) | 全量重训练 |
| 性能下降 | 准确率低于阈值 | 立即重训练 |
| 数据漂移 | PSI>0.2 | 评估后重训练 |
| 新数据量达标 | 新增X条样本 | 增量更新 |
LLM Integration Workflow
LLM集成工作流
Integrate LLM APIs into production applications:
- Create provider abstraction layer for vendor flexibility
- Implement retry logic with exponential backoff
- Configure fallback to secondary provider
- Set up token counting and context truncation
- Add response caching for repeated queries
- Implement cost tracking per request
- Add structured output validation with Pydantic
- Validation: Response parses correctly, cost within budget
将LLM API集成到生产应用中:
- 构建供应商抽象层以实现供应商灵活性
- 实现带指数退避的重试逻辑
- 配置备用供应商降级方案
- 搭建令牌计数与上下文截断机制
- 为重复查询添加响应缓存
- 实现按请求的成本跟踪
- 使用Pydantic实现结构化输出验证
- 验证标准: 响应解析正确,成本在预算范围内
Provider Abstraction
供应商抽象层
python
from abc import ABC, abstractmethod
from tenacity import retry, stop_after_attempt, wait_exponential
class LLMProvider(ABC):
@abstractmethod
def complete(self, prompt: str, **kwargs) -> str:
pass
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def call_llm_with_retry(provider: LLMProvider, prompt: str) -> str:
return provider.complete(prompt)python
from abc import ABC, abstractmethod
from tenacity import retry, stop_after_attempt, wait_exponential
class LLMProvider(ABC):
@abstractmethod
def complete(self, prompt: str, **kwargs) -> str:
pass
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def call_llm_with_retry(provider: LLMProvider, prompt: str) -> str:
return provider.complete(prompt)Cost Management
成本管理
| Provider | Input Cost | Output Cost |
|---|---|---|
| GPT-4 | $0.03/1K | $0.06/1K |
| GPT-3.5 | $0.0005/1K | $0.0015/1K |
| Claude 3 Opus | $0.015/1K | $0.075/1K |
| Claude 3 Haiku | $0.00025/1K | $0.00125/1K |
| 供应商 | 输入成本 | 输出成本 |
|---|---|---|
| GPT-4 | $0.03/1K令牌 | $0.06/1K令牌 |
| GPT-3.5 | $0.0005/1K令牌 | $0.0015/1K令牌 |
| Claude 3 Opus | $0.015/1K令牌 | $0.075/1K令牌 |
| Claude 3 Haiku | $0.00025/1K令牌 | $0.00125/1K令牌 |
RAG System Implementation
RAG系统实现
Build retrieval-augmented generation pipeline:
- Choose vector database (Pinecone, Qdrant, Weaviate)
- Select embedding model based on quality/cost tradeoff
- Implement document chunking strategy
- Create ingestion pipeline with metadata extraction
- Build retrieval with query embedding
- Add reranking for relevance improvement
- Format context and send to LLM
- Validation: Response references retrieved context, no hallucinations
构建检索增强生成流水线:
- 选择向量数据库(Pinecone、Qdrant、Weaviate)
- 根据质量/成本权衡选择嵌入模型
- 实现文档分块策略
- 搭建带元数据提取的 ingestion 流水线
- 基于查询嵌入构建检索逻辑
- 添加重排序以提升相关性
- 格式化上下文并发送至LLM
- 验证标准: 响应引用检索到的上下文,无幻觉内容
Vector Database Selection
向量数据库选型
| Database | Hosting | Scale | Latency | Best For |
|---|---|---|---|---|
| Pinecone | Managed | High | Low | Production, managed |
| Qdrant | Both | High | Very Low | Performance-critical |
| Weaviate | Both | High | Low | Hybrid search |
| Chroma | Self-hosted | Medium | Low | Prototyping |
| pgvector | Self-hosted | Medium | Medium | Existing Postgres |
| 数据库 | 部署方式 | 扩展性 | 延迟 | 最佳适用场景 |
|---|---|---|---|---|
| Pinecone | 托管式 | 高 | 低 | 生产环境、托管场景 |
| Qdrant | 托管/自托管 | 高 | 极低 | 性能敏感场景 |
| Weaviate | 托管/自托管 | 高 | 低 | 混合搜索 |
| Chroma | 自托管 | 中 | 低 | 原型开发 |
| pgvector | 自托管 | 中 | 中 | 已有Postgres环境 |
Chunking Strategies
分块策略
| Strategy | Chunk Size | Overlap | Best For |
|---|---|---|---|
| Fixed | 500-1000 tokens | 50-100 | General text |
| Sentence | 3-5 sentences | 1 sentence | Structured text |
| Semantic | Variable | Based on meaning | Research papers |
| Recursive | Hierarchical | Parent-child | Long documents |
| 策略 | 分块大小 | 重叠部分 | 最佳适用场景 |
|---|---|---|---|
| 固定长度 | 500-1000令牌 | 50-100令牌 | 通用文本 |
| 句子拆分 | 3-5个句子 | 1个句子 | 结构化文本 |
| 语义拆分 | 可变长度 | 基于语义关联 | 研究论文 |
| 递归拆分 | 层级式 | 父-子分块重叠 | 长文档 |
Model Monitoring
模型监控
Monitor production models for drift and degradation:
- Set up latency tracking (p50, p95, p99)
- Configure error rate alerting
- Implement input data drift detection
- Track prediction distribution shifts
- Log ground truth when available
- Compare model versions with A/B metrics
- Set up automated retraining triggers
- Validation: Alerts fire before user-visible degradation
监控生产环境模型的漂移与性能退化:
- 搭建延迟跟踪(p50、p95、p99)
- 配置错误率告警
- 实现输入数据漂移检测
- 跟踪预测分布变化
- 记录可用的真实标签
- 通过A/B指标对比模型版本
- 配置自动化重训练触发条件
- 验证标准: 在用户感知到退化前触发告警
Drift Detection
漂移检测代码
python
from scipy.stats import ks_2samp
def detect_drift(reference, current, threshold=0.05):
statistic, p_value = ks_2samp(reference, current)
return {
"drift_detected": p_value < threshold,
"ks_statistic": statistic,
"p_value": p_value
}python
from scipy.stats import ks_2samp
def detect_drift(reference, current, threshold=0.05):
statistic, p_value = ks_2samp(reference, current)
return {
"drift_detected": p_value < threshold,
"ks_statistic": statistic,
"p_value": p_value
}Alert Thresholds
告警阈值
| Metric | Warning | Critical |
|---|---|---|
| p95 latency | > 100ms | > 200ms |
| Error rate | > 0.1% | > 1% |
| PSI (drift) | > 0.1 | > 0.2 |
| Accuracy drop | > 2% | > 5% |
| 指标 | 警告阈值 | 严重阈值 |
|---|---|---|
| p95延迟 | >100ms | >200ms |
| 错误率 | >0.1% | >1% |
| PSI(漂移) | >0.1 | >0.2 |
| 准确率下降 | >2% | >5% |
Reference Documentation
参考文档
MLOps Production Patterns
MLOps生产实践
references/mlops_production_patterns.md- Model deployment pipeline with Kubernetes manifests
- Feature store architecture with Feast examples
- Model monitoring with drift detection code
- A/B testing infrastructure with traffic splitting
- Automated retraining pipeline with MLflow
references/mlops_production_patterns.md- 带Kubernetes清单的模型部署流水线
- 基于Feast的特征存储架构示例
- 带漂移检测代码的模型监控方案
- 带流量拆分的A/B测试基础设施
- 基于MLflow的自动化重训练流水线
LLM Integration Guide
LLM集成指南
references/llm_integration_guide.md- Provider abstraction layer pattern
- Retry and fallback strategies with tenacity
- Prompt engineering templates (few-shot, CoT)
- Token optimization with tiktoken
- Cost calculation and tracking
references/llm_integration_guide.md- 供应商抽象层实践模式
- 基于tenacity的重试与降级策略
- 提示工程模板(少样本、思维链)
- 基于tiktoken的令牌优化
- 成本计算与跟踪方案
RAG System Architecture
RAG系统架构
references/rag_system_architecture.md- RAG pipeline implementation with code
- Vector database comparison and integration
- Chunking strategies (fixed, semantic, recursive)
- Embedding model selection guide
- Hybrid search and reranking patterns
references/rag_system_architecture.md- 带代码实现的RAG流水线
- 向量数据库对比与集成方案
- 分块策略(固定、语义、递归)
- 嵌入模型选型指南
- 混合搜索与重排序模式
Tools
工具
Model Deployment Pipeline
模型部署流水线
bash
python scripts/model_deployment_pipeline.py --model model.pkl --target stagingGenerates deployment artifacts: Dockerfile, Kubernetes manifests, health checks.
bash
python scripts/model_deployment_pipeline.py --model model.pkl --target staging生成部署工件:Dockerfile、Kubernetes清单、健康检查配置。
RAG System Builder
RAG系统构建工具
bash
python scripts/rag_system_builder.py --config rag_config.yaml --analyzeScaffolds RAG pipeline with vector store integration and retrieval logic.
bash
python scripts/rag_system_builder.py --config rag_config.yaml --analyze快速搭建带向量存储集成和检索逻辑的RAG流水线。
ML Monitoring Suite
机器学习监控套件
bash
python scripts/ml_monitoring_suite.py --config monitoring.yaml --deploySets up drift detection, alerting, and performance dashboards.
bash
python scripts/ml_monitoring_suite.py --config monitoring.yaml --deploy搭建漂移检测、告警和性能仪表盘。
Tech Stack
技术栈
| Category | Tools |
|---|---|
| ML Frameworks | PyTorch, TensorFlow, Scikit-learn, XGBoost |
| LLM Frameworks | LangChain, LlamaIndex, DSPy |
| MLOps | MLflow, Weights & Biases, Kubeflow |
| Data | Spark, Airflow, dbt, Kafka |
| Deployment | Docker, Kubernetes, Triton |
| Databases | PostgreSQL, BigQuery, Pinecone, Redis |
| 分类 | 工具 |
|---|---|
| 机器学习框架 | PyTorch、TensorFlow、Scikit-learn、XGBoost |
| LLM框架 | LangChain、LlamaIndex、DSPy |
| MLOps工具 | MLflow、Weights & Biases、Kubeflow |
| 数据工具 | Spark、Airflow、dbt、Kafka |
| 部署工具 | Docker、Kubernetes、Triton |
| 数据库 | PostgreSQL、BigQuery、Pinecone、Redis |