senior-ml-engineer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Senior ML Engineer

资深机器学习工程师

Production ML engineering patterns for model deployment, MLOps infrastructure, and LLM integration.

面向模型部署、MLOps基础设施和LLM集成的生产级机器学习工程实践方案。

Table of Contents

目录

Model Deployment Workflow

模型部署工作流

Deploy a trained model to production with monitoring:
  1. Export model to standardized format (ONNX, TorchScript, SavedModel)
  2. Package model with dependencies in Docker container
  3. Deploy to staging environment
  4. Run integration tests against staging
  5. Deploy canary (5% traffic) to production
  6. Monitor latency and error rates for 1 hour
  7. Promote to full production if metrics pass
  8. Validation: p95 latency < 100ms, error rate < 0.1%
将训练好的模型部署到生产环境并配置监控:
  1. 将模型导出为标准化格式(ONNX、TorchScript、SavedModel)
  2. 使用Docker容器打包模型及依赖
  3. 部署到预发布环境
  4. 针对预发布环境运行集成测试
  5. 以金丝雀发布(5%流量)方式部署到生产环境
  6. 监控延迟和错误率1小时
  7. 若指标达标则全量推广至生产环境
  8. 验证标准: p95延迟<100ms,错误率<0.1%

Container Template

容器模板

dockerfile
FROM python:3.11-slim

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY model/ /app/model/
COPY src/ /app/src/

HEALTHCHECK CMD curl -f http://localhost:8080/health || exit 1

EXPOSE 8080
CMD ["uvicorn", "src.server:app", "--host", "0.0.0.0", "--port", "8080"]
dockerfile
FROM python:3.11-slim

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY model/ /app/model/
COPY src/ /app/src/

HEALTHCHECK CMD curl -f http://localhost:8080/health || exit 1

EXPOSE 8080
CMD ["uvicorn", "src.server:app", "--host", "0.0.0.0", "--port", "8080"]

Serving Options

服务部署选项

OptionLatencyThroughputUse Case
FastAPI + UvicornLowMediumREST APIs, small models
Triton Inference ServerVery LowVery HighGPU inference, batching
TensorFlow ServingLowHighTensorFlow models
TorchServeLowHighPyTorch models
Ray ServeMediumHighComplex pipelines, multi-model

选项延迟吞吐量适用场景
FastAPI + UvicornREST API、小型模型
Triton Inference Server极低极高GPU推理、批量处理
TensorFlow ServingTensorFlow模型
TorchServePyTorch模型
Ray Serve复杂流水线、多模型场景

MLOps Pipeline Setup

MLOps流水线搭建

Establish automated training and deployment:
  1. Configure feature store (Feast, Tecton) for training data
  2. Set up experiment tracking (MLflow, Weights & Biases)
  3. Create training pipeline with hyperparameter logging
  4. Register model in model registry with version metadata
  5. Configure staging deployment triggered by registry events
  6. Set up A/B testing infrastructure for model comparison
  7. Enable drift monitoring with alerting
  8. Validation: New models automatically evaluated against baseline
搭建自动化训练与部署流水线:
  1. 配置特征存储(Feast、Tecton)用于训练数据管理
  2. 搭建实验跟踪系统(MLflow、Weights & Biases)
  3. 创建带超参数日志的训练流水线
  4. 在模型注册表中注册模型并添加版本元数据
  5. 配置由注册表事件触发的预发布部署流程
  6. 搭建用于模型对比的A/B测试基础设施
  7. 启用带告警的漂移监控
  8. 验证标准: 新模型自动与基线模型进行评估对比

Feature Store Pattern

特征存储实践

python
from feast import Entity, Feature, FeatureView, FileSource

user = Entity(name="user_id", value_type=ValueType.INT64)

user_features = FeatureView(
    name="user_features",
    entities=["user_id"],
    ttl=timedelta(days=1),
    features=[
        Feature(name="purchase_count_30d", dtype=ValueType.INT64),
        Feature(name="avg_order_value", dtype=ValueType.FLOAT),
    ],
    online=True,
    source=FileSource(path="data/user_features.parquet"),
)
python
from feast import Entity, Feature, FeatureView, FileSource

user = Entity(name="user_id", value_type=ValueType.INT64)

user_features = FeatureView(
    name="user_features",
    entities=["user_id"],
    ttl=timedelta(days=1),
    features=[
        Feature(name="purchase_count_30d", dtype=ValueType.INT64),
        Feature(name="avg_order_value", dtype=ValueType.FLOAT),
    ],
    online=True,
    source=FileSource(path="data/user_features.parquet"),
)

Retraining Triggers

重训练触发条件

TriggerDetectionAction
ScheduledCron (weekly/monthly)Full retrain
Performance dropAccuracy < thresholdImmediate retrain
Data driftPSI > 0.2Evaluate, then retrain
New data volumeX new samplesIncremental update

触发方式检测逻辑执行动作
定时触发Cron表达式(每周/每月)全量重训练
性能下降准确率低于阈值立即重训练
数据漂移PSI>0.2评估后重训练
新数据量达标新增X条样本增量更新

LLM Integration Workflow

LLM集成工作流

Integrate LLM APIs into production applications:
  1. Create provider abstraction layer for vendor flexibility
  2. Implement retry logic with exponential backoff
  3. Configure fallback to secondary provider
  4. Set up token counting and context truncation
  5. Add response caching for repeated queries
  6. Implement cost tracking per request
  7. Add structured output validation with Pydantic
  8. Validation: Response parses correctly, cost within budget
将LLM API集成到生产应用中:
  1. 构建供应商抽象层以实现供应商灵活性
  2. 实现带指数退避的重试逻辑
  3. 配置备用供应商降级方案
  4. 搭建令牌计数与上下文截断机制
  5. 为重复查询添加响应缓存
  6. 实现按请求的成本跟踪
  7. 使用Pydantic实现结构化输出验证
  8. 验证标准: 响应解析正确,成本在预算范围内

Provider Abstraction

供应商抽象层

python
from abc import ABC, abstractmethod
from tenacity import retry, stop_after_attempt, wait_exponential

class LLMProvider(ABC):
    @abstractmethod
    def complete(self, prompt: str, **kwargs) -> str:
        pass

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def call_llm_with_retry(provider: LLMProvider, prompt: str) -> str:
    return provider.complete(prompt)
python
from abc import ABC, abstractmethod
from tenacity import retry, stop_after_attempt, wait_exponential

class LLMProvider(ABC):
    @abstractmethod
    def complete(self, prompt: str, **kwargs) -> str:
        pass

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def call_llm_with_retry(provider: LLMProvider, prompt: str) -> str:
    return provider.complete(prompt)

Cost Management

成本管理

ProviderInput CostOutput Cost
GPT-4$0.03/1K$0.06/1K
GPT-3.5$0.0005/1K$0.0015/1K
Claude 3 Opus$0.015/1K$0.075/1K
Claude 3 Haiku$0.00025/1K$0.00125/1K

供应商输入成本输出成本
GPT-4$0.03/1K令牌$0.06/1K令牌
GPT-3.5$0.0005/1K令牌$0.0015/1K令牌
Claude 3 Opus$0.015/1K令牌$0.075/1K令牌
Claude 3 Haiku$0.00025/1K令牌$0.00125/1K令牌

RAG System Implementation

RAG系统实现

Build retrieval-augmented generation pipeline:
  1. Choose vector database (Pinecone, Qdrant, Weaviate)
  2. Select embedding model based on quality/cost tradeoff
  3. Implement document chunking strategy
  4. Create ingestion pipeline with metadata extraction
  5. Build retrieval with query embedding
  6. Add reranking for relevance improvement
  7. Format context and send to LLM
  8. Validation: Response references retrieved context, no hallucinations
构建检索增强生成流水线:
  1. 选择向量数据库(Pinecone、Qdrant、Weaviate)
  2. 根据质量/成本权衡选择嵌入模型
  3. 实现文档分块策略
  4. 搭建带元数据提取的 ingestion 流水线
  5. 基于查询嵌入构建检索逻辑
  6. 添加重排序以提升相关性
  7. 格式化上下文并发送至LLM
  8. 验证标准: 响应引用检索到的上下文,无幻觉内容

Vector Database Selection

向量数据库选型

DatabaseHostingScaleLatencyBest For
PineconeManagedHighLowProduction, managed
QdrantBothHighVery LowPerformance-critical
WeaviateBothHighLowHybrid search
ChromaSelf-hostedMediumLowPrototyping
pgvectorSelf-hostedMediumMediumExisting Postgres
数据库部署方式扩展性延迟最佳适用场景
Pinecone托管式生产环境、托管场景
Qdrant托管/自托管极低性能敏感场景
Weaviate托管/自托管混合搜索
Chroma自托管原型开发
pgvector自托管已有Postgres环境

Chunking Strategies

分块策略

StrategyChunk SizeOverlapBest For
Fixed500-1000 tokens50-100General text
Sentence3-5 sentences1 sentenceStructured text
SemanticVariableBased on meaningResearch papers
RecursiveHierarchicalParent-childLong documents

策略分块大小重叠部分最佳适用场景
固定长度500-1000令牌50-100令牌通用文本
句子拆分3-5个句子1个句子结构化文本
语义拆分可变长度基于语义关联研究论文
递归拆分层级式父-子分块重叠长文档

Model Monitoring

模型监控

Monitor production models for drift and degradation:
  1. Set up latency tracking (p50, p95, p99)
  2. Configure error rate alerting
  3. Implement input data drift detection
  4. Track prediction distribution shifts
  5. Log ground truth when available
  6. Compare model versions with A/B metrics
  7. Set up automated retraining triggers
  8. Validation: Alerts fire before user-visible degradation
监控生产环境模型的漂移与性能退化:
  1. 搭建延迟跟踪(p50、p95、p99)
  2. 配置错误率告警
  3. 实现输入数据漂移检测
  4. 跟踪预测分布变化
  5. 记录可用的真实标签
  6. 通过A/B指标对比模型版本
  7. 配置自动化重训练触发条件
  8. 验证标准: 在用户感知到退化前触发告警

Drift Detection

漂移检测代码

python
from scipy.stats import ks_2samp

def detect_drift(reference, current, threshold=0.05):
    statistic, p_value = ks_2samp(reference, current)
    return {
        "drift_detected": p_value < threshold,
        "ks_statistic": statistic,
        "p_value": p_value
    }
python
from scipy.stats import ks_2samp

def detect_drift(reference, current, threshold=0.05):
    statistic, p_value = ks_2samp(reference, current)
    return {
        "drift_detected": p_value < threshold,
        "ks_statistic": statistic,
        "p_value": p_value
    }

Alert Thresholds

告警阈值

MetricWarningCritical
p95 latency> 100ms> 200ms
Error rate> 0.1%> 1%
PSI (drift)> 0.1> 0.2
Accuracy drop> 2%> 5%

指标警告阈值严重阈值
p95延迟>100ms>200ms
错误率>0.1%>1%
PSI(漂移)>0.1>0.2
准确率下降>2%>5%

Reference Documentation

参考文档

MLOps Production Patterns

MLOps生产实践

references/mlops_production_patterns.md
contains:
  • Model deployment pipeline with Kubernetes manifests
  • Feature store architecture with Feast examples
  • Model monitoring with drift detection code
  • A/B testing infrastructure with traffic splitting
  • Automated retraining pipeline with MLflow
references/mlops_production_patterns.md
包含:
  • 带Kubernetes清单的模型部署流水线
  • 基于Feast的特征存储架构示例
  • 带漂移检测代码的模型监控方案
  • 带流量拆分的A/B测试基础设施
  • 基于MLflow的自动化重训练流水线

LLM Integration Guide

LLM集成指南

references/llm_integration_guide.md
contains:
  • Provider abstraction layer pattern
  • Retry and fallback strategies with tenacity
  • Prompt engineering templates (few-shot, CoT)
  • Token optimization with tiktoken
  • Cost calculation and tracking
references/llm_integration_guide.md
包含:
  • 供应商抽象层实践模式
  • 基于tenacity的重试与降级策略
  • 提示工程模板(少样本、思维链)
  • 基于tiktoken的令牌优化
  • 成本计算与跟踪方案

RAG System Architecture

RAG系统架构

references/rag_system_architecture.md
contains:
  • RAG pipeline implementation with code
  • Vector database comparison and integration
  • Chunking strategies (fixed, semantic, recursive)
  • Embedding model selection guide
  • Hybrid search and reranking patterns

references/rag_system_architecture.md
包含:
  • 带代码实现的RAG流水线
  • 向量数据库对比与集成方案
  • 分块策略(固定、语义、递归)
  • 嵌入模型选型指南
  • 混合搜索与重排序模式

Tools

工具

Model Deployment Pipeline

模型部署流水线

bash
python scripts/model_deployment_pipeline.py --model model.pkl --target staging
Generates deployment artifacts: Dockerfile, Kubernetes manifests, health checks.
bash
python scripts/model_deployment_pipeline.py --model model.pkl --target staging
生成部署工件:Dockerfile、Kubernetes清单、健康检查配置。

RAG System Builder

RAG系统构建工具

bash
python scripts/rag_system_builder.py --config rag_config.yaml --analyze
Scaffolds RAG pipeline with vector store integration and retrieval logic.
bash
python scripts/rag_system_builder.py --config rag_config.yaml --analyze
快速搭建带向量存储集成和检索逻辑的RAG流水线。

ML Monitoring Suite

机器学习监控套件

bash
python scripts/ml_monitoring_suite.py --config monitoring.yaml --deploy
Sets up drift detection, alerting, and performance dashboards.

bash
python scripts/ml_monitoring_suite.py --config monitoring.yaml --deploy
搭建漂移检测、告警和性能仪表盘。

Tech Stack

技术栈

CategoryTools
ML FrameworksPyTorch, TensorFlow, Scikit-learn, XGBoost
LLM FrameworksLangChain, LlamaIndex, DSPy
MLOpsMLflow, Weights & Biases, Kubeflow
DataSpark, Airflow, dbt, Kafka
DeploymentDocker, Kubernetes, Triton
DatabasesPostgreSQL, BigQuery, Pinecone, Redis
分类工具
机器学习框架PyTorch、TensorFlow、Scikit-learn、XGBoost
LLM框架LangChain、LlamaIndex、DSPy
MLOps工具MLflow、Weights & Biases、Kubeflow
数据工具Spark、Airflow、dbt、Kafka
部署工具Docker、Kubernetes、Triton
数据库PostgreSQL、BigQuery、Pinecone、Redis