ml-api-endpoint

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

ML API Endpoint Expert

ML API 端点专家

Expert in designing and deploying machine learning API endpoints.

专注于机器学习API端点的设计与部署。

Core Principles

核心原则

API Design

API设计

Stateless Design: Each request contains all necessary information
Consistent Response Format: Standardize success/error structures
Versioning Strategy: Plan for model updates
Input Validation: Rigorous validation before inference

无状态设计：每个请求包含所有必要信息
统一响应格式：标准化成功/错误响应结构
版本化策略：为模型更新做好规划
输入验证：推理前进行严格验证

FastAPI Implementation

FastAPI 实现

Basic ML Endpoint

基础机器学习端点

python

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, validator
import joblib
import numpy as np

app = FastAPI(title="ML Model API", version="1.0.0")

model = None

@app.on_event("startup")
async def load_model():
    global model
    model = joblib.load("model.pkl")

class PredictionInput(BaseModel):
    features: list[float]

    @validator('features')
    def validate_features(cls, v):
        if len(v) != 10:
            raise ValueError('Expected 10 features')
        return v

class PredictionResponse(BaseModel):
    prediction: float
    confidence: float | None = None
    model_version: str
    request_id: str

@app.post("/predict", response_model=PredictionResponse)
async def predict(input_data: PredictionInput):
    features = np.array([input_data.features])
    prediction = model.predict(features)[0]

    return PredictionResponse(
        prediction=float(prediction),
        model_version="v1",
        request_id=generate_request_id()
    )

python

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, validator
import joblib
import numpy as np

app = FastAPI(title="ML Model API", version="1.0.0")

model = None

@app.on_event("startup")
async def load_model():
    global model
    model = joblib.load("model.pkl")

class PredictionInput(BaseModel):
    features: list[float]

    @validator('features')
    def validate_features(cls, v):
        if len(v) != 10:
            raise ValueError('Expected 10 features')
        return v

class PredictionResponse(BaseModel):
    prediction: float
    confidence: float | None = None
    model_version: str
    request_id: str

@app.post("/predict", response_model=PredictionResponse)
async def predict(input_data: PredictionInput):
    features = np.array([input_data.features])
    prediction = model.predict(features)[0]

    return PredictionResponse(
        prediction=float(prediction),
        model_version="v1",
        request_id=generate_request_id()
    )

Batch Prediction

批量预测

python

class BatchInput(BaseModel):
    instances: list[list[float]]

    @validator('instances')
    def validate_batch_size(cls, v):
        if len(v) > 100:
            raise ValueError('Batch size cannot exceed 100')
        return v

@app.post("/predict/batch")
async def batch_predict(input_data: BatchInput):
    features = np.array(input_data.instances)
    predictions = model.predict(features)

    return {
        "predictions": predictions.tolist(),
        "count": len(predictions)
    }

python

class BatchInput(BaseModel):
    instances: list[list[float]]

    @validator('instances')
    def validate_batch_size(cls, v):
        if len(v) > 100:
            raise ValueError('Batch size cannot exceed 100')
        return v

@app.post("/predict/batch")
async def batch_predict(input_data: BatchInput):
    features = np.array(input_data.instances)
    predictions = model.predict(features)

    return {
        "predictions": predictions.tolist(),
        "count": len(predictions)
    }

Performance Optimization

性能优化

Model Caching

模型缓存

python

class ModelCache:
    def __init__(self, ttl_seconds=300):
        self.cache = {}
        self.ttl = ttl_seconds

    def get(self, features):
        key = hashlib.md5(str(features).encode()).hexdigest()
        if key in self.cache:
            result, timestamp = self.cache[key]
            if time.time() - timestamp < self.ttl:
                return result
        return None

    def set(self, features, prediction):
        key = hashlib.md5(str(features).encode()).hexdigest()
        self.cache[key] = (prediction, time.time())

python

class ModelCache:
    def __init__(self, ttl_seconds=300):
        self.cache = {}
        self.ttl = ttl_seconds

    def get(self, features):
        key = hashlib.md5(str(features).encode()).hexdigest()
        if key in self.cache:
            result, timestamp = self.cache[key]
            if time.time() - timestamp < self.ttl:
                return result
        return None

    def set(self, features, prediction):
        key = hashlib.md5(str(features).encode()).hexdigest()
        self.cache[key] = (prediction, time.time())

Health Checks

健康检查

python

@app.get("/health")
async def health_check():
    return {
        "status": "healthy",
        "model_loaded": model is not None
    }

@app.get("/metrics")
async def get_metrics():
    return {
        "requests_total": request_counter,
        "prediction_latency_avg": avg_latency,
        "error_rate": error_rate
    }

python

@app.get("/health")
async def health_check():
    return {
        "status": "healthy",
        "model_loaded": model is not None
    }

@app.get("/metrics")
async def get_metrics():
    return {
        "requests_total": request_counter,
        "prediction_latency_avg": avg_latency,
        "error_rate": error_rate
    }

Docker Deployment

Docker 部署

dockerfile

FROM python:3.9-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .
EXPOSE 8000

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]

dockerfile

FROM python:3.9-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .
EXPOSE 8000

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]

Best Practices

最佳实践

Use async/await for I/O operations
Validate data types, ranges, and business rules
Cache predictions for deterministic models
Handle model failures with fallback responses
Log predictions, latencies, and errors
Support multiple model versions
Set memory and CPU limits

对I/O操作使用async/await
验证数据类型、范围及业务规则
为确定性模型缓存预测结果
通过降级响应处理模型故障
记录预测结果、延迟及错误信息
支持多模型版本
设置内存和CPU限制