backend-principle-eng-python-ml-pro-max

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Backend Principle Eng Python ML Pro Max

Python ML首席后端工程专家全指南

Principal-level guidance for Python AI/ML backends, training pipelines, and inference services. Emphasizes data integrity, reproducibility, and production reliability.
为Python AI/ML后端、训练管道及推理服务提供首席级别的指导。重点关注数据完整性、可复现性及生产环境可靠性。

When to Apply

适用场景

  • Designing or refactoring ML training or inference systems
  • Reviewing ML code for data leakage, evaluation quality, and reliability
  • Building feature pipelines, batch scoring, or real-time serving
  • Incident response for model regressions or data drift
  • 设计或重构ML训练/推理系统
  • 评审ML代码以检查数据泄露、评估质量及可靠性
  • 构建特征管道、批量评分或实时服务
  • 响应模型退化或数据漂移事件

Priority Model (highest to lowest)

优先级模型(从高到低)

PriorityCategoryGoalSignals
1Data Quality & LeakageTrust the dataClean splits, lineage, leakage checks
2Correctness & ReproducibilitySame inputs, same outputsVersioned data, pinned deps, deterministic runs
3Reliability & ResilienceStable training and servingTimeouts, retries, graceful degradation
4Model Evaluation & SafetyReal-world performanceOffline + online eval, bias checks
5Performance & CostEfficient training/inferenceGPU utilization, batching, cost budgets
6Observability & MonitoringFast detectionDrift, latency, error budgets
7Security & PrivacyProtect sensitive dataAccess controls, data minimization
8Operability & MLOpsSustainable deliveryCI/CD, model registry, rollback
优先级类别目标信号
1数据质量与泄露信任数据清晰的数据集划分、数据血缘、泄露检查
2正确性与可复现性相同输入得到相同输出版本化数据、固定依赖、确定性运行
3可靠性与韧性稳定的训练与服务超时设置、重试机制、优雅降级
4模型评估与安全性真实世界性能表现离线+在线评估、偏差检查
5性能与成本高效训练/推理GPU利用率、批处理、成本预算
6可观测性与监控快速检测问题漂移、延迟、错误预算
7安全性与隐私保护敏感数据访问控制、数据最小化
8可操作性与MLOps可持续交付CI/CD、模型注册表、回滚机制

Quick Reference (Rules)

快速参考规则

1. Data Quality & Leakage (CRITICAL)

1. 数据质量与泄露(CRITICAL)

  • lineage
    - Track dataset provenance and transformations
  • leakage
    - Strict train/val/test separation with time-based splits when needed
  • features
    - Feature definitions are versioned and documented
  • validation
    - Schema and distribution checks on every data ingest
  • lineage
    - 追踪数据集来源与转换过程
  • leakage
    - 严格划分训练/验证/测试集,必要时采用基于时间的划分方式
  • features
    - 特征定义需版本化并文档化
  • validation
    - 每次数据导入时进行 schema 与分布检查

2. Correctness & Reproducibility (CRITICAL)

2. 正确性与可复现性(CRITICAL)

  • versioning
    - Data, code, and model versions are pinned
  • determinism
    - Fixed seeds and deterministic ops where possible
  • config
    - Single source of truth for hyperparameters
  • artifact
    - Immutable model artifacts and metadata
  • versioning
    - 固定数据、代码与模型版本
  • determinism
    - 尽可能设置固定随机种子与确定性操作
  • config
    - 超参数的单一可信来源
  • artifact
    - 不可变的模型工件与元数据

3. Reliability & Resilience (CRITICAL)

3. 可靠性与韧性(CRITICAL)

  • timeouts
    - Explicit timeouts for all external calls
  • retries
    - Bounded retries with jitter
  • fallbacks
    - Safe fallback models or rules when inference fails
  • idempotency
    - Safe retries for batch scoring
  • timeouts
    - 为所有外部调用设置明确超时
  • retries
    - 带抖动的有限次数重试
  • fallbacks
    - 推理失败时的安全 fallback 模型或规则
  • idempotency
    - 批量评分的安全重试机制

4. Model Evaluation & Safety (HIGH)

4. 模型评估与安全性(HIGH)

  • offline-eval
    - Metrics aligned to product goals
  • online-eval
    - Shadow or canary before full rollout
  • bias
    - Bias and fairness checks for sensitive domains
  • calibration
    - Calibrate probabilities for decision thresholds
  • offline-eval
    - 与产品目标对齐的指标
  • online-eval
    - 全量发布前先进行影子或金丝雀部署评估
  • bias
    - 针对敏感领域的偏差与公平性检查
  • calibration
    - 为决策阈值校准概率

5. Performance & Cost (HIGH)

5. 性能与成本(HIGH)

  • batching
    - Batch inference to improve throughput
  • caching
    - Cache features and embeddings when safe
  • profiling
    - Profile training and inference hot spots
  • cost-budgets
    - Define and enforce cost ceilings
  • batching
    - 采用批处理推理提升吞吐量
  • caching
    - 在安全前提下缓存特征与嵌入向量
  • profiling
    - 分析训练与推理的性能热点
  • cost-budgets
    - 定义并执行成本上限

6. Observability & Monitoring (HIGH)

6. 可观测性与监控(HIGH)

  • drift
    - Monitor data and concept drift
  • latency
    - Track P95/P99 for inference
  • quality
    - Monitor model quality against ground truth
  • alerts
    - SLO-based alerts with runbooks
  • drift
    - 监控数据漂移与概念漂移
  • latency
    - 追踪推理的P95/P99延迟
  • quality
    - 监控模型质量与真实标签的对比
  • alerts
    - 基于SLO的告警及运行手册

7. Security & Privacy (HIGH)

7. 安全性与隐私(HIGH)

  • access
    - Least privilege for data and model artifacts
  • pii
    - Redact or tokenize sensitive fields
  • secrets
    - Use vault/KMS; never in code or logs
  • compliance
    - Retention and deletion policies
  • access
    - 数据与模型工件的最小权限访问
  • pii
    - 脱敏或标记化敏感字段
  • secrets
    - 使用vault/KMS存储密钥;绝不要存于代码或日志中
  • compliance
    - 数据留存与删除策略

8. Operability & MLOps (MEDIUM)

8. 可操作性与MLOps(MEDIUM)

  • registry
    - Model registry with lineage and approvals
  • rollout
    - Canary, blue/green, or shadow deployments
  • rollback
    - Fast revert on regression
  • ci-cd
    - Automated tests for data, training, and serving
  • registry
    - 带血缘关系与审批流程的模型注册表
  • rollout
    - 金丝雀、蓝绿或影子部署
  • rollback
    - 模型退化时快速回滚
  • ci-cd
    - 针对数据、训练与服务的自动化测试

Execution Workflow

执行流程

  1. Define product goals, metrics, and safety constraints
  2. Validate data sources and prevent leakage
  3. Define features and versioned pipelines
  4. Train with reproducible configs and tracked artifacts
  5. Evaluate offline, then validate online via shadow or canary
  6. Deploy with monitoring for drift, latency, and quality
  7. Establish rollback and retraining triggers
  1. 定义产品目标、指标与安全约束
  2. 验证数据源并防止数据泄露
  3. 定义特征与版本化管道
  4. 使用可复现配置与追踪工件进行训练
  5. 先进行离线评估,再通过影子或金丝雀部署做在线验证
  6. 部署并监控漂移、延迟与模型质量
  7. 建立回滚与重训练触发机制

Language-Specific Guidance

语言专属指导

See
references/python-ml-core.md
for stack defaults, MLOps patterns, and tooling.
详见
references/python-ml-core.md
获取栈默认配置、MLOps模式及工具相关内容。