backend-principle-eng-python-ml-pro-max
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBackend Principle Eng Python ML Pro Max
Python ML首席后端工程专家全指南
Principal-level guidance for Python AI/ML backends, training pipelines, and inference services. Emphasizes data integrity, reproducibility, and production reliability.
为Python AI/ML后端、训练管道及推理服务提供首席级别的指导。重点关注数据完整性、可复现性及生产环境可靠性。
When to Apply
适用场景
- Designing or refactoring ML training or inference systems
- Reviewing ML code for data leakage, evaluation quality, and reliability
- Building feature pipelines, batch scoring, or real-time serving
- Incident response for model regressions or data drift
- 设计或重构ML训练/推理系统
- 评审ML代码以检查数据泄露、评估质量及可靠性
- 构建特征管道、批量评分或实时服务
- 响应模型退化或数据漂移事件
Priority Model (highest to lowest)
优先级模型(从高到低)
| Priority | Category | Goal | Signals |
|---|---|---|---|
| 1 | Data Quality & Leakage | Trust the data | Clean splits, lineage, leakage checks |
| 2 | Correctness & Reproducibility | Same inputs, same outputs | Versioned data, pinned deps, deterministic runs |
| 3 | Reliability & Resilience | Stable training and serving | Timeouts, retries, graceful degradation |
| 4 | Model Evaluation & Safety | Real-world performance | Offline + online eval, bias checks |
| 5 | Performance & Cost | Efficient training/inference | GPU utilization, batching, cost budgets |
| 6 | Observability & Monitoring | Fast detection | Drift, latency, error budgets |
| 7 | Security & Privacy | Protect sensitive data | Access controls, data minimization |
| 8 | Operability & MLOps | Sustainable delivery | CI/CD, model registry, rollback |
| 优先级 | 类别 | 目标 | 信号 |
|---|---|---|---|
| 1 | 数据质量与泄露 | 信任数据 | 清晰的数据集划分、数据血缘、泄露检查 |
| 2 | 正确性与可复现性 | 相同输入得到相同输出 | 版本化数据、固定依赖、确定性运行 |
| 3 | 可靠性与韧性 | 稳定的训练与服务 | 超时设置、重试机制、优雅降级 |
| 4 | 模型评估与安全性 | 真实世界性能表现 | 离线+在线评估、偏差检查 |
| 5 | 性能与成本 | 高效训练/推理 | GPU利用率、批处理、成本预算 |
| 6 | 可观测性与监控 | 快速检测问题 | 漂移、延迟、错误预算 |
| 7 | 安全性与隐私 | 保护敏感数据 | 访问控制、数据最小化 |
| 8 | 可操作性与MLOps | 可持续交付 | CI/CD、模型注册表、回滚机制 |
Quick Reference (Rules)
快速参考规则
1. Data Quality & Leakage (CRITICAL)
1. 数据质量与泄露(CRITICAL)
- - Track dataset provenance and transformations
lineage - - Strict train/val/test separation with time-based splits when needed
leakage - - Feature definitions are versioned and documented
features - - Schema and distribution checks on every data ingest
validation
- - 追踪数据集来源与转换过程
lineage - - 严格划分训练/验证/测试集,必要时采用基于时间的划分方式
leakage - - 特征定义需版本化并文档化
features - - 每次数据导入时进行 schema 与分布检查
validation
2. Correctness & Reproducibility (CRITICAL)
2. 正确性与可复现性(CRITICAL)
- - Data, code, and model versions are pinned
versioning - - Fixed seeds and deterministic ops where possible
determinism - - Single source of truth for hyperparameters
config - - Immutable model artifacts and metadata
artifact
- - 固定数据、代码与模型版本
versioning - - 尽可能设置固定随机种子与确定性操作
determinism - - 超参数的单一可信来源
config - - 不可变的模型工件与元数据
artifact
3. Reliability & Resilience (CRITICAL)
3. 可靠性与韧性(CRITICAL)
- - Explicit timeouts for all external calls
timeouts - - Bounded retries with jitter
retries - - Safe fallback models or rules when inference fails
fallbacks - - Safe retries for batch scoring
idempotency
- - 为所有外部调用设置明确超时
timeouts - - 带抖动的有限次数重试
retries - - 推理失败时的安全 fallback 模型或规则
fallbacks - - 批量评分的安全重试机制
idempotency
4. Model Evaluation & Safety (HIGH)
4. 模型评估与安全性(HIGH)
- - Metrics aligned to product goals
offline-eval - - Shadow or canary before full rollout
online-eval - - Bias and fairness checks for sensitive domains
bias - - Calibrate probabilities for decision thresholds
calibration
- - 与产品目标对齐的指标
offline-eval - - 全量发布前先进行影子或金丝雀部署评估
online-eval - - 针对敏感领域的偏差与公平性检查
bias - - 为决策阈值校准概率
calibration
5. Performance & Cost (HIGH)
5. 性能与成本(HIGH)
- - Batch inference to improve throughput
batching - - Cache features and embeddings when safe
caching - - Profile training and inference hot spots
profiling - - Define and enforce cost ceilings
cost-budgets
- - 采用批处理推理提升吞吐量
batching - - 在安全前提下缓存特征与嵌入向量
caching - - 分析训练与推理的性能热点
profiling - - 定义并执行成本上限
cost-budgets
6. Observability & Monitoring (HIGH)
6. 可观测性与监控(HIGH)
- - Monitor data and concept drift
drift - - Track P95/P99 for inference
latency - - Monitor model quality against ground truth
quality - - SLO-based alerts with runbooks
alerts
- - 监控数据漂移与概念漂移
drift - - 追踪推理的P95/P99延迟
latency - - 监控模型质量与真实标签的对比
quality - - 基于SLO的告警及运行手册
alerts
7. Security & Privacy (HIGH)
7. 安全性与隐私(HIGH)
- - Least privilege for data and model artifacts
access - - Redact or tokenize sensitive fields
pii - - Use vault/KMS; never in code or logs
secrets - - Retention and deletion policies
compliance
- - 数据与模型工件的最小权限访问
access - - 脱敏或标记化敏感字段
pii - - 使用vault/KMS存储密钥;绝不要存于代码或日志中
secrets - - 数据留存与删除策略
compliance
8. Operability & MLOps (MEDIUM)
8. 可操作性与MLOps(MEDIUM)
- - Model registry with lineage and approvals
registry - - Canary, blue/green, or shadow deployments
rollout - - Fast revert on regression
rollback - - Automated tests for data, training, and serving
ci-cd
- - 带血缘关系与审批流程的模型注册表
registry - - 金丝雀、蓝绿或影子部署
rollout - - 模型退化时快速回滚
rollback - - 针对数据、训练与服务的自动化测试
ci-cd
Execution Workflow
执行流程
- Define product goals, metrics, and safety constraints
- Validate data sources and prevent leakage
- Define features and versioned pipelines
- Train with reproducible configs and tracked artifacts
- Evaluate offline, then validate online via shadow or canary
- Deploy with monitoring for drift, latency, and quality
- Establish rollback and retraining triggers
- 定义产品目标、指标与安全约束
- 验证数据源并防止数据泄露
- 定义特征与版本化管道
- 使用可复现配置与追踪工件进行训练
- 先进行离线评估,再通过影子或金丝雀部署做在线验证
- 部署并监控漂移、延迟与模型质量
- 建立回滚与重训练触发机制
Language-Specific Guidance
语言专属指导
See for stack defaults, MLOps patterns, and tooling.
references/python-ml-core.md详见获取栈默认配置、MLOps模式及工具相关内容。
references/python-ml-core.md