ml-system-design-interview
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseML System Design Interview
ML系统设计面试
End-to-end ML pipeline design coaching for staff+ engineers. Covers the full arc from problem definition through production monitoring -- the scope expected at L6+ interviews at top-tier ML organizations.
This skill assumes 15+ years of ML/CV/AI/NLP experience. It does not teach fundamentals. It structures the knowledge you already have into the format interviewers reward.
为staff+级别工程师提供端到端ML管线设计指导,覆盖从问题定义到生产监控的全流程——这是顶级ML机构L6及以上级别面试的考察范围。
本技能假设使用者拥有15年以上ML/CV/AI/NLP相关经验,不教授基础内容,而是将你已有的知识梳理成面试官认可的回答结构。
When to Use
适用场景
Use for:
- Practicing 45-minute ML system design rounds
- Structuring whiteboard presentations for recommendation, ranking, RAG, fraud, perception systems
- Analyzing serving architecture tradeoffs (batch vs online vs streaming)
- Identifying L6+ differentiation signals (problem ownership, org constraints, data flywheels)
- Reviewing and critiquing ML system design answers
NOT for:
- Coding interviews (use )
senior-coding-interview - Behavioral / leadership questions (use )
interview-loop-strategist - ML theory or math derivations
- Implementing models or writing training code
- Paper reading or research review
适用于:
- 练习45分钟的ML系统设计面试轮次
- 构建推荐、排序、RAG、风控、感知系统的白板演示结构
- 分析服务架构权衡(批处理 vs 在线处理 vs 流处理)
- 识别L6+级别的加分信号(问题所有权、组织约束、数据飞轮)
- 评审和优化ML系统设计的回答
不适用于:
- 编码面试(请使用)
senior-coding-interview - 行为/领导力问题(请使用)
interview-loop-strategist - ML理论或数学推导
- 模型实现或训练代码编写
- 论文阅读或研究综述
The 7-Stage Design Framework
7阶段设计框架
Every ML system design answer follows this arc. The stages are sequential but you will loop back as constraints emerge. The Mermaid diagram below is your whiteboard skeleton.
mermaid
flowchart TD
R[1. Requirements\n- Business goal\n- Users and scale\n- Latency/throughput SLA\n- Constraints] --> M[2. Metrics\n- Offline: precision, recall, NDCG\n- Online: CTR, conversion, revenue\n- Guardrails: latency p99, fairness]
M --> D[3. Data\n- Sources and collection\n- Labeling strategy\n- Pipeline: ETL, validation\n- Freshness and staleness]
D --> F[4. Features\n- Engineering and transforms\n- Feature store architecture\n- Online vs offline features\n- Freshness requirements]
F --> Mo[5. Model\n- Architecture selection\n- Training pipeline\n- Iteration strategy\n- Baseline and ablation]
Mo --> S[6. Serving\n- Batch vs online vs streaming\n- Caching and precomputation\n- Scaling and cost\n- Canary and shadow mode]
S --> Mon[7. Monitoring\n- Data drift detection\n- Model degradation alerts\n- A/B testing framework\n- Rollback strategy\n- Feedback loops]
Mon -.->|Feedback loop| D
Mon -.->|Retrain trigger| Mo所有ML系统设计的回答都遵循这个流程。各阶段是顺序进行的,但当新约束出现时你可以回溯调整。下方的Mermaid图就是你白板演示的骨架。
mermaid
flowchart TD
R[1. Requirements\n- Business goal\n- Users and scale\n- Latency/throughput SLA\n- Constraints] --> M[2. Metrics\n- Offline: precision, recall, NDCG\n- Online: CTR, conversion, revenue\n- Guardrails: latency p99, fairness]
M --> D[3. Data\n- Sources and collection\n- Labeling strategy\n- Pipeline: ETL, validation\n- Freshness and staleness]
D --> F[4. Features\n- Engineering and transforms\n- Feature store architecture\n- Online vs offline features\n- Freshness requirements]
F --> Mo[5. Model\n- Architecture selection\n- Training pipeline\n- Iteration strategy\n- Baseline and ablation]
Mo --> S[6. Serving\n- Batch vs online vs streaming\n- Caching and precomputation\n- Scaling and cost\n- Canary and shadow mode]
S --> Mon[7. Monitoring\n- Data drift detection\n- Model degradation alerts\n- A/B testing framework\n- Rollback strategy\n- Feedback loops]
Mon -.->|Feedback loop| D
Mon -.->|Retrain trigger| MoStage Details
阶段详情
Stage 1 -- Requirements (5 minutes)
Ask clarifying questions before designing anything. Establish: Who is the user? What is the business metric? What is the latency SLA? What scale (QPS, data volume)? What are hard constraints (cost, privacy, regulation)? An L6+ candidate owns the problem definition -- do not wait for the interviewer to hand you requirements.
Stage 2 -- Metrics (3 minutes)
Define offline metrics that you can measure before deployment AND online metrics that matter to the business. Explain the gap: "NDCG improvement offline does not always translate to CTR lift online because of position bias and novelty effects." Define guardrail metrics: latency p99, fairness across user segments, cost per prediction.
Stage 3 -- Data (7 minutes)
Where does training data come from? How is it labeled (human, weak supervision, implicit signals)? What is the class balance? How fresh does data need to be? What is the data pipeline (batch ETL vs streaming)? What data quality checks exist? This stage separates L6+ candidates from L5 -- junior candidates assume clean labeled data.
Stage 4 -- Features (5 minutes)
What features does the model need? Which are precomputed (offline) vs computed at request time (online)? Feature store architecture: online store (low-latency lookups) vs offline store (batch training). Feature freshness: user features update daily, item features update hourly, contextual features are real-time.
Stage 5 -- Model (8 minutes)
Start with a simple baseline (logistic regression, XGBoost) and explain why. Then propose the production architecture (two-tower, transformer, etc.) and justify the upgrade. Discuss training pipeline: how often, how much data, how to handle distribution shift. Iteration strategy: what experiments to run first.
Stage 6 -- Serving (8 minutes)
This is where system design and ML intersect. Discuss: inference latency requirements, batch precomputation vs online inference, GPU/CPU tradeoffs, model serving framework, caching strategy, cost optimization (quantization, distillation, spot instances). Draw the serving architecture.
Stage 7 -- Monitoring (5 minutes)
What happens after deployment? Data drift detection (PSI, KL divergence). Model degradation alerts (metric decay over time). A/B testing framework (sample size, duration, novelty effects). Rollback strategy (shadow mode, canary percentage). Feedback loops that improve the model over time.
阶段1 -- 需求确认(5分钟)
在开始设计前先问清问题,明确:用户是谁?业务指标是什么?延迟SLA是多少?规模多大(QPS、数据量)?硬性约束有哪些(成本、隐私、法规)?L6+的候选人会主导问题定义——不要等面试官给你提需求。
阶段2 -- 指标定义(3分钟)
同时定义部署前可衡量的离线指标,以及对业务有价值的在线指标。解释两者的差异:"离线NDCG提升不一定能转化为在线CTR增长,因为存在位置偏差和新奇效应。"定义护栏指标:p99延迟、用户群体间的公平性、单次预测成本。
阶段3 -- 数据梳理(7分钟)
训练数据来自哪里?如何标注(人工标注、弱监督、隐式信号)?类别平衡情况如何?数据的新鲜度要求是多少?数据管线是什么样的(批处理ETL vs 流处理)?有哪些数据质量校验机制?这个阶段是L6+候选人和L5候选人的分水岭——初级候选人会默认存在干净的标注数据。
阶段4 -- 特征设计(5分钟)
模型需要哪些特征?哪些是预计算的(离线),哪些是请求时实时计算的(在线)?特征存储架构:在线存储(低延迟查询)vs 离线存储(批训练)。特征新鲜度:用户特征每日更新,物品特征每小时更新,上下文特征实时更新。
阶段5 -- 模型选型(8分钟)
先从简单基线(逻辑回归、XGBoost)开始并说明理由,然后提出生产级架构(双塔、transformer等)并说明升级的合理性。讨论训练管线:训练频率、数据量、如何处理分布偏移。迭代策略:优先跑哪些实验。
阶段6 -- 服务部署(8分钟)
这是系统设计和ML的交叉点。讨论:推理延迟要求、批预计算 vs 在线推理、GPU/CPU权衡、模型服务框架、缓存策略、成本优化(量化、蒸馏、竞价实例)。画出服务架构图。
阶段7 -- 监控运维(5分钟)
部署后要做什么?数据漂移检测(PSI、KL散度)、模型退化告警(指标随时间衰减)、A/B测试框架(样本量、时长、新奇效应)、回滚策略(影子模式、灰度放量比例)、持续优化模型的反馈回路。
45-Minute Time Budget
45分钟时间预算
| Phase | Minutes | What to Cover |
|---|---|---|
| Requirements + Clarification | 5 | Business goal, users, scale, SLA, constraints |
| Metrics | 3 | Offline, online, guardrails, metric alignment |
| Data | 7 | Sources, labeling, pipeline, quality, freshness |
| Features | 5 | Engineering, store architecture, online/offline split |
| Model | 8 | Baseline, production arch, training, iteration |
| Serving | 8 | Latency, architecture, cost, deployment strategy |
| Monitoring | 5 | Drift, alerts, A/B testing, rollback, feedback |
| Q&A Buffer | 4 | Interviewer deep-dives, defend tradeoffs |
If the interviewer cuts in with questions, adapt -- but cover all 7 stages even briefly. Skipping monitoring is the most common L5 mistake.
| 阶段 | 时长 | 覆盖内容 |
|---|---|---|
| 需求确认 | 5 | 业务目标、用户、规模、SLA、约束 |
| 指标定义 | 3 | 离线指标、在线指标、护栏指标、指标对齐 |
| 数据梳理 | 7 | 数据源、标注策略、管线、质量、新鲜度 |
| 特征设计 | 5 | 特征工程、存储架构、在线/离线拆分 |
| 模型选型 | 8 | 基线模型、生产架构、训练、迭代策略 |
| 服务部署 | 8 | 延迟、架构、成本、部署策略 |
| 监控运维 | 5 | 漂移检测、告警、A/B测试、回滚、反馈回路 |
| Q&A缓冲 | 4 | 面试官深度提问、权衡选择辩护 |
如果面试官中途插入提问,请注意调整节奏,但即使简略也要覆盖全部7个阶段。跳过监控是L5候选人最常见的失误。
Canonical Problem Set
典型问题集
| Problem | Key Challenges | Must-Discuss |
|---|---|---|
| Recommendation System | Cold start, position bias, multi-objective optimization | Two-tower retrieval + reranking, exploration-exploitation |
| Search Ranking | Query intent classification, relevance vs engagement, latency at scale | Inverted index + embedding retrieval, L1/L2 ranking cascade |
| Content Moderation | Multi-modal (text+image+video), adversarial evasion, precision-recall tradeoff | Human-in-the-loop, escalation tiers, appeal workflow |
| RAG Pipeline | Retrieval quality, chunk strategy, hallucination detection, evaluation | Embedding model selection, hybrid search, reranking, citation |
| Fraud Detection | Extreme class imbalance, adversarial adaptation, real-time requirement | Feature velocity, graph features, ensemble + rules, feedback delay |
| Autonomous Driving Perception | Sensor fusion, safety-critical latency, long-tail distribution | Multi-task architecture, simulation, OTA updates, regulatory |
| 问题 | 核心挑战 | 必须讨论内容 |
|---|---|---|
| 推荐系统 | 冷启动、位置偏差、多目标优化 | 双塔召回+重排序、探索-利用平衡 |
| 搜索排序 | 查询意图分类、相关性 vs 参与度、大规模场景下的延迟 | 倒排索引+向量召回、L1/L2排序级联 |
| 内容审核 | 多模态(文本+图像+视频)、对抗规避、精确率-召回率权衡 | 人机协同、分级升级机制、申诉流程 |
| RAG管线 | 召回质量、分片策略、幻觉检测、效果评估 | 嵌入模型选型、混合搜索、重排序、引用溯源 |
| 风控检测 | 极端类别不平衡、对抗自适应、实时性要求 | 特征更新速度、图特征、集成模型+规则、反馈延迟 |
| 自动驾驶感知 | 多传感器融合、安全级延迟要求、长尾分布 | 多任务架构、仿真、OTA更新、合规要求 |
Serving Architecture Comparison
服务架构对比
| Pattern | Latency | Freshness | Cost | Best For |
|---|---|---|---|---|
| Batch prediction | N/A (precomputed) | Hours-stale | Low compute, high storage | Email recommendations, daily reports |
| Online inference | 10-500ms | Real-time | High compute (GPU) | Search ranking, fraud detection |
| Near-real-time | 1-60s | Minutes-fresh | Medium | Feed ranking, content moderation |
| Streaming | Sub-second | Continuous | High (always-on) | Fraud, anomaly detection, bidding |
Detailed serving tradeoffs, framework comparisons, and cost optimization strategies are in .
references/serving-tradeoffs.md| 模式 | 延迟 | 新鲜度 | 成本 | 适用场景 |
|---|---|---|---|---|
| 批预测 | 无(预计算) | 小时级滞后 | 计算成本低、存储成本高 | 邮件推荐、日报表 |
| 在线推理 | 10-500ms | 实时 | 计算成本高(GPU) | 搜索排序、风控检测 |
| 近实时 | 1-60s | 分钟级滞后 | 中等 | Feed流排序、内容审核 |
| 流处理 | 亚秒级 | 持续更新 | 高(常驻运行) | 风控、异常检测、竞价 |
详细的服务权衡分析、框架对比和成本优化策略可查看。
references/serving-tradeoffs.mdL6+ Differentiation Signals
L6+级别加分信号
What separates a staff+ answer from a senior answer:
1. Own the Problem Definition
Do not accept the problem as stated. Ask: "What business metric are we optimizing? Is this a revenue problem or an engagement problem? What is the current solution and why is it insufficient?" L5 candidates accept "build a recommendation system." L6+ candidates ask "what are we recommending, to whom, and what does success look like?"
2. Discuss Organizational Constraints
Real systems live inside organizations. Address: team size (can we maintain a custom model or should we use a managed service?), on-call burden, cross-team data dependencies, compliance requirements, migration path from legacy system.
3. Data Flywheel Strategy
Show that you think about the virtuous cycle: better model -> more engagement -> more data -> better model. Discuss how to accelerate it: active learning, implicit feedback loops, exploration strategies, cold-start bootstrapping.
4. Build vs Buy Decisions
Not everything should be custom. Argue for managed services where appropriate (embedding APIs, feature stores, serving platforms) and custom solutions where competitive advantage demands it. Show you understand the total cost of ownership.
5. Multi-Objective Thinking
Real systems optimize multiple objectives simultaneously: relevance AND diversity, accuracy AND fairness, quality AND latency. Discuss how to handle conflicts: Pareto optimization, constrained optimization, multi-task learning, business-rule post-processing.
staff+级别的回答和高级工程师回答的区别:
1. 主导问题定义
不要被动接受给定的问题。主动提问:"我们要优化的业务指标是什么?这是营收问题还是用户参与度问题?当前的解决方案是什么,为什么不够好?" L5候选人会接受"搭建一个推荐系统"的需求,L6+候选人会问"我们要推荐什么内容、给谁、成功的标准是什么?"
2. 考虑组织约束
真实的系统是运行在组织内的。要说明:团队规模(我们能维护自定义模型还是应该使用托管服务?)、 on-call负担、跨团队数据依赖、合规要求、 legacy系统的迁移路径。
3. 数据飞轮策略
体现你对正向循环的思考:更好的模型->更高的用户参与度->更多的数据->更好的模型。讨论如何加速这个循环:主动学习、隐式反馈回路、探索策略、冷启动数据预热。
4. 自研 vs 外购决策
不是所有组件都要自研。合理论证什么时候用托管服务(嵌入API、特征存储、服务平台),什么时候需要自研来构建竞争优势。体现你对总持有成本的理解。
5. 多目标思考
真实的系统需要同时优化多个目标:相关性和多样性、准确性和公平性、效果和延迟。讨论如何处理冲突:帕累托优化、约束优化、多任务学习、业务规则后处理。
Whiteboard Strategy
白板演示策略
What to draw and when:
| Time | Draw This | Purpose |
|---|---|---|
| 0-5 min | Requirements box with bullet points | Anchor the discussion, show structured thinking |
| 5-8 min | Metric table (offline vs online) | Demonstrate you think beyond model accuracy |
| 8-15 min | Data pipeline diagram (sources -> ETL -> store) | Show you understand data engineering |
| 15-20 min | Feature architecture (offline store + online store) | Demonstrate feature store knowledge |
| 20-28 min | Model architecture + serving diagram | The core system design artifact |
| 28-36 min | Full system diagram with latency annotations | Connect everything, show you can ship |
| 36-41 min | Monitoring dashboard sketch + feedback arrows | Close the loop, show production thinking |
Use boxes for components, arrows for data flow, and annotate with latency/throughput numbers. The diagram should be readable by someone who walks in at minute 30.
各时间点应该画的内容:
| 时间 | 绘制内容 | 目的 |
|---|---|---|
| 0-5分钟 | 带要点的需求框 | 锚定讨论方向,体现结构化思维 |
| 5-8分钟 | 指标表(离线vs在线) | 证明你考虑的不只是模型准确率 |
| 8-15分钟 | 数据管线图(数据源 -> ETL -> 存储) | 体现你对数据工程的理解 |
| 15-20分钟 | 特征架构(离线存储+在线存储) | 证明你掌握特征存储相关知识 |
| 20-28分钟 | 模型架构+服务部署图 | 系统设计的核心产出 |
| 28-36分钟 | 标注了延迟的全系统图 | 串联所有组件,体现落地能力 |
| 36-41分钟 | 监控看板草图+反馈箭头 | 形成闭环,体现生产运维思维 |
用方框表示组件,箭头表示数据流,标注延迟/吞吐量数值。即使有人在面试进行到30分钟时进来,也能看懂你的图。
Anti-Patterns
反模式
Model-First Thinking
模型优先思维
Novice: Jumps to "I would use a transformer" or "Let me describe the attention mechanism" in the first 2 minutes, before understanding the problem, defining metrics, or discussing data. Spends 70% of time on model architecture and 0% on serving.
Expert: Spends the first 10 minutes on requirements, metrics, and data before mentioning any model. Names a simple baseline first (logistic regression on handcrafted features), then argues for complexity only when the baseline's limitations are clear. Allocates equal time to serving and monitoring.
Detection: Architecture diagram has a detailed model box but no data pipeline, no feature store, no serving layer, and no monitoring component. Mentions model architecture in the first sentence.
新手表现:前2分钟还没理解问题、定义指标、讨论数据就直接跳到"我会用transformer"或者"我来解释下注意力机制",70%的时间花在模型架构上,完全不提服务部署。
专家表现:前10分钟先讨论需求、指标和数据,之后才提到模型。先提出简单基线(手工特征上的逻辑回归),只有当基线的局限性明确时才论证引入更复杂方案的必要性,给服务部署和监控分配同等的时间。
识别点:架构图里有详细的模型模块,但没有数据管线、特征存储、服务层和监控组件。第一句话就提到模型架构。
Ignoring the Data
忽视数据
Novice: Assumes clean, labeled data exists at scale. Says "we would train on millions of labeled examples" without discussing where labels come from, how much they cost, what the class distribution looks like, or how stale the data gets.
Expert: Asks about data sources, labeling strategy (human vs weak supervision vs implicit signals), class imbalance handling, data freshness SLA, and data quality monitoring. Discusses the cost of labeling and proposes strategies to reduce it (active learning, semi-supervised methods, synthetic data).
Detection: No discussion of data collection, labeling costs, class imbalance, data quality checks, or data freshness anywhere in the answer. The word "label" does not appear.
新手表现:默认存在大规模的干净标注数据,说"我们会用百万级标注样本训练",但完全不讨论标注来自哪里、成本多少、类别分布如何、数据滞后程度。
专家表现:询问数据源、标注策略(人工vs弱监督vs隐式信号)、类别不平衡处理方案、数据新鲜度SLA、数据质量监控。讨论标注成本,提出降低成本的策略(主动学习、半监督方法、合成数据)。
识别点:回答里完全没有提到数据采集、标注成本、类别不平衡、数据质量校验、数据新鲜度,没有出现"标注"这个词。
No Monitoring Story
缺失监控方案
Novice: Design ends at the serving layer. No mention of what happens after the model is deployed. Does not discuss how to detect degradation, how to roll back, or how to improve the model over time.
Expert: Discusses data drift detection (population stability index, feature distribution monitoring), model performance decay alerts, A/B testing framework with proper statistical rigor, canary deployment strategy, shadow mode for safe rollouts, and explicit feedback loops that flow data back into retraining.
Detection: Architecture diagram has no monitoring component. No feedback arrows from production back to training. No mention of A/B testing, canary deployment, or rollback.
新手表现:设计到服务层就结束了,完全不提模型部署后要做什么,不讨论如何检测退化、如何回滚、如何持续优化模型。
专家表现:讨论数据漂移检测(群体稳定性指数、特征分布监控)、模型性能衰减告警、符合统计严谨性的A/B测试框架、灰度部署策略、安全上线的影子模式、将数据回流到训练流程的明确反馈回路。
识别点:架构图没有监控组件,没有从生产回流到训练的反馈箭头,没有提到A/B测试、灰度部署、回滚。
Reference Files
参考文件
Consult these for deep dives -- they are NOT loaded by default:
| File | Consult When |
|---|---|
| Working through a specific problem (recommendation, search, RAG, fraud, content mod, perception). Contains 6 fully worked designs with Mermaid diagrams. |
| Deep-diving on serving architecture, framework selection, caching, cost optimization, deployment strategies. Contains framework comparisons and latency targets by use case. |
| Choosing metrics, understanding metric alignment, designing A/B tests, evaluating generative AI. Contains metric decision trees and formulas. |
如需深入了解可查看以下文件,默认不会加载:
| 文件 | 适用场景 |
|---|---|
| 处理具体问题(推荐、搜索、RAG、风控、内容审核、感知),包含6个完整的设计方案和对应的Mermaid图。 |
| 深入了解服务架构、框架选型、缓存、成本优化、部署策略,包含框架对比和各场景的延迟目标。 |
| 选择指标、理解指标对齐、设计A/B测试、评估生成式AI效果,包含指标决策树和计算公式。 |