rag-architecture
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseRAG Architecture
RAG架构
When to Use This Skill
技能适用场景
Use this skill when:
- Designing RAG pipelines for LLM applications
- Choosing chunking and embedding strategies
- Optimizing retrieval quality and relevance
- Building knowledge-grounded AI systems
- Implementing hybrid search (dense + sparse)
- Designing multi-stage retrieval pipelines
Keywords: RAG, retrieval-augmented generation, embeddings, chunking, vector search, semantic search, context window, grounding, knowledge base, hybrid search, reranking, BM25, dense retrieval
以下场景可使用本技能:
- 为LLM应用设计RAG流水线
- 选择分块与Embedding策略
- 优化检索质量与相关性
- 构建基于知识库的AI系统
- 实现混合搜索(稠密+稀疏)
- 设计多阶段检索流水线
关键词: RAG, retrieval-augmented generation, embeddings, chunking, vector search, semantic search, context window, grounding, knowledge base, hybrid search, reranking, BM25, dense retrieval
RAG Architecture Overview
RAG架构概述
text
┌─────────────────────────────────────────────────────────────────────┐
│ RAG Pipeline │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Ingestion │ │ Indexing │ │ Vector Store │ │
│ │ Pipeline │───▶│ Pipeline │───▶│ (Embeddings) │ │
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
│ │ │ │ │
│ Documents Chunks + Indexed │
│ Embeddings Vectors │
│ │ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Query │ │ Retrieval │ │ Context Assembly │ │
│ │ Processing │───▶│ Engine │───▶│ + Generation │ │
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
│ │ │ │ │
│ User Query Top-K Chunks LLM Response │
│ │
└─────────────────────────────────────────────────────────────────────┘text
┌─────────────────────────────────────────────────────────────────────┐
│ RAG Pipeline │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Ingestion │ │ Indexing │ │ Vector Store │ │
│ │ Pipeline │───▶│ Pipeline │───▶│ (Embeddings) │ │
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
│ │ │ │ │
│ Documents Chunks + Indexed │
│ Embeddings Vectors │
│ │ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Query │ │ Retrieval │ │ Context Assembly │ │
│ │ Processing │───▶│ Engine │───▶│ + Generation │ │
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
│ │ │ │ │
│ User Query Top-K Chunks LLM Response │
│ │
└─────────────────────────────────────────────────────────────────────┘Document Ingestion Pipeline
文档摄入流水线
Document Processing Steps
文档处理步骤
text
Raw Documents
│
▼
┌─────────────┐
│ Extract │ ← PDF, HTML, DOCX, Markdown
│ Content │
└─────────────┘
│
▼
┌─────────────┐
│ Clean & │ ← Remove boilerplate, normalize
│ Normalize │
└─────────────┘
│
▼
┌─────────────┐
│ Chunk │ ← Split into retrievable units
│ Documents │
└─────────────┘
│
▼
┌─────────────┐
│ Generate │ ← Create vector representations
│ Embeddings │
└─────────────┘
│
▼
┌─────────────┐
│ Store │ ← Persist vectors + metadata
│ in Index │
└─────────────┘text
Raw Documents
│
▼
┌─────────────┐
│ Extract │ ← PDF, HTML, DOCX, Markdown
│ Content │
└─────────────┘
│
▼
┌─────────────┐
│ Clean & │ ← Remove boilerplate, normalize
│ Normalize │
└─────────────┘
│
▼
┌─────────────┐
│ Chunk │ ← Split into retrievable units
│ Documents │
└─────────────┘
│
▼
┌─────────────┐
│ Generate │ ← Create vector representations
│ Embeddings │
└─────────────┘
│
▼
┌─────────────┐
│ Store │ ← Persist vectors + metadata
│ in Index │
└─────────────┘Chunking Strategies
分块策略
Strategy Comparison
策略对比
| Strategy | Description | Best For | Chunk Size |
|---|---|---|---|
| Fixed-size | Split by token/character count | Simple documents | 256-512 tokens |
| Sentence-based | Split at sentence boundaries | Narrative text | Variable |
| Paragraph-based | Split at paragraph boundaries | Structured docs | Variable |
| Semantic | Split by topic/meaning | Long documents | Variable |
| Recursive | Hierarchical splitting | Mixed content | Configurable |
| Document-specific | Custom per doc type | Specialized (code, tables) | Variable |
| 策略 | 描述 | 适用场景 | 分块尺寸 |
|---|---|---|---|
| Fixed-size(固定尺寸) | 按令牌/字符数拆分 | 简单文档 | 256-512 tokens |
| Sentence-based(基于句子) | 按句子边界拆分 | 叙述性文本 | 可变 |
| Paragraph-based(基于段落) | 按段落边界拆分 | 结构化文档 | 可变 |
| Semantic(语义化) | 按主题/含义拆分 | 长文档 | 可变 |
| Recursive(递归式) | 分层拆分 | 混合内容 | 可配置 |
| Document-specific(文档专属) | 针对文档类型定制 | 专业内容(代码、表格) | 可变 |
Chunking Decision Tree
分块决策树
text
What type of content?
├── Code
│ └── AST-based or function-level chunking
├── Tables/Structured
│ └── Keep tables intact, chunk surrounding text
├── Long narrative
│ └── Semantic or recursive chunking
├── Short documents (<1 page)
│ └── Whole document as chunk
└── Mixed content
└── Recursive with type-specific handlerstext
What type of content?
├── Code
│ └── AST-based or function-level chunking
├── Tables/Structured
│ └── Keep tables intact, chunk surrounding text
├── Long narrative
│ └── Semantic or recursive chunking
├── Short documents (<1 page)
│ └── Whole document as chunk
└── Mixed content
└── Recursive with type-specific handlersChunk Overlap
分块重叠
text
Without Overlap:
[Chunk 1: "The quick brown"] [Chunk 2: "fox jumps over"]
↑
Information lost at boundary
With Overlap (20%):
[Chunk 1: "The quick brown fox"]
[Chunk 2: "brown fox jumps over"]
↑
Context preserved across boundariesRecommended overlap: 10-20% of chunk size
text
Without Overlap:
[Chunk 1: "The quick brown"] [Chunk 2: "fox jumps over"]
↑
Information lost at boundary
With Overlap (20%):
[Chunk 1: "The quick brown fox"]
[Chunk 2: "brown fox jumps over"]
↑
Context preserved across boundaries推荐重叠比例: 分块尺寸的10-20%
Chunk Size Trade-offs
分块尺寸权衡
text
Smaller Chunks (128-256 tokens) Larger Chunks (512-1024 tokens)
├── More precise retrieval ├── More context per chunk
├── Less context per chunk ├── May include irrelevant content
├── More chunks to search ├── Fewer chunks to search
├── Better for factoid Q&A ├── Better for summarization
└── Higher retrieval recall └── Higher retrieval precisiontext
Smaller Chunks (128-256 tokens) Larger Chunks (512-1024 tokens)
├── More precise retrieval ├── More context per chunk
├── Less context per chunk ├── May include irrelevant content
├── More chunks to search ├── Fewer chunks to search
├── Better for factoid Q&A ├── Better for summarization
└── Higher retrieval recall └── Higher retrieval precisionEmbedding Models
Embedding模型
Model Comparison
模型对比
| Model | Dimensions | Context | Strengths |
|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | 8K | High quality, expensive |
| OpenAI text-embedding-3-small | 1536 | 8K | Good quality/cost ratio |
| Cohere embed-v3 | 1024 | 512 | Multilingual, fast |
| BGE-large | 1024 | 512 | Open source, competitive |
| E5-large-v2 | 1024 | 512 | Open source, instruction-tuned |
| GTE-large | 1024 | 512 | Alibaba, good for Chinese |
| Sentence-BERT | 768 | 512 | Classic, well-understood |
| 模型 | 维度 | 上下文 | 优势 |
|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | 8K | 质量高,成本高 |
| OpenAI text-embedding-3-small | 1536 | 8K | 性价比优 |
| Cohere embed-v3 | 1024 | 512 | 多语言,速度快 |
| BGE-large | 1024 | 512 | 开源,竞争力强 |
| E5-large-v2 | 1024 | 512 | 开源,指令微调 |
| GTE-large | 1024 | 512 | 阿里巴巴出品,中文表现佳 |
| Sentence-BERT | 768 | 512 | 经典模型,认知度高 |
Embedding Selection
Embedding选择决策树
text
Need best quality, cost OK?
├── Yes → OpenAI text-embedding-3-large
└── No
└── Need self-hosted/open source?
├── Yes → BGE-large or E5-large-v2
└── No
└── Need multilingual?
├── Yes → Cohere embed-v3
└── No → OpenAI text-embedding-3-smalltext
Need best quality, cost OK?
├── Yes → OpenAI text-embedding-3-large
└── No
└── Need self-hosted/open source?
├── Yes → BGE-large or E5-large-v2
└── No
└── Need multilingual?
├── Yes → Cohere embed-v3
└── No → OpenAI text-embedding-3-smallEmbedding Optimization
Embedding优化技术
| Technique | Description | When to Use |
|---|---|---|
| Matryoshka embeddings | Truncatable to smaller dims | Memory-constrained |
| Quantized embeddings | INT8/binary embeddings | Large-scale search |
| Instruction-tuned | Prefix with task instruction | Specialized retrieval |
| Fine-tuned embeddings | Domain-specific training | Specialized domains |
| 技术 | 描述 | 适用场景 |
|---|---|---|
| Matryoshka embeddings | 可截断至更小维度 | 内存受限场景 |
| Quantized embeddings | INT8/二进制Embeddings | 大规模搜索场景 |
| Instruction-tuned(指令微调) | 前缀添加任务指令 | 专业检索场景 |
| Fine-tuned embeddings(微调Embeddings) | 领域专属训练 | 专业领域场景 |
Retrieval Strategies
检索策略
Dense Retrieval (Semantic Search)
稠密检索(语义搜索)
text
Query: "How to deploy containers"
│
▼
┌─────────┐
│ Embed │
│ Query │
└─────────┘
│
▼
┌─────────────────────────────────┐
│ Vector Similarity Search │
│ (Cosine, Dot Product, L2) │
└─────────────────────────────────┘
│
▼
Top-K semantically similar chunkstext
Query: "How to deploy containers"
│
▼
┌─────────┐
│ Embed │
│ Query │
└─────────┘
│
▼
┌─────────────────────────────────┐
│ Vector Similarity Search │
│ (Cosine, Dot Product, L2) │
└─────────────────────────────────┘
│
▼
Top-K semantically similar chunksSparse Retrieval (BM25/TF-IDF)
稀疏检索(BM25/TF-IDF)
text
Query: "Kubernetes pod deployment YAML"
│
▼
┌─────────┐
│Tokenize │
│ + Score │
└─────────┘
│
▼
┌─────────────────────────────────┐
│ BM25 Ranking │
│ (Term frequency × IDF) │
└─────────────────────────────────┘
│
▼
Top-K lexically matching chunkstext
Query: "Kubernetes pod deployment YAML"
│
▼
┌─────────┐
│Tokenize │
│ + Score │
└─────────┘
│
▼
┌─────────────────────────────────┐
│ BM25 Ranking │
│ (Term frequency × IDF) │
└─────────────────────────────────┘
│
▼
Top-K lexically matching chunksHybrid Search (Best of Both)
混合检索(结合两者优势)
text
Query ──┬──▶ Dense Search ──┬──▶ Fusion ──▶ Final Ranking
│ │ │
└──▶ Sparse Search ─┘ │
│
Fusion Methods: ▼
• RRF (Reciprocal Rank Fusion)
• Linear combination
• Learned rerankingtext
Query ──┬──▶ Dense Search ──┬──▶ Fusion ──▶ Final Ranking
│ │ │
└──▶ Sparse Search ─┘ │
│
Fusion Methods: ▼
• RRF (Reciprocal Rank Fusion)
• Linear combination
• Learned rerankingReciprocal Rank Fusion (RRF)
倒数排名融合(RRF)
text
RRF Score = Σ 1 / (k + rank_i)
Where:
- k = constant (typically 60)
- rank_i = rank in each retrieval result
Example:
Doc A: Dense rank=1, Sparse rank=5
RRF(A) = 1/(60+1) + 1/(60+5) = 0.0164 + 0.0154 = 0.0318
Doc B: Dense rank=3, Sparse rank=1
RRF(B) = 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323
Result: Doc B ranks higher (better combined relevance)text
RRF Score = Σ 1 / (k + rank_i)
Where:
- k = constant (typically 60)
- rank_i = rank in each retrieval result
Example:
Doc A: Dense rank=1, Sparse rank=5
RRF(A) = 1/(60+1) + 1/(60+5) = 0.0164 + 0.0154 = 0.0318
Doc B: Dense rank=3, Sparse rank=1
RRF(B) = 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323
Result: Doc B ranks higher (better combined relevance)Multi-Stage Retrieval
多阶段检索
Two-Stage Pipeline
两阶段流水线
text
┌─────────────────────────────────────────────────────────┐
│ Stage 1: Recall (Fast, High Recall) │
│ • ANN search (HNSW, IVF) │
│ • Retrieve top-100 candidates │
│ • Latency: 10-50ms │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Stage 2: Rerank (Slow, High Precision) │
│ • Cross-encoder or LLM reranking │
│ • Score top-100 → return top-10 │
│ • Latency: 100-500ms │
└─────────────────────────────────────────────────────────┘text
┌─────────────────────────────────────────────────────────┐
│ Stage 1: Recall (Fast, High Recall) │
│ • ANN search (HNSW, IVF) │
│ • Retrieve top-100 candidates │
│ • Latency: 10-50ms │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Stage 2: Rerank (Slow, High Precision) │
│ • Cross-encoder or LLM reranking │
│ • Score top-100 → return top-10 │
│ • Latency: 100-500ms │
└─────────────────────────────────────────────────────────┘Reranking Options
重排序选项
| Reranker | Latency | Quality | Cost |
|---|---|---|---|
| Cross-encoder (local) | Medium | High | Compute |
| Cohere Rerank | Fast | High | API cost |
| LLM-based rerank | Slow | Highest | High API cost |
| BGE-reranker | Fast | Good | Compute |
| 重排序器 | 延迟 | 质量 | 成本 |
|---|---|---|---|
| Cross-encoder(本地) | 中等 | 高 | 计算资源 |
| Cohere Rerank | 快 | 高 | API调用成本 |
| LLM-based rerank(基于LLM的重排序) | 慢 | 最高 | 高API调用成本 |
| BGE-reranker | 快 | 良好 | 计算资源 |
Context Assembly
上下文组装
Context Window Management
上下文窗口管理
text
Context Budget: 128K tokens
├── System prompt: 500 tokens (fixed)
├── Conversation history: 4K tokens (sliding window)
├── Retrieved context: 8K tokens (dynamic)
└── Generation buffer: ~115K tokens (available)
Strategy: Maximize retrieved context quality within budgettext
Context Budget: 128K tokens
├── System prompt: 500 tokens (fixed)
├── Conversation history: 4K tokens (sliding window)
├── Retrieved context: 8K tokens (dynamic)
└── Generation buffer: ~115K tokens (available)
Strategy: Maximize retrieved context quality within budget策略: 在预算范围内最大化检索上下文质量
Context Assembly Strategies
上下文组装策略
| Strategy | Description | When to Use |
|---|---|---|
| Simple concatenation | Join top-K chunks | Small context, simple Q&A |
| Relevance-ordered | Most relevant first | General retrieval |
| Chronological | Time-ordered | Temporal queries |
| Hierarchical | Summary + details | Long-form generation |
| Interleaved | Mix sources | Multi-source queries |
| 策略 | 描述 | 适用场景 |
|---|---|---|
| Simple concatenation(简单拼接) | 拼接Top-K分块 | 小上下文、简单问答场景 |
| Relevance-ordered(按相关性排序) | 最相关内容优先 | 通用检索场景 |
| Chronological(按时间排序) | 时间顺序排列 | 时序查询场景 |
| Hierarchical(分层式) | 摘要+细节 | 长篇生成场景 |
| Interleaved(交错式) | 混合多源内容 | 多源查询场景 |
Lost-in-the-Middle Problem
中间信息丢失问题
text
LLM Attention Pattern:
┌─────────────────────────────────────────────────────────┐
│ Beginning Middle End │
│ ████ ░░░░ ████ │
│ High attention Low attention High attention │
└─────────────────────────────────────────────────────────┘
Mitigation:
1. Put most relevant at beginning AND end
2. Use shorter context windows when possible
3. Use hierarchical summarization
4. Fine-tune for long-context attentiontext
LLM Attention Pattern:
┌─────────────────────────────────────────────────────────┐
│ Beginning Middle End │
│ ████ ░░░░ ████ │
│ High attention Low attention High attention │
└─────────────────────────────────────────────────────────┘
Mitigation:
1. Put most relevant at beginning AND end
2. Use shorter context windows when possible
3. Use hierarchical summarization
4. Fine-tune for long-context attention缓解方案:
- 将最相关内容放在开头和结尾
- 尽可能使用更短的上下文窗口
- 采用分层摘要
- 针对长上下文注意力进行微调
Advanced RAG Patterns
高级RAG模式
Query Transformation
查询转换
text
Original Query: "Tell me about the project"
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌──────────┐
│ HyDE │ │ Query │ │ Sub-query│
│ (Hypo │ │ Expansion│ │ Decomp. │
│ Doc) │ │ │ │ │
└─────────┘ └──────────┘ └──────────┘
│ │ │
▼ ▼ ▼
Hypothetical "project, "What is the
answer to goals, project scope?"
embed timeline, "What are the
deliverables" deliverables?"text
Original Query: "Tell me about the project"
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌──────────┐
│ HyDE │ │ Query │ │ Sub-query│
│ (Hypo │ │ Expansion│ │ Decomp. │
│ Doc) │ │ │ │ │
└─────────┘ └──────────┘ └──────────┘
│ │ │
▼ ▼ ▼
Hypothetical "project, "What is the
answer to goals, project scope?"
embed timeline, "What are the
deliverables" deliverables?"HyDE (Hypothetical Document Embeddings)
HyDE(假设文档Embeddings)
text
Query: "How does photosynthesis work?"
│
▼
┌───────────────┐
│ LLM generates │
│ hypothetical │
│ answer │
└───────────────┘
│
▼
"Photosynthesis is the process by which
plants convert sunlight into energy..."
│
▼
┌───────────────┐
│ Embed hypo │
│ document │
└───────────────┘
│
▼
Search with hypothetical embedding
(Better matches actual documents)text
Query: "How does photosynthesis work?"
│
▼
┌───────────────┐
│ LLM generates │
│ hypothetical │
│ answer │
└───────────────┘
│
▼
"Photosynthesis is the process by which
plants convert sunlight into energy..."
│
▼
┌───────────────┐
│ Embed hypo │
│ document │
└───────────────┘
│
▼
Search with hypothetical embedding
(Better matches actual documents)Self-RAG (Retrieval-Augmented LM with Self-Reflection)
Self-RAG(带自反思的检索增强语言模型)
text
┌─────────────────────────────────────────────────────────┐
│ 1. Generate initial response │
│ 2. Decide: Need more retrieval? (critique token) │
│ ├── Yes → Retrieve more, regenerate │
│ └── No → Check factuality (isRel, isSup tokens) │
│ 3. Verify claims against sources │
│ 4. Regenerate if needed │
│ 5. Return verified response │
└─────────────────────────────────────────────────────────┘text
┌─────────────────────────────────────────────────────────┐
│ 1. Generate initial response │
│ 2. Decide: Need more retrieval? (critique token) │
│ ├── Yes → Retrieve more, regenerate │
│ └── No → Check factuality (isRel, isSup tokens) │
│ 3. Verify claims against sources │
│ 4. Regenerate if needed │
│ 5. Return verified response │
└─────────────────────────────────────────────────────────┘Agentic RAG
Agentic RAG(智能体式RAG)
text
Query: "Compare Q3 revenue across regions"
│
▼
┌───────────────┐
│ Query Agent │
│ (Plan steps) │
└───────────────┘
│
┌───────────┼───────────┐
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐
│Search │ │Search │ │Search │
│ EMEA │ │ APAC │ │ AMER │
│ docs │ │ docs │ │ docs │
└───────┘ └───────┘ └───────┘
│ │ │
└───────────┼───────────┘
▼
┌───────────────┐
│ Synthesize │
│ Comparison │
└───────────────┘text
Query: "Compare Q3 revenue across regions"
│
▼
┌───────────────┐
│ Query Agent │
│ (Plan steps) │
└───────────────┘
│
┌───────────┼───────────┐
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐
│Search │ │Search │ │Search │
│ EMEA │ │ APAC │ │ AMER │
│ docs │ │ docs │ │ docs │
└───────┘ └───────┘ └───────┘
│ │ │
└───────────┼───────────┘
▼
┌───────────────┐
│ Synthesize │
│ Comparison │
└───────────────┘Evaluation Metrics
评估指标
Retrieval Metrics
检索指标
| Metric | Description | Target |
|---|---|---|
| Recall@K | % relevant docs in top-K | >80% |
| Precision@K | % of top-K that are relevant | >60% |
| MRR (Mean Reciprocal Rank) | 1/rank of first relevant | >0.5 |
| NDCG | Graded relevance ranking | >0.7 |
| 指标 | 描述 | 目标值 |
|---|---|---|
| Recall@K | Top-K结果中相关文档占比 | >80% |
| Precision@K | Top-K结果中相关文档占比 | >60% |
| MRR(平均倒数排名) | 首个相关文档排名的倒数均值 | >0.5 |
| NDCG | 分级相关性排名 | >0.7 |
End-to-End Metrics
端到端指标
| Metric | Description | Target |
|---|---|---|
| Answer correctness | Is the answer factually correct? | >90% |
| Faithfulness | Is the answer grounded in context? | >95% |
| Answer relevance | Does it answer the question? | >90% |
| Context relevance | Is retrieved context relevant? | >80% |
| 指标 | 描述 | 目标值 |
|---|---|---|
| Answer correctness(答案正确性) | 答案是否符合事实 | >90% |
| Faithfulness(忠实度) | 答案是否基于上下文 | >95% |
| Answer relevance(答案相关性) | 答案是否回应问题 | >90% |
| Context relevance(上下文相关性) | 检索到的上下文是否相关 | >80% |
Evaluation Framework
评估框架
text
┌─────────────────────────────────────────────────────────┐
│ RAG Evaluation Pipeline │
├─────────────────────────────────────────────────────────┤
│ 1. Query Set: Representative questions │
│ 2. Ground Truth: Expected answers + source docs │
│ 3. Metrics: │
│ • Retrieval: Recall@K, MRR, NDCG │
│ • Generation: Correctness, Faithfulness │
│ 4. A/B Testing: Compare configurations │
│ 5. Error Analysis: Identify failure patterns │
└─────────────────────────────────────────────────────────┘text
┌─────────────────────────────────────────────────────────┐
│ RAG Evaluation Pipeline │
├─────────────────────────────────────────────────────────┤
│ 1. Query Set: Representative questions │
│ 2. Ground Truth: Expected answers + source docs │
│ 3. Metrics: │
│ • Retrieval: Recall@K, MRR, NDCG │
│ • Generation: Correctness, Faithfulness │
│ 4. A/B Testing: Compare configurations │
│ 5. Error Analysis: Identify failure patterns │
└─────────────────────────────────────────────────────────┘Common Failure Modes
常见失败模式
| Failure Mode | Cause | Mitigation |
|---|---|---|
| Retrieval miss | Query-doc mismatch | Hybrid search, query expansion |
| Wrong chunk | Poor chunking | Better segmentation, overlap |
| Hallucination | Poor grounding | Faithfulness training, citations |
| Lost context | Long-context issues | Hierarchical, summarization |
| Stale data | Outdated index | Incremental updates, TTL |
| 失败模式 | 原因 | 缓解方案 |
|---|---|---|
| Retrieval miss(检索遗漏) | 查询与文档不匹配 | 混合检索、查询扩展 |
| Wrong chunk(错误分块) | 分块质量差 | 优化分割策略、增加重叠 |
| Hallucination(幻觉) | 基础grounding不足 | 忠实度训练、添加引用 |
| Lost context(上下文丢失) | 长上下文问题 | 分层处理、摘要 |
| Stale data(数据过时) | 索引未更新 | 增量更新、TTL机制 |
Scaling Considerations
扩展考量
Index Scaling
索引扩展
| Scale | Approach |
|---|---|
| <1M docs | Single node, exact search |
| 1-10M docs | Single node, HNSW |
| 10-100M docs | Distributed, sharded |
| >100M docs | Distributed + aggressive filtering |
| 规模 | 方案 |
|---|---|
| <1M文档 | 单节点,精确搜索 |
| 1-10M文档 | 单节点,HNSW |
| 10-100M文档 | 分布式,分片 |
| >100M文档 | 分布式+严格过滤 |
Latency Budget
延迟预算
text
Typical RAG Pipeline Latency:
Query embedding: 10-50ms
Vector search: 20-100ms
Reranking: 100-300ms
LLM generation: 500-2000ms
────────────────────────────
Total: 630-2450ms
Target p95: <3 seconds for interactive usetext
Typical RAG Pipeline Latency:
Query embedding: 10-50ms
Vector search: 20-100ms
Reranking: 100-300ms
LLM generation: 500-2000ms
────────────────────────────
Total: 630-2450ms
Target p95: <3 seconds for interactive use目标p95延迟: 交互式场景下<3秒
Related Skills
相关技能
- - LLM inference infrastructure
llm-serving-patterns - - Vector store selection and optimization
vector-databases - - End-to-end ML pipeline design
ml-system-design - - Capacity planning for RAG systems
estimation-techniques
- - LLM推理基础设施
llm-serving-patterns - - 向量存储选择与优化
vector-databases - - 端到端ML流水线设计
ml-system-design - - RAG系统容量规划
estimation-techniques
Version History
版本历史
- v1.0.0 (2025-12-26): Initial release - RAG architecture patterns for systems design
- v1.0.0 (2025-12-26): 初始版本 - 面向系统设计的RAG架构模式
Last Updated
最后更新
Date: 2025-12-26
日期: 2025-12-26