rag-architecture

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

RAG Architecture

RAG架构

When to Use This Skill

技能适用场景

Use this skill when:

Designing RAG pipelines for LLM applications
Choosing chunking and embedding strategies
Optimizing retrieval quality and relevance
Building knowledge-grounded AI systems
Implementing hybrid search (dense + sparse)
Designing multi-stage retrieval pipelines

Keywords: RAG, retrieval-augmented generation, embeddings, chunking, vector search, semantic search, context window, grounding, knowledge base, hybrid search, reranking, BM25, dense retrieval

以下场景可使用本技能：

为LLM应用设计RAG流水线
选择分块与Embedding策略
优化检索质量与相关性
构建基于知识库的AI系统
实现混合搜索（稠密+稀疏）
设计多阶段检索流水线

关键词： RAG, retrieval-augmented generation, embeddings, chunking, vector search, semantic search, context window, grounding, knowledge base, hybrid search, reranking, BM25, dense retrieval

RAG Architecture Overview

RAG架构概述

text

┌─────────────────────────────────────────────────────────────────────┐
│                       RAG Pipeline                                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────────┐  │
│  │   Ingestion  │    │   Indexing   │    │    Vector Store      │  │
│  │   Pipeline   │───▶│   Pipeline   │───▶│    (Embeddings)      │  │
│  └──────────────┘    └──────────────┘    └──────────────────────┘  │
│         │                   │                       │               │
│    Documents           Chunks +                 Indexed             │
│                       Embeddings               Vectors              │
│                                                     │               │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────────┐  │
│  │    Query     │    │  Retrieval   │    │   Context Assembly   │  │
│  │  Processing  │───▶│   Engine     │───▶│   + Generation       │  │
│  └──────────────┘    └──────────────┘    └──────────────────────┘  │
│         │                   │                       │               │
│    User Query          Top-K Chunks            LLM Response         │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

text

┌─────────────────────────────────────────────────────────────────────┐
│                       RAG Pipeline                                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────────┐  │
│  │   Ingestion  │    │   Indexing   │    │    Vector Store      │  │
│  │   Pipeline   │───▶│   Pipeline   │───▶│    (Embeddings)      │  │
│  └──────────────┘    └──────────────┘    └──────────────────────┘  │
│         │                   │                       │               │
│    Documents           Chunks +                 Indexed             │
│                       Embeddings               Vectors              │
│                                                     │               │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────────┐  │
│  │    Query     │    │  Retrieval   │    │   Context Assembly   │  │
│  │  Processing  │───▶│   Engine     │───▶│   + Generation       │  │
│  └──────────────┘    └──────────────┘    └──────────────────────┘  │
│         │                   │                       │               │
│    User Query          Top-K Chunks            LLM Response         │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Document Ingestion Pipeline

文档摄入流水线

Document Processing Steps

文档处理步骤

text

Raw Documents
      │
      ▼
┌─────────────┐
│   Extract   │ ← PDF, HTML, DOCX, Markdown
│   Content   │
└─────────────┘
      │
      ▼
┌─────────────┐
│   Clean &   │ ← Remove boilerplate, normalize
│  Normalize  │
└─────────────┘
      │
      ▼
┌─────────────┐
│   Chunk     │ ← Split into retrievable units
│  Documents  │
└─────────────┘
      │
      ▼
┌─────────────┐
│  Generate   │ ← Create vector representations
│ Embeddings  │
└─────────────┘
      │
      ▼
┌─────────────┐
│   Store     │ ← Persist vectors + metadata
│  in Index   │
└─────────────┘

text

Raw Documents
      │
      ▼
┌─────────────┐
│   Extract   │ ← PDF, HTML, DOCX, Markdown
│   Content   │
└─────────────┘
      │
      ▼
┌─────────────┐
│   Clean &   │ ← Remove boilerplate, normalize
│  Normalize  │
└─────────────┘
      │
      ▼
┌─────────────┐
│   Chunk     │ ← Split into retrievable units
│  Documents  │
└─────────────┘
      │
      ▼
┌─────────────┐
│  Generate   │ ← Create vector representations
│ Embeddings  │
└─────────────┘
      │
      ▼
┌─────────────┐
│   Store     │ ← Persist vectors + metadata
│  in Index   │
└─────────────┘

Chunking Strategies

分块策略

Strategy Comparison

策略对比

Strategy	Description	Best For	Chunk Size
Fixed-size	Split by token/character count	Simple documents	256-512 tokens
Sentence-based	Split at sentence boundaries	Narrative text	Variable
Paragraph-based	Split at paragraph boundaries	Structured docs	Variable
Semantic	Split by topic/meaning	Long documents	Variable
Recursive	Hierarchical splitting	Mixed content	Configurable
Document-specific	Custom per doc type	Specialized (code, tables)	Variable

策略	描述	适用场景	分块尺寸
Fixed-size（固定尺寸）	按令牌/字符数拆分	简单文档	256-512 tokens
Sentence-based（基于句子）	按句子边界拆分	叙述性文本	可变
Paragraph-based（基于段落）	按段落边界拆分	结构化文档	可变
Semantic（语义化）	按主题/含义拆分	长文档	可变
Recursive（递归式）	分层拆分	混合内容	可配置
Document-specific（文档专属）	针对文档类型定制	专业内容（代码、表格）	可变

Chunking Decision Tree

分块决策树

text

What type of content?
├── Code
│   └── AST-based or function-level chunking
├── Tables/Structured
│   └── Keep tables intact, chunk surrounding text
├── Long narrative
│   └── Semantic or recursive chunking
├── Short documents (<1 page)
│   └── Whole document as chunk
└── Mixed content
    └── Recursive with type-specific handlers

text

What type of content?
├── Code
│   └── AST-based or function-level chunking
├── Tables/Structured
│   └── Keep tables intact, chunk surrounding text
├── Long narrative
│   └── Semantic or recursive chunking
├── Short documents (<1 page)
│   └── Whole document as chunk
└── Mixed content
    └── Recursive with type-specific handlers

Chunk Overlap

分块重叠

text

Without Overlap:
[Chunk 1: "The quick brown"] [Chunk 2: "fox jumps over"]
                             ↑
               Information lost at boundary

With Overlap (20%):
[Chunk 1: "The quick brown fox"]
                    [Chunk 2: "brown fox jumps over"]
                         ↑
              Context preserved across boundaries

Recommended overlap: 10-20% of chunk size

text

Without Overlap:
[Chunk 1: "The quick brown"] [Chunk 2: "fox jumps over"]
                             ↑
               Information lost at boundary

With Overlap (20%):
[Chunk 1: "The quick brown fox"]
                    [Chunk 2: "brown fox jumps over"]
                         ↑
              Context preserved across boundaries

推荐重叠比例： 分块尺寸的10-20%

Chunk Size Trade-offs

分块尺寸权衡

text

Smaller Chunks (128-256 tokens)        Larger Chunks (512-1024 tokens)
├── More precise retrieval             ├── More context per chunk
├── Less context per chunk             ├── May include irrelevant content
├── More chunks to search              ├── Fewer chunks to search
├── Better for factoid Q&A             ├── Better for summarization
└── Higher retrieval recall            └── Higher retrieval precision

text

Smaller Chunks (128-256 tokens)        Larger Chunks (512-1024 tokens)
├── More precise retrieval             ├── More context per chunk
├── Less context per chunk             ├── May include irrelevant content
├── More chunks to search              ├── Fewer chunks to search
├── Better for factoid Q&A             ├── Better for summarization
└── Higher retrieval recall            └── Higher retrieval precision

Embedding Models

Embedding模型

Model Comparison

模型对比

Model	Dimensions	Context	Strengths
OpenAI text-embedding-3-large	3072	8K	High quality, expensive
OpenAI text-embedding-3-small	1536	8K	Good quality/cost ratio
Cohere embed-v3	1024	512	Multilingual, fast
BGE-large	1024	512	Open source, competitive
E5-large-v2	1024	512	Open source, instruction-tuned
GTE-large	1024	512	Alibaba, good for Chinese
Sentence-BERT	768	512	Classic, well-understood

模型	维度	上下文	优势
OpenAI text-embedding-3-large	3072	8K	质量高，成本高
OpenAI text-embedding-3-small	1536	8K	性价比优
Cohere embed-v3	1024	512	多语言，速度快
BGE-large	1024	512	开源，竞争力强
E5-large-v2	1024	512	开源，指令微调
GTE-large	1024	512	阿里巴巴出品，中文表现佳
Sentence-BERT	768	512	经典模型，认知度高

Embedding Selection

Embedding选择决策树

text

Need best quality, cost OK?
├── Yes → OpenAI text-embedding-3-large
└── No
    └── Need self-hosted/open source?
        ├── Yes → BGE-large or E5-large-v2
        └── No
            └── Need multilingual?
                ├── Yes → Cohere embed-v3
                └── No → OpenAI text-embedding-3-small

text

Need best quality, cost OK?
├── Yes → OpenAI text-embedding-3-large
└── No
    └── Need self-hosted/open source?
        ├── Yes → BGE-large or E5-large-v2
        └── No
            └── Need multilingual?
                ├── Yes → Cohere embed-v3
                └── No → OpenAI text-embedding-3-small

Embedding Optimization

Embedding优化技术

Technique	Description	When to Use
Matryoshka embeddings	Truncatable to smaller dims	Memory-constrained
Quantized embeddings	INT8/binary embeddings	Large-scale search
Instruction-tuned	Prefix with task instruction	Specialized retrieval
Fine-tuned embeddings	Domain-specific training	Specialized domains

技术	描述	适用场景
Matryoshka embeddings	可截断至更小维度	内存受限场景
Quantized embeddings	INT8/二进制Embeddings	大规模搜索场景
Instruction-tuned（指令微调）	前缀添加任务指令	专业检索场景
Fine-tuned embeddings（微调Embeddings）	领域专属训练	专业领域场景

Retrieval Strategies

检索策略

Dense Retrieval (Semantic Search)

稠密检索（语义搜索）

text

Query: "How to deploy containers"
         │
         ▼
    ┌─────────┐
    │ Embed   │
    │ Query   │
    └─────────┘
         │
         ▼
    ┌─────────────────────────────────┐
    │ Vector Similarity Search        │
    │ (Cosine, Dot Product, L2)       │
    └─────────────────────────────────┘
         │
         ▼
    Top-K semantically similar chunks

text

Query: "How to deploy containers"
         │
         ▼
    ┌─────────┐
    │ Embed   │
    │ Query   │
    └─────────┘
         │
         ▼
    ┌─────────────────────────────────┐
    │ Vector Similarity Search        │
    │ (Cosine, Dot Product, L2)       │
    └─────────────────────────────────┘
         │
         ▼
    Top-K semantically similar chunks

Sparse Retrieval (BM25/TF-IDF)

稀疏检索（BM25/TF-IDF）

text

Query: "Kubernetes pod deployment YAML"
         │
         ▼
    ┌─────────┐
    │Tokenize │
    │ + Score │
    └─────────┘
         │
         ▼
    ┌─────────────────────────────────┐
    │ BM25 Ranking                    │
    │ (Term frequency × IDF)          │
    └─────────────────────────────────┘
         │
         ▼
    Top-K lexically matching chunks

text

Query: "Kubernetes pod deployment YAML"
         │
         ▼
    ┌─────────┐
    │Tokenize │
    │ + Score │
    └─────────┘
         │
         ▼
    ┌─────────────────────────────────┐
    │ BM25 Ranking                    │
    │ (Term frequency × IDF)          │
    └─────────────────────────────────┘
         │
         ▼
    Top-K lexically matching chunks

Hybrid Search (Best of Both)

混合检索（结合两者优势）

text

Query ──┬──▶ Dense Search ──┬──▶ Fusion ──▶ Final Ranking
        │                   │      │
        └──▶ Sparse Search ─┘      │
                                   │
        Fusion Methods:            ▼
        • RRF (Reciprocal Rank Fusion)
        • Linear combination
        • Learned reranking

text

Query ──┬──▶ Dense Search ──┬──▶ Fusion ──▶ Final Ranking
        │                   │      │
        └──▶ Sparse Search ─┘      │
                                   │
        Fusion Methods:            ▼
        • RRF (Reciprocal Rank Fusion)
        • Linear combination
        • Learned reranking

Reciprocal Rank Fusion (RRF)

倒数排名融合（RRF）

text

RRF Score = Σ 1 / (k + rank_i)

Where:
- k = constant (typically 60)
- rank_i = rank in each retrieval result

Example:
Doc A: Dense rank=1, Sparse rank=5
RRF(A) = 1/(60+1) + 1/(60+5) = 0.0164 + 0.0154 = 0.0318

Doc B: Dense rank=3, Sparse rank=1
RRF(B) = 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323

Result: Doc B ranks higher (better combined relevance)

text

RRF Score = Σ 1 / (k + rank_i)

Where:
- k = constant (typically 60)
- rank_i = rank in each retrieval result

Example:
Doc A: Dense rank=1, Sparse rank=5
RRF(A) = 1/(60+1) + 1/(60+5) = 0.0164 + 0.0154 = 0.0318

Doc B: Dense rank=3, Sparse rank=1
RRF(B) = 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323

Result: Doc B ranks higher (better combined relevance)

Multi-Stage Retrieval

多阶段检索

Two-Stage Pipeline

两阶段流水线

text

┌─────────────────────────────────────────────────────────┐
│ Stage 1: Recall (Fast, High Recall)                     │
│ • ANN search (HNSW, IVF)                                │
│ • Retrieve top-100 candidates                           │
│ • Latency: 10-50ms                                      │
└─────────────────────────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────┐
│ Stage 2: Rerank (Slow, High Precision)                  │
│ • Cross-encoder or LLM reranking                        │
│ • Score top-100 → return top-10                         │
│ • Latency: 100-500ms                                    │
└─────────────────────────────────────────────────────────┘

text

┌─────────────────────────────────────────────────────────┐
│ Stage 1: Recall (Fast, High Recall)                     │
│ • ANN search (HNSW, IVF)                                │
│ • Retrieve top-100 candidates                           │
│ • Latency: 10-50ms                                      │
└─────────────────────────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────┐
│ Stage 2: Rerank (Slow, High Precision)                  │
│ • Cross-encoder or LLM reranking                        │
│ • Score top-100 → return top-10                         │
│ • Latency: 100-500ms                                    │
└─────────────────────────────────────────────────────────┘

Reranking Options

重排序选项

Reranker	Latency	Quality	Cost
Cross-encoder (local)	Medium	High	Compute
Cohere Rerank	Fast	High	API cost
LLM-based rerank	Slow	Highest	High API cost
BGE-reranker	Fast	Good	Compute

重排序器	延迟	质量	成本
Cross-encoder（本地）	中等	高	计算资源
Cohere Rerank	快	高	API调用成本
LLM-based rerank（基于LLM的重排序）	慢	最高	高API调用成本
BGE-reranker	快	良好	计算资源

Context Assembly

上下文组装

Context Window Management

上下文窗口管理

text

Context Budget: 128K tokens
├── System prompt: 500 tokens (fixed)
├── Conversation history: 4K tokens (sliding window)
├── Retrieved context: 8K tokens (dynamic)
└── Generation buffer: ~115K tokens (available)

Strategy: Maximize retrieved context quality within budget

text

Context Budget: 128K tokens
├── System prompt: 500 tokens (fixed)
├── Conversation history: 4K tokens (sliding window)
├── Retrieved context: 8K tokens (dynamic)
└── Generation buffer: ~115K tokens (available)

Strategy: Maximize retrieved context quality within budget

策略： 在预算范围内最大化检索上下文质量

Context Assembly Strategies

上下文组装策略

Strategy	Description	When to Use
Simple concatenation	Join top-K chunks	Small context, simple Q&A
Relevance-ordered	Most relevant first	General retrieval
Chronological	Time-ordered	Temporal queries
Hierarchical	Summary + details	Long-form generation
Interleaved	Mix sources	Multi-source queries

策略	描述	适用场景
Simple concatenation（简单拼接）	拼接Top-K分块	小上下文、简单问答场景
Relevance-ordered（按相关性排序）	最相关内容优先	通用检索场景
Chronological（按时间排序）	时间顺序排列	时序查询场景
Hierarchical（分层式）	摘要+细节	长篇生成场景
Interleaved（交错式）	混合多源内容	多源查询场景

Lost-in-the-Middle Problem

中间信息丢失问题

text

LLM Attention Pattern:
┌─────────────────────────────────────────────────────────┐
│ Beginning           Middle            End               │
│    ████              ░░░░             ████              │
│  High attention   Low attention   High attention        │
└─────────────────────────────────────────────────────────┘

Mitigation:
1. Put most relevant at beginning AND end
2. Use shorter context windows when possible
3. Use hierarchical summarization
4. Fine-tune for long-context attention

text

LLM Attention Pattern:
┌─────────────────────────────────────────────────────────┐
│ Beginning           Middle            End               │
│    ████              ░░░░             ████              │
│  High attention   Low attention   High attention        │
└─────────────────────────────────────────────────────────┘

Mitigation:
1. Put most relevant at beginning AND end
2. Use shorter context windows when possible
3. Use hierarchical summarization
4. Fine-tune for long-context attention

缓解方案：

将最相关内容放在开头和结尾
尽可能使用更短的上下文窗口
采用分层摘要
针对长上下文注意力进行微调

Advanced RAG Patterns

高级RAG模式

Query Transformation

查询转换

text

Original Query: "Tell me about the project"
                           │
         ┌─────────────────┼─────────────────┐
         ▼                 ▼                 ▼
    ┌─────────┐      ┌──────────┐     ┌──────────┐
    │ HyDE    │      │ Query    │     │ Sub-query│
    │ (Hypo   │      │ Expansion│     │ Decomp.  │
    │ Doc)    │      │          │     │          │
    └─────────┘      └──────────┘     └──────────┘
         │                 │                 │
         ▼                 ▼                 ▼
    Hypothetical      "project,        "What is the
    answer to         goals,           project scope?"
    embed             timeline,        "What are the
                      deliverables"    deliverables?"

text

Original Query: "Tell me about the project"
                           │
         ┌─────────────────┼─────────────────┐
         ▼                 ▼                 ▼
    ┌─────────┐      ┌──────────┐     ┌──────────┐
    │ HyDE    │      │ Query    │     │ Sub-query│
    │ (Hypo   │      │ Expansion│     │ Decomp.  │
    │ Doc)    │      │          │     │          │
    └─────────┘      └──────────┘     └──────────┘
         │                 │                 │
         ▼                 ▼                 ▼
    Hypothetical      "project,        "What is the
    answer to         goals,           project scope?"
    embed             timeline,        "What are the
                      deliverables"    deliverables?"

HyDE (Hypothetical Document Embeddings)

HyDE（假设文档Embeddings）

text

Query: "How does photosynthesis work?"
                │
                ▼
        ┌───────────────┐
        │ LLM generates │
        │ hypothetical  │
        │ answer        │
        └───────────────┘
                │
                ▼
"Photosynthesis is the process by which
plants convert sunlight into energy..."
                │
                ▼
        ┌───────────────┐
        │ Embed hypo    │
        │ document      │
        └───────────────┘
                │
                ▼
    Search with hypothetical embedding
    (Better matches actual documents)

text

Query: "How does photosynthesis work?"
                │
                ▼
        ┌───────────────┐
        │ LLM generates │
        │ hypothetical  │
        │ answer        │
        └───────────────┘
                │
                ▼
"Photosynthesis is the process by which
plants convert sunlight into energy..."
                │
                ▼
        ┌───────────────┐
        │ Embed hypo    │
        │ document      │
        └───────────────┘
                │
                ▼
    Search with hypothetical embedding
    (Better matches actual documents)

Self-RAG (Retrieval-Augmented LM with Self-Reflection)

Self-RAG（带自反思的检索增强语言模型）

text

┌─────────────────────────────────────────────────────────┐
│ 1. Generate initial response                            │
│ 2. Decide: Need more retrieval? (critique token)        │
│    ├── Yes → Retrieve more, regenerate                  │
│    └── No → Check factuality (isRel, isSup tokens)      │
│ 3. Verify claims against sources                        │
│ 4. Regenerate if needed                                 │
│ 5. Return verified response                             │
└─────────────────────────────────────────────────────────┘

text

┌─────────────────────────────────────────────────────────┐
│ 1. Generate initial response                            │
│ 2. Decide: Need more retrieval? (critique token)        │
│    ├── Yes → Retrieve more, regenerate                  │
│    └── No → Check factuality (isRel, isSup tokens)      │
│ 3. Verify claims against sources                        │
│ 4. Regenerate if needed                                 │
│ 5. Return verified response                             │
└─────────────────────────────────────────────────────────┘

Agentic RAG

Agentic RAG（智能体式RAG）

text

Query: "Compare Q3 revenue across regions"
                │
                ▼
        ┌───────────────┐
        │ Query Agent   │
        │ (Plan steps)  │
        └───────────────┘
                │
    ┌───────────┼───────────┐
    ▼           ▼           ▼
┌───────┐   ┌───────┐   ┌───────┐
│Search │   │Search │   │Search │
│ EMEA  │   │ APAC  │   │ AMER  │
│ docs  │   │ docs  │   │ docs  │
└───────┘   └───────┘   └───────┘
    │           │           │
    └───────────┼───────────┘
                ▼
        ┌───────────────┐
        │  Synthesize   │
        │  Comparison   │
        └───────────────┘

text

Query: "Compare Q3 revenue across regions"
                │
                ▼
        ┌───────────────┐
        │ Query Agent   │
        │ (Plan steps)  │
        └───────────────┘
                │
    ┌───────────┼───────────┐
    ▼           ▼           ▼
┌───────┐   ┌───────┐   ┌───────┐
│Search │   │Search │   │Search │
│ EMEA  │   │ APAC  │   │ AMER  │
│ docs  │   │ docs  │   │ docs  │
└───────┘   └───────┘   └───────┘
    │           │           │
    └───────────┼───────────┘
                ▼
        ┌───────────────┐
        │  Synthesize   │
        │  Comparison   │
        └───────────────┘

Evaluation Metrics

评估指标

Retrieval Metrics

检索指标

Metric	Description	Target
Recall@K	% relevant docs in top-K	>80%
Precision@K	% of top-K that are relevant	>60%
MRR (Mean Reciprocal Rank)	1/rank of first relevant	>0.5
NDCG	Graded relevance ranking	>0.7

指标	描述	目标值
Recall@K	Top-K结果中相关文档占比	>80%
Precision@K	Top-K结果中相关文档占比	>60%
MRR（平均倒数排名）	首个相关文档排名的倒数均值	>0.5
NDCG	分级相关性排名	>0.7

End-to-End Metrics

端到端指标

Metric	Description	Target
Answer correctness	Is the answer factually correct?	>90%
Faithfulness	Is the answer grounded in context?	>95%
Answer relevance	Does it answer the question?	>90%
Context relevance	Is retrieved context relevant?	>80%

指标	描述	目标值
Answer correctness（答案正确性）	答案是否符合事实	>90%
Faithfulness（忠实度）	答案是否基于上下文	>95%
Answer relevance（答案相关性）	答案是否回应问题	>90%
Context relevance（上下文相关性）	检索到的上下文是否相关	>80%

Evaluation Framework

评估框架

text

┌─────────────────────────────────────────────────────────┐
│                RAG Evaluation Pipeline                  │
├─────────────────────────────────────────────────────────┤
│ 1. Query Set: Representative questions                  │
│ 2. Ground Truth: Expected answers + source docs         │
│ 3. Metrics:                                             │
│    • Retrieval: Recall@K, MRR, NDCG                     │
│    • Generation: Correctness, Faithfulness              │
│ 4. A/B Testing: Compare configurations                  │
│ 5. Error Analysis: Identify failure patterns            │
└─────────────────────────────────────────────────────────┘

text

┌─────────────────────────────────────────────────────────┐
│                RAG Evaluation Pipeline                  │
├─────────────────────────────────────────────────────────┤
│ 1. Query Set: Representative questions                  │
│ 2. Ground Truth: Expected answers + source docs         │
│ 3. Metrics:                                             │
│    • Retrieval: Recall@K, MRR, NDCG                     │
│    • Generation: Correctness, Faithfulness              │
│ 4. A/B Testing: Compare configurations                  │
│ 5. Error Analysis: Identify failure patterns            │
└─────────────────────────────────────────────────────────┘

Common Failure Modes

常见失败模式

Failure Mode	Cause	Mitigation
Retrieval miss	Query-doc mismatch	Hybrid search, query expansion
Wrong chunk	Poor chunking	Better segmentation, overlap
Hallucination	Poor grounding	Faithfulness training, citations
Lost context	Long-context issues	Hierarchical, summarization
Stale data	Outdated index	Incremental updates, TTL

失败模式	原因	缓解方案
Retrieval miss（检索遗漏）	查询与文档不匹配	混合检索、查询扩展
Wrong chunk（错误分块）	分块质量差	优化分割策略、增加重叠
Hallucination（幻觉）	基础grounding不足	忠实度训练、添加引用
Lost context（上下文丢失）	长上下文问题	分层处理、摘要
Stale data（数据过时）	索引未更新	增量更新、TTL机制

Scaling Considerations

扩展考量

Index Scaling

索引扩展

Scale	Approach
<1M docs	Single node, exact search
1-10M docs	Single node, HNSW
10-100M docs	Distributed, sharded
>100M docs	Distributed + aggressive filtering

规模	方案
<1M文档	单节点，精确搜索
1-10M文档	单节点，HNSW
10-100M文档	分布式，分片
>100M文档	分布式+严格过滤

Latency Budget

延迟预算

text

Typical RAG Pipeline Latency:

Query embedding:     10-50ms
Vector search:       20-100ms
Reranking:          100-300ms
LLM generation:     500-2000ms
────────────────────────────
Total:              630-2450ms

Target p95: <3 seconds for interactive use

text

Typical RAG Pipeline Latency:

Query embedding:     10-50ms
Vector search:       20-100ms
Reranking:          100-300ms
LLM generation:     500-2000ms
────────────────────────────
Total:              630-2450ms

Target p95: <3 seconds for interactive use

目标p95延迟： 交互式场景下<3秒

Related Skills

Version History

版本历史

v1.0.0 (2025-12-26): Initial release - RAG architecture patterns for systems design

v1.0.0 (2025-12-26): 初始版本 - 面向系统设计的RAG架构模式

Last Updated

最后更新

Date: 2025-12-26

日期： 2025-12-26