rag-architecture

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

RAG Architecture

RAG架构

When to Use This Skill

技能适用场景

Use this skill when:
  • Designing RAG pipelines for LLM applications
  • Choosing chunking and embedding strategies
  • Optimizing retrieval quality and relevance
  • Building knowledge-grounded AI systems
  • Implementing hybrid search (dense + sparse)
  • Designing multi-stage retrieval pipelines
Keywords: RAG, retrieval-augmented generation, embeddings, chunking, vector search, semantic search, context window, grounding, knowledge base, hybrid search, reranking, BM25, dense retrieval
以下场景可使用本技能:
  • 为LLM应用设计RAG流水线
  • 选择分块与Embedding策略
  • 优化检索质量与相关性
  • 构建基于知识库的AI系统
  • 实现混合搜索(稠密+稀疏)
  • 设计多阶段检索流水线
关键词: RAG, retrieval-augmented generation, embeddings, chunking, vector search, semantic search, context window, grounding, knowledge base, hybrid search, reranking, BM25, dense retrieval

RAG Architecture Overview

RAG架构概述

text
┌─────────────────────────────────────────────────────────────────────┐
│                       RAG Pipeline                                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────────┐  │
│  │   Ingestion  │    │   Indexing   │    │    Vector Store      │  │
│  │   Pipeline   │───▶│   Pipeline   │───▶│    (Embeddings)      │  │
│  └──────────────┘    └──────────────┘    └──────────────────────┘  │
│         │                   │                       │               │
│    Documents           Chunks +                 Indexed             │
│                       Embeddings               Vectors              │
│                                                     │               │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────────┐  │
│  │    Query     │    │  Retrieval   │    │   Context Assembly   │  │
│  │  Processing  │───▶│   Engine     │───▶│   + Generation       │  │
│  └──────────────┘    └──────────────┘    └──────────────────────┘  │
│         │                   │                       │               │
│    User Query          Top-K Chunks            LLM Response         │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
text
┌─────────────────────────────────────────────────────────────────────┐
│                       RAG Pipeline                                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────────┐  │
│  │   Ingestion  │    │   Indexing   │    │    Vector Store      │  │
│  │   Pipeline   │───▶│   Pipeline   │───▶│    (Embeddings)      │  │
│  └──────────────┘    └──────────────┘    └──────────────────────┘  │
│         │                   │                       │               │
│    Documents           Chunks +                 Indexed             │
│                       Embeddings               Vectors              │
│                                                     │               │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────────┐  │
│  │    Query     │    │  Retrieval   │    │   Context Assembly   │  │
│  │  Processing  │───▶│   Engine     │───▶│   + Generation       │  │
│  └──────────────┘    └──────────────┘    └──────────────────────┘  │
│         │                   │                       │               │
│    User Query          Top-K Chunks            LLM Response         │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Document Ingestion Pipeline

文档摄入流水线

Document Processing Steps

文档处理步骤

text
Raw Documents
┌─────────────┐
│   Extract   │ ← PDF, HTML, DOCX, Markdown
│   Content   │
└─────────────┘
┌─────────────┐
│   Clean &   │ ← Remove boilerplate, normalize
│  Normalize  │
└─────────────┘
┌─────────────┐
│   Chunk     │ ← Split into retrievable units
│  Documents  │
└─────────────┘
┌─────────────┐
│  Generate   │ ← Create vector representations
│ Embeddings  │
└─────────────┘
┌─────────────┐
│   Store     │ ← Persist vectors + metadata
│  in Index   │
└─────────────┘
text
Raw Documents
┌─────────────┐
│   Extract   │ ← PDF, HTML, DOCX, Markdown
│   Content   │
└─────────────┘
┌─────────────┐
│   Clean &   │ ← Remove boilerplate, normalize
│  Normalize  │
└─────────────┘
┌─────────────┐
│   Chunk     │ ← Split into retrievable units
│  Documents  │
└─────────────┘
┌─────────────┐
│  Generate   │ ← Create vector representations
│ Embeddings  │
└─────────────┘
┌─────────────┐
│   Store     │ ← Persist vectors + metadata
│  in Index   │
└─────────────┘

Chunking Strategies

分块策略

Strategy Comparison

策略对比

StrategyDescriptionBest ForChunk Size
Fixed-sizeSplit by token/character countSimple documents256-512 tokens
Sentence-basedSplit at sentence boundariesNarrative textVariable
Paragraph-basedSplit at paragraph boundariesStructured docsVariable
SemanticSplit by topic/meaningLong documentsVariable
RecursiveHierarchical splittingMixed contentConfigurable
Document-specificCustom per doc typeSpecialized (code, tables)Variable
策略描述适用场景分块尺寸
Fixed-size(固定尺寸)按令牌/字符数拆分简单文档256-512 tokens
Sentence-based(基于句子)按句子边界拆分叙述性文本可变
Paragraph-based(基于段落)按段落边界拆分结构化文档可变
Semantic(语义化)按主题/含义拆分长文档可变
Recursive(递归式)分层拆分混合内容可配置
Document-specific(文档专属)针对文档类型定制专业内容(代码、表格)可变

Chunking Decision Tree

分块决策树

text
What type of content?
├── Code
│   └── AST-based or function-level chunking
├── Tables/Structured
│   └── Keep tables intact, chunk surrounding text
├── Long narrative
│   └── Semantic or recursive chunking
├── Short documents (<1 page)
│   └── Whole document as chunk
└── Mixed content
    └── Recursive with type-specific handlers
text
What type of content?
├── Code
│   └── AST-based or function-level chunking
├── Tables/Structured
│   └── Keep tables intact, chunk surrounding text
├── Long narrative
│   └── Semantic or recursive chunking
├── Short documents (<1 page)
│   └── Whole document as chunk
└── Mixed content
    └── Recursive with type-specific handlers

Chunk Overlap

分块重叠

text
Without Overlap:
[Chunk 1: "The quick brown"] [Chunk 2: "fox jumps over"]
               Information lost at boundary

With Overlap (20%):
[Chunk 1: "The quick brown fox"]
                    [Chunk 2: "brown fox jumps over"]
              Context preserved across boundaries
Recommended overlap: 10-20% of chunk size
text
Without Overlap:
[Chunk 1: "The quick brown"] [Chunk 2: "fox jumps over"]
               Information lost at boundary

With Overlap (20%):
[Chunk 1: "The quick brown fox"]
                    [Chunk 2: "brown fox jumps over"]
              Context preserved across boundaries
推荐重叠比例: 分块尺寸的10-20%

Chunk Size Trade-offs

分块尺寸权衡

text
Smaller Chunks (128-256 tokens)        Larger Chunks (512-1024 tokens)
├── More precise retrieval             ├── More context per chunk
├── Less context per chunk             ├── May include irrelevant content
├── More chunks to search              ├── Fewer chunks to search
├── Better for factoid Q&A             ├── Better for summarization
└── Higher retrieval recall            └── Higher retrieval precision
text
Smaller Chunks (128-256 tokens)        Larger Chunks (512-1024 tokens)
├── More precise retrieval             ├── More context per chunk
├── Less context per chunk             ├── May include irrelevant content
├── More chunks to search              ├── Fewer chunks to search
├── Better for factoid Q&A             ├── Better for summarization
└── Higher retrieval recall            └── Higher retrieval precision

Embedding Models

Embedding模型

Model Comparison

模型对比

ModelDimensionsContextStrengths
OpenAI text-embedding-3-large30728KHigh quality, expensive
OpenAI text-embedding-3-small15368KGood quality/cost ratio
Cohere embed-v31024512Multilingual, fast
BGE-large1024512Open source, competitive
E5-large-v21024512Open source, instruction-tuned
GTE-large1024512Alibaba, good for Chinese
Sentence-BERT768512Classic, well-understood
模型维度上下文优势
OpenAI text-embedding-3-large30728K质量高,成本高
OpenAI text-embedding-3-small15368K性价比优
Cohere embed-v31024512多语言,速度快
BGE-large1024512开源,竞争力强
E5-large-v21024512开源,指令微调
GTE-large1024512阿里巴巴出品,中文表现佳
Sentence-BERT768512经典模型,认知度高

Embedding Selection

Embedding选择决策树

text
Need best quality, cost OK?
├── Yes → OpenAI text-embedding-3-large
└── No
    └── Need self-hosted/open source?
        ├── Yes → BGE-large or E5-large-v2
        └── No
            └── Need multilingual?
                ├── Yes → Cohere embed-v3
                └── No → OpenAI text-embedding-3-small
text
Need best quality, cost OK?
├── Yes → OpenAI text-embedding-3-large
└── No
    └── Need self-hosted/open source?
        ├── Yes → BGE-large or E5-large-v2
        └── No
            └── Need multilingual?
                ├── Yes → Cohere embed-v3
                └── No → OpenAI text-embedding-3-small

Embedding Optimization

Embedding优化技术

TechniqueDescriptionWhen to Use
Matryoshka embeddingsTruncatable to smaller dimsMemory-constrained
Quantized embeddingsINT8/binary embeddingsLarge-scale search
Instruction-tunedPrefix with task instructionSpecialized retrieval
Fine-tuned embeddingsDomain-specific trainingSpecialized domains
技术描述适用场景
Matryoshka embeddings可截断至更小维度内存受限场景
Quantized embeddingsINT8/二进制Embeddings大规模搜索场景
Instruction-tuned(指令微调)前缀添加任务指令专业检索场景
Fine-tuned embeddings(微调Embeddings)领域专属训练专业领域场景

Retrieval Strategies

检索策略

Dense Retrieval (Semantic Search)

稠密检索(语义搜索)

text
Query: "How to deploy containers"
    ┌─────────┐
    │ Embed   │
    │ Query   │
    └─────────┘
    ┌─────────────────────────────────┐
    │ Vector Similarity Search        │
    │ (Cosine, Dot Product, L2)       │
    └─────────────────────────────────┘
    Top-K semantically similar chunks
text
Query: "How to deploy containers"
    ┌─────────┐
    │ Embed   │
    │ Query   │
    └─────────┘
    ┌─────────────────────────────────┐
    │ Vector Similarity Search        │
    │ (Cosine, Dot Product, L2)       │
    └─────────────────────────────────┘
    Top-K semantically similar chunks

Sparse Retrieval (BM25/TF-IDF)

稀疏检索(BM25/TF-IDF)

text
Query: "Kubernetes pod deployment YAML"
    ┌─────────┐
    │Tokenize │
    │ + Score │
    └─────────┘
    ┌─────────────────────────────────┐
    │ BM25 Ranking                    │
    │ (Term frequency × IDF)          │
    └─────────────────────────────────┘
    Top-K lexically matching chunks
text
Query: "Kubernetes pod deployment YAML"
    ┌─────────┐
    │Tokenize │
    │ + Score │
    └─────────┘
    ┌─────────────────────────────────┐
    │ BM25 Ranking                    │
    │ (Term frequency × IDF)          │
    └─────────────────────────────────┘
    Top-K lexically matching chunks

Hybrid Search (Best of Both)

混合检索(结合两者优势)

text
Query ──┬──▶ Dense Search ──┬──▶ Fusion ──▶ Final Ranking
        │                   │      │
        └──▶ Sparse Search ─┘      │
        Fusion Methods:            ▼
        • RRF (Reciprocal Rank Fusion)
        • Linear combination
        • Learned reranking
text
Query ──┬──▶ Dense Search ──┬──▶ Fusion ──▶ Final Ranking
        │                   │      │
        └──▶ Sparse Search ─┘      │
        Fusion Methods:            ▼
        • RRF (Reciprocal Rank Fusion)
        • Linear combination
        • Learned reranking

Reciprocal Rank Fusion (RRF)

倒数排名融合(RRF)

text
RRF Score = Σ 1 / (k + rank_i)

Where:
- k = constant (typically 60)
- rank_i = rank in each retrieval result

Example:
Doc A: Dense rank=1, Sparse rank=5
RRF(A) = 1/(60+1) + 1/(60+5) = 0.0164 + 0.0154 = 0.0318

Doc B: Dense rank=3, Sparse rank=1
RRF(B) = 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323

Result: Doc B ranks higher (better combined relevance)
text
RRF Score = Σ 1 / (k + rank_i)

Where:
- k = constant (typically 60)
- rank_i = rank in each retrieval result

Example:
Doc A: Dense rank=1, Sparse rank=5
RRF(A) = 1/(60+1) + 1/(60+5) = 0.0164 + 0.0154 = 0.0318

Doc B: Dense rank=3, Sparse rank=1
RRF(B) = 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323

Result: Doc B ranks higher (better combined relevance)

Multi-Stage Retrieval

多阶段检索

Two-Stage Pipeline

两阶段流水线

text
┌─────────────────────────────────────────────────────────┐
│ Stage 1: Recall (Fast, High Recall)                     │
│ • ANN search (HNSW, IVF)                                │
│ • Retrieve top-100 candidates                           │
│ • Latency: 10-50ms                                      │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ Stage 2: Rerank (Slow, High Precision)                  │
│ • Cross-encoder or LLM reranking                        │
│ • Score top-100 → return top-10                         │
│ • Latency: 100-500ms                                    │
└─────────────────────────────────────────────────────────┘
text
┌─────────────────────────────────────────────────────────┐
│ Stage 1: Recall (Fast, High Recall)                     │
│ • ANN search (HNSW, IVF)                                │
│ • Retrieve top-100 candidates                           │
│ • Latency: 10-50ms                                      │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ Stage 2: Rerank (Slow, High Precision)                  │
│ • Cross-encoder or LLM reranking                        │
│ • Score top-100 → return top-10                         │
│ • Latency: 100-500ms                                    │
└─────────────────────────────────────────────────────────┘

Reranking Options

重排序选项

RerankerLatencyQualityCost
Cross-encoder (local)MediumHighCompute
Cohere RerankFastHighAPI cost
LLM-based rerankSlowHighestHigh API cost
BGE-rerankerFastGoodCompute
重排序器延迟质量成本
Cross-encoder(本地)中等计算资源
Cohere RerankAPI调用成本
LLM-based rerank(基于LLM的重排序)最高高API调用成本
BGE-reranker良好计算资源

Context Assembly

上下文组装

Context Window Management

上下文窗口管理

text
Context Budget: 128K tokens
├── System prompt: 500 tokens (fixed)
├── Conversation history: 4K tokens (sliding window)
├── Retrieved context: 8K tokens (dynamic)
└── Generation buffer: ~115K tokens (available)

Strategy: Maximize retrieved context quality within budget
text
Context Budget: 128K tokens
├── System prompt: 500 tokens (fixed)
├── Conversation history: 4K tokens (sliding window)
├── Retrieved context: 8K tokens (dynamic)
└── Generation buffer: ~115K tokens (available)

Strategy: Maximize retrieved context quality within budget
策略: 在预算范围内最大化检索上下文质量

Context Assembly Strategies

上下文组装策略

StrategyDescriptionWhen to Use
Simple concatenationJoin top-K chunksSmall context, simple Q&A
Relevance-orderedMost relevant firstGeneral retrieval
ChronologicalTime-orderedTemporal queries
HierarchicalSummary + detailsLong-form generation
InterleavedMix sourcesMulti-source queries
策略描述适用场景
Simple concatenation(简单拼接)拼接Top-K分块小上下文、简单问答场景
Relevance-ordered(按相关性排序)最相关内容优先通用检索场景
Chronological(按时间排序)时间顺序排列时序查询场景
Hierarchical(分层式)摘要+细节长篇生成场景
Interleaved(交错式)混合多源内容多源查询场景

Lost-in-the-Middle Problem

中间信息丢失问题

text
LLM Attention Pattern:
┌─────────────────────────────────────────────────────────┐
│ Beginning           Middle            End               │
│    ████              ░░░░             ████              │
│  High attention   Low attention   High attention        │
└─────────────────────────────────────────────────────────┘

Mitigation:
1. Put most relevant at beginning AND end
2. Use shorter context windows when possible
3. Use hierarchical summarization
4. Fine-tune for long-context attention
text
LLM Attention Pattern:
┌─────────────────────────────────────────────────────────┐
│ Beginning           Middle            End               │
│    ████              ░░░░             ████              │
│  High attention   Low attention   High attention        │
└─────────────────────────────────────────────────────────┘

Mitigation:
1. Put most relevant at beginning AND end
2. Use shorter context windows when possible
3. Use hierarchical summarization
4. Fine-tune for long-context attention
缓解方案:
  1. 将最相关内容放在开头和结尾
  2. 尽可能使用更短的上下文窗口
  3. 采用分层摘要
  4. 针对长上下文注意力进行微调

Advanced RAG Patterns

高级RAG模式

Query Transformation

查询转换

text
Original Query: "Tell me about the project"
         ┌─────────────────┼─────────────────┐
         ▼                 ▼                 ▼
    ┌─────────┐      ┌──────────┐     ┌──────────┐
    │ HyDE    │      │ Query    │     │ Sub-query│
    │ (Hypo   │      │ Expansion│     │ Decomp.  │
    │ Doc)    │      │          │     │          │
    └─────────┘      └──────────┘     └──────────┘
         │                 │                 │
         ▼                 ▼                 ▼
    Hypothetical      "project,        "What is the
    answer to         goals,           project scope?"
    embed             timeline,        "What are the
                      deliverables"    deliverables?"
text
Original Query: "Tell me about the project"
         ┌─────────────────┼─────────────────┐
         ▼                 ▼                 ▼
    ┌─────────┐      ┌──────────┐     ┌──────────┐
    │ HyDE    │      │ Query    │     │ Sub-query│
    │ (Hypo   │      │ Expansion│     │ Decomp.  │
    │ Doc)    │      │          │     │          │
    └─────────┘      └──────────┘     └──────────┘
         │                 │                 │
         ▼                 ▼                 ▼
    Hypothetical      "project,        "What is the
    answer to         goals,           project scope?"
    embed             timeline,        "What are the
                      deliverables"    deliverables?"

HyDE (Hypothetical Document Embeddings)

HyDE(假设文档Embeddings)

text
Query: "How does photosynthesis work?"
        ┌───────────────┐
        │ LLM generates │
        │ hypothetical  │
        │ answer        │
        └───────────────┘
"Photosynthesis is the process by which
plants convert sunlight into energy..."
        ┌───────────────┐
        │ Embed hypo    │
        │ document      │
        └───────────────┘
    Search with hypothetical embedding
    (Better matches actual documents)
text
Query: "How does photosynthesis work?"
        ┌───────────────┐
        │ LLM generates │
        │ hypothetical  │
        │ answer        │
        └───────────────┘
"Photosynthesis is the process by which
plants convert sunlight into energy..."
        ┌───────────────┐
        │ Embed hypo    │
        │ document      │
        └───────────────┘
    Search with hypothetical embedding
    (Better matches actual documents)

Self-RAG (Retrieval-Augmented LM with Self-Reflection)

Self-RAG(带自反思的检索增强语言模型)

text
┌─────────────────────────────────────────────────────────┐
│ 1. Generate initial response                            │
│ 2. Decide: Need more retrieval? (critique token)        │
│    ├── Yes → Retrieve more, regenerate                  │
│    └── No → Check factuality (isRel, isSup tokens)      │
│ 3. Verify claims against sources                        │
│ 4. Regenerate if needed                                 │
│ 5. Return verified response                             │
└─────────────────────────────────────────────────────────┘
text
┌─────────────────────────────────────────────────────────┐
│ 1. Generate initial response                            │
│ 2. Decide: Need more retrieval? (critique token)        │
│    ├── Yes → Retrieve more, regenerate                  │
│    └── No → Check factuality (isRel, isSup tokens)      │
│ 3. Verify claims against sources                        │
│ 4. Regenerate if needed                                 │
│ 5. Return verified response                             │
└─────────────────────────────────────────────────────────┘

Agentic RAG

Agentic RAG(智能体式RAG)

text
Query: "Compare Q3 revenue across regions"
        ┌───────────────┐
        │ Query Agent   │
        │ (Plan steps)  │
        └───────────────┘
    ┌───────────┼───────────┐
    ▼           ▼           ▼
┌───────┐   ┌───────┐   ┌───────┐
│Search │   │Search │   │Search │
│ EMEA  │   │ APAC  │   │ AMER  │
│ docs  │   │ docs  │   │ docs  │
└───────┘   └───────┘   └───────┘
    │           │           │
    └───────────┼───────────┘
        ┌───────────────┐
        │  Synthesize   │
        │  Comparison   │
        └───────────────┘
text
Query: "Compare Q3 revenue across regions"
        ┌───────────────┐
        │ Query Agent   │
        │ (Plan steps)  │
        └───────────────┘
    ┌───────────┼───────────┐
    ▼           ▼           ▼
┌───────┐   ┌───────┐   ┌───────┐
│Search │   │Search │   │Search │
│ EMEA  │   │ APAC  │   │ AMER  │
│ docs  │   │ docs  │   │ docs  │
└───────┘   └───────┘   └───────┘
    │           │           │
    └───────────┼───────────┘
        ┌───────────────┐
        │  Synthesize   │
        │  Comparison   │
        └───────────────┘

Evaluation Metrics

评估指标

Retrieval Metrics

检索指标

MetricDescriptionTarget
Recall@K% relevant docs in top-K>80%
Precision@K% of top-K that are relevant>60%
MRR (Mean Reciprocal Rank)1/rank of first relevant>0.5
NDCGGraded relevance ranking>0.7
指标描述目标值
Recall@KTop-K结果中相关文档占比>80%
Precision@KTop-K结果中相关文档占比>60%
MRR(平均倒数排名)首个相关文档排名的倒数均值>0.5
NDCG分级相关性排名>0.7

End-to-End Metrics

端到端指标

MetricDescriptionTarget
Answer correctnessIs the answer factually correct?>90%
FaithfulnessIs the answer grounded in context?>95%
Answer relevanceDoes it answer the question?>90%
Context relevanceIs retrieved context relevant?>80%
指标描述目标值
Answer correctness(答案正确性)答案是否符合事实>90%
Faithfulness(忠实度)答案是否基于上下文>95%
Answer relevance(答案相关性)答案是否回应问题>90%
Context relevance(上下文相关性)检索到的上下文是否相关>80%

Evaluation Framework

评估框架

text
┌─────────────────────────────────────────────────────────┐
│                RAG Evaluation Pipeline                  │
├─────────────────────────────────────────────────────────┤
│ 1. Query Set: Representative questions                  │
│ 2. Ground Truth: Expected answers + source docs         │
│ 3. Metrics:                                             │
│    • Retrieval: Recall@K, MRR, NDCG                     │
│    • Generation: Correctness, Faithfulness              │
│ 4. A/B Testing: Compare configurations                  │
│ 5. Error Analysis: Identify failure patterns            │
└─────────────────────────────────────────────────────────┘
text
┌─────────────────────────────────────────────────────────┐
│                RAG Evaluation Pipeline                  │
├─────────────────────────────────────────────────────────┤
│ 1. Query Set: Representative questions                  │
│ 2. Ground Truth: Expected answers + source docs         │
│ 3. Metrics:                                             │
│    • Retrieval: Recall@K, MRR, NDCG                     │
│    • Generation: Correctness, Faithfulness              │
│ 4. A/B Testing: Compare configurations                  │
│ 5. Error Analysis: Identify failure patterns            │
└─────────────────────────────────────────────────────────┘

Common Failure Modes

常见失败模式

Failure ModeCauseMitigation
Retrieval missQuery-doc mismatchHybrid search, query expansion
Wrong chunkPoor chunkingBetter segmentation, overlap
HallucinationPoor groundingFaithfulness training, citations
Lost contextLong-context issuesHierarchical, summarization
Stale dataOutdated indexIncremental updates, TTL
失败模式原因缓解方案
Retrieval miss(检索遗漏)查询与文档不匹配混合检索、查询扩展
Wrong chunk(错误分块)分块质量差优化分割策略、增加重叠
Hallucination(幻觉)基础grounding不足忠实度训练、添加引用
Lost context(上下文丢失)长上下文问题分层处理、摘要
Stale data(数据过时)索引未更新增量更新、TTL机制

Scaling Considerations

扩展考量

Index Scaling

索引扩展

ScaleApproach
<1M docsSingle node, exact search
1-10M docsSingle node, HNSW
10-100M docsDistributed, sharded
>100M docsDistributed + aggressive filtering
规模方案
<1M文档单节点,精确搜索
1-10M文档单节点,HNSW
10-100M文档分布式,分片
>100M文档分布式+严格过滤

Latency Budget

延迟预算

text
Typical RAG Pipeline Latency:

Query embedding:     10-50ms
Vector search:       20-100ms
Reranking:          100-300ms
LLM generation:     500-2000ms
────────────────────────────
Total:              630-2450ms

Target p95: <3 seconds for interactive use
text
Typical RAG Pipeline Latency:

Query embedding:     10-50ms
Vector search:       20-100ms
Reranking:          100-300ms
LLM generation:     500-2000ms
────────────────────────────
Total:              630-2450ms

Target p95: <3 seconds for interactive use
目标p95延迟: 交互式场景下<3秒

Related Skills

相关技能

  • llm-serving-patterns
    - LLM inference infrastructure
  • vector-databases
    - Vector store selection and optimization
  • ml-system-design
    - End-to-end ML pipeline design
  • estimation-techniques
    - Capacity planning for RAG systems
  • llm-serving-patterns
    - LLM推理基础设施
  • vector-databases
    - 向量存储选择与优化
  • ml-system-design
    - 端到端ML流水线设计
  • estimation-techniques
    - RAG系统容量规划

Version History

版本历史

  • v1.0.0 (2025-12-26): Initial release - RAG architecture patterns for systems design

  • v1.0.0 (2025-12-26): 初始版本 - 面向系统设计的RAG架构模式

Last Updated

最后更新

Date: 2025-12-26
日期: 2025-12-26