rag-agent-builder

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

RAG Agent Builder

RAG Agent 构建指南

Build powerful Retrieval-Augmented Generation (RAG) applications that enhance LLM capabilities with external knowledge sources, enabling accurate, contextualized AI responses.
构建功能强大的Retrieval-Augmented Generation(RAG)应用,通过外部知识源增强LLM的能力,生成准确且贴合上下文的AI响应。

Quick Start

快速开始

Get started with RAG implementations in the examples and utilities:
  • Examples: See
    examples/
    directory for complete implementations:
    • basic_rag.py
      - Simple chunk-embed-retrieve-generate pipeline
    • retrieval_strategies.py
      - Hybrid search, reranking, and filtering
    • agentic_rag.py
      - Agent-controlled retrieval with iterative refinement
  • Utilities: See
    scripts/
    directory for helper modules:
    • embedding_management.py
      - Embedding generation, normalization, and caching
    • vector_db_manager.py
      - Vector database abstraction and factory
    • rag_evaluation.py
      - Retrieval and answer quality metrics
通过示例和工具快速上手RAG实现:
  • 示例:查看
    examples/
    目录获取完整实现:
    • basic_rag.py
      - 简单的切分-嵌入-检索-生成流水线
    • retrieval_strategies.py
      - 混合搜索、重排序与过滤
    • agentic_rag.py
      - 由Agent控制的检索与迭代优化
  • 工具:查看
    scripts/
    目录获取辅助模块:
    • embedding_management.py
      - 嵌入生成、归一化与缓存
    • vector_db_manager.py
      - 向量数据库抽象与工厂类
    • rag_evaluation.py
      - 检索与回答质量指标

Overview

概述

RAG systems combine three key components:
  1. Document Retrieval - Find relevant information from knowledge bases
  2. Context Integration - Pass retrieved context to the LLM
  3. Response Generation - Generate answers grounded in the retrieved information
This skill covers building production-ready RAG applications with various frameworks and approaches.
RAG系统包含三个核心组件:
  1. 文档检索 - 从知识库中查找相关信息
  2. 上下文整合 - 将检索到的上下文传递给LLM
  3. 响应生成 - 基于检索到的信息生成回答
本指南涵盖如何使用各类框架与方法构建生产级RAG应用。

Core Concepts

核心概念

What is RAG?

什么是RAG?

RAG augments LLM knowledge with external data:
  • Without RAG: LLM relies on training data (may be outdated or limited)
  • With RAG: LLM uses real-time, custom knowledge + training knowledge
RAG通过外部数据增强LLM的知识储备:
  • 无RAG时:LLM依赖训练数据(可能过时或受限)
  • 有RAG时:LLM结合实时自定义知识与训练知识

When to Use RAG

何时使用RAG

  • Document Q&A: Answer questions about PDFs, books, reports
  • Knowledge Base Search: Query internal documentation, wikis
  • Enterprise Search: Search proprietary company data
  • Context-Specific Assistants: Customer support, HR assistants
  • Fact-Heavy Applications: Legal docs, medical records, financial data
  • 文档问答:解答关于PDF、书籍、报告的问题
  • 知识库搜索:查询内部文档、维基类资源
  • 企业搜索:搜索企业专有数据
  • 特定上下文助手:客户支持、HR助手
  • 事实密集型应用:法律文档、医疗记录、金融数据

When RAG Might Not Be Needed

何时无需使用RAG

  • General knowledge questions (ChatGPT-like)
  • Real-time data that changes constantly (use tools instead)
  • Very simple lookup tasks (use database queries)
  • 通用知识问题(类似ChatGPT的场景)
  • 频繁变化的实时数据(改用工具)
  • 非常简单的查询任务(改用数据库查询)

Architecture Patterns

架构模式

Basic RAG Pipeline

基础RAG流水线

Documents → Chunks → Embeddings → Vector DB
User Question → Embedding → Retrieval → LLM → Answer
                              ↑         ↓
                         Vector DB    Context
Documents → Chunks → Embeddings → Vector DB
User Question → Embedding → Retrieval → LLM → Answer
                              ↑         ↓
                         Vector DB    Context

Advanced RAG Patterns

高级RAG模式

1. Agentic RAG

1. Agent驱动的RAG

  • Agent decides what to retrieve and when
  • Can refine queries iteratively
  • Better for complex reasoning
  • Agent决定检索内容与时机
  • 可迭代优化查询
  • 更适合复杂推理场景

2. Hierarchical RAG

2. 分层RAG

  • Multi-level document structure
  • Search at different levels of detail
  • More flexible organization
  • 多级文档结构
  • 按不同细节层级搜索
  • 组织方式更灵活

3. Hybrid Search RAG

3. 混合搜索RAG

  • Combines keyword search (BM25) + semantic search (embeddings)
  • Captures both exact matches and meaning
  • Better for mixed query types
  • 结合关键词搜索(BM25)与语义搜索(嵌入)
  • 同时捕获精确匹配与语义关联
  • 更适合混合类型的查询

4. Corrective RAG (CRAG)

4. 纠错型RAG(CRAG)

  • Evaluates retrieved documents for relevance
  • Retrieves additional sources if needed
  • Ensures high-quality context
  • 评估检索文档的相关性
  • 如有需要检索额外来源
  • 确保上下文的高质量

Implementation Components

实现组件

1. Document Processing

1. 文档处理

Chunking Strategies:
python
undefined
切分策略:
python
undefined

Simple fixed-size chunks

简单固定大小切分

chunks = split_text(doc, chunk_size=1000, overlap=100)
chunks = split_text(doc, chunk_size=1000, overlap=100)

Semantic chunks (group by meaning)

语义切分(按语义分组)

chunks = semantic_chunking(doc, max_tokens=512)
chunks = semantic_chunking(doc, max_tokens=512)

Hierarchical chunks (different levels)

分层切分(不同层级)

chapters = split_by_heading(doc) chunks = split_each_chapter(chapters, size=1000)

**Key Considerations**:
- Chunk size affects retrieval quality and cost
- Overlap helps maintain context between chunks
- Semantic chunking preserves meaning better
chapters = split_by_heading(doc) chunks = split_each_chapter(chapters, size=1000)

**关键注意事项**:
- 切分大小影响检索质量与成本
- 重叠部分有助于维持块间上下文
- 语义切分更好地保留语义

2. Embedding Generation

2. 嵌入生成

Popular Embedding Models:
  • OpenAI:
    text-embedding-3-small
    ,
    text-embedding-3-large
  • Open Source:
    all-MiniLM-L6-v2
    ,
    all-mpnet-base-v2
  • Domain-Specific: Domain-trained embeddings for specialized knowledge
Best Practices:
  • Use consistent embedding model for retrieval and queries
  • Store embeddings with normalized vectors
  • Update embeddings when documents change
热门嵌入模型:
  • OpenAI:
    text-embedding-3-small
    ,
    text-embedding-3-large
  • 开源模型:
    all-MiniLM-L6-v2
    ,
    all-mpnet-base-v2
  • 领域特定模型: 针对专业知识训练的嵌入模型
最佳实践:
  • 索引与查询使用相同的嵌入模型
  • 存储归一化后的向量嵌入
  • 文档更新时同步更新嵌入

3. Vector Databases

3. 向量数据库

Popular Options:
  • Pinecone: Managed, serverless, easy to scale
  • Weaviate: Open-source, self-hosted, flexible
  • Milvus: Open-source, high performance
  • Chroma: Lightweight, good for prototypes
  • Qdrant: Production-grade, high-performance
Selection Criteria:
  • Scale requirements (data volume, queries per second)
  • Latency needs (real-time vs batch)
  • Cost considerations
  • Deployment preferences (managed vs self-hosted)
热门选项:
  • Pinecone: 托管式、无服务器、易于扩展
  • Weaviate: 开源、自托管、灵活
  • Milvus: 开源、高性能
  • Chroma: 轻量、适合原型开发
  • Qdrant: 生产级、高性能
选择标准:
  • 扩展需求(数据量、每秒查询数)
  • 延迟要求(实时 vs 批量)
  • 成本考量
  • 部署偏好(托管 vs 自托管)

4. Retrieval Strategies

4. 检索策略

Retrieval Methods:
python
undefined
检索方法:
python
undefined

Similarity search (most common)

相似度搜索(最常用)

results = vector_db.query(question_embedding, k=5)
results = vector_db.query(question_embedding, k=5)

Hybrid search (keyword + semantic)

混合搜索(关键词+语义)

keyword_results = bm25.search(question, k=3) semantic_results = vector_db.query(embedding, k=3) results = combine_and_rank(keyword_results, semantic_results)
keyword_results = bm25.search(question, k=3) semantic_results = vector_db.query(embedding, k=3) results = combine_and_rank(keyword_results, semantic_results)

Reranking (improve relevance)

重排序(提升相关性)

retrieved = initial_retrieval(query) reranked = rerank_by_relevance(retrieved, query)

**Retrieval Parameters**:
- **k** (number of results): Balance between context and relevance
- **Similarity threshold**: Filter out low-relevance results
- **Diversity**: Return varied results vs best matches
retrieved = initial_retrieval(query) reranked = rerank_by_relevance(retrieved, query)

**检索参数**:
- **k**(结果数量): 在上下文与相关性间平衡
- **相似度阈值**: 过滤低相关性结果
- **多样性**: 返回多样化结果 vs 最佳匹配

5. Context Integration

5. 上下文整合

Context Window Management:
python
undefined
上下文窗口管理:
python
undefined

Fit retrieved documents into context window

将检索到的文档适配到上下文窗口

def prepare_context(retrieved_docs, max_tokens=3000): context = "" for doc in retrieved_docs: if len(tokenize(context + doc)) <= max_tokens: context += doc else: break return context

**Prompt Design**:
You are a helpful assistant. Answer the question based on the provided context.
Context: {retrieved_documents}
Question: {user_question}
Answer:
undefined
def prepare_context(retrieved_docs, max_tokens=3000): context = "" for doc in retrieved_docs: if len(tokenize(context + doc)) <= max_tokens: context += doc else: break return context

**提示词设计**:
你是一个乐于助人的助手。请根据提供的上下文回答问题。
上下文: {retrieved_documents}
问题: {user_question}
回答:
undefined

6. Response Generation

6. 响应生成

Generation Strategies:
  • Direct Generation: LLM answers from context
  • Summarization: Summarize multiple retrieved docs first
  • Fact-Grounding: Ensure answer cites sources
  • Iterative Refinement: Refine based on user feedback
生成策略:
  • 直接生成: LLM基于上下文回答
  • 摘要生成: 先总结多个检索文档
  • 事实锚定: 确保回答引用来源
  • 迭代优化: 根据用户反馈优化

Implementation Patterns

实现模式

Pattern 1: Basic RAG

模式1: 基础RAG

Simplest RAG implementation:
  1. Split documents into chunks
  2. Generate embeddings for each chunk
  3. Store in vector database
  4. Retrieve top-k similar chunks for query
  5. Pass to LLM with context
Pros: Simple, fast, works well for straightforward QA Cons: May miss relevant context, no refinement
最简单的RAG实现:
  1. 将文档切分为块
  2. 为每个块生成嵌入
  3. 存储到向量数据库
  4. 为查询检索top-k相似块
  5. 传递给LLM并附带上下文
优点: 简单、快速,适用于直接的问答场景 缺点: 可能遗漏相关上下文,无优化机制

Pattern 2: Agentic RAG

模式2: Agent驱动的RAG

Agent controls retrieval:
  1. Agent receives user question
  2. Decides whether to retrieve documents
  3. Formulates retrieval query (may differ from original)
  4. Retrieves relevant documents
  5. Can iterate or use tools
  6. Generates final answer
Pros: Better for complex questions, iterative improvement Cons: More complex, higher costs
由Agent控制检索:
  1. Agent接收用户问题
  2. 决定是否需要检索文档
  3. 制定检索查询(可能与原问题不同)
  4. 检索相关文档
  5. 可迭代或使用工具
  6. 生成最终回答
优点: 更适合复杂问题,可迭代改进 缺点: 更复杂,成本更高

Pattern 3: Corrective RAG (CRAG)

模式3: 纠错型RAG(CRAG)

Validates retrieved documents:
  1. Retrieve documents for question
  2. Grade each document for relevance
  3. If poor relevance:
    • Try different retrieval strategy
    • Expand search scope
    • Retrieve from different sources
  4. Generate answer from validated context
Pros: Higher quality answers, adapts to failures Cons: More API calls, slower
验证检索到的文档:
  1. 为问题检索文档
  2. 评估每个文档的相关性
  3. 若相关性差:
    • 尝试不同的检索策略
    • 扩大搜索范围
    • 从不同来源检索
  4. 基于验证后的上下文生成回答
优点: 回答质量更高,能适配失败场景 缺点: API调用更多,速度更慢

Popular Frameworks

热门框架

LangChain

LangChain

python
from langchain.document_loaders import PDFLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.chains import RetrievalQA
python
from langchain.document_loaders import PDFLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.chains import RetrievalQA

Load documents

加载文档

loader = PDFLoader("document.pdf") docs = loader.load()
loader = PDFLoader("document.pdf") docs = loader.load()

Create RAG chain

创建RAG链

embeddings = OpenAIEmbeddings() vectorstore = Pinecone.from_documents(docs, embeddings) qa = RetrievalQA.from_chain_type( llm=ChatOpenAI(), chain_type="stuff", retriever=vectorstore.as_retriever() )
answer = qa.run("What is the document about?")
undefined
embeddings = OpenAIEmbeddings() vectorstore = Pinecone.from_documents(docs, embeddings) qa = RetrievalQA.from_chain_type( llm=ChatOpenAI(), chain_type="stuff", retriever=vectorstore.as_retriever() )
answer = qa.run("What is the document about?")
undefined

LlamaIndex

LlamaIndex

python
from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader
python
from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader

Load documents

加载文档

documents = SimpleDirectoryReader("./data").load_data()
documents = SimpleDirectoryReader("./data").load_data()

Create index

创建索引

index = GPTVectorStoreIndex.from_documents(documents)
index = GPTVectorStoreIndex.from_documents(documents)

Query

查询

response = index.as_query_engine().query("What is the main topic?")
undefined
response = index.as_query_engine().query("What is the main topic?")
undefined

CrewAI with RAG

CrewAI with RAG

python
from crewai import Agent, Task, Crew
from tools import retrieval_tool

researcher = Agent(
    role="Research Assistant",
    goal="Research topics using knowledge base",
    tools=[retrieval_tool]
)

research_task = Task(
    description="Research the topic: {topic}",
    agent=researcher
)
python
from crewai import Agent, Task, Crew
from tools import retrieval_tool

researcher = Agent(
    role="Research Assistant",
    goal="Research topics using knowledge base",
    tools=[retrieval_tool]
)

research_task = Task(
    description="Research the topic: {topic}",
    agent=researcher
)

Best Practices

最佳实践

Document Preparation

文档准备

  • ✓ Clean and normalize text (remove headers, footers)
  • ✓ Preserve document structure when possible
  • ✓ Add metadata (source, date, category)
  • ✓ Handle PDFs with OCR if scanned
  • ✓ Test chunk sizes for your domain
  • ✓ 清理并标准化文本(移除页眉、页脚)
  • ✓ 尽可能保留文档结构
  • ✓ 添加元数据(来源、日期、分类)
  • ✓ 若为扫描件,使用OCR处理PDF
  • ✓ 针对你的领域测试切分大小

Embedding Strategy

嵌入策略

  • ✓ Use same embedding model for indexing and queries
  • ✓ Fine-tune embeddings for domain-specific needs
  • ✓ Normalize embeddings for consistency
  • ✓ Monitor embedding quality metrics
  • ✓ 索引与查询使用相同的嵌入模型
  • ✓ 针对领域特定需求微调嵌入
  • ✓ 归一化嵌入以保证一致性
  • ✓ 监控嵌入质量指标

Retrieval Optimization

检索优化

  • ✓ Tune k (number of results) for your use case
  • ✓ Use reranking for quality improvement
  • ✓ Implement relevance filtering
  • ✓ Monitor retrieval precision and recall
  • ✓ Cache frequently retrieved documents
  • ✓ 针对你的场景调整k值(结果数量)
  • ✓ 使用重排序提升质量
  • ✓ 实现相关性过滤
  • ✓ 监控检索的精确率与召回率
  • ✓ 缓存频繁检索的文档

Generation Quality

生成质量

  • ✓ Include source citations in answers
  • ✓ Prompt LLM to indicate confidence
  • ✓ Ask to cite specific documents
  • ✓ Generate summaries for long contexts
  • ✓ Validate answers against context
  • ✓ 在回答中包含来源引用
  • ✓ 提示LLM表明置信度
  • ✓ 要求引用特定文档
  • ✓ 为长上下文生成摘要
  • ✓ 验证回答与上下文的一致性

Monitoring & Evaluation

监控与评估

  • ✓ Track retrieval metrics (precision, recall, MRR)
  • ✓ Monitor answer quality and relevance
  • ✓ Log failed retrievals for improvement
  • ✓ Collect user feedback
  • ✓ Iterate based on failures
  • ✓ 跟踪检索指标(精确率、召回率、MRR)
  • ✓ 监控回答质量与相关性
  • ✓ 记录检索失败的案例以改进
  • ✓ 收集用户反馈
  • ✓ 基于失败案例迭代优化

Common Challenges & Solutions

常见挑战与解决方案

Challenge: Irrelevant Retrieval

挑战:检索结果不相关

Solutions:
  • Improve chunking strategy
  • Better embedding model
  • Add document metadata to queries
  • Implement reranking
  • Use hybrid search
解决方案:
  • 改进切分策略
  • 使用更好的嵌入模型
  • 在查询中添加文档元数据
  • 实现重排序
  • 使用混合搜索

Challenge: Context Too Large

挑战:上下文过大

Solutions:
  • Reduce chunk size
  • Retrieve fewer results (smaller k)
  • Summarize retrieved context
  • Use hierarchical retrieval
  • Filter by relevance score
解决方案:
  • 减小切分大小
  • 减少检索结果数量(更小的k值)
  • 总结检索到的上下文
  • 使用分层检索
  • 按相关性分数过滤

Challenge: Missing Information

挑战:信息缺失

Solutions:
  • Increase k (retrieve more)
  • Improve embedding model
  • Better preprocessing
  • Use multiple search strategies
  • Add document hierarchy
解决方案:
  • 增大k值(检索更多结果)
  • 使用更好的嵌入模型
  • 优化预处理步骤
  • 使用多种搜索策略
  • 添加文档层级结构

Challenge: Slow Performance

挑战:性能缓慢

Solutions:
  • Use managed vector database
  • Cache embeddings
  • Batch process documents
  • Optimize chunk size
  • Use smaller embedding model for speed
解决方案:
  • 使用托管式向量数据库
  • 缓存嵌入
  • 批量处理文档
  • 优化切分大小
  • 使用更小的嵌入模型提升速度

Evaluation Metrics

评估指标

Retrieval Metrics:
  • Precision: % of retrieved docs that are relevant
  • Recall: % of relevant docs that are retrieved
  • MRR (Mean Reciprocal Rank): Rank of first relevant result
  • NDCG (Normalized DCG): Quality of ranking
Answer Quality Metrics:
  • Relevance: Does answer address the question?
  • Correctness: Is the answer factually accurate?
  • Grounding: Is answer supported by context?
  • User Satisfaction: Would user find answer helpful?
检索指标:
  • 精确率: 检索到的相关文档占比
  • 召回率: 被检索到的相关文档占比
  • MRR(平均倒数排名): 第一个相关结果的排名
  • NDCG(归一化折损累积增益): 排名的质量
回答质量指标:
  • 相关性: 回答是否解决了问题?
  • 正确性: 回答是否符合事实?
  • 锚定性: 回答是否有上下文支持?
  • 用户满意度: 用户是否认为回答有帮助?

Advanced Techniques

高级技术

1. Query Expansion

1. 查询扩展

python
undefined
python
undefined

Expand query with related terms

使用相关术语扩展查询

expanded_query = query + " " + synonym_expansion(query) results = retrieve(expanded_query)
undefined
expanded_query = query + " " + synonym_expansion(query) results = retrieve(expanded_query)
undefined

2. Document Compression

2. 文档压缩

python
undefined
python
undefined

Compress retrieved docs before passing to LLM

在传递给LLM前压缩检索到的文档

compressed = compress_documents(retrieved_docs, query) context = format_context(compressed)
undefined
compressed = compress_documents(retrieved_docs, query) context = format_context(compressed)
undefined

3. Active Retrieval

3. 主动检索

python
undefined
python
undefined

Iteratively refine retrieval based on LLM output

基于LLM输出迭代优化检索

query = user_question while iterations < max: results = retrieve(query) answer = generate_with_context(results) if answer_complete(answer): break query = refine_query(answer)
undefined
query = user_question while iterations < max: results = retrieve(query) answer = generate_with_context(results) if answer_complete(answer): break query = refine_query(answer)
undefined

4. Multi-Modal RAG

4. 多模态RAG

python
undefined
python
undefined

Retrieve both text and images

同时检索文本与图像

text_results = text_retriever.query(question) image_results = image_retriever.query(question) context = combine_multimodal(text_results, image_results)
undefined
text_results = text_retriever.query(question) image_results = image_retriever.query(question) context = combine_multimodal(text_results, image_results)
undefined

Resources & References

资源与参考

Key Papers

核心论文

  • "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (Lewis et al.)
  • "REALM: Retrieval-Augmented Language Model Pre-Training" (Guu et al.)
  • "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (Lewis et al.)
  • "REALM: Retrieval-Augmented Language Model Pre-Training" (Guu et al.)

Frameworks

框架

Vector Databases

向量数据库

Embedding Models

嵌入模型

Next Steps

下一步

  1. Choose your stack: Decide on framework (LangChain, LlamaIndex, etc.)
  2. Prepare documents: Process and chunk your knowledge base
  3. Select embeddings: Choose embedding model for your domain
  4. Pick vector DB: Select storage solution for scale
  5. Build pipeline: Implement retrieval and generation
  6. Evaluate: Test on sample questions and iterate
  7. Monitor: Track quality metrics in production
  1. 选择技术栈: 确定框架(LangChain、LlamaIndex等)
  2. 准备文档: 处理并切分你的知识库
  3. 选择嵌入模型: 针对你的领域选择合适的嵌入模型
  4. 选择向量数据库: 选择适合规模的存储方案
  5. 构建流水线: 实现检索与生成功能
  6. 评估: 在示例问题上测试并迭代
  7. 监控: 在生产环境中跟踪质量指标