rag-agent-builder
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseRAG Agent Builder
RAG Agent 构建指南
Build powerful Retrieval-Augmented Generation (RAG) applications that enhance LLM capabilities with external knowledge sources, enabling accurate, contextualized AI responses.
构建功能强大的Retrieval-Augmented Generation(RAG)应用,通过外部知识源增强LLM的能力,生成准确且贴合上下文的AI响应。
Quick Start
快速开始
Get started with RAG implementations in the examples and utilities:
-
Examples: Seedirectory for complete implementations:
examples/- - Simple chunk-embed-retrieve-generate pipeline
basic_rag.py - - Hybrid search, reranking, and filtering
retrieval_strategies.py - - Agent-controlled retrieval with iterative refinement
agentic_rag.py
-
Utilities: Seedirectory for helper modules:
scripts/- - Embedding generation, normalization, and caching
embedding_management.py - - Vector database abstraction and factory
vector_db_manager.py - - Retrieval and answer quality metrics
rag_evaluation.py
通过示例和工具快速上手RAG实现:
-
示例:查看目录获取完整实现:
examples/- - 简单的切分-嵌入-检索-生成流水线
basic_rag.py - - 混合搜索、重排序与过滤
retrieval_strategies.py - - 由Agent控制的检索与迭代优化
agentic_rag.py
-
工具:查看目录获取辅助模块:
scripts/- - 嵌入生成、归一化与缓存
embedding_management.py - - 向量数据库抽象与工厂类
vector_db_manager.py - - 检索与回答质量指标
rag_evaluation.py
Overview
概述
RAG systems combine three key components:
- Document Retrieval - Find relevant information from knowledge bases
- Context Integration - Pass retrieved context to the LLM
- Response Generation - Generate answers grounded in the retrieved information
This skill covers building production-ready RAG applications with various frameworks and approaches.
RAG系统包含三个核心组件:
- 文档检索 - 从知识库中查找相关信息
- 上下文整合 - 将检索到的上下文传递给LLM
- 响应生成 - 基于检索到的信息生成回答
本指南涵盖如何使用各类框架与方法构建生产级RAG应用。
Core Concepts
核心概念
What is RAG?
什么是RAG?
RAG augments LLM knowledge with external data:
- Without RAG: LLM relies on training data (may be outdated or limited)
- With RAG: LLM uses real-time, custom knowledge + training knowledge
RAG通过外部数据增强LLM的知识储备:
- 无RAG时:LLM依赖训练数据(可能过时或受限)
- 有RAG时:LLM结合实时自定义知识与训练知识
When to Use RAG
何时使用RAG
- Document Q&A: Answer questions about PDFs, books, reports
- Knowledge Base Search: Query internal documentation, wikis
- Enterprise Search: Search proprietary company data
- Context-Specific Assistants: Customer support, HR assistants
- Fact-Heavy Applications: Legal docs, medical records, financial data
- 文档问答:解答关于PDF、书籍、报告的问题
- 知识库搜索:查询内部文档、维基类资源
- 企业搜索:搜索企业专有数据
- 特定上下文助手:客户支持、HR助手
- 事实密集型应用:法律文档、医疗记录、金融数据
When RAG Might Not Be Needed
何时无需使用RAG
- General knowledge questions (ChatGPT-like)
- Real-time data that changes constantly (use tools instead)
- Very simple lookup tasks (use database queries)
- 通用知识问题(类似ChatGPT的场景)
- 频繁变化的实时数据(改用工具)
- 非常简单的查询任务(改用数据库查询)
Architecture Patterns
架构模式
Basic RAG Pipeline
基础RAG流水线
Documents → Chunks → Embeddings → Vector DB
↓
User Question → Embedding → Retrieval → LLM → Answer
↑ ↓
Vector DB ContextDocuments → Chunks → Embeddings → Vector DB
↓
User Question → Embedding → Retrieval → LLM → Answer
↑ ↓
Vector DB ContextAdvanced RAG Patterns
高级RAG模式
1. Agentic RAG
1. Agent驱动的RAG
- Agent decides what to retrieve and when
- Can refine queries iteratively
- Better for complex reasoning
- Agent决定检索内容与时机
- 可迭代优化查询
- 更适合复杂推理场景
2. Hierarchical RAG
2. 分层RAG
- Multi-level document structure
- Search at different levels of detail
- More flexible organization
- 多级文档结构
- 按不同细节层级搜索
- 组织方式更灵活
3. Hybrid Search RAG
3. 混合搜索RAG
- Combines keyword search (BM25) + semantic search (embeddings)
- Captures both exact matches and meaning
- Better for mixed query types
- 结合关键词搜索(BM25)与语义搜索(嵌入)
- 同时捕获精确匹配与语义关联
- 更适合混合类型的查询
4. Corrective RAG (CRAG)
4. 纠错型RAG(CRAG)
- Evaluates retrieved documents for relevance
- Retrieves additional sources if needed
- Ensures high-quality context
- 评估检索文档的相关性
- 如有需要检索额外来源
- 确保上下文的高质量
Implementation Components
实现组件
1. Document Processing
1. 文档处理
Chunking Strategies:
python
undefined切分策略:
python
undefinedSimple fixed-size chunks
简单固定大小切分
chunks = split_text(doc, chunk_size=1000, overlap=100)
chunks = split_text(doc, chunk_size=1000, overlap=100)
Semantic chunks (group by meaning)
语义切分(按语义分组)
chunks = semantic_chunking(doc, max_tokens=512)
chunks = semantic_chunking(doc, max_tokens=512)
Hierarchical chunks (different levels)
分层切分(不同层级)
chapters = split_by_heading(doc)
chunks = split_each_chapter(chapters, size=1000)
**Key Considerations**:
- Chunk size affects retrieval quality and cost
- Overlap helps maintain context between chunks
- Semantic chunking preserves meaning betterchapters = split_by_heading(doc)
chunks = split_each_chapter(chapters, size=1000)
**关键注意事项**:
- 切分大小影响检索质量与成本
- 重叠部分有助于维持块间上下文
- 语义切分更好地保留语义2. Embedding Generation
2. 嵌入生成
Popular Embedding Models:
- OpenAI: ,
text-embedding-3-smalltext-embedding-3-large - Open Source: ,
all-MiniLM-L6-v2all-mpnet-base-v2 - Domain-Specific: Domain-trained embeddings for specialized knowledge
Best Practices:
- Use consistent embedding model for retrieval and queries
- Store embeddings with normalized vectors
- Update embeddings when documents change
热门嵌入模型:
- OpenAI: ,
text-embedding-3-smalltext-embedding-3-large - 开源模型: ,
all-MiniLM-L6-v2all-mpnet-base-v2 - 领域特定模型: 针对专业知识训练的嵌入模型
最佳实践:
- 索引与查询使用相同的嵌入模型
- 存储归一化后的向量嵌入
- 文档更新时同步更新嵌入
3. Vector Databases
3. 向量数据库
Popular Options:
- Pinecone: Managed, serverless, easy to scale
- Weaviate: Open-source, self-hosted, flexible
- Milvus: Open-source, high performance
- Chroma: Lightweight, good for prototypes
- Qdrant: Production-grade, high-performance
Selection Criteria:
- Scale requirements (data volume, queries per second)
- Latency needs (real-time vs batch)
- Cost considerations
- Deployment preferences (managed vs self-hosted)
热门选项:
- Pinecone: 托管式、无服务器、易于扩展
- Weaviate: 开源、自托管、灵活
- Milvus: 开源、高性能
- Chroma: 轻量、适合原型开发
- Qdrant: 生产级、高性能
选择标准:
- 扩展需求(数据量、每秒查询数)
- 延迟要求(实时 vs 批量)
- 成本考量
- 部署偏好(托管 vs 自托管)
4. Retrieval Strategies
4. 检索策略
Retrieval Methods:
python
undefined检索方法:
python
undefinedSimilarity search (most common)
相似度搜索(最常用)
results = vector_db.query(question_embedding, k=5)
results = vector_db.query(question_embedding, k=5)
Hybrid search (keyword + semantic)
混合搜索(关键词+语义)
keyword_results = bm25.search(question, k=3)
semantic_results = vector_db.query(embedding, k=3)
results = combine_and_rank(keyword_results, semantic_results)
keyword_results = bm25.search(question, k=3)
semantic_results = vector_db.query(embedding, k=3)
results = combine_and_rank(keyword_results, semantic_results)
Reranking (improve relevance)
重排序(提升相关性)
retrieved = initial_retrieval(query)
reranked = rerank_by_relevance(retrieved, query)
**Retrieval Parameters**:
- **k** (number of results): Balance between context and relevance
- **Similarity threshold**: Filter out low-relevance results
- **Diversity**: Return varied results vs best matchesretrieved = initial_retrieval(query)
reranked = rerank_by_relevance(retrieved, query)
**检索参数**:
- **k**(结果数量): 在上下文与相关性间平衡
- **相似度阈值**: 过滤低相关性结果
- **多样性**: 返回多样化结果 vs 最佳匹配5. Context Integration
5. 上下文整合
Context Window Management:
python
undefined上下文窗口管理:
python
undefinedFit retrieved documents into context window
将检索到的文档适配到上下文窗口
def prepare_context(retrieved_docs, max_tokens=3000):
context = ""
for doc in retrieved_docs:
if len(tokenize(context + doc)) <= max_tokens:
context += doc
else:
break
return context
**Prompt Design**:You are a helpful assistant. Answer the question based on the provided context.
Context:
{retrieved_documents}
Question: {user_question}
Answer:
undefineddef prepare_context(retrieved_docs, max_tokens=3000):
context = ""
for doc in retrieved_docs:
if len(tokenize(context + doc)) <= max_tokens:
context += doc
else:
break
return context
**提示词设计**:你是一个乐于助人的助手。请根据提供的上下文回答问题。
上下文:
{retrieved_documents}
问题: {user_question}
回答:
undefined6. Response Generation
6. 响应生成
Generation Strategies:
- Direct Generation: LLM answers from context
- Summarization: Summarize multiple retrieved docs first
- Fact-Grounding: Ensure answer cites sources
- Iterative Refinement: Refine based on user feedback
生成策略:
- 直接生成: LLM基于上下文回答
- 摘要生成: 先总结多个检索文档
- 事实锚定: 确保回答引用来源
- 迭代优化: 根据用户反馈优化
Implementation Patterns
实现模式
Pattern 1: Basic RAG
模式1: 基础RAG
Simplest RAG implementation:
- Split documents into chunks
- Generate embeddings for each chunk
- Store in vector database
- Retrieve top-k similar chunks for query
- Pass to LLM with context
Pros: Simple, fast, works well for straightforward QA
Cons: May miss relevant context, no refinement
最简单的RAG实现:
- 将文档切分为块
- 为每个块生成嵌入
- 存储到向量数据库
- 为查询检索top-k相似块
- 传递给LLM并附带上下文
优点: 简单、快速,适用于直接的问答场景
缺点: 可能遗漏相关上下文,无优化机制
Pattern 2: Agentic RAG
模式2: Agent驱动的RAG
Agent controls retrieval:
- Agent receives user question
- Decides whether to retrieve documents
- Formulates retrieval query (may differ from original)
- Retrieves relevant documents
- Can iterate or use tools
- Generates final answer
Pros: Better for complex questions, iterative improvement
Cons: More complex, higher costs
由Agent控制检索:
- Agent接收用户问题
- 决定是否需要检索文档
- 制定检索查询(可能与原问题不同)
- 检索相关文档
- 可迭代或使用工具
- 生成最终回答
优点: 更适合复杂问题,可迭代改进
缺点: 更复杂,成本更高
Pattern 3: Corrective RAG (CRAG)
模式3: 纠错型RAG(CRAG)
Validates retrieved documents:
- Retrieve documents for question
- Grade each document for relevance
- If poor relevance:
- Try different retrieval strategy
- Expand search scope
- Retrieve from different sources
- Generate answer from validated context
Pros: Higher quality answers, adapts to failures
Cons: More API calls, slower
验证检索到的文档:
- 为问题检索文档
- 评估每个文档的相关性
- 若相关性差:
- 尝试不同的检索策略
- 扩大搜索范围
- 从不同来源检索
- 基于验证后的上下文生成回答
优点: 回答质量更高,能适配失败场景
缺点: API调用更多,速度更慢
Popular Frameworks
热门框架
LangChain
LangChain
python
from langchain.document_loaders import PDFLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.chains import RetrievalQApython
from langchain.document_loaders import PDFLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.chains import RetrievalQALoad documents
加载文档
loader = PDFLoader("document.pdf")
docs = loader.load()
loader = PDFLoader("document.pdf")
docs = loader.load()
Create RAG chain
创建RAG链
embeddings = OpenAIEmbeddings()
vectorstore = Pinecone.from_documents(docs, embeddings)
qa = RetrievalQA.from_chain_type(
llm=ChatOpenAI(),
chain_type="stuff",
retriever=vectorstore.as_retriever()
)
answer = qa.run("What is the document about?")
undefinedembeddings = OpenAIEmbeddings()
vectorstore = Pinecone.from_documents(docs, embeddings)
qa = RetrievalQA.from_chain_type(
llm=ChatOpenAI(),
chain_type="stuff",
retriever=vectorstore.as_retriever()
)
answer = qa.run("What is the document about?")
undefinedLlamaIndex
LlamaIndex
python
from llama_index import GPTVectorStoreIndex, SimpleDirectoryReaderpython
from llama_index import GPTVectorStoreIndex, SimpleDirectoryReaderLoad documents
加载文档
documents = SimpleDirectoryReader("./data").load_data()
documents = SimpleDirectoryReader("./data").load_data()
Create index
创建索引
index = GPTVectorStoreIndex.from_documents(documents)
index = GPTVectorStoreIndex.from_documents(documents)
Query
查询
response = index.as_query_engine().query("What is the main topic?")
undefinedresponse = index.as_query_engine().query("What is the main topic?")
undefinedCrewAI with RAG
CrewAI with RAG
python
from crewai import Agent, Task, Crew
from tools import retrieval_tool
researcher = Agent(
role="Research Assistant",
goal="Research topics using knowledge base",
tools=[retrieval_tool]
)
research_task = Task(
description="Research the topic: {topic}",
agent=researcher
)python
from crewai import Agent, Task, Crew
from tools import retrieval_tool
researcher = Agent(
role="Research Assistant",
goal="Research topics using knowledge base",
tools=[retrieval_tool]
)
research_task = Task(
description="Research the topic: {topic}",
agent=researcher
)Best Practices
最佳实践
Document Preparation
文档准备
- ✓ Clean and normalize text (remove headers, footers)
- ✓ Preserve document structure when possible
- ✓ Add metadata (source, date, category)
- ✓ Handle PDFs with OCR if scanned
- ✓ Test chunk sizes for your domain
- ✓ 清理并标准化文本(移除页眉、页脚)
- ✓ 尽可能保留文档结构
- ✓ 添加元数据(来源、日期、分类)
- ✓ 若为扫描件,使用OCR处理PDF
- ✓ 针对你的领域测试切分大小
Embedding Strategy
嵌入策略
- ✓ Use same embedding model for indexing and queries
- ✓ Fine-tune embeddings for domain-specific needs
- ✓ Normalize embeddings for consistency
- ✓ Monitor embedding quality metrics
- ✓ 索引与查询使用相同的嵌入模型
- ✓ 针对领域特定需求微调嵌入
- ✓ 归一化嵌入以保证一致性
- ✓ 监控嵌入质量指标
Retrieval Optimization
检索优化
- ✓ Tune k (number of results) for your use case
- ✓ Use reranking for quality improvement
- ✓ Implement relevance filtering
- ✓ Monitor retrieval precision and recall
- ✓ Cache frequently retrieved documents
- ✓ 针对你的场景调整k值(结果数量)
- ✓ 使用重排序提升质量
- ✓ 实现相关性过滤
- ✓ 监控检索的精确率与召回率
- ✓ 缓存频繁检索的文档
Generation Quality
生成质量
- ✓ Include source citations in answers
- ✓ Prompt LLM to indicate confidence
- ✓ Ask to cite specific documents
- ✓ Generate summaries for long contexts
- ✓ Validate answers against context
- ✓ 在回答中包含来源引用
- ✓ 提示LLM表明置信度
- ✓ 要求引用特定文档
- ✓ 为长上下文生成摘要
- ✓ 验证回答与上下文的一致性
Monitoring & Evaluation
监控与评估
- ✓ Track retrieval metrics (precision, recall, MRR)
- ✓ Monitor answer quality and relevance
- ✓ Log failed retrievals for improvement
- ✓ Collect user feedback
- ✓ Iterate based on failures
- ✓ 跟踪检索指标(精确率、召回率、MRR)
- ✓ 监控回答质量与相关性
- ✓ 记录检索失败的案例以改进
- ✓ 收集用户反馈
- ✓ 基于失败案例迭代优化
Common Challenges & Solutions
常见挑战与解决方案
Challenge: Irrelevant Retrieval
挑战:检索结果不相关
Solutions:
- Improve chunking strategy
- Better embedding model
- Add document metadata to queries
- Implement reranking
- Use hybrid search
解决方案:
- 改进切分策略
- 使用更好的嵌入模型
- 在查询中添加文档元数据
- 实现重排序
- 使用混合搜索
Challenge: Context Too Large
挑战:上下文过大
Solutions:
- Reduce chunk size
- Retrieve fewer results (smaller k)
- Summarize retrieved context
- Use hierarchical retrieval
- Filter by relevance score
解决方案:
- 减小切分大小
- 减少检索结果数量(更小的k值)
- 总结检索到的上下文
- 使用分层检索
- 按相关性分数过滤
Challenge: Missing Information
挑战:信息缺失
Solutions:
- Increase k (retrieve more)
- Improve embedding model
- Better preprocessing
- Use multiple search strategies
- Add document hierarchy
解决方案:
- 增大k值(检索更多结果)
- 使用更好的嵌入模型
- 优化预处理步骤
- 使用多种搜索策略
- 添加文档层级结构
Challenge: Slow Performance
挑战:性能缓慢
Solutions:
- Use managed vector database
- Cache embeddings
- Batch process documents
- Optimize chunk size
- Use smaller embedding model for speed
解决方案:
- 使用托管式向量数据库
- 缓存嵌入
- 批量处理文档
- 优化切分大小
- 使用更小的嵌入模型提升速度
Evaluation Metrics
评估指标
Retrieval Metrics:
- Precision: % of retrieved docs that are relevant
- Recall: % of relevant docs that are retrieved
- MRR (Mean Reciprocal Rank): Rank of first relevant result
- NDCG (Normalized DCG): Quality of ranking
Answer Quality Metrics:
- Relevance: Does answer address the question?
- Correctness: Is the answer factually accurate?
- Grounding: Is answer supported by context?
- User Satisfaction: Would user find answer helpful?
检索指标:
- 精确率: 检索到的相关文档占比
- 召回率: 被检索到的相关文档占比
- MRR(平均倒数排名): 第一个相关结果的排名
- NDCG(归一化折损累积增益): 排名的质量
回答质量指标:
- 相关性: 回答是否解决了问题?
- 正确性: 回答是否符合事实?
- 锚定性: 回答是否有上下文支持?
- 用户满意度: 用户是否认为回答有帮助?
Advanced Techniques
高级技术
1. Query Expansion
1. 查询扩展
python
undefinedpython
undefinedExpand query with related terms
使用相关术语扩展查询
expanded_query = query + " " + synonym_expansion(query)
results = retrieve(expanded_query)
undefinedexpanded_query = query + " " + synonym_expansion(query)
results = retrieve(expanded_query)
undefined2. Document Compression
2. 文档压缩
python
undefinedpython
undefinedCompress retrieved docs before passing to LLM
在传递给LLM前压缩检索到的文档
compressed = compress_documents(retrieved_docs, query)
context = format_context(compressed)
undefinedcompressed = compress_documents(retrieved_docs, query)
context = format_context(compressed)
undefined3. Active Retrieval
3. 主动检索
python
undefinedpython
undefinedIteratively refine retrieval based on LLM output
基于LLM输出迭代优化检索
query = user_question
while iterations < max:
results = retrieve(query)
answer = generate_with_context(results)
if answer_complete(answer):
break
query = refine_query(answer)
undefinedquery = user_question
while iterations < max:
results = retrieve(query)
answer = generate_with_context(results)
if answer_complete(answer):
break
query = refine_query(answer)
undefined4. Multi-Modal RAG
4. 多模态RAG
python
undefinedpython
undefinedRetrieve both text and images
同时检索文本与图像
text_results = text_retriever.query(question)
image_results = image_retriever.query(question)
context = combine_multimodal(text_results, image_results)
undefinedtext_results = text_retriever.query(question)
image_results = image_retriever.query(question)
context = combine_multimodal(text_results, image_results)
undefinedResources & References
资源与参考
Key Papers
核心论文
- "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (Lewis et al.)
- "REALM: Retrieval-Augmented Language Model Pre-Training" (Guu et al.)
- "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (Lewis et al.)
- "REALM: Retrieval-Augmented Language Model Pre-Training" (Guu et al.)
Frameworks
框架
- LangChain: https://python.langchain.com/
- LlamaIndex: https://www.llamaindex.ai/
- HayStack: https://haystack.deepset.ai/
- LangChain: https://python.langchain.com/
- LlamaIndex: https://www.llamaindex.ai/
- HayStack: https://haystack.deepset.ai/
Vector Databases
向量数据库
- Pinecone: https://www.pinecone.io/
- Weaviate: https://weaviate.io/
- Qdrant: https://qdrant.tech/
- Pinecone: https://www.pinecone.io/
- Weaviate: https://weaviate.io/
- Qdrant: https://qdrant.tech/
Embedding Models
嵌入模型
Next Steps
下一步
- Choose your stack: Decide on framework (LangChain, LlamaIndex, etc.)
- Prepare documents: Process and chunk your knowledge base
- Select embeddings: Choose embedding model for your domain
- Pick vector DB: Select storage solution for scale
- Build pipeline: Implement retrieval and generation
- Evaluate: Test on sample questions and iterate
- Monitor: Track quality metrics in production
- 选择技术栈: 确定框架(LangChain、LlamaIndex等)
- 准备文档: 处理并切分你的知识库
- 选择嵌入模型: 针对你的领域选择合适的嵌入模型
- 选择向量数据库: 选择适合规模的存储方案
- 构建流水线: 实现检索与生成功能
- 评估: 在示例问题上测试并迭代
- 监控: 在生产环境中跟踪质量指标