rag-retrieval

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

RAG Retrieval

RAG 检索

Combine vector search with LLM generation for accurate, grounded responses.
将向量搜索与LLM生成相结合,以生成准确、基于事实的响应。

Basic RAG Pattern

基础RAG模式

python
async def rag_query(question: str, top_k: int = 5) -> str:
    """Basic RAG: retrieve then generate."""
    # 1. Retrieve relevant documents
    docs = await vector_db.search(question, limit=top_k)

    # 2. Construct context
    context = "\n\n".join([
        f"[{i+1}] {doc.text}"
        for i, doc in enumerate(docs)
    ])

    # 3. Generate with context
    response = await llm.chat([
        {"role": "system", "content":
            "Answer using ONLY the provided context. "
            "If not in context, say 'I don't have that information.'"},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
    ])

    return response.content
python
async def rag_query(question: str, top_k: int = 5) -> str:
    """Basic RAG: retrieve then generate."""
    # 1. Retrieve relevant documents
    docs = await vector_db.search(question, limit=top_k)

    # 2. Construct context
    context = "\n\n".join([
        f"[{i+1}] {doc.text}"
        for i, doc in enumerate(docs)
    ])

    # 3. Generate with context
    response = await llm.chat([
        {"role": "system", "content":
            "Answer using ONLY the provided context. "
            "If not in context, say 'I don't have that information.'"},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
    ])

    return response.content

RAG with Citations

带引用的RAG

python
async def rag_with_citations(question: str) -> dict:
    """RAG with inline citations [1], [2], etc."""
    docs = await vector_db.search(question, limit=5)

    context = "\n\n".join([
        f"[{i+1}] {doc.text}\nSource: {doc.metadata['source']}"
        for i, doc in enumerate(docs)
    ])

    response = await llm.chat([
        {"role": "system", "content":
            "Answer with inline citations like [1], [2]. "
            "End with a Sources section."},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
    ])

    return {
        "answer": response.content,
        "sources": [doc.metadata['source'] for doc in docs]
    }
python
async def rag_with_citations(question: str) -> dict:
    """RAG with inline citations [1], [2], etc."""
    docs = await vector_db.search(question, limit=5)

    context = "\n\n".join([
        f"[{i+1}] {doc.text}\nSource: {doc.metadata['source']}"
        for i, doc in enumerate(docs)
    ])

    response = await llm.chat([
        {"role": "system", "content":
            "Answer with inline citations like [1], [2]. "
            "End with a Sources section."},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
    ])

    return {
        "answer": response.content,
        "sources": [doc.metadata['source'] for doc in docs]
    }

Hybrid Search (Semantic + Keyword)

混合搜索(语义+关键词)

python
def reciprocal_rank_fusion(
    semantic_results: list,
    keyword_results: list,
    k: int = 60
) -> list:
    """Combine semantic and keyword search with RRF."""
    scores = {}

    for rank, doc in enumerate(semantic_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank + 1)

    for rank, doc in enumerate(keyword_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank + 1)

    # Sort by combined score
    ranked_ids = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
    return [get_doc(id) for id in ranked_ids]
python
def reciprocal_rank_fusion(
    semantic_results: list,
    keyword_results: list,
    k: int = 60
) -> list:
    """Combine semantic and keyword search with RRF."""
    scores = {}

    for rank, doc in enumerate(semantic_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank + 1)

    for rank, doc in enumerate(keyword_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank + 1)

    # Sort by combined score
    ranked_ids = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
    return [get_doc(id) for id in ranked_ids]

Context Window Management

上下文窗口管理

python
def fit_context(docs: list, max_tokens: int = 6000) -> list:
    """Truncate context to fit token budget."""
    total_tokens = 0
    selected = []

    for doc in docs:
        doc_tokens = count_tokens(doc.text)
        if total_tokens + doc_tokens > max_tokens:
            break
        selected.append(doc)
        total_tokens += doc_tokens

    return selected
Guidelines:
  • Keep context under 75% of model limit
  • Reserve tokens for system prompt + response
  • Prioritize highest-relevance documents
python
def fit_context(docs: list, max_tokens: int = 6000) -> list:
    """Truncate context to fit token budget."""
    total_tokens = 0
    selected = []

    for doc in docs:
        doc_tokens = count_tokens(doc.text)
        if total_tokens + doc_tokens > max_tokens:
            break
        selected.append(doc)
        total_tokens += doc_tokens

    return selected
指导原则:
  • 保持上下文在模型限制的75%以内
  • 为系统提示词和响应预留token
  • 优先选择相关性最高的文档

Context Sufficiency Check (2026 Best Practice)

上下文充分性检查(2026最佳实践)

python
from pydantic import BaseModel

class SufficiencyCheck(BaseModel):
    """Pre-generation context validation."""
    is_sufficient: bool
    confidence: float  # 0.0-1.0
    missing_info: str | None = None

async def rag_with_sufficiency(question: str, top_k: int = 5) -> str:
    """RAG with hallucination prevention via sufficiency check.

    Based on Google Research ICLR 2025: Adding a sufficiency check
    before generation reduces hallucinations from insufficient context.
    """
    docs = await vector_db.search(question, limit=top_k)
    context = "\n\n".join([f"[{i+1}] {doc.text}" for i, doc in enumerate(docs)])

    # Pre-generation sufficiency check (prevents hallucination)
    check = await llm.with_structured_output(SufficiencyCheck).ainvoke(
        f"""Does this context contain sufficient information to answer the question?

Question: {question}

Context:
{context}

Evaluate:
- is_sufficient: Can the question be fully answered from context?
- confidence: How confident are you? (0.0-1.0)
- missing_info: What's missing if not sufficient?"""
    )

    # Abstain if context insufficient (high-confidence)
    if not check.is_sufficient and check.confidence > 0.7:
        return f"I don't have enough information to answer this question. Missing: {check.missing_info}"

    # Low confidence → retrieve more context
    if not check.is_sufficient and check.confidence <= 0.7:
        more_docs = await vector_db.search(question, limit=top_k * 2)
        context = "\n\n".join([f"[{i+1}] {doc.text}" for i, doc in enumerate(more_docs)])

    # Generate only with sufficient context
    response = await llm.chat([
        {"role": "system", "content":
            "Answer using ONLY the provided context. "
            "If information is missing, say so rather than guessing."},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
    ])

    return response.content
Why this matters (Google Research 2025):
  • RAG paradoxically increases hallucinations when context is insufficient
  • Additional context increases model confidence → more likely to hallucinate
  • Sufficiency check allows abstention when information is missing
python
from pydantic import BaseModel

class SufficiencyCheck(BaseModel):
    """Pre-generation context validation."""
    is_sufficient: bool
    confidence: float  # 0.0-1.0
    missing_info: str | None = None

async def rag_with_sufficiency(question: str, top_k: int = 5) -> str:
    """RAG with hallucination prevention via sufficiency check.

    Based on Google Research ICLR 2025: Adding a sufficiency check
    before generation reduces hallucinations from insufficient context.
    """
    docs = await vector_db.search(question, limit=top_k)
    context = "\n\n".join([f"[{i+1}] {doc.text}" for i, doc in enumerate(docs)])

    # Pre-generation sufficiency check (prevents hallucination)
    check = await llm.with_structured_output(SufficiencyCheck).ainvoke(
        f"""Does this context contain sufficient information to answer the question?

Question: {question}

Context:
{context}

Evaluate:
- is_sufficient: Can the question be fully answered from context?
- confidence: How confident are you? (0.0-1.0)
- missing_info: What's missing if not sufficient?"""
    )

    # Abstain if context insufficient (high-confidence)
    if not check.is_sufficient and check.confidence > 0.7:
        return f"I don't have enough information to answer this question. Missing: {check.missing_info}"

    # Low confidence → retrieve more context
    if not check.is_sufficient and check.confidence <= 0.7:
        more_docs = await vector_db.search(question, limit=top_k * 2)
        context = "\n\n".join([f"[{i+1}] {doc.text}" for i, doc in enumerate(more_docs)])

    # Generate only with sufficient context
    response = await llm.chat([
        {"role": "system", "content":
            "Answer using ONLY the provided context. "
            "If information is missing, say so rather than guessing."},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
    ])

    return response.content
为什么这很重要(Google Research 2025):
  • 当上下文不足时,RAG反而会增加幻觉
  • 额外的上下文会提升模型信心,进而更容易产生幻觉
  • 充分性检查允许在信息缺失时拒绝回答

Key Decisions

关键决策

DecisionRecommendation
Top-k3-10 documents
Temperature0.1-0.3 (factual)
Context budget4K-8K tokens
Hybrid ratio50/50 semantic/keyword
决策项推荐方案
Top-k3-10份文档
温度值0.1-0.3(事实类场景)
上下文预算4K-8K tokens
混合比例语义/关键词搜索各占50%

Common Mistakes

常见错误

  • No citation tracking (unverifiable answers)
  • Context too large (dilutes relevance)
  • Temperature too high (hallucinations)
  • Single retrieval method (misses keyword matches)
  • 未跟踪引用(答案无法验证)
  • 上下文过大(稀释相关性)
  • 温度值过高(产生幻觉)
  • 单一检索方式(错过关键词匹配)

Advanced Patterns

高级模式

See
references/advanced-rag.md
for:
  • HyDE Integration: Hypothetical document embeddings for vocabulary mismatch
  • Agentic RAG: Multi-step retrieval with tool use
  • Self-RAG: LLM decides when to retrieve and validates outputs
  • Corrective RAG: Evaluate retrieval quality and correct if needed
  • Pipeline Composition: Combine HyDE + Hybrid + Rerank
请参阅
references/advanced-rag.md
了解:
  • HyDE集成:针对词汇不匹配问题的假设文档嵌入
  • 智能体RAG:结合工具调用的多步检索
  • Self-RAG:由LLM决定何时检索并验证输出
  • 纠正性RAG:评估检索质量并在需要时进行修正
  • 流水线组合:结合HyDE + 混合搜索 + 重排序

Related Skills

相关技能

  • embeddings
    - Creating vectors for retrieval
  • hyde-retrieval
    - Hypothetical document embeddings
  • query-decomposition
    - Multi-concept query handling
  • reranking-patterns
    - Cross-encoder and LLM reranking
  • contextual-retrieval
    - Anthropic's context-prepending technique
  • langgraph-functional
    - Building agentic RAG workflows
  • embeddings
    - 为检索创建向量
  • hyde-retrieval
    - 假设文档嵌入
  • query-decomposition
    - 多概念查询处理
  • reranking-patterns
    - 交叉编码器与LLM重排序
  • contextual-retrieval
    - Anthropic的上下文前置技术
  • langgraph-functional
    - 构建智能体RAG工作流

Capability Details

能力详情

retrieval-patterns

retrieval-patterns

Keywords: retrieval, context, chunks, relevance Solves:
  • Retrieve relevant context for LLM
  • Implement RAG pipeline
  • Optimize retrieval quality
关键词: retrieval, context, chunks, relevance 解决问题:
  • 为LLM检索相关上下文
  • 实现RAG流水线
  • 优化检索质量

hybrid-search

hybrid-search

Keywords: hybrid, bm25, vector, fusion Solves:
  • Combine keyword and semantic search
  • Implement reciprocal rank fusion
  • Balance precision and recall
关键词: hybrid, bm25, vector, fusion 解决问题:
  • 结合关键词与语义搜索
  • 实现倒数排名融合(RRF)
  • 平衡精确率与召回率

chatbot-example

chatbot-example

Keywords: chatbot, rag, example, typescript Solves:
  • Build RAG chatbot example
  • TypeScript implementation
  • End-to-end RAG pipeline
关键词: chatbot, rag, example, typescript 解决问题:
  • 构建RAG聊天机器人示例
  • TypeScript实现
  • 端到端RAG流水线

pipeline-template

pipeline-template

Keywords: pipeline, template, implementation, starter Solves:
  • RAG pipeline starter template
  • Production-ready code
  • Copy-paste implementation
关键词: pipeline, template, implementation, starter 解决问题:
  • RAG流水线启动模板
  • 生产级代码
  • 可直接复制使用的实现