rag-retrieval

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

RAG Retrieval

RAG 检索

Combine vector search with LLM generation for accurate, grounded responses.

将向量搜索与LLM生成相结合，以生成准确、基于事实的响应。

Basic RAG Pattern

基础RAG模式

python

async def rag_query(question: str, top_k: int = 5) -> str:
    """Basic RAG: retrieve then generate."""
    # 1. Retrieve relevant documents
    docs = await vector_db.search(question, limit=top_k)

    # 2. Construct context
    context = "\n\n".join([
        f"[{i+1}] {doc.text}"
        for i, doc in enumerate(docs)
    ])

    # 3. Generate with context
    response = await llm.chat([
        {"role": "system", "content":
            "Answer using ONLY the provided context. "
            "If not in context, say 'I don't have that information.'"},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
    ])

    return response.content

python

async def rag_query(question: str, top_k: int = 5) -> str:
    """Basic RAG: retrieve then generate."""
    # 1. Retrieve relevant documents
    docs = await vector_db.search(question, limit=top_k)

    # 2. Construct context
    context = "\n\n".join([
        f"[{i+1}] {doc.text}"
        for i, doc in enumerate(docs)
    ])

    # 3. Generate with context
    response = await llm.chat([
        {"role": "system", "content":
            "Answer using ONLY the provided context. "
            "If not in context, say 'I don't have that information.'"},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
    ])

    return response.content

RAG with Citations

带引用的RAG

python

async def rag_with_citations(question: str) -> dict:
    """RAG with inline citations [1], [2], etc."""
    docs = await vector_db.search(question, limit=5)

    context = "\n\n".join([
        f"[{i+1}] {doc.text}\nSource: {doc.metadata['source']}"
        for i, doc in enumerate(docs)
    ])

    response = await llm.chat([
        {"role": "system", "content":
            "Answer with inline citations like [1], [2]. "
            "End with a Sources section."},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
    ])

    return {
        "answer": response.content,
        "sources": [doc.metadata['source'] for doc in docs]
    }

python

async def rag_with_citations(question: str) -> dict:
    """RAG with inline citations [1], [2], etc."""
    docs = await vector_db.search(question, limit=5)

    context = "\n\n".join([
        f"[{i+1}] {doc.text}\nSource: {doc.metadata['source']}"
        for i, doc in enumerate(docs)
    ])

    response = await llm.chat([
        {"role": "system", "content":
            "Answer with inline citations like [1], [2]. "
            "End with a Sources section."},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
    ])

    return {
        "answer": response.content,
        "sources": [doc.metadata['source'] for doc in docs]
    }

Hybrid Search (Semantic + Keyword)

混合搜索（语义+关键词）

python

def reciprocal_rank_fusion(
    semantic_results: list,
    keyword_results: list,
    k: int = 60
) -> list:
    """Combine semantic and keyword search with RRF."""
    scores = {}

    for rank, doc in enumerate(semantic_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank + 1)

    for rank, doc in enumerate(keyword_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank + 1)

    # Sort by combined score
    ranked_ids = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
    return [get_doc(id) for id in ranked_ids]

python

def reciprocal_rank_fusion(
    semantic_results: list,
    keyword_results: list,
    k: int = 60
) -> list:
    """Combine semantic and keyword search with RRF."""
    scores = {}

    for rank, doc in enumerate(semantic_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank + 1)

    for rank, doc in enumerate(keyword_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank + 1)

    # Sort by combined score
    ranked_ids = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
    return [get_doc(id) for id in ranked_ids]

Context Window Management

上下文窗口管理

python

def fit_context(docs: list, max_tokens: int = 6000) -> list:
    """Truncate context to fit token budget."""
    total_tokens = 0
    selected = []

    for doc in docs:
        doc_tokens = count_tokens(doc.text)
        if total_tokens + doc_tokens > max_tokens:
            break
        selected.append(doc)
        total_tokens += doc_tokens

    return selected

Guidelines:

Keep context under 75% of model limit
Reserve tokens for system prompt + response
Prioritize highest-relevance documents

python

def fit_context(docs: list, max_tokens: int = 6000) -> list:
    """Truncate context to fit token budget."""
    total_tokens = 0
    selected = []

    for doc in docs:
        doc_tokens = count_tokens(doc.text)
        if total_tokens + doc_tokens > max_tokens:
            break
        selected.append(doc)
        total_tokens += doc_tokens

    return selected

指导原则：

保持上下文在模型限制的75%以内
为系统提示词和响应预留token
优先选择相关性最高的文档

Context Sufficiency Check (2026 Best Practice)

上下文充分性检查（2026最佳实践）

python

from pydantic import BaseModel

class SufficiencyCheck(BaseModel):
    """Pre-generation context validation."""
    is_sufficient: bool
    confidence: float  # 0.0-1.0
    missing_info: str | None = None

async def rag_with_sufficiency(question: str, top_k: int = 5) -> str:
    """RAG with hallucination prevention via sufficiency check.

    Based on Google Research ICLR 2025: Adding a sufficiency check
    before generation reduces hallucinations from insufficient context.
    """
    docs = await vector_db.search(question, limit=top_k)
    context = "\n\n".join([f"[{i+1}] {doc.text}" for i, doc in enumerate(docs)])

    # Pre-generation sufficiency check (prevents hallucination)
    check = await llm.with_structured_output(SufficiencyCheck).ainvoke(
        f"""Does this context contain sufficient information to answer the question?

Question: {question}

Context:
{context}

Evaluate:
- is_sufficient: Can the question be fully answered from context?
- confidence: How confident are you? (0.0-1.0)
- missing_info: What's missing if not sufficient?"""
    )

    # Abstain if context insufficient (high-confidence)
    if not check.is_sufficient and check.confidence > 0.7:
        return f"I don't have enough information to answer this question. Missing: {check.missing_info}"

    # Low confidence → retrieve more context
    if not check.is_sufficient and check.confidence <= 0.7:
        more_docs = await vector_db.search(question, limit=top_k * 2)
        context = "\n\n".join([f"[{i+1}] {doc.text}" for i, doc in enumerate(more_docs)])

    # Generate only with sufficient context
    response = await llm.chat([
        {"role": "system", "content":
            "Answer using ONLY the provided context. "
            "If information is missing, say so rather than guessing."},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
    ])

    return response.content

Why this matters (Google Research 2025):

RAG paradoxically increases hallucinations when context is insufficient
Additional context increases model confidence → more likely to hallucinate
Sufficiency check allows abstention when information is missing

python

from pydantic import BaseModel

class SufficiencyCheck(BaseModel):
    """Pre-generation context validation."""
    is_sufficient: bool
    confidence: float  # 0.0-1.0
    missing_info: str | None = None

async def rag_with_sufficiency(question: str, top_k: int = 5) -> str:
    """RAG with hallucination prevention via sufficiency check.

    Based on Google Research ICLR 2025: Adding a sufficiency check
    before generation reduces hallucinations from insufficient context.
    """
    docs = await vector_db.search(question, limit=top_k)
    context = "\n\n".join([f"[{i+1}] {doc.text}" for i, doc in enumerate(docs)])

    # Pre-generation sufficiency check (prevents hallucination)
    check = await llm.with_structured_output(SufficiencyCheck).ainvoke(
        f"""Does this context contain sufficient information to answer the question?

Question: {question}

Context:
{context}

Evaluate:
- is_sufficient: Can the question be fully answered from context?
- confidence: How confident are you? (0.0-1.0)
- missing_info: What's missing if not sufficient?"""
    )

    # Abstain if context insufficient (high-confidence)
    if not check.is_sufficient and check.confidence > 0.7:
        return f"I don't have enough information to answer this question. Missing: {check.missing_info}"

    # Low confidence → retrieve more context
    if not check.is_sufficient and check.confidence <= 0.7:
        more_docs = await vector_db.search(question, limit=top_k * 2)
        context = "\n\n".join([f"[{i+1}] {doc.text}" for i, doc in enumerate(more_docs)])

    # Generate only with sufficient context
    response = await llm.chat([
        {"role": "system", "content":
            "Answer using ONLY the provided context. "
            "If information is missing, say so rather than guessing."},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
    ])

    return response.content

为什么这很重要（Google Research 2025）：

当上下文不足时，RAG反而会增加幻觉
额外的上下文会提升模型信心，进而更容易产生幻觉
充分性检查允许在信息缺失时拒绝回答

Key Decisions

关键决策

Decision	Recommendation
Top-k	3-10 documents
Temperature	0.1-0.3 (factual)
Context budget	4K-8K tokens
Hybrid ratio	50/50 semantic/keyword

决策项	推荐方案
Top-k	3-10份文档
温度值	0.1-0.3（事实类场景）
上下文预算	4K-8K tokens
混合比例	语义/关键词搜索各占50%

Common Mistakes

常见错误

No citation tracking (unverifiable answers)
Context too large (dilutes relevance)
Temperature too high (hallucinations)
Single retrieval method (misses keyword matches)

未跟踪引用（答案无法验证）
上下文过大（稀释相关性）
温度值过高（产生幻觉）
单一检索方式（错过关键词匹配）

Advanced Patterns

高级模式

See

references/advanced-rag.md

for:

HyDE Integration: Hypothetical document embeddings for vocabulary mismatch
Agentic RAG: Multi-step retrieval with tool use
Self-RAG: LLM decides when to retrieve and validates outputs
Corrective RAG: Evaluate retrieval quality and correct if needed
Pipeline Composition: Combine HyDE + Hybrid + Rerank

请参阅

references/advanced-rag.md

了解：

HyDE集成：针对词汇不匹配问题的假设文档嵌入
智能体RAG：结合工具调用的多步检索
Self-RAG：由LLM决定何时检索并验证输出
纠正性RAG：评估检索质量并在需要时进行修正
流水线组合：结合HyDE + 混合搜索 + 重排序

Related Skills

Capability Details

能力详情

retrieval-patterns

Keywords: retrieval, context, chunks, relevance Solves:

Retrieve relevant context for LLM
Implement RAG pipeline
Optimize retrieval quality

关键词： retrieval, context, chunks, relevance 解决问题：

为LLM检索相关上下文
实现RAG流水线
优化检索质量

hybrid-search

Keywords: hybrid, bm25, vector, fusion Solves:

Combine keyword and semantic search
Implement reciprocal rank fusion
Balance precision and recall

关键词： hybrid, bm25, vector, fusion 解决问题：

结合关键词与语义搜索
实现倒数排名融合（RRF）
平衡精确率与召回率

chatbot-example

Keywords: chatbot, rag, example, typescript Solves:

Build RAG chatbot example
TypeScript implementation
End-to-end RAG pipeline

关键词： chatbot, rag, example, typescript 解决问题：

构建RAG聊天机器人示例
TypeScript实现
端到端RAG流水线

pipeline-template

Keywords: pipeline, template, implementation, starter Solves:

RAG pipeline starter template
Production-ready code
Copy-paste implementation

关键词： pipeline, template, implementation, starter 解决问题：

RAG流水线启动模板
生产级代码
可直接复制使用的实现

rag-retrieval

Original

Translation

RAG Retrieval

RAG 检索

Basic RAG Pattern

基础RAG模式

RAG with Citations

带引用的RAG

Hybrid Search (Semantic + Keyword)

混合搜索（语义+关键词）

Context Window Management

上下文窗口管理

Context Sufficiency Check (2026 Best Practice)

上下文充分性检查（2026最佳实践）

Key Decisions

关键决策

Common Mistakes

常见错误

Advanced Patterns

高级模式

Related Skills

相关技能

Capability Details

能力详情

retrieval-patterns

retrieval-patterns

hybrid-search

hybrid-search

chatbot-example

chatbot-example

pipeline-template

pipeline-template