evaluate-rag

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Evaluate RAG

评估RAG

Overview

概述

Do error analysis on end-to-end traces first. Determine whether failures come from retrieval, generation, or both.
Build a retrieval evaluation dataset: queries paired with relevant document chunks.
Measure retrieval quality with Recall@k (most important for first-pass retrieval).
Evaluate generation separately: faithfulness (grounded in context?) and relevance (answers the query?).
If retrieval is the bottleneck, optimize chunking via grid search before tuning generation.

首先对端到端链路追踪进行错误分析，确定故障是来自检索环节、生成环节，还是两者都有问题。
构建检索评估数据集：由查询与对应的相关文档块配对组成。
使用Recall@k衡量检索质量（对首阶段检索来说最重要）。
单独评估生成环节：忠实度（是否基于上下文生成？）和相关性（是否回答了查询问题？）。
如果检索是瓶颈，在调优生成环节前，先通过网格搜索优化分块策略。

Prerequisites

前置要求

Complete error analysis on RAG pipeline traces before selecting metrics. Inspect what was retrieved vs. what the model needed. Determine whether the problem is retrieval, generation, or both. Fix retrieval first.

在选择指标前先完成RAG pipeline链路追踪的错误分析。对比检索到的内容与模型实际需要的内容，确定问题是出在检索、生成还是两者皆有，优先修复检索问题。

Core Instructions

核心指令

Evaluate Retrieval and Generation Separately

分开评估检索与生成环节

Measure each component independently. Use the appropriate metric for each retrieval stage:

First-pass retrieval: Optimize for Recall@k. Include all relevant documents, even at the cost of noise.
Reranking: Optimize for Precision@k, MRR, or NDCG@k. Rank the most relevant documents first.

独立衡量每个组件的表现，为每个检索阶段选择适配的指标：

首阶段检索： 优先优化Recall@k，目标是覆盖所有相关文档，即使会引入噪音也可接受。
重排序阶段： 优先优化Precision@k、MRR或NDCG@k，目标是将相关性最高的文档排在最前面。

Building a Retrieval Evaluation Dataset

构建检索评估数据集

You need queries paired with ground-truth relevant document chunks.

Manual curation (highest quality): Write realistic questions and map each to the exact chunk(s) containing the answer.

Synthetic QA generation (scalable): For each document chunk, prompt an LLM to extract a fact and generate a question answerable only from that fact.

Synthetic QA prompt template:

Given a chunk of text, extract a specific, self-contained fact from it.
Then write a question that is directly and unambiguously answered
by that fact alone.

Return output in JSON format:
{ "fact": "...", "question": "..." }

Chunk: "{text_chunk}"

Adversarial question generation: Create harder queries that resemble content in multiple chunks but are only answered by one.

Process:

Select target chunk A containing a clear fact.
Find similar chunks B, C using embedding search (chunks that share terminology but lack the answer).
Prompt the LLM to write a question using terminology from B and C that only chunk A answers.

Example:

Chunk A: "In April 2020, the company reported a 17% drop in quarterly revenue, its largest decline since 2008."
Chunk B: "The company experienced significant losses in 2008 during the financial crisis."
Generated question: "When did the company experience its largest revenue decline since the 2008 financial crisis?"

Only chunk A contains the answer. Chunk B is a plausible distractor.

Filtering synthetic questions: Rate synthetic queries for realism using few-shot LLM scoring. Keep only those rated realistic (4-5 on a 1-5 scale). Likert scoring is appropriate here, since the goal is fuzzy ranking for dataset curation, not measuring failure rates.

你需要由查询与标注为真值的相关文档块配对组成的数据集。

人工标注（质量最高）： 编写符合真实使用场景的问题，将每个问题映射到包含答案的 exact chunk 上。

合成QA生成（可扩展）： 针对每个文档块，提示LLM提取一个独立事实，生成仅能通过该事实回答的问题。

合成QA提示词模板：

Given a chunk of text, extract a specific, self-contained fact from it.
Then write a question that is directly and unambiguously answered
by that fact alone.

Return output in JSON format:
{ "fact": "...", "question": "..." }

Chunk: "{text_chunk}"

对抗性问题生成： 构造难度更高的查询，这类查询的表述和多个文档块内容相似，但仅能在其中一个块中找到答案。

流程：

选择包含明确事实的目标块A。
通过embedding搜索找到相似块B、C（这些块包含相同术语但没有目标答案）。
提示LLM使用B和C中的术语编写问题，且该问题仅能通过块A回答。

示例：

块A："2020年4月，该公司公布季度收入下降17%，是2008年以来的最大跌幅。"
块B："2008年金融危机期间，该公司遭遇了严重亏损。"
生成的问题："该公司遭遇2008年金融危机以来最大收入跌幅的时间是？"

只有块A包含答案，块B是合理的干扰项。

过滤合成问题： 使用少样本LLM评分对合成查询的真实度进行评级，仅保留真实度评级较高的内容（1-5分制下得4-5分的内容）。这里适合使用李克特评分，因为目标是数据集构建过程中的模糊排序，而非衡量故障率。

Retrieval Metrics

检索指标

Recall@k: Fraction of relevant documents found in the top k results.

Recall@k = (relevant docs in top k) / (total relevant docs for query)

Prioritize recall for first-pass retrieval. LLMs can ignore irrelevant content but cannot generate from missing content.

Precision@k: Fraction of top k results that are relevant.

Precision@k = (relevant docs in top k) / k

Use for reranking evaluation.

Mean Reciprocal Rank (MRR): How early the first relevant document appears.

MRR = (1/N) * sum(1/rank_of_first_relevant_doc)

Best for single-fact lookups where only one key chunk is needed.

NDCG@k (Normalized Discounted Cumulative Gain): For graded relevance where documents have varying utility. Rewards placing more relevant items higher.

DCG@k  = sum over i=1..k of: rel_i / log2(i+1)
IDCG@k = DCG@k with documents sorted by decreasing relevance
NDCG@k = DCG@k / IDCG@k

Caveat: Optimal ranking of weakly relevant documents can outscore a highly relevant document ranked lower. Supplement with Recall@k.

Choosing k: k varies by query type. A factual lookup uses k=1-2. A synthesis query ("summarize market trends") uses k=5-10.

Recall@k： 前k个检索结果中相关文档的占比。

Recall@k = (relevant docs in top k) / (total relevant docs for query)

首阶段检索优先考虑召回率，LLM可以忽略无关内容，但无法基于缺失的内容生成回答。

Precision@k： 前k个检索结果中相关文档的占比。

Precision@k = (relevant docs in top k) / k

用于重排序环节的评估。

Mean Reciprocal Rank (MRR)： 第一个相关文档出现位置的倒数平均值。

MRR = (1/N) * sum(1/rank_of_first_relevant_doc)

最适合仅需要单个关键块的单事实查询场景。

NDCG@k（归一化折损累计增益）： 适用于文档有不同效用等级的分级相关性场景，会奖励将更相关的内容排在更靠前位置的结果。

DCG@k  = sum over i=1..k of: rel_i / log2(i+1)
IDCG@k = DCG@k with documents sorted by decreasing relevance
NDCG@k = DCG@k / IDCG@k

注意事项：弱相关文档的最优排序得分可能会高于排名靠后的高相关文档，需要搭配Recall@k使用。

k值选择： k值随查询类型变化，事实查询使用k=1-2，综合类查询（如“总结市场趋势”）使用k=5-10。

Metric Selection

指标选择

Query Type	Primary Metric
Single-fact lookups	MRR
Broad coverage needed	Recall@k
Ranked quality matters	NDCG@k or Precision@k
Multi-hop reasoning	Two-hop Recall@k

查询类型	核心指标
单事实查询	MRR
需要广覆盖的查询	Recall@k
排序质量很重要的场景	NDCG@k 或 Precision@k
多跳推理查询	两跳Recall@k

Evaluating and Optimizing Chunking

评估与优化分块策略

Treat chunking as a tunable hyperparameter. Even with the same retriever, metrics vary based on chunking alone.

Grid search for fixed-size chunking: Test combinations of chunk size and overlap. Re-index the corpus for each configuration. Measure retrieval metrics on your evaluation dataset.

Example search grid:

Chunk size	Overlap	Recall@5	NDCG@5
128 tokens	0	0.82	0.69
128 tokens	64	0.88	0.75
256 tokens	0	0.86	0.74
256 tokens	128	0.89	0.77
512 tokens	0	0.80	0.72
512 tokens	256	0.83	0.74

Content-aware chunking: When fixed-size chunks split related information:

Use natural document boundaries (sections, paragraphs, steps).
Augment chunks with context: prepend document title and section headings to each chunk before embedding.

将分块作为可调超参数处理，即使使用相同的检索器，仅分块策略不同也会导致指标差异。

固定大小分块的网格搜索： 测试分块大小和重叠度的组合，为每个配置重新构建索引，在你的评估数据集上衡量检索指标。

搜索网格示例：

分块大小	重叠度	Recall@5	NDCG@5
128 tokens	0	0.82	0.69
128 tokens	64	0.88	0.75
256 tokens	0	0.86	0.74
256 tokens	128	0.89	0.77
512 tokens	0	0.80	0.72
512 tokens	256	0.83	0.74

内容感知分块： 当固定大小分块拆分了相关信息时：

使用文档天然边界（章节、段落、步骤）拆分。
为分块补充上下文：在embedding前为每个分块添加文档标题和章节标题作为前缀。

Evaluating Generation Quality

评估生成质量

After confirming retrieval works, evaluate what the LLM does with the retrieved context along two dimensions:

Answer faithfulness: Does the output accurately reflect the retrieved context? Check for:

Hallucinations: Information absent from source documents. In RAG, even correct facts from the LLM's own knowledge count as hallucinations.
Omissions: Relevant information from the context ignored in the output.
Misinterpretations: Context information represented inaccurately.

Answer relevance: Does the output address the original query? An answer can be faithful to the context but fail to answer what the user asked.

Use error analysis to discover specific manifestations in your pipeline. Identify what kind of information gets hallucinated and which constraints get omitted.

确认检索环节正常后，从两个维度评估LLM基于检索到的上下文生成的内容：

回答忠实度： 输出内容是否准确反映了检索到的上下文？需要检查：

幻觉： 源文档中不存在的信息。在RAG场景中，即便来自LLM自身知识的正确事实也属于幻觉。
遗漏： 上下文中的相关信息未被输出包含。
误读： 上下文信息被不准确地呈现。

回答相关性： 输出内容是否回应了原始查询？一个回答可能对上下文是忠实的，但没有回答用户的问题。

通过错误分析发现你的pipeline中的具体问题，识别哪类信息容易出现幻觉，哪些约束容易被遗漏。

Diagnosing Failures by Metric Pattern

通过指标模式诊断故障

Context Relevance	Faithfulness	Answer Relevance	Diagnosis
High	High	Low	Generator attended to wrong section of a correct document
High	Low	--	Hallucination or misinterpretation of retrieved content
Low	--	--	Retrieval problem. Fix chunking, embeddings, or query preprocessing

上下文相关性	忠实度	回答相关性	诊断结果
高	高	低	生成器关注了正确文档的错误部分
高	低	--	检索内容被幻觉或误读
低	--	--	检索问题，需要修复分块、embedding或查询预处理

Multi-Hop Retrieval Evaluation

多跳检索评估

For queries requiring information from multiple chunks:

Two-hop Recall@k: Fraction of 2-hop queries where both ground-truth chunks appear in the top k results.

TwoHopRecall@k = (1/N) * sum(1 if {Chunk1, Chunk2} ⊆ top_k_results)

Diagnose failures by classifying: hop 1 miss, hop 2 miss, or rank-out-of-top-k.

对于需要从多个块获取信息的查询：

两跳Recall@k： 两个真值块都出现在前k个结果中的两跳查询占比。

TwoHopRecall@k = (1/N) * sum(1 if {Chunk1, Chunk2} ⊆ top_k_results)

通过分类故障类型诊断问题：第一跳遗漏、第二跳遗漏，或是排名超出前k位。

Anti-Patterns

反模式

Using a single end-to-end correctness metric without separating retrieval and generation measurement.
Jumping directly to metrics without reading traces first.
Overfitting to synthetic evaluation data. Validate against real user queries regularly.
Using similarity metrics (ROUGE, BERTScore, cosine similarity) as primary generation evaluation. Use binary evaluators driven by error analysis.
Evaluating generation without checking context grounding.

仅使用单一的端到端正确率指标，不拆分检索和生成环节的衡量。
没有先查看链路追踪就直接开始衡量指标。
过拟合合成评估数据，定期使用真实用户查询验证效果。
使用相似性指标（ROUGE、BERTScore、余弦相似度）作为生成评估的核心指标，应使用基于错误分析驱动的二元评估器。
评估生成环节时不检查内容是否基于上下文生成。