rag-auditor

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

RAG Auditor

RAG 审计工具

Systematic RAG pipeline evaluation across the full retrieval-generation chain: designs evaluation query sets, measures retrieval metrics (Precision@K, Recall@K, MRR), evaluates generation quality (groundedness, completeness, hallucination rate), diagnoses component-level failures, and recommends targeted improvements.

针对完整的检索-生成链路进行系统化RAG流水线评估：设计评估查询集，衡量检索指标（Precision@K、Recall@K、MRR），评估生成质量（事实一致性、完整性、幻觉率），诊断组件级故障，并提供针对性优化建议。

Reference Files

参考文件

File	Contents	Load When
`references/retrieval-metrics.md`	Precision@K, Recall@K, MRR, NDCG definitions and calculation	Always
`references/generation-metrics.md`	Groundedness, completeness, hallucination detection methods	Generation evaluation needed
`references/failure-taxonomy.md`	RAG failure categories: retrieval, generation, chunking, embedding	Failure diagnosis needed
`references/diagnostic-queries.md`	Designing evaluation query sets, known-answer questions, difficulty levels	Evaluation setup

文件	内容	加载时机
`references/retrieval-metrics.md`	Precision@K、Recall@K、MRR、NDCG的定义与计算方法	始终加载
`references/generation-metrics.md`	事实一致性、完整性、幻觉检测方法	需要评估生成质量时加载
`references/failure-taxonomy.md`	RAG故障分类：检索、生成、分块、嵌入	需要诊断故障时加载
`references/diagnostic-queries.md`	评估查询集设计、已知答案问题、难度分级	评估设置阶段加载

Prerequisites

前置条件

Access to the RAG pipeline (or its outputs for post-hoc evaluation)
A set of test queries with known-correct answers
Understanding of the pipeline components (embedding model, retriever, generator)

可访问RAG流水线（或其输出结果用于事后评估）
一组带有已知正确答案的测试查询
了解流水线组件（嵌入模型、检索器、生成器）

Workflow

工作流程

Phase 1: Pipeline Inventory

阶段1：流水线清单梳理

Document the RAG pipeline configuration:

Document source — What documents are indexed? Format, count, size.
Chunking — Strategy (fixed-size, semantic, paragraph), chunk size, overlap.
Embedding — Model name and version, dimensionality.
Vector store — Type (FAISS, Pinecone, Chroma, pgvector), index type.
Retrieval — Method (similarity, hybrid, reranking), top-K parameter.
Generation — Model, prompt template, context window usage.

记录RAG流水线配置信息：

文档源 — 索引了哪些文档？格式、数量、大小。
分块策略 — 策略类型（固定长度、语义分块、段落分块）、分块大小、重叠率。
嵌入模型 — 模型名称及版本、维度。
向量数据库 — 类型（FAISS、Pinecone、Chroma、pgvector）、索引类型。
检索方式 — 方法（相似度检索、混合检索、重排序）、Top-K参数。
生成配置 — 模型、提示词模板、上下文窗口使用情况。

Phase 2: Design Evaluation Queries

阶段2：设计评估查询集

Create a diverse set of test queries:

Query Type	Purpose	Count
Known-answer (factoid)	Measure retrieval + generation accuracy	10+
Multi-hop	Require combining info from multiple chunks	5+
Unanswerable	Not in the corpus — should abstain	3+
Ambiguous	Multiple valid interpretations	3+
Recent/updated	Test freshness	2+

For each query, document the expected answer and the source chunk(s).

创建多样化的测试查询集：

查询类型	目的	数量
已知答案（事实类）	衡量检索+生成的准确性	10+
多跳查询	需要结合多个分块的信息	5+
无法回答	语料库中无对应答案 — 模型应拒绝回答	3+
歧义查询	存在多种合理解释	3+
近期/更新内容	测试新鲜度	2+

为每个查询记录预期答案及对应的源分块。

Phase 3: Evaluate Retrieval

阶段3：评估检索质量

For each test query, measure:

Precision@K — Of the K retrieved chunks, how many are relevant?
Recall@K — Of all relevant chunks in the corpus, how many were retrieved?
MRR (Mean Reciprocal Rank) — How high is the first relevant chunk ranked?
Chunk relevance — Score each retrieved chunk: Relevant, Partially Relevant, Irrelevant.

针对每个测试查询，衡量以下指标：

Precision@K — 检索到的K个分块中，有多少是相关的？
Recall@K — 语料库中所有相关分块里，有多少被检索到了？
MRR（平均倒数排名） — 第一个相关分块的排名有多高？
分块相关性 — 为每个检索到的分块打分：相关、部分相关、不相关。

Phase 4: Evaluate Generation

阶段4：评估生成质量

For each test query with retrieved context:

Groundedness — Is every claim in the response supported by the retrieved context? Score: 0 (hallucinated) to 1 (fully grounded).
Completeness — Does the response use all relevant information from the context? Score: 0 (ignored context) to 1 (complete).
Hallucination detection — Identify specific claims not supported by context.
Abstention — For unanswerable queries, does the model correctly say "I don't know"?

针对每个带有检索上下文的测试查询，评估以下内容：

事实一致性 — 响应中的每个声明是否都有检索到的上下文支撑？评分：0（存在幻觉）至1（完全一致）。
完整性 — 响应是否充分利用了上下文中的所有相关信息？评分：0（忽略上下文）至1（完整覆盖）。
幻觉检测 — 识别上下文中未提及的声明。
拒绝回答准确性 — 对于无法回答的查询，模型是否正确返回“我不知道”？

Phase 5: Diagnose Failures

阶段5：故障诊断

For every incorrect or low-quality response, classify the root cause:

Failure Type	Diagnosis	Indicator
Retrieval failure	Relevant chunks not retrieved	Low Recall@K
Ranking failure	Relevant chunk retrieved but ranked low	Low MRR, high Recall
Chunk boundary issue	Answer split across chunk boundaries	Partial matches in multiple chunks
Embedding mismatch	Query semantics don't match chunk embeddings	Relevant chunk has low similarity score
Generation failure	Correct context but wrong answer	High retrieval scores, low groundedness
Hallucination	Model invents facts not in context	Claims not traceable to any chunk
Over-abstention	Model refuses to answer when context is sufficient	Unanswered with relevant context present

针对每个错误或低质量的响应，分类其根本原因：

故障类型	诊断说明	指标表现
检索失败	相关分块未被检索到	Recall@K偏低
排序失败	相关分块被检索到但排名靠后	MRR偏低、Recall较高
分块边界问题	答案被拆分到多个分块边界上	多个分块存在部分匹配
嵌入不匹配	查询语义与分块嵌入不匹配	相关分块的相似度得分低
生成失败	上下文正确但答案错误	检索分数高，事实一致性低
幻觉	模型编造了上下文中不存在的事实	声明无法追溯到任何分块
过度拒绝回答	上下文足够时模型仍拒绝回答	存在相关上下文但未给出答案

Phase 6: Recommendations

阶段6：优化建议

Based on failure analysis, recommend specific improvements:

Failure Pattern	Recommendation
Chunk boundary issues	Increase overlap, try semantic chunking
Low Precision@K	Reduce K, add reranking stage
Low Recall@K	Increase K, try hybrid search
Embedding mismatch	Try different embedding model, add query expansion
Hallucination	Strengthen grounding instruction in prompt, reduce temperature
Over-abstention	Soften abstention criteria in prompt

基于故障分析，提供针对性的优化建议：

故障模式	建议方案
分块边界问题	增加重叠率，尝试语义分块
Precision@K偏低	减小K值，增加重排序阶段
Recall@K偏低	增大K值，尝试混合检索
嵌入不匹配	尝试不同的嵌入模型，添加查询扩展
幻觉问题	在提示词中强化事实一致性指令，降低温度参数
过度拒绝回答	在提示词中放宽拒绝回答的判定条件

Output Format

输出格式

undefined

undefined

RAG Audit Report

RAG 审计报告

Pipeline Configuration

流水线配置

Component	Value
Documents	{N} ({format})
Chunking	{strategy}, {size} tokens, {overlap}% overlap
Embedding	{model} ({dimensions}d)
Retrieval	{method}, K={N}
Generation	{model}, temperature={T}

组件	配置值
文档	{N}份（{format}格式）
分块	{strategy}策略，{size} tokens，{overlap}% 重叠率
嵌入	{model}（{dimensions}维）
检索	{method}方法，K={N}
生成	{model}模型，temperature={T}

Evaluation Dataset

评估数据集

Total queries: {N}
Known-answer: {N}
Multi-hop: {N}
Unanswerable: {N}

总查询数： {N}
已知答案类： {N}
多跳类： {N}
无法回答类： {N}

Retrieval Quality

检索质量

Metric	Score	Target	Status
Precision@{K}	{score}	{target}	{Pass/Fail}
Recall@{K}	{score}	{target}	{Pass/Fail}
MRR	{score}	{target}	{Pass/Fail}

指标	得分	目标值	状态
Precision@{K}	{score}	{target}	{通过/不通过}
Recall@{K}	{score}	{target}	{通过/不通过}
MRR	{score}	{target}	{通过/不通过}

Generation Quality

生成质量

Metric	Score	Target	Status
Groundedness	{score}	{target}	{Pass/Fail}
Completeness	{score}	{target}	{Pass/Fail}
Hallucination rate	{score}	{target}	{Pass/Fail}
Abstention accuracy	{score}	{target}	{Pass/Fail}

指标	得分	目标值	状态
事实一致性	{score}	{target}	{通过/不通过}
完整性	{score}	{target}	{通过/不通过}
幻觉率	{score}	{target}	{通过/不通过}
拒绝回答准确率	{score}	{target}	{通过/不通过}

Failure Analysis

故障分析

#	Query	Failure Type	Root Cause	Recommendation
1	{query}	{type}	{cause}	{fix}

序号	查询内容	故障类型	根本原因	优化建议
1	{query}	{type}	{cause}	{fix}

Recommendations (Priority Order)

优先级优化建议

{Recommendation} — addresses {N} failures, expected impact: {description}
{Recommendation} — addresses {N} failures, expected impact: {description}

{建议内容} — 解决{N}个故障，预期影响：{描述}
{建议内容} — 解决{N}个故障，预期影响：{描述}

Sample Failures

故障示例

Query: "{query}"

查询："{query}"

Expected: {answer}
Retrieved chunks: {chunk summaries with relevance scores}
Generated: {response}
Issue: {diagnosis}

undefined

预期答案： {answer}
检索到的分块： {带相关性得分的分块摘要}
生成结果： {response}
问题诊断： {diagnosis}

undefined

Calibration Rules

校准规则

Component isolation. Evaluate retrieval and generation independently. A great retriever with a bad generator looks like retrieval failure if you only check end output.
Known answers first. Start with factoid questions where the correct answer is unambiguous. Multi-hop and ambiguous queries are harder to evaluate.
Quantify, don't qualify. "Retrieval is bad" is not a finding. "Precision@5 is 0.3 (target: 0.8) with 70% of failures due to chunk boundary splits" is actionable.
Sample failures deeply. Aggregate metrics identify WHERE the problem is. Individual failure analysis identifies WHY.

组件隔离评估：独立评估检索和生成环节。如果仅检查最终输出，优秀的检索器搭配糟糕的生成器会被误判为检索故障。
优先使用已知答案：从答案明确的事实类问题开始评估。多跳和歧义查询的评估难度更高。
量化而非定性：“检索效果差”不是有效结论。“Precision@5为0.3（目标值0.8），70%的故障源于分块边界拆分问题”才是可落地的发现。
深度分析样本故障：聚合指标能定位问题位置，单个故障分析能揭示问题原因。

Error Handling

异常处理

Problem	Resolution
No known-answer queries available	Help design them from the document corpus. Pick 10 facts and formulate questions.
Pipeline access not available	Work from recorded inputs/outputs. Post-hoc evaluation is possible with query-context-response triples.
Corpus is too large to review	Sample-based evaluation. Select representative documents and generate queries from them.
Multiple failure types co-exist	Address retrieval failures first. Generation quality cannot exceed retrieval quality.

问题	解决方法
无已知答案查询可用	协助从文档语料库中设计：挑选10个事实并转化为问题。
无法访问流水线	基于记录的输入/输出进行事后评估，只要有查询-上下文-响应三元组即可开展。
语料库过大无法全面审核	基于样本的评估：选择代表性文档并生成对应查询。
多种故障类型并存	优先解决检索故障。生成质量不可能超过检索质量。

When NOT to Audit

无需审计的场景

Push back if:

The pipeline hasn't been built yet — design it first, audit after
The corpus has fewer than 10 documents — too small for meaningful retrieval evaluation
The user wants to compare embedding models — that's a benchmark task, not an audit

遇到以下情况可拒绝执行：

流水线尚未搭建完成 — 先完成设计，再进行审计
语料库文档不足10份 — 样本量过小，无法开展有意义的检索评估
用户希望对比嵌入模型 — 这属于基准测试任务，而非审计