rag-auditor

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

RAG Auditor

RAG 审计工具

Systematic RAG pipeline evaluation across the full retrieval-generation chain: designs evaluation query sets, measures retrieval metrics (Precision@K, Recall@K, MRR), evaluates generation quality (groundedness, completeness, hallucination rate), diagnoses component-level failures, and recommends targeted improvements.
针对完整的检索-生成链路进行系统化RAG流水线评估:设计评估查询集,衡量检索指标(Precision@K、Recall@K、MRR),评估生成质量(事实一致性、完整性、幻觉率),诊断组件级故障,并提供针对性优化建议。

Reference Files

参考文件

FileContentsLoad When
references/retrieval-metrics.md
Precision@K, Recall@K, MRR, NDCG definitions and calculationAlways
references/generation-metrics.md
Groundedness, completeness, hallucination detection methodsGeneration evaluation needed
references/failure-taxonomy.md
RAG failure categories: retrieval, generation, chunking, embeddingFailure diagnosis needed
references/diagnostic-queries.md
Designing evaluation query sets, known-answer questions, difficulty levelsEvaluation setup
文件内容加载时机
references/retrieval-metrics.md
Precision@K、Recall@K、MRR、NDCG的定义与计算方法始终加载
references/generation-metrics.md
事实一致性、完整性、幻觉检测方法需要评估生成质量时加载
references/failure-taxonomy.md
RAG故障分类:检索、生成、分块、嵌入需要诊断故障时加载
references/diagnostic-queries.md
评估查询集设计、已知答案问题、难度分级评估设置阶段加载

Prerequisites

前置条件

  • Access to the RAG pipeline (or its outputs for post-hoc evaluation)
  • A set of test queries with known-correct answers
  • Understanding of the pipeline components (embedding model, retriever, generator)
  • 可访问RAG流水线(或其输出结果用于事后评估)
  • 一组带有已知正确答案的测试查询
  • 了解流水线组件(嵌入模型、检索器、生成器)

Workflow

工作流程

Phase 1: Pipeline Inventory

阶段1:流水线清单梳理

Document the RAG pipeline configuration:
  1. Document source — What documents are indexed? Format, count, size.
  2. Chunking — Strategy (fixed-size, semantic, paragraph), chunk size, overlap.
  3. Embedding — Model name and version, dimensionality.
  4. Vector store — Type (FAISS, Pinecone, Chroma, pgvector), index type.
  5. Retrieval — Method (similarity, hybrid, reranking), top-K parameter.
  6. Generation — Model, prompt template, context window usage.
记录RAG流水线配置信息:
  1. 文档源 — 索引了哪些文档?格式、数量、大小。
  2. 分块策略 — 策略类型(固定长度、语义分块、段落分块)、分块大小、重叠率。
  3. 嵌入模型 — 模型名称及版本、维度。
  4. 向量数据库 — 类型(FAISS、Pinecone、Chroma、pgvector)、索引类型。
  5. 检索方式 — 方法(相似度检索、混合检索、重排序)、Top-K参数。
  6. 生成配置 — 模型、提示词模板、上下文窗口使用情况。

Phase 2: Design Evaluation Queries

阶段2:设计评估查询集

Create a diverse set of test queries:
Query TypePurposeCount
Known-answer (factoid)Measure retrieval + generation accuracy10+
Multi-hopRequire combining info from multiple chunks5+
UnanswerableNot in the corpus — should abstain3+
AmbiguousMultiple valid interpretations3+
Recent/updatedTest freshness2+
For each query, document the expected answer and the source chunk(s).
创建多样化的测试查询集:
查询类型目的数量
已知答案(事实类)衡量检索+生成的准确性10+
多跳查询需要结合多个分块的信息5+
无法回答语料库中无对应答案 — 模型应拒绝回答3+
歧义查询存在多种合理解释3+
近期/更新内容测试新鲜度2+
为每个查询记录预期答案及对应的源分块。

Phase 3: Evaluate Retrieval

阶段3:评估检索质量

For each test query, measure:
  1. Precision@K — Of the K retrieved chunks, how many are relevant?
  2. Recall@K — Of all relevant chunks in the corpus, how many were retrieved?
  3. MRR (Mean Reciprocal Rank) — How high is the first relevant chunk ranked?
  4. Chunk relevance — Score each retrieved chunk: Relevant, Partially Relevant, Irrelevant.
针对每个测试查询,衡量以下指标:
  1. Precision@K — 检索到的K个分块中,有多少是相关的?
  2. Recall@K — 语料库中所有相关分块里,有多少被检索到了?
  3. MRR(平均倒数排名) — 第一个相关分块的排名有多高?
  4. 分块相关性 — 为每个检索到的分块打分:相关、部分相关、不相关。

Phase 4: Evaluate Generation

阶段4:评估生成质量

For each test query with retrieved context:
  1. Groundedness — Is every claim in the response supported by the retrieved context? Score: 0 (hallucinated) to 1 (fully grounded).
  2. Completeness — Does the response use all relevant information from the context? Score: 0 (ignored context) to 1 (complete).
  3. Hallucination detection — Identify specific claims not supported by context.
  4. Abstention — For unanswerable queries, does the model correctly say "I don't know"?
针对每个带有检索上下文的测试查询,评估以下内容:
  1. 事实一致性 — 响应中的每个声明是否都有检索到的上下文支撑? 评分:0(存在幻觉)至1(完全一致)。
  2. 完整性 — 响应是否充分利用了上下文中的所有相关信息? 评分:0(忽略上下文)至1(完整覆盖)。
  3. 幻觉检测 — 识别上下文中未提及的声明。
  4. 拒绝回答准确性 — 对于无法回答的查询,模型是否正确返回“我不知道”?

Phase 5: Diagnose Failures

阶段5:故障诊断

For every incorrect or low-quality response, classify the root cause:
Failure TypeDiagnosisIndicator
Retrieval failureRelevant chunks not retrievedLow Recall@K
Ranking failureRelevant chunk retrieved but ranked lowLow MRR, high Recall
Chunk boundary issueAnswer split across chunk boundariesPartial matches in multiple chunks
Embedding mismatchQuery semantics don't match chunk embeddingsRelevant chunk has low similarity score
Generation failureCorrect context but wrong answerHigh retrieval scores, low groundedness
HallucinationModel invents facts not in contextClaims not traceable to any chunk
Over-abstentionModel refuses to answer when context is sufficientUnanswered with relevant context present
针对每个错误或低质量的响应,分类其根本原因:
故障类型诊断说明指标表现
检索失败相关分块未被检索到Recall@K偏低
排序失败相关分块被检索到但排名靠后MRR偏低、Recall较高
分块边界问题答案被拆分到多个分块边界上多个分块存在部分匹配
嵌入不匹配查询语义与分块嵌入不匹配相关分块的相似度得分低
生成失败上下文正确但答案错误检索分数高,事实一致性低
幻觉模型编造了上下文中不存在的事实声明无法追溯到任何分块
过度拒绝回答上下文足够时模型仍拒绝回答存在相关上下文但未给出答案

Phase 6: Recommendations

阶段6:优化建议

Based on failure analysis, recommend specific improvements:
Failure PatternRecommendation
Chunk boundary issuesIncrease overlap, try semantic chunking
Low Precision@KReduce K, add reranking stage
Low Recall@KIncrease K, try hybrid search
Embedding mismatchTry different embedding model, add query expansion
HallucinationStrengthen grounding instruction in prompt, reduce temperature
Over-abstentionSoften abstention criteria in prompt
基于故障分析,提供针对性的优化建议:
故障模式建议方案
分块边界问题增加重叠率,尝试语义分块
Precision@K偏低减小K值,增加重排序阶段
Recall@K偏低增大K值,尝试混合检索
嵌入不匹配尝试不同的嵌入模型,添加查询扩展
幻觉问题在提示词中强化事实一致性指令,降低温度参数
过度拒绝回答在提示词中放宽拒绝回答的判定条件

Output Format

输出格式

undefined
undefined

RAG Audit Report

RAG 审计报告

Pipeline Configuration

流水线配置

ComponentValue
Documents{N} ({format})
Chunking{strategy}, {size} tokens, {overlap}% overlap
Embedding{model} ({dimensions}d)
Retrieval{method}, K={N}
Generation{model}, temperature={T}
组件配置值
文档{N}份({format}格式)
分块{strategy}策略,{size} tokens,{overlap}% 重叠率
嵌入{model}({dimensions}维)
检索{method}方法,K={N}
生成{model}模型,temperature={T}

Evaluation Dataset

评估数据集

  • Total queries: {N}
  • Known-answer: {N}
  • Multi-hop: {N}
  • Unanswerable: {N}
  • 总查询数: {N}
  • 已知答案类: {N}
  • 多跳类: {N}
  • 无法回答类: {N}

Retrieval Quality

检索质量

MetricScoreTargetStatus
Precision@{K}{score}{target}{Pass/Fail}
Recall@{K}{score}{target}{Pass/Fail}
MRR{score}{target}{Pass/Fail}
指标得分目标值状态
Precision@{K}{score}{target}{通过/不通过}
Recall@{K}{score}{target}{通过/不通过}
MRR{score}{target}{通过/不通过}

Generation Quality

生成质量

MetricScoreTargetStatus
Groundedness{score}{target}{Pass/Fail}
Completeness{score}{target}{Pass/Fail}
Hallucination rate{score}{target}{Pass/Fail}
Abstention accuracy{score}{target}{Pass/Fail}
指标得分目标值状态
事实一致性{score}{target}{通过/不通过}
完整性{score}{target}{通过/不通过}
幻觉率{score}{target}{通过/不通过}
拒绝回答准确率{score}{target}{通过/不通过}

Failure Analysis

故障分析

#QueryFailure TypeRoot CauseRecommendation
1{query}{type}{cause}{fix}
序号查询内容故障类型根本原因优化建议
1{query}{type}{cause}{fix}

Recommendations (Priority Order)

优先级优化建议

  1. {Recommendation} — addresses {N} failures, expected impact: {description}
  2. {Recommendation} — addresses {N} failures, expected impact: {description}
  1. {建议内容} — 解决{N}个故障,预期影响:{描述}
  2. {建议内容} — 解决{N}个故障,预期影响:{描述}

Sample Failures

故障示例

Query: "{query}"

查询:"{query}"

  • Expected: {answer}
  • Retrieved chunks: {chunk summaries with relevance scores}
  • Generated: {response}
  • Issue: {diagnosis}
undefined
  • 预期答案: {answer}
  • 检索到的分块: {带相关性得分的分块摘要}
  • 生成结果: {response}
  • 问题诊断: {diagnosis}
undefined

Calibration Rules

校准规则

  1. Component isolation. Evaluate retrieval and generation independently. A great retriever with a bad generator looks like retrieval failure if you only check end output.
  2. Known answers first. Start with factoid questions where the correct answer is unambiguous. Multi-hop and ambiguous queries are harder to evaluate.
  3. Quantify, don't qualify. "Retrieval is bad" is not a finding. "Precision@5 is 0.3 (target: 0.8) with 70% of failures due to chunk boundary splits" is actionable.
  4. Sample failures deeply. Aggregate metrics identify WHERE the problem is. Individual failure analysis identifies WHY.
  1. 组件隔离评估:独立评估检索和生成环节。如果仅检查最终输出,优秀的检索器搭配糟糕的生成器会被误判为检索故障。
  2. 优先使用已知答案:从答案明确的事实类问题开始评估。多跳和歧义查询的评估难度更高。
  3. 量化而非定性:“检索效果差”不是有效结论。“Precision@5为0.3(目标值0.8),70%的故障源于分块边界拆分问题”才是可落地的发现。
  4. 深度分析样本故障:聚合指标能定位问题位置,单个故障分析能揭示问题原因。

Error Handling

异常处理

ProblemResolution
No known-answer queries availableHelp design them from the document corpus. Pick 10 facts and formulate questions.
Pipeline access not availableWork from recorded inputs/outputs. Post-hoc evaluation is possible with query-context-response triples.
Corpus is too large to reviewSample-based evaluation. Select representative documents and generate queries from them.
Multiple failure types co-existAddress retrieval failures first. Generation quality cannot exceed retrieval quality.
问题解决方法
无已知答案查询可用协助从文档语料库中设计:挑选10个事实并转化为问题。
无法访问流水线基于记录的输入/输出进行事后评估,只要有查询-上下文-响应三元组即可开展。
语料库过大无法全面审核基于样本的评估:选择代表性文档并生成对应查询。
多种故障类型并存优先解决检索故障。生成质量不可能超过检索质量。

When NOT to Audit

无需审计的场景

Push back if:
  • The pipeline hasn't been built yet — design it first, audit after
  • The corpus has fewer than 10 documents — too small for meaningful retrieval evaluation
  • The user wants to compare embedding models — that's a benchmark task, not an audit
遇到以下情况可拒绝执行:
  • 流水线尚未搭建完成 — 先完成设计,再进行审计
  • 语料库文档不足10份 — 样本量过小,无法开展有意义的检索评估
  • 用户希望对比嵌入模型 — 这属于基准测试任务,而非审计