rag-auditor
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseRAG Auditor
RAG 审计工具
Systematic RAG pipeline evaluation across the full retrieval-generation chain: designs
evaluation query sets, measures retrieval metrics (Precision@K, Recall@K, MRR), evaluates
generation quality (groundedness, completeness, hallucination rate), diagnoses component-level
failures, and recommends targeted improvements.
针对完整的检索-生成链路进行系统化RAG流水线评估:设计评估查询集,衡量检索指标(Precision@K、Recall@K、MRR),评估生成质量(事实一致性、完整性、幻觉率),诊断组件级故障,并提供针对性优化建议。
Reference Files
参考文件
| File | Contents | Load When |
|---|---|---|
| Precision@K, Recall@K, MRR, NDCG definitions and calculation | Always |
| Groundedness, completeness, hallucination detection methods | Generation evaluation needed |
| RAG failure categories: retrieval, generation, chunking, embedding | Failure diagnosis needed |
| Designing evaluation query sets, known-answer questions, difficulty levels | Evaluation setup |
| 文件 | 内容 | 加载时机 |
|---|---|---|
| Precision@K、Recall@K、MRR、NDCG的定义与计算方法 | 始终加载 |
| 事实一致性、完整性、幻觉检测方法 | 需要评估生成质量时加载 |
| RAG故障分类:检索、生成、分块、嵌入 | 需要诊断故障时加载 |
| 评估查询集设计、已知答案问题、难度分级 | 评估设置阶段加载 |
Prerequisites
前置条件
- Access to the RAG pipeline (or its outputs for post-hoc evaluation)
- A set of test queries with known-correct answers
- Understanding of the pipeline components (embedding model, retriever, generator)
- 可访问RAG流水线(或其输出结果用于事后评估)
- 一组带有已知正确答案的测试查询
- 了解流水线组件(嵌入模型、检索器、生成器)
Workflow
工作流程
Phase 1: Pipeline Inventory
阶段1:流水线清单梳理
Document the RAG pipeline configuration:
- Document source — What documents are indexed? Format, count, size.
- Chunking — Strategy (fixed-size, semantic, paragraph), chunk size, overlap.
- Embedding — Model name and version, dimensionality.
- Vector store — Type (FAISS, Pinecone, Chroma, pgvector), index type.
- Retrieval — Method (similarity, hybrid, reranking), top-K parameter.
- Generation — Model, prompt template, context window usage.
记录RAG流水线配置信息:
- 文档源 — 索引了哪些文档?格式、数量、大小。
- 分块策略 — 策略类型(固定长度、语义分块、段落分块)、分块大小、重叠率。
- 嵌入模型 — 模型名称及版本、维度。
- 向量数据库 — 类型(FAISS、Pinecone、Chroma、pgvector)、索引类型。
- 检索方式 — 方法(相似度检索、混合检索、重排序)、Top-K参数。
- 生成配置 — 模型、提示词模板、上下文窗口使用情况。
Phase 2: Design Evaluation Queries
阶段2:设计评估查询集
Create a diverse set of test queries:
| Query Type | Purpose | Count |
|---|---|---|
| Known-answer (factoid) | Measure retrieval + generation accuracy | 10+ |
| Multi-hop | Require combining info from multiple chunks | 5+ |
| Unanswerable | Not in the corpus — should abstain | 3+ |
| Ambiguous | Multiple valid interpretations | 3+ |
| Recent/updated | Test freshness | 2+ |
For each query, document the expected answer and the source chunk(s).
创建多样化的测试查询集:
| 查询类型 | 目的 | 数量 |
|---|---|---|
| 已知答案(事实类) | 衡量检索+生成的准确性 | 10+ |
| 多跳查询 | 需要结合多个分块的信息 | 5+ |
| 无法回答 | 语料库中无对应答案 — 模型应拒绝回答 | 3+ |
| 歧义查询 | 存在多种合理解释 | 3+ |
| 近期/更新内容 | 测试新鲜度 | 2+ |
为每个查询记录预期答案及对应的源分块。
Phase 3: Evaluate Retrieval
阶段3:评估检索质量
For each test query, measure:
- Precision@K — Of the K retrieved chunks, how many are relevant?
- Recall@K — Of all relevant chunks in the corpus, how many were retrieved?
- MRR (Mean Reciprocal Rank) — How high is the first relevant chunk ranked?
- Chunk relevance — Score each retrieved chunk: Relevant, Partially Relevant, Irrelevant.
针对每个测试查询,衡量以下指标:
- Precision@K — 检索到的K个分块中,有多少是相关的?
- Recall@K — 语料库中所有相关分块里,有多少被检索到了?
- MRR(平均倒数排名) — 第一个相关分块的排名有多高?
- 分块相关性 — 为每个检索到的分块打分:相关、部分相关、不相关。
Phase 4: Evaluate Generation
阶段4:评估生成质量
For each test query with retrieved context:
- Groundedness — Is every claim in the response supported by the retrieved context? Score: 0 (hallucinated) to 1 (fully grounded).
- Completeness — Does the response use all relevant information from the context? Score: 0 (ignored context) to 1 (complete).
- Hallucination detection — Identify specific claims not supported by context.
- Abstention — For unanswerable queries, does the model correctly say "I don't know"?
针对每个带有检索上下文的测试查询,评估以下内容:
- 事实一致性 — 响应中的每个声明是否都有检索到的上下文支撑? 评分:0(存在幻觉)至1(完全一致)。
- 完整性 — 响应是否充分利用了上下文中的所有相关信息? 评分:0(忽略上下文)至1(完整覆盖)。
- 幻觉检测 — 识别上下文中未提及的声明。
- 拒绝回答准确性 — 对于无法回答的查询,模型是否正确返回“我不知道”?
Phase 5: Diagnose Failures
阶段5:故障诊断
For every incorrect or low-quality response, classify the root cause:
| Failure Type | Diagnosis | Indicator |
|---|---|---|
| Retrieval failure | Relevant chunks not retrieved | Low Recall@K |
| Ranking failure | Relevant chunk retrieved but ranked low | Low MRR, high Recall |
| Chunk boundary issue | Answer split across chunk boundaries | Partial matches in multiple chunks |
| Embedding mismatch | Query semantics don't match chunk embeddings | Relevant chunk has low similarity score |
| Generation failure | Correct context but wrong answer | High retrieval scores, low groundedness |
| Hallucination | Model invents facts not in context | Claims not traceable to any chunk |
| Over-abstention | Model refuses to answer when context is sufficient | Unanswered with relevant context present |
针对每个错误或低质量的响应,分类其根本原因:
| 故障类型 | 诊断说明 | 指标表现 |
|---|---|---|
| 检索失败 | 相关分块未被检索到 | Recall@K偏低 |
| 排序失败 | 相关分块被检索到但排名靠后 | MRR偏低、Recall较高 |
| 分块边界问题 | 答案被拆分到多个分块边界上 | 多个分块存在部分匹配 |
| 嵌入不匹配 | 查询语义与分块嵌入不匹配 | 相关分块的相似度得分低 |
| 生成失败 | 上下文正确但答案错误 | 检索分数高,事实一致性低 |
| 幻觉 | 模型编造了上下文中不存在的事实 | 声明无法追溯到任何分块 |
| 过度拒绝回答 | 上下文足够时模型仍拒绝回答 | 存在相关上下文但未给出答案 |
Phase 6: Recommendations
阶段6:优化建议
Based on failure analysis, recommend specific improvements:
| Failure Pattern | Recommendation |
|---|---|
| Chunk boundary issues | Increase overlap, try semantic chunking |
| Low Precision@K | Reduce K, add reranking stage |
| Low Recall@K | Increase K, try hybrid search |
| Embedding mismatch | Try different embedding model, add query expansion |
| Hallucination | Strengthen grounding instruction in prompt, reduce temperature |
| Over-abstention | Soften abstention criteria in prompt |
基于故障分析,提供针对性的优化建议:
| 故障模式 | 建议方案 |
|---|---|
| 分块边界问题 | 增加重叠率,尝试语义分块 |
| Precision@K偏低 | 减小K值,增加重排序阶段 |
| Recall@K偏低 | 增大K值,尝试混合检索 |
| 嵌入不匹配 | 尝试不同的嵌入模型,添加查询扩展 |
| 幻觉问题 | 在提示词中强化事实一致性指令,降低温度参数 |
| 过度拒绝回答 | 在提示词中放宽拒绝回答的判定条件 |
Output Format
输出格式
undefinedundefinedRAG Audit Report
RAG 审计报告
Pipeline Configuration
流水线配置
| Component | Value |
|---|---|
| Documents | {N} ({format}) |
| Chunking | {strategy}, {size} tokens, {overlap}% overlap |
| Embedding | {model} ({dimensions}d) |
| Retrieval | {method}, K={N} |
| Generation | {model}, temperature={T} |
| 组件 | 配置值 |
|---|---|
| 文档 | {N}份({format}格式) |
| 分块 | {strategy}策略,{size} tokens,{overlap}% 重叠率 |
| 嵌入 | {model}({dimensions}维) |
| 检索 | {method}方法,K={N} |
| 生成 | {model}模型,temperature={T} |
Evaluation Dataset
评估数据集
- Total queries: {N}
- Known-answer: {N}
- Multi-hop: {N}
- Unanswerable: {N}
- 总查询数: {N}
- 已知答案类: {N}
- 多跳类: {N}
- 无法回答类: {N}
Retrieval Quality
检索质量
| Metric | Score | Target | Status |
|---|---|---|---|
| Precision@{K} | {score} | {target} | {Pass/Fail} |
| Recall@{K} | {score} | {target} | {Pass/Fail} |
| MRR | {score} | {target} | {Pass/Fail} |
| 指标 | 得分 | 目标值 | 状态 |
|---|---|---|---|
| Precision@{K} | {score} | {target} | {通过/不通过} |
| Recall@{K} | {score} | {target} | {通过/不通过} |
| MRR | {score} | {target} | {通过/不通过} |
Generation Quality
生成质量
| Metric | Score | Target | Status |
|---|---|---|---|
| Groundedness | {score} | {target} | {Pass/Fail} |
| Completeness | {score} | {target} | {Pass/Fail} |
| Hallucination rate | {score} | {target} | {Pass/Fail} |
| Abstention accuracy | {score} | {target} | {Pass/Fail} |
| 指标 | 得分 | 目标值 | 状态 |
|---|---|---|---|
| 事实一致性 | {score} | {target} | {通过/不通过} |
| 完整性 | {score} | {target} | {通过/不通过} |
| 幻觉率 | {score} | {target} | {通过/不通过} |
| 拒绝回答准确率 | {score} | {target} | {通过/不通过} |
Failure Analysis
故障分析
| # | Query | Failure Type | Root Cause | Recommendation |
|---|---|---|---|---|
| 1 | {query} | {type} | {cause} | {fix} |
| 序号 | 查询内容 | 故障类型 | 根本原因 | 优化建议 |
|---|---|---|---|---|
| 1 | {query} | {type} | {cause} | {fix} |
Recommendations (Priority Order)
优先级优化建议
- {Recommendation} — addresses {N} failures, expected impact: {description}
- {Recommendation} — addresses {N} failures, expected impact: {description}
- {建议内容} — 解决{N}个故障,预期影响:{描述}
- {建议内容} — 解决{N}个故障,预期影响:{描述}
Sample Failures
故障示例
Query: "{query}"
查询:"{query}"
- Expected: {answer}
- Retrieved chunks: {chunk summaries with relevance scores}
- Generated: {response}
- Issue: {diagnosis}
undefined- 预期答案: {answer}
- 检索到的分块: {带相关性得分的分块摘要}
- 生成结果: {response}
- 问题诊断: {diagnosis}
undefinedCalibration Rules
校准规则
- Component isolation. Evaluate retrieval and generation independently. A great retriever with a bad generator looks like retrieval failure if you only check end output.
- Known answers first. Start with factoid questions where the correct answer is unambiguous. Multi-hop and ambiguous queries are harder to evaluate.
- Quantify, don't qualify. "Retrieval is bad" is not a finding. "Precision@5 is 0.3 (target: 0.8) with 70% of failures due to chunk boundary splits" is actionable.
- Sample failures deeply. Aggregate metrics identify WHERE the problem is. Individual failure analysis identifies WHY.
- 组件隔离评估:独立评估检索和生成环节。如果仅检查最终输出,优秀的检索器搭配糟糕的生成器会被误判为检索故障。
- 优先使用已知答案:从答案明确的事实类问题开始评估。多跳和歧义查询的评估难度更高。
- 量化而非定性:“检索效果差”不是有效结论。“Precision@5为0.3(目标值0.8),70%的故障源于分块边界拆分问题”才是可落地的发现。
- 深度分析样本故障:聚合指标能定位问题位置,单个故障分析能揭示问题原因。
Error Handling
异常处理
| Problem | Resolution |
|---|---|
| No known-answer queries available | Help design them from the document corpus. Pick 10 facts and formulate questions. |
| Pipeline access not available | Work from recorded inputs/outputs. Post-hoc evaluation is possible with query-context-response triples. |
| Corpus is too large to review | Sample-based evaluation. Select representative documents and generate queries from them. |
| Multiple failure types co-exist | Address retrieval failures first. Generation quality cannot exceed retrieval quality. |
| 问题 | 解决方法 |
|---|---|
| 无已知答案查询可用 | 协助从文档语料库中设计:挑选10个事实并转化为问题。 |
| 无法访问流水线 | 基于记录的输入/输出进行事后评估,只要有查询-上下文-响应三元组即可开展。 |
| 语料库过大无法全面审核 | 基于样本的评估:选择代表性文档并生成对应查询。 |
| 多种故障类型并存 | 优先解决检索故障。生成质量不可能超过检索质量。 |
When NOT to Audit
无需审计的场景
Push back if:
- The pipeline hasn't been built yet — design it first, audit after
- The corpus has fewer than 10 documents — too small for meaningful retrieval evaluation
- The user wants to compare embedding models — that's a benchmark task, not an audit
遇到以下情况可拒绝执行:
- 流水线尚未搭建完成 — 先完成设计,再进行审计
- 语料库文档不足10份 — 样本量过小,无法开展有意义的检索评估
- 用户希望对比嵌入模型 — 这属于基准测试任务,而非审计