rag-auditor
Original:🇺🇸 English
Translated
Evaluates RAG (Retrieval-Augmented Generation) pipeline quality across retrieval and generation stages. Measures precision, recall, MRR for retrieval; groundedness, completeness, and hallucination rate for generation. Diagnoses failure root causes and recommends chunk, retrieval, and prompt improvements. Triggers on: "audit RAG", "RAG quality", "evaluate retrieval", "hallucination detection", "retrieval precision", "why is RAG failing", "RAG diagnosis", "retrieval quality", "RAG evaluation", "chunk quality", "RAG pipeline review", "grounding check". Use this skill when diagnosing or evaluating a RAG pipeline's quality.
3installs
Added on
NPX Install
npx skill4agent add mathews-tom/praxis-skills rag-auditorTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →RAG Auditor
Systematic RAG pipeline evaluation across the full retrieval-generation chain: designs
evaluation query sets, measures retrieval metrics (Precision@K, Recall@K, MRR), evaluates
generation quality (groundedness, completeness, hallucination rate), diagnoses component-level
failures, and recommends targeted improvements.
Reference Files
| File | Contents | Load When |
|---|---|---|
| Precision@K, Recall@K, MRR, NDCG definitions and calculation | Always |
| Groundedness, completeness, hallucination detection methods | Generation evaluation needed |
| RAG failure categories: retrieval, generation, chunking, embedding | Failure diagnosis needed |
| Designing evaluation query sets, known-answer questions, difficulty levels | Evaluation setup |
Prerequisites
- Access to the RAG pipeline (or its outputs for post-hoc evaluation)
- A set of test queries with known-correct answers
- Understanding of the pipeline components (embedding model, retriever, generator)
Workflow
Phase 1: Pipeline Inventory
Document the RAG pipeline configuration:
- Document source — What documents are indexed? Format, count, size.
- Chunking — Strategy (fixed-size, semantic, paragraph), chunk size, overlap.
- Embedding — Model name and version, dimensionality.
- Vector store — Type (FAISS, Pinecone, Chroma, pgvector), index type.
- Retrieval — Method (similarity, hybrid, reranking), top-K parameter.
- Generation — Model, prompt template, context window usage.
Phase 2: Design Evaluation Queries
Create a diverse set of test queries:
| Query Type | Purpose | Count |
|---|---|---|
| Known-answer (factoid) | Measure retrieval + generation accuracy | 10+ |
| Multi-hop | Require combining info from multiple chunks | 5+ |
| Unanswerable | Not in the corpus — should abstain | 3+ |
| Ambiguous | Multiple valid interpretations | 3+ |
| Recent/updated | Test freshness | 2+ |
For each query, document the expected answer and the source chunk(s).
Phase 3: Evaluate Retrieval
For each test query, measure:
- Precision@K — Of the K retrieved chunks, how many are relevant?
- Recall@K — Of all relevant chunks in the corpus, how many were retrieved?
- MRR (Mean Reciprocal Rank) — How high is the first relevant chunk ranked?
- Chunk relevance — Score each retrieved chunk: Relevant, Partially Relevant, Irrelevant.
Phase 4: Evaluate Generation
For each test query with retrieved context:
- Groundedness — Is every claim in the response supported by the retrieved context? Score: 0 (hallucinated) to 1 (fully grounded).
- Completeness — Does the response use all relevant information from the context? Score: 0 (ignored context) to 1 (complete).
- Hallucination detection — Identify specific claims not supported by context.
- Abstention — For unanswerable queries, does the model correctly say "I don't know"?
Phase 5: Diagnose Failures
For every incorrect or low-quality response, classify the root cause:
| Failure Type | Diagnosis | Indicator |
|---|---|---|
| Retrieval failure | Relevant chunks not retrieved | Low Recall@K |
| Ranking failure | Relevant chunk retrieved but ranked low | Low MRR, high Recall |
| Chunk boundary issue | Answer split across chunk boundaries | Partial matches in multiple chunks |
| Embedding mismatch | Query semantics don't match chunk embeddings | Relevant chunk has low similarity score |
| Generation failure | Correct context but wrong answer | High retrieval scores, low groundedness |
| Hallucination | Model invents facts not in context | Claims not traceable to any chunk |
| Over-abstention | Model refuses to answer when context is sufficient | Unanswered with relevant context present |
Phase 6: Recommendations
Based on failure analysis, recommend specific improvements:
| Failure Pattern | Recommendation |
|---|---|
| Chunk boundary issues | Increase overlap, try semantic chunking |
| Low Precision@K | Reduce K, add reranking stage |
| Low Recall@K | Increase K, try hybrid search |
| Embedding mismatch | Try different embedding model, add query expansion |
| Hallucination | Strengthen grounding instruction in prompt, reduce temperature |
| Over-abstention | Soften abstention criteria in prompt |
Output Format
## RAG Audit Report
### Pipeline Configuration
| Component | Value |
|-----------|-------|
| Documents | {N} ({format}) |
| Chunking | {strategy}, {size} tokens, {overlap}% overlap |
| Embedding | {model} ({dimensions}d) |
| Retrieval | {method}, K={N} |
| Generation | {model}, temperature={T} |
### Evaluation Dataset
- **Total queries:** {N}
- **Known-answer:** {N}
- **Multi-hop:** {N}
- **Unanswerable:** {N}
### Retrieval Quality
| Metric | Score | Target | Status |
|--------|-------|--------|--------|
| Precision@{K} | {score} | {target} | {Pass/Fail} |
| Recall@{K} | {score} | {target} | {Pass/Fail} |
| MRR | {score} | {target} | {Pass/Fail} |
### Generation Quality
| Metric | Score | Target | Status |
|--------|-------|--------|--------|
| Groundedness | {score} | {target} | {Pass/Fail} |
| Completeness | {score} | {target} | {Pass/Fail} |
| Hallucination rate | {score} | {target} | {Pass/Fail} |
| Abstention accuracy | {score} | {target} | {Pass/Fail} |
### Failure Analysis
| # | Query | Failure Type | Root Cause | Recommendation |
|---|-------|-------------|------------|----------------|
| 1 | {query} | {type} | {cause} | {fix} |
### Recommendations (Priority Order)
1. **{Recommendation}** — addresses {N} failures, expected impact: {description}
2. **{Recommendation}** — addresses {N} failures, expected impact: {description}
### Sample Failures
#### Query: "{query}"
- **Expected:** {answer}
- **Retrieved chunks:** {chunk summaries with relevance scores}
- **Generated:** {response}
- **Issue:** {diagnosis}Calibration Rules
- Component isolation. Evaluate retrieval and generation independently. A great retriever with a bad generator looks like retrieval failure if you only check end output.
- Known answers first. Start with factoid questions where the correct answer is unambiguous. Multi-hop and ambiguous queries are harder to evaluate.
- Quantify, don't qualify. "Retrieval is bad" is not a finding. "Precision@5 is 0.3 (target: 0.8) with 70% of failures due to chunk boundary splits" is actionable.
- Sample failures deeply. Aggregate metrics identify WHERE the problem is. Individual failure analysis identifies WHY.
Error Handling
| Problem | Resolution |
|---|---|
| No known-answer queries available | Help design them from the document corpus. Pick 10 facts and formulate questions. |
| Pipeline access not available | Work from recorded inputs/outputs. Post-hoc evaluation is possible with query-context-response triples. |
| Corpus is too large to review | Sample-based evaluation. Select representative documents and generate queries from them. |
| Multiple failure types co-exist | Address retrieval failures first. Generation quality cannot exceed retrieval quality. |
When NOT to Audit
Push back if:
- The pipeline hasn't been built yet — design it first, audit after
- The corpus has fewer than 10 documents — too small for meaningful retrieval evaluation
- The user wants to compare embedding models — that's a benchmark task, not an audit