rag-eval
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chineserag-eval
rag-eval
Purpose
用途
Replace the "tweak → squint → swap model → burn credits" loop with a single command that runs a grid of eval variants on the user's gold-set, ranks them by a cost-aware score, and returns structured feedback on architecture, stack, and likely-issues. Draws on evidence-based RAG practices and learns from the user's past runs.
用单一命令替代「调整→人工检视→更换模型→消耗额度」的循环,该命令会在用户的gold-set上运行一系列评估变体,通过成本感知分数对它们进行排名,并返回关于架构、技术栈和潜在问题的结构化反馈。基于循证RAG实践,并从用户过往运行记录中学习。
When to use
使用场景
Trigger on: "help me test a RAG", "tune my RAG", "my RAG is bad", "compare retrieval prompts", "how do I eval this", "what's the best embedding model for X", "my RAG eval is expensive". Also trigger when the user reports burning OpenRouter / OpenAI credits with no clear signal of improvement.
触发场景包括:“帮我测试RAG”、“调优我的RAG”、“我的RAG效果很差”、“对比检索提示词”、“我该如何评估这个”、“针对X场景最佳的嵌入模型是什么”、“我的RAG评估成本太高”。当用户反馈消耗OpenRouter/OpenAI额度却未获得明确的性能提升信号时,也可触发此技能。
Prerequisites — gather before running
前置条件——运行前需收集
Collect these from the user before the first sweep. Many are optional with sensible defaults; always confirm the ones that gate cost.
- RAG codebase root — path to the repo/module under test.
- Gold-set — at least 10 Q&A pairs. If missing, offer to generate a starter gold-set from the user's dataset (LLM-synthesized, human-reviewed). See .
references/best-practices.md - Dataset — the corpus the RAG retrieves over.
- Budget cap — hard dollar limit per run (default: $2 if user doesn't specify). Always confirm before any sweep.
- Provider keys — or
OPENROUTER_API_KEY(read from env).OPENAI_API_KEY - Vector-store config — collection name, embedding model, chunk size (read from repo; confirm if ambiguous).
- Eval history path (optional) — defaults to in the repo root.
.rag-eval/history.jsonl
在首次扫描前向用户收集以下信息。其中许多信息为可选,有合理默认值;但始终需确认与成本相关的项。
- RAG代码库根目录——待测试仓库/模块的路径。
- Gold-set——至少10组问答对。若缺失,可提议从用户的数据集中生成初始gold-set(由LLM合成、人工审核)。详见。
references/best-practices.md - 数据集——RAG检索依赖的语料库。
- 预算上限——每次运行的美元硬限制(默认:若用户未指定则为2美元)。在任何扫描前务必确认此项。
- 服务商密钥——或
OPENROUTER_API_KEY(从环境变量读取)。OPENAI_API_KEY - 向量数据库配置——集合名称、嵌入模型、分块大小(从仓库读取;若存在歧义则需确认)。
- 评估历史路径(可选)——默认值为仓库根目录下的。
.rag-eval/history.jsonl
Workflow
工作流程
Follow this order. Refer to for the canonical checklist and for the research-backed defaults.
references/best-practices.mdreferences/evidence-base.md按以下顺序执行。参考获取标准检查清单,参考获取基于研究的默认配置。
references/best-practices.mdreferences/evidence-base.mdStep 0 — (Optional) Ingest a prior iteration session
步骤0——(可选)导入过往迭代会话
When the user provides a session ID (Claude Code transcript, skill-studio session, or a Fathom meeting), run the deterministic ingest first — no LLM calls. This extracts only the useful signals (models tried, prompt variants, cost events, eval results) as compact JSON, so the rest of the skill works off a tiny structured bundle instead of a long raw transcript.
bash
python scripts/session_ingest.py <session_id> > /tmp/rag-eval-bundle.json当用户提供会话ID(Claude Code对话记录、skill-studio会话或Fathom会议记录)时,先执行确定性导入——无需调用LLM。该操作仅提取有用信号(尝试过的模型、提示词变体、成本事件、评估结果)并转换为紧凑JSON,以便技能后续基于这个小型结构化包运行,而非冗长的原始对话记录。
bash
python scripts/session_ingest.py <session_id> > /tmp/rag-eval-bundle.jsonor with a direct path:
或直接指定路径:
python scripts/session_ingest.py --path /path/to/transcript.jsonl > /tmp/rag-eval-bundle.json
The bundle includes: `models_tried`, `prompts_tried` (hashes only), `iterations`, `total_cost_usd`, `summary_stats`. Feed this into Step 1 — do not paste the raw transcript.
**Why this matters:** transcripts can be 100k+ tokens of noise. The ingest script does regex extraction only, keeping the LLM budget for the actual audit + sweep planning. This is a hard requirement, not an optimization.python scripts/session_ingest.py --path /path/to/transcript.jsonl > /tmp/rag-eval-bundle.json
该包包含:`models_tried`、`prompts_tried`(仅哈希值)、`iterations`、`total_cost_usd`、`summary_stats`。将其输入步骤1——请勿粘贴原始对话记录。
**重要性:**对话记录可能包含10万+ token的冗余信息。导入脚本仅通过正则提取信息,将LLM预算留作实际审计和扫描规划使用。这是硬性要求,而非优化项。Step 1 — Audit the stack
步骤1——技术栈审计
Read and inspect the user's repo + vector-store config. Produce a structured report covering:
references/best-practices.md- Architecture (retrieval type: dense / hybrid / rerank; chunking strategy; prompt structure)
- Tech stack (embedding model, LLM, vector store)
- Resources (dataset size, gold-set size, prior eval runs)
- Risks (known anti-patterns, missing pieces)
Present the report to the user and ask which issues to address first.
阅读并检查用户的仓库+向量数据库配置。生成一份结构化报告,涵盖:
references/best-practices.md- 架构(检索类型:密集型/混合型/重排序;分块策略;提示词结构)
- 技术栈(嵌入模型、LLM、向量数据库)
- 资源(数据集规模、gold-set规模、过往评估运行记录)
- 风险(已知反模式、缺失组件)
向用户展示报告,并询问优先解决哪些问题。
Step 2 — Propose a sweep plan
步骤2——提议扫描方案
Based on the audit, propose 3–8 variants to test. Keep the grid small on the first run (default: 2 prompts × 2 models × 1 retrieval variant = 4 cells). Estimate cost using gold-set size × variants × avg tokens × provider pricing. Present the cost estimate and wait for user confirmation before running.
基于审计结果,提议3–8个待测试的变体。首次运行时保持测试矩阵规模较小(默认:2个提示词 × 2个模型 × 1个检索变体 = 4个测试单元)。通过gold-set规模 × 变体数量 × 平均token数 × 服务商定价来估算成本。在运行前展示成本估算并等待用户确认。
Step 3 — Run the sweep
步骤3——执行扫描
Use (see the script header for invocation). It reads a config YAML, runs each variant against the gold-set, records per-variant cost and answer quality, and appends to .
scripts/eval_sweep.pyhistory.jsonlGuardrails:
- Never exceed the budget cap — halt mid-sweep if reached.
- Never mutate the user's repo. Write all artifacts under (gitignore it).
.rag-eval/ - Confirm before any sweep estimated to exceed the user's cap.
使用(查看脚本头部获取调用方式)。它读取配置YAML文件,针对gold-set运行每个变体,记录每个变体的成本和回答质量,并将结果追加到中。
scripts/eval_sweep.pyhistory.jsonl防护规则:
- 绝不超过预算上限——若达到上限则中途停止扫描。
- 绝不修改用户的仓库。所有产物均写入目录(已加入git忽略)。
.rag-eval/ - 若扫描估算成本超过用户上限,需先确认再执行。
Step 4 — Rank and report
步骤4——排名与报告
After the sweep, rank variants by a cost-aware score: . Present:
quality × (1 / log(1 + cost))- Top 3 variants with quality metrics and cost
- What changed vs the previous best
- Concrete next experiment to try
Write the full report to .
.rag-eval/reports/<timestamp>.md扫描完成后,通过成本感知分数对变体进行排名:。展示内容包括:
质量 × (1 / log(1 + 成本))- 排名前三的变体及其质量指标和成本
- 与之前最优方案相比的变化
- 具体的下一个实验建议
将完整报告写入。
.rag-eval/reports/<timestamp>.mdStep 5 — Self-improve
步骤5——自我优化
Before each subsequent run, read and factor in what the user has already tried. Avoid re-testing rejected variants. Surface patterns ("models A, B, C all underperformed on multi-hop queries — next try a reranker").
history.jsonl在每次后续运行前,读取并考虑用户已尝试过的方案。避免重复测试被否决的变体。发现规律(例如“模型A、B、C在多跳查询上均表现不佳——接下来尝试重排序器”)。
history.jsonlReusable resources
可复用资源
- — grid-search runner. Reads
scripts/eval_sweep.py, writes results toeval_config.yaml.history.jsonl - — evidence-based RAG checklist the agent uses as an anchor.
references/best-practices.md - — pointers to recent RAG research and when each technique helps.
references/evidence-base.md - — starter config to copy into the user's repo.
assets/eval_config.template.yaml - — 3 example Q&A pairs to show the gold-set format.
assets/gold_set.template.jsonl
- ——网格搜索运行器。读取
scripts/eval_sweep.py,将结果写入eval_config.yaml。history.jsonl - ——Agent作为参考依据的循证RAG检查清单。
references/best-practices.md - ——指向近期RAG研究的链接,以及各技术适用场景说明。
references/evidence-base.md - ——可复制到用户仓库的起始配置模板。
assets/eval_config.template.yaml - ——3组示例问答对,展示gold-set格式。
assets/gold_set.template.jsonl
Notes
注意事项
- Cost is the main failure mode. Never run without a confirmed budget. Err on the side of smaller sweeps; users can always run again.
- No repo mutation. All outputs go under in the target repo.
.rag-eval/ - When uncertain about best practices, do web research. Use or
tavily-searchto pull current evidence, then synthesize into the audit report.firecrawl-research - Defer to the user. Before changing any file in the target repo, always confirm.
- **成本是主要失败因素。**未经确认预算绝不要运行。优先选择规模较小的扫描;用户可随时再次运行。
- **不修改仓库。**所有输出均写入目标仓库的目录下。
.rag-eval/ - **若对最佳实践存疑,进行网络调研。**使用或
tavily-search获取最新依据,然后整合到审计报告中。firecrawl-research - **遵从用户意愿。**在修改目标仓库中的任何文件前,务必先确认。