rag-eval

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

rag-eval

Purpose

用途

Replace the "tweak → squint → swap model → burn credits" loop with a single command that runs a grid of eval variants on the user's gold-set, ranks them by a cost-aware score, and returns structured feedback on architecture, stack, and likely-issues. Draws on evidence-based RAG practices and learns from the user's past runs.

用单一命令替代「调整→人工检视→更换模型→消耗额度」的循环，该命令会在用户的gold-set上运行一系列评估变体，通过成本感知分数对它们进行排名，并返回关于架构、技术栈和潜在问题的结构化反馈。基于循证RAG实践，并从用户过往运行记录中学习。

When to use

使用场景

Trigger on: "help me test a RAG", "tune my RAG", "my RAG is bad", "compare retrieval prompts", "how do I eval this", "what's the best embedding model for X", "my RAG eval is expensive". Also trigger when the user reports burning OpenRouter / OpenAI credits with no clear signal of improvement.

触发场景包括：“帮我测试RAG”、“调优我的RAG”、“我的RAG效果很差”、“对比检索提示词”、“我该如何评估这个”、“针对X场景最佳的嵌入模型是什么”、“我的RAG评估成本太高”。当用户反馈消耗OpenRouter/OpenAI额度却未获得明确的性能提升信号时，也可触发此技能。

Prerequisites — gather before running

前置条件——运行前需收集

Collect these from the user before the first sweep. Many are optional with sensible defaults; always confirm the ones that gate cost.

RAG codebase root — path to the repo/module under test.
Gold-set — at least 10 Q&A pairs. If missing, offer to generate a starter gold-set from the user's dataset (LLM-synthesized, human-reviewed). See
```
references/best-practices.md
```
.
Dataset — the corpus the RAG retrieves over.
Budget cap — hard dollar limit per run (default: $2 if user doesn't specify). Always confirm before any sweep.
Provider keys —
```
OPENROUTER_API_KEY
```
or
```
OPENAI_API_KEY
```
(read from env).
Vector-store config — collection name, embedding model, chunk size (read from repo; confirm if ambiguous).
Eval history path (optional) — defaults to
```
.rag-eval/history.jsonl
```
in the repo root.

在首次扫描前向用户收集以下信息。其中许多信息为可选，有合理默认值；但始终需确认与成本相关的项。

RAG代码库根目录——待测试仓库/模块的路径。
Gold-set——至少10组问答对。若缺失，可提议从用户的数据集中生成初始gold-set（由LLM合成、人工审核）。详见
```
references/best-practices.md
```
。
数据集——RAG检索依赖的语料库。
预算上限——每次运行的美元硬限制（默认：若用户未指定则为2美元）。在任何扫描前务必确认此项。
服务商密钥——
```
OPENROUTER_API_KEY
```
或
```
OPENAI_API_KEY
```
（从环境变量读取）。
向量数据库配置——集合名称、嵌入模型、分块大小（从仓库读取；若存在歧义则需确认）。
评估历史路径（可选）——默认值为仓库根目录下的
```
.rag-eval/history.jsonl
```
。

Workflow

工作流程

Follow this order. Refer to

references/best-practices.md

for the canonical checklist and

references/evidence-base.md

for the research-backed defaults.

按以下顺序执行。参考

references/best-practices.md

获取标准检查清单，参考

references/evidence-base.md

获取基于研究的默认配置。

Step 0 — (Optional) Ingest a prior iteration session

步骤0——（可选）导入过往迭代会话

When the user provides a session ID (Claude Code transcript, skill-studio session, or a Fathom meeting), run the deterministic ingest first — no LLM calls. This extracts only the useful signals (models tried, prompt variants, cost events, eval results) as compact JSON, so the rest of the skill works off a tiny structured bundle instead of a long raw transcript.

bash

python scripts/session_ingest.py <session_id> > /tmp/rag-eval-bundle.json

当用户提供会话ID（Claude Code对话记录、skill-studio会话或Fathom会议记录）时，先执行确定性导入——无需调用LLM。该操作仅提取有用信号（尝试过的模型、提示词变体、成本事件、评估结果）并转换为紧凑JSON，以便技能后续基于这个小型结构化包运行，而非冗长的原始对话记录。

bash

python scripts/session_ingest.py <session_id> > /tmp/rag-eval-bundle.json

or with a direct path:

或直接指定路径：

python scripts/session_ingest.py --path /path/to/transcript.jsonl > /tmp/rag-eval-bundle.json


The bundle includes: `models_tried`, `prompts_tried` (hashes only), `iterations`, `total_cost_usd`, `summary_stats`. Feed this into Step 1 — do not paste the raw transcript.

**Why this matters:** transcripts can be 100k+ tokens of noise. The ingest script does regex extraction only, keeping the LLM budget for the actual audit + sweep planning. This is a hard requirement, not an optimization.

python scripts/session_ingest.py --path /path/to/transcript.jsonl > /tmp/rag-eval-bundle.json


该包包含：`models_tried`、`prompts_tried`（仅哈希值）、`iterations`、`total_cost_usd`、`summary_stats`。将其输入步骤1——请勿粘贴原始对话记录。

**重要性：**对话记录可能包含10万+ token的冗余信息。导入脚本仅通过正则提取信息，将LLM预算留作实际审计和扫描规划使用。这是硬性要求，而非优化项。

Step 1 — Audit the stack

步骤1——技术栈审计

Read

references/best-practices.md

and inspect the user's repo + vector-store config. Produce a structured report covering:

Architecture (retrieval type: dense / hybrid / rerank; chunking strategy; prompt structure)
Tech stack (embedding model, LLM, vector store)
Resources (dataset size, gold-set size, prior eval runs)
Risks (known anti-patterns, missing pieces)

Present the report to the user and ask which issues to address first.

阅读

references/best-practices.md

并检查用户的仓库+向量数据库配置。生成一份结构化报告，涵盖：

架构（检索类型：密集型/混合型/重排序；分块策略；提示词结构）
技术栈（嵌入模型、LLM、向量数据库）
资源（数据集规模、gold-set规模、过往评估运行记录）
风险（已知反模式、缺失组件）

向用户展示报告，并询问优先解决哪些问题。

Step 2 — Propose a sweep plan

步骤2——提议扫描方案

Based on the audit, propose 3–8 variants to test. Keep the grid small on the first run (default: 2 prompts × 2 models × 1 retrieval variant = 4 cells). Estimate cost using gold-set size × variants × avg tokens × provider pricing. Present the cost estimate and wait for user confirmation before running.

基于审计结果，提议3–8个待测试的变体。首次运行时保持测试矩阵规模较小（默认：2个提示词 × 2个模型 × 1个检索变体 = 4个测试单元）。通过gold-set规模 × 变体数量 × 平均token数 × 服务商定价来估算成本。在运行前展示成本估算并等待用户确认。

Step 3 — Run the sweep

步骤3——执行扫描

Use

scripts/eval_sweep.py

(see the script header for invocation). It reads a config YAML, runs each variant against the gold-set, records per-variant cost and answer quality, and appends to

history.jsonl

Guardrails:

Never exceed the budget cap — halt mid-sweep if reached.
Never mutate the user's repo. Write all artifacts under
```
.rag-eval/
```
(gitignore it).
Confirm before any sweep estimated to exceed the user's cap.

使用

scripts/eval_sweep.py

（查看脚本头部获取调用方式）。它读取配置YAML文件，针对gold-set运行每个变体，记录每个变体的成本和回答质量，并将结果追加到

history.jsonl

中。

防护规则：

绝不超过预算上限——若达到上限则中途停止扫描。
绝不修改用户的仓库。所有产物均写入
```
.rag-eval/
```
目录（已加入git忽略）。
若扫描估算成本超过用户上限，需先确认再执行。

Step 4 — Rank and report

步骤4——排名与报告

After the sweep, rank variants by a cost-aware score:

quality × (1 / log(1 + cost))

. Present:

Top 3 variants with quality metrics and cost
What changed vs the previous best
Concrete next experiment to try

Write the full report to

.rag-eval/reports/<timestamp>.md

扫描完成后，通过成本感知分数对变体进行排名：

质量 × (1 / log(1 + 成本))

。展示内容包括：

排名前三的变体及其质量指标和成本
与之前最优方案相比的变化
具体的下一个实验建议

将完整报告写入

.rag-eval/reports/<timestamp>.md

。

Step 5 — Self-improve

步骤5——自我优化

Before each subsequent run, read

history.jsonl

and factor in what the user has already tried. Avoid re-testing rejected variants. Surface patterns ("models A, B, C all underperformed on multi-hop queries — next try a reranker").

在每次后续运行前，读取

history.jsonl

并考虑用户已尝试过的方案。避免重复测试被否决的变体。发现规律（例如“模型A、B、C在多跳查询上均表现不佳——接下来尝试重排序器”）。

Reusable resources

可复用资源

scripts/eval_sweep.py

— grid-search runner. Reads

eval_config.yaml

, writes results to

history.jsonl

```
references/best-practices.md
```
— evidence-based RAG checklist the agent uses as an anchor.
```
references/evidence-base.md
```
— pointers to recent RAG research and when each technique helps.
```
assets/eval_config.template.yaml
```
— starter config to copy into the user's repo.
```
assets/gold_set.template.jsonl
```
— 3 example Q&A pairs to show the gold-set format.

```
scripts/eval_sweep.py
```
——网格搜索运行器。读取
```
eval_config.yaml
```
，将结果写入
```
history.jsonl
```
。
```
references/best-practices.md
```
——Agent作为参考依据的循证RAG检查清单。
```
references/evidence-base.md
```
——指向近期RAG研究的链接，以及各技术适用场景说明。
```
assets/eval_config.template.yaml
```
——可复制到用户仓库的起始配置模板。
```
assets/gold_set.template.jsonl
```
——3组示例问答对，展示gold-set格式。

Notes

注意事项

Cost is the main failure mode. Never run without a confirmed budget. Err on the side of smaller sweeps; users can always run again.
No repo mutation. All outputs go under
```
.rag-eval/
```
in the target repo.
When uncertain about best practices, do web research. Use
```
tavily-search
```
or
```
firecrawl-research
```
to pull current evidence, then synthesize into the audit report.
Defer to the user. Before changing any file in the target repo, always confirm.

**成本是主要失败因素。**未经确认预算绝不要运行。优先选择规模较小的扫描；用户可随时再次运行。
**不修改仓库。**所有输出均写入目标仓库的
```
.rag-eval/
```
目录下。
**若对最佳实践存疑，进行网络调研。**使用
```
tavily-search
```
或
```
firecrawl-research
```
获取最新依据，然后整合到审计报告中。
**遵从用户意愿。**在修改目标仓库中的任何文件前，务必先确认。