rag-eval

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

rag-eval

rag-eval

Purpose

用途

Replace the "tweak → squint → swap model → burn credits" loop with a single command that runs a grid of eval variants on the user's gold-set, ranks them by a cost-aware score, and returns structured feedback on architecture, stack, and likely-issues. Draws on evidence-based RAG practices and learns from the user's past runs.
用单一命令替代「调整→人工检视→更换模型→消耗额度」的循环,该命令会在用户的gold-set上运行一系列评估变体,通过成本感知分数对它们进行排名,并返回关于架构、技术栈和潜在问题的结构化反馈。基于循证RAG实践,并从用户过往运行记录中学习。

When to use

使用场景

Trigger on: "help me test a RAG", "tune my RAG", "my RAG is bad", "compare retrieval prompts", "how do I eval this", "what's the best embedding model for X", "my RAG eval is expensive". Also trigger when the user reports burning OpenRouter / OpenAI credits with no clear signal of improvement.
触发场景包括:“帮我测试RAG”、“调优我的RAG”、“我的RAG效果很差”、“对比检索提示词”、“我该如何评估这个”、“针对X场景最佳的嵌入模型是什么”、“我的RAG评估成本太高”。当用户反馈消耗OpenRouter/OpenAI额度却未获得明确的性能提升信号时,也可触发此技能。

Prerequisites — gather before running

前置条件——运行前需收集

Collect these from the user before the first sweep. Many are optional with sensible defaults; always confirm the ones that gate cost.
  1. RAG codebase root — path to the repo/module under test.
  2. Gold-set — at least 10 Q&A pairs. If missing, offer to generate a starter gold-set from the user's dataset (LLM-synthesized, human-reviewed). See
    references/best-practices.md
    .
  3. Dataset — the corpus the RAG retrieves over.
  4. Budget cap — hard dollar limit per run (default: $2 if user doesn't specify). Always confirm before any sweep.
  5. Provider keys
    OPENROUTER_API_KEY
    or
    OPENAI_API_KEY
    (read from env).
  6. Vector-store config — collection name, embedding model, chunk size (read from repo; confirm if ambiguous).
  7. Eval history path (optional) — defaults to
    .rag-eval/history.jsonl
    in the repo root.
在首次扫描前向用户收集以下信息。其中许多信息为可选,有合理默认值;但始终需确认与成本相关的项。
  1. RAG代码库根目录——待测试仓库/模块的路径。
  2. Gold-set——至少10组问答对。若缺失,可提议从用户的数据集中生成初始gold-set(由LLM合成、人工审核)。详见
    references/best-practices.md
  3. 数据集——RAG检索依赖的语料库。
  4. 预算上限——每次运行的美元硬限制(默认:若用户未指定则为2美元)。在任何扫描前务必确认此项。
  5. 服务商密钥——
    OPENROUTER_API_KEY
    OPENAI_API_KEY
    (从环境变量读取)。
  6. 向量数据库配置——集合名称、嵌入模型、分块大小(从仓库读取;若存在歧义则需确认)。
  7. 评估历史路径(可选)——默认值为仓库根目录下的
    .rag-eval/history.jsonl

Workflow

工作流程

Follow this order. Refer to
references/best-practices.md
for the canonical checklist and
references/evidence-base.md
for the research-backed defaults.
按以下顺序执行。参考
references/best-practices.md
获取标准检查清单,参考
references/evidence-base.md
获取基于研究的默认配置。

Step 0 — (Optional) Ingest a prior iteration session

步骤0——(可选)导入过往迭代会话

When the user provides a session ID (Claude Code transcript, skill-studio session, or a Fathom meeting), run the deterministic ingest first — no LLM calls. This extracts only the useful signals (models tried, prompt variants, cost events, eval results) as compact JSON, so the rest of the skill works off a tiny structured bundle instead of a long raw transcript.
bash
python scripts/session_ingest.py <session_id> > /tmp/rag-eval-bundle.json
当用户提供会话ID(Claude Code对话记录、skill-studio会话或Fathom会议记录)时,先执行确定性导入——无需调用LLM。该操作仅提取有用信号(尝试过的模型、提示词变体、成本事件、评估结果)并转换为紧凑JSON,以便技能后续基于这个小型结构化包运行,而非冗长的原始对话记录。
bash
python scripts/session_ingest.py <session_id> > /tmp/rag-eval-bundle.json

or with a direct path:

或直接指定路径:

python scripts/session_ingest.py --path /path/to/transcript.jsonl > /tmp/rag-eval-bundle.json

The bundle includes: `models_tried`, `prompts_tried` (hashes only), `iterations`, `total_cost_usd`, `summary_stats`. Feed this into Step 1 — do not paste the raw transcript.

**Why this matters:** transcripts can be 100k+ tokens of noise. The ingest script does regex extraction only, keeping the LLM budget for the actual audit + sweep planning. This is a hard requirement, not an optimization.
python scripts/session_ingest.py --path /path/to/transcript.jsonl > /tmp/rag-eval-bundle.json

该包包含:`models_tried`、`prompts_tried`(仅哈希值)、`iterations`、`total_cost_usd`、`summary_stats`。将其输入步骤1——请勿粘贴原始对话记录。

**重要性:**对话记录可能包含10万+ token的冗余信息。导入脚本仅通过正则提取信息,将LLM预算留作实际审计和扫描规划使用。这是硬性要求,而非优化项。

Step 1 — Audit the stack

步骤1——技术栈审计

Read
references/best-practices.md
and inspect the user's repo + vector-store config. Produce a structured report covering:
  • Architecture (retrieval type: dense / hybrid / rerank; chunking strategy; prompt structure)
  • Tech stack (embedding model, LLM, vector store)
  • Resources (dataset size, gold-set size, prior eval runs)
  • Risks (known anti-patterns, missing pieces)
Present the report to the user and ask which issues to address first.
阅读
references/best-practices.md
并检查用户的仓库+向量数据库配置。生成一份结构化报告,涵盖:
  • 架构(检索类型:密集型/混合型/重排序;分块策略;提示词结构)
  • 技术栈(嵌入模型、LLM、向量数据库)
  • 资源(数据集规模、gold-set规模、过往评估运行记录)
  • 风险(已知反模式、缺失组件)
向用户展示报告,并询问优先解决哪些问题。

Step 2 — Propose a sweep plan

步骤2——提议扫描方案

Based on the audit, propose 3–8 variants to test. Keep the grid small on the first run (default: 2 prompts × 2 models × 1 retrieval variant = 4 cells). Estimate cost using gold-set size × variants × avg tokens × provider pricing. Present the cost estimate and wait for user confirmation before running.
基于审计结果,提议3–8个待测试的变体。首次运行时保持测试矩阵规模较小(默认:2个提示词 × 2个模型 × 1个检索变体 = 4个测试单元)。通过gold-set规模 × 变体数量 × 平均token数 × 服务商定价来估算成本。在运行前展示成本估算并等待用户确认。

Step 3 — Run the sweep

步骤3——执行扫描

Use
scripts/eval_sweep.py
(see the script header for invocation). It reads a config YAML, runs each variant against the gold-set, records per-variant cost and answer quality, and appends to
history.jsonl
.
Guardrails:
  • Never exceed the budget cap — halt mid-sweep if reached.
  • Never mutate the user's repo. Write all artifacts under
    .rag-eval/
    (gitignore it).
  • Confirm before any sweep estimated to exceed the user's cap.
使用
scripts/eval_sweep.py
(查看脚本头部获取调用方式)。它读取配置YAML文件,针对gold-set运行每个变体,记录每个变体的成本和回答质量,并将结果追加到
history.jsonl
中。
防护规则:
  • 绝不超过预算上限——若达到上限则中途停止扫描。
  • 绝不修改用户的仓库。所有产物均写入
    .rag-eval/
    目录(已加入git忽略)。
  • 若扫描估算成本超过用户上限,需先确认再执行。

Step 4 — Rank and report

步骤4——排名与报告

After the sweep, rank variants by a cost-aware score:
quality × (1 / log(1 + cost))
. Present:
  • Top 3 variants with quality metrics and cost
  • What changed vs the previous best
  • Concrete next experiment to try
Write the full report to
.rag-eval/reports/<timestamp>.md
.
扫描完成后,通过成本感知分数对变体进行排名:
质量 × (1 / log(1 + 成本))
。展示内容包括:
  • 排名前三的变体及其质量指标和成本
  • 与之前最优方案相比的变化
  • 具体的下一个实验建议
将完整报告写入
.rag-eval/reports/<timestamp>.md

Step 5 — Self-improve

步骤5——自我优化

Before each subsequent run, read
history.jsonl
and factor in what the user has already tried. Avoid re-testing rejected variants. Surface patterns ("models A, B, C all underperformed on multi-hop queries — next try a reranker").
在每次后续运行前,读取
history.jsonl
并考虑用户已尝试过的方案。避免重复测试被否决的变体。发现规律(例如“模型A、B、C在多跳查询上均表现不佳——接下来尝试重排序器”)。

Reusable resources

可复用资源

  • scripts/eval_sweep.py
    — grid-search runner. Reads
    eval_config.yaml
    , writes results to
    history.jsonl
    .
  • references/best-practices.md
    — evidence-based RAG checklist the agent uses as an anchor.
  • references/evidence-base.md
    — pointers to recent RAG research and when each technique helps.
  • assets/eval_config.template.yaml
    — starter config to copy into the user's repo.
  • assets/gold_set.template.jsonl
    — 3 example Q&A pairs to show the gold-set format.
  • scripts/eval_sweep.py
    ——网格搜索运行器。读取
    eval_config.yaml
    ,将结果写入
    history.jsonl
  • references/best-practices.md
    ——Agent作为参考依据的循证RAG检查清单。
  • references/evidence-base.md
    ——指向近期RAG研究的链接,以及各技术适用场景说明。
  • assets/eval_config.template.yaml
    ——可复制到用户仓库的起始配置模板。
  • assets/gold_set.template.jsonl
    ——3组示例问答对,展示gold-set格式。

Notes

注意事项

  • Cost is the main failure mode. Never run without a confirmed budget. Err on the side of smaller sweeps; users can always run again.
  • No repo mutation. All outputs go under
    .rag-eval/
    in the target repo.
  • When uncertain about best practices, do web research. Use
    tavily-search
    or
    firecrawl-research
    to pull current evidence, then synthesize into the audit report.
  • Defer to the user. Before changing any file in the target repo, always confirm.
  • **成本是主要失败因素。**未经确认预算绝不要运行。优先选择规模较小的扫描;用户可随时再次运行。
  • **不修改仓库。**所有输出均写入目标仓库的
    .rag-eval/
    目录下。
  • **若对最佳实践存疑,进行网络调研。**使用
    tavily-search
    firecrawl-research
    获取最新依据,然后整合到审计报告中。
  • **遵从用户意愿。**在修改目标仓库中的任何文件前,务必先确认。