nemo-retriever

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

nemo-retriever

nemo-retriever

The
retriever
CLI indexes a folder of PDFs into LanceDB (
retriever ingest
) and serves vector search over it (
retriever query
). For any task about searching/answering questions across a folder of PDFs, use this CLI — do not write a custom RAG.
Beyond PDFs and beyond semantic search.
retriever ingest
also handles images, Office, HTML, TXT, audio, and video — see
references/setup.md
for the per-format recipe and
references/install.md
for the install extras (
[multimedia]
, libreoffice, ffmpeg). For non-semantic operations — page filter, verbatim quote with citation, corpus-level aggregate, chart/image caption hits — see
references/query.md
. Don't fall back to native Read/Grep/Python on non-PDF inputs.
retriever
CLI可将PDF文件夹索引到LanceDB中(
retriever ingest
),并提供基于向量的搜索服务(
retriever query
)。若需对PDF文件夹进行搜索/问答类任务,请使用该CLI——无需自定义RAG。
超越PDF,超越语义搜索
retriever ingest
还支持图片、Office文档、HTML、TXT、音频和视频文件——各格式处理方法请参考
references/setup.md
,安装扩展组件(
[multimedia]
、libreoffice、ffmpeg)请参考
references/install.md
。对于非语义操作——页面筛选、带引用的原文提取、语料库级聚合、图表/图片标题匹配等,请参考
references/query.md
。处理非PDF输入时,请勿退而使用原生Read/Grep/Python。

Install (if
retriever
is missing)

安装(若未安装
retriever

If
command -v retriever
returns nothing, follow
references/install.md
to install the NeMo Retriever Library before proceeding. It prints
RETRIEVER_VENV=<path>
; substitute that path for
<RETRIEVER_VENV>
in every example in this skill (setup, query, troubleshooting, and the CLI references).
若执行
command -v retriever
无返回结果,请先按照
references/install.md
安装NeMo Retriever Library。安装完成后会输出
RETRIEVER_VENV=<path>
;在本技能的所有示例(设置、查询、故障排查及CLI参考)中,请将
<RETRIEVER_VENV>
替换为该路径。

Workflow — read the reference for the current phase, then execute

工作流程——先阅读当前阶段的参考文档,再执行操作

Turn typeRead this onceThen execute
Setup turn (first turn —
./lancedb/nv-ingest.lance
doesn't exist)
references/setup.md
Build the index
Query turn (every subsequent turn — user asks a question)
references/query.md
One
retriever query
call
Anything errored or returned empty
references/troubleshooting.md
Apply the named recovery; do not improvise
For the full
retriever ingest
/
retriever query
CLI specs, see
references/cli/ingest.md
and
references/cli/query.md
. You do not need these for routine turns —
<RETRIEVER_VENV>/bin/retriever <subcommand> --help
is faster.
Before ingesting a mixed folder, inventory extensions (
find <dir> -name '*.*' | sed 's/.*\.//' | sort -u
) —
--input-type=auto
silently drops anything outside the supported set. See
references/troubleshooting.md
"Unsupported file types".
操作类型先阅读以下文档然后执行
设置阶段(首次操作——
./lancedb/nv-ingest.lance
不存在)
references/setup.md
构建索引
查询阶段(后续所有操作——用户提出问题)
references/query.md
调用一次
retriever query
出现错误或返回空结果
references/troubleshooting.md
按照指定方法恢复;请勿自行操作
完整的
retriever ingest
/
retriever query
CLI规范,请参考
references/cli/ingest.md
references/cli/query.md
。日常操作无需查阅这些文档——使用
<RETRIEVER_VENV>/bin/retriever <subcommand> --help
更快捷。
在处理混合文件文件夹前,请先统计文件扩展名(执行
find <dir> -name '*.*' | sed 's/.*\.//' | sort -u
)——
--input-type=auto
会自动丢弃不支持的文件类型。详情请参考
references/troubleshooting.md
中的“不支持的文件类型”部分。

Hard limits (apply to every turn)

硬性限制(适用于所有操作阶段)

  • Setup turn: build the index in one shell command (see
    references/setup.md
    ). STOP after the index lands.
  • Query turn: at most 2 Bash calls — 1
    retriever query
    , +1 optional targeted text-extract per
    references/query.md
    . Reply and then STOP.
  • No narration between tool calls. Tokens you emit between calls become input + cached input for every later turn — quadratic cost. Go straight from reading the summary to writing the JSON file.
  • Banned:
    TodoWrite
    , Glob, Grep,
    Read
    of whole PDFs, re-running setup, spawning subagents, speculative "confirmation" calls.
Long query turns (5+ tool calls, 1M+ cache-read tokens) cost ~5× a disciplined turn and almost always still produce the wrong answer. Answering partially beats timing out.
  • 设置阶段:通过一条Shell命令构建索引(参考
    references/setup.md
    )。索引构建完成后立即停止操作。
  • 查询阶段:最多执行2次Bash调用——1次
    retriever query
    ,+1次根据
    references/query.md
    进行的可选定向文本提取。回复后立即停止操作。
  • 工具调用之间无需说明。调用之间输出的内容会成为后续所有操作的输入及缓存输入——会导致成本呈二次方增长。直接从阅读摘要过渡到编写JSON文件即可。
  • 禁止操作
    TodoWrite
    、Glob、Grep、读取完整PDF、重新执行设置、生成子代理、推测性“确认”调用。
冗长的查询操作(5次以上工具调用、读取100万以上缓存令牌)的成本约为规范操作的5倍,且几乎总会得出错误答案。部分回答强于超时无响应。