nemo-retriever
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesenemo-retriever
nemo-retriever
The CLI indexes a folder of PDFs into LanceDB () and serves vector search over it (). For any task about searching/answering questions across a folder of PDFs, use this CLI — do not write a custom RAG.
retrieverretriever ingestretriever queryBeyond PDFs and beyond semantic search. also handles images, Office, HTML, TXT, audio, and video — see for the per-format recipe and for the install extras (, libreoffice, ffmpeg). For non-semantic operations — page filter, verbatim quote with citation, corpus-level aggregate, chart/image caption hits — see . Don't fall back to native Read/Grep/Python on non-PDF inputs.
retriever ingestreferences/setup.mdreferences/install.md[multimedia]references/query.mdretrieverretriever ingestretriever query超越PDF,超越语义搜索。还支持图片、Office文档、HTML、TXT、音频和视频文件——各格式处理方法请参考,安装扩展组件(、libreoffice、ffmpeg)请参考。对于非语义操作——页面筛选、带引用的原文提取、语料库级聚合、图表/图片标题匹配等,请参考。处理非PDF输入时,请勿退而使用原生Read/Grep/Python。
retriever ingestreferences/setup.md[multimedia]references/install.mdreferences/query.mdInstall (if retriever
is missing)
retriever安装(若未安装retriever
)
retrieverIf returns nothing, follow to install the NeMo Retriever Library before proceeding. It prints ; substitute that path for in every example in this skill (setup, query, troubleshooting, and the CLI references).
command -v retrieverreferences/install.mdRETRIEVER_VENV=<path><RETRIEVER_VENV>若执行无返回结果,请先按照安装NeMo Retriever Library。安装完成后会输出;在本技能的所有示例(设置、查询、故障排查及CLI参考)中,请将替换为该路径。
command -v retrieverreferences/install.mdRETRIEVER_VENV=<path><RETRIEVER_VENV>Workflow — read the reference for the current phase, then execute
工作流程——先阅读当前阶段的参考文档,再执行操作
| Turn type | Read this once | Then execute |
|---|---|---|
Setup turn (first turn — | | Build the index |
| Query turn (every subsequent turn — user asks a question) | | One |
| Anything errored or returned empty | | Apply the named recovery; do not improvise |
For the full / CLI specs, see and . You do not need these for routine turns — is faster.
retriever ingestretriever queryreferences/cli/ingest.mdreferences/cli/query.md<RETRIEVER_VENV>/bin/retriever <subcommand> --helpBefore ingesting a mixed folder, inventory extensions () — silently drops anything outside the supported set. See "Unsupported file types".
find <dir> -name '*.*' | sed 's/.*\.//' | sort -u--input-type=autoreferences/troubleshooting.md| 操作类型 | 先阅读以下文档 | 然后执行 |
|---|---|---|
设置阶段(首次操作—— | | 构建索引 |
| 查询阶段(后续所有操作——用户提出问题) | | 调用一次 |
| 出现错误或返回空结果 | | 按照指定方法恢复;请勿自行操作 |
完整的/CLI规范,请参考和。日常操作无需查阅这些文档——使用更快捷。
retriever ingestretriever queryreferences/cli/ingest.mdreferences/cli/query.md<RETRIEVER_VENV>/bin/retriever <subcommand> --help在处理混合文件文件夹前,请先统计文件扩展名(执行)——会自动丢弃不支持的文件类型。详情请参考中的“不支持的文件类型”部分。
find <dir> -name '*.*' | sed 's/.*\.//' | sort -u--input-type=autoreferences/troubleshooting.mdHard limits (apply to every turn)
硬性限制(适用于所有操作阶段)
- Setup turn: build the index in one shell command (see ). STOP after the index lands.
references/setup.md - Query turn: at most 2 Bash calls — 1 , +1 optional targeted text-extract per
retriever query. Reply and then STOP.references/query.md - No narration between tool calls. Tokens you emit between calls become input + cached input for every later turn — quadratic cost. Go straight from reading the summary to writing the JSON file.
- Banned: , Glob, Grep,
TodoWriteof whole PDFs, re-running setup, spawning subagents, speculative "confirmation" calls.Read
Long query turns (5+ tool calls, 1M+ cache-read tokens) cost ~5× a disciplined turn and almost always still produce the wrong answer. Answering partially beats timing out.
- 设置阶段:通过一条Shell命令构建索引(参考)。索引构建完成后立即停止操作。
references/setup.md - 查询阶段:最多执行2次Bash调用——1次,+1次根据
retriever query进行的可选定向文本提取。回复后立即停止操作。references/query.md - 工具调用之间无需说明。调用之间输出的内容会成为后续所有操作的输入及缓存输入——会导致成本呈二次方增长。直接从阅读摘要过渡到编写JSON文件即可。
- 禁止操作:、Glob、Grep、读取完整PDF、重新执行设置、生成子代理、推测性“确认”调用。
TodoWrite
冗长的查询操作(5次以上工具调用、读取100万以上缓存令牌)的成本约为规范操作的5倍,且几乎总会得出错误答案。部分回答强于超时无响应。