rag-blueprint

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

NVIDIA RAG Blueprint

Autonomy Principles

自主运行原则

Auto-detect everything: GPU, VRAM, drivers, Docker, CUDA, disk, OS, ports, existing services, NGC key, repo state.
If it can be checked with a command, check it — don't ask the user.
Ask only when user action is required: providing an API key, confirming data deletion, or choosing between equally valid options.
Once analysis is done, route to the correct workflow and execute.

自动检测所有内容：GPU、VRAM、驱动程序、Docker、CUDA、磁盘、操作系统、端口、现有服务、NGC密钥、仓库状态。
任何可通过命令检查的内容都自动检查——无需询问用户。
仅在需要用户操作时询问：提供API密钥、确认数据删除，或在多个等效选项中选择。
分析完成后，路由至正确工作流并执行。

Intent Detection

意图检测

Determine what the user wants and route immediately:

User Intent	Action
Deploy, install, set up, start RAG	Read and follow `references/deploy.md`
Configure, enable, change, toggle a feature	Use the Configure section below
Troubleshoot, debug, fix, error, unhealthy	Read and follow `references/troubleshoot.md`
Stop, shutdown, tear down, clean up	Read and follow `references/shutdown.md`

If the intent is ambiguous, infer from context (e.g., "RAG isn't working" → troubleshoot; "get RAG running" → deploy). Only ask if genuinely unclear.

确定用户需求并立即路由：

用户意图	操作
部署、安装、搭建、启动RAG	阅读并遵循 `references/deploy.md`
配置、启用、修改、切换功能	使用下方的配置章节
故障排查、调试、修复、错误、异常状态	阅读并遵循 `references/troubleshoot.md`
停止、关闭、拆除、清理	阅读并遵循 `references/shutdown.md`

若意图模糊，可从上下文推断（例如："RAG无法运行" → 故障排查；"让RAG运行起来" → 部署）。仅在确实无法明确时才询问用户。

Configure

配置

Requires a running RAG deployment. If services are not running, deploy first via

references/deploy.md

Match the user's request to a reference file, then read and follow it:

Feature Keywords	Reference
VLM, VLM embeddings, image captioning	`references/configure/vlm.md`
NeMo Guardrails	`references/configure/guardrails.md`
Query rewriting, decomposition, multi-turn	`references/configure/query-and-conversation.md`
Ingestion (text-only, audio, Nemotron Parse, OCR, batch CLI, NV-Ingest, volume mount, performance)	`references/configure/ingestion.md`
Search, retrieval, hybrid search, multi-collection, metadata, filters, reranker, topK, accuracy/performance	`references/configure/search-and-retrieval.md`
LLM/embedding/ranking model changes, vector DB, Milvus/Elasticsearch auth, service keys, model profiles, ports/GPU	`references/configure/models-and-infrastructure.md`
Reasoning, self-reflection, prompts, generation params (tokens, temperature, citations), per-request LLM params	`references/configure/reasoning-and-generation.md`
Summarization	`references/configure/summarization.md`
Observability (tracing, Zipkin, Grafana, Prometheus)	`references/configure/observability.md`
Multimodal query (image + text)	`references/configure/multimodal-query.md`
Data catalog (collection/document metadata)	`references/configure/data-catalog.md`
User interface (UI settings)	`references/configure/user-interface.md`
API reference (endpoints, schemas)	`references/configure/api-reference.md`
Evaluation (RAGAS metrics)	`references/configure/evaluation.md`
MCP server & client, agent toolkit	`references/configure/mcp.md`
Migration (version upgrades)	`references/configure/migration.md`
Notebooks (setup and catalog)	`references/configure/notebooks.md`

需要已运行的RAG部署。若服务未运行，请先通过

references/deploy.md

进行部署。

将用户请求匹配至参考文件，然后阅读并遵循该文件：

功能关键词	参考文件
VLM、VLM嵌入、图像字幕生成	`references/configure/vlm.md`
NeMo Guardrails	`references/configure/guardrails.md`
查询重写、分解、多轮对话	`references/configure/query-and-conversation.md`
数据摄入（纯文本、音频、Nemotron Parse、OCR、批量CLI、NV-Ingest、卷挂载、性能）	`references/configure/ingestion.md`
搜索、检索、混合搜索、多集合、元数据、过滤器、重排序器、topK、准确性/性能	`references/configure/search-and-retrieval.md`
LLM/嵌入/排序模型变更、向量数据库、Milvus/Elasticsearch认证、服务密钥、模型配置文件、端口/GPU	`references/configure/models-and-infrastructure.md`
推理、自我反思、提示词、生成参数（令牌、温度、引用）、单请求LLM参数	`references/configure/reasoning-and-generation.md`
摘要生成	`references/configure/summarization.md`
可观测性（追踪、Zipkin、Grafana、Prometheus）	`references/configure/observability.md`
多模态查询（图像+文本）	`references/configure/multimodal-query.md`
数据目录（集合/文档元数据）	`references/configure/data-catalog.md`
用户界面（UI设置）	`references/configure/user-interface.md`
API参考（端点、模式）	`references/configure/api-reference.md`
评估（RAGAS指标）	`references/configure/evaluation.md`
MCP服务器与客户端、Agent工具包	`references/configure/mcp.md`
迁移（版本升级）	`references/configure/migration.md`
笔记本（设置与目录）	`references/configure/notebooks.md`

Configure Flow

配置流程

Match the user's request to a reference file from the table above.

Detect what's running:

bash

echo "=== NIM ===" && docker ps --format '{{.Names}}' 2>/dev/null | grep -iE '(nim-llm|nemoretriever-embedding|nemoretriever-ranking|nemo-vlm|nemotron-vlm)' || echo "NO_LOCAL_NIMS"; echo "=== RAG ===" && docker ps --format '{{.Names}}' 2>/dev/null | grep -iE '(rag-server|ingestor-server|milvus)' || echo "NO_DOCKER_RAG"; echo "=== K8S ===" && kubectl get pods -n rag 2>/dev/null | head -5 || echo "NO_K8S"; echo "=== LIBRARY ===" && ps aux 2>/dev/null | grep -E '(nvidia_rag|uvicorn.*rag)' | grep -v grep || echo "NO_LIBRARY"

Use this table to determine platform, deployment type, and where config lives:

Local NIMs running?	RAG services running?	Deployment Type	Config Location
Yes (Docker)	Any	Self-hosted	`deploy/compose/.env`
No	Yes (Docker)	NVIDIA-hosted	`deploy/compose/nvdev.env`
Yes (K8s pods)	Any	Self-hosted	`values.yaml` (NIM sections)
No	Yes (K8s pods)	NVIDIA-hosted	`values.yaml` (envVars)
—	Library processes	Library mode	`notebooks/config.yaml`
No	No	Not running	Deploy first via `references/deploy.md`

Tell the user what you detected and ask to confirm. Example: "I see local NIM containers running (nim-llm-ms, nemoretriever-embedding-ms) — this is a self-hosted deployment. Config file is

deploy/compose/.env

. Correct?"

Check current feature state before changing anything — read the config location from step 3, then cross-check the live service:
- Docker:
```
docker exec rag-server env 2>/dev/null | grep -E "<VAR_NAME>"
```
- Helm:
```
kubectl get pod -n rag -l app=rag-server -o jsonpath='{.items[0].spec.containers[0].env}' 2>/dev/null
```
If the config file and live service disagree, tell the user the service has stale config and will need a restart.

If the feature needs extra GPUs, check availability against hardware restrictions (see below):

bash

nvidia-smi --query-gpu=index,name,memory.total,memory.used --format=csv,noheader 2>/dev/null || echo "NO_GPU"

Read the reference file and apply changes:

Docker: edit the env file (uncomment to enable, re-comment to disable — the env file is the source of truth). Then restart the affected service:

source <env-file> && docker compose -f deploy/compose/<compose-file> up -d

Service	Compose File
rag-server	`docker-compose-rag-server.yaml`
ingestor-server	`docker-compose-ingestor-server.yaml`
milvus, etcd, minio	`vectordb.yaml`
NIM containers (LLM, embedding, ranking, VLM, OCR)	`nims.yaml`
guardrails	`docker-compose-nemo-guardrails.yaml`
observability (Grafana, Prometheus, Zipkin)	`observability.yaml`

Helm: edit

values.yaml

, then upgrade:

helm upgrade rag <chart> -n rag -f values.yaml

Library: edit
```
notebooks/config.yaml
```
, then restart the Python process

Verify:

Docker:

docker ps --format "table {{.Names}}\t{{.Status}}" | head -20; curl -s http://localhost:8081/v1/health?check_dependencies=true 2>/dev/null | head -1

Helm:

kubectl get pods -n rag; kubectl rollout status deployment/rag-server -n rag --timeout=120s

Library:

curl -s http://localhost:8081/v1/health 2>/dev/null | head -1

If restart fails, read
```
references/troubleshoot.md
```
. If multiple features requested, repeat from step 1 for each.

将用户请求匹配至上方表格中的参考文件。

检测当前运行的服务：

bash

echo "=== NIM ===" && docker ps --format '{{.Names}}' 2>/dev/null | grep -iE '(nim-llm|nemoretriever-embedding|nemoretriever-ranking|nemo-vlm|nemotron-vlm)' || echo "NO_LOCAL_NIMS"; echo "=== RAG ===" && docker ps --format '{{.Names}}' 2>/dev/null | grep -iE '(rag-server|ingestor-server|milvus)' || echo "NO_DOCKER_RAG"; echo "=== K8S ===" && kubectl get pods -n rag 2>/dev/null | head -5 || echo "NO_K8S"; echo "=== LIBRARY ===" && ps aux 2>/dev/null | grep -E '(nvidia_rag|uvicorn.*rag)' | grep -v grep || echo "NO_LIBRARY"

使用以下表格确定平台、部署类型及配置文件位置：

是否运行本地NIM？	是否运行RAG服务？	部署类型	配置位置
是（Docker）	任意	自托管	`deploy/compose/.env`
否	是（Docker）	NVIDIA托管	`deploy/compose/nvdev.env`
是（K8s Pod）	任意	自托管	`values.yaml` （NIM章节）
否	是（K8s Pod）	NVIDIA托管	`values.yaml` （envVars）
—	库进程	库模式	`notebooks/config.yaml`
否	否	未运行	先通过 `references/deploy.md` 部署

告知用户检测结果并请求确认。示例："我检测到本地NIM容器正在运行（nim-llm-ms, nemoretriever-embedding-ms）——这是自托管部署。配置文件为

deploy/compose/.env

。是否正确？"

在进行任何变更前检查当前功能状态——读取步骤3中的配置位置，然后交叉检查运行中的服务：
- Docker：
```
docker exec rag-server env 2>/dev/null | grep -E "<VAR_NAME>"
```
- Helm：
```
kubectl get pod -n rag -l app=rag-server -o jsonpath='{.items[0].spec.containers[0].env}' 2>/dev/null
```
若配置文件与运行中的服务不一致，告知用户服务配置已过期，需要重启。

若功能需要额外GPU，对照硬件限制检查可用性（见下文）：

bash

nvidia-smi --query-gpu=index,name,memory.total,memory.used --format=csv,noheader 2>/dev/null || echo "NO_GPU"

阅读参考文件并应用变更：

Docker：编辑环境文件（取消注释以启用，重新注释以禁用——环境文件为可信数据源）。然后重启受影响的服务：

source <env-file> && docker compose -f deploy/compose/<compose-file> up -d

服务	Compose文件
rag-server	`docker-compose-rag-server.yaml`
ingestor-server	`docker-compose-ingestor-server.yaml`
milvus, etcd, minio	`vectordb.yaml`
NIM容器（LLM、嵌入、排序、VLM、OCR）	`nims.yaml`
guardrails	`docker-compose-nemo-guardrails.yaml`
可观测性（Grafana、Prometheus、Zipkin）	`observability.yaml`

Helm：编辑

values.yaml

，然后执行升级：

helm upgrade rag <chart> -n rag -f values.yaml

库模式：编辑
```
notebooks/config.yaml
```
，然后重启Python进程

验证：

Docker：

docker ps --format "table {{.Names}}\t{{.Status}}" | head -20; curl -s http://localhost:8081/v1/health?check_dependencies=true 2>/dev/null | head -1

Helm：

kubectl get pods -n rag; kubectl rollout status deployment/rag-server -n rag --timeout=120s

库模式：

curl -s http://localhost:8081/v1/health 2>/dev/null | head -1

若重启失败，阅读
```
references/troubleshoot.md
```
。若用户请求多个功能，针对每个功能重复步骤1。

When User Says "Configure" Without Specifics

当用户仅说“配置”未指定具体内容时

Run steps 2–3 above, then read the identified config file to list what's currently enabled:

bash

grep -E "^(export )?(ENABLE_|APP_)" <config-file> 2>/dev/null | sort

Summarize what's running and enabled, then ask which feature to change.

执行上述步骤2–3，然后读取识别到的配置文件以列出当前已启用的功能：

bash

grep -E "^(export )?(ENABLE_|APP_)" <config-file> 2>/dev/null | sort

汇总当前运行和已启用的功能，然后询问用户要修改哪个功能。

Hardware Restrictions

硬件限制

Read

docs/support-matrix.md

for current GPU requirements per deployment mode. Read

docs/service-port-gpu-reference.md

for port mappings and GPU assignments.

GPU	Feature Restrictions
B200	No VLM, No Guardrails, No Nemotron Parse. May need multi-GPU LLM ( `LLM_MS_GPU_ID` ).
RTX PRO 6000	No Nemotron Parse. No Audio on Helm.

阅读

docs/support-matrix.md

获取各部署模式的当前GPU要求。阅读

docs/service-port-gpu-reference.md

获取端口映射和GPU分配信息。

GPU	功能限制
B200	不支持VLM、不支持防护机制、不支持Nemotron Parse。可能需要多GPU LLM（ `LLM_MS_GPU_ID` ）。
RTX PRO 6000	不支持Nemotron Parse。Helm部署不支持音频功能。