rag-blueprint

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

NVIDIA RAG Blueprint

NVIDIA RAG Blueprint

Autonomy Principles

自主运行原则

  • Auto-detect everything: GPU, VRAM, drivers, Docker, CUDA, disk, OS, ports, existing services, NGC key, repo state.
  • If it can be checked with a command, check it — don't ask the user.
  • Ask only when user action is required: providing an API key, confirming data deletion, or choosing between equally valid options.
  • Once analysis is done, route to the correct workflow and execute.
  • 自动检测所有内容:GPU、VRAM、驱动程序、Docker、CUDA、磁盘、操作系统、端口、现有服务、NGC密钥、仓库状态。
  • 任何可通过命令检查的内容都自动检查——无需询问用户。
  • 仅在需要用户操作时询问:提供API密钥、确认数据删除,或在多个等效选项中选择。
  • 分析完成后,路由至正确工作流并执行。

Intent Detection

意图检测

Determine what the user wants and route immediately:
User IntentAction
Deploy, install, set up, start RAGRead and follow
references/deploy.md
Configure, enable, change, toggle a featureUse the Configure section below
Troubleshoot, debug, fix, error, unhealthyRead and follow
references/troubleshoot.md
Stop, shutdown, tear down, clean upRead and follow
references/shutdown.md
If the intent is ambiguous, infer from context (e.g., "RAG isn't working" → troubleshoot; "get RAG running" → deploy). Only ask if genuinely unclear.

确定用户需求并立即路由:
用户意图操作
部署、安装、搭建、启动RAG阅读并遵循
references/deploy.md
配置、启用、修改、切换功能使用下方的配置章节
故障排查、调试、修复、错误、异常状态阅读并遵循
references/troubleshoot.md
停止、关闭、拆除、清理阅读并遵循
references/shutdown.md
若意图模糊,可从上下文推断(例如:"RAG无法运行" → 故障排查;"让RAG运行起来" → 部署)。仅在确实无法明确时才询问用户。

Configure

配置

Requires a running RAG deployment. If services are not running, deploy first via
references/deploy.md
.
Match the user's request to a reference file, then read and follow it:
Feature KeywordsReference
VLM, VLM embeddings, image captioning
references/configure/vlm.md
NeMo Guardrails
references/configure/guardrails.md
Query rewriting, decomposition, multi-turn
references/configure/query-and-conversation.md
Ingestion (text-only, audio, Nemotron Parse, OCR, batch CLI, NV-Ingest, volume mount, performance)
references/configure/ingestion.md
Search, retrieval, hybrid search, multi-collection, metadata, filters, reranker, topK, accuracy/performance
references/configure/search-and-retrieval.md
LLM/embedding/ranking model changes, vector DB, Milvus/Elasticsearch auth, service keys, model profiles, ports/GPU
references/configure/models-and-infrastructure.md
Reasoning, self-reflection, prompts, generation params (tokens, temperature, citations), per-request LLM params
references/configure/reasoning-and-generation.md
Summarization
references/configure/summarization.md
Observability (tracing, Zipkin, Grafana, Prometheus)
references/configure/observability.md
Multimodal query (image + text)
references/configure/multimodal-query.md
Data catalog (collection/document metadata)
references/configure/data-catalog.md
User interface (UI settings)
references/configure/user-interface.md
API reference (endpoints, schemas)
references/configure/api-reference.md
Evaluation (RAGAS metrics)
references/configure/evaluation.md
MCP server & client, agent toolkit
references/configure/mcp.md
Migration (version upgrades)
references/configure/migration.md
Notebooks (setup and catalog)
references/configure/notebooks.md
需要已运行的RAG部署。若服务未运行,请先通过
references/deploy.md
进行部署。
将用户请求匹配至参考文件,然后阅读并遵循该文件:
功能关键词参考文件
VLM、VLM嵌入、图像字幕生成
references/configure/vlm.md
NeMo Guardrails
references/configure/guardrails.md
查询重写、分解、多轮对话
references/configure/query-and-conversation.md
数据摄入(纯文本、音频、Nemotron Parse、OCR、批量CLI、NV-Ingest、卷挂载、性能)
references/configure/ingestion.md
搜索、检索、混合搜索、多集合、元数据、过滤器、重排序器、topK、准确性/性能
references/configure/search-and-retrieval.md
LLM/嵌入/排序模型变更、向量数据库、Milvus/Elasticsearch认证、服务密钥、模型配置文件、端口/GPU
references/configure/models-and-infrastructure.md
推理、自我反思、提示词、生成参数(令牌、温度、引用)、单请求LLM参数
references/configure/reasoning-and-generation.md
摘要生成
references/configure/summarization.md
可观测性(追踪、Zipkin、Grafana、Prometheus)
references/configure/observability.md
多模态查询(图像+文本)
references/configure/multimodal-query.md
数据目录(集合/文档元数据)
references/configure/data-catalog.md
用户界面(UI设置)
references/configure/user-interface.md
API参考(端点、模式)
references/configure/api-reference.md
评估(RAGAS指标)
references/configure/evaluation.md
MCP服务器与客户端、Agent工具包
references/configure/mcp.md
迁移(版本升级)
references/configure/migration.md
笔记本(设置与目录)
references/configure/notebooks.md

Configure Flow

配置流程

  1. Match the user's request to a reference file from the table above.
  2. Detect what's running:
    bash
    echo "=== NIM ===" && docker ps --format '{{.Names}}' 2>/dev/null | grep -iE '(nim-llm|nemoretriever-embedding|nemoretriever-ranking|nemo-vlm|nemotron-vlm)' || echo "NO_LOCAL_NIMS"; echo "=== RAG ===" && docker ps --format '{{.Names}}' 2>/dev/null | grep -iE '(rag-server|ingestor-server|milvus)' || echo "NO_DOCKER_RAG"; echo "=== K8S ===" && kubectl get pods -n rag 2>/dev/null | head -5 || echo "NO_K8S"; echo "=== LIBRARY ===" && ps aux 2>/dev/null | grep -E '(nvidia_rag|uvicorn.*rag)' | grep -v grep || echo "NO_LIBRARY"
  3. Use this table to determine platform, deployment type, and where config lives:
    Local NIMs running?RAG services running?Deployment TypeConfig Location
    Yes (Docker)AnySelf-hosted
    deploy/compose/.env
    NoYes (Docker)NVIDIA-hosted
    deploy/compose/nvdev.env
    Yes (K8s pods)AnySelf-hosted
    values.yaml
    (NIM sections)
    NoYes (K8s pods)NVIDIA-hosted
    values.yaml
    (envVars)
    Library processesLibrary mode
    notebooks/config.yaml
    NoNoNot runningDeploy first via
    references/deploy.md
    Tell the user what you detected and ask to confirm. Example: "I see local NIM containers running (nim-llm-ms, nemoretriever-embedding-ms) — this is a self-hosted deployment. Config file is
    deploy/compose/.env
    . Correct?"
  4. Check current feature state before changing anything — read the config location from step 3, then cross-check the live service:
    • Docker:
      docker exec rag-server env 2>/dev/null | grep -E "<VAR_NAME>"
    • Helm:
      kubectl get pod -n rag -l app=rag-server -o jsonpath='{.items[0].spec.containers[0].env}' 2>/dev/null
    If the config file and live service disagree, tell the user the service has stale config and will need a restart.
  5. If the feature needs extra GPUs, check availability against hardware restrictions (see below):
    bash
    nvidia-smi --query-gpu=index,name,memory.total,memory.used --format=csv,noheader 2>/dev/null || echo "NO_GPU"
  6. Read the reference file and apply changes:
    • Docker: edit the env file (uncomment to enable, re-comment to disable — the env file is the source of truth). Then restart the affected service:
      source <env-file> && docker compose -f deploy/compose/<compose-file> up -d
      ServiceCompose File
      rag-server
      docker-compose-rag-server.yaml
      ingestor-server
      docker-compose-ingestor-server.yaml
      milvus, etcd, minio
      vectordb.yaml
      NIM containers (LLM, embedding, ranking, VLM, OCR)
      nims.yaml
      guardrails
      docker-compose-nemo-guardrails.yaml
      observability (Grafana, Prometheus, Zipkin)
      observability.yaml
    • Helm: edit
      values.yaml
      , then upgrade:
      helm upgrade rag <chart> -n rag -f values.yaml
    • Library: edit
      notebooks/config.yaml
      , then restart the Python process
  7. Verify:
    • Docker:
      docker ps --format "table {{.Names}}\t{{.Status}}" | head -20; curl -s http://localhost:8081/v1/health?check_dependencies=true 2>/dev/null | head -1
    • Helm:
      kubectl get pods -n rag; kubectl rollout status deployment/rag-server -n rag --timeout=120s
    • Library:
      curl -s http://localhost:8081/v1/health 2>/dev/null | head -1
  8. If restart fails, read
    references/troubleshoot.md
    . If multiple features requested, repeat from step 1 for each.
  1. 将用户请求匹配至上方表格中的参考文件。
  2. 检测当前运行的服务:
    bash
    echo "=== NIM ===" && docker ps --format '{{.Names}}' 2>/dev/null | grep -iE '(nim-llm|nemoretriever-embedding|nemoretriever-ranking|nemo-vlm|nemotron-vlm)' || echo "NO_LOCAL_NIMS"; echo "=== RAG ===" && docker ps --format '{{.Names}}' 2>/dev/null | grep -iE '(rag-server|ingestor-server|milvus)' || echo "NO_DOCKER_RAG"; echo "=== K8S ===" && kubectl get pods -n rag 2>/dev/null | head -5 || echo "NO_K8S"; echo "=== LIBRARY ===" && ps aux 2>/dev/null | grep -E '(nvidia_rag|uvicorn.*rag)' | grep -v grep || echo "NO_LIBRARY"
  3. 使用以下表格确定平台、部署类型及配置文件位置:
    是否运行本地NIM?是否运行RAG服务?部署类型配置位置
    是(Docker)任意自托管
    deploy/compose/.env
    是(Docker)NVIDIA托管
    deploy/compose/nvdev.env
    是(K8s Pod)任意自托管
    values.yaml
    (NIM章节)
    是(K8s Pod)NVIDIA托管
    values.yaml
    (envVars)
    库进程库模式
    notebooks/config.yaml
    未运行先通过
    references/deploy.md
    部署
    告知用户检测结果并请求确认。示例:"我检测到本地NIM容器正在运行(nim-llm-ms, nemoretriever-embedding-ms)——这是自托管部署。配置文件为
    deploy/compose/.env
    。是否正确?"
  4. 在进行任何变更前检查当前功能状态——读取步骤3中的配置位置,然后交叉检查运行中的服务:
    • Docker:
      docker exec rag-server env 2>/dev/null | grep -E "<VAR_NAME>"
    • Helm:
      kubectl get pod -n rag -l app=rag-server -o jsonpath='{.items[0].spec.containers[0].env}' 2>/dev/null
    若配置文件与运行中的服务不一致,告知用户服务配置已过期,需要重启。
  5. 若功能需要额外GPU,对照硬件限制检查可用性(见下文):
    bash
    nvidia-smi --query-gpu=index,name,memory.total,memory.used --format=csv,noheader 2>/dev/null || echo "NO_GPU"
  6. 阅读参考文件并应用变更:
    • Docker:编辑环境文件(取消注释以启用,重新注释以禁用——环境文件为可信数据源)。然后重启受影响的服务:
      source <env-file> && docker compose -f deploy/compose/<compose-file> up -d
      服务Compose文件
      rag-server
      docker-compose-rag-server.yaml
      ingestor-server
      docker-compose-ingestor-server.yaml
      milvus, etcd, minio
      vectordb.yaml
      NIM容器(LLM、嵌入、排序、VLM、OCR)
      nims.yaml
      guardrails
      docker-compose-nemo-guardrails.yaml
      可观测性(Grafana、Prometheus、Zipkin)
      observability.yaml
    • Helm:编辑
      values.yaml
      ,然后执行升级:
      helm upgrade rag <chart> -n rag -f values.yaml
    • 库模式:编辑
      notebooks/config.yaml
      ,然后重启Python进程
  7. 验证:
    • Docker:
      docker ps --format "table {{.Names}}\t{{.Status}}" | head -20; curl -s http://localhost:8081/v1/health?check_dependencies=true 2>/dev/null | head -1
    • Helm:
      kubectl get pods -n rag; kubectl rollout status deployment/rag-server -n rag --timeout=120s
    • 库模式:
      curl -s http://localhost:8081/v1/health 2>/dev/null | head -1
  8. 若重启失败,阅读
    references/troubleshoot.md
    。若用户请求多个功能,针对每个功能重复步骤1。

When User Says "Configure" Without Specifics

当用户仅说“配置”未指定具体内容时

Run steps 2–3 above, then read the identified config file to list what's currently enabled:
bash
grep -E "^(export )?(ENABLE_|APP_)" <config-file> 2>/dev/null | sort
Summarize what's running and enabled, then ask which feature to change.

执行上述步骤2–3,然后读取识别到的配置文件以列出当前已启用的功能:
bash
grep -E "^(export )?(ENABLE_|APP_)" <config-file> 2>/dev/null | sort
汇总当前运行和已启用的功能,然后询问用户要修改哪个功能。

Hardware Restrictions

硬件限制

Read
docs/support-matrix.md
for current GPU requirements per deployment mode. Read
docs/service-port-gpu-reference.md
for port mappings and GPU assignments.
GPUFeature Restrictions
B200No VLM, No Guardrails, No Nemotron Parse. May need multi-GPU LLM (
LLM_MS_GPU_ID
).
RTX PRO 6000No Nemotron Parse. No Audio on Helm.
阅读
docs/support-matrix.md
获取各部署模式的当前GPU要求。 阅读
docs/service-port-gpu-reference.md
获取端口映射和GPU分配信息。
GPU功能限制
B200不支持VLM、不支持防护机制、不支持Nemotron Parse。可能需要多GPU LLM(
LLM_MS_GPU_ID
)。
RTX PRO 6000不支持Nemotron Parse。Helm部署不支持音频功能。