nemoclaw-user-configure-inference

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Use a Local Inference Server

使用本地推理服务器

Gotchas

注意事项

Ollama is convenient for local chat, but some model/template combinations can return tool calls as plain text under realistic agent load.

Ollama便于进行本地聊天，但在实际的Agent负载下，部分模型/模板组合可能会以纯文本形式返回工具调用。

Prerequisites

前提条件

NemoClaw installed.
A local model server running, or a supported Ollama, vLLM, or NIM setup that the NemoClaw onboard wizard can use, start, or install.

NemoClaw can route inference to a model server running on your machine instead of a cloud API. This page covers Ollama, compatible-endpoint paths for other servers, and experimental managed options for vLLM and NVIDIA NIM.

All approaches use the same

inference.local

routing model. The agent inside the sandbox never connects to your model server directly. OpenShell intercepts inference traffic and forwards it to the local endpoint you configure.

已安装NemoClaw。
本地模型服务器正在运行，或已完成Ollama、vLLM或NIM的支持性配置，可供NemoClaw初始化向导使用、启动或安装。

NemoClaw可将推理请求路由至本地机器上运行的模型服务器，而非云API。本文档涵盖Ollama、其他服务器的兼容端点路径，以及vLLM和NVIDIA NIM的实验性托管选项。

所有方案均使用相同的

inference.local

路由模型。沙箱内的Agent不会直接连接到你的模型服务器。 OpenShell会拦截推理流量，并将其转发至你配置的本地端点。

Ollama

Ollama is the default local inference option. The onboard wizard detects Ollama automatically when it is installed or running on the host.

If Ollama is installed but not running, NemoClaw starts it for you. On macOS and Linux, the wizard can also offer to install Ollama when it is not present. On WSL, the wizard can use, start, restart, or install Ollama on the Windows host through PowerShell interop. On Debian and Ubuntu, the native Linux install path checks for

zstd

before it runs the Ollama installer. If

zstd

is missing, NemoClaw installs it with

apt-get

and explains the sudo prompt before continuing. On non-apt Linux distributions, install

zstd

first, then rerun onboarding.

Run the onboard wizard.

console

$ nemoclaw onboard

Select Local Ollama from the provider list. NemoClaw lists installed models or offers starter models if none are installed. On hosts with at least 32 GiB of detected GPU memory, the starter list includes

qwen3.6:35b

and selects it by default. It pulls the selected model, loads it into memory, and validates it before continuing. If the selected model declares that it does not support tool calling, onboarding stops with guidance to choose a model whose

ollama show <model>

capabilities include

tools

. The validation also requires structured chat-completions tool calls. If the model leaks tool-call JSON as plain message text, onboarding stops so you can choose a model that returns tool calls in the expected response field. On WSL, if you choose the Windows-host Ollama path, NemoClaw uses

host.docker.internal:11434

and pulls missing models through the Ollama HTTP API instead of requiring the

ollama

CLI inside WSL.

Ollama是默认的本地推理选项。当Ollama已安装或在主机上运行时，初始化向导会自动检测到它。

如果Ollama已安装但未运行，NemoClaw会为你启动它。在macOS和Linux系统中，若Ollama未安装，向导还可提供安装选项。在WSL环境中，向导可通过PowerShell交互功能，在Windows主机上使用、启动、重启或安装Ollama。在Debian和Ubuntu系统中，原生Linux安装路径会在运行Ollama安装程序前检查

zstd

是否存在。若缺少

zstd

，NemoClaw会使用

apt-get

进行安装，并在继续操作前解释sudo提示的用途。在非apt包管理的Linux发行版中，请先安装

zstd

，再重新运行初始化向导。

运行初始化向导：

console

$ nemoclaw onboard

从提供商列表中选择Local Ollama。 NemoClaw会列出已安装的模型；若未安装任何模型，则会提供入门模型选项。在检测到至少32 GiB GPU内存的主机上，入门模型列表会包含

qwen3.6:35b

并将其设为默认选项。它会拉取所选模型、加载至内存，并在继续操作前进行验证。若所选模型声明不支持工具调用，初始化会停止并给出指引，要求选择

ollama show <model>

能力包含

tools

的模型。验证过程还要求模型支持结构化聊天补全工具调用。若模型将工具调用JSON以纯消息文本形式返回，初始化会停止，以便你选择能在预期响应字段中返回工具调用的模型。在WSL环境中，若你选择Windows主机Ollama路径，NemoClaw会使用

host.docker.internal:11434

，并通过Ollama HTTP API拉取缺失的模型，无需在WSL内使用

ollama

CLI。

WSL with Windows-Host Ollama

WSL搭配Windows主机Ollama

When NemoClaw runs inside WSL, the provider menu can include Windows-host Ollama actions:

Use Ollama on Windows host when the Windows daemon is already reachable.
Restart Ollama on Windows host when the daemon is installed but only bound to Windows loopback.
Start Ollama on Windows host when Ollama is installed but not running.
Install Ollama on Windows host when Windows does not have Ollama installed.

The install and restart paths set

OLLAMA_HOST=0.0.0.0:11434

on the Windows side so Docker and WSL can reach the daemon through

host.docker.internal

. After an install or restart action, NemoClaw relaunches Ollama from the detected Windows tray app or verified

ollama.exe

path and waits until

host.docker.internal:11434

responds. If the daemon does not become reachable, onboarding prints PowerShell commands you can run to inspect the Windows-side process and port state. Use one Ollama instance on port

at a time. If both WSL and Windows-host Ollama are running, pick the intended menu entry during onboarding so NemoClaw validates and pulls models against the right daemon.

Warning:

Ollama is convenient for local chat, but some model/template combinations can return tool calls as plain text under realistic agent load. If the TUI shows raw JSON such as

{"name":"memory_search","arguments":{...}}

instead of running a tool, switch to vLLM with

--enable-auto-tool-choice

and the correct

--tool-call-parser

. See Tool-Calling Reliability (use the

nemoclaw-user-configure-inference

skill).

当NemoClaw在WSL内运行时，提供商菜单会包含Windows主机Ollama操作选项：

Use Ollama on Windows host：当Windows守护进程已可访问时选择。
Restart Ollama on Windows host：当守护进程已安装但仅绑定到Windows回环地址时选择。
Start Ollama on Windows host：当Ollama已安装但未运行时选择。
Install Ollama on Windows host：当Windows未安装Ollama时选择。

安装和重启操作会在Windows端设置

OLLAMA_HOST=0.0.0.0:11434

，以便Docker和WSL可通过

host.docker.internal

访问守护进程。完成安装或重启操作后，NemoClaw会从检测到的Windows托盘应用或已验证的

ollama.exe

路径重新启动Ollama，并等待

host.docker.internal:11434

响应。若守护进程无法访问，初始化会打印可用于检查Windows端进程和端口状态的PowerShell命令。同一时间仅使用一个运行在端口

的Ollama实例。若WSL和Windows主机均运行Ollama，请在初始化期间选择对应的菜单选项，以便NemoClaw针对正确的守护进程验证并拉取模型。

警告：

Ollama便于进行本地聊天，但在实际的Agent负载下，部分模型/模板组合可能会以纯文本形式返回工具调用。若TUI显示原始JSON（如

{"name":"memory_search","arguments":{...}}

）而非运行工具，请切换至启用

--enable-auto-tool-choice

和正确

--tool-call-parser

的vLLM。请参阅工具调用可靠性（使用

nemoclaw-user-configure-inference

技能）。

Authenticated Reverse Proxy

带认证的反向代理

On non-WSL hosts, NemoClaw keeps Ollama bound to

127.0.0.1:11434

and starts a token-gated reverse proxy on

0.0.0.0:11435

. The native install/start paths also reset NemoClaw-managed systemd launches to the loopback binding. Containers and other hosts on the local network reach Ollama only through the proxy, which validates a Bearer token before forwarding requests. On that native path, NemoClaw never exposes Ollama without authentication.

WSL Ollama paths do not use this proxy. Windows-host Ollama uses the Windows daemon through

host.docker.internal

For non-WSL Ollama setups, the onboard wizard manages the proxy automatically:

Generates a random 24-byte token on first run and stores it in
```
~/.nemoclaw/ollama-proxy-token
```
with
```
0600
```
permissions.
Starts the proxy after Ollama and verifies it before continuing.
Cleans up stale proxy processes from previous runs.
Probes the sandbox Docker network path to the proxy before committing the inference route.
Stops matching proxy processes during uninstall before deleting NemoClaw state.
Reuses the persisted token after a host reboot so you do not need to re-run onboard.

On native Linux hosts, a firewall can allow the host proxy health check while still blocking sandbox containers on the OpenShell Docker bridge. When the sandbox-side proxy probe fails with a TCP error, onboarding exits before it saves the inference route and prints a command like:

console

$ sudo ufw allow from <openshell-docker-subnet> to any port 11435 proto tcp
$ nemoclaw onboard

If the probe cannot run, for example because Docker Desktop or WSL uses a different host routing model, onboarding continues and relies on the regular proxy health check.

The sandbox provider is configured to use proxy port

with the generated token as its

OPENAI_API_KEY

credential. OpenShell's L7 proxy injects the token at egress, so the agent inside the sandbox never sees the token directly.

All proxy endpoints require the Bearer token, including

GET /api/tags

. Internal health and reachability checks run via the proxy treat any HTTP response (including

) as proof the proxy is alive — they only fail when nothing answers at all.

If Ollama is already running on a non-loopback address when you start onboard, the wizard restarts it on

127.0.0.1:11434

so the proxy is the only network path to the model server.

在非WSL主机上，NemoClaw会将Ollama绑定到

127.0.0.1:11434

，并在

0.0.0.0:11435

启动一个令牌门控反向代理。原生安装/启动路径还会将NemoClaw管理的systemd启动重置为回环绑定。本地网络中的容器和其他主机只能通过代理访问Ollama，代理会在转发请求前验证Bearer令牌。在该原生路径下，NemoClaw绝不会在无认证的情况下暴露Ollama。

WSL Ollama路径不使用此代理。 Windows主机Ollama通过

host.docker.internal

使用Windows守护进程。

对于非WSL Ollama设置，初始化向导会自动管理代理：

首次运行时生成随机的24字节令牌，并以
```
0600
```
权限存储在
```
~/.nemoclaw/ollama-proxy-token
```
中。
在Ollama启动后启动代理，并在继续操作前进行验证。
清理之前运行留下的过期代理进程。
在提交推理路由前，探测沙箱Docker网络到代理的路径。
在卸载期间，删除NemoClaw状态前停止匹配的代理进程。
主机重启后重用持久化令牌，无需重新运行初始化。

在原生Linux主机上，防火墙可能允许主机代理健康检查，但仍会阻止OpenShell Docker桥上的沙箱容器。当沙箱端代理探测因TCP错误失败时，初始化会在保存推理路由前退出，并打印如下命令：

console

$ sudo ufw allow from <openshell-docker-subnet> to any port 11435 proto tcp
$ nemoclaw onboard

若无法运行探测（例如Docker Desktop或WSL使用不同的主机路由模型），初始化会继续进行，并依赖常规代理健康检查。

沙箱提供商配置为使用代理端口

，并将生成的令牌作为其

OPENAI_API_KEY

凭据。 OpenShell的L7代理会在出口处注入令牌，因此沙箱内的Agent不会直接看到令牌。

所有代理端点均需要Bearer令牌，包括

GET /api/tags

。通过代理运行的内部健康和可达性检查会将任何HTTP响应（包括

）视为代理存活的证明——只有当完全无响应时才会失败。

若启动初始化时Ollama已运行在非回环地址，向导会将其重启至

127.0.0.1:11434

，以便代理是访问模型服务器的唯一网络路径。

GPU Memory Cleanup

GPU内存清理

When you switch away from Ollama, stop host services, or destroy an Ollama-backed sandbox, NemoClaw asks Ollama to unload currently loaded models from GPU memory. The cleanup sends

keep_alive: 0

for each model reported by Ollama and runs on a best-effort basis, so shutdown continues if Ollama is already stopped. This does not delete downloaded model files.

当你切换出Ollama、停止主机服务或销毁基于Ollama的沙箱时，NemoClaw会请求Ollama将当前加载的模型从GPU内存中卸载。清理操作会向Ollama报告的每个模型发送

keep_alive: 0

，且为尽力而为操作，因此即使Ollama已停止，关闭仍会继续。此操作不会删除已下载的模型文件。

Non-Interactive Setup

非交互式设置

console

$ NEMOCLAW_PROVIDER=ollama \
  NEMOCLAW_MODEL=qwen2.5:14b \
  nemoclaw onboard --non-interactive --yes

NEMOCLAW_MODEL

is not set, NemoClaw selects a default model based on available memory.

--yes

(or

NEMOCLAW_YES=1

) authorises the Ollama model download without an interactive confirmation prompt. Under

--non-interactive

--yes

(or

NEMOCLAW_YES=1

) is required to authorise the download — onboard exits otherwise, since it cannot prompt. Run onboard without

--non-interactive

to get the interactive

[y/N]

prompt that shows the model size before downloading.

Variable	Purpose
`NEMOCLAW_PROVIDER`	Set to `ollama` .
`NEMOCLAW_MODEL`	Ollama model tag to use. Optional.
`NEMOCLAW_YES`	Set to `1` to auto-accept the model-download confirmation prompt. Optional.

console

$ NEMOCLAW_PROVIDER=ollama \
  NEMOCLAW_MODEL=qwen2.5:14b \
  nemoclaw onboard --non-interactive --yes

若未设置

NEMOCLAW_MODEL

，NemoClaw会根据可用内存选择默认模型。

--yes

（或

NEMOCLAW_YES=1

）会自动授权Ollama模型下载，无需交互式确认提示。在

--non-interactive

模式下，必须使用

--yes

（或

NEMOCLAW_YES=1

）才能授权下载——否则初始化会退出，因为无法进行提示。若要在下载前显示模型大小并获取交互式

[y/N]

提示，请不带

--non-interactive

运行初始化。

变量	用途
`NEMOCLAW_PROVIDER`	设置为 `ollama` 。
`NEMOCLAW_MODEL`	要使用的Ollama模型标签。可选。
`NEMOCLAW_YES`	设置为 `1` 以自动接受模型下载确认提示。可选。

OpenAI-Compatible Server

兼容OpenAI的服务器

This option works with any server that implements

/v1/chat/completions

, including vLLM, TensorRT-LLM, llama.cpp, LocalAI, and others. For compatible endpoints, NemoClaw uses

/v1/chat/completions

by default. This avoids a class of failures where local backends accept

/v1/responses

requests but silently drop the system prompt and tool definitions. To opt in to

/v1/responses

, set

NEMOCLAW_PREFERRED_API=openai-responses

before running onboard.

Start your model server. The examples below use vLLM, but any OpenAI-compatible server works.

console

$ vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

Run the onboard wizard.

console

$ nemoclaw onboard

When the wizard asks you to choose an inference provider, select Other OpenAI-compatible endpoint. Enter the base URL of your local server, for example

http://localhost:8000/v1

The wizard prompts for an API key. If your server does not require authentication, enter any non-empty string (for example,

dummy

NemoClaw validates the endpoint by sending a test inference request before continuing. The wizard probes

/v1/chat/completions

by default for the compatible-endpoint provider. If you set

NEMOCLAW_PREFERRED_API=openai-responses

, NemoClaw probes

/v1/responses

instead and only selects it when the response includes the streaming events OpenClaw requires. If a reasoning model returns only reasoning content before producing a final answer, NemoClaw retries the smoke request with a larger response budget. Route, configuration, and authentication failures still fail immediately.

此选项适用于任何实现

/v1/chat/completions

的服务器，包括vLLM、TensorRT-LLM、llama.cpp、LocalAI等。对于兼容端点，NemoClaw默认使用

/v1/chat/completions

。这可避免一类故障：本地后端接受

/v1/responses

请求，但会静默丢弃系统提示和工具定义。若要选择

/v1/responses

，请在运行初始化前设置

NEMOCLAW_PREFERRED_API=openai-responses

。

启动你的模型服务器。以下示例使用vLLM，但任何兼容OpenAI的服务器均可。

console

$ vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

运行初始化向导：

console

$ nemoclaw onboard

当向导要求你选择推理提供商时，选择Other OpenAI-compatible endpoint。输入本地服务器的基础URL，例如

http://localhost:8000/v1

。

向导会提示输入API密钥。若你的服务器无需认证，输入任意非空字符串（例如

dummy

）即可。

NemoClaw会通过发送测试推理请求来验证端点，然后再继续操作。对于兼容端点提供商，向导默认探测

/v1/chat/completions

。若你设置了

NEMOCLAW_PREFERRED_API=openai-responses

，NemoClaw会改为探测

/v1/responses

，仅当响应包含OpenClaw所需的流事件时才会选择该路径。若推理模型在生成最终答案前仅返回推理内容，NemoClaw会使用更大的响应预算重试探测请求。路由、配置和认证失败会立即导致初始化失败。

Non-Interactive Setup

非交互式设置

Set the following environment variables for scripted or CI/CD deployments.

console

$ NEMOCLAW_PROVIDER=custom \
  NEMOCLAW_ENDPOINT_URL=http://localhost:8000/v1 \
  NEMOCLAW_MODEL=meta-llama/Llama-3.1-8B-Instruct \
  COMPATIBLE_API_KEY=dummy \
  nemoclaw onboard --non-interactive

Variable	Purpose
`NEMOCLAW_PROVIDER`	Set to `custom` for an OpenAI-compatible endpoint.
`NEMOCLAW_ENDPOINT_URL`	Base URL of the local server.
`NEMOCLAW_MODEL`	Model ID as reported by the server.
`COMPATIBLE_API_KEY`	API key for the endpoint. Use any non-empty value if authentication is not required.

设置以下环境变量以支持脚本化或CI/CD部署。

console

$ NEMOCLAW_PROVIDER=custom \
  NEMOCLAW_ENDPOINT_URL=http://localhost:8000/v1 \
  NEMOCLAW_MODEL=meta-llama/Llama-3.1-8B-Instruct \
  COMPATIBLE_API_KEY=dummy \
  nemoclaw onboard --non-interactive

变量	用途
`NEMOCLAW_PROVIDER`	对于兼容OpenAI的端点，设置为 `custom` 。
`NEMOCLAW_ENDPOINT_URL`	本地服务器的基础URL。
`NEMOCLAW_MODEL`	服务器报告的模型ID。
`COMPATIBLE_API_KEY`	端点的API密钥。若无需认证，使用任意非空值即可。

Selecting the API Path

选择API路径

For the compatible-endpoint provider,

/v1/chat/completions

is the default. NemoClaw tests streaming events during onboarding and uses chat completions without probing the Responses API.

To opt in to

/v1/responses

, set

NEMOCLAW_PREFERRED_API

before running onboard:

console

$ NEMOCLAW_PREFERRED_API=openai-responses nemoclaw onboard

The wizard then probes

/v1/responses

and only selects it when streaming support is complete. If the probe fails, the wizard falls back to

/v1/chat/completions

automatically. You can use this variable in both interactive and non-interactive mode.

Variable	Values	Default
`NEMOCLAW_PREFERRED_API`	`openai-completions` , `openai-responses`	`openai-completions` for compatible endpoints

If you already onboarded and the sandbox is failing at runtime, re-run

nemoclaw onboard

to re-probe the endpoint and bake the correct API path into the image. Refer to Switch Inference Models (use the

nemoclaw-user-configure-inference

skill) for details.

对于兼容端点提供商，默认路径为

/v1/chat/completions

。 NemoClaw会在初始化期间测试流事件，并使用聊天补全，无需探测Responses API。

若要选择

/v1/responses

，请在运行初始化前设置

NEMOCLAW_PREFERRED_API

：

console

$ NEMOCLAW_PREFERRED_API=openai-responses nemoclaw onboard

向导会探测

/v1/responses

，仅当完整支持流功能时才会选择该路径。若探测失败，向导会自动回退到

/v1/chat/completions

。此变量可用于交互式和非交互式模式。

变量	取值	默认值
`NEMOCLAW_PREFERRED_API`	`openai-completions` , `openai-responses`	兼容端点默认值为 `openai-completions`

若你已完成初始化但沙箱在运行时失败，请重新运行

nemoclaw onboard

以重新探测端点，并将正确的API路径嵌入镜像。详情请参阅切换推理模型（使用

nemoclaw-user-configure-inference

技能）。

Anthropic-Compatible Server

兼容Anthropic的服务器

If your local server implements the Anthropic Messages API (

/v1/messages

), choose Other Anthropic-compatible endpoint during onboarding instead.

console

$ nemoclaw onboard

For non-interactive setup, use

NEMOCLAW_PROVIDER=anthropicCompatible

and set

COMPATIBLE_ANTHROPIC_API_KEY

console

$ NEMOCLAW_PROVIDER=anthropicCompatible \
  NEMOCLAW_ENDPOINT_URL=http://localhost:8080 \
  NEMOCLAW_MODEL=my-model \
  COMPATIBLE_ANTHROPIC_API_KEY=dummy \
  nemoclaw onboard --non-interactive

若你的本地服务器实现了Anthropic Messages API (

/v1/messages

)，请在初始化期间选择Other Anthropic-compatible endpoint。

console

$ nemoclaw onboard

对于非交互式设置，请使用

NEMOCLAW_PROVIDER=anthropicCompatible

并设置

COMPATIBLE_ANTHROPIC_API_KEY

。

console

$ NEMOCLAW_PROVIDER=anthropicCompatible \
  NEMOCLAW_ENDPOINT_URL=http://localhost:8080 \
  NEMOCLAW_MODEL=my-model \
  COMPATIBLE_ANTHROPIC_API_KEY=dummy \
  nemoclaw onboard --non-interactive

vLLM

When vLLM is already running on

localhost:8000

, NemoClaw can detect it automatically and query the

/v1/models

endpoint to determine the loaded model. On supported Linux hosts with NVIDIA GPUs, the onboard wizard can also install or start a managed vLLM container for you.

For an already-running vLLM server, run

nemoclaw onboard

and select Local vLLM [experimental] from the provider list.

console

$ nemoclaw onboard

If vLLM is already running, NemoClaw detects the running model and validates the endpoint. If vLLM is not running and your host matches a DGX Spark or DGX Station managed profile, NemoClaw shows the Install vLLM or Start vLLM entry by default. Generic Linux NVIDIA GPU hosts still require

NEMOCLAW_EXPERIMENTAL=1

NEMOCLAW_PROVIDER=install-vllm

before the managed entry appears. NemoClaw pulls the vLLM image, downloads model weights into

~/.cache/huggingface

, starts the

nemoclaw-vllm

container on

localhost:8000

, and prints progress markers while the model loads. The first run can take 10 to 30 minutes. Later runs reuse the cached image and model weights.

Managed vLLM uses these profiles:

Host profile	Default model
DGX Spark	`Qwen/Qwen3.6-27B-FP8`
DGX Station	`Qwen/Qwen3.6-27B-FP8`
Linux with an NVIDIA GPU	`nvidia/NVIDIA-Nemotron-3-Nano-4B-FP8`

Note:

NemoClaw forces the

chat/completions

API path for vLLM. The vLLM

/v1/responses

endpoint does not run the

--tool-call-parser

, so tool calls arrive as raw text.

当vLLM已在

localhost:8000

运行时，NemoClaw可自动检测到它，并查询

/v1/models

端点以确定已加载的模型。在支持NVIDIA GPU的Linux主机上，初始化向导还可为你安装或启动托管的vLLM容器。

对于已运行的vLLM服务器，运行

nemoclaw onboard

并从提供商列表中选择Local vLLM [experimental]。

console

$ nemoclaw onboard

若vLLM已运行，NemoClaw会检测到运行中的模型并验证端点。若vLLM未运行且你的主机匹配DGX Spark或DGX Station托管配置文件，NemoClaw会默认显示Install vLLM或Start vLLM选项。通用Linux NVIDIA GPU主机仍需设置

NEMOCLAW_EXPERIMENTAL=1

或

NEMOCLAW_PROVIDER=install-vllm

，才会显示托管选项。 NemoClaw会拉取vLLM镜像、将模型权重下载至

~/.cache/huggingface

、在

localhost:8000

启动

nemoclaw-vllm

容器，并在模型加载期间打印进度标记。首次运行可能需要10至30分钟。后续运行会重用缓存的镜像和模型权重。

托管vLLM使用以下配置文件：

主机配置文件	默认模型
DGX Spark	`Qwen/Qwen3.6-27B-FP8`
DGX Station	`Qwen/Qwen3.6-27B-FP8`
带NVIDIA GPU的Linux	`nvidia/NVIDIA-Nemotron-3-Nano-4B-FP8`

注意：

NemoClaw强制vLLM使用

chat/completions

API路径。 vLLM的

/v1/responses

端点不会运行

--tool-call-parser

，因此工具调用会以原始文本形式返回。

Non-Interactive Setup

非交互式设置

Use an already-running vLLM server:

console

$ NEMOCLAW_PROVIDER=vllm \
  nemoclaw onboard --non-interactive

Install or start managed vLLM when a supported profile is detected. On DGX Spark and DGX Station,

NEMOCLAW_PROVIDER=install-vllm

is enough for non-interactive runs; add

NEMOCLAW_EXPERIMENTAL=1

on generic Linux NVIDIA GPU hosts.

console

$ NEMOCLAW_PROVIDER=install-vllm \
  nemoclaw onboard --non-interactive

NemoClaw records the model returned by vLLM's

/v1/models

endpoint. Start vLLM with the model you want before onboarding if you manage the server yourself.

使用已运行的vLLM服务器：

console

$ NEMOCLAW_PROVIDER=vllm \
  nemoclaw onboard --non-interactive

当检测到支持的配置文件时，安装或启动托管vLLM。在DGX Spark和DGX Station上，

NEMOCLAW_PROVIDER=install-vllm

即可支持非交互式运行；在通用Linux NVIDIA GPU主机上，需添加

NEMOCLAW_EXPERIMENTAL=1

。

console

$ NEMOCLAW_PROVIDER=install-vllm \
  nemoclaw onboard --non-interactive

NemoClaw会记录vLLM的

/v1/models

端点返回的模型。若你自行管理服务器，请在初始化前启动所需的模型。

Override the Managed-vLLM Model

覆盖托管vLLM的模型

Managed vLLM serves the profile default unless you select a different registry entry. Export

NEMOCLAW_VLLM_MODEL=<slug>

before invoking the installer to choose a different model from the registry. NemoClaw uses the matching

vllm serve

flags, including the reasoning parser, tool-call parser, and

--max-model-len

. Recognised slugs:

Slug	Hugging Face model	Notes
`qwen3.6-27b`	`Qwen/Qwen3.6-27B-FP8`	Default on DGX Spark and DGX Station profiles
`nemotron-3-nano-4b`	`nvidia/NVIDIA-Nemotron-3-Nano-4B-FP8`	Default on the generic Linux + NVIDIA GPU profile
`deepseek-r1-distill-70b`	`deepseek-ai/DeepSeek-R1-Distill-Llama-70B`	Gated. Requires Hugging Face license acceptance

The slug is case-insensitive; the full Hugging Face id is also accepted. An unrecognised value fails fast with a list of valid slugs.

Gated models require a Hugging Face token; export it before onboarding so NemoClaw can forward it into the managed vLLM container:

console

$ export HF_TOKEN=<your-hf-token>
$ NEMOCLAW_PROVIDER=install-vllm \
  NEMOCLAW_VLLM_MODEL=deepseek-r1-distill-70b \
  nemoclaw onboard --non-interactive

HUGGING_FACE_HUB_TOKEN

is accepted as an alternative. The token check runs on the host before any docker pull, so a missing or empty token aborts onboarding before bandwidth is spent on a 401.

托管vLLM会使用配置文件默认模型，除非你选择不同的注册表条目。在调用安装程序前导出

NEMOCLAW_VLLM_MODEL=<slug>

，即可从注册表中选择不同的模型。 NemoClaw会使用匹配的

vllm serve

标志，包括推理解析器、工具调用解析器和

--max-model-len

。可识别的slug：

Slug	Hugging Face模型	说明
`qwen3.6-27b`	`Qwen/Qwen3.6-27B-FP8`	DGX Spark和DGX Station配置文件的默认模型
`nemotron-3-nano-4b`	`nvidia/NVIDIA-Nemotron-3-Nano-4B-FP8`	通用Linux + NVIDIA GPU配置文件的默认模型
`deepseek-r1-distill-70b`	`deepseek-ai/DeepSeek-R1-Distill-Llama-70B`	gated模型，需接受Hugging Face许可

slug不区分大小写；也可接受完整的Hugging Face ID。若输入无法识别的值，会快速失败并列出有效的slug。

Gated模型需要Hugging Face令牌；请在初始化前导出令牌，以便NemoClaw可将其转发至托管vLLM容器：

console

$ export HF_TOKEN=<your-hf-token>
$ NEMOCLAW_PROVIDER=install-vllm \
  NEMOCLAW_VLLM_MODEL=deepseek-r1-distill-70b \
  nemoclaw onboard --non-interactive

HUGGING_FACE_HUB_TOKEN

可作为替代令牌。令牌检查会在主机上运行，之后才会进行docker拉取，因此若令牌缺失或为空，会在因401错误浪费带宽前终止初始化。

NVIDIA NIM (Experimental)

NVIDIA NIM（实验性）

NemoClaw can pull, start, and manage a NIM container on hosts with a NIM-capable NVIDIA GPU.

Set the experimental flag and run onboard.

console

$ NEMOCLAW_EXPERIMENTAL=1 nemoclaw onboard

Select Local NVIDIA NIM [experimental] from the provider list. NemoClaw filters available models by GPU VRAM, pulls the NIM container image, starts it, and waits for it to become healthy before continuing. On hosts with mixed NVIDIA GPU models, the preflight summary shows each detected GPU model and the total VRAM so you can confirm which device class the model selection used.

NIM container images are hosted on

nvcr.io

and require NGC registry authentication before

docker pull

succeeds. If Docker is not already logged in to

nvcr.io

, onboard prompts for an NGC API key and runs

docker login nvcr.io

over

--password-stdin

so the key is never written to disk or shell history. The prompt masks the key during input and retries once on a bad key before failing. In non-interactive mode, onboard exits with login instructions if Docker is not already authenticated; run

docker login nvcr.io

yourself, then re-run

nemoclaw onboard --non-interactive

. If

NGC_API_KEY

NVIDIA_API_KEY

is already exported, NemoClaw passes it into the managed NIM container through the process environment instead of command-line arguments. If the NIM container exits before the health endpoint becomes ready, onboarding stops early and prints the last container log lines.

Note:

NIM uses vLLM internally. The same

chat/completions

API path restriction applies.

NemoClaw可在具备NIM兼容NVIDIA GPU的主机上拉取、启动和管理NIM容器。

设置实验性标志并运行初始化：

console

$ NEMOCLAW_EXPERIMENTAL=1 nemoclaw onboard

从提供商列表中选择Local NVIDIA NIM [experimental]。 NemoClaw会根据GPU显存筛选可用模型、拉取NIM容器镜像、启动容器，并在继续操作前等待容器进入健康状态。在混合NVIDIA GPU模型的主机上，预检摘要会显示每个检测到的GPU模型和总显存，以便你确认模型选择使用的设备类别。

NIM容器镜像托管在

nvcr.io

，需要NGC注册表认证才能成功执行

docker pull

。若Docker尚未登录

nvcr.io

，初始化会提示输入NGC API密钥，并通过

--password-stdin

运行

docker login nvcr.io

，确保密钥不会写入磁盘或shell历史。输入时会掩码密钥，若密钥错误会重试一次后失败。在非交互式模式下，若Docker未认证，初始化会退出并给出登录指引；请自行运行

docker login nvcr.io

，再重新运行

nemoclaw onboard --non-interactive

。若已导出

NGC_API_KEY

或

NVIDIA_API_KEY

，NemoClaw会通过进程环境将其传递至托管NIM容器，而非命令行参数。若NIM容器在健康端点就绪前退出，初始化会提前停止并打印容器的最后日志行。

注意：

NIM内部使用vLLM。同样适用

chat/completions

API路径限制。

Non-Interactive Setup

非交互式设置

console

$ NEMOCLAW_EXPERIMENTAL=1 \
  NEMOCLAW_PROVIDER=nim \
  nemoclaw onboard --non-interactive

To select a specific model, set

NEMOCLAW_MODEL

console

$ NEMOCLAW_EXPERIMENTAL=1 \
  NEMOCLAW_PROVIDER=nim \
  nemoclaw onboard --non-interactive

若要选择特定模型，请设置

NEMOCLAW_MODEL

。

Timeout Configuration

超时配置

Local inference requests use a default timeout of 180 seconds. Large prompts on hardware such as DGX Spark can exceed shorter timeouts, so NemoClaw sets a higher default for Ollama, vLLM, NIM, and compatible-endpoint setup.

To override the timeout, set the

NEMOCLAW_LOCAL_INFERENCE_TIMEOUT

environment variable before onboarding:

console

$ export NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300
$ nemoclaw onboard

The value is in seconds. This setting is baked into the sandbox at build time. Changing it after onboarding requires re-running

nemoclaw onboard

NEMOCLAW_LOCAL_INFERENCE_TIMEOUT

only governs the inference-server validation probe. The post-create readiness wait (image build, gateway upload, in-sandbox boot) has its own budget,

NEMOCLAW_SANDBOX_READY_TIMEOUT

, also defaulting to 180 seconds. On hosts where the sandbox image takes minutes to build or upload — large quantised models, DGX Station first runs, or remote VMs over a slow link — raise both together:

console

$ export NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300
$ export NEMOCLAW_SANDBOX_READY_TIMEOUT=600
$ nemoclaw onboard

If onboard ends with

Sandbox '<name>' was created but did not become ready within 180s

, refer to Troubleshooting (use the

nemoclaw-user-reference

skill).

本地推理请求的默认超时时间为180秒。在DGX Spark等硬件上处理大型提示可能会超出较短的超时时间，因此NemoClaw为Ollama、vLLM、NIM和兼容端点设置了更高的默认超时。

若要覆盖超时时间，请在初始化前设置

NEMOCLAW_LOCAL_INFERENCE_TIMEOUT

环境变量：

console

$ export NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300
$ nemoclaw onboard

取值单位为秒。此设置会在构建时嵌入沙箱。若要在初始化后更改，需重新运行

nemoclaw onboard

。

NEMOCLAW_LOCAL_INFERENCE_TIMEOUT

仅控制推理服务器验证探测的超时。创建后的就绪等待（镜像构建、网关上传、沙箱内启动）有其自己的预算

NEMOCLAW_SANDBOX_READY_TIMEOUT

，默认也为180秒。在沙箱镜像构建或上传需要数分钟的主机上（如大量化模型、DGX Station首次运行、或慢速链路下的远程VM），请同时提高两个超时值：

console

$ export NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300
$ export NEMOCLAW_SANDBOX_READY_TIMEOUT=600
$ nemoclaw onboard

若初始化结束时显示

Sandbox '<name>' was created but did not become ready within 180s

，请参阅故障排除（使用

nemoclaw-user-reference

技能）。

Verify the Configuration

验证配置

After onboarding completes, confirm the active provider and model.

console

$ nemoclaw <name> status

The output shows the provider label (for example, "Local vLLM" or "Other OpenAI-compatible endpoint") and the active model. For Local Ollama, status also checks the authenticated proxy when a proxy token is available. If

Inference

is healthy but

Inference (auth proxy)

is not, rerun onboarding to repair the proxy path that sandbox requests use.

初始化完成后，确认当前活跃的提供商和模型：

console

$ nemoclaw <name> status

输出会显示提供商标签（例如"Local vLLM"或"Other OpenAI-compatible endpoint"）以及活跃模型。对于Local Ollama，当存在代理令牌时，状态还会检查认证代理。若

Inference

健康但

Inference (auth proxy)

不健康，请重新运行初始化以修复沙箱请求使用的代理路径。

Switch Models at Runtime

运行时切换模型

You can change the model without re-running onboard. Refer to Switch Inference Models (use the

nemoclaw-user-configure-inference

skill) for the full procedure.

For compatible endpoints, the command is:

console

$ nemoclaw inference set --provider compatible-endpoint --model <model-name>

If the provider itself needs to change (for example, switching from vLLM to a cloud API), pass the new provider to

nemoclaw inference set

你无需重新运行初始化即可更改模型。完整流程请参阅切换推理模型（使用

nemoclaw-user-configure-inference

技能）。

对于兼容端点，命令为：

console

$ nemoclaw inference set --provider compatible-endpoint --model <model-name>

若需要更改提供商本身（例如从vLLM切换到云API），请将新提供商传递给

nemoclaw inference set

。

References

参考资料

Load references/switch-inference-providers.md when switching inference providers, changing the model runtime, or reconfiguring inference routing. Changes the active inference model without restarting the sandbox.
Load references/set-up-sub-agent.md when users ask how to add a second model, configure a sub-agent model, use Omni for vision tasks, configure agents.list, or use sessions_spawn in NemoClaw. Shows the NemoClaw-specific file paths and update flow for adding an auxiliary OpenClaw sub-agent model.
references/tool-calling-reliability.md — Explains Ollama tool-call leak symptoms, when vLLM with a tool-call parser is recommended, and how to repoint NemoClaw to a parser-aware local endpoint.
Load references/inference-options.md when explaining which providers are available, what the onboard wizard presents, or how inference routing works. Lists all inference providers offered during NemoClaw onboarding.

加载references/switch-inference-providers.md：适用于切换推理提供商、更改模型运行时或重新配置推理路由的场景。无需重启沙箱即可更改活跃推理模型。
加载references/set-up-sub-agent.md：适用于用户询问如何添加第二个模型、配置子Agent模型、使用Omni处理视觉任务、配置agents.list或在NemoClaw中使用sessions_spawn的场景。展示了NemoClaw特定的文件路径和添加辅助OpenClaw子Agent模型的更新流程。
references/tool-calling-reliability.md — 解释Ollama工具调用泄露症状、何时推荐使用带工具调用解析器的vLLM，以及如何将NemoClaw重新指向支持解析器的本地端点。
加载references/inference-options.md：适用于解释可用提供商、初始化向导展示内容或推理路由工作原理的场景。列出NemoClaw初始化期间提供的所有推理提供商。

nemoclaw-user-configure-inference

Original

Translation

Use a Local Inference Server

使用本地推理服务器

Gotchas

注意事项

Prerequisites

前提条件

Ollama

Ollama

WSL with Windows-Host Ollama

WSL搭配Windows主机Ollama

Authenticated Reverse Proxy

带认证的反向代理

GPU Memory Cleanup

GPU内存清理

Non-Interactive Setup

非交互式设置

OpenAI-Compatible Server

兼容OpenAI的服务器

Non-Interactive Setup

非交互式设置

Selecting the API Path

选择API路径

Anthropic-Compatible Server

兼容Anthropic的服务器

vLLM

vLLM

Non-Interactive Setup

非交互式设置

Override the Managed-vLLM Model

覆盖托管vLLM的模型

NVIDIA NIM (Experimental)

NVIDIA NIM（实验性）

Non-Interactive Setup

非交互式设置

Timeout Configuration

超时配置

Verify the Configuration

验证配置

Switch Models at Runtime

运行时切换模型

References

参考资料

Related Skills

相关技能