tao-run-inference-service

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

TAO Inference Microservice

TAO推理微服务

Instructions

操作说明

To start an inference service:

Collect required inputs (Section 1) and resolve the container image (Section 2).
Build the job payload and inner command (Sections 3–4.1); use
```
references/code-templates.yaml
```
→
```
job_payload_builder
```
.
Read
```
skills/platform/<platform>/SKILL.md
```
and start the container (Section 4.2).
Write the service registry and poll readiness (Section 4.3); use
```
references/code-templates.yaml
```
→
```
registry_write.<platform>
```
and
```
readiness_check
```
.

To send an inference request:

Resolve which service receives the request per Section 6.0 (by
```
job_id
```
, by
```
network_arch
```
, or by explicit user choice when multiple services run — never silently default to
"latest"
when more than one service exists), then read the endpoint from
```
references/code-templates.yaml
```
→
```
request.registry_read
```
with the resolved
```
job_id
```
.
Before building the request body, prompt the user for the vLLM-style sampling parameters (Section 6.1). Present
```
max_tokens
```
,
```
top_p
```
,
```
temperature
```
(and any per-arch extras) with their defaults; let the user override or skip each one to accept the default. Never silently use defaults.
Build and send the body per Section 6.2; handle the response per Section 6.3.

To stop a service: Read

references/code-templates.yaml

→

stop.registry_read

to resolve the job_id, read

skills/platform/<platform>/SKILL.md

, then follow Section 5.

Reference data (schemas, mappings, valid values — no instructions):

references/service.yaml
— image mappings, valid
```
network_arch
```
names, job payload schema, env var names, secrets classification.
references/request.yaml
— endpoint definition, request field schema, response shapes, code examples.
references/code-templates.yaml
— Python templates for payload building, registry writes, readiness checks, and stop/request flows.

启动推理服务：

收集所需输入（第1节）并解析容器镜像（第2节）。
构建作业负载和内部命令（第3–4.1节）；使用
```
references/code-templates.yaml
```
中的
```
job_payload_builder
```
模板。
阅读
```
skills/platform/<platform>/SKILL.md
```
并启动容器（第4.2节）。
写入服务注册表并轮询就绪状态（第4.3节）；使用
```
references/code-templates.yaml
```
中的
```
registry_write.<platform>
```
和
```
readiness_check
```
模板。

发送推理请求：

根据第6.0节的规则确定接收请求的服务（通过
```
job_id
```
、
```
network_arch
```
，或者当多个服务运行时由用户明确选择——当存在多个服务时，绝不能默认使用
"latest"
而不告知用户），然后通过
```
references/code-templates.yaml
```
中的
```
request.registry_read
```
模板，传入解析后的
```
job_id
```
来读取端点信息。
构建请求体之前，必须提示用户提供vLLM风格的采样参数（第6.1节）。列出
```
max_tokens
```
、
```
top_p
```
、
```
temperature
```
（以及各架构特有的额外参数）及其默认值；允许用户覆盖或跳过每个参数以接受默认值。绝不能静默使用默认值。
根据第6.2节构建并发送请求体；根据第6.3节处理响应。

停止服务： 通过

references/code-templates.yaml

中的

stop.registry_read

模板解析

job_id

，阅读

skills/platform/<platform>/SKILL.md

，然后按照第5节的步骤操作。

参考数据（模式、映射、有效值——不含操作说明）：

references/service.yaml
—— 镜像映射、有效的
```
network_arch
```
名称、作业负载模式、环境变量名称、密钥分类。
references/request.yaml
—— 端点定义、请求字段模式、响应格式、代码示例。
references/code-templates.yaml
—— 用于负载构建、注册表写入、就绪检查以及停止/请求流程的Python模板。

Secrets rule (applies to every generated code block in this skill)

密钥规则（适用于本skill中生成的所有代码块）

Never ask the user to type a secret value into a prompt. For every secret value:

Tell the user which environment variable to set (e.g.
```
export HF_TOKEN=...
```
).
Generate code that reads it with
```
os.environ["VAR_NAME"]
```
— never hard-code, interpolate, or prompt for the value.

Secret env vars (full list in

references/service.yaml

→

secrets_handling

HF_TOKEN

WANDB_API_KEY

CLEARML_API_ACCESS_KEY

CLEARML_API_SECRET_KEY

TAO_API_KEY

TAO_USER_KEY

Safe to collect in the prompt:

network_arch

model_path

num_gpus

, prompt text,

WANDB_*

config URLs,

CLEARML_*_HOST

URLs.

绝不能要求用户在提示框中输入密钥值。对于每个密钥值：

告知用户需要设置哪个环境变量（例如
```
export HF_TOKEN=...
```
）。
生成使用
```
os.environ["VAR_NAME"]
```
读取该变量的代码——绝不能硬编码、插值或提示用户输入该值。

密钥环境变量（完整列表见

references/service.yaml

中的

secrets_handling

）：

HF_TOKEN

WANDB_API_KEY

CLEARML_API_ACCESS_KEY

CLEARML_API_SECRET_KEY

TAO_API_KEY

TAO_USER_KEY

。

可在提示中收集的安全信息：

network_arch

model_path

num_gpus

, 提示文本,

WANDB_*

配置URL,

CLEARML_*_HOST

URL。

1. What to collect from the user

1. 需要向用户收集的信息

Input	Role
`network_arch`	Chooses container image, the per-arch inner command shape ( `references/service.yaml` → `container_commands.<network_arch>` ), and `neural_network_name` in the job JSON when applicable. Must match a basename in `valid_network_arch_config_basenames` in `references/service.yaml` (e.g. `cosmos-rl` , `cosmos-predict2.5` ).
`model_path`	The trained model checkpoint. Valid forms: `hf_model://<org>/<model>` (HuggingFace Hub — set `HF_TOKEN` for gated models) or a local container filesystem path. Cloud URIs ( `s3://` , `gs://` , `az://` ) are NOT supported — the inference service has no cloud-storage dependency. Always ask the user; never substitute a placeholder. See `references/service.yaml` → `model_path_protocols` .
`platform`	Compute platform: `local-docker` , `brev` , `lepton` , `slurm` , or `kubernetes` .
`num_gpus`	Defaults to 1; minimum 1 for inference.

输入项	作用
`network_arch`	选择容器镜像、对应架构的内部命令格式（ `references/service.yaml` → `container_commands.<network_arch>` ），以及适用时作业JSON中的 `neural_network_name` 。必须与 `references/service.yaml` 中 `valid_network_arch_config_basenames` 的基础名称匹配（例如 `cosmos-rl` , `cosmos-predict2.5` ）。
`model_path`	训练好的模型检查点。有效格式： `hf_model://<org>/<model>` （HuggingFace Hub——对于 gated 模型需设置 `HF_TOKEN` ）或本地容器文件系统路径。不支持云URI（ `s3://` , `gs://` , `az://` ）——推理服务不依赖云存储。必须询问用户，绝不能使用占位符替代。详见 `references/service.yaml` → `model_path_protocols` 。
`platform`	计算平台： `local-docker` , `brev` , `lepton` , `slurm` , 或 `kubernetes` 。
`num_gpus`	默认值为1；推理所需的最小值为1。

2. Image resolution

2. 镜像解析

Each

network_arch

has a sidecar config file named

{network_arch}.config.json

. Resolve the container image as follows:

Read

{network_arch}.config.json

and take

api_params.image

(e.g.

COSMOS_RL

). This is a key into

docker_image_defaults.mapping

references/service.yaml

Look up that key in the mapping. If the host env var
```
IMAGE_<KEY>
```
is set (e.g.
```
IMAGE_COSMOS_RL
```
), it overrides the mapped default.
The mapped value is normally a dotted key into the repo-root
```
versions.yaml
```
manifest (e.g.
```
tao_toolkit.cosmos_rl
```
). Resolve it to a concrete
```
nvcr.io/...
```
image URI by looking up
```
versions.yaml
```
→
```
images.<group>.<name>
```
. Absolute URIs pass through unchanged, so an
```
IMAGE_<KEY>
```
env-var override that contains a full URI still works. The Python helper for this lives in
```
references/code-templates.yaml
```
.
If the config file is missing or
```
api_params.image
```
is empty, fall back to the
```
COSMOS_RL
```
key.

The config file also has

spec_params.inference.model_path

which drives folder vs file path semantics: if the value contains the substring

folder

, the container treats the path as a directory.

每个

network_arch

都有一个名为

{network_arch}.config.json

的辅助配置文件。按以下步骤解析容器镜像：

读取

{network_arch}.config.json

并获取

api_params.image

（例如

COSMOS_RL

）。这是

references/service.yaml

中

docker_image_defaults.mapping

的一个键。

在映射中查找该键。如果主机环境变量
```
IMAGE_<KEY>
```
已设置（例如
```
IMAGE_COSMOS_RL
```
），则会覆盖映射的默认值。
映射值通常是指向仓库根目录
```
versions.yaml
```
清单的点分隔键（例如
```
tao_toolkit.cosmos_rl
```
）。通过查找
```
versions.yaml
```
→
```
images.<group>.<name>
```
将其解析为具体的
```
nvcr.io/...
```
镜像URI。绝对URI会直接保留，因此包含完整URI的
```
IMAGE_<KEY>
```
环境变量覆盖仍然有效。此操作的Python助手位于
```
references/code-templates.yaml
```
中。
如果配置文件缺失或
```
api_params.image
```
为空，则回退到
```
COSMOS_RL
```
键。

配置文件中还包含

spec_params.inference.model_path

，用于驱动文件夹 vs 文件路径语义：如果该值包含子字符串

folder

，则容器会将路径视为目录。

3. Environment variables (no callbacks)

3. 环境变量（无回调）

Set these in

env_payload

before encoding

env_json

. Do not set

TAO_LOGGING_SERVER_URL

TAO_ADMIN_KEY

TAO_EXECUTION_BACKEND
— must match the platform:

Platform	`TAO_EXECUTION_BACKEND` value
local-docker	`local-docker`
brev	`local-docker`
lepton	`lepton`
slurm	`slurm`
kubernetes	`local-k8s`

CLOUD_BASED
— always

"False"

for this skill (disables callback posting to

TAO_LOGGING_SERVER_URL

GPU env vars — only needed when the platform skill does not handle GPU injection automatically:

Tegra / Jetson:

--runtime=nvidia

with

NVIDIA_DRIVER_CAPABILITIES=all

and

NVIDIA_VISIBLE_DEVICES=<ids>

Standard x86 + nvidia-container-toolkit: use Docker
```
device_requests
```
. The platform skill handles this.

在编码

env_json

之前，将这些变量设置到

env_payload

中。不要设置

TAO_LOGGING_SERVER_URL

或

TAO_ADMIN_KEY

。

TAO_EXECUTION_BACKEND
—— 必须与平台匹配：

平台	`TAO_EXECUTION_BACKEND` 值
local-docker	`local-docker`
brev	`local-docker`
lepton	`lepton`
slurm	`slurm`
kubernetes	`local-k8s`

CLOUD_BASED
—— 对于本skill始终设置为

"False"

（禁用向

TAO_LOGGING_SERVER_URL

发送回调）。

GPU环境变量 —— 仅当平台skill不自动处理GPU注入时才需要：

Tegra / Jetson：使用

--runtime=nvidia

，并设置

NVIDIA_DRIVER_CAPABILITIES=all

和

NVIDIA_VISIBLE_DEVICES=<ids>

。

标准x86 + nvidia-container-toolkit：使用Docker的
```
device_requests
```
。平台skill会处理此操作。

4. Executing across platforms

4. 跨平台执行

The job payload and inner command (Sections 1–3) are platform-agnostic. For each platform, read skills/platform/<name>/SKILL.md
for preflight checks and credentials before generating any execution code.

作业负载和内部命令（第1–3节）是平台无关的。对于每个平台，在生成任何执行代码之前，请先阅读**

skills/platform/<name>/SKILL.md

**中的预检检查和凭证要求。

4.1 Build the inner command (per arch)

4.1 构建内部命令（按架构）

The inner-command shape is per
network_arch
— there is no uniform template. Look up the per-arch entry in

references/service.yaml

→

container_commands.<network_arch>

; if not present, the arch is unsupported — stop and ask. Pick the matching sub-block in

references/code-templates.yaml

→

job_payload_builder.<network_arch>

. Prefix the command with

umask 0 &&

and keep it identical across platforms (local-docker, brev, lepton, slurm, kubernetes).

Common across arches:

```
job_id
```
: fresh
```
uuid.uuid4()
```
— becomes the container name and registry key.
```
image
```
: resolve per Section 2.
Secrets (
```
access_key
```
,
```
secret_key
```
,
```
HF_TOKEN
```
, etc.) are read from env vars at runtime — never hard-code, never log or print.

Arch-specific notes (full details in

references/service.yaml

→

container_commands

cosmos-rl
— single
```
--job '<JOB_JSON>' --docker_env_vars '<ENV_JSON>'
```
blob;
```
json.dumps(...)
```
+
```
shlex.quote(...)
```
.
```
env_payload
```
carries
```
TAO_EXECUTION_BACKEND
```
(per Section 3 table),
```
TAO_API_JOB_ID
```
,
```
CLOUD_BASED=False
```
. The inference service has no cloud-storage dependency;
```
HF_TOKEN
```
is the only cred env var that ever applies (for gated HuggingFace models).
cosmos-predict2.5
— flag-style
```
cosmos_predict inference_microservice start ... --port 8080
```
(no
```
setup.
```
prefix; uses
```
tyro.conf.OmitArgPrefixes
```
).
```
--job
```
/
```
--docker_env_vars
```
are not accepted. Translate
```
model_path
```
to
```
--checkpoint-path
```
(local path) or
```
--model <registered_key>
```
(
```
hf_model://
```
); cloud URIs are rejected. The only cred env var that ever applies is
```
HF_TOKEN
```
for gated HuggingFace models. Per-request params (prompt, inference_type, num_output_frames, guidance, seed, num_steps, negative_prompt) go in the request body, not at startup.
```
TAO_EXECUTION_BACKEND
```
/
```
TAO_API_JOB_ID
```
/
```
CLOUD_BASED
```
are unused and may be omitted.

内部命令格式是基于
network_arch
的——没有统一模板。在

references/service.yaml

→

container_commands.<network_arch>

中查找对应架构的条目；如果不存在，则该架构不受支持——停止操作并询问用户。在

references/code-templates.yaml

→

job_payload_builder.<network_arch>

中选择匹配的子块。在命令前添加

umask 0 &&

，并确保在所有平台（local-docker, brev, lepton, slurm, kubernetes）上完全一致。

各架构通用规则：

```
job_id
```
：生成新的
```
uuid.uuid4()
```
——将作为容器名称和注册表键。
```
image
```
：根据第2节解析得到。
密钥（
```
access_key
```
,
```
secret_key
```
,
```
HF_TOKEN
```
等）在运行时从环境变量读取——绝不能硬编码，绝不能记录或打印。

各架构特定说明（详细信息见

references/service.yaml

→

container_commands

）：

cosmos-rl
—— 单个
```
--job '<JOB_JSON>' --docker_env_vars '<ENV_JSON>'
```
块；使用
```
json.dumps(...)
```
+
```
shlex.quote(...)
```
处理。
```
env_payload
```
包含
```
TAO_EXECUTION_BACKEND
```
（按第3节表格）、
```
TAO_API_JOB_ID
```
、
```
CLOUD_BASED=False
```
。推理服务不依赖云存储；
```
HF_TOKEN
```
是唯一适用的凭证环境变量（用于 gated HuggingFace模型）。
cosmos-predict2.5
—— 标志式命令
```
cosmos_predict inference_microservice start ... --port 8080
```
（无
```
setup.
```
前缀；使用
```
tyro.conf.OmitArgPrefixes
```
）。不接受
```
--job
```
/
```
--docker_env_vars
```
参数。将
```
model_path
```
转换为
```
--checkpoint-path
```
（本地路径）或
```
--model <registered_key>
```
（
```
hf_model://
```
格式）；拒绝云URI。唯一适用的凭证环境变量是用于 gated HuggingFace模型的
```
HF_TOKEN
```
。每个请求的参数（提示、inference_type、num_output_frames、guidance、seed、num_steps、negative_prompt）放在请求体中，而不是启动时设置。
```
TAO_EXECUTION_BACKEND
```
/
```
TAO_API_JOB_ID
```
/
```
CLOUD_BASED
```
未使用，可以省略。

4.2 Delegate execution to the platform skill

4.2 将执行委托给平台skill

Read skills/platform/<platform>/SKILL.md
and follow it to start the container.

Base parameters (all platforms):

Parameter	Value
`image`	resolved container image (Section 2)
`command`	`inner` — the shell string built in Section 4.1
`gpu_count`	`num_gpus`
`env_vars`	`env_payload`
job / container name	`job_id` — must equal the UUID from 4.1 so the registry can reference it
`host_port` (local-docker, brev)	host-side port to bind to container port 8080. Default `8080` , but must be unique per concurrent service — see the port-allocation rule below.

Platform-specific additional inputs:

Platform	Additional inputs
local-docker	None beyond base
brev	`instance_id` (optional — reuse an existing instance); on multi-credential / multi-workspace accounts also `cloud_cred_id` and `workspace_group_id` for first-create — see `skills/platform/tao-run-on-brev/SKILL.md`
lepton	`resource_shape` (GPU shape ID, e.g. `gpu.8xh100-sxm` ); `dedicated_node_group` (optional)
slurm	`partition` and `account` — check `SLURM_PARTITION` / `SLURM_ACCOUNT` env vars; ask user if unset
kubernetes	`namespace` (default: `default` ); `image_pull_secret` (required for `nvcr.io` images)

Port binding (local-docker and brev): use direct docker run (not DockerSDK) so that

-p <host_port>:8080

can be passed and the container name equals

job_id

exactly.

Port allocation rule (local-docker and brev, REQUIRED for concurrent services): Before starting a service, read the registry (

/tmp/tao-inf-ms-state.json

) and collect the set of

host_port

values from every existing entry on the same platform (and, for brev, the same

instance_id

). Pick the lowest free port starting from 8080 that is not in that set — e.g.

host_port = next(p for p in range(8080, 8200) if p not in used_ports)

. The default

only applies when no other service is running. This is what makes "start 3 services, each reachable at a distinct

host_url

" work; without it, services 2 and 3 fail with

bind: address already in use

. Lepton, SLURM, and kubernetes get distinct endpoints from their own platform mechanisms and do not need this step.

阅读**

skills/platform/<platform>/SKILL.md

**并按照说明启动容器。

基础参数（所有平台）：

参数	值
`image`	解析后的容器镜像（第2节）
`command`	`inner` —— 第4.1节构建的Shell字符串
`gpu_count`	`num_gpus`
`env_vars`	`env_payload`
作业/容器名称	`job_id` —— 必须与4.1节中的UUID一致，以便注册表可以引用它
`host_port` （local-docker, brev）	主机端绑定到容器端口8080的端口。默认值为 `8080` ，但每个并发服务必须唯一——请参阅下面的端口分配规则。

平台特定附加输入：

平台	附加输入
local-docker	除基础参数外无其他输入
brev	`instance_id` （可选——重用现有实例）；对于多凭证/多工作区账户，首次创建时还需要 `cloud_cred_id` 和 `workspace_group_id` ——详见 `skills/platform/tao-run-on-brev/SKILL.md`
lepton	`resource_shape` （GPU规格ID，例如 `gpu.8xh100-sxm` ）； `dedicated_node_group` （可选）
slurm	`partition` 和 `account` ——检查 `SLURM_PARTITION` / `SLURM_ACCOUNT` 环境变量；如果未设置则询问用户
kubernetes	`namespace` （默认： `default` ）； `image_pull_secret` （对于 `nvcr.io` 镜像为必填项）

端口绑定（local-docker和brev）： 使用直接docker run命令（而非DockerSDK），以便可以传递

-p <host_port>:8080

参数，且容器名称与

job_id

完全一致。

端口分配规则（local-docker和brev，并发服务必填）： 启动服务前，读取注册表（

/tmp/tao-inf-ms-state.json

）并收集同一平台（对于brev，还包括同一

instance_id

）上所有现有条目的

host_port

值集合。选择从8080开始的最低可用端口，且该端口不在已使用集合中——例如

host_port = next(p for p in range(8080, 8200) if p not in used_ports)

。仅当没有其他服务运行时才使用默认值

。此规则确保“启动3个服务，每个都可通过不同的

host_url

访问”可行；否则，第2和第3个服务会因

bind: address already in use

错误而失败。Lepton、SLURM和kubernetes通过各自的平台机制获取不同的端点，无需此步骤。

4.3 After start: service registry and endpoint

4.3 启动后：服务注册表和端点

Write the service registry immediately after the platform confirms the container is running. The registry (

/tmp/tao-inf-ms-state.json

) is keyed by

job_id

;

"latest"

always points to the most recently started service.

See

references/code-templates.yaml

→

registry_write.<platform>

for the Python template.

Platform	`host_url`	`platform_job_id`	Extra step before writing
local-docker	`http://localhost:{host_port}`	—	None
brev	`http://{brev_ip}:{host_port}`	—	`brev ls` → get instance IP ( `localhost` is invalid on remote VM)
lepton	Lepton endpoint URL	`job.id`	Poll `sdk.get_job_status` until Running; get endpoint from console or `lep job get <job.id>`
slurm	`http://localhost:{host_port}`	SLURM scheduler job ID	Wait until Running; SSH port-forward `localhost:{host_port}→{node}:8080`
kubernetes	`http://{external_ip}:8080`	k8s job name	`kubectl expose job … --type=LoadBalancer` ; wait for external IP

After writing the registry, print the job_id and URL:

python

print(f"Inference service started.")
print(f"  Job ID : {job_id}")
print(f"  Arch   : {network_arch}")
print(f"  URL    : {state[job_id]['host_url']}/v1/chat/completions")
print(f"Use this Job ID to send requests or stop the service.")

Then poll for readiness — see

references/code-templates.yaml

→

readiness_check

. The container loads the model in the background; do not send requests before it returns 200.

平台确认容器运行后，立即写入服务注册表。注册表（

/tmp/tao-inf-ms-state.json

）以

job_id

为键；

"latest"

始终指向最近启动的服务。

请参阅

references/code-templates.yaml

→

registry_write.<platform>

中的Python模板。

平台	`host_url`	`platform_job_id`	写入前的额外步骤
local-docker	`http://localhost:{host_port}`	—	无
brev	`http://{brev_ip}:{host_port}`	—	执行 `brev ls` → 获取实例IP（远程VM上 `localhost` 无效）
lepton	Lepton端点URL	`job.id`	轮询 `sdk.get_job_status` 直到状态为Running；从控制台或 `lep job get <job.id>` 获取端点
slurm	`http://localhost:{host_port}`	SLURM调度器作业ID	等待状态变为Running；通过SSH端口转发 `localhost:{host_port}→{node}:8080`
kubernetes	`http://{external_ip}:8080`	k8s作业名称	执行 `kubectl expose job … --type=LoadBalancer` ；等待外部IP分配

写入注册表后，打印job_id和URL：

python

print(f"Inference service started.")
print(f"  Job ID : {job_id}")
print(f"  Arch   : {network_arch}")
print(f"  URL    : {state[job_id]['host_url']}/v1/chat/completions")
print(f"Use this Job ID to send requests or stop the service.")

然后轮询就绪状态——请参阅

references/code-templates.yaml

→

readiness_check

模板。容器会在后台加载模型；在返回200状态码之前不要发送请求。

5. Stopping the inference service

5. 停止推理服务

Ask the user for the

job_id

to stop. If they don't provide one, default to

state["latest"]

and confirm which job_id is being stopped. Read the registry using

references/code-templates.yaml

→

stop.registry_read

, then read skills/platform/<platform>/SKILL.md
and use its cancellation / stop mechanism.

Platform	Identifier to pass	Extra cleanup
local-docker	`job_id_to_stop` — container name	None
brev	`job_id_to_stop` — container name	None
lepton	`entry["platform_job_id"]` — Lepton job ID	None
slurm	`entry["platform_job_id"]` — SLURM job ID	`pkill -f "ssh.-L.{entry['host_port']}"`
kubernetes	`entry["platform_job_id"]` — k8s job name	`kubectl delete svc {entry["platform_job_id"]} -n <namespace>`

where

entry = state[job_id_to_stop]

. After stopping, clean up the registry:

references/code-templates.yaml

→

stop.registry_cleanup

询问用户要停止的

job_id

。如果用户未提供，则默认使用

state["latest"]

并确认要停止的job_id。通过

references/code-templates.yaml

→

stop.registry_read

模板读取注册表，然后阅读**

skills/platform/<platform>/SKILL.md

**并使用其取消/停止机制。

平台	需传递的标识符	额外清理操作
local-docker	`job_id_to_stop` —— 容器名称	无
brev	`job_id_to_stop` —— 容器名称	无
lepton	`entry["platform_job_id"]` —— Lepton作业ID	无
slurm	`entry["platform_job_id"]` —— SLURM作业ID	执行 `pkill -f "ssh.-L.{entry['host_port']}"`
kubernetes	`entry["platform_job_id"]` —— k8s作业名称	执行 `kubectl delete svc {entry["platform_job_id"]} -n <namespace>`

其中

entry = state[job_id_to_stop]

。停止后，清理注册表：使用

references/code-templates.yaml

→

stop.registry_cleanup

模板。

6. Sending inference requests

6. 发送推理请求

6.0 Resolve which service receives this request (REQUIRED)

6.0 确定接收请求的服务（必填）

Each request must be routed to the specific service that runs the matching model. Routing happens by

job_id

— the registry stores

network_arch

per entry, so you can resolve a target by arch when the user names a model instead of a

job_id

. Apply these rules in order:

User provided an explicit
job_id
→ use it. Verify it exists in
```
state
```
.
User named a
network_arch
(e.g. "send this to the cosmos-rl service") → look up matching entries:
```
candidates = [j for j, e in state.items() if j != "latest" and isinstance(e, dict) and e["network_arch"] == arch]
```
.
- Exactly one match → use it.
- Multiple matches → prompt the user with the candidate
```
job_id
```
  s and their
```
started_at
```
  ; do not auto-pick.
- No match → stop and tell the user no service for that arch is running.
No
job_id
and no
network_arch
→ count non-
```
"latest"
```
entries in
```
state
```
:
- Exactly one running service → use it.
- Two or more → do not silently default to
  state["latest"]
  . Prompt the user with the full list (
```
job_id
```
  ,
```
network_arch
```
  ,
```
host_url
```
  ) and require an explicit choice. The
```
"latest"
```
  pointer is a convenience for single-service workflows, not a routing fallback when multiple services coexist.
- Zero → stop and tell the user to start a service first.

After resolving, read the endpoint from the registry (

references/code-templates.yaml

→

request.registry_read

), passing the resolved

job_id

user_provided_job_id

. Confirm to the user: "Sending to job_id=… arch=… url=…". If the service may still be loading, poll readiness first (

references/code-templates.yaml

→

readiness_check

Cross-check before sending: if the user-supplied request body contains arch-specific fields (e.g.

guidance

num_steps

seed

negative_prompt

→ cosmos-predict2.5; required

image_url

video_url

content items → cosmos-rl), verify they are consistent with

state[job_id]["network_arch"]

. On mismatch, stop and ask — sending a cosmos-predict2.5 body to a cosmos-rl service will fail at the container with a 4xx/5xx that is harder to diagnose than catching it here.

每个请求必须路由到运行匹配模型的特定服务。路由通过

job_id

实现——注册表存储每个条目的

network_arch

，因此当用户指定模型而非

job_id

时，你可以通过架构解析目标。按以下顺序应用规则：

用户提供了明确的
job_id
→ 使用该ID。验证它是否存在于
```
state
```
中。
用户指定了
network_arch
（例如“将此发送到cosmos-rl服务”）→ 查找匹配条目：
```
candidates = [j for j, e in state.items() if j != "latest" and isinstance(e, dict) and e["network_arch"] == arch]
```
。
- 恰好一个匹配项 → 使用它。
- 多个匹配项 → 提示用户提供候选
```
job_id
```
  及其
```
started_at
```
  ；不要自动选择。
- 无匹配项 → 停止操作并告知用户没有运行该架构的服务。
未提供
job_id
和
network_arch
→ 统计
```
state
```
中非
```
"latest"
```
的条目数量：
- 恰好一个运行中的服务 → 使用它。
- 两个或更多 → 绝不能静默默认使用
  state["latest"]
  。向用户显示完整列表（
```
job_id
```
  ,
```
network_arch
```
  ,
```
host_url
```
  ）并要求明确选择。
```
"latest"
```
  指针是单服务工作流的便利工具，而非多服务共存时的路由回退。
- 零个 → 停止操作并告知用户先启动服务。

解析完成后，通过注册表读取端点（

references/code-templates.yaml

→

request.registry_read

），传入解析后的

job_id

作为

user_provided_job_id

。向用户确认：“正在发送到job_id=… arch=… url=…”。如果服务可能仍在加载中，请先轮询就绪状态（

references/code-templates.yaml

→

readiness_check

）。

发送前交叉检查： 如果用户提供的请求体包含架构特定字段（例如

guidance

num_steps

seed

negative_prompt

→ cosmos-predict2.5；必填的

image_url

video_url

内容项 → cosmos-rl），请验证它们与

state[job_id]["network_arch"]

是否一致。如果不匹配，停止操作并询问用户——将cosmos-predict2.5的请求体发送到cosmos-rl服务会导致容器返回难以诊断的4xx/5xx错误，不如在此处提前拦截。

6.1 Sampling parameters — REQUIRED user prompt before each request

6.1 采样参数 —— 每次请求前必须提示用户

Before constructing the request body, you MUST explicitly prompt the user for the vLLM-style sampling parameters. Do not silently apply defaults. Use a structured prompt (e.g.

AskUserQuestion

in Claude Code, one question per field) that:

Lists every applicable field with its type and default value.
Lets the user skip / accept any field to take that field's default — entering a value is never required.
Collects all fields in one round.

After the prompt, apply each user-entered value verbatim and substitute the default for any skipped field. Do not invent values or silently clamp.

Field list, defaults, and per-arch applicability:

references/request.yaml

→

chat_completions_request_body

(base sampling fields:

max_tokens

top_p

temperature

) and

network_arch_constraints.<network_arch>

(per-arch overrides and extras such as

guidance

num_steps

seed

negative_prompt

for

cosmos-predict2.5

). If a field is marked unsupported for the active arch, do not prompt for it and do not include it in the body.

构建请求体之前，你必须明确提示用户提供vLLM风格的采样参数。绝不能静默应用默认值。使用结构化提示（例如Claude Code中的

AskUserQuestion

，每个字段一个问题）：

列出每个适用字段及其类型和默认值。
允许用户跳过/接受任何字段以使用该字段的默认值——无需强制输入值。
一次性收集所有字段。

提示后，直接使用用户输入的值，对跳过的字段使用默认值。不要自行生成值或静默限制范围。

字段列表、默认值和各架构适用性： 见

references/request.yaml

→

chat_completions_request_body

（基础采样字段：

max_tokens

top_p

temperature

）和

network_arch_constraints.<network_arch>

（各架构的覆盖项和额外参数，例如cosmos-predict2.5的

guidance

num_steps

seed

negative_prompt

）。如果某个字段标记为当前架构不支持，则不要提示用户，也不要将其包含在请求体中。

6.2 Request format

6.2 请求格式

Send a

POST

{BASE_URL}/v1/chat/completions

with

Content-Type: application/json

and a timeout of at least 300 s. The body is OpenAI-compatible (vLLM chat completions); see

references/request.yaml

→

chat_completions_request_body

for the full field schema and content-item shapes (text / image_url / video_url), and

code_examples

for ready-to-run Python and curl samples.

Constraints: only the first user message is processed. No secret values in request bodies. Per-network constraints (e.g. cosmos-rl requires every request to include an image or video; cosmos-rl rejects

data:

URIs) are in

references/request.yaml

→

network_arch_constraints

向

{BASE_URL}/v1/chat/completions

发送

POST

请求，设置

Content-Type: application/json

，超时时间至少为300秒。请求体为OpenAI兼容格式（vLLM聊天补全）；完整字段模式和内容项格式（文本/image_url/video_url）见

references/request.yaml

→

chat_completions_request_body

，可直接运行的Python和curl示例见

code_examples

。

约束： 仅处理第一条用户消息。请求体中不得包含密钥值。各网络约束（例如cosmos-rl要求每个请求包含图片或视频；cosmos-rl拒绝

data:

URI）见

references/request.yaml

→

network_arch_constraints

。

6.3 Response handling

6.3 响应处理

HTTP status	Meaning	Action
200	Success — `choices[0].message.content` has the generated text	Read result
202	Server still initializing or model still loading	Retry after a delay
503	Initialization failed, model load failed, or model not yet ready	Inspect `error.type` : `model_not_ready` → retry; `initialization_error` / `model_load_error` → give up and check logs
400	Missing or empty JSON body	Fix request
500	Unhandled exception during inference	Check container logs

For 202 and 503, the body contains

{"error": {"type": "<error_type>", "message": "<reason>"}}

. See

container_response_shapes

references/request.yaml

for error type strings.

HTTP状态码	含义	操作
200	成功 —— `choices[0].message.content` 包含生成的文本	读取结果
202	服务器仍在初始化或模型仍在加载	延迟后重试
503	初始化失败、模型加载失败，或模型尚未就绪	检查 `error.type` ： `model_not_ready` → 重试； `initialization_error` / `model_load_error` → 放弃并检查日志
400	请求体缺失或为空	修复请求
500	推理过程中出现未处理的异常	检查容器日志

对于202和503状态码，响应体包含

{"error": {"type": "<error_type>", "message": "<reason>"}}

。错误类型字符串见

references/request.yaml

中的

container_response_shapes

。