tao-run-inference-service

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

TAO Inference Microservice

TAO推理微服务

Instructions

操作说明

To start an inference service:
  1. Collect required inputs (Section 1) and resolve the container image (Section 2).
  2. Build the job payload and inner command (Sections 3–4.1); use
    references/code-templates.yaml
    job_payload_builder
    .
  3. Read
    skills/platform/<platform>/SKILL.md
    and start the container (Section 4.2).
  4. Write the service registry and poll readiness (Section 4.3); use
    references/code-templates.yaml
    registry_write.<platform>
    and
    readiness_check
    .
To send an inference request:
  1. Resolve which service receives the request per Section 6.0 (by
    job_id
    , by
    network_arch
    , or by explicit user choice when multiple services run — never silently default to
    "latest"
    when more than one service exists
    ), then read the endpoint from
    references/code-templates.yaml
    request.registry_read
    with the resolved
    job_id
    .
  2. Before building the request body, prompt the user for the vLLM-style sampling parameters (Section 6.1). Present
    max_tokens
    ,
    top_p
    ,
    temperature
    (and any per-arch extras) with their defaults; let the user override or skip each one to accept the default. Never silently use defaults.
  3. Build and send the body per Section 6.2; handle the response per Section 6.3.
To stop a service: Read
references/code-templates.yaml
stop.registry_read
to resolve the job_id, read
skills/platform/<platform>/SKILL.md
, then follow Section 5.
Reference data (schemas, mappings, valid values — no instructions):
  • references/service.yaml
    — image mappings, valid
    network_arch
    names, job payload schema, env var names, secrets classification.
  • references/request.yaml
    — endpoint definition, request field schema, response shapes, code examples.
  • references/code-templates.yaml
    — Python templates for payload building, registry writes, readiness checks, and stop/request flows.

启动推理服务:
  1. 收集所需输入(第1节)并解析容器镜像(第2节)。
  2. 构建作业负载和内部命令(第3–4.1节);使用
    references/code-templates.yaml
    中的
    job_payload_builder
    模板。
  3. 阅读
    skills/platform/<platform>/SKILL.md
    并启动容器(第4.2节)。
  4. 写入服务注册表并轮询就绪状态(第4.3节);使用
    references/code-templates.yaml
    中的
    registry_write.<platform>
    readiness_check
    模板。
发送推理请求:
  1. 根据第6.0节的规则确定接收请求的服务(通过
    job_id
    network_arch
    ,或者当多个服务运行时由用户明确选择——当存在多个服务时,绝不能默认使用
    "latest"
    而不告知用户
    ),然后通过
    references/code-templates.yaml
    中的
    request.registry_read
    模板,传入解析后的
    job_id
    来读取端点信息。
  2. 构建请求体之前,必须提示用户提供vLLM风格的采样参数(第6.1节)。列出
    max_tokens
    top_p
    temperature
    (以及各架构特有的额外参数)及其默认值;允许用户覆盖或跳过每个参数以接受默认值。绝不能静默使用默认值。
  3. 根据第6.2节构建并发送请求体;根据第6.3节处理响应。
停止服务: 通过
references/code-templates.yaml
中的
stop.registry_read
模板解析
job_id
,阅读
skills/platform/<platform>/SKILL.md
,然后按照第5节的步骤操作。
参考数据(模式、映射、有效值——不含操作说明):
  • references/service.yaml
    —— 镜像映射、有效的
    network_arch
    名称、作业负载模式、环境变量名称、密钥分类。
  • references/request.yaml
    —— 端点定义、请求字段模式、响应格式、代码示例。
  • references/code-templates.yaml
    —— 用于负载构建、注册表写入、就绪检查以及停止/请求流程的Python模板。

Secrets rule (applies to every generated code block in this skill)

密钥规则(适用于本skill中生成的所有代码块)

Never ask the user to type a secret value into a prompt. For every secret value:
  1. Tell the user which environment variable to set (e.g.
    export HF_TOKEN=...
    ).
  2. Generate code that reads it with
    os.environ["VAR_NAME"]
    — never hard-code, interpolate, or prompt for the value.
Secret env vars (full list in
references/service.yaml
secrets_handling
):
HF_TOKEN
,
WANDB_API_KEY
,
CLEARML_API_ACCESS_KEY
,
CLEARML_API_SECRET_KEY
,
TAO_API_KEY
,
TAO_USER_KEY
.
Safe to collect in the prompt:
network_arch
,
model_path
,
num_gpus
, prompt text,
WANDB_*
config URLs,
CLEARML_*_HOST
URLs.

绝不能要求用户在提示框中输入密钥值。对于每个密钥值:
  1. 告知用户需要设置哪个环境变量(例如
    export HF_TOKEN=...
    )。
  2. 生成使用
    os.environ["VAR_NAME"]
    读取该变量的代码——绝不能硬编码、插值或提示用户输入该值。
密钥环境变量(完整列表见
references/service.yaml
中的
secrets_handling
):
HF_TOKEN
,
WANDB_API_KEY
,
CLEARML_API_ACCESS_KEY
,
CLEARML_API_SECRET_KEY
,
TAO_API_KEY
,
TAO_USER_KEY
可在提示中收集的安全信息
network_arch
,
model_path
,
num_gpus
, 提示文本,
WANDB_*
配置URL,
CLEARML_*_HOST
URL。

1. What to collect from the user

1. 需要向用户收集的信息

InputRole
network_arch
Chooses container image, the per-arch inner command shape (
references/service.yaml
container_commands.<network_arch>
), and
neural_network_name
in the job JSON when applicable. Must match a basename in
valid_network_arch_config_basenames
in
references/service.yaml
(e.g.
cosmos-rl
,
cosmos-predict2.5
).
model_path
The trained model checkpoint. Valid forms:
hf_model://<org>/<model>
(HuggingFace Hub — set
HF_TOKEN
for gated models) or a local container filesystem path. Cloud URIs (
s3://
,
gs://
,
az://
) are NOT supported — the inference service has no cloud-storage dependency. Always ask the user; never substitute a placeholder. See
references/service.yaml
model_path_protocols
.
platform
Compute platform:
local-docker
,
brev
,
lepton
,
slurm
, or
kubernetes
.
num_gpus
Defaults to 1; minimum 1 for inference.

输入项作用
network_arch
选择容器镜像、对应架构的内部命令格式(
references/service.yaml
container_commands.<network_arch>
),以及适用时作业JSON中的
neural_network_name
。必须与
references/service.yaml
valid_network_arch_config_basenames
的基础名称匹配(例如
cosmos-rl
,
cosmos-predict2.5
)。
model_path
训练好的模型检查点。有效格式:
hf_model://<org>/<model>
(HuggingFace Hub——对于 gated 模型需设置
HF_TOKEN
)或本地容器文件系统路径。不支持云URI(
s3://
,
gs://
,
az://
)——推理服务不依赖云存储。必须询问用户,绝不能使用占位符替代。详见
references/service.yaml
model_path_protocols
platform
计算平台:
local-docker
,
brev
,
lepton
,
slurm
, 或
kubernetes
num_gpus
默认值为1;推理所需的最小值为1

2. Image resolution

2. 镜像解析

Each
network_arch
has a sidecar config file named
{network_arch}.config.json
. Resolve the container image as follows:
  1. Read
    {network_arch}.config.json
    and take
    api_params.image
    (e.g.
    COSMOS_RL
    ). This is a key into
    docker_image_defaults.mapping
    in
    references/service.yaml
    .
  2. Look up that key in the mapping. If the host env var
    IMAGE_<KEY>
    is set (e.g.
    IMAGE_COSMOS_RL
    ), it overrides the mapped default.
  3. The mapped value is normally a dotted key into the repo-root
    versions.yaml
    manifest (e.g.
    tao_toolkit.cosmos_rl
    ). Resolve it to a concrete
    nvcr.io/...
    image URI by looking up
    versions.yaml
    images.<group>.<name>
    . Absolute URIs pass through unchanged, so an
    IMAGE_<KEY>
    env-var override that contains a full URI still works. The Python helper for this lives in
    references/code-templates.yaml
    .
  4. If the config file is missing or
    api_params.image
    is empty, fall back to the
    COSMOS_RL
    key.
The config file also has
spec_params.inference.model_path
which drives folder vs file path semantics: if the value contains the substring
folder
, the container treats the path as a directory.

每个
network_arch
都有一个名为
{network_arch}.config.json
的辅助配置文件。按以下步骤解析容器镜像:
  1. 读取
    {network_arch}.config.json
    并获取
    api_params.image
    (例如
    COSMOS_RL
    )。这是
    references/service.yaml
    docker_image_defaults.mapping
    的一个键。
  2. 在映射中查找该键。如果主机环境变量
    IMAGE_<KEY>
    已设置(例如
    IMAGE_COSMOS_RL
    ),则会覆盖映射的默认值。
  3. 映射值通常是指向仓库根目录
    versions.yaml
    清单的点分隔键(例如
    tao_toolkit.cosmos_rl
    )。通过查找
    versions.yaml
    images.<group>.<name>
    将其解析为具体的
    nvcr.io/...
    镜像URI。绝对URI会直接保留,因此包含完整URI的
    IMAGE_<KEY>
    环境变量覆盖仍然有效。此操作的Python助手位于
    references/code-templates.yaml
    中。
  4. 如果配置文件缺失或
    api_params.image
    为空,则回退到
    COSMOS_RL
    键。
配置文件中还包含
spec_params.inference.model_path
,用于驱动文件夹 vs 文件路径语义:如果该值包含子字符串
folder
,则容器会将路径视为目录。

3. Environment variables (no callbacks)

3. 环境变量(无回调)

Set these in
env_payload
before encoding
env_json
. Do not set
TAO_LOGGING_SERVER_URL
or
TAO_ADMIN_KEY
.
TAO_EXECUTION_BACKEND
— must match the platform:
Platform
TAO_EXECUTION_BACKEND
value
local-docker
local-docker
brev
local-docker
lepton
lepton
slurm
slurm
kubernetes
local-k8s
CLOUD_BASED
— always
"False"
for this skill (disables callback posting to
TAO_LOGGING_SERVER_URL
).
GPU env vars — only needed when the platform skill does not handle GPU injection automatically:
  • Tegra / Jetson:
    --runtime=nvidia
    with
    NVIDIA_DRIVER_CAPABILITIES=all
    and
    NVIDIA_VISIBLE_DEVICES=<ids>
    .
  • Standard x86 + nvidia-container-toolkit: use Docker
    device_requests
    . The platform skill handles this.

在编码
env_json
之前,将这些变量设置到
env_payload
中。不要设置
TAO_LOGGING_SERVER_URL
TAO_ADMIN_KEY
TAO_EXECUTION_BACKEND
—— 必须与平台匹配:
平台
TAO_EXECUTION_BACKEND
local-docker
local-docker
brev
local-docker
lepton
lepton
slurm
slurm
kubernetes
local-k8s
CLOUD_BASED
—— 对于本skill始终设置为
"False"
(禁用向
TAO_LOGGING_SERVER_URL
发送回调)。
GPU环境变量 —— 仅当平台skill不自动处理GPU注入时才需要:
  • Tegra / Jetson:使用
    --runtime=nvidia
    ,并设置
    NVIDIA_DRIVER_CAPABILITIES=all
    NVIDIA_VISIBLE_DEVICES=<ids>
  • 标准x86 + nvidia-container-toolkit:使用Docker的
    device_requests
    。平台skill会处理此操作。

4. Executing across platforms

4. 跨平台执行

The job payload and inner command (Sections 1–3) are platform-agnostic. For each platform, read
skills/platform/<name>/SKILL.md
for preflight checks and credentials before generating any execution code.
作业负载和内部命令(第1–3节)是平台无关的。对于每个平台,在生成任何执行代码之前,请先阅读**
skills/platform/<name>/SKILL.md
**中的预检检查和凭证要求。

4.1 Build the inner command (per arch)

4.1 构建内部命令(按架构)

The inner-command shape is per
network_arch
— there is no uniform template. Look up the per-arch entry in
references/service.yaml
container_commands.<network_arch>
; if not present, the arch is unsupported — stop and ask. Pick the matching sub-block in
references/code-templates.yaml
job_payload_builder.<network_arch>
. Prefix the command with
umask 0 &&
and keep it identical across platforms (local-docker, brev, lepton, slurm, kubernetes).
Common across arches:
  • job_id
    : fresh
    uuid.uuid4()
    — becomes the container name and registry key.
  • image
    : resolve per Section 2.
  • Secrets (
    access_key
    ,
    secret_key
    ,
    HF_TOKEN
    , etc.) are read from env vars at runtime — never hard-code, never log or print.
Arch-specific notes (full details in
references/service.yaml
container_commands
):
  • cosmos-rl
    — single
    --job '<JOB_JSON>' --docker_env_vars '<ENV_JSON>'
    blob;
    json.dumps(...)
    +
    shlex.quote(...)
    .
    env_payload
    carries
    TAO_EXECUTION_BACKEND
    (per Section 3 table),
    TAO_API_JOB_ID
    ,
    CLOUD_BASED=False
    . The inference service has no cloud-storage dependency;
    HF_TOKEN
    is the only cred env var that ever applies (for gated HuggingFace models).
  • cosmos-predict2.5
    — flag-style
    cosmos_predict inference_microservice start ... --port 8080
    (no
    setup.
    prefix; uses
    tyro.conf.OmitArgPrefixes
    ).
    --job
    /
    --docker_env_vars
    are not accepted. Translate
    model_path
    to
    --checkpoint-path
    (local path) or
    --model <registered_key>
    (
    hf_model://
    ); cloud URIs are rejected. The only cred env var that ever applies is
    HF_TOKEN
    for gated HuggingFace models. Per-request params (prompt, inference_type, num_output_frames, guidance, seed, num_steps, negative_prompt) go in the request body, not at startup.
    TAO_EXECUTION_BACKEND
    /
    TAO_API_JOB_ID
    /
    CLOUD_BASED
    are unused and may be omitted.
内部命令格式是基于
network_arch
——没有统一模板。在
references/service.yaml
container_commands.<network_arch>
中查找对应架构的条目;如果不存在,则该架构不受支持——停止操作并询问用户。在
references/code-templates.yaml
job_payload_builder.<network_arch>
中选择匹配的子块。在命令前添加
umask 0 &&
,并确保在所有平台(local-docker, brev, lepton, slurm, kubernetes)上完全一致
各架构通用规则:
  • job_id
    :生成新的
    uuid.uuid4()
    ——将作为容器名称和注册表键。
  • image
    :根据第2节解析得到。
  • 密钥(
    access_key
    ,
    secret_key
    ,
    HF_TOKEN
    等)在运行时从环境变量读取——绝不能硬编码,绝不能记录或打印。
各架构特定说明(详细信息见
references/service.yaml
container_commands
):
  • cosmos-rl
    —— 单个
    --job '<JOB_JSON>' --docker_env_vars '<ENV_JSON>'
    块;使用
    json.dumps(...)
    +
    shlex.quote(...)
    处理。
    env_payload
    包含
    TAO_EXECUTION_BACKEND
    (按第3节表格)、
    TAO_API_JOB_ID
    CLOUD_BASED=False
    。推理服务不依赖云存储;
    HF_TOKEN
    是唯一适用的凭证环境变量(用于 gated HuggingFace模型)。
  • cosmos-predict2.5
    —— 标志式命令
    cosmos_predict inference_microservice start ... --port 8080
    (无
    setup.
    前缀;使用
    tyro.conf.OmitArgPrefixes
    )。不接受
    --job
    /
    --docker_env_vars
    参数。将
    model_path
    转换为
    --checkpoint-path
    (本地路径)或
    --model <registered_key>
    hf_model://
    格式);拒绝云URI。唯一适用的凭证环境变量是用于 gated HuggingFace模型的
    HF_TOKEN
    。每个请求的参数(提示、inference_type、num_output_frames、guidance、seed、num_steps、negative_prompt)放在请求体中,而不是启动时设置。
    TAO_EXECUTION_BACKEND
    /
    TAO_API_JOB_ID
    /
    CLOUD_BASED
    未使用,可以省略。

4.2 Delegate execution to the platform skill

4.2 将执行委托给平台skill

Read
skills/platform/<platform>/SKILL.md
and follow it to start the container.
Base parameters (all platforms):
ParameterValue
image
resolved container image (Section 2)
command
inner
— the shell string built in Section 4.1
gpu_count
num_gpus
env_vars
env_payload
job / container name
job_id
— must equal the UUID from 4.1 so the registry can reference it
host_port
(local-docker, brev)
host-side port to bind to container port 8080. Default
8080
, but must be unique per concurrent service — see the port-allocation rule below.
Platform-specific additional inputs:
PlatformAdditional inputs
local-dockerNone beyond base
brev
instance_id
(optional — reuse an existing instance); on multi-credential / multi-workspace accounts also
cloud_cred_id
and
workspace_group_id
for first-create — see
skills/platform/tao-run-on-brev/SKILL.md
lepton
resource_shape
(GPU shape ID, e.g.
gpu.8xh100-sxm
);
dedicated_node_group
(optional)
slurm
partition
and
account
— check
SLURM_PARTITION
/
SLURM_ACCOUNT
env vars; ask user if unset
kubernetes
namespace
(default:
default
);
image_pull_secret
(required for
nvcr.io
images)
Port binding (local-docker and brev): use direct docker run (not DockerSDK) so that
-p <host_port>:8080
can be passed and the container name equals
job_id
exactly.
Port allocation rule (local-docker and brev, REQUIRED for concurrent services): Before starting a service, read the registry (
/tmp/tao-inf-ms-state.json
) and collect the set of
host_port
values from every existing entry on the same platform (and, for brev, the same
instance_id
). Pick the lowest free port starting from 8080 that is not in that set — e.g.
host_port = next(p for p in range(8080, 8200) if p not in used_ports)
. The default
8080
only applies when no other service is running. This is what makes "start 3 services, each reachable at a distinct
host_url
" work; without it, services 2 and 3 fail with
bind: address already in use
. Lepton, SLURM, and kubernetes get distinct endpoints from their own platform mechanisms and do not need this step.
阅读**
skills/platform/<platform>/SKILL.md
**并按照说明启动容器。
基础参数(所有平台):
参数
image
解析后的容器镜像(第2节)
command
inner
—— 第4.1节构建的Shell字符串
gpu_count
num_gpus
env_vars
env_payload
作业/容器名称
job_id
—— 必须与4.1节中的UUID一致,以便注册表可以引用它
host_port
(local-docker, brev)
主机端绑定到容器端口8080的端口。默认值为
8080
,但每个并发服务必须唯一——请参阅下面的端口分配规则。
平台特定附加输入:
平台附加输入
local-docker除基础参数外无其他输入
brev
instance_id
(可选——重用现有实例);对于多凭证/多工作区账户,首次创建时还需要
cloud_cred_id
workspace_group_id
——详见
skills/platform/tao-run-on-brev/SKILL.md
lepton
resource_shape
(GPU规格ID,例如
gpu.8xh100-sxm
);
dedicated_node_group
(可选)
slurm
partition
account
——检查
SLURM_PARTITION
/
SLURM_ACCOUNT
环境变量;如果未设置则询问用户
kubernetes
namespace
(默认:
default
);
image_pull_secret
(对于
nvcr.io
镜像为必填项)
端口绑定(local-docker和brev): 使用直接docker run命令(而非DockerSDK),以便可以传递
-p <host_port>:8080
参数,且容器名称与
job_id
完全一致。
端口分配规则(local-docker和brev,并发服务必填): 启动服务前,读取注册表(
/tmp/tao-inf-ms-state.json
)并收集同一平台(对于brev,还包括同一
instance_id
)上所有现有条目的
host_port
值集合。选择从8080开始的最低可用端口,且该端口不在已使用集合中——例如
host_port = next(p for p in range(8080, 8200) if p not in used_ports)
。仅当没有其他服务运行时才使用默认值
8080
。此规则确保“启动3个服务,每个都可通过不同的
host_url
访问”可行;否则,第2和第3个服务会因
bind: address already in use
错误而失败。Lepton、SLURM和kubernetes通过各自的平台机制获取不同的端点,无需此步骤。

4.3 After start: service registry and endpoint

4.3 启动后:服务注册表和端点

Write the service registry immediately after the platform confirms the container is running. The registry (
/tmp/tao-inf-ms-state.json
) is keyed by
job_id
;
"latest"
always points to the most recently started service.
See
references/code-templates.yaml
registry_write.<platform>
for the Python template.
Platform
host_url
platform_job_id
Extra step before writing
local-docker
http://localhost:{host_port}
None
brev
http://{brev_ip}:{host_port}
brev ls
→ get instance IP (
localhost
is invalid on remote VM)
leptonLepton endpoint URL
job.id
Poll
sdk.get_job_status
until Running; get endpoint from console or
lep job get <job.id>
slurm
http://localhost:{host_port}
SLURM scheduler job IDWait until Running; SSH port-forward
localhost:{host_port}→{node}:8080
kubernetes
http://{external_ip}:8080
k8s job name
kubectl expose job … --type=LoadBalancer
; wait for external IP
After writing the registry, print the job_id and URL:
python
print(f"Inference service started.")
print(f"  Job ID : {job_id}")
print(f"  Arch   : {network_arch}")
print(f"  URL    : {state[job_id]['host_url']}/v1/chat/completions")
print(f"Use this Job ID to send requests or stop the service.")
Then poll for readiness — see
references/code-templates.yaml
readiness_check
. The container loads the model in the background; do not send requests before it returns 200.

平台确认容器运行后,立即写入服务注册表。注册表(
/tmp/tao-inf-ms-state.json
)以
job_id
为键;
"latest"
始终指向最近启动的服务。
请参阅
references/code-templates.yaml
registry_write.<platform>
中的Python模板。
平台
host_url
platform_job_id
写入前的额外步骤
local-docker
http://localhost:{host_port}
brev
http://{brev_ip}:{host_port}
执行
brev ls
→ 获取实例IP(远程VM上
localhost
无效)
leptonLepton端点URL
job.id
轮询
sdk.get_job_status
直到状态为Running;从控制台或
lep job get <job.id>
获取端点
slurm
http://localhost:{host_port}
SLURM调度器作业ID等待状态变为Running;通过SSH端口转发
localhost:{host_port}→{node}:8080
kubernetes
http://{external_ip}:8080
k8s作业名称执行
kubectl expose job … --type=LoadBalancer
;等待外部IP分配
写入注册表后,打印job_id和URL:
python
print(f"Inference service started.")
print(f"  Job ID : {job_id}")
print(f"  Arch   : {network_arch}")
print(f"  URL    : {state[job_id]['host_url']}/v1/chat/completions")
print(f"Use this Job ID to send requests or stop the service.")
然后轮询就绪状态——请参阅
references/code-templates.yaml
readiness_check
模板。容器会在后台加载模型;在返回200状态码之前不要发送请求。

5. Stopping the inference service

5. 停止推理服务

Ask the user for the
job_id
to stop. If they don't provide one, default to
state["latest"]
and confirm which job_id is being stopped. Read the registry using
references/code-templates.yaml
stop.registry_read
, then read
skills/platform/<platform>/SKILL.md
and use its cancellation / stop mechanism.
PlatformIdentifier to passExtra cleanup
local-docker
job_id_to_stop
— container name
None
brev
job_id_to_stop
— container name
None
lepton
entry["platform_job_id"]
— Lepton job ID
None
slurm
entry["platform_job_id"]
— SLURM job ID
pkill -f "ssh.*-L.*{entry['host_port']}"
kubernetes
entry["platform_job_id"]
— k8s job name
kubectl delete svc {entry["platform_job_id"]} -n <namespace>
where
entry = state[job_id_to_stop]
. After stopping, clean up the registry:
references/code-templates.yaml
stop.registry_cleanup
.

询问用户要停止的
job_id
。如果用户未提供,则默认使用
state["latest"]
并确认要停止的job_id。通过
references/code-templates.yaml
stop.registry_read
模板读取注册表,然后阅读**
skills/platform/<platform>/SKILL.md
**并使用其取消/停止机制。
平台需传递的标识符额外清理操作
local-docker
job_id_to_stop
—— 容器名称
brev
job_id_to_stop
—— 容器名称
lepton
entry["platform_job_id"]
—— Lepton作业ID
slurm
entry["platform_job_id"]
—— SLURM作业ID
执行
pkill -f "ssh.*-L.*{entry['host_port']}"
kubernetes
entry["platform_job_id"]
—— k8s作业名称
执行
kubectl delete svc {entry["platform_job_id"]} -n <namespace>
其中
entry = state[job_id_to_stop]
。停止后,清理注册表:使用
references/code-templates.yaml
stop.registry_cleanup
模板。

6. Sending inference requests

6. 发送推理请求

6.0 Resolve which service receives this request (REQUIRED)

6.0 确定接收请求的服务(必填)

Each request must be routed to the specific service that runs the matching model. Routing happens by
job_id
— the registry stores
network_arch
per entry, so you can resolve a target by arch when the user names a model instead of a
job_id
. Apply these rules in order:
  1. User provided an explicit
    job_id
    → use it. Verify it exists in
    state
    .
  2. User named a
    network_arch
    (e.g. "send this to the cosmos-rl service") → look up matching entries:
    candidates = [j for j, e in state.items() if j != "latest" and isinstance(e, dict) and e["network_arch"] == arch]
    .
    • Exactly one match → use it.
    • Multiple matches → prompt the user with the candidate
      job_id
      s and their
      started_at
      ; do not auto-pick.
    • No match → stop and tell the user no service for that arch is running.
  3. No
    job_id
    and no
    network_arch
    → count non-
    "latest"
    entries in
    state
    :
    • Exactly one running service → use it.
    • Two or more → do not silently default to
      state["latest"]
      . Prompt the user with the full list (
      job_id
      ,
      network_arch
      ,
      host_url
      ) and require an explicit choice. The
      "latest"
      pointer is a convenience for single-service workflows, not a routing fallback when multiple services coexist.
    • Zero → stop and tell the user to start a service first.
After resolving, read the endpoint from the registry (
references/code-templates.yaml
request.registry_read
), passing the resolved
job_id
as
user_provided_job_id
. Confirm to the user: "Sending to job_id=… arch=… url=…". If the service may still be loading, poll readiness first (
references/code-templates.yaml
readiness_check
).
Cross-check before sending: if the user-supplied request body contains arch-specific fields (e.g.
guidance
/
num_steps
/
seed
/
negative_prompt
→ cosmos-predict2.5; required
image_url
/
video_url
content items → cosmos-rl), verify they are consistent with
state[job_id]["network_arch"]
. On mismatch, stop and ask — sending a cosmos-predict2.5 body to a cosmos-rl service will fail at the container with a 4xx/5xx that is harder to diagnose than catching it here.
每个请求必须路由到运行匹配模型的特定服务。路由通过
job_id
实现——注册表存储每个条目的
network_arch
,因此当用户指定模型而非
job_id
时,你可以通过架构解析目标。按以下顺序应用规则:
  1. 用户提供了明确的
    job_id
    → 使用该ID。验证它是否存在于
    state
    中。
  2. 用户指定了
    network_arch
    (例如“将此发送到cosmos-rl服务”)→ 查找匹配条目:
    candidates = [j for j, e in state.items() if j != "latest" and isinstance(e, dict) and e["network_arch"] == arch]
    • 恰好一个匹配项 → 使用它。
    • 多个匹配项 → 提示用户提供候选
      job_id
      及其
      started_at
      ;不要自动选择。
    • 无匹配项 → 停止操作并告知用户没有运行该架构的服务。
  3. 未提供
    job_id
    network_arch
    → 统计
    state
    中非
    "latest"
    的条目数量:
    • 恰好一个运行中的服务 → 使用它。
    • 两个或更多 → 绝不能静默默认使用
      state["latest"]
      。向用户显示完整列表(
      job_id
      ,
      network_arch
      ,
      host_url
      )并要求明确选择。
      "latest"
      指针是单服务工作流的便利工具,而非多服务共存时的路由回退。
    • 零个 → 停止操作并告知用户先启动服务。
解析完成后,通过注册表读取端点(
references/code-templates.yaml
request.registry_read
),传入解析后的
job_id
作为
user_provided_job_id
。向用户确认:“正在发送到job_id=… arch=… url=…”。如果服务可能仍在加载中,请先轮询就绪状态(
references/code-templates.yaml
readiness_check
)。
发送前交叉检查: 如果用户提供的请求体包含架构特定字段(例如
guidance
/
num_steps
/
seed
/
negative_prompt
→ cosmos-predict2.5;必填的
image_url
/
video_url
内容项 → cosmos-rl),请验证它们与
state[job_id]["network_arch"]
是否一致。如果不匹配,停止操作并询问用户——将cosmos-predict2.5的请求体发送到cosmos-rl服务会导致容器返回难以诊断的4xx/5xx错误,不如在此处提前拦截。

6.1 Sampling parameters — REQUIRED user prompt before each request

6.1 采样参数 —— 每次请求前必须提示用户

Before constructing the request body, you MUST explicitly prompt the user for the vLLM-style sampling parameters. Do not silently apply defaults. Use a structured prompt (e.g.
AskUserQuestion
in Claude Code, one question per field) that:
  1. Lists every applicable field with its type and default value.
  2. Lets the user skip / accept any field to take that field's default — entering a value is never required.
  3. Collects all fields in one round.
After the prompt, apply each user-entered value verbatim and substitute the default for any skipped field. Do not invent values or silently clamp.
Field list, defaults, and per-arch applicability:
references/request.yaml
chat_completions_request_body
(base sampling fields:
max_tokens
,
top_p
,
temperature
) and
network_arch_constraints.<network_arch>
(per-arch overrides and extras such as
guidance
/
num_steps
/
seed
/
negative_prompt
for
cosmos-predict2.5
). If a field is marked unsupported for the active arch, do not prompt for it and do not include it in the body.
构建请求体之前,你必须明确提示用户提供vLLM风格的采样参数。绝不能静默应用默认值。使用结构化提示(例如Claude Code中的
AskUserQuestion
,每个字段一个问题):
  1. 列出每个适用字段及其类型默认值
  2. 允许用户跳过/接受任何字段以使用该字段的默认值——无需强制输入值。
  3. 一次性收集所有字段。
提示后,直接使用用户输入的值,对跳过的字段使用默认值。不要自行生成值或静默限制范围。
字段列表、默认值和各架构适用性:
references/request.yaml
chat_completions_request_body
(基础采样字段:
max_tokens
,
top_p
,
temperature
)和
network_arch_constraints.<network_arch>
(各架构的覆盖项和额外参数,例如cosmos-predict2.5的
guidance
/
num_steps
/
seed
/
negative_prompt
)。如果某个字段标记为当前架构不支持,则不要提示用户,也不要将其包含在请求体中。

6.2 Request format

6.2 请求格式

Send a
POST
to
{BASE_URL}/v1/chat/completions
with
Content-Type: application/json
and a timeout of at least 300 s. The body is OpenAI-compatible (vLLM chat completions); see
references/request.yaml
chat_completions_request_body
for the full field schema and content-item shapes (text / image_url / video_url), and
code_examples
for ready-to-run Python and curl samples.
Constraints: only the first user message is processed. No secret values in request bodies. Per-network constraints (e.g. cosmos-rl requires every request to include an image or video; cosmos-rl rejects
data:
URIs) are in
references/request.yaml
network_arch_constraints
.
{BASE_URL}/v1/chat/completions
发送
POST
请求,设置
Content-Type: application/json
,超时时间至少为300秒。请求体为OpenAI兼容格式(vLLM聊天补全);完整字段模式和内容项格式(文本/image_url/video_url)见
references/request.yaml
chat_completions_request_body
,可直接运行的Python和curl示例见
code_examples
约束: 仅处理第一条用户消息。请求体中不得包含密钥值。各网络约束(例如cosmos-rl要求每个请求包含图片或视频;cosmos-rl拒绝
data:
URI)见
references/request.yaml
network_arch_constraints

6.3 Response handling

6.3 响应处理

HTTP statusMeaningAction
200Success —
choices[0].message.content
has the generated text
Read result
202Server still initializing or model still loadingRetry after a delay
503Initialization failed, model load failed, or model not yet readyInspect
error.type
:
model_not_ready
→ retry;
initialization_error
/
model_load_error
→ give up and check logs
400Missing or empty JSON bodyFix request
500Unhandled exception during inferenceCheck container logs
For 202 and 503, the body contains
{"error": {"type": "<error_type>", "message": "<reason>"}}
. See
container_response_shapes
in
references/request.yaml
for error type strings.
HTTP状态码含义操作
200成功 ——
choices[0].message.content
包含生成的文本
读取结果
202服务器仍在初始化或模型仍在加载延迟后重试
503初始化失败、模型加载失败,或模型尚未就绪检查
error.type
model_not_ready
→ 重试;
initialization_error
/
model_load_error
→ 放弃并检查日志
400请求体缺失或为空修复请求
500推理过程中出现未处理的异常检查容器日志
对于202和503状态码,响应体包含
{"error": {"type": "<error_type>", "message": "<reason>"}}
。错误类型字符串见
references/request.yaml
中的
container_response_shapes