tao-run-inference-service
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTAO Inference Microservice
TAO推理微服务
Instructions
操作说明
To start an inference service:
- Collect required inputs (Section 1) and resolve the container image (Section 2).
- Build the job payload and inner command (Sections 3–4.1); use →
references/code-templates.yaml.job_payload_builder - Read and start the container (Section 4.2).
skills/platform/<platform>/SKILL.md - Write the service registry and poll readiness (Section 4.3); use →
references/code-templates.yamlandregistry_write.<platform>.readiness_check
To send an inference request:
- Resolve which service receives the request per Section 6.0 (by , by
job_id, or by explicit user choice when multiple services run — never silently default tonetwork_archwhen more than one service exists), then read the endpoint from"latest"→references/code-templates.yamlwith the resolvedrequest.registry_read.job_id - Before building the request body, prompt the user for the vLLM-style sampling parameters (Section 6.1). Present ,
max_tokens,top_p(and any per-arch extras) with their defaults; let the user override or skip each one to accept the default. Never silently use defaults.temperature - Build and send the body per Section 6.2; handle the response per Section 6.3.
To stop a service: Read → to resolve the job_id, read , then follow Section 5.
references/code-templates.yamlstop.registry_readskills/platform/<platform>/SKILL.mdReference data (schemas, mappings, valid values — no instructions):
- — image mappings, valid
references/service.yamlnames, job payload schema, env var names, secrets classification.network_arch - — endpoint definition, request field schema, response shapes, code examples.
references/request.yaml - — Python templates for payload building, registry writes, readiness checks, and stop/request flows.
references/code-templates.yaml
启动推理服务:
- 收集所需输入(第1节)并解析容器镜像(第2节)。
- 构建作业负载和内部命令(第3–4.1节);使用中的
references/code-templates.yaml模板。job_payload_builder - 阅读并启动容器(第4.2节)。
skills/platform/<platform>/SKILL.md - 写入服务注册表并轮询就绪状态(第4.3节);使用中的
references/code-templates.yaml和registry_write.<platform>模板。readiness_check
发送推理请求:
- 根据第6.0节的规则确定接收请求的服务(通过、
job_id,或者当多个服务运行时由用户明确选择——当存在多个服务时,绝不能默认使用network_arch而不告知用户),然后通过"latest"中的references/code-templates.yaml模板,传入解析后的request.registry_read来读取端点信息。job_id - 构建请求体之前,必须提示用户提供vLLM风格的采样参数(第6.1节)。列出、
max_tokens、top_p(以及各架构特有的额外参数)及其默认值;允许用户覆盖或跳过每个参数以接受默认值。绝不能静默使用默认值。temperature - 根据第6.2节构建并发送请求体;根据第6.3节处理响应。
停止服务: 通过中的模板解析,阅读,然后按照第5节的步骤操作。
references/code-templates.yamlstop.registry_readjob_idskills/platform/<platform>/SKILL.md参考数据(模式、映射、有效值——不含操作说明):
- —— 镜像映射、有效的
references/service.yaml名称、作业负载模式、环境变量名称、密钥分类。network_arch - —— 端点定义、请求字段模式、响应格式、代码示例。
references/request.yaml - —— 用于负载构建、注册表写入、就绪检查以及停止/请求流程的Python模板。
references/code-templates.yaml
Secrets rule (applies to every generated code block in this skill)
密钥规则(适用于本skill中生成的所有代码块)
Never ask the user to type a secret value into a prompt. For every secret value:
- Tell the user which environment variable to set (e.g. ).
export HF_TOKEN=... - Generate code that reads it with — never hard-code, interpolate, or prompt for the value.
os.environ["VAR_NAME"]
Secret env vars (full list in → ):
, , , , , .
references/service.yamlsecrets_handlingHF_TOKENWANDB_API_KEYCLEARML_API_ACCESS_KEYCLEARML_API_SECRET_KEYTAO_API_KEYTAO_USER_KEYSafe to collect in the prompt: , , , prompt text, config URLs, URLs.
network_archmodel_pathnum_gpusWANDB_*CLEARML_*_HOST绝不能要求用户在提示框中输入密钥值。对于每个密钥值:
- 告知用户需要设置哪个环境变量(例如)。
export HF_TOKEN=... - 生成使用读取该变量的代码——绝不能硬编码、插值或提示用户输入该值。
os.environ["VAR_NAME"]
密钥环境变量(完整列表见中的):
, , , , , 。
references/service.yamlsecrets_handlingHF_TOKENWANDB_API_KEYCLEARML_API_ACCESS_KEYCLEARML_API_SECRET_KEYTAO_API_KEYTAO_USER_KEY可在提示中收集的安全信息:, , , 提示文本, 配置URL, URL。
network_archmodel_pathnum_gpusWANDB_*CLEARML_*_HOST1. What to collect from the user
1. 需要向用户收集的信息
| Input | Role |
|---|---|
| Chooses container image, the per-arch inner command shape ( |
| The trained model checkpoint. Valid forms: |
| Compute platform: |
| Defaults to 1; minimum 1 for inference. |
| 输入项 | 作用 |
|---|---|
| 选择容器镜像、对应架构的内部命令格式( |
| 训练好的模型检查点。有效格式: |
| 计算平台: |
| 默认值为1;推理所需的最小值为1。 |
2. Image resolution
2. 镜像解析
Each has a sidecar config file named . Resolve the container image as follows:
network_arch{network_arch}.config.json- Read and take
{network_arch}.config.json(e.g.api_params.image). This is a key intoCOSMOS_RLindocker_image_defaults.mapping.references/service.yaml - Look up that key in the mapping. If the host env var is set (e.g.
IMAGE_<KEY>), it overrides the mapped default.IMAGE_COSMOS_RL - The mapped value is normally a dotted key into the repo-root manifest (e.g.
versions.yaml). Resolve it to a concretetao_toolkit.cosmos_rlimage URI by looking upnvcr.io/...→versions.yaml. Absolute URIs pass through unchanged, so animages.<group>.<name>env-var override that contains a full URI still works. The Python helper for this lives inIMAGE_<KEY>.references/code-templates.yaml - If the config file is missing or is empty, fall back to the
api_params.imagekey.COSMOS_RL
The config file also has which drives folder vs file path semantics: if the value contains the substring , the container treats the path as a directory.
spec_params.inference.model_pathfolder每个都有一个名为的辅助配置文件。按以下步骤解析容器镜像:
network_arch{network_arch}.config.json- 读取并获取
{network_arch}.config.json(例如api_params.image)。这是COSMOS_RL中references/service.yaml的一个键。docker_image_defaults.mapping - 在映射中查找该键。如果主机环境变量已设置(例如
IMAGE_<KEY>),则会覆盖映射的默认值。IMAGE_COSMOS_RL - 映射值通常是指向仓库根目录清单的点分隔键(例如
versions.yaml)。通过查找tao_toolkit.cosmos_rl→versions.yaml将其解析为具体的images.<group>.<name>镜像URI。绝对URI会直接保留,因此包含完整URI的nvcr.io/...环境变量覆盖仍然有效。此操作的Python助手位于IMAGE_<KEY>中。references/code-templates.yaml - 如果配置文件缺失或为空,则回退到
api_params.image键。COSMOS_RL
配置文件中还包含,用于驱动文件夹 vs 文件路径语义:如果该值包含子字符串,则容器会将路径视为目录。
spec_params.inference.model_pathfolder3. Environment variables (no callbacks)
3. 环境变量(无回调)
Set these in before encoding . Do not set or .
env_payloadenv_jsonTAO_LOGGING_SERVER_URLTAO_ADMIN_KEYTAO_EXECUTION_BACKEND| Platform | |
|---|---|
| local-docker | |
| brev | |
| lepton | |
| slurm | |
| kubernetes | |
CLOUD_BASED"False"TAO_LOGGING_SERVER_URLGPU env vars — only needed when the platform skill does not handle GPU injection automatically:
- Tegra / Jetson: with
--runtime=nvidiaandNVIDIA_DRIVER_CAPABILITIES=all.NVIDIA_VISIBLE_DEVICES=<ids> - Standard x86 + nvidia-container-toolkit: use Docker . The platform skill handles this.
device_requests
在编码之前,将这些变量设置到中。不要设置或。
env_jsonenv_payloadTAO_LOGGING_SERVER_URLTAO_ADMIN_KEYTAO_EXECUTION_BACKEND| 平台 | |
|---|---|
| local-docker | |
| brev | |
| lepton | |
| slurm | |
| kubernetes | |
CLOUD_BASED"False"TAO_LOGGING_SERVER_URLGPU环境变量 —— 仅当平台skill不自动处理GPU注入时才需要:
- Tegra / Jetson:使用,并设置
--runtime=nvidia和NVIDIA_DRIVER_CAPABILITIES=all。NVIDIA_VISIBLE_DEVICES=<ids> - 标准x86 + nvidia-container-toolkit:使用Docker的。平台skill会处理此操作。
device_requests
4. Executing across platforms
4. 跨平台执行
The job payload and inner command (Sections 1–3) are platform-agnostic. For each platform, read for preflight checks and credentials before generating any execution code.
skills/platform/<name>/SKILL.md作业负载和内部命令(第1–3节)是平台无关的。对于每个平台,在生成任何执行代码之前,请先阅读****中的预检检查和凭证要求。
skills/platform/<name>/SKILL.md4.1 Build the inner command (per arch)
4.1 构建内部命令(按架构)
The inner-command shape is per — there is no uniform template. Look up the per-arch entry in → ; if not present, the arch is unsupported — stop and ask. Pick the matching sub-block in → . Prefix the command with and keep it identical across platforms (local-docker, brev, lepton, slurm, kubernetes).
network_archreferences/service.yamlcontainer_commands.<network_arch>references/code-templates.yamljob_payload_builder.<network_arch>umask 0 &&Common across arches:
- : fresh
job_id— becomes the container name and registry key.uuid.uuid4() - : resolve per Section 2.
image - Secrets (,
access_key,secret_key, etc.) are read from env vars at runtime — never hard-code, never log or print.HF_TOKEN
Arch-specific notes (full details in → ):
references/service.yamlcontainer_commands- — single
cosmos-rlblob;--job '<JOB_JSON>' --docker_env_vars '<ENV_JSON>'+json.dumps(...).shlex.quote(...)carriesenv_payload(per Section 3 table),TAO_EXECUTION_BACKEND,TAO_API_JOB_ID. The inference service has no cloud-storage dependency;CLOUD_BASED=Falseis the only cred env var that ever applies (for gated HuggingFace models).HF_TOKEN - — flag-style
cosmos-predict2.5(nocosmos_predict inference_microservice start ... --port 8080prefix; usessetup.).tyro.conf.OmitArgPrefixes/--jobare not accepted. Translate--docker_env_varstomodel_path(local path) or--checkpoint-path(--model <registered_key>); cloud URIs are rejected. The only cred env var that ever applies ishf_model://for gated HuggingFace models. Per-request params (prompt, inference_type, num_output_frames, guidance, seed, num_steps, negative_prompt) go in the request body, not at startup.HF_TOKEN/TAO_EXECUTION_BACKEND/TAO_API_JOB_IDare unused and may be omitted.CLOUD_BASED
内部命令格式是基于的——没有统一模板。在 → 中查找对应架构的条目;如果不存在,则该架构不受支持——停止操作并询问用户。在 → 中选择匹配的子块。在命令前添加,并确保在所有平台(local-docker, brev, lepton, slurm, kubernetes)上完全一致。
network_archreferences/service.yamlcontainer_commands.<network_arch>references/code-templates.yamljob_payload_builder.<network_arch>umask 0 &&各架构通用规则:
- :生成新的
job_id——将作为容器名称和注册表键。uuid.uuid4() - :根据第2节解析得到。
image - 密钥(,
access_key,secret_key等)在运行时从环境变量读取——绝不能硬编码,绝不能记录或打印。HF_TOKEN
各架构特定说明(详细信息见 → ):
references/service.yamlcontainer_commands- —— 单个
cosmos-rl块;使用--job '<JOB_JSON>' --docker_env_vars '<ENV_JSON>'+json.dumps(...)处理。shlex.quote(...)包含env_payload(按第3节表格)、TAO_EXECUTION_BACKEND、TAO_API_JOB_ID。推理服务不依赖云存储;CLOUD_BASED=False是唯一适用的凭证环境变量(用于 gated HuggingFace模型)。HF_TOKEN - —— 标志式命令
cosmos-predict2.5(无cosmos_predict inference_microservice start ... --port 8080前缀;使用setup.)。不接受tyro.conf.OmitArgPrefixes/--job参数。将--docker_env_vars转换为model_path(本地路径)或--checkpoint-path(--model <registered_key>格式);拒绝云URI。唯一适用的凭证环境变量是用于 gated HuggingFace模型的hf_model://。每个请求的参数(提示、inference_type、num_output_frames、guidance、seed、num_steps、negative_prompt)放在请求体中,而不是启动时设置。HF_TOKEN/TAO_EXECUTION_BACKEND/TAO_API_JOB_ID未使用,可以省略。CLOUD_BASED
4.2 Delegate execution to the platform skill
4.2 将执行委托给平台skill
Read and follow it to start the container.
skills/platform/<platform>/SKILL.mdBase parameters (all platforms):
| Parameter | Value |
|---|---|
| resolved container image (Section 2) |
| |
| |
| |
| job / container name | |
| host-side port to bind to container port 8080. Default |
Platform-specific additional inputs:
| Platform | Additional inputs |
|---|---|
| local-docker | None beyond base |
| brev | |
| lepton | |
| slurm | |
| kubernetes | |
Port binding (local-docker and brev): use direct docker run (not DockerSDK) so that can be passed and the container name equals exactly.
-p <host_port>:8080job_idPort allocation rule (local-docker and brev, REQUIRED for concurrent services): Before starting a service, read the registry () and collect the set of values from every existing entry on the same platform (and, for brev, the same ). Pick the lowest free port starting from 8080 that is not in that set — e.g. . The default only applies when no other service is running. This is what makes "start 3 services, each reachable at a distinct " work; without it, services 2 and 3 fail with . Lepton, SLURM, and kubernetes get distinct endpoints from their own platform mechanisms and do not need this step.
/tmp/tao-inf-ms-state.jsonhost_portinstance_idhost_port = next(p for p in range(8080, 8200) if p not in used_ports)8080host_urlbind: address already in use阅读****并按照说明启动容器。
skills/platform/<platform>/SKILL.md基础参数(所有平台):
| 参数 | 值 |
|---|---|
| 解析后的容器镜像(第2节) |
| |
| |
| |
| 作业/容器名称 | |
| 主机端绑定到容器端口8080的端口。默认值为 |
平台特定附加输入:
| 平台 | 附加输入 |
|---|---|
| local-docker | 除基础参数外无其他输入 |
| brev | |
| lepton | |
| slurm | |
| kubernetes | |
端口绑定(local-docker和brev): 使用直接docker run命令(而非DockerSDK),以便可以传递参数,且容器名称与完全一致。
-p <host_port>:8080job_id端口分配规则(local-docker和brev,并发服务必填): 启动服务前,读取注册表()并收集同一平台(对于brev,还包括同一)上所有现有条目的值集合。选择从8080开始的最低可用端口,且该端口不在已使用集合中——例如。仅当没有其他服务运行时才使用默认值。此规则确保“启动3个服务,每个都可通过不同的访问”可行;否则,第2和第3个服务会因错误而失败。Lepton、SLURM和kubernetes通过各自的平台机制获取不同的端点,无需此步骤。
/tmp/tao-inf-ms-state.jsoninstance_idhost_porthost_port = next(p for p in range(8080, 8200) if p not in used_ports)8080host_urlbind: address already in use4.3 After start: service registry and endpoint
4.3 启动后:服务注册表和端点
Write the service registry immediately after the platform confirms the container is running. The registry () is keyed by ; always points to the most recently started service.
/tmp/tao-inf-ms-state.jsonjob_id"latest"See → for the Python template.
references/code-templates.yamlregistry_write.<platform>| Platform | | | Extra step before writing |
|---|---|---|---|
| local-docker | | — | None |
| brev | | — | |
| lepton | Lepton endpoint URL | | Poll |
| slurm | | SLURM scheduler job ID | Wait until Running; SSH port-forward |
| kubernetes | | k8s job name | |
After writing the registry, print the job_id and URL:
python
print(f"Inference service started.")
print(f" Job ID : {job_id}")
print(f" Arch : {network_arch}")
print(f" URL : {state[job_id]['host_url']}/v1/chat/completions")
print(f"Use this Job ID to send requests or stop the service.")Then poll for readiness — see → . The container loads the model in the background; do not send requests before it returns 200.
references/code-templates.yamlreadiness_check平台确认容器运行后,立即写入服务注册表。注册表()以为键;始终指向最近启动的服务。
/tmp/tao-inf-ms-state.jsonjob_id"latest"请参阅 → 中的Python模板。
references/code-templates.yamlregistry_write.<platform>| 平台 | | | 写入前的额外步骤 |
|---|---|---|---|
| local-docker | | — | 无 |
| brev | | — | 执行 |
| lepton | Lepton端点URL | | 轮询 |
| slurm | | SLURM调度器作业ID | 等待状态变为Running;通过SSH端口转发 |
| kubernetes | | k8s作业名称 | 执行 |
写入注册表后,打印job_id和URL:
python
print(f"Inference service started.")
print(f" Job ID : {job_id}")
print(f" Arch : {network_arch}")
print(f" URL : {state[job_id]['host_url']}/v1/chat/completions")
print(f"Use this Job ID to send requests or stop the service.")然后轮询就绪状态——请参阅 → 模板。容器会在后台加载模型;在返回200状态码之前不要发送请求。
references/code-templates.yamlreadiness_check5. Stopping the inference service
5. 停止推理服务
Ask the user for the to stop. If they don't provide one, default to and confirm which job_id is being stopped. Read the registry using → , then read and use its cancellation / stop mechanism.
job_idstate["latest"]references/code-templates.yamlstop.registry_readskills/platform/<platform>/SKILL.md| Platform | Identifier to pass | Extra cleanup |
|---|---|---|
| local-docker | | None |
| brev | | None |
| lepton | | None |
| slurm | | |
| kubernetes | | |
where . After stopping, clean up the registry: → .
entry = state[job_id_to_stop]references/code-templates.yamlstop.registry_cleanup询问用户要停止的。如果用户未提供,则默认使用并确认要停止的job_id。通过 → 模板读取注册表,然后阅读****并使用其取消/停止机制。
job_idstate["latest"]references/code-templates.yamlstop.registry_readskills/platform/<platform>/SKILL.md| 平台 | 需传递的标识符 | 额外清理操作 |
|---|---|---|
| local-docker | | 无 |
| brev | | 无 |
| lepton | | 无 |
| slurm | | 执行 |
| kubernetes | | 执行 |
其中。停止后,清理注册表:使用 → 模板。
entry = state[job_id_to_stop]references/code-templates.yamlstop.registry_cleanup6. Sending inference requests
6. 发送推理请求
6.0 Resolve which service receives this request (REQUIRED)
6.0 确定接收请求的服务(必填)
Each request must be routed to the specific service that runs the matching model. Routing happens by — the registry stores per entry, so you can resolve a target by arch when the user names a model instead of a . Apply these rules in order:
job_idnetwork_archjob_id- User provided an explicit → use it. Verify it exists in
job_id.state - User named a (e.g. "send this to the cosmos-rl service") → look up matching entries:
network_arch.candidates = [j for j, e in state.items() if j != "latest" and isinstance(e, dict) and e["network_arch"] == arch]- Exactly one match → use it.
- Multiple matches → prompt the user with the candidate s and their
job_id; do not auto-pick.started_at - No match → stop and tell the user no service for that arch is running.
- No and no
job_id→ count non-network_archentries in"latest":state- Exactly one running service → use it.
- Two or more → do not silently default to . Prompt the user with the full list (
state["latest"],job_id,network_arch) and require an explicit choice. Thehost_urlpointer is a convenience for single-service workflows, not a routing fallback when multiple services coexist."latest" - Zero → stop and tell the user to start a service first.
After resolving, read the endpoint from the registry ( → ), passing the resolved as . Confirm to the user: "Sending to job_id=… arch=… url=…". If the service may still be loading, poll readiness first ( → ).
references/code-templates.yamlrequest.registry_readjob_iduser_provided_job_idreferences/code-templates.yamlreadiness_checkCross-check before sending: if the user-supplied request body contains arch-specific fields (e.g. / / / → cosmos-predict2.5; required / content items → cosmos-rl), verify they are consistent with . On mismatch, stop and ask — sending a cosmos-predict2.5 body to a cosmos-rl service will fail at the container with a 4xx/5xx that is harder to diagnose than catching it here.
guidancenum_stepsseednegative_promptimage_urlvideo_urlstate[job_id]["network_arch"]每个请求必须路由到运行匹配模型的特定服务。路由通过实现——注册表存储每个条目的,因此当用户指定模型而非时,你可以通过架构解析目标。按以下顺序应用规则:
job_idnetwork_archjob_id- 用户提供了明确的→ 使用该ID。验证它是否存在于
job_id中。state - 用户指定了(例如“将此发送到cosmos-rl服务”)→ 查找匹配条目:
network_arch。candidates = [j for j, e in state.items() if j != "latest" and isinstance(e, dict) and e["network_arch"] == arch]- 恰好一个匹配项 → 使用它。
- 多个匹配项 → 提示用户提供候选及其
job_id;不要自动选择。started_at - 无匹配项 → 停止操作并告知用户没有运行该架构的服务。
- 未提供和
job_id→ 统计network_arch中非state的条目数量:"latest"- 恰好一个运行中的服务 → 使用它。
- 两个或更多 → 绝不能静默默认使用。向用户显示完整列表(
state["latest"],job_id,network_arch)并要求明确选择。host_url指针是单服务工作流的便利工具,而非多服务共存时的路由回退。"latest" - 零个 → 停止操作并告知用户先启动服务。
解析完成后,通过注册表读取端点( → ),传入解析后的作为。向用户确认:“正在发送到job_id=… arch=… url=…”。如果服务可能仍在加载中,请先轮询就绪状态( → )。
references/code-templates.yamlrequest.registry_readjob_iduser_provided_job_idreferences/code-templates.yamlreadiness_check发送前交叉检查: 如果用户提供的请求体包含架构特定字段(例如/// → cosmos-predict2.5;必填的/内容项 → cosmos-rl),请验证它们与是否一致。如果不匹配,停止操作并询问用户——将cosmos-predict2.5的请求体发送到cosmos-rl服务会导致容器返回难以诊断的4xx/5xx错误,不如在此处提前拦截。
guidancenum_stepsseednegative_promptimage_urlvideo_urlstate[job_id]["network_arch"]6.1 Sampling parameters — REQUIRED user prompt before each request
6.1 采样参数 —— 每次请求前必须提示用户
Before constructing the request body, you MUST explicitly prompt the user for the vLLM-style sampling parameters. Do not silently apply defaults. Use a structured prompt (e.g. in Claude Code, one question per field) that:
AskUserQuestion- Lists every applicable field with its type and default value.
- Lets the user skip / accept any field to take that field's default — entering a value is never required.
- Collects all fields in one round.
After the prompt, apply each user-entered value verbatim and substitute the default for any skipped field. Do not invent values or silently clamp.
Field list, defaults, and per-arch applicability: → (base sampling fields: , , ) and (per-arch overrides and extras such as /// for ). If a field is marked unsupported for the active arch, do not prompt for it and do not include it in the body.
references/request.yamlchat_completions_request_bodymax_tokenstop_ptemperaturenetwork_arch_constraints.<network_arch>guidancenum_stepsseednegative_promptcosmos-predict2.5构建请求体之前,你必须明确提示用户提供vLLM风格的采样参数。绝不能静默应用默认值。使用结构化提示(例如Claude Code中的,每个字段一个问题):
AskUserQuestion- 列出每个适用字段及其类型和默认值。
- 允许用户跳过/接受任何字段以使用该字段的默认值——无需强制输入值。
- 一次性收集所有字段。
提示后,直接使用用户输入的值,对跳过的字段使用默认值。不要自行生成值或静默限制范围。
字段列表、默认值和各架构适用性: 见 → (基础采样字段:, , )和(各架构的覆盖项和额外参数,例如cosmos-predict2.5的///)。如果某个字段标记为当前架构不支持,则不要提示用户,也不要将其包含在请求体中。
references/request.yamlchat_completions_request_bodymax_tokenstop_ptemperaturenetwork_arch_constraints.<network_arch>guidancenum_stepsseednegative_prompt6.2 Request format
6.2 请求格式
Send a to with and a timeout of at least 300 s. The body is OpenAI-compatible (vLLM chat completions); see → for the full field schema and content-item shapes (text / image_url / video_url), and for ready-to-run Python and curl samples.
POST{BASE_URL}/v1/chat/completionsContent-Type: application/jsonreferences/request.yamlchat_completions_request_bodycode_examplesConstraints: only the first user message is processed. No secret values in request bodies. Per-network constraints (e.g. cosmos-rl requires every request to include an image or video; cosmos-rl rejects URIs) are in → .
data:references/request.yamlnetwork_arch_constraints向发送请求,设置,超时时间至少为300秒。请求体为OpenAI兼容格式(vLLM聊天补全);完整字段模式和内容项格式(文本/image_url/video_url)见 → ,可直接运行的Python和curl示例见。
{BASE_URL}/v1/chat/completionsPOSTContent-Type: application/jsonreferences/request.yamlchat_completions_request_bodycode_examples约束: 仅处理第一条用户消息。请求体中不得包含密钥值。各网络约束(例如cosmos-rl要求每个请求包含图片或视频;cosmos-rl拒绝 URI)见 → 。
data:references/request.yamlnetwork_arch_constraints6.3 Response handling
6.3 响应处理
| HTTP status | Meaning | Action |
|---|---|---|
| 200 | Success — | Read result |
| 202 | Server still initializing or model still loading | Retry after a delay |
| 503 | Initialization failed, model load failed, or model not yet ready | Inspect |
| 400 | Missing or empty JSON body | Fix request |
| 500 | Unhandled exception during inference | Check container logs |
For 202 and 503, the body contains . See in for error type strings.
{"error": {"type": "<error_type>", "message": "<reason>"}}container_response_shapesreferences/request.yaml| HTTP状态码 | 含义 | 操作 |
|---|---|---|
| 200 | 成功 —— | 读取结果 |
| 202 | 服务器仍在初始化或模型仍在加载 | 延迟后重试 |
| 503 | 初始化失败、模型加载失败,或模型尚未就绪 | 检查 |
| 400 | 请求体缺失或为空 | 修复请求 |
| 500 | 推理过程中出现未处理的异常 | 检查容器日志 |
对于202和503状态码,响应体包含。错误类型字符串见中的。
{"error": {"type": "<error_type>", "message": "<reason>"}}references/request.yamlcontainer_response_shapes