TAO Inference Microservice
Instructions
To start an inference service:
- Collect required inputs (Section 1) and resolve the container image (Section 2).
- Build the job payload and inner command (Sections 3–4.1); use
references/code-templates.yaml
→ .
- Read
skills/platform/<platform>/SKILL.md
and start the container (Section 4.2).
- Write the service registry and poll readiness (Section 4.3); use
references/code-templates.yaml
→ registry_write.<platform>
and .
To send an inference request:
- Resolve which service receives the request per Section 6.0 (by , by , or by explicit user choice when multiple services run — never silently default to when more than one service exists), then read the endpoint from
references/code-templates.yaml
→ with the resolved .
- Before building the request body, prompt the user for the vLLM-style sampling parameters (Section 6.1). Present , , (and any per-arch extras) with their defaults; let the user override or skip each one to accept the default. Never silently use defaults.
- Build and send the body per Section 6.2; handle the response per Section 6.3.
To stop a service: Read
references/code-templates.yaml
→
to resolve the job_id, read
skills/platform/<platform>/SKILL.md
, then follow Section 5.
Reference data (schemas, mappings, valid values — no instructions):
- — image mappings, valid names, job payload schema, env var names, secrets classification.
- — endpoint definition, request field schema, response shapes, code examples.
references/code-templates.yaml
— Python templates for payload building, registry writes, readiness checks, and stop/request flows.
Secrets rule (applies to every generated code block in this skill)
Never ask the user to type a secret value into a prompt. For every secret value:
- Tell the user which environment variable to set (e.g. ).
- Generate code that reads it with — never hard-code, interpolate, or prompt for the value.
Secret env vars (full list in
→
):
,
,
,
,
,
.
Safe to collect in the prompt: ,
,
, prompt text,
config URLs,
URLs.
1. What to collect from the user
| Input | Role |
|---|
| Chooses container image, the per-arch inner command shape ( → container_commands.<network_arch>
), and in the job JSON when applicable. Must match a basename in valid_network_arch_config_basenames
in (e.g. , ). |
| The trained model checkpoint. Valid forms: (HuggingFace Hub — set for gated models) or a local container filesystem path. Cloud URIs (, , ) are NOT supported — the inference service has no cloud-storage dependency. Always ask the user; never substitute a placeholder. See → . |
| Compute platform: , , , , or . |
| Defaults to 1; minimum 1 for inference. |
2. Image resolution
Each
has a sidecar config file named
{network_arch}.config.json
. Resolve the container image as follows:
- Read
{network_arch}.config.json
and take (e.g. ). This is a key into docker_image_defaults.mapping
in .
- Look up that key in the mapping. If the host env var is set (e.g. ), it overrides the mapped default.
- The mapped value is normally a dotted key into the repo-root manifest (e.g. ). Resolve it to a concrete image URI by looking up → . Absolute URIs pass through unchanged, so an env-var override that contains a full URI still works. The Python helper for this lives in
references/code-templates.yaml
.
- If the config file is missing or is empty, fall back to the key.
The config file also has
spec_params.inference.model_path
which drives
folder vs file path semantics: if the value contains the substring
, the container treats the path as a directory.
3. Environment variables (no callbacks)
Set these in
before encoding
. Do
not set
or
.
— must match the platform:
| Platform | value |
|---|
| local-docker | |
| brev | |
| lepton | |
| slurm | |
| kubernetes | |
— always
for this skill (disables callback posting to
).
GPU env vars — only needed when the platform skill does not handle GPU injection automatically:
- Tegra / Jetson: with
NVIDIA_DRIVER_CAPABILITIES=all
and NVIDIA_VISIBLE_DEVICES=<ids>
.
- Standard x86 + nvidia-container-toolkit: use Docker . The platform skill handles this.
4. Executing across platforms
The job payload and inner command (Sections 1–3) are
platform-agnostic. For each platform, read
skills/platform/<name>/SKILL.md
for preflight checks and credentials
before generating any execution code.
4.1 Build the inner command (per arch)
The inner-command shape is
per — there is no uniform template. Look up the per-arch entry in
→
container_commands.<network_arch>
; if not present, the arch is unsupported — stop and ask. Pick the matching sub-block in
references/code-templates.yaml
→
job_payload_builder.<network_arch>
. Prefix the command with
and keep it
identical across platforms (local-docker, brev, lepton, slurm, kubernetes).
Common across arches:
- : fresh — becomes the container name and registry key.
- : resolve per Section 2.
- Secrets (, , , etc.) are read from env vars at runtime — never hard-code, never log or print.
Arch-specific notes (full details in
→
):
- — single
--job '<JOB_JSON>' --docker_env_vars '<ENV_JSON>'
blob; + . carries (per Section 3 table), , . The inference service has no cloud-storage dependency; is the only cred env var that ever applies (for gated HuggingFace models).
- — flag-style
cosmos_predict inference_microservice start ... --port 8080
(no prefix; uses tyro.conf.OmitArgPrefixes
). / are not accepted. Translate to (local path) or (); cloud URIs are rejected. The only cred env var that ever applies is for gated HuggingFace models. Per-request params (prompt, inference_type, num_output_frames, guidance, seed, num_steps, negative_prompt) go in the request body, not at startup. // are unused and may be omitted.
4.2 Delegate execution to the platform skill
Read
skills/platform/<platform>/SKILL.md
and follow it to start the container.
Base parameters (all platforms):
| Parameter | Value |
|---|
| resolved container image (Section 2) |
| — the shell string built in Section 4.1 |
| |
| |
| job / container name | — must equal the UUID from 4.1 so the registry can reference it |
| (local-docker, brev) | host-side port to bind to container port 8080. Default , but must be unique per concurrent service — see the port-allocation rule below. |
Platform-specific additional inputs:
| Platform | Additional inputs |
|---|
| local-docker | None beyond base |
| brev | (optional — reuse an existing instance); on multi-credential / multi-workspace accounts also and for first-create — see skills/platform/tao-run-on-brev/SKILL.md
|
| lepton | (GPU shape ID, e.g. ); (optional) |
| slurm | and — check / env vars; ask user if unset |
| kubernetes | (default: ); (required for images) |
Port binding (local-docker and brev): use
direct docker run (not DockerSDK) so that
can be passed and the container name equals
exactly.
Port allocation rule (local-docker and brev, REQUIRED for concurrent services): Before starting a service, read the registry (
/tmp/tao-inf-ms-state.json
) and collect the set of
values from every existing entry on the same platform (and, for brev, the same
). Pick the
lowest free port starting from 8080 that is not in that set — e.g.
host_port = next(p for p in range(8080, 8200) if p not in used_ports)
. The default
only applies when no other service is running. This is what makes "start 3 services, each reachable at a distinct
" work; without it, services 2 and 3 fail with
bind: address already in use
. Lepton, SLURM, and kubernetes get distinct endpoints from their own platform mechanisms and do not need this step.
4.3 After start: service registry and endpoint
Write the service registry immediately after the platform confirms the container is running. The registry (
/tmp/tao-inf-ms-state.json
) is keyed by
;
always points to the most recently started service.
See
references/code-templates.yaml
→
registry_write.<platform>
for the Python template.
| Platform | | | Extra step before writing |
|---|
| local-docker | http://localhost:{host_port}
| — | None |
| brev | http://{brev_ip}:{host_port}
| — | → get instance IP ( is invalid on remote VM) |
| lepton | Lepton endpoint URL | | Poll until Running; get endpoint from console or |
| slurm | http://localhost:{host_port}
| SLURM scheduler job ID | Wait until Running; SSH port-forward localhost:{host_port}→{node}:8080
|
| kubernetes | http://{external_ip}:8080
| k8s job name | kubectl expose job … --type=LoadBalancer
; wait for external IP |
After writing the registry, print the job_id and URL:
python
print(f"Inference service started.")
print(f" Job ID : {job_id}")
print(f" Arch : {network_arch}")
print(f" URL : {state[job_id]['host_url']}/v1/chat/completions")
print(f"Use this Job ID to send requests or stop the service.")
Then poll for readiness — see
references/code-templates.yaml
→
. The container loads the model in the background; do not send requests before it returns 200.
5. Stopping the inference service
Ask the user for the
to stop. If they don't provide one, default to
and confirm which job_id is being stopped. Read the registry using
references/code-templates.yaml
→
, then read
skills/platform/<platform>/SKILL.md
and use its cancellation / stop mechanism.
| Platform | Identifier to pass | Extra cleanup |
|---|
| local-docker | — container name | None |
| brev | — container name | None |
| lepton | — Lepton job ID | None |
| slurm | — SLURM job ID | pkill -f "ssh.*-L.*{entry['host_port']}"
|
| kubernetes | — k8s job name | kubectl delete svc {entry["platform_job_id"]} -n <namespace>
|
where
entry = state[job_id_to_stop]
. After stopping, clean up the registry:
references/code-templates.yaml
→
.
6. Sending inference requests
6.0 Resolve which service receives this request (REQUIRED)
Each request must be routed to the
specific service that runs the matching model. Routing happens by
— the registry stores
per entry, so you can resolve a target by arch when the user names a model instead of a
. Apply these rules in order:
- User provided an explicit → use it. Verify it exists in .
- User named a (e.g. "send this to the cosmos-rl service") → look up matching entries:
candidates = [j for j, e in state.items() if j != "latest" and isinstance(e, dict) and e["network_arch"] == arch]
.
- Exactly one match → use it.
- Multiple matches → prompt the user with the candidate s and their ; do not auto-pick.
- No match → stop and tell the user no service for that arch is running.
- No and no → count non- entries in :
- Exactly one running service → use it.
- Two or more → do not silently default to . Prompt the user with the full list (, , ) and require an explicit choice. The pointer is a convenience for single-service workflows, not a routing fallback when multiple services coexist.
- Zero → stop and tell the user to start a service first.
After resolving, read the endpoint from the registry (
references/code-templates.yaml
→
), passing the resolved
as
. Confirm to the user: "Sending to job_id=… arch=… url=…". If the service may still be loading, poll readiness first (
references/code-templates.yaml
→
).
Cross-check before sending: if the user-supplied request body contains arch-specific fields (e.g.
/
/
/
→ cosmos-predict2.5; required
/
content items → cosmos-rl), verify they are consistent with
state[job_id]["network_arch"]
. On mismatch, stop and ask — sending a cosmos-predict2.5 body to a cosmos-rl service will fail at the container with a 4xx/5xx that is harder to diagnose than catching it here.
6.1 Sampling parameters — REQUIRED user prompt before each request
Before constructing the request body, you
MUST explicitly prompt the user for the vLLM-style sampling parameters. Do
not silently apply defaults. Use a structured prompt (e.g.
in Claude Code, one question per field) that:
- Lists every applicable field with its type and default value.
- Lets the user skip / accept any field to take that field's default — entering a value is never required.
- Collects all fields in one round.
After the prompt, apply each user-entered value verbatim and substitute the default for any skipped field. Do not invent values or silently clamp.
Field list, defaults, and per-arch applicability: →
chat_completions_request_body
(base sampling fields:
,
,
) and
network_arch_constraints.<network_arch>
(per-arch overrides and extras such as
/
/
/
for
). If a field is marked unsupported for the active arch, do
not prompt for it and do
not include it in the body.
6.2 Request format
Send a
to
{BASE_URL}/v1/chat/completions
with
Content-Type: application/json
and a timeout of
at least 300 s. The body is OpenAI-compatible (vLLM chat completions); see
→
chat_completions_request_body
for the full field schema and content-item shapes (text / image_url / video_url), and
for ready-to-run Python and curl samples.
Constraints: only the first user message is processed. No secret values in request bodies.
Per-network constraints (e.g. cosmos-rl requires every request to include an image or video; cosmos-rl rejects
URIs) are in
→
.
6.3 Response handling
| HTTP status | Meaning | Action |
|---|
| 200 | Success — choices[0].message.content
has the generated text | Read result |
| 202 | Server still initializing or model still loading | Retry after a delay |
| 503 | Initialization failed, model load failed, or model not yet ready | Inspect : → retry; / → give up and check logs |
| 400 | Missing or empty JSON body | Fix request |
| 500 | Unhandled exception during inference | Check container logs |
For 202 and 503, the body contains
{"error": {"type": "<error_type>", "message": "<reason>"}}
. See
container_response_shapes
in
for error type strings.