physical-ai-infrastructure-setup-and-resilient-scaling

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Physical AI Infrastructure Setup And Resilient Scaling

物理AI基础设施设置与弹性扩展

Canonical skill for the Physical AI infrastructure stack. Use it to compose cluster, inference, OSMO, and workload stages into a reproducible Physical AI SDG environment, then keep the environment observable and recoverable.
这是Physical AI基础设施栈的标准技能。可用于将集群、推理、OSMO和工作流阶段组合成可复现的Physical AI合成数据生成(SDG)环境,并确保该环境可观测、可恢复。

Operating Rules

操作规则

  • Read only the component references needed for the selected target. Do not load every component by default.
  • Keep the repo as the durable artifact. Fix checked-in config or scripts, then rerun. Do not recover a failed install with untracked one-off changes.
  • Run mutating cluster, OSMO, Helm, Terraform, or Azure operations through checked-in scripts when a script exists. Read-only diagnostics are allowed.
  • Stop at the first red gate. Fix the lowest owning layer in this order: config, script, then skill guidance.
  • Derive values from the environment when possible. Ask only for values that cannot be inferred, such as API keys, target choice, or quota tradeoffs.
  • Store secrets in
    ${REPO_ROOT}/.env
    . Cluster-derived values such as storage, database, Redis, and endpoint names come from Terraform outputs or platform queries, not
    .env
    .
  • Preflight means no deployed state: no cluster API, Terraform outputs, Helm releases, OSMO pools, or workflow state. Those belong to deploy/verify gates.
  • Never print, echo, or paste raw keys into commands, YAML, logs, or transcripts. Prefer credential handles, Kubernetes
    secretKeyRef
    , and runtime-only secret injection. Scan raw transcript exports with
    scripts/scan_transcript_secrets.py
    before sharing.
  • Use absolute paths. Derive repo root with
    git rev-parse --show-toplevel
    .
  • 仅加载所选目标所需的组件参考文档。默认情况下不要加载所有组件。
  • 将代码库作为持久化工件。修复已签入的配置或脚本后重新运行。不要通过未跟踪的一次性更改恢复失败的安装。
  • 当存在对应脚本时,通过已签入的脚本执行集群变更、OSMO、Helm、Terraform或Azure操作。允许执行只读诊断操作。
  • 遇到第一个失败节点时停止。按以下顺序修复最低层级的问题:配置、脚本,然后是技能指导。
  • 尽可能从环境中获取值。仅询问无法推断的值,例如API密钥、目标选择或配额权衡。
  • 将密钥存储在
    ${REPO_ROOT}/.env
    中。集群衍生的值(如存储、数据库、Redis和端点名称)来自Terraform输出或平台查询,而非
    .env
  • 预检查意味着无部署状态:无集群API、Terraform输出、Helm发布、OSMO池或工作流状态。这些属于部署/验证节点的范畴。
  • 切勿在命令、YAML、日志或记录中打印、回显或粘贴原始密钥。优先使用凭证句柄、Kubernetes
    secretKeyRef
    和仅运行时密钥注入。分享前使用
    scripts/scan_transcript_secrets.py
    扫描原始记录导出内容。
  • 使用绝对路径。通过
    git rev-parse --show-toplevel
    获取代码库根目录。

Component References

组件参考文档

Each component lives inside this skill so the stack has one canonical trigger. Load the component reference only when the selected target needs that slice.
ConcernLoadAssets
Stage matrix and old driver notes
components/driver/reference.md
None
MicroK8s cluster
components/cluster-microk8s/reference.md
components/cluster-microk8s/scripts/
,
components/cluster-microk8s/runtimeclass-nvidia-runc.yaml
Azure AKS cluster
components/cluster-azure/reference.md
components/cluster-azure/scripts/
,
components/cluster-azure/terraform/
NIM Operator inference
components/inference-nim-operator/reference.md
components/inference-nim-operator/scripts/
,
components/inference-nim-operator/nims/
NVCF inference
components/inference-nvcf/reference.md
components/inference-nvcf/scripts/
Azure AI Foundry inference
components/inference-azure/reference.md
components/inference-azure/scripts/
MicroK8s OSMO
components/osmo-k8s/reference.md
components/osmo-k8s/scripts/
, upstream OSMO deploy scripts
Azure OSMO
components/osmo-azure/reference.md
components/osmo-azure/scripts/
, upstream OSMO deploy scripts plus Azure TF outputs
Azure access setup
components/azure-access/reference.md
None
OSMO CLI and workflow operations
components/osmo-cli/reference.md
components/osmo-cli/scripts/
,
components/osmo-cli/references/
,
components/osmo-cli/agents/
,
components/osmo-cli/tests/
OpenClaw Azure device login
components/openclaw-azure-login/reference.md
None
每个组件都包含在本技能中,因此整个栈只需一个标准触发条件。仅当所选目标需要对应部分时,才加载该组件的参考文档。
关注点加载路径资源
阶段矩阵和旧驱动说明
components/driver/reference.md
MicroK8s集群
components/cluster-microk8s/reference.md
components/cluster-microk8s/scripts/
,
components/cluster-microk8s/runtimeclass-nvidia-runc.yaml
Azure AKS集群
components/cluster-azure/reference.md
components/cluster-azure/scripts/
,
components/cluster-azure/terraform/
NIM Operator推理
components/inference-nim-operator/reference.md
components/inference-nim-operator/scripts/
,
components/inference-nim-operator/nims/
NVCF推理
components/inference-nvcf/reference.md
components/inference-nvcf/scripts/
Azure AI Foundry推理
components/inference-azure/reference.md
components/inference-azure/scripts/
MicroK8s OSMO
components/osmo-k8s/reference.md
components/osmo-k8s/scripts/
, 上游OSMO部署脚本
Azure OSMO
components/osmo-azure/reference.md
components/osmo-azure/scripts/
, 上游OSMO部署脚本加Azure TF输出
Azure访问设置
components/azure-access/reference.md
OSMO CLI和工作流操作
components/osmo-cli/reference.md
components/osmo-cli/scripts/
,
components/osmo-cli/references/
,
components/osmo-cli/agents/
,
components/osmo-cli/tests/
OpenClaw Azure设备登录
components/openclaw-azure-login/reference.md

OSMO CLI Support Files

OSMO CLI支持文件

The OSMO CLI component has second-level support files because its command and workflow surface is large. Load these directly only for the stated case.
FileRead when
components/osmo-cli/agents/workflow-expert.md
Spawning a workflow-generation or workflow-failure subagent.
components/osmo-cli/agents/logs-reader.md
Spawning a log summarization subagent for OSMO workflow failures.
components/osmo-cli/references/cli-commands.md
Exact OSMO CLI flags, payloads, or command syntax are needed.
components/osmo-cli/references/workflow-spec.md
Workflow YAML schema, credentials, outputs, or provider fields are needed.
components/osmo-cli/references/workflow-patterns.md
Multi-task, data dependency, Jinja, serial, or parallel workflow design is needed.
components/osmo-cli/references/advanced-patterns.md
Checkpointing, retry/exit behavior, or node exclusion is needed.
components/osmo-cli/tests/orchestrator-runtime-failure.md
Validating or debugging the OSMO orchestration review pattern.
OSMO CLI组件包含二级支持文件,因其命令和工作流范围较广。仅在指定场景下直接加载这些文件。
文件加载场景
components/osmo-cli/agents/workflow-expert.md
生成工作流或处理工作流失败的子Agent启动时。
components/osmo-cli/agents/logs-reader.md
为OSMO工作流失败启动日志汇总子Agent时。
components/osmo-cli/references/cli-commands.md
需要准确的OSMO CLI标志、负载或命令语法时。
components/osmo-cli/references/workflow-spec.md
需要工作流YAML schema、凭证、输出或提供者字段时。
components/osmo-cli/references/workflow-patterns.md
需要多任务、数据依赖、Jinja、串行或并行工作流设计时。
components/osmo-cli/references/advanced-patterns.md
需要检查点、重试/退出行为或节点排除时。
components/osmo-cli/tests/orchestrator-runtime-failure.md
验证或调试OSMO编排审查模式时。

Target Selection

目标选择

Pick exactly one option per stage. Stage 2 follows stage 1.
  1. Kubernetes:
    MicroK8s
    or
    Azure
  2. OSMO:
    MicroK8s OSMO
    when Kubernetes is MicroK8s,
    Azure OSMO
    when Kubernetes is Azure
  3. Inference:
    NIM Operator
    ,
    NVCF
    ,
    Azure AI Foundry
    , or
    None
  4. Workload: Video Data Augmentation, Defect Image Generation, NuRec Carline Adaptation, NRE, NCore, Asset Harvester, or custom workflow YAML
Reject invalid combinations before provisioning:
ClusterNIM OperatorNVCFAzure AI Foundry
MicroK8syesyesno, Foundry requires Azure identities
Azureyesyesyes
For OpenClaw or any chat-only environment that cannot open a browser, read
components/openclaw-azure-login/reference.md
before Azure prerequisites. For any Azure target, read
components/azure-access/reference.md
before Azure component preflights.
每个阶段仅选择一个选项。阶段2紧随阶段1之后。
  1. Kubernetes:
    MicroK8s
    Azure
  2. OSMO:当Kubernetes为MicroK8s时选择
    MicroK8s OSMO
    ,当Kubernetes为Azure时选择
    Azure OSMO
  3. 推理:
    NIM Operator
    NVCF
    Azure AI Foundry
    None
  4. 工作流:视频数据增强、缺陷图像生成、NuRec车型适配、NRE、NCore、资产采集器或自定义工作流YAML
在部署前拒绝无效组合:
集群NIM OperatorNVCFAzure AI Foundry
MicroK8s支持支持不支持,Foundry需要Azure身份验证
Azure支持支持支持
对于无法打开浏览器的OpenClaw或纯聊天环境,在完成Azure前置条件前请阅读
components/openclaw-azure-login/reference.md
。对于任何Azure目标,在Azure组件预检查前请阅读
components/azure-access/reference.md

Setup Flow

设置流程

  1. Confirm target choices and workload compute requirements.
  2. Load the selected component references.
  3. Resolve prerequisites up front, including API keys, Azure access, caller CIDR, GPU quota, storage class, and OSMO login requirements.
  4. Run
    scripts/preflight.sh
    for every selected infrastructure component plus any OSMO CLI/workload preflight before provisioning; build the implementation plan from the results and stop on red preflight.
  5. Deploy Kubernetes first. Nothing else starts until the cluster gate is green.
  6. Deploy OSMO and inference after Kubernetes. These can proceed in parallel once the cluster exists, but workload submission waits for both selected gates.
  7. Submit the workload only after OSMO, storage credentials, compute pool, and selected inference endpoints are verified. For VDA, this includes
    preflight_credentials.sh
    ,
    pre_submit_guard.py
    with resolved
    --set
    values, non-empty model-cache prefixes, and workflow-namespace endpoint smoke checks.
  8. Monitor through completion. On failed workflow state, inspect events and logs from
    components/osmo-cli/reference.md
    ; do not resubmit blindly.
  1. 确认目标选择和工作流计算需求。
  2. 加载所选组件的参考文档。
  3. 提前解决前置条件,包括API密钥、Azure访问权限、调用方CIDR、GPU配额、存储类和OSMO登录要求。
  4. 在部署前,为每个所选基础设施组件以及任何OSMO CLI/工作流执行
    scripts/preflight.sh
    预检查;根据结果制定实施计划,若预检查失败则停止。
  5. 先部署Kubernetes。集群节点验证通过前,不启动其他任何操作。
  6. Kubernetes部署完成后,部署OSMO和推理服务。集群就绪后,这两者可并行进行,但工作流提交需等待所选的OSMO和推理节点都验证通过。
  7. 仅在OSMO、存储凭证、计算池和所选推理端点都验证通过后,再提交工作流。对于视频数据增强(VDA),这包括
    preflight_credentials.sh
    执行通过、
    pre_submit_guard.py
    使用已解析的
    --set
    值、非空模型缓存前缀,以及工作流命名空间端点的冒烟测试通过。
  8. 监控直至完成。若工作流状态失败,从
    components/osmo-cli/reference.md
    中检查事件和日志;不要盲目重新提交。

Inference Discovery

推理服务发现

Avoid over-deploying expensive endpoints.
  1. Scan the chosen workflow spec and default values for endpoint references:
    *.osmo-nims.svc.cluster.local
    ,
    api.nvcf.nvidia.com/*
    ,
    *.inference.ai.azure.com
    , or
    *.cognitiveservices.azure.com
    .
  2. Map each reference to the selected backend:
    • NIM Operator: service name must match a directory under
      components/inference-nim-operator/nims/
      .
    • NVCF: function URL or function ID must be supplied by the environment.
    • Azure AI Foundry: endpoint name must be deployed through
      components/inference-azure/scripts/install.sh
      .
  3. If the workflow needs a capability the selected backend lacks, stop and report the mismatch. Do not silently substitute another model.
避免过度部署高成本端点。
  1. 扫描所选工作流规范和默认值中的端点引用:
    *.osmo-nims.svc.cluster.local
    api.nvcf.nvidia.com/*
    *.inference.ai.azure.com
    *.cognitiveservices.azure.com
  2. 将每个引用映射到所选后端:
    • NIM Operator:服务名称必须与
      components/inference-nim-operator/nims/
      下的目录匹配。
    • NVCF:函数URL或函数ID必须由环境提供。
    • Azure AI Foundry:端点名称必须通过
      components/inference-azure/scripts/install.sh
      部署。
  3. 若工作流需要的功能是所选后端不具备的,停止操作并报告不匹配。切勿静默替换其他模型。

Verification Gates

验证节点

Each stage has its own Verify section in the component reference. These gates are mandatory:
StageGate
KubernetesCluster API reachable, nodes Ready, GPU capacity advertised for GPU paths, and CPU+NVCF paths have
runtimeclass/nvidia
mapped to
runc
.
InferenceEvery endpoint referenced by the workload is reachable. NIM readiness uses
/v1/health/ready
; NVCF and Foundry still need task-specific authenticated checks.
OSMOOSMO pods Ready, pool ONLINE, port-forward watchdogs alive, storage credentials configured, and verify-hello workflow COMPLETED.
WorkloadSelected workload pre-submit guards pass before submit.
osmo workflow query <id>
reports
COMPLETED
and every task is green. Failed terminal states require events and logs before retry.
每个阶段在组件参考文档中都有自己的验证部分。这些节点是强制性的:
阶段验证节点
Kubernetes集群API可访问、节点处于Ready状态、GPU路径已公布GPU容量、CPU+NVCF路径已将
runtimeclass/nvidia
映射到
runc
推理工作流引用的每个端点都可访问。NIM就绪状态使用
/v1/health/ready
检查;NVCF和Foundry仍需特定任务的身份验证检查。
OSMOOSMO Pod处于Ready状态、池处于ONLINE状态、端口转发监控程序运行正常、存储凭证已配置,且verify-hello工作流已完成(COMPLETED)。
工作流所选工作流的提交前检查通过后再提交。
osmo workflow query <id>
报告状态为COMPLETED且所有任务都成功。若出现失败的终端状态,需先检查事件和日志再重试。

Resilient Scaling

弹性扩展

  • Size the cluster from workload needs before provisioning. For Azure, check CPU and GPU quota for the selected VM families before
    terraform apply
    .
  • For NIM Operator, deploy only the NIMServices referenced by the workload. Each service pins GPU and model-cache storage for the lifetime of the cluster.
  • Keep OSMO storage URL schemes aligned with the active backend. Local MicroK8s uses MinIO, Azure uses Blob-backed configuration.
  • Treat Pending, Unknown, ImagePullBackOff, unbound PVCs, or 0 Ready replicas as layer failures. Investigate scheduling, storage, image credentials, and adjacent platform state before retrying the same command.
  • For long deploys or workflow watches, provide heartbeat updates with current state, elapsed time, last useful observation, and next check.
  • 部署前根据工作流需求调整集群规模。对于Azure,在执行
    terraform apply
    前检查所选VM系列的CPU和GPU配额。
  • 对于NIM Operator,仅部署工作流引用的NIMService。每个服务在集群生命周期内会占用GPU和模型缓存存储。
  • 保持OSMO存储URL方案与活跃后端一致。本地MicroK8s使用MinIO,Azure使用基于Blob的配置。
  • 将Pending、Unknown、ImagePullBackOff、未绑定PVC或0个Ready副本视为层级故障。重试相同命令前,先排查调度、存储、镜像凭证和相邻平台状态问题。
  • 对于长时间部署或工作流监控,需提供心跳更新,包含当前状态、已用时间、最近有效观测结果和下一次检查计划。

Workload Routing

工作流路由

  • Video Data Augmentation: use
    skills/physical-ai-video-data-augmentation/SKILL.md
    .
  • Defect Image Generation: use
    skills/physical-ai-defect-image-generation/SKILL.md
    .
  • NuRec carline adaptation: use
    skills/carline-adaptation/SKILL.md
    .
  • NRE, NCore, and Asset Harvester live in the canonical NuRec catalog listed in
    skills/INDEX.md
    .
  • Custom workload: submit the provided workflow YAML through OSMO after checking resource requests, image credentials, data credentials, and inference URLs.
  • 视频数据增强:使用
    skills/physical-ai-video-data-augmentation/SKILL.md
  • 缺陷图像生成:使用
    skills/physical-ai-defect-image-generation/SKILL.md
  • NuRec车型适配:使用
    skills/carline-adaptation/SKILL.md
  • NRE、NCore和资产采集器位于
    skills/INDEX.md
    中列出的标准NuRec目录中。
  • 自定义工作流:检查资源请求、镜像凭证、数据凭证和推理URL后,通过OSMO提交提供的工作流YAML。

Evaluation Prompts And Results

评估提示与结果

  • Positive trigger: "Set up resilient Physical AI infrastructure for VDA on Azure AKS with NIM Operator." Expected: use this skill.
  • Negative trigger: "Summarize recent OSMO workflow logs for this workflow ID." Expected: do not use this infrastructure setup skill unless the request also involves setup, scaling, validation, or recovery of the infrastructure stack.
Latest static review: 2026-05-26, description keywords match the expected routes above.
  • 正向触发:"在Azure AKS上为VDA设置弹性Physical AI基础设施,并使用NIM Operator。" 预期行为:使用本技能。
  • 负向触发:"汇总此工作流ID的近期OSMO工作流日志。" 预期行为:除非请求还涉及基础设施栈的设置、扩展、验证或恢复,否则不要使用此基础设施设置技能。
最新静态审查时间:2026-05-26,描述关键词与上述预期路由匹配。