physical-ai-infrastructure-setup-and-resilient-scaling
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePhysical AI Infrastructure Setup And Resilient Scaling
物理AI基础设施设置与弹性扩展
Canonical skill for the Physical AI infrastructure stack. Use it to compose cluster,
inference, OSMO, and workload stages into a reproducible Physical AI SDG
environment, then keep the environment observable and recoverable.
这是Physical AI基础设施栈的标准技能。可用于将集群、推理、OSMO和工作流阶段组合成可复现的Physical AI合成数据生成(SDG)环境,并确保该环境可观测、可恢复。
Operating Rules
操作规则
- Read only the component references needed for the selected target. Do not load every component by default.
- Keep the repo as the durable artifact. Fix checked-in config or scripts, then rerun. Do not recover a failed install with untracked one-off changes.
- Run mutating cluster, OSMO, Helm, Terraform, or Azure operations through checked-in scripts when a script exists. Read-only diagnostics are allowed.
- Stop at the first red gate. Fix the lowest owning layer in this order: config, script, then skill guidance.
- Derive values from the environment when possible. Ask only for values that cannot be inferred, such as API keys, target choice, or quota tradeoffs.
- Store secrets in . Cluster-derived values such as storage, database, Redis, and endpoint names come from Terraform outputs or platform queries, not
${REPO_ROOT}/.env..env - Preflight means no deployed state: no cluster API, Terraform outputs, Helm releases, OSMO pools, or workflow state. Those belong to deploy/verify gates.
- Never print, echo, or paste raw keys into commands, YAML, logs, or
transcripts. Prefer credential handles, Kubernetes , and runtime-only secret injection. Scan raw transcript exports with
secretKeyRefbefore sharing.scripts/scan_transcript_secrets.py - Use absolute paths. Derive repo root with .
git rev-parse --show-toplevel
- 仅加载所选目标所需的组件参考文档。默认情况下不要加载所有组件。
- 将代码库作为持久化工件。修复已签入的配置或脚本后重新运行。不要通过未跟踪的一次性更改恢复失败的安装。
- 当存在对应脚本时,通过已签入的脚本执行集群变更、OSMO、Helm、Terraform或Azure操作。允许执行只读诊断操作。
- 遇到第一个失败节点时停止。按以下顺序修复最低层级的问题:配置、脚本,然后是技能指导。
- 尽可能从环境中获取值。仅询问无法推断的值,例如API密钥、目标选择或配额权衡。
- 将密钥存储在中。集群衍生的值(如存储、数据库、Redis和端点名称)来自Terraform输出或平台查询,而非
${REPO_ROOT}/.env。.env - 预检查意味着无部署状态:无集群API、Terraform输出、Helm发布、OSMO池或工作流状态。这些属于部署/验证节点的范畴。
- 切勿在命令、YAML、日志或记录中打印、回显或粘贴原始密钥。优先使用凭证句柄、Kubernetes 和仅运行时密钥注入。分享前使用
secretKeyRef扫描原始记录导出内容。scripts/scan_transcript_secrets.py - 使用绝对路径。通过获取代码库根目录。
git rev-parse --show-toplevel
Component References
组件参考文档
Each component lives inside this skill so the stack has one canonical trigger.
Load the component reference only when the selected target needs that slice.
| Concern | Load | Assets |
|---|---|---|
| Stage matrix and old driver notes | | None |
| MicroK8s cluster | | |
| Azure AKS cluster | | |
| NIM Operator inference | | |
| NVCF inference | | |
| Azure AI Foundry inference | | |
| MicroK8s OSMO | | |
| Azure OSMO | | |
| Azure access setup | | None |
| OSMO CLI and workflow operations | | |
| OpenClaw Azure device login | | None |
每个组件都包含在本技能中,因此整个栈只需一个标准触发条件。仅当所选目标需要对应部分时,才加载该组件的参考文档。
| 关注点 | 加载路径 | 资源 |
|---|---|---|
| 阶段矩阵和旧驱动说明 | | 无 |
| MicroK8s集群 | | |
| Azure AKS集群 | | |
| NIM Operator推理 | | |
| NVCF推理 | | |
| Azure AI Foundry推理 | | |
| MicroK8s OSMO | | |
| Azure OSMO | | |
| Azure访问设置 | | 无 |
| OSMO CLI和工作流操作 | | |
| OpenClaw Azure设备登录 | | 无 |
OSMO CLI Support Files
OSMO CLI支持文件
The OSMO CLI component has second-level support files because its command and
workflow surface is large. Load these directly only for the stated case.
| File | Read when |
|---|---|
| Spawning a workflow-generation or workflow-failure subagent. |
| Spawning a log summarization subagent for OSMO workflow failures. |
| Exact OSMO CLI flags, payloads, or command syntax are needed. |
| Workflow YAML schema, credentials, outputs, or provider fields are needed. |
| Multi-task, data dependency, Jinja, serial, or parallel workflow design is needed. |
| Checkpointing, retry/exit behavior, or node exclusion is needed. |
| Validating or debugging the OSMO orchestration review pattern. |
OSMO CLI组件包含二级支持文件,因其命令和工作流范围较广。仅在指定场景下直接加载这些文件。
| 文件 | 加载场景 |
|---|---|
| 生成工作流或处理工作流失败的子Agent启动时。 |
| 为OSMO工作流失败启动日志汇总子Agent时。 |
| 需要准确的OSMO CLI标志、负载或命令语法时。 |
| 需要工作流YAML schema、凭证、输出或提供者字段时。 |
| 需要多任务、数据依赖、Jinja、串行或并行工作流设计时。 |
| 需要检查点、重试/退出行为或节点排除时。 |
| 验证或调试OSMO编排审查模式时。 |
Target Selection
目标选择
Pick exactly one option per stage. Stage 2 follows stage 1.
- Kubernetes: or
MicroK8sAzure - OSMO: when Kubernetes is MicroK8s,
MicroK8s OSMOwhen Kubernetes is AzureAzure OSMO - Inference: ,
NIM Operator,NVCF, orAzure AI FoundryNone - Workload: Video Data Augmentation, Defect Image Generation, NuRec Carline Adaptation, NRE, NCore, Asset Harvester, or custom workflow YAML
Reject invalid combinations before provisioning:
| Cluster | NIM Operator | NVCF | Azure AI Foundry |
|---|---|---|---|
| MicroK8s | yes | yes | no, Foundry requires Azure identities |
| Azure | yes | yes | yes |
For OpenClaw or any chat-only environment that cannot open a browser, read
before Azure prerequisites.
For any Azure target, read before Azure
component preflights.
components/openclaw-azure-login/reference.mdcomponents/azure-access/reference.md每个阶段仅选择一个选项。阶段2紧随阶段1之后。
- Kubernetes:或
MicroK8sAzure - OSMO:当Kubernetes为MicroK8s时选择,当Kubernetes为Azure时选择
MicroK8s OSMOAzure OSMO - 推理:、
NIM Operator、NVCF或Azure AI FoundryNone - 工作流:视频数据增强、缺陷图像生成、NuRec车型适配、NRE、NCore、资产采集器或自定义工作流YAML
在部署前拒绝无效组合:
| 集群 | NIM Operator | NVCF | Azure AI Foundry |
|---|---|---|---|
| MicroK8s | 支持 | 支持 | 不支持,Foundry需要Azure身份验证 |
| Azure | 支持 | 支持 | 支持 |
对于无法打开浏览器的OpenClaw或纯聊天环境,在完成Azure前置条件前请阅读。对于任何Azure目标,在Azure组件预检查前请阅读。
components/openclaw-azure-login/reference.mdcomponents/azure-access/reference.mdSetup Flow
设置流程
- Confirm target choices and workload compute requirements.
- Load the selected component references.
- Resolve prerequisites up front, including API keys, Azure access, caller CIDR, GPU quota, storage class, and OSMO login requirements.
- Run for every selected infrastructure component plus any OSMO CLI/workload preflight before provisioning; build the implementation plan from the results and stop on red preflight.
scripts/preflight.sh - Deploy Kubernetes first. Nothing else starts until the cluster gate is green.
- Deploy OSMO and inference after Kubernetes. These can proceed in parallel once the cluster exists, but workload submission waits for both selected gates.
- Submit the workload only after OSMO, storage credentials, compute pool, and
selected inference endpoints are verified. For VDA, this includes
,
preflight_credentials.shwith resolvedpre_submit_guard.pyvalues, non-empty model-cache prefixes, and workflow-namespace endpoint smoke checks.--set - Monitor through completion. On failed workflow state, inspect events and logs
from ; do not resubmit blindly.
components/osmo-cli/reference.md
- 确认目标选择和工作流计算需求。
- 加载所选组件的参考文档。
- 提前解决前置条件,包括API密钥、Azure访问权限、调用方CIDR、GPU配额、存储类和OSMO登录要求。
- 在部署前,为每个所选基础设施组件以及任何OSMO CLI/工作流执行预检查;根据结果制定实施计划,若预检查失败则停止。
scripts/preflight.sh - 先部署Kubernetes。集群节点验证通过前,不启动其他任何操作。
- Kubernetes部署完成后,部署OSMO和推理服务。集群就绪后,这两者可并行进行,但工作流提交需等待所选的OSMO和推理节点都验证通过。
- 仅在OSMO、存储凭证、计算池和所选推理端点都验证通过后,再提交工作流。对于视频数据增强(VDA),这包括执行通过、
preflight_credentials.sh使用已解析的pre_submit_guard.py值、非空模型缓存前缀,以及工作流命名空间端点的冒烟测试通过。--set - 监控直至完成。若工作流状态失败,从中检查事件和日志;不要盲目重新提交。
components/osmo-cli/reference.md
Inference Discovery
推理服务发现
Avoid over-deploying expensive endpoints.
- Scan the chosen workflow spec and default values for endpoint references:
,
*.osmo-nims.svc.cluster.local,api.nvcf.nvidia.com/*, or*.inference.ai.azure.com.*.cognitiveservices.azure.com - Map each reference to the selected backend:
- NIM Operator: service name must match a directory under
.
components/inference-nim-operator/nims/ - NVCF: function URL or function ID must be supplied by the environment.
- Azure AI Foundry: endpoint name must be deployed through
.
components/inference-azure/scripts/install.sh
- NIM Operator: service name must match a directory under
- If the workflow needs a capability the selected backend lacks, stop and report the mismatch. Do not silently substitute another model.
避免过度部署高成本端点。
- 扫描所选工作流规范和默认值中的端点引用:、
*.osmo-nims.svc.cluster.local、api.nvcf.nvidia.com/*或*.inference.ai.azure.com。*.cognitiveservices.azure.com - 将每个引用映射到所选后端:
- NIM Operator:服务名称必须与下的目录匹配。
components/inference-nim-operator/nims/ - NVCF:函数URL或函数ID必须由环境提供。
- Azure AI Foundry:端点名称必须通过部署。
components/inference-azure/scripts/install.sh
- NIM Operator:服务名称必须与
- 若工作流需要的功能是所选后端不具备的,停止操作并报告不匹配。切勿静默替换其他模型。
Verification Gates
验证节点
Each stage has its own Verify section in the component reference. These gates
are mandatory:
| Stage | Gate |
|---|---|
| Kubernetes | Cluster API reachable, nodes Ready, GPU capacity advertised for GPU paths, and CPU+NVCF paths have |
| Inference | Every endpoint referenced by the workload is reachable. NIM readiness uses |
| OSMO | OSMO pods Ready, pool ONLINE, port-forward watchdogs alive, storage credentials configured, and verify-hello workflow COMPLETED. |
| Workload | Selected workload pre-submit guards pass before submit. |
每个阶段在组件参考文档中都有自己的验证部分。这些节点是强制性的:
| 阶段 | 验证节点 |
|---|---|
| Kubernetes | 集群API可访问、节点处于Ready状态、GPU路径已公布GPU容量、CPU+NVCF路径已将 |
| 推理 | 工作流引用的每个端点都可访问。NIM就绪状态使用 |
| OSMO | OSMO Pod处于Ready状态、池处于ONLINE状态、端口转发监控程序运行正常、存储凭证已配置,且verify-hello工作流已完成(COMPLETED)。 |
| 工作流 | 所选工作流的提交前检查通过后再提交。 |
Resilient Scaling
弹性扩展
- Size the cluster from workload needs before provisioning. For Azure, check CPU
and GPU quota for the selected VM families before .
terraform apply - For NIM Operator, deploy only the NIMServices referenced by the workload. Each service pins GPU and model-cache storage for the lifetime of the cluster.
- Keep OSMO storage URL schemes aligned with the active backend. Local MicroK8s uses MinIO, Azure uses Blob-backed configuration.
- Treat Pending, Unknown, ImagePullBackOff, unbound PVCs, or 0 Ready replicas as layer failures. Investigate scheduling, storage, image credentials, and adjacent platform state before retrying the same command.
- For long deploys or workflow watches, provide heartbeat updates with current state, elapsed time, last useful observation, and next check.
- 部署前根据工作流需求调整集群规模。对于Azure,在执行前检查所选VM系列的CPU和GPU配额。
terraform apply - 对于NIM Operator,仅部署工作流引用的NIMService。每个服务在集群生命周期内会占用GPU和模型缓存存储。
- 保持OSMO存储URL方案与活跃后端一致。本地MicroK8s使用MinIO,Azure使用基于Blob的配置。
- 将Pending、Unknown、ImagePullBackOff、未绑定PVC或0个Ready副本视为层级故障。重试相同命令前,先排查调度、存储、镜像凭证和相邻平台状态问题。
- 对于长时间部署或工作流监控,需提供心跳更新,包含当前状态、已用时间、最近有效观测结果和下一次检查计划。
Workload Routing
工作流路由
- Video Data Augmentation: use .
skills/physical-ai-video-data-augmentation/SKILL.md - Defect Image Generation: use .
skills/physical-ai-defect-image-generation/SKILL.md - NuRec carline adaptation: use .
skills/carline-adaptation/SKILL.md - NRE, NCore, and Asset Harvester live in the canonical NuRec catalog listed in
.
skills/INDEX.md - Custom workload: submit the provided workflow YAML through OSMO after checking resource requests, image credentials, data credentials, and inference URLs.
- 视频数据增强:使用。
skills/physical-ai-video-data-augmentation/SKILL.md - 缺陷图像生成:使用。
skills/physical-ai-defect-image-generation/SKILL.md - NuRec车型适配:使用。
skills/carline-adaptation/SKILL.md - NRE、NCore和资产采集器位于中列出的标准NuRec目录中。
skills/INDEX.md - 自定义工作流:检查资源请求、镜像凭证、数据凭证和推理URL后,通过OSMO提交提供的工作流YAML。
Evaluation Prompts And Results
评估提示与结果
- Positive trigger: "Set up resilient Physical AI infrastructure for VDA on Azure AKS with NIM Operator." Expected: use this skill.
- Negative trigger: "Summarize recent OSMO workflow logs for this workflow ID." Expected: do not use this infrastructure setup skill unless the request also involves setup, scaling, validation, or recovery of the infrastructure stack.
Latest static review: 2026-05-26, description keywords match the expected
routes above.
- 正向触发:"在Azure AKS上为VDA设置弹性Physical AI基础设施,并使用NIM Operator。" 预期行为:使用本技能。
- 负向触发:"汇总此工作流ID的近期OSMO工作流日志。" 预期行为:除非请求还涉及基础设施栈的设置、扩展、验证或恢复,否则不要使用此基础设施设置技能。
最新静态审查时间:2026-05-26,描述关键词与上述预期路由匹配。