airunway-aks-setup

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

AI Runway AKS Setup

在AKS上搭建AI Runway

This skill walks users from a bare Kubernetes cluster to a running AI model deployment. Follow each step in sequence unless the user provides

skip-to-step N

to resume from a specific phase.

Cost awareness: GPU node pools incur significant compute charges (A100-80GB can cost $3–5+/hr). Confirm the user understands cost implications before provisioning GPU resources.

本技能将引导用户从裸Kubernetes集群开始，完成AI模型部署的全流程。请按顺序执行每个步骤，除非用户提供

skip-to-step N

指令以从特定阶段继续。

成本提示：GPU节点池会产生高额计算费用（A100-80GB每小时成本约3-5美元）。在配置GPU资源前，请确认用户已了解相关成本影响。

Prerequisites

前提条件

This skill assumes an AKS cluster already exists. If the user does not have a cluster, hand off to the

azure-kubernetes

skill first to provision one (with a GPU node pool unless CPU-only inference is acceptable), then return here.

本技能假设AKS集群已存在。如果用户尚未拥有集群，请先转交至

azure-kubernetes

技能以创建集群（除非仅接受CPU推理，否则需配置GPU节点池），之后再返回此处。

Quick Reference

快速参考

Property	Value
Best for	End-to-end AI Runway onboarding on AKS
CLI tools	`kubectl` , `make` , `curl`
MCP tools	None
Related skills	`azure-kubernetes` (cluster setup), `azure-diagnostics` (troubleshooting)

属性	值
最佳适用场景	在AKS上完成AI Runway的端到端接入
CLI工具	`kubectl` , `make` , `curl`
MCP工具	无
相关技能	`azure-kubernetes` （集群搭建）, `azure-diagnostics` （故障排查）

When to Use This Skill

适用场景

Use this skill when the user wants to:

Set up AI Runway on an existing AKS cluster from scratch
Install the AI Runway controller and CRDs
Assess GPU hardware compatibility for model deployment
Choose and install an inference provider (KAITO, Dynamo, KubeRay)
Deploy their first AI model to AKS via AI Runway
Resume a partially-complete AI Runway setup from a specific step

当用户需要以下操作时，使用本技能：

在现有AKS集群上从头搭建AI Runway
安装AI Runway控制器和CRD
评估GPU硬件与模型部署的兼容性
选择并安装推理提供商（KAITO、Dynamo、KubeRay）
通过AI Runway将首个AI模型部署到AKS
从特定步骤恢复未完成的AI Runway搭建任务

MCP Tools

MCP工具

This skill uses no MCP tools. All cluster operations are performed directly via

kubectl

and

make

本技能不使用MCP工具。所有集群操作均直接通过

kubectl

和

make

执行。

Rules

规则

Execute steps in sequence — load the reference for each step as you reach it
Report cluster state at each step: ✓ healthy, ✗ missing/failed
Ask for user confirmation before any install or deployment action
If a step is already complete, report status and skip to the next step
If the user provides
```
skip-to-step N
```
, start at step N; assume prior steps are complete

按顺序执行步骤——到达每个步骤时加载对应的参考文档
在每个步骤报告集群状态：✓ 健康，✗ 缺失/故障
在执行任何安装或部署操作前，请求用户确认
如果步骤已完成，报告状态并跳至下一步
如果用户提供
```
skip-to-step N
```
，从第N步开始；默认假设之前的步骤已完成

Steps

步骤

#	Step	Reference
1	Cluster Verification — context check, node inventory, GPU detection	step-1-verify.md
2	Controller Installation — CRD + controller deployment	step-2-controller.md
3	GPU Assessment — detect GPU models, flag dtype/attention constraints	step-3-gpu.md
4	Provider Setup — recommend and install inference provider	step-4-provider.md
5	First Deployment — pick a model, deploy, verify Ready	step-5-deploy.md
6	Summary — recap, smoke test, next steps	step-6-summary.md

序号	步骤	参考文档
1	集群验证 —— 上下文检查、节点清单、GPU检测	step-1-verify.md
2	控制器安装 —— CRD + 控制器部署	step-2-controller.md
3	GPU评估 —— 检测GPU型号，标记数据类型/注意力机制限制	step-3-gpu.md
4	提供商设置 —— 推荐并安装推理提供商	step-4-provider.md
5	首次部署 —— 选择模型、部署、验证就绪状态	step-5-deploy.md
6	总结 —— 回顾、冒烟测试、后续步骤	step-6-summary.md

Error Handling

错误处理

Error / Symptom	Likely Cause	Remediation
No kubeconfig context	Not connected to a cluster	Run `az aks get-credentials` or equivalent
Controller in CrashLoopBackOff	Config or RBAC issue	`kubectl logs -n airunway-system -l control-plane=controller-manager --previous`
Provider not ready	Image pull or RBAC issue	`kubectl logs <pod-name> -n <namespace>` for the provider pod
ModelDeployment stuck in Pending	GPU scheduling failure or provider not ready	`kubectl describe modeldeployment <name> -n <namespace>` events
`bfloat16` errors at inference	T4 or V100 lacks bfloat16 support	Add `--dtype float16` to serving args

For full error handling and rollback procedures, see troubleshooting.md.

错误/症状	可能原因	修复方案
无kubeconfig上下文	未连接到集群	执行 `az aks get-credentials` 或等效命令
控制器处于CrashLoopBackOff状态	配置或RBAC问题	执行 `kubectl logs -n airunway-system -l control-plane=controller-manager --previous` 查看日志
提供商未就绪	镜像拉取或RBAC问题	对提供商Pod执行 `kubectl logs <pod-name> -n <namespace>` 查看日志
ModelDeployment卡在Pending状态	GPU调度失败或提供商未就绪	执行 `kubectl describe modeldeployment <name> -n <namespace>` 查看事件
推理时出现 `bfloat16` 错误	T4或V100不支持bfloat16	在服务参数中添加 `--dtype float16`

完整的错误处理和回滚流程，请查看troubleshooting.md。