airunway-aks-setup
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAI Runway AKS Setup
在AKS上搭建AI Runway
This skill walks users from a bare Kubernetes cluster to a running AI model deployment. Follow each step in sequence unless the user provides to resume from a specific phase.
skip-to-step NCost awareness: GPU node pools incur significant compute charges (A100-80GB can cost $3–5+/hr). Confirm the user understands cost implications before provisioning GPU resources.
本技能将引导用户从裸Kubernetes集群开始,完成AI模型部署的全流程。请按顺序执行每个步骤,除非用户提供指令以从特定阶段继续。
skip-to-step N成本提示:GPU节点池会产生高额计算费用(A100-80GB每小时成本约3-5美元)。在配置GPU资源前,请确认用户已了解相关成本影响。
Prerequisites
前提条件
This skill assumes an AKS cluster already exists. If the user does not have a cluster, hand off to the skill first to provision one (with a GPU node pool unless CPU-only inference is acceptable), then return here.
azure-kubernetes本技能假设AKS集群已存在。如果用户尚未拥有集群,请先转交至技能以创建集群(除非仅接受CPU推理,否则需配置GPU节点池),之后再返回此处。
azure-kubernetesQuick Reference
快速参考
| Property | Value |
|---|---|
| Best for | End-to-end AI Runway onboarding on AKS |
| CLI tools | |
| MCP tools | None |
| Related skills | |
| 属性 | 值 |
|---|---|
| 最佳适用场景 | 在AKS上完成AI Runway的端到端接入 |
| CLI工具 | |
| MCP工具 | 无 |
| 相关技能 | |
When to Use This Skill
适用场景
Use this skill when the user wants to:
- Set up AI Runway on an existing AKS cluster from scratch
- Install the AI Runway controller and CRDs
- Assess GPU hardware compatibility for model deployment
- Choose and install an inference provider (KAITO, Dynamo, KubeRay)
- Deploy their first AI model to AKS via AI Runway
- Resume a partially-complete AI Runway setup from a specific step
当用户需要以下操作时,使用本技能:
- 在现有AKS集群上从头搭建AI Runway
- 安装AI Runway控制器和CRD
- 评估GPU硬件与模型部署的兼容性
- 选择并安装推理提供商(KAITO、Dynamo、KubeRay)
- 通过AI Runway将首个AI模型部署到AKS
- 从特定步骤恢复未完成的AI Runway搭建任务
MCP Tools
MCP工具
This skill uses no MCP tools. All cluster operations are performed directly via and .
kubectlmake本技能不使用MCP工具。所有集群操作均直接通过和执行。
kubectlmakeRules
规则
- Execute steps in sequence — load the reference for each step as you reach it
- Report cluster state at each step: ✓ healthy, ✗ missing/failed
- Ask for user confirmation before any install or deployment action
- If a step is already complete, report status and skip to the next step
- If the user provides , start at step N; assume prior steps are complete
skip-to-step N
- 按顺序执行步骤——到达每个步骤时加载对应的参考文档
- 在每个步骤报告集群状态:✓ 健康,✗ 缺失/故障
- 在执行任何安装或部署操作前,请求用户确认
- 如果步骤已完成,报告状态并跳至下一步
- 如果用户提供,从第N步开始;默认假设之前的步骤已完成
skip-to-step N
Steps
步骤
| # | Step | Reference |
|---|---|---|
| 1 | Cluster Verification — context check, node inventory, GPU detection | step-1-verify.md |
| 2 | Controller Installation — CRD + controller deployment | step-2-controller.md |
| 3 | GPU Assessment — detect GPU models, flag dtype/attention constraints | step-3-gpu.md |
| 4 | Provider Setup — recommend and install inference provider | step-4-provider.md |
| 5 | First Deployment — pick a model, deploy, verify Ready | step-5-deploy.md |
| 6 | Summary — recap, smoke test, next steps | step-6-summary.md |
| 序号 | 步骤 | 参考文档 |
|---|---|---|
| 1 | 集群验证 —— 上下文检查、节点清单、GPU检测 | step-1-verify.md |
| 2 | 控制器安装 —— CRD + 控制器部署 | step-2-controller.md |
| 3 | GPU评估 —— 检测GPU型号,标记数据类型/注意力机制限制 | step-3-gpu.md |
| 4 | 提供商设置 —— 推荐并安装推理提供商 | step-4-provider.md |
| 5 | 首次部署 —— 选择模型、部署、验证就绪状态 | step-5-deploy.md |
| 6 | 总结 —— 回顾、冒烟测试、后续步骤 | step-6-summary.md |
Error Handling
错误处理
| Error / Symptom | Likely Cause | Remediation |
|---|---|---|
| No kubeconfig context | Not connected to a cluster | Run |
| Controller in CrashLoopBackOff | Config or RBAC issue | |
| Provider not ready | Image pull or RBAC issue | |
| ModelDeployment stuck in Pending | GPU scheduling failure or provider not ready | |
| T4 or V100 lacks bfloat16 support | Add |
For full error handling and rollback procedures, see troubleshooting.md.
| 错误/症状 | 可能原因 | 修复方案 |
|---|---|---|
| 无kubeconfig上下文 | 未连接到集群 | 执行 |
| 控制器处于CrashLoopBackOff状态 | 配置或RBAC问题 | 执行 |
| 提供商未就绪 | 镜像拉取或RBAC问题 | 对提供商Pod执行 |
| ModelDeployment卡在Pending状态 | GPU调度失败或提供商未就绪 | 执行 |
推理时出现 | T4或V100不支持bfloat16 | 在服务参数中添加 |
完整的错误处理和回滚流程,请查看troubleshooting.md。