airunway-aks-setup

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

AI Runway AKS Setup

在AKS上搭建AI Runway

This skill walks users from a bare Kubernetes cluster to a running AI model deployment. Follow each step in sequence unless the user provides
skip-to-step N
to resume from a specific phase.
Cost awareness: GPU node pools incur significant compute charges (A100-80GB can cost $3–5+/hr). Confirm the user understands cost implications before provisioning GPU resources.
本技能将引导用户从裸Kubernetes集群开始,完成AI模型部署的全流程。请按顺序执行每个步骤,除非用户提供
skip-to-step N
指令以从特定阶段继续。
成本提示:GPU节点池会产生高额计算费用(A100-80GB每小时成本约3-5美元)。在配置GPU资源前,请确认用户已了解相关成本影响。

Prerequisites

前提条件

This skill assumes an AKS cluster already exists. If the user does not have a cluster, hand off to the
azure-kubernetes
skill first to provision one (with a GPU node pool unless CPU-only inference is acceptable), then return here.
本技能假设AKS集群已存在。如果用户尚未拥有集群,请先转交至
azure-kubernetes
技能以创建集群(除非仅接受CPU推理,否则需配置GPU节点池),之后再返回此处。

Quick Reference

快速参考

PropertyValue
Best forEnd-to-end AI Runway onboarding on AKS
CLI tools
kubectl
,
make
,
curl
MCP toolsNone
Related skills
azure-kubernetes
(cluster setup),
azure-diagnostics
(troubleshooting)
属性
最佳适用场景在AKS上完成AI Runway的端到端接入
CLI工具
kubectl
,
make
,
curl
MCP工具
相关技能
azure-kubernetes
(集群搭建),
azure-diagnostics
(故障排查)

When to Use This Skill

适用场景

Use this skill when the user wants to:
  • Set up AI Runway on an existing AKS cluster from scratch
  • Install the AI Runway controller and CRDs
  • Assess GPU hardware compatibility for model deployment
  • Choose and install an inference provider (KAITO, Dynamo, KubeRay)
  • Deploy their first AI model to AKS via AI Runway
  • Resume a partially-complete AI Runway setup from a specific step
当用户需要以下操作时,使用本技能:
  • 在现有AKS集群上从头搭建AI Runway
  • 安装AI Runway控制器和CRD
  • 评估GPU硬件与模型部署的兼容性
  • 选择并安装推理提供商(KAITO、Dynamo、KubeRay)
  • 通过AI Runway将首个AI模型部署到AKS
  • 从特定步骤恢复未完成的AI Runway搭建任务

MCP Tools

MCP工具

This skill uses no MCP tools. All cluster operations are performed directly via
kubectl
and
make
.
本技能不使用MCP工具。所有集群操作均直接通过
kubectl
make
执行。

Rules

规则

  1. Execute steps in sequence — load the reference for each step as you reach it
  2. Report cluster state at each step: ✓ healthy, ✗ missing/failed
  3. Ask for user confirmation before any install or deployment action
  4. If a step is already complete, report status and skip to the next step
  5. If the user provides
    skip-to-step N
    , start at step N; assume prior steps are complete
  1. 按顺序执行步骤——到达每个步骤时加载对应的参考文档
  2. 在每个步骤报告集群状态:✓ 健康,✗ 缺失/故障
  3. 在执行任何安装或部署操作前,请求用户确认
  4. 如果步骤已完成,报告状态并跳至下一步
  5. 如果用户提供
    skip-to-step N
    ,从第N步开始;默认假设之前的步骤已完成

Steps

步骤

#StepReference
1Cluster Verification — context check, node inventory, GPU detectionstep-1-verify.md
2Controller Installation — CRD + controller deploymentstep-2-controller.md
3GPU Assessment — detect GPU models, flag dtype/attention constraintsstep-3-gpu.md
4Provider Setup — recommend and install inference providerstep-4-provider.md
5First Deployment — pick a model, deploy, verify Readystep-5-deploy.md
6Summary — recap, smoke test, next stepsstep-6-summary.md
序号步骤参考文档
1集群验证 —— 上下文检查、节点清单、GPU检测step-1-verify.md
2控制器安装 —— CRD + 控制器部署step-2-controller.md
3GPU评估 —— 检测GPU型号,标记数据类型/注意力机制限制step-3-gpu.md
4提供商设置 —— 推荐并安装推理提供商step-4-provider.md
5首次部署 —— 选择模型、部署、验证就绪状态step-5-deploy.md
6总结 —— 回顾、冒烟测试、后续步骤step-6-summary.md

Error Handling

错误处理

Error / SymptomLikely CauseRemediation
No kubeconfig contextNot connected to a clusterRun
az aks get-credentials
or equivalent
Controller in CrashLoopBackOffConfig or RBAC issue
kubectl logs -n airunway-system -l control-plane=controller-manager --previous
Provider not readyImage pull or RBAC issue
kubectl logs <pod-name> -n <namespace>
for the provider pod
ModelDeployment stuck in PendingGPU scheduling failure or provider not ready
kubectl describe modeldeployment <name> -n <namespace>
events
bfloat16
errors at inference
T4 or V100 lacks bfloat16 supportAdd
--dtype float16
to serving args
For full error handling and rollback procedures, see troubleshooting.md.
错误/症状可能原因修复方案
无kubeconfig上下文未连接到集群执行
az aks get-credentials
或等效命令
控制器处于CrashLoopBackOff状态配置或RBAC问题执行
kubectl logs -n airunway-system -l control-plane=controller-manager --previous
查看日志
提供商未就绪镜像拉取或RBAC问题对提供商Pod执行
kubectl logs <pod-name> -n <namespace>
查看日志
ModelDeployment卡在Pending状态GPU调度失败或提供商未就绪执行
kubectl describe modeldeployment <name> -n <namespace>
查看事件
推理时出现
bfloat16
错误
T4或V100不支持bfloat16在服务参数中添加
--dtype float16
完整的错误处理和回滚流程,请查看troubleshooting.md