azure-kubernetes
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAzure Kubernetes Service
Azure Kubernetes Service
AUTHORITATIVE GUIDANCE — MANDATORY COMPLIANCEThis skill produces a recommended AKS cluster configuration based on user requirements, distinguishing Day-0 decisions (networking, API server — hard to change later) from Day-1 features (can enable post-creation). See CLI reference for commands.
权威指南 — 强制合规本技能可根据用户需求生成推荐的AKS集群配置,区分Day-0决策(网络、API服务器——后期难以更改)与Day-1功能(可在创建后启用)。相关命令请查看CLI参考文档。
Quick Reference
快速参考
| Property | Value |
|---|---|
| Best for | AKS cluster planning and Day-0 decisions |
| MCP Tools | |
| CLI | |
| Related skills | azure-diagnostics (troubleshooting AKS), azure-validate (readiness checks) |
| 属性 | 值 |
|---|---|
| 适用场景 | AKS集群规划与Day-0决策 |
| MCP工具 | |
| CLI命令 | |
| 相关技能 | azure-diagnostics(AKS故障排查), azure-validate(就绪性检查) |
When to Use This Skill
何时使用本技能
Activate this skill when user wants to:
- Create a new AKS cluster
- Plan AKS cluster configuration for production workloads
- Design AKS networking (API server access, pod IP model, egress)
- Set up AKS identity and secrets management
- Configure AKS governance (Azure Policy, Deployment Safeguards)
- Enable AKS observability (Container Insights, Managed Prometheus, Grafana)
- Define AKS upgrade and patching strategy
- Enable AKS cost visibility and analysis
- Understand AKS Automatic vs Standard SKU differences
- Get a Day-0 checklist for AKS cluster setup and configuration
当用户需要以下操作时,激活本技能:
- 创建新的AKS集群
- 为生产工作负载规划AKS集群配置
- 设计AKS网络(API服务器访问、Pod IP模型、出入口流量)
- 配置AKS身份与密钥管理
- 配置AKS治理(Azure Policy、部署防护)
- 启用AKS可观测性(Container Insights、托管Prometheus、Grafana)
- 定义AKS升级与补丁策略
- 启用AKS成本可见性与分析
- 了解AKS Automatic与Standard SKU的差异
- 获取AKS集群搭建与配置的Day-0检查清单
Rules
规则
- Start with the user's requirements for provisioning compute, networking, security, and other settings.
- Use the MCP server and select
azurefirst to discover the exact AKS-specific MCP tools surfaced by the client. Choose the smallest discovered AKS tool that fits the task, and fall back to Azure CLI (mcp_azure_mcp_aks) only when the needed functionality is not exposed through the AKS MCP surface.az aks - Determine if AKS Automatic or Standard SKU is more appropriate based on the user's need for control vs convenience. Default to AKS Automatic unless specific customizations are required.
- Document decisions and rationale for cluster configuration choices, especially for Day-0 decisions that are hard to change later (networking, API server access).
- 从用户对计算、网络、安全及其他设置的部署需求入手。
- 先使用MCP服务器并选择
azure,以发现客户端暴露的具体AKS专属MCP工具。选择最符合任务需求的最小型AKS工具,仅当所需功能未通过AKS MCP界面暴露时,才回退使用Azure CLI(mcp_azure_mcp_aks)。az aks - 根据用户对控制度与便捷性的需求,判断AKS Automatic或Standard SKU更合适。除非需要特定自定义配置,否则默认使用AKS Automatic。
- 记录集群配置选择的决策及理由,尤其是后期难以更改的Day-0决策(网络、API服务器访问)。
Required Inputs (Ask only what’s needed)
必要输入(仅询问所需信息)
If the user is unsure, use safe defaults.
- AKS environment type: dev/test or production
- Region(s), availability zones, preferred node VM sizes
- Expected scale (node/cluster count, workload size)
- Networking requirements (API server access, pod IP model, ingress/egress control)
- Security and identity requirements, including image registry
- Upgrade and observability preferences
- Cost constraints
若用户不确定,使用安全默认值。
- AKS环境类型:开发/测试或生产
- 区域、可用区、首选节点VM规格
- 预期规模(节点/集群数量、工作负载大小)
- 网络需求(API服务器访问、Pod IP模型、出入口流量控制)
- 安全与身份需求,包括镜像仓库
- 升级与可观测性偏好
- 成本限制
Workflow
工作流程
1. Cluster Type
1. 集群类型
- AKS Automatic (default): Best for most production workloads, provides a curated experience with pre-configured best practices for security, reliability, and performance. Use unless you have specific custom requirements for networking, autoscaling, or node pool configurations not supported by Node Auto-Provisioning (NAP).
- AKS Standard: Use if you need full control over environment configuration, which requires additional overhead to set up and manage.
- AKS Automatic(默认):最适合大多数生产工作负载,提供经过精心设计的体验,预配置了安全、可靠性与性能方面的最佳实践。除非你需要Node Auto-Provisioning (NAP)不支持的网络、自动扩缩容或节点池配置等特定自定义需求,否则请使用该类型。
- AKS Standard:若你需要对环境配置拥有完全控制权,请使用该类型,但这需要额外的搭建与管理开销。
2. Networking (Pod IP, Egress, Ingress, Dataplane)
2. 网络(Pod IP、出口、入口、数据平面)
Pod IP Model (Key Day-0 decision):
- Azure CNI Overlay (recommended): pod IPs from private overlay range, not VNet-routable, scales to large environments and good for most workloads
- Azure CNI (VNet-routable): pod IPs directly from VNet (pod subnet or node subnet), use when pods must be directly addressable from VNet or on-prem
Dataplane & Network Policy:
- Azure CNI powered by Cilium (recommended): eBPF-based for high-performance packet processing, network policies, and observability
Egress:
- Static Egress Gateway for stable, predictable outbound IPs
- For restricted egress: UDR + Azure Firewall or NVA
Ingress:
- App Routing addon with Gateway API — recommended default for HTTP/HTTPS workloads
- Istio service mesh with Gateway API - for advanced traffic management, mTLS, canary releases
- Application Gateway for Containers — for L7 load balancing with WAF integration
DNS:
- Enable LocalDNS on all node pools for reliable, performant DNS resolution
Pod IP模型(关键Day-0决策):
- Azure CNI Overlay(推荐):Pod IP来自私有覆盖范围,不可通过VNet路由,可扩展至大型环境,适合大多数工作负载
- Azure CNI(VNet可路由):Pod IP直接来自VNet(Pod子网或节点子网),当Pod必须可从VNet或本地直接访问时使用
数据平面与网络策略:
- Azure CNI powered by Cilium(推荐):基于eBPF,用于高性能数据包处理、网络策略与可观测性
出口流量:
- 静态出口网关:提供稳定、可预测的出站IP
- 受限出口场景:使用UDR + Azure防火墙或NVA
入口流量:
- 带Gateway API的App Routing插件 — 推荐作为HTTP/HTTPS工作负载的默认选择
- 带Gateway API的Istio服务网格 - 用于高级流量管理、mTLS、金丝雀发布
- Application Gateway for Containers — 用于集成WAF的L7负载均衡
DNS:
- 在所有节点池上启用LocalDNS,以实现可靠、高性能的DNS解析
3. Security
3. 安全
- Use Microsoft Entra ID everywhere (control plane, Workload Identity for pods, node access). Avoid static credentials.
- Azure Key Vault via Secrets Store CSI Driver for secrets
- Enable Azure Policy + Deployment Safeguards
- Enable Encryption at rest for etcd/API server; in-transit for node-to-node
- Allow only signed, policy-approved images (Azure Policy + Ratify), prefer Azure Container Registry
- Isolation: Use namespaces, network policies, scoped logging
- 全面使用Microsoft Entra ID(控制平面、Pod工作负载身份、节点访问)。避免使用静态凭据。
- 通过Secrets Store CSI Driver集成Azure Key Vault以管理密钥
- 启用Azure Policy + Deployment Safeguards
- 为etcd/API服务器启用静态加密;为节点间通信启用传输中加密
- 仅允许经过签名、符合策略的镜像(Azure Policy + Ratify),优先使用Azure Container Registry
- 隔离:使用命名空间、网络策略、范围化日志
4. Observability
4. 可观测性
- Use Managed Prometheus and Container Insights with Grafana for AKS observability (logs + metrics).
- Enable Diagnostic Settings to collect control plane logs and audit logs in a Log Analytics workspace for security monitoring and troubleshooting.
- For other monitoring and troubleshooting tools, use features like the Agentic CLI for AKS, Application Insights, Resource Health Center, AppLens detectors, and Azure Advisors.
- 使用托管Prometheus与带Grafana的Container Insights实现AKS可观测性(日志 + 指标)。
- 启用诊断设置,将控制平面日志与审计日志收集到Log Analytics工作区,用于安全监控与故障排查。
- 对于其他监控与故障排查工具,可使用AKS智能CLI、Application Insights、资源健康中心、AppLens检测器以及Azure顾问等功能。
5. Upgrades & Patching
5. 升级与补丁
- Configure Maintenance Windows for controlled upgrade timing
- Enable auto-upgrades for control plane and node OS to stay up-to-date with security patches and Kubernetes versions
- Consider LTS versions for enterprise stability (2-year support) by upgrading your AKS environment to the Premium tier
- Fleet upgrades: Use AKS Fleet Manager for staged rollout across test to production environments
- 配置维护窗口以控制升级时间
- 启用控制平面与节点操作系统的自动升级,以持续获取安全补丁与Kubernetes版本更新
- 若追求企业级稳定性(2年支持),可将AKS环境升级至Premium tier,使用LTS版本
- 集群组升级:使用AKS Fleet Manager实现从测试到生产环境的分阶段滚动升级
6. Performance
6. 性能
- Use Ephemeral OS disks () for faster node startup
--node-osdisk-type Ephemeral - Select Azure Linux as node OS (smaller footprint, faster boot)
- Enable KEDA for event-driven autoscaling beyond HPA
- 使用临时OS磁盘()加快节点启动速度
--node-osdisk-type Ephemeral - 选择Azure Linux作为节点操作系统(占用空间更小、启动速度更快)
- 启用KEDA以实现超越HPA的事件驱动型自动扩缩容
7. Node Pools & Compute
7. 节点池与计算
- Dedicated system node pool: At least 2 nodes, tainted for system workloads only ()
CriticalAddonsOnly - Enable Node Auto Provisioning (NAP) on all pools for cost savings and responsive scaling
- Use latest generation SKUs (v5/v6) for host-level optimizations
- Avoid B-series VMs — burstable SKUs cause performance/reliability issues
- Use SKUs with at least 4 vCPUs for production workloads
- Set topology spread constraints to distribute pods across hosts/zones per SLO
- 专用系统节点池:至少2个节点,设置污点仅允许系统工作负载运行()
CriticalAddonsOnly - 在所有池上启用**Node Auto Provisioning (NAP)**以节省成本并实现快速扩缩容
- 使用**最新一代SKU(v5/v6)**以获得主机级优化
- 避免使用B系列VM — 突发型SKU会导致性能/可靠性问题
- 生产工作负载使用至少4 vCPU的SKU
- 设置拓扑分布约束,以根据SLO在主机/可用区间分布Pod
8. Reliability
8. 可靠性
- Deploy across 3 Availability Zones ()
--zones 1 2 3 - Use Standard tier for zone-redundant control plane + 99.95% SLA for API server availability
- Enable Microsoft Defender for Containers for runtime protection
- Configure PodDisruptionBudgets for all production workloads
- Use topology spread constraints to ensure pod distribution across failure domains
- 跨3个可用区部署()
--zones 1 2 3 - 使用Standard tier实现跨可用区的控制平面冗余 + API服务器99.95%的SLA可用性
- 启用Microsoft Defender for Containers以实现运行时防护
- 为所有生产工作负载配置PodDisruptionBudgets
- 使用拓扑分布约束确保Pod在故障域间的分布
9. Cost Controls
9. 成本控制
- Use Spot node pools for batch/interruptible workloads (up to 90% savings)
- Stop/Start dev/test clusters:
az aks stop/start - Consider Reserved Instances or Savings Plans for steady-state workloads
- 为批量/可中断工作负载使用Spot节点池(最高可节省90%成本)
- 启停开发/测试集群:
az aks stop/start - 对于稳定运行的工作负载,可考虑预留实例或节约计划
Guardrails / Safety
防护措施 / 安全提示
- Do not request or output secrets (tokens, keys).
- If requirements are ambiguous for day-0 critical decisions, ask the user clarifying questions. For day-1 enabled features, propose 2–3 safe options with tradeoffs and choose a conservative default.
- Do not promise zero downtime; advise workload safeguards (PDBs, probes, replicas) and staged upgrades along with best practices for reliability and performance.
- 请勿请求或输出密钥(令牌、密钥)。
- 若Day-0关键决策的需求不明确,请向用户询问澄清问题。对于Day-1启用的功能,提出2-3个安全选项并说明权衡,选择保守的默认值。
- 请勿承诺零停机;建议采用工作负载防护措施(PDB、探针、副本)与分阶段升级,并遵循可靠性与性能最佳实践。
MCP Tools
MCP工具
| Tool | Purpose | Key Parameters |
|---|---|---|
| AKS MCP entry point used to discover the exact AKS-specific tools exposed by the client | Discover the callable AKS tool first, then use that tool's parameters |
| 工具 | 用途 | 关键参数 |
|---|---|---|
| AKS MCP入口点,用于发现客户端暴露的具体AKS专属工具 | 先发现可调用的AKS工具,再使用该工具的参数 |
Error Handling
错误处理
| Error / Symptom | Likely Cause | Remediation |
|---|---|---|
| MCP tool call fails or times out | Invalid credentials, subscription, or AKS context | Verify |
| Quota exceeded | Regional vCPU or resource limits | Request quota increase or select different region/VM SKU |
| Networking conflict (IP exhaustion) | Pod subnet too small for overlay/CNI | Re-plan IP ranges; may require cluster recreation (Day-0) |
| Workload Identity not working | Missing OIDC issuer or federated credential | Enable |
| 错误 / 症状 | 可能原因 | 修复方案 |
|---|---|---|
| MCP工具调用失败或超时 | 凭据无效、订阅错误或AKS上下文错误 | 验证 |
| 配额超出 | 区域vCPU或资源限制 | 请求提高配额或选择其他区域/VM SKU |
| 网络冲突(IP耗尽) | Pod子网对于overlay/CNI来说过小 | 重新规划IP范围;可能需要重新创建集群(Day-0决策) |
| 工作负载身份无法正常工作 | 缺少OIDC颁发者或联合凭据 | 启用 |