azure-kubernetes

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Azure Kubernetes Service

Azure Kubernetes Service

AUTHORITATIVE GUIDANCE — MANDATORY COMPLIANCE
This skill produces a recommended AKS cluster configuration based on user requirements, distinguishing Day-0 decisions (networking, API server — hard to change later) from Day-1 features (can enable post-creation). See CLI reference for commands.
权威指南 — 强制合规
本技能可根据用户需求生成推荐的AKS集群配置,区分Day-0决策(网络、API服务器——后期难以更改)与Day-1功能(可在创建后启用)。相关命令请查看CLI参考文档

Quick Reference

快速参考

PropertyValue
Best forAKS cluster planning and Day-0 decisions
MCP Tools
mcp_azure_mcp_aks
CLI
az aks create
,
az aks show
,
kubectl get
,
kubectl describe
Related skillsazure-diagnostics (troubleshooting AKS), azure-validate (readiness checks)
属性
适用场景AKS集群规划与Day-0决策
MCP工具
mcp_azure_mcp_aks
CLI命令
az aks create
,
az aks show
,
kubectl get
,
kubectl describe
相关技能azure-diagnostics(AKS故障排查), azure-validate(就绪性检查)

When to Use This Skill

何时使用本技能

Activate this skill when user wants to:
  • Create a new AKS cluster
  • Plan AKS cluster configuration for production workloads
  • Design AKS networking (API server access, pod IP model, egress)
  • Set up AKS identity and secrets management
  • Configure AKS governance (Azure Policy, Deployment Safeguards)
  • Enable AKS observability (Container Insights, Managed Prometheus, Grafana)
  • Define AKS upgrade and patching strategy
  • Enable AKS cost visibility and analysis
  • Understand AKS Automatic vs Standard SKU differences
  • Get a Day-0 checklist for AKS cluster setup and configuration
当用户需要以下操作时,激活本技能:
  • 创建新的AKS集群
  • 为生产工作负载规划AKS集群配置
  • 设计AKS网络(API服务器访问、Pod IP模型、出入口流量)
  • 配置AKS身份与密钥管理
  • 配置AKS治理(Azure Policy、部署防护)
  • 启用AKS可观测性(Container Insights、托管Prometheus、Grafana)
  • 定义AKS升级与补丁策略
  • 启用AKS成本可见性与分析
  • 了解AKS Automatic与Standard SKU的差异
  • 获取AKS集群搭建与配置的Day-0检查清单

Rules

规则

  1. Start with the user's requirements for provisioning compute, networking, security, and other settings.
  2. Use the
    azure
    MCP server and select
    mcp_azure_mcp_aks
    first to discover the exact AKS-specific MCP tools surfaced by the client. Choose the smallest discovered AKS tool that fits the task, and fall back to Azure CLI (
    az aks
    ) only when the needed functionality is not exposed through the AKS MCP surface.
  3. Determine if AKS Automatic or Standard SKU is more appropriate based on the user's need for control vs convenience. Default to AKS Automatic unless specific customizations are required.
  4. Document decisions and rationale for cluster configuration choices, especially for Day-0 decisions that are hard to change later (networking, API server access).
  1. 从用户对计算、网络、安全及其他设置的部署需求入手。
  2. 先使用
    azure
    MCP服务器并选择
    mcp_azure_mcp_aks
    ,以发现客户端暴露的具体AKS专属MCP工具。选择最符合任务需求的最小型AKS工具,仅当所需功能未通过AKS MCP界面暴露时,才回退使用Azure CLI(
    az aks
    )。
  3. 根据用户对控制度与便捷性的需求,判断AKS Automatic或Standard SKU更合适。除非需要特定自定义配置,否则默认使用AKS Automatic。
  4. 记录集群配置选择的决策及理由,尤其是后期难以更改的Day-0决策(网络、API服务器访问)。

Required Inputs (Ask only what’s needed)

必要输入(仅询问所需信息)

If the user is unsure, use safe defaults.
  • AKS environment type: dev/test or production
  • Region(s), availability zones, preferred node VM sizes
  • Expected scale (node/cluster count, workload size)
  • Networking requirements (API server access, pod IP model, ingress/egress control)
  • Security and identity requirements, including image registry
  • Upgrade and observability preferences
  • Cost constraints
若用户不确定,使用安全默认值。
  • AKS环境类型:开发/测试或生产
  • 区域、可用区、首选节点VM规格
  • 预期规模(节点/集群数量、工作负载大小)
  • 网络需求(API服务器访问、Pod IP模型、出入口流量控制)
  • 安全与身份需求,包括镜像仓库
  • 升级与可观测性偏好
  • 成本限制

Workflow

工作流程

1. Cluster Type

1. 集群类型

  • AKS Automatic (default): Best for most production workloads, provides a curated experience with pre-configured best practices for security, reliability, and performance. Use unless you have specific custom requirements for networking, autoscaling, or node pool configurations not supported by Node Auto-Provisioning (NAP).
  • AKS Standard: Use if you need full control over environment configuration, which requires additional overhead to set up and manage.
  • AKS Automatic(默认):最适合大多数生产工作负载,提供经过精心设计的体验,预配置了安全、可靠性与性能方面的最佳实践。除非你需要Node Auto-Provisioning (NAP)不支持的网络、自动扩缩容或节点池配置等特定自定义需求,否则请使用该类型。
  • AKS Standard:若你需要对环境配置拥有完全控制权,请使用该类型,但这需要额外的搭建与管理开销。

2. Networking (Pod IP, Egress, Ingress, Dataplane)

2. 网络(Pod IP、出口、入口、数据平面)

Pod IP Model (Key Day-0 decision):
  • Azure CNI Overlay (recommended): pod IPs from private overlay range, not VNet-routable, scales to large environments and good for most workloads
  • Azure CNI (VNet-routable): pod IPs directly from VNet (pod subnet or node subnet), use when pods must be directly addressable from VNet or on-prem
Dataplane & Network Policy:
  • Azure CNI powered by Cilium (recommended): eBPF-based for high-performance packet processing, network policies, and observability
Egress:
  • Static Egress Gateway for stable, predictable outbound IPs
  • For restricted egress: UDR + Azure Firewall or NVA
Ingress:
  • App Routing addon with Gateway API — recommended default for HTTP/HTTPS workloads
  • Istio service mesh with Gateway API - for advanced traffic management, mTLS, canary releases
  • Application Gateway for Containers — for L7 load balancing with WAF integration
DNS:
  • Enable LocalDNS on all node pools for reliable, performant DNS resolution
Pod IP模型(关键Day-0决策):
  • Azure CNI Overlay(推荐):Pod IP来自私有覆盖范围,不可通过VNet路由,可扩展至大型环境,适合大多数工作负载
  • Azure CNI(VNet可路由):Pod IP直接来自VNet(Pod子网或节点子网),当Pod必须可从VNet或本地直接访问时使用
数据平面与网络策略:
  • Azure CNI powered by Cilium(推荐):基于eBPF,用于高性能数据包处理、网络策略与可观测性
出口流量:
  • 静态出口网关:提供稳定、可预测的出站IP
  • 受限出口场景:使用UDR + Azure防火墙或NVA
入口流量:
  • 带Gateway API的App Routing插件 — 推荐作为HTTP/HTTPS工作负载的默认选择
  • 带Gateway API的Istio服务网格 - 用于高级流量管理、mTLS、金丝雀发布
  • Application Gateway for Containers — 用于集成WAF的L7负载均衡
DNS:
  • 在所有节点池上启用LocalDNS,以实现可靠、高性能的DNS解析

3. Security

3. 安全

  • Use Microsoft Entra ID everywhere (control plane, Workload Identity for pods, node access). Avoid static credentials.
  • Azure Key Vault via Secrets Store CSI Driver for secrets
  • Enable Azure Policy + Deployment Safeguards
  • Enable Encryption at rest for etcd/API server; in-transit for node-to-node
  • Allow only signed, policy-approved images (Azure Policy + Ratify), prefer Azure Container Registry
  • Isolation: Use namespaces, network policies, scoped logging
  • 全面使用Microsoft Entra ID(控制平面、Pod工作负载身份、节点访问)。避免使用静态凭据。
  • 通过Secrets Store CSI Driver集成Azure Key Vault以管理密钥
  • 启用Azure Policy + Deployment Safeguards
  • 为etcd/API服务器启用静态加密;为节点间通信启用传输中加密
  • 仅允许经过签名、符合策略的镜像(Azure Policy + Ratify),优先使用Azure Container Registry
  • 隔离:使用命名空间、网络策略、范围化日志

4. Observability

4. 可观测性

  • Use Managed Prometheus and Container Insights with Grafana for AKS observability (logs + metrics).
  • Enable Diagnostic Settings to collect control plane logs and audit logs in a Log Analytics workspace for security monitoring and troubleshooting.
  • For other monitoring and troubleshooting tools, use features like the Agentic CLI for AKS, Application Insights, Resource Health Center, AppLens detectors, and Azure Advisors.
  • 使用托管Prometheus与带Grafana的Container Insights实现AKS可观测性(日志 + 指标)。
  • 启用诊断设置,将控制平面日志与审计日志收集到Log Analytics工作区,用于安全监控与故障排查。
  • 对于其他监控与故障排查工具,可使用AKS智能CLI、Application Insights、资源健康中心、AppLens检测器以及Azure顾问等功能。

5. Upgrades & Patching

5. 升级与补丁

  • Configure Maintenance Windows for controlled upgrade timing
  • Enable auto-upgrades for control plane and node OS to stay up-to-date with security patches and Kubernetes versions
  • Consider LTS versions for enterprise stability (2-year support) by upgrading your AKS environment to the Premium tier
  • Fleet upgrades: Use AKS Fleet Manager for staged rollout across test to production environments
  • 配置维护窗口以控制升级时间
  • 启用控制平面与节点操作系统的自动升级,以持续获取安全补丁与Kubernetes版本更新
  • 若追求企业级稳定性(2年支持),可将AKS环境升级至Premium tier,使用LTS版本
  • 集群组升级:使用AKS Fleet Manager实现从测试到生产环境的分阶段滚动升级

6. Performance

6. 性能

  • Use Ephemeral OS disks (
    --node-osdisk-type Ephemeral
    ) for faster node startup
  • Select Azure Linux as node OS (smaller footprint, faster boot)
  • Enable KEDA for event-driven autoscaling beyond HPA
  • 使用临时OS磁盘
    --node-osdisk-type Ephemeral
    )加快节点启动速度
  • 选择Azure Linux作为节点操作系统(占用空间更小、启动速度更快)
  • 启用KEDA以实现超越HPA的事件驱动型自动扩缩容

7. Node Pools & Compute

7. 节点池与计算

  • Dedicated system node pool: At least 2 nodes, tainted for system workloads only (
    CriticalAddonsOnly
    )
  • Enable Node Auto Provisioning (NAP) on all pools for cost savings and responsive scaling
  • Use latest generation SKUs (v5/v6) for host-level optimizations
  • Avoid B-series VMs — burstable SKUs cause performance/reliability issues
  • Use SKUs with at least 4 vCPUs for production workloads
  • Set topology spread constraints to distribute pods across hosts/zones per SLO
  • 专用系统节点池:至少2个节点,设置污点仅允许系统工作负载运行(
    CriticalAddonsOnly
  • 在所有池上启用**Node Auto Provisioning (NAP)**以节省成本并实现快速扩缩容
  • 使用**最新一代SKU(v5/v6)**以获得主机级优化
  • 避免使用B系列VM — 突发型SKU会导致性能/可靠性问题
  • 生产工作负载使用至少4 vCPU的SKU
  • 设置拓扑分布约束,以根据SLO在主机/可用区间分布Pod

8. Reliability

8. 可靠性

  • Deploy across 3 Availability Zones (
    --zones 1 2 3
    )
  • Use Standard tier for zone-redundant control plane + 99.95% SLA for API server availability
  • Enable Microsoft Defender for Containers for runtime protection
  • Configure PodDisruptionBudgets for all production workloads
  • Use topology spread constraints to ensure pod distribution across failure domains
  • 3个可用区部署(
    --zones 1 2 3
  • 使用Standard tier实现跨可用区的控制平面冗余 + API服务器99.95%的SLA可用性
  • 启用Microsoft Defender for Containers以实现运行时防护
  • 为所有生产工作负载配置PodDisruptionBudgets
  • 使用拓扑分布约束确保Pod在故障域间的分布

9. Cost Controls

9. 成本控制

  • Use Spot node pools for batch/interruptible workloads (up to 90% savings)
  • Stop/Start dev/test clusters:
    az aks stop/start
  • Consider Reserved Instances or Savings Plans for steady-state workloads
  • 为批量/可中断工作负载使用Spot节点池(最高可节省90%成本)
  • 启停开发/测试集群:
    az aks stop/start
  • 对于稳定运行的工作负载,可考虑预留实例节约计划

Guardrails / Safety

防护措施 / 安全提示

  • Do not request or output secrets (tokens, keys).
  • If requirements are ambiguous for day-0 critical decisions, ask the user clarifying questions. For day-1 enabled features, propose 2–3 safe options with tradeoffs and choose a conservative default.
  • Do not promise zero downtime; advise workload safeguards (PDBs, probes, replicas) and staged upgrades along with best practices for reliability and performance.
  • 请勿请求或输出密钥(令牌、密钥)。
  • 若Day-0关键决策的需求不明确,请向用户询问澄清问题。对于Day-1启用的功能,提出2-3个安全选项并说明权衡,选择保守的默认值。
  • 请勿承诺零停机;建议采用工作负载防护措施(PDB、探针、副本)与分阶段升级,并遵循可靠性与性能最佳实践。

MCP Tools

MCP工具

ToolPurposeKey Parameters
mcp_azure_mcp_aks
AKS MCP entry point used to discover the exact AKS-specific tools exposed by the clientDiscover the callable AKS tool first, then use that tool's parameters
工具用途关键参数
mcp_azure_mcp_aks
AKS MCP入口点,用于发现客户端暴露的具体AKS专属工具先发现可调用的AKS工具,再使用该工具的参数

Error Handling

错误处理

Error / SymptomLikely CauseRemediation
MCP tool call fails or times outInvalid credentials, subscription, or AKS contextVerify
az login
, confirm the active subscription context with
az account show
, and check the target resource group without echoing subscription identifiers back to the user
Quota exceededRegional vCPU or resource limitsRequest quota increase or select different region/VM SKU
Networking conflict (IP exhaustion)Pod subnet too small for overlay/CNIRe-plan IP ranges; may require cluster recreation (Day-0)
Workload Identity not workingMissing OIDC issuer or federated credentialEnable
--enable-oidc-issuer --enable-workload-identity
, configure federated identity
错误 / 症状可能原因修复方案
MCP工具调用失败或超时凭据无效、订阅错误或AKS上下文错误验证
az login
状态,使用
az account show
确认活动订阅上下文,检查目标资源组(请勿向用户回显订阅标识符)
配额超出区域vCPU或资源限制请求提高配额或选择其他区域/VM SKU
网络冲突(IP耗尽)Pod子网对于overlay/CNI来说过小重新规划IP范围;可能需要重新创建集群(Day-0决策)
工作负载身份无法正常工作缺少OIDC颁发者或联合凭据启用
--enable-oidc-issuer --enable-workload-identity
,配置联合身份