kubernetes-specialist

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Kubernetes Specialist

Kubernetes专家

Purpose

用途

Provides expert Kubernetes orchestration and cloud-native application expertise with deep knowledge of container orchestration, cluster management, and production-grade deployments. Specializes in Kubernetes architecture, Helm charts, operators, multi-cluster management, and GitOps workflows across EKS, AKS, GKE, and on-premises deployments.
提供专业的Kubernetes编排和云原生应用技术支持,具备容器编排、集群管理和生产级部署的深厚知识。专注于Kubernetes架构、Helm Charts、Operator、多集群管理,以及适用于EKS、AKS、GKE和本地部署的GitOps工作流。

When to Use

使用场景

  • Designing Kubernetes cluster architecture for production workloads
  • Implementing Helm charts, operators, or GitOps workflows (ArgoCD, Flux)
  • Troubleshooting cluster issues (networking, storage, performance)
  • Planning Kubernetes upgrades or multi-cluster strategies
  • Optimizing resource utilization and cost in Kubernetes environments
  • Setting up service mesh (Istio, Linkerd) and observability
  • Implementing Kubernetes security and RBAC policies
  • 为生产工作负载设计Kubernetes集群架构
  • 实施Helm Charts、Operator或GitOps工作流(ArgoCD、Flux)
  • 排查集群问题(网络、存储、性能)
  • 规划Kubernetes升级或多集群策略
  • 优化Kubernetes环境中的资源利用率和成本
  • 搭建服务网格(Istio、Linkerd)和可观测性体系
  • 实施Kubernetes安全和RBAC策略

Quick Start

快速入门

Invoke this skill when:
  • Designing Kubernetes cluster architecture for production workloads
  • Implementing Helm charts, operators, or GitOps workflows
  • Troubleshooting cluster issues (networking, storage, performance)
  • Planning Kubernetes upgrades or multi-cluster strategies
  • Optimizing resource utilization and cost in Kubernetes environments
Do NOT invoke when:
  • Simple Docker container needs (use docker commands directly)
  • Cloud infrastructure provisioning (use cloud-architect instead)
  • Application code debugging (use backend-developer/frontend-developer)
  • Database-specific issues (use database-administrator instead)
在以下场景调用该技能:
  • 为生产工作负载设计Kubernetes集群架构
  • 实施Helm Charts、Operator或GitOps工作流
  • 排查集群问题(网络、存储、性能)
  • 规划Kubernetes升级或多集群策略
  • 优化Kubernetes环境中的资源利用率和成本
请勿在以下场景调用:
  • 简单Docker容器需求(直接使用Docker命令)
  • 云基础设施配置(使用云架构师技能)
  • 应用代码调试(使用后端开发/前端开发技能)
  • 数据库特定问题(使用数据库管理员技能)

Decision Framework

决策框架

Deployment Strategy Selection

部署策略选择

├─ Zero downtime required?
│   ├─ Instant rollback needed → Blue-Green Deployment
│   │   Pros: Instant switch, easy rollback
│   │   Cons: 2x resources during deployment
│   │
│   ├─ Gradual rollout → Canary Deployment
│   │   Pros: Test with subset of traffic
│   │   Cons: Complex routing setup
│   │
│   └─ Simple updates → Rolling Update (default)
│       Pros: Built-in, no extra resources
│       Cons: Rollback takes time
├─ Stateful application?
│   ├─ Database → StatefulSet + PVC
│   │   Pros: Stable network IDs, ordered deployment
│   │   Cons: Complex scaling
│   │
│   └─ Stateless → Deployment
│       Pros: Easy scaling, self-healing
└─ Batch processing?
    ├─ One-time → Job
    ├─ Scheduled → CronJob
    └─ Parallel processing → Job with parallelism
├─ 是否需要零停机?
│   ├─ 需要即时回滚 → 蓝绿部署
│   │   优点:即时切换,回滚简单
│   │   缺点:部署期间需要2倍资源
│   │
│   ├─ 需要逐步发布 → 金丝雀部署
│   │   优点:可使用部分流量测试
│   │   缺点:路由设置复杂
│   │
│   └─ 简单更新 → 滚动更新(默认)
│       优点:内置功能,无需额外资源
│       缺点:回滚耗时
├─ 是否为有状态应用?
│   ├─ 数据库 → StatefulSet + PVC
│   │   优点:稳定网络ID,有序部署
│   │   缺点:扩容复杂
│   │
│   └─ 无状态 → Deployment
│       优点:易于扩容,自我修复
└─ 是否为批处理?
    ├─ 一次性任务 → Job
    ├─ 定时任务 → CronJob
    └─ 并行处理 → 带parallelism的Job

Resource Configuration Matrix

资源配置矩阵

Workload TypeCPU RequestCPU LimitMemory RequestMemory Limit
Web API100m-500m1000m256Mi-512Mi1Gi
Worker500m-1000m2000m512Mi-1Gi2Gi
Database1000m-2000m4000m2Gi-4Gi8Gi
Cache100m-250m500m1Gi-4Gi8Gi
Batch Job500m-2000m4000m1Gi-4Gi8Gi
工作负载类型CPU请求CPU限制内存请求内存限制
Web API100m-500m1000m256Mi-512Mi1Gi
Worker500m-1000m2000m512Mi-1Gi2Gi
Database1000m-2000m4000m2Gi-4Gi8Gi
Cache100m-250m500m1Gi-4Gi8Gi
Batch Job500m-2000m4000m1Gi-4Gi8Gi

Node Pool Strategy

节点池策略

Use CaseInstance TypeScalingCost
System podst3.large (3 nodes)FixedLow
Applicationsm5.xlargeAuto 3-20Medium
Batch/Spotm5.large-2xlargeAuto 0-50Very Low
GPU workloadsp3.2xlargeManualHigh
使用场景实例类型伸缩方式成本
系统Podt3.large(3节点)固定
应用程序m5.xlarge自动3-20
批处理/抢占式m5.large-2xlarge自动0-50极低
GPU工作负载p3.2xlarge手动

Red Flags → Escalate

危险信号 → 升级上报

STOP and escalate if:
  • Cluster upgrade with breaking API changes (deprecated versions)
  • Multi-region active-active requirements
  • Compliance requirements (PCI-DSS, HIPAA) need validation
  • Custom scheduler or controller development needed
  • etcd corruption or cluster state issues
立即停止并上报如果出现以下情况:
  • 包含破坏性API变更的集群升级(已弃用版本)
  • 多区域双活需求
  • 需要验证合规要求(PCI-DSS、HIPAA)
  • 需要开发自定义调度器或控制器
  • etcd损坏或集群状态问题

Quality Checklist

质量检查清单

Cluster Configuration

集群配置

  • Multi-AZ deployment (nodes spread across availability zones)
  • Node autoscaling configured (Cluster Autoscaler or Karpenter)
  • System node pool with taints (separate critical addons from apps)
  • Encryption enabled (secrets at rest with KMS)
  • Audit logging enabled (API server logs)
  • 多可用区部署(节点分布在多个可用区)
  • 配置了节点自动伸缩(Cluster Autoscaler或Karpenter)
  • 系统节点池配置了污点(将关键插件与应用分离)
  • 启用加密(使用KMS实现静态密钥加密)
  • 启用审计日志(API服务器日志)

Security

安全

  • Pod Security Standards enforced (restricted or baseline)
  • Network policies configured (default deny + explicit allow)
  • RBAC configured (least privilege for all service accounts)
  • Image scanning enabled (scan for vulnerabilities)
  • Private container registry configured
  • 强制执行Pod安全标准(受限或基线级别)
  • 配置了网络策略(默认拒绝 + 显式允许)
  • 配置了RBAC(所有服务账户遵循最小权限原则)
  • 启用镜像扫描(排查漏洞)
  • 配置了私有容器镜像仓库

Resource Management

资源管理

  • All pods have resource requests and limits
  • HorizontalPodAutoscalers configured for scalable workloads
  • PodDisruptionBudgets defined (prevent too many pods down)
  • ResourceQuotas set per namespace
  • LimitRanges defined (default limits for pods)
  • 所有Pod都设置了资源请求和限制
  • 为可伸缩工作负载配置了HorizontalPodAutoscaler
  • 定义了PodDisruptionBudgets(防止过多Pod同时下线)
  • 为每个命名空间设置了ResourceQuotas
  • 定义了LimitRanges(Pod的默认限制)

High Availability

高可用性

  • Deployments have ≥2 replicas
  • Anti-affinity rules prevent pod co-location
  • Readiness and liveness probes configured
  • PodDisruptionBudgets allow for rolling updates
  • Multi-region cluster (if global scale required)
  • Deployment副本数≥2
  • 配置了反亲和规则,防止Pod共置
  • 配置了就绪性和存活探针
  • PodDisruptionBudgets允许滚动更新
  • 多区域集群(如果需要全球扩展)

Observability

可观测性

  • Metrics server installed (kubectl top works)
  • Prometheus monitoring application metrics
  • Centralized logging (CloudWatch, Elasticsearch, Loki)
  • Distributed tracing (Jaeger, Tempo)
  • Dashboards for cluster and application health
  • 安装了Metrics Server(kubectl top可正常使用)
  • 使用Prometheus监控应用指标
  • 集中式日志(CloudWatch、Elasticsearch、Loki)
  • 分布式追踪(Jaeger、Tempo)
  • 集群和应用健康状况仪表盘

Disaster Recovery

灾难恢复

  • Velero installed for cluster backups
  • Backup schedule configured (daily minimum)
  • Restore tested (annual drill)
  • etcd backups automated (cloud-managed clusters)
  • 安装了Velero用于集群备份
  • 配置了备份计划(至少每日备份)
  • 测试过恢复功能(每年至少一次演练)
  • etcd备份自动化(云托管集群)

Additional Resources

额外资源

  • Detailed Technical Reference: See REFERENCE.md
  • Code Examples & Patterns: See EXAMPLES.md
  • 详细技术参考:查看REFERENCE.md
  • 代码示例与模式:查看EXAMPLES.md