Kubernetes Specialist

Kubernetes专家

Purpose

用途

Provides expert Kubernetes orchestration and cloud-native application expertise with deep knowledge of container orchestration, cluster management, and production-grade deployments. Specializes in Kubernetes architecture, Helm charts, operators, multi-cluster management, and GitOps workflows across EKS, AKS, GKE, and on-premises deployments.

提供专业的Kubernetes编排和云原生应用技术支持，具备容器编排、集群管理和生产级部署的深厚知识。专注于Kubernetes架构、Helm Charts、Operator、多集群管理，以及适用于EKS、AKS、GKE和本地部署的GitOps工作流。

When to Use

使用场景

Designing Kubernetes cluster architecture for production workloads
Implementing Helm charts, operators, or GitOps workflows (ArgoCD, Flux)
Troubleshooting cluster issues (networking, storage, performance)
Planning Kubernetes upgrades or multi-cluster strategies
Optimizing resource utilization and cost in Kubernetes environments
Setting up service mesh (Istio, Linkerd) and observability
Implementing Kubernetes security and RBAC policies

为生产工作负载设计Kubernetes集群架构
实施Helm Charts、Operator或GitOps工作流（ArgoCD、Flux）
排查集群问题（网络、存储、性能）
规划Kubernetes升级或多集群策略
优化Kubernetes环境中的资源利用率和成本
搭建服务网格（Istio、Linkerd）和可观测性体系
实施Kubernetes安全和RBAC策略

Quick Start

快速入门

Invoke this skill when:

Designing Kubernetes cluster architecture for production workloads
Implementing Helm charts, operators, or GitOps workflows
Troubleshooting cluster issues (networking, storage, performance)
Planning Kubernetes upgrades or multi-cluster strategies
Optimizing resource utilization and cost in Kubernetes environments

Do NOT invoke when:

Simple Docker container needs (use docker commands directly)
Cloud infrastructure provisioning (use cloud-architect instead)
Application code debugging (use backend-developer/frontend-developer)
Database-specific issues (use database-administrator instead)

在以下场景调用该技能：

为生产工作负载设计Kubernetes集群架构
实施Helm Charts、Operator或GitOps工作流
排查集群问题（网络、存储、性能）
规划Kubernetes升级或多集群策略
优化Kubernetes环境中的资源利用率和成本

请勿在以下场景调用：

简单Docker容器需求（直接使用Docker命令）
云基础设施配置（使用云架构师技能）
应用代码调试（使用后端开发/前端开发技能）
数据库特定问题（使用数据库管理员技能）

Decision Framework

决策框架

Deployment Strategy Selection

部署策略选择

├─ Zero downtime required?
│   ├─ Instant rollback needed → Blue-Green Deployment
│   │   Pros: Instant switch, easy rollback
│   │   Cons: 2x resources during deployment
│   │
│   ├─ Gradual rollout → Canary Deployment
│   │   Pros: Test with subset of traffic
│   │   Cons: Complex routing setup
│   │
│   └─ Simple updates → Rolling Update (default)
│       Pros: Built-in, no extra resources
│       Cons: Rollback takes time
│
├─ Stateful application?
│   ├─ Database → StatefulSet + PVC
│   │   Pros: Stable network IDs, ordered deployment
│   │   Cons: Complex scaling
│   │
│   └─ Stateless → Deployment
│       Pros: Easy scaling, self-healing
│
└─ Batch processing?
    ├─ One-time → Job
    ├─ Scheduled → CronJob
    └─ Parallel processing → Job with parallelism

├─ 是否需要零停机？
│   ├─ 需要即时回滚 → 蓝绿部署
│   │   优点：即时切换，回滚简单
│   │   缺点：部署期间需要2倍资源
│   │
│   ├─ 需要逐步发布 → 金丝雀部署
│   │   优点：可使用部分流量测试
│   │   缺点：路由设置复杂
│   │
│   └─ 简单更新 → 滚动更新（默认）
│       优点：内置功能，无需额外资源
│       缺点：回滚耗时
│
├─ 是否为有状态应用？
│   ├─ 数据库 → StatefulSet + PVC
│   │   优点：稳定网络ID，有序部署
│   │   缺点：扩容复杂
│   │
│   └─ 无状态 → Deployment
│       优点：易于扩容，自我修复
│
└─ 是否为批处理？
    ├─ 一次性任务 → Job
    ├─ 定时任务 → CronJob
    └─ 并行处理 → 带parallelism的Job

Resource Configuration Matrix

资源配置矩阵

Workload Type	CPU Request	CPU Limit	Memory Request	Memory Limit
Web API	100m-500m	1000m	256Mi-512Mi	1Gi
Worker	500m-1000m	2000m	512Mi-1Gi	2Gi
Database	1000m-2000m	4000m	2Gi-4Gi	8Gi
Cache	100m-250m	500m	1Gi-4Gi	8Gi
Batch Job	500m-2000m	4000m	1Gi-4Gi	8Gi

工作负载类型	CPU请求	CPU限制	内存请求	内存限制
Web API	100m-500m	1000m	256Mi-512Mi	1Gi
Worker	500m-1000m	2000m	512Mi-1Gi	2Gi
Database	1000m-2000m	4000m	2Gi-4Gi	8Gi
Cache	100m-250m	500m	1Gi-4Gi	8Gi
Batch Job	500m-2000m	4000m	1Gi-4Gi	8Gi

Node Pool Strategy

节点池策略

Use Case	Instance Type	Scaling	Cost
System pods	t3.large (3 nodes)	Fixed	Low
Applications	m5.xlarge	Auto 3-20	Medium
Batch/Spot	m5.large-2xlarge	Auto 0-50	Very Low
GPU workloads	p3.2xlarge	Manual	High

使用场景	实例类型	伸缩方式	成本
系统Pod	t3.large（3节点）	固定	低
应用程序	m5.xlarge	自动3-20	中
批处理/抢占式	m5.large-2xlarge	自动0-50	极低
GPU工作负载	p3.2xlarge	手动	高

Red Flags → Escalate

危险信号 → 升级上报

STOP and escalate if:

Cluster upgrade with breaking API changes (deprecated versions)
Multi-region active-active requirements
Compliance requirements (PCI-DSS, HIPAA) need validation
Custom scheduler or controller development needed
etcd corruption or cluster state issues

立即停止并上报如果出现以下情况：

包含破坏性API变更的集群升级（已弃用版本）
多区域双活需求
需要验证合规要求（PCI-DSS、HIPAA）
需要开发自定义调度器或控制器
etcd损坏或集群状态问题

Quality Checklist

质量检查清单

Cluster Configuration

集群配置

Multi-AZ deployment (nodes spread across availability zones)
Node autoscaling configured (Cluster Autoscaler or Karpenter)
System node pool with taints (separate critical addons from apps)
Encryption enabled (secrets at rest with KMS)
Audit logging enabled (API server logs)

多可用区部署（节点分布在多个可用区）
配置了节点自动伸缩（Cluster Autoscaler或Karpenter）
系统节点池配置了污点（将关键插件与应用分离）
启用加密（使用KMS实现静态密钥加密）
启用审计日志（API服务器日志）

Security

安全

Pod Security Standards enforced (restricted or baseline)
Network policies configured (default deny + explicit allow)
RBAC configured (least privilege for all service accounts)
Image scanning enabled (scan for vulnerabilities)
Private container registry configured

强制执行Pod安全标准（受限或基线级别）
配置了网络策略（默认拒绝 + 显式允许）
配置了RBAC（所有服务账户遵循最小权限原则）
启用镜像扫描（排查漏洞）
配置了私有容器镜像仓库

Resource Management

资源管理

All pods have resource requests and limits
HorizontalPodAutoscalers configured for scalable workloads
PodDisruptionBudgets defined (prevent too many pods down)
ResourceQuotas set per namespace
LimitRanges defined (default limits for pods)

所有Pod都设置了资源请求和限制
为可伸缩工作负载配置了HorizontalPodAutoscaler
定义了PodDisruptionBudgets（防止过多Pod同时下线）
为每个命名空间设置了ResourceQuotas
定义了LimitRanges（Pod的默认限制）

High Availability

高可用性

Deployments have ≥2 replicas
Anti-affinity rules prevent pod co-location
Readiness and liveness probes configured
PodDisruptionBudgets allow for rolling updates
Multi-region cluster (if global scale required)

Deployment副本数≥2
配置了反亲和规则，防止Pod共置
配置了就绪性和存活探针
PodDisruptionBudgets允许滚动更新
多区域集群（如果需要全球扩展）

Observability

可观测性

Metrics server installed (kubectl top works)
Prometheus monitoring application metrics
Centralized logging (CloudWatch, Elasticsearch, Loki)
Distributed tracing (Jaeger, Tempo)
Dashboards for cluster and application health

安装了Metrics Server（kubectl top可正常使用）
使用Prometheus监控应用指标
集中式日志（CloudWatch、Elasticsearch、Loki）
分布式追踪（Jaeger、Tempo）
集群和应用健康状况仪表盘

Disaster Recovery

灾难恢复

Velero installed for cluster backups
Backup schedule configured (daily minimum)
Restore tested (annual drill)
etcd backups automated (cloud-managed clusters)

安装了Velero用于集群备份
配置了备份计划（至少每日备份）
测试过恢复功能（每年至少一次演练）
etcd备份自动化（云托管集群）

Additional Resources

额外资源

Detailed Technical Reference: See REFERENCE.md
Code Examples & Patterns: See EXAMPLES.md

详细技术参考：查看REFERENCE.md
代码示例与模式：查看EXAMPLES.md

kubernetes-specialist

Original

Translation