kubernetes-specialist
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseKubernetes Specialist
Kubernetes专家
Purpose
用途
Provides expert Kubernetes orchestration and cloud-native application expertise with deep knowledge of container orchestration, cluster management, and production-grade deployments. Specializes in Kubernetes architecture, Helm charts, operators, multi-cluster management, and GitOps workflows across EKS, AKS, GKE, and on-premises deployments.
提供专业的Kubernetes编排和云原生应用技术支持,具备容器编排、集群管理和生产级部署的深厚知识。专注于Kubernetes架构、Helm Charts、Operator、多集群管理,以及适用于EKS、AKS、GKE和本地部署的GitOps工作流。
When to Use
使用场景
- Designing Kubernetes cluster architecture for production workloads
- Implementing Helm charts, operators, or GitOps workflows (ArgoCD, Flux)
- Troubleshooting cluster issues (networking, storage, performance)
- Planning Kubernetes upgrades or multi-cluster strategies
- Optimizing resource utilization and cost in Kubernetes environments
- Setting up service mesh (Istio, Linkerd) and observability
- Implementing Kubernetes security and RBAC policies
- 为生产工作负载设计Kubernetes集群架构
- 实施Helm Charts、Operator或GitOps工作流(ArgoCD、Flux)
- 排查集群问题(网络、存储、性能)
- 规划Kubernetes升级或多集群策略
- 优化Kubernetes环境中的资源利用率和成本
- 搭建服务网格(Istio、Linkerd)和可观测性体系
- 实施Kubernetes安全和RBAC策略
Quick Start
快速入门
Invoke this skill when:
- Designing Kubernetes cluster architecture for production workloads
- Implementing Helm charts, operators, or GitOps workflows
- Troubleshooting cluster issues (networking, storage, performance)
- Planning Kubernetes upgrades or multi-cluster strategies
- Optimizing resource utilization and cost in Kubernetes environments
Do NOT invoke when:
- Simple Docker container needs (use docker commands directly)
- Cloud infrastructure provisioning (use cloud-architect instead)
- Application code debugging (use backend-developer/frontend-developer)
- Database-specific issues (use database-administrator instead)
在以下场景调用该技能:
- 为生产工作负载设计Kubernetes集群架构
- 实施Helm Charts、Operator或GitOps工作流
- 排查集群问题(网络、存储、性能)
- 规划Kubernetes升级或多集群策略
- 优化Kubernetes环境中的资源利用率和成本
请勿在以下场景调用:
- 简单Docker容器需求(直接使用Docker命令)
- 云基础设施配置(使用云架构师技能)
- 应用代码调试(使用后端开发/前端开发技能)
- 数据库特定问题(使用数据库管理员技能)
Decision Framework
决策框架
Deployment Strategy Selection
部署策略选择
├─ Zero downtime required?
│ ├─ Instant rollback needed → Blue-Green Deployment
│ │ Pros: Instant switch, easy rollback
│ │ Cons: 2x resources during deployment
│ │
│ ├─ Gradual rollout → Canary Deployment
│ │ Pros: Test with subset of traffic
│ │ Cons: Complex routing setup
│ │
│ └─ Simple updates → Rolling Update (default)
│ Pros: Built-in, no extra resources
│ Cons: Rollback takes time
│
├─ Stateful application?
│ ├─ Database → StatefulSet + PVC
│ │ Pros: Stable network IDs, ordered deployment
│ │ Cons: Complex scaling
│ │
│ └─ Stateless → Deployment
│ Pros: Easy scaling, self-healing
│
└─ Batch processing?
├─ One-time → Job
├─ Scheduled → CronJob
└─ Parallel processing → Job with parallelism├─ 是否需要零停机?
│ ├─ 需要即时回滚 → 蓝绿部署
│ │ 优点:即时切换,回滚简单
│ │ 缺点:部署期间需要2倍资源
│ │
│ ├─ 需要逐步发布 → 金丝雀部署
│ │ 优点:可使用部分流量测试
│ │ 缺点:路由设置复杂
│ │
│ └─ 简单更新 → 滚动更新(默认)
│ 优点:内置功能,无需额外资源
│ 缺点:回滚耗时
│
├─ 是否为有状态应用?
│ ├─ 数据库 → StatefulSet + PVC
│ │ 优点:稳定网络ID,有序部署
│ │ 缺点:扩容复杂
│ │
│ └─ 无状态 → Deployment
│ 优点:易于扩容,自我修复
│
└─ 是否为批处理?
├─ 一次性任务 → Job
├─ 定时任务 → CronJob
└─ 并行处理 → 带parallelism的JobResource Configuration Matrix
资源配置矩阵
| Workload Type | CPU Request | CPU Limit | Memory Request | Memory Limit |
|---|---|---|---|---|
| Web API | 100m-500m | 1000m | 256Mi-512Mi | 1Gi |
| Worker | 500m-1000m | 2000m | 512Mi-1Gi | 2Gi |
| Database | 1000m-2000m | 4000m | 2Gi-4Gi | 8Gi |
| Cache | 100m-250m | 500m | 1Gi-4Gi | 8Gi |
| Batch Job | 500m-2000m | 4000m | 1Gi-4Gi | 8Gi |
| 工作负载类型 | CPU请求 | CPU限制 | 内存请求 | 内存限制 |
|---|---|---|---|---|
| Web API | 100m-500m | 1000m | 256Mi-512Mi | 1Gi |
| Worker | 500m-1000m | 2000m | 512Mi-1Gi | 2Gi |
| Database | 1000m-2000m | 4000m | 2Gi-4Gi | 8Gi |
| Cache | 100m-250m | 500m | 1Gi-4Gi | 8Gi |
| Batch Job | 500m-2000m | 4000m | 1Gi-4Gi | 8Gi |
Node Pool Strategy
节点池策略
| Use Case | Instance Type | Scaling | Cost |
|---|---|---|---|
| System pods | t3.large (3 nodes) | Fixed | Low |
| Applications | m5.xlarge | Auto 3-20 | Medium |
| Batch/Spot | m5.large-2xlarge | Auto 0-50 | Very Low |
| GPU workloads | p3.2xlarge | Manual | High |
| 使用场景 | 实例类型 | 伸缩方式 | 成本 |
|---|---|---|---|
| 系统Pod | t3.large(3节点) | 固定 | 低 |
| 应用程序 | m5.xlarge | 自动3-20 | 中 |
| 批处理/抢占式 | m5.large-2xlarge | 自动0-50 | 极低 |
| GPU工作负载 | p3.2xlarge | 手动 | 高 |
Red Flags → Escalate
危险信号 → 升级上报
STOP and escalate if:
- Cluster upgrade with breaking API changes (deprecated versions)
- Multi-region active-active requirements
- Compliance requirements (PCI-DSS, HIPAA) need validation
- Custom scheduler or controller development needed
- etcd corruption or cluster state issues
立即停止并上报如果出现以下情况:
- 包含破坏性API变更的集群升级(已弃用版本)
- 多区域双活需求
- 需要验证合规要求(PCI-DSS、HIPAA)
- 需要开发自定义调度器或控制器
- etcd损坏或集群状态问题
Quality Checklist
质量检查清单
Cluster Configuration
集群配置
- Multi-AZ deployment (nodes spread across availability zones)
- Node autoscaling configured (Cluster Autoscaler or Karpenter)
- System node pool with taints (separate critical addons from apps)
- Encryption enabled (secrets at rest with KMS)
- Audit logging enabled (API server logs)
- 多可用区部署(节点分布在多个可用区)
- 配置了节点自动伸缩(Cluster Autoscaler或Karpenter)
- 系统节点池配置了污点(将关键插件与应用分离)
- 启用加密(使用KMS实现静态密钥加密)
- 启用审计日志(API服务器日志)
Security
安全
- Pod Security Standards enforced (restricted or baseline)
- Network policies configured (default deny + explicit allow)
- RBAC configured (least privilege for all service accounts)
- Image scanning enabled (scan for vulnerabilities)
- Private container registry configured
- 强制执行Pod安全标准(受限或基线级别)
- 配置了网络策略(默认拒绝 + 显式允许)
- 配置了RBAC(所有服务账户遵循最小权限原则)
- 启用镜像扫描(排查漏洞)
- 配置了私有容器镜像仓库
Resource Management
资源管理
- All pods have resource requests and limits
- HorizontalPodAutoscalers configured for scalable workloads
- PodDisruptionBudgets defined (prevent too many pods down)
- ResourceQuotas set per namespace
- LimitRanges defined (default limits for pods)
- 所有Pod都设置了资源请求和限制
- 为可伸缩工作负载配置了HorizontalPodAutoscaler
- 定义了PodDisruptionBudgets(防止过多Pod同时下线)
- 为每个命名空间设置了ResourceQuotas
- 定义了LimitRanges(Pod的默认限制)
High Availability
高可用性
- Deployments have ≥2 replicas
- Anti-affinity rules prevent pod co-location
- Readiness and liveness probes configured
- PodDisruptionBudgets allow for rolling updates
- Multi-region cluster (if global scale required)
- Deployment副本数≥2
- 配置了反亲和规则,防止Pod共置
- 配置了就绪性和存活探针
- PodDisruptionBudgets允许滚动更新
- 多区域集群(如果需要全球扩展)
Observability
可观测性
- Metrics server installed (kubectl top works)
- Prometheus monitoring application metrics
- Centralized logging (CloudWatch, Elasticsearch, Loki)
- Distributed tracing (Jaeger, Tempo)
- Dashboards for cluster and application health
- 安装了Metrics Server(kubectl top可正常使用)
- 使用Prometheus监控应用指标
- 集中式日志(CloudWatch、Elasticsearch、Loki)
- 分布式追踪(Jaeger、Tempo)
- 集群和应用健康状况仪表盘
Disaster Recovery
灾难恢复
- Velero installed for cluster backups
- Backup schedule configured (daily minimum)
- Restore tested (annual drill)
- etcd backups automated (cloud-managed clusters)
- 安装了Velero用于集群备份
- 配置了备份计划(至少每日备份)
- 测试过恢复功能(每年至少一次演练)
- etcd备份自动化(云托管集群)
Additional Resources
额外资源
- Detailed Technical Reference: See REFERENCE.md
- Code Examples & Patterns: See EXAMPLES.md
- 详细技术参考:查看REFERENCE.md
- 代码示例与模式:查看EXAMPLES.md