kubernetes-best-practices

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Kubernetes Best Practices

Kubernetes最佳实践

This skill provides guidance for writing production-ready Kubernetes manifests and managing cloud-native applications.
本技能提供编写可用于生产环境的Kubernetes清单以及管理云原生应用的指南。

Resource Management

资源管理

Memory: Set requests and limits to the same value to ensure QoS class and prevent OOM kills.
CPU: Set requests only, omit limits to allow performance bursting and avoid throttling.
yaml
resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "256Mi"
    # No CPU limit
内存:将requests和limits设置为相同值,以确保QoS等级并防止OOM终止。
CPU:仅设置requests,省略limits以允许性能突发并避免节流。
yaml
resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "256Mi"
    # No CPU limit

Image Versioning

镜像版本控制

Always pin specific versions, never use
:latest
tag unless explicitly requested:
yaml
undefined
始终固定特定版本,除非明确要求,否则绝不使用
:latest
标签:
yaml
undefined

Good

Good

image: nginx:1.25.3
image: nginx:1.25.3

Bad

Bad

image: nginx:latest

For immutability, consider pinning to specific digests.
image: nginx:latest

为了实现不可变性,可考虑固定到特定摘要。

Configuration Management

配置管理

Secrets: Sensitive data (passwords, tokens, certificates) ConfigMaps: Non-sensitive configuration (feature flags, URLs, settings)
yaml
env:
  - name: DATABASE_URL
    valueFrom:
      secretKeyRef:
        name: app-secrets
        key: database-url
  - name: LOG_LEVEL
    valueFrom:
      configMapKeyRef:
        name: app-config
        key: log-level
Best practices:
  • Never hardcode secrets in manifests
  • Use external secret management (Sealed Secrets, External Secrets Operator)
  • Rotate secrets regularly
  • Limit access with RBAC
Secrets:敏感数据(密码、令牌、证书) ConfigMaps:非敏感配置(功能标志、URL、设置)
yaml
env:
  - name: DATABASE_URL
    valueFrom:
      secretKeyRef:
        name: app-secrets
        key: database-url
  - name: LOG_LEVEL
    valueFrom:
      configMapKeyRef:
        name: app-config
        key: log-level
最佳实践:
  • 绝不在清单中硬编码密钥
  • 使用外部密钥管理(Sealed Secrets、External Secrets Operator)
  • 定期轮换密钥
  • 使用RBAC限制访问

Workload Selection

工作负载选择

Choose the appropriate workload type:
  • Deployment: Stateless applications (web servers, APIs, microservices)
  • StatefulSet: Stateful applications (databases, message queues)
  • DaemonSet: Node-level services (log collectors, monitoring agents)
  • Job/CronJob: Batch processing and scheduled tasks
选择合适的工作负载类型:
  • Deployment:无状态应用(Web服务器、API、微服务)
  • StatefulSet:有状态应用(数据库、消息队列)
  • DaemonSet:节点级服务(日志收集器、监控Agent)
  • Job/CronJob:批处理和定时任务

Security Context

安全上下文

Always implement security best practices:
yaml
securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  fsGroup: 1000
  capabilities:
    drop:
      - ALL
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
Security checklist:
  • Run as non-root user
  • Drop all capabilities by default
  • Use read-only root filesystem
  • Disable privilege escalation
  • Implement network policies
  • Scan images for vulnerabilities
始终实施安全最佳实践:
yaml
securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  fsGroup: 1000
  capabilities:
    drop:
      - ALL
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
安全检查清单:
  • 以非root用户运行
  • 默认丢弃所有权限
  • 使用只读根文件系统
  • 禁用权限提升
  • 实施网络策略
  • 扫描镜像漏洞

Health Checks

健康检查

Implement all three probe types:
Liveness: Restart container if unhealthy Readiness: Remove from service endpoints if not ready Startup: Allow slow-starting containers time to initialize
yaml
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

startupProbe:
  httpGet:
    path: /startup
    port: 8080
  periodSeconds: 10
  failureThreshold: 30
实现三种探针类型:
Liveness:若容器不健康则重启 Readiness:若未就绪则从服务端点移除 Startup:为启动缓慢的容器预留初始化时间
yaml
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

startupProbe:
  httpGet:
    path: /startup
    port: 8080
  periodSeconds: 10
  failureThreshold: 30

High Availability

高可用性

Replica counts: Set minimum 2 for production workloads
Pod Disruption Budgets: Maintain availability during voluntary disruptions
yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: web-app
Additional HA considerations:
  • Use anti-affinity rules for pod distribution across nodes
  • Configure graceful shutdown periods
  • Implement horizontal pod autoscaling
  • Set appropriate resource requests for scheduling
副本数:生产环境工作负载至少设置2个副本
Pod中断预算:在自愿中断期间保持可用性
yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: web-app
额外高可用性注意事项:
  • 使用反亲和规则在节点间分布Pod
  • 配置优雅关闭时长
  • 实现水平Pod自动扩缩容
  • 为调度设置合适的资源请求

Namespace Organization

命名空间管理

Use namespaces for environment isolation and apply resource quotas:
yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: prod-quota
  namespace: production
spec:
  hard:
    requests.cpu: "100"
    requests.memory: 200Gi
    persistentvolumeclaims: "10"
Benefits: Logical separation, resource limits, RBAC boundaries, cost tracking
使用命名空间进行环境隔离并应用资源配额:
yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: prod-quota
  namespace: production
spec:
  hard:
    requests.cpu: "100"
    requests.memory: 200Gi
    persistentvolumeclaims: "10"
优势:逻辑隔离、资源限制、RBAC边界、成本追踪

Labels and Annotations

标签与注解

Use consistent, recommended labels:
yaml
metadata:
  labels:
    app.kubernetes.io/name: myapp
    app.kubernetes.io/instance: myapp-prod
    app.kubernetes.io/version: "1.0.0"
    app.kubernetes.io/component: backend
    app.kubernetes.io/part-of: ecommerce
    app.kubernetes.io/managed-by: helm
使用一致的推荐标签:
yaml
metadata:
  labels:
    app.kubernetes.io/name: myapp
    app.kubernetes.io/instance: myapp-prod
    app.kubernetes.io/version: "1.0.0"
    app.kubernetes.io/component: backend
    app.kubernetes.io/part-of: ecommerce
    app.kubernetes.io/managed-by: helm

Service Types

服务类型

  • ClusterIP: Internal cluster communication (default)
  • NodePort: External access via node ports (dev/test)
  • LoadBalancer: Cloud provider load balancer (production)
  • ExternalName: DNS CNAME record (external services)
  • ClusterIP:集群内部通信(默认)
  • NodePort:通过节点端口对外访问(开发/测试)
  • LoadBalancer:云提供商负载均衡器(生产环境)
  • ExternalName:DNS CNAME记录(外部服务)

Storage

存储

Choose appropriate storage class and access mode:
Access Modes:
  • ReadWriteOnce (RWO): Single node read-write
  • ReadOnlyMany (ROX): Multiple nodes read-only
  • ReadWriteMany (RWX): Multiple nodes read-write
yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: app-data
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 10Gi
选择合适的存储类和访问模式:
访问模式:
  • ReadWriteOnce (RWO):单节点读写
  • ReadOnlyMany (ROX):多节点只读
  • ReadWriteMany (RWX):多节点读写
yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: app-data
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 10Gi

Validation and Testing

验证与测试

Always validate before applying to production:
  1. Client-side validation:
    kubectl apply --dry-run=client -f manifest.yaml
  2. Server-side validation:
    kubectl apply --dry-run=server -f manifest.yaml
  3. Test in staging: Deploy to non-production environment first
  4. Monitor metrics: Watch resource usage and application health
  5. Gradual rollout: Use rolling updates with health checks
应用到生产环境前始终进行验证:
  1. 客户端验证
    kubectl apply --dry-run=client -f manifest.yaml
  2. 服务端验证
    kubectl apply --dry-run=server -f manifest.yaml
  3. 在预发布环境测试:先部署到非生产环境
  4. 监控指标:监控资源使用情况和应用健康状态
  5. 逐步发布:结合健康检查使用滚动更新

Application Checklist

应用检查清单

When creating or reviewing Kubernetes manifests:
  • Resource requests and limits configured
  • Specific image version pinned (not :latest)
  • Secrets and ConfigMaps used for configuration
  • Security context implemented (non-root, dropped capabilities)
  • Health checks configured (liveness, readiness, startup)
  • Pod Disruption Budget defined for HA workloads
  • Consistent labels applied
  • Appropriate workload type selected
  • Namespace and resource quotas configured
  • Validated with dry-run before applying
创建或审核Kubernetes清单时:
  • 已配置资源requests和limits
  • 已固定特定镜像版本(非:latest)
  • 已使用Secrets和ConfigMaps进行配置
  • 已实现安全上下文(非root用户、丢弃权限)
  • 已配置健康检查(liveness、readiness、startup)
  • 已为HA工作负载定义Pod Disruption Budget
  • 已应用一致的标签
  • 已选择合适的工作负载类型
  • 已配置命名空间和资源配额
  • 应用前已通过dry-run验证