gcp-gke

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Google Kubernetes Engine (GKE)

Google Kubernetes Engine(GKE)

Deploy, operate, and scale managed Kubernetes clusters on Google Cloud Platform.
在Google Cloud Platform(GCP)上部署、运维和扩缩容托管式Kubernetes集群。

When to Use

适用场景

  • Running containerized microservices at scale with automatic scaling and healing
  • Workloads requiring fine-grained orchestration, service mesh, or custom scheduling
  • Teams already invested in Kubernetes tooling (Helm, Argo CD, Flux)
  • When Cloud Run's request-based model does not fit (long-running, stateful workloads)
  • 以自动扩缩容和自愈能力大规模运行容器化微服务
  • 需要精细编排、服务网格或自定义调度的工作负载
  • 已投入使用Kubernetes工具链(Helm、Argo CD、Flux)的团队
  • Cloud Run的请求型模型不适用的场景(长时间运行、有状态的工作负载)

Prerequisites

前置条件

  • Google Cloud SDK (
    gcloud
    ) and
    kubectl
    installed
  • APIs enabled: Kubernetes Engine, Compute Engine
  • IAM role
    roles/container.admin
    for cluster management
bash
gcloud services enable container.googleapis.com compute.googleapis.com
gcloud components install kubectl
  • 已安装Google Cloud SDK(
    gcloud
    )和
    kubectl
  • 已启用API:Kubernetes Engine、Compute Engine
  • 拥有集群管理所需的IAM角色
    roles/container.admin
bash
gcloud services enable container.googleapis.com compute.googleapis.com
gcloud components install kubectl

Standard vs Autopilot

标准集群 vs 自动驾驶集群

FeatureStandardAutopilot
Node managementYou manage node poolsGoogle manages nodes
PricingPay per node (VM)Pay per pod resource request
GPU/TPUFull supportSupported (with limits)
DaemonSetsAllowedRestricted
Best forFull control, specialized HWHands-off, cost-optimized
特性标准集群自动驾驶集群
节点管理用户管理节点池Google管理节点
计费方式按节点(VM)付费按Pod资源请求付费
GPU/TPU支持完全支持支持(有使用限制)
DaemonSets允许使用受限使用
最佳适用场景需要完全控制、专用硬件的场景免运维、成本优化的场景

Create a Standard Cluster

创建标准集群

bash
gcloud container clusters create prod-cluster \
  --region=us-central1 --num-nodes=2 \
  --machine-type=e2-standard-4 --disk-size=100 \
  --enable-autoscaling --min-nodes=1 --max-nodes=5 \
  --enable-autorepair --enable-autoupgrade \
  --release-channel=regular \
  --workload-pool=${PROJECT_ID}.svc.id.goog \
  --enable-ip-alias --enable-network-policy \
  --enable-shielded-nodes \
  --logging=SYSTEM,WORKLOAD --monitoring=SYSTEM,WORKLOAD \
  --labels=env=production,team=platform

gcloud container clusters get-credentials prod-cluster --region=us-central1
bash
gcloud container clusters create prod-cluster \
  --region=us-central1 --num-nodes=2 \
  --machine-type=e2-standard-4 --disk-size=100 \
  --enable-autoscaling --min-nodes=1 --max-nodes=5 \
  --enable-autorepair --enable-autoupgrade \
  --release-channel=regular \
  --workload-pool=${PROJECT_ID}.svc.id.goog \
  --enable-ip-alias --enable-network-policy \
  --enable-shielded-nodes \
  --logging=SYSTEM,WORKLOAD --monitoring=SYSTEM,WORKLOAD \
  --labels=env=production,team=platform

gcloud container clusters get-credentials prod-cluster --region=us-central1

Create an Autopilot Cluster

创建自动驾驶集群

bash
gcloud container clusters create-auto autopilot-prod \
  --region=us-central1 --release-channel=regular \
  --workload-pool=${PROJECT_ID}.svc.id.goog \
  --network=my-vpc --subnetwork=gke-subnet
bash
gcloud container clusters create-auto autopilot-prod \
  --region=us-central1 --release-channel=regular \
  --workload-pool=${PROJECT_ID}.svc.id.goog \
  --network=my-vpc --subnetwork=gke-subnet

Node Pools

节点池配置

bash
undefined
bash
undefined

High-memory pool with taint

带污点的高内存节点池

gcloud container node-pools create highmem-pool
--cluster=prod-cluster --region=us-central1
--machine-type=n2-highmem-8 --disk-size=200 --disk-type=pd-ssd
--num-nodes=1 --enable-autoscaling --min-nodes=0 --max-nodes=4
--node-labels=workload=memory-intensive
--node-taints=dedicated=highmem:NoSchedule
gcloud container node-pools create highmem-pool
--cluster=prod-cluster --region=us-central1
--machine-type=n2-highmem-8 --disk-size=200 --disk-type=pd-ssd
--num-nodes=1 --enable-autoscaling --min-nodes=0 --max-nodes=4
--node-labels=workload=memory-intensive
--node-taints=dedicated=highmem:NoSchedule

GPU pool

GPU节点池

gcloud container node-pools create gpu-pool
--cluster=prod-cluster --region=us-central1
--machine-type=n1-standard-8
--accelerator=type=nvidia-tesla-t4,count=1
--num-nodes=0 --enable-autoscaling --min-nodes=0 --max-nodes=4
--node-taints=nvidia.com/gpu=present:NoSchedule
gcloud container node-pools create gpu-pool
--cluster=prod-cluster --region=us-central1
--machine-type=n1-standard-8
--accelerator=type=nvidia-tesla-t4,count=1
--num-nodes=0 --enable-autoscaling --min-nodes=0 --max-nodes=4
--node-taints=nvidia.com/gpu=present:NoSchedule

Spot pool for batch workloads

用于批处理工作负载的Spot节点池

gcloud container node-pools create spot-pool
--cluster=prod-cluster --region=us-central1
--machine-type=e2-standard-4 --spot
--num-nodes=0 --enable-autoscaling --min-nodes=0 --max-nodes=20
--node-taints=cloud.google.com/gke-spot=true:NoSchedule
undefined
gcloud container node-pools create spot-pool
--cluster=prod-cluster --region=us-central1
--machine-type=e2-standard-4 --spot
--num-nodes=0 --enable-autoscaling --min-nodes=0 --max-nodes=20
--node-taints=cloud.google.com/gke-spot=true:NoSchedule
undefined

Workload Identity

工作负载身份配置

bash
undefined
bash
undefined

Create GSA and grant permissions

创建GSA并授予权限

gcloud iam service-accounts create app-gsa gcloud projects add-iam-policy-binding ${PROJECT_ID}
--member="serviceAccount:app-gsa@${PROJECT_ID}.iam.gserviceaccount.com"
--role="roles/storage.objectViewer"
gcloud iam service-accounts create app-gsa gcloud projects add-iam-policy-binding ${PROJECT_ID}
--member="serviceAccount:app-gsa@${PROJECT_ID}.iam.gserviceaccount.com"
--role="roles/storage.objectViewer"

Create KSA and bind to GSA

创建KSA并绑定到GSA

kubectl create namespace myapp kubectl create serviceaccount app-ksa --namespace=myapp gcloud iam service-accounts add-iam-policy-binding
app-gsa@${PROJECT_ID}.iam.gserviceaccount.com
--role=roles/iam.workloadIdentityUser
--member="serviceAccount:${PROJECT_ID}.svc.id.goog[myapp/app-ksa]" kubectl annotate serviceaccount app-ksa --namespace=myapp
iam.gke.io/gcp-service-account=app-gsa@${PROJECT_ID}.iam.gserviceaccount.com
undefined
kubectl create namespace myapp kubectl create serviceaccount app-ksa --namespace=myapp gcloud iam service-accounts add-iam-policy-binding
app-gsa@${PROJECT_ID}.iam.gserviceaccount.com
--role=roles/iam.workloadIdentityUser
--member="serviceAccount:${PROJECT_ID}.svc.id.goog[myapp/app-ksa]" kubectl annotate serviceaccount app-ksa --namespace=myapp
iam.gke.io/gcp-service-account=app-gsa@${PROJECT_ID}.iam.gserviceaccount.com
undefined

Deploying Workloads

工作负载部署

yaml
undefined
yaml
undefined

deployment.yaml

deployment.yaml

apiVersion: apps/v1 kind: Deployment metadata: name: web-app namespace: myapp spec: replicas: 3 selector: matchLabels: { app: web-app } template: metadata: labels: { app: web-app } spec: serviceAccountName: app-ksa containers: - name: web image: us-central1-docker.pkg.dev/PROJECT_ID/repo/web-app:v1.2.0 ports: [{ containerPort: 8080 }] resources: requests: { cpu: 250m, memory: 512Mi } limits: { cpu: 500m, memory: 1Gi } readinessProbe: httpGet: { path: /healthz, port: 8080 } initialDelaySeconds: 5 livenessProbe: httpGet: { path: /healthz, port: 8080 } initialDelaySeconds: 15 topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: { app: web-app }

apiVersion: v1 kind: Service metadata: { name: web-app, namespace: myapp } spec: selector: { app: web-app } ports: [{ port: 80, targetPort: 8080 }] type: ClusterIP
undefined

apiVersion: apps/v1 kind: Deployment metadata: name: web-app namespace: myapp spec: replicas: 3 selector: matchLabels: { app: web-app } template: metadata: labels: { app: web-app } spec: serviceAccountName: app-ksa containers: - name: web image: us-central1-docker.pkg.dev/PROJECT_ID/repo/web-app:v1.2.0 ports: [{ containerPort: 8080 }] resources: requests: { cpu: 250m, memory: 512Mi } limits: { cpu: 500m, memory: 1Gi } readinessProbe: httpGet: { path: /healthz, port: 8080 } initialDelaySeconds: 5 livenessProbe: httpGet: { path: /healthz, port: 8080 } initialDelaySeconds: 15 topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: { app: web-app }

apiVersion: v1 kind: Service metadata: { name: web-app, namespace: myapp } spec: selector: { app: web-app } ports: [{ port: 80, targetPort: 8080 }] type: ClusterIP
undefined

Ingress with Managed SSL

带托管SSL证书的Ingress配置

yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web-ingress
  namespace: myapp
  annotations:
    kubernetes.io/ingress.class: "gce"
    networking.gke.io/managed-certificates: "web-cert"
    kubernetes.io/ingress.global-static-ip-name: "web-static-ip"
spec:
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service: { name: web-app, port: { number: 80 } }
---
apiVersion: networking.gke.io/v1
kind: ManagedCertificate
metadata: { name: web-cert, namespace: myapp }
spec:
  domains: [app.example.com]
bash
gcloud compute addresses create web-static-ip --global
yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web-ingress
  namespace: myapp
  annotations:
    kubernetes.io/ingress.class: "gce"
    networking.gke.io/managed-certificates: "web-cert"
    kubernetes.io/ingress.global-static-ip-name: "web-static-ip"
spec:
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service: { name: web-app, port: { number: 80 } }
---
apiVersion: networking.gke.io/v1
kind: ManagedCertificate
metadata: { name: web-cert, namespace: myapp }
spec:
  domains: [app.example.com]
bash
gcloud compute addresses create web-static-ip --global

Terraform Configuration

Terraform配置

hcl
resource "google_container_cluster" "primary" {
  name     = "prod-cluster"
  location = "us-central1"

  release_channel { channel = "REGULAR" }
  workload_identity_config { workload_pool = "${var.project_id}.svc.id.goog" }

  network    = google_compute_network.vpc.name
  subnetwork = google_compute_subnetwork.gke.name

  ip_allocation_policy {
    cluster_secondary_range_name  = "pods"
    services_secondary_range_name = "services"
  }

  private_cluster_config {
    enable_private_nodes   = true
    master_ipv4_cidr_block = "172.16.0.0/28"
  }

  network_policy { enabled = true }
  logging_config { enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"] }
  monitoring_config {
    enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"]
    managed_prometheus { enabled = true }
  }

  remove_default_node_pool = true
  initial_node_count       = 1
}

resource "google_container_node_pool" "primary" {
  name     = "primary-pool"
  cluster  = google_container_cluster.primary.name
  location = "us-central1"

  initial_node_count = 2
  autoscaling { min_node_count = 1; max_node_count = 5 }
  management  { auto_repair = true; auto_upgrade = true }

  node_config {
    machine_type = "e2-standard-4"
    disk_size_gb = 100
    disk_type    = "pd-balanced"
    oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
    shielded_instance_config {
      enable_secure_boot          = true
      enable_integrity_monitoring = true
    }
    metadata = { disable-legacy-endpoints = "true" }
  }
}

resource "google_compute_subnetwork" "gke" {
  name          = "gke-subnet"
  ip_cidr_range = "10.0.0.0/20"
  region        = "us-central1"
  network       = google_compute_network.vpc.id

  secondary_ip_range { range_name = "pods";     ip_cidr_range = "10.4.0.0/14" }
  secondary_ip_range { range_name = "services"; ip_cidr_range = "10.8.0.0/20" }
}
hcl
resource "google_container_cluster" "primary" {
  name     = "prod-cluster"
  location = "us-central1"

  release_channel { channel = "REGULAR" }
  workload_identity_config { workload_pool = "${var.project_id}.svc.id.goog" }

  network    = google_compute_network.vpc.name
  subnetwork = google_compute_subnetwork.gke.name

  ip_allocation_policy {
    cluster_secondary_range_name  = "pods"
    services_secondary_range_name = "services"
  }

  private_cluster_config {
    enable_private_nodes   = true
    master_ipv4_cidr_block = "172.16.0.0/28"
  }

  network_policy { enabled = true }
  logging_config { enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"] }
  monitoring_config {
    enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"]
    managed_prometheus { enabled = true }
  }

  remove_default_node_pool = true
  initial_node_count       = 1
}

resource "google_container_node_pool" "primary" {
  name     = "primary-pool"
  cluster  = google_container_cluster.primary.name
  location = "us-central1"

  initial_node_count = 2
  autoscaling { min_node_count = 1; max_node_count = 5 }
  management  { auto_repair = true; auto_upgrade = true }

  node_config {
    machine_type = "e2-standard-4"
    disk_size_gb = 100
    disk_type    = "pd-balanced"
    oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
    shielded_instance_config {
      enable_secure_boot          = true
      enable_integrity_monitoring = true
    }
    metadata = { disable-legacy-endpoints = "true" }
  }
}

resource "google_compute_subnetwork" "gke" {
  name          = "gke-subnet"
  ip_cidr_range = "10.0.0.0/20"
  region        = "us-central1"
  network       = google_compute_network.vpc.id

  secondary_ip_range { range_name = "pods";     ip_cidr_range = "10.4.0.0/14" }
  secondary_ip_range { range_name = "services"; ip_cidr_range = "10.8.0.0/20" }
}

Common Operations

常见操作

bash
gcloud container clusters list
gcloud container clusters upgrade prod-cluster --region=us-central1 --master
kubectl top nodes && kubectl top pods --namespace=myapp
kubectl scale deployment web-app --replicas=5 --namespace=myapp
kubectl autoscale deployment web-app --namespace=myapp --min=3 --max=20 --cpu-percent=70
kubectl logs -f deployment/web-app --namespace=myapp --all-containers
bash
gcloud container clusters list
gcloud container clusters upgrade prod-cluster --region=us-central1 --master
kubectl top nodes && kubectl top pods --namespace=myapp
kubectl scale deployment web-app --replicas=5 --namespace=myapp
kubectl autoscale deployment web-app --namespace=myapp --min=3 --max=20 --cpu-percent=70
kubectl logs -f deployment/web-app --namespace=myapp --all-containers

Troubleshooting

故障排查

SymptomCauseFix
Pods stuck in
Pending
No nodes with enough resourcesCheck autoscaler; add larger node pool; verify resource requests
ImagePullBackOff
Wrong image path or missing AR accessVerify image URL; grant
roles/artifactregistry.reader
to node SA
Workload Identity wrong accountKSA annotation missingRe-annotate KSA; restart pods to pick up new token
Nodes
NotReady
Disk/memory pressure or network issueRun
kubectl describe node
; check taints and conditions
Ingress returns 502Backend pods failing health checkVerify readiness probe; check NEG health in Console
Cluster create quota errorInsufficient regional CPU/IP quotaRequest quota increase in IAM & Admin > Quotas
Network policy not workingNot enabled on clusterRecreate with
--enable-network-policy
or use Dataplane V2
症状原因解决方法
Pod一直处于
Pending
状态
没有节点具备足够资源检查自动扩缩容配置;添加更大规格的节点池;验证资源请求配置
出现
ImagePullBackOff
错误
镜像路径错误或缺少Artifact Registry访问权限验证镜像URL;为节点服务账号授予
roles/artifactregistry.reader
角色
工作负载身份账号配置错误KSA注解缺失重新为KSA添加注解;重启Pod以获取新的令牌
节点处于
NotReady
状态
磁盘/内存压力或网络问题执行
kubectl describe node
命令;检查污点和节点状态
Ingress返回502错误后端Pod健康检查失败验证就绪探针配置;在控制台检查NEG健康状态
集群创建时出现配额错误区域CPU/IP配额不足在IAM与管理>配额页面申请配额提升
网络策略不生效集群未启用网络策略使用
--enable-network-policy
参数重新创建集群,或使用Dataplane V2

Related Skills

相关技能

  • gcp-networking - VPC, firewall rules, and load balancers for GKE clusters
  • terraform-gcp - Provision GKE clusters with Infrastructure as Code
  • gcp-compute - When workloads are better suited for VMs than containers
  • gcp-cloud-sql - Connecting GKE pods to Cloud SQL via sidecar proxy
  • gcp-networking - 用于GKE集群的VPC、防火墙规则和负载均衡器配置
  • terraform-gcp - 使用基础设施即代码(IaC)配置GKE集群
  • gcp-compute - 当工作负载更适合虚拟机而非容器时的相关配置
  • gcp-cloud-sql - 通过Sidecar代理将GKE Pod连接到Cloud SQL