gcp-gke
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGoogle Kubernetes Engine (GKE)
Google Kubernetes Engine(GKE)
Deploy, operate, and scale managed Kubernetes clusters on Google Cloud Platform.
在Google Cloud Platform(GCP)上部署、运维和扩缩容托管式Kubernetes集群。
When to Use
适用场景
- Running containerized microservices at scale with automatic scaling and healing
- Workloads requiring fine-grained orchestration, service mesh, or custom scheduling
- Teams already invested in Kubernetes tooling (Helm, Argo CD, Flux)
- When Cloud Run's request-based model does not fit (long-running, stateful workloads)
- 以自动扩缩容和自愈能力大规模运行容器化微服务
- 需要精细编排、服务网格或自定义调度的工作负载
- 已投入使用Kubernetes工具链(Helm、Argo CD、Flux)的团队
- Cloud Run的请求型模型不适用的场景(长时间运行、有状态的工作负载)
Prerequisites
前置条件
- Google Cloud SDK () and
gcloudinstalledkubectl - APIs enabled: Kubernetes Engine, Compute Engine
- IAM role for cluster management
roles/container.admin
bash
gcloud services enable container.googleapis.com compute.googleapis.com
gcloud components install kubectl- 已安装Google Cloud SDK()和
gcloudkubectl - 已启用API:Kubernetes Engine、Compute Engine
- 拥有集群管理所需的IAM角色
roles/container.admin
bash
gcloud services enable container.googleapis.com compute.googleapis.com
gcloud components install kubectlStandard vs Autopilot
标准集群 vs 自动驾驶集群
| Feature | Standard | Autopilot |
|---|---|---|
| Node management | You manage node pools | Google manages nodes |
| Pricing | Pay per node (VM) | Pay per pod resource request |
| GPU/TPU | Full support | Supported (with limits) |
| DaemonSets | Allowed | Restricted |
| Best for | Full control, specialized HW | Hands-off, cost-optimized |
| 特性 | 标准集群 | 自动驾驶集群 |
|---|---|---|
| 节点管理 | 用户管理节点池 | Google管理节点 |
| 计费方式 | 按节点(VM)付费 | 按Pod资源请求付费 |
| GPU/TPU支持 | 完全支持 | 支持(有使用限制) |
| DaemonSets | 允许使用 | 受限使用 |
| 最佳适用场景 | 需要完全控制、专用硬件的场景 | 免运维、成本优化的场景 |
Create a Standard Cluster
创建标准集群
bash
gcloud container clusters create prod-cluster \
--region=us-central1 --num-nodes=2 \
--machine-type=e2-standard-4 --disk-size=100 \
--enable-autoscaling --min-nodes=1 --max-nodes=5 \
--enable-autorepair --enable-autoupgrade \
--release-channel=regular \
--workload-pool=${PROJECT_ID}.svc.id.goog \
--enable-ip-alias --enable-network-policy \
--enable-shielded-nodes \
--logging=SYSTEM,WORKLOAD --monitoring=SYSTEM,WORKLOAD \
--labels=env=production,team=platform
gcloud container clusters get-credentials prod-cluster --region=us-central1bash
gcloud container clusters create prod-cluster \
--region=us-central1 --num-nodes=2 \
--machine-type=e2-standard-4 --disk-size=100 \
--enable-autoscaling --min-nodes=1 --max-nodes=5 \
--enable-autorepair --enable-autoupgrade \
--release-channel=regular \
--workload-pool=${PROJECT_ID}.svc.id.goog \
--enable-ip-alias --enable-network-policy \
--enable-shielded-nodes \
--logging=SYSTEM,WORKLOAD --monitoring=SYSTEM,WORKLOAD \
--labels=env=production,team=platform
gcloud container clusters get-credentials prod-cluster --region=us-central1Create an Autopilot Cluster
创建自动驾驶集群
bash
gcloud container clusters create-auto autopilot-prod \
--region=us-central1 --release-channel=regular \
--workload-pool=${PROJECT_ID}.svc.id.goog \
--network=my-vpc --subnetwork=gke-subnetbash
gcloud container clusters create-auto autopilot-prod \
--region=us-central1 --release-channel=regular \
--workload-pool=${PROJECT_ID}.svc.id.goog \
--network=my-vpc --subnetwork=gke-subnetNode Pools
节点池配置
bash
undefinedbash
undefinedHigh-memory pool with taint
带污点的高内存节点池
gcloud container node-pools create highmem-pool
--cluster=prod-cluster --region=us-central1
--machine-type=n2-highmem-8 --disk-size=200 --disk-type=pd-ssd
--num-nodes=1 --enable-autoscaling --min-nodes=0 --max-nodes=4
--node-labels=workload=memory-intensive
--node-taints=dedicated=highmem:NoSchedule
--cluster=prod-cluster --region=us-central1
--machine-type=n2-highmem-8 --disk-size=200 --disk-type=pd-ssd
--num-nodes=1 --enable-autoscaling --min-nodes=0 --max-nodes=4
--node-labels=workload=memory-intensive
--node-taints=dedicated=highmem:NoSchedule
gcloud container node-pools create highmem-pool
--cluster=prod-cluster --region=us-central1
--machine-type=n2-highmem-8 --disk-size=200 --disk-type=pd-ssd
--num-nodes=1 --enable-autoscaling --min-nodes=0 --max-nodes=4
--node-labels=workload=memory-intensive
--node-taints=dedicated=highmem:NoSchedule
--cluster=prod-cluster --region=us-central1
--machine-type=n2-highmem-8 --disk-size=200 --disk-type=pd-ssd
--num-nodes=1 --enable-autoscaling --min-nodes=0 --max-nodes=4
--node-labels=workload=memory-intensive
--node-taints=dedicated=highmem:NoSchedule
GPU pool
GPU节点池
gcloud container node-pools create gpu-pool
--cluster=prod-cluster --region=us-central1
--machine-type=n1-standard-8
--accelerator=type=nvidia-tesla-t4,count=1
--num-nodes=0 --enable-autoscaling --min-nodes=0 --max-nodes=4
--node-taints=nvidia.com/gpu=present:NoSchedule
--cluster=prod-cluster --region=us-central1
--machine-type=n1-standard-8
--accelerator=type=nvidia-tesla-t4,count=1
--num-nodes=0 --enable-autoscaling --min-nodes=0 --max-nodes=4
--node-taints=nvidia.com/gpu=present:NoSchedule
gcloud container node-pools create gpu-pool
--cluster=prod-cluster --region=us-central1
--machine-type=n1-standard-8
--accelerator=type=nvidia-tesla-t4,count=1
--num-nodes=0 --enable-autoscaling --min-nodes=0 --max-nodes=4
--node-taints=nvidia.com/gpu=present:NoSchedule
--cluster=prod-cluster --region=us-central1
--machine-type=n1-standard-8
--accelerator=type=nvidia-tesla-t4,count=1
--num-nodes=0 --enable-autoscaling --min-nodes=0 --max-nodes=4
--node-taints=nvidia.com/gpu=present:NoSchedule
Spot pool for batch workloads
用于批处理工作负载的Spot节点池
gcloud container node-pools create spot-pool
--cluster=prod-cluster --region=us-central1
--machine-type=e2-standard-4 --spot
--num-nodes=0 --enable-autoscaling --min-nodes=0 --max-nodes=20
--node-taints=cloud.google.com/gke-spot=true:NoSchedule
--cluster=prod-cluster --region=us-central1
--machine-type=e2-standard-4 --spot
--num-nodes=0 --enable-autoscaling --min-nodes=0 --max-nodes=20
--node-taints=cloud.google.com/gke-spot=true:NoSchedule
undefinedgcloud container node-pools create spot-pool
--cluster=prod-cluster --region=us-central1
--machine-type=e2-standard-4 --spot
--num-nodes=0 --enable-autoscaling --min-nodes=0 --max-nodes=20
--node-taints=cloud.google.com/gke-spot=true:NoSchedule
--cluster=prod-cluster --region=us-central1
--machine-type=e2-standard-4 --spot
--num-nodes=0 --enable-autoscaling --min-nodes=0 --max-nodes=20
--node-taints=cloud.google.com/gke-spot=true:NoSchedule
undefinedWorkload Identity
工作负载身份配置
bash
undefinedbash
undefinedCreate GSA and grant permissions
创建GSA并授予权限
gcloud iam service-accounts create app-gsa
gcloud projects add-iam-policy-binding ${PROJECT_ID}
--member="serviceAccount:app-gsa@${PROJECT_ID}.iam.gserviceaccount.com"
--role="roles/storage.objectViewer"
--member="serviceAccount:app-gsa@${PROJECT_ID}.iam.gserviceaccount.com"
--role="roles/storage.objectViewer"
gcloud iam service-accounts create app-gsa
gcloud projects add-iam-policy-binding ${PROJECT_ID}
--member="serviceAccount:app-gsa@${PROJECT_ID}.iam.gserviceaccount.com"
--role="roles/storage.objectViewer"
--member="serviceAccount:app-gsa@${PROJECT_ID}.iam.gserviceaccount.com"
--role="roles/storage.objectViewer"
Create KSA and bind to GSA
创建KSA并绑定到GSA
kubectl create namespace myapp
kubectl create serviceaccount app-ksa --namespace=myapp
gcloud iam service-accounts add-iam-policy-binding
app-gsa@${PROJECT_ID}.iam.gserviceaccount.com
--role=roles/iam.workloadIdentityUser
--member="serviceAccount:${PROJECT_ID}.svc.id.goog[myapp/app-ksa]" kubectl annotate serviceaccount app-ksa --namespace=myapp
iam.gke.io/gcp-service-account=app-gsa@${PROJECT_ID}.iam.gserviceaccount.com
app-gsa@${PROJECT_ID}.iam.gserviceaccount.com
--role=roles/iam.workloadIdentityUser
--member="serviceAccount:${PROJECT_ID}.svc.id.goog[myapp/app-ksa]" kubectl annotate serviceaccount app-ksa --namespace=myapp
iam.gke.io/gcp-service-account=app-gsa@${PROJECT_ID}.iam.gserviceaccount.com
undefinedkubectl create namespace myapp
kubectl create serviceaccount app-ksa --namespace=myapp
gcloud iam service-accounts add-iam-policy-binding
app-gsa@${PROJECT_ID}.iam.gserviceaccount.com
--role=roles/iam.workloadIdentityUser
--member="serviceAccount:${PROJECT_ID}.svc.id.goog[myapp/app-ksa]" kubectl annotate serviceaccount app-ksa --namespace=myapp
iam.gke.io/gcp-service-account=app-gsa@${PROJECT_ID}.iam.gserviceaccount.com
app-gsa@${PROJECT_ID}.iam.gserviceaccount.com
--role=roles/iam.workloadIdentityUser
--member="serviceAccount:${PROJECT_ID}.svc.id.goog[myapp/app-ksa]" kubectl annotate serviceaccount app-ksa --namespace=myapp
iam.gke.io/gcp-service-account=app-gsa@${PROJECT_ID}.iam.gserviceaccount.com
undefinedDeploying Workloads
工作负载部署
yaml
undefinedyaml
undefineddeployment.yaml
deployment.yaml
apiVersion: apps/v1 kind: Deployment metadata: name: web-app namespace: myapp spec: replicas: 3 selector: matchLabels: { app: web-app } template: metadata: labels: { app: web-app } spec: serviceAccountName: app-ksa containers: - name: web image: us-central1-docker.pkg.dev/PROJECT_ID/repo/web-app:v1.2.0 ports: [{ containerPort: 8080 }] resources: requests: { cpu: 250m, memory: 512Mi } limits: { cpu: 500m, memory: 1Gi } readinessProbe: httpGet: { path: /healthz, port: 8080 } initialDelaySeconds: 5 livenessProbe: httpGet: { path: /healthz, port: 8080 } initialDelaySeconds: 15 topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: { app: web-app }
apiVersion: v1
kind: Service
metadata: { name: web-app, namespace: myapp }
spec:
selector: { app: web-app }
ports: [{ port: 80, targetPort: 8080 }]
type: ClusterIP
undefinedapiVersion: apps/v1 kind: Deployment metadata: name: web-app namespace: myapp spec: replicas: 3 selector: matchLabels: { app: web-app } template: metadata: labels: { app: web-app } spec: serviceAccountName: app-ksa containers: - name: web image: us-central1-docker.pkg.dev/PROJECT_ID/repo/web-app:v1.2.0 ports: [{ containerPort: 8080 }] resources: requests: { cpu: 250m, memory: 512Mi } limits: { cpu: 500m, memory: 1Gi } readinessProbe: httpGet: { path: /healthz, port: 8080 } initialDelaySeconds: 5 livenessProbe: httpGet: { path: /healthz, port: 8080 } initialDelaySeconds: 15 topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: { app: web-app }
apiVersion: v1
kind: Service
metadata: { name: web-app, namespace: myapp }
spec:
selector: { app: web-app }
ports: [{ port: 80, targetPort: 8080 }]
type: ClusterIP
undefinedIngress with Managed SSL
带托管SSL证书的Ingress配置
yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: web-ingress
namespace: myapp
annotations:
kubernetes.io/ingress.class: "gce"
networking.gke.io/managed-certificates: "web-cert"
kubernetes.io/ingress.global-static-ip-name: "web-static-ip"
spec:
rules:
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service: { name: web-app, port: { number: 80 } }
---
apiVersion: networking.gke.io/v1
kind: ManagedCertificate
metadata: { name: web-cert, namespace: myapp }
spec:
domains: [app.example.com]bash
gcloud compute addresses create web-static-ip --globalyaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: web-ingress
namespace: myapp
annotations:
kubernetes.io/ingress.class: "gce"
networking.gke.io/managed-certificates: "web-cert"
kubernetes.io/ingress.global-static-ip-name: "web-static-ip"
spec:
rules:
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service: { name: web-app, port: { number: 80 } }
---
apiVersion: networking.gke.io/v1
kind: ManagedCertificate
metadata: { name: web-cert, namespace: myapp }
spec:
domains: [app.example.com]bash
gcloud compute addresses create web-static-ip --globalTerraform Configuration
Terraform配置
hcl
resource "google_container_cluster" "primary" {
name = "prod-cluster"
location = "us-central1"
release_channel { channel = "REGULAR" }
workload_identity_config { workload_pool = "${var.project_id}.svc.id.goog" }
network = google_compute_network.vpc.name
subnetwork = google_compute_subnetwork.gke.name
ip_allocation_policy {
cluster_secondary_range_name = "pods"
services_secondary_range_name = "services"
}
private_cluster_config {
enable_private_nodes = true
master_ipv4_cidr_block = "172.16.0.0/28"
}
network_policy { enabled = true }
logging_config { enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"] }
monitoring_config {
enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"]
managed_prometheus { enabled = true }
}
remove_default_node_pool = true
initial_node_count = 1
}
resource "google_container_node_pool" "primary" {
name = "primary-pool"
cluster = google_container_cluster.primary.name
location = "us-central1"
initial_node_count = 2
autoscaling { min_node_count = 1; max_node_count = 5 }
management { auto_repair = true; auto_upgrade = true }
node_config {
machine_type = "e2-standard-4"
disk_size_gb = 100
disk_type = "pd-balanced"
oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
shielded_instance_config {
enable_secure_boot = true
enable_integrity_monitoring = true
}
metadata = { disable-legacy-endpoints = "true" }
}
}
resource "google_compute_subnetwork" "gke" {
name = "gke-subnet"
ip_cidr_range = "10.0.0.0/20"
region = "us-central1"
network = google_compute_network.vpc.id
secondary_ip_range { range_name = "pods"; ip_cidr_range = "10.4.0.0/14" }
secondary_ip_range { range_name = "services"; ip_cidr_range = "10.8.0.0/20" }
}hcl
resource "google_container_cluster" "primary" {
name = "prod-cluster"
location = "us-central1"
release_channel { channel = "REGULAR" }
workload_identity_config { workload_pool = "${var.project_id}.svc.id.goog" }
network = google_compute_network.vpc.name
subnetwork = google_compute_subnetwork.gke.name
ip_allocation_policy {
cluster_secondary_range_name = "pods"
services_secondary_range_name = "services"
}
private_cluster_config {
enable_private_nodes = true
master_ipv4_cidr_block = "172.16.0.0/28"
}
network_policy { enabled = true }
logging_config { enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"] }
monitoring_config {
enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"]
managed_prometheus { enabled = true }
}
remove_default_node_pool = true
initial_node_count = 1
}
resource "google_container_node_pool" "primary" {
name = "primary-pool"
cluster = google_container_cluster.primary.name
location = "us-central1"
initial_node_count = 2
autoscaling { min_node_count = 1; max_node_count = 5 }
management { auto_repair = true; auto_upgrade = true }
node_config {
machine_type = "e2-standard-4"
disk_size_gb = 100
disk_type = "pd-balanced"
oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
shielded_instance_config {
enable_secure_boot = true
enable_integrity_monitoring = true
}
metadata = { disable-legacy-endpoints = "true" }
}
}
resource "google_compute_subnetwork" "gke" {
name = "gke-subnet"
ip_cidr_range = "10.0.0.0/20"
region = "us-central1"
network = google_compute_network.vpc.id
secondary_ip_range { range_name = "pods"; ip_cidr_range = "10.4.0.0/14" }
secondary_ip_range { range_name = "services"; ip_cidr_range = "10.8.0.0/20" }
}Common Operations
常见操作
bash
gcloud container clusters list
gcloud container clusters upgrade prod-cluster --region=us-central1 --master
kubectl top nodes && kubectl top pods --namespace=myapp
kubectl scale deployment web-app --replicas=5 --namespace=myapp
kubectl autoscale deployment web-app --namespace=myapp --min=3 --max=20 --cpu-percent=70
kubectl logs -f deployment/web-app --namespace=myapp --all-containersbash
gcloud container clusters list
gcloud container clusters upgrade prod-cluster --region=us-central1 --master
kubectl top nodes && kubectl top pods --namespace=myapp
kubectl scale deployment web-app --replicas=5 --namespace=myapp
kubectl autoscale deployment web-app --namespace=myapp --min=3 --max=20 --cpu-percent=70
kubectl logs -f deployment/web-app --namespace=myapp --all-containersTroubleshooting
故障排查
| Symptom | Cause | Fix |
|---|---|---|
Pods stuck in | No nodes with enough resources | Check autoscaler; add larger node pool; verify resource requests |
| Wrong image path or missing AR access | Verify image URL; grant |
| Workload Identity wrong account | KSA annotation missing | Re-annotate KSA; restart pods to pick up new token |
Nodes | Disk/memory pressure or network issue | Run |
| Ingress returns 502 | Backend pods failing health check | Verify readiness probe; check NEG health in Console |
| Cluster create quota error | Insufficient regional CPU/IP quota | Request quota increase in IAM & Admin > Quotas |
| Network policy not working | Not enabled on cluster | Recreate with |
| 症状 | 原因 | 解决方法 |
|---|---|---|
Pod一直处于 | 没有节点具备足够资源 | 检查自动扩缩容配置;添加更大规格的节点池;验证资源请求配置 |
出现 | 镜像路径错误或缺少Artifact Registry访问权限 | 验证镜像URL;为节点服务账号授予 |
| 工作负载身份账号配置错误 | KSA注解缺失 | 重新为KSA添加注解;重启Pod以获取新的令牌 |
节点处于 | 磁盘/内存压力或网络问题 | 执行 |
| Ingress返回502错误 | 后端Pod健康检查失败 | 验证就绪探针配置;在控制台检查NEG健康状态 |
| 集群创建时出现配额错误 | 区域CPU/IP配额不足 | 在IAM与管理>配额页面申请配额提升 |
| 网络策略不生效 | 集群未启用网络策略 | 使用 |
Related Skills
相关技能
- gcp-networking - VPC, firewall rules, and load balancers for GKE clusters
- terraform-gcp - Provision GKE clusters with Infrastructure as Code
- gcp-compute - When workloads are better suited for VMs than containers
- gcp-cloud-sql - Connecting GKE pods to Cloud SQL via sidecar proxy
- gcp-networking - 用于GKE集群的VPC、防火墙规则和负载均衡器配置
- terraform-gcp - 使用基础设施即代码(IaC)配置GKE集群
- gcp-compute - 当工作负载更适合虚拟机而非容器时的相关配置
- gcp-cloud-sql - 通过Sidecar代理将GKE Pod连接到Cloud SQL