gcp-gke-troubleshooting

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

GKE Troubleshooting

GKE故障排查

Purpose

目的

Systematically diagnose and resolve common GKE issues. This skill provides structured debugging workflows, common causes, and proven solutions for the most frequent problems encountered in production deployments.
系统性诊断并解决常见GKE问题。本技能为生产部署中最常遇到的问题提供结构化调试流程、常见原因及经过验证的解决方案。

When to Use

适用场景

Use this skill when you need to:
  • Debug pods stuck in Pending, CrashLoopBackOff, or ImagePullBackOff status
  • Troubleshoot networking issues (DNS failures, service connectivity)
  • Fix Cloud SQL connection problems or IAM authentication errors
  • Resolve Pub/Sub message processing issues
  • Investigate resource exhaustion or scheduling failures
  • Debug health probe failures
  • Diagnose application crashes or startup issues
Trigger phrases: "pod not starting", "CrashLoopBackOff", "debug GKE issue", "Cloud SQL connection failed", "Pub/Sub not working", "pod pending"
当你需要以下操作时使用本技能:
  • 调试处于Pending、CrashLoopBackOff或ImagePullBackOff状态的Pod
  • 排查网络问题(DNS故障、服务连通性问题)
  • 修复Cloud SQL连接问题或IAM认证错误
  • 解决Pub/Sub消息处理问题
  • 调查资源耗尽或调度失败问题
  • 调试健康探针故障
  • 诊断应用崩溃或启动问题
触发关键词:"pod无法启动"、"CrashLoopBackOff"、"调试GKE问题"、"Cloud SQL连接失败"、"Pub/Sub无法工作"、"Pod处于Pending状态"

Table of Contents

目录

Quick Start

快速开始

Quick diagnostic flow for any pod issue:
bash
undefined
针对任意Pod问题的快速诊断流程:
bash
undefined

1. Check pod status

1. 检查Pod状态

kubectl get pods -n wtr-supplier-charges
kubectl get pods -n wtr-supplier-charges

2. View detailed pod information

2. 查看Pod详细信息

kubectl describe pod <pod-name> -n wtr-supplier-charges
kubectl describe pod <pod-name> -n wtr-supplier-charges

3. Check logs

3. 查看日志

kubectl logs <pod-name> -n wtr-supplier-charges
kubectl logs <pod-name> -n wtr-supplier-charges

4. Check previous logs if crashed

4. 若Pod崩溃,查看之前的日志

kubectl logs <pod-name> -n wtr-supplier-charges --previous
kubectl logs <pod-name> -n wtr-supplier-charges --previous

5. Check events for scheduling issues

5. 检查调度相关事件

kubectl get events -n wtr-supplier-charges --sort-by='.lastTimestamp'
kubectl get events -n wtr-supplier-charges --sort-by='.lastTimestamp'

6. Check resource availability

6. 检查资源可用性

kubectl top nodes kubectl top pods -n wtr-supplier-charges
undefined
kubectl top nodes kubectl top pods -n wtr-supplier-charges
undefined

Instructions

操作步骤

Step 1: Identify the Pod Status

步骤1:确认Pod状态

Understand what the pod status means:
bash
kubectl get pods -n wtr-supplier-charges -o wide
StatusMeaningAction
RunningPod is executingCheck logs if issues
PendingWaiting to be scheduledCheck events, node resources
CrashLoopBackOffApp crashes repeatedlyCheck logs, configuration
ImagePullBackOffCan't pull imageVerify image, permissions
CompletedPod ran successfully and exitedNormal for batch jobs
ErrorPod exited with errorCheck logs
理解Pod状态的含义:
bash
kubectl get pods -n wtr-supplier-charges -o wide
状态含义操作建议
RunningPod正在执行若有问题则查看日志
Pending等待调度检查事件、节点资源
CrashLoopBackOff应用反复崩溃查看日志、配置
ImagePullBackOff无法拉取镜像验证镜像、权限
CompletedPod成功运行并退出批处理作业的正常状态
ErrorPod退出并报错查看日志

Step 2: Investigate Based on Status

步骤2:根据状态排查

Pod Status: ImagePullBackOff

Pod状态:ImagePullBackOff

Diagnose:
bash
undefined
诊断:
bash
undefined

Get detailed error

获取详细错误信息

kubectl describe pod <pod-name> -n wtr-supplier-charges
kubectl describe pod <pod-name> -n wtr-supplier-charges

Look for "Failed to pull image" in Events section

在Events部分查找"Failed to pull image"

Example: "Failed to pull image ... access denied"

示例:"Failed to pull image ... access denied"

Check if image exists in registry

检查镜像是否存在于镜像仓库

gcloud artifacts docker images list
europe-west2-docker.pkg.dev/ecp-artifact-registry/wtr-supplier-charges-container-images

**Solutions:**

1. **Image doesn't exist:**
```bash
gcloud artifacts docker images list
europe-west2-docker.pkg.dev/ecp-artifact-registry/wtr-supplier-charges-container-images

**解决方案:**

1. **镜像不存在:**
```bash

Verify image tag is correct

验证镜像标签是否正确

kubectl get deployment supplier-charges-hub -n wtr-supplier-charges
-o jsonpath='{.spec.template.spec.containers[0].image}'

2. **Missing Artifact Registry permissions:**
```bash
kubectl get deployment supplier-charges-hub -n wtr-supplier-charges
-o jsonpath='{.spec.template.spec.containers[0].image}'

2. **缺少Artifact Registry权限:**
```bash

Grant Artifact Registry Reader role

授予Artifact Registry Reader角色

gcloud artifacts repositories add-iam-policy-binding
wtr-supplier-charges-container-images
--location=europe-west2
--member="serviceAccount:app-runtime@project.iam.gserviceaccount.com"
--role="roles/artifactregistry.reader"

3. **Private image registry authentication:**
```bash
gcloud artifacts repositories add-iam-policy-binding
wtr-supplier-charges-container-images
--location=europe-west2
--member="serviceAccount:app-runtime@project.iam.gserviceaccount.com"
--role="roles/artifactregistry.reader"

3. **私有镜像仓库认证:**
```bash

Create image pull secret

创建镜像拉取密钥

kubectl create secret docker-registry regcred
--docker-server=europe-west2-docker.pkg.dev
--docker-username=_json_key
--docker-password="$(cat key.json)"
-n wtr-supplier-charges
kubectl create secret docker-registry regcred
--docker-server=europe-west2-docker.pkg.dev
--docker-username=_json_key
--docker-password="$(cat key.json)"
-n wtr-supplier-charges

Add to deployment

添加到部署配置

spec: imagePullSecrets:
  • name: regcred
undefined
spec: imagePullSecrets:
  • name: regcred
undefined

Pod Status: CrashLoopBackOff

Pod状态:CrashLoopBackOff

Diagnose:
bash
undefined
诊断:
bash
undefined

Check current logs

查看当前日志

kubectl logs <pod-name> -n wtr-supplier-charges
kubectl logs <pod-name> -n wtr-supplier-charges

Check logs from previous container (if crashed)

查看之前容器的日志(若已崩溃)

kubectl logs <pod-name> -n wtr-supplier-charges --previous
kubectl logs <pod-name> -n wtr-supplier-charges --previous

Check liveness probe configuration

检查存活探针配置

kubectl describe pod <pod-name> -n wtr-supplier-charges | grep -A 10 "Liveness"

**Common Causes:**

1. **Application exits immediately:**
```bash
kubectl describe pod <pod-name> -n wtr-supplier-charges | grep -A 10 "Liveness"

**常见原因:**

1. **应用立即退出:**
```bash

Check startup logs for Java/Spring Boot errors

查看Java/Spring Boot应用的启动日志

kubectl logs <pod-name> -n wtr-supplier-charges | head -50
kubectl logs <pod-name> -n wtr-supplier-charges | head -50

Look for: ClassNotFoundException, ConfigurationException, connection errors

查找:ClassNotFoundException、ConfigurationException、连接错误


2. **Liveness probe fails too early:**
```bash

2. **存活探针过早失败:**
```bash

Increase initialDelaySeconds from 20 to 60

将initialDelaySeconds从20增加到60

kubectl patch deployment supplier-charges-hub -n wtr-supplier-charges
-p '{"spec":{"template":{"spec":{"containers":[{"name":"supplier-charges-hub-container","livenessProbe":{"initialDelaySeconds":60}}]}}}}'

3. **Out of memory:**
```bash
kubectl patch deployment supplier-charges-hub -n wtr-supplier-charges
-p '{"spec":{"template":{"spec":{"containers":[{"name":"supplier-charges-hub-container","livenessProbe":{"initialDelaySeconds":60}}]}}}}'

3. **内存不足:**
```bash

Check memory usage

检查内存使用情况

kubectl top pods <pod-name> -n wtr-supplier-charges
kubectl top pods <pod-name> -n wtr-supplier-charges

Increase memory limits

增加内存限制

kubectl patch deployment supplier-charges-hub -n wtr-supplier-charges
-p '{"spec":{"template":{"spec":{"containers":[{"name":"supplier-charges-hub-container","resources":{"limits":{"memory":"4Gi"}}}]}}}}'

4. **Missing environment variables:**
```bash
kubectl patch deployment supplier-charges-hub -n wtr-supplier-charges
-p '{"spec":{"template":{"spec":{"containers":[{"name":"supplier-charges-hub-container","resources":{"limits":{"memory":"4Gi"}}}]}}}}'

4. **缺少环境变量:**
```bash

Check what env vars are set

检查已设置的环境变量

kubectl exec <pod-name> -n wtr-supplier-charges -- env | sort
kubectl exec <pod-name> -n wtr-supplier-charges -- env | sort

Verify ConfigMap/Secret values

验证ConfigMap/Secret的值

kubectl get configmap supplier-charges-hub-config -n wtr-supplier-charges -o yaml kubectl get secret db-credentials -n wtr-supplier-charges -o yaml
undefined
kubectl get configmap supplier-charges-hub-config -n wtr-supplier-charges -o yaml kubectl get secret db-credentials -n wtr-supplier-charges -o yaml
undefined

Pod Status: Pending (Unschedulable)

Pod状态:Pending(无法调度)

Diagnose:
bash
undefined
诊断:
bash
undefined

Check events for scheduling messages

检查调度相关事件信息

kubectl describe pod <pod-name> -n wtr-supplier-charges
kubectl describe pod <pod-name> -n wtr-supplier-charges

Look for: "Insufficient memory", "Insufficient cpu", "PersistentVolumeClaim"

查找:"Insufficient memory"、"Insufficient cpu"、"PersistentVolumeClaim"

Check node capacity

检查节点容量

kubectl top nodes kubectl describe nodes

**Solutions:**

1. **Insufficient cluster resources:**
```bash
kubectl top nodes kubectl describe nodes

**解决方案:**

1. **集群资源不足:**
```bash

Scale deployment down

缩减部署副本数

kubectl scale deployment supplier-charges-hub --replicas=1 -n wtr-supplier-charges
kubectl scale deployment supplier-charges-hub --replicas=1 -n wtr-supplier-charges

Or trigger autoscaling (if available)

或触发自动扩缩容(若已启用)

GKE Autopilot automatically provisions capacity

GKE Autopilot会自动配置容量


2. **Node affinity/taints preventing scheduling:**
```bash

2. **节点亲和性/污点阻止调度:**
```bash

Check node taints

检查节点污点

kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

View pod's node affinity/tolerations

查看Pod的节点亲和性/容忍度配置

kubectl get pod <pod-name> -n wtr-supplier-charges -o yaml | grep -A 10 -B 2 "affinity|toleration"
kubectl get pod <pod-name> -n wtr-supplier-charges -o yaml | grep -A 10 -B 2 "affinity|toleration"

Add toleration to deployment if needed

若需要,为部署添加容忍度配置

spec: tolerations:
  • key: "dedicated" operator: "Equal" value: "compute" effect: "NoSchedule"

3. **PersistentVolumeClaim not bound:**
```bash
spec: tolerations:
  • key: "dedicated" operator: "Equal" value: "compute" effect: "NoSchedule"

3. **PersistentVolumeClaim未绑定:**
```bash

Check PVC status

检查PVC状态

kubectl get pvc -n wtr-supplier-charges
kubectl get pvc -n wtr-supplier-charges

If Pending, check storage class

若处于Pending状态,检查存储类

kubectl get storageclass
undefined
kubectl get storageclass
undefined

Step 3: Network and Connectivity Issues

步骤3:网络与连通性问题

DNS Resolution Failures

DNS解析失败

Diagnose:
bash
undefined
诊断:
bash
undefined

Test DNS from pod

在Pod内部测试DNS

kubectl exec <pod-name> -n wtr-supplier-charges -- nslookup postgres
kubectl exec <pod-name> -n wtr-supplier-charges -- nslookup postgres

Test connectivity to service

测试与服务的连通性

kubectl exec <pod-name> -n wtr-supplier-charges -- curl -v http://postgres:5432

**Solutions:**

1. **CoreDNS pods not running:**
```bash
kubectl exec <pod-name> -n wtr-supplier-charges -- curl -v http://postgres:5432

**解决方案:**

1. **CoreDNS Pod未运行:**
```bash

Check CoreDNS

检查CoreDNS状态

kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl get pods -n kube-system -l k8s-app=kube-dns

Restart CoreDNS if needed

若需要,重启CoreDNS

kubectl rollout restart deployment coredns -n kube-system

2. **Service doesn't exist or wrong namespace:**
```bash
kubectl rollout restart deployment coredns -n kube-system

2. **服务不存在或命名空间错误:**
```bash

Verify service exists

验证服务是否存在

kubectl get svc postgres -n wtr-supplier-charges
kubectl get svc postgres -n wtr-supplier-charges

Use fully qualified DNS name if in different namespace

若在不同命名空间,使用完整DNS名称

service-name.namespace.svc.cluster.local
undefined
service-name.namespace.svc.cluster.local
undefined

Service Not Accessible

服务无法访问

Diagnose:
bash
undefined
诊断:
bash
undefined

Check service endpoints

检查服务端点

kubectl get endpoints supplier-charges-hub -n wtr-supplier-charges
kubectl get endpoints supplier-charges-hub -n wtr-supplier-charges

If empty, no pods match the selector

若为空,说明没有Pod匹配选择器

kubectl get svc supplier-charges-hub -n wtr-supplier-charges -o yaml | grep selector kubectl get pods -n wtr-supplier-charges --show-labels

**Solutions:**

1. **Pod labels don't match service selector:**
```bash
kubectl get svc supplier-charges-hub -n wtr-supplier-charges -o yaml | grep selector kubectl get pods -n wtr-supplier-charges --show-labels

**解决方案:**

1. **Pod标签与服务选择器不匹配:**
```bash

Add/update labels on deployment

更新部署的标签

kubectl patch deployment supplier-charges-hub -n wtr-supplier-charges
-p '{"spec":{"template":{"metadata":{"labels":{"app":"supplier-charges-hub"}}}}}'

2. **Pods not in Ready state:**
```bash
kubectl patch deployment supplier-charges-hub -n wtr-supplier-charges
-p '{"spec":{"template":{"metadata":{"labels":{"app":"supplier-charges-hub"}}}}}'

2. **Pod未处于Ready状态:**
```bash

Check readiness probe

检查就绪探针配置

kubectl describe pod <pod-name> -n wtr-supplier-charges | grep -A 10 "Readiness"
kubectl describe pod <pod-name> -n wtr-supplier-charges | grep -A 10 "Readiness"

Check health endpoint

检查健康端点

kubectl exec <pod-name> -n wtr-supplier-charges --
curl localhost:8080/actuator/health/readiness
undefined
kubectl exec <pod-name> -n wtr-supplier-charges --
curl localhost:8080/actuator/health/readiness
undefined

Step 4: Database Connection Issues

步骤4:数据库连接问题

Diagnose:
bash
undefined
诊断:
bash
undefined

Test connectivity to Cloud SQL Proxy

测试与Cloud SQL Proxy的连通性

kubectl exec <pod-name> -n wtr-supplier-charges -- nc -zv localhost 5432
kubectl exec <pod-name> -n wtr-supplier-charges -- nc -zv localhost 5432

Check Cloud SQL Proxy logs

查看Cloud SQL Proxy日志

kubectl logs <pod-name> -c cloud-sql-proxy -n wtr-supplier-charges
kubectl logs <pod-name> -c cloud-sql-proxy -n wtr-supplier-charges

Check application startup logs for DB connection errors

查看应用启动日志中的数据库连接错误

kubectl logs <pod-name> -c supplier-charges-hub-container -n wtr-supplier-charges | grep -i "database|connection"

**Solutions:**

1. **IAM Authentication fails:**
```bash
kubectl logs <pod-name> -c supplier-charges-hub-container -n wtr-supplier-charges | grep -i "database|connection"

**解决方案:**

1. **IAM认证失败:**
```bash

Verify Workload Identity binding

验证工作负载身份绑定

kubectl get sa app-runtime -n wtr-supplier-charges -o yaml | grep iam.gke.io
kubectl get sa app-runtime -n wtr-supplier-charges -o yaml | grep iam.gke.io

Grant cloudsql.client role

授予cloudsql.client角色

gcloud projects add-iam-policy-binding project-id
--member="serviceAccount:app-runtime@project.iam.gserviceaccount.com"
--role="roles/cloudsql.client"
gcloud projects add-iam-policy-binding project-id
--member="serviceAccount:app-runtime@project.iam.gserviceaccount.com"
--role="roles/cloudsql.client"

Check service account email format (must be {name}@{project}.iam)

检查服务账号邮箱格式(必须为{name}@{project}.iam)


2. **Wrong connection string:**
```bash

2. **连接字符串错误:**
```bash

Verify DB_CONNECTION_NAME format: project:region:instance

验证DB_CONNECTION_NAME格式:project:region:instance

kubectl get configmap db-config -n wtr-supplier-charges -o yaml
kubectl get configmap db-config -n wtr-supplier-charges -o yaml

Should be something like: ecp-wtr-supplier-charges-labs:europe-west2:supplier-charges-hub

格式示例:ecp-wtr-supplier-charges-labs:europe-west2:supplier-charges-hub


3. **Cloud SQL Proxy not running:**
```bash

3. **Cloud SQL Proxy未运行:**
```bash

Check sidecar logs

查看边车容器日志

kubectl logs <pod-name> -c cloud-sql-proxy -n wtr-supplier-charges
kubectl logs <pod-name> -c cloud-sql-proxy -n wtr-supplier-charges

Check sidecar resources

检查边车容器资源配置

kubectl describe pod <pod-name> -n wtr-supplier-charges | grep -A 15 "cloud-sql-proxy"
undefined
kubectl describe pod <pod-name> -n wtr-supplier-charges | grep -A 15 "cloud-sql-proxy"
undefined

Step 5: Pub/Sub Issues

步骤5:Pub/Sub问题

Diagnose:
bash
undefined
诊断:
bash
undefined

Check subscription backlog

检查订阅积压情况

gcloud pubsub subscriptions describe supplier-charges-incoming-sub
--project=ecp-wtr-supplier-charges-labs
gcloud pubsub subscriptions describe supplier-charges-incoming-sub
--project=ecp-wtr-supplier-charges-labs

Check application Pub/Sub logs

查看应用的Pub/Sub日志

kubectl logs <pod-name> -c supplier-charges-hub-container
-n wtr-supplier-charges | grep -i "pubsub|subscription"
kubectl logs <pod-name> -c supplier-charges-hub-container
-n wtr-supplier-charges | grep -i "pubsub|subscription"

Test pub/sub connectivity from pod

在Pod内部测试Pub/Sub连通性

kubectl exec <pod-name> -n wtr-supplier-charges --
gcloud pubsub topics list --project=ecp-wtr-supplier-charges-labs

**Solutions:**

1. **Missing Pub/Sub permissions:**
```bash
kubectl exec <pod-name> -n wtr-supplier-charges --
gcloud pubsub topics list --project=ecp-wtr-supplier-charges-labs

**解决方案:**

1. **缺少Pub/Sub权限:**
```bash

Grant Pub/Sub roles

授予Pub/Sub角色

gcloud projects add-iam-policy-binding project-id
--member="serviceAccount:app-runtime@project.iam.gserviceaccount.com"
--role="roles/pubsub.subscriber"
gcloud projects add-iam-policy-binding project-id
--member="serviceAccount:app-runtime@project.iam.gserviceaccount.com"
--role="roles/pubsub.publisher"

2. **High subscription backlog (messages not being consumed):**
```bash
gcloud projects add-iam-policy-binding project-id
--member="serviceAccount:app-runtime@project.iam.gserviceaccount.com"
--role="roles/pubsub.subscriber"
gcloud projects add-iam-policy-binding project-id
--member="serviceAccount:app-runtime@project.iam.gserviceaccount.com"
--role="roles/pubsub.publisher"

2. **订阅积压过高(消息未被消费):**
```bash

Check if pod is running

检查Pod是否在运行

kubectl get pods -n wtr-supplier-charges
kubectl get pods -n wtr-supplier-charges

Check application logs for processing errors

查看应用日志中的处理错误

kubectl logs -f <pod-name> -c supplier-charges-hub-container
-n wtr-supplier-charges | grep -i "error|exception"
kubectl logs -f <pod-name> -c supplier-charges-hub-container
-n wtr-supplier-charges | grep -i "error|exception"

Increase message processing timeout

增加消息处理超时时间

In application.yaml:

在application.yaml中配置:

spring.cloud.gcp.pubsub.subscriber.max-ack-extension-period: 600

spring.cloud.gcp.pubsub.subscriber.max-ack-extension-period: 600


3. **Message processing failures:**
```bash

3. **消息处理失败:**
```bash

Check for poison messages (causing repeated failures)

检查毒消息(导致反复失败的消息)

Review DLQ (Dead Letter Queue) if configured

若已配置,查看死信队列(DLQ)

Implement retry logic with exponential backoff

实现带指数退避的重试逻辑

See Spring Cloud GCP documentation for retry configuration

查看Spring Cloud GCP文档中的重试配置

undefined
undefined

Examples

示例

See examples/examples.md for comprehensive examples including:
  • Complete troubleshooting workflow
  • Database connectivity debugging
  • Pub/Sub debugging
查看examples/examples.md获取完整示例,包括:
  • 完整故障排查流程
  • 数据库连通性调试
  • Pub/Sub调试

Requirements

前提条件

  • kubectl
    access to the cluster
  • gcloud
    CLI configured
  • Permissions to view pod logs and describe resources
  • For database debugging: access to view Cloud SQL configuration
  • For Pub/Sub debugging: access to view subscription details
  • 拥有集群的
    kubectl
    访问权限
  • 已配置
    gcloud
    CLI
  • 拥有查看Pod日志和描述资源的权限
  • 数据库调试:拥有查看Cloud SQL配置的权限
  • Pub/Sub调试:拥有查看订阅详情的权限

See Also

相关链接

  • gcp-gke-deployment-strategies - Understand deployment health checks
  • gcp-gke-monitoring-observability - Monitor applications
  • gcp-gke-workload-identity - Debug IAM/Workload Identity issues
  • gcp-gke-deployment-strategies - 了解部署健康检查
  • gcp-gke-monitoring-observability - 监控应用
  • gcp-gke-workload-identity - 调试IAM/工作负载身份问题