kubernetes-troubleshooting

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Kubernetes Troubleshooting

Kubernetes故障排查

Systematic approach to debugging Kubernetes issues.

针对Kubernetes问题的系统化调试方法。

When to Use This Skill

何时使用此技能

Pod stuck in CrashLoopBackOff
OOMKilled errors
ImagePullBackOff failures
Pod not starting or scheduling
Service connectivity issues
Resource constraint problems

Pod陷入CrashLoopBackOff状态
出现OOMKilled错误
ImagePullBackOff失败
Pod无法启动或调度
服务连通性问题
资源限制问题

Quick Diagnostic Commands

快速诊断命令

Start with these commands to understand the current state:

bash

undefined

先执行以下命令了解当前状态：

bash

undefined

Cluster overview

集群概览

kubectl get nodes kubectl get pods -A | grep -v Running

Specific namespace

指定命名空间

kubectl get pods -n <namespace> kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

Resource usage

资源使用情况

kubectl top nodes kubectl top pods -n <namespace>

undefined

kubectl top nodes kubectl top pods -n <namespace>

undefined

Pod Debugging Workflow

Pod调试流程

Step 1: Check Pod Status

步骤1：检查Pod状态

bash

kubectl get pod <pod-name> -n <namespace> -o wide
kubectl describe pod <pod-name> -n <namespace>

Look for:

Status: What state is the pod in?
Conditions: Ready, ContainersReady, PodScheduled
Events: Recent events at the bottom of describe output

bash

kubectl get pod <pod-name> -n <namespace> -o wide
kubectl describe pod <pod-name> -n <namespace>

重点查看：

状态：Pod处于什么状态？
条件：Ready、ContainersReady、PodScheduled
事件：describe输出底部的近期事件

Step 2: Identify the Problem Category

步骤2：确定问题类别

Symptom	Likely Cause	Go To Section
Pending	Scheduling issue	Scheduling Issues
CrashLoopBackOff	Application crash	CrashLoopBackOff
ImagePullBackOff	Image/registry issue	Image Pull Issues
OOMKilled	Memory exhaustion	OOMKilled
Running but not Ready	Health check failing	Readiness Issues
Error	Container error	Container Errors

症状	可能原因	跳转章节
Pending（待调度）	调度问题	调度问题
CrashLoopBackOff	应用崩溃	CrashLoopBackOff
ImagePullBackOff	镜像/仓库问题	镜像拉取问题
OOMKilled	内存耗尽	OOMKilled
已运行但未就绪	健康检查失败	就绪性问题
错误	容器错误	容器错误

Common Issues

常见问题

Scheduling Issues

调度问题

Pod stuck in Pending state.

Diagnostic:

bash

kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Events

Common Causes:

Event Message	Cause	Fix
Insufficient cpu/memory	Not enough resources	Add nodes or reduce requests
node(s) had taints	Node taints	Add tolerations or remove taints
no nodes available	No matching nodes	Check node selector/affinity
persistentvolumeclaim not found	PVC missing	Create the PVC

Fix Resource Issues:

bash

undefined

Pod陷入Pending状态。

诊断方法：

bash

kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Events

常见原因：

事件消息	原因	修复方案
Insufficient cpu/memory	资源不足	添加节点或减少资源请求
node(s) had taints	节点污点	添加容忍度或移除污点
no nodes available	无匹配节点	检查节点选择器/亲和性
persistentvolumeclaim not found	PVC缺失	创建对应的PVC

修复资源问题：

bash

undefined

Check resource requests vs available

检查资源请求与可用资源对比

kubectl describe nodes | grep -A 5 "Allocated resources"

Check pending pod requests

查看Pending Pod的资源请求

kubectl get pod <pod> -o yaml | grep -A 10 resources

---

kubectl get pod <pod> -o yaml | grep -A 10 resources

---

CrashLoopBackOff

Container keeps crashing and restarting.

Diagnostic:

bash

undefined

容器持续崩溃并重启。

诊断方法：

bash

undefined

Check container logs (current)

查看容器日志（当前实例）

kubectl logs <pod-name> -n <namespace>

Check previous container logs

查看上一个容器实例的日志

kubectl logs <pod-name> -n <namespace> --previous

Check exit code

查看退出码

kubectl describe pod <pod-name> -n <namespace> | grep -A 3 "Last State"


**Common Exit Codes**:

| Exit Code | Meaning           | Common Cause                                        |
| --------- | ----------------- | --------------------------------------------------- |
| 0         | Success           | Process completed (might be wrong for long-running) |
| 1         | Application error | Check application logs                              |
| 137       | SIGKILL (OOM)     | Memory limit exceeded                               |
| 139       | SIGSEGV           | Segmentation fault                                  |
| 143       | SIGTERM           | Graceful termination                                |

**Common Fixes**:

- Check application logs for startup errors
- Verify environment variables and secrets
- Check if dependencies are available
- Verify resource limits aren't too restrictive

---

kubectl describe pod <pod-name> -n <namespace> | grep -A 3 "Last State"


**常见退出码**：

| 退出码 | 含义           | 常见原因                                        |
| --------- | ----------------- | --------------------------------------------------- |
| 0         | 成功           | 进程已完成（对于长期运行的应用可能不符合预期） |
| 1         | 应用错误 | 检查应用日志                              |
| 137       | SIGKILL（内存不足）     | 超出内存限制                               |
| 139       | SIGSEGV           | 段错误                                  |
| 143       | SIGTERM           | 优雅终止                                |

**常见修复方案**：

- 检查应用启动日志中的错误
- 验证环境变量和密钥
- 检查依赖是否可用
- 确认资源限制不过于严格

---

Image Pull Issues

镜像拉取问题

ImagePullBackOff or ErrImagePull.

Diagnostic:

bash

kubectl describe pod <pod-name> -n <namespace> | grep -A 5 Events

Common Causes:

Error	Cause	Fix
repository does not exist	Wrong image name	Fix image name/tag
unauthorized	Auth failure	Check imagePullSecrets
manifest unknown	Tag doesn't exist	Verify tag exists
connection refused	Registry unreachable	Check network/firewall

Fix Registry Auth:

bash

undefined

出现ImagePullBackOff或ErrImagePull。

诊断方法：

bash

kubectl describe pod <pod-name> -n <namespace> | grep -A 5 Events

常见原因：

错误	原因	修复方案
repository does not exist	镜像名称错误	修正镜像名称/标签
unauthorized	认证失败	检查imagePullSecrets
manifest unknown	标签不存在	验证标签是否存在
connection refused	仓库无法访问	检查网络/防火墙

修复仓库认证：

bash

undefined

Create image pull secret

创建镜像拉取密钥

kubectl create secret docker-registry regcred
--docker-server=<registry>
--docker-username=<user>
--docker-password=<password>
-n <namespace>

Reference in pod spec

在Pod spec中引用

spec: imagePullSecrets:

name: regcred

---

spec: imagePullSecrets:

name: regcred

---

OOMKilled

Container killed due to memory exhaustion.

Diagnostic:

bash

kubectl describe pod <pod-name> -n <namespace> | grep -i oom
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 5 lastState

Fix Options:

Increase memory limit (if available):

yaml

resources:
  limits:
    memory: '512Mi' # Increase this
  requests:
    memory: '256Mi'

Profile memory usage:

bash

kubectl top pod <pod-name> -n <namespace> --containers

Check for memory leaks in application code

容器因内存耗尽被终止。

诊断方法：

bash

kubectl describe pod <pod-name> -n <namespace> | grep -i oom
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 5 lastState

修复选项：

增加内存限制（如果有可用资源）：

yaml

resources:
  limits:
    memory: '512Mi' # 增大此值
  requests:
    memory: '256Mi'

分析内存使用情况：

bash

kubectl top pod <pod-name> -n <namespace> --containers

检查应用代码中的内存泄漏

Readiness Issues

就绪性问题

Pod is Running but not Ready.

Diagnostic:

bash

undefined

Pod处于Running状态但未就绪。

诊断方法：

bash

undefined

Check readiness probe

检查就绪探针

kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Readiness

Check probe endpoint manually

手动检查探针端点

kubectl exec <pod-name> -n <namespace> -- wget -qO- localhost:<port>/health


**Common Causes**:

- Application not listening on expected port
- Readiness endpoint returning non-200
- Probe timeout too short
- Dependencies not available

**Fix Readiness Probe**:

```yaml
readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10 # Give app time to start
  periodSeconds: 5
  timeoutSeconds: 3 # Increase if needed
  failureThreshold: 3

kubectl exec <pod-name> -n <namespace> -- wget -qO- localhost:<port>/health


**常见原因**：

- 应用未监听预期端口
- 就绪端点返回非200状态码
- 探针超时时间过短
- 依赖不可用

**修复就绪探针**：

```yaml
readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10 # 给应用启动时间
  periodSeconds: 5
  timeoutSeconds: 3 # 必要时增大
  failureThreshold: 3

Container Errors

容器错误

Diagnostic:

bash

undefined

诊断方法：

bash

undefined

Get detailed container status

获取详细容器状态

kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[*]}'

Check init containers

检查初始化容器

kubectl logs <pod-name> -n <namespace> -c <init-container-name>

---

kubectl logs <pod-name> -n <namespace> -c <init-container-name>

---

Networking Troubleshooting

网络故障排查

Service Not Reachable

服务无法访问

bash

undefined

bash

undefined

Check service endpoints

检查服务端点

kubectl get endpoints <service-name> -n <namespace>

Check service selector matches pod labels

检查服务选择器是否匹配Pod标签

kubectl get svc <service-name> -n <namespace> -o yaml | grep selector -A 5 kubectl get pods -n <namespace> --show-labels

Test connectivity from another pod

从另一个Pod测试连通性

kubectl run debug --rm -it --image=busybox -- wget -qO- <service>:<port>

undefined

kubectl run debug --rm -it --image=busybox -- wget -qO- <service>:<port>

undefined

DNS Issues

DNS问题

bash

undefined

bash

undefined

Check DNS resolution from pod

从Pod内部检查DNS解析

kubectl exec <pod> -n <namespace> -- nslookup <service-name> kubectl exec <pod> -n <namespace> -- nslookup <service-name>.<namespace>.svc.cluster.local

Check CoreDNS is running

检查CoreDNS是否运行

kubectl get pods -n kube-system -l k8s-app=kube-dns

---

kubectl get pods -n kube-system -l k8s-app=kube-dns

---

Resource Analysis

资源分析

Node Pressure

节点压力

bash

undefined

bash

undefined

Check node conditions

检查节点状态

kubectl describe nodes | grep -A 5 Conditions

Check node resource usage

检查节点资源使用情况

kubectl top nodes

Find resource-heavy pods

找出资源占用高的Pod

kubectl top pods -A --sort-by=memory | head -20

undefined

kubectl top pods -A --sort-by=memory | head -20

undefined

PVC Issues

PVC问题

bash

undefined

bash

undefined

Check PVC status

检查PVC状态

kubectl get pvc -n <namespace>

Check PV status

检查PV状态

kubectl get pv

Describe for events

查看事件详情

kubectl describe pvc <pvc-name> -n <namespace>

---

kubectl describe pvc <pvc-name> -n <namespace>

---

Quick Reference Commands

快速参考命令

bash

undefined

bash

undefined

Pod debugging

Pod调试

kubectl logs <pod> -n <ns> # Current logs kubectl logs <pod> -n <ns> --previous # Previous container logs kubectl logs <pod> -n <ns> -c <container> # Specific container kubectl logs <pod> -n <ns> --tail=100 -f # Follow logs

kubectl logs <pod> -n <ns> # 当前日志 kubectl logs <pod> -n <ns> --previous # 上一个容器实例的日志 kubectl logs <pod> -n <ns> -c <container> # 指定容器的日志 kubectl logs <pod> -n <ns> --tail=100 -f # 实时跟踪日志

Interactive debugging

交互式调试

kubectl exec -it <pod> -n <ns> -- /bin/sh # Shell into container kubectl exec <pod> -n <ns> -- env # Check environment kubectl exec <pod> -n <ns> -- cat /etc/hosts # Check DNS

kubectl exec -it <pod> -n <ns> -- /bin/sh # 进入容器Shell kubectl exec <pod> -n <ns> -- env # 检查环境变量 kubectl exec <pod> -n <ns> -- cat /etc/hosts # 检查DNS配置

Resource inspection

资源检查

kubectl get pod <pod> -n <ns> -o yaml # Full pod spec kubectl describe pod <pod> -n <ns> # Events and status kubectl get events -n <ns> --sort-by='.lastTimestamp'

kubectl get pod <pod> -n <ns> -o yaml # 完整Pod配置 kubectl describe pod <pod> -n <ns> # 事件和状态详情 kubectl get events -n <ns> --sort-by='.lastTimestamp'

Cluster-wide

集群范围

kubectl get pods -A | grep -v Running # Non-running pods kubectl top pods -A --sort-by=cpu # CPU usage kubectl top pods -A --sort-by=memory # Memory usage

undefined

kubectl get pods -A | grep -v Running # 非运行状态的Pod kubectl top pods -A --sort-by=cpu # CPU使用情况排序 kubectl top pods -A --sort-by=memory # 内存使用情况排序

undefined

Additional Resources

额外资源

Error Message Decoder
kubectl Cheat Sheet

错误消息解码器
kubectl速查表