kubernetes-troubleshooting

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Kubernetes Troubleshooting

Kubernetes故障排查

Systematic approach to debugging Kubernetes issues.
针对Kubernetes问题的系统化调试方法。

When to Use This Skill

何时使用此技能

  • Pod stuck in CrashLoopBackOff
  • OOMKilled errors
  • ImagePullBackOff failures
  • Pod not starting or scheduling
  • Service connectivity issues
  • Resource constraint problems
  • Pod陷入CrashLoopBackOff状态
  • 出现OOMKilled错误
  • ImagePullBackOff失败
  • Pod无法启动或调度
  • 服务连通性问题
  • 资源限制问题

Quick Diagnostic Commands

快速诊断命令

Start with these commands to understand the current state:
bash
undefined
先执行以下命令了解当前状态:
bash
undefined

Cluster overview

集群概览

kubectl get nodes kubectl get pods -A | grep -v Running
kubectl get nodes kubectl get pods -A | grep -v Running

Specific namespace

指定命名空间

kubectl get pods -n <namespace> kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
kubectl get pods -n <namespace> kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

Resource usage

资源使用情况

kubectl top nodes kubectl top pods -n <namespace>
undefined
kubectl top nodes kubectl top pods -n <namespace>
undefined

Pod Debugging Workflow

Pod调试流程

Step 1: Check Pod Status

步骤1:检查Pod状态

bash
kubectl get pod <pod-name> -n <namespace> -o wide
kubectl describe pod <pod-name> -n <namespace>
Look for:
  • Status: What state is the pod in?
  • Conditions: Ready, ContainersReady, PodScheduled
  • Events: Recent events at the bottom of describe output
bash
kubectl get pod <pod-name> -n <namespace> -o wide
kubectl describe pod <pod-name> -n <namespace>
重点查看:
  • 状态:Pod处于什么状态?
  • 条件:Ready、ContainersReady、PodScheduled
  • 事件:describe输出底部的近期事件

Step 2: Identify the Problem Category

步骤2:确定问题类别

SymptomLikely CauseGo To Section
PendingScheduling issueScheduling Issues
CrashLoopBackOffApplication crashCrashLoopBackOff
ImagePullBackOffImage/registry issueImage Pull Issues
OOMKilledMemory exhaustionOOMKilled
Running but not ReadyHealth check failingReadiness Issues
ErrorContainer errorContainer Errors
症状可能原因跳转章节
Pending(待调度)调度问题调度问题
CrashLoopBackOff应用崩溃CrashLoopBackOff
ImagePullBackOff镜像/仓库问题镜像拉取问题
OOMKilled内存耗尽OOMKilled
已运行但未就绪健康检查失败就绪性问题
错误容器错误容器错误

Common Issues

常见问题

Scheduling Issues

调度问题

Pod stuck in Pending state.
Diagnostic:
bash
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Events
Common Causes:
Event MessageCauseFix
Insufficient cpu/memoryNot enough resourcesAdd nodes or reduce requests
node(s) had taintsNode taintsAdd tolerations or remove taints
no nodes availableNo matching nodesCheck node selector/affinity
persistentvolumeclaim not foundPVC missingCreate the PVC
Fix Resource Issues:
bash
undefined
Pod陷入Pending状态。
诊断方法
bash
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Events
常见原因
事件消息原因修复方案
Insufficient cpu/memory资源不足添加节点或减少资源请求
node(s) had taints节点污点添加容忍度或移除污点
no nodes available无匹配节点检查节点选择器/亲和性
persistentvolumeclaim not foundPVC缺失创建对应的PVC
修复资源问题
bash
undefined

Check resource requests vs available

检查资源请求与可用资源对比

kubectl describe nodes | grep -A 5 "Allocated resources"
kubectl describe nodes | grep -A 5 "Allocated resources"

Check pending pod requests

查看Pending Pod的资源请求

kubectl get pod <pod> -o yaml | grep -A 10 resources

---
kubectl get pod <pod> -o yaml | grep -A 10 resources

---

CrashLoopBackOff

CrashLoopBackOff

Container keeps crashing and restarting.
Diagnostic:
bash
undefined
容器持续崩溃并重启。
诊断方法
bash
undefined

Check container logs (current)

查看容器日志(当前实例)

kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace>

Check previous container logs

查看上一个容器实例的日志

kubectl logs <pod-name> -n <namespace> --previous
kubectl logs <pod-name> -n <namespace> --previous

Check exit code

查看退出码

kubectl describe pod <pod-name> -n <namespace> | grep -A 3 "Last State"

**Common Exit Codes**:

| Exit Code | Meaning           | Common Cause                                        |
| --------- | ----------------- | --------------------------------------------------- |
| 0         | Success           | Process completed (might be wrong for long-running) |
| 1         | Application error | Check application logs                              |
| 137       | SIGKILL (OOM)     | Memory limit exceeded                               |
| 139       | SIGSEGV           | Segmentation fault                                  |
| 143       | SIGTERM           | Graceful termination                                |

**Common Fixes**:

- Check application logs for startup errors
- Verify environment variables and secrets
- Check if dependencies are available
- Verify resource limits aren't too restrictive

---
kubectl describe pod <pod-name> -n <namespace> | grep -A 3 "Last State"

**常见退出码**:

| 退出码 | 含义           | 常见原因                                        |
| --------- | ----------------- | --------------------------------------------------- |
| 0         | 成功           | 进程已完成(对于长期运行的应用可能不符合预期) |
| 1         | 应用错误 | 检查应用日志                              |
| 137       | SIGKILL(内存不足)     | 超出内存限制                               |
| 139       | SIGSEGV           | 段错误                                  |
| 143       | SIGTERM           | 优雅终止                                |

**常见修复方案**:

- 检查应用启动日志中的错误
- 验证环境变量和密钥
- 检查依赖是否可用
- 确认资源限制不过于严格

---

Image Pull Issues

镜像拉取问题

ImagePullBackOff or ErrImagePull.
Diagnostic:
bash
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 Events
Common Causes:
ErrorCauseFix
repository does not existWrong image nameFix image name/tag
unauthorizedAuth failureCheck imagePullSecrets
manifest unknownTag doesn't existVerify tag exists
connection refusedRegistry unreachableCheck network/firewall
Fix Registry Auth:
bash
undefined
出现ImagePullBackOffErrImagePull
诊断方法
bash
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 Events
常见原因
错误原因修复方案
repository does not exist镜像名称错误修正镜像名称/标签
unauthorized认证失败检查imagePullSecrets
manifest unknown标签不存在验证标签是否存在
connection refused仓库无法访问检查网络/防火墙
修复仓库认证
bash
undefined

Create image pull secret

创建镜像拉取密钥

kubectl create secret docker-registry regcred
--docker-server=<registry>
--docker-username=<user>
--docker-password=<password>
-n <namespace>
kubectl create secret docker-registry regcred
--docker-server=<registry>
--docker-username=<user>
--docker-password=<password>
-n <namespace>

Reference in pod spec

在Pod spec中引用

spec: imagePullSecrets:
  • name: regcred

---
spec: imagePullSecrets:
  • name: regcred

---

OOMKilled

OOMKilled

Container killed due to memory exhaustion.
Diagnostic:
bash
kubectl describe pod <pod-name> -n <namespace> | grep -i oom
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 5 lastState
Fix Options:
  1. Increase memory limit (if available):
yaml
resources:
  limits:
    memory: '512Mi' # Increase this
  requests:
    memory: '256Mi'
  1. Profile memory usage:
bash
kubectl top pod <pod-name> -n <namespace> --containers
  1. Check for memory leaks in application code

容器因内存耗尽被终止。
诊断方法
bash
kubectl describe pod <pod-name> -n <namespace> | grep -i oom
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 5 lastState
修复选项
  1. 增加内存限制(如果有可用资源):
yaml
resources:
  limits:
    memory: '512Mi' # 增大此值
  requests:
    memory: '256Mi'
  1. 分析内存使用情况
bash
kubectl top pod <pod-name> -n <namespace> --containers
  1. 检查应用代码中的内存泄漏

Readiness Issues

就绪性问题

Pod is Running but not Ready.
Diagnostic:
bash
undefined
Pod处于Running状态但未就绪。
诊断方法
bash
undefined

Check readiness probe

检查就绪探针

kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Readiness
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Readiness

Check probe endpoint manually

手动检查探针端点

kubectl exec <pod-name> -n <namespace> -- wget -qO- localhost:<port>/health

**Common Causes**:

- Application not listening on expected port
- Readiness endpoint returning non-200
- Probe timeout too short
- Dependencies not available

**Fix Readiness Probe**:

```yaml
readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10 # Give app time to start
  periodSeconds: 5
  timeoutSeconds: 3 # Increase if needed
  failureThreshold: 3

kubectl exec <pod-name> -n <namespace> -- wget -qO- localhost:<port>/health

**常见原因**:

- 应用未监听预期端口
- 就绪端点返回非200状态码
- 探针超时时间过短
- 依赖不可用

**修复就绪探针**:

```yaml
readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10 # 给应用启动时间
  periodSeconds: 5
  timeoutSeconds: 3 # 必要时增大
  failureThreshold: 3

Container Errors

容器错误

Diagnostic:
bash
undefined
诊断方法
bash
undefined

Get detailed container status

获取详细容器状态

kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[*]}'
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[*]}'

Check init containers

检查初始化容器

kubectl logs <pod-name> -n <namespace> -c <init-container-name>

---
kubectl logs <pod-name> -n <namespace> -c <init-container-name>

---

Networking Troubleshooting

网络故障排查

Service Not Reachable

服务无法访问

bash
undefined
bash
undefined

Check service endpoints

检查服务端点

kubectl get endpoints <service-name> -n <namespace>
kubectl get endpoints <service-name> -n <namespace>

Check service selector matches pod labels

检查服务选择器是否匹配Pod标签

kubectl get svc <service-name> -n <namespace> -o yaml | grep selector -A 5 kubectl get pods -n <namespace> --show-labels
kubectl get svc <service-name> -n <namespace> -o yaml | grep selector -A 5 kubectl get pods -n <namespace> --show-labels

Test connectivity from another pod

从另一个Pod测试连通性

kubectl run debug --rm -it --image=busybox -- wget -qO- <service>:<port>
undefined
kubectl run debug --rm -it --image=busybox -- wget -qO- <service>:<port>
undefined

DNS Issues

DNS问题

bash
undefined
bash
undefined

Check DNS resolution from pod

从Pod内部检查DNS解析

kubectl exec <pod> -n <namespace> -- nslookup <service-name> kubectl exec <pod> -n <namespace> -- nslookup <service-name>.<namespace>.svc.cluster.local
kubectl exec <pod> -n <namespace> -- nslookup <service-name> kubectl exec <pod> -n <namespace> -- nslookup <service-name>.<namespace>.svc.cluster.local

Check CoreDNS is running

检查CoreDNS是否运行

kubectl get pods -n kube-system -l k8s-app=kube-dns

---
kubectl get pods -n kube-system -l k8s-app=kube-dns

---

Resource Analysis

资源分析

Node Pressure

节点压力

bash
undefined
bash
undefined

Check node conditions

检查节点状态

kubectl describe nodes | grep -A 5 Conditions
kubectl describe nodes | grep -A 5 Conditions

Check node resource usage

检查节点资源使用情况

kubectl top nodes
kubectl top nodes

Find resource-heavy pods

找出资源占用高的Pod

kubectl top pods -A --sort-by=memory | head -20
undefined
kubectl top pods -A --sort-by=memory | head -20
undefined

PVC Issues

PVC问题

bash
undefined
bash
undefined

Check PVC status

检查PVC状态

kubectl get pvc -n <namespace>
kubectl get pvc -n <namespace>

Check PV status

检查PV状态

kubectl get pv
kubectl get pv

Describe for events

查看事件详情

kubectl describe pvc <pvc-name> -n <namespace>

---
kubectl describe pvc <pvc-name> -n <namespace>

---

Quick Reference Commands

快速参考命令

bash
undefined
bash
undefined

Pod debugging

Pod调试

kubectl logs <pod> -n <ns> # Current logs kubectl logs <pod> -n <ns> --previous # Previous container logs kubectl logs <pod> -n <ns> -c <container> # Specific container kubectl logs <pod> -n <ns> --tail=100 -f # Follow logs
kubectl logs <pod> -n <ns> # 当前日志 kubectl logs <pod> -n <ns> --previous # 上一个容器实例的日志 kubectl logs <pod> -n <ns> -c <container> # 指定容器的日志 kubectl logs <pod> -n <ns> --tail=100 -f # 实时跟踪日志

Interactive debugging

交互式调试

kubectl exec -it <pod> -n <ns> -- /bin/sh # Shell into container kubectl exec <pod> -n <ns> -- env # Check environment kubectl exec <pod> -n <ns> -- cat /etc/hosts # Check DNS
kubectl exec -it <pod> -n <ns> -- /bin/sh # 进入容器Shell kubectl exec <pod> -n <ns> -- env # 检查环境变量 kubectl exec <pod> -n <ns> -- cat /etc/hosts # 检查DNS配置

Resource inspection

资源检查

kubectl get pod <pod> -n <ns> -o yaml # Full pod spec kubectl describe pod <pod> -n <ns> # Events and status kubectl get events -n <ns> --sort-by='.lastTimestamp'
kubectl get pod <pod> -n <ns> -o yaml # 完整Pod配置 kubectl describe pod <pod> -n <ns> # 事件和状态详情 kubectl get events -n <ns> --sort-by='.lastTimestamp'

Cluster-wide

集群范围

kubectl get pods -A | grep -v Running # Non-running pods kubectl top pods -A --sort-by=cpu # CPU usage kubectl top pods -A --sort-by=memory # Memory usage
undefined
kubectl get pods -A | grep -v Running # 非运行状态的Pod kubectl top pods -A --sort-by=cpu # CPU使用情况排序 kubectl top pods -A --sort-by=memory # 内存使用情况排序
undefined

Additional Resources

额外资源

  • Error Message Decoder
  • kubectl Cheat Sheet
  • 错误消息解码器
  • kubectl速查表