kubernetes-troubleshooting
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseKubernetes Troubleshooting
Kubernetes故障排查
Systematic approach to debugging Kubernetes issues.
针对Kubernetes问题的系统化调试方法。
When to Use This Skill
何时使用此技能
- Pod stuck in CrashLoopBackOff
- OOMKilled errors
- ImagePullBackOff failures
- Pod not starting or scheduling
- Service connectivity issues
- Resource constraint problems
- Pod陷入CrashLoopBackOff状态
- 出现OOMKilled错误
- ImagePullBackOff失败
- Pod无法启动或调度
- 服务连通性问题
- 资源限制问题
Quick Diagnostic Commands
快速诊断命令
Start with these commands to understand the current state:
bash
undefined先执行以下命令了解当前状态:
bash
undefinedCluster overview
集群概览
kubectl get nodes
kubectl get pods -A | grep -v Running
kubectl get nodes
kubectl get pods -A | grep -v Running
Specific namespace
指定命名空间
kubectl get pods -n <namespace>
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
kubectl get pods -n <namespace>
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
Resource usage
资源使用情况
kubectl top nodes
kubectl top pods -n <namespace>
undefinedkubectl top nodes
kubectl top pods -n <namespace>
undefinedPod Debugging Workflow
Pod调试流程
Step 1: Check Pod Status
步骤1:检查Pod状态
bash
kubectl get pod <pod-name> -n <namespace> -o wide
kubectl describe pod <pod-name> -n <namespace>Look for:
- Status: What state is the pod in?
- Conditions: Ready, ContainersReady, PodScheduled
- Events: Recent events at the bottom of describe output
bash
kubectl get pod <pod-name> -n <namespace> -o wide
kubectl describe pod <pod-name> -n <namespace>重点查看:
- 状态:Pod处于什么状态?
- 条件:Ready、ContainersReady、PodScheduled
- 事件:describe输出底部的近期事件
Step 2: Identify the Problem Category
步骤2:确定问题类别
| Symptom | Likely Cause | Go To Section |
|---|---|---|
| Pending | Scheduling issue | Scheduling Issues |
| CrashLoopBackOff | Application crash | CrashLoopBackOff |
| ImagePullBackOff | Image/registry issue | Image Pull Issues |
| OOMKilled | Memory exhaustion | OOMKilled |
| Running but not Ready | Health check failing | Readiness Issues |
| Error | Container error | Container Errors |
Common Issues
常见问题
Scheduling Issues
调度问题
Pod stuck in Pending state.
Diagnostic:
bash
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 EventsCommon Causes:
| Event Message | Cause | Fix |
|---|---|---|
| Insufficient cpu/memory | Not enough resources | Add nodes or reduce requests |
| node(s) had taints | Node taints | Add tolerations or remove taints |
| no nodes available | No matching nodes | Check node selector/affinity |
| persistentvolumeclaim not found | PVC missing | Create the PVC |
Fix Resource Issues:
bash
undefinedPod陷入Pending状态。
诊断方法:
bash
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Events常见原因:
| 事件消息 | 原因 | 修复方案 |
|---|---|---|
| Insufficient cpu/memory | 资源不足 | 添加节点或减少资源请求 |
| node(s) had taints | 节点污点 | 添加容忍度或移除污点 |
| no nodes available | 无匹配节点 | 检查节点选择器/亲和性 |
| persistentvolumeclaim not found | PVC缺失 | 创建对应的PVC |
修复资源问题:
bash
undefinedCheck resource requests vs available
检查资源请求与可用资源对比
kubectl describe nodes | grep -A 5 "Allocated resources"
kubectl describe nodes | grep -A 5 "Allocated resources"
Check pending pod requests
查看Pending Pod的资源请求
kubectl get pod <pod> -o yaml | grep -A 10 resources
---kubectl get pod <pod> -o yaml | grep -A 10 resources
---CrashLoopBackOff
CrashLoopBackOff
Container keeps crashing and restarting.
Diagnostic:
bash
undefined容器持续崩溃并重启。
诊断方法:
bash
undefinedCheck container logs (current)
查看容器日志(当前实例)
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace>
Check previous container logs
查看上一个容器实例的日志
kubectl logs <pod-name> -n <namespace> --previous
kubectl logs <pod-name> -n <namespace> --previous
Check exit code
查看退出码
kubectl describe pod <pod-name> -n <namespace> | grep -A 3 "Last State"
**Common Exit Codes**:
| Exit Code | Meaning | Common Cause |
| --------- | ----------------- | --------------------------------------------------- |
| 0 | Success | Process completed (might be wrong for long-running) |
| 1 | Application error | Check application logs |
| 137 | SIGKILL (OOM) | Memory limit exceeded |
| 139 | SIGSEGV | Segmentation fault |
| 143 | SIGTERM | Graceful termination |
**Common Fixes**:
- Check application logs for startup errors
- Verify environment variables and secrets
- Check if dependencies are available
- Verify resource limits aren't too restrictive
---kubectl describe pod <pod-name> -n <namespace> | grep -A 3 "Last State"
**常见退出码**:
| 退出码 | 含义 | 常见原因 |
| --------- | ----------------- | --------------------------------------------------- |
| 0 | 成功 | 进程已完成(对于长期运行的应用可能不符合预期) |
| 1 | 应用错误 | 检查应用日志 |
| 137 | SIGKILL(内存不足) | 超出内存限制 |
| 139 | SIGSEGV | 段错误 |
| 143 | SIGTERM | 优雅终止 |
**常见修复方案**:
- 检查应用启动日志中的错误
- 验证环境变量和密钥
- 检查依赖是否可用
- 确认资源限制不过于严格
---Image Pull Issues
镜像拉取问题
ImagePullBackOff or ErrImagePull.
Diagnostic:
bash
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 EventsCommon Causes:
| Error | Cause | Fix |
|---|---|---|
| repository does not exist | Wrong image name | Fix image name/tag |
| unauthorized | Auth failure | Check imagePullSecrets |
| manifest unknown | Tag doesn't exist | Verify tag exists |
| connection refused | Registry unreachable | Check network/firewall |
Fix Registry Auth:
bash
undefined出现ImagePullBackOff或ErrImagePull。
诊断方法:
bash
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 Events常见原因:
| 错误 | 原因 | 修复方案 |
|---|---|---|
| repository does not exist | 镜像名称错误 | 修正镜像名称/标签 |
| unauthorized | 认证失败 | 检查imagePullSecrets |
| manifest unknown | 标签不存在 | 验证标签是否存在 |
| connection refused | 仓库无法访问 | 检查网络/防火墙 |
修复仓库认证:
bash
undefinedCreate image pull secret
创建镜像拉取密钥
kubectl create secret docker-registry regcred
--docker-server=<registry>
--docker-username=<user>
--docker-password=<password>
-n <namespace>
--docker-server=<registry>
--docker-username=<user>
--docker-password=<password>
-n <namespace>
kubectl create secret docker-registry regcred
--docker-server=<registry>
--docker-username=<user>
--docker-password=<password>
-n <namespace>
--docker-server=<registry>
--docker-username=<user>
--docker-password=<password>
-n <namespace>
Reference in pod spec
在Pod spec中引用
spec:
imagePullSecrets:
- name: regcred
---spec:
imagePullSecrets:
- name: regcred
---OOMKilled
OOMKilled
Container killed due to memory exhaustion.
Diagnostic:
bash
kubectl describe pod <pod-name> -n <namespace> | grep -i oom
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 5 lastStateFix Options:
- Increase memory limit (if available):
yaml
resources:
limits:
memory: '512Mi' # Increase this
requests:
memory: '256Mi'- Profile memory usage:
bash
kubectl top pod <pod-name> -n <namespace> --containers- Check for memory leaks in application code
容器因内存耗尽被终止。
诊断方法:
bash
kubectl describe pod <pod-name> -n <namespace> | grep -i oom
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 5 lastState修复选项:
- 增加内存限制(如果有可用资源):
yaml
resources:
limits:
memory: '512Mi' # 增大此值
requests:
memory: '256Mi'- 分析内存使用情况:
bash
kubectl top pod <pod-name> -n <namespace> --containers- 检查应用代码中的内存泄漏
Readiness Issues
就绪性问题
Pod is Running but not Ready.
Diagnostic:
bash
undefinedPod处于Running状态但未就绪。
诊断方法:
bash
undefinedCheck readiness probe
检查就绪探针
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Readiness
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Readiness
Check probe endpoint manually
手动检查探针端点
kubectl exec <pod-name> -n <namespace> -- wget -qO- localhost:<port>/health
**Common Causes**:
- Application not listening on expected port
- Readiness endpoint returning non-200
- Probe timeout too short
- Dependencies not available
**Fix Readiness Probe**:
```yaml
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10 # Give app time to start
periodSeconds: 5
timeoutSeconds: 3 # Increase if needed
failureThreshold: 3kubectl exec <pod-name> -n <namespace> -- wget -qO- localhost:<port>/health
**常见原因**:
- 应用未监听预期端口
- 就绪端点返回非200状态码
- 探针超时时间过短
- 依赖不可用
**修复就绪探针**:
```yaml
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10 # 给应用启动时间
periodSeconds: 5
timeoutSeconds: 3 # 必要时增大
failureThreshold: 3Container Errors
容器错误
Diagnostic:
bash
undefined诊断方法:
bash
undefinedGet detailed container status
获取详细容器状态
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[*]}'
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[*]}'
Check init containers
检查初始化容器
kubectl logs <pod-name> -n <namespace> -c <init-container-name>
---kubectl logs <pod-name> -n <namespace> -c <init-container-name>
---Networking Troubleshooting
网络故障排查
Service Not Reachable
服务无法访问
bash
undefinedbash
undefinedCheck service endpoints
检查服务端点
kubectl get endpoints <service-name> -n <namespace>
kubectl get endpoints <service-name> -n <namespace>
Check service selector matches pod labels
检查服务选择器是否匹配Pod标签
kubectl get svc <service-name> -n <namespace> -o yaml | grep selector -A 5
kubectl get pods -n <namespace> --show-labels
kubectl get svc <service-name> -n <namespace> -o yaml | grep selector -A 5
kubectl get pods -n <namespace> --show-labels
Test connectivity from another pod
从另一个Pod测试连通性
kubectl run debug --rm -it --image=busybox -- wget -qO- <service>:<port>
undefinedkubectl run debug --rm -it --image=busybox -- wget -qO- <service>:<port>
undefinedDNS Issues
DNS问题
bash
undefinedbash
undefinedCheck DNS resolution from pod
从Pod内部检查DNS解析
kubectl exec <pod> -n <namespace> -- nslookup <service-name>
kubectl exec <pod> -n <namespace> -- nslookup <service-name>.<namespace>.svc.cluster.local
kubectl exec <pod> -n <namespace> -- nslookup <service-name>
kubectl exec <pod> -n <namespace> -- nslookup <service-name>.<namespace>.svc.cluster.local
Check CoreDNS is running
检查CoreDNS是否运行
kubectl get pods -n kube-system -l k8s-app=kube-dns
---kubectl get pods -n kube-system -l k8s-app=kube-dns
---Resource Analysis
资源分析
Node Pressure
节点压力
bash
undefinedbash
undefinedCheck node conditions
检查节点状态
kubectl describe nodes | grep -A 5 Conditions
kubectl describe nodes | grep -A 5 Conditions
Check node resource usage
检查节点资源使用情况
kubectl top nodes
kubectl top nodes
Find resource-heavy pods
找出资源占用高的Pod
kubectl top pods -A --sort-by=memory | head -20
undefinedkubectl top pods -A --sort-by=memory | head -20
undefinedPVC Issues
PVC问题
bash
undefinedbash
undefinedCheck PVC status
检查PVC状态
kubectl get pvc -n <namespace>
kubectl get pvc -n <namespace>
Check PV status
检查PV状态
kubectl get pv
kubectl get pv
Describe for events
查看事件详情
kubectl describe pvc <pvc-name> -n <namespace>
---kubectl describe pvc <pvc-name> -n <namespace>
---Quick Reference Commands
快速参考命令
bash
undefinedbash
undefinedPod debugging
Pod调试
kubectl logs <pod> -n <ns> # Current logs
kubectl logs <pod> -n <ns> --previous # Previous container logs
kubectl logs <pod> -n <ns> -c <container> # Specific container
kubectl logs <pod> -n <ns> --tail=100 -f # Follow logs
kubectl logs <pod> -n <ns> # 当前日志
kubectl logs <pod> -n <ns> --previous # 上一个容器实例的日志
kubectl logs <pod> -n <ns> -c <container> # 指定容器的日志
kubectl logs <pod> -n <ns> --tail=100 -f # 实时跟踪日志
Interactive debugging
交互式调试
kubectl exec -it <pod> -n <ns> -- /bin/sh # Shell into container
kubectl exec <pod> -n <ns> -- env # Check environment
kubectl exec <pod> -n <ns> -- cat /etc/hosts # Check DNS
kubectl exec -it <pod> -n <ns> -- /bin/sh # 进入容器Shell
kubectl exec <pod> -n <ns> -- env # 检查环境变量
kubectl exec <pod> -n <ns> -- cat /etc/hosts # 检查DNS配置
Resource inspection
资源检查
kubectl get pod <pod> -n <ns> -o yaml # Full pod spec
kubectl describe pod <pod> -n <ns> # Events and status
kubectl get events -n <ns> --sort-by='.lastTimestamp'
kubectl get pod <pod> -n <ns> -o yaml # 完整Pod配置
kubectl describe pod <pod> -n <ns> # 事件和状态详情
kubectl get events -n <ns> --sort-by='.lastTimestamp'
Cluster-wide
集群范围
kubectl get pods -A | grep -v Running # Non-running pods
kubectl top pods -A --sort-by=cpu # CPU usage
kubectl top pods -A --sort-by=memory # Memory usage
undefinedkubectl get pods -A | grep -v Running # 非运行状态的Pod
kubectl top pods -A --sort-by=cpu # CPU使用情况排序
kubectl top pods -A --sort-by=memory # 内存使用情况排序
undefinedAdditional Resources
额外资源
- Error Message Decoder
- kubectl Cheat Sheet
- 错误消息解码器
- kubectl速查表