k8s-debug

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Kubernetes Debugging Skill

Kubernetes调试技能

Overview

概述

Systematic toolkit for debugging and troubleshooting Kubernetes clusters, pods, services, and deployments. Provides scripts, workflows, and reference guides for identifying and resolving common Kubernetes issues efficiently.
用于调试和排查Kubernetes集群、Pod、服务及部署问题的系统化工具包。提供脚本、工作流和参考指南,助力高效识别并解决常见Kubernetes问题。

When to Use This Skill

何时使用本技能

Invoke this skill when encountering:
  • Pod failures (CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled)
  • Service connectivity or DNS resolution issues
  • Network policy or ingress problems
  • Volume and storage mount failures
  • Deployment rollout issues
  • Cluster health or performance degradation
  • Resource exhaustion (CPU/memory)
  • Configuration problems (ConfigMaps, Secrets, RBAC)
遇到以下情况时调用本技能:
  • Pod故障(CrashLoopBackOff、ImagePullBackOff、Pending、OOMKilled)
  • 服务连接或DNS解析问题
  • 网络策略或Ingress问题
  • 卷与存储挂载故障
  • 部署发布问题
  • 集群健康或性能下降
  • 资源耗尽(CPU/内存)
  • 配置问题(ConfigMaps、Secrets、RBAC)

Debugging Workflow

调试工作流

Follow this systematic approach for any Kubernetes issue:
针对任何Kubernetes问题,遵循以下系统化步骤:

1. Identify the Problem Layer

1. 确定问题层级

Categorize the issue:
  • Application Layer: Application crashes, errors, bugs
  • Pod Layer: Pod not starting, restarting, or pending
  • Service Layer: Network connectivity, DNS issues
  • Node Layer: Node not ready, resource exhaustion
  • Cluster Layer: Control plane issues, API problems
  • Storage Layer: Volume mount failures, PVC issues
  • Configuration Layer: ConfigMap, Secret, RBAC issues
对问题进行分类:
  • 应用层:应用崩溃、报错、程序缺陷
  • Pod层:Pod无法启动、重启或处于Pending状态
  • 服务层:网络连接、DNS问题
  • 节点层:节点未就绪、资源耗尽
  • 集群层:控制平面问题、API故障
  • 存储层:卷挂载失败、PVC问题
  • 配置层:ConfigMap、Secret、RBAC问题

2. Gather Diagnostic Information

2. 收集诊断信息

Use the appropriate diagnostic script based on scope:
根据范围使用相应的诊断脚本:

Pod-Level Diagnostics

Pod级诊断

Use
scripts/pod_diagnostics.py
for comprehensive pod analysis:
bash
python3 scripts/pod_diagnostics.py <pod-name> -n <namespace>
This script gathers:
  • Pod status and description
  • Pod events
  • Container logs (current and previous)
  • Resource usage
  • Node information
  • YAML configuration
Output can be saved for analysis:
python3 scripts/pod_diagnostics.py <pod-name> -n <namespace> -o diagnostics.txt
使用
scripts/pod_diagnostics.py
进行全面的Pod分析:
bash
python3 scripts/pod_diagnostics.py <pod-name> -n <namespace>
该脚本会收集:
  • Pod状态与描述
  • Pod事件
  • 容器日志(当前及历史)
  • 资源使用情况
  • 节点信息
  • YAML配置
输出结果可保存用于分析:
python3 scripts/pod_diagnostics.py <pod-name> -n <namespace> -o diagnostics.txt

Cluster-Level Health Check

集群级健康检查

Use
scripts/cluster_health.sh
for overall cluster diagnostics:
bash
./scripts/cluster_health.sh
This script checks:
  • Cluster info and version
  • Node status and resources
  • Pods across all namespaces
  • Failed/pending pods
  • Recent events
  • Deployments, services, statefulsets, daemonsets
  • PVCs and PVs
  • Component health
  • Common error states (CrashLoopBackOff, ImagePullBackOff)
使用
scripts/cluster_health.sh
进行整体集群诊断:
bash
./scripts/cluster_health.sh
该脚本会检查:
  • 集群信息与版本
  • 节点状态与资源
  • 所有命名空间下的Pod
  • 故障或Pending状态的Pod
  • 近期事件
  • 部署、服务、有状态集、守护进程集
  • PVC与PV
  • 组件健康状态
  • 常见错误状态(CrashLoopBackOff、ImagePullBackOff)

Network Diagnostics

网络诊断

Use
scripts/network_debug.sh
for connectivity issues:
bash
./scripts/network_debug.sh <namespace> <pod-name>
This script analyzes:
  • Pod network configuration
  • DNS setup and resolution
  • Service endpoints
  • Network policies
  • Connectivity tests
  • CoreDNS logs
使用
scripts/network_debug.sh
排查连接问题:
bash
./scripts/network_debug.sh <namespace> <pod-name>
该脚本会分析:
  • Pod网络配置
  • DNS设置与解析
  • 服务端点
  • 网络策略
  • 连通性测试
  • CoreDNS日志

3. Follow Issue-Specific Workflow

3. 遵循特定问题工作流

Based on the identified issue, consult
references/troubleshooting_workflow.md
for detailed workflows:
  • Pod Pending: Resource/scheduling workflow
  • CrashLoopBackOff: Application crash workflow
  • ImagePullBackOff: Image pull workflow
  • Service issues: Network connectivity workflow
  • DNS failures: DNS troubleshooting workflow
  • Resource exhaustion: Performance investigation workflow
  • Storage issues: PVC binding workflow
  • Deployment stuck: Rollout workflow
根据识别出的问题,查阅
references/troubleshooting_workflow.md
获取详细工作流:
  • Pod Pending:资源/调度工作流
  • CrashLoopBackOff:应用崩溃工作流
  • ImagePullBackOff:镜像拉取工作流
  • 服务问题:网络连通性工作流
  • DNS故障:DNS排查工作流
  • 资源耗尽:性能调查工作流
  • 存储问题:PVC绑定工作流
  • 部署停滞:发布工作流

4. Apply Targeted Fixes

4. 应用针对性修复

Refer to
references/common_issues.md
for specific solutions to common problems.
查阅
references/common_issues.md
获取常见问题的具体解决方案。

Common Debugging Patterns

常见调试模式

Pattern 1: Pod Not Starting

模式1:Pod无法启动

bash
undefined
bash
undefined

Quick assessment

快速评估

kubectl get pod <pod-name> -n <namespace> kubectl describe pod <pod-name> -n <namespace>
kubectl get pod <pod-name> -n <namespace> kubectl describe pod <pod-name> -n <namespace>

Detailed diagnostics

详细诊断

python3 scripts/pod_diagnostics.py <pod-name> -n <namespace>
python3 scripts/pod_diagnostics.py <pod-name> -n <namespace>

Check common causes:

检查常见原因:

- ImagePullBackOff: Verify image exists and credentials

- ImagePullBackOff:验证镜像存在及凭证有效性

- CrashLoopBackOff: Check logs with --previous flag

- CrashLoopBackOff:使用--previous标志查看日志

- Pending: Check node resources and scheduling

- Pending:检查节点资源与调度情况

undefined
undefined

Pattern 2: Service Connectivity Issues

模式2:服务连接问题

bash
undefined
bash
undefined

Verify service and endpoints

验证服务与端点

kubectl get svc <service-name> -n <namespace> kubectl get endpoints <service-name> -n <namespace>
kubectl get svc <service-name> -n <namespace> kubectl get endpoints <service-name> -n <namespace>

Network diagnostics

网络诊断

./scripts/network_debug.sh <namespace> <pod-name>
./scripts/network_debug.sh <namespace> <pod-name>

Test connectivity from debug pod

从调试Pod测试连通性

kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot -- /bin/bash
kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot -- /bin/bash

Inside: curl <service-name>.<namespace>.svc.cluster.local:<port>

容器内执行:curl <service-name>.<namespace>.svc.cluster.local:<port>

Check network policies

检查网络策略

kubectl get networkpolicies -n <namespace>
undefined
kubectl get networkpolicies -n <namespace>
undefined

Pattern 3: Application Performance Issues

模式3:应用性能问题

bash
undefined
bash
undefined

Check resource usage

检查资源使用情况

kubectl top nodes kubectl top pods -n <namespace> --containers
kubectl top nodes kubectl top pods -n <namespace> --containers

Get pod metrics

获取Pod指标

kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 resources
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 resources

Check for OOMKilled

检查是否出现OOMKilled

kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 lastState
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 lastState

Review application logs

查看应用日志

kubectl logs <pod-name> -n <namespace> --tail=100
undefined
kubectl logs <pod-name> -n <namespace> --tail=100
undefined

Pattern 4: Cluster Health Assessment

模式4:集群健康评估

bash
undefined
bash
undefined

Run comprehensive health check

运行全面健康检查

./scripts/cluster_health.sh > cluster-health-$(date +%Y%m%d-%H%M%S).txt
./scripts/cluster_health.sh > cluster-health-$(date +%Y%m%d-%H%M%S).txt

Review output for:

检查输出内容:

- Node conditions and resource pressure

- 节点状态与资源压力

- Failed or pending pods

- 故障或Pending状态的Pod

- Recent error events

- 近期错误事件

- Component health status

- 组件健康状态

- Resource quota usage

- 资源配额使用情况

undefined
undefined

Essential Manual Commands

必备手动命令

While scripts automate diagnostics, understand these core commands:
虽然脚本可自动化诊断,但需掌握以下核心命令:

Pod Debugging

Pod调试

bash
undefined
bash
undefined

View pod status

查看Pod状态

kubectl get pods -n <namespace> -o wide
kubectl get pods -n <namespace> -o wide

Detailed pod information

详细Pod信息

kubectl describe pod <pod-name> -n <namespace>
kubectl describe pod <pod-name> -n <namespace>

View logs

查看日志

kubectl logs <pod-name> -n <namespace> kubectl logs <pod-name> -n <namespace> --previous # Previous container kubectl logs <pod-name> -n <namespace> -c <container> # Specific container
kubectl logs <pod-name> -n <namespace> kubectl logs <pod-name> -n <namespace> --previous # 查看上一个容器的日志 kubectl logs <pod-name> -n <namespace> -c <container> # 查看指定容器的日志

Execute commands in pod

在Pod内执行命令

kubectl exec <pod-name> -n <namespace> -it -- /bin/sh
kubectl exec <pod-name> -n <namespace> -it -- /bin/sh

Get pod YAML

获取Pod的YAML配置

kubectl get pod <pod-name> -n <namespace> -o yaml
undefined
kubectl get pod <pod-name> -n <namespace> -o yaml
undefined

Service and Network Debugging

服务与网络调试

bash
undefined
bash
undefined

Check services

检查服务

kubectl get svc -n <namespace> kubectl describe svc <service-name> -n <namespace>
kubectl get svc -n <namespace> kubectl describe svc <service-name> -n <namespace>

Check endpoints

检查端点

kubectl get endpoints -n <namespace>
kubectl get endpoints -n <namespace>

Test DNS

测试DNS

kubectl exec <pod-name> -n <namespace> -- nslookup kubernetes.default
kubectl exec <pod-name> -n <namespace> -- nslookup kubernetes.default

View events

查看事件

kubectl get events -n <namespace> --sort-by='.lastTimestamp'
undefined
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
undefined

Resource Monitoring

资源监控

bash
undefined
bash
undefined

Node resources

节点资源

kubectl top nodes kubectl describe nodes
kubectl top nodes kubectl describe nodes

Pod resources

Pod资源

kubectl top pods -n <namespace> kubectl top pod <pod-name> -n <namespace> --containers
undefined
kubectl top pods -n <namespace> kubectl top pod <pod-name> -n <namespace> --containers
undefined

Emergency Operations

紧急操作

bash
undefined
bash
undefined

Restart deployment

重启部署

kubectl rollout restart deployment/<name> -n <namespace>
kubectl rollout restart deployment/<name> -n <namespace>

Rollback deployment

回滚部署

kubectl rollout undo deployment/<name> -n <namespace>
kubectl rollout undo deployment/<name> -n <namespace>

Force delete stuck pod

强制删除停滞的Pod

kubectl delete pod <pod-name> -n <namespace> --force --grace-period=0
kubectl delete pod <pod-name> -n <namespace> --force --grace-period=0

Drain node (maintenance)

驱逐节点(维护时)

kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

Cordon node (prevent scheduling)

隔离节点(阻止调度)

kubectl cordon <node-name>
undefined
kubectl cordon <node-name>
undefined

Reference Documentation

参考文档

Detailed Troubleshooting Guides

详细故障排查指南

Consult
references/troubleshooting_workflow.md
for:
  • Step-by-step workflows for each issue type
  • Decision trees for diagnosis
  • Command sequences for systematic debugging
  • Quick reference command cheat sheet
查阅
references/troubleshooting_workflow.md
获取:
  • 各类问题的分步工作流
  • 诊断决策树
  • 系统化调试的命令序列
  • 快速参考命令速查表

Common Issues Database

常见问题数据库

Consult
references/common_issues.md
for:
  • Detailed explanations of each common issue
  • Symptoms and causes
  • Specific debugging steps
  • Solutions and fixes
  • Prevention strategies
查阅
references/common_issues.md
获取:
  • 各常见问题的详细说明
  • 症状与原因
  • 具体调试步骤
  • 解决方案与修复方法
  • 预防策略

Best Practices

最佳实践

Systematic Approach

系统化方法

  1. Observe: Gather facts before making changes
  2. Analyze: Use diagnostic scripts to collect comprehensive data
  3. Hypothesize: Form theory about root cause
  4. Test: Verify hypothesis with targeted commands
  5. Fix: Apply appropriate solution
  6. Verify: Confirm issue is resolved
  7. Document: Record findings for future reference
  1. 观察:在做出更改前收集事实信息
  2. 分析:使用诊断脚本收集全面数据
  3. 假设:形成关于根本原因的理论
  4. 测试:使用针对性命令验证假设
  5. 修复:应用合适的解决方案
  6. 验证:确认问题已解决
  7. 记录:记录发现以供未来参考

Data Collection

数据收集

  • Save diagnostic output to files for analysis
  • Capture logs before restarting failing pods
  • Record events timeline for incident reports
  • Export resource metrics for trend analysis
  • 将诊断输出保存到文件以便分析
  • 在重启故障Pod前捕获日志
  • 记录事件时间线用于事故报告
  • 导出资源指标用于趋势分析

Prevention

预防措施

  • Set appropriate resource requests and limits
  • Implement health checks (liveness/readiness probes)
  • Use proper logging and monitoring
  • Apply network policies incrementally
  • Test changes in non-production environments
  • Maintain documentation of cluster architecture
  • 设置合适的资源请求与限制
  • 实现健康检查(存活/就绪探针)
  • 使用恰当的日志与监控
  • 逐步应用网络策略
  • 在非生产环境测试更改
  • 维护集群架构文档

Advanced Debugging Techniques

高级调试技术

Debug Containers (Kubernetes 1.23+)

调试容器(Kubernetes 1.23+)

bash
undefined
bash
undefined

Attach ephemeral debug container

附加临时调试容器

kubectl debug <pod-name> -n <namespace> -it --image=nicolaka/netshoot
kubectl debug <pod-name> -n <namespace> -it --image=nicolaka/netshoot

Create debug copy of pod

创建Pod的调试副本

kubectl debug <pod-name> -n <namespace> -it --copy-to=<debug-pod-name> --container=<container>
undefined
kubectl debug <pod-name> -n <namespace> -it --copy-to=<debug-pod-name> --container=<container>
undefined

Port Forwarding for Testing

端口转发用于测试

bash
undefined
bash
undefined

Forward pod port to local machine

将Pod端口转发到本地机器

kubectl port-forward pod/<pod-name> -n <namespace> <local-port>:<pod-port>
kubectl port-forward pod/<pod-name> -n <namespace> <local-port>:<pod-port>

Forward service port

转发服务端口

kubectl port-forward svc/<service-name> -n <namespace> <local-port>:<service-port>
undefined
kubectl port-forward svc/<service-name> -n <namespace> <local-port>:<service-port>
undefined

Proxy for API Access

代理用于API访问

bash
undefined
bash
undefined

Start kubectl proxy

启动kubectl代理

kubectl proxy --port=8080
kubectl proxy --port=8080

Access API

访问API

curl http://localhost:8080/api/v1/namespaces/<namespace>/pods/<pod-name>
undefined
curl http://localhost:8080/api/v1/namespaces/<namespace>/pods/<pod-name>
undefined

Custom Column Output

自定义列输出

bash
undefined
bash
undefined

Custom pod info

自定义Pod信息

kubectl get pods -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,IP:.status.podIP
kubectl get pods -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,IP:.status.podIP

Node taints

节点污点

kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
undefined
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
undefined

Troubleshooting Checklist

故障排查清单

Before escalating issues, verify:
  • Reviewed pod events:
    kubectl describe pod
  • Checked pod logs (current and previous)
  • Verified resource availability on nodes
  • Confirmed image exists and is accessible
  • Validated service selectors match pod labels
  • Tested DNS resolution from pods
  • Checked network policies
  • Reviewed recent cluster events
  • Confirmed ConfigMaps/Secrets exist
  • Validated RBAC permissions
  • Checked for resource quotas/limits
  • Reviewed cluster component health
在升级问题前,请验证:
  • 已查看Pod事件:
    kubectl describe pod
  • 已检查Pod日志(当前及历史)
  • 已验证节点上的资源可用性
  • 已确认镜像存在且可访问
  • 已验证服务选择器与Pod标签匹配
  • 已从Pod测试DNS解析
  • 已检查网络策略
  • 已查看近期集群事件
  • 已确认ConfigMaps/Secrets存在
  • 已验证RBAC权限
  • 已检查资源配额/限制
  • 已查看集群组件健康状态

Related Tools

相关工具

Useful additional tools for Kubernetes debugging:
  • kubectl-debug: Advanced debugging plugin
  • stern: Multi-pod log tailing
  • kubectx/kubens: Context and namespace switching
  • k9s: Terminal UI for Kubernetes
  • lens: Desktop IDE for Kubernetes
  • Prometheus/Grafana: Monitoring and alerting
  • Jaeger/Zipkin: Distributed tracing
以下是用于Kubernetes调试的实用附加工具:
  • kubectl-debug:高级调试插件
  • stern:多Pod日志追踪
  • kubectx/kubens:上下文与命名空间切换工具
  • k9s:Kubernetes终端UI
  • lens:Kubernetes桌面IDE
  • Prometheus/Grafana:监控与告警
  • Jaeger/Zipkin:分布式追踪