k8s-debug

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Kubernetes Debugging Skill

Kubernetes调试技能

Overview

概述

Systematic toolkit for debugging and troubleshooting Kubernetes clusters, pods, services, and deployments. Provides scripts, workflows, and reference guides for identifying and resolving common Kubernetes issues efficiently.

用于调试和排查Kubernetes集群、Pod、服务及部署问题的系统化工具包。提供脚本、工作流和参考指南，助力高效识别并解决常见Kubernetes问题。

When to Use This Skill

何时使用本技能

Invoke this skill when encountering:

Pod failures (CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled)
Service connectivity or DNS resolution issues
Network policy or ingress problems
Volume and storage mount failures
Deployment rollout issues
Cluster health or performance degradation
Resource exhaustion (CPU/memory)
Configuration problems (ConfigMaps, Secrets, RBAC)

遇到以下情况时调用本技能：

Pod故障（CrashLoopBackOff、ImagePullBackOff、Pending、OOMKilled）
服务连接或DNS解析问题
网络策略或Ingress问题
卷与存储挂载故障
部署发布问题
集群健康或性能下降
资源耗尽（CPU/内存）
配置问题（ConfigMaps、Secrets、RBAC）

Debugging Workflow

调试工作流

Follow this systematic approach for any Kubernetes issue:

针对任何Kubernetes问题，遵循以下系统化步骤：

1. Identify the Problem Layer

1. 确定问题层级

Categorize the issue:

Application Layer: Application crashes, errors, bugs
Pod Layer: Pod not starting, restarting, or pending
Service Layer: Network connectivity, DNS issues
Node Layer: Node not ready, resource exhaustion
Cluster Layer: Control plane issues, API problems
Storage Layer: Volume mount failures, PVC issues
Configuration Layer: ConfigMap, Secret, RBAC issues

对问题进行分类：

应用层：应用崩溃、报错、程序缺陷
Pod层：Pod无法启动、重启或处于Pending状态
服务层：网络连接、DNS问题
节点层：节点未就绪、资源耗尽
集群层：控制平面问题、API故障
存储层：卷挂载失败、PVC问题
配置层：ConfigMap、Secret、RBAC问题

2. Gather Diagnostic Information

2. 收集诊断信息

Use the appropriate diagnostic script based on scope:

根据范围使用相应的诊断脚本：

Pod-Level Diagnostics

Pod级诊断

Use

scripts/pod_diagnostics.py

for comprehensive pod analysis:

bash

python3 scripts/pod_diagnostics.py <pod-name> -n <namespace>

This script gathers:

Pod status and description
Pod events
Container logs (current and previous)
Resource usage
Node information
YAML configuration

Output can be saved for analysis:

python3 scripts/pod_diagnostics.py <pod-name> -n <namespace> -o diagnostics.txt

使用

scripts/pod_diagnostics.py

进行全面的Pod分析：

bash

python3 scripts/pod_diagnostics.py <pod-name> -n <namespace>

该脚本会收集：

Pod状态与描述
Pod事件
容器日志（当前及历史）
资源使用情况
节点信息
YAML配置

输出结果可保存用于分析：

python3 scripts/pod_diagnostics.py <pod-name> -n <namespace> -o diagnostics.txt

Cluster-Level Health Check

集群级健康检查

Use

scripts/cluster_health.sh

for overall cluster diagnostics:

bash

./scripts/cluster_health.sh

This script checks:

Cluster info and version
Node status and resources
Pods across all namespaces
Failed/pending pods
Recent events
Deployments, services, statefulsets, daemonsets
PVCs and PVs
Component health
Common error states (CrashLoopBackOff, ImagePullBackOff)

使用

scripts/cluster_health.sh

进行整体集群诊断：

bash

./scripts/cluster_health.sh

该脚本会检查：

集群信息与版本
节点状态与资源
所有命名空间下的Pod
故障或Pending状态的Pod
近期事件
部署、服务、有状态集、守护进程集
PVC与PV
组件健康状态
常见错误状态（CrashLoopBackOff、ImagePullBackOff）

Network Diagnostics

网络诊断

Use

scripts/network_debug.sh

for connectivity issues:

bash

./scripts/network_debug.sh <namespace> <pod-name>

This script analyzes:

Pod network configuration
DNS setup and resolution
Service endpoints
Network policies
Connectivity tests
CoreDNS logs

使用

scripts/network_debug.sh

排查连接问题：

bash

./scripts/network_debug.sh <namespace> <pod-name>

该脚本会分析：

Pod网络配置
DNS设置与解析
服务端点
网络策略
连通性测试
CoreDNS日志

3. Follow Issue-Specific Workflow

3. 遵循特定问题工作流

Based on the identified issue, consult

references/troubleshooting_workflow.md

for detailed workflows:

Pod Pending: Resource/scheduling workflow
CrashLoopBackOff: Application crash workflow
ImagePullBackOff: Image pull workflow
Service issues: Network connectivity workflow
DNS failures: DNS troubleshooting workflow
Resource exhaustion: Performance investigation workflow
Storage issues: PVC binding workflow
Deployment stuck: Rollout workflow

根据识别出的问题，查阅

references/troubleshooting_workflow.md

获取详细工作流：

Pod Pending：资源/调度工作流
CrashLoopBackOff：应用崩溃工作流
ImagePullBackOff：镜像拉取工作流
服务问题：网络连通性工作流
DNS故障：DNS排查工作流
资源耗尽：性能调查工作流
存储问题：PVC绑定工作流
部署停滞：发布工作流

4. Apply Targeted Fixes

4. 应用针对性修复

Refer to

references/common_issues.md

for specific solutions to common problems.

查阅

references/common_issues.md

获取常见问题的具体解决方案。

Common Debugging Patterns

常见调试模式

Pattern 1: Pod Not Starting

模式1：Pod无法启动

bash

undefined

bash

undefined

Quick assessment

快速评估

kubectl get pod <pod-name> -n <namespace> kubectl describe pod <pod-name> -n <namespace>

Detailed diagnostics

详细诊断

python3 scripts/pod_diagnostics.py <pod-name> -n <namespace>

Check common causes:

检查常见原因：

- ImagePullBackOff: Verify image exists and credentials

- ImagePullBackOff：验证镜像存在及凭证有效性

- CrashLoopBackOff: Check logs with --previous flag

- CrashLoopBackOff：使用--previous标志查看日志

- Pending: Check node resources and scheduling

- Pending：检查节点资源与调度情况

undefined

undefined

Pattern 2: Service Connectivity Issues

模式2：服务连接问题

bash

undefined

bash

undefined

Verify service and endpoints

验证服务与端点

kubectl get svc <service-name> -n <namespace> kubectl get endpoints <service-name> -n <namespace>

Network diagnostics

网络诊断

./scripts/network_debug.sh <namespace> <pod-name>

Test connectivity from debug pod

从调试Pod测试连通性

kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot -- /bin/bash

Inside: curl <service-name>.<namespace>.svc.cluster.local:<port>

容器内执行：curl <service-name>.<namespace>.svc.cluster.local:<port>

Check network policies

检查网络策略

kubectl get networkpolicies -n <namespace>

undefined

kubectl get networkpolicies -n <namespace>

undefined

Pattern 3: Application Performance Issues

模式3：应用性能问题

bash

undefined

bash

undefined

Check resource usage

检查资源使用情况

kubectl top nodes kubectl top pods -n <namespace> --containers

Get pod metrics

获取Pod指标

kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 resources

Check for OOMKilled

检查是否出现OOMKilled

kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 lastState

Review application logs

查看应用日志

kubectl logs <pod-name> -n <namespace> --tail=100

undefined

kubectl logs <pod-name> -n <namespace> --tail=100

undefined

Pattern 4: Cluster Health Assessment

模式4：集群健康评估

bash

undefined

bash

undefined

Run comprehensive health check

运行全面健康检查

./scripts/cluster_health.sh > cluster-health-$(date +%Y%m%d-%H%M%S).txt

Review output for:

检查输出内容：

- Node conditions and resource pressure

- 节点状态与资源压力

- Failed or pending pods

- 故障或Pending状态的Pod

- Recent error events

- 近期错误事件

- Component health status

- 组件健康状态

- Resource quota usage

- 资源配额使用情况

undefined

undefined

Essential Manual Commands

必备手动命令

While scripts automate diagnostics, understand these core commands:

虽然脚本可自动化诊断，但需掌握以下核心命令：

Pod Debugging

Pod调试

bash

undefined

bash

undefined

View pod status

查看Pod状态

kubectl get pods -n <namespace> -o wide

Detailed pod information

详细Pod信息

kubectl describe pod <pod-name> -n <namespace>

View logs

查看日志

kubectl logs <pod-name> -n <namespace> kubectl logs <pod-name> -n <namespace> --previous # Previous container kubectl logs <pod-name> -n <namespace> -c <container> # Specific container

kubectl logs <pod-name> -n <namespace> kubectl logs <pod-name> -n <namespace> --previous # 查看上一个容器的日志 kubectl logs <pod-name> -n <namespace> -c <container> # 查看指定容器的日志

Execute commands in pod

在Pod内执行命令

kubectl exec <pod-name> -n <namespace> -it -- /bin/sh

Get pod YAML

获取Pod的YAML配置

kubectl get pod <pod-name> -n <namespace> -o yaml

undefined

kubectl get pod <pod-name> -n <namespace> -o yaml

undefined

Service and Network Debugging

服务与网络调试

bash

undefined

bash

undefined

Check services

检查服务

kubectl get svc -n <namespace> kubectl describe svc <service-name> -n <namespace>

Check endpoints

检查端点

kubectl get endpoints -n <namespace>

Test DNS

测试DNS

kubectl exec <pod-name> -n <namespace> -- nslookup kubernetes.default

View events

查看事件

kubectl get events -n <namespace> --sort-by='.lastTimestamp'

undefined

kubectl get events -n <namespace> --sort-by='.lastTimestamp'

undefined

Resource Monitoring

资源监控

bash

undefined

bash

undefined

Node resources

节点资源

kubectl top nodes kubectl describe nodes

Pod resources

Pod资源

kubectl top pods -n <namespace> kubectl top pod <pod-name> -n <namespace> --containers

undefined

kubectl top pods -n <namespace> kubectl top pod <pod-name> -n <namespace> --containers

undefined

Emergency Operations

紧急操作

bash

undefined

bash

undefined

Restart deployment

重启部署

kubectl rollout restart deployment/<name> -n <namespace>

Rollback deployment

回滚部署

kubectl rollout undo deployment/<name> -n <namespace>

Force delete stuck pod

强制删除停滞的Pod

kubectl delete pod <pod-name> -n <namespace> --force --grace-period=0

Drain node (maintenance)

驱逐节点（维护时）

kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

Cordon node (prevent scheduling)

隔离节点（阻止调度）

kubectl cordon <node-name>

undefined

kubectl cordon <node-name>

undefined

Reference Documentation

参考文档

Detailed Troubleshooting Guides

详细故障排查指南

Consult

references/troubleshooting_workflow.md

for:

Step-by-step workflows for each issue type
Decision trees for diagnosis
Command sequences for systematic debugging
Quick reference command cheat sheet

查阅

references/troubleshooting_workflow.md

获取：

各类问题的分步工作流
诊断决策树
系统化调试的命令序列
快速参考命令速查表

Common Issues Database

常见问题数据库

Consult

references/common_issues.md

for:

Detailed explanations of each common issue
Symptoms and causes
Specific debugging steps
Solutions and fixes
Prevention strategies

查阅

references/common_issues.md

获取：

各常见问题的详细说明
症状与原因
具体调试步骤
解决方案与修复方法
预防策略

Best Practices

最佳实践

Systematic Approach

系统化方法

Observe: Gather facts before making changes
Analyze: Use diagnostic scripts to collect comprehensive data
Hypothesize: Form theory about root cause
Test: Verify hypothesis with targeted commands
Fix: Apply appropriate solution
Verify: Confirm issue is resolved
Document: Record findings for future reference

观察：在做出更改前收集事实信息
分析：使用诊断脚本收集全面数据
假设：形成关于根本原因的理论
测试：使用针对性命令验证假设
修复：应用合适的解决方案
验证：确认问题已解决
记录：记录发现以供未来参考

Data Collection

数据收集

Save diagnostic output to files for analysis
Capture logs before restarting failing pods
Record events timeline for incident reports
Export resource metrics for trend analysis

将诊断输出保存到文件以便分析
在重启故障Pod前捕获日志
记录事件时间线用于事故报告
导出资源指标用于趋势分析

Prevention

预防措施

Set appropriate resource requests and limits
Implement health checks (liveness/readiness probes)
Use proper logging and monitoring
Apply network policies incrementally
Test changes in non-production environments
Maintain documentation of cluster architecture

设置合适的资源请求与限制
实现健康检查（存活/就绪探针）
使用恰当的日志与监控
逐步应用网络策略
在非生产环境测试更改
维护集群架构文档

Advanced Debugging Techniques

高级调试技术

Debug Containers (Kubernetes 1.23+)

调试容器（Kubernetes 1.23+）

bash

undefined

bash

undefined

Attach ephemeral debug container

附加临时调试容器

kubectl debug <pod-name> -n <namespace> -it --image=nicolaka/netshoot

Create debug copy of pod

创建Pod的调试副本

kubectl debug <pod-name> -n <namespace> -it --copy-to=<debug-pod-name> --container=<container>

undefined

kubectl debug <pod-name> -n <namespace> -it --copy-to=<debug-pod-name> --container=<container>

undefined

Port Forwarding for Testing

端口转发用于测试

bash

undefined

bash

undefined

Forward pod port to local machine

将Pod端口转发到本地机器

kubectl port-forward pod/<pod-name> -n <namespace> <local-port>:<pod-port>

Forward service port

转发服务端口

kubectl port-forward svc/<service-name> -n <namespace> <local-port>:<service-port>

undefined

kubectl port-forward svc/<service-name> -n <namespace> <local-port>:<service-port>

undefined

Proxy for API Access

代理用于API访问

bash

undefined

bash

undefined

Start kubectl proxy

启动kubectl代理

kubectl proxy --port=8080

Access API

访问API

curl http://localhost:8080/api/v1/namespaces/<namespace>/pods/<pod-name>

undefined

curl http://localhost:8080/api/v1/namespaces/<namespace>/pods/<pod-name>

undefined

Custom Column Output

自定义列输出

bash

undefined

bash

undefined

Custom pod info

自定义Pod信息

kubectl get pods -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,IP:.status.podIP

Node taints

节点污点

kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

undefined

kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

undefined

Troubleshooting Checklist

故障排查清单

Before escalating issues, verify:

Reviewed pod events:
```
kubectl describe pod
```
Checked pod logs (current and previous)
Verified resource availability on nodes
Confirmed image exists and is accessible
Validated service selectors match pod labels
Tested DNS resolution from pods
Checked network policies
Reviewed recent cluster events
Confirmed ConfigMaps/Secrets exist
Validated RBAC permissions
Checked for resource quotas/limits
Reviewed cluster component health

在升级问题前，请验证：

已查看Pod事件：
```
kubectl describe pod
```
已检查Pod日志（当前及历史）
已验证节点上的资源可用性
已确认镜像存在且可访问
已验证服务选择器与Pod标签匹配
已从Pod测试DNS解析
已检查网络策略
已查看近期集群事件
已确认ConfigMaps/Secrets存在
已验证RBAC权限
已检查资源配额/限制
已查看集群组件健康状态