k8s-troubleshoot

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Kubernetes Troubleshooting

Kubernetes故障排查

Expert debugging and diagnostics for Kubernetes clusters using kubectl-mcp-server tools.

使用kubectl-mcp-server工具对Kubernetes集群进行专业调试与诊断。

When to Apply

适用场景

Use this skill when:

User mentions: "debug", "troubleshoot", "diagnose", "failing", "crash", "not starting", "broken"
Pod states: Pending, CrashLoopBackOff, ImagePullBackOff, OOMKilled, Error, Unknown
Node issues: NotReady, MemoryPressure, DiskPressure, NetworkUnavailable, PIDPressure
Keywords: "logs", "events", "describe", "why isn't working", "stuck", "not responding"

在以下场景使用该技能：

用户提及以下词汇时："debug"、"troubleshoot"、"diagnose"、"failing"、"crash"、"无法启动"、"故障"
Pod状态：Pending、CrashLoopBackOff、ImagePullBackOff、OOMKilled、Error、Unknown
节点问题：NotReady、MemoryPressure、DiskPressure、NetworkUnavailable、PIDPressure
关键词："logs"、"events"、"describe"、"为什么无法工作"、"卡住"、"无响应"

Priority Rules

优先级规则


get_pods
get_events
get_pod_logs
get_pod_metrics
get_endpoints
get_network_policies
get_nodes

Priority	Rule	Impact	Tools
1	Check pod status first	CRITICAL	`get_pods` , `describe_pod`
2	View recent events	CRITICAL	`get_events`
3	Inspect logs (including previous)	HIGH	`get_pod_logs`
4	Check resource metrics	HIGH	`get_pod_metrics`
5	Verify endpoints	MEDIUM	`get_endpoints`
6	Review network policies	MEDIUM	`get_network_policies`
7	Examine node status	LOW	`get_nodes` , `describe_node`


get_pods
get_events
get_pod_logs
get_pod_metrics
get_endpoints
get_network_policies
get_nodes

优先级	规则	影响级别	工具
1	首先检查Pod状态	严重	`get_pods` , `describe_pod`
2	查看近期事件	严重	`get_events`
3	检查日志（包括历史日志）	高	`get_pod_logs`
4	检查资源指标	高	`get_pod_metrics`
5	验证端点	中	`get_endpoints`
6	检查网络策略	中	`get_network_policies`
7	检查节点状态	低	`get_nodes` , `describe_node`

Quick Reference

快速参考

Symptom	First Tool	Next Steps
Pod Pending	`describe_pod`	Check events, node capacity, resource requests
CrashLoopBackOff	`get_pod_logs(previous=True)`	Check exit code, resources, liveness probes
ImagePullBackOff	`describe_pod`	Verify image name, registry auth, network
OOMKilled	`get_pod_metrics`	Increase memory limits, check for memory leaks
ContainerCreating	`describe_pod`	Check PVC binding, secrets, configmaps
Terminating (stuck)	`describe_pod`	Check finalizers, PDBs, preStop hooks

症状	首选工具	后续步骤
Pod Pending	`describe_pod`	检查事件、节点容量、资源请求
CrashLoopBackOff	`get_pod_logs(previous=True)`	检查退出码、资源配置、存活探针
ImagePullBackOff	`describe_pod`	验证镜像名称、仓库认证、网络连接
OOMKilled	`get_pod_metrics`	提升内存限制、检查内存泄漏
ContainerCreating	`describe_pod`	检查PVC绑定、密钥、配置映射
Terminating（卡住）	`describe_pod`	检查终结器、PDB、preStop钩子

Diagnostic Workflows

诊断流程

Pod Not Starting

Pod无法启动

1. get_pods(namespace, label_selector) - Get pod status
2. describe_pod(name, namespace) - See events and conditions
3. get_events(namespace, field_selector="involvedObject.name=<pod>") - Check events
4. get_pod_logs(name, namespace, previous=True) - For crash loops

1. get_pods(namespace, label_selector) - 获取Pod状态
2. describe_pod(name, namespace) - 查看事件与状态条件
3. get_events(namespace, field_selector="involvedObject.name=<pod>") - 检查相关事件
4. get_pod_logs(name, namespace, previous=True) - 针对崩溃循环场景

Common Pod States

常见Pod状态

State	Likely Cause	Tools to Use
Pending	Scheduling issues	`describe_pod` , `get_nodes` , `get_events`
ImagePullBackOff	Registry/auth	`describe_pod` , check image name
CrashLoopBackOff	App crash	`get_pod_logs(previous=True)`
OOMKilled	Memory limit	`get_pod_metrics` , adjust limits
ContainerCreating	Volume/network	`describe_pod` , `get_pvc`

状态	可能原因	适用工具
Pending	调度问题	`describe_pod` , `get_nodes` , `get_events`
ImagePullBackOff	仓库/认证问题	`describe_pod` , 检查镜像名称
CrashLoopBackOff	应用崩溃	`get_pod_logs(previous=True)`
OOMKilled	内存限制不足	`get_pod_metrics` , 调整限制
ContainerCreating	存储/网络问题	`describe_pod` , `get_pvc`

Node Issues

节点问题

1. get_nodes() - List nodes and status
2. describe_node(name) - See conditions and capacity
3. Check: Ready, MemoryPressure, DiskPressure, PIDPressure
4. node_logs_tool(name, "kubelet") - Kubelet logs

1. get_nodes() - 列出节点及状态
2. describe_node(name) - 查看状态条件与容量
3. 检查：Ready、MemoryPressure、DiskPressure、PIDPressure
4. node_logs_tool(name, "kubelet") - 获取Kubelet日志

Deep Debugging Workflows

深度调试流程

CrashLoopBackOff Investigation

CrashLoopBackOff问题排查

1. get_pod_logs(name, namespace, previous=True) - See why it crashed
2. describe_pod(name, namespace) - Check resource limits, probes
3. get_pod_metrics(name, namespace) - Memory/CPU at crash time
4. If OOM: compare requests/limits to actual usage
5. If app error: check logs for stack trace

1. get_pod_logs(name, namespace, previous=True) - 查看崩溃原因
2. describe_pod(name, namespace) - 检查资源限制、探针配置
3. get_pod_metrics(name, namespace) - 获取崩溃时的内存/CPU数据
4. 若为OOM：对比资源请求/限制与实际使用量
5. 若为应用错误：检查日志中的堆栈跟踪信息

Networking Issues

网络问题

1. get_services(namespace) - Verify service exists
2. get_endpoints(namespace) - Check endpoint backends
3. If empty endpoints: pods don't match selector
4. get_network_policies(namespace) - Check traffic rules
5. For Cilium: cilium_endpoints_list_tool(), hubble_flows_query_tool()

1. get_services(namespace) - 验证服务是否存在
2. get_endpoints(namespace) - 检查端点后端
3. 若端点为空：Pod与选择器不匹配
4. get_network_policies(namespace) - 检查流量规则
5. 针对Cilium：使用cilium_endpoints_list_tool(), hubble_flows_query_tool()

Storage Problems

存储问题

1. get_pvc(namespace) - Check PVC status
2. describe_pvc(name, namespace) - See binding issues
3. get_storage_classes() - Verify provisioner exists
4. If Pending: check storage class, access modes

1. get_pvc(namespace) - 检查PVC状态
2. describe_pvc(name, namespace) - 查看绑定问题
3. get_storage_classes() - 验证存储供应者是否存在
4. 若为Pending状态：检查存储类、访问模式

DNS Resolution

DNS解析问题

1. kubectl_exec(pod, namespace, "nslookup kubernetes.default") - Test DNS
2. If fails: check coredns pods in kube-system
3. get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
4. get_pod_logs(name="coredns-*", namespace="kube-system")

1. kubectl_exec(pod, namespace, "nslookup kubernetes.default") - 测试DNS
2. 若失败：检查kube-system命名空间中的coredns Pod
3. get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
4. get_pod_logs(name="coredns-*", namespace="kube-system")

Multi-Cluster Debugging

多集群调试

All tools support

context

parameter for targeting different clusters:

python

get_pods(namespace="kube-system", context="production-cluster")
get_events(namespace="default", context="staging-cluster")
describe_pod(name="myapp-xyz", namespace="prod", context="prod-east")

所有工具均支持

context

参数，用于指定目标集群：

python

get_pods(namespace="kube-system", context="production-cluster")
get_events(namespace="default", context="staging-cluster")
describe_pod(name="myapp-xyz", namespace="prod", context="prod-east")

Diagnostic Scripts

诊断脚本

For comprehensive diagnostics, run the bundled scripts:

See scripts/diagnose-pod.py for automated pod analysis
See scripts/health-check.sh for cluster health checks

如需全面诊断，可运行内置脚本：

查看scripts/diagnose-pod.py进行自动化Pod分析
查看scripts/health-check.sh进行集群健康检查

Decision Tree

决策树

See references/DECISION-TREE.md for visual troubleshooting flowcharts.

查看references/DECISION-TREE.md获取可视化故障排查流程图。

Common Errors Reference

常见错误参考

See references/COMMON-ERRORS.md for error message explanations and fixes.

查看references/COMMON-ERRORS.md获取错误消息解释与修复方案。