k8s-troubleshoot

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Kubernetes Troubleshooting

Kubernetes故障排查

Expert debugging and diagnostics for Kubernetes clusters using kubectl-mcp-server tools.
使用kubectl-mcp-server工具对Kubernetes集群进行专业调试与诊断。

When to Apply

适用场景

Use this skill when:
  • User mentions: "debug", "troubleshoot", "diagnose", "failing", "crash", "not starting", "broken"
  • Pod states: Pending, CrashLoopBackOff, ImagePullBackOff, OOMKilled, Error, Unknown
  • Node issues: NotReady, MemoryPressure, DiskPressure, NetworkUnavailable, PIDPressure
  • Keywords: "logs", "events", "describe", "why isn't working", "stuck", "not responding"
在以下场景使用该技能:
  • 用户提及以下词汇时:"debug"、"troubleshoot"、"diagnose"、"failing"、"crash"、"无法启动"、"故障"
  • Pod状态:Pending、CrashLoopBackOff、ImagePullBackOff、OOMKilled、Error、Unknown
  • 节点问题:NotReady、MemoryPressure、DiskPressure、NetworkUnavailable、PIDPressure
  • 关键词:"logs"、"events"、"describe"、"为什么无法工作"、"卡住"、"无响应"

Priority Rules

优先级规则

PriorityRuleImpactTools
1Check pod status firstCRITICAL
get_pods
,
describe_pod
2View recent eventsCRITICAL
get_events
3Inspect logs (including previous)HIGH
get_pod_logs
4Check resource metricsHIGH
get_pod_metrics
5Verify endpointsMEDIUM
get_endpoints
6Review network policiesMEDIUM
get_network_policies
7Examine node statusLOW
get_nodes
,
describe_node
优先级规则影响级别工具
1首先检查Pod状态严重
get_pods
,
describe_pod
2查看近期事件严重
get_events
3检查日志(包括历史日志)
get_pod_logs
4检查资源指标
get_pod_metrics
5验证端点
get_endpoints
6检查网络策略
get_network_policies
7检查节点状态
get_nodes
,
describe_node

Quick Reference

快速参考

SymptomFirst ToolNext Steps
Pod Pending
describe_pod
Check events, node capacity, resource requests
CrashLoopBackOff
get_pod_logs(previous=True)
Check exit code, resources, liveness probes
ImagePullBackOff
describe_pod
Verify image name, registry auth, network
OOMKilled
get_pod_metrics
Increase memory limits, check for memory leaks
ContainerCreating
describe_pod
Check PVC binding, secrets, configmaps
Terminating (stuck)
describe_pod
Check finalizers, PDBs, preStop hooks
症状首选工具后续步骤
Pod Pending
describe_pod
检查事件、节点容量、资源请求
CrashLoopBackOff
get_pod_logs(previous=True)
检查退出码、资源配置、存活探针
ImagePullBackOff
describe_pod
验证镜像名称、仓库认证、网络连接
OOMKilled
get_pod_metrics
提升内存限制、检查内存泄漏
ContainerCreating
describe_pod
检查PVC绑定、密钥、配置映射
Terminating(卡住)
describe_pod
检查终结器、PDB、preStop钩子

Diagnostic Workflows

诊断流程

Pod Not Starting

Pod无法启动

1. get_pods(namespace, label_selector) - Get pod status
2. describe_pod(name, namespace) - See events and conditions
3. get_events(namespace, field_selector="involvedObject.name=<pod>") - Check events
4. get_pod_logs(name, namespace, previous=True) - For crash loops
1. get_pods(namespace, label_selector) - 获取Pod状态
2. describe_pod(name, namespace) - 查看事件与状态条件
3. get_events(namespace, field_selector="involvedObject.name=<pod>") - 检查相关事件
4. get_pod_logs(name, namespace, previous=True) - 针对崩溃循环场景

Common Pod States

常见Pod状态

StateLikely CauseTools to Use
PendingScheduling issues
describe_pod
,
get_nodes
,
get_events
ImagePullBackOffRegistry/auth
describe_pod
, check image name
CrashLoopBackOffApp crash
get_pod_logs(previous=True)
OOMKilledMemory limit
get_pod_metrics
, adjust limits
ContainerCreatingVolume/network
describe_pod
,
get_pvc
状态可能原因适用工具
Pending调度问题
describe_pod
,
get_nodes
,
get_events
ImagePullBackOff仓库/认证问题
describe_pod
, 检查镜像名称
CrashLoopBackOff应用崩溃
get_pod_logs(previous=True)
OOMKilled内存限制不足
get_pod_metrics
, 调整限制
ContainerCreating存储/网络问题
describe_pod
,
get_pvc

Node Issues

节点问题

1. get_nodes() - List nodes and status
2. describe_node(name) - See conditions and capacity
3. Check: Ready, MemoryPressure, DiskPressure, PIDPressure
4. node_logs_tool(name, "kubelet") - Kubelet logs
1. get_nodes() - 列出节点及状态
2. describe_node(name) - 查看状态条件与容量
3. 检查:Ready、MemoryPressure、DiskPressure、PIDPressure
4. node_logs_tool(name, "kubelet") - 获取Kubelet日志

Deep Debugging Workflows

深度调试流程

CrashLoopBackOff Investigation

CrashLoopBackOff问题排查

1. get_pod_logs(name, namespace, previous=True) - See why it crashed
2. describe_pod(name, namespace) - Check resource limits, probes
3. get_pod_metrics(name, namespace) - Memory/CPU at crash time
4. If OOM: compare requests/limits to actual usage
5. If app error: check logs for stack trace
1. get_pod_logs(name, namespace, previous=True) - 查看崩溃原因
2. describe_pod(name, namespace) - 检查资源限制、探针配置
3. get_pod_metrics(name, namespace) - 获取崩溃时的内存/CPU数据
4. 若为OOM:对比资源请求/限制与实际使用量
5. 若为应用错误:检查日志中的堆栈跟踪信息

Networking Issues

网络问题

1. get_services(namespace) - Verify service exists
2. get_endpoints(namespace) - Check endpoint backends
3. If empty endpoints: pods don't match selector
4. get_network_policies(namespace) - Check traffic rules
5. For Cilium: cilium_endpoints_list_tool(), hubble_flows_query_tool()
1. get_services(namespace) - 验证服务是否存在
2. get_endpoints(namespace) - 检查端点后端
3. 若端点为空:Pod与选择器不匹配
4. get_network_policies(namespace) - 检查流量规则
5. 针对Cilium:使用cilium_endpoints_list_tool(), hubble_flows_query_tool()

Storage Problems

存储问题

1. get_pvc(namespace) - Check PVC status
2. describe_pvc(name, namespace) - See binding issues
3. get_storage_classes() - Verify provisioner exists
4. If Pending: check storage class, access modes
1. get_pvc(namespace) - 检查PVC状态
2. describe_pvc(name, namespace) - 查看绑定问题
3. get_storage_classes() - 验证存储供应者是否存在
4. 若为Pending状态:检查存储类、访问模式

DNS Resolution

DNS解析问题

1. kubectl_exec(pod, namespace, "nslookup kubernetes.default") - Test DNS
2. If fails: check coredns pods in kube-system
3. get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
4. get_pod_logs(name="coredns-*", namespace="kube-system")
1. kubectl_exec(pod, namespace, "nslookup kubernetes.default") - 测试DNS
2. 若失败:检查kube-system命名空间中的coredns Pod
3. get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
4. get_pod_logs(name="coredns-*", namespace="kube-system")

Multi-Cluster Debugging

多集群调试

All tools support
context
parameter for targeting different clusters:
python
get_pods(namespace="kube-system", context="production-cluster")
get_events(namespace="default", context="staging-cluster")
describe_pod(name="myapp-xyz", namespace="prod", context="prod-east")
所有工具均支持
context
参数,用于指定目标集群:
python
get_pods(namespace="kube-system", context="production-cluster")
get_events(namespace="default", context="staging-cluster")
describe_pod(name="myapp-xyz", namespace="prod", context="prod-east")

Diagnostic Scripts

诊断脚本

For comprehensive diagnostics, run the bundled scripts:
  • See scripts/diagnose-pod.py for automated pod analysis
  • See scripts/health-check.sh for cluster health checks
如需全面诊断,可运行内置脚本:
  • 查看scripts/diagnose-pod.py进行自动化Pod分析
  • 查看scripts/health-check.sh进行集群健康检查

Decision Tree

决策树

See references/DECISION-TREE.md for visual troubleshooting flowcharts.
查看references/DECISION-TREE.md获取可视化故障排查流程图。

Common Errors Reference

常见错误参考

See references/COMMON-ERRORS.md for error message explanations and fixes.
查看references/COMMON-ERRORS.md获取错误消息解释与修复方案。

Related Tools

相关工具

Core Diagnostics

核心诊断工具

  • get_pods
    ,
    describe_pod
    ,
    get_pod_logs
    ,
    get_pod_metrics
  • get_events
    ,
    get_nodes
    ,
    describe_node
  • get_resource_usage
    ,
    compare_namespaces
  • get_pods
    ,
    describe_pod
    ,
    get_pod_logs
    ,
    get_pod_metrics
  • get_events
    ,
    get_nodes
    ,
    describe_node
  • get_resource_usage
    ,
    compare_namespaces

Advanced (Ecosystem)

高级工具(生态系统)

  • Cilium:
    cilium_endpoints_list_tool
    ,
    hubble_flows_query_tool
  • Istio:
    istio_proxy_status_tool
    ,
    istio_analyze_tool
  • Cilium:
    cilium_endpoints_list_tool
    ,
    hubble_flows_query_tool
  • Istio:
    istio_proxy_status_tool
    ,
    istio_analyze_tool

Related Skills

相关技能

  • k8s-diagnostics - Metrics and health checks
  • k8s-incident - Emergency runbooks
  • k8s-networking - Network troubleshooting
  • k8s-diagnostics - 指标与健康检查
  • k8s-incident - 应急手册
  • k8s-networking - 网络故障排查