kubernetes-debug

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Kubernetes Debugging

Kubernetes 调试

Core Principle: Events Before Logs

核心原则:先看事件,再查日志

ALWAYS check pod events BEFORE logs. Events explain 80% of issues faster:
  • OOMKilled → Memory limit exceeded
  • ImagePullBackOff → Image not found or auth issue
  • FailedScheduling → No nodes with enough resources
  • CrashLoopBackOff → Container crashing repeatedly
务必先查看Pod事件,再查看日志。 事件能快速解释80%的问题:
  • OOMKilled → 内存限制超出
  • ImagePullBackOff → 镜像未找到或认证问题
  • FailedScheduling → 没有节点具备足够资源
  • CrashLoopBackOff → 容器反复崩溃

Available Scripts

可用脚本

All scripts are in
.claude/skills/infrastructure-kubernetes/scripts/
所有脚本位于
.claude/skills/infrastructure-kubernetes/scripts/
目录下

list_pods.py - List pods with status

list_pods.py - 查看带状态的Pod列表

bash
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n <namespace> [--label <selector>]
bash
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n <namespace> [--label <selector>]

Examples:

Examples:

python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n otel-demo python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n otel-demo --label app.kubernetes.io/name=payment
undefined
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n otel-demo python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n otel-demo --label app.kubernetes.io/name=payment
undefined

get_events.py - Get pod events (USE FIRST!)

get_events.py - 获取Pod事件(优先使用!)

bash
python .claude/skills/infrastructure-kubernetes/scripts/get_events.py <pod-name> -n <namespace>
bash
python .claude/skills/infrastructure-kubernetes/scripts/get_events.py <pod-name> -n <namespace>

Example:

Example:

python .claude/skills/infrastructure-kubernetes/scripts/get_events.py payment-7f8b9c6d5-x2k4m -n otel-demo
undefined
python .claude/skills/infrastructure-kubernetes/scripts/get_events.py payment-7f8b9c6d5-x2k4m -n otel-demo
undefined

get_logs.py - Get pod logs

get_logs.py - 获取Pod日志

bash
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py <pod-name> -n <namespace> [--tail N] [--container NAME]
bash
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py <pod-name> -n <namespace> [--tail N] [--container NAME]

Examples:

Examples:

python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py payment-7f8b9c6d5-x2k4m -n otel-demo --tail 100 python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py payment-7f8b9c6d5-x2k4m -n otel-demo --container payment
undefined
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py payment-7f8b9c6d5-x2k4m -n otel-demo --tail 100 python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py payment-7f8b9c6d5-x2k4m -n otel-demo --container payment
undefined

describe_pod.py - Detailed pod info

describe_pod.py - 查看Pod详细信息

bash
python .claude/skills/infrastructure-kubernetes/scripts/describe_pod.py <pod-name> -n <namespace>
bash
python .claude/skills/infrastructure-kubernetes/scripts/describe_pod.py <pod-name> -n <namespace>

get_resources.py - Resource usage vs limits

get_resources.py - 资源使用量与限制对比

bash
python .claude/skills/infrastructure-kubernetes/scripts/get_resources.py <pod-name> -n <namespace>
bash
python .claude/skills/infrastructure-kubernetes/scripts/get_resources.py <pod-name> -n <namespace>

describe_deployment.py - Deployment status

describe_deployment.py - 查看部署状态

bash
python .claude/skills/infrastructure-kubernetes/scripts/describe_deployment.py <deployment-name> -n <namespace>
bash
python .claude/skills/infrastructure-kubernetes/scripts/describe_deployment.py <deployment-name> -n <namespace>

get_history.py - Rollout history

get_history.py - 查看滚动更新历史

bash
python .claude/skills/infrastructure-kubernetes/scripts/get_history.py <deployment-name> -n <namespace>
bash
python .claude/skills/infrastructure-kubernetes/scripts/get_history.py <deployment-name> -n <namespace>

Debugging Workflows

调试流程

Pod Not Starting (Pending/CrashLoopBackOff)

Pod无法启动(Pending/CrashLoopBackOff)

  1. list_pods.py
    - Check pod status
  2. get_events.py
    - Look for scheduling/pull/crash events
  3. describe_pod.py
    - Check conditions and container states
  4. get_logs.py
    - Only if events don't explain
  1. list_pods.py
    - 检查Pod状态
  2. get_events.py
    - 查找调度/拉取/崩溃事件
  3. describe_pod.py
    - 检查条件与容器状态
  4. get_logs.py
    - 仅在事件无法解释问题时使用

Pod Restarting (OOMKilled/Crashes)

Pod频繁重启(OOMKilled/崩溃)

  1. get_events.py
    - Check for OOMKilled or error events
  2. get_resources.py
    - Compare usage vs limits
  3. get_logs.py
    - Check for errors before crash
  4. describe_pod.py
    - Check restart count and state
  1. get_events.py
    - 检查是否有OOMKilled或错误事件
  2. get_resources.py
    - 对比资源使用量与限制
  3. get_logs.py
    - 检查崩溃前的错误日志
  4. describe_pod.py
    - 检查重启次数与状态

Deployment Not Progressing

部署无进展

  1. describe_deployment.py
    - Check replica counts
  2. list_pods.py
    - Find stuck pods
  3. get_events.py
    - Check events on stuck pods
  4. get_history.py
    - Check rollout history for rollback
  1. describe_deployment.py
    - 检查副本数量
  2. list_pods.py
    - 找出停滞的Pod
  3. get_events.py
    - 检查停滞Pod的事件
  4. get_history.py
    - 查看滚动更新历史以进行回滚

Common Issues & Solutions

常见问题与解决方案

Event ReasonMeaningAction
OOMKilledContainer exceeded memory limitIncrease limits or fix memory leak
ImagePullBackOffCan't pull imageCheck image name, registry auth
CrashLoopBackOffContainer keeps crashingCheck logs for startup errors
FailedSchedulingNo node can run podCheck node resources, taints
UnhealthyLiveness probe failedCheck probe config, app health
事件原因含义操作
OOMKilled容器超出内存限制提升内存限制或修复内存泄漏
ImagePullBackOff无法拉取镜像检查镜像名称、仓库认证信息
CrashLoopBackOff容器持续崩溃查看启动错误日志
FailedScheduling无节点可运行Pod检查节点资源、污点配置
Unhealthy存活探针失败检查探针配置、应用健康状态

Output Format

输出格式

When reporting findings, use this structure:
undefined
报告排查结果时,请使用以下结构:
undefined

Kubernetes Analysis

Kubernetes Analysis

Pod: <name> Namespace: <namespace> Status: <phase> (Restarts: N)
Pod: <name> Namespace: <namespace> Status: <phase> (Restarts: N)

Events

Events

  • [timestamp] <reason>: <message>
  • [timestamp] <reason>: <message>

Issues Found

Issues Found

  1. [Issue description with evidence]
  1. [Issue description with evidence]

Root Cause Hypothesis

Root Cause Hypothesis

[Based on events and logs]
[Based on events and logs]

Recommended Action

Recommended Action

[Specific remediation step]
undefined
[Specific remediation step]
undefined