kubernetes-debug
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseKubernetes Debugging
Kubernetes 调试
Core Principle: Events Before Logs
核心原则:先看事件,再查日志
ALWAYS check pod events BEFORE logs. Events explain 80% of issues faster:
- OOMKilled → Memory limit exceeded
- ImagePullBackOff → Image not found or auth issue
- FailedScheduling → No nodes with enough resources
- CrashLoopBackOff → Container crashing repeatedly
务必先查看Pod事件,再查看日志。 事件能快速解释80%的问题:
- OOMKilled → 内存限制超出
- ImagePullBackOff → 镜像未找到或认证问题
- FailedScheduling → 没有节点具备足够资源
- CrashLoopBackOff → 容器反复崩溃
Available Scripts
可用脚本
All scripts are in
.claude/skills/infrastructure-kubernetes/scripts/所有脚本位于 目录下
.claude/skills/infrastructure-kubernetes/scripts/list_pods.py - List pods with status
list_pods.py - 查看带状态的Pod列表
bash
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n <namespace> [--label <selector>]bash
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n <namespace> [--label <selector>]Examples:
Examples:
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n otel-demo
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n otel-demo --label app.kubernetes.io/name=payment
undefinedpython .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n otel-demo
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n otel-demo --label app.kubernetes.io/name=payment
undefinedget_events.py - Get pod events (USE FIRST!)
get_events.py - 获取Pod事件(优先使用!)
bash
python .claude/skills/infrastructure-kubernetes/scripts/get_events.py <pod-name> -n <namespace>bash
python .claude/skills/infrastructure-kubernetes/scripts/get_events.py <pod-name> -n <namespace>Example:
Example:
python .claude/skills/infrastructure-kubernetes/scripts/get_events.py payment-7f8b9c6d5-x2k4m -n otel-demo
undefinedpython .claude/skills/infrastructure-kubernetes/scripts/get_events.py payment-7f8b9c6d5-x2k4m -n otel-demo
undefinedget_logs.py - Get pod logs
get_logs.py - 获取Pod日志
bash
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py <pod-name> -n <namespace> [--tail N] [--container NAME]bash
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py <pod-name> -n <namespace> [--tail N] [--container NAME]Examples:
Examples:
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py payment-7f8b9c6d5-x2k4m -n otel-demo --tail 100
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py payment-7f8b9c6d5-x2k4m -n otel-demo --container payment
undefinedpython .claude/skills/infrastructure-kubernetes/scripts/get_logs.py payment-7f8b9c6d5-x2k4m -n otel-demo --tail 100
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py payment-7f8b9c6d5-x2k4m -n otel-demo --container payment
undefineddescribe_pod.py - Detailed pod info
describe_pod.py - 查看Pod详细信息
bash
python .claude/skills/infrastructure-kubernetes/scripts/describe_pod.py <pod-name> -n <namespace>bash
python .claude/skills/infrastructure-kubernetes/scripts/describe_pod.py <pod-name> -n <namespace>get_resources.py - Resource usage vs limits
get_resources.py - 资源使用量与限制对比
bash
python .claude/skills/infrastructure-kubernetes/scripts/get_resources.py <pod-name> -n <namespace>bash
python .claude/skills/infrastructure-kubernetes/scripts/get_resources.py <pod-name> -n <namespace>describe_deployment.py - Deployment status
describe_deployment.py - 查看部署状态
bash
python .claude/skills/infrastructure-kubernetes/scripts/describe_deployment.py <deployment-name> -n <namespace>bash
python .claude/skills/infrastructure-kubernetes/scripts/describe_deployment.py <deployment-name> -n <namespace>get_history.py - Rollout history
get_history.py - 查看滚动更新历史
bash
python .claude/skills/infrastructure-kubernetes/scripts/get_history.py <deployment-name> -n <namespace>bash
python .claude/skills/infrastructure-kubernetes/scripts/get_history.py <deployment-name> -n <namespace>Debugging Workflows
调试流程
Pod Not Starting (Pending/CrashLoopBackOff)
Pod无法启动(Pending/CrashLoopBackOff)
- - Check pod status
list_pods.py - - Look for scheduling/pull/crash events
get_events.py - - Check conditions and container states
describe_pod.py - - Only if events don't explain
get_logs.py
- - 检查Pod状态
list_pods.py - - 查找调度/拉取/崩溃事件
get_events.py - - 检查条件与容器状态
describe_pod.py - - 仅在事件无法解释问题时使用
get_logs.py
Pod Restarting (OOMKilled/Crashes)
Pod频繁重启(OOMKilled/崩溃)
- - Check for OOMKilled or error events
get_events.py - - Compare usage vs limits
get_resources.py - - Check for errors before crash
get_logs.py - - Check restart count and state
describe_pod.py
- - 检查是否有OOMKilled或错误事件
get_events.py - - 对比资源使用量与限制
get_resources.py - - 检查崩溃前的错误日志
get_logs.py - - 检查重启次数与状态
describe_pod.py
Deployment Not Progressing
部署无进展
- - Check replica counts
describe_deployment.py - - Find stuck pods
list_pods.py - - Check events on stuck pods
get_events.py - - Check rollout history for rollback
get_history.py
- - 检查副本数量
describe_deployment.py - - 找出停滞的Pod
list_pods.py - - 检查停滞Pod的事件
get_events.py - - 查看滚动更新历史以进行回滚
get_history.py
Common Issues & Solutions
常见问题与解决方案
| Event Reason | Meaning | Action |
|---|---|---|
| OOMKilled | Container exceeded memory limit | Increase limits or fix memory leak |
| ImagePullBackOff | Can't pull image | Check image name, registry auth |
| CrashLoopBackOff | Container keeps crashing | Check logs for startup errors |
| FailedScheduling | No node can run pod | Check node resources, taints |
| Unhealthy | Liveness probe failed | Check probe config, app health |
| 事件原因 | 含义 | 操作 |
|---|---|---|
| OOMKilled | 容器超出内存限制 | 提升内存限制或修复内存泄漏 |
| ImagePullBackOff | 无法拉取镜像 | 检查镜像名称、仓库认证信息 |
| CrashLoopBackOff | 容器持续崩溃 | 查看启动错误日志 |
| FailedScheduling | 无节点可运行Pod | 检查节点资源、污点配置 |
| Unhealthy | 存活探针失败 | 检查探针配置、应用健康状态 |
Output Format
输出格式
When reporting findings, use this structure:
undefined报告排查结果时,请使用以下结构:
undefinedKubernetes Analysis
Kubernetes Analysis
Pod: <name>
Namespace: <namespace>
Status: <phase> (Restarts: N)
Pod: <name>
Namespace: <namespace>
Status: <phase> (Restarts: N)
Events
Events
- [timestamp] <reason>: <message>
- [timestamp] <reason>: <message>
Issues Found
Issues Found
- [Issue description with evidence]
- [Issue description with evidence]
Root Cause Hypothesis
Root Cause Hypothesis
[Based on events and logs]
[Based on events and logs]
Recommended Action
Recommended Action
[Specific remediation step]
undefined[Specific remediation step]
undefined