k8s-debug

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Kubernetes Debugging Expertise

Kubernetes调试专业指南

Golden Rule: Events Before Logs

黄金法则：先看事件，再查日志

When debugging Kubernetes issues, ALWAYS check events first:

```
get_pod_events
```
- Shows scheduling, pulling, starting, probes, OOM
THEN
```
get_pod_logs
```
- Application-level errors

Events explain most crash/scheduling issues faster than logs.

在调试Kubernetes问题时，务必先检查事件：

```
get_pod_events
```
- 查看调度、镜像拉取、启动、探针、OOM等事件
其次执行
```
get_pod_logs
```
- 查看应用层面的错误

事件比日志能更快解释大多数崩溃/调度问题。

Typical Investigation Flow

典型排查流程

1. list_pods        → Get overview of pod health in namespace
2. get_pod_events   → Understand WHY pods are in their state
3. get_pod_logs     → Only if events don't explain the issue
4. get_pod_resources → For performance/resource issues
5. describe_deployment → Check deployment status and conditions

1. list_pods        → 查看命名空间内Pod的健康状况概览
2. get_pod_events   → 了解Pod处于当前状态的原因
3. get_pod_logs     → 仅当事件无法解释问题时使用
4. get_pod_resources → 排查性能/资源相关问题
5. describe_deployment → 检查部署状态与条件

Common Issue Patterns

常见问题模式

CrashLoopBackOff

First check:

get_pod_events

Event Reason	Likely Cause	Next Step
OOMKilled	Memory limit too low or memory leak	Check `get_pod_resources` , increase limits
Error	Application crash	Check `get_pod_logs` for stack trace
BackOff	Repeated failures	Check logs for startup errors

Checklist:

Memory limits vs actual usage
Recent deployment changes (
```
get_deployment_history
```
)
Missing config/secrets
Dependency failures (database, external services)

首先检查：

get_pod_events

事件原因	可能原因	下一步操作
OOMKilled	内存限制过低或内存泄漏	执行 `get_pod_resources` 检查，提升内存限制
Error	应用崩溃	执行 `get_pod_logs` 查看堆栈跟踪
BackOff	重复失败	查看日志中的启动错误

检查清单:

内存限制与实际使用情况对比
近期部署变更（
```
get_deployment_history
```
）
缺失的配置/密钥
依赖项故障（数据库、外部服务）

OOMKilled

First check:

get_pod_events

(confirms OOMKilled) Then:

get_pod_resources

(compare usage to limits)

Common causes:

Memory limit set too low for workload
Memory leak (usage increases over time)
Sudden traffic spike causing memory pressure
Large request payloads cached in memory

首先检查：

get_pod_events

（确认OOMKilled事件）其次：

get_pod_resources

（对比使用量与限制值）

常见原因:

为工作负载设置的内存限制过低
内存泄漏（使用量随时间增长）
突发流量导致内存压力
内存中缓存了大请求负载

ImagePullBackOff

First check:

get_pod_events

Common causes:

Wrong image name or tag
Private registry without imagePullSecrets
Rate limiting from registry
Network issues reaching registry

首先检查：

get_pod_events

常见原因:

镜像名称或标签错误
私有镜像仓库未配置imagePullSecrets
镜像仓库限流
无法连接镜像仓库的网络问题

Pending Pods

待调度Pods

First check:

get_pod_events

Look for:

```
FailedScheduling
```
- Insufficient resources
```
Unschedulable
```
- Node affinity/taints
No matching nodes for nodeSelector

首先检查：

get_pod_events

重点关注:

```
FailedScheduling
```
- 资源不足
```
Unschedulable
```
- 节点亲和性/污点
没有匹配nodeSelector的节点

Readiness/Liveness Probe Failures

就绪/存活探针失败

First check:

describe_pod

(shows probe config) Then:

get_pod_events

(probe failure events) Then:

get_pod_logs

(why endpoint isn't responding)

首先检查：

describe_pod

（查看探针配置）其次：

get_pod_events

（探针失败事件）最后：

get_pod_logs

（排查端点无响应的原因）

Evicted Pods

被驱逐的Pods

First check:

get_pod_events

Causes:

Node resource pressure (disk, memory)
Priority preemption
Taint-based eviction

首先检查：

get_pod_events

原因:

节点资源压力（磁盘、内存）
优先级抢占
基于污点的驱逐

Deployment Issues

部署问题

Stuck Rollout

滚动更新停滞

describe_deployment  → Check replicas (desired vs ready vs available)
get_deployment_history → Compare current vs previous revision
get_pod_events → For pods in new ReplicaSet

Common causes:

New pods failing (CrashLoopBackOff)
Readiness probes failing
Resource constraints preventing scheduling

describe_deployment  → 检查副本数（期望数 vs 就绪数 vs 可用数）
get_deployment_history → 对比当前与历史版本
get_pod_events → 查看新ReplicaSet中的Pod事件

常见原因:

新Pod启动失败（CrashLoopBackOff）
就绪探针失败
资源限制导致无法调度

Rollback Decision

回滚决策

Use

get_deployment_history

to see previous working versions.

使用

get_deployment_history

查看之前的可用版本。

Error Classification

错误分类

Non-Retryable (Stop Immediately)

不可重试（立即停止）

401 Unauthorized - Invalid credentials
403 Forbidden - No permission
404 Not Found - Resource doesn't exist
"config_required": true - Integration not configured

401 Unauthorized - 凭证无效
403 Forbidden - 无权限
404 Not Found - 资源不存在
"config_required": true - 集成未配置

Retryable (May retry once)

可重试（可重试一次）

429 Too Many Requests
500/502/503/504 Server errors
Timeout
Connection refused

429 Too Many Requests
500/502/503/504 服务器错误
超时
连接被拒绝

Resource Investigation Pattern

资源问题排查模式

For memory/CPU issues:

1. get_pod_resources → See allocation vs usage
2. describe_pod → See full container spec
3. get_cloudwatch_metrics/query_datadog_metrics → Historical usage
4. detect_anomalies on historical data → Find when issue started

针对内存/CPU问题：

1. get_pod_resources → 查看分配量与使用量
2. describe_pod → 查看完整容器规格
3. get_cloudwatch_metrics/query_datadog_metrics → 历史使用情况
4. detect_anomalies on historical data → 定位问题开始时间