k8s-debug

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Kubernetes Debugging Expertise

Kubernetes调试专业指南

Golden Rule: Events Before Logs

黄金法则:先看事件,再查日志

When debugging Kubernetes issues, ALWAYS check events first:
  1. get_pod_events
    - Shows scheduling, pulling, starting, probes, OOM
  2. THEN
    get_pod_logs
    - Application-level errors
Events explain most crash/scheduling issues faster than logs.
在调试Kubernetes问题时,务必先检查事件
  1. get_pod_events
    - 查看调度、镜像拉取、启动、探针、OOM等事件
  2. 其次执行
    get_pod_logs
    - 查看应用层面的错误
事件比日志能更快解释大多数崩溃/调度问题。

Typical Investigation Flow

典型排查流程

1. list_pods        → Get overview of pod health in namespace
2. get_pod_events   → Understand WHY pods are in their state
3. get_pod_logs     → Only if events don't explain the issue
4. get_pod_resources → For performance/resource issues
5. describe_deployment → Check deployment status and conditions
1. list_pods        → 查看命名空间内Pod的健康状况概览
2. get_pod_events   → 了解Pod处于当前状态的原因
3. get_pod_logs     → 仅当事件无法解释问题时使用
4. get_pod_resources → 排查性能/资源相关问题
5. describe_deployment → 检查部署状态与条件

Common Issue Patterns

常见问题模式

CrashLoopBackOff

CrashLoopBackOff

First check:
get_pod_events
Event ReasonLikely CauseNext Step
OOMKilledMemory limit too low or memory leakCheck
get_pod_resources
, increase limits
ErrorApplication crashCheck
get_pod_logs
for stack trace
BackOffRepeated failuresCheck logs for startup errors
Checklist:
  • Memory limits vs actual usage
  • Recent deployment changes (
    get_deployment_history
    )
  • Missing config/secrets
  • Dependency failures (database, external services)
首先检查
get_pod_events
事件原因可能原因下一步操作
OOMKilled内存限制过低或内存泄漏执行
get_pod_resources
检查,提升内存限制
Error应用崩溃执行
get_pod_logs
查看堆栈跟踪
BackOff重复失败查看日志中的启动错误
检查清单:
  • 内存限制与实际使用情况对比
  • 近期部署变更(
    get_deployment_history
  • 缺失的配置/密钥
  • 依赖项故障(数据库、外部服务)

OOMKilled

OOMKilled

First check:
get_pod_events
(confirms OOMKilled) Then:
get_pod_resources
(compare usage to limits)
Common causes:
  • Memory limit set too low for workload
  • Memory leak (usage increases over time)
  • Sudden traffic spike causing memory pressure
  • Large request payloads cached in memory
首先检查
get_pod_events
(确认OOMKilled事件) 其次
get_pod_resources
(对比使用量与限制值)
常见原因:
  • 为工作负载设置的内存限制过低
  • 内存泄漏(使用量随时间增长)
  • 突发流量导致内存压力
  • 内存中缓存了大请求负载

ImagePullBackOff

ImagePullBackOff

First check:
get_pod_events
Common causes:
  • Wrong image name or tag
  • Private registry without imagePullSecrets
  • Rate limiting from registry
  • Network issues reaching registry
首先检查
get_pod_events
常见原因:
  • 镜像名称或标签错误
  • 私有镜像仓库未配置imagePullSecrets
  • 镜像仓库限流
  • 无法连接镜像仓库的网络问题

Pending Pods

待调度Pods

First check:
get_pod_events
Look for:
  • FailedScheduling
    - Insufficient resources
  • Unschedulable
    - Node affinity/taints
  • No matching nodes for nodeSelector
首先检查
get_pod_events
重点关注:
  • FailedScheduling
    - 资源不足
  • Unschedulable
    - 节点亲和性/污点
  • 没有匹配nodeSelector的节点

Readiness/Liveness Probe Failures

就绪/存活探针失败

First check:
describe_pod
(shows probe config) Then:
get_pod_events
(probe failure events) Then:
get_pod_logs
(why endpoint isn't responding)
首先检查
describe_pod
(查看探针配置) 其次
get_pod_events
(探针失败事件) 最后
get_pod_logs
(排查端点无响应的原因)

Evicted Pods

被驱逐的Pods

First check:
get_pod_events
Causes:
  • Node resource pressure (disk, memory)
  • Priority preemption
  • Taint-based eviction
首先检查
get_pod_events
原因:
  • 节点资源压力(磁盘、内存)
  • 优先级抢占
  • 基于污点的驱逐

Deployment Issues

部署问题

Stuck Rollout

滚动更新停滞

describe_deployment  → Check replicas (desired vs ready vs available)
get_deployment_history → Compare current vs previous revision
get_pod_events → For pods in new ReplicaSet
Common causes:
  • New pods failing (CrashLoopBackOff)
  • Readiness probes failing
  • Resource constraints preventing scheduling
describe_deployment  → 检查副本数(期望数 vs 就绪数 vs 可用数)
get_deployment_history → 对比当前与历史版本
get_pod_events → 查看新ReplicaSet中的Pod事件
常见原因:
  • 新Pod启动失败(CrashLoopBackOff)
  • 就绪探针失败
  • 资源限制导致无法调度

Rollback Decision

回滚决策

Use
get_deployment_history
to see previous working versions.
使用
get_deployment_history
查看之前的可用版本。

Error Classification

错误分类

Non-Retryable (Stop Immediately)

不可重试(立即停止)

  • 401 Unauthorized - Invalid credentials
  • 403 Forbidden - No permission
  • 404 Not Found - Resource doesn't exist
  • "config_required": true - Integration not configured
  • 401 Unauthorized - 凭证无效
  • 403 Forbidden - 无权限
  • 404 Not Found - 资源不存在
  • "config_required": true - 集成未配置

Retryable (May retry once)

可重试(可重试一次)

  • 429 Too Many Requests
  • 500/502/503/504 Server errors
  • Timeout
  • Connection refused
  • 429 Too Many Requests
  • 500/502/503/504 服务器错误
  • 超时
  • 连接被拒绝

Resource Investigation Pattern

资源问题排查模式

For memory/CPU issues:
1. get_pod_resources → See allocation vs usage
2. describe_pod → See full container spec
3. get_cloudwatch_metrics/query_datadog_metrics → Historical usage
4. detect_anomalies on historical data → Find when issue started
针对内存/CPU问题:
1. get_pod_resources → 查看分配量与使用量
2. describe_pod → 查看完整容器规格
3. get_cloudwatch_metrics/query_datadog_metrics → 历史使用情况
4. detect_anomalies on historical data → 定位问题开始时间