k8s-debug
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseKubernetes Debugging Expertise
Kubernetes调试专业指南
Golden Rule: Events Before Logs
黄金法则:先看事件,再查日志
When debugging Kubernetes issues, ALWAYS check events first:
- - Shows scheduling, pulling, starting, probes, OOM
get_pod_events - THEN - Application-level errors
get_pod_logs
Events explain most crash/scheduling issues faster than logs.
在调试Kubernetes问题时,务必先检查事件:
- - 查看调度、镜像拉取、启动、探针、OOM等事件
get_pod_events - 其次执行- 查看应用层面的错误
get_pod_logs
事件比日志能更快解释大多数崩溃/调度问题。
Typical Investigation Flow
典型排查流程
1. list_pods → Get overview of pod health in namespace
2. get_pod_events → Understand WHY pods are in their state
3. get_pod_logs → Only if events don't explain the issue
4. get_pod_resources → For performance/resource issues
5. describe_deployment → Check deployment status and conditions1. list_pods → 查看命名空间内Pod的健康状况概览
2. get_pod_events → 了解Pod处于当前状态的原因
3. get_pod_logs → 仅当事件无法解释问题时使用
4. get_pod_resources → 排查性能/资源相关问题
5. describe_deployment → 检查部署状态与条件Common Issue Patterns
常见问题模式
CrashLoopBackOff
CrashLoopBackOff
First check:
get_pod_events| Event Reason | Likely Cause | Next Step |
|---|---|---|
| OOMKilled | Memory limit too low or memory leak | Check |
| Error | Application crash | Check |
| BackOff | Repeated failures | Check logs for startup errors |
Checklist:
- Memory limits vs actual usage
- Recent deployment changes ()
get_deployment_history - Missing config/secrets
- Dependency failures (database, external services)
首先检查:
get_pod_events| 事件原因 | 可能原因 | 下一步操作 |
|---|---|---|
| OOMKilled | 内存限制过低或内存泄漏 | 执行 |
| Error | 应用崩溃 | 执行 |
| BackOff | 重复失败 | 查看日志中的启动错误 |
检查清单:
- 内存限制与实际使用情况对比
- 近期部署变更()
get_deployment_history - 缺失的配置/密钥
- 依赖项故障(数据库、外部服务)
OOMKilled
OOMKilled
First check: (confirms OOMKilled)
Then: (compare usage to limits)
get_pod_eventsget_pod_resourcesCommon causes:
- Memory limit set too low for workload
- Memory leak (usage increases over time)
- Sudden traffic spike causing memory pressure
- Large request payloads cached in memory
首先检查:(确认OOMKilled事件)
其次:(对比使用量与限制值)
get_pod_eventsget_pod_resources常见原因:
- 为工作负载设置的内存限制过低
- 内存泄漏(使用量随时间增长)
- 突发流量导致内存压力
- 内存中缓存了大请求负载
ImagePullBackOff
ImagePullBackOff
First check:
get_pod_eventsCommon causes:
- Wrong image name or tag
- Private registry without imagePullSecrets
- Rate limiting from registry
- Network issues reaching registry
首先检查:
get_pod_events常见原因:
- 镜像名称或标签错误
- 私有镜像仓库未配置imagePullSecrets
- 镜像仓库限流
- 无法连接镜像仓库的网络问题
Pending Pods
待调度Pods
First check:
get_pod_eventsLook for:
- - Insufficient resources
FailedScheduling - - Node affinity/taints
Unschedulable - No matching nodes for nodeSelector
首先检查:
get_pod_events重点关注:
- - 资源不足
FailedScheduling - - 节点亲和性/污点
Unschedulable - 没有匹配nodeSelector的节点
Readiness/Liveness Probe Failures
就绪/存活探针失败
First check: (shows probe config)
Then: (probe failure events)
Then: (why endpoint isn't responding)
describe_podget_pod_eventsget_pod_logs首先检查:(查看探针配置)
其次:(探针失败事件)
最后:(排查端点无响应的原因)
describe_podget_pod_eventsget_pod_logsEvicted Pods
被驱逐的Pods
First check:
get_pod_eventsCauses:
- Node resource pressure (disk, memory)
- Priority preemption
- Taint-based eviction
首先检查:
get_pod_events原因:
- 节点资源压力(磁盘、内存)
- 优先级抢占
- 基于污点的驱逐
Deployment Issues
部署问题
Stuck Rollout
滚动更新停滞
describe_deployment → Check replicas (desired vs ready vs available)
get_deployment_history → Compare current vs previous revision
get_pod_events → For pods in new ReplicaSetCommon causes:
- New pods failing (CrashLoopBackOff)
- Readiness probes failing
- Resource constraints preventing scheduling
describe_deployment → 检查副本数(期望数 vs 就绪数 vs 可用数)
get_deployment_history → 对比当前与历史版本
get_pod_events → 查看新ReplicaSet中的Pod事件常见原因:
- 新Pod启动失败(CrashLoopBackOff)
- 就绪探针失败
- 资源限制导致无法调度
Rollback Decision
回滚决策
Use to see previous working versions.
get_deployment_history使用查看之前的可用版本。
get_deployment_historyError Classification
错误分类
Non-Retryable (Stop Immediately)
不可重试(立即停止)
- 401 Unauthorized - Invalid credentials
- 403 Forbidden - No permission
- 404 Not Found - Resource doesn't exist
- "config_required": true - Integration not configured
- 401 Unauthorized - 凭证无效
- 403 Forbidden - 无权限
- 404 Not Found - 资源不存在
- "config_required": true - 集成未配置
Retryable (May retry once)
可重试(可重试一次)
- 429 Too Many Requests
- 500/502/503/504 Server errors
- Timeout
- Connection refused
- 429 Too Many Requests
- 500/502/503/504 服务器错误
- 超时
- 连接被拒绝
Resource Investigation Pattern
资源问题排查模式
For memory/CPU issues:
1. get_pod_resources → See allocation vs usage
2. describe_pod → See full container spec
3. get_cloudwatch_metrics/query_datadog_metrics → Historical usage
4. detect_anomalies on historical data → Find when issue started针对内存/CPU问题:
1. get_pod_resources → 查看分配量与使用量
2. describe_pod → 查看完整容器规格
3. get_cloudwatch_metrics/query_datadog_metrics → 历史使用情况
4. detect_anomalies on historical data → 定位问题开始时间