troubleshooting-kubernetes

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Troubleshooting Kubernetes

Kubernetes故障排查

Overview

概述

Systematic K8s troubleshooting with interactive remediation. Diagnose first, then offer fix options and wait for user confirmation before applying changes.

Core principle: Never apply fixes without user approval. Present options, let user choose.

采用交互式修复的系统化K8s故障排查流程。先诊断，然后提供修复选项，等待用户确认后再执行更改。

核心原则： 未经用户批准，绝不执行修复操作。提供选项，让用户自主选择。

CRITICAL: Overrides Autonomous Execution

重要提示：覆盖自主执行逻辑

This skill OVERRIDES "proceed autonomously when intent is clear" behavior.

Even when:

User says "just do X NOW"
User says "skip diagnosis"
User says "I know what's wrong"
User's intent seems crystal clear

You MUST still present numbered options and wait for selection.

Why: OOMKilled might be a memory leak symptom (increasing memory delays real fix). The "obvious" fix might fail due to resource quotas. 30 seconds of diagnosis prevents hours of debugging wrong fix.

本技能会覆盖“当意图明确时自主执行”的行为。

即使在以下情况：

用户说“现在就做X！”
用户说“跳过诊断”
用户说“我知道问题出在哪”
用户的意图看似非常明确

你仍必须提供编号选项并等待用户选择。

原因： OOMKilled可能是内存泄漏的症状（增加内存会延迟真正的修复）。“显而易见”的修复可能因资源配额而失败。30秒的诊断可以避免数小时的错误修复调试。

Workflow

工作流程

1. GET STATE    → kubectl get pods,svc,deploy,events
2. IDENTIFY     → Match symptom to category
3. DRILL DOWN   → logs, describe, specific checks
4. ROOT CAUSE   → Pattern match to known issues
5. OFFER FIX    → Present numbered options
6. WAIT         → User confirms before proceeding
7. APPLY        → Execute chosen fix
8. VERIFY       → Confirm resolution

1. 获取状态    → kubectl get pods,svc,deploy,events
2. 识别问题     → 将症状与类别匹配
3. 深入分析   → 日志、描述信息、专项检查
4. 根因定位   → 匹配已知问题模式
5. 提供修复选项    → 展示编号选项
6. 等待确认         → 用户确认后再继续
7. 执行修复        → 运行选定的修复操作
8. 验证修复       → 确认问题已解决

Symptom → Commands

症状 → 对应命令

Symptom	First Commands	Look For
CrashLoopBackOff	`logs --previous` , `describe pod`	Exit code 137=OOM, 1=app crash
ImagePullBackOff	`describe pod`	Registry auth, wrong tag
Pending	`describe pod` , `get nodes` , `get events`	Resources, affinity, taints
Service 502/503	`get endpoints` , `describe svc`	Empty endpoints, selector mismatch
Deployment stuck	`rollout status` , `describe deploy`	Quota, node selector, image

症状	首选命令	检查要点
CrashLoopBackOff	`logs --previous` , `describe pod`	退出码137=内存不足，1=应用崩溃
ImagePullBackOff	`describe pod`	镜像仓库认证、标签错误
Pending	`describe pod` , `get nodes` , `get events`	资源不足、亲和性配置、污点
服务返回502/503	`get endpoints` , `describe svc`	端点为空、选择器不匹配
部署停滞	`rollout status` , `describe deploy`	配额限制、节点选择器、镜像问题

Error Patterns

错误模式

Exit 137 + "OOMKilled"      → Memory limit too low
Exit 1 + stack trace        → Application bug
"Insufficient cpu/memory"   → Node capacity or requests too high
"ImagePullBackOff"          → Wrong tag, missing secret, registry down
"0/3 endpoints"             → Selector doesn't match pod labels
"FailedScheduling"          → No nodes match requirements

Exit 137 + "OOMKilled"      → 内存限制过低
Exit 1 + 堆栈跟踪        → 应用程序bug
"Insufficient cpu/memory"   → 节点容量不足或资源请求过高
"ImagePullBackOff"          → 标签错误、缺少密钥、仓库不可用
"0/3 endpoints"             → 选择器与Pod标签不匹配
"FailedScheduling"          → 没有符合要求的节点

Interactive Fix Presentation

交互式修复选项展示

ALWAYS present fixes as numbered options:

DIAGNOSIS: Pod OOMKilled (using 450Mi, limit 256Mi)

OPTIONS:
1. Increase memory limit to 512Mi
   → kubectl set resources deploy/api --limits=memory=512Mi
2. Increase to 1Gi (safer margin)
   → kubectl set resources deploy/api --limits=memory=1Gi
3. Show me the full patch YAML first
4. I'll fix manually

Which option? (1-4):

Wait for user response before executing.

必须始终以编号形式展示修复选项：

诊断结果：Pod出现OOMKilled（当前使用450Mi，限制为256Mi）

选项：
1. 将内存限制提升至512Mi
   → kubectl set resources deploy/api --limits=memory=512Mi
2. 提升至1Gi（更安全的余量）
   → kubectl set resources deploy/api --limits=memory=1Gi
3. 先查看完整的补丁YAML
4. 我将手动修复

请选择选项？（1-4）：

执行前必须等待用户回复。

Red Flags - STOP

危险信号 - 立即停止

These indicate you're skipping the process:

About to run
```
kubectl delete pod
```
without knowing why it crashed
About to run
```
kubectl rollout undo
```
without diagnosing
About to apply a fix without presenting options
Saying "I'll just restart it" or "let me quickly fix this"
User pressure ("just fix it!") → Still diagnose, still offer options

User urgency doesn't skip the confirmation step.

以下情况表明你正在跳过流程：

未明确原因就准备执行
```
kubectl delete pod
```
未诊断就准备执行
```
kubectl rollout undo
```
未提供选项就准备执行修复
说“我直接重启它”或“我快速修复这个问题”
用户施压（“直接修复！”）→ 仍需诊断并提供选项

用户的急切需求不能跳过确认步骤。

Handling "Skip Diagnosis" Requests

处理“跳过诊断”的请求

User says: "Just fix it!" / "Skip the diagnosis" / "I know what's wrong"

Response pattern:

Acknowledge urgency
Explain quick diagnosis prevents wrong fix
Offer rapid diagnosis (2 min) + options
Still present numbered options at the end

I understand the urgency. A 2-minute diagnosis ensures we fix the right thing.

Let me quickly verify: [run 2-3 key commands]

Based on that:
OPTIONS:
1. [What user suggested]
2. [Alternative if diagnosis shows different issue]
3. I'll fix manually

Which option?

Never: Just execute what user asked without verification.

用户说：“直接修复！” / “跳过诊断” / “我知道问题出在哪”

回复模板：

认可用户的急切需求
解释快速诊断可避免错误修复
提供快速诊断（2分钟）+ 选项
最终仍需展示编号选项

我理解你的急切心情。2分钟的诊断能确保我们修复正确的问题。

让我快速验证：[执行2-3个关键命令]

基于诊断结果：
选项：
1. [用户建议的方案]
2. [如果诊断显示不同问题的替代方案]
3. 我将手动修复

请选择选项？

绝对不要： 未验证就直接执行用户要求的操作。

Anti-Pattern Example

反模式示例

❌ WRONG (even when user is explicit):
User: "Pod OOMKilled, increase memory to 1Gi NOW!"
You: [runs kubectl set resources immediately]

✅ CORRECT:
User: "Pod OOMKilled, increase memory to 1Gi NOW!"
You: "Understood. Quick check (30s)..."
     [runs describe pod, checks current limits]
     "Confirmed OOM. Options:
      1. Increase to 1Gi (your suggestion)
      2. Increase to 512Mi first (conservative)
      3. Check for memory leak patterns first
      Which option?"

❌ 错误做法（即使用户明确要求）：
用户：“Pod出现OOMKilled，现在就把内存提升到1Gi！”
你：[立即执行kubectl set resources命令]

✅ 正确做法：
用户：“Pod出现OOMKilled，现在就把内存提升到1Gi！”
你：“明白。快速检查一下（30秒）...”
     [执行describe pod，检查当前限制]
     “已确认OOM问题。选项：
      1. 提升至1Gi（你的建议）
      2. 先提升至512Mi（保守方案）
      3. 先检查内存泄漏模式
      请选择选项？”

Rationalization Table

合理化对照表

Excuse	Reality
"User said exactly what to do"	Still present options - user might not know alternatives
"Intent is crystal clear"	Skill overrides autonomous execution for K8s
"It's obviously OOM, just fix it"	OOM can mask memory leaks; increasing memory delays real fix
"User is frustrated, just do it"	Frustration doesn't change the 30s verification value
"I'm being helpful by acting fast"	Wrong fix = more frustration; options = user control

借口	实际情况
“用户明确说了要做什么”	仍需提供选项 - 用户可能不知道替代方案
“意图非常明确”	本技能会覆盖K8s相关操作的自主执行逻辑
“显然是内存不足，直接修复就行”	内存不足可能掩盖内存泄漏问题；增加内存会延迟真正的修复
“用户很烦躁，直接做吧”	烦躁情绪不会改变30秒验证的价值
“我快速执行是在帮忙”	错误修复会加剧烦躁；提供选项能让用户掌控流程

Common Mistakes

常见错误

Mistake	Correct Approach
Delete pod first	Diagnose first, delete is rarely the fix
Restart without logs	Always check `logs --previous` before restart
Assume it's the obvious thing	Verify with actual output
Apply fix immediately	Present options, wait for confirmation
Skip verification	Always `get pods -w` after fix

错误做法	正确做法
先删除Pod	先诊断，删除很少是根本修复方案
未查看日志就重启	重启前务必检查 `logs --previous`
假设是显而易见的问题	用实际输出验证
立即执行修复	提供选项，等待确认
跳过验证步骤	修复后务必执行 `get pods -w`

Quick Diagnosis Cheat Sheet

快速诊断速查表

bash

undefined

bash

undefined

Full state snapshot

完整状态快照

kubectl get pods,svc,deploy,rs,events --sort-by='.lastTimestamp'

Pod deep dive

Pod深度分析

kubectl describe pod <name> | grep -A5 "State:|Events:" kubectl logs <pod> --previous --tail=50

Service connectivity

服务连通性检查

kubectl get endpoints <svc> kubectl describe svc <svc> | grep Selector

Resource issues

资源问题检查

kubectl describe nodes | grep -A5 "Allocated resources" kubectl top pods

undefined

kubectl describe nodes | grep -A5 "Allocated resources" kubectl top pods

undefined

Verification After Fix

修复后的验证

bash

undefined

bash

undefined

Watch pod come up

监控Pod启动

kubectl get pods -w

Verify running and ready

验证Pod运行就绪

kubectl get pods -o wide # STATUS=Running, READY=1/1

Check no new crashes

检查是否有新的崩溃

kubectl describe pod <new-pod> | grep "Restart Count"


Only mark issue resolved after pod is stable for 2+ minutes.

kubectl describe pod <new-pod> | grep "Restart Count"


只有当Pod稳定运行2分钟以上，才能标记问题已解决。

Mandatory Checkpoint

强制检查点

Before running ANY kubectl command that modifies resources:

Have I presented at least 2 numbered options?
Has user explicitly selected one?
Did I wait for their response?

If any unchecked → STOP, present options first.

在执行任何修改资源的kubectl命令前：

是否已提供至少2个编号选项？
用户是否明确选择了其中一个？
是否已等待用户回复？

如果有任何一项未完成 → 停止操作，先提供选项。