troubleshooting-kubernetes
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTroubleshooting Kubernetes
Kubernetes故障排查
Overview
概述
Systematic K8s troubleshooting with interactive remediation. Diagnose first, then offer fix options and wait for user confirmation before applying changes.
Core principle: Never apply fixes without user approval. Present options, let user choose.
采用交互式修复的系统化K8s故障排查流程。先诊断,然后提供修复选项,等待用户确认后再执行更改。
核心原则: 未经用户批准,绝不执行修复操作。提供选项,让用户自主选择。
CRITICAL: Overrides Autonomous Execution
重要提示:覆盖自主执行逻辑
This skill OVERRIDES "proceed autonomously when intent is clear" behavior.
Even when:
- User says "just do X NOW"
- User says "skip diagnosis"
- User says "I know what's wrong"
- User's intent seems crystal clear
You MUST still present numbered options and wait for selection.
Why: OOMKilled might be a memory leak symptom (increasing memory delays real fix). The "obvious" fix might fail due to resource quotas. 30 seconds of diagnosis prevents hours of debugging wrong fix.
本技能会覆盖“当意图明确时自主执行”的行为。
即使在以下情况:
- 用户说“现在就做X!”
- 用户说“跳过诊断”
- 用户说“我知道问题出在哪”
- 用户的意图看似非常明确
你仍必须提供编号选项并等待用户选择。
原因: OOMKilled可能是内存泄漏的症状(增加内存会延迟真正的修复)。“显而易见”的修复可能因资源配额而失败。30秒的诊断可以避免数小时的错误修复调试。
Workflow
工作流程
1. GET STATE → kubectl get pods,svc,deploy,events
2. IDENTIFY → Match symptom to category
3. DRILL DOWN → logs, describe, specific checks
4. ROOT CAUSE → Pattern match to known issues
5. OFFER FIX → Present numbered options
6. WAIT → User confirms before proceeding
7. APPLY → Execute chosen fix
8. VERIFY → Confirm resolution1. 获取状态 → kubectl get pods,svc,deploy,events
2. 识别问题 → 将症状与类别匹配
3. 深入分析 → 日志、描述信息、专项检查
4. 根因定位 → 匹配已知问题模式
5. 提供修复选项 → 展示编号选项
6. 等待确认 → 用户确认后再继续
7. 执行修复 → 运行选定的修复操作
8. 验证修复 → 确认问题已解决Symptom → Commands
症状 → 对应命令
| Symptom | First Commands | Look For |
|---|---|---|
| CrashLoopBackOff | | Exit code 137=OOM, 1=app crash |
| ImagePullBackOff | | Registry auth, wrong tag |
| Pending | | Resources, affinity, taints |
| Service 502/503 | | Empty endpoints, selector mismatch |
| Deployment stuck | | Quota, node selector, image |
| 症状 | 首选命令 | 检查要点 |
|---|---|---|
| CrashLoopBackOff | | 退出码137=内存不足,1=应用崩溃 |
| ImagePullBackOff | | 镜像仓库认证、标签错误 |
| Pending | | 资源不足、亲和性配置、污点 |
| 服务返回502/503 | | 端点为空、选择器不匹配 |
| 部署停滞 | | 配额限制、节点选择器、镜像问题 |
Error Patterns
错误模式
Exit 137 + "OOMKilled" → Memory limit too low
Exit 1 + stack trace → Application bug
"Insufficient cpu/memory" → Node capacity or requests too high
"ImagePullBackOff" → Wrong tag, missing secret, registry down
"0/3 endpoints" → Selector doesn't match pod labels
"FailedScheduling" → No nodes match requirementsExit 137 + "OOMKilled" → 内存限制过低
Exit 1 + 堆栈跟踪 → 应用程序bug
"Insufficient cpu/memory" → 节点容量不足或资源请求过高
"ImagePullBackOff" → 标签错误、缺少密钥、仓库不可用
"0/3 endpoints" → 选择器与Pod标签不匹配
"FailedScheduling" → 没有符合要求的节点Interactive Fix Presentation
交互式修复选项展示
ALWAYS present fixes as numbered options:
DIAGNOSIS: Pod OOMKilled (using 450Mi, limit 256Mi)
OPTIONS:
1. Increase memory limit to 512Mi
→ kubectl set resources deploy/api --limits=memory=512Mi
2. Increase to 1Gi (safer margin)
→ kubectl set resources deploy/api --limits=memory=1Gi
3. Show me the full patch YAML first
4. I'll fix manually
Which option? (1-4):Wait for user response before executing.
必须始终以编号形式展示修复选项:
诊断结果:Pod出现OOMKilled(当前使用450Mi,限制为256Mi)
选项:
1. 将内存限制提升至512Mi
→ kubectl set resources deploy/api --limits=memory=512Mi
2. 提升至1Gi(更安全的余量)
→ kubectl set resources deploy/api --limits=memory=1Gi
3. 先查看完整的补丁YAML
4. 我将手动修复
请选择选项?(1-4):执行前必须等待用户回复。
Red Flags - STOP
危险信号 - 立即停止
These indicate you're skipping the process:
- About to run without knowing why it crashed
kubectl delete pod - About to run without diagnosing
kubectl rollout undo - About to apply a fix without presenting options
- Saying "I'll just restart it" or "let me quickly fix this"
- User pressure ("just fix it!") → Still diagnose, still offer options
User urgency doesn't skip the confirmation step.
以下情况表明你正在跳过流程:
- 未明确原因就准备执行
kubectl delete pod - 未诊断就准备执行
kubectl rollout undo - 未提供选项就准备执行修复
- 说“我直接重启它”或“我快速修复这个问题”
- 用户施压(“直接修复!”)→ 仍需诊断并提供选项
用户的急切需求不能跳过确认步骤。
Handling "Skip Diagnosis" Requests
处理“跳过诊断”的请求
User says: "Just fix it!" / "Skip the diagnosis" / "I know what's wrong"
Response pattern:
- Acknowledge urgency
- Explain quick diagnosis prevents wrong fix
- Offer rapid diagnosis (2 min) + options
- Still present numbered options at the end
I understand the urgency. A 2-minute diagnosis ensures we fix the right thing.
Let me quickly verify: [run 2-3 key commands]
Based on that:
OPTIONS:
1. [What user suggested]
2. [Alternative if diagnosis shows different issue]
3. I'll fix manually
Which option?Never: Just execute what user asked without verification.
用户说:“直接修复!” / “跳过诊断” / “我知道问题出在哪”
回复模板:
- 认可用户的急切需求
- 解释快速诊断可避免错误修复
- 提供快速诊断(2分钟)+ 选项
- 最终仍需展示编号选项
我理解你的急切心情。2分钟的诊断能确保我们修复正确的问题。
让我快速验证:[执行2-3个关键命令]
基于诊断结果:
选项:
1. [用户建议的方案]
2. [如果诊断显示不同问题的替代方案]
3. 我将手动修复
请选择选项?绝对不要: 未验证就直接执行用户要求的操作。
Anti-Pattern Example
反模式示例
❌ WRONG (even when user is explicit):
User: "Pod OOMKilled, increase memory to 1Gi NOW!"
You: [runs kubectl set resources immediately]
✅ CORRECT:
User: "Pod OOMKilled, increase memory to 1Gi NOW!"
You: "Understood. Quick check (30s)..."
[runs describe pod, checks current limits]
"Confirmed OOM. Options:
1. Increase to 1Gi (your suggestion)
2. Increase to 512Mi first (conservative)
3. Check for memory leak patterns first
Which option?"❌ 错误做法(即使用户明确要求):
用户:“Pod出现OOMKilled,现在就把内存提升到1Gi!”
你:[立即执行kubectl set resources命令]
✅ 正确做法:
用户:“Pod出现OOMKilled,现在就把内存提升到1Gi!”
你:“明白。快速检查一下(30秒)...”
[执行describe pod,检查当前限制]
“已确认OOM问题。选项:
1. 提升至1Gi(你的建议)
2. 先提升至512Mi(保守方案)
3. 先检查内存泄漏模式
请选择选项?”Rationalization Table
合理化对照表
| Excuse | Reality |
|---|---|
| "User said exactly what to do" | Still present options - user might not know alternatives |
| "Intent is crystal clear" | Skill overrides autonomous execution for K8s |
| "It's obviously OOM, just fix it" | OOM can mask memory leaks; increasing memory delays real fix |
| "User is frustrated, just do it" | Frustration doesn't change the 30s verification value |
| "I'm being helpful by acting fast" | Wrong fix = more frustration; options = user control |
| 借口 | 实际情况 |
|---|---|
| “用户明确说了要做什么” | 仍需提供选项 - 用户可能不知道替代方案 |
| “意图非常明确” | 本技能会覆盖K8s相关操作的自主执行逻辑 |
| “显然是内存不足,直接修复就行” | 内存不足可能掩盖内存泄漏问题;增加内存会延迟真正的修复 |
| “用户很烦躁,直接做吧” | 烦躁情绪不会改变30秒验证的价值 |
| “我快速执行是在帮忙” | 错误修复会加剧烦躁;提供选项能让用户掌控流程 |
Common Mistakes
常见错误
| Mistake | Correct Approach |
|---|---|
| Delete pod first | Diagnose first, delete is rarely the fix |
| Restart without logs | Always check |
| Assume it's the obvious thing | Verify with actual output |
| Apply fix immediately | Present options, wait for confirmation |
| Skip verification | Always |
| 错误做法 | 正确做法 |
|---|---|
| 先删除Pod | 先诊断,删除很少是根本修复方案 |
| 未查看日志就重启 | 重启前务必检查 |
| 假设是显而易见的问题 | 用实际输出验证 |
| 立即执行修复 | 提供选项,等待确认 |
| 跳过验证步骤 | 修复后务必执行 |
Quick Diagnosis Cheat Sheet
快速诊断速查表
bash
undefinedbash
undefinedFull state snapshot
完整状态快照
kubectl get pods,svc,deploy,rs,events --sort-by='.lastTimestamp'
kubectl get pods,svc,deploy,rs,events --sort-by='.lastTimestamp'
Pod deep dive
Pod深度分析
kubectl describe pod <name> | grep -A5 "State:|Events:"
kubectl logs <pod> --previous --tail=50
kubectl describe pod <name> | grep -A5 "State:|Events:"
kubectl logs <pod> --previous --tail=50
Service connectivity
服务连通性检查
kubectl get endpoints <svc>
kubectl describe svc <svc> | grep Selector
kubectl get endpoints <svc>
kubectl describe svc <svc> | grep Selector
Resource issues
资源问题检查
kubectl describe nodes | grep -A5 "Allocated resources"
kubectl top pods
undefinedkubectl describe nodes | grep -A5 "Allocated resources"
kubectl top pods
undefinedVerification After Fix
修复后的验证
bash
undefinedbash
undefinedWatch pod come up
监控Pod启动
kubectl get pods -w
kubectl get pods -w
Verify running and ready
验证Pod运行就绪
kubectl get pods -o wide # STATUS=Running, READY=1/1
kubectl get pods -o wide # STATUS=Running, READY=1/1
Check no new crashes
检查是否有新的崩溃
kubectl describe pod <new-pod> | grep "Restart Count"
Only mark issue resolved after pod is stable for 2+ minutes.kubectl describe pod <new-pod> | grep "Restart Count"
只有当Pod稳定运行2分钟以上,才能标记问题已解决。Mandatory Checkpoint
强制检查点
Before running ANY kubectl command that modifies resources:
- Have I presented at least 2 numbered options?
- Has user explicitly selected one?
- Did I wait for their response?
If any unchecked → STOP, present options first.
在执行任何修改资源的kubectl命令前:
- 是否已提供至少2个编号选项?
- 用户是否明确选择了其中一个?
- 是否已等待用户回复?
如果有任何一项未完成 → 停止操作,先提供选项。