troubleshooting-kubernetes

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Troubleshooting Kubernetes

Kubernetes故障排查

Overview

概述

Systematic K8s troubleshooting with interactive remediation. Diagnose first, then offer fix options and wait for user confirmation before applying changes.
Core principle: Never apply fixes without user approval. Present options, let user choose.
采用交互式修复的系统化K8s故障排查流程。先诊断,然后提供修复选项,等待用户确认后再执行更改。
核心原则: 未经用户批准,绝不执行修复操作。提供选项,让用户自主选择。

CRITICAL: Overrides Autonomous Execution

重要提示:覆盖自主执行逻辑

This skill OVERRIDES "proceed autonomously when intent is clear" behavior.
Even when:
  • User says "just do X NOW"
  • User says "skip diagnosis"
  • User says "I know what's wrong"
  • User's intent seems crystal clear
You MUST still present numbered options and wait for selection.
Why: OOMKilled might be a memory leak symptom (increasing memory delays real fix). The "obvious" fix might fail due to resource quotas. 30 seconds of diagnosis prevents hours of debugging wrong fix.
本技能会覆盖“当意图明确时自主执行”的行为。
即使在以下情况:
  • 用户说“现在就做X!”
  • 用户说“跳过诊断”
  • 用户说“我知道问题出在哪”
  • 用户的意图看似非常明确
你仍必须提供编号选项并等待用户选择。
原因: OOMKilled可能是内存泄漏的症状(增加内存会延迟真正的修复)。“显而易见”的修复可能因资源配额而失败。30秒的诊断可以避免数小时的错误修复调试。

Workflow

工作流程

1. GET STATE    → kubectl get pods,svc,deploy,events
2. IDENTIFY     → Match symptom to category
3. DRILL DOWN   → logs, describe, specific checks
4. ROOT CAUSE   → Pattern match to known issues
5. OFFER FIX    → Present numbered options
6. WAIT         → User confirms before proceeding
7. APPLY        → Execute chosen fix
8. VERIFY       → Confirm resolution
1. 获取状态    → kubectl get pods,svc,deploy,events
2. 识别问题     → 将症状与类别匹配
3. 深入分析   → 日志、描述信息、专项检查
4. 根因定位   → 匹配已知问题模式
5. 提供修复选项    → 展示编号选项
6. 等待确认         → 用户确认后再继续
7. 执行修复        → 运行选定的修复操作
8. 验证修复       → 确认问题已解决

Symptom → Commands

症状 → 对应命令

SymptomFirst CommandsLook For
CrashLoopBackOff
logs --previous
,
describe pod
Exit code 137=OOM, 1=app crash
ImagePullBackOff
describe pod
Registry auth, wrong tag
Pending
describe pod
,
get nodes
,
get events
Resources, affinity, taints
Service 502/503
get endpoints
,
describe svc
Empty endpoints, selector mismatch
Deployment stuck
rollout status
,
describe deploy
Quota, node selector, image
症状首选命令检查要点
CrashLoopBackOff
logs --previous
,
describe pod
退出码137=内存不足,1=应用崩溃
ImagePullBackOff
describe pod
镜像仓库认证、标签错误
Pending
describe pod
,
get nodes
,
get events
资源不足、亲和性配置、污点
服务返回502/503
get endpoints
,
describe svc
端点为空、选择器不匹配
部署停滞
rollout status
,
describe deploy
配额限制、节点选择器、镜像问题

Error Patterns

错误模式

Exit 137 + "OOMKilled"      → Memory limit too low
Exit 1 + stack trace        → Application bug
"Insufficient cpu/memory"   → Node capacity or requests too high
"ImagePullBackOff"          → Wrong tag, missing secret, registry down
"0/3 endpoints"             → Selector doesn't match pod labels
"FailedScheduling"          → No nodes match requirements
Exit 137 + "OOMKilled"      → 内存限制过低
Exit 1 + 堆栈跟踪        → 应用程序bug
"Insufficient cpu/memory"   → 节点容量不足或资源请求过高
"ImagePullBackOff"          → 标签错误、缺少密钥、仓库不可用
"0/3 endpoints"             → 选择器与Pod标签不匹配
"FailedScheduling"          → 没有符合要求的节点

Interactive Fix Presentation

交互式修复选项展示

ALWAYS present fixes as numbered options:
DIAGNOSIS: Pod OOMKilled (using 450Mi, limit 256Mi)

OPTIONS:
1. Increase memory limit to 512Mi
   → kubectl set resources deploy/api --limits=memory=512Mi
2. Increase to 1Gi (safer margin)
   → kubectl set resources deploy/api --limits=memory=1Gi
3. Show me the full patch YAML first
4. I'll fix manually

Which option? (1-4):
Wait for user response before executing.
必须始终以编号形式展示修复选项:
诊断结果:Pod出现OOMKilled(当前使用450Mi,限制为256Mi)

选项:
1. 将内存限制提升至512Mi
   → kubectl set resources deploy/api --limits=memory=512Mi
2. 提升至1Gi(更安全的余量)
   → kubectl set resources deploy/api --limits=memory=1Gi
3. 先查看完整的补丁YAML
4. 我将手动修复

请选择选项?(1-4):
执行前必须等待用户回复。

Red Flags - STOP

危险信号 - 立即停止

These indicate you're skipping the process:
  • About to run
    kubectl delete pod
    without knowing why it crashed
  • About to run
    kubectl rollout undo
    without diagnosing
  • About to apply a fix without presenting options
  • Saying "I'll just restart it" or "let me quickly fix this"
  • User pressure ("just fix it!") → Still diagnose, still offer options
User urgency doesn't skip the confirmation step.
以下情况表明你正在跳过流程:
  • 未明确原因就准备执行
    kubectl delete pod
  • 未诊断就准备执行
    kubectl rollout undo
  • 未提供选项就准备执行修复
  • 说“我直接重启它”或“我快速修复这个问题”
  • 用户施压(“直接修复!”)→ 仍需诊断并提供选项
用户的急切需求不能跳过确认步骤。

Handling "Skip Diagnosis" Requests

处理“跳过诊断”的请求

User says: "Just fix it!" / "Skip the diagnosis" / "I know what's wrong"
Response pattern:
  1. Acknowledge urgency
  2. Explain quick diagnosis prevents wrong fix
  3. Offer rapid diagnosis (2 min) + options
  4. Still present numbered options at the end
I understand the urgency. A 2-minute diagnosis ensures we fix the right thing.

Let me quickly verify: [run 2-3 key commands]

Based on that:
OPTIONS:
1. [What user suggested]
2. [Alternative if diagnosis shows different issue]
3. I'll fix manually

Which option?
Never: Just execute what user asked without verification.
用户说:“直接修复!” / “跳过诊断” / “我知道问题出在哪”
回复模板:
  1. 认可用户的急切需求
  2. 解释快速诊断可避免错误修复
  3. 提供快速诊断(2分钟)+ 选项
  4. 最终仍需展示编号选项
我理解你的急切心情。2分钟的诊断能确保我们修复正确的问题。

让我快速验证:[执行2-3个关键命令]

基于诊断结果:
选项:
1. [用户建议的方案]
2. [如果诊断显示不同问题的替代方案]
3. 我将手动修复

请选择选项?
绝对不要: 未验证就直接执行用户要求的操作。

Anti-Pattern Example

反模式示例

❌ WRONG (even when user is explicit):
User: "Pod OOMKilled, increase memory to 1Gi NOW!"
You: [runs kubectl set resources immediately]

✅ CORRECT:
User: "Pod OOMKilled, increase memory to 1Gi NOW!"
You: "Understood. Quick check (30s)..."
     [runs describe pod, checks current limits]
     "Confirmed OOM. Options:
      1. Increase to 1Gi (your suggestion)
      2. Increase to 512Mi first (conservative)
      3. Check for memory leak patterns first
      Which option?"
❌ 错误做法(即使用户明确要求):
用户:“Pod出现OOMKilled,现在就把内存提升到1Gi!”
你:[立即执行kubectl set resources命令]

✅ 正确做法:
用户:“Pod出现OOMKilled,现在就把内存提升到1Gi!”
你:“明白。快速检查一下(30秒)...”
     [执行describe pod,检查当前限制]
     “已确认OOM问题。选项:
      1. 提升至1Gi(你的建议)
      2. 先提升至512Mi(保守方案)
      3. 先检查内存泄漏模式
      请选择选项?”

Rationalization Table

合理化对照表

ExcuseReality
"User said exactly what to do"Still present options - user might not know alternatives
"Intent is crystal clear"Skill overrides autonomous execution for K8s
"It's obviously OOM, just fix it"OOM can mask memory leaks; increasing memory delays real fix
"User is frustrated, just do it"Frustration doesn't change the 30s verification value
"I'm being helpful by acting fast"Wrong fix = more frustration; options = user control
借口实际情况
“用户明确说了要做什么”仍需提供选项 - 用户可能不知道替代方案
“意图非常明确”本技能会覆盖K8s相关操作的自主执行逻辑
“显然是内存不足,直接修复就行”内存不足可能掩盖内存泄漏问题;增加内存会延迟真正的修复
“用户很烦躁,直接做吧”烦躁情绪不会改变30秒验证的价值
“我快速执行是在帮忙”错误修复会加剧烦躁;提供选项能让用户掌控流程

Common Mistakes

常见错误

MistakeCorrect Approach
Delete pod firstDiagnose first, delete is rarely the fix
Restart without logsAlways check
logs --previous
before restart
Assume it's the obvious thingVerify with actual output
Apply fix immediatelyPresent options, wait for confirmation
Skip verificationAlways
get pods -w
after fix
错误做法正确做法
先删除Pod先诊断,删除很少是根本修复方案
未查看日志就重启重启前务必检查
logs --previous
假设是显而易见的问题用实际输出验证
立即执行修复提供选项,等待确认
跳过验证步骤修复后务必执行
get pods -w

Quick Diagnosis Cheat Sheet

快速诊断速查表

bash
undefined
bash
undefined

Full state snapshot

完整状态快照

kubectl get pods,svc,deploy,rs,events --sort-by='.lastTimestamp'
kubectl get pods,svc,deploy,rs,events --sort-by='.lastTimestamp'

Pod deep dive

Pod深度分析

kubectl describe pod <name> | grep -A5 "State:|Events:" kubectl logs <pod> --previous --tail=50
kubectl describe pod <name> | grep -A5 "State:|Events:" kubectl logs <pod> --previous --tail=50

Service connectivity

服务连通性检查

kubectl get endpoints <svc> kubectl describe svc <svc> | grep Selector
kubectl get endpoints <svc> kubectl describe svc <svc> | grep Selector

Resource issues

资源问题检查

kubectl describe nodes | grep -A5 "Allocated resources" kubectl top pods
undefined
kubectl describe nodes | grep -A5 "Allocated resources" kubectl top pods
undefined

Verification After Fix

修复后的验证

bash
undefined
bash
undefined

Watch pod come up

监控Pod启动

kubectl get pods -w
kubectl get pods -w

Verify running and ready

验证Pod运行就绪

kubectl get pods -o wide # STATUS=Running, READY=1/1
kubectl get pods -o wide # STATUS=Running, READY=1/1

Check no new crashes

检查是否有新的崩溃

kubectl describe pod <new-pod> | grep "Restart Count"

Only mark issue resolved after pod is stable for 2+ minutes.
kubectl describe pod <new-pod> | grep "Restart Count"

只有当Pod稳定运行2分钟以上,才能标记问题已解决。

Mandatory Checkpoint

强制检查点

Before running ANY kubectl command that modifies resources:
  • Have I presented at least 2 numbered options?
  • Has user explicitly selected one?
  • Did I wait for their response?
If any unchecked → STOP, present options first.
在执行任何修改资源的kubectl命令前:
  • 是否已提供至少2个编号选项?
  • 用户是否明确选择了其中一个?
  • 是否已等待用户回复?
如果有任何一项未完成 → 停止操作,先提供选项。