Loading...
Loading...
Diagnoses and fixes Kubernetes issues with interactive remediation. Use when pods crash (CrashLoopBackOff, OOMKilled), services unreachable (502/503, empty endpoints), deployments stuck (ImagePullBackOff, pending). Also use when tempted to run kubectl fix commands directly without presenting options, or when user says "just fix it" for K8s issues.
npx skill4agent add galihcitta/dotclaudeskills troubleshooting-kubernetes1. GET STATE → kubectl get pods,svc,deploy,events
2. IDENTIFY → Match symptom to category
3. DRILL DOWN → logs, describe, specific checks
4. ROOT CAUSE → Pattern match to known issues
5. OFFER FIX → Present numbered options
6. WAIT → User confirms before proceeding
7. APPLY → Execute chosen fix
8. VERIFY → Confirm resolution| Symptom | First Commands | Look For |
|---|---|---|
| CrashLoopBackOff | | Exit code 137=OOM, 1=app crash |
| ImagePullBackOff | | Registry auth, wrong tag |
| Pending | | Resources, affinity, taints |
| Service 502/503 | | Empty endpoints, selector mismatch |
| Deployment stuck | | Quota, node selector, image |
Exit 137 + "OOMKilled" → Memory limit too low
Exit 1 + stack trace → Application bug
"Insufficient cpu/memory" → Node capacity or requests too high
"ImagePullBackOff" → Wrong tag, missing secret, registry down
"0/3 endpoints" → Selector doesn't match pod labels
"FailedScheduling" → No nodes match requirementsDIAGNOSIS: Pod OOMKilled (using 450Mi, limit 256Mi)
OPTIONS:
1. Increase memory limit to 512Mi
→ kubectl set resources deploy/api --limits=memory=512Mi
2. Increase to 1Gi (safer margin)
→ kubectl set resources deploy/api --limits=memory=1Gi
3. Show me the full patch YAML first
4. I'll fix manually
Which option? (1-4):kubectl delete podkubectl rollout undoI understand the urgency. A 2-minute diagnosis ensures we fix the right thing.
Let me quickly verify: [run 2-3 key commands]
Based on that:
OPTIONS:
1. [What user suggested]
2. [Alternative if diagnosis shows different issue]
3. I'll fix manually
Which option?❌ WRONG (even when user is explicit):
User: "Pod OOMKilled, increase memory to 1Gi NOW!"
You: [runs kubectl set resources immediately]
✅ CORRECT:
User: "Pod OOMKilled, increase memory to 1Gi NOW!"
You: "Understood. Quick check (30s)..."
[runs describe pod, checks current limits]
"Confirmed OOM. Options:
1. Increase to 1Gi (your suggestion)
2. Increase to 512Mi first (conservative)
3. Check for memory leak patterns first
Which option?"| Excuse | Reality |
|---|---|
| "User said exactly what to do" | Still present options - user might not know alternatives |
| "Intent is crystal clear" | Skill overrides autonomous execution for K8s |
| "It's obviously OOM, just fix it" | OOM can mask memory leaks; increasing memory delays real fix |
| "User is frustrated, just do it" | Frustration doesn't change the 30s verification value |
| "I'm being helpful by acting fast" | Wrong fix = more frustration; options = user control |
| Mistake | Correct Approach |
|---|---|
| Delete pod first | Diagnose first, delete is rarely the fix |
| Restart without logs | Always check |
| Assume it's the obvious thing | Verify with actual output |
| Apply fix immediately | Present options, wait for confirmation |
| Skip verification | Always |
# Full state snapshot
kubectl get pods,svc,deploy,rs,events --sort-by='.lastTimestamp'
# Pod deep dive
kubectl describe pod <name> | grep -A5 "State:\|Events:"
kubectl logs <pod> --previous --tail=50
# Service connectivity
kubectl get endpoints <svc>
kubectl describe svc <svc> | grep Selector
# Resource issues
kubectl describe nodes | grep -A5 "Allocated resources"
kubectl top pods# Watch pod come up
kubectl get pods -w
# Verify running and ready
kubectl get pods -o wide # STATUS=Running, READY=1/1
# Check no new crashes
kubectl describe pod <new-pod> | grep "Restart Count"