k8s-incident
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseKubernetes Incident Response
Kubernetes事件响应
Runbooks and diagnostic workflows for common Kubernetes incidents.
针对常见Kubernetes事件的运行手册和诊断工作流。
When to Apply
适用场景
Use this skill when:
- User mentions: "incident", "outage", "emergency", "down", "not working"
- Operations: emergency response, production issues, service degradation
- Keywords: "urgent", "broken", "fix", "restore", "recover"
在以下场景使用该技能:
- 用户提及:“事件”、“服务中断”、“紧急情况”、“宕机”、“无法正常工作”
- 操作场景:应急响应、生产环境问题、服务性能下降
- 关键词:“紧急”、“故障”、“修复”、“恢复”、“复原”
Priority Rules
优先级规则
| Priority | Rule | Impact | Tools |
|---|---|---|---|
| 1 | Check control plane first | CRITICAL | |
| 2 | Assess node health | CRITICAL | |
| 3 | Gather events before changes | HIGH | |
| 4 | Document timeline | HIGH | Manual notes |
| 5 | Rollback if safe | MEDIUM | |
| 优先级 | 规则 | 影响程度 | 工具 |
|---|---|---|---|
| 1 | 首先检查控制平面 | 严重 | |
| 2 | 评估节点健康状态 | 严重 | |
| 3 | 执行变更前收集事件日志 | 高 | |
| 4 | 记录事件时间线 | 高 | 手动记录 |
| 5 | 若安全则执行回滚 | 中 | |
Quick Reference
速查指南
| Incident | First Tool | Next Steps |
|---|---|---|
| Pod failure | | |
| Node down | | Check kubelet logs |
| Service unreachable | | |
| Control plane | | Check API server logs |
| 事件类型 | 首选工具 | 后续步骤 |
|---|---|---|
| Pod故障 | | |
| 节点宕机 | | 检查kubelet日志 |
| 服务无法访问 | | |
| 控制平面问题 | | 检查API Server日志 |
Incident Triage
事件分诊
Quick Health Check
快速健康检查
python
get_nodes()
get_pods(namespace="kube-system")
get_events(namespace)python
get_nodes()
get_pods(namespace="kube-system")
get_events(namespace)Severity Assessment
严重程度评估
| Indicator | Severity | Action |
|---|---|---|
| Multiple nodes NotReady | Critical | Escalate immediately |
| kube-system pods failing | Critical | Control plane issue |
| Single pod CrashLoop | Medium | Debug pod |
| High latency | Medium | Check resources |
| 指标 | 严重程度 | 操作 |
|---|---|---|
| 多个节点处于NotReady状态 | 严重 | 立即升级处理 |
| kube-system Pod故障 | 严重 | 控制平面问题 |
| 单个Pod出现CrashLoop | 中 | 调试Pod |
| 高延迟 | 中 | 检查资源使用情况 |
Runbook: Pod Failures
运行手册:Pod故障
CrashLoopBackOff
CrashLoopBackOff
python
get_pod_logs(name, namespace, previous=True)
describe_pod(name, namespace)
get_events(namespace, field_selector="involvedObject.name=<pod>")
get_pod_metrics(name, namespace)Common Causes:
- OOMKilled → Increase memory limits
- Exit code 1 → Application error in logs
- Exit code 137 → Killed by OOM or SIGKILL
- Exit code 143 → Graceful SIGTERM
python
get_pod_logs(name, namespace, previous=True)
describe_pod(name, namespace)
get_events(namespace, field_selector="involvedObject.name=<pod>")
get_pod_metrics(name, namespace)常见原因:
- OOMKilled → 提升内存限制
- 退出码1 → 日志中存在应用程序错误
- 退出码137 → 因OOM或SIGKILL被终止
- 退出码143 → 优雅终止(SIGTERM)
ImagePullBackOff
ImagePullBackOff
python
describe_pod(name, namespace)
get_secrets(namespace)python
describe_pod(name, namespace)
get_secrets(namespace)Pending Pod
处于Pending状态的Pod
python
describe_pod(name, namespace)
get_nodes()
get_events(namespace)python
describe_pod(name, namespace)
get_nodes()
get_events(namespace)Runbook: Node Issues
运行手册:节点问题
Node NotReady
节点处于NotReady状态
python
describe_node(name)
get_events(namespace="", field_selector="involvedObject.name=<node>")
node_logs_tool(name, "kubelet")python
describe_node(name)
get_events(namespace="", field_selector="involvedObject.name=<node>")
node_logs_tool(name, "kubelet")Node DiskPressure
节点磁盘压力过大
python
describe_node(name)
get_pods(field_selector="spec.nodeName=<node>")python
describe_node(name)
get_pods(field_selector="spec.nodeName=<node>")Runbook: Network Issues
运行手册:网络问题
Service Not Accessible
服务无法访问
python
get_services(namespace)
get_endpoints(namespace)
get_pods(namespace, label_selector="<service-selector>")
get_network_policies(namespace)python
get_services(namespace)
get_endpoints(namespace)
get_pods(namespace, label_selector="<service-selector>")
get_network_policies(namespace)DNS Resolution Failures
DNS解析失败
python
get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
get_pod_logs("coredns-xxx", "kube-system")python
get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
get_pod_logs("coredns-xxx", "kube-system")With Cilium
使用Cilium时
python
cilium_status_tool()
cilium_endpoints_list_tool(namespace)
hubble_flows_query_tool(namespace)python
cilium_status_tool()
cilium_endpoints_list_tool(namespace)
hubble_flows_query_tool(namespace)With Istio
使用Istio时
python
istio_analyze_tool(namespace)
istio_proxy_status_tool()python
istio_analyze_tool(namespace)
istio_proxy_status_tool()Runbook: Storage Issues
运行手册:存储问题
PVC Pending
PVC处于Pending状态
python
describe_pvc(name, namespace)
get_storage_classes()
get_events(namespace)python
describe_pvc(name, namespace)
get_storage_classes()
get_events(namespace)Pod Stuck in ContainerCreating
Pod卡在ContainerCreating状态
python
describe_pod(name, namespace)
get_pvc(namespace)
get_events(namespace)python
describe_pod(name, namespace)
get_pvc(namespace)
get_events(namespace)Runbook: Control Plane Issues
运行手册:控制平面问题
API Server Unavailable
API Server不可用
python
get_pods(namespace="kube-system", label_selector="component=kube-apiserver")
get_events(namespace="kube-system")python
get_pods(namespace="kube-system", label_selector="component=kube-apiserver")
get_events(namespace="kube-system")etcd Issues
etcd问题
python
get_pods(namespace="kube-system", label_selector="component=etcd")
get_pod_logs("etcd-xxx", "kube-system")python
get_pods(namespace="kube-system", label_selector="component=etcd")
get_pod_logs("etcd-xxx", "kube-system")Emergency Actions
应急操作
Force Delete Pod
强制删除Pod
python
delete_pod(name, namespace, grace_period=0, force=True)python
delete_pod(name, namespace, grace_period=0, force=True)Rollback Deployment
回滚Deployment
python
rollback_deployment(name, namespace, revision=0)python
rollback_deployment(name, namespace, revision=0)Helm Rollback
Helm回滚
python
rollback_helm_release(name, namespace, revision=1)python
rollback_helm_release(name, namespace, revision=1)Diagnostic Collection Script
诊断收集脚本
For comprehensive incident diagnostics, see scripts/collect-diagnostics.py.
如需全面的事件诊断,请查看 scripts/collect-diagnostics.py。
Multi-Cluster Incident Response
多集群事件响应
Check all clusters:
python
for context in ["prod-1", "prod-2", "staging"]:
get_nodes(context=context)
get_pods(namespace="kube-system", context=context)
get_events(namespace="kube-system", context=context)检查所有集群:
python
for context in ["prod-1", "prod-2", "staging"]:
get_nodes(context=context)
get_pods(namespace="kube-system", context=context)
get_events(namespace="kube-system", context=context)Post-Incident
事件后处理
Document Timeline
记录时间线
- When did the incident start?
- What was the impact?
- What was the root cause?
- What fixed it?
- 事件何时开始?
- 影响范围是什么?
- 根本原因是什么?
- 解决方法是什么?
Prevent Recurrence
预防复发
- Add monitoring/alerting
- Improve resource limits
- Add readiness probes
- Document runbook
- 添加监控/告警
- 优化资源限制
- 添加就绪探针
- 完善运行手册文档
Related Skills
相关技能
- k8s-troubleshoot - Detailed debugging
- k8s-security - Security incidents
- k8s-troubleshoot - 详细调试
- k8s-security - 安全事件处理