k8s-incident

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Kubernetes Incident Response

Kubernetes事件响应

Runbooks and diagnostic workflows for common Kubernetes incidents.
针对常见Kubernetes事件的运行手册和诊断工作流。

When to Apply

适用场景

Use this skill when:
  • User mentions: "incident", "outage", "emergency", "down", "not working"
  • Operations: emergency response, production issues, service degradation
  • Keywords: "urgent", "broken", "fix", "restore", "recover"
在以下场景使用该技能:
  • 用户提及:“事件”、“服务中断”、“紧急情况”、“宕机”、“无法正常工作”
  • 操作场景:应急响应、生产环境问题、服务性能下降
  • 关键词:“紧急”、“故障”、“修复”、“恢复”、“复原”

Priority Rules

优先级规则

PriorityRuleImpactTools
1Check control plane firstCRITICAL
get_pods(namespace="kube-system")
2Assess node healthCRITICAL
get_nodes
3Gather events before changesHIGH
get_events
4Document timelineHIGHManual notes
5Rollback if safeMEDIUM
rollback_deployment
优先级规则影响程度工具
1首先检查控制平面严重
get_pods(namespace="kube-system")
2评估节点健康状态严重
get_nodes
3执行变更前收集事件日志
get_events
4记录事件时间线手动记录
5若安全则执行回滚
rollback_deployment

Quick Reference

速查指南

IncidentFirst ToolNext Steps
Pod failure
get_pod_logs(previous=True)
describe_pod
,
get_events
Node down
describe_node
Check kubelet logs
Service unreachable
get_endpoints
get_network_policies
Control plane
get_pods(namespace="kube-system")
Check API server logs
事件类型首选工具后续步骤
Pod故障
get_pod_logs(previous=True)
describe_pod
,
get_events
节点宕机
describe_node
检查kubelet日志
服务无法访问
get_endpoints
get_network_policies
控制平面问题
get_pods(namespace="kube-system")
检查API Server日志

Incident Triage

事件分诊

Quick Health Check

快速健康检查

python
get_nodes()
get_pods(namespace="kube-system")
get_events(namespace)
python
get_nodes()
get_pods(namespace="kube-system")
get_events(namespace)

Severity Assessment

严重程度评估

IndicatorSeverityAction
Multiple nodes NotReadyCriticalEscalate immediately
kube-system pods failingCriticalControl plane issue
Single pod CrashLoopMediumDebug pod
High latencyMediumCheck resources
指标严重程度操作
多个节点处于NotReady状态严重立即升级处理
kube-system Pod故障严重控制平面问题
单个Pod出现CrashLoop调试Pod
高延迟检查资源使用情况

Runbook: Pod Failures

运行手册:Pod故障

CrashLoopBackOff

CrashLoopBackOff

python
get_pod_logs(name, namespace, previous=True)
describe_pod(name, namespace)
get_events(namespace, field_selector="involvedObject.name=<pod>")
get_pod_metrics(name, namespace)
Common Causes:
  • OOMKilled → Increase memory limits
  • Exit code 1 → Application error in logs
  • Exit code 137 → Killed by OOM or SIGKILL
  • Exit code 143 → Graceful SIGTERM
python
get_pod_logs(name, namespace, previous=True)
describe_pod(name, namespace)
get_events(namespace, field_selector="involvedObject.name=<pod>")
get_pod_metrics(name, namespace)
常见原因:
  • OOMKilled → 提升内存限制
  • 退出码1 → 日志中存在应用程序错误
  • 退出码137 → 因OOM或SIGKILL被终止
  • 退出码143 → 优雅终止(SIGTERM)

ImagePullBackOff

ImagePullBackOff

python
describe_pod(name, namespace)
get_secrets(namespace)
python
describe_pod(name, namespace)
get_secrets(namespace)

Pending Pod

处于Pending状态的Pod

python
describe_pod(name, namespace)
get_nodes()
get_events(namespace)
python
describe_pod(name, namespace)
get_nodes()
get_events(namespace)

Runbook: Node Issues

运行手册:节点问题

Node NotReady

节点处于NotReady状态

python
describe_node(name)
get_events(namespace="", field_selector="involvedObject.name=<node>")
node_logs_tool(name, "kubelet")
python
describe_node(name)
get_events(namespace="", field_selector="involvedObject.name=<node>")
node_logs_tool(name, "kubelet")

Node DiskPressure

节点磁盘压力过大

python
describe_node(name)
get_pods(field_selector="spec.nodeName=<node>")
python
describe_node(name)
get_pods(field_selector="spec.nodeName=<node>")

Runbook: Network Issues

运行手册:网络问题

Service Not Accessible

服务无法访问

python
get_services(namespace)
get_endpoints(namespace)
get_pods(namespace, label_selector="<service-selector>")
get_network_policies(namespace)
python
get_services(namespace)
get_endpoints(namespace)
get_pods(namespace, label_selector="<service-selector>")
get_network_policies(namespace)

DNS Resolution Failures

DNS解析失败

python
get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
get_pod_logs("coredns-xxx", "kube-system")
python
get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
get_pod_logs("coredns-xxx", "kube-system")

With Cilium

使用Cilium时

python
cilium_status_tool()
cilium_endpoints_list_tool(namespace)
hubble_flows_query_tool(namespace)
python
cilium_status_tool()
cilium_endpoints_list_tool(namespace)
hubble_flows_query_tool(namespace)

With Istio

使用Istio时

python
istio_analyze_tool(namespace)
istio_proxy_status_tool()
python
istio_analyze_tool(namespace)
istio_proxy_status_tool()

Runbook: Storage Issues

运行手册:存储问题

PVC Pending

PVC处于Pending状态

python
describe_pvc(name, namespace)
get_storage_classes()
get_events(namespace)
python
describe_pvc(name, namespace)
get_storage_classes()
get_events(namespace)

Pod Stuck in ContainerCreating

Pod卡在ContainerCreating状态

python
describe_pod(name, namespace)
get_pvc(namespace)
get_events(namespace)
python
describe_pod(name, namespace)
get_pvc(namespace)
get_events(namespace)

Runbook: Control Plane Issues

运行手册:控制平面问题

API Server Unavailable

API Server不可用

python
get_pods(namespace="kube-system", label_selector="component=kube-apiserver")
get_events(namespace="kube-system")
python
get_pods(namespace="kube-system", label_selector="component=kube-apiserver")
get_events(namespace="kube-system")

etcd Issues

etcd问题

python
get_pods(namespace="kube-system", label_selector="component=etcd")
get_pod_logs("etcd-xxx", "kube-system")
python
get_pods(namespace="kube-system", label_selector="component=etcd")
get_pod_logs("etcd-xxx", "kube-system")

Emergency Actions

应急操作

Force Delete Pod

强制删除Pod

python
delete_pod(name, namespace, grace_period=0, force=True)
python
delete_pod(name, namespace, grace_period=0, force=True)

Rollback Deployment

回滚Deployment

python
rollback_deployment(name, namespace, revision=0)
python
rollback_deployment(name, namespace, revision=0)

Helm Rollback

Helm回滚

python
rollback_helm_release(name, namespace, revision=1)
python
rollback_helm_release(name, namespace, revision=1)

Diagnostic Collection Script

诊断收集脚本

For comprehensive incident diagnostics, see scripts/collect-diagnostics.py.
如需全面的事件诊断,请查看 scripts/collect-diagnostics.py

Multi-Cluster Incident Response

多集群事件响应

Check all clusters:
python
for context in ["prod-1", "prod-2", "staging"]:
    get_nodes(context=context)
    get_pods(namespace="kube-system", context=context)
    get_events(namespace="kube-system", context=context)
检查所有集群:
python
for context in ["prod-1", "prod-2", "staging"]:
    get_nodes(context=context)
    get_pods(namespace="kube-system", context=context)
    get_events(namespace="kube-system", context=context)

Post-Incident

事件后处理

Document Timeline

记录时间线

  1. When did the incident start?
  2. What was the impact?
  3. What was the root cause?
  4. What fixed it?
  1. 事件何时开始?
  2. 影响范围是什么?
  3. 根本原因是什么?
  4. 解决方法是什么?

Prevent Recurrence

预防复发

  • Add monitoring/alerting
  • Improve resource limits
  • Add readiness probes
  • Document runbook
  • 添加监控/告警
  • 优化资源限制
  • 添加就绪探针
  • 完善运行手册文档

Related Skills

相关技能

  • k8s-troubleshoot - Detailed debugging
  • k8s-security - Security incidents
  • k8s-troubleshoot - 详细调试
  • k8s-security - 安全事件处理