kubernetes-operations

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Kubernetes Operations

Kubernetes 运维操作

Comprehensive kubectl assistance for debugging, resource management, and cluster operations with token-efficient scripts.
提供全面的kubectl协助,通过高效节省Token的脚本实现调试、资源管理和集群操作。

BEFORE YOU START

开始之前

This skill prevents 5 common errors and saves ~70% tokens.
MetricWithout SkillWith Skill
Pod Debugging~1200 tokens~400 tokens
Resource Listing~800 tokens~200 tokens
Cluster Health~1500 tokens~300 tokens
该技能可避免5类常见错误,节省约70%的Token。
指标未使用该技能使用该技能
Pod 调试~1200 Token~400 Token
资源列表~800 Token~200 Token
集群健康检查~1500 Token~300 Token

Known Issues This Skill Prevents

该技能可避免的常见问题

  1. Running kubectl commands in wrong namespace/context
  2. Verbose output flooding context with unnecessary data
  3. Missing critical debugging steps (events, previous logs)
  4. Exposing secrets in plain text output
  5. Destructive operations without dry-run verification
  1. 在错误的namespace/context中执行kubectl命令
  2. 冗长的输出导致上下文充斥不必要的数据
  3. 遗漏关键调试步骤(事件、历史日志)
  4. 在明文输出中暴露敏感信息(secrets)
  5. 未通过dry-run验证就执行破坏性操作

Quick Start

快速开始

Step 1: Verify Context

步骤1:验证上下文

bash
kubectl config current-context
kubectl config get-contexts
Why this matters: Running commands in the wrong cluster can cause production incidents.
bash
kubectl config current-context
kubectl config get-contexts
为什么这很重要: 在错误的集群中执行命令可能导致生产事故。

Step 2: Debug a Pod

步骤2:调试Pod

bash
uv run scripts/debug_pod.py <pod-name> [-n namespace]
Why this matters: The script combines describe, logs, and events into a condensed summary, saving ~800 tokens.
bash
uv run scripts/debug_pod.py <pod-name> [-n namespace]
为什么这很重要: 该脚本将describe、logs和events整合为精简的摘要,可节省约800个Token。

Step 3: Check Cluster Health

步骤3:检查集群健康状态

bash
uv run scripts/cluster_health.py
Why this matters: Quick overview of node status and unhealthy pods without verbose output.
bash
uv run scripts/cluster_health.py
为什么这很重要: 快速查看节点状态和异常Pod,无需冗长输出。

Critical Rules

核心规则

Always Do

必须执行的操作

  • Always verify
    kubectl config current-context
    before operations
  • Always use
    -n namespace
    to be explicit about target
  • Always use
    --dry-run=client -o yaml
    before applying changes
  • Always check events when debugging:
    kubectl get events --sort-by='.lastTimestamp'
  • Always use
    --previous
    flag when pod is in CrashLoopBackOff
  • 执行操作前务必验证
    kubectl config current-context
  • 务必使用
    -n namespace
    明确指定目标命名空间
  • 应用变更前务必使用
    --dry-run=client -o yaml
    进行验证
  • 调试时务必检查事件:
    kubectl get events --sort-by='.lastTimestamp'
  • 当Pod处于CrashLoopBackOff状态时,务必使用
    --previous
    参数

Never Do

禁止执行的操作

  • Never run
    kubectl delete
    without
    --dry-run
    first in production
  • Never output secrets without filtering: avoid
    kubectl get secret -o yaml
  • Never assume default namespace - always specify
    -n
  • Never ignore resource limits when debugging OOMKilled pods
  • Never skip
    describe
    when logs show no errors
  • 生产环境中禁止在未使用
    --dry-run
    的情况下执行
    kubectl delete
  • 禁止输出未过滤的secrets信息:避免使用
    kubectl get secret -o yaml
  • 禁止默认使用默认命名空间 - 务必指定
    -n
  • 调试OOMKilled Pod时禁止忽略资源限制
  • 当日志无错误时禁止跳过
    describe
    操作

Common Mistakes

常见错误示例

Wrong:
bash
kubectl logs my-pod
Correct:
bash
kubectl logs my-pod -n my-namespace --tail=100 --timestamps
Why: Default namespace may not be correct, unlimited logs flood context, timestamps help correlate with events.
错误写法:
bash
kubectl logs my-pod
正确写法:
bash
kubectl logs my-pod -n my-namespace --tail=100 --timestamps
原因: 默认命名空间可能不正确,无限制的日志会占用大量上下文,时间戳有助于关联事件。

Known Issues Prevention

问题预防方案

IssueRoot CauseSolution
CrashLoopBackOffApp crash on startupCheck
kubectl logs --previous
and describe for exit codes
ImagePullBackOffRegistry auth or image tagVerify image exists and check pull secrets
Pending podsNo schedulable nodesCheck node resources and pod affinity/tolerations
OOMKilledMemory limit exceededCheck container limits vs actual usage with
kubectl top
Connection refusedService selector mismatchVerify pod labels match service selector
问题根本原因解决方案
CrashLoopBackOff应用启动时崩溃检查
kubectl logs --previous
和describe输出中的退出码
ImagePullBackOff镜像仓库认证失败或镜像标签错误验证镜像存在性并检查拉取密钥
Pod 处于Pending状态无可用调度节点检查节点资源和Pod亲和性/容忍度
OOMKilled内存限制超出使用
kubectl top
检查容器限制与实际使用情况
连接被拒绝Service选择器不匹配验证Pod标签与Service选择器一致

Debugging Workflows

调试流程

Pod Not Starting

Pod 无法启动

bash
undefined
bash
undefined

1. Get pod status and events

1. 获取Pod状态和事件

kubectl describe pod <name> -n <namespace>
kubectl describe pod <name> -n <namespace>

2. Check logs (current or previous)

2. 查看日志(当前或历史)

kubectl logs <name> -n <namespace> --tail=100 kubectl logs <name> -n <namespace> --previous # If restarting
kubectl logs <name> -n <namespace> --tail=100 kubectl logs <name> -n <namespace> --previous # 若Pod已重启

3. Check events for scheduling issues

3. 检查调度相关事件

kubectl get events -n <namespace> --sort-by='.lastTimestamp' | grep <name>
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | grep <name>

4. Interactive debugging

4. 交互式调试

kubectl exec -it <name> -n <namespace> -- /bin/sh
undefined
kubectl exec -it <name> -n <namespace> -- /bin/sh
undefined

Service Connectivity

Service 连通性测试

bash
undefined
bash
undefined

1. Verify service exists and has endpoints

1. 验证Service存在且有端点

kubectl get svc <name> -n <namespace> kubectl get endpoints <name> -n <namespace>
kubectl get svc <name> -n <namespace> kubectl get endpoints <name> -n <namespace>

2. Check pod labels match service selector

2. 检查Pod标签与Service选择器是否匹配

kubectl get pods -n <namespace> --show-labels
kubectl get pods -n <namespace> --show-labels

3. Test from within cluster

3. 集群内部测试

kubectl run debug --rm -it --image=busybox -- wget -qO- http://<service>:<port>
kubectl run debug --rm -it --image=busybox -- wget -qO- http://<service>:<port>

4. Port-forward for local testing

4. 端口转发用于本地测试

kubectl port-forward svc/<name> 8080:80 -n <namespace>
undefined
kubectl port-forward svc/<name> 8080:80 -n <namespace>
undefined

Resource Management

资源管理

Deployments

Deployments

bash
undefined
bash
undefined

List deployments

列出Deployments

kubectl get deployments -n <namespace>
kubectl get deployments -n <namespace>

Scale

扩容

kubectl scale deployment <name> --replicas=3 -n <namespace>
kubectl scale deployment <name> --replicas=3 -n <namespace>

Rollout status

滚动发布状态

kubectl rollout status deployment/<name> -n <namespace>
kubectl rollout status deployment/<name> -n <namespace>

Rollback

回滚

kubectl rollout undo deployment/<name> -n <namespace>
kubectl rollout undo deployment/<name> -n <namespace>

History

历史记录

kubectl rollout history deployment/<name> -n <namespace>
undefined
kubectl rollout history deployment/<name> -n <namespace>
undefined

ConfigMaps and Secrets

ConfigMaps 和 Secrets

bash
undefined
bash
undefined

List

列出资源

kubectl get configmaps -n <namespace> kubectl get secrets -n <namespace>
kubectl get configmaps -n <namespace> kubectl get secrets -n <namespace>

View ConfigMap data

查看ConfigMap数据

kubectl get configmap <name> -n <namespace> -o jsonpath='{.data}'
kubectl get configmap <name> -n <namespace> -o jsonpath='{.data}'

View Secret keys (NOT values)

查看Secret密钥(不显示值)

kubectl get secret <name> -n <namespace> -o jsonpath='{.data}' | jq 'keys'
kubectl get secret <name> -n <namespace> -o jsonpath='{.data}' | jq 'keys'

Create from file

从文件创建

kubectl create configmap <name> --from-file=<path> -n <namespace> --dry-run=client -o yaml
undefined
kubectl create configmap <name> --from-file=<path> -n <namespace> --dry-run=client -o yaml
undefined

Cluster Operations

集群操作

Node Management

节点管理

bash
undefined
bash
undefined

List nodes with status

列出节点及状态

kubectl get nodes -o wide
kubectl get nodes -o wide

Node details

节点详情

kubectl describe node <name>
kubectl describe node <name>

Cordon (prevent scheduling)

标记为不可调度(Cordon)

kubectl cordon <node>
kubectl cordon <node>

Drain (evict pods)

驱逐Pod(Drain)

kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

Uncordon

标记为可调度(Uncordon)

kubectl uncordon <node>
undefined
kubectl uncordon <node>
undefined

Resource Usage

资源使用情况

bash
undefined
bash
undefined

Node resources

节点资源使用

kubectl top nodes
kubectl top nodes

Pod resources

Pod资源使用

kubectl top pods -n <namespace>
kubectl top pods -n <namespace>

Sort by memory

按内存排序

kubectl top pods -n <namespace> --sort-by=memory
undefined
kubectl top pods -n <namespace> --sort-by=memory
undefined

Bundled Resources

配套资源

Scripts

脚本

Located in
scripts/
:
  • debug_pod.py
    - Comprehensive pod debugging with condensed output
  • get_resources.py
    - Resource summary using jsonpath for minimal tokens
  • cluster_health.py
    - Quick cluster status overview
位于
scripts/
目录下:
  • debug_pod.py
    - 整合式Pod调试工具,输出精简摘要
  • get_resources.py
    - 使用jsonpath生成资源摘要,节省Token
  • cluster_health.py
    - 快速查看集群状态

References

参考文档

Located in
references/
:
  • kubectl-cheatsheet.md
    - Condensed command reference
  • jsonpath-patterns.md
    - Common JSONPath expressions
  • debugging-flowchart.md
    - Decision tree for pod issues
Note: For deep dives on specific topics, see the reference files above.
位于
references/
目录下:
  • kubectl-cheatsheet.md
    - 精简版kubectl命令参考
  • jsonpath-patterns.md
    - 常用JSONPath表达式
  • debugging-flowchart.md
    - Pod问题排查决策树
注意: 如需深入了解特定主题,请查看上述参考文档。

Dependencies

依赖环境

Required

必装依赖

PackageVersionPurpose
kubectl1.25+Kubernetes CLI
jq1.6+JSON parsing for scripts
软件包版本用途
kubectl1.25+Kubernetes命令行工具
jq1.6+JSON解析工具,用于脚本执行

Optional

可选依赖

PackageVersionPurpose
k9s0.27+Terminal UI for Kubernetes
stern1.25+Multi-pod log tailing
软件包版本用途
k9s0.27+Kubernetes终端UI工具
stern1.25+多Pod日志实时查看工具

Official Documentation

官方文档

Troubleshooting

故障排查

kubectl command not found

kubectl命令未找到

Symptoms:
command not found: kubectl
Solution:
bash
undefined
症状:
command not found: kubectl
解决方案:
bash
undefined

macOS

macOS系统

brew install kubectl
brew install kubectl

Verify

验证安装

kubectl version --client
undefined
kubectl version --client
undefined

Context not set

上下文未设置

Symptoms:
error: no context is currently set
Solution:
bash
undefined
症状:
error: no context is currently set
解决方案:
bash
undefined

List available contexts

列出可用上下文

kubectl config get-contexts
kubectl config get-contexts

Set context

设置上下文

kubectl config use-context <context-name>
undefined
kubectl config use-context <context-name>
undefined

Permission denied

权限不足

Symptoms:
Error from server (Forbidden)
Solution:
bash
undefined
症状:
Error from server (Forbidden)
解决方案:
bash
undefined

Check current user

检查当前用户

kubectl auth whoami
kubectl auth whoami

Check permissions

检查权限

kubectl auth can-i get pods -n <namespace> kubectl auth can-i --list -n <namespace>
undefined
kubectl auth can-i get pods -n <namespace> kubectl auth can-i --list -n <namespace>
undefined

Timeout connecting to cluster

集群连接超时

Symptoms:
Unable to connect to the server: dial tcp: i/o timeout
Solution:
bash
undefined
症状:
Unable to connect to the server: dial tcp: i/o timeout
解决方案:
bash
undefined

Check cluster endpoint

检查集群端点

kubectl cluster-info
kubectl cluster-info

Verify network connectivity

验证网络连通性

curl -k https://<cluster-api-endpoint>/healthz
curl -k https://<cluster-api-endpoint>/healthz

Check kubeconfig

检查kubeconfig配置

cat ~/.kube/config
undefined
cat ~/.kube/config
undefined

Setup Checklist

安装检查清单

Before using this skill, verify:
  • kubectl
    installed (
    kubectl version --client
    )
  • Kubeconfig configured (
    ~/.kube/config
    exists)
  • Context set to correct cluster (
    kubectl config current-context
    )
  • Permissions verified (
    kubectl auth can-i get pods
    )
  • jq
    installed for JSON parsing (
    jq --version
    )
使用该技能前,请验证以下内容:
  • 已安装kubectl(执行
    kubectl version --client
    验证)
  • 已配置Kubeconfig(
    ~/.kube/config
    文件存在)
  • 已设置正确的集群上下文(执行
    kubectl config current-context
    验证)
  • 已验证权限(执行
    kubectl auth can-i get pods
    验证)
  • 已安装jq用于JSON解析(执行
    jq --version
    验证)