eks-workload-best-practice-assessment

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

EKS Workload Best Practice Assessment

EKS工作负载最佳实践评估

Assess Kubernetes workloads on Amazon EKS against best practices from K8s official documentation and the EKS Best Practices Guide. Covers 8 dimensions: workload configuration, security, observability, networking, storage, EKS platform integration, CI/CD, and image security.
评估在Amazon EKS上运行的Kubernetes工作负载是否符合K8s官方文档及《EKS最佳实践指南》中的最佳实践。涵盖8个维度:工作负载配置、安全、可观察性、网络、存储、EKS平台集成、CI/CD及镜像安全。

Prerequisites

前置条件

This skill requires:
  • aws knowledge mcp server tools:
    • aws___search_documentation
      — search AWS documentation
    • aws___read_documentation
      — read full documentation pages
    • aws___recommend
      — get related documentation
  • context7 MCP tools:
    • context7_resolve-library-id
      — resolve K8s library ID
    • context7_query-docs
      — query K8s documentation
  • AWS CLI (
    aws
    ) — configured with read access to the target EKS cluster and ECR
  • kubectl — configured to access the target EKS cluster
  • jq — for parsing JSON output from AWS CLI and kubectl commands
本技能需要以下工具:
  • aws knowledge mcp server 工具:
    • aws___search_documentation
      — 搜索AWS文档
    • aws___read_documentation
      — 读取完整文档页面
    • aws___recommend
      — 获取相关文档推荐
  • context7 MCP 工具:
    • context7_resolve-library-id
      — 解析K8s库ID
    • context7_query-docs
      — 查询K8s文档
  • AWS CLI (
    aws
    ) — 已配置对目标EKS集群和ECR的读取权限
  • kubectl — 已配置为可访问目标EKS集群
  • jq — 用于解析AWS CLI和kubectl命令输出的JSON结果

Scope Boundary

范围边界

This skill focuses on workload-level checks — items that require
kubectl
or in-cluster inspection. It complements
aws-best-practice-research
which covers the infrastructure layer (control plane, node groups, addons, etc.).
This Skill (Workload Layer)aws-best-practice-research (Infra Layer)
Pod resource requests/limitsControl plane configuration
Probes (liveness/readiness/startup)Node group sizing and AZ distribution
PDB, topology constraintsAddon versions
Pod security context, PSASecrets envelope encryption
Network PoliciesCluster networking (VPC, subnets)
Service Accounts, RBACAuthentication mode, Access Entries
Container image scanningGuardDuty EKS protection
HPA/VPA/Karpenter workload configKarpenter/CA infrastructure config
本技能专注于工作负载层面的检查——即需要通过
kubectl
或集群内检查的项目。它与
aws-best-practice-research
形成互补,后者负责基础设施层面的检查(控制平面、节点组、附加组件等)。
本技能(工作负载层面)aws-best-practice-research(基础设施层面)
Pod资源请求/限制控制平面配置
探针(存活/就绪/启动)节点组规格与可用区分布
PDB、拓扑约束附加组件版本
Pod安全上下文、PSA密钥信封加密
网络策略集群网络(VPC、子网)
服务账户、RBAC认证模式、访问条目
容器镜像扫描GuardDuty EKS防护
HPA/VPA/Karpenter工作负载配置Karpenter/CA基础设施配置

Workflow

工作流程

Step 1: Confirm Assessment Scope

步骤1:确认评估范围

Determine from user input:
  • Cluster name and AWS Region
  • Assessment scope:
    • Full cluster — assess all namespaces (excluding
      kube-system
      ,
      kube-public
      ,
      kube-node-lease
      by default)
    • Specific namespaces — user-specified list
    • Specific workloads — user-specified Deployments/StatefulSets
  • Include infrastructure layer? — whether to also invoke
    aws-best-practice-research
    for the EKS infrastructure layer and merge results (default: yes)
If the user provides only a cluster name, default to full cluster assessment.
从用户输入中确定:
  • 集群名称AWS区域
  • 评估范围
    • 全集群 — 评估所有命名空间(默认排除
      kube-system
      kube-public
      kube-node-lease
    • 特定命名空间 — 用户指定的列表
    • 特定工作负载 — 用户指定的Deployment/StatefulSet
  • 是否包含基础设施层面? — 是否同时调用
    aws-best-practice-research
    对EKS基础设施层面进行评估并合并结果(默认:是)
如果用户仅提供集群名称,默认采用全集群评估。

Step 2: Environment Detection & Version Awareness

步骤2:环境检测与版本感知

Run the following commands to detect the environment:
bash
undefined
运行以下命令检测环境:
bash
undefined

Cluster info via AWS CLI

通过AWS CLI获取集群信息

aws eks describe-cluster --name {CLUSTER} --region {REGION}
aws eks describe-cluster --name {CLUSTER} --region {REGION}

K8s version

K8s版本

kubectl version --output=json
kubectl version --output=json

Node distribution

节点分布

kubectl get nodes -o wide --no-headers

Record:
- **K8s server version** (e.g., 1.30) — used for version-aware filtering
- **EKS platform version** (e.g., eks.15)
- **Node count and AZ distribution**
- **Node instance types**

**Version-aware filtering rules** (apply in Step 3):
- K8s >= 1.25: Check Pod Security Admission (PSA), skip PodSecurityPolicy (PSP)
- K8s < 1.25: Check PSP, note PSA as upgrade recommendation
- K8s >= 1.20: Check Startup Probes
- K8s >= 1.19: Check Topology Spread Constraints
- K8s >= 1.29 + VPC CNI >= 1.21.1: Check Admin Network Policies
- EKS with Pod Identity available: Prefer Pod Identity over IRSA
kubectl get nodes -o wide --no-headers

记录:
- **K8s服务器版本**(如1.30)—— 用于版本感知过滤
- **EKS平台版本**(如eks.15)
- **节点数量与可用区分布**
- **节点实例类型**

**版本感知过滤规则**(在步骤3中应用):
- K8s >= 1.25:检查Pod安全准入(PSA),跳过PodSecurityPolicy(PSP)
- K8s < 1.25:检查PSP,将PSA作为升级建议提出
- K8s >= 1.20:检查启动探针
- K8s >= 1.19:检查拓扑分布约束
- K8s >= 1.29 + VPC CNI >= 1.21.1:检查管理员网络策略
- 支持Pod Identity的EKS集群:优先使用Pod Identity而非IRSA

Step 3: Dynamic Best Practice Research

步骤3:动态最佳实践研究

Research the latest best practices using context7 and aws-knowledge-mcp-server. Run all queries sequentially (one at a time) to avoid rate limiting.
For each of the 8 assessment dimensions, execute the search queries defined in
references/search-queries.md
. The general flow per dimension is:
  1. Query context7 (
    /websites/kubernetes_io
    ) for K8s official best practices
  2. Query aws-knowledge-mcp-server for EKS-specific best practices
  3. Read key documentation pages from search results (max 2-3 pages per dimension)
  4. Extract check items with specific thresholds and conditions
After all research is complete, merge results with the baseline framework in
references/check-dimensions.md
to ensure no critical dimension is missed.
Apply version-aware filtering from Step 2 to remove inapplicable items and add version-specific recommendations.
Rate limit protection: If any MCP request returns "Too many requests", wait 5 seconds and retry once. If it fails again, skip and continue. Sequential execution is mandatory.
使用context7aws-knowledge-mcp-server研究最新的最佳实践。 需按顺序运行所有查询(一次一个),以避免速率限制。
针对8个评估维度中的每一个,执行
references/search-queries.md
中定义的搜索查询。每个维度的通用流程为:
  1. 查询context7
    /websites/kubernetes_io
    )获取K8s官方最佳实践
  2. 查询aws-knowledge-mcp-server获取EKS特定的最佳实践
  3. 从搜索结果中读取关键文档页面(每个维度最多2-3页)
  4. 提取带有特定阈值和条件的检查项
完成所有研究后,将结果与
references/check-dimensions.md
中的基线框架合并,确保不会遗漏任何关键维度。
应用步骤2中的版本感知过滤,移除不适用的项并添加版本特定的建议。
速率限制保护:如果任何MCP请求返回"Too many requests",等待5秒后重试一次。如果再次失败,则跳过该请求并继续。必须按顺序执行。

Step 4: Infrastructure Layer Assessment (Optional)

步骤4:基础设施层面评估(可选)

If infrastructure layer assessment is included (default: yes):
  1. Invoke the
    aws-best-practice-research
    skill for the EKS cluster
  2. Store the infrastructure-layer checklist and assessment results
  3. These will be merged into the final report in Step 7
If the user opts out, skip this step.
如果包含基础设施层面评估(默认:是):
  1. 调用
    aws-best-practice-research
    技能对EKS集群进行评估
  2. 存储基础设施层面的检查清单和评估结果
  3. 这些结果将在步骤7中合并到最终报告中
如果用户选择跳过,则忽略此步骤。

Step 5: Workload Data Collection

步骤5:工作负载数据收集

Collect workload configurations using
kubectl
. Independent commands can run in parallel (they are not subject to MCP rate limits).
See
references/kubectl-assessment-commands.md
for the complete command list. Key data to collect:
bash
undefined
使用
kubectl
收集工作负载配置。独立命令可以并行运行(它们不受MCP速率限制的影响)。
完整的命令列表请参见
references/kubectl-assessment-commands.md
。需要收集的关键数据:
bash
undefined

Core workloads

核心工作负载

kubectl get deployments,statefulsets,daemonsets,jobs,cronjobs --all-namespaces -o json
kubectl get deployments,statefulsets,daemonsets,jobs,cronjobs --all-namespaces -o json

Pod specifications (within workloads above)

Pod规格(包含在上述工作负载的-o json输出中)

Already included in the -o json output

已包含在-o json输出内

Disruption and scaling

中断与伸缩

kubectl get pdb,hpa --all-namespaces -o json
kubectl get pdb,hpa --all-namespaces -o json

Networking

网络

kubectl get networkpolicies,services,ingresses --all-namespaces -o json
kubectl get networkpolicies,services,ingresses --all-namespaces -o json

Security

安全

kubectl get serviceaccounts --all-namespaces -o json kubectl get clusterrolebindings,rolebindings -o json
kubectl get serviceaccounts --all-namespaces -o json kubectl get clusterrolebindings,rolebindings -o json

Storage

存储

kubectl get pvc,storageclass -o json
kubectl get pvc,storageclass -o json

Namespace labels (for PSA)

命名空间标签(用于PSA)

kubectl get namespaces -o json
kubectl get namespaces -o json

Events (recent issues)

事件(近期问题)

kubectl get events --all-namespaces --sort-by='.lastTimestamp' -o json

For **ECR image scanning** (if images are from ECR):
```bash
kubectl get events --all-namespaces --sort-by='.lastTimestamp' -o json

对于**ECR镜像扫描**(如果镜像来自ECR):
```bash

For each unique ECR image found in workloads

针对工作负载中发现的每个唯一ECR镜像

aws ecr describe-image-scan-findings --repository-name {REPO} --image-id imageTag={TAG} aws ecr describe-repositories --repository-names {REPO} aws ecr get-lifecycle-policy --repository-name {REPO}

Filter collected data to the assessment scope (namespaces/workloads from Step 1).
aws ecr describe-image-scan-findings --repository-name {REPO} --image-id imageTag={TAG} aws ecr describe-repositories --repository-names {REPO} aws ecr get-lifecycle-policy --repository-name {REPO}

根据步骤1中的评估范围过滤收集到的数据(命名空间/工作负载)。

Step 6: Per-Dimension Assessment

步骤6:分维度评估

For each check item from the research phase (Step 3), evaluate every in-scope workload:
StatusMeaning
PASSThe workload configuration meets or exceeds the recommendation
FAILThe workload configuration does not meet the recommendation
WARNCannot be fully verified, or partially meets the recommendation
N/AThe check does not apply (e.g., storage checks for stateless workloads)
For each finding, record:
  • Check item ID and name
  • Status (PASS/FAIL/WARN/N/A)
  • Actual value observed (not just "not configured")
  • The specific workload(s) affected
  • Version relevance notes (if any)
针对研究阶段(步骤3)中的每个检查项,评估所有在范围内的工作负载:
状态含义
PASS工作负载配置符合或超出建议要求
FAIL工作负载配置不符合建议要求
WARN无法完全验证,或仅部分符合建议要求
N/A该检查不适用(例如:无状态工作负载的存储检查)
对于每个发现,记录:
  • 检查项ID和名称
  • 状态(PASS/FAIL/WARN/N/A)
  • 观察到的实际值(不仅仅是"未配置")
  • 受影响的特定工作负载
  • 版本相关性说明(如有)

Step 7: Generate Report and Save to Local File

步骤7:生成报告并保存到本地文件

Generate a single comprehensive report using the template in
references/output-template.md
and write it directly to a local markdown file.
IMPORTANT — File Writing Rules:
  • Use the Write/file tool (not bash heredoc/echo/cat) to create the report file
  • If the report is too large for a single write, split into sections: write the file with the first half, then use an append/edit operation to add the remaining sections
  • Do NOT output the full report content to the terminal
Use the following file naming convention:
bash
TIMESTAMP=$(date +%Y-%m-%d-%H-%M-%S)
CLUSTER_SLUG=$(echo "{CLUSTER_NAME}" | tr '[:upper:]' '[:lower:]' | tr ' :/' '-')
Assessment Report — see
references/output-template.md
  • Full cluster overview
  • Compliance scorecard with rating scale, top 3 priorities, and quick stats
  • Dimension-by-dimension assessment tables
  • Per-workload detail section
  • Critical issues and prioritized remediation
  • Data sources and reference links
  • Save to:
    ${TIMESTAMP}-${CLUSTER_SLUG}-assessment-report.md
If infrastructure layer results exist from Step 4, merge them into the report.
After saving, print a brief summary to the terminal listing only:
  • The file path of the generated report
  • Overall compliance score
  • Number of PASS / FAIL / WARN findings
使用
references/output-template.md
中的模板生成一份综合报告,并直接写入本地Markdown文件
重要——文件写入规则
  • 使用Write/file工具(而非bash heredoc/echo/cat)创建报告文件
  • 如果报告过大无法一次性写入,拆分成分段:先写入前半部分,然后使用追加/编辑操作添加剩余部分
  • 不要将完整报告内容输出到终端
使用以下文件命名规则:
bash
TIMESTAMP=$(date +%Y-%m-%d-%H-%M-%S)
CLUSTER_SLUG=$(echo "{CLUSTER_NAME}" | tr '[:upper:]' '[:lower:]' | tr ' :/' '-')
评估报告——参见
references/output-template.md
  • 全集群概览
  • 合规评分卡(包含评级标准、前3个优先级问题和快速统计)
  • 分维度评估表格
  • 按工作负载细分的详情部分
  • 关键问题和优先级修复建议
  • 数据源和参考链接
  • 保存路径
    ${TIMESTAMP}-${CLUSTER_SLUG}-assessment-report.md
如果步骤4中存在基础设施层面的结果,将其合并到报告中。
保存完成后,向终端打印一份简短摘要,仅包含:
  • 生成的报告文件路径
  • 总体合规得分
  • PASS/FAIL/WARN结果的数量

Step 8: Remediation Guidance & Next Steps

步骤8:修复指导与后续步骤

After saving the reports, offer:
  • "I can help fix specific FAIL items — which ones would you like to address?"
  • "I can re-run the assessment after remediation to verify improvements."
For Critical Issues (FAIL + High priority), provide:
  • Specific remediation commands or manifest changes
  • Whether the fix requires workload restart or is in-place
  • Impact assessment of the change
保存报告后,提供以下选项:
  • "我可以帮助修复特定的FAIL项——您想处理哪一项?"
  • "修复后我可以重新运行评估以验证改进效果。"
对于关键问题(FAIL + 高优先级),提供:
  • 具体的修复命令或清单变更
  • 修复是否需要重启工作负载或可原地应用
  • 变更的影响评估

Important Guidelines

重要指南

  • Be comprehensive: The value of this skill is thoroughness. Better to include a check and mark it N/A than to miss it.
  • Always cite sources: Every check item must reference its source (EKS Best Practices Guide, K8s official docs, etc.).
  • Sequential MCP queries: All context7 and aws-knowledge-mcp requests must be sequential. kubectl commands can be parallel.
  • Rate limit protection: Wait 5s and retry once on "Too many requests". Skip on second failure.
  • Version awareness: Always filter checks by detected K8s/EKS version. Never recommend features unavailable in the cluster's version.
  • Actual values in findings: Always report what was observed, not just "not configured". Good: "
    resources.requests.memory: not set
    — container has no memory request" Bad: "Memory request missing"
  • Per-workload granularity: Report findings at the individual Deployment/StatefulSet level, not just cluster-wide summaries.
  • Exclude system namespaces by default: Skip
    kube-system
    ,
    kube-public
    ,
    kube-node-lease
    unless the user explicitly includes them.
  • Respect language: Output in the same language as the user's conversation.
  • Infrastructure vs workload boundary: Never duplicate checks from
    aws-best-practice-research
    . This skill handles ONLY what requires kubectl/in-cluster access.
  • 全面性:本技能的价值在于全面性。宁可包含一个检查项并标记为N/A,也不要遗漏它。
  • 始终引用来源:每个检查项必须引用其来源(EKS最佳实践指南、K8s官方文档等)。
  • 按顺序执行MCP查询:所有context7和aws-knowledge-mcp请求必须按顺序执行。kubectl命令可以并行运行。
  • 速率限制保护:遇到"Too many requests"时,等待5秒并重试一次。第二次失败则跳过。
  • 版本感知:始终根据检测到的K8s/EKS版本过滤检查项。绝不要推荐集群版本不支持的功能。
  • 发现中包含实际值:始终报告观察到的内容,而不仅仅是"未配置"。 良好示例:"
    resources.requests.memory: not set
    — 容器未设置内存请求" 不良示例:"内存请求缺失"
  • 按工作负载粒度报告:在单个Deployment/StatefulSet级别报告发现,而不仅仅是集群范围的摘要。
  • 默认排除系统命名空间:除非用户明确要求包含,否则跳过
    kube-system
    kube-public
    kube-node-lease
  • 语言适配:输出语言与用户对话语言保持一致。
  • 基础设施与工作负载边界:绝不要重复
    aws-best-practice-research
    中的检查项。本技能仅处理需要kubectl/集群内访问的内容。