eks-workload-best-practice-assessment
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseEKS Workload Best Practice Assessment
EKS工作负载最佳实践评估
Assess Kubernetes workloads on Amazon EKS against best practices from K8s official documentation
and the EKS Best Practices Guide. Covers 8 dimensions: workload configuration, security,
observability, networking, storage, EKS platform integration, CI/CD, and image security.
评估在Amazon EKS上运行的Kubernetes工作负载是否符合K8s官方文档及《EKS最佳实践指南》中的最佳实践。涵盖8个维度:工作负载配置、安全、可观察性、网络、存储、EKS平台集成、CI/CD及镜像安全。
Prerequisites
前置条件
This skill requires:
- aws knowledge mcp server tools:
- — search AWS documentation
aws___search_documentation - — read full documentation pages
aws___read_documentation - — get related documentation
aws___recommend
- context7 MCP tools:
- — resolve K8s library ID
context7_resolve-library-id - — query K8s documentation
context7_query-docs
- AWS CLI () — configured with read access to the target EKS cluster and ECR
aws - kubectl — configured to access the target EKS cluster
- jq — for parsing JSON output from AWS CLI and kubectl commands
本技能需要以下工具:
- aws knowledge mcp server 工具:
- — 搜索AWS文档
aws___search_documentation - — 读取完整文档页面
aws___read_documentation - — 获取相关文档推荐
aws___recommend
- context7 MCP 工具:
- — 解析K8s库ID
context7_resolve-library-id - — 查询K8s文档
context7_query-docs
- AWS CLI () — 已配置对目标EKS集群和ECR的读取权限
aws - kubectl — 已配置为可访问目标EKS集群
- jq — 用于解析AWS CLI和kubectl命令输出的JSON结果
Scope Boundary
范围边界
This skill focuses on workload-level checks — items that require or in-cluster
inspection. It complements which covers the infrastructure layer
(control plane, node groups, addons, etc.).
kubectlaws-best-practice-research| This Skill (Workload Layer) | aws-best-practice-research (Infra Layer) |
|---|---|
| Pod resource requests/limits | Control plane configuration |
| Probes (liveness/readiness/startup) | Node group sizing and AZ distribution |
| PDB, topology constraints | Addon versions |
| Pod security context, PSA | Secrets envelope encryption |
| Network Policies | Cluster networking (VPC, subnets) |
| Service Accounts, RBAC | Authentication mode, Access Entries |
| Container image scanning | GuardDuty EKS protection |
| HPA/VPA/Karpenter workload config | Karpenter/CA infrastructure config |
本技能专注于工作负载层面的检查——即需要通过或集群内检查的项目。它与形成互补,后者负责基础设施层面的检查(控制平面、节点组、附加组件等)。
kubectlaws-best-practice-research| 本技能(工作负载层面) | aws-best-practice-research(基础设施层面) |
|---|---|
| Pod资源请求/限制 | 控制平面配置 |
| 探针(存活/就绪/启动) | 节点组规格与可用区分布 |
| PDB、拓扑约束 | 附加组件版本 |
| Pod安全上下文、PSA | 密钥信封加密 |
| 网络策略 | 集群网络(VPC、子网) |
| 服务账户、RBAC | 认证模式、访问条目 |
| 容器镜像扫描 | GuardDuty EKS防护 |
| HPA/VPA/Karpenter工作负载配置 | Karpenter/CA基础设施配置 |
Workflow
工作流程
Step 1: Confirm Assessment Scope
步骤1:确认评估范围
Determine from user input:
- Cluster name and AWS Region
- Assessment scope:
- Full cluster — assess all namespaces (excluding ,
kube-system,kube-publicby default)kube-node-lease - Specific namespaces — user-specified list
- Specific workloads — user-specified Deployments/StatefulSets
- Full cluster — assess all namespaces (excluding
- Include infrastructure layer? — whether to also invoke for the EKS infrastructure layer and merge results (default: yes)
aws-best-practice-research
If the user provides only a cluster name, default to full cluster assessment.
从用户输入中确定:
- 集群名称和AWS区域
- 评估范围:
- 全集群 — 评估所有命名空间(默认排除、
kube-system、kube-public)kube-node-lease - 特定命名空间 — 用户指定的列表
- 特定工作负载 — 用户指定的Deployment/StatefulSet
- 全集群 — 评估所有命名空间(默认排除
- 是否包含基础设施层面? — 是否同时调用对EKS基础设施层面进行评估并合并结果(默认:是)
aws-best-practice-research
如果用户仅提供集群名称,默认采用全集群评估。
Step 2: Environment Detection & Version Awareness
步骤2:环境检测与版本感知
Run the following commands to detect the environment:
bash
undefined运行以下命令检测环境:
bash
undefinedCluster info via AWS CLI
通过AWS CLI获取集群信息
aws eks describe-cluster --name {CLUSTER} --region {REGION}
aws eks describe-cluster --name {CLUSTER} --region {REGION}
K8s version
K8s版本
kubectl version --output=json
kubectl version --output=json
Node distribution
节点分布
kubectl get nodes -o wide --no-headers
Record:
- **K8s server version** (e.g., 1.30) — used for version-aware filtering
- **EKS platform version** (e.g., eks.15)
- **Node count and AZ distribution**
- **Node instance types**
**Version-aware filtering rules** (apply in Step 3):
- K8s >= 1.25: Check Pod Security Admission (PSA), skip PodSecurityPolicy (PSP)
- K8s < 1.25: Check PSP, note PSA as upgrade recommendation
- K8s >= 1.20: Check Startup Probes
- K8s >= 1.19: Check Topology Spread Constraints
- K8s >= 1.29 + VPC CNI >= 1.21.1: Check Admin Network Policies
- EKS with Pod Identity available: Prefer Pod Identity over IRSAkubectl get nodes -o wide --no-headers
记录:
- **K8s服务器版本**(如1.30)—— 用于版本感知过滤
- **EKS平台版本**(如eks.15)
- **节点数量与可用区分布**
- **节点实例类型**
**版本感知过滤规则**(在步骤3中应用):
- K8s >= 1.25:检查Pod安全准入(PSA),跳过PodSecurityPolicy(PSP)
- K8s < 1.25:检查PSP,将PSA作为升级建议提出
- K8s >= 1.20:检查启动探针
- K8s >= 1.19:检查拓扑分布约束
- K8s >= 1.29 + VPC CNI >= 1.21.1:检查管理员网络策略
- 支持Pod Identity的EKS集群:优先使用Pod Identity而非IRSAStep 3: Dynamic Best Practice Research
步骤3:动态最佳实践研究
Research the latest best practices using context7 and aws-knowledge-mcp-server.
Run all queries sequentially (one at a time) to avoid rate limiting.
For each of the 8 assessment dimensions, execute the search queries defined in
. The general flow per dimension is:
references/search-queries.md- Query context7 () for K8s official best practices
/websites/kubernetes_io - Query aws-knowledge-mcp-server for EKS-specific best practices
- Read key documentation pages from search results (max 2-3 pages per dimension)
- Extract check items with specific thresholds and conditions
After all research is complete, merge results with the baseline framework in
to ensure no critical dimension is missed.
references/check-dimensions.mdApply version-aware filtering from Step 2 to remove inapplicable items and add
version-specific recommendations.
Rate limit protection: If any MCP request returns "Too many requests", wait 5 seconds
and retry once. If it fails again, skip and continue. Sequential execution is mandatory.
使用context7和aws-knowledge-mcp-server研究最新的最佳实践。
需按顺序运行所有查询(一次一个),以避免速率限制。
针对8个评估维度中的每一个,执行中定义的搜索查询。每个维度的通用流程为:
references/search-queries.md- 查询context7()获取K8s官方最佳实践
/websites/kubernetes_io - 查询aws-knowledge-mcp-server获取EKS特定的最佳实践
- 从搜索结果中读取关键文档页面(每个维度最多2-3页)
- 提取带有特定阈值和条件的检查项
完成所有研究后,将结果与中的基线框架合并,确保不会遗漏任何关键维度。
references/check-dimensions.md应用步骤2中的版本感知过滤,移除不适用的项并添加版本特定的建议。
速率限制保护:如果任何MCP请求返回"Too many requests",等待5秒后重试一次。如果再次失败,则跳过该请求并继续。必须按顺序执行。
Step 4: Infrastructure Layer Assessment (Optional)
步骤4:基础设施层面评估(可选)
If infrastructure layer assessment is included (default: yes):
- Invoke the skill for the EKS cluster
aws-best-practice-research - Store the infrastructure-layer checklist and assessment results
- These will be merged into the final report in Step 7
If the user opts out, skip this step.
如果包含基础设施层面评估(默认:是):
- 调用技能对EKS集群进行评估
aws-best-practice-research - 存储基础设施层面的检查清单和评估结果
- 这些结果将在步骤7中合并到最终报告中
如果用户选择跳过,则忽略此步骤。
Step 5: Workload Data Collection
步骤5:工作负载数据收集
Collect workload configurations using . Independent commands can run in parallel
(they are not subject to MCP rate limits).
kubectlSee for the complete command list. Key data to collect:
references/kubectl-assessment-commands.mdbash
undefined使用收集工作负载配置。独立命令可以并行运行(它们不受MCP速率限制的影响)。
kubectl完整的命令列表请参见。需要收集的关键数据:
references/kubectl-assessment-commands.mdbash
undefinedCore workloads
核心工作负载
kubectl get deployments,statefulsets,daemonsets,jobs,cronjobs --all-namespaces -o json
kubectl get deployments,statefulsets,daemonsets,jobs,cronjobs --all-namespaces -o json
Pod specifications (within workloads above)
Pod规格(包含在上述工作负载的-o json输出中)
Already included in the -o json output
已包含在-o json输出内
Disruption and scaling
中断与伸缩
kubectl get pdb,hpa --all-namespaces -o json
kubectl get pdb,hpa --all-namespaces -o json
Networking
网络
kubectl get networkpolicies,services,ingresses --all-namespaces -o json
kubectl get networkpolicies,services,ingresses --all-namespaces -o json
Security
安全
kubectl get serviceaccounts --all-namespaces -o json
kubectl get clusterrolebindings,rolebindings -o json
kubectl get serviceaccounts --all-namespaces -o json
kubectl get clusterrolebindings,rolebindings -o json
Storage
存储
kubectl get pvc,storageclass -o json
kubectl get pvc,storageclass -o json
Namespace labels (for PSA)
命名空间标签(用于PSA)
kubectl get namespaces -o json
kubectl get namespaces -o json
Events (recent issues)
事件(近期问题)
kubectl get events --all-namespaces --sort-by='.lastTimestamp' -o json
For **ECR image scanning** (if images are from ECR):
```bashkubectl get events --all-namespaces --sort-by='.lastTimestamp' -o json
对于**ECR镜像扫描**(如果镜像来自ECR):
```bashFor each unique ECR image found in workloads
针对工作负载中发现的每个唯一ECR镜像
aws ecr describe-image-scan-findings --repository-name {REPO} --image-id imageTag={TAG}
aws ecr describe-repositories --repository-names {REPO}
aws ecr get-lifecycle-policy --repository-name {REPO}
Filter collected data to the assessment scope (namespaces/workloads from Step 1).aws ecr describe-image-scan-findings --repository-name {REPO} --image-id imageTag={TAG}
aws ecr describe-repositories --repository-names {REPO}
aws ecr get-lifecycle-policy --repository-name {REPO}
根据步骤1中的评估范围过滤收集到的数据(命名空间/工作负载)。Step 6: Per-Dimension Assessment
步骤6:分维度评估
For each check item from the research phase (Step 3), evaluate every in-scope workload:
| Status | Meaning |
|---|---|
| PASS | The workload configuration meets or exceeds the recommendation |
| FAIL | The workload configuration does not meet the recommendation |
| WARN | Cannot be fully verified, or partially meets the recommendation |
| N/A | The check does not apply (e.g., storage checks for stateless workloads) |
For each finding, record:
- Check item ID and name
- Status (PASS/FAIL/WARN/N/A)
- Actual value observed (not just "not configured")
- The specific workload(s) affected
- Version relevance notes (if any)
针对研究阶段(步骤3)中的每个检查项,评估所有在范围内的工作负载:
| 状态 | 含义 |
|---|---|
| PASS | 工作负载配置符合或超出建议要求 |
| FAIL | 工作负载配置不符合建议要求 |
| WARN | 无法完全验证,或仅部分符合建议要求 |
| N/A | 该检查不适用(例如:无状态工作负载的存储检查) |
对于每个发现,记录:
- 检查项ID和名称
- 状态(PASS/FAIL/WARN/N/A)
- 观察到的实际值(不仅仅是"未配置")
- 受影响的特定工作负载
- 版本相关性说明(如有)
Step 7: Generate Report and Save to Local File
步骤7:生成报告并保存到本地文件
Generate a single comprehensive report using the template in
and write it directly to a local markdown file.
references/output-template.mdIMPORTANT — File Writing Rules:
- Use the Write/file tool (not bash heredoc/echo/cat) to create the report file
- If the report is too large for a single write, split into sections: write the file with the first half, then use an append/edit operation to add the remaining sections
- Do NOT output the full report content to the terminal
Use the following file naming convention:
bash
TIMESTAMP=$(date +%Y-%m-%d-%H-%M-%S)
CLUSTER_SLUG=$(echo "{CLUSTER_NAME}" | tr '[:upper:]' '[:lower:]' | tr ' :/' '-')Assessment Report — see
references/output-template.md- Full cluster overview
- Compliance scorecard with rating scale, top 3 priorities, and quick stats
- Dimension-by-dimension assessment tables
- Per-workload detail section
- Critical issues and prioritized remediation
- Data sources and reference links
- Save to:
${TIMESTAMP}-${CLUSTER_SLUG}-assessment-report.md
If infrastructure layer results exist from Step 4, merge them into the report.
After saving, print a brief summary to the terminal listing only:
- The file path of the generated report
- Overall compliance score
- Number of PASS / FAIL / WARN findings
使用中的模板生成一份综合报告,并直接写入本地Markdown文件。
references/output-template.md重要——文件写入规则:
- 使用Write/file工具(而非bash heredoc/echo/cat)创建报告文件
- 如果报告过大无法一次性写入,拆分成分段:先写入前半部分,然后使用追加/编辑操作添加剩余部分
- 不要将完整报告内容输出到终端
使用以下文件命名规则:
bash
TIMESTAMP=$(date +%Y-%m-%d-%H-%M-%S)
CLUSTER_SLUG=$(echo "{CLUSTER_NAME}" | tr '[:upper:]' '[:lower:]' | tr ' :/' '-')评估报告——参见
references/output-template.md- 全集群概览
- 合规评分卡(包含评级标准、前3个优先级问题和快速统计)
- 分维度评估表格
- 按工作负载细分的详情部分
- 关键问题和优先级修复建议
- 数据源和参考链接
- 保存路径:
${TIMESTAMP}-${CLUSTER_SLUG}-assessment-report.md
如果步骤4中存在基础设施层面的结果,将其合并到报告中。
保存完成后,向终端打印一份简短摘要,仅包含:
- 生成的报告文件路径
- 总体合规得分
- PASS/FAIL/WARN结果的数量
Step 8: Remediation Guidance & Next Steps
步骤8:修复指导与后续步骤
After saving the reports, offer:
- "I can help fix specific FAIL items — which ones would you like to address?"
- "I can re-run the assessment after remediation to verify improvements."
For Critical Issues (FAIL + High priority), provide:
- Specific remediation commands or manifest changes
- Whether the fix requires workload restart or is in-place
- Impact assessment of the change
保存报告后,提供以下选项:
- "我可以帮助修复特定的FAIL项——您想处理哪一项?"
- "修复后我可以重新运行评估以验证改进效果。"
对于关键问题(FAIL + 高优先级),提供:
- 具体的修复命令或清单变更
- 修复是否需要重启工作负载或可原地应用
- 变更的影响评估
Important Guidelines
重要指南
- Be comprehensive: The value of this skill is thoroughness. Better to include a check and mark it N/A than to miss it.
- Always cite sources: Every check item must reference its source (EKS Best Practices Guide, K8s official docs, etc.).
- Sequential MCP queries: All context7 and aws-knowledge-mcp requests must be sequential. kubectl commands can be parallel.
- Rate limit protection: Wait 5s and retry once on "Too many requests". Skip on second failure.
- Version awareness: Always filter checks by detected K8s/EKS version. Never recommend features unavailable in the cluster's version.
- Actual values in findings: Always report what was observed, not just "not configured".
Good: "— container has no memory request" Bad: "Memory request missing"
resources.requests.memory: not set - Per-workload granularity: Report findings at the individual Deployment/StatefulSet level, not just cluster-wide summaries.
- Exclude system namespaces by default: Skip ,
kube-system,kube-publicunless the user explicitly includes them.kube-node-lease - Respect language: Output in the same language as the user's conversation.
- Infrastructure vs workload boundary: Never duplicate checks from . This skill handles ONLY what requires kubectl/in-cluster access.
aws-best-practice-research
- 全面性:本技能的价值在于全面性。宁可包含一个检查项并标记为N/A,也不要遗漏它。
- 始终引用来源:每个检查项必须引用其来源(EKS最佳实践指南、K8s官方文档等)。
- 按顺序执行MCP查询:所有context7和aws-knowledge-mcp请求必须按顺序执行。kubectl命令可以并行运行。
- 速率限制保护:遇到"Too many requests"时,等待5秒并重试一次。第二次失败则跳过。
- 版本感知:始终根据检测到的K8s/EKS版本过滤检查项。绝不要推荐集群版本不支持的功能。
- 发现中包含实际值:始终报告观察到的内容,而不仅仅是"未配置"。
良好示例:"— 容器未设置内存请求" 不良示例:"内存请求缺失"
resources.requests.memory: not set - 按工作负载粒度报告:在单个Deployment/StatefulSet级别报告发现,而不仅仅是集群范围的摘要。
- 默认排除系统命名空间:除非用户明确要求包含,否则跳过、
kube-system、kube-public。kube-node-lease - 语言适配:输出语言与用户对话语言保持一致。
- 基础设施与工作负载边界:绝不要重复中的检查项。本技能仅处理需要kubectl/集群内访问的内容。
aws-best-practice-research