k8s-hpa-cost-tuning
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseKubernetes HPA Cost & Scale-Down Tuning
Kubernetes HPA 成本优化与缩容调优
Mode selection (mandatory)
模式选择(必填)
Declare a mode before executing this skill. All reasoning, thresholds, and recommendations depend on this choice.
text
mode = audit | incidentIf no mode is provided, refuse to run and request clarification.
执行此技能前需声明模式。所有推理、阈值和建议均取决于此选择。
text
mode = audit | incident若未提供模式,将拒绝运行并要求明确模式。
When to use
适用场景
mode = audit
— Periodic cost-savings audit
mode = auditmode = audit
— 定期成本节约审计
mode = auditRun on a schedule (weekly or bi-weekly) to:
- Detect over-reservation early
- Validate that scale-down and node consolidation still work
- Identify safe opportunities to reduce cluster cost
This mode assumes no active incident and prioritizes stability-preserving recommendations.
按计划(每周或每两周)运行,用于:
- 提前检测过度预留问题
- 验证缩容和节点整合功能是否正常
- 识别可安全降低集群成本的机会
此模式假设无活跃事件,优先给出保障稳定性的建议。
mode = incident
— Post-incident scaling analysis
mode = incidentmode = incident
— 事件后扩缩容分析
mode = incidentRun after a production incident or anomaly, attaching:
- Production logs
- HPA events
- Scaling timelines
This mode focuses on:
- Explaining why scaling behaved the way it did
- Distinguishing traffic-driven vs configuration-driven incidents
- Preventing recurrence without overcorrecting
This skill assumes Datadog for observability and standard Kubernetes HPA + Cluster Autoscaler.
在生产事件或异常发生后运行,需附带:
- 生产日志
- HPA事件记录
- 扩缩容时间线
此模式重点关注:
- 解释扩缩容行为异常的原因
- 区分流量驱动型与配置驱动型事件
- 在不过度修正的前提下防止事件复发
此技能假设使用Datadog作为可观测性工具,且采用标准Kubernetes HPA + Cluster Autoscaler架构。
Core mental model
核心逻辑模型
Kubernetes scaling is a three-layer system:
- HPA decides how many pods (based on usage / requests)
- Scheduler decides where pods go (based on requests + constraints)
- Cluster Autoscaler decides how many nodes exist (only when nodes can empty)
Cost optimization only works if all three layers can move downward.
Key takeaway: HPA decides quantity, scheduler decides placement, autoscaler decides cost. Scale-up can be aggressive; scale-down must be possible. If replicas drop but nodes do not, the scheduler is the bottleneck.
Kubernetes扩缩容是一个三层系统:
- HPA 决定需要多少个Pod(基于资源使用量/请求量)
- 调度器 决定Pod部署位置(基于资源请求量和约束条件)
- Cluster Autoscaler 决定集群节点数量(仅当节点可被完全排空时)
成本优化仅在所有三层都能向下调整时生效。
关键结论:HPA决定Pod数量,调度器决定部署位置,自动扩缩容器决定成本。扩容可以激进,但缩容必须具备可行性。如果副本数下降但节点数未减少,调度器是瓶颈所在。
Key Datadog metrics
关键Datadog指标
The utility scripts query three metric families:
- CPU used % — real utilization (/
kubernetes.cpu.usage.total)node.cpu_allocatable - CPU requested % — reserved on paper (/
kubernetes.cpu.requests)node.cpu_allocatable - Memory used vs requests — HPA-relevant ratio
CPU requested % must go down after scale-down for cost savings to be real. If memory usage stays above target, memory drives scale-up even when CPU is idle.
实用脚本会查询三类指标:
- CPU使用率% — 实际利用率(/
kubernetes.cpu.usage.total)node.cpu_allocatable - CPU请求率% — 纸面预留量(/
kubernetes.cpu.requests)node.cpu_allocatable - 内存使用量与请求量对比 — 与HPA相关的比值
缩容后CPU请求率%必须下降,才能真正实现成本节约。 如果内存使用率持续高于目标值,即使CPU闲置,内存也会触发扩容。
Scale-down as a first-class cost control
将缩容作为核心成本控制手段
When scale-down is slow or blocked:
- Replicas plateau
- Pods remain evenly spread
- Nodes never empty
- Cluster Autoscaler cannot remove nodes
Result: permanent over-reservation.
当缩容缓慢或被阻塞时:
- 副本数停滞
- Pod保持均匀分布
- 节点永远无法排空
- Cluster Autoscaler无法移除节点
结果:永久性过度预留。
Recommended HPA scale-down policy
推荐的HPA缩容策略
yaml
scaleDown:
stabilizationWindowSeconds: 60
selectPolicy: Max
policies:
- type: Percent
value: 50
periodSeconds: 30Effects: fast reaction once load drops, predictable replica collapse, low flapping risk.
yaml
scaleDown:
stabilizationWindowSeconds: 60
selectPolicy: Max
policies:
- type: Percent
value: 50
periodSeconds: 30效果:负载下降后快速响应,副本数可预测性减少,抖动风险低。
Topology spread: critical cost lever
拓扑分布:关键成本杠杆
Topology spread must never prevent pod consolidation during scale-down.
Strict constraints block scheduler flexibility and freeze cluster size.
拓扑分布绝不能在缩容期间阻止Pod整合。
严格的约束会限制调度器的灵活性,导致集群规模固定。
Anti-pattern (breaks cost optimization)
反模式(破坏成本优化)
yaml
maxSkew: 1
whenUnsatisfiable: DoNotSchedulePods cannot collapse onto fewer nodes. Nodes never drain. Reserved CPU/memory never decreases.
yaml
maxSkew: 1
whenUnsatisfiable: DoNotSchedulePod无法集中到更少的节点上,节点永远无法排空,预留的CPU/内存永远不会减少。
Recommended default (cost-safe)
推荐默认配置(成本友好)
yaml
topologySpreadConstraints:
- topologyKey: kubernetes.io/hostname
maxSkew: 2
whenUnsatisfiable: ScheduleAnywayStrong preference for spreading while allowing bin-packing during scale-down and enabling node removal.
yaml
topologySpreadConstraints:
- topologyKey: kubernetes.io/hostname
maxSkew: 2
whenUnsatisfiable: ScheduleAnyway优先考虑分布,但允许缩容期间进行紧凑调度,从而支持节点移除。
Strict isolation (AZ-level only)
严格隔离(仅可用区级别)
When hard guarantees are required:
yaml
topologySpreadConstraints:
- topologyKey: topology.kubernetes.io/zone
maxSkew: 1
whenUnsatisfiable: DoNotScheduleDo not combine this with strict hostname-level spread.
当需要硬保障时:
yaml
topologySpreadConstraints:
- topologyKey: topology.kubernetes.io/zone
maxSkew: 1
whenUnsatisfiable: DoNotSchedule请勿将此配置与严格的主机名级别分布结合使用。
Anti-affinity as a soft alternative
反亲和性作为软替代方案
To avoid hot nodes without blocking scale-down:
yaml
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
app: your-appAnti-affinity is advisory and cost-safe.
为避免节点过热但不阻塞缩容:
yaml
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
app: your-app反亲和性是建议性的,且对成本控制友好。
Resource requests tuning
资源请求调优
- Over-requesting CPU = slower scale-down
- Over-requesting memory = unexpected scale-ups
Practical defaults:
targetCPUUtilizationPercentage: 70targetMemoryUtilizationPercentage: 75–80
Adjust one knob at a time.
- CPU过度请求 = 缩容速度变慢
- 内存过度请求 = 意外扩容
实用默认值:
targetCPUUtilizationPercentage: 70targetMemoryUtilizationPercentage: 75–80
一次仅调整一个参数。
Validation loop
验证循环
Run weekly (or after changes):
- Check HPA values
current/target - Compare CPU used % vs CPU requested %
- Observe replica collapse after load drops
- Verify nodes drain and disappear
- Re-check latency, errors, OOMs
每周运行一次(或更改配置后):
- 检查HPA的值
current/target - 对比CPU使用率%与CPU请求率%
- 观察负载下降后副本数的减少情况
- 验证节点是否被排空并移除
- 重新检查延迟、错误和OOM情况
Quick validation commands
快速验证命令
bash
kubectl -n <namespace> get hpa <deployment>
kubectl -n <namespace> describe hpa <deployment>
kubectl -n <namespace> top pod --containers
kubectl top node
kubectl -n <namespace> get pods -o wide | sort -k7bash
kubectl -n <namespace> get hpa <deployment>
kubectl -n <namespace> describe hpa <deployment>
kubectl -n <namespace> top pod --containers
kubectl top node
kubectl -n <namespace> get pods -o wide | sort -k7Utility scripts
实用脚本
Both scripts require Datadog credentials:
bash
export DD_API_KEY=...
export DD_APP_KEY=...
export DD_SITE=datadoghq.com # optional, defaults to datadoghq.com两个脚本均需Datadog凭证:
bash
export DD_API_KEY=...
export DD_APP_KEY=...
export DD_SITE=datadoghq.com # 可选,默认值为datadoghq.comaudit-metrics.mjs
— Cost-savings discovery
audit-metrics.mjsaudit-metrics.mjs
— 成本节约发现
audit-metrics.mjsScan a cluster over a wide window (default 24 h) to find over-reservation and waste.
bash
undefined在宽时间窗口(默认24小时)内扫描集群,找出过度预留和资源浪费。
bash
undefinedCluster-wide audit
集群范围审计
node scripts/audit-metrics.mjs --cluster <cluster>
node scripts/audit-metrics.mjs --cluster <cluster>
With deployment deep-dive
结合部署深度分析
node scripts/audit-metrics.mjs
--cluster <cluster>
--namespace <namespace>
--deployment <deployment>
--cluster <cluster>
--namespace <namespace>
--deployment <deployment>
Reports:
- **Cluster**: CPU/memory used %, requested %, and **waste %** (requested minus used)
- **Deployment** (when provided): CPU/memory usage vs requests, HPA replica range
- **Savings opportunities**: actionable recommendations based on thresholdsnode scripts/audit-metrics.mjs
--cluster <cluster>
--namespace <namespace>
--deployment <deployment>
--cluster <cluster>
--namespace <namespace>
--deployment <deployment>
报告内容:
- **集群层面**:CPU/内存使用率%、请求率%,以及**浪费率%**(请求量减去使用量)
- **部署层面**(若提供):CPU/内存使用量与请求量对比、HPA副本范围
- **节约机会**:基于阈值的可操作建议incident-metrics.mjs
— Post-incident analysis
incident-metrics.mjsincident-metrics.mjs
— 事件后分析
incident-metrics.mjsCollect metrics for a narrow incident window and get a tuning recommendation.
bash
node scripts/incident-metrics.mjs \
--cluster <cluster> \
--namespace <namespace> \
--deployment <deployment> \
--from <ISO8601> \
--to <ISO8601>Reports:
- Cluster: CPU used % and requested % of allocatable
- Deployment: CPU/memory usage vs requests, unavailable %
- HPA: current / desired / max replicas
- Capacity planning: required allocatable cores for 80 % and 70 % reservation ceilings
- Tuning order: step-by-step recommendation (one knob at a time)
收集窄时间窗口内的事件指标,并给出调优建议。
bash
node scripts/incident-metrics.mjs \
--cluster <cluster> \
--namespace <namespace> \
--deployment <deployment> \
--from <ISO8601> \
--to <ISO8601>报告内容:
- 集群层面:CPU使用率%与可分配资源的请求率%
- 部署层面:CPU/内存使用量与请求量对比、不可用率%
- HPA层面:当前/期望/最大副本数
- 容量规划:达到80%和70%预留上限所需的可分配核心数
- 调优顺序:分步建议(一次仅调整一个参数)
Interpretation notes
解读说明
- Keep unchanged unless OOMKills or near-limit memory usage are confirmed
limits.memory - Use to save full JSON for deeper analysis or diffing across runs
--out <path> - Run on either script for all options (relative windows, custom HPA name, pretty JSON)
--help
- 除非确认存在OOMKills或接近限制的内存使用率,否则保持不变
limits.memory - 使用保存完整JSON,用于深度分析或跨运行版本对比
--out <path> - 对任一脚本运行查看所有选项(相对时间窗口、自定义HPA名称、格式化JSON)
--help