k8s-hpa-cost-tuning

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Kubernetes HPA Cost & Scale-Down Tuning

Kubernetes HPA 成本优化与缩容调优

Mode selection (mandatory)

模式选择(必填)

Declare a mode before executing this skill. All reasoning, thresholds, and recommendations depend on this choice.
text
mode = audit | incident
If no mode is provided, refuse to run and request clarification.
执行此技能前需声明模式。所有推理、阈值和建议均取决于此选择。
text
mode = audit | incident
若未提供模式,将拒绝运行并要求明确模式。

When to use

适用场景

mode = audit
— Periodic cost-savings audit

mode = audit
— 定期成本节约审计

Run on a schedule (weekly or bi-weekly) to:
  • Detect over-reservation early
  • Validate that scale-down and node consolidation still work
  • Identify safe opportunities to reduce cluster cost
This mode assumes no active incident and prioritizes stability-preserving recommendations.
按计划(每周或每两周)运行,用于:
  • 提前检测过度预留问题
  • 验证缩容和节点整合功能是否正常
  • 识别可安全降低集群成本的机会
此模式假设无活跃事件,优先给出保障稳定性的建议。

mode = incident
— Post-incident scaling analysis

mode = incident
— 事件后扩缩容分析

Run after a production incident or anomaly, attaching:
  • Production logs
  • HPA events
  • Scaling timelines
This mode focuses on:
  • Explaining why scaling behaved the way it did
  • Distinguishing traffic-driven vs configuration-driven incidents
  • Preventing recurrence without overcorrecting
This skill assumes Datadog for observability and standard Kubernetes HPA + Cluster Autoscaler.
在生产事件或异常发生后运行,需附带:
  • 生产日志
  • HPA事件记录
  • 扩缩容时间线
此模式重点关注:
  • 解释扩缩容行为异常的原因
  • 区分流量驱动型与配置驱动型事件
  • 在不过度修正的前提下防止事件复发
此技能假设使用Datadog作为可观测性工具,且采用标准Kubernetes HPA + Cluster Autoscaler架构。

Core mental model

核心逻辑模型

Kubernetes scaling is a three-layer system:
  1. HPA decides how many pods (based on usage / requests)
  2. Scheduler decides where pods go (based on requests + constraints)
  3. Cluster Autoscaler decides how many nodes exist (only when nodes can empty)
Cost optimization only works if all three layers can move downward.
Key takeaway: HPA decides quantity, scheduler decides placement, autoscaler decides cost. Scale-up can be aggressive; scale-down must be possible. If replicas drop but nodes do not, the scheduler is the bottleneck.
Kubernetes扩缩容是一个三层系统
  1. HPA 决定需要多少个Pod(基于资源使用量/请求量)
  2. 调度器 决定Pod部署位置(基于资源请求量和约束条件)
  3. Cluster Autoscaler 决定集群节点数量(仅当节点可被完全排空时)
成本优化仅在所有三层都能向下调整时生效。
关键结论:HPA决定Pod数量,调度器决定部署位置,自动扩缩容器决定成本。扩容可以激进,但缩容必须具备可行性。如果副本数下降但节点数未减少,调度器是瓶颈所在。

Key Datadog metrics

关键Datadog指标

The utility scripts query three metric families:
  • CPU used % — real utilization (
    kubernetes.cpu.usage.total
    /
    node.cpu_allocatable
    )
  • CPU requested % — reserved on paper (
    kubernetes.cpu.requests
    /
    node.cpu_allocatable
    )
  • Memory used vs requests — HPA-relevant ratio
CPU requested % must go down after scale-down for cost savings to be real. If memory usage stays above target, memory drives scale-up even when CPU is idle.
实用脚本会查询三类指标:
  • CPU使用率% — 实际利用率(
    kubernetes.cpu.usage.total
    /
    node.cpu_allocatable
  • CPU请求率% — 纸面预留量(
    kubernetes.cpu.requests
    /
    node.cpu_allocatable
  • 内存使用量与请求量对比 — 与HPA相关的比值
缩容后CPU请求率%必须下降,才能真正实现成本节约。 如果内存使用率持续高于目标值,即使CPU闲置,内存也会触发扩容。

Scale-down as a first-class cost control

将缩容作为核心成本控制手段

When scale-down is slow or blocked:
  • Replicas plateau
  • Pods remain evenly spread
  • Nodes never empty
  • Cluster Autoscaler cannot remove nodes
Result: permanent over-reservation.
当缩容缓慢或被阻塞时:
  • 副本数停滞
  • Pod保持均匀分布
  • 节点永远无法排空
  • Cluster Autoscaler无法移除节点
结果:永久性过度预留

Recommended HPA scale-down policy

推荐的HPA缩容策略

yaml
scaleDown:
  stabilizationWindowSeconds: 60
  selectPolicy: Max
  policies:
    - type: Percent
      value: 50
      periodSeconds: 30
Effects: fast reaction once load drops, predictable replica collapse, low flapping risk.
yaml
scaleDown:
  stabilizationWindowSeconds: 60
  selectPolicy: Max
  policies:
    - type: Percent
      value: 50
      periodSeconds: 30
效果:负载下降后快速响应,副本数可预测性减少,抖动风险低。

Topology spread: critical cost lever

拓扑分布:关键成本杠杆

Topology spread must never prevent pod consolidation during scale-down.
Strict constraints block scheduler flexibility and freeze cluster size.
拓扑分布绝不能在缩容期间阻止Pod整合。
严格的约束会限制调度器的灵活性,导致集群规模固定。

Anti-pattern (breaks cost optimization)

反模式(破坏成本优化)

yaml
maxSkew: 1
whenUnsatisfiable: DoNotSchedule
Pods cannot collapse onto fewer nodes. Nodes never drain. Reserved CPU/memory never decreases.
yaml
maxSkew: 1
whenUnsatisfiable: DoNotSchedule
Pod无法集中到更少的节点上,节点永远无法排空,预留的CPU/内存永远不会减少。

Recommended default (cost-safe)

推荐默认配置(成本友好)

yaml
topologySpreadConstraints:
- topologyKey: kubernetes.io/hostname
  maxSkew: 2
  whenUnsatisfiable: ScheduleAnyway
Strong preference for spreading while allowing bin-packing during scale-down and enabling node removal.
yaml
topologySpreadConstraints:
- topologyKey: kubernetes.io/hostname
  maxSkew: 2
  whenUnsatisfiable: ScheduleAnyway
优先考虑分布,但允许缩容期间进行紧凑调度,从而支持节点移除。

Strict isolation (AZ-level only)

严格隔离(仅可用区级别)

When hard guarantees are required:
yaml
topologySpreadConstraints:
- topologyKey: topology.kubernetes.io/zone
  maxSkew: 1
  whenUnsatisfiable: DoNotSchedule
Do not combine this with strict hostname-level spread.
当需要硬保障时:
yaml
topologySpreadConstraints:
- topologyKey: topology.kubernetes.io/zone
  maxSkew: 1
  whenUnsatisfiable: DoNotSchedule
请勿将此配置与严格的主机名级别分布结合使用。

Anti-affinity as a soft alternative

反亲和性作为软替代方案

To avoid hot nodes without blocking scale-down:
yaml
podAntiAffinity:
  preferredDuringSchedulingIgnoredDuringExecution:
  - weight: 100
    podAffinityTerm:
      topologyKey: kubernetes.io/hostname
      labelSelector:
        matchLabels:
          app: your-app
Anti-affinity is advisory and cost-safe.
为避免节点过热但不阻塞缩容:
yaml
podAntiAffinity:
  preferredDuringSchedulingIgnoredDuringExecution:
  - weight: 100
    podAffinityTerm:
      topologyKey: kubernetes.io/hostname
      labelSelector:
        matchLabels:
          app: your-app
反亲和性是建议性的,且对成本控制友好。

Resource requests tuning

资源请求调优

  • Over-requesting CPU = slower scale-down
  • Over-requesting memory = unexpected scale-ups
Practical defaults:
  • targetCPUUtilizationPercentage: 70
  • targetMemoryUtilizationPercentage: 75–80
Adjust one knob at a time.
  • CPU过度请求 = 缩容速度变慢
  • 内存过度请求 = 意外扩容
实用默认值:
  • targetCPUUtilizationPercentage: 70
  • targetMemoryUtilizationPercentage: 75–80
一次仅调整一个参数

Validation loop

验证循环

Run weekly (or after changes):
  1. Check HPA
    current/target
    values
  2. Compare CPU used % vs CPU requested %
  3. Observe replica collapse after load drops
  4. Verify nodes drain and disappear
  5. Re-check latency, errors, OOMs
每周运行一次(或更改配置后):
  1. 检查HPA的
    current/target
  2. 对比CPU使用率%与CPU请求率%
  3. 观察负载下降后副本数的减少情况
  4. 验证节点是否被排空并移除
  5. 重新检查延迟、错误和OOM情况

Quick validation commands

快速验证命令

bash
kubectl -n <namespace> get hpa <deployment>
kubectl -n <namespace> describe hpa <deployment>
kubectl -n <namespace> top pod --containers
kubectl top node
kubectl -n <namespace> get pods -o wide | sort -k7
bash
kubectl -n <namespace> get hpa <deployment>
kubectl -n <namespace> describe hpa <deployment>
kubectl -n <namespace> top pod --containers
kubectl top node
kubectl -n <namespace> get pods -o wide | sort -k7

Utility scripts

实用脚本

Both scripts require Datadog credentials:
bash
export DD_API_KEY=...
export DD_APP_KEY=...
export DD_SITE=datadoghq.com   # optional, defaults to datadoghq.com
两个脚本均需Datadog凭证:
bash
export DD_API_KEY=...
export DD_APP_KEY=...
export DD_SITE=datadoghq.com   # 可选,默认值为datadoghq.com

audit-metrics.mjs
— Cost-savings discovery

audit-metrics.mjs
— 成本节约发现

Scan a cluster over a wide window (default 24 h) to find over-reservation and waste.
bash
undefined
在宽时间窗口(默认24小时)内扫描集群,找出过度预留和资源浪费。
bash
undefined

Cluster-wide audit

集群范围审计

node scripts/audit-metrics.mjs --cluster <cluster>
node scripts/audit-metrics.mjs --cluster <cluster>

With deployment deep-dive

结合部署深度分析

node scripts/audit-metrics.mjs
--cluster <cluster>
--namespace <namespace>
--deployment <deployment>

Reports:

- **Cluster**: CPU/memory used %, requested %, and **waste %** (requested minus used)
- **Deployment** (when provided): CPU/memory usage vs requests, HPA replica range
- **Savings opportunities**: actionable recommendations based on thresholds
node scripts/audit-metrics.mjs
--cluster <cluster>
--namespace <namespace>
--deployment <deployment>

报告内容:

- **集群层面**:CPU/内存使用率%、请求率%,以及**浪费率%**(请求量减去使用量)
- **部署层面**(若提供):CPU/内存使用量与请求量对比、HPA副本范围
- **节约机会**:基于阈值的可操作建议

incident-metrics.mjs
— Post-incident analysis

incident-metrics.mjs
— 事件后分析

Collect metrics for a narrow incident window and get a tuning recommendation.
bash
node scripts/incident-metrics.mjs \
  --cluster <cluster> \
  --namespace <namespace> \
  --deployment <deployment> \
  --from <ISO8601> \
  --to <ISO8601>
Reports:
  • Cluster: CPU used % and requested % of allocatable
  • Deployment: CPU/memory usage vs requests, unavailable %
  • HPA: current / desired / max replicas
  • Capacity planning: required allocatable cores for 80 % and 70 % reservation ceilings
  • Tuning order: step-by-step recommendation (one knob at a time)
收集窄时间窗口内的事件指标,并给出调优建议。
bash
node scripts/incident-metrics.mjs \
  --cluster <cluster> \
  --namespace <namespace> \
  --deployment <deployment> \
  --from <ISO8601> \
  --to <ISO8601>
报告内容:
  • 集群层面:CPU使用率%与可分配资源的请求率%
  • 部署层面:CPU/内存使用量与请求量对比、不可用率%
  • HPA层面:当前/期望/最大副本数
  • 容量规划:达到80%和70%预留上限所需的可分配核心数
  • 调优顺序:分步建议(一次仅调整一个参数)

Interpretation notes

解读说明

  • Keep
    limits.memory
    unchanged unless OOMKills or near-limit memory usage are confirmed
  • Use
    --out <path>
    to save full JSON for deeper analysis or diffing across runs
  • Run
    --help
    on either script for all options (relative windows, custom HPA name, pretty JSON)
  • 除非确认存在OOMKills或接近限制的内存使用率,否则保持
    limits.memory
    不变
  • 使用
    --out <path>
    保存完整JSON,用于深度分析或跨运行版本对比
  • 对任一脚本运行
    --help
    查看所有选项(相对时间窗口、自定义HPA名称、格式化JSON)