deco-site-scaling-tuning

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Deco Site Scaling Tuning

Deco站点扩缩容调优

Analyze a site's Prometheus metrics to discover the optimal autoscaling parameters. This skill helps you find the CPU/concurrency threshold where latency degrades and recommends scaling configuration accordingly.
分析站点的Prometheus指标,以确定最优的自动扩缩容参数。该Skill可帮助你找到延迟开始劣化时的CPU/并发阈值,并据此推荐扩缩容配置。

When to Use This Skill

何时使用该Skill

  • A site is overscaled (too many pods for its traffic)
  • A site oscillates between scaling up and down (panic mode loop)
  • Need to switch scaling metric (concurrency vs CPU vs RPS)
  • Need to find the right target value for a site
  • After deploying scaling changes, to verify they're working
  • 站点存在过度扩缩容问题(流量对应的Pod数量过多)
  • 站点在扩缩容之间频繁振荡(恐慌模式循环)
  • 需要切换扩缩容指标(并发量 vs CPU vs 请求每秒数RPS)
  • 需要为站点确定合适的扩缩容目标值
  • 部署扩缩容变更后,验证变更是否生效

Prerequisites

前置条件

  • kubectl
    access to the target cluster
  • Prometheus accessible via port-forward (from
    kube-prometheus-stack
    in monitoring namespace)
  • Python 3 for analysis scripts
  • At least 6 hours of metric history for meaningful analysis
  • For direct latency data: queue-proxy PodMonitor must be applied (see Step 0)
  • 拥有目标集群的
    kubectl
    访问权限
  • 可通过端口转发访问Prometheus(来自monitoring命名空间的
    kube-prometheus-stack
  • 安装Python 3以运行分析脚本
  • 至少6小时的指标历史数据,以保证分析结果有意义
  • 如需直接获取延迟数据:必须已部署queue-proxy PodMonitor(参见步骤0)

Quick Start

快速开始

0. ENABLE METRICS   → Apply queue-proxy PodMonitor if not already done
1. PORT-FORWARD     → kubectl port-forward prometheus-pod 19090:9090
2. COLLECT DATA     → Run analysis scripts against Prometheus
3. ANALYZE          → Find CPU threshold where latency degrades
4. RECOMMEND        → Choose scaling metric and target
5. APPLY            → Use deco-site-deployment skill to apply changes
6. VERIFY           → Monitor for 1-2 hours after change
0. 启用指标监控   → 若尚未部署,应用queue-proxy PodMonitor
1. 端口转发     → kubectl port-forward prometheus-pod 19090:9090
2. 收集数据     → 针对Prometheus运行分析脚本
3. 分析数据      → 找到延迟开始劣化的CPU阈值
4. 生成推荐      → 选择扩缩容指标与目标值
5. 应用变更      → 使用deco-site-deployment Skill应用配置变更
6. 验证效果      → 变更后监控1-2小时

Files in This Skill

该Skill包含的文件

FilePurpose
SKILL.md
Overview, methodology, analysis procedures
analysis-scripts.md
Ready-to-use Python scripts for Prometheus queries
文件用途
SKILL.md
概述、方法论、分析流程
analysis-scripts.md
可直接使用的Prometheus查询Python脚本

Step 0: Enable Queue-Proxy Metrics (one-time)

步骤0:启用Queue-Proxy指标监控(一次性操作)

Queue-proxy runs as a sidecar on every Knative pod and exposes request latency histograms. These are critical for precise tuning but are not scraped by default.
Apply this PodMonitor:
yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: knative-queue-proxy
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  namespaceSelector:
    any: true
  selector:
    matchExpressions:
      - key: serving.knative.dev/revision
        operator: Exists
  podMetricsEndpoints:
    - port: http-usermetric
      path: /metrics
      interval: 15s
bash
kubectl apply -f queue-proxy-podmonitor.yaml
Queue-proxy作为Sidecar容器运行在每个Knative Pod上,对外暴露请求延迟直方图指标。这些指标对于精准调优至关重要,但默认不会被采集
应用以下PodMonitor:
yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: knative-queue-proxy
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  namespaceSelector:
    any: true
  selector:
    matchExpressions:
      - key: serving.knative.dev/revision
        operator: Exists
  podMetricsEndpoints:
    - port: http-usermetric
      path: /metrics
      interval: 15s
bash
kubectl apply -f queue-proxy-podmonitor.yaml

Wait 2-3 hours for data to accumulate before running latency analysis

等待2-3小时积累数据后,再进行延迟分析


**Metrics unlocked by this PodMonitor:**
- `revision_app_request_latencies_bucket` — request latency histogram (p50/p95/p99)
- `revision_app_request_latencies_sum` / `_count` — for avg latency
- `revision_app_request_count` — request rate by response code

**该PodMonitor解锁的指标:**
- `revision_app_request_latencies_bucket` — 请求延迟直方图(p50/p95/p99分位值)
- `revision_app_request_latencies_sum` / `_count` — 用于计算平均延迟
- `revision_app_request_count` — 按响应码统计的请求速率

Step 1: Establish Prometheus Connection

步骤1:建立Prometheus连接

bash
PROM_POD=$(kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus -o jsonpath='{.items[0].metadata.name}')
kubectl port-forward -n monitoring $PROM_POD 19090:9090 &
bash
PROM_POD=$(kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus -o jsonpath='{.items[0].metadata.name}')
kubectl port-forward -n monitoring $PROM_POD 19090:9090 &

Verify

验证连接

undefined
undefined

Step 2: Collect Current State

步骤2:收集当前状态

Before analyzing, understand what the site is currently configured for.
在分析前,需先了解站点当前的配置情况。

2a. Read current autoscaler config

2a. 读取当前自动扩缩容配置

bash
SITENAME="<sitename>"
NS="sites-${SITENAME}"
bash
SITENAME="<站点名称>"
NS="sites-${SITENAME}"

Current revision annotations

当前版本的注解信息

kubectl get rev -n $NS -o json |
jq '.items[] | select(.status.conditions[]?.status == "True" and .status.conditions[]?.type == "Active") | {name: .metadata.name, annotations: .metadata.annotations | with_entries(select(.key | startswith("autoscaling")))}'
kubectl get rev -n $NS -o json |
jq '.items[] | select(.status.conditions[]?.status == "True" and .status.conditions[]?.type == "Active") | {name: .metadata.name, annotations: .metadata.annotations | with_entries(select(.key | startswith("autoscaling")))}'

Global autoscaler defaults

全局自动扩缩容默认配置

kubectl get cm config-autoscaler -n knative-serving -o json | jq '.data | del(._example)'
undefined
kubectl get cm config-autoscaler -n knative-serving -o json | jq '.data | del(._example)'
undefined

2b. Current pod count and resources

2b. 当前Pod数量与资源使用情况

bash
kubectl get pods -n $NS --no-headers | wc -l
kubectl top pods -n $NS --no-headers | head -20
bash
kubectl get pods -n $NS --no-headers | wc -l
kubectl top pods -n $NS --no-headers | head -20

Step 3: Run Analysis

步骤3:运行分析

Use the scripts in
analysis-scripts.md
. The analysis follows this methodology:
使用
analysis-scripts.md
中的脚本。分析遵循以下方法论:

Methodology: Finding the Optimal CPU Target

方法论:寻找最优CPU目标值

Goal: Find the CPU level at which latency starts to degrade. This is your scaling target — keep pods below this CPU to maintain good latency.
Approach:
  1. Collect CPU per pod, concurrency per pod, pod count, and (if available) request latency over 6-12 hours
  2. Bucket data by CPU range (0-200m, 200-300m, ..., 700m+)
  3. For each bucket, compute avg/p95 concurrency per pod
  4. Compute the "latency inflation factor" — how much concurrency increases beyond what the pod count reduction explains:
    excess = (avg_conc_above_threshold / avg_conc_below_threshold) / (avg_pods_below / avg_pods_above)
    • excess = 1.0 → concurrency increase fully explained by fewer pods (no latency degradation)
    • excess > 1.0 → latency is inflating concurrency (pods are slowing down)
    • The CPU level where excess crosses ~1.5x is your inflection point
  5. If queue-proxy latency is available, directly plot avg latency vs CPU — the hockey stick inflection is your target
目标:找到延迟开始劣化时的CPU水平。这就是你的扩缩容目标——保持Pod的CPU使用率低于该值,以维持良好的延迟性能。
方法
  1. 收集6-12小时内的每个Pod的CPU使用率、每个Pod的并发量、Pod数量,以及(若可用)请求延迟数据
  2. 按CPU范围分组数据(0-200m、200-300m……700m+)
  3. 针对每个分组,计算每个Pod的平均/95分位并发量
  4. 计算“延迟膨胀系数” —— 并发量的增长幅度超出Pod数量减少所能解释的部分:
    excess = (阈值以上的平均并发量 / 阈值以下的平均并发量) / (阈值以下的平均Pod数 / 阈值以上的平均Pod数)
    • excess = 1.0 → 并发量的增长完全可以用Pod数量减少来解释(无延迟劣化)
    • excess > 1.0 → 延迟导致并发量虚高(Pod处理速度变慢)
    • excess超过约1.5倍时对应的CPU水平即为拐点
  5. 若queue-proxy延迟数据可用,直接绘制平均延迟与CPU的关系图——图中曲棍球杆状的拐点就是你的目标值

What to Look For

关注要点

CPU vs Concurrency/pod:

  Low CPU   (0-200m)   →  Low conc/pod   →  Pods are idle (overprovisioned)
  Medium CPU (200-400m) →  Moderate conc  →  Healthy range
  ★ INFLECTION ★       →  Conc jumps      →  Latency starting to degrade
  High CPU  (500m+)    →  High conc/pod   →  Pods overloaded, latency bad
The inflection point is where you want your scaling target.
CPU vs 每个Pod的并发量:

  低CPU区间   (0-200m)   →  低并发量/每个Pod   →  Pod处于空闲状态(过度配置)
  中等CPU区间 (200-400m) →  适中的并发量      →  健康区间
  ★ 拐点 ★       →  并发量突增      →  延迟开始劣化
  高CPU区间  (500m+)    →  高并发量/每个Pod   →  Pod过载,延迟表现差
拐点就是你需要设置的扩缩容目标值。

Decision Matrix

决策矩阵

IMPORTANT: CPU target is in millicores (not percentage). E.g.,
target: 400
means scale when CPU reaches 400m.
Inflection CPURecommended metricTargetNotes
< CPU requestCPU scalingtarget = inflection value in millicoresStandard case
~ CPU requestCPU scalingtarget = CPU_request × 0.8Conservative
> CPU request (no limit)CPU scalingtarget = CPU_request × 0.8, increase CPU requestNeed more CPU headroom
No clear inflectionConcurrency scalingKeep current but tune targetCPU isn't the bottleneck
重要提示:CPU目标值的单位是毫核(而非百分比)。例如,
target: 400
表示当CPU使用率达到400毫核时触发扩缩容。
拐点CPU值推荐指标目标值说明
小于CPU请求值CPU扩缩容目标值 = 拐点的毫核数值标准场景
接近CPU请求值CPU扩缩容目标值 = CPU请求值 × 0.8保守配置
大于CPU请求值(无限制)CPU扩缩容目标值 = CPU请求值 × 0.8,同时提高CPU请求值需要更多CPU余量
无明确拐点并发量扩缩容保留当前指标但调整目标值CPU并非瓶颈

Common Patterns

常见模式

Pattern: CPU-bound app (Deno SSR)
  • Baseline CPU: 200-300m (Deno runtime + V8 JIT)
  • Inflection: 400-500m
  • Recommendation: CPU scaling with target = inflection (e.g., 400 millicores)
Pattern: IO-bound app (mostly external API calls)
  • CPU stays low even under high concurrency
  • Inflection not visible in CPU
  • Recommendation: Keep concurrency scaling, tune the target
Pattern: Oscillating (panic loop)
  • Symptoms: pods cycle between min and max
  • Cause: concurrency scaling + low target +
    scale-down-delay
    ratchet
  • Fix: Switch to CPU scaling (breaks the latency→concurrency feedback loop)
模式:CPU密集型应用(Deno SSR)
  • 基线CPU使用率:200-300m(Deno运行时 + V8 JIT)
  • 拐点:400-500m
  • 推荐:采用CPU扩缩容,目标值为拐点值(例如400毫核)
模式:IO密集型应用(主要调用外部API)
  • 即使在高并发下CPU使用率仍保持低位
  • CPU数据中无明显拐点
  • 推荐:保留并发量扩缩容,调整目标值
模式:振荡(恐慌循环)
  • 症状:Pod数量在最小值和最大值之间循环
  • 原因:并发量扩缩容 + 低目标值 +
    scale-down-delay
    棘轮效应
  • 修复:切换为CPU扩缩容(打破延迟→并发量的反馈循环)

Step 4: Apply Changes

步骤4:应用变更

Use the
deco-site-deployment
skill to:
  1. Update the
    state
    secret with new scaling config
  2. Redeploy on both clouds
Example for CPU-based scaling (target is in millicores):
bash
NEW_STATE=$(echo "$STATE" | jq '
  .scaling.metric = {
    "type": "cpu",
    "target": 400
  }
')
使用
deco-site-deployment
Skill完成以下操作:
  1. 更新
    state
    密钥,配置新的扩缩容参数
  2. 在双云环境重新部署
基于CPU的扩缩容示例(目标值单位为毫核):
bash
NEW_STATE=$(echo "$STATE" | jq '
  .scaling.metric = {
    "type": "cpu",
    "target": 400
  }
')

Step 5: Verify After Change

步骤5:变更后验证

Monitor for 1-2 hours after applying changes:
bash
undefined
应用变更后监控1-2小时:
bash
undefined

Watch pod count stabilize

观察Pod数量是否稳定

watch -n 10 "kubectl get pods -n sites-<sitename> --no-headers | wc -l"
watch -n 10 "kubectl get pods -n sites-<站点名称> --no-headers | wc -l"

Check if panic mode triggers (should be N/A for HPA/CPU)

检查是否触发恐慌模式(HPA/CPU扩缩容无此模式)

HPA doesn't have panic mode — this is one of the advantages

HPA没有恐慌模式——这是其优势之一

Verify HPA is active

验证HPA是否处于活跃状态

kubectl get hpa -n sites-<sitename>
kubectl get hpa -n sites-<站点名称>

Check HPA status

查看HPA状态

kubectl describe hpa -n sites-<sitename>
undefined
kubectl describe hpa -n sites-<站点名称>
undefined

Success Criteria

成功标准

  • Pod count stabilizes (no more oscillation)
  • Avg CPU per pod stays below your target during normal traffic
  • CPU crosses target only during genuine traffic spikes (and scales up proportionally)
  • No panic mode events (HPA doesn't have panic mode)
  • Latency stays acceptable (check with queue-proxy metrics if available)
  • Pod数量保持稳定(无振荡)
  • 正常流量下,每个Pod的平均CPU使用率低于目标值
  • 仅在真实流量峰值时CPU使用率超过目标值(并按比例扩缩容)
  • 无恐慌模式事件(HPA无此模式)
  • 延迟表现保持可接受水平(若可用,通过queue-proxy指标验证)

Rollback

回滚

If the new scaling is worse, revert by changing the state secret back to concurrency scaling:
bash
NEW_STATE=$(echo "$STATE" | jq '
  .scaling.metric = {
    "type": "concurrency",
    "target": 15,
    "targetUtilizationPercentage": 70
  }
')
若新的扩缩容配置效果更差,可通过将state密钥改回并发量扩缩容来恢复:
bash
NEW_STATE=$(echo "$STATE" | jq '
  .scaling.metric = {
    "type": "concurrency",
    "target": 15,
    "targetUtilizationPercentage": 70
  }
')

Related Skills

相关Skill

  • deco-site-deployment
    — Apply scaling changes and redeploy
  • deco-site-memory-debugging
    — Debug memory issues on running pods
  • deco-incident-debugging
    — Incident response and triage
  • deco-site-deployment
    — 应用扩缩容变更并重新部署
  • deco-site-memory-debugging
    — 调试运行中Pod的内存问题
  • deco-incident-debugging
    — 事件响应与分类排查