deco-site-scaling-tuning

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Deco Site Scaling Tuning

Deco站点扩缩容调优

Analyze a site's Prometheus metrics to discover the optimal autoscaling parameters. This skill helps you find the CPU/concurrency threshold where latency degrades and recommends scaling configuration accordingly.

分析站点的Prometheus指标，以确定最优的自动扩缩容参数。该Skill可帮助你找到延迟开始劣化时的CPU/并发阈值，并据此推荐扩缩容配置。

When to Use This Skill

何时使用该Skill

A site is overscaled (too many pods for its traffic)
A site oscillates between scaling up and down (panic mode loop)
Need to switch scaling metric (concurrency vs CPU vs RPS)
Need to find the right target value for a site
After deploying scaling changes, to verify they're working

站点存在过度扩缩容问题（流量对应的Pod数量过多）
站点在扩缩容之间频繁振荡（恐慌模式循环）
需要切换扩缩容指标（并发量 vs CPU vs 请求每秒数RPS）
需要为站点确定合适的扩缩容目标值
部署扩缩容变更后，验证变更是否生效

Prerequisites

前置条件

```
kubectl
```
access to the target cluster
Prometheus accessible via port-forward (from
```
kube-prometheus-stack
```
in monitoring namespace)
Python 3 for analysis scripts
At least 6 hours of metric history for meaningful analysis
For direct latency data: queue-proxy PodMonitor must be applied (see Step 0)

拥有目标集群的
```
kubectl
```
访问权限
可通过端口转发访问Prometheus（来自monitoring命名空间的
```
kube-prometheus-stack
```
）
安装Python 3以运行分析脚本
至少6小时的指标历史数据，以保证分析结果有意义
如需直接获取延迟数据：必须已部署queue-proxy PodMonitor（参见步骤0）

Quick Start

快速开始

0. ENABLE METRICS   → Apply queue-proxy PodMonitor if not already done
1. PORT-FORWARD     → kubectl port-forward prometheus-pod 19090:9090
2. COLLECT DATA     → Run analysis scripts against Prometheus
3. ANALYZE          → Find CPU threshold where latency degrades
4. RECOMMEND        → Choose scaling metric and target
5. APPLY            → Use deco-site-deployment skill to apply changes
6. VERIFY           → Monitor for 1-2 hours after change

0. 启用指标监控   → 若尚未部署，应用queue-proxy PodMonitor
1. 端口转发     → kubectl port-forward prometheus-pod 19090:9090
2. 收集数据     → 针对Prometheus运行分析脚本
3. 分析数据      → 找到延迟开始劣化的CPU阈值
4. 生成推荐      → 选择扩缩容指标与目标值
5. 应用变更      → 使用deco-site-deployment Skill应用配置变更
6. 验证效果      → 变更后监控1-2小时

Files in This Skill

该Skill包含的文件

File	Purpose
`SKILL.md`	Overview, methodology, analysis procedures
`analysis-scripts.md`	Ready-to-use Python scripts for Prometheus queries

文件	用途
`SKILL.md`	概述、方法论、分析流程
`analysis-scripts.md`	可直接使用的Prometheus查询Python脚本

Step 0: Enable Queue-Proxy Metrics (one-time)

步骤0：启用Queue-Proxy指标监控（一次性操作）

Queue-proxy runs as a sidecar on every Knative pod and exposes request latency histograms. These are critical for precise tuning but are not scraped by default.

Apply this PodMonitor:

yaml

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: knative-queue-proxy
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  namespaceSelector:
    any: true
  selector:
    matchExpressions:
      - key: serving.knative.dev/revision
        operator: Exists
  podMetricsEndpoints:
    - port: http-usermetric
      path: /metrics
      interval: 15s

bash

kubectl apply -f queue-proxy-podmonitor.yaml

Queue-proxy作为Sidecar容器运行在每个Knative Pod上，对外暴露请求延迟直方图指标。这些指标对于精准调优至关重要，但默认不会被采集。

应用以下PodMonitor：

yaml

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: knative-queue-proxy
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  namespaceSelector:
    any: true
  selector:
    matchExpressions:
      - key: serving.knative.dev/revision
        operator: Exists
  podMetricsEndpoints:
    - port: http-usermetric
      path: /metrics
      interval: 15s

bash

kubectl apply -f queue-proxy-podmonitor.yaml

Wait 2-3 hours for data to accumulate before running latency analysis

等待2-3小时积累数据后，再进行延迟分析


**Metrics unlocked by this PodMonitor:**
- `revision_app_request_latencies_bucket` — request latency histogram (p50/p95/p99)
- `revision_app_request_latencies_sum` / `_count` — for avg latency
- `revision_app_request_count` — request rate by response code


**该PodMonitor解锁的指标：**
- `revision_app_request_latencies_bucket` — 请求延迟直方图（p50/p95/p99分位值）
- `revision_app_request_latencies_sum` / `_count` — 用于计算平均延迟
- `revision_app_request_count` — 按响应码统计的请求速率

Step 1: Establish Prometheus Connection

步骤1：建立Prometheus连接

bash

PROM_POD=$(kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus -o jsonpath='{.items[0].metadata.name}')
kubectl port-forward -n monitoring $PROM_POD 19090:9090 &

bash

PROM_POD=$(kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus -o jsonpath='{.items[0].metadata.name}')
kubectl port-forward -n monitoring $PROM_POD 19090:9090 &

Verify

验证连接

curl -s "http://127.0.0.1:19090/api/v1/query?query=up" | jq '.status'

undefined

curl -s "http://127.0.0.1:19090/api/v1/query?query=up" | jq '.status'

undefined

Step 2: Collect Current State

步骤2：收集当前状态

Before analyzing, understand what the site is currently configured for.

在分析前，需先了解站点当前的配置情况。

2a. Read current autoscaler config

2a. 读取当前自动扩缩容配置

bash

SITENAME="<sitename>"
NS="sites-${SITENAME}"

bash

SITENAME="<站点名称>"
NS="sites-${SITENAME}"

Current revision annotations

当前版本的注解信息

kubectl get rev -n $NS -o json |
jq '.items[] | select(.status.conditions[]?.status == "True" and .status.conditions[]?.type == "Active") | {name: .metadata.name, annotations: .metadata.annotations | with_entries(select(.key | startswith("autoscaling")))}'

Global autoscaler defaults

全局自动扩缩容默认配置

kubectl get cm config-autoscaler -n knative-serving -o json | jq '.data | del(._example)'

undefined

kubectl get cm config-autoscaler -n knative-serving -o json | jq '.data | del(._example)'

undefined

2b. Current pod count and resources

2b. 当前Pod数量与资源使用情况

bash

kubectl get pods -n $NS --no-headers | wc -l
kubectl top pods -n $NS --no-headers | head -20

bash

kubectl get pods -n $NS --no-headers | wc -l
kubectl top pods -n $NS --no-headers | head -20

Step 3: Run Analysis

步骤3：运行分析

Use the scripts in

analysis-scripts.md

. The analysis follows this methodology:

使用

analysis-scripts.md

中的脚本。分析遵循以下方法论：

Methodology: Finding the Optimal CPU Target

方法论：寻找最优CPU目标值

Goal: Find the CPU level at which latency starts to degrade. This is your scaling target — keep pods below this CPU to maintain good latency.

Approach:

Collect CPU per pod, concurrency per pod, pod count, and (if available) request latency over 6-12 hours
Bucket data by CPU range (0-200m, 200-300m, ..., 700m+)
For each bucket, compute avg/p95 concurrency per pod
Compute the "latency inflation factor" — how much concurrency increases beyond what the pod count reduction explains:
```
excess = (avg_conc_above_threshold / avg_conc_below_threshold) / (avg_pods_below / avg_pods_above)
```
- excess = 1.0 → concurrency increase fully explained by fewer pods (no latency degradation)
- excess > 1.0 → latency is inflating concurrency (pods are slowing down)
- The CPU level where excess crosses ~1.5x is your inflection point
If queue-proxy latency is available, directly plot avg latency vs CPU — the hockey stick inflection is your target

目标：找到延迟开始劣化时的CPU水平。这就是你的扩缩容目标——保持Pod的CPU使用率低于该值，以维持良好的延迟性能。

方法：

收集6-12小时内的每个Pod的CPU使用率、每个Pod的并发量、Pod数量，以及（若可用）请求延迟数据
按CPU范围分组数据（0-200m、200-300m……700m+）
针对每个分组，计算每个Pod的平均/95分位并发量
计算“延迟膨胀系数” —— 并发量的增长幅度超出Pod数量减少所能解释的部分：
```
excess = (阈值以上的平均并发量 / 阈值以下的平均并发量) / (阈值以下的平均Pod数 / 阈值以上的平均Pod数)
```
- excess = 1.0 → 并发量的增长完全可以用Pod数量减少来解释（无延迟劣化）
- excess > 1.0 → 延迟导致并发量虚高（Pod处理速度变慢）
- excess超过约1.5倍时对应的CPU水平即为拐点
若queue-proxy延迟数据可用，直接绘制平均延迟与CPU的关系图——图中曲棍球杆状的拐点就是你的目标值

What to Look For

关注要点

CPU vs Concurrency/pod:

  Low CPU   (0-200m)   →  Low conc/pod   →  Pods are idle (overprovisioned)
  Medium CPU (200-400m) →  Moderate conc  →  Healthy range
  ★ INFLECTION ★       →  Conc jumps      →  Latency starting to degrade
  High CPU  (500m+)    →  High conc/pod   →  Pods overloaded, latency bad

The inflection point is where you want your scaling target.

CPU vs 每个Pod的并发量:

  低CPU区间   (0-200m)   →  低并发量/每个Pod   →  Pod处于空闲状态（过度配置）
  中等CPU区间 (200-400m) →  适中的并发量      →  健康区间
  ★ 拐点 ★       →  并发量突增      →  延迟开始劣化
  高CPU区间  (500m+)    →  高并发量/每个Pod   →  Pod过载，延迟表现差

拐点就是你需要设置的扩缩容目标值。

Decision Matrix

决策矩阵

IMPORTANT: CPU target is in millicores (not percentage). E.g.,

target: 400

means scale when CPU reaches 400m.

Inflection CPU	Recommended metric	Target	Notes
< CPU request	CPU scaling	target = inflection value in millicores	Standard case
~ CPU request	CPU scaling	target = CPU_request × 0.8	Conservative
> CPU request (no limit)	CPU scaling	target = CPU_request × 0.8, increase CPU request	Need more CPU headroom
No clear inflection	Concurrency scaling	Keep current but tune target	CPU isn't the bottleneck

重要提示：CPU目标值的单位是毫核（而非百分比）。例如，

target: 400

表示当CPU使用率达到400毫核时触发扩缩容。

拐点CPU值	推荐指标	目标值	说明
小于CPU请求值	CPU扩缩容	目标值 = 拐点的毫核数值	标准场景
接近CPU请求值	CPU扩缩容	目标值 = CPU请求值 × 0.8	保守配置
大于CPU请求值（无限制）	CPU扩缩容	目标值 = CPU请求值 × 0.8，同时提高CPU请求值	需要更多CPU余量
无明确拐点	并发量扩缩容	保留当前指标但调整目标值	CPU并非瓶颈

Common Patterns

常见模式

Pattern: CPU-bound app (Deno SSR)

Baseline CPU: 200-300m (Deno runtime + V8 JIT)
Inflection: 400-500m
Recommendation: CPU scaling with target = inflection (e.g., 400 millicores)

Pattern: IO-bound app (mostly external API calls)

CPU stays low even under high concurrency
Inflection not visible in CPU
Recommendation: Keep concurrency scaling, tune the target

Pattern: Oscillating (panic loop)

Symptoms: pods cycle between min and max
Cause: concurrency scaling + low target +
```
scale-down-delay
```
ratchet
Fix: Switch to CPU scaling (breaks the latency→concurrency feedback loop)

模式：CPU密集型应用（Deno SSR）

基线CPU使用率：200-300m（Deno运行时 + V8 JIT）
拐点：400-500m
推荐：采用CPU扩缩容，目标值为拐点值（例如400毫核）

模式：IO密集型应用（主要调用外部API）

即使在高并发下CPU使用率仍保持低位
CPU数据中无明显拐点
推荐：保留并发量扩缩容，调整目标值

模式：振荡（恐慌循环）

症状：Pod数量在最小值和最大值之间循环
原因：并发量扩缩容 + 低目标值 +
```
scale-down-delay
```
棘轮效应
修复：切换为CPU扩缩容（打破延迟→并发量的反馈循环）

Step 4: Apply Changes

步骤4：应用变更

Use the

deco-site-deployment

skill to:

Update the
```
state
```
secret with new scaling config
Redeploy on both clouds

Example for CPU-based scaling (target is in millicores):

bash

NEW_STATE=$(echo "$STATE" | jq '
  .scaling.metric = {
    "type": "cpu",
    "target": 400
  }
')

使用

deco-site-deployment

Skill完成以下操作：

更新
```
state
```
密钥，配置新的扩缩容参数
在双云环境重新部署

基于CPU的扩缩容示例（目标值单位为毫核）：

bash

NEW_STATE=$(echo "$STATE" | jq '
  .scaling.metric = {
    "type": "cpu",
    "target": 400
  }
')

Step 5: Verify After Change

步骤5：变更后验证

Monitor for 1-2 hours after applying changes:

bash

undefined

应用变更后监控1-2小时：

bash

undefined

Watch pod count stabilize

观察Pod数量是否稳定

watch -n 10 "kubectl get pods -n sites-<sitename> --no-headers | wc -l"

watch -n 10 "kubectl get pods -n sites-<站点名称> --no-headers | wc -l"

Check if panic mode triggers (should be N/A for HPA/CPU)

检查是否触发恐慌模式（HPA/CPU扩缩容无此模式）

HPA doesn't have panic mode — this is one of the advantages

HPA没有恐慌模式——这是其优势之一

Verify HPA is active

验证HPA是否处于活跃状态

kubectl get hpa -n sites-<sitename>

kubectl get hpa -n sites-<站点名称>

Check HPA status

查看HPA状态

kubectl describe hpa -n sites-<sitename>

undefined

kubectl describe hpa -n sites-<站点名称>

undefined

Success Criteria

成功标准

Pod count stabilizes (no more oscillation)
Avg CPU per pod stays below your target during normal traffic
CPU crosses target only during genuine traffic spikes (and scales up proportionally)
No panic mode events (HPA doesn't have panic mode)
Latency stays acceptable (check with queue-proxy metrics if available)

Pod数量保持稳定（无振荡）
正常流量下，每个Pod的平均CPU使用率低于目标值
仅在真实流量峰值时CPU使用率超过目标值（并按比例扩缩容）
无恐慌模式事件（HPA无此模式）
延迟表现保持可接受水平（若可用，通过queue-proxy指标验证）

Rollback

回滚

If the new scaling is worse, revert by changing the state secret back to concurrency scaling:

bash

NEW_STATE=$(echo "$STATE" | jq '
  .scaling.metric = {
    "type": "concurrency",
    "target": 15,
    "targetUtilizationPercentage": 70
  }
')

若新的扩缩容配置效果更差，可通过将state密钥改回并发量扩缩容来恢复：

bash

NEW_STATE=$(echo "$STATE" | jq '
  .scaling.metric = {
    "type": "concurrency",
    "target": 15,
    "targetUtilizationPercentage": 70
  }
')

deco-site-scaling-tuning

Original

Translation

Deco Site Scaling Tuning

Deco站点扩缩容调优

When to Use This Skill

何时使用该Skill

Prerequisites

前置条件

Quick Start

快速开始

Files in This Skill

该Skill包含的文件

Step 0: Enable Queue-Proxy Metrics (one-time)

步骤0：启用Queue-Proxy指标监控（一次性操作）

Wait 2-3 hours for data to accumulate before running latency analysis

等待2-3小时积累数据后，再进行延迟分析

Step 1: Establish Prometheus Connection

步骤1：建立Prometheus连接

Verify

验证连接

Step 2: Collect Current State

步骤2：收集当前状态

2a. Read current autoscaler config

2a. 读取当前自动扩缩容配置

Current revision annotations

当前版本的注解信息

Global autoscaler defaults

全局自动扩缩容默认配置

2b. Current pod count and resources

2b. 当前Pod数量与资源使用情况

Step 3: Run Analysis

步骤3：运行分析

Methodology: Finding the Optimal CPU Target

方法论：寻找最优CPU目标值

What to Look For

关注要点

Decision Matrix

决策矩阵

Common Patterns

常见模式

Step 4: Apply Changes

步骤4：应用变更

Step 5: Verify After Change

步骤5：变更后验证

Watch pod count stabilize

观察Pod数量是否稳定

Check if panic mode triggers (should be N/A for HPA/CPU)

检查是否触发恐慌模式（HPA/CPU扩缩容无此模式）

HPA doesn't have panic mode — this is one of the advantages

HPA没有恐慌模式——这是其优势之一

Verify HPA is active

验证HPA是否处于活跃状态

Check HPA status

查看HPA状态

Success Criteria

成功标准

Rollback

回滚

Related Skills

相关Skill