slo-architect

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

SLO Architect

SLO 架构师

Define SLOs that mean something. Most "SLOs" in the wild are arbitrary numbers no one believes — 99.9% on every endpoint, no SLI definition, no error budget, no policy for what happens when budget burns. This skill enforces the discipline from Google's SRE Workbook: pick the right SLI, set a target users actually care about, calculate the error budget, wire multi-window burn-rate alerts, and have a written policy for when budget runs out.

定义具有实际意义的SLO。当前业界大多数“SLO”都是无人信服的任意数字——每个端点都设为99.9%，没有SLI定义，没有错误预算，也没有预算耗尽时的应对策略。本技能遵循《Google SRE工作手册》中的规范：选择合适的SLI，设定用户真正关心的目标，计算错误预算，配置多窗口耗费率告警，并制定预算耗尽时的书面应对策略。

When to use

使用场景

Defining a new SLO for a service or feature
Reviewing existing SLOs for common bugs
Picking the right SLI (event-based vs time-window based vs request-based)
Computing error budgets and burn-rate alert thresholds
Tying SLOs to existing controls — feature flags abort, chaos blast radius, operator capability levels

为服务或功能定义新的SLO
评审现有SLO，排查常见问题
选择合适的SLI（基于事件、基于时间窗口或基于请求）
计算错误预算和耗费率告警阈值
将SLO与现有控制手段关联——功能标志中止、混沌工程影响范围、Operator能力等级

When NOT to use

非适用场景

General observability strategy (metrics + logs + traces) → use
```
observability-designer
```
Customer-facing SLAs with legal teeth → that's contract drafting, not engineering
Performance load testing (capacity, not reliability) → use
```
performance-profiler
```
Active incident response → use
```
incident-response
```

通用可观测性策略（指标+日志+链路追踪）→ 使用
```
observability-designer
```
具有法律约束力的客户侧SLA → 这属于合同起草范畴，而非工程领域
性能负载测试（容量而非可靠性）→ 使用
```
performance-profiler
```
实时事件响应 → 使用
```
incident-response
```

Core principle: an SLO is a promise about user experience

核心原则：SLO是关于用户体验的承诺

SLI  ⟶  measurable signal of user-perceived health (e.g., HTTP 2xx rate)
SLO  ⟶  target for the SLI over a window (e.g., 99.9% over 30 days)
SLA  ⟶  customer-facing commitment with consequences (separate concern)
EB   ⟶  error budget: 100% − SLO target = how much "bad" you can spend
BR   ⟶  burn rate: how fast you're consuming the error budget

The four cardinal mistakes:

Target too high (99.99%+ on services that can't support it) — every minor blip violates SLO; alerts become noise.
Wrong SLI (CPU usage as proxy for user experience) — system can be "green" while users suffer.
No error budget policy — burning budget means nothing if there's no agreed action.
Single-window burn-rate alert — either too noisy (page on a 5-min spike) or too slow (notice budget exhausted after the fact).

The 3 tools below catch each of these.

SLI  ⟶  用户感知健康状态的可衡量指标（例如：HTTP 2xx响应率）
SLO  ⟶  某一窗口内SLI的目标值（例如：30天内达到99.9%）
SLA  ⟶  面向客户的承诺并附带后果（独立关注点）
EB   ⟶  错误预算：100% − SLO目标值 = 可容忍的“故障”总量
BR   ⟶  耗费率：错误预算的消耗速度

四大常见错误：

目标过高（对无法支撑的服务设置99.99%+的目标）—— 任何微小波动都会违反SLO，告警沦为噪音。
SLI选型错误（用CPU使用率替代用户体验）—— 系统显示“正常”但用户却在遭遇问题。
无错误预算策略—— 若未约定应对措施，预算耗尽便毫无意义。
单窗口耗费率告警—— 要么过于敏感（5分钟峰值就触发告警），要么过于滞后（预算耗尽后才发现）。

以下3款工具可排查上述所有问题。

Quick start

快速开始

bash

SKILL=engineering/slo-architect/skills/slo-architect

bash

SKILL=engineering/slo-architect/skills/slo-architect

1. Design an SLO

1. 设计SLO

python "$SKILL/scripts/slo_designer.py"
--service checkout-svc
--sli-type request-success-rate
--target 99.9
--window-days 30

2. Compute error budget + multi-window burn-rate alerts

2. 计算错误预算 + 多窗口耗费率告警

python "$SKILL/scripts/error_budget_calculator.py"
--target 99.9 --window-days 30

3. Review existing SLO definitions for common bugs

3. 评审现有SLO定义，排查常见问题

python "$SKILL/scripts/slo_review.py" --slo-doc docs/slos/

undefined

python "$SKILL/scripts/slo_review.py" --slo-doc docs/slos/

undefined

The 3 Python tools

三款Python工具

All stdlib-only.

所有工具仅依赖标准库。

slo_designer.py

slo_designer.py

Generates a structured SLO definition with required fields. Refuses to render if any required field is missing (

exit 1

bash

python scripts/slo_designer.py \
  --service checkout-svc \
  --sli-type request-success-rate \
  --target 99.9 \
  --window-days 30 \
  --owner team-checkout

SLI types supported:

request-success-rate

—

(total_requests - bad_requests) / total_requests

request-latency

—

count(requests < threshold) / total_requests

availability-time

—

(window - downtime) / window

data-freshness

—

count(data_age < threshold) / total_data_points

correctness

—

count(correct_outputs) / total_outputs

Output is markdown by default with all required fields filled or marked

<must define>

. JSON output (

--format json

) is consumed by

slo_review.py

生成包含必填字段的结构化SLO定义。若缺少任何必填字段，将拒绝生成并返回

exit 1

。

bash

python scripts/slo_designer.py \
  --service checkout-svc \
  --sli-type request-success-rate \
  --target 99.9 \
  --window-days 30 \
  --owner team-checkout

支持的SLI类型：

request-success-rate

—

(总请求数 - 错误请求数) / 总请求数

request-latency

—

(延迟小于阈值的请求数) / 总请求数

availability-time

—

(窗口时长 - 停机时长) / 窗口时长

data-freshness

—

(数据新鲜度符合阈值的点数) / 总数据点数

correctness

—

(输出正确的请求数) / 总请求数

默认输出为Markdown格式，所有必填字段将被填充或标记为

<must define>

。JSON格式输出（

--format json

）可被

slo_review.py

读取。

error_budget_calculator.py

error_budget_calculator.py

Given target availability + window, computes:

Allowed downtime in the window
Multi-window burn-rate thresholds per Google SRE Workbook (Chapter 5):
- Fast burn — page if 2% of monthly budget consumed in 1 hour
- Slow burn — page if 10% consumed in 6 hours, ticket if 10% in 3 days
Recommended alerting rules (PromQL-shaped output)

bash

python scripts/error_budget_calculator.py --target 99.9 --window-days 30
python scripts/error_budget_calculator.py --target 99.95 --window-days 7 --format json

给定可用性目标和时间窗口，计算：

窗口内允许的停机时长
符合《Google SRE工作手册》第5章的多窗口耗费率阈值：
- 快速消耗 — 若1小时内消耗了月度预算的2%，触发告警
- 缓慢消耗 — 若6小时内消耗了10%，触发告警；若3天内消耗了10%，创建工单
推荐的告警规则（PromQL格式输出）

bash

python scripts/error_budget_calculator.py --target 99.9 --window-days 30
python scripts/error_budget_calculator.py --target 99.95 --window-days 7 --format json

slo_review.py

slo_review.py

Audits a directory of SLO definitions (markdown or JSON) for the common bugs.

bash

python scripts/slo_review.py --slo-doc docs/slos/

Checks:

```
target_too_high
```
: target ≥ 99.99% (sustainable only with massive engineering investment)
```
target_too_low
```
: target ≤ 99.0% (probably wrong SLI; users will notice)
```
window_too_short
```
: window < 7 days (statistical noise dominates)
```
window_too_long
```
: window > 90 days (slow feedback)
```
no_sli_definition
```
: SLI section missing or vague ("everything OK")
```
no_error_budget_policy
```
: no documented action when budget burns
```
cpu_as_sli
```
: CPU/memory used as user-experience proxy (wrong signal)

审计指定目录下的SLO定义文件（Markdown或JSON格式），排查常见问题。

bash

python scripts/slo_review.py --slo-doc docs/slos/

检查项：

```
target_too_high
```
: 目标≥99.99%（仅在投入大量工程资源时才可维持）
```
target_too_low
```
: 目标≤99.0%（可能是SLI选型错误；用户会明显感知到问题）
```
window_too_short
```
: 时间窗口<7天（统计噪声占主导）
```
window_too_long
```
: 时间窗口>90天（反馈周期过长）
```
no_sli_definition
```
: SLI部分缺失或模糊（如“一切正常”）
```
no_error_budget_policy
```
: 未记录预算耗尽时的应对措施
```
cpu_as_sli
```
: 将CPU/内存使用率作为用户体验的替代指标（错误信号）

SLI selection cheatsheet

SLI选型速查表

User experience	SLI type	What you measure
"Did the request succeed?"	request-success-rate	`2xx / total`
"Was the response fast?"	request-latency	`count(p99 < threshold) / total`
"Was the service up?"	availability-time	`(window - downtime) / window`
"Is the data current?"	data-freshness	`count(data_age < threshold) / total`
"Was the answer correct?"	correctness	`count(correct) / total`

See

references/sli_design.md

for examples and anti-patterns.

用户体验	SLI类型	测量内容
“请求是否成功？”	request-success-rate	`2xx响应数 / 总请求数`
“响应是否够快？”	request-latency	`(p99延迟小于阈值的请求数) / 总请求数`
“服务是否可用？”	availability-time	`(窗口时长 - 停机时长) / 窗口时长`
“数据是否最新？”	data-freshness	`(数据新鲜度符合阈值的点数) / 总数据点数`
“结果是否正确？”	correctness	`(输出正确的请求数) / 总请求数`

查看

references/sli_design.md

获取示例和反模式。

Error budget math (the basics)

错误预算基础计算

For 99.9% SLO over 30 days:

Allowed unavailability:
```
0.1% × 30 × 24 × 60 = 43.2 minutes
```
1-hour fast-burn threshold (2% of monthly budget burned):
```
2% × 43.2 / 60 ≈ 1.44 ratio multiplier
```

6-hour slow-burn threshold (10% in 6h):

10% × 43.2 / 360 ≈ 0.6 ratio multiplier

error_budget_calculator.py

does this math for you and emits ready-to-paste alert rules.

以30天内99.9%的SLO为例：

允许的不可用时长：
```
0.1% × 30 × 24 × 60 = 43.2分钟
```
1小时快速消耗阈值（消耗月度预算的2%）：
```
2% × 43.2 / 60 ≈ 1.44倍系数
```
6小时缓慢消耗阈值（6小时内消耗10%）：
```
10% × 43.2 / 360 ≈ 0.6倍系数
```

error_budget_calculator.py

会自动完成上述计算，并输出可直接粘贴使用的告警规则。

Composition with the rest of the portfolio

与其他技能的组合使用

This skill explicitly composes with three others:

Skill	Composition
`feature-flags-architect`	Rollout abort criteria reference SLO burn-rate thresholds
`chaos-engineering`	Blast-radius calculator already takes monthly error budget as input — define it here
`kubernetes-operator`	Operator capability L4 (Deep Insights) requires SLOs + Prometheus rules

The

error_budget_calculator.py

output is in the same shape

chaos-engineering/scripts/blast_radius_calculator.py

expects on stdin.

本技能可与另外三款技能明确组合：

技能	组合方式
`feature-flags-architect`	发布中止规则参考SLO耗费率阈值
`chaos-engineering`	影响范围计算器已将月度错误预算作为输入——在此处定义该预算
`kubernetes-operator`	Operator能力等级L4（深度洞察）要求SLO + Prometheus规则

error_budget_calculator.py

的输出格式与

chaos-engineering/scripts/blast_radius_calculator.py

的标准输入格式一致。

Workflows

工作流

Workflow 1: Define a new SLO

工作流1：定义新SLO

1. Pick the user journey to protect (e.g., "checkout completion").
2. Choose SLI type (request-success-rate, latency, availability, freshness, correctness).
3. Define the SLI precisely: numerator/denominator with concrete labels.
4. Pick a target by measuring 30 days of historical SLI value:
     target = floor(p50 of last 30 days × 100) / 100
   This avoids targets the system has never sustained.
5. Pick a window (28 days = 4 calendar weeks, recommended).
6. Run slo_designer.py to render the SLO definition.
7. Run error_budget_calculator.py to get burn-rate alerts.
8. Write the error budget policy (what happens when budget burns).
9. Run slo_review.py — must pass before the SLO is "live".

1. 选择需要保障的用户旅程（例如：“结账完成”）。
2. 选择SLI类型（请求成功率、延迟、可用性、新鲜度、正确性）。
3. 精确定义SLI：明确分子/分母及具体标签。
4. 通过测量过去30天的SLI历史值确定目标：
     target = floor(过去30天p50值 × 100) / 100
   这可避免设置系统从未达到过的目标。
5. 选择时间窗口（推荐28天=4个日历周）。
6. 运行slo_designer.py生成SLO定义。
7. 运行error_budget_calculator.py获取耗费率告警规则。
8. 编写错误预算策略（预算耗尽时的应对措施）。
9. 运行slo_review.py——必须通过检查才能使SLO“生效”。

Workflow 2: Quarterly SLO review

工作流2：季度SLO评审

1. For every active SLO, run slo_review.py — fix any FAIL findings.
2. Look at last quarter's data:
   - Was the SLO too easy (never burned budget)? Tighten target.
   - Was it too hard (frequently burned)? Loosen target OR fix the system.
   - Did burn-rate alerts fire usefully (not too noisy, not too late)? Adjust thresholds.
3. Audit error budget policies — were they actually followed when budget burned?
4. Commit revised SLOs; archive old versions with date stamps.

1. 对所有在用SLO运行slo_review.py——修复所有FAIL项。
2. 查看上一季度的数据：
   - SLO是否过于宽松（从未耗尽预算）？收紧目标。
   - SLO是否过于严格（频繁耗尽预算）？放宽目标或修复系统。
   - 耗费率告警是否有效（既不过于嘈杂也不过于滞后）？调整阈值。
3. 审计错误预算策略——预算耗尽时是否确实执行了应对措施？
4. 提交修订后的SLO；为旧版本添加日期戳并存档。

Workflow 3: SLO-driven rollback

工作流3：基于SLO的回滚

1. New deploy starts burning error budget faster than baseline.
2. Burn-rate alert fires (from error_budget_calculator.py thresholds).
3. Auto-rollback via feature flag (kill switch from feature-flags-architect).
4. Postmortem feeds into next SLO revision.

1. 新版本发布后，错误预算消耗速度超过基线。
2. 触发耗费率告警（来自error_budget_calculator.py设置的阈值）。
3. 通过功能标志自动回滚（来自feature-flags-architect的熔断开关）。
4. 将事后分析结果纳入下一次SLO修订。

References

参考资料

```
references/slo_principles.md
```
— SLI vs SLO vs SLA, Google SRE Workbook canon
```
references/sli_design.md
```
— picking the right SLI; 5 types with examples
```
references/error_budget.md
```
— error budget math, burn-rate alerts, budget policy
```
references/composition.md
```
— how SLOs feed feature flags, chaos, operators

```
references/slo_principles.md
```
— SLI vs SLO vs SLA，《Google SRE工作手册》规范
```
references/sli_design.md
```
— 如何选择合适的SLI；5种类型及示例
```
references/error_budget.md
```
— 错误预算计算、耗费率告警、预算策略
```
references/composition.md
```
— SLO如何与功能标志、混沌工程、Operator集成

Slash command

斜杠命令

/slo-design

— interactive SLO design wizard that runs all 3 tools.

/slo-design

— 交互式SLO设计向导，可运行所有三款工具。

Asset templates

资产模板

```
assets/slo_template.yaml
```
— fillable SLO YAML
```
assets/error_budget_policy.md
```
— fillable policy template

```
assets/slo_template.yaml
```
— 可填充的SLO YAML模板
```
assets/error_budget_policy.md
```
— 可填充的策略模板

Anti-patterns

反模式

99.99% on every endpoint — copy-paste SLOs that nobody verified the system can sustain
CPU usage as SLI — system metrics aren't user experience
Single-window burn-rate alert — too noisy if 5-min, too slow if 30-day
No error budget policy — burning budget means nothing without an action
SLOs without owners — no one is responsible; they bit-rot
SLOs reviewed once a year — system characteristics change faster than that
SLAs in the SLO doc — different audience, different stakes; keep them separate
SLO target = SLA target — SLO must be tighter (you should beat your contract before customers notice)

每个端点都设为99.99% — 复制粘贴的SLO，未验证系统是否可维持
用CPU使用率作为SLI — 系统指标不等于用户体验
单窗口耗费率告警 — 5分钟窗口过于敏感，30天窗口过于滞后
无错误预算策略 — 若无应对措施，预算耗尽便毫无意义
无负责人的SLO — 无人负责，最终会失效
每年评审一次SLO — 系统特性变化速度远快于此
SLO文档中包含SLA — 受众和风险不同，应分开管理
SLO目标等于SLA目标 — SLO必须更严格（应在客户察觉前达成合同要求）

Verifiable success

可验证的成功指标

A team using this skill should achieve:

100% of SLOs pass
```
slo_review.py
```
with 0 FAIL findings
Every SLO has a documented owner, error budget, burn-rate alerts, and policy
Burn-rate alerts fire ≤2 times/month per SLO that's hit (signal, not noise)
Mean time to detect SLO violation: <30 min (multi-window burn-rate alerts working)
Quarterly SLO review happens every quarter (not annually)

使用本技能的团队应达成：

100%的SLO通过
```
slo_review.py
```
检查，无FAIL项
每个SLO都有明确的负责人、错误预算、耗费率告警和应对策略
每个SLO每月触发的有效耗费率告警≤2次（有效信号，而非噪音）
SLO违规平均检测时间：<30分钟（多窗口耗费率告警生效）
每季度进行一次SLO评审（而非每年一次）