slo-architect
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSLO Architect
SLO 架构师
Define SLOs that mean something. Most "SLOs" in the wild are arbitrary numbers no one believes — 99.9% on every endpoint, no SLI definition, no error budget, no policy for what happens when budget burns. This skill enforces the discipline from Google's SRE Workbook: pick the right SLI, set a target users actually care about, calculate the error budget, wire multi-window burn-rate alerts, and have a written policy for when budget runs out.
定义具有实际意义的SLO。当前业界大多数“SLO”都是无人信服的任意数字——每个端点都设为99.9%,没有SLI定义,没有错误预算,也没有预算耗尽时的应对策略。本技能遵循《Google SRE工作手册》中的规范:选择合适的SLI,设定用户真正关心的目标,计算错误预算,配置多窗口耗费率告警,并制定预算耗尽时的书面应对策略。
When to use
使用场景
- Defining a new SLO for a service or feature
- Reviewing existing SLOs for common bugs
- Picking the right SLI (event-based vs time-window based vs request-based)
- Computing error budgets and burn-rate alert thresholds
- Tying SLOs to existing controls — feature flags abort, chaos blast radius, operator capability levels
- 为服务或功能定义新的SLO
- 评审现有SLO,排查常见问题
- 选择合适的SLI(基于事件、基于时间窗口或基于请求)
- 计算错误预算和耗费率告警阈值
- 将SLO与现有控制手段关联——功能标志中止、混沌工程影响范围、Operator能力等级
When NOT to use
非适用场景
- General observability strategy (metrics + logs + traces) → use
observability-designer - Customer-facing SLAs with legal teeth → that's contract drafting, not engineering
- Performance load testing (capacity, not reliability) → use
performance-profiler - Active incident response → use
incident-response
- 通用可观测性策略(指标+日志+链路追踪)→ 使用
observability-designer - 具有法律约束力的客户侧SLA → 这属于合同起草范畴,而非工程领域
- 性能负载测试(容量而非可靠性)→ 使用
performance-profiler - 实时事件响应 → 使用
incident-response
Core principle: an SLO is a promise about user experience
核心原则:SLO是关于用户体验的承诺
SLI ⟶ measurable signal of user-perceived health (e.g., HTTP 2xx rate)
SLO ⟶ target for the SLI over a window (e.g., 99.9% over 30 days)
SLA ⟶ customer-facing commitment with consequences (separate concern)
EB ⟶ error budget: 100% − SLO target = how much "bad" you can spend
BR ⟶ burn rate: how fast you're consuming the error budgetThe four cardinal mistakes:
- Target too high (99.99%+ on services that can't support it) — every minor blip violates SLO; alerts become noise.
- Wrong SLI (CPU usage as proxy for user experience) — system can be "green" while users suffer.
- No error budget policy — burning budget means nothing if there's no agreed action.
- Single-window burn-rate alert — either too noisy (page on a 5-min spike) or too slow (notice budget exhausted after the fact).
The 3 tools below catch each of these.
SLI ⟶ 用户感知健康状态的可衡量指标(例如:HTTP 2xx响应率)
SLO ⟶ 某一窗口内SLI的目标值(例如:30天内达到99.9%)
SLA ⟶ 面向客户的承诺并附带后果(独立关注点)
EB ⟶ 错误预算:100% − SLO目标值 = 可容忍的“故障”总量
BR ⟶ 耗费率:错误预算的消耗速度四大常见错误:
- 目标过高(对无法支撑的服务设置99.99%+的目标)—— 任何微小波动都会违反SLO,告警沦为噪音。
- SLI选型错误(用CPU使用率替代用户体验)—— 系统显示“正常”但用户却在遭遇问题。
- 无错误预算策略—— 若未约定应对措施,预算耗尽便毫无意义。
- 单窗口耗费率告警—— 要么过于敏感(5分钟峰值就触发告警),要么过于滞后(预算耗尽后才发现)。
以下3款工具可排查上述所有问题。
Quick start
快速开始
bash
SKILL=engineering/slo-architect/skills/slo-architectbash
SKILL=engineering/slo-architect/skills/slo-architect1. Design an SLO
1. 设计SLO
python "$SKILL/scripts/slo_designer.py"
--service checkout-svc
--sli-type request-success-rate
--target 99.9
--window-days 30
--service checkout-svc
--sli-type request-success-rate
--target 99.9
--window-days 30
python "$SKILL/scripts/slo_designer.py"
--service checkout-svc
--sli-type request-success-rate
--target 99.9
--window-days 30
--service checkout-svc
--sli-type request-success-rate
--target 99.9
--window-days 30
2. Compute error budget + multi-window burn-rate alerts
2. 计算错误预算 + 多窗口耗费率告警
python "$SKILL/scripts/error_budget_calculator.py"
--target 99.9 --window-days 30
--target 99.9 --window-days 30
python "$SKILL/scripts/error_budget_calculator.py"
--target 99.9 --window-days 30
--target 99.9 --window-days 30
3. Review existing SLO definitions for common bugs
3. 评审现有SLO定义,排查常见问题
python "$SKILL/scripts/slo_review.py" --slo-doc docs/slos/
undefinedpython "$SKILL/scripts/slo_review.py" --slo-doc docs/slos/
undefinedThe 3 Python tools
三款Python工具
All stdlib-only.
所有工具仅依赖标准库。
slo_designer.py
slo_designer.pyslo_designer.py
slo_designer.pyGenerates a structured SLO definition with required fields. Refuses to render if any required field is missing ().
exit 1bash
python scripts/slo_designer.py \
--service checkout-svc \
--sli-type request-success-rate \
--target 99.9 \
--window-days 30 \
--owner team-checkoutSLI types supported:
- —
request-success-rate(total_requests - bad_requests) / total_requests - —
request-latencycount(requests < threshold) / total_requests - —
availability-time(window - downtime) / window - —
data-freshnesscount(data_age < threshold) / total_data_points - —
correctnesscount(correct_outputs) / total_outputs
Output is markdown by default with all required fields filled or marked . JSON output () is consumed by .
<must define>--format jsonslo_review.py生成包含必填字段的结构化SLO定义。若缺少任何必填字段,将拒绝生成并返回。
exit 1bash
python scripts/slo_designer.py \
--service checkout-svc \
--sli-type request-success-rate \
--target 99.9 \
--window-days 30 \
--owner team-checkout支持的SLI类型:
- —
request-success-rate(总请求数 - 错误请求数) / 总请求数 - —
request-latency(延迟小于阈值的请求数) / 总请求数 - —
availability-time(窗口时长 - 停机时长) / 窗口时长 - —
data-freshness(数据新鲜度符合阈值的点数) / 总数据点数 - —
correctness(输出正确的请求数) / 总请求数
默认输出为Markdown格式,所有必填字段将被填充或标记为。JSON格式输出()可被读取。
<must define>--format jsonslo_review.pyerror_budget_calculator.py
error_budget_calculator.pyerror_budget_calculator.py
error_budget_calculator.pyGiven target availability + window, computes:
- Allowed downtime in the window
- Multi-window burn-rate thresholds per Google SRE Workbook (Chapter 5):
- Fast burn — page if 2% of monthly budget consumed in 1 hour
- Slow burn — page if 10% consumed in 6 hours, ticket if 10% in 3 days
- Recommended alerting rules (PromQL-shaped output)
bash
python scripts/error_budget_calculator.py --target 99.9 --window-days 30
python scripts/error_budget_calculator.py --target 99.95 --window-days 7 --format json给定可用性目标和时间窗口,计算:
- 窗口内允许的停机时长
- 符合《Google SRE工作手册》第5章的多窗口耗费率阈值:
- 快速消耗 — 若1小时内消耗了月度预算的2%,触发告警
- 缓慢消耗 — 若6小时内消耗了10%,触发告警;若3天内消耗了10%,创建工单
- 推荐的告警规则(PromQL格式输出)
bash
python scripts/error_budget_calculator.py --target 99.9 --window-days 30
python scripts/error_budget_calculator.py --target 99.95 --window-days 7 --format jsonslo_review.py
slo_review.pyslo_review.py
slo_review.pyAudits a directory of SLO definitions (markdown or JSON) for the common bugs.
bash
python scripts/slo_review.py --slo-doc docs/slos/Checks:
- : target ≥ 99.99% (sustainable only with massive engineering investment)
target_too_high - : target ≤ 99.0% (probably wrong SLI; users will notice)
target_too_low - : window < 7 days (statistical noise dominates)
window_too_short - : window > 90 days (slow feedback)
window_too_long - : SLI section missing or vague ("everything OK")
no_sli_definition - : no documented action when budget burns
no_error_budget_policy - : CPU/memory used as user-experience proxy (wrong signal)
cpu_as_sli
审计指定目录下的SLO定义文件(Markdown或JSON格式),排查常见问题。
bash
python scripts/slo_review.py --slo-doc docs/slos/检查项:
- : 目标≥99.99%(仅在投入大量工程资源时才可维持)
target_too_high - : 目标≤99.0%(可能是SLI选型错误;用户会明显感知到问题)
target_too_low - : 时间窗口<7天(统计噪声占主导)
window_too_short - : 时间窗口>90天(反馈周期过长)
window_too_long - : SLI部分缺失或模糊(如“一切正常”)
no_sli_definition - : 未记录预算耗尽时的应对措施
no_error_budget_policy - : 将CPU/内存使用率作为用户体验的替代指标(错误信号)
cpu_as_sli
SLI selection cheatsheet
SLI选型速查表
| User experience | SLI type | What you measure |
|---|---|---|
| "Did the request succeed?" | request-success-rate | |
| "Was the response fast?" | request-latency | |
| "Was the service up?" | availability-time | |
| "Is the data current?" | data-freshness | |
| "Was the answer correct?" | correctness | |
See for examples and anti-patterns.
references/sli_design.md| 用户体验 | SLI类型 | 测量内容 |
|---|---|---|
| “请求是否成功?” | request-success-rate | |
| “响应是否够快?” | request-latency | |
| “服务是否可用?” | availability-time | |
| “数据是否最新?” | data-freshness | |
| “结果是否正确?” | correctness | |
查看获取示例和反模式。
references/sli_design.mdError budget math (the basics)
错误预算基础计算
For 99.9% SLO over 30 days:
- Allowed unavailability:
0.1% × 30 × 24 × 60 = 43.2 minutes - 1-hour fast-burn threshold (2% of monthly budget burned):
2% × 43.2 / 60 ≈ 1.44 ratio multiplier - 6-hour slow-burn threshold (10% in 6h):
10% × 43.2 / 360 ≈ 0.6 ratio multiplier
error_budget_calculator.py以30天内99.9%的SLO为例:
- 允许的不可用时长:
0.1% × 30 × 24 × 60 = 43.2分钟 - 1小时快速消耗阈值(消耗月度预算的2%):
2% × 43.2 / 60 ≈ 1.44倍系数 - 6小时缓慢消耗阈值(6小时内消耗10%):
10% × 43.2 / 360 ≈ 0.6倍系数
error_budget_calculator.pyComposition with the rest of the portfolio
与其他技能的组合使用
This skill explicitly composes with three others:
| Skill | Composition |
|---|---|
| Rollout abort criteria reference SLO burn-rate thresholds |
| Blast-radius calculator already takes monthly error budget as input — define it here |
| Operator capability L4 (Deep Insights) requires SLOs + Prometheus rules |
The output is in the same shape expects on stdin.
error_budget_calculator.pychaos-engineering/scripts/blast_radius_calculator.py本技能可与另外三款技能明确组合:
| 技能 | 组合方式 |
|---|---|
| 发布中止规则参考SLO耗费率阈值 |
| 影响范围计算器已将月度错误预算作为输入——在此处定义该预算 |
| Operator能力等级L4(深度洞察)要求SLO + Prometheus规则 |
error_budget_calculator.pychaos-engineering/scripts/blast_radius_calculator.pyWorkflows
工作流
Workflow 1: Define a new SLO
工作流1:定义新SLO
1. Pick the user journey to protect (e.g., "checkout completion").
2. Choose SLI type (request-success-rate, latency, availability, freshness, correctness).
3. Define the SLI precisely: numerator/denominator with concrete labels.
4. Pick a target by measuring 30 days of historical SLI value:
target = floor(p50 of last 30 days × 100) / 100
This avoids targets the system has never sustained.
5. Pick a window (28 days = 4 calendar weeks, recommended).
6. Run slo_designer.py to render the SLO definition.
7. Run error_budget_calculator.py to get burn-rate alerts.
8. Write the error budget policy (what happens when budget burns).
9. Run slo_review.py — must pass before the SLO is "live".1. 选择需要保障的用户旅程(例如:“结账完成”)。
2. 选择SLI类型(请求成功率、延迟、可用性、新鲜度、正确性)。
3. 精确定义SLI:明确分子/分母及具体标签。
4. 通过测量过去30天的SLI历史值确定目标:
target = floor(过去30天p50值 × 100) / 100
这可避免设置系统从未达到过的目标。
5. 选择时间窗口(推荐28天=4个日历周)。
6. 运行slo_designer.py生成SLO定义。
7. 运行error_budget_calculator.py获取耗费率告警规则。
8. 编写错误预算策略(预算耗尽时的应对措施)。
9. 运行slo_review.py——必须通过检查才能使SLO“生效”。Workflow 2: Quarterly SLO review
工作流2:季度SLO评审
1. For every active SLO, run slo_review.py — fix any FAIL findings.
2. Look at last quarter's data:
- Was the SLO too easy (never burned budget)? Tighten target.
- Was it too hard (frequently burned)? Loosen target OR fix the system.
- Did burn-rate alerts fire usefully (not too noisy, not too late)? Adjust thresholds.
3. Audit error budget policies — were they actually followed when budget burned?
4. Commit revised SLOs; archive old versions with date stamps.1. 对所有在用SLO运行slo_review.py——修复所有FAIL项。
2. 查看上一季度的数据:
- SLO是否过于宽松(从未耗尽预算)?收紧目标。
- SLO是否过于严格(频繁耗尽预算)?放宽目标或修复系统。
- 耗费率告警是否有效(既不过于嘈杂也不过于滞后)?调整阈值。
3. 审计错误预算策略——预算耗尽时是否确实执行了应对措施?
4. 提交修订后的SLO;为旧版本添加日期戳并存档。Workflow 3: SLO-driven rollback
工作流3:基于SLO的回滚
1. New deploy starts burning error budget faster than baseline.
2. Burn-rate alert fires (from error_budget_calculator.py thresholds).
3. Auto-rollback via feature flag (kill switch from feature-flags-architect).
4. Postmortem feeds into next SLO revision.1. 新版本发布后,错误预算消耗速度超过基线。
2. 触发耗费率告警(来自error_budget_calculator.py设置的阈值)。
3. 通过功能标志自动回滚(来自feature-flags-architect的熔断开关)。
4. 将事后分析结果纳入下一次SLO修订。References
参考资料
- — SLI vs SLO vs SLA, Google SRE Workbook canon
references/slo_principles.md - — picking the right SLI; 5 types with examples
references/sli_design.md - — error budget math, burn-rate alerts, budget policy
references/error_budget.md - — how SLOs feed feature flags, chaos, operators
references/composition.md
- — SLI vs SLO vs SLA,《Google SRE工作手册》规范
references/slo_principles.md - — 如何选择合适的SLI;5种类型及示例
references/sli_design.md - — 错误预算计算、耗费率告警、预算策略
references/error_budget.md - — SLO如何与功能标志、混沌工程、Operator集成
references/composition.md
Slash command
斜杠命令
/slo-design/slo-designAsset templates
资产模板
- — fillable SLO YAML
assets/slo_template.yaml - — fillable policy template
assets/error_budget_policy.md
- — 可填充的SLO YAML模板
assets/slo_template.yaml - — 可填充的策略模板
assets/error_budget_policy.md
Anti-patterns
反模式
- 99.99% on every endpoint — copy-paste SLOs that nobody verified the system can sustain
- CPU usage as SLI — system metrics aren't user experience
- Single-window burn-rate alert — too noisy if 5-min, too slow if 30-day
- No error budget policy — burning budget means nothing without an action
- SLOs without owners — no one is responsible; they bit-rot
- SLOs reviewed once a year — system characteristics change faster than that
- SLAs in the SLO doc — different audience, different stakes; keep them separate
- SLO target = SLA target — SLO must be tighter (you should beat your contract before customers notice)
- 每个端点都设为99.99% — 复制粘贴的SLO,未验证系统是否可维持
- 用CPU使用率作为SLI — 系统指标不等于用户体验
- 单窗口耗费率告警 — 5分钟窗口过于敏感,30天窗口过于滞后
- 无错误预算策略 — 若无应对措施,预算耗尽便毫无意义
- 无负责人的SLO — 无人负责,最终会失效
- 每年评审一次SLO — 系统特性变化速度远快于此
- SLO文档中包含SLA — 受众和风险不同,应分开管理
- SLO目标等于SLA目标 — SLO必须更严格(应在客户察觉前达成合同要求)
Verifiable success
可验证的成功指标
A team using this skill should achieve:
- 100% of SLOs pass with 0 FAIL findings
slo_review.py - Every SLO has a documented owner, error budget, burn-rate alerts, and policy
- Burn-rate alerts fire ≤2 times/month per SLO that's hit (signal, not noise)
- Mean time to detect SLO violation: <30 min (multi-window burn-rate alerts working)
- Quarterly SLO review happens every quarter (not annually)
使用本技能的团队应达成:
- 100%的SLO通过检查,无FAIL项
slo_review.py - 每个SLO都有明确的负责人、错误预算、耗费率告警和应对策略
- 每个SLO每月触发的有效耗费率告警≤2次(有效信号,而非噪音)
- SLO违规平均检测时间:<30分钟(多窗口耗费率告警生效)
- 每季度进行一次SLO评审(而非每年一次)