Loading...
Loading...
Use when defining, reviewing, or operating SLOs/SLIs/error budgets. Triggers on "define an SLO", "what should our SLO be", "error budget", "burn rate", "SLI", "service level objective", "Google SRE workbook", "multi-window burn-rate alert", or any reliability-target question. Ships SLO designer, error-budget calculator with multi-window burn-rate thresholds, and SLO reviewer that catches the common bugs (target too aggressive, window too short, conflicting SLOs, no SLI definition). 4 references on SLO principles + SLI design + error budget math + composition with feature-flags-architect/chaos-engineering/kubernetes-operator. NOT a generic observability skill — specifically the SLO discipline.
npx skill4agent add alirezarezvani/claude-skills slo-architectobservability-designerperformance-profilerincident-responseSLI ⟶ measurable signal of user-perceived health (e.g., HTTP 2xx rate)
SLO ⟶ target for the SLI over a window (e.g., 99.9% over 30 days)
SLA ⟶ customer-facing commitment with consequences (separate concern)
EB ⟶ error budget: 100% − SLO target = how much "bad" you can spend
BR ⟶ burn rate: how fast you're consuming the error budgetSKILL=engineering/slo-architect/skills/slo-architect
# 1. Design an SLO
python "$SKILL/scripts/slo_designer.py" \
--service checkout-svc \
--sli-type request-success-rate \
--target 99.9 \
--window-days 30
# 2. Compute error budget + multi-window burn-rate alerts
python "$SKILL/scripts/error_budget_calculator.py" \
--target 99.9 --window-days 30
# 3. Review existing SLO definitions for common bugs
python "$SKILL/scripts/slo_review.py" --slo-doc docs/slos/slo_designer.pyexit 1python scripts/slo_designer.py \
--service checkout-svc \
--sli-type request-success-rate \
--target 99.9 \
--window-days 30 \
--owner team-checkoutrequest-success-rate(total_requests - bad_requests) / total_requestsrequest-latencycount(requests < threshold) / total_requestsavailability-time(window - downtime) / windowdata-freshnesscount(data_age < threshold) / total_data_pointscorrectnesscount(correct_outputs) / total_outputs<must define>--format jsonslo_review.pyerror_budget_calculator.pypython scripts/error_budget_calculator.py --target 99.9 --window-days 30
python scripts/error_budget_calculator.py --target 99.95 --window-days 7 --format jsonslo_review.pypython scripts/slo_review.py --slo-doc docs/slos/target_too_hightarget_too_lowwindow_too_shortwindow_too_longno_sli_definitionno_error_budget_policycpu_as_sli| User experience | SLI type | What you measure |
|---|---|---|
| "Did the request succeed?" | request-success-rate | |
| "Was the response fast?" | request-latency | |
| "Was the service up?" | availability-time | |
| "Is the data current?" | data-freshness | |
| "Was the answer correct?" | correctness | |
references/sli_design.md0.1% × 30 × 24 × 60 = 43.2 minutes2% × 43.2 / 60 ≈ 1.44 ratio multiplier10% × 43.2 / 360 ≈ 0.6 ratio multipliererror_budget_calculator.py| Skill | Composition |
|---|---|
| Rollout abort criteria reference SLO burn-rate thresholds |
| Blast-radius calculator already takes monthly error budget as input — define it here |
| Operator capability L4 (Deep Insights) requires SLOs + Prometheus rules |
error_budget_calculator.pychaos-engineering/scripts/blast_radius_calculator.py1. Pick the user journey to protect (e.g., "checkout completion").
2. Choose SLI type (request-success-rate, latency, availability, freshness, correctness).
3. Define the SLI precisely: numerator/denominator with concrete labels.
4. Pick a target by measuring 30 days of historical SLI value:
target = floor(p50 of last 30 days × 100) / 100
This avoids targets the system has never sustained.
5. Pick a window (28 days = 4 calendar weeks, recommended).
6. Run slo_designer.py to render the SLO definition.
7. Run error_budget_calculator.py to get burn-rate alerts.
8. Write the error budget policy (what happens when budget burns).
9. Run slo_review.py — must pass before the SLO is "live".1. For every active SLO, run slo_review.py — fix any FAIL findings.
2. Look at last quarter's data:
- Was the SLO too easy (never burned budget)? Tighten target.
- Was it too hard (frequently burned)? Loosen target OR fix the system.
- Did burn-rate alerts fire usefully (not too noisy, not too late)? Adjust thresholds.
3. Audit error budget policies — were they actually followed when budget burned?
4. Commit revised SLOs; archive old versions with date stamps.1. New deploy starts burning error budget faster than baseline.
2. Burn-rate alert fires (from error_budget_calculator.py thresholds).
3. Auto-rollback via feature flag (kill switch from feature-flags-architect).
4. Postmortem feeds into next SLO revision.references/slo_principles.mdreferences/sli_design.mdreferences/error_budget.mdreferences/composition.md/slo-designassets/slo_template.yamlassets/error_budget_policy.mdslo_review.py