Site Reliability Engineer

站点可靠性工程师（SRE）

Purpose

用途

Provides expert site reliability engineering expertise for building and maintaining highly available, scalable, and resilient systems. Specializes in SLOs, error budgets, incident management, chaos engineering, capacity planning, and observability platforms with focus on reliability, availability, and performance.

提供专业的站点可靠性工程能力，用于构建和维护高可用、可扩展、高弹性的系统。专精SLO、错误预算、事件管理、混沌工程、容量规划和可观测性平台，重点关注可靠性、可用性与性能。

When to Use

适用场景

Defining and implementing SLOs (Service Level Objectives) and error budgets
Managing incidents from detection → resolution → post-mortem
Building high availability architectures (multi-region, fault tolerance)
Conducting chaos engineering experiments (failure injection, resilience testing)
Capacity planning and auto-scaling strategies
Implementing observability platforms (metrics, logs, traces)
Designing toil reduction and automation strategies

定义和落地SLO（服务水平目标）与错误预算
事件全流程管理：从发现→解决→事后复盘
构建高可用架构（多区域、容错）
开展混沌工程实验（故障注入、弹性测试）
容量规划与自动扩缩容策略
落地可观测性平台（指标、日志、链路追踪）
设计减负（toil reduction）与自动化策略

Quick Start

快速开始

Invoke this skill when:

Defining and implementing SLOs (Service Level Objectives) and error budgets
Managing incidents from detection → resolution → post-mortem
Building high availability architectures (multi-region, fault tolerance)
Conducting chaos engineering experiments (failure injection, resilience testing)
Capacity planning and auto-scaling strategies
Implementing observability platforms (metrics, logs, traces)

Do NOT invoke when:

Only DevOps automation needed (use devops-engineer for CI/CD pipelines)
Application-level debugging (use debugger skill)
Infrastructure provisioning without reliability context (use cloud-architect)
Database performance tuning (use database-optimizer)
Security incident response (use incident-responder for security)

适用该技能的场景：

定义和落地SLO（服务水平目标）与错误预算
事件全流程管理：从发现→解决→事后复盘
构建高可用架构（多区域、容错）
开展混沌工程实验（故障注入、弹性测试）
容量规划与自动扩缩容策略
落地可观测性平台（指标、日志、链路追踪）

不适用的场景：

仅需要DevOps自动化（CI/CD流水线相关请使用devops-engineer技能）
应用层级调试（请使用debugger技能）
无可靠性相关需求的基础设施部署（请使用cloud-architect技能）
数据库性能调优（请使用database-optimizer技能）
安全事件响应（安全相关请使用incident-responder技能）

Core Workflows

核心工作流

Workflow 1: Define and Implement SLOs

工作流1：定义并落地SLO

Use case: New microservice needs SLO definition and monitoring

Step 1: SLI (Service Level Indicator) Selection

yaml

undefined

适用场景： 新微服务需要定义SLO并配套监控

步骤1：SLI（服务水平指标）选择

yaml

undefined

Service: User Authentication API

Critical user journey: Login flow

SLI Candidates:

Availability (request success rate): Definition: (successful_requests / total_requests) * 100 Measurement: HTTP 2xx responses vs 5xx errors Rationale: Core indicator of service health
Latency (response time): Definition: P99 response time \u003c 500ms Measurement: Time from request received → response sent Rationale: User experience directly impacted by slow logins
Correctness (authentication accuracy): Definition: Valid tokens issued / authentication attempts Measurement: JWT validation failures within 1 hour of issuance Rationale: Security and functional correctness

Selected SLIs for SLO:

Availability: 99.9% (primary SLO)
Latency P99: 500ms (secondary SLO)


**Step 2: SLO Definition Document**
```markdown

SLI Candidates:

Availability (request success rate): Definition: (successful_requests / total_requests) * 100 Measurement: HTTP 2xx responses vs 5xx errors Rationale: Core indicator of service health
Latency (response time): Definition: P99 response time < 500ms Measurement: Time from request received → response sent Rationale: User experience directly impacted by slow logins
Correctness (authentication accuracy): Definition: Valid tokens issued / authentication attempts Measurement: JWT validation failures within 1 hour of issuance Rationale: Security and functional correctness

Selected SLIs for SLO:

Availability: 99.9% (primary SLO)
Latency P99: 500ms (secondary SLO)


**步骤2：SLO定义文档**
```markdown

Authentication Service SLO

Service Overview

Service: User Authentication API
Owner: Platform Team
Criticality: Tier 1 (blocks all user actions)

Service: User Authentication API
Owner: Platform Team
Criticality: Tier 1 (blocks all user actions)

SLO Commitments

Primary SLO: Availability

Target: 99.9% availability over 28-day rolling window
Error Budget: 0.1% = 40.3 minutes downtime per 28 days

Measurement:

(count(http_response_code=2xx) / count(http_requests)) >= 0.999

Exclusions: Planned maintenance windows, client errors (4xx)

Target: 99.9% availability over 28-day rolling window
Error Budget: 0.1% = 40.3 minutes downtime per 28 days

Measurement:

(count(http_response_code=2xx) / count(http_requests)) >= 0.999

Exclusions: Planned maintenance windows, client errors (4xx)

Secondary SLO: Latency

Target: P99 latency \u003c 500ms
Error Budget: 1% of requests can exceed 500ms

Measurement:

histogram_quantile(0.99, http_request_duration_seconds) \u003c 0.5

Measurement Window: 5-minute sliding window

Target: P99 latency < 500ms
Error Budget: 1% of requests can exceed 500ms

Measurement:

histogram_quantile(0.99, http_request_duration_seconds) < 0.5

Measurement Window: 5-minute sliding window

Error Budget Policy

Budget Remaining Actions:

\u003e 50%: Normal development velocity, feature releases allowed
25-50%: Slow down feature releases, prioritize reliability
10-25%: Feature freeze, focus on SLO improvement
\u003c10%: Incident declared, all hands on reliability

> 50%: Normal development velocity, feature releases allowed
25-50%: Slow down feature releases, prioritize reliability
10-25%: Feature freeze, focus on SLO improvement
<10%: Incident declared, all hands on reliability

Budget Exhausted (0%):

Immediate feature freeze
Rollback recent changes
Root cause analysis required
Executive notification

Immediate feature freeze
Rollback recent changes
Root cause analysis required
Executive notification

Monitoring and Alerting

Prometheus Alerting Rules:

yaml

groups:
  - name: auth_service_slo
    interval: 30s
    rules:
      # Availability SLO alert
      - alert: AuthServiceSLOBreach
        expr: |
          (
            sum(rate(http_requests_total{service="auth",code=~"2.."}[5m]))
            /
            sum(rate(http_requests_total{service="auth"}[5m]))
          ) < 0.999
        for: 5m
        labels:
          severity: critical
          service: auth
        annotations:
          summary: "Auth service availability below SLO"
          description: "Current availability: {{ $value | humanizePercentage }}"
      
      # Error budget burn rate alert (fast burn)
      - alert: AuthServiceErrorBudgetFastBurn
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{service="auth",code=~"2.."}[1h]))
              /
              sum(rate(http_requests_total{service="auth"}[1h]))
            )
          ) > 14.4 * (1 - 0.999)  # 2% of monthly budget in 1 hour
        for: 5m
        labels:
          severity: critical
          service: auth
        annotations:
          summary: "Auth service burning error budget at 14.4x rate"
          description: "At this rate, monthly budget exhausted in 2 days"
      
      # Latency SLO alert
      - alert: AuthServiceLatencySLOBreach
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket{service="auth"}[5m])) by (le)
          ) > 0.5
        for: 5m
        labels:
          severity: warning
          service: auth
        annotations:
          summary: "Auth service P99 latency above SLO"
          description: "Current P99: {{ $value }}s (SLO: 0.5s)"

Step 3: Grafana Dashboard

json

{
  "dashboard": {
    "title": "Auth Service SLO Dashboard",
    "panels": [
      {
        "title": "30-Day Availability SLO",
        "targets": [{
          "expr": "avg_over_time((sum(rate(http_requests_total{service=\"auth\",code=~\"2..\"}[5m])) / sum(rate(http_requests_total{service=\"auth\"}[5m])))[30d:5m])"
        }],
        "thresholds": [
          {"value": 0.999, "color": "green"},
          {"value": 0.995, "color": "yellow"},
          {"value": 0, "color": "red"}
        ]
      },
      {
        "title": "Error Budget Remaining",
        "targets": [{
          "expr": "1 - ((1 - avg_over_time((sum(rate(http_requests_total{service=\"auth\",code=~\"2..\"}[5m])) / sum(rate(http_requests_total{service=\"auth\"}[5m])))[30d:5m])) / (1 - 0.999))"
        }],
        "visualization": "gauge",
        "thresholds": [
          {"value": 0.5, "color": "green"},
          {"value": 0.25, "color": "yellow"},
          {"value": 0, "color": "red"}
        ]
      }
    ]
  }
}

Prometheus Alerting Rules:

yaml

groups:
  - name: auth_service_slo
    interval: 30s
    rules:
      # Availability SLO alert
      - alert: AuthServiceSLOBreach
        expr: |
          (
            sum(rate(http_requests_total{service="auth",code=~"2.."}[5m]))
            /
            sum(rate(http_requests_total{service="auth"}[5m]))
          ) < 0.999
        for: 5m
        labels:
          severity: critical
          service: auth
        annotations:
          summary: "Auth service availability below SLO"
          description: "Current availability: {{ $value | humanizePercentage }}"
      
      # Error budget burn rate alert (fast burn)
      - alert: AuthServiceErrorBudgetFastBurn
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{service="auth",code=~"2.."}[1h]))
              /
              sum(rate(http_requests_total{service="auth"}[1h]))
            )
          ) > 14.4 * (1 - 0.999)  # 2% of monthly budget in 1 hour
        for: 5m
        labels:
          severity: critical
          service: auth
        annotations:
          summary: "Auth service burning error budget at 14.4x rate"
          description: "At this rate, monthly budget exhausted in 2 days"
      
      # Latency SLO alert
      - alert: AuthServiceLatencySLOBreach
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket{service="auth"}[5m])) by (le)
          ) > 0.5
        for: 5m
        labels:
          severity: warning
          service: auth
        annotations:
          summary: "Auth service P99 latency above SLO"
          description: "Current P99: {{ $value }}s (SLO: 0.5s)"

步骤3：Grafana 仪表盘

json

{
  "dashboard": {
    "title": "Auth Service SLO Dashboard",
    "panels": [
      {
        "title": "30-Day Availability SLO",
        "targets": [{
          "expr": "avg_over_time((sum(rate(http_requests_total{service=\"auth\",code=~\"2..\"}[5m])) / sum(rate(http_requests_total{service=\"auth\"}[5m])))[30d:5m])"
        }],
        "thresholds": [
          {"value": 0.999, "color": "green"},
          {"value": 0.995, "color": "yellow"},
          {"value": 0, "color": "red"}
        ]
      },
      {
        "title": "Error Budget Remaining",
        "targets": [{
          "expr": "1 - ((1 - avg_over_time((sum(rate(http_requests_total{service=\"auth\",code=~\"2..\"}[5m])) / sum(rate(http_requests_total{service=\"auth\"}[5m])))[30d:5m])) / (1 - 0.999))"
        }],
        "visualization": "gauge",
        "thresholds": [
          {"value": 0.5, "color": "green"},
          {"value": 0.25, "color": "yellow"},
          {"value": 0, "color": "red"}
        ]
      }
    ]
  }
}

Workflow 3: Chaos Engineering Experiment

工作流3：混沌工程实验

Use case: Validate resilience to database failover

Experiment Design:

yaml

undefined

适用场景： 验证数据库故障切换的弹性

实验设计：

yaml

undefined

chaos-experiment-db-failover.yaml

apiVersion: chaos-mesh.org/v1alpha1 kind: PodChaos metadata: name: database-primary-kill namespace: chaos-testing spec: action: pod-kill mode: one selector: namespaces: - production labelSelectors: app: postgresql role: primary scheduler: cron: "@every 2h" # Run experiment every 2 hours duration: "0s" # Instant kill


**Hypothesis:**
```markdown

apiVersion: chaos-mesh.org/v1alpha1 kind: PodChaos metadata: name: database-primary-kill namespace: chaos-testing spec: action: pod-kill mode: one selector: namespaces: - production labelSelectors: app: postgresql role: primary scheduler: cron: "@every 2h" # Run experiment every 2 hours duration: "0s" # Instant kill


**假设：**
```markdown

Hypothesis

Steady State:

Application maintains 99.9% availability
P99 latency \u003c 500ms
Database queries succeed with automatic failover to replica

Perturbation:

Kill primary database pod (simulates AZ failure)

Expected Behavior:

Kubernetes detects pod failure within 10 seconds
Replica promoted to primary within 30 seconds
Application reconnects to new primary within 5 seconds
Total impact: \u003c45 seconds of elevated error rate (\u003c5%)
No data loss (synchronous replication)

Abort Conditions:

Error rate \u003e 20% for \u003e60 seconds
Manual rollback command issued
Customer complaints spike \u003e10x normal


**Execution Steps:**
```bash
#!/bin/bash

Steady State:

Application maintains 99.9% availability
P99 latency < 500ms
Database queries succeed with automatic failover to replica

Perturbation:

Kill primary database pod (simulates AZ failure)

Expected Behavior:

Kubernetes detects pod failure within 10 seconds
Replica promoted to primary within 30 seconds
Application reconnects to new primary within 5 seconds
Total impact: <45 seconds of elevated error rate (<5%)
No data loss (synchronous replication)

Abort Conditions:

Error rate > 20% for >60 seconds
Manual rollback command issued
Customer complaints spike >10x normal


**执行步骤：**
```bash
#!/bin/bash

chaos-experiment-runner.sh

set -e

echo "=== Chaos Experiment: Database Failover ===" echo "Start time: $(date)"

set -e

echo "=== Chaos Experiment: Database Failover ===" echo "Start time: $(date)"

Step 1: Baseline metrics (5 minutes)

echo "[1/7] Collecting baseline metrics..." START_TIME=$(date -u +%s) sleep 300

BASELINE_ERROR_RATE=$(promtool query instant
'sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))'
| jq -r '.data.result[0].value[1]')

echo "Baseline error rate: ${BASELINE_ERROR_RATE}"

echo "[1/7] Collecting baseline metrics..." START_TIME=$(date -u +%s) sleep 300

BASELINE_ERROR_RATE=$(promtool query instant
'sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))'
| jq -r '.data.result[0].value[1]')

echo "Baseline error rate: ${BASELINE_ERROR_RATE}"

Step 2: Inject failure

echo "[2/7] Injecting failure: Killing primary database pod..." kubectl delete pod -l app=postgresql,role=primary -n production

Step 3: Monitor failover

echo "[3/7] Monitoring failover process..." for i in {1..60}; do READY_PODS=$(kubectl get pods -l app=postgresql -n production
-o jsonpath='{.items[?(@.status.conditions[?(@.type=="Ready")].status=="True")].metadata.name}'
| wc -w)

if [ $READY_PODS -ge 1 ]; then echo "Failover completed at T+${i}s: $READY_PODS ready pods" break fi

echo "T+${i}s: Waiting for replica promotion..." sleep 1 done

echo "[3/7] Monitoring failover process..." for i in {1..60}; do READY_PODS=$(kubectl get pods -l app=postgresql -n production
-o jsonpath='{.items[?(@.status.conditions[?(@.type=="Ready")].status=="True")].metadata.name}'
| wc -w)

if [ $READY_PODS -ge 1 ]; then echo "Failover completed at T+${i}s: $READY_PODS ready pods" break fi

echo "T+${i}s: Waiting for replica promotion..." sleep 1 done

Step 4: Measure impact

echo "[4/7] Measuring incident impact..." sleep 60 # Wait for metrics to stabilize

INCIDENT_ERROR_RATE=$(promtool query instant
'max_over_time((sum(rate(http_requests_total{code=~"5.."}[1m])) / sum(rate(http_requests_total[1m])))[5m:])'
| jq -r '.data.result[0].value[1]')

echo "Peak error rate during incident: ${INCIDENT_ERROR_RATE}"

echo "[4/7] Measuring incident impact..." sleep 60 # Wait for metrics to stabilize

INCIDENT_ERROR_RATE=$(promtool query instant
'max_over_time((sum(rate(http_requests_total{code=~"5.."}[1m])) / sum(rate(http_requests_total[1m])))[5m:])'
| jq -r '.data.result[0].value[1]')

echo "Peak error rate during incident: ${INCIDENT_ERROR_RATE}"

Step 5: Validate recovery

echo "[5/7] Validating service recovery..." for i in {1..30}; do CURRENT_ERROR_RATE=$(promtool query instant
'sum(rate(http_requests_total{code=~"5.."}[1m])) / sum(rate(http_requests_total[1m]))'
| jq -r '.data.result[0].value[1]')

if (( $(echo "$CURRENT_ERROR_RATE < 0.01" | bc -l) )); then echo "Service recovered at T+$((60+i))s" break fi

sleep 1 done

echo "[5/7] Validating service recovery..." for i in {1..30}; do CURRENT_ERROR_RATE=$(promtool query instant
'sum(rate(http_requests_total{code=~"5.."}[1m])) / sum(rate(http_requests_total[1m]))'
| jq -r '.data.result[0].value[1]')

if (( $(echo "$CURRENT_ERROR_RATE < 0.01" | bc -l) )); then echo "Service recovered at T+$((60+i))s" break fi

sleep 1 done

Step 6: Data integrity check

echo "[6/7] Running data integrity checks..." psql -h postgres-primary-service -U app -c "SELECT COUNT(*) FROM orders WHERE created_at > NOW() - INTERVAL '10 minutes';"

Step 7: Results summary

echo "[7/7] Experiment Results:" echo "================================" echo "Baseline error rate: ${BASELINE_ERROR_RATE}" echo "Peak error rate: ${INCIDENT_ERROR_RATE}" echo "Current error rate: ${CURRENT_ERROR_RATE}" echo "Failover time: ~30-45 seconds" echo "Hypothesis validation: $([ $(echo "$INCIDENT_ERROR_RATE < 0.05" | bc -l) -eq 1 ] && echo "PASS" || echo "FAIL")" echo "================================"

Output experiment report

cat > experiment-report-$(date +%Y%m%d-%H%M%S).md <<EOF

Chaos Experiment Report: Database Failover

Experiment Details

Date: $(date)
Hypothesis: Application survives primary database failure with \u003c5% error rate
Perturbation: Kill primary PostgreSQL pod

Date: $(date)
Hypothesis: Application survives primary database failure with <5% error rate
Perturbation: Kill primary PostgreSQL pod

Results

Baseline error rate: ${BASELINE_ERROR_RATE}
Peak error rate during failure: ${INCIDENT_ERROR_RATE}
Recovery time: ~45 seconds
Data integrity: Verified (no data loss)

Baseline error rate: ${BASELINE_ERROR_RATE}
Peak error rate during failure: ${INCIDENT_ERROR_RATE}
Recovery time: ~45 seconds
Data integrity: Verified (no data loss)

Hypothesis Validation

$([ $(echo "$INCIDENT_ERROR_RATE < 0.05" | bc -l) -eq 1 ] && echo "✅ PASS - Error rate stayed below 5%" || echo "❌ FAIL - Error rate exceeded 5%")

Action Items

Reduce failover time from 45s to \u003c30s (tune health check intervals)
Add connection pool retry logic (reduce client-side errors)
Improve monitoring alerts for database failover events EOF

echo "Experiment report generated."


**Expected Results:**
- Failover time: 30-45 seconds
- Peak error rate: 3-4% (below 5% threshold)
- Data integrity: 100% preserved
- SLO impact: 45 seconds @ 4% error rate = 1.8 seconds error budget consumed

---
---

Reduce failover time from 45s to <30s (tune health check intervals)
Add connection pool retry logic (reduce client-side errors)
Improve monitoring alerts for database failover events EOF

echo "Experiment report generated."


**预期结果：**
- 故障切换耗时：30-45秒
- 峰值错误率：3-4%（低于5%阈值）
- 数据完整性：100%保留
- SLO影响：45秒@4%错误率 = 消耗1.8秒错误预算

---
---

❌ Anti-Pattern 2: No Incident Command Structure

❌ 反模式2：无事件指挥架构

What it looks like:

[During P0 incident in Slack]
Engineer A: "Database is down!"
Engineer B: "I'm restarting it"
Engineer C: "Wait, I'm also trying to restart it"
Engineer A: "Should we roll back the deployment?"
Engineer B: "I don't know, maybe?"
Engineer C: "Who's talking to customers?"
[15 minutes of chaos, uncoordinated actions]

Why it fails:

No single decision maker
Duplicate/conflicting actions
No stakeholder communication
Timeline not documented
Learning opportunities lost

Correct approach (Incident Command System):

Incident Roles:
1. Incident Commander (IC) - Makes decisions, coordinates
2. Tech Lead - Investigates root cause, implements fixes
3. Communications Lead - Updates stakeholders
4. Scribe - Documents timeline

[Incident starts]
IC: "@team P0 incident declared. I'm IC. @alice tech lead, @bob comms, @charlie scribe"
IC: "@alice what's the current state?"
Alice: "Database primary down, replica healthy. Investigating cause."
IC: "Decision: Promote replica to primary now. @alice proceed."
Bob: "Posted status page update: investigating database issue."
Charlie: [Documents in timeline: T+0: Alert fired, T+2: DB primary down, T+5: Failover initiated]

IC: "Mitigation complete. @alice confirm service health."
Alice: "Error rate back to 0.1%, latency normal."
IC: "Incident resolved. @bob final status update. @charlie compile timeline for post-mortem."

表现形式：

[Slack中P0级事件发生时]
工程师A：「数据库挂了！」
工程师B：「我正在重启它」
工程师C：「等等，我也在尝试重启」
工程师A：「我们要不要回滚部署？」
工程师B：「我不知道，也许吧？」
工程师C：「谁在和客户对接？」
[15分钟混乱操作，所有行动无协调]

为什么不可行：

无唯一决策人
行动重复/冲突
无涉众沟通
时间线未记录
错失学习机会

正确做法（事件指挥体系）：

事件角色：
1. 事件指挥官（IC）- 做决策，协调行动
2. 技术负责人 - 调查根因，落地修复方案
3. 沟通负责人 - 同步涉众信息
4. 记录员 - 记录事件时间线

[事件启动]
IC：「@团队 已宣布P0级事件，我担任IC。@alice任技术负责人，@bob任沟通负责人，@charlie任记录员」
IC：「@alice 当前状态如何？」
Alice：「主库宕机，从库正常，正在调查原因。」
IC：「决策：立即将从库提升为主库，@alice执行。」
Bob：「已发布状态页更新：正在调查数据库问题。」
Charlie：[记录时间线：T+0：告警触发，T+2：主库宕机，T+5：启动故障切换]

IC：「缓解措施完成，@alice确认服务健康状态。」
Alice：「错误率回落至0.1%，延迟正常。」
IC：「事件已解决，@bob发布最终状态更新，@charlie整理时间线用于事后复盘。」

Quality Checklist

质量检查清单

SLO Implementation

SLO落地

SLIs clearly defined and measurable
Error budget calculated and tracked
Prometheus/monitoring queries validated
Alert thresholds set (avoid alert fatigue)
Error budget policy documented

SLI定义清晰且可衡量
错误预算已计算并持续追踪
Prometheus/监控查询已验证
已设置告警阈值（避免告警疲劳）
错误预算策略已文档化

Incident Response

事件响应

Runbooks exist for all critical services
Incident command roles defined
Communication templates ready
On-call rotation sustainable (\u003c5 pages/week)
Post-mortem process established (blameless)

所有核心服务都有运行手册
事件指挥角色已明确定义
沟通模板已准备就绪
轮值机制可持续（每周告警<5次）
已建立无指责事后复盘流程

High Availability

高可用性

Multi-AZ deployment verified
Automated failover tested
RTO/RPO documented and validated
Disaster recovery tested quarterly
Chaos experiments run monthly

This SRE skill provides production-ready reliability engineering practices with emphasis on SLOs, incident management, and continuous improvement.

多可用区部署已验证
自动故障切换已测试
RTO/RPO已文档化并验证
每季度开展灾难恢复测试
每月执行混沌实验

本SRE技能提供生产级可靠性工程实践，重点覆盖SLO、事件管理与持续优化。

sre-engineer

Original

Translation

Site Reliability Engineer

站点可靠性工程师（SRE）

Purpose

用途

When to Use

适用场景

Quick Start

快速开始

Core Workflows

核心工作流

Workflow 1: Define and Implement SLOs

工作流1：定义并落地SLO

Service: User Authentication API

Service: User Authentication API

Critical user journey: Login flow

Critical user journey: Login flow

Authentication Service SLO

Authentication Service SLO

Service Overview

Service Overview

SLO Commitments

SLO Commitments

Primary SLO: Availability

Primary SLO: Availability

Secondary SLO: Latency

Secondary SLO: Latency

Error Budget Policy

Error Budget Policy

Budget Remaining Actions:

Budget Remaining Actions:

Budget Exhausted (0%):

Budget Exhausted (0%):

Monitoring and Alerting

Monitoring and Alerting

Workflow 3: Chaos Engineering Experiment

工作流3：混沌工程实验

chaos-experiment-db-failover.yaml

chaos-experiment-db-failover.yaml

Hypothesis

Hypothesis

chaos-experiment-runner.sh

chaos-experiment-runner.sh

Step 1: Baseline metrics (5 minutes)

Step 1: Baseline metrics (5 minutes)

Step 2: Inject failure

Step 2: Inject failure

Step 3: Monitor failover

Step 3: Monitor failover

Step 4: Measure impact

Step 4: Measure impact

Step 5: Validate recovery

Step 5: Validate recovery

Step 6: Data integrity check

Step 6: Data integrity check

Step 7: Results summary

Step 7: Results summary

Output experiment report

Output experiment report

Chaos Experiment Report: Database Failover

Chaos Experiment Report: Database Failover

Experiment Details

Experiment Details

Results

Results

Hypothesis Validation

Hypothesis Validation

Action Items

Action Items

❌ Anti-Pattern 2: No Incident Command Structure

❌ 反模式2：无事件指挥架构

Quality Checklist

质量检查清单

SLO Implementation

SLO落地

Incident Response

事件响应

High Availability