sre-engineer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Site Reliability Engineer

站点可靠性工程师(SRE)

Purpose

用途

Provides expert site reliability engineering expertise for building and maintaining highly available, scalable, and resilient systems. Specializes in SLOs, error budgets, incident management, chaos engineering, capacity planning, and observability platforms with focus on reliability, availability, and performance.
提供专业的站点可靠性工程能力,用于构建和维护高可用、可扩展、高弹性的系统。专精SLO、错误预算、事件管理、混沌工程、容量规划和可观测性平台,重点关注可靠性、可用性与性能。

When to Use

适用场景

  • Defining and implementing SLOs (Service Level Objectives) and error budgets
  • Managing incidents from detection → resolution → post-mortem
  • Building high availability architectures (multi-region, fault tolerance)
  • Conducting chaos engineering experiments (failure injection, resilience testing)
  • Capacity planning and auto-scaling strategies
  • Implementing observability platforms (metrics, logs, traces)
  • Designing toil reduction and automation strategies
  • 定义和落地SLO(服务水平目标)与错误预算
  • 事件全流程管理:从发现→解决→事后复盘
  • 构建高可用架构(多区域、容错)
  • 开展混沌工程实验(故障注入、弹性测试)
  • 容量规划与自动扩缩容策略
  • 落地可观测性平台(指标、日志、链路追踪)
  • 设计减负(toil reduction)与自动化策略

Quick Start

快速开始

Invoke this skill when:
  • Defining and implementing SLOs (Service Level Objectives) and error budgets
  • Managing incidents from detection → resolution → post-mortem
  • Building high availability architectures (multi-region, fault tolerance)
  • Conducting chaos engineering experiments (failure injection, resilience testing)
  • Capacity planning and auto-scaling strategies
  • Implementing observability platforms (metrics, logs, traces)
Do NOT invoke when:
  • Only DevOps automation needed (use devops-engineer for CI/CD pipelines)
  • Application-level debugging (use debugger skill)
  • Infrastructure provisioning without reliability context (use cloud-architect)
  • Database performance tuning (use database-optimizer)
  • Security incident response (use incident-responder for security)


适用该技能的场景:
  • 定义和落地SLO(服务水平目标)与错误预算
  • 事件全流程管理:从发现→解决→事后复盘
  • 构建高可用架构(多区域、容错)
  • 开展混沌工程实验(故障注入、弹性测试)
  • 容量规划与自动扩缩容策略
  • 落地可观测性平台(指标、日志、链路追踪)
不适用的场景:
  • 仅需要DevOps自动化(CI/CD流水线相关请使用devops-engineer技能)
  • 应用层级调试(请使用debugger技能)
  • 无可靠性相关需求的基础设施部署(请使用cloud-architect技能)
  • 数据库性能调优(请使用database-optimizer技能)
  • 安全事件响应(安全相关请使用incident-responder技能)


Core Workflows

核心工作流

Workflow 1: Define and Implement SLOs

工作流1:定义并落地SLO

Use case: New microservice needs SLO definition and monitoring
Step 1: SLI (Service Level Indicator) Selection
yaml
undefined
适用场景: 新微服务需要定义SLO并配套监控
步骤1:SLI(服务水平指标)选择
yaml
undefined

Service: User Authentication API

Service: User Authentication API

Critical user journey: Login flow

Critical user journey: Login flow

SLI Candidates:
  1. Availability (request success rate): Definition: (successful_requests / total_requests) * 100 Measurement: HTTP 2xx responses vs 5xx errors Rationale: Core indicator of service health
  2. Latency (response time): Definition: P99 response time \u003c 500ms Measurement: Time from request received → response sent Rationale: User experience directly impacted by slow logins
  3. Correctness (authentication accuracy): Definition: Valid tokens issued / authentication attempts Measurement: JWT validation failures within 1 hour of issuance Rationale: Security and functional correctness
Selected SLIs for SLO:
  • Availability: 99.9% (primary SLO)
  • Latency P99: 500ms (secondary SLO)

**Step 2: SLO Definition Document**
```markdown
SLI Candidates:
  1. Availability (request success rate): Definition: (successful_requests / total_requests) * 100 Measurement: HTTP 2xx responses vs 5xx errors Rationale: Core indicator of service health
  2. Latency (response time): Definition: P99 response time < 500ms Measurement: Time from request received → response sent Rationale: User experience directly impacted by slow logins
  3. Correctness (authentication accuracy): Definition: Valid tokens issued / authentication attempts Measurement: JWT validation failures within 1 hour of issuance Rationale: Security and functional correctness
Selected SLIs for SLO:
  • Availability: 99.9% (primary SLO)
  • Latency P99: 500ms (secondary SLO)

**步骤2:SLO定义文档**
```markdown

Authentication Service SLO

Authentication Service SLO

Service Overview

Service Overview

  • Service: User Authentication API
  • Owner: Platform Team
  • Criticality: Tier 1 (blocks all user actions)
  • Service: User Authentication API
  • Owner: Platform Team
  • Criticality: Tier 1 (blocks all user actions)

SLO Commitments

SLO Commitments

Primary SLO: Availability

Primary SLO: Availability

  • Target: 99.9% availability over 28-day rolling window
  • Error Budget: 0.1% = 40.3 minutes downtime per 28 days
  • Measurement:
    (count(http_response_code=2xx) / count(http_requests)) >= 0.999
  • Exclusions: Planned maintenance windows, client errors (4xx)
  • Target: 99.9% availability over 28-day rolling window
  • Error Budget: 0.1% = 40.3 minutes downtime per 28 days
  • Measurement:
    (count(http_response_code=2xx) / count(http_requests)) >= 0.999
  • Exclusions: Planned maintenance windows, client errors (4xx)

Secondary SLO: Latency

Secondary SLO: Latency

  • Target: P99 latency \u003c 500ms
  • Error Budget: 1% of requests can exceed 500ms
  • Measurement:
    histogram_quantile(0.99, http_request_duration_seconds) \u003c 0.5
  • Measurement Window: 5-minute sliding window
  • Target: P99 latency < 500ms
  • Error Budget: 1% of requests can exceed 500ms
  • Measurement:
    histogram_quantile(0.99, http_request_duration_seconds) < 0.5
  • Measurement Window: 5-minute sliding window

Error Budget Policy

Error Budget Policy

Budget Remaining Actions:

Budget Remaining Actions:

  • \u003e 50%: Normal development velocity, feature releases allowed
  • 25-50%: Slow down feature releases, prioritize reliability
  • 10-25%: Feature freeze, focus on SLO improvement
  • \u003c10%: Incident declared, all hands on reliability
  • > 50%: Normal development velocity, feature releases allowed
  • 25-50%: Slow down feature releases, prioritize reliability
  • 10-25%: Feature freeze, focus on SLO improvement
  • <10%: Incident declared, all hands on reliability

Budget Exhausted (0%):

Budget Exhausted (0%):

  • Immediate feature freeze
  • Rollback recent changes
  • Root cause analysis required
  • Executive notification
  • Immediate feature freeze
  • Rollback recent changes
  • Root cause analysis required
  • Executive notification

Monitoring and Alerting

Monitoring and Alerting

Prometheus Alerting Rules:
yaml
groups:
  - name: auth_service_slo
    interval: 30s
    rules:
      # Availability SLO alert
      - alert: AuthServiceSLOBreach
        expr: |
          (
            sum(rate(http_requests_total{service="auth",code=~"2.."}[5m]))
            /
            sum(rate(http_requests_total{service="auth"}[5m]))
          ) < 0.999
        for: 5m
        labels:
          severity: critical
          service: auth
        annotations:
          summary: "Auth service availability below SLO"
          description: "Current availability: {{ $value | humanizePercentage }}"
      
      # Error budget burn rate alert (fast burn)
      - alert: AuthServiceErrorBudgetFastBurn
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{service="auth",code=~"2.."}[1h]))
              /
              sum(rate(http_requests_total{service="auth"}[1h]))
            )
          ) > 14.4 * (1 - 0.999)  # 2% of monthly budget in 1 hour
        for: 5m
        labels:
          severity: critical
          service: auth
        annotations:
          summary: "Auth service burning error budget at 14.4x rate"
          description: "At this rate, monthly budget exhausted in 2 days"
      
      # Latency SLO alert
      - alert: AuthServiceLatencySLOBreach
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket{service="auth"}[5m])) by (le)
          ) > 0.5
        for: 5m
        labels:
          severity: warning
          service: auth
        annotations:
          summary: "Auth service P99 latency above SLO"
          description: "Current P99: {{ $value }}s (SLO: 0.5s)"
Step 3: Grafana Dashboard
json
{
  "dashboard": {
    "title": "Auth Service SLO Dashboard",
    "panels": [
      {
        "title": "30-Day Availability SLO",
        "targets": [{
          "expr": "avg_over_time((sum(rate(http_requests_total{service=\"auth\",code=~\"2..\"}[5m])) / sum(rate(http_requests_total{service=\"auth\"}[5m])))[30d:5m])"
        }],
        "thresholds": [
          {"value": 0.999, "color": "green"},
          {"value": 0.995, "color": "yellow"},
          {"value": 0, "color": "red"}
        ]
      },
      {
        "title": "Error Budget Remaining",
        "targets": [{
          "expr": "1 - ((1 - avg_over_time((sum(rate(http_requests_total{service=\"auth\",code=~\"2..\"}[5m])) / sum(rate(http_requests_total{service=\"auth\"}[5m])))[30d:5m])) / (1 - 0.999))"
        }],
        "visualization": "gauge",
        "thresholds": [
          {"value": 0.5, "color": "green"},
          {"value": 0.25, "color": "yellow"},
          {"value": 0, "color": "red"}
        ]
      }
    ]
  }
}


Prometheus Alerting Rules:
yaml
groups:
  - name: auth_service_slo
    interval: 30s
    rules:
      # Availability SLO alert
      - alert: AuthServiceSLOBreach
        expr: |
          (
            sum(rate(http_requests_total{service="auth",code=~"2.."}[5m]))
            /
            sum(rate(http_requests_total{service="auth"}[5m]))
          ) < 0.999
        for: 5m
        labels:
          severity: critical
          service: auth
        annotations:
          summary: "Auth service availability below SLO"
          description: "Current availability: {{ $value | humanizePercentage }}"
      
      # Error budget burn rate alert (fast burn)
      - alert: AuthServiceErrorBudgetFastBurn
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{service="auth",code=~"2.."}[1h]))
              /
              sum(rate(http_requests_total{service="auth"}[1h]))
            )
          ) > 14.4 * (1 - 0.999)  # 2% of monthly budget in 1 hour
        for: 5m
        labels:
          severity: critical
          service: auth
        annotations:
          summary: "Auth service burning error budget at 14.4x rate"
          description: "At this rate, monthly budget exhausted in 2 days"
      
      # Latency SLO alert
      - alert: AuthServiceLatencySLOBreach
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket{service="auth"}[5m])) by (le)
          ) > 0.5
        for: 5m
        labels:
          severity: warning
          service: auth
        annotations:
          summary: "Auth service P99 latency above SLO"
          description: "Current P99: {{ $value }}s (SLO: 0.5s)"
步骤3:Grafana 仪表盘
json
{
  "dashboard": {
    "title": "Auth Service SLO Dashboard",
    "panels": [
      {
        "title": "30-Day Availability SLO",
        "targets": [{
          "expr": "avg_over_time((sum(rate(http_requests_total{service=\"auth\",code=~\"2..\"}[5m])) / sum(rate(http_requests_total{service=\"auth\"}[5m])))[30d:5m])"
        }],
        "thresholds": [
          {"value": 0.999, "color": "green"},
          {"value": 0.995, "color": "yellow"},
          {"value": 0, "color": "red"}
        ]
      },
      {
        "title": "Error Budget Remaining",
        "targets": [{
          "expr": "1 - ((1 - avg_over_time((sum(rate(http_requests_total{service=\"auth\",code=~\"2..\"}[5m])) / sum(rate(http_requests_total{service=\"auth\"}[5m])))[30d:5m])) / (1 - 0.999))"
        }],
        "visualization": "gauge",
        "thresholds": [
          {"value": 0.5, "color": "green"},
          {"value": 0.25, "color": "yellow"},
          {"value": 0, "color": "red"}
        ]
      }
    ]
  }
}


Workflow 3: Chaos Engineering Experiment

工作流3:混沌工程实验

Use case: Validate resilience to database failover
Experiment Design:
yaml
undefined
适用场景: 验证数据库故障切换的弹性
实验设计:
yaml
undefined

chaos-experiment-db-failover.yaml

chaos-experiment-db-failover.yaml

apiVersion: chaos-mesh.org/v1alpha1 kind: PodChaos metadata: name: database-primary-kill namespace: chaos-testing spec: action: pod-kill mode: one selector: namespaces: - production labelSelectors: app: postgresql role: primary scheduler: cron: "@every 2h" # Run experiment every 2 hours duration: "0s" # Instant kill

**Hypothesis:**
```markdown
apiVersion: chaos-mesh.org/v1alpha1 kind: PodChaos metadata: name: database-primary-kill namespace: chaos-testing spec: action: pod-kill mode: one selector: namespaces: - production labelSelectors: app: postgresql role: primary scheduler: cron: "@every 2h" # Run experiment every 2 hours duration: "0s" # Instant kill

**假设:**
```markdown

Hypothesis

Hypothesis

Steady State:
  • Application maintains 99.9% availability
  • P99 latency \u003c 500ms
  • Database queries succeed with automatic failover to replica
Perturbation:
  • Kill primary database pod (simulates AZ failure)
Expected Behavior:
  • Kubernetes detects pod failure within 10 seconds
  • Replica promoted to primary within 30 seconds
  • Application reconnects to new primary within 5 seconds
  • Total impact: \u003c45 seconds of elevated error rate (\u003c5%)
  • No data loss (synchronous replication)
Abort Conditions:
  • Error rate \u003e 20% for \u003e60 seconds
  • Manual rollback command issued
  • Customer complaints spike \u003e10x normal

**Execution Steps:**
```bash
#!/bin/bash
Steady State:
  • Application maintains 99.9% availability
  • P99 latency < 500ms
  • Database queries succeed with automatic failover to replica
Perturbation:
  • Kill primary database pod (simulates AZ failure)
Expected Behavior:
  • Kubernetes detects pod failure within 10 seconds
  • Replica promoted to primary within 30 seconds
  • Application reconnects to new primary within 5 seconds
  • Total impact: <45 seconds of elevated error rate (<5%)
  • No data loss (synchronous replication)
Abort Conditions:
  • Error rate > 20% for >60 seconds
  • Manual rollback command issued
  • Customer complaints spike >10x normal

**执行步骤:**
```bash
#!/bin/bash

chaos-experiment-runner.sh

chaos-experiment-runner.sh

set -e
echo "=== Chaos Experiment: Database Failover ===" echo "Start time: $(date)"
set -e
echo "=== Chaos Experiment: Database Failover ===" echo "Start time: $(date)"

Step 1: Baseline metrics (5 minutes)

Step 1: Baseline metrics (5 minutes)

echo "[1/7] Collecting baseline metrics..." START_TIME=$(date -u +%s) sleep 300
BASELINE_ERROR_RATE=$(promtool query instant
'sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))'
| jq -r '.data.result[0].value[1]')
echo "Baseline error rate: ${BASELINE_ERROR_RATE}"
echo "[1/7] Collecting baseline metrics..." START_TIME=$(date -u +%s) sleep 300
BASELINE_ERROR_RATE=$(promtool query instant
'sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))'
| jq -r '.data.result[0].value[1]')
echo "Baseline error rate: ${BASELINE_ERROR_RATE}"

Step 2: Inject failure

Step 2: Inject failure

echo "[2/7] Injecting failure: Killing primary database pod..." kubectl delete pod -l app=postgresql,role=primary -n production
echo "[2/7] Injecting failure: Killing primary database pod..." kubectl delete pod -l app=postgresql,role=primary -n production

Step 3: Monitor failover

Step 3: Monitor failover

echo "[3/7] Monitoring failover process..." for i in {1..60}; do READY_PODS=$(kubectl get pods -l app=postgresql -n production
-o jsonpath='{.items[?(@.status.conditions[?(@.type=="Ready")].status=="True")].metadata.name}'
| wc -w)
if [ $READY_PODS -ge 1 ]; then echo "Failover completed at T+${i}s: $READY_PODS ready pods" break fi
echo "T+${i}s: Waiting for replica promotion..." sleep 1 done
echo "[3/7] Monitoring failover process..." for i in {1..60}; do READY_PODS=$(kubectl get pods -l app=postgresql -n production
-o jsonpath='{.items[?(@.status.conditions[?(@.type=="Ready")].status=="True")].metadata.name}'
| wc -w)
if [ $READY_PODS -ge 1 ]; then echo "Failover completed at T+${i}s: $READY_PODS ready pods" break fi
echo "T+${i}s: Waiting for replica promotion..." sleep 1 done

Step 4: Measure impact

Step 4: Measure impact

echo "[4/7] Measuring incident impact..." sleep 60 # Wait for metrics to stabilize
INCIDENT_ERROR_RATE=$(promtool query instant
'max_over_time((sum(rate(http_requests_total{code=~"5.."}[1m])) / sum(rate(http_requests_total[1m])))[5m:])'
| jq -r '.data.result[0].value[1]')
echo "Peak error rate during incident: ${INCIDENT_ERROR_RATE}"
echo "[4/7] Measuring incident impact..." sleep 60 # Wait for metrics to stabilize
INCIDENT_ERROR_RATE=$(promtool query instant
'max_over_time((sum(rate(http_requests_total{code=~"5.."}[1m])) / sum(rate(http_requests_total[1m])))[5m:])'
| jq -r '.data.result[0].value[1]')
echo "Peak error rate during incident: ${INCIDENT_ERROR_RATE}"

Step 5: Validate recovery

Step 5: Validate recovery

echo "[5/7] Validating service recovery..." for i in {1..30}; do CURRENT_ERROR_RATE=$(promtool query instant
'sum(rate(http_requests_total{code=~"5.."}[1m])) / sum(rate(http_requests_total[1m]))'
| jq -r '.data.result[0].value[1]')
if (( $(echo "$CURRENT_ERROR_RATE < 0.01" | bc -l) )); then echo "Service recovered at T+$((60+i))s" break fi
sleep 1 done
echo "[5/7] Validating service recovery..." for i in {1..30}; do CURRENT_ERROR_RATE=$(promtool query instant
'sum(rate(http_requests_total{code=~"5.."}[1m])) / sum(rate(http_requests_total[1m]))'
| jq -r '.data.result[0].value[1]')
if (( $(echo "$CURRENT_ERROR_RATE < 0.01" | bc -l) )); then echo "Service recovered at T+$((60+i))s" break fi
sleep 1 done

Step 6: Data integrity check

Step 6: Data integrity check

echo "[6/7] Running data integrity checks..." psql -h postgres-primary-service -U app -c "SELECT COUNT(*) FROM orders WHERE created_at > NOW() - INTERVAL '10 minutes';"
echo "[6/7] Running data integrity checks..." psql -h postgres-primary-service -U app -c "SELECT COUNT(*) FROM orders WHERE created_at > NOW() - INTERVAL '10 minutes';"

Step 7: Results summary

Step 7: Results summary

echo "[7/7] Experiment Results:" echo "================================" echo "Baseline error rate: ${BASELINE_ERROR_RATE}" echo "Peak error rate: ${INCIDENT_ERROR_RATE}" echo "Current error rate: ${CURRENT_ERROR_RATE}" echo "Failover time: ~30-45 seconds" echo "Hypothesis validation: $([ $(echo "$INCIDENT_ERROR_RATE < 0.05" | bc -l) -eq 1 ] && echo "PASS" || echo "FAIL")" echo "================================"
echo "[7/7] Experiment Results:" echo "================================" echo "Baseline error rate: ${BASELINE_ERROR_RATE}" echo "Peak error rate: ${INCIDENT_ERROR_RATE}" echo "Current error rate: ${CURRENT_ERROR_RATE}" echo "Failover time: ~30-45 seconds" echo "Hypothesis validation: $([ $(echo "$INCIDENT_ERROR_RATE < 0.05" | bc -l) -eq 1 ] && echo "PASS" || echo "FAIL")" echo "================================"

Output experiment report

Output experiment report

cat > experiment-report-$(date +%Y%m%d-%H%M%S).md <<EOF
cat > experiment-report-$(date +%Y%m%d-%H%M%S).md <<EOF

Chaos Experiment Report: Database Failover

Chaos Experiment Report: Database Failover

Experiment Details

Experiment Details

  • Date: $(date)
  • Hypothesis: Application survives primary database failure with \u003c5% error rate
  • Perturbation: Kill primary PostgreSQL pod
  • Date: $(date)
  • Hypothesis: Application survives primary database failure with <5% error rate
  • Perturbation: Kill primary PostgreSQL pod

Results

Results

  • Baseline error rate: ${BASELINE_ERROR_RATE}
  • Peak error rate during failure: ${INCIDENT_ERROR_RATE}
  • Recovery time: ~45 seconds
  • Data integrity: Verified (no data loss)
  • Baseline error rate: ${BASELINE_ERROR_RATE}
  • Peak error rate during failure: ${INCIDENT_ERROR_RATE}
  • Recovery time: ~45 seconds
  • Data integrity: Verified (no data loss)

Hypothesis Validation

Hypothesis Validation

$([ $(echo "$INCIDENT_ERROR_RATE < 0.05" | bc -l) -eq 1 ] && echo "✅ PASS - Error rate stayed below 5%" || echo "❌ FAIL - Error rate exceeded 5%")
$([ $(echo "$INCIDENT_ERROR_RATE < 0.05" | bc -l) -eq 1 ] && echo "✅ PASS - Error rate stayed below 5%" || echo "❌ FAIL - Error rate exceeded 5%")

Action Items

Action Items

  1. Reduce failover time from 45s to \u003c30s (tune health check intervals)
  2. Add connection pool retry logic (reduce client-side errors)
  3. Improve monitoring alerts for database failover events EOF
echo "Experiment report generated."

**Expected Results:**
- Failover time: 30-45 seconds
- Peak error rate: 3-4% (below 5% threshold)
- Data integrity: 100% preserved
- SLO impact: 45 seconds @ 4% error rate = 1.8 seconds error budget consumed

---
---
  1. Reduce failover time from 45s to <30s (tune health check intervals)
  2. Add connection pool retry logic (reduce client-side errors)
  3. Improve monitoring alerts for database failover events EOF
echo "Experiment report generated."

**预期结果:**
- 故障切换耗时:30-45秒
- 峰值错误率:3-4%(低于5%阈值)
- 数据完整性:100%保留
- SLO影响:45秒@4%错误率 = 消耗1.8秒错误预算

---
---

❌ Anti-Pattern 2: No Incident Command Structure

❌ 反模式2:无事件指挥架构

What it looks like:
[During P0 incident in Slack]
Engineer A: "Database is down!"
Engineer B: "I'm restarting it"
Engineer C: "Wait, I'm also trying to restart it"
Engineer A: "Should we roll back the deployment?"
Engineer B: "I don't know, maybe?"
Engineer C: "Who's talking to customers?"
[15 minutes of chaos, uncoordinated actions]
Why it fails:
  • No single decision maker
  • Duplicate/conflicting actions
  • No stakeholder communication
  • Timeline not documented
  • Learning opportunities lost
Correct approach (Incident Command System):
Incident Roles:
1. Incident Commander (IC) - Makes decisions, coordinates
2. Tech Lead - Investigates root cause, implements fixes
3. Communications Lead - Updates stakeholders
4. Scribe - Documents timeline

[Incident starts]
IC: "@team P0 incident declared. I'm IC. @alice tech lead, @bob comms, @charlie scribe"
IC: "@alice what's the current state?"
Alice: "Database primary down, replica healthy. Investigating cause."
IC: "Decision: Promote replica to primary now. @alice proceed."
Bob: "Posted status page update: investigating database issue."
Charlie: [Documents in timeline: T+0: Alert fired, T+2: DB primary down, T+5: Failover initiated]

IC: "Mitigation complete. @alice confirm service health."
Alice: "Error rate back to 0.1%, latency normal."
IC: "Incident resolved. @bob final status update. @charlie compile timeline for post-mortem."


表现形式:
[Slack中P0级事件发生时]
工程师A:「数据库挂了!」
工程师B:「我正在重启它」
工程师C:「等等,我也在尝试重启」
工程师A:「我们要不要回滚部署?」
工程师B:「我不知道,也许吧?」
工程师C:「谁在和客户对接?」
[15分钟混乱操作,所有行动无协调]
为什么不可行:
  • 无唯一决策人
  • 行动重复/冲突
  • 无涉众沟通
  • 时间线未记录
  • 错失学习机会
正确做法(事件指挥体系):
事件角色:
1. 事件指挥官(IC)- 做决策,协调行动
2. 技术负责人 - 调查根因,落地修复方案
3. 沟通负责人 - 同步涉众信息
4. 记录员 - 记录事件时间线

[事件启动]
IC:「@团队 已宣布P0级事件,我担任IC。@alice任技术负责人,@bob任沟通负责人,@charlie任记录员」
IC:「@alice 当前状态如何?」
Alice:「主库宕机,从库正常,正在调查原因。」
IC:「决策:立即将从库提升为主库,@alice执行。」
Bob:「已发布状态页更新:正在调查数据库问题。」
Charlie:[记录时间线:T+0:告警触发,T+2:主库宕机,T+5:启动故障切换]

IC:「缓解措施完成,@alice确认服务健康状态。」
Alice:「错误率回落至0.1%,延迟正常。」
IC:「事件已解决,@bob发布最终状态更新,@charlie整理时间线用于事后复盘。」


Quality Checklist

质量检查清单

SLO Implementation

SLO落地

  • SLIs clearly defined and measurable
  • Error budget calculated and tracked
  • Prometheus/monitoring queries validated
  • Alert thresholds set (avoid alert fatigue)
  • Error budget policy documented
  • SLI定义清晰且可衡量
  • 错误预算已计算并持续追踪
  • Prometheus/监控查询已验证
  • 已设置告警阈值(避免告警疲劳)
  • 错误预算策略已文档化

Incident Response

事件响应

  • Runbooks exist for all critical services
  • Incident command roles defined
  • Communication templates ready
  • On-call rotation sustainable (\u003c5 pages/week)
  • Post-mortem process established (blameless)
  • 所有核心服务都有运行手册
  • 事件指挥角色已明确定义
  • 沟通模板已准备就绪
  • 轮值机制可持续(每周告警<5次)
  • 已建立无指责事后复盘流程

High Availability

高可用性

  • Multi-AZ deployment verified
  • Automated failover tested
  • RTO/RPO documented and validated
  • Disaster recovery tested quarterly
  • Chaos experiments run monthly
This SRE skill provides production-ready reliability engineering practices with emphasis on SLOs, incident management, and continuous improvement.
  • 多可用区部署已验证
  • 自动故障切换已测试
  • RTO/RPO已文档化并验证
  • 每季度开展灾难恢复测试
  • 每月执行混沌实验
本SRE技能提供生产级可靠性工程实践,重点覆盖SLO、事件管理与持续优化。