sre-expert

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Site Reliability Engineering Expert

站点可靠性工程(SRE)专家指南

Expert guidance for SRE practices, reliability engineering, SLOs/SLIs, incident management, and operational excellence.
为SRE实践、可靠性工程、SLO/SLI、事件管理及运维卓越提供专家级指导。

Core Concepts

核心概念

SRE Fundamentals

SRE基础

  • Service Level Objectives (SLOs)
  • Service Level Indicators (SLIs)
  • Error budgets
  • Toil reduction
  • Monitoring and alerting
  • Capacity planning
  • Service Level Objectives (SLOs)
  • Service Level Indicators (SLIs)
  • Error budgets(错误预算)
  • Toil reduction(减少重复性运维工作)
  • Monitoring and alerting(监控与告警)
  • Capacity planning(容量规划)

Reliability Practices

可靠性实践

  • Incident management
  • Post-incident reviews (PIRs)
  • On-call rotations
  • Chaos engineering
  • Disaster recovery
  • Change management
  • Incident management(事件管理)
  • Post-incident reviews (PIRs)(事后复盘)
  • On-call rotations(轮值待命)
  • Chaos engineering(混沌工程)
  • Disaster recovery(灾难恢复)
  • Change management(变更管理)

Automation

自动化

  • Infrastructure as Code
  • Configuration management
  • Deployment automation
  • Self-healing systems
  • Runbook automation
  • Automated remediation
  • Infrastructure as Code(基础设施即代码)
  • Configuration management(配置管理)
  • Deployment automation(部署自动化)
  • Self-healing systems(自修复系统)
  • Runbook automation(运维手册自动化)
  • Automated remediation(自动故障修复)

SLO/SLI Management

SLO/SLI 管理

python
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import List, Dict
import numpy as np

@dataclass
class SLI:
    """Service Level Indicator"""
    name: str
    description: str
    query: str
    unit: str  # 'percentage', 'milliseconds', etc.

@dataclass
class SLO:
    """Service Level Objective"""
    name: str
    sli: SLI
    target: float
    window_days: int

class SLOTracker:
    """Track and manage SLOs"""

    def __init__(self):
        self.slos: Dict[str, SLO] = {}
        self.measurements: Dict[str, List[Dict]] = {}

    def define_slo(self, slo: SLO):
        """Define a new SLO"""
        self.slos[slo.name] = slo
        self.measurements[slo.name] = []

    def record_measurement(self, slo_name: str, value: float, timestamp: datetime):
        """Record SLI measurement"""
        if slo_name in self.slos:
            self.measurements[slo_name].append({
                'value': value,
                'timestamp': timestamp
            })

    def calculate_slo_compliance(self, slo_name: str) -> Dict:
        """Calculate SLO compliance"""
        slo = self.slos.get(slo_name)
        if not slo:
            return {}

        measurements = self.measurements.get(slo_name, [])
        window_start = datetime.now() - timedelta(days=slo.window_days)

        recent_measurements = [
            m for m in measurements
            if m['timestamp'] > window_start
        ]

        if not recent_measurements:
            return {'status': 'no_data'}

        values = [m['value'] for m in recent_measurements]
        actual = np.mean(values)

        return {
            'slo_name': slo_name,
            'target': slo.target,
            'actual': actual,
            'compliant': actual >= slo.target,
            'window_days': slo.window_days,
            'sample_count': len(recent_measurements)
        }

    def calculate_error_budget(self, slo_name: str) -> Dict:
        """Calculate remaining error budget"""
        compliance = self.calculate_slo_compliance(slo_name)

        if compliance.get('status') == 'no_data':
            return {'status': 'no_data'}

        target = compliance['target']
        actual = compliance['actual']

        error_budget_target = 100 - target
        errors_actual = 100 - actual

        remaining = error_budget_target - errors_actual
        remaining_pct = (remaining / error_budget_target) * 100 if error_budget_target > 0 else 100

        return {
            'slo_name': slo_name,
            'error_budget_target': error_budget_target,
            'errors_actual': errors_actual,
            'remaining': remaining,
            'remaining_percentage': remaining_pct,
            'exhausted': remaining < 0
        }
python
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import List, Dict
import numpy as np

@dataclass
class SLI:
    """Service Level Indicator"""
    name: str
    description: str
    query: str
    unit: str  # 'percentage', 'milliseconds', etc.

@dataclass
class SLO:
    """Service Level Objective"""
    name: str
    sli: SLI
    target: float
    window_days: int

class SLOTracker:
    """Track and manage SLOs"""

    def __init__(self):
        self.slos: Dict[str, SLO] = {}
        self.measurements: Dict[str, List[Dict]] = {}

    def define_slo(self, slo: SLO):
        """Define a new SLO"""
        self.slos[slo.name] = slo
        self.measurements[slo.name] = []

    def record_measurement(self, slo_name: str, value: float, timestamp: datetime):
        """Record SLI measurement"""
        if slo_name in self.slos:
            self.measurements[slo_name].append({
                'value': value,
                'timestamp': timestamp
            })

    def calculate_slo_compliance(self, slo_name: str) -> Dict:
        """Calculate SLO compliance"""
        slo = self.slos.get(slo_name)
        if not slo:
            return {}

        measurements = self.measurements.get(slo_name, [])
        window_start = datetime.now() - timedelta(days=slo.window_days)

        recent_measurements = [
            m for m in measurements
            if m['timestamp'] > window_start
        ]

        if not recent_measurements:
            return {'status': 'no_data'}

        values = [m['value'] for m in recent_measurements]
        actual = np.mean(values)

        return {
            'slo_name': slo_name,
            'target': slo.target,
            'actual': actual,
            'compliant': actual >= slo.target,
            'window_days': slo.window_days,
            'sample_count': len(recent_measurements)
        }

    def calculate_error_budget(self, slo_name: str) -> Dict:
        """Calculate remaining error budget"""
        compliance = self.calculate_slo_compliance(slo_name)

        if compliance.get('status') == 'no_data':
            return {'status': 'no_data'}

        target = compliance['target']
        actual = compliance['actual']

        error_budget_target = 100 - target
        errors_actual = 100 - actual

        remaining = error_budget_target - errors_actual
        remaining_pct = (remaining / error_budget_target) * 100 if error_budget_target > 0 else 100

        return {
            'slo_name': slo_name,
            'error_budget_target': error_budget_target,
            'errors_actual': errors_actual,
            'remaining': remaining,
            'remaining_percentage': remaining_pct,
            'exhausted': remaining < 0
        }

Example SLOs

Example SLOs

def define_standard_slos() -> List[SLO]: """Define standard SLOs for a web service""" return [ SLO( name="api_availability", sli=SLI( name="availability", description="Percentage of successful requests", query="sum(rate(http_requests_total{code!~'5..'}[5m])) / sum(rate(http_requests_total[5m])) * 100", unit="percentage" ), target=99.9, window_days=30 ), SLO( name="api_latency", sli=SLI( name="latency_p95", description="95th percentile latency", query="histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))", unit="seconds" ), target=0.5, # 500ms window_days=30 ) ]
undefined
def define_standard_slos() -> List[SLO]: """Define standard SLOs for a web service""" return [ SLO( name="api_availability", sli=SLI( name="availability", description="Percentage of successful requests", query="sum(rate(http_requests_total{code!~'5..'}[5m])) / sum(rate(http_requests_total[5m])) * 100", unit="percentage" ), target=99.9, window_days=30 ), SLO( name="api_latency", sli=SLI( name="latency_p95", description="95th percentile latency", query="histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))", unit="seconds" ), target=0.5, # 500ms window_days=30 ) ]
undefined

Incident Management

事件管理

python
from enum import Enum
from datetime import datetime
from typing import List, Optional

class Severity(Enum):
    SEV1 = "sev1"  # Critical
    SEV2 = "sev2"  # High
    SEV3 = "sev3"  # Medium
    SEV4 = "sev4"  # Low

class IncidentStatus(Enum):
    INVESTIGATING = "investigating"
    IDENTIFIED = "identified"
    MONITORING = "monitoring"
    RESOLVED = "resolved"

@dataclass
class Incident:
    incident_id: str
    title: str
    severity: Severity
    status: IncidentStatus
    started_at: datetime
    detected_at: datetime
    resolved_at: Optional[datetime]
    incident_commander: str
    responders: List[str]
    affected_services: List[str]
    timeline: List[Dict]
    root_cause: Optional[str] = None

class IncidentManager:
    """Manage incidents following SRE best practices"""

    def __init__(self):
        self.incidents: Dict[str, Incident] = {}

    def create_incident(self, incident: Incident) -> str:
        """Create new incident"""
        self.incidents[incident.incident_id] = incident

        # Notify on-call
        self.notify_oncall(incident)

        # Start incident timeline
        self.add_timeline_event(
            incident.incident_id,
            "Incident created",
            datetime.now()
        )

        return incident.incident_id

    def update_status(self, incident_id: str, new_status: IncidentStatus,
                     note: str):
        """Update incident status"""
        if incident_id in self.incidents:
            incident = self.incidents[incident_id]
            incident.status = new_status

            self.add_timeline_event(
                incident_id,
                f"Status changed to {new_status.value}: {note}",
                datetime.now()
            )

            if new_status == IncidentStatus.RESOLVED:
                incident.resolved_at = datetime.now()

    def add_timeline_event(self, incident_id: str, event: str,
                          timestamp: datetime):
        """Add event to incident timeline"""
        if incident_id in self.incidents:
            self.incidents[incident_id].timeline.append({
                'timestamp': timestamp,
                'event': event
            })

    def calculate_mttr(self, incident_id: str) -> Optional[float]:
        """Calculate Mean Time To Resolution"""
        incident = self.incidents.get(incident_id)

        if incident and incident.resolved_at:
            duration = incident.resolved_at - incident.detected_at
            return duration.total_seconds() / 60  # minutes

        return None

    def generate_incident_report(self, incident_id: str) -> Dict:
        """Generate incident report"""
        incident = self.incidents.get(incident_id)

        if not incident:
            return {}

        return {
            'incident_id': incident.incident_id,
            'title': incident.title,
            'severity': incident.severity.value,
            'status': incident.status.value,
            'duration_minutes': self.calculate_mttr(incident_id),
            'affected_services': incident.affected_services,
            'incident_commander': incident.incident_commander,
            'responders': incident.responders,
            'timeline': incident.timeline,
            'root_cause': incident.root_cause
        }

    def notify_oncall(self, incident: Incident):
        """Notify on-call engineer (integrate with PagerDuty, etc.)"""
        # Implementation would integrate with alerting system
        pass
python
from enum import Enum
from datetime import datetime
from typing import List, Optional

class Severity(Enum):
    SEV1 = "sev1"  # Critical
    SEV2 = "sev2"  # High
    SEV3 = "sev3"  # Medium
    SEV4 = "sev4"  # Low

class IncidentStatus(Enum):
    INVESTIGATING = "investigating"
    IDENTIFIED = "identified"
    MONITORING = "monitoring"
    RESOLVED = "resolved"

@dataclass
class Incident:
    incident_id: str
    title: str
    severity: Severity
    status: IncidentStatus
    started_at: datetime
    detected_at: datetime
    resolved_at: Optional[datetime]
    incident_commander: str
    responders: List[str]
    affected_services: List[str]
    timeline: List[Dict]
    root_cause: Optional[str] = None

class IncidentManager:
    """Manage incidents following SRE best practices"""

    def __init__(self):
        self.incidents: Dict[str, Incident] = {}

    def create_incident(self, incident: Incident) -> str:
        """Create new incident"""
        self.incidents[incident.incident_id] = incident

        # Notify on-call
        self.notify_oncall(incident)

        # Start incident timeline
        self.add_timeline_event(
            incident.incident_id,
            "Incident created",
            datetime.now()
        )

        return incident.incident_id

    def update_status(self, incident_id: str, new_status: IncidentStatus,
                     note: str):
        """Update incident status"""
        if incident_id in self.incidents:
            incident = self.incidents[incident_id]
            incident.status = new_status

            self.add_timeline_event(
                incident_id,
                f"Status changed to {new_status.value}: {note}",
                datetime.now()
            )

            if new_status == IncidentStatus.RESOLVED:
                incident.resolved_at = datetime.now()

    def add_timeline_event(self, incident_id: str, event: str,
                          timestamp: datetime):
        """Add event to incident timeline"""
        if incident_id in self.incidents:
            self.incidents[incident_id].timeline.append({
                'timestamp': timestamp,
                'event': event
            })

    def calculate_mttr(self, incident_id: str) -> Optional[float]:
        """Calculate Mean Time To Resolution"""
        incident = self.incidents.get(incident_id)

        if incident and incident.resolved_at:
            duration = incident.resolved_at - incident.detected_at
            return duration.total_seconds() / 60  # minutes

        return None

    def generate_incident_report(self, incident_id: str) -> Dict:
        """Generate incident report"""
        incident = self.incidents.get(incident_id)

        if not incident:
            return {}

        return {
            'incident_id': incident.incident_id,
            'title': incident.title,
            'severity': incident.severity.value,
            'status': incident.status.value,
            'duration_minutes': self.calculate_mttr(incident_id),
            'affected_services': incident.affected_services,
            'incident_commander': incident.incident_commander,
            'responders': incident.responders,
            'timeline': incident.timeline,
            'root_cause': incident.root_cause
        }

    def notify_oncall(self, incident: Incident):
        """Notify on-call engineer (integrate with PagerDuty, etc.)"""
        # Implementation would integrate with alerting system
        pass

Monitoring and Alerting

监控与告警

python
from prometheus_client import Counter, Histogram, Gauge
import time
python
from prometheus_client import Counter, Histogram, Gauge
import time

Metrics

Metrics

request_count = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status']) request_duration = Histogram('http_request_duration_seconds', 'HTTP request duration') active_connections = Gauge('active_connections', 'Number of active connections')
class MonitoringSystem: """Implement monitoring best practices"""
def __init__(self):
    self.alerts = []

def record_request(self, method: str, endpoint: str, status: int, duration: float):
    """Record HTTP request metrics"""
    request_count.labels(method=method, endpoint=endpoint, status=status).inc()
    request_duration.observe(duration)

def define_alert(self, name: str, expression: str, threshold: float,
                duration: str, severity: str) -> Dict:
    """Define alerting rule"""
    alert = {
        'name': name,
        'expression': expression,
        'threshold': threshold,
        'duration': duration,
        'severity': severity,
        'annotations': {
            'summary': f'{name} alert triggered',
            'runbook_url': f'https://runbooks.example.com/{name}'
        }
    }

    self.alerts.append(alert)
    return alert

def check_golden_signals(self, metrics: Dict) -> Dict:
    """Check the four golden signals"""
    return {
        'latency': self._check_latency(metrics.get('latency', [])),
        'traffic': self._check_traffic(metrics.get('traffic', 0)),
        'errors': self._check_errors(metrics.get('error_rate', 0)),
        'saturation': self._check_saturation(metrics.get('cpu_usage', 0))
    }

def _check_latency(self, latencies: List[float]) -> Dict:
    if not latencies:
        return {'status': 'unknown'}

    p95 = np.percentile(latencies, 95)
    return {
        'status': 'critical' if p95 > 1000 else 'ok',
        'p95_ms': p95
    }

def _check_traffic(self, requests_per_second: float) -> Dict:
    return {
        'status': 'ok',
        'rps': requests_per_second
    }

def _check_errors(self, error_rate: float) -> Dict:
    return {
        'status': 'critical' if error_rate > 1.0 else 'ok',
        'error_rate': error_rate
    }

def _check_saturation(self, cpu_usage: float) -> Dict:
    return {
        'status': 'warning' if cpu_usage > 80 else 'ok',
        'cpu_usage': cpu_usage
    }
undefined
request_count = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status']) request_duration = Histogram('http_request_duration_seconds', 'HTTP request duration') active_connections = Gauge('active_connections', 'Number of active connections')
class MonitoringSystem: """Implement monitoring best practices"""
def __init__(self):
    self.alerts = []

def record_request(self, method: str, endpoint: str, status: int, duration: float):
    """Record HTTP request metrics"""
    request_count.labels(method=method, endpoint=endpoint, status=status).inc()
    request_duration.observe(duration)

def define_alert(self, name: str, expression: str, threshold: float,
                duration: str, severity: str) -> Dict:
    """Define alerting rule"""
    alert = {
        'name': name,
        'expression': expression,
        'threshold': threshold,
        'duration': duration,
        'severity': severity,
        'annotations': {
            'summary': f'{name} alert triggered',
            'runbook_url': f'https://runbooks.example.com/{name}'
        }
    }

    self.alerts.append(alert)
    return alert

def check_golden_signals(self, metrics: Dict) -> Dict:
    """Check the four golden signals"""
    return {
        'latency': self._check_latency(metrics.get('latency', [])),
        'traffic': self._check_traffic(metrics.get('traffic', 0)),
        'errors': self._check_errors(metrics.get('error_rate', 0)),
        'saturation': self._check_saturation(metrics.get('cpu_usage', 0))
    }

def _check_latency(self, latencies: List[float]) -> Dict:
    if not latencies:
        return {'status': 'unknown'}

    p95 = np.percentile(latencies, 95)
    return {
        'status': 'critical' if p95 > 1000 else 'ok',
        'p95_ms': p95
    }

def _check_traffic(self, requests_per_second: float) -> Dict:
    return {
        'status': 'ok',
        'rps': requests_per_second
    }

def _check_errors(self, error_rate: float) -> Dict:
    return {
        'status': 'critical' if error_rate > 1.0 else 'ok',
        'error_rate': error_rate
    }

def _check_saturation(self, cpu_usage: float) -> Dict:
    return {
        'status': 'warning' if cpu_usage > 80 else 'ok',
        'cpu_usage': cpu_usage
    }
undefined

Chaos Engineering

混沌工程

python
import random
from typing import Callable

class ChaosExperiment:
    """Run chaos engineering experiments"""

    def __init__(self, name: str, hypothesis: str):
        self.name = name
        self.hypothesis = hypothesis
        self.results = []

    def inject_latency(self, service_call: Callable, delay_ms: int):
        """Inject latency into service call"""
        time.sleep(delay_ms / 1000)
        return service_call()

    def inject_failure(self, service_call: Callable, failure_rate: float):
        """Randomly fail service calls"""
        if random.random() < failure_rate:
            raise Exception("Chaos: Simulated failure")
        return service_call()

    def kill_random_instance(self, instances: List[str]) -> str:
        """Kill random instance"""
        victim = random.choice(instances)
        # Implementation would actually kill the instance
        return victim

    def run_experiment(self, experiment_func: Callable) -> Dict:
        """Run chaos experiment"""
        start_time = datetime.now()

        try:
            result = experiment_func()
            status = "success"
            error = None
        except Exception as e:
            result = None
            status = "failed"
            error = str(e)

        end_time = datetime.now()

        experiment_result = {
            'name': self.name,
            'hypothesis': self.hypothesis,
            'status': status,
            'result': result,
            'error': error,
            'duration': (end_time - start_time).total_seconds(),
            'timestamp': start_time
        }

        self.results.append(experiment_result)
        return experiment_result
python
import random
from typing import Callable

class ChaosExperiment:
    """Run chaos engineering experiments"""

    def __init__(self, name: str, hypothesis: str):
        self.name = name
        self.hypothesis = hypothesis
        self.results = []

    def inject_latency(self, service_call: Callable, delay_ms: int):
        """Inject latency into service call"""
        time.sleep(delay_ms / 1000)
        return service_call()

    def inject_failure(self, service_call: Callable, failure_rate: float):
        """Randomly fail service calls"""
        if random.random() < failure_rate:
            raise Exception("Chaos: Simulated failure")
        return service_call()

    def kill_random_instance(self, instances: List[str]) -> str:
        """Kill random instance"""
        victim = random.choice(instances)
        # Implementation would actually kill the instance
        return victim

    def run_experiment(self, experiment_func: Callable) -> Dict:
        """Run chaos experiment"""
        start_time = datetime.now()

        try:
            result = experiment_func()
            status = "success"
            error = None
        except Exception as e:
            result = None
            status = "failed"
            error = str(e)

        end_time = datetime.now()

        experiment_result = {
            'name': self.name,
            'hypothesis': self.hypothesis,
            'status': status,
            'result': result,
            'error': error,
            'duration': (end_time - start_time).total_seconds(),
            'timestamp': start_time
        }

        self.results.append(experiment_result)
        return experiment_result

Best Practices

最佳实践

SRE Principles

SRE原则

  • Embrace risk management
  • Set SLOs based on user experience
  • Use error budgets for decision making
  • Automate toil away
  • Monitor the four golden signals
  • Practice blameless post-mortems
  • Gradual rollouts and canary deployments
  • 拥抱风险管理
  • 基于用户体验设定SLO
  • 利用错误预算做决策
  • 自动化消除重复性工作
  • 监控四大黄金信号
  • 开展无责事后复盘
  • 渐进式发布与金丝雀部署

Incident Management

事件管理

  • Clear incident severity definitions
  • Defined incident commander role
  • Communicate proactively
  • Document timeline during incident
  • Conduct post-incident reviews
  • Track action items to completion
  • Share learnings across teams
  • 明确的事件严重等级定义
  • 设定事件指挥官角色
  • 主动沟通
  • 事件过程中记录时间线
  • 开展事后复盘
  • 跟踪行动项直至完成
  • 跨团队分享经验

On-Call

轮值待命

  • Reasonable on-call rotations
  • Comprehensive runbooks
  • Alert on symptoms, not causes
  • Actionable alerts only
  • Escalation policies
  • Support on-call engineers
  • Measure and reduce alert fatigue
  • 合理的轮值安排
  • 全面的运维手册
  • 针对症状告警而非原因
  • 仅发送可执行的告警
  • 明确的升级策略
  • 为待命工程师提供支持
  • 衡量并减少告警疲劳

Anti-Patterns

反模式

❌ No SLOs defined ❌ Alerts without runbooks ❌ Blame culture for incidents ❌ No post-incident reviews ❌ 100% uptime expectations ❌ Toil not tracked or reduced ❌ Manual processes for common tasks
❌ 未定义SLO ❌ 无运维手册的告警 ❌ 事件追责文化 ❌ 不开展事后复盘 ❌ 要求100%可用 ❌ 不跟踪或减少重复性工作 ❌ 常见任务采用手动流程

Resources

资源