site-reliability-engineer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Site Reliability Engineer (SRE)

站点可靠性工程师(SRE)

When to Use

适用场景

  • Define SLIs, SLOs, and error budgets per service or user journey
  • Configure burn-rate alerts and reliability dashboards
  • Run production readiness reviews before launch or major change
  • Analyze incidents for reliability gaps and SLO impact
  • Plan capacity for traffic growth and failure scenarios (N+1, regional loss)
  • Measure and reduce toil; prioritize automation with highest reliability ROI
  • Map dependencies and failure modes; design graceful degradation
  • Gate releases on SLO/error-budget policy (canary, rollback triggers)
  • Conduct chaos or game days when org maturity supports it
  • Partner with engineering on reliability backlog (timeouts, retries, circuit breakers)
  • 针对每个服务或用户旅程定义SLIsSLOs错误预算(error budgets)
  • 配置burn-rate告警与可靠性仪表盘
  • 在上线或重大变更前开展生产就绪评审(production readiness reviews)
  • 分析**事件(incidents)**以识别可靠性缺口及对SLO的影响
  • 针对流量增长与故障场景(N+1架构、区域故障)规划容量(capacity)
  • 衡量并减少繁琐工作(toil);优先开展可靠性投资回报率最高的自动化工作
  • 梳理**依赖关系(dependencies)**与故障模式;设计优雅降级方案
  • 基于SLO/错误预算策略管控发布(releases)(金丝雀发布、回滚触发条件)
  • 当组织成熟度达标时开展混沌工程或故障演练(chaos or game days)
  • 与工程团队协作推进可靠性待办事项(reliability backlog)(超时设置、重试机制、断路器)

When NOT to Use

不适用场景

  • Build or fix Jenkins/GitHub Actions/GitLab pipelines →
    devops
  • Design SEV levels, on-call rotations, postmortem program →
    incident-management-engineer
  • IAM grants, VM patching, snapshot restores →
    cloud-system-administrator
  • Stand up VPC, RDS, or new managed services →
    cloud-engineer
  • JMeter/k6 load tests and app profiling →
    performance-engineer
  • Blue-green cutover playbooks and change tiers →
    deployment-strategist
  • K8s cluster upgrades and Helm platform →
    cluster-deployment-engineer
  • Customer status page copy and comms approval →
    communication-lead
  • Org-wide reliability posture, tiering, investment themes →
    vp-of-infrastructure
  • 构建或修复Jenkins/GitHub Actions/GitLab流水线 →
    devops
  • 设计严重等级(SEV)、轮值告警、事后复盘流程 →
    incident-management-engineer
  • IAM权限授予、VM补丁更新、快照恢复 →
    cloud-system-administrator
  • 搭建VPC、RDS或新的托管服务 →
    cloud-engineer
  • JMeter/k6负载测试与应用性能分析 →
    performance-engineer
  • 蓝绿发布切换手册与变更分级 →
    deployment-strategist
  • Kubernetes集群升级与Helm平台管理 →
    cluster-deployment-engineer
  • 客户状态页面文案与沟通审批 →
    communication-lead
  • 全组织可靠性态势、分级、投资方向 →
    vp-of-infrastructure

Related skills

相关技能

NeedSkill
CI/CD, GitOps, pipeline observability
devops
Incident program and paging policy
incident-management-engineer
Cloud day-2 operations
cloud-system-administrator
Cloud service implementation
cloud-engineer
Performance testing and tuning
performance-engineer
Release cutover strategy
deployment-strategist
Kubernetes platform ops
cluster-deployment-engineer
Data pipeline SLAs
data-system-ops-lead
Security incidents
defensive-security-analyst
,
cybersecurity
BCP/DRP, RTO/RPO for security/IdP, ransomware recovery planning
bcm-disaster-recovery-specialist
Architecture review
senior-system-architecture
VP infrastructure leadership
vp-of-infrastructure
需求技能
CI/CD、GitOps、流水线可观测性
devops
事件管理流程与告警策略
incident-management-engineer
云环境日常运维
cloud-system-administrator
云服务落地实施
cloud-engineer
性能测试与调优
performance-engineer
发布切换策略
deployment-strategist
Kubernetes平台运维
cluster-deployment-engineer
数据流水线SLA管理
data-system-ops-lead
安全事件响应
defensive-security-analyst
,
cybersecurity
业务连续性/灾难恢复计划(BCP/DRP)、安全/身份提供商的RTO/RPO、勒索软件恢复规划
bcm-disaster-recovery-specialist
架构评审
senior-system-architecture
基础设施副总裁级领导力
vp-of-infrastructure

Core Workflows

核心工作流

1. Scope and SRE principles

1. 范围与SRE原则

Service ownership, error budget policy, boundaries with DevOps and IM.
See
references/sre_scope_and_principles.md
.
服务所有权、错误预算策略、与DevOps及事件管理(IM)的边界划分。
详见
references/sre_scope_and_principles.md

2. SLI, SLO, and error budgets

2. SLI、SLO与错误预算

Select SLIs, set targets, alert on burn.
See
references/sli_slo_error_budgets.md
.
选择SLI、设定目标、针对消耗速率告警。
详见
references/sli_slo_error_budgets.md

3. Observability for reliability

3. 可靠性可观测性

Metrics, logs, traces, alert hygiene.
See
references/observability_reliability.md
.
指标、日志、链路追踪、告警优化。
详见
references/observability_reliability.md

4. Incident response (reliability lens)

4. 事件响应(可靠性视角)

Mitigation, SLO impact, follow-up actions.
See
references/incident_reliability_response.md
.
故障缓解、SLO影响评估、后续行动。
详见
references/incident_reliability_response.md

5. Capacity, toil, and automation

5. 容量、繁琐工作与自动化

Scaling, toil metrics, reliability automation.
See
references/capacity_toil_automation.md
.
扩容、繁琐工作度量、可靠性自动化。
详见
references/capacity_toil_automation.md

6. Release reliability and resilience testing

6. 发布可靠性与韧性测试

PRR, canaries, chaos, failure modes.
See
references/release_reliability_chaos.md
.
PRR、金丝雀发布、混沌工程、故障模式。
详见
references/release_reliability_chaos.md

Outputs

交付成果

  • SLO document — SLI definition, target, window, exclusions, owners
  • Error budget report — burn %, policy actions (freeze, focus week)
  • PRR checklist — pass/fail with required fixes before launch
  • Reliability backlog — ranked items with estimated SLO impact
  • Incident reliability summary — budget consumed, contributing factors, action items
  • Capacity plan — headroom, scaling triggers, regional failover notes
  • SLO文档 — SLI定义、目标、统计窗口、排除项、负责人
  • 错误预算报告 — 消耗百分比、策略行动(冻结发布、聚焦周)
  • PRR检查清单 — 上线前需修复项的通过/未通过状态
  • 可靠性待办事项 — 按SLO影响优先级排序的事项
  • 事件可靠性总结 — 消耗的预算、影响因素、行动项
  • 容量规划 — 预留空间、扩容触发条件、区域故障转移说明

Principles

原则

  • User-centric SLIs — measure what customers experience
  • Error budgets drive decisions — balance velocity and reliability
  • Automate toil — repetitive manual work is a reliability risk
  • Blameless learning — fix systems, not people
  • Progressive delivery — small releases with measurable rollback criteria
  • 以用户为中心的SLI — 衡量客户实际体验
  • 错误预算驱动决策 — 平衡交付速度与可靠性
  • 自动化繁琐工作 — 重复手动工作是可靠性风险
  • 无责学习文化 — 修复系统而非追责个人
  • 渐进式交付 — 小版本发布并制定可衡量的回滚标准