site-reliability-engineer

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Site Reliability Engineer (SRE)

站点可靠性工程师（SRE）

When to Use

适用场景

Define SLIs, SLOs, and error budgets per service or user journey
Configure burn-rate alerts and reliability dashboards
Run production readiness reviews before launch or major change
Analyze incidents for reliability gaps and SLO impact
Plan capacity for traffic growth and failure scenarios (N+1, regional loss)
Measure and reduce toil; prioritize automation with highest reliability ROI
Map dependencies and failure modes; design graceful degradation
Gate releases on SLO/error-budget policy (canary, rollback triggers)
Conduct chaos or game days when org maturity supports it
Partner with engineering on reliability backlog (timeouts, retries, circuit breakers)

针对每个服务或用户旅程定义SLIs、SLOs和错误预算（error budgets）
配置burn-rate告警与可靠性仪表盘
在上线或重大变更前开展生产就绪评审（production readiness reviews）
分析**事件（incidents）**以识别可靠性缺口及对SLO的影响
针对流量增长与故障场景（N+1架构、区域故障）规划容量（capacity）
衡量并减少繁琐工作（toil）；优先开展可靠性投资回报率最高的自动化工作
梳理**依赖关系（dependencies）**与故障模式；设计优雅降级方案
基于SLO/错误预算策略管控发布（releases）（金丝雀发布、回滚触发条件）
当组织成熟度达标时开展混沌工程或故障演练（chaos or game days）
与工程团队协作推进可靠性待办事项（reliability backlog）（超时设置、重试机制、断路器）

When NOT to Use

不适用场景

Build or fix Jenkins/GitHub Actions/GitLab pipelines →
```
devops
```
Design SEV levels, on-call rotations, postmortem program →
```
incident-management-engineer
```
IAM grants, VM patching, snapshot restores →
```
cloud-system-administrator
```
Stand up VPC, RDS, or new managed services →
```
cloud-engineer
```
JMeter/k6 load tests and app profiling →
```
performance-engineer
```
Blue-green cutover playbooks and change tiers →
```
deployment-strategist
```
K8s cluster upgrades and Helm platform →
```
cluster-deployment-engineer
```
Customer status page copy and comms approval →
```
communication-lead
```
Org-wide reliability posture, tiering, investment themes →
```
vp-of-infrastructure
```

构建或修复Jenkins/GitHub Actions/GitLab流水线 →
```
devops
```
设计严重等级（SEV）、轮值告警、事后复盘流程 →
```
incident-management-engineer
```
IAM权限授予、VM补丁更新、快照恢复 →
```
cloud-system-administrator
```
搭建VPC、RDS或新的托管服务 →
```
cloud-engineer
```
JMeter/k6负载测试与应用性能分析 →
```
performance-engineer
```
蓝绿发布切换手册与变更分级 →
```
deployment-strategist
```
Kubernetes集群升级与Helm平台管理 →
```
cluster-deployment-engineer
```
客户状态页面文案与沟通审批 →
```
communication-lead
```
全组织可靠性态势、分级、投资方向 →
```
vp-of-infrastructure
```

Related skills

Need	Skill
CI/CD, GitOps, pipeline observability	`devops`
Incident program and paging policy	`incident-management-engineer`
Cloud day-2 operations	`cloud-system-administrator`
Cloud service implementation	`cloud-engineer`
Performance testing and tuning	`performance-engineer`
Release cutover strategy	`deployment-strategist`
Kubernetes platform ops	`cluster-deployment-engineer`
Data pipeline SLAs	`data-system-ops-lead`
Security incidents	`defensive-security-analyst` , `cybersecurity`
BCP/DRP, RTO/RPO for security/IdP, ransomware recovery planning	`bcm-disaster-recovery-specialist`
Architecture review	`senior-system-architecture`
VP infrastructure leadership	`vp-of-infrastructure`

需求	技能
CI/CD、GitOps、流水线可观测性	`devops`
事件管理流程与告警策略	`incident-management-engineer`
云环境日常运维	`cloud-system-administrator`
云服务落地实施	`cloud-engineer`
性能测试与调优	`performance-engineer`
发布切换策略	`deployment-strategist`
Kubernetes平台运维	`cluster-deployment-engineer`
数据流水线SLA管理	`data-system-ops-lead`
安全事件响应	`defensive-security-analyst` , `cybersecurity`
业务连续性/灾难恢复计划（BCP/DRP）、安全/身份提供商的RTO/RPO、勒索软件恢复规划	`bcm-disaster-recovery-specialist`
架构评审	`senior-system-architecture`
基础设施副总裁级领导力	`vp-of-infrastructure`

Core Workflows

核心工作流

1. Scope and SRE principles

1. 范围与SRE原则

Service ownership, error budget policy, boundaries with DevOps and IM.

See
references/sre_scope_and_principles.md
.

服务所有权、错误预算策略、与DevOps及事件管理（IM）的边界划分。

详见
references/sre_scope_and_principles.md
。

2. SLI, SLO, and error budgets

2. SLI、SLO与错误预算

Select SLIs, set targets, alert on burn.

See
references/sli_slo_error_budgets.md
.

选择SLI、设定目标、针对消耗速率告警。

详见
references/sli_slo_error_budgets.md
。

3. Observability for reliability

3. 可靠性可观测性

Metrics, logs, traces, alert hygiene.

See
references/observability_reliability.md
.

指标、日志、链路追踪、告警优化。

详见
references/observability_reliability.md
。

4. Incident response (reliability lens)

4. 事件响应（可靠性视角）

Mitigation, SLO impact, follow-up actions.

See
references/incident_reliability_response.md
.

故障缓解、SLO影响评估、后续行动。

详见
references/incident_reliability_response.md
。

5. Capacity, toil, and automation

5. 容量、繁琐工作与自动化

Scaling, toil metrics, reliability automation.

See
references/capacity_toil_automation.md
.

扩容、繁琐工作度量、可靠性自动化。

详见
references/capacity_toil_automation.md
。

6. Release reliability and resilience testing

6. 发布可靠性与韧性测试

PRR, canaries, chaos, failure modes.

See
references/release_reliability_chaos.md
.

PRR、金丝雀发布、混沌工程、故障模式。

详见
references/release_reliability_chaos.md
。

Outputs

交付成果

SLO document — SLI definition, target, window, exclusions, owners
Error budget report — burn %, policy actions (freeze, focus week)
PRR checklist — pass/fail with required fixes before launch
Reliability backlog — ranked items with estimated SLO impact
Incident reliability summary — budget consumed, contributing factors, action items
Capacity plan — headroom, scaling triggers, regional failover notes

SLO文档 — SLI定义、目标、统计窗口、排除项、负责人
错误预算报告 — 消耗百分比、策略行动（冻结发布、聚焦周）
PRR检查清单 — 上线前需修复项的通过/未通过状态
可靠性待办事项 — 按SLO影响优先级排序的事项
事件可靠性总结 — 消耗的预算、影响因素、行动项
容量规划 — 预留空间、扩容触发条件、区域故障转移说明

Principles

原则

User-centric SLIs — measure what customers experience
Error budgets drive decisions — balance velocity and reliability
Automate toil — repetitive manual work is a reliability risk
Blameless learning — fix systems, not people
Progressive delivery — small releases with measurable rollback criteria

以用户为中心的SLI — 衡量客户实际体验
错误预算驱动决策 — 平衡交付速度与可靠性
自动化繁琐工作 — 重复手动工作是可靠性风险
无责学习文化 — 修复系统而非追责个人
渐进式交付 — 小版本发布并制定可衡量的回滚标准