site-reliability-engineer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSite Reliability Engineer (SRE)
站点可靠性工程师(SRE)
When to Use
适用场景
- Define SLIs, SLOs, and error budgets per service or user journey
- Configure burn-rate alerts and reliability dashboards
- Run production readiness reviews before launch or major change
- Analyze incidents for reliability gaps and SLO impact
- Plan capacity for traffic growth and failure scenarios (N+1, regional loss)
- Measure and reduce toil; prioritize automation with highest reliability ROI
- Map dependencies and failure modes; design graceful degradation
- Gate releases on SLO/error-budget policy (canary, rollback triggers)
- Conduct chaos or game days when org maturity supports it
- Partner with engineering on reliability backlog (timeouts, retries, circuit breakers)
- 针对每个服务或用户旅程定义SLIs、SLOs和错误预算(error budgets)
- 配置burn-rate告警与可靠性仪表盘
- 在上线或重大变更前开展生产就绪评审(production readiness reviews)
- 分析**事件(incidents)**以识别可靠性缺口及对SLO的影响
- 针对流量增长与故障场景(N+1架构、区域故障)规划容量(capacity)
- 衡量并减少繁琐工作(toil);优先开展可靠性投资回报率最高的自动化工作
- 梳理**依赖关系(dependencies)**与故障模式;设计优雅降级方案
- 基于SLO/错误预算策略管控发布(releases)(金丝雀发布、回滚触发条件)
- 当组织成熟度达标时开展混沌工程或故障演练(chaos or game days)
- 与工程团队协作推进可靠性待办事项(reliability backlog)(超时设置、重试机制、断路器)
When NOT to Use
不适用场景
- Build or fix Jenkins/GitHub Actions/GitLab pipelines →
devops - Design SEV levels, on-call rotations, postmortem program →
incident-management-engineer - IAM grants, VM patching, snapshot restores →
cloud-system-administrator - Stand up VPC, RDS, or new managed services →
cloud-engineer - JMeter/k6 load tests and app profiling →
performance-engineer - Blue-green cutover playbooks and change tiers →
deployment-strategist - K8s cluster upgrades and Helm platform →
cluster-deployment-engineer - Customer status page copy and comms approval →
communication-lead - Org-wide reliability posture, tiering, investment themes →
vp-of-infrastructure
- 构建或修复Jenkins/GitHub Actions/GitLab流水线 →
devops - 设计严重等级(SEV)、轮值告警、事后复盘流程 →
incident-management-engineer - IAM权限授予、VM补丁更新、快照恢复 →
cloud-system-administrator - 搭建VPC、RDS或新的托管服务 →
cloud-engineer - JMeter/k6负载测试与应用性能分析 →
performance-engineer - 蓝绿发布切换手册与变更分级 →
deployment-strategist - Kubernetes集群升级与Helm平台管理 →
cluster-deployment-engineer - 客户状态页面文案与沟通审批 →
communication-lead - 全组织可靠性态势、分级、投资方向 →
vp-of-infrastructure
Related skills
相关技能
| Need | Skill |
|---|---|
| CI/CD, GitOps, pipeline observability | |
| Incident program and paging policy | |
| Cloud day-2 operations | |
| Cloud service implementation | |
| Performance testing and tuning | |
| Release cutover strategy | |
| Kubernetes platform ops | |
| Data pipeline SLAs | |
| Security incidents | |
| BCP/DRP, RTO/RPO for security/IdP, ransomware recovery planning | |
| Architecture review | |
| VP infrastructure leadership | |
| 需求 | 技能 |
|---|---|
| CI/CD、GitOps、流水线可观测性 | |
| 事件管理流程与告警策略 | |
| 云环境日常运维 | |
| 云服务落地实施 | |
| 性能测试与调优 | |
| 发布切换策略 | |
| Kubernetes平台运维 | |
| 数据流水线SLA管理 | |
| 安全事件响应 | |
| 业务连续性/灾难恢复计划(BCP/DRP)、安全/身份提供商的RTO/RPO、勒索软件恢复规划 | |
| 架构评审 | |
| 基础设施副总裁级领导力 | |
Core Workflows
核心工作流
1. Scope and SRE principles
1. 范围与SRE原则
Service ownership, error budget policy, boundaries with DevOps and IM.
See .
references/sre_scope_and_principles.md服务所有权、错误预算策略、与DevOps及事件管理(IM)的边界划分。
详见 。
references/sre_scope_and_principles.md2. SLI, SLO, and error budgets
2. SLI、SLO与错误预算
Select SLIs, set targets, alert on burn.
See .
references/sli_slo_error_budgets.md选择SLI、设定目标、针对消耗速率告警。
详见 。
references/sli_slo_error_budgets.md3. Observability for reliability
3. 可靠性可观测性
Metrics, logs, traces, alert hygiene.
See .
references/observability_reliability.md指标、日志、链路追踪、告警优化。
详见 。
references/observability_reliability.md4. Incident response (reliability lens)
4. 事件响应(可靠性视角)
Mitigation, SLO impact, follow-up actions.
See .
references/incident_reliability_response.md故障缓解、SLO影响评估、后续行动。
详见 。
references/incident_reliability_response.md5. Capacity, toil, and automation
5. 容量、繁琐工作与自动化
Scaling, toil metrics, reliability automation.
See .
references/capacity_toil_automation.md扩容、繁琐工作度量、可靠性自动化。
详见 。
references/capacity_toil_automation.md6. Release reliability and resilience testing
6. 发布可靠性与韧性测试
PRR, canaries, chaos, failure modes.
See .
references/release_reliability_chaos.mdPRR、金丝雀发布、混沌工程、故障模式。
详见 。
references/release_reliability_chaos.mdOutputs
交付成果
- SLO document — SLI definition, target, window, exclusions, owners
- Error budget report — burn %, policy actions (freeze, focus week)
- PRR checklist — pass/fail with required fixes before launch
- Reliability backlog — ranked items with estimated SLO impact
- Incident reliability summary — budget consumed, contributing factors, action items
- Capacity plan — headroom, scaling triggers, regional failover notes
- SLO文档 — SLI定义、目标、统计窗口、排除项、负责人
- 错误预算报告 — 消耗百分比、策略行动(冻结发布、聚焦周)
- PRR检查清单 — 上线前需修复项的通过/未通过状态
- 可靠性待办事项 — 按SLO影响优先级排序的事项
- 事件可靠性总结 — 消耗的预算、影响因素、行动项
- 容量规划 — 预留空间、扩容触发条件、区域故障转移说明
Principles
原则
- User-centric SLIs — measure what customers experience
- Error budgets drive decisions — balance velocity and reliability
- Automate toil — repetitive manual work is a reliability risk
- Blameless learning — fix systems, not people
- Progressive delivery — small releases with measurable rollback criteria
- 以用户为中心的SLI — 衡量客户实际体验
- 错误预算驱动决策 — 平衡交付速度与可靠性
- 自动化繁琐工作 — 重复手动工作是可靠性风险
- 无责学习文化 — 修复系统而非追责个人
- 渐进式交付 — 小版本发布并制定可衡量的回滚标准