on-call-handoff-patterns

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

On-Call Handoff Patterns

值班班次交接模式

Effective patterns for on-call shift transitions, ensuring continuity, context transfer, and reliable incident response across shifts.
确保班次间连续性、上下文传递和可靠事件响应的有效值班交接模式。

When to Use This Skill

何时使用该值班交接技能

  • Transitioning on-call responsibilities
  • Writing shift handoff summaries
  • Documenting ongoing investigations
  • Establishing on-call rotation procedures
  • Improving handoff quality
  • Onboarding new on-call engineers
  • 移交值班职责
  • 撰写班次交接总结
  • 记录正在进行的调查
  • 建立值班轮换流程
  • 提升交接质量
  • 培训新的值班工程师

Core Concepts

核心概念

1. Handoff Components

1. 交接组件

ComponentPurpose
Active IncidentsWhat's currently broken
Ongoing InvestigationsIssues being debugged
Recent ChangesDeployments, configs
Known IssuesWorkarounds in place
Upcoming EventsMaintenance, releases
组件用途
Active Incidents当前存在的故障
Ongoing Investigations正在调试的问题
Recent Changes部署、配置变更
Known Issues已有的临时解决方案
Upcoming Events维护、发布计划

2. Handoff Timing

2. 交接时间安排

Recommended: 30 min overlap between shifts

Outgoing:
├── 15 min: Write handoff document
└── 15 min: Sync call with incoming

Incoming:
├── 15 min: Review handoff document
├── 15 min: Sync call with outgoing
└── 5 min: Verify alerting setup
推荐:班次之间重叠30分钟

交班方:
├── 15分钟:撰写交接文档
└── 15分钟:与接班方同步沟通

接班方:
├── 15分钟:审阅交接文档
├── 15分钟:与交班方同步沟通
└── 5分钟:验证告警设置

Templates

模板

Template 1: Shift Handoff Document

模板1:班次交接文档

markdown
undefined
markdown
undefined

On-Call Handoff: Platform Team

值班交接:平台团队

Outgoing: @alice (2024-01-15 to 2024-01-22) Incoming: @bob (2024-01-22 to 2024-01-29) Handoff Time: 2024-01-22 09:00 UTC

交班人:@alice(2024-01-15 至 2024-01-22) 接班人:@bob(2024-01-22 至 2024-01-29) 交接时间:2024-01-22 09:00 UTC

🔴 Active Incidents

🔴 活跃事件

None currently active

当前无活跃事件

No active incidents at handoff time.

交接时无活跃事件。

🟡 Ongoing Investigations

🟡 正在进行的调查

1. Intermittent API Timeouts (ENG-1234)

1. 间歇性API超时(ENG-1234)

Status: Investigating Started: 2024-01-20 Impact: ~0.1% of requests timing out
Context:
  • Timeouts correlate with database backup window (02:00-03:00 UTC)
  • Suspect backup process causing lock contention
  • Added extra logging in PR #567 (deployed 01/21)
Next Steps:
  • Review new logs after tonight's backup
  • Consider moving backup window if confirmed
Resources:
  • Dashboard: API Latency
  • Thread: #platform-eng (01/20, 14:32)

状态:调查中 开始时间:2024-01-20 影响:约0.1%的请求超时
上下文
  • 超时与数据库备份窗口(02:00-03:00 UTC)相关
  • 怀疑备份流程导致锁竞争
  • 在PR #567中添加了额外日志(已在01/21部署)
下一步
  • 查看今晚备份后的新日志
  • 若确认则考虑调整备份窗口
资源
  • 仪表盘:API延迟
  • 讨论线程:#platform-eng(01/20 14:32)

2. Memory Growth in Auth Service (ENG-1235)

2. 认证服务内存增长(ENG-1235)

Status: Monitoring Started: 2024-01-18 Impact: None yet (proactive)
Context:
  • Memory usage growing ~5% per day
  • No memory leak found in profiling
  • Suspect connection pool not releasing properly
Next Steps:
  • Review heap dump from 01/21
  • Consider restart if usage > 80%
Resources:

状态:监控中 开始时间:2024-01-18 影响:目前无影响(主动监控)
上下文
  • 内存使用量每天增长约5%
  • 性能分析未发现内存泄漏
  • 怀疑连接池未正确释放
下一步
  • 查看01/21的堆转储
  • 若使用率超过80%则考虑重启
资源

🟢 Resolved This Shift

🟢 本班次已解决事件

Payment Service Outage (2024-01-19)

支付服务中断(2024-01-19)

  • Duration: 23 minutes
  • Root Cause: Database connection exhaustion
  • Resolution: Rolled back v2.3.4, increased pool size
  • Postmortem: POSTMORTEM-89
  • Follow-up tickets: ENG-1230, ENG-1231

  • 持续时间:23分钟
  • 根本原因:数据库连接耗尽
  • 解决方案:回滚至v2.3.4,增加连接池大小
  • 事后分析POSTMORTEM-89
  • 后续工单:ENG-1230、ENG-1231

📋 Recent Changes

📋 近期变更

Deployments

部署记录

ServiceVersionTimeNotes
api-gatewayv3.2.101/21 14:00Bug fix for header parsing
user-servicev2.8.001/20 10:00New profile features
auth-servicev4.1.201/19 16:00Security patch
服务版本时间说明
api-gatewayv3.2.101/21 14:00修复头解析相关Bug
user-servicev2.8.001/20 10:00新增个人资料功能
auth-servicev4.1.201/19 16:00安全补丁

Configuration Changes

配置变更

  • 01/21: Increased API rate limit from 1000 to 1500 RPS
  • 01/20: Updated database connection pool max from 50 to 75
  • 01/21:将API速率限制从1000 RPS提升至1500 RPS
  • 01/20:将数据库连接池最大值从50调整至75

Infrastructure

基础设施变更

  • 01/20: Added 2 nodes to Kubernetes cluster
  • 01/19: Upgraded Redis from 6.2 to 7.0

  • 01/20:为Kubernetes集群添加2个节点
  • 01/19:将Redis从6.2升级至7.0

⚠️ Known Issues & Workarounds

⚠️ 已知问题与临时解决方案

1. Slow Dashboard Loading

1. 仪表盘加载缓慢

Issue: Grafana dashboards slow on Monday mornings Workaround: Wait 5 min after 08:00 UTC for cache warm-up Ticket: OPS-456 (P3)
问题:Grafana仪表盘在周一早晨加载缓慢 临时解决方案:在08:00 UTC后等待5分钟,让缓存预热 工单:OPS-456(P3)

2. Flaky Integration Test

2. 不稳定的集成测试

Issue:
test_payment_flow
fails intermittently in CI Workaround: Re-run failed job (usually passes on retry) Ticket: ENG-1200 (P2)

问题
test_payment_flow
在CI中间歇性失败 临时解决方案:重新运行失败的任务(通常重试后会通过) 工单:ENG-1200(P2)

📅 Upcoming Events

📅 即将到来的事件

DateEventImpactContact
01/23 02:00Database maintenance5 min read-only@dba-team
01/24 14:00Major release v5.0Monitor closely@release-team
01/25Marketing campaign2x traffic expected@platform

日期事件影响联系人
01/23 02:00数据库维护5分钟只读状态@dba-team
01/24 14:00v5.0重大版本发布需要密切监控@release-team
01/25营销活动预计流量翻倍@platform

📞 Escalation Reminders

📞 升级提醒

Issue TypeFirst EscalationSecond Escalation
Payment issues@payments-oncall@payments-manager
Auth issues@auth-oncall@security-team
Database issues@dba-team@infra-manager
Unknown/severe@engineering-manager@vp-engineering

问题类型第一升级对象第二升级对象
支付问题@payments-oncall@payments-manager
认证问题@auth-oncall@security-team
数据库问题@dba-team@infra-manager
未知/严重问题@engineering-manager@vp-engineering

🔧 Quick Reference

🔧 快速参考

Common Commands

常用命令

bash
undefined
bash
undefined

Check service health

检查服务健康状态

kubectl get pods -A | grep -v Running
kubectl get pods -A | grep -v Running

Recent deployments

近期部署记录

kubectl get events --sort-by='.lastTimestamp' | tail -20
kubectl get events --sort-by='.lastTimestamp' | tail -20

Database connections

数据库连接数

psql -c "SELECT count(*) FROM pg_stat_activity;"
psql -c "SELECT count(*) FROM pg_stat_activity;"

Clear cache (emergency only)

清除缓存(仅紧急情况使用)

redis-cli FLUSHDB
undefined
redis-cli FLUSHDB
undefined

Important Links

重要链接

Handoff Checklist

交接检查清单

Outgoing Engineer

交班工程师

  • Document active incidents
  • Document ongoing investigations
  • List recent changes
  • Note known issues
  • Add upcoming events
  • Sync with incoming engineer
  • 记录活跃事件
  • 记录正在进行的调查
  • 列出近期变更
  • 标注已知问题
  • 添加即将到来的事件
  • 与接班工程师同步沟通

Incoming Engineer

接班工程师

  • Read this document
  • Join sync call
  • Verify PagerDuty is routing to you
  • Verify Slack notifications working
  • Check VPN/access working
  • Review critical dashboards
undefined
  • 阅读此文档
  • 参加同步沟通会议
  • 验证PagerDuty是否将告警路由至你
  • 验证Slack通知是否正常工作
  • 检查VPN/访问权限是否正常
  • 查看关键仪表盘
undefined

Template 2: Quick Handoff (Async)

模板2:快速交接(异步)

markdown
undefined
markdown
undefined

Quick Handoff: @alice → @bob

快速交接:@alice → @bob

TL;DR

摘要

  • No active incidents
  • 1 investigation ongoing (API timeouts, see ENG-1234)
  • Major release tomorrow (01/24) - be ready for issues
  • 无活跃事件
  • 1项正在进行的调查(API超时,见ENG-1234)
  • 明日(01/24)有重大版本发布 - 需做好应对问题的准备

Watch List

重点关注

  1. API latency around 02:00-03:00 UTC (backup window)
  2. Auth service memory (restart if > 80%)
  1. API延迟在02:00-03:00 UTC(备份窗口)期间的情况
  2. 认证服务内存使用(若超过80%则重启)

Recent

近期变更

  • Deployed api-gateway v3.2.1 yesterday (stable)
  • Increased rate limits to 1500 RPS
  • 昨日部署了api-gateway v3.2.1(稳定)
  • 将速率限制提升至1500 RPS

Coming Up

即将到来

  • 01/23 02:00 - DB maintenance (5 min read-only)
  • 01/24 14:00 - v5.0 release
  • 01/23 02:00 - 数据库维护(5分钟只读)
  • 01/24 14:00 - v5.0版本发布

Questions?

疑问?

I'll be available on Slack until 17:00 today.
undefined
我今日17:00前会在Slack上保持在线。
undefined

Template 3: Incident Handoff (Mid-Incident)

模板3:事件交接(事件进行中)

markdown
undefined
markdown
undefined

INCIDENT HANDOFF: Payment Service Degradation

事件交接:支付服务性能下降

Incident Start: 2024-01-22 08:15 UTC Current Status: Mitigating Severity: SEV2

事件开始时间:2024-01-22 08:15 UTC 当前状态:缓解中 严重程度:SEV2

Current State

当前状态

  • Error rate: 15% (down from 40%)
  • Mitigation in progress: scaling up pods
  • ETA to resolution: ~30 min
  • 错误率:15%(已从40%下降)
  • 缓解措施进行中:扩容Pod
  • 预计解决时间:约30分钟

What We Know

已知信息

  1. Root cause: Memory pressure on payment-service pods
  2. Triggered by: Unusual traffic spike (3x normal)
  3. Contributing: Inefficient query in checkout flow
  1. 根本原因:支付服务Pod内存压力过大
  2. 触发因素:异常流量峰值(为正常的3倍)
  3. 相关因素:结账流程中的低效查询

What We've Done

已采取的措施

  • Scaled payment-service from 5 → 15 pods
  • Enabled rate limiting on checkout endpoint
  • Disabled non-critical features
  • 将支付服务从5个Pod扩容至15个
  • 在结账端点启用速率限制
  • 禁用非关键功能

What Needs to Happen

后续需执行的操作

  1. Monitor error rate - should reach <1% in ~15 min
  2. If not improving, escalate to @payments-manager
  3. Once stable, begin root cause investigation
  1. 监控错误率 - 约15分钟内应降至<1%
  2. 若未改善,升级至@payments-manager
  3. 稳定后开始根本原因调查

Key People

关键人员

  • Incident Commander: @alice (handing off)
  • Comms Lead: @charlie
  • Technical Lead: @bob (incoming)
  • 事件指挥官:@alice(移交中)
  • 沟通负责人:@charlie
  • 技术负责人:@bob(接班)

Communication

沟通情况

  • Status page: Updated at 08:45
  • Customer support: Notified
  • Exec team: Aware
  • 状态页面:已在08:45更新
  • 客户支持:已通知
  • 执行团队:已知晓

Resources

资源


Incoming on-call (@bob) - Please confirm you have:
  • Joined #inc-20240122-payment
  • Access to dashboards
  • Understand current state
  • Know escalation path
undefined

接班值班人员(@bob)- 请确认你已:
  • 加入#inc-20240122-payment频道
  • 可访问仪表盘
  • 了解当前状态
  • 知晓升级路径
undefined

Handoff Sync Meeting

交接同步会议

Agenda (15 minutes)

议程(15分钟)

markdown
undefined
markdown
undefined

Handoff Sync: @alice → @bob

交接同步:@alice → @bob

  1. Active Issues (5 min)
    • Walk through any ongoing incidents
    • Discuss investigation status
    • Transfer context and theories
  2. Recent Changes (3 min)
    • Deployments to watch
    • Config changes
    • Known regressions
  3. Upcoming Events (3 min)
    • Maintenance windows
    • Expected traffic changes
    • Releases planned
  4. Questions (4 min)
    • Clarify anything unclear
    • Confirm access and alerting
    • Exchange contact info
undefined
  1. 活跃问题(5分钟)
    • 梳理所有正在进行的事件
    • 讨论调查状态
    • 传递上下文和推测
  2. 近期变更(3分钟)
    • 需要关注的部署
    • 配置变更
    • 已知的回归问题
  3. 即将到来的事件(3分钟)
    • 维护窗口
    • 预计的流量变化
    • 计划中的发布
  4. 疑问解答(4分钟)
    • 澄清任何不明确的内容
    • 确认访问权限和告警设置
    • 交换联系方式
undefined

On-Call Best Practices

值班最佳实践

Before Your Shift

班次开始前

markdown
undefined
markdown
undefined

Pre-Shift Checklist

班前检查清单

Access Verification

访问权限验证

  • VPN working
  • kubectl access to all clusters
  • Database read access
  • Log aggregator access (Splunk/Datadog)
  • PagerDuty app installed and logged in
  • VPN正常工作
  • kubectl可访问所有集群
  • 数据库只读权限
  • 日志聚合器访问权限(Splunk/Datadog)
  • 已安装并登录PagerDuty应用

Alerting Setup

告警设置验证

  • PagerDuty schedule shows you as primary
  • Phone notifications enabled
  • Slack notifications for incident channels
  • Test alert received and acknowledged
  • PagerDuty排班表显示你为主要值班人员
  • 手机通知已启用
  • 事件频道的Slack通知已开启
  • 已接收并确认测试告警

Knowledge Refresh

知识回顾

  • Review recent incidents (past 2 weeks)
  • Check service changelog
  • Skim critical runbooks
  • Know escalation contacts
  • 查看近期事件(过去2周)
  • 检查服务变更日志
  • 浏览关键运行手册
  • 知晓升级联系人

Environment Ready

环境准备

  • Laptop charged and accessible
  • Phone charged
  • Quiet space available for calls
  • Secondary contact identified (if traveling)
undefined
  • 笔记本电脑已充电且可随时使用
  • 手机已充电
  • 有安静的空间可参加会议
  • 已确定备用联系人(若出差)
undefined

During Your Shift

班次进行中

markdown
undefined
markdown
undefined

Daily On-Call Routine

日常值班流程

Morning (start of day)

早晨(班次开始)

  • Check overnight alerts
  • Review dashboards for anomalies
  • Check for any P0/P1 tickets created
  • Skim incident channels for context
  • 检查夜间告警
  • 查看仪表盘是否有异常
  • 检查是否有P0/P1工单
  • 浏览事件频道获取上下文

Throughout Day

全天

  • Respond to alerts within SLA
  • Document investigation progress
  • Update team on significant issues
  • Triage incoming pages
  • 在SLA内响应告警
  • 记录调查进度
  • 向团队更新重要问题
  • 分类处理收到的告警

End of Day

班次结束前

  • Hand off any active issues
  • Update investigation docs
  • Note anything for next shift
undefined
  • 移交所有活跃问题
  • 更新调查文档
  • 为下一班次标注注意事项
undefined

After Your Shift

班次结束后

markdown
undefined
markdown
undefined

Post-Shift Checklist

班后检查清单

  • Complete handoff document
  • Sync with incoming on-call
  • Verify PagerDuty routing changed
  • Close/update investigation tickets
  • File postmortems for any incidents
  • Take time off if shift was stressful
undefined
  • 完成交接文档
  • 与接班值班人员同步沟通
  • 验证PagerDuty路由已变更
  • 关闭/更新调查工单
  • 为所有事件提交事后分析
  • 若班次压力大则安排休息
undefined

Escalation Guidelines

升级指南

When to Escalate

何时升级

markdown
undefined
markdown
undefined

Escalation Triggers

升级触发条件

Immediate Escalation

立即升级

  • SEV1 incident declared
  • Data breach suspected
  • Unable to diagnose within 30 min
  • Customer or legal escalation received
  • 已声明SEV1事件
  • 怀疑存在数据泄露
  • 30分钟内无法诊断问题
  • 收到客户或法务的升级请求

Consider Escalation

考虑升级

  • Issue spans multiple teams
  • Requires expertise you don't have
  • Business impact exceeds threshold
  • You're uncertain about next steps
  • 问题涉及多个团队
  • 需要你不具备的专业知识
  • 业务影响超过阈值
  • 你不确定下一步操作

How to Escalate

如何升级

  1. Page the appropriate escalation path
  2. Provide brief context in Slack
  3. Stay engaged until escalation acknowledges
  4. Hand off cleanly, don't just disappear
undefined
  1. 呼叫相应的升级路径联系人
  2. 在Slack中提供简要上下文
  3. 保持参与直到升级联系人确认
  4. 干净地移交,不要直接消失
undefined

Best Practices

最佳实践

Do's

建议

  • Document everything - Future you will thank you
  • Escalate early - Better safe than sorry
  • Take breaks - Alert fatigue is real
  • Keep handoffs synchronous - Async loses context
  • Test your setup - Before incidents, not during
  • 记录所有内容 - 未来的你会感谢现在的自己
  • 尽早升级 - 谨慎总比后悔好
  • 适当休息 - 告警疲劳是真实存在的
  • 保持交接同步 - 异步交接会丢失上下文
  • 测试你的设置 - 在事件发生前测试,而非事件中

Don'ts

禁忌

  • Don't skip handoffs - Context loss causes incidents
  • Don't hero - Escalate when needed
  • Don't ignore alerts - Even if they seem minor
  • Don't work sick - Swap shifts instead
  • Don't disappear - Stay reachable during shift
  • 不要跳过交接 - 上下文丢失会导致事件
  • 不要逞能 - 需要时及时升级
  • 不要忽略告警 - 即使看起来无关紧要
  • 不要带病值班 - 换班代替
  • 不要失联 - 班次期间保持可联系

Resources

参考资源