on-call-handoff-patterns

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

On-Call Handoff Patterns

值班班次交接模式

Effective patterns for on-call shift transitions, ensuring continuity, context transfer, and reliable incident response across shifts.

确保班次间连续性、上下文传递和可靠事件响应的有效值班交接模式。

When to Use This Skill

何时使用该值班交接技能

Transitioning on-call responsibilities
Writing shift handoff summaries
Documenting ongoing investigations
Establishing on-call rotation procedures
Improving handoff quality
Onboarding new on-call engineers

移交值班职责
撰写班次交接总结
记录正在进行的调查
建立值班轮换流程
提升交接质量
培训新的值班工程师

Core Concepts

核心概念

1. Handoff Components

1. 交接组件

Component	Purpose
Active Incidents	What's currently broken
Ongoing Investigations	Issues being debugged
Recent Changes	Deployments, configs
Known Issues	Workarounds in place
Upcoming Events	Maintenance, releases

组件	用途
Active Incidents	当前存在的故障
Ongoing Investigations	正在调试的问题
Recent Changes	部署、配置变更
Known Issues	已有的临时解决方案
Upcoming Events	维护、发布计划

2. Handoff Timing

2. 交接时间安排

Recommended: 30 min overlap between shifts

Outgoing:
├── 15 min: Write handoff document
└── 15 min: Sync call with incoming

Incoming:
├── 15 min: Review handoff document
├── 15 min: Sync call with outgoing
└── 5 min: Verify alerting setup

推荐：班次之间重叠30分钟

交班方：
├── 15分钟：撰写交接文档
└── 15分钟：与接班方同步沟通

接班方：
├── 15分钟：审阅交接文档
├── 15分钟：与交班方同步沟通
└── 5分钟：验证告警设置

Templates

模板

Template 1: Shift Handoff Document

模板1：班次交接文档

markdown

undefined

markdown

undefined

On-Call Handoff: Platform Team

值班交接：平台团队

Outgoing: @alice (2024-01-15 to 2024-01-22) Incoming: @bob (2024-01-22 to 2024-01-29) Handoff Time: 2024-01-22 09:00 UTC

交班人：@alice（2024-01-15 至 2024-01-22） 接班人：@bob（2024-01-22 至 2024-01-29） 交接时间：2024-01-22 09:00 UTC

🔴 Active Incidents

🔴 活跃事件

None currently active

当前无活跃事件

No active incidents at handoff time.

交接时无活跃事件。

🟡 Ongoing Investigations

🟡 正在进行的调查

1. Intermittent API Timeouts (ENG-1234)

1. 间歇性API超时（ENG-1234）

Status: Investigating Started: 2024-01-20 Impact: ~0.1% of requests timing out

Context:

Timeouts correlate with database backup window (02:00-03:00 UTC)
Suspect backup process causing lock contention
Added extra logging in PR #567 (deployed 01/21)

Next Steps:

Review new logs after tonight's backup
Consider moving backup window if confirmed

Resources:

Dashboard: API Latency
Thread: #platform-eng (01/20, 14:32)

状态：调查中 开始时间：2024-01-20 影响：约0.1%的请求超时

上下文：

超时与数据库备份窗口（02:00-03:00 UTC）相关
怀疑备份流程导致锁竞争
在PR #567中添加了额外日志（已在01/21部署）

下一步：

查看今晚备份后的新日志
若确认则考虑调整备份窗口

资源：

仪表盘：API延迟
讨论线程：#platform-eng（01/20 14:32）

2. Memory Growth in Auth Service (ENG-1235)

2. 认证服务内存增长（ENG-1235）

Status: Monitoring Started: 2024-01-18 Impact: None yet (proactive)

Context:

Memory usage growing ~5% per day
No memory leak found in profiling
Suspect connection pool not releasing properly

Next Steps:

Review heap dump from 01/21
Consider restart if usage > 80%

Resources:

Dashboard: Auth Service Memory
Analysis doc: Memory Investigation

状态：监控中 开始时间：2024-01-18 影响：目前无影响（主动监控）

上下文：

内存使用量每天增长约5%
性能分析未发现内存泄漏
怀疑连接池未正确释放

下一步：

查看01/21的堆转储
若使用率超过80%则考虑重启

资源：

仪表盘：认证服务内存
分析文档：内存调查

🟢 Resolved This Shift

🟢 本班次已解决事件

Payment Service Outage (2024-01-19)

支付服务中断（2024-01-19）

Duration: 23 minutes
Root Cause: Database connection exhaustion
Resolution: Rolled back v2.3.4, increased pool size
Postmortem: POSTMORTEM-89
Follow-up tickets: ENG-1230, ENG-1231

持续时间：23分钟
根本原因：数据库连接耗尽
解决方案：回滚至v2.3.4，增加连接池大小
事后分析：POSTMORTEM-89
后续工单：ENG-1230、ENG-1231

📋 Recent Changes

📋 近期变更

Deployments

部署记录

Service	Version	Time	Notes
api-gateway	v3.2.1	01/21 14:00	Bug fix for header parsing
user-service	v2.8.0	01/20 10:00	New profile features
auth-service	v4.1.2	01/19 16:00	Security patch

服务	版本	时间	说明
api-gateway	v3.2.1	01/21 14:00	修复头解析相关Bug
user-service	v2.8.0	01/20 10:00	新增个人资料功能
auth-service	v4.1.2	01/19 16:00	安全补丁

Configuration Changes

配置变更

01/21: Increased API rate limit from 1000 to 1500 RPS
01/20: Updated database connection pool max from 50 to 75

01/21：将API速率限制从1000 RPS提升至1500 RPS
01/20：将数据库连接池最大值从50调整至75

Infrastructure

基础设施变更

01/20: Added 2 nodes to Kubernetes cluster
01/19: Upgraded Redis from 6.2 to 7.0

01/20：为Kubernetes集群添加2个节点
01/19：将Redis从6.2升级至7.0

⚠️ Known Issues & Workarounds

⚠️ 已知问题与临时解决方案

1. Slow Dashboard Loading

1. 仪表盘加载缓慢

Issue: Grafana dashboards slow on Monday mornings Workaround: Wait 5 min after 08:00 UTC for cache warm-up Ticket: OPS-456 (P3)

问题：Grafana仪表盘在周一早晨加载缓慢 临时解决方案：在08:00 UTC后等待5分钟，让缓存预热工单：OPS-456（P3）

2. Flaky Integration Test

2. 不稳定的集成测试

Issue:

test_payment_flow

fails intermittently in CI Workaround: Re-run failed job (usually passes on retry) Ticket: ENG-1200 (P2)

问题：

test_payment_flow

在CI中间歇性失败 临时解决方案：重新运行失败的任务（通常重试后会通过）工单：ENG-1200（P2）

📅 Upcoming Events

📅 即将到来的事件

Date	Event	Impact	Contact
01/23 02:00	Database maintenance	5 min read-only	@dba-team
01/24 14:00	Major release v5.0	Monitor closely	@release-team
01/25	Marketing campaign	2x traffic expected	@platform

日期	事件	影响	联系人
01/23 02:00	数据库维护	5分钟只读状态	@dba-team
01/24 14:00	v5.0重大版本发布	需要密切监控	@release-team
01/25	营销活动	预计流量翻倍	@platform

📞 Escalation Reminders

📞 升级提醒

Issue Type	First Escalation	Second Escalation
Payment issues	@payments-oncall	@payments-manager
Auth issues	@auth-oncall	@security-team
Database issues	@dba-team	@infra-manager
Unknown/severe	@engineering-manager	@vp-engineering

问题类型	第一升级对象	第二升级对象
支付问题	@payments-oncall	@payments-manager
认证问题	@auth-oncall	@security-team
数据库问题	@dba-team	@infra-manager
未知/严重问题	@engineering-manager	@vp-engineering

🔧 Quick Reference

🔧 快速参考

Common Commands

常用命令

bash

undefined

bash

undefined

Check service health

检查服务健康状态

kubectl get pods -A | grep -v Running

Recent deployments

近期部署记录

kubectl get events --sort-by='.lastTimestamp' | tail -20

Database connections

数据库连接数

psql -c "SELECT count(*) FROM pg_stat_activity;"

Clear cache (emergency only)

清除缓存（仅紧急情况使用）

redis-cli FLUSHDB

undefined

redis-cli FLUSHDB

undefined

Important Links

重要链接

Handoff Checklist

交接检查清单

Outgoing Engineer

交班工程师

Incoming Engineer

接班工程师

阅读此文档
参加同步沟通会议
验证PagerDuty是否将告警路由至你
验证Slack通知是否正常工作
检查VPN/访问权限是否正常
查看关键仪表盘

undefined

Template 2: Quick Handoff (Async)

模板2：快速交接（异步）

markdown

undefined

markdown

undefined

Quick Handoff: @alice → @bob

快速交接：@alice → @bob

TL;DR

摘要

No active incidents
1 investigation ongoing (API timeouts, see ENG-1234)
Major release tomorrow (01/24) - be ready for issues

无活跃事件
1项正在进行的调查（API超时，见ENG-1234）
明日（01/24）有重大版本发布 - 需做好应对问题的准备

Watch List

重点关注

API latency around 02:00-03:00 UTC (backup window)
Auth service memory (restart if > 80%)

API延迟在02:00-03:00 UTC（备份窗口）期间的情况
认证服务内存使用（若超过80%则重启）

Recent

近期变更

Deployed api-gateway v3.2.1 yesterday (stable)
Increased rate limits to 1500 RPS

昨日部署了api-gateway v3.2.1（稳定）
将速率限制提升至1500 RPS

Coming Up

即将到来

01/23 02:00 - DB maintenance (5 min read-only)
01/24 14:00 - v5.0 release

01/23 02:00 - 数据库维护（5分钟只读）
01/24 14:00 - v5.0版本发布

Questions?

疑问？

I'll be available on Slack until 17:00 today.

undefined

我今日17:00前会在Slack上保持在线。

undefined

Template 3: Incident Handoff (Mid-Incident)

模板3：事件交接（事件进行中）

markdown

undefined

markdown

undefined

INCIDENT HANDOFF: Payment Service Degradation

事件交接：支付服务性能下降

Incident Start: 2024-01-22 08:15 UTC Current Status: Mitigating Severity: SEV2

事件开始时间：2024-01-22 08:15 UTC 当前状态：缓解中 严重程度：SEV2

Current State

当前状态

Error rate: 15% (down from 40%)
Mitigation in progress: scaling up pods
ETA to resolution: ~30 min

错误率：15%（已从40%下降）
缓解措施进行中：扩容Pod
预计解决时间：约30分钟

What We Know

已知信息

Root cause: Memory pressure on payment-service pods
Triggered by: Unusual traffic spike (3x normal)
Contributing: Inefficient query in checkout flow

根本原因：支付服务Pod内存压力过大
触发因素：异常流量峰值（为正常的3倍）
相关因素：结账流程中的低效查询

What We've Done

已采取的措施

Scaled payment-service from 5 → 15 pods
Enabled rate limiting on checkout endpoint
Disabled non-critical features

将支付服务从5个Pod扩容至15个
在结账端点启用速率限制
禁用非关键功能

What Needs to Happen

后续需执行的操作

Monitor error rate - should reach <1% in ~15 min
If not improving, escalate to @payments-manager
Once stable, begin root cause investigation

监控错误率 - 约15分钟内应降至<1%
若未改善，升级至@payments-manager
稳定后开始根本原因调查

Key People

关键人员

Incident Commander: @alice (handing off)
Comms Lead: @charlie
Technical Lead: @bob (incoming)

事件指挥官：@alice（移交中）
沟通负责人：@charlie
技术负责人：@bob（接班）

Communication

沟通情况

Status page: Updated at 08:45
Customer support: Notified
Exec team: Aware

状态页面：已在08:45更新
客户支持：已通知
执行团队：已知晓

Resources

资源

Incident channel: #inc-20240122-payment
Dashboard: Payment Service
Runbook: Payment Degradation

Incoming on-call (@bob) - Please confirm you have:

Joined #inc-20240122-payment
Access to dashboards
Understand current state
Know escalation path

undefined

事件频道：#inc-20240122-payment
仪表盘：支付服务
运行手册：支付性能下降

接班值班人员（@bob）- 请确认你已：

加入#inc-20240122-payment频道
可访问仪表盘
了解当前状态
知晓升级路径

undefined

Handoff Sync Meeting

交接同步会议

Agenda (15 minutes)

议程（15分钟）

markdown

undefined

markdown

undefined

Handoff Sync: @alice → @bob

交接同步：@alice → @bob

Active Issues (5 min)
- Walk through any ongoing incidents
- Discuss investigation status
- Transfer context and theories
Recent Changes (3 min)
- Deployments to watch
- Config changes
- Known regressions
Upcoming Events (3 min)
- Maintenance windows
- Expected traffic changes
- Releases planned
Questions (4 min)
- Clarify anything unclear
- Confirm access and alerting
- Exchange contact info

undefined

活跃问题（5分钟）
- 梳理所有正在进行的事件
- 讨论调查状态
- 传递上下文和推测
近期变更（3分钟）
- 需要关注的部署
- 配置变更
- 已知的回归问题
即将到来的事件（3分钟）
- 维护窗口
- 预计的流量变化
- 计划中的发布
疑问解答（4分钟）
- 澄清任何不明确的内容
- 确认访问权限和告警设置
- 交换联系方式

undefined

On-Call Best Practices

值班最佳实践

Before Your Shift

班次开始前

markdown

undefined

markdown

undefined

Pre-Shift Checklist

班前检查清单

Access Verification

访问权限验证

Alerting Setup

告警设置验证

PagerDuty schedule shows you as primary
Phone notifications enabled
Slack notifications for incident channels
Test alert received and acknowledged

PagerDuty排班表显示你为主要值班人员
手机通知已启用
事件频道的Slack通知已开启
已接收并确认测试告警

Knowledge Refresh

知识回顾

Environment Ready

环境准备

Laptop charged and accessible
Phone charged
Quiet space available for calls
Secondary contact identified (if traveling)

undefined

笔记本电脑已充电且可随时使用
手机已充电
有安静的空间可参加会议
已确定备用联系人（若出差）

undefined

During Your Shift

班次进行中

markdown

undefined

markdown

undefined

Daily On-Call Routine

日常值班流程

Morning (start of day)

早晨（班次开始）

Check overnight alerts
Review dashboards for anomalies
Check for any P0/P1 tickets created
Skim incident channels for context

检查夜间告警
查看仪表盘是否有异常
检查是否有P0/P1工单
浏览事件频道获取上下文

Throughout Day

全天

End of Day

班次结束前

Hand off any active issues
Update investigation docs
Note anything for next shift

undefined

移交所有活跃问题
更新调查文档
为下一班次标注注意事项

undefined

After Your Shift

班次结束后

markdown

undefined

markdown

undefined

Post-Shift Checklist

班后检查清单

Complete handoff document
Sync with incoming on-call
Verify PagerDuty routing changed
Close/update investigation tickets
File postmortems for any incidents
Take time off if shift was stressful

undefined

完成交接文档
与接班值班人员同步沟通
验证PagerDuty路由已变更
关闭/更新调查工单
为所有事件提交事后分析
若班次压力大则安排休息

undefined

Escalation Guidelines

升级指南

When to Escalate

何时升级

markdown

undefined

markdown

undefined

Escalation Triggers

升级触发条件

Immediate Escalation

立即升级

SEV1 incident declared
Data breach suspected
Unable to diagnose within 30 min
Customer or legal escalation received

已声明SEV1事件
怀疑存在数据泄露
30分钟内无法诊断问题
收到客户或法务的升级请求

Consider Escalation

考虑升级

Issue spans multiple teams
Requires expertise you don't have
Business impact exceeds threshold
You're uncertain about next steps

问题涉及多个团队
需要你不具备的专业知识
业务影响超过阈值
你不确定下一步操作

How to Escalate

如何升级

Page the appropriate escalation path
Provide brief context in Slack
Stay engaged until escalation acknowledges
Hand off cleanly, don't just disappear

undefined

呼叫相应的升级路径联系人
在Slack中提供简要上下文
保持参与直到升级联系人确认
干净地移交，不要直接消失

undefined

Best Practices

最佳实践

Do's

建议

Document everything - Future you will thank you
Escalate early - Better safe than sorry
Take breaks - Alert fatigue is real
Keep handoffs synchronous - Async loses context
Test your setup - Before incidents, not during

记录所有内容 - 未来的你会感谢现在的自己
尽早升级 - 谨慎总比后悔好
适当休息 - 告警疲劳是真实存在的
保持交接同步 - 异步交接会丢失上下文
测试你的设置 - 在事件发生前测试，而非事件中

Don'ts

禁忌

Don't skip handoffs - Context loss causes incidents
Don't hero - Escalate when needed
Don't ignore alerts - Even if they seem minor
Don't work sick - Swap shifts instead
Don't disappear - Stay reachable during shift

不要跳过交接 - 上下文丢失会导致事件
不要逞能 - 需要时及时升级
不要忽略告警 - 即使看起来无关紧要
不要带病值班 - 换班代替
不要失联 - 班次期间保持可联系