on-call-handoff-patterns
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseOn-Call Handoff Patterns
值班班次交接模式
Effective patterns for on-call shift transitions, ensuring continuity, context transfer, and reliable incident response across shifts.
确保班次间连续性、上下文传递和可靠事件响应的有效值班交接模式。
When to Use This Skill
何时使用该值班交接技能
- Transitioning on-call responsibilities
- Writing shift handoff summaries
- Documenting ongoing investigations
- Establishing on-call rotation procedures
- Improving handoff quality
- Onboarding new on-call engineers
- 移交值班职责
- 撰写班次交接总结
- 记录正在进行的调查
- 建立值班轮换流程
- 提升交接质量
- 培训新的值班工程师
Core Concepts
核心概念
1. Handoff Components
1. 交接组件
| Component | Purpose |
|---|---|
| Active Incidents | What's currently broken |
| Ongoing Investigations | Issues being debugged |
| Recent Changes | Deployments, configs |
| Known Issues | Workarounds in place |
| Upcoming Events | Maintenance, releases |
| 组件 | 用途 |
|---|---|
| Active Incidents | 当前存在的故障 |
| Ongoing Investigations | 正在调试的问题 |
| Recent Changes | 部署、配置变更 |
| Known Issues | 已有的临时解决方案 |
| Upcoming Events | 维护、发布计划 |
2. Handoff Timing
2. 交接时间安排
Recommended: 30 min overlap between shifts
Outgoing:
├── 15 min: Write handoff document
└── 15 min: Sync call with incoming
Incoming:
├── 15 min: Review handoff document
├── 15 min: Sync call with outgoing
└── 5 min: Verify alerting setup推荐:班次之间重叠30分钟
交班方:
├── 15分钟:撰写交接文档
└── 15分钟:与接班方同步沟通
接班方:
├── 15分钟:审阅交接文档
├── 15分钟:与交班方同步沟通
└── 5分钟:验证告警设置Templates
模板
Template 1: Shift Handoff Document
模板1:班次交接文档
markdown
undefinedmarkdown
undefinedOn-Call Handoff: Platform Team
值班交接:平台团队
Outgoing: @alice (2024-01-15 to 2024-01-22)
Incoming: @bob (2024-01-22 to 2024-01-29)
Handoff Time: 2024-01-22 09:00 UTC
交班人:@alice(2024-01-15 至 2024-01-22)
接班人:@bob(2024-01-22 至 2024-01-29)
交接时间:2024-01-22 09:00 UTC
🔴 Active Incidents
🔴 活跃事件
None currently active
当前无活跃事件
No active incidents at handoff time.
交接时无活跃事件。
🟡 Ongoing Investigations
🟡 正在进行的调查
1. Intermittent API Timeouts (ENG-1234)
1. 间歇性API超时(ENG-1234)
Status: Investigating
Started: 2024-01-20
Impact: ~0.1% of requests timing out
Context:
- Timeouts correlate with database backup window (02:00-03:00 UTC)
- Suspect backup process causing lock contention
- Added extra logging in PR #567 (deployed 01/21)
Next Steps:
- Review new logs after tonight's backup
- Consider moving backup window if confirmed
Resources:
- Dashboard: API Latency
- Thread: #platform-eng (01/20, 14:32)
状态:调查中
开始时间:2024-01-20
影响:约0.1%的请求超时
上下文:
- 超时与数据库备份窗口(02:00-03:00 UTC)相关
- 怀疑备份流程导致锁竞争
- 在PR #567中添加了额外日志(已在01/21部署)
下一步:
- 查看今晚备份后的新日志
- 若确认则考虑调整备份窗口
资源:
- 仪表盘:API延迟
- 讨论线程:#platform-eng(01/20 14:32)
2. Memory Growth in Auth Service (ENG-1235)
2. 认证服务内存增长(ENG-1235)
Status: Monitoring
Started: 2024-01-18
Impact: None yet (proactive)
Context:
- Memory usage growing ~5% per day
- No memory leak found in profiling
- Suspect connection pool not releasing properly
Next Steps:
- Review heap dump from 01/21
- Consider restart if usage > 80%
Resources:
- Dashboard: Auth Service Memory
- Analysis doc: Memory Investigation
🟢 Resolved This Shift
🟢 本班次已解决事件
Payment Service Outage (2024-01-19)
支付服务中断(2024-01-19)
- Duration: 23 minutes
- Root Cause: Database connection exhaustion
- Resolution: Rolled back v2.3.4, increased pool size
- Postmortem: POSTMORTEM-89
- Follow-up tickets: ENG-1230, ENG-1231
- 持续时间:23分钟
- 根本原因:数据库连接耗尽
- 解决方案:回滚至v2.3.4,增加连接池大小
- 事后分析:POSTMORTEM-89
- 后续工单:ENG-1230、ENG-1231
📋 Recent Changes
📋 近期变更
Deployments
部署记录
| Service | Version | Time | Notes |
|---|---|---|---|
| api-gateway | v3.2.1 | 01/21 14:00 | Bug fix for header parsing |
| user-service | v2.8.0 | 01/20 10:00 | New profile features |
| auth-service | v4.1.2 | 01/19 16:00 | Security patch |
| 服务 | 版本 | 时间 | 说明 |
|---|---|---|---|
| api-gateway | v3.2.1 | 01/21 14:00 | 修复头解析相关Bug |
| user-service | v2.8.0 | 01/20 10:00 | 新增个人资料功能 |
| auth-service | v4.1.2 | 01/19 16:00 | 安全补丁 |
Configuration Changes
配置变更
- 01/21: Increased API rate limit from 1000 to 1500 RPS
- 01/20: Updated database connection pool max from 50 to 75
- 01/21:将API速率限制从1000 RPS提升至1500 RPS
- 01/20:将数据库连接池最大值从50调整至75
Infrastructure
基础设施变更
- 01/20: Added 2 nodes to Kubernetes cluster
- 01/19: Upgraded Redis from 6.2 to 7.0
- 01/20:为Kubernetes集群添加2个节点
- 01/19:将Redis从6.2升级至7.0
⚠️ Known Issues & Workarounds
⚠️ 已知问题与临时解决方案
1. Slow Dashboard Loading
1. 仪表盘加载缓慢
Issue: Grafana dashboards slow on Monday mornings
Workaround: Wait 5 min after 08:00 UTC for cache warm-up
Ticket: OPS-456 (P3)
问题:Grafana仪表盘在周一早晨加载缓慢
临时解决方案:在08:00 UTC后等待5分钟,让缓存预热
工单:OPS-456(P3)
2. Flaky Integration Test
2. 不稳定的集成测试
Issue: fails intermittently in CI
Workaround: Re-run failed job (usually passes on retry)
Ticket: ENG-1200 (P2)
test_payment_flow问题:在CI中间歇性失败
临时解决方案:重新运行失败的任务(通常重试后会通过)
工单:ENG-1200(P2)
test_payment_flow📅 Upcoming Events
📅 即将到来的事件
| Date | Event | Impact | Contact |
|---|---|---|---|
| 01/23 02:00 | Database maintenance | 5 min read-only | @dba-team |
| 01/24 14:00 | Major release v5.0 | Monitor closely | @release-team |
| 01/25 | Marketing campaign | 2x traffic expected | @platform |
| 日期 | 事件 | 影响 | 联系人 |
|---|---|---|---|
| 01/23 02:00 | 数据库维护 | 5分钟只读状态 | @dba-team |
| 01/24 14:00 | v5.0重大版本发布 | 需要密切监控 | @release-team |
| 01/25 | 营销活动 | 预计流量翻倍 | @platform |
📞 Escalation Reminders
📞 升级提醒
| Issue Type | First Escalation | Second Escalation |
|---|---|---|
| Payment issues | @payments-oncall | @payments-manager |
| Auth issues | @auth-oncall | @security-team |
| Database issues | @dba-team | @infra-manager |
| Unknown/severe | @engineering-manager | @vp-engineering |
| 问题类型 | 第一升级对象 | 第二升级对象 |
|---|---|---|
| 支付问题 | @payments-oncall | @payments-manager |
| 认证问题 | @auth-oncall | @security-team |
| 数据库问题 | @dba-team | @infra-manager |
| 未知/严重问题 | @engineering-manager | @vp-engineering |
🔧 Quick Reference
🔧 快速参考
Common Commands
常用命令
bash
undefinedbash
undefinedCheck service health
检查服务健康状态
kubectl get pods -A | grep -v Running
kubectl get pods -A | grep -v Running
Recent deployments
近期部署记录
kubectl get events --sort-by='.lastTimestamp' | tail -20
kubectl get events --sort-by='.lastTimestamp' | tail -20
Database connections
数据库连接数
psql -c "SELECT count(*) FROM pg_stat_activity;"
psql -c "SELECT count(*) FROM pg_stat_activity;"
Clear cache (emergency only)
清除缓存(仅紧急情况使用)
redis-cli FLUSHDB
undefinedredis-cli FLUSHDB
undefinedImportant Links
重要链接
Handoff Checklist
交接检查清单
Outgoing Engineer
交班工程师
- Document active incidents
- Document ongoing investigations
- List recent changes
- Note known issues
- Add upcoming events
- Sync with incoming engineer
- 记录活跃事件
- 记录正在进行的调查
- 列出近期变更
- 标注已知问题
- 添加即将到来的事件
- 与接班工程师同步沟通
Incoming Engineer
接班工程师
- Read this document
- Join sync call
- Verify PagerDuty is routing to you
- Verify Slack notifications working
- Check VPN/access working
- Review critical dashboards
undefined- 阅读此文档
- 参加同步沟通会议
- 验证PagerDuty是否将告警路由至你
- 验证Slack通知是否正常工作
- 检查VPN/访问权限是否正常
- 查看关键仪表盘
undefinedTemplate 2: Quick Handoff (Async)
模板2:快速交接(异步)
markdown
undefinedmarkdown
undefinedQuick Handoff: @alice → @bob
快速交接:@alice → @bob
TL;DR
摘要
- No active incidents
- 1 investigation ongoing (API timeouts, see ENG-1234)
- Major release tomorrow (01/24) - be ready for issues
- 无活跃事件
- 1项正在进行的调查(API超时,见ENG-1234)
- 明日(01/24)有重大版本发布 - 需做好应对问题的准备
Watch List
重点关注
- API latency around 02:00-03:00 UTC (backup window)
- Auth service memory (restart if > 80%)
- API延迟在02:00-03:00 UTC(备份窗口)期间的情况
- 认证服务内存使用(若超过80%则重启)
Recent
近期变更
- Deployed api-gateway v3.2.1 yesterday (stable)
- Increased rate limits to 1500 RPS
- 昨日部署了api-gateway v3.2.1(稳定)
- 将速率限制提升至1500 RPS
Coming Up
即将到来
- 01/23 02:00 - DB maintenance (5 min read-only)
- 01/24 14:00 - v5.0 release
- 01/23 02:00 - 数据库维护(5分钟只读)
- 01/24 14:00 - v5.0版本发布
Questions?
疑问?
I'll be available on Slack until 17:00 today.
undefined我今日17:00前会在Slack上保持在线。
undefinedTemplate 3: Incident Handoff (Mid-Incident)
模板3:事件交接(事件进行中)
markdown
undefinedmarkdown
undefinedINCIDENT HANDOFF: Payment Service Degradation
事件交接:支付服务性能下降
Incident Start: 2024-01-22 08:15 UTC
Current Status: Mitigating
Severity: SEV2
事件开始时间:2024-01-22 08:15 UTC
当前状态:缓解中
严重程度:SEV2
Current State
当前状态
- Error rate: 15% (down from 40%)
- Mitigation in progress: scaling up pods
- ETA to resolution: ~30 min
- 错误率:15%(已从40%下降)
- 缓解措施进行中:扩容Pod
- 预计解决时间:约30分钟
What We Know
已知信息
- Root cause: Memory pressure on payment-service pods
- Triggered by: Unusual traffic spike (3x normal)
- Contributing: Inefficient query in checkout flow
- 根本原因:支付服务Pod内存压力过大
- 触发因素:异常流量峰值(为正常的3倍)
- 相关因素:结账流程中的低效查询
What We've Done
已采取的措施
- Scaled payment-service from 5 → 15 pods
- Enabled rate limiting on checkout endpoint
- Disabled non-critical features
- 将支付服务从5个Pod扩容至15个
- 在结账端点启用速率限制
- 禁用非关键功能
What Needs to Happen
后续需执行的操作
- Monitor error rate - should reach <1% in ~15 min
- If not improving, escalate to @payments-manager
- Once stable, begin root cause investigation
- 监控错误率 - 约15分钟内应降至<1%
- 若未改善,升级至@payments-manager
- 稳定后开始根本原因调查
Key People
关键人员
- Incident Commander: @alice (handing off)
- Comms Lead: @charlie
- Technical Lead: @bob (incoming)
- 事件指挥官:@alice(移交中)
- 沟通负责人:@charlie
- 技术负责人:@bob(接班)
Communication
沟通情况
- Status page: Updated at 08:45
- Customer support: Notified
- Exec team: Aware
- 状态页面:已在08:45更新
- 客户支持:已通知
- 执行团队:已知晓
Resources
资源
- Incident channel: #inc-20240122-payment
- Dashboard: Payment Service
- Runbook: Payment Degradation
Incoming on-call (@bob) - Please confirm you have:
- Joined #inc-20240122-payment
- Access to dashboards
- Understand current state
- Know escalation path
undefinedHandoff Sync Meeting
交接同步会议
Agenda (15 minutes)
议程(15分钟)
markdown
undefinedmarkdown
undefinedHandoff Sync: @alice → @bob
交接同步:@alice → @bob
-
Active Issues (5 min)
- Walk through any ongoing incidents
- Discuss investigation status
- Transfer context and theories
-
Recent Changes (3 min)
- Deployments to watch
- Config changes
- Known regressions
-
Upcoming Events (3 min)
- Maintenance windows
- Expected traffic changes
- Releases planned
-
Questions (4 min)
- Clarify anything unclear
- Confirm access and alerting
- Exchange contact info
undefined-
活跃问题(5分钟)
- 梳理所有正在进行的事件
- 讨论调查状态
- 传递上下文和推测
-
近期变更(3分钟)
- 需要关注的部署
- 配置变更
- 已知的回归问题
-
即将到来的事件(3分钟)
- 维护窗口
- 预计的流量变化
- 计划中的发布
-
疑问解答(4分钟)
- 澄清任何不明确的内容
- 确认访问权限和告警设置
- 交换联系方式
undefinedOn-Call Best Practices
值班最佳实践
Before Your Shift
班次开始前
markdown
undefinedmarkdown
undefinedPre-Shift Checklist
班前检查清单
Access Verification
访问权限验证
- VPN working
- kubectl access to all clusters
- Database read access
- Log aggregator access (Splunk/Datadog)
- PagerDuty app installed and logged in
- VPN正常工作
- kubectl可访问所有集群
- 数据库只读权限
- 日志聚合器访问权限(Splunk/Datadog)
- 已安装并登录PagerDuty应用
Alerting Setup
告警设置验证
- PagerDuty schedule shows you as primary
- Phone notifications enabled
- Slack notifications for incident channels
- Test alert received and acknowledged
- PagerDuty排班表显示你为主要值班人员
- 手机通知已启用
- 事件频道的Slack通知已开启
- 已接收并确认测试告警
Knowledge Refresh
知识回顾
- Review recent incidents (past 2 weeks)
- Check service changelog
- Skim critical runbooks
- Know escalation contacts
- 查看近期事件(过去2周)
- 检查服务变更日志
- 浏览关键运行手册
- 知晓升级联系人
Environment Ready
环境准备
- Laptop charged and accessible
- Phone charged
- Quiet space available for calls
- Secondary contact identified (if traveling)
undefined- 笔记本电脑已充电且可随时使用
- 手机已充电
- 有安静的空间可参加会议
- 已确定备用联系人(若出差)
undefinedDuring Your Shift
班次进行中
markdown
undefinedmarkdown
undefinedDaily On-Call Routine
日常值班流程
Morning (start of day)
早晨(班次开始)
- Check overnight alerts
- Review dashboards for anomalies
- Check for any P0/P1 tickets created
- Skim incident channels for context
- 检查夜间告警
- 查看仪表盘是否有异常
- 检查是否有P0/P1工单
- 浏览事件频道获取上下文
Throughout Day
全天
- Respond to alerts within SLA
- Document investigation progress
- Update team on significant issues
- Triage incoming pages
- 在SLA内响应告警
- 记录调查进度
- 向团队更新重要问题
- 分类处理收到的告警
End of Day
班次结束前
- Hand off any active issues
- Update investigation docs
- Note anything for next shift
undefined- 移交所有活跃问题
- 更新调查文档
- 为下一班次标注注意事项
undefinedAfter Your Shift
班次结束后
markdown
undefinedmarkdown
undefinedPost-Shift Checklist
班后检查清单
- Complete handoff document
- Sync with incoming on-call
- Verify PagerDuty routing changed
- Close/update investigation tickets
- File postmortems for any incidents
- Take time off if shift was stressful
undefined- 完成交接文档
- 与接班值班人员同步沟通
- 验证PagerDuty路由已变更
- 关闭/更新调查工单
- 为所有事件提交事后分析
- 若班次压力大则安排休息
undefinedEscalation Guidelines
升级指南
When to Escalate
何时升级
markdown
undefinedmarkdown
undefinedEscalation Triggers
升级触发条件
Immediate Escalation
立即升级
- SEV1 incident declared
- Data breach suspected
- Unable to diagnose within 30 min
- Customer or legal escalation received
- 已声明SEV1事件
- 怀疑存在数据泄露
- 30分钟内无法诊断问题
- 收到客户或法务的升级请求
Consider Escalation
考虑升级
- Issue spans multiple teams
- Requires expertise you don't have
- Business impact exceeds threshold
- You're uncertain about next steps
- 问题涉及多个团队
- 需要你不具备的专业知识
- 业务影响超过阈值
- 你不确定下一步操作
How to Escalate
如何升级
- Page the appropriate escalation path
- Provide brief context in Slack
- Stay engaged until escalation acknowledges
- Hand off cleanly, don't just disappear
undefined- 呼叫相应的升级路径联系人
- 在Slack中提供简要上下文
- 保持参与直到升级联系人确认
- 干净地移交,不要直接消失
undefinedBest Practices
最佳实践
Do's
建议
- Document everything - Future you will thank you
- Escalate early - Better safe than sorry
- Take breaks - Alert fatigue is real
- Keep handoffs synchronous - Async loses context
- Test your setup - Before incidents, not during
- 记录所有内容 - 未来的你会感谢现在的自己
- 尽早升级 - 谨慎总比后悔好
- 适当休息 - 告警疲劳是真实存在的
- 保持交接同步 - 异步交接会丢失上下文
- 测试你的设置 - 在事件发生前测试,而非事件中
Don'ts
禁忌
- Don't skip handoffs - Context loss causes incidents
- Don't hero - Escalate when needed
- Don't ignore alerts - Even if they seem minor
- Don't work sick - Swap shifts instead
- Don't disappear - Stay reachable during shift
- 不要跳过交接 - 上下文丢失会导致事件
- 不要逞能 - 需要时及时升级
- 不要忽略告警 - 即使看起来无关紧要
- 不要带病值班 - 换班代替
- 不要失联 - 班次期间保持可联系