runbook-creation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseRunbook Creation
运行手册创建
Overview
概述
Create comprehensive operational runbooks that provide step-by-step procedures for common operational tasks, incident response, and system maintenance.
创建全面的运维运行手册,为常见运维任务、事件响应和系统维护提供分步流程。
When to Use
适用场景
- Incident response procedures
- Standard operating procedures (SOPs)
- On-call playbooks
- System maintenance guides
- Disaster recovery procedures
- Deployment runbooks
- Escalation procedures
- Service restoration guides
- 事件响应流程
- 标准作业程序(SOP)
- 值班操作剧本
- 系统维护指南
- 灾难恢复流程
- 部署运行手册
- 升级流程
- 服务恢复指南
Incident Response Runbook Template
事件响应运行手册模板
markdown
undefinedmarkdown
undefinedIncident Response Runbook
事件响应运行手册
Quick Reference
快速参考
Severity Levels:
- P0 (Critical): Complete outage, data loss, security breach
- P1 (High): Major feature down, significant user impact
- P2 (Medium): Minor feature degradation, limited user impact
- P3 (Low): Cosmetic issues, minimal user impact
Response Times:
- P0: Immediate (24/7)
- P1: 15 minutes (business hours), 1 hour (after hours)
- P2: 4 hours (business hours)
- P3: Next business day
Escalation Contacts:
- On-call Engineer: PagerDuty rotation
- Engineering Manager: +1-555-0100
- VP Engineering: +1-555-0101
- CTO: +1-555-0102
严重等级:
- P0(关键):完全中断、数据丢失、安全漏洞
- P1(高):主要功能故障、用户影响范围广
- P2(中):次要功能降级、用户影响有限
- P3(低):界面显示问题、用户影响极小
响应时间:
- P0:立即响应(7×24小时)
- P1:15分钟内(工作时间),1小时内(非工作时间)
- P2:4小时内(工作时间)
- P3:下一个工作日
升级联系人:
- 值班工程师:PagerDuty轮值
- 工程经理:+1-555-0100
- 工程副总裁:+1-555-0101
- CTO:+1-555-0102
Table of Contents
目录
Service Down
服务中断
Symptoms
症状
- Health check endpoint returning 500 errors
- Users unable to access application
- Load balancer showing all instances unhealthy
- Alerts: ,
service_downhealth_check_failed
- 健康检查端点返回500错误
- 用户无法访问应用
- 负载均衡器显示所有实例状态异常
- 告警:,
service_downhealth_check_failed
Severity: P0 (Critical)
严重等级:P0(关键)
Initial Response (5 minutes)
初始响应(5分钟内)
-
Acknowledge the incidentbash
# Acknowledge in PagerDuty # Post in #incidents Slack channel -
Create incident channel
Create Slack channel: #incident-YYYY-MM-DD-service-down Post incident details and status updates -
Assess impactbash
# Check service status kubectl get pods -n production # Check recent deployments kubectl rollout history deployment/api -n production # Check logs kubectl logs -f deployment/api -n production --tail=100
-
确认事件bash
# 在PagerDuty中确认 # 在#incidents Slack频道发布消息 -
创建事件频道
创建Slack频道:#incident-YYYY-MM-DD-service-down 发布事件详情和状态更新 -
评估影响范围bash
# 检查服务状态 kubectl get pods -n production # 检查最近的部署记录 kubectl rollout history deployment/api -n production # 查看日志 kubectl logs -f deployment/api -n production --tail=100
Investigation Steps
调查步骤
Check Application Health
检查应用健康状态
bash
undefinedbash
undefined1. Check pod status
1. 检查Pod状态
kubectl get pods -n production -l app=api
kubectl get pods -n production -l app=api
Expected output: All pods Running
预期输出:所有Pod处于Running状态
NAME READY STATUS RESTARTS AGE
NAME READY STATUS RESTARTS AGE
api-7d8c9f5b6d-4xk2p 1/1 Running 0 2h
api-7d8c9f5b6d-4xk2p 1/1 Running 0 2h
api-7d8c9f5b6d-7nm8r 1/1 Running 0 2h
api-7d8c9f5b6d-7nm8r 1/1 Running 0 2h
2. Check pod logs for errors
2. 检查Pod日志中的错误
kubectl logs -f deployment/api -n production --tail=100 | grep -i error
kubectl logs -f deployment/api -n production --tail=100 | grep -i error
3. Check application endpoints
3. 检查应用端点
curl -v https://api.example.com/health
curl -v https://api.example.com/api/v1/status
curl -v https://api.example.com/health
curl -v https://api.example.com/api/v1/status
4. Check database connectivity
4. 检查数据库连接
kubectl exec -it deployment/api -n production -- sh
psql $DATABASE_URL -c "SELECT 1"
undefinedkubectl exec -it deployment/api -n production -- sh
psql $DATABASE_URL -c "SELECT 1"
undefinedCheck Infrastructure
检查基础设施
bash
undefinedbash
undefined1. Check load balancer
1. 检查负载均衡器
aws elb describe-target-health
--target-group-arn arn:aws:elasticloadbalancing:...
--query 'TargetHealthDescriptions[*].[Target.Id,TargetHealth.State]'
--output table
--target-group-arn arn:aws:elasticloadbalancing:...
--query 'TargetHealthDescriptions[*].[Target.Id,TargetHealth.State]'
--output table
aws elb describe-target-health
--target-group-arn arn:aws:elasticloadbalancing:...
--query 'TargetHealthDescriptions[*].[Target.Id,TargetHealth.State]'
--output table
--target-group-arn arn:aws:elasticloadbalancing:...
--query 'TargetHealthDescriptions[*].[Target.Id,TargetHealth.State]'
--output table
2. Check DNS resolution
2. 检查DNS解析
dig api.example.com
nslookup api.example.com
dig api.example.com
nslookup api.example.com
3. Check SSL certificates
3. 检查SSL证书
echo | openssl s_client -connect api.example.com:443 2>/dev/null |
openssl x509 -noout -dates
openssl x509 -noout -dates
echo | openssl s_client -connect api.example.com:443 2>/dev/null |
openssl x509 -noout -dates
openssl x509 -noout -dates
4. Check network connectivity
4. 检查网络连通性
kubectl exec -it deployment/api -n production --
curl -v https://database.example.com:5432
curl -v https://database.example.com:5432
undefinedkubectl exec -it deployment/api -n production --
curl -v https://database.example.com:5432
curl -v https://database.example.com:5432
undefinedCheck Database
检查数据库
bash
undefinedbash
undefined1. Check database connections
1. 检查数据库连接数
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity"
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity"
2. Check for locks
2. 检查锁情况
psql $DATABASE_URL -c "
SELECT pid, usename, pg_blocking_pids(pid) as blocked_by, query
FROM pg_stat_activity
WHERE cardinality(pg_blocking_pids(pid)) > 0
"
psql $DATABASE_URL -c "
SELECT pid, usename, pg_blocking_pids(pid) as blocked_by, query
FROM pg_stat_activity
WHERE cardinality(pg_blocking_pids(pid)) > 0
"
3. Check database size
3. 检查数据库大小
psql $DATABASE_URL -c "
SELECT pg_size_pretty(pg_database_size(current_database()))
"
psql $DATABASE_URL -c "
SELECT pg_size_pretty(pg_database_size(current_database()))
"
4. Check long-running queries
4. 检查长时间运行的查询
psql $DATABASE_URL -c "
SELECT pid, now() - query_start as duration, query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC
LIMIT 10
"
undefinedpsql $DATABASE_URL -c "
SELECT pid, now() - query_start as duration, query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC
LIMIT 10
"
undefinedResolution Steps
解决步骤
Option 1: Restart Pods (Quick Fix)
选项1:重启Pod(快速修复)
bash
undefinedbash
undefinedRestart all pods (rolling restart)
重启所有Pod(滚动重启)
kubectl rollout restart deployment/api -n production
kubectl rollout restart deployment/api -n production
Watch restart progress
监控重启进度
kubectl rollout status deployment/api -n production
kubectl rollout status deployment/api -n production
Verify pods are healthy
验证Pod状态正常
kubectl get pods -n production -l app=api
undefinedkubectl get pods -n production -l app=api
undefinedOption 2: Scale Up (If Overload)
选项2:扩容(如果过载)
bash
undefinedbash
undefinedCheck current replicas
检查当前副本数
kubectl get deployment api -n production
kubectl get deployment api -n production
Scale up
扩容
kubectl scale deployment/api -n production --replicas=10
kubectl scale deployment/api -n production --replicas=10
Watch scaling
监控扩容过程
kubectl get pods -n production -l app=api -w
undefinedkubectl get pods -n production -l app=api -w
undefinedOption 3: Rollback (If Bad Deploy)
选项3:回滚(如果部署出错)
bash
undefinedbash
undefinedCheck deployment history
检查部署历史
kubectl rollout history deployment/api -n production
kubectl rollout history deployment/api -n production
Rollback to previous version
回滚到上一个版本
kubectl rollout undo deployment/api -n production
kubectl rollout undo deployment/api -n production
Rollback to specific revision
回滚到指定版本
kubectl rollout undo deployment/api -n production --to-revision=5
kubectl rollout undo deployment/api -n production --to-revision=5
Verify rollback
验证回滚结果
kubectl rollout status deployment/api -n production
undefinedkubectl rollout status deployment/api -n production
undefinedOption 4: Database Connection Reset
选项4:重置数据库连接
bash
undefinedbash
undefinedIf database connection pool exhausted
如果数据库连接池耗尽
kubectl exec -it deployment/api -n production -- sh
kill -HUP 1 # Reload process, reset connections
kubectl exec -it deployment/api -n production -- sh
kill -HUP 1 # 重新加载进程,重置连接
Or restart database connection pool
或者重启数据库连接池
psql $DATABASE_URL -c "SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE application_name = 'api'
AND state = 'idle'"
undefinedpsql $DATABASE_URL -c "SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE application_name = 'api'
AND state = 'idle'"
undefinedVerification
验证
bash
undefinedbash
undefined1. Check health endpoint
1. 检查健康端点
Expected: {"status": "healthy"}
预期:{"status": "healthy"}
2. Check API endpoints
2. 检查API端点
Expected: Valid JSON response
预期:有效的JSON响应
3. Check metrics
3. 检查指标
Verify:
验证:
- Error rate < 1%
- 错误率 < 1%
- Response time < 500ms
- 响应时间 < 500ms
- All pods healthy
- 所有Pod状态正常
4. Check logs for errors
4. 检查日志中的错误
kubectl logs deployment/api -n production --tail=100 | grep -i error
kubectl logs deployment/api -n production --tail=100 | grep -i error
Expected: No new errors
预期:无新错误
undefinedundefinedCommunication
沟通
Initial Update (within 5 minutes):
🚨 INCIDENT: Service Down
Status: Investigating
Severity: P0
Impact: All users unable to access application
Start Time: 2025-01-15 14:30 UTC
We are investigating reports of users unable to access the application.
Our team is working to identify the root cause.
Next update in 15 minutes.Progress Update (every 15 minutes):
🔍 UPDATE: Service Down
Status: Identified
Root Cause: Database connection pool exhausted
Action: Restarting application pods
ETA: 5 minutes
We have identified the issue and are implementing a fix.Resolution Update:
✅ RESOLVED: Service Down
Status: Resolved
Resolution: Restarted application pods, reset database connections
Duration: 23 minutes
The service is now fully operational. We are monitoring closely
and will conduct a post-mortem to prevent future occurrences.初始更新(5分钟内):
🚨 事件:服务中断
状态:正在调查
严重等级:P0
影响:所有用户无法访问应用
开始时间:2025-01-15 14:30 UTC
我们正在调查用户无法访问应用的问题。
团队正在努力确定根本原因。
15分钟内更新下一次状态。进度更新(每15分钟一次):
🔍 更新:服务中断
状态:已定位问题
根本原因:数据库连接池耗尽
操作:正在重启应用Pod
预计恢复时间:5分钟内
我们已定位问题并正在实施修复。解决更新:
✅ 已解决:服务中断
状态:已解决
解决方式:重启应用Pod,重置数据库连接
持续时间:23分钟
服务现已完全恢复正常。我们将密切监控,并进行事后复盘以防止未来再次发生。Post-Incident
事后处理
-
Create post-mortem document
- Timeline of events
- Root cause analysis
- Action items to prevent recurrence
-
Update monitoring
- Add alerts for this scenario
- Improve detection time
-
Update runbook
- Document any new findings
- Add shortcuts for faster resolution
-
创建事后复盘文档
- 事件时间线
- 根本原因分析
- 预防复发的行动项
-
更新监控
- 添加此场景的告警
- 缩短检测时间
-
更新运行手册
- 记录新发现的问题
- 添加快速解决的快捷方式
Database Issues
数据库问题
High Connection Count
连接数过高
Symptoms:
- Database rejecting new connections
- Error: "too many connections"
- Alert:
db_connections_high
Quick Fix:
bash
undefined症状:
- 数据库拒绝新连接
- 错误:"too many connections"
- 告警:
db_connections_high
快速修复:
bash
undefined1. Check connection count
1. 检查连接数
psql $DATABASE_URL -c "
SELECT count(*), application_name
FROM pg_stat_activity
GROUP BY application_name
"
psql $DATABASE_URL -c "
SELECT count(*), application_name
FROM pg_stat_activity
GROUP BY application_name
"
2. Kill idle connections
2. 终止空闲连接
psql $DATABASE_URL -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes'
"
psql $DATABASE_URL -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes'
"
3. Restart connection pools
3. 重启连接池
kubectl rollout restart deployment/api -n production
undefinedkubectl rollout restart deployment/api -n production
undefinedSlow Queries
查询缓慢
Symptoms:
- API response times > 5 seconds
- Database CPU at 100%
- Alert:
slow_query_detected
Investigation:
sql
-- Find slow queries
SELECT
pid,
now() - query_start as duration,
query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC
LIMIT 10;
-- Check for missing indexes
SELECT
schemaname,
tablename,
seq_scan,
seq_tup_read,
idx_scan
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_scan DESC
LIMIT 10;
-- Kill long-running query (if needed)
SELECT pg_terminate_backend(12345); -- Replace with actual PID症状:
- API响应时间 > 5秒
- 数据库CPU使用率达100%
- 告警:
slow_query_detected
调查:
sql
-- 查找缓慢查询
SELECT
pid,
now() - query_start as duration,
query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC
LIMIT 10;
-- 检查缺失的索引
SELECT
schemaname,
tablename,
seq_scan,
seq_tup_read,
idx_scan
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_scan DESC
LIMIT 10;
-- 终止长时间运行的查询(如有需要)
SELECT pg_terminate_backend(12345); -- 替换为实际PIDHigh CPU/Memory Usage
CPU/内存使用率过高
Symptoms
症状
- Pods being OOMKilled
- Response times increasing
- Alert: ,
high_memory_usagehigh_cpu_usage
- Pod被OOMKilled
- 响应时间增加
- 告警:,
high_memory_usagehigh_cpu_usage
Investigation
调查
bash
undefinedbash
undefined1. Check pod resources
1. 检查Pod资源使用情况
kubectl top pods -n production
kubectl top pods -n production
2. Check resource limits
2. 检查资源限制
kubectl describe pod <pod-name> -n production | grep -A 5 Limits
kubectl describe pod <pod-name> -n production | grep -A 5 Limits
3. Check for memory leaks
3. 检查内存泄漏
kubectl logs deployment/api -n production | grep -i "out of memory"
kubectl logs deployment/api -n production | grep -i "out of memory"
4. Profile application (if needed)
4. 应用性能分析(如有需要)
kubectl exec -it <pod-name> -n production -- sh
kubectl exec -it <pod-name> -n production -- sh
Run profiler: node --inspect, py-spy, etc.
运行分析工具:node --inspect, py-spy, etc.
undefinedundefinedResolution
解决
bash
undefinedbash
undefinedOption 1: Increase resources
选项1:增加资源配额
kubectl set resources deployment/api -n production
--limits=cpu=2000m,memory=4Gi
--requests=cpu=1000m,memory=2Gi
--limits=cpu=2000m,memory=4Gi
--requests=cpu=1000m,memory=2Gi
kubectl set resources deployment/api -n production
--limits=cpu=2000m,memory=4Gi
--requests=cpu=1000m,memory=2Gi
--limits=cpu=2000m,memory=4Gi
--requests=cpu=1000m,memory=2Gi
Option 2: Scale horizontally
选项2:水平扩容
kubectl scale deployment/api -n production --replicas=6
kubectl scale deployment/api -n production --replicas=6
Option 3: Restart problematic pods
选项3:重启有问题的Pod
kubectl delete pod <pod-name> -n production
---kubectl delete pod <pod-name> -n production
---Rollback Procedures
回滚流程
Application Rollback
应用回滚
bash
undefinedbash
undefined1. List deployment history
1. 列出部署历史
kubectl rollout history deployment/api -n production
kubectl rollout history deployment/api -n production
2. Check specific revision
2. 查看指定版本详情
kubectl rollout history deployment/api -n production --revision=5
kubectl rollout history deployment/api -n production --revision=5
3. Rollback to previous
3. 回滚到上一个版本
kubectl rollout undo deployment/api -n production
kubectl rollout undo deployment/api -n production
4. Rollback to specific revision
4. 回滚到指定版本
kubectl rollout undo deployment/api -n production --to-revision=5
kubectl rollout undo deployment/api -n production --to-revision=5
5. Verify rollback
5. 验证回滚结果
kubectl rollout status deployment/api -n production
kubectl get pods -n production
undefinedkubectl rollout status deployment/api -n production
kubectl get pods -n production
undefinedDatabase Rollback
数据库回滚
bash
undefinedbash
undefined1. Check migration status
1. 检查迁移状态
npm run db:migrate:status
npm run db:migrate:status
2. Rollback last migration
2. 回滚上一次迁移
npm run db:migrate:undo
npm run db:migrate:undo
3. Rollback to specific migration
3. 回滚到指定迁移版本
npm run db:migrate:undo --to 20250115120000-migration-name
npm run db:migrate:undo --to 20250115120000-migration-name
4. Verify database state
4. 验证数据库状态
psql $DATABASE_URL -c "\dt"
---psql $DATABASE_URL -c "\dt"
---Escalation Path
升级路径
-
Level 1 - On-call Engineer (You)
- Initial response and investigation
- Attempt standard fixes from runbook
-
Level 2 - Senior Engineers
- Escalate if not resolved in 30 minutes
- Escalate if issue is complex/unclear
- Contact via PagerDuty or Slack
-
Level 3 - Engineering Manager
- Escalate if not resolved in 1 hour
- Escalate if cross-team coordination needed
-
Level 4 - VP Engineering / CTO
- Escalate for P0 incidents > 2 hours
- Escalate for security breaches
- Escalate for data loss
-
Level 1 - 值班工程师(你)
- 初始响应和调查
- 尝试运行手册中的标准修复方法
-
Level 2 - 资深工程师
- 30分钟内未解决则升级
- 问题复杂或不明确时升级
- 通过PagerDuty或Slack联系
-
Level 3 - 工程经理
- 1小时内未解决则升级
- 需要跨团队协调时升级
-
Level 4 - 工程副总裁 / CTO
- P0事件超过2小时则升级
- 安全漏洞时升级
- 数据丢失时升级
Useful Commands
常用命令
bash
undefinedbash
undefinedKubernetes
Kubernetes
kubectl get pods -n production
kubectl logs -f <pod-name> -n production
kubectl describe pod <pod-name> -n production
kubectl exec -it <pod-name> -n production -- sh
kubectl top pods -n production
kubectl get pods -n production
kubectl logs -f <pod-name> -n production
kubectl describe pod <pod-name> -n production
kubectl exec -it <pod-name> -n production -- sh
kubectl top pods -n production
Database
数据库
psql $DATABASE_URL -c "SELECT version()"
psql $DATABASE_URL -c "SELECT * FROM pg_stat_activity"
psql $DATABASE_URL -c "SELECT version()"
psql $DATABASE_URL -c "SELECT * FROM pg_stat_activity"
AWS
AWS
aws ecs list-tasks --cluster production
aws rds describe-db-instances
aws cloudwatch get-metric-statistics ...
aws ecs list-tasks --cluster production
aws rds describe-db-instances
aws cloudwatch get-metric-statistics ...
Monitoring URLs
监控链接
Grafana: https://grafana.example.com
Grafana: https://grafana.example.com
Datadog: https://app.datadoghq.com
Datadog: https://app.datadoghq.com
PagerDuty: https://example.pagerduty.com
PagerDuty: https://example.pagerduty.com
Status Page: https://status.example.com
undefinedundefinedBest Practices
最佳实践
✅ DO
✅ 建议
- Include quick reference section at top
- Provide exact commands to run
- Document expected outputs
- Include verification steps
- Add communication templates
- Define severity levels clearly
- Document escalation paths
- Include useful links and contacts
- Keep runbooks up-to-date
- Test runbooks regularly
- Include screenshots/diagrams
- Document common gotchas
- 在顶部包含快速参考部分
- 提供可直接运行的精确命令
- 记录预期输出
- 包含验证步骤
- 添加沟通模板
- 明确定义严重等级
- 记录升级路径
- 包含有用的链接和联系人
- 保持运行手册更新
- 定期测试运行手册
- 包含截图/图表
- 记录常见陷阱
❌ DON'T
❌ 避免
- Use vague instructions
- Skip verification steps
- Forget to document prerequisites
- Assume knowledge of tools
- Skip communication guidelines
- Forget to update after incidents
- 使用模糊的指令
- 跳过验证步骤
- 忘记记录先决条件
- 假设用户了解工具使用
- 跳过沟通指南
- 事件后忘记更新