runbook-creation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Runbook Creation

运行手册创建

Overview

概述

Create comprehensive operational runbooks that provide step-by-step procedures for common operational tasks, incident response, and system maintenance.
创建全面的运维运行手册,为常见运维任务、事件响应和系统维护提供分步流程。

When to Use

适用场景

  • Incident response procedures
  • Standard operating procedures (SOPs)
  • On-call playbooks
  • System maintenance guides
  • Disaster recovery procedures
  • Deployment runbooks
  • Escalation procedures
  • Service restoration guides
  • 事件响应流程
  • 标准作业程序(SOP)
  • 值班操作剧本
  • 系统维护指南
  • 灾难恢复流程
  • 部署运行手册
  • 升级流程
  • 服务恢复指南

Incident Response Runbook Template

事件响应运行手册模板

markdown
undefined
markdown
undefined

Incident Response Runbook

事件响应运行手册

Quick Reference

快速参考

Severity Levels:
  • P0 (Critical): Complete outage, data loss, security breach
  • P1 (High): Major feature down, significant user impact
  • P2 (Medium): Minor feature degradation, limited user impact
  • P3 (Low): Cosmetic issues, minimal user impact
Response Times:
  • P0: Immediate (24/7)
  • P1: 15 minutes (business hours), 1 hour (after hours)
  • P2: 4 hours (business hours)
  • P3: Next business day
Escalation Contacts:
  • On-call Engineer: PagerDuty rotation
  • Engineering Manager: +1-555-0100
  • VP Engineering: +1-555-0101
  • CTO: +1-555-0102
严重等级:
  • P0(关键):完全中断、数据丢失、安全漏洞
  • P1(高):主要功能故障、用户影响范围广
  • P2(中):次要功能降级、用户影响有限
  • P3(低):界面显示问题、用户影响极小
响应时间:
  • P0:立即响应(7×24小时)
  • P1:15分钟内(工作时间),1小时内(非工作时间)
  • P2:4小时内(工作时间)
  • P3:下一个工作日
升级联系人:
  • 值班工程师:PagerDuty轮值
  • 工程经理:+1-555-0100
  • 工程副总裁:+1-555-0101
  • CTO:+1-555-0102

Table of Contents

目录

Service Down

服务中断

Symptoms

症状

  • Health check endpoint returning 500 errors
  • Users unable to access application
  • Load balancer showing all instances unhealthy
  • Alerts:
    service_down
    ,
    health_check_failed
  • 健康检查端点返回500错误
  • 用户无法访问应用
  • 负载均衡器显示所有实例状态异常
  • 告警:
    service_down
    ,
    health_check_failed

Severity: P0 (Critical)

严重等级:P0(关键)

Initial Response (5 minutes)

初始响应(5分钟内)

  1. Acknowledge the incident
    bash
    # Acknowledge in PagerDuty
    # Post in #incidents Slack channel
  2. Create incident channel
    Create Slack channel: #incident-YYYY-MM-DD-service-down
    Post incident details and status updates
  3. Assess impact
    bash
    # Check service status
    kubectl get pods -n production
    
    # Check recent deployments
    kubectl rollout history deployment/api -n production
    
    # Check logs
    kubectl logs -f deployment/api -n production --tail=100
  1. 确认事件
    bash
    # 在PagerDuty中确认
    # 在#incidents Slack频道发布消息
  2. 创建事件频道
    创建Slack频道:#incident-YYYY-MM-DD-service-down
    发布事件详情和状态更新
  3. 评估影响范围
    bash
    # 检查服务状态
    kubectl get pods -n production
    
    # 检查最近的部署记录
    kubectl rollout history deployment/api -n production
    
    # 查看日志
    kubectl logs -f deployment/api -n production --tail=100

Investigation Steps

调查步骤

Check Application Health

检查应用健康状态

bash
undefined
bash
undefined

1. Check pod status

1. 检查Pod状态

kubectl get pods -n production -l app=api
kubectl get pods -n production -l app=api

Expected output: All pods Running

预期输出:所有Pod处于Running状态

NAME READY STATUS RESTARTS AGE

NAME READY STATUS RESTARTS AGE

api-7d8c9f5b6d-4xk2p 1/1 Running 0 2h

api-7d8c9f5b6d-4xk2p 1/1 Running 0 2h

api-7d8c9f5b6d-7nm8r 1/1 Running 0 2h

api-7d8c9f5b6d-7nm8r 1/1 Running 0 2h

2. Check pod logs for errors

2. 检查Pod日志中的错误

kubectl logs -f deployment/api -n production --tail=100 | grep -i error
kubectl logs -f deployment/api -n production --tail=100 | grep -i error

3. Check application endpoints

3. 检查应用端点

4. Check database connectivity

4. 检查数据库连接

kubectl exec -it deployment/api -n production -- sh psql $DATABASE_URL -c "SELECT 1"
undefined
kubectl exec -it deployment/api -n production -- sh psql $DATABASE_URL -c "SELECT 1"
undefined

Check Infrastructure

检查基础设施

bash
undefined
bash
undefined

1. Check load balancer

1. 检查负载均衡器

aws elb describe-target-health
--target-group-arn arn:aws:elasticloadbalancing:...
--query 'TargetHealthDescriptions[*].[Target.Id,TargetHealth.State]'
--output table
aws elb describe-target-health
--target-group-arn arn:aws:elasticloadbalancing:...
--query 'TargetHealthDescriptions[*].[Target.Id,TargetHealth.State]'
--output table

2. Check DNS resolution

2. 检查DNS解析

dig api.example.com nslookup api.example.com
dig api.example.com nslookup api.example.com

3. Check SSL certificates

3. 检查SSL证书

echo | openssl s_client -connect api.example.com:443 2>/dev/null |
openssl x509 -noout -dates
echo | openssl s_client -connect api.example.com:443 2>/dev/null |
openssl x509 -noout -dates

4. Check network connectivity

4. 检查网络连通性

kubectl exec -it deployment/api -n production --
curl -v https://database.example.com:5432
undefined
kubectl exec -it deployment/api -n production --
curl -v https://database.example.com:5432
undefined

Check Database

检查数据库

bash
undefined
bash
undefined

1. Check database connections

1. 检查数据库连接数

psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity"
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity"

2. Check for locks

2. 检查锁情况

psql $DATABASE_URL -c " SELECT pid, usename, pg_blocking_pids(pid) as blocked_by, query FROM pg_stat_activity WHERE cardinality(pg_blocking_pids(pid)) > 0 "
psql $DATABASE_URL -c " SELECT pid, usename, pg_blocking_pids(pid) as blocked_by, query FROM pg_stat_activity WHERE cardinality(pg_blocking_pids(pid)) > 0 "

3. Check database size

3. 检查数据库大小

psql $DATABASE_URL -c " SELECT pg_size_pretty(pg_database_size(current_database())) "
psql $DATABASE_URL -c " SELECT pg_size_pretty(pg_database_size(current_database())) "

4. Check long-running queries

4. 检查长时间运行的查询

psql $DATABASE_URL -c " SELECT pid, now() - query_start as duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10 "
undefined
psql $DATABASE_URL -c " SELECT pid, now() - query_start as duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10 "
undefined

Resolution Steps

解决步骤

Option 1: Restart Pods (Quick Fix)

选项1:重启Pod(快速修复)

bash
undefined
bash
undefined

Restart all pods (rolling restart)

重启所有Pod(滚动重启)

kubectl rollout restart deployment/api -n production
kubectl rollout restart deployment/api -n production

Watch restart progress

监控重启进度

kubectl rollout status deployment/api -n production
kubectl rollout status deployment/api -n production

Verify pods are healthy

验证Pod状态正常

kubectl get pods -n production -l app=api
undefined
kubectl get pods -n production -l app=api
undefined

Option 2: Scale Up (If Overload)

选项2:扩容(如果过载)

bash
undefined
bash
undefined

Check current replicas

检查当前副本数

kubectl get deployment api -n production
kubectl get deployment api -n production

Scale up

扩容

kubectl scale deployment/api -n production --replicas=10
kubectl scale deployment/api -n production --replicas=10

Watch scaling

监控扩容过程

kubectl get pods -n production -l app=api -w
undefined
kubectl get pods -n production -l app=api -w
undefined

Option 3: Rollback (If Bad Deploy)

选项3:回滚(如果部署出错)

bash
undefined
bash
undefined

Check deployment history

检查部署历史

kubectl rollout history deployment/api -n production
kubectl rollout history deployment/api -n production

Rollback to previous version

回滚到上一个版本

kubectl rollout undo deployment/api -n production
kubectl rollout undo deployment/api -n production

Rollback to specific revision

回滚到指定版本

kubectl rollout undo deployment/api -n production --to-revision=5
kubectl rollout undo deployment/api -n production --to-revision=5

Verify rollback

验证回滚结果

kubectl rollout status deployment/api -n production
undefined
kubectl rollout status deployment/api -n production
undefined

Option 4: Database Connection Reset

选项4:重置数据库连接

bash
undefined
bash
undefined

If database connection pool exhausted

如果数据库连接池耗尽

kubectl exec -it deployment/api -n production -- sh kill -HUP 1 # Reload process, reset connections
kubectl exec -it deployment/api -n production -- sh kill -HUP 1 # 重新加载进程,重置连接

Or restart database connection pool

或者重启数据库连接池

psql $DATABASE_URL -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE application_name = 'api' AND state = 'idle'"
undefined
psql $DATABASE_URL -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE application_name = 'api' AND state = 'idle'"
undefined

Verification

验证

bash
undefined
bash
undefined

1. Check health endpoint

1. 检查健康端点

Expected: {"status": "healthy"}

预期:{"status": "healthy"}

2. Check API endpoints

2. 检查API端点

Expected: Valid JSON response

预期:有效的JSON响应

3. Check metrics

3. 检查指标

Verify:

验证:

- Error rate < 1%

- 错误率 < 1%

- Response time < 500ms

- 响应时间 < 500ms

- All pods healthy

- 所有Pod状态正常

4. Check logs for errors

4. 检查日志中的错误

kubectl logs deployment/api -n production --tail=100 | grep -i error
kubectl logs deployment/api -n production --tail=100 | grep -i error

Expected: No new errors

预期:无新错误

undefined
undefined

Communication

沟通

Initial Update (within 5 minutes):
🚨 INCIDENT: Service Down

Status: Investigating
Severity: P0
Impact: All users unable to access application
Start Time: 2025-01-15 14:30 UTC

We are investigating reports of users unable to access the application.
Our team is working to identify the root cause.

Next update in 15 minutes.
Progress Update (every 15 minutes):
🔍 UPDATE: Service Down

Status: Identified
Root Cause: Database connection pool exhausted
Action: Restarting application pods
ETA: 5 minutes

We have identified the issue and are implementing a fix.
Resolution Update:
✅ RESOLVED: Service Down

Status: Resolved
Resolution: Restarted application pods, reset database connections
Duration: 23 minutes

The service is now fully operational. We are monitoring closely
and will conduct a post-mortem to prevent future occurrences.
初始更新(5分钟内):
🚨 事件:服务中断

状态:正在调查
严重等级:P0
影响:所有用户无法访问应用
开始时间:2025-01-15 14:30 UTC

我们正在调查用户无法访问应用的问题。
团队正在努力确定根本原因。

15分钟内更新下一次状态。
进度更新(每15分钟一次):
🔍 更新:服务中断

状态:已定位问题
根本原因:数据库连接池耗尽
操作:正在重启应用Pod
预计恢复时间:5分钟内

我们已定位问题并正在实施修复。
解决更新:
✅ 已解决:服务中断

状态:已解决
解决方式:重启应用Pod,重置数据库连接
持续时间:23分钟

服务现已完全恢复正常。我们将密切监控,并进行事后复盘以防止未来再次发生。

Post-Incident

事后处理

  1. Create post-mortem document
    • Timeline of events
    • Root cause analysis
    • Action items to prevent recurrence
  2. Update monitoring
    • Add alerts for this scenario
    • Improve detection time
  3. Update runbook
    • Document any new findings
    • Add shortcuts for faster resolution

  1. 创建事后复盘文档
    • 事件时间线
    • 根本原因分析
    • 预防复发的行动项
  2. 更新监控
    • 添加此场景的告警
    • 缩短检测时间
  3. 更新运行手册
    • 记录新发现的问题
    • 添加快速解决的快捷方式

Database Issues

数据库问题

High Connection Count

连接数过高

Symptoms:
  • Database rejecting new connections
  • Error: "too many connections"
  • Alert:
    db_connections_high
Quick Fix:
bash
undefined
症状:
  • 数据库拒绝新连接
  • 错误:"too many connections"
  • 告警:
    db_connections_high
快速修复:
bash
undefined

1. Check connection count

1. 检查连接数

psql $DATABASE_URL -c " SELECT count(*), application_name FROM pg_stat_activity GROUP BY application_name "
psql $DATABASE_URL -c " SELECT count(*), application_name FROM pg_stat_activity GROUP BY application_name "

2. Kill idle connections

2. 终止空闲连接

psql $DATABASE_URL -c " SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes' "
psql $DATABASE_URL -c " SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes' "

3. Restart connection pools

3. 重启连接池

kubectl rollout restart deployment/api -n production
undefined
kubectl rollout restart deployment/api -n production
undefined

Slow Queries

查询缓慢

Symptoms:
  • API response times > 5 seconds
  • Database CPU at 100%
  • Alert:
    slow_query_detected
Investigation:
sql
-- Find slow queries
SELECT
  pid,
  now() - query_start as duration,
  query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC
LIMIT 10;

-- Check for missing indexes
SELECT
  schemaname,
  tablename,
  seq_scan,
  seq_tup_read,
  idx_scan
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_scan DESC
LIMIT 10;

-- Kill long-running query (if needed)
SELECT pg_terminate_backend(12345);  -- Replace with actual PID

症状:
  • API响应时间 > 5秒
  • 数据库CPU使用率达100%
  • 告警:
    slow_query_detected
调查:
sql
-- 查找缓慢查询
SELECT
  pid,
  now() - query_start as duration,
  query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC
LIMIT 10;

-- 检查缺失的索引
SELECT
  schemaname,
  tablename,
  seq_scan,
  seq_tup_read,
  idx_scan
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_scan DESC
LIMIT 10;

-- 终止长时间运行的查询(如有需要)
SELECT pg_terminate_backend(12345);  -- 替换为实际PID

High CPU/Memory Usage

CPU/内存使用率过高

Symptoms

症状

  • Pods being OOMKilled
  • Response times increasing
  • Alert:
    high_memory_usage
    ,
    high_cpu_usage
  • Pod被OOMKilled
  • 响应时间增加
  • 告警:
    high_memory_usage
    ,
    high_cpu_usage

Investigation

调查

bash
undefined
bash
undefined

1. Check pod resources

1. 检查Pod资源使用情况

kubectl top pods -n production
kubectl top pods -n production

2. Check resource limits

2. 检查资源限制

kubectl describe pod <pod-name> -n production | grep -A 5 Limits
kubectl describe pod <pod-name> -n production | grep -A 5 Limits

3. Check for memory leaks

3. 检查内存泄漏

kubectl logs deployment/api -n production | grep -i "out of memory"
kubectl logs deployment/api -n production | grep -i "out of memory"

4. Profile application (if needed)

4. 应用性能分析(如有需要)

kubectl exec -it <pod-name> -n production -- sh
kubectl exec -it <pod-name> -n production -- sh

Run profiler: node --inspect, py-spy, etc.

运行分析工具:node --inspect, py-spy, etc.

undefined
undefined

Resolution

解决

bash
undefined
bash
undefined

Option 1: Increase resources

选项1:增加资源配额

kubectl set resources deployment/api -n production
--limits=cpu=2000m,memory=4Gi
--requests=cpu=1000m,memory=2Gi
kubectl set resources deployment/api -n production
--limits=cpu=2000m,memory=4Gi
--requests=cpu=1000m,memory=2Gi

Option 2: Scale horizontally

选项2:水平扩容

kubectl scale deployment/api -n production --replicas=6
kubectl scale deployment/api -n production --replicas=6

Option 3: Restart problematic pods

选项3:重启有问题的Pod

kubectl delete pod <pod-name> -n production

---
kubectl delete pod <pod-name> -n production

---

Rollback Procedures

回滚流程

Application Rollback

应用回滚

bash
undefined
bash
undefined

1. List deployment history

1. 列出部署历史

kubectl rollout history deployment/api -n production
kubectl rollout history deployment/api -n production

2. Check specific revision

2. 查看指定版本详情

kubectl rollout history deployment/api -n production --revision=5
kubectl rollout history deployment/api -n production --revision=5

3. Rollback to previous

3. 回滚到上一个版本

kubectl rollout undo deployment/api -n production
kubectl rollout undo deployment/api -n production

4. Rollback to specific revision

4. 回滚到指定版本

kubectl rollout undo deployment/api -n production --to-revision=5
kubectl rollout undo deployment/api -n production --to-revision=5

5. Verify rollback

5. 验证回滚结果

kubectl rollout status deployment/api -n production kubectl get pods -n production
undefined
kubectl rollout status deployment/api -n production kubectl get pods -n production
undefined

Database Rollback

数据库回滚

bash
undefined
bash
undefined

1. Check migration status

1. 检查迁移状态

npm run db:migrate:status
npm run db:migrate:status

2. Rollback last migration

2. 回滚上一次迁移

npm run db:migrate:undo
npm run db:migrate:undo

3. Rollback to specific migration

3. 回滚到指定迁移版本

npm run db:migrate:undo --to 20250115120000-migration-name
npm run db:migrate:undo --to 20250115120000-migration-name

4. Verify database state

4. 验证数据库状态

psql $DATABASE_URL -c "\dt"

---
psql $DATABASE_URL -c "\dt"

---

Escalation Path

升级路径

  1. Level 1 - On-call Engineer (You)
    • Initial response and investigation
    • Attempt standard fixes from runbook
  2. Level 2 - Senior Engineers
    • Escalate if not resolved in 30 minutes
    • Escalate if issue is complex/unclear
    • Contact via PagerDuty or Slack
  3. Level 3 - Engineering Manager
    • Escalate if not resolved in 1 hour
    • Escalate if cross-team coordination needed
  4. Level 4 - VP Engineering / CTO
    • Escalate for P0 incidents > 2 hours
    • Escalate for security breaches
    • Escalate for data loss

  1. Level 1 - 值班工程师(你)
    • 初始响应和调查
    • 尝试运行手册中的标准修复方法
  2. Level 2 - 资深工程师
    • 30分钟内未解决则升级
    • 问题复杂或不明确时升级
    • 通过PagerDuty或Slack联系
  3. Level 3 - 工程经理
    • 1小时内未解决则升级
    • 需要跨团队协调时升级
  4. Level 4 - 工程副总裁 / CTO
    • P0事件超过2小时则升级
    • 安全漏洞时升级
    • 数据丢失时升级

Useful Commands

常用命令

bash
undefined
bash
undefined

Kubernetes

Kubernetes

kubectl get pods -n production kubectl logs -f <pod-name> -n production kubectl describe pod <pod-name> -n production kubectl exec -it <pod-name> -n production -- sh kubectl top pods -n production
kubectl get pods -n production kubectl logs -f <pod-name> -n production kubectl describe pod <pod-name> -n production kubectl exec -it <pod-name> -n production -- sh kubectl top pods -n production

Database

数据库

psql $DATABASE_URL -c "SELECT version()" psql $DATABASE_URL -c "SELECT * FROM pg_stat_activity"
psql $DATABASE_URL -c "SELECT version()" psql $DATABASE_URL -c "SELECT * FROM pg_stat_activity"

AWS

AWS

aws ecs list-tasks --cluster production aws rds describe-db-instances aws cloudwatch get-metric-statistics ...
aws ecs list-tasks --cluster production aws rds describe-db-instances aws cloudwatch get-metric-statistics ...

Monitoring URLs

监控链接

undefined
undefined

Best Practices

最佳实践

✅ DO

✅ 建议

  • Include quick reference section at top
  • Provide exact commands to run
  • Document expected outputs
  • Include verification steps
  • Add communication templates
  • Define severity levels clearly
  • Document escalation paths
  • Include useful links and contacts
  • Keep runbooks up-to-date
  • Test runbooks regularly
  • Include screenshots/diagrams
  • Document common gotchas
  • 在顶部包含快速参考部分
  • 提供可直接运行的精确命令
  • 记录预期输出
  • 包含验证步骤
  • 添加沟通模板
  • 明确定义严重等级
  • 记录升级路径
  • 包含有用的链接和联系人
  • 保持运行手册更新
  • 定期测试运行手册
  • 包含截图/图表
  • 记录常见陷阱

❌ DON'T

❌ 避免

  • Use vague instructions
  • Skip verification steps
  • Forget to document prerequisites
  • Assume knowledge of tools
  • Skip communication guidelines
  • Forget to update after incidents
  • 使用模糊的指令
  • 跳过验证步骤
  • 忘记记录先决条件
  • 假设用户了解工具使用
  • 跳过沟通指南
  • 事件后忘记更新

Resources

参考资源