backup-restore-runbook-generator

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Backup/Restore Runbook Generator

备份/恢复运行手册生成器

Create reliable disaster recovery procedures for your databases.
为你的数据库创建可靠的灾难恢复流程。

Backup Strategy

备份策略

markdown
undefined
markdown
undefined

Database Backup Strategy

数据库备份策略

Backup Types

备份类型

1. Full Backup (Daily)

1. 全量备份(每日)

  • When: 2:00 AM UTC
  • Retention: 30 days
  • Storage: S3
    s3://backups/full/
  • Size: ~50 GB
  • Duration: ~45 minutes
  • 执行时间:UTC时间凌晨2点
  • 保留时长:30天
  • 存储位置:S3
    s3://backups/full/
  • 大小:约50 GB
  • 耗时:约45分钟

2. Incremental Backup (Hourly)

2. 增量备份(每小时)

  • When: Every hour
  • Retention: 7 days
  • Storage: S3
    s3://backups/incremental/
  • Size: ~500 MB
  • Duration: ~5 minutes
  • 执行时间:每小时一次
  • 保留时长:7天
  • 存储位置:S3
    s3://backups/incremental/
  • 大小:约500 MB
  • 耗时:约5分钟

3. Transaction Log Backup (Every 15 min)

3. 事务日志备份(每15分钟)

  • When: Every 15 minutes
  • Retention: 3 days
  • Storage: S3
    s3://backups/wal/
  • Point-in-time recovery capability
  • 执行时间:每15分钟一次
  • 保留时长:3天
  • 存储位置:S3
    s3://backups/wal/
  • 支持时间点恢复(PITR)

Backup Automation

备份自动化

PostgreSQL

PostgreSQL

bash
#!/bin/bash
bash
#!/bin/bash

scripts/backup-postgres.sh

scripts/backup-postgres.sh

set -e
set -e

Configuration

Configuration

DB_NAME="production" DB_USER="postgres" DB_HOST="postgres.example.com" BACKUP_DIR="/var/backups/postgres" S3_BUCKET="s3://my-backups/postgres" DATE=$(date +%Y%m%d_%H%M%S) FILENAME="${DB_NAME}_${DATE}.sql.gz"
DB_NAME="production" DB_USER="postgres" DB_HOST="postgres.example.com" BACKUP_DIR="/var/backups/postgres" S3_BUCKET="s3://my-backups/postgres" DATE=$(date +%Y%m%d_%H%M%S) FILENAME="${DB_NAME}_${DATE}.sql.gz"

Create backup directory

Create backup directory

mkdir -p $BACKUP_DIR
echo "🔄 Starting backup: $FILENAME"
mkdir -p $BACKUP_DIR
echo "🔄 Starting backup: $FILENAME"

Full backup with pg_dump

Full backup with pg_dump

pg_dump
--host=$DB_HOST
--username=$DB_USER
--dbname=$DB_NAME
--format=custom
--compress=9
--file=$BACKUP_DIR/$FILENAME
--verbose
pg_dump
--host=$DB_HOST
--username=$DB_USER
--dbname=$DB_NAME
--format=custom
--compress=9
--file=$BACKUP_DIR/$FILENAME
--verbose

Verify backup

Verify backup

if [ -f "$BACKUP_DIR/$FILENAME" ]; then SIZE=$(du -h "$BACKUP_DIR/$FILENAME" | cut -f1) echo "✅ Backup created: $SIZE" else echo "❌ Backup failed" exit 1 fi
if [ -f "$BACKUP_DIR/$FILENAME" ]; then SIZE=$(du -h "$BACKUP_DIR/$FILENAME" | cut -f1) echo "✅ Backup created: $SIZE" else echo "❌ Backup failed" exit 1 fi

Upload to S3

Upload to S3

echo "📤 Uploading to S3..." aws s3 cp $BACKUP_DIR/$FILENAME $S3_BUCKET/
--storage-class STANDARD_IA
echo "📤 Uploading to S3..." aws s3 cp $BACKUP_DIR/$FILENAME $S3_BUCKET/
--storage-class STANDARD_IA

Verify upload

Verify upload

if aws s3 ls $S3_BUCKET/$FILENAME; then echo "✅ Uploaded to S3" else echo "❌ S3 upload failed" exit 1 fi
if aws s3 ls $S3_BUCKET/$FILENAME; then echo "✅ Uploaded to S3" else echo "❌ S3 upload failed" exit 1 fi

Cleanup old local backups (keep last 7 days)

Cleanup old local backups (keep last 7 days)

find $BACKUP_DIR -type f -name "*.sql.gz" -mtime +7 -delete echo "🗑️ Cleaned up old local backups"
find $BACKUP_DIR -type f -name "*.sql.gz" -mtime +7 -delete echo "🗑️ Cleaned up old local backups"

Send notification

Send notification

curl -X POST $SLACK_WEBHOOK
-H 'Content-Type: application/json'
-d "{"text": "✅ Database backup complete: $FILENAME ($SIZE)"}"
echo "✅ Backup complete!"
undefined
curl -X POST $SLACK_WEBHOOK
-H 'Content-Type: application/json'
-d "{"text": "✅ Database backup complete: $FILENAME ($SIZE)"}"
echo "✅ Backup complete!"
undefined

MySQL

MySQL

bash
#!/bin/bash
bash
#!/bin/bash

scripts/backup-mysql.sh

scripts/backup-mysql.sh

set -e
DB_NAME="production" DB_USER="root" DB_PASSWORD=$MYSQL_PASSWORD DATE=$(date +%Y%m%d_%H%M%S) FILENAME="${DB_NAME}_${DATE}.sql.gz"
echo "🔄 Starting MySQL backup..."
set -e
DB_NAME="production" DB_USER="root" DB_PASSWORD=$MYSQL_PASSWORD DATE=$(date +%Y%m%d_%H%M%S) FILENAME="${DB_NAME}_${DATE}.sql.gz"
echo "🔄 Starting MySQL backup..."

Backup with mysqldump

Backup with mysqldump

mysqldump
--user=$DB_USER
--password=$DB_PASSWORD
--single-transaction
--quick
--lock-tables=false
--databases $DB_NAME
| gzip > /var/backups/mysql/$FILENAME
mysqldump
--user=$DB_USER
--password=$DB_PASSWORD
--single-transaction
--quick
--lock-tables=false
--databases $DB_NAME
| gzip > /var/backups/mysql/$FILENAME

Upload to S3

Upload to S3

aws s3 cp /var/backups/mysql/$FILENAME s3://my-backups/mysql/
echo "✅ Backup complete!"
undefined
aws s3 cp /var/backups/mysql/$FILENAME s3://my-backups/mysql/
echo "✅ Backup complete!"
undefined

Restore Procedures

恢复流程

Full Restore

全量恢复

bash
#!/bin/bash
bash
#!/bin/bash

scripts/restore-postgres.sh

scripts/restore-postgres.sh

set -e
BACKUP_FILE=$1 RESTORE_DB="production_restored"
if [ -z "$BACKUP_FILE" ]; then echo "Usage: ./restore-postgres.sh <backup-file>" exit 1 fi
echo "🔄 Starting restore from: $BACKUP_FILE"
set -e
BACKUP_FILE=$1 RESTORE_DB="production_restored"
if [ -z "$BACKUP_FILE" ]; then echo "Usage: ./restore-postgres.sh <backup-file>" exit 1 fi
echo "🔄 Starting restore from: $BACKUP_FILE"

1. Download from S3

1. Download from S3

echo "📥 Downloading backup..." aws s3 cp s3://my-backups/postgres/$BACKUP_FILE /tmp/
echo "📥 Downloading backup..." aws s3 cp s3://my-backups/postgres/$BACKUP_FILE /tmp/

2. Create new database

2. Create new database

echo "🗄️ Creating database..." psql -h $DB_HOST -U postgres -c "CREATE DATABASE $RESTORE_DB;"
echo "🗄️ Creating database..." psql -h $DB_HOST -U postgres -c "CREATE DATABASE $RESTORE_DB;"

3. Restore backup

3. Restore backup

echo "🔄 Restoring data..." pg_restore
--host=$DB_HOST
--username=postgres
--dbname=$RESTORE_DB
--verbose
/tmp/$BACKUP_FILE
echo "🔄 Restoring data..." pg_restore
--host=$DB_HOST
--username=postgres
--dbname=$RESTORE_DB
--verbose
/tmp/$BACKUP_FILE

4. Verify restore

4. Verify restore

echo "✅ Verifying restore..." TABLE_COUNT=$(psql -h $DB_HOST -U postgres -d $RESTORE_DB -t -c "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public';") echo " Tables restored: $TABLE_COUNT"
ROW_COUNT=$(psql -h $DB_HOST -U postgres -d $RESTORE_DB -t -c "SELECT COUNT(*) FROM users;") echo " User rows: $ROW_COUNT"
echo "✅ Restore complete!" echo " Database: $RESTORE_DB" echo " To use: UPDATE application config to point to $RESTORE_DB"
undefined
echo "✅ Verifying restore..." TABLE_COUNT=$(psql -h $DB_HOST -U postgres -d $RESTORE_DB -t -c "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public';") echo " Tables restored: $TABLE_COUNT"
ROW_COUNT=$(psql -h $DB_HOST -U postgres -d $RESTORE_DB -t -c "SELECT COUNT(*) FROM users;") echo " User rows: $ROW_COUNT"
echo "✅ Restore complete!" echo " Database: $RESTORE_DB" echo " To use: UPDATE application config to point to $RESTORE_DB"
undefined

Point-in-Time Recovery (PITR)

时间点恢复(PITR)

bash
#!/bin/bash
bash
#!/bin/bash

scripts/pitr-restore.sh

scripts/pitr-restore.sh

TARGET_TIME=$1 # Format: 2024-01-15 14:30:00
echo "🔄 Point-in-Time Restore to: $TARGET_TIME"
TARGET_TIME=$1 # Format: 2024-01-15 14:30:00
echo "🔄 Point-in-Time Restore to: $TARGET_TIME"

1. Restore base backup

1. Restore base backup

echo "📦 Restoring base backup..." pg_basebackup -D /var/lib/postgresql/data -X stream
echo "📦 Restoring base backup..." pg_basebackup -D /var/lib/postgresql/data -X stream

2. Configure recovery

2. Configure recovery

cat > /var/lib/postgresql/data/recovery.conf << EOF restore_command = 'aws s3 cp s3://my-backups/wal/%f %p' recovery_target_time = '$TARGET_TIME' recovery_target_action = 'promote' EOF
cat > /var/lib/postgresql/data/recovery.conf << EOF restore_command = 'aws s3 cp s3://my-backups/wal/%f %p' recovery_target_time = '$TARGET_TIME' recovery_target_action = 'promote' EOF

3. Start PostgreSQL

3. Start PostgreSQL

echo "🚀 Starting PostgreSQL in recovery mode..." systemctl start postgresql
echo "🚀 Starting PostgreSQL in recovery mode..." systemctl start postgresql

4. Wait for recovery

4. Wait for recovery

while ! pg_isready; do echo " Waiting for recovery..." sleep 5 done
echo "✅ PITR complete!"
undefined
while ! pg_isready; do echo " Waiting for recovery..." sleep 5 done
echo "✅ PITR complete!"
undefined

Validation Checks

验证检查

bash
#!/bin/bash
bash
#!/bin/bash

scripts/validate-restore.sh

scripts/validate-restore.sh

DB=$1
echo "🔍 Validating restore..."
DB=$1
echo "🔍 Validating restore..."

1. Check table count

1. Check table count

TABLES=$(psql -d $DB -t -c "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public';") echo "Tables: $TABLES"
if [ "$TABLES" -lt 10 ]; then echo "❌ Too few tables restored" exit 1 fi
TABLES=$(psql -d $DB -t -c "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public';") echo "Tables: $TABLES"
if [ "$TABLES" -lt 10 ]; then echo "❌ Too few tables restored" exit 1 fi

2. Check row counts

2. Check row counts

for table in users products orders; do ROWS=$(psql -d $DB -t -c "SELECT COUNT(*) FROM $table;") echo " $table: $ROWS rows"
if [ "$ROWS" -lt 1 ]; then echo "❌ Table $table is empty" exit 1 fi done
for table in users products orders; do ROWS=$(psql -d $DB -t -c "SELECT COUNT(*) FROM $table;") echo " $table: $ROWS rows"
if [ "$ROWS" -lt 1 ]; then echo "❌ Table $table is empty" exit 1 fi done

3. Check constraints

3. Check constraints

CONSTRAINTS=$(psql -d $DB -t -c "SELECT COUNT(*) FROM information_schema.table_constraints WHERE constraint_type='FOREIGN KEY';") echo "Foreign keys: $CONSTRAINTS"
CONSTRAINTS=$(psql -d $DB -t -c "SELECT COUNT(*) FROM information_schema.table_constraints WHERE constraint_type='FOREIGN KEY';") echo "Foreign keys: $CONSTRAINTS"

4. Check indexes

4. Check indexes

INDEXES=$(psql -d $DB -t -c "SELECT COUNT(*) FROM pg_indexes WHERE schemaname='public';") echo "Indexes: $INDEXES"
INDEXES=$(psql -d $DB -t -c "SELECT COUNT(*) FROM pg_indexes WHERE schemaname='public';") echo "Indexes: $INDEXES"

5. Test query performance

5. Test query performance

START=$(date +%s%N) psql -d $DB -c "SELECT COUNT(*) FROM users WHERE email LIKE '%@example.com%';" > /dev/null END=$(date +%s%N) DURATION=$(( (END - START) / 1000000 )) echo "Query performance: ${DURATION}ms"
if [ "$DURATION" -gt 1000 ]; then echo "⚠️ Slow query - missing indexes?" fi
echo "✅ Validation complete!"
undefined
START=$(date +%s%N) psql -d $DB -c "SELECT COUNT(*) FROM users WHERE email LIKE '%@example.com%';" > /dev/null END=$(date +%s%N) DURATION=$(( (END - START) / 1000000 )) echo "Query performance: ${DURATION}ms"
if [ "$DURATION" -gt 1000 ]; then echo "⚠️ Slow query - missing indexes?" fi
echo "✅ Validation complete!"
undefined

Disaster Recovery Runbook

灾难恢复运行手册

markdown
undefined
markdown
undefined

Disaster Recovery Runbook

灾难恢复运行手册

Incident Response

事件响应

1. Assess Situation (5 minutes)

1. 评估情况(5分钟)

  • Identify incident severity (P0/P1/P2)
  • Determine data loss window
  • Notify stakeholders
Contacts:
  • DBA On-Call: [phone]
  • Engineering Lead: [phone]
  • CTO: [phone]
  • 确定事件严重等级(P0/P1/P2)
  • 确定数据丢失窗口期
  • 通知相关人员
联系人:
  • 值班DBA:[电话]
  • 工程负责人:[电话]
  • CTO:[电话]

2. Stop the Bleeding (10 minutes)

2. 止损(10分钟)

  • Enable maintenance mode
  • Stop writes to corrupted database
  • Preserve evidence (logs, backups)
bash
undefined
  • 启用维护模式
  • 停止向损坏的数据库写入数据
  • 保留证据(日志、备份)
bash
undefined

Enable maintenance mode

Enable maintenance mode

kubectl scale deployment/api --replicas=0
undefined
kubectl scale deployment/api --replicas=0
undefined

3. Identify Recovery Point (15 minutes)

3. 确定恢复点(15分钟)

  • Determine last good backup
  • Check backup integrity
  • Calculate data loss
bash
undefined
  • 确定最新的可用备份
  • 检查备份完整性
  • 计算数据丢失量
bash
undefined

List available backups

List available backups

aws s3 ls s3://my-backups/postgres/ | tail -20
aws s3 ls s3://my-backups/postgres/ | tail -20

Check backup size

Check backup size

aws s3 ls s3://my-backups/postgres/production_20240115_020000.sql.gz --human-readable
undefined
aws s3 ls s3://my-backups/postgres/production_20240115_020000.sql.gz --human-readable
undefined

4. Prepare Recovery Environment (30 minutes)

4. 准备恢复环境(30分钟)

  • Spin up new database instance
  • Configure networking
  • Test connectivity
bash
undefined
  • 启动新的数据库实例
  • 配置网络
  • 测试连通性
bash
undefined

Create RDS instance

Create RDS instance

aws rds create-db-instance
--db-instance-identifier production-recovery
--db-instance-class db.r6g.xlarge
--engine postgres
--master-username postgres
--master-user-password [secure-password]
undefined
aws rds create-db-instance
--db-instance-identifier production-recovery
--db-instance-class db.r6g.xlarge
--engine postgres
--master-username postgres
--master-user-password [secure-password]
undefined

5. Execute Restore (1-2 hours)

5. 执行恢复(1-2小时)

  • Download backup from S3
  • Run restore script
  • Apply transaction logs (if PITR)
  • Verify data integrity
bash
undefined
  • 从S3下载备份
  • 运行恢复脚本
  • 应用事务日志(如果使用PITR)
  • 验证数据完整性
bash
undefined

Run restore

Run restore

./scripts/restore-postgres.sh production_20240115_020000.sql.gz
./scripts/restore-postgres.sh production_20240115_020000.sql.gz

Validate

Validate

./scripts/validate-restore.sh production_restored
undefined
./scripts/validate-restore.sh production_restored
undefined

6. Validate and Test (30 minutes)

6. 验证与测试(30分钟)

  • Run validation scripts
  • Test critical queries
  • Verify row counts
  • Check data consistency
  • 运行验证脚本
  • 测试关键查询
  • 验证行数
  • 检查数据一致性

7. Cutover (15 minutes)

7. 切换流量(15分钟)

  • Update application config
  • Point DNS to new database
  • Disable maintenance mode
  • Monitor for errors
bash
undefined
  • 更新应用配置
  • 将DNS指向新数据库
  • 关闭维护模式
  • 监控错误
bash
undefined

Update connection string

Update connection string

kubectl set env deployment/api DATABASE_URL=postgresql://...
kubectl set env deployment/api DATABASE_URL=postgresql://...

Scale up

Scale up

kubectl scale deployment/api --replicas=3
undefined
kubectl scale deployment/api --replicas=3
undefined

8. Post-Recovery (1 hour)

8. 恢复后操作(1小时)

  • Monitor system health
  • Verify user reports
  • Document incident
  • Schedule postmortem
  • 监控系统健康状态
  • 验证用户反馈
  • 记录事件详情
  • 安排事后复盘

Recovery Time Objective (RTO)

恢复时间目标(RTO)

ScenarioTargetActual
Full restore2 hours[measured]
PITR restore3 hours[measured]
Region failover15 minutes[measured]
场景目标值实际值
全量恢复2小时[实测值]
PITR恢复3小时[实测值]
区域故障转移15分钟[实测值]

Recovery Point Objective (RPO)

恢复点目标(RPO)

Backup TypeData Loss Window
Full backup24 hours
Incremental1 hour
Transaction logs15 minutes
undefined
备份类型数据丢失窗口期
全量备份24小时
增量备份1小时
事务日志15分钟
undefined

Automated Backup Monitoring

自动备份监控

typescript
// scripts/monitor-backups.ts
import { S3Client, ListObjectsV2Command } from '@aws-sdk/client-s3';

const s3 = new S3Client({ region: 'us-east-1' });

async function checkBackupHealth() {
  const bucket = 'my-backups';
  const prefix = 'postgres/';

  // List recent backups
  const command = new ListObjectsV2Command({
    Bucket: bucket,
    Prefix: prefix,
    MaxKeys: 10,
  });

  const response = await s3.send(command);
  const backups = response.Contents || [];

  // Check last backup age
  const latestBackup = backups[0];
  const age = Date.now() - new Date(latestBackup.LastModified!).getTime();
  const ageHours = age / (1000 * 60 * 60);

  if (ageHours > 25) {
    console.error('❌ No backup in last 24 hours!');
    // Send alert
    await sendSlackAlert('No recent database backup!');
    process.exit(1);
  }

  // Check backup size
  const size = latestBackup.Size! / (1024 * 1024 * 1024); // GB
  if (size < 10) {
    console.error('⚠️  Backup size suspiciously small');
  }

  console.log('✅ Backup health check passed');
  console.log(`  Latest: ${latestBackup.Key}`);
  console.log(`  Age: ${ageHours.toFixed(1)} hours`);
  console.log(`  Size: ${size.toFixed(2)} GB`);
}

checkBackupHealth();
typescript
// scripts/monitor-backups.ts
import { S3Client, ListObjectsV2Command } from '@aws-sdk/client-s3';

const s3 = new S3Client({ region: 'us-east-1' });

async function checkBackupHealth() {
  const bucket = 'my-backups';
  const prefix = 'postgres/';

  // List recent backups
  const command = new ListObjectsV2Command({
    Bucket: bucket,
    Prefix: prefix,
    MaxKeys: 10,
  });

  const response = await s3.send(command);
  const backups = response.Contents || [];

  // Check last backup age
  const latestBackup = backups[0];
  const age = Date.now() - new Date(latestBackup.LastModified!).getTime();
  const ageHours = age / (1000 * 60 * 60);

  if (ageHours > 25) {
    console.error('❌ No backup in last 24 hours!');
    // Send alert
    await sendSlackAlert('No recent database backup!');
    process.exit(1);
  }

  // Check backup size
  const size = latestBackup.Size! / (1024 * 1024 * 1024); // GB
  if (size < 10) {
    console.error('⚠️  Backup size suspiciously small');
  }

  console.log('✅ Backup health check passed');
  console.log(`  Latest: ${latestBackup.Key}`);
  console.log(`  Age: ${ageHours.toFixed(1)} hours`);
  console.log(`  Size: ${size.toFixed(2)} GB`);
}

checkBackupHealth();

Role Assignments

角色分配

markdown
undefined
markdown
undefined

DR Team Roles

灾难恢复团队角色

Database Administrator (Primary)

数据库管理员(主要负责人)

  • Execute restore procedures
  • Verify data integrity
  • Monitor recovery progress
  • 执行恢复流程
  • 验证数据完整性
  • 监控恢复进度

Engineering Lead

工程负责人

  • Coordinate response
  • Communicate with stakeholders
  • Make cutover decisions
  • 协调响应工作
  • 与相关人员沟通
  • 做出流量切换决策

DevOps Engineer

DevOps工程师

  • Provision infrastructure
  • Update application configs
  • Monitor system health
  • 配置基础设施
  • 更新应用配置
  • 监控系统健康状态

Product Manager

产品经理

  • Assess business impact
  • Prioritize recovery
  • Customer communication
  • 评估业务影响
  • 确定恢复优先级
  • 客户沟通

Escalation Path

升级路径

  1. DBA on-call →
  2. Engineering Lead →
  3. CTO →
  4. CEO (P0 incidents only)
undefined
  1. 值班DBA →
  2. 工程负责人 →
  3. CTO →
  4. CEO(仅P0事件)
undefined

Best Practices

最佳实践

  1. Test restores regularly: Quarterly DR drills
  2. Automate backups: Never rely on manual processes
  3. Multiple locations: Cross-region backup storage
  4. Monitor backup health: Alert on failures
  5. Document procedures: Keep runbook updated
  6. Encrypt backups: Protect sensitive data
  7. Version control: Track backup script changes
  1. 定期测试恢复:每季度进行灾难恢复演练
  2. 自动化备份:绝不依赖手动流程
  3. 多位置存储:跨区域备份存储
  4. 监控备份健康状态:备份失败时触发告警
  5. 文档化流程:保持运行手册更新
  6. 加密备份:保护敏感数据
  7. 版本控制:跟踪备份脚本的变更

Output Checklist

输出检查清单

  • Backup automation scripts
  • Restore procedures documented
  • Validation checks defined
  • PITR procedure (if applicable)
  • DR runbook created
  • Role assignments documented
  • RTO/RPO defined
  • Backup monitoring configured
  • 备份自动化脚本
  • 恢复流程文档
  • 验证检查规则
  • PITR流程(如适用)
  • 灾难恢复运行手册
  • 角色分配文档
  • RTO/RPO定义
  • 备份监控配置