Backup/Restore Runbook Generator

备份/恢复运行手册生成器

Create reliable disaster recovery procedures for your databases.

为你的数据库创建可靠的灾难恢复流程。

Backup Strategy

备份策略

markdown

undefined

markdown

undefined

Database Backup Strategy

数据库备份策略

Backup Types

备份类型

1. Full Backup (Daily)

1. 全量备份（每日）

When: 2:00 AM UTC
Retention: 30 days
Storage: S3
```
s3://backups/full/
```
Size: ~50 GB
Duration: ~45 minutes

执行时间：UTC时间凌晨2点
保留时长：30天
存储位置：S3
```
s3://backups/full/
```
大小：约50 GB
耗时：约45分钟

2. Incremental Backup (Hourly)

2. 增量备份（每小时）

When: Every hour
Retention: 7 days
Storage: S3
```
s3://backups/incremental/
```
Size: ~500 MB
Duration: ~5 minutes

执行时间：每小时一次
保留时长：7天
存储位置：S3
```
s3://backups/incremental/
```
大小：约500 MB
耗时：约5分钟

3. Transaction Log Backup (Every 15 min)

3. 事务日志备份（每15分钟）

When: Every 15 minutes
Retention: 3 days
Storage: S3
```
s3://backups/wal/
```
Point-in-time recovery capability

执行时间：每15分钟一次
保留时长：3天
存储位置：S3
```
s3://backups/wal/
```
支持时间点恢复（PITR）

Backup Automation

备份自动化

PostgreSQL

bash

#!/bin/bash

bash

#!/bin/bash

scripts/backup-postgres.sh

set -e

Configuration

DB_NAME="production" DB_USER="postgres" DB_HOST="postgres.example.com" BACKUP_DIR="/var/backups/postgres" S3_BUCKET="s3://my-backups/postgres" DATE=$(date +%Y%m%d_%H%M%S) FILENAME="${DB_NAME}_${DATE}.sql.gz"

Create backup directory

mkdir -p $BACKUP_DIR

echo "🔄 Starting backup: $FILENAME"

mkdir -p $BACKUP_DIR

echo "🔄 Starting backup: $FILENAME"

Full backup with pg_dump

pg_dump
--host=$DB_HOST
--username=$DB_USER
--dbname=$DB_NAME
--format=custom
--compress=9
--file=$BACKUP_DIR/$FILENAME
--verbose

Verify backup

if [ -f "$BACKUP_DIR/$FILENAME" ]; then SIZE=$(du -h "$BACKUP_DIR/$FILENAME" | cut -f1) echo "✅ Backup created: $SIZE" else echo "❌ Backup failed" exit 1 fi

Upload to S3

echo "📤 Uploading to S3..." aws s3 cp $BACKUP_DIR/$FILENAME $S3_BUCKET/
--storage-class STANDARD_IA

Verify upload

if aws s3 ls $S3_BUCKET/$FILENAME; then echo "✅ Uploaded to S3" else echo "❌ S3 upload failed" exit 1 fi

Cleanup old local backups (keep last 7 days)

find $BACKUP_DIR -type f -name "*.sql.gz" -mtime +7 -delete echo "🗑️ Cleaned up old local backups"

Send notification

curl -X POST $SLACK_WEBHOOK
-H 'Content-Type: application/json'
-d "{"text": "✅ Database backup complete: $FILENAME ($SIZE)"}"

echo "✅ Backup complete!"

undefined

curl -X POST $SLACK_WEBHOOK
-H 'Content-Type: application/json'
-d "{"text": "✅ Database backup complete: $FILENAME ($SIZE)"}"

echo "✅ Backup complete!"

undefined

MySQL

bash

#!/bin/bash

bash

#!/bin/bash

scripts/backup-mysql.sh

set -e

DB_NAME="production" DB_USER="root" DB_PASSWORD=$MYSQL_PASSWORD DATE=$(date +%Y%m%d_%H%M%S) FILENAME="${DB_NAME}_${DATE}.sql.gz"

echo "🔄 Starting MySQL backup..."

set -e

DB_NAME="production" DB_USER="root" DB_PASSWORD=$MYSQL_PASSWORD DATE=$(date +%Y%m%d_%H%M%S) FILENAME="${DB_NAME}_${DATE}.sql.gz"

echo "🔄 Starting MySQL backup..."

Backup with mysqldump

mysqldump
--user=$DB_USER
--password=$DB_PASSWORD
--single-transaction
--quick
--lock-tables=false
--databases $DB_NAME
| gzip > /var/backups/mysql/$FILENAME

Upload to S3

aws s3 cp /var/backups/mysql/$FILENAME s3://my-backups/mysql/

echo "✅ Backup complete!"

undefined

aws s3 cp /var/backups/mysql/$FILENAME s3://my-backups/mysql/

echo "✅ Backup complete!"

undefined

Restore Procedures

恢复流程

Full Restore

全量恢复

bash

#!/bin/bash

bash

#!/bin/bash

scripts/restore-postgres.sh

set -e

BACKUP_FILE=$1 RESTORE_DB="production_restored"

if [ -z "$BACKUP_FILE" ]; then echo "Usage: ./restore-postgres.sh <backup-file>" exit 1 fi

echo "🔄 Starting restore from: $BACKUP_FILE"

set -e

BACKUP_FILE=$1 RESTORE_DB="production_restored"

if [ -z "$BACKUP_FILE" ]; then echo "Usage: ./restore-postgres.sh <backup-file>" exit 1 fi

echo "🔄 Starting restore from: $BACKUP_FILE"

1. Download from S3

echo "📥 Downloading backup..." aws s3 cp s3://my-backups/postgres/$BACKUP_FILE /tmp/

2. Create new database

echo "🗄️ Creating database..." psql -h $DB_HOST -U postgres -c "CREATE DATABASE $RESTORE_DB;"

3. Restore backup

echo "🔄 Restoring data..." pg_restore
--host=$DB_HOST
--username=postgres
--dbname=$RESTORE_DB
--verbose
/tmp/$BACKUP_FILE

4. Verify restore

echo "✅ Verifying restore..." TABLE_COUNT=$(psql -h $DB_HOST -U postgres -d $RESTORE_DB -t -c "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public';") echo " Tables restored: $TABLE_COUNT"

ROW_COUNT=$(psql -h $DB_HOST -U postgres -d $RESTORE_DB -t -c "SELECT COUNT(*) FROM users;") echo " User rows: $ROW_COUNT"

echo "✅ Restore complete!" echo " Database: $RESTORE_DB" echo " To use: UPDATE application config to point to $RESTORE_DB"

undefined

echo "✅ Verifying restore..." TABLE_COUNT=$(psql -h $DB_HOST -U postgres -d $RESTORE_DB -t -c "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public';") echo " Tables restored: $TABLE_COUNT"

ROW_COUNT=$(psql -h $DB_HOST -U postgres -d $RESTORE_DB -t -c "SELECT COUNT(*) FROM users;") echo " User rows: $ROW_COUNT"

echo "✅ Restore complete!" echo " Database: $RESTORE_DB" echo " To use: UPDATE application config to point to $RESTORE_DB"

undefined

Point-in-Time Recovery (PITR)

时间点恢复（PITR）

bash

#!/bin/bash

bash

#!/bin/bash

scripts/pitr-restore.sh

TARGET_TIME=$1 # Format: 2024-01-15 14:30:00

echo "🔄 Point-in-Time Restore to: $TARGET_TIME"

TARGET_TIME=$1 # Format: 2024-01-15 14:30:00

echo "🔄 Point-in-Time Restore to: $TARGET_TIME"

1. Restore base backup

echo "📦 Restoring base backup..." pg_basebackup -D /var/lib/postgresql/data -X stream

2. Configure recovery

cat > /var/lib/postgresql/data/recovery.conf << EOF restore_command = 'aws s3 cp s3://my-backups/wal/%f %p' recovery_target_time = '$TARGET_TIME' recovery_target_action = 'promote' EOF

3. Start PostgreSQL

echo "🚀 Starting PostgreSQL in recovery mode..." systemctl start postgresql

4. Wait for recovery

while ! pg_isready; do echo " Waiting for recovery..." sleep 5 done

echo "✅ PITR complete!"

undefined

while ! pg_isready; do echo " Waiting for recovery..." sleep 5 done

echo "✅ PITR complete!"

undefined

Validation Checks

验证检查

bash

#!/bin/bash

bash

#!/bin/bash

scripts/validate-restore.sh

DB=$1

echo "🔍 Validating restore..."

DB=$1

echo "🔍 Validating restore..."

1. Check table count

TABLES=$(psql -d $DB -t -c "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public';") echo "Tables: $TABLES"

if [ "$TABLES" -lt 10 ]; then echo "❌ Too few tables restored" exit 1 fi

TABLES=$(psql -d $DB -t -c "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public';") echo "Tables: $TABLES"

if [ "$TABLES" -lt 10 ]; then echo "❌ Too few tables restored" exit 1 fi

2. Check row counts

for table in users products orders; do ROWS=$(psql -d $DB -t -c "SELECT COUNT(*) FROM $table;") echo " $table: $ROWS rows"

if [ "$ROWS" -lt 1 ]; then echo "❌ Table $table is empty" exit 1 fi done

for table in users products orders; do ROWS=$(psql -d $DB -t -c "SELECT COUNT(*) FROM $table;") echo " $table: $ROWS rows"

if [ "$ROWS" -lt 1 ]; then echo "❌ Table $table is empty" exit 1 fi done

3. Check constraints

CONSTRAINTS=$(psql -d $DB -t -c "SELECT COUNT(*) FROM information_schema.table_constraints WHERE constraint_type='FOREIGN KEY';") echo "Foreign keys: $CONSTRAINTS"

4. Check indexes

INDEXES=$(psql -d $DB -t -c "SELECT COUNT(*) FROM pg_indexes WHERE schemaname='public';") echo "Indexes: $INDEXES"

5. Test query performance

START=$(date +%s%N) psql -d $DB -c "SELECT COUNT(*) FROM users WHERE email LIKE '%@example.com%';" > /dev/null END=$(date +%s%N) DURATION=$(( (END - START) / 1000000 )) echo "Query performance: ${DURATION}ms"

if [ "$DURATION" -gt 1000 ]; then echo "⚠️ Slow query - missing indexes?" fi

echo "✅ Validation complete!"

undefined

START=$(date +%s%N) psql -d $DB -c "SELECT COUNT(*) FROM users WHERE email LIKE '%@example.com%';" > /dev/null END=$(date +%s%N) DURATION=$(( (END - START) / 1000000 )) echo "Query performance: ${DURATION}ms"

if [ "$DURATION" -gt 1000 ]; then echo "⚠️ Slow query - missing indexes?" fi

echo "✅ Validation complete!"

undefined

Disaster Recovery Runbook

灾难恢复运行手册

markdown

undefined

markdown

undefined

Disaster Recovery Runbook

灾难恢复运行手册

Incident Response

事件响应

1. Assess Situation (5 minutes)

1. 评估情况（5分钟）

Identify incident severity (P0/P1/P2)
Determine data loss window
Notify stakeholders

Contacts:

DBA On-Call: [phone]
Engineering Lead: [phone]
CTO: [phone]

确定事件严重等级（P0/P1/P2）
确定数据丢失窗口期
通知相关人员

联系人：

值班DBA：[电话]
工程负责人：[电话]
CTO：[电话]

2. Stop the Bleeding (10 minutes)

2. 止损（10分钟）

Enable maintenance mode
Stop writes to corrupted database
Preserve evidence (logs, backups)

bash

undefined

启用维护模式
停止向损坏的数据库写入数据
保留证据（日志、备份）

bash

undefined

Enable maintenance mode

kubectl scale deployment/api --replicas=0

undefined

kubectl scale deployment/api --replicas=0

undefined

3. Identify Recovery Point (15 minutes)

3. 确定恢复点（15分钟）

Determine last good backup
Check backup integrity
Calculate data loss

bash

undefined

确定最新的可用备份
检查备份完整性
计算数据丢失量

bash

undefined

List available backups

aws s3 ls s3://my-backups/postgres/ | tail -20

Check backup size

aws s3 ls s3://my-backups/postgres/production_20240115_020000.sql.gz --human-readable

undefined

aws s3 ls s3://my-backups/postgres/production_20240115_020000.sql.gz --human-readable

undefined

4. Prepare Recovery Environment (30 minutes)

4. 准备恢复环境（30分钟）

Spin up new database instance
Configure networking
Test connectivity

bash

undefined

启动新的数据库实例
配置网络
测试连通性

bash

undefined

Create RDS instance

aws rds create-db-instance
--db-instance-identifier production-recovery
--db-instance-class db.r6g.xlarge
--engine postgres
--master-username postgres
--master-user-password [secure-password]

undefined

aws rds create-db-instance
--db-instance-identifier production-recovery
--db-instance-class db.r6g.xlarge
--engine postgres
--master-username postgres
--master-user-password [secure-password]

undefined

5. Execute Restore (1-2 hours)

5. 执行恢复（1-2小时）

Download backup from S3
Run restore script
Apply transaction logs (if PITR)
Verify data integrity

bash

undefined

从S3下载备份
运行恢复脚本
应用事务日志（如果使用PITR）
验证数据完整性

bash

undefined

Run restore

./scripts/restore-postgres.sh production_20240115_020000.sql.gz

Validate

./scripts/validate-restore.sh production_restored

undefined

./scripts/validate-restore.sh production_restored

undefined

6. Validate and Test (30 minutes)

6. 验证与测试（30分钟）

Run validation scripts
Test critical queries
Verify row counts
Check data consistency

运行验证脚本
测试关键查询
验证行数
检查数据一致性

7. Cutover (15 minutes)

7. 切换流量（15分钟）

Update application config
Point DNS to new database
Disable maintenance mode
Monitor for errors

bash

undefined

更新应用配置
将DNS指向新数据库
关闭维护模式
监控错误

bash

undefined

Update connection string

kubectl set env deployment/api DATABASE_URL=postgresql://...

Scale up

kubectl scale deployment/api --replicas=3

undefined

kubectl scale deployment/api --replicas=3

undefined

8. Post-Recovery (1 hour)

8. 恢复后操作（1小时）

Monitor system health
Verify user reports
Document incident
Schedule postmortem

监控系统健康状态
验证用户反馈
记录事件详情
安排事后复盘

Recovery Time Objective (RTO)

恢复时间目标（RTO）

Scenario	Target	Actual
Full restore	2 hours	[measured]
PITR restore	3 hours	[measured]
Region failover	15 minutes	[measured]

场景	目标值	实际值
全量恢复	2小时	[实测值]
PITR恢复	3小时	[实测值]
区域故障转移	15分钟	[实测值]

Recovery Point Objective (RPO)

恢复点目标（RPO）

Backup Type	Data Loss Window
Full backup	24 hours
Incremental	1 hour
Transaction logs	15 minutes

undefined

备份类型	数据丢失窗口期
全量备份	24小时
增量备份	1小时
事务日志	15分钟

undefined

Automated Backup Monitoring

自动备份监控

typescript

// scripts/monitor-backups.ts
import { S3Client, ListObjectsV2Command } from '@aws-sdk/client-s3';

const s3 = new S3Client({ region: 'us-east-1' });

async function checkBackupHealth() {
  const bucket = 'my-backups';
  const prefix = 'postgres/';

  // List recent backups
  const command = new ListObjectsV2Command({
    Bucket: bucket,
    Prefix: prefix,
    MaxKeys: 10,
  });

  const response = await s3.send(command);
  const backups = response.Contents || [];

  // Check last backup age
  const latestBackup = backups[0];
  const age = Date.now() - new Date(latestBackup.LastModified!).getTime();
  const ageHours = age / (1000 * 60 * 60);

  if (ageHours > 25) {
    console.error('❌ No backup in last 24 hours!');
    // Send alert
    await sendSlackAlert('No recent database backup!');
    process.exit(1);
  }

  // Check backup size
  const size = latestBackup.Size! / (1024 * 1024 * 1024); // GB
  if (size < 10) {
    console.error('⚠️  Backup size suspiciously small');
  }

  console.log('✅ Backup health check passed');
  console.log(`  Latest: ${latestBackup.Key}`);
  console.log(`  Age: ${ageHours.toFixed(1)} hours`);
  console.log(`  Size: ${size.toFixed(2)} GB`);
}

checkBackupHealth();

typescript

// scripts/monitor-backups.ts
import { S3Client, ListObjectsV2Command } from '@aws-sdk/client-s3';

const s3 = new S3Client({ region: 'us-east-1' });

async function checkBackupHealth() {
  const bucket = 'my-backups';
  const prefix = 'postgres/';

  // List recent backups
  const command = new ListObjectsV2Command({
    Bucket: bucket,
    Prefix: prefix,
    MaxKeys: 10,
  });

  const response = await s3.send(command);
  const backups = response.Contents || [];

  // Check last backup age
  const latestBackup = backups[0];
  const age = Date.now() - new Date(latestBackup.LastModified!).getTime();
  const ageHours = age / (1000 * 60 * 60);

  if (ageHours > 25) {
    console.error('❌ No backup in last 24 hours!');
    // Send alert
    await sendSlackAlert('No recent database backup!');
    process.exit(1);
  }

  // Check backup size
  const size = latestBackup.Size! / (1024 * 1024 * 1024); // GB
  if (size < 10) {
    console.error('⚠️  Backup size suspiciously small');
  }

  console.log('✅ Backup health check passed');
  console.log(`  Latest: ${latestBackup.Key}`);
  console.log(`  Age: ${ageHours.toFixed(1)} hours`);
  console.log(`  Size: ${size.toFixed(2)} GB`);
}

checkBackupHealth();

Role Assignments

角色分配

markdown

undefined

markdown

undefined

DR Team Roles

灾难恢复团队角色

Database Administrator (Primary)

数据库管理员（主要负责人）

Execute restore procedures
Verify data integrity
Monitor recovery progress

执行恢复流程
验证数据完整性
监控恢复进度

Engineering Lead

工程负责人

Coordinate response
Communicate with stakeholders
Make cutover decisions

协调响应工作
与相关人员沟通
做出流量切换决策

DevOps Engineer

DevOps工程师

Provision infrastructure
Update application configs
Monitor system health

配置基础设施
更新应用配置
监控系统健康状态

Product Manager

产品经理

Assess business impact
Prioritize recovery
Customer communication

评估业务影响
确定恢复优先级
客户沟通

Escalation Path

升级路径

DBA on-call →
Engineering Lead →
CTO →
CEO (P0 incidents only)

undefined

值班DBA →
工程负责人 →
CTO →
CEO（仅P0事件）

undefined

Best Practices

最佳实践

Test restores regularly: Quarterly DR drills
Automate backups: Never rely on manual processes
Multiple locations: Cross-region backup storage
Monitor backup health: Alert on failures
Document procedures: Keep runbook updated
Encrypt backups: Protect sensitive data
Version control: Track backup script changes

定期测试恢复：每季度进行灾难恢复演练
自动化备份：绝不依赖手动流程
多位置存储：跨区域备份存储
监控备份健康状态：备份失败时触发告警
文档化流程：保持运行手册更新
加密备份：保护敏感数据
版本控制：跟踪备份脚本的变更

backup-restore-runbook-generator

Original

Translation

Backup/Restore Runbook Generator

备份/恢复运行手册生成器

Backup Strategy

备份策略

Database Backup Strategy

数据库备份策略

Backup Types

备份类型

1. Full Backup (Daily)

1. 全量备份（每日）

2. Incremental Backup (Hourly)

2. 增量备份（每小时）

3. Transaction Log Backup (Every 15 min)

3. 事务日志备份（每15分钟）

Backup Automation

备份自动化

PostgreSQL

PostgreSQL

scripts/backup-postgres.sh

scripts/backup-postgres.sh

Configuration

Configuration

Create backup directory

Create backup directory

Full backup with pg_dump

Full backup with pg_dump

Verify backup

Verify backup

Upload to S3

Upload to S3

Verify upload

Verify upload

Cleanup old local backups (keep last 7 days)

Cleanup old local backups (keep last 7 days)

Send notification

Send notification

MySQL

MySQL

scripts/backup-mysql.sh

scripts/backup-mysql.sh

Backup with mysqldump

Backup with mysqldump

Upload to S3

Upload to S3

Restore Procedures

恢复流程

Full Restore

全量恢复

scripts/restore-postgres.sh

scripts/restore-postgres.sh

1. Download from S3

1. Download from S3

2. Create new database

2. Create new database

3. Restore backup

3. Restore backup

4. Verify restore

4. Verify restore

Point-in-Time Recovery (PITR)

时间点恢复（PITR）

scripts/pitr-restore.sh

scripts/pitr-restore.sh

1. Restore base backup

1. Restore base backup

2. Configure recovery

2. Configure recovery

3. Start PostgreSQL

3. Start PostgreSQL

4. Wait for recovery

4. Wait for recovery

Validation Checks

验证检查

scripts/validate-restore.sh

scripts/validate-restore.sh

1. Check table count

1. Check table count

2. Check row counts