backup-restore-runbook-generator
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBackup/Restore Runbook Generator
备份/恢复运行手册生成器
Create reliable disaster recovery procedures for your databases.
为你的数据库创建可靠的灾难恢复流程。
Backup Strategy
备份策略
markdown
undefinedmarkdown
undefinedDatabase Backup Strategy
数据库备份策略
Backup Types
备份类型
1. Full Backup (Daily)
1. 全量备份(每日)
- When: 2:00 AM UTC
- Retention: 30 days
- Storage: S3
s3://backups/full/ - Size: ~50 GB
- Duration: ~45 minutes
- 执行时间:UTC时间凌晨2点
- 保留时长:30天
- 存储位置:S3
s3://backups/full/ - 大小:约50 GB
- 耗时:约45分钟
2. Incremental Backup (Hourly)
2. 增量备份(每小时)
- When: Every hour
- Retention: 7 days
- Storage: S3
s3://backups/incremental/ - Size: ~500 MB
- Duration: ~5 minutes
- 执行时间:每小时一次
- 保留时长:7天
- 存储位置:S3
s3://backups/incremental/ - 大小:约500 MB
- 耗时:约5分钟
3. Transaction Log Backup (Every 15 min)
3. 事务日志备份(每15分钟)
- When: Every 15 minutes
- Retention: 3 days
- Storage: S3
s3://backups/wal/ - Point-in-time recovery capability
- 执行时间:每15分钟一次
- 保留时长:3天
- 存储位置:S3
s3://backups/wal/ - 支持时间点恢复(PITR)
Backup Automation
备份自动化
PostgreSQL
PostgreSQL
bash
#!/bin/bashbash
#!/bin/bashscripts/backup-postgres.sh
scripts/backup-postgres.sh
set -e
set -e
Configuration
Configuration
DB_NAME="production"
DB_USER="postgres"
DB_HOST="postgres.example.com"
BACKUP_DIR="/var/backups/postgres"
S3_BUCKET="s3://my-backups/postgres"
DATE=$(date +%Y%m%d_%H%M%S)
FILENAME="${DB_NAME}_${DATE}.sql.gz"
DB_NAME="production"
DB_USER="postgres"
DB_HOST="postgres.example.com"
BACKUP_DIR="/var/backups/postgres"
S3_BUCKET="s3://my-backups/postgres"
DATE=$(date +%Y%m%d_%H%M%S)
FILENAME="${DB_NAME}_${DATE}.sql.gz"
Create backup directory
Create backup directory
mkdir -p $BACKUP_DIR
echo "🔄 Starting backup: $FILENAME"
mkdir -p $BACKUP_DIR
echo "🔄 Starting backup: $FILENAME"
Full backup with pg_dump
Full backup with pg_dump
pg_dump
--host=$DB_HOST
--username=$DB_USER
--dbname=$DB_NAME
--format=custom
--compress=9
--file=$BACKUP_DIR/$FILENAME
--verbose
--host=$DB_HOST
--username=$DB_USER
--dbname=$DB_NAME
--format=custom
--compress=9
--file=$BACKUP_DIR/$FILENAME
--verbose
pg_dump
--host=$DB_HOST
--username=$DB_USER
--dbname=$DB_NAME
--format=custom
--compress=9
--file=$BACKUP_DIR/$FILENAME
--verbose
--host=$DB_HOST
--username=$DB_USER
--dbname=$DB_NAME
--format=custom
--compress=9
--file=$BACKUP_DIR/$FILENAME
--verbose
Verify backup
Verify backup
if [ -f "$BACKUP_DIR/$FILENAME" ]; then
SIZE=$(du -h "$BACKUP_DIR/$FILENAME" | cut -f1)
echo "✅ Backup created: $SIZE"
else
echo "❌ Backup failed"
exit 1
fi
if [ -f "$BACKUP_DIR/$FILENAME" ]; then
SIZE=$(du -h "$BACKUP_DIR/$FILENAME" | cut -f1)
echo "✅ Backup created: $SIZE"
else
echo "❌ Backup failed"
exit 1
fi
Upload to S3
Upload to S3
echo "📤 Uploading to S3..."
aws s3 cp $BACKUP_DIR/$FILENAME $S3_BUCKET/
--storage-class STANDARD_IA
--storage-class STANDARD_IA
echo "📤 Uploading to S3..."
aws s3 cp $BACKUP_DIR/$FILENAME $S3_BUCKET/
--storage-class STANDARD_IA
--storage-class STANDARD_IA
Verify upload
Verify upload
if aws s3 ls $S3_BUCKET/$FILENAME; then
echo "✅ Uploaded to S3"
else
echo "❌ S3 upload failed"
exit 1
fi
if aws s3 ls $S3_BUCKET/$FILENAME; then
echo "✅ Uploaded to S3"
else
echo "❌ S3 upload failed"
exit 1
fi
Cleanup old local backups (keep last 7 days)
Cleanup old local backups (keep last 7 days)
find $BACKUP_DIR -type f -name "*.sql.gz" -mtime +7 -delete
echo "🗑️ Cleaned up old local backups"
find $BACKUP_DIR -type f -name "*.sql.gz" -mtime +7 -delete
echo "🗑️ Cleaned up old local backups"
Send notification
Send notification
curl -X POST $SLACK_WEBHOOK
-H 'Content-Type: application/json'
-d "{"text": "✅ Database backup complete: $FILENAME ($SIZE)"}"
-H 'Content-Type: application/json'
-d "{"text": "✅ Database backup complete: $FILENAME ($SIZE)"}"
echo "✅ Backup complete!"
undefinedcurl -X POST $SLACK_WEBHOOK
-H 'Content-Type: application/json'
-d "{"text": "✅ Database backup complete: $FILENAME ($SIZE)"}"
-H 'Content-Type: application/json'
-d "{"text": "✅ Database backup complete: $FILENAME ($SIZE)"}"
echo "✅ Backup complete!"
undefinedMySQL
MySQL
bash
#!/bin/bashbash
#!/bin/bashscripts/backup-mysql.sh
scripts/backup-mysql.sh
set -e
DB_NAME="production"
DB_USER="root"
DB_PASSWORD=$MYSQL_PASSWORD
DATE=$(date +%Y%m%d_%H%M%S)
FILENAME="${DB_NAME}_${DATE}.sql.gz"
echo "🔄 Starting MySQL backup..."
set -e
DB_NAME="production"
DB_USER="root"
DB_PASSWORD=$MYSQL_PASSWORD
DATE=$(date +%Y%m%d_%H%M%S)
FILENAME="${DB_NAME}_${DATE}.sql.gz"
echo "🔄 Starting MySQL backup..."
Backup with mysqldump
Backup with mysqldump
mysqldump
--user=$DB_USER
--password=$DB_PASSWORD
--single-transaction
--quick
--lock-tables=false
--databases $DB_NAME
| gzip > /var/backups/mysql/$FILENAME
--user=$DB_USER
--password=$DB_PASSWORD
--single-transaction
--quick
--lock-tables=false
--databases $DB_NAME
| gzip > /var/backups/mysql/$FILENAME
mysqldump
--user=$DB_USER
--password=$DB_PASSWORD
--single-transaction
--quick
--lock-tables=false
--databases $DB_NAME
| gzip > /var/backups/mysql/$FILENAME
--user=$DB_USER
--password=$DB_PASSWORD
--single-transaction
--quick
--lock-tables=false
--databases $DB_NAME
| gzip > /var/backups/mysql/$FILENAME
Upload to S3
Upload to S3
aws s3 cp /var/backups/mysql/$FILENAME s3://my-backups/mysql/
echo "✅ Backup complete!"
undefinedaws s3 cp /var/backups/mysql/$FILENAME s3://my-backups/mysql/
echo "✅ Backup complete!"
undefinedRestore Procedures
恢复流程
Full Restore
全量恢复
bash
#!/bin/bashbash
#!/bin/bashscripts/restore-postgres.sh
scripts/restore-postgres.sh
set -e
BACKUP_FILE=$1
RESTORE_DB="production_restored"
if [ -z "$BACKUP_FILE" ]; then
echo "Usage: ./restore-postgres.sh <backup-file>"
exit 1
fi
echo "🔄 Starting restore from: $BACKUP_FILE"
set -e
BACKUP_FILE=$1
RESTORE_DB="production_restored"
if [ -z "$BACKUP_FILE" ]; then
echo "Usage: ./restore-postgres.sh <backup-file>"
exit 1
fi
echo "🔄 Starting restore from: $BACKUP_FILE"
1. Download from S3
1. Download from S3
echo "📥 Downloading backup..."
aws s3 cp s3://my-backups/postgres/$BACKUP_FILE /tmp/
echo "📥 Downloading backup..."
aws s3 cp s3://my-backups/postgres/$BACKUP_FILE /tmp/
2. Create new database
2. Create new database
echo "🗄️ Creating database..."
psql -h $DB_HOST -U postgres -c "CREATE DATABASE $RESTORE_DB;"
echo "🗄️ Creating database..."
psql -h $DB_HOST -U postgres -c "CREATE DATABASE $RESTORE_DB;"
3. Restore backup
3. Restore backup
echo "🔄 Restoring data..."
pg_restore
--host=$DB_HOST
--username=postgres
--dbname=$RESTORE_DB
--verbose
/tmp/$BACKUP_FILE
--host=$DB_HOST
--username=postgres
--dbname=$RESTORE_DB
--verbose
/tmp/$BACKUP_FILE
echo "🔄 Restoring data..."
pg_restore
--host=$DB_HOST
--username=postgres
--dbname=$RESTORE_DB
--verbose
/tmp/$BACKUP_FILE
--host=$DB_HOST
--username=postgres
--dbname=$RESTORE_DB
--verbose
/tmp/$BACKUP_FILE
4. Verify restore
4. Verify restore
echo "✅ Verifying restore..."
TABLE_COUNT=$(psql -h $DB_HOST -U postgres -d $RESTORE_DB -t -c "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public';")
echo " Tables restored: $TABLE_COUNT"
ROW_COUNT=$(psql -h $DB_HOST -U postgres -d $RESTORE_DB -t -c "SELECT COUNT(*) FROM users;")
echo " User rows: $ROW_COUNT"
echo "✅ Restore complete!"
echo " Database: $RESTORE_DB"
echo " To use: UPDATE application config to point to $RESTORE_DB"
undefinedecho "✅ Verifying restore..."
TABLE_COUNT=$(psql -h $DB_HOST -U postgres -d $RESTORE_DB -t -c "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public';")
echo " Tables restored: $TABLE_COUNT"
ROW_COUNT=$(psql -h $DB_HOST -U postgres -d $RESTORE_DB -t -c "SELECT COUNT(*) FROM users;")
echo " User rows: $ROW_COUNT"
echo "✅ Restore complete!"
echo " Database: $RESTORE_DB"
echo " To use: UPDATE application config to point to $RESTORE_DB"
undefinedPoint-in-Time Recovery (PITR)
时间点恢复(PITR)
bash
#!/bin/bashbash
#!/bin/bashscripts/pitr-restore.sh
scripts/pitr-restore.sh
TARGET_TIME=$1 # Format: 2024-01-15 14:30:00
echo "🔄 Point-in-Time Restore to: $TARGET_TIME"
TARGET_TIME=$1 # Format: 2024-01-15 14:30:00
echo "🔄 Point-in-Time Restore to: $TARGET_TIME"
1. Restore base backup
1. Restore base backup
echo "📦 Restoring base backup..."
pg_basebackup -D /var/lib/postgresql/data -X stream
echo "📦 Restoring base backup..."
pg_basebackup -D /var/lib/postgresql/data -X stream
2. Configure recovery
2. Configure recovery
cat > /var/lib/postgresql/data/recovery.conf << EOF
restore_command = 'aws s3 cp s3://my-backups/wal/%f %p'
recovery_target_time = '$TARGET_TIME'
recovery_target_action = 'promote'
EOF
cat > /var/lib/postgresql/data/recovery.conf << EOF
restore_command = 'aws s3 cp s3://my-backups/wal/%f %p'
recovery_target_time = '$TARGET_TIME'
recovery_target_action = 'promote'
EOF
3. Start PostgreSQL
3. Start PostgreSQL
echo "🚀 Starting PostgreSQL in recovery mode..."
systemctl start postgresql
echo "🚀 Starting PostgreSQL in recovery mode..."
systemctl start postgresql
4. Wait for recovery
4. Wait for recovery
while ! pg_isready; do
echo " Waiting for recovery..."
sleep 5
done
echo "✅ PITR complete!"
undefinedwhile ! pg_isready; do
echo " Waiting for recovery..."
sleep 5
done
echo "✅ PITR complete!"
undefinedValidation Checks
验证检查
bash
#!/bin/bashbash
#!/bin/bashscripts/validate-restore.sh
scripts/validate-restore.sh
DB=$1
echo "🔍 Validating restore..."
DB=$1
echo "🔍 Validating restore..."
1. Check table count
1. Check table count
TABLES=$(psql -d $DB -t -c "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public';")
echo "Tables: $TABLES"
if [ "$TABLES" -lt 10 ]; then
echo "❌ Too few tables restored"
exit 1
fi
TABLES=$(psql -d $DB -t -c "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public';")
echo "Tables: $TABLES"
if [ "$TABLES" -lt 10 ]; then
echo "❌ Too few tables restored"
exit 1
fi
2. Check row counts
2. Check row counts
for table in users products orders; do
ROWS=$(psql -d $DB -t -c "SELECT COUNT(*) FROM $table;")
echo " $table: $ROWS rows"
if [ "$ROWS" -lt 1 ]; then
echo "❌ Table $table is empty"
exit 1
fi
done
for table in users products orders; do
ROWS=$(psql -d $DB -t -c "SELECT COUNT(*) FROM $table;")
echo " $table: $ROWS rows"
if [ "$ROWS" -lt 1 ]; then
echo "❌ Table $table is empty"
exit 1
fi
done
3. Check constraints
3. Check constraints
CONSTRAINTS=$(psql -d $DB -t -c "SELECT COUNT(*) FROM information_schema.table_constraints WHERE constraint_type='FOREIGN KEY';")
echo "Foreign keys: $CONSTRAINTS"
CONSTRAINTS=$(psql -d $DB -t -c "SELECT COUNT(*) FROM information_schema.table_constraints WHERE constraint_type='FOREIGN KEY';")
echo "Foreign keys: $CONSTRAINTS"
4. Check indexes
4. Check indexes
INDEXES=$(psql -d $DB -t -c "SELECT COUNT(*) FROM pg_indexes WHERE schemaname='public';")
echo "Indexes: $INDEXES"
INDEXES=$(psql -d $DB -t -c "SELECT COUNT(*) FROM pg_indexes WHERE schemaname='public';")
echo "Indexes: $INDEXES"
5. Test query performance
5. Test query performance
START=$(date +%s%N)
psql -d $DB -c "SELECT COUNT(*) FROM users WHERE email LIKE '%@example.com%';" > /dev/null
END=$(date +%s%N)
DURATION=$(( (END - START) / 1000000 ))
echo "Query performance: ${DURATION}ms"
if [ "$DURATION" -gt 1000 ]; then
echo "⚠️ Slow query - missing indexes?"
fi
echo "✅ Validation complete!"
undefinedSTART=$(date +%s%N)
psql -d $DB -c "SELECT COUNT(*) FROM users WHERE email LIKE '%@example.com%';" > /dev/null
END=$(date +%s%N)
DURATION=$(( (END - START) / 1000000 ))
echo "Query performance: ${DURATION}ms"
if [ "$DURATION" -gt 1000 ]; then
echo "⚠️ Slow query - missing indexes?"
fi
echo "✅ Validation complete!"
undefinedDisaster Recovery Runbook
灾难恢复运行手册
markdown
undefinedmarkdown
undefinedDisaster Recovery Runbook
灾难恢复运行手册
Incident Response
事件响应
1. Assess Situation (5 minutes)
1. 评估情况(5分钟)
- Identify incident severity (P0/P1/P2)
- Determine data loss window
- Notify stakeholders
Contacts:
- DBA On-Call: [phone]
- Engineering Lead: [phone]
- CTO: [phone]
- 确定事件严重等级(P0/P1/P2)
- 确定数据丢失窗口期
- 通知相关人员
联系人:
- 值班DBA:[电话]
- 工程负责人:[电话]
- CTO:[电话]
2. Stop the Bleeding (10 minutes)
2. 止损(10分钟)
- Enable maintenance mode
- Stop writes to corrupted database
- Preserve evidence (logs, backups)
bash
undefined- 启用维护模式
- 停止向损坏的数据库写入数据
- 保留证据(日志、备份)
bash
undefinedEnable maintenance mode
Enable maintenance mode
kubectl scale deployment/api --replicas=0
undefinedkubectl scale deployment/api --replicas=0
undefined3. Identify Recovery Point (15 minutes)
3. 确定恢复点(15分钟)
- Determine last good backup
- Check backup integrity
- Calculate data loss
bash
undefined- 确定最新的可用备份
- 检查备份完整性
- 计算数据丢失量
bash
undefinedList available backups
List available backups
aws s3 ls s3://my-backups/postgres/ | tail -20
aws s3 ls s3://my-backups/postgres/ | tail -20
Check backup size
Check backup size
aws s3 ls s3://my-backups/postgres/production_20240115_020000.sql.gz --human-readable
undefinedaws s3 ls s3://my-backups/postgres/production_20240115_020000.sql.gz --human-readable
undefined4. Prepare Recovery Environment (30 minutes)
4. 准备恢复环境(30分钟)
- Spin up new database instance
- Configure networking
- Test connectivity
bash
undefined- 启动新的数据库实例
- 配置网络
- 测试连通性
bash
undefinedCreate RDS instance
Create RDS instance
aws rds create-db-instance
--db-instance-identifier production-recovery
--db-instance-class db.r6g.xlarge
--engine postgres
--master-username postgres
--master-user-password [secure-password]
--db-instance-identifier production-recovery
--db-instance-class db.r6g.xlarge
--engine postgres
--master-username postgres
--master-user-password [secure-password]
undefinedaws rds create-db-instance
--db-instance-identifier production-recovery
--db-instance-class db.r6g.xlarge
--engine postgres
--master-username postgres
--master-user-password [secure-password]
--db-instance-identifier production-recovery
--db-instance-class db.r6g.xlarge
--engine postgres
--master-username postgres
--master-user-password [secure-password]
undefined5. Execute Restore (1-2 hours)
5. 执行恢复(1-2小时)
- Download backup from S3
- Run restore script
- Apply transaction logs (if PITR)
- Verify data integrity
bash
undefined- 从S3下载备份
- 运行恢复脚本
- 应用事务日志(如果使用PITR)
- 验证数据完整性
bash
undefinedRun restore
Run restore
./scripts/restore-postgres.sh production_20240115_020000.sql.gz
./scripts/restore-postgres.sh production_20240115_020000.sql.gz
Validate
Validate
./scripts/validate-restore.sh production_restored
undefined./scripts/validate-restore.sh production_restored
undefined6. Validate and Test (30 minutes)
6. 验证与测试(30分钟)
- Run validation scripts
- Test critical queries
- Verify row counts
- Check data consistency
- 运行验证脚本
- 测试关键查询
- 验证行数
- 检查数据一致性
7. Cutover (15 minutes)
7. 切换流量(15分钟)
- Update application config
- Point DNS to new database
- Disable maintenance mode
- Monitor for errors
bash
undefined- 更新应用配置
- 将DNS指向新数据库
- 关闭维护模式
- 监控错误
bash
undefinedUpdate connection string
Update connection string
kubectl set env deployment/api DATABASE_URL=postgresql://...
kubectl set env deployment/api DATABASE_URL=postgresql://...
Scale up
Scale up
kubectl scale deployment/api --replicas=3
undefinedkubectl scale deployment/api --replicas=3
undefined8. Post-Recovery (1 hour)
8. 恢复后操作(1小时)
- Monitor system health
- Verify user reports
- Document incident
- Schedule postmortem
- 监控系统健康状态
- 验证用户反馈
- 记录事件详情
- 安排事后复盘
Recovery Time Objective (RTO)
恢复时间目标(RTO)
| Scenario | Target | Actual |
|---|---|---|
| Full restore | 2 hours | [measured] |
| PITR restore | 3 hours | [measured] |
| Region failover | 15 minutes | [measured] |
| 场景 | 目标值 | 实际值 |
|---|---|---|
| 全量恢复 | 2小时 | [实测值] |
| PITR恢复 | 3小时 | [实测值] |
| 区域故障转移 | 15分钟 | [实测值] |
Recovery Point Objective (RPO)
恢复点目标(RPO)
| Backup Type | Data Loss Window |
|---|---|
| Full backup | 24 hours |
| Incremental | 1 hour |
| Transaction logs | 15 minutes |
undefined| 备份类型 | 数据丢失窗口期 |
|---|---|
| 全量备份 | 24小时 |
| 增量备份 | 1小时 |
| 事务日志 | 15分钟 |
undefinedAutomated Backup Monitoring
自动备份监控
typescript
// scripts/monitor-backups.ts
import { S3Client, ListObjectsV2Command } from '@aws-sdk/client-s3';
const s3 = new S3Client({ region: 'us-east-1' });
async function checkBackupHealth() {
const bucket = 'my-backups';
const prefix = 'postgres/';
// List recent backups
const command = new ListObjectsV2Command({
Bucket: bucket,
Prefix: prefix,
MaxKeys: 10,
});
const response = await s3.send(command);
const backups = response.Contents || [];
// Check last backup age
const latestBackup = backups[0];
const age = Date.now() - new Date(latestBackup.LastModified!).getTime();
const ageHours = age / (1000 * 60 * 60);
if (ageHours > 25) {
console.error('❌ No backup in last 24 hours!');
// Send alert
await sendSlackAlert('No recent database backup!');
process.exit(1);
}
// Check backup size
const size = latestBackup.Size! / (1024 * 1024 * 1024); // GB
if (size < 10) {
console.error('⚠️ Backup size suspiciously small');
}
console.log('✅ Backup health check passed');
console.log(` Latest: ${latestBackup.Key}`);
console.log(` Age: ${ageHours.toFixed(1)} hours`);
console.log(` Size: ${size.toFixed(2)} GB`);
}
checkBackupHealth();typescript
// scripts/monitor-backups.ts
import { S3Client, ListObjectsV2Command } from '@aws-sdk/client-s3';
const s3 = new S3Client({ region: 'us-east-1' });
async function checkBackupHealth() {
const bucket = 'my-backups';
const prefix = 'postgres/';
// List recent backups
const command = new ListObjectsV2Command({
Bucket: bucket,
Prefix: prefix,
MaxKeys: 10,
});
const response = await s3.send(command);
const backups = response.Contents || [];
// Check last backup age
const latestBackup = backups[0];
const age = Date.now() - new Date(latestBackup.LastModified!).getTime();
const ageHours = age / (1000 * 60 * 60);
if (ageHours > 25) {
console.error('❌ No backup in last 24 hours!');
// Send alert
await sendSlackAlert('No recent database backup!');
process.exit(1);
}
// Check backup size
const size = latestBackup.Size! / (1024 * 1024 * 1024); // GB
if (size < 10) {
console.error('⚠️ Backup size suspiciously small');
}
console.log('✅ Backup health check passed');
console.log(` Latest: ${latestBackup.Key}`);
console.log(` Age: ${ageHours.toFixed(1)} hours`);
console.log(` Size: ${size.toFixed(2)} GB`);
}
checkBackupHealth();Role Assignments
角色分配
markdown
undefinedmarkdown
undefinedDR Team Roles
灾难恢复团队角色
Database Administrator (Primary)
数据库管理员(主要负责人)
- Execute restore procedures
- Verify data integrity
- Monitor recovery progress
- 执行恢复流程
- 验证数据完整性
- 监控恢复进度
Engineering Lead
工程负责人
- Coordinate response
- Communicate with stakeholders
- Make cutover decisions
- 协调响应工作
- 与相关人员沟通
- 做出流量切换决策
DevOps Engineer
DevOps工程师
- Provision infrastructure
- Update application configs
- Monitor system health
- 配置基础设施
- 更新应用配置
- 监控系统健康状态
Product Manager
产品经理
- Assess business impact
- Prioritize recovery
- Customer communication
- 评估业务影响
- 确定恢复优先级
- 客户沟通
Escalation Path
升级路径
- DBA on-call →
- Engineering Lead →
- CTO →
- CEO (P0 incidents only)
undefined- 值班DBA →
- 工程负责人 →
- CTO →
- CEO(仅P0事件)
undefinedBest Practices
最佳实践
- Test restores regularly: Quarterly DR drills
- Automate backups: Never rely on manual processes
- Multiple locations: Cross-region backup storage
- Monitor backup health: Alert on failures
- Document procedures: Keep runbook updated
- Encrypt backups: Protect sensitive data
- Version control: Track backup script changes
- 定期测试恢复:每季度进行灾难恢复演练
- 自动化备份:绝不依赖手动流程
- 多位置存储:跨区域备份存储
- 监控备份健康状态:备份失败时触发告警
- 文档化流程:保持运行手册更新
- 加密备份:保护敏感数据
- 版本控制:跟踪备份脚本的变更
Output Checklist
输出检查清单
- Backup automation scripts
- Restore procedures documented
- Validation checks defined
- PITR procedure (if applicable)
- DR runbook created
- Role assignments documented
- RTO/RPO defined
- Backup monitoring configured
- 备份自动化脚本
- 恢复流程文档
- 验证检查规则
- PITR流程(如适用)
- 灾难恢复运行手册
- 角色分配文档
- RTO/RPO定义
- 备份监控配置