Backup/Restore Runbook Generator

Create reliable disaster recovery procedures for your databases.

Backup Strategy

markdown

# Database Backup Strategy

## Backup Types

### 1. Full Backup (Daily)

- **When**: 2:00 AM UTC
- **Retention**: 30 days
- **Storage**: S3 `s3://backups/full/`
- **Size**: ~50 GB
- **Duration**: ~45 minutes

### 2. Incremental Backup (Hourly)

- **When**: Every hour
- **Retention**: 7 days
- **Storage**: S3 `s3://backups/incremental/`
- **Size**: ~500 MB
- **Duration**: ~5 minutes

### 3. Transaction Log Backup (Every 15 min)

- **When**: Every 15 minutes
- **Retention**: 3 days
- **Storage**: S3 `s3://backups/wal/`
- **Point-in-time recovery capability**

## Backup Automation

### PostgreSQL

```bash
#!/bin/bash
# scripts/backup-postgres.sh

set -e

# Configuration
DB_NAME="production"
DB_USER="postgres"
DB_HOST="postgres.example.com"
BACKUP_DIR="/var/backups/postgres"
S3_BUCKET="s3://my-backups/postgres"
DATE=$(date +%Y%m%d_%H%M%S)
FILENAME="${DB_NAME}_${DATE}.sql.gz"

# Create backup directory
mkdir -p $BACKUP_DIR

echo "🔄 Starting backup: $FILENAME"

# Full backup with pg_dump
pg_dump \
  --host=$DB_HOST \
  --username=$DB_USER \
  --dbname=$DB_NAME \
  --format=custom \
  --compress=9 \
  --file=$BACKUP_DIR/$FILENAME \
  --verbose

# Verify backup
if [ -f "$BACKUP_DIR/$FILENAME" ]; then
  SIZE=$(du -h "$BACKUP_DIR/$FILENAME" | cut -f1)
  echo "✅ Backup created: $SIZE"
else
  echo "❌ Backup failed"
  exit 1
fi

# Upload to S3
echo "📤 Uploading to S3..."
aws s3 cp $BACKUP_DIR/$FILENAME $S3_BUCKET/ \
  --storage-class STANDARD_IA

# Verify upload
if aws s3 ls $S3_BUCKET/$FILENAME; then
  echo "✅ Uploaded to S3"
else
  echo "❌ S3 upload failed"
  exit 1
fi

# Cleanup old local backups (keep last 7 days)
find $BACKUP_DIR -type f -name "*.sql.gz" -mtime +7 -delete
echo "🗑️  Cleaned up old local backups"

# Send notification
curl -X POST $SLACK_WEBHOOK \
  -H 'Content-Type: application/json' \
  -d "{\"text\": \"✅ Database backup complete: $FILENAME ($SIZE)\"}"

echo "✅ Backup complete!"
```

MySQL

bash

#!/bin/bash
# scripts/backup-mysql.sh

set -e

DB_NAME="production"
DB_USER="root"
DB_PASSWORD=$MYSQL_PASSWORD
DATE=$(date +%Y%m%d_%H%M%S)
FILENAME="${DB_NAME}_${DATE}.sql.gz"

echo "🔄 Starting MySQL backup..."

# Backup with mysqldump
mysqldump \
  --user=$DB_USER \
  --password=$DB_PASSWORD \
  --single-transaction \
  --quick \
  --lock-tables=false \
  --databases $DB_NAME \
  | gzip > /var/backups/mysql/$FILENAME

# Upload to S3
aws s3 cp /var/backups/mysql/$FILENAME s3://my-backups/mysql/

echo "✅ Backup complete!"

Restore Procedures

Full Restore

bash

#!/bin/bash
# scripts/restore-postgres.sh

set -e

BACKUP_FILE=$1
RESTORE_DB="production_restored"

if [ -z "$BACKUP_FILE" ]; then
  echo "Usage: ./restore-postgres.sh <backup-file>"
  exit 1
fi

echo "🔄 Starting restore from: $BACKUP_FILE"

# 1. Download from S3
echo "📥 Downloading backup..."
aws s3 cp s3://my-backups/postgres/$BACKUP_FILE /tmp/

# 2. Create new database
echo "🗄️  Creating database..."
psql -h $DB_HOST -U postgres -c "CREATE DATABASE $RESTORE_DB;"

# 3. Restore backup
echo "🔄 Restoring data..."
pg_restore \
  --host=$DB_HOST \
  --username=postgres \
  --dbname=$RESTORE_DB \
  --verbose \
  /tmp/$BACKUP_FILE

# 4. Verify restore
echo "✅ Verifying restore..."
TABLE_COUNT=$(psql -h $DB_HOST -U postgres -d $RESTORE_DB -t -c "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public';")
echo "  Tables restored: $TABLE_COUNT"

ROW_COUNT=$(psql -h $DB_HOST -U postgres -d $RESTORE_DB -t -c "SELECT COUNT(*) FROM users;")
echo "  User rows: $ROW_COUNT"

echo "✅ Restore complete!"
echo "  Database: $RESTORE_DB"
echo "  To use: UPDATE application config to point to $RESTORE_DB"

Point-in-Time Recovery (PITR)

bash

#!/bin/bash
# scripts/pitr-restore.sh

TARGET_TIME=$1  # Format: 2024-01-15 14:30:00

echo "🔄 Point-in-Time Restore to: $TARGET_TIME"

# 1. Restore base backup
echo "📦 Restoring base backup..."
pg_basebackup -D /var/lib/postgresql/data -X stream

# 2. Configure recovery
cat > /var/lib/postgresql/data/recovery.conf << EOF
restore_command = 'aws s3 cp s3://my-backups/wal/%f %p'
recovery_target_time = '$TARGET_TIME'
recovery_target_action = 'promote'
EOF

# 3. Start PostgreSQL
echo "🚀 Starting PostgreSQL in recovery mode..."
systemctl start postgresql

# 4. Wait for recovery
while ! pg_isready; do
  echo "  Waiting for recovery..."
  sleep 5
done

echo "✅ PITR complete!"

Validation Checks

bash

#!/bin/bash
# scripts/validate-restore.sh

DB=$1

echo "🔍 Validating restore..."

# 1. Check table count
TABLES=$(psql -d $DB -t -c "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public';")
echo "Tables: $TABLES"

if [ "$TABLES" -lt 10 ]; then
  echo "❌ Too few tables restored"
  exit 1
fi

# 2. Check row counts
for table in users products orders; do
  ROWS=$(psql -d $DB -t -c "SELECT COUNT(*) FROM $table;")
  echo "  $table: $ROWS rows"

  if [ "$ROWS" -lt 1 ]; then
    echo "❌ Table $table is empty"
    exit 1
  fi
done

# 3. Check constraints
CONSTRAINTS=$(psql -d $DB -t -c "SELECT COUNT(*) FROM information_schema.table_constraints WHERE constraint_type='FOREIGN KEY';")
echo "Foreign keys: $CONSTRAINTS"

# 4. Check indexes
INDEXES=$(psql -d $DB -t -c "SELECT COUNT(*) FROM pg_indexes WHERE schemaname='public';")
echo "Indexes: $INDEXES"

# 5. Test query performance
START=$(date +%s%N)
psql -d $DB -c "SELECT COUNT(*) FROM users WHERE email LIKE '%@example.com%';" > /dev/null
END=$(date +%s%N)
DURATION=$(( (END - START) / 1000000 ))
echo "Query performance: ${DURATION}ms"

if [ "$DURATION" -gt 1000 ]; then
  echo "⚠️  Slow query - missing indexes?"
fi

echo "✅ Validation complete!"

Disaster Recovery Runbook

markdown

# Disaster Recovery Runbook

## Incident Response

### 1. Assess Situation (5 minutes)

- [ ] Identify incident severity (P0/P1/P2)
- [ ] Determine data loss window
- [ ] Notify stakeholders

**Contacts:**

- DBA On-Call: [phone]
- Engineering Lead: [phone]
- CTO: [phone]

### 2. Stop the Bleeding (10 minutes)

- [ ] Enable maintenance mode
- [ ] Stop writes to corrupted database
- [ ] Preserve evidence (logs, backups)

```bash
# Enable maintenance mode
kubectl scale deployment/api --replicas=0
```

3. Identify Recovery Point (15 minutes)

Determine last good backup
Check backup integrity
Calculate data loss

bash

# List available backups
aws s3 ls s3://my-backups/postgres/ | tail -20

# Check backup size
aws s3 ls s3://my-backups/postgres/production_20240115_020000.sql.gz --human-readable

4. Prepare Recovery Environment (30 minutes)

Spin up new database instance
Configure networking
Test connectivity

bash

# Create RDS instance
aws rds create-db-instance \
  --db-instance-identifier production-recovery \
  --db-instance-class db.r6g.xlarge \
  --engine postgres \
  --master-username postgres \
  --master-user-password [secure-password]

5. Execute Restore (1-2 hours)

Download backup from S3
Run restore script
Apply transaction logs (if PITR)
Verify data integrity

bash

# Run restore
./scripts/restore-postgres.sh production_20240115_020000.sql.gz

# Validate
./scripts/validate-restore.sh production_restored

6. Validate and Test (30 minutes)

Run validation scripts
Test critical queries
Verify row counts
Check data consistency

7. Cutover (15 minutes)

Update application config
Point DNS to new database
Disable maintenance mode
Monitor for errors

bash

# Update connection string
kubectl set env deployment/api DATABASE_URL=postgresql://...

# Scale up
kubectl scale deployment/api --replicas=3

8. Post-Recovery (1 hour)

Monitor system health
Verify user reports
Document incident
Schedule postmortem

Recovery Time Objective (RTO)

Scenario	Target	Actual
Full restore	2 hours	[measured]
PITR restore	3 hours	[measured]
Region failover	15 minutes	[measured]

Recovery Point Objective (RPO)

Backup Type	Data Loss Window
Full backup	24 hours
Incremental	1 hour
Transaction logs	15 minutes


## Automated Backup Monitoring

```typescript
// scripts/monitor-backups.ts
import { S3Client, ListObjectsV2Command } from '@aws-sdk/client-s3';

const s3 = new S3Client({ region: 'us-east-1' });

async function checkBackupHealth() {
  const bucket = 'my-backups';
  const prefix = 'postgres/';

  // List recent backups
  const command = new ListObjectsV2Command({
    Bucket: bucket,
    Prefix: prefix,
    MaxKeys: 10,
  });

  const response = await s3.send(command);
  const backups = response.Contents || [];

  // Check last backup age
  const latestBackup = backups[0];
  const age = Date.now() - new Date(latestBackup.LastModified!).getTime();
  const ageHours = age / (1000 * 60 * 60);

  if (ageHours > 25) {
    console.error('❌ No backup in last 24 hours!');
    // Send alert
    await sendSlackAlert('No recent database backup!');
    process.exit(1);
  }

  // Check backup size
  const size = latestBackup.Size! / (1024 * 1024 * 1024); // GB
  if (size < 10) {
    console.error('⚠️  Backup size suspiciously small');
  }

  console.log('✅ Backup health check passed');
  console.log(`  Latest: ${latestBackup.Key}`);
  console.log(`  Age: ${ageHours.toFixed(1)} hours`);
  console.log(`  Size: ${size.toFixed(2)} GB`);
}

checkBackupHealth();

Role Assignments

markdown

## DR Team Roles

### Database Administrator (Primary)

- Execute restore procedures
- Verify data integrity
- Monitor recovery progress

### Engineering Lead

- Coordinate response
- Communicate with stakeholders
- Make cutover decisions

### DevOps Engineer

- Provision infrastructure
- Update application configs
- Monitor system health

### Product Manager

- Assess business impact
- Prioritize recovery
- Customer communication

## Escalation Path

1. DBA on-call →
2. Engineering Lead →
3. CTO →
4. CEO (P0 incidents only)

Best Practices

Test restores regularly: Quarterly DR drills
Automate backups: Never rely on manual processes
Multiple locations: Cross-region backup storage
Monitor backup health: Alert on failures
Document procedures: Keep runbook updated
Encrypt backups: Protect sensitive data
Version control: Track backup script changes

Output Checklist

Backup automation scripts
Restore procedures documented
Validation checks defined
PITR procedure (if applicable)
DR runbook created
Role assignments documented
RTO/RPO defined
Backup monitoring configured

backup-restore-runbook-generator

NPX Install

Tags

SKILL.md Content

Backup/Restore Runbook Generator

Backup Strategy

MySQL

Restore Procedures

Full Restore

Point-in-Time Recovery (PITR)

Validation Checks

Disaster Recovery Runbook

3. Identify Recovery Point (15 minutes)

4. Prepare Recovery Environment (30 minutes)

5. Execute Restore (1-2 hours)

6. Validate and Test (30 minutes)

7. Cutover (15 minutes)

8. Post-Recovery (1 hour)

Recovery Time Objective (RTO)

Recovery Point Objective (RPO)

Role Assignments

Best Practices

Output Checklist