disaster-recovery-testing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Disaster Recovery Testing

灾难恢复测试

Overview

概述

Implement systematic disaster recovery testing to validate recovery procedures, measure RTO/RPO, identify gaps, and ensure team readiness for actual incidents.
实施系统化的灾难恢复测试,以验证恢复流程、衡量RTO/RPO、识别差距,并确保团队为实际事件做好准备。

When to Use

适用场景

  • Annual DR exercises
  • Infrastructure changes
  • New service deployments
  • Compliance requirements
  • Team training
  • Recovery procedure validation
  • Cross-region failover testing
  • 年度灾难恢复演练
  • 基础设施变更后
  • 新服务部署后
  • 合规要求
  • 团队培训
  • 恢复流程验证
  • 跨区域故障转移测试

Implementation Examples

实施示例

1. DR Test Plan and Execution

1. 灾难恢复测试计划与执行

yaml
undefined
yaml
undefined

dr-test-plan.yaml

dr-test-plan.yaml

apiVersion: v1 kind: ConfigMap metadata: name: dr-test-procedures namespace: operations data: dr-test-plan.md: | # Disaster Recovery Test Plan
## Test Objectives
- Validate backup restoration procedures
- Verify failover mechanisms
- Test DNS failover
- Validate data integrity post-recovery
- Measure RTO and RPO
- Train incident response team

## Pre-Test Checklist
- [ ] Notify stakeholders
- [ ] Schedule 4-6 hour window
- [ ] Disable alerting to prevent noise
- [ ] Backup production data
- [ ] Ensure DR environment is isolated
- [ ] Have rollback plan ready

## Test Scope
- Primary database failover to standby
- Application failover to DR site
- DNS resolution update
- Load balancer health checks
- Data synchronization verification

## Success Criteria
- RTO: < 1 hour
- RPO: < 15 minutes
- Zero data loss
- All services operational
- Alerts functional

## Post-Test Activities
- Document timeline
- Identify gaps
- Update procedures
- Schedule post-mortem
- Update team documentation

apiVersion: batch/v1 kind: Job metadata: name: dr-test-executor namespace: operations spec: template: spec: serviceAccountName: dr-test-sa containers: - name: executor image: alpine:latest env: - name: TEST_ID value: "dr-test-$(date +%s)" - name: BACKUP_BUCKET value: "s3://my-backups" - name: DR_NAMESPACE value: "dr-test" command: - sh - -c - | apk add --no-cache aws-cli kubectl jq postgresql-client mysql-client
          echo "Starting DR Test: $TEST_ID"

          # Step 1: Create test namespace
          echo "Creating isolated test environment..."
          kubectl create namespace "$DR_NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -

          # Step 2: Restore database from backup
          echo "Restoring database from latest backup..."
          LATEST_BACKUP=$(aws s3 ls "$BACKUP_BUCKET/databases/" | \
            sort | tail -n 1 | awk '{print $4}')

          aws s3 cp "$BACKUP_BUCKET/databases/$LATEST_BACKUP" - | \
            gunzip | psql postgres://user:pass@dr-db:5432/testdb

          # Step 3: Deploy application to DR namespace
          echo "Deploying application to DR environment..."
          kubectl set image deployment/myapp \
            myapp=myrepo/myapp:production \
            -n "$DR_NAMESPACE"

          # Step 4: Run health checks
          echo "Running health checks..."
          for i in {1..30}; do
            if curl -sf http://myapp-dr/health > /dev/null; then
              echo "Health check passed"
              break
            fi
            echo "Waiting for service to be healthy... ($i/30)"
            sleep 10
          done

          # Step 5: Run smoke tests
          echo "Running smoke tests..."
          kubectl exec -it deployment/myapp -n "$DR_NAMESPACE" -- \
            npm run test:smoke || exit 1

          # Step 6: Validate data integrity
          echo "Validating data integrity..."
          PROD_RECORD_COUNT=$(psql postgres://user:pass@prod-db:5432/mydb \
            -t -c "SELECT COUNT(*) FROM users;")
          DR_RECORD_COUNT=$(psql postgres://user:pass@dr-db:5432/testdb \
            -t -c "SELECT COUNT(*) FROM users;")

          if [ "$PROD_RECORD_COUNT" -eq "$DR_RECORD_COUNT" ]; then
            echo "Data integrity verified"
          else
            echo "Data integrity check failed"
            exit 1
          fi

          # Step 7: Record metrics
          echo "Recording DR test metrics..."
          kubectl logs deployment/myapp -n "$DR_NAMESPACE" | \
            grep "startup_time" | jq '.' > /tmp/dr-metrics-$TEST_ID.json

          echo "DR Test Complete: $TEST_ID"

      restartPolicy: Never

apiVersion: v1 kind: ServiceAccount metadata: name: dr-test-sa namespace: operations

apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: dr-test rules:
  • apiGroups: [""] resources: ["namespaces"] verbs: ["create", "get", "list"]
  • apiGroups: ["apps"] resources: ["deployments"] verbs: ["create", "get", "list", "patch", "set"]
  • apiGroups: [""] resources: ["pods", "pods/log"] verbs: ["get", "list"]
  • apiGroups: [""] resources: ["pods/exec"] verbs: ["create", "get"]

apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: dr-test roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: dr-test subjects:
  • kind: ServiceAccount name: dr-test-sa namespace: operations
undefined
apiVersion: v1 kind: ConfigMap metadata: name: dr-test-procedures namespace: operations data: dr-test-plan.md: | # Disaster Recovery Test Plan
## Test Objectives
- Validate backup restoration procedures
- Verify failover mechanisms
- Test DNS failover
- Validate data integrity post-recovery
- Measure RTO and RPO
- Train incident response team

## Pre-Test Checklist
- [ ] Notify stakeholders
- [ ] Schedule 4-6 hour window
- [ ] Disable alerting to prevent noise
- [ ] Backup production data
- [ ] Ensure DR environment is isolated
- [ ] Have rollback plan ready

## Test Scope
- Primary database failover to standby
- Application failover to DR site
- DNS resolution update
- Load balancer health checks
- Data synchronization verification

## Success Criteria
- RTO: < 1 hour
- RPO: < 15 minutes
- Zero data loss
- All services operational
- Alerts functional

## Post-Test Activities
- Document timeline
- Identify gaps
- Update procedures
- Schedule post-mortem
- Update team documentation

apiVersion: batch/v1 kind: Job metadata: name: dr-test-executor namespace: operations spec: template: spec: serviceAccountName: dr-test-sa containers: - name: executor image: alpine:latest env: - name: TEST_ID value: "dr-test-$(date +%s)" - name: BACKUP_BUCKET value: "s3://my-backups" - name: DR_NAMESPACE value: "dr-test" command: - sh - -c - | apk add --no-cache aws-cli kubectl jq postgresql-client mysql-client
          echo "Starting DR Test: $TEST_ID"

          # Step 1: Create test namespace
          echo "Creating isolated test environment..."
          kubectl create namespace "$DR_NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -

          # Step 2: Restore database from backup
          echo "Restoring database from latest backup..."
          LATEST_BACKUP=$(aws s3 ls "$BACKUP_BUCKET/databases/" | \
            sort | tail -n 1 | awk '{print $4}')

          aws s3 cp "$BACKUP_BUCKET/databases/$LATEST_BACKUP" - | \
            gunzip | psql postgres://user:pass@dr-db:5432/testdb

          # Step 3: Deploy application to DR namespace
          echo "Deploying application to DR environment..."
          kubectl set image deployment/myapp \
            myapp=myrepo/myapp:production \
            -n "$DR_NAMESPACE"

          # Step 4: Run health checks
          echo "Running health checks..."
          for i in {1..30}; do
            if curl -sf http://myapp-dr/health > /dev/null; then
              echo "Health check passed"
              break
            fi
            echo "Waiting for service to be healthy... ($i/30)"
            sleep 10
          done

          # Step 5: Run smoke tests
          echo "Running smoke tests..."
          kubectl exec -it deployment/myapp -n "$DR_NAMESPACE" -- \
            npm run test:smoke || exit 1

          # Step 6: Validate data integrity
          echo "Validating data integrity..."
          PROD_RECORD_COUNT=$(psql postgres://user:pass@prod-db:5432/mydb \
            -t -c "SELECT COUNT(*) FROM users;")
          DR_RECORD_COUNT=$(psql postgres://user:pass@dr-db:5432/testdb \
            -t -c "SELECT COUNT(*) FROM users;")

          if [ "$PROD_RECORD_COUNT" -eq "$DR_RECORD_COUNT" ]; then
            echo "Data integrity verified"
          else
            echo "Data integrity check failed"
            exit 1
          fi

          # Step 7: Record metrics
          echo "Recording DR test metrics..."
          kubectl logs deployment/myapp -n "$DR_NAMESPACE" | \
            grep "startup_time" | jq '.' > /tmp/dr-metrics-$TEST_ID.json

          echo "DR Test Complete: $TEST_ID"

      restartPolicy: Never

apiVersion: v1 kind: ServiceAccount metadata: name: dr-test-sa namespace: operations

apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: dr-test rules:
  • apiGroups: [""] resources: ["namespaces"] verbs: ["create", "get", "list"]
  • apiGroups: ["apps"] resources: ["deployments"] verbs: ["create", "get", "list", "patch", "set"]
  • apiGroups: [""] resources: ["pods", "pods/log"] verbs: ["get", "list"]
  • apiGroups: [""] resources: ["pods/exec"] verbs: ["create", "get"]

apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: dr-test roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: dr-test subjects:
  • kind: ServiceAccount name: dr-test-sa namespace: operations
undefined

2. DR Test Script

2. 灾难恢复测试脚本

bash
#!/bin/bash
bash
#!/bin/bash

execute-dr-test.sh - Comprehensive DR test execution

execute-dr-test.sh - Comprehensive DR test execution

set -euo pipefail
TEST_ID="dr-test-$(date +%Y%m%d-%H%M%S)" LOG_FILE="/tmp/dr-test-${TEST_ID}.log" METRICS_FILE="/tmp/dr-metrics-${TEST_ID}.json"
set -euo pipefail
TEST_ID="dr-test-$(date +%Y%m%d-%H%M%S)" LOG_FILE="/tmp/dr-test-${TEST_ID}.log" METRICS_FILE="/tmp/dr-metrics-${TEST_ID}.json"

Logging

Logging

exec 1> >(tee -a "$LOG_FILE") exec 2>&1
log_info() { echo "[INFO] $(date '+%Y-%m-%d %H:%M:%S') $1" }
log_error() { echo "[ERROR] $(date '+%Y-%m-%d %H:%M:%S') $1" }
exec 1> >(tee -a "$LOG_FILE") exec 2>&1
log_info() { echo "[INFO] $(date '+%Y-%m-%d %H:%M:%S') $1" }
log_error() { echo "[ERROR] $(date '+%Y-%m-%d %H:%M:%S') $1" }

Start time

Start time

START_TIME=$(date +%s)
log_info "Starting DR Test: $TEST_ID"
START_TIME=$(date +%s)
log_info "Starting DR Test: $TEST_ID"

Disable production monitoring

Disable production monitoring

log_info "Disabling production alerts..." aws sns set-topic-attributes
--topic-arn "arn:aws:sns:us-east-1:123456789012:prod-alerts"
--attribute-name DisplayName
--attribute-value "DR Test - Alerts Disabled"
log_info "Disabling production alerts..." aws sns set-topic-attributes
--topic-arn "arn:aws:sns:us-east-1:123456789012:prod-alerts"
--attribute-name DisplayName
--attribute-value "DR Test - Alerts Disabled"

Phase 1: Backup Validation

Phase 1: Backup Validation

log_info "Phase 1: Validating backups..." if ! aws s3 ls s3://my-backups/databases/ | grep -q "sql.gz"; then log_error "No valid backups found" exit 1 fi
log_info "Phase 1: Validating backups..." if ! aws s3 ls s3://my-backups/databases/ | grep -q "sql.gz"; then log_error "No valid backups found" exit 1 fi

Phase 2: Environment Setup

Phase 2: Environment Setup

log_info "Phase 2: Setting up DR environment..." LATEST_BACKUP=$(aws s3 ls s3://my-backups/databases/ |
sort | tail -n 1 | awk '{print $4}')
log_info "Using backup: $LATEST_BACKUP" aws s3 cp "s3://my-backups/databases/$LATEST_BACKUP" - | gunzip > /tmp/restore.sql
log_info "Phase 2: Setting up DR environment..." LATEST_BACKUP=$(aws s3 ls s3://my-backups/databases/ |
sort | tail -n 1 | awk '{print $4}')
log_info "Using backup: $LATEST_BACKUP" aws s3 cp "s3://my-backups/databases/$LATEST_BACKUP" - | gunzip > /tmp/restore.sql

Phase 3: Database Restoration

Phase 3: Database Restoration

log_info "Phase 3: Restoring database..." psql -h dr-db.internal -U postgres -d postgres -f /tmp/restore.sql > /dev/null 2>&1
log_info "Phase 3: Restoring database..." psql -h dr-db.internal -U postgres -d postgres -f /tmp/restore.sql > /dev/null 2>&1

Phase 4: Application Deployment

Phase 4: Application Deployment

log_info "Phase 4: Deploying application..." kubectl create namespace dr-test --dry-run=client -o yaml | kubectl apply -f - kubectl apply -f dr-deployment.yaml -n dr-test kubectl rollout status deployment/myapp -n dr-test --timeout=10m
log_info "Phase 4: Deploying application..." kubectl create namespace dr-test --dry-run=client -o yaml | kubectl apply -f - kubectl apply -f dr-deployment.yaml -n dr-test kubectl rollout status deployment/myapp -n dr-test --timeout=10m

Phase 5: Health Checks

Phase 5: Health Checks

log_info "Phase 5: Running health checks..." HEALTH_CHECK_START=$(date +%s)
for i in {1..60}; do if curl -sf --max-time 5 http://myapp-dr.internal/health > /dev/null 2>&1; then HEALTH_CHECK_TIME=$(($(date +%s) - HEALTH_CHECK_START)) log_info "Health check passed in ${HEALTH_CHECK_TIME}s" break fi if [ $i -eq 60 ]; then log_error "Health check timeout" exit 1 fi sleep 10 done
log_info "Phase 5: Running health checks..." HEALTH_CHECK_START=$(date +%s)
for i in {1..60}; do if curl -sf --max-time 5 http://myapp-dr.internal/health > /dev/null 2>&1; then HEALTH_CHECK_TIME=$(($(date +%s) - HEALTH_CHECK_START)) log_info "Health check passed in ${HEALTH_CHECK_TIME}s" break fi if [ $i -eq 60 ]; then log_error "Health check timeout" exit 1 fi sleep 10 done

Phase 6: Data Integrity

Phase 6: Data Integrity

log_info "Phase 6: Validating data integrity..." PROD_HASH=$(psql -h prod-db.internal -U postgres -d mydb -t -c
"SELECT md5(string_agg(CAST(id AS text), ',')) FROM users ORDER BY id;") DR_HASH=$(psql -h dr-db.internal -U postgres -d mydb -t -c
"SELECT md5(string_agg(CAST(id AS text), ',')) FROM users ORDER BY id;")
if [ "$PROD_HASH" = "$DR_HASH" ]; then log_info "Data integrity verified" else log_error "Data integrity check failed: $PROD_HASH != $DR_HASH" fi
log_info "Phase 6: Validating data integrity..." PROD_HASH=$(psql -h prod-db.internal -U postgres -d mydb -t -c
"SELECT md5(string_agg(CAST(id AS text), ',')) FROM users ORDER BY id;") DR_HASH=$(psql -h dr-db.internal -U postgres -d mydb -t -c
"SELECT md5(string_agg(CAST(id AS text), ',')) FROM users ORDER BY id;")
if [ "$PROD_HASH" = "$DR_HASH" ]; then log_info "Data integrity verified" else log_error "Data integrity check failed: $PROD_HASH != $DR_HASH" fi

Phase 7: Smoke Tests

Phase 7: Smoke Tests

log_info "Phase 7: Running smoke tests..." kubectl exec -it deployment/myapp -n dr-test -- npm run test:smoke ||
log_error "Smoke tests failed"
log_info "Phase 7: Running smoke tests..." kubectl exec -it deployment/myapp -n dr-test -- npm run test:smoke ||
log_error "Smoke tests failed"

Record metrics

Record metrics

END_TIME=$(date +%s) TOTAL_TIME=$((END_TIME - START_TIME)) RTO=$TOTAL_TIME RPO=$(date -d "$(aws s3api head-object --bucket my-backups --key databases/$LATEST_BACKUP --query 'LastModified' --output text)" +%s)
log_info "DR Test Complete" log_info "Total time: ${TOTAL_TIME}s" log_info "RTO: ${RTO}s (target: 3600s)" log_info "RPO: $(date -d @$RPO)"
END_TIME=$(date +%s) TOTAL_TIME=$((END_TIME - START_TIME)) RTO=$TOTAL_TIME RPO=$(date -d "$(aws s3api head-object --bucket my-backups --key databases/$LATEST_BACKUP --query 'LastModified' --output text)" +%s)
log_info "DR Test Complete" log_info "Total time: ${TOTAL_TIME}s" log_info "RTO: ${RTO}s (target: 3600s)" log_info "RPO: $(date -d @$RPO)"

Generate report

Generate report

cat > "$METRICS_FILE" <<EOF { "test_id": "$TEST_ID", "start_time": $START_TIME, "end_time": $END_TIME, "rto_seconds": $RTO, "rpo_timestamp": $RPO, "data_integrity": "PASS", "health_check": "PASS", "smoke_tests": "PASS" } EOF
log_info "Metrics saved to: $METRICS_FILE"
cat > "$METRICS_FILE" <<EOF { "test_id": "$TEST_ID", "start_time": $START_TIME, "end_time": $END_TIME, "rto_seconds": $RTO, "rpo_timestamp": $RPO, "data_integrity": "PASS", "health_check": "PASS", "smoke_tests": "PASS" } EOF
log_info "Metrics saved to: $METRICS_FILE"

Re-enable monitoring

Re-enable monitoring

log_info "Re-enabling production alerts..." aws sns set-topic-attributes
--topic-arn "arn:aws:sns:us-east-1:123456789012:prod-alerts"
--attribute-name DisplayName
--attribute-value "Production Alerts"
log_info "Test artifacts: $LOG_FILE, $METRICS_FILE"
undefined
log_info "Re-enabling production alerts..." aws sns set-topic-attributes
--topic-arn "arn:aws:sns:us-east-1:123456789012:prod-alerts"
--attribute-name DisplayName
--attribute-value "Production Alerts"
log_info "Test artifacts: $LOG_FILE, $METRICS_FILE"
undefined

3. DR Test Automation

3. 灾难恢复测试自动化

yaml
undefined
yaml
undefined

scheduled-dr-tests.yaml

scheduled-dr-tests.yaml

apiVersion: batch/v1 kind: CronJob metadata: name: quarterly-dr-test namespace: operations spec:

Run quarterly on first Monday of each quarter at 2 AM

schedule: "0 2 2-8 1,4,7,10 MON" jobTemplate: spec: backoffLimit: 0 template: spec: serviceAccountName: dr-test-sa containers: - name: dr-test image: myrepo/dr-test:latest command: - /usr/local/bin/execute-dr-test.sh env: - name: SLACK_WEBHOOK valueFrom: secretKeyRef: name: dr-notifications key: slack-webhook - name: TEST_MODE value: "full" restartPolicy: Never
undefined
apiVersion: batch/v1 kind: CronJob metadata: name: quarterly-dr-test namespace: operations spec:

Run quarterly on first Monday of each quarter at 2 AM

schedule: "0 2 2-8 1,4,7,10 MON" jobTemplate: spec: backoffLimit: 0 template: spec: serviceAccountName: dr-test-sa containers: - name: dr-test image: myrepo/dr-test:latest command: - /usr/local/bin/execute-dr-test.sh env: - name: SLACK_WEBHOOK valueFrom: secretKeyRef: name: dr-notifications key: slack-webhook - name: TEST_MODE value: "full" restartPolicy: Never
undefined

Best Practices

最佳实践

✅ DO

✅ 建议

  • Schedule regular DR tests
  • Document procedures in advance
  • Test in isolated environments
  • Measure actual RTO/RPO
  • Involve all teams
  • Automate validation
  • Record findings
  • Update procedures based on results
  • 定期安排灾难恢复测试
  • 提前记录流程
  • 在隔离环境中测试
  • 衡量实际RTO/RPO
  • 让所有团队参与
  • 自动化验证流程
  • 记录测试结果
  • 根据测试结果更新流程

❌ DON'T

❌ 避免

  • Skip DR testing
  • Test during business hours
  • Test against production
  • Ignore test failures
  • Neglect post-test analysis
  • Forget to re-enable monitoring
  • Use stale backup processes
  • Test only once a year
  • 跳过灾难恢复测试
  • 在业务高峰期测试
  • 直接在生产环境测试
  • 忽略测试失败
  • 跳过测试后分析
  • 忘记重新启用监控
  • 使用过时的备份流程
  • 每年仅测试一次

DR Test Levels

灾难恢复测试级别

  • Tabletop: Documentation and discussion
  • Simulation: Controlled partial failover
  • Full DR: Complete system failover
  • Continuous: Ongoing shadow operations
  • 桌面演练:文档审查与讨论
  • 模拟演练:受控的部分故障转移
  • 完整灾难恢复:全系统故障转移
  • 持续演练:持续的影子操作

Key Metrics

关键指标

  • RTO: Recovery Time Objective
  • RPO: Recovery Point Objective
  • MTPD: Mean Time to Detect
  • MTTR: Mean Time to Recover
  • RTO:恢复时间目标
  • RPO:恢复点目标
  • MTPD:平均检测时间
  • MTTR:平均恢复时间

Resources

参考资源