planning-disaster-recovery

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Disaster Recovery

灾难恢复

Purpose

用途

Provide comprehensive guidance for designing disaster recovery (DR) strategies, implementing backup systems, and validating recovery procedures across databases, Kubernetes clusters, and cloud infrastructure. Enable teams to define RTO/RPO objectives, select appropriate backup tools, configure automated failover, and test DR capabilities through chaos engineering.
为数据库、Kubernetes集群和云基础设施的灾难恢复(DR)策略设计、备份系统部署以及恢复流程验证提供全面指导。帮助团队定义RTO/RPO目标、选择合适的备份工具、配置自动化故障转移,并通过混沌工程测试灾难恢复能力。

When to Use This Skill

适用场景

Invoke this skill when:
  • Defining recovery time objectives (RTO) and recovery point objectives (RPO)
  • Implementing database backups with point-in-time recovery (PITR)
  • Setting up Kubernetes cluster backup and restore workflows
  • Configuring cross-region replication for high availability
  • Testing disaster recovery procedures through chaos experiments
  • Meeting compliance requirements (GDPR, SOC 2, HIPAA)
  • Automating backup monitoring and alerting
  • Designing multi-cloud disaster recovery architectures
在以下场景中调用该技能:
  • 定义恢复时间目标(RTO)和恢复点目标(RPO)
  • 部署支持时间点恢复(PITR)的数据库备份
  • 搭建Kubernetes集群备份与恢复工作流
  • 配置跨区域复制以实现高可用性
  • 通过混沌实验测试灾难恢复流程
  • 满足合规要求(GDPR、SOC 2、HIPAA)
  • 自动化备份监控与告警
  • 设计多云灾难恢复架构

Core Concepts

核心概念

RTO and RPO Fundamentals

RTO与RPO基础

Recovery Time Objective (RTO): Maximum acceptable downtime after a disaster before business impact becomes unacceptable.
Recovery Point Objective (RPO): Maximum acceptable data loss measured in time. Defines how far back in time recovery must reach.
Criticality Tiers:
  • Tier 0 (Mission-Critical): RTO < 1 hour, RPO < 5 minutes
  • Tier 1 (Production): RTO 1-4 hours, RPO 15-60 minutes
  • Tier 2 (Important): RTO 4-24 hours, RPO 1-6 hours
  • Tier 3 (Standard): RTO > 24 hours, RPO > 6 hours
恢复时间目标(RTO): 灾难发生后,业务影响达到不可接受程度前的最长可容忍停机时间。
恢复点目标(RPO): 以时间衡量的最大可容忍数据丢失量,定义了恢复操作需要回溯到的时间点。
关键程度层级:
  • 0级(核心业务): RTO < 1小时,RPO < 5分钟
  • 1级(生产环境): RTO 1-4小时,RPO 15-60分钟
  • 2级(重要业务): RTO 4-24小时,RPO 1-6小时
  • 3级(标准业务): RTO > 24小时,RPO > 6小时

3-2-1 Backup Rule

3-2-1备份规则

Maintain 3 copies of data on 2 different media types with 1 copy offsite.
Example implementation:
  • Primary: Production database
  • Secondary: Local backup storage
  • Tertiary: Cloud backup (S3/GCS/Azure)
保持3份数据副本,存储在2种不同介质上,且1份副本异地存储
示例部署:
  • 主副本:生产数据库
  • 次副本:本地备份存储
  • 第三副本:云备份(S3/GCS/Azure)

Backup Types

备份类型

Full Backup: Complete copy of all data. Slowest to create, fastest to restore.
Incremental Backup: Only changes since last backup. Fastest to create, requires full + all incrementals to restore.
Differential Backup: Changes since last full backup. Balance between storage and restore speed.
Continuous Backup: Real-time or near-real-time backup via WAL/binlog archiving. Lowest RPO.
全量备份: 所有数据的完整副本。创建速度最慢,但恢复速度最快。
增量备份: 仅备份上次备份后变更的数据。创建速度最快,但恢复时需要全量备份加所有增量备份。
差异备份: 备份上次全量备份后变更的数据。在存储占用和恢复速度间取得平衡。
持续备份: 通过WAL/binlog归档实现实时或近实时备份,RPO最低。

Quick Decision Framework

快速决策框架

Step 1: Map RTO/RPO to Strategy

步骤1:根据RTO/RPO匹配策略

RTO < 1 hour, RPO < 5 min
→ Active-Active replication, continuous archiving, automated failover
→ Tools: Aurora Global DB, GCS Multi-Region, pgBackRest PITR
→ Cost: Highest

RTO 1-4 hours, RPO 15-60 min
→ Warm standby, incremental backups, automated failover
→ Tools: pgBackRest, WAL-G, RDS Multi-AZ
→ Cost: High

RTO 4-24 hours, RPO 1-6 hours
→ Daily full + incremental, cross-region backup
→ Tools: pgBackRest, Velero, Restic
→ Cost: Medium

RTO > 24 hours, RPO > 6 hours
→ Weekly full + daily incremental, single region
→ Tools: pg_dump, mysqldump, S3 versioning
→ Cost: Low
RTO < 1小时,RPO < 5分钟
→ 主备双活复制、持续归档、自动化故障转移
→ 工具:Aurora Global DB、GCS Multi-Region、pgBackRest PITR
→ 成本:最高

RTO 1-4小时,RPO 15-60分钟
→ 温备、增量备份、自动化故障转移
→ 工具:pgBackRest、WAL-G、RDS Multi-AZ
→ 成本:高

RTO 4-24小时,RPO 1-6小时
→ 每日全量+增量备份、跨区域备份
→ 工具:pgBackRest、Velero、Restic
→ 成本:中等

RTO > 24小时,RPO > 6小时
→ 每周全量+每日增量备份、单区域存储
→ 工具:pg_dump、mysqldump、S3版本控制
→ 成本:低

Step 2: Select Backup Tools by Use Case

步骤2:按场景选择备份工具

Use CasePrimary ToolAlternativeKey Feature
PostgreSQL productionpgBackRestWAL-GPITR, compression, multi-repo
MySQL productionPercona XtraBackupWAL-GHot backups, incremental
MongoDBAtlas BackupmongodumpContinuous backup, PITR
Kubernetes clusterVeleroArgoCD + GitPV snapshots, scheduling
File/object backupResticDuplicityEncryption, deduplication
Cross-region replicationAurora Global DBRDS Read ReplicaActive-Active capable
适用场景主要工具替代工具核心特性
PostgreSQL生产环境pgBackRestWAL-GPITR、压缩、多仓库支持
MySQL生产环境Percona XtraBackupWAL-G热备份、增量备份
MongoDBAtlas Backupmongodump持续备份、PITR
Kubernetes集群VeleroArgoCD + GitPV快照、调度功能
文件/对象存储备份ResticDuplicity加密、重复数据删除
跨区域复制Aurora Global DBRDS Read Replica支持主备双活

Database Backup Patterns

数据库备份模式

PostgreSQL with pgBackRest

基于pgBackRest的PostgreSQL备份

Use Case: Production PostgreSQL with < 5 minute RPO
Quick Start: See
examples/postgresql/pgbackrest-config/
Configure continuous WAL archiving with full/differential/incremental backups to S3/GCS/Azure. Schedule weekly full, daily differential backups. Enable PITR with
pgbackrest --stanza=main --delta restore
.
Detailed Guide:
references/database-backups.md#postgresql
适用场景: RPO < 5分钟的生产环境PostgreSQL
快速开始: 参见
examples/postgresql/pgbackrest-config/
配置持续WAL归档,将全量/差异/增量备份存储到S3/GCS/Azure。每周执行全量备份,每日执行差异备份。通过
pgbackrest --stanza=main --delta restore
启用PITR。
详细指南:
references/database-backups.md#postgresql

MySQL with Percona XtraBackup

基于Percona XtraBackup的MySQL备份

Use Case: MySQL production requiring hot backups
Quick Start: See
examples/mysql/xtrabackup/
Perform full (
xtrabackup --backup --parallel=4
) and incremental backups with binary log archiving for PITR. Restore requires decompress, prepare, apply incrementals, and copy-back steps.
Detailed Guide:
references/database-backups.md#mysql
适用场景: 需要热备份的生产环境MySQL
快速开始: 参见
examples/mysql/xtrabackup/
执行全量备份(
xtrabackup --backup --parallel=4
)和增量备份,并通过二进制日志归档实现PITR。恢复流程包括解压缩、预处理、应用增量备份和复制回数据目录等步骤。
详细指南:
references/database-backups.md#mysql

MongoDB Backup

MongoDB备份

Quick Start: Use
mongodump --gzip --numParallelCollections=4
for logical backups or MongoDB Atlas for continuous backup with PITR.
Detailed Guide:
references/database-backups.md#mongodb
快速开始: 使用
mongodump --gzip --numParallelCollections=4
执行逻辑备份,或使用MongoDB Atlas实现带PITR的持续备份。
详细指南:
references/database-backups.md#mongodb

Kubernetes Disaster Recovery

Kubernetes灾难恢复

Velero for Cluster Backups

基于Velero的集群备份

Quick Start:
velero install --provider aws --bucket my-backups
Configure scheduled backups (daily full, hourly production namespace) with PV snapshots. Restore with
velero restore create --from-backup <name>
. Support selective restore (namespace mappings, storage class remapping).
Examples:
examples/kubernetes/velero/
Detailed Guide:
references/kubernetes-dr.md
快速开始:
velero install --provider aws --bucket my-backups
配置调度备份(每日全量备份、每小时生产命名空间备份)并启用PV快照。通过
velero restore create --from-backup <name>
执行恢复。支持选择性恢复(命名空间映射、存储类重映射)。
示例:
examples/kubernetes/velero/
详细指南:
references/kubernetes-dr.md

etcd Backup

etcd备份

Quick Start:
ETCDCTL_API=3 etcdctl snapshot save /backups/etcd/snapshot.db
Create periodic etcd snapshots for control plane recovery. Restore requires cluster recreation with snapshot data.
Examples:
examples/kubernetes/etcd/
快速开始:
ETCDCTL_API=3 etcdctl snapshot save /backups/etcd/snapshot.db
定期创建etcd快照用于控制平面恢复。恢复需要基于快照数据重新创建集群。
示例:
examples/kubernetes/etcd/

Cloud-Specific DR Patterns

云厂商特定灾难恢复模式

AWS

AWS

Key Services:
  • RDS: Automated backups (30-day retention), PITR, Multi-AZ
  • Aurora Global DB: Cross-region active-passive with automatic failover
  • S3 CRR: Cross-region replication with 15-min SLA (Replication Time Control)
Examples:
examples/cloud/aws/
Detailed Guide:
references/cloud-dr-patterns.md#aws
核心服务:
  • RDS:自动备份(30天保留期)、PITR、多可用区部署
  • Aurora Global DB:跨区域主备架构,支持自动故障转移
  • S3 CRR:跨区域复制,15分钟SLA(复制时间控制)
示例:
examples/cloud/aws/
详细指南:
references/cloud-dr-patterns.md#aws

GCP

GCP

Key Services:
  • Cloud SQL: PITR with 7-day transaction logs, 30-day retention
  • GCS Multi-Regional: Automatic replication across 100+ mile separation
  • Regional HA: Synchronous replication within region
Detailed Guide:
references/cloud-dr-patterns.md#gcp
核心服务:
  • Cloud SQL:带7天事务日志的PITR、30天保留期
  • GCS多区域存储:自动跨100英里以上距离复制
  • 区域高可用:区域内同步复制
详细指南:
references/cloud-dr-patterns.md#gcp

Azure

Azure

Key Services:
  • Azure Backup: VM backups with flexible retention (daily/weekly/monthly/yearly)
  • Azure Site Recovery: Cross-region VM replication with 4-hour app-consistent snapshots
  • Geo-Redundant Storage: Automatic replication to secondary region
Detailed Guide:
references/cloud-dr-patterns.md#azure
核心服务:
  • Azure Backup:VM备份,支持灵活保留策略(每日/每周/每月/每年)
  • Azure Site Recovery:跨区域VM复制,4小时应用一致性快照
  • 异地冗余存储:自动复制到次要区域
详细指南:
references/cloud-dr-patterns.md#azure

Cross-Region Replication Patterns

跨区域复制模式

PatternRTORPOCostUse Case
Active-Active< 1 min< 1 minHighBoth regions serve traffic
Active-Passive15-60 min5-15 minMediumStandby for failover
Pilot Light10-30 min5-15 minLowMinimal secondary infra
Warm Standby5-15 min5-15 minMed-HighScaled-down secondary
Implementation Examples:
  • PostgreSQL streaming replication (Active-Passive)
  • Aurora Global Database (Active-Active)
  • ASG scale-up automation (Pilot Light)
Detailed Guide:
references/cross-region-replication.md
模式RTORPO成本适用场景
主备双活< 1分钟< 1分钟双区域同时处理流量
主备架构15-60分钟5-15分钟备用区域用于故障转移
试点模式10-30分钟5-15分钟次要基础设施最小化
温备架构5-15分钟5-15分钟中高次要基础设施缩容部署
部署示例:
  • PostgreSQL流复制(主备架构)
  • Aurora全局数据库(主备双活)
  • ASG扩容自动化(试点模式)
详细指南:
references/cross-region-replication.md

Testing Disaster Recovery

灾难恢复测试

Chaos Engineering

混沌工程

Purpose: Validate DR procedures through controlled failure injection.
Test Scenarios:
  • Database failover (stop primary, measure promotion time)
  • Region failure (block network, trigger DNS failover)
  • Kubernetes recovery (delete namespace, restore from Velero)
Tools: Chaos Mesh, Gremlin, Litmus, Toxiproxy
Examples:
examples/chaos/db-failover-test.sh
,
examples/chaos/region-failure-test.sh
Detailed Guide:
references/chaos-engineering.md
用途: 通过受控故障注入验证灾难恢复流程。
测试场景:
  • 数据库故障转移(停止主节点,衡量提升时间)
  • 区域故障(阻断网络,触发DNS故障转移)
  • Kubernetes恢复(删除命名空间,从Velero恢复)
工具: Chaos Mesh、Gremlin、Litmus、Toxiproxy
示例:
examples/chaos/db-failover-test.sh
,
examples/chaos/region-failure-test.sh
详细指南:
references/chaos-engineering.md

Automated DR Drills

自动化灾难恢复演练

Run Monthly Tests:
bash
./scripts/dr-drill.sh --environment staging --test-type full
./scripts/test-restore.sh --backup latest --target staging-db
每月执行测试:
bash
./scripts/dr-drill.sh --environment staging --test-type full
./scripts/test-restore.sh --backup latest --target staging-db

Compliance and Retention

合规与保留策略

RegulationRetentionRequirements
GDPR1-7 yearsEU data residency, right to erasure
SOC 21 year+Secure deletion, access controls
HIPAA6 yearsEncryption, PHI protection
PCI DSS3mo-1yrSecure deletion, quarterly reviews
Implement with S3/GCS lifecycle policies: 30d→Standard-IA, 90d→Glacier, 365d→Deep Archive
Immutable backups: Use S3 Object Lock or Azure Immutable Blob Storage for ransomware protection.
Detailed Guide:
references/compliance-retention.md
法规保留期限要求
GDPR1-7年欧盟数据驻留权、删除权
SOC 21年以上安全删除、访问控制
HIPAA6年加密、PHI保护
PCI DSS3个月-1年安全删除、季度审核
通过S3/GCS生命周期策略实现: 30天→标准低频存储,90天→Glacier,365天→深度归档
不可变备份: 使用S3 Object Lock或Azure Immutable Blob Storage防范勒索软件攻击。
详细指南:
references/compliance-retention.md

Monitoring and Alerting

监控与告警

Key Metrics: Backup success rate, duration, time since last backup, RPO breach, storage utilization
Prometheus Alerts: VeleroBackupFailed, VeleroBackupTooOld, BackupSizeTrend
Validation Scripts:
bash
./scripts/validate-backup.sh --backup latest --verify-integrity
./scripts/check-retention.sh --report-violations
./scripts/generate-dr-report.sh --format pdf
核心指标: 备份成功率、备份时长、上次备份距今时间、RPO违规、存储利用率
Prometheus告警: VeleroBackupFailed、VeleroBackupTooOld、BackupSizeTrend
验证脚本:
bash
./scripts/validate-backup.sh --backup latest --verify-integrity
./scripts/check-retention.sh --report-violations
./scripts/generate-dr-report.sh --format pdf

Automation and Runbooks

自动化与运行手册

Automate Backup Schedules: Cron for pgBackRest (weekly full, daily differential), Velero schedules (K8s)
DR Runbook Steps: Detect failure → Verify secondary → Promote → Update DNS → Notify → Document
Detailed Guide:
references/runbook-automation.md
自动化备份调度: 用Cron调度pgBackRest(每周全量、每日差异),用Velero调度Kubernetes备份
灾难恢复运行手册步骤: 检测故障 → 验证备用节点 → 提升备用节点为主节点 → 更新DNS → 通知相关方 → 记录流程
详细指南:
references/runbook-automation.md

Integration with Other Skills

与其他技能的集成

Related Skills

相关技能

Prerequisites:
  • infrastructure-as-code
    : Provision backup infrastructure, DR regions
  • kubernetes-operations
    : K8s cluster setup for Velero
  • secret-management
    : Backup encryption keys, credentials
Parallel Skills:
  • databases-postgresql
    : PostgreSQL configuration and operations
  • databases-mysql
    : MySQL configuration and operations
  • observability
    : Backup monitoring, alerting
  • security-hardening
    : Secure backup storage, access control
Consumer Skills:
  • incident-management
    : Invoke DR procedures during incidents
  • compliance-frameworks
    : Meet regulatory requirements
前置技能:
  • infrastructure-as-code
    : 部署备份基础设施、灾难恢复区域
  • kubernetes-operations
    : 为Velero搭建Kubernetes集群
  • secret-management
    : 备份加密密钥、凭证
并行技能:
  • databases-postgresql
    : PostgreSQL配置与运维
  • databases-mysql
    : MySQL配置与运维
  • observability
    : 备份监控、告警
  • security-hardening
    : 安全备份存储、访问控制
依赖技能:
  • incident-management
    : 事件响应期间调用灾难恢复流程
  • compliance-frameworks
    : 满足法规要求

Skill Chaining Example

技能链示例

infrastructure-as-code → secret-management → disaster-recovery → observability
       ↓                        ↓                   ↓                ↓
  Create S3 buckets      Store encryption     Configure backups   Monitor jobs
  Provision databases    keys in Vault        Set up replication  Alert failures
  Setup VPCs             Manage credentials   Test DR drills      Track metrics
infrastructure-as-code → secret-management → disaster-recovery → observability
       ↓                        ↓                   ↓                ↓
  创建S3存储桶      存储加密密钥     配置备份策略   监控备份任务
  部署数据库        到Vault中        搭建复制架构   告警故障
  搭建VPC          管理凭证         测试灾难恢复   跟踪指标

Best Practices

最佳实践

Do

建议

✓ Test restores regularly (monthly for critical systems) ✓ Automate backup monitoring and alerting ✓ Encrypt backups at rest and in transit ✓ Implement 3-2-1 backup rule ✓ Define and measure RTO/RPO ✓ Run chaos experiments to validate DR ✓ Document recovery procedures ✓ Store backups in different regions ✓ Use immutable backups for ransomware protection ✓ Automate DR testing in CI/CD
✓ 定期测试恢复流程(核心系统每月一次) ✓ 自动化备份监控与告警 ✓ 对备份数据进行静态和传输加密 ✓ 实施3-2-1备份规则 ✓ 定义并跟踪RTO/RPO ✓ 通过混沌实验验证灾难恢复能力 ✓ 记录恢复流程 ✓ 将备份存储在不同区域 ✓ 使用不可变备份防范勒索软件 ✓ 在CI/CD中自动化灾难恢复测试

Don't

禁忌

✗ Assume backups work without testing ✗ Store all backups in single region ✗ Skip retention policy definition ✗ Forget to encrypt sensitive data ✗ Rely solely on cloud provider backups ✗ Ignore backup monitoring ✗ Perform backups only from primary database under high load ✗ Store encryption keys with backups
✗ 不测试就假设备份可用 ✗ 将所有备份存储在同一区域 ✗ 不定义保留策略 ✗ 忽略敏感数据加密 ✗ 完全依赖云厂商的备份服务 ✗ 忽略备份监控 ✗ 在主数据库高负载时执行备份 ✗ 将加密密钥与备份存储在一起

Reference Documentation

参考文档

  • RTO/RPO Planning:
    references/rto-rpo-planning.md
  • Database Backups:
    references/database-backups.md
  • Kubernetes DR:
    references/kubernetes-dr.md
  • Cloud DR Patterns:
    references/cloud-dr-patterns.md
  • Cross-Region Replication:
    references/cross-region-replication.md
  • Chaos Engineering:
    references/chaos-engineering.md
  • Compliance Requirements:
    references/compliance-retention.md
  • Runbook Automation:
    references/runbook-automation.md
  • RTO/RPO规划:
    references/rto-rpo-planning.md
  • 数据库备份:
    references/database-backups.md
  • Kubernetes灾难恢复:
    references/kubernetes-dr.md
  • 云灾难恢复模式:
    references/cloud-dr-patterns.md
  • 跨区域复制:
    references/cross-region-replication.md
  • 混沌工程:
    references/chaos-engineering.md
  • 合规与保留:
    references/compliance-retention.md
  • 运行手册自动化:
    references/runbook-automation.md

Examples

示例

  • Runbooks:
    examples/runbooks/database-failover.md
    ,
    examples/runbooks/region-failover.md
  • PostgreSQL:
    examples/postgresql/pgbackrest-config/
    ,
    examples/postgresql/walg-config/
  • MySQL:
    examples/mysql/xtrabackup/
    ,
    examples/mysql/walg/
  • Kubernetes:
    examples/kubernetes/velero/
    ,
    examples/kubernetes/etcd/
  • Cloud:
    examples/cloud/aws/
    ,
    examples/cloud/gcp/
    ,
    examples/cloud/azure/
  • Chaos:
    examples/chaos/db-failover-test.sh
    ,
    examples/chaos/region-failure-test.sh
  • 运行手册:
    examples/runbooks/database-failover.md
    ,
    examples/runbooks/region-failover.md
  • PostgreSQL:
    examples/postgresql/pgbackrest-config/
    ,
    examples/postgresql/walg-config/
  • MySQL:
    examples/mysql/xtrabackup/
    ,
    examples/mysql/walg/
  • Kubernetes:
    examples/kubernetes/velero/
    ,
    examples/kubernetes/etcd/
  • 云厂商:
    examples/cloud/aws/
    ,
    examples/cloud/gcp/
    ,
    examples/cloud/azure/
  • 混沌工程:
    examples/chaos/db-failover-test.sh
    ,
    examples/chaos/region-failure-test.sh

Scripts

脚本

  • scripts/validate-backup.sh
    : Verify backup integrity
  • scripts/test-restore.sh
    : Automated restore testing
  • scripts/dr-drill.sh
    : Run full DR drill
  • scripts/check-retention.sh
    : Verify retention policies
  • scripts/generate-dr-report.sh
    : Compliance reporting
  • scripts/validate-backup.sh
    : 验证备份完整性
  • scripts/test-restore.sh
    : 自动化恢复测试
  • scripts/dr-drill.sh
    : 执行完整灾难恢复演练
  • scripts/check-retention.sh
    : 验证保留策略合规性
  • scripts/generate-dr-report.sh
    : 生成合规报告