planning-disaster-recovery
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDisaster Recovery
灾难恢复
Purpose
用途
Provide comprehensive guidance for designing disaster recovery (DR) strategies, implementing backup systems, and validating recovery procedures across databases, Kubernetes clusters, and cloud infrastructure. Enable teams to define RTO/RPO objectives, select appropriate backup tools, configure automated failover, and test DR capabilities through chaos engineering.
为数据库、Kubernetes集群和云基础设施的灾难恢复(DR)策略设计、备份系统部署以及恢复流程验证提供全面指导。帮助团队定义RTO/RPO目标、选择合适的备份工具、配置自动化故障转移,并通过混沌工程测试灾难恢复能力。
When to Use This Skill
适用场景
Invoke this skill when:
- Defining recovery time objectives (RTO) and recovery point objectives (RPO)
- Implementing database backups with point-in-time recovery (PITR)
- Setting up Kubernetes cluster backup and restore workflows
- Configuring cross-region replication for high availability
- Testing disaster recovery procedures through chaos experiments
- Meeting compliance requirements (GDPR, SOC 2, HIPAA)
- Automating backup monitoring and alerting
- Designing multi-cloud disaster recovery architectures
在以下场景中调用该技能:
- 定义恢复时间目标(RTO)和恢复点目标(RPO)
- 部署支持时间点恢复(PITR)的数据库备份
- 搭建Kubernetes集群备份与恢复工作流
- 配置跨区域复制以实现高可用性
- 通过混沌实验测试灾难恢复流程
- 满足合规要求(GDPR、SOC 2、HIPAA)
- 自动化备份监控与告警
- 设计多云灾难恢复架构
Core Concepts
核心概念
RTO and RPO Fundamentals
RTO与RPO基础
Recovery Time Objective (RTO): Maximum acceptable downtime after a disaster before business impact becomes unacceptable.
Recovery Point Objective (RPO): Maximum acceptable data loss measured in time. Defines how far back in time recovery must reach.
Criticality Tiers:
- Tier 0 (Mission-Critical): RTO < 1 hour, RPO < 5 minutes
- Tier 1 (Production): RTO 1-4 hours, RPO 15-60 minutes
- Tier 2 (Important): RTO 4-24 hours, RPO 1-6 hours
- Tier 3 (Standard): RTO > 24 hours, RPO > 6 hours
恢复时间目标(RTO): 灾难发生后,业务影响达到不可接受程度前的最长可容忍停机时间。
恢复点目标(RPO): 以时间衡量的最大可容忍数据丢失量,定义了恢复操作需要回溯到的时间点。
关键程度层级:
- 0级(核心业务): RTO < 1小时,RPO < 5分钟
- 1级(生产环境): RTO 1-4小时,RPO 15-60分钟
- 2级(重要业务): RTO 4-24小时,RPO 1-6小时
- 3级(标准业务): RTO > 24小时,RPO > 6小时
3-2-1 Backup Rule
3-2-1备份规则
Maintain 3 copies of data on 2 different media types with 1 copy offsite.
Example implementation:
- Primary: Production database
- Secondary: Local backup storage
- Tertiary: Cloud backup (S3/GCS/Azure)
保持3份数据副本,存储在2种不同介质上,且1份副本异地存储。
示例部署:
- 主副本:生产数据库
- 次副本:本地备份存储
- 第三副本:云备份(S3/GCS/Azure)
Backup Types
备份类型
Full Backup: Complete copy of all data. Slowest to create, fastest to restore.
Incremental Backup: Only changes since last backup. Fastest to create, requires full + all incrementals to restore.
Differential Backup: Changes since last full backup. Balance between storage and restore speed.
Continuous Backup: Real-time or near-real-time backup via WAL/binlog archiving. Lowest RPO.
全量备份: 所有数据的完整副本。创建速度最慢,但恢复速度最快。
增量备份: 仅备份上次备份后变更的数据。创建速度最快,但恢复时需要全量备份加所有增量备份。
差异备份: 备份上次全量备份后变更的数据。在存储占用和恢复速度间取得平衡。
持续备份: 通过WAL/binlog归档实现实时或近实时备份,RPO最低。
Quick Decision Framework
快速决策框架
Step 1: Map RTO/RPO to Strategy
步骤1:根据RTO/RPO匹配策略
RTO < 1 hour, RPO < 5 min
→ Active-Active replication, continuous archiving, automated failover
→ Tools: Aurora Global DB, GCS Multi-Region, pgBackRest PITR
→ Cost: Highest
RTO 1-4 hours, RPO 15-60 min
→ Warm standby, incremental backups, automated failover
→ Tools: pgBackRest, WAL-G, RDS Multi-AZ
→ Cost: High
RTO 4-24 hours, RPO 1-6 hours
→ Daily full + incremental, cross-region backup
→ Tools: pgBackRest, Velero, Restic
→ Cost: Medium
RTO > 24 hours, RPO > 6 hours
→ Weekly full + daily incremental, single region
→ Tools: pg_dump, mysqldump, S3 versioning
→ Cost: LowRTO < 1小时,RPO < 5分钟
→ 主备双活复制、持续归档、自动化故障转移
→ 工具:Aurora Global DB、GCS Multi-Region、pgBackRest PITR
→ 成本:最高
RTO 1-4小时,RPO 15-60分钟
→ 温备、增量备份、自动化故障转移
→ 工具:pgBackRest、WAL-G、RDS Multi-AZ
→ 成本:高
RTO 4-24小时,RPO 1-6小时
→ 每日全量+增量备份、跨区域备份
→ 工具:pgBackRest、Velero、Restic
→ 成本:中等
RTO > 24小时,RPO > 6小时
→ 每周全量+每日增量备份、单区域存储
→ 工具:pg_dump、mysqldump、S3版本控制
→ 成本:低Step 2: Select Backup Tools by Use Case
步骤2:按场景选择备份工具
| Use Case | Primary Tool | Alternative | Key Feature |
|---|---|---|---|
| PostgreSQL production | pgBackRest | WAL-G | PITR, compression, multi-repo |
| MySQL production | Percona XtraBackup | WAL-G | Hot backups, incremental |
| MongoDB | Atlas Backup | mongodump | Continuous backup, PITR |
| Kubernetes cluster | Velero | ArgoCD + Git | PV snapshots, scheduling |
| File/object backup | Restic | Duplicity | Encryption, deduplication |
| Cross-region replication | Aurora Global DB | RDS Read Replica | Active-Active capable |
| 适用场景 | 主要工具 | 替代工具 | 核心特性 |
|---|---|---|---|
| PostgreSQL生产环境 | pgBackRest | WAL-G | PITR、压缩、多仓库支持 |
| MySQL生产环境 | Percona XtraBackup | WAL-G | 热备份、增量备份 |
| MongoDB | Atlas Backup | mongodump | 持续备份、PITR |
| Kubernetes集群 | Velero | ArgoCD + Git | PV快照、调度功能 |
| 文件/对象存储备份 | Restic | Duplicity | 加密、重复数据删除 |
| 跨区域复制 | Aurora Global DB | RDS Read Replica | 支持主备双活 |
Database Backup Patterns
数据库备份模式
PostgreSQL with pgBackRest
基于pgBackRest的PostgreSQL备份
Use Case: Production PostgreSQL with < 5 minute RPO
Quick Start: See
examples/postgresql/pgbackrest-config/Configure continuous WAL archiving with full/differential/incremental backups to S3/GCS/Azure. Schedule weekly full, daily differential backups. Enable PITR with .
pgbackrest --stanza=main --delta restoreDetailed Guide:
references/database-backups.md#postgresql适用场景: RPO < 5分钟的生产环境PostgreSQL
快速开始: 参见
examples/postgresql/pgbackrest-config/配置持续WAL归档,将全量/差异/增量备份存储到S3/GCS/Azure。每周执行全量备份,每日执行差异备份。通过 启用PITR。
pgbackrest --stanza=main --delta restore详细指南:
references/database-backups.md#postgresqlMySQL with Percona XtraBackup
基于Percona XtraBackup的MySQL备份
Use Case: MySQL production requiring hot backups
Quick Start: See
examples/mysql/xtrabackup/Perform full () and incremental backups with binary log archiving for PITR. Restore requires decompress, prepare, apply incrementals, and copy-back steps.
xtrabackup --backup --parallel=4Detailed Guide:
references/database-backups.md#mysql适用场景: 需要热备份的生产环境MySQL
快速开始: 参见
examples/mysql/xtrabackup/执行全量备份()和增量备份,并通过二进制日志归档实现PITR。恢复流程包括解压缩、预处理、应用增量备份和复制回数据目录等步骤。
xtrabackup --backup --parallel=4详细指南:
references/database-backups.md#mysqlMongoDB Backup
MongoDB备份
Quick Start: Use for logical backups or MongoDB Atlas for continuous backup with PITR.
mongodump --gzip --numParallelCollections=4Detailed Guide:
references/database-backups.md#mongodb快速开始: 使用 执行逻辑备份,或使用MongoDB Atlas实现带PITR的持续备份。
mongodump --gzip --numParallelCollections=4详细指南:
references/database-backups.md#mongodbKubernetes Disaster Recovery
Kubernetes灾难恢复
Velero for Cluster Backups
基于Velero的集群备份
Quick Start:
velero install --provider aws --bucket my-backupsConfigure scheduled backups (daily full, hourly production namespace) with PV snapshots. Restore with . Support selective restore (namespace mappings, storage class remapping).
velero restore create --from-backup <name>Examples:
Detailed Guide:
examples/kubernetes/velero/references/kubernetes-dr.md快速开始:
velero install --provider aws --bucket my-backups配置调度备份(每日全量备份、每小时生产命名空间备份)并启用PV快照。通过 执行恢复。支持选择性恢复(命名空间映射、存储类重映射)。
velero restore create --from-backup <name>示例:
详细指南:
examples/kubernetes/velero/references/kubernetes-dr.mdetcd Backup
etcd备份
Quick Start:
ETCDCTL_API=3 etcdctl snapshot save /backups/etcd/snapshot.dbCreate periodic etcd snapshots for control plane recovery. Restore requires cluster recreation with snapshot data.
Examples:
examples/kubernetes/etcd/快速开始:
ETCDCTL_API=3 etcdctl snapshot save /backups/etcd/snapshot.db定期创建etcd快照用于控制平面恢复。恢复需要基于快照数据重新创建集群。
示例:
examples/kubernetes/etcd/Cloud-Specific DR Patterns
云厂商特定灾难恢复模式
AWS
AWS
Key Services:
- RDS: Automated backups (30-day retention), PITR, Multi-AZ
- Aurora Global DB: Cross-region active-passive with automatic failover
- S3 CRR: Cross-region replication with 15-min SLA (Replication Time Control)
Examples:
Detailed Guide:
examples/cloud/aws/references/cloud-dr-patterns.md#aws核心服务:
- RDS:自动备份(30天保留期)、PITR、多可用区部署
- Aurora Global DB:跨区域主备架构,支持自动故障转移
- S3 CRR:跨区域复制,15分钟SLA(复制时间控制)
示例:
详细指南:
examples/cloud/aws/references/cloud-dr-patterns.md#awsGCP
GCP
Key Services:
- Cloud SQL: PITR with 7-day transaction logs, 30-day retention
- GCS Multi-Regional: Automatic replication across 100+ mile separation
- Regional HA: Synchronous replication within region
Detailed Guide:
references/cloud-dr-patterns.md#gcp核心服务:
- Cloud SQL:带7天事务日志的PITR、30天保留期
- GCS多区域存储:自动跨100英里以上距离复制
- 区域高可用:区域内同步复制
详细指南:
references/cloud-dr-patterns.md#gcpAzure
Azure
Key Services:
- Azure Backup: VM backups with flexible retention (daily/weekly/monthly/yearly)
- Azure Site Recovery: Cross-region VM replication with 4-hour app-consistent snapshots
- Geo-Redundant Storage: Automatic replication to secondary region
Detailed Guide:
references/cloud-dr-patterns.md#azure核心服务:
- Azure Backup:VM备份,支持灵活保留策略(每日/每周/每月/每年)
- Azure Site Recovery:跨区域VM复制,4小时应用一致性快照
- 异地冗余存储:自动复制到次要区域
详细指南:
references/cloud-dr-patterns.md#azureCross-Region Replication Patterns
跨区域复制模式
| Pattern | RTO | RPO | Cost | Use Case |
|---|---|---|---|---|
| Active-Active | < 1 min | < 1 min | High | Both regions serve traffic |
| Active-Passive | 15-60 min | 5-15 min | Medium | Standby for failover |
| Pilot Light | 10-30 min | 5-15 min | Low | Minimal secondary infra |
| Warm Standby | 5-15 min | 5-15 min | Med-High | Scaled-down secondary |
Implementation Examples:
- PostgreSQL streaming replication (Active-Passive)
- Aurora Global Database (Active-Active)
- ASG scale-up automation (Pilot Light)
Detailed Guide:
references/cross-region-replication.md| 模式 | RTO | RPO | 成本 | 适用场景 |
|---|---|---|---|---|
| 主备双活 | < 1分钟 | < 1分钟 | 高 | 双区域同时处理流量 |
| 主备架构 | 15-60分钟 | 5-15分钟 | 中 | 备用区域用于故障转移 |
| 试点模式 | 10-30分钟 | 5-15分钟 | 低 | 次要基础设施最小化 |
| 温备架构 | 5-15分钟 | 5-15分钟 | 中高 | 次要基础设施缩容部署 |
部署示例:
- PostgreSQL流复制(主备架构)
- Aurora全局数据库(主备双活)
- ASG扩容自动化(试点模式)
详细指南:
references/cross-region-replication.mdTesting Disaster Recovery
灾难恢复测试
Chaos Engineering
混沌工程
Purpose: Validate DR procedures through controlled failure injection.
Test Scenarios:
- Database failover (stop primary, measure promotion time)
- Region failure (block network, trigger DNS failover)
- Kubernetes recovery (delete namespace, restore from Velero)
Tools: Chaos Mesh, Gremlin, Litmus, Toxiproxy
Examples: ,
Detailed Guide:
examples/chaos/db-failover-test.shexamples/chaos/region-failure-test.shreferences/chaos-engineering.md用途: 通过受控故障注入验证灾难恢复流程。
测试场景:
- 数据库故障转移(停止主节点,衡量提升时间)
- 区域故障(阻断网络,触发DNS故障转移)
- Kubernetes恢复(删除命名空间,从Velero恢复)
工具: Chaos Mesh、Gremlin、Litmus、Toxiproxy
示例: ,
详细指南:
examples/chaos/db-failover-test.shexamples/chaos/region-failure-test.shreferences/chaos-engineering.mdAutomated DR Drills
自动化灾难恢复演练
Run Monthly Tests:
bash
./scripts/dr-drill.sh --environment staging --test-type full
./scripts/test-restore.sh --backup latest --target staging-db每月执行测试:
bash
./scripts/dr-drill.sh --environment staging --test-type full
./scripts/test-restore.sh --backup latest --target staging-dbCompliance and Retention
合规与保留策略
| Regulation | Retention | Requirements |
|---|---|---|
| GDPR | 1-7 years | EU data residency, right to erasure |
| SOC 2 | 1 year+ | Secure deletion, access controls |
| HIPAA | 6 years | Encryption, PHI protection |
| PCI DSS | 3mo-1yr | Secure deletion, quarterly reviews |
Implement with S3/GCS lifecycle policies: 30d→Standard-IA, 90d→Glacier, 365d→Deep Archive
Immutable backups: Use S3 Object Lock or Azure Immutable Blob Storage for ransomware protection.
Detailed Guide:
references/compliance-retention.md| 法规 | 保留期限 | 要求 |
|---|---|---|
| GDPR | 1-7年 | 欧盟数据驻留权、删除权 |
| SOC 2 | 1年以上 | 安全删除、访问控制 |
| HIPAA | 6年 | 加密、PHI保护 |
| PCI DSS | 3个月-1年 | 安全删除、季度审核 |
通过S3/GCS生命周期策略实现: 30天→标准低频存储,90天→Glacier,365天→深度归档
不可变备份: 使用S3 Object Lock或Azure Immutable Blob Storage防范勒索软件攻击。
详细指南:
references/compliance-retention.mdMonitoring and Alerting
监控与告警
Key Metrics: Backup success rate, duration, time since last backup, RPO breach, storage utilization
Prometheus Alerts: VeleroBackupFailed, VeleroBackupTooOld, BackupSizeTrend
Validation Scripts:
bash
./scripts/validate-backup.sh --backup latest --verify-integrity
./scripts/check-retention.sh --report-violations
./scripts/generate-dr-report.sh --format pdf核心指标: 备份成功率、备份时长、上次备份距今时间、RPO违规、存储利用率
Prometheus告警: VeleroBackupFailed、VeleroBackupTooOld、BackupSizeTrend
验证脚本:
bash
./scripts/validate-backup.sh --backup latest --verify-integrity
./scripts/check-retention.sh --report-violations
./scripts/generate-dr-report.sh --format pdfAutomation and Runbooks
自动化与运行手册
Automate Backup Schedules: Cron for pgBackRest (weekly full, daily differential), Velero schedules (K8s)
DR Runbook Steps: Detect failure → Verify secondary → Promote → Update DNS → Notify → Document
Detailed Guide:
references/runbook-automation.md自动化备份调度: 用Cron调度pgBackRest(每周全量、每日差异),用Velero调度Kubernetes备份
灾难恢复运行手册步骤: 检测故障 → 验证备用节点 → 提升备用节点为主节点 → 更新DNS → 通知相关方 → 记录流程
详细指南:
references/runbook-automation.mdIntegration with Other Skills
与其他技能的集成
Related Skills
相关技能
Prerequisites:
- : Provision backup infrastructure, DR regions
infrastructure-as-code - : K8s cluster setup for Velero
kubernetes-operations - : Backup encryption keys, credentials
secret-management
Parallel Skills:
- : PostgreSQL configuration and operations
databases-postgresql - : MySQL configuration and operations
databases-mysql - : Backup monitoring, alerting
observability - : Secure backup storage, access control
security-hardening
Consumer Skills:
- : Invoke DR procedures during incidents
incident-management - : Meet regulatory requirements
compliance-frameworks
前置技能:
- : 部署备份基础设施、灾难恢复区域
infrastructure-as-code - : 为Velero搭建Kubernetes集群
kubernetes-operations - : 备份加密密钥、凭证
secret-management
并行技能:
- : PostgreSQL配置与运维
databases-postgresql - : MySQL配置与运维
databases-mysql - : 备份监控、告警
observability - : 安全备份存储、访问控制
security-hardening
依赖技能:
- : 事件响应期间调用灾难恢复流程
incident-management - : 满足法规要求
compliance-frameworks
Skill Chaining Example
技能链示例
infrastructure-as-code → secret-management → disaster-recovery → observability
↓ ↓ ↓ ↓
Create S3 buckets Store encryption Configure backups Monitor jobs
Provision databases keys in Vault Set up replication Alert failures
Setup VPCs Manage credentials Test DR drills Track metricsinfrastructure-as-code → secret-management → disaster-recovery → observability
↓ ↓ ↓ ↓
创建S3存储桶 存储加密密钥 配置备份策略 监控备份任务
部署数据库 到Vault中 搭建复制架构 告警故障
搭建VPC 管理凭证 测试灾难恢复 跟踪指标Best Practices
最佳实践
Do
建议
✓ Test restores regularly (monthly for critical systems)
✓ Automate backup monitoring and alerting
✓ Encrypt backups at rest and in transit
✓ Implement 3-2-1 backup rule
✓ Define and measure RTO/RPO
✓ Run chaos experiments to validate DR
✓ Document recovery procedures
✓ Store backups in different regions
✓ Use immutable backups for ransomware protection
✓ Automate DR testing in CI/CD
✓ 定期测试恢复流程(核心系统每月一次)
✓ 自动化备份监控与告警
✓ 对备份数据进行静态和传输加密
✓ 实施3-2-1备份规则
✓ 定义并跟踪RTO/RPO
✓ 通过混沌实验验证灾难恢复能力
✓ 记录恢复流程
✓ 将备份存储在不同区域
✓ 使用不可变备份防范勒索软件
✓ 在CI/CD中自动化灾难恢复测试
Don't
禁忌
✗ Assume backups work without testing
✗ Store all backups in single region
✗ Skip retention policy definition
✗ Forget to encrypt sensitive data
✗ Rely solely on cloud provider backups
✗ Ignore backup monitoring
✗ Perform backups only from primary database under high load
✗ Store encryption keys with backups
✗ 不测试就假设备份可用
✗ 将所有备份存储在同一区域
✗ 不定义保留策略
✗ 忽略敏感数据加密
✗ 完全依赖云厂商的备份服务
✗ 忽略备份监控
✗ 在主数据库高负载时执行备份
✗ 将加密密钥与备份存储在一起
Reference Documentation
参考文档
- RTO/RPO Planning:
references/rto-rpo-planning.md - Database Backups:
references/database-backups.md - Kubernetes DR:
references/kubernetes-dr.md - Cloud DR Patterns:
references/cloud-dr-patterns.md - Cross-Region Replication:
references/cross-region-replication.md - Chaos Engineering:
references/chaos-engineering.md - Compliance Requirements:
references/compliance-retention.md - Runbook Automation:
references/runbook-automation.md
- RTO/RPO规划:
references/rto-rpo-planning.md - 数据库备份:
references/database-backups.md - Kubernetes灾难恢复:
references/kubernetes-dr.md - 云灾难恢复模式:
references/cloud-dr-patterns.md - 跨区域复制:
references/cross-region-replication.md - 混沌工程:
references/chaos-engineering.md - 合规与保留:
references/compliance-retention.md - 运行手册自动化:
references/runbook-automation.md
Examples
示例
- Runbooks: ,
examples/runbooks/database-failover.mdexamples/runbooks/region-failover.md - PostgreSQL: ,
examples/postgresql/pgbackrest-config/examples/postgresql/walg-config/ - MySQL: ,
examples/mysql/xtrabackup/examples/mysql/walg/ - Kubernetes: ,
examples/kubernetes/velero/examples/kubernetes/etcd/ - Cloud: ,
examples/cloud/aws/,examples/cloud/gcp/examples/cloud/azure/ - Chaos: ,
examples/chaos/db-failover-test.shexamples/chaos/region-failure-test.sh
- 运行手册: ,
examples/runbooks/database-failover.mdexamples/runbooks/region-failover.md - PostgreSQL: ,
examples/postgresql/pgbackrest-config/examples/postgresql/walg-config/ - MySQL: ,
examples/mysql/xtrabackup/examples/mysql/walg/ - Kubernetes: ,
examples/kubernetes/velero/examples/kubernetes/etcd/ - 云厂商: ,
examples/cloud/aws/,examples/cloud/gcp/examples/cloud/azure/ - 混沌工程: ,
examples/chaos/db-failover-test.shexamples/chaos/region-failure-test.sh
Scripts
脚本
- : Verify backup integrity
scripts/validate-backup.sh - : Automated restore testing
scripts/test-restore.sh - : Run full DR drill
scripts/dr-drill.sh - : Verify retention policies
scripts/check-retention.sh - : Compliance reporting
scripts/generate-dr-report.sh
- : 验证备份完整性
scripts/validate-backup.sh - : 自动化恢复测试
scripts/test-restore.sh - : 执行完整灾难恢复演练
scripts/dr-drill.sh - : 验证保留策略合规性
scripts/check-retention.sh - : 生成合规报告
scripts/generate-dr-report.sh