planning-disaster-recovery

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Disaster Recovery

灾难恢复

Purpose

用途

Provide comprehensive guidance for designing disaster recovery (DR) strategies, implementing backup systems, and validating recovery procedures across databases, Kubernetes clusters, and cloud infrastructure. Enable teams to define RTO/RPO objectives, select appropriate backup tools, configure automated failover, and test DR capabilities through chaos engineering.

为数据库、Kubernetes集群和云基础设施的灾难恢复（DR）策略设计、备份系统部署以及恢复流程验证提供全面指导。帮助团队定义RTO/RPO目标、选择合适的备份工具、配置自动化故障转移，并通过混沌工程测试灾难恢复能力。

When to Use This Skill

适用场景

Invoke this skill when:

Defining recovery time objectives (RTO) and recovery point objectives (RPO)
Implementing database backups with point-in-time recovery (PITR)
Setting up Kubernetes cluster backup and restore workflows
Configuring cross-region replication for high availability
Testing disaster recovery procedures through chaos experiments
Meeting compliance requirements (GDPR, SOC 2, HIPAA)
Automating backup monitoring and alerting
Designing multi-cloud disaster recovery architectures

在以下场景中调用该技能：

定义恢复时间目标（RTO）和恢复点目标（RPO）
部署支持时间点恢复（PITR）的数据库备份
搭建Kubernetes集群备份与恢复工作流
配置跨区域复制以实现高可用性
通过混沌实验测试灾难恢复流程
满足合规要求（GDPR、SOC 2、HIPAA）
自动化备份监控与告警
设计多云灾难恢复架构

Core Concepts

核心概念

RTO and RPO Fundamentals

RTO与RPO基础

Recovery Time Objective (RTO): Maximum acceptable downtime after a disaster before business impact becomes unacceptable.

Recovery Point Objective (RPO): Maximum acceptable data loss measured in time. Defines how far back in time recovery must reach.

Criticality Tiers:

Tier 0 (Mission-Critical): RTO < 1 hour, RPO < 5 minutes
Tier 1 (Production): RTO 1-4 hours, RPO 15-60 minutes
Tier 2 (Important): RTO 4-24 hours, RPO 1-6 hours
Tier 3 (Standard): RTO > 24 hours, RPO > 6 hours

恢复时间目标（RTO）： 灾难发生后，业务影响达到不可接受程度前的最长可容忍停机时间。

恢复点目标（RPO）： 以时间衡量的最大可容忍数据丢失量，定义了恢复操作需要回溯到的时间点。

关键程度层级：

0级（核心业务）： RTO < 1小时，RPO < 5分钟
1级（生产环境）： RTO 1-4小时，RPO 15-60分钟
2级（重要业务）： RTO 4-24小时，RPO 1-6小时
3级（标准业务）： RTO > 24小时，RPO > 6小时

3-2-1 Backup Rule

3-2-1备份规则

Maintain 3 copies of data on 2 different media types with 1 copy offsite.

Example implementation:

Primary: Production database
Secondary: Local backup storage
Tertiary: Cloud backup (S3/GCS/Azure)

保持3份数据副本，存储在2种不同介质上，且1份副本异地存储。

示例部署：

主副本：生产数据库
次副本：本地备份存储
第三副本：云备份（S3/GCS/Azure）

Backup Types

备份类型

Full Backup: Complete copy of all data. Slowest to create, fastest to restore.

Incremental Backup: Only changes since last backup. Fastest to create, requires full + all incrementals to restore.

Differential Backup: Changes since last full backup. Balance between storage and restore speed.

Continuous Backup: Real-time or near-real-time backup via WAL/binlog archiving. Lowest RPO.

全量备份： 所有数据的完整副本。创建速度最慢，但恢复速度最快。

增量备份： 仅备份上次备份后变更的数据。创建速度最快，但恢复时需要全量备份加所有增量备份。

差异备份： 备份上次全量备份后变更的数据。在存储占用和恢复速度间取得平衡。

持续备份： 通过WAL/binlog归档实现实时或近实时备份，RPO最低。

Quick Decision Framework

快速决策框架

Step 1: Map RTO/RPO to Strategy

步骤1：根据RTO/RPO匹配策略

RTO < 1 hour, RPO < 5 min
→ Active-Active replication, continuous archiving, automated failover
→ Tools: Aurora Global DB, GCS Multi-Region, pgBackRest PITR
→ Cost: Highest

RTO 1-4 hours, RPO 15-60 min
→ Warm standby, incremental backups, automated failover
→ Tools: pgBackRest, WAL-G, RDS Multi-AZ
→ Cost: High

RTO 4-24 hours, RPO 1-6 hours
→ Daily full + incremental, cross-region backup
→ Tools: pgBackRest, Velero, Restic
→ Cost: Medium

RTO > 24 hours, RPO > 6 hours
→ Weekly full + daily incremental, single region
→ Tools: pg_dump, mysqldump, S3 versioning
→ Cost: Low

RTO < 1小时，RPO < 5分钟
→ 主备双活复制、持续归档、自动化故障转移
→ 工具：Aurora Global DB、GCS Multi-Region、pgBackRest PITR
→ 成本：最高

RTO 1-4小时，RPO 15-60分钟
→ 温备、增量备份、自动化故障转移
→ 工具：pgBackRest、WAL-G、RDS Multi-AZ
→ 成本：高

RTO 4-24小时，RPO 1-6小时
→ 每日全量+增量备份、跨区域备份
→ 工具：pgBackRest、Velero、Restic
→ 成本：中等

RTO > 24小时，RPO > 6小时
→ 每周全量+每日增量备份、单区域存储
→ 工具：pg_dump、mysqldump、S3版本控制
→ 成本：低

Step 2: Select Backup Tools by Use Case

步骤2：按场景选择备份工具

Use Case	Primary Tool	Alternative	Key Feature
PostgreSQL production	pgBackRest	WAL-G	PITR, compression, multi-repo
MySQL production	Percona XtraBackup	WAL-G	Hot backups, incremental
MongoDB	Atlas Backup	mongodump	Continuous backup, PITR
Kubernetes cluster	Velero	ArgoCD + Git	PV snapshots, scheduling
File/object backup	Restic	Duplicity	Encryption, deduplication
Cross-region replication	Aurora Global DB	RDS Read Replica	Active-Active capable

适用场景	主要工具	替代工具	核心特性
PostgreSQL生产环境	pgBackRest	WAL-G	PITR、压缩、多仓库支持
MySQL生产环境	Percona XtraBackup	WAL-G	热备份、增量备份
MongoDB	Atlas Backup	mongodump	持续备份、PITR
Kubernetes集群	Velero	ArgoCD + Git	PV快照、调度功能
文件/对象存储备份	Restic	Duplicity	加密、重复数据删除
跨区域复制	Aurora Global DB	RDS Read Replica	支持主备双活

Database Backup Patterns

数据库备份模式

PostgreSQL with pgBackRest

基于pgBackRest的PostgreSQL备份

Use Case: Production PostgreSQL with < 5 minute RPO

Quick Start: See

examples/postgresql/pgbackrest-config/

Configure continuous WAL archiving with full/differential/incremental backups to S3/GCS/Azure. Schedule weekly full, daily differential backups. Enable PITR with

pgbackrest --stanza=main --delta restore

Detailed Guide:

references/database-backups.md#postgresql

适用场景： RPO < 5分钟的生产环境PostgreSQL

快速开始： 参见

examples/postgresql/pgbackrest-config/

配置持续WAL归档，将全量/差异/增量备份存储到S3/GCS/Azure。每周执行全量备份，每日执行差异备份。通过

pgbackrest --stanza=main --delta restore

启用PITR。

详细指南：

references/database-backups.md#postgresql

MySQL with Percona XtraBackup

基于Percona XtraBackup的MySQL备份

Use Case: MySQL production requiring hot backups

Quick Start: See

examples/mysql/xtrabackup/

Perform full (

xtrabackup --backup --parallel=4

) and incremental backups with binary log archiving for PITR. Restore requires decompress, prepare, apply incrementals, and copy-back steps.

Detailed Guide:

references/database-backups.md#mysql

适用场景： 需要热备份的生产环境MySQL

快速开始： 参见

examples/mysql/xtrabackup/

执行全量备份（

xtrabackup --backup --parallel=4

）和增量备份，并通过二进制日志归档实现PITR。恢复流程包括解压缩、预处理、应用增量备份和复制回数据目录等步骤。

详细指南：

references/database-backups.md#mysql

MongoDB Backup

MongoDB备份

Quick Start: Use

mongodump --gzip --numParallelCollections=4

for logical backups or MongoDB Atlas for continuous backup with PITR.

Detailed Guide:

references/database-backups.md#mongodb

快速开始： 使用

mongodump --gzip --numParallelCollections=4

执行逻辑备份，或使用MongoDB Atlas实现带PITR的持续备份。

详细指南：

references/database-backups.md#mongodb

Kubernetes Disaster Recovery

Kubernetes灾难恢复

Velero for Cluster Backups

基于Velero的集群备份

Quick Start:

velero install --provider aws --bucket my-backups

Configure scheduled backups (daily full, hourly production namespace) with PV snapshots. Restore with

velero restore create --from-backup <name>

. Support selective restore (namespace mappings, storage class remapping).

Examples:

examples/kubernetes/velero/

Detailed Guide:

references/kubernetes-dr.md

快速开始：

velero install --provider aws --bucket my-backups

配置调度备份（每日全量备份、每小时生产命名空间备份）并启用PV快照。通过

velero restore create --from-backup <name>

执行恢复。支持选择性恢复（命名空间映射、存储类重映射）。

示例：

examples/kubernetes/velero/

详细指南：

references/kubernetes-dr.md

etcd Backup

etcd备份

Quick Start:

ETCDCTL_API=3 etcdctl snapshot save /backups/etcd/snapshot.db

Create periodic etcd snapshots for control plane recovery. Restore requires cluster recreation with snapshot data.

Examples:

examples/kubernetes/etcd/

快速开始：

ETCDCTL_API=3 etcdctl snapshot save /backups/etcd/snapshot.db

定期创建etcd快照用于控制平面恢复。恢复需要基于快照数据重新创建集群。

示例：

examples/kubernetes/etcd/

Cloud-Specific DR Patterns

云厂商特定灾难恢复模式

AWS

Key Services:

RDS: Automated backups (30-day retention), PITR, Multi-AZ
Aurora Global DB: Cross-region active-passive with automatic failover
S3 CRR: Cross-region replication with 15-min SLA (Replication Time Control)

Examples:

examples/cloud/aws/

Detailed Guide:

references/cloud-dr-patterns.md#aws

核心服务：

RDS：自动备份（30天保留期）、PITR、多可用区部署
Aurora Global DB：跨区域主备架构，支持自动故障转移
S3 CRR：跨区域复制，15分钟SLA（复制时间控制）

示例：

examples/cloud/aws/

详细指南：

references/cloud-dr-patterns.md#aws

GCP

Key Services:

Cloud SQL: PITR with 7-day transaction logs, 30-day retention
GCS Multi-Regional: Automatic replication across 100+ mile separation
Regional HA: Synchronous replication within region

Detailed Guide:

references/cloud-dr-patterns.md#gcp

核心服务：

Cloud SQL：带7天事务日志的PITR、30天保留期
GCS多区域存储：自动跨100英里以上距离复制
区域高可用：区域内同步复制

详细指南：

references/cloud-dr-patterns.md#gcp

Azure

Key Services:

Azure Backup: VM backups with flexible retention (daily/weekly/monthly/yearly)
Azure Site Recovery: Cross-region VM replication with 4-hour app-consistent snapshots
Geo-Redundant Storage: Automatic replication to secondary region

Detailed Guide:

references/cloud-dr-patterns.md#azure

核心服务：

Azure Backup：VM备份，支持灵活保留策略（每日/每周/每月/每年）
Azure Site Recovery：跨区域VM复制，4小时应用一致性快照
异地冗余存储：自动复制到次要区域

详细指南：

references/cloud-dr-patterns.md#azure

Cross-Region Replication Patterns

跨区域复制模式

Pattern	RTO	RPO	Cost	Use Case
Active-Active	< 1 min	< 1 min	High	Both regions serve traffic
Active-Passive	15-60 min	5-15 min	Medium	Standby for failover
Pilot Light	10-30 min	5-15 min	Low	Minimal secondary infra
Warm Standby	5-15 min	5-15 min	Med-High	Scaled-down secondary

Implementation Examples:

PostgreSQL streaming replication (Active-Passive)
Aurora Global Database (Active-Active)
ASG scale-up automation (Pilot Light)

Detailed Guide:

references/cross-region-replication.md

模式	RTO	RPO	成本	适用场景
主备双活	< 1分钟	< 1分钟	高	双区域同时处理流量
主备架构	15-60分钟	5-15分钟	中	备用区域用于故障转移
试点模式	10-30分钟	5-15分钟	低	次要基础设施最小化
温备架构	5-15分钟	5-15分钟	中高	次要基础设施缩容部署

部署示例：

PostgreSQL流复制（主备架构）
Aurora全局数据库（主备双活）
ASG扩容自动化（试点模式）

详细指南：

references/cross-region-replication.md

Testing Disaster Recovery

灾难恢复测试

Chaos Engineering

混沌工程

Purpose: Validate DR procedures through controlled failure injection.

Test Scenarios:

Database failover (stop primary, measure promotion time)
Region failure (block network, trigger DNS failover)
Kubernetes recovery (delete namespace, restore from Velero)

Tools: Chaos Mesh, Gremlin, Litmus, Toxiproxy

Examples:

examples/chaos/db-failover-test.sh

examples/chaos/region-failure-test.sh

Detailed Guide:

references/chaos-engineering.md

用途： 通过受控故障注入验证灾难恢复流程。

测试场景：

数据库故障转移（停止主节点，衡量提升时间）
区域故障（阻断网络，触发DNS故障转移）
Kubernetes恢复（删除命名空间，从Velero恢复）

工具： Chaos Mesh、Gremlin、Litmus、Toxiproxy

示例：

examples/chaos/db-failover-test.sh

examples/chaos/region-failure-test.sh

详细指南：

references/chaos-engineering.md

Automated DR Drills

自动化灾难恢复演练

Run Monthly Tests:

bash

./scripts/dr-drill.sh --environment staging --test-type full
./scripts/test-restore.sh --backup latest --target staging-db

每月执行测试：

bash

./scripts/dr-drill.sh --environment staging --test-type full
./scripts/test-restore.sh --backup latest --target staging-db

Compliance and Retention

合规与保留策略

Regulation	Retention	Requirements
GDPR	1-7 years	EU data residency, right to erasure
SOC 2	1 year+	Secure deletion, access controls
HIPAA	6 years	Encryption, PHI protection
PCI DSS	3mo-1yr	Secure deletion, quarterly reviews

Implement with S3/GCS lifecycle policies: 30d→Standard-IA, 90d→Glacier, 365d→Deep Archive

Immutable backups: Use S3 Object Lock or Azure Immutable Blob Storage for ransomware protection.

Detailed Guide:

references/compliance-retention.md

法规	保留期限	要求
GDPR	1-7年	欧盟数据驻留权、删除权
SOC 2	1年以上	安全删除、访问控制
HIPAA	6年	加密、PHI保护
PCI DSS	3个月-1年	安全删除、季度审核

通过S3/GCS生命周期策略实现： 30天→标准低频存储，90天→Glacier，365天→深度归档

不可变备份： 使用S3 Object Lock或Azure Immutable Blob Storage防范勒索软件攻击。

详细指南：

references/compliance-retention.md

Monitoring and Alerting

监控与告警

Key Metrics: Backup success rate, duration, time since last backup, RPO breach, storage utilization

Prometheus Alerts: VeleroBackupFailed, VeleroBackupTooOld, BackupSizeTrend

Validation Scripts:

bash

./scripts/validate-backup.sh --backup latest --verify-integrity
./scripts/check-retention.sh --report-violations
./scripts/generate-dr-report.sh --format pdf

核心指标： 备份成功率、备份时长、上次备份距今时间、RPO违规、存储利用率

Prometheus告警： VeleroBackupFailed、VeleroBackupTooOld、BackupSizeTrend

验证脚本：

bash

./scripts/validate-backup.sh --backup latest --verify-integrity
./scripts/check-retention.sh --report-violations
./scripts/generate-dr-report.sh --format pdf

Automation and Runbooks

自动化与运行手册

Automate Backup Schedules: Cron for pgBackRest (weekly full, daily differential), Velero schedules (K8s)

DR Runbook Steps: Detect failure → Verify secondary → Promote → Update DNS → Notify → Document

Detailed Guide:

references/runbook-automation.md

自动化备份调度： 用Cron调度pgBackRest（每周全量、每日差异），用Velero调度Kubernetes备份

灾难恢复运行手册步骤： 检测故障 → 验证备用节点 → 提升备用节点为主节点 → 更新DNS → 通知相关方 → 记录流程

详细指南：

references/runbook-automation.md

Integration with Other Skills

与其他技能的集成

Related Skills

Skill Chaining Example

技能链示例

infrastructure-as-code → secret-management → disaster-recovery → observability
       ↓                        ↓                   ↓                ↓
  Create S3 buckets      Store encryption     Configure backups   Monitor jobs
  Provision databases    keys in Vault        Set up replication  Alert failures
  Setup VPCs             Manage credentials   Test DR drills      Track metrics

infrastructure-as-code → secret-management → disaster-recovery → observability
       ↓                        ↓                   ↓                ↓
  创建S3存储桶      存储加密密钥     配置备份策略   监控备份任务
  部署数据库        到Vault中        搭建复制架构   告警故障
  搭建VPC          管理凭证         测试灾难恢复   跟踪指标

Best Practices

最佳实践

Do

建议

✓ Test restores regularly (monthly for critical systems) ✓ Automate backup monitoring and alerting ✓ Encrypt backups at rest and in transit ✓ Implement 3-2-1 backup rule ✓ Define and measure RTO/RPO ✓ Run chaos experiments to validate DR ✓ Document recovery procedures ✓ Store backups in different regions ✓ Use immutable backups for ransomware protection ✓ Automate DR testing in CI/CD

✓ 定期测试恢复流程（核心系统每月一次） ✓ 自动化备份监控与告警 ✓ 对备份数据进行静态和传输加密 ✓ 实施3-2-1备份规则 ✓ 定义并跟踪RTO/RPO ✓ 通过混沌实验验证灾难恢复能力 ✓ 记录恢复流程 ✓ 将备份存储在不同区域 ✓ 使用不可变备份防范勒索软件 ✓ 在CI/CD中自动化灾难恢复测试

Don't

禁忌

✗ Assume backups work without testing ✗ Store all backups in single region ✗ Skip retention policy definition ✗ Forget to encrypt sensitive data ✗ Rely solely on cloud provider backups ✗ Ignore backup monitoring ✗ Perform backups only from primary database under high load ✗ Store encryption keys with backups

✗ 不测试就假设备份可用 ✗ 将所有备份存储在同一区域 ✗ 不定义保留策略 ✗ 忽略敏感数据加密 ✗ 完全依赖云厂商的备份服务 ✗ 忽略备份监控 ✗ 在主数据库高负载时执行备份 ✗ 将加密密钥与备份存储在一起

Reference Documentation

参考文档

RTO/RPO Planning:
```
references/rto-rpo-planning.md
```
Database Backups:
```
references/database-backups.md
```
Kubernetes DR:
```
references/kubernetes-dr.md
```
Cloud DR Patterns:
```
references/cloud-dr-patterns.md
```
Cross-Region Replication:
```
references/cross-region-replication.md
```
Chaos Engineering:
```
references/chaos-engineering.md
```
Compliance Requirements:
```
references/compliance-retention.md
```
Runbook Automation:
```
references/runbook-automation.md
```

RTO/RPO规划：
```
references/rto-rpo-planning.md
```
数据库备份：
```
references/database-backups.md
```
Kubernetes灾难恢复：
```
references/kubernetes-dr.md
```
云灾难恢复模式：
```
references/cloud-dr-patterns.md
```
跨区域复制：
```
references/cross-region-replication.md
```
混沌工程：
```
references/chaos-engineering.md
```
合规与保留：
```
references/compliance-retention.md
```
运行手册自动化：
```
references/runbook-automation.md
```

Examples

示例

Runbooks:

examples/runbooks/database-failover.md

examples/runbooks/region-failover.md

PostgreSQL:

examples/postgresql/pgbackrest-config/

examples/postgresql/walg-config/

MySQL:

examples/mysql/xtrabackup/

examples/mysql/walg/

Kubernetes:

examples/kubernetes/velero/

examples/kubernetes/etcd/

Cloud:

examples/cloud/aws/

examples/cloud/gcp/

examples/cloud/azure/

Chaos:

examples/chaos/db-failover-test.sh

examples/chaos/region-failure-test.sh

运行手册：

examples/runbooks/database-failover.md

examples/runbooks/region-failover.md

PostgreSQL：

examples/postgresql/pgbackrest-config/

examples/postgresql/walg-config/

MySQL：

examples/mysql/xtrabackup/

examples/mysql/walg/

Kubernetes：

examples/kubernetes/velero/

examples/kubernetes/etcd/

云厂商：

examples/cloud/aws/

examples/cloud/gcp/

examples/cloud/azure/

混沌工程：

examples/chaos/db-failover-test.sh

examples/chaos/region-failure-test.sh

Scripts

脚本

```
scripts/validate-backup.sh
```
: Verify backup integrity
```
scripts/test-restore.sh
```
: Automated restore testing
```
scripts/dr-drill.sh
```
: Run full DR drill
```
scripts/check-retention.sh
```
: Verify retention policies
```
scripts/generate-dr-report.sh
```
: Compliance reporting

```
scripts/validate-backup.sh
```
: 验证备份完整性
```
scripts/test-restore.sh
```
: 自动化恢复测试
```
scripts/dr-drill.sh
```
: 执行完整灾难恢复演练
```
scripts/check-retention.sh
```
: 验证保留策略合规性
```
scripts/generate-dr-report.sh
```
: 生成合规报告

planning-disaster-recovery

Original

Translation

Disaster Recovery

灾难恢复

Purpose

用途

When to Use This Skill

适用场景

Core Concepts

核心概念

RTO and RPO Fundamentals

RTO与RPO基础

3-2-1 Backup Rule

3-2-1备份规则

Backup Types

备份类型

Quick Decision Framework

快速决策框架

Step 1: Map RTO/RPO to Strategy

步骤1：根据RTO/RPO匹配策略

Step 2: Select Backup Tools by Use Case

步骤2：按场景选择备份工具

Database Backup Patterns

数据库备份模式

PostgreSQL with pgBackRest

基于pgBackRest的PostgreSQL备份

MySQL with Percona XtraBackup

基于Percona XtraBackup的MySQL备份

MongoDB Backup

MongoDB备份

Kubernetes Disaster Recovery

Kubernetes灾难恢复

Velero for Cluster Backups

基于Velero的集群备份

etcd Backup

etcd备份

Cloud-Specific DR Patterns

云厂商特定灾难恢复模式

AWS

AWS

GCP

GCP

Azure

Azure

Cross-Region Replication Patterns

跨区域复制模式

Testing Disaster Recovery

灾难恢复测试

Chaos Engineering

混沌工程

Automated DR Drills

自动化灾难恢复演练

Compliance and Retention

合规与保留策略

Monitoring and Alerting

监控与告警

Automation and Runbooks

自动化与运行手册

Integration with Other Skills

与其他技能的集成

Related Skills

相关技能

Skill Chaining Example

技能链示例

Best Practices

最佳实践

Do

建议

Don't

禁忌

Reference Documentation

参考文档

Examples

示例

Scripts

脚本