backup-and-disaster-recovery

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Backup and Disaster Recovery

备份与灾难恢复

Plan for the worst case: the database is gone, the host is down for a week, the deploy was poisoned, ransomware encrypted everything. The skill is in advance preparation, not reaction.

做好最坏情况的预案:数据库丢失、主机宕机一周、部署包被恶意篡改、勒索软件加密所有数据。此技能的核心是提前准备,而非事后应对。

When to use

适用场景

  • Setting up backups for a new system
  • Reviewing and validating backup architecture
  • Defining RPO (recovery point objective) and RTO (recovery time objective)
  • Running a disaster recovery drill
  • Diagnosing gaps after an incident
  • Planning for ransomware, data corruption, or insider threats
  • Migrating to a new platform (DR planning belongs in the migration plan)
  • 为新系统配置备份
  • 评审并验证备份架构
  • 定义RPO(恢复点目标)和RTO(恢复时间目标)
  • 开展灾难恢复演练
  • 事件发生后排查恢复能力缺口
  • 针对勒索软件、数据损坏或内部威胁制定预案
  • 迁移至新平台(灾难恢复规划需纳入迁移计划)

When NOT to use

不适用场景

  • Active incident response (use
    incident-response
    )
  • Routine deploy rollbacks (use
    launch-runbook
    )
  • Code or content versioning (covered by Git, CMS revision history)
  • Routine database snapshots (use this skill to set them up; routine review goes in monitoring)

  • 活跃事件响应(使用
    incident-response
  • 常规部署回滚(使用
    launch-runbook
  • 代码或内容版本控制(由Git、CMS修订历史覆盖)
  • 常规数据库快照(使用此技能配置快照;常规评审归入监控环节)

Required inputs

必要输入

  • The systems in scope (databases, file storage, code, configs, secrets)
  • The hosting platforms and providers
  • Existing backup tooling and what it covers
  • Tolerance for data loss (in time)
  • Tolerance for downtime (in time)
  • Compliance requirements (some regulations mandate specific backup standards)

  • 涉及的系统(数据库、文件存储、代码、配置、密钥)
  • 托管平台与服务商
  • 现有备份工具及其覆盖范围
  • 数据丢失容忍度(以时间衡量)
  • 停机时间容忍度(以时间衡量)
  • 合规要求(部分法规强制要求特定备份标准)

The framework: 4 questions

框架:4个核心问题

Every disaster recovery plan answers four questions explicitly.
每个灾难恢复计划都需明确回答以下4个问题。

Question 1: What needs to be recoverable?

问题1:哪些内容需要恢复?

List every system that holds state. Categorize by criticality.
Tier 1: must recover. Without it, the business stops. (Customer database, transaction log, primary content store.)
Tier 2: should recover. Loss is painful but not fatal. (Analytics, logs, secondary services.)
Tier 3: nice to recover. Easy to rebuild. (Caches, derived data, temporary state.)
The tier drives RPO, RTO, backup frequency, and storage spend.
列出所有存储状态的系统,并按重要性分类。
Tier 1:必须恢复:缺失会导致业务停滞。(客户数据库、交易日志、主内容存储。)
Tier 2:应当恢复:丢失会造成损失但不会致命。(分析数据、日志、次要服务。)
Tier 3:可恢复可不恢复:易于重建。(缓存、衍生数据、临时状态。)
分类等级将决定RPO、RTO、备份频率及存储成本。

Question 2: How much data loss is acceptable? (RPO)

问题2:可接受的数据损失量是多少?(RPO)

RPO is the maximum age of data that's acceptable to lose, measured in time.
  • RPO = 1 hour: hourly backups or continuous replication needed
  • RPO = 1 day: daily backups acceptable
  • RPO = 1 week: weekly backups acceptable
For most production data, RPO of 1 hour or less is the target. For critical financial systems, near-zero RPO (continuous replication).
For derived or rebuildable data, RPO of 1 day or longer is fine.
RPO是指可接受的最大数据丢失时长。
  • RPO = 1小时:需每小时备份或持续复制
  • RPO = 1天:每日备份即可
  • RPO = 1周:每周备份即可
对于大多数生产数据,目标RPO为1小时或更短。对于关键金融系统,需接近零RPO(持续复制)。
对于衍生或可重建数据,RPO为1天或更长即可。

Question 3: How much downtime is acceptable? (RTO)

问题3:可接受的停机时长是多少?(RTO)

RTO is the maximum time to restore service after a disaster.
RTO targetImplies
< 5 minutesHot standby with automatic failover
< 1 hourWarm standby with manual failover or fast restore from recent snapshot
< 24 hoursCold backup with documented restore process
Days to weeksBest-effort, accept extended downtime
RTO drives architecture spend. Aggressive RTOs (< 1 hour) are expensive. Loose RTOs (days) are cheap.
RTO是指灾难发生后恢复服务的最长时间。
RTO目标对应的架构要求
< 5分钟具备自动故障转移的热备系统
< 1小时具备手动故障转移的温备系统,或可从最新快照快速恢复
< 24小时冷备份+文档化的恢复流程
数天至数周尽力恢复,接受长时间停机
RTO决定架构成本。严苛的RTO(<1小时)成本高昂,宽松的RTO(数天)成本低廉。

Question 4: What's the disaster?

问题4:可能发生的灾难类型有哪些?

Plan for specific scenarios. Each has different implications.
Hardware failure. Disk dies. Standard backups solve this. Most modern hosts handle automatically.
Provider outage. Region or vendor goes down. Cross-region or cross-provider redundancy needed for low RTO.
Data corruption. Bad migration, bug, accidental delete. Point-in-time restore needed. The latest backup might be corrupted; you need history.
Ransomware or compromise. Attacker encrypts or deletes. Backups must be immutable or air-gapped, otherwise the attacker takes them too.
Account compromise. Attacker has admin credentials, deletes everything. Same defense as ransomware: immutable backups, separate access control.
Vendor lock-out. Account suspended, billing dispute, vendor disappears. Backups outside the vendor needed.
Insider threat. Disgruntled employee deletes or exfiltrates. Audit logs, separation of duties, immutable backups.
A backup strategy that handles only hardware failure isn't a strategy. It's the easiest case.

针对特定场景制定预案,不同场景影响不同。
硬件故障:磁盘损坏。标准备份即可解决。多数现代托管服务商可自动处理。
服务商中断:区域或服务商宕机。需跨区域或跨服务商冗余以实现低RTO。
数据损坏:错误迁移、bug、意外删除。需支持时间点恢复。最新备份可能已损坏,因此需要历史备份。
勒索软件或入侵:攻击者加密或删除数据。备份必须具备不可变性或离线存储,否则攻击者也会破坏备份。
账户被盗:攻击者获取管理员权限,删除所有数据。防御措施与勒索软件相同:不可变备份、独立访问控制。
服务商锁定:账户被暂停、账单纠纷、服务商倒闭。需将备份存储在服务商外部。
内部威胁:不满员工删除或泄露数据。需审计日志、职责分离、不可变备份。
仅能应对硬件故障的备份策略不能称之为策略,这只是最简单的情况。

Workflow

工作流程

Step 1: Inventory state

步骤1:盘点状态系统

Every system that holds state goes on a list:
SystemData typeTierCurrent backupTested?
If you can't list it, you can't protect it. Often the inventory itself reveals gaps (the "we forgot about that database" moment).
列出所有存储状态的系统:
系统数据类型等级当前备份情况是否已测试?
如果无法列出,就无法保护。盘点过程往往会暴露缺口(比如“我们忘了那个数据库”的时刻)。

Step 2: Set RPO and RTO per tier

步骤2:按等级设置RPO和RTO

For each tier, agree on RPO and RTO. Get sign-off from the people who'd be impacted by a disaster.
Push back on aspirational targets that aren't backed by infrastructure spend. RTO of 5 minutes for a system without a hot standby is not real.
为每个等级商定RPO和RTO,并获得灾难影响相关人员的签字确认。
对于没有基础设施投入支撑的理想化目标要提出质疑。例如,没有热备系统却设定5分钟的RTO是不现实的。

Step 3: Verify or design backup architecture

步骤3:验证或设计备份架构

For each system, ensure:
  • Frequency matches RPO.
  • Retention covers point-in-time recovery (typically 30+ days for production data).
  • Storage location is separate from the source. Same disk, same account, same region: not enough.
  • Immutability or write-once storage for at least some backup copies. Defends against ransomware.
  • Encryption at rest. Standard for compliance.
  • Tested restore procedure. Untested backups are not backups.
The "3-2-1 rule" is a useful starting point: 3 copies of data, 2 different storage types, 1 offsite (or off-account, off-platform).
针对每个系统,确保:
  • 备份频率符合RPO要求。
  • 保留期限支持时间点恢复(生产数据通常需30天以上)。
  • 存储位置与源系统分离。同一磁盘、同一账户、同一区域:远远不够。
  • 至少部分备份副本采用不可变或一次写入存储。防御勒索软件。
  • 静态加密。合规标准要求。
  • 已测试的恢复流程。未测试的备份不算备份。
“3-2-1规则”是实用的起点:3份数据副本、2种不同存储类型、1份离线(或跨账户、跨平台)存储。

Step 4: Document the restore runbook

步骤4:编写恢复运行手册

For each system, write the runbook:
  1. How to detect the disaster (cross-reference monitoring)
  2. How to decide to restore (decision criteria, who authorizes)
  3. The exact restore steps (commands, screenshots, sequence)
  4. How to verify the restore worked
  5. How to switch traffic back
  6. Communication template (status page, customer notice)
The runbook is for the worst night of someone's career. Write it for tired, panicked you.
针对每个系统编写运行手册:
  1. 如何检测灾难(关联监控系统)
  2. 如何决定是否恢复(决策标准、授权人)
  3. 具体恢复步骤(命令、截图、执行顺序)
  4. 如何验证恢复成功
  5. 如何切换回流量
  6. 沟通模板(状态页面、客户通知)
运行手册是为职业生涯中最糟糕的夜晚准备的,要写给疲惫、恐慌的自己看。

Step 5: Run a drill

步骤5:开展演练

The first restore should never be during a real disaster.
Drills can be:
  • Tabletop: walk through the runbook on paper. Useful for finding gaps in the plan.
  • Partial: restore to a non-production environment. Verify the data, validate the steps.
  • Full: simulate the disaster. Production failover or full restore. Maximum confidence, maximum risk.
For most teams: quarterly tabletop, annual partial drill, full drill before major launches or after major architecture changes.
首次恢复绝不能在真实灾难中进行。
演练类型包括:
  • 桌面演练:纸上走一遍运行手册。有助于发现计划中的缺口。
  • 部分演练:恢复到非生产环境。验证数据,确认步骤有效性。
  • 全量演练:模拟灾难场景。生产环境故障转移或全量恢复。信心最高,风险也最大。
对于大多数团队:每季度开展桌面演练,每年开展一次部分演练,在重大发布前或重大架构变更后开展全量演练。

Step 6: Document drill results

步骤6:记录演练结果

After each drill, document:
  • What was tested
  • What worked
  • What broke
  • What the actual RPO and RTO were (vs. targets)
  • Action items
If the actual RTO was 6 hours when the target was 1 hour, the target is fiction. Either fix the gap or revise the target.
每次演练后记录:
  • 测试内容
  • 成功之处
  • 失败之处
  • 实际RPO和RTO(与目标对比)
  • 行动项
如果实际RTO为6小时而目标是1小时,那么目标就是空想。要么修复缺口,要么修订目标。

Step 7: Schedule the next drill

步骤7:安排下一次演练

Calendar it. Assign an owner. Backups that aren't drilled drift toward useless.

在日历中标记,指定负责人。不演练的备份会逐渐失效。

Special topics

特殊主题

Database point-in-time recovery

数据库时间点恢复(PITR)

Many managed databases offer point-in-time recovery (PITR) within a retention window (often 7-35 days). This typically achieves RPO of seconds to minutes.
For longer retention, schedule periodic exports to immutable storage.
PITR alone isn't enough. If the database service itself is compromised, PITR is gone too. Always have at least one backup outside the source service.
许多托管数据库支持在保留窗口内(通常7-35天)进行时间点恢复(PITR),通常可实现秒级到分钟级的RPO。
如需更长保留期限,定期导出到不可变存储。
仅依赖PITR是不够的。如果数据库服务本身被入侵,PITR也会失效。始终至少保留一份源服务外部的备份。

File storage backups

文件存储备份

Object stores (S3, GCS, Azure Blob) usually offer:
  • Versioning (recover overwritten objects)
  • Replication (cross-region)
  • Object lock or immutability (defense against deletion)
Set all three for production-critical buckets. Don't rely on the storage provider's default retention.
对象存储(S3、GCS、Azure Blob)通常提供:
  • 版本控制(恢复被覆盖的对象)
  • 复制(跨区域)
  • 对象锁定或不可变性(防御删除)
针对生产关键存储桶启用所有三项功能。不要依赖存储服务商的默认保留设置。

Code and config backups

代码与配置备份

Code lives in Git. The Git host (GitHub, GitLab, etc.) is your backup, but a single host is a single point of failure.
For high-criticality code:
  • Mirror to a second host or your own server
  • Periodic offline exports
Configs and secrets need separate handling:
  • Infrastructure-as-code: in Git, mirrored
  • Runtime configs: backed up alongside the system
  • Secrets: in a secret manager with its own backup story
代码存储在Git中。Git托管商(GitHub、GitLab等)是你的备份,但单一托管商是单点故障。
对于高重要性代码:
  • 镜像到第二个托管商或自有服务器
  • 定期离线导出
配置和密钥需单独处理:
  • 基础设施即代码:存储在Git中并镜像
  • 运行时配置:随系统一起备份
  • 密钥:存储在具备自身备份机制的密钥管理器中

Backups of backups

备份的备份

The backup system itself can fail. Backup metadata, backup credentials, encryption keys: all must be backed up.
If your backup is encrypted with a key you've lost, the backup is useless.
备份系统本身也可能故障。备份元数据、备份凭证、加密密钥:所有这些都必须备份。
如果备份使用的加密密钥丢失,备份就毫无用处。

Compliance backups

合规备份

Some regulations require specific retention (e.g., 7 years for financial data). Comply with the highest applicable standard.
Don't conflate compliance retention with operational backup. Compliance often allows much slower restore (just need to be able to produce the data eventually).

部分法规要求特定保留期限(例如金融数据需保留7年)。遵守最高适用标准。
不要将合规保留与运营备份混淆。合规通常允许更慢的恢复(只需最终能提供数据即可)。

Failure patterns

常见失败模式

Untested backups. The single most common failure. Backups appear to work; restore fails. Test.
Backups in the same account or region as the source. Account compromise or region outage takes both.
No immutability. Ransomware encrypts the backups too. Use object lock or air-gapped storage.
RTO and RPO that aren't measured. Target says "1 hour" but no one has verified the actual RTO. Assume the actual is longer than the target until proven otherwise.
Restore runbook only in someone's head. Person leaves or is unavailable; runbook is gone. Document.
Backups but no DR plan. "We have backups" isn't a plan. The plan is the runbook plus the architecture plus the drilling.
Optimism bias. "It won't happen to us." It happens. Plan as if it will.
Backups too old or too new. Want point-in-time history (in case corruption isn't immediately discovered). Daily snapshots with 30+ day retention. Or continuous replication with separate periodic snapshots for history.
Skipping drills "because we're busy." Then you'll be busier during the disaster.
No communication plan. Restoring data is half the job. Telling customers, stakeholders, and internal teams what's happening is the other half.

未测试的备份:最常见的失败类型。备份看似正常,但恢复失败。一定要测试。
备份与源系统在同一账户或区域:账户被盗或区域中断会同时影响两者。
无不可变性:勒索软件也会加密备份。使用对象锁定或离线存储。
未衡量RTO和RPO:目标写的是“1小时”但无人验证实际RTO。在被证明之前,假设实际时长比目标更长。
恢复运行手册仅存在于某人脑中:人员离职或无法联系时,手册就消失了。一定要文档化。
有备份但无灾难恢复计划:“我们有备份”不是计划。计划是运行手册+架构+演练的结合。
乐观偏见:“这种事不会发生在我们身上”。它一定会发生。按会发生的情况制定预案。
备份过旧或过新:需要时间点历史记录(以防损坏未被立即发现)。每日快照+30天以上保留期限。或持续复制+单独的定期快照以保留历史。
因“太忙”跳过演练:灾难发生时会更忙。
无沟通计划:恢复数据只是一半工作。告知客户、利益相关者和内部团队当前情况是另一半工作。

Output format

输出格式

A DR plan document includes:
  • Inventory: every stateful system
  • Tiering: criticality per system
  • Targets: RPO and RTO per tier
  • Architecture: backup tooling, frequency, storage, immutability
  • Runbooks: restore procedures per system
  • Drill schedule: what gets tested when
  • Drill log: results of past drills
  • Communication templates: what to say during a real DR event

灾难恢复计划文档应包含:
  • 盘点清单:所有有状态系统
  • 分类等级:各系统的重要性
  • 目标:各等级的RPO和RTO
  • 架构:备份工具、频率、存储、不可变性
  • 运行手册:各系统的恢复流程
  • 演练计划:测试内容与时间
  • 演练日志:过往演练结果
  • 沟通模板:真实灾难事件中的话术

Reference files

参考文件

  • references/restore-runbook-template.md
    : Fillable template for a restore runbook, covering detection, authorization, steps, verification, and rollback.
  • references/restore-runbook-template.md
    :可填写的恢复运行手册模板,涵盖检测、授权、步骤、验证和回滚。