Disaster Recovery Planning

Status: Complete
Category: Infrastructure
Default enforcement: Soft
Author: PushBackLog team

Summary

Disaster recovery (DR) is the capability to restore a system to an operational state following a significant failure event — data centre outage, data corruption, ransomware attack, accidental deletion, or infrastructure failure at scale. DR planning defines the Recovery Time Objective (RTO), the maximum tolerable downtime, and Recovery Point Objective (RPO), the maximum acceptable data loss in time. Without a tested DR plan, the first time a team discovers its recovery capabilities is during a real disaster — the worst possible time.

Rationale

Untested DR plans are not DR plans

Many organisations have documented DR procedures that have never been executed. The documentation was written when the system was designed and has diverged from reality over time. Scripts don’t work because dependencies have changed. Backups exist but restore procedures are broken. The team members named in the runbook have changed roles. A DR plan that has not been practised is a false sense of security.

RTO and RPO drive architecture decisions

The business needs to define how much downtime and how much data loss are acceptable for each system. These are business questions, not engineering questions, but engineering must quantify the cost of meeting each target. RTO = 4 hours and RPO = 24 hours is achievable with simple backup-restore procedures. RTO = 15 minutes and RPO = 0 requires active-active multi-region architecture. Designing to the wrong target wastes money or fails the business.

Guidance

Core concepts

Concept	Definition	Example
RTO (Recovery Time Objective)	Maximum acceptable downtime	”4 hours” — system must be restored within 4 hours of incident declaration
RPO (Recovery Point Objective)	Maximum acceptable data loss	”1 hour” — no more than 1 hour of transactions may be lost
MTTR (Mean Time to Recover)	Average time to recover from a failure	Historical measure of actual recovery performance
DR Tier	Criticality classification of the system	Tier 1 = mission-critical; Tier 3 = low-priority internal tools

DR strategies by RTO/RPO target

Strategy	RTO	RPO	Cost	Description
Backup and restore	Hours to days	Hours to days	Low	Restore from backup to new infrastructure on failure
Pilot light	< 1 hour	Minutes	Medium	Minimal replica running; scale up on failover
Warm standby	Minutes	Seconds to minutes	Medium-high	Full but scaled-down replica always running
Active-active	Seconds	Near-zero	High	Multiple regions all serving traffic

DR runbook structure

A DR runbook for each system should include:

# [Service Name] Disaster Recovery Runbook

## RTO: 2 hours  |  RPO: 4 hours

## Failure Scenarios Covered
- [ ] Database failure (primary)
- [ ] Application server failure
- [ ] Region-wide outage
- [ ] Data corruption / accidental deletion

## Recovery Procedure

### Step 1: Declare a disaster
- Incident commander: [name or role]
- Notify: [stakeholder list]
- Communication channel: #incidents in Slack

### Step 2: Assess the failure
- Determine scope: single service, full region, data corruption?
- Check AWS Health Dashboard / provider status page
- Estimated recovery time: X hours

### Step 3: Initiate recovery
[Detailed, copy-pasteable commands for each scenario]

### Step 4: Validate recovery
- [ ] Application health check endpoint returns 200
- [ ] Smoke test suite passes against restored environment
- [ ] Data integrity check: [specific query or test]
- [ ] External integrations functional

### Step 5: Communicate restoration
- Notify stakeholders: system restored at [time]
- Log timeline of events for post-mortem

Database failover

# AWS RDS — promote read replica to standalone primary
aws rds promote-read-replica \
  --db-instance-identifier myapp-prod-replica \
  --backup-retention-period 7

# Update application's DB connection string to point to promoted replica
# (This step should be automated via DNS failover or AWS Route 53 health check)

Infrastructure as code accelerates recovery

When all infrastructure is defined in Terraform, provisioning a new environment in a different region is:

# In a disaster scenario — provision fresh environment in target region
cd infrastructure/
terraform workspace new dr-recovery
terraform apply \
  -var="region=eu-west-1" \
  -var="environment=dr" \
  -var="db_snapshot_id=rds:myapp-prod-2024-01-15-00-00"

Without IaC, manual re-provisioning can take days.

DR testing schedule

Activity	Frequency	What to validate
Backup restore drill	Monthly	Backup is valid, restore procedure works
Runbook review	Quarterly	Steps are accurate, contacts are current
Full DR exercise	Annually (min)	Full recovery in a test environment
Game day (chaos event)	Annually	Team can execute under simulated stress

During a DR exercise, attempt the actual recovery procedures in a non-production environment and time the recovery against the RTO. Treat failures as findings, not failures — update the runbook.

Review checklist

RTO and RPO are formally documented and agreed with the business
A written runbook exists for each system tier
The most recent backup has been restored successfully (tested, not assumed)
Runbook has been executed by the team at least once in the last 12 months
Contacts in the runbook are current
Infrastructure can be re-provisioned from IaC without manual steps

Disaster Recovery Planning

Disaster Recovery Planning

Tags

Summary

Rationale

Untested DR plans are not DR plans

RTO and RPO drive architecture decisions

Guidance

Core concepts

DR strategies by RTO/RPO target

DR runbook structure

Database failover

Infrastructure as code accelerates recovery

DR testing schedule

Review checklist