PushBackLog

Disaster Recovery Planning

Soft enforcement Complete by PushBackLog team
Topic: infrastructure Topic: reliability Topic: operations Skillset: devops Skillset: engineering-management Technology: generic Stage: planning Stage: operations

Disaster Recovery Planning

Status: Complete
Category: Infrastructure
Default enforcement: Soft
Author: PushBackLog team


Tags

  • Topic: infrastructure, reliability, operations
  • Skillset: devops, engineering-management
  • Technology: generic
  • Stage: planning, operations

Summary

Disaster recovery (DR) is the capability to restore a system to an operational state following a significant failure event — data centre outage, data corruption, ransomware attack, accidental deletion, or infrastructure failure at scale. DR planning defines the Recovery Time Objective (RTO), the maximum tolerable downtime, and Recovery Point Objective (RPO), the maximum acceptable data loss in time. Without a tested DR plan, the first time a team discovers its recovery capabilities is during a real disaster — the worst possible time.


Rationale

Untested DR plans are not DR plans

Many organisations have documented DR procedures that have never been executed. The documentation was written when the system was designed and has diverged from reality over time. Scripts don’t work because dependencies have changed. Backups exist but restore procedures are broken. The team members named in the runbook have changed roles. A DR plan that has not been practised is a false sense of security.

RTO and RPO drive architecture decisions

The business needs to define how much downtime and how much data loss are acceptable for each system. These are business questions, not engineering questions, but engineering must quantify the cost of meeting each target. RTO = 4 hours and RPO = 24 hours is achievable with simple backup-restore procedures. RTO = 15 minutes and RPO = 0 requires active-active multi-region architecture. Designing to the wrong target wastes money or fails the business.


Guidance

Core concepts

ConceptDefinitionExample
RTO (Recovery Time Objective)Maximum acceptable downtime”4 hours” — system must be restored within 4 hours of incident declaration
RPO (Recovery Point Objective)Maximum acceptable data loss”1 hour” — no more than 1 hour of transactions may be lost
MTTR (Mean Time to Recover)Average time to recover from a failureHistorical measure of actual recovery performance
DR TierCriticality classification of the systemTier 1 = mission-critical; Tier 3 = low-priority internal tools

DR strategies by RTO/RPO target

StrategyRTORPOCostDescription
Backup and restoreHours to daysHours to daysLowRestore from backup to new infrastructure on failure
Pilot light< 1 hourMinutesMediumMinimal replica running; scale up on failover
Warm standbyMinutesSeconds to minutesMedium-highFull but scaled-down replica always running
Active-activeSecondsNear-zeroHighMultiple regions all serving traffic

DR runbook structure

A DR runbook for each system should include:

# [Service Name] Disaster Recovery Runbook

## RTO: 2 hours  |  RPO: 4 hours

## Failure Scenarios Covered
- [ ] Database failure (primary)
- [ ] Application server failure
- [ ] Region-wide outage
- [ ] Data corruption / accidental deletion

## Recovery Procedure

### Step 1: Declare a disaster
- Incident commander: [name or role]
- Notify: [stakeholder list]
- Communication channel: #incidents in Slack

### Step 2: Assess the failure
- Determine scope: single service, full region, data corruption?
- Check AWS Health Dashboard / provider status page
- Estimated recovery time: X hours

### Step 3: Initiate recovery
[Detailed, copy-pasteable commands for each scenario]

### Step 4: Validate recovery
- [ ] Application health check endpoint returns 200
- [ ] Smoke test suite passes against restored environment
- [ ] Data integrity check: [specific query or test]
- [ ] External integrations functional

### Step 5: Communicate restoration
- Notify stakeholders: system restored at [time]
- Log timeline of events for post-mortem

Database failover

# AWS RDS — promote read replica to standalone primary
aws rds promote-read-replica \
  --db-instance-identifier myapp-prod-replica \
  --backup-retention-period 7

# Update application's DB connection string to point to promoted replica
# (This step should be automated via DNS failover or AWS Route 53 health check)

Infrastructure as code accelerates recovery

When all infrastructure is defined in Terraform, provisioning a new environment in a different region is:

# In a disaster scenario — provision fresh environment in target region
cd infrastructure/
terraform workspace new dr-recovery
terraform apply \
  -var="region=eu-west-1" \
  -var="environment=dr" \
  -var="db_snapshot_id=rds:myapp-prod-2024-01-15-00-00"

Without IaC, manual re-provisioning can take days.

DR testing schedule

ActivityFrequencyWhat to validate
Backup restore drillMonthlyBackup is valid, restore procedure works
Runbook reviewQuarterlySteps are accurate, contacts are current
Full DR exerciseAnnually (min)Full recovery in a test environment
Game day (chaos event)AnnuallyTeam can execute under simulated stress

During a DR exercise, attempt the actual recovery procedures in a non-production environment and time the recovery against the RTO. Treat failures as findings, not failures — update the runbook.

Review checklist

  • RTO and RPO are formally documented and agreed with the business
  • A written runbook exists for each system tier
  • The most recent backup has been restored successfully (tested, not assumed)
  • Runbook has been executed by the team at least once in the last 12 months
  • Contacts in the runbook are current
  • Infrastructure can be re-provisioned from IaC without manual steps