Disaster Recovery Planning
Status: Complete
Category: Infrastructure
Default enforcement: Soft
Author: PushBackLog team
Tags
- Topic: infrastructure, reliability, operations
- Skillset: devops, engineering-management
- Technology: generic
- Stage: planning, operations
Summary
Disaster recovery (DR) is the capability to restore a system to an operational state following a significant failure event — data centre outage, data corruption, ransomware attack, accidental deletion, or infrastructure failure at scale. DR planning defines the Recovery Time Objective (RTO), the maximum tolerable downtime, and Recovery Point Objective (RPO), the maximum acceptable data loss in time. Without a tested DR plan, the first time a team discovers its recovery capabilities is during a real disaster — the worst possible time.
Rationale
Untested DR plans are not DR plans
Many organisations have documented DR procedures that have never been executed. The documentation was written when the system was designed and has diverged from reality over time. Scripts don’t work because dependencies have changed. Backups exist but restore procedures are broken. The team members named in the runbook have changed roles. A DR plan that has not been practised is a false sense of security.
RTO and RPO drive architecture decisions
The business needs to define how much downtime and how much data loss are acceptable for each system. These are business questions, not engineering questions, but engineering must quantify the cost of meeting each target. RTO = 4 hours and RPO = 24 hours is achievable with simple backup-restore procedures. RTO = 15 minutes and RPO = 0 requires active-active multi-region architecture. Designing to the wrong target wastes money or fails the business.
Guidance
Core concepts
| Concept | Definition | Example |
|---|---|---|
| RTO (Recovery Time Objective) | Maximum acceptable downtime | ”4 hours” — system must be restored within 4 hours of incident declaration |
| RPO (Recovery Point Objective) | Maximum acceptable data loss | ”1 hour” — no more than 1 hour of transactions may be lost |
| MTTR (Mean Time to Recover) | Average time to recover from a failure | Historical measure of actual recovery performance |
| DR Tier | Criticality classification of the system | Tier 1 = mission-critical; Tier 3 = low-priority internal tools |
DR strategies by RTO/RPO target
| Strategy | RTO | RPO | Cost | Description |
|---|---|---|---|---|
| Backup and restore | Hours to days | Hours to days | Low | Restore from backup to new infrastructure on failure |
| Pilot light | < 1 hour | Minutes | Medium | Minimal replica running; scale up on failover |
| Warm standby | Minutes | Seconds to minutes | Medium-high | Full but scaled-down replica always running |
| Active-active | Seconds | Near-zero | High | Multiple regions all serving traffic |
DR runbook structure
A DR runbook for each system should include:
# [Service Name] Disaster Recovery Runbook
## RTO: 2 hours | RPO: 4 hours
## Failure Scenarios Covered
- [ ] Database failure (primary)
- [ ] Application server failure
- [ ] Region-wide outage
- [ ] Data corruption / accidental deletion
## Recovery Procedure
### Step 1: Declare a disaster
- Incident commander: [name or role]
- Notify: [stakeholder list]
- Communication channel: #incidents in Slack
### Step 2: Assess the failure
- Determine scope: single service, full region, data corruption?
- Check AWS Health Dashboard / provider status page
- Estimated recovery time: X hours
### Step 3: Initiate recovery
[Detailed, copy-pasteable commands for each scenario]
### Step 4: Validate recovery
- [ ] Application health check endpoint returns 200
- [ ] Smoke test suite passes against restored environment
- [ ] Data integrity check: [specific query or test]
- [ ] External integrations functional
### Step 5: Communicate restoration
- Notify stakeholders: system restored at [time]
- Log timeline of events for post-mortem
Database failover
# AWS RDS — promote read replica to standalone primary
aws rds promote-read-replica \
--db-instance-identifier myapp-prod-replica \
--backup-retention-period 7
# Update application's DB connection string to point to promoted replica
# (This step should be automated via DNS failover or AWS Route 53 health check)
Infrastructure as code accelerates recovery
When all infrastructure is defined in Terraform, provisioning a new environment in a different region is:
# In a disaster scenario — provision fresh environment in target region
cd infrastructure/
terraform workspace new dr-recovery
terraform apply \
-var="region=eu-west-1" \
-var="environment=dr" \
-var="db_snapshot_id=rds:myapp-prod-2024-01-15-00-00"
Without IaC, manual re-provisioning can take days.
DR testing schedule
| Activity | Frequency | What to validate |
|---|---|---|
| Backup restore drill | Monthly | Backup is valid, restore procedure works |
| Runbook review | Quarterly | Steps are accurate, contacts are current |
| Full DR exercise | Annually (min) | Full recovery in a test environment |
| Game day (chaos event) | Annually | Team can execute under simulated stress |
During a DR exercise, attempt the actual recovery procedures in a non-production environment and time the recovery against the RTO. Treat failures as findings, not failures — update the runbook.
Review checklist
- RTO and RPO are formally documented and agreed with the business
- A written runbook exists for each system tier
- The most recent backup has been restored successfully (tested, not assumed)
- Runbook has been executed by the team at least once in the last 12 months
- Contacts in the runbook are current
- Infrastructure can be re-provisioned from IaC without manual steps