Backup Strategies
Status: Complete
Category: Infrastructure
Default enforcement: Soft
Author: PushBackLog team
Tags
- Topic: infrastructure, reliability, data
- Skillset: devops, engineering-management
- Technology: generic
- Stage: operations, planning
Summary
A backup strategy defines how, how often, where, and for how long copies of critical data are retained. The goal is to ensure data can be recovered within the organisation’s defined Recovery Point Objective (RPO) with high confidence. The most dangerous misconception about backups is that having a backup equals having recovery capability; the two are different things. A backup strategy is incomplete without a tested, documented restore procedure — an unverified backup is not a backup.
Rationale
An untested restore is not a backup
It is common for organisations to have automated backup processes running for years that have never been verified. The backup job succeeds nightly but produces corrupt files, uses the wrong format, or backs up an empty directory. The first time the restore is needed is during a crisis — the moment of maximum pressure and minimum time. Restore verification must be a regular, scheduled activity, not a paper assumption.
The 3-2-1 rule provides resilience by construction
The 3-2-1 rule is the simplest mental model for a resilient backup strategy: 3 copies of data, 2 different types of media, 1 off-site copy. The rationale is that any single failure mode — hardware failure, data centre loss, accidental deletion — cannot destroy all three copies simultaneously. A database backup written only to an S3 bucket in the same region as the database violates this principle.
Guidance
The 3-2-1 rule
| Factor | What it means | Example |
|---|---|---|
| 3 copies | Original + two backups | Production DB + S3 daily snapshot + S3 Glacier archive |
| 2 media types | Two different storage types/locations | AWS region 1 + AWS region 2 (or on-premises tape) |
| 1 off-site copy | At least one copy geographically separate | Cross-region S3 replication |
Backup types
| Type | Description | RPO potential | Storage impact |
|---|---|---|---|
| Full backup | Complete copy of all data | Days (if daily) | High |
| Incremental | Only changes since last backup | Hours (if hourly) | Low |
| Differential | Changes since last full backup | Hours | Medium |
| Continuous / point-in-time | Transaction log replay to any moment | Seconds/minutes | Varies |
AWS RDS automated backups
# Terraform: configure RDS backup window and retention
resource "aws_db_instance" "primary" {
identifier = "myapp-prod"
engine = "postgres"
engine_version = "15.3"
instance_class = "db.t3.medium"
# Backup configuration
backup_retention_period = 30 # Days to retain automated backups
backup_window = "03:00-04:00" # UTC; low-traffic window
delete_automated_backups = false
deletion_protection = true # Prevent accidental deletion
# Point-in-time recovery enabled automatically when retention > 0
}
Point-in-time recovery (PITR) allows restoration to any second within the retention window — critical for recovering from data corruption events where the corruption was introduced gradually.
Cross-region backup replication
# Copy snapshots to another region for DR
resource "aws_db_instance_automated_backups_replication" "dr" {
source_db_instance_arn = aws_db_instance.primary.arn
retention_period = 7
# Backups are replicated to another region automatically
# The KMS key must exist in the destination region
kms_key_id = aws_kms_key.backup_dr.arn
}
Or schedule manual snapshot copies with a Lambda:
# Lambda: copy latest RDS snapshot to DR region
import boto3
def handler(event, context):
source = boto3.client('rds', region_name='us-east-1')
dest = boto3.client('rds', region_name='eu-west-1')
# Get latest automated snapshot
snaps = source.describe_db_snapshots(
DBInstanceIdentifier='myapp-prod',
SnapshotType='automated',
)['DBSnapshots']
latest = sorted(snaps, key=lambda x: x['SnapshotCreateTime'])[-1]
# Copy to DR region
dest.copy_db_snapshot(
SourceDBSnapshotIdentifier=latest['DBSnapshotArn'],
TargetDBSnapshotIdentifier=f"dr-copy-{latest['DBSnapshotIdentifier']}",
SourceRegion='us-east-1',
)
S3 versioning and cross-region replication
For object storage:
resource "aws_s3_bucket_versioning" "primary" {
bucket = aws_s3_bucket.primary.id
versioning_configuration {
status = "Enabled" # Retains all versions; enables recovery from accidental deletion
}
}
resource "aws_s3_bucket_replication_configuration" "dr" {
bucket = aws_s3_bucket.primary.id
role = aws_iam_role.replication.arn
rule {
id = "replicate-all"
status = "Enabled"
destination {
bucket = aws_s3_bucket.dr_region.arn
storage_class = "STANDARD_IA" # Cheaper for DR data rarely accessed
}
}
}
Retention policy
Define retention tiers based on compliance and operational requirements:
| Tier | Retention | Storage class |
|---|---|---|
| Hourly backups | 48 hours | S3 Standard |
| Daily backups | 30 days | S3 Standard-IA |
| Weekly backups | 12 weeks | S3 Glacier Instant Retrieval |
| Monthly backups | 1 year | S3 Glacier Flexible |
| Yearly backups | 7 years (compliance) | S3 Glacier Deep Archive |
Define lifecycle policies in S3 / AWS Backup to automatically transition and expire backups.
Backup verification schedule
| Test | Frequency | Procedure |
|---|---|---|
| Automated restore test | Weekly | Spin up a test DB from the latest snapshot; run integrity queries |
| Full recovery drill | Monthly | Restore to a separate environment; verify application starts and data is correct |
| Cross-region restore | Quarterly | Verify the DR region backup can be restored successfully |
An automated weekly restore test is the minimum bar:
# CI/CD backup verification job (weekly cron)
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier backup-verify-$(date +%Y%m%d) \
--db-snapshot-identifier $(latest_snapshot) \
--db-instance-class db.t3.medium
# Wait for available, run integrity queries, then delete
Review checklist
- 3-2-1 rule satisfied: 3 copies, 2 media types, 1 off-site
- Point-in-time recovery enabled for all production databases
- Backup retention period meets or exceeds RPO requirements and compliance requirements
- Automated restore verification is scheduled and running
- Cross-region backup replication is active for DR
- Retention lifecycle policies are configured to manage storage cost
- Access to backup data is restricted (least privilege — backups are highly sensitive)