PushBackLog

Backup Strategies

Soft enforcement Complete by PushBackLog team
Topic: infrastructure Topic: reliability Topic: data Skillset: devops Skillset: engineering-management Technology: generic Stage: operations Stage: planning

Backup Strategies

Status: Complete
Category: Infrastructure
Default enforcement: Soft
Author: PushBackLog team


Tags

  • Topic: infrastructure, reliability, data
  • Skillset: devops, engineering-management
  • Technology: generic
  • Stage: operations, planning

Summary

A backup strategy defines how, how often, where, and for how long copies of critical data are retained. The goal is to ensure data can be recovered within the organisation’s defined Recovery Point Objective (RPO) with high confidence. The most dangerous misconception about backups is that having a backup equals having recovery capability; the two are different things. A backup strategy is incomplete without a tested, documented restore procedure — an unverified backup is not a backup.


Rationale

An untested restore is not a backup

It is common for organisations to have automated backup processes running for years that have never been verified. The backup job succeeds nightly but produces corrupt files, uses the wrong format, or backs up an empty directory. The first time the restore is needed is during a crisis — the moment of maximum pressure and minimum time. Restore verification must be a regular, scheduled activity, not a paper assumption.

The 3-2-1 rule provides resilience by construction

The 3-2-1 rule is the simplest mental model for a resilient backup strategy: 3 copies of data, 2 different types of media, 1 off-site copy. The rationale is that any single failure mode — hardware failure, data centre loss, accidental deletion — cannot destroy all three copies simultaneously. A database backup written only to an S3 bucket in the same region as the database violates this principle.


Guidance

The 3-2-1 rule

FactorWhat it meansExample
3 copiesOriginal + two backupsProduction DB + S3 daily snapshot + S3 Glacier archive
2 media typesTwo different storage types/locationsAWS region 1 + AWS region 2 (or on-premises tape)
1 off-site copyAt least one copy geographically separateCross-region S3 replication

Backup types

TypeDescriptionRPO potentialStorage impact
Full backupComplete copy of all dataDays (if daily)High
IncrementalOnly changes since last backupHours (if hourly)Low
DifferentialChanges since last full backupHoursMedium
Continuous / point-in-timeTransaction log replay to any momentSeconds/minutesVaries

AWS RDS automated backups

# Terraform: configure RDS backup window and retention
resource "aws_db_instance" "primary" {
  identifier     = "myapp-prod"
  engine         = "postgres"
  engine_version = "15.3"
  instance_class = "db.t3.medium"

  # Backup configuration
  backup_retention_period    = 30        # Days to retain automated backups
  backup_window              = "03:00-04:00"  # UTC; low-traffic window
  delete_automated_backups   = false
  deletion_protection        = true      # Prevent accidental deletion

  # Point-in-time recovery enabled automatically when retention > 0
}

Point-in-time recovery (PITR) allows restoration to any second within the retention window — critical for recovering from data corruption events where the corruption was introduced gradually.

Cross-region backup replication

# Copy snapshots to another region for DR
resource "aws_db_instance_automated_backups_replication" "dr" {
  source_db_instance_arn = aws_db_instance.primary.arn
  retention_period        = 7

  # Backups are replicated to another region automatically
  # The KMS key must exist in the destination region
  kms_key_id = aws_kms_key.backup_dr.arn
}

Or schedule manual snapshot copies with a Lambda:

# Lambda: copy latest RDS snapshot to DR region
import boto3

def handler(event, context):
    source = boto3.client('rds', region_name='us-east-1')
    dest   = boto3.client('rds', region_name='eu-west-1')

    # Get latest automated snapshot
    snaps = source.describe_db_snapshots(
        DBInstanceIdentifier='myapp-prod',
        SnapshotType='automated',
    )['DBSnapshots']
    latest = sorted(snaps, key=lambda x: x['SnapshotCreateTime'])[-1]

    # Copy to DR region
    dest.copy_db_snapshot(
        SourceDBSnapshotIdentifier=latest['DBSnapshotArn'],
        TargetDBSnapshotIdentifier=f"dr-copy-{latest['DBSnapshotIdentifier']}",
        SourceRegion='us-east-1',
    )

S3 versioning and cross-region replication

For object storage:

resource "aws_s3_bucket_versioning" "primary" {
  bucket = aws_s3_bucket.primary.id

  versioning_configuration {
    status = "Enabled"  # Retains all versions; enables recovery from accidental deletion
  }
}

resource "aws_s3_bucket_replication_configuration" "dr" {
  bucket = aws_s3_bucket.primary.id
  role   = aws_iam_role.replication.arn

  rule {
    id     = "replicate-all"
    status = "Enabled"

    destination {
      bucket        = aws_s3_bucket.dr_region.arn
      storage_class = "STANDARD_IA"  # Cheaper for DR data rarely accessed
    }
  }
}

Retention policy

Define retention tiers based on compliance and operational requirements:

TierRetentionStorage class
Hourly backups48 hoursS3 Standard
Daily backups30 daysS3 Standard-IA
Weekly backups12 weeksS3 Glacier Instant Retrieval
Monthly backups1 yearS3 Glacier Flexible
Yearly backups7 years (compliance)S3 Glacier Deep Archive

Define lifecycle policies in S3 / AWS Backup to automatically transition and expire backups.

Backup verification schedule

TestFrequencyProcedure
Automated restore testWeeklySpin up a test DB from the latest snapshot; run integrity queries
Full recovery drillMonthlyRestore to a separate environment; verify application starts and data is correct
Cross-region restoreQuarterlyVerify the DR region backup can be restored successfully

An automated weekly restore test is the minimum bar:

# CI/CD backup verification job (weekly cron)
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier backup-verify-$(date +%Y%m%d) \
  --db-snapshot-identifier $(latest_snapshot) \
  --db-instance-class db.t3.medium
  
# Wait for available, run integrity queries, then delete

Review checklist

  • 3-2-1 rule satisfied: 3 copies, 2 media types, 1 off-site
  • Point-in-time recovery enabled for all production databases
  • Backup retention period meets or exceeds RPO requirements and compliance requirements
  • Automated restore verification is scheduled and running
  • Cross-region backup replication is active for DR
  • Retention lifecycle policies are configured to manage storage cost
  • Access to backup data is restricted (least privilege — backups are highly sensitive)