CloudTada | Infrastructure & DevOps Insights

Backup and Point-in-Time Recovery PITR

Objectives

After this lesson, you will be able to:

Set up WAL archiving
Perform backups with pg_basebackup
Configure continuous archiving
Restore the database to a specific point in time
Automate backup strategies
Implement a disaster recovery plan

1. Backup Strategies Overview

1.1. Types of backups

A. Logical Backup

BASH

# pg_dump / pg_dumpall
pg_dump -h localhost -U postgres mydb > mydb.sql
pg_dumpall -h localhost -U postgres > cluster.sql

# Pros:
# ✅ Easy to restore specific tables
# ✅ Portable across PostgreSQL versions
# ✅ Human-readable (text)

# Cons:
# ❌ Slow for large databases
# ❌ Not suitable for PITR
# ❌ Requires downtime for consistent backup

B. Physical Backup

BASH

# pg_basebackup / File system snapshot
pg_basebackup -D /backup/base -Ft -z -P

# Pros:
# ✅ Fast backup and restore
# ✅ Enables PITR with WAL archiving
# ✅ Consistent snapshot

# Cons:
# ❌ Cannot restore individual tables
# ❌ Must match PostgreSQL version
# ❌ Larger backup size

C. Continuous Archiving (WAL Archiving)

TEXT

WAL files archived continuously
+ Base backup
= Point-in-Time Recovery capability

# Pros:
# ✅ Can restore to ANY point in time
# ✅ Minimal data loss (RPO: seconds)
# ✅ Online backup (no downtime)

# Cons:
# ❌ More complex setup
# ❌ Requires storage for WAL archives
# ❌ More moving parts

1.2. RTO and RPO

RTO (Recovery Time Objective) = How long to restore service?

TEXT

pg_basebackup only: ~30min - 2hrs (depends on size)
PITR: Same + WAL replay time (~10-30min)
Streaming replication: ~30-60s (failover)

RPO (Recovery Point Objective) = How much data loss acceptable?

TEXT

Daily backup: Up to 24 hours data loss ❌
WAL archiving: Up to last archived segment (~16MB)
Synchronous replication: Zero data loss ✅

1.3. Backup strategy decision matrix

Requirement	Solution
Zero data loss	Synchronous replication + PITR
Fast recovery `(<1hr)`	Streaming replication
PITR capability	WAL archiving + pg_basebackup
Long-term retention	Periodic pg_basebackup
Disaster recovery	Off-site backups + PITR

2. WAL Archiving Setup

2.1. Understanding WAL archiving

WAL (Write-Ahead Log) = Transaction log files

TEXT

Normal operation:
  Transaction → WAL file → Data files
  
WAL archiving:
  Transaction → WAL file → Data files
                     ↓
                  Archive location (safe storage)

WAL segments:

BASH

# Default: 16MB per segment
# Location: $PGDATA/pg_wal/

ls -lh /var/lib/postgresql/18/data/pg_wal/
# 000000010000000000000001  (16MB)
# 000000010000000000000002  (16MB)
# 000000010000000000000003  (16MB)
# ...

2.2. Configure WAL archiving

PostgreSQL configuration

BASH

# Edit postgresql.conf or use ALTER SYSTEM
sudo -u postgres psql -c "
ALTER SYSTEM SET wal_level = 'replica';  -- or 'logical'
ALTER SYSTEM SET archive_mode = 'on';
ALTER SYSTEM SET archive_command = 'test ! -f /mnt/wal_archive/%f && cp %p /mnt/wal_archive/%f';
ALTER SYSTEM SET archive_timeout = 300;  -- Force archive every 5 min
"

# Restart PostgreSQL
sudo systemctl restart postgresql

Parameters explained:

TEXT

wal_level: 'replica'
  # 'minimal': No archiving possible
  # 'replica': Required for archiving and replication
  # 'logical': For logical replication

archive_mode: 'on'
  # Enable archiving

archive_command: 'test ! -f /mnt/wal_archive/%f && cp %p /mnt/wal_archive/%f'
  # %f = WAL filename (e.g., 000000010000000000000001)
  # %p = WAL full path (e.g., /var/lib/postgresql/18/data/pg_wal/000000010000000000000001)
  # test ! -f = Don't overwrite existing files
  # cp = Copy to archive location

archive_timeout: 300
  # Force WAL switch every 5 minutes (even if not full)
  # Ensures RPO <= 5 minutes

Create archive directory

BASH

# On primary server
sudo mkdir -p /mnt/wal_archive
sudo chown postgres:postgres /mnt/wal_archive
sudo chmod 700 /mnt/wal_archive

# Verify archiving working
sudo -u postgres psql -c "SELECT pg_switch_wal();"
# Forces current WAL file to be archived

# Check archive
ls -lh /mnt/wal_archive/
# Should see WAL files appearing

2.3. Advanced archive commands

A. Archive to remote server (rsync)

INI

# archive_command using rsync
archive_command = 'rsync -a %p backup-server:/mnt/wal_archive/%f'

B. Archive to S3 (wal-g)

BASH

# Install wal-g
wget https://github.com/wal-g/wal-g/releases/download/v2.0.1/wal-g-pg-ubuntu-20.04-amd64.tar.gz
tar -xzf wal-g-pg-ubuntu-20.04-amd64.tar.gz
sudo mv wal-g-pg-ubuntu-20.04-amd64 /usr/local/bin/wal-g
sudo chmod +x /usr/local/bin/wal-g

# Configure
sudo -u postgres tee /var/lib/postgresql/.walrc <<EOF
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=us-east-1
WALG_S3_PREFIX=s3://my-bucket/postgres-wal
EOF

# Set archive_command
archive_command = '/usr/local/bin/wal-g wal-push %p'

C. Archive with compression

BASH

# Compress before archiving
archive_command = 'gzip < %p > /mnt/wal_archive/%f.gz'

# Or with pigz (parallel gzip)
archive_command = 'pigz < %p > /mnt/wal_archive/%f.gz'

2.4. Monitor archiving

SQL

-- Check archiving status
SELECT archived_count, 
       failed_count,
       last_archived_wal,
       last_archived_time,
       last_failed_wal,
       last_failed_time
FROM pg_stat_archiver;

-- Example output:
--  archived_count | failed_count | last_archived_wal        | last_archived_time          
-- ----------------+--------------+--------------------------+-----------------------------
--            1234 |            0 | 000000010000000000000056 | 2024-11-25 10:30:15.123456

-- If failed_count > 0, check logs!

BASH

# Check PostgreSQL logs for archive errors
sudo journalctl -u postgresql | grep -i archive

# Common errors:
# - Permission denied on archive directory
# - Archive directory full
# - Network timeout (for remote archiving)

3. Base Backup with pg_basebackup

3.1. Basic pg_basebackup

BASH

# Full backup to directory
sudo -u postgres pg_basebackup \
  -D /backup/base/$(date +%Y%m%d_%H%M%S) \
  -Fp \
  -Xs \
  -P \
  -v

# Flags:
# -D: Destination directory
# -Fp: Plain format (directory)
# -Xs: Stream WAL during backup (ensures consistency)
# -P: Show progress
# -v: Verbose

Output:

TEXT

pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/6000000 on timeline 3
pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot "pg_basebackup_12345"
245678/245678 kB (100%), 1/1 tablespace
pg_basebackup: write-ahead log end point: 0/6000168
pg_basebackup: syncing data to disk ...
pg_basebackup: renaming backup_manifest.tmp to backup_manifest
pg_basebackup: base backup completed

3.2. Compressed tar backup

BASH

# Backup as compressed tar
sudo -u postgres pg_basebackup \
  -D /backup/tar \
  -Ft \
  -z \
  -P \
  -v

# Flags:
# -Ft: Tar format
# -z: Gzip compression

# Result:
ls -lh /backup/tar/
# base.tar.gz         (main data)
# pg_wal.tar.gz       (WAL files)
# backup_manifest     (verification)

3.3. Backup to remote server

BASH

# Stream directly to remote server
sudo -u postgres pg_basebackup \
  -D - \
  -Ft \
  -z \
  | ssh backup-server "cat > /backup/postgres-$(date +%Y%m%d).tar.gz"

3.4. Backup with replication slot

BASH

# Create replication slot first
sudo -u postgres psql -c "
  SELECT pg_create_physical_replication_slot('backup_slot');
"

# Backup using slot
sudo -u postgres pg_basebackup \
  -D /backup/base/$(date +%Y%m%d) \
  -Fp \
  -Xs \
  -P \
  -S backup_slot

# Slot ensures WAL files aren't removed during backup

3.5. Verify backup

BASH

# Check backup_manifest
cat /backup/base/20241125/backup_manifest | jq

# Verify checksum
sudo -u postgres pg_verifybackup /backup/base/20241125

# Output:
# backup successfully verified
# ✅

4. Point-in-Time Recovery (PITR)

4.1. PITR concepts

PITR = Restore database to any point in time (not just backup time)

TEXT

Timeline:

T0: Base backup taken
  ↓
T1: Transaction A committed
  ↓
T2: Transaction B committed
  ↓
T3: Transaction C committed (ERROR! Want to undo)
  ↓
T4: Now

With PITR, can restore to T2 (before Transaction C)

Requirements:

Base backup (pg_basebackup)
WAL archive from backup until target time
Recovery target specification

4.2. Prepare for PITR

Create recovery directory:

BASH

# Stop PostgreSQL on target server
sudo systemctl stop postgresql

# Backup current data (safety)
sudo mv /var/lib/postgresql/18/data /var/lib/postgresql/18/data.old

# Create new data directory
sudo mkdir -p /var/lib/postgresql/18/data
sudo chown postgres:postgres /var/lib/postgresql/18/data

Restore base backup:

BASH

# From plain directory backup
sudo cp -a /backup/base/20241125/* /var/lib/postgresql/18/data/

# Or from tar backup
cd /var/lib/postgresql/18/data
sudo -u postgres tar -xzf /backup/tar/base.tar.gz
sudo -u postgres tar -xzf /backup/tar/pg_wal.tar.gz

4.3. Configure recovery

Create recovery configuration:

BASH

# PostgreSQL 12+: Use recovery.signal + postgresql.conf

# Step 1: Create recovery.signal
sudo -u postgres touch /var/lib/postgresql/18/data/recovery.signal

# Step 2: Configure recovery in postgresql.conf
sudo -u postgres tee -a /var/lib/postgresql/18/data/postgresql.auto.conf <<EOF
restore_command = 'cp /mnt/wal_archive/%f %p'
recovery_target_time = '2024-11-25 10:30:00'
recovery_target_action = 'promote'
EOF

Recovery parameters:

TEXT

restore_command: 'cp /mnt/wal_archive/%f %p'
  # How to fetch archived WAL files
  # %f = WAL filename
  # %p = Destination path

recovery_target_time: '2024-11-25 10:30:00'
  # Restore to this timestamp

recovery_target_action: 'promote'
  # After reaching target: promote to normal operation
  # Options: 'pause', 'promote', 'shutdown'

4.4. Recovery target options

A. Recover to specific time

INI

-- In postgresql.auto.conf
recovery_target_time = '2024-11-25 10:30:00'

B. Recover to specific transaction

SQL

-- Find transaction ID
SELECT txid_current();  -- Before bad transaction

-- In postgresql.auto.conf
recovery_target_xid = '12345678'

C. Recover to specific LSN

INI

-- In postgresql.auto.conf
recovery_target_lsn = '0/6000000'

D. Recover to latest

INI

-- In postgresql.auto.conf
# No recovery_target_* parameter
# Will replay all available WAL

E. Recovery target inclusive/exclusive

INI

-- Default: exclusive (stop BEFORE target)
recovery_target_inclusive = 'off'

-- Inclusive: include target transaction
recovery_target_inclusive = 'on'

4.5. Perform recovery

BASH

# Start PostgreSQL
sudo systemctl start postgresql

# Monitor logs
sudo journalctl -u postgresql -f

Log output:

TEXT

2024-11-25 11:00:00 LOG: starting PostgreSQL 18.0
2024-11-25 11:00:01 LOG: entering standby mode
2024-11-25 11:00:02 LOG: redo starts at 0/6000000
2024-11-25 11:00:03 LOG: restored log file "000000010000000000000006" from archive
2024-11-25 11:00:05 LOG: restored log file "000000010000000000000007" from archive
2024-11-25 11:00:08 LOG: recovery stopping before commit of transaction 12345678, time 2024-11-25 10:30:00
2024-11-25 11:00:09 LOG: pausing at the end of recovery
2024-11-25 11:00:09 HINT: Execute pg_wal_replay_resume() to continue.

If paused, resume:

SQL

-- Check status
SELECT pg_is_in_recovery();
-- true

-- Resume (will promote if action = promote)
SELECT pg_wal_replay_resume();

-- Or promote manually
SELECT pg_promote();

Verify recovery:

SQL

-- Check database state
SELECT pg_is_in_recovery();
-- false (if promoted)

-- Verify data
SELECT * FROM critical_table WHERE created_at >= '2024-11-25 10:25:00';
-- Should see data up to recovery target time

4.6. Timeline after PITR

TEXT

After PITR, timeline increments:

Original timeline: 3
After PITR: 4

This prevents accidentally replaying WAL from "future" timeline

BASH

# Check new timeline
sudo -u postgres psql -c "
  SELECT timeline_id FROM pg_control_checkpoint();
"
# timeline_id
# ------------
#           4

5. Automation with Patroni

5.1. Patroni WAL archiving

Configure in patroni.yml:

YAML

postgresql:
  parameters:
    wal_level: replica
    archive_mode: 'on'
    archive_command: 'test ! -f /mnt/wal_archive/%f && cp %p /mnt/wal_archive/%f'
    archive_timeout: 300
  
  # Restore command for replicas
  recovery_conf:
    restore_command: 'cp /mnt/wal_archive/%f %p'

Patroni automatically:

Configures archiving on primary
Configures restore on replicas
Handles timeline changes

5.2. Backup script

BASH

#!/bin/bash
# backup.sh - Automated PostgreSQL backup

set -e

BACKUP_DIR="/backup/postgres"
RETENTION_DAYS=7
DATE=$(date +%Y%m%d_%H%M%S)

# Create backup directory
mkdir -p "$BACKUP_DIR/$DATE"

# Run pg_basebackup
sudo -u postgres pg_basebackup \
  -D "$BACKUP_DIR/$DATE" \
  -Fp \
  -Xs \
  -P \
  -v \
  -c fast

# Verify backup
sudo -u postgres pg_verifybackup "$BACKUP_DIR/$DATE"

# Create metadata
cat > "$BACKUP_DIR/$DATE/backup_info.txt" <<EOF
Backup Date: $(date)
Hostname: $(hostname)
PostgreSQL Version: $(sudo -u postgres psql -t -c "SELECT version();")
Database Size: $(du -sh "$BACKUP_DIR/$DATE" | awk '{print $1}')
EOF

# Remove old backups
find "$BACKUP_DIR" -maxdepth 1 -type d -mtime +$RETENTION_DAYS -exec rm -rf {} \;

# Log
echo "$(date): Backup completed successfully: $BACKUP_DIR/$DATE" | tee -a /var/log/postgres-backup.log

# Optional: Upload to S3
# aws s3 sync "$BACKUP_DIR/$DATE" "s3://my-bucket/postgres-backups/$DATE/"

# Send notification
# curl -X POST https://hooks.slack.com/... -d '{"text":"Backup completed"}'

Schedule with cron:

BASH

# Run daily at 2 AM
sudo crontab -u postgres -e

# Add:
0 2 * * * /usr/local/bin/backup.sh >> /var/log/postgres-backup.log 2>&1

5.3. WAL-G integration

Install WAL-G:

BASH

# Download
wget https://github.com/wal-g/wal-g/releases/download/v2.0.1/wal-g-pg-ubuntu-20.04-amd64.tar.gz
tar -xzf wal-g-pg-ubuntu-20.04-amd64.tar.gz
sudo mv wal-g-pg-ubuntu-20.04-amd64 /usr/local/bin/wal-g
sudo chmod +x /usr/local/bin/wal-g

Configure:

BASH

# Create config
sudo -u postgres tee /var/lib/postgresql/.walrc <<EOF
AWS_ACCESS_KEY_ID=your_key
AWS_SECRET_ACCESS_KEY=your_secret
AWS_REGION=us-east-1
WALG_S3_PREFIX=s3://my-bucket/postgres
WALG_COMPRESSION_METHOD=lz4
WALG_DELTA_MAX_STEPS=6
EOF

# Update postgresql.conf
archive_command = '/usr/local/bin/wal-g wal-push %p'
restore_command = '/usr/local/bin/wal-g wal-fetch %f %p'

Take backup with WAL-G:

BASH

# Full backup
sudo -u postgres wal-g backup-push /var/lib/postgresql/18/data

# List backups
sudo -u postgres wal-g backup-list

# name                          modified             wal_segment_backup_start
# base_000000010000000000000004 2024-11-25T10:00:00Z 000000010000000000000004

Restore with WAL-G:

BASH

# Stop PostgreSQL
sudo systemctl stop postgresql

# Clear data directory
sudo rm -rf /var/lib/postgresql/18/data/*

# Restore latest backup
sudo -u postgres wal-g backup-fetch /var/lib/postgresql/18/data LATEST

# Or restore specific backup
sudo -u postgres wal-g backup-fetch /var/lib/postgresql/18/data base_000000010000000000000004

# Configure PITR (if needed)
sudo -u postgres touch /var/lib/postgresql/18/data/recovery.signal
echo "recovery_target_time = '2024-11-25 10:30:00'" | \
  sudo -u postgres tee -a /var/lib/postgresql/18/data/postgresql.auto.conf

# Start PostgreSQL
sudo systemctl start postgresql

6. Disaster Recovery Planning

6.1. DR strategy

3-2-1 Rule:

TEXT

3: Keep 3 copies of data
2: Store on 2 different media types
1: Keep 1 copy off-site

Example:
- Production database (live)
- Local backup (same datacenter)
- S3 backup (cloud, different region)

6.2. DR checklist

Preparation:

TEXT

✅ WAL archiving enabled and tested
✅ Regular base backups (daily/weekly)
✅ Off-site backup storage (S3, remote server)
✅ Backup verification automated
✅ Restore procedures documented
✅ DR drills scheduled (quarterly)
✅ Monitoring and alerting configured
✅ Backup retention policy defined
✅ Encryption for backups (at rest and in transit)
✅ Access controls (who can restore)

6.3. DR scenarios and procedures

Scenario 1: Database corruption

BASH

# 1. Identify corruption
SELECT * FROM corrupt_table;  # ERROR

# 2. Stop PostgreSQL
sudo systemctl stop postgresql

# 3. Restore from last good backup
sudo rm -rf /var/lib/postgresql/18/data
sudo cp -a /backup/base/20241125 /var/lib/postgresql/18/data

# 4. Configure PITR to just before corruption
echo "recovery_target_time = '2024-11-25 09:55:00'" | \
  sudo tee -a /var/lib/postgresql/18/data/postgresql.auto.conf
sudo touch /var/lib/postgresql/18/data/recovery.signal

# 5. Start and verify
sudo systemctl start postgresql

Scenario 2: Datacenter failure

BASH

# 1. Provision new servers in different region

# 2. Install PostgreSQL + Patroni

# 3. Restore from off-site backup (S3)
aws s3 sync s3://my-bucket/postgres-backups/20241125 /backup/restore/
sudo cp -a /backup/restore /var/lib/postgresql/18/data

# 4. Restore WAL from S3
# (configure restore_command with wal-g or S3)

# 5. Start cluster
sudo systemctl start patroni

# 6. Update DNS/Load balancer to new region

# RTO: ~1-2 hours (depends on backup size and network)

Scenario 3: Accidental data deletion

SQL

# Oops: DELETE FROM users WHERE ...;  (without WHERE clause)

# Option A: PITR to before deletion
# (See section 4.4)

# Option B: Restore to separate instance and copy data
sudo -u postgres pg_basebackup -D /tmp/restore ...
# Start on different port
# Copy missing data to production

6.4. Recovery metrics

RTO (Recovery Time Objective):

TEXT

Target: < 2 hours for full DR

Breakdown:
- Detection: 5-10 min
- Decision: 10-15 min
- Restore base backup: 30-60 min
- Replay WAL: 10-30 min
- Verification: 10-20 min
- DNS/Traffic switch: 5-10 min

Total: ~70-145 min

RPO (Recovery Point Objective):

TEXT

Target: < 5 minutes data loss

With synchronous replication: 0 (zero data loss)
With WAL archiving: < archive_timeout (e.g., 5 min)
With daily backup only: Up to 24 hours ❌

7. Monitoring and Alerting

7.1. Key backup metrics

SQL

-- Archive status
SELECT archived_count, 
       failed_count,
       EXTRACT(EPOCH FROM (now() - last_archived_time)) AS seconds_since_last_archive
FROM pg_stat_archiver;

-- Alert if seconds_since_last_archive > 600 (10 min)

BASH

# Backup age
find /backup/base -maxdepth 1 -type d -name "202*" -mtime -1 | wc -l
# Should be >= 1 (at least one backup in last 24h)

7.2. Prometheus metrics

YAML

# Alert rules
groups:
  - name: backup_alerts
    rules:
      - alert: PostgreSQLArchivingFailed
        expr: pg_stat_archiver_failed_count > 0
        labels:
          severity: critical
        annotations:
          summary: "WAL archiving failures detected"
      
      - alert: PostgreSQLNoRecentBackup
        expr: time() - pg_backup_last_success_timestamp > 86400
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "No backup in last 24 hours"
      
      - alert: PostgreSQLArchiveDelayHigh
        expr: |
          time() - pg_stat_archiver_last_archived_time > 600
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "WAL archiving delayed > 10 min"

7.3. Backup verification

BASH

#!/bin/bash
# verify-backup.sh

BACKUP_DIR="/backup/base"
LATEST_BACKUP=$(ls -td $BACKUP_DIR/*/ | head -1)

echo "Verifying: $LATEST_BACKUP"

# 1. Check backup_manifest exists
if [ ! -f "$LATEST_BACKUP/backup_manifest" ]; then
  echo "❌ backup_manifest missing"
  exit 1
fi

# 2. Run pg_verifybackup
if sudo -u postgres pg_verifybackup "$LATEST_BACKUP" > /dev/null 2>&1; then
  echo "✅ Backup verified successfully"
else
  echo "❌ Backup verification failed"
  exit 1
fi

# 3. Check size (should be reasonable)
SIZE=$(du -sb "$LATEST_BACKUP" | awk '{print $1}')
MIN_SIZE=1000000000  # 1GB minimum
if [ "$SIZE" -lt "$MIN_SIZE" ]; then
  echo "⚠️  Backup size suspicious: $(du -sh "$LATEST_BACKUP" | awk '{print $1}')"
  exit 1
fi

# 4. Test restore to temp location (optional, resource-intensive)
# ...

echo "✅ All backup checks passed"

8. Best Practices

✅ DO

Enable WAL archiving: Required for PITR.
Automate backups: Daily pg_basebackup via cron/systemd timer.
Test restores regularly: Monthly DR drills.
Monitor archiving: Alert on failures.
Keep multiple backup generations: 7 daily + 4 weekly + 12 monthly.
Off-site backups: S3, different region/datacenter.
Encrypt backups: At rest and in transit.
Document procedures: Runbooks for restoration.
Verify backups: pg_verifybackup after each backup.
Calculate RTO/RPO: Know your limits.

❌ DON'T

Don't skip testing: Untested backup = no backup.
Don't store only locally: Datacenter failure = data loss.
Don't ignore archive failures: Silent data loss risk.
Don't delete WAL too early: Need for PITR.
Don't forget retention: Storage costs vs recovery needs.
Don't backup to same disk: Disk failure = everything lost.
Don't skip encryption: Security/compliance risk.
Don't assume it works: Verify, verify, verify.

9. Lab Exercises

Lab 1: Setup WAL archiving

Tasks:

Configure archive_mode and archive_command.
Create archive directory.
Force WAL switch: SELECT pg_switch_wal();.
Verify WAL files in archive directory.
Monitor pg_stat_archiver.

Lab 2: Take base backup

Tasks:

Use pg_basebackup to create backup.
Verify with pg_verifybackup.
Calculate backup size and time.
Compress backup and compare size.
Document backup metadata.

Lab 3: Perform PITR

Tasks:

Create test table with timestamps.
Take base backup.
Insert more data.
Note specific timestamp.
Insert "bad" data after timestamp.
Restore to timestamp (before bad data).
Verify recovery point is correct.

Lab 4: Automate backup

Tasks:

Write backup script with retention.
Add error handling and notifications.
Schedule with cron.
Test script execution.
Monitor backup logs.

Lab 5: DR drill

Tasks:

Simulate total database loss (remove data directory).
Restore from backup.
Replay WAL to latest point.
Measure RTO (time to restore).
Verify data integrity.
Document lessons learned.

10. Troubleshooting

Issue: Archiving not working

Symptoms: failed_count > 0 in pg_stat_archiver.

Check:

BASH

# Check archive_command manually
sudo -u postgres bash -c 'f=000000010000000000000001; p=/var/lib/postgresql/18/data/pg_wal/$f; test ! -f /mnt/wal_archive/$f && cp $p /mnt/wal_archive/$f'

echo $?  # Should be 0

Common causes:

Permission denied on archive directory.
Archive directory doesn't exist.
Disk full.
Network issue (if remote archiving).

Issue: pg_basebackup slow

Optimize:

BASH

# Use compression
pg_basebackup -z ...

# Parallel copy (if multiple tablespaces)
pg_basebackup -j 4 ...

# Adjust checkpoint_timeout
ALTER SYSTEM SET checkpoint_timeout = '15min';

# Use faster storage for backup destination

Issue: PITR fails - WAL not found

Error: could not open file "000000010000000000000007": No such file or directory

Cause: WAL file not in archive or restore_command wrong.

Fix:

BASH

# Check WAL exists in archive
ls -lh /mnt/wal_archive/000000010000000000000007

# Test restore_command manually
sudo -u postgres cp /mnt/wal_archive/000000010000000000000007 /tmp/test_wal

# If missing, cannot recover to that point
# Restore to earlier point or use latest available WAL

11. Summary

Backup Strategy Summary

Method	RPO	RTO	Complexity	Use Case
pg_dump	Hours/Days	Hours	Low	Small DBs, migration
pg_basebackup	Last backup	30-120 min	Medium	Regular backups
WAL archiving	Minutes	30-120 min	Medium	PITR capability
Replication	Seconds/0	30-60 sec	High	HA, zero data loss

Key Concepts

✅ WAL Archiving: Continuous backup of transaction logs.

✅ pg_basebackup: Physical backup of entire cluster.

✅ PITR: Restore to any point in time using base backup + WAL.

✅ RTO: How fast can you recover.

✅ RPO: How much data loss is acceptable.

✅ 3-2-1 Rule: 3 copies, 2 media types, 1 off-site.

Recovery Checklist

Stop PostgreSQL
Restore base backup
Create recovery.signal
Configure restore_command
Set recovery target (time/xid/lsn)
Start PostgreSQL
Monitor recovery logs
Verify recovery point
Promote if satisfied
Update replication if needed

Next Steps

Lesson 17 will cover Monitoring and Observability:

Prometheus + Grafana setup
Key PostgreSQL metrics
Patroni monitoring
Log aggregation
Alerting strategies
Performance dashboards

Course

PostgreSQL High Availability A-Z

Share this article

You might also like