Backup and Point-in-Time Recovery PITR
Objectives
After this lesson, you will be able to:
- Set up WAL archiving
- Perform backups with
pg_basebackup - Configure continuous archiving
- Restore the database to a specific point in time
- Automate backup strategies
- Implement a disaster recovery plan
1. Backup Strategies Overview
1.1. Types of backups
A. Logical Backup
B. Physical Backup
C. Continuous Archiving (WAL Archiving)
1.2. RTO and RPO
RTO (Recovery Time Objective) = How long to restore service?
RPO (Recovery Point Objective) = How much data loss acceptable?
1.3. Backup strategy decision matrix
| Requirement | Solution |
|---|---|
| Zero data loss | Synchronous replication + PITR |
Fast recovery (<1hr) | Streaming replication |
| PITR capability | WAL archiving + pg_basebackup |
| Long-term retention | Periodic pg_basebackup |
| Disaster recovery | Off-site backups + PITR |
2. WAL Archiving Setup
2.1. Understanding WAL archiving
WAL (Write-Ahead Log) = Transaction log files
WAL segments:
2.2. Configure WAL archiving
PostgreSQL configuration
Parameters explained:
Create archive directory
2.3. Advanced archive commands
A. Archive to remote server (rsync)
B. Archive to S3 (wal-g)
C. Archive with compression
2.4. Monitor archiving
3. Base Backup with pg_basebackup
3.1. Basic pg_basebackup
Output:
3.2. Compressed tar backup
3.3. Backup to remote server
3.4. Backup with replication slot
3.5. Verify backup
4. Point-in-Time Recovery (PITR)
4.1. PITR concepts
PITR = Restore database to any point in time (not just backup time)
Requirements:
- Base backup (pg_basebackup)
- WAL archive from backup until target time
- Recovery target specification
4.2. Prepare for PITR
Create recovery directory:
Restore base backup:
4.3. Configure recovery
Create recovery configuration:
Recovery parameters:
4.4. Recovery target options
A. Recover to specific time
B. Recover to specific transaction
C. Recover to specific LSN
D. Recover to latest
E. Recovery target inclusive/exclusive
4.5. Perform recovery
Log output:
If paused, resume:
Verify recovery:
4.6. Timeline after PITR
5. Automation with Patroni
5.1. Patroni WAL archiving
Configure in patroni.yml:
Patroni automatically:
- Configures archiving on primary
- Configures restore on replicas
- Handles timeline changes
5.2. Backup script
Schedule with cron:
5.3. WAL-G integration
Install WAL-G:
Configure:
Take backup with WAL-G:
Restore with WAL-G:
6. Disaster Recovery Planning
6.1. DR strategy
3-2-1 Rule:
6.2. DR checklist
Preparation:
6.3. DR scenarios and procedures
Scenario 1: Database corruption
Scenario 2: Datacenter failure
Scenario 3: Accidental data deletion
6.4. Recovery metrics
RTO (Recovery Time Objective):
RPO (Recovery Point Objective):
7. Monitoring and Alerting
7.1. Key backup metrics
7.2. Prometheus metrics
7.3. Backup verification
8. Best Practices
✅ DO
- Enable WAL archiving: Required for PITR.
- Automate backups: Daily
pg_basebackupvia cron/systemd timer. - Test restores regularly: Monthly DR drills.
- Monitor archiving: Alert on failures.
- Keep multiple backup generations: 7 daily + 4 weekly + 12 monthly.
- Off-site backups: S3, different region/datacenter.
- Encrypt backups: At rest and in transit.
- Document procedures: Runbooks for restoration.
- Verify backups:
pg_verifybackupafter each backup. - Calculate RTO/RPO: Know your limits.
❌ DON'T
- Don't skip testing: Untested backup = no backup.
- Don't store only locally: Datacenter failure = data loss.
- Don't ignore archive failures: Silent data loss risk.
- Don't delete WAL too early: Need for PITR.
- Don't forget retention: Storage costs vs recovery needs.
- Don't backup to same disk: Disk failure = everything lost.
- Don't skip encryption: Security/compliance risk.
- Don't assume it works: Verify, verify, verify.
9. Lab Exercises
Lab 1: Setup WAL archiving
Tasks:
- Configure
archive_modeandarchive_command. - Create archive directory.
- Force WAL switch:
SELECT pg_switch_wal();. - Verify WAL files in archive directory.
- Monitor
pg_stat_archiver.
Lab 2: Take base backup
Tasks:
- Use
pg_basebackupto create backup. - Verify with
pg_verifybackup. - Calculate backup size and time.
- Compress backup and compare size.
- Document backup metadata.
Lab 3: Perform PITR
Tasks:
- Create test table with timestamps.
- Take base backup.
- Insert more data.
- Note specific timestamp.
- Insert "bad" data after timestamp.
- Restore to timestamp (before bad data).
- Verify recovery point is correct.
Lab 4: Automate backup
Tasks:
- Write backup script with retention.
- Add error handling and notifications.
- Schedule with cron.
- Test script execution.
- Monitor backup logs.
Lab 5: DR drill
Tasks:
- Simulate total database loss (remove data directory).
- Restore from backup.
- Replay WAL to latest point.
- Measure RTO (time to restore).
- Verify data integrity.
- Document lessons learned.
10. Troubleshooting
Issue: Archiving not working
Symptoms: failed_count > 0 in pg_stat_archiver.
Check:
Common causes:
- Permission denied on archive directory.
- Archive directory doesn't exist.
- Disk full.
- Network issue (if remote archiving).
Issue: pg_basebackup slow
Optimize:
Issue: PITR fails - WAL not found
Error: could not open file "000000010000000000000007": No such file or directory
Cause: WAL file not in archive or restore_command wrong.
Fix:
11. Summary
Backup Strategy Summary
| Method | RPO | RTO | Complexity | Use Case |
|---|---|---|---|---|
| pg_dump | Hours/Days | Hours | Low | Small DBs, migration |
| pg_basebackup | Last backup | 30-120 min | Medium | Regular backups |
| WAL archiving | Minutes | 30-120 min | Medium | PITR capability |
| Replication | Seconds/0 | 30-60 sec | High | HA, zero data loss |
Key Concepts
✅ WAL Archiving: Continuous backup of transaction logs.
✅ pg_basebackup: Physical backup of entire cluster.
✅ PITR: Restore to any point in time using base backup + WAL.
✅ RTO: How fast can you recover.
✅ RPO: How much data loss is acceptable.
✅ 3-2-1 Rule: 3 copies, 2 media types, 1 off-site.
Recovery Checklist
- Stop PostgreSQL
- Restore base backup
- Create recovery.signal
- Configure restore_command
- Set recovery target (time/xid/lsn)
- Start PostgreSQL
- Monitor recovery logs
- Verify recovery point
- Promote if satisfied
- Update replication if needed
Next Steps
Lesson 17 will cover Monitoring and Observability:
- Prometheus + Grafana setup
- Key PostgreSQL metrics
- Patroni monitoring
- Log aggregation
- Alerting strategies
- Performance dashboards