CloudTada | Infrastructure & DevOps Insights

Disaster Recovery Drills

After this lesson, you will be able to:

Plan comprehensive disaster recovery procedures.
Execute DR drills systematically.
Measure and optimize RTO/RPO.
Conduct incident response exercises.
Document and improve DR processes.

1. DR Planning Foundation

1.1. Key DR metrics

TEXT

RTO (Recovery Time Objective):
- Maximum acceptable downtime
- Example: 15 minutes

RPO (Recovery Point Objective):
- Maximum acceptable data loss
- Example: 5 minutes

RTA (Recovery Time Actual):
- Actual time taken in drill
- Goal: RTA < RTO

RPD (Recovery Point Detected):
- Actual data loss in drill
- Goal: RPD < RPO

1.2. DR scenarios to test

Single node failure
- Impact: Low (automatic failover)
- RTO: < 1 minute
- RPO: 0 (synchronous replication)
Leader node failure
- Impact: Medium (brief disruption)
- RTO: < 2 minutes
- RPO: 0
Complete datacenter failure
- Impact: High (manual intervention)
- RTO: < 15 minutes
- RPO: < 5 minutes
Data corruption
- Impact: High (PITR required)
- RTO: 1-4 hours
- RPO: Last valid backup
Human error (DROP TABLE)
- Impact: Medium-High
- RTO: 30 minutes - 2 hours
- RPO: Point-in-time before error

2. DR Drill Preparation

2.1. Pre-drill checklist

TEXT

☐ Review DR documentation
☐ Verify all backups are current
☐ Test backup restoration (dry run)
☐ Confirm monitoring/alerting works
☐ Notify stakeholders of drill
☐ Schedule during low-traffic period
☐ Prepare rollback procedure
☐ Assemble response team
☐ Set up communication channels (Slack, Zoom)
☐ Document drill objectives
☐ Prepare stopwatch for timing
☐ Set up screen recording (for post-mortem)

2.2. DR team roles

Incident Commander: Owns overall response, makes final decisions, coordinates teams.
Database Admin: Executes PostgreSQL recovery, manages Patroni cluster, validates data integrity.
System Admin: Manages infrastructure, network connectivity, firewall rules.
Application Owner: Tests application functionality, validates business logic, user acceptance testing.
Communications Lead: Updates stakeholders, documents timeline, post-mortem facilitator.
Observer (optional): Takes notes, times each step, identifies improvements.

3. Scenario 1: Single Replica Failure

3.1. Drill procedure

BASH

# Step 1: Simulate replica failure (10:00:00)
ssh node2 "sudo systemctl stop patroni"

# Step 2: Monitor automatic recovery (10:00:15)
watch -n 1 'patronictl -c /etc/patroni/patroni.yml list'

# Expected output after 30 seconds:
# + Cluster: postgres-cluster -------+----+-----------+
# | Member | Host       | Role    | State   | TL | Lag in MB |
# +--------+------------+---------+---------+----+-----------+
# | node1  | 10.0.1.11  | Leader  | running |  5 |           |
# | node2  | 10.0.1.12  | Replica | STOPPED |    |           |  ← Down
# | node3  | 10.0.1.13  | Replica | running |  5 |         0 |
# +--------+------------+---------+---------+----+-----------+

# Step 3: Verify read traffic routes to remaining replica (10:01:00)
psql -h haproxy-vip -U postgres -c "SELECT inet_server_addr();"
# Should return node1 or node3, NOT node2

# Step 4: Restore failed replica (10:05:00)
ssh node2 "sudo systemctl start patroni"

# Step 5: Wait for replication catchup (10:05:30)
patronictl -c /etc/patroni/patroni.yml list
# node2 should show "streaming" state

# Step 6: Verify replication lag is minimal (10:06:00)
psql -h node2 -U postgres -c "
  SELECT pg_wal_lsn_diff(
    pg_last_wal_receive_lsn(),
    pg_last_wal_replay_lsn()
  ) AS lag_bytes;
"
# lag_bytes should be < 1MB

3.2. Expected results

Timeline:
- 10:00:00: Failure injected
- 10:00:30: Failure detected by Patroni
- 10:01:00: Traffic automatically rerouted
- 10:05:00: Recovery initiated
- 10:06:00: Full recovery complete
RTO: 1 minute (time until traffic rerouted)
RPO: 0 bytes (no data loss)
Impact:
- No application downtime
- Slightly increased load on remaining replica
- Monitoring alerts triggered (expected)

4. Scenario 2: Leader Failover

4.1. Drill procedure

BASH

# Step 1: Record current leader (10:00:00)
CURRENT_LEADER=$(patronictl -c /etc/patroni/patroni.yml list | grep Leader | awk '{print $2}')
echo "Current leader: $CURRENT_LEADER"

# Step 2: Simulate leader failure (10:00:05)
ssh $CURRENT_LEADER "sudo systemctl stop patroni"

# Step 3: Monitor automatic failover (10:00:10)
watch -n 1 'patronictl -c /etc/patroni/patroni.yml list'

# Expected: New leader elected in 15-30 seconds
# + Cluster: postgres-cluster -------+----+-----------+
# | Member | Host       | Role    | State   | TL | Lag in MB |
# +--------+------------+---------+---------+----+-----------+
# | node1  | 10.0.1.11  | Replica | STOPPED |    |           |  ← Old leader
# | node2  | 10.0.1.12  | Leader  | running |  6 |           |  ← NEW leader
# | node3  | 10.0.1.13  | Replica | running |  6 |         0 |
# +--------+------------+---------+---------+----+-----------+

# Step 4: Test write operations (10:00:45)
psql -h haproxy-vip -U postgres <<EOF
CREATE TABLE drill_test_$(date +%s) (id serial primary key, data text);
INSERT INTO drill_test_$(date +%s) (data) VALUES ('DR drill success');
SELECT * FROM drill_test_$(date +%s);
EOF

# Step 5: Verify application connectivity (10:01:00)
# Run application health checks
curl -f http://app-server/health || echo "Application DOWN"

# Step 6: Restore old leader as replica (10:03:00)
ssh $CURRENT_LEADER "sudo systemctl start patroni"

# Step 7: Wait for reintegration (10:03:30)
patronictl -c /etc/patroni/patroni.yml list
# node1 should rejoin as replica

# Step 8: Validate replication (10:04:00)
psql -h $CURRENT_LEADER -U postgres -c "SELECT pg_is_in_recovery();"
# Should return 't' (true = replica)

4.2. Expected results

Timeline:
- 10:00:05: Leader failure injected
- 10:00:20: Failure detected (TTL expired)
- 10:00:35: New leader elected
- 10:00:45: Write operations succeed
- 10:01:00: Application fully functional
- 10:04:00: Old leader rejoins as replica
RTO: 30 seconds (leader election time)
RPO: 0 bytes (with synchronous replication)
Impact:
- 30 seconds of write unavailability
- Read operations continue on replicas
- ~10-20 failed write requests (depending on traffic)
- Monitoring alerts triggered

5. Scenario 3: Complete Datacenter Failure

5.1. Drill procedure

BASH

# Setup: Assume 2 datacenters
# DC1: node1 (leader), node2 (replica)
# DC2: node3 (replica)

# Step 1: Simulate DC1 total failure (10:00:00)
for node in node1 node2; do
  ssh $node "sudo systemctl stop patroni"
  ssh $node "sudo systemctl stop etcd"  # Simulate network partition
done

# Step 2: Monitor DC2 status (10:00:15)
ssh node3 "patronictl -c /etc/patroni/patroni.yml list"
# Expected: No leader (quorum lost)

# Step 3: Manual intervention - promote DC2 replica (10:02:00)
# First, verify DC1 is truly down (not network glitch)
ping -c 3 node1 && echo "WARNING: DC1 still reachable!"

# Remove DC1 from etcd cluster
ssh node3 "etcdctl member list"
ssh node3 "etcdctl member remove <node1_member_id>"
ssh node3 "etcdctl member remove <node2_member_id>"

# Step 4: Promote node3 to leader (10:03:00)
ssh node3 "patronictl -c /etc/patroni/patroni.yml failover postgres-cluster --candidate node3 --force"

# Step 5: Update application connection strings (10:04:00)
# Point to DC2: node3 (now leader)
# This may require DNS update or load balancer reconfiguration

# Step 6: Verify write operations (10:05:00)
psql -h node3 -U postgres <<EOF
CREATE TABLE dc_failover_test (id serial primary key, recovered_at timestamp default now());
INSERT INTO dc_failover_test VALUES (DEFAULT);
SELECT * FROM dc_failover_test;
EOF

# Step 7: When DC1 recovers, reintegrate (later, during maintenance)
# Bring up DC1 nodes as replicas of DC2
ssh node1 "sudo systemctl start etcd"
ssh node1 "sudo systemctl start patroni"
# Wait for replication catchup
patronictl -c /etc/patroni/patroni.yml list

5.2. Expected results

Timeline:
- 10:00:00: DC1 failure
- 10:02:00: Decision to failover to DC2
- 10:03:00: Manual promotion of DC2 leader
- 10:04:00: Application reconfiguration
- 10:05:00: Service fully restored
RTO: 5 minutes (includes decision time)
RPO: 0-5 minutes (depends on replication lag at failure time)
Impact:
- 5 minutes of complete outage
- Possible data loss if async replication
- Manual intervention required
- Requires application update

6. Scenario 4: Point-in-Time Recovery (Data Corruption)

6.1. Drill procedure

BASH

# Setup: Simulate accidental table drop at 10:30:00
psql -h leader -U postgres <<EOF
CREATE TABLE important_data (id serial, data text);
INSERT INTO important_data (data) SELECT 'Record ' || generate_series(1, 1000);
SELECT count(*) FROM important_data;  -- 1000 rows
EOF

# Record current time before corruption
BEFORE_CORRUPTION=$(date -u +"%Y-%m-%d %H:%M:%S")
echo "Before corruption: $BEFORE_CORRUPTION"

# Simulate data corruption at 10:30:00
psql -h leader -U postgres -c "DROP TABLE important_data;"
echo "Table dropped (simulating accident) at $(date)"

# Step 1: Detect data loss (10:30:30)
psql -h leader -U postgres -c "SELECT * FROM important_data;"
# ERROR: relation "important_data" does not exist

# Step 2: Identify PITR target time (10:31:00)
PITR_TARGET=$BEFORE_CORRUPTION
echo "Will recover to: $PITR_TARGET"

# Step 3: Setup recovery environment (10:32:00)
# Create separate recovery instance (don't disturb production!)
sudo mkdir -p /var/lib/postgresql/18/pitr_recovery
sudo chown postgres:postgres /var/lib/postgresql/18/pitr_recovery

# Step 4: Restore base backup (10:33:00)
sudo -u postgres pg_basebackup \
  -h leader \
  -D /var/lib/postgresql/18/pitr_recovery \
  -X stream -P

# Step 5: Configure recovery (10:35:00)
cat << EOF | sudo tee /var/lib/postgresql/18/pitr_recovery/recovery.signal
# PITR recovery signal file
EOF

sudo -u postgres tee /var/lib/postgresql/18/pitr_recovery/postgresql.auto.conf <<EOF
restore_command = 'cp /var/lib/postgresql/wal_archive/%f %p'
recovery_target_time = '$PITR_TARGET'
recovery_target_action = 'promote'
EOF

# Step 6: Start recovery instance (10:36:00)
sudo -u postgres /usr/lib/postgresql/18/bin/pg_ctl \
  -D /var/lib/postgresql/18/pitr_recovery \
  -l /tmp/pitr_recovery.log \
  start

# Step 7: Wait for recovery completion (10:40:00)
tail -f /tmp/pitr_recovery.log
# Look for: "database system is ready to accept connections"

# Step 8: Verify recovered data (10:41:00)
psql -h localhost -p 5433 -U postgres -c "SELECT count(*) FROM important_data;"
# Should return: 1000 rows

# Step 9: Export recovered data (10:42:00)
pg_dump -h localhost -p 5433 -U postgres -t important_data > recovered_data.sql

# Step 10: Import to production (10:43:00)
psql -h leader -U postgres < recovered_data.sql

# Step 11: Verify production (10:44:00)
psql -h leader -U postgres -c "SELECT count(*) FROM important_data;"
# Should return: 1000 rows ✅

# Step 12: Cleanup recovery instance (10:45:00)
sudo -u postgres /usr/lib/postgresql/18/bin/pg_ctl \
  -D /var/lib/postgresql/18/pitr_recovery stop
sudo rm -rf /var/lib/postgresql/18/pitr_recovery

6.2. Expected results

Timeline:
- 10:30:00: Data corruption detected
- 10:31:00: PITR target time identified
- 10:33:00: Base backup restoration started
- 10:36:00: PITR recovery initiated
- 10:41:00: Data recovery complete
- 10:44:00: Data restored to production
- 10:45:00: Cleanup complete
RTO: 15 minutes (data restoration)
RPO: 0 (recovered to exact point before corruption)
Impact:
- Temporary read-only mode during restoration
- Requires manual data export/import
- No service downtime (recovery on separate instance)

7. DR Drill Metrics and Reporting

7.1. Drill scorecard

TEXT

Scenario: Leader Failover Drill
Date: 2024-11-25
Duration: 30 minutes
Participants: 5 team members

Metrics:
☑ RTO Target: 2 minutes
  RTO Actual: 35 seconds ✅ (Better than target)

☑ RPO Target: 0 bytes
  RPO Actual: 0 bytes ✅

☑ Detection Time: 15 seconds ✅
☑ Failover Time: 20 seconds ✅
☑ Validation Time: 5 minutes ⚠️ (Could be faster)

Issues Found:
1. Monitoring alert delayed by 10 seconds (configuration issue)
2. Runbook step 3 outdated (missing new command)
3. Team member unfamiliar with patronictl commands

Action Items:
☐ Fix monitoring alert configuration
☐ Update runbook documentation
☐ Schedule training session for new commands
☐ Re-test in 2 weeks

7.2. Post-drill analysis

MARKDOWN

# DR Drill Post-Mortem: Leader Failover

## Summary
Successfully executed planned leader failover drill. RTO/RPO targets exceeded. Identified 3 areas for improvement.

## Timeline
| Time | Event | Owner |
|------|-------|-------|
| 10:00:00 | Drill initiated | DBA |
| 10:00:15 | Leader stopped | DBA |
| 10:00:30 | Failure detected | Monitoring |
| 10:00:35 | New leader elected | Patroni |
| 10:00:50 | Write operations tested | DBA |
| 10:01:00 | Application health check | App Owner |
| 10:05:00 | Old leader rejoined | DBA |

## What Went Well
✅ Automatic failover worked flawlessly
✅ Zero data loss confirmed
✅ Team communication effective
✅ Documentation mostly accurate

## What Could Be Improved
⚠️ Monitoring alert configuration needs tuning
⚠️ Runbook has outdated commands
⚠️ One team member needs additional training

## Action Items
1. [ ] Update Prometheus alert rules (@sre-team, due: 2024-11-30)
2. [ ] Revise DR runbook (@dba-team, due: 2024-11-28)
3. [ ] Conduct patronictl training (@dba-lead, due: 2024-12-05)
4. [ ] Schedule next drill (@incident-commander, due: 2025-01-15)

## Recommendations
- Continue quarterly DR drills
- Rotate incident commander role
- Add chaos engineering (random failures)

8. Chaos Engineering for HA

8.1. Chaos Monkey for PostgreSQL

BASH

#!/bin/bash
# chaos-monkey.sh - Randomly kill PostgreSQL nodes

NODES=("node1" "node2" "node3")
INTERVAL=3600  # 1 hour between failures

while true; do
  # Random node
  NODE=${NODES[$RANDOM % ${#NODES[@]}]}
  
  # Random failure type
  FAILURE_TYPE=$((RANDOM % 3))
  
  case $FAILURE_TYPE in
    0)
      echo "$(date): Stopping Patroni on $NODE"
      ssh $NODE "sudo systemctl stop patroni"
      ;;
    1)
      echo "$(date): Simulating network partition on $NODE"
      ssh $NODE "sudo iptables -A INPUT -p tcp --dport 5432 -j DROP"
      sleep 300
      ssh $NODE "sudo iptables -D INPUT -p tcp --dport 5432 -j DROP"
      ;;
    2)
      echo "$(date): Stopping etcd on $NODE"
      ssh $NODE "sudo systemctl stop etcd"
      ;;
  esac
  
  # Wait for recovery
  sleep 300
  
  # Restore if not auto-recovered
  ssh $NODE "sudo systemctl start patroni"
  ssh $NODE "sudo systemctl start etcd"
  
  # Wait before next chaos
  sleep $INTERVAL
done

8.2. Automated DR testing

YAML

# automated_dr_test.yml
---
- name: Automated DR Drill
  hosts: postgres_cluster
  vars:
    drill_start_time: "{{ ansible_date_time.iso8601 }}"
  tasks:
    - name: Record baseline metrics
      shell: patronictl -c /etc/patroni/patroni.yml list
      register: baseline
      
    - name: Inject failure on leader
      shell: |
        LEADER=$(patronictl -c /etc/patroni/patroni.yml list | grep Leader | awk '{print $2}')
        ssh $LEADER "sudo systemctl stop patroni"
      delegate_to: localhost
      
    - name: Wait for failover
      wait_for:
        timeout: 60
        
    - name: Verify new leader elected
      shell: patronictl -c /etc/patroni/patroni.yml list | grep Leader | wc -l
      register: leader_count
      failed_when: leader_count.stdout != "1"
      
    - name: Measure RTO
      shell: |
        echo "RTO: $(( $(date +%s) - $(date -d '{{ drill_start_time }}' +%s) )) seconds"
      register: rto_result
      
    - name: Generate drill report
      template:
        src: drill_report.j2
        dest: /tmp/drill_report_{{ drill_start_time }}.txt
      
    - name: Send report to Slack
      uri:
        url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
        method: POST
        body_format: json
        body:
          text: "DR Drill completed. RTO: {{ rto_result.stdout }}"

9. Best Practices

✅ DO

Schedule regular drills: Quarterly minimum.
Test all scenarios: Not just easy ones.
Rotate roles: Everyone should be IC once.
Document everything: Timestamped notes.
Measure RTO/RPO: Track improvements.
Post-mortem every drill: Learn and improve.
Update runbooks: Keep documentation current.
Involve all teams: Cross-functional practice.
Test backups: Restore verification essential.
Automate where possible: Reduce human error.

❌ DON'T

Don't skip drills: "Too busy" is not an excuse.
Don't test only easy scenarios: Hard ones matter most.
Don't ignore action items: Follow up on improvements.
Don't reuse same scenario: Vary the drills.
Don't rely on one person: Bus factor = 1 is dangerous.
Don't rush: Proper testing takes time.
Don't skip post-mortems: Learning opportunity.

10. Lab Exercises

Lab 1: Execute failover drill

Tasks:

Plan and schedule drill.
Assign team roles.
Execute leader failover.
Document timeline.
Calculate RTO/RPO.
Write post-mortem.

Lab 2: PITR recovery drill

Tasks:

Create test data.
Simulate data corruption.
Identify PITR target time.
Restore to separate instance.
Verify recovered data.
Document procedure.

Lab 3: Multi-DC failover

Tasks:

Setup 2-DC cluster.
Simulate DC1 total failure.
Manually promote DC2.
Update application config.
Measure downtime.
Document lessons learned.

Lab 4: Chaos engineering

Tasks:

Implement chaos monkey script.
Run for 24 hours.
Monitor cluster behavior.
Document failures and recoveries.
Identify weak points.
Improve HA configuration.

11. Summary

DR Drill Frequency

TEXT

Scenario Frequency:
- Single node failure: Monthly (automated)
- Leader failover: Quarterly
- DC failure: Semi-annually
- PITR recovery: Quarterly
- Full DR: Annually

Success Criteria

A successful DR drill has:

✅ Met RTO/RPO targets.
✅ Zero data loss (or within RPO).
✅ All team members participated.
✅ Documentation updated.
✅ Action items identified.
✅ Post-mortem completed.
✅ Next drill scheduled.

Key Metrics to Track

Detection time: How fast we notice.
Response time: How fast we act.
Recovery time: How fast we restore.
Data loss: How much data lost.
Team coordination: How well we work together.

Next Steps

Lesson 28 will cover HA Architecture Design:

Requirements gathering
Architecture design documents
Capacity planning
Cost estimation
Design review process

Course

PostgreSQL High Availability A-Z

Share this article

You might also like