CloudTadaInsights

Lesson 27: Disaster Recovery Drills

Disaster Recovery Drills

After this lesson, you will be able to:

  • Plan comprehensive disaster recovery procedures.
  • Execute DR drills systematically.
  • Measure and optimize RTO/RPO.
  • Conduct incident response exercises.
  • Document and improve DR processes.

1. DR Planning Foundation

1.1. Key DR metrics

TEXT
RTO (Recovery Time Objective):
- Maximum acceptable downtime
- Example: 15 minutes

RPO (Recovery Point Objective):
- Maximum acceptable data loss
- Example: 5 minutes

RTA (Recovery Time Actual):
- Actual time taken in drill
- Goal: RTA < RTO

RPD (Recovery Point Detected):
- Actual data loss in drill
- Goal: RPD < RPO

1.2. DR scenarios to test

  1. Single node failure
    • Impact: Low (automatic failover)
    • RTO: < 1 minute
    • RPO: 0 (synchronous replication)
  2. Leader node failure
    • Impact: Medium (brief disruption)
    • RTO: < 2 minutes
    • RPO: 0
  3. Complete datacenter failure
    • Impact: High (manual intervention)
    • RTO: < 15 minutes
    • RPO: < 5 minutes
  4. Data corruption
    • Impact: High (PITR required)
    • RTO: 1-4 hours
    • RPO: Last valid backup
  5. Human error (DROP TABLE)
    • Impact: Medium-High
    • RTO: 30 minutes - 2 hours
    • RPO: Point-in-time before error

2. DR Drill Preparation

2.1. Pre-drill checklist

TEXT
☐ Review DR documentation
☐ Verify all backups are current
☐ Test backup restoration (dry run)
☐ Confirm monitoring/alerting works
☐ Notify stakeholders of drill
☐ Schedule during low-traffic period
☐ Prepare rollback procedure
☐ Assemble response team
☐ Set up communication channels (Slack, Zoom)
☐ Document drill objectives
☐ Prepare stopwatch for timing
☐ Set up screen recording (for post-mortem)

2.2. DR team roles

  • Incident Commander: Owns overall response, makes final decisions, coordinates teams.
  • Database Admin: Executes PostgreSQL recovery, manages Patroni cluster, validates data integrity.
  • System Admin: Manages infrastructure, network connectivity, firewall rules.
  • Application Owner: Tests application functionality, validates business logic, user acceptance testing.
  • Communications Lead: Updates stakeholders, documents timeline, post-mortem facilitator.
  • Observer (optional): Takes notes, times each step, identifies improvements.

3. Scenario 1: Single Replica Failure

3.1. Drill procedure

BASH
# Step 1: Simulate replica failure (10:00:00)
ssh node2 "sudo systemctl stop patroni"

# Step 2: Monitor automatic recovery (10:00:15)
watch -n 1 'patronictl -c /etc/patroni/patroni.yml list'

# Expected output after 30 seconds:
# + Cluster: postgres-cluster -------+----+-----------+
# | Member | Host       | Role    | State   | TL | Lag in MB |
# +--------+------------+---------+---------+----+-----------+
# | node1  | 10.0.1.11  | Leader  | running |  5 |           |
# | node2  | 10.0.1.12  | Replica | STOPPED |    |           |  ← Down
# | node3  | 10.0.1.13  | Replica | running |  5 |         0 |
# +--------+------------+---------+---------+----+-----------+

# Step 3: Verify read traffic routes to remaining replica (10:01:00)
psql -h haproxy-vip -U postgres -c "SELECT inet_server_addr();"
# Should return node1 or node3, NOT node2

# Step 4: Restore failed replica (10:05:00)
ssh node2 "sudo systemctl start patroni"

# Step 5: Wait for replication catchup (10:05:30)
patronictl -c /etc/patroni/patroni.yml list
# node2 should show "streaming" state

# Step 6: Verify replication lag is minimal (10:06:00)
psql -h node2 -U postgres -c "
  SELECT pg_wal_lsn_diff(
    pg_last_wal_receive_lsn(),
    pg_last_wal_replay_lsn()
  ) AS lag_bytes;
"
# lag_bytes should be < 1MB

3.2. Expected results

  • Timeline:
    • 10:00:00: Failure injected
    • 10:00:30: Failure detected by Patroni
    • 10:01:00: Traffic automatically rerouted
    • 10:05:00: Recovery initiated
    • 10:06:00: Full recovery complete
  • RTO: 1 minute (time until traffic rerouted)
  • RPO: 0 bytes (no data loss)
  • Impact:
    • No application downtime
    • Slightly increased load on remaining replica
    • Monitoring alerts triggered (expected)

4. Scenario 2: Leader Failover

4.1. Drill procedure

BASH
# Step 1: Record current leader (10:00:00)
CURRENT_LEADER=$(patronictl -c /etc/patroni/patroni.yml list | grep Leader | awk '{print $2}')
echo "Current leader: $CURRENT_LEADER"

# Step 2: Simulate leader failure (10:00:05)
ssh $CURRENT_LEADER "sudo systemctl stop patroni"

# Step 3: Monitor automatic failover (10:00:10)
watch -n 1 'patronictl -c /etc/patroni/patroni.yml list'

# Expected: New leader elected in 15-30 seconds
# + Cluster: postgres-cluster -------+----+-----------+
# | Member | Host       | Role    | State   | TL | Lag in MB |
# +--------+------------+---------+---------+----+-----------+
# | node1  | 10.0.1.11  | Replica | STOPPED |    |           |  ← Old leader
# | node2  | 10.0.1.12  | Leader  | running |  6 |           |  ← NEW leader
# | node3  | 10.0.1.13  | Replica | running |  6 |         0 |
# +--------+------------+---------+---------+----+-----------+

# Step 4: Test write operations (10:00:45)
psql -h haproxy-vip -U postgres <<EOF
CREATE TABLE drill_test_$(date +%s) (id serial primary key, data text);
INSERT INTO drill_test_$(date +%s) (data) VALUES ('DR drill success');
SELECT * FROM drill_test_$(date +%s);
EOF

# Step 5: Verify application connectivity (10:01:00)
# Run application health checks
curl -f http://app-server/health || echo "Application DOWN"

# Step 6: Restore old leader as replica (10:03:00)
ssh $CURRENT_LEADER "sudo systemctl start patroni"

# Step 7: Wait for reintegration (10:03:30)
patronictl -c /etc/patroni/patroni.yml list
# node1 should rejoin as replica

# Step 8: Validate replication (10:04:00)
psql -h $CURRENT_LEADER -U postgres -c "SELECT pg_is_in_recovery();"
# Should return 't' (true = replica)

4.2. Expected results

  • Timeline:
    • 10:00:05: Leader failure injected
    • 10:00:20: Failure detected (TTL expired)
    • 10:00:35: New leader elected
    • 10:00:45: Write operations succeed
    • 10:01:00: Application fully functional
    • 10:04:00: Old leader rejoins as replica
  • RTO: 30 seconds (leader election time)
  • RPO: 0 bytes (with synchronous replication)
  • Impact:
    • 30 seconds of write unavailability
    • Read operations continue on replicas
    • ~10-20 failed write requests (depending on traffic)
    • Monitoring alerts triggered

5. Scenario 3: Complete Datacenter Failure

5.1. Drill procedure

BASH
# Setup: Assume 2 datacenters
# DC1: node1 (leader), node2 (replica)
# DC2: node3 (replica)

# Step 1: Simulate DC1 total failure (10:00:00)
for node in node1 node2; do
  ssh $node "sudo systemctl stop patroni"
  ssh $node "sudo systemctl stop etcd"  # Simulate network partition
done

# Step 2: Monitor DC2 status (10:00:15)
ssh node3 "patronictl -c /etc/patroni/patroni.yml list"
# Expected: No leader (quorum lost)

# Step 3: Manual intervention - promote DC2 replica (10:02:00)
# First, verify DC1 is truly down (not network glitch)
ping -c 3 node1 && echo "WARNING: DC1 still reachable!"

# Remove DC1 from etcd cluster
ssh node3 "etcdctl member list"
ssh node3 "etcdctl member remove <node1_member_id>"
ssh node3 "etcdctl member remove <node2_member_id>"

# Step 4: Promote node3 to leader (10:03:00)
ssh node3 "patronictl -c /etc/patroni/patroni.yml failover postgres-cluster --candidate node3 --force"

# Step 5: Update application connection strings (10:04:00)
# Point to DC2: node3 (now leader)
# This may require DNS update or load balancer reconfiguration

# Step 6: Verify write operations (10:05:00)
psql -h node3 -U postgres <<EOF
CREATE TABLE dc_failover_test (id serial primary key, recovered_at timestamp default now());
INSERT INTO dc_failover_test VALUES (DEFAULT);
SELECT * FROM dc_failover_test;
EOF

# Step 7: When DC1 recovers, reintegrate (later, during maintenance)
# Bring up DC1 nodes as replicas of DC2
ssh node1 "sudo systemctl start etcd"
ssh node1 "sudo systemctl start patroni"
# Wait for replication catchup
patronictl -c /etc/patroni/patroni.yml list

5.2. Expected results

  • Timeline:
    • 10:00:00: DC1 failure
    • 10:02:00: Decision to failover to DC2
    • 10:03:00: Manual promotion of DC2 leader
    • 10:04:00: Application reconfiguration
    • 10:05:00: Service fully restored
  • RTO: 5 minutes (includes decision time)
  • RPO: 0-5 minutes (depends on replication lag at failure time)
  • Impact:
    • 5 minutes of complete outage
    • Possible data loss if async replication
    • Manual intervention required
    • Requires application update

6. Scenario 4: Point-in-Time Recovery (Data Corruption)

6.1. Drill procedure

BASH
# Setup: Simulate accidental table drop at 10:30:00
psql -h leader -U postgres <<EOF
CREATE TABLE important_data (id serial, data text);
INSERT INTO important_data (data) SELECT 'Record ' || generate_series(1, 1000);
SELECT count(*) FROM important_data;  -- 1000 rows
EOF

# Record current time before corruption
BEFORE_CORRUPTION=$(date -u +"%Y-%m-%d %H:%M:%S")
echo "Before corruption: $BEFORE_CORRUPTION"

# Simulate data corruption at 10:30:00
psql -h leader -U postgres -c "DROP TABLE important_data;"
echo "Table dropped (simulating accident) at $(date)"

# Step 1: Detect data loss (10:30:30)
psql -h leader -U postgres -c "SELECT * FROM important_data;"
# ERROR: relation "important_data" does not exist

# Step 2: Identify PITR target time (10:31:00)
PITR_TARGET=$BEFORE_CORRUPTION
echo "Will recover to: $PITR_TARGET"

# Step 3: Setup recovery environment (10:32:00)
# Create separate recovery instance (don't disturb production!)
sudo mkdir -p /var/lib/postgresql/18/pitr_recovery
sudo chown postgres:postgres /var/lib/postgresql/18/pitr_recovery

# Step 4: Restore base backup (10:33:00)
sudo -u postgres pg_basebackup \
  -h leader \
  -D /var/lib/postgresql/18/pitr_recovery \
  -X stream -P

# Step 5: Configure recovery (10:35:00)
cat << EOF | sudo tee /var/lib/postgresql/18/pitr_recovery/recovery.signal
# PITR recovery signal file
EOF

sudo -u postgres tee /var/lib/postgresql/18/pitr_recovery/postgresql.auto.conf <<EOF
restore_command = 'cp /var/lib/postgresql/wal_archive/%f %p'
recovery_target_time = '$PITR_TARGET'
recovery_target_action = 'promote'
EOF

# Step 6: Start recovery instance (10:36:00)
sudo -u postgres /usr/lib/postgresql/18/bin/pg_ctl \
  -D /var/lib/postgresql/18/pitr_recovery \
  -l /tmp/pitr_recovery.log \
  start

# Step 7: Wait for recovery completion (10:40:00)
tail -f /tmp/pitr_recovery.log
# Look for: "database system is ready to accept connections"

# Step 8: Verify recovered data (10:41:00)
psql -h localhost -p 5433 -U postgres -c "SELECT count(*) FROM important_data;"
# Should return: 1000 rows

# Step 9: Export recovered data (10:42:00)
pg_dump -h localhost -p 5433 -U postgres -t important_data > recovered_data.sql

# Step 10: Import to production (10:43:00)
psql -h leader -U postgres < recovered_data.sql

# Step 11: Verify production (10:44:00)
psql -h leader -U postgres -c "SELECT count(*) FROM important_data;"
# Should return: 1000 rows ✅

# Step 12: Cleanup recovery instance (10:45:00)
sudo -u postgres /usr/lib/postgresql/18/bin/pg_ctl \
  -D /var/lib/postgresql/18/pitr_recovery stop
sudo rm -rf /var/lib/postgresql/18/pitr_recovery

6.2. Expected results

  • Timeline:
    • 10:30:00: Data corruption detected
    • 10:31:00: PITR target time identified
    • 10:33:00: Base backup restoration started
    • 10:36:00: PITR recovery initiated
    • 10:41:00: Data recovery complete
    • 10:44:00: Data restored to production
    • 10:45:00: Cleanup complete
  • RTO: 15 minutes (data restoration)
  • RPO: 0 (recovered to exact point before corruption)
  • Impact:
    • Temporary read-only mode during restoration
    • Requires manual data export/import
    • No service downtime (recovery on separate instance)

7. DR Drill Metrics and Reporting

7.1. Drill scorecard

TEXT
Scenario: Leader Failover Drill
Date: 2024-11-25
Duration: 30 minutes
Participants: 5 team members

Metrics:
☑ RTO Target: 2 minutes
  RTO Actual: 35 seconds ✅ (Better than target)

☑ RPO Target: 0 bytes
  RPO Actual: 0 bytes ✅

☑ Detection Time: 15 seconds ✅
☑ Failover Time: 20 seconds ✅
☑ Validation Time: 5 minutes ⚠️ (Could be faster)

Issues Found:
1. Monitoring alert delayed by 10 seconds (configuration issue)
2. Runbook step 3 outdated (missing new command)
3. Team member unfamiliar with patronictl commands

Action Items:
☐ Fix monitoring alert configuration
☐ Update runbook documentation
☐ Schedule training session for new commands
☐ Re-test in 2 weeks

7.2. Post-drill analysis

MARKDOWN
# DR Drill Post-Mortem: Leader Failover

## Summary
Successfully executed planned leader failover drill. RTO/RPO targets exceeded. Identified 3 areas for improvement.

## Timeline
| Time | Event | Owner |
|------|-------|-------|
| 10:00:00 | Drill initiated | DBA |
| 10:00:15 | Leader stopped | DBA |
| 10:00:30 | Failure detected | Monitoring |
| 10:00:35 | New leader elected | Patroni |
| 10:00:50 | Write operations tested | DBA |
| 10:01:00 | Application health check | App Owner |
| 10:05:00 | Old leader rejoined | DBA |

## What Went Well
✅ Automatic failover worked flawlessly
✅ Zero data loss confirmed
✅ Team communication effective
✅ Documentation mostly accurate

## What Could Be Improved
⚠️ Monitoring alert configuration needs tuning
⚠️ Runbook has outdated commands
⚠️ One team member needs additional training

## Action Items
1. [ ] Update Prometheus alert rules (@sre-team, due: 2024-11-30)
2. [ ] Revise DR runbook (@dba-team, due: 2024-11-28)
3. [ ] Conduct patronictl training (@dba-lead, due: 2024-12-05)
4. [ ] Schedule next drill (@incident-commander, due: 2025-01-15)

## Recommendations
- Continue quarterly DR drills
- Rotate incident commander role
- Add chaos engineering (random failures)

8. Chaos Engineering for HA

8.1. Chaos Monkey for PostgreSQL

BASH
#!/bin/bash
# chaos-monkey.sh - Randomly kill PostgreSQL nodes

NODES=("node1" "node2" "node3")
INTERVAL=3600  # 1 hour between failures

while true; do
  # Random node
  NODE=${NODES[$RANDOM % ${#NODES[@]}]}
  
  # Random failure type
  FAILURE_TYPE=$((RANDOM % 3))
  
  case $FAILURE_TYPE in
    0)
      echo "$(date): Stopping Patroni on $NODE"
      ssh $NODE "sudo systemctl stop patroni"
      ;;
    1)
      echo "$(date): Simulating network partition on $NODE"
      ssh $NODE "sudo iptables -A INPUT -p tcp --dport 5432 -j DROP"
      sleep 300
      ssh $NODE "sudo iptables -D INPUT -p tcp --dport 5432 -j DROP"
      ;;
    2)
      echo "$(date): Stopping etcd on $NODE"
      ssh $NODE "sudo systemctl stop etcd"
      ;;
  esac
  
  # Wait for recovery
  sleep 300
  
  # Restore if not auto-recovered
  ssh $NODE "sudo systemctl start patroni"
  ssh $NODE "sudo systemctl start etcd"
  
  # Wait before next chaos
  sleep $INTERVAL
done

8.2. Automated DR testing

YAML
# automated_dr_test.yml
---
- name: Automated DR Drill
  hosts: postgres_cluster
  vars:
    drill_start_time: "{{ ansible_date_time.iso8601 }}"
  tasks:
    - name: Record baseline metrics
      shell: patronictl -c /etc/patroni/patroni.yml list
      register: baseline
      
    - name: Inject failure on leader
      shell: |
        LEADER=$(patronictl -c /etc/patroni/patroni.yml list | grep Leader | awk '{print $2}')
        ssh $LEADER "sudo systemctl stop patroni"
      delegate_to: localhost
      
    - name: Wait for failover
      wait_for:
        timeout: 60
        
    - name: Verify new leader elected
      shell: patronictl -c /etc/patroni/patroni.yml list | grep Leader | wc -l
      register: leader_count
      failed_when: leader_count.stdout != "1"
      
    - name: Measure RTO
      shell: |
        echo "RTO: $(( $(date +%s) - $(date -d '{{ drill_start_time }}' +%s) )) seconds"
      register: rto_result
      
    - name: Generate drill report
      template:
        src: drill_report.j2
        dest: /tmp/drill_report_{{ drill_start_time }}.txt
      
    - name: Send report to Slack
      uri:
        url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
        method: POST
        body_format: json
        body:
          text: "DR Drill completed. RTO: {{ rto_result.stdout }}"

9. Best Practices

✅ DO

  1. Schedule regular drills: Quarterly minimum.
  2. Test all scenarios: Not just easy ones.
  3. Rotate roles: Everyone should be IC once.
  4. Document everything: Timestamped notes.
  5. Measure RTO/RPO: Track improvements.
  6. Post-mortem every drill: Learn and improve.
  7. Update runbooks: Keep documentation current.
  8. Involve all teams: Cross-functional practice.
  9. Test backups: Restore verification essential.
  10. Automate where possible: Reduce human error.

❌ DON'T

  1. Don't skip drills: "Too busy" is not an excuse.
  2. Don't test only easy scenarios: Hard ones matter most.
  3. Don't ignore action items: Follow up on improvements.
  4. Don't reuse same scenario: Vary the drills.
  5. Don't rely on one person: Bus factor = 1 is dangerous.
  6. Don't rush: Proper testing takes time.
  7. Don't skip post-mortems: Learning opportunity.

10. Lab Exercises

Lab 1: Execute failover drill

Tasks:

  1. Plan and schedule drill.
  2. Assign team roles.
  3. Execute leader failover.
  4. Document timeline.
  5. Calculate RTO/RPO.
  6. Write post-mortem.

Lab 2: PITR recovery drill

Tasks:

  1. Create test data.
  2. Simulate data corruption.
  3. Identify PITR target time.
  4. Restore to separate instance.
  5. Verify recovered data.
  6. Document procedure.

Lab 3: Multi-DC failover

Tasks:

  1. Setup 2-DC cluster.
  2. Simulate DC1 total failure.
  3. Manually promote DC2.
  4. Update application config.
  5. Measure downtime.
  6. Document lessons learned.

Lab 4: Chaos engineering

Tasks:

  1. Implement chaos monkey script.
  2. Run for 24 hours.
  3. Monitor cluster behavior.
  4. Document failures and recoveries.
  5. Identify weak points.
  6. Improve HA configuration.

11. Summary

DR Drill Frequency

TEXT
Scenario Frequency:
- Single node failure: Monthly (automated)
- Leader failover: Quarterly
- DC failure: Semi-annually
- PITR recovery: Quarterly
- Full DR: Annually

Success Criteria

A successful DR drill has:

  • ✅ Met RTO/RPO targets.
  • ✅ Zero data loss (or within RPO).
  • ✅ All team members participated.
  • ✅ Documentation updated.
  • ✅ Action items identified.
  • ✅ Post-mortem completed.
  • ✅ Next drill scheduled.

Key Metrics to Track

  • Detection time: How fast we notice.
  • Response time: How fast we act.
  • Recovery time: How fast we restore.
  • Data loss: How much data lost.
  • Team coordination: How well we work together.

Next Steps

Lesson 28 will cover HA Architecture Design:

  • Requirements gathering
  • Architecture design documents
  • Capacity planning
  • Cost estimation
  • Design review process

Share this article

You might also like

Browse all articles

Lesson 21: Multi-datacenter Setup

Designing and implementing cross-datacenter replication architectures for disaster recovery and geographic load balancing.

#PostgreSQL#Multi-DC#Replication

Lesson 9: Bootstrap PostgreSQL Cluster

Learn how to bootstrap a Patroni cluster including starting Patroni for the first time on 3 nodes, verifying cluster status with patronictl, checking replication, troubleshooting common issues, and testing basic failover.

#Patroni#bootstrap#cluster

Lesson 8: Detailed Patroni Configuration

Learn detailed Patroni configuration including all sections of patroni.yml, bootstrap options, PostgreSQL parameters tuning, authentication setup, tags and constraints, and timing parameters optimization.

#Patroni#configuration#parameters

Lesson 7: Installing Patroni

Learn how to install Patroni, including setting up Python dependencies, installing via pip, understanding the patroni.yml configuration structure, creating systemd service, and configuring Patroni on 3 nodes for PostgreSQL high availability.

#Patroni#installation#configuration