Disaster Recovery Drills
After this lesson, you will be able to:
- Plan comprehensive disaster recovery procedures.
- Execute DR drills systematically.
- Measure and optimize RTO/RPO.
- Conduct incident response exercises.
- Document and improve DR processes.
1. DR Planning Foundation
1.1. Key DR metrics
1.2. DR scenarios to test
- Single node failure
- Impact: Low (automatic failover)
- RTO: < 1 minute
- RPO: 0 (synchronous replication)
- Leader node failure
- Impact: Medium (brief disruption)
- RTO: < 2 minutes
- RPO: 0
- Complete datacenter failure
- Impact: High (manual intervention)
- RTO: < 15 minutes
- RPO: < 5 minutes
- Data corruption
- Impact: High (PITR required)
- RTO: 1-4 hours
- RPO: Last valid backup
- Human error (DROP TABLE)
- Impact: Medium-High
- RTO: 30 minutes - 2 hours
- RPO: Point-in-time before error
2. DR Drill Preparation
2.1. Pre-drill checklist
2.2. DR team roles
- Incident Commander: Owns overall response, makes final decisions, coordinates teams.
- Database Admin: Executes PostgreSQL recovery, manages Patroni cluster, validates data integrity.
- System Admin: Manages infrastructure, network connectivity, firewall rules.
- Application Owner: Tests application functionality, validates business logic, user acceptance testing.
- Communications Lead: Updates stakeholders, documents timeline, post-mortem facilitator.
- Observer (optional): Takes notes, times each step, identifies improvements.
3. Scenario 1: Single Replica Failure
3.1. Drill procedure
3.2. Expected results
- Timeline:
- 10:00:00: Failure injected
- 10:00:30: Failure detected by Patroni
- 10:01:00: Traffic automatically rerouted
- 10:05:00: Recovery initiated
- 10:06:00: Full recovery complete
- RTO: 1 minute (time until traffic rerouted)
- RPO: 0 bytes (no data loss)
- Impact:
- No application downtime
- Slightly increased load on remaining replica
- Monitoring alerts triggered (expected)
4. Scenario 2: Leader Failover
4.1. Drill procedure
4.2. Expected results
- Timeline:
- 10:00:05: Leader failure injected
- 10:00:20: Failure detected (TTL expired)
- 10:00:35: New leader elected
- 10:00:45: Write operations succeed
- 10:01:00: Application fully functional
- 10:04:00: Old leader rejoins as replica
- RTO: 30 seconds (leader election time)
- RPO: 0 bytes (with synchronous replication)
- Impact:
- 30 seconds of write unavailability
- Read operations continue on replicas
- ~10-20 failed write requests (depending on traffic)
- Monitoring alerts triggered
5. Scenario 3: Complete Datacenter Failure
5.1. Drill procedure
5.2. Expected results
- Timeline:
- 10:00:00: DC1 failure
- 10:02:00: Decision to failover to DC2
- 10:03:00: Manual promotion of DC2 leader
- 10:04:00: Application reconfiguration
- 10:05:00: Service fully restored
- RTO: 5 minutes (includes decision time)
- RPO: 0-5 minutes (depends on replication lag at failure time)
- Impact:
- 5 minutes of complete outage
- Possible data loss if async replication
- Manual intervention required
- Requires application update
6. Scenario 4: Point-in-Time Recovery (Data Corruption)
6.1. Drill procedure
6.2. Expected results
- Timeline:
- 10:30:00: Data corruption detected
- 10:31:00: PITR target time identified
- 10:33:00: Base backup restoration started
- 10:36:00: PITR recovery initiated
- 10:41:00: Data recovery complete
- 10:44:00: Data restored to production
- 10:45:00: Cleanup complete
- RTO: 15 minutes (data restoration)
- RPO: 0 (recovered to exact point before corruption)
- Impact:
- Temporary read-only mode during restoration
- Requires manual data export/import
- No service downtime (recovery on separate instance)
7. DR Drill Metrics and Reporting
7.1. Drill scorecard
7.2. Post-drill analysis
8. Chaos Engineering for HA
8.1. Chaos Monkey for PostgreSQL
8.2. Automated DR testing
9. Best Practices
✅ DO
- Schedule regular drills: Quarterly minimum.
- Test all scenarios: Not just easy ones.
- Rotate roles: Everyone should be IC once.
- Document everything: Timestamped notes.
- Measure RTO/RPO: Track improvements.
- Post-mortem every drill: Learn and improve.
- Update runbooks: Keep documentation current.
- Involve all teams: Cross-functional practice.
- Test backups: Restore verification essential.
- Automate where possible: Reduce human error.
❌ DON'T
- Don't skip drills: "Too busy" is not an excuse.
- Don't test only easy scenarios: Hard ones matter most.
- Don't ignore action items: Follow up on improvements.
- Don't reuse same scenario: Vary the drills.
- Don't rely on one person: Bus factor = 1 is dangerous.
- Don't rush: Proper testing takes time.
- Don't skip post-mortems: Learning opportunity.
10. Lab Exercises
Lab 1: Execute failover drill
Tasks:
- Plan and schedule drill.
- Assign team roles.
- Execute leader failover.
- Document timeline.
- Calculate RTO/RPO.
- Write post-mortem.
Lab 2: PITR recovery drill
Tasks:
- Create test data.
- Simulate data corruption.
- Identify PITR target time.
- Restore to separate instance.
- Verify recovered data.
- Document procedure.
Lab 3: Multi-DC failover
Tasks:
- Setup 2-DC cluster.
- Simulate DC1 total failure.
- Manually promote DC2.
- Update application config.
- Measure downtime.
- Document lessons learned.
Lab 4: Chaos engineering
Tasks:
- Implement chaos monkey script.
- Run for 24 hours.
- Monitor cluster behavior.
- Document failures and recoveries.
- Identify weak points.
- Improve HA configuration.
11. Summary
DR Drill Frequency
Success Criteria
A successful DR drill has:
- ✅ Met RTO/RPO targets.
- ✅ Zero data loss (or within RPO).
- ✅ All team members participated.
- ✅ Documentation updated.
- ✅ Action items identified.
- ✅ Post-mortem completed.
- ✅ Next drill scheduled.
Key Metrics to Track
- Detection time: How fast we notice.
- Response time: How fast we act.
- Recovery time: How fast we restore.
- Data loss: How much data lost.
- Team coordination: How well we work together.
Next Steps
Lesson 28 will cover HA Architecture Design:
- Requirements gathering
- Architecture design documents
- Capacity planning
- Cost estimation
- Design review process