Multi-datacenter Setup
After this lesson, you will be able to:
- Design cross-datacenter replication architecture.
- Implement cascading replication topology.
- Handle network latency and failures.
- Configure disaster recovery for multiple sites.
- Load balance across geographic locations.
1. Multi-DC Architecture Patterns
1.1. Active-Passive (DR standby)
1.2. Active-Active (Multi-master)
1.3. Hub-and-Spoke (Cascading)
2. Cascading Replication Setup
2.1. Architecture
2.2. Configure cascading node (DC1)
2.3. Configure downstream replica (DC2)
2.4. Create replication slot on cascade node
2.5. Start DC2 replica
3. Network Latency Handling
3.1. Measure inter-DC latency
3.2. Optimize for high latency
3.3. Use WAL compression
3.4. Limit replication bandwidth
4. Disaster Recovery Scenarios
4.1. DC1 total failure
4.2. DC1 recovery after failure
4.3. Split-brain prevention
Note: For true split-brain prevention, consider:
- Odd number of sites (3+ DCs) with witness node.
- Fencing mechanisms (STONITH).
- Quorum-based decisions.
5. Geographic Load Balancing
5.1. HAProxy with geo-awareness
5.2. DNS-based routing
5.3. Application-level routing
6. Cross-DC Monitoring
6.1. Monitor replication lag
6.2. Prometheus exporters
6.3. Alert rules for cross-DC
7. Backup Strategy for Multi-DC
7.1. Per-DC backups
7.2. Geo-replicated backup storage
7.3. Backup verification
8. Best Practices
✅ DO
- Use cascading replication: Reduces load on primary.
- Separate etcd clusters: Per-DC for independence.
- Monitor replication lag: Alert on high lag.
- Test failover regularly: Quarterly DR drills.
- Use replication slots: Prevent WAL deletion.
- Compress WAL: Reduce WAN bandwidth.
- Limit base backup rate: Avoid WAN saturation.
- Implement geo-routing: Reduce latency for users.
- Document topology: Clear architecture diagrams.
- Automate failover: But with human approval for DR.
❌ DON'T
- Don't use sync replication cross-DC: Too slow.
- Don't share etcd across WAN: Split-brain risk.
- Don't ignore network latency: Tune timeouts.
- Don't forget about WAL retention: Use slots.
- Don't skip DR testing: Must validate regularly.
- Don't use single DC for backups: Geo-replicate.
- Don't over-complicate: Start simple, add complexity as needed.
9. Lab Exercises
Lab 1: Setup cascading replication
Tasks:
- Configure cascade node in DC1.
- Setup downstream replica in DC2.
- Create replication slot.
- Verify replication lag.
- Monitor with Prometheus.
Lab 2: Test DR failover
Tasks:
- Simulate DC1 failure (stop all nodes).
- Promote DC2 to primary.
- Verify application connectivity.
- Document RTO/RPO.
- Plan failback procedure.
Lab 3: Geo-aware load balancing
Tasks:
- Setup HAProxy in each DC.
- Configure geo-based routing.
- Test read/write routing.
- Measure latency improvement.
- Implement health checks.
Lab 4: Cross-DC monitoring
Tasks:
- Configure Prometheus multi-DC scraping.
- Create Grafana dashboard with DC labels.
- Setup alert rules for cross-DC lag.
- Test alerting on simulated failure.
- Document runbook for alerts.
10. Advanced Topics
10.1. Three-datacenter setup
10.2. Active-active with logical replication
10.3. Quorum-based commit
11. Summary
Multi-DC Strategies
| Pattern | RPO | RTO | Complexity | Cost |
|---|---|---|---|---|
| Active-Passive (DR) | Minutes | Minutes | Low | Low |
| Cascading Replicas | Seconds | Seconds | Medium | Medium |
| Active-Active | Near-zero | Near-zero | High | High |
| Hub-and-Spoke | Seconds | Minutes | Medium | Medium |
Key Metrics
Checklist
- Cascading replication configured
- Separate etcd per DC
- Replication slots created
- WAL compression enabled
- Timeouts tuned for WAN
- Geo-aware load balancing
- Cross-DC monitoring
- DR failover tested
- Backup geo-replication
- Documentation complete
Next Steps
Lesson 22 will cover Patroni on Kubernetes:
- StatefulSets configuration
- Patroni Kubernetes operator
- PersistentVolumes setup
- Helm charts usage
- K8s-specific considerations