CloudTadaInsights

Lesson 21: Multi-datacenter Setup

Multi-datacenter Setup

After this lesson, you will be able to:

  • Design cross-datacenter replication architecture.
  • Implement cascading replication topology.
  • Handle network latency and failures.
  • Configure disaster recovery for multiple sites.
  • Load balance across geographic locations.

1. Multi-DC Architecture Patterns

1.1. Active-Passive (DR standby)

TEXT
Primary DC (Active):
  ├─ node1 (Leader)
  ├─ node2 (Replica)
  └─ node3 (Replica)
    ↓ Async replication
DR DC (Passive):
  ├─ node4 (Standby)
  └─ node5 (Standby)

Use case: Disaster recovery
RPO: Minutes to hours
RTO: Minutes to hours
Cost: Lower (minimal resources in DR)

1.2. Active-Active (Multi-master)

TEXT
DC1 (Active):
  ├─ node1 (Leader)
  └─ node2 (Replica)
    ↕ Bi-directional logical replication
DC2 (Active):
  ├─ node3 (Leader)
  └─ node4 (Replica)

Use case: Global applications with regional traffic
RPO: Near-zero
RTO: Near-zero
Cost: Higher (full resources in both DCs)
Note: Requires conflict resolution

1.3. Hub-and-Spoke (Cascading)

TEXT
Primary DC (Hub):
  └─ node1 (Leader)
      ├─ node2 (Replica) ← DC1 local
      ├─ node3 (Cascade) → DC2
      └─ node4 (Cascade) → DC3

DC2 (Spoke):
  └─ node3 (receives from node3-cascade)
      └─ node5 (Replica) ← DC2 local

DC3 (Spoke):
  └─ node4 (receives from node4-cascade)
      └─ node6 (Replica) ← DC3 local

Use case: Multiple regional read replicas
RPO: Seconds to minutes
Cost: Medium

2. Cascading Replication Setup

2.1. Architecture

TEXT
DC1 (us-east):
  ├─ pg-us-east-1 (Leader) - 10.1.1.11
  ├─ pg-us-east-2 (Replica) - 10.1.1.12
  └─ pg-us-east-3 (Cascade) - 10.1.1.13
        ↓ WAN replication
DC2 (us-west):
  └─ pg-us-west-1 (Replica) - 10.2.1.11
      ├─ Receives from pg-us-east-3
      └─ pg-us-west-2 (Replica) - 10.2.1.12

2.2. Configure cascading node (DC1)

YAML
# /etc/patroni/patroni.yml on pg-us-east-3 (cascade node)
scope: postgres-cluster
name: pg-us-east-3

restapi:
  listen: 10.1.1.13:8008
  connect_address: 10.1.1.13:8008

etcd:
  hosts: 10.1.1.11:2379,10.1.1.12:2379,10.1.1.13:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.1.1.13:5432
  data_dir: /var/lib/postgresql/18/data
  bin_dir: /usr/lib/postgresql/18/bin

  authentication:
    replication:
      username: replicator
      password: rep_password
    superuser:
      username: postgres
      password: postgres_password

  parameters:
    # Enable cascading replication
    hot_standby: on
    wal_level: replica
    max_wal_senders: 10  # Extra slots for downstream replicas
    max_replication_slots: 10
    hot_standby_feedback: on
    
    # Performance tuning for WAN
    wal_sender_timeout: 60s
    wal_receiver_timeout: 60s

  # Allow downstream replicas to connect
  pg_hba:
    - host replication replicator 10.2.1.0/24 scram-sha-256  # DC2 subnet

tags:
  nofailover: false
  noloadbalance: false
  clonefrom: true  # Can be used as clone source
  nosync: false

2.3. Configure downstream replica (DC2)

YAML
# /etc/patroni/patroni.yml on pg-us-west-1
scope: postgres-cluster-dc2  # Different scope!
name: pg-us-west-1

restapi:
  listen: 10.2.1.11:8008
  connect_address: 10.2.1.11:8008

etcd:
  # Separate etcd cluster for DC2
  hosts: 10.2.1.11:2379,10.2.1.12:2379,10.2.1.13:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    standby_cluster:
      # Point to DC1 cascade node
      host: 10.1.1.13  # pg-us-east-3
      port: 5432
      primary_slot_name: pg_us_west_1_slot
      create_replica_methods:
        - basebackup

  method: basebackup
  basebackup:
    max-rate: '100M'
    checkpoint: 'fast'
    waldir: /var/lib/postgresql/18/data/pg_wal

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.2.1.11:5432
  data_dir: /var/lib/postgresql/18/data
  bin_dir: /usr/lib/postgresql/18/bin

  authentication:
    replication:
      username: replicator
      password: rep_password
    superuser:
      username: postgres
      password: postgres_password

  parameters:
    hot_standby: on
    wal_level: replica
    max_wal_senders: 5
    max_replication_slots: 5
    hot_standby_feedback: on
    
    # WAN-optimized settings
    wal_sender_timeout: 120s  # Higher for WAN
    wal_receiver_timeout: 120s
    wal_retrieve_retry_interval: 10s

  pg_hba:
    - host replication replicator 10.2.1.0/24 scram-sha-256
    - host all all 10.2.1.0/24 scram-sha-256

tags:
  nofailover: false  # Can become leader in DC2
  noloadbalance: false
  clonefrom: false

2.4. Create replication slot on cascade node

BASH
# On pg-us-east-3 (cascade node)
sudo -u postgres psql -c "
  SELECT pg_create_physical_replication_slot('pg_us_west_1_slot');
"

2.5. Start DC2 replica

BASH
# On pg-us-west-1
sudo systemctl start patroni
sudo systemctl status patroni

# Check replication status
patronictl -c /etc/patroni/patroni.yml list

# Verify it's receiving from cascade node
sudo -u postgres psql -c "
  SELECT client_addr, state, sync_state, 
         pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
  FROM pg_stat_replication;
"

3. Network Latency Handling

3.1. Measure inter-DC latency

BASH
# Ping test
ping -c 10 10.2.1.11

# TCP latency test
nc -vz 10.2.1.11 5432

# PostgreSQL connection latency
psql "host=10.2.1.11 user=postgres" -c "SELECT now();"

# iPerf bandwidth test
# On DC2:
iperf3 -s
# On DC1:
iperf3 -c 10.2.1.11 -t 30

3.2. Optimize for high latency

SQL
-- Increase timeouts for WAN
ALTER SYSTEM SET wal_sender_timeout = '120s';  -- Default 60s
ALTER SYSTEM SET wal_receiver_timeout = '120s';
ALTER SYSTEM SET wal_retrieve_retry_interval = '10s';

-- TCP keepalive settings
ALTER SYSTEM SET tcp_keepalives_idle = 60;
ALTER SYSTEM SET tcp_keepalives_interval = 10;
ALTER SYSTEM SET tcp_keepalives_count = 6;

-- Reload
SELECT pg_reload_conf();

3.3. Use WAL compression

SQL
-- Enable WAL compression (PostgreSQL 14+)
ALTER SYSTEM SET wal_compression = on;

-- Can reduce WAN traffic by 50-70%
SELECT pg_reload_conf();

3.4. Limit replication bandwidth

YAML
# In patroni.yml
bootstrap:
  method: basebackup
  basebackup:
    max-rate: '50M'  # Limit to 50 MB/s to avoid saturating WAN
    checkpoint: 'fast'

4. Disaster Recovery Scenarios

4.1. DC1 total failure

BASH
# Promote DC2 to primary
# On pg-us-west-1

# Remove standby_cluster config
patronictl edit-config postgres-cluster-dc2 -s scope -p standby_cluster --force

# Promote to leader
patronictl failover postgres-cluster-dc2 --leader pg-us-west-1 --force

# Verify
patronictl -c /etc/patroni/patroni.yml list
# + Cluster: postgres-cluster-dc2 ----+------------+----+-----------+
# | Member         | Host       | Role   | State     | Lag in MB |
# +----------------+------------+--------+-----------+-----------+
# | pg-us-west-1   | 10.2.1.11  | Leader | running   |         0 |
# | pg-us-west-2   | 10.2.1.12  | Replica| streaming |         0 |
# +----------------+------------+--------+-----------+-----------+

4.2. DC1 recovery after failure

BASH
# When DC1 comes back online, reintegrate it

# Option 1: Make DC1 follow DC2 (temporary)
# Edit patroni.yml on DC1 nodes to add standby_cluster pointing to DC2

# Option 2: Failback to DC1
# Wait for DC2 to be fully synchronized
# Perform planned switchover back to DC1

patronictl switchover postgres-cluster-dc2 \
  --leader pg-us-west-1 \
  --candidate pg-us-east-1 \
  --scheduled 'now'

4.3. Split-brain prevention

YAML
# Use etcd/consul in both DCs
# Each DC has its own etcd cluster
# Use etcd discovery URL for cross-DC awareness

etcd:
  hosts:
    - 10.1.1.11:2379  # DC1
    - 10.1.1.12:2379
    - 10.2.1.11:2379  # DC2
    - 10.2.1.12:2379

Note: For true split-brain prevention, consider:

  1. Odd number of sites (3+ DCs) with witness node.
  2. Fencing mechanisms (STONITH).
  3. Quorum-based decisions.

5. Geographic Load Balancing

5.1. HAProxy with geo-awareness

TEXT
Architecture:

Users (us-east) → HAProxy-DC1 → PG-DC1 (primary)
Users (us-west) → HAProxy-DC2 → PG-DC2 (replica) - read-only
                             ↳ PG-DC1 (primary) - writes
BASH
# /etc/haproxy/haproxy.cfg on HAProxy-DC1 (us-east)
frontend postgres_front
  bind *:5432
  mode tcp
  default_backend postgres_master

backend postgres_master
  mode tcp
  option tcp-check
  tcp-check connect
  tcp-check send-binary 00000008  # SSLRequest
  tcp-check expect binary 4e       # 'N' (no SSL)
  server pg-us-east-1 10.1.1.11:5432 check inter 3000
  server pg-us-east-2 10.1.1.12:5432 check inter 3000 backup
BASH
# /etc/haproxy/haproxy.cfg on HAProxy-DC2 (us-west)
frontend postgres_front_read
  bind *:5432
  mode tcp
  default_backend postgres_replicas

frontend postgres_front_write
  bind *:5433
  mode tcp
  default_backend postgres_master_remote

backend postgres_replicas
  # Local read replicas
  mode tcp
  balance roundrobin
  option tcp-check
  server pg-us-west-1 10.2.1.11:5432 check inter 3000
  server pg-us-west-2 10.2.1.12:5432 check inter 3000

backend postgres_master_remote
  # Write to primary in DC1
  mode tcp
  option tcp-check
  server pg-us-east-1 10.1.1.11:5432 check inter 3000
  server pg-us-east-2 10.1.1.12:5432 check inter 3000 backup

5.2. DNS-based routing

TEXT
# Use DNS with geo-location
# GeoDNS service (Route53, Cloudflare, etc.)

# US-East users resolve to:
postgres.example.com → 10.1.1.100 (HAProxy-DC1)

# US-West users resolve to:
postgres.example.com → 10.2.1.100 (HAProxy-DC2)

# Configure health checks to failover on DC failure

5.3. Application-level routing

PYTHON
# Python example with psycopg2
import psycopg2
import requests

def get_postgres_endpoint():
    """Get optimal PostgreSQL endpoint based on geo-location"""
    # Check latency to each DC
    latencies = {}
    for dc, host in [('dc1', '10.1.1.11'), ('dc2', '10.2.1.11')]:
        try:
            start = time.time()
            conn = psycopg2.connect(
                host=host, user='app', password='pass',
                dbname='mydb', connect_timeout=3
            )
            conn.close()
            latencies[dc] = time.time() - start
        except:
            latencies[dc] = float('inf')
    
    # Return DC with lowest latency
    best_dc = min(latencies, key=latencies.get)
    return '10.1.1.11' if best_dc == 'dc1' else '10.2.1.11'

# Use it
conn = psycopg2.connect(
    host=get_postgres_endpoint(),
    user='app', password='pass', dbname='mydb'
)

6. Cross-DC Monitoring

6.1. Monitor replication lag

SQL
-- On cascade node (DC1)
SELECT client_addr, application_name,
       state, sync_state,
       pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS sending_lag,
       pg_wal_lsn_diff(sent_lsn, write_lsn) AS write_lag,
       pg_wal_lsn_diff(write_lsn, flush_lsn) AS flush_lag,
       pg_wal_lsn_diff(flush_lsn, replay_lsn) AS replay_lag
FROM pg_stat_replication
WHERE application_name LIKE '%west%';
TEXT
# Expected replication lag for cross-DC:
# - Low latency WAN (< 10ms): 0-10 MB lag
# - Medium latency WAN (10-50ms): 10-50 MB lag
# - High latency WAN (> 50ms): 50-200 MB lag

6.2. Prometheus exporters

YAML
# prometheus.yml
scrape_configs:
  - job_name: 'postgres-dc1'
    static_configs:
      - targets:
          - '10.1.1.11:9187'
          - '10.1.1.12:9187'
          - '10.1.1.13:9187'
    labels:
      datacenter: 'us-east'
  
  - job_name: 'postgres-dc2'
    static_configs:
      - targets:
          - '10.2.1.11:9187'
          - '10.2.1.12:9187'
    labels:
      datacenter: 'us-west'

6.3. Alert rules for cross-DC

YAML
# /etc/prometheus/alerts/multi-dc.yml
groups:
  - name: multi-dc
    rules:
      - alert: CrossDCReplicationLag
        expr: |
          pg_replication_lag{datacenter="us-west"} > 100 * 1024 * 1024
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High replication lag to DC2"
          description: "Replication lag to {{ $labels.instance }} is {{ $value | humanize }}B"
      
      - alert: CrossDCReplicationBroken
        expr: |
          pg_replication_status{datacenter="us-west"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Replication to DC2 is broken"
      
      - alert: CrossDCLatency
        expr: |
          probe_duration_seconds{job="blackbox-dc2"} > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High network latency to DC2"

7. Backup Strategy for Multi-DC

7.1. Per-DC backups

BASH
# DC1 backup
pgbackrest --stanza=main --type=full backup

# DC2 backup (can use DC1's backup repo over WAN)
pgbackrest --stanza=main --type=diff backup --repo1-host=10.1.1.11

7.2. Geo-replicated backup storage

INI
# pgbackrest.conf
[global]
repo1-type=s3
repo1-s3-bucket=my-postgres-backups-us-east
repo1-s3-region=us-east-1
repo1-s3-endpoint=s3.amazonaws.com

# S3 cross-region replication enabled:
# us-east-1 → us-west-2

7.3. Backup verification

BASH
# Restore test in DR site
pgbackrest --stanza=main \
  --type=time \
  --target="2024-11-25 10:00:00" \
  restore \
  --repo1-host=backup-server \
  --pg1-path=/var/lib/postgresql/18/restore_test

8. Best Practices

✅ DO

  1. Use cascading replication: Reduces load on primary.
  2. Separate etcd clusters: Per-DC for independence.
  3. Monitor replication lag: Alert on high lag.
  4. Test failover regularly: Quarterly DR drills.
  5. Use replication slots: Prevent WAL deletion.
  6. Compress WAL: Reduce WAN bandwidth.
  7. Limit base backup rate: Avoid WAN saturation.
  8. Implement geo-routing: Reduce latency for users.
  9. Document topology: Clear architecture diagrams.
  10. Automate failover: But with human approval for DR.

❌ DON'T

  1. Don't use sync replication cross-DC: Too slow.
  2. Don't share etcd across WAN: Split-brain risk.
  3. Don't ignore network latency: Tune timeouts.
  4. Don't forget about WAL retention: Use slots.
  5. Don't skip DR testing: Must validate regularly.
  6. Don't use single DC for backups: Geo-replicate.
  7. Don't over-complicate: Start simple, add complexity as needed.

9. Lab Exercises

Lab 1: Setup cascading replication

Tasks:

  1. Configure cascade node in DC1.
  2. Setup downstream replica in DC2.
  3. Create replication slot.
  4. Verify replication lag.
  5. Monitor with Prometheus.

Lab 2: Test DR failover

Tasks:

  1. Simulate DC1 failure (stop all nodes).
  2. Promote DC2 to primary.
  3. Verify application connectivity.
  4. Document RTO/RPO.
  5. Plan failback procedure.

Lab 3: Geo-aware load balancing

Tasks:

  1. Setup HAProxy in each DC.
  2. Configure geo-based routing.
  3. Test read/write routing.
  4. Measure latency improvement.
  5. Implement health checks.

Lab 4: Cross-DC monitoring

Tasks:

  1. Configure Prometheus multi-DC scraping.
  2. Create Grafana dashboard with DC labels.
  3. Setup alert rules for cross-DC lag.
  4. Test alerting on simulated failure.
  5. Document runbook for alerts.

10. Advanced Topics

10.1. Three-datacenter setup

TEXT
DC1 (us-east):
  └─ pg1 (Leader)
      ├─ pg2 (Replica)
      └─ pg3 (Cascade) → DC2

DC2 (us-west):
  └─ pg4 (Replica from pg3)
      ├─ pg5 (Replica)
      └─ pg6 (Cascade) → DC3

DC3 (eu-central):
  └─ pg7 (Replica from pg6)
      └─ pg8 (Replica)

Use case: Global application with regional reads

10.2. Active-active with logical replication

SQL
-- DC1 publication
CREATE PUBLICATION dc1_pub FOR ALL TABLES;

-- DC2 subscription
CREATE SUBSCRIPTION dc2_sub
CONNECTION 'host=10.1.1.11 dbname=mydb user=replicator'
PUBLICATION dc1_pub
WITH (copy_data = true);

-- DC2 publication (for bi-directional)
CREATE PUBLICATION dc2_pub FOR ALL TABLES;

-- DC1 subscription
CREATE SUBSCRIPTION dc1_sub
CONNECTION 'host=10.2.1.11 dbname=mydb user=replicator'
PUBLICATION dc2_pub
WITH (copy_data = false);  -- Already have data

-- Conflict resolution required!
-- See: https://www.postgresql.org/docs/current/logical-replication-conflicts.html

10.3. Quorum-based commit

YAML
# For strong consistency across DCs
postgresql:
  parameters:
    synchronous_standby_names: 'ANY 2 (pg-us-east-2, pg-us-west-1, pg-eu-central-1)'
    synchronous_commit: 'remote_apply'

# Requires 2 of 3 DCs to acknowledge commit
# Provides strong durability but higher latency

11. Summary

Multi-DC Strategies

PatternRPORTOComplexityCost
Active-Passive (DR)MinutesMinutesLowLow
Cascading ReplicasSecondsSecondsMediumMedium
Active-ActiveNear-zeroNear-zeroHighHigh
Hub-and-SpokeSecondsMinutesMediumMedium

Key Metrics

TEXT
Replication Lag: < 50 MB for WAN
Network Latency: < 100 ms acceptable
Throughput: 50-100 MB/s typical for WAN
RPO Target: < 5 minutes
RTO Target: < 15 minutes

Checklist

  • Cascading replication configured
  • Separate etcd per DC
  • Replication slots created
  • WAL compression enabled
  • Timeouts tuned for WAN
  • Geo-aware load balancing
  • Cross-DC monitoring
  • DR failover tested
  • Backup geo-replication
  • Documentation complete

Next Steps

Lesson 22 will cover Patroni on Kubernetes:

  • StatefulSets configuration
  • Patroni Kubernetes operator
  • PersistentVolumes setup
  • Helm charts usage
  • K8s-specific considerations

Share this article

You might also like

Browse all articles

Lesson 27: Disaster Recovery Drills

Planning and executing effective disaster recovery drills to ensure PostgreSQL cluster resilience.

#PostgreSQL#Disaster Recovery#Drills

Lesson 2: PostgreSQL Streaming Replication

Deep dive into PostgreSQL Streaming Replication, covering WAL mechanisms, synchronous vs asynchronous replication, replication slots, and hands-on lab setup.

#Database#PostgreSQL#Replication

Lesson 1: Overview of PostgreSQL High Availability

Understanding the fundamentals of PostgreSQL High Availability, including why HA is critical, different HA methods, and comparing Patroni vs Repmgr vs Pacemaker solutions.

#Database#PostgreSQL#High Availability