CloudTadaInsights

Lesson 10: Replication Management

Replication Management

Objectives

After this lesson, you will:

  • Understand synchronous vs asynchronous replication modes
  • Configure and manage replication settings in Patroni
  • Monitor replication lag and performance
  • Troubleshoot common replication issues
  • Handle replication failures and recovery

1. Understanding Replication Modes

1.1. Asynchronous replication (default)

In asynchronous mode, the primary server does not wait for confirmation from replicas before acknowledging a transaction as committed.

Characteristics:

  • High performance: Minimal impact on write performance
  • High availability: No blocking if replicas are down
  • Potential data loss: Uncommitted transactions may be lost on primary failure
  • No guarantee: Replicas may lag behind primary

Configuration:

TEXT
bootstrap:
  dcs:
    synchronous_mode: false
    synchronous_mode_strict: false

When to use:

  • Applications that can tolerate some data loss
  • High-throughput environments where performance is critical
  • Cross-datacenter replication where network latency is high

1.2. Synchronous replication

In synchronous mode, the primary server waits for confirmation from at least one synchronous replica before acknowledging a transaction.

Characteristics:

  • Zero data loss: Confirmed transactions are guaranteed to be on replica
  • Strong consistency: Data is guaranteed to be on multiple nodes
  • Performance impact: Higher transaction latency
  • Availability risk: Transactions block if synchronous replica is down

Configuration:

TEXT
bootstrap:
  dcs:
    synchronous_mode: true
    synchronous_mode_strict: false
    synchronous_node_count: 1  # Number of synchronous standbys required

Synchronous mode variants:

  • synchronous_mode_strict: false: Degrades to async if no sync standby available
  • synchronous_mode_strict: true: Refuses writes if no sync standby available

1.3. Quorum-based synchronous replication

PostgreSQL supports multiple synchronous standbys with quorum-based confirmation:

TEXT
# In PostgreSQL configuration (managed by Patroni)
synchronous_standby_names = 'FIRST 2 (node2,node3,node4)'
# or
synchronous_standby_names = 'ANY 2 (node2,node3,node4)'

2. Configuring Synchronous Replication

2.1. Enable synchronous mode in Patroni

Edit cluster configuration:

TEXT
patronictl edit-config postgres
# This opens an editor to modify the cluster-wide configuration

Configuration example:

TEXT
loop_wait: 10
ttl: 30
retry_timeout: 10
synchronous_mode: true
synchronous_mode_strict: false
synchronous_node_count: 1

postgresql:
  parameters:
    wal_level: replica
    hot_standby: "on"
    max_wal_senders: 10
    max_replication_slots: 10

2.2. Designate synchronous replicas

Using tags to designate which nodes should be synchronous:

TEXT
# On the node that should be synchronous standby
tags:
  synchronous: true

2.3. Verify synchronous replication

Check synchronous standby status:

TEXT
# On primary
sudo -u postgres psql -c "
SELECT application_name, 
       sync_state, 
       sync_priority,
       state,
       sent_lsn,
       write_lsn,
       flush_lsn
FROM pg_stat_replication;
"

Output:

TEXT
application_name | sync_state | sync_priority | state  | sent_lsn | write_lsn | flush_lsn 
------------------+------------+---------------+--------+----------+-----------+----------
 node2            | sync       |             1 | stream | 0/2000000| 0/2000000 | 0/2000000
 node3            | async      |             0 | stream | 0/2000000| 0/2000000 | 0/2000000

3. Monitoring Replication Lag

3.1. Using pg_stat_replication

On primary node:

TEXT
# Check replication lag for all replicas
sudo -u postgres psql -c "
SELECT 
  application_name,
  client_addr,
  state,
  pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS sent_lag_bytes,
  pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn) AS write_lag_bytes,
  pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS flush_lag_bytes,
  write_lag,
  flush_lag,
  replay_lag
FROM pg_stat_replication;
"

3.2. Using pg_stat_wal_receiver (on replicas)

TEXT
# On each replica
sudo -u postgres psql -c "
SELECT 
  status,
  received_lsn,
  last_msg_receipt_time,
  latest_end_lsn,
  latest_end_time,
  pg_wal_lsn_diff(pg_current_wal_lsn(), received_lsn) AS receive_lag_bytes
FROM pg_stat_wal_receiver;
"

3.3. Real-time lag monitoring

Continuous monitoring script:

TEXT
#!/bin/bash
# monitor_lag.sh

while true; do
  echo "=== $(date) ==="
  sudo -u postgres psql -c "
  SELECT 
    application_name,
    client_addr,
    state,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS lag,
    replay_lag
  FROM pg_stat_replication;
  " | grep -v "rows)"
  sleep 5
done

3.4. Setting up lag alerts

Create monitoring function:

TEXT
# Add to monitoring database
CREATE OR REPLACE FUNCTION check_replication_lag(max_lag_mb INTEGER DEFAULT 10)
RETURNS TABLE(
  application_name TEXT,
  client_addr INET,
  lag_mb NUMERIC,
  status TEXT
) AS $$
BEGIN
  RETURN QUERY
  SELECT 
    pr.application_name,
    pr.client_addr,
    pg_wal_lsn_diff(pg_current_wal_lsn(), pr.replay_lsn)::NUMERIC / 1024^2 AS lag_mb,
    CASE 
      WHEN pg_wal_lsn_diff(pg_current_wal_lsn(), pr.replay_lsn) > (max_lag_mb * 1024^2) 
      THEN 'ALERT' 
      ELSE 'OK' 
    END AS status
  FROM pg_stat_replication pr;
END;
$$ LANGUAGE plpgsql;

4. Managing Replication Slots

4.1. Understanding replication slots

Replication slots ensure that WAL files are not removed from the primary until they have been received by the replica.

Types of slots:

  • Physical slots: Used by physical replicas
  • Logical slots: Used by logical replication

4.2. Check replication slots

TEXT
# On primary
sudo -u postgres psql -c "\du+"
sudo -u postgres psql -c "SELECT * FROM pg_replication_slots;"

4.3. Managing slots with Patroni

Automatic slot management:

Patroni can automatically create and manage replication slots:

TEXT
bootstrap:
  dcs:
    postgresql:
      use_slots: true  # Enable automatic slot management

Manual slot management:

TEXT
# Create slot
SELECT pg_create_physical_replication_slot('node4_slot');

# Drop slot
SELECT pg_drop_replication_slot('node4_slot');

# List slots
SELECT slot_name, slot_type, active, restart_lsn FROM pg_replication_slots;

5. Handling Replication Issues

5.1. High replication lag

Symptoms:

  • Large values in pg_stat_replication lag columns
  • Replicas falling behind primary significantly
  • Potential timeout issues

Causes and solutions:

A. Network issues

TEXT
# Check network performance
iperf3 -c 10.0.1.12  # Test from primary to replica

# Check for packet loss
ping -c 100 10.0.1.12

B. Replica performance issues

TEXT
# Check replica system resources
top
iostat -x 1

# Check PostgreSQL activity on replica
sudo -u postgres psql -c "SELECT * FROM pg_stat_activity;"

C. Large transactions or bulk operations

TEXT
# Monitor for large transactions
sudo -u postgres psql -c "
SELECT pid, 
       now() - pg_stat_activity.query_start AS duration, 
       query,
       state
FROM pg_stat_activity 
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes';
"

5.2. Replication stopped

Check replica status:

TEXT
# Check if replica is in recovery
sudo -u postgres psql -c "SELECT pg_is_in_recovery();"

# Check recovery status
sudo -u postgres psql -c "SELECT * FROM pg_stat_wal_receiver;"

# Check for errors in PostgreSQL logs
sudo tail -f /var/lib/postgresql/18/data/log/postgresql-*.log

5.3. WAL file issues

Check for missing WAL files:

TEXT
# On replica - check WAL receiver status
sudo -u postgres psql -c "SELECT * FROM pg_stat_wal_receiver;"

# Check WAL files on primary
ls -la /var/lib/postgresql/18/data/pg_wal/ | head -10

5.4. Authentication issues

Verify replication user:

TEXT
# Check replication user exists on primary
sudo -u postgres psql -c "\du" | grep replicator

# Check pg_hba.conf allows replication
sudo -u postgres psql -c "SHOW hba_file;"
# Then check the file content

6. Performance Tuning for Replication

6.1. WAL configuration for replication

TEXT
postgresql:
  parameters:
    # WAL settings for replication
    wal_level: replica
    max_wal_senders: 10
    max_replication_slots: 10
    wal_keep_size: "512MB"  # Keep WAL files longer for replicas
    
    # Archive settings (optional)
    archive_mode: "on"
    archive_command: "cp %p /var/lib/postgresql/18/archive/%f"
    archive_timeout: 300

6.2. Replication-specific parameters

TEXT
postgresql:
  parameters:
    # Replica settings
    hot_standby: "on"
    max_standby_archive_delay: 30000ms
    max_standby_streaming_delay: 30000ms
    wal_receiver_status_interval: 10s
    hot_standby_feedback: "on"
    
    # Performance settings
    wal_sender_timeout: 60s
    wal_receiver_timeout: 120s
    tcp_keepalives_idle: 600
    tcp_keepalives_interval: 30
    tcp_keepalives_count: 3

6.3. Network optimization

For high-latency networks:

TEXT
# In PostgreSQL configuration
tcp_keepalives_idle = 600
tcp_keepalives_interval = 30
tcp_keepalives_count = 3

# These help maintain connections over unstable networks

7. Advanced Replication Features

7.1. Cascading replication

Setting up replicas of replicas to reduce load on primary:

TEXT
# Configuration for intermediate replica (cascading replica)
postgresql:
  parameters:
    # Enable the replica to act as a sender
    max_wal_senders: 5
    max_replication_slots: 5

# Configure the downstream replica to connect to this node instead of primary
bootstrap:
  dcs:
    postgresql:
      recovery_conf:
        primary_conninfo: "host=10.0.1.12 port=5432 user=replicator password=replicator_password"
        # Connect to intermediate replica, not primary

7.2. Logical replication

For partial replication of specific tables:

TEXT
# Requires wal_level = logical
postgresql:
  parameters:
    wal_level: logical

# Create publication on primary
CREATE PUBLICATION my_publication FOR TABLE users, orders;

# Create subscription on replica
CREATE SUBSCRIPTION my_subscription 
CONNECTION 'host=primary port=5432 dbname=mydb' 
PUBLICATION my_publication;

8. Replication Monitoring and Alerting

8.1. Key metrics to monitor

TEXT
# Replication lag (bytes)
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)

# Replication status
SELECT state FROM pg_stat_wal_receiver;  -- Should be 'streaming'

# WAL sender processes
SELECT count(*) FROM pg_stat_replication;

# Replication slots
SELECT count(*) FROM pg_replication_slots WHERE active = true;

8.2. Prometheus metrics for replication

TEXT
# Sample Prometheus rules
groups:
  - name: replication
    rules:
      - alert: HighReplicationLag
        expr: pg_replication_lag_bytes > 104857600  # 100MB
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Replication lag is high on {{ $labels.instance }}"
          description: "Replication lag is {{ $value }} bytes on {{ $labels.instance }}"
      
      - alert: ReplicationStopped
        expr: pg_replication_is_replicating == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Replication has stopped on {{ $labels.instance }}"
          description: "Replication has stopped on {{ $labels.instance }}"

8.3. Log-based monitoring

Important log patterns to watch:

TEXT
# Check for replication errors
grep -i "replication\|wal receiver\|startup process" /var/lib/postgresql/18/data/log/postgresql-*.log

# Common error patterns
tail -f /var/lib/postgresql/18/data/log/postgresql-*.log | grep -E "FATAL|ERROR.*replication|could not connect"

9. Troubleshooting Common Issues

9.1. Replica not connecting to primary

Check connectivity:

TEXT
# From replica to primary
telnet 10.0.1.11 5432

# Check firewall rules
sudo ufw status
sudo iptables -L

Check authentication:

TEXT
# Test connection
psql -h 10.0.1.11 -p 5432 -U replicator -c "IDENTIFY_SYSTEM" replication=1

9.2. High memory usage on replicas

TEXT
# Check if too many WAL files are being buffered
sudo -u postgres psql -c "SHOW shared_buffers;"
sudo -u postgres psql -c "SHOW max_worker_processes;"

# Consider reducing maintenance_work_mem on replicas
postgresql:
  parameters:
    maintenance_work_mem: "256MB"  # Lower than primary

9.3. Disk space issues with WAL files

TEXT
# Check WAL directory size
du -sh /var/lib/postgresql/18/data/pg_wal/

# Clean up old WAL files if needed
# Usually handled by replication slots and archive settings

10. Lab Exercises

Lab 1: Configure synchronous replication

Tasks:

  1. Switch cluster to synchronous mode
  2. Verify one replica becomes synchronous
  3. Test write performance difference
  4. Simulate synchronous replica failure

Lab 2: Monitor replication lag

Tasks:

  1. Set up continuous lag monitoring
  2. Generate load on primary
  3. Monitor lag increase and recovery
  4. Create alerting rules

Lab 3: Handle replication failure

Tasks:

  1. Simulate network interruption
  2. Monitor replication status
  3. Restore replication
  4. Verify data consistency

11. Summary

Key Takeaways

Asynchronous replication: Better performance, potential data loss ✅ Synchronous replication: Zero data loss, performance impact ✅ Monitoring: Essential for detecting replication issues early ✅ Replication slots: Prevent WAL file cleanup before replicas receive them ✅ Performance tuning: Balance between safety and performance ✅ Troubleshooting: Quick detection and resolution of replication issues

Best Practices

  •  Monitor replication lag continuously
  •  Set up alerts for high lag or stopped replication
  •  Configure appropriate WAL retention settings
  •  Use replication slots to prevent WAL file loss
  •  Test failover procedures regularly
  •  Plan for network interruptions and performance issues
  •  Document replication topology and procedures

Preparation for Lesson 11

Lesson 11 will cover Patroni callbacks:

  • Understanding callback mechanisms
  • Implementing custom scripts
  • Using callbacks for automation
  • Monitoring and alerting with callbacks

You might also like

Browse all articles

Lesson 9: Bootstrap PostgreSQL Cluster

Learn how to bootstrap a Patroni cluster including starting Patroni for the first time on 3 nodes, verifying cluster status with patronictl, checking replication, troubleshooting common issues, and testing basic failover.

#Patroni#bootstrap#cluster

Lesson 8: Detailed Patroni Configuration

Learn detailed Patroni configuration including all sections of patroni.yml, bootstrap options, PostgreSQL parameters tuning, authentication setup, tags and constraints, and timing parameters optimization.

#Patroni#configuration#parameters

Lesson 7: Installing Patroni

Learn how to install Patroni, including setting up Python dependencies, installing via pip, understanding the patroni.yml configuration structure, creating systemd service, and configuring Patroni on 3 nodes for PostgreSQL high availability.

#Patroni#installation#configuration

Lesson 20: Security Best Practices

Learn about Lesson 20: Security Best Practices in PostgreSQL HA clusters with Patroni and etcd.

#Patroni#PostgreSQL#high availability

Lesson 19: Logging và Troubleshooting

Learn about Lesson 19: Logging và Troubleshooting in PostgreSQL HA clusters with Patroni and etcd.

#Patroni#PostgreSQL#high availability