CloudTada | Infrastructure & DevOps Insights

Objectives

After this lesson, you will:

Understand synchronous vs asynchronous replication modes
Configure and manage replication settings in Patroni
Monitor replication lag and performance
Troubleshoot common replication issues
Handle replication failures and recovery

1. Understanding Replication Modes

1.1. Asynchronous replication (default)

In asynchronous mode, the primary server does not wait for confirmation from replicas before acknowledging a transaction as committed.

Characteristics:

✅ High performance: Minimal impact on write performance
✅ High availability: No blocking if replicas are down
❌ Potential data loss: Uncommitted transactions may be lost on primary failure
❌ No guarantee: Replicas may lag behind primary

Configuration:

TEXT

bootstrap:
  dcs:
    synchronous_mode: false
    synchronous_mode_strict: false

When to use:

Applications that can tolerate some data loss
High-throughput environments where performance is critical
Cross-datacenter replication where network latency is high

1.2. Synchronous replication

In synchronous mode, the primary server waits for confirmation from at least one synchronous replica before acknowledging a transaction.

Characteristics:

✅ Zero data loss: Confirmed transactions are guaranteed to be on replica
✅ Strong consistency: Data is guaranteed to be on multiple nodes
❌ Performance impact: Higher transaction latency
❌ Availability risk: Transactions block if synchronous replica is down

Configuration:

TEXT

bootstrap:
  dcs:
    synchronous_mode: true
    synchronous_mode_strict: false
    synchronous_node_count: 1  # Number of synchronous standbys required

Synchronous mode variants:

synchronous_mode_strict: false: Degrades to async if no sync standby available
synchronous_mode_strict: true: Refuses writes if no sync standby available

1.3. Quorum-based synchronous replication

PostgreSQL supports multiple synchronous standbys with quorum-based confirmation:

TEXT

# In PostgreSQL configuration (managed by Patroni)
synchronous_standby_names = 'FIRST 2 (node2,node3,node4)'
# or
synchronous_standby_names = 'ANY 2 (node2,node3,node4)'

2. Configuring Synchronous Replication

2.1. Enable synchronous mode in Patroni

Edit cluster configuration:

TEXT

patronictl edit-config postgres
# This opens an editor to modify the cluster-wide configuration

Configuration example:

TEXT

loop_wait: 10
ttl: 30
retry_timeout: 10
synchronous_mode: true
synchronous_mode_strict: false
synchronous_node_count: 1

postgresql:
  parameters:
    wal_level: replica
    hot_standby: "on"
    max_wal_senders: 10
    max_replication_slots: 10

2.2. Designate synchronous replicas

Using tags to designate which nodes should be synchronous:

TEXT

# On the node that should be synchronous standby
tags:
  synchronous: true

2.3. Verify synchronous replication

Check synchronous standby status:

TEXT

# On primary
sudo -u postgres psql -c "
SELECT application_name, 
       sync_state, 
       sync_priority,
       state,
       sent_lsn,
       write_lsn,
       flush_lsn
FROM pg_stat_replication;
"

Output:

TEXT

application_name | sync_state | sync_priority | state  | sent_lsn | write_lsn | flush_lsn 
------------------+------------+---------------+--------+----------+-----------+----------
 node2            | sync       |             1 | stream | 0/2000000| 0/2000000 | 0/2000000
 node3            | async      |             0 | stream | 0/2000000| 0/2000000 | 0/2000000

3. Monitoring Replication Lag

3.1. Using pg_stat_replication

On primary node:

TEXT

# Check replication lag for all replicas
sudo -u postgres psql -c "
SELECT 
  application_name,
  client_addr,
  state,
  pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS sent_lag_bytes,
  pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn) AS write_lag_bytes,
  pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS flush_lag_bytes,
  write_lag,
  flush_lag,
  replay_lag
FROM pg_stat_replication;
"

3.2. Using pg_stat_wal_receiver (on replicas)

TEXT

# On each replica
sudo -u postgres psql -c "
SELECT 
  status,
  received_lsn,
  last_msg_receipt_time,
  latest_end_lsn,
  latest_end_time,
  pg_wal_lsn_diff(pg_current_wal_lsn(), received_lsn) AS receive_lag_bytes
FROM pg_stat_wal_receiver;
"

3.3. Real-time lag monitoring

Continuous monitoring script:

TEXT

#!/bin/bash
# monitor_lag.sh

while true; do
  echo "=== $(date) ==="
  sudo -u postgres psql -c "
  SELECT 
    application_name,
    client_addr,
    state,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS lag,
    replay_lag
  FROM pg_stat_replication;
  " | grep -v "rows)"
  sleep 5
done

3.4. Setting up lag alerts

Create monitoring function:

TEXT

# Add to monitoring database
CREATE OR REPLACE FUNCTION check_replication_lag(max_lag_mb INTEGER DEFAULT 10)
RETURNS TABLE(
  application_name TEXT,
  client_addr INET,
  lag_mb NUMERIC,
  status TEXT
) AS $$
BEGIN
  RETURN QUERY
  SELECT 
    pr.application_name,
    pr.client_addr,
    pg_wal_lsn_diff(pg_current_wal_lsn(), pr.replay_lsn)::NUMERIC / 1024^2 AS lag_mb,
    CASE 
      WHEN pg_wal_lsn_diff(pg_current_wal_lsn(), pr.replay_lsn) > (max_lag_mb * 1024^2) 
      THEN 'ALERT' 
      ELSE 'OK' 
    END AS status
  FROM pg_stat_replication pr;
END;
$$ LANGUAGE plpgsql;

4. Managing Replication Slots

4.1. Understanding replication slots

Replication slots ensure that WAL files are not removed from the primary until they have been received by the replica.

Types of slots:

Physical slots: Used by physical replicas
Logical slots: Used by logical replication

4.2. Check replication slots

TEXT

# On primary
sudo -u postgres psql -c "\du+"
sudo -u postgres psql -c "SELECT * FROM pg_replication_slots;"

4.3. Managing slots with Patroni

Automatic slot management:

Patroni can automatically create and manage replication slots:

TEXT

bootstrap:
  dcs:
    postgresql:
      use_slots: true  # Enable automatic slot management

Manual slot management:

TEXT

# Create slot
SELECT pg_create_physical_replication_slot('node4_slot');

# Drop slot
SELECT pg_drop_replication_slot('node4_slot');

# List slots
SELECT slot_name, slot_type, active, restart_lsn FROM pg_replication_slots;

5. Handling Replication Issues

5.1. High replication lag

Symptoms:

Large values in pg_stat_replication lag columns
Replicas falling behind primary significantly
Potential timeout issues

Causes and solutions:

A. Network issues

TEXT

# Check network performance
iperf3 -c 10.0.1.12  # Test from primary to replica

# Check for packet loss
ping -c 100 10.0.1.12

B. Replica performance issues

TEXT

# Check replica system resources
top
iostat -x 1

# Check PostgreSQL activity on replica
sudo -u postgres psql -c "SELECT * FROM pg_stat_activity;"

C. Large transactions or bulk operations

TEXT

# Monitor for large transactions
sudo -u postgres psql -c "
SELECT pid, 
       now() - pg_stat_activity.query_start AS duration, 
       query,
       state
FROM pg_stat_activity 
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes';
"

5.2. Replication stopped

Check replica status:

TEXT

# Check if replica is in recovery
sudo -u postgres psql -c "SELECT pg_is_in_recovery();"

# Check recovery status
sudo -u postgres psql -c "SELECT * FROM pg_stat_wal_receiver;"

# Check for errors in PostgreSQL logs
sudo tail -f /var/lib/postgresql/18/data/log/postgresql-*.log

5.3. WAL file issues

Check for missing WAL files:

TEXT

# On replica - check WAL receiver status
sudo -u postgres psql -c "SELECT * FROM pg_stat_wal_receiver;"

# Check WAL files on primary
ls -la /var/lib/postgresql/18/data/pg_wal/ | head -10

5.4. Authentication issues

Verify replication user:

TEXT

# Check replication user exists on primary
sudo -u postgres psql -c "\du" | grep replicator

# Check pg_hba.conf allows replication
sudo -u postgres psql -c "SHOW hba_file;"
# Then check the file content

6. Performance Tuning for Replication

6.1. WAL configuration for replication

TEXT

postgresql:
  parameters:
    # WAL settings for replication
    wal_level: replica
    max_wal_senders: 10
    max_replication_slots: 10
    wal_keep_size: "512MB"  # Keep WAL files longer for replicas
    
    # Archive settings (optional)
    archive_mode: "on"
    archive_command: "cp %p /var/lib/postgresql/18/archive/%f"
    archive_timeout: 300

6.2. Replication-specific parameters

TEXT

postgresql:
  parameters:
    # Replica settings
    hot_standby: "on"
    max_standby_archive_delay: 30000ms
    max_standby_streaming_delay: 30000ms
    wal_receiver_status_interval: 10s
    hot_standby_feedback: "on"
    
    # Performance settings
    wal_sender_timeout: 60s
    wal_receiver_timeout: 120s
    tcp_keepalives_idle: 600
    tcp_keepalives_interval: 30
    tcp_keepalives_count: 3

6.3. Network optimization

For high-latency networks:

TEXT

# In PostgreSQL configuration
tcp_keepalives_idle = 600
tcp_keepalives_interval = 30
tcp_keepalives_count = 3

# These help maintain connections over unstable networks

7. Advanced Replication Features

7.1. Cascading replication

Setting up replicas of replicas to reduce load on primary:

TEXT

# Configuration for intermediate replica (cascading replica)
postgresql:
  parameters:
    # Enable the replica to act as a sender
    max_wal_senders: 5
    max_replication_slots: 5

# Configure the downstream replica to connect to this node instead of primary
bootstrap:
  dcs:
    postgresql:
      recovery_conf:
        primary_conninfo: "host=10.0.1.12 port=5432 user=replicator password=replicator_password"
        # Connect to intermediate replica, not primary

7.2. Logical replication

For partial replication of specific tables:

TEXT

# Requires wal_level = logical
postgresql:
  parameters:
    wal_level: logical

# Create publication on primary
CREATE PUBLICATION my_publication FOR TABLE users, orders;

# Create subscription on replica
CREATE SUBSCRIPTION my_subscription 
CONNECTION 'host=primary port=5432 dbname=mydb' 
PUBLICATION my_publication;

8. Replication Monitoring and Alerting

8.1. Key metrics to monitor

TEXT

# Replication lag (bytes)
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)

# Replication status
SELECT state FROM pg_stat_wal_receiver;  -- Should be 'streaming'

# WAL sender processes
SELECT count(*) FROM pg_stat_replication;

# Replication slots
SELECT count(*) FROM pg_replication_slots WHERE active = true;

8.2. Prometheus metrics for replication

TEXT

# Sample Prometheus rules
groups:
  - name: replication
    rules:
      - alert: HighReplicationLag
        expr: pg_replication_lag_bytes > 104857600  # 100MB
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Replication lag is high on {{ $labels.instance }}"
          description: "Replication lag is {{ $value }} bytes on {{ $labels.instance }}"
      
      - alert: ReplicationStopped
        expr: pg_replication_is_replicating == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Replication has stopped on {{ $labels.instance }}"
          description: "Replication has stopped on {{ $labels.instance }}"

8.3. Log-based monitoring

Important log patterns to watch:

TEXT

# Check for replication errors
grep -i "replication\|wal receiver\|startup process" /var/lib/postgresql/18/data/log/postgresql-*.log

# Common error patterns
tail -f /var/lib/postgresql/18/data/log/postgresql-*.log | grep -E "FATAL|ERROR.*replication|could not connect"

9. Troubleshooting Common Issues

9.1. Replica not connecting to primary

Check connectivity:

TEXT

# From replica to primary
telnet 10.0.1.11 5432

# Check firewall rules
sudo ufw status
sudo iptables -L

Check authentication:

TEXT

# Test connection
psql -h 10.0.1.11 -p 5432 -U replicator -c "IDENTIFY_SYSTEM" replication=1

9.2. High memory usage on replicas

TEXT

# Check if too many WAL files are being buffered
sudo -u postgres psql -c "SHOW shared_buffers;"
sudo -u postgres psql -c "SHOW max_worker_processes;"

# Consider reducing maintenance_work_mem on replicas
postgresql:
  parameters:
    maintenance_work_mem: "256MB"  # Lower than primary

9.3. Disk space issues with WAL files

TEXT

# Check WAL directory size
du -sh /var/lib/postgresql/18/data/pg_wal/

# Clean up old WAL files if needed
# Usually handled by replication slots and archive settings

10. Lab Exercises

Lab 1: Configure synchronous replication

Tasks:

Switch cluster to synchronous mode
Verify one replica becomes synchronous
Test write performance difference
Simulate synchronous replica failure

Lab 2: Monitor replication lag

Tasks:

Set up continuous lag monitoring
Generate load on primary
Monitor lag increase and recovery
Create alerting rules

Lab 3: Handle replication failure

Tasks:

Simulate network interruption
Monitor replication status
Restore replication
Verify data consistency

✅ Asynchronous replication: Better performance, potential data loss ✅ Synchronous replication: Zero data loss, performance impact ✅ Monitoring: Essential for detecting replication issues early ✅ Replication slots: Prevent WAL file cleanup before replicas receive them ✅ Performance tuning: Balance between safety and performance ✅ Troubleshooting: Quick detection and resolution of replication issues

Best Practices

Monitor replication lag continuously
Set up alerts for high lag or stopped replication
Configure appropriate WAL retention settings
Use replication slots to prevent WAL file loss
Test failover procedures regularly
Plan for network interruptions and performance issues
Document replication topology and procedures

Preparation for Lesson 11

Lesson 11 will cover Patroni callbacks:

Understanding callback mechanisms
Implementing custom scripts
Using callbacks for automation
Monitoring and alerting with callbacks

Course

PostgreSQL High Availability A-Z

Share this article

You might also like