Replication Management
Objectives
After this lesson, you will:
- Understand synchronous vs asynchronous replication modes
- Configure and manage replication settings in Patroni
- Monitor replication lag and performance
- Troubleshoot common replication issues
- Handle replication failures and recovery
1. Understanding Replication Modes
1.1. Asynchronous replication (default)
In asynchronous mode, the primary server does not wait for confirmation from replicas before acknowledging a transaction as committed.
Characteristics:
- ✅ High performance: Minimal impact on write performance
- ✅ High availability: No blocking if replicas are down
- ❌ Potential data loss: Uncommitted transactions may be lost on primary failure
- ❌ No guarantee: Replicas may lag behind primary
Configuration:
When to use:
- Applications that can tolerate some data loss
- High-throughput environments where performance is critical
- Cross-datacenter replication where network latency is high
1.2. Synchronous replication
In synchronous mode, the primary server waits for confirmation from at least one synchronous replica before acknowledging a transaction.
Characteristics:
- ✅ Zero data loss: Confirmed transactions are guaranteed to be on replica
- ✅ Strong consistency: Data is guaranteed to be on multiple nodes
- ❌ Performance impact: Higher transaction latency
- ❌ Availability risk: Transactions block if synchronous replica is down
Configuration:
Synchronous mode variants:
synchronous_mode_strict: false: Degrades to async if no sync standby availablesynchronous_mode_strict: true: Refuses writes if no sync standby available
1.3. Quorum-based synchronous replication
PostgreSQL supports multiple synchronous standbys with quorum-based confirmation:
2. Configuring Synchronous Replication
2.1. Enable synchronous mode in Patroni
Edit cluster configuration:
Configuration example:
2.2. Designate synchronous replicas
Using tags to designate which nodes should be synchronous:
2.3. Verify synchronous replication
Check synchronous standby status:
Output:
3. Monitoring Replication Lag
3.1. Using pg_stat_replication
On primary node:
3.2. Using pg_stat_wal_receiver (on replicas)
3.3. Real-time lag monitoring
Continuous monitoring script:
3.4. Setting up lag alerts
Create monitoring function:
4. Managing Replication Slots
4.1. Understanding replication slots
Replication slots ensure that WAL files are not removed from the primary until they have been received by the replica.
Types of slots:
- Physical slots: Used by physical replicas
- Logical slots: Used by logical replication
4.2. Check replication slots
4.3. Managing slots with Patroni
Automatic slot management:
Patroni can automatically create and manage replication slots:
Manual slot management:
5. Handling Replication Issues
5.1. High replication lag
Symptoms:
- Large values in
pg_stat_replicationlag columns - Replicas falling behind primary significantly
- Potential timeout issues
Causes and solutions:
A. Network issues
B. Replica performance issues
C. Large transactions or bulk operations
5.2. Replication stopped
Check replica status:
5.3. WAL file issues
Check for missing WAL files:
5.4. Authentication issues
Verify replication user:
6. Performance Tuning for Replication
6.1. WAL configuration for replication
6.2. Replication-specific parameters
6.3. Network optimization
For high-latency networks:
7. Advanced Replication Features
7.1. Cascading replication
Setting up replicas of replicas to reduce load on primary:
7.2. Logical replication
For partial replication of specific tables:
8. Replication Monitoring and Alerting
8.1. Key metrics to monitor
8.2. Prometheus metrics for replication
8.3. Log-based monitoring
Important log patterns to watch:
9. Troubleshooting Common Issues
9.1. Replica not connecting to primary
Check connectivity:
Check authentication:
9.2. High memory usage on replicas
9.3. Disk space issues with WAL files
10. Lab Exercises
Lab 1: Configure synchronous replication
Tasks:
- Switch cluster to synchronous mode
- Verify one replica becomes synchronous
- Test write performance difference
- Simulate synchronous replica failure
Lab 2: Monitor replication lag
Tasks:
- Set up continuous lag monitoring
- Generate load on primary
- Monitor lag increase and recovery
- Create alerting rules
Lab 3: Handle replication failure
Tasks:
- Simulate network interruption
- Monitor replication status
- Restore replication
- Verify data consistency
11. Summary
Key Takeaways
✅ Asynchronous replication: Better performance, potential data loss ✅ Synchronous replication: Zero data loss, performance impact ✅ Monitoring: Essential for detecting replication issues early ✅ Replication slots: Prevent WAL file cleanup before replicas receive them ✅ Performance tuning: Balance between safety and performance ✅ Troubleshooting: Quick detection and resolution of replication issues
Best Practices
- Monitor replication lag continuously
- Set up alerts for high lag or stopped replication
- Configure appropriate WAL retention settings
- Use replication slots to prevent WAL file loss
- Test failover procedures regularly
- Plan for network interruptions and performance issues
- Document replication topology and procedures
Preparation for Lesson 11
Lesson 11 will cover Patroni callbacks:
- Understanding callback mechanisms
- Implementing custom scripts
- Using callbacks for automation
- Monitoring and alerting with callbacks