Automatic Failover
Learning Objectives
After this lesson, you will:
- Understand failure detection mechanisms in Patroni
- Understand the leader election process
- Track failover timeline in detail
- Test automatic failover in multiple scenarios
- Troubleshoot failover issues
- Optimize failover speed
1. Automatic Failover Overview
1.1. Failover là gì?
Automatic Failover = Quá trình tự động promote một replica lên làm primary khi primary hiện tại fails.
Đặc điểm:
- ⚡ Tự động: Không cần can thiệp manual
- 🚨 Unplanned: Xảy ra do sự cố
- ⏱️ Fast: 30-60 giây (configurable)
- 🎯 Goal: Minimize downtime
Khi nào xảy ra failover?
- Primary server crashes
- PostgreSQL process dies
- Network partition
- Hardware failure
- DCS connection lost
- Disk full
1.2. Failover vs Replication
2. Failure Detection Mechanism
2.1. Health Check Loop
Patroni health check components:
2.2. PostgreSQL Health Checks
Patroni performs multiple checks:
A. Process check
B. Connection check
C. Replication check (on replicas)
D. Timeline check
2.3. DCS Connectivity Check
Why DCS connectivity matters:
DCS check example:
2.4. Leader Lock TTL
TTL (Time-To-Live) mechanism:
Timeline:
3. Leader Election Process
3.1. Election Trigger
Leader election starts when:
3.2. Candidate Selection Criteria
Patroni selects best replica based on:
Priority 1: Replication State
Priority 2: Replication Lag
Priority 3: Timeline
Priority 4: Tags
Example:
Priority 5: Synchronous State
3.3. Race Condition and Lock Acquisition
Multiple replicas compete:
DCS guarantees:
- Atomicity: Only one node gets the lock
- Consistency: All nodes see same leader
- Isolation: No split-brain possible
3.4. Promotion Process
Winner node executes:
4. Failover Timeline Detailed
4.1. Complete Failover Flow
4.2. Factors Affecting Failover Speed
Configuration parameters:
Trade-offs:
| Parameter | Lower Value | Higher Value |
|---|---|---|
| TTL | Faster failover | More stable |
| More false positives | Slower failover | |
| loop_wait | Faster detection | Less DCS traffic |
| More CPU/network | Slower reaction |
Typical configurations:
5. Testing Automatic Failover
5.1. Test Scenario 1: PostgreSQL Process Kill
Simulate PostgreSQL crash:
Monitor failover:
Expected timeline:
5.2. Test Scenario 2: Network Partition
Simulate network partition:
Observe:
Recovery:
5.3. Test Scenario 3: Server Reboot
Simulate server crash:
Expected behavior: Same as Scenario 1, but node completely unavailable.
5.4. Test Scenario 4: Disk Full
Simulate disk full:
Patroni will detect PostgreSQL unhealthy → trigger failover.
5.5. Test Scenario 5: DCS Failure
Stop etcd on all nodes:
Expected behavior:
6. Verify Failover Success
6.1. Check cluster status
6.2. Verify new primary
6.3. Test write operations
6.4. Check failover history
7. Troubleshooting Failover Issues
7.1. Issue: Failover not happening
Symptoms: Primary down but no promotion.
Possible causes:
A. All replicas tagged nofailover
B. Replication lag too high
C. No quorum in DCS
D. synchronous_mode_strict enabled
7.2. Issue: Multiple failovers (flapping)
Symptoms: Cluster keeps failing over repeatedly.
Possible causes:
A. Network instability
B. TTL too aggressive
C. Resource exhaustion
7.3. Issue: Slow failover
Symptoms: Takes >60 seconds to failover.
Diagnosis:
Optimization:
7.4. Issue: Data loss after failover
Symptoms: Some recent transactions missing.
Cause: Asynchronous replication + replica lag.
Verification:
Prevention:
8. Metrics and Monitoring
8.1. Key failover metrics
8.2. Alerting rules
Prometheus alert examples:
9. Best Practices
✅ DO
- Test failover regularly - Monthly in staging, quarterly in production
- Monitor replication lag - Alert if lag > 1MB
- Use synchronous replication for zero data loss
- Set synchronous_mode_strict: false - Allow degradation
- Configure proper TTL - Balance speed vs stability (20-30s)
- Have >= 2 replicas - Allow failover even if one replica down
- Monitor DCS health - etcd cluster must be healthy
- Document runbooks - Procedures for manual intervention
- Log failover events - Track patterns and issues
- Capacity planning - Replicas should handle primary load
❌ DON'T
- Don't use single replica - No failover option
- Don't ignore lag - High lag = data loss risk
- Don't set TTL too low
(<15s)- False positives - Don't skip testing - Untested failover = downtime risk
- Don't manually promote during automatic failover - Let Patroni handle it
- Don't forget about old primary - Needs rejoin/rebuild
- Don't run without monitoring - Must know when failover happens
- Don't overload DCS - Separate etcd cluster recommended
10. Lab Exercises
Lab 1: Basic failover test
Tasks: 1. Record baseline: patronictl list 2. Stop primary: sudo systemctl stop patroni 3. Time the failover with watch -n 1 patronictl list 4. Document downtime duration 5. Verify new primary accepts writes 6. Restart old primary and verify rejoin
Lab 2: Network partition test
Tasks: 1. Use iptables to partition primary from cluster 2. Observe DCS behavior 3. Verify only one primary exists after partition 4. Restore network and verify automatic recovery
Lab 3: Optimize failover speed
Tasks: 1. Baseline test with default settings (TTL=30) 2. Reduce TTL to 20, test again 3. Reduce to 15, test again 4. Compare failover times 5. Evaluate trade-offs (speed vs false positives)
Lab 4: Failover under load
Tasks: 1. Generate load with pgbench: pgbench -c 10 -T 300 2. During load, stop primary 3. Count connection errors in pgbench output 4. Calculate availability percentage 5. Document user impact
11. Tổng kết
Key Concepts
✅ Automatic Failover = Self-healing without manual intervention
✅ Detection = Health checks + DCS connectivity + TTL expiration
✅ Election = Best replica based on lag, timeline, tags
✅ Promotion = pg_promote() + timeline increment + role change
✅ Timeline = Failover counter, prevents divergence
✅ TTL = Trade-off between speed and stability
Failover Checklist
- Primary failure detected
- Leader lock expired in DCS
- Best replica identified
- Leader lock acquired
- PostgreSQL promoted successfully
- Timeline incremented
- Callbacks executed
- Other replicas reconfigured
- Replication restored
- Cluster operational
Next Steps
Bài 14 sẽ cover Switchover có kế hoạch:
- Planned maintenance scenarios
- Zero-downtime switchover process
- Graceful vs immediate switchover
- Best practices for planned failove