CloudTadaInsights

Lesson 13: Automatic Failover

Automatic Failover

Learning Objectives

After this lesson, you will:

  • Understand failure detection mechanisms in Patroni
  • Understand the leader election process
  • Track failover timeline in detail
  • Test automatic failover in multiple scenarios
  • Troubleshoot failover issues
  • Optimize failover speed

1. Automatic Failover Overview

1.1. Failover là gì?

Automatic Failover = Quá trình tự động promote một replica lên làm primary khi primary hiện tại fails.

Đặc điểm:

  • ⚡ Tự động: Không cần can thiệp manual
  • 🚨 Unplanned: Xảy ra do sự cố
  • ⏱️ Fast: 30-60 giây (configurable)
  • 🎯 Goal: Minimize downtime

Khi nào xảy ra failover?

  • Primary server crashes
  • PostgreSQL process dies
  • Network partition
  • Hardware failure
  • DCS connection lost
  • Disk full

1.2. Failover vs Replication

TEXT
WITHOUT Patroni (Manual Failover):
1. Primary fails
2. DBA gets paged
3. DBA investigates (10-30 mins)
4. DBA manually promotes replica
5. DBA updates application config
6. Service restored
Total downtime: 30+ minutes ❌

WITH Patroni (Automatic Failover):
1. Primary fails
2. Patroni detects (10 seconds)
3. Patroni promotes best replica (20 seconds)
4. Service restored automatically
Total downtime: 30-60 seconds ✅

2. Failure Detection Mechanism

2.1. Health Check Loop

Patroni health check components:

TEXT
# Pseudo-code of Patroni's main loop
while True:
    # 1. Check PostgreSQL health
    if not check_postgresql_running():
        log.error("PostgreSQL is down!")
        handle_postgres_failure()
    
    # 2. Check DCS connectivity
    if not can_connect_to_dcs():
        log.error("Lost DCS connection!")
        demote_if_leader()
    
    # 3. Update status in DCS
    update_member_status_in_dcs()
    
    # 4. Check leader lock (if I'm leader)
    if is_leader:
        renew_leader_lock()
    
    # 5. Sleep until next check
    sleep(loop_wait)  # Default: 10 seconds

2.2. PostgreSQL Health Checks

Patroni performs multiple checks:

A. Process check

TEXT
# Check if postgres process exists
ps aux | grep postgres

# Check if accepting connections
pg_isready -h localhost -p 5432

B. Connection check

TEXT
# Try to connect to PostgreSQL
try:
    conn = psycopg2.connect("host=localhost port=5432 dbname=postgres")
    conn.close()
except:
    # Connection failed!
    mark_unhealthy()

C. Replication check (on replicas)

TEXT
-- Check if replication is active
SELECT status, received_lsn, replay_lsn 
FROM pg_stat_wal_receiver;

-- If no data or status != 'streaming' → Problem!

D. Timeline check

TEXT
-- Ensure timeline matches cluster
SELECT timeline_id FROM pg_control_checkpoint();

2.3. DCS Connectivity Check

Why DCS connectivity matters:

TEXT
If node loses DCS connection:
- Cannot renew leader lock
- Cannot read cluster state
- MUST demote to avoid split-brain

Even if PostgreSQL is healthy!

DCS check example:

TEXT
# Check etcd health
etcdctl endpoint health

# Try to read/write
etcdctl get /service/postgres/leader
etcdctl put /service/postgres/members/node1 "healthy"

2.4. Leader Lock TTL

TTL (Time-To-Live) mechanism:

TEXT
# In patroni.yml
bootstrap:
  dcs:
    ttl: 30  # Leader lock expires after 30 seconds
    loop_wait: 10  # Check every 10 seconds

Timeline:

TEXT
T+0s:  Leader acquires lock (TTL=30s)
T+10s: Leader renews lock (TTL extended to T+40s)
T+20s: Leader renews lock (TTL extended to T+50s)
T+30s: Leader tries to renew but FAILS (crashed)
T+40s: Lock expires in DCS
T+41s: Replicas detect no leader
T+42s: Replica election begins
T+45s: New leader elected

Total detection time: ~35-40 seconds

3. Leader Election Process

3.1. Election Trigger

Leader election starts when:

TEXT
Condition 1: Leader lock expired in DCS
  /service/postgres/leader → key not found

Condition 2: No active leader for > loop_wait
  All replicas see: no leader heartbeat

Condition 3: Explicit failover
  patronictl failover command

3.2. Candidate Selection Criteria

Patroni selects best replica based on:

Priority 1: Replication State

TEXT
-- Prefer streaming over archive recovery
SELECT state FROM pg_stat_wal_receiver;

streaming > in archive recovery > stopped

Priority 2: Replication Lag

TEXT
-- Replica with lowest lag wins
SELECT pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) AS lag_bytes;

-- Example:
-- node2: lag = 0 bytes      ← BEST
-- node3: lag = 1048576 bytes (1MB)

Priority 3: Timeline

TEXT
-- Higher timeline = more recent
SELECT timeline_id FROM pg_control_checkpoint();

-- node2: timeline = 3  ← BEST
-- node3: timeline = 2

Priority 4: Tags

TEXT
# In patroni.yml
tags:
  nofailover: false  # true = never promote this node
  noloadbalance: false
  priority: 100  # Higher = preferred (0-999)

Example:

TEXT
# node2 - Preferred candidate
tags:
  nofailover: false
  priority: 200

# node3 - Lower priority
tags:
  nofailover: false
  priority: 100

# node4 - Never promote
tags:
  nofailover: true

Priority 5: Synchronous State

TEXT
-- Synchronous replica preferred over async
SELECT sync_state FROM pg_stat_replication;

sync > potential > async

3.3. Race Condition and Lock Acquisition

Multiple replicas compete:

TEXT
Scenario: Primary fails, 2 replicas compete

T+0s: node2 and node3 both detect no leader
T+0.1s: Both try to acquire lock simultaneously

In etcd (atomic operation):
  node2 tries: PUT /service/postgres/leader "node2" if_not_exists
  node3 tries: PUT /service/postgres/leader "node3" if_not_exists

Result: Only ONE succeeds (etcd atomic guarantee)
  node2: SUCCESS → becomes leader
  node3: FAILED → remains replica

DCS guarantees:

  • Atomicity: Only one node gets the lock
  • Consistency: All nodes see same leader
  • Isolation: No split-brain possible

3.4. Promotion Process

Winner node executes:

TEXT
Step 1: Acquire leader lock in DCS
  etcdctl put /service/postgres/leader '{"node": "node2", ...}'

Step 2: Run pre_promote callback (if configured)
  /var/lib/postgresql/callbacks/pre_promote.sh

Step 3: Promote PostgreSQL
  Method A: pg_ctl promote -D /var/lib/postgresql/18/data
  Method B: SELECT pg_promote();
  Method C: Create trigger file (old method)

Step 4: Wait for promotion complete
  Check: SELECT pg_is_in_recovery();
  Should return: false (not in recovery = primary)

Step 5: Update timeline
  Timeline increments: 1 → 2

Step 6: Run post_promote callback
  Update DNS, load balancer, send notifications

Step 7: Run on_role_change callback
  /var/lib/postgresql/callbacks/on_role_change.sh master

Step 8: Update DCS with new primary info
  xlog_location, timeline, conn_url

Step 9: Start accepting writes
  PostgreSQL now in read-write mode

4. Failover Timeline Detailed

4.1. Complete Failover Flow

TEXT
Timeline of Automatic Failover

T+0s: NORMAL OPERATION
  Primary (node1): Healthy, serving requests
  Replica (node2): Streaming from node1, lag=0
  Replica (node3): Streaming from node1, lag=0

T+1s: PRIMARY FAILS
  node1: PostgreSQL crashes / server dies
  node2: Still streaming (buffered data)
  node3: Still streaming (buffered data)

T+5s: REPLICATION BROKEN
  node2: WAL receiver error "connection lost"
  node3: WAL receiver error "connection lost"
  node1: Still holds leader lock (TTL not expired yet)

T+10s: HEALTH CHECK CYCLE 1
  node2: Check replication → FAILED, wait...
  node3: Check replication → FAILED, wait...
  node1: Cannot renew lock (crashed)

T+20s: HEALTH CHECK CYCLE 2
  node2: Still cannot connect to node1
  node3: Still cannot connect to node1

T+30s: LEADER LOCK EXPIRES
  DCS: /service/postgres/leader TTL expired → key deleted
  node2: Detects no leader key
  node3: Detects no leader key

T+31s: CANDIDATE ELECTION BEGINS
  node2: Check eligibility → YES (lag=0, priority=100)
  node3: Check eligibility → YES (lag=1MB, priority=100)

T+32s: RACE FOR LOCK
  node2: PUT /service/postgres/leader "node2" → SUCCESS
  node3: PUT /service/postgres/leader "node3" → FAILED

T+33s: NODE2 PROMOTES
  node2: Run pre_promote callback
  node2: pg_promote() executed
  node2: Timeline: 1 → 2

T+35s: PROMOTION COMPLETE
  node2: pg_is_in_recovery() → false
  node2: Now accepting writes
  node2: Run post_promote & on_role_change callbacks

T+36s: NODE3 RECONFIGURES
  node3: Detects new leader = node2
  node3: Update primary_conninfo → node2:5432
  node3: Restart WAL receiver

T+38s: REPLICATION RESTORED
  node3: Connected to node2
  node3: Streaming at timeline 2

T+40s: CLUSTER OPERATIONAL
  Primary: node2 (was replica)
  Replica: node3 (following node2)
  Failed: node1 (needs manual intervention)

Total Downtime: ~35-40 seconds ✅

4.2. Factors Affecting Failover Speed

Configuration parameters:

TEXT
# Fast failover configuration
bootstrap:
  dcs:
    ttl: 20  # Faster detection (default: 30)
    loop_wait: 5  # More frequent checks (default: 10)
    retry_timeout: 5  # Quick retries (default: 10)

Trade-offs:

ParameterLower ValueHigher Value
TTLFaster failoverMore stable
More false positivesSlower failover
loop_waitFaster detectionLess DCS traffic
More CPU/networkSlower reaction

Typical configurations:

TEXT
# Conservative (stable, slower)
ttl: 30
loop_wait: 10
→ Failover: ~40-50s

# Balanced (recommended)
ttl: 20
loop_wait: 10
→ Failover: ~30-40s

# Aggressive (fast, sensitive)
ttl: 15
loop_wait: 5
→ Failover: ~20-30s

5. Testing Automatic Failover

5.1. Test Scenario 1: PostgreSQL Process Kill

Simulate PostgreSQL crash:

TEXT
# On current primary (node1)
sudo -u postgres psql -c "SELECT pg_backend_pid();"
# Returns: 12345

sudo kill -9 12345  # Kill PostgreSQL

# Or kill all postgres processes
sudo pkill -9 postgres

Monitor failover:

TEXT
# Terminal 1: Watch cluster status
watch -n 1 "patronictl list postgres"

# Terminal 2: Monitor logs
sudo journalctl -u patroni -f

# Terminal 3: Test connectivity
while true; do
  psql -h 10.0.1.11 -U app_user -d myapp -c "SELECT 1;" 2>&1 | grep -q "ERROR" && echo "$(date): DOWN" || echo "$(date): UP"
  sleep 1
done

Expected timeline:

TEXT
00:00 - Cluster healthy
00:01 - Kill postgres on node1
00:02-00:30 - Patroni detecting failure
00:31 - node2 elected as new primary
00:35 - Cluster operational (node2 = primary)
00:36+ - Connections working again

5.2. Test Scenario 2: Network Partition

Simulate network partition:

TEXT
# On primary node, block traffic to other nodes
sudo iptables -A INPUT -s 10.0.1.12 -j DROP
sudo iptables -A INPUT -s 10.0.1.13 -j DROP
sudo iptables -A OUTPUT -d 10.0.1.12 -j DROP
sudo iptables -A OUTPUT -d 10.0.1.13 -j DROP

# Or block etcd access specifically
sudo iptables -A OUTPUT -p tcp --dport 2379 -j DROP

Observe:

TEXT
# On node1 (isolated)
patronictl list postgres
# Will show errors / cannot connect to cluster

# On node2/node3
patronictl list postgres
# Will show node1 as unavailable
# After TTL: node2 or node3 becomes leader

Recovery:

TEXT
# Restore network on node1
sudo iptables -F

# node1 should automatically rejoin as replica
patronictl list postgres

5.3. Test Scenario 3: Server Reboot

Simulate server crash:

TEXT
# On primary node
sudo reboot

# Or immediate crash
echo c | sudo tee /proc/sysrq-trigger

Expected behavior: Same as Scenario 1, but node completely unavailable.

5.4. Test Scenario 4: Disk Full

Simulate disk full:

TEXT
# Fill up disk on primary
dd if=/dev/zero of=/var/lib/postgresql/bigfile bs=1M count=10000

# PostgreSQL will fail when cannot write WAL

Patroni will detect PostgreSQL unhealthy → trigger failover.

5.5. Test Scenario 5: DCS Failure

Stop etcd on all nodes:

TEXT
# On all 3 etcd nodes
sudo systemctl stop etcd

Expected behavior:

TEXT
- All Patroni nodes lose DCS connection
- Current primary DEMOTES (safety mechanism)
- Cluster enters "read-only" state
- NO failover possible (no DCS consensus)

Recovery:
- Restart etcd cluster
- Patroni auto-recovers
- Leader election happens

6. Verify Failover Success

6.1. Check cluster status

TEXT
# List cluster members
patronictl list postgres

# Expected after failover:
# + Cluster: postgres (7001234567890123456) ----+----+-----------+
# | Member | Host          | Role    | State   | TL | Lag in MB |
# +--------+---------------+---------+---------+----+-----------+
# | node1  | 10.0.1.11:5432| Replica | stopped |  1 |           | ← Old primary
# | node2  | 10.0.1.12:5432| Leader  | running |  2 |           | ← NEW primary
# | node3  | 10.0.1.13:5432| Replica | running |  2 |         0 |
# +--------+---------------+---------+---------+----+-----------+

# Note timeline changed: 1 → 2

6.2. Verify new primary

TEXT
# Check primary role
sudo -u postgres psql -h 10.0.1.12 -c "SELECT pg_is_in_recovery();"
# pg_is_in_recovery
# ------------------
#  f                  ← false = PRIMARY

# Check timeline
sudo -u postgres psql -h 10.0.1.12 -c "SELECT timeline_id FROM pg_control_checkpoint();"
# timeline_id
# ------------
#           2

# Check replication from new primary
sudo -u postgres psql -h 10.0.1.12 -c "SELECT * FROM pg_stat_replication;"
# Should show node3 replicating from node2

6.3. Test write operations

TEXT
# Insert data on new primary
sudo -u postgres psql -h 10.0.1.12 -d testdb -c "
INSERT INTO test_table (data) VALUES ('After failover at ' || NOW());
"

# Verify on replica
sudo -u postgres psql -h 10.0.1.13 -d testdb -c "
SELECT * FROM test_table ORDER BY id DESC LIMIT 5;
"
# Should see new data replicated

6.4. Check failover history

TEXT
# View history via REST API
curl -s http://10.0.1.12:8008/history | jq

# Output:
# [
#   [1, 67108864, "no recovery target specified", "2024-11-25T10:00:00+00:00"],
#   [2, 134217728, "no recovery target specified", "2024-11-25T11:30:15+00:00"]
# ]
#   ↑ Timeline 2 = Failover event

# Check Patroni logs
sudo journalctl -u patroni --since "30 minutes ago" | grep -i "promote\|failover\|leader"

7. Troubleshooting Failover Issues

7.1. Issue: Failover not happening

Symptoms: Primary down but no promotion.

Possible causes:

A. All replicas tagged nofailover

TEXT
# Check tags
patronictl show-config postgres | grep -A5 "tags:"

# If all replicas have nofailover: true
# Solution: Remove tag from at least one replica
patronictl edit-config postgres
# Set: nofailover: false

B. Replication lag too high

TEXT
# Check maximum_lag_on_failover
patronictl show-config postgres | grep maximum_lag_on_failover

# If replica lag > threshold, won't promote
# Solution: Increase threshold or wait for lag to decrease
patronictl edit-config postgres
# Set: maximum_lag_on_failover: 10485760  # 10MB

C. No quorum in DCS

TEXT
# Check etcd health
etcdctl endpoint health --cluster

# If etcd cluster has no quorum (< 2 of 3 healthy)
# Solution: Fix etcd cluster first
sudo systemctl restart etcd

D. synchronous_mode_strict enabled

TEXT
# If enabled and no sync replica available
synchronous_mode: true
synchronous_mode_strict: true  # ← Problem!

# Primary cannot be demoted, replicas cannot be promoted
# Solution: Disable strict mode
patronictl edit-config postgres
# Set: synchronous_mode_strict: false

7.2. Issue: Multiple failovers (flapping)

Symptoms: Cluster keeps failing over repeatedly.

Possible causes:

A. Network instability

TEXT
# Check network between nodes
ping -c 100 10.0.1.12
# High packet loss → false failovers

# Solution: Fix network or increase TTL
patronictl edit-config postgres
# Set: ttl: 40  # More tolerant

B. TTL too aggressive

TEXT
# ttl: 10  ← Too low!
# Every small network blip causes failover

# Solution: Increase TTL
ttl: 30  # More stable

C. Resource exhaustion

TEXT
# Check CPU/Memory on nodes
top
free -h

# If resources exhausted, health checks timeout
# Solution: Scale up resources or reduce load

7.3. Issue: Slow failover

Symptoms: Takes >60 seconds to failover.

Diagnosis:

TEXT
# Check TTL and loop_wait
patronictl show-config postgres | grep -E "ttl|loop_wait"

# Calculate minimum failover time:
# Minimum = TTL + (loop_wait × 2) + promotion_time
# Example: 30 + (10 × 2) + 5 = 55 seconds

Optimization:

TEXT
# Reduce TTL and loop_wait
bootstrap:
  dcs:
    ttl: 20  # Was 30
    loop_wait: 5  # Was 10

# Expected failover: ~30-35 seconds

7.4. Issue: Data loss after failover

Symptoms: Some recent transactions missing.

Cause: Asynchronous replication + replica lag.

Verification:

TEXT
# Check replication mode
patronictl show-config postgres | grep synchronous_mode

# Check lag before failover
# (check logs for lag_in_mb at failover time)
sudo journalctl -u patroni | grep "lag_in_mb"

Prevention:

TEXT
# Enable synchronous replication
bootstrap:
  dcs:
    synchronous_mode: true
    synchronous_mode_strict: false  # Allow degradation
    
    postgresql:
      parameters:
        synchronous_commit: 'on'

8. Metrics and Monitoring

8.1. Key failover metrics

TEXT
-- Time since last failover
SELECT timeline_id, 
       pg_postmaster_start_time(),
       now() - pg_postmaster_start_time() AS uptime
FROM pg_control_checkpoint();

-- Replication lag (pre-failover indicator)
SELECT application_name,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
       replay_lag
FROM pg_stat_replication;

-- Failed connection attempts (indicator of downtime)
SELECT datname, numbackends, xact_commit, xact_rollback
FROM pg_stat_database;

8.2. Alerting rules

Prometheus alert examples:

TEXT
groups:
  - name: patroni_failover
    rules:
      - alert: PatroniFailoverDetected
        expr: increase(patroni_timeline[5m]) > 0
        labels:
          severity: warning
        annotations:
          summary: "Patroni failover detected"
          description: "Timeline changed, indicating failover"
      
      - alert: PatroniNoLeader
        expr: count(patroni_patroni_info{role="master"}) == 0
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "No Patroni leader"
          description: "Cluster has no primary"
      
      - alert: PatroniHighReplicationLag
        expr: patroni_replication_lag_bytes > 10485760  # 10MB
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High replication lag"
          description: "Replica lag > 10MB, risk of data loss on failover"

9. Best Practices

✅ DO

  1. Test failover regularly - Monthly in staging, quarterly in production
  2. Monitor replication lag - Alert if lag > 1MB
  3. Use synchronous replication for zero data loss
  4. Set synchronous_mode_strict: false - Allow degradation
  5. Configure proper TTL - Balance speed vs stability (20-30s)
  6. Have >= 2 replicas - Allow failover even if one replica down
  7. Monitor DCS health - etcd cluster must be healthy
  8. Document runbooks - Procedures for manual intervention
  9. Log failover events - Track patterns and issues
  10. Capacity planning - Replicas should handle primary load

❌ DON'T

  1. Don't use single replica - No failover option
  2. Don't ignore lag - High lag = data loss risk
  3. Don't set TTL too low (<15s) - False positives
  4. Don't skip testing - Untested failover = downtime risk
  5. Don't manually promote during automatic failover - Let Patroni handle it
  6. Don't forget about old primary - Needs rejoin/rebuild
  7. Don't run without monitoring - Must know when failover happens
  8. Don't overload DCS - Separate etcd cluster recommended

10. Lab Exercises

Lab 1: Basic failover test

Tasks: 1. Record baseline: patronictl list 2. Stop primary: sudo systemctl stop patroni 3. Time the failover with watch -n 1 patronictl list 4. Document downtime duration 5. Verify new primary accepts writes 6. Restart old primary and verify rejoin

Lab 2: Network partition test

Tasks: 1. Use iptables to partition primary from cluster 2. Observe DCS behavior 3. Verify only one primary exists after partition 4. Restore network and verify automatic recovery

Lab 3: Optimize failover speed

Tasks: 1. Baseline test with default settings (TTL=30) 2. Reduce TTL to 20, test again 3. Reduce to 15, test again 4. Compare failover times 5. Evaluate trade-offs (speed vs false positives)

Lab 4: Failover under load

Tasks: 1. Generate load with pgbench: pgbench -c 10 -T 300 2. During load, stop primary 3. Count connection errors in pgbench output 4. Calculate availability percentage 5. Document user impact

11. Tổng kết

Key Concepts

✅ Automatic Failover = Self-healing without manual intervention

✅ Detection = Health checks + DCS connectivity + TTL expiration

✅ Election = Best replica based on lag, timeline, tags

✅ Promotion = pg_promote() + timeline increment + role change

✅ Timeline = Failover counter, prevents divergence

✅ TTL = Trade-off between speed and stability

Failover Checklist

  •  Primary failure detected
  •  Leader lock expired in DCS
  •  Best replica identified
  •  Leader lock acquired
  •  PostgreSQL promoted successfully
  •  Timeline incremented
  •  Callbacks executed
  •  Other replicas reconfigured
  •  Replication restored
  •  Cluster operational

Next Steps

Bài 14 sẽ cover Switchover có kế hoạch:

  • Planned maintenance scenarios
  • Zero-downtime switchover process
  • Graceful vs immediate switchover
  • Best practices for planned failove

Share this article

You might also like

Browse all articles

Lesson 20: Security Best Practices

Learn about Lesson 20: Security Best Practices in PostgreSQL HA clusters with Patroni and etcd.

#Patroni#PostgreSQL#high availability

Lesson 19: Logging và Troubleshooting

Learn about Lesson 19: Logging và Troubleshooting in PostgreSQL HA clusters with Patroni and etcd.

#Patroni#PostgreSQL#high availability

Lesson 15: Recovering Failed Nodes

Learn about Lesson 15: Recovering Failed Nodes in PostgreSQL HA clusters with Patroni and etcd.

#Patroni#PostgreSQL#high availability

Lesson 14: Planned Switchover

Learn about Lesson 14: Switchover - Planned Switchover in PostgreSQL HA clusters with Patroni and etcd.

#Patroni#PostgreSQL#high availability

Lesson 12: Patroni REST API

Learn about Lesson 12: Patroni REST API in PostgreSQL HA clusters with Patroni and etcd.

#Patroni#PostgreSQL#high availability