CloudTada | Infrastructure & DevOps Insights

Learning Objectives

After this lesson, you will:

Understand failure detection mechanisms in Patroni
Understand the leader election process
Track failover timeline in detail
Test automatic failover in multiple scenarios
Troubleshoot failover issues
Optimize failover speed

1. Automatic Failover Overview

1.1. Failover là gì?

Automatic Failover = Quá trình tự động promote một replica lên làm primary khi primary hiện tại fails.

Đặc điểm:

⚡ Tự động: Không cần can thiệp manual
🚨 Unplanned: Xảy ra do sự cố
⏱️ Fast: 30-60 giây (configurable)
🎯 Goal: Minimize downtime

Khi nào xảy ra failover?

Primary server crashes
PostgreSQL process dies
Network partition
Hardware failure
DCS connection lost
Disk full

1.2. Failover vs Replication

TEXT

WITHOUT Patroni (Manual Failover):
1. Primary fails
2. DBA gets paged
3. DBA investigates (10-30 mins)
4. DBA manually promotes replica
5. DBA updates application config
6. Service restored
Total downtime: 30+ minutes ❌

WITH Patroni (Automatic Failover):
1. Primary fails
2. Patroni detects (10 seconds)
3. Patroni promotes best replica (20 seconds)
4. Service restored automatically
Total downtime: 30-60 seconds ✅

2. Failure Detection Mechanism

2.1. Health Check Loop

Patroni health check components:

TEXT

# Pseudo-code of Patroni's main loop
while True:
    # 1. Check PostgreSQL health
    if not check_postgresql_running():
        log.error("PostgreSQL is down!")
        handle_postgres_failure()
    
    # 2. Check DCS connectivity
    if not can_connect_to_dcs():
        log.error("Lost DCS connection!")
        demote_if_leader()
    
    # 3. Update status in DCS
    update_member_status_in_dcs()
    
    # 4. Check leader lock (if I'm leader)
    if is_leader:
        renew_leader_lock()
    
    # 5. Sleep until next check
    sleep(loop_wait)  # Default: 10 seconds

2.2. PostgreSQL Health Checks

Patroni performs multiple checks:

A. Process check

TEXT

# Check if postgres process exists
ps aux | grep postgres

# Check if accepting connections
pg_isready -h localhost -p 5432

B. Connection check

TEXT

# Try to connect to PostgreSQL
try:
    conn = psycopg2.connect("host=localhost port=5432 dbname=postgres")
    conn.close()
except:
    # Connection failed!
    mark_unhealthy()

C. Replication check (on replicas)

TEXT

-- Check if replication is active
SELECT status, received_lsn, replay_lsn 
FROM pg_stat_wal_receiver;

-- If no data or status != 'streaming' → Problem!

D. Timeline check

TEXT

-- Ensure timeline matches cluster
SELECT timeline_id FROM pg_control_checkpoint();

2.3. DCS Connectivity Check

Why DCS connectivity matters:

TEXT

If node loses DCS connection:
- Cannot renew leader lock
- Cannot read cluster state
- MUST demote to avoid split-brain

Even if PostgreSQL is healthy!

DCS check example:

TEXT

# Check etcd health
etcdctl endpoint health

# Try to read/write
etcdctl get /service/postgres/leader
etcdctl put /service/postgres/members/node1 "healthy"

2.4. Leader Lock TTL

TTL (Time-To-Live) mechanism:

TEXT

# In patroni.yml
bootstrap:
  dcs:
    ttl: 30  # Leader lock expires after 30 seconds
    loop_wait: 10  # Check every 10 seconds

Timeline:

TEXT

T+0s:  Leader acquires lock (TTL=30s)
T+10s: Leader renews lock (TTL extended to T+40s)
T+20s: Leader renews lock (TTL extended to T+50s)
T+30s: Leader tries to renew but FAILS (crashed)
T+40s: Lock expires in DCS
T+41s: Replicas detect no leader
T+42s: Replica election begins
T+45s: New leader elected

Total detection time: ~35-40 seconds

3. Leader Election Process

3.1. Election Trigger

Leader election starts when:

TEXT

Condition 1: Leader lock expired in DCS
  /service/postgres/leader → key not found

Condition 2: No active leader for > loop_wait
  All replicas see: no leader heartbeat

Condition 3: Explicit failover
  patronictl failover command

3.2. Candidate Selection Criteria

Patroni selects best replica based on:

Priority 1: Replication State

TEXT

-- Prefer streaming over archive recovery
SELECT state FROM pg_stat_wal_receiver;

streaming > in archive recovery > stopped

Priority 2: Replication Lag

TEXT

-- Replica with lowest lag wins
SELECT pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) AS lag_bytes;

-- Example:
-- node2: lag = 0 bytes      ← BEST
-- node3: lag = 1048576 bytes (1MB)

Priority 3: Timeline

TEXT

-- Higher timeline = more recent
SELECT timeline_id FROM pg_control_checkpoint();

-- node2: timeline = 3  ← BEST
-- node3: timeline = 2

Priority 4: Tags

TEXT

# In patroni.yml
tags:
  nofailover: false  # true = never promote this node
  noloadbalance: false
  priority: 100  # Higher = preferred (0-999)

Example:

TEXT

# node2 - Preferred candidate
tags:
  nofailover: false
  priority: 200

# node3 - Lower priority
tags:
  nofailover: false
  priority: 100

# node4 - Never promote
tags:
  nofailover: true

Priority 5: Synchronous State

TEXT

-- Synchronous replica preferred over async
SELECT sync_state FROM pg_stat_replication;

sync > potential > async

3.3. Race Condition and Lock Acquisition

Multiple replicas compete:

TEXT

Scenario: Primary fails, 2 replicas compete

T+0s: node2 and node3 both detect no leader
T+0.1s: Both try to acquire lock simultaneously

In etcd (atomic operation):
  node2 tries: PUT /service/postgres/leader "node2" if_not_exists
  node3 tries: PUT /service/postgres/leader "node3" if_not_exists

Result: Only ONE succeeds (etcd atomic guarantee)
  node2: SUCCESS → becomes leader
  node3: FAILED → remains replica

DCS guarantees:

Atomicity: Only one node gets the lock
Consistency: All nodes see same leader
Isolation: No split-brain possible

3.4. Promotion Process

Winner node executes:

TEXT

Step 1: Acquire leader lock in DCS
  etcdctl put /service/postgres/leader '{"node": "node2", ...}'

Step 2: Run pre_promote callback (if configured)
  /var/lib/postgresql/callbacks/pre_promote.sh

Step 3: Promote PostgreSQL
  Method A: pg_ctl promote -D /var/lib/postgresql/18/data
  Method B: SELECT pg_promote();
  Method C: Create trigger file (old method)

Step 4: Wait for promotion complete
  Check: SELECT pg_is_in_recovery();
  Should return: false (not in recovery = primary)

Step 5: Update timeline
  Timeline increments: 1 → 2

Step 6: Run post_promote callback
  Update DNS, load balancer, send notifications

Step 7: Run on_role_change callback
  /var/lib/postgresql/callbacks/on_role_change.sh master

Step 8: Update DCS with new primary info
  xlog_location, timeline, conn_url

Step 9: Start accepting writes
  PostgreSQL now in read-write mode

4. Failover Timeline Detailed

4.1. Complete Failover Flow

TEXT

Timeline of Automatic Failover

T+0s: NORMAL OPERATION
  Primary (node1): Healthy, serving requests
  Replica (node2): Streaming from node1, lag=0
  Replica (node3): Streaming from node1, lag=0

T+1s: PRIMARY FAILS
  node1: PostgreSQL crashes / server dies
  node2: Still streaming (buffered data)
  node3: Still streaming (buffered data)

T+5s: REPLICATION BROKEN
  node2: WAL receiver error "connection lost"
  node3: WAL receiver error "connection lost"
  node1: Still holds leader lock (TTL not expired yet)

T+10s: HEALTH CHECK CYCLE 1
  node2: Check replication → FAILED, wait...
  node3: Check replication → FAILED, wait...
  node1: Cannot renew lock (crashed)

T+20s: HEALTH CHECK CYCLE 2
  node2: Still cannot connect to node1
  node3: Still cannot connect to node1

T+30s: LEADER LOCK EXPIRES
  DCS: /service/postgres/leader TTL expired → key deleted
  node2: Detects no leader key
  node3: Detects no leader key

T+31s: CANDIDATE ELECTION BEGINS
  node2: Check eligibility → YES (lag=0, priority=100)
  node3: Check eligibility → YES (lag=1MB, priority=100)

T+32s: RACE FOR LOCK
  node2: PUT /service/postgres/leader "node2" → SUCCESS
  node3: PUT /service/postgres/leader "node3" → FAILED

T+33s: NODE2 PROMOTES
  node2: Run pre_promote callback
  node2: pg_promote() executed
  node2: Timeline: 1 → 2

T+35s: PROMOTION COMPLETE
  node2: pg_is_in_recovery() → false
  node2: Now accepting writes
  node2: Run post_promote & on_role_change callbacks

T+36s: NODE3 RECONFIGURES
  node3: Detects new leader = node2
  node3: Update primary_conninfo → node2:5432
  node3: Restart WAL receiver

T+38s: REPLICATION RESTORED
  node3: Connected to node2
  node3: Streaming at timeline 2

T+40s: CLUSTER OPERATIONAL
  Primary: node2 (was replica)
  Replica: node3 (following node2)
  Failed: node1 (needs manual intervention)

Total Downtime: ~35-40 seconds ✅

4.2. Factors Affecting Failover Speed

Configuration parameters:

TEXT

# Fast failover configuration
bootstrap:
  dcs:
    ttl: 20  # Faster detection (default: 30)
    loop_wait: 5  # More frequent checks (default: 10)
    retry_timeout: 5  # Quick retries (default: 10)

Trade-offs:

Parameter	Lower Value	Higher Value
TTL	Faster failover	More stable
	More false positives	Slower failover
loop_wait	Faster detection	Less DCS traffic
	More CPU/network	Slower reaction

Typical configurations:

TEXT

# Conservative (stable, slower)
ttl: 30
loop_wait: 10
→ Failover: ~40-50s

# Balanced (recommended)
ttl: 20
loop_wait: 10
→ Failover: ~30-40s

# Aggressive (fast, sensitive)
ttl: 15
loop_wait: 5
→ Failover: ~20-30s

5. Testing Automatic Failover

5.1. Test Scenario 1: PostgreSQL Process Kill

Simulate PostgreSQL crash:

TEXT

# On current primary (node1)
sudo -u postgres psql -c "SELECT pg_backend_pid();"
# Returns: 12345

sudo kill -9 12345  # Kill PostgreSQL

# Or kill all postgres processes
sudo pkill -9 postgres

Monitor failover:

TEXT

# Terminal 1: Watch cluster status
watch -n 1 "patronictl list postgres"

# Terminal 2: Monitor logs
sudo journalctl -u patroni -f

# Terminal 3: Test connectivity
while true; do
  psql -h 10.0.1.11 -U app_user -d myapp -c "SELECT 1;" 2>&1 | grep -q "ERROR" && echo "$(date): DOWN" || echo "$(date): UP"
  sleep 1
done

Expected timeline:

TEXT

00:00 - Cluster healthy
00:01 - Kill postgres on node1
00:02-00:30 - Patroni detecting failure
00:31 - node2 elected as new primary
00:35 - Cluster operational (node2 = primary)
00:36+ - Connections working again

5.2. Test Scenario 2: Network Partition

Simulate network partition:

TEXT

# On primary node, block traffic to other nodes
sudo iptables -A INPUT -s 10.0.1.12 -j DROP
sudo iptables -A INPUT -s 10.0.1.13 -j DROP
sudo iptables -A OUTPUT -d 10.0.1.12 -j DROP
sudo iptables -A OUTPUT -d 10.0.1.13 -j DROP

# Or block etcd access specifically
sudo iptables -A OUTPUT -p tcp --dport 2379 -j DROP

Observe:

TEXT

# On node1 (isolated)
patronictl list postgres
# Will show errors / cannot connect to cluster

# On node2/node3
patronictl list postgres
# Will show node1 as unavailable
# After TTL: node2 or node3 becomes leader

Recovery:

TEXT

# Restore network on node1
sudo iptables -F

# node1 should automatically rejoin as replica
patronictl list postgres

5.3. Test Scenario 3: Server Reboot

Simulate server crash:

TEXT

# On primary node
sudo reboot

# Or immediate crash
echo c | sudo tee /proc/sysrq-trigger

Expected behavior: Same as Scenario 1, but node completely unavailable.

5.4. Test Scenario 4: Disk Full

Simulate disk full:

TEXT

# Fill up disk on primary
dd if=/dev/zero of=/var/lib/postgresql/bigfile bs=1M count=10000

# PostgreSQL will fail when cannot write WAL

Patroni will detect PostgreSQL unhealthy → trigger failover.

5.5. Test Scenario 5: DCS Failure

Stop etcd on all nodes:

TEXT

# On all 3 etcd nodes
sudo systemctl stop etcd

Expected behavior:

TEXT

- All Patroni nodes lose DCS connection
- Current primary DEMOTES (safety mechanism)
- Cluster enters "read-only" state
- NO failover possible (no DCS consensus)

Recovery:
- Restart etcd cluster
- Patroni auto-recovers
- Leader election happens

6. Verify Failover Success

6.1. Check cluster status

TEXT

# List cluster members
patronictl list postgres

# Expected after failover:
# + Cluster: postgres (7001234567890123456) ----+----+-----------+
# | Member | Host          | Role    | State   | TL | Lag in MB |
# +--------+---------------+---------+---------+----+-----------+
# | node1  | 10.0.1.11:5432| Replica | stopped |  1 |           | ← Old primary
# | node2  | 10.0.1.12:5432| Leader  | running |  2 |           | ← NEW primary
# | node3  | 10.0.1.13:5432| Replica | running |  2 |         0 |
# +--------+---------------+---------+---------+----+-----------+

# Note timeline changed: 1 → 2

6.2. Verify new primary

TEXT

# Check primary role
sudo -u postgres psql -h 10.0.1.12 -c "SELECT pg_is_in_recovery();"
# pg_is_in_recovery
# ------------------
#  f                  ← false = PRIMARY

# Check timeline
sudo -u postgres psql -h 10.0.1.12 -c "SELECT timeline_id FROM pg_control_checkpoint();"
# timeline_id
# ------------
#           2

# Check replication from new primary
sudo -u postgres psql -h 10.0.1.12 -c "SELECT * FROM pg_stat_replication;"
# Should show node3 replicating from node2

6.3. Test write operations

TEXT

# Insert data on new primary
sudo -u postgres psql -h 10.0.1.12 -d testdb -c "
INSERT INTO test_table (data) VALUES ('After failover at ' || NOW());
"

# Verify on replica
sudo -u postgres psql -h 10.0.1.13 -d testdb -c "
SELECT * FROM test_table ORDER BY id DESC LIMIT 5;
"
# Should see new data replicated

6.4. Check failover history

TEXT

# View history via REST API
curl -s http://10.0.1.12:8008/history | jq

# Output:
# [
#   [1, 67108864, "no recovery target specified", "2024-11-25T10:00:00+00:00"],
#   [2, 134217728, "no recovery target specified", "2024-11-25T11:30:15+00:00"]
# ]
#   ↑ Timeline 2 = Failover event

# Check Patroni logs
sudo journalctl -u patroni --since "30 minutes ago" | grep -i "promote\|failover\|leader"

7. Troubleshooting Failover Issues

7.1. Issue: Failover not happening

Symptoms: Primary down but no promotion.

Possible causes:

A. All replicas tagged nofailover

TEXT

# Check tags
patronictl show-config postgres | grep -A5 "tags:"

# If all replicas have nofailover: true
# Solution: Remove tag from at least one replica
patronictl edit-config postgres
# Set: nofailover: false

B. Replication lag too high

TEXT

# Check maximum_lag_on_failover
patronictl show-config postgres | grep maximum_lag_on_failover

# If replica lag > threshold, won't promote
# Solution: Increase threshold or wait for lag to decrease
patronictl edit-config postgres
# Set: maximum_lag_on_failover: 10485760  # 10MB

C. No quorum in DCS

TEXT

# Check etcd health
etcdctl endpoint health --cluster

# If etcd cluster has no quorum (< 2 of 3 healthy)
# Solution: Fix etcd cluster first
sudo systemctl restart etcd

D. synchronous_mode_strict enabled

TEXT

# If enabled and no sync replica available
synchronous_mode: true
synchronous_mode_strict: true  # ← Problem!

# Primary cannot be demoted, replicas cannot be promoted
# Solution: Disable strict mode
patronictl edit-config postgres
# Set: synchronous_mode_strict: false

7.2. Issue: Multiple failovers (flapping)

Symptoms: Cluster keeps failing over repeatedly.

Possible causes:

A. Network instability

TEXT

# Check network between nodes
ping -c 100 10.0.1.12
# High packet loss → false failovers

# Solution: Fix network or increase TTL
patronictl edit-config postgres
# Set: ttl: 40  # More tolerant

B. TTL too aggressive

TEXT

# ttl: 10  ← Too low!
# Every small network blip causes failover

# Solution: Increase TTL
ttl: 30  # More stable

C. Resource exhaustion

TEXT

# Check CPU/Memory on nodes
top
free -h

# If resources exhausted, health checks timeout
# Solution: Scale up resources or reduce load

7.3. Issue: Slow failover

Symptoms: Takes >60 seconds to failover.

Diagnosis:

TEXT

# Check TTL and loop_wait
patronictl show-config postgres | grep -E "ttl|loop_wait"

# Calculate minimum failover time:
# Minimum = TTL + (loop_wait × 2) + promotion_time
# Example: 30 + (10 × 2) + 5 = 55 seconds

Optimization:

TEXT

# Reduce TTL and loop_wait
bootstrap:
  dcs:
    ttl: 20  # Was 30
    loop_wait: 5  # Was 10

# Expected failover: ~30-35 seconds

7.4. Issue: Data loss after failover

Symptoms: Some recent transactions missing.

Cause: Asynchronous replication + replica lag.

Verification:

TEXT

# Check replication mode
patronictl show-config postgres | grep synchronous_mode

# Check lag before failover
# (check logs for lag_in_mb at failover time)
sudo journalctl -u patroni | grep "lag_in_mb"

Prevention:

TEXT

# Enable synchronous replication
bootstrap:
  dcs:
    synchronous_mode: true
    synchronous_mode_strict: false  # Allow degradation
    
    postgresql:
      parameters:
        synchronous_commit: 'on'

8. Metrics and Monitoring

8.1. Key failover metrics

TEXT

-- Time since last failover
SELECT timeline_id, 
       pg_postmaster_start_time(),
       now() - pg_postmaster_start_time() AS uptime
FROM pg_control_checkpoint();

-- Replication lag (pre-failover indicator)
SELECT application_name,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
       replay_lag
FROM pg_stat_replication;

-- Failed connection attempts (indicator of downtime)
SELECT datname, numbackends, xact_commit, xact_rollback
FROM pg_stat_database;

8.2. Alerting rules

Prometheus alert examples:

TEXT

groups:
  - name: patroni_failover
    rules:
      - alert: PatroniFailoverDetected
        expr: increase(patroni_timeline[5m]) > 0
        labels:
          severity: warning
        annotations:
          summary: "Patroni failover detected"
          description: "Timeline changed, indicating failover"
      
      - alert: PatroniNoLeader
        expr: count(patroni_patroni_info{role="master"}) == 0
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "No Patroni leader"
          description: "Cluster has no primary"
      
      - alert: PatroniHighReplicationLag
        expr: patroni_replication_lag_bytes > 10485760  # 10MB
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High replication lag"
          description: "Replica lag > 10MB, risk of data loss on failover"

9. Best Practices

✅ DO

Test failover regularly - Monthly in staging, quarterly in production
Monitor replication lag - Alert if lag > 1MB
Use synchronous replication for zero data loss
Set synchronous_mode_strict: false - Allow degradation
Configure proper TTL - Balance speed vs stability (20-30s)
Have >= 2 replicas - Allow failover even if one replica down
Monitor DCS health - etcd cluster must be healthy
Document runbooks - Procedures for manual intervention
Log failover events - Track patterns and issues
Capacity planning - Replicas should handle primary load

❌ DON'T

Don't use single replica - No failover option
Don't ignore lag - High lag = data loss risk
Don't set TTL too low (<15s) - False positives
Don't skip testing - Untested failover = downtime risk
Don't manually promote during automatic failover - Let Patroni handle it
Don't forget about old primary - Needs rejoin/rebuild
Don't run without monitoring - Must know when failover happens
Don't overload DCS - Separate etcd cluster recommended

Tasks: 1. Record baseline: patronictl list 2. Stop primary: sudo systemctl stop patroni 3. Time the failover with watch -n 1 patronictl list 4. Document downtime duration 5. Verify new primary accepts writes 6. Restart old primary and verify rejoin

Lab 2: Network partition test

Tasks: 1. Use iptables to partition primary from cluster 2. Observe DCS behavior 3. Verify only one primary exists after partition 4. Restore network and verify automatic recovery

Lab 3: Optimize failover speed

Tasks: 1. Baseline test with default settings (TTL=30) 2. Reduce TTL to 20, test again 3. Reduce to 15, test again 4. Compare failover times 5. Evaluate trade-offs (speed vs false positives)

Lab 4: Failover under load

Tasks: 1. Generate load with pgbench: pgbench -c 10 -T 300 2. During load, stop primary 3. Count connection errors in pgbench output 4. Calculate availability percentage 5. Document user impact