CloudTada | Infrastructure & DevOps Insights

Introduction to Patroni and etcd

Objectives

After this lesson, you will understand:

What Patroni is and how it works
DCS (Distributed Configuration Store) - etcd/Consul/ZooKeeper
Consensus algorithm (Raft)
Leader election & Failover mechanism
Split-brain problem and how to solve it

1. What is Patroni?

Introduction

Patroni is an open-source HA (High Availability) template for PostgreSQL, developed by Zalando. It automates PostgreSQL cluster management, including:

Leader election: Automatically select primary node
Automatic failover: Automatic failover when primary fails
Configuration management: Centralized configuration management
Health checking: Continuous monitoring of node health

Patroni Architecture

The Patroni architecture is a popular choice for managing PostgreSQL clusters.

How Patroni works

Startup: Each Patroni instance connects to DCS (etcd)
Leader election: Nodes compete to become leader in DCS
Role assignment: Node that wins leader lock promotes PostgreSQL to primary
Health monitoring: Patroni continuously checks:
- PostgreSQL process health
- Replication status
- DCS connectivity
Auto failover: If leader fails, Patroni automatically:
- Detects the issue
- Selects the most suitable replica
- Promotes new replica to primary
- Updates remaining replicas

Main components

Patroni daemon

Runs on each PostgreSQL node
Manages PostgreSQL lifecycle
Performs health checks
Interacts with DCS

REST API

Health check endpoint: http://node:8008/health
Read-only endpoint: http://node:8008/read-only
Primary endpoint: http://node:8008/master (deprecated) or /primary

patronictl

CLI tool to manage cluster
Commands: list, switchover, failover, reinit, restart, reload

2. DCS - Distributed Configuration Store

Role of DCS

DCS is the coordination center for Patroni cluster, storing:

Leader key: Information about which node is leader (TTL-based)
Configuration: PostgreSQL and Patroni configuration
Member information: List of nodes in cluster
Failover/Switchover state: Switch state

Comparison of popular DCS

Feature	etcd	Consul	ZooKeeper
Language	Go	Go	Java
Consensus	Raft	Raft	ZAB (Paxos-like)
API	gRPC, HTTP	HTTP, DNS	Custom protocol
Setup	Simple	Medium	Complex
Performance	High	High	Medium
Documentation	Good	Very Good	Medium
Usage	Kubernetes, Patroni	Service mesh, HA	Hadoop, Kafka

Recommendation: etcd for most cases because of simplicity and high performance.

etcd - Distributed Key-Value Store

Main characteristics:

Strongly consistent (CAP theorem: CP)
Distributed and highly available
Fast (sub-millisecond latency)
Simple API
Watch mechanism for real-time updates

Data structure in etcd for Patroni:

TEXT

/service/postgres/
├── config          # Cluster configuration
├── initialize      # Bootstrap token
├── leader          # Leader lock (TTL: 30s)
├── members/
│   ├── node1      # Node1 information
│   ├── node2      # Node2 information
│   └── node3      # Node3 information
├── optime/
│   └── leader     # LSN of leader
└── failover       # Failover/switchover instructions

3. Consensus Algorithm - Raft

What is Raft?

Raft is a consensus algorithm designed to be easier to understand than Paxos, ensuring:

Safety: Never returns incorrect results
Liveness: Always makes progress (when majority nodes are operational)
Consistency: All nodes see the same state

Roles in Raft

Leader:
- Handles all client requests
- Replicates log entries to followers
- Unique in a term
Follower:
- Passive, only receives requests from leader
- If no heartbeat received, becomes candidate
Candidate:
- Follower timeout becomes candidate
- Requests votes from other nodes
- If wins election → Leader

Leader Election Process

The Leader Election Process is crucial for ensuring the consistency and availability of the distributed system.

Detailed election:

Follower doesn't receive heartbeat within election timeout (150-300ms random)
Becomes Candidate, increases term number
Votes for itself
Sends RequestVote RPC to all nodes
If receives majority votes (n/2 + 1):
- Becomes Leader
- Sends heartbeat immediately
If timeout or loses election:
- Returns to Follower or starts new election

Quorum and Majority

Quorum: Minimum number of nodes needed for system to operate

TEXT

Cluster size | Quorum | Tolerated failures
-------------|--------|-------------------
     1       |   1    |        0
     3       |   2    |        1
     5       |   3    |        2
     7       |   4    |        3

Formula: Quorum = floor(n/2) + 1

Example with 3 nodes:

✅ 3 nodes active: Cluster healthy
✅ 2 nodes active: Cluster works (quorum met)
❌ 1 node active: Cluster stops (no quorum)

Recommendation: Always use odd number of nodes (3, 5, 7) to optimize fault tolerance.

4. Leader Election in Patroni

Leader Lock Mechanism

Patroni uses DCS to implement distributed lock:

Leader Lock Properties:

TEXT

Key: /service/postgres/leader
Value: 
  {
    "version": "3.0.2",
    "conn_url": "postgres://node1:5432/postgres",
    "api_url": "http://node1:8008/patroni",
    "xlog_location": 123456789,
    "timeline": 2
  }
TTL: 30 seconds

Leader Election Process

Step 1: Race Condition

TEXT

Time: T0 - Leader crashes
Node1: Check DCS → No leader key exists
Node2: Check DCS → No leader key exists  
Node3: Check DCS → No leader key exists

Step 2: Acquire Lock Attempt

TEXT

Time: T0 + 100ms
Node1: Try acquire lock → SUCCESS (first to write)
Node2: Try acquire lock → FAILED (key exists)
Node3: Try acquire lock → FAILED (key exists)

Step 3: Role Assignment

TEXT

Node1: Promote PostgreSQL to Primary
Node2: Configure as Replica, point to Node1
Node3: Configure as Replica, point to Node1

Step 4: Maintenance

TEXT

Every 10 seconds:
Node1 (Leader): 
  - Renew lock (TTL extension)
  - Update xlog_location
  - Send heartbeat

Node2/3 (Followers):
  - Monitor leader key
  - Check replication lag
  - Ready to take over

Best Replica Selection Criteria

When failover occurs, Patroni selects replica based on:

Replication state:
- streaming > in archive recovery
Timeline: Higher timeline preferred
XLog position:
- Replica with LSN closest to primary
- Least data loss
No replication lag:
- pg_stat_replication.replay_lag = 0
Explicit candidate: Set in configuration

Priority tag:

TEXT

tags:
  nofailover: false
  noloadbalance: false
  clonefrom: false
  nosync: false

Example:

TEXT

Primary fails at LSN: 0/3000000

Replica1: LSN=0/3000000, lag=0s     ← BEST CHOICE
Replica2: LSN=0/2FFFFFF, lag=1s
Replica3: LSN=0/2FFFFFE, lag=2s

→ Patroni promotes Replica1

5. Failover Mechanism

Automatic Failover Process

Detailed timeline:

Automatic Failover Process

Detailed failover steps

Step 1: Detect failure

TEXT

# Patroni health check loop
while True:
    if not check_postgresql_health():
        log.error("PostgreSQL unhealthy")
        stop_renewing_leader_lock()
    
    if not check_dcs_connectivity():
        log.error("Lost connection to DCS")
        demote_if_leader()
    
    sleep(10)

Step 2: Leader lock expires

TEXT

# In etcd
$ etcdctl get /service/postgres/leader
# After TTL: Key not found

# Patroni logs on former leader
WARN: Could not renew leader lock
INFO: Demoting PostgreSQL to standby

Step 3: Replica promotion

TEXT

# Patroni on promoted replica
INFO: No leader found
INFO: Attempting to acquire leader lock
INFO: Lock acquired successfully
INFO: Promoting PostgreSQL instance
INFO: Updating configuration
INFO: Notifying other members

Step 4: Reconfiguration

TEXT

-- On promoted replica
SELECT pg_promote();

-- Changes primary_conninfo to null
-- Restarts as read-write

Step 5: Followers repoint

TEXT

# Other replicas
INFO: New leader detected: node2
INFO: Updating primary_conninfo
INFO: Restarting replication

Monitor Failover

Important metrics:

patroni_primary_timeline: Detect timeline changes
patroni_xlog_location: Track WAL position
patroni_replication_lag: Lag before failover
patroni_failover_count: Count number of failovers

6. Split-Brain Problem

What is Split-Brain?

Definition: Situation where ≥2 nodes think they are Primary, writing different data → Data divergence.

Causes

Network Partition

Network Partition
DCS partition: etcd cluster split
Slow network: Heartbeat timeout but node still alive

Consequences of Split-Brain

Consequences of Split-Brain

Patroni's Split-Brain Prevention

Mechanism 1: DCS-based Lock (Primary)

TEXT

def maintain_leader_lock():
    while is_leader:
        # Must renew within TTL
        success = dcs.renew_lock(ttl=30)
        
        if not success:
            log.critical("Lost leader lock!")
            # Immediate demotion
            demote_to_standby()
            stop_accepting_writes()
            break
        
        sleep(10)

Mechanism 2: Leader Key Verification

TEXT

def before_handle_write():
    leader_key = dcs.get("/service/postgres/leader")
    
    if leader_key.owner != my_node_name:
        # I'm not the real leader!
        raise Exception("Not leader anymore")
        demote_immediately()

Mechanism 3: Timeline Divergence Detection

TEXT

-- PostgreSQL timeline
SELECT timeline_id FROM pg_control_checkpoint();

-- If timelines diverge:
-- Node1: timeline=5
-- Node2: timeline=6
-- → Data inconsistency detected
-- → Requires pg_rewind or rebuild

Quorum requirement

etcd with 3 nodes:

TEXT

Scenario 1: Network partition 1-2 split
  Partition A: Node1 (1 node)
    - Cannot get quorum (1 < 2)
    - Cannot write to etcd
    - Demotes to standby ✓
  
  Partition B: Node2, Node3 (2 nodes)
    - Has quorum (2 ≥ 2)
    - Can elect leader
    - Node2 becomes primary ✓
  
Result: Only 1 primary exists ✓

Scenario 2: Complete isolation

TEXT

Node1: Isolated, loses DCS
  - Tries to renew lock → FAIL
  - Demotes PostgreSQL immediately
  - Stops accepting connections
  
Node2/3: See Node1 gone
  - Elect new leader
  - Only 1 primary in cluster ✓

Watchdog Timer (Advanced Protection)

Hardware watchdog:

TEXT

# patroni.yml
watchdog:
  mode: required  # or automatic, off
  device: /dev/watchdog
  safety_margin: 5

Operation:

Patroni kicks watchdog device every 10s
If Patroni hangs or loses DCS → stops kicking
After timeout → Watchdog reboots entire node
Prevents "zombie primary" scenario

Best Practices to Avoid Split-Brain

Deploy DCS separately: etcd cluster in different AZs
Monitor DCS health: Alert when etcd is unhealthy
Network redundancy: Multiple network paths between nodes
Proper timeouts:

TEXT

patroni:
  ttl: 30              # Leader lock TTL
  loop_wait: 10        # Check interval
  retry_timeout: 10    # DCS operation timeout

Enable watchdog: Hardware protection layer
Monitoring:

TEXT

# Check for timeline divergence
patronictl list

# Expected: All nodes same timeline
+ Cluster: postgres (7001234567890123456) ----+----+-----------+
| Member | Host         | Role    | State   | TL | Lag in MB |
+--------+--------------+---------+---------+----+-----------+
| node1  | 10.0.1.1:5432| Leader  | running | 5  |           |
| node2  | 10.0.1.2:5432| Replica | running | 5  |         0 |
| node3  | 10.0.1.3:5432| Replica | running | 5  |         0 |
+--------+--------------+---------+---------+----+-----------+

Recovery from Split-Brain

If split-brain occurs:

Step 1: Identify

TEXT

# Check timeline
patronictl list
# node1: timeline=5
# node2: timeline=6  ← DIVERGED!

Step 2: Choose primary

Choose node with more important data
Or node with higher timeline

Step 3: Rebuild diverged replica

TEXT

# Option 1: pg_rewind (if safe)
patronictl reinit postgres node2

# Option 2: Full rebuild
patronictl remove postgres node2
# Then: reinitialize from scratch

Step 4: Verify

TEXT

patronictl list
# All nodes same timeline ✓

7. Summary

Key Takeaways

✅ Patroni: HA template that automates PostgreSQL cluster management

✅ DCS (etcd): Distributed coordination, store configuration and leader lock

✅ Raft consensus: Ensures consistency and leader election in etcd

✅ Leader election: Automatic, fast (~30-40s), based on TTL locks

✅ Failover: Automatically promotes best replica when primary fails

✅ Split-brain prevention: DCS quorum + TTL locks + watchdog

Combined Architecture

Combined Architecture

Review Questions

How is Patroni different from pure Streaming Replication?
Why do we need DCS? Can't we use the database to store state?
What is the quorum in a 5-node cluster?
Which replica does Patroni choose to promote during failover?
How does split-brain occur and how does Patroni prevent it?
What is the meaning of timeline in PostgreSQL?
What does TTL 30 seconds mean? Why not set TTL = 5 seconds?

Preparation for next lesson

Lesson 4 will guide infrastructure preparation:

Setup 3 VMs/Servers
Network, firewall configuration
SSH keys, time sync
Required dependencies

Course

PostgreSQL High Availability A-Z

Share this article

You might also like