CloudTadaInsights

Lesson 3: Introduction to Patroni and etcd

Introduction to Patroni and etcd

Objectives

After this lesson, you will understand:

  • What Patroni is and how it works
  • DCS (Distributed Configuration Store) - etcd/Consul/ZooKeeper
  • Consensus algorithm (Raft)
  • Leader election & Failover mechanism
  • Split-brain problem and how to solve it

1. What is Patroni?

Introduction

Patroni is an open-source HA (High Availability) template for PostgreSQL, developed by Zalando. It automates PostgreSQL cluster management, including:

  • Leader election: Automatically select primary node
  • Automatic failover: Automatic failover when primary fails
  • Configuration management: Centralized configuration management
  • Health checking: Continuous monitoring of node health

Patroni Architecture

The Patroni architecture is a popular choice for managing PostgreSQL clusters.

How Patroni works

  1. Startup: Each Patroni instance connects to DCS (etcd)
  2. Leader election: Nodes compete to become leader in DCS
  3. Role assignment: Node that wins leader lock promotes PostgreSQL to primary
  4. Health monitoring: Patroni continuously checks:
    • PostgreSQL process health
    • Replication status
    • DCS connectivity
  5. Auto failover: If leader fails, Patroni automatically:
    • Detects the issue
    • Selects the most suitable replica
    • Promotes new replica to primary
    • Updates remaining replicas

Main components

Patroni daemon

  • Runs on each PostgreSQL node
  • Manages PostgreSQL lifecycle
  • Performs health checks
  • Interacts with DCS

REST API

  • Health check endpoint: http://node:8008/health
  • Read-only endpoint: http://node:8008/read-only
  • Primary endpoint: http://node:8008/master (deprecated) or /primary

patronictl

  • CLI tool to manage cluster
  • Commands: list, switchover, failover, reinit, restart, reload

2. DCS - Distributed Configuration Store

Role of DCS

DCS is the coordination center for Patroni cluster, storing:

  • Leader key: Information about which node is leader (TTL-based)
  • Configuration: PostgreSQL and Patroni configuration
  • Member information: List of nodes in cluster
  • Failover/Switchover state: Switch state
FeatureetcdConsulZooKeeper
LanguageGoGoJava
ConsensusRaftRaftZAB (Paxos-like)
APIgRPC, HTTPHTTP, DNSCustom protocol
SetupSimpleMediumComplex
PerformanceHighHighMedium
DocumentationGoodVery GoodMedium
UsageKubernetes, PatroniService mesh, HAHadoop, Kafka

Recommendation: etcd for most cases because of simplicity and high performance.

etcd - Distributed Key-Value Store

Main characteristics:

  • Strongly consistent (CAP theorem: CP)
  • Distributed and highly available
  • Fast (sub-millisecond latency)
  • Simple API
  • Watch mechanism for real-time updates

Data structure in etcd for Patroni:

TEXT
/service/postgres/
├── config          # Cluster configuration
├── initialize      # Bootstrap token
├── leader          # Leader lock (TTL: 30s)
├── members/
│   ├── node1      # Node1 information
│   ├── node2      # Node2 information
│   └── node3      # Node3 information
├── optime/
│   └── leader     # LSN of leader
└── failover       # Failover/switchover instructions

3. Consensus Algorithm - Raft

What is Raft?

Raft is a consensus algorithm designed to be easier to understand than Paxos, ensuring:

  • Safety: Never returns incorrect results
  • Liveness: Always makes progress (when majority nodes are operational)
  • Consistency: All nodes see the same state

Roles in Raft

  1. Leader:
    • Handles all client requests
    • Replicates log entries to followers
    • Unique in a term
  2. Follower:
    • Passive, only receives requests from leader
    • If no heartbeat received, becomes candidate
  3. Candidate:
    • Follower timeout becomes candidate
    • Requests votes from other nodes
    • If wins election → Leader

Leader Election Process

The Leader Election Process is crucial for ensuring the consistency and availability of the distributed system.

Detailed election:

  1. Follower doesn't receive heartbeat within election timeout (150-300ms random)
  2. Becomes Candidate, increases term number
  3. Votes for itself
  4. Sends RequestVote RPC to all nodes
  5. If receives majority votes (n/2 + 1):
    • Becomes Leader
    • Sends heartbeat immediately
  6. If timeout or loses election:
    • Returns to Follower or starts new election

Quorum and Majority

Quorum: Minimum number of nodes needed for system to operate

TEXT
Cluster size | Quorum | Tolerated failures
-------------|--------|-------------------
     1       |   1    |        0
     3       |   2    |        1
     5       |   3    |        2
     7       |   4    |        3

Formula: Quorum = floor(n/2) + 1

Example with 3 nodes:

  • ✅ 3 nodes active: Cluster healthy
  • ✅ 2 nodes active: Cluster works (quorum met)
  • ❌ 1 node active: Cluster stops (no quorum)

Recommendation: Always use odd number of nodes (3, 5, 7) to optimize fault tolerance.

4. Leader Election in Patroni

Leader Lock Mechanism

Patroni uses DCS to implement distributed lock:

Leader Lock Properties:

TEXT
Key: /service/postgres/leader
Value: 
  {
    "version": "3.0.2",
    "conn_url": "postgres://node1:5432/postgres",
    "api_url": "http://node1:8008/patroni",
    "xlog_location": 123456789,
    "timeline": 2
  }
TTL: 30 seconds

Leader Election Process

Step 1: Race Condition

TEXT
Time: T0 - Leader crashes
Node1: Check DCS → No leader key exists
Node2: Check DCS → No leader key exists  
Node3: Check DCS → No leader key exists

Step 2: Acquire Lock Attempt

TEXT
Time: T0 + 100ms
Node1: Try acquire lock → SUCCESS (first to write)
Node2: Try acquire lock → FAILED (key exists)
Node3: Try acquire lock → FAILED (key exists)

Step 3: Role Assignment

TEXT
Node1: Promote PostgreSQL to Primary
Node2: Configure as Replica, point to Node1
Node3: Configure as Replica, point to Node1

Step 4: Maintenance

TEXT
Every 10 seconds:
Node1 (Leader): 
  - Renew lock (TTL extension)
  - Update xlog_location
  - Send heartbeat

Node2/3 (Followers):
  - Monitor leader key
  - Check replication lag
  - Ready to take over

Best Replica Selection Criteria

When failover occurs, Patroni selects replica based on:

  1. Replication state:
    • streaming > in archive recovery
  2. Timeline: Higher timeline preferred
  3. XLog position:
    • Replica with LSN closest to primary
    • Least data loss
  4. No replication lag:
    • pg_stat_replication.replay_lag = 0
  5. Explicit candidate: Set in configuration

Priority tag:

TEXT
tags:
  nofailover: false
  noloadbalance: false
  clonefrom: false
  nosync: false

Example:

TEXT
Primary fails at LSN: 0/3000000

Replica1: LSN=0/3000000, lag=0s     ← BEST CHOICE
Replica2: LSN=0/2FFFFFF, lag=1s
Replica3: LSN=0/2FFFFFE, lag=2s

→ Patroni promotes Replica1

5. Failover Mechanism

Automatic Failover Process

Detailed timeline:

Automatic Failover Process

Detailed failover steps

Step 1: Detect failure

TEXT
# Patroni health check loop
while True:
    if not check_postgresql_health():
        log.error("PostgreSQL unhealthy")
        stop_renewing_leader_lock()
    
    if not check_dcs_connectivity():
        log.error("Lost connection to DCS")
        demote_if_leader()
    
    sleep(10)

Step 2: Leader lock expires

TEXT
# In etcd
$ etcdctl get /service/postgres/leader
# After TTL: Key not found

# Patroni logs on former leader
WARN: Could not renew leader lock
INFO: Demoting PostgreSQL to standby

Step 3: Replica promotion

TEXT
# Patroni on promoted replica
INFO: No leader found
INFO: Attempting to acquire leader lock
INFO: Lock acquired successfully
INFO: Promoting PostgreSQL instance
INFO: Updating configuration
INFO: Notifying other members

Step 4: Reconfiguration

TEXT
-- On promoted replica
SELECT pg_promote();

-- Changes primary_conninfo to null
-- Restarts as read-write

Step 5: Followers repoint

TEXT
# Other replicas
INFO: New leader detected: node2
INFO: Updating primary_conninfo
INFO: Restarting replication

Monitor Failover

Important metrics:

  • patroni_primary_timeline: Detect timeline changes
  • patroni_xlog_location: Track WAL position
  • patroni_replication_lag: Lag before failover
  • patroni_failover_count: Count number of failovers

6. Split-Brain Problem

What is Split-Brain?

Definition: Situation where ≥2 nodes think they are Primary, writing different data → Data divergence.

Causes

Network Partition

  1. Network Partition
  2. DCS partition: etcd cluster split
  3. Slow network: Heartbeat timeout but node still alive

Consequences of Split-Brain

Consequences of Split-Brain

Patroni's Split-Brain Prevention

Mechanism 1: DCS-based Lock (Primary)

TEXT
def maintain_leader_lock():
    while is_leader:
        # Must renew within TTL
        success = dcs.renew_lock(ttl=30)
        
        if not success:
            log.critical("Lost leader lock!")
            # Immediate demotion
            demote_to_standby()
            stop_accepting_writes()
            break
        
        sleep(10)

Mechanism 2: Leader Key Verification

TEXT
def before_handle_write():
    leader_key = dcs.get("/service/postgres/leader")
    
    if leader_key.owner != my_node_name:
        # I'm not the real leader!
        raise Exception("Not leader anymore")
        demote_immediately()

Mechanism 3: Timeline Divergence Detection

TEXT
-- PostgreSQL timeline
SELECT timeline_id FROM pg_control_checkpoint();

-- If timelines diverge:
-- Node1: timeline=5
-- Node2: timeline=6
-- → Data inconsistency detected
-- → Requires pg_rewind or rebuild

Quorum requirement

etcd with 3 nodes:

TEXT
Scenario 1: Network partition 1-2 split
  Partition A: Node1 (1 node)
    - Cannot get quorum (1 < 2)
    - Cannot write to etcd
    - Demotes to standby ✓
  
  Partition B: Node2, Node3 (2 nodes)
    - Has quorum (2 ≥ 2)
    - Can elect leader
    - Node2 becomes primary ✓
  
Result: Only 1 primary exists ✓

Scenario 2: Complete isolation

TEXT
Node1: Isolated, loses DCS
  - Tries to renew lock → FAIL
  - Demotes PostgreSQL immediately
  - Stops accepting connections
  
Node2/3: See Node1 gone
  - Elect new leader
  - Only 1 primary in cluster ✓

Watchdog Timer (Advanced Protection)

Hardware watchdog:

TEXT
# patroni.yml
watchdog:
  mode: required  # or automatic, off
  device: /dev/watchdog
  safety_margin: 5

Operation:

  1. Patroni kicks watchdog device every 10s
  2. If Patroni hangs or loses DCS → stops kicking
  3. After timeout → Watchdog reboots entire node
  4. Prevents "zombie primary" scenario

Best Practices to Avoid Split-Brain

  1. Deploy DCS separately: etcd cluster in different AZs
  2. Monitor DCS health: Alert when etcd is unhealthy
  3. Network redundancy: Multiple network paths between nodes
  4. Proper timeouts:
TEXT
patroni:
  ttl: 30              # Leader lock TTL
  loop_wait: 10        # Check interval
  retry_timeout: 10    # DCS operation timeout
  1. Enable watchdog: Hardware protection layer
  2. Monitoring:
TEXT
# Check for timeline divergence
patronictl list

# Expected: All nodes same timeline
+ Cluster: postgres (7001234567890123456) ----+----+-----------+
| Member | Host         | Role    | State   | TL | Lag in MB |
+--------+--------------+---------+---------+----+-----------+
| node1  | 10.0.1.1:5432| Leader  | running | 5  |           |
| node2  | 10.0.1.2:5432| Replica | running | 5  |         0 |
| node3  | 10.0.1.3:5432| Replica | running | 5  |         0 |
+--------+--------------+---------+---------+----+-----------+

Recovery from Split-Brain

If split-brain occurs:

Step 1: Identify

TEXT
# Check timeline
patronictl list
# node1: timeline=5
# node2: timeline=6  ← DIVERGED!

Step 2: Choose primary

  • Choose node with more important data
  • Or node with higher timeline

Step 3: Rebuild diverged replica

TEXT
# Option 1: pg_rewind (if safe)
patronictl reinit postgres node2

# Option 2: Full rebuild
patronictl remove postgres node2
# Then: reinitialize from scratch

Step 4: Verify

TEXT
patronictl list
# All nodes same timeline ✓

7. Summary

Key Takeaways

Patroni: HA template that automates PostgreSQL cluster management

DCS (etcd): Distributed coordination, store configuration and leader lock

Raft consensus: Ensures consistency and leader election in etcd

Leader election: Automatic, fast (~30-40s), based on TTL locks

Failover: Automatically promotes best replica when primary fails

Split-brain prevention: DCS quorum + TTL locks + watchdog

Combined Architecture

Combined Architecture

Review Questions

  1. How is Patroni different from pure Streaming Replication?
  2. Why do we need DCS? Can't we use the database to store state?
  3. What is the quorum in a 5-node cluster?
  4. Which replica does Patroni choose to promote during failover?
  5. How does split-brain occur and how does Patroni prevent it?
  6. What is the meaning of timeline in PostgreSQL?
  7. What does TTL 30 seconds mean? Why not set TTL = 5 seconds?

Preparation for next lesson

Lesson 4 will guide infrastructure preparation:

  • Setup 3 VMs/Servers
  • Network, firewall configuration
  • SSH keys, time sync
  • Required dependencies

You might also like

Browse all articles

Lesson 4: Infrastructure Preparation for PostgreSQL HA

Setting up the infrastructure for PostgreSQL High Availability with Patroni and etcd, including hardware requirements, network configuration, firewall, SSH keys, and time synchronization.

#Database#PostgreSQL#Infrastructure

Lesson 1: Overview of PostgreSQL High Availability

Understanding the fundamentals of PostgreSQL High Availability, including why HA is critical, different HA methods, and comparing Patroni vs Repmgr vs Pacemaker solutions.

#Database#PostgreSQL#High Availability
Series

PostgreSQL Security Hardening Guide

Essential security features and hardening measures for PostgreSQL HA cluster deployment with Patroni, etcd, and PgBouncer. Follow this guide for production security best practices.

#Database#PostgreSQL#Security
Series

PostgreSQL HA Cluster - Monitoring Stack

Full monitoring stack with Prometheus and Grafana for PostgreSQL HA cluster. Includes pre-configured dashboards, alerting rules, and exporter configurations.

#Database#PostgreSQL#Monitoring