Introduction to Patroni and etcd
Objectives
After this lesson, you will understand:
- What Patroni is and how it works
- DCS (Distributed Configuration Store) - etcd/Consul/ZooKeeper
- Consensus algorithm (Raft)
- Leader election & Failover mechanism
- Split-brain problem and how to solve it
1. What is Patroni?
Introduction
Patroni is an open-source HA (High Availability) template for PostgreSQL, developed by Zalando. It automates PostgreSQL cluster management, including:
- Leader election: Automatically select primary node
- Automatic failover: Automatic failover when primary fails
- Configuration management: Centralized configuration management
- Health checking: Continuous monitoring of node health
Patroni Architecture
The Patroni architecture is a popular choice for managing PostgreSQL clusters.
How Patroni works
- Startup: Each Patroni instance connects to DCS (etcd)
- Leader election: Nodes compete to become leader in DCS
- Role assignment: Node that wins leader lock promotes PostgreSQL to primary
- Health monitoring: Patroni continuously checks:
- PostgreSQL process health
- Replication status
- DCS connectivity
- Auto failover: If leader fails, Patroni automatically:
- Detects the issue
- Selects the most suitable replica
- Promotes new replica to primary
- Updates remaining replicas
Main components
Patroni daemon
- Runs on each PostgreSQL node
- Manages PostgreSQL lifecycle
- Performs health checks
- Interacts with DCS
REST API
- Health check endpoint:
http://node:8008/health - Read-only endpoint:
http://node:8008/read-only - Primary endpoint:
http://node:8008/master(deprecated) or/primary
patronictl
- CLI tool to manage cluster
- Commands: list, switchover, failover, reinit, restart, reload
2. DCS - Distributed Configuration Store
Role of DCS
DCS is the coordination center for Patroni cluster, storing:
- Leader key: Information about which node is leader (TTL-based)
- Configuration: PostgreSQL and Patroni configuration
- Member information: List of nodes in cluster
- Failover/Switchover state: Switch state
Comparison of popular DCS
| Feature | etcd | Consul | ZooKeeper |
|---|---|---|---|
| Language | Go | Go | Java |
| Consensus | Raft | Raft | ZAB (Paxos-like) |
| API | gRPC, HTTP | HTTP, DNS | Custom protocol |
| Setup | Simple | Medium | Complex |
| Performance | High | High | Medium |
| Documentation | Good | Very Good | Medium |
| Usage | Kubernetes, Patroni | Service mesh, HA | Hadoop, Kafka |
Recommendation: etcd for most cases because of simplicity and high performance.
etcd - Distributed Key-Value Store
Main characteristics:
- Strongly consistent (CAP theorem: CP)
- Distributed and highly available
- Fast (sub-millisecond latency)
- Simple API
- Watch mechanism for real-time updates
Data structure in etcd for Patroni:
3. Consensus Algorithm - Raft
What is Raft?
Raft is a consensus algorithm designed to be easier to understand than Paxos, ensuring:
- Safety: Never returns incorrect results
- Liveness: Always makes progress (when majority nodes are operational)
- Consistency: All nodes see the same state
Roles in Raft
- Leader:
- Handles all client requests
- Replicates log entries to followers
- Unique in a term
- Follower:
- Passive, only receives requests from leader
- If no heartbeat received, becomes candidate
- Candidate:
- Follower timeout becomes candidate
- Requests votes from other nodes
- If wins election → Leader
Leader Election Process
The Leader Election Process is crucial for ensuring the consistency and availability of the distributed system.
Detailed election:
- Follower doesn't receive heartbeat within election timeout (150-300ms random)
- Becomes Candidate, increases term number
- Votes for itself
- Sends RequestVote RPC to all nodes
- If receives majority votes (n/2 + 1):
- Becomes Leader
- Sends heartbeat immediately
- If timeout or loses election:
- Returns to Follower or starts new election
Quorum and Majority
Quorum: Minimum number of nodes needed for system to operate
Formula: Quorum = floor(n/2) + 1
Example with 3 nodes:
- ✅ 3 nodes active: Cluster healthy
- ✅ 2 nodes active: Cluster works (quorum met)
- ❌ 1 node active: Cluster stops (no quorum)
Recommendation: Always use odd number of nodes (3, 5, 7) to optimize fault tolerance.
4. Leader Election in Patroni
Leader Lock Mechanism
Patroni uses DCS to implement distributed lock:
Leader Lock Properties:
Leader Election Process
Step 1: Race Condition
Step 2: Acquire Lock Attempt
Step 3: Role Assignment
Step 4: Maintenance
Best Replica Selection Criteria
When failover occurs, Patroni selects replica based on:
- Replication state:
streaming>in archive recovery
- Timeline: Higher timeline preferred
- XLog position:
- Replica with LSN closest to primary
- Least data loss
- No replication lag:
pg_stat_replication.replay_lag = 0
- Explicit candidate: Set in configuration
Priority tag:
Example:
5. Failover Mechanism
Automatic Failover Process
Detailed timeline:
Automatic Failover Process
Detailed failover steps
Step 1: Detect failure
Step 2: Leader lock expires
Step 3: Replica promotion
Step 4: Reconfiguration
Step 5: Followers repoint
Monitor Failover
Important metrics:
patroni_primary_timeline: Detect timeline changespatroni_xlog_location: Track WAL positionpatroni_replication_lag: Lag before failoverpatroni_failover_count: Count number of failovers
6. Split-Brain Problem
What is Split-Brain?
Definition: Situation where ≥2 nodes think they are Primary, writing different data → Data divergence.
Causes
Network Partition
- Network Partition
- DCS partition: etcd cluster split
- Slow network: Heartbeat timeout but node still alive
Consequences of Split-Brain
Consequences of Split-Brain
Patroni's Split-Brain Prevention
Mechanism 1: DCS-based Lock (Primary)
Mechanism 2: Leader Key Verification
Mechanism 3: Timeline Divergence Detection
Quorum requirement
etcd with 3 nodes:
Scenario 2: Complete isolation
Watchdog Timer (Advanced Protection)
Hardware watchdog:
Operation:
- Patroni kicks watchdog device every 10s
- If Patroni hangs or loses DCS → stops kicking
- After timeout → Watchdog reboots entire node
- Prevents "zombie primary" scenario
Best Practices to Avoid Split-Brain
- Deploy DCS separately: etcd cluster in different AZs
- Monitor DCS health: Alert when etcd is unhealthy
- Network redundancy: Multiple network paths between nodes
- Proper timeouts:
- Enable watchdog: Hardware protection layer
- Monitoring:
Recovery from Split-Brain
If split-brain occurs:
Step 1: Identify
Step 2: Choose primary
- Choose node with more important data
- Or node with higher timeline
Step 3: Rebuild diverged replica
Step 4: Verify
7. Summary
Key Takeaways
✅ Patroni: HA template that automates PostgreSQL cluster management
✅ DCS (etcd): Distributed coordination, store configuration and leader lock
✅ Raft consensus: Ensures consistency and leader election in etcd
✅ Leader election: Automatic, fast (~30-40s), based on TTL locks
✅ Failover: Automatically promotes best replica when primary fails
✅ Split-brain prevention: DCS quorum + TTL locks + watchdog
Combined Architecture
Combined Architecture
Review Questions
- How is Patroni different from pure Streaming Replication?
- Why do we need DCS? Can't we use the database to store state?
- What is the quorum in a 5-node cluster?
- Which replica does Patroni choose to promote during failover?
- How does split-brain occur and how does Patroni prevent it?
- What is the meaning of timeline in PostgreSQL?
- What does TTL 30 seconds mean? Why not set TTL = 5 seconds?
Preparation for next lesson
Lesson 4 will guide infrastructure preparation:
- Setup 3 VMs/Servers
- Network, firewall configuration
- SSH keys, time sync
- Required dependencies