CloudTada | Infrastructure & DevOps Insights

Recovering Failed Nodes

Learning Objectives

After this lesson, you will:

Rejoin old primary after failover
Use pg_rewind to sync data
Rebuild replica with pg_basebackup
Handle timeline divergence
Recover from split-brain scenarios
Automate recovery with Patroni

1. Node Recovery Overview

1.1. Recovery Scenarios

Khi nào cần recover node?

Scenario 1: Old primary sau failover

TEXT

Before:
  node1 (primary) → FAILS
  node2 (replica) → promoted to primary

After:
  node1: Needs rejoin as replica
  node2: Current primary

Scenario 2: Replica disconnected

TEXT

Before:
  node3 (replica) → Network partition / Crash

After:
  node3: Needs to catch up with primary

Scenario 3: Hardware replacement

TEXT

Before:
  node2: Disk failure

After:
  node2: New disk, needs full rebuild

Scenario 4: Timeline divergence

TEXT

Before:
  node1 accepted writes AFTER losing leader lock

After:
  node1: Diverged timeline, conflicts with cluster

1.2. Recovery Methods

Method	When to use	Time	Data loss
Auto-rejoin	Node was clean shutdown	~10s	None
pg_rewind	Timeline divergence	~1-5min	None
pg_basebackup	Major corruption / Full rebuild	~30min+	None
Manual recovery	Complex split-brain scenarios	Varies	Possible

2. Auto-Rejoin (Patroni Default)

2.1. How auto-rejoin works

When node comes back online:

TEXT

1. Patroni starts
2. Checks DCS for cluster state
3. Finds current leader (e.g., node2)
4. Compares local timeline with cluster timeline
5. If compatible → auto-rejoin as replica
6. If diverged → need pg_rewind or reinit

2.2. Example: Clean rejoin

Setup:

TEXT

# Current cluster state
patronictl list postgres

# + Cluster: postgres ----+----+-----------+
# | Member | Host        | Role    | State   | TL | Lag in MB |
# +--------+-------------+---------+---------+----+-----------+
# | node1  | 10.0.1.11   | Leader  | running |  2 |           |
# | node2  | 10.0.1.12   | Replica | running |  2 |         0 |
# | node3  | 10.0.1.13   | Replica | running |  2 |         0 |
# +--------+-------------+---------+---------+----+-----------+

Simulate node3 failure:

TEXT

# On node3: Stop Patroni cleanly
sudo systemctl stop patroni

# Cluster now:
# | node1  | 10.0.1.11   | Leader  | running |  2 |           |
# | node2  | 10.0.1.12   | Replica | running |  2 |         0 |
# | node3  | 10.0.1.13   | -       | stopped |  - |           | ← Down

Recovery:

TEXT

# On node3: Start Patroni
sudo systemctl start patroni

# Watch logs
sudo journalctl -u patroni -f

Log output:

TEXT

2024-11-25 10:00:00 INFO: Starting Patroni...
2024-11-25 10:00:01 INFO: Connected to DCS (etcd)
2024-11-25 10:00:02 INFO: Cluster timeline: 2, local timeline: 2 ✅
2024-11-25 10:00:03 INFO: Current leader: node1
2024-11-25 10:00:04 INFO: Rejoining as replica
2024-11-25 10:00:05 INFO: Starting PostgreSQL in recovery mode
2024-11-25 10:00:08 INFO: Replication started, streaming from node1
2024-11-25 10:00:10 INFO: Successfully rejoined cluster ✅

Verify:

TEXT

patronictl list postgres

# + Cluster: postgres ----+----+-----------+
# | Member | Host        | Role    | State   | TL | Lag in MB |
# +--------+-------------+---------+---------+----+-----------+
# | node1  | 10.0.1.11   | Leader  | running |  2 |           |
# | node2  | 10.0.1.12   | Replica | running |  2 |         0 |
# | node3  | 10.0.1.13   | Replica | running |  2 |         0 | ← Rejoined!
# +--------+-------------+---------+---------+----+-----------+

Time: ~10 seconds ✅

2.3. Configuration for auto-rejoin

TEXT

# In patroni.yml
postgresql:
  use_pg_rewind: true  # Enable automatic pg_rewind if needed
  remove_data_directory_on_rewind_failure: false  # Safety
  remove_data_directory_on_diverged_timelines: false  # Safety

# Patroni will attempt:
# 1. Auto-rejoin (if timelines match)
# 2. pg_rewind (if timeline diverged but recoverable)
# 3. Full reinit (if pg_rewind fails and auto-reinit enabled)

3. Using pg_rewind

3.1. What is pg_rewind?

pg_rewind = Tool to resync a PostgreSQL instance that diverged from the current timeline.

When needed:

TEXT

Scenario: Old primary received writes AFTER failover

Timeline:
  T+0: node1 (primary), node2 (replica)
  T+1: Network partition
  T+2: node2 promoted (timeline: 1 → 2)
  T+3: node1 still thinks it's primary, accepts writes (timeline: 1)
  T+4: Network restored
  T+5: Conflict! node1 timeline=1, cluster timeline=2

Solution: pg_rewind node1 to match node2's timeline

How it works:

TEXT

1. Find common ancestor (last shared WAL position)
2. Replay WAL from new primary
3. Overwrite conflicting blocks
4. Node rejoins as replica on new timeline

3.2. Prerequisites for pg_rewind

Requirements:

TEXT

# In patroni.yml → postgresql.parameters
wal_log_hints: 'on'  # Required! (or use full_page_writes)

# Or use data checksums (set during initdb):
# initdb --data-checksums

# Also ensure:
max_wal_senders: 10  # For replication
wal_level: replica   # For replication

Why wal_log_hints?

TEXT

Without wal_log_hints:
  pg_rewind cannot determine which blocks changed
  → Cannot resync
  → Must use full rebuild (pg_basebackup)

With wal_log_hints:
  PostgreSQL tracks all block changes
  → pg_rewind can identify divergence
  → Fast resync ✅

Trade-off: ~1-2% write performance overhead

3.3. Manual pg_rewind

Scenario: node1 (old primary) needs resync after failover.

Step 1: Stop PostgreSQL on node1

TEXT

# On node1
sudo systemctl stop patroni
sudo systemctl stop postgresql

Step 2: Run pg_rewind

TEXT

# On node1: Rewind to match node2 (current primary)
sudo -u postgres pg_rewind \
  --target-pgdata=/var/lib/postgresql/18/data \
  --source-server="host=10.0.1.12 port=5432 user=replicator dbname=postgres" \
  --progress \
  --debug

# Output:
# connected to server
# servers diverged at WAL location 0/3000000 on timeline 1
# rewinding from last common checkpoint at 0/2000000 on timeline 1
# reading source file list
# reading target file list
# reading WAL in target
# need to copy 124 MB (total source directory size is 2048 MB)
# creating backup label and updating control file
# syncing target data directory
# Done!

Step 3: Create standby.signal

TEXT

# On node1: Mark as standby
sudo -u postgres touch /var/lib/postgresql/18/data/standby.signal

Step 4: Update primary_conninfo

TEXT

# On node1: Point to new primary (node2)
sudo -u postgres tee /var/lib/postgresql/18/data/postgresql.auto.conf <<EOF
primary_conninfo = 'host=10.0.1.12 port=5432 user=replicator password=replica_password'
EOF

Step 5: Start PostgreSQL

TEXT

# On node1
sudo systemctl start patroni

# Patroni will start PostgreSQL in recovery mode

Step 6: Verify

TEXT

patronictl list postgres

# node1 should now be a Replica following node2 ✅

Time: ~1-5 minutes (depends on divergence size)

3.4. Automatic pg_rewind (Patroni)

Enable in patroni.yml:

TEXT

# Patroni will automatically run pg_rewind if needed
postgresql:
  use_pg_rewind: true
  
  parameters:
    wal_log_hints: 'on'  # Required!

Behavior:

TEXT

When node rejoins after failover:
  1. Patroni detects timeline divergence
  2. Automatically runs pg_rewind
  3. Restarts PostgreSQL as replica
  4. Node rejoins cluster

No manual intervention needed! ✅

Example log:

TEXT

2024-11-25 10:05:00 INFO: Local timeline 1, cluster timeline 2
2024-11-25 10:05:01 WARNING: Timeline divergence detected
2024-11-25 10:05:02 INFO: use_pg_rewind enabled, attempting rewind...
2024-11-25 10:05:03 INFO: Running pg_rewind...
2024-11-25 10:05:45 INFO: pg_rewind completed successfully
2024-11-25 10:05:46 INFO: Starting PostgreSQL as replica
2024-11-25 10:05:50 INFO: Rejoined cluster ✅

4. Full Rebuild with pg_basebackup

4.1. When to use pg_basebackup

Use cases:

pg_rewind failed - Data too diverged
Corruption detected - Data integrity issues
Major version upgrade - Different PostgreSQL versions
New node - Adding fresh replica to cluster
Disk replaced - Empty data directory
Paranoid safety - Want guaranteed clean state

Trade-off: Slower (~30min-2hrs for large DB) but guaranteed clean.

4.2. Manual pg_basebackup

Step 1: Stop and clean node

TEXT

# On node to rebuild (e.g., node3)
sudo systemctl stop patroni
sudo systemctl stop postgresql

# Remove old data directory
sudo rm -rf /var/lib/postgresql/18/data/*

Step 2: Take base backup from primary

TEXT

# On node3: Backup from current primary (node2)
sudo -u postgres pg_basebackup \
  -h 10.0.1.12 \
  -p 5432 \
  -U replicator \
  -D /var/lib/postgresql/18/data \
  -Fp \
  -Xs \
  -P \
  -R

# Flags:
# -h: Host (primary)
# -U: Replication user
# -D: Target data directory
# -Fp: Plain format (not tar)
# -Xs: Stream WAL during backup
# -P: Show progress
# -R: Create standby.signal and replication config

Output:

TEXT

Password: [enter replicator password]
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/4000000 on timeline 2
pg_basebackup: starting background WAL receiver
24567/24567 kB (100%), 1/1 tablespace
pg_basebackup: write-ahead log end point: 0/4000168
pg_basebackup: syncing data to disk ...
pg_basebackup: base backup completed

Step 3: Verify configuration

TEXT

# On node3: Check standby.signal created
ls /var/lib/postgresql/18/data/standby.signal

# Check primary_conninfo
cat /var/lib/postgresql/18/data/postgresql.auto.conf | grep primary_conninfo

Step 4: Start node

TEXT

# On node3
sudo systemctl start patroni

# Node will rejoin as replica

Step 5: Verify

TEXT

patronictl list postgres

# node3 should be streaming from primary ✅

Time: ~30min-2hrs (depends on database size)

4.3. Patroni automatic reinit

Enable auto-reinit:

TEXT

# In patroni.yml
postgresql:
  use_pg_rewind: true
  
  # If pg_rewind fails, auto-reinit
  remove_data_directory_on_rewind_failure: true
  remove_data_directory_on_diverged_timelines: true

# WARNING: Data directory will be DELETED and recreated
# Only enable if you trust automation!

Behavior:

TEXT

When node rejoins:
  1. Try auto-rejoin → FAILED (diverged)
  2. Try pg_rewind → FAILED (corruption)
  3. Automatically remove data directory
  4. Run pg_basebackup from current primary
  5. Rejoin as replica

Fully automated! But destructive! ⚠️

4.4. Patroni reinit command

Manual trigger:

TEXT

# Force reinit on node3
patronictl reinit postgres node3

# Patroni will:
# 1. Stop PostgreSQL on node3
# 2. Remove data directory
# 3. Run pg_basebackup from leader
# 4. Start as replica

# Prompt:
# Are you sure you want to reinitialize members node3? [y/N]: y

Monitor progress:

TEXT

# On node3: Watch logs
sudo journalctl -u patroni -f

# Expected:
# INFO: Removing data directory...
# INFO: Running pg_basebackup...
# INFO: Backup completed (24 GB in 15 minutes)
# INFO: Starting PostgreSQL...
# INFO: Rejoined cluster ✅

5. Timeline Divergence Resolution

5.1. Understanding timelines

Timeline = History branch counter

TEXT

Initial:
  Timeline 1 (all nodes)

After first failover:
  Old primary: Timeline 1
  New primary: Timeline 2 ← Incremented

After second failover:
  Timeline 3 ← Incremented again

Why timelines exist:

TEXT

Prevent data conflict:
  If two nodes both think they're primary,
  they write on different timelines.
  → Conflict detected
  → Manual intervention required

5.2. Detecting timeline divergence

Check local timeline:

TEXT

# On any node
sudo -u postgres psql -c "
  SELECT timeline_id 
  FROM pg_control_checkpoint();
"

# Example:
# timeline_id
# ------------
#           2

Check cluster timeline:

TEXT

# Via Patroni
patronictl list postgres | head -2

# + Cluster: postgres (7001234567890123456) ----+----+-----------+
#                                               ↑ Timeline in cluster ID

# Or via REST API
curl -s http://10.0.1.12:8008/patroni | jq '.timeline'
# Output: 2

Compare:

TEXT

# If node timeline ≠ cluster timeline
# → Node needs pg_rewind or reinit

5.3. Scenario: Timeline divergence after split-brain

Setup:

TEXT

T+0: 3-node cluster, node1 = primary (timeline 2)
T+1: Network partition splits node1 from node2/node3
T+2: node1 thinks it's still primary (timeline 2)
T+3: node2/node3 elect node2 as primary (timeline 3)
T+4: Both node1 and node2 accept writes!
  - node1: timeline 2, accepting writes ❌
  - node2: timeline 3, accepting writes ✅
  - Split-brain! ⚠️
T+5: Network restored
T+6: Conflict detected

Resolution:

TEXT

# Step 1: Verify which timeline is "correct"
patronictl list postgres

# + Cluster: postgres ----+----+-----------+
# | Member | Host        | Role    | State   | TL | Lag in MB |
# +--------+-------------+---------+---------+----+-----------+
# | node1  | 10.0.1.11   | -       | stopped |  2 |           | ← WRONG timeline
# | node2  | 10.0.1.12   | Leader  | running |  3 |           | ← CORRECT
# | node3  | 10.0.1.13   | Replica | running |  3 |         0 |
# +--------+-------------+---------+---------+----+-----------+

# Step 2: Save diverged data from node1 (if needed)
sudo -u postgres pg_dumpall -h 10.0.1.11 > /backup/node1-diverged-data.sql

# Step 3: Rewind node1 to match timeline 3
# If pg_rewind works:
patronictl reinit postgres node1

# If pg_rewind fails (likely due to significant divergence):
# Manual pg_basebackup required
sudo systemctl stop patroni  # On node1
sudo rm -rf /var/lib/postgresql/18/data/*
sudo -u postgres pg_basebackup -h 10.0.1.12 -D /var/lib/postgresql/18/data -U replicator -R -P
sudo systemctl start patroni

# Step 4: Manually reconcile diverged data (if important)
# Review /backup/node1-diverged-data.sql
# Manually merge important transactions into node2

Prevention:

TEXT

# Configure Patroni to prevent split-brain
bootstrap:
  dcs:
    # Primary loses leader lock → immediately demote
    ttl: 30
    retry_timeout: 10
    
  postgresql:
    parameters:
      # Prevent writes if not sure about leadership
      synchronous_commit: 'remote_apply'  # Requires sync replica

6. Split-Brain Prevention and Recovery

6.1. How Patroni prevents split-brain

Mechanism: DCS Leader Lock

TEXT

Primary MUST hold leader lock in DCS:

If primary loses DCS connection:
  1. Cannot renew leader lock
  2. TTL expires (e.g., 30 seconds)
  3. Primary DEMOTES itself (becomes read-only)
  4. Replicas detect no leader
  5. Election begins

Key: Primary NEVER operates without DCS lock ✅

Code flow (pseudo):

TEXT

while True:
    if is_leader:
        if can_renew_leader_lock():
            # Still leader, continue
            accept_writes()
        else:
            # Lost DCS connection!
            log.error("Lost leader lock, DEMOTING!")
            demote_to_replica()
            reject_writes()
    
    sleep(loop_wait)

6.2. Fencing mechanisms

PostgreSQL-level fencing:

TEXT

-- When demoted, set read-only
ALTER SYSTEM SET default_transaction_read_only = 'on';
SELECT pg_reload_conf();

-- All new transactions will fail:
-- ERROR: cannot execute INSERT in a read-only transaction

OS-level fencing (advanced):

TEXT

# STONITH (Shoot The Other Node In The Head)
# Via callbacks in patroni.yml

callbacks:
  on_start: /var/lib/postgresql/callbacks/on_start.sh
  on_stop: /var/lib/postgresql/callbacks/on_stop.sh
  on_role_change: /var/lib/postgresql/callbacks/on_role_change.sh

# on_role_change.sh example:
#!/bin/bash
ROLE=$1  # "master" or "replica"

if [ "$ROLE" == "replica" ]; then
  # Lost leadership, ensure NO writes possible
  sudo iptables -A INPUT -p tcp --dport 5432 -j REJECT
  # Block incoming connections to PostgreSQL
fi

if [ "$ROLE" == "master" ]; then
  # Gained leadership, allow writes
  sudo iptables -D INPUT -p tcp --dport 5432 -j REJECT
fi

6.3. Scenario: Recover from split-brain

Detection:

TEXT

# Symptoms:
# - Multiple nodes claim to be primary
# - Patroni shows errors
# - Applications seeing inconsistent data

# Check cluster state
patronictl list postgres

# If you see multiple "Leader" or conflicts:
# SPLIT-BRAIN DETECTED! ⚠️

Recovery steps:

TEXT

# Step 1: STOP ALL NODES immediately
for node in node1 node2 node3; do
  ssh $node "sudo systemctl stop patroni"
done

# Step 2: Determine "source of truth"
# Usually: Node with most recent data / highest timeline
for node in node1 node2 node3; do
  echo "=== $node ==="
  ssh $node "sudo -u postgres psql -c \"
    SELECT timeline_id, pg_last_wal_receive_lsn()
    FROM pg_control_checkpoint();
  \""
done

# Step 3: Choose winner (e.g., node2 has highest timeline)
WINNER="node2"

# Step 4: Backup diverged data from losers
ssh node1 "sudo -u postgres pg_dumpall > /backup/node1-diverged.sql"
ssh node3 "sudo -u postgres pg_dumpall > /backup/node3-diverged.sql"

# Step 5: Wipe losers and rebuild from winner
for node in node1 node3; do
  ssh $node "sudo rm -rf /var/lib/postgresql/18/data/*"
  ssh $node "sudo -u postgres pg_basebackup \
    -h $WINNER \
    -D /var/lib/postgresql/18/data \
    -U replicator -R -P"
done

# Step 6: Clear DCS state (fresh start)
etcdctl del --prefix /service/postgres/

# Step 7: Start winner first
ssh $WINNER "sudo systemctl start patroni"

# Wait for winner to become leader
sleep 10

# Step 8: Start other nodes
ssh node1 "sudo systemctl start patroni"
ssh node3 "sudo systemctl start patroni"

# Step 9: Verify cluster
patronictl list postgres

# Should show:
# node2: Leader
# node1: Replica (following node2)
# node3: Replica (following node2)
# All same timeline ✅

# Step 10: Reconcile diverged data manually
# Review /backup/*-diverged.sql files
# Merge critical transactions if needed

7. Monitoring Node Recovery

7.1. Key metrics

TEXT

-- Replication status
SELECT application_name, 
       state,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
       replay_lag,
       sync_state
FROM pg_stat_replication;

-- Timeline check
SELECT timeline_id FROM pg_control_checkpoint();

-- Recovery status (on replica)
SELECT pg_is_in_recovery(),
       pg_last_wal_receive_lsn(),
       pg_last_wal_replay_lsn(),
       pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) AS replay_lag_bytes;

7.2. Patroni REST API monitoring

TEXT

# Check node status
curl -s http://10.0.1.11:8008/patroni | jq

# Key fields:
# {
#   "state": "running",
#   "role": "replica",
#   "timeline": 3,
#   "replication": [
#     {
#       "usename": "replicator",
#       "application_name": "node1",
#       "state": "streaming",
#       "sync_state": "async",
#       "replay_lsn": "0/5000000"
#     }
#   ]
# }

7.3. Alerting on recovery issues

TEXT

# Prometheus alert
groups:
  - name: node_recovery
    rules:
      - alert: PatroniNodeDown
        expr: up{job="patroni"} == 0
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Patroni node {{ $labels.instance }} is down"
      
      - alert: PatroniTimelineMismatch
        expr: |
          count by (cluster) (patroni_timeline) 
          != 
          count by (cluster, timeline) (patroni_timeline)
        labels:
          severity: critical
        annotations:
          summary: "Timeline mismatch detected - possible split-brain"
      
      - alert: PatroniReplicationLagHigh
        expr: patroni_replication_lag_bytes > 104857600  # 100MB
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Replication lag > 100MB on {{ $labels.instance }}"

8. Best Practices

✅ DO

Enable wal_log_hints - Required for pg_rewind
Test recovery regularly - Monthly drills
Monitor timelines - Alert on divergence
Have backups - Before risky operations
Document procedures - Recovery runbooks
Use Patroni auto-recovery - Less manual intervention
Verify after recovery - Test replication, queries
Keep DCS healthy - etcd cluster critical
Log everything - Audit trail for incidents
Practice split-brain recovery - Hope never needed, but be ready

❌ DON'T

Don't skip wal_log_hints - pg_rewind will fail
Don't assume auto-recovery works - Test it!
Don't ignore timeline mismatches - Critical issue
Don't manually promote during recovery - Let Patroni handle
Don't delete data without backup - Diverged data may be important
Don't run split-brain clusters - Fix immediately
Don't forget callbacks - Fencing prevents split-brain
Don't over-automate reinit - Risk data loss

9. Lab Exercises

Lab 1: Auto-rejoin after clean shutdown

Tasks:

Stop one replica: sudo systemctl stop patroni
Make changes on primary
Start replica: sudo systemctl start patroni
Verify auto-rejoin and lag catch-up
Time the recovery

Lab 2: pg_rewind after simulated failover

Tasks:

Record current primary
Manually stop primary: sudo systemctl stop patroni
Wait for failover to complete
Start old primary (should auto-rewind)
Verify old primary rejoined as replica
Check timeline increment

Lab 3: Full rebuild with pg_basebackup

Tasks:

Stop a replica
Delete data directory: sudo rm -rf /var/lib/postgresql/18/data/*
Manually run pg_basebackup from primary
Start replica
Verify replication restored
Measure rebuild time

Lab 4: Patroni reinit command

Tasks:

Use patronictl reinit postgres node3
Monitor logs during process
Verify automated rebuild
Compare time vs manual pg_basebackup

Lab 5: Timeline divergence simulation

Tasks:

Create network partition (iptables)
Wait for failover
Manually promote old primary (force split-brain)
Write different data to both "primaries"
Restore network
Observe conflict detection
Practice recovery procedure

10. Troubleshooting

Issue: pg_rewind fails

Error: pg_rewind: fatal: could not find common ancestor

Cause: wal_log_hints not enabled or data too diverged.

Solution:

TEXT

# Check wal_log_hints
sudo -u postgres psql -c "SHOW wal_log_hints;"

# If off, enable:
sudo -u postgres psql -c "ALTER SYSTEM SET wal_log_hints = on;"
sudo systemctl restart postgresql

# If still fails, use pg_basebackup instead
patronictl reinit postgres node1

Issue: Replica stuck in recovery

Symptoms: Replica shows "running" but high lag.

Diagnosis:

TEXT

# Check replication status
sudo -u postgres psql -h 10.0.1.11 -c "
  SELECT * FROM pg_stat_replication;
"

# Check replica logs
sudo journalctl -u postgresql -n 100

Common causes:

WAL receiver crashed
Network issues
Disk full on replica
Archive restore errors

Solution:

TEXT

# Restart replication
sudo systemctl restart patroni

# If persists, reinit
patronictl reinit postgres node3

Issue: Cannot connect after recovery

Error: FATAL: the database system is starting up

Cause: PostgreSQL still replaying WAL.

Solution: Wait for recovery to complete, or check logs for errors.

TEXT

# Check recovery progress
sudo -u postgres psql -h 10.0.1.13 -c "
  SELECT pg_is_in_recovery(),
         pg_last_wal_receive_lsn(),
         pg_last_wal_replay_lsn();
"

11. Tổng kết

Recovery Methods Summary

Method	Speed	Data Loss	Use Case
Auto-rejoin	Fastest	None	Clean shutdown/restart
pg_rewind	Fast	None	Timeline divergence
pg_basebackup	Slow	None	Corruption, major divergence
Manual recovery	Varies	Possible	Split-brain, complex issues

Key Concepts

✅ Auto-rejoin - Patroni handles clean recovery automatically

✅ pg_rewind - Resync after timeline divergence (requires wal_log_hints)

✅ pg_basebackup - Full rebuild from primary (slow but safe)

✅ Timeline - History branch, increments on failover

✅ Split-brain - Multiple primaries (prevented by DCS leader lock)

Recovery Checklist

Node failure detected
Determine recovery method needed
Backup diverged data (if any)
Execute recovery (auto or manual)
Verify timeline matches cluster
Verify replication streaming
Test read/write operations
Check replication lag
Update monitoring/documentation

Next Steps

Bài 16 sẽ cover Backup và Point-in-Time Recovery:

pg_basebackup strategies
WAL archiving configuration
Point-in-Time Recovery (PITR) procedures
Backup automation and scheduling
Disaster recovery planning

Course

PostgreSQL High Availability A-Z

Share this article

You might also like