Recovering Failed Nodes
Learning Objectives
After this lesson, you will:
- Rejoin old primary after failover
- Use pg_rewind to sync data
- Rebuild replica with pg_basebackup
- Handle timeline divergence
- Recover from split-brain scenarios
- Automate recovery with Patroni
1. Node Recovery Overview
1.1. Recovery Scenarios
Khi nào cần recover node?
Scenario 1: Old primary sau failover
Scenario 2: Replica disconnected
Scenario 3: Hardware replacement
Scenario 4: Timeline divergence
1.2. Recovery Methods
| Method | When to use | Time | Data loss |
|---|---|---|---|
| Auto-rejoin | Node was clean shutdown | ~10s | None |
| pg_rewind | Timeline divergence | ~1-5min | None |
| pg_basebackup | Major corruption / Full rebuild | ~30min+ | None |
| Manual recovery | Complex split-brain scenarios | Varies | Possible |
2. Auto-Rejoin (Patroni Default)
2.1. How auto-rejoin works
When node comes back online:
2.2. Example: Clean rejoin
Setup:
Simulate node3 failure:
Recovery:
Log output:
Verify:
Time: ~10 seconds ✅
2.3. Configuration for auto-rejoin
3. Using pg_rewind
3.1. What is pg_rewind?
pg_rewind = Tool to resync a PostgreSQL instance that diverged from the current timeline.
When needed:
How it works:
3.2. Prerequisites for pg_rewind
Requirements:
Why wal_log_hints?
3.3. Manual pg_rewind
Scenario: node1 (old primary) needs resync after failover.
Step 1: Stop PostgreSQL on node1
Step 2: Run pg_rewind
Step 3: Create standby.signal
Step 4: Update primary_conninfo
Step 5: Start PostgreSQL
Step 6: Verify
Time: ~1-5 minutes (depends on divergence size)
3.4. Automatic pg_rewind (Patroni)
Enable in patroni.yml:
Behavior:
Example log:
4. Full Rebuild with pg_basebackup
4.1. When to use pg_basebackup
Use cases:
- pg_rewind failed - Data too diverged
- Corruption detected - Data integrity issues
- Major version upgrade - Different PostgreSQL versions
- New node - Adding fresh replica to cluster
- Disk replaced - Empty data directory
- Paranoid safety - Want guaranteed clean state
Trade-off: Slower (~30min-2hrs for large DB) but guaranteed clean.
4.2. Manual pg_basebackup
Step 1: Stop and clean node
Step 2: Take base backup from primary
Output:
Step 3: Verify configuration
Step 4: Start node
Step 5: Verify
Time: ~30min-2hrs (depends on database size)
4.3. Patroni automatic reinit
Enable auto-reinit:
Behavior:
4.4. Patroni reinit command
Manual trigger:
Monitor progress:
5. Timeline Divergence Resolution
5.1. Understanding timelines
Timeline = History branch counter
Why timelines exist:
5.2. Detecting timeline divergence
Check local timeline:
Check cluster timeline:
Compare:
5.3. Scenario: Timeline divergence after split-brain
Setup:
Resolution:
Prevention:
6. Split-Brain Prevention and Recovery
6.1. How Patroni prevents split-brain
Mechanism: DCS Leader Lock
Code flow (pseudo):
6.2. Fencing mechanisms
PostgreSQL-level fencing:
OS-level fencing (advanced):
6.3. Scenario: Recover from split-brain
Detection:
Recovery steps:
7. Monitoring Node Recovery
7.1. Key metrics
7.2. Patroni REST API monitoring
7.3. Alerting on recovery issues
8. Best Practices
✅ DO
- Enable wal_log_hints - Required for pg_rewind
- Test recovery regularly - Monthly drills
- Monitor timelines - Alert on divergence
- Have backups - Before risky operations
- Document procedures - Recovery runbooks
- Use Patroni auto-recovery - Less manual intervention
- Verify after recovery - Test replication, queries
- Keep DCS healthy - etcd cluster critical
- Log everything - Audit trail for incidents
- Practice split-brain recovery - Hope never needed, but be ready
❌ DON'T
- Don't skip wal_log_hints - pg_rewind will fail
- Don't assume auto-recovery works - Test it!
- Don't ignore timeline mismatches - Critical issue
- Don't manually promote during recovery - Let Patroni handle
- Don't delete data without backup - Diverged data may be important
- Don't run split-brain clusters - Fix immediately
- Don't forget callbacks - Fencing prevents split-brain
- Don't over-automate reinit - Risk data loss
9. Lab Exercises
Lab 1: Auto-rejoin after clean shutdown
Tasks:
- Stop one replica:
sudo systemctl stop patroni - Make changes on primary
- Start replica:
sudo systemctl start patroni - Verify auto-rejoin and lag catch-up
- Time the recovery
Lab 2: pg_rewind after simulated failover
Tasks:
- Record current primary
- Manually stop primary:
sudo systemctl stop patroni - Wait for failover to complete
- Start old primary (should auto-rewind)
- Verify old primary rejoined as replica
- Check timeline increment
Lab 3: Full rebuild with pg_basebackup
Tasks:
- Stop a replica
- Delete data directory:
sudo rm -rf /var/lib/postgresql/18/data/* - Manually run pg_basebackup from primary
- Start replica
- Verify replication restored
- Measure rebuild time
Lab 4: Patroni reinit command
Tasks:
- Use
patronictl reinit postgres node3 - Monitor logs during process
- Verify automated rebuild
- Compare time vs manual pg_basebackup
Lab 5: Timeline divergence simulation
Tasks:
- Create network partition (iptables)
- Wait for failover
- Manually promote old primary (force split-brain)
- Write different data to both "primaries"
- Restore network
- Observe conflict detection
- Practice recovery procedure
10. Troubleshooting
Issue: pg_rewind fails
Error: pg_rewind: fatal: could not find common ancestor
Cause: wal_log_hints not enabled or data too diverged.
Solution:
Issue: Replica stuck in recovery
Symptoms: Replica shows "running" but high lag.
Diagnosis:
Common causes:
- WAL receiver crashed
- Network issues
- Disk full on replica
- Archive restore errors
Solution:
Issue: Cannot connect after recovery
Error: FATAL: the database system is starting up
Cause: PostgreSQL still replaying WAL.
Solution: Wait for recovery to complete, or check logs for errors.
11. Tổng kết
Recovery Methods Summary
| Method | Speed | Data Loss | Use Case |
|---|---|---|---|
| Auto-rejoin | Fastest | None | Clean shutdown/restart |
| pg_rewind | Fast | None | Timeline divergence |
| pg_basebackup | Slow | None | Corruption, major divergence |
| Manual recovery | Varies | Possible | Split-brain, complex issues |
Key Concepts
✅ Auto-rejoin - Patroni handles clean recovery automatically
✅ pg_rewind - Resync after timeline divergence (requires wal_log_hints)
✅ pg_basebackup - Full rebuild from primary (slow but safe)
✅ Timeline - History branch, increments on failover
✅ Split-brain - Multiple primaries (prevented by DCS leader lock)
Recovery Checklist
- Node failure detected
- Determine recovery method needed
- Backup diverged data (if any)
- Execute recovery (auto or manual)
- Verify timeline matches cluster
- Verify replication streaming
- Test read/write operations
- Check replication lag
- Update monitoring/documentation
Next Steps
Bài 16 sẽ cover Backup và Point-in-Time Recovery:
- pg_basebackup strategies
- WAL archiving configuration
- Point-in-Time Recovery (PITR) procedures
- Backup automation and scheduling
- Disaster recovery planning