CloudTadaInsights

Lesson 14: Planned Switchover

Planned Switchover

Learning Objectives

After this lesson, you will:

  • Distinguish between switchover and failover
  • Perform planned switchover safely
  • Understand graceful vs immediate switchover
  • Minimize downtime during maintenance
  • Automate switchover for rolling updates
  • Handle switchover in production

1. Switchover Overview

1.1. Switchover là gì?

Switchover = Có kế hoạch promote một replica lên làm primary.

So sánh với Failover:

AspectFailoverSwitchover
TriggerPrimary failure (unplanned)Manual/scheduled (planned)
Downtime30-60 seconds0-10 seconds
Data lossPossible (if async)Zero (controlled)
ControlAutomaticManual/scripted
TimingUnpredictableScheduled

1.2. Khi nào cần switchover?

Common scenarios:

A. Hardware maintenance

TEXT
Scenario: Need to replace failing disk on primary server
  → Switchover to replica
  → Perform maintenance on old primary
  → Keep as replica or switchover back

B. Software upgrades

TEXT
Scenario: OS kernel update requires reboot
  → Switchover to replica
  → Update & reboot old primary
  → Verify, then switchover back (optional)

C. Database migration

TEXT
Scenario: Move database to larger server
  → Add new server as replica
  → Switchover to new server
  → Remove old server

D. Datacenter migration

TEXT
Scenario: Move from DC1 to DC2
  → Setup replicas in DC2
  → Switchover primary to DC2
  → Decommission DC1 nodes

E. Testing

TEXT
Scenario: Test HA readiness before production
  → Perform switchover in staging
  → Validate application behavior
  → Measure downtime

1.3. Switchover Benefits

✅ Zero data loss - All transactions committed before switch

✅ Controlled timing - During maintenance window

✅ Lower risk - Coordinated, tested process

✅ Minimal downtime - 0-10 seconds vs 30-60 for failover

✅ Reversible - Can switchover back if issues

2. Types of Switchover

2.1. Graceful Switchover (Default)

Process:

TEXT
1. Verify cluster healthy
2. Wait for replication lag = 0
3. Stop new connections to old primary
4. Wait for active transactions to complete
5. Promote new primary
6. Reconfigure old primary as replica

Downtime: ~5-10 seconds ✅
Data loss: None ✅

Command:

TEXT
patronictl switchover postgres

2.2. Immediate Switchover

Process:

TEXT
1. Immediately promote replica
2. Kill active connections on old primary
3. Demote old primary (force if needed)

Downtime: ~2-5 seconds ✅
Data loss: Possible if transactions in-flight ⚠️

Command:

TEXT
patronictl switchover postgres --force

2.3. Scheduled Switchover

Process:

TEXT
1. Schedule switchover at specific time
2. Patroni waits until scheduled time
3. Performs graceful switchover automatically

Downtime: ~5-10 seconds ✅
Automation: Full ✅

Command:

TEXT
patronictl switchover postgres --scheduled 2024-11-25T02:00:00

3. Switchover Prerequisites

3.1. Cluster health check

TEXT
# 1. Verify all nodes running
patronictl list postgres

# Expected:
# + Cluster: postgres (7001234567890123456) ----+----+-----------+
# | Member | Host          | Role    | State   | TL | Lag in MB |
# +--------+---------------+---------+---------+----+-----------+
# | node1  | 10.0.1.11:5432| Leader  | running |  2 |           |
# | node2  | 10.0.1.12:5432| Replica | running |  2 |         0 | ✅
# | node3  | 10.0.1.13:5432| Replica | running |  2 |         0 | ✅
# +--------+---------------+---------+---------+----+-----------+

# All nodes must be:
# - State: running ✅
# - Lag: 0 or very low ✅
# - Same timeline ✅

3.2. Replication lag check

TEXT
# Check lag on all replicas
sudo -u postgres psql -h 10.0.1.11 -c "
SELECT application_name,
       client_addr,
       state,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
       replay_lag
FROM pg_stat_replication
ORDER BY lag_bytes DESC;
"

# Desired:
# application_name | client_addr | state     | lag_bytes | replay_lag
# -----------------+-------------+-----------+-----------+------------
# node2            | 10.0.1.12   | streaming |         0 | 00:00:00   ✅
# node3            | 10.0.1.13   | streaming |         0 | 00:00:00   ✅

3.3. Target candidate check

TEXT
# Check if target has nofailover tag
patronictl show-config postgres | grep -A10 "tags:"

# Target node should have:
tags:
  nofailover: false  # ✅ Can be promoted
  priority: 100      # Higher = preferred

# NOT:
tags:
  nofailover: true   # ❌ Cannot be promoted

3.4. Connection availability

TEXT
# Test connection to target
psql -h 10.0.1.12 -U postgres -c "SELECT 1;"

# Test application user
psql -h 10.0.1.12 -U app_user -d myapp -c "SELECT 1;"

4. Performing Switchover

Step-by-step:

TEXT
# 1. Initiate switchover
patronictl switchover postgres

# Patroni prompts:
TEXT
Master [node1]:  ← Current primary (press Enter to accept)
Candidate ['node2', 'node3'] []:  ← Type target, e.g., "node2"
When should the switchover take place (e.g. 2024-11-25T10:00 )  [now]:  ← Press Enter for immediate
Are you sure you want to switchover cluster postgres, demoting current master node1? [y/N]: y

Output:

TEXT
2024-11-25 10:30:00.123 UTC [INFO]: Switching over from node1 to node2
2024-11-25 10:30:02.456 UTC [INFO]: Waiting for replica node2 to catch up...
2024-11-25 10:30:02.789 UTC [INFO]: Replica node2 lag: 0 bytes ✅
2024-11-25 10:30:03.012 UTC [INFO]: Promoting node2...
2024-11-25 10:30:05.234 UTC [INFO]: node2 promoted successfully
2024-11-25 10:30:06.567 UTC [INFO]: Demoting node1...
2024-11-25 10:30:08.890 UTC [INFO]: node1 reconfigured as replica
2024-11-25 10:30:10.123 UTC [INFO]: Switchover completed ✅

Total time: 10 seconds

4.2. Non-interactive Switchover

Direct command:

TEXT
# Specify master and candidate explicitly
patronictl switchover postgres \
  --master node1 \
  --candidate node2 \
  --force

# --force: Skip confirmation prompt

4.3. Scheduled Switchover

Schedule for maintenance window:

TEXT
# Schedule switchover at 2 AM
patronictl switchover postgres \
  --master node1 \
  --candidate node2 \
  --scheduled "2024-11-25T02:00:00"

# Patroni will automatically execute at scheduled time

Verify scheduled switchover:

TEXT
# Check pending actions
curl -s http://10.0.1.11:8008/patroni | jq '.scheduled_switchover'

# Output:
# {
#   "at": "2024-11-25T02:00:00+00:00",
#   "from": "node1",
#   "to": "node2"
# }

Cancel scheduled switchover:

TEXT
# If plans change
patronictl flush postgres switchover

4.4. Switchover with REST API

Trigger via API:

TEXT
# POST to current leader
curl -X POST http://10.0.1.11:8008/switchover \
  -H "Content-Type: application/json" \
  -d '{
    "leader": "node1",
    "candidate": "node2"
  }'

# Response:
# {
#   "status": "ok",
#   "message": "Switchover scheduled"
# }

5. Switchover Timeline

5.1. Detailed flow

TEXT
T+0s: INITIATE SWITCHOVER
  Command: patronictl switchover postgres --master node1 --candidate node2

T+0.5s: PRE-CHECKS
  ✓ node1 is current leader
  ✓ node2 is healthy replica
  ✓ node2 replication lag: 0 bytes
  ✓ node2 timeline matches: 2

T+1s: PREPARE OLD PRIMARY (node1)
  - Checkpoint: CHECKPOINT;
  - Flush WAL
  - Set session_replication_role = 'replica' (prevent writes soon)

T+2s: WAIT FOR LAG = 0
  - Monitor: pg_stat_replication.replay_lag
  - node2 lag: 0 bytes ✅
  - All WAL replayed

T+3s: PAUSE OLD PRIMARY
  - Set: pg_catalog.pg_pause_wal_replay() on replicas (not needed, they're already replaying)
  - Actually: Just ensure all WAL consumed

T+4s: DEMOTE OLD PRIMARY (node1)
  - Remove leader lock from DCS
  - Stop accepting new connections (pg_ctl reload with max_connections=0)
  - Wait for active transactions (timeout: 30s default)

T+5s: PROMOTE NEW PRIMARY (node2)
  - Acquire leader lock in DCS
  - Execute: SELECT pg_promote();
  - Timeline: 2 → 3
  - Run callbacks: on_role_change, post_promote

T+7s: VERIFY NEW PRIMARY
  - pg_is_in_recovery() → false ✅
  - Accepting connections
  - Timeline = 3

T+8s: RECONFIGURE OLD PRIMARY (node1)
  - Update primary_conninfo → node2:5432
  - Update recovery.signal
  - Restart PostgreSQL in recovery mode
  - Timeline: 2 → 3

T+10s: REPLICATION RESTORED
  - node1 now streaming from node2
  - node3 updated to stream from node2
  - All replicas timeline = 3

T+10s: SWITCHOVER COMPLETE ✅
  Primary: node2 (was replica)
  Replica: node1 (was primary)
  Replica: node3

Total downtime: ~5-10 seconds
Data loss: None ✅

5.2. What happens to active connections?

During switchover:

TEXT
Client connections to old primary (node1):

Option A: Graceful (default)
  - New connections: REJECTED
  - Active queries: ALLOWED TO COMPLETE (timeout: 30s)
  - Idle connections: TERMINATED after queries done

Option B: Force (--force)
  - All connections: TERMINATED IMMEDIATELY
  - Active queries: ROLLBACK
  - Faster but risky ⚠️

Application behavior:

TEXT
# Well-written application with retry logic
import psycopg2

def execute_query():
    retries = 3
    for i in range(retries):
        try:
            conn = psycopg2.connect("host=10.0.1.11 ...")
            cursor = conn.cursor()
            cursor.execute("SELECT * FROM users;")
            return cursor.fetchall()
        except psycopg2.OperationalError as e:
            if i < retries - 1:
                time.sleep(1)  # Wait and retry
                continue
            raise

6. Verification After Switchover

6.1. Cluster status

TEXT
patronictl list postgres

# Expected:
# + Cluster: postgres (7001234567890123456) ----+----+-----------+
# | Member | Host          | Role    | State   | TL | Lag in MB |
# +--------+---------------+---------+---------+----+-----------+
# | node1  | 10.0.1.11:5432| Replica | running |  3 |         0 | ← Was Leader
# | node2  | 10.0.1.12:5432| Leader  | running |  3 |           | ← Was Replica
# | node3  | 10.0.1.13:5432| Replica | running |  3 |         0 |
# +--------+---------------+---------+---------+----+-----------+

# Check:
# ✅ node2 is now Leader
# ✅ Timeline changed: 2 → 3
# ✅ All nodes running
# ✅ Replication lag = 0

6.2. Replication status

TEXT
# On new primary (node2)
sudo -u postgres psql -h 10.0.1.12 -c "
SELECT application_name, client_addr, state, sync_state
FROM pg_stat_replication;
"

# Expected:
# application_name | client_addr | state     | sync_state
# -----------------+-------------+-----------+------------
# node1            | 10.0.1.11   | streaming | async
# node3            | 10.0.1.13   | streaming | async

# Both replicas should be streaming from node2 ✅

6.3. Write test

TEXT
# Insert on new primary
sudo -u postgres psql -h 10.0.1.12 -d testdb -c "
INSERT INTO test_table (data, created_at) 
VALUES ('After switchover', NOW())
RETURNING *;
"

# Verify on replicas
sudo -u postgres psql -h 10.0.1.11 -d testdb -c "
SELECT * FROM test_table ORDER BY created_at DESC LIMIT 1;
"

sudo -u postgres psql -h 10.0.1.13 -d testdb -c "
SELECT * FROM test_table ORDER BY created_at DESC LIMIT 1;
"

# Should see the new row on both replicas ✅

6.4. Timeline verification

TEXT
# Check timeline on all nodes
for node in 10.0.1.11 10.0.1.12 10.0.1.13; do
  echo "=== $node ==="
  sudo -u postgres psql -h $node -c "
    SELECT timeline_id, pg_is_in_recovery() AS is_replica
    FROM pg_control_checkpoint();
  "
done

# All should report:
# timeline_id | is_replica
# ------------+------------
#           3 | t/f

7. Switchover Best Practices

7.1. Pre-switchover checklist

TEXT
#!/bin/bash
# pre-switchover-check.sh

echo "=== Pre-Switchover Checks ==="

# 1. Cluster health
echo "1. Checking cluster health..."
patronictl list postgres | grep -q "running" || { echo "❌ Not all nodes running"; exit 1; }
echo "✅ All nodes running"

# 2. Replication lag
echo "2. Checking replication lag..."
lag=$(sudo -u postgres psql -h 10.0.1.11 -t -c "
  SELECT COALESCE(MAX(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)), 0)
  FROM pg_stat_replication;
")
if [ "$lag" -gt 1048576 ]; then  # 1MB
  echo "❌ Lag too high: $lag bytes"
  exit 1
fi
echo "✅ Lag acceptable: $lag bytes"

# 3. Target candidate available
echo "3. Checking target candidate..."
patronictl list postgres | grep node2 | grep -q "running" || { echo "❌ node2 not available"; exit 1; }
echo "✅ Target candidate available"

# 4. No scheduled maintenance
echo "4. Checking scheduled actions..."
curl -s http://10.0.1.11:8008/patroni | jq -e '.scheduled_switchover == null' > /dev/null || {
  echo "⚠️  Another switchover already scheduled"
}

echo ""
echo "✅ All pre-checks passed. Safe to proceed."

7.2. Minimize downtime strategies

A. Connection pooler

TEXT
Use PgBouncer/HAProxy between app and database:

App → PgBouncer → Primary
              ↓
            Replicas

During switchover:
1. PgBouncer detects primary change
2. Reconnects to new primary automatically
3. Application sees minimal disruption

B. Read-replica routing

TEXT
Route read queries to replicas during switchover:

- Write queries: Wait for new primary
- Read queries: Continue on replicas (may be slightly stale)

Result: Partial availability during switchover

C. Application-level retry

TEXT
# Implement exponential backoff
def execute_with_retry(query, max_retries=3):
    for i in range(max_retries):
        try:
            return execute_query(query)
        except OperationalError:
            if i == max_retries - 1:
                raise
            time.sleep(2 ** i)  # 1s, 2s, 4s

7.3. Communication plan

Before switchover:

TEXT
T-24h: Announce maintenance window
  - Email: ops@, dev@, stakeholders
  - Slack: #incidents, #ops
  - Status page: Update with scheduled maintenance

T-1h: Reminder notification
  - Final checks
  - Confirm go/no-go

T-5min: Begin maintenance
  - Start switchover
  - Monitor dashboards

During switchover:

TEXT
- Real-time updates in ops channel
- Monitor metrics (latency, error rate)
- Have rollback plan ready

After switchover:

TEXT
- Verify all systems operational
- Post-switchover validation
- Update documentation
- Send completion notification

8. Troubleshooting Switchover

8.1. Issue: Switchover command hangs

Symptomspatronictl switchover never completes.

Diagnosis:

TEXT
# Check what Patroni is waiting for
sudo journalctl -u patroni -f

# Common causes:

# A. High replication lag
sudo -u postgres psql -h 10.0.1.11 -c "
  SELECT application_name, 
         pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
  FROM pg_stat_replication;
"
# If lag > 0, Patroni waits for lag = 0

# B. Active long-running queries
sudo -u postgres psql -h 10.0.1.11 -c "
  SELECT pid, usename, state, query_start, query
  FROM pg_stat_activity
  WHERE state = 'active' AND query_start < now() - interval '5 minutes';
"
# Kill blocking queries:
# SELECT pg_terminate_backend(pid);

Solution:

TEXT
# Option 1: Wait for lag to catch up (recommended)
# Option 2: Use --force to skip wait (risk data loss)
# Option 3: Cancel and reschedule
Ctrl+C  # Cancel current switchover attempt

8.2. Issue: Candidate not eligible

Symptoms: Error "candidate is not eligible".

Diagnosis:

TEXT
# Check nofailover tag
patronictl show-config postgres | grep -A5 "node2:"

# If output shows:
# node2:
#   tags:
#     nofailover: true  ← Problem!

Solution:

TEXT
# Remove nofailover tag
patronictl edit-config postgres

# Edit:
tags:
  nofailover: false  # Change to false

# Restart Patroni on node2
sudo systemctl restart patroni

8.3. Issue: Old primary won't demote

Symptoms: Switchover fails, old primary still leader.

Diagnosis:

TEXT
# Check Patroni logs on old primary
sudo journalctl -u patroni -n 100 | grep -i "demote\|error"

# Possible causes:
# - PostgreSQL won't stop
# - Active transactions won't terminate
# - File permission issues

Solution:

TEXT
# Force demote via REST API
curl -X POST http://10.0.1.11:8008/restart

# Or manually:
sudo -u postgres psql -h 10.0.1.11 -c "
  SELECT pg_terminate_backend(pid)
  FROM pg_stat_activity
  WHERE pid != pg_backend_pid();
"

sudo systemctl restart patroni

8.4. Issue: Replication broken after switchover

Symptoms: Old primary not replicating from new primary.

Diagnosis:

TEXT
# Check replication status
patronictl list postgres

# If node1 shows "stopped" or "streaming: False"

# Check logs
sudo journalctl -u patroni -u postgresql -n 100

Solution:

TEXT
# A. Restart Patroni (usually auto-fixes)
sudo systemctl restart patroni

# B. Manual reinit if needed
patronictl reinit postgres node1

# Patroni will:
# 1. Stop PostgreSQL on node1
# 2. Remove data directory
# 3. pg_basebackup from node2
# 4. Start as replica

9. Switchover Automation

9.1. Scripted switchover

TEXT
#!/bin/bash
# automated-switchover.sh

set -e

CLUSTER="postgres"
OLD_PRIMARY="node1"
NEW_PRIMARY="node2"

echo "=== Starting Automated Switchover ==="
echo "From: $OLD_PRIMARY → To: $NEW_PRIMARY"

# Pre-checks
echo "Running pre-checks..."
./pre-switchover-check.sh || exit 1

# Perform switchover
echo "Executing switchover..."
patronictl switchover $CLUSTER \
  --master $OLD_PRIMARY \
  --candidate $NEW_PRIMARY \
  --force

# Wait for completion
echo "Waiting for switchover to complete..."
sleep 15

# Post-checks
echo "Running post-checks..."
new_leader=$(patronictl list $CLUSTER | grep Leader | awk '{print $2}')
if [ "$new_leader" == "$NEW_PRIMARY" ]; then
  echo "✅ Switchover successful!"
  echo "New leader: $new_leader"
else
  echo "❌ Switchover failed!"
  echo "Current leader: $new_leader"
  exit 1
fi

# Verify replication
echo "Verifying replication..."
patronictl list $CLUSTER

echo "=== Switchover Complete ==="

9.2. Ansible playbook

TEXT
# switchover.yml
---
- name: Perform Patroni switchover
  hosts: localhost
  gather_facts: no
  vars:
    cluster_name: postgres
    old_primary: node1
    new_primary: node2
  
  tasks:
    - name: Pre-check cluster health
      command: patronictl list {{ cluster_name }}
      register: cluster_status
      changed_when: false
    
    - name: Verify all nodes running
      assert:
        that:
          - "'running' in cluster_status.stdout"
        fail_msg: "Not all nodes are running"
    
    - name: Execute switchover
      command: >
        patronictl switchover {{ cluster_name }}
        --master {{ old_primary }}
        --candidate {{ new_primary }}
        --force
      register: switchover_result
    
    - name: Wait for switchover completion
      pause:
        seconds: 15
    
    - name: Verify new leader
      command: patronictl list {{ cluster_name }}
      register: final_status
      changed_when: false
    
    - name: Display result
      debug:
        msg: "{{ final_status.stdout_lines }}"
    
    - name: Verify leadership
      assert:
        that:
          - "'{{ new_primary }}' in final_status.stdout"
          - "'Leader' in final_status.stdout"
        fail_msg: "Switchover failed"
        success_msg: "Switchover successful"

Run:

TEXT
ansible-playbook switchover.yml

9.3. CI/CD integration

TEXT
# .github/workflows/db-maintenance.yml
name: Database Maintenance Switchover

on:
  schedule:
    - cron: '0 2 * * 0'  # Every Sunday at 2 AM
  workflow_dispatch:  # Manual trigger

jobs:
  switchover:
    runs-on: self-hosted
    steps:
      - name: Notify start
        run: |
          curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
            -d '{"text": "Starting scheduled database switchover"}'
      
      - name: Pre-checks
        run: ./scripts/pre-switchover-check.sh
      
      - name: Execute switchover
        run: |
          patronictl switchover postgres \
            --master node1 \
            --candidate node2 \
            --force
      
      - name: Verify
        run: ./scripts/post-switchover-verify.sh
      
      - name: Notify completion
        if: always()
        run: |
          curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
            -d '{"text": "Switchover completed: ${{ job.status }}"}'

10. Rolling Updates with Switchover

10.1. Update strategy

Scenario: Update PostgreSQL from 17 → 18.

Steps:

TEXT
1. Update replica node3 (least critical)
   - Stop Patroni
   - Upgrade PostgreSQL
   - Start Patroni
   - Verify replication

2. Update replica node2
   - Stop Patroni
   - Upgrade PostgreSQL
   - Start Patroni
   - Verify replication

3. Switchover to node2 (now updated)
   - patronictl switchover --master node1 --candidate node2

4. Update old primary node1
   - Stop Patroni
   - Upgrade PostgreSQL
   - Start Patroni (now replica)
   - Verify replication

5. Optionally switchover back to node1
   - patronictl switchover --master node2 --candidate node1

Result: Zero-downtime upgrade ✅

10.2. Kernel update example

TEXT
#!/bin/bash
# rolling-kernel-update.sh

NODES=("node1" "node2" "node3")
PRIMARY=$(patronictl list postgres | grep Leader | awk '{print $2}')

echo "Current primary: $PRIMARY"

# Update replicas first
for node in "${NODES[@]}"; do
  if [ "$node" == "$PRIMARY" ]; then
    continue  # Skip primary for now
  fi
  
  echo "=== Updating $node ==="
  ssh $node 'sudo yum update -y kernel && sudo reboot'
  
  echo "Waiting for $node to come back..."
  sleep 60
  
  # Wait for node to rejoin
  until patronictl list postgres | grep $node | grep -q "running"; do
    echo "Waiting for $node..."
    sleep 10
  done
  
  echo "✅ $node updated and rejoined"
done

# Now switchover from primary
NEW_PRIMARY=${NODES[1]}  # Pick a replica
if [ "$NEW_PRIMARY" == "$PRIMARY" ]; then
  NEW_PRIMARY=${NODES[2]}
fi

echo "=== Switching over from $PRIMARY to $NEW_PRIMARY ==="
patronictl switchover postgres \
  --master $PRIMARY \
  --candidate $NEW_PRIMARY \
  --force

sleep 15

# Update old primary
echo "=== Updating $PRIMARY ==="
ssh $PRIMARY 'sudo yum update -y kernel && sudo reboot'

echo "Waiting for $PRIMARY to rejoin as replica..."
sleep 60

until patronictl list postgres | grep $PRIMARY | grep -q "running"; do
  echo "Waiting for $PRIMARY..."
  sleep 10
done

echo "✅ All nodes updated!"
patronictl list postgres

11. Lab Exercises

Lab 1: Basic switchover

Tasks:

  1. Check current primary: patronictl list
  2. Perform switchover: patronictl switchover postgres
  3. Measure downtime with continuous query loop
  4. Verify new topology
  5. Document observations

Lab 2: Scheduled switchover

Tasks:

  1. Schedule switchover for 2 minutes from now
  2. Monitor logs during wait period
  3. Observe automatic execution
  4. Cancel a scheduled switchover (repeat and test cancel)

Lab 3: Forced vs graceful

Tasks:

  1. Create long-running query: SELECT pg_sleep(300);
  2. Attempt graceful switchover (observe wait)
  3. Cancel and retry with --force
  4. Compare behavior and downtime

Lab 4: Rolling update simulation

Tasks:

  1. Start with 3-node cluster
  2. "Update" node3 (simulate by restarting)
  3. "Update" node2
  4. Switchover to node2
  5. "Update" node1
  6. Verify all nodes operational

Lab 5: Switchover under load

Tasks:

  1. Start pgbench: pgbench -c 10 -T 300
  2. During load, perform switchover
  3. Analyze pgbench output for errors
  4. Calculate success rate
  5. Test with connection pooler (PgBouncer)

12. Tổng kết

Key Concepts

✅ Switchover = Planned, controlled role change

✅ Graceful = Wait for transactions (slower, safer)

✅ Immediate = Force termination (faster, riskier)

✅ Scheduled = Automated at specific time

✅ Zero downtime = Achievable with proper architecture

Switchover vs Failover

AspectSwitchoverFailover
PlanningScheduledUnplanned
ControlManualAutomatic
Downtime0-10s30-60s
Data lossNonePossible
ReversibleYesNo

Best Practices

  • ✅ Test in staging first
  • ✅ Schedule during low-traffic windows
  • ✅ Use graceful mode (default)
  • ✅ Verify lag = 0 before switchover
  • ✅ Monitor during process
  • ✅ Have rollback plan
  • ✅ Communicate with stakeholders
  • ✅ Document procedure

Next Steps

Bài 15 sẽ cover Recovering Failed Nodes:

  • Rejoin old primary after failover
  • pg_rewind usage and scenarios
  • Full rebuild with pg_basebackup
  • Timeline divergence resolution
  • Split-brain recovery

Share this article

You might also like

Browse all articles

Lesson 20: Security Best Practices

Learn about Lesson 20: Security Best Practices in PostgreSQL HA clusters with Patroni and etcd.

#Patroni#PostgreSQL#high availability

Lesson 19: Logging và Troubleshooting

Learn about Lesson 19: Logging và Troubleshooting in PostgreSQL HA clusters with Patroni and etcd.

#Patroni#PostgreSQL#high availability

Lesson 15: Recovering Failed Nodes

Learn about Lesson 15: Recovering Failed Nodes in PostgreSQL HA clusters with Patroni and etcd.

#Patroni#PostgreSQL#high availability

Lesson 13: Automatic Failover

Learn about Lesson 13: Automatic Failover in PostgreSQL HA clusters with Patroni and etcd.

#Patroni#PostgreSQL#high availability

Lesson 12: Patroni REST API

Learn about Lesson 12: Patroni REST API in PostgreSQL HA clusters with Patroni and etcd.

#Patroni#PostgreSQL#high availability