CloudTada | Infrastructure & DevOps Insights

Planned Switchover

Learning Objectives

After this lesson, you will:

Distinguish between switchover and failover
Perform planned switchover safely
Understand graceful vs immediate switchover
Minimize downtime during maintenance
Automate switchover for rolling updates
Handle switchover in production

1. Switchover Overview

1.1. Switchover là gì?

Switchover = Có kế hoạch promote một replica lên làm primary.

So sánh với Failover:

Aspect	Failover	Switchover
Trigger	Primary failure (unplanned)	Manual/scheduled (planned)
Downtime	30-60 seconds	0-10 seconds
Data loss	Possible (if async)	Zero (controlled)
Control	Automatic	Manual/scripted
Timing	Unpredictable	Scheduled

1.2. Khi nào cần switchover?

Common scenarios:

A. Hardware maintenance

TEXT

Scenario: Need to replace failing disk on primary server
  → Switchover to replica
  → Perform maintenance on old primary
  → Keep as replica or switchover back

B. Software upgrades

TEXT

Scenario: OS kernel update requires reboot
  → Switchover to replica
  → Update & reboot old primary
  → Verify, then switchover back (optional)

C. Database migration

TEXT

Scenario: Move database to larger server
  → Add new server as replica
  → Switchover to new server
  → Remove old server

D. Datacenter migration

TEXT

Scenario: Move from DC1 to DC2
  → Setup replicas in DC2
  → Switchover primary to DC2
  → Decommission DC1 nodes

E. Testing

TEXT

Scenario: Test HA readiness before production
  → Perform switchover in staging
  → Validate application behavior
  → Measure downtime

1.3. Switchover Benefits

✅ Zero data loss - All transactions committed before switch

✅ Controlled timing - During maintenance window

✅ Lower risk - Coordinated, tested process

✅ Minimal downtime - 0-10 seconds vs 30-60 for failover

✅ Reversible - Can switchover back if issues

2. Types of Switchover

2.1. Graceful Switchover (Default)

Process:

TEXT

1. Verify cluster healthy
2. Wait for replication lag = 0
3. Stop new connections to old primary
4. Wait for active transactions to complete
5. Promote new primary
6. Reconfigure old primary as replica

Downtime: ~5-10 seconds ✅
Data loss: None ✅

Command:

TEXT

patronictl switchover postgres

2.2. Immediate Switchover

Process:

TEXT

1. Immediately promote replica
2. Kill active connections on old primary
3. Demote old primary (force if needed)

Downtime: ~2-5 seconds ✅
Data loss: Possible if transactions in-flight ⚠️

Command:

TEXT

patronictl switchover postgres --force

2.3. Scheduled Switchover

Process:

TEXT

1. Schedule switchover at specific time
2. Patroni waits until scheduled time
3. Performs graceful switchover automatically

Downtime: ~5-10 seconds ✅
Automation: Full ✅

Command:

TEXT

patronictl switchover postgres --scheduled 2024-11-25T02:00:00

3. Switchover Prerequisites

3.1. Cluster health check

TEXT

# 1. Verify all nodes running
patronictl list postgres

# Expected:
# + Cluster: postgres (7001234567890123456) ----+----+-----------+
# | Member | Host          | Role    | State   | TL | Lag in MB |
# +--------+---------------+---------+---------+----+-----------+
# | node1  | 10.0.1.11:5432| Leader  | running |  2 |           |
# | node2  | 10.0.1.12:5432| Replica | running |  2 |         0 | ✅
# | node3  | 10.0.1.13:5432| Replica | running |  2 |         0 | ✅
# +--------+---------------+---------+---------+----+-----------+

# All nodes must be:
# - State: running ✅
# - Lag: 0 or very low ✅
# - Same timeline ✅

3.2. Replication lag check

TEXT

# Check lag on all replicas
sudo -u postgres psql -h 10.0.1.11 -c "
SELECT application_name,
       client_addr,
       state,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
       replay_lag
FROM pg_stat_replication
ORDER BY lag_bytes DESC;
"

# Desired:
# application_name | client_addr | state     | lag_bytes | replay_lag
# -----------------+-------------+-----------+-----------+------------
# node2            | 10.0.1.12   | streaming |         0 | 00:00:00   ✅
# node3            | 10.0.1.13   | streaming |         0 | 00:00:00   ✅

3.3. Target candidate check

TEXT

# Check if target has nofailover tag
patronictl show-config postgres | grep -A10 "tags:"

# Target node should have:
tags:
  nofailover: false  # ✅ Can be promoted
  priority: 100      # Higher = preferred

# NOT:
tags:
  nofailover: true   # ❌ Cannot be promoted

3.4. Connection availability

TEXT

# Test connection to target
psql -h 10.0.1.12 -U postgres -c "SELECT 1;"

# Test application user
psql -h 10.0.1.12 -U app_user -d myapp -c "SELECT 1;"

4. Performing Switchover

4.1. Interactive Switchover (Recommended)

Step-by-step:

TEXT

# 1. Initiate switchover
patronictl switchover postgres

# Patroni prompts:

TEXT

Master [node1]:  ← Current primary (press Enter to accept)
Candidate ['node2', 'node3'] []:  ← Type target, e.g., "node2"
When should the switchover take place (e.g. 2024-11-25T10:00 )  [now]:  ← Press Enter for immediate
Are you sure you want to switchover cluster postgres, demoting current master node1? [y/N]: y

Output:

TEXT

2024-11-25 10:30:00.123 UTC [INFO]: Switching over from node1 to node2
2024-11-25 10:30:02.456 UTC [INFO]: Waiting for replica node2 to catch up...
2024-11-25 10:30:02.789 UTC [INFO]: Replica node2 lag: 0 bytes ✅
2024-11-25 10:30:03.012 UTC [INFO]: Promoting node2...
2024-11-25 10:30:05.234 UTC [INFO]: node2 promoted successfully
2024-11-25 10:30:06.567 UTC [INFO]: Demoting node1...
2024-11-25 10:30:08.890 UTC [INFO]: node1 reconfigured as replica
2024-11-25 10:30:10.123 UTC [INFO]: Switchover completed ✅

Total time: 10 seconds

4.2. Non-interactive Switchover

Direct command:

TEXT

# Specify master and candidate explicitly
patronictl switchover postgres \
  --master node1 \
  --candidate node2 \
  --force

# --force: Skip confirmation prompt

4.3. Scheduled Switchover

Schedule for maintenance window:

TEXT

# Schedule switchover at 2 AM
patronictl switchover postgres \
  --master node1 \
  --candidate node2 \
  --scheduled "2024-11-25T02:00:00"

# Patroni will automatically execute at scheduled time

Verify scheduled switchover:

TEXT

# Check pending actions
curl -s http://10.0.1.11:8008/patroni | jq '.scheduled_switchover'

# Output:
# {
#   "at": "2024-11-25T02:00:00+00:00",
#   "from": "node1",
#   "to": "node2"
# }

Cancel scheduled switchover:

TEXT

# If plans change
patronictl flush postgres switchover

4.4. Switchover with REST API

Trigger via API:

TEXT

# POST to current leader
curl -X POST http://10.0.1.11:8008/switchover \
  -H "Content-Type: application/json" \
  -d '{
    "leader": "node1",
    "candidate": "node2"
  }'

# Response:
# {
#   "status": "ok",
#   "message": "Switchover scheduled"
# }

5. Switchover Timeline

5.1. Detailed flow

TEXT

T+0s: INITIATE SWITCHOVER
  Command: patronictl switchover postgres --master node1 --candidate node2

T+0.5s: PRE-CHECKS
  ✓ node1 is current leader
  ✓ node2 is healthy replica
  ✓ node2 replication lag: 0 bytes
  ✓ node2 timeline matches: 2

T+1s: PREPARE OLD PRIMARY (node1)
  - Checkpoint: CHECKPOINT;
  - Flush WAL
  - Set session_replication_role = 'replica' (prevent writes soon)

T+2s: WAIT FOR LAG = 0
  - Monitor: pg_stat_replication.replay_lag
  - node2 lag: 0 bytes ✅
  - All WAL replayed

T+3s: PAUSE OLD PRIMARY
  - Set: pg_catalog.pg_pause_wal_replay() on replicas (not needed, they're already replaying)
  - Actually: Just ensure all WAL consumed

T+4s: DEMOTE OLD PRIMARY (node1)
  - Remove leader lock from DCS
  - Stop accepting new connections (pg_ctl reload with max_connections=0)
  - Wait for active transactions (timeout: 30s default)

T+5s: PROMOTE NEW PRIMARY (node2)
  - Acquire leader lock in DCS
  - Execute: SELECT pg_promote();
  - Timeline: 2 → 3
  - Run callbacks: on_role_change, post_promote

T+7s: VERIFY NEW PRIMARY
  - pg_is_in_recovery() → false ✅
  - Accepting connections
  - Timeline = 3

T+8s: RECONFIGURE OLD PRIMARY (node1)
  - Update primary_conninfo → node2:5432
  - Update recovery.signal
  - Restart PostgreSQL in recovery mode
  - Timeline: 2 → 3

T+10s: REPLICATION RESTORED
  - node1 now streaming from node2
  - node3 updated to stream from node2
  - All replicas timeline = 3

T+10s: SWITCHOVER COMPLETE ✅
  Primary: node2 (was replica)
  Replica: node1 (was primary)
  Replica: node3

Total downtime: ~5-10 seconds
Data loss: None ✅

5.2. What happens to active connections?

During switchover:

TEXT

Client connections to old primary (node1):

Option A: Graceful (default)
  - New connections: REJECTED
  - Active queries: ALLOWED TO COMPLETE (timeout: 30s)
  - Idle connections: TERMINATED after queries done

Option B: Force (--force)
  - All connections: TERMINATED IMMEDIATELY
  - Active queries: ROLLBACK
  - Faster but risky ⚠️

Application behavior:

TEXT

# Well-written application with retry logic
import psycopg2

def execute_query():
    retries = 3
    for i in range(retries):
        try:
            conn = psycopg2.connect("host=10.0.1.11 ...")
            cursor = conn.cursor()
            cursor.execute("SELECT * FROM users;")
            return cursor.fetchall()
        except psycopg2.OperationalError as e:
            if i < retries - 1:
                time.sleep(1)  # Wait and retry
                continue
            raise

6. Verification After Switchover

6.1. Cluster status

TEXT

patronictl list postgres

# Expected:
# + Cluster: postgres (7001234567890123456) ----+----+-----------+
# | Member | Host          | Role    | State   | TL | Lag in MB |
# +--------+---------------+---------+---------+----+-----------+
# | node1  | 10.0.1.11:5432| Replica | running |  3 |         0 | ← Was Leader
# | node2  | 10.0.1.12:5432| Leader  | running |  3 |           | ← Was Replica
# | node3  | 10.0.1.13:5432| Replica | running |  3 |         0 |
# +--------+---------------+---------+---------+----+-----------+

# Check:
# ✅ node2 is now Leader
# ✅ Timeline changed: 2 → 3
# ✅ All nodes running
# ✅ Replication lag = 0

6.2. Replication status

TEXT

# On new primary (node2)
sudo -u postgres psql -h 10.0.1.12 -c "
SELECT application_name, client_addr, state, sync_state
FROM pg_stat_replication;
"

# Expected:
# application_name | client_addr | state     | sync_state
# -----------------+-------------+-----------+------------
# node1            | 10.0.1.11   | streaming | async
# node3            | 10.0.1.13   | streaming | async

# Both replicas should be streaming from node2 ✅

6.3. Write test

TEXT

# Insert on new primary
sudo -u postgres psql -h 10.0.1.12 -d testdb -c "
INSERT INTO test_table (data, created_at) 
VALUES ('After switchover', NOW())
RETURNING *;
"

# Verify on replicas
sudo -u postgres psql -h 10.0.1.11 -d testdb -c "
SELECT * FROM test_table ORDER BY created_at DESC LIMIT 1;
"

sudo -u postgres psql -h 10.0.1.13 -d testdb -c "
SELECT * FROM test_table ORDER BY created_at DESC LIMIT 1;
"

# Should see the new row on both replicas ✅

6.4. Timeline verification

TEXT

# Check timeline on all nodes
for node in 10.0.1.11 10.0.1.12 10.0.1.13; do
  echo "=== $node ==="
  sudo -u postgres psql -h $node -c "
    SELECT timeline_id, pg_is_in_recovery() AS is_replica
    FROM pg_control_checkpoint();
  "
done

# All should report:
# timeline_id | is_replica
# ------------+------------
#           3 | t/f

7. Switchover Best Practices

7.1. Pre-switchover checklist

TEXT

#!/bin/bash
# pre-switchover-check.sh

echo "=== Pre-Switchover Checks ==="

# 1. Cluster health
echo "1. Checking cluster health..."
patronictl list postgres | grep -q "running" || { echo "❌ Not all nodes running"; exit 1; }
echo "✅ All nodes running"

# 2. Replication lag
echo "2. Checking replication lag..."
lag=$(sudo -u postgres psql -h 10.0.1.11 -t -c "
  SELECT COALESCE(MAX(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)), 0)
  FROM pg_stat_replication;
")
if [ "$lag" -gt 1048576 ]; then  # 1MB
  echo "❌ Lag too high: $lag bytes"
  exit 1
fi
echo "✅ Lag acceptable: $lag bytes"

# 3. Target candidate available
echo "3. Checking target candidate..."
patronictl list postgres | grep node2 | grep -q "running" || { echo "❌ node2 not available"; exit 1; }
echo "✅ Target candidate available"

# 4. No scheduled maintenance
echo "4. Checking scheduled actions..."
curl -s http://10.0.1.11:8008/patroni | jq -e '.scheduled_switchover == null' > /dev/null || {
  echo "⚠️  Another switchover already scheduled"
}

echo ""
echo "✅ All pre-checks passed. Safe to proceed."

7.2. Minimize downtime strategies

A. Connection pooler

TEXT

Use PgBouncer/HAProxy between app and database:

App → PgBouncer → Primary
              ↓
            Replicas

During switchover:
1. PgBouncer detects primary change
2. Reconnects to new primary automatically
3. Application sees minimal disruption

B. Read-replica routing

TEXT

Route read queries to replicas during switchover:

- Write queries: Wait for new primary
- Read queries: Continue on replicas (may be slightly stale)

Result: Partial availability during switchover

C. Application-level retry

TEXT

# Implement exponential backoff
def execute_with_retry(query, max_retries=3):
    for i in range(max_retries):
        try:
            return execute_query(query)
        except OperationalError:
            if i == max_retries - 1:
                raise
            time.sleep(2 ** i)  # 1s, 2s, 4s

7.3. Communication plan

Before switchover:

TEXT

T-24h: Announce maintenance window
  - Email: ops@, dev@, stakeholders
  - Slack: #incidents, #ops
  - Status page: Update with scheduled maintenance

T-1h: Reminder notification
  - Final checks
  - Confirm go/no-go

T-5min: Begin maintenance
  - Start switchover
  - Monitor dashboards

During switchover:

TEXT

- Real-time updates in ops channel
- Monitor metrics (latency, error rate)
- Have rollback plan ready

After switchover:

TEXT

- Verify all systems operational
- Post-switchover validation
- Update documentation
- Send completion notification

8. Troubleshooting Switchover

8.1. Issue: Switchover command hangs

Symptoms: patronictl switchover never completes.

Diagnosis:

TEXT

# Check what Patroni is waiting for
sudo journalctl -u patroni -f

# Common causes:

# A. High replication lag
sudo -u postgres psql -h 10.0.1.11 -c "
  SELECT application_name, 
         pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
  FROM pg_stat_replication;
"
# If lag > 0, Patroni waits for lag = 0

# B. Active long-running queries
sudo -u postgres psql -h 10.0.1.11 -c "
  SELECT pid, usename, state, query_start, query
  FROM pg_stat_activity
  WHERE state = 'active' AND query_start < now() - interval '5 minutes';
"
# Kill blocking queries:
# SELECT pg_terminate_backend(pid);

Solution:

TEXT

# Option 1: Wait for lag to catch up (recommended)
# Option 2: Use --force to skip wait (risk data loss)
# Option 3: Cancel and reschedule
Ctrl+C  # Cancel current switchover attempt

8.2. Issue: Candidate not eligible

Symptoms: Error "candidate is not eligible".

Diagnosis:

TEXT

# Check nofailover tag
patronictl show-config postgres | grep -A5 "node2:"

# If output shows:
# node2:
#   tags:
#     nofailover: true  ← Problem!

Solution:

TEXT

# Remove nofailover tag
patronictl edit-config postgres

# Edit:
tags:
  nofailover: false  # Change to false

# Restart Patroni on node2
sudo systemctl restart patroni

8.3. Issue: Old primary won't demote

Symptoms: Switchover fails, old primary still leader.

Diagnosis:

TEXT

# Check Patroni logs on old primary
sudo journalctl -u patroni -n 100 | grep -i "demote\|error"

# Possible causes:
# - PostgreSQL won't stop
# - Active transactions won't terminate
# - File permission issues

Solution:

TEXT

# Force demote via REST API
curl -X POST http://10.0.1.11:8008/restart

# Or manually:
sudo -u postgres psql -h 10.0.1.11 -c "
  SELECT pg_terminate_backend(pid)
  FROM pg_stat_activity
  WHERE pid != pg_backend_pid();
"

sudo systemctl restart patroni

8.4. Issue: Replication broken after switchover

Symptoms: Old primary not replicating from new primary.

Diagnosis:

TEXT

# Check replication status
patronictl list postgres

# If node1 shows "stopped" or "streaming: False"

# Check logs
sudo journalctl -u patroni -u postgresql -n 100

Solution:

TEXT

# A. Restart Patroni (usually auto-fixes)
sudo systemctl restart patroni

# B. Manual reinit if needed
patronictl reinit postgres node1

# Patroni will:
# 1. Stop PostgreSQL on node1
# 2. Remove data directory
# 3. pg_basebackup from node2
# 4. Start as replica

9. Switchover Automation

9.1. Scripted switchover

TEXT

#!/bin/bash
# automated-switchover.sh

set -e

CLUSTER="postgres"
OLD_PRIMARY="node1"
NEW_PRIMARY="node2"

echo "=== Starting Automated Switchover ==="
echo "From: $OLD_PRIMARY → To: $NEW_PRIMARY"

# Pre-checks
echo "Running pre-checks..."
./pre-switchover-check.sh || exit 1

# Perform switchover
echo "Executing switchover..."
patronictl switchover $CLUSTER \
  --master $OLD_PRIMARY \
  --candidate $NEW_PRIMARY \
  --force

# Wait for completion
echo "Waiting for switchover to complete..."
sleep 15

# Post-checks
echo "Running post-checks..."
new_leader=$(patronictl list $CLUSTER | grep Leader | awk '{print $2}')
if [ "$new_leader" == "$NEW_PRIMARY" ]; then
  echo "✅ Switchover successful!"
  echo "New leader: $new_leader"
else
  echo "❌ Switchover failed!"
  echo "Current leader: $new_leader"
  exit 1
fi

# Verify replication
echo "Verifying replication..."
patronictl list $CLUSTER

echo "=== Switchover Complete ==="

9.2. Ansible playbook

TEXT

# switchover.yml
---
- name: Perform Patroni switchover
  hosts: localhost
  gather_facts: no
  vars:
    cluster_name: postgres
    old_primary: node1
    new_primary: node2
  
  tasks:
    - name: Pre-check cluster health
      command: patronictl list {{ cluster_name }}
      register: cluster_status
      changed_when: false
    
    - name: Verify all nodes running
      assert:
        that:
          - "'running' in cluster_status.stdout"
        fail_msg: "Not all nodes are running"
    
    - name: Execute switchover
      command: >
        patronictl switchover {{ cluster_name }}
        --master {{ old_primary }}
        --candidate {{ new_primary }}
        --force
      register: switchover_result
    
    - name: Wait for switchover completion
      pause:
        seconds: 15
    
    - name: Verify new leader
      command: patronictl list {{ cluster_name }}
      register: final_status
      changed_when: false
    
    - name: Display result
      debug:
        msg: "{{ final_status.stdout_lines }}"
    
    - name: Verify leadership
      assert:
        that:
          - "'{{ new_primary }}' in final_status.stdout"
          - "'Leader' in final_status.stdout"
        fail_msg: "Switchover failed"
        success_msg: "Switchover successful"

Run:

TEXT

ansible-playbook switchover.yml

9.3. CI/CD integration

TEXT

# .github/workflows/db-maintenance.yml
name: Database Maintenance Switchover

on:
  schedule:
    - cron: '0 2 * * 0'  # Every Sunday at 2 AM
  workflow_dispatch:  # Manual trigger

jobs:
  switchover:
    runs-on: self-hosted
    steps:
      - name: Notify start
        run: |
          curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
            -d '{"text": "Starting scheduled database switchover"}'
      
      - name: Pre-checks
        run: ./scripts/pre-switchover-check.sh
      
      - name: Execute switchover
        run: |
          patronictl switchover postgres \
            --master node1 \
            --candidate node2 \
            --force
      
      - name: Verify
        run: ./scripts/post-switchover-verify.sh
      
      - name: Notify completion
        if: always()
        run: |
          curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
            -d '{"text": "Switchover completed: ${{ job.status }}"}'

10. Rolling Updates with Switchover

10.1. Update strategy

Scenario: Update PostgreSQL from 17 → 18.

Steps:

TEXT

1. Update replica node3 (least critical)
   - Stop Patroni
   - Upgrade PostgreSQL
   - Start Patroni
   - Verify replication

2. Update replica node2
   - Stop Patroni
   - Upgrade PostgreSQL
   - Start Patroni
   - Verify replication

3. Switchover to node2 (now updated)
   - patronictl switchover --master node1 --candidate node2

4. Update old primary node1
   - Stop Patroni
   - Upgrade PostgreSQL
   - Start Patroni (now replica)
   - Verify replication

5. Optionally switchover back to node1
   - patronictl switchover --master node2 --candidate node1

Result: Zero-downtime upgrade ✅

10.2. Kernel update example

TEXT

#!/bin/bash
# rolling-kernel-update.sh

NODES=("node1" "node2" "node3")
PRIMARY=$(patronictl list postgres | grep Leader | awk '{print $2}')

echo "Current primary: $PRIMARY"

# Update replicas first
for node in "${NODES[@]}"; do
  if [ "$node" == "$PRIMARY" ]; then
    continue  # Skip primary for now
  fi
  
  echo "=== Updating $node ==="
  ssh $node 'sudo yum update -y kernel && sudo reboot'
  
  echo "Waiting for $node to come back..."
  sleep 60
  
  # Wait for node to rejoin
  until patronictl list postgres | grep $node | grep -q "running"; do
    echo "Waiting for $node..."
    sleep 10
  done
  
  echo "✅ $node updated and rejoined"
done

# Now switchover from primary
NEW_PRIMARY=${NODES[1]}  # Pick a replica
if [ "$NEW_PRIMARY" == "$PRIMARY" ]; then
  NEW_PRIMARY=${NODES[2]}
fi

echo "=== Switching over from $PRIMARY to $NEW_PRIMARY ==="
patronictl switchover postgres \
  --master $PRIMARY \
  --candidate $NEW_PRIMARY \
  --force

sleep 15

# Update old primary
echo "=== Updating $PRIMARY ==="
ssh $PRIMARY 'sudo yum update -y kernel && sudo reboot'

echo "Waiting for $PRIMARY to rejoin as replica..."
sleep 60

until patronictl list postgres | grep $PRIMARY | grep -q "running"; do
  echo "Waiting for $PRIMARY..."
  sleep 10
done

echo "✅ All nodes updated!"
patronictl list postgres