CloudTadaInsights

Lesson 11: Patroni Callbacks

Patroni Callbacks

Learning Objectives

After this lesson, you will:

  • Understand what Patroni callbacks are and when they are triggered
  • Implement custom scripts for lifecycle events
  • Configure callbacks for automation tasks
  • Handle role changes (primary ↔ replica)
  • Setup notifications and monitoring hooks
  • Troubleshoot callback failures

1. Callbacks Overview

1.1. What are Callbacks?

Callbacks = Custom scripts executed by Patroni at cluster lifecycle events.

Use cases:

  • šŸ”” Notifications: Alert team when failover occurs
  • šŸ”§ Automation: Update DNS, load balancer configs
  • šŸ“Š Monitoring: Push metrics to monitoring system
  • 🚦 Traffic management: Redirect application traffic
  • šŸ” Security: Rotate credentials, update firewall rules
  • šŸ“ Logging: Custom audit logs

1.2. Available callbacks

Patroni provides the following callback events:

CallbackTriggerUse Case
on_startBefore PostgreSQL startsPre-start checks, mount volumes
on_stopBefore PostgreSQL stopsCleanup, notify applications
on_restartBefore PostgreSQL restartsLog restart event
on_reloadAfter PostgreSQL config reloadVerify config changes
on_role_changeRole changes (primary ↔ replica)Most important - update DNS, LB
pre_promoteBefore replica promoted to primaryFinal checks before promotion
post_promoteAfter replica promoted to primaryUpdate monitoring, send alerts

1.3. Callback execution flow

TEXT
Example: Failover scenario

Old Primary crashes
       ↓
Patroni detects failure (after TTL expires)
       ↓
Patroni selects best replica (node2)
       ↓
pre_promote callback runs on node2
       ↓
PostgreSQL promoted to primary (pg_promote)
       ↓
post_promote callback runs on node2
       ↓
on_role_change callback runs on node2 (role=master)
       ↓
Other replicas detect new leader
       ↓
on_role_change callback runs on replicas (role=replica)
       ↓
Failover complete

1.4. Callback environment variables

Patroni passes environment variables to scripts:

VariableDescriptionExample
PATRONI_ROLECurrent role after changemaster, replica
PATRONI_SCOPECluster namepostgres
PATRONI_NAMENode namenode1
PATRONI_CLUSTER_NAMECluster name (alias)postgres
PATRONI_VERSIONPatroni version3.2.0

For on_role_change:

VariableValue
PATRONI_NEW_ROLENew role: master or replica
PATRONI_OLD_ROLEPrevious role

2. Configure Callbacks in Patroni

2.1. Basic configuration

In patroni.yml:

TEXT
scope: postgres
name: node1

postgresql:
  callbacks:
    on_start: /var/lib/postgresql/callbacks/on_start.sh
    on_stop: /var/lib/postgresql/callbacks/on_stop.sh
    on_restart: /var/lib/postgresql/callbacks/on_restart.sh
    on_reload: /var/lib/postgresql/callbacks/on_reload.sh
    on_role_change: /var/lib/postgresql/callbacks/on_role_change.sh

Key points:

  • Paths must be absolute
  • Scripts must be executable (chmod +x)
  • Owned by postgres user
  • Should complete quickly (<30 seconds)
  • Non-zero exit code = callback failed (logged but doesn't block operation)

2.2. Create callback directory

TEXT
# On all 3 nodes
sudo mkdir -p /var/lib/postgresql/callbacks
sudo chown postgres:postgres /var/lib/postgresql/callbacks
sudo chmod 750 /var/lib/postgresql/callbacks

3. Implement Callback Scripts

3.1. on_start callback

Use case: Pre-start validation, mount checks.

Script: /var/lib/postgresql/callbacks/on_start.sh

TEXT
#!/bin/bash
# on_start.sh - Runs before PostgreSQL starts

set -e

LOG_FILE="/var/log/patroni/callbacks.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')

# Logging function
log() {
    echo "[$TIMESTAMP] [ON_START] $1" | tee -a "$LOG_FILE"
}

log "Starting PostgreSQL on $PATRONI_NAME"
log "Role: $PATRONI_ROLE"
log "Cluster: $PATRONI_SCOPE"

# Check disk space
DISK_USAGE=$(df -h /var/lib/postgresql | awk 'NR==2 {print $5}' | sed 's/%//')
if [ "$DISK_USAGE" -gt 90 ]; then
    log "ERROR: Disk usage is ${DISK_USAGE}% - critically high!"
    exit 1
fi
log "Disk usage: ${DISK_USAGE}%"

# Check if data directory is mounted
if ! mountpoint -q /var/lib/postgresql/18/data; then
    log "WARNING: Data directory is not a mount point"
fi

# Check network connectivity to etcd
for ETCD_HOST in 10.0.1.11 10.0.1.12 10.0.1.13; do
    if ! nc -zw3 "$ETCD_HOST" 2379 2>/dev/null; then
        log "ERROR: Cannot reach etcd at $ETCD_HOST:2379"
        exit 1
    fi
done
log "etcd connectivity verified"

log "Pre-start checks passed"
exit 0

Create script:

TEXT
# On all nodes
sudo tee /var/lib/postgresql/callbacks/on_start.sh > /dev/null << 'EOF'
#!/bin/bash
set -e
LOG_FILE="/var/log/patroni/callbacks.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
log() { echo "[$TIMESTAMP] [ON_START] $1" | tee -a "$LOG_FILE"; }

log "Starting PostgreSQL on $PATRONI_NAME (Role: $PATRONI_ROLE)"

# Disk space check
DISK_USAGE=$(df -h /var/lib/postgresql | awk 'NR==2 {print $5}' | sed 's/%//')
if [ "$DISK_USAGE" -gt 90 ]; then
    log "ERROR: Disk usage ${DISK_USAGE}% too high"
    exit 1
fi
log "Disk usage: ${DISK_USAGE}%"

log "Pre-start checks passed"
exit 0
EOF

sudo chmod +x /var/lib/postgresql/callbacks/on_start.sh
sudo chown postgres:postgres /var/lib/postgresql/callbacks/on_start.sh

3.2. on_stop callback

Use case: Graceful shutdown notifications.

Script: /var/lib/postgresql/callbacks/on_stop.sh

TEXT
#!/bin/bash
# on_stop.sh - Runs before PostgreSQL stops

set -e

LOG_FILE="/var/log/patroni/callbacks.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')

log() {
    echo "[$TIMESTAMP] [ON_STOP] $1" | tee -a "$LOG_FILE"
}

log "Stopping PostgreSQL on $PATRONI_NAME"
log "Role: $PATRONI_ROLE"

# Notify monitoring system
if command -v curl >/dev/null 2>&1; then
    curl -s -X POST http://monitoring.example.com/api/events \
        -H "Content-Type: application/json" \
        -d "{
            \"event\": \"postgresql_stop\",
            \"node\": \"$PATRONI_NAME\",
            \"role\": \"$PATRONI_ROLE\",
            \"timestamp\": \"$TIMESTAMP\"
        }" || log "WARNING: Failed to notify monitoring"
fi

log "PostgreSQL stop initiated"
exit 0

Create script:

TEXT
sudo tee /var/lib/postgresql/callbacks/on_stop.sh > /dev/null << 'EOF'
#!/bin/bash
set -e
LOG_FILE="/var/log/patroni/callbacks.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
log() { echo "[$TIMESTAMP] [ON_STOP] $1" | tee -a "$LOG_FILE"; }

log "Stopping PostgreSQL on $PATRONI_NAME (Role: $PATRONI_ROLE)"
exit 0
EOF

sudo chmod +x /var/lib/postgresql/callbacks/on_stop.sh
sudo chown postgres:postgres /var/lib/postgresql/callbacks/on_stop.sh

3.3. on_role_change callback (Most Important!)

Use case: Update DNS, load balancers, send notifications.

Script: /var/lib/postgresql/callbacks/on_role_change.sh

TEXT
#!/bin/bash
# on_role_change.sh - Runs when role changes (master ↔ replica)

set -e

LOG_FILE="/var/log/patroni/callbacks.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')

log() {
    echo "[$TIMESTAMP] [ROLE_CHANGE] $1" | tee -a "$LOG_FILE"
}

log "=========================================="
log "Role change detected on $PATRONI_NAME"
log "Cluster: $PATRONI_SCOPE"
log "Old role: ${PATRONI_OLD_ROLE:-unknown}"
log "New role: $PATRONI_ROLE"
log "=========================================="

# Function: Update DNS
update_dns() {
    local NEW_PRIMARY_IP="$1"
    
    log "Updating DNS record for primary.postgres.local -> $NEW_PRIMARY_IP"
    
    # Example using nsupdate (BIND DNS)
    # nsupdate -k /etc/dns/Kpostgres.+157+12345.key << EOF
    # server dns-server.local
    # zone postgres.local
    # update delete primary.postgres.local A
    # update add primary.postgres.local 60 A $NEW_PRIMARY_IP
    # send
    # EOF
    
    # Or using API (e.g., Route53, Cloudflare)
    # aws route53 change-resource-record-sets --hosted-zone-id Z1234 ...
    
    log "DNS update completed"
}

# Function: Update HAProxy
update_haproxy() {
    local NEW_PRIMARY_IP="$1"
    
    log "Notifying HAProxy about new primary: $NEW_PRIMARY_IP"
    
    # Use HAProxy stats socket
    # echo "set server postgres/primary addr $NEW_PRIMARY_IP" | \
    #     socat stdio /var/run/haproxy.sock
    
    log "HAProxy updated"
}

# Function: Send Slack notification
send_notification() {
    local MESSAGE="$1"
    local WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
    
    log "Sending notification: $MESSAGE"
    
    curl -s -X POST "$WEBHOOK_URL" \
        -H "Content-Type: application/json" \
        -d "{
            \"text\": \"šŸ”„ PostgreSQL Role Change\",
            \"attachments\": [{
                \"color\": \"warning\",
                \"fields\": [
                    {\"title\": \"Node\", \"value\": \"$PATRONI_NAME\", \"short\": true},
                    {\"title\": \"Cluster\", \"value\": \"$PATRONI_SCOPE\", \"short\": true},
                    {\"title\": \"Old Role\", \"value\": \"${PATRONI_OLD_ROLE:-N/A}\", \"short\": true},
                    {\"title\": \"New Role\", \"value\": \"$PATRONI_ROLE\", \"short\": true},
                    {\"title\": \"Time\", \"value\": \"$TIMESTAMP\", \"short\": false}
                ]
            }]
        }" || log "WARNING: Notification failed"
}

# Main logic
case "$PATRONI_ROLE" in
    master)
        log "This node is now PRIMARY"
        
        # Get this node's IP
        NODE_IP=$(hostname -I | awk '{print $1}')
        log "Node IP: $NODE_IP"
        
        # Update DNS to point to new primary
        update_dns "$NODE_IP"
        
        # Update load balancer
        update_haproxy "$NODE_IP"
        
        # Send notification
        send_notification "Node $PATRONI_NAME promoted to PRIMARY"
        
        # Set marker file for applications
        touch /tmp/postgres_is_primary
        rm -f /tmp/postgres_is_replica
        
        log "Primary promotion tasks completed"
        ;;
        
    replica)
        log "This node is now REPLICA"
        
        # Remove primary marker
        rm -f /tmp/postgres_is_primary
        touch /tmp/postgres_is_replica
        
        # Send notification if demoted from primary
        if [ "${PATRONI_OLD_ROLE}" = "master" ]; then
            send_notification "Node $PATRONI_NAME demoted to REPLICA"
        fi
        
        log "Replica role tasks completed"
        ;;
        
    *)
        log "Unknown role: $PATRONI_ROLE"
        exit 1
        ;;
esac

log "Role change handling completed successfully"
exit 0

Create production-ready script:

TEXT
sudo tee /var/lib/postgresql/callbacks/on_role_change.sh > /dev/null << 'EOF'
#!/bin/bash
set -e

LOG_FILE="/var/log/patroni/callbacks.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
log() { echo "[$TIMESTAMP] [ROLE_CHANGE] $1" | tee -a "$LOG_FILE"; }

log "=========================================="
log "Role change: $PATRONI_NAME"
log "Old role: ${PATRONI_OLD_ROLE:-unknown}"
log "New role: $PATRONI_ROLE"
log "=========================================="

case "$PATRONI_ROLE" in
    master)
        log "This node is now PRIMARY"
        NODE_IP=$(hostname -I | awk '{print $1}')
        log "Node IP: $NODE_IP"
        
        # TODO: Update DNS, load balancer, etc.
        # update_dns "$NODE_IP"
        
        touch /tmp/postgres_is_primary
        rm -f /tmp/postgres_is_replica
        ;;
        
    replica)
        log "This node is now REPLICA"
        rm -f /tmp/postgres_is_primary
        touch /tmp/postgres_is_replica
        ;;
        
    *)
        log "Unknown role: $PATRONI_ROLE"
        exit 1
        ;;
esac

log "Role change completed"
exit 0
EOF

sudo chmod +x /var/lib/postgresql/callbacks/on_role_change.sh
sudo chown postgres:postgres /var/lib/postgresql/callbacks/on_role_change.sh

3.4. on_restart callback

Use case: Log restarts, notify about planned maintenance.

TEXT
sudo tee /var/lib/postgresql/callbacks/on_restart.sh > /dev/null << 'EOF'
#!/bin/bash
set -e
LOG_FILE="/var/log/patroni/callbacks.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
log() { echo "[$TIMESTAMP] [ON_RESTART] $1" | tee -a "$LOG_FILE"; }

log "Restarting PostgreSQL on $PATRONI_NAME (Role: $PATRONI_ROLE)"
exit 0
EOF

sudo chmod +x /var/lib/postgresql/callbacks/on_restart.sh
sudo chown postgres:postgres /var/lib/postgresql/callbacks/on_restart.sh

3.5. on_reload callback

Use case: Verify configuration changes were applied.

TEXT
sudo tee /var/lib/postgresql/callbacks/on_reload.sh > /dev/null << 'EOF'
#!/bin/bash
set -e
LOG_FILE="/var/log/patroni/callbacks.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
log() { echo "[$TIMESTAMP] [ON_RELOAD] $1" | tee -a "$LOG_FILE"; }

log "Configuration reloaded on $PATRONI_NAME"

# Verify critical settings
MAX_CONN=$(sudo -u postgres psql -t -c "SHOW max_connections;")
log "max_connections = $MAX_CONN"

exit 0
EOF

sudo chmod +x /var/lib/postgresql/callbacks/on_reload.sh
sudo chown postgres:postgres /var/lib/postgresql/callbacks/on_reload.sh

3.6. Create log directory

TEXT
# On all nodes
sudo mkdir -p /var/log/patroni
sudo chown postgres:postgres /var/log/patroni
sudo chmod 750 /var/log/patroni

4. Update Patroni Configuration

4.1. Add callbacks to patroni.yml

On all 3 nodes, edit /etc/patroni/patroni.yml:

TEXT
scope: postgres
namespace: /service/
name: node1  # node2, node3 for other nodes

restapi:
  listen: 0.0.0.0:8008
  connect_address: 10.0.1.11:8008  # Change per node

etcd3:
  hosts: 10.0.1.11:2379,10.0.1.12:2379,10.0.1.13:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    synchronous_mode: true
    synchronous_mode_strict: false
    
    postgresql:
      parameters:
        max_connections: 100
        shared_buffers: 256MB
        wal_level: replica
        max_wal_senders: 10
        max_replication_slots: 10

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.0.1.11:5432  # Change per node
  data_dir: /var/lib/postgresql/18/data
  bin_dir: /usr/lib/postgresql/18/bin
  
  authentication:
    replication:
      username: replicator
      password: replicator_password
    superuser:
      username: postgres
      password: postgres_password
  
  parameters:
    unix_socket_directories: '/var/run/postgresql'
  
  # āœ… Add callbacks section
  callbacks:
    on_start: /var/lib/postgresql/callbacks/on_start.sh
    on_stop: /var/lib/postgresql/callbacks/on_stop.sh
    on_restart: /var/lib/postgresql/callbacks/on_restart.sh
    on_reload: /var/lib/postgresql/callbacks/on_reload.sh
    on_role_change: /var/lib/postgresql/callbacks/on_role_change.sh

tags:
  nofailover: false
  noloadbalance: false
  clonefrom: false
  nosync: false

4.2. Reload Patroni configuration

TEXT
# On all 3 nodes
sudo systemctl reload patroni

# Verify callbacks configured
patronictl show-config postgres

5. Test Callbacks

5.1. Test on_restart

TEXT
# Restart a node
patronictl restart postgres node2

# Check logs
sudo tail -f /var/log/patroni/callbacks.log

# Expected output:
# [2024-11-25 10:30:15] [ON_RESTART] Restarting PostgreSQL on node2

5.2. Test on_reload

TEXT
# Reload configuration
patronictl reload postgres node2

# Check logs
sudo tail /var/log/patroni/callbacks.log

# Expected:
# [2024-11-25 10:32:45] [ON_RELOAD] Configuration reloaded on node2

5.3. Test on_role_change (Failover)

āš ļø IMPORTANT: Test in non-production!

TEXT
# 1. Check current primary
patronictl list postgres
# node1 is Leader

# 2. Stop primary
sudo systemctl stop patroni  # On node1

# 3. Watch logs on node2 (will become new primary)
sudo tail -f /var/log/patroni/callbacks.log

# Expected output:
# [2024-11-25 10:35:10] [ROLE_CHANGE] ==========================================
# [2024-11-25 10:35:10] [ROLE_CHANGE] Role change: node2
# [2024-11-25 10:35:10] [ROLE_CHANGE] Old role: replica
# [2024-11-25 10:35:10] [ROLE_CHANGE] New role: master
# [2024-11-25 10:35:10] [ROLE_CHANGE] This node is now PRIMARY
# [2024-11-25 10:35:10] [ROLE_CHANGE] Node IP: 10.0.1.12
# [2024-11-25 10:35:10] [ROLE_CHANGE] Role change completed

# 4. Verify marker file
ls -la /tmp/postgres_is_*
# -rw-r--r-- 1 postgres postgres 0 Nov 25 10:35 /tmp/postgres_is_primary

# 5. Restart node1 (will rejoin as replica)
sudo systemctl start patroni  # On node1

# 6. Check node1 logs
sudo tail /var/log/patroni/callbacks.log
# [2024-11-25 10:36:30] [ROLE_CHANGE] Old role: master
# [2024-11-25 10:36:30] [ROLE_CHANGE] New role: replica
# [2024-11-25 10:36:30] [ROLE_CHANGE] This node is now REPLICA

6. Advanced Callback Examples

6.1. DNS update using nsupdate

Prerequisites: BIND DNS server with DDNS enabled.

TEXT
#!/bin/bash
# Update DNS via nsupdate

update_dns() {
    local NEW_PRIMARY_IP="$1"
    local DNS_KEY="/etc/dns/Kpostgres.+157+12345.key"
    local DNS_SERVER="dns.example.com"
    local ZONE="postgres.local"
    local RECORD="primary.postgres.local"
    
    log "Updating DNS: $RECORD -> $NEW_PRIMARY_IP"
    
    nsupdate -k "$DNS_KEY" << EOF
server $DNS_SERVER
zone $ZONE
update delete $RECORD A
update add $RECORD 60 A $NEW_PRIMARY_IP
send
EOF
    
    if [ $? -eq 0 ]; then
        log "DNS updated successfully"
    else
        log "ERROR: DNS update failed"
        return 1
    fi
}

# In on_role_change.sh
if [ "$PATRONI_ROLE" = "master" ]; then
    NODE_IP=$(hostname -I | awk '{print $1}')
    update_dns "$NODE_IP"
fi

6.2. HAProxy backend update

Via stats socket:

TEXT
update_haproxy() {
    local NEW_PRIMARY_IP="$1"
    local HAPROXY_SOCKET="/var/run/haproxy.sock"
    
    log "Updating HAProxy: primary backend -> $NEW_PRIMARY_IP"
    
    echo "set server postgres-primary/node addr $NEW_PRIMARY_IP port 5432" | \
        socat stdio "$HAPROXY_SOCKET"
    
    echo "set server postgres-primary/node state ready" | \
        socat stdio "$HAPROXY_SOCKET"
    
    log "HAProxy backend updated"
}

6.3. Consul service registration

TEXT
register_in_consul() {
    local ROLE="$1"
    local NODE_IP="$2"
    
    log "Registering in Consul: $PATRONI_NAME as $ROLE"
    
    curl -s -X PUT "http://consul.local:8500/v1/agent/service/register" \
        -H "Content-Type: application/json" \
        -d "{
            \"Name\": \"postgres-$ROLE\",
            \"ID\": \"postgres-$PATRONI_NAME\",
            \"Address\": \"$NODE_IP\",
            \"Port\": 5432,
            \"Tags\": [\"$ROLE\", \"patroni\"],
            \"Check\": {
                \"TCP\": \"$NODE_IP:5432\",
                \"Interval\": \"10s\",
                \"Timeout\": \"2s\"
            }
        }"
    
    log "Consul registration completed"
}

# Usage
NODE_IP=$(hostname -I | awk '{print $1}')
register_in_consul "$PATRONI_ROLE" "$NODE_IP"

6.4. Email notification

TEXT
send_email_alert() {
    local SUBJECT="$1"
    local BODY="$2"
    local RECIPIENT="ops-team@example.com"
    
    log "Sending email alert: $SUBJECT"
    
    echo "$BODY" | mail -s "$SUBJECT" "$RECIPIENT"
    
    log "Email sent to $RECIPIENT"
}

# In on_role_change.sh
if [ "$PATRONI_ROLE" = "master" ]; then
    send_email_alert \
        "[ALERT] PostgreSQL Failover: $PATRONI_NAME promoted to PRIMARY" \
        "Cluster: $PATRONI_SCOPE
Node: $PATRONI_NAME
Old Role: ${PATRONI_OLD_ROLE}
New Role: $PATRONI_ROLE
Time: $TIMESTAMP

Action required: Verify cluster health"
fi

6.5. Slack/Teams webhook

Detailed Slack notification:

TEXT
send_slack_alert() {
    local WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
    local COLOR="$1"  # good, warning, danger
    local TITLE="$2"
    local MESSAGE="$3"
    
    curl -s -X POST "$WEBHOOK_URL" \
        -H "Content-Type: application/json" \
        -d "{
            \"username\": \"Patroni Monitor\",
            \"icon_emoji\": \": database:\",
            \"attachments\": [{
                \"color\": \"$COLOR\",
                \"title\": \"$TITLE\",
                \"text\": \"$MESSAGE\",
                \"fields\": [
                    {\"title\": \"Cluster\", \"value\": \"$PATRONI_SCOPE\", \"short\": true},
                    {\"title\": \"Node\", \"value\": \"$PATRONI_NAME\", \"short\": true},
                    {\"title\": \"Old Role\", \"value\": \"${PATRONI_OLD_ROLE:-N/A}\", \"short\": true},
                    {\"title\": \"New Role\", \"value\": \"$PATRONI_ROLE\", \"short\": true},
                    {\"title\": \"Timestamp\", \"value\": \"$TIMESTAMP\", \"short\": false}
                ],
                \"footer\": \"PostgreSQL HA\",
                \"footer_icon\": \"https://www.postgresql.org/media/img/about/press/elephant.png\"
            }]
        }"
}

# Usage
if [ "$PATRONI_ROLE" = "master" ]; then
    send_slack_alert "warning" \
        "🚨 Failover Event" \
        "Node $PATRONI_NAME has been promoted to PRIMARY"
fi

6.6. Metrics push to monitoring

Push to Prometheus Pushgateway:

TEXT
push_metrics() {
    local PUSHGATEWAY="http://pushgateway.local:9091"
    local JOB="patroni_callbacks"
    
    log "Pushing metrics to Prometheus"
    
    cat << EOF | curl -s --data-binary @- "$PUSHGATEWAY/metrics/job/$JOB/instance/$PATRONI_NAME"
# TYPE patroni_role_change counter
# HELP patroni_role_change Number of role changes
patroni_role_change{cluster="$PATRONI_SCOPE",node="$PATRONI_NAME",new_role="$PATRONI_ROLE"} 1

# TYPE patroni_role_change_timestamp gauge
# HELP patroni_role_change_timestamp Timestamp of last role change
patroni_role_change_timestamp{cluster="$PATRONI_SCOPE",node="$PATRONI_NAME"} $(date +%s)
EOF
    
    log "Metrics pushed"
}

7. Callback Best Practices

āœ… DO

  1. Keep callbacks fast
    • Complete within 10-30 seconds
    • Long tasks → background jobs
  2. Use proper logging
    • Log all actions
    • Include timestamps
    • Rotate logs
  3. Handle errors gracefully
    • Use set -e carefully
    • Catch errors, log, continue
    • Non-zero exit = warning, not failure
  4. Test thoroughly
    • Test in staging
    • Simulate all scenarios
    • Verify idempotency
  5. Make scripts idempotent
    • Can run multiple times safely
    • Check before modify
  6. Use absolute paths
    • Don't rely on PATH
    • Specify full paths
  7. Secure credentials
    • Don't hardcode passwords
    • Use environment variables or secrets manager
  8. Monitor callback execution
    • Alert on failures
    • Track execution time

āŒ DON'T

  1. Don't block for long time
    • Patroni waits for callbacks
    • Long delays → slower failover
  2. Don't rely on network during failover
    • Network may be partitioned
    • Have fallback mechanisms
  3. Don't fail the callback unnecessarily
    • Exit 0 even if notification fails
    • Log errors but continue
  4. Don't run database queries in callbacks
    • PostgreSQL may not be ready
    • Can cause deadlocks
  5. Don't modify PostgreSQL configuration
    • Let Patroni manage config
    • Use Patroni's parameters
  6. Don't use interactive commands
    • No user input
    • Must run unattended

8. Troubleshoot Callback Issues

8.1. Callback not executing

Check:

TEXT
# 1. Verify script exists
ls -la /var/lib/postgresql/callbacks/on_role_change.sh

# 2. Check executable permissions
# Should be: -rwxr-xr-x postgres postgres
sudo chmod +x /var/lib/postgresql/callbacks/on_role_change.sh

# 3. Check ownership
sudo chown postgres:postgres /var/lib/postgresql/callbacks/on_role_change.sh

# 4. Verify path in patroni.yml
grep -A5 "callbacks:" /etc/patroni/patroni.yml

# 5. Check Patroni logs
sudo journalctl -u patroni -n 100 | grep -i callback

8.2. Callback failing

Check logs:

TEXT
# Patroni logs
sudo journalctl -u patroni | grep "callback.*failed"

# Callback logs
sudo tail -f /var/log/patroni/callbacks.log

# Test script manually
sudo -u postgres /var/lib/postgresql/callbacks/on_role_change.sh

Common issues:

  • Syntax error: Run bash -n script.sh to check
  • Missing dependency: Install required tools (curl, nc, etc.)
  • Permission denied: Check file/directory permissions
  • Timeout: Script taking too long

8.3. Callback causing slow failover

Measure callback execution time:

TEXT
# Add timing to script
START_TIME=$(date +%s)

# ... your callback logic ...

END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))
log "Callback completed in ${DURATION} seconds"

# If DURATION > 30, investigate and optimize

9. Production Callback Template

Complete production-ready template:

TEXT
#!/bin/bash
# Patroni callback template
# File: /var/lib/postgresql/callbacks/on_role_change.sh

set -euo pipefail  # Exit on error, undefined vars, pipe failures

# Configuration
readonly LOG_FILE="/var/log/patroni/callbacks.log"
readonly LOCK_FILE="/tmp/callback_role_change.lock"
readonly TIMEOUT=30
readonly SLACK_WEBHOOK="${SLACK_WEBHOOK_URL:-}"

# Logging function
log() {
    local LEVEL="$1"
    shift
    local MESSAGE="$*"
    local TIMESTAMP
    TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
    echo "[$TIMESTAMP] [$LEVEL] [ROLE_CHANGE] $MESSAGE" | tee -a "$LOG_FILE"
}

# Error handler
error_exit() {
    log "ERROR" "$1"
    cleanup
    exit 1
}

# Cleanup function
cleanup() {
    rm -f "$LOCK_FILE"
}

# Ensure only one instance runs
if ! mkdir "$LOCK_FILE" 2>/dev/null; then
    log "WARN" "Another callback instance is running, exiting"
    exit 0
fi

trap cleanup EXIT

# Set timeout
timeout "$TIMEOUT" bash << 'SCRIPT' || error_exit "Callback timed out after ${TIMEOUT}s"

log "INFO" "=========================================="
log "INFO" "Role change detected"
log "INFO" "Cluster: ${PATRONI_SCOPE:-unknown}"
log "INFO" "Node: ${PATRONI_NAME:-unknown}"
log "INFO" "Old role: ${PATRONI_OLD_ROLE:-unknown}"
log "INFO" "New role: ${PATRONI_ROLE:-unknown}"
log "INFO" "=========================================="

# Main logic
case "${PATRONI_ROLE:-}" in
    master)
        log "INFO" "Handling promotion to PRIMARY"
        
        # Get node IP
        NODE_IP=$(hostname -I | awk '{print $1}')
        log "INFO" "Node IP: $NODE_IP"
        
        # Update DNS (implement your logic)
        # update_dns "$NODE_IP" || log "WARN" "DNS update failed"
        
        # Update load balancer (implement your logic)
        # update_load_balancer "$NODE_IP" || log "WARN" "LB update failed"
        
        # Send notification
        if [ -n "$SLACK_WEBHOOK" ]; then
            curl -s -X POST "$SLACK_WEBHOOK" \
                -H "Content-Type: application/json" \
                -d "{\"text\": \"🚨 Failover: $PATRONI_NAME promoted to PRIMARY\"}" \
                || log "WARN" "Slack notification failed"
        fi
        
        # Set marker files
        touch /tmp/postgres_is_primary
        rm -f /tmp/postgres_is_replica
        
        log "INFO" "PRIMARY promotion tasks completed"
        ;;
        
    replica)
        log "INFO" "Handling demotion to REPLICA"
        
        # Remove primary marker
        rm -f /tmp/postgres_is_primary
        touch /tmp/postgres_is_replica
        
        # Notify if demoted from primary
        if [ "${PATRONI_OLD_ROLE:-}" = "master" ]; then
            log "WARN" "Node demoted from PRIMARY to REPLICA"
            # Send alert
        fi
        
        log "INFO" "REPLICA tasks completed"
        ;;
        
    *)
        error_exit "Unknown role: ${PATRONI_ROLE:-unknown}"
        ;;
esac

log "INFO" "Callback completed successfully"
exit 0

SCRIPT

10. Lab Exercises

Lab 1: Setup basic callbacks

Tasks:

  1. Create callback directory and scripts
  2. Add callbacks to patroni.yml
  3. Reload Patroni
  4. Test with patronictl restart

Lab 2: Test failover callbacks

Tasks:

  1. Monitor callback logs: tail -f /var/log/patroni/callbacks.log
  2. Stop primary: sudo systemctl stop patroni
  3. Verify on_role_change executed on new primary
  4. Check marker files: /tmp/postgres_is_*
  5. Restart old primary, verify it rejoins as replica

Lab 3: Implement Slack notifications

Tasks:

  1. Get Slack webhook URL
  2. Add notification to on_role_change.sh
  3. Test by triggering failover
  4. Verify message received in Slack

Lab 4: Measure callback performance

Tasks:

  1. Add timing to all callbacks
  2. Trigger various events (restart, reload, failover)
  3. Analyze callback execution times
  4. Optimize slow callbacks

11. Summary

Key Takeaways

āœ… Callbacks = Custom automation at lifecycle events

āœ… on_role_change = Most critical callback for failover automation

āœ… Keep callbacks fast (<30s) for quick failover

āœ… Log everything for debugging

āœ… Test thoroughly before production

āœ… Handle errors gracefully - don't block operations

Common Use Cases

CallbackCommon Actions
on_startPre-flight checks, mount verification
on_stopCleanup, notifications
on_role_changeUpdate DNS, LB, send alerts
on_restartLog maintenance events
on_reloadVerify config changes

Current Architecture

TEXT
āœ… 3 VMs prepared (Lesson 4)
āœ… PostgreSQL 18 installed (Lesson 5)
āœ… etcd cluster running (Lesson 6)
āœ… Patroni installed (Lesson 7)
āœ… Patroni configured (Lesson 8)
āœ… Cluster bootstrapped (Lesson 9)
āœ… Replication configured (Lesson 10)
āœ… Callbacks implemented (Lesson 11)

Next: REST API usage

Preparing for Lesson 12

Lesson 12 will cover Patroni REST API:

  • Health check endpoints
  • Cluster status queries
  • Configuration management via API
  • Integration with load balancers
  • Monitoring and metrics

Share this article

You might also like

Browse all articles

Lesson 9: Bootstrap PostgreSQL Cluster

Learn how to bootstrap a Patroni cluster including starting Patroni for the first time on 3 nodes, verifying cluster status with patronictl, checking replication, troubleshooting common issues, and testing basic failover.

#Patroni#bootstrap#cluster

Lesson 8: Detailed Patroni Configuration

Learn detailed Patroni configuration including all sections of patroni.yml, bootstrap options, PostgreSQL parameters tuning, authentication setup, tags and constraints, and timing parameters optimization.

#Patroni#configuration#parameters

Lesson 7: Installing Patroni

Learn how to install Patroni, including setting up Python dependencies, installing via pip, understanding the patroni.yml configuration structure, creating systemd service, and configuring Patroni on 3 nodes for PostgreSQL high availability.

#Patroni#installation#configuration

Lesson 6: Installing and Configuring etcd Cluster

Learn how to install and configure etcd cluster for use with Patroni, including understanding etcd's role in Patroni architecture, setting up 3-node cluster with Raft consensus, creating systemd services, and testing cluster health.

#etcd#Raft#DCS

Lesson 20: Security Best Practices

Learn about Lesson 20: Security Best Practices in PostgreSQL HA clusters with Patroni and etcd.

#Patroni#PostgreSQL#high availability