CloudTadaInsights

Lesson 17: Monitoring Patroni Cluster

Monitoring Patroni Cluster

After this lesson, you will be able to:

  • Understand critical metrics of a PostgreSQL HA cluster.
  • Set up Prometheus + Grafana for monitoring.
  • Configure postgres_exporter and patroni_exporter.
  • Create dashboards and alerting rules.
  • Monitor etcd cluster health.
  • Implement best practices for observability.

1. Why Monitoring Matters

1.1. Monitoring goals

Visibility:

TEXT
✅ Know cluster health in real-time
✅ Detect issues before users notice
✅ Track performance trends
✅ Capacity planning data
✅ Audit trail for incidents

Key questions to answer:

  • Is the cluster healthy?
  • Is replication working?
  • What's the lag?
  • Is there any failover?
  • Are connections saturated?
  • What's the query performance?
  • Is etcd healthy?
  • Are backups running?

1.2. The four golden signals

Latency: How long do requests take?

SQL
-- Query execution time
SELECT query, mean_exec_time, calls 
FROM pg_stat_statements 
ORDER BY mean_exec_time DESC 
LIMIT 10;

Traffic: How many requests?

SQL
-- Connection count
SELECT count(*) FROM pg_stat_activity;

-- Transactions per second
SELECT xact_commit + xact_rollback AS tps 
FROM pg_stat_database 
WHERE datname = 'mydb';

Errors: What's failing?

SQL
-- Failed queries
SELECT query, calls, errors 
FROM pg_stat_statements 
WHERE errors > 0;

Saturation: How full are resources?

SQL
-- Connection usage
SELECT count(*), max_connections 
FROM pg_stat_activity, 
     (SELECT setting::int AS max_connections FROM pg_settings WHERE name = 'max_connections') s;

2. Metrics to Monitor

2.1. Cluster-level metrics

Cluster health

  • ✅ Number of nodes up/down
  • ✅ Current leader
  • ✅ Failover count
  • ✅ Timeline number
  • ✅ Cluster configuration version

Replication health

  • ✅ Replication lag (bytes and time)
  • ✅ WAL sender/receiver status
  • ✅ Sync vs async replica count
  • ✅ Replication slot usage
  • ✅ WAL segment generation rate

2.2. PostgreSQL metrics

Connection metrics

SQL
-- Active connections by state
SELECT state, count(*) 
FROM pg_stat_activity 
GROUP BY state;

-- state              | count
-- -------------------+-------
-- active             |    15
-- idle               |    50
-- idle in transaction|     2

Database size and growth

SQL
-- Database sizes
SELECT datname, 
       pg_size_pretty(pg_database_size(datname)) AS size
FROM pg_database
ORDER BY pg_database_size(datname) DESC;

-- Growth rate (need historical data)
SELECT now(), 
       pg_database_size('mydb') AS size_bytes;

Transaction rate

SQL
-- Transactions per second
SELECT datname,
       xact_commit + xact_rollback AS total_xacts,
       xact_commit,
       xact_rollback
FROM pg_stat_database
WHERE datname = 'mydb';

Cache hit ratio

SQL
-- Buffer cache hit ratio (should be > 95%)
SELECT 
  sum(heap_blks_read) AS heap_read,
  sum(heap_blks_hit) AS heap_hit,
  sum(heap_blks_hit) / nullif(sum(heap_blks_hit) + sum(heap_blks_read), 0) * 100 AS cache_hit_ratio
FROM pg_statio_user_tables;

Index usage

SQL
-- Tables with missing indexes (high seq scans)
SELECT schemaname, tablename, seq_scan, seq_tup_read,
       idx_scan, seq_tup_read / nullif(seq_scan, 0) AS avg_seq_tup
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_tup_read DESC
LIMIT 10;

Vacuum and autovacuum

SQL
-- Last vacuum/analyze
SELECT schemaname, tablename,
       last_vacuum,
       last_autovacuum,
       last_analyze,
       last_autoanalyze,
       n_dead_tup
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;

Locks

SQL
-- Current locks
SELECT locktype, relation::regclass, mode, granted
FROM pg_locks
WHERE NOT granted;  -- Waiting locks

Long-running queries

SQL
-- Queries running > 5 minutes
SELECT pid, 
       now() - query_start AS duration,
       state,
       query
FROM pg_stat_activity
WHERE state = 'active'
  AND now() - query_start > interval '5 minutes'
ORDER BY duration DESC;

2.3. Patroni metrics

Via REST API (http://node:8008/metrics):

TEXT
# Patroni metrics
patroni_patroni_info{scope="postgres",version="3.2.0"}
patroni_postgres_running{scope="postgres"} 1
patroni_postmaster_start_time{scope="postgres"} 1732531200
patroni_timeline{scope="postgres"} 3
patroni_cluster_unlocked{scope="postgres"} 0

# Replication metrics
patroni_replication_lag_bytes{application_name="node2"} 0
patroni_xlog_location{scope="postgres"} 100663296
patroni_xlog_replayed_location{scope="postgres"} 100663296
patroni_is_leader{scope="postgres"} 1

2.4. etcd metrics

Via etcd metrics endpoint (http://node:2379/metrics):

TEXT
# etcd health
etcd_server_has_leader 1
etcd_server_is_leader 0
etcd_server_leader_changes_seen_total 2

# Performance
etcd_disk_backend_commit_duration_seconds_bucket
etcd_network_peer_round_trip_time_seconds_bucket

# Cluster size
etcd_cluster_version{cluster_version="3.5"}
etcd_server_id{server_id="node1"}

2.5. System metrics

BASH
# CPU usage
top
htop

# Memory
free -h

# Disk I/O
iostat -x 1

# Disk space
df -h

# Network
netstat -s
ss -s

3. Prometheus Setup

3.1. Install Prometheus

BASH
# Download
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.48.0/prometheus-2.48.0.linux-amd64.tar.gz
tar -xzf prometheus-2.48.0.linux-amd64.tar.gz
sudo mv prometheus-2.48.0.linux-amd64 /opt/prometheus

# Create user
sudo useradd --no-create-home --shell /bin/false prometheus

# Create directories
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus

3.2. Configure Prometheus

YAML
# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'postgres-ha'
    environment: 'production'

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - localhost:9093

# Load rules
rule_files:
  - "alerts/*.yml"

# Scrape configurations
scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # PostgreSQL (via postgres_exporter)
  - job_name: 'postgresql'
    static_configs:
      - targets:
          - 10.0.1.11:9187  # node1
          - 10.0.1.12:9187  # node2
          - 10.0.1.13:9187  # node3
    relabel_configs:
      - source_labels: [__address__]
        regex: '([^:]+):.*'
        target_label: instance

  # Patroni (via REST API)
  - job_name: 'patroni'
    static_configs:
      - targets:
          - 10.0.1.11:8008
          - 10.0.1.12:8008
          - 10.0.1.13:8008
    metrics_path: /metrics

  # etcd
  - job_name: 'etcd'
    static_configs:
      - targets:
          - 10.0.1.11:2379
          - 10.0.1.12:2379
          - 10.0.1.13:2379
    scheme: http

  # Node exporter (system metrics)
  - job_name: 'node'
    static_configs:
      - targets:
          - 10.0.1.11:9100
          - 10.0.1.12:9100
          - 10.0.1.13:9100

3.3. Create systemd service

BASH
# /etc/systemd/system/prometheus.service
sudo tee /etc/systemd/system/prometheus.service <<EOF
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/prometheus/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus/ \
  --storage.tsdb.retention.time=30d \
  --web.console.templates=/opt/prometheus/consoles \
  --web.console.libraries=/opt/prometheus/console_libraries

Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

# Set permissions
sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml

# Start
sudo systemctl daemon-reload
sudo systemctl start prometheus
sudo systemctl enable prometheus

# Verify
sudo systemctl status prometheus
curl http://localhost:9090/metrics

4. Exporters Setup

4.1. postgres_exporter

BASH
# Install on each PostgreSQL node
cd /tmp
wget https://github.com/prometheus-community/postgres_exporter/releases/download/v0.15.0/postgres_exporter-0.15.0.linux-amd64.tar.gz
tar -xzf postgres_exporter-0.15.0.linux-amd64.tar.gz
sudo mv postgres_exporter-0.15.0.linux-amd64/postgres_exporter /usr/local/bin/

# Create monitoring user in PostgreSQL
sudo -u postgres psql -c "
CREATE USER postgres_exporter WITH PASSWORD 'exporter_password';
GRANT pg_monitor TO postgres_exporter;
"

# Create connection file
sudo tee /etc/postgres_exporter.env <<EOF
DATA_SOURCE_NAME=postgresql://postgres_exporter:exporter_password@localhost:5432/postgres?sslmode=disable
EOF

sudo chmod 600 /etc/postgres_exporter.env

Custom queries (optional):

YAML
# /etc/postgres_exporter/queries.yml
pg_replication:
  query: |
    SELECT 
      application_name,
      client_addr,
      state,
      COALESCE(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn), 0) AS lag_bytes,
      EXTRACT(EPOCH FROM replay_lag) AS replay_lag_seconds
    FROM pg_stat_replication
  metrics:
    - application_name:
        usage: "LABEL"
        description: "Application name"
    - client_addr:
        usage: "LABEL"
        description: "Client address"
    - state:
        usage: "LABEL"
        description: "Replication state"
    - lag_bytes:
        usage: "GAUGE"
        description: "Replication lag in bytes"
    - replay_lag_seconds:
        usage: "GAUGE"
        description: "Replay lag in seconds"

Systemd service:

BASH
# /etc/systemd/system/postgres_exporter.service
sudo tee /etc/systemd/system/postgres_exporter.service <<EOF
[Unit]
Description=Prometheus PostgreSQL Exporter
After=network.target

[Service]
Type=simple
User=postgres
EnvironmentFile=/etc/postgres_exporter.env
ExecStart=/usr/local/bin/postgres_exporter \
  --web.listen-address=:9187 \
  --extend.query-path=/etc/postgres_exporter/queries.yml

Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

# Start
sudo systemctl daemon-reload
sudo systemctl start postgres_exporter
sudo systemctl enable postgres_exporter

# Verify
curl http://localhost:9187/metrics | grep pg_

4.2. node_exporter

BASH
# Install on each node
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar -xzf node_exporter-1.7.0.linux-amd64.tar.gz
sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/

# Systemd service
sudo tee /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Prometheus Node Exporter
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --web.listen-address=:9100

Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

# Start
sudo systemctl daemon-reload
sudo systemctl start node_exporter
sudo systemctl enable node_exporter

# Verify
curl http://localhost:9100/metrics | head -20

4.3. Patroni metrics endpoint

Already built-in! Patroni exposes metrics at:

BASH
# Check Patroni metrics
curl http://localhost:8008/metrics

# Sample output:
# patroni_postgres_running 1
# patroni_postmaster_start_time 1732531200
# patroni_timeline 3
# patroni_cluster_unlocked 0
# patroni_is_leader 1

5. Grafana Setup

5.1. Install Grafana

BASH
# Add repository
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -

# Install
sudo apt-get update
sudo apt-get install -y grafana

# Start
sudo systemctl start grafana-server
sudo systemctl enable grafana-server

# Access: http://your-server:3000
# Default credentials: admin / admin

5.2. Add Prometheus data source

  1. Login to Grafana (http://localhost:3000)
  2. Go to Configuration → Data Sources
  3. Click "Add data source"
  4. Select "Prometheus"
  5. URL: http://localhost:9090
  6. Click "Save & Test"

5.3. Import dashboards

PostgreSQL dashboard:

  1. Go to Dashboards → Import
  2. Enter dashboard ID: 9628 (PostgreSQL Database)
  3. Select Prometheus data source
  4. Click Import

Patroni dashboard (custom):

JSON
{
  "dashboard": {
    "title": "Patroni Cluster Overview",
    "panels": [
      {
        "title": "Cluster Status",
        "targets": [
          {
            "expr": "patroni_postgres_running"
          }
        ]
      },
      {
        "title": "Leader",
        "targets": [
          {
            "expr": "patroni_is_leader"
          }
        ]
      },
      {
        "title": "Timeline",
        "targets": [
          {
            "expr": "patroni_timeline"
          }
        ]
      },
      {
        "title": "Replication Lag",
        "targets": [
          {
            "expr": "patroni_replication_lag_bytes"
          }
        ]
      }
    ]
  }
}

etcd dashboard:

  • Dashboard ID: 3070 (etcd by Prometheus)

Node exporter dashboard:

  • Dashboard ID: 1860 (Node Exporter Full)

6. Alerting Rules

6.1. PostgreSQL alerts

YAML
# /etc/prometheus/alerts/postgresql.yml
groups:
  - name: postgresql
    interval: 30s
    rules:
      # PostgreSQL down
      - alert: PostgreSQLDown
        expr: pg_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL instance {{ $labels.instance }} is down"
          description: "PostgreSQL on {{ $labels.instance }} has been down for more than 1 minute"

      # High replication lag
      - alert: PostgreSQLReplicationLag
        expr: pg_replication_lag_bytes > 104857600  # 100MB
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High replication lag on {{ $labels.instance }}"
          description: "Replication lag is {{ $value }} bytes (>100MB)"

      # Too many connections
      - alert: PostgreSQLTooManyConnections
        expr: |
          sum(pg_stat_activity_count) by (instance) / 
          pg_settings_max_connections * 100 > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Too many connections on {{ $labels.instance }}"
          description: "Connection usage is {{ $value }}% (>80%)"

      # Replication slot lag
      - alert: PostgreSQLReplicationSlotLag
        expr: pg_replication_slots_lag_bytes > 1073741824  # 1GB
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Replication slot {{ $labels.slot_name }} lag high"
          description: "Slot lag is {{ $value }} bytes (>1GB)"

      # Long-running queries
      - alert: PostgreSQLLongRunningQueries
        expr: pg_stat_activity_max_tx_duration > 3600  # 1 hour
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Long-running query on {{ $labels.instance }}"
          description: "Query running for {{ $value }} seconds (>1h)"

      # Dead tuples
      - alert: PostgreSQLDeadTuples
        expr: pg_stat_user_tables_n_dead_tup > 10000
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "High dead tuples on table {{ $labels.table }}"
          description: "Table has {{ $value }} dead tuples"

      # Cache hit ratio low
      - alert: PostgreSQLCacheHitRatio
        expr: |
          sum(pg_stat_database_blks_hit) / 
          nullif(sum(pg_stat_database_blks_hit + pg_stat_database_blks_read), 0) * 100 < 95
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Low cache hit ratio on {{ $labels.instance }}"
          description: "Cache hit ratio is {{ $value }}% (<95%)"

6.2. Patroni alerts

YAML
# /etc/prometheus/alerts/patroni.yml
groups:
  - name: patroni
    interval: 30s
    rules:
      # Patroni down
      - alert: PatroniDown
        expr: up{job="patroni"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Patroni on {{ $labels.instance }} is down"

      # No leader
      - alert: PatroniNoLeader
        expr: sum(patroni_is_leader) == 0
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "Patroni cluster has no leader"
          description: "No node is acting as leader in the cluster"

      # Multiple leaders (split-brain)
      - alert: PatroniMultipleLeaders
        expr: sum(patroni_is_leader) > 1
        for: 10s
        labels:
          severity: critical
        annotations:
          summary: "Multiple Patroni leaders detected (split-brain)"
          description: "{{ $value }} nodes claim to be leader"

      # Timeline mismatch
      - alert: PatroniTimelineMismatch
        expr: count(count by (timeline) (patroni_timeline)) > 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Patroni timeline mismatch"
          description: "Nodes are on different timelines"

      # Failover detected
      - alert: PatroniFailover
        expr: increase(patroni_timeline[5m]) > 0
        labels:
          severity: warning
        annotations:
          summary: "Patroni failover detected"
          description: "Timeline changed, indicating a failover occurred"

      # Cluster unlocked
      - alert: PatroniClusterUnlocked
        expr: patroni_cluster_unlocked == 1
        for: 30s
        labels:
          severity: warning
        annotations:
          summary: "Patroni cluster is unlocked"

6.3. etcd alerts

YAML
# /etc/prometheus/alerts/etcd.yml
groups:
  - name: etcd
    interval: 30s
    rules:
      # etcd down
      - alert: EtcdDown
        expr: up{job="etcd"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "etcd node {{ $labels.instance }} is down"

      # No leader
      - alert: EtcdNoLeader
        expr: etcd_server_has_leader == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "etcd cluster has no leader"

      # High leader changes
      - alert: EtcdFrequentLeaderChanges
        expr: increase(etcd_server_leader_changes_seen_total[1h]) > 3
        labels:
          severity: warning
        annotations:
          summary: "etcd frequent leader changes"
          description: "{{ $value }} leader changes in last hour"

      # High commit latency
      - alert: EtcdHighCommitLatency
        expr: |
          histogram_quantile(0.99, 
            rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])
          ) > 0.25
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "etcd high commit latency"
          description: "99th percentile commit latency is {{ $value }}s"

7. Alertmanager Setup

7.1. Install Alertmanager

BASH
# Download
cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar -xzf alertmanager-0.26.0.linux-amd64.tar.gz
sudo mv alertmanager-0.26.0.linux-amd64 /opt/alertmanager

# Create user
sudo useradd --no-create-home --shell /bin/false alertmanager

# Create directories
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown alertmanager:alertmanager /var/lib/alertmanager

7.2. Configure Alertmanager

YAML
# /etc/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'critical'
    - match:
        severity: warning
      receiver: 'warning'

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#alerts'
        title: 'PostgreSQL HA Alert'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'critical'
    slack_configs:
      - channel: '#alerts-critical'
        title: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    email_configs:
      - to: 'oncall@example.com'
        from: 'alerts@example.com'
        smarthost: 'smtp.example.com:587'
        auth_username: 'alerts@example.com'
        auth_password: 'password'

  - name: 'warning'
    slack_configs:
      - channel: '#alerts'
        title: '⚠️  Warning: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

7.3. Start Alertmanager

BASH
# Systemd service
sudo tee /etc/systemd/system/alertmanager.service <<EOF
[Unit]
Description=Prometheus Alertmanager
After=network.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/opt/alertmanager/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager/ \
  --web.listen-address=:9093

Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

# Start
sudo systemctl daemon-reload
sudo systemctl start alertmanager
sudo systemctl enable alertmanager

# Access: http://localhost:9093

8. Best Practices

✅ DO

  1. Monitor proactively: Don't wait for users to report issues.
  2. Set meaningful thresholds: Based on your workload.
  3. Test alerts: Ensure notifications work.
  4. Document runbooks: Link alerts to resolution steps.
  5. Keep metrics retention: 30 days minimum, 1 year recommended.
  6. Use labels wisely: For filtering and grouping.
  7. Monitor the monitors: Alert if Prometheus/Grafana goes down.
  8. Regular dashboard reviews: Update as needs change.
  9. Track SLOs/SLIs: Define and measure service levels.
  10. Correlate metrics: CPU + disk + query time together.

❌ DON'T

  1. Don't over-alert: Alert fatigue is real.
  2. Don't ignore warnings: They become criticals.
  3. Don't forget to update: Dashboards and alerts evolve.
  4. Don't expose metrics publicly: Security risk.
  5. Don't rely on single monitoring: Have backups.
  6. Don't collect everything: Focus on what matters.
  7. Don't ignore baselines: Know your normal.
  8. Don't skip testing: Test failover detection.

9. Lab Exercises

Lab 1: Setup monitoring stack

Tasks:

  1. Install Prometheus on monitoring server.
  2. Install postgres_exporter on all nodes.
  3. Install node_exporter on all nodes.
  4. Configure scrape targets.
  5. Verify metrics collection.
  6. Install Grafana.
  7. Add Prometheus data source.
  8. Import PostgreSQL dashboard.

Lab 2: Create custom dashboard

Tasks:

  1. Create new dashboard in Grafana.
  2. Add panel for replication lag.
  3. Add panel for connection count.
  4. Add panel for TPS.
  5. Add panel for cache hit ratio.
  6. Create variables for node selection.
  7. Save and share dashboard.

Lab 3: Configure alerting

Tasks:

  1. Install Alertmanager.
  2. Create alert rules for PostgreSQL.
  3. Create alert rules for Patroni.
  4. Configure Slack notifications.
  5. Test alerts by triggering conditions.
  6. Verify notification delivery.

Lab 4: Simulate and monitor failover

Tasks:

  1. Open Grafana dashboard.
  2. Stop primary node.
  3. Watch metrics during failover.
  4. Verify alerts triggered.
  5. Document timeline.
  6. Calculate downtime from metrics.

10. Summary

Key Metrics Summary

CategoryMetricThreshold
ReplicationLag bytes< 10MB
ReplicationLag time< 10s
ConnectionsUsage %< 80%
CacheHit ratio> 95%
QueriesLong-running< 1 hour
DiskUsage %< 85%
CPUUsage %< 80% sustained

Monitoring Stack

TEXT
Prometheus: Metrics collection and storage
├─ postgres_exporter: PostgreSQL metrics
├─ node_exporter: System metrics
├─ Patroni: HA cluster metrics (built-in)
└─ etcd: DCS metrics (built-in)

Grafana: Visualization and dashboards

Alertmanager: Alert routing and notifications

Next Steps

Lesson 18 will cover Performance Tuning:

  • PostgreSQL configuration optimization
  • Connection pooling with PgBouncer
  • Load balancing with HAProxy
  • Query optimization techniques
  • Read replica scaling strategies

Share this article

You might also like

Browse all articles
Series

PostgreSQL HA Cluster - Monitoring Stack

Full monitoring stack with Prometheus and Grafana for PostgreSQL HA cluster. Includes pre-configured dashboards, alerting rules, and exporter configurations.

#Database#PostgreSQL#Monitoring

Lesson 9: Bootstrap PostgreSQL Cluster

Learn how to bootstrap a Patroni cluster including starting Patroni for the first time on 3 nodes, verifying cluster status with patronictl, checking replication, troubleshooting common issues, and testing basic failover.

#Patroni#bootstrap#cluster

Lesson 8: Detailed Patroni Configuration

Learn detailed Patroni configuration including all sections of patroni.yml, bootstrap options, PostgreSQL parameters tuning, authentication setup, tags and constraints, and timing parameters optimization.

#Patroni#configuration#parameters

Lesson 7: Installing Patroni

Learn how to install Patroni, including setting up Python dependencies, installing via pip, understanding the patroni.yml configuration structure, creating systemd service, and configuring Patroni on 3 nodes for PostgreSQL high availability.

#Patroni#installation#configuration

Lesson 4: Infrastructure Preparation for PostgreSQL HA

Setting up the infrastructure for PostgreSQL High Availability with Patroni and etcd, including hardware requirements, network configuration, firewall, SSH keys, and time synchronization.

#Database#PostgreSQL#Infrastructure