CloudTadaInsights

PostgreSQL HA Cluster - Monitoring Stack

PostgreSQL HA Cluster - Monitoring Stack

Full monitoring stack with Prometheus + Grafana to monitor PostgreSQL HA cluster.

Components

1. Prometheus (Port 9090)

  • Time-series database
  • Scrapes metrics every 15 seconds
  • Stores data for 30 days (configurable)
  • Alert rules for critical events

2. Grafana (Port 3000)

  • Visualization dashboard
  • Pre-configured dashboards:
    • PostgreSQL Overview
    • Patroni HA Cluster
    • System Metrics (Node Exporter)
    • etcd Cluster
  • Alert notifications (optional)

3. Exporters (On each PostgreSQL node)

  • node_exporter (9100): CPU, RAM, Disk, Network
  • postgres_exporter (9187): Connections, queries, replication lag
  • pgbouncer_exporter (9127): Connection pool stats
  • patroni metrics (8008): Leader/replica state, failover events
  • etcd metrics (2379): Cluster health, leader changes

Installation

Step 1: Configure Environment

BASH
cd /path/to/postgres-patroni-etcd-install

# Copy và edit .env
cp .env.example .env
nano .env

Key monitoring variables in .env:

BASH
# Enable/Disable Monitoring
MONITORING_ENABLED=true

# Monitoring Server (can be a separate node or use 1 of 3 postgres nodes)
MONITORING_SERVER_IP=172.23.202.11
MONITORING_SERVER_NAME=pg-node1

# Prometheus
PROMETHEUS_VERSION=3.0.1
PROMETHEUS_PORT=9090
PROMETHEUS_RETENTION_TIME=30d
PROMETHEUS_RETENTION_SIZE=50GB

# Grafana
GRAFANA_PORT=3000
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=ChangeMe@AdminPass#2024

# Exporters
NODE_EXPORTER_PORT=9100
POSTGRES_EXPORTER_PORT=9187
PGBOUNCER_EXPORTER_PORT=9127

Note:

  • Set MONITORING_ENABLED=true to enable monitoring
  • MONITORING_SERVER_IP and MONITORING_SERVER_NAME to specify a dedicated monitoring server
  • If no dedicated server, you can use pg-node1 (or any node with sufficient resources)

Step 2: Deploy Monitoring Stack

BASH
# Load environment
set -a && source .env && set +a

# Deploy full stack (recommended)
./scripts/deploy_monitoring.sh --all

# Or deploy individual components
./scripts/deploy_monitoring.sh --exporters    # Exporters only
./scripts/deploy_monitoring.sh --prometheus   # Prometheus only
./scripts/deploy_monitoring.sh --grafana      # Grafana only

Step 3: Verify Deployment

BASH
# Check health of all services
./scripts/deploy_monitoring.sh --check

Expected output:

TEXT
✓ Prometheus is healthy
✓ Grafana is healthy
✓ Node Exporter is running (all nodes)
✓ PostgreSQL Exporter is running (all nodes)
✓ PgBouncer Exporter is running (all nodes)

Access URLs

Prometheus

TEXT
URL: http://<node-ip>:9090
Targets: http://<node-ip>:9090/targets
Alerts: http://<node-ip>:9090/alerts

Grafana

TEXT
URL: http://<node-ip>:3000
Username: admin (default)
Password: Check GRAFANA_ADMIN_PASSWORD in .env

Exporters (per node)

TEXT
Node Exporter: http://<node-ip>:9100/metrics
PostgreSQL Exporter: http://<node-ip>:9187/metrics
PgBouncer Exporter: http://<node-ip>:9127/metrics
Patroni Metrics: http://<node-ip>:8008/metrics
etcd Metrics: http://<node-ip>:2379/metrics

Grafana Dashboards

After logging into Grafana, there are 4 pre-configured dashboards:

1. PostgreSQL Overview

  • Database status (up/down)
  • Active connections per database
  • Replication lag
  • Transaction rate (commits/rollbacks)
  • Cache hit ratio
  • Dead tuples count

2. Patroni HA Cluster

  • Current leader node
  • Cluster member states
  • Timeline changes (failover events)
  • DCS (etcd) connectivity

3. Node Exporter - System Metrics

  • CPU usage per core
  • Memory usage (total/available)
  • Disk usage per mount point
  • Network traffic (RX/TX)
  • Disk I/O

4. etcd Cluster

  • Leader status
  • Leader change rate
  • RPC traffic
  • Disk sync duration

Alert Rules

Prometheus has built-in alert rules for:

Critical Alerts

  • PostgreSQLDown: Database instance down > 1 minute
  • PatroniNoLeader: Cluster has no leader > 1 minute
  • EtcdNoLeader: etcd has no leader > 1 minute
  • NodeDown: Server does not respond > 2 minutes
  • PgBouncerDown: Connection pooler down > 2 minutes

Warning Alerts

  • PostgreSQLReplicationLag: Lag > 60 seconds
  • PostgreSQLTooManyConnections: > 80% max connections
  • HighCPUUsage: CPU > 80% for 5 minutes
  • HighMemoryUsage: RAM > 85% for 5 minutes
  • LowDiskSpace: Disk < 15% free space

Maintenance

Update Exporters

BASH
# Update version in .env
nano .env
# Change NODE_EXPORTER_VERSION, POSTGRES_EXPORTER_VERSION, etc.

# Re-deploy
set -a && source .env && set +a
./scripts/deploy_monitoring.sh --exporters

Restart Services

BASH
# Prometheus
ssh root@<node-ip> "systemctl restart prometheus"

# Grafana
ssh root@<node-ip> "systemctl restart grafana-server"

# Exporters (per node)
ssh root@<node-ip> "systemctl restart node_exporter"
ssh root@<node-ip> "systemctl restart postgres_exporter"
ssh root@<node-ip> "systemctl restart pgbouncer_exporter"

Check Logs

BASH
# Prometheus
ssh root@<node-ip> "journalctl -u prometheus -f"

# Grafana
ssh root@<node-ip> "journalctl -u grafana-server -f"

# Exporters
ssh root@<node-ip> "journalctl -u node_exporter -f"
ssh root@<node-ip> "journalctl -u postgres_exporter -f"

Troubleshooting

Prometheus không scrape được metrics

BASH
# Check firewall
ufw status

# Open ports if needed
ufw allow 9100/tcp  # node_exporter
ufw allow 9187/tcp  # postgres_exporter
ufw allow 9127/tcp  # pgbouncer_exporter

PostgreSQL Exporter không connect được

BASH
# Check DSN in /etc/default/postgres_exporter
cat /etc/default/postgres_exporter

# Test connection manually
psql "postgresql://admin:<password>@localhost:5432/postgres"

# Restart exporter
systemctl restart postgres_exporter

Grafana not showing dashboards

BASH
# Check provisioning directory
ls -la /var/lib/grafana/dashboards/

# Check Grafana logs
journalctl -u grafana-server -n 100

Performance Impact

Monitoring stack has minimal impact:

ComponentCPURAMDisk I/O
Prometheus< 2%~1GBLow
Grafana< 1%~200MBVery Low
node_exporter< 0.5%~20MBVery Low
postgres_exporter< 1%~50MBLow
pgbouncer_exporter< 0.5%~20MBVery Low

Total overhead per node: ~2-3% CPU, ~100MB RAM

Security Recommendations

  1. Change default Grafana password in .env
  2. Enable authentication for Prometheus if exposing to internet
  3. Use firewall to restrict access to monitoring ports
  4. Enable SSL/TLS for Grafana in production
  5. Rotate secrets in GRAFANA_SECRET_KEY

Integration with Alertmanager (Optional)

If you want to send alerts via Slack/Email:

BASH
# Install Alertmanager
# Set PROMETHEUS_ALERTMANAGER_TARGETS in .env
PROMETHEUS_ALERTMANAGER_TARGETS=localhost:9093

# Re-deploy Prometheus
./scripts/deploy_monitoring.sh --prometheus

Files Structure

TEXT
roles/
├── prometheus/
│   ├── tasks/main.yml
│   ├── templates/
│   │   ├── prometheus.yml.j2
│   │   ├── prometheus.service.j2
│   │   └── alert_rules.yml.j2
│   ├── handlers/main.yml
│   └── defaults/main.yml
│
├── grafana/
│   ├── tasks/main.yml
│   ├── templates/
│   │   ├── grafana.ini.j2
│   │   ├── datasource.yml.j2
│   │   ├── dashboard_provisioning.yml.j2
│   │   └── dashboards/
│   │       ├── postgresql_dashboard.json.j2
│   │       ├── patroni_dashboard.json.j2
│   │       ├── node_exporter_dashboard.json.j2
│   │       └── etcd_dashboard.json.j2
│   ├── handlers/main.yml
│   └── defaults/main.yml
│
└── exporters/
    ├── tasks/main.yml
    ├── templates/
    │   ├── node_exporter.service.j2
    │   ├── postgres_exporter.service.j2
    │   ├── postgres_exporter.env.j2
    │   ├── pgbouncer_exporter.service.j2
    │   └── pgbouncer_exporter.env.j2
    ├── handlers/main.yml
    └── defaults/main.yml

Additional Resources

Share this article

You might also like

Browse all articles

Lesson 4: Infrastructure Preparation for PostgreSQL HA

Setting up the infrastructure for PostgreSQL High Availability with Patroni and etcd, including hardware requirements, network configuration, firewall, SSH keys, and time synchronization.

#Database#PostgreSQL#Infrastructure

Lesson 3: Introduction to Patroni and etcd

Understanding Patroni and etcd for PostgreSQL High Availability, including DCS, Raft consensus algorithm, leader election, and split-brain prevention mechanisms.

#Database#PostgreSQL#Patroni
Series

PostgreSQL Security Hardening Guide

Essential security features and hardening measures for PostgreSQL HA cluster deployment with Patroni, etcd, and PgBouncer. Follow this guide for production security best practices.

#Database#PostgreSQL#Security