CloudTada | Infrastructure & DevOps Insights

PostgreSQL HA Cluster - Monitoring Stack

Full monitoring stack with Prometheus + Grafana to monitor PostgreSQL HA cluster.

Components

1. Prometheus (Port 9090)

Time-series database
Scrapes metrics every 15 seconds
Stores data for 30 days (configurable)
Alert rules for critical events

2. Grafana (Port 3000)

Visualization dashboard
Pre-configured dashboards:
- PostgreSQL Overview
- Patroni HA Cluster
- System Metrics (Node Exporter)
- etcd Cluster
Alert notifications (optional)

3. Exporters (On each PostgreSQL node)

node_exporter (9100): CPU, RAM, Disk, Network
postgres_exporter (9187): Connections, queries, replication lag
pgbouncer_exporter (9127): Connection pool stats
patroni metrics (8008): Leader/replica state, failover events
etcd metrics (2379): Cluster health, leader changes

Installation

Step 1: Configure Environment

BASH

cd /path/to/postgres-patroni-etcd-install

# Copy và edit .env
cp .env.example .env
nano .env

Key monitoring variables in .env:

BASH

# Enable/Disable Monitoring
MONITORING_ENABLED=true

# Monitoring Server (can be a separate node or use 1 of 3 postgres nodes)
MONITORING_SERVER_IP=172.23.202.11
MONITORING_SERVER_NAME=pg-node1

# Prometheus
PROMETHEUS_VERSION=3.0.1
PROMETHEUS_PORT=9090
PROMETHEUS_RETENTION_TIME=30d
PROMETHEUS_RETENTION_SIZE=50GB

# Grafana
GRAFANA_PORT=3000
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=ChangeMe@AdminPass#2024

# Exporters
NODE_EXPORTER_PORT=9100
POSTGRES_EXPORTER_PORT=9187
PGBOUNCER_EXPORTER_PORT=9127

Note:

Set MONITORING_ENABLED=true to enable monitoring
MONITORING_SERVER_IP and MONITORING_SERVER_NAME to specify a dedicated monitoring server
If no dedicated server, you can use pg-node1 (or any node with sufficient resources)

Step 2: Deploy Monitoring Stack

BASH

# Load environment
set -a && source .env && set +a

# Deploy full stack (recommended)
./scripts/deploy_monitoring.sh --all

# Or deploy individual components
./scripts/deploy_monitoring.sh --exporters    # Exporters only
./scripts/deploy_monitoring.sh --prometheus   # Prometheus only
./scripts/deploy_monitoring.sh --grafana      # Grafana only

Step 3: Verify Deployment

BASH

# Check health of all services
./scripts/deploy_monitoring.sh --check

Expected output:

TEXT

✓ Prometheus is healthy
✓ Grafana is healthy
✓ Node Exporter is running (all nodes)
✓ PostgreSQL Exporter is running (all nodes)
✓ PgBouncer Exporter is running (all nodes)

Access URLs

Prometheus

TEXT

URL: http://<node-ip>:9090
Targets: http://<node-ip>:9090/targets
Alerts: http://<node-ip>:9090/alerts

Grafana

TEXT

URL: http://<node-ip>:3000
Username: admin (default)
Password: Check GRAFANA_ADMIN_PASSWORD in .env

Exporters (per node)

TEXT

Node Exporter: http://<node-ip>:9100/metrics
PostgreSQL Exporter: http://<node-ip>:9187/metrics
PgBouncer Exporter: http://<node-ip>:9127/metrics
Patroni Metrics: http://<node-ip>:8008/metrics
etcd Metrics: http://<node-ip>:2379/metrics

Grafana Dashboards

After logging into Grafana, there are 4 pre-configured dashboards:

1. PostgreSQL Overview

Database status (up/down)
Active connections per database
Replication lag
Transaction rate (commits/rollbacks)
Cache hit ratio
Dead tuples count

2. Patroni HA Cluster

Current leader node
Cluster member states
Timeline changes (failover events)
DCS (etcd) connectivity

3. Node Exporter - System Metrics

CPU usage per core
Memory usage (total/available)
Disk usage per mount point
Network traffic (RX/TX)
Disk I/O

4. etcd Cluster

Leader status
Leader change rate
RPC traffic
Disk sync duration

Alert Rules

Prometheus has built-in alert rules for:

Critical Alerts

PostgreSQLDown: Database instance down > 1 minute
PatroniNoLeader: Cluster has no leader > 1 minute
EtcdNoLeader: etcd has no leader > 1 minute
NodeDown: Server does not respond > 2 minutes
PgBouncerDown: Connection pooler down > 2 minutes

Warning Alerts

PostgreSQLReplicationLag: Lag > 60 seconds
PostgreSQLTooManyConnections: > 80% max connections
HighCPUUsage: CPU > 80% for 5 minutes
HighMemoryUsage: RAM > 85% for 5 minutes
LowDiskSpace: Disk < 15% free space

Maintenance

Update Exporters

BASH

# Update version in .env
nano .env
# Change NODE_EXPORTER_VERSION, POSTGRES_EXPORTER_VERSION, etc.

# Re-deploy
set -a && source .env && set +a
./scripts/deploy_monitoring.sh --exporters

Restart Services

BASH

# Prometheus
ssh root@<node-ip> "systemctl restart prometheus"

# Grafana
ssh root@<node-ip> "systemctl restart grafana-server"

# Exporters (per node)
ssh root@<node-ip> "systemctl restart node_exporter"
ssh root@<node-ip> "systemctl restart postgres_exporter"
ssh root@<node-ip> "systemctl restart pgbouncer_exporter"

Check Logs

BASH

# Prometheus
ssh root@<node-ip> "journalctl -u prometheus -f"

# Grafana
ssh root@<node-ip> "journalctl -u grafana-server -f"

# Exporters
ssh root@<node-ip> "journalctl -u node_exporter -f"
ssh root@<node-ip> "journalctl -u postgres_exporter -f"

Troubleshooting

Prometheus không scrape được metrics

BASH

# Check firewall
ufw status

# Open ports if needed
ufw allow 9100/tcp  # node_exporter
ufw allow 9187/tcp  # postgres_exporter
ufw allow 9127/tcp  # pgbouncer_exporter

PostgreSQL Exporter không connect được

BASH

# Check DSN in /etc/default/postgres_exporter
cat /etc/default/postgres_exporter

# Test connection manually
psql "postgresql://admin:<password>@localhost:5432/postgres"

# Restart exporter
systemctl restart postgres_exporter

Grafana not showing dashboards

BASH

# Check provisioning directory
ls -la /var/lib/grafana/dashboards/

# Check Grafana logs
journalctl -u grafana-server -n 100

Performance Impact

Monitoring stack has minimal impact:

Component	CPU	RAM	Disk I/O
Prometheus	`< 2%`	~1GB	Low
Grafana	`< 1%`	~200MB	Very Low
node_exporter	`< 0.5%`	~20MB	Very Low
postgres_exporter	`< 1%`	~50MB	Low
pgbouncer_exporter	`< 0.5%`	~20MB	Very Low

Total overhead per node: ~2-3% CPU, ~100MB RAM

Security Recommendations

Change default Grafana password in .env
Enable authentication for Prometheus if exposing to internet
Use firewall to restrict access to monitoring ports
Enable SSL/TLS for Grafana in production
Rotate secrets in GRAFANA_SECRET_KEY

Integration with Alertmanager (Optional)

If you want to send alerts via Slack/Email:

BASH

# Install Alertmanager
# Set PROMETHEUS_ALERTMANAGER_TARGETS in .env
PROMETHEUS_ALERTMANAGER_TARGETS=localhost:9093

# Re-deploy Prometheus
./scripts/deploy_monitoring.sh --prometheus

Files Structure

TEXT

roles/
├── prometheus/
│   ├── tasks/main.yml
│   ├── templates/
│   │   ├── prometheus.yml.j2
│   │   ├── prometheus.service.j2
│   │   └── alert_rules.yml.j2
│   ├── handlers/main.yml
│   └── defaults/main.yml
│
├── grafana/
│   ├── tasks/main.yml
│   ├── templates/
│   │   ├── grafana.ini.j2
│   │   ├── datasource.yml.j2
│   │   ├── dashboard_provisioning.yml.j2
│   │   └── dashboards/
│   │       ├── postgresql_dashboard.json.j2
│   │       ├── patroni_dashboard.json.j2
│   │       ├── node_exporter_dashboard.json.j2
│   │       └── etcd_dashboard.json.j2
│   ├── handlers/main.yml
│   └── defaults/main.yml
│
└── exporters/
    ├── tasks/main.yml
    ├── templates/
    │   ├── node_exporter.service.j2
    │   ├── postgres_exporter.service.j2
    │   ├── postgres_exporter.env.j2
    │   ├── pgbouncer_exporter.service.j2
    │   └── pgbouncer_exporter.env.j2
    ├── handlers/main.yml
    └── defaults/main.yml

Series

PostgreSQL High Availability Mastery

Automating PostgreSQL High Availability with Ansible, Patroni, and etcd

PostgreSQL Security Hardening Guide

PostgreSQL HA Cluster - Monitoring Stack

Building a Highly Available PostgreSQL Cluster with Patroni, etcd, and PgBouncer

Share this article

You might also like