PostgreSQL HA Cluster - Monitoring Stack
Full monitoring stack with Prometheus + Grafana to monitor PostgreSQL HA cluster.
Components
1. Prometheus (Port 9090)
- Time-series database
- Scrapes metrics every 15 seconds
- Stores data for 30 days (configurable)
- Alert rules for critical events
2. Grafana (Port 3000)
- Visualization dashboard
- Pre-configured dashboards:
- PostgreSQL Overview
- Patroni HA Cluster
- System Metrics (Node Exporter)
- etcd Cluster
- Alert notifications (optional)
3. Exporters (On each PostgreSQL node)
- node_exporter (9100): CPU, RAM, Disk, Network
- postgres_exporter (9187): Connections, queries, replication lag
- pgbouncer_exporter (9127): Connection pool stats
- patroni metrics (8008): Leader/replica state, failover events
- etcd metrics (2379): Cluster health, leader changes
Installation
Step 1: Configure Environment
Key monitoring variables in .env:
Note:
- Set
MONITORING_ENABLED=trueto enable monitoring MONITORING_SERVER_IPandMONITORING_SERVER_NAMEto specify a dedicated monitoring server- If no dedicated server, you can use
pg-node1(or any node with sufficient resources)
Step 2: Deploy Monitoring Stack
Step 3: Verify Deployment
Expected output:
Access URLs
Prometheus
Grafana
Exporters (per node)
Grafana Dashboards
After logging into Grafana, there are 4 pre-configured dashboards:
1. PostgreSQL Overview
- Database status (up/down)
- Active connections per database
- Replication lag
- Transaction rate (commits/rollbacks)
- Cache hit ratio
- Dead tuples count
2. Patroni HA Cluster
- Current leader node
- Cluster member states
- Timeline changes (failover events)
- DCS (etcd) connectivity
3. Node Exporter - System Metrics
- CPU usage per core
- Memory usage (total/available)
- Disk usage per mount point
- Network traffic (RX/TX)
- Disk I/O
4. etcd Cluster
- Leader status
- Leader change rate
- RPC traffic
- Disk sync duration
Alert Rules
Prometheus has built-in alert rules for:
Critical Alerts
- PostgreSQLDown: Database instance down > 1 minute
- PatroniNoLeader: Cluster has no leader > 1 minute
- EtcdNoLeader: etcd has no leader > 1 minute
- NodeDown: Server does not respond > 2 minutes
- PgBouncerDown: Connection pooler down > 2 minutes
Warning Alerts
- PostgreSQLReplicationLag: Lag > 60 seconds
- PostgreSQLTooManyConnections: > 80% max connections
- HighCPUUsage: CPU > 80% for 5 minutes
- HighMemoryUsage: RAM > 85% for 5 minutes
- LowDiskSpace: Disk < 15% free space
Maintenance
Update Exporters
Restart Services
Check Logs
Troubleshooting
Prometheus không scrape được metrics
PostgreSQL Exporter không connect được
Grafana not showing dashboards
Performance Impact
Monitoring stack has minimal impact:
| Component | CPU | RAM | Disk I/O |
|---|---|---|---|
| Prometheus | < 2% | ~1GB | Low |
| Grafana | < 1% | ~200MB | Very Low |
| node_exporter | < 0.5% | ~20MB | Very Low |
| postgres_exporter | < 1% | ~50MB | Low |
| pgbouncer_exporter | < 0.5% | ~20MB | Very Low |
Total overhead per node: ~2-3% CPU, ~100MB RAM
Security Recommendations
- Change default Grafana password in
.env - Enable authentication for Prometheus if exposing to internet
- Use firewall to restrict access to monitoring ports
- Enable SSL/TLS for Grafana in production
- Rotate secrets in
GRAFANA_SECRET_KEY
Integration with Alertmanager (Optional)
If you want to send alerts via Slack/Email: