Monitoring Patroni Cluster
After this lesson, you will be able to:
- Understand critical metrics of a PostgreSQL HA cluster.
- Set up Prometheus + Grafana for monitoring.
- Configure
postgres_exporterandpatroni_exporter. - Create dashboards and alerting rules.
- Monitor etcd cluster health.
- Implement best practices for observability.
1. Why Monitoring Matters
1.1. Monitoring goals
Visibility:
Key questions to answer:
- Is the cluster healthy?
- Is replication working?
- What's the lag?
- Is there any failover?
- Are connections saturated?
- What's the query performance?
- Is etcd healthy?
- Are backups running?
1.2. The four golden signals
Latency: How long do requests take?
Traffic: How many requests?
Errors: What's failing?
Saturation: How full are resources?
2. Metrics to Monitor
2.1. Cluster-level metrics
Cluster health
- ✅ Number of nodes up/down
- ✅ Current leader
- ✅ Failover count
- ✅ Timeline number
- ✅ Cluster configuration version
Replication health
- ✅ Replication lag (bytes and time)
- ✅ WAL sender/receiver status
- ✅ Sync vs async replica count
- ✅ Replication slot usage
- ✅ WAL segment generation rate
2.2. PostgreSQL metrics
Connection metrics
Database size and growth
Transaction rate
Cache hit ratio
Index usage
Vacuum and autovacuum
Locks
Long-running queries
2.3. Patroni metrics
Via REST API (http://node:8008/metrics):
2.4. etcd metrics
Via etcd metrics endpoint (http://node:2379/metrics):
2.5. System metrics
3. Prometheus Setup
3.1. Install Prometheus
3.2. Configure Prometheus
3.3. Create systemd service
4. Exporters Setup
4.1. postgres_exporter
Custom queries (optional):
Systemd service:
4.2. node_exporter
4.3. Patroni metrics endpoint
Already built-in! Patroni exposes metrics at:
5. Grafana Setup
5.1. Install Grafana
5.2. Add Prometheus data source
- Login to Grafana (http://localhost:3000)
- Go to Configuration → Data Sources
- Click "Add data source"
- Select "Prometheus"
- URL: http://localhost:9090
- Click "Save & Test"
5.3. Import dashboards
PostgreSQL dashboard:
- Go to Dashboards → Import
- Enter dashboard ID:
9628(PostgreSQL Database) - Select Prometheus data source
- Click Import
Patroni dashboard (custom):
etcd dashboard:
- Dashboard ID:
3070(etcd by Prometheus)
Node exporter dashboard:
- Dashboard ID:
1860(Node Exporter Full)
6. Alerting Rules
6.1. PostgreSQL alerts
6.2. Patroni alerts
6.3. etcd alerts
7. Alertmanager Setup
7.1. Install Alertmanager
7.2. Configure Alertmanager
7.3. Start Alertmanager
8. Best Practices
✅ DO
- Monitor proactively: Don't wait for users to report issues.
- Set meaningful thresholds: Based on your workload.
- Test alerts: Ensure notifications work.
- Document runbooks: Link alerts to resolution steps.
- Keep metrics retention: 30 days minimum, 1 year recommended.
- Use labels wisely: For filtering and grouping.
- Monitor the monitors: Alert if Prometheus/Grafana goes down.
- Regular dashboard reviews: Update as needs change.
- Track SLOs/SLIs: Define and measure service levels.
- Correlate metrics: CPU + disk + query time together.
❌ DON'T
- Don't over-alert: Alert fatigue is real.
- Don't ignore warnings: They become criticals.
- Don't forget to update: Dashboards and alerts evolve.
- Don't expose metrics publicly: Security risk.
- Don't rely on single monitoring: Have backups.
- Don't collect everything: Focus on what matters.
- Don't ignore baselines: Know your normal.
- Don't skip testing: Test failover detection.
9. Lab Exercises
Lab 1: Setup monitoring stack
Tasks:
- Install Prometheus on monitoring server.
- Install
postgres_exporteron all nodes. - Install
node_exporteron all nodes. - Configure scrape targets.
- Verify metrics collection.
- Install Grafana.
- Add Prometheus data source.
- Import PostgreSQL dashboard.
Lab 2: Create custom dashboard
Tasks:
- Create new dashboard in Grafana.
- Add panel for replication lag.
- Add panel for connection count.
- Add panel for TPS.
- Add panel for cache hit ratio.
- Create variables for node selection.
- Save and share dashboard.
Lab 3: Configure alerting
Tasks:
- Install Alertmanager.
- Create alert rules for PostgreSQL.
- Create alert rules for Patroni.
- Configure Slack notifications.
- Test alerts by triggering conditions.
- Verify notification delivery.
Lab 4: Simulate and monitor failover
Tasks:
- Open Grafana dashboard.
- Stop primary node.
- Watch metrics during failover.
- Verify alerts triggered.
- Document timeline.
- Calculate downtime from metrics.
10. Summary
Key Metrics Summary
| Category | Metric | Threshold |
|---|---|---|
| Replication | Lag bytes | < 10MB |
| Replication | Lag time | < 10s |
| Connections | Usage % | < 80% |
| Cache | Hit ratio | > 95% |
| Queries | Long-running | < 1 hour |
| Disk | Usage % | < 85% |
| CPU | Usage % | < 80% sustained |
Monitoring Stack
Next Steps
Lesson 18 will cover Performance Tuning:
- PostgreSQL configuration optimization
- Connection pooling with PgBouncer
- Load balancing with HAProxy
- Query optimization techniques
- Read replica scaling strategies