-- Essential logging settings ALTER SYSTEM SET logging_collector = on; ALTER SYSTEM SET log_directory = 'log'; ALTER SYSTEM SET log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'; ALTER SYSTEM SET log_rotation_age = '1d'; ALTER SYSTEM SET log_rotation_size = '100MB';
-- What to log ALTER SYSTEM SET log_min_duration_statement = '1000'; -- Slow queries (>1s) ALTER SYSTEM SET log_checkpoints = on; ALTER SYSTEM SET log_connections = off; -- Too verbose in production ALTER SYSTEM SET log_disconnections = off; ALTER SYSTEM SET log_lock_waits = on; ALTER SYSTEM SET log_temp_files = 0; -- Log all temp file usage ALTER SYSTEM SET log_autovacuum_min_duration = '1s';
-- Log line format ALTER SYSTEM SET log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '; ALTER SYSTEM SET log_statement = 'none'; -- 'none', 'ddl', 'mod', 'all' ALTER SYSTEM SET log_duration = off;
-- Error detail ALTER SYSTEM SET log_error_verbosity = 'default'; -- 'terse', 'default', 'verbose'
-- Reload SELECT pg_reload_conf();
1.2. Log file location
1.3. Common log patterns
Connection issues
Slow queries
Deadlocks
Checkpoint warnings
Replication lag
Out of disk space
1.4. Analyze logs with pgBadger
- Patroni Logging
2.1. Patroni log levels
2.2. View Patroni logs
2.3. Common Patroni log patterns
Successful bootstrap
Failed to acquire lock (replica)
Failover detected
DCS connection issues
pg_rewind execution
Configuration reload
2.4. Enable DEBUG logging
- etcd Logging
3.1. View etcd logs
3.2. etcd log levels
3.3. Common etcd log patterns
Cluster healthy
Leader election
Network partition
Slow operations
3.4. etcd debugging
- Common Issues and Solutions
4.1. Issue: Patroni won't start
Symptoms: systemctl status patroni shows failed
Diagnosis:
4.2. Issue: No leader in cluster
Symptoms: patronictl list shows no Leader
Diagnosis:
4.3. Issue: High replication lag
Symptoms: Replica lag > 100MB
Diagnosis:
4.4. Issue: Split-brain detected
Symptoms: Multiple nodes claim to be leader
Diagnosis:
4.5. Issue: Failover not happening
Symptoms: Primary down but no promotion
Diagnosis:
4.6. Issue: Cannot connect to PostgreSQL
Symptoms: Connection refused or timeout
Diagnosis:
4.7. Issue: pg_rewind failed
Symptoms: Node cannot rejoin after failover
Error: pg_rewind: error: could not find common ancestor
Diagnosis:
5.1. patronictl commands
5.2. etcdctl commands
5.3. PostgreSQL diagnostic queries
5.4. System diagnostic commands
- Troubleshooting Runbook
6.1. Primary node down
6.2. Replica not replicating
6.3. etcd cluster unhealthy
- Best Practices
✅ DO
- Enable appropriate logging - Balance detail vs volume
- Centralize logs - Use ELK/Grafana Loki
- Set up alerts - Proactive notification
- Regular log review - Weekly analysis
- Document issues - Build knowledge base
- Test scenarios - Practice troubleshooting
- Keep runbooks updated - Living documents
- Monitor disk space - Logs can fill disk
❌ DON'T
-
Don't enable DEBUG in production - Too verbose
-
Don't ignore warnings - They become errors
-
Don't delete logs immediately - Keep for analysis
-
Don't skip log rotation - Prevent disk full
-
Don't troubleshoot blind - Check logs first
-
Don't make changes without logs - Document actions
-
Lab Exercises
Lab 1: Log analysis
Tasks: 1. Configure PostgreSQL slow query logging 2. Generate workload with slow queries 3. Analyze logs with pgBadger 4. Identify top issues 5. Document findings
Lab 2: Simulate and debug failover
Tasks: 1. Stop primary node 2. Monitor Patroni logs during failover 3. Trace timeline of events 4. Calculate downtime from logs 5. Create timeline diagram
Lab 3: Debug replication lag
Tasks: 1. Simulate high write load on primary 2. Observe lag increase on replica 3. Use diagnostic queries to identify cause 4. Fix the issue 5. Verify lag reduced
Lab 4: Troubleshoot connection issues
Tasks: 1. Misconfigure pg_hba.conf 2. Attempt connections (will fail) 3. Use logs to identify issue 4. Fix configuration 5. Verify connections work
- Summary
Key Logging Locations
Essential Diagnostic Commands
Troubleshooting Workflow
- Identify - What's the symptom?
- Isolate - Which component is failing?
- Investigate - Check logs, metrics
- Diagnose - What's the root cause?
- Fix - Apply solution
- Verify - Confirm resolution
- Document - Record for future
Next Steps
Bài 20 sẽ cover Security Best Practices:
- SSL/TLS encryption
- Authentication methods
- Network security
- Encryption at rest
- Audit logging
- Security hardening