CloudTada | Infrastructure & DevOps Insights

-- Essential logging settings ALTER SYSTEM SET logging_collector = on; ALTER SYSTEM SET log_directory = 'log'; ALTER SYSTEM SET log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'; ALTER SYSTEM SET log_rotation_age = '1d'; ALTER SYSTEM SET log_rotation_size = '100MB';

-- What to log ALTER SYSTEM SET log_min_duration_statement = '1000'; -- Slow queries (>1s) ALTER SYSTEM SET log_checkpoints = on; ALTER SYSTEM SET log_connections = off; -- Too verbose in production ALTER SYSTEM SET log_disconnections = off; ALTER SYSTEM SET log_lock_waits = on; ALTER SYSTEM SET log_temp_files = 0; -- Log all temp file usage ALTER SYSTEM SET log_autovacuum_min_duration = '1s';

-- Log line format ALTER SYSTEM SET log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '; ALTER SYSTEM SET log_statement = 'none'; -- 'none', 'ddl', 'mod', 'all' ALTER SYSTEM SET log_duration = off;

-- Error detail ALTER SYSTEM SET log_error_verbosity = 'default'; -- 'terse', 'default', 'verbose'

-- Reload SELECT pg_reload_conf();

1.2. Log file location

TEXT

# Default location
/var/lib/postgresql/18/data/log/

# Or configured location
sudo -u postgres psql -c "SHOW log_directory;"
sudo -u postgres psql -c "SHOW data_directory;"

# View logs
sudo tail -f /var/lib/postgresql/18/data/log/postgresql-*.log

# Or via journalctl
sudo journalctl -u postgresql -f

1.3. Common log patterns

Connection issues

TEXT

LOG: connection received: host=10.0.1.50 port=54321
FATAL: password authentication failed for user "app_user"
DETAIL: Connection matched pg_hba.conf line 95: "host all all 0.0.0.0/0 md5"

→ Solution: Check password, pg_hba.conf configuration

Slow queries

TEXT

LOG: duration: 5432.123 ms  statement: SELECT * FROM large_table WHERE ...

→ Solution: Analyze query, add indexes, optimize

Deadlocks

TEXT

ERROR: deadlock detected
DETAIL: Process 12345 waits for ShareLock on transaction 67890; blocked by process 12346.
Process 12346 waits for ShareLock on transaction 67891; blocked by process 12345.
HINT: See server log for query details.

→ Solution: Review transaction order, add proper locking

Checkpoint warnings

TEXT

LOG: checkpoints are occurring too frequently (24 seconds apart)
HINT: Consider increasing the configuration parameter "max_wal_size".

→ Solution: Increase max_wal_size

Replication lag

TEXT

WARNING: could not send data to WAL receiver; retrying
DETAIL: Connection to replica timed out

→ Solution: Check network, replica health

Out of disk space

TEXT

FATAL: could not write to file "base/16384/12345": No space left on device

→ Solution: Free up disk space, increase storage

1.4. Analyze logs with pgBadger

TEXT

# Install pgBadger
sudo apt-get install -y pgbadger

# Or from source
wget https://github.com/darold/pgbadger/archive/v12.2.tar.gz
tar -xzf v12.2.tar.gz
cd pgbadger-12.2
perl Makefile.PL
make && sudo make install

# Generate report
pgbadger /var/lib/postgresql/18/data/log/postgresql-*.log \
  -o /tmp/pgbadger-report.html

# Open in browser
# file:///tmp/pgbadger-report.html

# Report includes:
# - Top slow queries
# - Connection statistics
# - Lock analysis
# - Checkpoint activity
# - Temporary file usage

Patroni Logging

2.1. Patroni log levels

TEXT

# In patroni.yml
log:
  level: INFO  # DEBUG, INFO, WARNING, ERROR, CRITICAL
  format: '%(asctime)s %(levelname)s: %(message)s'
  dateformat: '%Y-%m-%d %H:%M:%S'
  max_queue_size: 1000
  dir: /var/log/patroni
  file_num: 10
  file_size: 25000000  # 25MB

2.2. View Patroni logs

TEXT

# Via journalctl (if systemd service)
sudo journalctl -u patroni -f

# Show last 100 lines
sudo journalctl -u patroni -n 100

# Show logs since specific time
sudo journalctl -u patroni --since "1 hour ago"
sudo journalctl -u patroni --since "2024-11-25 10:00:00"

# Search for specific keywords
sudo journalctl -u patroni | grep -i "error\|warning\|fail"

# Export to file
sudo journalctl -u patroni --since today > /tmp/patroni-today.log

2.3. Common Patroni log patterns

Successful bootstrap

TEXT

2024-11-25 10:00:00 INFO: Selected new etcd server http://10.0.1.11:2379
2024-11-25 10:00:01 INFO: No PostgreSQL configuration items changed, nothing to reload.
2024-11-25 10:00:02 INFO: Lock owner: None; I am node1
2024-11-25 10:00:03 INFO: trying to bootstrap a new cluster
2024-11-25 10:00:05 INFO: Running custom bootstrap script: /usr/lib/postgresql/18/bin/initdb
2024-11-25 10:00:15 INFO: postmaster pid=12345
2024-11-25 10:00:16 INFO: initialized a new cluster
2024-11-25 10:00:17 INFO: acquired session lock as a leader ✅

Failed to acquire lock (replica)

TEXT

2024-11-25 10:05:00 INFO: Lock owner: node1; I am node2
2024-11-25 10:05:01 INFO: does not have lock
2024-11-25 10:05:02 INFO: starting as a secondary
2024-11-25 10:05:05 INFO: replica has been created using basebackup

Failover detected

TEXT

2024-11-25 11:30:00 WARNING: Request failed to node1:8008: connection refused
2024-11-25 11:30:10 INFO: promoted self to leader by acquiring session lock ✅
2024-11-25 11:30:11 INFO: cleared rewind state after becoming the leader
2024-11-25 11:30:12 INFO: running post_promote script

DCS connection issues

TEXT

2024-11-25 12:00:00 ERROR: Failed to get list of machines from etcd
2024-11-25 12:00:01 WARNING: Could not activate Linux watchdog device: No such file or directory
2024-11-25 12:00:02 ERROR: watching on etcd failed
2024-11-25 12:00:03 INFO: trying to connect to etcd server http://10.0.1.11:2379

→ Solution: Check etcd cluster health

pg_rewind execution

TEXT

2024-11-25 13:00:00 INFO: running pg_rewind from node2
2024-11-25 13:00:05 INFO: pg_rewind: servers diverged at WAL position 0/5000000 on timeline 2
2024-11-25 13:00:10 INFO: pg_rewind: rewinding from last common checkpoint at 0/4000000
2024-11-25 13:00:45 INFO: pg_rewind: Done!
2024-11-25 13:00:46 INFO: starting PostgreSQL

Configuration reload

TEXT

2024-11-25 14:00:00 INFO: Reloading PostgreSQL configuration.
2024-11-25 14:00:01 INFO: Reload configuration: max_connections changed from 100 to 200
2024-11-25 14:00:02 INFO: PostgreSQL configuration reload succeeded

2.4. Enable DEBUG logging

TEXT

# Temporarily enable debug logging
sudo -u postgres patronictl edit-config postgres

# Add:
log:
  level: DEBUG

# Or set via environment variable
sudo systemctl edit patroni

# Add:
[Service]
Environment="PATRONI_LOGLEVEL=DEBUG"

sudo systemctl daemon-reload
sudo systemctl restart patroni

# Warning: DEBUG is VERY verbose!
# Use only for troubleshooting, then disable

etcd Logging

3.1. View etcd logs

TEXT

# Via journalctl
sudo journalctl -u etcd -f

# Last 100 entries
sudo journalctl -u etcd -n 100

# Errors only
sudo journalctl -u etcd -p err

# Specific time range
sudo journalctl -u etcd --since "10:00" --until "11:00"

3.2. etcd log levels

TEXT

# Configure in etcd service file
# /etc/systemd/system/etcd.service

ExecStart=/usr/local/bin/etcd \
  --log-level info \
  --logger zap \
  --log-outputs stderr

3.3. Common etcd log patterns

Cluster healthy

TEXT

2024-11-25 10:00:00 INFO: raft.node: elected leader at term 5
2024-11-25 10:00:01 INFO: rafthttp: established a TCP streaming connection with peer node2
2024-11-25 10:00:02 INFO: etcdserver: published member info to cluster

Leader election

TEXT

2024-11-25 11:00:00 WARN: rafthttp: lost TCP streaming connection with peer node1
2024-11-25 11:00:01 INFO: raft.node: node3 elected leader at term 6
2024-11-25 11:00:02 INFO: rafthttp: peer node1 became active

Network partition

TEXT

2024-11-25 12:00:00 WARN: rafthttp: health check for peer node2 could not connect: dial tcp: i/o timeout
2024-11-25 12:00:05 ERROR: rafthttp: failed to read from node2 on stream: EOF
2024-11-25 12:00:10 WARN: etcdserver: failed to reach quorum

Slow operations

TEXT

2024-11-25 13:00:00 WARN: etcdserver: apply request took too long [1.234s] expected less than [100ms]
2024-11-25 13:00:01 WARN: etcdserver: read-only range request took too long to execute [523ms]

→ Solution: Check disk I/O, CPU load

3.4. etcd debugging

TEXT

# Check cluster health
etcdctl endpoint health --cluster

# Check cluster status
etcdctl endpoint status --cluster --write-out=table

# Check member list
etcdctl member list --write-out=table

# Check alarms
etcdctl alarm list

# Compact history (if too large)
etcdctl compact $(etcdctl endpoint status --write-out="json" | jq -r '.[0].Status.header.revision')

# Defrag
etcdctl defrag --cluster

Common Issues and Solutions

4.1. Issue: Patroni won't start

Symptoms: systemctl status patroni shows failed

Diagnosis:

TEXT

# Check logs
sudo journalctl -u patroni -n 50

# Common causes:

# 1. Configuration error
# Error: "yaml.scanner.ScannerError: mapping values are not allowed here"
→ Fix YAML syntax in patroni.yml

# 2. PostgreSQL already running
# Error: "postmaster.pid already exists"
sudo systemctl stop postgresql
rm /var/lib/postgresql/18/data/postmaster.pid

# 3. DCS not reachable
# Error: "Failed to get list of machines from etcd"
→ Check etcd status: systemctl status etcd

# 4. Permission denied
# Error: "could not open file postgresql.conf: Permission denied"
sudo chown -R postgres:postgres /var/lib/postgresql/18/data

4.2. Issue: No leader in cluster

Symptoms: patronictl list shows no Leader

Diagnosis:

TEXT

# Check all Patroni instances
for node in node1 node2 node3; do
  echo "=== $node ==="
  ssh $node "sudo systemctl status patroni"
done

# Check DCS
etcdctl endpoint health --cluster

# Check logs for election issues
sudo journalctl -u patroni | grep -i "leader\|election"

# Common causes:

# 1. DCS has no quorum
→ Ensure at least 2 of 3 etcd nodes running

# 2. All nodes tagged nofailover
→ Check tags in patroni.yml or DCS

# 3. synchronous_mode_strict with no sync replica
→ Disable strict mode or fix replication

4.3. Issue: High replication lag

Symptoms: Replica lag > 100MB

Diagnosis:

SQL

-- Check lag
SELECT application_name,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
       replay_lag
FROM pg_stat_replication;

-- Check replica activity
SELECT * FROM pg_stat_activity WHERE wait_event IS NOT NULL;

-- Common causes:

-- 1. Long-running queries on replica
SELECT pid, query_start, state, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - query_start > interval '5 min';
-- Kill if needed: SELECT pg_terminate_backend(pid);

-- 2. Network bandwidth
-- Check with iftop or netstat

-- 3. Disk I/O on replica
-- Check with iostat -x 1

-- 4. max_wal_senders limit
SHOW max_wal_senders;  -- Increase if needed

-- 5. wal_keep_size too small
SHOW wal_keep_size;  -- Increase to prevent WAL deletion

4.4. Issue: Split-brain detected

Symptoms: Multiple nodes claim to be leader

Diagnosis:

TEXT

# Check cluster status
patronictl list postgres

# Check each node's view
for node in node1 node2 node3; do
  echo "=== $node ==="
  curl -s http://$node:8008/patroni | jq '.role'
done

# CRITICAL: Stop all Patroni immediately if split-brain confirmed
for node in node1 node2 node3; do
  ssh $node "sudo systemctl stop patroni"
done

# Recovery procedure: See Bài 15

4.5. Issue: Failover not happening

Symptoms: Primary down but no promotion

Diagnosis:

TEXT

# Check Patroni status
patronictl list postgres

# Check logs
sudo journalctl -u patroni | grep -i "failover\|promote"

# Common causes:

# 1. All replicas have nofailover tag
patronictl show-config postgres | grep nofailover

# 2. Replication lag too high
patronictl show-config postgres | grep maximum_lag_on_failover

# 3. DCS connection lost on all replicas
etcdctl endpoint health --cluster

# 4. synchronous_mode_strict enabled with no sync replica
patronictl show-config postgres | grep synchronous_mode

4.6. Issue: Cannot connect to PostgreSQL

Symptoms: Connection refused or timeout

Diagnosis:

TEXT

# 1. Check PostgreSQL running
sudo systemctl status postgresql
ps aux | grep postgres

# 2. Check listening port
sudo netstat -tuln | grep 5432
sudo ss -tuln | grep 5432

# 3. Test connection locally
psql -h localhost -U postgres -c "SELECT 1"

# 4. Check pg_hba.conf
sudo cat /var/lib/postgresql/18/data/pg_hba.conf | grep -v "^#"

# 5. Check postgresql.conf
sudo -u postgres psql -c "SHOW listen_addresses;"
sudo -u postgres psql -c "SHOW port;"

# 6. Check firewall
sudo iptables -L -n | grep 5432
sudo ufw status

# 7. Test from remote
telnet 10.0.1.11 5432
nc -zv 10.0.1.11 5432

4.7. Issue: pg_rewind failed

Symptoms: Node cannot rejoin after failover

Error: pg_rewind: error: could not find common ancestor

Diagnosis:

TEXT

# Check logs
sudo journalctl -u patroni | grep pg_rewind

# Common causes:

# 1. wal_log_hints not enabled
sudo -u postgres psql -c "SHOW wal_log_hints;"
# If off, enable and restart

# 2. Timeline too diverged
# Solution: Full reinit
patronictl reinit postgres node1

# 3. Data directory corrupt
# Solution: Remove and rebuild
sudo systemctl stop patroni
sudo rm -rf /var/lib/postgresql/18/data
sudo systemctl start patroni  # Patroni will rebuild

5.1. patronictl commands

TEXT

# List cluster
patronictl list postgres

# Show configuration
patronictl show-config postgres

# Edit configuration
patronictl edit-config postgres

# Restart node
patronictl restart postgres node1

# Reinitialize node
patronictl reinit postgres node1

# Reload configuration
patronictl reload postgres node1

# Pause/resume cluster
patronictl pause postgres
patronictl resume postgres

# History
patronictl history postgres

5.2. etcdctl commands

TEXT

# Cluster health
etcdctl endpoint health --cluster

# Cluster status
etcdctl endpoint status --cluster --write-out=table

# Member list
etcdctl member list --write-out=table

# Get Patroni data
etcdctl get /service/postgres/ --prefix --keys-only
etcdctl get /service/postgres/leader

# Watch for changes
etcdctl watch /service/postgres/ --prefix

# Delete key (dangerous!)
etcdctl del /service/postgres/leader  # Force re-election

5.3. PostgreSQL diagnostic queries

SQL

-- Current activity
SELECT * FROM pg_stat_activity WHERE state != 'idle';

-- Blocking queries
SELECT blocked_locks.pid AS blocked_pid,
       blocked_activity.usename AS blocked_user,
       blocking_locks.pid AS blocking_pid,
       blocking_activity.usename AS blocking_user,
       blocked_activity.query AS blocked_statement,
       blocking_activity.query AS current_statement_in_blocking_process
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks 
    ON blocking_locks.locktype = blocked_locks.locktype
    AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
    AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
    AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
    AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
    AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
    AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
    AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
    AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
    AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
    AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;

-- Replication status
SELECT * FROM pg_stat_replication;

-- WAL status
SELECT * FROM pg_stat_wal_receiver;

-- Database size
SELECT datname, pg_size_pretty(pg_database_size(datname))
FROM pg_database
ORDER BY pg_database_size(datname) DESC;

-- Table bloat
SELECT schemaname, tablename, 
       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
       n_dead_tup
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;

5.4. System diagnostic commands

TEXT

# CPU usage
top -bn1 | head -20
htop

# Memory
free -h
ps aux --sort=-%mem | head -20

# Disk I/O
iostat -x 1 10
iotop

# Disk space
df -h
du -sh /var/lib/postgresql/18/data/*

# Network connections
netstat -tuln | grep 5432
ss -tuln | grep 5432

# Network traffic
iftop -i eth0
nethogs

# Check for OOM kills
dmesg | grep -i "out of memory"
journalctl -k | grep -i "killed process"

Troubleshooting Runbook

6.1. Primary node down

TEXT

# 1. Verify primary is down
patronictl list postgres

# 2. Check if automatic failover occurred
# Timeline should increment, new leader elected

# 3. If no failover, check why:
# - DCS healthy?
etcdctl endpoint health --cluster

# - Replicas eligible?
patronictl show-config postgres | grep nofailover

# - Replication lag acceptable?
patronictl show-config postgres | grep maximum_lag_on_failover

# 4. Check old primary logs
sudo journalctl -u patroni -n 200 | grep -i "error\|fatal"

# 5. Attempt manual intervention if needed
patronictl failover postgres --candidate node2

6.2. Replica not replicating

TEXT

# 1. Check replica status
patronictl list postgres

# 2. On replica, check replication status
sudo -u postgres psql -c "SELECT * FROM pg_stat_wal_receiver;"

# 3. Check if timeline matches
sudo -u postgres psql -c "SELECT timeline_id FROM pg_control_checkpoint();"

# 4. Check primary for replication slot
sudo -u postgres psql -h primary -c "SELECT * FROM pg_replication_slots;"

# 5. Check network connectivity
ping primary
telnet primary 5432

# 6. Restart replication
sudo systemctl restart patroni

# 7. If still failing, reinitialize
patronictl reinit postgres replica-node

6.3. etcd cluster unhealthy

TEXT

# 1. Check cluster health
etcdctl endpoint health --cluster

# 2. Check cluster status
etcdctl endpoint status --cluster --write-out=table

# 3. Check logs
sudo journalctl -u etcd -n 100

# 4. If quorum lost (< 2 of 3 healthy):
# - Fix or restart unhealthy members
sudo systemctl restart etcd

# 5. If member corrupt:
# - Remove and re-add member
etcdctl member remove <member-id>
etcdctl member add node3 --peer-urls=http://10.0.1.13:2380
# Then restart etcd on node3

# 6. Last resort: Rebuild etcd cluster
# (Requires stopping all services)

Best Practices

✅ DO

Enable appropriate logging - Balance detail vs volume
Centralize logs - Use ELK/Grafana Loki
Set up alerts - Proactive notification
Regular log review - Weekly analysis
Document issues - Build knowledge base
Test scenarios - Practice troubleshooting
Keep runbooks updated - Living documents
Monitor disk space - Logs can fill disk

❌ DON'T

Don't enable DEBUG in production - Too verbose
Don't ignore warnings - They become errors
Don't delete logs immediately - Keep for analysis
Don't skip log rotation - Prevent disk full
Don't troubleshoot blind - Check logs first
Don't make changes without logs - Document actions
Lab Exercises

Summary

Key Logging Locations

TEXT

PostgreSQL: /var/lib/postgresql/18/data/log/
Patroni: journalctl -u patroni
etcd: journalctl -u etcd
System: /var/log/syslog, dmesg

Essential Diagnostic Commands

TEXT

# Cluster status
patronictl list postgres

# PostgreSQL connection
psql -h localhost -U postgres

# Replication status
SELECT * FROM pg_stat_replication;

# etcd health
etcdctl endpoint health --cluster

# Logs
sudo journalctl -u patroni -f
sudo journalctl -u etcd -f

Troubleshooting Workflow

Identify - What's the symptom?
Isolate - Which component is failing?
Investigate - Check logs, metrics
Diagnose - What's the root cause?
Fix - Apply solution
Verify - Confirm resolution
Document - Record for future

Next Steps

Bài 20 sẽ cover Security Best Practices:

SSL/TLS encryption
Authentication methods
Network security
Encryption at rest
Audit logging
Security hardening

TEXT

Course

PostgreSQL High Availability A-Z

Share this article

You might also like