CloudTadaInsights

Lesson 19: Logging và Troubleshooting

-- Essential logging settings ALTER SYSTEM SET logging_collector = on; ALTER SYSTEM SET log_directory = 'log'; ALTER SYSTEM SET log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'; ALTER SYSTEM SET log_rotation_age = '1d'; ALTER SYSTEM SET log_rotation_size = '100MB';

-- What to log ALTER SYSTEM SET log_min_duration_statement = '1000'; -- Slow queries (>1s) ALTER SYSTEM SET log_checkpoints = on; ALTER SYSTEM SET log_connections = off; -- Too verbose in production ALTER SYSTEM SET log_disconnections = off; ALTER SYSTEM SET log_lock_waits = on; ALTER SYSTEM SET log_temp_files = 0; -- Log all temp file usage ALTER SYSTEM SET log_autovacuum_min_duration = '1s';

-- Log line format ALTER SYSTEM SET log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '; ALTER SYSTEM SET log_statement = 'none'; -- 'none', 'ddl', 'mod', 'all' ALTER SYSTEM SET log_duration = off;

-- Error detail ALTER SYSTEM SET log_error_verbosity = 'default'; -- 'terse', 'default', 'verbose'

-- Reload SELECT pg_reload_conf();

1.2. Log file location

TEXT
# Default location
/var/lib/postgresql/18/data/log/

# Or configured location
sudo -u postgres psql -c "SHOW log_directory;"
sudo -u postgres psql -c "SHOW data_directory;"

# View logs
sudo tail -f /var/lib/postgresql/18/data/log/postgresql-*.log

# Or via journalctl
sudo journalctl -u postgresql -f

1.3. Common log patterns

Connection issues

TEXT
LOG: connection received: host=10.0.1.50 port=54321
FATAL: password authentication failed for user "app_user"
DETAIL: Connection matched pg_hba.conf line 95: "host all all 0.0.0.0/0 md5"

→ Solution: Check password, pg_hba.conf configuration

Slow queries

TEXT
LOG: duration: 5432.123 ms  statement: SELECT * FROM large_table WHERE ...

→ Solution: Analyze query, add indexes, optimize

Deadlocks

TEXT
ERROR: deadlock detected
DETAIL: Process 12345 waits for ShareLock on transaction 67890; blocked by process 12346.
Process 12346 waits for ShareLock on transaction 67891; blocked by process 12345.
HINT: See server log for query details.

→ Solution: Review transaction order, add proper locking

Checkpoint warnings

TEXT
LOG: checkpoints are occurring too frequently (24 seconds apart)
HINT: Consider increasing the configuration parameter "max_wal_size".

→ Solution: Increase max_wal_size

Replication lag

TEXT
WARNING: could not send data to WAL receiver; retrying
DETAIL: Connection to replica timed out

→ Solution: Check network, replica health

Out of disk space

TEXT
FATAL: could not write to file "base/16384/12345": No space left on device

→ Solution: Free up disk space, increase storage

1.4. Analyze logs with pgBadger

TEXT
# Install pgBadger
sudo apt-get install -y pgbadger

# Or from source
wget https://github.com/darold/pgbadger/archive/v12.2.tar.gz
tar -xzf v12.2.tar.gz
cd pgbadger-12.2
perl Makefile.PL
make && sudo make install

# Generate report
pgbadger /var/lib/postgresql/18/data/log/postgresql-*.log \
  -o /tmp/pgbadger-report.html

# Open in browser
# file:///tmp/pgbadger-report.html

# Report includes:
# - Top slow queries
# - Connection statistics
# - Lock analysis
# - Checkpoint activity
# - Temporary file usage
  1. Patroni Logging

2.1. Patroni log levels

TEXT
# In patroni.yml
log:
  level: INFO  # DEBUG, INFO, WARNING, ERROR, CRITICAL
  format: '%(asctime)s %(levelname)s: %(message)s'
  dateformat: '%Y-%m-%d %H:%M:%S'
  max_queue_size: 1000
  dir: /var/log/patroni
  file_num: 10
  file_size: 25000000  # 25MB

2.2. View Patroni logs

TEXT
# Via journalctl (if systemd service)
sudo journalctl -u patroni -f

# Show last 100 lines
sudo journalctl -u patroni -n 100

# Show logs since specific time
sudo journalctl -u patroni --since "1 hour ago"
sudo journalctl -u patroni --since "2024-11-25 10:00:00"

# Search for specific keywords
sudo journalctl -u patroni | grep -i "error\|warning\|fail"

# Export to file
sudo journalctl -u patroni --since today > /tmp/patroni-today.log

2.3. Common Patroni log patterns

Successful bootstrap

TEXT
2024-11-25 10:00:00 INFO: Selected new etcd server http://10.0.1.11:2379
2024-11-25 10:00:01 INFO: No PostgreSQL configuration items changed, nothing to reload.
2024-11-25 10:00:02 INFO: Lock owner: None; I am node1
2024-11-25 10:00:03 INFO: trying to bootstrap a new cluster
2024-11-25 10:00:05 INFO: Running custom bootstrap script: /usr/lib/postgresql/18/bin/initdb
2024-11-25 10:00:15 INFO: postmaster pid=12345
2024-11-25 10:00:16 INFO: initialized a new cluster
2024-11-25 10:00:17 INFO: acquired session lock as a leader ✅

Failed to acquire lock (replica)

TEXT
2024-11-25 10:05:00 INFO: Lock owner: node1; I am node2
2024-11-25 10:05:01 INFO: does not have lock
2024-11-25 10:05:02 INFO: starting as a secondary
2024-11-25 10:05:05 INFO: replica has been created using basebackup

Failover detected

TEXT
2024-11-25 11:30:00 WARNING: Request failed to node1:8008: connection refused
2024-11-25 11:30:10 INFO: promoted self to leader by acquiring session lock ✅
2024-11-25 11:30:11 INFO: cleared rewind state after becoming the leader
2024-11-25 11:30:12 INFO: running post_promote script

DCS connection issues

TEXT
2024-11-25 12:00:00 ERROR: Failed to get list of machines from etcd
2024-11-25 12:00:01 WARNING: Could not activate Linux watchdog device: No such file or directory
2024-11-25 12:00:02 ERROR: watching on etcd failed
2024-11-25 12:00:03 INFO: trying to connect to etcd server http://10.0.1.11:2379

→ Solution: Check etcd cluster health

pg_rewind execution

TEXT
2024-11-25 13:00:00 INFO: running pg_rewind from node2
2024-11-25 13:00:05 INFO: pg_rewind: servers diverged at WAL position 0/5000000 on timeline 2
2024-11-25 13:00:10 INFO: pg_rewind: rewinding from last common checkpoint at 0/4000000
2024-11-25 13:00:45 INFO: pg_rewind: Done!
2024-11-25 13:00:46 INFO: starting PostgreSQL

Configuration reload

TEXT
2024-11-25 14:00:00 INFO: Reloading PostgreSQL configuration.
2024-11-25 14:00:01 INFO: Reload configuration: max_connections changed from 100 to 200
2024-11-25 14:00:02 INFO: PostgreSQL configuration reload succeeded

2.4. Enable DEBUG logging

TEXT
# Temporarily enable debug logging
sudo -u postgres patronictl edit-config postgres

# Add:
log:
  level: DEBUG

# Or set via environment variable
sudo systemctl edit patroni

# Add:
[Service]
Environment="PATRONI_LOGLEVEL=DEBUG"

sudo systemctl daemon-reload
sudo systemctl restart patroni

# Warning: DEBUG is VERY verbose!
# Use only for troubleshooting, then disable
  1. etcd Logging

3.1. View etcd logs

TEXT
# Via journalctl
sudo journalctl -u etcd -f

# Last 100 entries
sudo journalctl -u etcd -n 100

# Errors only
sudo journalctl -u etcd -p err

# Specific time range
sudo journalctl -u etcd --since "10:00" --until "11:00"

3.2. etcd log levels

TEXT
# Configure in etcd service file
# /etc/systemd/system/etcd.service

ExecStart=/usr/local/bin/etcd \
  --log-level info \
  --logger zap \
  --log-outputs stderr

3.3. Common etcd log patterns

Cluster healthy

TEXT
2024-11-25 10:00:00 INFO: raft.node: elected leader at term 5
2024-11-25 10:00:01 INFO: rafthttp: established a TCP streaming connection with peer node2
2024-11-25 10:00:02 INFO: etcdserver: published member info to cluster

Leader election

TEXT
2024-11-25 11:00:00 WARN: rafthttp: lost TCP streaming connection with peer node1
2024-11-25 11:00:01 INFO: raft.node: node3 elected leader at term 6
2024-11-25 11:00:02 INFO: rafthttp: peer node1 became active

Network partition

TEXT
2024-11-25 12:00:00 WARN: rafthttp: health check for peer node2 could not connect: dial tcp: i/o timeout
2024-11-25 12:00:05 ERROR: rafthttp: failed to read from node2 on stream: EOF
2024-11-25 12:00:10 WARN: etcdserver: failed to reach quorum

Slow operations

TEXT
2024-11-25 13:00:00 WARN: etcdserver: apply request took too long [1.234s] expected less than [100ms]
2024-11-25 13:00:01 WARN: etcdserver: read-only range request took too long to execute [523ms]

→ Solution: Check disk I/O, CPU load

3.4. etcd debugging

TEXT
# Check cluster health
etcdctl endpoint health --cluster

# Check cluster status
etcdctl endpoint status --cluster --write-out=table

# Check member list
etcdctl member list --write-out=table

# Check alarms
etcdctl alarm list

# Compact history (if too large)
etcdctl compact $(etcdctl endpoint status --write-out="json" | jq -r '.[0].Status.header.revision')

# Defrag
etcdctl defrag --cluster
  1. Common Issues and Solutions

4.1. Issue: Patroni won't start

Symptomssystemctl status patroni shows failed

Diagnosis:

TEXT
# Check logs
sudo journalctl -u patroni -n 50

# Common causes:

# 1. Configuration error
# Error: "yaml.scanner.ScannerError: mapping values are not allowed here"
→ Fix YAML syntax in patroni.yml

# 2. PostgreSQL already running
# Error: "postmaster.pid already exists"
sudo systemctl stop postgresql
rm /var/lib/postgresql/18/data/postmaster.pid

# 3. DCS not reachable
# Error: "Failed to get list of machines from etcd"
→ Check etcd status: systemctl status etcd

# 4. Permission denied
# Error: "could not open file postgresql.conf: Permission denied"
sudo chown -R postgres:postgres /var/lib/postgresql/18/data

4.2. Issue: No leader in cluster

Symptomspatronictl list shows no Leader

Diagnosis:

TEXT
# Check all Patroni instances
for node in node1 node2 node3; do
  echo "=== $node ==="
  ssh $node "sudo systemctl status patroni"
done

# Check DCS
etcdctl endpoint health --cluster

# Check logs for election issues
sudo journalctl -u patroni | grep -i "leader\|election"

# Common causes:

# 1. DCS has no quorum
→ Ensure at least 2 of 3 etcd nodes running

# 2. All nodes tagged nofailover
→ Check tags in patroni.yml or DCS

# 3. synchronous_mode_strict with no sync replica
→ Disable strict mode or fix replication

4.3. Issue: High replication lag

Symptoms: Replica lag > 100MB

Diagnosis:

SQL
-- Check lag
SELECT application_name,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
       replay_lag
FROM pg_stat_replication;

-- Check replica activity
SELECT * FROM pg_stat_activity WHERE wait_event IS NOT NULL;

-- Common causes:

-- 1. Long-running queries on replica
SELECT pid, query_start, state, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - query_start > interval '5 min';
-- Kill if needed: SELECT pg_terminate_backend(pid);

-- 2. Network bandwidth
-- Check with iftop or netstat

-- 3. Disk I/O on replica
-- Check with iostat -x 1

-- 4. max_wal_senders limit
SHOW max_wal_senders;  -- Increase if needed

-- 5. wal_keep_size too small
SHOW wal_keep_size;  -- Increase to prevent WAL deletion

4.4. Issue: Split-brain detected

Symptoms: Multiple nodes claim to be leader

Diagnosis:

TEXT
# Check cluster status
patronictl list postgres

# Check each node's view
for node in node1 node2 node3; do
  echo "=== $node ==="
  curl -s http://$node:8008/patroni | jq '.role'
done

# CRITICAL: Stop all Patroni immediately if split-brain confirmed
for node in node1 node2 node3; do
  ssh $node "sudo systemctl stop patroni"
done

# Recovery procedure: See Bài 15

4.5. Issue: Failover not happening

Symptoms: Primary down but no promotion

Diagnosis:

TEXT
# Check Patroni status
patronictl list postgres

# Check logs
sudo journalctl -u patroni | grep -i "failover\|promote"

# Common causes:

# 1. All replicas have nofailover tag
patronictl show-config postgres | grep nofailover

# 2. Replication lag too high
patronictl show-config postgres | grep maximum_lag_on_failover

# 3. DCS connection lost on all replicas
etcdctl endpoint health --cluster

# 4. synchronous_mode_strict enabled with no sync replica
patronictl show-config postgres | grep synchronous_mode

4.6. Issue: Cannot connect to PostgreSQL

Symptoms: Connection refused or timeout

Diagnosis:

TEXT
# 1. Check PostgreSQL running
sudo systemctl status postgresql
ps aux | grep postgres

# 2. Check listening port
sudo netstat -tuln | grep 5432
sudo ss -tuln | grep 5432

# 3. Test connection locally
psql -h localhost -U postgres -c "SELECT 1"

# 4. Check pg_hba.conf
sudo cat /var/lib/postgresql/18/data/pg_hba.conf | grep -v "^#"

# 5. Check postgresql.conf
sudo -u postgres psql -c "SHOW listen_addresses;"
sudo -u postgres psql -c "SHOW port;"

# 6. Check firewall
sudo iptables -L -n | grep 5432
sudo ufw status

# 7. Test from remote
telnet 10.0.1.11 5432
nc -zv 10.0.1.11 5432

4.7. Issue: pg_rewind failed

Symptoms: Node cannot rejoin after failover

Errorpg_rewind: error: could not find common ancestor

Diagnosis:

TEXT
# Check logs
sudo journalctl -u patroni | grep pg_rewind

# Common causes:

# 1. wal_log_hints not enabled
sudo -u postgres psql -c "SHOW wal_log_hints;"
# If off, enable and restart

# 2. Timeline too diverged
# Solution: Full reinit
patronictl reinit postgres node1

# 3. Data directory corrupt
# Solution: Remove and rebuild
sudo systemctl stop patroni
sudo rm -rf /var/lib/postgresql/18/data
sudo systemctl start patroni  # Patroni will rebuild

5.1. patronictl commands

TEXT
# List cluster
patronictl list postgres

# Show configuration
patronictl show-config postgres

# Edit configuration
patronictl edit-config postgres

# Restart node
patronictl restart postgres node1

# Reinitialize node
patronictl reinit postgres node1

# Reload configuration
patronictl reload postgres node1

# Pause/resume cluster
patronictl pause postgres
patronictl resume postgres

# History
patronictl history postgres

5.2. etcdctl commands

TEXT
# Cluster health
etcdctl endpoint health --cluster

# Cluster status
etcdctl endpoint status --cluster --write-out=table

# Member list
etcdctl member list --write-out=table

# Get Patroni data
etcdctl get /service/postgres/ --prefix --keys-only
etcdctl get /service/postgres/leader

# Watch for changes
etcdctl watch /service/postgres/ --prefix

# Delete key (dangerous!)
etcdctl del /service/postgres/leader  # Force re-election

5.3. PostgreSQL diagnostic queries

SQL
-- Current activity
SELECT * FROM pg_stat_activity WHERE state != 'idle';

-- Blocking queries
SELECT blocked_locks.pid AS blocked_pid,
       blocked_activity.usename AS blocked_user,
       blocking_locks.pid AS blocking_pid,
       blocking_activity.usename AS blocking_user,
       blocked_activity.query AS blocked_statement,
       blocking_activity.query AS current_statement_in_blocking_process
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks 
    ON blocking_locks.locktype = blocked_locks.locktype
    AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
    AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
    AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
    AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
    AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
    AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
    AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
    AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
    AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
    AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;

-- Replication status
SELECT * FROM pg_stat_replication;

-- WAL status
SELECT * FROM pg_stat_wal_receiver;

-- Database size
SELECT datname, pg_size_pretty(pg_database_size(datname))
FROM pg_database
ORDER BY pg_database_size(datname) DESC;

-- Table bloat
SELECT schemaname, tablename, 
       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
       n_dead_tup
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;

5.4. System diagnostic commands

TEXT
# CPU usage
top -bn1 | head -20
htop

# Memory
free -h
ps aux --sort=-%mem | head -20

# Disk I/O
iostat -x 1 10
iotop

# Disk space
df -h
du -sh /var/lib/postgresql/18/data/*

# Network connections
netstat -tuln | grep 5432
ss -tuln | grep 5432

# Network traffic
iftop -i eth0
nethogs

# Check for OOM kills
dmesg | grep -i "out of memory"
journalctl -k | grep -i "killed process"
  1. Troubleshooting Runbook

6.1. Primary node down

TEXT
# 1. Verify primary is down
patronictl list postgres

# 2. Check if automatic failover occurred
# Timeline should increment, new leader elected

# 3. If no failover, check why:
# - DCS healthy?
etcdctl endpoint health --cluster

# - Replicas eligible?
patronictl show-config postgres | grep nofailover

# - Replication lag acceptable?
patronictl show-config postgres | grep maximum_lag_on_failover

# 4. Check old primary logs
sudo journalctl -u patroni -n 200 | grep -i "error\|fatal"

# 5. Attempt manual intervention if needed
patronictl failover postgres --candidate node2

6.2. Replica not replicating

TEXT
# 1. Check replica status
patronictl list postgres

# 2. On replica, check replication status
sudo -u postgres psql -c "SELECT * FROM pg_stat_wal_receiver;"

# 3. Check if timeline matches
sudo -u postgres psql -c "SELECT timeline_id FROM pg_control_checkpoint();"

# 4. Check primary for replication slot
sudo -u postgres psql -h primary -c "SELECT * FROM pg_replication_slots;"

# 5. Check network connectivity
ping primary
telnet primary 5432

# 6. Restart replication
sudo systemctl restart patroni

# 7. If still failing, reinitialize
patronictl reinit postgres replica-node

6.3. etcd cluster unhealthy

TEXT
# 1. Check cluster health
etcdctl endpoint health --cluster

# 2. Check cluster status
etcdctl endpoint status --cluster --write-out=table

# 3. Check logs
sudo journalctl -u etcd -n 100

# 4. If quorum lost (< 2 of 3 healthy):
# - Fix or restart unhealthy members
sudo systemctl restart etcd

# 5. If member corrupt:
# - Remove and re-add member
etcdctl member remove <member-id>
etcdctl member add node3 --peer-urls=http://10.0.1.13:2380
# Then restart etcd on node3

# 6. Last resort: Rebuild etcd cluster
# (Requires stopping all services)
  1. Best Practices

✅ DO

  1. Enable appropriate logging - Balance detail vs volume
  2. Centralize logs - Use ELK/Grafana Loki
  3. Set up alerts - Proactive notification
  4. Regular log review - Weekly analysis
  5. Document issues - Build knowledge base
  6. Test scenarios - Practice troubleshooting
  7. Keep runbooks updated - Living documents
  8. Monitor disk space - Logs can fill disk

❌ DON'T

  1. Don't enable DEBUG in production - Too verbose

  2. Don't ignore warnings - They become errors

  3. Don't delete logs immediately - Keep for analysis

  4. Don't skip log rotation - Prevent disk full

  5. Don't troubleshoot blind - Check logs first

  6. Don't make changes without logs - Document actions

  7. Lab Exercises


Lab 1: Log analysis

Tasks: 1. Configure PostgreSQL slow query logging 2. Generate workload with slow queries 3. Analyze logs with pgBadger 4. Identify top issues 5. Document findings

Lab 2: Simulate and debug failover

Tasks: 1. Stop primary node 2. Monitor Patroni logs during failover 3. Trace timeline of events 4. Calculate downtime from logs 5. Create timeline diagram

Lab 3: Debug replication lag

Tasks: 1. Simulate high write load on primary 2. Observe lag increase on replica 3. Use diagnostic queries to identify cause 4. Fix the issue 5. Verify lag reduced

Lab 4: Troubleshoot connection issues

Tasks: 1. Misconfigure pg_hba.conf 2. Attempt connections (will fail) 3. Use logs to identify issue 4. Fix configuration 5. Verify connections work

  1. Summary

Key Logging Locations

TEXT
PostgreSQL: /var/lib/postgresql/18/data/log/
Patroni: journalctl -u patroni
etcd: journalctl -u etcd
System: /var/log/syslog, dmesg

Essential Diagnostic Commands

TEXT
# Cluster status
patronictl list postgres

# PostgreSQL connection
psql -h localhost -U postgres

# Replication status
SELECT * FROM pg_stat_replication;

# etcd health
etcdctl endpoint health --cluster

# Logs
sudo journalctl -u patroni -f
sudo journalctl -u etcd -f

Troubleshooting Workflow

  1. Identify - What's the symptom?
  2. Isolate - Which component is failing?
  3. Investigate - Check logs, metrics
  4. Diagnose - What's the root cause?
  5. Fix - Apply solution
  6. Verify - Confirm resolution
  7. Document - Record for future

Next Steps

Bài 20 sẽ cover Security Best Practices:

  • SSL/TLS encryption
  • Authentication methods
  • Network security
  • Encryption at rest
  • Audit logging
  • Security hardening
TEXT

Share this article

You might also like

Browse all articles

Lesson 20: Security Best Practices

Learn about Lesson 20: Security Best Practices in PostgreSQL HA clusters with Patroni and etcd.

#Patroni#PostgreSQL#high availability

Lesson 15: Recovering Failed Nodes

Learn about Lesson 15: Recovering Failed Nodes in PostgreSQL HA clusters with Patroni and etcd.

#Patroni#PostgreSQL#high availability

Lesson 14: Planned Switchover

Learn about Lesson 14: Switchover - Planned Switchover in PostgreSQL HA clusters with Patroni and etcd.

#Patroni#PostgreSQL#high availability

Lesson 13: Automatic Failover

Learn about Lesson 13: Automatic Failover in PostgreSQL HA clusters with Patroni and etcd.

#Patroni#PostgreSQL#high availability

Lesson 12: Patroni REST API

Learn about Lesson 12: Patroni REST API in PostgreSQL HA clusters with Patroni and etcd.

#Patroni#PostgreSQL#high availability