CloudTada | Infrastructure & DevOps Insights

Real-world Case Studies

After this lesson, you will be able to:

Learn from production PostgreSQL HA deployments.
Understand scaling strategies for high traffic.
Analyze cost optimization techniques.
Study incident post-mortems.
Apply best practices from real scenarios.

1. Case Study 1: E-commerce Platform (High Transaction Volume)

1.1. Company profile

TEXT

Company: Online Retail Platform
Scale: 50M users, 500K daily transactions
Traffic: 10K queries/second peak
Data size: 5TB
Industry: E-commerce

1.2. Architecture

TEXT

Production Setup:
├─ PostgreSQL 18 + Patroni
├─ 5-node cluster (3 DC1 + 2 DC2)
│  ├─ Leader: AWS r6g.4xlarge (16 vCPU, 128GB RAM)
│  ├─ Replicas: AWS r6g.2xlarge (8 vCPU, 64GB RAM)
│  └─ Storage: io2 SSD, 20K IOPS
├─ PgBouncer connection pooling (transaction mode)
├─ HAProxy load balancing
├─ Redis caching layer
└─ Monitoring: Prometheus + Grafana

Database separation:
├─ orders_db (heavy writes)
├─ products_db (mostly reads)
├─ users_db (mixed workload)
└─ analytics_db (read replica for reports)

1.3. Configuration highlights

YAML

# Patroni configuration
postgresql:
  parameters:
    # Memory
    shared_buffers: 32GB
    effective_cache_size: 96GB
    work_mem: 32MB
    maintenance_work_mem: 2GB
    
    # Connections
    max_connections: 500
    
    # Write performance
    wal_buffers: 64MB
    checkpoint_completion_target: 0.9
    max_wal_size: 16GB
    min_wal_size: 4GB
    
    # Query performance
    random_page_cost: 1.1  # SSD
    effective_io_concurrency: 200
    
    # Parallelism
    max_parallel_workers: 8
    max_parallel_workers_per_gather: 4

1.4. Challenges and solutions

Challenge 1: Connection exhaustion

Problem:

Peak traffic caused max_connections limit to be hit.
Application errors: "FATAL: too many connections".

Solution:

Implemented PgBouncer with transaction pooling.
Reduced max_connections from 1000 to 500.
PgBouncer pool_size=100 per database.
Result: Handled 10K app connections with 500 DB connections.

Challenge 2: Replication lag during flash sales

Problem:

Flash sales caused 50K writes/second.
Replicas lagged by 5-10 seconds.
Read queries returned stale data.

Solution:

Increased wal_sender_timeout and wal_receiver_timeout.
Tuned checkpoint_completion_target to 0.9.

Added synchronous replication for critical tables:

SQL

ALTER TABLE orders SET (synchronous_commit = 'remote_apply');

Separated analytics queries to dedicated read replica.

Challenge 3: Disk I/O bottleneck

Problem:

Disk I/O saturated at 95%+ during peak.
Query latency increased from 5ms to 500ms.

Solution:

Upgraded storage from gp3 to io2 (20K IOPS).
Implemented table partitioning for orders table.
- Range partitioning by order_date (daily).
- Automated partition management.
Added btree indexes on frequently queried columns.
Result: I/O dropped to 40%, latency back to 5-10ms.

1.5. Key metrics

Performance:

Query throughput: 10K qps peak
Average query latency: 8ms
99th percentile latency: 50ms
Replication lag: < 100ms
Failover time: 15 seconds

Availability:

Uptime: 99.99% (52 minutes downtime/year)
Unplanned downtime: 2 incidents, 15 minutes total
Planned maintenance: 0 downtime (rolling updates)

Cost:

Infrastructure: $15K/month (AWS)
Staffing: 2 DBAs + 1 SRE
Total: ~$50K/month

2. Case Study 2: SaaS Application (Multi-tenant)

2.1. Company profile

TEXT

Company: Project Management SaaS
Scale: 100K tenants, 5M users
Traffic: 2K queries/second average
Data size: 2TB
Industry: SaaS

2.2. Architecture

TEXT

Multi-tenant Strategy:
├─ Shared database, separate schemas per tenant
├─ Row-level security (RLS) for data isolation
├─ Connection pooling per tenant
└─ Automated backup per tenant

Cluster Setup:
├─ 3-node Patroni cluster
├─ PostgreSQL 18 on GCP Cloud SQL equivalent (self-managed)
├─ Compute Engine n2-highmem-8 (8 vCPU, 64GB RAM)
├─ Persistent SSD, 10K IOPS
└─ Automated daily backups to GCS

2.3. Multi-tenancy implementation

SQL

-- Schema per tenant
CREATE SCHEMA tenant_12345;
CREATE SCHEMA tenant_12346;

-- Row-level security
CREATE POLICY tenant_isolation ON users
  USING (tenant_id = current_setting('app.current_tenant')::bigint);

ALTER TABLE users ENABLE ROW LEVEL SECURITY;

-- Application sets tenant context
SET app.current_tenant = '12345';

-- Query automatically filtered by RLS
SELECT * FROM users;  -- Only sees tenant 12345's data

2.4. Challenges and solutions

Challenge 1: Large tenant impact

Problem:

One tenant (10% of data) caused high CPU usage.
Impacted all other tenants.
"Noisy neighbor" problem.

Solution:

Implemented query timeout per tenant:

SQL

ALTER ROLE tenant_12345 SET statement_timeout = '30s';

Added work_mem limit per tenant:

SQL

ALTER ROLE tenant_12345 SET work_mem = '8MB';

Moved largest tenants to dedicated instances.
Implemented fair queuing with pg_cron.

Challenge 2: Backup/restore for specific tenant

Problem:

Needed to restore one tenant's data.
Full restore would impact all tenants.

Solution:

Implemented per-schema backup script:

BASH

#!/bin/bash
TENANT_ID=$1
pg_dump -n tenant_${TENANT_ID} myapp > tenant_${TENANT_ID}_backup.sql

Logical backup to S3 per tenant, daily.
PITR for full database, per-tenant granular restore.

Challenge 3: Schema migration across 100K tenants

Problem:

Need to add column to table.
100K schemas = 100K migrations.
Can't hold lock that long.

Solution:

Multi-phase migration:
1. Add column as nullable (fast, no rewrite).
2. Backfill data in batches (chunked updates).
3. Add default value (after backfill).
4. Add NOT NULL constraint (after validation).

SQL

-- Phase 1: Add column (instant)
ALTER TABLE users ADD COLUMN last_login_at TIMESTAMP;

-- Phase 2: Backfill (chunked)
DO $
DECLARE
  tenant RECORD;
BEGIN
  FOR tenant IN SELECT schema_name FROM information_schema.schemata 
                WHERE schema_name LIKE 'tenant_%'
  LOOP
    EXECUTE format('UPDATE %I.users SET last_login_at = created_at 
                    WHERE last_login_at IS NULL', tenant.schema_name);
    COMMIT;  -- Commit per tenant
  END LOOP;
END $;

-- Phase 3: Add default (after backfill)
ALTER TABLE users ALTER COLUMN last_login_at SET DEFAULT now();

-- Phase 4: Add NOT NULL (after validation)
ALTER TABLE users ALTER COLUMN last_login_at SET NOT NULL;

2.5. Key metrics

Performance:

Query throughput: 2K qps average
Average query latency: 15ms
Replication lag: < 50ms
Largest tenant: 50GB (isolated)

Availability:

Uptime: 99.95%
Failover time: 20 seconds

Cost:

Infrastructure: $5K/month (GCP)
Staffing: 1 DBA
Cost per tenant: $0.05/month

3. Case Study 3: Financial Services (Compliance-heavy)

3.1. Company profile

TEXT

Company: Online Banking Platform
Scale: 1M users, $100M transactions/day
Traffic: 500 queries/second
Data size: 10TB
Industry: Financial Services (heavily regulated)

3.2. Architecture

TEXT

Compliance-focused Setup:
├─ PostgreSQL 18 + Patroni (on-premises)
├─ 5-node cluster + 2 DR site nodes
├─ HPE servers (bare metal, 32-core, 256GB RAM)
├─ Enterprise SSD RAID 10
├─ Full encryption at rest (LUKS)
├─ SSL/TLS for all connections
├─ pgAudit enabled (log all queries)
├─ Backup retention: 7 years (compliance)
└─ Disaster recovery tested quarterly

Security measures:
├─ Network: Air-gapped from internet
├─ Authentication: Client certificates + SCRAM-SHA-256
├─ Authorization: Row-level security for sensitive data
├─ Auditing: All queries logged to SIEM
└─ Monitoring: 24/7 SOC

3.3. Compliance configuration

SQL

-- Enable pgAudit
CREATE EXTENSION pgaudit;
ALTER SYSTEM SET pgaudit.log = 'all';
ALTER SYSTEM SET pgaudit.log_catalog = off;
ALTER SYSTEM SET pgaudit.log_parameter = on;
ALTER SYSTEM SET pgaudit.log_relation = on;

-- Immutable audit table
CREATE TABLE audit_log (
  id BIGSERIAL PRIMARY KEY,
  timestamp TIMESTAMPTZ NOT NULL DEFAULT now(),
  user_name TEXT NOT NULL,
  query TEXT NOT NULL,
  client_ip INET NOT NULL
) WITH (fillfactor=100);  -- No updates, append-only

-- Prevent deletion (compliance)
CREATE RULE no_delete AS ON DELETE TO audit_log DO INSTEAD NOTHING;
CREATE RULE no_update AS ON UPDATE TO audit_log DO INSTEAD NOTHING;

-- Separate tablespace on WORM (Write Once Read Many) storage
CREATE TABLESPACE audit_ts LOCATION '/mnt/worm_storage/audit';
ALTER TABLE audit_log SET TABLESPACE audit_ts;

3.4. Challenges and solutions

Challenge 1: 7-year backup retention

Problem:

Compliance requires 7 years of backups.
10TB database = 365 x 7 = 2,555 daily backups.
Storage costs astronomical.

Solution:

Implemented tiered backup strategy:
- Daily full backups: 30 days (hot storage)
- Weekly full backups: 1 year (warm storage)
- Monthly full backups: 7 years (cold storage - tape)
Compression with pgBackRest.
Result: Reduced storage from 25PB to 5PB.

Challenge 2: Zero-tolerance for data loss (RPO = 0)

Problem:

Banking regulations require no data loss.
Async replication has lag window.

Solution:

Synchronous replication to 2 replicas:

SQL

ALTER SYSTEM SET synchronous_standby_names = 'ANY 2 (node2, node3, node4)';
ALTER SYSTEM SET synchronous_commit = 'remote_apply';

Trade-off: 10ms additional latency.
Acceptable for financial transactions.

Challenge 3: Disaster recovery drills

Problem:

Quarterly DR drills required by auditors.
Can't disrupt production.

Solution:

Automated DR failover testing:
1. Clone production to DR site (logical replication).
2. Promote DR site to primary.
3. Run smoke tests (read-only queries).
4. Measure RTO (target: < 1 hour).
5. Restore production primary.
6. Document results for audit.
Implemented with Ansible playbooks.
Full drill takes 2 hours (outside business hours).

3.5. Key metrics

Performance:

Query throughput: 500 qps
Average query latency: 20ms (with sync replication)
Replication lag: 0ms (synchronous)
Failover time: 30 seconds

Availability:

Uptime: 99.999% (5 minutes downtime/year)
Unplanned downtime: 0 (in last 2 years)

Compliance:

Audit log retention: 7 years
Backup retention: 7 years
DR drills: Quarterly (100% success rate)

Cost:

Infrastructure: $30K/month (on-prem)
Staffing: 3 DBAs + 2 security engineers
Total: ~$100K/month

4.1. Company profile

TEXT

Company: Social Media App
Scale: 500M users, 10B posts
Traffic: 50K queries/second (95% reads)
Data size: 50TB
Industry: Social Media

4.2. Architecture

TEXT

Read-heavy Optimization:
├─ 1 Leader (writes only)
├─ 20 Read replicas (geographically distributed)
├─ CDN for static content
├─ Redis for session/cache
├─ Elasticsearch for search
└─ S3 for media files

Database sharding:
├─ Shard by user_id (hash-based)
├─ 50 shards (1TB each)
├─ Each shard: 1 leader + 5 replicas
└─ Vitess for shard management

4.3. Read scaling strategy

PYTHON

# Read queries routed to replicas
# Application logic:
if query_type == 'SELECT':
    conn = connect_to_replica()
else:
    conn = connect_to_leader()

# Geographic routing
if user_location == 'us-west':
    replica = 'pg-us-west-replica-1'
elif user_location == 'eu-central':
    replica = 'pg-eu-central-replica-1'
else:
    replica = 'pg-us-east-replica-1'

4.4. Challenges and solutions

Challenge 1: Replication lag visible to users

Problem:

User posts content, immediately refreshes page.
Content not visible (read from lagging replica).
User thinks post failed.

Solution:

Sticky sessions after write:
1. User writes to leader.
2. Application stores LSN in session cookie.
3. Next read checks replica LSN >= session LSN.
4. If replica behind, route to leader temporarily.
5. After replica catches up, route back to replica.

SQL

-- PostgreSQL 10+ function
SELECT pg_last_wal_replay_lsn();  -- On replica
SELECT pg_current_wal_lsn();       -- On leader

PYTHON

# App logic
if replica_lsn < session_lsn:
    route_to_leader()

Challenge 2: Hot partition (celebrity posts)

Problem:

Celebrity with 100M followers posts content.
Single partition overwhelmed.
Query latency spikes to 10 seconds.

Solution:

Identify hot users (> 1M followers).
Replicate hot user data to all shards.
Denormalize celebrity posts to separate table.
Use materialized views for timeline generation.

SQL

CREATE MATERIALIZED VIEW celebrity_timeline AS 
  SELECT * FROM posts 
  WHERE user_id IN (SELECT user_id FROM celebrities)
  ORDER BY created_at DESC;

-- Refresh every 5 minutes
REFRESH MATERIALIZED VIEW CONCURRENTLY celebrity_timeline;

Challenge 3: Managing 20 read replicas

Problem:

Manual management of 20 replicas is error-prone.
Need to add/remove replicas dynamically.

Solution:

Kubernetes + Zalando Postgres Operator.
Auto-scaling based on CPU/query load.
Example: Scale from 20 to 30 replicas during peak hours.

YAML

apiVersion: acid.zalan.do/v1
kind: postgresql
spec:
  numberOfInstances: 20  # Auto-scaled by HPA
  resources:
    requests:
      cpu: 4
      memory: 16Gi
    limits:
      cpu: 8
      memory: 32Gi

4.5. Key metrics

Performance:

Query throughput: 50K qps (48K reads, 2K writes)
Average read latency: 5ms
Average write latency: 15ms
Replication lag: 100-500ms (acceptable for social media)

Availability:

Uptime: 99.9%
Read replicas can fail without user impact

Cost:

Infrastructure: $80K/month (AWS)
50 shards x (1 leader + 5 replicas) = 300 instances
Mostly r6g.xlarge (4 vCPU, 32GB RAM)

5. Lessons Learned (Cross-case Analysis)

5.1. Common patterns

✅ What works:

Connection pooling (PgBouncer): Essential for high traffic.
Read replicas: Cheapest way to scale reads.
Monitoring with Prometheus: Early problem detection.
Automated failover (Patroni): Reduces MTTR.
Table partitioning: Improves query performance.
Backup automation: Prevents human error.
Regular DR drills: Validates procedures.
Documentation: Critical for incident response.

❌ What doesn't work:

Over-sharding: Adds complexity without benefit.
Premature optimization: YAGNI applies to databases too.
Ignoring replication lag: Causes data consistency issues.
Manual processes: Error-prone and slow.
Single point of failure: No HA = no production.

5.2. Cost optimization techniques

Right-sizing instances:
- Start small, scale up based on metrics.
- Use burstable instances (t3/t4g) for dev/staging.
- Reserved instances for predictable workloads (40% savings).
Storage optimization:
- gp3 instead of io2 for most workloads (60% cheaper).
- Compress old partitions (pg_squeeze).
- Archive to S3 for long-term retention.
Reduce replica count:
- 2-3 replicas sufficient for most workloads.
- Use read cache (Redis) before adding replicas.
Connection pooling:
- Reduces instance size requirements.
- 500 connections → 100 actual DB connections.
Serverless options:
- AWS RDS Proxy + Aurora Serverless for variable workload.
- Pay per request instead of fixed capacity.

5.3. When to NOT use Patroni

Consider alternatives if:

Single instance is sufficient (< 100 qps).
Cloud-managed HA available (RDS, Cloud SQL).
Don't have skilled PostgreSQL DBA.
Budget very limited.
Development/testing only.

Use Patroni when:

Need full control over configuration.
On-premises or hybrid cloud.
Compliance requires self-managed.
Cost optimization vs managed services.
High availability is critical.

6. Lab Exercises

Lab 1: Calculate capacity planning

Tasks:

Estimate queries per second for your use case.
Calculate required connections.
Size instance (CPU, RAM, storage).
Estimate replication lag for replica count.
Calculate total infrastructure cost.

Lab 2: Design multi-tenant architecture

Tasks:

Choose tenancy model (shared vs dedicated).
Implement row-level security.
Create backup strategy per tenant.
Design migration procedure.
Test noisy neighbor mitigation.

Lab 3: Implement read replica scaling

Tasks:

Add read replica to cluster.
Implement read/write routing in application.
Measure replication lag.
Test failover with replicas.
Monitor query distribution.

Lab 4: Cost optimization analysis

Tasks:

Audit current infrastructure costs.
Identify optimization opportunities.
Implement connection pooling.
Right-size instances.
Calculate cost savings.

7. Summary

Architecture Patterns Summary

Pattern	Best For	Complexity	Cost
Single Leader + Replicas	Read-heavy	Low	Low
Multi-datacenter	Geographic distribution	High	High
Sharding	Horizontal scaling	Very High	Medium
Multi-tenant	SaaS applications	Medium	Low

Key Takeaways

Connection pooling is non-negotiable at scale.
Read replicas are the easiest way to scale.
Monitoring and alerting prevent incidents.
Backup and restore must be tested regularly.
Documentation saves time during incidents.
Automation reduces human error.
Cost optimization is ongoing effort.
Right-sizing prevents over-provisioning.

Next Steps

Lesson 26 will cover Automation with Ansible:

Ansible playbooks for Patroni deployment
Configuration management automation
Automated testing frameworks
CI/CD integration for database changes
Infrastructure as Code (IaC)

Course

PostgreSQL High Availability A-Z

Share this article

You might also like