Real-world Case Studies
After this lesson, you will be able to:
- Learn from production PostgreSQL HA deployments.
- Understand scaling strategies for high traffic.
- Analyze cost optimization techniques.
- Study incident post-mortems.
- Apply best practices from real scenarios.
1. Case Study 1: E-commerce Platform (High Transaction Volume)
1.1. Company profile
1.2. Architecture
1.3. Configuration highlights
1.4. Challenges and solutions
Challenge 1: Connection exhaustion
Problem:
- Peak traffic caused
max_connectionslimit to be hit. - Application errors: "FATAL: too many connections".
Solution:
- Implemented PgBouncer with transaction pooling.
- Reduced
max_connectionsfrom 1000 to 500. - PgBouncer pool_size=100 per database.
- Result: Handled 10K app connections with 500 DB connections.
Challenge 2: Replication lag during flash sales
Problem:
- Flash sales caused 50K writes/second.
- Replicas lagged by 5-10 seconds.
- Read queries returned stale data.
Solution:
- Increased
wal_sender_timeoutandwal_receiver_timeout. - Tuned
checkpoint_completion_targetto 0.9. - Added synchronous replication for critical tables:
SQL
- Separated analytics queries to dedicated read replica.
Challenge 3: Disk I/O bottleneck
Problem:
- Disk I/O saturated at 95%+ during peak.
- Query latency increased from 5ms to 500ms.
Solution:
- Upgraded storage from gp3 to io2 (20K IOPS).
- Implemented table partitioning for orders table.
- Range partitioning by
order_date(daily). - Automated partition management.
- Range partitioning by
- Added btree indexes on frequently queried columns.
- Result: I/O dropped to 40%, latency back to 5-10ms.
1.5. Key metrics
Performance:
- Query throughput: 10K qps peak
- Average query latency: 8ms
- 99th percentile latency: 50ms
- Replication lag: < 100ms
- Failover time: 15 seconds
Availability:
- Uptime: 99.99% (52 minutes downtime/year)
- Unplanned downtime: 2 incidents, 15 minutes total
- Planned maintenance: 0 downtime (rolling updates)
Cost:
- Infrastructure: $15K/month (AWS)
- Staffing: 2 DBAs + 1 SRE
- Total: ~$50K/month
2. Case Study 2: SaaS Application (Multi-tenant)
2.1. Company profile
2.2. Architecture
2.3. Multi-tenancy implementation
2.4. Challenges and solutions
Challenge 1: Large tenant impact
Problem:
- One tenant (10% of data) caused high CPU usage.
- Impacted all other tenants.
- "Noisy neighbor" problem.
Solution:
- Implemented query timeout per tenant:
SQL
- Added
work_memlimit per tenant:SQL - Moved largest tenants to dedicated instances.
- Implemented fair queuing with
pg_cron.
Challenge 2: Backup/restore for specific tenant
Problem:
- Needed to restore one tenant's data.
- Full restore would impact all tenants.
Solution:
- Implemented per-schema backup script:
BASH
- Logical backup to S3 per tenant, daily.
- PITR for full database, per-tenant granular restore.
Challenge 3: Schema migration across 100K tenants
Problem:
- Need to add column to table.
- 100K schemas = 100K migrations.
- Can't hold lock that long.
Solution:
- Multi-phase migration:
- Add column as nullable (fast, no rewrite).
- Backfill data in batches (chunked updates).
- Add default value (after backfill).
- Add NOT NULL constraint (after validation).
2.5. Key metrics
Performance:
- Query throughput: 2K qps average
- Average query latency: 15ms
- Replication lag: < 50ms
- Largest tenant: 50GB (isolated)
Availability:
- Uptime: 99.95%
- Failover time: 20 seconds
Cost:
- Infrastructure: $5K/month (GCP)
- Staffing: 1 DBA
- Cost per tenant: $0.05/month
3. Case Study 3: Financial Services (Compliance-heavy)
3.1. Company profile
3.2. Architecture
3.3. Compliance configuration
3.4. Challenges and solutions
Challenge 1: 7-year backup retention
Problem:
- Compliance requires 7 years of backups.
- 10TB database = 365 x 7 = 2,555 daily backups.
- Storage costs astronomical.
Solution:
- Implemented tiered backup strategy:
- Daily full backups: 30 days (hot storage)
- Weekly full backups: 1 year (warm storage)
- Monthly full backups: 7 years (cold storage - tape)
- Compression with pgBackRest.
- Result: Reduced storage from 25PB to 5PB.
Challenge 2: Zero-tolerance for data loss (RPO = 0)
Problem:
- Banking regulations require no data loss.
- Async replication has lag window.
Solution:
- Synchronous replication to 2 replicas:
SQL
- Trade-off: 10ms additional latency.
- Acceptable for financial transactions.
Challenge 3: Disaster recovery drills
Problem:
- Quarterly DR drills required by auditors.
- Can't disrupt production.
Solution:
- Automated DR failover testing:
- Clone production to DR site (logical replication).
- Promote DR site to primary.
- Run smoke tests (read-only queries).
- Measure RTO (target: < 1 hour).
- Restore production primary.
- Document results for audit.
- Implemented with Ansible playbooks.
- Full drill takes 2 hours (outside business hours).
3.5. Key metrics
Performance:
- Query throughput: 500 qps
- Average query latency: 20ms (with sync replication)
- Replication lag: 0ms (synchronous)
- Failover time: 30 seconds
Availability:
- Uptime: 99.999% (5 minutes downtime/year)
- Unplanned downtime: 0 (in last 2 years)
Compliance:
- Audit log retention: 7 years
- Backup retention: 7 years
- DR drills: Quarterly (100% success rate)
Cost:
- Infrastructure: $30K/month (on-prem)
- Staffing: 3 DBAs + 2 security engineers
- Total: ~$100K/month
4. Case Study 4: Social Media App (Read-heavy)
4.1. Company profile
4.2. Architecture
4.3. Read scaling strategy
4.4. Challenges and solutions
Challenge 1: Replication lag visible to users
Problem:
- User posts content, immediately refreshes page.
- Content not visible (read from lagging replica).
- User thinks post failed.
Solution:
- Sticky sessions after write:
- User writes to leader.
- Application stores LSN in session cookie.
- Next read checks replica LSN >= session LSN.
- If replica behind, route to leader temporarily.
- After replica catches up, route back to replica.
Challenge 2: Hot partition (celebrity posts)
Problem:
- Celebrity with 100M followers posts content.
- Single partition overwhelmed.
- Query latency spikes to 10 seconds.
Solution:
- Identify hot users (> 1M followers).
- Replicate hot user data to all shards.
- Denormalize celebrity posts to separate table.
- Use materialized views for timeline generation.
Challenge 3: Managing 20 read replicas
Problem:
- Manual management of 20 replicas is error-prone.
- Need to add/remove replicas dynamically.
Solution:
- Kubernetes + Zalando Postgres Operator.
- Auto-scaling based on CPU/query load.
- Example: Scale from 20 to 30 replicas during peak hours.
4.5. Key metrics
Performance:
- Query throughput: 50K qps (48K reads, 2K writes)
- Average read latency: 5ms
- Average write latency: 15ms
- Replication lag: 100-500ms (acceptable for social media)
Availability:
- Uptime: 99.9%
- Read replicas can fail without user impact
Cost:
- Infrastructure: $80K/month (AWS)
- 50 shards x (1 leader + 5 replicas) = 300 instances
- Mostly r6g.xlarge (4 vCPU, 32GB RAM)
5. Lessons Learned (Cross-case Analysis)
5.1. Common patterns
✅ What works:
- Connection pooling (PgBouncer): Essential for high traffic.
- Read replicas: Cheapest way to scale reads.
- Monitoring with Prometheus: Early problem detection.
- Automated failover (Patroni): Reduces MTTR.
- Table partitioning: Improves query performance.
- Backup automation: Prevents human error.
- Regular DR drills: Validates procedures.
- Documentation: Critical for incident response.
❌ What doesn't work:
- Over-sharding: Adds complexity without benefit.
- Premature optimization: YAGNI applies to databases too.
- Ignoring replication lag: Causes data consistency issues.
- Manual processes: Error-prone and slow.
- Single point of failure: No HA = no production.
5.2. Cost optimization techniques
- Right-sizing instances:
- Start small, scale up based on metrics.
- Use burstable instances (t3/t4g) for dev/staging.
- Reserved instances for predictable workloads (40% savings).
- Storage optimization:
- gp3 instead of io2 for most workloads (60% cheaper).
- Compress old partitions (
pg_squeeze). - Archive to S3 for long-term retention.
- Reduce replica count:
- 2-3 replicas sufficient for most workloads.
- Use read cache (Redis) before adding replicas.
- Connection pooling:
- Reduces instance size requirements.
- 500 connections → 100 actual DB connections.
- Serverless options:
- AWS RDS Proxy + Aurora Serverless for variable workload.
- Pay per request instead of fixed capacity.
5.3. When to NOT use Patroni
Consider alternatives if:
- Single instance is sufficient (< 100 qps).
- Cloud-managed HA available (RDS, Cloud SQL).
- Don't have skilled PostgreSQL DBA.
- Budget very limited.
- Development/testing only.
Use Patroni when:
- Need full control over configuration.
- On-premises or hybrid cloud.
- Compliance requires self-managed.
- Cost optimization vs managed services.
- High availability is critical.
6. Lab Exercises
Lab 1: Calculate capacity planning
Tasks:
- Estimate queries per second for your use case.
- Calculate required connections.
- Size instance (CPU, RAM, storage).
- Estimate replication lag for replica count.
- Calculate total infrastructure cost.
Lab 2: Design multi-tenant architecture
Tasks:
- Choose tenancy model (shared vs dedicated).
- Implement row-level security.
- Create backup strategy per tenant.
- Design migration procedure.
- Test noisy neighbor mitigation.
Lab 3: Implement read replica scaling
Tasks:
- Add read replica to cluster.
- Implement read/write routing in application.
- Measure replication lag.
- Test failover with replicas.
- Monitor query distribution.
Lab 4: Cost optimization analysis
Tasks:
- Audit current infrastructure costs.
- Identify optimization opportunities.
- Implement connection pooling.
- Right-size instances.
- Calculate cost savings.
7. Summary
Architecture Patterns Summary
| Pattern | Best For | Complexity | Cost |
|---|---|---|---|
| Single Leader + Replicas | Read-heavy | Low | Low |
| Multi-datacenter | Geographic distribution | High | High |
| Sharding | Horizontal scaling | Very High | Medium |
| Multi-tenant | SaaS applications | Medium | Low |
Key Takeaways
- Connection pooling is non-negotiable at scale.
- Read replicas are the easiest way to scale.
- Monitoring and alerting prevent incidents.
- Backup and restore must be tested regularly.
- Documentation saves time during incidents.
- Automation reduces human error.
- Cost optimization is ongoing effort.
- Right-sizing prevents over-provisioning.
Next Steps
Lesson 26 will cover Automation with Ansible:
- Ansible playbooks for Patroni deployment
- Configuration management automation
- Automated testing frameworks
- CI/CD integration for database changes
- Infrastructure as Code (IaC)