Overview
Our finance application relied on a single-region Aurora PostgreSQL cluster (us-east-1). Scheduled maintenance or transient issues risked minutes of downtime, impacting thousands of daily transactions.
Architected a fully automated, cross-region failover solution using Aurora Global Database to meet a 99.99% availability SLA.
Our finance application relied on a single-region Aurora PostgreSQL cluster (us-east-1). Scheduled maintenance or transient issues risked minutes of downtime, impacting thousands of daily transactions.
Deployed an Aurora Global Database with a secondary writer in eu-central-1. A Route 53 health check watches the primary writer endpoint. If it fails, DNS automatically flips to the secondary's endpoint within 60 seconds.
1. Wrote Terraform modules to provision Global Database and DNS failover. 2. Configured CloudWatch alarms on replica lag > 5 seconds. 3. Automated IAM roles for cross-region KMS access. 4. Deployed via CI/CD pipeline using GitHub Actions.
Performed simulated failover drills by rebooting the primary node. Logged RTO (Recovery Time Objective) and verified zero data loss. Documented the playbook and trained on-call engineers.
• Achieved mean RTO of 25 seconds (vs. 5-minute window previously). • Replica lag consistently under 1 second. • Uptime improved from 99.90% to 99.99%.
Terraform state locking across regions required careful S3 versioning and DynamoDB lock tables. We standardised naming conventions to avoid drift.
Integrate Lambda for automated failback once the primary returns healthy. Add a tertiary failover region in ap-southeast-2 to cover Asia Pacific.
Aurora PostgreSQL • Terraform • AWS Route 53 • CloudWatch Alarms • KMS • GitHub Actions
Let's discuss how I can help optimize your infrastructure