Amazon Aurora Ha and Dr Design Patterns for Global Resilience Dat324

Title

AWS re:Invent 2023 - Amazon Aurora HA and DR design patterns for global resilience (DAT324)

Summary

  • The session focused on building resilient systems using Amazon Aurora, emphasizing high availability (HA) and disaster recovery (DR) patterns.
  • Resilience is defined as the ability to recover from disruptions, dynamically acquire resources, and mitigate issues like network problems.
  • Availability and disaster recovery are the two pillars of resilience, with availability measured in nines (e.g., 99.99% uptime) and disaster recovery focusing on recovery time objective (RTO) and recovery point objective (RPO).
  • Aurora separates storage from compute, with storage distributed across three availability zones for durability.
  • Aurora provides continuous backup to S3, allowing for point-in-time recovery within a retention window.
  • Volume Clone feature allows for creating test environments with production-like data without impacting performance or doubling costs.
  • Multi-AZ deployments improve availability without affecting durability by adding additional database instances in separate AZs.
  • Read replicas can be used to offload read-only queries and scale out read performance.
  • Global database replication allows for asynchronous replication across regions, improving RPO.
  • Write forwarding enables applications to perform writes in read-only regions by forwarding them to the primary region.
  • AWS Backup simplifies cross-region backup and replication processes.
  • Global DBRPO parameter in Aurora Postgres can manage replication lag and ensure data is within a bounded lag across regions.
  • Account-level resilience can be achieved by copying backups to a separate AWS account.

Insights

  • Aurora's design separates storage and compute, which underpins many HA and DR features, such as continuous backup and fast recovery.
  • The ability to create volume clones for testing or batch processing can significantly enhance the resilience of production systems without incurring additional costs.
  • Multi-AZ deployments and read replicas are key strategies for achieving high availability and scaling read performance in Aurora.
  • Global database replication is crucial for achieving low RPO and ensuring data durability across multiple regions.
  • Write forwarding is a powerful feature that allows for global deployment of applications with read-write capabilities without complex application changes.
  • The Global DBRPO parameter is a sophisticated feature for applications with high-value transactions, ensuring that replication lag is within a defined threshold.
  • Cross-account backup strategies provide an additional layer of resilience against account-specific issues, ensuring business continuity.
  • The session highlighted the importance of testing resilience strategies, such as using the switchover command to simulate region failovers and ensure the application can handle such events.