Title

AWS re:Invent 2023 - Using Zonal AutoShift to Automatically Recover from an AZ Impairment (ARC309)

Summary

Deepak Suri, the general manager for the application recovery controller at AWS, introduced the session and the concept of automatically recovering from AZ impairments with the newly launched Zonal AutoShift feature.
Gavin McCullough, with 12 years at AWS, shared the history of Amazon's use of multiple data centers and the evolution of Availability Zones (AZs).
Gavin discussed the importance of recovery-oriented computing, which focuses on quickly shifting away from failures rather than fixing them immediately.
He explained the difference between hard failures and gray failures and the strategies AWS uses to handle AZ resilience, including pre-scaling and minimizing coordination between AZs.
Gavin introduced the concept of shared responsibility in AZ resilience, where AWS manages certain aspects while customers manage others, such as ensuring sufficient EC2 instance capacity.
Deepak presented Zonal Shift, a feature that allows customers to shift traffic away from an impaired AZ, and the new Zonal AutoShift, which automates this process.
Practice Run was introduced as a mandatory feature alongside AutoShift, which simulates AZ impairments weekly to ensure preparedness.
Gavin concluded with key takeaways, emphasizing the importance of AZ-resilient design, managed regional services, pre-scaling, and regular practice to ensure swift and safe recovery from AZ impairments.

The introduction of Zonal AutoShift represents a significant advancement in AWS's ability to provide high availability and resilience for customer applications.
AWS's approach to recovery-oriented computing, which prioritizes quick recovery over immediate repair, is a critical strategy for maintaining uptime and customer trust.
The shared responsibility model in cloud computing is highlighted, where AWS provides the infrastructure and services while customers are responsible for their application-level decisions and configurations.
The practice of pre-scaling, as opposed to relying on auto-scaling during an event, is a strategic choice that customers need to make based on their business needs and tolerance for downtime.
Regular testing and practice runs are essential for ensuring that recovery strategies are effective and that teams are prepared for real-world incidents.
The session underscores the importance of designing applications with multi-AZ resilience in mind, leveraging AWS's managed services to handle potential AZ impairments.
The introduction of Practice Run as a mandatory feature with AutoShift ensures that customers are not only equipped with the tools to handle AZ impairments but are also regularly testing and validating their resilience strategies.