Title
AWS re:Invent 2022 - Operating highly available Multi-AZ applications (ARC329)
Summary
- Gavin McCullough, with 11 years at Amazon, discusses strategies for building highly available Multi-AZ applications.
- He introduces the concept of Zonal Shift, a new feature in Application Recovery Controller, which allows recovery from single-zone issues.
- The talk covers building resilient systems, handling single sources of failure, and the importance of redundancy across multiple servers and availability zones.
- McCullough emphasizes the need to consider human elements like code deployment and config changes as potential failure sources.
- He explains how AWS uses availability zones to ensure that failures in one zone do not affect others and shares real-life examples from AWS services like Route 53.
- The session includes a demonstration of Zonal Shift using a toy application and concludes with best practices for using Zonal Shift in real-life scenarios.
Insights
- Zonal Shift is designed to be a reliable and cost-free feature for handling zone-specific problems without needing to understand the root cause immediately.
- The strategy of deploying and operating replicas in different availability zones is crucial for quick recovery from failures.
- AWS emphasizes the importance of monitoring each availability zone separately to detect and address issues promptly.
- The talk suggests turning off cross-zone load balancing to better define zonal replicas and reduce interactions between zones, although this requires careful consideration of load distribution and capacity.
- Deep Health Checks and minimum healthy targets are recommended to handle gray failures, which are harder to detect and can be subjective based on application requirements.
- Best practices include ensuring enough capacity to handle the loss of one availability zone, testing Zonal Shift in advance, and considering automation with caution due to potential edge cases.
- Running monitoring from a different region is suggested for a more accurate picture of the customer experience and to decouple monitoring from the application.