Operating Highly Available Multi Az Applications Arc329

Title

AWS re:Invent 2022 - Operating highly available Multi-AZ applications (ARC329)

Summary

  • Gavin McCullough, with 11 years at Amazon, discusses strategies for building highly available Multi-AZ applications.
  • He introduces the concept of Zonal Shift, a new feature in Application Recovery Controller, which allows recovery from single-zone issues.
  • The talk covers building resilient systems, handling single sources of failure, and the importance of redundancy across multiple servers and availability zones.
  • McCullough emphasizes the need to consider human elements like code deployment and config changes as potential failure sources.
  • He explains how AWS uses availability zones to ensure that failures in one zone do not affect others and shares real-life examples from AWS services like Route 53.
  • The session includes a demonstration of Zonal Shift using a toy application and concludes with best practices for using Zonal Shift in real-life scenarios.

Insights

  • Zonal Shift is designed to be a reliable and cost-free feature for handling zone-specific problems without needing to understand the root cause immediately.
  • The strategy of deploying and operating replicas in different availability zones is crucial for quick recovery from failures.
  • AWS emphasizes the importance of monitoring each availability zone separately to detect and address issues promptly.
  • The talk suggests turning off cross-zone load balancing to better define zonal replicas and reduce interactions between zones, although this requires careful consideration of load distribution and capacity.
  • Deep Health Checks and minimum healthy targets are recommended to handle gray failures, which are harder to detect and can be subjective based on application requirements.
  • Best practices include ensuring enough capacity to handle the loss of one availability zone, testing Zonal Shift in advance, and considering automation with caution due to potential edge cases.
  • Running monitoring from a different region is suggested for a more accurate picture of the customer experience and to decouple monitoring from the application.