Title
AWS re:Invent 2023 - 5 things you should know about resilience at scale (ARC327)
Summary
-
Dependencies and Modes: Dependencies are essential for service-oriented architecture, but their intersection can create potential failure points. Modes refer to significant changes in application behavior, which can be problematic at scale. Route 53's dependency on S3 for DNS updates is an example where a modal shift caused issues when the database fell behind, leading to a lesson on avoiding fallback paths that can harm resilience.
-
Blast Radius: The concept of blast radius involves understanding the potential impact of a change or failure within a system. AWS emphasizes preparing for failures and limiting their impact. An example is using IAM policies to control access to S3, where a bad policy update can cause widespread outages. Strategies like partial deployments and rollback-first approaches can help reduce blast radius.
-
Queues: Queues are useful for decoupling systems but can become a problem during outages, leading to a backlog that affects recovery time. Techniques like sidelining and back pressure can help manage queues effectively during high-load situations.
-
Errors: Proper classification of errors (400s for client errors and 500s for server errors) is crucial for quick detection and recovery from outages. CloudWatch Contributor Insights can provide valuable information on error patterns and customer behavior.
-
Retries: Retries are a response to transient errors but can amplify load during outages, leading to slower recovery. Strategies to mitigate this include being aware of the retry behavior of intermediate services and considering proactive retries when necessary, as demonstrated by the Route 53 Resolver's dual DNS resolver approach.
Insights
-
Operational Experience: The speakers emphasize the importance of operational experience in building resilient systems. They suggest that while some lessons can be learned the hard way, sharing experiences can help others avoid common pitfalls.
-
Resilience at Scale: The talk highlights that systems behave differently at scale compared to their behavior in earlier stages. This requires a different approach to resilience, focusing on understanding dependencies, managing blast radius, handling queues, classifying errors correctly, and implementing effective retry strategies.
-
Customer-Centric Approach: AWS's approach to resilience is customer-centric, with a focus on minimizing the impact of outages on customers. This is evident in their strategies for managing blast radius and queues, as well as their emphasis on error classification and retries.
-
Monitoring and Alarms: The importance of monitoring and setting appropriate alarms is a recurring theme. The speakers discuss using CloudWatch and other AWS tools to detect anomalies and respond quickly to issues.
-
Proactive Measures: AWS advocates for proactive measures such as deploying canary policies, rollback-first approaches, and proactive retries to ensure that systems can recover quickly from failures and maintain high availability.
-
Trade-offs in Resilience: The speakers acknowledge that there are trade-offs in building resilient systems, such as sacrificing some availability for faster recovery or doubling capacity to allow for constant retries. These decisions are made based on the criticality of the service and the potential impact on customers.