Title
AWS re:Invent 2023 - Improve application resilience with AWS Fault Injection Service (ARC317)
Summary
- Adrian Hornsby, a principal engineer with the AWS Reliability team, and Iris, a senior product manager, presented on improving application resilience using AWS Fault Injection Service (formerly known as the simulator).
- They discussed the high cost of downtime for enterprises and the importance of resilience in maintaining a good reputation and avoiding financial losses.
- The concept of resilience was broken down into four pillars: anticipation, monitoring, responding, and learning.
- AWS's Resilience Lifecycle Framework was introduced, emphasizing the continuous process of improving system resilience.
- The importance of fault isolation boundaries, such as regions and availability zones (AZs), was highlighted, along with the concept of static stability in system design.
- AWS Fault Injection Service (FIS) was presented as a tool for resilience testing, allowing controlled experiments to inject faults into systems to uncover hidden issues and improve operational practices.
- Iris introduced new features in FIS, including Scenarios for predefined experiment templates and multi-account experiments.
- Two new scenarios were announced: AZ Availability Power Interruption and Cross-Region Connectivity, designed to test multi-AZ and multi-region applications.
- The session concluded with a call to practice resilience and provided resources for further learning.
Insights
- The cost of downtime is significant, with enterprises potentially losing hundreds of thousands to millions of dollars per hour of downtime.
- Resilience is not just about technology; it involves culture, mechanisms, and tools.
- The Resilience Lifecycle Framework is a holistic approach to improving system resilience, which can be entered at any stage.
- Fault isolation boundaries are crucial for minimizing the impact of failures and ensuring that systems can handle traffic surges without control plane operations.
- AWS FIS is a powerful tool for resilience testing, allowing users to simulate faults and improve their systems' reliability and performance.
- The introduction of Scenarios in FIS simplifies the process for customers to start testing their applications' resilience.
- The new multi-AZ and multi-region scenarios in FIS enable customers to test complex applications and ensure they can handle real-world failure modes.
- Building resilience is an ongoing process that requires regular practice and testing to ensure systems can withstand and recover from failures.