Improve Application Resilience with Aws Fault Injection Service Arc317

Title

AWS re:Invent 2023 - Improve application resilience with AWS Fault Injection Service (ARC317)

Summary

  • Adrian Hornsby, a principal engineer with the AWS Reliability team, and Iris, a senior product manager, presented on improving application resilience using AWS Fault Injection Service (formerly known as the simulator).
  • They discussed the high cost of downtime for enterprises and the importance of resilience in maintaining a good reputation and avoiding financial losses.
  • The concept of resilience was broken down into four pillars: anticipation, monitoring, responding, and learning.
  • AWS's Resilience Lifecycle Framework was introduced, emphasizing the continuous process of improving system resilience.
  • The importance of fault isolation boundaries, such as regions and availability zones (AZs), was highlighted, along with the concept of static stability in system design.
  • AWS Fault Injection Service (FIS) was presented as a tool for resilience testing, allowing controlled experiments to inject faults into systems to uncover hidden issues and improve operational practices.
  • Iris introduced new features in FIS, including Scenarios for predefined experiment templates and multi-account experiments.
  • Two new scenarios were announced: AZ Availability Power Interruption and Cross-Region Connectivity, designed to test multi-AZ and multi-region applications.
  • The session concluded with a call to practice resilience and provided resources for further learning.

Insights

  • The cost of downtime is significant, with enterprises potentially losing hundreds of thousands to millions of dollars per hour of downtime.
  • Resilience is not just about technology; it involves culture, mechanisms, and tools.
  • The Resilience Lifecycle Framework is a holistic approach to improving system resilience, which can be entered at any stage.
  • Fault isolation boundaries are crucial for minimizing the impact of failures and ensuring that systems can handle traffic surges without control plane operations.
  • AWS FIS is a powerful tool for resilience testing, allowing users to simulate faults and improve their systems' reliability and performance.
  • The introduction of Scenarios in FIS simplifies the process for customers to start testing their applications' resilience.
  • The new multi-AZ and multi-region scenarios in FIS enable customers to test complex applications and ensure they can handle real-world failure modes.
  • Building resilience is an ongoing process that requires regular practice and testing to ensure systems can withstand and recover from failures.