Reducing Your Area of Impact and Surviving Difficult Days Arc306

Title

AWS re:Invent 2023 - Reducing your area of impact and surviving difficult days (ARC306)

Summary

  • The session focused on strategies for reducing the impact of system impairments on critical workloads.
  • The presenters, Bruno Emmer and Byron Arneo, used a case study of Alice's coffee shop to illustrate the evolution from a small business to a larger enterprise and the corresponding need for resilient IT systems.
  • Key concepts discussed included resilience, fault tolerance, capacity, timely and correct output, fault isolation, and categories of failure such as single points of failure, additional load, excessive latency, misconfigurations, and bugs.
  • The talk covered the transition from monolithic architectures to microservices, the use of AWS fault isolation boundaries like regions and availability zones, and the adoption of cell-based architectures and shuffle sharding to further reduce the area of impact.
  • AWS Resilience Lifecycle Framework was introduced, which includes phases like setting objectives, designing, implementing, evaluating, operating, and learning from responses.
  • AWS tools such as Resilience Hub, Elastic Disaster Recovery, AWS Backups, Route 53 Application Recovery Controller, and the AWS Solutions Library were recommended for improving resilience.

Insights

  • Microservices Architecture: Decomposing applications into microservices can reduce the impact of failures by isolating them to specific features.
  • Fault Isolation Boundaries: Leveraging AWS regions and availability zones can help avoid shared fate scenarios and improve resilience.
  • Cell-based Architectures: Implementing cell-based architectures can provide even greater isolation and limit the impact of failures to smaller segments of the business.
  • Shuffle Sharding: This advanced technique can further reduce the area of impact, potentially down to individual users, by using mathematical methods to distribute traffic across servers.
  • Resilience Lifecycle Framework: AWS provides a structured approach to building and maintaining resilient systems, emphasizing continuous improvement and learning.
  • AWS Resilience Tools: AWS offers a suite of tools designed to help customers achieve their resilience objectives, including automated backups, disaster recovery, and traffic management during impairments.
  • Practical Application: The case study of Alice's coffee shop effectively illustrates how businesses can evolve their IT infrastructure to support growth and maintain resilience in the face of system impairments.