Title
AWS re:Invent 2022 - Building modern apps: Architecting for observability & resilience (ARC217-L)
Summary
- Francesca Vasquez, VP at AWS, opens the session by dedicating it to David Grimm and discussing the James Webb telescope as an example of resilience and observability.
- Observability and resilience are critical elements of AWS's well-architected framework, specifically in the reliability and operational excellence pillars.
- Shon Nandi, Director of Solutions Architecture at AWS, discusses the importance of building systems that are highly available, can recover from rare failure scenarios, and the concept of continuous resilience.
- AWS's shared responsibility model is highlighted, where AWS is responsible for the resilience of the cloud, and customers are responsible for the resilience of their workloads in the cloud.
- AWS's internal service ownership model, operation readiness process (ORR), and correction of error (COE) process are explained.
- Four key areas of focus for resilience in the cloud are discussed: anticipation, monitoring, responding, and learning.
- AWS's investment in resilient services, including the AWS Fault Injection Simulator, AWS Resilience Hub, and Amazon Route 53 Application Recovery Controller, is mentioned.
- Will Meyer from Capital One and Kim Weiland from FINRA share their experiences and best practices in building resilient systems on AWS.
- The session concludes with strategies for continuous improvement in resilience, including the use of AWS's well-architected framework, game days, infrastructure event management, and various AWS resources.
Insights
- The James Webb telescope serves as a powerful metaphor for the importance of resilience and observability in systems that cannot afford failure.
- AWS emphasizes the shared responsibility model, which is crucial for customers to understand their role in maintaining the resilience of their applications.
- The concept of "continuous resilience" is introduced, which involves ongoing automation, testing, and observability to ensure systems are robust and can handle failures.
- AWS's internal practices, such as the ORR and COE processes, demonstrate a culture of resilience and continuous improvement that customers can learn from.
- The session highlights the importance of anticipating potential failures, monitoring systems effectively, responding quickly to incidents, and learning from each event to improve future resilience.
- AWS provides a suite of tools and services designed to help customers build and maintain resilient systems, including the AWS Fault Injection Simulator and AWS Resilience Hub.
- Customer stories from Capital One and FINRA provide real-world examples of how large organizations are implementing resilience and observability on AWS.
- The session underscores the importance of regular testing, such as game days, to ensure that teams are prepared to handle failures in production environments.
- AWS's well-architected framework is a recurring theme, serving as a guide for customers to build and evaluate their workloads for resilience and operational excellence.
- The session concludes with a message that resilience is a journey, not a destination, and AWS is committed to partnering with customers to continuously improve their resilience posture.