Title

AWS re:Invent 2022 - Building modern apps: Architecting for observability & resilience (ARC217-L)

Summary

Francesca Vasquez, VP at AWS, opens the session by dedicating it to David Grimm and discussing the James Webb telescope as an example of resilience and observability.
Observability and resilience are critical elements of AWS's well-architected framework, specifically in the reliability and operational excellence pillars.
Shon Nandi, Director of Solutions Architecture at AWS, discusses the importance of building systems that are highly available, can recover from rare failure scenarios, and the concept of continuous resilience.
AWS's shared responsibility model is highlighted, where AWS is responsible for the resilience of the cloud, and customers are responsible for the resilience of their workloads in the cloud.
AWS's internal service ownership model, operation readiness process (ORR), and correction of error (COE) process are explained.
Four key areas of focus for resilience in the cloud are discussed: anticipation, monitoring, responding, and learning.
AWS's investment in resilient services, including the AWS Fault Injection Simulator, AWS Resilience Hub, and Amazon Route 53 Application Recovery Controller, is mentioned.
Will Meyer from Capital One and Kim Weiland from FINRA share their experiences and best practices in building resilient systems on AWS.
The session concludes with strategies for continuous improvement in resilience, including the use of AWS's well-architected framework, game days, infrastructure event management, and various AWS resources.

Insights

The James Webb telescope serves as a powerful metaphor for the importance of resilience and observability in systems that cannot afford failure.
AWS emphasizes the shared responsibility model, which is crucial for customers to understand their role in maintaining the resilience of their applications.
The concept of "continuous resilience" is introduced, which involves ongoing automation, testing, and observability to ensure systems are robust and can handle failures.
AWS's internal practices, such as the ORR and COE processes, demonstrate a culture of resilience and continuous improvement that customers can learn from.
The session highlights the importance of anticipating potential failures, monitoring systems effectively, responding quickly to incidents, and learning from each event to improve future resilience.
AWS provides a suite of tools and services designed to help customers build and maintain resilient systems, including the AWS Fault Injection Simulator and AWS Resilience Hub.
Customer stories from Capital One and FINRA provide real-world examples of how large organizations are implementing resilience and observability on AWS.
The session underscores the importance of regular testing, such as game days, to ensure that teams are prepared to handle failures in production environments.
AWS's well-architected framework is a recurring theme, serving as a guide for customers to build and evaluate their workloads for resilience and operational excellence.
The session concludes with a message that resilience is a journey, not a destination, and AWS is committed to partnering with customers to continuously improve their resilience posture.

Building Global Event Driven Applications Api302 R Building Modern Data Architectures on Aws Arc313