Title
AWS re:Invent 2022 - A real-world resilience evolution in the cloud framework (ARC309)
Summary
- Anderson Motta from Itaú and Gus Santana and Robert Fuente from AWS presented their collaborative resilience program.
- Itaú Unibanco, a 98-year-old financial institution from Brazil, is the largest in Latin America with operations in 20 countries and a strong digital customer base.
- The journey to the cloud began in 2016 with private cloud frameworks and data centers, evolving to a public cloud-first mandate in 2020.
- Itaú has migrated 45% of its applications to AWS, aiming for 60% by the end of 2023.
- The Resilience Evolution Program was created to address high-impact events affecting customers, with AWS working backwards to meet Itaú's needs.
- The program established five pillars: monitoring, testing (chaos engineering), understanding dependencies, architectural enhancements, and governance.
- A "mechanisms" pillar was added to scale the lessons learned across the bank.
- The program led to a 75% reduction in high-impact events and a significant reduction in Mean Time to Recovery (MTTR).
- AWS provided a Single Threaded Leader (STL) for the program, daily stand-up meetings, and a focus on communication.
- Key activities included chaos engineering, critical dependencies analysis, blue-green deployment strategies, and observability improvements.
- AWS Incident Detection and Response (IDR) was introduced to enhance support for critical workloads.
Insights
- Itaú's digital transformation required decoupling monolithic applications into microservices and adopting a data-driven approach with observability.
- The Resilience Evolution Program is an example of AWS's customer-centric approach, creating a tailored solution for Itaú's specific challenges.
- The program's success was measured not only by the reduction in high-impact events but also by the establishment of a sustainable model for future application development.
- The STL role was crucial in ensuring the program's focus and accountability, demonstrating the importance of dedicated leadership in large-scale projects.
- The use of chaos engineering and dependency mapping highlights the proactive approach to resilience, preparing for potential failures rather than reacting to them.
- The creation of executive dashboards and traffic light systems (farol) for application health visibility indicates a high level of maturity in operational monitoring.
- The introduction of AWS IDR shows a commitment to continuous improvement and the value of feedback loops between AWS and its customers.
- The program's emphasis on governance and executive sponsorship underscores the importance of aligning technical initiatives with organizational support and change management.