Title: AWS re:Inforce 2024 - Continuous resilience: Managing your application risks (GRC322)
Insights:
- Importance of Resilience: Resilience is crucial for maintaining revenue and brand reputation. Downtime can lead to significant financial losses, as evidenced by Fortune 1000 companies losing between $1.5 and $2.5 billion due to system downtime.
- Shared Responsibility Model: Resilience is a shared responsibility between AWS and its customers. AWS ensures the resilience of the cloud infrastructure, while customers are responsible for the resilience of their applications built on top of AWS services.
- Complexity and Challenges: The adoption of microservices and complex systems increases the challenges of ensuring resilience. Observability becomes more complex, and the variations between deployments make it harder to anticipate failures.
- AWS Resilience Tools: AWS offers several tools to help customers assess and test the resilience of their applications, including AWS Resilience Hub, AWS Fault Injection Service, AWS Elastic Disaster Recovery, AWS Backup, and Route 53 Application Recovery Controller.
- Resilience Lifecycle: AWS introduced the Resilience Lifecycle, which provides tools and mechanisms to ensure continuous resilience. This lifecycle includes setting objectives (RTO and RPO), designing and implementing resilience strategies, testing and evaluating, operating with observability, and responding and learning from events.
- Vanguard's Approach: Vanguard operates in a poly-cloud, multi-region environment to support its global client base. They have integrated continuous resilience into every step of their software development lifecycle, moving from a reactive to a proactive model.
- Homegrown Tools: Vanguard developed homegrown tools like PTAS (Performance Testing as a Service) and Climate of Chaos for fault injection to democratize performance testing and ensure applications can withstand failures.
- Observability and Policy Engine: Vanguard implemented observability across their stack and a policy engine to ensure compliance with resilience standards. They also have an enterprise health check dashboard to monitor critical metrics.
- Resilience Wins: Vanguard has seen significant improvements, including a 5x acceleration in feature delivery, a 30% reduction in failures, and a 60% decrease in incident recovery time.
- Future Plans: Vanguard plans to continue integrating resilience tools, improving programmatic consumption of tools, and evolving resilience practices in line with industry innovations.
Quotes:
- "Resilience equals revenue. That's what Gartner said."
- "Like security, resilience is a shared responsibility between AWS and the customer."
- "The more the systems are complex, there are more challenges to ensure resiliency with the adoption of microservices."
- "We developed a new operating model. We went from a reactive model to a proactive model, integrating continuous resilience into every step of the software development lifecycle."
- "We have accelerated feature delivery by 5x while reducing failures by 30%."
- "Not only are we failing less frequently, we're recovering from those failures more quickly."
- "We plan to continue easing the adoption of resilient libraries and resilience tools by improving the availability of programmatic consumption of our tools."
- "We of course also want to build onto that policy engine that I mentioned. Having that policy engine foundation layer allows us to add new policies and change them all the time now that it's there."