Title
AWS re:Invent 2023 - A consistent approach to resilience analysis for critical workloads (ARC313)
Summary
- John Fermento introduced the session and highlighted the importance of resilience in systems, drawing an analogy with the historical example of bullet hole analysis on WWII aircraft.
- AWS released a resilience lifecycle framework, which is detailed in a white paper accessible via a QR code.
- The session focused on the design, implement, and operate stages of the resilience lifecycle.
- The resilience analysis framework (RAF) was introduced, which helps in assessing and improving the resilience of workloads.
- RAF is based on the SEAMS model, which stands for Shared fate, Excessive load, Excessive latency, Misconfiguration, Bugs, and Single point of failure.
- The RAF process involves understanding the workload, identifying critical user stories, and analyzing potential failure modes.
- Trade-offs in resilience strategies were discussed, including cost, effort, complexity, operational burden, and consistency vs. latency.
- The RAF process also involves assessing the likelihood and impact of potential failures and deciding on preventative or corrective mitigations.
- Real-world customer examples were presented by Mike Haken, demonstrating how RAF was applied to address resilience issues.
- Mike Golovnik shared insights on implementing RAF within AWS, emphasizing the need for proactive resilience thinking and integration with operational and development processes.
Insights
- The RAF provides a structured approach to proactively analyze and improve the resilience of systems, which is crucial for mission-critical applications.
- The SEAMS model within RAF helps identify common failure modes that can compromise resilience properties such as fault isolation, sufficient capacity, timely output, correct output, and redundancy.
- Understanding the trade-offs involved in resilience strategies is essential for making informed decisions about where to invest resources for maximum impact.
- Real-world examples highlighted the practical application of RAF and the benefits of patterns like constant work, hedging, and fault-isolated deployments.
- RAF requires executive support, dedicated resources, and a team with a deep understanding of the system and a passion for resilience.
- RAF is not just a technical process but also a cultural shift towards a resilience mindset, which involves continuous learning and improvement.
- The RAF process can lead to significant improvements in system resilience by identifying and addressing risks that might otherwise be overlooked.
- Integrating RAF with operational and feature development processes ensures that resilience analysis is not an isolated activity but part of the ongoing lifecycle of the application.