Title
AWS re:Invent 2023 - AWS Resilience Partners: Best practices to create a resilient organization - PEX210
Summary
- Ashu, the leader of the Worldwide Partners Team for Resilience at AWS, introduces the session on building a resilient organization, joined by Steve from Cigna and Nitin from Deloitte.
- Resilience is defined as the ability to recover from disruptions such as cyber attacks, human error, or unauthorized access, with an emphasis on always-on availability.
- AWS announced the AWS Resilience Competency, which validates partners' capabilities in designing, operating, and recovering resilient AWS workloads.
- Steve Sefton from Cigna discusses the importance of system stability and resilience for their healthcare services, which serve over 100 million patients.
- Nitin Gupta from Deloitte outlines their technology resiliency services and the importance of reliable design, intelligent visibility, high availability, disaster preparedness, and fault tolerance.
- Cigna's guiding principles for resilience include integrating defensively, testing completely, deploying pessimistically, running cautiously, observing obsessively, recovering urgently, and updating frequently.
- Deloitte and Cigna collaborated on defining Service Level Objectives (SLOs), performing Failure Mode Analysis (FMA), and creating a Reliability Guide for consistent remediation.
- The importance of including vendors in the resilience journey is highlighted, ensuring they meet Cigna's resiliency requirements.
- Chaos testing and game days are conducted to test system reactions to faults, with a focus on critical applications.
- Resiliency training is provided to different personas within the organization, with mandatory training to ensure everyone is knowledgeable about their roles in resilience.
- Cigna's application resiliency certification process is introduced, aiming to certify software as resilient and requiring certain resiliency steps before production deployment.
- The resiliency program has led to a 25% reduction in high and critical production incident counts and duration, exceeding the initial goal of 15%.
- The program's success on the Evernorth side of Cigna will be applied to the Cigna healthcare side and infrastructure services.
Insights
- The AWS Resilience Competency is a significant development for AWS partners, providing a structured framework for validating their expertise in resilience best practices.
- Cigna's approach to resilience emphasizes the critical nature of their services and the direct impact on patient health, highlighting the real-world consequences of system failures.
- Deloitte's involvement in Cigna's resilience journey demonstrates the value of external expertise in assessing and improving an organization's resilience posture.
- The guiding principles outlined by Cigna provide a comprehensive approach to resilience, covering all aspects from design to recovery, and could serve as a model for other organizations.
- The concept of Service Level Objectives (SLOs) and error budgets is a strategic approach to managing and measuring application performance and reliability.
- Failure Mode Analysis (FMA) is a proactive method to identify and prioritize potential points of failure, allowing for targeted improvements.
- The inclusion of vendors in the resilience strategy acknowledges the interconnected nature of modern IT ecosystems and the need for end-to-end resilience.
- The emphasis on chaos testing and game days reflects a shift towards more aggressive and realistic testing methods to ensure system robustness.
- The focus on resiliency training and the creation of a resiliency culture within the organization is crucial for sustaining long-term resilience efforts.
- The application resiliency certification process introduced by Cigna is an innovative approach to ensuring software resilience and could inspire similar initiatives in other organizations.