Aws Resilience Partners Best Practices to Create a Resilient Organization Pex210

Title

AWS re:Invent 2023 - AWS Resilience Partners: Best practices to create a resilient organization - PEX210

Summary

  • Ashu, the leader of the Worldwide Partners Team for Resilience at AWS, introduces the session on building a resilient organization, joined by Steve from Cigna and Nitin from Deloitte.
  • Resilience is defined as the ability to recover from disruptions such as cyber attacks, human error, or unauthorized access, with an emphasis on always-on availability.
  • AWS announced the AWS Resilience Competency, which validates partners' capabilities in designing, operating, and recovering resilient AWS workloads.
  • Steve Sefton from Cigna discusses the importance of system stability and resilience for their healthcare services, which serve over 100 million patients.
  • Nitin Gupta from Deloitte outlines their technology resiliency services and the importance of reliable design, intelligent visibility, high availability, disaster preparedness, and fault tolerance.
  • Cigna's guiding principles for resilience include integrating defensively, testing completely, deploying pessimistically, running cautiously, observing obsessively, recovering urgently, and updating frequently.
  • Deloitte and Cigna collaborated on defining Service Level Objectives (SLOs), performing Failure Mode Analysis (FMA), and creating a Reliability Guide for consistent remediation.
  • The importance of including vendors in the resilience journey is highlighted, ensuring they meet Cigna's resiliency requirements.
  • Chaos testing and game days are conducted to test system reactions to faults, with a focus on critical applications.
  • Resiliency training is provided to different personas within the organization, with mandatory training to ensure everyone is knowledgeable about their roles in resilience.
  • Cigna's application resiliency certification process is introduced, aiming to certify software as resilient and requiring certain resiliency steps before production deployment.
  • The resiliency program has led to a 25% reduction in high and critical production incident counts and duration, exceeding the initial goal of 15%.
  • The program's success on the Evernorth side of Cigna will be applied to the Cigna healthcare side and infrastructure services.

Insights

  • The AWS Resilience Competency is a significant development for AWS partners, providing a structured framework for validating their expertise in resilience best practices.
  • Cigna's approach to resilience emphasizes the critical nature of their services and the direct impact on patient health, highlighting the real-world consequences of system failures.
  • Deloitte's involvement in Cigna's resilience journey demonstrates the value of external expertise in assessing and improving an organization's resilience posture.
  • The guiding principles outlined by Cigna provide a comprehensive approach to resilience, covering all aspects from design to recovery, and could serve as a model for other organizations.
  • The concept of Service Level Objectives (SLOs) and error budgets is a strategic approach to managing and measuring application performance and reliability.
  • Failure Mode Analysis (FMA) is a proactive method to identify and prioritize potential points of failure, allowing for targeted improvements.
  • The inclusion of vendors in the resilience strategy acknowledges the interconnected nature of modern IT ecosystems and the need for end-to-end resilience.
  • The emphasis on chaos testing and game days reflects a shift towards more aggressive and realistic testing methods to ensure system robustness.
  • The focus on resiliency training and the creation of a resiliency culture within the organization is crucial for sustaining long-term resilience efforts.
  • The application resiliency certification process introduced by Cigna is an innovative approach to ensuring software resilience and could inspire similar initiatives in other organizations.