Title
AWS re:Invent 2023 - Improve resilience of SAP workloads with AWS Support (SUP312)
Summary
- 3M collaborated with AWS support to enhance the resilience of their SAP workload.
- Kim Otto from 3M and AWS technical account managers Manik Chopra and Vijay Sitaram shared their experiences.
- AWS support provided various engagements like tabletop exercises, fault testing, and runbook reviews to improve disaster recovery (DR) exercises.
- 3M utilized AWS Resiliency Hub, Fault Injection Simulation, and CloudWatch Application Insights to modernize their SAP resiliency management.
- The session emphasized the importance of high availability, continuity of operations, and continuous resilience.
- Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are critical metrics for system resiliency.
- Different categories of failures were discussed, including code deployment, core infrastructure, data corruption, dependencies, and regional outages.
- AWS services like AWS Health, Trusted Advisor, and CloudWatch were leveraged by 3M.
- The well-architected SAP with a lens, access to response and incident, business build and review runbooks, manage and operate resilience, test and validate recovery, and monitor and observe availability were key support engagements.
- 3M's SAP architecture includes two regions for production and non-production workloads, with a focus on identifying single points of failure.
- Trusted Advisor Priority and Resilience Hub were used to assess and improve the resilience of 3M's SAP systems.
- Fault Injection Service (FIS) was used to simulate and test various failure scenarios.
- CloudWatch Application Insights for SAP provided monitoring capabilities for both infrastructure and application metrics.
- Kim Otto highlighted the journey with AWS, focusing on reducing Mean Time to Recovery (MTTR), modernizing operations, and fostering a culture of resilience testing.
Insights
- The collaboration between 3M and AWS demonstrates the value of AWS support in enhancing the resilience of critical workloads.
- AWS Resiliency Hub and Fault Injection Simulation are powerful tools for assessing and testing the resilience of systems, allowing for proactive identification and mitigation of potential failure points.
- The use of AWS services for monitoring and observability, such as CloudWatch Application Insights, is crucial for maintaining system health and quickly identifying issues.
- The integration of Trusted Advisor and Resilience Hub provides a comprehensive view of system resilience and actionable recommendations for improvement.
- The session highlighted the importance of continuous resilience practices, including regular DR testing and updating runbooks for cloud environments.
- The focus on RPO and RTO metrics underscores the need for businesses to align their technical resilience strategies with their business continuity requirements.
- The journey of 3M serves as a case study for other organizations looking to improve their system resilience on AWS, particularly for complex and mission-critical applications like SAP.
- The emphasis on cross-team communication and collaboration is key to successful resilience management, as it involves both infrastructure and application teams.
- The session provided insights into the evolving nature of cloud services and the importance of staying current with new features and best practices to maintain a resilient infrastructure.