Title
AWS re:Invent 2022 - Improving resiliency with the correction of error process (ARC308)
Summary
- Juan Osa and Johnny Hanley from AWS presented on improving resiliency through the Correction of Error (COE) process.
- The COE process is crucial for identifying root causes, documenting them, and sharing knowledge to prevent future recurrences of issues.
- The session covered the importance of creating a culture that encourages learning from mistakes without assigning blame.
- The speakers detailed the components of a COE, including detection, response, learning, and decoupling analysis from the event.
- An example scenario was provided to illustrate the COE process in action, involving an application with S3 and Lambda services.
- A demo of AWS Systems Manager Incident Manager was given to show how to document COEs digitally within the AWS console.
- The talk concluded with strategies for cultivating a COE culture within an organization, including establishing a community of practice and identifying champions.
Insights
- The COE process is a structured approach to learning from operational failures, emphasizing the importance of documentation and knowledge sharing.
- AWS Systems Manager Incident Manager can be used to manage and document incidents and COEs, providing templates and integration with other AWS services.
- Creating a COE culture requires a safe environment where team members can freely share information without fear of blame or punishment.
- The process of COE involves a detailed analysis, including a timeline of events, metrics, event questions, and the "five whys" technique to drill down to root causes.
- Action items derived from COEs should have clear ownership and deadlines to ensure that improvements are implemented effectively.
- Cultivating a COE culture involves training and empowering champions within the organization to spread best practices and facilitate continuous improvement.
- The session highlighted the importance of mechanisms over good intentions, suggesting that structured processes like COEs are necessary to drive organizational learning and improvement.