Improving Resiliency with the Correction of Error Process Arc308

Title

AWS re:Invent 2022 - Improving resiliency with the correction of error process (ARC308)

Summary

  • Juan Osa and Johnny Hanley from AWS presented on improving resiliency through the Correction of Error (COE) process.
  • The COE process is crucial for identifying root causes, documenting them, and sharing knowledge to prevent future recurrences of issues.
  • The session covered the importance of creating a culture that encourages learning from mistakes without assigning blame.
  • The speakers detailed the components of a COE, including detection, response, learning, and decoupling analysis from the event.
  • An example scenario was provided to illustrate the COE process in action, involving an application with S3 and Lambda services.
  • A demo of AWS Systems Manager Incident Manager was given to show how to document COEs digitally within the AWS console.
  • The talk concluded with strategies for cultivating a COE culture within an organization, including establishing a community of practice and identifying champions.

Insights

  • The COE process is a structured approach to learning from operational failures, emphasizing the importance of documentation and knowledge sharing.
  • AWS Systems Manager Incident Manager can be used to manage and document incidents and COEs, providing templates and integration with other AWS services.
  • Creating a COE culture requires a safe environment where team members can freely share information without fear of blame or punishment.
  • The process of COE involves a detailed analysis, including a timeline of events, metrics, event questions, and the "five whys" technique to drill down to root causes.
  • Action items derived from COEs should have clear ownership and deadlines to ensure that improvements are implemented effectively.
  • Cultivating a COE culture involves training and empowering champions within the organization to spread best practices and facilitate continuous improvement.
  • The session highlighted the importance of mechanisms over good intentions, suggesting that structured processes like COEs are necessary to drive organizational learning and improvement.