The Amazon Builders Library 25 Yrs of Amazon Operational Excellence Dop301

Title

AWS re:Invent 2022 - The Amazon Builders’ Library: 25 yrs of Amazon operational excellence (DOP301)

Summary

  • Event Management: Amazon uses an event management process for operational incidents, which includes audio conference calls, ticket tracking, and chat systems. They also conduct game days to simulate incidents in new AWS regions and manage events in degraded states.
  • Deploying Infrastructure and Software: Amazon has moved towards automated deployments to reduce human error and speed up the process from check-in to production. They use a centralized deployment system for both code and infrastructure changes.
  • Heavy Lifts and Backward Incompatible Changes: Amazon treats every service and feature as a promise to customers, making backward incompatible changes a rare and carefully managed process. They use extensive instrumentation to understand system usage and manage changes.
  • SAFE Framework: Amazon has a framework for effective management of operational events, emphasizing maintaining urgency without panic, active participation, focusing on restoring service before root cause analysis, and escalating issues when necessary.
  • Deployment Safety Best Practices: Amazon has learned to deploy incrementally, roll back quickly, and ensure good observability to minimize customer impact.
  • Infrastructure Change Safety Practices: Similar to deployment safety, infrastructure changes are managed with pre-checks, incremental changes, and the ability to roll back deletes.
  • Long-term API Support: Amazon has a history of supporting APIs for a long time, understanding the impact of changes on customers and using this knowledge to build better systems.

Insights

  • Operational Excellence: Amazon's operational excellence is built on a foundation of event management, deployment safety, and infrastructure change safety practices, which have been refined over 25 years.
  • Automation and Risk Reduction: Automating deployments and infrastructure changes has significantly reduced the risk of human error and improved the speed and safety of changes.
  • Customer-Centric Approach: Amazon's approach to managing backward incompatible changes reflects a strong customer-centric philosophy, ensuring minimal disruption to customer operations.
  • Learning from Incidents: Amazon's culture of writing detailed post-incident reports and maintaining a library of these allows for cross-team learning and continuous improvement in operational practices.
  • Instrumentation and Observability: Deep instrumentation and observability are key to Amazon's ability to manage changes and understand system usage, which informs better system design and optimization.
  • DevOps Benefits: Amazon's DevOps approach not only ensures developers have skin in the game but also provides them with a deeper understanding of system usage, leading to better system design and customer experience.