Beyond Five 9s Lessons from Our Highest Available Data Planes Arc310

Title

AWS re:Invent 2022 - Beyond five 9s: Lessons from our highest available data planes (ARC310)

Summary

  • Colin, a VP and Distinguished Engineer at AWS, shares insights from building highly available AWS services.
  • He challenges the traditional NINES model, advocating for a deeper understanding of system failure characteristics, including duration.
  • AWS uses partitioning and cellular systems to avoid global outages, focusing on regional independence.
  • Colin emphasizes the importance of plotting data to understand the rate and expected duration of incidents.
  • He discusses the use of compartmentalization, shuffle sharding, and minimizing change to enhance system reliability.
  • Testing and operational safety are highlighted as critical for avoiding defects and operational mistakes.
  • Colin stresses the importance of culture, high standards, and supportive environments in building reliable systems.
  • He concludes by encouraging builders to push past the NINES model, shepherd system evolution, and foster elite, happy teams.

Insights

  • The traditional NINES model is insufficient for modern cloud services, as it does not account for partial availability or the complexity of distributed systems.
  • AWS's approach to high availability includes regional partitioning, shuffle sharding, and running systems at maximum load to minimize change.
  • Testing and operational safety are paramount, with AWS employing extensive automated testing and cautious deployment processes.
  • Culture and team dynamics play a significant role in system reliability. Elite, supportive teams with high standards are more likely to produce maintainable and operable systems.
  • The talk suggests that system design is an evolutionary process, with continuous integration of lessons learned being crucial for long-term reliability.
  • Colin's insights align with AWS's broader philosophy of building resilient, customer-centric services, emphasizing the need for both technical excellence and a strong team culture.