Behind the Scenes of Amazon Ebs Innovation and Operational Excellence Stg210

Title

AWS re:Invent 2023 - Behind the scenes of Amazon EBS innovation and operational excellence (STG210)

Summary

  • Cami Novella-Kuhlin, director of product management, and John Hayden, director of engineering for EBS, discuss the evolution of Amazon Elastic Block Store (EBS) over 15 years.
  • EBS is a high-scale distributed system providing persistent block storage for EC2, with features like snapshots and data resilience.
  • The talk covers the history of EBS, its scaling challenges, architectural evolution, feature set expansion, and team culture.
  • EBS has grown to handle 100 trillion IOs and move 13 exabytes of data daily, necessitating continuous architectural innovation to avoid bottlenecks.
  • Key milestones include the transition from HDDs to SSDs, the introduction of provisioned IOPS with IO1 volumes, and the launch of IO2 volumes with five nines durability.
  • The Nitro system and SRD protocol have been instrumental in improving EBS performance, reducing tail latencies, and increasing throughput.
  • Operational excellence is emphasized as critical for innovation, with a focus on measuring everything, ownership, and continuous improvement.
  • The talk concludes with insights on balancing operational excellence with innovation and the importance of rethinking architectural risks as services scale.

Insights

  • EBS's journey reflects the broader trend of cloud services evolving to meet increasing customer demands for performance, reliability, and scalability.
  • The transition from HDDs to SSDs and the introduction of provisioned IOPS were pivotal in addressing customer needs for consistent and predictable performance.
  • The Nitro system and SRD protocol highlight AWS's commitment to leveraging in-house innovations to enhance service offerings.
  • AWS's approach to operational excellence, which involves meticulous measurement, ownership, and iterative improvement, is a model for managing large-scale distributed systems.
  • The concept of "no edge case" at scale underscores the importance of designing systems that can handle rare events, as they become inevitable with growth.
  • AWS's culture of reevaluating and adapting architectures to mitigate risks and improve service is a key factor in their ability to innovate and maintain customer trust.
  • The talk suggests that investing in operational excellence can create a virtuous cycle that enables more rapid innovation and growth, challenging the notion that operations and innovation are mutually exclusive.