Title
AWS re:Invent 2023 - Behind the scenes of Amazon EBS innovation and operational excellence (STG210)
Summary
- Cami Novella-Kuhlin, director of product management, and John Hayden, director of engineering for EBS, discuss the evolution of Amazon Elastic Block Store (EBS) over 15 years.
- EBS is a high-scale distributed system providing persistent block storage for EC2, with features like snapshots and data resilience.
- The talk covers the history of EBS, its scaling challenges, architectural evolution, feature set expansion, and team culture.
- EBS has grown to handle 100 trillion IOs and move 13 exabytes of data daily, necessitating continuous architectural innovation to avoid bottlenecks.
- Key milestones include the transition from HDDs to SSDs, the introduction of provisioned IOPS with IO1 volumes, and the launch of IO2 volumes with five nines durability.
- The Nitro system and SRD protocol have been instrumental in improving EBS performance, reducing tail latencies, and increasing throughput.
- Operational excellence is emphasized as critical for innovation, with a focus on measuring everything, ownership, and continuous improvement.
- The talk concludes with insights on balancing operational excellence with innovation and the importance of rethinking architectural risks as services scale.
Insights
- EBS's journey reflects the broader trend of cloud services evolving to meet increasing customer demands for performance, reliability, and scalability.
- The transition from HDDs to SSDs and the introduction of provisioned IOPS were pivotal in addressing customer needs for consistent and predictable performance.
- The Nitro system and SRD protocol highlight AWS's commitment to leveraging in-house innovations to enhance service offerings.
- AWS's approach to operational excellence, which involves meticulous measurement, ownership, and iterative improvement, is a model for managing large-scale distributed systems.
- The concept of "no edge case" at scale underscores the importance of designing systems that can handle rare events, as they become inevitable with growth.
- AWS's culture of reevaluating and adapting architectures to mitigate risks and improve service is a key factor in their ability to innovate and maintain customer trust.
- The talk suggests that investing in operational excellence can create a virtuous cycle that enables more rapid innovation and growth, challenging the notion that operations and innovation are mutually exclusive.