Title

AWS re:Invent 2023 - How Netflix uses AWS for multi-Region cache replication (NFX304)

Summary

Prateek Sharma, a Principal Solutions Architect at AWS, introduces the session highlighting Netflix's use of AWS for personalized user experiences and resiliency.
Sriram Chettykarnaburuvan and Prithvirajani, engineers from Netflix, discuss the challenges and solutions for global cache replication across multiple AWS regions.
They explain the importance of cache replication for maintaining low latency and reducing database load during region failovers.
The session covers the design and architecture of Netflix's EVCache system, which is a distributed, sharded key-value store based on Memcached with in-region and global replication capabilities.
The engineers discuss the use of Kafka for event streaming, SQS for retries, and the importance of auto-scaling, observability, and alerting in managing the system.
They share insights into efficiency improvements, such as batch compression and removing network load balancers, which resulted in significant cost savings.
The session concludes with a look into the life of an EVCache engineer at Netflix, detailing incident response and testing strategies, and future plans for the replication service.

Netflix's EVCache system is a critical component for delivering personalized content quickly to users, handling 30 million replication events per second and storing 2 trillion items.
The replication service is designed with high availability as the top priority, offering best-effort consistency due to the nature of cache data and the challenges of strong consistency.
The use of Kafka and SQS in the replication service architecture allows for efficient handling of replication events and provides a robust retry mechanism for failed events.
Auto-scaling policies based on CPU, network, and queue size metrics are crucial for managing the replication service's resources effectively and ensuring it can handle traffic spikes without manual intervention.
Efficiency improvements like batch compression and the elimination of network load balancers have led to significant cost reductions, showcasing the importance of continuous optimization in cloud services.
The engineers' approach to incident response and testing in production environments emphasizes the need for thorough observability and the ability to test fixes in a live setting without affecting users.
Netflix's future plans for the replication service include migration to IPv6, containerization, and expanding its capabilities to support additional use cases like write-ahead logs and delayed queues.