How Netflix Uses Aws for Multi Region Cache Replication Nfx304

Title

AWS re:Invent 2023 - How Netflix uses AWS for multi-Region cache replication (NFX304)

Summary

  • Prateek Sharma, a Principal Solutions Architect at AWS, introduces the session highlighting Netflix's use of AWS for personalized user experiences and resiliency.
  • Sriram Chettykarnaburuvan and Prithvirajani, engineers from Netflix, discuss the challenges and solutions for global cache replication across multiple AWS regions.
  • They explain the importance of cache replication for maintaining low latency and reducing database load during region failovers.
  • The session covers the design and architecture of Netflix's EVCache system, which is a distributed, sharded key-value store based on Memcached with in-region and global replication capabilities.
  • The engineers discuss the use of Kafka for event streaming, SQS for retries, and the importance of auto-scaling, observability, and alerting in managing the system.
  • They share insights into efficiency improvements, such as batch compression and removing network load balancers, which resulted in significant cost savings.
  • The session concludes with a look into the life of an EVCache engineer at Netflix, detailing incident response and testing strategies, and future plans for the replication service.

Insights

  • Netflix's EVCache system is a critical component for delivering personalized content quickly to users, handling 30 million replication events per second and storing 2 trillion items.
  • The replication service is designed with high availability as the top priority, offering best-effort consistency due to the nature of cache data and the challenges of strong consistency.
  • The use of Kafka and SQS in the replication service architecture allows for efficient handling of replication events and provides a robust retry mechanism for failed events.
  • Auto-scaling policies based on CPU, network, and queue size metrics are crucial for managing the replication service's resources effectively and ensuring it can handle traffic spikes without manual intervention.
  • Efficiency improvements like batch compression and the elimination of network load balancers have led to significant cost reductions, showcasing the importance of continuous optimization in cloud services.
  • The engineers' approach to incident response and testing in production environments emphasizes the need for thorough observability and the ability to test fixes in a live setting without affecting users.
  • Netflix's future plans for the replication service include migration to IPv6, containerization, and expanding its capabilities to support additional use cases like write-ahead logs and delayed queues.