Title

AWS re:Invent 2022 - The evolution of chaos engineering at Netflix (NFX303)

Summary

Rob Hilton introduces the session, highlighting Netflix's innovation in application resilience and chaos engineering.
Aleš Plšek discusses the history and evolution of chaos engineering at Netflix, starting with Chaos Monkey and evolving to more sophisticated tools like FIT (Failure Injection Technology).
Chaos Monkey was initially effective but became less valuable as services adapted to instance failures.
Netflix's architecture evolved to microservices, necessitating more granular failure injection, leading to the development of FIT.
FIT allows precise failure injections at various points (e.g., IPC libraries, databases) and scopes (e.g., instances, clusters, requests).
Game Days became a practice where teams would simulate outages to observe service behavior.
The introduction of request context (CRR) allowed for even more precise failure scoping to individual requests.
Netflix engineers began using FIT for chaos testing directly on their devices, integrating it into smoke tests and release cycles.
Canary deployments were combined with FIT to create automated chaos experiments, comparing baseline and canary clusters.
Netflix developed a monitoring system focused on member experience, collecting device and service events in real-time.
The experimentation platform CHAP allows engineers to run experiments in production, measuring the impact of changes on users.
Infrastructure experimentation has expanded beyond chaos to include various types of experiments (e.g., sticky experiments, unscoped chaos, data experiments, squeeze tests).
The talk concludes with advice on adopting chaos engineering and infrastructure experimentation, emphasizing the journey and the next steps organizations can take.

Insights

Netflix's approach to chaos engineering has significantly evolved, moving from simple instance failure tests to complex, granular, and automated experiments that can measure the direct impact on user experience.
The development of FIT and its integration into Netflix's deployment and testing processes demonstrates the importance of precision and control in chaos engineering.
The use of canary deployments combined with FIT represents a shift towards automated and safe deployment practices that can quickly identify and mitigate potential issues before they affect a large number of users.
Netflix's real-time monitoring system for member experience is a critical component of their infrastructure experimentation, allowing for immediate detection and response to issues during experiments.
The modular approach to designing experiments at Netflix, varying treatments, allocations, and scopes, provides a flexible framework that can be adapted to different needs and scenarios.
The talk emphasizes that adopting chaos engineering is a journey, suggesting that organizations can start with simple practices and gradually build up to more sophisticated experiments as they develop the necessary infrastructure and expertise.

The Digital Transformation Journey Santander Uk Prt303 The Next Frontier Computational Deep Space and Ocean Life Exploration Imp104