Title

AWS re:Invent 2023 - Practice like you play: How Amazon scales resilience to new heights (ARC316)

Summary

Downtime costs industries an average of $300,000 per hour, with 46% of companies unable to serve customers during downtime.
Amazon Prime Video streams sports events, including Thursday Night Football, and has seen a 26% increase in audience, equating to over 12 million viewers.
Prime Video applies lessons from sports team preparations to engineering team preparations for peak workloads, creating a resilience playbook.
Olga Hall, Director of Availability and Resilience Engineering at Prime Video Sports, and Lauren Don, Chief Technologist at AWS, discuss building a winning mindset and training teams for resilience.
Prime Video's approach includes a global strategy with local execution, predictive customer behavior analysis, and preparation for unpredictable events.
The session covers operational readiness scores, automated load testing, chaos engineering, and the importance of observability and reporting.
Prime Video's resilience portal, game day simulations, and correction of error processes are highlighted as key tools for maintaining service availability.
The talk concludes with the importance of practicing like you play, creating muscle memory, and being prepared for both predictable and unpredictable scenarios.

Insights

The cost of downtime is significant across industries, emphasizing the need for resilience and reliability in services.
Prime Video's success in streaming sports events at scale is attributed to their resilience playbook, which includes preparation strategies similar to those used by sports teams.
The concept of "Think global, act local" is crucial for distributed teams working on global events, ensuring alignment on goals and independent execution.
Prime Video's operational readiness score is a structured approach to measure and improve service availability, focusing on deployment safety, code coverage, operational readiness completion and review, and correction of error actions.
Automated load testing, known as "game days," is conducted regularly to ensure system readiness and identify issues at scale.
Chaos engineering is integrated into Prime Video's resilience strategy, including both low and high-risk experiments to test system behavior under various fault conditions.
Observability tools, such as service graphs and real-time availability dashboards, are essential for monitoring service health and responding to incidents.
The importance of creating a culture of proactive reliability through continuous training, experimentation, and analysis is emphasized, drawing parallels to sports teams' practice routines.
The session encourages attendees to develop their own resilience playbooks tailored to their teams and industries, using insights from Prime Video's approach to resilience and chaos engineering.

Practical Implementations of Quantum Communication Networks Qtc204 Predictive Maintenance at Scale Kaess Journey with Amazon Monitron Aim216