Title
AWS re:Invent 2023 - Practice like you play: How Amazon scales resilience to new heights (ARC316)
Summary
- Downtime costs industries an average of $300,000 per hour, with 46% of companies unable to serve customers during downtime.
- Amazon Prime Video streams sports events, including Thursday Night Football, and has seen a 26% increase in audience, equating to over 12 million viewers.
- Prime Video applies lessons from sports team preparations to engineering team preparations for peak workloads, creating a resilience playbook.
- Olga Hall, Director of Availability and Resilience Engineering at Prime Video Sports, and Lauren Don, Chief Technologist at AWS, discuss building a winning mindset and training teams for resilience.
- Prime Video's approach includes a global strategy with local execution, predictive customer behavior analysis, and preparation for unpredictable events.
- The session covers operational readiness scores, automated load testing, chaos engineering, and the importance of observability and reporting.
- Prime Video's resilience portal, game day simulations, and correction of error processes are highlighted as key tools for maintaining service availability.
- The talk concludes with the importance of practicing like you play, creating muscle memory, and being prepared for both predictable and unpredictable scenarios.
Insights
- The cost of downtime is significant across industries, emphasizing the need for resilience and reliability in services.
- Prime Video's success in streaming sports events at scale is attributed to their resilience playbook, which includes preparation strategies similar to those used by sports teams.
- The concept of "Think global, act local" is crucial for distributed teams working on global events, ensuring alignment on goals and independent execution.
- Prime Video's operational readiness score is a structured approach to measure and improve service availability, focusing on deployment safety, code coverage, operational readiness completion and review, and correction of error actions.
- Automated load testing, known as "game days," is conducted regularly to ensure system readiness and identify issues at scale.
- Chaos engineering is integrated into Prime Video's resilience strategy, including both low and high-risk experiments to test system behavior under various fault conditions.
- Observability tools, such as service graphs and real-time availability dashboards, are essential for monitoring service health and responding to incidents.
- The importance of creating a culture of proactive reliability through continuous training, experimentation, and analysis is emphasized, drawing parallels to sports teams' practice routines.
- The session encourages attendees to develop their own resilience playbooks tailored to their teams and industries, using insights from Prime Video's approach to resilience and chaos engineering.