Practice like You Play How Amazon Scales Resilience to New Heights Arc316

Title

AWS re:Invent 2023 - Practice like you play: How Amazon scales resilience to new heights (ARC316)

Summary

  • Downtime costs industries an average of $300,000 per hour, with 46% of companies unable to serve customers during downtime.
  • Amazon Prime Video streams sports events, including Thursday Night Football, and has seen a 26% increase in audience, equating to over 12 million viewers.
  • Prime Video applies lessons from sports team preparations to engineering team preparations for peak workloads, creating a resilience playbook.
  • Olga Hall, Director of Availability and Resilience Engineering at Prime Video Sports, and Lauren Don, Chief Technologist at AWS, discuss building a winning mindset and training teams for resilience.
  • Prime Video's approach includes a global strategy with local execution, predictive customer behavior analysis, and preparation for unpredictable events.
  • The session covers operational readiness scores, automated load testing, chaos engineering, and the importance of observability and reporting.
  • Prime Video's resilience portal, game day simulations, and correction of error processes are highlighted as key tools for maintaining service availability.
  • The talk concludes with the importance of practicing like you play, creating muscle memory, and being prepared for both predictable and unpredictable scenarios.

Insights

  • The cost of downtime is significant across industries, emphasizing the need for resilience and reliability in services.
  • Prime Video's success in streaming sports events at scale is attributed to their resilience playbook, which includes preparation strategies similar to those used by sports teams.
  • The concept of "Think global, act local" is crucial for distributed teams working on global events, ensuring alignment on goals and independent execution.
  • Prime Video's operational readiness score is a structured approach to measure and improve service availability, focusing on deployment safety, code coverage, operational readiness completion and review, and correction of error actions.
  • Automated load testing, known as "game days," is conducted regularly to ensure system readiness and identify issues at scale.
  • Chaos engineering is integrated into Prime Video's resilience strategy, including both low and high-risk experiments to test system behavior under various fault conditions.
  • Observability tools, such as service graphs and real-time availability dashboards, are essential for monitoring service health and responding to incidents.
  • The importance of creating a culture of proactive reliability through continuous training, experimentation, and analysis is emphasized, drawing parallels to sports teams' practice routines.
  • The session encourages attendees to develop their own resilience playbooks tailored to their teams and industries, using insights from Prime Video's approach to resilience and chaos engineering.