Title
AWS re:Invent 2022 - Building resilient multi-site workloads using AWS global services (SUP401)
Summary
- Presenters: Mike Hagen (Principal Technologist), James Aitken (Senior Software Development Manager), Ryan Schroeder (Site Reliability Engineer at Netflix).
- Key Topics: Fault isolation, static stability, AWS global services, best practices for multi-site resilience, AWS Health Service, Netflix case study.
- Fault Isolation: AWS spans 30 geographic regions with multiple AZs, designed to prevent correlated failures.
- Static Stability: Systems should operate normally even with impaired dependencies, achieved by eliminating circular dependencies, pre-provisioning capacity, maintaining existing state, and eliminating synchronous interaction.
- AWS Services: Categorized into zonal, regional, and global services, with a focus on understanding control plane and data plane separation.
- Global Services: Avoid control plane actions in recovery paths, rely on data planes for recovery, pre-provision resources, and understand what will and won't work during control plane impairments.
- AWS Health: Provides notifications about AWS events impacting workloads, supports observability best practices, and offers a highly available API endpoint.
- Netflix Case Study: Demonstrates a multi-region active-active architecture, reliance on AWS Health API, key metrics for service health, regular failover exercises, and internal operations adapted to support impact mitigation strategy.
Insights
- Fault Isolation Boundaries: AWS's design of regions and AZs is critical for building resilient systems that can withstand localized failures without affecting the entire workload.
- Static Stability Importance: The concept of static stability is essential for ensuring that systems are not overly dependent on any single service or component, which can be a single point of failure.
- Global Services Architecture: Understanding the architecture of global services, particularly the separation of control and data planes, is crucial for designing resilient and scalable applications.
- Best Practices for Recovery: The best practice of avoiding control plane actions during recovery is a key insight for maintaining uptime during potential global service impairments.
- AWS Health Integration: The integration of AWS Health into operational practices is a proactive measure to detect and respond to AWS-related issues, enhancing overall system resilience.
- Netflix's Resilience Strategy: Netflix's approach to resilience, including active-active deployment, data replication, and regular failover exercises, provides a practical example of how to implement the concepts discussed in the session.
- Operational Readiness: The importance of operational readiness and regular testing of failover mechanisms is highlighted by Netflix's example, showing that resilience is as much about people and processes as it is about technology.