Title
AWS re:Invent 2022 - Reducing your area of impact and surviving difficult days (ARC305)
Summary
- Bruno Emerrer and Byron Arnello, Solutions Architects at AWS, discuss architectural patterns to reduce the impact of failures and ensure resilience.
- They categorize failures into code deployments, data and state issues, third-party dependencies, core infrastructure, and highly unlikely scenarios.
- The session covers the importance of anticipating failures and protecting against them using architectural patterns like bulkhead and cell-based architectures.
- They emphasize the need for workload isolation and failure containment within cells, which are independent units of deployment with their own data and dependencies.
- The concept of sharding and shuffle sharding is introduced to further reduce the impact of failures.
- A routing mechanism is necessary to direct traffic to the appropriate cells, which can be implemented using DNS or a custom cell router.
- The discussion includes real-world scenarios, operational considerations, and the trade-offs between smaller and larger cells.
- They conclude that cell-based architectures are not a one-size-fits-all solution and should be applied to critical applications where the added complexity is justified.
Insights
- Cell-based architectures provide logical isolation, not physical isolation like availability zones or regions.
- Cells should not share data or interdependent logic to ensure true isolation and containment of failures.
- The complexity of managing cell-based architectures increases with the number of cells, requiring careful consideration of data modeling, routing, and operational overhead.
- Shuffle sharding is an advanced technique that can significantly reduce the impact of failures by distributing workloads across multiple nodes in a non-uniform manner.
- The decision to use cell-based architectures should be based on the criticality of the application and the need for extreme resilience, as it introduces additional complexity and management challenges.
- The session highlights AWS's internal use of shuffle sharding in services like Amazon Route 53, demonstrating the practical application of these patterns at scale.
- Developers and architects need to consider the trade-offs between cell size, manageability, cost efficiency, and the degree of isolation when designing cell-based systems.