Reducing Your Area of Impact and Surviving Difficult Days Arc305

Title

AWS re:Invent 2022 - Reducing your area of impact and surviving difficult days (ARC305)

Summary

  • Bruno Emerrer and Byron Arnello, Solutions Architects at AWS, discuss architectural patterns to reduce the impact of failures and ensure resilience.
  • They categorize failures into code deployments, data and state issues, third-party dependencies, core infrastructure, and highly unlikely scenarios.
  • The session covers the importance of anticipating failures and protecting against them using architectural patterns like bulkhead and cell-based architectures.
  • They emphasize the need for workload isolation and failure containment within cells, which are independent units of deployment with their own data and dependencies.
  • The concept of sharding and shuffle sharding is introduced to further reduce the impact of failures.
  • A routing mechanism is necessary to direct traffic to the appropriate cells, which can be implemented using DNS or a custom cell router.
  • The discussion includes real-world scenarios, operational considerations, and the trade-offs between smaller and larger cells.
  • They conclude that cell-based architectures are not a one-size-fits-all solution and should be applied to critical applications where the added complexity is justified.

Insights

  • Cell-based architectures provide logical isolation, not physical isolation like availability zones or regions.
  • Cells should not share data or interdependent logic to ensure true isolation and containment of failures.
  • The complexity of managing cell-based architectures increases with the number of cells, requiring careful consideration of data modeling, routing, and operational overhead.
  • Shuffle sharding is an advanced technique that can significantly reduce the impact of failures by distributing workloads across multiple nodes in a non-uniform manner.
  • The decision to use cell-based architectures should be based on the criticality of the application and the need for extreme resilience, as it introduces additional complexity and management challenges.
  • The session highlights AWS's internal use of shuffle sharding in services like Amazon Route 53, demonstrating the practical application of these patterns at scale.
  • Developers and architects need to consider the trade-offs between cell size, manageability, cost efficiency, and the degree of isolation when designing cell-based systems.