Title

AWS re:Invent 2023 - Deep dive into Amazon ECS resilience and availability (CON401)

Summary

Introduction: Meish Seidel-Casing and Malcolm Featley introduced the session, focusing on Amazon ECS's resilience and availability.
Amazon ECS Overview: ECS is a native container service on AWS, supporting various deployment options including EC2, Fargate, and ECS Anywhere. It's integral to AWS, with over 2.25 billion tasks launched weekly and used by 65% of new AWS customers.
Architectural Resilience and Availability: Emphasizing the need to embrace failure, the speakers discussed AWS's design principles for building resilient systems, including the use of AWS regions and availability zones.
ECS Architecture: ECS is installed in every AWS region and operates independently in at least three availability zones per region. It's pre-scaled to 150% of peak capacity to ensure static stability.
ECS Service and Task Placement: ECS services describe application workloads with a desired count of tasks, which ECS ensures are spread across availability zones for resilience.
Operational Resilience: The speakers detailed how ECS uses rolling deployments, automated monitoring, and bake time to ensure changes are safely deployed and quickly rolled back if necessary.
Continuous Improvement: The session concluded with a discussion on the importance of continuous improvement through chaos engineering, game days, and the Correction of Errors (COE) process.

Embracing Failure: AWS's approach to resilience involves expecting and planning for failure, rather than trying to prevent it entirely.
Static Stability: Pre-scaling services to handle the loss of an availability zone without needing to mutate the service is a key principle for maintaining static stability.
Partitioning for Isolation: ECS uses partitions to isolate workloads and limit the blast radius of any failures, which also aids in scaling and software isolation.
Rolling Deployments: AWS employs a cautious approach to deploying changes, using rolling deployments to gradually introduce changes and monitor their impact.
Correction of Errors (COE): AWS uses COEs to learn from incidents, focusing on understanding root causes and sharing knowledge to prevent future occurrences.
Customer-Centric Improvements: Feedback from customers drives the creation of sessions like this one and the continuous improvement of AWS services.
Resources for Learning: AWS provides extensive resources like the Amazon Builders Library and various re:Invent sessions to help users understand and leverage AWS services effectively.