Title: AWS re:Inforce 2024 - Expect the unexpected: Building resilience with AWS (CFS122)
Insights:
- Resilience Definition and Importance: Resilience in AWS refers to the ability of a workload or application to quickly respond and recover from failure. This includes handling hardware failures, software bugs, network outages, and cyber-attacks. The financial and brand costs of unplanned system downtime can be significant, emphasizing the need for resilient systems.
- Three Pillars of Resilience: AWS defines resilience through three pillars:
- High Availability: Ensuring applications can withstand common errors without impacting functionality.
- Recoverability: Protecting and recovering applications from severe incidents like data center floods or ransomware attacks.
- Continuous Resilience: Implementing operational guardrails, CI-CD pipelines, and observability to detect and respond to incidents proactively.
- Shared Responsibility Model: AWS handles the resilience of the cloud infrastructure, while customers are responsible for resilience within the cloud. This includes setting up backups, defining RPOs and RTOs, and ensuring operational teams are prepared for emergencies.
- Resilience as a Continuous Journey: Building resilience is not a one-time task but requires constant iteration and learning from failures to improve architecture and operations.
- PwC’s Resiliency Journey Framework: PwC provides a structured approach to adopting resiliency, which includes evaluating resilience capabilities, designing resilient cloud architecture, building core infrastructure, enabling resilient applications, and continuously testing resiliency.
- Multi-Region and Multi-AZ Considerations: Organizations need to evaluate business impact, risk, and cost when designing disaster recovery strategies. Critical applications should have a secondary environment that mirrors the primary to ensure seamless failover.
- Pre-Recovery Steps: Setting up foundational environments, security, networking, identity, CI-CD tools, and continuous replication ahead of time can significantly reduce RTO and RPO during a disaster.
- PwC’s DR Orchestrator: An automated solution that reduces RTO from weeks to minutes and decreases operational overhead by automating failover and failback processes using AWS Step Functions and Lambda. It allows frequent DR testing with minimal developer impact and provides regulatory compliance evidence.
Quotes:
- "Everything fails all the time."
- "Luck is not a resilience strategy."
- "Unplanned system downtime for the Fortune 1000 companies in the US costs about 1.2 to 1.5 billion dollars per year."
- "In AWS terms, resilience refers to the ability of a workload or application to be able to quickly respond and to recover from failure."
- "You need to detect incidents before your customers detect the incident."
- "Resilience is not a one and done type of job. It's something that requires constant iteration, constant learnings."
- "Deploying new workloads in a new region is now giving organizations an opportunity to revisit some of those legacy landing zones, adopt new control tower features, enhance their security policies, and apply more modern security policies."
- "The DR Orchestrator automates the disaster recovery of your critical AWS services to a healthy region in the event of a disaster, and also allows frequent DR testing with minimum developer impact."