Title

AWS re:Invent 2022 - AWS Incident Detection and Response (SUP201)

Summary

Michael Proctor, a site reliability engineer at Chase, discusses the importance of resiliency and redundancy in IT systems.
A customer survey reveals many have dealt with major IT outages in critical systems, highlighting the need for better incident detection and response.
AWS has released a new service called AWS Support Incident Detection and Response (IDR) to address these issues.
IDR helps customers define alerts for leading indicators of problems, automate detection, and integrate with AWS's incident management systems.
The service promises a 15-minute or less response time to incidents detected by the system.
Onboarding to IDR involves a well-architected review, setting up CloudWatch alerts, and creating runbooks for incident response.
Chase has successfully migrated Chase.com to AWS, leveraging IDR for incident detection and response.
The migration focused on achieving four nines of availability, cost-effectiveness, and end-to-end solution engineering.
Preparedness strategies include failure modes and effects analysis, game days, and continuous improvement through resiliency testing.
Chase's architecture for Chase.com includes multi-region, multi-AZ, and multi-account strategies for redundancy and isolation.
Multiple monitoring tools are used for observability, and alerts are integrated into Chase's corporate incident management process.
Best practices include clear measures for business impact, maturity in alerting, validation of runbooks, and allowing time for alert tuning in production.

Insights

The frequency and duration of IT incidents are higher than desired, leading to a significant cost and trust erosion with customers.
IDR aims to reduce the time taken to detect and respond to incidents by automating the detection process and integrating with AWS's internal systems.
The service is designed to be proactive rather than reactive, with the goal of preventing incidents from becoming customer-impacting.
Chase's experience with IDR during the migration of Chase.com demonstrates the service's effectiveness in a large-scale, real-world scenario.
The importance of preparedness is emphasized, with strategies such as failure modes and effects analysis and game days being crucial for resilience.
The use of multiple monitoring tools can be beneficial for comprehensive observability but requires careful selection and tuning of alerts to avoid information overload.
The cost of IDR is approximately 40% of the base cost of enterprise support, making it a potentially valuable investment for critical applications.

Aws Impact Accelerator Being the Megaphone Dei207 Aws Infrastructure as Code a Year in Review Dop201