Title
AWS re:Invent 2022 - AWS Incident Detection and Response (SUP201)
Summary
- Michael Proctor, a site reliability engineer at Chase, discusses the importance of resiliency and redundancy in IT systems.
- A customer survey reveals many have dealt with major IT outages in critical systems, highlighting the need for better incident detection and response.
- AWS has released a new service called AWS Support Incident Detection and Response (IDR) to address these issues.
- IDR helps customers define alerts for leading indicators of problems, automate detection, and integrate with AWS's incident management systems.
- The service promises a 15-minute or less response time to incidents detected by the system.
- Onboarding to IDR involves a well-architected review, setting up CloudWatch alerts, and creating runbooks for incident response.
- Chase has successfully migrated Chase.com to AWS, leveraging IDR for incident detection and response.
- The migration focused on achieving four nines of availability, cost-effectiveness, and end-to-end solution engineering.
- Preparedness strategies include failure modes and effects analysis, game days, and continuous improvement through resiliency testing.
- Chase's architecture for Chase.com includes multi-region, multi-AZ, and multi-account strategies for redundancy and isolation.
- Multiple monitoring tools are used for observability, and alerts are integrated into Chase's corporate incident management process.
- Best practices include clear measures for business impact, maturity in alerting, validation of runbooks, and allowing time for alert tuning in production.
Insights
- The frequency and duration of IT incidents are higher than desired, leading to a significant cost and trust erosion with customers.
- IDR aims to reduce the time taken to detect and respond to incidents by automating the detection process and integrating with AWS's internal systems.
- The service is designed to be proactive rather than reactive, with the goal of preventing incidents from becoming customer-impacting.
- Chase's experience with IDR during the migration of Chase.com demonstrates the service's effectiveness in a large-scale, real-world scenario.
- The importance of preparedness is emphasized, with strategies such as failure modes and effects analysis and game days being crucial for resilience.
- The use of multiple monitoring tools can be beneficial for comprehensive observability but requires careful selection and tuning of alerts to avoid information overload.
- The cost of IDR is approximately 40% of the base cost of enterprise support, making it a potentially valuable investment for critical applications.