Aws Incident Detection and Response Sup201

Title

AWS re:Invent 2022 - AWS Incident Detection and Response (SUP201)

Summary

  • Michael Proctor, a site reliability engineer at Chase, discusses the importance of resiliency and redundancy in IT systems.
  • A customer survey reveals many have dealt with major IT outages in critical systems, highlighting the need for better incident detection and response.
  • AWS has released a new service called AWS Support Incident Detection and Response (IDR) to address these issues.
  • IDR helps customers define alerts for leading indicators of problems, automate detection, and integrate with AWS's incident management systems.
  • The service promises a 15-minute or less response time to incidents detected by the system.
  • Onboarding to IDR involves a well-architected review, setting up CloudWatch alerts, and creating runbooks for incident response.
  • Chase has successfully migrated Chase.com to AWS, leveraging IDR for incident detection and response.
  • The migration focused on achieving four nines of availability, cost-effectiveness, and end-to-end solution engineering.
  • Preparedness strategies include failure modes and effects analysis, game days, and continuous improvement through resiliency testing.
  • Chase's architecture for Chase.com includes multi-region, multi-AZ, and multi-account strategies for redundancy and isolation.
  • Multiple monitoring tools are used for observability, and alerts are integrated into Chase's corporate incident management process.
  • Best practices include clear measures for business impact, maturity in alerting, validation of runbooks, and allowing time for alert tuning in production.

Insights

  • The frequency and duration of IT incidents are higher than desired, leading to a significant cost and trust erosion with customers.
  • IDR aims to reduce the time taken to detect and respond to incidents by automating the detection process and integrating with AWS's internal systems.
  • The service is designed to be proactive rather than reactive, with the goal of preventing incidents from becoming customer-impacting.
  • Chase's experience with IDR during the migration of Chase.com demonstrates the service's effectiveness in a large-scale, real-world scenario.
  • The importance of preparedness is emphasized, with strategies such as failure modes and effects analysis and game days being crucial for resilience.
  • The use of multiple monitoring tools can be beneficial for comprehensive observability but requires careful selection and tuning of alerts to avoid information overload.
  • The cost of IDR is approximately 40% of the base cost of enterprise support, making it a potentially valuable investment for critical applications.