Detecting and Mitigating Gray Failures Arc310

Title

AWS re:Invent 2023 - Detecting and mitigating gray failures (ARC310)

Summary

  • Presenter: Mike Hagen, Senior Principal Solutions Architect at AWS.
  • Topic: Gray failures and their impact on system resilience.
  • Key Points:
    • Gray failures are characterized by differential observability, where the system may appear healthy from one perspective but not from another.
    • A single host can cause significant drops in service availability.
    • Gray failures often occur along fault isolation boundaries such as regions, AZs, instances, and software modules.
    • Detection of gray failures requires deeper health checks and observability tools beyond what AWS provides out of the box.
    • Outlier detection and composite alarms are effective for identifying gray failures.
    • Mitigation strategies include replacing instances and evacuating affected AZs.
    • Data plane actions are preferred over control plane actions for AZ evacuation.
    • AWS offers several resources and services to aid in resilience, including the Resilience Hub, Fault Injection Simulator, Elastic Disaster Recovery, AWS Backup, and Application Recovery Controller Zonal Shift.

Insights

  • Gray Failures:

    • They are subtle and often go undetected by standard health checks.
    • They can have a disproportionate impact on user experience despite appearing minor or isolated.
    • Differential observability is a key concept, highlighting the need for perspective-aware monitoring.
  • Detection:

    • Requires explicit code for observability and instrumentation.
    • CloudWatch's embedded metric format and Contributor Insights are valuable for monitoring and detecting anomalies.
    • Outlier detection algorithms can automate the identification of gray failures, but they must be used with caution to avoid masking real issues or generating false positives.
  • Mitigation:

    • The simplest mitigation strategy is to replace the instance experiencing a gray failure.
    • For single AZ failures, options include waiting it out, evacuating the AZ, or failing over to another region.
    • Evacuating an AZ requires an AZ-independent architecture and may involve data plane actions like API Gateway and DynamoDB or control plane actions for resources like auto-scaling groups.
  • AWS Services and Tools:

    • AWS provides a suite of tools to help design, implement, and operate resilient systems.
    • The Resilience Life Cycle Framework offers a structured approach to resilience.
    • AWS services like Application Recovery Controller's Zonal Shift feature facilitate the evacuation of AZs during gray failures.
  • Best Practices:

    • Enrich metrics with dimensions aligned to fault isolation boundaries for better visibility and troubleshooting.
    • Use a combination of deep health checks, outlier detection, and composite alarms for comprehensive monitoring.
    • Prefer data plane actions over control plane actions for recovery operations to increase reliability.