Title
AWS re:Invent 2023 - Detecting and mitigating gray failures (ARC310)
Summary
- Presenter: Mike Hagen, Senior Principal Solutions Architect at AWS.
- Topic: Gray failures and their impact on system resilience.
- Key Points:
- Gray failures are characterized by differential observability, where the system may appear healthy from one perspective but not from another.
- A single host can cause significant drops in service availability.
- Gray failures often occur along fault isolation boundaries such as regions, AZs, instances, and software modules.
- Detection of gray failures requires deeper health checks and observability tools beyond what AWS provides out of the box.
- Outlier detection and composite alarms are effective for identifying gray failures.
- Mitigation strategies include replacing instances and evacuating affected AZs.
- Data plane actions are preferred over control plane actions for AZ evacuation.
- AWS offers several resources and services to aid in resilience, including the Resilience Hub, Fault Injection Simulator, Elastic Disaster Recovery, AWS Backup, and Application Recovery Controller Zonal Shift.
Insights
-
Gray Failures:
- They are subtle and often go undetected by standard health checks.
- They can have a disproportionate impact on user experience despite appearing minor or isolated.
- Differential observability is a key concept, highlighting the need for perspective-aware monitoring.
-
Detection:
- Requires explicit code for observability and instrumentation.
- CloudWatch's embedded metric format and Contributor Insights are valuable for monitoring and detecting anomalies.
- Outlier detection algorithms can automate the identification of gray failures, but they must be used with caution to avoid masking real issues or generating false positives.
-
Mitigation:
- The simplest mitigation strategy is to replace the instance experiencing a gray failure.
- For single AZ failures, options include waiting it out, evacuating the AZ, or failing over to another region.
- Evacuating an AZ requires an AZ-independent architecture and may involve data plane actions like API Gateway and DynamoDB or control plane actions for resources like auto-scaling groups.
-
AWS Services and Tools:
- AWS provides a suite of tools to help design, implement, and operate resilient systems.
- The Resilience Life Cycle Framework offers a structured approach to resilience.
- AWS services like Application Recovery Controller's Zonal Shift feature facilitate the evacuation of AZs during gray failures.
-
Best Practices:
- Enrich metrics with dimensions aligned to fault isolation boundaries for better visibility and troubleshooting.
- Use a combination of deep health checks, outlier detection, and composite alarms for comprehensive monitoring.
- Prefer data plane actions over control plane actions for recovery operations to increase reliability.