Title

AWS re:Invent 2023 - Detecting and mitigating gray failures (ARC310)

Summary

Presenter: Mike Hagen, Senior Principal Solutions Architect at AWS.
Topic: Gray failures and their impact on system resilience.
Key Points:
- Gray failures are characterized by differential observability, where the system may appear healthy from one perspective but not from another.
- A single host can cause significant drops in service availability.
- Gray failures often occur along fault isolation boundaries such as regions, AZs, instances, and software modules.
- Detection of gray failures requires deeper health checks and observability tools beyond what AWS provides out of the box.
- Outlier detection and composite alarms are effective for identifying gray failures.
- Mitigation strategies include replacing instances and evacuating affected AZs.
- Data plane actions are preferred over control plane actions for AZ evacuation.
- AWS offers several resources and services to aid in resilience, including the Resilience Hub, Fault Injection Simulator, Elastic Disaster Recovery, AWS Backup, and Application Recovery Controller Zonal Shift.

Insights

Gray Failures:
- They are subtle and often go undetected by standard health checks.
- They can have a disproportionate impact on user experience despite appearing minor or isolated.
- Differential observability is a key concept, highlighting the need for perspective-aware monitoring.
Detection:
- Requires explicit code for observability and instrumentation.
- CloudWatch's embedded metric format and Contributor Insights are valuable for monitoring and detecting anomalies.
- Outlier detection algorithms can automate the identification of gray failures, but they must be used with caution to avoid masking real issues or generating false positives.
Mitigation:
- The simplest mitigation strategy is to replace the instance experiencing a gray failure.
- For single AZ failures, options include waiting it out, evacuating the AZ, or failing over to another region.
- Evacuating an AZ requires an AZ-independent architecture and may involve data plane actions like API Gateway and DynamoDB or control plane actions for resources like auto-scaling groups.
AWS Services and Tools:
- AWS provides a suite of tools to help design, implement, and operate resilient systems.
- The Resilience Life Cycle Framework offers a structured approach to resilience.
- AWS services like Application Recovery Controller's Zonal Shift feature facilitate the evacuation of AZs during gray failures.
Best Practices:
- Enrich metrics with dimensions aligned to fault isolation boundaries for better visibility and troubleshooting.
- Use a combination of deep health checks, outlier detection, and composite alarms for comprehensive monitoring.
- Prefer data plane actions over control plane actions for recovery operations to increase reliability.

Designing Migrating an Aws Native Mobility Solution in an Ma Setup Ent104 Developing Serverless Solutions Tnc218