Building Confidence through Chaos Engineering on Aws Arc307

Title

AWS re:Invent 2022 - Building confidence through chaos engineering on AWS (ARC307)

Summary

  • Chaos Engineering is a discipline that helps uncover unknown system deficiencies by intentionally injecting faults in a controlled manner, aiming to improve system resilience and operational readiness.
  • Continuous Resilience is the ongoing process of building and maintaining resilient systems, which is essential for ensuring that applications can handle real-world events and failures.
  • Shared Responsibility Model in AWS dictates that while AWS is responsible for the resilience of the cloud infrastructure, customers are responsible for the resilience of their workloads on the cloud.
  • Observability is a key component of chaos engineering, involving metrics, logging, and tracing to understand system behavior and identify issues.
  • Organizational Awareness is necessary for chaos engineering, with a need for executive sponsorship, understanding of real-world events, and commitment to remediate any identified deficiencies.
  • Chaos Engineering Program combines chaos engineering with continuous resilience, enabling organizations to scale their practices and build robust workloads.
  • Fault Injection Simulator is an AWS tool that allows for safe, controlled fault injection and experimentation.
  • Game Days are planned events where teams run chaos experiments to test system resilience and learn from the outcomes.
  • Automation of chaos experiments is encouraged to ensure resilience is tested regularly and not just during code deployment.
  • Resources such as workshops, templates, and white papers are available to help organizations get started with chaos engineering.

Insights

  • Chaos Engineering is Not Random: It is a misconception that chaos engineering is about randomly breaking things in production. Instead, it is a thoughtful process of hypothesis-driven experiments to improve system resilience.
  • Importance of Observability: Without proper observability, it is impossible to understand the impact of experiments and to ensure that systems are behaving as expected.
  • Cultural Shift: Adopting chaos engineering requires a cultural shift within an organization, where resilience becomes a shared responsibility and part of the software development lifecycle.
  • Real-World Examples: Companies like Capital One and Intuit have successfully implemented chaos engineering, demonstrating its value in highly regulated industries like financial services.
  • Gartner's Prediction: According to Gartner, 40% of companies will adopt chaos engineering by the following year, with the expectation of increasing customer satisfaction by 20%.
  • Chaos Engineering as a Learning Tool: The process of running chaos experiments and game days is as much about learning and improving team communication and processes as it is about testing system resilience.
  • Integration with Existing Tools: AWS Fault Injection Simulator's integration with tools like Litmus Chaos and Chaos Mesh expands the scope of experiments that can be conducted, particularly in Kubernetes environments.
  • Sharing Learnings: Documenting and sharing the outcomes of chaos experiments across the organization is crucial for collective learning and avoiding repeated mistakes.
  • Starting Small: When beginning with chaos engineering, it is recommended to start with less critical workloads and gradually build up to more significant experiments, including those in production environments.