Title
AWS re:Invent 2022 - Benefiting from Chaos Engineering at Capital One (PRT026)
Summary
- Speaker: Kyle Smith, a member of the engineering community at Capital One.
- Topic: Chaos Engineering and its implementation at Capital One to enhance system resiliency.
- Key Points:
- Chaos Engineering involves intentionally injecting failures to test system resilience.
- Capital One uses it to uncover hidden vulnerabilities, validate resiliency requirements, test failover mechanisms, and ensure proper alert configurations.
- Four types of chaos events at Capital One:
- Regional testing events (multiple times a year, mandatory participation).
- Game day events (monthly, smaller groups, specific resiliency objectives).
- Self-service chaos tooling (for teams to validate and calibrate on their own).
- No-notice events (unannounced tests by the PRE team to check compliance with resiliency requirements).
- Key components of Capital One's chaos practice:
- Chaos Experiment Framework
- Experiment Designer
- Fault Injection Engine
- Testing Results Assessment
- Chaos Playbook
- Emphasis on objective-driven approach rather than fault-driven.
- Types of failures tested include resource failures, network outages/degradation, configuration errors, AWS quota ceilings, and API throttling.
- Session takeaways:
- Minimize the need for developers to be chaos experts.
- Simulate real-world failure scenarios.
- Focus on the resiliency objective, not just the failure.
- Provide actionable results.
- Integrate chaos testing early in application development.
Insights
- Chaos Engineering as a Proactive Measure: Capital One's approach to chaos engineering is proactive, aiming to identify and mitigate potential issues before they impact customers. This aligns with the broader industry trend of shifting left with testing, including resiliency testing, to catch issues earlier in the development lifecycle.
- Balancing Risk and Learning: The use of no-notice events indicates a willingness to balance the risk of potential disruption against the learning opportunities such events provide. This suggests a mature understanding of the trade-offs involved in chaos engineering.
- Tooling and Frameworks: Capital One's use of a Chaos Experiment Framework and a Fault Injection Engine, such as AWS's Fault Injection Simulator, highlights the importance of having robust tooling to safely conduct chaos experiments in production environments.
- Developer Experience and Adoption: By minimizing the need for developers to be chaos experts and providing clear objectives and actionable results, Capital One is likely to see higher adoption rates of chaos engineering practices. This focus on developer experience is crucial for the success of any engineering practice.
- Objective-Driven Approach: The emphasis on defining clear resiliency objectives and desired outcomes before selecting the type of fault to inject is a strategic approach that ensures experiments are aligned with business goals and operational requirements.
- Real-World Relevance: Capital One's strategy of using real incident data to inform their chaos experiments ensures that the scenarios tested are relevant and reflective of actual risks, increasing the value of the testing.
- Integration into Development Lifecycle: The recommendation to integrate chaos testing early in the application development process is a best practice that can lead to more resilient systems and aligns with the principles of DevOps and continuous delivery.