Title
AWS re:Invent 2023 - Capital One: Achieving resiliency to run mission-critical applications (FSI314)
Summary
- Capital One has transitioned to a cloud-native approach, eliminating data centers and adopting public cloud services.
- The company emphasizes a resiliency-centric environment, focusing on customer and associate experiences.
- Resiliency is integrated from the business team level, through product development, to architecture and SRE (Site Reliability Engineering).
- Capital One combines architecture and SRE teams to ensure resilient design from the start.
- The company handles over 7 billion card transactions daily, necessitating a robust resiliency posture.
- Resiliency is not just about achieving high availability (e.g., four nines or five nines) but also about designing for failure, including degraded services and handling stale data.
- The CAP theorem is a constant consideration, balancing consistency, availability, and partition tolerance.
- Capital One employs various strategies for resiliency, such as caching, monitoring, observability, circuit breakers, and fallbacks.
- The company has adopted a microservices architecture and is exploring cell-based architectures for better end-to-end transaction handling.
- The control plane's reliability is crucial, as demonstrated by an incident in December 2021 when AWS Route 53 experienced issues.
- Capital One focuses on continuous improvement, operational excellence, and a culture of engineering excellence.
- The company practices game days, fire drills, chaos engineering, and maintains playbooks for incident response.
- Capital One's approach to resiliency is driven by a technology company mindset, from the CEO to the developers.
Insights
- Capital One's cloud-native journey and resiliency focus highlight the importance of cloud services for modern financial institutions.
- The integration of resiliency into every level of the organization, from business teams to SRE, indicates a holistic approach to system reliability.
- The combination of architecture and SRE under one leadership suggests a trend towards more collaborative and cross-functional teams in IT.
- The discussion of the CAP theorem and its implications on system design reflects the trade-offs required in distributed systems.
- Capital One's use of caching, global tables, and other resiliency strategies demonstrates the need for innovative solutions to maintain uptime and customer satisfaction.
- The incident involving AWS Route 53 underscores the importance of having a resilient control plane and the potential impact of cloud service outages on customers.
- The emphasis on continuous improvement and operational excellence indicates a shift towards a proactive and learning-oriented culture in IT operations.
- The adoption of practices like game days, fire drills, and chaos engineering shows a commitment to preparedness and the ability to handle unexpected failures.
- Capital One's approach to scaling SRE practices by involving domain experts and fostering a community of practice could serve as a model for other large organizations.
- The presentation reinforces the idea that financial institutions must think and operate like technology companies to stay competitive and meet customer expectations in the digital age.