Title

AWS re:Invent 2023 - Capital One: Achieving resiliency to run mission-critical applications (FSI314)

Summary

Capital One has transitioned to a cloud-native approach, eliminating data centers and adopting public cloud services.
The company emphasizes a resiliency-centric environment, focusing on customer and associate experiences.
Resiliency is integrated from the business team level, through product development, to architecture and SRE (Site Reliability Engineering).
Capital One combines architecture and SRE teams to ensure resilient design from the start.
The company handles over 7 billion card transactions daily, necessitating a robust resiliency posture.
Resiliency is not just about achieving high availability (e.g., four nines or five nines) but also about designing for failure, including degraded services and handling stale data.
The CAP theorem is a constant consideration, balancing consistency, availability, and partition tolerance.
Capital One employs various strategies for resiliency, such as caching, monitoring, observability, circuit breakers, and fallbacks.
The company has adopted a microservices architecture and is exploring cell-based architectures for better end-to-end transaction handling.
The control plane's reliability is crucial, as demonstrated by an incident in December 2021 when AWS Route 53 experienced issues.
Capital One focuses on continuous improvement, operational excellence, and a culture of engineering excellence.
The company practices game days, fire drills, chaos engineering, and maintains playbooks for incident response.
Capital One's approach to resiliency is driven by a technology company mindset, from the CEO to the developers.

Insights

Capital One's cloud-native journey and resiliency focus highlight the importance of cloud services for modern financial institutions.
The integration of resiliency into every level of the organization, from business teams to SRE, indicates a holistic approach to system reliability.
The combination of architecture and SRE under one leadership suggests a trend towards more collaborative and cross-functional teams in IT.
The discussion of the CAP theorem and its implications on system design reflects the trade-offs required in distributed systems.
Capital One's use of caching, global tables, and other resiliency strategies demonstrates the need for innovative solutions to maintain uptime and customer satisfaction.
The incident involving AWS Route 53 underscores the importance of having a resilient control plane and the potential impact of cloud service outages on customers.
The emphasis on continuous improvement and operational excellence indicates a shift towards a proactive and learning-oriented culture in IT operations.
The adoption of practices like game days, fire drills, and chaos engineering shows a commitment to preparedness and the ability to handle unexpected failures.
Capital One's approach to scaling SRE practices by involving domain experts and fostering a community of practice could serve as a model for other large organizations.
The presentation reinforces the idea that financial institutions must think and operate like technology companies to stay competitive and meet customer expectations in the digital age.

Capacity Availability Cost Efficiency Pick Three Cmp207 Carrier Case Study Abound a Connected Building Platform Iot102