Capital One Achieving Resiliency to Run Mission Critical Applications Fsi314

Title

AWS re:Invent 2023 - Capital One: Achieving resiliency to run mission-critical applications (FSI314)

Summary

  • Capital One has transitioned to a cloud-native approach, eliminating data centers and adopting public cloud services.
  • The company emphasizes a resiliency-centric environment, focusing on customer and associate experiences.
  • Resiliency is integrated from the business team level, through product development, to architecture and SRE (Site Reliability Engineering).
  • Capital One combines architecture and SRE teams to ensure resilient design from the start.
  • The company handles over 7 billion card transactions daily, necessitating a robust resiliency posture.
  • Resiliency is not just about achieving high availability (e.g., four nines or five nines) but also about designing for failure, including degraded services and handling stale data.
  • The CAP theorem is a constant consideration, balancing consistency, availability, and partition tolerance.
  • Capital One employs various strategies for resiliency, such as caching, monitoring, observability, circuit breakers, and fallbacks.
  • The company has adopted a microservices architecture and is exploring cell-based architectures for better end-to-end transaction handling.
  • The control plane's reliability is crucial, as demonstrated by an incident in December 2021 when AWS Route 53 experienced issues.
  • Capital One focuses on continuous improvement, operational excellence, and a culture of engineering excellence.
  • The company practices game days, fire drills, chaos engineering, and maintains playbooks for incident response.
  • Capital One's approach to resiliency is driven by a technology company mindset, from the CEO to the developers.

Insights

  • Capital One's cloud-native journey and resiliency focus highlight the importance of cloud services for modern financial institutions.
  • The integration of resiliency into every level of the organization, from business teams to SRE, indicates a holistic approach to system reliability.
  • The combination of architecture and SRE under one leadership suggests a trend towards more collaborative and cross-functional teams in IT.
  • The discussion of the CAP theorem and its implications on system design reflects the trade-offs required in distributed systems.
  • Capital One's use of caching, global tables, and other resiliency strategies demonstrates the need for innovative solutions to maintain uptime and customer satisfaction.
  • The incident involving AWS Route 53 underscores the importance of having a resilient control plane and the potential impact of cloud service outages on customers.
  • The emphasis on continuous improvement and operational excellence indicates a shift towards a proactive and learning-oriented culture in IT operations.
  • The adoption of practices like game days, fire drills, and chaos engineering shows a commitment to preparedness and the ability to handle unexpected failures.
  • Capital One's approach to scaling SRE practices by involving domain experts and fostering a community of practice could serve as a model for other large organizations.
  • The presentation reinforces the idea that financial institutions must think and operate like technology companies to stay competitive and meet customer expectations in the digital age.