Goldman Sachs the Journey to Zero Downtime Fsi310

Title

AWS re:Invent 2023 - Goldman Sachs: The journey to zero downtime (FSI310)

Summary

  • Introduction: Manjula Nagineni, a senior solutions architect at AWS, introduces the session on Goldman Sachs' journey to zero downtime, emphasizing the complexity of achieving this in distributed systems.
  • Problem Statement: Zero downtime is critical for client trust, especially in the banking industry, and is driven by the need for resiliency, client trust, and meeting SLAs.
  • Goldman Sachs Transaction Banking (TXB): Rob Carson explains that TXB is built on AWS, with hundreds of microservices and a zero recovery point objective (RPO). They aim for minimal recovery time objective (RTO) and have thousands of deployments annually.
  • High Availability Architecture: Rob discusses the architecture, including the use of Amazon ECS, Terraform, microaccounts, VPC endpoint services, and DevOps practices.
  • Deployment Strategies: Akarshit Bachu covers zero downtime deployment strategies, focusing on stateless services using AWS ECS Fargate and Blue-Green deployment with AWS CodeDeploy and CloudWatch Synthetics.
  • Stateful Resources: Strategies for Amazon RDS, Amazon MSK, and DynamoDB are discussed, including a homegrown solution for Aurora RDS to achieve high availability.
  • Release Procedures: The use of infrastructure as code, security as code, and AWS CloudWatch Synthetics for validating releases is explained.
  • Game Days: A ritual in TXB where they perform multi-region deployments and failovers to test resiliency and update runbooks.
  • Key Wins and Lessons Learned: Rob shares the benefits of their approach, including reduced prod release validation time, fewer dev team release hours, and decreased Aurora downtime per deployment. He also shares lessons learned about deep health checks, NLB vs. ALB flips, and event-driven applications.
  • Looking Forward: Manjula outlines future enhancements, including blue-green deployments for Aurora PostgreSQL, chaos testing with AWS Fault Injection Simulator, and zero-touch release processes.

Insights

  • Resiliency is Key: The emphasis on resiliency and the quote from AWS CTO Werner Vogels, "everything fails all the time," highlights the importance of designing systems that can withstand failures.
  • Client Trust and SLAs: The banking industry's reliance on 24/7 service availability underscores the need for architectures that can maintain client trust and meet stringent SLAs.
  • Microservices and Microaccounts: Goldman Sachs' use of hundreds of microservices and the concept of microaccounts for isolation to prevent cascading failures demonstrates a commitment to modular and resilient design.
  • Blue-Green Deployments: The detailed explanation of blue-green deployments for both stateless and stateful services shows a sophisticated approach to minimizing downtime during updates.
  • Automation and DevOps: The focus on automating technical and functional processes and embracing a DevOps culture indicates a modern approach to software development and operations.
  • Game Days: The practice of conducting game days to test the system's resiliency and update runbooks is a proactive approach to ensuring system reliability.
  • Continuous Improvement: The session concludes with a look at future enhancements, suggesting that the journey to zero downtime is ongoing and requires continuous innovation and adaptation.