Title
AWS re:Invent 2022 - Bridgewater Associates: Building Inspectable Models at Scale (FSI311)
Summary
- Bridgewater Associates, a global macroeconomic investment firm, manages over $100 billion and has been in the investment space for over 45 years.
- They focus on understanding the global economy through an economic model that is continuously run for portfolio management and research.
- Inspectable models are crucial for Bridgewater to understand and evaluate their economic model, as mistakes can be costly.
- Bridgewater's compute architecture is entirely on AWS, which allows them to perform forward testing, back testing, hypothetical scenario testing, and daily trade simulations.
- They store over 20 petabytes of data, manage over 120 billion time series, and run about 10,000 mini-model executions daily.
- The inspectable models were initially implemented by persisting intermediates and traces during model runtime, which led to performance issues and skyrocketing costs as the models became more complex.
- Bridgewater and AWS collaborated to re-architect the system using static analysis and Presto on Amazon EMR, which decoupled the provenance system from the model runtime.
- This new architecture resulted in a 2x reduction in average job runtime, over 75% machine utilization, and a 5x peak job increase, enabling researchers to ask more complex questions efficiently.
- The cost of managing the data lake decreased by 35% despite a 42% growth in data, and compute costs remained the same while running 20% more jobs year over year.
- The project followed a structured process, including roadmap planning, architecture reviews, executive alignment, AWS data labs, and optimization exercises.
Insights
- Inspectable models are essential for Bridgewater to maintain transparency and trust in their economic model, which is critical for their investment decisions.
- The initial approach to creating inspectable models by persisting runtime traces was not scalable and led to performance degradation and increased costs.
- By leveraging AWS technologies and re-architecting the system, Bridgewater was able to decouple the provenance system from the model runtime, leading to significant performance and cost improvements.
- The use of static analysis and Presto on Amazon EMR allowed Bridgewater to compute traces on demand, eliminating the need to store them and reducing the burden on the model runtime.
- The collaboration with AWS and the structured process followed during the project played a crucial role in the successful re-architecture and optimization of the system.
- The new architecture not only improved performance and reduced costs but also increased the resilience of the system by enabling multi-AZ and multi-region designs and the ability to scale infrastructure on demand.
- Bridgewater's experience highlights the importance of continuous innovation and optimization in technology projects, especially when dealing with large-scale, complex systems.