Title

AWS re:Invent 2023 - Building a machine learning team and platform at Cash App (AIM228)

Summary

Speakers: Jason Hand (Senior Developer Advocate at Datadog) and James (Machine Learning at Cash App).
Topic: Building Cash App's machine learning (ML) platform and the journey from addressing model hosting issues to creating a scalable ML platform.
Cash App: A financial app for sending, spending, and investing money, including stocks and Bitcoin.
ML Platform Origin: Started with the support ML team at Cash App, focusing on customer support and dealing with inadequate model hosting infrastructure.
Key Requirements: Stability, observability, simplicity, and co-location with other services.
SageMaker: Chosen for its ability to solve encountered problems and fit the requirements after stress testing.
Integration Challenge: Deciding where to call SageMaker from within Cash App's microservices architecture.
Team Structure: Composed of modelers and engineers with overlapping roles, focusing on different ML applications.
Gondola Service: An internal project for deploying containerized Python models on Kubernetes, which evolved into a platform-wide solution integrating SageMaker.
Model Lifecycle: Packaging, deploying, communicating with models, monitoring, and scaling.
Packaging: A Python library for easy packaging and a metadata file (gondola.json) for configuration.
Deployment: Two approaches using SageMaker: isolated single model endpoints for consistent traffic and multi-model endpoints for shared infrastructure.
Inference Architecture: Simplified by a reverse proxy in the container to standardize request/response signatures.
Monitoring: Using Datadog for system load, network load, auto-scaling, and application performance monitoring (APM).
Scaling: Depending on model type and traffic, scaling vertically for large models or horizontally for shared infrastructure.
Customer Focus: Treating internal teams as customers, prioritizing impact, managing expectations, and aligning work to benefit multiple teams.

Insights

Collaboration is Key: The convergence of efforts from different teams (support ML and Gondola) led to a more robust and unified ML platform.
Platform Abstraction: Abstracting the complexity of SageMaker behind a service like Gondola simplifies the process for other teams and reduces vendor lock-in.
Tooling for Independence: Creating tools that allow teams to independently manage parts of the ML stack (e.g., packaging models) empowers them and streamlines the deployment process.
Monitoring and Observability: The emphasis on monitoring and observability using Datadog highlights the importance of these practices in maintaining a healthy ML platform.
Scalability Considerations: The discussion on scaling strategies (vertical vs. horizontal) and the separation of batch and online inference workloads demonstrate a nuanced approach to resource management.
Internal Customer Service: The approach to treating internal teams as customers, with a focus on documentation, availability, and aligning work for collective benefit, is a valuable insight for any internal platform team.
Continuous Improvement: The mention of promotion and demotion of model versions and shadow traffic handling indicates a mature approach to continuous improvement and deployment in ML operations.

Building a Life Science Data Strategy for Accelerating Insights Lfs203 Building a Multi Account Multi Runtime Service Oriented Architecture Dop316