Title
AWS re:Invent 2023 - Building a machine learning team and platform at Cash App (AIM228)
Summary
- Speakers: Jason Hand (Senior Developer Advocate at Datadog) and James (Machine Learning at Cash App).
- Topic: Building Cash App's machine learning (ML) platform and the journey from addressing model hosting issues to creating a scalable ML platform.
- Cash App: A financial app for sending, spending, and investing money, including stocks and Bitcoin.
- ML Platform Origin: Started with the support ML team at Cash App, focusing on customer support and dealing with inadequate model hosting infrastructure.
- Key Requirements: Stability, observability, simplicity, and co-location with other services.
- SageMaker: Chosen for its ability to solve encountered problems and fit the requirements after stress testing.
- Integration Challenge: Deciding where to call SageMaker from within Cash App's microservices architecture.
- Team Structure: Composed of modelers and engineers with overlapping roles, focusing on different ML applications.
- Gondola Service: An internal project for deploying containerized Python models on Kubernetes, which evolved into a platform-wide solution integrating SageMaker.
- Model Lifecycle: Packaging, deploying, communicating with models, monitoring, and scaling.
- Packaging: A Python library for easy packaging and a metadata file (
gondola.json
) for configuration. - Deployment: Two approaches using SageMaker: isolated single model endpoints for consistent traffic and multi-model endpoints for shared infrastructure.
- Inference Architecture: Simplified by a reverse proxy in the container to standardize request/response signatures.
- Monitoring: Using Datadog for system load, network load, auto-scaling, and application performance monitoring (APM).
- Scaling: Depending on model type and traffic, scaling vertically for large models or horizontally for shared infrastructure.
- Customer Focus: Treating internal teams as customers, prioritizing impact, managing expectations, and aligning work to benefit multiple teams.
Insights
- Collaboration is Key: The convergence of efforts from different teams (support ML and Gondola) led to a more robust and unified ML platform.
- Platform Abstraction: Abstracting the complexity of SageMaker behind a service like Gondola simplifies the process for other teams and reduces vendor lock-in.
- Tooling for Independence: Creating tools that allow teams to independently manage parts of the ML stack (e.g., packaging models) empowers them and streamlines the deployment process.
- Monitoring and Observability: The emphasis on monitoring and observability using Datadog highlights the importance of these practices in maintaining a healthy ML platform.
- Scalability Considerations: The discussion on scaling strategies (vertical vs. horizontal) and the separation of batch and online inference workloads demonstrate a nuanced approach to resource management.
- Internal Customer Service: The approach to treating internal teams as customers, with a focus on documentation, availability, and aligning work for collective benefit, is a valuable insight for any internal platform team.
- Continuous Improvement: The mention of promotion and demotion of model versions and shadow traffic handling indicates a mature approach to continuous improvement and deployment in ML operations.