Title

AWS re:Invent 2023 - Large model training on AWS Deep Learning AMIs & PyTorch, ft. Pinterest -AIM326

Summary

The session covered the challenges and solutions for training large models on AWS, featuring insights from Pinterest's experience.
Aruntham, Karthik (ML platforms engineering manager at Pinterest), and Zlatan (principal solutions architect at AWS) presented.
They discussed the ubiquity of large models and their impact on digital and physical existence.
Key challenges in training large models include the need for large compute and storage, GPU shortages, data hosting, training run stability, and orchestrating infrastructure.
AWS technologies like Deep Learning AMIs (DL AMIs), Deep Learning Containers (DLCs), and AWS Batch were highlighted for their role in efficient model training.
Pinterest's ML infrastructure is built around three pillars: ML application development, compute orchestration, and training hardware.
Pinterest uses PyTorch, DL AMIs, and other AWS services to train large models, focusing on recommendation models and content understanding.
They have optimized their ML stack for faster upgrades and reduced total cost of ownership.
Pinterest's use of AWS Batch and UltraCluster has led to significant improvements in job submission, scaling, and execution speeds.
The session concluded with an open Q&A and resources for further information.

Insights

Pinterest's approach to ML infrastructure emphasizes developer velocity and efficiency, with a focus on enabling ML engineers to train models without operational overhead.
The use of Docker containers for ML applications at Pinterest ensures reproducibility and seamless transition from development to production.
Pinterest's Training Compute Platform, built on Kubernetes, allows for efficient job management and resource scheduling, including quotas and potential for preemption and gang scheduling.
The diversification of GPU instance strategy at Pinterest, including the adoption of G5 instances, has helped reduce costs and align training and serving infrastructure.
Ray, an open-source framework, has been effectively used by Pinterest for decoupling data processing from GPU compute, leading to higher throughput and better GPU utilization.
Pinterest's use of AWS's UltraCluster and EFA has enabled them to train large models with significant speed improvements due to the high bandwidth and low-latency network.
The partnership with AWS DL AMI team allowed Pinterest to standardize their ML stack across development, serving, and training, leading to faster PyTorch upgrades and reduced ownership costs.
AWS Batch has evolved to better support ML workloads, with features like fair share scheduling and dynamic compute environment updates, influenced by customer feedback and collaboration with Pinterest.
The strategic use of object storage (S3) for training data by Pinterest is a unique approach that differs from common practices like using file systems, indicating a trend towards leveraging cloud-native storage solutions for ML workloads.

Kick Start Your Inclusive Journey Inclusion Powered by Aws Playbook Ide101 Launch Achieving Scale with Amazon Aurora Limitless Database Dat344