Title
AWS re:Invent 2023 - Large model training on AWS Deep Learning AMIs & PyTorch, ft. Pinterest -AIM326
Summary
- The session covered the challenges and solutions for training large models on AWS, featuring insights from Pinterest's experience.
- Aruntham, Karthik (ML platforms engineering manager at Pinterest), and Zlatan (principal solutions architect at AWS) presented.
- They discussed the ubiquity of large models and their impact on digital and physical existence.
- Key challenges in training large models include the need for large compute and storage, GPU shortages, data hosting, training run stability, and orchestrating infrastructure.
- AWS technologies like Deep Learning AMIs (DL AMIs), Deep Learning Containers (DLCs), and AWS Batch were highlighted for their role in efficient model training.
- Pinterest's ML infrastructure is built around three pillars: ML application development, compute orchestration, and training hardware.
- Pinterest uses PyTorch, DL AMIs, and other AWS services to train large models, focusing on recommendation models and content understanding.
- They have optimized their ML stack for faster upgrades and reduced total cost of ownership.
- Pinterest's use of AWS Batch and UltraCluster has led to significant improvements in job submission, scaling, and execution speeds.
- The session concluded with an open Q&A and resources for further information.
Insights
- Pinterest's approach to ML infrastructure emphasizes developer velocity and efficiency, with a focus on enabling ML engineers to train models without operational overhead.
- The use of Docker containers for ML applications at Pinterest ensures reproducibility and seamless transition from development to production.
- Pinterest's Training Compute Platform, built on Kubernetes, allows for efficient job management and resource scheduling, including quotas and potential for preemption and gang scheduling.
- The diversification of GPU instance strategy at Pinterest, including the adoption of G5 instances, has helped reduce costs and align training and serving infrastructure.
- Ray, an open-source framework, has been effectively used by Pinterest for decoupling data processing from GPU compute, leading to higher throughput and better GPU utilization.
- Pinterest's use of AWS's UltraCluster and EFA has enabled them to train large models with significant speed improvements due to the high bandwidth and low-latency network.
- The partnership with AWS DL AMI team allowed Pinterest to standardize their ML stack across development, serving, and training, leading to faster PyTorch upgrades and reduced ownership costs.
- AWS Batch has evolved to better support ML workloads, with features like fair share scheduling and dynamic compute environment updates, influenced by customer feedback and collaboration with Pinterest.
- The strategic use of object storage (S3) for training data by Pinterest is a unique approach that differs from common practices like using file systems, indicating a trend towards leveraging cloud-native storage solutions for ML workloads.