Large Model Training on Aws Deep Learning Amis Pytorch Ft Pinterest Aim326

Title

AWS re:Invent 2023 - Large model training on AWS Deep Learning AMIs & PyTorch, ft. Pinterest -AIM326

Summary

  • The session covered the challenges and solutions for training large models on AWS, featuring insights from Pinterest's experience.
  • Aruntham, Karthik (ML platforms engineering manager at Pinterest), and Zlatan (principal solutions architect at AWS) presented.
  • They discussed the ubiquity of large models and their impact on digital and physical existence.
  • Key challenges in training large models include the need for large compute and storage, GPU shortages, data hosting, training run stability, and orchestrating infrastructure.
  • AWS technologies like Deep Learning AMIs (DL AMIs), Deep Learning Containers (DLCs), and AWS Batch were highlighted for their role in efficient model training.
  • Pinterest's ML infrastructure is built around three pillars: ML application development, compute orchestration, and training hardware.
  • Pinterest uses PyTorch, DL AMIs, and other AWS services to train large models, focusing on recommendation models and content understanding.
  • They have optimized their ML stack for faster upgrades and reduced total cost of ownership.
  • Pinterest's use of AWS Batch and UltraCluster has led to significant improvements in job submission, scaling, and execution speeds.
  • The session concluded with an open Q&A and resources for further information.

Insights

  • Pinterest's approach to ML infrastructure emphasizes developer velocity and efficiency, with a focus on enabling ML engineers to train models without operational overhead.
  • The use of Docker containers for ML applications at Pinterest ensures reproducibility and seamless transition from development to production.
  • Pinterest's Training Compute Platform, built on Kubernetes, allows for efficient job management and resource scheduling, including quotas and potential for preemption and gang scheduling.
  • The diversification of GPU instance strategy at Pinterest, including the adoption of G5 instances, has helped reduce costs and align training and serving infrastructure.
  • Ray, an open-source framework, has been effectively used by Pinterest for decoupling data processing from GPU compute, leading to higher throughput and better GPU utilization.
  • Pinterest's use of AWS's UltraCluster and EFA has enabled them to train large models with significant speed improvements due to the high bandwidth and low-latency network.
  • The partnership with AWS DL AMI team allowed Pinterest to standardize their ML stack across development, serving, and training, leading to faster PyTorch upgrades and reduced ownership costs.
  • AWS Batch has evolved to better support ML workloads, with features like fair share scheduling and dynamic compute environment updates, influenced by customer feedback and collaboration with Pinterest.
  • The strategic use of object storage (S3) for training data by Pinterest is a unique approach that differs from common practices like using file systems, indicating a trend towards leveraging cloud-native storage solutions for ML workloads.