Title

AWS re:Invent 2023 - Deploy FMs on Amazon SageMaker for price performance (AIM330)

Summary

Introduction: Venkatesh Krishnan and Rama Thaman introduced the session, focusing on deploying foundational models (FMs) on Amazon SageMaker for optimal price performance in generative AI applications. Travis Mellinger from Cisco Systems shared their success story with SageMaker.
Generative AI and SageMaker: 2023 is highlighted as the year generative AI gained significant attention. The session covered the journey from experimentation to production-scale deployment, emphasizing the importance of model selection, customization, deployment, integration, and maintenance.
Performance Tuning: The talk stressed the importance of performance tuning for latency, throughput, and scale while minimizing costs. It also discussed the challenges of deploying large models, including the need for powerful compute instances and the associated high costs.
Cost of Inference: The cost of inference from FMs is significant due to the computational demands, particularly matrix multiplications that require efficient accelerators like GPUs. However, GPUs are expensive, and the session provided an example of the high monthly cost of deploying a model on a P4D instance.
SageMaker Inference Options: SageMaker offers a range of inference options, including real-time, batch, and asynchronous inference, supporting single or multiple models deployment, and providing infrastructure choices like CPUs, GPUs, Inferentia, Graviton-based instances, or serverless options.
Cost-Saving Features: SageMaker's cost-saving features include deploying multiple models on the same instance, using Inferentia or Tranium-based instances for better price performance, and dynamic auto-scaling based on traffic.
Large Model Inference (LMI) Container: The LMI container includes libraries and tools for optimizing large language models, such as DeepSpeed and TensorRT-LLM, to improve latency and throughput.
Post-Deployment Maintenance: SageMaker supports rolling deployments for model updates without impacting production traffic or requiring double the instances.
Demo: Rama demonstrated how to deploy a large language model on SageMaker using a notebook and the UI, highlighting the ease of use and cost benefits.
Cisco's Use Case: Travis Mellinger shared how Cisco leverages SageMaker for AI capabilities in the WebEx ecosystem, emphasizing the benefits of SageMaker for research, experimentation, and scaling.

Insights

Generative AI's Hype: Generative AI is at the peak of inflated expectations, indicating a high level of interest and potential overestimation of capabilities. This presents both opportunities and challenges for businesses looking to leverage generative AI.
Trilemma of Complexity, Performance, and Cost: The session highlighted the difficulty in optimizing all three factors simultaneously, suggesting that prioritizing two often compromises the third. This trilemma is a key consideration in deploying FMs.
GPU Scarcity and Cost: The scarcity of GPUs and their high cost is a significant challenge for deploying large models. SageMaker's cost-saving features and support for alternative instances like Inferentia and Tranium can mitigate this issue.
SageMaker as a Managed Service: The session underscored the benefits of using SageMaker as a managed service for deploying FMs, reducing operational overhead and accelerating innovation by abstracting away the complexities of model deployment and optimization.
Interactive User Experiences: SageMaker's ability to stream responses from models enables the creation of interactive user experiences, such as chatbots, which is crucial for customer engagement.
Real-world Application: Cisco's use case provided a practical example of how SageMaker can be used in a large-scale, real-world environment, demonstrating the platform's scalability and cost-effectiveness.
Future Improvements: The session hinted at future improvements in SageMaker, such as better developer experiences and more efficient model scaling, indicating ongoing innovation in AWS services to support AI and ML applications.

Demystifying and Mitigating Aws Lambda Cold Starts Com305 Deploy Gen Ai Apps Efficiently at Scale with Serverless Containers Con303