Title

AWS re:Invent 2023 - Scaling FM inference to hundreds of models with Amazon SageMaker (AIM327)

Summary

Dhawal Patel, a leader in AWS's machine learning specialist team, and Alan Tan from the SageMaker product team, presented new features in SageMaker for scaling foundation model (FM) inference.
Bhavesh Doshi from Salesforce shared how his team uses SageMaker for scaling FM inference cost-efficiently.
Foundation models are large, pre-trained models that require significant memory and computational resources.
SageMaker offers various inference options, including real-time, offline, asynchronous, and multi-model endpoints, with support for CPUs, GPUs, AWS Inferentia, and serverless.
New features in SageMaker's large model inference container improve latency by 20% on average, with optimizations like all-reduce algorithm and TensorRT LLM backend.
SageMaker's multi-model inference endpoint dynamically loads models and uses smart routing to minimize cold start latency.
SageMaker inference components allow for packing multiple foundation models into a single endpoint, reducing operational overhead and costs.
Salesforce's EinsteinOne platform leverages SageMaker for generative AI use cases, optimizing model inferences and scaling up to hundreds of foundation models.
The session concluded with a demonstration of SageMaker's new features, including auto-scaling, smart routing, and streaming responses, and a call for feedback on the session.

The ability to scale foundation model inference efficiently is critical for organizations embedding generative AI into their operations.
SageMaker's new features address the challenges of hosting large foundation models by optimizing resource utilization and reducing latency.
Salesforce's use case demonstrates the practical application of SageMaker's new features in a real-world scenario, highlighting the importance of cost efficiency and performance at scale.
The session emphasized the importance of a unified container for various types of foundation models, which simplifies deployment and management.
The new features in SageMaker, such as auto-scaling policies for individual models and smart routing, are designed to handle the variability in traffic and inference latency that comes with foundation models.
The session's focus on practical demonstrations and customer stories underscores AWS's commitment to providing solutions that meet the needs of enterprise customers in the field of machine learning and AI.