Scale Interactive Data Analysis with Step Functions Distributed Map Api310

Title

AWS re:Invent 2023 - Scale interactive data analysis with Step Functions Distributed Map (API310)

Summary

  • Speakers: Adam Wagner (Principal Serverless Solutions Architect at AWS) and Roberto Eteraldi (Senior Director of Software Engineering at Vertex Pharmaceuticals).
  • Topic: Scaling interactive data analysis using AWS Step Functions Distributed Map.
  • Use Case: Processing 500,000 invoice files in S3, extracting data, running calculations, and loading into a database for reporting.
  • Solution: Distributed processing using AWS Step Functions to handle large-scale data processing efficiently and cost-effectively.
  • Benefits: Faster processing, scalability, fault tolerance, and cost-effectiveness.
  • Challenges: Concurrency management, learning curve for big data frameworks, balancing cost, security, and speed.
  • Serverless Solution: AWS Step Functions, a serverless workflow service that integrates with almost every AWS service, including direct integration with Amazon Bedrock and HTTP API endpoints.
  • Distributed Map Feature: Allows iteration over S3 objects and running child workflows for each, scaling up to 10,000 concurrent workflows.
  • Vertex Pharmaceuticals Use Case: Accelerating drug discovery by analyzing images from experiments using machine learning for image segmentation.
  • Results: Improved scalability, 11 times faster processing, 90% cost reduction, and better system visibility.
  • Challenges and Solutions: Encountered issues with new service adoption, tooling lag, and addressed them with AWS support.
  • Resources: Free training guides, Power Tools for AWS Lambda, and AWS expo showcases.

Insights

  • Serverless First Approach: Vertex Pharmaceuticals adopts a serverless-first approach to maximize their small engineering team's focus on differentiating work, leveraging AWS for high availability, elasticity, and cost optimization.
  • Performance and Cost: The transition to a serverless architecture using AWS Step Functions Distributed Map resulted in significant performance improvements and cost savings for Vertex Pharmaceuticals.
  • Operational Efficiency: The serverless solution reduced the operational burden, eliminating the need for OS patching and infrastructure maintenance.
  • Visibility and Error Handling: The new system provided better visibility into operations and robust error handling out of the box, which was a significant improvement over the legacy system.
  • Learning Curve and Adoption: There was an initial learning curve and some friction due to the newness of the Distributed Map feature and the lag in tooling support, but the benefits outweighed these challenges.
  • Community and Support: Close collaboration with AWS support and the community helped overcome challenges and influence product improvements.
  • Infrastructure as Code: Emphasis on representing infrastructure as code for step functions workflows using the Amazon States Language (ASL) to ensure consistency and ease of deployment.
  • Wide Applicability: Step Functions Distributed Map has a broad range of use cases beyond life sciences, including financial modeling, unstructured file processing, data transformation, and migration.
  • Continuous Learning: AWS encourages continuous learning and skill development through free training resources and community engagement.