How Disney Used Aws Glue as a Data Integration and Etl Framework Ant335

Title

AWS re:Invent 2022 - How Disney used AWS Glue as a data integration and ETL framework (ANT335)

Summary

  • Alona Nadler, AWS Glue Head of Product, and Ralph Peterkin, Principal Technical Architect from Disney, presented how Disney leverages AWS Glue for data integration and ETL.
  • Alona provided an overview of data integration challenges, the evolution of data, and the spectrum of ETL solutions from traditional to do-it-yourself platforms.
  • AWS Glue was introduced as a serverless service that combines the benefits of traditional ETL and open-source platforms, offering cost efficiency and ease of use.
  • AWS Glue features include serverless infrastructure, per-second billing, auto-scaling, support for multiple engines (Spark, Python Shell, Ray), and a variety of connectors.
  • Glue provides different ways to author and operationalize jobs, catering to various user skills and needs, and integrates with Git, job monitoring, and workflow orchestration.
  • Data management in Glue includes a Data Catalog, crawlers for schema recognition, and tools for sensitive data detection.
  • Ralph Peterkin shared Disney's journey with AWS Glue, which began during COVID-19 when new data sets and analytical processes were needed for capacity management.
  • Disney's technical architecture includes a developer pipeline, data storage layer, control plane, and an internal Glue framework.
  • The internal Glue framework uses YAML as a declarative language and caters to data engineers, scientists, and analysts.
  • Disney faced limitations with AWS Glue, which led to custom solutions using EventBridge, Lambda, and DynamoDB for job status tracking and metadata management.
  • Disney's dashboard provides job tracking, error handling, and performance insights.
  • The success with AWS Glue led Disney to migrate workloads from Hadoop to Glue and expand the platform to Disneyland Paris.

Insights

  • AWS Glue's serverless nature and scalability have been crucial for Disney to manage a large number of jobs efficiently, reducing the need for infrastructure management.
  • Disney's approach to abstracting Glue jobs into clusters and using a control plane for job submission and tracking demonstrates a sophisticated use of AWS services to overcome Glue's limitations.
  • The internal Glue framework developed by Disney highlights the need for customization in large enterprises to meet specific compliance and operational requirements.
  • Disney's migration from Hadoop to AWS Glue signifies a shift towards more managed, serverless data platforms that can offer faster time to market and lower operational overhead.
  • The ability to use AWS Glue for both ETL and data extraction (as a replacement for tools like Scoop) showcases the versatility of the service.
  • Disney's experience with AWS Glue underscores the importance of monitoring and error handling in data pipelines, as well as the need for detailed job tracking and performance analysis.
  • The successful deployment of AWS Glue at Disneyland Paris indicates that the solutions developed by Disney are scalable and transferable across different geographic locations.