Title
AWS re:Invent 2022 - How Disney used AWS Glue as a data integration and ETL framework (ANT335)
Summary
- Alona Nadler, AWS Glue Head of Product, and Ralph Peterkin, Principal Technical Architect from Disney, presented how Disney leverages AWS Glue for data integration and ETL.
- Alona provided an overview of data integration challenges, the evolution of data, and the spectrum of ETL solutions from traditional to do-it-yourself platforms.
- AWS Glue was introduced as a serverless service that combines the benefits of traditional ETL and open-source platforms, offering cost efficiency and ease of use.
- AWS Glue features include serverless infrastructure, per-second billing, auto-scaling, support for multiple engines (Spark, Python Shell, Ray), and a variety of connectors.
- Glue provides different ways to author and operationalize jobs, catering to various user skills and needs, and integrates with Git, job monitoring, and workflow orchestration.
- Data management in Glue includes a Data Catalog, crawlers for schema recognition, and tools for sensitive data detection.
- Ralph Peterkin shared Disney's journey with AWS Glue, which began during COVID-19 when new data sets and analytical processes were needed for capacity management.
- Disney's technical architecture includes a developer pipeline, data storage layer, control plane, and an internal Glue framework.
- The internal Glue framework uses YAML as a declarative language and caters to data engineers, scientists, and analysts.
- Disney faced limitations with AWS Glue, which led to custom solutions using EventBridge, Lambda, and DynamoDB for job status tracking and metadata management.
- Disney's dashboard provides job tracking, error handling, and performance insights.
- The success with AWS Glue led Disney to migrate workloads from Hadoop to Glue and expand the platform to Disneyland Paris.
Insights
- AWS Glue's serverless nature and scalability have been crucial for Disney to manage a large number of jobs efficiently, reducing the need for infrastructure management.
- Disney's approach to abstracting Glue jobs into clusters and using a control plane for job submission and tracking demonstrates a sophisticated use of AWS services to overcome Glue's limitations.
- The internal Glue framework developed by Disney highlights the need for customization in large enterprises to meet specific compliance and operational requirements.
- Disney's migration from Hadoop to AWS Glue signifies a shift towards more managed, serverless data platforms that can offer faster time to market and lower operational overhead.
- The ability to use AWS Glue for both ETL and data extraction (as a replacement for tools like Scoop) showcases the versatility of the service.
- Disney's experience with AWS Glue underscores the importance of monitoring and error handling in data pipelines, as well as the need for detailed job tracking and performance analysis.
- The successful deployment of AWS Glue at Disneyland Paris indicates that the solutions developed by Disney are scalable and transferable across different geographic locations.