Build Your Data Lakehouse with Starburst Galaxy Prt014

Title

AWS re:Invent 2022 - Build your data lakehouse with Starburst Galaxy (PRT014)

Summary

  • Monica, a developer advocate at Starburst, discusses how to enhance data lake architecture using Starburst Galaxy.
  • She highlights the challenges of navigating and extracting information from data lakes due to lack of proper configurations and organization.
  • Monica introduces the concept of a data lakehouse, which combines the benefits of data lakes and warehouses by adding governance, security, and controls.
  • Starburst Galaxy is recommended for its flexibility, support for open table formats like Apache Iceberg, Delta Lake, and Hudi, and seamless integration with AWS services.
  • The talk covers the importance of using open table formats for warehouse-like functionality, implementing native security for governance, and building a reporting structure for organization.
  • A demo is presented using Pokemon Go data to illustrate the process of creating a data lakehouse with Starburst Galaxy, including data ingestion, cleaning, optimization, and visualization using ThoughtSpot.
  • The demo showcases the creation of different data layers (land, structure, consume), the use of AWS S3 and MongoDB, and the configuration of security roles and permissions within Starburst Galaxy.

Insights

  • Data lakes can be challenging to navigate without proper organization, but by applying governance and controls, they can be transformed into more functional data lakehouses.
  • Starburst Galaxy offers a solution that supports interactive and long-running queries, which is crucial for businesses that need to handle a variety of data workloads.
  • The integration of open table formats like Iceberg, Delta Lake, and Hudi is essential for providing ACID transactions and metadata management, which are key features of data warehouses.
  • Security is a critical aspect of data management, and Starburst Galaxy allows for granular access control down to the table or location level, supporting common SSO identity providers.
  • The concept of structured data layers (land, structure, consume) helps in organizing the data lifecycle from raw ingestion to query-ready formats, which is a best practice for data lakehouse architecture.
  • The use of visual analytics tools like ThoughtSpot in conjunction with Starburst Galaxy can provide valuable insights and visualizations from the data stored in a data lakehouse.
  • The demo's use of a real-world application (Pokemon Go data) to illustrate the process of building a data lakehouse with Starburst Galaxy makes the concept more relatable and easier to understand for the audience.