Build Your Open Data Lakehouse with Dremio and Aws Prt085

Title

AWS re:Invent 2022 - Build your open data lakehouse with Dremio and AWS (PRT085)

Summary

  • Brett Roberts, Principal Partner Solutions Architect at Dremio, discusses building an open data lakehouse with Dremio and AWS.
  • Dremio is an open data lakehouse platform that allows high-performance BI directly on the data lake without data movement or copying.
  • Dremio contributes to open source with projects like Apache Arrow, Project Nessie, and Apache Iceberg.
  • Data lakehouses combine the best features of data warehouses and data lakes into a unified architecture.
  • Key components of a data lakehouse include scalable data lake storage (e.g., Amazon S3), open file and table formats (e.g., Parquet, Apache Iceberg), query engines, and a self-service semantic layer.
  • Dremio's reference architecture with AWS includes Dremio Arctic, an intelligent meta store, and integration with AWS services like Glue and Lake Formation for security and governance.
  • Emphasis on open architecture allows for flexibility and future-proofing against new technologies.
  • Dremio offers a SaaS version called Dremio Cloud for users to try out their platform.

Insights

  • The concept of a data lakehouse is gaining traction as organizations seek to modernize their data architectures by combining the benefits of data lakes and data warehouses.
  • Open source contributions by Dremio, such as Apache Arrow, are significant in the data community, with Arrow seeing around 60 million downloads per month.
  • The move towards open architectures is driven by the need for flexibility, cost-effectiveness, and the ability to avoid vendor lock-in.
  • Dremio's partnership with AWS and the use of AWS services like S3, Glue, and Lake Formation indicate a strong integration with the AWS ecosystem, which is beneficial for AWS customers looking to implement a data lakehouse.
  • The emphasis on a self-service semantic layer and the ability to perform Git-like operations on data catalogs (Project Nessie) highlight the importance of agility and collaboration in modern data teams.
  • The presentation suggests a growing trend towards decoupling compute from data, which allows for multiple query engines to work on the same data sets without creating redundant copies, enhancing efficiency and reducing costs.
  • Dremio's offering of a free SaaS version for users to test their platform demonstrates confidence in their product and a customer-centric approach to adoption.